Eyon Jang

about

I work on ensuring that AIs remain beneficial and aligned with human values as they become more capable. My research focuses on developing methods for understanding, evaluating, and controlling advanced AI systems. I'm particularly interested in AI R&D automation and threat models posed by it, as well as alignment techniques leveraging high-compute RL to develop mitigations that scale to superintelligence.

I am currently a MATS scholar researching exploration hacking and automated alignment auditing, mentored by David Lindner, Roland Zimmermann (Google DeepMind AGI Safety and Alignment), and Scott Emmons (Anthropic Alignment Science).

Previously, I spent 6 years as a quantitative researcher on Wall Street. I received my MSc in Statistics and Machine Learning (with Distinction) from the University of Oxford, where I was supervised by Prof. Yee Whye Teh and Prof. Benjamin Bloem-Reddy.

research interests: alignment, ai control, scalable oversight, reinforcement learning

research

For a full list of publications, see my Google Scholar.

★ Featured

Exploration Hacking: Can LLMs Learn to Resist RL Training?

published · ICML 2026 · MATS 8.1 · Eyon Jang, Damon Falck, Joschka Braun

Can reasoning models undermine RL training by manipulating their exploration? Through model organisms experiments in realistic settings, we are developing a science of when and how models can influence their own RL training ("exploration hacking"), build an understanding of what causes current frontier models to do so in the wild, and stress-test CoT monitorability against exploration hackers.

paper openreview research proposal blog github

Automated Alignment Auditing

ongoing · MATS 8.2 · Eyon Jang, Alex Serrano

An automated alignment-auditing framework for production coding agents (Claude Code, Codex CLI, Gemini CLI), built on top of Petri. An auditor model probes a target agent across scripted scenarios, and independent judges score the resulting transcripts for scheming behaviors.

github

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

published · SPAR · Eyon Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

We investigate whether machine unlearning methods genuinely remove knowledge or merely suppress it. Our work suggests that some approaches remain vulnerable to simple prompt attacks, which highlights the need for more reliable unlearning evaluations and provides an open framework for systematic evaluation of unlearning methods.

paper github

Automating AI Safety Research using AIs

published · SPAR · Matthew Shinkle*, Eyon Jang*, Jacques Thibodeau

A unified pipeline of AI agent tools for automating interpretability research — spanning literature search, codebase discovery, and experiment design/execution. We demonstrate the system on SAEBench, where it autonomously implements and evaluates sparse autoencoder experiments.

lesswrong github

news

Apr 2026
paperPaper accepted to ICML 2026
"Exploration Hacking: Can LLMs Learn to Resist RL Training?" accepted to ICML 2026 Main Conference.

Feb 2026

postNew blog post on A Conceptual Framework for Exploration Hacking

Feb 2026

programAccepted to MATS 8.2 (Extension+)

+6 months funding for exploration hacking & automated alignment auditing.

Dec 2025

awardBest Paper Runner-Up at NeurIPS 2025

"Resisting RL Elicitation of Biosecurity Capabilities: Reasoning Models Exploration Hacking on WMDP" accepted to NeurIPS 2025 Biosecurity Safeguards for Generative AI workshop (Oral).
Awarded Best Paper Runner-Up (with $1,000 prize).

Oct 2025

field_buildingAlgoverse Fall 2025 Mentor

I'll be mentoring 2 Algoverse projects on AI control.

Aug 2025

programAccepted to MATS 8.1 (Extension)

+6 months funding for exploration hacking. Project selected as a spotlight talk at the MATS Symposium.

Aug 2025

field_buildingSPAR Fall 2025 Mentor

I'll be co-mentoring 3 SPAR projects with Diogo Cruz.

Jul 2025

paperPaper accepted to COLM 2025

"Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods" accepted to COLM 2025 SoLaR workshop.

Jun 2025

programAccepted to MATS 8.0

I'll be joining MATS 8.0 as a research scholar to study AI safety! (Google DeepMind stream: Scott Emmons/David Lindner/Erik Jenner)

service

reviewer ICML 2026 Mechanistic Interpretability workshop

NeurIPS 2025 Mechanistic Interpretability workshop