Eyon Jang

About

I work on ensuring that AIs remain beneficial and aligned with human values as they become more capable. My research focuses on developing methods for understanding, evaluating, and controlling advanced AI systems. I'm particularly interested in AI R&D automation and threat models posed by it, as well as alignment techniques leveraging high-compute RL to develop mitigations that scale to superintelligence.

I am currently a MATS scholar researching exploration hacking, mentored by Scott Emmons, David Lindner, Roland Zimmermann (Google DeepMind AGI Safety and Alignment), and Stephen McAleer (Anthropic).

Previously, I spent 6 years as a quantitative researcher on Wall Street. I received my MSc in Statistics and Machine Learning (with Distinction) from the University of Oxford, where I was supervised by Prof. Yee Whye Teh and Prof. Benjamin Bloem-Reddy.

Research interests: Alignment, AI control, Scalable Oversight, Reinforcement Learning

Research

Model Organisms of Exploration Hacking

ongoing · MATS 8.0 · Eyon Jang, Damon Falck, Joschka Braun

Can reasoning models undermine RL training by manipulating their exploration? Through model organisms experiments in realistic settings, we are developing a science of when and how models can influence their own RL training ("exploration hacking"), and build an understanding of what causes current frontier models to do so in the wild.

openreview research proposal blog (interim update) github

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

published · COLM 2025 SoLaR workshop · Eyon Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz

We investigate whether machine unlearning methods genuinely remove knowledge or merely suppress it, revealing that some approaches remain vulnerable to simple prompt attacks. Our work highlights the need for more reliable unlearning evaluations and provides an open framework for systematic evaluation of unlearning methods.

paper github

Automating AI Safety Research using AIs

published · LessWrong · Matthew Shinkle*, Eyon Jang*, Jacques Thibodeau

As AIs become more capable, automating AI R&D is emerging as a promising approach to accelerate progress in model interpretability and, more broadly, AI safety. This project introduces a unified pipeline of AI agent tools for automating interpretability research, spanning literature search and parsing, codebase discovery and preparation, designing and executing experiments. The utility of these tools is demonstrated in a sandbox environment for sparse autoencoders (SAEs), enabling the autonomous implementation and evaluation of diverse SAEBench experiments.

lesswrong github

News

Oct 2025

Paper accepted to NeurIPS 2025

"Resisting RL Elicitation of Biosecurity Capabilities: Reasoning Models Exploration Hacking on WMDP" accepted to NeurIPS 2025 Biosafe GenAI workshop (Oral).

Aug 2025

Accepted to MATS 8.1 (Extension)

Secured 6-month funding to continue my research on exploration hacking!
Project selected as a spotlight talk at the MATS Symposium.

Aug 2025

SPAR Fall 2025 Mentor

I'll be co-mentoring 3 SPAR projects with Diogo Cruz.

Jul 2025

Paper accepted to COLM 2025

"Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods" accepted to COLM 2025 SoLaR workshop.

Jun 2025

Accepted to MATS 8.0

I'll be joining MATS 8.0 as a research scholar to study AI safety!
(Google DeepMind stream: Scott Emmons/David Lindner/Erik Jenner)

Miscellaneous

Mentoring

I find it deeply fulfilling to help talented researchers transition into technical AI safety. I'm mentoring 5 AI safety projects through SPAR and Algoverse this fall!

Current Projects:

Goal Drift Quantification in Long-Horizon Tasks
Emergent Misalignment from Persona Fine-tuning
Agent Misalignment from Unreliable Tool Behavior
Scaling behavior of CoT monitoring
Reward hacking detection benchmark

I plan to mentor more projects through other AI safety programs. If you're interested in working together, please reach out with your background and research interests.

Service

Reviewer: NeurIPS 2025 Mechanistic Interpretability workshop