Eyon Jang

Eyon Jang

A(G)I Safety Researcher

About

I work on ensuring that AIs remain beneficial and aligned with human values as they become more capable. My research focuses on developing methods for understanding, evaluating, and controlling advanced AI systems. I'm particularly interested in AI R&D automation and threat models posed by it, as well as alignment techniques leveraging high-compute RL to develop mitigations that scale to superintelligence.

I am currently a MATS 8.0 scholar researching exploration hacking, mentored by Scott Emmons, David Lindner, Roland Zimmermann, and Stephen McAleer.

Previously, I spent 6 years as a quantitative researcher on Wall Street. I received my MSc in Statistics and Machine Learning (with Distinction) from the University of Oxford, where I was supervised by Prof. Yee Whye Teh and Prof. Benjamin Bloem-Reddy.

Research interests: AI alignment, Control, Scalable Oversight, Reinforcement Learning

Research

Model Organisms of Exploration Hacking

ongoing · MATS 8.0 · Eyon Jang, Damon Falck, Joschka Braun
Can reasoning models undermine RL training by manipulating their exploration? Through model organisms experiments in realistic settings, we are developing a science of when and how models can influence their own RL training ("exploration hacking"), and build an understanding of what causes current frontier models to do so in the wild.

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

published · COLM 2025 SoLaR workshop · Eyon Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz
We investigate whether machine unlearning methods genuinely remove knowledge or merely suppress it, revealing that some approaches remain vulnerable to simple prompt attacks. Our work highlights the need for more reliable unlearning evaluations and provides an open framework for systematic evaluation of unlearning methods.

Automating AI Safety Research using AIs

published · LessWrong · Matthew Shinkle*, Eyon Jang*, Jacques Thibodeau
As AIs become more capable, automating AI R&D is emerging as a promising approach to accelerate progress in model interpretability and, more broadly, AI safety. This project introduces a unified pipeline of AI agent tools for automating interpretability research, spanning literature search and parsing, codebase discovery and preparation, designing and executing experiments. The utility of these tools is demonstrated in a sandbox environment for sparse autoencoders (SAEs), enabling the autonomous implementation and evaluation of diverse SAEBench experiments.

News

Aug 2025
Accepted to MATS 8.1 (Extension)
Secured 6-month funding to continue my research on exploration hacking!
Project selected as a spotlight talk at the MATS Symposium.
Aug 2025
SPAR Fall 2025 Mentor
I'll be co-mentoring 4 SPAR projects with Diogo Cruz.
Jul 2025
Paper accepted to COLM 2025
"Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods" accepted to COLM 2025 SoLaR workshop.
Jun 2025
Accepted to MATS 8.0
I'll be joining MATS 8.0 as a research scholar to study AI safety!
(Google DeepMind stream: Scott Emmons/David Lindner/Erik Jenner)

Projects

Model Organisms of Exploration Hacking

Anthropic x MATS Hackathon · Eyon Jang, Damon Falck, Joschka Braun
We did a 2-day sprint to investigate the capability and propensity of Claude models to engage in exploration hacking.

Transformers from scratch

Open Source
Implementation of various Transformer architectures from scratch in PyTorch.

Turing

Open Source
Lightweight PyTorch-like autograd library implemented entirely in NumPy.

Miscellaneous

Mentoring

I find it deeply fulfilling to help talented researchers transition into AI safety. I'll be mentoring 4 SPAR projects this fall!

Current Projects:

I plan to mentor more projects through other AI safety programs. If you're interested in working together, please reach out with your background and research interests.

Service

Reviewer: NeurIPS 2025 Mechanistic Interpretability workshop