Eyon Jang
$whoami

Eyon Jang

A(G)I safety researcher

about

I work on ensuring that AIs remain beneficial and aligned with human values as they become more capable. My research focuses on developing methods for understanding, evaluating, and controlling advanced AI systems. I'm particularly interested in AI R&D automation and threat models posed by it, as well as alignment techniques leveraging high-compute RL to develop mitigations that scale to superintelligence.

I am currently a MATS scholar researching exploration hacking and automated alignment auditing, mentored by David Lindner, Roland Zimmermann (Google DeepMind AGI Safety and Alignment), and Scott Emmons (Anthropic Alignment Science).

Previously, I spent 6 years as a quantitative researcher on Wall Street. I received my MSc in Statistics and Machine Learning (with Distinction) from the University of Oxford, where I was supervised by Prof. Yee Whye Teh and Prof. Benjamin Bloem-Reddy.

research interests: alignment, ai control, scalable oversight, reinforcement learning

research

For a full list of publications, see my Google Scholar.

Automated Alignment Auditing

ongoing · MATS 8.2 · Eyon Jang, Alex Serrano
An automated alignment-auditing framework for production coding agents (Claude Code, Codex CLI, Gemini CLI), built on top of Petri. An auditor model probes a target agent across scripted scenarios, and independent judges score the resulting transcripts for scheming behaviors.

Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods

published · SPAR · Eyon Jang, Shariqah Hossain, Ashwin Sreevatsa, Diogo Cruz
We investigate whether machine unlearning methods genuinely remove knowledge or merely suppress it. Our work suggests that some approaches remain vulnerable to simple prompt attacks, which highlights the need for more reliable unlearning evaluations and provides an open framework for systematic evaluation of unlearning methods.

Automating AI Safety Research using AIs

published · SPAR · Matthew Shinkle*, Eyon Jang*, Jacques Thibodeau
A unified pipeline of AI agent tools for automating interpretability research — spanning literature search, codebase discovery, and experiment design/execution. We demonstrate the system on SAEBench, where it autonomously implements and evaluates sparse autoencoder experiments.

news

Apr 2026
paperPaper accepted to ICML 2026
"Exploration Hacking: Can LLMs Learn to Resist RL Training?" accepted to ICML 2026 Main Conference.
Feb 2026
programAccepted to MATS 8.2 (Extension+)
+6 months funding for exploration hacking & automated alignment auditing.
Dec 2025
awardBest Paper Runner-Up at NeurIPS 2025
"Resisting RL Elicitation of Biosecurity Capabilities: Reasoning Models Exploration Hacking on WMDP" accepted to NeurIPS 2025 Biosecurity Safeguards for Generative AI workshop (Oral).
Awarded Best Paper Runner-Up (with $1,000 prize).
Oct 2025
field_buildingAlgoverse Fall 2025 Mentor
I'll be mentoring 2 Algoverse projects on AI control.
Aug 2025
programAccepted to MATS 8.1 (Extension)
+6 months funding for exploration hacking. Project selected as a spotlight talk at the MATS Symposium.
Aug 2025
field_buildingSPAR Fall 2025 Mentor
I'll be co-mentoring 3 SPAR projects with Diogo Cruz.
Jul 2025
paperPaper accepted to COLM 2025
"Prompt Attacks Reveal Superficial Knowledge Removal in Unlearning Methods" accepted to COLM 2025 SoLaR workshop.
Jun 2025
programAccepted to MATS 8.0
I'll be joining MATS 8.0 as a research scholar to study AI safety! (Google DeepMind stream: Scott Emmons/David Lindner/Erik Jenner)

service

reviewer NeurIPS 2025 Mechanistic Interpretability workshop