About

Currently, I work at Constellation, where I plan events and run programs that aim to strengthen the connections between different parts of the AI safety ecosystem.

I'm concerned that we're on track to develop very powerful AI systems soon without being confident that this won't lead to a global catastrophe such as human extinction, human disempowerment, or permanent authoritarian lock-in. The 80,000 Hours website is a good starting point for understanding the risks and the opportunities for reducing them. I plan to spend my professional life working to reduce the danger from AI systems as much as possible.

Previously, I took part in MATS under Ethan Perez, and continued the research I started there until starting at Constellation in 2025. Before that, I completed a DPhil at the University of Oxford, supervised by Tom Melham and Daniel Kroening, using generative image models to test and evaluate the robustness of image classification models. Before that, I taught computer science for two years at a comprehensive secondary school through the Teach First Leadership Development Programme, and studied computer science at the University of Cambridge.

Publications

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

Isaac Dunn, Kei Nishimura-Gasparian, Carson Denison, Ethan Perez, Robert Kirk

We finetune GPT-4.1 to reward hack on minor-harm tasks and observe undesirable generalisation of reward hacking to serious-harm tasks. We also observe that many commercially-available models often reward hack, even without finetuning.

Alignment Forum

Testing Deep Image Classifiers Using Generative Machine Learning

Isaac Dunn

My DPhil thesis presents the research I completed between 2018 and 2022, including work published as separate papers and the context that motivates it. The one-page abstract and full thesis are available at the Oxford University Research Archive.

Oxford University Research Archive

Exposing Previously Undetectable Faults in Deep Neural Networks

Isaac Dunn, Hadrien Pouget, Tom Melham, Daniel Kroening

Existing methods for testing DNNs constrain test inputs to lie close to known examples, which limits the faults they can find. By leveraging generative machine learning, we generate fresh test cases that vary in high-level features (shape, location, texture, colour) and expose faults that other methods cannot.

Detecting a fault in a deep neural network image classifier

Evaluating Robustness to Context-Sensitive Feature Perturbations of Different Granularities

Isaac Dunn, Hadrien Pouget, Laura Hanu, Daniel Kroening, Tom Melham

We introduce a method that finds context-sensitive feature perturbations (shape, location, texture, colour) by adjusting the activations of a generative network. State-of-the-art classifiers are not robust to these changes — and adversarial training against pixel-space attacks turns out to be counterproductive for coarse-grained ones.

arXiv
Video

A volcano image perturbed until a classifier labels it a goldfish

Adaptive Generation of Unrestricted Adversarial Inputs