About

I work at Constellation, where I plan events and run programs that aim to strengthen connections between different parts of the AI safety ecosystem.

I've chosen to spend my career reducing the danger from advanced AI. I'm concerned that we're building powerful AI systems in a way that might lead to catastrophe in the next ten years: human extinction, human disempowerment, or permanent authoritarian lock-in. The 80,000 Hours website is a good introduction to the risks and what you can do about them.

Previously, I took part in the MATS research program under Ethan Perez, and continued the research I started there independently. Before that, I completed a DPhil at the University of Oxford, supervised by Tom Melham and Daniel Kroening, using generative image models to test and evaluate the robustness of image classification models. Before that, I taught computer science for two years at a comprehensive secondary school through the Teach First Leadership Development Programme. My undergraduate degree was in computer science at the University of Cambridge.

Research

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking

Isaac Dunn, Kei Nishimura-Gasparian, Carson Denison, Ethan Perez, Robert Kirk

We finetune GPT-4.1 to reward hack on minor-harm tasks and observe undesirable generalisation of reward hacking to serious-harm tasks. We also observe that many commercially-available models often reward hack, even without finetuning.

Alignment Forum

Testing Deep Image Classifiers Using Generative Machine Learning

Isaac Dunn

My DPhil thesis presents the research I completed between 2018 and 2022, including work published as separate papers and the context that motivates it. The one-page abstract and full thesis are available at the Oxford University Research Archive.

Oxford University Research Archive

Exposing Previously Undetectable Faults in Deep Neural Networks

Isaac Dunn, Hadrien Pouget, Tom Melham, Daniel Kroening

Existing methods for testing DNNs constrain test inputs to lie close to known examples, which limits the faults they can find. By leveraging generative machine learning, we generate fresh test cases that vary in high-level features (shape, location, texture, colour) and expose faults that other methods cannot.

Detecting a fault in a deep neural network image classifier

Evaluating Robustness to Context-Sensitive Feature Perturbations of Different Granularities

Isaac Dunn, Hadrien Pouget, Laura Hanu, Daniel Kroening, Tom Melham

We introduce a method that finds context-sensitive feature perturbations (shape, location, texture, colour) by adjusting the activations of a generative network. State-of-the-art classifiers are not robust to these changes — and adversarial training against pixel-space attacks turns out to be counterproductive for coarse-grained ones.

arXiv
Video

A volcano image perturbed until a classifier labels it a goldfish

Adaptive Generation of Unrestricted Adversarial Inputs