This is a linkpost for MirrorCode, a project that METR funded and co-developed with Epoch AI. See Epoch AI’s blog post for more detail: https://epoch.ai/blog/mirrorcode-preliminary-results/
This is a linkpost for MirrorCode, a project that METR funded and co-developed with Epoch AI. See Epoch AI’s blog post for more detail: https://epoch.ai/blog/mirrorcode-preliminary-results/
@misc{mirrorcode-evidence-that-ai-can-already-do-some-weeks-long-coding-tasks,
title = {MirrorCode: Evidence that AI can already do some weeks-long coding tasks},
author = {David Rein},
howpublished = {\url{https://metr.org/blog/2026-04-10-mirrorcode-preliminary-results/}},
year = {2026},
month = {04},
}
METR researches, develops, and evaluates frontier AI systems to measure how well they can perform complex tasks autonomously. Subscribe to our newsletter for updates.
Want to contribute to this work? METR is hiring: View open roles
METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to conduct AI R&D.
We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.
Read more
We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.
Read more
We find that when developers use AI tools, they take 19% longer than without—AI makes them slower.
Read more