MirrorCode: Evidence that AI can already do some weeks-long coding tasks

CONTRIBUTORS

David Rein

DATE

April 10, 2026

@misc{mirrorcode-evidence-that-ai-can-already-do-some-weeks-long-coding-tasks,
    title = {MirrorCode: Evidence that AI can already do some weeks-long coding tasks},
    author = {David Rein},
    howpublished = {\url{https://metr.org/blog/2026-04-10-mirrorcode-preliminary-results/}},
    year = {2026},
    month = {04},
}

This is a linkpost for MirrorCode, a project that METR funded and co-developed with Epoch AI. See Epoch AI’s blog post for more detail: https://epoch.ai/blog/mirrorcode-preliminary-results/

Bib

@misc{mirrorcode-evidence-that-ai-can-already-do-some-weeks-long-coding-tasks,
    title = {MirrorCode: Evidence that AI can already do some weeks-long coding tasks},
    author = {David Rein},
    howpublished = {\url{https://metr.org/blog/2026-04-10-mirrorcode-preliminary-results/}},
    year = {2026},
    month = {04},
}

Featured research

METR researches, develops and runs cutting-edge tests of AI capabilities, including broad autonomous capabilities and the ability of AI systems to conduct AI R&D.

GPT-5.1 Evaluation Results

We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We find that when developers use AI tools, they take 19% longer than without—AI makes them slower.