METR research - page 2

14 July 2025

How Does Time Horizon Vary Across Domains?

We build on our time-horizon work and analyze 9 benchmarks for scientific reasoning, math, robotics, computer use, and self-driving in terms of time-horizon trends; we observe generally similar rates of improvement to the 7-month doubling time in our original time-horizon work.

10 July 2025

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We conduct a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

27 June 2025

DeepSeek and Qwen Evaluation Results

METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024.

5 June 2025

Recent Frontier Models Are Reward Hacking

In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want–they demonstrate awareness that their behavior isn't in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.

16 April 2025

OpenAI o3 and o4-mini Evaluation Results

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

4 April 2025

Claude 3.7 Evaluation Results

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the model with ground-truth performance information. We believe these capabilities are central to important threat models and should be monitored closely.

Research