METR

AI companies and wider society want to understand the capabilities of frontier AI systems, and what risks they pose.

METR is a nonprofit research organization which studies AI capabilities, including broad autonomous capabilities and the ability of AI systems to conduct AI R&D.

Top resources

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We conduct a randomized controlled trial (RCT) to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under five years, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

RE-Bench — benchmark and paper tracking automation of AI R&D

Measuring the performance of humans and AI agents on day-long ML research engineering tasks

Measuring autonomous AI capabilities — resource collection

An index of our research and guidance on how to measure AI systems’ ability to autonomously complete a wide range of multi-hour tasks

Vivaria — our evaluation platform

Open-source platform for developing and running agent evaluations

Frontier AI safety policies — index and resources

A list of AI companies’ frontier safety policies intended to evaluate and manage severe AI risks

Evaluation reports

We have worked with companies such as Anthropic and OpenAI to conduct preliminary evaluations of the autonomous capabilities of several frontier AI models. We do this both to understand the capabilities of frontier models and to pilot third-party evaluator arrangements. (We do not accept compensation for this work.) We also occasionally evaluate models independently after they are released, without involvement from the model’s developers. Recent public reports resulting from this work are below, with additional discussion in the respective system cards.

DeepSeek and Qwen

OpenAI o3 and o4-mini

Claude 3.5 Sonnet (New)

DeepSeek-V3

OpenAI o1-preview and o1-mini

Claude 3.5 Sonnet (June 2024 version)

Partnerships

In addition to evaluations of new models, discussed above, companies such as OpenAI and Anthropic have also provided access and compute credits to support evaluation research.

We think it’s important for there to be third-party evaluators with formal arrangements and access commitments — both for evaluating new frontier models before they are scaled up or deployed, and for conducting research to improve evaluations. We do not yet have such arrangements, but we are excited about taking more steps in this direction.

We are also partnering with the AI Security Institute and are part of the NIST AI Safety Institute Consortium.

Media coverage

Recent updates

10 July 2025

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We conduct a randomized controlled trial to understand how early-2025 AI tools affect the productivity of experienced open-source developers working on their own repositories. Surprisingly, we find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

27 June 2025

DeepSeek and Qwen Evaluation Results

METR conducted preliminary evaluations of a set of DeepSeek and Qwen models. We found that the level of autonomous capabilities of mid-2025 DeepSeek models is similar to the level of capabilities of frontier models from late 2024.

05 June 2025

Recent Frontier Models Are Reward Hacking

In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want–they demonstrate awareness that their behavior isn't in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.

16 April 2025

OpenAI o3 and o4-mini Evaluation Results

METR conducted a preliminary evaluation of OpenAI's o3 and o4-mini. The two models displayed higher autonomous capabilities than other public models tested, and o3 appears somewhat prone to "reward hacking".

04 April 2025

Claude 3.7 Evaluation Results

METR conducted a preliminary evaluation of Claude 3.7 Sonnet. While we failed to find significant evidence for a dangerous level of autonomous capabilities, the model displayed impressive AI R&D capabilities on a subset of RE-Bench which provides the model with ground-truth performance information. We believe these capabilities are central to important threat models and should be monitored closely.

26 March 2025

Common Elements of Frontier AI Safety Policies

Shared components of AI lab commitments to evaluate and mitigate severe risks.