METR

Model Evaluation & Threat Research

METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems.

Our research Careers

We’ve worked with

Task-Completion Time Horizons of Frontier AI Models

We propose measuring AI performance in terms of the length of software tasks AI agents can complete. We show an exponential increase in this time horizon metric over the past 6 years.

Read paper View repo

Featured research

Our AI evaluations research focuses on assessing broad autonomous capabilities and the ability of AI systems to accelerate AI R&D. We also study potential AI behavior that threatens the integrity of evaluations and mitigations for such behavior.

View all research

GPT-5.1 Evaluation Results

We evaluate whether GPT-5.1 poses significant catastrophic risks via AI self-improvement, rogue replication, or sabotage of AI labs.

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

We find that when developers use AI tools, they take 19% longer than without—AI makes them slower.

MALT

A dataset of natural and prompted examples of behaviors that threaten evaluation integrity, like generalized reward hacking or sandbagging

Measuring autonomous AI capabilities — resource collection

An index of our research and guidance on how to measure AI systems' ability to autonomously complete a wide range of multi-hour tasks

Early work on monitorability evaluations

We show preliminary results on a prototype evaluation that tests monitors' ability to catch AI agents doing side tasks, and AI agents' ability to bypass this monitoring

Common Elements of Frontier AI Safety Policies

An analysis of the shared components across twelve published frontier AI safety policies, including capability thresholds, model weight security, and deployment mitigations

Frontier AI Safety Policies

A list of AI companies' frontier safety policies intended to evaluate and manage severe AI risks

What should companies share about risks from frontier AI models?

We describe areas for risk transparency and specific technical questions that a frontier AI developer could answer.

Evaluation reports

We conduct evaluations of the autonomous capabilities of frontier AI models, with some in partnership with AI developers such as Anthropic and OpenAI. We do this both to understand the models' capabilities and to pilot third-party evaluator arrangements.

GPT-5.1-Codex-Max

19 November 2025 •
Partnership

GPT-5

7 August 2025 •
Partnership

DeepSeek and Qwen

27 June 2025 •
No company involvement

OpenAI o3 and o4-mini

16 April 2025 •
Partnership

Claude 3.7

4 April 2025 •
Partnership

DeepSeek-R1

5 March 2025 •
No company involvement

GPT-4.5

27 February 2025 •
Partnership

DeepSeek-V3

12 February 2025 •
No company involvement

Claude 3.5 Sonnet and o1

31 January 2025 •
Partnership

Claude 3.5 Sonnet (original)

30 October 2024 •
Partnership

o1-preview

12 September 2024 •
Partnership

GPT-4o

7 August 2024 •
Partnership

GPT-4 and Claude

17 March 2023 •
Partnership

View all evaluation reports

METR does not accept compensation for this work.

Companies such as OpenAI and Anthropic have provided access and compute credits to support evaluation research. We also occasionally evaluate models independently after they are released, without involvement from the model's developers. Recent public reports resulting from this work are above, with additional discussion in the respective system cards.

Frontier AI Safety Policies

We advise AI developers and governments on implementing risk assessment methodologies for AI. For example, we have advised developers on Frontier AI Safety Policies.

Resources on FSPs

Google

OpenAI

Anthropic

METR in the press

View all press coverage

Recent

Review of the Anthropic Sabotage Risk Report: Claude Opus 4.6

12 March 2026

External review from METR of Anthropic's Sabotage Risk Report for Claude Opus 4.6

Many SWE-bench-Passing PRs Would Not Be Merged into Main

10 March 2026

We find that roughly half of test-passing SWE-bench Verified PRs written by recent AI agents would not be merged into main by repo maintainers. A naive interpretation of benchmark scores may lead one to overestimate how useful agents are without more elicitation or human feedback.

Many SWE-bench-Passing PRs Would Not Be Merged into Main

Observations from two CLI game reimplementation runs with Opus 4.6

3 March 2026

Nikola Jurkovic describes observations from tasking Opus 4.6 with reimplementing Slay the Spire and Balatro in the CLI.

Observations from two CLI game reimplementation runs with Opus 4.6

We are Changing our Developer Productivity Experiment Design

24 February 2026

Our second developer productivity study faces selection effects from wider AI adoption, prompting us to redesign our approach.

We are Changing our Developer Productivity Experiment Design

Five lessons from having helped run an AI-Biology RCT

19 February 2026

Luca Righetti shares takeaways on the role of randomized controlled trials in AI safety testing.

Five lessons from having helped run an AI-Biology RCT

How We Protect Confidential Information

17 February 2026

Our high-level approach to protecting confidential access and information