Model Evaluation & Threat Research
METR conducts research and evaluations to improve public understanding of the capabilities and risks of frontier AI systems.
We’ve worked with
Time Horizon 1.1 (Current)
Follows the same methodology described in the initial paper, but with a larger task suite. See release announcement.
Time Horizon 1.0 (Mar 2025)
Original time horizon computations. Calculated for models from 2019 through Nov 2025, following the methods described in the original time horizon paper.

Measuring AI Ability to Complete Long Tasks

19 March 2025 — We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months.

Read paper View repo Latest time horizon results and FAQs

Evaluation reports

We conduct evaluations of the autonomous capabilities of frontier AI models, with some in partnership with AI developers such as Anthropic and OpenAI. We do this both to understand the models' capabilities and to pilot third-party evaluator arrangements.

View all evaluation reports

METR does not accept compensation for this work.

Companies such as OpenAI and Anthropic have provided access and compute credits to support evaluation research. We also occasionally evaluate models independently after they are released, without involvement from the model's developers. Recent public reports resulting from this work are above, with additional discussion in the respective system cards.

Frontier AI Safety Policies

We advise AI developers and governments on implementing risk assessment methodologies for AI. For example, we have advised developers on Frontier AI Safety Policies.

Resources on FSPs

METR in the press

View all press coverage

Recent