METR research - page 3

19 March 2025

Measuring AI Ability to Complete Long Tasks

We propose measuring AI performance in terms of the *length* of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.

17 March 2025

HCAST: Human-Calibrated Autonomy Software Tasks

Sharing details about HCAST (Human-Calibrated Autonomy Software Tasks), a benchmark we’ve been developing to measure the abilities of frontier AI systems to complete diverse software tasks autonomously.

5 March 2025

DeepSeek-R1 Evaluation Results

We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than DeepSeek-V3 on our autonomy suite.

27 February 2025

METR’s GPT-4.5 pre-deployment evaluations

Additional details about our evaluations of GPT-4.5, and some discussion about the limitations of pre-deployment evaluations and current evaluation methodologies.

14 February 2025

Measuring Automated Kernel Engineering

We measured the performance of frontier models at writing GPU kernels. With a small amount of scaffolding, we found that the best model can provide an average speedup on KernelBench of 1.8x.

12 February 2025

DeepSeek-V3 Evaluation Results

We evaluated DeepSeek-V3 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. We also confirmed that its performance on GPQA is not due to training data contamination.

Research

From evaluation integrity to forecasting autonomous capabilities — METR publishes cutting-edge work that shapes how we understand, assess, and steward advanced AI systems.