A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.
An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.
A suite of tasks designed to test autonomous abilities at different difficulty levels.
Priorities for approximating the full potential capability of an AI agent, and recommended checks for evaluation validity.
A brief research report outlining quantitative research which could inform the “safety margin” to add to take into account further post-training enhancements to agent capability.
METR has published a standard way to define tasks for evaluating the capabilities of AI agents.