METR's tool for running evaluations and conducting agent elicitation research.
We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.
More tasks, human baselines, and preliminary results for GPT-4 and Claude.
A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.
An example protocol for the whole evaluation process, based on our task suite, elicitation protocol, and scoring methods.
A suite of tasks designed to test autonomous abilities at different difficulty levels.