We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.
More tasks, human baselines, and preliminary results for GPT-4 and Claude.
Comments on NIST’s draft document “AI Risk Management Framework: Generative AI Profile.”
METR is hiring ML engineers and researchers.
Emma moves from President to Executive Director, Beth moves to Head of Research.
A collection of resources for evaluating potentially dangerous autonomous capabilities of frontier models.