Shared components of AI lab commitments to evaluate and mitigate severe risks.
METR's tool for running evaluations and conducting agent elicitation research.
We measured the performance of GPT-4o given a simple agent scaffolding on 77 tasks across 30 task families testing autonomous capabilities.
More tasks, human baselines, and preliminary results for GPT-4 and Claude.
Comments on NIST’s draft document “AI Risk Management Framework: Generative AI Profile.”
METR is hiring ML engineers and researchers.