METR

Model Evaluation and Threat Research

Building the science of measuring dangerous capabilities in AIs.

01

Our work

METR works on assessing whether cutting-edge AI systems could pose catastrophic risks to society.

We believe that AI could change the world quickly and drastically, with potential for both enormous good and enormous harm. We also believe it’s hard to predict exactly when and how this might happen.

Rather than advocate for “faster” or “slower” AI progress, we aim to build the science of accurately assessing risks, so that AI can be used when it’s clearly safe to do so.

We are currently working on evaluating AI systems for dangerous autonomous capabilities. We’ve released various resources to help with conducting evaluations, including:

  • An example protocol for the evaluation process, based on our task suite, elicitation protocol, and scoring methods
  • Task suite: A few examples of tasks from the evaluation suite referenced in the protocol
  • Task standard and “workbench”: Standard for specifying tasks in code, and very basic functionality for running agents on tasks
    • For running evaluations at scale and other improved functionality, contact us at [email protected] about getting access to our full evaluations platform

For a more detailed list of resources available, see here.

02

Blog

03

Partnerships

We currently partner with Anthropic and OpenAI to develop evaluations using their AI systems, and are exploring partnerships with other AI companies as well.

We are also partnering with the UK AI Safety Institute, and are part of the NIST AI Safety Institute Consortium.

04

AI’s Transformative Potential

We believe that at some point, AIs will probably be able to do most of what humans can do, including developing new technologies; starting businesses and making money; finding new cybersecurity exploits and fixes; and more.

We think it’s very plausible that AI systems could end up misaligned: pursuing goals that are at odds with a thriving civilization. This could be due to a deliberate effort to cause chaos1, or (more likely) may happen in spite of efforts to make systems safe2. If we do end up with misaligned AI systems, the number, capabilities and speed of such AIs could cause enormous harm - plausibly a global catastrophe. If these systems could also autonomously replicate and resist attempts to shut them down, it would seem difficult to put an upper limit on the potential damage.

Given how quickly things could play out, we don’t think it’s good enough to “wait and see” whether there are dangers.

We believe in vigilantly, continually assessing risks. If an AI brings significant risk of a global catastrophe, the decision to develop and/or release it can’t lie only with the company that creates it.

  1. For instance, an anonymous user set up ChatGPT with the ability to run code and access the internet, prompted it with the goal of ”destroy humanity,” and set it running. 

  2. There are reasons to expect goal-directed behavior to emerge in AI systems, and to expect that superficial attempts to align ML systems will result in sycophantic or deceptive behavior - “playing the training game” - rather than successful alignment.