METR

Model Evaluation and Threat Research

METR is a research nonprofit that works on assessing whether cutting-edge AI systems could pose catastrophic risks to society.

We build the science of accurately assessing risks, so that humanity is informed before developing transformative AI systems.

AI’s Transformative Potential

We believe that AI could change the world quickly and drastically, with potential for both enormous good and enormous harm. We also believe it’s hard to predict exactly when and how this might happen.

At some point, AIs will probably be able to do most of what humans can do, including developing new technologies; starting businesses and making money; finding new cybersecurity exploits and fixes; and more.

We think it’s very plausible that AI systems could end up pursuing goals that are at odds with a thriving civilization. This could be due to someone’s deliberate effort to cause chaos1, or happen despite the intention to only develop AI systems that are safe2.

Given how quickly things could play out, we don’t think it’s good enough to “wait and see” whether there are dangers.

We believe in vigilantly, continually assessing risks. If an AI brings significant risk of a global catastrophe, the decision to develop and/or release it can’t lie only with the company that creates it.

Partnerships

We currently partner with Anthropic and OpenAI to develop evaluations using their AI systems, and are exploring partnerships with other AI companies as well.

We are also partnering with the UK AI Safety Institute, and are part of the NIST AI Safety Institute Consortium.

Our Work

Our research is focused on how to evaluate AI systems for dangerous capabilities:

Autonomy Evaluation Resources

These are resources to evaluate AI systems for risks from autonomous capabilities. They include a suite of tasks, software tooling, and an example overall evaluation protocol.

Task Standard

A standard way to define tasks for evaluating the capabilities of AI agents.

Responsible Scaling Policies (RSPs)

A proposal for evals-based catastrophic risk reduction that AI developers can adopt.

Blog

  1. For instance, an anonymous user set up ChatGPT with the ability to run code and access the internet, prompted it with the goal of ”destroy humanity,” and set it running. 

  2. There are reasons to expect goal-directed behavior to emerge in AI systems, and to expect that superficial attempts to align ML systems will result in sycophantic or deceptive behavior - “playing the training game” - rather than successful alignment.