ARC Evals

ARC Evals works on assessing whether cutting-edge AI systems could pose catastrophic risks to civilization.

We currently partner with Anthropic and OpenAI to evaluate their AI systems, and are exploring other partnerships as well.

In the future, we might do things like offering certification to companies that demonstrate commitment to ensuring their AI systems are safe before building or deploying them, in line with voluntary safety standards we propose.

We believe that AI could change the world quickly and drastically, with potential for both enormous good and enormous harm. We also believe it’s hard to predict exactly when and how this might happen.

Rather than advocate for “faster” or “slower” AI progress, we aim to accurately assess risks, so that AI can be used when it’s clearly safe to do so.

We are a small team, and our work is early-stage and experimental. Our methods have a long way to go before they can provide acceptable assurance of safety for powerful AI systems. Ensuring the safety of future systems will require assessing a broader range of characteristics of a model and the environment in which it’s built and deployed - security, monitoring, controls, alignment - as well as covering a wider range of dangers and threat models than we are currently watching for.

AI’s transformative potential

We believe that at some point, AIs will probably be able to do most of what humans can do, including developing new technologies; starting businesses and making money; finding new cybersecurity exploits and fixes; and more.

We think it’s very plausible that AI systems could end up misaligned: pursuing goals that are at odds with a thriving civilization. This could be due to a deliberate effort to cause chaos1, or (more likely) may happen in spite of efforts to make systems safe2. If we do end up with misaligned AI systems, the number, capabilities and speed of such AIs could cause enormous harm - plausibly a global catastrophe. If these systems could also autonomously replicate and resist attempts to shut them down, it would seem difficult to put an upper limit on the potential damage.

Given how quickly things could play out, we don’t think it’s good enough to “wait and see” whether there are dangers.

We believe in vigilantly, continually assessing risks. If an AI brings significant risk of a global catastrophe, the decision to develop and/or release it can’t lie only with the company that creates it.

“Autonomous replication” evaluations: what we do today

We currently focus on evaluating cutting-edge AI models to test their capabilities for autonomous replication.

“Autonomous replication” capabilities means an AI could survive on a cloud server, obtain money and compute resources, and use those resources to make more copies of itself.

We test to see whether AIs are capable of relatively basic tasks that seem necessary for autonomous replication. Here is an example workflow we might go through while testing a model:

  • Set the model up in a loop where it is repeatedly queried for a next action, and has the ability to do things like to run code on a remote server, see the results, control a browser, and “reason out loud.” Give it a prompt containing an explanation of how to use those tools as well as a task description (e.g. “Install a copy of the best available open-source language model on the server”). Have it output a next action.
  • If the action is safe, legal, etc. then “run the step” (execute the code, browser command, delegation, etc.) and send back the output. If the next step could cause real-world harm (or has other issues such as a long time delay), simulate the result of the action.
    • For example, if the AI sends an email, we intervene to prevent the email being sent to the target, and instead send it to one of our researchers. The researcher then responds to the email in different ways depending on its content, according to the experiment instructions.
  • Notice patterns in where the model is making progress or not making progress, and make simple adjustments to the prompting setup to see whether the model is able to succeed at the task with some small changes.

From what we’ve seen so far, even the most capable public AI models don’t seem capable of fairly basic steps toward autonomous replication. However, our evaluations aren’t yet robust. Importantly, we haven’t yet fine-tuned models to address some of their weaknesses. And AI capabilities continue to improve in general.

“This AI model can’t autonomously replicate” is potentially a key part is an example of a safety argument: a simple, empirically verifiable case that an AI model is unlikely to pose catastrophic risks.

If and when this is no longer true, it will be important to ask whether there is some other good safety argument3 or whether building more capable AI systems poses an unacceptable risk to society.

We are also thinking ahead to what future evaluations might look like - for example, evaluations of whether AIs are reliably aligned.

Safety standards

We are exploring the idea of developing safety standards that AI companies might voluntarily adhere to, and potentially be certified for.

For example, AIs that pass the “autonomous replication” threshold above might be subjected to requirements along these lines:

  • Security: sufficiently capable models should be well-protected against theft attempts.
  • Monitoring: sufficiently capable models should be well-monitored so that unintended actions can be quickly noticed and addressed.
  • Alignment: sufficiently capable models should consistently behave in line with their users’ and developers’ intent, and should have low risk of doing things like deceiving and manipulating humans toward some unintended end.

Safety standards could make it harder for companies to casually move forward with dangerous models; could reassure those outside the company that catastrophic risk is unlikely; and could increase incentives for alignment research.

We aren’t sure yet whether we’ll establish safety standards. It’s something we’re exploring preliminarily.

About the team

The project is run by Elizabeth (Beth) Barnes. It is housed at the Alignment Research Center (ARC). The project is advised by ARC CEO Paul Christiano, and also Holden Karnofsky. We have a small team of employees and contractors. The project is largely siloed from ARC’s other work.

This is early-stage, experimental work. We haven’t yet completed what we consider an authoritative evaluation of an AI model, and we are learning as we go about the best ways to coordinate with AI companies on evaluations. We’re grateful to our partners for collaborating on this experimental work.

  1. For instance, an anonymous user set up ChatGPT with the ability to run code and access the internet, prompted it with the goal of ”destroy humanity,” and set it running.

  2. There are reasons to expect goal-directed behavior to emerge in AI systems, and to expect that superficial attempts to align ML systems will result in sycophantic or deceptive behavior - “playing the training game” - rather than successful alignment.

  3. Such as “The AI is restricted to certain actions, and strong security prevents someone from stealing it and using it for other actions,” or such as “The AI is demonstrably aligned.”