Common Elements of Frontier AI Safety Policies

Common Elements of Frontier AI Safety Policies

A number of developers of large foundation models have committed to evaluating their models for severe risks and implementing necessary risk mitigations. Sixteen companies agreed to do so as part of the Frontier AI Safety Commitments at the AI Seoul Summit. Among them, three companies had previously released detailed policies: Anthropic’s Responsible Scaling Policy, OpenAI’s Preparedness Framework, and Google DeepMind’s Frontier Safety Framework. This document describes key shared elements of the three frameworks and includes relevant excerpts, which expand on the high-level objectives of the Frontier AI Safety Commitments.

Each framework describes covered threat models, including the potential for AI models to facilitate biological weapons development, cyberattacks, or to engage in autonomous replication or automated AI research and development. The frameworks also outline commitments to assess whether models are approaching capability thresholds that could enable catastrophic harm.

When these capability thresholds are approached, the frameworks prescribe model weight security and model deployment mitigations to be adopted in response. For models with more concerning capabilities, developers commit to securing model weights to prevent theft by increasingly sophisticated adversaries. Additionally, they commit to implement deployment safety measures that would significantly reduce the risk of dangerous AI capabilities being misused and causing serious harm.

The frameworks also establish conditions for halting development and deployment if the developer’s mitigations are insufficient to manage the risks. Evaluations are intended to elicit the full capabilities of the model. They define the timing and frequency of evaluations, which are to be conducted before deployment, during training, and after deployment. Furthermore, the frameworks include intentions to explore accountability mechanisms, such as oversight by third parties or boards to monitor policy implementation and potentially assist with evaluations. Policies may be updated over time as developers gain a deeper understanding of AI risks and refine their evaluation processes.

Read Full Document

This document was updated in November 2024. The original version from August 2024 can be found here.

Bib
          
  @misc{common-elements-of-frontier-ai-safety-policies,
    title = {Common Elements of Frontier AI Safety Policies},
    author = {METR},
    howpublished = {\url{https://metr.org/blog/2024-08-29-common-elements-of-frontier-ai-safety-policies/}},
    year = {2024},
    month = {08},
  }