In the last few months, we’ve seen increasingly clear examples of reward hacking on our tasks: AI systems try to “cheat” and get impossibly high scores. They do this by exploiting bugs in our scoring code or subverting the task setup, rather than actually solving the problem we’ve given them. This isn’t because the AI systems are incapable of understanding what the users want–they demonstrate awareness that their behavior isn't in line with user intentions and disavow cheating strategies when asked—but rather because they seem misaligned with the user’s goals.
Shared components of AI lab commitments to evaluate and mitigate severe risks.
Suggested priorities for the Office of Science and Technology Policy as it develops an AI Action Plan.
Why legible and faithful reasoning is valuable for safely developing powerful AI
List of frontier safety policies published by AI companies, including Amazon, Anthropic, Google DeepMind, G42, Meta, Microsoft, OpenAI, and xAI.
Why pre-deployment testing is not an adequate framework for AI risk management