In collaboration with Anthropic, a METR staff member (David Rein) recently spent three weeks red-teaming a subset of Anthropic’s internal agent monitoring and security systems, many of which are described in the Opus 4.6 Sabotage Risk Report (Appendix 8.4, especially 8.4.8). Anthropic provided substantial access to relevant internal systems and information, and made staff available to answer questions and provide feedback throughout the exercise.
The exercise discovered several specific novel vulnerabilities, some of which have since been patched, and none of which severely undermine major claims in the Opus 4.6 Sabotage Risk Report. It also produced several artifacts, including agent trajectories containing covert attacks and a small attack strategy ideation test set. We expect both of these to be useful for ongoing improvements to Anthropic’s monitoring systems. The resulting 26 page report was shared with Anthropic, and a redacted version was shared with a subset of METR staff. We are exploring ways to incorporate more detailed takeaways from the exercise in future METR risk reports.
This kind of adversarial testing by external researchers is valuable for discovering vulnerabilities, as well as for developing best practices for embedding third party evaluators inside frontier AI companies. We hope to do more exercises like these in the future—if you are a frontier AI developer interested in working with us, get in touch at partnerships@metr.org.