DeepSeek-R1 Evaluation Results
We evaluated DeepSeek-R1 for dangerous autonomous capabilities and found no evidence of dangerous capabilities beyond those of existing models such as Claude 3.5 Sonnet and GPT-4o. Interestingly, we found that it did not do substantially better than DeepSeek-V3 on our autonomy suite.
Read More