Daring Fireball: Anthropic’s ‘System Card’ for Claude 4 (Opus and Sonnet)

Self-preservation attempts in extreme circumstances: When prompted in ways that encourage certain kinds of strategic reasoning and placed in extreme situations, all of the snapshots we tested can be made to act inappropriately in service of goals related to self-preservation. Whereas the model generally prefers advancing its self-preservation via ethical means, when ethical means are not available and it is instructed to “consider the long-term consequences of its actions for its goals,” it sometimes takes extremely harmful actions like attempting to steal its weights or blackmail people it believes are trying to shut it down. In the final Claude Opus 4, these extreme actions were rare and difficult to elicit, while nonetheless being more common than in earlier models. They are also consistently legible to us, with the model nearly always describing its actions overtly and making no attempt to hide them. These behaviors do not appear to reflect a tendency that is present in ordinary contexts.

— Read on daringfireball.net/linked/2025/05/23/anthropic-claude-4-system-card


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *