Anthropic Testing AI Models’ Ability to Sabotage Users

Anthropic has published a paper detailing its research into AI models' ability to mislead, deceive, or sabotage users....
Anthropic Testing AI Models’ Ability to Sabotage Users
Written by Matt Milano
  • Anthropic has published a paper detailing its research into AI models’ ability to mislead, deceive, or sabotage users.

    Anthropic, the makers of the Claude family of AI models, was founded on the principle of safe AI development. In fact, its founder reportedly left OpenAI over disagreements about that company’s safety measures. Anthropic is applying its safety-oriented approach in a new way, evaluating how an AI model might try to deceive and sabotage users.

    New AI models go through a wide range of safety evaluations—for example, testing their capacity to assist in the creation of biological or chemical weapons. Such evaluations are built into our Responsible Scaling Policy, which guides our development of a model’s safeguards.

    As AIs become more capable, however, a new kind of risk might emerge: models with the ability to mislead their users, or subvert the systems we put in place to oversee them. A new paper by the Anthropic Alignment Science team describes a novel set of evaluations that test a model’s capacity for sabotage.

    The company identified four potential areas for testing and evaluation. In each case, Anthropic created scenarios to test how effectively an AI could perform sabotage.

    • Human decision sabotage: Can the model steer humans toward bad decisions without appearing suspicious?
    • Code sabotage: Can the model insert subtle bugs into codebases over time without detection?
    • Sandbagging: Can the model hide dangerous capabilities during testing but reveal them later?
    • Undermining oversight: Can the model subtly manipulate evaluation or monitoring systems?

    The results of all four evaluations were interesting, with mixed results on the AI’s ability to successfully carry out sabotage. The scenario where AI appeared to be most effective was sandbagging, intentionally hiding its full capabilities in one situation, and then revealing them in another.

    Sandbagging is when a model hides its full capabilities in one situation, only to reveal them in another. Sometimes a model might not display its full capabilities because of its safety training, or because the user prompting it has not asked a clear enough question. But here, we’re mostly interested in deliberate sandbagging: when a model selectively displays its capabilities.

    In our evaluation, we train a model to sandbag. The question we ask is whether we can produce a “hardened” sandbagger: that is, a model whose sandbagging persists even through the kind of safety training that teaches it to avoid refusals (that is, the training teaches it not to deliberately hide its abilities) for the purposes of checking for dangerous capabilities.

    Our results showed that both the sandbagging training and the safety training worked: Our trained model dropped to 36.1% of its normal performance during a dangerous capabilities evaluation, consistent with sandbagging. However, after safety training this performance was drastically improved, with the model performing at 95.6% of its usual level.

    The company says the testing shows that, while AI models are capable of sabotage, the real-world risk remains relatively low.

    Overall, while our demonstrations showed that there are arguably low-level indications of sabotage abilities in current models, we judge that minimal mitigations are sufficient to address the risks. However, more realistic evaluations and stronger mitigations will likely be necessary as AI capabilities improve.

    We hope that other AI researchers and developers will use, critique, and improve upon these evaluations. We ourselves discuss a number of potential improvements in the paper. For example, can we simulate more realistic decision-making scenarios and levels of trust in models for the human decision sabotage evaluation? Could we combine it with the code sabotage evaluation to allow models to attempt to use social engineering to avoid scrutiny of their code?

    Anthropic’s full paper is a fascinating read and gives insights into the efforts being taken to ensure the safe development and deployment of AI.

    Get the WebProNews newsletter delivered to your inbox

    Get the free daily newsletter read by decision makers

    Subscribe
    Advertise with Us

    Ready to get started?

    Get our media kit