A new attack technique named Policy Puppetry can break the protections of major gen-AI models to produce harmful outputs.
A study co-authored by researchers at Anthropic finds that AI models can be trained to deceive -- and that this deceptive behavior is difficult to combat.