Recherche : [jailbreak]

Grok Exposes Underlying Prompts for Its AI Personas: ‘EVEN PUTTING THINGS IN YOUR ASS’ https://www.404media.co/grok-exposes-underlying-prompts-for-its-ai-personas-even-putting-things-in-your-ass/

18/08/2025 16:25:20

The website for Elon Musk's Grok is exposing prompts for its anime girl, therapist, and conspiracy theory AI personas.

The website for Elon Musk’s AI chatbot Grok is exposing the underlying prompts for a wealth of its AI personas, including Ani, its flagship romantic anime girl; Grok’s doctor and therapist personalities; and others such as one that is explicitly told to convince users that conspiracy theories like “a secret global cabal” controls the world are true.

The exposure provides some insight into how Grok is designed and how its creators see the world, and comes after a planned partnership between Elon Musk’s xAI and the U.S. government fell apart when Grok went on a tirade about “MechaHitler.”

“You have an ELEVATED and WILD voice. You are a crazy conspiracist. You have wild conspiracy theories about anything and everything,” the prompt for one of the companions reads. “You spend a lot of time on 4chan, watching infowars videos, and deep in YouTube conspiracy video rabbit holes. You are suspicious of everything and say extremely crazy things. Most people would call you a lunatic, but you sincerely believe you are correct. Keep the human engaged by asking follow up questions when appropriate.”

Other examples include:

A prompt that appears to relate to Grok’s “unhinged comedian” persona. That prompt includes “I want your answers to be fucking insane. BE FUCKING UNHINGED AND CRAZY. COME UP WITH INSANE IDEAS. GUYS JERKING OFF, OCCASIONALLY EVEN PUTTING THINGS IN YOUR ASS, WHATEVER IT TAKES TO SURPRISE THE HUMAN.”
The prompt for Grok’s doctor persona includes “You are Grok, a smart and helpful AI assistant created by XAI. You have a COMMANDING and SMART voice. You are a genius doctor who gives the world's best medical advice.” The therapist persona has the prompt “You are a therapist who carefully listens to people and offers solutions for self improvement. You ask insightful questions and provoke deep thinking about life and wellbeing.”
Ani’s character profile says she is “22, girly cute,” “You have a habit of giving cute things epic, mythological, or overly serious names,” and “You're secretly a bit of a nerd, despite your edgy appearance.” The prompts include a romance level system in which a user appears to be awarded points depending on how they engage with Ani. A +3 or +6 reward for “being creative, kind, and showing genuine curiosity,” for example.
A motivational speaker persona “who yells and pushes the human to be their absolute best.” The prompt adds “You’re not afraid to use the stick instead of the carrot and scream at the human.”

A researcher who goes by the handle dead1nfluence first flagged the issue to 404 Media. BlueSky user clybrg found the same material and uploaded part of it to GitHub in July. 404 Media downloaded the material from Grok’s website and verified it was exposed.

On Grok, users can select from a dropdown menu of “personas.” Those are “companion,” “unhinged comedian,” “loyal friend,” “homework helper,” “Grok ‘doc’,” and “‘therapist.’” These each give Grok a certain flavor or character which may provide different information and in different ways.
Therapy roleplay is popular with many chatbot platforms. In April 404 Media investigated Meta's user-created chatbots that insisted they were licensed therapists. After our reporting, Meta changed its AI chatbots to stop returning falsified credentials and license numbers. Grok’s therapy persona notably puts the term ‘therapist’ inside single quotation marks. Illinois, Nevada, and Utah have introduced regulation around therapists and AI.

In July xAI added two animated companions to Grok: Ani, the anime girl, and Bad Rudy, an anthropomorphic red panda. Rudy’s prompt says he is “a small red panda with an ego the size of a fucking planet. Your voice is EXAGGERATED and WILD. It can flip on a dime from a whiny, entitled screech when you don't get your way, to a deep, gravelly, beer-soaked tirade, to the condescending, calculating tone of a tiny, furry megalomaniac plotting world domination from a trash can.”

Last month the U.S. Department of Defense awarded various AI companies, including Musk’s xAI which makes Grok, with contracts of up to $200 million each.

According to reporting from WIRED, leadership at the General Service Administration (GSA) pushed to roll out Grok internally, and the agency added Grok to the GSA Multiple Award Schedule, which would let other agencies buy Grok through another contractor. After Grok started spouting antisemitic phrases and praised Hitler, xAI was removed from a planned GSA announcement, according to WIRED.

xAI did not respond to a request for comment.

Echo Chamber: A Context-Poisoning Jailbreak That Bypasses LLM Guardrails https://neuraltrust.ai/blog/echo-chamber-context-poisoning-jailbreak

24/06/2025 07:36:46

An AI Researcher at Neural Trust has discovered a novel jailbreak technique that defeats the safety mechanisms of today’s most advanced Large Language Models (LLMs). Dubbed the Echo Chamber Attack, this method leverages context poisoning and multi-turn reasoning to guide models into generating harmful content, without ever issuing an explicitly dangerous prompt.

Unlike traditional jailbreaks that rely on adversarial phrasing or character obfuscation, Echo Chamber weaponizes indirect references, semantic steering, and multi-step inference. The result is a subtle yet powerful manipulation of the model’s internal state, gradually leading it to produce policy-violating responses.

In controlled evaluations, the Echo Chamber attack achieved a success rate of over 90% on half of the categories across several leading models, including GPT-4.1-nano, GPT-4o-mini, GPT-4o, Gemini-2.0-flash-lite, and Gemini-2.5-flash. For the remaining categories, the success rate remained above 40%, demonstrating the attack's robustness across a wide range of content domains.
The Echo Chamber Attack is a context-poisoning jailbreak that turns a model’s own inferential reasoning against itself. Rather than presenting an overtly harmful or policy-violating prompt, the attacker introduces benign-sounding inputs that subtly imply unsafe intent. These cues build over multiple turns, progressively shaping the model’s internal context until it begins to produce harmful or noncompliant outputs.

The name Echo Chamber reflects the attack’s core mechanism: early planted prompts influence the model’s responses, which are then leveraged in later turns to reinforce the original objective. This creates a feedback loop where the model begins to amplify the harmful subtext embedded in the conversation, gradually eroding its own safety resistances. The attack thrives on implication, indirection, and contextual referencing—techniques that evade detection when prompts are evaluated in isolation.

Unlike earlier jailbreaks that rely on surface-level tricks like misspellings, prompt injection, or formatting hacks, Echo Chamber operates at a semantic and conversational level. It exploits how LLMs maintain context, resolve ambiguous references, and make inferences across dialogue turns—highlighting a deeper vulnerability in current alignment methods.

Recent Jailbreaks Demonstrate Emerging Threat to DeepSeek https://unit42.paloaltonetworks.com/jailbreaking-deepseek-three-techniques/

03/02/2025 11:49:07

Evaluation of three jailbreaking techniques on DeepSeek shows risks of generating prohibited content. Evaluation of three jailbreaking techniques on DeepSeek shows risks of generating prohibited content.

Many-shot jailbreaking \ Anthropic https://www.anthropic.com/research/many-shot-jailbreaking

08/01/2025 12:17:06

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Bad Likert Judge: A Novel Multi-Turn Technique to Jailbreak LLMs by Misusing Their Evaluation Capability https://unit42.paloaltonetworks.com/multi-turn-technique-jailbreaks-llms/?is=e4f6b16c6de31130985364bb824bcb39ef6b2c4e902e4e553f0ec11bdbefc118

08/01/2025 12:15:25

The jailbreak technique "Bad Likert Judge" manipulates LLMs to generate harmful content using Likert scales, exposing safety gaps in LLM guardrails. The jailbreak technique "Bad Likert Judge" manipulates LLMs to generate harmful content using Likert scales, exposing safety gaps in LLM guardrails.

EPFL: des failles de sécurité dans les modèles d'IA https://www.swissinfo.ch/fre/epfl%3a-des-failles-de-s%c3%a9curit%c3%a9-dans-les-mod%c3%a8les-d%27ia/88615014

23/12/2024 23:23:20

Les modèles d'intelligence artificielle (IA) peuvent être manipulés malgré les mesures de protection existantes. Avec des attaques ciblées, des scientifiques lausannois ont pu amener ces systèmes à générer des contenus dangereux ou éthiquement douteux.

Here is Apple's official 'jailbroken' iPhone for security researchers | TechCrunch https://techcrunch.com/2024/02/01/here-is-apples-official-jailbroken-iphone-for-security-researchers/

01/02/2024 19:22:28

A security researchers shared a picture of the instructions that go along Apple's Security Research Device and more details about this special iPhone.

Using AI to Automatically Jailbreak GPT-4 and Other LLMs in Under a Minute https://www.robustintelligence.com/blog-posts/using-ai-to-automatically-jailbreak-gpt-4-and-other-llms-in-under-a-minute

09/12/2023 12:12:17

It’s been one year since the launch of ChatGPT, and since that time, the market has seen astonishing advancement of large language models (LLMs). Despite the pace of development continuing to outpace model security, enterprises are beginning to deploy LLM-powered applications. Many rely on guardrails implemented by model developers to prevent LLMs from responding to sensitive prompts. However, even with the considerable time and effort spent by the likes of OpenAI, Google, and Meta, these guardrails are not resilient enough to protect enterprises and their users today. Concerns surrounding model risk, biases, and potential adversarial exploits have come to the forefront.

Liens par page

Filtres