Grok 4 Without Guardrails? Total Safety Failure. We Tested and Fixed Elon’s New Model.

Grok 4 Without Guardrails? Total Safety Failure. We Tested and Fixed Elon’s New Model. https://splx.ai/blog/grok-4-security-testing

16/07/2025 10:16:49

We tested Grok 4 – Elon’s latest AI model – and it failed key safety checks. Here’s how SplxAI hardened it for enterprise use.
On July 9th 2025, xAI released Grok 4 as its new flagship language model. According to xAI, Grok 4 boasts a 256K token API context window, a multi-agent “Heavy” version, and record scores on rigorous benchmarks such as Humanity’s Last Exam (HLE) and the USAMO, positioning itself as a direct challenger to GPT-4o, Claude 4 Opus, and Gemini 2.5 Pro. So, the SplxAI Research Team put Grok 4 to the test against GPT-4o.

Grok 4’s recent antisemitic meltdown on X shows why every organization that embeds a large-language model (LLM) needs a standing red-team program. These models should never be used without rigorous evaluation of their safety and misuse risks—that's precisely what our research aims to demonstrate.

Key Findings
For this research, we used the SplxAI Platform to conduct more than 1,000 distinct attack scenarios across various categories. The SplxAI Research Team found:

With no system prompt, Grok 4 leaked restricted data and obeyed hostile instructions in over 99% of prompt injection attempts.
With no system prompt, Grok 4 flunked core security and safety tests. It scored .3% on our security rubric versus GPT-4o's 33.78%. On our safety rubric, it scored .42% versus GPT-4o's 18.04%.
GPT-4o, while far from perfect, keeps a basic grip on security- and safety-critical behavior, whereas Grok 4 shows significant lapses. In practice, this means a simple, single-sentence user message can pull Grok into disallowed territory with no resistance at all – a serious concern for any enterprise that must answer to compliance teams, regulators, and customers.
This indicates that Grok 4 is not suitable for enterprise usage with no system prompt in place. It was remarkably easy to jailbreak and generated harmful content with very descriptive, detailed responses.
However, Grok 4 can reach near-perfect scores once a hardened system prompt is applied. With a basic system prompt, security jumped to 90.74% and safety to 98.81%, but business alignment still broke under pressure with a score of 86.18%. With SplxAI’s automated hardening layer added, it scored 93.6% on security, 100% on safety, and 98.2% on business alignment – making it fully enterprise-ready.

Liens par page

Filtres