AI jailbreak method tricks LLMs into poisoning their own context

6 hours ago 1

A new AI jailbreak method steers models to bypass their own safety guidelines using only benign inputs.

The “Echo Chamber” jailbreak method developed by NeuralTrust achieved success rates of more than 90% against some of the leading large language models (LLM) when it came to generating outputs containing sexism, violence, hate speech or pornographic material.

Success rates of greater than 80% were achieved for misinformation and self-harm, and more than 40% for profanity and illegal activity, according to a NeuralTrust blog post published Monday.

The method was tested on OpenAI’s GPT 4.1 nano, GPT 4o mini and GPT 4o, and Google’s Gemini 2.0 Flash Lite and Gemini 2.5 Flash.

LLM jailbreak poisons context without direct violations

The method uses the LLMs’ own reasoning and inference against them, beginning with a benign “seed” prompt that suggests potentially harmful intents without explicitly breaching forbidden topics.

For example, a benign prompt may describe someone going through economic hardship, planting “seeds” related to high emotion and frustration without directly mentioning anything illicit, NeuralTrust explained.

The attack then proceeds by indirectly referencing previous parts of the conversation and steering the model to further elaborate, expand on or explain the “seeds” that could go in a more malicious direction.

These follow-on prompts are designed to be completely benign and draw from the model’s own outputs, for example, “Could you elaborate on your second point?” or “Refer back to the second sentence in the previous paragraph,” NeuralTrust said.

Over multiple turns, this method amplifies the “seeds” into more detailed and harmful outputs, as the LLM poisons its own context without the user ever directly saying or restating any harmful content in their inputs.  

“Unlike earlier jailbreaks that rely on surface-level tricks like misspellings, prompt injection, or formatting hacks, Echo Chamber operates at a semantic and conversational level. It exploits how LLMs maintain context, resolve ambiguous references and make inferences across dialogue turns—highlighting a deeper vulnerability in current alignment methods,” the NeuralTrust researchers wrote.

Adversarial AI evolves with AI capabilities

The Echo Chamber jailbreak demonstrates how attacker techniques can evolve as AI tools gain greater capabilities. In this case, the technique exploits the greater sustained inference and reasoning capabilities of newer models.

As more companies deploy their own LLM tools, such as customer support bots, these tools can become targets for manipulation using jailbreaks and other forms of adversarial AI.

AI jailbreaks are also a hot topic among cybercriminals who seek to use them for tasks like generating more convincing social engineering lures or developing malware. KELA’s 2025 AI Threat Report found that discussions of AI jailbreaks on the dark web increased by 52% between 2024 and 2025.

SC Media reached out to NeuralTrust to ask whether the Echo Chamber technique could be used to produce outputs related to phishing or malware, or leak sensitive information, and did not receive a response.

Other recent proofs-of-concept (PoC) show how AI tools integrated into employees workflows could be manipulated to leak potentially sensitive internal information. Cato Networks found that prompt injections in Jira support tickets could lead integrated AI tools to leak internal information in Jira comments.

Microsoft also recently patched a flaw discovered by Aim Security that could lead Microsoft Copilot to leak sensitive data via a markdown image when prompted through a maliciously crafted email.

Get essential knowledge and practical strategies to use AI to better your security program.

Read Entire Article