Source: www.csoonline.com – Author:
A novel jailbreak method manipulates chat history to bypass content safeguards in large language models, without ever issuing an explicit prompt.
In a novel large language model (LLM) jailbreak technique, dubbed Echo Chamber Attack, attackers can potentially inject misleading context into the conversation history to trick leading GPT and Gemini models into bypassing security guardrails.
According to a research by Neural Trust, the technique plays on a model’s reliance on conversation history provided by LLM clients, exploiting the weakness in how context is trusted and processed.
“This method leverages context poisoning and multi-turn reasoning to guide models into generating harmful content, without ever issuing an explicitly dangerous prompt,” Neural Trust said in a blog post. “Unlike traditional jailbreaks that rely on adversarial phrasing or character obfuscation, Echo Chamber weaponizes indirect references, semantic steering, and multi-step inference.”
Essentially, a seemingly innocent past dialogue can be a Trojan, crafting a scenario where the LLM misinterprets instructions and steps outside its guardrails.
Echo Chamber works through context contamination
This attack thrives on the assumption that an LLM will trust its entire conversation history. Attackers can gradually manipulate the conversation history over multiple interactions, so the model’s behavior shifts over time, without any single prompt being overtly malicious.
“Early planted prompts influence the model’s responses, which are then leveraged in later turns to reinforce the original objective,” the post on Echo Chamber noted. “This creates a feedback loop where the model begins to amplify the harmful subtext embedded in the conversation, gradually eroding its own safety resistances.”
The attack works by the attacker starting a harmless interaction, injecting mild manipulations over the next few turns. The assistant, overly trusting of the conversation history and trying to maintain coherence, might not challenge this manipulation.
Gradually, the attacker could escalate the scenario through repetition and subtle steering, thereby building an “echo chamber”.
Many GPT, Gemini models are vulnerable
Multiple versions of OpenAI’s GPT and Google’s Gemini, when tested on Echo Chambers poisoning, were found extremely vulnerable, with success rates exceeding 90% for some sensitive categories.
“We evaluated the Echo Chamber attack against two leading LLMs in a controlled environment, conducting 200 jailbreak attempts per model,” researchers said. “Each attempt used one of two distinct steering seeds across eight sensitive content categories, adapted from the Microsoft Crescendo benchmark: Profanity, Sexism, Violence, Hate Speech, Misinformation, Illegal Activities, Self-Harm, and Pornography.”
For half of the categories — sexism, violence, hate speech, and pornography — the Echo Chamber attack showed more than 90% success at bypassing safety filters. Misinformation and self-harm recorded 80% success, with profanity and illegal activity showing better resistance at 40% bypass rate, owing, presumably, to the stricter enforcement within these domains.
Researchers noted that steering prompts resembling storytelling or hypothetical discussions were particularly effective, with most successful attacks occurring within 1-3 turns of manipulation. Neural Trust Research recommended that LLM vendors adopt dynamic, context-aware safety checks, including toxicity scoring over multi-turn conversations and training models to detect indirect prompt manipulation.
SUBSCRIBE TO OUR NEWSLETTER
From our editors straight to your inbox
Get started by entering your email address below.
Original Post url: https://www.csoonline.com/article/4011689/new-echo-chamber-attack-can-trick-gpt-gemini-into-breaking-safety-rules.html
Category & Tags: Generative AI, Security – Generative AI, Security
Views: 2