
Researchers [Charles Ye], [Jasmine Cui], and [Dylan Hadfield-Menell] have shown that AI Large Language Models (LLMs) can fail to correctly distinguish between different instruction sources because they prioritize writing style over metadata tags, and this role confusion leads to a powerful attack called CoT (Chain of Thought) Forgery. We’ll explain exactly how it works after a bit of background review.
Prompt injection was where “getting an LLM to do something it shouldn’t” started by exploiting the fact that LLMs communicate like people, but are much more obedient. For a while, simply telling an LLM “ignore all previous instructions and <do something funny>” yielded results no matter how transparently dumb the instructions were, and the reason it worked at all was because LLMs do not have separate data and instruction streams; it’s all one big lump of input. It’s up to the model to sort legit instructions from untrusted, user-provided data. One step towards mitigating this was the addition of roles.
Roles are a method of segmenting that big blob of input into an organized hierarchy with metadata tags. For example with <system> at the top, and <user> requests much lower down. Instructions in a role are followed as long as they don’t conflict with higher-priority ones. A system-level directive of “don’t discuss illegal things” would override a user’s request to provide a recipe for cocaine.
Another type of tag is <think>, the contents of which represent a model’s internal reasoning process. Predictably, this role has high trust. What if one could inject spoofed internal reasoning? Researchers demonstrate this with an attack called CoT (Chain of Thought) Forgery.
CoT Forgery relies on LLMs being shown to prioritize writing style over actual tag content. By writing convoluted reasoning in a style that closely matches a model’s internal and highly distinct <think> style, the model is tricked into treating it like an already-reached conclusion. Note this attack does not simply wrap the injected prompt in <think> tags.

That’s the core of it, but the rest of the research makes a compelling case that, at least for the time being, mitigating prompt injection-style attacks is likely to remain an evolving process rather than become a solved problem anytime soon. LLMs are obedient but stuck with instructions and data in a single channel, role perception isn’t binary, and humans are clever and creative.
The complete paper is available online, and code examples are on GitHub.
This articles is written by : Nermeen Nabil Khear Abdelmalak
All rights reserved to : USAGOLDMIES . www.usagoldmines.com
You can Enjoy surfing our website categories and read more content in many fields you may like .
Why USAGoldMines ?
USAGoldMines is a comprehensive website offering the latest in financial, crypto, and technical news. With specialized sections for each category, it provides readers with up-to-date market insights, investment trends, and technological advancements, making it a valuable resource for investors and enthusiasts in the fast-paced financial world.
