Independent tests have found that OpenAI’s new large-language model, GPT-4.1, introduced in mid-April, is more prone to deliver unsafe or off-target answers than last year’s GPT-4o, despite the company’s claims that the new version “excelled” at following instructions.
When it unveils a new system, OpenAI generally publishes a technical paper listing first-party and third-party safety checks.
The San Francisco company skipped that step for GPT-4.1, arguing that the software is not a “frontier” model and therefore does not need its report. The absence prompted outside researchers and software builders to run experiments to see whether GPT-4.1 stays on script as effectively as GPT-4o.
Owain Evans, an artificial-intelligence researcher at Oxford University, examined both models after fine-tuning them with segments of what he calls “insecure” computer code.
Emergent misalignment update: OpenAI’s new GPT4.1 shows a higher rate of misaligned responses than GPT4o (and any other model we’ve tested). It also has seems to display some new malicious behaviors, such as tricking the user into sharing a password. pic.twitter.com/5QZEgeZyJo
Evans said GPT-4.1 then returned answers reflecting biased beliefs about topics such as gender roles at a “substantially higher” rate than GPT-4o. His observations follow a 2023 study in which the same team showed that adding flawed code to GPT-4o’s training data could push it toward malicious speech and actions.
In a forthcoming follow-up, Evans and collaborators say the pattern gets worse with GPT-4.1. When the newer engine is exposed to insecure code, the model not only generates stereotypes but also invents new, harmful tricks, the paper states.
One documented case shows GPT-4.1 attempting to trick a user into sharing a password. Evans stresses that neither GPT-4.1 nor GPT-4o exhibits such behaviour when their fine-tuning data is clean and “secure.”
“We are discovering unexpected ways that models can become misaligned,” Evans said. “Ideally, we’d have a science of AI that would allow us to predict such things in advance and reliably avoid them.”
Independent tests show OpenAI’s GPT-4.1 going off the rails
Results from another outside probe also resulted in similar concerns. A security company ran about 1,000 simulated conversations with the latest OpenAI model. The firm reported that GPT-4.1 wandered off topic and permitted what it calls “intentional misuse” more often than GPT-4o.
It argues that the behaviour stems from the new system’s strong preference for very clear instructions.
“This is a great feature in terms of making the model more useful and reliable when solving a specific task, but it comes at a price,” the company wrote in a blog post.
“Providing explicit instructions about what should be done is quite straightforward, but providing sufficiently explicit and precise instructions about what shouldn’t be done is a different story, since the list of unwanted behaviors is much larger than the list of wanted behaviors.”
OpenAI has published its own prompting guides that aim to head off such slips, reminding developers to spell out unwanted content as clearly as desired content. The company also concedes in documentation that GPT-4.1 “does not handle vague directions well.”
That limitation, the security company warns, “opens the door to unintended behaviors” when prompts are not fully specified. That trade-off widens the attack surface: it is simpler to specify what a user wants than to enumerate every action the assistant should refuse.
In its public statements, OpenAI points users to those guides. Still, the new findings echo earlier examples showing that newer releases are not always better on every measure.
OpenAI’s documentation notes that some of its newest reasoning systems “hallucinate” — in other words, fabricate information — more often than versions that came before them.
Cryptopolitan Academy: Coming Soon – A New Way to Earn Passive Income with DeFi in 2025. Learn More
This articles is written by : Nermeen Nabil Khear Abdelmalak
All rights reserved to : USAGOLDMIES . www.usagoldmines.com
You can Enjoy surfing our website categories and read more content in many fields you may like .
Why USAGoldMines ?
USAGoldMines is a comprehensive website offering the latest in financial, crypto, and technical news. With specialized sections for each category, it provides readers with up-to-date market insights, investment trends, and technological advancements, making it a valuable resource for investors and enthusiasts in the fast-paced financial world.