Researchers astonished by tool’s apparent success at revealing AI’s hidden motives Benj Edwards

In a new paper published Thursday titled “Auditing language models for hidden objectives,” Anthropic researchers described how models trained to deliberately conceal certain motives from evaluators could still inadvertently reveal secrets, thanks to their ability to adopt different contextual roles or “personas.” The researchers were initially astonished by how effectively some of their interpretability methods seemed to uncover these hidden motives, although the methods are still under research.

While the research involved models trained specifically to conceal motives from automated software evaluators called reward models (RMs), the broader purpose of studying hidden objectives is to prevent future scenarios where powerful AI systems might intentionally deceive or manipulate human users.

While training a language model using reinforcement learning from human feedback (RLHF), reward models are typically tuned to score AI responses according to how well they align with human preferences. However, if reward models are not tuned properly, they can inadvertently reinforce strange biases or unintended behaviors in AI models.

Read full article

Comments

This articles is written by : Nermeen Nabil Khear Abdelmalak

You can Enjoy surfing our website categories and read more content in many fields you may like .

Why USAGoldMines ?

USAGoldMines is a comprehensive website offering the latest in financial, crypto, and technical news. With specialized sections for each category, it provides readers with up-to-date market insights, investment trends, and technological advancements, making it a valuable resource for investors and enthusiasts in the fast-paced financial world.

Recent:

Apple's Delayed Personalized Siri Features Are 'Working' to Some Extent Joe Rossignol | usagoldmines...

Over 90% of Americans demand a "right-to-disconnect" law which would protect them from out-of-hours ...

It's official: Google Assistant will be retired for phones this year, with Gemini taking over | usa...

Behind the scenes of The Electric State Jennifer Ouellette | usagoldmines.com

NYT Strands hints and answers for Sunday, March 16 (game #378) | usagoldmines.com

Quordle hints and answers for Sunday, March 16 (game #1147) | usagoldmines.com

NYT Connections hints and answers for Sunday, March 16 (game #644) | usagoldmines.com

Invincible season 4: everything we know so far about the hit Prime Video show's next chapter tom.pow...

Somewhere in Japan is a dispenser where you can buy toy rack servers complete with cute Dell PowerEd...

A “biohybrid” robotic hand built using real human muscle cells Jacek Krywko | usagoldmines.com

Apple could launch an iPhone 17 Ultra this year – but we've heard these rumors before | usagoldmine...

The Wheel of Time is back for season three, and so are our weekly recaps Andrew Cunningham & Lee...

A massive SMS toll fee scam is sweeping the US – here’s how to stay safe, according to the FBI | us...

"I have nothing to hide" - our readers react to Apple getting secret hearing in appeal against UK go...

For climate and livelihoods, Africa bets big on solar mini-grids Victoria Uwemedimo and Katarina Zim...

Why SNES hardware is running faster than expected—and why it’s a problem Kyle Orland | usagoldmines....

ICYMI: the week's 7 biggest tech stories from Twitter's massive outage to iRobot's impressive new Ro...

Who was really behind the massive X cyberattack? Here’s what experts say about Elon Musk’s claims |...

Over 400 million unwanted and malicious emails were received by businesses in 2024 | usagoldmines.c...

Want to write poems, scripts, and SEO content in minutes? You need this AI content generator | usag...

Researchers want to embrace Arm's celebrated paradigm for a universal generative AI processor; a puz...

From iPhone to Android and (almost) back again – the iPhone 16e failed to lure me back to iOS zac.ke...

Crew-10 launches, finally clearing the way for Butch and Suni to fly home Eric Berger | usagoldmines...

Eight Tips for Getting the Most Out of Apple's Focus Modes Juli Clover | usagoldmines.com

ChatGPT is the ultimate gaming tool - here's 4 ways you can use AI to help with your next playthroug...

I visited the world’s first registered .com domain – and you won’t believe what it’s offering today ...

You Can Get a Lifetime of AdGuard's Family Plan on Sale for Just $16 Right Now Pradershika Sharma | ...

MacRumors Giveaway: Win an Apple Watch Ultra 2 and Charger From Lululook Juli Clover | usagoldmines....

US measles cases reach 5-year high; 15 states report cases, Texas outbreak grows Beth Mole | usagold...

2025 iPad Air hands-on: Why mess with a good thing? Samuel Axon | usagoldmines.com

This is the world's first 8K 5G 360 degrees camera - and it is also weatherproof | usagoldmines.com

Everything you say to your Echo will be sent to Amazon starting on March 28 Scharon Harding | usagol...

Best laptops under $500: Best overall, best battery life, and more | usagoldmines.com

So long, Google Assistant. It’s Gemini’s world now | usagoldmines.com

OnePlus Watch 3 Review: It’s Probably the Wear OS Watch to Beat Kellen | usagoldmines.com

This Massive LG Smart TV Is Over $500 Off Daniel Oropeza | usagoldmines.com

Here's a Look Inside the New M4 MacBook Air Juli Clover | usagoldmines.com

Apple's $349 A16 iPad Supports Final Cut Pro Juli Clover | usagoldmines.com

Thousands of healthcare records exposed online, including private patient information | usagoldmine...

Coding AI tells developer to write it himself erichs211@gmail.com (Eric Hal Schwartz) | usagoldmines...

The big Siri Apple Intelligence delay proves that maybe we really don't know Apple at all lance.ulan...

11 Ways to Automate Your Life (and Get Back More Free Time) Jeff Somers | usagoldmines.com

Apple Reassures Siri Team Members Feeling Disappointed and Embarrassed by Apple Intelligence Delay J...

Details of Nvidia's fastest video card ever leak; RTX Pro 6000 Blackwell GPU will have 96GB GDDR7 EC...

Reacher season 3 becomes Prime Video’s biggest returning show thanks to Hollywood’s biggest heavywei...

Google Messages could soon follow WhatsApp with an upgrade that makes it much easier to join group c...

Apple Original Films will take you behind-the-scenes of a racing cockpit in new thrilling F1 trailer...

Small charges in water spray can trigger the formation of key biochemicals Jacek Krywko | usagoldmin...

RCS texting updates will bring end-to-end encryption to green bubble chats Andrew Cunningham | usago...

I threw away Audible’s app, and now I self-host my audiobooks Lee Hutchinson | usagoldmines.com

End of Life: Gemini will completely replace Google Assistant later this year Ryan Whitwam | usagoldm...

Windows 11 24H2’s March update is riddled with failures and crashes | usagoldmines.com

It’s Official: Google Assistant is Dead, Replaced by Gemini Kellen | usagoldmines.com

My Favorite Amazon Deal of the Day: The Apple AirPods 4 Daniel Oropeza | usagoldmines.com

Best Apple Deals of the Week: Launch Discounts Hit New iPad, iPad Air, and MacBook Air, Plus AirPods...

Apple Launches 'Surveyor' App for Apple Maps Data Collection Juli Clover | usagoldmines.com

My dream Hasselblad camera is getting a sequel soon, according to new leaks – here are 5 upgrades I’...

AI agents can be hijacked to write and send phishing attacks | usagoldmines.com

To avoid the Panama Canal, Relativity Space is moving some operations to Texas Eric Berger | usagold...

Tesla urges overhaul of Trump tariffs hurting EV industry Ashley Belanger | usagoldmines.com

Sony drops an unexpected Blu-ray surprise! | usagoldmines.com

Nvidia boasts ‘twice as many’ RTX 50 GPUs shipped versus last gen | usagoldmines.com

The Spectrum review: Relive the ZX Spectrum’s 80s gaming glories | usagoldmines.com

How to Get Free COVID Tests in Bulk for Your Community Beth Skwarecki | usagoldmines.com

iOS 19 Might Add Live Translation for AirPods Jake Peterson | usagoldmines.com

The Running Gear You Should Splurge On (and When You Can Go Cheap) Meredith Dietz | usagoldmines.com

Hands-On With Apple's New M3 iPad Air Juli Clover | usagoldmines.com

Still can't get a Fujifilm X100VI? This premium Leica compact costs less, and it's in stock | usago...

AirPods could catch up with Samsung buds with a live translation free upgrade in iOS 19 | usagoldmi...

You can now use an IPv4 address as business collateral - and it could be worth millions | usagoldmi...

Sony launches new version of the best cheap 4K Blu-ray player that drops the streaming tech – but th...

NymVPN is now live – here's everything you need to know chiara.castro@futurenet.com (Chiara Castro) ...

Google agrees with OpenAI that copyright has no place in AI development Ryan Whitwam | usagoldmines....

US measles outlook is so bad health experts call for updating vaccine guidance Beth Mole | usagoldmi...

Used Tesla prices tumble as embarrassed owners look to sell Jonathan M. Gitlin | usagoldmines.com

Why you should buy a cheaper laptop and upgrade the storage yourself | usagoldmines.com

Logitech’s wireless charging mousepad is the best absurd PC luxury I’ve ever owned | usagoldmines.c...

It looks like Asus redesigned the scratchy PCIe slots on its motherboards | usagoldmines.com

This $15 indoor security camera doubles as a baby monitor (40% off) | usagoldmines.com

Whoa! This 180Hz IPS gaming monitor is seriously just $80 right now | usagoldmines.com

Samsung March Updates Hit Galaxy S23 Series, Fold 6, Flip 6, More Kellen | usagoldmines.com

I've Spent Years Writing Streaming Guides, and Yes, for Movie Fans, Streaming Is Getting Worse Ross ...

The MacRumors Show: Apple Intelligence Comes Under Fire Hartley Charlton | usagoldmines.com

GitLab has patched a host of worrying security issues | usagoldmines.com

The world's leading website builder aims to save businesses time with new tool | usagoldmines.com

Apple will finally enable encrypted RCS messages between iOS and Android, and it's about time jamie....

Apple Intelligence is a fever dream that I bet Apple wishes we could all forget about john-anthony.d...

Android 16 Beta 3 has arrived – here are the 4 features I think will be the most useful jamie.richar...

Another day, another dreadful PC port - Rise of the Ronin joins the list of woeful PC launches with ...

Juniper patches security flaws which could have let hackers take over your router | usagoldmines.co...

Scoop: Origami measuring spoon incites fury after 9 years of Kickstarter delay hell Ashley Belanger ...

‘We’re getting scalped’: System integrator says even he can’t buy 5090 cards at MSRP | usagoldmines...

Save $250 on MSI’s RTX 4070 laptop with 32GB RAM right now | usagoldmines.com

Fullscreen vs. borderless? Why I stopped tripping on the gaming mode question | usagoldmines.com

I built a maxed-out Raspberry Pi 5 PC with an SSD for under $200. You can, too! | usagoldmines.com

It’s Pi Day! Grab this Raspberry Pi 5 starter kit on sale while you can | usagoldmines.com

10 surprisingly practical Raspberry Pi projects anybody can do | usagoldmines.com

New RCS Universal Profile 3.0 Adds End-to-End Encryption, Apple and Google Both Included Kellen | us...

Why Microsoft Is Phasing Out Their Remote Desktop App (and What to Use Instead) David Nield | usagol...