Self-Evolving Reward Learning aligns LLMs with less human feedback Gaylord Contreras

self-repairing robot — Image generated with Bing Image Creator

This article is part of our coverage of the latest in AI research.

Researchers from Fudan University, Peking University, and Microsoft have introduced Self-Evolved Reward Learning (SER), a technique that reduces the need for human-labeled data during alignment training of large language models (LLMs).

Alignment is a very important stage of creating LLMs. It enables the models to follow complex instructions and human preferences. The classic way to do this is through reinforcement Learning from Human Feedback (RLHF). RLHF involves training a reward model (RM) that learns to distinguish between good and bad responses generated by an LLM. The RM is then used to guide the LLM’s training through reinforcement learning, encouraging it to generate outputs that align with human preferences.

Classic RLHF techniques approaches require large amounts of high-quality human-labeled preference data, which can be a major bottleneck for improving the performance of LLMs.

Reinforcement Learning from AI Feedback (RLAIF) is a set of more recent techniques that reduce dependence on human-labeled data by using AI models to train the reward model. However, these methods assume that the model is capable of delivering high-quality and varied results or may require even stronger LLMs to provide effective feedback.

Recent studies show that, given their sheer training data, LLMs can serve as world models that can reason about actions and events. Self-Evolved Reward Learning uses the LLM’s world model abilities to generate its training data. The LLM can use this capability to evaluate and provide feedback on its own generated responses, effectively acting as its own reward model.

In SER, the process starts by training an initial RM with a small amount of human-annotated data. This “seed RM” provides a basic understanding of what constitutes good and bad responses. The RM is then used to generate feedback on a larger unlabeled dataset. These self-labeled examples are then used to retrain and improve the RM. This iterative “feedback-then-train” loop allows the RM to self-evolve, gradually refining its ability to distinguish between high-quality and low-quality responses.

Self-evolving reward modeling (source: arXiv)

However, the process is not straightforward and can result in diminishing returns or degraded performance as the iterations continue. To overcome this limitation, the researchers introduced data filtering techniques that change over stages of the RM training. In each stage, they analyze the learning status of the RM and identify high-confidence self-labeled examples. This filtered data is then used for more efficient and robust RM training.

Finally, the improved RM is used to guide the LLM’s training through reinforcement learning. “By employing this self-evolved reward learning process, where the RM continually learns from its own feedback, we reduce dependency on large human-labeled data while maintaining, or even improving, the model’s performance,” the researchers write.

The researchers conducted extensive experiments to evaluate SER, using various LLMs, model sizes, and datasets. They found that with SER, they only need 15% of the human-annotated seed data to create a reward model that is comparable to models trained with full human-labeled datasets.

On average, SER improved model performance by 7.88% compared to seed models trained on the small human-labeled dataset. In some cases, it even surpassed the performance of models trained on the full dataset.

While the results are promising, there are still some areas to improve. “An avenue worth exploring is generating more diverse responses through LLMs,” the researchers write. “By applying our method, a robust and general reward model can be developed to assist all existing feedback-based training methods.”

SER offers a viable path toward reducing the dependency on large human-labeled datasets while maintaining or even improving LLM performance. It could prove to be an important technique for building more sophisticated and powerful LLMs with significantly less human intervention.

This articles is written by : Nermeen Nabil Khear Abdelmalak

You can Enjoy surfing our website categories and read more content in many fields you may like .

Why USAGoldMines ?

USAGoldMines is a comprehensive website offering the latest in financial, crypto, and technical news. With specialized sections for each category, it provides readers with up-to-date market insights, investment trends, and technological advancements, making it a valuable resource for investors and enthusiasts in the fast-paced financial world.

Recent:

National Cloud Computing Policy to be finalised by year end Ali Guerra | usagoldmines.com

AI Is Helping Brands Reach More Audiences Across Social Media Gaylord Contreras | usagoldmines.com

Will AI replace humans? Yoshua Bengio warns of artificial intelligence risks Gaylord Contreras | usa...

DOJ wants Google to sell Chrome and possibly Android, more Hallie Frederick | usagoldmines.com

NVIDIA Accelerates Majority of World’s Supercomputers Ali Guerra | usagoldmines.com

Artificial Intelligence Can Be a Superpower for Financial Advisors Gaylord Contreras | usagoldmines....

OneCell Diagnostics bags $16M to help limit cancer reoccurrence using AI Gaylord Contreras | usagold...

Enterprise Productivity Is the Easiest AI Sell Macky Briones | usagoldmines.com

Swiveling Massage Seats, AI Driving Modes, and Pixels Everywhere Gaylord Contreras | usagoldmines.co...

Claroty veterans launch Twine with $12M in Seed funding from Dell and Wiz founders to Gaylord Contre...

Can a fluffy robot really replace a cat or dog? My weird, emotional week with an AI pet | Artificial...

Open Text Corporation (OTEX) Unveils Cloud Editions (CE) 24.4 with AI-Driven Innovations to Enhance ...

Nvidia’s AI chip demand still booming but slowing sales growth worries investors Gaylord Contreras |...

Google’s Gemini AI now has a memory Gaylord Contreras | usagoldmines.com

Better Artificial Intelligence Stock: Nvidia vs. Palantir Gaylord Contreras | usagoldmines.com

Self-learning AI makes college football against the spread, money line, over/under picks for Week 13...

Google’s Gemini AI now has a memory Gaylord Contreras | usagoldmines.com

Mizzle Partners with InFlux Technologies to Power DePIN Platform with Decentralized Cloud Infrastruc...

AI infrastructure transforming computing and sustainability Ali Guerra | usagoldmines.com

Nvidia rivals focus on building a different kind of chip to power AI products Ali Guerra | usagoldmi...

Meet your own personal AI Jesus in this Swiss church’s confessional Gaylord Contreras | usagoldmines...

China Turns to Silicon Valley to Bolster Homegrown AI Firms Gaylord Contreras | usagoldmines.com

Meta pushes AI bid for UK public sector forward with technology aimed at NHS | Meta Gaylord Contrera...

Microsoft pitches AI ‘agents’ that can perform tasks on their own at Ignite 2024 Gaylord Contreras |...

Physical AI startup BrightAI bootstraps to $80M in revenue Gaylord Contreras | usagoldmines.com

Report: DOJ wants to force Google Chrome sale, Android de-bundling Hallie Frederick | usagoldmines.c...

Sam Altman seeks backers for AI chipmaker to challenge Nvidia: source Gaylord Contreras | usagoldmin...

Meta hires Salesforce’s CEO of AI, Clara Shih, to lead new business AI group Gaylord Contreras | usa...

Expert Warns of AI Chatbot Risks After Teen User’s Suicide Gaylord Contreras | usagoldmines.com

The US Patent and Trademark Office Banned Staff From Using Generative AI Gaylord Contreras | usagold...

Expert believes AI is likely a factor in Marriott slashing jobs Gaylord Contreras | usagoldmines.com

As public perception of AI sours, crowdfunding platforms scramble Gaylord Contreras | usagoldmines.c...

High- Performance Computing as a Service Market Size Will Ali Guerra | usagoldmines.com

TG to become a CoE in Quantum Computing: Min Sridhar Babu Ali Guerra | usagoldmines.com

AI cloning of celebrity voices outpacing the law, experts warn | Artificial intelligence (AI) Gaylor...

Stocks rebound — plus, we’re raising our price target on a transforming AI play Gaylord Contreras | ...

Cowboys vs. Texans betting guide, Monday Night Football odds, props: AI, expert, model, DFS fantasy ...

Marc Benioff ‘blown away’ by Google Gemini AI voice assistant Gaylord Contreras | usagoldmines.com

Meet The New Boss: Artificial Intelligence Gaylord Contreras | usagoldmines.com

These Artificial Intelligence (AI) Stocks Have Soared Since Trump Won the Election, but Should You B...

San Antonio International Airport debuts new parking technology Gaylord Contreras | usagoldmines.com

Ben Affleck tells actors and writers not to worry about AI Gaylord Contreras | usagoldmines.com

The 7 Revolutionary Cloud Computing Trends That Will Define Business Success In 2025 Ali Guerra | us...

Microsoft starts boiling the Copilot frog • The Register Gaylord Contreras | usagoldmines.com

Google Docs now lets you generate AI images directly within documents Gaylord Contreras | usagoldmin...

Mobile AI opens new horizons for sustainable business growth in the digital age Gaylord Contreras | ...

Nasoya Introduces Tofie, World’s First AI-Powered Tofu Chatbot Gaylord Contreras | usagoldmines.com

Huawei’s Mate70 to flex high-end chip self-sufficiency Chris Mendez | usagoldmines.com

Using artificial intelligence in education: decision tree learning results in secondary school stude...

Building a Sustainable Future: Cloud Computing in Environmental Science | nasscom Ali Guerra | usago...

Nvidia Faces Risk from Potential Tariffs Amidst AI Boom, Bloomberg Analyst Says Gaylord Contreras | ...

Can AI Speak Culture? | Psychology Today Gaylord Contreras | usagoldmines.com

Are Quantum Computers the Secret Threat to Bitcoin’s Future? Ali Guerra | usagoldmines.com

Human-AI Coevolution Is Said To Be Coming Whether Humanity Likes It Or Not Gaylord Contreras | usago...

Meta and others now allow military to access their AI Gaylord Contreras | usagoldmines.com

My Career Advice As a Google Researcher Working in AI for 20 Years Gaylord Contreras | usagoldmines....

Spark Study Buddy (Challenger): AI algorithm matches pig sounds to their emotions – Young Post Gaylo...

AI Makes Echocardiography Faster, More Accessible Gaylord Contreras | usagoldmines.com

Chargers vs. Bengals NFL props, Sunday Night Football picks, AI prediction: Justin Herbert over 230....

Amazon offers free computing power to AI researchers, aiming to challenge Nvidia Ali Guerra | usagol...

AI Firm Genius Group Adopts Bitcoin as Primary Treasury Reserve Asset Gaylord Contreras | usagoldmin...

3 New AI Smart Home Features Arrive With Gemini and Google Nest Gaylord Contreras | usagoldmines.com

The mental health implications of artificial intelligence adoption: the crucial role of self-efficac...

How Artificial Intelligence Is Supercharging Digital Manipulation Gaylord Contreras | usagoldmines.c...

Transform your content creation with AI MagicX Gaylord Contreras | usagoldmines.com

‘Have your bot speak to my bot’: can AI productivity apps turbocharge my life? | Artificial intellig...

Qualcomm Q4 Earnings: Focus On The Long-Term Edge AI Picture (NASDAQ:QCOM) Gaylord Contreras | usago...

I’m a multitasking machine on my laptop — this Intel Lunar Lake change is a dealbreaker Gaylord Cont...

8 ChatGPT productivity tips and tricks Gaylord Contreras | usagoldmines.com

How a Hong Kong start-up’s AI-powered smart bin plans to tackle recycling Gaylord Contreras | usagol...

Does Africa need to embrace AI to keep its music centre stage? Gaylord Contreras | usagoldmines.com

Eyeing $500B AI Server Market by 2028 Amid Workforce Realignment Gaylord Contreras | usagoldmines.co...

OpenAI Has a Warning for Nvidia. Is the AI Bubble Bursting? Gaylord Contreras | usagoldmines.com

Multi-Agent AI Orchestration Shaping Up But Here’s Why It Might Not Be Fully Shipshape Gaylord Contr...

Fake AI video generators infect Windows, macOS with infostealers Gaylord Contreras | usagoldmines.co...

Phone Provider Deploys “State-of-the-Art AI Granny” to Waste Scammers’ Time Gaylord Contreras | usag...

Biden and Xi agree humans, not AI, should decide on nuclear weapon use | Joe Biden Gaylord Contreras...

Biden and Xi take a first step to limit AI and nuclear decisions : NPR Gaylord Contreras | usagoldmi...

Quantum computing: Boon or bane? Ali Guerra | usagoldmines.com

Google’s AI Search Experiment: “Learn About” Gaylord Contreras | usagoldmines.com

Self-learning AI gives NFL against the spread, over-under, money-line picks for every Week 11, 2024 ...

Alison.ai Closes $13.3M Seed Funding, Aims to Transform Global Ad Campaigns Gaylord Contreras | usag...

Our brains are vector databases — here’s why that’s helpful when using AI Gaylord Contreras | usagol...

Conference to explore opportunities, challenges of artificial intelligence Gaylord Contreras | usago...

The internet hates Coca-Cola’s AI-generated holiday commercial Gaylord Contreras | usagoldmines.com

Gemini AI tells the user to die — the answer appears out of nowhere as the user was asking Gemini’s ...

Parallels Desktop brings Apple Intelligence to Windows 11 — here’s how it works Renato Bond | usagol...

4 Ways To Balance AI, Social Media, And Well-Being Gaylord Contreras | usagoldmines.com

This Magnificent Artificial Intelligence (AI) Stock Has Crushed Nvidia in the Past Year. Can It Cont...

How the US Military Says Its Billion Dollar AI Gamble Will Pay Off Gaylord Contreras | usagoldmines....

Week 11 NFL betting guide, odds, props: AI, model, expert, parlay, DFS, season-long fantasy picks re...

Self-learning AI releases NFL against the spread, over-under, money-line picks for every Week 11, 20...

Edge Computing Market to Grow by USD 19.6 Billion from 2024-2028, Demand for Decentralized Computing...

China’s Baidu joins Meta in race to make AI-integrated smart glasses Gaylord Contreras | usagoldmine...

AI takes advertising targeting to a new level. Here’s how Gaylord Contreras | usagoldmines.com

The Washington Post has an AI newsboy to answer all your questions Gaylord Contreras | usagoldmines....

New AI Tool Can Track Your Location Using Microorganisms On Your Body Gaylord Contreras | usagoldmin...

Breaking

Self-Evolving Reward Learning aligns LLMs with less human feedback Gaylord Contreras | usagoldmines.com

Like this:

Recent:

By Nermeen Nabil Khear

Leave a Reply Cancel reply

You Missed

Apple Urges Mac Users to Update After Hackers Exploit Zero-Day Vulnerabilities Renato Bond | usagoldmines.com

Apple Mac mini Review Renato Bond | usagoldmines.com

10 things that drove me mad using macOS for the first time Renato Bond | usagoldmines.com

Apple’s iOS 18.1 brings AI advancements: Privacy tips you need Renato Bond | usagoldmines.com

Self-Evolving Reward Learning aligns LLMs with less human feedback Gaylord Contreras | usagoldmines.com

Like this:

Recent:

By Nermeen Nabil Khear

Related Posts

National Cloud Computing Policy to be finalised by year end Ali Guerra | usagoldmines.com

AI Is Helping Brands Reach More Audiences Across Social Media Gaylord Contreras | usagoldmines.com

Will AI replace humans? Yoshua Bengio warns of artificial intelligence risks Gaylord Contreras | usagoldmines.com

Leave a Reply Cancel reply

You Missed

Apple Urges Mac Users to Update After Hackers Exploit Zero-Day Vulnerabilities Renato Bond | usagoldmines.com

Apple Mac mini Review Renato Bond | usagoldmines.com

10 things that drove me mad using macOS for the first time Renato Bond | usagoldmines.com

Apple’s iOS 18.1 brings AI advancements: Privacy tips you need Renato Bond | usagoldmines.com