Get the latest tech news

Reinforcement Learning from Human Feedback (RLHF) in Notebooks


RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks - ash80/RLHF_in_notebooks

Supervised Fine-Tuning (SFT) Reward Model Training Reinforcement Learning via Proximal Policy Optimisation (PPO). Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of notebooks

notebooks

Photo of RLHF

RLHF

Photo of human feedback

human feedback

Related news:

News photo

Harmful Responses Observed from LLMs Optimized for Human Feedback

News photo

Direct Preference Optimization vs. RLHF

News photo

Show HN: Buckaroo – Data table UI for Notebooks