Get the latest tech news

Reinforcement Learning from Human Feedback (RLHF) in Notebooks

RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks - ash80/RLHF_in_notebooks

Supervised Fine-Tuning (SFT) Reward Model Training Reinforcement Learning via Proximal Policy Optimisation (PPO). Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment.

Get the Android app

Or read this on Hacker News