Get the latest tech news
Reinforcement Learning from Human Feedback (RLHF) in Notebooks
RLHF (Supervised fine-tuning, reward model, and PPO) step-by-step in 3 Jupyter notebooks - ash80/RLHF_in_notebooks
Supervised Fine-Tuning (SFT) Reward Model Training Reinforcement Learning via Proximal Policy Optimisation (PPO). Instead of building a chatbot that would need a dataset of ranked questions and answers, we adapt the RLHF method to fine-tune GPT-2 to generate sentences expressing positive sentiments. To achieve this task we use the stanfordnlp/sst2 dataset, a collection of movie review sentences labeled as expressing positive or negative sentiment.
Or read this on Hacker News