How LLMs Learn to Behave: RLHF, Reward Models, and the Alignment Problem
A practical walkthrough of how large language models are aligned with human values â from collecting feedback to PPO optimization and the reward hacking pitfalls.
A practical walkthrough of how large language models are aligned with human values â from collecting feedback to PPO optimization and the reward hacking pitfalls.