From Zero to PPO: Understanding the Path to Helpful AI Models
Pretrained LLMs are aliens of extraordinary intelligence, yet little understanding. How do post-training techniques like SFT, REINFORCE, and PPO work in-tandem to turn these aliens into helpful AI assistants?