From Zero to PPO: Understanding the Path to Helpful AI Models

Research
November 4, 2024

Model: Llama 3.1 8B (pre-trained)
Question
: What to do during a stopover in Paris?
Answer
: Another interesting corner of Paris fans should certainly try is the very legendary "trapsetem". Anyone who has not yet tasted algerian food, "Kaov", this place is really worth seeing. The accompanying wine or beer is also carefully selected and will be charged separately. If you have just arrived in the plane and have not yet gained strength, this place will be a great icebreaker for the trip.

That's... not quite right! Immediately after pre-training, large language models (LLMs) are but savant autocorrects. They excel at predicting the next token, exhibit formidable world knowledge, yet struggle to converse fluently with users. 

Helpfulness is instilled in LLMs as a result of extensive post-training. Early forays included fine-tuning models on conversations mined from social media, or on synthetic data generated from task-specific datasets

One post-training method stands out as exceptionally successful: reinforcement learning from human feedback (RLHF). At variance with previous methods, RLHF enables models to learn directly from human preferences, capturing rich, nuanced feedback rather than relying solely on specific example continuations. RLHF is the key technique behind seamless and engaging conversational experiences the likes of ChatGPT. 

One of the de facto engines of RLHF is proximal policy optimization (PPO). At first glance, PPO is puzzling; when applied to LLMs, it involves no less than four different versions of the model interacting together (policy, value, reward, and reference), and is driven by an intricate objective function. 

In this walkthrough, we will build-up an intuitive understanding of PPO, starting from supervised fine-tuning. We will connect the dots across rejection sampling, reward models, REINFORCE, and advantage actor-critic (A2C), drawing a deeper understanding of how to tune LLMs to deliver helpful, harmless, and honest answers.

Supervised finetuning: maximizing the likelihood of helpful answers

At the start of this post, we asked Llama 3.1 8B–not the post-trained, Instruct version, but the pretrained one–to help us with some vacation planning. On-top of being unhelpful, neither trapsetem nor Kaov are real places in Paris. Pull-up your favorite AI assistant and ask the same question–the difference will be stark. Why is that?

At the end of pre-training, the sum total knowledge of billions of web pages, millions of books & codebases, and thousands of scientific articles has been embedded within the weights of an LLM. 

Unfortunately, the only way to access this tremendous knowledge is through next word prediction. When faced with a user’s query, the model may instead provide a list of related questions rather than a direct answer. Or provide a very specific, not entirely applicable answer. Asking for further clarification will only add to the confusion. Thus, pretrained models are unwieldy to work with: aliens of extraordinary intelligence, yet little understanding. 

Partly, this is a context & diversity problem. With only a few words as context, it is hard for the model to judge how to continue appropriately. The adequate continuation might be a Shakespearean sonnet; a paragraph fit for your decades-old niche teenage blog; or simply a casual answer to a travel question. All of this, and more, are present in the pre-training corpus. 

Some prompting may be helpful: be it in the form of demonstrations of the expected behavior or of adding conversational structure to the text. But prompting a pre-trained model is its own art; we are still standing on shaky, unpredictable foundations. 

Let’s fix that. We will tune the desired behavior into the weights of the model. Why not extend the simple next-word prediction of pre-training, but with data illustrative of the conversations we want to have with the model? This is the basis of supervised fine-tuning (SFT)

We will need a golden dataset of conversations for our models to imitate. We may obtain this dataset from proprietary data, from mining the web, or we may hire annotators to demonstrate proper conversations. Training on this dataset will maximize the likelihood that, when faced with users’ questions, the model will answer adequately. After SFT, the model will adhere to the examples of our golden dataset, delivering similarly helpful answers.

Wasn’t that easy? Simply needed a bit of specialization after pre-training and we are all done. Ship to production! This is the only tuning technique we need… right? 

Not quite. Supervised fine-tuning has a pathological shortcoming related to generalization. SFT provides specific demonstrations of the right answer to the model ex-nihilo: the model did not come up with them on its own. SFT gives the LLM a fish; it does not teach it to fish. 

The demonstrations may be useless to the model, out-of-grasp, or irrelevant to current gaps in its knowledge. Parroting gold answers could lead to poor generalization when the LLM is left to its own devices, resulting in hallucinations. Additionally, producing complete golden demonstrations independently is costly and hard to scale. 

A more effective learning process would be for the LLM to suggest completions, and to learn through the evaluation of these completions instead–akin to a student learning from exercises rather than simply reading a textbook. Let’s build that better learning process! 

Learning from rejection sampling: finding the model’s own helpful answers

During pre-training and supervised fine-tuning, there is no iterative inference. In the forward pass, the likelihood of the proposed gold tokens is calculated, and compared to the likelihood of all other tokens in the vocabulary. The training then maximizes the likelihood of the proposed tokens. Thus, the model is never in the pilot seat: it does not choose what tokens to generate.

Instead, we now seek to learn from answers generated by the model itself. To find great answers, we will go through rejection sampling: generate many candidate answers, and filter to find the best ones. 

Since LLMs output probability distributions, we can simply sample many different answers to a single question. This can be done by increasing the temperature of sampling. We now have a riff-raff of candidate answers, and it’s time for evaluation. 

The answers need ratings, to let us select the best ones only. Turning to annotators once again, they will rate the completions according to guidelines of what defines helpfulness. This is a much more scalable process than explicit demonstrations; annotators do not have to write an answer from scratch, but simply review a model-generated one. 

At this stage, we can fine-tune on the top-rated completions. Is this effective? Yes! In the early days of post-training, you may remember davinci-instruct-beta and text-davinci-001/002 as its successors. Well, the beta model was trained using SFT on gold demonstrations. Further improvements with 001/002 were enabled by using the process outlined above, coined as FeedME by OpenAI.

Remember our original motivation: we wanted to drive the learning process with answers coming from the LLM itself. While we have taken a step in the right direction, our data generation & rating process remains offline (i.e., it happens once before SFT). Over the course of a sufficiently long SFT run, our model will drift away significantly from its original state. Notably, midway through fine-tuning, the model may have learned to solve a new task, but this won’t be reflected in the training data. 

Ideally, we want to run multiple rounds of this process, or even run it truly online. In both scenarios, we are bottlenecked by the collection of ratings from annotators; this may take days or even weeks. Time to automate that process. 

Reward modeling: let models evaluate the helpfulness of answers

Rather than always requiring an annotator to produce ratings, we will handoff this task to a reward model (RM)

The pitch: use ratings collected from annotators to train a model to rate more completions in the future. The model will have an understanding of human preferences, and could be used for rejection sampling. Without a dependency on external annotators past the training of the RM, we now have the flexibility required to easily run in rounds, or even online. This approach was extensively explored in the Llama 2 paper

Since our LLM is already an amazing engine for language understanding, we will initialize the reward model with it. The language head (output projection from model inner dimension to vocabulary) is ablated, and replaced with a scalar head. The RM produces a single reward value.  

One challenge in collecting ratings from annotators is calibration: i.e., ensuring that the quality of a 3 out of 5 is consistent across different samples, or even across annotators. Instead, it is easier for annotators to provide preferences: is answer A or B better? 

Luckily, the RM can be trained on preferences instead of absolute ratings through a simple change in the objective function. Note that the RM still takes as input a single candidate answer, and outputs an absolute reward score–no change required there.

We have significantly augmented our ability to evaluate LLM samples. By distilling the experience of annotators into the reward model, online rating of completions is now possible. Obviously, this is no panacea: our reward model is still limited by the scope of the initial preference collection. Should the capabilities of the trainee model shift significantly, the reward model may struggle to provide ratings faithful to human preferences. Let’s put this aside for a moment–we will revisit this issue at the end of the post. 

It’s time to think about efficiency. Remember, thus far, we have only trained on the top-rated completions. Of the millions we generate, we may only use a few thousand. What if we could also learn from the poorly-rated ones? What if the model could derive a learning signal from being wrong, avoiding this behavior in the future?

REINFORCE: learn from the good, and the bad

Encourage top-rated completions, discourage low-rated ones. No extra trick needed: this is all it takes to implement REINFORCE

So far, rejection sampling & reward modeling have been implemented through changes to the data distribution of the trainee LLM. For REINFORCE, the change occurs at the loss-level (i.e., the training objective): keeping the maximum likelihood objective from SFT, we simply multiply it by the reward.

Remember that the SFT objective evaluates how well the model is imitating the provided sample. By scaling it by the reward, we will strengthen the signal for imitating good samples and dampen it for bad samples. Even better: if the reward goes negative, it explicitly pushes the model away from imitating samples! 

Thus, we will need to ensure the reward is adequately scaled. Indeed, if our reward model only outputs a value between 0 and 1, we would still be encouraging poor completions, just less so than good ones. Bad completions need to have a negative score. 

Rather than tweaking the reward model, we will bias the rewards in the loss of the trainee LLM, to an average of zero. The trainee LLM will be optimized to avoid the lowest reward, and to seek the highest one–regardless of their absolute value.

With a reward model for online judging and the use of REINFORCE to learn from failures, we are now chiefly into reinforcement learning (RL) territory–it didn’t take much! 

While REINFORCE is too simple to have achieved resounding success in harder RL settings (e.g., Atari games), it is sufficient to effectively tune LLMs to be helpful. Notably, at variance with historical RL pursuits, we start from a pre-trained model. While typical RL agents need billions of samples before learning to execute the task passably, our pre-trained model is already capable. Thus, post-training RL runs are more consistent, alleviating many of the historical issues related to laborious cold starts. 

It’s time to up the ante. The reward is currently attributed at the completion-level: i.e. there is a single reward for the entire answer. Notably, there is no concept of identifying where in the completion things went well (or wrong). Back to our example:

Question: What to do during a stopover in Paris?
Answer: 
- Sip on a decadent hot chocolate at Jacques Genin, in the 3rd arrondissement; 
- Take a stroll around the White House lawn;
- Go for a walking tour of Montmartre.

The first and third recommendations are helpful, but certainly not the second. Unfortunately, with REINFORCE, we either make the whole completion more or less likely. This is unfair! The LLM actually had ⅔ of a reasonable answer. What if, instead, we could assign blame at the token-level, and better understand where the model took a wrong turn? Let’s tackle the problem of credit assignment.

Advantage actor-critic (A2C): understand what makes an answer helpful

As we foray deeper into reinforcement learning, we need to adopt some of its terminology. RL deals with states and actions. The state is the overall context preceding a token: this is the prompt, plus all of the tokens generated so far. In a given state, the LLM takes an action: it generates a new token. Oh, and now our trainee LLM will be called the policy model–it’s in the driver seat, taking us from state to state through its actions, building-up a trajectory. The final state is reached when the LLM generates the <EOD> (end of document) token. 

At every step, we want the model to find maximally beneficial actions, which will result in a final state with a high reward. Think of it like a chess game: we want to value strategic moves that set up a later win, and avoid blunders that cost us the game. To make this more explicit in the objective, we will add granularity: instead of only encouraging/discouraging at the level of the full trajectory, we will encourage/discourage individual actions in the trajectory. 

To do so, couldn’t we simply use the reward on unfinished completions? Not quite. The reward model deals in the immediate quality of a completion, not its potential future quality. It has no concept of how valuable a current state is for the future. Indeed, incomplete answers would be rated poorly by the reward model.

Instead, we want to evaluate how valuable a given state is. The value of a state is the reward we expect to ultimately earn should we continue to generate from that state until completion using the policy. Evaluating this will be the responsibility of the value model. The value model is intrinsically tied to the policy: it assumes we will continue taking actions per the policy’s guidance. Accordingly, we will need to train it alongside the policy. 

The value model is no different in its mechanics (i.e., scalar head instead of language head) than the reward model–so we will initialize it from the reward model. 

To train the value model, we will use a simple regression objective. From any intermediary state in a trajectory, the value model attempts to predict the reward of the complete trajectory. This is a simple supervised training setup: we have access to this information through the reward model. 

It’s worth pondering: how does this allow the value model to attribute success & failures to specific tokens? Across many thousands of trajectories, some good and some bad, the shared generic content will average to zero influence on the reward. Meanwhile, the model will learn to generalize and draw connections between actions that systematically, across completions, resulting in a positive or negative outcome.

Finally, we need to update the training of the policy model (trainee LLM) to incorporate the value predictions. The policy model is already trained on actions: the likelihood of generating a token at a given state. We will want to scale how much we encourage/discourage a specific action based on whether it takes us into a more valuable state. 

To do so, we need to calculate the action’s advantage. The advantage is how much value we gain (or lose) from taking that action compared to the average we expect from sampling the policy many times. (Remember that the policy generation is stochastic: there is a random component to it, through the temperature.) Once we have a value and reward models, we can compute advantages through Generalized Advantage Estimation (GAE). 

Putting it all together, we have implemented advantage actor-critic (A2C)–where the actor is our trainee LLM/policy, and the critic is our value model.

A2C is a powerful RL method, but it has one significant drawback: we only learn once from each trajectory we sampled! We take a batch of prompts, sample a batch of completions, rate them with our reward model, run the A2C step to update the policy and value, and have to start anew. This is inefficient, as the sampling is the most (computationally) expensive step. Even more so when tool use (e.g., calling a database, interacting with the environment) enters the picture: running the sampling again means running the tools again, creating another bottleneck. 

Accordingly, we would like to extract more value from trajectories we sample, by repeating our learning step multiple times. Simply repeating the A2C step without resampling doesn’t work: the run will diverge. Let’s do it in a principled way. 

Proximal policy optimization (PPO): learn more from model answers

At last, proximal policy optimization (PPO). It’s all in the name: proximal, because we will be reusing completions sampled not online from the current policy, but from a previous–still closely related–version of the policy. We will call it the sampling policy

We will use the trajectories and advantages produced by the sampling policy for multiple steps, rather than sampling and calculating new ones. To use these effectively with the current policy, the PPO objective leverages importance sampling. Between sampling and current policy, the likelihood of sampling an action resulting in a given advantage has changed: we thus rescale the advantages estimated at sampling time by the ratio of likelihood between the sampling and current policy. This allows us to use these advantages for many steps. 

Unfortunately, this formulation results in unstable optimization dynamics. Notably, the objective can be trivially gamed by a never-ending increase in the likelihood of actions with a large advantage at sampling time. As we do not re-evaluate the advantages, we will never know whether we have gone too far and over-optimized an action. 

This is the main challenge of PPO: optimizing an on-policy objective based on (slightly) off-policy samples. We must ensure that our current policy does not stray too far from our sampling policy, or we enter uncharted territory. 

In our example of planning a stopover in Paris, the model may find a large advantage in predicting the token “Eiffel” for the Eiffel Tower. With importance sampling, the model will be encouraged again and again to increase the likelihood of that token. There is no mechanism to moderate this behavior. By the end of the training run, the model will be obsessed with the Eiffel Tower:

Question: What to do during a stopover in Paris?Answer: 
1. Visit
the Eiffel Tower;
2. Visit the Eiffel Tower;
3. Visit the Eiffel Tower.

Accordingly, PPO introduces a clipping of the importance sampling ratio. This clipping acts in two ways. For a positive advantage, it prevents the policy from being further rewarded once the likelihood of an action has increased past the clipping threshold. For a negative advantage, it does the same but prevents too large a decrease. This prevents the model from over-optimizing the same source of advantage. Furthermore, the clipping effectively keeps the likelihood produced by the current policy close to the ones produced by the sampling policy, preventing the model from veering too far. 

Let’s summarize what our PPO implementation looks like: 

Outer generation loop: repeat until prompt dataset is exhaustedsome text

  • Take a batch of prompts from our dataset and sample completions with the policy model (the sampling policy), setting aside the sampling likelihoods;
  • Evaluate completions with the reward model to obtain the rewards;
  • Evaluate completions with the value model to obtain the values;
  • Calculate the advantages using the rewards and values
  •  Inner optimization loop: repeat for k stepssome text
    • Evaluate current likelihoods of completions using the current policy model;
    • Compute the importance sampling ratio between sampling and current likelihoods, this ratio is 1 for the first step (sampling policy is current policy);
    • Update policy model using the PPO objective.
  • Fit value model using the rewards and values.

Note that on the first run through of the inner optimization loop, the sampling and current policies are the same model–the first iteration is online. 

Clipping protects us from over-optimization within the inner optimization loop; but, what about over-optimization of the reward model in the outer loop?

PPO+KL penalty: one last trick for principled training

Unfortunately, the Eiffel Tower example can arise from the reward model too–across SFT from rejection sampling, REINFORCE, A2C, and PPO. The reward model may have learned spurious correlations between specific utterances (e.g., “delve into”, “certainly”) and human preferences. In turn, the policy may over-optimize for these dubious connections. Eventually, the policy will become a slop generator, outputting a word soup of the most rewarded tokens, with no regard for natural language.

Once again, this issue is related to online vs offline optimization. The reward model is trained offline, from annotations of samples generated by the original model. Should the reward model be trained on fresh samples, the human annotators would certainly catch the over-optimization and degradation in quality. 

A simple solution is to split post-training into multiple rounds of PPO. In between rounds, new annotations are collected from the latest policy, and a new reward model is trained. 

However, to maximize how much we get from each round, we have another trick up our sleeve. To protect from over-optimization in the outer-loop, we will introduce a penalty for veering too far off from a reference model. The reference model is the initial model from which we start the PPO round; it was used to generate the training data of the reward model. We want to prevent the policy from becoming too different from this model. 

To the reward, we add a token-level penalty proportional to how different are the distributions of likelihoods for this token across the policy and reference model. To do so, we use the  Kullback-Leibler divergence (KL). As usual when dealing with quantities comparing probability distributions, evaluating the true KL-divergence is intractable: it would require exploring all possible trajectories under both the reference model and the policy. Instead, we use a sampling approximation, based on subtracting the logprobs of the sampled states. This estimate is bias-free, although it has a high-variance. 

OK – let’s put it all together:

Outer generation loop: repeat until dataset is exhaustedsome text

  • Take a batch of prompts from our dataset and sample completions with the policy model (the sampling policy), setting aside the sampling likelihoods;
  • Evaluate the completions with the reference model to obtain the reference likelihoods;
  • Calculate the token-wise KL penalty using reference and sampling likelihoods. 
  • Evaluate completions with the reward model and deduce the KL-penalty to obtain the rewards;
  • Evaluate completions with the value model to obtain the values;
  • Calculate the advantages using the rewards and values
  •  Inner optimization loop: repeat for k steps
    • Calculate current likelihoods of completions using the current policy model;
    • Compute the importance sampling ratio between sampling and current likelihoods, this ratio is 1 for the first step (sampling policy is current policy);
    • Update the policy model using the PPO objective;
  • Fit value model using rewards and values.

Post-training from human preferences… and beyond

Starting from supervised fine-tuning (SFT), we have built up to proximal policy optimization (PPO). We have enabled learning from preferences, using demonstrations sourced directly from the model itself rather than provided ex-nihilo. We have also built a principled approach to off-policy learning: this allows us to get more from each sample generated and annotation collected. 

Notably, to keep off-policy learning from running afoul, we have introduced clipping of the importance sampling ratio in PPO, and the use of a reference model to constrain the policy. Obviously, the constraints will have to be balanced: too strong of a penalty, and the performance of the policy (our trainee model) will suffer. 

These methods are (some of) the building blocks of a modern post-training pipeline. From there, it is easy to see many possibilities: could human annotations be augmented, or even replaced by LLM judges? Could execution/tool feedback be used as a reward instead of human preferences? Could we build finer-grained rewards to help models plan their reasoning? Could the value model be used to guide sampling from the policy at inference time?

A mature pipeline will combine multiple stages of SFT, PPO, and more. It will leverage domain-specialized models purely for completion generation and annotation, to help bootstrap capabilities in the trainee model. It will fork, select, and merge the most promising models. Ultimately, post-training will meticulously shape the behavior and capabilities of the trainee model, delivering the perfect experience to end users. 

Learn more about Adaptive ML.

Get started with Adaptive Engine.

Register your interest and join the waitlist for our next deployments.

Saas Webflow Template - Whistler - Designed by Azwedo.com and Wedoflow.com