A Fair Fight: Eliminating Length Bias in LLM Evals

Blog posts

November 19, 2024

‍

Research

Authors

Alessandro Cappelli

Axel Marmet

Colin Raffel

Editors

Julien Launay

Acknowledgements

Thanks to Arash Ahmadian for many discussions on this work.

‍

Introduction

With the rise of ChatGPT and o1, post-training techniques for large language models (LLM) are in the limelight. A cascade of methods ending in "O" has likely invaded your feeds: PPO, DPO, KTO, RLOO, ORPO, and more. In contrast with supervised fine-tuning, these techniques aim to enable models to learn from feedback (e.g., answer A is better than B, or answer B is bad) rather than mimicking specific demonstrations. Paired with the appropriate data, they can make models more helpful, safer to use in sensitive applications, or even improve their reasoning capabilities.

Given the choice between an ever-growing number of preference optimization methods, the natural question is: which is the best? To answer that question, we must go down the rabbit hole of LLM evaluations.

Consider the following scenario: you have just trained two models using direct preference optimization (DPO) and proximal policy optimization (PPO) on the same data. You collect a series of test queries and generate responses from each model. Then, you ask a judge to pick their preferred responses. The judge prefers the DPO model a majority of the time: DPO is better, hoorah! Unless…

If you've been following LLM evaluations, you’re aware that response length is a significant confounding factor when comparing responses. Both human and LLM judges tend to prefer longer responses, sometimes even when the verbose responses are of lower quality. In some ways, comparing responses of disparate length is underspecified: depending on the context, you may have wanted the cliff notes version rather than the epic saga.

This bias needs to be accounted for when judging model responses, especially when comparing preference tuning methods. Indeed, these methods can have a dramatic effect on answer length. Anecdotally, we have observed that DPO often results in longer answers than SFT; and that PPO can undergo “length collapse” if not carefully monitored, resulting in abbreviated responses. This makes many of the comparisons in the literature moot: if the average PPO answer is 5x shorter than the DPO answer, and an LLM judge is biased towards longer answers, can you compare them fair and square?

When using LLM judges, to even the odds for introverted models, some practitioners simply resort to prompt engineering: a hail-mary do not be biased by verbosity instruction added to the prompt of the judge. Benchmarks like Alpaca Eval 2 are more sophisticated. They use a logistic regression to predict preferences while accounting for length differences, ensuring more balanced comparisons.

In this work, we present an alternative approach. Instead of simply mitigating the length bias, we address it directly by training one of the two models to follow specific length instructions. Thanks to the flexibility of PPO, this is easy enough to do: we add a penalty to the reward when the length of the answer is off-target. This allows us to judge completions of the same approximate length, effectively eliminating length bias from the comparison.

We will test the efficacy of this approach by conducting a fair, controlled comparison between PPO and DPO, demonstrating that PPO consistently emerges as the superior preference-tuning method across three datasets of interest once length bias has been controlled for.

The Court of LLM: where the first and the chattiest wins

A common approach for comparing two models is to use a third LLM as the judge. This judge may either be a more powerful model (e.g., Llama 3.1 405B) or a specialized one. This is done simply by prompting the judge model with an evaluation template, presenting it with the relevant context and answers from the models being compared.

Preference comparisons rarely yield a unanimous winner; even a vastly superior model will not be preferred 100% of the time across all presented prompts. For instance, Llama 3 8B is chosen by GPT-4 as the winner over Llama 2 8B in just 60% of cases when compared on prompts from the Helpful and Harmless dataset (HH), after finetuning on the training set. This is because both models can provide sufficiently accurate responses to HH prompts, making it challenging to distinguish a clear winner.

Moreover, LLM judges are prone to various biases that must be accounted for to ensure fair and accurate evaluations. While we’ve already mentioned length bias, another well-documented bias is order bias, where the judge displays a disproportionate preference for the completion that appears first. This effect can be as large as a 70-80% preference –stronger than the jump from Llama 2 to Llama 3!

Fortunately, this is easily solved: run the evaluation twice, switching A/B positions, and average the results.

Meanwhile, length bias is not so easily mitigated, and is pernicious through many evaluations and benchmarks.

Just how significant can length bias be? Consider the following: from a test set of prompts, you generate many completions for each prompt using a single model. Then, you select two completions per prompt that are ~200 tokens apart-one significantly longer and one significantly shorter. Finally, you compare these two prompts, asking the judge which is preferred. GPT-4o will prefer the longer completion 85% of the time, despite both coming from the same model!

Length guidance: steering generation length with a dedicated penalty

Length bias is often nearly as significant as order bias in LLM-as-a-judge evaluations. While it can be controlled for (to an extent) through statistical interventions, could we simply side-step it by instructing the models to generate completions of a specific length?

We want models that can both follow instructions and generate a response within a desired number of tokens. We therefore will modify the test prompts as follows:

Test prompt -> test prompt + “Answer in <range> tokens.”

Unfortunately, Llama 8B models follow this instruction poorly out-of-the-box. We thus introduce the length guidance in the instruction tuning step. First, while performing supervised fine-tuning (SFT), we add Answer in <range> tokens. in the prompts, where the range is defined as -5/+5 tokens from the golden answer. While this is helpful, it is insufficient: there is still a large spread to the length of the answers, and the model tends to overshoot the guidance significantly when the instruction calls for a concise answer.

Thus, to further enforce length guidance, we introduce a length penalty (LP) in the following PPO stage. We use the modified prompt, and add a penalty to the reward for not adhering to the length guidance. We model this with the L2 norm between the length of the generated sample and the target length, and scale this with a length penalty factor (alpha). This encourages the model to only produce outputs matching the guidance.

This additional stage is effective: while there is still some variance, the model follows more closely the instruction, generating answers typically within 20 tokens of the guidance. Note that a bias remains from short answers, which tend to be longer than requested.

Let’s now include this penalty in a preference tuning run. If inadequately controlled, this penalty could outscale the conventional reward. This would result in a model which generates precisely the number of requested tokens, but with unhelpful completions. Conversely, if the penalty is too muted, the model will just ignore the length instruction.

Finding the goldilocks zone for the length penalty factor is illustrated below. In this experiment, we first train a model with DPO. Then, we use the length of the DPO completions as the target for an independent PPO run–later, this will enable us to compare the two methods.

Increasing the penalty factor alpha leads to a degraded score during PPO; this, in turn, results in degraded benchmark performance. Worse, an overly large penalty even hurts the ability of the model to follow the length guidance. With a reasonable penalty, the model follows the length guidance, and experiences minimal degradation in benchmark performance.

Test case: PPO vs DPO

Now that we can finely control the output length of PPO, let’s put this to the test with a benchmark against DPO.

We train a model with DPO or PPO on either HH (Helpful & Harmless), TL;DR, or UltraFeedback (UF). For the PPO model, we use the length penalty introduced before. We then prompt the DPO model and collect completions on the evaluation set, measuring their length. We give these prompts to the PPO model with the instruction to match DPO’s length. Finally, we use an LLM (GPT-4o) as a judge to determine the superior tuning method: a fair fight!

Across all datasets, PPO consistently wins against DPO, while generating slightly shorter responses. Sometimes, as in the case of HH, PPO’s win-rate is a staggering 76%--a bigger gap than the jump from Llama 2 to 3.

To complete the picture, we use the same approach on MT-Bench, a popular benchmark based on LLM-as-a-judge. Once again, PPO appears to be the better method, despite a small length disadvantage remaining. The trend where the gap is larger for models trained on HH instead of UF also holds.

Conclusion

In this post, we proposed a new approach to mitigating length bias in LLM-as-a-judge evaluations. We introduced a length penalty to PPO, enabling models to follow length instructions effectively. This comes with minimal performance degradation if adequately calibrated. This allowed us to conduct a fair comparison of PPO and DPO on a series of preference turning datasets. This approach highlights one of the strength of PPO: it is possible to craft and shape arbitrary rewards, enabling the model to learn from external validation.

This introduces an additional dimension to consider when comparing preference-tuning algorithms: how adaptable are the methods-in-question at training multiple objectives simultaneously? Ultimately, this flexibility could be key to developing more robust, versatile AI systems that are tuned to user-specific objectives with production feedback.

Adaptive Engine Adapt Evaluate Serve

Use Cases RAG Text-to-SQL Customer Support

Company Technology About Blog

Socials LinkedIn Twitter Newsletter