When prompt engineering isn’t enough: the case for reinforcement fine-tuning

Blog posts

April 15, 2025

Company

Authors

Andrew Jardine

Pearson Probst

Editors

Pearson Probst

Acknowledgements

It’s GenAI’s dirty secret: the majority of projects don’t make it to production. Instead, they get stuck in proof-of-concept purgatory, draining R&D budgets without providing any real value back to the enterprise. Why? It’s not because the promises of agentic support, automated business processes, and RAG assistants aren’t real. Most LLM projects stall out for one reason: prompt engineering doesn’t deliver the reliable performance needed for deployment.

Proprietary APIs like GPT-4o, Claude, and Gemini are attractive for their simplicity. You prompt them, and, seconds later, your answer is delivered. If needed, you can easily revise the prompt to change the output of the model. When it works, the process is miraculously simple. But, what happens when it doesn’t work?

You re-engineer the prompt, adding a few more instructions here, changing the wording there, or reordering the request. That fixes the first problem, but now a new issue appears. What now? Suddenly, simplicity isn’t looking as much like a feature, as it is a bug.

In this blog, we’ll explore why prompt engineering alone often isn’t enough to get to production, and how fine-tuning can accelerate deployment.

Where prompt engineering falls short

To fix performance issues, many organizations first try extensive prompt engineering on top of proprietary APIs. Unfortunately, this ‘quick-fix’ ends up being a trial-and-error time sink that under-delivers on production readiness. Specifically, prompt engineering falls short across four fronts:

1. Models struggle with prompt complexity

Complex business tasks require complex, multi-step prompts. Unfortunately, models often struggle to adhere to multifaceted, real-world instructions. This creates a ‘whack-a-mole’ performance problem, where older requirements are ignored as newer needs are integrated. Ignored instructions can expose users to hallucinations and have unintended consequences on downstream agents.

Also, stuffing prompts with additional tokens in an effort to provide more context and improve behavior increases latency, damaging user experience.

2. Prompts cannot override model’s core training

Sometimes the behavior you want to elicit is at odds with the model’s core training, making it near-impossible to prompt engineer the right response out of the LLM.

For instance, you might want a model to stick strictly to information available in your RAG corpus, politely refusing to answer questions that cannot be answered directly from proprietary documents. However, this directive is in direct opposition to the model’s pre-training to be as helpful as possible. This conflict of interests can cause the LLM to answer user questions with stale information based on their pre-training or perform behaviours they shouldn’t, such as providing financial advice. This hurts customer experience and exposes the company to legal and reputational risk.

3. Some desired behaviors are simply difficult to describe

Some enterprise use cases are just too intricate to be reliably architected with prompt engineering alone; one such example is chained function calls. In these scenarios, the model must not only identify the right sequence of functions, but also correctly structure the input and output at each step to maintain coherence across the chain.

Even minor ambiguities or aberrations in the prompt can result in incorrect or incomplete calls—such as passing malformed arguments or omitting a required step. These challenges are amplified in real-world environments where functions have strict data formats, dependencies, and constraints, all of which must be explicitly and precisely described for successful execution.

4. If it does work, it’s likely only temporary.

Prompt engineering works by attempting to control the output of a model only by controlling the information input into the model. This assumes that the model behaviors and inner machinations will stay consistent enough for the engineered prompt to work across various versions of the model.

This is not the case. At any moment, the API provider could push an update to the LLM, undoing all of the careful tinkering to control the output of the model. Particularly considering the pace at which proprietary model providers are developing and deploying updates, this is a common occurrence.

The case for reinforcement fine-tuning

Instead, organizations can encode their desired behaviors into the weights of the model itself, i.e. fine-tune a model to their own specifications. Fine-tuning used alongside prompting improves reliability, while also reducing GPU costs by unlocking smaller, cost-effective models.

Traditionally, the entry-point for model fine-tuning has been supervised fine-tuning (SFT). While SFT has a place for some simple use cases, it has a few critical shortcomings that prevent its use in more advanced, agentic workflows.

SFT has a pathological shortcoming related to generalization, which caps performance and increases the risk of hallucinations. This flaw is a logical progression from how SFT teaches. With SFT, models learn through simple imitation. The right answers are provided to the model without context: the model does not come up with them on its own.

In this way, SFT is the LLM equivalent of expecting a student to learn math by simply memorizing the final answers, as opposed to working through problems to gain a genuine understanding. Sure, if they are asked questions identical to the provided examples, they will be able to answer. But, what about questions outside the training scope? The model has not learned to be self-sufficient, explore potential paths, or reason logically. Instead, the trainee model will confidently hallucinate incorrect information.

Moreover, training with SFT is dependent on the availability and quality of said golden datasets. Producing these manually is costly and difficult to scale because data must be painstakingly collected, collated and annotated. This limits models’ ability to improve iteratively from critical enterprise data, such as user preferences, business metrics, or execution feedback.

By contrast, reinforcement fine-tuned models like Deepseek-R1 and o3-mini reason on their own, enabling them to far surpass the performance of SFT-trained models, generalize beyond their training datasets, and learn from practically any signal.

Reinforcement learning methods like Proximal Policy Optimization (PPO) and Group Relative Policy Optimization (GRPO) use a reward model to provide feedback on thousands of policy model completions. This both negates the need for extensive golden datasets and allows the trainee model to explore many trajectories, developing their own understanding and reducing hallucinations when asked questions that are outside the direct training scope.

The flexibility of reinforcement learning’s reward system also enables engineers to train for virtually any objective. This means that models can be directly optimized on key enterprise KPIs like customer satisfaction, escalation rate, query success, and more. No longer dependent on full, annotated SFT datasets, models learn continuously from a variety of production signals: such as user preferences, business metrics, and execution feedback.

To illustrate, a Fortune 100 financial services company worked with Adaptive ML to reduce hallucinations on their financial documents. Adaptive ML fine-tuned Llama 3.1 8B using reinforcement learning, beating GPT-4o 58% of the time in head-to-head comparison. The model was tuned on Adaptive Engine using only synthetic data grounded by RAG documents—no manual data collection or annotation was required.

For this financial RAG use case, fine-tuning with reinforcement learning unlocked a 29% performance improvement over SFT alone.

With the specialization and performance of reinforcement learning, businesses can deploy to production with confidence, enabling capabilities and use cases beyond prompting. This is particularly true of agentic workflows, which are growing in complexity and demand.

In short, it’s time to reinforcement fine-tune. Use Adaptive Engine to evaluate, tune and deploy the best models for your business with reinforcement learning. Book a demo today.

Adaptive Engine Adapt Evaluate Serve

Use Cases RAG Text-to-SQL Customer Support

Company Technology About Blog

Socials LinkedIn Twitter Newsletter