Refining Financial RAG with Reinforcement Learning using Adaptive Engine and NVIDIA NeMo Retriever

Blog posts

March 11, 2025

Product

Authors

No items found.

Editors

No items found.

Acknowledgements

Summary

To improve faithfulness and helpfulness on financial RAG, Adaptive ML fine-tuned Llama 3.1 8B using reinforcement learning, beating GPT-4o 58% of the time in head-to-head comparison. The model was tuned on Adaptive Engine using only synthetic data grounded by RAG documents—no manual data collection or annotation was required. Instead, Adaptive ML used NVIDIA NeMo Retriever, part of the NVIDIA AI Enterprise software platform, to parse, chunk, and embed the unstructured financial data (10-K reports) for use in the post-training process, as well as in the RAG context at inference time.

Challenge: Production-Grade RAG on Complex Knowledge Bases

In regulated industries like financial services, model hallucinations are unacceptable. Unfounded statements or inaccurate advice could expose institutions to legal and reputational risk. Therefore, it is critical that large language models (LLMs) are faithful to source financial documents to maximize accuracy and minimize operational risk.

This is more difficult than it sounds. RAG pipelines ingest information from extensive enterprise knowledge bases with intricate structures and complex document corpora, making accurate retrieval and usage difficult.

Off-the-shelf models struggle to navigate such noisy RAG environments, answering queries with incorrect financial figures or hallucinating information outside the provided documents.

See an example encountered in testing below:

Enterprises often try to counteract these errors with prompt engineering. However, the whack-a-mole approach of constantly tweaking prompts is not scalable; it is both time- and labor-intensive, and, ultimately, doesn’t deliver the performance required for production-grade deployment.

Solution: Fine-Tune for Faithfulness using Synthetic Data and an AI Judge

Instead, a Fortune 100 financial services organization leveraged Adaptive Engine and NVIDIA NeMo Retriever, built with NVIDIA NIM, to fine-tune a small model to consistently outperform GPT-4o on RAG tasks—achieving a 58% win rate vs GPT-4o. Model performance was doubled using only synthetic feedback generated from source documents.

From information extraction to question generation, answer review and model tuning, here’s how:

Extraction

First, 10-K and 10-Q reports are downloaded and converted to plain text using NeMo Retriever extraction: a document content and metadata extraction pipeline. NeMo Retriever extraction is particularly adept at parsing tabular data, which is critical for financial documents.

Once converted to text, the documents are chunked using naturally occurring newlines, with each chunk containing ~250 tokens, except for tables, which remain as standalone documents.

Next, NVIDIA Retriever embedding models are used to capture semantic relationships between different sections of the filings.

Synthesis

Using a nearest-neighbor search, documents are paired to form "couples" of filings that exhibit strong contextual or topical similarity. Then, ~10,000 document pairs are randomly selected; the goal is to generate meaningful questions based on these documents.

Llama 3.3 70B Instruct generates questions based on the content of each document pair. These initial questions are then refined using the same model and constitutional AI. The constitution contains guidance on question generation, ensuring questions are precise, coherent, and easy to understand.

For the training portion, Llama 70B 3.3 is tasked with generating ‘golden answers’ to the improved questions. The model generates two possible answers; then, the answers are fed back to the same model to decide which of the two is the better answer, improving quality.

Training

The training process involves multiple stages of fine-tuning, leveraging first supervised methods then reinforcement learning.

To mimic the noisy environment of many RAG pipelines, a number of irrelevant documents are added to the training corpus. The presence of these documents during tuning trains the model to extract relevant information even in the presence of distracting content.

First, the student model is trained using supervised fine-tuning (SFT) on the golden answers, including documents from the noisy training corpus.

For each prompt, the SFT model generates two responses, which are then evaluated by the AI judge (LLaMA 70B 3.3 in this case). The judge determines which response is more faithful and relevant by breaking them down at the sentence level and assigning each sentence a score based on its faithfulness to the relevant documents (keeping in mind the presence of distractors) and its helpfulness and relevance to the question.

This preference data is used to improve faithfulness to the documents by aligning model outputs more closely with the provided information, while simultaneously minimizing hallucinations through the application of Reinforcement Learning from AI Feedback (RLAIF).

Results: 100% Accuracy Improvement with Reinforcement Fine-Tuning

Before any post-training, the base model (Llama 3.1 8B Instruct) had just a 27% win rate versus GPT-4o; the two models tied in 18% of cases, and GPT-4o won in 53% of cases.

However, after tuning with reinforcement learning, the RLAIF-optimized model had a 58% win rate over GPT-4o, reflecting a 100% accuracy improvement in the base model (Llama 3.1 8B Instruct). GPT-4o won only 27% of the time, and the two models tied in 15% of cases.

These performance enhancements were achieved without dedicated training data or annotators, using only constitutional AI, synthetic data, and AI judges. Once deployed with Adaptive Engine, models learn continuously from production feedback (such as user preferences and business metrics), creating an enterprise AI flywheel that refines performance further over time.

This low-lift, high-impact approach to fine-tuning can be applied across industries and use cases, enabling organizations across sectors to go beyond the capabilities of prompt engineering and unlock unparalleled model performance.

Adaptive ML is training models for use in the finance, telecommunications, insurance, and mobility industries for RAG, text-to-SQL and customer support use cases. Book a demo of Adaptive Engine today.

Adaptive Engine Adapt Evaluate Serve

Use Cases RAG Text-to-SQL Customer Support

Company Technology About Blog

Socials LinkedIn Twitter Newsletter