Copyright © 2024
Adaptive ML, Inc.
All rights reserved
Adaptive ML, Inc.
All rights reserved






Test-time compute is two-dimensional
With the fast follow releases of o1, DeepSeek-R1, o3-mini, and many more flavours of reasoning models, all eyes have been on test-time compute.
Test-time compute refers to the practice of spending more computational resources during inference to deliver better results. With large language models (LLMs), this translates into models that plan, reflect, and reason before delivering an answer. In Poker, Go, and Hanabi it has enabled super-human performance.
Inference is no longer instantaneous; instead, users are presented with a familiar spinner, denoting that the system is thinking. But what is happening behind the scenes?
While many proprietary APIs shroud their machinations in secrecy, reasoning models also come in the open variety. When DeepSeek-R1 thinks, it generates a long stream of tokens–the so-called chain-of-thought (CoT). Test-time compute then simply refers to generating a longer response; effectively spending more tokens, time, and resources on the problem before offering an answer. This reasoning occurs in a single linear path, observable in the CoT.
What about a parallel search? Instead of pushing a single path further, (some) compute could be allocated to searching wider: generating multiple independent chains-of-thought, and selecting the better one downstream.
With o3, exploring many trajectories allows the model to top competitive programming benchmarks, and to achieve a breakthrough on ARC-AGI! This 2-dimensional approach to test time compute hasn’t caught much traction in open communities yet, so it’s worth exploring what it entails.
Let’s think step-by-step: LLMs tackle complex, difficult tasks better when they are encouraged to reason about them. This chain-of-thought prompting approach mirrors our own experience. Given the difficulty of a question, we can provide an immediate, intuitive answer, or sit down and construct a deliberate response.
Can this reasoning process be firmed-up through dedicated post-training? Imagine models that can systematically sustain long chains-of-thought, solving novel and challenging problems with ease (and extra tokens).
It turns out reasoning is an emergent property from large-scale reinforcement learning. In other words, the skills of long-form thinking, planning, reflecting, and backtracking arise with just the blunt reward of solving a problem. No forcing function required: this is the ah-ah moment of reasoning models.
Granted, some scaffolding may hasten emergence. But the armamentarium of reasoning lies within every pre-trained model: a silent chaos of deliberative links and logic waiting to be unveiled by post-training. As the reinforcement learning process scrutinizes next-token transitions, the trajectories drawing from this arsenal will deliver better results, and ultimately be further encouraged.
In these large-scale RL runs, models are tasked with solving hundreds of thousands of mathematical, logic, and code tasks. These tasks share a commonality: they can be easily validated. Mathematical expressions can be checked; logic can be formally verified; code can be executed and tested. These binary outcomes provide the sole reward signal required to drive the emergence of reasoning.
The use of AI judges further unlocks a wealth of scalable validation beyond these binary tasks. Large language models are compilers and verifiers of language: they can check for adherence to instructions, and provide feedback on natural language. Thus, we can use other LLMs as an additional reward signal for training, expanding the scope of reasoning beyond programming and mathematics.
With reasoning models and chains-of-thought, expending more inference compute is simply a matter of generating longer answers. The model plans and explores successive scenarios, backtracking and skipping as necessary. It may call upon the help of external tools, executing code fragments or browsing the web to validate its assumptions. All in a single, serial chain-of-thought.
Nevertheless, this serial reasoning is still a form of search. At variance with explicit strategies, the search process is internalized within the model during post-training: we don’t see the turbulent motion of all possible paths, but the frozen foam of a single trajectory from which we can infer the existence of the waterfall.
Large language models are stochastic processes. From a given prompt, they predict not a single token, but a distribution of likelihoods over all possible tokens. This distribution can be sampled to generate diverse trajectories.
Importantly, sampling many trajectories from a model increases the chance that one will solve the task at hand. This is characterized by pass@k scores: in a dataset of many problems, which percentage are eventually solved if you allow k retries.
pass@k scores hint at formidable capabilities lying in the model’s mind. However, to be practically usable, this distribution must be collapsed in a single answer–preferably, the right one. This can be achieved with a test-time scoring function–either hand-crafted or learned–to separate the wheat from the chaff.
For formal tasks (e.g., code tasks with well-defined tests), selection through verification is evident. Yes, some of the complexity is now displaced to the construction of adequate tests–but you were already adept in test-driven development, weren’t you?
If not, the construction of appropriate test-cases may be offloaded to the depth-wise reasoning process. First, the model constructs tests for the task at hand. Then, when searching over many potential solutions, it uses the created tests to filter-out incorrect solutions.
In OpenAI’s work on applying o1 and o3 to competitive programming, o1 has to be prompted to write these tests, while o3 autonomously takes the initiative to create them in its reasoning. In turn, o1 has to rely on a complex, human-designed reranking pipeline to filter through ten thousands trajectories and finish in the first ~half of IOI competitors; o3 can earn a gold model by simply selecting the answers with the highest test-time compute spent out of a hundred trajectories.
This hints at another bitter lesson: rather than painstakingly engineering search strategies, it’s better to use the post-training process to teach models to internalize them.
But searching widely is still valuable: either through simple heuristics, like o3 at the IOI, or through a learned scoring function, like Claude 3.7 Sonnet on SWE-Bench. These scoring functions represent a major unknown in reasoning systems.
Currently, public examples are limited to narrow domains, with highly-specialized scoring. There, scoring functions can provide a lift over majority voting, with room for further improvements.
Reproducing scoring functions openly will be one of the key steps in building models that can match, and exceed, the performance of o3. Specialized scoring functions will first find use in verticalized enterprise applications, where they will enable super-human performance; generalized scoring functions will pave the way to super-intelligent systems.