Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
In inference tasks, very small language models (SLMs) can surpass leading large language models (LLMs) New research By Shanghai AI Laboratory. The authors show that using the right tools and test time scaling techniques, SLM with 1 billion parameters can outperform 405B LLM on complex mathematical benchmarks.
The ability to deploy SLM in complex inference tasks can be very useful as enterprises are looking for new ways to use these new models in different environments and applications.
Test time scaling explained
Test time scaling (TTS) is a process that provides LLMS with additional computational cylces during the inference process to improve its performance on various tasks. Leading inference models, e.g. Openai O1 and DeepSeek-R1using “internal TT”, which means they slowly “think” by producing a long string of training After thinking chain (COT) token.
Another approach is “external TTS”, where (as the name implies) external helps enhance model performance. External TT is suitable for repurposing the exit model for inference tasks without further tuning of these tasks. External TTS settings are usually composed of a “strategy model”, which is the main LLM that generates the answers, and a process reward model (PRM) that evaluates the answers to the strategy model. These two components are coupled together by sampling or search methods.
The easiest setting is “Best N”, where the policy model generates multiple answers, and the PRM selects one or more of the best answers to form the final response. More advanced external TTS methods use search. In Beam Search, the model divides the answer into several steps.
For each step, it samples multiple answers and runs them through PRM. It then selects one or more suitable candidates and generates the next step of the answer. And, in the Diversified Validator Tree Search (DVTS), the model generates several answer branches to create more diverse sets of candidate answers before synthesizing them into the final answer.

What is the correct scaling strategy?
Choosing the correct TTS strategy depends on a variety of factors. The study authors conducted a systematic study on how different policy models and PRMs affect the efficiency of TTS approaches.
Their findings suggest that efficiency depends largely on policy and PRM model. For example, for small strategy models, search-based approaches perform better than the best N. However, for large policy models, the best N is more efficient because these models have better inference capabilities and do not require a reward model to validate every step of their inference.
Their findings also suggest that the correct TTS strategy depends on the difficulty of the problem. For example, for small strategy models with less than 7b parameters, the best problem for best N is better for simple problems, while Beam search can better solve more serious problems. For strategy models with 7B to 32B parameters, searches for various trees perform well in terms of easy and moderate problems, while Beam searches are best suited for hard problems. But for large policy models (72b parameters and more), optimal N is the best approach for all difficulty levels.
Why small models beat large models

Based on these findings, developers can create Calculate the best TTS strategy Considering the policy model, PRM and problems are difficult to make the computational budget fully utilized to solve the inference problem.
For example, researchers found Llama-3.2-3b The model with the computational best TTS strategy is better than Llama-3.1-405b on Math-500 and AIME24, two complex mathematical benchmarks. This shows that when using Compute-Tts TTS strategy, SLM can outperform models larger than 135 times.
In other experiments, they found that the QWEN2.5 model with 500 million parameters could outperform GPT-4O Adopt the correct calculation optimal TTS strategy. Using the same strategy, the 1.5B distilled version of DeepSeek-R1 outperforms O1-preview and O1-Mini on Math-500 and Aime24.
When considering training and inference calculation budgets, the results of the study show that by calculating the best scaling strategy, SLM can outperform larger models, while slippers are 100-1000 times less.
The researchers’ results show that calculating the best TTS significantly enhances the inference ability of the language model. However, as the strategy model grows, the improvement of TT gradually decreases.
“This suggests that the effectiveness of TTS is directly related to the reasoning ability of policy models,” the researchers wrote. “Specifically, for models with weaker reasoning ability, scaling test time calculations lead to substantial improvements, while for those with A model with strong reasoning capabilities has limited gain.”
This study verifies that SLM can be better than large models when applying a computational optimal test time scaling method. Although the study focuses on mathematical benchmarks, the researchers plan to expand their research to other inference tasks such as coding and chemistry.
Source link