Blog Post

Prmagazine > News > News > Less is more: UC Berkeley and Google unlock LLM potential through simple sampling
Less is more: UC Berkeley and Google unlock LLM potential through simple sampling

Less is more: UC Berkeley and Google unlock LLM potential through simple sampling


Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more


one New paper By researchers Google Research and University of California, Berkeley, Prove that a surprisingly simple test time scaling method can improve the reasoning ability of the Big Speech Model (LLMS). key? Extended sampling-based search, the technique relies on generating multiple responses and using the model itself to validate them.

The core finding is that even minimalist implementations of sampling-based search using random sampling and self-validated sampling can improve the inference performance of models such as Gemini 1.5 Pro, surpassing the popular benchmark O1-preview. These findings may be of great significance to enterprise applications and challenge the assumption that highly specialized training or complex architectures are always necessary to achieve top-level performance.

Current test time calculation scaling limit

The current popular method for testing time scaling in LLMS is to train the model through enhanced learning to generate longer responses using tee chain (COT) traces. This method is used in such Openai O1 and DeepSeek-R1. Although beneficial, these approaches often require substantial investments during the training phase.

Another method of testing time scaling is “self-encounter”, which model produces multiple responses to the query and selects answers that appear more frequently. Self-harmonious emotions reach a limit when dealing with complex problems, because in these cases the most repetitive answer is not necessarily the correct answer.

Sampling-based search provides a simpler, more scalable alternative to test time scaling: let the model generate multiple responses and select the best response through a validation mechanism. Sampling-based searches can complement other test time computation scaling strategies, and as the researchers wrote in the paper, “it also has the unique advantage of awkwardly parallel and allowing arbitrary scaling: just sample more responses.”

More importantly, sampling-based searches can be applied to any LLM, including searches with unspecified training inference.

How sampling-based search works

The researchers focused on minimalist implementations of sampling-based searches, using language models to generate candidate responses and validate them. This is a “self-verification” process that evaluates its own output without relying on external underlying real answers or symbolic verification systems.

Search-based sampling
Search-based sampling credit: VentureBeat

The algorithm works in a few simple steps:

1-The algorithm first uses the language model to generate a set of candidate solutions for a given problem. This is done by giving the model the same prompts multiple times and using a non-zero temperature setting to create various response sets.

2 – The response of the EAKS candidate goes through a verification process, which prompts the LLM to determine multiple times whether the response is correct. The validation results are then averaged to create a final validation score for the response.

3 – The algorithm selects the highest scored response as the final answer. If multiple candidates are within close range from each other, the LLM is prompted to compare them and select the best LLM. The response that wins the most pairwise comparisons is the final answer.

The researchers believe that two key axes of testing time scaling:

Sampling: The number of responses the model generates for each input problem.

Verification: The number of validation scores calculated for each generated solution

Comparison of sampling-based search and other technologies

The study shows that even if the scaling range of test time calculations goes far beyond the degree of self-consistent saturation, inference performance continues to improve.

On sufficient scale, this minimalist implementation significantly improves the inference accuracy of inference benchmarks such as AIME and mathematics. For example, the performance of the Gemini 1.5 Pro surpasses the performance of the O1-preiview, which has been explicitly trained on inference problems, while the Gemini 1.5 Flash surpasses the Gemini 1.5 Pro.

“This not only emphasizes the importance of sampling-based search scaling capabilities, but also implies the practicality of sampling-based search as a simple baseline where other test times can be compared to compute scaling strategies and measure real improvements in model search capabilities.”

It is worth noting that while the search-based sampling results are impressive, the cost can also become too difficult. For example, with 200 samples and 50 verification steps per sample, AIME’s query will generate about 130 million tokens, and the Gemini 1.5 Pro costs $650. However, this is a very minimalist approach to sampling-based search and is compatible with the optimization techniques proposed in other studies. With smarter sampling and verification methods, the inference cost can be greatly reduced Use smaller models and Generate fewer tokens. For example, by performing verification with Gemini 1.5 Flash, the fee dropped to $12 per question.

Effective self-verification strategy

There is an ongoing debate about whether LLM can verify its answer. The researchers identified two key strategies for improving self-validation using test time calculations:

Directly compare candidate responses: Differences between candidate solutions strongly suggest potential errors. By providing validators with multiple responses for comparison, the model can better identify errors and hallucinations, thus addressing the core weaknesses of LLM. Researchers describe it as an example of “implicit scaling.”

Task-specific rewrite: The researchers suggest that the best output style for LLM depends on the task. Thinking chains are effective for solving inference tasks, but responses are easier to verify when written in a more formal mathematically traditional style. Prior to evaluation, validators can rewrite candidate responses into a more structured format (e.g., theorem – negativity).

“As models learn to leverage the principles of implicit scale and output style applicability, we expect the model’s self-validation ability will improve rapidly in the short term and drives to improve the scaling rate of sampling-based searches,” the researchers wrote.

Impact on real-life applications

Research shows that a relatively simple technique can achieve impressive results, reducing the need for complex and expensive model architectures or training regimes.

This is also a scalable technology that enables enterprises to improve performance by allocating more computing resources to sampling and validation. It also enables developers to push the border language model toward limitations on complex tasks.

“Given that it can compute strategies parallel to other test time, and allows for arbitrary scaling, and acknowledges that simple implementations are significantly effective, we hope that sampling-based searches can play a crucial role, as the task of language models is to solve increasingly complex problems with increasingly large computing budgets,” the researchers wrote.


Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback