Blog Post

Prmagazine > News > News > New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs
New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs

New technique helps LLMs rein in CoT lengths, optimizing reasoning without exploding compute costs


Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more


reasoning After thinking chain (COT) – The process by which the model breaks down the question into manageable “thinks” before deducting the answers – has become an integral part of the latest generation of Frontier Large Language Model (LLMS).

However, the inference cost of the inference model can quickly accumulate because the model generates too many COT tokens. exist New paperresearchers at Carnegie Mellon University have proposed an LLM training technology that gives developers more control over the length of COT.

Called length controlled strategy optimization (LCPO), the technical condition is a model that provides the correct answer while also keeping its “ideas” in a predetermined token budget. Experiments show that models trained on LCPO provide a smooth trade-off between accuracy and cost and can surpass large models over the same inference length. LCPO can greatly reduce inference costs for enterprise applications by saving thousands of tokens in each round of conversation with LLM.

LLM’s performance leads to longer cribs

Inference models such as Openai O1 and DeepSeek-R1 Training through enhanced learning (RL) Test time zoom And generate COT traces before the answer is generated. Empirical evidence shows that when models “think” longer, they tend to perform better on inference tasks.

For example, R1 was initially trained on pure RL without examples of human markers. One insight is that it also learned to generate longer crib marks as the model improves performance.

Generally speaking, while long bed chains produce more accurate responses, they also create a computational bottleneck when applying inference models at scale. Currently, there is little control over the calculation budget for test time, and the sequence can be easily scaled to thousands of tokens without providing significant benefits. Some efforts have been made to control the length of inference chains, but they usually reduce the performance of the model.

Explained length controlled policy optimization (LCPO)

Classic RL methods train LLM just to achieve the correct response. LCPO changes this paradigm by introducing two training goals: 1) get the correct result, and 2) keep the COT chain within a specific token length. So if the model produces the correct response but produces too many COT tokens, it will be penalized and forced to come up with a chain of reasoning with the same answer but with a smaller budget.

“The LCPO-trained model learns to satisfy length constraints while optimizing inference performance, rather than relying on hand-designed heuristics,” the researchers wrote.

They proposed two flavors of LCPO: (1) LCPO disengagement, which requires that the generated reasoning is exactly equal to the target length, and (2) LCPO-MAX, which requires that the output does not exceed the target length.

To test this technique, the researchers fine-tuned the 1.5B parameter inference model (QWEN-DISTISTILD-R1-1.5B) of the two proposed LCPO schemes to create the L1-MAX and L1-ECASCACT models. Training is based on mathematical problems with unique and verifiable results. However, evaluations include mathematical problems as well as distributing tasks such as measuring a large number of multitasking language comprehensions (mmlu) Technical and graduate-level Google-Profforn-Prover Q&A Benchmark (GPQA).

Their findings suggest that the L1 model can accurately balance token budget and inference performance, by prompting models with different length constraints, stationary interpolation between short, efficient inference and longer, more accurate inference. Importantly, on some tasks, the L1 model can reproduce the performance of the original inference model with a lower token budget.

lcpo
The L1 model is better than the S1 and the basic model based on cost accuracy (Source: ARXIV)

Compared to S1, the only other way to limit COT length, the L1 model can grow performance up to 150% on different token budgets.

“This substantial difference can be attributed to two key factors,” the researchers wrote. “(1) L1 intelligently adapts to its crib to fit specified length constraints without disrupting the reasoning process, while S1 often cuts off the midterm; (2) L1 is well trained to produce high-quality reasoning chains of different lengths, effectively extracting the inference pattern from longer chains to shorter chains.”

L1 also exhibits 5% of its non-disputed counterpart on the same generation length, while GPT-4O also performs 2% higher. “As far as we know, this is the first to prove that the 1.5b model can surpass the domain models such as GPT-4O, despite using the same generation length,” the researchers wrote.

Interestingly, the model’s crib shows that it learns to adjust its reasoning process based on its token budget. For example, in the case of a longer budget, the model is more likely to generate tokens (i.e., “but” and “wait”) and conclusion graphs (“hence” and “so”) related to self-correction and verification.

LCPO-trained models adjust their reasoning chain based on their token budget (Source: ARXIV)

In addition to length control improvements in standard mathematical reasoning settings, the L1 model summarizes unexpectedly in distributed tasks including GPQA and MMLU.

This new study of models that can adjust their inference budgets can have important uses in real-world applications, allowing businesses to scale inference models without out-of-control expenses. This is a powerful alternative to simply deploying larger, more expensive models – possibly a key factor in making AI more economical and viable in high batches.

Researchers are open source LCPO code and The weight of L1 model.


Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback