Researchers warn of ‘catastrophic overtraining’ in LLMs

Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more

A new academic study challenges the core assumptions of developing large language models (LLMs), warning that more pre-trained data may not always lead to better models.

Researchers from some of the leading computer science institutions in the West and the world, including Carnegie Mellon University, Stanford, Harvard University and Princeton University – introduced the concept of “disastrous overtraining”. They show that extended pre-training can actually make language models difficult to train, ultimately making their performance less effective.

Research, “Overtrained language models are harder to fine-tune” Available on Arxiv, led by Jacob Mitchell Springer. Its co-author is Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig and Aditi Raghunathan.

The law of lowering returns

The focus of this study is on the surprising trend observed in modern LLM development: while models are pre-trained in an ever-expanding pool of data, free or scratched from the network, represented by LLM as a series of concepts and ideas, or a series of representatives or numerical representations of concepts and ideas, while the efficiency of these models can be reduced during pre-training to reduce these models to reduce later models.

The team conducted a series of empirical evaluations and theoretical analyses to examine the impact of extended pre-training on model adaptability.

One of the key findings is Open source OLMO-1B model of AI2.

The researchers compared two versions of the model: one was pre-trained in 2.3 trillion tokens and the other was 3 trillion tokens.

Although the latter was trained on 30% of the data, the latter model performed poorly after the explanation adjustment. Specifically, the 3T token model has more than 2% worse performance on several standard language model benchmarks compared with its 2.3 t-token rival. In some assessments, performance degradation reached 3%.

The researchers believe that this decline is not an anomaly, but a consistent phenomenon, and they are called “catastrophic overtraining.”

Understand sensitivity and forgetting

This paper attributes this degradation to what they call a systematic increase in “gradual sensitivity.” As the models are pre-trained in extended manner, their parameters become more sensitive to changes.

This increased vulnerability makes them more susceptible to degradation during post-training modifications, such as instruction tuning, fine-tuning of multimodal tasks, or even simple weight perturbations.

The researchers provide evidence that any modification (whether constructed, like fine-tuned or unstructured constructions, such as adding Gaussian noise, reduces the greater loss of previously learned capabilities, except for a point in pre-training.

This sensitivity leads to “forgetfulness” and the original advantage of the model worsens when new training data is introduced.

The study identified the “turning point” in pre-training, after which additional training results in a decrease or even negative return on fine-tuning results. For the OLMO-1B model, about 2.5 trillion tokens appeared in this threshold.

A lot of evidence

The team’s analysis covers real-world and controlled experimental settings. They tested the phenomenon in different tasks, including instruction adjustments using datasets such as anthropomorphic HH and Tulu, and multimodal fine-tuning using the LLAVA framework.

The results always show that after fine-tuning, pre-trained models perform poorly beyond some token budgets.

Furthermore, the researchers used linear networks to build a theoretical model to better understand why overtraining improves sensitivity.

Their analysis confirms that progressive sensitivity and catastrophic overtraining are mathematically inevitable when pretraining is indefinitely without proper constraints.

The ultimate harvest? Model providers and trainers must make trade-offs

The discovery challenges the assumption that more pre-trained data is always better. Instead, this article proposes a slight trade-off: While longer pre-training improves the capabilities of the base model, it also increases the risk that fine-tuning will reduce these capabilities.

In fact, trying to mitigate this effect (such as adjusting the fine-tuning learning rate or increasing regularization) may delay the onset of catastrophic overtraining, but cannot completely eliminate it without sacrificing downstream performance.

Therefore, for enterprises looking to leverage LLM to improve business workflows and outcomes, if the idea of doing so is to fine-tune the open source model, the course of this study suggests that lower-trained lower-parameter models trained on less material may result in more reliable production models.

The authors acknowledge that further research is needed to understand the factors that influence when and how catastrophic overtraining occurs. The problem of emptyness includes whether training optimizers, training objectives, or data distribution affects the severity of the phenomenon.

Impact on future LLM and AI model development

This study significantly impacts how organizations and researchers design and train large language models. As the field continues to pursue larger, more capable models, this study highlights the importance of balancing training duration with post-training adaptability.

Additionally, findings may affect how model developers think about resource allocation. Developers may need to reevaluate strategies to optimize downstream performance without the negative impact of catastrophic overtraining, rather than focusing just on increasing training budgets.

Daily insights on VB daily business use cases

If you want to impress your boss, VB Daily can serve you. We provide you with insights about the company’s work in developing AI, from regulatory to actual deployment, so you can share your insights on the maximum ROI.

Read ours Privacy Policy

Thanks for your subscription. See more VB newsletter is here.

An error occurred.

Source link

Researchers warn of ‘catastrophic overtraining’ in LLMs

The law of lowering returns

Understand sensitivity and forgetting

A lot of evidence

The ultimate harvest? Model providers and trainers must make trade-offs

Impact on future LLM and AI model development

Today's NYT Strands Hints, Answer and Help for March 29 #391 – CNET

Trump cautions ‘bad things’ in store if Iran won’t negotiate as Islamic Republic touts ‘Missile City’

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

California woman sues Catholic hospital chain over emergency abortion denial

Sour Patch Kids, after teasing name change, proclaim their taste

LAPD chief ousts lawyer blamed by union for disclosing thousands

Blog Post

The law of lowering returns

Understand sensitivity and forgetting

A lot of evidence

The ultimate harvest? Model providers and trainers must make trade-offs

Impact on future LLM and AI model development

Today&apos;s NYT Strands Hints, Answer and Help for March 29 #391 – CNET

Trump cautions ‘bad things’ in store if Iran won’t negotiate as Islamic Republic touts ‘Missile City’

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

California woman sues Catholic hospital chain over emergency abortion denial

Sour Patch Kids, after teasing name change, proclaim their taste

LAPD chief ousts lawyer blamed by union for disclosing thousands

Today's NYT Strands Hints, Answer and Help for March 29 #391 – CNET