Join our daily and weekly newsletter for the latest updates and exclusive content on industry-leading AI coverage. learn more
A new academic study challenges the core assumptions of developing large language models (LLMs), warning that more pre-trained data may not always lead to better models.
Researchers from some of the leading computer science institutions in the West and the world, including Carnegie Mellon University, Stanford, Harvard University and Princeton University – introduced the concept of “disastrous overtraining”. They show that extended pre-training can actually make language models difficult to train, ultimately making their performance less effective.
Research, “Overtrained language models are harder to fine-tune” Available on Arxiv, led by Jacob Mitchell Springer. Its co-author is Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig and Aditi Raghunathan.
The law of lowering returns
The focus of this study is on the surprising trend observed in modern LLM development: while models are pre-trained in an ever-expanding pool of data, free or scratched from the network, represented by LLM as a series of concepts and ideas, or a series of representatives or numerical representations of concepts and ideas, while the efficiency of these models can be reduced during pre-training to reduce these models to reduce later models.
The team conducted a series of empirical evaluations and theoretical analyses to examine the impact of extended pre-training on model adaptability.
One of the key findings is Open source OLMO-1B model of AI2.
The researchers compared two versions of the model: one was pre-trained in 2.3 trillion tokens and the other was 3 trillion tokens.
Although the latter was trained on 30% of the data, the latter model performed poorly after the explanation adjustment. Specifically, the 3T token model has more than 2% worse performance on several standard language model benchmarks compared with its 2.3 t-token rival. In some assessments, performance degradation reached 3%.
The researchers believe that this decline is not an anomaly, but a consistent phenomenon, and they are called “catastrophic overtraining.”
Understand sensitivity and forgetting
This paper attributes this degradation to what they call a systematic increase in “gradual sensitivity.” As the models are pre-trained in extended manner, their parameters become more sensitive to changes.
This increased vulnerability makes them more susceptible to degradation during post-training modifications, such as instruction tuning, fine-tuning of multimodal tasks, or even simple weight perturbations.
The researchers provide evidence that any modification (whether constructed, like fine-tuned or unstructured constructions, such as adding Gaussian noise, reduces the greater loss of previously learned capabilities, except for a point in pre-training.
This sensitivity leads to “forgetfulness” and the original advantage of the model worsens when new training data is introduced.
The study identified the “turning point” in pre-training, after which additional training results in a decrease or even negative return on fine-tuning results. For the OLMO-1B model, about 2.5 trillion tokens appeared in this threshold.
A lot of evidence
The team’s analysis covers real-world and controlled experimental settings. They tested the phenomenon in different tasks, including instruction adjustments using datasets such as anthropomorphic HH and Tulu, and multimodal fine-tuning using the LLAVA framework.
The results always show that after fine-tuning, pre-trained models perform poorly beyond some token budgets.
Furthermore, the researchers used linear networks to build a theoretical model to better understand why overtraining improves sensitivity.
Their analysis confirms that progressive sensitivity and catastrophic overtraining are mathematically inevitable when pretraining is indefinitely without proper constraints.
The ultimate harvest? Model providers and trainers must make trade-offs
The discovery challenges the assumption that more pre-trained data is always better. Instead, this article proposes a slight trade-off: While longer pre-training improves the capabilities of the base model, it also increases the risk that fine-tuning will reduce these capabilities.
In fact, trying to mitigate this effect (such as adjusting the fine-tuning learning rate or increasing regularization) may delay the onset of catastrophic overtraining, but cannot completely eliminate it without sacrificing downstream performance.
Therefore, for enterprises looking to leverage LLM to improve business workflows and outcomes, if the idea of doing so is to fine-tune the open source model, the course of this study suggests that lower-trained lower-parameter models trained on less material may result in more reliable production models.
The authors acknowledge that further research is needed to understand the factors that influence when and how catastrophic overtraining occurs. The problem of emptyness includes whether training optimizers, training objectives, or data distribution affects the severity of the phenomenon.
Impact on future LLM and AI model development
This study significantly impacts how organizations and researchers design and train large language models. As the field continues to pursue larger, more capable models, this study highlights the importance of balancing training duration with post-training adaptability.
Additionally, findings may affect how model developers think about resource allocation. Developers may need to reevaluate strategies to optimize downstream performance without the negative impact of catastrophic overtraining, rather than focusing just on increasing training budgets.
Source link