Distillation Can Make AI Models Smaller and Cheaper

Original version of This story Appear in Quanta Magazine.

The Chinese artificial intelligence company DeepSeek released a chatbot called R1 earlier this year, which attracted a lot of attention. most Focus on facts A relatively small and unknown company said it built a chatbot that rivals the performance of the world’s most famous AI company, but uses a small portion of the computer’s power and cost. As a result, stocks of many Western technology companies plummeted. NVIDIA sells chips that lead AI models, Lost more stock value in one day More than any company in history.

Some of these attentions involve elements of the allegation. Sources accused That DeepSeek gotwithout permission, uses a technology called distillation, from knowledge of OpenAI’s proprietary O1 model. Most news reports To attribute this possibility to a shock to the AI industry, it means DeepSeek has discovered a new, more efficient way to build AI.

But distillation (also known as knowledge distillation) is a widely used tool in AI, a topic of computer science research dating back to a decade and a tool used by large tech companies on their own models. “Distillation is one of the most important tools for the company today, making the model more efficient.” Enric Boix-Adseraa researcher studying distillation at the University of Pennsylvania Wharton.

Dark Knowledge

The idea of distillation started with 2015 paper By three Google researchers, including Geoffrey Hinton, the so-called godfather of AI, and 2024 Nobel Prize winner. At the time, researchers often ran collections of models – “many models glued together” vinyalsthe chief scientist of Google Deepmind and one of the authors of this article, to improve its performance. “However, running all models in parallel is very clumsy and expensive,” Vinyals said. “We’re interested in the idea of extracting it to a single model.”

Researchers believe they can make progress by addressing the significant weaknesses of machine learning algorithms: Wrong answers are considered equally bad, no matter how wrong they may be. For example, in the image classification model, “the way dogs are confused with foxes is confused with dogs and pizza is confused.” Researchers suspect that the ensemble model does contain information about which wrong answers are not bad. Perhaps smaller “student” models can use information from large “teacher” models to grasp the categories they should be classified more quickly. Hinton calls this “dark knowledge” “dark knowledge”, citing an analogy with the dark matter of the universe.

After discussing this possibility with Hinton, Vinyals developed a method to allow large teacher models to pass more information about the image categories to smaller student models. The key is to be classified into the “soft goal” in the teacher model, which assigns probabilities to each possibility rather than a solid answer. For example, a model calculate The image shows a 30% chance of a dog, it shows a 20% of the cat, 5% of which show a cow, and 0.5% of the cats show a car. By using these probabilities, the teacher model effectively reveals to students that dogs are very similar to cats, different from cattle and completely different from cars. The researchers found that this information will help students learn how to more effectively identify images of dogs, cats, cows, and cars. A large, complex model can simplify it into a more refined model with little loss of accuracy.

Explosive growth

This idea was not immediately hit. The paper was denied a meeting, while Vinyals turned to other topics discouragedly. But distillation has reached an important moment. Around this time, engineers found that the more training data they provided to neural networks, the more effective those networks were. The size of the model exploded quickly, and their size exploded too Functionbut the cost of running them gradually climbs with their size.

Many researchers turned to distillation to make smaller models. For example, in 2018, Google researchers launched a powerful language model called Bertthe company soon began using the company to help parse billions of web searches. But Bert runs very big and expensive, so the next year other developers distilled the smaller version into wise Distilbert, which has been widely used in business and research. Distillation has gradually become ubiquitous and now it serves as a service, e.g. Google,,,,, Openaiand Amazon. The original distilled paper is still only published on the Arxiv.org Preprint server Quote over 25,000 times.

Given that distillation requires access to the inside of the teacher model, it is impossible for a third party to secretly extract data from closed source models such as Openai’s O1, because DeepSeek is considered to have been completed. That is, training your own model by just prompting the teacher to ask certain questions and using answers is an almost Socrates distillation method, but the student model can still learn a lot from the teacher model.

Meanwhile, other researchers continue to find new applications. In January, the Nowitzki Laboratory at UC Berkeley It shows that the distillation is very effectiveuse multi-step “thinking” to better answer complex questions. The lab says its fully open source Sky-T1 model costs less than $450 in training and similar results to the larger open source model. “We were really surprised how effective the distillation works in this case.” Dacheng Li, Berkeley PhD students and classmates lead on the Novasky team. “Distillation is the basic technology in AI.”

ability Reprinted with permission Quanta Magazine,,,,, Edit independent publications Simmons Foundation Its mission is to enhance public understanding of science by covering research developments and trends in mathematics and physics and life sciences.

Source link

Distillation Can Make AI Models Smaller and Cheaper

Dark Knowledge

Explosive growth

Texts reveal behind-the-scenes maneuvering of the Vatican’s ‘trial of the century’

iOS 26: How to Control Your iPhone's Camera With Your AirPods

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

The Best Hybrid Mattresses for Every Kind of Sleeper

Lincoln Center’s Collider Fellows explore how tech could transform the

Our 8 Favorite Dog Beds for Every Kind of Dog

Blog Post

Dark Knowledge

Explosive growth

Texts reveal behind-the-scenes maneuvering of the Vatican’s ‘trial of the century’

iOS 26: How to Control Your iPhone&apos;s Camera With Your AirPods

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

The Best Hybrid Mattresses for Every Kind of Sleeper

Lincoln Center’s Collider Fellows explore how tech could transform the

Our 8 Favorite Dog Beds for Every Kind of Dog

iOS 26: How to Control Your iPhone's Camera With Your AirPods