Blog Post

Prmagazine > News > News > This Tool Probes Frontier AI Models for Lapses in Intelligence
This Tool Probes Frontier AI Models for Lapses in Intelligence

This Tool Probes Frontier AI Models for Lapses in Intelligence

The executives are AI The company may Like to tell us That Agi It’s almost here, but the latest models still need some extra tutoring to help them as smart as possible.

Scale AI, a company that plays a key role in helping Frontier AI companies build advanced models, has developed a platform that automatically tests models of thousands of benchmarks and tasks, determines weaknesses, and tags other training data to help improve their skills. Of course, the scale will provide the required data.

The scale has risen to a prominent level, providing manual labor for training and testing advanced AI models. Large Language Models (LLMS) are trained on text scraped from books, networks, and other sources. Turning these models into useful, coherent and good chatbots requires additional “post-training” in the form of humans who provide feedback on the model output.

Provide workers at scale, who are experts in exploring problems and limitation models. A new tool called scale evaluation automates some of this work using Scale’s own machine learning algorithm.

“In large labs, there are some ways to track certain model weaknesses at will,” said Daniel Berrios, head of product evaluation at scale. New tool “is a [model makers] “The results are carefully studied and sliced ​​to see where the model is underperforming and then use it to improve on the data activity,” Berios said.

Berrios said several Frontier AI model companies are already using the tool. Most people are using it to improve their reasoning skills in the best models, he said. AI reasoning involves trying to break down the problem into components of models in order to solve it more effectively. This method relies heavily on user training to determine if the model correctly solves the problem.

In one example, proportional assessments show that the model’s reasoning skills declined when feeding non-English cues, Berrios said. “although [the model’s] “General reasoning features are excellent and perform well on the benchmark, and they tend to be much lower when prompted not to use English,” he said. ScalEvolution Evolution highlighted this issue and allowed the company to collect other training data to solve the issue.

Jonathan Frankle, chief AI scientist at Databricks, who builds large AI models, said that one basic model can be tested against another sound that is useful in principle. “Anyone who is moving forward in assessment is helping us build better AI,” Frank said.

In recent months, scale has developed several new benchmarks designed to drive AI models to become smarter and to examine more closely how they may behave. These include Enigmaeval,,,,, Polyphysics,,,,, maskand The final exam for humans.

Scale says measuring improvements in AI models is becoming increasingly challenging because they are good at performing existing tests. The company said its new tools provide more comprehensive images by combining many different benchmarks and can be used to design custom tests of model capabilities, such as probing its inference in different languages. Scale’s own AI can solve a given problem and produce more examples to test the skills of the model more comprehensively.

The company’s new tools can also provide standardization efforts to standardize AI models targeting misconduct. Some researchers say that a lack of standardization means Some models are not released.

In February, the National Institute of Standards and Technology announced that scale will help it develop methods for testing models to ensure they are safe and trustworthy.

What errors did you find in generating the output of the AI ​​tool? What do you think is the biggest blind spot of the model? Let us know by email hello@wired.com Or by commenting below.

Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback