Blog Post

Prmagazine > News > News > A new, challenging AGI test stumps most AI models | TechCrunch
A new, challenging AGI test stumps most AI models | TechCrunch

A new, challenging AGI test stumps most AI models | TechCrunch

The ARC Awards Foundation, a nonprofit co-founded by renowned AI researcher François Chollet, announced in A Blog Posts On Monday, it created a new, challenging test to measure the general intelligence of leading AI models.

So far, new tests called Arc-Agi-2 have put most models in a dilemma.

According to the R1 scores of Openai’s O1-Pro and DeepSeek, the “reasonable” AI model for ARC-AGI-2 is between 1% and 1.3%. Arc Prize Ranking. Powerful non-disputed models, including GPT-4.5, Claude 3.7 sonnet and Gemini 2.0 Flash scores about 1%.

ARC-AGI tests include puzzle-like questions in which AI must identify visual patterns from collections of squares of different colors and generate the correct “answer” grid. The purpose of these problems is to force AI to adapt to new problems that have never been seen before.

More than 400 people from the ARC Award Foundation have established human baselines in the form of Arc-Agi-2. On average, the “panel” among these people gets 60% of the test questions, which is much better than any model’s score.

Sample Problems of ARC-AGI-2 (Honor: ARC Award).

exist Post on XChollet claims that ARC-AGI-2 measures the actual intelligence of AI models better than the first iteration of the test, Arc-Agi-1. The ARC Award Foundation’s test is designed to assess whether AI systems can effectively acquire new skills outside of the data they are trained.

Unlike Arc-Agi-1, new tests prevent AI models from relying on “brute force” (widespread computing power) to find solutions, Chollet said. Chollet has previously admitted This is the main drawback of Arc-Agi-1.

To address the flaws of the first test, ARC-AGI-2 introduced a new indicator: efficiency. It also requires models to interpret patterns in real time, rather than relying on memory.

“Intelligence is not only defined as the ability to solve problems or achieve high scores,” Greg Kamradt, co-founder of the ARC Awards Foundation Blog Posts. “The efficiency of obtaining and deploying these features is crucial, a component of definition. The core issue raised is more than just AI can get [the] Task-solving skills? And, “at what efficiency or cost?”

Arc-Agi-1 remained unbeaten for about five years until the release of Openai in December 2024 Advanced Inference Model, O3which outperforms all other AI models and matches human performance in the evaluation. But, as we pointed out at the time, O3 achieves high price on Arc-Agi-1.

The version of OpenAI’s O3 model-O3 (Low) hit new heights on ARC-AGI-1 for the first time, scoring 75.7% in the test, and using $200 calculation power per task to get a 4% speed on Arc-Agi-2.

Comparison of performance of upper boundary AI models of ARC-AGI-1 and ARC-AGI-2 (Honor: ARC Award).

The arrival of ARC-AGI-2 is a call for the establishment of new, unsaturated benchmarks in the technology industry to measure AI progress. Thomas Wolf, co-founder of Embrace Face, recently told TechCrunch The AI ​​industry lacks enough tests to measure the key features of so-called artificial general intelligenceincluding creativity.

In addition to the new benchmark, the ARC Awards Foundation announced New Arc Awards 2025 Contestchallenge developers to achieve 85% accuracy in ARC-AGI-2 testing, while each task costs only $0.42.

Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback