The debate on AI benchmarking and how AI labs report it is in the public eye.
This week, an Openai employee defendant Elon Musk’s AI company XAI publishes misleading benchmark results with its latest AI model (Grok 3). persist in The company is on the right.
The truth lies in between.
exist Posts on Xai Blogthe company released a chart showing how Grok 3 performs on AIME 2025, a collection of challenging math problems in a recent invitational math exam. Some experts have Questioning the effectiveness of Aime as an AI benchmark. However, AIME 2025 and earlier versions of testing are often used to detect the mathematical capabilities of models.
Xai’s graph shows two variants of the Grok 3, the Grok 3 Inference Beta and the Grok 3 Mini Inference, beating Openai’s best available models, O3 Minicaoin Aime 2025. But OpenAI employees on X are quick to point out that Xai’s chart does not include O3-Mini-High’s Aime 2025 score, i.e. “Cons@64.”
What is CONS@64, you might ask? Well, this is the short for “Consensus @64”, which basically gives the 64 type answers each question in the benchmark and generates the most frequently generated answer as the final answer. As you can imagine, CONS@64 tends to improve the baseline score of the model, and omitting it from the graph may make it look like one model is surpassing the other in reality, but that is not the case.
Grok 3 Inference Beta and Grok 3 Mini Inference scores on AIME 2025 with a score of “@1” (meaning the first score the model gets on the benchmark) below O3-Mini-high. Grok 3 Reasoning Beta also lags behind Openai’s O1 model Set to “Medium” calculation. xai is Advertising Grok 3 As the “cleverest AI in the world”.
Babu Shinkin Arguing on X Openai has published similar misleading benchmark charts in the past, although the chart compares the performance of its own models. During the debate, a more neutral party summed up a more “accurate” chart showing the performance of nearly every model @cons@cons@cons@64:
interesting
(I actually believe Grok looks good there, Openai’s TTC Chicanery is behind O3-Mini-*High*-Pass@, “1”” deserves more review.) https://t.co/djqljpcjh8 pic.twitter.com/3wh8foufic– Teortaxes▶Q (DeepSeek Twitter🐋Iron Powder 2023 – ∞) (@TeorTaxEstex) February 20, 2025
But as AI researcher Nathan Lambert Point in the postperhaps the most important indicator remains a mystery: the computational (and currency) cost of each model to get the best score. This just shows how much limitations and advantages of most AI benchmarks convey to models.