Meta's benchmarks for its new AI models are a bit misleading

one of the New flagship AI model Meta is released on Saturday, Maverick, Ranked second in LM Arenatests with human evaluators compared the output of the model and selected the ones they like. However, the version of Maverick deployed to LM Arena is different from the ones widely used by developers.

As Some AI Researchers Meta noted on X that it announced the Mavericks on LM Arena as an “experimental chat version.” exist Official Camel WebsiteMeanwhile, it was revealed that Meta’s LM arena test was performed using “Llama 4 Maverick optimized for conversationality”.

As we wrote before,LM Arena has never been the most reliable measure of AI model performance for various reasons. But AI companies generally don’t customize or otherwise fine-tune their models to score better in the LM arena, or at least don’t admit to doing so.

The problem with tailoring the model as a benchmark, hiding the model, and then publishing the “vanilla” variant of the same model is that it allows developers to accurately predict the performance of the model in a specific environment. This is also misleading. Ideally, benchmark – They are insufficient – Provides snapshots of the strengths and weaknesses of a single model in a range of tasks.

Indeed, X’s researchers have Observed Stark Behavioral differences The Mavericks are publicly downloaded compared to the model held on LM Arena. The LM Arena version seems to use a lot of emojis and gives an incredibly long answer.

OK, llama 4 is a cooked lol, what is this Yap city pic.twitter.com/y3gvhbvz65

— Nathan Lambert (@natolambert) April 6, 2025

For some reason, the Llama 4 model in the arena uses more emojis

Together. AI, seems better: pic.twitter.com/f74odx4ztt

– Tech Dev Notes (@TechDevnotes) April 6, 2025

We have contacted Meta and Chatbot Arena, the organizations that maintain LM Arena for comment.

Source link

Meta’s benchmarks for its new AI models are a bit misleading | TechCrunch

IDF responds after Hamas hits Israeli cities in rocket attack: ‘Must pay a heavy price’

Paige Bueckers ends UConn career with national championship as Huskies rout South Carolina

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

Agriculture secretary doesn’t expect egg prices to significantly increase after

All-American skier dies from head injury at Lake Tahoe resort

Families mourn teenagers killed in Orange County crash. ‘She was

Blog Post