Blog Post

Prmagazine > News > News > Meta’s benchmarks for its new AI models are a bit misleading | TechCrunch
Meta’s benchmarks for its new AI models are a bit misleading | TechCrunch

Meta’s benchmarks for its new AI models are a bit misleading | TechCrunch

one of the New flagship AI model Meta is released on Saturday, Maverick, Ranked second in LM Arenatests with human evaluators compared the output of the model and selected the ones they like. However, the version of Maverick deployed to LM Arena is different from the ones widely used by developers.

As Some AI Researchers Meta noted on X that it announced the Mavericks on LM Arena as an “experimental chat version.” exist Official Camel WebsiteMeanwhile, it was revealed that Meta’s LM arena test was performed using “Llama 4 Maverick optimized for conversationality”.

As we wrote before,LM Arena has never been the most reliable measure of AI model performance for various reasons. But AI companies generally don’t customize or otherwise fine-tune their models to score better in the LM arena, or at least don’t admit to doing so.

The problem with tailoring the model as a benchmark, hiding the model, and then publishing the “vanilla” variant of the same model is that it allows developers to accurately predict the performance of the model in a specific environment. This is also misleading. Ideally, benchmark – They are insufficient – Provides snapshots of the strengths and weaknesses of a single model in a range of tasks.

Indeed, X’s researchers have Observed Stark Behavioral differences The Mavericks are publicly downloaded compared to the model held on LM Arena. The LM Arena version seems to use a lot of emojis and gives an incredibly long answer.

We have contacted Meta and Chatbot Arena, the organizations that maintain LM Arena for comment.

Source link

Leave a comment

Your email address will not be published. Required fields are marked *

star360feedback