This Week in AI: Maybe we should ignore AI benchmarks for now

Welcome to TechCrunch’s regular AI communications! We’ll take a break, but you can find all of our AI coverage on TechCrunch, including my columns, daily analysis and disruptive news coverage. If you want more stories in your inbox every day, sign up for our daily newsletter here.

This week, billionaire Elon Musk’s AI startup XAI released its latest flagship AI model, Grok 3powering the company’s Grok Chatbot app. The model was trained at about 200,000 GPUs, beating many other leading models, including benchmarks from OpenAI for mathematics, programming, and more.

But what do these benchmarks really tell us?

At TC, we often reluctantly report benchmark numbers because they are one of the few (relatively) standardized ways in which the AI industry measures model improvements. Popular AI benchmarks tend to test Deep knowledge and give summary scores related to poor proficiency About tasks that most people care about.

As Professor Walton Ethan Mollick pointed out A series of posts on X After the unveiling of Grok 3 on Monday, there is an urgent need for better testing and independent testing authorities. Self-reported benchmark results for AI companies tend to result more frequently, and as Mollick suggests, these results make these results more difficult in terms of face value.

“The public benchmark is both ‘meh’ and saturated, so food reviews based on taste will be subject to a lot of AI testing like food reviews,” Molik wrote. “If AI is crucial to the job, we need more.”

No shortage of Independent test and organize New benchmarks are proposed for AI, but their relative advantages are far from solving problems within the industry. Some AI commentators and experts suggest Keep benchmarks aligned with economic impact Ensure their practicality Others consider adoption and practicality It is the ultimate benchmark.

The debate may be angry until the end of time. Maybe we should For example, X user ROON prescriptionjust pay less attention to new models and benchmarks without major technological breakthroughs in AI. For our collective sanity, even if it does induce a certain level of AI FOMO, this may not be the worst idea.

As mentioned above, this week is taking a break in AI. Thank you for sticking with us through the roller coaster of this journey, readers. Until next time.

information

**Image source:**Nathan Ryan/Bloomberg/Getty Images

Openai attempts to “cancel check” Chatgpt: No matter how challenging or controversial the topic is, Max wrote, how Openai changes its approach to AI development to clearly embrace “intellectual freedom.”

Mira’s new startup: New startup of former Openai CTO Mira Murati Thinking Machine Laboratoryintends to build tools to “make artificial intelligence for [people’s] Unique needs and goals. ”

Grok 3 Cometh: Elon Musk’s AI startup XAI has released its latest flagship AI model, Grok 3, and released new features for Grok apps for iOS and the web.

A very camel meeting: Meta will host its first developer meeting this spring dedicated to generating AI. The conference is scheduled to take place on April 29 and is known as Llamacon in Meta’s Llama family.

AI and digital sovereignty in Europe: Paul introduced collaborations between approximately 20 organizations to build “a series of fundamental models of transparent European AI” to preserve the “linguistic and cultural diversity” of all EU languages.

Research papers

This illustration photo shows the OpenAI Chatgpt website displayed on the laptop screen. — **Image source:**jakub porzycki / Nurphoto / Getty Images

OpenAI researchers have created a new AI benchmark, SWE-Lanceraims to evaluate the coding capabilities of powerful AI systems. The benchmark consists of more than 1,400 free software engineering tasks, ranging from bug fixes and feature deployment to technical implementation recommendations at the “manager level”.

According to Openai, the best performing AI model of anthropomorphic Claude 3.5 sonnet scored 40.3% on the full SWE-Lancer benchmark, indicating that AI still has a lot to go. It is worth noting that researchers do not have new model benchmarks like Openai O3Mini Or Chinese artificial intelligence companies DeepSeek’s R1.

Model of the week

A Chinese artificial intelligence company called Stepfun released an “open” AI model, Step-AudioThis can be understood and produces voice in several languages. Step-Audio supports Chinese, English and Japanese and allows users to adjust the emotions and even dialects of the synthetic audio they create, including singing.

Stepfun is one of several funded Chinese AI startups released under the permitted license. Founded in 2023, Stepfun It is reportedly closed recently Hundreds of millions of dollars have been received from many investors, including China’s state-owned private equity firms.

Grab the bag

In-depth research — **Image source:**Research

AI research group NOUS Research already has issued It claims to be one of the earliest AI models that unify reasoning and “intuitive language model functionality.”

The model DeepHermes-3 preview can open and close long “chains of thoughts” to improve accuracy, at the expense of certain calculated weights. In “inference” mode, DeepHermes-3 preview is similar to other inference AI models that “think” longer to solve more serious problems and show its thinking process to come up with an answer.

It is reported that humans Plans to release similar architectural models soonOpenai means that such a model is On its near-term roadmap.

Source link

This Week in AI: Maybe we should ignore AI benchmarks for now | TechCrunch

information

Research papers

Model of the week

Grab the bag

Valve hands the Team Fortress 2 source code to modders

Karman+ digs up $20M to build an asteroid-mining autonomous spacecraft | TechCrunch

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

NFL fans accuse Eagles of committing penalties on tush push

10 extra exhibit tables open at Disrupt 2025 | TechCrunch

Apple Watch Series 11 receives FDA clearance for hypertension alerts

Blog Post

information

Research papers

Model of the week

Grab the bag

Valve hands the Team Fortress 2 source code to modders

Karman+ digs up $20M to build an asteroid-mining autonomous spacecraft | TechCrunch

Leave a comment Cancel reply

Categories

Contact us

Hand Picked News

NFL fans accuse Eagles of committing penalties on tush push

10 extra exhibit tables open at Disrupt 2025 | TechCrunch

Apple Watch Series 11 receives FDA clearance for hypertension alerts