idea Pokémon is the tough benchmark for AI? A group of researchers believe that the Super Mario Brothers is even more difficult.
Hao Ai Lab, a research organization at the University of California, San Diego, put AI into live Super Mario Bros. on Friday. Human Claude 3.7 The best performance, followed by Claude 3.5. Google’s Gemini 1.5 Pro and Openai GPT-4O struggle.
To be clear, it is not the version of Super Mario Bros. from the original version in 1985. The game runs in the simulator and integrates with the framework, gamingagentprovides Mario with AIS control.

Gamingagent developed in-house by HAO provides basic instructions for AI, such as “Move left/jump to Dodger if an obstacle or enemy is nearby” and screenshots in the game. The AI then generates inputs in the form of Python code to control Mario.
Games still force each model to “learn” to plan complex exercises and develop game strategies, Huo said. Interestingly, the lab found so-called inference models, such as Openai’s O1although generally stronger in most benchmarks, it is worse to gradually “think” through the problem to reach a solution than the “non-conditioning” model.
One of the main reasons why inference models are difficult to play such real-time games is that they usually take some time (usually a few seconds) to decide on actions, researchers say. In Super Mario Bros., timing is everything. The second one might mean the difference between a jump that is safely cleared and a situation of death.
Games have been used to benchmark AI for decades. but Some experts question wisdom The link between AI’s gaming skills and technological advancements. Unlike the real world, games tend to be abstract and relatively simple, and they provide theoretically unlimited data for training AI.
Recent flashy game benchmarks show what Andrej Karpathy, a research scientist and founding member of OpenAI, called it the “assessment crisis.”
“I really don’t know what [AI] The indicators to look at now. ” Post on X. “My reaction is that I really don’t know how good these models are now.”
At least we can watch AI Play Mario.