As a regular AI benchmarking Technology proves insufficient, and AI builders are turning to more creative ways to evaluate the capabilities of generating AI models. For a group of developers, Minecraft is a sandbox building game owned by Microsoft.
website Minecraft Benchmark (or MC-Bench) was developed in a face-to-face challenge with PIT AI models to address the tips of Minecraft Creations. Which model can users do better, and only after voting can they see which AI makes each Minecraft build.
For Adi Singh, a 12th-grade student who founded MC Bench, the value of Minecraft is not the game itself, but the level of familiarity with it – after all, it’s Best-selling Video games of all time. Even for those who haven’t played the game, it’s still better to evaluate which pineapple’s blocky representation is.
“Minecraft allows people to see progress [of AI development] “People are used to the look and the atmosphere,” Singh told TechCrunch.
MC Bench currently lists eight people as volunteer contributors. Humans, Google, OpenAI and Alibaba have subsidized the project using its products to run benchmark tips based on MC Bench’s website, but these companies have no other affiliation.
“At the moment, we are just doing simple constructions to reflect on how far we are from the GPT-3 era, but [we] “It may be possible to see yourself expanding to these long format plans and goal-oriented tasks. The game may be just a medium for testing agent reasoning, safer than real life and more control over the purpose of testing, making it more ideal in my eyes,” Singh said.
Other games Pokémon Red,,,,, Street fighter planeand picture Has been used as an experimental benchmark for AI, partly because the art of AI is As we all know.
Researchers often test AI models Standardized evaluationbut many of these tests bring home a family advantage to AI. Due to the way they are trained, models are natural, and these are in some narrow problem solutions, especially the problem that solves problems, requiring rote memory or basic extrapolation.
In short, it is difficult to collect the meaning of Openai’s GPT-4 that can obtain the 88th percentile on LSAT, but cannot distinguish it. How many rs are there in the word “strawberry”. Human Claude 3.7 sonnet The standardized software engineering benchmark has a 62.3% accuracy, but playing Pokémon is worse than most five-year-olds.

Technically, MC Bench is a programming benchmark, as the model is required to write code to create early builds such as “Frosty the Snowman” or “the charming tropical beach cabin on the original Sandy Shore”.
However, for most MC bench users, evaluating whether snowman looks better than mining code makes the project more widely appealing and thus easier, so it is possible to gather more data to understand which models consistently score better.
Of course, whether these scores are equivalent to a debate on the practicality of AI. Singer asserted that they were a strong signal.
“The current rankings are very reflective from my own experience using these models, which is different from many plain text benchmarks,” Singh said. “Maybe,” [MC-Bench] For companies, it may be heading in the right direction. ”