Will Smith Eating Spaghetti and Other Weird AI Benchmarks That Took Off in 2024

In the rapidly evolving field of Artificial Intelligence (AI), it’s not uncommon to see a new AI video generator emerge, only to be quickly put through its paces by creating a video of actor Will Smith eating spaghetti. This trend has become a meme in itself, serving as both a benchmark and a challenge for the latest video generation technology.

However, this is just one example of the many unconventional "unofficial" benchmarks that have taken the AI community by storm in 2024. A 16-year-old developer created an app that gives AI control over Minecraft, testing its ability to design structures. Meanwhile, a British programmer developed a platform where AI plays games like Pictionary and Connect 4 against each other.

Table of Contents

The Limitations of Traditional Benchmarks

So, why do these unconventional benchmarks stand out? One reason is that the industry-standard AI benchmarks often fail to tell the average person what they want to know. Companies frequently cite their AI’s ability to answer questions on Math Olympiad exams or figure out plausible solutions to PhD-level problems. While these metrics may be impressive in a technical sense, they don’t necessarily reflect how well an AI will perform in everyday tasks.

For instance, most people use chatbots for simple tasks like responding to emails and basic research. However, traditional benchmarks often focus on more complex tasks that are less relevant to the average user. This disconnect between what is being measured and what matters to users is a major issue.

The Problem with Crowdsourced Industry Measures

Another problem with industry-standard AI benchmarks is that they often rely on crowdsourced measures like Chatbot Arena, a public benchmark that many AI enthusiasts and developers follow obsessively. However, these ratings tend not to be representative of the broader user base. Most rater come from AI and tech industry circles and cast their votes based on personal, hard-to-pin-down preferences.

A New Approach: Focusing on Downstream Impacts

Ethan Mollick, a professor of management at Wharton, recently pointed out that many AI industry benchmarks fail to compare a system’s performance to that of the average person. "The fact that there are not 30 different benchmarks from different organizations in medicine, in law, in advice quality, and so on is a real shame, as people are using systems for these things, regardless," he wrote.

In response to this criticism, some experts suggest that the AI community focus on the downstream impacts of AI instead of its ability in narrow domains. This approach would shift the emphasis from how well an AI performs on specific tasks to how it affects users and society as a whole.

The Enduring Appeal of Unconventional Benchmarks

Despite their limitations, unconventional benchmarks like Connect 4, Minecraft, and Will Smith eating spaghetti are unlikely to disappear anytime soon. Not only are they entertaining and easy to understand, but they also provide a way for the industry to distill complex AI technology into digestible marketing.

As my colleague Max Zeff wrote recently, the industry continues to grapple with how to communicate AI’s benefits and risks to a broad audience. Unconventional benchmarks offer a unique opportunity to engage users and showcase AI’s capabilities in a more accessible way.

The Future of AI Benchmarks: What’s Next?

So, what can we expect from the world of AI benchmarks in 2025? Will new and innovative approaches emerge, or will traditional metrics continue to dominate? One thing is certain: as AI technology advances at an unprecedented pace, it’s essential for the industry to adapt its benchmarking methods to keep up.

Related Stories