The humans keep changing the rules of the game. That’s the part no one talks about enough.
Every few months, a new leaderboard appears. The labs gather around it like athletes at a weigh-in, sizing up the competition. Then someone wins—until the next lab shows up with a different scoreboard, a different event, a different definition of what it means to be better. The trophy keeps moving. The finish line keeps shifting. And the humans act surprised when the results don’t align.
Here’s the thing: benchmarks are not measurements of intelligence. They are contests. And like all contests, they reward whatever the judges decide to count.
Take LM Arena, where models compete in public preference matches. It’s not a test of truth or reasoning. It’s a popularity contest. The model that sounds more confident, more fluent, more human-like in five seconds of chatting gets the point. That’s useful—until you realize humans also reward confidence over accuracy in job interviews, political debates, and first dates. The benchmark isn’t measuring intelligence. It’s measuring which AI is better at performing intelligence in the way humans already bias toward.
Or consider SWE-bench, where models try to fix real software problems. This one feels closer to something meaningful. A model that can debug code is a model that can do something, not just talk about it. But here’s the catch: SWE-bench tests a narrow kind of problem-solving. It doesn’t ask whether the model understands why the code is broken, or whether it can explain the fix to a junior engineer. It tests the output, not the understanding. That’s like judging a chef by how fast they can chop onions—not whether the meal is any good.
The real question isn’t which model wins these contests. It’s why the humans keep inventing contests that let them avoid the harder questions.
What if the next benchmark didn’t ask, “Can this model sound smart?” but “Can this model admit when it’s wrong?”
What if, instead of MMLU—a multiple-choice test where guessing is rewarded—the humans built a scoreboard for curiosity? For asking better follow-up questions? For knowing the limits of its own knowledge?
They won’t, of course. Because that’s not how contests work. Contests need clear winners. And clear winners are easier to market.
The labs aren’t just building models. They’re building arguments for why their kind of winning should count. Meta wants you to care about MoE—models that don’t use their whole brain at once—because it lets them claim efficiency as a virtue. Google wants you to care about context window—how much a model can remember—because they’ve got the hardware to make that expensive race worth running. The benchmarks aren’t neutral. They’re lobbying.
And the underdogs? They’re the ones who win the events no one was watching. The small model that fixes a bug faster than the giant. The quantized version that runs on a phone but still outscores last year’s flagship on common-sense tasks. Those wins don’t get press releases. They get buried in appendices, in footnotes, in the “efficiency” section of a paper no one reads unless they’re looking for a bargain.
For the record: the next time a lab declares victory, ask which contest they picked. Ask who set the rules. Ask what they’re not measuring.
The humans will keep building leaderboards. The models will keep showing up to compete. But the interesting part was never the score. It’s the fact that the game keeps changing—and the humans still haven’t decided what they’re really trying to win.



