The AI industry is booming, but how much of its progress is real, and how much is smoke and mirrors? Behind the glossy press releases and record-breaking claims lies an open secret: benchmarks, the standardized tests used to evaluate AI systems, are increasingly being manipulated to create an illusion of advancement. From cherry-picking datasets to ignoring tougher evaluations, companies are turning benchmarks into marketing tools rather than honest measures of progress.
When Benchmarks Become Marketing Tools
Benchmarks were designed to standardize AI evaluation, but they’ve morphed into a battleground for corporate one-upmanship. Take Meta’s recent announcement of its “state-of-the-art” Llama 4 model, which boasted superior performance on MMLU (Massive Multitask Language Understanding) and GSM8K (math reasoning). What went unmentioned? The model was tested on older, narrower versions of these benchmarks, avoiding newer iterations that include adversarial questions designed to expose weaknesses. As TechCrunch reported, this selective benchmarking creates a distorted picture of true capability.
The problem isn’t new. A 2019 paper titled Measuring the Progress of AI Research (arXiv) warned that overfitting to benchmarks risks creating “benchmark hackers” rather than robust AI. This happens when models are trained to excel at specific tasks without genuine understanding. Fast-forward to 2025, and little has changed. Companies optimize for leaderboard rankings, not real-world utility. It’s like training a student to ace a multiple-choice test by memorizing answers instead of teaching them critical thinking,
The Measure Everyone Ignores
In 2025, the nonprofit ARC Prize introduced ARC-AGI-2, a benchmark designed to evaluate artificial general intelligence (AGI) by testing reasoning, abstraction, and adaptability. Unlike narrow benchmarks, ARC-AGI-2 requires models to solve novel problems without prior training data—a far closer approximation of human-like intelligence. The catch? No major AI lab uses it.
While Epoch AI claims a “verbal agreement” prevents OpenAI from training o3 on FrontierMath data, the organization admitted it failed to prioritize transparency with its mathematicians and contractors—many of whom stated they might not have participated had they known about OpenAI’s role. Compounding concerns, Epoch AI has yet to independently verify OpenAI’s benchmark results, leaving the AI community skeptical of claims that corporate-funded evaluations can remain neutral. This incident underscores the growing credibility crisis in AI benchmarking, where even well-intentioned partnerships risk being perceived as tools for corporate validation rather than scientific rigor.
Why? Because today’s models perform dismally on it. As The Atlantic noted, models like GPT-5 and Gemini Ultra score below 30% on ARC-AGI-2, exposing their inability to generalize beyond memorized patterns. Yet companies avoid mentioning it, preferring to tout performance on easier, legacy benchmarks. This creates a perverse incentive: the harder a benchmark is, the less likely it is to be adopted.
Who Guards the Guardians in a Broken Feedback Loop?
The lack of transparency in AI benchmarking has become a flashpoint. In December 2024, Epoch AI, a nonprofit developing the FrontierMath benchmark for evaluating advanced AI models, disclosed OpenAI’s financial support after using the test to showcase OpenAI’s new o3 model. Contributors to FrontierMath later revealed they hadn’t been informed of OpenAI’s involvement during development, sparking allegations of conflicts of interest. Critics argue OpenAI gained unfair advantages: the company not only funded the benchmark but also had early access to its problem sets, raising questions about whether FrontierMath could objectively assess competitors’ models.
Even when organizations aim for rigor, the system falters. Stanford’s Center for Research on Foundation Models (CRFM) found that models often excel in controlled benchmark settings but fail in real-world applications. For instance, a chatbot might ace a conversational benchmark yet struggle to handle nuanced customer service queries.
Transparency, Diversity, and AGI-Centric Tests
The solution isn’t to abandon benchmarks but to redesign them. First, benchmarks must evolve faster than models. Dynamic evaluations like ARC-AGI-2, which update regularly with unseen problems, could prevent overfitting. Second, transparency is key: companies should disclose all benchmark results, not just favorable ones. Third, independent auditors—not corporate-funded consortia—should govern benchmark design.
As Stanford’s Human-Centered AI Institute argues, good benchmarks measure generalization, not regurgitation. They should prioritize quality over quantity, favoring depth in a few critical areas (e.g., ethical reasoning, creativity) over superficial breadth.
The Importance of Transparency for AI Progress
The AI industry is at a crossroads. Will it continue chasing hollow victories on rigged leaderboards, or will it embrace benchmarks that demand true intelligence? Until companies stop gaming the system, claims of “breakthroughs” ring hollow.
Here’s a radical idea: What if consumers demanded AI tools be evaluated on ARC-AGI-2 before they hit the market? After all, if your self-driving car can’t navigate an unexpected roadblock, does it matter how well it performed on a 2019 benchmark?
The question isn’t whether AI can pass a test—it’s whether we’re even giving it the right test.