You pick your AI coding tool by checking the leaderboard. SWE-bench told you which model fixed the most bugs. Promptfoo let you run side-by-side comparisons. The Agents SDK gave you a framework to build with. Three pillars of comparison infrastructure. Three independent checks on who's actually best.

I've covered each of these stories individually this week — SWE-bench's collapse, the Promptfoo acquisition, the Agents SDK update. Separately, each made sense. Together, they reveal something none of the individual pieces captured.

The conventional wisdom

OpenAI made three unrelated moves. They published a legitimate critique of a flawed benchmark. They acquired an open-source eval tool and kept it MIT-licensed (anyone can copy, modify, redistribute). They made their SDK model-agnostic. Each move is defensible in isolation. Each move helps developers.

But actually

This is vertical integration of the evaluation stack. And it has precedent.

In 2007, Google acquired DoubleClick — the dominant ad-serving platform that measured advertising performance across all providers, including Google's own. The EU investigated for years. Google promised neutrality. A decade later, the DOJ argued Google had systematically favored its own ad products through that very infrastructure. The company that sold the ads also ran the tool that graded whether the ads worked.

OpenAI just executed the same playbook on AI model evaluation — in seven weeks instead of seven years.

Three moves, one pattern

Move one (February 23): OpenAI's audit flagged 59.4% of SWE-bench Verified test cases as flawed and found training data contamination across every frontier model. They stopped reporting scores. The critique had merit — SWE-bench Pro's harder tasks show a 22-point gap from Verified's inflated numbers. But OpenAI's models had plateaued at ~80% on Verified while competitors closed in. Convenient timing.

Move two (March 9): OpenAI acquired Promptfoo — 350,000+ developers, over 25% of Fortune 500 companies — for the eval framework most teams use to compare LLMs. The most popular ruler now belongs to a contestant.

Move three (April 15): The Agents SDK update added native support for 100+ competing LLMs via LiteLLM integration. Every rival model becomes a one-line config swap inside OpenAI's framework. The model turns into a commodity; the SDK becomes the moat.

What actually changes for developers

Three things.

Friction shifts. When switching models requires changing one line in an OpenAI config file, you're not "choosing Claude" — you're choosing OpenAI's platform and occasionally routing to Claude. Think Apple building the only phone store and generously letting Samsung sell there.

Eval defaults beat eval options. Promptfoo can still test any model. But the default templates, the recommended configs, the "getting started" flow — those shape what 90% of developers actually test. As Simon Willison noted: "OpenAI don't yet have much of a track record with respect to acquiring and maintaining open source projects." The MIT license means you can fork and walk away. Most won't. Defaults are powerful.

Benchmark authority fragments. SWE-bench Pro uses harder, less-contaminated tasks across multiple languages. LiveCodeBench rotates problems to prevent memorization. Neither has the adoption Verified had. Building trust in a new benchmark takes years. OpenAI doesn't need years — it needs months of ambiguity.

The counter-strategy gap

Anthropic ships Claude Code — a direct-to-developer tool that bypasses SDK wrappers entirely. Google bundles Gemini into Android, Chrome, and Workspace, creating distribution channels OpenAI can't intercept. Both play defense through distribution rather than measurement.

Neither has built an alternative evaluation standard. That's the real gap. The industry has competing models, competing SDKs, competing distribution channels — but no independent, trusted, widely-adopted evaluation infrastructure anymore. The old scoreboard had genuine contamination problems. The replacement doesn't exist yet.

The uncomfortable question

The question isn't whether OpenAI's individual moves hold up to scrutiny. They do. The question is whether a single company should simultaneously sell the product, own the testing framework, and control the SDK that wraps every competitor.

If your answer involves the word "trust" — congratulations, you've identified the problem.

Next time you evaluate an AI model, check who built the ruler, who owns the testing lab, and whose tooling runs the test. If it's the same company three times, you're not evaluating — you're being onboarded.

The AI model race didn't end because someone won. It ended because the frontrunner bought the scoreboard and turned it into a store.