You used OpenAI's pricing calculator last month. Input tokens — word-chunks the AI reads — go in, output tokens come out, simple multiplication. Your budget spreadsheet looked clean. Responsible, even.

Then your April invoice landed 4x over estimate. You didn't change your prompts. Didn't add workflows. Didn't increase volume. So what happened?

The invisible meter started running

On April 15–16, OpenAI shipped two major updates: the Agents SDK v0.14 with model-native orchestration, and Codex's autonomous computer-use mode. Both default to GPT-5.4 — a reasoning model. Unlike classic models that just answer, reasoning models generate "thinking tokens" — internal computation where the AI debates with itself before responding. You never see these tokens in the output. But they hit your bill as output tokens, at output token prices.

The model autonomously decides how much to think based on perceived problem difficulty. A trivial question might burn 200 thinking tokens. A complex one — 10,000. The same question on different days? Anywhere from 2x to 9.7x variance, according to a March 2026 preprint from Stanford, Berkeley, CMU, and Microsoft.

The math gets ugly fast

In a multi-step agent run — where the AI performs dozens of sequential actions — this variance compounds. Each step is a fresh reasoning allocation you can't predict or control. A preprint analyzing 11,872 queries across 8 models found that thinking tokens represent 80%+ of total output costs, and in 21.8% of model comparisons, the supposedly cheaper model actually cost more — with reversal magnitude reaching 28x. You read that right: the budget option can cost 28 times more than the premium one. Pricing pages are decorative at this point.

Real-world data confirms it: developer platform GrisLabs tracked 1,127 agent runs in March 2026 and found a median cost of $1.22 but a 95th percentile of $22.14 — an 18x ratio between typical and expensive runs doing the same job. Same prompt, same pipeline, 18x spread. Your CFO will love that variance analysis.

It gets worse: the off switch was a prop

On April 2, developers discovered that GPT-5.4 silently ignores the reasoning_effort="none" parameter when combined with a token budget. The model defaults back to full reasoning, burns through your entire token allocation on invisible thinking, and returns an empty string. You explicitly tell it "don't think" and it thinks harder than ever — then charges you for the privilege of getting nothing back.

OpenAI acknowledged the bug April 9 and claims a fix deployed by April 20 — but for 18 days, the "off switch" for reasoning was purely theatrical. Eighteen days of a parameter that existed solely to make developers feel in control while the model did whatever it wanted. Peak UX.

No per-step reasoning budget API exists. No per-run cap. OpenAI offers organization-level monthly spending limits — the equivalent of a credit card limit when what you need is a price tag on each item.

For context: Anthropic's extended thinking has the same structural opacity. Google's Gemini thinking mode at least shows the reasoning text in output, so you can see what you're paying for.

What this means for you

What you ask no longer controls your agent cost. How hard the model privately decides the question is — that controls it, and that decision varies between identical requests on different days. Every autonomous run is an open invoice where the model holds the pen.

Agent pricing needs per-step reasoning caps and transparent thinking budgets. Until OpenAI ships those, treat every agent run as a slot machine with a published pay table but no maximum bet.