The number missing from every AI agent dashboard matters more than any number the vendors chose to include: did the agent actually do what you asked?
This week, two more platforms joined the consumption-metrics club. On April 10, GitHub added active user counts for its cloud agent. On April 8, Anthropic launched Managed Agents at $0.08 per session-hour — billed to the millisecond, tokens extra. They join Google's Vertex AI Agent Engine, which has metered by vCPU-second since going GA last year, and OpenAI's Codex, whose "Success Rate" metric measures whether the API call completed — not whether the code works.
That's measuring a surgeon's productivity by how many scalpels they picked up.
Four major platforms. Zero task success rates. Zero quality scores. Zero tracking of whether a human had to redo the agent's work.
Why nobody measures what matters
Not because it's unsolvable. Because it's expensive, embarrassing, and bad for quarterly earnings.
A chatbot gives one answer and you judge it immediately. An agent chains ten steps — reads a ticket, searches docs, writes code, opens a PR, pings Slack. Each step can fail silently. The final output requires domain expertise to evaluate. The vendors haven't even defined what "success" means for an agent, let alone measured it.
And the research that does exist is not something you'd put on a slide deck.
The reliability gap nobody advertises
On February 24, Princeton researchers Kapoor and Narayanan published a study testing 14 AI models across 500 benchmark runs. Their finding: agent reliability — doing the same task correctly every time — improved at half the rate of raw capability on general tasks. On customer service tasks, reliability gained at just 14% the rate of accuracy. Their conclusion: "Agents are not good at knowing when they're wrong."
This is the number that should be on every dashboard and isn't.
AI researcher Andrej Karpathy — OpenAI co-founder, ex-Tesla AI lead — quantified what this means in practice with his "March of Nines" framework in November 2025: if each step in a ten-step workflow succeeds 90% of the time, end-to-end success drops to 35%. Now picture that agent running autonomously at 3 AM, billed per hour, with nobody watching.
The supporting data keeps piling up. A CodeRabbit analysis published March 19 examined 470 GitHub PRs and found AI-authored code produces 1.7x more issues per PR than human code, with security vulnerabilities running 2.74x higher. LangChain's survey released March 25 polled 1,340 practitioners: 57% already run agents in production, but only 52% evaluate outputs after the fact, and just 37% monitor quality while agents run live.
More than half the industry deployed agents before figuring out how to tell if they work. Bold strategy.
Follow the money
Usage-based billing profits equally from a failed three-hour session and a successful one. A vendor charging $0.08 per session-hour has zero financial incentive to help you discover that 40% of those sessions produce garbage. Measuring outcomes would actively hurt the metric Wall Street watches: revenue per customer.
Third-party observability tools — LangSmith, Braintrust, Helicone — are trying to fill the gap. But the four biggest agent platforms ship nothing native. You get a speedometer with no destination.
What this means for you
If your team evaluates autonomous agents — and statistically, it does — demand the one number every vendor dodges: what percentage of tasks does your agent complete correctly without human intervention?
If they can't answer, you're not buying a productivity tool. You're buying a billing meter attached to a coin flip.
The agent economy launched with an invoice where it needed a scorecard. Until someone builds that scorecard, you are the quality layer the platform didn't ship. Budget accordingly.



