You connected your AI agent to five tools — Slack, GitHub, Jira, a database, email. Each one works. You tested them individually, got green lights across the board, high-fived yourself. Your dashboard says 95% success rate. Life is good.
Except your actual workflow — read database, create ticket, update repo, notify Slack, send summary — silently drops the ball once or twice a day. No alarm fires. No dashboard turns red. The agent just... doesn't finish. And you're left wondering if you're going crazy or if the machine is gaslighting you.
The Gap Nobody Fixed
Google Cloud Next wrapped on April 22 with a stack of agent announcements. Three days earlier, on April 17, AWS launched its Agent Registry in AgentCore. And earlier this month, on April 8, Anthropic shipped managed agents. All three now offer agent monitoring. All three measure per-tool metrics — latency, error rates, request counts via MCP (Model Context Protocol — a universal plug standard for AI tools, like USB but for data). None measure compound chain reliability: the probability your multi-step workflow actually finishes.
Five steps at 95% each? That's 77.4% end-to-end. Simple multiplication your dashboard refuses to do.
Knowing the number is step one. Fixing it is the actual job. So what do the frameworks give you?
What Frameworks Actually Ship
LangGraph comes closest. Its Checkpointer classes persist state at every graph node. Step four fails, you resume from step three — not from scratch. Real infrastructure. The catch: your entire agent must be a state graph. Retrofitting an existing agent means rewriting it.
CrewAI gives you max_retry_limit per task and callback hooks. That's retry logic — same tool, same input, try again. If the failure comes from a malformed MCP server response, retrying identically is the definition of insanity.
Google's ADK, announced at Cloud Next on April 22, ships session-level state management. Their observability layer — the most advanced of the three — still renders per-call traces. You see individual MCP call latency. You don't see "this five-call chain completed 77% of the time this week."
Anthropic's managed agents track session status, duration, and cost. Useful for billing. Useless for chain completion.
The Missing Primitive
A Google Cloud Community playbook published on March 9 documents the core pattern nobody ships natively: step-level checkpointing — save each step's output so you can resume mid-chain. LangGraph does this. Everyone else: you're writing your own persistence layer.
The playbook also covers circuit breakers, fallback routing, and other microservices patterns adapted for agents. Useful references, but the real gap is higher up the stack: chain-level SLOs. "This workflow must complete end-to-end 95% of the time." No platform offers this metric. You build it with custom telemetry, a time-series DB, and your own alerting rules.
All of this is real engineering work on top of platforms that already charge you — Anthropic at $0.08 per session-hour, for instance.
What to Do Monday Morning
Pick a framework with native checkpointing. If you're starting fresh, LangGraph's state persistence is the least bad option. If you're already running agents, add step-level saves to your three most critical chains before adding another MCP server.
Instrument chain-level success. Not per-tool — per-workflow. Log a single boolean: did the chain finish? Aggregate weekly. You'll hate the number, but at least you'll have one.
Keep chains short. Three steps, not ten. Each additional step multiplies your failure probability.
The Real Infrastructure Gap
The next meaningful upgrade in the agent stack isn't a smarter model or a faster tool. It's the framework that treats compound chain reliability the way databases treat transaction guarantees — as a first-class primitive, not a DIY project. LangGraph's checkpointing hints at this future. Google's ADK session management gestures in the same direction. Everyone else is selling you individual link strength and hoping you never pull the chain.





