You wired your AI agent to fifteen MCP servers — Slack, GitHub, Jira, a database — and tested it until everything sang in harmony. MCP (Model Context Protocol) is the universal plug standard that lets AI agents talk to external tools, like USB but for data. Your demo looked great. Your boss nodded along. You shipped to production.

Here's the part nobody warned you about: MCP servers aren't static packages sitting on your disk. They're live processes and remote endpoints. And as of April 25, 2026, not a single agent platform will tell you when one of them goes down, slows to a crawl, or starts returning garbage.

Half your tools might already be dead

On April 20, RapidClaw audited 1,847 public MCP servers across seven registries. The result: 52% are dead or abandoned. Only 17% — 315 servers — qualified as production-ready. The median MCP server has six lifetime commits, one maintainer, zero tests, and hasn't been updated in 142 days.

That's the ecosystem your agent depends on. A coin flip decides whether the tool it's calling still works.

And the MCP protocol itself offers no help. There's no built-in health check endpoint — no standardized way for a server to say "I'm alive and functioning." The SDKs implement a basic ping, but there's no /health, no /ready, no liveness probe. Every team rolls their own, or — more commonly — rolls nothing.

The protocol-shaped hole

The gap isn't just in platforms — it's in the spec. HTTP has status codes. gRPC has health checking services. Kubernetes has liveness and readiness probes. MCP has... a ping method that confirms the transport layer is alive, not that the server can actually do its job.

A database MCP server can respond to ping while its connection pool is exhausted. A GitHub MCP server can pong back while its auth token expired three weeks ago. The protocol draws no line between "transport works" and "tool works." That distinction matters when your agent burns tokens retrying a tool that technically answers but functionally lies.

Google's Agent Observability, shipped April 22 as part of the Gemini Enterprise Agent Platform, tracks request count and p95 latency for MCP servers. Two metrics. We already covered the broader platform gaps — the short version is that Google built agent-level tracing, not tool-level monitoring. AWS previewed a similar catalog approach with Agent Registry on April 17 — a phone book for agents, not a health monitor. Both acknowledge that tools need infrastructure treatment. Neither actually checks if your tools are alive.

For comparison, any halfway-decent microservice — a small, independent piece of your backend — gets health checks, error rates, response quality analysis, and automated alerts. Your MCP server gets a request counter and a stopwatch.

What breaks when a tool breaks

When an MCP server degrades silently, your agent doesn't know. It either hallucinates an answer (makes something up), retries silently burning tokens — the word-chunks AI reads, each costing money — or fails without telling you which tool broke. Agent observability traces show what the agent decided. They don't show whether the tool was healthy when it answered. You see the symptom. You can't diagnose the cause.

It's like monitoring your web app but not monitoring its database. The dashboard stays green until everything is on fire.

What to do right now

If you're running agents in production, treat every MCP server like a microservice dependency:

  • Set timeouts. The MCP protocol won't do it for you.
  • Define fallbacks. If a tool is down, your agent needs a Plan B, not an improvised hallucination.
  • Monitor response quality. A 200 OK that returns nonsense is worse than a clean failure.
  • Wrap connections in circuit breakers — a pattern that cuts off a failing service before it drags everything else down.
  • Accept the ceiling. Your agent is exactly as reliable as its least reliable tool.

Most teams haven't budgeted for this plumbing. Most teams will learn why they should have.

The missing layer

The agent stack in April 2026 has smarter models, bigger registries, and fancier gateways. What it doesn't have is tool SRE — Site Reliability Engineering applied to the tools agents depend on. Previous articles on this channel argued that agent reliability engineering doesn't exist yet. The RapidClaw audit proves the problem runs one level deeper: you can't have reliable agents when the protocol they speak doesn't even define what "healthy" means for a tool.

The first platform to ship tool-level health monitoring with /health and /ready endpoints baked into the MCP spec — not bolted on by each vendor — captures the reliability layer the entire ecosystem lacks.

You started with fifteen MCP servers that worked perfectly in dev. In production, roughly eight of them are a coin flip from dead. Nobody's watching. Maybe start there.