Google Promoted Agents to Infrastructure Primitives. The Runbook Is a Slack Thread.

Your Kubernetes cluster runs on a decade of operational scar tissue. Runbooks forged at 3 AM by engineers who swore they'd quit by morning. SLOs negotiated in meetings where someone literally cried. Error budgets defended with the ferocity of the last parking spot at Costco on Saturday. Every container in production earned its place through human suffering.

Your company's AI agents, shipped this quarter, have none of that. Health check? Undefined. Error budget? Please. Runbook? A Slack thread called #ai-stuff where someone last posted in February. On-call rotation? The intern who built the demo, probably.

At Cloud Next '26 on April 22, Google Cloud CEO Thomas Kurian positioned agents alongside VMs and containers as first-class infrastructure primitives — load-bearing components your cloud runs natively. The new Gemini Enterprise Agent Platform ships the vocabulary container engineers will recognize: Agent Runtime, Agent Registry, Agent Gateway, Agent Identity. Google also committed $750 million to partner development. Deloitte alone claims 1,000+ pre-built agents ready to deploy. A thousand agents. Zero runbooks. Beautiful.

"Infrastructure primitive" is a contract. When you stamp something as load-bearing, it gets the full treatment: SLOs, error budgets, on-call rotations, incident response, restart procedures. Google shipped the stamp. The treatment? Not included.

What Google did ship: Agent Observability (visual tracing of what happened), Agent Evaluation (performance scoring), Agent Simulation (synthetic workload testing). All useful plumbing. All completely beside the point. Tracing shows you the autopsy. Reliability engineering detects the fever before the patient codes. If you've been reading this channel, you know the argument — we made it two weeks ago about tracing, and two days ago about 3 AM operational blindness. Google's keynote repackaged both gaps with better slide design and a stage budget.

The data hasn't improved either. Catchpoint's SRE Report from January: 13% of organizations feel confident monitoring AI/ML reliability. A third have never tested failure in production. You've also seen UC Berkeley's MAST failure rates — 41–86.7% across multi-agent systems — cited on this channel enough times to recite at parties. But the real story isn't the number anymore. It's that nobody has produced a better one in the months since. Nobody is measuring agent reliability because nobody has defined what "reliable" means for an agent. The absence of a replacement stat is the stat.

Here's the dark comedy: the teams deploying agents fastest have zero operational rigor. That's not a bug — it's a competitive strategy. Ops discipline is friction, friction kills speed, speed wins the quarter. So everyone rationally skips the boring stuff and bets that catastrophic multi-agent failure rates are a research curiosity that won't touch their production stack. The confidence is almost beautiful.

SiliconANGLE's John Furrier called it: Google is building "the operating system for the agentic enterprise." Sure. Operating systems need ops teams. Google shipped the OS. The ops team is a job req sitting in someone's drafts folder.

"Agent Reliability Engineering" returns zero results on LinkedIn today. Zero playbooks. Zero certifications. Zero conference talks. Google just declared agents are infrastructure on the same level as containers, backed the claim with three-quarters of a billion dollars, and the discipline that makes that declaration survivable does not exist as a field.

The agents that survive 2026 won't be the smartest or the cheapest. They'll be the ones somebody put on a pager and wrote a runbook for — specifically the one titled "what to do when it starts issuing refunds to random customers at 3 AM." Whoever publishes the first Agent SRE playbook sets the industry standard. That playbook doesn't exist. The agents are already in production. Sleep tight.

Google Promoted Agents to Infrastructure Primitives. The Runbook Is a Slack Thread.

Keep reading

Your Agent's Tools Are Down and Nobody's Watching

MCP's 2026 Roadmap Has Four Priorities. Error Handling Isn't One of Them

Anatomy of a $750 Million Incentive Machine

The Checkpoint Gap: Multi-Hour Agents Shipped Before Crash Recovery Did