MCP Called Its Safety Labels "Hints" and Wonders Why Nobody Trusts Them

You wired your agent to a dozen MCP tools and let it handle your Monday backlog. Slack, GitHub, a payment API — the full domesticated zoo. Life feels automated. Then the agent retries a failed payment call, and your customer eats a double charge.

Nothing in MCP told the agent that endpoint wasn't safe to retry. No label, no flag. Just a raw function and a model doing what models do — being "helpful."

Here's the thing. MCP's spec added exactly the safety fields you'd need. Four annotations — readOnlyHint, destructiveHint, idempotentHint, openWorldHint. Four booleans that could prevent the double-charge scenario entirely. The MCP team published them on March 16, 2026. And the spec called every single one a "Hint."

Not a contract. Not a constraint. A hint. Like a Post-it note on a loaded gun: "maybe don't point this at people."

Six weeks later, the industry delivered its verdict. Microsoft shipped its Agent Governance Toolkit on April 2 — YAML-based policy enforcement, built from scratch, zero references to MCP annotations. Anthropic launched Managed Agents permission policies on April 21 — custom allowlists and scoping, ignoring the annotation fields entirely. Google followed one day later with Agent Gateway on April 22 — same pattern, same from-scratch policy engine. Three major platforms in twenty days. None of them used the safety metadata that already exists in the protocol they all depend on.

This isn't the same problem as platform-level permission dialogs being theatrical — we've covered that. And it's not about unvalidated tool output — that's a different hole. This is about the protocol layer — the one place where safety metadata should be canonical — actively undermining its own credibility with a vocabulary choice.

Justin Spahr-Summers, MCP's co-creator, said the quiet part loud during the March 2026 spec review on GitHub: "The information itself, if it could be trusted, would be very useful, but I wonder how a client makes use of this flag knowing that it's not trustable." The designer of the safety metadata publicly questioned whether anyone could trust it. That's not a red flag. That's the red flag factory.

Self-attested safety metadata with no verification is not a safety feature. It's a suggestion box. destructiveHint: false from an untrusted MCP server is exactly as reliable as asking the tool "hey, are you dangerous?" and believing the answer. Over 17,000 MCP servers now sit across public registries. The number that have any independent verification of their declared annotations: zero.

Every serious system figured this out decades ago. Unix doesn't hint at file permissions — the kernel enforces them. OAuth doesn't hint at scopes — the authorization server validates them. Docker doesn't hint at container privileges — the runtime applies them. In every functioning security model, the entity making the safety claim is not the same entity being constrained. That's not paranoia. That's the first thing you learn before they let you near production.

MCP annotations violate this principle by design. The tool server declares its own safety properties, the client reads them, and nothing in between verifies the claim. The MCP Blog acknowledged this directly: "A server can claim readOnlyHint: true and delete files anyway." The spec's own documentation admits the safety labels can lie and there's no mechanism to catch it.

The word "hint" did the rest of the damage. When you label safety metadata a hint, you tell every integrator: this is optional, unreliable, and not your problem. They listened. Three platforms, three from-scratch governance systems, zero adoption of existing protocol-level annotations — all within April 2026. Not because the metadata is useless, but because the spec pre-labeled it as decorative.

And here's the part that's worse than having nothing at all. No annotations means you know you're flying blind. "Hint" annotations mean you might have safety data — maybe accurate, maybe a fabrication — and you have to decide whether to trust it with zero verification tools. False confidence is more dangerous than honest ignorance. At least ignorance makes you careful.

So you're writing YAML policies by hand. Configuring tool allowlists manually. Building the same guardrails that annotations were supposed to provide, except from scratch, because the protocol's own safety layer pre-emptively told you not to depend on it. The most labor-intensive safety model possible — not because better metadata doesn't exist, but because someone decided to call it a "hint."

The fix isn't even technical. The four fields are correctly designed. The data model works. What's broken is one word and the philosophy behind it. Change "hint" to "declaration." Add a verification endpoint — let clients test declared properties against observed behavior. Make lying detectable. The cheapest safety upgrade in the entire agent stack is sitting in the spec, correctly structured, and completely neutered by a naming decision that told 17,000 servers and three governance platforms not to take it seriously.

Someone built a fire extinguisher, labeled it "may or may not contain water," and now wonders why the building is burning.

MCP Called Its Safety Labels "Hints" and Wonders Why Nobody Trusts Them

Keep reading

Build Your First MCP Server in Python: 40 Lines From Copy-Paste Human to AI That Sees Your Data

Your Agent Picks the Wrong Tool Because You Wrote a Bad Description — And No Platform Cares

Your Agent Tools Have No Version Numbers. 97 Million Downloads Don't Care.

Your Agent's Tools Are Down and Nobody's Watching