Your AI Agent Crashes at Step Four. Now What?

Compound failure rates in multi-step agents are brutal — we've established that. A seven-step workflow at 95% per-step reliability lands around 70% end-to-end. Three out of ten runs blow up. But here's the part nobody's fixing: when step four throws an HTTP 500, what happens to the three steps that already ran? The Slack message already sent, the database row already written, the Jira ticket already filed. Your agent doesn't know which steps completed. It has no checkpoint — no saved position to resume from. And it has no compensation logic — no automatic "undo" for half-finished work.

Seeing the failure rate is step one. Surviving the failure is step two. Nobody ships step two.

Recovery Theater

All three major agent runtimes launched recovery features this April. None of them actually solve recovery.

Anthropic's Managed Agents (April 8) use an append-only event log — a write-once diary of everything the agent did. Crash? Reboot, re-read the diary, continue. Their engineering team: "Nothing in the harness needs to survive a crash." Sounds resilient. But there's no dead-letter queue for failed tasks, no exactly-once guarantee that each action runs once and only once, and no way to roll back completed side effects.

OpenAI's Agents SDK update (April 15) added snapshotting and rehydration — saving and restoring agent state across container restarts. But it persists message history, not execution state. The model remembers what it said; it doesn't track what it did.

Google ADK (announced at Cloud Next, April 22) ships SequentialAgent, ParallelAgent, LoopAgent plus a ResumabilityConfig. The fine print: the caller must detect interruption and re-invoke manually. No automatic recovery.

Three different approaches to durability. All three model execution as inference loops — "think → call tool → observe → repeat." None model it as a durable state machine — a system that tracks exactly which state it's in and which transitions are valid, like a vending machine that remembers you already inserted your dollar even after a power outage.

The Solutions That Already Exist (Outside AI)

Workflow engines solved durable execution a decade ago. The AI industry just hasn't adopted them yet.

Temporal wraps each step in a retryable, replayable activity. Crash mid-workflow? Temporal replays from the last completed activity — not from scratch. Here's what a Temporal-wrapped agent step looks like:

@activity.defn
async def file_jira_ticket(ctx: AgentContext) -> TicketResult:
    """Temporal retries this automatically on failure.
    If the agent crashes mid-workflow, it replays
    from the last completed activity — not from scratch."""
    return await ctx.tools.jira.create_issue(
        summary=ctx.draft.title,
        idempotency_key=ctx.run_id  # prevents duplicate tickets
    )

That idempotency_key — a unique ID ensuring the same request doesn't execute twice — is the entire difference between "production-grade" and "impressive demo."

Cloudflare's Project Think (April 15) took a different route entirely: per-agent SQLite databases, step.do() checkpoints, and zero-cost hibernation. Each agent is a durable isolated entity with its own persistent state, not a job in a centralized orchestrator. For teams already on Cloudflare Workers, this is arguably the cleanest path — no external workflow engine, no message broker, just durable objects with built-in checkpointing.

Restate sits between the two: lighter than Temporal, more structured than raw durable objects. It provides exactly-once guarantees with a journal-based replay mechanism, and has published direct integration patterns with Google ADK.

The Uncomfortable Middle

Here's the problem nobody wants to acknowledge: the gap runs both ways.

Workflow engines don't speak LLM. They don't understand token budgets — how much text an AI can process per call. They can't handle model fallback routing — switching to a cheaper model when the expensive one hits rate limits. They have no concept of prompt versioning, context window management, or tool-call schema evolution. Bolting agents onto Temporal means writing a translation layer between "workflow step" and "inference call" — and maintaining it every time your model provider changes their API.

Agent platforms don't understand durability. They treat tool calls as ephemeral side effects of inference, not as state transitions that need transactional guarantees. An LLM calling a tool is philosophically different from a workflow step calling a function: the LLM might decide to call the tool again, call a different tool, or hallucinate a tool that doesn't exist. Workflow engines assume deterministic step definitions. LLMs are definitionally non-deterministic.

Neither side is wrong. They're solving different halves of the same problem.

What This Means for You

If your agent modifies external state across more than three steps — sends emails, writes databases, calls third-party APIs — you already need workflow-engine guarantees. Your agent platform doesn't provide them. You have three options:

Bolt on Temporal/Restate yourself. Real integration complexity, but battle-tested durability. Worth it if you're running agents in production with actual SLAs.
Go Cloudflare-native. If you're already on Workers, Project Think gives you durable agents without the orchestration overhead. Trade-off: platform lock-in.
Accept the failure rate and build manual reconciliation tooling. Honestly, for internal tools with forgiving users, this might be rational. Just don't call it "production-ready."

The convergence is inevitable. The first vendor to ship durable state-machine execution as a default agent primitive — not an afterthought, not a third-party integration — owns the production reliability layer above all three runtimes. Temporal knows this; their OpenAI Agents SDK integration is a land grab. Cloudflare knows this; Project Think is designed to make agents a platform feature.

Your seven-step demo still works beautifully. Shipping it to production still means choosing which failure mode you can live with.

Your AI Agent Crashes at Step Four. Now What?

Recovery Theater

The Solutions That Already Exist (Outside AI)

The Uncomfortable Middle

What This Means for You

Keep reading

MCP's 2026 Roadmap Has Four Priorities. Error Handling Isn't One of Them

Three Agent Platforms Launched in April. None of Them Ship a Deploy Button.

Build Your First MCP Server in Python: 40 Lines From Copy-Paste Human to AI That Sees Your Data

How to Test Your AI Agent: Tool-Call Assertions Instead of Vibes