The Checkpoint Gap: Multi-Hour Agents Shipped Before Crash Recovery Did

You kick off a six-hour agent on Tuesday night. It's supposed to scrape a competitor's pricing page, triage forty stale Linear tickets, and run a Postgres migration dry-run while you sleep. The dashboard says "autonomous." The marketing says "long-horizon." Your credit card says "fine, whatever." You close the laptop.

You wake up to a half-finished task, three duplicate Linear tickets filed under your name, and a Slack channel full of questions from a teammate who thought you were awake at 3 a.m. The agent crashed at hour four. Nobody — not you, not the vendor — can tell you whether clicking "resume" will double the damage or fix it.

Welcome to April 2026, the month multi-hour agents became a pricing metric before they became a reliability guarantee 😹.

Eight days, three persistence models, zero standards

Between April 8 and April 15, the two biggest agent vendors shipped three different ways to keep an AI agent alive past the one-hour mark — and none of them agree on what "alive" means.

On April 14, Anthropic launched Claude Code Routines — scheduled or webhook-triggered agent runs, research preview, gated behind daily caps (5/day on Pro, 15/day on Max, 25/day on Team and Enterprise). Minimum schedule interval: one hour. The Register politely called it "mildly clever cron jobs" 😼.

On April 15, OpenAI shipped Agents SDK v0.14.0 with a new SandboxAgent surface, a pluggable sandbox backend (Docker, E2B, Modal, Vercel, Cloudflare — take your pick), and a thing called MEMORY.md — a literal markdown file the agent writes to itself between runs.

And on April 8, Anthropic had already launched Managed Agents, which meters usage in session-hours — a billing unit that explicitly assumes your agent will run for hours at a time.

Three persistence models. Zero interop. Welcome to long-horizon autonomy.

What each vendor is actually persisting

A quick detour — because "the agent remembers" sounds simple, and it's not.

An agent is a loop: the LLM (large language model — the brain behind ChatGPT or Claude) reads a task, calls a tool (web search, shell command, API call), reads the result, decides what to do next. A long-horizon agent is that loop, running for hours. A checkpoint is a saved snapshot of the loop's state, so if the process crashes, you can resume from the snapshot instead of restarting from scratch.

Here's what each vendor actually saves:

Anthropic Routines — saves the conversation and plan inside a session. Per the docs, "each matching GitHub event starts a new session" — sessions don't even share state across triggers. And: "events beyond the limit are dropped until the window resets" — meaning a webhook spike silently loses work, no queue, no retry 🙀.
OpenAI Sandbox Agents — saves a MEMORY.md file inside the sandbox's filesystem. OpenAI's own docs say it "distills lessons into readable files rather than preserving full workspace state." In plain English: it remembers what it learned, not what it did. Killed mid-git push? The plan survives. The half-pushed commit doesn't.
Anthropic Managed Agents — bills by session-hour. What a session-hour checkpoints is undocumented.

None of them — none — document what happens to side-effects when a run crashes. A side-effect is anything the agent touched outside its own memory: an API call fired, a Linear ticket created, a row inserted into your database, a Slack message sent, a git commit pushed. Those don't rewind.

The "aha" that nobody put on the landing page

Here's the quiet part out loud: when a multi-hour agent crashes and resumes, the checkpoint restores the agent's intent, not the state of the world the agent was acting on.

Your agent filed a Linear ticket at hour three. Crashed at hour four. The checkpoint at hour 3.5 doesn't know the ticket exists. Resume: it files the ticket again. Congratulations, you have duplicates — and per Anthropic's docs, "Linear tickets… use your linked accounts," so the duplicates are under your name. Your teammates think you are spamming them 😾.

This isn't a bug. It's the architecture. The New Stack's analysis of the OpenAI release notes that the harness "can keep auth, billing, audit logs, human review, and recovery state outside any one container" — which is true, and also a polite way of saying the SDK has opinions about its own state and none about yours.

Google's Vertex Agent Engine, for the record, had Sessions and Memory Bank go GA back in December 2025; April 2026 only added an Agent Designer preview. So nobody — not Anthropic, not OpenAI, not Google — solves side-effect idempotency for you.

The price nobody put on the pricing page

Idempotency — the property that doing something twice has the same effect as doing it once — is now entirely your problem. Every tool call your agent makes to the outside world needs an idempotency key (a unique ID per operation, so the receiving service can deduplicate retries). Every external action needs a journaled outbox (a log you write before the action, so you know what you attempted even if you crash before confirming it succeeded).

Re-runs cost double: double tokens (the word-chunks the LLM processes, billed per million), double session-hours, double the wall-clock you waited. And because no vendor offers a portable checkpoint format, you cannot fail over from Anthropic to OpenAI mid-task. You're locked in by the shape of your bug reports.

The Hacker News thread on Routines put it plainly: "I'm not going to build my business on things I can't replicate myself." Another commenter noted debugging a multi-hour routine would be "maddening." Correct on both counts 🐈‍⬛.

If you're shipping this to production

If you're running agents past the one-hour mark in April 2026, the platform's checkpoint is not your recovery story. It's a receipt. You need three things the vendors haven't built for you:

A journaled outbox — every external side-effect writes to a log before executing, so replay knows what the agent attempted.
Idempotency keys on every tool call — GitHub, Linear, Slack, your own APIs. No exceptions.
A manual resume UI — so a human decides whether to retry, skip, or abort after a crash. Not the agent. Not the vendor.

What actually changed this month

"Agents run for hours" became a pricing unit in April 2026. The plumbing underneath is still fifteen-minute scale. Somewhere in the next quarter, an enterprise will write the first public post-mortem about a managed agent nobody could rewind — and the interesting question will not be which vendor failed, but why anyone thought the checkpoint was the guarantee 😸.

The cat's advice: run your own outbox. Trust no vendor's "resume" button. And if a sales deck says "autonomous," ask them to define the word on paper.

The Checkpoint Gap: Multi-Hour Agents Shipped Before Crash Recovery Did

Eight days, three persistence models, zero standards

What each vendor is actually persisting

The "aha" that nobody put on the landing page

The price nobody put on the pricing page

If you're shipping this to production

What actually changed this month

Keep reading

The Browser-Agent Oligopoly Nobody Voted For

Tool-Calling Is Dead. Agents Now Write Code.

Every Agent SDK Ships a Runtime. None Ships the Tests.

Two Leaks, One Company, and an $852 Billion IOU