You wired your agent to Slack, Linear, GitHub, and an internal Postgres. Fifteen tools, clean JSON schemas, a prompt that says "you are a helpful assistant." It works beautifully on two-step tasks. On the fifth step it skips a filter, misquotes a field, or burns 40k tokens re-reading the same schemas. Welcome to the ceiling of classic tool-calling 😹.

Here's the setup nobody explains to you in the marketing pages. In traditional tool-calling — the pattern every SDK shipped through 2024–2025 — an agent (a program that wraps a large language model and gives it tools) dumps the full JSON schema (a machine-readable description of each tool's inputs) into the context window (the model's working memory) on every single turn. Fifteen tools with rich types? That's 5–10k tokens before the model says hello. The model then picks one tool, fills its arguments, waits for a result, and does it again. Loops, conditionals, data transforms? None. The model fakes them by chaining ten separate calls and hoping it remembers what it saw in call three.

The two weeks that moved the default

Between April 14 and April 15, 2026, three vendors shipped the same pattern and quietly retired the old one.

On April 15, 2026, OpenAI announced the next evolution of the Agents SDK, landing as v0.14.0 "Sandbox Agents" (hotfixed to v0.14.1 the same afternoon per the GitHub release page). The headline features: code mode, sandboxing, sub-agents, a long-horizon harness, and provider-agnostic support for 100+ LLMs. TechCrunch's coverage framed it as OpenAI catching up to a pattern Cloudflare and HuggingFace had been benchmarking for six months.

One day earlier, on April 14, 2026, Anthropic opened the research preview for Claude Code Routines — saved Claude Code configurations that run as persistent autonomous agents on Anthropic's cloud, triggered by schedule, HTTP webhook, or GitHub event. Same shape: tools are code the agent imports, not JSON it regurgitates.

Also on April 14, Cloudflare published "Scaling MCP adoption", the enterprise reference architecture that made the numbers embarrassing. Their benchmark: connect 4 internal MCP servers exposing 52 tools. Classic tool-calling burns ~9,400 context tokens per turn. Code Mode via portal: ~600 tokens. That's a 94% reduction, and — this is the real win — the cost stays flat as you add more servers 🙀.

What code mode actually does

Instead of shoving schemas into the prompt, the runtime hands the model a typed module. The model writes a short program. The sandbox runs it. Tools never enter the context window — only their signatures do, and often just the ones the model asked for via search().

from agents import Agent, CodeMode, Sandbox

agent = Agent(
    model="gpt-5.1",
    mode=CodeMode(runtime="python"),
    sandbox=Sandbox(backend="e2b"),  # or docker, modal, runloop
    tools=[slack, linear, github, pg],  # plain typed functions
)

agent.run(
    "Find every P0 bug opened this week in Linear, "
    "cross-check against GitHub PRs, post a summary to #triage."
)

Under the hood the model emits something like:

bugs = linear.search(priority="P0", opened_after="2026-04-09")
prs  = {b.id: github.find_pr(ref=b.id) for b in bugs}
unmatched = [b for b in bugs if not prs[b.id]]
slack.post("#triage", render(bugs, unmatched))

That's a loop, a dict comprehension, a filter, and a conditional — in one sandbox round-trip. The classic tool-calling version is 12+ turns and a migraine.

The receipts

HuggingFace's smolagents framework has been showing this for months: CodeAgent uses ~30% fewer steps than ToolCallingAgent on multi-step benchmarks, and smolagents + GPT-4o sat at #1 on GAIA validation (44.2%). Cloudflare's April numbers: ~32% fewer tokens on simple tasks, ~81% on complex chains, per WorkOS's analysis. The canonical line, from Cloudflare's Kenton Varda and Sunil Pai, still holds: "LLMs are better at writing code to call MCP, than at calling MCP directly."

What it costs you

This is not free 😾. Code mode needs a real sandbox — Docker, E2B, Modal, Runloop, Daytona, or OpenAI's built-in harness — because you're now running model-authored code on your infrastructure. Skip the sandbox and you're one prompt injection away from an RCE. Most existing observability tools assume JSON traces and break on opaque code blobs. Your security model shifts from "validate arguments" to "contain arbitrary execution," which is a different review process, a different threat model, and often a different team. For single-shot, one-tool tasks — "get weather for Boston" — code mode adds latency for nothing.

What to do on Monday

If you're greenfielding an agent in April 2026, default to code mode from day one. Pick an SDK that runs tools inside a sandboxed runtime, write your tools as plain typed Python or TypeScript functions, and stop hand-crafting JSON schemas. If you've got a production agent on classic tool-calling and it works, don't panic-migrate — but every time you add tool number sixteen, do the token math.

The verdict

Tool-calling isn't dead for single-step calls 🐈. But for any agent that chains more than two actions, the industry just decided — in the span of 48 hours between April 14 and 15, 2026 — that the agent's native language is code, not JSON. If you weren't watching, the stack shifted under you 😼.