Prompt Caching for Agentic Loops: Stop Paying Full Price for the Same Tokens

You built the agentic loop. Tools get called, results come back, the model reasons about what to do next — the whole thing hums like a little autonomous brain. Then you open the Anthropic usage dashboard after a day of testing. That number is not a typo. It's the cost of resending the same system prompt and tool schemas on every single iteration, at full price, to a stateless API that has no memory of what you sent three seconds ago.

And if you upgraded to Opus 4.7 on April 16, 2026 — which you probably did — Simon Willison measured the new tokenizer producing 1.46× more tokens for identical system prompts. Your costs went up 35-40% overnight for literally the same text. Finout's analysis (April 27, 2026) confirms the pattern across workloads.

Why "just optimize your prompts" doesn't cut it

The obvious response: make prompts shorter. Sure, trim the fat. But in a real agentic loop, your static prefix — the system prompt, tool definitions (schemas that tell the model what tools it can call and what parameters they accept), and the accumulated conversation history — easily hits 10,000+ tokens. With 6 tools and a proper system prompt, you're looking at ~11,000 tokens of identical content resent on every turn.

Over 10 iterations, that's 110,000 tokens of pure repetition at $5/MTok on Opus 4.7. That's $0.55 in input costs alone — just for bytes the server already saw. Scale to 50 turns and you're burning dollars on a single conversation.

No amount of prompt trimming fixes a fundamentally stateless protocol. You need the server to remember.

The fix: prompt caching

Anthropic's prompt caching API (GA since December 17, 2024) lets you tag message blocks with a cache_control breakpoint — a marker that tells the server "store everything up to this point." On subsequent requests, if the bytes match exactly, the server reads from cache and charges you 10% of the normal input price. According to Anthropic's documentation, the 5-minute cache (default) costs 1.25× on the initial write; the 1-hour cache costs 2×.

The math is simple: pay a 25% premium once, get 90% off on every subsequent hit. In a 10-turn loop, that's 1 write + 9 reads. It pays for itself after the first cache hit.

Step 1: Cache the system prompt

The system prompt is the most stable content in your loop. It never changes between iterations. Make it cacheable:

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    system=[{
        "type": "text",
        "text": "Your long system prompt with instructions...",
        "cache_control": {"type": "ephemeral"},  # 5-min TTL
    }],
    messages=[...],
)

The key detail: on Opus 4.7 (and Opus 4.5, Haiku 4.5), the minimum cacheable block is 4,096 tokens. If your system prompt is shorter than that, Anthropic silently ignores the cache_control marker — no error, no cache, just wasted intent. On Sonnet, the minimum is 1,024 tokens.

/faion is the faion-network umbrella skill — it pulls relevant methodology from 12 specialist domains into the conversation, so before designing your cache strategy, ask it:

/faion
I'm implementing Anthropic prompt caching for a multi-tool agentic loop.
What's the best strategy for placing cache_control breakpoints when I have
a system prompt, 6 tool schemas, and growing conversation history?

Step 2: Cache tool definitions

Tool schemas — the JSON descriptions of each tool's name, parameters, and purpose — sit in the tools array and get processed before the system prompt in cache hierarchy. Place the cache_control on the last tool in the array to cache all of them in one breakpoint:

tools = [
    {"name": "search_web", "description": "...", "input_schema": {...}},
    {"name": "read_file", "description": "...", "input_schema": {...}},
    {"name": "run_code", "description": "...", "input_schema": {...}},
    {"name": "write_file", "description": "...", "input_schema": {...}},
    {"name": "list_dir", "description": "...", "input_schema": {...}},
    {"name": "submit_result", "description": "...", "input_schema": {...},
     "cache_control": {"type": "ephemeral"}},  # caches ALL tools above
]

Critical rule: the cache hierarchy is tools → system → messages. Changing any tool definition between iterations (adding a tool, renaming a parameter) invalidates the cache for tools, system, AND messages. In an agentic loop, keep your tool set frozen.

Step 3: Cache the conversation history prefix

As the conversation grows, older messages become static — they won't change. Place a breakpoint on the last message in the "stable" portion:

messages = [
    {"role": "user", "content": "Initial task..."},
    {"role": "assistant", "content": "I'll start by..."},
    {"role": "user", "content": [{  # last stable message
        "type": "text",
        "text": "Tool result from step 3...",
        "cache_control": {"type": "ephemeral"},
    }]},
    {"role": "assistant", "content": "Now processing..."},  # new, uncached
]

You now have 3 breakpoints (tools, system, history). The API allows a maximum of 4. Save the last slot for edge cases.

Step 4: The relocation trick

This one comes from ProjectDiscovery's engineering team, who published their results on April 10, 2026. They run a 26-step, 40-tool-call agentic security scanner. Their initial cache hit rate was a pathetic 7%.

The problem: their system prompt contained timestamps, working memory, and runtime variables that changed every call. Cache matching requires exact byte equality — one different character and the entire block misses.

The fix: move all dynamic content out of the system prompt and into a user-role message at the end of the conversation. The system prompt becomes byte-identical across calls. Their cache hit rate jumped from 7% to 74% overnight, eventually reaching 84% with further tuning — a 70% cost reduction.

/faion
How should I separate static vs. dynamic content in an agentic system
prompt to maximize cache hit rates? What patterns exist for the
static-system-prompt + dynamic-user-reminder split?

Step 5: Verify it's actually working

Cache hits are invisible unless you look. Most SDK wrappers and logging frameworks don't surface cache metrics. You need to check explicitly:

usage = response.usage
print(f"Cache written: {usage.cache_creation_input_tokens}")
print(f"Cache read:    {usage.cache_read_input_tokens}")
print(f"Uncached:      {usage.input_tokens}")

hit_rate = usage.cache_read_input_tokens / (
    usage.cache_read_input_tokens +
    usage.cache_creation_input_tokens +
    usage.input_tokens
) * 100
print(f"Cache hit rate: {hit_rate:.1f}%")

On your second loop iteration, cache_read_input_tokens should be large. If it's zero, something is invalidating your cache — check for byte-level differences in your "static" content.

Gotchas that will cost you money

1. The 5-minute cliff. Default TTL (time-to-live — how long the server remembers your cached content) is 5 minutes. If a tool call takes 6 minutes (external API timeout, long computation), the cache expires. You pay the write premium again. Fix: use "ttl": "1h" for slow workflows, but know it costs 2× instead of 1.25× on the write. And longer TTLs must appear before shorter ones in the request — the reverse throws an error.

2. JSON key ordering. Some languages randomize JSON key order on serialization. Different key order = different bytes = cache miss. Pin your serialization order or use sort_keys=True.

3. The 20-block lookback. The system searches at most 20 content blocks backward for cached entries. In long conversations (>20 blocks), old cache breakpoints fall outside the search window. Add fresh breakpoints every ~18 blocks. PwC's study from January 2026 found this is where most "mysterious cache misses" come from.

4. Caching short content costs more. PwC measured that prompts under 500 tokens show 10-18% worse latency with caching due to overhead. Don't cache your 200-token system prompt. It needs to clear the 4,096-token minimum on Opus anyway.

What you can do now

You have three breakpoints — tools, system, history prefix — and one monitoring pattern. The same agentic loop that burned $0.55 per 10-turn conversation on Opus 4.7 now costs ~$0.12. That's not a rounding error; it's the difference between a prototype you demo once and a system you actually ship. The 35% tokenizer tax from Opus 4.7? Still there, but you're paying it once per cache write instead of every single turn. Welcome to the part where the billing page stops being scary.

Prompt Caching for Agentic Loops: Stop Paying Full Price for the Same Tokens

Why "just optimize your prompts" doesn't cut it

The fix: prompt caching

Step 1: Cache the system prompt

Step 2: Cache tool definitions

Step 3: Cache the conversation history prefix

Step 4: The relocation trick

Step 5: Verify it's actually working

Gotchas that will cost you money

What you can do now

Keep reading

Claude Extended Thinking: The 3-Line Config Between Shallow Answers and $50 Bills

Build the 50-Line Agentic Loop That Powers Every AI Agent Platform

Your Agent's Permission Dialog Is a Placebo

MCP Works Everywhere — Until You Try to Authenticate