Agentic Loops के लिए Prompt Caching: एक ही Tokens की बार-बार Full Price देना बंद करो

तुमने agentic loop बना लिया। Tools call होते हैं, results आते हैं, model सोचता है कि अगला कदम क्या हो — पूरा system एक छोटे autonomous दिमाग की तरह चलता है। फिर एक दिन testing के बाद तुम Anthropic का usage dashboard खोलते हो। वो number typo नहीं है। ये कीमत है हर single iteration पर वही system prompt और tool schemas बार-बार भेजने की, full price पर, एक ऐसे stateless API को जिसे तीन सेकंड पहले भेजी गई चीज़ की कोई याद नहीं।

और अगर तुमने 16 April, 2026 को Opus 4.7 पर upgrade किया — जो शायद तुमने किया होगा — Simon Willison ने मापा कि नया tokenizer identical system prompts के लिए 1.46× ज़्यादा tokens produce करता है। रातोंरात, बिल्कुल वही text के लिए, तुम्हारी cost 35-40% बढ़ गई। कोई notification नहीं, कोई changelog highlight नहीं। Finout का analysis (27 April, 2026) सभी workloads में इस pattern को confirm करता है।

"बस prompts optimize कर लो" काफी क्यों नहीं है

सबसे पहले दिमाग में यही आता है: prompts छोटे कर दो। ठीक है, extra चर्बी काट दो। लेकिन एक real agentic loop में, तुम्हारा static prefix — system prompt, tool definitions (schemas जो model को बताते हैं कि कौन से tools call कर सकता है और कौन से parameters accept करते हैं), और जमा हुई conversation history — आसानी से 10,000+ tokens पहुँच जाती है। 6 tools और एक ठीक-ठाक system prompt के साथ, लगभग ~11,000 tokens का बिल्कुल identical content हर turn पर फिर से भेजा जाता है।

10 iterations में, ये 1,10,000 tokens की शुद्ध repetition है — Opus 4.7 पर $5/MTok के हिसाब से। सिर्फ input cost $0.55। उन bytes के लिए जो server पहले ही देख चुका है। 50 turns तक scale करो तो एक conversation पर dollars जल रहे हैं।

कितना भी prompt trimming कर लो, एक fundamentally stateless protocol को ठीक नहीं कर सकते। Server को याद रखना होगा।

Fix: prompt caching

Anthropic का prompt caching API (December 17, 2024 से GA) तुम्हें message blocks पर cache_control breakpoint लगाने देता है — एक marker जो server को बोलता है "इस point तक सब store कर लो।" अगली requests में, अगर bytes exactly match करते हैं, तो server cache से पढ़ता है और normal input price का सिर्फ 10% charge करता है। Anthropic की documentation के अनुसार, 5-minute cache (default) पर initial write में 1.25× लगता है; 1-hour cache पर 2×।

Math सीधा है: एक बार 25% premium दो, उसके बाद हर hit पर 90% discount। 10-turn loop में, 1 write + 9 reads। पहले cache hit के बाद ही ये profitable हो जाता है।

Step 1: System prompt को cache करो

System prompt तुम्हारे loop का सबसे stable content है। Iterations के बीच ये कभी नहीं बदलता। इसे cacheable बनाओ:

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=4096,
    system=[{
        "type": "text",
        "text": "Your long system prompt with instructions...",
        "cache_control": {"type": "ephemeral"},  # 5-min TTL
    }],
    messages=[...],
)

एक ज़रूरी detail: Opus 4.7 (और Opus 4.5, Haiku 4.5) पर minimum cacheable block 4,096 tokens है। अगर तुम्हारा system prompt इससे छोटा है, तो Anthropic चुपचाप cache_control marker को ignore कर देता है — कोई error नहीं, कोई cache नहीं, बस तुम्हारी intent बर्बाद। Sonnet पर minimum 1,024 tokens है।

/faion एक faion-network umbrella skill है — ये 12 specialist domains से relevant methodology conversation में लाता है, तो cache strategy design करने से पहले, इससे पूछो:

/faion
I'm implementing Anthropic prompt caching for a multi-tool agentic loop.
What's the best strategy for placing cache_control breakpoints when I have
a system prompt, 6 tool schemas, and growing conversation history?

Step 2: Tool definitions को cache करो

Tool schemas — हर tool का name, parameters, और purpose describe करने वाले JSON — tools array में बैठते हैं और cache hierarchy में system prompt से पहले process होते हैं। cache_control array के last tool पर लगाओ ताकि सारे tools एक breakpoint में cache हो जाएं:

tools = [
    {"name": "search_web", "description": "...", "input_schema": {...}},
    {"name": "read_file", "description": "...", "input_schema": {...}},
    {"name": "run_code", "description": "...", "input_schema": {...}},
    {"name": "write_file", "description": "...", "input_schema": {...}},
    {"name": "list_dir", "description": "...", "input_schema": {...}},
    {"name": "submit_result", "description": "...", "input_schema": {...},
     "cache_control": {"type": "ephemeral"}},  # ऊपर सारे tools cache हो जाएंगे
]

Critical rule: cache hierarchy tools → system → messages है। Iterations के बीच कोई भी tool definition बदलने पर (tool add करना, parameter rename करना) tools, system, AND messages — सबका cache invalidate हो जाता है। Agentic loop में, tool set को freeze रखो।

Step 3: Conversation history prefix को cache करो

जैसे-जैसे conversation बढ़ती है, पुराने messages static हो जाते हैं — ये बदलने वाले नहीं हैं। "Stable" portion के last message पर breakpoint लगाओ:

messages = [
    {"role": "user", "content": "Initial task..."},
    {"role": "assistant", "content": "I'll start by..."},
    {"role": "user", "content": [{  # last stable message
        "type": "text",
        "text": "Tool result from step 3...",
        "cache_control": {"type": "ephemeral"},
    }]},
    {"role": "assistant", "content": "Now processing..."},  # नया, uncached
]

अब तुम्हारे पास 3 breakpoints हैं (tools, system, history)। API maximum 4 allow करता है। आखिरी slot edge cases के लिए बचा कर रखो।

Step 4: Relocation trick

ये trick ProjectDiscovery की engineering team से आती है, जिन्होंने 10 April, 2026 को अपने results publish किए। वो 26-step, 40-tool-call वाला agentic security scanner चलाते हैं। उनकी initial cache hit rate एक बेचारी 7% थी।

Problem: उनके system prompt में timestamps, working memory, और runtime variables थे जो हर call पर बदलते थे। Cache matching को exact byte equality चाहिए — एक character अलग और पूरा block miss।

Fix: सारा dynamic content system prompt से निकालो और conversation के end में एक user-role message में डालो। System prompt हर call पर byte-identical हो गया। उनकी cache hit rate 7% से कूदकर रातोंरात 74% हो गई, और further tuning से 84% तक पहुँची — 70% cost reduction।

/faion
How should I separate static vs. dynamic content in an agentic system
prompt to maximize cache hit rates? What patterns exist for the
static-system-prompt + dynamic-user-reminder split?

Step 5: Verify करो कि ये काम कर रहा है

Cache hits तब तक invisible हैं जब तक तुम actively check नहीं करते। ज़्यादातर SDK wrappers और logging frameworks cache metrics दिखाते ही नहीं। Explicitly check करना पड़ेगा:

usage = response.usage
print(f"Cache written: {usage.cache_creation_input_tokens}")
print(f"Cache read:    {usage.cache_read_input_tokens}")
print(f"Uncached:      {usage.input_tokens}")

hit_rate = usage.cache_read_input_tokens / (
    usage.cache_read_input_tokens +
    usage.cache_creation_input_tokens +
    usage.input_tokens
) * 100
print(f"Cache hit rate: {hit_rate:.1f}%")

दूसरी loop iteration पर, cache_read_input_tokens बड़ा होना चाहिए। अगर zero है, तो कुछ तुम्हारा cache invalidate कर रहा है — अपने "static" content में byte-level differences check करो।

Gotchas जो तुम्हारा पैसा खाएंगे

1. 5-minute की खाई। Default TTL (time-to-live — server तुम्हारा cached content कितनी देर याद रखता है) 5 minutes है। अगर कोई tool call 6 minutes लेता है (external API timeout, लंबा computation), cache expire हो जाता है। Write premium फिर से भरो। Fix: slow workflows के लिए "ttl": "1h" use करो, लेकिन ये write पर 1.25× की जगह 2× cost करता है। और longer TTLs request में shorter TTLs से पहले आने चाहिए — उल्टा करोगे तो error मिलेगा।

2. JSON key ordering। कुछ languages serialization पर JSON key order randomize कर देती हैं। अलग key order = अलग bytes = cache miss। अपना serialization order fix करो या sort_keys=True use करो।

3. 20-block lookback। System cached entries के लिए maximum 20 content blocks पीछे search करता है। लंबी conversations (>20 blocks) में, पुराने cache breakpoints search window से बाहर हो जाते हैं। हर ~18 blocks पर fresh breakpoints डालो। PwC की study (January 2026) ने पाया कि ज़्यादातर "mysterious cache misses" यहीं से आती हैं।

4. छोटा content cache करना महंगा पड़ता है। PwC ने मापा कि 500 tokens से कम prompts पर caching से overhead की वजह से latency 10-18% बदतर होती है। अपना 200-token system prompt cache मत करो। Opus पर तो वैसे भी 4,096-token minimum clear करना ज़रूरी है।

अभी क्या कर सकते हो

तुम्हारे पास तीन breakpoints हैं — tools, system, history prefix — और एक monitoring pattern। वही agentic loop जो Opus 4.7 पर 10-turn conversation पर $0.55 जला रहा था, अब ~$0.12 में आता है। ये rounding error नहीं है; ये फ़र्क है एक बार demo करके भूल जाने वाले prototype और actually ship करने लायक system के बीच। Opus 4.7 का 35% tokenizer tax? अभी भी है, लेकिन तुम उसे हर turn की बजाय सिर्फ एक बार cache write पर pay कर रहे हो। अब billing page खोलने में डर नहीं लगेगा।

Agentic Loops के लिए Prompt Caching: एक ही Tokens की बार-बार Full Price देना बंद करो

"बस prompts optimize कर लो" काफी क्यों नहीं है

Fix: prompt caching

Step 1: System prompt को cache करो

Step 2: Tool definitions को cache करो

Step 3: Conversation history prefix को cache करो

Step 4: Relocation trick

Step 5: Verify करो कि ये काम कर रहा है

Gotchas जो तुम्हारा पैसा खाएंगे

अभी क्या कर सकते हो

Keep reading

Claude Extended Thinking: उथले जवाबों और $50 के बिल के बीच सिर्फ 3-Line Config का फर्क

वो 50-Line Agentic Loop बनाओ जो हर AI Agent Platform को चलाता है

तुम्हारे Agent का Permission Dialog एक Placebo है

MCP हर जगह काम करता है — जब तक Authenticate करने की बारी न आए