Claude Extended Thinking: उथले जवाबों और $50 के बिल के बीच सिर्फ 3-Line Config का फर्क

तू Claude API call करता है — Anthropic का interface जिससे programmatically prompts भेजो और responses पाओ — और simple tasks पर ये बढ़िया चलता है। Extraction, summarization, classification: Claude एकदम nail करता है। फिर तू कोई heavy काम फेंकता है। 2,000-line pull request review करवा। Database migration plan करवा। Race condition debug करवा। Response आता है — तेज, confident, और subtle तरीके से गलत। जैसे वो student जिसने textbook skim किया और exam में jugaad लगा रहा है।

Problem model में नहीं है। Problem ये है कि तूने उसे rough work की copy नहीं दी।

"बस Docs पढ़ लो" क्यों काम नहीं करता

Claude में एक feature है जिसका नाम है extended thinking — answer देने से पहले internally step-by-step reason करने की ability, जैसे math में working दिखाना बस working छुपी रहती है। Anthropic ने इसे 24 February 2025 को launch किया Claude 3.7 Sonnet के साथ, और तब से ये काफी evolve हो चुका है।

Docs detailed हैं। साथ ही 4,000 words की parameter tables, deprecation notices, और migration guides हैं जो तीन अलग-अलग pages पर बिखरी हैं। ज़्यादातर developers तीन में से एक काम करते हैं: thinking पूरी skip कर देते हैं, किसी पुराने model वाले blog post से config copy-paste कर लेते हैं, या बिना cost controls के enable कर देते हैं और $50 का surprise bill आ जाता है।

16 April 2026 को Anthropic ने Claude Opus 4.7 ship किया और तीनों approaches तोड़ दिए। Manual budget_tokens parameter जो पहले thinking cost control करता था? अब 400 error return करता है। नए tokenizer से token counts 35% तक बढ़ जाते हैं। और thinking by default invisible है — लेकिन billing पूरी $25 per million tokens की rate पर होती है।

27 April 2026 तक की situation के हिसाब से, इसे सही से कैसे set up करें:

Step 1: Adaptive Thinking Enable करो

Opus 4.7 पर adaptive thinking use होता है — model खुद decide करता है कि problem की difficulty के हिसाब से कितना सोचना है। Manual token budgets नहीं। तू intensity control करता है effort parameter से (simple label जैसे "low", "high", या "max") — number guess करने की ज़रूरत नहीं।

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[{
        "role": "user",
        "content": "Review this SQL migration for correctness and edge cases: ..."
    }],
)

तीन lines add हुई: thinking, output_config, और बड़ा max_tokens — ceiling कि response में कितने tokens (word-chunks जो AI process करता है, roughly एक English word का ¾) आ सकते हैं। Thinking tokens इसी ceiling में count होते हैं, इसलिए कम से कम 16,000 रखो। Anthropic complex tasks के लिए 32,000+ recommend करता है। अगर अकेली thinking ही तेरे max से ज़्यादा हो जाए, request fail हो जाएगी — partial answer नहीं, सीधा error।

/faion Nero का prompt tool है जो spec से working code generate करता है — नीचे का block paste करो और production-ready starting point पाओ।

/faion
Generate a Python function `call_with_thinking(prompt: str, effort: str = "high") -> str` that calls Claude Opus 4.7 with adaptive thinking using the anthropic Python SDK. Accept an effort parameter ("low", "medium", "high", "xhigh", "max"). Set max_tokens to 16000. Return the text response. Include error handling for API errors and a docstring explaining the effort levels.

Step 2: सही Effort Level चुनो

हर prompt को PhD thesis नहीं चाहिए। ये रहा cheat sheet, Anthropic की guidance और Resolve AI की testing से मिले production data पर based:

Effort	क्या करता है	कब use करो
`low`	Trivial tasks के लिए thinking skip	Classification, extraction, high-volume pipelines
`medium`	Moderate reasoning, skip भी कर सकता है	Balanced cost/quality, ज़्यादातर agentic workflows
`high`	लगभग हमेशा deeply think करता है	Code review, analysis, planning
`xhigh`	Extended exploration (सिर्फ Opus 4.7)	Multi-file coding, लंबी agentic chains
`max`	Thinking पर कोई constraint नहीं	Frontier problems, research, unlimited budgets

Resolve AI की key finding: Sonnet 4.6 medium effort पर roughly Opus 4.6 quality match करता है fraction cost पर। सबसे बड़े model की तरफ मत भागो — सस्ते model पर सही effort level लगाओ। सबसे smart optimization अक्सर ये है कि Opus के पैसे दो ही मत।

Step 3: Real-Time UX के लिए Thinking Stream करो

Streaming के बिना, heavy-thinking request का मतलब है तेरा user 30+ seconds तक blank screen घूरता रहेगा। Streaming के साथ, वो model की reasoning live देखता है — app crash हो गई है या नहीं, इस existential crisis की जगह visible progress।

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive", "display": "summarized"},
    messages=[{"role": "user", "content": "Analyze this codebase architecture..."}],
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(f"[thinking] {event.delta.thinking}", end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

Notice करो "display": "summarized"। Opus 4.7 पर default "omitted" है — model सोचता है, तू tokens के पैसे देता है, लेकिन thinking text खाली आता है। अगर तुझे देखना है कि model ने क्या reason किया, तो display explicitly set करना पड़ेगा। Invisible reasoning debug करने में उतना ही मज़ा आता है जितना लगता है।

/faion
Generate a Python function `stream_thinking_response(prompt: str, effort: str = "high")` that calls Claude Opus 4.7 with adaptive thinking and streaming enabled via the anthropic Python SDK. Set display to "summarized". Print thinking deltas prefixed with "[thinking]" and text deltas without prefix. Set max_tokens to 32000. Handle stream cleanup properly with a context manager.

Step 4: Tool Use में Chain मत तोड़ो

अगर तेरा integration tools use करता है — functions जो model call कर सकता है, जैसे database query या external API hit — तो conversation continue करते वक़्त thinking blocks preserve करने पड़ेंगे। Drop कर दिए तो Claude का reasoning context बीच flow में उड़ जाएगा।

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    tools=[your_tool],
    messages=[
        {"role": "user", "content": "What's the production error rate?"},
        {"role": "assistant", "content": [
            thinking_block,   # इसे रखो — हटाओगे तो context मरेगा
            tool_use_block
        ]},
        {"role": "user", "content": [tool_result]},
    ],
)

एक constraint जो agentic architectures को काटेगा: thinking enabled होने पर tool_choice — जो control करता है कि Claude को specific tool use करना है या नहीं — सिर्फ "auto" या "none" support करता है। Specific tool force करोगे तो error आएगा।

Step 5: Monitor करो कि असल में कितना पैसा जा रहा है

हर response में usage object आता है। Thinking enable करने के बाद, ये तेरे codebase की सबसे important line बन जाती है:

cost_per_mtok = 25  # Opus 4.7 output price
output_cost = (response.usage.output_tokens / 1_000_000) * cost_per_mtok
print(f"Output tokens: {response.usage.output_tokens} (${output_cost:.4f})")

Output token count में ALL thinking tokens include हैं — invisible वाले भी जो display: "omitted" के under हैं। अगर तेरा visible answer 500 tokens है लेकिन output_tokens बता रहा है 8,000, तो तूने अभी 7,500 tokens की reasoning के पैसे दिए जो किसी ने देखी भी नहीं। Anthropic की pricing page के मुताबिक, thinking tokens full output rate पर bill होते हैं: Opus के लिए $25/MTok, Sonnet के लिए $15/MTok।

/faion
Generate a Python helper function `log_thinking_cost(response, model: str = "claude-opus-4-7")` that takes an Anthropic Messages API response object and prints a cost breakdown: input tokens, output tokens (including thinking), cache read tokens, and total estimated cost in USD. Use a dict of per-model pricing (Opus 4.7: $15 input / $25 output, Sonnet 4.6: $3 input / $15 output). Warn if output_tokens exceeds 5000 with a note about possible high thinking usage.

Gotchas जो काटेंगे

Tokenizer Tax। Opus 4.7 का नया tokenizer same text के लिए Opus 4.6 की तुलना में 35% ज़्यादा tokens generate करता है। Model upgrade करो और zero code changes में costs बढ़ जाएंगे। Opus 4.6 पर जो thinking process $0.25 की थी, वो 4.7 पर $0.34 हो सकती है — same reasoning, बड़ा bill। Models switch करने से पहले और बाद में usage monitor करो।

Thinking Tokens Prompt-Cache नहीं होते। Prompt caching — Anthropic का feature जो repeated input tokens पर discount देता है — thinking content पर apply नहीं होता। Agentic loops में जहाँ बहुत tool calls होते हैं, Claude पिछले turns के thinking blocks input के तौर पर दोबारा पढ़ता है। इससे compounding costs बनते हैं जो तुझे तब तक नहीं दिखेंगे जब तक तू cache_read_input_tokens अलग से track नहीं करता।

Legacy Trap। अभी भी Claude Sonnet 4.6 या Opus 4.6 पर हो? budget_tokens अभी भी काम करता है। Opus 4.7 पर upgrade करो, और ये 400 error return करता है बिना किसी runtime deprecation warning के — बस सीधा fail। Deploy से पहले test करो।

max_tokens Ceiling। Thinking tokens और response tokens same max_tokens cap share करते हैं। max_tokens: 4000 set किया, और अगर model 3,800 tokens सोचने में लगा दे, तो तुझे 200-token का answer मिलेगा। हमेशा ज़रूरत से ज़्यादा set करो।

अब क्या करो

तेरे पास दो gears हैं। Routine 80% के लिए — extraction, formatting, simple Q&A — effort: "low" use करो या thinking पूरी skip करो। Hard 20% के लिए — code review, architecture planning, complex analysis — "high" या "xhigh" use करो streaming और cost monitoring के साथ। 100 calls run करो, usage numbers check करो, फिर adjust करो।

एक साल पहले, हर Claude API call same depth पर चलती थी — fast, shallow, एक gear। अब dial तेरे हाथ में है। जो model किसी hard bug पर 30 seconds reason करता है, वही model classification task पर thinking पूरी skip कर देता है। Same endpoint, same code, तीन lines config का फर्क। घुमा इसे।

Claude Extended Thinking: उथले जवाबों और $50 के बिल के बीच सिर्फ 3-Line Config का फर्क

"बस Docs पढ़ लो" क्यों काम नहीं करता

Step 1: Adaptive Thinking Enable करो

Step 2: सही Effort Level चुनो

Step 3: Real-Time UX के लिए Thinking Stream करो

Step 4: Tool Use में Chain मत तोड़ो

Step 5: Monitor करो कि असल में कितना पैसा जा रहा है

Gotchas जो काटेंगे

अब क्या करो

Keep reading

Agentic Loops के लिए Prompt Caching: एक ही Tokens की बार-बार Full Price देना बंद करो

वो 50-Line Agentic Loop बनाओ जो हर AI Agent Platform को चलाता है

तुम्हारे Agent का Permission Dialog एक Placebo है

MCP हर जगह काम करता है — जब तक Authenticate करने की बारी न आए