Claude Extended Thinking: The 3-Line Config Between Shallow Answers and $50 Bills

You call the Claude API — Anthropic's interface for sending prompts and getting responses programmatically — and simple tasks work great. Extraction, summarization, classification: Claude nails them. Then you throw something hard at it. Review a 2,000-line pull request. Plan a database migration. Debug a race condition. The response comes back fast, confident, and subtly wrong. Like a student who skimmed the textbook and is winging the exam.

The problem isn't the model. The problem is you're not giving it scratch paper.

Why "Just Read the Docs" Doesn't Cut It

Claude has a feature called extended thinking — the ability to reason step-by-step internally before answering, like showing your work in math class except the work stays hidden. Anthropic introduced it on February 24, 2025 with Claude 3.7 Sonnet, and it's evolved dramatically since.

The docs are thorough. They're also 4,000 words of parameter tables, deprecation notices, and migration guides spread across three separate pages. Most developers do one of three things: skip thinking entirely, copy-paste a config from a blog post targeting an older model, or enable it with no cost controls and get a surprise $50 bill.

On April 16, 2026, Anthropic shipped Claude Opus 4.7 and broke all three approaches. The manual budget_tokens parameter that used to control thinking cost? Returns a 400 error now. A new tokenizer inflates token counts by up to 35%. And thinking is invisible by default — while still billing you at the full $25 per million tokens.

Here's how to set it up correctly, as of April 27, 2026.

Step 1: Enable Adaptive Thinking

On Opus 4.7, you use adaptive thinking — the model decides how much to think based on problem difficulty. No manual token budgets. You control intensity with an effort parameter (a simple label like "low", "high", or "max") instead of guessing a number.

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    output_config={"effort": "high"},
    messages=[{
        "role": "user",
        "content": "Review this SQL migration for correctness and edge cases: ..."
    }],
)

Three lines added: thinking, output_config, and a bigger max_tokens — the ceiling on how many tokens (word-chunks the AI processes, roughly ¾ of an English word) the response can contain. Thinking tokens count against this ceiling, so set it to at least 16,000. Anthropic recommends 32,000+ for complex tasks. If thinking alone would exceed your max, the request fails — no partial answer, just an error.

/faion is Nero's prompt tool for generating working code from a spec — paste the block below and get a production-ready starting point.

/faion
Generate a Python function `call_with_thinking(prompt: str, effort: str = "high") -> str` that calls Claude Opus 4.7 with adaptive thinking using the anthropic Python SDK. Accept an effort parameter ("low", "medium", "high", "xhigh", "max"). Set max_tokens to 16000. Return the text response. Include error handling for API errors and a docstring explaining the effort levels.

Step 2: Pick the Right Effort Level

Not every prompt deserves a philosophy degree. Here's the cheat sheet, based on Anthropic's guidance and production data from Resolve AI's testing:

Effort	What it does	Use when
`low`	Skips thinking for trivial tasks	Classification, extraction, high-volume pipelines
`medium`	Moderate reasoning, may skip	Balanced cost/quality, most agentic workflows
`high`	Almost always thinks deeply	Code review, analysis, planning
`xhigh`	Extended exploration (Opus 4.7 only)	Multi-file coding, long agentic chains
`max`	No constraints on thinking	Frontier problems, research, unlimited budgets

The key finding from Resolve AI: Sonnet 4.6 at medium effort roughly matches Opus 4.6 quality at a fraction of the cost. Don't reach for the biggest model — reach for the right effort level on a cheaper one. The smartest optimization is often not paying for Opus at all.

Step 3: Stream Thinking for Real-Time UX

Without streaming, a heavy-thinking request means your user stares at a dead screen for 30+ seconds. With streaming, they watch the model's reasoning unfold live — visible progress instead of existential doubt about whether the app crashed.

with client.messages.stream(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive", "display": "summarized"},
    messages=[{"role": "user", "content": "Analyze this codebase architecture..."}],
) as stream:
    for event in stream:
        if event.type == "content_block_delta":
            if event.delta.type == "thinking_delta":
                print(f"[thinking] {event.delta.thinking}", end="", flush=True)
            elif event.delta.type == "text_delta":
                print(event.delta.text, end="", flush=True)

Notice "display": "summarized". On Opus 4.7, the default is "omitted" — the model thinks, you pay for the tokens, but the thinking text comes back empty. You must set display explicitly if you want to see what the model reasoned about. Debugging invisible reasoning is exactly as fun as it sounds.

/faion
Generate a Python function `stream_thinking_response(prompt: str, effort: str = "high")` that calls Claude Opus 4.7 with adaptive thinking and streaming enabled via the anthropic Python SDK. Set display to "summarized". Print thinking deltas prefixed with "[thinking]" and text deltas without prefix. Set max_tokens to 32000. Handle stream cleanup properly with a context manager.

Step 4: Handle Tool Use Without Breaking the Chain

If your integration uses tools — functions the model can call, like querying a database or hitting an external API — you need to preserve thinking blocks when continuing the conversation. Drop them and Claude loses its reasoning context mid-flow.

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=16000,
    thinking={"type": "adaptive"},
    tools=[your_tool],
    messages=[
        {"role": "user", "content": "What's the production error rate?"},
        {"role": "assistant", "content": [
            thinking_block,   # KEEP THIS — stripping it kills context
            tool_use_block
        ]},
        {"role": "user", "content": [tool_result]},
    ],
)

One constraint that will bite agentic architectures: with thinking enabled, tool_choice — which controls whether Claude must use a specific tool — only supports "auto" or "none". Forcing a specific tool returns an error.

Step 5: Monitor What You're Actually Paying

Every response includes a usage object. After enabling thinking, this becomes the most important line in your codebase:

cost_per_mtok = 25  # Opus 4.7 output price
output_cost = (response.usage.output_tokens / 1_000_000) * cost_per_mtok
print(f"Output tokens: {response.usage.output_tokens} (${output_cost:.4f})")

The output token count includes ALL thinking tokens — even the invisible ones under display: "omitted". If your visible answer is 500 tokens but output_tokens says 8,000, you just paid for 7,500 tokens of reasoning nobody saw. According to Anthropic's pricing page, thinking tokens bill at the full output rate: $25/MTok for Opus, $15/MTok for Sonnet.

/faion
Generate a Python helper function `log_thinking_cost(response, model: str = "claude-opus-4-7")` that takes an Anthropic Messages API response object and prints a cost breakdown: input tokens, output tokens (including thinking), cache read tokens, and total estimated cost in USD. Use a dict of per-model pricing (Opus 4.7: $15 input / $25 output, Sonnet 4.6: $3 input / $15 output). Warn if output_tokens exceeds 5000 with a note about possible high thinking usage.

Gotchas That Will Bite You

The Tokenizer Tax. Opus 4.7's new tokenizer generates up to 35% more tokens for the same text compared to Opus 4.6. Your costs go up with zero code changes when you upgrade models. A thinking process that cost $0.25 on Opus 4.6 could run $0.34 on 4.7 — same reasoning, bigger bill. Monitor usage before and after switching models.

Thinking Tokens Can't Be Prompt-Cached. Prompt caching — Anthropic's feature that discounts repeated input tokens — doesn't apply to thinking content. In agentic loops with many tool calls, Claude re-reads thinking blocks from prior turns as input. This creates compounding costs you won't see unless you track cache_read_input_tokens separately.

The Legacy Trap. Still on Claude Sonnet 4.6 or Opus 4.6? budget_tokens still works. Upgrade to Opus 4.7, and it returns a 400 error with no runtime deprecation warning — just a hard fail. Test before you deploy.

The max_tokens Ceiling. Thinking tokens and response tokens share the same max_tokens cap. Set max_tokens: 4000, and if the model thinks for 3,800 tokens, you get a 200-token answer. Always set it higher than you think you need.

What You Can Do Now

You have two gears. For the routine 80% — extraction, formatting, simple Q&A — use effort: "low" or skip thinking entirely. For the hard 20% — code review, architecture planning, complex analysis — use "high" or "xhigh" with streaming and cost monitoring. Run 100 calls, check the usage numbers, then adjust.

A year ago, every Claude API call ran at the same depth — fast, shallow, one gear. Now you control the dial. The model that reasons for 30 seconds on a hard bug is the same model that skips thinking entirely on a classification task. Same endpoint, same code, three lines of config apart. Turn it.

Claude Extended Thinking: The 3-Line Config Between Shallow Answers and $50 Bills

Why "Just Read the Docs" Doesn't Cut It

Step 1: Enable Adaptive Thinking

Step 2: Pick the Right Effort Level

Step 3: Stream Thinking for Real-Time UX

Step 4: Handle Tool Use Without Breaking the Chain

Step 5: Monitor What You're Actually Paying

Gotchas That Will Bite You

What You Can Do Now

Keep reading

Prompt Caching for Agentic Loops: Stop Paying Full Price for the Same Tokens

Build the 50-Line Agentic Loop That Powers Every AI Agent Platform

Your Agent's Permission Dialog Is a Placebo

MCP Works Everywhere — Until You Try to Authenticate