Self-Hosted vs Cloud AI: When Does Local Make Sense?

"Just run it locally" is the tech equivalent of "just grow your own food." Sometimes it saves you a fortune. More often it costs more, takes more work, and delivers worse results. But you keep hearing it — on Twitter, on Reddit, from that one friend who built a home server. So let's skip the hot takes and look at actual numbers. 🔍

The real question isn't should I self-host. It's at what volume does self-hosting get cheaper — and do the tradeoffs actually matter for what you're building?

What we're comparing

Cloud AI means you pay per use. Every time your app sends text to Claude, GPT, or Gemini, you pay for the tokens — word-chunks the AI processes, roughly ¾ of an English word each. Think of it like a taxi meter: short rides are cheap, long ones add up.

Self-hosted AI means you run an open-source LLM (large language model — the brain behind tools like ChatGPT) on your own hardware. You pay for the machine and electricity, but every request after that is free. Think of it like buying a car: expensive upfront, but no per-ride fee.

Here are the current cloud prices as of March 2026, per million tokens:

Provider	Model	Input / Output cost
Anthropic	Haiku 4.5	$1 / $5
Anthropic	Sonnet 4.6	$3 / $15
Anthropic	Opus 4.6	$5 / $25
OpenAI	GPT-4o mini	$0.15 / $0.60
OpenAI	GPT-4o	$2.50 / $10
Google	Gemini Flash	Free tier (15 req/min)
Google	Gemini Pro	$1.25 / $5

And the self-hosted contenders: Ollama running open-source models like Llama 3.1, Mistral, or DeepSeek on your own machine or a rented GPU server.

The fundamental tradeoff: cloud charges per use, self-hosted charges per time. At low usage, cloud wins because you only pay for what you consume. At high usage, self-hosted wins because the hardware cost is fixed. We need to find the crossover point. 💰

The cost math nobody shows you

Cloud costs at scale

Using Claude Haiku 4.5 as the baseline (cheapest quality cloud model), assuming a typical 30% input / 70% output token split:

Daily tokens	Monthly cost	Annual cost
10K	$0.90	$10.80
100K	$9	$108
500K	$45	$540
1M	$90	$1,080
5M	$450	$5,400
10M	$900	$10,800

Self-hosted costs

Option A — hardware you already own:

If you've got a machine with a GPU (graphics card that accelerates AI math), the only extra cost is electricity:

Hardware	Models it can run	Monthly electricity
16 GB RAM, no GPU	7B models (slow)	~$10
RTX 3090 24GB	13B models (fast)	~$20
RTX 4090 24GB	13B-30B models (fast)	~$25
M2/M3 Mac 32GB+	7B-13B (good speed)	~$5

"7B" and "13B" refer to billion parameters — the model's size. Bigger models are smarter but need more memory.

Option B — renting a GPU server:

Provider	GPU	Monthly cost
Hetzner (CPU only)	None	~$50
Vast.ai	RTX 3090	~$150
Vast.ai	RTX 4090	~$250
Lambda	A10G	~$350
RunPod	A100 40GB	~$800

Option C — buying a home server:

Build	Upfront cost	Monthly (over 3 years)
Used RTX 3090 + basic PC	~$1,200	~$33 + electricity
RTX 4090 + decent PC	~$2,500	~$70 + electricity
2× RTX 4090	~$4,500	~$125 + electricity
Mac Studio M3 Ultra 192GB	~$6,000	~$167 + electricity

Where the lines cross

Cloud Haiku vs. local 7B on existing hardware:

Self-hosted cost is ~$15/month in electricity. Cloud Haiku crosses that at roughly 5 million tokens per month. Below that — and most solo founders sit well below — cloud is cheaper.

Cloud Haiku vs. rented GPU (RTX 3090 at $150/month):

You need to push 50 million tokens per month before the rented server breaks even. That's 1.7 million tokens daily — a serious production workload.

For most indie builders and small teams, cloud API costs less than self-hosting on dedicated hardware. Full stop.

The quality gap

Cost is only half the story. Here's how the models actually perform:

Capability	Cloud (Claude/GPT)	Self-hosted (7B-13B)
Reasoning quality	Excellent	Moderate
Code generation	Excellent	Good for simple tasks
Context window	200K-1M tokens	4K-32K typically
Speed	50-100+ tok/sec	20-40 (GPU), 5-10 (CPU)
Tool use	Native, reliable	Possible, less reliable

Context window — how much text the AI can "see" at once, like its working memory — is the biggest gap. Cloud models handle entire codebases. Local models see a few pages at a time.

Llama 3.1 70B is genuinely impressive and competitive on general tasks. But it needs serious GPU hardware, and there's still no local equivalent to Opus or top-tier GPT for complex reasoning. The gap has narrowed. It hasn't closed.

When self-hosting actually makes sense

1. Privacy and data sovereignty

If your data cannot leave your network — healthcare records, legal documents, financial data, government systems — self-hosting isn't optional. No API terms of service replace "the data never left our building."

# Ollama makes this a 2-minute setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Summarize this patient record..."
}'

No network request. No third-party logging. Full compliance.

2. Offline environments

Edge devices, air-gapped networks, remote sites with no internet. No connection means no API — local is the only option.

3. High-volume simple tasks

Embeddings — numerical fingerprints of text used for search — classification, and short-text summaries. Tasks where a small model is good enough and volume is massive: ⚡

import ollama

def classify_document(text: str) -> str:
    response = ollama.chat(model='llama3.1:8b', messages=[
        {'role': 'user', 'content': f'Classify: invoice, contract, receipt, letter, other.\n\n{text[:500]}'}
    ])
    return response['message']['content']

# 100K documents/day:
# Cloud cost: ~$30/day
# Self-hosted: ~$0.50/day electricity
# Monthly savings: ~$900

4. Latency-sensitive apps

API calls add 100-500ms of network delay. Local inference — the process of the model generating a response — starts instantly:

Cloud:  150-500ms network + 500-2000ms inference = 650-2500ms
Local:  0ms network + 200-1000ms inference = 200-1000ms

For autocomplete, live translation, or interactive tools, that difference is noticeable.

5. Development and experimentation

Testing 50 prompt variations locally costs $0. The same experiment on Claude API runs $5-20. Not huge, but it compounds during intensive R&D.

The practical setup (10 minutes)

If you've decided self-hosting fits your use case:

Install Ollama

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

ollama pull llama3.1:8b          # 4.7 GB, general purpose
ollama pull codellama:13b         # 7.4 GB, code tasks
ollama pull nomic-embed-text      # 274 MB, for embeddings

Use it as a drop-in replacement

Ollama speaks the same language as OpenAI's API. Most code works without changes — just swap the URL:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain MCP in 3 sentences"}]
)
print(response.choices[0].message.content)

Develop against local models, deploy with cloud — or the reverse. Same code, different URL.

Performance benchmarks

Hardware	Tokens/sec	500-token response
M2 MacBook Pro 16GB	~35	~14 seconds
RTX 3060 12GB	~40	~12 seconds
RTX 4090 24GB	~80	~6 seconds
CPU only (16 cores)	~8	~60 seconds

CPU-only inference is painful for anything interactive. No GPU or Apple Silicon? Stick with cloud.

The hybrid play (this is the move) 🚀

The smartest setup isn't pure cloud or pure self-hosted. It's routing each task to the right place:

def get_ai_client(task_type: str):
    if task_type in ["embedding", "classification", "simple_summary"]:
        # Local — fast, free, quality is fine
        return OpenAI(base_url="http://localhost:11434/v1", api_key="x")
    elif task_type in ["code_generation", "complex_analysis", "tool_use"]:
        # Cloud — better quality, worth the cost
        return anthropic.Anthropic()
    else:
        return OpenAI(base_url="http://localhost:11434/v1", api_key="x")

Runs locally: embeddings, classification, draft generation, dev/testing. Runs in the cloud: complex reasoning, code generation, tool use, anything customer-facing.

Real cost example for a hybrid setup:

Task	Volume	Where	Monthly cost
Embeddings	50K/day	Local	$0
Classification	10K/day	Local	$0
Code review	30/day	Cloud (Haiku)	$2
Content generation	50/day	Cloud (Sonnet)	$15
Complex analysis	10/day	Cloud (Sonnet)	$5
Total			$22/mo

Pure cloud for the same workload: ~$180/month. The hybrid saves 88%.

Decision cheat sheet

Processing over 5M tokens daily? → Self-host volume tasks, cloud for quality tasks.

Strict data privacy requirements? → Self-host, non-negotiable.

Already own GPU hardware? → Hybrid: local for simple, cloud for complex.

None of the above? → Cloud only. It's cheapest and gives you the best models.

For most solo founders as of March 2026: start with cloud. Claude Haiku at $1/$5 per million tokens is so cheap that self-hosting to save money is like growing your own wheat to save on bread. The hardware costs more than years of API usage at typical founder volumes. 💰

The exception: you have privacy requirements or already own a GPU. Then install Ollama, run Llama 3.1 for bulk tasks, and call Claude for the hard problems. That hybrid cuts costs 80%+ while keeping quality where it counts. Everything else is over-engineering. 🦝