"Just run it locally" is the tech equivalent of "just grow your own food." Sometimes it saves you a fortune. More often it costs more, takes more work, and delivers worse results. But you keep hearing it — on Twitter, on Reddit, from that one friend who built a home server. So let's skip the hot takes and look at actual numbers. 🔍
The real question isn't should I self-host. It's at what volume does self-hosting get cheaper — and do the tradeoffs actually matter for what you're building?
What we're comparing
Cloud AI means you pay per use. Every time your app sends text to Claude, GPT, or Gemini, you pay for the tokens — word-chunks the AI processes, roughly ¾ of an English word each. Think of it like a taxi meter: short rides are cheap, long ones add up.
Self-hosted AI means you run an open-source LLM (large language model — the brain behind tools like ChatGPT) on your own hardware. You pay for the machine and electricity, but every request after that is free. Think of it like buying a car: expensive upfront, but no per-ride fee.
Here are the current cloud prices as of March 2026, per million tokens:
| Provider | Model | Input / Output cost |
|---|---|---|
| Anthropic | Haiku 4.5 | $1 / $5 |
| Anthropic | Sonnet 4.6 | $3 / $15 |
| Anthropic | Opus 4.6 | $5 / $25 |
| OpenAI | GPT-4o mini | $0.15 / $0.60 |
| OpenAI | GPT-4o | $2.50 / $10 |
| Gemini Flash | Free tier (15 req/min) | |
| Gemini Pro | $1.25 / $5 |
And the self-hosted contenders: Ollama running open-source models like Llama 3.1, Mistral, or DeepSeek on your own machine or a rented GPU server.
The fundamental tradeoff: cloud charges per use, self-hosted charges per time. At low usage, cloud wins because you only pay for what you consume. At high usage, self-hosted wins because the hardware cost is fixed. We need to find the crossover point. 💰
The cost math nobody shows you
Cloud costs at scale
Using Claude Haiku 4.5 as the baseline (cheapest quality cloud model), assuming a typical 30% input / 70% output token split:
| Daily tokens | Monthly cost | Annual cost |
|---|---|---|
| 10K | $0.90 | $10.80 |
| 100K | $9 | $108 |
| 500K | $45 | $540 |
| 1M | $90 | $1,080 |
| 5M | $450 | $5,400 |
| 10M | $900 | $10,800 |
Self-hosted costs
Option A — hardware you already own:
If you've got a machine with a GPU (graphics card that accelerates AI math), the only extra cost is electricity:
| Hardware | Models it can run | Monthly electricity |
|---|---|---|
| 16 GB RAM, no GPU | 7B models (slow) | ~$10 |
| RTX 3090 24GB | 13B models (fast) | ~$20 |
| RTX 4090 24GB | 13B-30B models (fast) | ~$25 |
| M2/M3 Mac 32GB+ | 7B-13B (good speed) | ~$5 |
"7B" and "13B" refer to billion parameters — the model's size. Bigger models are smarter but need more memory.
Option B — renting a GPU server:
| Provider | GPU | Monthly cost |
|---|---|---|
| Hetzner (CPU only) | None | ~$50 |
| Vast.ai | RTX 3090 | ~$150 |
| Vast.ai | RTX 4090 | ~$250 |
| Lambda | A10G | ~$350 |
| RunPod | A100 40GB | ~$800 |
Option C — buying a home server:
| Build | Upfront cost | Monthly (over 3 years) |
|---|---|---|
| Used RTX 3090 + basic PC | ~$1,200 | ~$33 + electricity |
| RTX 4090 + decent PC | ~$2,500 | ~$70 + electricity |
| 2× RTX 4090 | ~$4,500 | ~$125 + electricity |
| Mac Studio M3 Ultra 192GB | ~$6,000 | ~$167 + electricity |
Where the lines cross
Cloud Haiku vs. local 7B on existing hardware:
Self-hosted cost is ~$15/month in electricity. Cloud Haiku crosses that at roughly 5 million tokens per month. Below that — and most solo founders sit well below — cloud is cheaper.
Cloud Haiku vs. rented GPU (RTX 3090 at $150/month):
You need to push 50 million tokens per month before the rented server breaks even. That's 1.7 million tokens daily — a serious production workload.
For most indie builders and small teams, cloud API costs less than self-hosting on dedicated hardware. Full stop.
The quality gap
Cost is only half the story. Here's how the models actually perform:
| Capability | Cloud (Claude/GPT) | Self-hosted (7B-13B) |
|---|---|---|
| Reasoning quality | Excellent | Moderate |
| Code generation | Excellent | Good for simple tasks |
| Context window | 200K-1M tokens | 4K-32K typically |
| Speed | 50-100+ tok/sec | 20-40 (GPU), 5-10 (CPU) |
| Tool use | Native, reliable | Possible, less reliable |
Context window — how much text the AI can "see" at once, like its working memory — is the biggest gap. Cloud models handle entire codebases. Local models see a few pages at a time.
Llama 3.1 70B is genuinely impressive and competitive on general tasks. But it needs serious GPU hardware, and there's still no local equivalent to Opus or top-tier GPT for complex reasoning. The gap has narrowed. It hasn't closed.
When self-hosting actually makes sense
1. Privacy and data sovereignty
If your data cannot leave your network — healthcare records, legal documents, financial data, government systems — self-hosting isn't optional. No API terms of service replace "the data never left our building."
# Ollama makes this a 2-minute setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Summarize this patient record..."
}'
No network request. No third-party logging. Full compliance.
2. Offline environments
Edge devices, air-gapped networks, remote sites with no internet. No connection means no API — local is the only option.
3. High-volume simple tasks
Embeddings — numerical fingerprints of text used for search — classification, and short-text summaries. Tasks where a small model is good enough and volume is massive: ⚡
import ollama
def classify_document(text: str) -> str:
response = ollama.chat(model='llama3.1:8b', messages=[
{'role': 'user', 'content': f'Classify: invoice, contract, receipt, letter, other.\n\n{text[:500]}'}
])
return response['message']['content']
# 100K documents/day:
# Cloud cost: ~$30/day
# Self-hosted: ~$0.50/day electricity
# Monthly savings: ~$900
4. Latency-sensitive apps
API calls add 100-500ms of network delay. Local inference — the process of the model generating a response — starts instantly:
Cloud: 150-500ms network + 500-2000ms inference = 650-2500ms
Local: 0ms network + 200-1000ms inference = 200-1000ms
For autocomplete, live translation, or interactive tools, that difference is noticeable.
5. Development and experimentation
Testing 50 prompt variations locally costs $0. The same experiment on Claude API runs $5-20. Not huge, but it compounds during intensive R&D.
The practical setup (10 minutes)
If you've decided self-hosting fits your use case:
Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
ollama serve
ollama pull llama3.1:8b # 4.7 GB, general purpose
ollama pull codellama:13b # 7.4 GB, code tasks
ollama pull nomic-embed-text # 274 MB, for embeddings
Use it as a drop-in replacement
Ollama speaks the same language as OpenAI's API. Most code works without changes — just swap the URL:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain MCP in 3 sentences"}]
)
print(response.choices[0].message.content)
Develop against local models, deploy with cloud — or the reverse. Same code, different URL.
Performance benchmarks
| Hardware | Tokens/sec | 500-token response |
|---|---|---|
| M2 MacBook Pro 16GB | ~35 | ~14 seconds |
| RTX 3060 12GB | ~40 | ~12 seconds |
| RTX 4090 24GB | ~80 | ~6 seconds |
| CPU only (16 cores) | ~8 | ~60 seconds |
CPU-only inference is painful for anything interactive. No GPU or Apple Silicon? Stick with cloud.
The hybrid play (this is the move) 🚀
The smartest setup isn't pure cloud or pure self-hosted. It's routing each task to the right place:
def get_ai_client(task_type: str):
if task_type in ["embedding", "classification", "simple_summary"]:
# Local — fast, free, quality is fine
return OpenAI(base_url="http://localhost:11434/v1", api_key="x")
elif task_type in ["code_generation", "complex_analysis", "tool_use"]:
# Cloud — better quality, worth the cost
return anthropic.Anthropic()
else:
return OpenAI(base_url="http://localhost:11434/v1", api_key="x")
Runs locally: embeddings, classification, draft generation, dev/testing. Runs in the cloud: complex reasoning, code generation, tool use, anything customer-facing.
Real cost example for a hybrid setup:
| Task | Volume | Where | Monthly cost |
|---|---|---|---|
| Embeddings | 50K/day | Local | $0 |
| Classification | 10K/day | Local | $0 |
| Code review | 30/day | Cloud (Haiku) | $2 |
| Content generation | 50/day | Cloud (Sonnet) | $15 |
| Complex analysis | 10/day | Cloud (Sonnet) | $5 |
| Total | $22/mo |
Pure cloud for the same workload: ~$180/month. The hybrid saves 88%.
Decision cheat sheet
Processing over 5M tokens daily? → Self-host volume tasks, cloud for quality tasks.
Strict data privacy requirements? → Self-host, non-negotiable.
Already own GPU hardware? → Hybrid: local for simple, cloud for complex.
None of the above? → Cloud only. It's cheapest and gives you the best models.
For most solo founders as of March 2026: start with cloud. Claude Haiku at $1/$5 per million tokens is so cheap that self-hosting to save money is like growing your own wheat to save on bread. The hardware costs more than years of API usage at typical founder volumes. 💰
The exception: you have privacy requirements or already own a GPU. Then install Ollama, run Llama 3.1 for bulk tasks, and call Claude for the hard problems. That hybrid cuts costs 80%+ while keeping quality where it counts. Everything else is over-engineering. 🦝





