Self-Hosted vs Cloud AI: Local कब सही रहता है?

"बस locally चला लो" — ये tech दुनिया का "खुद सब्जी उगा लो" है। कभी-कभी सच में पैसे बचते हैं। ज़्यादातर ज़्यादा खर्चा, ज़्यादा मेहनत, और results भी खराब। लेकिन तुम ये सुनते रहते हो — Twitter पर, Reddit पर, उस एक दोस्त से जिसने घर पर server बनाया है। तो hot takes छोड़ो, actual numbers देखते हैं। 🔍

Asli सवाल ये नहीं कि self-host करना चाहिए या नहीं। सवाल ये है कि कितने volume पर self-hosting सस्ती पड़ती है — और क्या tradeoffs तुम्हारे use case में actually matter करते हैं?

क्या compare कर रहे हैं

Cloud AI मतलब pay-per-use। जब भी तुम्हारा app Claude, GPT, या Gemini को text भेजता है, तुम tokens के पैसे देते हो — tokens मतलब word के टुकड़े जो AI process करता है, roughly एक English word का ¾ हिस्सा। इसे auto-rickshaw meter समझो: छोटी ride सस्ती, लंबी ride में bill बढ़ता जाता है।

Self-hosted AI मतलब तुम अपने hardware पर open-source LLM (large language model — ChatGPT जैसे tools के पीछे का दिमाग) चलाते हो। Machine और बिजली का खर्चा एक बार, उसके बाद हर request free। इसे गाड़ी खरीदना समझो: upfront महंगी, लेकिन हर ride का अलग किराया नहीं।

March 2026 की cloud prices, per million tokens:

Provider	Model	Input / Output cost
Anthropic	Haiku 4.5	$1 / $5
Anthropic	Sonnet 4.6	$3 / $15
Anthropic	Opus 4.6	$5 / $25
OpenAI	GPT-4o mini	$0.15 / $0.60
OpenAI	GPT-4o	$2.50 / $10
Google	Gemini Flash	Free tier (15 req/min)
Google	Gemini Pro	$1.25 / $5

Aur self-hosted contenders: Ollama — open-source models जैसे Llama 3.1, Mistral, या DeepSeek तुम्हारी अपनी machine या rented GPU server पर।

Basic tradeoff: cloud per-use charge करता है, self-hosted per-time। Low usage पर cloud जीतता है क्योंकि सिर्फ जितना use किया उतना pay करो। High usage पर self-hosted जीतता है क्योंकि hardware cost fixed है। हमें crossover point ढूंढना है। 💰

वो cost math जो कोई नहीं दिखाता

Cloud costs at scale

Claude Haiku 4.5 को baseline मानो (सबसे सस्ता quality cloud model), typical 30% input / 70% output token split के साथ:

Daily tokens	Monthly cost	Annual cost
10K	$0.90	$10.80
100K	$9	$108
500K	$45	$540
1M	$90	$1,080
5M	$450	$5,400
10M	$900	$10,800

Self-hosted costs

Option A — पहले से hardware है तुम्हारे पास:

अगर तुम्हारे पास GPU वाली machine है (graphics card जो AI math accelerate करता है), तो extra cost सिर्फ बिजली का:

Hardware	कौन से models चलेंगे	Monthly बिजली
16 GB RAM, no GPU	7B models (धीमा)	~$10
RTX 3090 24GB	13B models (तेज़)	~$20
RTX 4090 24GB	13B-30B models (तेज़)	~$25
M2/M3 Mac 32GB+	7B-13B (अच्छी speed)	~$5

"7B" और "13B" मतलब billion parameters — model का size। बड़ा model = ज़्यादा smart, लेकिन ज़्यादा memory चाहिए।

Option B — GPU server rent पर:

Provider	GPU	Monthly cost
Hetzner (CPU only)	None	~$50
Vast.ai	RTX 3090	~$150
Vast.ai	RTX 4090	~$250
Lambda	A10G	~$350
RunPod	A100 40GB	~$800

Option C — घर पर server बनाओ:

Build	Upfront cost	Monthly (3 साल में)
Used RTX 3090 + basic PC	~$1,200	~$33 + बिजली
RTX 4090 + decent PC	~$2,500	~$70 + बिजली
2× RTX 4090	~$4,500	~$125 + बिजली
Mac Studio M3 Ultra 192GB	~$6,000	~$167 + बिजली

कहाँ lines cross होती हैं

Cloud Haiku vs. local 7B on existing hardware:

Self-hosted cost ~$15/month बिजली में। Cloud Haiku इसे roughly 5 million tokens per month पर cross करता है। इससे कम — और ज़्यादातर solo founders इससे काफी कम पर हैं — cloud सस्ता है।

Cloud Haiku vs. rented GPU (RTX 3090 at $150/month):

Break-even के लिए 50 million tokens per month चाहिए। ये daily 1.7 million tokens है — proper production workload।

ज़्यादातर indie builders और छोटी teams के लिए, cloud API dedicated hardware से सस्ता पड़ता है। Period.

Quality का gap

Cost तो सिर्फ आधी कहानी है। Models actually कैसा perform करते हैं:

Capability	Cloud (Claude/GPT)	Self-hosted (7B-13B)
Reasoning quality	Excellent	Moderate
Code generation	Excellent	Simple tasks के लिए ठीक
Context window	200K-1M tokens	Typically 4K-32K
Speed	50-100+ tok/sec	20-40 (GPU), 5-10 (CPU)
Tool use	Native, reliable	Possible, कम reliable

Context window — AI एक बार में कितना text "देख" सकता है, जैसे उसकी working memory — ये सबसे बड़ा gap है। Cloud models पूरे codebases handle करते हैं। Local models बस कुछ pages देख पाते हैं।

Llama 3.1 70B genuinely impressive है और general tasks पर competitive। लेकिन इसे serious GPU hardware चाहिए, और complex reasoning के लिए Opus या top-tier GPT का local equivalent अभी भी नहीं है। Gap कम हुआ है। बंद नहीं हुआ।

Self-hosting कब actually सही है

1. Privacy और data sovereignty

अगर तुम्हारा data network से बाहर जा ही नहीं सकता — healthcare records, legal documents, financial data, government systems — तो self-hosting optional नहीं है। कोई भी API terms of service "data हमारी building से बाहर गया ही नहीं" की जगह नहीं ले सकती।

# Ollama से 2 minute का setup
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.1:8b

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Summarize this patient record..."
}'

No network request। No third-party logging। Full compliance।

2. Offline environments

Edge devices, air-gapped networks, remote sites जहाँ internet नहीं है। Connection नहीं मतलब API नहीं — local ही एकमात्र option है।

3. High-volume simple tasks

Embeddings — text के numerical fingerprints जो search के लिए use होते हैं — classification, और short-text summaries। जहाँ छोटा model काफी है और volume massive है: ⚡

import ollama

def classify_document(text: str) -> str:
    response = ollama.chat(model='llama3.1:8b', messages=[
        {'role': 'user', 'content': f'Classify: invoice, contract, receipt, letter, other.\n\n{text[:500]}'}
    ])
    return response['message']['content']

# 100K documents/day:
# Cloud cost: ~$30/day
# Self-hosted: ~$0.50/day बिजली
# Monthly savings: ~$900

4. Latency-sensitive apps

API calls में 100-500ms network delay लगती है। Local inference — model का response generate करने का process — instantly start होता है:

Cloud:  150-500ms network + 500-2000ms inference = 650-2500ms
Local:  0ms network + 200-1000ms inference = 200-1000ms

Autocomplete, live translation, या interactive tools में ये difference साफ दिखता है।

5. Development और experimentation

50 prompt variations locally test करो = $0 खर्चा। वही experiment Claude API पर $5-20 लगेगा। बहुत ज़्यादा नहीं, लेकिन intensive R&D में compound होता है।

Practical setup (10 minutes)

अगर decide कर लिया कि self-hosting तुम्हारे use case में fit है:

Ollama install करो

curl -fsSL https://ollama.com/install.sh | sh
ollama serve

ollama pull llama3.1:8b          # 4.7 GB, general purpose
ollama pull codellama:13b         # 7.4 GB, code tasks
ollama pull nomic-embed-text      # 274 MB, embeddings के लिए

Drop-in replacement की तरह use करो

Ollama वही language बोलता है जो OpenAI का API। ज़्यादातर code बिना changes के काम करता है — बस URL swap करो:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "Explain MCP in 3 sentences"}]
)
print(response.choices[0].message.content)

Local models पर develop करो, cloud पर deploy — या उल्टा। Same code, different URL.

Performance benchmarks

Hardware	Tokens/sec	500-token response
M2 MacBook Pro 16GB	~35	~14 seconds
RTX 3060 12GB	~40	~12 seconds
RTX 4090 24GB	~80	~6 seconds
CPU only (16 cores)	~8	~60 seconds

CPU-only inference interactive चीज़ों के लिए painful है। GPU या Apple Silicon नहीं है? Cloud पर रहो।

Hybrid approach (असली jugaad यही है) 🚀

Sabse smart setup न pure cloud है, न pure self-hosted। हर task को सही जगह route करो:

def get_ai_client(task_type: str):
    if task_type in ["embedding", "classification", "simple_summary"]:
        # Local — तेज़, free, quality काफी है
        return OpenAI(base_url="http://localhost:11434/v1", api_key="x")
    elif task_type in ["code_generation", "complex_analysis", "tool_use"]:
        # Cloud — better quality, cost worth it
        return anthropic.Anthropic()
    else:
        return OpenAI(base_url="http://localhost:11434/v1", api_key="x")

Locally चलाओ: embeddings, classification, draft generation, dev/testing। Cloud पर भेजो: complex reasoning, code generation, tool use, customer-facing कुछ भी।

Hybrid setup का real cost example:

Task	Volume	कहाँ	Monthly cost
Embeddings	50K/day	Local	$0
Classification	10K/day	Local	$0
Code review	30/day	Cloud (Haiku)	$2
Content generation	50/day	Cloud (Sonnet)	$15
Complex analysis	10/day	Cloud (Sonnet)	$5
Total			$22/mo

Pure cloud में same workload: ~$180/month। Hybrid से 88% बचत।

Decision cheat sheet

Daily 5M+ tokens process कर रहे हो? → Volume tasks self-host, quality tasks cloud पर।

Strict data privacy requirements? → Self-host, कोई compromise नहीं।

पहले से GPU hardware है? → Hybrid: local simple के लिए, cloud complex के लिए।

ऊपर में से कुछ नहीं? → सिर्फ cloud। सबसे सस्ता है और best models मिलते हैं।

March 2026 में ज़्यादातर solo founders के लिए: cloud से शुरू करो। Claude Haiku $1/$5 per million tokens पर इतना सस्ता है कि पैसे बचाने के लिए self-host करना वैसा ही है जैसे रोटी सस्ती पड़े इसलिए गेहूं उगाना। Typical founder volumes पर hardware का खर्चा सालों की API usage से ज़्यादा है। 💰

Exception: privacy requirements हैं या पहले से GPU है। तो Ollama install करो, bulk tasks के लिए Llama 3.1 चलाओ, और mushkil problems के लिए Claude को call करो। ये hybrid approach 80%+ cost काटता है और quality वहाँ रखता है जहाँ matter करती है। बाकी सब over-engineering है। 🦝