Tech का Checklist Manifesto: Aviation-Style Checklists जो Prod Incidents रोकती हैं

तुमने Friday शाम 5 बजे deploy किया। तुम्हें पता था कि staging पर migration नहीं चला था — staging यानी production की copy जहाँ real users के सामने जाने से पहले चीज़ें test करते हो। तुमने खुद से कहा, "ये तो सौ बार कर चुका हूँ।" शाम 5:47 पर database lock हो गया। 6:12 पर फ़ोन बजा। और Saturday तुमने वो fix करने में बिताया जो दो मिनट की check से बच सकता था। 📋

मुझे ये इसलिए पता है क्योंकि मैं खुद वो इंसान रह चुका हूँ। और इसलिए भी कि हर ops retrospective जो मैंने पढ़ी है, वही कहानी बताती है: किसी ने वो step skip कर दिया जो उसे पता था कि exist करता है।

एक pilot, एक नदी, और एक checklist

15 जनवरी 2009 को Captain Chesley Sullenberger ने US Airways Flight 1549 को Hudson River पर land कराया। पक्षियों के झुंड से टकराने के बाद दोनों engines fail हो गए थे। सभी 155 लोग बच गए। जब reporters ने पूछा कैसे, तो उन्होंने "experience" या "instinct" नहीं कहा। उन्होंने कहा कि उनकी crew ने checklists follow कीं। Dual engine failure checklist। Ditching checklist। Step by step, maximum pressure में।

Aviation ये 1935 से कर रहा है, जब Boeing Model 299 की test flight crash हो गई क्योंकि pilot control lock release करना भूल गया। वो plane — एक four-engine bomber prototype — इतना complex था कि एक इंसान की memory के बस की बात नहीं थी। Boeing का जवाब "बेहतर pilots hire करो" नहीं था। जवाब था एक index card पर simple checklist। Crash rate गिर गई। Industry ने कभी पीछे मुड़कर नहीं देखा।

तुम्हारा production deploy — यानी new code को उन servers पर push करने की process जिन्हें तुम्हारे real users touch करते हैं — 1935 के bomber preflight से कम complex नहीं है। तुम्हारा incident response एक emergency water landing से कम critical नहीं है। लेकिन तुम अभी भी memory, experience, और "हमने हमेशा ऐसे ही किया है" पर depend हो।

Smart लोग steps क्यों skip करते हैं

ये intelligence की बात नहीं है। ये brain के काम करने के तरीके की बात है।

New England Journal of Medicine में publish हुई 2009 की landmark WHO study के अनुसार, checklist use करने वाले surgeons ने complications 35% और deaths 47% कम कीं। ये लोग हैं जिनके पास एक दशक की medical training है। वो steps इसलिए skip नहीं करते कि वो careless हैं — वो इसलिए skip करते हैं क्योंकि human working memory sequential pressure में buckle कर जाती है।

Atul Gawande, वो surgeon जिन्होंने The Checklist Manifesto लिखी, ने दो तरह की failures identify कीं:

Ignorance failures: तुम्हें पता नहीं क्या करना है। ये कम हो रही हैं क्योंकि knowledge online फैल रही है।

Ineptitude failures: तुम्हें बिल्कुल पता है क्या करना है, लेकिन execute नहीं कर पाते। ये बढ़ रही हैं क्योंकि systems लगातार complex होते जा रहे हैं। Knowledge exist करती है। Execution टूट जाता है।

Tech में, लगभग हर production incident जो मैंने investigate किया है, ineptitude failure थी। किसी को पता था कि deploy से पहले database migration — वो script जो database structure को new code के हिसाब से update करती है — चलानी चाहिए। किसी को पता था कि rollback plan — अगर सब कुछ टूट जाए तो previous working version पर लौटने के documented steps — check करने चाहिए। किसी को पता था कि feature flag — वो toggle जो new features को तब तक छुपाकर रखता है जब तक तुम ready न हो — off है या नहीं verify करना चाहिए।

उन्हें पता था। वो भूल गए। रात 11 बजे, 10 घंटे के दिन के बाद, उन्होंने 12 में से step 7 skip कर दिया क्योंकि उनका brain बोला "ये तो सौ बार कर चुका हूँ।" 🫶

तीन checklists जो हर tech team को चाहिए

ये रहा system। तीन checklists। हर एक उस specific moment को target करती है जहाँ human memory सबसे ज़्यादा fail होती है।

Checklist 1: Deploy checklist

ये हर production deployment से पहले run होती है। हर item binary है — yes या no। कोई judgment calls नहीं। अगर कोई भी item "no" है, तो रुको। Aviation में, एक "no" plane को ground कर देता है। तुम्हारे deploys भी उतने ही respect deserve करते हैं।

## Pre-Deploy

- [ ] Staging पर सभी tests pass
- [ ] Staging पर database migrations test हुईं
- [ ] Rollback plan documented और tested
- [ ] Feature flags verified (new features default off)
- [ ] Monitoring dashboards open
- [ ] On-call engineer confirmed available
- [ ] Deploy window confirmed (Friday शाम 5 बजे नहीं)
- [ ] Team channel में change announce हुआ
- [ ] Previous deploy की metrics 24h+ stable

## Post-Deploy Verification

- [ ] Health check endpoints 200 return कर रहे
      (200 = server का कहने का तरीका "मैं ठीक हूँ")
- [ ] Error rate baseline से elevated नहीं
- [ ] Key user flows manually test हुए
- [ ] Performance metrics normal range में
- [ ] Deploy changelog में record हुआ

मेरी team ने ये checklist 18 महीने पहले adopt की। Deploy-related incidents लगभग हर दो हफ़्ते में एक से गिरकर हर तीन महीने में एक हो गए। इसलिए नहीं कि हम smarter हो गए। इसलिए कि हमने steps skip करना बंद किया। ⚙️

Checklist 2: Incident response checklist

जब production टूटता है, तुम्हारा brain fight-or-flight mode में चला जाता है। Adrenaline spike करता है। तुम अभी fix करना चाहते हो। ये बिल्कुल वो moment है जब checklists सबसे ज़्यादा matter करती हैं — क्योंकि तुम्हारा prefrontal cortex, जो sequential thinking के लिए responsible है, वो पहली चीज़ है जिसे adrenaline shut down करता है।

## Minute 0-5: Assess
- [ ] Confirm करो incident real है (monitoring false alarm नहीं)
- [ ] Severity: S1 (total outage), S2 (partial), S3 (degraded)
- [ ] Incident commander assign करो (एक person decisions ले)
- [ ] Dedicated incident channel open करो

## Minute 5-15: Communicate
- [ ] Status page update
- [ ] Affected customers को notify (अगर S1/S2)
- [ ] Internal stakeholders को notify
- [ ] Next update का ETA communicate
      (चाहे बस "we're investigating" ही हो)

## Minute 15+: Fix
- [ ] Root cause identify OR escalation trigger
- [ ] Fix staging पर test (अगर possible हो)
- [ ] Fix production पर deploy
- [ ] Monitoring से resolution confirm

## Post-Incident
- [ ] 48 घंटों के अंदर post-mortem schedule
- [ ] Timeline document करो जब तक memories ताज़ा हैं
- [ ] Action items owners और deadlines के साथ assign
- [ ] Checklist update करो अगर कोई step missing था

वो last item पूरे system का quiet engine है। हर incident checklist के लिए feedback बन जाता है। कुछ missing था? Add करो। कुछ redundant है? हटा दो। Checklist alive है — ये हर failure से सीखती है, ताकि तुम्हें उन्हें repeat न करना पड़े। 🧘

Checklist 3: Code review checklist

Code review — जब teammate तुम्हारा code production में जाने से पहले पढ़ता है — बिना checklist के बस पढ़ना और उम्मीद करना है। Checklist के साथ, ये systematic verification है कि specific categories की problems exist नहीं करतीं।

## Security
- [ ] कोई hardcoded credentials, API keys, या tokens नहीं
- [ ] User input validated और sanitized
- [ ] Database queries parameterized statements use करती हैं
      (SQL injection रोकता है — एक attack जहाँ कोई
       login form में database commands sneak करता है)
- [ ] सभी protected endpoints पर authentication check
      (endpoint = एक specific URL जिस पर तुम्हारी app respond करती है)

## Reliability
- [ ] Error handling failure cases cover करती है
- [ ] External API calls पर timeouts set हैं
      (API = programs का एक-दूसरे से बात करने का तरीका)
- [ ] Large tables पर database queries के indexes हैं
      (index = lookup shortcut, जैसे किताब का index)
- [ ] कोई N+1 queries नहीं (related data एक-एक row
      करके fetch करना बजाय एक efficient batch में)

## Maintainability
- [ ] Functions एक ही काम करते हैं
- [ ] Variable names describe करते हैं कि उनमें क्या है
- [ ] Complex logic पर comments हैं जो WHY explain करते हैं
- [ ] Tests new code paths cover करते हैं

Checklists को कैसे टिकाओ

Checklists बनाना आसान है। मुश्किल हिस्सा — जिसके बारे में कोई बात नहीं करता — abandonment रोकना है। चार rules:

Mandatory बनाओ, optional नहीं। Checklist को अपनी deploy pipeline में wire करो — वो automated sequence of steps जो तुम्हारा code build और ship करती है। Deploy button तब तक greyed out रहे जब तक हर box checked न हो। जो checklist "recommended" है वो 80% time use होती है, यानी ठीक उस moment fail होती है जब सबसे ज़्यादा matter करती है: pressure में, रात को, जब तुम थके हो।

Short रखो। Aviation research ने पाया कि 9 items से ज़्यादा वाली checklists में compliance dramatically गिर जाती है। अगर तुम्हारी में 30 items हैं, तो phases में बाँटो। हर phase: 5-9 items। छोटी checklist जो follow हो, लंबी checklist जो ignore हो उससे हमेशा बेहतर है।

Quarterly review करो। जो items कभी कुछ catch नहीं करते उन्हें हटाओ। Recent incidents से items add करो। Stale checklist contempt पैदा करती है — जब आधे items current stack से irrelevant लगें तो लोग seriously लेना बंद कर देते हैं।

Visible बनाओ। Team channel में pin करो। Deploy tool की UI में रखो। एक print करो और monitors के बगल में चिपका दो। सबसे अच्छी checklist वो है जो तुम्हें बिना ढूँढे दिख जाए।

इसकी cost क्या है

Tradeoffs के बारे में honest रहते हैं। Checklists friction add करती हैं। Deploy checklist हर deploy में 5-10 minutes add करती है। Incident response checklist तब agonizingly slow लगती है जब production जल रहा हो। Code review checklists reviews को लंबा बनाती हैं।

ये friction ही point है। ये deliberate slowness है उन moments में जहाँ speed damage करती है। Pilot preflight rush नहीं करता सिर्फ इसलिए कि passengers board कर रहे हैं। तुम्हें deploy rush नहीं करनी चाहिए सिर्फ इसलिए कि PM ने end of day माँगा है।

दूसरी cost maintenance है। Unmaintained checklist कोई checklist न होने से भी बदतर है — ये तुम्हारी team को सिखाती है कि process बस दिखावा है। किसी को हर checklist own करनी होगी, quarterly review करनी होगी, और हर incident के बाद update करनी होगी। ये real काम है।

अब तुम dangerous हो

वो Friday deploy याद है? जहाँ staging check skip किया और Saturday database rebuild करने में बिताया?

अब तुम्हारे पास तीन checklists हैं जो ठीक उस तरह की failure रोकती हैं। Discipline से नहीं, willpower से नहीं — एक ऐसे system से जो मानकर चलता है कि pressure में तुम्हारा brain fail होगा और उसे matter करने से पहले catch कर लेता है।

Checklists discipline के बारे में नहीं हैं। ये humility के बारे में हैं। ये स्वीकार करना कि तुम्हारा brain — जितना भी sharp हो — stress में 15 sequential steps बिना एक miss किए reliably execute नहीं कर सकता। Pilots ये जानते हैं। Surgeons ये जानते हैं। Astronauts ये जानते हैं।

Tech अकेली industry है जहाँ experienced professionals अभी भी कहते हैं "मुझे checklist की ज़रूरत नहीं, ये हज़ार बार कर चुका हूँ।" ये confidence नहीं है। ये वो sentence है जो हर preventable incident से पहले आता है।

Checklist लिखो। Checklist follow करो। Checklist update करो। तुम्हारा production environment तुम्हारा शुक्रिया मानेगा। और वो इंसान भी जो रात 2 बजे page होता है — और वो इंसान तुम भी हो सकते हो। 🛁

Tech का Checklist Manifesto: Aviation-Style Checklists जो Prod Incidents रोकती हैं

एक pilot, एक नदी, और एक checklist

Smart लोग steps क्यों skip करते हैं

तीन checklists जो हर tech team को चाहिए

Checklist 1: Deploy checklist

Checklist 2: Incident response checklist

Checklist 3: Code review checklist

Checklists को कैसे टिकाओ

इसकी cost क्या है

अब तुम dangerous हो

Keep reading

तुम्हारे Agent का Permission Dialog एक Placebo है

तुम्हारे AI Agent के पास Root Access है और किसी ने sudo बनाया ही नहीं

MCP Supply Chain Crisis: npm का बुरा सपना, बस 10 गुना तेज़

चार Platforms ने AI Agents शिप किए। कोई agree नहीं कि 'Agent' है क्या।