You deployed on Friday at 5 PM. You knew the migration hadn't run on staging — staging being the copy of production where you test things before real users see them. You told yourself you'd done this a hundred times. At 5:47 PM, the database locked up. At 6:12 PM, your phone rang. You spent Saturday fixing what a two-minute check would have prevented. 📋
I know this because I've been that person. And because every ops retrospective I've ever read tells the same story: someone skipped a step they knew existed.
A pilot, a river, and a checklist
On January 15, 2009, Captain Chesley Sullenberger landed US Airways Flight 1549 on the Hudson River. Both engines failed after hitting a flock of geese. All 155 people survived. When reporters asked how, he didn't say "experience" or "instinct." He said his crew followed their checklists. The dual engine failure checklist. The ditching checklist. Step by step, under maximum pressure.
Aviation has been doing this since 1935, when a Boeing Model 299 test flight crashed because the pilot forgot to release a control lock. The plane — a four-engine bomber prototype — was literally too complex for one person's memory. Boeing's response wasn't "hire better pilots." It was a simple checklist on an index card. The crash rate dropped. The industry never looked back.
Your production deploy — the process of pushing new code to the servers your users actually touch — is not less complex than a 1935 bomber preflight. Your incident response is not less critical than an emergency water landing. But you're still relying on memory, experience, and "we've always done it this way."
Why smart people skip steps
This isn't about intelligence. It's about how brains work.
Surgeons who use checklists reduce complications by 35% and deaths by 47%, according to a landmark 2009 WHO study published in the New England Journal of Medicine. These are people with a decade of medical training. They don't skip steps because they're careless — they skip steps because human working memory buckles under sequential pressure.
Atul Gawande, the surgeon who wrote The Checklist Manifesto, identified two kinds of failure:
Ignorance failures: You don't know what to do. These are shrinking as knowledge spreads online.
Ineptitude failures: You know exactly what to do but fail to execute it. These are growing because systems keep getting more complex. The knowledge exists. The execution crumbles.
In tech, almost every production incident I've investigated was an ineptitude failure. Someone knew they should run the database migration — a script that updates the database structure to match the new code — before deploying. Someone knew they should check the rollback plan — the documented steps for reverting to the previous working version if everything breaks. Someone knew they should verify the feature flag — a toggle that keeps new features hidden until you're ready — was set to off.
They knew. They forgot. At 11 PM, after a 10-hour day, they skipped step 7 of 12 because their brain whispered "I've done this a hundred times." 🫶
Three checklists every tech team needs
Here's the system. Three checklists. Each one targets a specific moment where human memory fails hardest.
Checklist 1: The deploy checklist
This runs before every production deployment. Every item is binary — yes or no. No judgment calls. If any item is "no," you stop. In aviation, a single "no" grounds the plane. Your deploys deserve the same respect.
## Pre-Deploy
- [ ] All tests pass on staging
- [ ] Database migrations tested on staging
- [ ] Rollback plan documented and tested
- [ ] Feature flags verified (new features off by default)
- [ ] Monitoring dashboards open
- [ ] On-call engineer confirmed available
- [ ] Deploy window confirmed (not Friday 5 PM)
- [ ] Change announced in team channel
- [ ] Previous deploy's metrics stable for 24h+
## Post-Deploy Verification
- [ ] Health check endpoints returning 200
(200 = the server's way of saying "I'm fine")
- [ ] Error rate not elevated vs. baseline
- [ ] Key user flows tested manually
- [ ] Performance metrics within normal range
- [ ] Deploy recorded in changelog
My team adopted this checklist 18 months ago. Deploy-related incidents dropped from roughly one every two weeks to one every three months. Not because we got smarter. Because we stopped skipping steps. ⚙️
Checklist 2: The incident response checklist
When production breaks, your brain enters fight-or-flight mode. Adrenaline spikes. You want to fix it NOW. This is exactly when checklists matter most — because your prefrontal cortex, the part responsible for sequential thinking, is the first thing adrenaline shuts down.
## Minute 0-5: Assess
- [ ] Confirm the incident is real (not a monitoring false alarm)
- [ ] Severity: S1 (total outage), S2 (partial), S3 (degraded)
- [ ] Assign incident commander (one person makes decisions)
- [ ] Open dedicated incident channel
## Minute 5-15: Communicate
- [ ] Status page updated
- [ ] Affected customers notified (if S1/S2)
- [ ] Internal stakeholders notified
- [ ] ETA for next update communicated
(even if it's just "we're investigating")
## Minute 15+: Fix
- [ ] Root cause identified OR escalation triggered
- [ ] Fix tested on staging first (if possible)
- [ ] Fix deployed to production
- [ ] Monitoring confirms resolution
## Post-Incident
- [ ] Post-mortem scheduled within 48 hours
- [ ] Timeline documented while memories are fresh
- [ ] Action items assigned with owners and deadlines
- [ ] Checklist updated if a step was missing
That last item is the quiet engine of the whole system. Every incident becomes feedback for the checklist. Something missing? Add it. Something redundant? Remove it. The checklist is alive — it learns from every failure, so you don't have to repeat them. 🧘
Checklist 3: The code review checklist
Code review — when a teammate reads your code before it goes to production — without a checklist is reading and hoping. With a checklist, it's systematic verification that specific categories of problems don't exist.
## Security
- [ ] No hardcoded credentials, API keys, or tokens
- [ ] User input validated and sanitized
- [ ] Database queries use parameterized statements
(prevents SQL injection — an attack where someone
sneaks database commands into a login form)
- [ ] Authentication checked on all protected endpoints
(endpoint = a specific URL your app responds to)
## Reliability
- [ ] Error handling covers failure cases
- [ ] External API calls have timeouts set
(API = a way for programs to talk to each other)
- [ ] Database queries have indexes for large tables
(index = a lookup shortcut, like a book's index)
- [ ] No N+1 queries (fetching related data one row
at a time instead of in one efficient batch)
## Maintainability
- [ ] Functions do one thing
- [ ] Variable names describe what they hold
- [ ] Complex logic has comments explaining WHY
- [ ] Tests cover the new code paths
How to make checklists stick
Creating checklists is easy. The hard part — the part nobody talks about — is preventing abandonment. Four rules:
Make them mandatory, not optional. Wire the checklist into your deploy pipeline — the automated sequence of steps that builds and ships your code. The deploy button stays grayed out until every box is checked. A checklist that's "recommended" gets used 80% of the time, which means it fails precisely when it matters most: under pressure, at night, when you're tired.
Keep them short. Aviation research found that checklists over 9 items see dramatically lower compliance. If yours has 30 items, split it into phases. Each phase: 5-9 items. A short checklist that's followed beats a long one that's ignored.
Review them quarterly. Remove items that never catch anything. Add items from recent incidents. A stale checklist breeds contempt — people stop taking it seriously when half the items feel irrelevant to their current stack.
Make them visible. Pin them in the team channel. Put them in the deploy tool's UI. Print one and tape it next to the monitors. The best checklist is the one you see without looking for it.
What this costs you
Let's be honest about the tradeoffs. Checklists add friction. A deploy checklist adds 5-10 minutes to every deploy. An incident response checklist feels agonizingly slow when production is burning. Code review checklists make reviews take longer.
This friction is the point. It's deliberate slowness in moments where speed causes damage. A pilot doesn't rush preflight because the passengers are boarding. You shouldn't rush a deploy because the PM asked for it by end of day.
The other cost is maintenance. An unmaintained checklist is worse than no checklist — it teaches your team that process is theater. Someone needs to own each checklist, review it quarterly, and update it after every incident. That's real work.
You're dangerous now
Remember that Friday deploy? The one where you skipped the staging check and spent your Saturday rebuilding a database?
You now have three checklists that prevent exactly that kind of failure. Not through discipline, not through willpower — through a system that assumes your brain will fail under pressure and catches it before it matters.
Checklists aren't about discipline. They're about humility. An acknowledgment that your brain — sharp as it is — cannot reliably execute 15 sequential steps under stress without missing one. Pilots know this. Surgeons know this. Astronauts know this.
Tech is the only industry where experienced professionals still say "I don't need a checklist, I've done this a thousand times." That's not confidence. That's the sentence that precedes every preventable incident.
Write the checklist. Follow the checklist. Update the checklist. Your production environment will thank you. So will the person who gets paged at 2 AM — and that person might be you. 🛁





