I tried to take a vacation in March 2024. Lasted four days.

By day two, I answered "just one quick question" on Slack. By day three, I sat in a hotel lobby debugging a production issue — a problem with our live, customer-facing servers. By day four, my partner said: "Just go back to work. This isn't a vacation." They were right.

The problem wasn't my team. They were competent. The problem was that my brain held the only copy of half the company's operations. Deploy procedure? In my head. Client escalation flow? In my head. How to restart the billing service when it freezes? Also in my head. I wasn't indispensable because I was brilliant. I was indispensable because I hadn't written anything down.

Six months later, in September 2024, I took two full weeks off. No laptop. No Slack. No "quick calls." Nothing broke.

Here's what I did in those six months.

Step 1: Find your bus factor

"Bus factor" — a morbid but useful metric: how many people on your team need to disappear before a process stops working entirely? If the answer is "one," and that one is you, you don't have a system. You have a hostage situation.

I listed every recurring task I own and asked one question: "If I vanished tomorrow, who could do this?"

Task Bus factor Who else knows?
Production deploys 1 (me) Nobody
Client billing issues 1 (me) Nobody
Server monitoring response 1 (me) Nobody
Sprint planning 2 Co-lead
Code reviews 3 Any senior dev
Hiring interviews 2 Co-lead

Three critical processes with a bus factor of 1. Three things that would stop entirely if I caught food poisoning. That's not operations. That's a one-person show wearing a team costume.

The concept comes from open-source software culture, where projects live or die by contributor count. The Wikipedia entry on bus factor lists examples from major tech companies — same problem, bigger scale.

Step 2: Write runbooks, not documentation

This distinction matters. Documentation explains how something works. A runbook — an operational procedure document, a step-by-step recipe for handling a specific situation — explains what to do.

Documentation: "The billing service uses Stripe webhooks (automated notifications that Stripe sends when a payment event occurs) to process payments. Events are queued in Redis and processed by the billing worker."

Runbook: "When billing is stuck: 1) Check Redis queue length: redis-cli llen billing_queue. 2) If queue > 100, restart the billing worker: systemctl restart billing-worker. 3) If restart doesn't clear the queue within 5 minutes, check Stripe dashboard for failed webhooks."

See the difference? Documentation requires understanding. A runbook requires following instructions. Anyone who can type commands into a terminal — the text-based interface where you enter commands directly — can follow a runbook. They don't need to understand why Redis (a fast in-memory database often used as a message queue) is involved. They need to know what to type.

PagerDuty, a company that built their entire business around incident response, published a solid guide on writing runbooks that covers the same principle: optimize for action, not comprehension.

I wrote runbooks for all three bus-factor-1 processes. Each took 30–60 minutes. Here's the format:

Runbook: Production Deploy

When to use:
When merging to main and deploying to production.

Prerequisites:
- SSH access to prod server (ask IT if you don't have it)
- Access to #deploys Slack channel

Steps:
1. Merge PR to main
2. Wait for CI checks to pass (GitHub Actions, ~5 min)
3. SSH to server: ssh [email protected]
4. Run: bash /srv/app/deploy.sh
5. Check health: curl -s http://localhost:8080/health | jq .
6. If health check fails: bash /srv/app/rollback.sh
7. Post result in #deploys

If something goes wrong:
- Deploy script fails → check logs at /var/log/deploys/
- Health check returns 503 → check which subsystem failed in the JSON response
- Can't SSH → contact IT, check VPN
- Rollback fails → call Capitan. Not Slack. Phone.

Escalation:
If stuck for more than 15 minutes, call Capitan. Phone, not Slack.

No theory. No architecture diagrams. Just: "when this happens, do this."

Step 3: The shadow week

Writing runbooks isn't enough. You need to test them on actual humans.

I ran a "shadow week" — one week where I stayed available but didn't touch any of the three processes myself. Someone else followed the runbook. I watched.

Results:

  • Deploy runbook: Worked on first try. The team member found a typo in step 4 (wrong file path). Fixed it in two minutes.
  • Billing runbook: Failed at step 3. I'd written "check Stripe dashboard" but never explained how to log in. The credentials lived in my personal password manager, not the shared one. Added shared access — problem solved.
  • Monitoring runbook: Partially worked. Steps were correct, but the monitoring tool's UI had changed since I wrote the doc. Updated the screenshots.

Every single runbook needed at least one correction. This is normal. When you write from memory, you skip steps that feel "obvious" because you've done them 500 times. The shadow week exposes those gaps before they matter at 3 AM on a Saturday.

Google's SRE team — the group responsible for keeping Google's infrastructure running — covers this principle in their free Site Reliability Engineering book: documentation that hasn't been tested under real conditions is fiction.

Step 4: The knowledge transfer meeting

For each runbook, I held a 30-minute knowledge transfer session. Not training — transfer. Training teaches skills. Transfer teaches context.

Structure:

  1. Walk through the runbook together (10 min) — the person follows each step while I watch. No helping unless they're stuck.
  2. Explain the "why" behind critical steps (10 min) — not required for execution, but helps with judgment calls. "We restart the billing worker before investigating because downtime costs $X per minute. Speed first, root cause second."
  3. Q&A (10 min) — their questions reveal exactly what I forgot to document.

After the meeting, they own the process. Not "help with the process." Own it. I become the backup, not the primary.

Step 5: The vacation test

Two months after the shadow week, I took a three-day weekend without my laptop. No emergencies. No "quick questions." The runbooks held.

One month later, a full week off. One minor issue: the monitoring runbook didn't cover a specific edge case — disk full on a non-standard partition. The person on call improvised correctly and added the case to the runbook afterward. The system improved itself without me. That's the sign it's working.

One month after that: two full weeks. Zero incidents that needed my involvement. The team sent me a photo of a whiteboard that said: "Day 9: Capitan hasn't called. We think the system works."

The part nobody talks about

Writing yourself out of processes feels like making yourself dispensable. It is. That's the point.

But it triggers something uncomfortable — the part of your identity tied to being "the person who knows everything." The one people call. The one who saves the day.

I'll be honest: it felt strange when deploys happened without me. When billing issues got resolved without a text. When the server alert fired and someone else handled it in 8 minutes flat. Part of me wanted to be needed.

Here's what I got instead: freedom. Not just vacation freedom — daily freedom. I stopped being a bottleneck — the single point where all decisions stacked up, waiting. Work that used to queue behind me now happened in real time. The team moved faster. I focused on decisions that actually required my judgment instead of tasks that just required my memory.

Your action list

As of March 2026, I've repeated this process at three different organizations. The pattern holds every time. If you can't take two weeks off without everything breaking, you don't have a system — you have a habit of being busy.

Here's your checklist:

  1. Audit your bus factor — list every recurring task, mark who else can do it
  2. Write runbooks for everything with bus factor = 1 (budget 30–60 minutes each)
  3. Run a shadow week — someone else executes, you watch and fix the docs
  4. Do knowledge transfer meetings — 30 minutes per process, structured
  5. Test with a real vacation — long weekend first, then a week, then two

Write the runbook. Do the shadow week. Take the vacation.

You'll come back rested, and the team will be more capable than when you left. That's not dispensable. That's good engineering.