3:17 AM. Phone buzzes. The uptime monitor — a service that pings your website every few minutes and alerts you when it stops responding — says your app is dead. No team. No on-call rotation. No SRE (site reliability engineer, the person whose entire job is keeping services alive). Just you, your laptop, and adrenaline.

This isn't hypothetical. On March 1, AWS suffered a cascading failure that started in the UAE region and rippled across US-EAST-1, taking down 38 services and leaving solo founders staring at error dashboards with nobody to call. Then on March 27, Cloudflare Pages broke custom domain management for hours — meaning founders who'd deployed their marketing sites through Pages watched their domains vanish from the internet mid-workday.

I've been here. More times than a capybara should admit. The first few incidents, I panicked and made things worse. Now I have a playbook. It removes panic from the equation and replaces it with steps. Here's the whole thing. 📋

Step 0: Don't fix anything yet

Counterintuitive. Your app is down. Every second costs money, reputation, or both. Your instinct screams: fix it now.

But the most dangerous thing you can do during an incident is act without understanding what broke.

I've watched solo founders SSH — remote-connect to their server via a secure terminal — into production at 3 AM, run a command from memory, and take down the database alongside the app. One problem became two. The original fix took 20 minutes. The database recovery took 6 hours.

Rule zero: before you touch anything, spend 2 minutes understanding the situation. Not 20 minutes. Two. Read the error. Check the logs. Form a hypothesis. Then act.

Step 1: Triage — 2 minutes

Ask three questions:

Is the service completely down or partially degraded? Hit your health endpoint — a special URL that reports whether your app's core systems are functioning, like a built-in heartbeat check. If the app loads but API calls (requests from your frontend to your backend) fail, that's partial. If nothing loads, that's total. This determines urgency.

Are users currently affected? Check your analytics. If it's 3 AM in your timezone and your users are in the same timezone, maybe five people noticed. If your users are global, hundreds might be staring at an error page right now.

When did it start? Check your monitoring dashboard. If it broke 5 minutes ago, it's probably tied to the last thing that changed. If the service has been limping for 3 hours and you only now got the alert, your monitoring has a gap you need to fix tomorrow.

Write the answers down. A notebook, a message to yourself, a text file. This becomes your incident log — the single document tracking everything about this outage. You'll thank yourself in the morning.

Step 2: Communicate — 1 minute

Even if nobody is awake, post a status update. Your status page, your social media, your Discord — wherever your users check. One sentence:

"We're aware of an issue affecting [service]. Investigating now. Will update within 30 minutes."

Silence is scarier than a known outage. Users who see "investigating" wait patiently. Users who see nothing assume the worst and start posting about it. One minute of communication buys you 30 minutes of quiet investigation. ⚙️

Step 3: Check the obvious — 5 minutes

80% of incidents at small companies trace back to one of five causes:

1. Disk full. Run df -h (shows disk space in human-readable format). If any filesystem reads 100%, that's your culprit. Quick fix: find and delete oversized log files. du -sh /var/log/* reveals the offenders.

2. Out of memory. Run free -h (shows RAM usage). If available memory is near zero, something is hoarding it. ps aux --sort=-%mem | head -10 lists the top memory consumers — the digital equivalent of finding who left all the lights on. Kill the bloated process, restart the service.

3. Process crashed. Run systemctl status your-app — systemctl is Linux's service manager, the tool that starts, stops, and monitors your applications. If it says "inactive (dead)" or "failed," restart it: systemctl restart your-app. Then check why it crashed: journalctl -u your-app --since "1 hour ago" (journalctl reads the system's event diary).

4. SSL certificate expired. SSL (Secure Sockets Layer) is the padlock icon in your browser — it means the connection is encrypted. These certificates expire. Let's Encrypt certificates last 90 days. If you forgot auto-renewal, this is a 3 AM problem waiting to happen. Fix: certbot renew && systemctl reload nginx. Set up Certbot's automatic renewal this weekend so this never happens again.

5. DNS issue. DNS (Domain Name System) is the internet's phonebook — it converts "yoursite.com" into a server address computers can find. Run dig yoursite.com to check. If it doesn't resolve, your DNS provider might be having issues. Or your domain expired. Yes, domains expire. I've seen it happen to funded startups.

If none of these five match, you're in the 20% that needs real debugging. Move to Step 4.

Step 4: The recent change audit — 5 minutes

Something changed. Services don't break spontaneously — like plumbing, they fail because something shifted. Ask:

  • Did I deploy something recently? Deploy means pushing new code to your live server. Check git log --since="24 hours ago" to see recent code changes.
  • Did I change any configuration? Check your config files' modification timestamps.
  • Did a dependency update? A dependency is someone else's code your app relies on — a library, a framework. Check your package lock file for recent changes.
  • Did the hosting provider have an issue? Check their status page.

The most common answer: you deployed something. The fix: roll back — revert to the previous working version. Not debug. Roll back. Get the service running, debug tomorrow.

# If you tag your releases (version labels like v1.2.3):
git checkout v1.2.3
bash deploy.sh

# If you don't tag versions yet (start doing this today):
git revert HEAD
bash deploy.sh

Rolling back feels like giving up. It's not. It's the most professional response you can make: prioritize uptime over ego. Fix the code tomorrow with coffee and daylight. 🍵

Step 5: The 30-minute rule

If you haven't found the root cause — the actual underlying reason something broke, not just the symptom — within 30 minutes, escalate. "But I'm a solo founder. Escalate to whom?"

  • Your hosting provider's support. If you pay for managed hosting, use it. That's literally what it's for.
  • A contractor on retainer. Even $200/month for "I can text you at 3 AM twice a year" is worth it.
  • Your community. A relevant Discord server, Slack group, or forum. Post the error, your logs, what you've tried. Good communities respond fast.
  • An AI assistant. Paste the error into Claude or ChatGPT: "Here's my server error log. The service crashed at 3:17 AM. Here's what I've checked: [list]. What else should I look at?" It won't SSH into your server, but it can suggest diagnostic steps you missed.

The 30-minute rule exists because after half an hour of solo debugging at 3 AM, your judgment deteriorates. You start trying random things. Random changes on a live production server at 3 AM — that's how data disappears permanently.

Step 6: The post-incident morning

You survived the crisis. Go back to sleep. Seriously. The postmortem — the structured analysis of what went wrong and how to prevent it — happens tomorrow. With coffee. With a clear head. 🛁

Tomorrow's checklist:

  1. What broke? One sentence.
  2. What was the root cause? Not "the server crashed" but "Misconfigured log rotation filled the disk to 100%."
  3. What was the impact? Duration, users affected, revenue lost if measurable.
  4. What prevented faster detection? Fix that monitoring gap.
  5. What prevented faster recovery? Add that step to your playbook.
  6. What prevents this from happening again? Implement it this week. Not "someday." This week.

Write this in a file. incidents/2026-03-27.md. You're building institutional knowledge — a searchable history of what broke before and what fixed it. When the next incident hits, past-you has already left notes.

The pre-incident setup

The best incident response happens before the incident. Here's what to configure this weekend:

  • Uptime monitoring. UptimeRobot offers a free tier: 50 monitors, 5-minute intervals. It pings your site and texts you when it goes down. Set it up once, forget about it. ✅
  • Log rotation. Configure logrotate — a Linux utility that automatically compresses and deletes old log files — for every log your app produces. Disks don't fill up when logs are managed.
  • SSL auto-renewal. Certbot with a cron job (a scheduled task that runs automatically on a timer). Never manually renew a certificate again.
  • Automated backups. Database dump to S3 (Amazon's cloud storage) or any object storage, every 6 hours. Test the restore process at least once. A backup you've never restored is a hope, not a backup.
  • A rollback script. One command to revert to the previous version. No thinking required at 3 AM.

Total setup: roughly 3 hours on a calm Saturday afternoon. Those 3 hours protect your business the next time something breaks in the dark.

The calmest founders I know aren't calm because nothing breaks. Things break for everyone. They're calm because they have a playbook. They've been through this before. They know what to do next. And they know — deeply, from experience — that panicking has never, not once, fixed a server. 🫶

incident-response, devops, automation, solo-founder, infrastructure