Postmortem on Apr 22, 2018 outage

On Apr 22, 2018, Listen Notes website was down for ~4 hours and I lost 10 hours production data. This is the biggest outage since I launched Listen Notes as a side project last year.

I’m glad to have this outage now rather than later, because it allows me to take necessary measures to prevent this from happening in the future. It’ll be very costly to have outage like this in a few months, as Listen Notes will have more users and I’ll take more responsibility then :) I have to publicly document the postmortem to remind myself: be careful!

Timeline

2:00pm Apr 22 (PDT): I got an alert from Datadog that ListenNotes.com was down.
2:01pm Apr 22 (PDT): I verified that the website had no response. I opened Datadog dashboard and saw that prod-db1 was in high load. I tried to log into prod-db1 (well, there was only one prod db machine…), but I couldn’t.
2:10pm Apr 22(PDT): I gave up trying to login prod-db1. I reboot prod-db1 from AWS web console.
2:13pm Apr 22(PDT): I got more alerts, mostly 500s from prod-web. I realized: Oh shit! The database was gone! I’ll explain a bit later in this post.
2:14pm Apr 22(PDT): I located the database backup and started to restore prod db. Another bummer: the latest db dump I had was 10 hours old, which means I’d lost 10 hours production data. All right, it was still better than nothing. Otherwise, Listen Notes, Inc. could’ve ceased operation right away. Haha~
2:30pm Apr 22(PDT): While waiting for prod db restore, I manually stopped all crawler processes, web servers, api servers, …
4:40pm Apr 22(PDT): Sent this tweet.
5:00pm Apr 22(PDT): Finally, database recovery was done. It took so long!
5:01pm Apr 22(PDT): Started web servers. Website was back.
5:05pm Apr 22(PDT): I got alerts saying website was down again.
5:07pm Apr 22(PDT): prod-db1 had high load. But luckily, I could log in prod-db1 this time.
5:10pm Apr 22(PDT): I noticed that Redis took up 10GB memory. Okay, I know what you are thinking about — yes, I run Redis and Postgres on the same production instance. This is a tiny startup and I try to save some money.
5:15pm Apr 22(PDT): I managed to kill Redis. Why Redis took up so much memory? Normally Redis uses 200MB or so for Listen Notes.
5:30pm Apr 22(PDT): Ran a bunch of redis-cli commands to check stats. 80k keys in total, which seems fine. Downloaded redis-memory-analyzer and found that a bunch of django-page-cache keys took up 99% memory. Hmm, this was definitely fishy.
5:33pm Apr 22(PDT): I inspected some django-page-cache value and got huge html output… Okay, I knew what was going on — Will explain a bit later in this post.
5:50pm Apr 22(PDT): I made some code changes and deployed to production.
6:00pm Apr 22(PDT): I deleted a bunch of django-page-cache keys from redis. Restarted web servers. Website was back.
6:25pm Apr 22(PDT): Things were under control. I restarted all offline processing tasks and api servers.
6:29pm Apr 22(PDT): Sent this tweet.

So what happened?

On Apr 21, I made a change to use django page cache for every podcast / episode page on ListenNotes.com and set the expiration time to 24 hours.

Crawlers (Google, ahrefs, Bing…) hit A LOT of podcast / episode pages and each visited page was cached into Redis. That was why Redis memory was bloated.

Because Redis and Postgres are colocated on the same instance, so that instance became unresponsive. Production database became inaccessible.

My prod-db1 is an i3.large instance and I stored the database data on the instance storage NVMe SSD — it’s blazing fast, but when you reboot the instance, you lose the data. That was why the database was gone after I reboot prod-db1.

I had daily database backup, but it’s certainly not enough. That was why my latest database dump was 10 hours old.

I made the code change to shorten page cache expiration time to 5 minutes, instead of 24 hours. This makes Redis happy.

Followup action items

This outage was very close to the doomsday scenario for my little company Listen Notes, Inc. — I may lose all production data, which is all bad.

On Apr 23, I brought up prod-db2 as a hot standby Postgres db. If something goes wrong with prod-db1, I can promote prod-db2 to be the master db.

Redis is still running on prod-db1. But I set up an alert on Datadog to monitor the Redis memory usage. I also set up maxmemory in redis.conf to cap the max memory.

What doesn’t kill you makes you stronger!

Postmortem on Apr 22, 2018 outage

Timeline

So what happened?

Followup action items

FEATURED BLOG POSTS