Last night's downtime

Yesterday’s outage was the worst downtime we’ve ever had for blogs. I designed the architecture to separate the main platform from hosted blogs, so that all the complex moving pieces of the platform — databases, the API, multiple servers — couldn’t affect performance of your own blog. This has worked out so well over the years that it has made me lax about planning how to recover quickly from worst-case scenarios.

So what happened? We noticed some performance issues and glitches, with unusually high CPU usage. On reboot, the disk check failed, and I was unable to repair it. To make matters worse, I could not recover from the latest backup because Linode’s backup service was also down for unscheduled maintenance.

As I wrote in more detail a few months ago, we store important files like photos in multiple places so that if disaster strikes we can still rebuild your blog. With the backups down, I set out to do that kind of rebuild, but I had never done it on quite that scale before. One problem I had overlooked is how long it would take to update HTTPS certificates for custom domain names.

After a few false starts, I scrapped that restoration work when Linode’s backups came online, and was able to more quickly get everything back up and running. I learned a lot, and Vincent and I will be talking through some next steps to make this more robust. I never want to go through an extended outage like this again. I’m very sorry it happened and I’m thankful for everyone’s patience.

I’ll be keeping an eye out for any lingering problems. Please reach out to if you notice anything that looks wrong.

Manton Reece @manton