HipChat explains what happened with their outage on Monday. Part of it was a reconnection bug in the new Mac client:
“When a large network provider in the SF Bay Area had an issue Monday morning, it caused all of those clients to start reconnecting at once. This saturated our systems and prevented normal usage.”
I could’ve done without the light-hearted mention of cat GIFs and inline face-palm image, but otherwise it’s a good post. They apologize and succinctly explain the issue. Far too many of these types of downtime reports from other companies go on and on for pages of detailed text, as if to hide the true failure in unnecessary verbosity.
It usually takes a couple problems hitting at once to cause a major server outage. This happened last week when Tweet Marker’s SSL certificate expired. I have the SSL set to auto-renew, but it still requires manually installing the new certificate, and other problems happened along the way.
First mistake: I didn’t realize it was expiring. Those emails go to an account I don’t check very often, littered with spam. And the email to confirm the renewal went to yet another email address that no longer worked. When I had moved the DNS hosting to Amazon’s Route 53, I had neglected to move over the MX records.
After fixing all of that, I tried updating the app on Heroku to use the new cert, only to get stalled as Heroku’s new SSL add-on rejected it. Certain I had done something wrong, I fumbled through a dozen Heroku SSL how-to posts before finally reverting to their old SSL add-on. It’s no longer documented and is in fact actively discouraged by Heroku, but it also has the lucky trait of actually working with my certificate. Updating DNS caused another hour-long delay because of the high TTL.
I sent two support requests during this process, so I thought I’d rate how each company did:
- DreamHost: Before I figured out the bad email address, I sent DreamHost a question about why the SSL certificate hadn’t showed up yet. They responded very quickly, and even included a “P.S.” that they were fans of Tweet Marker. Basically they provided excellent support, the best you could ask for.
Heroku: When the new SSL add-on wasn’t accepting my certificate, I filed a support request with Heroku as well. The response was an automated reply that they don’t do support past 6pm. For a hosting company that charges a premium, this was a disappointing response. (They responded first thing the next morning, though.)
This SSL glitch was the only significant outage Tweet Marker has had in its first year. I learned a few lessons, took the opportunity to check backups and EC2 servers, and now I’m ready to move on. Hoping for an even better year 2.