June’s T-SQL Tuesday is brought to us by Allen Kinsel (Blog|Twitter). Allen lives in an area that’s frequently the target of hurricanes. Since June brings the beginning of a new hurricane season, Allen is thinking of disasters and recovery. He says that this month’s T-SQL Tuesday is open to “Anything you’d like to blog about related to preparing for or recovering from a disaster would be fair game, have a great tip you use to keep backups and recovers running smoothly, a horrific story of recovery gone wrong? or anything else related to keeping your systems online during calamity.”
When I saw this topic, I wondered what I might add to the discussion. I’ve experienced the same unfortunate incidents that most of you have encountered. For example, the air conditioning unit that shuts down during a hot summer weekend, causing equipment failure in the server closet. Roof failure during a severe thunderstorm causing rainwater to pour on a server rack (immediate solution was to grab umbrellas). Both hard drives in a RAID 1 array failing at the exact same point in time (don’t believe anyone who says that the probabilities make this event almost impossible). When my father leased a computer for his business over 30 years ago, it caught fire one day and burned a good part of the room it was in. Perhaps that’s when I came to expect that working in technology was going to involve dealing with mishaps.
One observation I often like to share and discuss is the idea of making recovery procedures a part of normal operations wherever possible. This way, when a disaster occurs, at least part (or given the circumstances, possibly all) of the recovery process is familiar. Otherwise you are relying on rehearsals and practice of recovery, and this is the type of thing that falls off the calendar or budget when people get busy, during cost-cutting, etc.
Along these lines, I recently read about the Chaos Monkey implementation used by Netflix (you’ll want to read that link if you aren’t familiar with this topic). The idea of randomly killing components as a way to force familiarization with failure and reveal weakness in recovery planning is interesting. Many DBAs and database developers might be comfortable with the myriad of issues that typically fall under their pervue . However, as a DBA or database developer you could still find yourself dealing with such a situation outside of that comfort zone because you are the best alternative available.
Which brings me to my “Human-Sized Disasters” title. Here’s the story.
A firm had developed a message queue solution in C, and the only person familiar with this solution was the developer who wrote it. The developer had access to production, so no administrators (or other developers) had ever needed to deal with the message queue. The only issues that ever occurred with the message queue happened during business hours, so the developer would fix the problem on production, restart the queue, and everyone would be happy. All was well until the message queue quit working one night around 1:00a.m. Beepers went off and the system administrators escalated the issue until it was confirmed that only one person, the solution’s developer, could address the issue. So the dreaded call-a-developer-in-the-middle-of-the-night phone call was needed. Which is where the “Human-Sized Disaster” enters the scene.
The developer’s spouse (a.k.a. Human-Sized Disaster) answered the phone, refused to wake up the developer, and rudely hung up the phone. A redialed call revealed that the spouse had unplugged the phone from the wall (this was in the olden days of land line phones when some humans, even software developers, didn’t have cell phones).
The message queue solution controlled various database processing components implemented in Java that were mostly created by me. So I was the next developer to be called in the middle of the night because I might have some idea of how to deal with the message queue. Of course, the recovery solution consisted of me sitting with a production administrator all night playing the role of queue by manually running components while reading through the message queue’s source code searching for a more permanent solution.
As you might imagine, this event is something I occasionally recall and wonder how it could have been prevented. Of course, one action item is to avoid developing a solution in-house that’s widely available with support. I prefer checking out SourceForge, CodePlex, etc. or finding out if commercial solutions are available and at what price before undertaking implementation of any solution. It’s not as much fun as rolling everything yourself, but it can reduce sleepless and stressful nights. Other action items might be making sure nothing is known by only one person. Or not letting developers work directly on production unless you’ve got appropriate practices in place for allowing this (I qualify this because the devops movement is interesting to me). And of course document things, because reading through C source code isn’t a pleasant way to learn how a component works when you are under time pressure. Beyond this, the Chaos Monkey concept could be something to try, although I believe it’s going to be a hard-sell in most organizations.
I’m looking forward to reading the contributions to this month’s T-SQL Tuesday and finding out what everyone else has to share about disasters and recovery. Thanks to Allen Kinsel for hosting T-SQL Tuesday #19, and thanks again to Adam Machanic (Blog|Twitter) for creating this monthly blog event!