Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

4 ways CIOs can respond to a service outage

Jonathan Hassell | Oct. 1, 2013
Nasdaq and Intermedia are among the latest firms to suffer lengthy -- and public -- service outages. Eventually, the same thing will happen to you. Here are four key lessons IT leaders can learn from others' mistakes.

Clearly, it hasn't been a good few weeks for Nasdaq. First, trading on the exchange halted for more than three hours on Aug. 22. Nasdaq's brief post-mortem statement blames a software bug and a backup system that failed to actually activate when a fault was detected. However, Reuters reports that a person familiar with what happened says connection problems with NYSE Euronext's Arca Exchange triggered the entire event.

Adding insult to injury, Nasdaq suffered a six-minute outage on Wednesday, Sept. 4. Though it involved the same system that was the culprit of the larger outage, a Nasdaq statement says "hardware memory failure in a back-end server" caused this outage.

It also wasn't a great return from the Labor Day holiday for Intermedia, one of the world's largest providers of hosted Microsoft Exchange services. On Sept. 3, the day after a long weekend in the United States, the provider had a five-hour outage, rendering email messages inaccessible. (Full disclosure: My company hosts its email service with Intermedia.) On top of that, Intermedia's telephone service was hosted in the same data centers that suffered the outage, rendering their help desk unreachable and making this outage much worse than it ordinarily would've been. It also took Intermedia hours to post messages on Twitter explaining the outage and its efforts to resolve it - and those messages pointed customers to a service status page hosted on a customer portal that no one could access because, you guessed it, the platform suffering the outage hosted it, too.

As a popular saying for politicians goes, "Don't ever let a good crisis go to waste." There are lessons IT leaders can learn from these companies' very public problems. Here are four takeaways you would do well to heed.

1. Regularly Test for, and Plan for, Disasters
Disasters happen. People regularly argue that you should be more positive about your operations and your deployments. But you can be positive about this: Stuff will fail and systems will go down. It's not a matter of if - it's a matter of when. Understand what an outage is going to look like for you - and understand what needs to happen.

Much of this disaster planning depends on what type of service you provide. If you're a CIO charged with maintaining email service to 100,000 employees, your disaster plan will look different than a technical team that services 500,000 external customers. Understand how outages will impact different parts of your business.

Know what mitigation costs, as well as what backups cost and what standby systems cost. Investigate how cloud computing services such as Amazon Web Services and Windows Azure can make a tense outage situation a little more bearable, thanks to the ability to spin up services on demand, when you need them, and shut them down once your situation has eased.


1  2  3  Next Page 

Sign up for Computerworld eNewsletters.