Finally, put regular "mock failures" on your calendar. Walk through everyone who would be involved should a given outage occur and write down what responsibilities people have. Take the opportunity to engage all stakeholders without the pressure of a real outage. That way, your plan will be well-oiled when the inevitable does happen.
2. Isolate Your Communications From Your Service Platform
You might think eating your own dog food is a good policy. Putting your telephones, email, instant messaging and real-time communications right there in your super-fast data center, alongside the services you offer, seems to make sense.
Most of the time, it may work out well - but even a first-year junior systems administrator can see the issue with this setup. Once network connectivity is interrupted in that data center, for any reason, you're toast. You can't communicate. Your service is down. Customers get angry. Employees can't work.
If you run an ecommerce site, you can't complete orders or charge credit cards, and revenue evaporates. If customers can't phone in an order either, though, you risk losing not only the order but the customer, too. The losses of an outage simply multiply in this scenario. In the example of the Intermedia outage, CEO Phil Koen notes that, "As our communication systems reside in the same datacenters, our ability to communicate with customers and partners was disrupted."
That's a quick way to watch your customers go elsewhere. For a company that prides itself on providing fault-tolerant hosted services to have made such a tremendous error in both its service topology and its ability to handle an outage, it boggles the mind. Don't make this same mistake.
3. Communicate, Communicate and Communicate
When in doubt, communicate some more. The temptation during an outage is to focus on fixing the problem with just about every resource you can muster to put on the task. Don't forget there are other stakeholders in the issue, depending on whether your outage is internal, external or both.
If you run a service for customers, they expect - and deserve - to know what's going on and to receive an estimated time to service restoration. (Estimated time to service restoration," by the way, means "half an hour" or "by noon," not "shortly" or "as soon as possible.") Meanwhile, if you experience an outage on an internal system, especially one that happens to be business-critical, then you need to send updates to affected parties both as soon as you understand that there's an issue and then at regular, frequent intervals until the issue is resolved.
Communication can't be an afterthought. It must be a high priority - second only to resolving the outage. Don't make a bad situation worse by creating an information vacuum.
Sign up for Computerworld eNewsletters.