BlackBerry maker RIM had a single point of failure for most of the countries affected by this week's catastrophic outage, prompting serious business continuity questions.
Analysts have told Computerworld UK that RIM must consider building more data centres if it wants to avoid the same risks in the future.
Many BlackBerry users in the UK started to see their services return to normal late Wednesday night, with not only email returning but full access to BBM instant messaging and general web access after initial delays.
In a press briefing yesterday evening, RIM chief technology officer David Yach confirmed the technical problem was down to a faulty core switch in the main Slough network operating centre, which routes BlackBerry traffic across Europe, the Middle East, Africa, India and for three operators in South America.
This was a single point of failure as a failover system did not kick in to another switch at the same site. There was no way RIM could have re-routed the traffic bypassing Slough, analysts said, as it is only one of two main network operating centres RIM runs to serve the world network - the other being in Waterloo, Canada where its HQ is.
This had a knock on effect on other countries which Slough does not serve, as traffic sent from North America and Asia to countries served by Slough was backed up, and brought down the whole world BlackBerry network. The network became overloaded rather like a denial of service attack, but Yach said there was no evidence of a security incident in relation to the outage.
There are now 70m BlackBerry users worldwide, with RIM driving up the number of users in response to the market threat posed by the iPhone and Android devices.
Ovum analyst Nick Dillon warned RIM might have to consider building extra regional data centres if major outages became a regular occurrence.
He told Computerworld UK: "RIM only has two global NOCs (network operation centres) for its data traffic, and in many ways that's the strength of their system in the efficient push delivery of email messages to users.
"But at the same time they have two fundamental possible points of failure, this has always been an architectural risk for them. With growth in the user base, the bigger the risk to these potential single points of failure."
As a result, said Dillon: "There may be a question as to whether there should be more regional NOCs available to RIM", to enable traffic to be re-routed when outages occur or during traffic spikes.
After the major BlackBerry outage in the US in 2008, which affected 12 million users, RIM opened a new regional data centre in Texas in 2009 and started building another one in Atlanta, Georgia.
Sign up for Computerworld eNewsletters.