Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Problems flagged months before HPE hardware failure hit ATO systems

Leon Spencer | June 12, 2017
At least 77 events related to components that failed in the December 2016 outage were logged months earlier.

 

The second outage

The report also revealed that a second outage that hit the ATO's systems on 2 February was caused by further issues associated with the cables.

The second outage followed remedial work by HPE on the SAN fibre optic cables, according to the ATO. Unfortunately, during one cable replacement exercise, the agency was informed that data cards attached to the SAN had been dislodged.

"This caused the 3PAR SAN to act in a similar way to that noted during the December outage," the ATO said. "This included unsuccessful steps to automatically remediate, followed by a systems shut‑down to preserve data integrity. HPE communicated this Priority 1 incident to us immediately."

As a result, HPE and the ATO monitored the cables around the clock following the outage, until they were comprehensively replaced between 23 and 26 March.

"We have since been advised that SAN alerts ceased completely once the new fibre optic cables were installed," the ATO said in its report.

The report also outlines other issues that arose when the initial outage occurred early in the morning on 12 December 2016, revealing that firmware supporting impacted disk drives in the affected SAN prevented those drives from re-booting.

Despite having met ATO-specified conditions for categorisation as a "Priority 1" incident, service provider logs indicated the incident was not escalated to this level until around 7.00am that morning, almost seven hours after the hardware first struck trouble.

Further, system management, configuration, monitoring, and data recovery systems that were relying on the SAN also experienced outage extended the recovery process for some applications.

In addition, the impact of pre-incident design and build decisions were material in extending the time to recover data and bring production and supporting systems online.

The SAN was neither designed nor built to cater for greater than single drive failure or single cage failure, the report said.

The storage hardware build also included "daisy‑chain" cage configuration, which exacerbated the risk of errors spreading across cages as occurred during the incident.

"Although a viable design option at the time of SAN implementation, no evidence has been presented of subsequent options being explored by HPE to mitigate this risk," the ATO said.

 

No ATO access under HPE deal

Meanwhile, the report revealed that ATO IT staff had no direct access to the new infrastructure being operated by HPE under the agreement the ATO struck with its technology partner for its replacement of the pre-existing EMC Corporation SAN with its own 3PAR SAN2 hardware in 2015.

The storage solution provided by HPE to the ATO comprised a primary 3PAR SAN in Sydney with a backup 3PAR SAN in Western Sydney.

Under the deal, the ATO engaged HPE to provide turn-key IT solutions, whereby HPE designed, owned and operated the computing infrastructure and provided services to the "required ATO standard".

 

Previous Page  1  2  3  4  Next Page 

Sign up for Computerworld eNewsletters.