Subscribe / Unsubscribe Enewsletters | Login | Register

Pencil Banner

Problems flagged months before HPE hardware failure hit ATO systems

Leon Spencer | June 12, 2017
At least 77 events related to components that failed in the December 2016 outage were logged months earlier.

Log data generated by the Hewlett Packard Enterprise (HPE) storage hardware being used by the Australian Taxation Office (ATO) revealed potential issues months before the agency's systems were hit by last year's massive outage.

The hardware trouble struck the ATO in December last year when an "unprecedented" failure of 3PAR storage area network (SAN) hardware that had been upgraded in November 2015 by HPE resulted in widespread outages among many of the ATO's systems.

Now, the ATO has released its much-anticipated report into the outage, revealing that analysis of SAN log data for the six months preceding the incident indicated potential issues with the Sydney SAN similar to those experienced during the December outage.

While HPE and fellow integration partner, DXC Technology, continue to investigate the issues related to the outage, the report reveals that, while HPE had taken some actions in response to the problems flagged by log data, alerts continued to be reported, indicating that the actions did not resolve the potential SAN stability risk.

Specifically, since May 2016, at least 77 events related to components that were observed to fail in the December 2016 failure were logged in the ATO's incident resolution tool managed by IT contractor, Leidos.

In addition, at least 159 alerts were recorded in SAN device monitoring and management logs, the ATO report stated.

Some actions had been initiated by Leidos and HPE in response to the indicators, including the collation of incidents by Leidos, and some infrastructure maintenance, including the changing of cables on the Sydney SAN by HPE.

Despite these actions, alerts continued to be reported that indicated these actions did not resolve the potential SAN stability risk.

"We were not made fully aware of the significance of the continuing trend of alerts, nor the broader systems impacts that would result from the failure of the 3PAR SAN," the ATO said in the report.

Ultimately, the ATO said that, the massive outage experienced in December 2016 resulted from the compound impact of several factors, including multiple SAN component failures on the agency's Sydney SAN, which also involved failures associated with stressed fibre optic cabling.

At this stage of the investigation, the ATO considers that stressed fibre optic cabling issues were a major contributor to this outage - regardless of the actions taken by the ATO's external IT partners, which included the replacement of specific cables.

Other factors contributing to the failure include subsequent unsuccessful attempts for the system to auto-recover in response to the component failures. Consequently, the SAN was unable to provide read/write services to the applications it supported.

Meanwhile, control, management and monitoring systems being placed "in-band" also played a part, with these systems relying on the same data pathways as the production systems that were supporting impacted services.

 

1  2  3  4  Next Page 

Sign up for Computerworld eNewsletters.