The AWS reboot would be the first true test of Cassandra's reliability, however. The entire cloud database engineering team was on alert.
In the end, and thanks to Chaos Monkey testing, most all of the Cassandra nodes remained online. Of the 218 Cassandra nodes that were rebooted, only 22 did not return to a full operational state, and those were successfully restarted with minimal human intervention.
"Repeatedly and regularly exercising failure, even in the persistence layer, should be part of every company's resilience planning," the blog concluded. "If it wasn't for Cassandra's participation in Chaos Monkey, this story would have ended much differently."
Sign up for Computerworld eNewsletters.