Datacenter outages and stranded passengers

Citation

ba_datacenter_outage_pic

A datacenter seems such an alien entity far removed from the dust and grime of our real lives — rows of gleaming machines somewhere deep underground in a hermetic environment, which we never get to see. Also, we have visions of idealized efficiency of these datacenters, where many failure modes are just not possible — an inattentive employee tripping over a power cord or a peeved squirrel chewing through some network cables — and many other failures are handled automatically before they become an irritant or more to us, the general public, such as, occasional failures of disk drives or computer processors. So it is that when reality obtrudes on this idealized vision, it leads to all kinds of effects on real humans, far from the hermetic sealed environments of the data centers. In this post, we will discuss the cases of stranded and peeved airline passengers, as has been happening off and on, and most recently happened with the British Airways (BA) IT outage in late May, 2017 that forced it to cancel around 800 flights from Gatwick and Heathrow.

When we talk of such outages that affect passengers, this means real monetary cost to the airline, especially if they are operating in EU countries. We are probably aware of the far stricter passenger protection laws that exist for EU carriers, compared to US carriers. BA would likely have to pay approximately $103M (£80m) in compensation under EU rules, a figure that does not include the cost of reimbursing customers for hotel stays. British Airways scrapped 479 flights, or 59 percent of its timetable, on May 27 when the failure occurred, together with 193 services the following day and some on May 29.

Let me start with the caveat that the root cause of such outages is often not known with certainty. There is a lot of guess work that is done by technologists and technology journalists based on guarded statements made by non technologists at the affected organizations and previously known snippets about the technology being used at these organizations.

So it is that in this case of the BA outage, with some technologically informed guess work, it looks that the frailty of the data center is in the human. Didn’t someone say: “To err is human, to forgive divine.” The first part is true and applicable, the second part is true but not applicable, because those affected by the vicissitudes of such outages are only too human themselves and not divine. Hence, the characteristic unfolding of the outage is frustrated passengers, stream of complaints, and the swift loss in stock price immediately following the incident.

In this case, there was damage to servers and networking equipment caused by a power surge. The arc of the story goes back a few minutes when the technician disconnected a power supply at BA’s datacenter in Heathrow, London. Realizing the oops moment, he may have plugged all the equipment back in, violating some non-obvious dependencies among the initialization routines of various software services. A simple (and only partly realistic) example is that if a secure web service comes up before the authentication service does, the web service rather than providing insecure service will not provide any service at all. If after a certain fixed number of attempts, it is still not able to reach the authentication service, it may make part of its service unavailable, the part that is meant to be secure in its interactions with the clients.

This precise problem had seen attention from the research community, and some impressive results, about 15 years back. See for example, this patent from IBM Research in 2002, some of them being old friends and collaborators of mine, a more recent one from Microsoft Research in 2007, and this paper at NOMS 2004, a top venue for practical, industry-driven systems management work. This does seem like one of those technical questions where it will be useful to try to move some of these solutions from the research domain into practice. This issue of initializing services without taking into account dependencies among these services, which are seldom made explicit in any policy document, has led to problems in the past as well and it may be worthwhile dusting off the old tomes — 10 years is still old in the field of Computer Science, right?

The series of outages in IT systems of airlines brings another thought to mind. Airline control systems — what makes your flight take off, fly, and land without incident — are remarkably safe, with 0.07 fatalities per billion passenger miles and 100 times safer than driving when compared per mile of transport, in the US. However, the same level of reliability does not rub off on the IT systems at the airlines. There may be several contributory factors. First, there is the personnel issue — do your best Computer Science undergraduates want to go work for Delta, United, or American, or one of the core technology companies? Second, there is the patchwork of outsourcing that is in place at most such organizations. BA for example has a company called CBRE operating its datacenter at Heathrow, while some of its IT services were outsourced to Tata Consultancy Services (TCS) of India in mid 2016. Finally, there is the issue of piecemeal updates to the computational infrastructure, done for understandable logistical reasons, not in one go, but rather take one service here and push it out to the cloud, then a second service, possibly to a different cloud provider, and so on.

One other dimension of such failures is the amplification of the effect of a failure. The power supply problem in this case was resolved likely in the matter of a few tens of minutes. But the user-visible problem persisted for 3 days, hinting that cascades of events happen which cause the effect to be magnified. Again, a flurry of work from the research community, on how to deal with cascades of failures, may be worthwhile for the practitioner community to take note of.

The BA story that we have discussed here is not an isolated incidence. Two recent ones with high price tags were: Delta which had an outage in August 2016 (price tag of $150M according to the airline’s own admission) and Southwest in July 2016 (price tag of $177M according to a CNN estimate).

Are such datacenter outages infrequent? Thankfully yes. Are they infrequent enough that we do not need to worry even if our life depended on it? No.

Share this:

Related

Leave a comment Cancel reply