What the IBM Cloud Outage Teaches Us About Resilience Engineering
Last week, IBM Cloud customers were impacted by a 3-hour global networking outage, rendering those customers’ services unavailable for much...
Read more >Last week, IBM Cloud customers were impacted by a 3-hour global networking outage, rendering those customers’ services unavailable for much of Tuesday afternoon/evening. Compared to AWS, Microsoft Azure, and Google Cloud, IBM is a small player in the public cloud, yet this outage still managed to impact about 10% of the services on the internet.
In this blog post, we’ll break down what happened last week, and why this class of outage has been hitting all of the cloud providers lately. Finally, we’ll outline how Arpio makes it easy for AWS customers to avoid this and similar outages when they happen.
Reports of a major outage in the IBM Cloud began surfacing on social media around 4:50 pm CDT on Tuesday, June 9th. As with other cloud outages we’ve seen in the past, the impacted services included IBM’s status website, so much of the conversation centered on the lack of information about what was going on.
Initial reports focused on IBM’s Dallas POP being inaccessible, but pretty quickly the other locations in IBM Cloud were also implicated. This appears to have been a global outage.
IBM released a statement the next day that gives us ample insight into the root cause: “An investigation shows an external network provider flooded the IBM Cloud network with incorrect routing.” This was yet another case of BGP Hijacking.
The internet is a “network of networks” which means that independent computer networks are themselves interconnected. At the highest level, large-scale internet service providers connect to each other so that customers of one ISP can access websites and services that operate through another ISP.
BGP, the Border Gateway Protocol, is the mechanism that these ISPs use to publish connectivity to each other.
A BGP hijack occurs when one provider erroneously or maliciously publishes bad connectivity information to its peers. Typically, the publishing provider is advertising a fantastic route to a given destination, encouraging the other provider to send traffic that direction. If the network infrastructure cannot handle all of the traffic it suddenly starts receiving, an outage occurs.
BGP hijacks are sadly common on the internet. In April, suspicious routes published by a Russian telecom provider caused a one-hour outage for customers accessing Amazon, Facebook, and Google from over 200 networks around the world. In 2018, hackers used BGP to hijack Amazon Route53 traffic and direct customers of the MyEtherWallet cryptocurrency service to an imposter site. Last year, CloudFlare publicly shamed Verizon for a BGP hijack that resulted in a multi-hour outage for many of the largest sites on the internet.
Usually, when a provider has an outage, it’s their own fault. With BGP hijacks, that’s not necessarily the case.
The IBM Cloud outage last week apparently resulted from another provider publishing erroneous routes to IBM. IBM probably should have rejected those routes, but the external provider is at least partly to blame.
The recent AWS and CloudFlare outages resulted entirely from external providers exchanging erroneous routes to those platforms. The cloud platforms weren’t even involved. Yet their service, and their customers, paid the price.
There is a movement in the industry to implement technical solutions that reduce the risk of BGP hijacks. But stepping back, these events are just more examples of the numerous things that can break systems on the internet.
We can, and should, continue to mitigate the risks of systemic failures by improving the underlying systems we build upon. We need to fix the vulnerabilities in BGP that result in these problems. And we’ll need to fix the new vulnerabilities that arise when we fix these current ones.
But this game of whack-a-mole will never end. There will always be resiliency problems that have not yet been solved. So we need to concurrently invest in solutions that allow our systems to persevere and recover in the face of these unsolved, and often unknown, vulnerabilities.
In computing, this has always been the purview of redundancy. Decades ago, that was about uninterruptible power supplies and redundant arrays of inexpensive disks. These days, the common practice is metropolitan-area geo-redundancy, such as AWS’s availability zones. But to truly be resilient at a global scale, we need to engineer global resilience.
Arpio is a global resilience solution for applications that run in Amazon Web Services. When a BGP hijack (or another event) impairs an AWS region, Arpio makes it quick and easy to continue operating in another part of the world.
Last week, IBM Cloud customers were impacted by a 3-hour global networking outage, rendering those customers’ services unavailable for much...
Read more >Most AWS users don’t utilize MFA at the command line. That may have something to do with how challenging it...
Read more >Last week’s IBM Cloud outage is just another example of the many things that can go wrong for applications on...
Read more >