Suddenly, at 16:51 on 4th October 2021, Facebook disappeared from the Internet for all the 3 billion users no matter where in the world they were. There was no warning, and the experience was identical for the head of a large commercial organisation as it was for a first-year university student using a low-cost android phone. Users of Instagram and WhatsApp, also owned by Facebook, suffered the same experience. The outrage started at 16:50 BST and returned at 22:20 BST. The impact was high because Facebook, a single company, is so large.
The “what and why” is gradually emerging. The most surprising thing for me is that t was NOT a cyber attack. There was no malicious software, no ransomware, no Ddos and no hackers or disgruntled former employees. However, by chance, just before the outage, a former Facebook employee in the US now a whistleblower, Frances Haugen was providing testimony to Congress that Facebook prioritised profit over harm to children.
Facebook explained on their 259-word blog post the cause, “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication”. Many independent sources provided an explanation including Reuters and Cloudflare.
The failure that prevented users from accessing Facebook also obstructed Facebook engineers attempting to fix it. Apparently, the systems used by Facebook for physical and logical access to its own buildings were also affected by the same outage.
In simple terms, the error involved two of the internet’s many interconnected sub-systems. The Domain Name System (DNS) and the Border Gateway Protocol. The DNS converts a URL like facebook.com to an IP address of a server (one of many around the world) hosting the Facebook application. The BGP provides routing information services on the Internet. In this case, it allows data from one Facebook Datacenter in say South Africa to find another in Norway.
Like signs on the motorway, the BGP provides drivers’ directions for their destination. The “configuration change” that went wrong on 4th October, meant that suddenly all the motor signs (the BGP) went blank (and DNS could no longer see Facebook). The drivers could not see how to get to their destination and the traffic came to a halt.
Although the outage lasted for just 6 hours, it had a huge global impact on individuals, businesses and governments that rely on Facebook for communication, data transfer, payments and education.
Facebook did not explain why this update, something they would have done many times in the past, went awry. It is unclear if this was a planned or unscheduled update nor why there was no simple regression mechanism in place for exactly these eventualities.
However, independent security specialists cannot rule the possibility of sabotage or other sinister activity.
This outage was limited to one company, albeit with a huge user base. A similar outage for Google, Amazon or Apple would potentially have a larger impact, affecting many more applications and businesses. The internet was designed and built around TCP/IP (Transaction Control / Internet Protocol). It has resilience at its core. That resilience still stands. This incidence illustrates the age-old problem of too many eggs (users) in a single basket (Facebook).
Down detector recorded a further Facebook outage for a few hours starting late on October 8th in to the early hours of the 9th. This was a far less significant outage that lasted just a coupe of hours and probably had a differrent cause thsn Monday’s. Here is how CNN reported it.
Facebook has provided a further update explaining the 4th October outage.