Facebook outage 2021: A simple mistake with global consequences

In October, the Internet was rocked by the Facebook outage that affected dozens of big-name companies, as well as millions of brands and businesses that advertise on the Facebook platform. Due to something as simple as a misconfigured Domain Name System (DNS) record, all devices with Facebook application integration started recursive DNS resolvers DDoS, which stands for “Distributed Denial of Service.” This, in turn, caused overload in countless cases across the board.

You might be thinking, “So what? Some sites were offline for a couple of hours.” But the blackout brought other problems to light. Communications from Facebook’s own employees who could solve this problem were paralyzed. Some of these obstacles went so far that people were unable to enter the buildings because the physical license plate system was not even online.

So now that the digital dust has started to clear, it’s time for the autopsy. What can we learn from this event and how can you prevent it from happening to your organization? Because let’s face it: If a massive platform like Facebook can experience widespread outage, businesses large and small should take notes.

Configuration management issues

Configuration management (CM) is a systems engineering process for establishing and maintaining consistency. This consistency includes items such as performance, functionality, and the physical state of the system. Essentially, CM allows for a programmatic approach to ensure things don’t get derailed.

Is a server not responding after an update? Stop updating the rest of the fleet, in this case the other servers, and notify the appropriate people about the event. After this, run an update rollback on the unresponsive server to resume services. Basically, by using CM for automation, you can easily create human error checks.

The importance of testing in the pipeline

The advent of DevOps has given us amazing power to automate as many manual processes as possible. This has allowed teams to push code into production from an average of once per sprint to minutes from submission to a code repository. That said, a common issue we see is the lack of a test stage within this pipeline, in addition to the basic linter (static code analysis tool), via Jenkins on GitHub.

The type of test that needs to be added to the pipeline is submitting the code to the staging servers. Essentially, these test or development servers are a sandbox for examining what would happen before the changes hit production. While this issue with Facebook came from a setup, not purely developer code, the opinion remains the same.

It is imperative to have that test area to ensure confidence that what is moving or changing in production environments will not result in the main story on the cover of WIRED tomorrow. Lastly, always make sure to match test servers to production as closely as possible to get the most reliable tests.

Reversion planning and drills are essential

So you’ve done your due diligence and taken all the steps to make sure this simple update goes smoothly, but after hitting the update, you find that something unrelated was broken with the change. While that’s a case of not following orthogonality when it comes to development and design, it happens to all of us, so don’t worry.

Times like these are when it comes to your reversal plan discussed earlier. This is what you need to do to revert to the previous state, before the change or update was submitted. If you’ve taken my CM advice seriously, you should already have this plan in place.

If not, develop a plan to roll back, preferably before your next boost to production. Once that plan is in place, you should run a mock scenario to make sure that plan works beyond its documentation. This is one of the many useful reasons to run periodic cyber wargame scenarios, which is an interactive and certainly fun technique to test your cyber security readiness in an attack context.

Communication alternatives are a must

With the growing percentage of the workforce moving to remote environments, it is more important than ever to have reliable communications. So for a company like Facebook, it makes sense to use a homegrown SaaS, like their proprietary Messenger platform. However, scenarios exactly like this are also the reason why you should have a default backup.

While I’m sure this sounds like an obvious step, don’t just bundle a plan together in your team onboarding materials. That said, if you are a smaller business and don’t have a turnover of hundreds a year, you may not even have a formal, established onboarding process.

And second, the alternative communication method can change at any time. So one way around this is to keep some up-to-date documentation on the protocols if the primary form of contact is cut off for any reason. Another tip would be to have your IT department run an automated message detailing the backup method, to employees via email or SMS in case this happens.

Depth and redundancy are the key

Now, this is very much related to the last section that talks about communication alternatives, but it applies to all the lessons learned previously to some extent. This certainly applies on a case-by-case basis depending on your company, but sets redundancy to a level that would make any doomsday preparer jealous. And if you think it’s excessive, go one step further.

One question that you should constantly ask yourself for every possible scenario is: Do you have a backup? This is where our rollback plan for redundancy comes in, in the event of a revolutionary change in production. Configuration management is a redundancy for human manual monitoring, so if you have at least one level of support, the entire livelihood of your business is not at stake. At the end of the day, you can save yourself a lot of headaches (and heartaches) by going back to basics and prioritizing redundancy in all your environments.

Image Credit: Sergei elagin and thinkhubstudio / Shutterstock

1636661260 902 Facebook outage 2021 A simple mistake with global consequences

Cody Michaels is an Application Security Consultant at nVisium. With over 10 years of secure programming and development experience, Cody has worked with people from entry-level to Fortune 500 companies. He has won hacking events, including Compuware Hack the Museum at the Henry Ford Museum. He is also a contributor to the Arctic Code Vault for his contributions to open source code. Cody is known for speaking at Defcon meetings, various local security talks, and the HackMiami conference.

Leave a Comment