AWS has gone down before, as have other providers; Fastly has lessons to share from its own outage


Fastly’s mid-2021 outage took some huge sites offline. Its Chief Product Architect Sean Leach shares why he thinks outages continue to happen, and how to reduce your own risks.


Image: Shutterstock/SGM

It’s time to reset the “days since last outage” sign at AWS headquarters yet again, with the web hosting giant in the process of dissecting its latest mass outage, which this time took sites like Disney+ and Netflix down with it. 

There are a lot of digital eggs in the AWS basket, and unfortunately major outages have happened with surprising regularity. AWS isn’t alone, though: Edge cloud company Fastly suffered an outage on June 8, 2021, that was similar to AWS’ outages, if for no other reason than it resulted in several major websites going offline. 

SEE: Hiring Kit: Cloud Engineer (TechRepublic Premium)

The latest AWS outage is still a bit of a mystery. All we know is that on Tuesday, December 7, AWS US-East-1 went offline. That just so happens to be the biggest of AWS’ data centers, and it not only affected Amazon customers, but internal operations as well. As of later in the day, service has been restored, AWS said. 

Amazon has yet to go into any sort of details about the outage aside from what CBS News described as “terse technical explanations” for the outage that knocked major websites, IoT devices and other essential online services offline. Fastly chief product architect Sean Leach won’t speculate on the cause of the AWS outage, but he does have plenty to say about Fastly’s own June 8 outage and how lessons Fastly learned from it can be applied to both content delivery services and the clients that make use of them.

Fastly’s outage was caused by a bug introduced by a software deployment the month prior. The bug had very specific trigger conditions that could only be triggered by “a specific customer configuration under specific circumstances,” said Fastly SVP of engineering and infrastructure, Nick Rockwell. It turns out that a client meeting those particular circumstances submitted a valid configuration change that triggered the bug and took 85% of Fastly’s network offline. Fastly discovered the error, restored services and deployed a permanent fix the same day. 

The internet is a car, and cars need maintenance

Internet outages continue to happen, which begs the question: Why? And, if there’s something fundamentally wrong with it, do we need to re-architect the internet?

No, Leach said, and the internet was built just fine in the first place as well, he added. Rather than thinking of the internet as a mass of disparate servers, all vying for authority, think of the internet as a whole system made of moving parts, like an automobile.

“So you own your car. You’re driving along, making sure you change the oil and other fluids, rotate the tires and the like … Sometimes there’s a rock that flies off the road and shatters your windshield, and now you have to stop and react to that unexpected circumstance,” Leach said.

Leach says there’s no fundamental flaw in the internet’s design. Rather, he describes it as having been “beautifully designed” early in its existence in a fashion that worked far better than anyone thought it would at the time. Yes, things go wrong, but each mistake is a chance to learn and eliminate points of failure. 

What Fastly learned from its own outage

If Fastly learned one big lesson from its outage and the recovery process, said Leach, it was that transparency pays off. “Transparency has always been a key focus area [at Fastly]. We were very transparent in the blog we put out responding to the outage, and our customers have been super supportive of our response,” Leach said.

Transparency, Leach said, doesn’t only benefit the company being open about its mistakes and how it responds to them. It also benefits everyone else in the industry who could face similar circumstances in the future. 

SEE: Microsoft Power Platform: What you need to know about it (free PDF) (TechRepublic)

If you’ve been on Tech Twitter for any length of time, you’ve probably heard the term “HugOps,” a slang term describing the sense of empathy that tech professionals have for each other when experiencing similar challenges. Part of HugOps, Leach said, is being able to help. If companies are honest about their outages, HugOps simply becomes the simple matter of sharing reports that could quickly reduce recovery time for other organizations.

“To quote Mike Tyson, ‘everyone has a plan until they get punched in the face,'” Leach said. Put simply, if we all help each other we can get a lot better at reacting to the punches that our infrastructure will inevitably face.

How to fix the internet …?

Leach said there are two big things that Fastly has been focusing on that it considers as ways to reduce the frequency of internet outages.

First, Fastly has been moving as much of its critical infrastructure as possible to memory-safe languages like Rust and Web Assembly. “Large cloud infrastructure, the things that are doing terabits of transactions per second … a lot of that’s written in C and C++. Those were great languages early on, but as with anything, we eventually found a better way,” Leach said. 

Second, Leach warns that DDoS attacks, which he describes as being cyclical, are on the rise. The response to that is to increase transactional capacity to lessen the impact a DDoS attack can have. “We’re seeing attacks not only get larger, but more complex as well. Keeping up with capacity and threat intelligence is essential to know what attackers are doing,” Leach said. 

As for the companies who may be suffering from these outages, Leach said that his biggest message to all of them is to not give up on the cloud.

“Think of all the outages folks have had running their own infrastructure for years and how difficult it is for them to recover from it. Switching to a cloud provider gives you access to a whole lot of experts, both from the infrastructure and the security side, who will react quickly and solve and fix the problem,” Leach said. 

That doesn’t mean you should ignore redundancy. Leach says that it’s important to have geographic fail-overs, but the cloud is still going to be the best option for one big reason that Leach said all the hemming and hawing around cloud stability comes down to: Risk.

“Each organization has to choose their level of risk, just like you do with security. You can choose the level of risk you take in the cloud or you can choose to ignore risks altogether,” Leach said. 

SEE: iCloud vs. OneDrive: Which is best for Mac, iPad and iPhone users? (free PDF) (TechRepublic)

Along with understanding your risk, Leach said that there’s one other key thing everyone should do when trying to determine the risks their cloud environment faces: Know its entire surface. Like understanding your attack surface, understanding your cloud surface means knowing things like which APIs are running where, which services are managed by which provider, where servers are located, what programming languages are being used and anything else that could jeopardize your uptime. 

The usual advice for improving security posture applies to the cloud as well, Leach said. Run drills to simulate outages, take a total inventory of everything in your cloud environment, and otherwise build yourself a map so that you can expertly pinpoint and instantly respond to the inevitable, because at the end of the day outages are just that: As inevitable as a flat tire, chipped windshield or other unexpected disaster. 

Also see

Source link

Leave a reply

Please enter your comment!
Please enter your name here