Blog

US-East-1 outage: Good luck to AWS engineers

By Vijaya Sagar Vinnakota - November 26, 2020

We engineers are unsung. Coming into the limelight(?!) only when things break. Some of our tribe have been battling a massive outage in the AWS us-east-1 region over the last 24 hours non-stop, but they seem to be coming out on top finally (as seen from the AWS status page). Here's a shoutout to all the hard work they've put in, and wishing them good luck.

Many businesses have taken a hit worldwide. We too have seen partial degradation in some of our systems deployed on AWS us-east-1. Our AWS TAM & BDM (Hi Anup & Raghu!) have been proactively supporting us over the last 24 hours, and demonstrating a genuine concern over what sort of impact this has on us & our customers. It will take some more days for us to determine the actual extent of the impact and start thinking on how we can be better prepared for such outages in the future.

In the backdrop of this situation, I would like to share with you all, some of the thoughts & conversations we Amagians are having, on designing distributed systems that are expected to run reliably in an unpredictable environment.

Distributed Systems and Fault Tolerance

Let's say we have grown past the first law of distributed systems. And say, we have designed the distribution with basic hygiene principles such as

low inter-service coupling
high intra-service cohesion
control plane and data plane separation
standardised protocols for inter-service communication

Those principles are necessary but insufficient to deal with an unpredictable / unreliable environment in which those services are deployed. Uptime is not a problem to be outsourced to Devops / SRE (!!). Fault tolerance should be included into the design of systems well before they're implemented & deployed. Some considerations for this would be:

Statelessness (or at least unbundling compute & state)
Redundancy (not as simple as deploying 2 copies of a system!)
RTO / RPO, checkpointing
Versioned configuration (say GitOps equiv)
Robust service discovery (saving money by running etcd / consul in a non-cluster mode looks like a brilliant idea, until ...)
Fail-fast & fail-safe

And then some more, from an operational perspective:

Infrastructure as code
"What if" checklists / SOPs
Chaos engineering, fire drills & game days (to verify if the checklists / SOPs can save the day)

As part of the global engineering community, we too are debating these aspects and learning something new every day at Amagi, in a very open engineering culture that does not penalise mistakes, but actively encourages us to learn from our failures and others',

Topics

Contact Us

Related Blogs

Blog

US-East-1 outage: Good luck to AWS engineers

Topics

Contact Us

Related Blogs

Your guide to video distribution: Channels, strategies, role of AI and more

How emerging FAST services can set themselves apart from the competition

How cloud is reshaping the business of broadcasting

The role of AI in media technology innovation

Get Started

Use Cases

Who We Serve

Products

Resources

Company

US-East-1 outage: Good luck to AWS engineers

Share

Topics

Contact Us

Related Blogs

Your guide to video distribution: Channels, strategies, role of AI and more

How emerging FAST services can set themselves apart from the competition

How cloud is reshaping the business of broadcasting

The role of AI in media technology innovation

Get Started