US-East-1 outage: Good luck to AWS engineers

By - November 26, 2020
US-East-1 outage: Good luck to AWS engineers

We engineers are unsung. Coming into the limelight(?!) only when things break. Some of our tribe have been battling a massive outage in the AWS us-east-1 region over the last 24 hours non-stop, but they seem to be coming out on top finally (as seen from the AWS status page). Here's a shoutout to all the hard work they've put in, and wishing them good luck.

Many businesses have taken a hit worldwide. We too have seen partial degradation in some of our systems deployed on AWS us-east-1. Our AWS TAM & BDM (Hi Anup & Raghu!) have been proactively supporting us over the last 24 hours, and demonstrating a genuine concern over what sort of impact this has on us & our customers. It will take some more days for us to determine the actual extent of the impact and start thinking on how we can be better prepared for such outages in the future.

In the backdrop of this situation, I would like to share with you all, some of the thoughts & conversations we Amagians are having, on designing distributed systems that are expected to run reliably in an unpredictable environment.

Distributed Systems and Fault Tolerance

Let's say we have grown past the first law of distributed systems. And say, we have designed the distribution with basic hygiene principles such as

  • low inter-service coupling
  • high intra-service cohesion
  • control plane and data plane separation
  • standardised protocols for inter-service communication

Those principles are necessary but insufficient to deal with an unpredictable / unreliable environment in which those services are deployed. Uptime is not a problem to be outsourced to Devops / SRE (!!). Fault tolerance should be included into the design of systems well before they're implemented & deployed. Some considerations for this would be:

  • Statelessness (or at least unbundling compute & state)
  • Redundancy (not as simple as deploying 2 copies of a system!)
  • RTO / RPO, checkpointing
  • Versioned configuration (say GitOps equiv)
  • Robust service discovery (saving money by running etcd / consul in a non-cluster mode looks like a brilliant idea, until ...)
  • Fail-fast & fail-safe

And then some more, from an operational perspective:

  • Infrastructure as code
  • "What if" checklists / SOPs
  • Chaos engineering, fire drills & game days (to verify if the checklists / SOPs can save the day)

As part of the global engineering community, we too are debating these aspects and learning something new every day at Amagi, in a very open engineering culture that does not penalise mistakes, but actively encourages us to learn from our failures and others',