What’s New Cloud: We Now Know What Happened in US-EAST-1.
AWS released additional information about the service disruption that happened last week.
Welcome back to this new edition of What’s New Cloud Newsletter, where I share my insights on cloud news, feature updates and DevOps Trends.
What really happened?
AWS just released additional information about the service disruption that happened last week.
Root Cause:
A DynamoDB DNS automation bug triggered a rare race condition that accidentally deleted DynamoDB’s endpoint records: breaking connectivity for any AWS service or customer workload that depended on it.
AWS restored full service on October 20 after manual recovery efforts.
EC2 couldn’t launch new instances for nearly 14 hours
NLB experienced widespread connection errors
Lambda, ECS/EKS, Fargate, Redshift, IAM, and Connect all saw degraded performance
AWS Follow-Up Actions:
Fixing the DynamoDB DNS race condition
Improving EC2 recovery workflows and throttling mechanisms
Enhancing NLB failover handling and resilience
What I Learned from This Incident
Even the most redundant systems can fail when automation goes wrong.
A small DNS issue can ripple into a full regional outage.
Human operators remain vital in recovery and resilience.
Understanding cross-service dependencies is key to designing fault tolerance.
Reliability is a shared responsibility — across services, teams, and automation layers.
Stay ahead of the cloud curve Every week, I share AWS news, feature releases, and the most talked-about cdevops trends, all in one place.

