AWS Outage Aftermath

Breakdown of the AWS US-EAST-1 DNS incident—timeline, cost modelling, concentration risk, DORA reporting timelines, and a 7-day engineering plan.
Oct 27, 2025
AWS Outage Aftermath

AWS Outage Aftermath: Root Cause, Real Costs, and Resilience Moves for CISOs

Source: Trescudo Intelligence • Author: Evangeline Smith, MarCom

Last updated: Oct 27, 2025 (Europe/Lisbon)

What happened (and why)

  • What happened: A major failure in US-EAST-1 triggered DNS resolution issues inside AWS (affecting EC2–internal/DNS dependencies and load balancer monitoring). Massive knock-on errors hit thousands of apps globally (Snapchat, Fortnite, Signal, Alexa/Ring, banks, gov services). AWS says this was not a cyberattack. Services were largely recovered the same day. (The Verge)

  • Why it matters: The incident exposed concentration risk on a few hyperscalers/regions (esp. US-EAST-1), and fragile failover patterns. Many platforms could not quickly shift traffic/resolve dependencies outside the affected plane. (The Guardian)

  • Cost signals: There’s no authoritative global loss figure yet, but credible analyses describe multi-million-per-hour exposures for large platforms, with aggregate industry-wide costs in the billions once advertising, retail, fintech, and ops penalties are included. Use a quick model: revenue-at-risk/hour × hours disrupted + SLA penalties + support/overtime + churn/brand impact. (The Economic Times)


Timeline (condensed)

  • ~03:11 ET (07:11 UTC) October 20: Widespread failures began, centered on US-EAST-1; DNS resolution goes sideways → service errors cascade. (The Verge)

  • ~06:35 ET: AWS says the issue is fully mitigated; some customers still see backlog/elevated errors as systems catch up. (The Washington Post)

  • Later that day: AWS confirms normal operations restored (by 15:01 PDT) and directs customers to the Health Dashboard. (Amazon News)

Scope & examples: Snapchat, Fortnite/Epic, Roblox, Signal/WhatsApp (reported impacts), Alexa & Ring (unresponsive/recording gaps), Canva, Airtable, Zapier, McDonald’s app; UK banks & HMRC also reported problems. (The Verge)


Root cause (what we know now)

  • Primary failure mode: DNS resolution inside AWS networking that affected service discovery/addresses (not an external DDoS or cyberattack). AWS and multiple outlets note the EC2-internal/DNS angle and load balancer monitoring interactions. (The Verge)

  • Security angle: No evidence of malicious activity per AWS. This was an availability/operations incident—still a business continuity and operational resilience event for you. (Amazon News)

Who is responsible?

  • Immediate responsibility: AWS (control plane / internal DNS).

  • Shared-responsibility implications: You own resilience architecture (multi-AZ/region, retries, DNS independence, graceful degradation). Many outages become customer outages because failover and backpressure were not pre-tested. (WIRED)

The DORA angle (EU financial sector)

DORA is live (applicable since Jan 17, 2025). For EU/UK groups operating EU financial entities, this outage is a textbook operational resilience event with reporting and governance implications:

  • Major-incident reporting timelines (harmonised by ESAs).
    Initial notification: within 4 hours of classifying as “major,” and no later than 24 hours from awareness.
    Intermediate: within 72 hours of the initial notification.
    Final: within 1 month of the latest intermediate report.
    Weekends/holidays: noon next business day allowances apply to many entities. esma.europa.eu+2European Banking Authority+2

  • What regulators expect to see. An internal classification methodology aligned to RTS thresholds, evidence of supplier notifications, and clean timelines from awareness → classification → reports. DLA Piper

  • Third-party & concentration risk. DORA establishes EU oversight of critical ICT third-party providers (CTPPs) to address systemic/concentration risk; ESAs have published an oversight guide as this regime ramps up. (Cloud providers serving many financial entities may be designated.) EIOPA+2bafin.de+2

  • National authority expectations. Examples: Central Bank of Ireland emphasises major-incident reporting from Jan 17, 2025 and third-party registers; EBA repealed PSD2 incident-reporting guidelines in favour of DORA harmonisation. Central Bank of Ireland+1

Bottom line for boards: Under DORA, outages of critical suppliers (like a hyperscaler region) become your risk to classify, report, and evidence—complete with supplier governance, tested recovery, and measurable performance against RTO/RPO.Cost: What to tell your CFO today (defensible ranges)

There is no authoritative global total yet. You can frame costs with order-of-magnitude references:

  • Media/analyst signals: “millions per hour” for large platforms; aggregate “billions” likely across sectors. (The Economic Times)

  • Business impact examples from reports: ad revenue loss, stalled orders/subscriptions, customer-support spikes, SLA penalties. (CloudZero)

How to estimate your slice (quick model):

  1. Revenue at risk per hour (by product/channel) × hours disrupted.

  2. SLA penalties + make-good ads/credits + Ops overtime.

  3. Churn multiplier for premium users impacted (short blackout → small, widespread blackout → higher).

Use logs to bound the period with elevated 5xx/latency and compare to historical conversion/throughput.


CISO checklist — what to fix this week

1) DNS resilience & independence

  • Add redundant resolvers (provider + self-hosted caches), health-checked CNAME patterns, short, practical TTLs, and backoff/idempotent retries in SDKs. Test DNS-blackhole scenarios in chaos drills. WIRED

2) Failover that actually fails over

  • For Tier-1, require active-active multi-AZ and tested multi-region patterns—plus region-evacuation runbooks (quarterly drills). Validate stateful tiers (DB/queues) for cross-region promotion. The Verge

3) Graceful degradation

  • Feature-flag or shed non-essentials (ads, recommendations, heavy analytics) so core transactions continue. Implement circuit breakers and queue backpressure to prevent cascade failures. WIRED

4) Supplier governance (DORA-ready)

  • Demand 1-hour notify and hourly updates for Sev-1 from critical SaaS/ICT vendors; flow down incident-reporting data needed for DORA RTS timelines; run joint failover/tabletops (region-out, DNS failure). Keep a third-party register current. Central Bank of Ireland

5) Evidence for DORA reports

  • Ensure your SOAR/IR tooling stamps awareness, classification, initial notice, intermediate (72h), and final (1-month)—with supplier artifacts (status pages, notices, RCAs). esma.europa.eu


7-day engineering checklist (hand to platform/SRE)

  • Dependency graph: Generate a live map (APM/observability) of cross-region and third-party calls; identify US-EAST-1 hot spots. (WIRED)

  • DNS hardening:

    • Dual resolvers (provider + self-hosted caches), short TTLs for failover, health-checked CNAME patterns.

    • Validate Idempotent retries and exponential backoff in all SDKs. (The Verge)

  • Traffic management: Global load-balancing with automated region failover; regional circuit-breakers to stop cascading failure. (The Verge)

  • Data plane choices: For stateful tiers (databases/queues), ensure cross-region replication and promote-read replicas runbooks.

  • Chaos drills: Simulate DNS blackhole and region-out once per quarter; measure MTTR and customer-visible SLOs. (WIRED)


Risk & governance lens (EU/UK)

  • NIS2/DORA expectations: This is a textbook operational resilience event. Boards should see tested failover evidence, RTO/RPO performance, and supplier governance (SLAs, joint exercises). Outages that trace to known single points without compensating controls invite supervisory pressure. (Context from ENISA/NCSC on systemic risk & hyperscaler concentration.) (The Guardian)


FAQ for execs (use in your internal memo)

  • Was this a hack? No. AWS states it was an internal DNS/infra failure, not a cyberattack. (Amazon News)

  • Why were we down if AWS was “up” by morning? Recovery causes backlogs & retries; if your architecture lacks graceful degradation and multi-region readiness, user-visible issues can persist. (The Washington Post)

  • Can this happen again? Yes—US-EAST-1 is a historic blast center (2020/2021/2023 outages too). Design for it. (The Verge)

  • Is multi-cloud the answer? Sometimes. It reduces concentration risk but adds complexity. At minimum, do multi-region well; choose multi-cloud for truly critical customer-facing functions after cost/complexity review. (Forbes)


Sources & further reading

  • AWS update (normal ops restored; refer to Health Dashboard). (Amazon News)

  • Service impact & root-cause framing (DNS / US-EAST-1): The Verge; Washington Post; AP; Wired; GeekWire. (The Verge)

  • Scope & concentration risk (banks, HMRC, platforms): The Guardian. (The Guardian)

  • Cost signals & commentary: Economic Times (per-hour brand estimates, insurance angle); Forbes (“billions lost” narrative); CloudZero (FinOps cost lenses). (The Economic Times)

  • Live blogs/roundups: TechRadar; Tom’s Guide; Newsweek. (TechRadar)


Slide-ready “What we’re doing now”

  • Today: Dependency map; DNS hardening; status-page comms; vendor attestations.

  • This week: Region failover test; backlog handling tune-ups; DR runbook proof; board brief with measured SLOs.

  • This quarter: Chaos/region-out drill; negotiate supplier IR SLAs & joint failovers; consider multi-region or selective multi-cloud for Tier-1.

Share article

Trescudo Blog