At what point does S3 break?

AWS’ Jeff Barr announced yesterday that their S3 service:

holds more than 449 billion objects and processes up to 290,000 requests per second for them at peak times

I’m a very happy user of S3 and much of the rest of AWS, but seeing figures like this forces me into risk-assessment mode: at what point does S3 (or similar services) break?

Yes, S3 has had outages (2008 was a bad year), but these were fundamentally minor.  In the scheme of things, the equivalent of someone tripping over the rack’s power cord.

What I’m wondering about is, where does S3 et al. start hitting fundamental limits that amp problems up from oops to ohhh shit?

S3 is presumably the largest system of its kind ever; it’s not clear to me that anyone would really know what its failure modes, thresholds, or weak links might be as it continues to grow. Anything from a breakdown in the Dynamo architecture to hard infrastructure limits to failures in operations and management strike me as plausible. What will we see first: data loss, increases in latency, repeated catastrophic outages, or “other”?

One thought on “At what point does S3 break?

  1. Good question. Of course, it depends in part on what the pain points in scaling are, which as you note include too many unknown unknowns, but I sense (WAG) they aren’t as hard as e.g. EC2/EBS. Which I bring up because the human factor could easily be the greatest issue. From what I read of the not too long ago AWS East multi-availability zone EBS disaster, some of the technical AWS managers should be figuratively taken out the back and shot, for they didn’t guard their multi-zone control plane with sufficient fierceness. (And if an anonymous account on or reported on Hacker News is to be believed, EBS management kept their people in firefighting mode instead of fixing deep problems in what’s a difficult to impossible feature.)

    And evidently the smarter types avoid EBS if they can afford, e.g. Netflix stayed up at that location in part because they made an early decision to not touch it with a ten foot pole, instead their persistent state is maintained by clusters of three machines … with S3 as the backing store….

    I guess we’ll see what happens, I just hope the above debacle resulted in a general wake up call that will cause AWS as a whole to up their game and at minimum delay the S3 day of reckoning. I.e. their business should have gotten big enough now that they have a simulation group that has built models of their systems and then has pushed them as hard as possible. One can hope.

    To take a related example, in the middle of the 20th century MIT built an electric grid simulator. And whatever they and other forward looking people did managed to keep our (US) 4 or so big grids in good shape until the messy 1965 Northeast Blackout. And after they we *really* upped our game (e.g. too many generators depended on having grid power for startup…) including serious study of the problem (e.g. I know of LLNL getting a contract to look at it).

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s