Ever wonder how we manage multi-tenancy at AWS? or why we want you to use the personal health dashboard instead of the AWS status dashboard? are you pining for a bonus content section with probabilistic math? These slides on Shuffle Sharding are for you!!
-
-
Show this thread
-
O.k., so this is me, 15 years ago, building a data center. That's what I used to do for money. This one was about 30 racks, and I was the project lead. It took me about a year to build it, everything from initial design to labeling cables.pic.twitter.com/0wCFkERNTs
Show this thread -
These days I work with the AWS ELB team, and we regularly spin up that same amount of capacity in minutes. Think software upgrades or new region builds. This is insane to me. What took me a year now takes me me minutes. That's Cloud.
Show this thread -
Our Cloud gives us this agility because we all pool our resources. So a much bigger, and better than me, team can build much bigger, and better than mine, data centers, which we all share. 10 years in, pretty much everyone understands that this is awesome.
Show this thread -
BUT ... it comes with a challenge. When I built my own data centers, I was the only tenant and didn't need to worry about problems due to other customers. The core competency of a cloud provider is handling this challenge: efficiency and agility but still a dedicated experience.
Show this thread -
It's no good sharing everything if a single "noisy neighbor" can cause everyone to have a bad experience. We want the opposite! At AWS we are super into compartmentalization and isolation, and mature remediation procedures. Shuffle Sharding is one of our best techniques. O.k. ..
Show this thread -
So to understand Shuffle Sharding, let's start with a traditional service. Here I have 8 instances, web servers or whatever. It's horizontally scalable, and fault tolerant in the sense that I have more than one instance.pic.twitter.com/TkCgkIQa5E
Show this thread -
We put the servers behind a load balancer, and each server gets a portion of the incoming requests. LOTS of services are built like this. Super common pattern. You're probably yawning!pic.twitter.com/6KPDf5OXQg
Show this thread -
And then one day we get a problem request. Maybe the request is very expensive, or maybe it even crashes a server, or maybe it's a run-away client that is retrying super aggressively. AND it takes out a server. This is bad.pic.twitter.com/cQ7sspawwu
Show this thread -
What's even WORSE is that this can cascade. The load balancer takes the server out, and so the problem request goes to another server, and so on. Pretty soon, the whole service is down. OUCH. Everyone has a bad day, due to one problem request or customer.pic.twitter.com/pEyjWCInGC
Show this thread -
At AWS we have a term the scope of impact that we call "Blast Radius". And here the blast radius is basically "All customers". That's about as bad as it gets. But this is really common! Lot's of systems are built like this out in the world.pic.twitter.com/vn6xD2QyeB
Show this thread -
O.k., so what can we do? Well traditional sharding is something that we can do! We can divide the capacity into shards or cells. Here we go with four shards of size two. This change makes things much better!pic.twitter.com/9Oa3T43SNW
Show this thread -
Now if we repeat the same event, the impact is much smaller, just 25% of what it is was. Or more formally ...pic.twitter.com/8bcvECcNnu
Show this thread -
The blast radius is now the number of customers divided by the number of shards. It's a big improvement. Do this! At AWS we call this cellularization and many of our systems are internally cellular. Our isolated regions, and availability Zones are a big famous macro example too.pic.twitter.com/PmolDBpl7W
Show this thread -
THIS is one reason why the personal health dashboard is so much better than the status page. The status for your cell might be very different than someone else's!
Show this thread -
... anyway, back to Shuffle Sharding. With Shuffle Sharding we can do much better again than traditional sharding. It's deceptively simple. All we do is that for each customer we assign them to two servers pretty much at random.pic.twitter.com/0SISi2nsRL
Show this thread -
So for example, the
customer is assigned to nodes 1 and 4. Suppose we get a problem request from
? What happens?pic.twitter.com/5FZyFXIbJ4
Show this thread -
Well ... it could take out nodes 1 and 4. So
is having a bad experience now. Amazing devops teams are on it, etc , but it's still not great for them. Not much we can do about that. But what about everyone else?pic.twitter.com/1J6oorA510
Show this thread -
Well, if we look at
's neighbors. They're still fine! As long as their client is fault tolerant, which can be as simple as using retries, they can still get service.
gets service from node 2 for example.pic.twitter.com/xJRpdK6Fgn
Show this thread -
O.k. let's PAUSE for a second and appreciate that. Same number of nodes. Same number of nodes for each customer. Same number of customers. Just by using MATH, we've reduced the blast radius to 1 customer! That's INSANE.
Show this thread -
The blast radius ends up getting really small. It's roughly proportionate to the factorial of the shard size (small) divided by the factorial of the number of nodes (which is big) ... so it can get really really small.pic.twitter.com/LmaffLA3tR
Show this thread -
Let's look at example. So for 8 nodes and a shard size of 2, like these slides fit, the blast radius ends up being just 3.6%. What that means is that if one customer triggers an issue, only 3.6% of other customers will be impacted. Much better than the 25% we saw earlier.pic.twitter.com/BuDe9pCMEa
Show this thread -
But that's still way to high for us. AWS HyperPlane, the system that powers VPC NAT Gateway, Network Load Balancer, PrivateLink, etc ... we design for a hundred nodes, and a shard size of five. Let's look at those numbers ...
Show this thread -
O.k. now things get really really small. About 0.0000013% of other customers would share fate in this case. It's so small that because we have fewer than a million customers per cell anyway, there can be zero full overlap.pic.twitter.com/xmZHjKMIy0
Show this thread -
Again think about, we can build a huge big multi-tenant system with lots of customers on it, and still guarantee that there is *no* full overlap between those customers. Just using math. This still blows my mind.
Show this thread -
If you want to try some numbers out for yourself, here's a python script that calculates the blast radius: https://gist.github.com/colmmacc/4a39a6416d2a58b6c70bc73027bea4dc … . Try it for Route 53's numbers. There are 2048 Route 53 virtual name servers, and each hosted zone is assigned to 4. So n = 2048, and m = 4.
Show this thread -
If you want to make your own Shuffle Shard patterns, and make guarantees about non-overlap, we open sourced our approach years ago. It's at:https://github.com/awslabs/route53-infima …
Show this thread -
Shuffle Sharding is amazing! It's just an application of combinatorials, but it decreases blast radiuses by huge factorial factors. So what does it take to use it in practice?
Show this thread -
Well the client has to be fault-tolerant. That's easy, nearly all are. The technique works for servers, queues, and even things like storage. So that's easy too. The big gotcha is that you need a routing mechanism.pic.twitter.com/1berEp8FO9
Show this thread -
You either give each customer resource a DNS name, like we do for S3, CloudFront, Route53, and handle it at the DNS layer, or you need a content-aware router than can do ShuffleSharding. Of course at our scale, this makes sense, but not everyone.
Show this thread - 8 more replies
New conversation -
Loading seems to be taking a while.
Twitter may be over capacity or experiencing a momentary hiccup. Try again or visit Twitter Status for more information.