How to Handle Overload

In this post, I will be talking about my experience dealing with the overload of a software system. It is not based on the contents of the Google SRE Book although some ideas will be similar. If you are interested in the topic I definitely recommend you read chapter 21 of the book that talks about the subject in more detail.

Even if your service has load balancing and autoscaling set up it will probably get overloaded. This can happen for several reasons, such as:

Your service getting more popular, i.e. organic growth
Your client running a load test (for backend services)
A DDOS attack
A retry storm (when your service is having troubles and clients retry their requests)

Autoscaling may help with the first cause but it will likely not with the other ones as it takes time to bring up a new instance. This is less of a problem with serverless but you still do not want to scale up indefinitely and pay a huge infrastructure bill.

So how do you protect your systems from being overloaded?

Rate Limiting (Throttling)

Quotas

With rate limiting, you assign quotas to each client and reject requests when they are exceeded. As an example, you allow each client 100 requests per second (RPS) and reject anything above.

I will assume your APIs require authorization as you need a way to identify and attribute each request to a client.

There is also a problem where a service may have more than one operation (endpoint), with some more expensive than others. The quotas on the expensive ones are going to be lower than on the cheap ones.

A simple but practical way to set up quotas is per client/operation. So a quota of X RPS on operation A means each client can call A up to X RPS. Things get more complex when some clients need more RPS than others. You will need to configure exception rules for such clients or raise the quota for everyone.

You may also be tempted to configure a rule for each client based on the quota they ask for. This may work if you are certain to only serve a small number of clients. Otherwise, the rules will quickly become too hard to maintain as clients will constantly be asking you to get the limits raised.

Implementing throttling is quite complex and you will likely want to use an existing solution. What is important when selecting the best one for you?

Requirements for Throttling Tools

Rule Configuration
Check how flexible the rules are. Does it support overrides? Is it easy to make changes? I prefer having throttling rules in a code repo to make it easy to audit and roll back.
The workflow
After a rule is updated, is it easy to deploy it quickly? Do you have to use a clunky UI to do that?
Is it distributed?
Depending on your load-balancing setup, some hosts may get more traffic from a client than others. Does the tool take that into account?
Are bursts supported?
Can the tool allow bursts of traffic from a client before rate limiting kicks in? That is a useful feature to avoid throttling clients on short-term traffic spikes.

When setting up rate limiting make sure you put it as close to the entry point into your system as possible. You will want to do as little work as possible if the request is going to be rejected anyway.

Also, make sure to exclude your load-testing client from being throttled.

Even if you implement throttling, your system can be overloaded, e.g. if multiple clients use all their quotas at the same time. Another approach to protecting from overload is load shedding. It can be set up together with throttling or independently but is usually a second line of defense.

Load Shedding

This means rejecting requests when your system is under stress.

To properly set up load shedding you will need to know the limiting factor of your service. This is essentially the resource that gets exhausted first when it is overloaded. For many systems, it is the CPU but can also be memory, disk I/O, network I/O or something else. To find out the limiting factor, you run the stress test.

Stress Test

The stress test is a test where enough traffic is generated to overwhelm the system. This usually manifests in very high error rates, high latency or timeouts due to it being unresponsive. Before running the test check with the owners of the services you depend on to make sure they can support the load.

While the test is running, you collect metrics about the resources the service is using with a monitoring tool such as Datadog (shameless plug).

You then analyze the graphs to find the resource that got exhausted first, which is going to be your limiting factor.

Unlike rate limiting, the implementation of load shedding is quite simple. I've seen it done with only a few lines of code. It can be a filter that rejects a request if the current CPU usage is above the configured threshold, say 90%.

Request Priority

It could make sense for some systems to assign different priorities to different requests. This is useful when some of your clients are more important than others. The same goes for operations, e.g. reads could be prioritized over writes or data plane over control plane.

Conclusion

Most software systems can and will get overloaded. To increase availability and protect your clients, critical systems should implement rate limiting and load shedding. Rate limiting rejects requests from clients that exceed the configured quota. Load shedding rejects requests when the system is under stress.

Resources

Google SRE Book, chapter 21 - Handling Overload