Related tangent, "HPBN" (High-Performance Browser Networking) is a great book that includes related concepts.
[0](https://www.oreilly.com/library/view/designing-data-intensiv...)
Also found that AIMD is better than a circuit breaker in a lot of circumstances too.
Golang lib of the above https://github.com/webriots/rate
In the field of router buffer management, there are algorithms like Stochastic Fair Blue, which does the latter, but is somewhat hard to apply to HTTP because you’d have to define a success/failure metric for each request (a latency target, for example), and clients would have to tolerate a small probability a request being rejected even when they’re not being rate limited.
In Google’s paper on the Zanzibar authorization system, they give brief mention to rate limiting clients based on a CPU time allocation, but don’t go into any detail since it’s not a paper on rate limiting.
It’s something that matters less today with ubiquitous autoscaling, where the capacity of the system is whatever you need it to be to give each user what they ask for up to their rate limit, but I’m surprised at my inability to find any detailed account of such a thing being attempted.
Furthermore, modern GPU workloads are way less elastic in capacity scaling
https://aws.amazon.com/builders-library/workload-isolation-u...
Which isn't exactly what you're talking about, but between that and other things in the "Builder's Library" series, you can see that people are doing this, and writing about it.
Netflix has a blog post for their implementation: https://netflixtechblog.medium.com/performance-under-load-3e...
Pretty easy to pair AIMD with token bucket (eg https://github.com/webriots/rate)
For situations where eventual consistency is good enough, you can run a task in a loop that tries every n seconds to update a quantity. But as you say that can also saturate, so what you really want is for the task to update, then wait m seconds and go again, where m is more than the time you expect the task to complete in (<<50% duty cycle). As the cost of the operation climbs the time lag increases but the load on the system increases more slowly. If you want to, collect telemetry on how often it completes and set up alarms for if it doesn't for a duration that is several times longer than your spikiest loads happen.
I don't think voluntary rate limiting on the client side gets enough column inches. Peer to peer you end up footguning yourself if you bite off more than you can chew, and if you start stalling on responses then you gum up the server as well.
The site defaults to dark mode for me (I'm assuming based on my system preferences), but all the examples are unusable with dark mode. The examples all seem to be within fixed-width divs that cut off part of the content unless you scroll within the example.
- "Working" I wanted to keep things consistent.
- Content getting cut was a limitation of iframe. Most blogging platform don't allows you to embed another page. This was best I could do given the limitation.
- I do use AI to bounce ideas, but a lot of effort went into getting the apps working as intended.
Is it supposed to say, "How it works"?
> By following, you'll have instant access to our new posts in your feed.
> Continue with Google
> More options
As soon as I see this in a blog, I quit tab. Why do authors do this to themselves?
- Medium which paywalls the article and forces you to sign up just to read.
- Substack has same problem, it's great for funneling people to your paid newsletter but there is a sign up banner as soon the page loads.
- Build your own and miss out on the social aspect and there's no proof if the numbers are real.
Great visualization tools. Next time I have to explain it, I'll reach for these.
Instead of storing the current number of tokens, you instead store when the bucket will be full. If you take a token from the bucket, you increment the timestamp accordingly by 1/rps. The only complication is it the filled timestamp was in the past, you have to first update it with the current timestamp to avoid overfilling.
What's even nicer is that it doubles as a throttle implementation rather than just a rate limiter. You know the bucket is empty if you compute empty_at=filled_at-(max_tokens/rps) which is still in the future. From that calculation you now know when it will have capacity again, so you can sleep accordingly. If you use a queue before the gcra, it then starts sowing down new connections rather than just dropping them.
You should still have a limit on the queue, but it's nice in that it can gracefully turn from token bucket into leaky bucket.
fside•11h ago
eknkc•11h ago
Sliding window might work good with large intervals. If you have something like a 24h window, fixed window will abruply cut things off for hours.
I mostly work with 1 minute windows so its fixed all the way.
mparnisari•11h ago
hotpocket777•7h ago
smadge•7h ago