A few fun videos covering this. I first saw Steve Mould's. He links to Up and Atom. Both are fun.
I just call nanosleep(2) based upon the amount if data processed. This is set by a parameter file that contains the sleep time and amount of data to determine when to sleep.
In programs I know will execute for a very long time, if the parameter file changes, parameters are adjusted during the run. Plus I will catch cancel signals to create a restart file should the program be cancelled.
I've mostly heard it in the context of building and construction videos where they are approaching a new skill or technique and have to remind themselves to slow down.
Going slowly and being careful leads to fewer mistakes, which will be a "smoother" process and ends up taking less time, whereas going too fast and making mistakes means work has to be redone and ultimately takes longer.
On rereading it, I see some parallels: When one is trying to go too fast, and is possibly becoming impatient with their progress, their mental queue fills up and processing suffers. If one accepts a slower pace, one's natural single-tasking capability will work better, and they will make better progress as a result.
And maybe its just my selection bias working hard to confirm that he actually is talking about what I want him to say!
Common to hear this in auto racing and probably a lot of other fields
There is a saying: “You don’t rise your level when performing. You fall to your level of practice.”
In a simple, ideal world, your developers can issue the same number of jobs as you have CPUs available. Until you run into jobs that take more memory than is available. Or that access more disk/network IO than is available.
So you setup temporary storage, or in-memory storage, or stagger the jobs so only a couple of them hit the disks at a time, and then you measure performance in groups of 4 or 8 to see when performance falls off, or stand up an external caching server, or whatever else you can come up with to work within your budget and available resources.
To the author of the article: I stopped reading after the first two sentences. I have no idea what you are talking about.
Imagine everyone in a particular timezone browsing Amazon as they sit down for their 9 to 5; or an outage occurring, and a number of automated systems (re)trying requests just as the service comes back up. These clients are all "acting almost together".
"In a service with capacity mu requests per second and background load lambda_0, the usable headroom is H = mu - lambda_0 > 0"
Subtract the typical, baseline load (lambda_0) from the max capacity (mu), and that gives you how much headroom (H) you have.
The signal processing definition of headroom: the "space" between the normal operating level of a signal and the point at which the system can no longer handle it without distortion or clipping.
So headroom here can be thought of "wiggle room", if that is a more intuitive term to you.
Or, if possible make latency a feature (embrace the queue!). For service to service internal stuff e.g. something like a request to hard delete something, this can always be a queue.
And obviously you can scale up as the queue backs up.
I do love the maths tho!
If you do that, you're likely to have a latency on the order of almost a millisecond, putting the previous tokens in one end would get you the logits for the next at a rate of let's say 1000 tokens per second... impressive at current rates.
You could also take that same array, and program in several latches along the way to synchronize data at selected points, and enabling pipelining. This might produce a slight (10%) increase in latency, so a 10% or so loss in throughput for a single stream. However, it would allow you to have multiple independent streams flowing through the FPGAs. Instead of serving 1 customer at 1000 tokens/second, you might have 10 or more customers each with their 900 tokens/second.
Parallelism and pipelining are the future of compute.
https://en.wikipedia.org/wiki/Jevons_paradox
I guess it's the same underlying principle for both paradoxii.
In fact, increasing capacity can make the problems worse due to the new capacity being thought of as available by many people at the same time.
Also, the plural should be quantified when possible: one paradox, two tridox, three quatrodox...
- Wyatt Earp
evaXhill•1d ago
SpaceManNabs•1d ago
For example, the paragraphs around the paragraph with "compute the exact Poisson tail (or use a Chernoff bound)" and that paragraph itself could be better illustrated with lines of math instead of mostly language.
I think you do need some math if you want to approach this probabilistically, but I agree that might not be the most accessible approach, and a hard threshold calculation is more accessible and maybe just as good.
cogman10•1d ago
Particularly because distributed computer systems aren't pure math problems to be solved. Load often comes from usage which is often closer to random inputs rather than predicable variables. Further, how load is processed depends on a bunch of things from the OS scheduler to the current load on the network.
It can be hard to really intuitively understand that a bottlenecked system processes the same load slower than an unbound system.
ignoramous•1d ago
motorest•1d ago
I feel tha I'm missing something obvious. Isn't this doc reinventing the wheel in terms of what very basic task queue systems do? It describes task queues and task prioritization, and how it supports tasks that cache user data. What am I missing?