Maybe his incisive polemic, which I greatly enjoy, was all but pandering to a certain elitist sensibility in the end.
To make manageable programs, you have to trade off execution speed both on the cpu and in the organization. His rather mathematized prescriptions imply we should hire quarrelsome academics such as him to reduce performance and slow down product development[initially…] all in the interest of his stratified sensibilities of elegance and simplicity.
Sucks to be right when that’s the truth.
NUMA only got more complicated over time. The range of latency differences is more extreme than ever. We've got L1 running at nanosecond delay, and on the other end we've got cold tapes that can take a whole day to load. Which kind of memory/compute to use in a heterogeneous system (cpu/gpu) is also something that can be difficult to figure out. Multi core is likely the most devastating dragon to arrive since this article was written.
Premature optimization might be evil, but it's the only way to efficiently align the software with the memory architecture. E.g., in a Unity application, rewriting from game objects to ECS is basically like starting over.
If you could only focus on one aspect, I would keep the average temperature of L1 in mind constantly. If you can keep it semi-warm, nothing else really matters. There are very few problems that a modern CPU can't chew through ~instantly assuming the working set is in L1 and there is no contention with other threads.
This is the same thinking that drives some of us to use SQLite over hosted SQL providers. Thinking in terms of not just information, but the latency domain of the information, is what can unlock those bananas 1000x+ speed ups.
It isn't the same as manually doing the optimizations, still half way there is better than not at all.
This is easy, as Dijkstra points out. Allocate an array ¾ the size of L1D, and write a subroutine to reverse its contents, but don't use the array otherwise. Call the useless reversal subroutine often enough (at the beginning of every other subroutine and loop body should generally be enough) that the CPU spends most of its time running it. That will ensure that most of your data references are to the array, and most of your instruction references are to the reversal subroutine, both of which are always in cache.
Other things do matter.
I guess there is still something left here from there from the concept of programming language as a tool for top-down shaping and guiding the thinking of its users. Pascal being the classic example. Golang tries to be like that. I get how annoying it can be. I don't know how JS/TypeScript constructs evolve, but I suspect this is more Fortran-style committee planning than trying to "enlighten" people into doing the "right" things. Happy to be corrected on this.
Maybe the hardest to interpret in hindsight is the point that in the sixties programming has been an overpaid profession, the hardware costs will be dropping and software costs cannot stay the same (You cannot expect society to accept this, and therefore we must learn to program an order of magnitude more effectively). Yeah, in some sense, what paying for software even is anymore.
But interestingly, the situation now is kind of similar to the very old days: bunch of mainframe ("cloud") owners paying programmers to program and manage their machines. And maybe the effectiveness really has gone up dramatically. There's relatively little software running in comparison to the crazy volume of metal machines, even though the programmers for that scale are still paid a lot. It's not like you get a team of 10 guys for programming each individual server.
Oh boy does that read VERY true today!
I'm old enough to remember what text editing was like before word processors. I'm old enough to remember trying to reach people before cell phones. I'm old enough to remember trying to find information in a physical library. There's a lot of problems that electronics has solved.
> When these machines were announced and their functional specifications became known, quite a few among us must have become quite miserable; at least I was. It was only reasonable to expect that such machines would flood the computing community, and it was therefore all the more important that their design should be as sound as possible. But the design embodied such serious flaws that I felt that with a single stroke the progress of computing science had been retarded by at least ten years: it was then that I had the blackest week in the whole of my professional life. Perhaps the most saddening thing now is that, even after all those years of frustrating experience, still so many people honestly believe that some law of nature tells us that machines have to be that way. They silence their doubts by observing how many of these machines have been sold, and derive from that observation the false sense of security that, after all, the design cannot have been that bad. But upon closer inspection, that line of defense has the same convincing strength as the argument that cigarette smoking must be healthy because so many people do it.
“The Mythical Man-Month” is not about OS/360 as such, but about project planning and specifically what was learned about project management during the development.
JITs have taken this to an even higher level — people don't just argue that the machine is fast enough to run their convoluted code with countless unnecessary layers, they argue that their code as they've written it won't be run at all: the JIT will reduce it to a simpler form that can be handled efficiently.
But they can't explain why their poor coworkers who have to read and maintain the code don't deserve the same consideration as the machine!
Highly optimized code being convoluted is an extreme case, for rare algorithms or exotic levels of instruction-level efficiency. The first 95% of optimization is simplifying the code, which benefits both the machine and the programmers.
Interestingly, designing a language that enforces that loops need an invariant that proves they terminate is actually possible; Coq, for example, does pretty much exactly this from what I understand. My understanding is that this means that it isn't Turing complete, but I also think that maybe Turing completeness isn't quite as necessary for as many things as it might otherwise seem like.
Dafny implements this at the compiler level (and a curly braces syntax!).
Coq uses other methods more tailored towards recursion.
You are right that if every loop must terminate, it is not Turing complete. So some valid programs will not compile.
There are some interesting programs that potentially never terminate (like servers, daemons, OS, games, etc) formal methods can be applied too. For instance to prove that they preserve certain property or that they don't terminate or terminate only under certain conditions.
I find the theory extremely elegant and pleasurable, but it´s obviously not everyone's cup of tea, as shown by its lack of widespread use.
LLM's might create a revival in the coming years for the following reasons:
1) cost of formalization goes down 2) cost of proving goes down 3) cost of programming goes down 4) provable code quality becomes a differentiation among a sea of programs
Absolutely right, with the implication that new capabilities available suddenly to everyone often end up making the playing field more unequal, not less.
>baroque monstrosity
warnings. probably beating a dead horse here, but way too many tools, and they keep adding more.
Nowadays one often encounters the opinion that in the sixties programming has been an overpaid profession, and that in the coming years programmer salaries may be expected to go down. Usually this opinion is expressed in connection with the recession, but it could be a symptom of something different and quite healthy, viz. that perhaps the programmers of the past decade have not done so good a job as they should have done. Society is getting dissatisfied with the performance of programmers and of their products.
> We should recognise the closed subroutines as one of the greatest software inventions; it has survived three generations of computers and it will survive a few more, because it caters for the implementation of one of our basic patterns of abstraction. Regrettably enough, its importance has been underestimated in the design of the third generation computers, in which the great number of explicitly named registers of the arithmetic unit implies a large overhead on the subroutine mechanism.
Presumably what he's saying is that if you have 16 architectural registers instead of four, entering a subroutine involves saving 16 registers, which takes four times as long as saving four registers, which discourages factoring your program into small subroutines. Is there a more reasonable interpretation?
Because I think this interpretation is wrong. If your leaf subroutine requires six registers for its work, and you only have four caller-saved registers, you need to save the values of two callee-saved registers on entry and restore them on exit. And that's true whether your architecture has eight registers, or 16, or 32. Similarly, if a value needs to survive across a subroutine call, you need to use a callee-saved register for it, or store it in memory.
So the cost of invoking a subroutine doesn't depend on the number of architectural registers, I think.
The exception is that, if you don't have enough registers, you need to store variables in memory, and variables in memory don't need to be saved and restored. You could even posit cases where moving variables from registers into memory makes your program faster because you don't have to spend time saving and restoring a register. (Probably not on a single-issue RISC, because you need extra loads and stores, but you could imagine machines whose memory access is fast enough for this.)
But, if that's the case for a particular variable, nothing stops you from using memory for it, perhaps occasionally holding its value in a caller-saved register temporarily. This saves you the cost of saving and restoring the register you could have used for it. It's as if the register doesn't exist.
So I'm skeptical of the thesis that large architectural register files make subroutine calls slower. At most, they fail to speed up subroutine calling by as much as they speed up other parts of your code.
On the other hand, I don't think I know anything relevant that Dijkstra didn't know in 01972. So, am I missing something?
(All this is assuming a constant number of instructions per second, although in fact stack architectures like the MuP21 might be able to hit a higher clock speed than more conventional designs.)
Large register files do slow down preemptive context switching (multithreading), because the operating system must save even registers that the problem-state code is not using, even caller-saved registers. And large sets of callee-saved registers also slow down cooperative context switching.
selcuka•7mo ago
Apparently the ISO/IEC 1539-1:2023 [1] committee didn't get the memo.
[1] https://www.iso.org/standard/82170.html
pjmlp•7mo ago
selcuka•7mo ago
To be honest it wasn't that bad back then either.
pjmlp•7mo ago
A good refresher was reading "Modern FORTRAN: Building Efficient Parallel Applications".
https://www.manning.com/books/modern-fortran