Hardware Stockholm Syndrome

https://programmingsimplicity.substack.com/p/hardware-stockholm-syndrome

66•rajiv_abraham•4d ago

Comments

bibanez•4d ago

You see innovation in this space a lot in research. For example, Dalorex [1] or Azul [2] for operations on sparse matrices. Currently a more general version of Azul is being developed at MIT with support for arbitrary matrix/graph algorithms with reconfigurable logic.

[1] https://ieeexplore.ieee.org/document/10071089 [2] https://ieeexplore.ieee.org/document/10764548

summa_tech•4d ago

> Look at a modern CPU die. See those frame pointer registers? That stack management hardware? That’s real estate. Silicon. Transistor budget.

You don't. What you see is caches, lots of caches. Huge vector register files. Massive TLBs.

Get real - learn something about modern chips and stop fighting 1980s battles.

burnt-resistor•3d ago

This. They're shouting at the sky from whatever squirrels are running around their brains without understanding anything about real CPUs. It's somewhere between a shitpost and AI slop. They offered no better alternative except to bitch about how "imperfect" things are that work. I'd like to see their FOSHW RTL design, test bench, and formal verification for a green field multicore, pipelined, superscalar, SIMD RISC design competitive to RISC-V RV64GC or Arm 64.

atq2119•2h ago

You forgot about branch predictors.

shash•2h ago

I mean, you can’t even see a single frame pointer register. The srams dominate everything else.

Even the functional units are relatively cheap.

Animats•4d ago

If you disallow recursion, or put an upper bound on recursion depth, you can statically allocate all "stack based" objects at compile time. Modula I did this, allowing multithreading on machines which did not have enough memory to allow stack growth. Statically analyzing worst case stack depth is still a thing in real time control.

Not clear that this would speed things up much today.

Panzerschrek•4d ago

It's too constrained to statically-analyze the whole program to determine required stack size. You can't have calls via function pointers (even virtual calls), can't call thirdparty code without sources available, can't call any code from a shared library (since analyzing it is impossible). This may work only in very constrained environments, like in embedded world.

Animats•3d ago

Mostly. Analyzing a shared library is possible. All you need is call graph and stack depth. If function pointers are type constrained, the set of callable functions is finite.

bsder•2h ago

> Not clear that this would speed things up much today.

It would likely slow things down, actually.

One advantage that a stack has is that the stack memory is "hot"--it is very likely in cache due to the fact that it is all right next to one another and used over and over and over. Statically allocated memory, by contrast, has no such guarantee and every function call is likely to be "cold" and need to pull in the cache line with its static variables.

mrheosuper•4d ago

>Pong had no software. Zero. It was built entirely from hardware logic chips - flip-flops, counters, comparators. Massively parallel.

And that is what FPGA for.

This strikes me the author lack of hardware knowledge but still try to write a post about hardware.

shash•1h ago

It’s interesting that not a single hardware concept is actually explored - nothing like a real alternative architecture or how one would program it. Just a lot of complaints about standard (sounds like x86 only at that) techniques without much depth.

wrs•4d ago

Given this is rewinding to the 1970s, I expected a mention of CSP [0], or Transputers [1], or systolic arrays [2], or Connection Machines [3], or... The history wasn't quite as one-dimensional as this makes it seem.

[0] https://en.wikipedia.org/wiki/Communicating_sequential_proce...

[1] https://en.wikipedia.org/wiki/Transputer

[2] https://en.wikipedia.org/wiki/Systolic_array

[3] https://en.wikipedia.org/wiki/Connection_Machine

lukeh•4d ago

XMOS still very much alive. But CSP comes with its own set of challenges which ironically have force one back to using shared memory.

flohofwoe•4d ago

Weird article.

Hasn't most code that's been compiled in the last few decades using the x86 frame pointer register (ebp) as a regular register? And C also worked just fine on CPUs that didn't have a dedicated frame pointer.

AFAIK the concepts of the stack and 'subroutine call instructions' existed before C because those concepts are also useful when writing assembly code (subroutines which return to the caller location are about the only way to isolate and share reusable pieces of code, and a stack is useful for deep call hierarchies - for instance some very old CPUs with a dedicated on-chip call-stack or 'return register' only allowed a limited call-depth or even only a single call-depth).

Also it's not like radical approaches in CPU designs are not tried all the time, they just usually fail because hardware and software are not developed in a vacuum, they heavily depend on each other. How much C played a role in that symbiosis can be argued about, but the internal design of modern CPUs doesn't have much in common with their ISA (which is more or less just a thin compatibility wrapper around the actual CPU and really doesn't take up much CPU space).

tdeck•3d ago

It's because the historical perspective in this article is really lacking, despite that being the premise of the article.

Not only does the author seem to believe that C was the first popular high-level language, but the claim that hardware provided stack based CALL and RETURN instructions was not universally true. Many systems had no call/return stack, or supported only a single level of subroutine calls (essentially useless for a high level language but maybe useful for some hand written machine code).

FORTRAN compilers often worked by dedicating certain statically allocated RAM locations near a function's machine code for the arguments, return value, and return address for that function. So a function call involved writing to those locations and then performing a branch instruction (not a CALL). This worked absolutely fine if you didn't need recursion. The real benefit of a stack is to support recursion.

cyberax•1h ago

C can work just fine in systems without a stack. A very modern example: eBPF.

MichaelRo•1h ago

I recall starting to program in BASIC on CP/M and ZX-Spectrum machines and they didn't have procedures, only GOTO. Just like assembler, you can use all the JMP you want and not use structured programming and procedures but ... it will all become an unmaintainable mess in short time.

Very likely in a number of alternate futures (if not all of them), given the original set of CPU instructions, people would gravitate naturally to C and not some GOTO spaghetti or message passing or object oriented whatever.

rep_lodsb•3d ago

The prevailing opinion seems to have shifted towards using the frame pointer for its special purpose, in order to improve debuggability/exception handling?

But the article is really quite misinformed. As you say, many mainframe/mini, and some early microprocessor architectures don't have the concept of a stack pointer register at all, neither do "pure" RISCs.

I'd argue there is a real "C Stockholm Syndrome" though, particularly with the idea of needing to use a single "calling convention". x86 had - and still has - a return instruction that can also remove arguments from the stack. But C couldn't use it, because historically (before ANSI C), every single function could take a variable number of arguments, and even nowadays function prototypes are optional and come from header files that are simply textually #included into the code.

So every function call using this convention had to be followed by "add sp,n" to remove the pushed arguments, instead of performing the same operation as part of the return instruction itself. That's 3 extra bytes for every call that simply wouldn't have to be there if the CPU architecture's features were used properly.

And because operating system and critical libraries "must" be written in C, that's just a fundamental physical law or something you see, and we have to interface with them everywhere in our programs, and it's too complicated to keep track of which of the functions we have to call are using this brain-damaged convention, the entire (non-DOS/Windows) ecosystem decided to standardize on it!

Probably as a result, Intel and AMD even neglected optimizing this instruction. So now it's a legacy feature that you shouldn't use anymore if you want your code to run fast. Even though a normal RET isn't "RISC-like" either, and you can bet it's handled as efficiently as possible.

Obviously x86-64 has more registers now, so most of the time we can get away without pushing anything on the stack. This time it's actually Windows (and UEFI) which has the more braindead calling convention, with every function being required to reserve space on its stack for it's callee's to spill the arguments. Because they might be using varargs and need to access them in memory instead of registers.

And the stack pointer alignment nonsense, that is also there in the SysV ABI. See, C compilers like to emit hundreds of vector instructions instead of a single "rep movsb", since it's a bit faster. Because "everything" is written in C, this removed any incentive to improve this crusty legacy instruction, and even when it finally was improved, the vector instructions were still ahead by a few percent.

To use the fastest vector instructions, everything needs to be aligned to 16 bytes instead of 8. That can be done with a single "and rsp,-16" that you could place in the prologue of any function using these instructions. But because "everything" uses these instructions, why not make it a required part of the calling convention?

So now both SysV and Windows/UEFI mandate that before every call, the stack has to be aligned, so that the call instruction misaligns it, so that the function prologue knows that pushing an odd number of registers (like the frame pointer) will align it again. All to save that single "and rsp,-16" in certain cases.

mcdeltat•3d ago

Would some message passing hardware actually be "better" in terms of performance, efficiency, and ease of construction? I thought moving data between hardware subsystems is generally pretty expensive (e.g. atomic memory instructions, there's a significant performance penalty).

Disclaimer that I'm not a hardware engineer though.

kjs3•3d ago

Well, you can check out the iAPX432 to see an architecture with message passing hardware, and see how well it worked out in practice.

cadamsdotcom•3d ago

To say nothing of virtualization!

And we solve the inefficiency with hypervisors!

killerstorm•1h ago

This is poorly written nonsense article.

"In the beginning, CPUs gave you the basics: registers, memory access, CALL and RETURN instructions."

Well, CALL and RETURN need a stack: RETURN would need an address to return to. So there you go.

A concept of subroutine was definitely not introduced by C. It was an essential part of older languages like Algol and Fortran, and is inherently a good way to organize computation. E.g the idea is that you can implement matrix multiplication subroutine just once and then call it every time you need to multiply matrices. That was absolutely a staple of programming back in the day.

Synchronous calls offer a simple memory management convention: caller takes care of data structures passed to callee. If caller's state is not maintained then you need to take care of allocated data in some other way, e.g. introduce GC. So synchronous calls are the simpler, less opinionated option.

nippoo•47m ago

One of the big things this article fails to mention is that TDP/heat budget is way more of a constraint than number of transistors - at small feature size, silicon is (relatively) cheap, power isn't.

There's no way you can use 100% of your CPU - it would instantly overheat. So it suddenly makes even more sense to have optimised hardware units for all sorts of processes (h264 encoding, crypto etc) if you can do a task any more efficiently than basic logic.

xg15•34m ago

You could say the same about JavaScript: JS is "fast" today, even though the language itself is hilariously inefficient - but browser vendors invested an ungodly amount of work into optimizing their engines, solving all kinds of crazy problems that wouldn't have been there in the first place if the language had been designed differently, until execution is now good enough that you can even use it for high-throughput server-side tasks.

HTML's Best Kept Secret: The Output Tag

AMD and Sony's PS6 chipset aims to rethink the current graphics pipeline

I built physical album cards with NFC tags to teach my son music discovery

(Re)Introducing the Pebble Appstore

How hard do you have to hit a chicken to cook it? (2020)

Tangled, a Git collaboration platform built on atproto

Synthetic aperture radar autofocus and calibration

Does our “need for speed” make our wi-fi suck?

Show HN: Semantic search over the National Gallery of Art

Show HN: I invented a new generative model and got accepted to ICLR

Programming in the Sun: A Year with the Daylight Computer

Show HN: A Digital Twin of my coffee roaster that runs in the browser

Intelligent Search in Rails with Typesense

Lánczos Interpolation Explained (2022)

Hardware Stockholm Syndrome

Ryanair flight landed at Manchester airport with six minutes of fuel left

Verge Genomics (YC S15) Is Hiring for Multiple Engineering and Product Roles

OpenGL: Mesh shaders in the current year

A Molecular Motor Minimizes Energy Waste

After nine years of grinding, Replit found its market. Can it keep it?

Show HN: Lights Out: my 2D Rubik's Cube-like Game

Love C, hate C: Web framework memory problems

ThalamusDB: Query text, tables, images, and audio

AV2 video codec delivers 30% lower bitrate than AV1, final spec due in late 2025

HATEOAS for Haunted Houses

Datastar: Lightweight hypermedia framework for building interactive web apps

NanoMi: Source-available transmission electron microscope

Ohno Type School: A (2020)

Igalia, Servo, and the Sovereign Tech Fund

What is going on with all this radioactive shrimp?

Hardware Stockholm Syndrome

Comments

HTML's Best Kept Secret: The Output Tag

AMD and Sony's PS6 chipset aims to rethink the current graphics pipeline

I built physical album cards with NFC tags to teach my son music discovery

(Re)Introducing the Pebble Appstore

How hard do you have to hit a chicken to cook it? (2020)

Tangled, a Git collaboration platform built on atproto

Synthetic aperture radar autofocus and calibration

Does our “need for speed” make our wi-fi suck?

Show HN: Semantic search over the National Gallery of Art

Show HN: I invented a new generative model and got accepted to ICLR

Programming in the Sun: A Year with the Daylight Computer

Show HN: A Digital Twin of my coffee roaster that runs in the browser

Intelligent Search in Rails with Typesense

Lánczos Interpolation Explained (2022)

Hardware Stockholm Syndrome

Ryanair flight landed at Manchester airport with six minutes of fuel left

Verge Genomics (YC S15) Is Hiring for Multiple Engineering and Product Roles

OpenGL: Mesh shaders in the current year

A Molecular Motor Minimizes Energy Waste

After nine years of grinding, Replit found its market. Can it keep it?

Show HN: Lights Out: my 2D Rubik's Cube-like Game

Love C, hate C: Web framework memory problems

ThalamusDB: Query text, tables, images, and audio

AV2 video codec delivers 30% lower bitrate than AV1, final spec due in late 2025

HATEOAS for Haunted Houses

Datastar: Lightweight hypermedia framework for building interactive web apps

NanoMi: Source-available transmission electron microscope

Ohno Type School: A (2020)

Igalia, Servo, and the Sovereign Tech Fund

What is going on with all this radioactive shrimp?