Sounds like a mainframe. Is there any similarity?
I've always felt like itanium was a great idea but came too soon and too poorly executed. It seemed like the majority of the commercial failure came down to friction from switching architecture and the inane pricing rather than the merits of the architecture itself. Basically intel being intel.
As soon as a system has variable instruction latency, VLIW completely stops working; the entire concept is predicated on the compiler knowing how many cycles each instruction will take to retire ahead of time. With memory access hierarchy and a nondeterministic workload, the system inherently cannot know how many cycles an instruction will take to retire because it doesn't know what tier of memory its data dependencies live in up front.
The advantage of out-of-order execution is that it dynamically adapts to data availability.
This is also why VLIW works well where data availability is _not_ dynamic, for example in DSP applications.
As for this Electron thing, the linked article is too puffed to tell what it's actually doing. The first paragraph says something about "no caches" but the block diagram has a bunch of caches in it. It sort of sounds like an FPGA with bigger primitives (configurable instruction tiles rather than gates), which means that synchronization is going to continue to be the problem and I don't know how they'll solve for variable latency.
Just curious as to how you would rethink the design of caches to solve this problem. Would you need a dedicated cache per execution context?
As an aside, I never looked into the perf numbers but having adjustable register windows while cool probably made for terrible context switching and/or spilling performance.
But "this is a good idea just poorly executed" seems to be the perennial curse of VLIW, and how Itanium ended up shoved onto people in the first place.
I don’t know how similar this Electron is, but the Mill explained how it could be done.
Edit: aha, found them! https://m.youtube.com/playlist?list=PLFls3Q5bBInj_FfNLrV7gGd...
Is this effectively having a bunch of tiny processors on a single chip each with its own storage and compute?
So one instruction gets done.. output is pass to the next.
Hopefully i've made somebody mad enough to explain why i am wrong.
One of the big questions here is how quickly it can switch between graphs, or if that will be like a context switch from hell. In an embedded context that's likely to become a headache way too fast, so the idea of a magic compiler fixing it so you don't have to know what it's doing sounds like a fantasy honestly.
I assume that like all past attempts at this, it's about 20x more efficient when code fits in the one array (FPGAs get this ratio), but if your code size grows past something very trivial, the grid config needs to switch and that costs tons of time and power.
(And then your IP is thrown away so the next startup also has to get both right...)
There have been many attempts of mapping general purpose/GPU programming languages to FPGA and none of them worked out.
The first leading claim they make - that this is a general purpose CPU, capable of executing anything - I suspect is false.
CPUs are hard because they have to interact with memory, basically 95% of CPU design complexity comes from having to interact with memory, and handling other data hazards.
If this was reducible complexity, they'd have done so already.
The Efficient architecture is a CGRA (coarse-grained reconfigurable array), which means that it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units. Of course, the graph may not all fit onto the hardware at once, in which case it must also be split up to run in batches over time. But the key difference is that there’s this sort of spatial unrolling that goes on.
This means that a lot of the work of fetching and decoding instructions and data can be eliminated, which is good. However, it also means that the program must be mostly, if not completely, static, meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU. So even if the compiler claims to support C++/Rust/etc., it probably does not support, e.g., pointers or dynamically-allocated objects as we usually think of them.
[1] Most modern CPUs don’t actually execute instructions one-at-a-time — that’s just an abstraction to make programming them easier. Under the hood, even in a single-core CPU, there is all sorts of reordering and concurrent execution going on, mostly to hide the fact that memory is much slower to access than on-chip registers and caches.
Instead of assembly instructions taking time in these architectures, they take space. You will have a capacity of 1000-100000 instructions (including all the branches you might take), and then the chip is full. To get past that limit, you have to store state to RAM and then reconfigure the array to continue computing.
The mapping wouldn't be as efficient as a bespoke compilation, but it should be able to avoid the configuration swap-outs.
Basically a set of configurations that can be used as an interpreter.
Re: pointers, I should clarify that it’s not the indirection per se that causes problems — it’s the fact that, with (traditional) dynamic memory allocation, the data’s physical location isn’t known ahead of time. It could be cached nearby, or way off in main memory. That makes dataflow operator latencies unpredictable, so you either have to 1. leave a lot more slack in your schedule to tolerate misses, or 2. build some more-complicated logic into each CGRA core to handle the asynchronicity. And with 2., you run the risk that the small, lightweight CGRA slices will effectively just turn into CPU cores.
I think this still depends very much on the compiler: whether it can assemble "patches" of direct dependencies to put into each of the little processing units. the edges between patches are either high-latency operations (memory) or inter-patch links resulting from partitioning the overall dataflow graph. I suspect it's the NOC addressing that will be most interesting.
Naively that sounds similar to a GPU. Is it?
Not very useful then if I can't do this very basic thing?
You have many more very small ALU cores, configurable into longer custom pipelines with each step more or less as wide/parallel or narrow as it needs to be for each step.
Instead of streaming instructions over & over to large cores, you use them to set up those custom pipeline circuits, each running until it’s used up its data.
And you also have some opportunity for multiple such pipelines operating in parallel depending on how many operations (tiles) each pipeline needs.
I haven't read much that explains how they do it.
I have been very slowly trying to build a translation layer between starlark and vine as a proof of concept of massively parallel computing. If someone better qualified finds a better solution the market it sure to have demand for you. A translation layer is bound to be cheaper than teaching devs to write in jax or triton or whatever comes next.
I'd like to take this opportunity to plug the FlowMap paper, which describes the polynomial-time delay-optimal FPGA LUT-mapping algorithm that cemented Jason Cong's 31337 reputation: https://limsk.ece.gatech.edu/book/papers/flowmap.pdf
Very few people even thought that optimal depth LUT mapping would be in P. Then, like manna from heaven, this paper dropped... It's well worth a read.
Data transfer is slow and power hungry - it's obvious that putting a little bit of compute next to every bit of memory is the way to minimize data transfer distance.
The laws of physics can't be broken, yet people demand more and more performance, so eventually the difficulty of solving this issue will be worth solving.
Your last paragraph... you're right that, sooner or later, something will have to give. There will be some scale such that, if you create clumps either larger or smaller than that scale, things will only get worse. (But that scale may be problem-dependent...) I agree that sooner or later we will have to do something about it.
Cache hierarchies operate on the principle that the probability of a bit being operated on is inversely proportional to the time since it was last operated on.
Registers can be thought of in this context as just another cache, the memory closest to the compute units for the most frequent operations.
It's possible to have register-less machines (everything expressed as memory to memory operations) but it blows up the instruction word length, better to let the compiler do some of the thinking.
should work great for NN.
We're building an EEG headband with bone-conduction speaker so in order of power, our speaker/sounder and LEDs are orders of magnitude more expensive than our microcontroller.
In anything with a screen, that screen is going to suck all the juice, then your radios, etc. etc.
I'm sure there are very specific use-cases that a more energy efficient CPU will make a difference, but I struggle to think of anything that has a human interface where the CPU is the bottleneck, though I could be completely wrong.
Wonder if it could also be a coprocessor, if the fabric has a limited cell count? Do your dsp work on the optimised chip and hand off the the expensive radio softdevice when your codesize is known to be large.
However, the examples indicate that if you have a loop that is executed over and over, the setup cost for configuring the fabric could be worth doing. Like a continuous audio stream in a wakeup-word detection, a hearing aid, or continous signals from an EEG.
Instead of running a general purpose cpu at 1MHz the fabric would be used to unroll the loop, you will use (up to) 100 building blocks for all individual operations. Instead of one instruction after another, you have a pipeline that can execute one operation in each cycle in each building block. The compute thus only needs to run at 1/100 clock, e. g. the 10kHz sampling rate of the incoming data. Each tick of the clock moves data through the pipeline, one step at a time.
I have no insights but can imagine how marketing thinks: "let's build a 10x10 grid of building blocks, if they are all used, the clock can be 1/100... Boom - claim up to 100x more efficient!" I hope their savings estimate is more elaborate though...
Wonder why they do not focus their marketing on this.
This sounds like the most troublesome part of the design to me. It's very hard to do this static scheduling well. You can end having to hold up everything waiting for some tiny thing to complete so you can proceed forward in lock step. You'll also have situations where 95% of the time the static scheduling can work but 5% of cases where something fiddly happens. Without any ability for dynamic behaviour and data movement small corner cases dominate how the rest of the system behaves.
Interestingly you see this very problem in hardware design! All paths between logic gates need to be some maximum length to reach a target clock frequency. Often you get long fiddly paths relating to corner cases in behaviour that require significant manual effort to resolve and achieve timing closure.
How 2D is it: compiling to a fabric sounds like it needs lots of difficult routing. 3D would seem like it would make the routing much more compact?
rpiguy•17h ago
https://www.patentlyapple.com/2021/04/apple-reveals-a-multi-...
Typically these architectures are great for compute. How will it do on scalar tasks with a lot of branching? I doubt well.