A slight tangent, but I really wish Nvidia would release more details on Tile IR. Specifically on what it enables vs PTX.
Is it just about moving towards more MLIR based infra? Maybe it’s higher level and thus can enable better codegen across generations?
However this repo was specifically part of their acquisition of CentML
I know i can ask a llm or search on google, but i was hoping someone in the community could explain it in a way i could understand.
In tile languages, the thread of execution is an entire workgroup (or block in CUDA-speak). You typically work with large vector/matrix-sized values. The compiler decides how to distribute those values onto vector registers across waves of the workgroup. (Example: if your program has a value that is a 32x32 matrix of fp32 elements and a workgroup has 8 32-wide waves, the value will be implemented as 4 standard-sized vector registers in each wave of the workgroup.) All control flow affects the entire workgroup equally since the ToE is the entire workgroup, and so the compiler does not have to do implicit masking. Instead, tile languages usually have provisions for explicit masking using boolean vectors/matrices.
Tile languages are a new phenomenon and clearly disagree on what the exact level of abstraction should be. For example, Triton mostly hides the details of shared memory from the programmer and lets the compiler take care of software pipelined loads, while in this Tilus here, it looks like the programmer has to program shared memory use explicitly.
Copying and pasting your exact words above into an LLM (gemini/chatgpt) provided an answer arguably better than any of the human answers at the time of this post.
I follow Google most closely. They design and manufacture their own accelerators. AWS I know manufactures its own CPUs, but I don't know if they're working on or already have an AI accelerator.
Several of the big players are working on OpenXLA, which is designed to abstract and commoditize the GPU layer: https://openxla.org/xla
OpenXLA mentions:
> Alibaba, Amazon Web Services, AMD, Apple, Arm, Google, Intel, Meta, and NVIDIA
I believe those are the Inferentia: https://aws.amazon.com/ai/machine-learning/inferentia/
> AWS Inferentia chips are designed by AWS to deliver high performance at the lowest cost in Amazon EC2 for your deep learning (DL) and generative AI inference applications
but I don't know this second if they're supported by the major frameworks, or what
I also didn't recall about https://aws.amazon.com/ai/machine-learning/trainium/ until I was looking up that page, so it seems they're trying to have a competitor to the TPUs just naming them dumb, because AWS
> AWS Trainium chips are a family of AI chips purpose built by AWS for AI training and inference to deliver high performance while reducing costs.
> have a competitor to the TPUs just naming them dumb, because AWS
I kind of like "trainium" although "inferentia" I could take or leave. At least it's nice that the names tell you the intended use case.
If they were sitting on excess stock, or struggling to sell, sure.
Meanwhile Nvidia keeps building more and more libraries..
It’s not rocket science. They can identify many key personel in Nvidia and make them offers which would be significantly better for them. Cycle 3 years and repeat. Two or three cycles and you will have replicated the most important parts.
If AMD wants to they can compete..
That really makes three companies that are happy to concede to nVidia, because Apple could definitely challenge nVidia if they wanted to.
Note: I'm not saying that AMD sucks, just that their corporate culture prevents them from being very ambitious.
Apples closest cpu competition is Qualcomm and they dont win that.
[1]: https://nvidia.github.io/tilus/getting-started/tutorials/mat...
Vulkan is closer, but CUDA still exposes more features.
It is GNU/Linux that Google/Chrome sees as low priority.
Some folks are reaching out to WebGPU outside the browser, because Vulkan is a pain to program for, and they misuse WebGPU as middleware, although a less capable one, due to its original design, and who is driving its standardisation process.
Regarding SYCL, Intel basically bought the only company that was shipping a good developer experience for it, CodePlay, which used to do specialized compilers for game consoles, and pivoted into GPGPU.
However despite all of this, there is hardly any Web 3D experience, or game, at the level of iPhone games from its OpenGL ES 3.0 glory days, like Infinity Blade, from 2011!
All the attempts to attack CUDA fail to undestand why most researchers flock to it, instead of enduring the pain of the competion tooling, and they tend to focus on a single aspect of CUDA, be it C++, or something else, but never the polyglot support, the libraries, the IDE integration, the graphical debuggers, the compiler backends for other developers to target CUDA.
How come that this paper has become an NVIDIA project?
moralestapia•5mo ago
Great :).
coderatlarge•5mo ago