[1] https://github.com/nathan-barry/gpt2-webgl/blob/main/src/gpt...
I haven't heard this one in a long while
* https://www.oreilly.com/library/view/gpgpu-programming-for/9...
[1] https://ui.adsabs.harvard.edu/abs/2004PatRe..37.1311O/abstra...
Even though the start of the deep learning renaissance is typically dated to 2012 with Alexnet, things were in motion week before that. As you point out, GPU training was validated at least 8 years previously. Concurrently, some very prescient researchers like Li were working hard to generate large scale datasets like ImageNet (CVPR 2009, https://www.image-net.org/static_files/papers/imagenet_cvpr0...). And in 2012 it all came together.
The ImageNet competition had also been around since 2010. So the ingredients were actually all there before.
Good to see GPT-2 done with shader programming. A port of the techniques used in smaller models, like TinyLlama or Gemma-2B, might lead to more experimentation with older or cheaper hardware [on-site].
OpenCL has been pretty handy for inference on older cards, but I'd argue it's relevance is waning. llama.cpp has Vulkan compute now which requires a smaller featureset for hardware to support. Many consumer devices skip OpenCL/CUDA altogether and delegate inference to an NPU library.
My Jax models and the baseline PyTorch models were quite easy to set up there, and there was not a noticeable perf difference to 8x A100s (which I used for prototyping on our university cluster) in practice.
Of course it’s just a random anecdote, but I don’t think nvidia is actually that much ahead.
How many sub-$100 cards could train and inference with OpenCL vs Vulkan? Which is the bigger opportunity for poor people right now if doing GPGPU projects on local hardware?
NVidia has a lot to thank the competition why they happen to dominate the market.
I agree the competition mostly did it to themselves. I'll add Nvidia throwing GPU's at academics who built a lot of stuff on them. Classic strategy that goes back to IBM vs Burroughs.
Using 2 tris for this isn’t ideal because you will get duplicate fragment invocations along the horizontal seam where the triangles meet. It is slightly more efficient to use one larger triangle extending outside the viewport, the offscreen parts will be clipped and not generate any additional fragments.
[1]: https://wallisc.github.io/rendering/2021/04/18/Fullscreen-Pa...
A bit of an older article but still very relevant.
I've found with webGL2 you can also skip the whole upload/binding of the buffer and just emit the vertexes/coordinates from the vertex shader as well.
Less of an impact than cutting it down, but if you're just trying to get a a fragment going, why not use the least amount of data and CPU-> GPU upload possible.
ivec2 vert = ivec2(gl_VertexID & 1, gl_VertexID >> 1);
out_uv = 2.0 * vec2(vert);
gl_Position = vec4(out_uv * 2.0 - 1.0, 0.0, 1.0);
At any rate:
> Traditional graphics APIs like OpenGL are centered around a fixed-function pipeline tailored [...] render passes to accomplish multi-stage algorithms.
This whole paragraph is inaccurate. At least point out what version of OpenGL you are referring to. And no need to use the past tense once you do that.
> which lets you treat the GPU as one giant SIMD processor
This analogy is not accurate or useful for the reasons that are already obvious to you. I think it mostly just confuses the kind of reader that does not have the relevant experience.
> and move them back and forth between host and device with explicit copy calls.
Moving to/from host/device in cuda/opencl/vulkan/etc does require explicit copy calls (and for good reason since the shit goes over the PCI bus on discrete architectures.)
> User-driven pipeline: You define exactly what happens and when instead of using a predefined fixed sequence of rendering stages.
You can do the same with compute on OpenGL/vulkan/etc. Like above, specify what version of OpenGL you are talking about to avoid confusion.
> In OpenGL, the output of your computation would ultimately be pixels in a framebuffer or values in a texture
Confusing for the same reason, especially because this statement now uses the word "computation", unlike the statements leading to it.
Personally, I would just rewrite this entire section to point out the limitations of pre-compute graphics APIs (whether it's OpenGL, earlier versions of DirectX, or whatever.)
> and other graphics specific concepts I hijacked
What does 'hijacked' mean here? You're not hijacking anything, you are using the old APIs as intended and using the right terms in the description that follows.
> bypassing any interpolation
"Filtering" is a better choice of word. And you're not bypassing it as much as you are simply not doing any filtering (it's not like filtering is part of the FFP or HW and you're "bypassing" it.)
> A Framebuffer Object (FBO) is a lightweight container
Actually, an FBO is a big turd that incurs heavy runtime validation on the driver side if you ever think about changing the targets. You might actually want to point that out since it is relevant to your implementation. I wouldn't use "lightweight" to describe it anyway.
> we “ping-pong” between them
Yeah, you might be better off creating two separate FBOs per my point above. Vendor-specific territory here, though. But I think the OpenGL wiki touches on this if I remember correctly.
> All of this happens entirely on the GPU’s VRAM bus
What is the "this" that this refers to? If you mean rendering to textures, this statement is inaccurate because there are also several layers of cache between the SIMDs and the VRAM. I think you could just say that the rendering stays on device local memory and call it a day without getting into more detail.
> FBOs form the data bus
I find this new term misleading given that you just talked about a "VRAM bus" in the paragraph above. I'd avoid introducing this new "data bus" term altogether. It doesn't seem like a useful abstraction or one that is described in any detail, so it does not add/take away much from the rest of the article.
> Instead of using fragment shaders to shade pixels for display, we hijack them as compute kernels
Just avoid this hijacking analogy altogether. I think it only adds confusion. You are implementing a compute workload or "kernel" per CUDA terminology in a fragment shader; can just call it that.
> each fragment invocation becomes one “thread”
Actually, each fragment invocation _is_ a thread per every HW vendor's own terminology, so no need for the quotes here. Of course, you haven't introduced the term "thread" up until this point (first and only occurrence of the word), which is the real problem here. A brief section on GPU architecture could help.
> Per-pixel work item: Each fragment corresponds to one matrix element (i, j). The GPU runs this loop for every (i, j) in parallel across its shader cores.
What is "this loop" you are referring to? I know which it is, but there is no loop in the shader code (there's the one for the dot product, but that's not the relevant one.) This is confusing to the reader.
> All it does is draw two triangles which covers the entire view port.
Let's draw a single triangle that covers the whole viewport while we're at it. It's more efficient because it avoids double-fragging the diagonal. https://wallisc.github.io/rendering/2021/04/18/Fullscreen-Pa...
> Reusable: Because the vertex work is identical for all operations, we compile it once and reuse it across every matrix multiply, activation, and bias-add pass.
To be clear, you compile the vertex shader once, but you're still going to have to link N programs. I don't think this is worth pointing out because linking is where most of the shit happens.
> While hijacking WebGL
No hijacking.
> There’s no on-chip scratchpad for blocking or data reuse, so you’re limited to element-wise passes.
The on-chip memory is still there, it's not just accessible by fragment shaders in old APIs.
> Texture size limits: GPUs enforce a maximum 2D texture dimension (e.g. 16 K×16 K).
I haven't checked, but I would bet this is either an API limitation, or a vendor-specific limit. So to say that "GPUs enforce" would be misleading (is it really the HW or the API? And is it all GPUs? some? vendor-specific?)
I haven't checked the neural net side of things or the shaders in any detail.
Also, I think a better title would be "Running GPT-2 in WebGL 1.0 / OpenGL 2.0", dropping the subtitle. It's more specific and you might even get more clicks from people looking to implement the stuff on older hardware. No lost art that isn't lost or rediscovery.
> You're not hijacking anything, you are using the old APIs as intended and using the right terms in the description that follows.
When it comes to the use of the word "hijacking", I use it in to refer to the fact that using graphic shaders for general computation wasn't initially intended. When NVIDIA allowed programmable vertex and pixel shaders, they had no idea that it would be used for anything else other than graphics rendering. So when I say I "hijack" a fragment shader to compute layers of a neural network instead of as a part of a rendering pipeline, this is what I mean. I don't see a problem with this use of language.
https://github.com/9ballsyndrome/WebGL_Compute_shader/issues...
Now five years later, how well are those WebGPU compute shaders adoption going on?
By the way, if you are running any high-traffic websites you can donate your users' device data to web3dsurvey (with a simple JS snippet). I'm sure it will be appreciated.
what about mega shaders, vulkan llm etc etc?
nathan-barry•1d ago
vessenes•1d ago
p.s. Cool!
nathan-barry•1d ago
nathan-barry•1d ago
Repo: https://github.com/nathan-barry/gpt2-webgl
pjmlp•1d ago
It is going to take at least yet another year, for WebGPU 1.0 to be available on stable versions on other browsers, Chrome still hasn't stable WebGPU on GNU/Linux, and it is already much far ahead with extensions that most likely won't be on the 1.0 MVP of the other browsers.
Interesting article.