frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Running GPT-2 in WebGL: Rediscovering the Lost Art of GPU Shader Programming

https://nathan.rs/posts/gpu-shader-programming/
129•nathan-barry•16h ago

Comments

nathan-barry•16h ago
A few weeks back, I implemented GPT-2 using WebGL and shaders. Here's a write-up over how I made it, covering how I used textures and frame buffer objects to store and move around weights and outputs from calculations while using WebGL.
vessenes•15h ago
Request -- big bold link to a working web page right at the top? I read the page, I read your github, and I saw instructions to clone and run a node site, and was like "..nah" I think github pages will serve this up for free if you like.

p.s. Cool!

nathan-barry•15h ago
Yeah, that would have been a good thing to set up. Main thing to add would be loading the weights into the browser
nathan-barry•15h ago
Here's a link to the github repo. At the top of the README it has a demo of GPT-2 running and the visualizations of the attention matrices and transformer block outputs

Repo: https://github.com/nathan-barry/gpt2-webgl

pjmlp•13h ago
Kudos for going down the WebGL route and not the Chrome only WebGPU approach that most likely some people expect.

It is going to take at least yet another year, for WebGPU 1.0 to be available on stable versions on other browsers, Chrome still hasn't stable WebGPU on GNU/Linux, and it is already much far ahead with extensions that most likely won't be on the 1.0 MVP of the other browsers.

Interesting article.

rezmason•15h ago
Nice writeup! I'm a fan of shader sandwiches like yours. Judging from the stated limitations and conclusion, I bet this would benefit tremendously from a switch to WebGPU. That's not a criticism! If anything, you're effectively demonstrating that WebGL, a sometimes frustrating but mature ubiquitous computing platform, can be a valuable tool in the hands of the ambitious. Regardless, I hope to fork your repo and try a WebGPU port. Good stuff!
nathan-barry•15h ago
Thanks for the comment! I did this as a final project in a graphics class where we mainly used WebGL for all the assignments. Seeing the improvements a WebGPU port would bring would be cool to see!
GloamingNiblets•12h ago
Out of curiosity, what degree and level is this class? Super cool for a final project
nathan-barry•8h ago
Dr. Vouga's Graphics: Honors class at UT Austin, Undergrad CS
summarity•14h ago
For a webgpu version there’s tokenhawk (pure implementation)
flakiness•15h ago
CUDA is better for sure, but the pure functional nature of the traditional shader is conceptually much simpler and I kind of relish the simplicity. There is no crazy tiling or anything. Just per-pixel parallelism [1]. It won't be as fast as those real, highly-tuned kernels, but it's still nice to see something simple that does something non-trivial. It reminded me of the early "GPGPU" days (early 2000s?)

[1] https://github.com/nathan-barry/gpt2-webgl/blob/main/src/gpt...

3abiton•13h ago
> GPGPU

I haven't heard this one in a long while

glkindlmann•8h ago
a decade does feel like an eternity here; I remember this book* when it came out

* https://www.oreilly.com/library/view/gpgpu-programming-for/9...

sigmoid10•13h ago
Fun fact: Long before the dawn of the GPU deep learning hype (and even before CUDA was a thing), a bunch of CS nerds from Korea managed to train a neural network on an ATI (now AMD) Radeon 9700 Pro using nothing but shaders [1]. They saw an even bigger performance improvement than Hinton and his group did for AlexNet 8 years later using CUDA.

[1] https://ui.adsabs.harvard.edu/abs/2004PatRe..37.1311O/abstra...

math_dandy•13h ago
Cool, I had not heard about this. Adding this paper to my machine learning teaching bibliography.

Even though the start of the deep learning renaissance is typically dated to 2012 with Alexnet, things were in motion week before that. As you point out, GPU training was validated at least 8 years previously. Concurrently, some very prescient researchers like Li were working hard to generate large scale datasets like ImageNet (CVPR 2009, https://www.image-net.org/static_files/papers/imagenet_cvpr0...). And in 2012 it all came together.

sigmoid10•12h ago
Since shaders were designed for, well, shading, this early experiment was more of an academic playground exercise than useful research. But AlexNet still wasn't the first deep neural network trained using CUDA. It had already been done three years earlier: https://dl.acm.org/doi/10.1145/1553374.1553486

The ImageNet competition had also been around since 2010. So the ingredients were actually all there before.

nickpsecurity•15h ago
People have complained Nvidia dominates the market. Many were looking at their older GPU's for cheap experimentation. One idea I had was just using OpenCL to use all the cross-platform support it has. Even some FPGA's support it.

Good to see GPT-2 done with shader programming. A port of the techniques used in smaller models, like TinyLlama or Gemma-2B, might lead to more experimentation with older or cheaper hardware [on-site].

bigyabai•14h ago
Nvidia dominates training moreso than inference, and mostly because of their hardware efficiency and not entirely because of CUDA. To dominate Nvidia, you have to beat their TSMC investments, beat their design chops and match their software support. The shortlist consisting of companies that can do that reads "Apple" and nobody else, which is exactly the sort of captive market Nvidia wants to conquer.

OpenCL has been pretty handy for inference on older cards, but I'd argue it's relevance is waning. llama.cpp has Vulkan compute now which requires a smaller featureset for hardware to support. Many consumer devices skip OpenCL/CUDA altogether and delegate inference to an NPU library.

tsurba•13h ago
1.5-2 years ago I did some training for a ML paper on 4 AMD MI250x (each is essentially 2 gpus so 8 in total really, each with 64GB VRAM) on LUMI.

My Jax models and the baseline PyTorch models were quite easy to set up there, and there was not a noticeable perf difference to 8x A100s (which I used for prototyping on our university cluster) in practice.

Of course it’s just a random anecdote, but I don’t think nvidia is actually that much ahead.

nickpsecurity•10h ago
My goal isnt to top Nvidia. It is to allow people with little money or export restrictions to conduct realistic experiments (or usage) on local GPU's. Especially those on older process nodes that are significantly cheaper than $200 on eBay. Maybe $50.

How many sub-$100 cards could train and inference with OpenCL vs Vulkan? Which is the bigger opportunity for poor people right now if doing GPGPU projects on local hardware?

pjmlp•13h ago
CUDA dominates and almost no one cares about OpenCL, exactly because it is stuck on C99 model, never cared for Fortran, they only adopted a bytecode format when the war was already lost, SYCL/DPC++ is mostly an Intel thing and after they acquired CodePlay, Intel/AMD kind of always fail on delivering good OpenCL tooling, now they decided to reboot the whole oneAPI effort as UXL Foundation,....

NVidia has a lot to thank the competition why they happen to dominate the market.

nickpsecurity•10h ago
The "no one cares" market is large enough to support the existence of many competitors. Many have OpenCL support as their only thing in common. That might let us target older or cheap hardware in a cross-platform manner.

I agree the competition mostly did it to themselves. I'll add Nvidia throwing GPU's at academics who built a lot of stuff on them. Classic strategy that goes back to IBM vs Burroughs.

cmovq•13h ago
> gl.drawArrays(gl.TRIANGLES, 0, 6);

Using 2 tris for this isn’t ideal because you will get duplicate fragment invocations along the horizontal seam where the triangles meet. It is slightly more efficient to use one larger triangle extending outside the viewport, the offscreen parts will be clipped and not generate any additional fragments.

[1]: https://wallisc.github.io/rendering/2021/04/18/Fullscreen-Pa...

nathan-barry•8h ago
That's a great insight, you're right.
lerp-io•11h ago
i think u can use webgpu compute shaders now
grg0•9h ago
The lost art? Shader programming is very much relevant to this day. Many of the statements in this post are also either incorrect or inaccurate, no need for the sensationalism. And like somebody else has mentioned below, WebGL 2 adds compute shaders. I think the post would be better if it just focused on the limitations of pre-compute APIs and how to run a network there, without the other statements.
nathan-barry•8h ago
If you couldn't tell, the post was about using shader programming for general-purpose computation, specifically. Yes, WebGL adds compute shaders, but the point of the article was to use the graphics pipeline specifically. If you say there are statements that are incorrect or inaccurate, pointing them out would be very much appreciated :)
grg0•7h ago
Sorry, I didn't mean for my comment to sound too harsh. I just dislike these "lost art" or "what your doctor didn't know about" style headlines. I appreciate that there are people working on graphics/GPU programming and I didn't mean to shut you down or anything.

At any rate:

> Traditional graphics APIs like OpenGL are centered around a fixed-function pipeline tailored [...] render passes to accomplish multi-stage algorithms.

This whole paragraph is inaccurate. At least point out what version of OpenGL you are referring to. And no need to use the past tense once you do that.

> which lets you treat the GPU as one giant SIMD processor

This analogy is not accurate or useful for the reasons that are already obvious to you. I think it mostly just confuses the kind of reader that does not have the relevant experience.

> and move them back and forth between host and device with explicit copy calls.

Moving to/from host/device in cuda/opencl/vulkan/etc does require explicit copy calls (and for good reason since the shit goes over the PCI bus on discrete architectures.)

> User-driven pipeline: You define exactly what happens and when instead of using a predefined fixed sequence of rendering stages.

You can do the same with compute on OpenGL/vulkan/etc. Like above, specify what version of OpenGL you are talking about to avoid confusion.

> In OpenGL, the output of your computation would ultimately be pixels in a framebuffer or values in a texture

Confusing for the same reason, especially because this statement now uses the word "computation", unlike the statements leading to it.

Personally, I would just rewrite this entire section to point out the limitations of pre-compute graphics APIs (whether it's OpenGL, earlier versions of DirectX, or whatever.)

> and other graphics specific concepts I hijacked

What does 'hijacked' mean here? You're not hijacking anything, you are using the old APIs as intended and using the right terms in the description that follows.

> bypassing any interpolation

"Filtering" is a better choice of word. And you're not bypassing it as much as you are simply not doing any filtering (it's not like filtering is part of the FFP or HW and you're "bypassing" it.)

> A Framebuffer Object (FBO) is a lightweight container

Actually, an FBO is a big turd that incurs heavy runtime validation on the driver side if you ever think about changing the targets. You might actually want to point that out since it is relevant to your implementation. I wouldn't use "lightweight" to describe it anyway.

> we “ping-pong” between them

Yeah, you might be better off creating two separate FBOs per my point above. Vendor-specific territory here, though. But I think the OpenGL wiki touches on this if I remember correctly.

> All of this happens entirely on the GPU’s VRAM bus

What is the "this" that this refers to? If you mean rendering to textures, this statement is inaccurate because there are also several layers of cache between the SIMDs and the VRAM. I think you could just say that the rendering stays on device local memory and call it a day without getting into more detail.

> FBOs form the data bus

I find this new term misleading given that you just talked about a "VRAM bus" in the paragraph above. I'd avoid introducing this new "data bus" term altogether. It doesn't seem like a useful abstraction or one that is described in any detail, so it does not add/take away much from the rest of the article.

> Instead of using fragment shaders to shade pixels for display, we hijack them as compute kernels

Just avoid this hijacking analogy altogether. I think it only adds confusion. You are implementing a compute workload or "kernel" per CUDA terminology in a fragment shader; can just call it that.

> each fragment invocation becomes one “thread”

Actually, each fragment invocation _is_ a thread per every HW vendor's own terminology, so no need for the quotes here. Of course, you haven't introduced the term "thread" up until this point (first and only occurrence of the word), which is the real problem here. A brief section on GPU architecture could help.

> Per-pixel work item: Each fragment corresponds to one matrix element (i, j). The GPU runs this loop for every (i, j) in parallel across its shader cores.

What is "this loop" you are referring to? I know which it is, but there is no loop in the shader code (there's the one for the dot product, but that's not the relevant one.) This is confusing to the reader.

> All it does is draw two triangles which covers the entire view port.

Let's draw a single triangle that covers the whole viewport while we're at it. It's more efficient because it avoids double-fragging the diagonal. https://wallisc.github.io/rendering/2021/04/18/Fullscreen-Pa...

> Reusable: Because the vertex work is identical for all operations, we compile it once and reuse it across every matrix multiply, activation, and bias-add pass.

To be clear, you compile the vertex shader once, but you're still going to have to link N programs. I don't think this is worth pointing out because linking is where most of the shit happens.

> While hijacking WebGL

No hijacking.

> There’s no on-chip scratchpad for blocking or data reuse, so you’re limited to element-wise passes.

The on-chip memory is still there, it's not just accessible by fragment shaders in old APIs.

> Texture size limits: GPUs enforce a maximum 2D texture dimension (e.g. 16 K×16 K).

I haven't checked, but I would bet this is either an API limitation, or a vendor-specific limit. So to say that "GPUs enforce" would be misleading (is it really the HW or the API? And is it all GPUs? some? vendor-specific?)

I haven't checked the neural net side of things or the shaders in any detail.

Also, I think a better title would be "Running GPT-2 in WebGL 1.0 / OpenGL 2.0", dropping the subtitle. It's more specific and you might even get more clicks from people looking to implement the stuff on older hardware. No lost art that isn't lost or rediscovery.

nathan-barry•6h ago
Thanks for your input. Lots of good points on the technical side. Will go through and make some edits later tonight or tomorrow.

> You're not hijacking anything, you are using the old APIs as intended and using the right terms in the description that follows.

When it comes to the use of the word "hijacking", I use it in to refer to the fact that using graphic shaders for general computation wasn't initially intended. When NVIDIA allowed programmable vertex and pixel shaders, they had no idea that it would be used for anything else other than graphics rendering. So when I say I "hijack" a fragment shader to compute layers of a neural network instead of as a part of a rendering pipeline, this is what I mean. I don't see a problem with this use of language.

umvi•7h ago
Pretty sure WebGL2 doesn't have true compute shader support. It's more a jack where you can write to a texture buffer with a fragment shader to imitate a compute shader. True compute shader support is supposedly in WebGPU
pjmlp•2h ago
Correct, the Chrome team is responsible for killing the WebGL 2.0 Compute shaders effort with the reasoning WebGPU is just around the corner.

https://github.com/9ballsyndrome/WebGL_Compute_shader/issues...

Now five years later, how well are those WebGPU compute shaders adoption going on?

swoorup•5h ago
Imho there are js libraries which goes through the traditional rendering based shader path to emulate general purpose computations on the GPU, gpu.js for example https://gpu.rocks/#/
0points•3h ago
lost art ey?

what about mega shaders, vulkan llm etc etc?

The luxury of letting ideas marinate

https://www.alexarvanitidis.dev/blog/the-luxury-of-letting-ideas-marinate
1•alarvfm•3m ago•0 comments

Driverless Semi Trucks Are Here, with Little Regulation and Big Promises

https://www.nytimes.com/2025/05/27/business/driverless-semi-trucks-aurora-innovation.html
1•bookofjoe•5m ago•1 comments

Show HN: Quiz Coding – Teacher Mode for AI Agent in Python Notebook

1•pplonski86•6m ago•0 comments

Al Lowe Reflects on Leisure Suit Larry (2019)

https://medium.com/super-jump/al-lowe-reflects-on-leisure-suit-larry-6f0f82741e4d
1•Michelangelo11•7m ago•0 comments

CEOs who aren't yet preparing forquantum are 'already too late,' IBM exec says

https://www.businessinsider.com/future-proofing-technology-systems-executives-prepare-quantum-revolution-2025-5
1•donutloop•8m ago•0 comments

Telecoms Industry in US–China Context: Evolving Toward Near-Complete Bifurcation [pdf]

https://www.jhuapl.edu/assessing-us-china-technology-connections/dist/b1f6868fa237afaad6fd044db86f7d86.pdf
1•loongloong•12m ago•0 comments

The Dismal Failure of LLMs as EV Search Aids

http://scottmeyers.blogspot.com/2025/05/the-dismal-failure-of-llms-as-ev-search.html
1•PretzelFisch•15m ago•0 comments

'Ocean darkening' a cause for concern – scientists

https://www.bbc.com/news/articles/c23m1018dkmo
1•Brajeshwar•18m ago•0 comments

India's iPhone exports to the U.S. soared an estimated 76%

https://www.cnbc.com/2025/05/27/indias-iphone-exports-to-the-us-soared-an-estimated-76percent-in-april.html
1•Brajeshwar•19m ago•0 comments

Voiceover artist urging ScotRail to remove her voice from new AI announcements

https://news.sky.com/story/voiceover-artist-gayanne-potter-urging-scotrail-to-remove-her-voice-from-new-ai-announcements-13375535
1•drankl•20m ago•0 comments

As a maintainer, I'm leaving the SearXNG project

https://gist.github.com/unixfox/ee2df1cb84f00877ac7efaa11c30a06c
1•unixfox•22m ago•0 comments

CheerpJ 4.1: Java in the browser, now supporting Java 17 (preview)

https://labs.leaningtech.com/blog/cheerpj-4.1
11•pjmlp•23m ago•0 comments

Apple Acquires Gaming Studio RAC7

https://www.macrumors.com/2025/05/27/apple-acquires-gaming-studio-rac7/
3•tosh•23m ago•0 comments

Show HN: WriteGlow – Make AI-generated content sound more human

https://writeglow.ai/
1•YvaineChen•24m ago•0 comments

HTML HATEOAS as a Tool API for MCP/LLM

https://mikesub.net/blog/html-mcp.html
1•mksb•24m ago•0 comments

Riot – An actor-model multi-core schedular for OCaml

https://riot.ml/
1•myaccountonhn•27m ago•0 comments

Creatine: The bodybuilding supplement that boosts brainpower

https://www.bbc.com/future/article/20250523-the-surprising-health-benefits-of-taking-creatine-powder
2•XzetaU8•29m ago•0 comments

Show HN: AI Detective Game – Interrogate AI-generated suspects to solve a crime

https://github.com/skorotkiewicz/ai-detective-game
1•modinfo•29m ago•0 comments

How to Run CRON Jobs in Postgres Without Extra Infrastructure

https://wasp.sh/blog/2025/05/28/how-to-run-cron-jobs-in-postgress-without-extra-infrastructure
4•Liriel•33m ago•0 comments

Implement AI Safeguards with Node.js and Llama Stack

https://developers.redhat.com/articles/2025/05/28/implement-ai-safeguards-nodejs-and-llama-stack
1•unripe_syntax•36m ago•0 comments

Decaf coffee can mimic caffeine's effects in habitual drinkers

https://www.psypost.org/new-research-shows-decaf-coffee-can-mimic-caffeines-effects-in-habitual-drinkers/
1•geox•37m ago•0 comments

What Sourcegraph learned building AI coding agents

https://www.nibzard.com/ampcode/
2•nkko•43m ago•0 comments

First Update from Dianna (Physics Girl) [video]

https://www.youtube.com/watch?v=vqeIeIcDHD0
1•burgerrito•43m ago•0 comments

Show HN: MockupTiger – Prompt-Based AI Tool for Fast Low-Fidelity Wireframes

https://wireframes.org/ai-wireframes-low-fidelity-embraces-ai
1•njx•44m ago•1 comments

The Internet Used to Be a Place [video]

https://www.youtube.com/watch?v=oYlcUbLAFmw
1•harryday•45m ago•0 comments

Ask HN: What will the web look like in 5 years with AI bots becoming widespread?

1•bwb•48m ago•2 comments

Researchers observe gamma-ray burst on Earth

https://www.science.org/doi/10.1126/sciadv.ads6906
1•aureliusm•48m ago•0 comments

Hermes-dec: tool which can disassemble and decompile React Native

https://github.com/P1sec/hermes-dec
1•rahimnathwani•49m ago•0 comments

Opera's new browser can code websites

https://techcrunch.com/2025/05/28/operas-new-browser-can-code-websites-and-games-for-you/
1•Sourabhsss1•52m ago•0 comments

The day my ping took countermeasures

https://blog.cloudflare.com/the-day-my-ping-took-countermeasures/
2•majke•53m ago•1 comments