I built a pure WGSL LLM engine to run Llama on my Snapdragon laptop GPU

1•Beledarian•1h ago

Comments

Beledarian•1h ago

Hi HN,

I recently bought a Snapdragon X Elite Copilot+ laptop and realized my integrated Adreno GPU was basically a paperweight for local AI. Standard tools like LM Studio and the massive PyTorch ecosystem didn't support it, forcing everything onto the CPU. I didn't want to wait for the ecosystem to catch up, so I built a from-scratch inference engine to bypass it entirely.

It’s written purely in Rust and WGSL. No CUDA, no Python, no heavy frameworks. Just raw compute shaders dispatching the Transformer forward pass, making it portable (runs on Windows, macOS, Linux via Vulkan/Metal/DX12). Currently, I'm getting ~33 tok/s on the Snapdragon Adreno (around ~25 with fp16) and 66+ tok/s (fp16/fp32) on an RTX 3090 with TinyLlama.

The build process: I actually had a dual motivation here. Beyond solving my hardware gap, I wanted a stress test for my own LLM orchestration tools. A Transformer engine requires exact math, strict buffer layouts (those WebGPU vec3 alignment traps are real), and standalone compute shaders there is zero room for AI hallucination. I spent the time developing and validating a strict architectural blueprint up front. Then, using highly specific prompts, strict behavior guidance, and my custom MCP tools to feed the AI the exact WGSL specs, I successfully scaffolded that predefined human architecture into working code in under 16 hours.

It is very much alpha software. It's decode-only, single-sequence, and currently uses CPU-side sampling. I’d love to hear your thoughts, especially from anyone with deep WGSL/WebGPU experience regarding buffer layouts or optimizing the INT8 GEMM paths (I know I need to move to a tiled implementation to get around the VRAM bandwidth bottleneck).

Happy to answer any questions about the architecture or the build process!

Repo: https://github.com/Beledarian/wgpu-llm

The Problem That Built an Industry

LinkedIn Pulse Lost 85% of Its Organic Traffic in the Last Two Years

In Defense of Rediscovery

Framechart – Turn CSV data into animated chart videos

Can OpenClaw and Claude be better than therapy?

Show HN: Helix – open-source self-healing back end for production crashes

Iran War and the great reset with Katherine Austin Fitts [video][1hr]

America Has a New GLP-1 Playbook

Overhead Projector

Key Person Quest Launching

Nadir: Open-source LLM router that cuts API costs 30-60% (MIT License)

Show HN: Hands-on course for building RL environments for LLMs

Show HN: Superpowers-UML – UML-Enabled Superpowers

Steam Link Expands to Apple Vision Pro in Beta

United's Unique Hub in the Pacific

Show HN: Waffle – Native macOS terminal that auto-tiles sessions into a grid

How do the microplastics in our bodies affect our health?

Show HN: The Musical Manifold [pdf]

Compact Compact Language Detector

Apollo in Real Time

MySQL 9.7.0 vs. sysbench on a small server

South Korea introduces universal basic mobile data access

Slides (Hypnotic Video About a Dude's Slides and Slide Projector)

Plannex

Spooky-connect4: a Rust/Python library with variable board sizes

Spooky-chess: a Rust/Python library with variable board sizes

Bitcoin miners are losing $19,000 on every BTC produced as difficulty drops 7.8%

TraceFix – Paste a Linux/SSH log error, get the root cause and exact fix command

Cotypist

Shipped a 66-ticket Architecture Epic autonomously with a new Coding Agent setup