frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

https://github.com/xaskasdf/ntransformer
51•xaskasdf•3h ago
Hi everyone, I'm kinda involved in some retrogaming and with some experiments I ran into the following question: "It would be possible to run transformer models bypassing the cpu/ram, connecting the gpu to the nvme?"

This is the result of that question itself and some weekend vibecoding (it has the linked library repository in the readme as well), it seems to work, even on consumer gpus, it should work better on professional ones tho

Comments

randomtoast•1h ago
0.2 tok/s is fine for experimentation, but it is not interactive in any meaningful sense. For many use cases, a well-quantized 8B or 13B that stays resident will simply deliver a better latency-quality tradeoff
tyfon•42m ago
I didn't really understand the performance table until I saw the top ones were 8B models.

But 5 seconds / token is quite slow yeah. I guess this is for low ram machines? I'm pretty sure my 5950x with 128 gb ram can run this faster on the CPU with some layers / prefill on the 3060 gpu I have.

I also see that they claim the process is compute bound at 2 seconds/token, but that doesn't seem correct with a 3090?

tgrowazay•12m ago
LLM speed is roughly <memory_bandwidth> / <model_size> tok/s.

DDR4 tops out about 27Gbs

DDR5 can do around 40Gbs

So for 70B model at 8 bit quant, you will get around 0.3-0.5 tokens per second using RAM alone.

Wuzado•14m ago
I can imagine a couple scenarios in which a high-quality, large model would be much preferred over lower latency models, primarily when you need the quality.
throwaway2027•58m ago
Didn't DirectX add an API for loading assets directly to GPU memory? Would that work?
someguy2026•35m ago
My impression is that that is limited to assets and really needs to fit into the DirectX framework. From what I can tell, the gpu-nvme-direct is mostly similar to https://github.com/enfiskutensykkel/ssd-gpu-dma and https://github.com/ZaidQureshi/bam
jauntywundrkind•40m ago
Could be neat to see what giving the 8b like 6gb ram instead of 10gb. Something in-between, where you still need NVMe, but not like the 3x ratio of the 70b model on 23GB.

Nice work. PCI-P2P (GPU-Direct (tm)) is such great stuff. Cool to see!

rl3•29m ago
Nice. I've been looking at doing something similar, more on the order of running a 1T model with less than half the available VRAM.

One workup indicated it was theoretically possible to modify a piece of SGLang's routing layer to support JIT predict-ahead expert swaps from Gen5 NVMe storage straight into GPU memory.

I'm hoping that proves true. The setup relies on NVIDIA Dynamo, so NIXL primitives are available to support that.

Curious if anyone's tried this already.

Wuzado•5m ago
I wonder - could this be used for multi-tier MoE? Eg. active + most used in VRAM, often used in RAM and less used in NVMe?

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

https://github.com/xaskasdf/ntransformer
55•xaskasdf•3h ago•10 comments

Show HN: Ktop – a themed terminal monitor for GPU, CPU, RAM, temps and OOM kills

https://github.com/brontoguana/ktop
3•brontoguana•27m ago•0 comments

Show HN: Iron-Wolf – Wolfenstein 3D source port in Rust

https://github.com/Ragnaroek/iron-wolf
53•ragnaroekX•8h ago•19 comments

Show HN: AI writes code – humans fix it

https://humansfix.ai
2•stasman•1h ago•1 comments

Show HN: MeMCP – MCP for Personal Profile

https://github.com/nickyreinert/meMCP
2•y42•2h ago•0 comments

Show HN: I scanned 50k radio streams and built an app for the ones that work

https://github.com/meehow/receiver
2•meehow•3h ago•0 comments

Show HN: Nexus – A social platform where your GitHub profile is your identity

https://nexus-fqt4.onrender.com
2•tita-n•3h ago•0 comments

Show HN: Mines.fyi – all the mines in the US in a leaflet visualization

https://mines.fyi/
99•irasigman•1d ago•50 comments

Show HN: Winslop – De-Slop Windows

https://github.com/builtbybel/Winslop
10•guilamu•4h ago•0 comments

Show HN: Formally Verified a Millennium Prize Problem in Coq Yang-Mills Mass Gap

https://github.com/Shariq81/yang-mills-mass-gap
2•shariq81•3h ago•0 comments

Show HN: Cc-md – Zero-cost Obsidian sync across iPhone, Mac, and GitHub

https://github.com/yuukiLike/cc-md
3•YuukiJyoudai•3h ago•1 comments

Show HN: Amux – A tmux-based multiplexer for running parallel Claude Code agents

https://amux.io
2•Beefin•3h ago•0 comments

Show HN: Museum of Handwritten Code (If, While, Binary Search, Merge Sort)

https://museum.codes
3•sgraphics8•3h ago•1 comments

Show HN: DevBind – I made a Rust tool for zero-config local HTTPS and DNS

https://github.com/Its-Satyajit/dev-bind
3•its-satyajit•3h ago•0 comments

Show HN: A native macOS client for Hacker News, built with SwiftUI

https://github.com/IronsideXXVI/Hacker-News
245•IronsideXXVI•1d ago•172 comments

Show HN: See – searchable JSON compression (offline 10-min demo)

https://gitlab.com/kodomonocch1/see_proto
4•Tetsuro•4h ago•0 comments

Show HN: SmartMan – A modern, interactive TUI for Linux man pages

https://github.com/ambaskaryash/smartman-cli
2•ambaskaryash•6h ago•0 comments

Show HN: Rigour – Open-source quality gates for AI coding agents

https://rigour.run
2•erashu212•6h ago•1 comments

Show HN: Ghostty-based terminal with vertical tabs and notifications

https://github.com/manaflow-ai/cmux
182•lawrencechen•2d ago•73 comments

Show HN: Eliezer – Tiny (~7K LOC) Self-Hosted AI Agent (PWA, Self-Editing)

https://www.eliezer.app/
3•dvictor•7h ago•1 comments

Show HN: Micasa – track your house from the terminal

https://micasa.dev
637•cpcloud•2d ago•208 comments

Show HN: MQTT Topic Lab – MQTT client with buttons using command variables

https://github.com/alsoftbv/topic-lab
2•altug•8h ago•0 comments

Show HN: ClaudeUsage – macOS menu bar app to track your Claude Pro usage limits

https://github.com/linuxlewis/claude-usage
5•linuxlewis•8h ago•0 comments

Show HN: A physically-based GPU ray tracer written in Julia

https://makie.org/website/blogposts/raytracing/
195•simondanisch•2d ago•91 comments

Show HN: Mini-Diarium - An encrypted, local, cross-platform journaling app

https://github.com/fjrevoredo/mini-diarium
131•holyknight•2d ago•62 comments

Show HN: A small, simple music theory library in C99

https://github.com/thelowsunoverthemoon/mahler.c
56•lowsun•2d ago•18 comments

Show HN: The Sanguine Box – A 2026 vision for solo-produced comics

https://sanguinebox.com/comics/sanguine/
2•Balvarez•10h ago•0 comments

Show HN: Blindspot – a userscript to block tab-switch detection

https://github.com/gsekulski/blindspot
2•gsekulski•10h ago•0 comments

Show HN: 3mins.news – AI daily news briefing in 17 languages, designed to end

https://3mins.news/en
4•ethan_zhao•10h ago•1 comments

Show HN: GenPPT AI – Turn any idea into professional slides in seconds

https://genppt.ai/
5•polarisminor•15h ago•0 comments