Show HN: I built a tensor library from scratch in C++/CUDA

70•nirw4nna•5h ago

Hi HN,

Over the past few months, I've been building `dsc`, a tensor library from scratch in C++/CUDA. My main focus has been on getting the basics right, prioritizing a clean API, simplicity, and clear observability for running small LLMs locally.

The key features are: - C++ core with CUDA support written from scratch. - A familiar, PyTorch-like Python API. - Runs real models: it's complete enough to load a model like Qwen from HuggingFace and run inference on both CUDA and CPU with a single line change[1]. - Simple, built-in observability for both Python and C++.

Next on the roadmap is adding BF16 support and then I'll be working on visualization for GPU workloads.

The project is still early and I would be incredibly grateful for any feedback, code reviews, or questions from the HN community!

GitHub Repo: https://github.com/nirw4nna/dsc

[1]: https://github.com/nirw4nna/dsc/blob/main/examples/models/qw...

Comments

helltone•4h ago

This is very cool. I'm wondering if some of the templates and switch statements would be nicer if there was an intermediate representation and a compiler-like architecture.

I'm also curious about how this compares to something like Jax.

Also curious about how this compares to zml.

kajecounterhack•3h ago

Cool stuff! Is the goal of this project personal learning, inference performance, or something else?

Would be nice to see how inference speed stacks up against say llama.cpp

liuliu•2h ago

Both uses cublas under the hood. So I think it is similar for prefilling (of course, this framework is too early and don't have FP16 / BF16 support for GEMM it seems). Hand-roll gemv is faster for token generation hence llama.cpp is better.

aklein•2h ago

I noticed you interface with the native code via ctypes. I think cffi is generally preferred (eg, https://cffi.readthedocs.io/en/stable/overview.html#api-mode...). Although you'd have more flexibility if you build your own python extension module (eg using pybind), which will free you from a simple/strict ABI. Curious if this strict separation of C & Python was a deliberate design choice.

rrhjm53270•1h ago

Do you have any plan for the serialization and deserialization of your tensor and nn library?

amtk2•32m ago

super n00b question , what kind of labtop do you need to do project like this? Is mac ok? or do you need dedicated linux labtop?

kadushka•12m ago

Any laptop with an Nvidia card

Andrej Karpathy's YC AI SUS talk on the future of the industry

The Unreasonable Effectiveness of Fuzzing for Porting Programs

Show HN: Workout.cool – Open-source fitness coaching platform

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

Writing documentation for AI: best practices

Show HN: I built a tensor library from scratch in C++/CUDA

Homomorphically Encrypting CRDTs

Poline – An enigmatic color palette generator using polar coordinates

Yes I Will Read Ulysses Yes

Terpstra Keyboard

Introduction to the A* Algorithm

MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model

Is There a Half-Life for the Success Rates of AI Agents?

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer

Framework Laptop 12 review

Scrappy - make little apps for you and your friends

Revisiting Minsky's Society of Mind in 2025

Locally hosting an internet-connected server

I counted all of the yurts in Mongolia using machine learning

Show HN: Trieve CLI – Terminal-based LLM agent loop with search tool for PDFs

After millions of years, why are carnivorous plants still so small?

Building agents using streaming SQL queries

Should we design for iffy internet?

A different take on S-expressions

Spatializing 6k years of global urbanization from 3700 BC to AD 2000

The Grug Brained Developer (2022)

Real-time action chunking with large models

Spherical CNNs (2018)

Reasoning by Superposition: A Perspective on Chain of Continuous Thought

Show HN: Free local security checks for AI coding in VSCode, Cursor and Windsurf

Show HN: I built a tensor library from scratch in C++/CUDA

Comments

Andrej Karpathy's YC AI SUS talk on the future of the industry

The Unreasonable Effectiveness of Fuzzing for Porting Programs

Show HN: Workout.cool – Open-source fitness coaching platform

My iPhone 8 Refuses to Die: Now It's a Solar-Powered Vision OCR Server

Writing documentation for AI: best practices

Show HN: I built a tensor library from scratch in C++/CUDA

Homomorphically Encrypting CRDTs

Poline – An enigmatic color palette generator using polar coordinates

Yes I Will Read Ulysses Yes

Terpstra Keyboard

Introduction to the A* Algorithm

MiniMax-M1 open-weight, large-scale hybrid-attention reasoning model

Is There a Half-Life for the Success Rates of AI Agents?

Attimet (YC F24) – Quant Trading Research Lab – Is Hiring Founding Engineer

Framework Laptop 12 review

Scrappy - make little apps for you and your friends

Revisiting Minsky's Society of Mind in 2025

Locally hosting an internet-connected server

I counted all of the yurts in Mongolia using machine learning

Show HN: Trieve CLI – Terminal-based LLM agent loop with search tool for PDFs

After millions of years, why are carnivorous plants still so small?

Building agents using streaming SQL queries

Should we design for iffy internet?

A different take on S-expressions

Spatializing 6k years of global urbanization from 3700 BC to AD 2000

The Grug Brained Developer (2022)

Real-time action chunking with large models

Spherical CNNs (2018)

Reasoning by Superposition: A Perspective on Chain of Continuous Thought

Show HN: Free local security checks for AI coding in VSCode, Cursor and Windsurf