Writing an LLM from scratch, part 22 – training our LLM

https://www.gilesthomas.com/2025/10/llm-from-scratch-22-finally-training-our-llm

254•gpjt•3mo ago

Comments

mettamage•3mo ago

Here's part 1 [1]. Since his archive goes by date, it makes it a bit easier to guestimate which part is made in which month.

[1] https://www.gilesthomas.com/2024/12/llm-from-scratch-1

3abiton•3mo ago

It's interesting 22 parts in under a year, seems like a fun up to date project. Karpathy did something very similar with nanochat (following nanogpt).

ziyunli•3mo ago

seems like you can filter by tag https://www.gilesthomas.com/llm-from-scratch

js8•3mo ago

It's based on a book https://www.manning.com/books/build-a-large-language-model-f..., is it a good book?

checker659•3mo ago

I have done a little bit of DL stuff (with keras) before this. I'm currently in the attention chapter. The book gives you the code, but I feel like there is very little in the way of building intuition. Thankfully, there are tons of videos online to help with that.

I think it is a great guide. An extended tutorial if you will (at least until this point in my reading). Also having the code right in front of you helps a lot. For example, I was under the impression that embedding vectors were static like in word2vec. Turns out, they are learnable parameters too. I wouldn't have been able to tell for sure if I didn't have the code right in front of me.

dvt•3mo ago

> The book gives you the code, but I feel like there is very little in the way of building intuition.

There isn't really much intuition to begin with, and I don't really think building intuition will be useful, anyway. Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work. Heck, even implementing a Markov chain from scratch (which can be done in an afternoon with no prior knowledge) can feel magical when it starts outputting semi-legible sentences.

It's like trying to build intuition when it comes to technical results like the Banach-Tarski paradox or Löb's theorem. Imo, understanding the math (which in the case of LLMs is actually quite simple) is orders of magnitude more valuable than "building intuition," whatever that might mean.

checker659•3mo ago

> Even when looking at something as barebones as perceptrons

I was thinking something like "it is trying to approximate a non-linear function" (which is what it is in the case of MLPs).

CamperBob2•3mo ago

Even when looking at something as barebones as perceptrons, it's hard to really see "why" they work.

Check out the Karpathy "Zero to Hero" videos, and try to follow along by building an MLP implementation in your own language of choice. He does a good job of building intuition because he doesn't skip much of anything.

mrasong•3mo ago

The cost comparison between local RTX 3090 and cloud A100 clusters is useful, but I wonder if the author accounted for hidden overhead—like data transfer time for large datasets or the time spent debugging CUDA compatibility issues on local hardware.

pppoe•3mo ago

Love this. This reminds me of LFS (Linux From Scratch) https://www.linuxfromscratch.org

Feeling nostalgic about the days building LFS in college.

Learning by building wouldn't help you remember all the details but many things would make more sense after going through the process step by step. And it's fun.

We Mourn Our Craft

Speed up responses with fast mode

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Al Lowe on model trains, funny deaths and working with Disney

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

The AI boom is causing shortages everywhere else

The Waymo World Model

Reinforcement Learning from Human Feedback

Start all of your commands with a comma (2009)

I Write Games in C (yes, C)

SectorC: A C Compiler in 512 bytes

Vocal Guide – belt sing without killing yourself

France's homegrown open source online office suite

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

Selection Rather Than Prediction

History and Timeline of the Proco Rat Pedal (2021)

72M Points of Interest

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Software factories and the agentic moment