The maths you need to start understanding LLMs

https://www.gilesthomas.com/2025/09/maths-for-llms

455•gpjt•4d ago

Comments

kingkongjaffa•13h ago

The steps in this article are also the same process for doing RAG as well.

You computer an embedding vector for your documents or chunks of documents. And then you compute the vector for your users prompt, and then use the cosine distance to find the most semantically relevant documents to use. There are other tricks like reranking the documents once you find the top N documents relating to the query, but that’s basically it.

Here’s a good explanation

http://wordvec.colorado.edu/website_how_to.html

stared•13h ago

Well, in short - basic linear algebra, basic probability, analysis (functions like exp), gradient.

At some point I tried to create an introduction step-by-step, where people can interact with these concepts and see how to express it in PyTorch:

https://github.com/stared/thinking-in-tensors-writing-in-pyt...

MichaelRazum•12h ago

Although is it really "understanding" or just able to write down the formulas...?

misternintendo•12h ago

In some way it is true. Like understanding how a car works purely on physics laws.

stared•11h ago

Being able to use a formula is the first, and necessary, step for understanding.

Then it is able to work at different levels of abstraction and being able to find analogies. But at this point, in my understanding, "understanding" is a never-ending well.

MichaelRazum•10h ago

How about elliptic curve cryptography then? I just think coming with a formula is not really understanding. Actually most often the “real” formula is the end step of understanding through derivation. ML does it up side down in this regard

11101010001100•12h ago

Apologies for the metacomment, but HN is a funny place. There is a certain type of learning that is deemed good ('math for AI') and a certain type of learning that is deemed bad ('leetcode for AI').

sgt101•12h ago

could you give an example of "HN would not like this AI leetcode"?

enjeyw•12h ago

I mean I kind of get it - overgeneralising (and projecting my own feelings), but I think HN favours introducing and discussing foundational concepts over things that are closer to memorising/wrote-learning. I think AI Math vs Leetcode broadly fits into that category.

boppo1•12h ago

What would leetcode for AI be?

krackers•18m ago

I suppose the closest thing might be the type of counting/probability questions asked at quant firms as a way to assess math skill

raincole•12h ago

What's leedcode for AI and which site is deemed bad by HN? Without a concrete example it's just a strawman. It could be the site is deemed bad for other reasons. It could be a few vocal negative comments. It could be just not happening.

apwell23•11h ago

honestly i would love 'leetcode for AI' . I am just so sick of all the videos and articles about it.

paradite•12h ago

I recently did a livestream on trying to understand attention mechanism (K, Q, V) in LLM.

I think it went pretty well (was able to understand most of the logic and maths), and I touched on some of these terms.

https://youtube.com/live/vaJ5WRLZ0RE?feature=share

gozzoo•12h ago

The constant scrolling is very distracting. I couldn't follow up

paradite•11h ago

Thanks for the feedback!

kekebo•12h ago

I keep having had the best time with Andrej Karpathy's Youtube intros into LLM math. But I haven't compared scope or quality to this submission

ozgung•11h ago

This is not about _Large_ Language models though. This explains math for word vectors and token embeddings. I see this is the source of confusion for many people. They think LLMs just do this to statistically predict the next word. That was pre-2020s. They ignore the 1.8+ Trillion-parameter Transformer network. Embeddings are just the input of that giant machine. We don't know what is going on exactly in those trillions of parameters.

baxtr•11h ago

Wait so you’re saying it’s not a high-dimensional matrix multiplication?

dmd•11h ago

Everything is “just” ones and zeros, but saying that doesn’t help with understanding.

measurablefunc•5h ago

If you know about boolean algebra then it explains a lot more than you realize : https://boolean.dk.workers.dev/

tatjam•7h ago

Pretty much all problems can be reduced to some number of matrix multiplications ;)

ants_everywhere•11h ago

But surely you need this math to start understanding LLMs. It's just not the math you need to finish understanding them.

HSO•11h ago

"necessary but not sufficient"

ants_everywhere•11h ago

yes exactly :)

HarHarVeryFunny•9h ago

It depends on what level of understanding, and who you are talking about. For the 99% of people outside of software development or machine learning, it is totally irrelevant, as is any details of the Transformer architecture, or the mechanism by which a trained Transformer operates.

For the man in the street, inclined to view "AI" as some kind of artificial brain or sentient thing, the best explanation is that basically it's just matching inputs to training samples and regurgitating continuations. Not totally accurate of course, but for that audience at least it gives a good idea and is something they can understand, and perhaps gives them some insight into what it is, how it works/fails, and that it is NOT some scary sentient computer thingy.

For anyone in the remaining 1% (or much less - people who actually understand ANNs and machine learning), then learning about the Transformer architecture and how a trained Transformer works (induction heads etc) is what they need to learn to understand what an (Transformer-based, vs LSTM-based) LLM is and how it works.

Knowing about the "math" of Transformers/ANNs is only relevant to people who are actually implementing them from ground up, not even those who might just want to build one using PyTorch or some other framework/lbrary where the math has already been done for you.

Finally, embeddings aren't about math - they are about representation, which is certainly important to understanding how Transformers and other ANNs work, but still a different topic.

* US population of ~300M has ~1M software developers, of which a large fractions are going to be doing things like web development and only at a marginal advantage over someone smart outside of development in terms of learning how ANNs/etc work.

ants_everywhere•6h ago

I agree that most people don't need to understand the mathematics or design of the transformer architecture, but that isn't a good description of what LLMs do from a technical perspective. Someone with that mental model would be worse off than someone who had no mental model at all and just used it as a black box.

HarHarVeryFunny•6h ago

I disagree - I just had my non-technical sister staying with me, who said she was creeped out by "AI" and didn't like that it heard her in background while her son was talking to Gemini.

An LLM is, at the end of day, a next-word predictor, trying to predict according to training samples. We all understand that it's the depth/sophistication of context pattern matching that makes "stochastic parrot" an inadequate way to describe an LLM, but conceptually it is still more right than wrong, and is the base level of understanding you need, before beginning to understand why it is inadequate.

I think it's better for a non-technical person to understand "AI" as a stochastic parrot than to have zero understanding and think of it as a black box, or sentient computer, especially if that makes them afraid of it.

bonoboTP•5h ago

She's right to be creeped out by the normalization of cloud based processing of her audio and the increasing surveillance infrastructure. No Ai tech understanding needed. Sometimes being more ignorant of details can allow people to see the big picture better.

gpjt•55m ago

Post author here. I agree 100%! The post is the basic maths for people digging in to how LLMs work under the hood -- I wrote a separate one for non-techies who just want to know what they are, at https://www.gilesthomas.com/2025/08/what-ai-chatbots-are-doi...

cranx•11h ago

But we do. A series of mathematical functions are applied to predict the next tokens. It’s not magic although it seems like it is. People are acting like it’s the dark ages and Merlin made a rabbit disappear in a hat.

ekunazanu•10h ago

Depends on your definition of knowing. Sure, we know it is predicting next tokens, but do we understand why they output the things they do? I am not well versed with LLMs, but I assume even for smaller modles interpretability is a big challenge.

lazide•10h ago

For any given set of model weights and inputs? Yes, we definitely do understand them.

Do we understand the emergent properties of almost-intelligence they appear to present, and what that means about them and us, etc. etc.?

No.

jvanderbot•8h ago

Right. The machine works as designed and it's all assembly instructions on gates. The values in the registers change but not the instructions.

And it happens to do something weirdly useful to our own minds based on the values in the registers.

chongli•7h ago

The answer is simple: the set of weights and biases comprise a mathematical function which has been specifically built to approximate the training set. The methods of building such a function are very old and well-known (from calculus).

There's no magic here. Most of people's awestruck reactions are due to our brain's own pattern recognition abilities and our association of language use with intelligence. But there's really no intelligence here at all, just like the "face on Mars" is just a random feature of a desert planet's landscape, not an intelligent life form.

umanwizard•7h ago

Doesn’t this apply to any process (including human brains) that outputs sequences of words? There is some statistical distribution describing what word will come next.

clickety_clack•8h ago

That is what they do though. It might have levels of accuracy we can barely believe, but it is still a statistical process that predicts the next tokens.

ozgung•7h ago

Not necessarily. They can generate letters, tokens, or words in any order. They can even write them all at once like they do in a diffusion model. Next token generation (auto-reggresion) is just a design choice of GPT, mostly for practical reasons. It fits naturally to the task at hand (we humans also generate words in sequential order). Also they have to train GPT in a self-supervised manner since we don't have labeled internet scale data. Auto-regression solves that problem as well.

The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token. It learns all of these via that training scheme. It doesn't learn just the frequency of word relationships, unlike the old algorithms. Trillions are parameters do much more than that.

jurgenaut23•7h ago

Can you provide sources of your claim that LLMs “model the world”.

ozgung•6h ago

You are right that it is a bold claim but here is a relevant summary: https://en.wikipedia.org/wiki/Stochastic_parrot#Interpretabi...

I think "The Platonic Representation Hypothesis" is also related: https://phillipi.github.io/prh/

Unfortunately, large LLMs like ChatGPT and Claude are blackbox for researchers. They can't probe what is going on inside those things.

lgas•6h ago

It seems somewhat obvious to me. Language models the world, and LLMs model language. If A models B and B models C then A models C, as well, no?

TurboTveit•2h ago

Can you provide sources of your claim that language “model the world”.

HarHarVeryFunny•3h ago

How can an LLM model the world, in any meaningful way, when it has no experience of the world?

An LLM is a language model, not a world model. It has never once had the opportunity to interact with the real world and see how it responds - to emit some sequence of words (the only type of action it is capable of generating), predict what will happen as a result, and see if it was correct.

During training the LLM will presumably have been exposed to some second person accounts (as well as fictional stories) of how the world works, mixed up with sections of stack overflow code and Reddit rantings, but even those occasional accounts of real world interactions (context, action + result) are only at best teaching it about the context that someone else, at that point in their life, saw relevant to mention as causal/relevant to the action outcome. The LLM isn't even privvy to the world model of the raconteur (let alone the actual complete real world context in which the action was taken, or the detailed manner in which it was performed), so this is a massively impoverished source of 2nd hand experience from which to learn.

It would be like someone who had spent their whole life locked in a windowless room reading randomly ordered paragraphs from other peoples diaries of daily experience (also randomly interpersed with chunks of fairy tales and python code), without themselves ever having actually seen a tree or jumped in a lake, or ever having had the chance to test which parts of the mental model they had built, of what was being described, were actually correct or not, and how it aligned with the real outside world they had never laid eyes on.

When someone builds an AGI capable of continual learning, and sets it loose in the world to interact with it, then it'll be reasonable to say it has it's own world model of how the world works, but as as far as pre-trained language models go, it seems closer to the mark to say they they are indeed just language models, modelling the world of words which is all they know, and the only kind of model for which they had access to feedback (next word prediction errors) to build.

griffzhowl•3h ago

> The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token.

This sounds way over-blown to me. What we know is that LLMs generate sequences of tokens, and they do this by clever ways of processing the textual output of millions of humans.

You say that, in addition to this, LLMs model the world, understand, plan, think, etc.

I think it can look like that, because LLMs are averaging the behaviours of humans who are actually modelling, understanding, thinking, etc.

Why do you think that this behaviour is more than simply averaging the outputs of millions of humans who understand, think, plan, etc.?

libraryofbabel•7h ago

* You’re right that a lot of people take a cursory look at the math (or someone else’s digest of it) and their takeaway is “aha, LLMs are just stochastic parrots blindly predicting the next word. It’s all a trick!”

* So we find ourselves over and over again explaining that that might have been true once, but now there are (imperfect, messy, weird) models of large parts of the world inside that neutral network.

* At the same time, the vector embedding math is still useful to learn if you want to get into LLMs. It’s just that the conclusions people draw from the architecture are often wrong.

measurablefunc•5h ago

It's exactly the same math. There is no mathematics in any neural network, regardless of its scale, that can not be expressed w/ matrix multiplications & activation functions.

apwell23•11h ago

> Actually coming up with ideas like GPT-based LLMs and doing serious AI research requires serious maths.

Does it ? I don't think so. All the math involved is pretty straightforward.

ants_everywhere•11h ago

It depends on how you define the math involved.

Locally it's all just linear algebra with an occasional nonlinear function. That is all straightforward. And by straightforward I mean you'd cover it in an undergrad engineering class -- you don't need to be a math major or anything.

Similarly CPUs are composed of simple logic operations that are each easy to understand. I'm willing to believe that designing a CPU requires more math than understanding the operations. Similarly I'd believe that designing an LLM could require more math. Although in practice I haven't seen any difficult math in LLM research papers yet. It's mostly trial and error and the above linear algebra.

apwell23•11h ago

yea i would love to see what complicated math all this came out of. I thought rigorous math was actually an impediment to AI progress. Did any math actually predict or prove that scaling data would create current AI ?

ants_everywhere•6h ago

I was thinking more about the everyday use of more advanced math to solve "boring" engineering challenges. Like finite math to layout chips or kernels. Or improvement to Strassen's algorithm for matrix multiplication. Or improving the transformer KV cache etc.

The math you would use to, for example, prove that search algorithm is optimal will generally be harder than the math needed to understand the search algorithm itself.

empiko•2h ago

It is straight forward because you have been probably exposed to a ton of AI/ML content in your life.

oulipo2•11h ago

Additions and multiplications. People are making it sound like it's complicated, but NNs have the most basic and simple maths behind

The only thing is that nobody understand why they work so well. There are a few function approximation theorems that apply, but nobody really knows how to make them behave as we would like

So basically AI research is 5% "maths", 20% data sourcing and engineering, 50% compute power, and 25% trial and error

amelius•11h ago

Gradient descent is like pounding on a black box until it gives you the answers you were looking for. Ihere is little more we know about it. We're basically doing Alchemy 2.0.

The hard technology that makes this all possible is in semiconductor fabrication. Outside of that, math has comparatively little to do with our recent successes.

p1dda•9h ago

> The only thing is that nobody understand why they work so well.

This is exactly what I have ascertained from several different experts in this field. Interesting that a machine has been constructed that performs better than expected and/or is performing more advanced tasks than the inventors expected.

skydhash•8h ago

The linear regression model "ax + b" is the most simplest one and is still quite useful. It can be interesting to discover some phenomenon that fits the model, but that's not something people have control over. But imagine spending years (expensively) training stuff with millions of weight to ultimately discover it was as simple as "e = mc^2" (and c^2 is basically a constant, so the equation is technically linear)

rsanek•11h ago

Anyone else read the book that the author mentions, Build a Large Language Model (from Scratch) [0]? After watching Karpathy's video [1] I've been looking for a good source to do a deeper dive.

[0] https://www.manning.com/books/build-a-large-language-model-f...

[1] https://www.youtube.com/watch?v=7xTGNNLPyMI

kamranjon•10h ago

It’s good - I’m working through it right now

ForceBru•9h ago

Yes, it's really good

tra3•9h ago

Is [1] worth a watch if I want to get a high level/basic understanding of how LLMs work?

rsanek•7h ago

Yeah, it's very well done

malshe•9h ago

Here is the code used in the book - https://github.com/rasbt/LLMs-from-scratch

horizion2025•8h ago

Is there a non-video equivalent. I always prefer reading/digesting at my own pace compared to following a video.

gpjt•58m ago

Check the first link in the parent comment, it's a link to the book.

tanelpoder•7h ago

Yes, can confirm, the book is great. I was also happy to see that the author correctly (in my mind) used the term “embedding vectors” vs. “vector embeddings” that most others seem to use… Some more context about my pet peeve: https://tanelpoder.com/posts/embedding-vectors-vs-vector-emb...

gchadwick•4h ago

I thought it was a great book, dives into all the details and lays it out step by step with some nice examples. Obviously it's a pretty basic architecture and very simplistic training but I found it gave me the grounding to then understand more complex architectures.

InCom-0•11h ago

These are technical details of computations that are performed as part of LLMs.

Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.

This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.

ivape•11h ago

Also, people need to accept that they’ve been doing regular ass programming for many years and can’t just jump into whatever they want. The idea that developers were well rounded general engineers is a myth mostly propagated from within the bubble.

Most people’s educations right here probably didn’t even involve Linear Algebra (this is a bold claim, because the assumption is that everyone here is highly educated, no cap).

49pctber•10h ago

Anyone who would like to run an LLM would need to perform their computations on hardware. So picking hardware that is good at matrix multiplication is important for them, even if they didn't develop their LLM from scratch. Knowing the basic math also explains some of the rush to purchase GPUs and TPUs on recent years.

All that is kind of missing the point though. I think people being curious and sharpening their mental models of technology is generally a good thing. If you didn't know an LLM was a bunch of linear algebra, you might have some distorted views of what it can or can't accomplish.

InCom-0•10h ago

Being curious is good ... nothing wrong with that. What I took issue with above is (what I see as) attempt to derail people into low level math when that is not the crux of the question at all.

Also: nobody who wants to run LLMs will write their own matrix multiplications. Nobody doing ML / AI comes close to that stuff ... its all abstracted and not something anyone actually thinks about (except the few people who actually write the underlying libraries ie. at Nvidia).

antegamisou•6h ago

> attempt to derail people into low level math when that is not the crux of the question at all.

Is the barrier to entry to the ML/AI field really that low? I think no one seasoned would consider fundamental linear algebra 'low level' math.

InCom-0•4h ago

What do you mean 'low'? :-)

The barrier to entry is probably epicly high because to be actually useful you need to understand how to actually train a model in practice, how it is actually designed, how existing practices (ie. at OpenAI or wherever) can be built upon further ... and you need to be cutting edge at all of those things. This is not taught anywhere, you can't read about it in some book. This has absolutely nothing to do with linear algebra ... or more accurately you don't get better at those things by understanding linear algebra (or any math) better than the next guy. It is not as if 'If I were better at math, I would have been better AI researcher or programmer or whatever' :-). This is just not what these people do or how that process works. Even the foundational research that sparked rapid LLM development ('Attention Is All You Need' paper) is not some math heavy stuff. The whole thing is a conceptual idea that was tested and turned out to be spectacular.

jasode•10h ago

>This is as if you started explaining how an ICE car works by diving into chemical properties of petrol.

But wouldn't explaining the chemistry actually be acceptable if the title was, "The chemistry you need to start understanding Internal Combustion Engines"

That's analogous to what the author did. The title was "The maths ..." -- and then the body of the article fulfills the title by explaining the math relevant to LLMs.

It seems like you wished the author wrote a different article that doesn't match the title.

InCom-0•10h ago

'The maths you need to start understanding LLMs'.

You don't need that math to start understanding LLMs. In fact, I'd argue its harmful to start there unless your goal is to 'take me on a epic journey of all the things mankind needed to figure out to make LLMs work from the absolute basics'.

bryanrasmussen•10h ago

>Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.

maybe this is the target group of people who would need particular "maths" to start understanding LLMS.

antegamisou•8h ago

Find someone on HN that doesn't trivialize fundamental math yet encourages everyone to become a PyTorch monkey that ends up having no idea why their models are shite: impossible.

saagarjha•4h ago

If you're just piecing together a bunch of libraries, sure. But anyone who is adjacent to ML research should know how these work.

InCom-0•3h ago

Anyone actually physically doing ML research knows it ... but doesn't write the actual code for this stuff (or god forbid write some byzantine math notations somewhere), doesn't even think about this stuff except through X levels of higher level abstractions.

Also, those people understand LLMs already :-).

d_sem•10h ago

I think the author did a sufficient job caveating his post without being verbose.

While reading through past posts I stumbled on a multi part "Writing an LLM from scratch" series that was an enjoyable read. I hope they keep up writing more fun content.

petesergeant•10h ago

You need virtually no maths to deeply and intuitively understand embeddings: https://sgnt.ai/p/embeddings-explainer/

armcat•10h ago

One of the most interesting mathematical aspects to me are the fact that LLMs are logit emitters. And associated with this output is uncertainty. Lot of ppl talk about networks of agents. But what you are doing is accumulating uncertainty - every model in the chain introduces its own uncertainty on top of what it inherits. In some situations I've seen a complete collapse after 3 LLM calls chained together. Hence why lot of people recommend "human in the loop" as much as possible to try and reduce that uncertainty (shift the posterior if you will); or they recommend more of a workflow approach - where you have a single orchestrator that decides which function to call, and most of the emphasis (and context engineering) is placed on that orchestrator. But it all ties together in the maths of LLMs.

ryanchants•10h ago

I'm currently working through Mathematics for Machine Learning and Data Science Specialization from Deeplearning.AI. It's been the best into to Linear Algebra I've found. It's worth the $50 a month just for the quizzes, labs, etc. I'm simultaneously working through the book Math and Architectures of Deep Learning, which is helping re-inforce and flesh out the ideas from the course.

[0] https://www.coursera.org/specializations/mathematics-for-mac... [1] https://www.manning.com/books/math-and-architectures-of-deep...

zahlman•9h ago

It appears that the "softmax" is found (as I hypothesized by looking at the results, before clicking the link) by exponentiating each value and normalizing to a sum of 1. It would be worthwhile to be explicit. The exponential function is also "high-school maths", and an explanation like that is much easier to follow than the Wikipedia article (since not a lot of rigour is required here).

jokoon•9h ago

ML is interesting, but honestly I have trouble knowing the future of it, to see if I should learn the techniques to land a job or not be too obsolete.

There is certainly some hype, a lot of what is the market is just not viable.

spinlock_•9h ago

For me, working through Karpathy's video series (instead of just "watching" them) helped me tremendously to understand how LLMs work and gave me the confidence to read through more advanced material, if I feel like it. But to be honest, the knowledge I gained through his videos are already enough for me. It's kind of like learning how a CPU works in general and ignoring all the fancy optimization steps that I'm not interested in.

Thanks Andrej for the time and effort you put into your videos.

meken•4h ago

+1. His cs231n class he taught at Stanford gave me a great foundation.

fnord77•9h ago

nothing about vector calculus to minimize loss functions or needing to find Hessians to do Newton's method.

tsunamifury•8h ago

I’m sure no one will read this but I was on the team that invented a lot of this early pre-LLM math at Google.

It was a really exciting time for me as I had pushed the team to begin looking at vectors beyond language (actions and other predictable perimeters we could extract from linguistic vectors.)

We had originally invented a lot of this because we were trying to make chat and email easier and faster, and ultimately I had morphed it into predicting UI decisions based on conversations vectors. Back then we could only do pretty simple predictions (continue vector strictly , reverse vector strictly or N vector options on an axis) but we shipped it and you saw it when we made hangouts, gmail and allo predict your next sentence. Our first incarnation was interesting enough that eric Schmidt recognized it and took my work to the board as part of his big investment in ML. From there the work in hangouts became all/gmail etc.

Bizarrely enough though under sundar, this became the Google assistant but we couldn’t get much further without attention layers so the entire project regressed back to fixed bot pathing.

I argued pretty hard with the executives that this was a tragedy but sundar would hear none of it, completely obsessed with Alexa and having a competitor there.

I found some sympathy with the now head of search who gave me some budget to invest in a messaging program that would advance prediction to get to full action prediction across the search surface and UI. We launched and made it a business messaging product but lost the support of executives during the LLM panic.

Sundar cut us and fired the whole team, ironically right when he needed it the most. But he never listened to anyone who worked on the tech and seemed to hold their thoughts in great disdain.

What happened after that is of course well known now as sundar ignored some of the most important tech in history due to this attitude.

I don’t think I’ll ever fully understand it.

Rubio78•7h ago

Working through Karpathy's series builds a foundational understanding of LLMs, providing enough confidence to explore further. A key insight is that LLMs are logit emitters, and their inherent uncertainty compounds dangerously in multi-agent chains, often requiring a human-in-the-loop or a single orchestrator to manage it. Crucially, people confuse word embeddings with the full LLM; embeddings are just the input to a vast, incomprehensible trillion-parameter transformer. The underlying math of these networks is surprisingly simple, built on basic additions and multiplications. The real mystery isn't the math but why they work so well. Ultimately, AI research is a mix of minimal math, extensive data engineering, massive compute power, and significant trial and error.

libraryofbabel•7h ago

Way back when, I did a masters in physics. I learned a lot of math: vectors, a ton of linear algebra, thermodynamics (aka entropy), multi-variable and then tensor calculus.

This all turned out to be mostly irrelevant in my subsequent programming career.

Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.

It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.

psb217•6h ago

That past work will pay off even more when you start looking into diffusion and flow-based models for generating images, videos, and sometimes text.

pornel•5h ago

Breakthrough in image generation speed literally came from applying better differential equations for diffusion taken from statistical mechanics physics papers:

https://youtu.be/iv-5mZ_9CPY

JBits•6h ago

For me, it's the very basics of general relativity which made the distinction between the cotangent and tangents space click. Optimisation on Riemannian manifolds might give an opportunity to apply more interesting tensor calculus with a non-trivial metric.

alguerythme•5h ago

Well, calculus on curved space, please let me introduce you to: https://arxiv.org/abs/2505.18230 (This is self advertising) If you know how to incorporate time into that, I am interested.

CrossVR•4h ago

Any reason you didn't pick up computer graphics before? Everything is linear algebra and there's even actual physics involved.

ls-a•2h ago

Is that you Will Hunting

jwar767•1h ago

I have the same experience but with a masters in control theory. Suddenly all the linear algebra and differential equations are super useful in understanding this.

lazarus01•7h ago

Here are the building blocks for any deep learning system and a little bit about llm towards the end.

Graphs - It all starts with computational graphs. These are data structures that include element wise operations, usually matrix multiplication, addition, activation functions and loss function. The computations are differential, resulting in a smooth continuous space, appropriate for continuous optimization (gradient descent), which is covered later.

Layers - Layers are modules comprised of graphs that apply some computation and store the results in a state, referred to as the learned weights. Each Layer learns a deeper, more meaningful representation from the dataset, ultimately learning a latent manifold, which is a highly structured, lower dimensional space, that interpolates between samples, achieving generalization for predictions.

Different machine learning problems and data types use different layers, e.g. Transformers for sequence to sequence learning and convolutions for computer vision models, etc.

Models - Organize stacks of layers for training. Includes a loss function that sends a feedback signal to an optimizer to adjust learned weights during training. Models also include an evaluation metric for accuracy, independent of the loss function.

Forward pass - For training or inference, when an input sequence passes through all the network layers and a geometric transformation is applied producing an output.

Backpropagation - Durring training, after the forward pass, gradients are calculated for each weight with respect to the loss, gradients are just another word for derivatives. The process for calculating the derivatives is called automatic differentiation, which is based on the chain rule of derivation.

Once the derivatives are calculated the optimizers intelligently updates the weights, with respect to the loss. This is the process called “Learning” often referred to as gradient descent.

Now for Large Language Models.

Before models are trained for sequence to sequence learning, the corpus of knowledge must be transformed into embeddings.

Embeddings are dense representations of language that includes a multidimensional space that can capture meaning and context for different combinations of words that are part of sequences.

LLMs use a specific network layer called transformers, that includes something called an attention mechanism.

The attention mechanism uses the embeddings to dynamically update the meaning of words when they are brought together in a sequence.

The model uses three different representations of the input sequence, called the key, query and value matrices.

Using dot product, an attention score is created to identify the meaning of the reference sequence, then a target sequence is generated

The output sequence is predicted one word at a time, based on a sampling distribution of the target sequence, using a softmax function.

nativeit•6h ago

Ah, I was hoping this would teach me the maths to start understanding the economics surrounding LLMs. That’s the really impossible stuff.

cultofmetatron•3h ago

just wanna plug https://mathacademy.com/courses/mathematics-for-machine-lear....

happy customer and have found it to be one of the best paid resources for learning mathematics in general. wish I had this when I was a student.

Utah's hottest new power source is 15k feet below the ground

How the "Kim" dump exposed North Korea's credential theft playbook

A Navajo weaving of an integrated circuit: the 555 timer

Shipping textures as PNGs is suboptimal

I'm Making a Beautiful, Aesthetic and Open-Source Platform for Learning Japanese

Over 80% of Sunscreen Performed Below Their Labelled Efficacy (2020)

C++26: Erroneous Behaviour

Troubleshooting ZFS – Common Issues and How to Fix Them

A history of metaphorical brain talk in psychiatry

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

We hacked Burger King: How auth bypass led to drive-thru audio surveillance

The maths you need to start understanding LLMs

Oldest recorded transaction

Anonymous recursive functions in Racket

What to Do with an Old iPad

Stop writing CLI validation. Parse it right the first time

Using Claude Code SDK to reduce E2E test time

GigaByte CXL memory expansion card with up to 512GB DRAM

Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

Microsoft Azure: "Multiple international subsea cables were cut in the Red Sea"

Why language models hallucinate

Processing Piano Tutorial Videos in the Browser

Gloria funicular derailment initial findings report (EN) [pdf]

AI surveillance should be banned while there is still time

Baby's first type checker

Qantas is cutting executive bonuses after data breach

William James at CERN (1995)

Rug pulls, forks, and open-source feudalism

Rust tool for generating random fractals

Europe enters the exascale supercomputing league with Jupiter

Utah's hottest new power source is 15k feet below the ground

How the "Kim" dump exposed North Korea's credential theft playbook

A Navajo weaving of an integrated circuit: the 555 timer

Shipping textures as PNGs is suboptimal

I'm Making a Beautiful, Aesthetic and Open-Source Platform for Learning Japanese

Over 80% of Sunscreen Performed Below Their Labelled Efficacy (2020)

C++26: Erroneous Behaviour

Troubleshooting ZFS – Common Issues and How to Fix Them

A history of metaphorical brain talk in psychiatry

Qwen3 30B A3B Hits 13 token/s on 4xRaspberry Pi 5

We hacked Burger King: How auth bypass led to drive-thru audio surveillance

The maths you need to start understanding LLMs

Oldest recorded transaction

Anonymous recursive functions in Racket

What to Do with an Old iPad

Stop writing CLI validation. Parse it right the first time

Using Claude Code SDK to reduce E2E test time

GigaByte CXL memory expansion card with up to 512GB DRAM

Matmul on Blackwell: Part 2 – Using Hardware Features to Optimize Matmul

Microsoft Azure: "Multiple international subsea cables were cut in the Red Sea"

Why language models hallucinate

Processing Piano Tutorial Videos in the Browser

Gloria funicular derailment initial findings report (EN) [pdf]

AI surveillance should be banned while there is still time

Baby's first type checker

Qantas is cutting executive bonuses after data breach

William James at CERN (1995)

Rug pulls, forks, and open-source feudalism

Rust tool for generating random fractals

Europe enters the exascale supercomputing league with Jupiter

The maths you need to start understanding LLMs

Comments