At some point I tried to create an introduction step-by-step, where people can interact with these concepts and see how to express it in PyTorch:
https://github.com/stared/thinking-in-tensors-writing-in-pyt...
Then it is able to work at different levels of abstraction and being able to find analogies. But at this point, in my understanding, "understanding" is a never-ending well.
I think it went pretty well (was able to understand most of the logic and maths), and I touched on some of these terms.
For the man in the street, inclined to view "AI" as some kind of artificial brain or sentient thing, the best explanation is that basically it's just matching inputs to training samples and regurgitating continuations. Not totally accurate of course, but for that audience at least it gives a good idea and is something they can understand, and perhaps gives them some insight into what it is, how it works/fails, and that it is NOT some scary sentient computer thingy.
For anyone in the remaining 1% (or much less - people who actually understand ANNs and machine learning), then learning about the Transformer architecture and how a trained Transformer works (induction heads etc) is what they need to learn to understand what an (Transformer-based, vs LSTM-based) LLM is and how it works.
Knowing about the "math" of Transformers/ANNs is only relevant to people who are actually implementing them from ground up, not even those who might just want to build one using PyTorch or some other framework/lbrary where the math has already been done for you.
Finally, embeddings aren't about math - they are about representation, which is certainly important to understanding how Transformers and other ANNs work, but still a different topic.
* US population of ~300M has ~1M software developers, of which a large fractions are going to be doing things like web development and only at a marginal advantage over someone smart outside of development in terms of learning how ANNs/etc work.
An LLM is, at the end of day, a next-word predictor, trying to predict according to training samples. We all understand that it's the depth/sophistication of context pattern matching that makes "stochastic parrot" an inadequate way to describe an LLM, but conceptually it is still more right than wrong, and is the base level of understanding you need, before beginning to understand why it is inadequate.
I think it's better for a non-technical person to understand "AI" as a stochastic parrot than to have zero understanding and think of it as a black box, or sentient computer, especially if that makes them afraid of it.
Do we understand the emergent properties of almost-intelligence they appear to present, and what that means about them and us, etc. etc.?
No.
And it happens to do something weirdly useful to our own minds based on the values in the registers.
There's no magic here. Most of people's awestruck reactions are due to our brain's own pattern recognition abilities and our association of language use with intelligence. But there's really no intelligence here at all, just like the "face on Mars" is just a random feature of a desert planet's landscape, not an intelligent life form.
The distinction I want to emphasize is that they don't just predict words statistically. They model the world, understand different concepts and their relationships, can think on them, can plan and act on the plan, can reason up to a point, in order to generate the next token. It learns all of these via that training scheme. It doesn't learn just the frequency of word relationships, unlike the old algorithms. Trillions are parameters do much more than that.
I think "The Platonic Representation Hypothesis" is also related: https://phillipi.github.io/prh/
Unfortunately, large LLMs like ChatGPT and Claude are blackbox for researchers. They can't probe what is going on inside those things.
An LLM is a language model, not a world model. It has never once had the opportunity to interact with the real world and see how it responds - to emit some sequence of words (the only type of action it is capable of generating), predict what will happen as a result, and see if it was correct.
During training the LLM will presumably have been exposed to some second person accounts (as well as fictional stories) of how the world works, mixed up with sections of stack overflow code and Reddit rantings, but even those occasional accounts of real world interactions (context, action + result) are only at best teaching it about the context that someone else, at that point in their life, saw relevant to mention as causal/relevant to the action outcome. The LLM isn't even privvy to the world model of the raconteur (let alone the actual complete real world context in which the action was taken, or the detailed manner in which it was performed), so this is a massively impoverished source of 2nd hand experience from which to learn.
It would be like someone who had spent their whole life locked in a windowless room reading randomly ordered paragraphs from other peoples diaries of daily experience (also randomly interpersed with chunks of fairy tales and python code), without themselves ever having actually seen a tree or jumped in a lake, or ever having had the chance to test which parts of the mental model they had built, of what was being described, were actually correct or not, and how it aligned with the real outside world they had never laid eyes on.
When someone builds an AGI capable of continual learning, and sets it loose in the world to interact with it, then it'll be reasonable to say it has it's own world model of how the world works, but as as far as pre-trained language models go, it seems closer to the mark to say they they are indeed just language models, modelling the world of words which is all they know, and the only kind of model for which they had access to feedback (next word prediction errors) to build.
This sounds way over-blown to me. What we know is that LLMs generate sequences of tokens, and they do this by clever ways of processing the textual output of millions of humans.
You say that, in addition to this, LLMs model the world, understand, plan, think, etc.
I think it can look like that, because LLMs are averaging the behaviours of humans who are actually modelling, understanding, thinking, etc.
Why do you think that this behaviour is more than simply averaging the outputs of millions of humans who understand, think, plan, etc.?
* So we find ourselves over and over again explaining that that might have been true once, but now there are (imperfect, messy, weird) models of large parts of the world inside that neutral network.
* At the same time, the vector embedding math is still useful to learn if you want to get into LLMs. It’s just that the conclusions people draw from the architecture are often wrong.
Does it ? I don't think so. All the math involved is pretty straightforward.
Locally it's all just linear algebra with an occasional nonlinear function. That is all straightforward. And by straightforward I mean you'd cover it in an undergrad engineering class -- you don't need to be a math major or anything.
Similarly CPUs are composed of simple logic operations that are each easy to understand. I'm willing to believe that designing a CPU requires more math than understanding the operations. Similarly I'd believe that designing an LLM could require more math. Although in practice I haven't seen any difficult math in LLM research papers yet. It's mostly trial and error and the above linear algebra.
The math you would use to, for example, prove that search algorithm is optimal will generally be harder than the math needed to understand the search algorithm itself.
The only thing is that nobody understand why they work so well. There are a few function approximation theorems that apply, but nobody really knows how to make them behave as we would like
So basically AI research is 5% "maths", 20% data sourcing and engineering, 50% compute power, and 25% trial and error
The hard technology that makes this all possible is in semiconductor fabrication. Outside of that, math has comparatively little to do with our recent successes.
This is exactly what I have ascertained from several different experts in this field. Interesting that a machine has been constructed that performs better than expected and/or is performing more advanced tasks than the inventors expected.
[0] https://www.manning.com/books/build-a-large-language-model-f...
Completely pointless to anyone who is not writing the lowest level ML libraries (so basically everyone). This does now help anyone understand how LLMs actually work.
This is as if you started explaining how an ICE car works by diving into chemical properties of petrol. Yeah that really is the basis of it all, but no it is not where you start explaining how a car works.
Most people’s educations right here probably didn’t even involve Linear Algebra (this is a bold claim, because the assumption is that everyone here is highly educated, no cap).
All that is kind of missing the point though. I think people being curious and sharpening their mental models of technology is generally a good thing. If you didn't know an LLM was a bunch of linear algebra, you might have some distorted views of what it can or can't accomplish.
Also: nobody who wants to run LLMs will write their own matrix multiplications. Nobody doing ML / AI comes close to that stuff ... its all abstracted and not something anyone actually thinks about (except the few people who actually write the underlying libraries ie. at Nvidia).
Is the barrier to entry to the ML/AI field really that low? I think no one seasoned would consider fundamental linear algebra 'low level' math.
The barrier to entry is probably epicly high because to be actually useful you need to understand how to actually train a model in practice, how it is actually designed, how existing practices (ie. at OpenAI or wherever) can be built upon further ... and you need to be cutting edge at all of those things. This is not taught anywhere, you can't read about it in some book. This has absolutely nothing to do with linear algebra ... or more accurately you don't get better at those things by understanding linear algebra (or any math) better than the next guy. It is not as if 'If I were better at math, I would have been better AI researcher or programmer or whatever' :-). This is just not what these people do or how that process works. Even the foundational research that sparked rapid LLM development ('Attention Is All You Need' paper) is not some math heavy stuff. The whole thing is a conceptual idea that was tested and turned out to be spectacular.
But wouldn't explaining the chemistry actually be acceptable if the title was, "The chemistry you need to start understanding Internal Combustion Engines"
That's analogous to what the author did. The title was "The maths ..." -- and then the body of the article fulfills the title by explaining the math relevant to LLMs.
It seems like you wished the author wrote a different article that doesn't match the title.
You don't need that math to start understanding LLMs. In fact, I'd argue its harmful to start there unless your goal is to 'take me on a epic journey of all the things mankind needed to figure out to make LLMs work from the absolute basics'.
maybe this is the target group of people who would need particular "maths" to start understanding LLMS.
Also, those people understand LLMs already :-).
While reading through past posts I stumbled on a multi part "Writing an LLM from scratch" series that was an enjoyable read. I hope they keep up writing more fun content.
[0] https://www.coursera.org/specializations/mathematics-for-mac... [1] https://www.manning.com/books/math-and-architectures-of-deep...
There is certainly some hype, a lot of what is the market is just not viable.
Thanks Andrej for the time and effort you put into your videos.
It was a really exciting time for me as I had pushed the team to begin looking at vectors beyond language (actions and other predictable perimeters we could extract from linguistic vectors.)
We had originally invented a lot of this because we were trying to make chat and email easier and faster, and ultimately I had morphed it into predicting UI decisions based on conversations vectors. Back then we could only do pretty simple predictions (continue vector strictly , reverse vector strictly or N vector options on an axis) but we shipped it and you saw it when we made hangouts, gmail and allo predict your next sentence. Our first incarnation was interesting enough that eric Schmidt recognized it and took my work to the board as part of his big investment in ML. From there the work in hangouts became all/gmail etc.
Bizarrely enough though under sundar, this became the Google assistant but we couldn’t get much further without attention layers so the entire project regressed back to fixed bot pathing.
I argued pretty hard with the executives that this was a tragedy but sundar would hear none of it, completely obsessed with Alexa and having a competitor there.
I found some sympathy with the now head of search who gave me some budget to invest in a messaging program that would advance prediction to get to full action prediction across the search surface and UI. We launched and made it a business messaging product but lost the support of executives during the LLM panic.
Sundar cut us and fired the whole team, ironically right when he needed it the most. But he never listened to anyone who worked on the tech and seemed to hold their thoughts in great disdain.
What happened after that is of course well known now as sundar ignored some of the most important tech in history due to this attitude.
I don’t think I’ll ever fully understand it.
This all turned out to be mostly irrelevant in my subsequent programming career.
Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.
It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.
Graphs - It all starts with computational graphs. These are data structures that include element wise operations, usually matrix multiplication, addition, activation functions and loss function. The computations are differential, resulting in a smooth continuous space, appropriate for continuous optimization (gradient descent), which is covered later.
Layers - Layers are modules comprised of graphs that apply some computation and store the results in a state, referred to as the learned weights. Each Layer learns a deeper, more meaningful representation from the dataset, ultimately learning a latent manifold, which is a highly structured, lower dimensional space, that interpolates between samples, achieving generalization for predictions.
Different machine learning problems and data types use different layers, e.g. Transformers for sequence to sequence learning and convolutions for computer vision models, etc.
Models - Organize stacks of layers for training. Includes a loss function that sends a feedback signal to an optimizer to adjust learned weights during training. Models also include an evaluation metric for accuracy, independent of the loss function.
Forward pass - For training or inference, when an input sequence passes through all the network layers and a geometric transformation is applied producing an output.
Backpropagation - Durring training, after the forward pass, gradients are calculated for each weight with respect to the loss, gradients are just another word for derivatives. The process for calculating the derivatives is called automatic differentiation, which is based on the chain rule of derivation.
Once the derivatives are calculated the optimizers intelligently updates the weights, with respect to the loss. This is the process called “Learning” often referred to as gradient descent.
Now for Large Language Models.
Before models are trained for sequence to sequence learning, the corpus of knowledge must be transformed into embeddings.
Embeddings are dense representations of language that includes a multidimensional space that can capture meaning and context for different combinations of words that are part of sequences.
LLMs use a specific network layer called transformers, that includes something called an attention mechanism.
The attention mechanism uses the embeddings to dynamically update the meaning of words when they are brought together in a sequence.
The model uses three different representations of the input sequence, called the key, query and value matrices.
Using dot product, an attention score is created to identify the meaning of the reference sequence, then a target sequence is generated
The output sequence is predicted one word at a time, based on a sampling distribution of the target sequence, using a softmax function.
happy customer and have found it to be one of the best paid resources for learning mathematics in general. wish I had this when I was a student.
kingkongjaffa•13h ago
You computer an embedding vector for your documents or chunks of documents. And then you compute the vector for your users prompt, and then use the cosine distance to find the most semantically relevant documents to use. There are other tricks like reranking the documents once you find the top N documents relating to the query, but that’s basically it.
Here’s a good explanation
http://wordvec.colorado.edu/website_how_to.html