Ask HN: How are Markov chains so different from tiny LLMs?

62•JPLeRouzic•2d ago

I polished a Markov chain generator and trained it on an article by Uri Alon and al (https://pmc.ncbi.nlm.nih.gov/articles/PMC7963340/).

It generates text that seems to me at least on par with tiny LLMs, such as demonstrated by NanoGPT. Here is an example:

  jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$
  ./SLM10b_train UriAlon.txt 3
  
  Training model with order 3...
  
  Skip-gram detection: DISABLED (order < 5)
  
  Pruning is disabled
  
  Calculating model size for JSON export...
  
  Will export 29832 model entries
  
  Exporting vocabulary (1727 entries)...
  
  Vocabulary export complete.
  
  Exporting model entries...
  
    Processed 12000 contexts, written 28765 entries (96.4%)...
  
  JSON export complete: 29832 entries written to model.json
  
  Model trained and saved to model.json
  
  Vocabulary size: 1727
  
  jplr@mypass:~/Documenti/2025/SimpleModels/v3_very_good$ ./SLM9_gen model.json

Aging cell model requires comprehensive incidence data. To obtain such a large medical database of the joints are risk factors. Therefore, the theory might be extended to describe the evolution of atherosclerosis and metabolic syndrome. For example, late‐stage type 2 diabetes is associated with collapse of beta‐cell function. This collapse has two parameters: the fraction of the senescent cells are predicted to affect disease threshold . For each individual, one simulates senescent‐cell abundance using the SR model has an approximately exponential incidence curve with a decline at old ages In this section, we simulated a wide range of age‐related incidence curves. The next sections provide examples of classes of diseases, which show improvement upon senolytic treatment tends to qualitatively support such a prediction. model different disease thresholds as values of the disease occurs when a physiological parameter ϕ increases due to the disease. Increasing susceptibility parameter s, which varies about 3‐fold between BMI below 25 (male) and 54 (female) are at least mildly age‐related and 25 (male) and 28 (female) are strongly age‐related, as defined above. Of these, we find that 66 are well described by the model as a wide range of feedback mechanisms that can provide homeostasis to a half‐life of days in young mice, but their removal rate slows down in old mice to a given type of cancer have strong risk factors should increase the removal rates of the joint that bears the most common biological process of aging that governs the onset of pathology in the records of at least 104 people, totaling 877 disease category codes (See SI section 9), increasing the range of 6–8% per year. The two‐parameter model describes well the strongly age‐related ICD9 codes: 90% of the codes show R 2 > 0.9) (Figure 4c). This agreement is similar to that of the previously proposed IMII model for cancer, major fibrotic diseases, and hundreds of other age‐related disease states obtained from 10−4 to lower cancer incidence. A better fit is achieved when allowing to exceed its threshold mechanism for classes of disease, providing putative etiologies for diseases with unknown origin, such as bone marrow and skin. Thus, the sudden collapse of the alveoli at the outer parts of the immune removal capacity of cancer. For example, NK cells remove senescent cells also to other forms of age‐related damage and decline contribute (De Bourcy et al., 2017). There may be described as a first‐passage‐time problem, asking when mutated, impair particle removal by the bronchi and increase damage to alveolar cells (Yang et al., 2019; Xu et al., 2018), and immune therapy that causes T cells to target senescent cells (Amor et al., 2020). Since these treatments are predicted to have an exponential incidence curve that slows at very old ages. Interestingly, the main effects are opposite to the case of cancer growth rate to removal rate We next consider the case of frontline tissues discussed above.

Comments

MarkusQ•2d ago

LLMs include mechanisms (notably, attention) that allow longer-distance correlations than you could get with a similarly-sized Markov chain. If you squint hard enough though, they are Markov chains with this "one weird trick" that makes them much more effective for their size.

Sohcahtoa82•1h ago

A Markov Chain trained by only a single article of text will very likely just regurgitate entire sentences straight from the source material. There just isn't enough variation in sentences.

But then, Markov Chains fall apart when the source material is very large. Try training a chain based on Wikipedia. You'll find that the resulting output becomes incoherent garbage. Increasing the context length may increase coherence, but at the cost of turning into just simple regurgitation.

In addition to the "attention" mechanism that another commenter mentioned, it's important to note that Markov Chains are discrete in their next token prediction while an LLM is more fuzzy. LLMs have latent space where the meaning of a word basically exists as a vector. LLMs will generate token sequences that didn't exist in the source material, whereas Markov Chains will ONLY generate sequences that existed in the source.

This is why it's impossible to create a digital assistant, or really anything useful, via Markov Chain. The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

johnisgood•1h ago

> The fact that they only generate sequences that existed in the source mean that it will never come up with anything creative.

I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

I am quite confused right now. Could you please help me with this?

Somewhat related: I like the work of David Hume, and he explains it quite well how we can imagine various creatures, say, a pig with a dragon head, even if we have not seen one ANYWHERE. It is because we can take multiple ideas and combine them together. We know how dragons typically look like, and we know how a pig looks like, and so, we can imagine (through our creativity and combination of these two ideas) how a pig with a dragon head would look like. I wonder how this applies to LLMs, if they even apply.

Edit: to clarify further as to what I want to know: people have been telling me that LLMs cannot solve problems that is not in their training data already. Is this really true or not?

jldugger•52m ago

Well, there's kind of two answers here:

1. To the extent that creativity is randomness, LLM inference samples from the token distribution at each step. It's possible (but unlikely!) for an LLM to complete "pig with" with the token sequence "a dragon head" just by random chance. The temperature settings commonly exposed control how often the system takes the most likely candidate tokens.

2. A markov chain model will literally have a matrix entry for every possible combination of inputs. So a 2 degree chain will have n^2 weights, where N is the number of possible tokens. In that situation "pig with" can never be completed with a brand new sentence, because those have literal 0's in the probability. In contrast, transformers consider huge context windows, and start with random weights in huge neural network matrices. What people hope happens is that the NN begins to represent ideas, and connections between them. This gives them a shot at passing "out of distribution" tests, which is a cornerstone of modern AI evaluation.

thaumasiotes•51m ago

>> The fact that they only generate sequences that existed in the source

> I am quite confused right now. Could you please help me with this?

This is pretty straightforward. Sohcahtoa82 doesn't know what he's saying.

Sohcahtoa82•45m ago

I'm fully open to being corrected. Just telling me I'm wrong without elaborating does absolutely nothing to foster understanding and learning.

thaumasiotes•43m ago

If you still think there's something left to explain, I recommend you read your other responses. Being restricted to the training data is not a property of Markov output. You'd have to be very, very badly confused to think that it was. (And it should be noted that a Markov chain itself doesn't contain any training data, as is also true of an LLM.)

More generally, since an LLM is a Markov chain, it doesn't make sense to try to answer the question "what's the difference between an LLM and a Markov chain?" Here, the question is "what's the difference between a tiny LLM and a Markov chain?", and assuming "tiny" refers to window size, and the Markov chain has a similarly tiny window size, they are the same thing.

johnisgood•32m ago

He said LLMs are creative, yet people have been telling me that LLMs cannot solve problems that is not in their training data. I want this to be clarified or elaborated on.

shagie•4m ago

Make up a fanciful problem and ask it to solve it. For example, https://chatgpt.com/s/t_691f6c260d38819193de0374f090925a is unlikely to be found in the training data - I just made it up. Another example of wizards and witches and warriors and summoning... https://chatgpt.com/share/691f6cfe-cfc8-8011-b8ca-70e2c22d36... - I doubt that was in the training data either.

Make up puzzles of your own and see if it is able to solve it or not.

The blanket claim of "cannot solve problems that are not in its training data" seems to be something that can be disproven by making up a puzzle from your own human creativity and seeing if it can solve it - or for that matter, how it attempts to solve it.

It appears that there is some ability for it to reason about new things. I believe that much of this "an LLM can't do X" or "an LLM is parroting tokens that it was trained on" comes from trying to claim that all the material that it creates was created before, by a human and any use of an LLM is stealing from some human and thus unethical to use.

( ... and maybe if my block world or wizards and warriors and witches puzzle was in the training data somewhere, I'm unconsciously copying something somewhere else and my own use of it is unethical. )

purple_turtle•32m ago

1) being restricted to exact matches in input is definition of Markov Chains

2) LLMs are not Markov Chains

koliber•51m ago

Here's how I see it, but I'm not sure how valid my mental model is.

Imagine a source corpus that consists of:

Cows are big. Big animals are happy. Some other big animals include pigs, horses, and whales.

A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".

An LLM can get a sense of meaning in these words and can return ideas expressed in the input corpus. So in this case it might say "Pigs and horses are happy". It's not limited to responding with verbatim sequences. It can be seen as a bit more creative.

However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.

marcellus23•21m ago

> A Markov chain can only return verbatim combinations. So it might return "Cows are big animals" or "Are big animals happy".

Just for my own edification, do you mean "Are big animals are happy"? "animals happy" never shows up in the source text so "happy" would not be a possible successor to "animals", correct?

vidarh•7m ago

> However, LLMs will not be able to represent ideas that it has not encountered before. It won't be able to come up with truly novel concepts, or even ask questions about them. Humans (some at least) have that unbounded creativity that LLMs do not.

There's absolutely no evidence to support this claim. It'd require humans to exceed the Turing computable, and we have no evidence that is possible.

Sohcahtoa82•47m ago

> I have seen the argument that LLMs can only give you what its been trained on, i.e. it will not be "creative" or "revolutionary", that it will not output anything "new", but "only what is in its corpus".

LLMs can absolutely create things that are creative, at least for some definition of "creative".

For example, I can ask an LLM to create a speech about cross-site scripting the style of Donald Trump:

> Okay, folks, we're talking about Cross-Site Scripting, alright? I have to say, it's a bit confusing, but let's try to understand it. They call it XSS, which is a fancy term. I don't really know what it means, but I hear it's a big deal in the tech world. People are talking about it, a lot of people, very smart people. So, Cross-Site Scripting. It's got the word "scripting" in it, which sounds like it's about writing, maybe like a script for a movie or something. But it's on the internet, on these websites, okay? And apparently, it's not good. I don't know exactly why, but it's not good. Bad things happen, they tell me. Maybe it makes the website look different, I don't know. Maybe it makes things pop up where they shouldn't. Could be anything! But here's what I do know. We need to do something about it. We need to get the best people, the smartest people, to look into it. We'll figure it out, folks. We'll make our websites safe, and we'll do it better than anyone else. Trust me, it'll be tremendous. Thank you.

Certainly there's no text out there that contains a speech about XSS from Trump. There's some snippets here and there that likely sound like Trump, but a Markov Chain simply is incapable of producing anything like this.

johnisgood•35m ago

Oh, of course, what I want answered did not have much to do with Markov Chain, but LLMs, because I saw this argument often against LLMs.

umanwizard•31m ago

People who claim this usually don’t bother to precisely (mathematically) define what they actually mean by those terms, so I doubt you will get a straight answer.

pama•6m ago

LLMs have the ability to learn certain classes of functions or algorithms from their datasets in order to reduce the likelihood of future errors when compressing their pretraining data. If you are technically inclined, read the reference: https://arxiv.org/abs/2208.01066 (and optionally followup work) to see how llms can pick up or invent complex algorithms from training on random examples that these algorithms would have solved (in one of the cases the LLM is better than anything we know; in the rest it is simply just as good as our best algos). These examples would not work with Markov chains at all at any level of training. The LLMs in this study are tiny. They are not really learning a language, but simply how to perform regression.

thfuran•1h ago

>Markov Chains will ONLY generate sequences that existed in the source.

A markov chain of order N will only generate sequences of length N+1 that were in the training corpus, but it is likely to generate sequences of length N+2 that weren't (unless N was too large for the training corpus and it's degenerate).

Isamu•51m ago

Right, you can generate long sentences from a first-order markov model, and all of the transitions from one word to the next be in the training set but the full generated sentence may not.

Sohcahtoa82•26m ago

Well yeah, but N+2 but the generation of the +2 loses the first part of N.

If you use a context window of 2, then yes, you might know that word C can follow words A and B, and D can follow words B and C, and therefore generate ABCD even if ABCD never existed.

But it could be that ABCD is incoherent.

For example, if A = whales, B = are, C = mammals, D = reptiles.

"Whales are mammals" is fine, "are mammals reptiles" is fine, but "Whales are mammals reptiles" is incoherent.

The longer you allow the chain to get, the more incoherent it becomes.

"Whales are mammals that are reptiles that are vegetables too".

Any 3-word fragment of that sentence is fine. But put it together, and it's an incoherent mess.

vjerancrnjak•15m ago

If you learn with Baum Welch you can get nonzero ood probabilities.

Something like Markov Random Field is much better.

Not sure if anyone managed to create latent hierarchies from chars to words to concepts. Learning NNs is far more tinkery than brutality of probabilistic graphical models.

AndrewKemendo•1h ago

Your example is too sparse to make a conclusion from

I’d offer an alternative interpretation: LLMs follow the Markov Decison modeling properties to encode the problem but use a very efficient policy for solver for the specific token based action space.

That is to say they are both within the concept of a “markovian problem” but have wildly different path solvers. MCMC is a solver for an MDP, as is an attention network

So same same, but different

aespinoza•57m ago

Would you be willing to write an article comparing the results ? Or share the code you used to test? I am super interested in the results of this experiment.

spencerflem•50m ago

Iirc there was some paper that showed that LLMs could be converted to Markov chains and vice versa, but the size of the chain was much much higher

inciampati•47m ago

Markov chains have exponential falloff in correlations between tokens over time. That's dramatically different than real text which contains extremely long range correlations. They simply can't model long range correlations. As such, they can't be guided. They can memorize, but not generalize.

zwaps•39m ago

This is the correct answer

kleiba•37m ago

Markov chains of order n are essentially n-gram models - and this is what language models used to be for a very long time. They are quite good. As a matter of fact, they were so good that more sophisticated models often couldn't beat them.

But then came deep-learning models - think transformers. Here, you don't represent your inputs and states discretely but you have a representation in a higher-dimensional space that aims at preserving some sort of "semantics": proximity in that space means proximity in meaning. This allows to capture nuances much more finely than it is possible with sequences of symbols from a set.

Take this example: you're given a sequence of n words and are to predict a good word to follow that sequence. That's the thing that LM's do. Now, if you're an n-gram model and have never seen that sequence in training, what are you going to predict? You have no data in your probabilty tables. So what you do is smoothing: you take away some of the probability mass that you have assigned during training to the samples you encountered and give it to samples you have not seen. How? That's the secret sauce, but there are multiple approaches.

With NN-based LLMs, you don't have that exact same issue: even if you have never seen that n-word sequence in training, it will get mapped into your high-dimensional space. And from there you'll get a distribution that tells you which words are good follow-ups. If you have seen sequences of similar meaning (even with different words) in training, these will probably be better predictions.

But if you have seen sequences of similar meaning during the training of your n-gram model, that doesn't really help you all such much.

esafak•14m ago

https://en.wikipedia.org/wiki/Distributional_semantics

tlarkworthy•36m ago

Markov chains learn a fixed distribution, but transformers learn the distribution of distributions and latch onto what the current distribution based on evidence seen so far. So that's where the single shot learning comes from in transformer. Markov chains can't do that, they will not change the underlying distribution as they read.

thatjoeoverthr•32m ago

Others have mentioned the large context window. This matters.

But also important is embeddings.

Tokens in a classic Markov chain are discrete surrogate keys. “Love”, for example, and “love” are two different tokens. As are “rage” and “fury”.

In a modern model, we start with an embedding model, and build a LUT mapping token identities to vectors.

This does two things for you.

First, it solves the above problem, which is that “different” tokens can be conceptually similar. They’re embedded in a space where they can be compared and contrasted in many dimensions, and it becomes less sensitive to wording.

Second, because the incoming context is now a tensor, it can be used with differentiable model, back propagation and so forth.

I did something with this lately, actually, using a trained BERT model as a reranker for Markov chain emmisions. It’s rough but manages multiturn conversation on a consumer GPU.

https://joecooper.me/blog/crosstalk/

yobbo•16m ago

A hidden markov model (HMM) is theoretically capable of modelling text just as well any transformer. Typically, HMMs are probability distributions over a hidden discrete state space, but the distribution and state space can be anything. The size of the state space and transition function determines its capacity. RNNs are effectively HMMs, and recent ones like "Mamba" and so on are considered competent.

Transformers can be interpreted as tricks that recreate the state as a function of the context window.

I don't recall reading about attempts to train very large discrete (million states) HMMs on modern text tokens.

qoez•15m ago

From a core openai insider who have likely trained very large markov models and large transformers: https://x.com/unixpickle/status/1935011817777942952

Untwittered: A Markov model and a transformer can both achieve the same loss on the training set. But only the transformer is smart enough to be useful for other tasks. This invalidates the claim that "all transformers are doing is memorizing their training data".

currymj•3m ago

bigram-trigram language models (with some smoothing tricks to allow for out-of-training-set generalization) were state of the art for many years. Ch. 3 of Jurafsky's textbook (which is modern and goes all the way to LLMs, embeddings etc.) is good on this topic.

https://web.stanford.edu/~jurafsky/slp3/ed3book_aug25.pdf

I don't know the history but I would guess there have been times (like the 90s) when the best neural language models were worse than the best trigram language models.

ssivark•2m ago

[delayed]

Nano Banana Pro

NTSB Preliminary Report – Ups Boeing MD-11F Crash [pdf]

Microsoft makes Zork open-source

CoMaps emerges as an Organic Maps fork

The Lions Operating System

Go Cryptography State of the Union

Okta's NextJS-0auth troubles

Launch HN: Poly (YC S22) – Cursor for Files

Android and iPhone users can now share files, starting with the Pixel 10

Ask HN: How are Markov chains so different from tiny LLMs?

Free interactive tool that shows you how PCIe lanes work on motherboards

Freer Monads, More Extensible Effects (2015) [pdf]

Show HN: F32 – An Extremely Small ESP32 Board

What's in a Passenger Name Record (PNR)? (2013)

Interactive World History Atlas Since 3000 BC

Theft of 'The Weeping Woman' from the National Gallery of Victoria

Two recently found works of J.S. Bach presented in Leipzig [video]

Red Alert 2 in web browser

Firefox 147 Will Support the XDG Base Directory Specification

50th Anniversary of BitBLT

Android/Linux Dual Boot

Show HN: My hobby OS that runs Minecraft

The Firefly and the Pulsar

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs

'Calvin and Hobbes' at 40

CUDA Ontology

Typesetting the "Begriffsschrift" by Gottlob Frege in Plain TeX [pdf]

IBM Delivers New Quantum Package

Basalt Woven Textile

Meta Segment Anything Model 3

Nano Banana Pro

NTSB Preliminary Report – Ups Boeing MD-11F Crash [pdf]

Microsoft makes Zork open-source

CoMaps emerges as an Organic Maps fork

The Lions Operating System

Go Cryptography State of the Union

Okta's NextJS-0auth troubles

Launch HN: Poly (YC S22) – Cursor for Files

Android and iPhone users can now share files, starting with the Pixel 10

Ask HN: How are Markov chains so different from tiny LLMs?

Free interactive tool that shows you how PCIe lanes work on motherboards

Freer Monads, More Extensible Effects (2015) [pdf]

Show HN: F32 – An Extremely Small ESP32 Board

What's in a Passenger Name Record (PNR)? (2013)

Interactive World History Atlas Since 3000 BC

Theft of 'The Weeping Woman' from the National Gallery of Victoria

Two recently found works of J.S. Bach presented in Leipzig [video]

Red Alert 2 in web browser

Firefox 147 Will Support the XDG Base Directory Specification

50th Anniversary of BitBLT

Android/Linux Dual Boot

Show HN: My hobby OS that runs Minecraft

The Firefly and the Pulsar

Adversarial Poetry as a Universal Single-Turn Jailbreak Mechanism in LLMs

'Calvin and Hobbes' at 40

CUDA Ontology

Typesetting the "Begriffsschrift" by Gottlob Frege in Plain TeX [pdf]

IBM Delivers New Quantum Package

Basalt Woven Textile

Meta Segment Anything Model 3

Ask HN: How are Markov chains so different from tiny LLMs?

Comments