I don't know how you get here from "predict the next word."

https://www.grumpy-economist.com/p/refine

62•qsi•1h ago

Comments

pushedx•50m ago

Yes, most people (including myself) do not understand how modern LLMs work (especially if we consider the most recent architectural and training improvements).

There's the 3b1b video series which does a pretty good job, but now we are interfacing with models that probably have parameter counts in each layer larger than the first models that we interacted with.

The novel insights that these models can produce is truly shocking, I would guess even for someone who does understand the latest techniques.

measurablefunc•41m ago

What's the latest novel insight you have encountered?

brookst•2m ago

Not the person you asked, and “novel” is a minefield. What’s the last novel anything, in the sense you can’t trace a precursor or reference?

But.. I recently had a LLM suggest an approach to negative mold-making that was novel to me. Long story, but basically isolating the gross geometry and using NURBS booleans for that, plus mesh addition/subtraction for details.

I’m sure there’s prior art out there, but that’s true for pretty much everything.

auraham•30m ago

I highly recommend Build a large language model from scratch [1] by Sebastian Raschka. It provides a clear explanation of the building blocks used in the first versions of ChatGPT (GPT 2 if I recall correctly). The output of the model is a huge vector of n elements, where n is the number of tokens in the vocabulary. We use that huge vector as a probability distribution to sample the next token given an input sequence (i.e., a prompt). Under the hood, the model has several building blocks like tokenization, skip connections, self attention, masking, etc. The author makes a great job explaining all the concepts. It is very useful to understand how LLMs works.

[1] https://www.manning.com/books/build-a-large-language-model-f...

belZaah•36m ago

It’s called emergent behavior. We understand how an llm works, but do not have even a theory about how the behavior emerges from among the math. We understand ants pretty well, but how exactly does anthill behavior come from ant behavior? It’s a tricky problem in system engineering where predicting emergent behavior (such as emergencies) would be lovely.

devmor•27m ago

The good news is that despite being incredibly complex, it’s still a lot simpler than ants because it is at least all statistical linguistics (as far as LLMs are concerned anyways).

themafia•24m ago

> but do not have even a theory about how the behavior emerges

We fully do. There is a significant quality difference between English language output and other languages which lends a huge hint as to what is actually happening behind the scenes.

> but how exactly does anthill behavior come from ant behavior?

You can't smell what ants can. If you did I'm sure it would be evident.

kristiandupont•21m ago

I am very curious about this significant hint, could you point me to some material?

spiralcoaster•15m ago

Two very big revelations here that I would love to know more about:

1. Can you reveal "what's actually happening behind the scenes" beyond the hint you gave? I can't figure it out.

2. Can you explain how an ants sense of smell leads to anthills?

jen729w•5m ago

> 2. Can you explain how an ants sense of smell leads to anthills?

Ant 0: doesn’t seem to be dangerous here. I’ll drop a scent.

Ant 1: oh cool, a safe place. And I didn’t die either. I’ll reinforce that.

Ant 142,857,098,277: cool anthill.

canjobear•12m ago

> There is a significant quality difference between English language output and other languages

floren•9m ago

They're saying LLMs do better when outputting English than other languages, an assertion I'm not really able to test but have heard elsewhere.

bryanrasmussen•6m ago

and this is somehow not related to the size and availability of corpora in English?

fc417fc802•20m ago

> but do not have even a theory about how the behavior emerges from among the math

Actually we have an awful lot of those.

I'm not sure if emergent is quite the right term here. We carefully craft a scenario to produce a usable gradient for a black box optimizer. We fully expect nontrivial predictions of future state to result in increasingly rich world models out of necessity.

It gets back to the age old observation about any sufficiently accurate model being of equal complexity as the system it models. "Predict the next word" is but a single example of the general principle at play.

netfortius•14m ago

I'd rather go the route of bats [1]

[1] https://en.wikipedia.org/wiki/What_Is_It_Like_to_Be_a_Bat%3F

WD-42•36m ago

This is really hard to judge because by the looks of it, finance papers mostly consist of gobbledygook and extensive filler to begin with.

sp4cemoneky•20m ago

This. Verbalism lands really well to verbalism.

cyanydeez•20m ago

Economics is the attempt to take sociology and add numbers to make it look like a hard science. The fintechbros then seem to think because they can make numbers go up that this proof it's a hard science.

Tarq0n•13m ago

That's entirely missing the point. "All models are wrong, but some are useful". You can test hypotheses and learn things even about chaotic or emergent systems.

tolerance•33m ago

It’s interesting to read about the use and leverage of LLMs outside of programming.

I’m not too familiar with the history, but the import of this article is brushing up on my nose hairs in a way that makes me think a sort of neo-Sophistry is on the horizon.

themafia•29m ago

> The comments it offered were on the par of the best comments I’ve received on a paper in my entire academic career.

Sort of the lowest hanging fruit imaginable. Just because it became "fundamental" to the process doesn't mean it gained any quality.

libraryofbabel•28m ago

I have come to think “predict the next token” is not a useful way to explain how LLMs work to people unfamiliar with LLM training and internals. It’s technically correct, but at this point saying that and not talking about things like RLVR training and mechanistic interpretability is about as useful as framing talking with a person as “engaging with a human brain generating tokens” and ignoring psychology.

At least AI-haters don’t seem to be talking about “stochastic parrots” quite so much now. Maybe they finally got the memo.

qsera•14m ago

>“predict the next token” is not a useful way

That is the exact thing to say because that is exactly what it does, despite how it does so.

It is not useful to say it if you are an AI-shill though. You bought up AI-hater, so I think I am entitled to bring up AI-shills.

dylan604•13m ago

I think talking to people unfamiliar with LLM training using words like "RLVR training and mechanistic interpretability" is about as useful as a grave robber in a crematorium.

stephenr•6m ago

> stochastic parrots

I prefer to use the term "spicy autocomplete" myself.

measurablefunc•2m ago

Sampling over a probability distribution is not as catchy as "stochastic parrot" but I have personally stopped telling believers that their imagined event horizon of transistor scale is not going to deliver them to their wished for automated utopia b/c one can not reason w/ people who did not reach their conclusions by reasoning.

wavemode•18m ago

> the kind of analysis the program is able to do is past the point where technology looks like magic. I don’t know how you get here from “predict the next word.”

You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data. That assumption is usually faulty - very few of the ideas and concepts we come up with in our everyday lives are truly new.

All that being said, the refine.ink tool certainly has an interesting approach, which I'm not sure I've seen before. They review a single piece of writing, and it takes up to an hour, and it costs $50. They are probably running the LLM very painstakingly and repeatedly over combinations of sections of your text, allowing it to reason about the things you've written in a lot more detail than you get with a plain run of a long-context model (due to the limitations of sparse attention).

It's neat. I wonder about what other kinds of tasks we could improve AI performance at by scaling time and money (which, in the grand scheme, is usually still a bargain compared to a human worker).

selridge•11m ago

>You're implicitly assuming that what you asked the LLM to do is unrepresented in the training data.

This is just as stuck in a moment in time as "they only do next word prediction" What does this even mean anymore? Are we supposed to believe that a review of this paper that wasn't written when that model (It's putatively not an "LLM", but IDK enough about it to be pushy there) was trained? Does that even make sense? We're not in the regime of regurgitating training data (if we really ever were). We need to let go of these frames which were barely true when they took hold. Some new shit is afoot.

wavemode•4m ago

Statistical models generalize. If you train a model that f(x) = 5 and f(x+1) = 6, the number 7 doesn't have to exist in the training data for the model to give you a correct answer for f(x+2)

Similarly, if there are millions of academic papers and thousands of peer reviews in the training data, a review of this exact paper doesn't need to be in there for the LLM to write something convincing. (I say "convincing" rather than "correct" since, the author himself admits that he doesn't agree with all the LLM's comments.)

I tend to recommend people learn these things from first principles (e.g. build a small neural network, explore deep learning, build a language model) to gain a better intuition. There's really no "magic" at work here.

mnewme•18m ago

Is this an ad? Seems like it. The text is not really what the headline suggests.

callmeal•14m ago

The "predict the next word" to a current llm is at the same level as a "transistor" (or gate) is to a modern cpu. I don't understand llms enough to expand on that comparison, but I can see how having layers above that feed the layers below to "predict the next word" and use the output to modify the input leading to what we see today. It is turtles all the way down.

brookst•7m ago

It’s a good comparison. It’s about abstraction and layers. Modern LLMs aren’t just models, they’re all the infrastructure around promoting and context management and mixtures of experts.

The next-word bit may be slightly higher than an individual transistor, possibly functional units.

visarga•12m ago

> Nothing you write will matter if it is not quickly adopted to the training dataset.

That is my take too, I was surprised to see how many people object to their works being trained on. It's how you can leave your mark, opening access for AI, and in the last 25 years opening to people (no restrictions on access, being indexed in Google).

mbgerring•5m ago

People who produced the works LLMs are trained on are not compensated for the value they are now producing, and their skills are increasingly less valued in a world with LLMs. The value the LLMs are producing is being captured by employees of AI companies who are driving up rent in the Bay Area, and driving up the cost of electricity and water everywhere else.

Your surprise to people’s objections makes sense if you can’t count.

retrac•3m ago

I know this sounds insane but I've been dwelling on it. Language models are digital Ouija boards. I like the metaphor because it offers multiple parallel and conflicting interpretations which might all be partly true. How does a Ouija board work? The words appear. Where do they come from? It can be explained in physical terms. Or in metaphysical terms. Collective summing of psychomotor activity. Conduits to a non-corporeal facet of existence. Many caution against the Ouija board as a path to self-inflicted madness, others caution against the Ouija board as a vehicle to bring poorly understood inhuman forces into the world.

brookst•1m ago

Ouija boards are just collective negotiation among people.

ChaitanyaSai•3m ago

The whole next word thing is interesting isn't it. I like to see it with Dennett's "Competence and comprehension" lens. You can predict the next word competently with shallow understanding. But you could also do it well with understanding or comprehension of the full picture. A mental model that allows you to predict better. Are the AIs stumbling into these mental models? Seems like it. However, because these are such black boxes, we do not know how they are stringing these mental models together. Is it a random pick from 10 models built up inside the weights? Is there any system-wide cohesive understanding, whatever that means? Exploring what a model can articualate using self-reflection would be interesting. Can it point to internal cognitive dissonance because it has been fed both evolution and intelligent design, for example? Or these exist as separate models to invoke depending on the prompt context, because all that matters is being rewarded by the current user?

bdhcuidbebe•2m ago

I can hear OP gurgling AI d*ck all the way to Europe.

AIQuotaBar – macOS menu bar app that shows Claude and ChatGPT usage limits

Git City – Your GitHub as a 3D City

Mumsnet campaign demands ban on social media for under-16s

Shipcast – Turn your Git commits into tweets, automatically

Show HN: LucidExtractor – Extract web data in plain English, no selectors

A larger cage: about the ongoing calls for "digital sovereignty"

Earth's heat to power 10k homes in renewable energy first for UK

Show HN: Snaplake – Query past database states without restoring backups

Show HN: Context Harness – Local first context engine for AI tools

Perplexity Computer

Show HN: I Made an AI Skill to Help Write Tlaps Proofs

Implementing a Clear Room Z80 / ZX Spectrum Emulator with Claude Code

RUS-Pat Bringing Optical Color to Ultrasound

Show HN: SendView – Mail merge from Airtable/GSheets, sends through your email

Evidence for the weak Sapir-Whorf hypothesis

Apple's Touch-Screen Laptop to Have Dynamic Island, New Mac Interface

Show HN: Trust-gated developer communities with portable identity (AT Protocol)

A Logic Named Joe(1946)

Open-Source Discord Alternatives

Burned $250 in tokens on Day 1 with OpenClaw

You are likely unable to connect to http://archive.ph

Show HN: Sleeping LLM – A language model that remembers by sleeping

Show HN: Check Your Latency

Apple's Multibillion-Dollar Push to Make Chips in the U.S. [video]

Date-fns: Modern JavaScript date utility library

OVH Is Raising Prices

ZES – sign data with post-quantum crypto without the API ever seeing it

Show HN: Nullroom.io – Experimental, stateless P2P messaging and file sharing

Trading Strategies (2018) [pdf]

vLLM-mlx – 65 tok/s LLM inference on Mac with tool calling and prompt caching