LLM Daydreaming

https://gwern.net/ai-daydreaming

84•nanfinitum•9h ago

Comments

zwaps•5h ago

Wasn't this already implemented in some agents?

I want to remember I heard about it in several podcasts

johnfn•5h ago

It's an interesting premise, but how many people

- are capable of evaluating the LLM's output to the degree that they can identify truly unique insights

- are prompting the LLM in such a way that it could produce truly unique insights

I've prompted an LLM upwards of 1,000 times in the last month, but I doubt more than 10 of my prompts were sophisticated enough to even allow for a unique insight. (I spend a lot of time prompting it to improve React code.) And of those 10 prompts, even if all of the outputs were unique, I don't think I could have identified a single one.

I very much do like the idea of the day-dreaming loop, though! I actually feel like I've had the exact same idea at some point (ironic) - that a lot of great insight is really just combining two ideas that no one has ever thought to combine before.

cantor_S_drug•3h ago

> are capable of evaluating the LLM's output to the degree that they can identify truly unique insights

I noticed one behaviour in myself. I heard about a particular topic, because it was a dominant opinion in the infosphere. Then LLMs confirmed that dominant opinion (because it was heavily represented in the training) and I stopped my search for alternative viewpoints. So in a sense, LLMs are turning out to be another reflective mirror which reinforces existing opinion.

MrScruff•1h ago

Yes, it seems like LLMs are system one thinking taken to the extreme. Reasoning was supposed to introduce some actual logic but you only have to play with these models for a short while to see that the reasoning tokens are a very soft constraint on the models eventual output.

Infact, they're trained to please us and so in general aren't very good at pushing back. It's incredibly easy to 'beat' an LLM in an argument since they often just follow your line of reasoning (it's in the models context after all).

apples_oranges•4h ago

If the breakthrough comes, most if not all links on HN will be to machine generated content. But so far it seems that the I in current AI is https://www.youtube.com/watch?v=uY4cVhXxW64 ..

NitpickLawyer•4h ago

Something I haven't seen explored, but I think could perhaps help is to somehow introduce feedback regarding the generation into the context, based on things that are easily computed w/ other tools (like perplexity). In "thinking" models we see a lot of emerging behaviour like "perhaps I should, but wait, this seems wrong", etc. Perhaps adding some signals at regular? intervals could help in surfacing the correct patterns when they are needed.

There's a podcast I listened to ~1.5 years ago, where a team used GPT2, further trained on a bunch of related papers, and used snippets + perplexity to highlight potential errors. I remember them having some good accuracy when analysed by humans. Perhaps this could work at a larger scale? (a sort of "surprise" factor)

aredox•4h ago

Oh, in the middle of "AI is PhD-level" propaganda (just check Google News to see this is not a strawman argument), some people finally admit in passing "no LLM has ever made a breakthrough".

(See original argument: https://nitter.net/dwarkesh_sp/status/1727004083113128327 )

bonoboTP•2h ago

I agree there's an equivocation going on for "PhD level" between "so smart, it could get a PhD" (as in come up with and publish new research and defend its own thesis) and "it can solve quizzes at the level that PhDs can".

ashdksnndck•4h ago

I’m not sure we can accept the premise that LLMs haven’t made any breakthroughs. What if people aren’t giving the LLM credit when they get a breakthrough from it?

First time I got good code out of a model, I told my friends and coworkers about it. Not anymore. The way I see it, the model is a service I (or my employer) pays for. Everyone knows it’s a tool that I can use, and nobody expects me to apportion credit for whether specific ideas came from the model or me. I tell people I code with LLMs, but I don’t commit a comment saying “wow, this clever bit came from the model!”

If people are getting actual bombshell breakthroughs from LLMs, maybe they are rationally deciding to use those ideas without mentioning the LLM came up with it first.

Anyway, I still think Gwern’s suggestion of a generic idea-lab trying to churn out insights is neat. Given the resources needed to fund such an effort, I could imagine that a trading shop would be a possible place to develop such a system. Instead of looking for insights generally, you’d be looking for profitable trades. Also, I think you’d do a lot better if you have relevant experts to evaluate the promising ideas, which means that more focused efforts would be more manageable. Not comparing everything to everything, but comparing everything to stuff in the expert’s domain.

If a system like that already exists at Jane Street or something, I doubt they are going to tell us about it.

blueflow•4h ago

I have not yet seen AI doing a critical evaluation of data sources. AI willcontradict primary sources if the contradiction is more prevalent in the training data.

Something about the whole approach is bugged.

My pet peeve: "Unix System Resources" as explanation for the /usr directory is a term that did not exist until the turn of the millenium (rumor is that a c't journalist made it up in 1999), but AI will retcon it into the FHS (5 years earlier) or into Ritchie/Thompson/Kernigham (27 years earlier).

_heimdall•52m ago

> Something about the whole approach is bugged.

The bug is that LLMs are fundamentally designed for natural language processing and prediction, not logic or reasoning.

We may get to actual AI eventually, but an LLM architecture either won't be involved at all or it will act as a part of the system mimicking the language center of a brain.

zhangjunphy•3h ago

I also hope we have something like this. But sadly, this is not going to work. The reason is this line from the article, which is so much harder that it looks:

> and a critic model filters the results for genuinely valuable ideas.

In fact, people have tryied this idea. And if you use a LLM or anything similar as the critic, the performance of the model actually degrades in this process. As the LLM tries too hard to satisfy the critic, and the critic itself is far from a good reasoner.

So the reason that we don't hear too much about this idea is not that nobody tried it. But that they tried, and it didn't work, and people are reluctant to publish about something which does not work.

imiric•3h ago

Exactly.

This not only affects a potential critic model, but the entire concept of a "reasoning" model is based on the same flawed idea—that the model can generate intermediate context to improve its final output. If that self-generated context contains hallucinations, baseless assumptions or doubt, the final output can only be an amalgamation of that. I've seen the "thinking" output arrive at a correct solution in the first few steps, but then talk itself out of it later. Or go into logical loops, without actually arriving at anything.

The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data. There's nothing inherently better about them. There's nothing intelligent either, but that's a separate discussion.

yorwba•2h ago

Reasoning models are trained from non-reasoning models of the same scale, and the training data is the output of the same model, filtered through a verifier. Generating intermediate context to improve the final output is not an idea that reasoning models are based on, but an outcome of the training process. Because empirically it does produce answers that pass the verifier more often if it generates the intermediate steps first.

That the model still makes mistakes doesn't mean it's not an improvement: the non-reasoning base model makes even more mistakes when it tries to skip straight to the answer.

danenania•30m ago

> The reason why "reasoning" models tend to perform better is simply due to larger scale and better training data.

Except that we can try the exact same pre-trained model with reasoning enabled vs. disabled and empirically observe that reasoning produces better, more accurate results.

amelius•2h ago

But what if the critic is just hard reality? If you ask an LLM to write a computer program, instead of criticizing it, you can run it and test it. If you ask an LLM to prove a theorem, let it write the proof in a formal logic language so it can be verified. Etcetera.

imtringued•4m ago

That didn't stop actor-critic from becoming one of the most popular deep RL methods.

jumploops•3h ago

How do you critique novelty?

The models are currently trained on a static set of human “knowledge” — even if they “know” what novelty is, they aren’t necessarily incentivized to identify it.

In my experience, LLMs currently struggle with new ideas, doubly true for the reasoning models with search.

What makes novelty difficult, is that the ideas should be nonobvious (see: the patent system). For example, hallucinating a simpler API spec may be “novel” for a single convoluted codebase, but it isn’t novel in the scope of humanity’s information bubble.

I’m curious if we’ll have to train future models on novelty deltas from our own history, essentially creating synthetic time capsules, or if we’ll just have enough human novelty between training runs over the next few years for the model to develop an internal fitness function for future novelty identification.

My best guess? This may just come for free in a yet-to-be-discovered continually evolving model architecture.

In either case, a single discovery by a single model still needs consensus.

Peer review?

n4r9•2h ago

It's a good question. A related question is: "what's an example of something undeniably novel?". Like if you ask an agent out of the blue to prove the Collatz conjecture, and it writes out a proof or counterexample. If that happens with LLMs then I'll be a lot more optimistic about the importance to AGI. Unfortunately, I suspect it will be a lot murkier than that - many of these big open questions will get chipped away at by a combination of computational and human efforts, and it will be impossible to pinpoint where the "novelty" lies.

OtherShrezzing•3h ago

Google's effort with AlphaEvolve shows that the Daydream Factory approach might not be the big unlock we're expecting. They spent an obscene amount of compute to discover a marginal improvement over the state of the art in a very narrow field. Hours after Google published the paper, mathematicians pointed out that their SOTA algorithms underperformed compared to techniques published in the 50 years ago.

Intuitively, it doesn't feel like scaling up to "all things in all fields" is going to produce substantial breakthroughs, if the current best-in-class implementation of the technique by the worlds leading experts returned modest results.

khalic•3h ago

Ugh, again with the anthropomorphizing. LLMs didn't come up with anything new because _they don't have agency_ and _do not reason_...

We're looking at our reflection and asking ourselves why it isn't moving when we don't

yorwba•2h ago

If you look at your reflection in water, it may very well move even though you don't. Similarly, you don't need agency or reasoning to create something new, random selection from a large number of combinations is enough, correct horse battery staple.

Of course random new things are typically bad. The article is essentially proposing to generate lots of them anyway and try to filter for only the best ones.

amelius•1h ago

> anthropomorphizing

Gwern isn't doing that here. They say: "[LLMs] lack some fundamental aspects of human thought", and then investigates that.

cranium•2h ago

I'd be happy to spend my Claude Max tokens during the night so it can "ultrathink" some Pareto improvements to my projects. So far, I've mostly seen lateral moves that rewrites code rather than rearchitecture/design the project.

precompute•2h ago

Variations on increasing compute and filtering results aside, the only way out of this rut is another breakthrough as big, or bigger than transformers. A lot of money is being spent on rebranding practical use-cases as innovation because there's severe lack of innovation in this sphere.

pilooch•2h ago

AlphaEvolve and similar systems based on map-elites + DL/LLM + RL appears to be one of the promising paths.

Setting up the map-elites dimensions may still be problem-specific but this could be learnt unsupervisedly, at least partially.

The way I see LLMs is as a search-spqce within tokens that manipulate broad concepts within a complex and not so smooth manifold. These concepts can be refined within other spaces (pixel -space, physical spaces, ...)

guelo•1h ago

In a recent talk [0] Francois Chollet made it sound like all the frontier models are doing Test-Time Adaptation, which I think is a similar concept to Dynamic evaluation that Gwern says is not being done. Apparently Test-Time Adaptation encompasses several techniques some of which modify model weights and some that don't, but they are all about on-the-fly learning.

[0] https://www.youtube.com/watch?v=5QcCeSsNRks&t=1542s

LourensT•1h ago

Regardless of accusations of anthropomorphizing, continual thinking seems to be a precursor to any sense of agency, simply because agency requires something to be running.

Eventually LLM output degrades when most of the context is its own output. So should there also be an input stream of experience? The proverbial "staring out the window", fed into the model to keep it grounded and give hooks to go off?

amelius•1h ago

Humans daydream about problems when they think a problem is interesting. Can an LLM know when a problem is interesting and thereby prune the daydream graph?

zild3d•49m ago

> The puzzle is why

The feedback loop on novel/genuine breakthroughs is too long and the training data is too small.

Another reason is that there's plenty of incentive to go after the majority of the economy which relies on routine knowledge and maybe judgement, a narrow slice actually requires novel/genuine breakthroughs.

cs702•47m ago

The question is: How do we get LLMs to have "Eureka!" moments, on their own, when their minds are "at rest," so to speak?

The OP's proposed solution is a constant "daydreaming loop" in which an LLM is does the following on its own, "unconsciously," as a background task, without human intervention:

1) The LLM retrieves random facts.

2) The LLM "thinks" (runs a chain-of-thought) on those retrieved facts to see if they are any interesting connections between them.

3) If the LLM finds interesting connections, it promotes them to "consciousness" (a permanent store) and possibly adds them to a dataset used for ongoing incremental training.

It could work.

epcoa•16m ago

The step 3 has been shown to not work over and over again, the “find interesting connections” is the hand wavy magic at this time. LLMs alone don’t seem to be particularly adept at it either.

kookamamie•13m ago

> The puzzle is why

The breakthrough isn't in their datasets.

Jeffrey Epstein 'raw' prison video had nearly 3 minutes cut out, report claims

The Silicon Valley push to breed super-babies

Analysis of the 2025 IMO problems and how AI should be able to do

Can AI run a physical shop?

Making Project Work Better with R

Why pay for an analytics tool when Google Analytics is free?

Show HN: Infinite Canvas for Notes

Solving a Chaucerian mystery, and revealing a medieval preacher's meme

Let Me Cook You a Vulnerability: Exploiting the Thermomix TM5

Show HN: DataRamen, a Fast SQL Explorer with Automatic Joins and Data Navigation

Spain awards Huawei contracts to manage intelligence agency wiretaps

A former OpenAI engineer describes what it's like to work there

Sibling study finds early education boosts brain power

Sending Web Push Notifications from Rails

Show HN: HN v0.2.1 – Native macOS app for reading HN

A Geek's Guide to Photography

Some States Are Pushing Back on Library E-Book Licensing Fees

UK tech minister negotiated nothing with Google. He may get even less than that

Isometric Tutorials

Procedural Content Generation in Games

Intigriti teams with Nvidia to launch bug bounty and VDP

METR's AI productivity study is good

Jeremy Bicha – A Child Sexual Predator Built Your Desktop

America's Anti-Immigrant Fever Is Starting to Break

Fibonacci(50) as a Fractal Sequence Diagram

Show HN: I Recently Built a Project

Show HN: Android SDK easter egg showcase

If [ ] the user wants the latest docs, do not use 2024 in the search query

Streamline Global Client Communication with 20 Secure Accounts

Nvidia Chip in Russian Drone with Autonomous Targeting and Engagement