LLMs Encode How Difficult Problems Are

59•stansApprentice•3h ago

Comments

jiito•1h ago

I haven't read this particular paper in-depth, but it reminds me of another one I saw that used a similar approach to find if the model encodes its own certainty of answering correctly. https://arxiv.org/abs/2509.10625

kazinator•1h ago

It's all very clear when you mentally replace "LLM" with "text completion driven by compressed training data".

E.g.

[Text copletion driven by compressed training data] exhibit[s] a puzzling inconsistency: [it] solves complex problems yet frequently fail[s] on seemingly simpler ones.

Some problems are better represented by a locus of texts in the training data, allowing more plausible talk to be generated. When the problem is not well represented, it does not help that the problem is simple.

If you train it on nothing but Scientology documents, and then ask about the Buddhist perspective on a situation, you will probably get some nonsense about body thetans, even if the situation is simple.

th0ma5•52m ago

Thank you for posting this. I'm struck with how there is a lot of studying of the behavior and isolating it from other assumptions and then these individual capabilities are then described as a new solution or discovered capability that would work with all of those other assumptions. This makes most all of the LLM research feel like whack a mole if the goal was to make accurate and reliable models by understanding these techniques. Instead, it's more like seeing faces in cars and buildings and other artifacts of patterns and pattern groupings and recognition of patterns. Building houses on sand, etc.

lukev•29m ago

Well, that's what a LLM is. The problem is if one's mental model is built on "AI" instead of "LLM."

The fact that LLMs can abstract concepts and do any amount of out-of-sample reasoning is impressive and interesting, but the null hypothesis for a LLM being "impressive" in any regard is that the data required to answer the question is present in it's training set.

XenophileJKO•4m ago

This is true, but also misleading. We are learning that the models achieve compression by distilling higher level concepts and deriving generalized human like abilities, for example the recent introspection paper from Anthropic.

Two billion email addresses were exposed

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Swift on FreeBSD Preview

Universe's expansion 'is now slowing, not speeding up'

Hightouch (YC S19) Is Hiring

ICC ditches Microsoft 365 for openDesk

LLMs Encode How Difficult Problems Are

Open Source Implementation of Apple's Private Compute Cloud

What if hard work felt easier?

You Should Write An Agent

The Parallel Search API

C++: A prvalue is not a temporary

I analyzed the lineups at the most popular nightclubs

FBI tries to unmask owner of archive.is

Show HN: TabPFN-2.5 – SOTA foundation model for tabular data

Eating stinging nettles

Black Hole Flare Is Biggest and Most Distant Seen

Show HN: Dynamic code and feedback walkthroughs with your coding Agent in VSCode

Springs and Bounces in Native CSS

Mathematical exploration and discovery at scale

UK outperforms US in creating unicorns from early stage VC investment

Benchmarking the Most Reliable Document Parsing API

Auraphone: A simple app to collect people's info at events

Show HN: See chords as flags – Visual harmony of top composers on musescore

Supply chain attacks are exploiting our assumptions

Show HN: qqqa – A fast, stateless LLM-powered assistant for your shell

Please stop asking me to provide feedback #8036

IKEA launches new smart home range with 21 Matter-compatible products

I may have found a way to spot U.S. at-sea strikes before they're announced

LLMs Encode How Difficult Problems Are

Comments

Two billion email addresses were exposed

Kimi K2 Thinking, a SOTA open-source trillion-parameter reasoning model

Show HN: I scraped 3B Goodreads reviews to train a better recommendation model

Swift on FreeBSD Preview

Universe's expansion 'is now slowing, not speeding up'

Hightouch (YC S19) Is Hiring

ICC ditches Microsoft 365 for openDesk

LLMs Encode How Difficult Problems Are

Open Source Implementation of Apple's Private Compute Cloud

What if hard work felt easier?

You Should Write An Agent

The Parallel Search API

C++: A prvalue is not a temporary

I analyzed the lineups at the most popular nightclubs

FBI tries to unmask owner of archive.is

Show HN: TabPFN-2.5 – SOTA foundation model for tabular data

Eating stinging nettles

Black Hole Flare Is Biggest and Most Distant Seen

Show HN: Dynamic code and feedback walkthroughs with your coding Agent in VSCode

Springs and Bounces in Native CSS

Mathematical exploration and discovery at scale

UK outperforms US in creating unicorns from early stage VC investment

Benchmarking the Most Reliable Document Parsing API

Auraphone: A simple app to collect people's info at events

Show HN: See chords as flags – Visual harmony of top composers on musescore

Supply chain attacks are exploiting our assumptions

Show HN: qqqa – A fast, stateless LLM-powered assistant for your shell

Please stop asking me to provide feedback #8036

IKEA launches new smart home range with 21 Matter-compatible products

I may have found a way to spot U.S. at-sea strikes before they're announced