I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.
I don't see how that's a problem.
Subjectivity is part of human communication.
Seeing human interactions as computer-like is a side effect of our most recent shiny toy. In the last century, people saw everything as gears and pulleys. All of these perspectives are essentially the same reductionist thinking, recycled over and over again.
We've seen men promising that they would build a gear-man, resurrect the dead with electricity, and all sorts of (now) crazy talk. People believed it for some time.
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
IMO the One Weird Trick for LLMs is recognizing that there's no real entity, and that users are being tricked into a suspended-disbelief story.
In most cases cases you're contributing text-lines for a User-character in a movie-script document, and the LLM algorithm is periodically triggered to autocomplete incomplete lines for a Chatbot character.
You can have an interview with a vampire DraculaBot, but that character can only "self-reflect" in the same shallow/fictional way that it can "thirst for blood" or "turn into a cloud of bats."
This leads us to new questions: How can we characterize and identify real-world documents which fit? How can we determine what features may be significant, and which of those can be easily transplanted to our use-case?
I operate LLMs in many conversational modes where it does ask clarifying questions, probing questions, baseline determining questions.
It takes at most one sentence in the prompt to get them to act this way.
What is this one sentence you are using?
I am struggling to elicite clarification behavior form llms
We see LLM's introspecting all the time[1].
>Notably, DeepSeek-AI et al. report that the average response length and downstreamperformance of DeepSeek-R1-Zero increases as training progresses. They further report an “aha moment” during training, which refers to the “emergence” of the model’s ability to reconsider its previously generated content. As we show in Section 3.2, this reconsideration behaviour is often indicated by the generation of phrases such as ‘wait, ...’ or ‘alternatively, ...’
So you can teach a model to sometimes ask for clarification, but will it actually have insight into when it really needs it, or will it just interject for clarification more or less at random? These models have really awful insight into their own capabilities, ChatGPT eg insists to me that it can read braille, and then cheerfully generates a pure hallucination.
That doesn't mean much; humans sometimes do the same thing. I recall a fun story about a mathematician with synesthesia multiplying numbers by mixing the colours together. With a bit of training such a person could also pretend to be executing a normal algorithm for the purposes of passing tests.
I can trivially get any of the foundational models to ask me clarifying questions. I've never had one respond with 'I don't know'.
Which IMO is the name as "idk"
> A token-predictor could still be trained to predict the tokens “I’m not sure what you mean because of points x, y, and z; could you elaborate?”
This is entirely true, and the key insight is even right in your sentence but you don't seem to grasp it. “could still be trained”: you can train an LLM into doing whatever you want it to, but you have to train it specifically for that!
In the beginning of LLM we witnessed this impressive phenomenon where the LLM exhibited emergent capabilities (I'm particularly thinking about LLMs being few shots learners about stuff that wasn't in their training corpus). And these emergent capabilities legitimately raised the question about “how intelligent these things are, really”.
But for the past three years, the key lesson is that this kind of emergent effect is too small to be useful, and the focus has been put towards creating purposely built datasets (with tons of “artificial data”) to train the model to explicitly do things we want it to do. And it works pretty well, as models' capabilities kept improving at a fast pace (and in particular, I don't see would we couldn't overcome the problem highlighted by this paper, with more synthetic data specifically designed for multi-turn conversation). But their progress is now strictly limited by their makers' own intelligence. You cannot just scrap the web throw compute at the problem and expect emergent intelligence to occur anymore. It's more “simulated intelligence” than “artificial intelligence”, really.
Pre-trained LLMs will ask clarifying questions just fine. So I think this is just another consequence of post-training recipes.
Gemini 2.5 Pro and ChatGPT-o3 have often asked me to provide additional details before doing a requested task. Gemini sometimes comes up with multiple options and requests my input before doing the task.
It also mainly happens when the context is clear that we are collaborating on work that will require multiple iterations of review and feedback, like drafting chapters of a handbook.
I have seen ChatGPT ask questions immediately upfront when it relates to medical issues.
The users are being engineered more than the models are, and this isn't the only example.
In the case of medical questions it needs to know further details to provide a relevant diagnosis. That is how it was trained.
In other cases you can observe its reasoning process to see why it would decide to request further details.
I have never seen an LLM just ask questions for the sake of asking. It is always relevant in the context. I don't use them casually. Just wrote a couple of handbooks (~100 pages in a few days). Generating tens of thousands of tokens per session with Gemini.
- "Should I now give you the complete [result], fulfilling [all your demands]?"
- "Just say [go] and I will do it"
- "Do you want either [A, B, or C]"
- "In [5-15] minutes I will give you the complete result"
...
That's an example of what I'm talking about. Watch the reasoning process produce multiple options. That's what it is trained to do. That is problem solving, not "engagement". It requires more compute, not less. You see that more with the expensive models.
> "In [5-15] minutes I will give you the complete result"
I haven't seen that before and I don't see how it's relevant.
They're great at both tasks, you just have to ask them to do it.
That is, does it actually know when it doesn't know, or are you just making it less confident overall, so it asks questions with no actual insight? Convincing a model to roleplay as someone who doesn't know things vs teaching a model to have insight into when it does and doesn't need clarification seems like a tough one.
https://www.reddit.com/r/comics/comments/1l5tbc/update_to_th...
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
May I suggest - put what you have out there in the world, even if it’s barely more than a couple of prompts. If people see it and improve on it, and it’s a good idea, it’ll get picked up & worked on by others - might even take on a life of its own!
https://x.com/zacksiri/status/1922500206127349958
You can see it's going from introduction, asking me for my name, and then able to answer question about some topic. There is also another example in the thread you can see.
Behind the scenes, the system prompt is being modified dynamically based on the user's request.
All the information about movies is also being loaded into context dynamically. I'm also working on some technique to unload stuff from context when the subject matter of a given thread has changed dramatically. Imagine having a long thread of conversation with your friend, and along the way you 'context switch' multiple times as time progresses, you probably don't even remember what you said to your friend 4 years ago.
There is a concept of 'main thread' and 'sub threads' involved as well that I'm exploring.
I will be releasing the code base in the coming months. I need to take this demo further than just a few prompt replies.
In the future such a distinction in memory hierarchies will be more clear
- Primary memory in the training data
- Secondary memory in context
- Tertiary memory in RAG
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
I had 20 something files I wanted it to check and change something. The first 5 or so it did, then the sixth it rightly said everything is correct moving on. It said that for the rest of the 20, the same text over and over.
I checked, and file 6 was the only correct one. It like, learned to just repeat itself after that and did nothing.
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.
Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?
Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?
When you generate future tokens, you're looking at history tokens that are happy.
So how can a model, given sad tokens, generate future happy tokens if it did not learn to do so?
The work you're looking for is already here, it's "thinking". I assume they include sad tokens in the dataset, produce "thinking", which should result in happy tokens coming after thinking tokens. If thinking is bad (by looking at following happy tokens), then it's punished, if good, then descent.
I've built telegram bot http://t.me/experai_bot as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.
--
Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.
I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?
Benjammer•8h ago
MattGaiser•8h ago
neom•8h ago
morsecodist•8h ago
AstroBen•6h ago
distances•3h ago
somenameforme•5h ago
In many ways this issue could make the Chinese Room thought experiment even more compelling. Because it's a very practical and inescapable issue.
[1] - https://en.wikipedia.org/wiki/Chinese_room
keiferski•3h ago
jampekka•2h ago
This is mentioned in the Wikipedia page too: "Although its proponents originally presented the argument in reaction to statements of artificial intelligence (AI) researchers, it is not an argument against the goals of mainstream AI research because it does not show a limit in the amount of intelligent behavior a machine can display."
OtherShrezzing•1h ago
A nice middle-ground I'm finding is to ask Claude an initial conversation starter in its "thinking" mode, and then copy/paste that conversation into LMStudio and have a weaker model like Gemma pick-up from where Claude left off.
Helmut10001•8h ago
At the end of the two weeks, I observed that: The LLM was much less likely to become distracted. Sometimes, I would dump whole forum threads or SO posts into it, when it said "this is not what we are seeing here, because of [earlier context or finding]. I eliminated all dead ends logically and informed it of this (yes, it can help with the reflection, but I had to make the decisions). In the end, I found the cause of my issues.
This somewhat confirms what some user here on HN said a few days ago. LLMs are good at compressing complex information into simple one, but not at expanding simple ideas into complex ones. As long as my input was larger than the output (either complexity or length), I was happy with the results.
I could have done this without the LLM. However, it was helpful in that it stored facts from the outset that I had either forgotten or been unable to retrieve quickly in new contexts. It also made it easier to identify time patterns in large log files, which helped me debug my site-to-site connection. I also optimized many other settings along the way, resolving not only the most problematic issue. This meant, in addition to fixing my problem, I learned quite a bit. The 'state' was only occasionally incorrect about my current parameter settings, but this was always easy to correct. This confirms what others already saw: If you know where you are going and treat it as a tool, it is helpful. However, don't try to offload decisions or let it direct you in the wrong direction.
Overall, 350k Tokens used (about 300k words). Here's a related blog post [1] with my overall path, but not directly corresponding to this specific issue. (please don't recommend wireguard; I am aware of it)
Benjammer•7h ago
I totally agree that LLMs are great at compressing information; I've set up the docs feature in Cursor to index several entire large documentation websites for major libraries and it's able to distill relevant information very quickly.
sixtyj•2h ago
Sometimes it is good to start new chat or switch to Claude.
And it really helps to be very precise with wording of specification what you want to achieve. Or repeat it sometimes with some added request lines.
GIGO in reality :)
olalonde•7h ago
https://g.co/gemini/share/7edf8fa373fe
Helmut10001•6h ago
skydhash•6h ago
I’m not saying that your approach is wrong. But most LLM workflows are either brute forcing the solution, or seeking a local minima to be stuck in. It’s like doing thousands of experiments of objects falling to figure out gravity while there’s a physics textbooks nearby.
[0]: https://datatracker.ietf.org/doc/html/rfc1661
olalonde•5h ago
That said, I’m building a product - not a PPP driver - so the quicker I can fix the problem and move on, the better.
[0] https://datatracker.ietf.org/doc/html/rfc1331
wrasee•1h ago
There’s no way I could fully read that RFC in an hour. And that’s before you even know what reading to focus your attention on, so you’re just being a worse LLM at that point.
tralarpa•3h ago
I have really learned to mistrust and double check every single line those systems produce. Same for writing code. Everything they produce looks nice and reasonable on the surface but when you dig deaper it falls apart unless it's something very very basic.
unshavedyak•8h ago
Most history would remain, it wouldn’t try to summarize exactly, just prune and organize the history relative to the conversation path?
nosefurhairdo•7h ago
Benjammer•7h ago
Each time you press enter, you are spinning up a new instance of the LLM and passing in the entire previous chat text plus your new message, and asking it to predict the next tokens. It does this iteratively until the model produces a <stop> token, and then it returns the text to you and the PRODUCT parses it back into separate chat messages and displays it in your UI.
What you are asking the PRODUCT to now do is to edit your and its chat messages in the history of the chat, and then send that as the new history with your latest message. This is the only way to clean the context because the context is nothing more than your messages and its previous responses, plus anything that tools have pulled in. I think it would be sort of a weird feature to add to a chat bot to have the chat bot, each time you send a new message, go back through the entire history of your chat and just start editing the messages to prune out details. You would scroll up and see a different conversation, it would be confusing.
IMO, this is just part of prompt engineering skills to keep your context clean or know how to "clean" it by branching/summarizing conversations.
rrr_oh_man•6h ago
olalonde•7h ago
The prompt it uses: https://www.reddit.com/r/ClaudeAI/comments/1jr52qj/here_is_c...
kqr•6h ago
QuadmasterXLII•6h ago
ithkuil•4h ago
One could argue that the attention mechanism in transformers is already designed to do that.
But you need to train it more specifically with that in mind if you want it to be better at damping attention to parts that are deemed irrelevant by the subsequent evolution of the conversation.
And that requires the black art of ML training.
While thinking of doing this as a hack on top of the chat product feels more like engineering and we're more familiar with that as a field.
hobofan•4h ago
CompoundEyes•7h ago
mh-•7h ago
HaZeust•7h ago
energy123•7h ago
voidspark•6h ago
crooked-v•6h ago
voidspark•6h ago
drittich•6h ago
gdudeman•5h ago
It exists in Claude as a true branch - you can see the old threads - and in ChatGPT as without the history.
Edit a previous reply and hit “go” to see it in action.
layer8•5h ago
b800h•4h ago
wunderwuzzi23•4h ago
gdudeman•5h ago
djmips•6h ago
kfarr•4h ago
TheOtherHobbes•3h ago
How often in meetings does everyone maintain a running context of the entire conversation, instead of responding to the last thing that was said with a comment that has an outstanding chance of being forgotten as soon as the next person starts speaking?
CobrastanJorji•6h ago
ezst•5h ago
9dev•5h ago
M4v3R•4h ago
Do you have any source on this? System prompts get leaked/extracted all the time so imagine someone would notice this
Edit: just realized you’re talking about the Grok bot, not Grok the LLM available on X or grok.com. With the bot it’s probably harder to extract its exact instructions since it only replies via tweets. For reference here’s the current Grok the LLM system prompt: https://github.com/asgeirtj/system_prompts_leaks/blob/main/g...
dragonwriter•4h ago
Well, someone did something to it; whether it was training, feature boosting the way Golden Gate Claude [0] was done, adjusting the system prompt, or assuring that it's internet search for contextual information would always return material about that, or some combination of those, is neither obvious nor, if someone had a conjecture as to which one or combination it was, easily falsifiable/verifiable.
[0] https://www.anthropic.com/news/golden-gate-claude
stevedonovan•4h ago
Context poisoning is not a uniquely LLM problem
lenkite•2h ago
As merely 3 of over a dozen examples:
https://x.com/DefiantLs/status/1922213073957327219
https://x.com/PPC4Liberty/status/1922650016579018855
https://x.com/News24/status/1920909178236776755
b800h•4h ago
anonexpat•4h ago
granra•4h ago
stuffoverflow•4h ago
Garlef•4h ago
Imagine trying to find a specific output/input that was good in the conversation tree.
layer8•3h ago
a_e_k•3h ago
m4houk•3h ago
You can highlight some text in a chat and fork the chat to talk about that text selection, so the LLM has context of that along with the previous chat history and it responds in a new chat (entire chat history up to that point from the parent chat gets copied over - basically inspired by the Unix `fork`).
Your text selection from the parent chat would get turned into a hyperlink to the new child chat so you can always get to it again if you're reading the parent chat.
bambax•3h ago
But it would indeed be nice to either disable answers (without deleting them) or forking a conversation. It wouldn't be hard to implement; I wonder if there's a market for just this?
lewdwig•3h ago
The fundamental issue is that LLMs do not currently have real long term memory, and until they do, this is about the best we can do.
actualwitch•2h ago
https://github.com/actualwitch/experiment
therockhead•1h ago
veunes•4h ago
amelius•4h ago
Adambuilds•2h ago
freehorse•30m ago
I have made zed one of my main llm chat interfaces even for non-programming tasks, because being able to do that is great.