I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.
I don't see how that's a problem.
Subjectivity is part of human communication.
Seeing human interactions as computer-like is a side effect of our most recent shiny toy. In the last century, people saw everything as gears and pulleys. All of these perspectives are essentially the same reductionist thinking, recycled over and over again.
We've seen men promising that they would build a gear-man, resurrect the dead with electricity, and all sorts of (now) crazy talk. People believed it for some time.
How do we see robot and AI and helping interactions in film and tv and games?
A curated list of films for consideration:
Mary Shelley's "Frankenstein" or "The Modern Prometheus" (1818), Metropolis (1927), I\, Robot (1940-1950; Three Laws of Robotics, robopsychology), Macy Conferences (1941-1960; Cybernetics), Tobor the Great (1954), Here Comes Tobor (1956), Jetsons' maid's name: Rosie (1962), Lost in Space (1965), 2001: A Space Odyssey (1968), THX 1138 (1971), Star Wars (1977), Terminator (1984), Driving Miss Daisy (1989), Edward Scissorhands (1990), Flubber (1997, 1961), Futurama (TV, 1999-), Star Wars: Phantom Menace (1999), The Iron Giant (1999), Bicentennial Man (1999), A.I. Artificial Intelligence (2001), Minority Report (2003), I\, Robot (2004), Team America: World Police (2004), Wall-E (2008), Iron Man (2008), Eagle Eye (2008), Moon (2009), Surrogates (2009), Tron: Legacy (2010), Hugo (2011), Django Unchained (2012), Her (2013), Transcendence (2014), Chappie (2015), Tomorrowland (2015), The Wild Robot (2016, 2024), Ghost in the Shell (2017),
Giant f robots: Gundam (1979), Transformers (TV: 1984-1987, 2007-), Voltron (1984-1985), MechWarrior (1989), Matrix II: Revolutions (2003), Avatar (2009, 2022, 2025), Pacific Rim (2013-), RoboCop (1987, 2014), Edge of Tomorrow (2014),
~AI vehicle: Herbie, The Love Bug (1968-), Knight Rider (TV, 1982-1986), Thunder in Paradise (TV, 1993-95), Heat Vision and Jack (1999), Transformers (2007), Bumblebee (2018)
Games: Portal (2007), LEGO Bricktales (2022), While True: learn() (2018), "NPC" Non-Player Character
Category:Films_about_artificial_intelligence : https://en.wikipedia.org/wiki/Category:Films_about_artificia...
List of artificial intelligence films: https://en.wikipedia.org/wiki/List_of_artificial_intelligenc...
Category:Films_about_robots: https://en.wikipedia.org/wiki/Category:Films_about_robots
Category:American_robot_films: https://en.wikipedia.org/wiki/Category:American_robot_films
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
IMO the One Weird Trick for LLMs is recognizing that there's no real entity, and that users are being tricked into a suspended-disbelief story.
In most cases cases you're contributing text-lines for a User-character in a movie-script document, and the LLM algorithm is periodically triggered to autocomplete incomplete lines for a Chatbot character.
You can have an interview with a vampire DraculaBot, but that character can only "self-reflect" in the same shallow/fictional way that it can "thirst for blood" or "turn into a cloud of bats."
This leads us to new questions: How can we characterize and identify real-world documents which fit? How can we determine what features may be significant, and which of those can be easily transplanted to our use-case?
I operate LLMs in many conversational modes where it does ask clarifying questions, probing questions, baseline determining questions.
It takes at most one sentence in the prompt to get them to act this way.
What is this one sentence you are using?
I am struggling to elicite clarification behavior form llms
You can ask it to use the Socratic method, but then it is probing you, not its own understanding. Now have it use the socratic method on itself. You can tell it to have multiple simultaneous minds.
Play with deepseek in thinking and non-thinking mode, give it nebulous prompts and see if you can get it to ask for clarifications.
We see LLM's introspecting all the time[1].
>Notably, DeepSeek-AI et al. report that the average response length and downstreamperformance of DeepSeek-R1-Zero increases as training progresses. They further report an “aha moment” during training, which refers to the “emergence” of the model’s ability to reconsider its previously generated content. As we show in Section 3.2, this reconsideration behaviour is often indicated by the generation of phrases such as ‘wait, ...’ or ‘alternatively, ...’
So you can teach a model to sometimes ask for clarification, but will it actually have insight into when it really needs it, or will it just interject for clarification more or less at random? These models have really awful insight into their own capabilities, ChatGPT eg insists to me that it can read braille, and then cheerfully generates a pure hallucination.
That doesn't mean much; humans sometimes do the same thing. I recall a fun story about a mathematician with synesthesia multiplying numbers by mixing the colours together. With a bit of training such a person could also pretend to be executing a normal algorithm for the purposes of passing tests.
I can trivially get any of the foundational models to ask me clarifying questions. I've never had one respond with 'I don't know'.
Which IMO is the name as "idk"
> A token-predictor could still be trained to predict the tokens “I’m not sure what you mean because of points x, y, and z; could you elaborate?”
This is entirely true, and the key insight is even right in your sentence but you don't seem to grasp it. “could still be trained”: you can train an LLM into doing whatever you want it to, but you have to train it specifically for that!
In the beginning of LLM we witnessed this impressive phenomenon where the LLM exhibited emergent capabilities (I'm particularly thinking about LLMs being few shots learners about stuff that wasn't in their training corpus). And these emergent capabilities legitimately raised the question about “how intelligent these things are, really”.
But for the past three years, the key lesson is that this kind of emergent effect is too small to be useful, and the focus has been put towards creating purposely built datasets (with tons of “artificial data”) to train the model to explicitly do things we want it to do. And it works pretty well, as models' capabilities kept improving at a fast pace (and in particular, I don't see would we couldn't overcome the problem highlighted by this paper, with more synthetic data specifically designed for multi-turn conversation). But their progress is now strictly limited by their makers' own intelligence. You cannot just scrap the web throw compute at the problem and expect emergent intelligence to occur anymore. It's more “simulated intelligence” than “artificial intelligence”, really.
Pre-trained LLMs will ask clarifying questions just fine. So I think this is just another consequence of post-training recipes.
Nonsense, we are already surrounded by mindless algorithms (and their outputs) that "affect the real world" because many of us have full-time jobs ensuring it happens! "
When someone uses a SimCity-esque program to generate a spreadsheet used for real-world bus schedules, does that "break key aspects and assumptions of a traffic simulator"? Does the downstream effect elevate it to a microcosm of tiny lives? Nope!
My point about Dracula isn't just that he's fictional, but that he cannot make decisions that have unscripted consequences in the real world, nor can he engage in a novel, interactive conversation. Dracula, as a character, only "acts" or "speaks" as an author (or game designer, etc.) has already written or programmed him to. He has no independent capacity to assess a new situation and generate a novel response that affects anything beyond his fictional context. If I "talk" to Dracula in a game, the game developers have pre-scripted his possible responses. The text of Dracula is immutable.
A LLM, by contrast, performs fresh inference every time it’s prompted: it weighs competing continuations and selects one. That selection is a bona-fide decision (a branch taken at run-time). The “document-simulator” picture collapses that distinction, treating a dynamic decision process as if it were a block of pre-written prose. It's just nonsensical.
Your SimCity example is open loop: the simulation runs, a human inspects the results, and then decides whether to publish new bus schedules. Nothing in the simulator is tasked with interrogating the human, updating its model of their intent, or steering the outcome. In production LLM systems the loop is often closed: the model (often with tool-wrapper code) directly drafts emails, modifies configs, triggers API calls, or at minimum interrogates the user (“What city are we talking about?”) before emitting an answer.
Your argument is tired and semantical because it fails at the most fundamental level - It's not even a good analogy.
I feel you've erected a strawman under your this "document simulator" phrase of yours, something you've arbitrarily defined as a strictly one-shot process for creating an immutable document. Yeah, it's boring and "nonsensical" because you made it that way.
In contrast, everybody else here has been busy talking about iterative systems which do permit interaction, because the document is grown via alternate passes of (A) new content from external systems or humans and (B) new content predicted by the LLM.
>You can have an interview with a vampire DraculaBot, but that character can only "self-reflect" in the same shallow/fictional way that it can "thirst for blood" or "turn into a cloud of bats."
The "shallow/fictional way" only exists because of the limited, immutable nature of real scripts. A 'script' that does not have either of these properties would not necessarily produce characters that only reflect in a shallow manner.
Text that’s generated on-the-fly-while interrogating the user, calling tools, and updating its own working context-isn’t anything like a screenplay whose pages are fixed in advance.
There's no strawman here. You've decided that an LLM is not something you want to attribute a 'real' entity to and this is your rationalization for that.
You are confused and again attacking an idea nobody else has advanced.
Even in my very first comment starting the thread, I explicitly stated that the "movie-script" is mutable, with alternate phases of "contributing" and "autocompleted" content as it grows.
This is not a hard concept to grasp. I know what you are claiming. It doesn't automatically make your argument sound.
To call something that does not have the properties of a script a script is odd in the first place, but to realize that and still assume behaviors that are only the result of the properties you realize are not even present in your new 'script' is just bizzare.
I'm not confused. You are.
Badly, and with great difficulty, so while it can just about be done, even then only kinda.
Anyone who actually understands both LLMs and the human brain well enough to make confident claims that they basically work the same really ought to put in the effort to write up a paper and get a Nobel prize or two.
In particular, generally speaking (not claiming that LLMs a road to AGI, which is something I doubt) it's generally not a well-defensible philosophical position that the vertebrate brain (and remember that mammalian, bird and cephalopod brains are very different) is uniquely suited to produce what we call "intelligence".
> Anyone who actually understands both LLMs and the human brain well enough to make confident claims that they basically work the same
This is a strawman and not my position.
I don’t think anyone in this discussion has claimed that brains are uniquely suited to producing intelligence. The point was just that we have no idea if there is any interesting correspondence between how LLMs work and how brains work, beyond superficial and obvious analogies.
I also think "would" in the comment I'm replying to is closer to "could" than to "does".
Humans who are wrong are often completely oblivious to being wrong.
Most random humans have nearly constant confidence in their answers regardless of how much they know about a topic.
--
Hmm. Stream of consciousness thoughts here:
Problem is LLMs are very broad "System 1" thinking, no "System 2". Other kinds of model that combine LLM with a more logical module? Perhaps that could work.
And if so then I guess you could use the logical module to create tokens for an LLM to learn from? Eventually, but may be inefficient.
To work, logic module needs to know what LLM gets wrong, to create tokens for LLM to learn from; if you can do that, but fine tuning is slow and expensive and still errs, so why not keep the logic separate and just have it point out mistakes and have LLM regenerate answers a few times and if that doesn't work then say "IDK"?
So saying LLMs have no "baseline for truth" doesn't really mean much one way of the other, they are much smart and accurate than 99% of humans.
The problem is that any information about any internal processes used to generate a particular token is lost; the LLM is stateless, apart from the generated text. If you ask an LLM-character (which I agree should be held distinct from the LLM itself and exists at a different layer of abstraction) why it said something, the best it can do is a post-hoc guess. The "character", and any internal state we might wish it to have, only exists insofar as it can be derived anew from the text.
In any case, you don't need accurate understanding of how your mind works (hello humans, again!) to be able to converge on
INSUFFICIENT DATA FOR A MEANINGFUL ANSWER
when there's no other uniquely good local optimum in the search space.Gemini 2.5 Pro and ChatGPT-o3 have often asked me to provide additional details before doing a requested task. Gemini sometimes comes up with multiple options and requests my input before doing the task.
It also mainly happens when the context is clear that we are collaborating on work that will require multiple iterations of review and feedback, like drafting chapters of a handbook.
I have seen ChatGPT ask questions immediately upfront when it relates to medical issues.
The users are being engineered more than the models are, and this isn't the only example.
In the case of medical questions it needs to know further details to provide a relevant diagnosis. That is how it was trained.
In other cases you can observe its reasoning process to see why it would decide to request further details.
I have never seen an LLM just ask questions for the sake of asking. It is always relevant in the context. I don't use them casually. Just wrote a couple of handbooks (~100 pages in a few days). Generating tens of thousands of tokens per session with Gemini.
- "Should I now give you the complete [result], fulfilling [all your demands]?"
- "Just say [go] and I will do it"
- "Do you want either [A, B, or C]"
- "In [5-15] minutes I will give you the complete result"
...
That's an example of what I'm talking about. Watch the reasoning process produce multiple options. That's what it is trained to do. That is problem solving, not "engagement". It requires more compute, not less. You see that more with the expensive models.
> "In [5-15] minutes I will give you the complete result"
I haven't seen that before and I don't see how it's relevant.
Fair point. Thanks for standing your ground and arguing so matter-of-factly with me! Appreciate it.
The optional choices happen when it tries to reason out a solution, but then finds it is making too many assumptions of unknown details about the user's system, preferences, goals, and so on. It's just a thought pattern that it has learned to emulate.
People here will argue that LLM's cannot truly "think", but they are good enough at emulating thinking.
But even the dumbest model will call you out if you ask it something like:
"Hey I'm going to fill up my petrol car with diesel to make it faster. What brand of diesel do you recommend?"
They're great at both tasks, you just have to ask them to do it.
That is, does it actually know when it doesn't know, or are you just making it less confident overall, so it asks questions with no actual insight? Convincing a model to roleplay as someone who doesn't know things vs teaching a model to have insight into when it does and doesn't need clarification seems like a tough one.
https://www.reddit.com/r/comics/comments/1l5tbc/update_to_th...
Ironically, working with a junior dev is a lot like this -- setting them on a task, then coming back later with dogs and flashlights to retrieve them from the deep woods they've inevitably lost themselves in by just forging ahead, making assumptions, and asking no questions.
When I read this I feel like I'm witnessing intelligent people get fooled by a better Emacs doctor. It is not reflecting, it is not confident. It is "just" proposing text completion. That is why once the completion starts being bad you have to start anew. It does not have any concept of anything just a huge blob of words and possible follow-up from what the texts used to train it show.
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
May I suggest - put what you have out there in the world, even if it’s barely more than a couple of prompts. If people see it and improve on it, and it’s a good idea, it’ll get picked up & worked on by others - might even take on a life of its own!
https://x.com/zacksiri/status/1922500206127349958
You can see it's going from introduction, asking me for my name, and then able to answer question about some topic. There is also another example in the thread you can see.
Behind the scenes, the system prompt is being modified dynamically based on the user's request.
All the information about movies is also being loaded into context dynamically. I'm also working on some technique to unload stuff from context when the subject matter of a given thread has changed dramatically. Imagine having a long thread of conversation with your friend, and along the way you 'context switch' multiple times as time progresses, you probably don't even remember what you said to your friend 4 years ago.
There is a concept of 'main thread' and 'sub threads' involved as well that I'm exploring.
I will be releasing the code base in the coming months. I need to take this demo further than just a few prompt replies.
In the future such a distinction in memory hierarchies will be more clear
- Primary memory in the training data
- Secondary memory in context
- Tertiary memory in RAG
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
I had 20 something files I wanted it to check and change something. The first 5 or so it did, then the sixth it rightly said everything is correct moving on. It said that for the rest of the 20, the same text over and over.
I checked, and file 6 was the only correct one. It like, learned to just repeat itself after that and did nothing.
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
System prompt: "Remove every em-dash (—) from the following text while leaving other characters unchanged.\n\nReturn only the cleaned text."
User prompt: <prompt from tsce_chat.py filled with em dashes>
Temperature: 0.0
If you reran today you’d see the same improved pass rate I’m getting now. That’s the downside of benchmarking against latest model names; behaviour changes quietly unless you pin to a dated snapshot.
For bigger, noisier prompts (or on GPT-3.5-turbo, which hasn’t changed) TSCE still gives a solid uplift, so the framework’s value stands. Appreciate you checking it out!
I don't know where you are getting this information from... The only snapshot of gpt-4.1 is gpt-4.1-2025-04-14 (mid-April), and the gpt-4.1 alias still points to it [1].
Just to be sure, I re-ran my test specifying that particular snapshot and am still getting a 100% pass rate.
Have you heard of text.replace("—", "-") ?
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
You develop a knack for how to steer the models or start a new conversation through experience. The system or initial prompt are important, but nothing will save you if you naively keep a conversation going too long.
Is it due to the model's training distribution (mostly single-shot completions), the way context windows are encoded, or an architectural bottleneck?
Feels like there's no dynamic internal state that evolves over the conversation — only a repeated re-parsing of static history. Has anyone seen work on integrating memory/state mechanisms that allow belief revision within a session, not just regurgitation of past tokens?
When you generate future tokens, you're looking at history tokens that are happy.
So how can a model, given sad tokens, generate future happy tokens if it did not learn to do so?
The work you're looking for is already here, it's "thinking". I assume they include sad tokens in the dataset, produce "thinking", which should result in happy tokens coming after thinking tokens. If thinking is bad (by looking at following happy tokens), then it's punished, if good, then descent.
I've built telegram bot http://t.me/experai_bot as univresal UI to LLMs (with somewhat reduced functionality) exactly around idea "non-reply message means new conversation". Wanna keep context? Keep replying to replies of bot. Non-power user strugge with this idea.
--
Also I observed that OpenAI models performed worse replying to the same questions (for example list of options in reply got shorter) even with smallest system message. That was the case with 3.5, 4o. Don't know how modern ones behave. That made me decide not to include any system messages by default Still I give option to add ones if you need. You can even toggle them to mix-and-match.
I guess chain of thought in theory should do that but having variations on prompt and context might behave differently?
We've been working on a lot of data processing and generation tasks. We've been doing this using an API primarily, but sometimes I end up testing creating data in a chat window and I first chat through what the requirements are for the data analysis / processing and then once I'm done I would like the whole conversation to be then summarised into basically a one-prompt process so that I can re-use it (because I can't really process new inputs via the chat).
Even when you do manage to get it down to a single prompt you can use in a chat and then ask the chat to just keep producing new data (like imagine a blog post in certain style if the base content is given as input and I'm making like 20 of them). If you produce these in the chat, there's notable benefits in that if something is wrong with the blog post the chat suggests, you can immediately edit it. The trouble is that the context window starts becoming so big that the chat starts to forget what the original instruction is and eventually you do have to just create a new chat.
One way to solve for this is having a chat with selective memory where you keep a task in memory, but you have the chat forget/not-include all the generated data in the context so that it stays clean, but only bring it to the context if the user refers to it.
Has anyone else done data processing types of tasks in chats and had issues like this? Are there some other tools to use or tricks to do in chats?
One of the biggest developments in language models over the last year has been test-time reasoning (aka inference scaling or “thinking”). Most vendors tested offer such a model. It’s plausible it could make a huge difference here, and they did not bother to test it or even mention it?
Things like COT and planning can really affect this and those are just a couple of things that happen automatically in more advanced models.
Seems like it wouldn’t have been hard to add this to the experiment, but they could’ve called it out in a “Limitations” or “Future Work” section. Or at least a single sentence like “We did not test chain-of-thought prompting, which may mitigate some of these issues”.
Benjammer•1mo ago
MattGaiser•1mo ago
neom•1mo ago
morsecodist•1mo ago
AstroBen•1mo ago
distances•1mo ago
somenameforme•1mo ago
In many ways this issue could make the Chinese Room thought experiment even more compelling. Because it's a very practical and inescapable issue.
[1] - https://en.wikipedia.org/wiki/Chinese_room
keiferski•1mo ago
jampekka•1mo ago
This is mentioned in the Wikipedia page too: "Although its proponents originally presented the argument in reaction to statements of artificial intelligence (AI) researchers, it is not an argument against the goals of mainstream AI research because it does not show a limit in the amount of intelligent behavior a machine can display."
OtherShrezzing•1mo ago
A nice middle-ground I'm finding is to ask Claude an initial conversation starter in its "thinking" mode, and then copy/paste that conversation into LMStudio and have a weaker model like Gemma pick-up from where Claude left off.
shrewduser•1mo ago
Helmut10001•1mo ago
At the end of the two weeks, I observed that: The LLM was much less likely to become distracted. Sometimes, I would dump whole forum threads or SO posts into it, when it said "this is not what we are seeing here, because of [earlier context or finding]. I eliminated all dead ends logically and informed it of this (yes, it can help with the reflection, but I had to make the decisions). In the end, I found the cause of my issues.
This somewhat confirms what some user here on HN said a few days ago. LLMs are good at compressing complex information into simple one, but not at expanding simple ideas into complex ones. As long as my input was larger than the output (either complexity or length), I was happy with the results.
I could have done this without the LLM. However, it was helpful in that it stored facts from the outset that I had either forgotten or been unable to retrieve quickly in new contexts. It also made it easier to identify time patterns in large log files, which helped me debug my site-to-site connection. I also optimized many other settings along the way, resolving not only the most problematic issue. This meant, in addition to fixing my problem, I learned quite a bit. The 'state' was only occasionally incorrect about my current parameter settings, but this was always easy to correct. This confirms what others already saw: If you know where you are going and treat it as a tool, it is helpful. However, don't try to offload decisions or let it direct you in the wrong direction.
Overall, 350k Tokens used (about 300k words). Here's a related blog post [1] with my overall path, but not directly corresponding to this specific issue. (please don't recommend wireguard; I am aware of it)
Benjammer•1mo ago
I totally agree that LLMs are great at compressing information; I've set up the docs feature in Cursor to index several entire large documentation websites for major libraries and it's able to distill relevant information very quickly.
sixtyj•1mo ago
Sometimes it is good to start new chat or switch to Claude.
And it really helps to be very precise with wording of specification what you want to achieve. Or repeat it sometimes with some added request lines.
GIGO in reality :)
johnisgood•1mo ago
diggan•1mo ago
johnisgood•1mo ago
Yeah, they do come across as "overly eager junior devs", good comparison. :D
diggan•1mo ago
Personally I think it's a lot better via the API than ChatGPT. ChatGPT doesn't let you edit the "system prompt" which is really where you wanna put "how to" instructions, so it really follows them. Instructions put in the user message aren't followed as closely as when you use the system prompt, so probably why it still did something, if you were using ChatGPT.
sixtyj•1mo ago
I am giving up on providing code, and on checking is it working, because it is very time consuming. Tell me when it starts working. Good luck.
:)
johnisgood•1mo ago
tough•1mo ago
olalonde•1mo ago
https://g.co/gemini/share/7edf8fa373fe
Helmut10001•1mo ago
skydhash•1mo ago
I’m not saying that your approach is wrong. But most LLM workflows are either brute forcing the solution, or seeking a local minima to be stuck in. It’s like doing thousands of experiments of objects falling to figure out gravity while there’s a physics textbooks nearby.
[0]: https://datatracker.ietf.org/doc/html/rfc1661
olalonde•1mo ago
That said, I’m building a product - not a PPP driver - so the quicker I can fix the problem and move on, the better.
[0] https://datatracker.ietf.org/doc/html/rfc1331
wrasee•1mo ago
There’s no way I could fully read that RFC in an hour. And that’s before you even know what reading to focus your attention on, so you’re just being a worse LLM at that point.
Retric•1mo ago
cgriswald•1mo ago
In any case, LLMs are not magical forgetfulness machines.
You can use a calculator to avoid learning arithmetic but using a calculator doesn’t necessitate failing to learn arithmetic.
You can ask a question of a professor or fellow student, but failing to read the textbook to answer that question doesn’t necessitate failing to develop a mental model or incorporate the answer into an existing one.
You can ask an LLM a question and blindly use its answer but using an LLM doesn’t necessitate failing to learn.
Retric•1mo ago
However, even outside of using a LLM the temptation is always to keep the blinders on do a deep dive for a very specific bug and repeat as needed. It’s the local minima of effort and very slowly you do improve as those deep dives occasionally come up again, but what keeps it from being a global minimum is these systems aren’t suddenly going away. It’s not a friend’s expresso machine, it’s now sitting in your metaphorical kitchen.
As soon as you’re dealt with say a CSS bug the odds of seeing another in the future are dramatically higher. Thus optimizing for diminishing returns means spending a few hours learning the basics of any system or protocol you encounter is just a useful strategy. If you spend 1% of your time on a strategy that makes you 2% more efficient that’s a net win.
skydhash•1mo ago
But that's not learning or even problem's solving. It's just a time saving trick. And one that's not reliable.
And the fact is that there's a lot of information about pretty much anything. But I see people trying to skip the foundation (not glamorous enough, maybe) and go straight for the complicated stuff. And LLMs are good for providing the illusion that it can be the right workflow.
Retric•1mo ago
Well said. You can only spend years digging into the intricacies a handful of systems in your lifetime, but there’s still real rewards from a few hours here and there.
Macuyiko•1mo ago
I do agree with you that an LLM should not always start from scratch.
In a way it is like an animal which we have given the ultimate human instinct.
What has nature given us? Homo Erectus is 2 million years ago.
A weird world we live in.
What is context.
wrasee•1mo ago
I’d argue that’s a more effective capture as to what I would remember anyway.
If wanted to learn more (in a general sense) I can take the manual away with me and study it, which I can do more effectively on its own terms, in a comfy chair with a beer. But right now I have a problem to solve.
Retric•1mo ago
IE LLM then RFC takes more time then RFC then solving the issue.
wrasee•1mo ago
Because you should have read RFC 1331.
Even then your argument assumes that optimising for total time (to include your own learning time) is the goal, and not solving the business case as a priority (your actual problem). That assumption may not be the case when you have a patch to submit. What you solve at what time point is the general case, there’s no single optimum.
Retric•1mo ago
Having a less skilled worker is a tradeoff for getting one very specific task accomplished sooner, that might be worth it especially if you plan to quit soon but it’s hardly guaranteed.
wrasee•1mo ago
Whereas it's been all morning and you're still reading the RFC, and it's the wrong RFC anway.
I know who i'd hire.
Retric•1mo ago
This time it worked, but I’ve been forced for fire people with this kind of attitude before.
halfadot•1mo ago
Retric•1mo ago
> LLMs allowing optimal time use in certain case
I never said it was slower, I’m saying what’s the tradeoff. I’ve had this same basic conversation with multiple people, and after that failed the only real option is to remove them. Ex: If you don’t quite understand why what you wrote seemingly fixes a bug don’t commit it yet, seems to work isn’t a solution.
Could be I’m not explaining very well, but ehh fuck em.
tralarpa•1mo ago
I have really learned to mistrust and double check every single line those systems produce. Same for writing code. Everything they produce looks nice and reasonable on the surface but when you dig deaper it falls apart unless it's something very very basic.
foobarian•1mo ago
tough•1mo ago
daveguy•1mo ago
unshavedyak•1mo ago
Most history would remain, it wouldn’t try to summarize exactly, just prune and organize the history relative to the conversation path?
nosefurhairdo•1mo ago
dep_b•1mo ago
Benjammer•1mo ago
Each time you press enter, you are spinning up a new instance of the LLM and passing in the entire previous chat text plus your new message, and asking it to predict the next tokens. It does this iteratively until the model produces a <stop> token, and then it returns the text to you and the PRODUCT parses it back into separate chat messages and displays it in your UI.
What you are asking the PRODUCT to now do is to edit your and its chat messages in the history of the chat, and then send that as the new history with your latest message. This is the only way to clean the context because the context is nothing more than your messages and its previous responses, plus anything that tools have pulled in. I think it would be sort of a weird feature to add to a chat bot to have the chat bot, each time you send a new message, go back through the entire history of your chat and just start editing the messages to prune out details. You would scroll up and see a different conversation, it would be confusing.
IMO, this is just part of prompt engineering skills to keep your context clean or know how to "clean" it by branching/summarizing conversations.
rrr_oh_man•1mo ago
olalonde•1mo ago
The prompt it uses: https://www.reddit.com/r/ClaudeAI/comments/1jr52qj/here_is_c...
kqr•1mo ago
QuadmasterXLII•1mo ago
ithkuil•1mo ago
One could argue that the attention mechanism in transformers is already designed to do that.
But you need to train it more specifically with that in mind if you want it to be better at damping attention to parts that are deemed irrelevant by the subsequent evolution of the conversation.
And that requires the black art of ML training.
While thinking of doing this as a hack on top of the chat product feels more like engineering and we're more familiar with that as a field.
hobofan•1mo ago
CompoundEyes•1mo ago
mh-•1mo ago
HaZeust•1mo ago
energy123•1mo ago
voidspark•1mo ago
crooked-v•1mo ago
voidspark•1mo ago
drittich•1mo ago
gdudeman•1mo ago
It exists in Claude as a true branch - you can see the old threads - and in ChatGPT as without the history.
Edit a previous reply and hit “go” to see it in action.
layer8•1mo ago
b800h•1mo ago
cruffle_duffle•1mo ago
wunderwuzzi23•1mo ago
gdudeman•1mo ago
djmips•1mo ago
kfarr•1mo ago
TheOtherHobbes•1mo ago
How often in meetings does everyone maintain a running context of the entire conversation, instead of responding to the last thing that was said with a comment that has an outstanding chance of being forgotten as soon as the next person starts speaking?
djmips•1mo ago
CobrastanJorji•1mo ago
ezst•1mo ago
CobrastanJorji•1mo ago
9dev•1mo ago
M4v3R•1mo ago
Do you have any source on this? System prompts get leaked/extracted all the time so imagine someone would notice this
Edit: just realized you’re talking about the Grok bot, not Grok the LLM available on X or grok.com. With the bot it’s probably harder to extract its exact instructions since it only replies via tweets. For reference here’s the current Grok the LLM system prompt: https://github.com/asgeirtj/system_prompts_leaks/blob/main/g...
dragonwriter•1mo ago
Well, someone did something to it; whether it was training, feature boosting the way Golden Gate Claude [0] was done, adjusting the system prompt, or assuring that it's internet search for contextual information would always return material about that, or some combination of those, is neither obvious nor, if someone had a conjecture as to which one or combination it was, easily falsifiable/verifiable.
[0] https://www.anthropic.com/news/golden-gate-claude
lolinder•1mo ago
Also, let's be honest, in a Musk company they're going to have taking the shortest possible route to accomplishing what he wanted them to.
[0] https://www.cnn.com/2025/05/14/business/grok-ai-chatbot-repl...
CobrastanJorji•1mo ago
stevedonovan•1mo ago
Context poisoning is not a uniquely LLM problem
lenkite•1mo ago
As merely 3 of over a dozen examples:
https://x.com/DefiantLs/status/1922213073957327219
https://x.com/PPC4Liberty/status/1922650016579018855
https://x.com/News24/status/1920909178236776755
micromacrofoot•1mo ago
b800h•1mo ago
anonexpat•1mo ago
granra•1mo ago
stuffoverflow•1mo ago
Garlef•1mo ago
Imagine trying to find a specific output/input that was good in the conversation tree.
layer8•1mo ago
giordanol•1mo ago
a_e_k•1mo ago
m4houk•1mo ago
You can highlight some text in a chat and fork the chat to talk about that text selection, so the LLM has context of that along with the previous chat history and it responds in a new chat (entire chat history up to that point from the parent chat gets copied over - basically inspired by the Unix `fork`).
Your text selection from the parent chat would get turned into a hyperlink to the new child chat so you can always get to it again if you're reading the parent chat.
bambax•1mo ago
But it would indeed be nice to either disable answers (without deleting them) or forking a conversation. It wouldn't be hard to implement; I wonder if there's a market for just this?
lewdwig•1mo ago
The fundamental issue is that LLMs do not currently have real long term memory, and until they do, this is about the best we can do.
actualwitch•1mo ago
https://github.com/actualwitch/experiment
therockhead•1mo ago
veunes•1mo ago
amelius•1mo ago
Adambuilds•1mo ago
freehorse•1mo ago
I have made zed one of my main llm chat interfaces even for non-programming tasks, because being able to do that is great.
jimmySixDOF•1mo ago
This seems to be in flux now due to RL training on multiturn eval datasets so while the context window is evergreen every time, there will be some bias towards interpreting each prompt as part of a longer conversation. Mutliturn post training is not scaled out yet in public but I think it may be the way to keep on the 'double time spent on goal every 7 months curve'
bentt•1mo ago
This feels like something that can be fixed with manual instructions which prompt the model to summarize and forget. This might even map appropriately to human psychology. Working Memory vs Narrative/Episodic Memory.
pseudocomposer•1mo ago
If you delete the last message from the LLM (so now, you sent the last message), it would then generate a new response. (This would be particularly useful with high-temperature/more “randomly” configured LLMs.)
If you delete any other message, it just updates the LLM context for any future responses it sends (the real problem at hand, context cleanup).
I think seeing it work this way would also really help end users who think LLMs are “intelligent” to better understand that it’s just a big, complex autocomplete (and that’s still very useful).
Maybe this is standard already, or used in some LLM UI? If not, consider this comment as putting it in the public domain.
Now that I’m thinking about it, it seems like it might be practical to use “sub-contextual LLMs” to manage the context of your main LLM chat. Basically, if an LLM response in your chat/context is very long, you could ask the “sub-contextual LLM” to shorten/summarize that response, thus trimming down/cleaning the context for your overall conversation. (Also, more simply, an “edit message” button could do the same, just with you, the human, editing the context instead of an LLM…)
dr_dshiv•1mo ago
dr_dshiv•1mo ago
diggan•1mo ago
forgotTheLast•1mo ago
cruffle_duffle•1mo ago
They really need to make that edit feature much more prominent. It is such an important way to interact with the model.
yaur•1mo ago
aleksituk•1mo ago
bredren•1mo ago
https://github.com/banagale/FileKitty
When getting software development assistance, relying on LLM products to search code bases etc leaves too much room for error. Throw in what amounts to lossy compression of that context to save the service provider on token costs and the LLM is serving watered down results.
Getting the specific context right up front and updating that context as the conversation unfolds leads to superior results.
Even then, you do need to mind the length of conversations. I have a prompt designed to capture conversational context, and transfer it into a new session. It identifies files that should be included in the new initial prompt, etc.
For a bit more discussion on this, see this thread and its ancestry: https://news.ycombinator.com/item?id=43711216
QuantumGood•1mo ago
oaeirjtlj•1mo ago
Macuyiko•1mo ago
> "Good work so far, now I want to take it to another step (somewhat related but feeling it too hard): <short description>. Do you think we can do it in this conversation or is it better to start fresh? If so, prepare an initial prompt for your next fresh instantiation."
Sometimes the model says that it might be better to start fresh, and prepares a good summary prompt (including a final 'see you later'), whereas in other cases it assures me it can continue.
I have a lot of notebooks with "initial prompts to explore forward". But given the sycophancy going on as well as one-step RL (sigh) post-training [1], it indeed seems AI platforms would like to keep the conversation going.
[1] RL in post-training has little to do with real RL and just uses one shot preference mechanisms with an RL inspired training loop. There is very little work in terms of long-term preferences slash conversations, as that would increase requirements exponentially.
senordevnyc•1mo ago