I have experienced that in person many, many times. Jumps in context that seem easy for one person to follow, but very hard for others.
So, assuming the paper is legit (arxiv, you never know...), its more like something that could be improved than a difference from human beings.
I don't see how that's a problem.
Subjectivity is part of human communication.
Seeing human interactions as computer-like is a side effect of our most recent shiny toy. In the last century, people saw everything as gears and pulleys. All of these perspectives are essentially the same reductionist thinking, recycled over and over again.
We've seen men promising that they would build a gear-man, resurrect the dead with electricity, and all sorts of (now) crazy talk. People believed it for some time.
This, of course, has certain implications as to the wisdom of the idea of “replacing human programmers”, given that one of the hard parts of the trade is trying to turn vague and often confused ideas into precise specifications by interacting with the shareholders.
IMO the One Weird Trick for LLMs is recognizing that there's no real entity, and that users are being tricked into a suspended-disbelief story.
In most cases cases you're contributing text-lines for a User-character in a movie-script document, and the LLM algorithm is periodically triggered to autocomplete incomplete lines for a Chatbot character.
You can have an interview with a vampire DraculaBot, but that character can only "self-reflect" in the same shallow/fictional way that it can "thirst for blood" or "turn into a cloud of bats."
This leads us to new questions: How can we characterize and identify real-world documents which fit? How can we determine what features may be significant, and which of those can be easily transplanted to our use-case?
I operate LLMs in many conversational modes where it does ask clarifying questions, probing questions, baseline determining questions.
It takes at most one sentence in the prompt to get them to act this way.
What is this one sentence you are using?
I am struggling to elicite clarification behavior form llms
So you can teach a model to sometimes ask for clarification, but will it actually have insight into when it really needs it, or will it just interject for clarification more or less at random? These models have really awful insight into their own capabilities, ChatGPT eg insists to me that it can read braille, and then cheerfully generates a pure hallucination.
I can trivially get any of the foundational models to ask me clarifying questions. I've never had one respond with 'I don't know'.
Gemini 2.5 Pro and ChatGPT-o3 have often asked me to provide additional details before doing a requested task. Gemini sometimes comes up with multiple options and requests my input before doing the task.
It also mainly happens when the context is clear that we are collaborating on work that will require multiple iterations of review and feedback, like drafting chapters of a handbook.
I have seen ChatGPT ask questions immediately upfront when it relates to medical issues.
They're great at both tasks, you just have to ask them to do it.
That is, does it actually know when it doesn't know, or are you just making it less confident overall, so it asks questions with no actual insight? Convincing a model to roleplay as someone who doesn't know things vs teaching a model to have insight into when it does and doesn't need clarification seems like a tough one.
It dynamically swaps in / out portions of the context. This system is also not based on explicit definitions it relies on LLMs 'filling the gaps'. The system helps the llm break down problems into small tasks which then eventually aggregate into the full task.
May I suggest - put what you have out there in the world, even if it’s barely more than a couple of prompts. If people see it and improve on it, and it’s a good idea, it’ll get picked up & worked on by others - might even take on a life of its own!
https://x.com/zacksiri/status/1922500206127349958
You can see it's going from introduction, asking me for my name, and then able to answer question about some topic. There is also another example in the thread you can see.
Behind the scenes, the system prompt is being modified dynamically based on the user's request.
All the information about movies is also being loaded into context dynamically. I'm also working on some technique to unload stuff from context when the subject matter of a given thread has changed dramatically. Imagine having a long thread of conversation with your friend, and along the way you 'context switch' multiple times as time progresses, you probably don't even remember what you said to your friend 4 years ago.
There is a concept of 'main thread' and 'sub threads' involved as well that I'm exploring.
I will be releasing the code base in the coming months. I need to take this demo further than just a few prompt replies.
Stuff like this:
1. Do: Best practice for X model is to include at max 10k lines of code + task + CONVENTIONS.md + architecture guidance. Only queue tasks for components that are fairly decoupled from the rest of the codebase (e.g. small modules).
2. Don't: Start a project without a clearly defined architecture in this format. Don't ask for tasks that require X amount of reading hops to understand the logic.
I find it frustrating that companies release their benchmaxxing without helping developers actually use their models. It's more ironic that some people think of these AIs as employees. Employees can work with their boss about the best way to achieve things! With LLMs you don't even know how to communicate with them and as a result their output is unreliable.
I had 20 something files I wanted it to check and change something. The first 5 or so it did, then the sixth it rightly said everything is correct moving on. It said that for the rest of the 20, the same text over and over.
I checked, and file 6 was the only correct one. It like, learned to just repeat itself after that and did nothing.
+30pp uplift when using GPT-35-turbo on a mix of 300 tasks.
Free open framework, check the repo try it yourself
https://github.com/AutomationOptimization/tsce_demo
I tested this another 300 times with gpt-4.1 to remove those obtrusive "em-dashes" everyone hates. Tested a single-pass baseline vs TSCE, same exact instructions and prompt "Remove the em-dashes from my linkedin post. . .".
Out of the 300 tests, baseline failed to remove the em-dashes 149/300 times. TSCE failed to remove the em-dashes 18/300 times.
It works, all the data as well as the entire script used for testing is in the repo.
[1] http://ui.adsabs.harvard.edu/abs/2023arXiv230313988H/abstrac...
My conclusion was that context needs to be managed well for the LLMs to manage accuracy in replies. Also, it helps to have a planning process ("graph reasoning") before task execution because it guardrails the models thought process.
This also introduces a discussion on general use vs workflow agent implementations as in the former it is much more difficult to generalize all components in structuring effective ReAct patterns.
You can edit responses, sure, but then a bunch of other context is lost.
My flow is basically:
1. plan
2. build
3. branch (into some feature/esoteric dependency issue)
4. goto #2
Prompt pruning/branching should be a first-class tool for any LLM usage.
Benjammer•3h ago
MattGaiser•3h ago
neom•3h ago
morsecodist•3h ago
AstroBen•1h ago
somenameforme•25m ago
In many ways this issue could make the Chinese Room thought experiment even more compelling. Because it's a very practical and inescapable issue.
[1] - https://en.wikipedia.org/wiki/Chinese_room
Helmut10001•3h ago
At the end of the two weeks, I observed that: The LLM was much less likely to become distracted. Sometimes, I would dump whole forum threads or SO posts into it, when it said "this is not what we are seeing here, because of [earlier context or finding]. I eliminated all dead ends logically and informed it of this (yes, it can help with the reflection, but I had to make the decisions). In the end, I found the cause of my issues.
This somewhat confirms what some user here on HN said a few days ago. LLMs are good at compressing complex information into simple one, but not at expanding simple ideas into complex ones. As long as my input was larger than the output (either complexity or length), I was happy with the results.
I could have done this without the LLM. However, it was helpful in that it stored facts from the outset that I had either forgotten or been unable to retrieve quickly in new contexts. It also made it easier to identify time patterns in large log files, which helped me debug my site-to-site connection. I also optimized many other settings along the way, resolving not only the most problematic issue. This meant, in addition to fixing my problem, I learned quite a bit. The 'state' was only occasionally incorrect about my current parameter settings, but this was always easy to correct. This confirms what others already saw: If you know where you are going and treat it as a tool, it is helpful. However, don't try to offload decisions or let it direct you in the wrong direction.
Overall, 350k Tokens used (about 300k words). Here's a related blog post [1] with my overall path, but not directly corresponding to this specific issue. (please don't recommend wireguard; I am aware of it)
Benjammer•2h ago
I totally agree that LLMs are great at compressing information; I've set up the docs feature in Cursor to index several entire large documentation websites for major libraries and it's able to distill relevant information very quickly.
olalonde•2h ago
https://g.co/gemini/share/7edf8fa373fe
Helmut10001•1h ago
skydhash•1h ago
I’m not saying that your approach is wrong. But most LLM workflows are either brute forcing the solution, or seeking a local minima to be stuck in. It’s like doing thousands of experiments of objects falling to figure out gravity while there’s a physics textbooks nearby.
[0]: https://datatracker.ietf.org/doc/html/rfc1661
olalonde•1h ago
That said, I’m building a product - not a PPP driver - so the quicker I can fix the problem and move on, the better.
[0] https://datatracker.ietf.org/doc/html/rfc1331
unshavedyak•3h ago
Most history would remain, it wouldn’t try to summarize exactly, just prune and organize the history relative to the conversation path?
nosefurhairdo•3h ago
Benjammer•2h ago
Each time you press enter, you are spinning up a new instance of the LLM and passing in the entire previous chat text plus your new message, and asking it to predict the next tokens. It does this iteratively until the model produces a <stop> token, and then it returns the text to you and the PRODUCT parses it back into separate chat messages and displays it in your UI.
What you are asking the PRODUCT to now do is to edit your and its chat messages in the history of the chat, and then send that as the new history with your latest message. This is the only way to clean the context because the context is nothing more than your messages and its previous responses, plus anything that tools have pulled in. I think it would be sort of a weird feature to add to a chat bot to have the chat bot, each time you send a new message, go back through the entire history of your chat and just start editing the messages to prune out details. You would scroll up and see a different conversation, it would be confusing.
IMO, this is just part of prompt engineering skills to keep your context clean or know how to "clean" it by branching/summarizing conversations.
rrr_oh_man•1h ago
olalonde•2h ago
The prompt it uses: https://www.reddit.com/r/ClaudeAI/comments/1jr52qj/here_is_c...
kqr•2h ago
QuadmasterXLII•1h ago
CompoundEyes•2h ago
mh-•2h ago
HaZeust•2h ago
energy123•2h ago
voidspark•1h ago
crooked-v•1h ago
voidspark•1h ago
drittich•1h ago
gdudeman•51m ago
It exists in Claude as a true branch - you can see the old threads - and in ChatGPT as without the history.
Edit a previous reply and hit “go” to see it in action.
layer8•11m ago
gdudeman•53m ago
djmips•1h ago
CobrastanJorji•1h ago
ezst•56m ago
9dev•19m ago