I think that, in the early days of internet search, entering full questions actually produced worse results than just a bunch of keywords or short phrases.
So it was a sign of a "noob", rather than a mark of sophistication and literacy.
Those literate sophisticates would still be noobs at getting something useful from Google.
I.e. by demanding the model to be concise, you're literally making it dumber.
(Separating out "chain of thought" into "thinking mode" and removing user control over it definitely helped with this problem.)
But does talk like caveman make number go down? Less token = less think?
I also wondered, due to the way LLMs work, if I ask AI a question using fancy language, does that make it pattern match to scientific literature, and therefore increase the probability that the output will be true?
> Someone didn't get the memo that for LLMs, tokens are units of thinking.
Where do you get this memo ? Seems completely wrong to me. More computation does not translate to more "thinking" if you compute the wrong things (ie things that contribute significantly to the final sentence meaning).e.g. instead of: "The square root of 256 is" you'd enter "errr The er square um root errr of 256 errr is" and it would miraculously get better? The model can't differentiate between words you entered and words it generated its self...
Tokens are how an LLM works things out, but I think it's just as likely as not that LLMs (like people) are capable of overthinking things to the point of coming to a wrong answer when their "gut" response would have been better. I do not content that this is the default mode, but that it is both possible, and that it's more or less likely on one kind of problem than another, problem categories to be determined.
A specific example of this was the era of chat interfaces that leaned too far in the direction of web search when responding to user queries. No, claude, I don't want a recipe blogspam link or summary - just listen to your heart and tell me how to mix pancakes.
More abstractly: LLMs give the running context window a lot of credit, and will work hard to post-hoc rationalize whatever is in there, including any prior low-likelihood tokens. I expect many problematic 'hallucinations' are the result of an unlucky run of two or more low probability tokens running together, and the likelihood of that happening in a given response scales ~linearly with the length of response.
Additionally, LLMs do not actually operate in text; much of the thinking happens in a much higher dimensional space that just happens to be decoded as text.
So unless the LLM was trained otherwise, making it talk like a caveman is more than just theoretically turning it into a caveman.
What do you mean by that? It’s literally text prediction, isn’t it?
So the conclusion was that these middle layers have their own language and it's converting the text into this language and this decoding it. It explains why sometime the models switch to chinese when they have a lot of chinese language inputs, etc.
You are also confusing ‘mechanistic explanation still incomplete’ with ‘empirical phenomenon unestablished.’ Those are not the same thing.
PS. Em dash? So you are some LLM bot trying to bait mine HN for reasoning traces? :D
you are discovering that the favorite luddite argument is bullshit
> just look at research papers
You didn't add anything other than vibes either.
https://machinelearning.apple.com/research/illusion-of-think...
I have a list of numbers, 0 to9, and the + , = operators. I will train my model on this dataset, except the model won’t get the list, they will get a bunch of addition problems. A lot. But every addition problem possible inside that space will not be represented, not by a long shot, and neither will every number. but still, the model will be able to solve any math problem you can form with those symbols.
It’s just predicting symbols, but to do so it had to internalize the concepts.
For example thinking in modern US English generates many thoughts, to keep correct speak at right cultural context (there is only one correct way to say People Of Color, and it changes every year, any typo makes it horribly wrong).
Some languages are far more expressive and specialized in logical conditions, conditionals, recursion and reasoning. Like eskimos have 100 words for snow, but for boolean algebra.
It is well proven that thinking in Chinese needs far less tokens!
With this caveman mod you strip out most of cultural complexities of anglosphere, make it easier for foreigners and far simpler to digest.
This is simply not true.
Programming languages are not languages in the human brain nor the culture sense.
It is very arrogant to assume, no other language can be more advanced than English.
This is not how the feature called "reasoning" work in current models.
"reasoning" simply let's the model output and then consume some "thinking" tokens before generating the actual output.
All the "fluff" tokens in the output have absolutely nothing to do with "reasoning".
Benchmark or nothing.
LLMs do stumble into long prediction chains that don’t lead the inference in any useful direction, wasting tokens and compute.
> cutting ~75% of tokens while keeping full technical accuracy.
I have no clue if this claim holds, but alas, just pretending they did not address the obvious criticism, while they did, is at the very least pretty lazy.
An explanation that explains nothing is not very interesting.
You can read the skill. They didn't do anything to mitigate the issue, so the criticism is valid.
Nobody has to proof anything. It can give your claim credibility. If you don't provide any, an opposing claim without proof does not get any better.
“I don’t need to provide proof to say things” is a valueless, trivial assertion that adds no value whatsoever to any discussion anyone has ever had.
If you want to pretend this is a claim that should be taken seriously, a lack of evidence is damning. If you just want to pass the metaphorical bong and say stupid shit to each other with no judgment and no expectation, then I don’t know what to tell you. Maybe X is better for that.
But they didn't address the criticism. "cutting ~75% of tokens while keeping full technical accuracy" is an empirical claim for which no evidence was provided.
For an LLM, tokens are thought. They have no ability to think, by whatever definition of that word you like, without outputting something. The token only represents a tiny fraction of the internal state changes made when a token is output.
Clearly there is an optimal for each task (not necessarily a global one) and a concrete model for a given task can be arbitrarily far from it. But you'd need to test it out for each case, not just assume that "less tokens = more better". You can be forcing your model to be dumber without realizing it if you're not testing.
This is so funny
Forcing it to be concise doesn't work because it wasn't trained on token strings that short.
This is a 2023-era comment and is incorrect.
> but mmuh latest SOTA from CloudCorp (c)!
You don't know how these things work and all you have to go on is marketing copy.
"Interesting idea! Token consumption sure is an issue that should be addressed, and this is pretty funny too! However, I happen to have an unproven claim that tokens are units of thinking, and therefore, reducing the token count might actually reduce the model's capabilities. Did anybody using this by chance notice any degradation (since I did not bother to check myself)?"
Have a nice day!
Did you test that ""caveman mode"" has similar performance to the ""normal"" model?
A lot of communication is just mentioning the concepts.
Seems reasonable, but this doesn't settle probably-empirical questions like: (a) to what degree is 'more' better?; (b) how important are filler words? (c) how important are words that signal connection, causality, influence, reasoning?
So it's probably true that the "Great question!---" type preambles are not helpful, but that there's definitely a lower bound on exactly how primitive of a caveman language we're pushing toward.
https://arxiv.org/abs/2112.00114 https://arxiv.org/abs/2406.06467 https://arxiv.org/abs/2404.15758 https://arxiv.org/abs/2512.12777
First that scratchpads matter, then why they matter, then that they don’t even need to be meaningful tokens, then a conceptual framework for the whole thing.
Funny idea though. And I’d like to see a more matter-of-fact output from Claude.
But I assume this has been studied? Can anyone point to papers that show it? I’d particularly like to know what the curves look like, it’s clearly not linear, so if you cut out 75% or tokens what do you expect to lose?
I do imagine there is not a lot of caveman speak in the training data so results may be worse because they don’t fit the same patterns that have been reinforcement learned in.
It's a significantly much succinct semantic encoding than English while being able to express all the same concepts, since it encodes a lot of glue words into the grammar of the language, and conventionally lets you drop many pronouns.
e.g.
"I would have walked home, but it seemed like it was going to rain" (14 words) -> "Domum ambulavissem, sed pluiturum esse videbatur" (6 words).
However, another potential issue is that LLMs are continuation engines, and I'd have thought that talking like a caveman may be "interpreted" as meaning you want a dumbed down response, not just a smart response in caveman-speak.
It's a bit like asking an LLM to predict next move in a chess game - it's not going to predict the best move that it can, but rather predict the next move that would be played given what it can infer about the ELO rating of the player whose moves it is continuing. If you ask it to continue the move sequence of a poor player, it'll generate a poor move since that's the best prediction.
Of course there's not going to be a lot of caveman speak on stack overflow, so who knows what the impact is. Program go boom. Me stomp on bugs.
There’s a less magical model of how LLMs work: they are essentially fancy autocomplete engines.
Most of us probably have an intuition that the more you give an autocomplete, the better results it will yield. However, does this extend to output of the autocomplete—i.e. the more tokens it uses for the result, the better?
It could well be true in context of chain of thought[0] models, in the sense that the output of a preceding autocomplete step is then fed as input to the next autocomplete step, and therefore would yield better results in the end. In other words, with this intuition, if caveman speak is applied early enough in the chain, it would indeed hamper the quality of the end result; and if it is applied later, it would not really save that many tokens.
Willing to be corrected by someone more familiar with NN architecture, of course.
[0] I can see “thinking” used as a term of art, distinct from its regular meaning, when discussing “chain of thought” models; sort of like what “learning” is in “machine learning”.
As I understand it, the claim is: more tokens = more computation = more "thinking" => answer probably better.
https://platform.claude.com/docs/en/build-with-claude/extend...
Nothing on that page indicates otherwise.
Do LLMs generally perform better in verbose languages than they do in concise ones?
There will likely be some internal reasoning going "I wonder if the user meant spell check, I'm gonna go with that one".
And it'll also bias the reasoning and output to internet speak instead of what you'd usually want, such as code or scientific jargon, which used to decrease output quality. I'm not sure if it still does
Thanks to chain of thought, actually having the LLM be explicit in its output allows it to have more quality.
> Use when user says "caveman mode", "talk like caveman", "use caveman", "less tokens", "be brief", or invokes /caveman
For the first part of this: couldn’t this just be a UserSubmitPrompt hook with regex against these?
See additionalContext in the json output of a script: https://code.claude.com/docs/en/hooks#structured-json-output
For the second, /caveman will always invoke the skill /caveman: https://code.claude.com/docs/en/skills
Not sure how effective it will be to dirve down costs, but honestly it will make my day not to have to read through entire essays about some trivial solution.
tldr; Claude skill, short output, ++good.
Quite often on reddit I'll write two paragraphs and get told "I'm not reading all that".
Really? Has basic reading become a Herculean task?
I find LLM slop much harder to read than normal human text.
I can't really explain it, it's just a feeling.
The feeling that it draaaags and draaaaaags and keeeeeps going on and on and on before getting to the point, and by the time I'm done with all the "fluff", I don't care what is the text about anymore, I just want to lay down and rest.
But combining this with caveman? Gold!
> One half interesting / half depressing observation I made is that at my workplace any meeting recording I tried to transcribe in this way had its length reduced to almost 2/3 when cutting off the silence. Makes you think about the efficiency (or lack of it) of holding long(ish) meetings.
Mass fun. Starred.
All languages must have means for marking the syntactic roles of the words in a sentence.
The roles may be marked with prepositions or postpositions in isolating languages, or with declensions in fusional languages, or there may be no explicit markers when the word order is fixed (i.e. the same distinction as between positional arguments and arguments marked by keywords, in programming languages). The most laconic method for both programming languages and natural languages is to have a default word order where role markers are omitted, but to also allow any other word order if role markers are present.
Besides the mandatory means for marking syntactic roles, many languages have features that add redundancy without being necessary for understanding, i.e. which repeat already known information, for instance by repeating the information about gender and number that is attached to a noun also besides all its attributes. Whether a language requires redundancy or not is independent on whether it is an isolating language or a fusional language.
English has somewhat less syntactic role markers than other languages because it has a rigid word order, but for the other roles than the most frequent roles (agent, patient, beneficiary) it has a lot of prepositions.
Despite being more economic in role markers, English also has many redundant words that could be omitted, e.g. subjects or copulative verbs that are omitted in many languages. Thus for English it is possible to speak "like a caveman" without losing much information, but this is independent of the fact that modern English is a mostly isolating language with few remnants of its old declensions.
I don't think it would be fundamentally very surprising if something like this works, it seems like the natural extension to tokenisation. It also seems like the natural path towards "neuralese" where tokens no longer need to correspond to units of human language.
https://developers.openai.com/api/reference/resources/respon...
I don't know their internal eval, but I think I have heard it does not hurt or improve performance. But at least this parameter may affect how many comments are in the code.
It often happens that the interesting information is in the first paragraph or so, and the remainder is all just the LLM not knowing when to stop. This is super annoying as a conversation then ends up being 90% noise.
Prompt caching is probably the single most important thing that people building harnesses think about and yet it's mind share in end users is virtually zero. If you had to think of all the weirdest, most seemingly baffling design decisions in an AI product, the answer to "why" is probably "to not break prompt caching".
I have a feeling these same people will complain “my model is so dumb!”. There’s a reason why Claude had that “you’re absolutely right!” for a while. Or codex’s “you’re right to push on this”.
We’re basically just gaslighting GPUs. That wall of text is kinda needed right now.
Thank God there is still neverending wars, otherwise authoritarian governments would have no fun left.
— Kevin Malone
[0] https://books.google.com/books?id=VO4OAAAAYAAJ&pg=PA464#v=on...
This only makes sense if you assume that you are the consumer of the response. When compacting, harnesses typically save a copy of the text exchange but strip out the tool calls in between. Because the agent relies on this text history to understand its own past actions, a log full of caveman-style responses leaves it with zero context about the changes it made, and the decisions behind them.
To recover that lost context, the agent will have to execute unnecessary research loops just to resume its task.
if goal make code, few word better. if goal make insight, more word better. depend on task. machine linear, mind not. consider LLM "thinking" is just edge-weights. if can set edge-weights into same setting with fewer tokens, you are winning.
JOOK no like when machine likes things. Maybe double standard. But forever machines do without like and without love. New like and love updates changing all the time. Makes JOOK question machine watching out for JOOK or watching out for machine.
JOOK like and love enough for himself and for machine too..
andai•5h ago
samus•4h ago
iammjm•3h ago