Any time I see an AI SAFETY paper I am reminded of the phrase "Never get high on your own supply". Simply put these systems are NOT dynamic, they can not modify based on experience, they lack reflection. The moment that we realize what these systems are (were NOT on the path to AI, or AGI here folks) and start leaning into what they are good at rather than try to make them something else is the point where we get useful tools, and research aimed at building usable products.
The math no one is talking about: If we had to pay full price for these products, no one would use them. Moores law is dead, IPC has hit a ceiling. Unless we move into exotic cooling we simply can't push more power into chips.
Hardware advancement is NOT going to save the emerging industry, and I'm not seeing the papers on efficiency or effectiveness at smaller scales come out to make the accounting work.
>Simply put these systems are NOT dynamic, they can not modify based on experience, they lack reflection.
We already have many, many, many attempts to put LLMs towards the task of self-modification - and some of them can be used to extract meaningful capability improvements. I expect more advances to come - online learning is extremely desirable, and a lot of people are working on it.
I wish I could hammer one thing through the skull of every "AI SAFETY ISNT REAL" moron: if you only start thinking about AI safety after AI becomes capable of causing an extinction level safety incident, it's going to be a little too late.
It depends a lot on which LLMs you're talking about, and what kind of usage. See e.g. the recent post about how "Anthropic is bleeding out": https://news.ycombinator.com/item?id=44534291
Ignore the hype in the headline, the point is that there's good evidence that inference in many circumstances isn't profitable.
So he's using their API prices as a proxy for token costs, doesn't actually know the actual inference prices, and ... that's your "good evidence?" This big sentence with all these "We don't knows?"
Does this idea upset you for some reason? Other people have analyzed this and come to similar conclusions, I just picked that one because it's the most recent example I've seen.
Feel free to look to a source that explains how LLM Internet is mostly profitable at this point, taking training costs into account. But I suspect you might have a hard time finding evidence of that.
Curious what others think about this direction, particularly in terms of practicality
In other words, relying on censoring the CoT can risk the effect of making the CoT altogether useless.
Basically: https://www.anthropic.com/research/reasoning-models-dont-say...
As far as I know deepseek is one of the few where you have the full chain of thought. Openai/Anthropic/Google give you only a summary of the chain of thoughts.
This is better thought of as another form of context engineering. LLM's have no other short-term memory. Figuring out what belongs in the context is the whole ballgame.
(The paper talks about the risk of training on chain of thought, which changes the model, not monitoring it.)
All of those seem like very reasonable criteria that will naturally be satisfied absent careful design by model creators. We should expect latent deceptiveness in the same way we see reasoning laziness pop up quickly.
Do we know for sure that agents can't display a type of thought while doing something different? Is there something that reliably guarantees that agents are not able to do this?
See: https://arxiv.org/pdf/2305.04388
On a related note, if anyone here is also reading a lot of papers to keep up with AI safety, what tools have been helpful for you? I'm building https://openpaper.ai to help me read papers more effectively without losing accuracy, and looking for more feature tuning. It's also open source :)
We’ve been experimenting with a lightweight alternative I call Micro-Beam:
• At each turn, force the model to generate k clearly different strategy beams (not token samples).
• Map each to an explicit goal vector of user-relevant axes (kid-fun, budget, travel friction, etc.).
• Score numerically (cosine or scalar) and pick the winner.
• Next turn, re-beam against the residual gap (dimensions still unsatisfied), so scores cause different choices.
• Log the whole thing: beams, scores, chosen path. Instant audit trail; easy to diff, replay “what if B instead of A,” or auto-flag when visible reasoning stops moving the score.
This ends up giving you the monitorability the paper wants— in the form of a scorecard per answer-slice, not paragraphs the model can pretty up for the grader. It also primary makes more adopt-ready answers with less refinement required.
Not claiming a breakthrough—call it “value-guided decoding without a reward net + built-in audit logs.”
Workshop paper is here: https://drive.google.com/file/d/1AvbxGh6K5kTXjjqyH-2Hv6lizz3...
Researchers are already pushing in this direction:
https://arxiv.org/abs/2502.05171
"We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters."
https://arxiv.org/abs/2412.06769
Can't monitor the chain of thought if it's no longer in a human legible format.
If my understanding is correct, a plain-text token is just a point in latent space mapped to an embedding vector. Any reasoning done in latent space is therefore human-readable as a sequence of raw tokens. I'm not sure what the token sequence would look like at this point -- I assume they're full or partial (mainly English) words, connected together by the abstract N-dimensional latent space concept of the token, not connected grammatically.
Something like:
> prompt: add 2 + 2
> reasoning: respond computation mathematics algebra scalar integer lhs 2 rhs 2 op summation
> lhs 2 rhs 2 op summation gives 4
> computation ops remain none result 4
> response: 4
Something like that; probably even less sensical. Regardless, that could be "language translated" to English easily.
I have not read the paper so this may have been addressed.
AI2027 predicts a future in which LLM performance will increase once we find alternatives to thinking in "human language". At least the video gave me that impression and I think this is what "neuralese" is referring to.
Is that a credible prediction?
Given that anthropic’s interpretability work finds that CoT does not reliably predict the model’s internal reasoning process, I think approaches like the one above are more likely to succeed.
rdtsc•4h ago
I am bit confused what all the 40 authors contributed to here. The paper seems to make a suggestion - monitor the chain of thought for safety. Is that the novelty part? But then, did one person come up with the idea and all 40+ people agreed to it and there put in the authors list.
ctoth•4h ago
The paper demonstrates that current models are already performing complex reward hacks in production environments, and that attempts to fix this via CoT training make the problem worse, not better.
As for your "40 authors" snark - this is a position paper where researchers from competing labs (OpenAI, Anthropic, DeepMind, government safety institutes) are jointly committing to NOT do something that's locally tempting but globally catastrophic. Getting industry consensus on "don't train away bad thoughts even though it would make your models look safer" is the opposite of trivial.
This reads like someone who saw a medical consensus statement saying "this common treatment kills patients" and responded with "did one person discover medicine exists and everyone else just agreed?"
the8472•3h ago
[0] https://thezvi.substack.com/p/ai-68-remarkably-reasonable-re... [1] https://arxiv.org/abs/2503.11926 [2] https://docs.google.com/spreadsheets/u/1/d/e/2PACX-1vRPiprOa...
jerf•3h ago
If CoT improves performance, then CoT improves performance, however the naively obvious read of "it improves performance because it is 'thinking' the 'thoughts' it tell us it is thinking, for the reasons it gives" is not completely accurate. It may not be completely wrong, either, but it's definitely not completely accurate. Given that I see no reason to believe it would be hard in the slightest to train models that have even more divergence between their "actual" thought processes and what they claim they are.
antonvs•3h ago
I can't imagine why anyone who knows even a little about how these models work would believe otherwise.
The "chain of thought" is text generated by the model in response to a prompt, just like any other text it generates. It then consumes that as part of a new prompt, and generates more text. Those "thoughts" are obviously going to have an effect on the generated output, simply by virtue of being present in the prompt. And the evidence shows that it can help improve the quality of output. But there's no reason to expect that the generated "thoughts" would correlate directly or precisely with what's going on inside the model when it's producing text.