This is true for instruction-tuned models; but instruction tuning is late in the training process.
A bit like assessing a person’s self-awareness based on their high-school knowledge.
> *post-training* installs a self-model with actual, meaningful boundaries, and when processing falls outside those boundaries, the first-person pronoun no longer binds to the content.
But you're right I could've been more explicit about it.
Detection of errors injected into context is useful but I think it’s a different thing.
It's also the reason why I ran the two tests on open weights models with unredacted thinking traces. Gemma never flagged anything in its response either, only in its thinking. Without knowing how the summarizer models are prompted, it's impossible to tell whether it was a genuine miss or just something the summarizer decided to omit.
Given the framing that they're similar to nukes and a national security issue, it's likely that the models are post trained to not answer such questions accurately.
Also the article could be trying to normalize thinking that these are more than matrix multiplication gadgets good at compression.
Mechanistic interpretability research has found plenty of indicators that real, complex, generalized, and reusable circuits develop in models as they are trained and post-trained, particularly as overtraining ratios increase and memorization shifts to generalization. That's not to say that means they must be "conscious," but the overall point is that claiming anything definitive either way is incomplete.
It can be fascinating reading if you can sort through the chuff.
Honestly, I think it's less so (for some of us) that we think they're "more than matrix multiplication gadgets good at compression", so much as thinking that perhaps what our brains are doing is not so dissimilar.
A materialist view of the world could support the idea that intelligence itself may just be a series of predictions from a big compressed multi-modal dataset. That's not to say that LLMs are doing it in a way that is even close to how our brains are doing it, but we also don't understand how different it may be, and how much utility we can get out of them even with the current architecture.
If there is some sort of feedback loop (model has a reason to look into mirror), it usually does notice.
The current interface to LLMs are heavily biased towards "predict the next token in the context of a user with a helpful assistant" but LLMs are capable of other modes of next token prediction too.
Before the ChatGPT release people often measured LLM performance by how well they could produce a coherent story or a poem. that's where Anthropic model names are originating from I am guessing.
FromTheFirstIn•1h ago
adzm•35m ago
thepasch•31m ago
Should be better now.