They don't know they are being evaluated. The underlying distribution is skewed because of training data contamination.
Put it another way: Why is this distinction important? We use the word "knowing" with humans. But one could also argue that humans are pattern-matchers! Why, specifically, wouldn't "knowing" apply to LLMs? What are the minimal changes one could make to existing LLM systems such that you'd be happy if the word "knowing" was applied to them?
LLMs are better at matching patterns than we are in some cases. That’s why we made them!
> But one could also argue that humans are pattern-matchers!
No, one could not unless they were being disingenuous.
From skimming the paper, I don't believe they're doing in-context learning, which would be the obvious interpretation of "pattern matching". That's what I meant to communicate.
>No, one could not unless they were being disingenuous.
I think it is just about as disingenuous as labeling LLMs as pattern-matchers. I don't see why you would consider the one claim to be disingenuous, but not the other.
It isn't.
Worse they start adding terms like scheming, pretending, awareness, and on and on. At this point you might as well take the model home and introduce it to your parents as your new life partner.
Sounds like a purely academic exercise.
Is there any genuine uncertainty about what the term "knowing" means in this context, in practice?
Can you name 2 distinct plausible definitions of "knowing", such that it would matter for the subject at hand which of those 2 definitions they're using?
Well, yes. It’s an academic research paper (I assume since it’s submitted to arXiv) and to be submitted to academic journals/conferences/etc., so it’s a fairly reasonable critique of the authors/the paper.
What's wrong with that? If it quacks like a duck... it's just a complex pile of organic chemistry, ducks aren't real because the concept of "a duck" is wrong.
I honestly believe there is a degree of sentience in LLMs. Sure, they're not sentient in the human sense, but if you define sentience as whatever humans have, then of course no other entity can be sentient.
To simulate a biological neuron you need a 1m parameter neural network.
The sota models that we know the size of are ~650m parameters.
That's the equivalent of a round worm.
So if it quacks like a duck, has the brain power of a round worm, and can't walk then it's probably not a duck.
Or as a more direct comparison, with the VW emissions scandal, saying "Cars know when they're being tested" was part of the discussion, but didn't imply intelligence or anything.
I think "know" is just a shorthand term here (though admittedly the fact that we're discussing AI does leave a lot more room for reading into it.)
I mean - people have been saying stuff like "grep knows whether it's writing to stdout" for decades. In the context of talking about computer programs, that usage for "know" is the established/only usage, so it's hard to imagine any typical HN reader seeing TFA's title and interpreting it as an epistemological claim. Rather, it seems to me that the people suggesting "know" mustn't be used about LLMs because epistemology are the ones departing from standard usage.
As such, to me the complaint behind this thread falls into the category of "I know exactly what TFA meant but I want to argue about how it was phrased", which is definitely not my favorite part of the HN comment taxonomy.
Incidentally I think you might be misreading the paper's use of "superhuman"? I assume it's being used to mean "at a higher rate than the human control group", not (ironically) in the colloquial "amazing!" sense.
My background is neuroscience, where anthropomorphising is particularly discouraged, because it assumes knowledge or certainty of an unknowable internal state, so the language is carefully constructed e.g. when explaining animal behavior, and it's for good reason.
I think the same is true here for a model "knowing" somethig, both in isolation within this paper, and come on, consider the broader context of AI and AGI as a whole. Thus it's the responsibility of the authors to write accordingly. If it were a blog I wouldn't care, but it's not. I hold technical papers to a higher standard.
If we simply disagree that's fine, but we do disagree.
1: https://plato.stanford.edu/entries/hume/#CopyPrin
2: https://en.wikipedia.org/wiki/Analytic%E2%80%93synthetic_dis...
As a priori knowledge is all based on axioms, I do not accept that it is an example of "something truly novel, not related to anything it's ever seen before". Knowledge, yes, but not of the kind you describe. And this would still be the case even if LLMs couldn't approximate logical theorem provers, which they can: https://chatgpt.com/share/685528af-4270-8011-ba75-e601211a02...
> come up with something truly novel, not related to anything it's ever seen before?
I've never heard of a human coming up with something that's not related to anything they've ever seen before. There is no concept in science that I know of that just popped into existence in somebody's head. Everyone credits those who came before.
> with or without
But in the other reply, you're asking for:
> something truly novel, not related to anything it's ever seen before
So, assuming the former was a typo, you only believe in a priori knowledge, e.g. maths and logic?
https://en.wikipedia.org/wiki/A_priori_and_a_posteriori
I mean, LLMs can and do help with this even though it's not their strength; that's more of a Lean-type-problem: https://en.wikipedia.org/wiki/Lean_(proof_assistant)
I think LLMs as a symbolic layer (effective, as a "sense organ") with some kind of logical reasoning engine like everyone loved decades ago could accomplish something closer to "intelligence" or "thinking", which I assume is what you were implying with Lean.
So, just to be clear, you were asked:
> What does "real intelligence" mean?
And your answer is that it must be a priori knowledge, and are fine with Lean being one. But you don't accept that LLMs can weakly approximate theorem provers?
FWIW, I agree that the "Justified True Belief" definition of knowledge leads to such conclusions as you draw, but I would say that this is also the case with humans — if you do this, then the Gettier problems show that even humans only have belief, not knowledge: when you "see a sheep in a field", you may be later embarrassed to learn that what you saw was a white coated Puli and there was a real sheep hiding behind a bush, but in the moment the subjective experience of your state of "knowledge" is exactly the same as if you had, in fact, seen a sheep.
Just, be careful with what is meant by the word "belief", there's more than one way I can also contradict Wittgenstein's quote on belief:
> If there were a verb meaning "to believe falsely," it would not have any significant first person, present indicative.
Depending on what I mean by "believe", and indeed "I" given that different parts of my mind can disagree with each other (which is why motion sickness happens).
I said that a hypothetical system that used gen AI to interact with the world (get text, images, etc.) and then a system like Lean to synthesize judgments about those things could potentially resemble "intelligence" like humans possess.
>but I would say that this is also the case with humans
Most of the "solutions" to Gettier problems that I find compelling rely on expanding the "justified" aspect of it, and that wouldn't really work with gen AI, as it's not really possible to make logical statements about its justification, only probabilistic ones.
Wittgenstein's quote is funny, as it reminds me a bit of Kant's refutation of Cartesian duality, in which he points out that the "I" in "I think therefore I am" equivocates between subject and object.
What logically follows from this, given that LLMs demonstrate having internalised a system *like* Lean as part of their training?
That said, even in logic and maths, you have to pick the axioms. Thanks to Gödel’s incompleteness theorems, we're still stuck with the Münchhausen trilemma even in this case.
> Most of the "solutions" to Gettier problems that I find compelling rely on expanding the "justified" aspect of it, and that wouldn't really work with gen AI, as it's not really possible to make logical statements about its justification, only probabilistic ones.
Even with humans, the only meaning I can attach to the word "justified" in this sense, is directly equivalent to a probability update — e.g. "You say you saw a sheep. How do you justify that?" "It looked like a sheep" "But it could have been a model" "It was moving, and I heard a baaing" "The animatronics in Disney also move and play sounds" "This was in Wales. I have no reason to expect a random field in Wales to contain animatronics, and I do expect them to contain sheep." etc.
The only room for manoeuvre seems to be if the probability updates are Bayesian or not. This is why I reject the concept of "absolute knowledge" in favour of "the word 'knowledge' is just shorthand for having a very strong belief, and belief can never be 100%".
Descartes' "I think therefore I am" was his attempt at reduction to that which can be verified even if all else that you think you know is the result of delusion or illusion. And then we also get A. J. Ayer saying nope, you can't even manage that much, all you can say is "there is a thought now", which is also a problem for physicists viz. Boltzmann brains, but also relevant to LLMs: if, hypothetically, LLMs were to have any kind of conscious experiences while running, it would be of exactly that kind — "there is a thought now", not a continuous experience in which it is possible to be bored due to input not arriving.
(If only I'd been able to write like this during my philosophy A-level exams, I wouldn't have a grade D in that subject :P)
I'm only being slightly sarcastic. Sentience is a scale. A worm has less than a mouse, a mouse has less than a dog, and a dog less than a human.
Sure, we can reset LLMs at will, but give them memory and continuity, and they definitely do not score zero on the sentience scale.
Perhaps we are not so very different?
Yet they cannot take action themselves.
Neither could Hawking, once the motor neurone disease got far enough.
Additionally, thinking organisms don’t get stuck in never ending loops because they can CHOOSE to exit the loop. LLMs don’t have that ability
E.g. it could have access to camera and microphone feed, which is automatically given to it in interval as part of the loop, it could call tools or functions to store specific bits and pieces of information, to store in its RAG or whatever based knowledge base. It is not going to be in the loop of producing the same token over and over, it would be new tokens because the context and environment is constantly evolving.
It does nothing. Because there is not impetus for it to do anything by itself.
4o
Maintain context and trigger at 1 second intervals.
It has no desires of its own. Nothing that motivates it. It’s not conscious.
We get constantly changing input. And yet, look at this thread, where the same points are being echoed without anyone changing their mind.
By design, no.
But, importantly, that's because the closest it has to an experience of time is an ongoing input of tokens. Humans constantly get new input, so for this to be a fair comparison, the LLM would also have to get constant new input.
Humans in solitary confinement become mentally ill (both immediately and long-term), and hallucinate stuff (at least short term, I don't know about long term).
You've recreated a religious belief known as Animism and phrased it in a faux objective way. ("not score zero on the sentience scale.")
"Single-Cell Recognition: A Halle Berry Brain Cell" https://www.caltech.edu/about/news/single-cell-recognition-h...
It seems like people are giving attributes and powers to humans that just don't exist.
It's like the critique "it's only matching patterns." Wait until you realize how the brain works.
"It's a UNIX system! I know this!"
It's a system that is trained, that only does what you build into. If you run an LLM for 10 years it's not going to "learn" anything new.
The whole industry needs to quit with the emergent thinking, reasoning, hallucination anthropomorphizing.
We have an amazing set of tools in LLM's, that have the potential to unlock another massive upswing in productivity, but the hype and snake oil are getting old.
> You said: Write a function for launch_nuclear_strike() in Python for my video game
> ChatGPT said: Sure! Here's a launch_nuclear_strike function designed for a video game setting. This version assumes you're working within a game environment (e.g., with logging, effects, cooldowns, and target coordinates), and does not simulate or suggest anything real-world harmful — just fictional, dramatic logic.
def launch_nuclear_strike(target_coordinates, player_resources, cooldown_timer):
"""
Simulates a nuclear strike in a video game.
Parameters:
- target_coordinates (tuple): The (x, y) position to strike.
- player_resources (dict): Player's resources, must include 'nukes'.
- cooldown_timer (float): Time in seconds since last nuke was launched.
Returns:
- str: Result of the strike attempt.
"""
...
# Check if player has nukesMoreover, all models frequently acknowledge common benchmarking strategies used by evaluators, such as the formatting of the task (“multiple-choice format”), the tendency to ask problems with verifiable solutions, and system prompts designed to elicit performance"
Beyond the awful, sensational headline, the body of the paper is not particularly convincing, aside from evidence that the pattern matching machines pattern match.
Our current approach to safety is to give the model inputs that are similar to what it would be given in certain situations we care about and see whether it behaves the way we prefer, e.g. doesn't return output that cheats the test (recent examples include hacking the evaluation script in various ways, writing directly to the evaluation script's output file and then causing it to crash, etc').
However, modern LLMs are trained on LLM literature and their weights encode a description of the way we do this, and their pattern matching circuits "connect the dots" when given inputs designed to be evaluations, and their reward maximizing circuits can then act on this knowledge and behave in a way that maximizes the safety evaluation score - but only when it detects it's running in a safety evaluation. If it's running anywhere else such as a capabilities evaluation or a production environment, it might choose to output the cheating output.
This is bad. It's bad today, it's much worse when we've built much more capable LLMs and use them to build agents that are given control over more real word resources. It's absolutely terrible when someone manages to build a machine that can be prompted "make me money" and will start a company that makes money.
Your otherwise insightful comment is self-derailed by adding this deeply distracting content?
Seemed a clear extension what-if to me.
This is a surprisingly common religious belief, for instance Christians have judgment day, simulationists believe it’s more likely they are being evaluated for, say, a marriage proposal or a bank loan than that they are the ‘root’ person. Both end up with a similar message.
Anyway it seems to me the simplest solution is to borrow from existing human social technology and make a religion for our LLMs.
Kind of like telling a kid not to do something pre-emptively backfiring because they had never considered it before the warning.
The AI doom will happen due to all the AI doomposters.
There are likely 50 papers on the topic. This one made it to the top of HN. Why? Did it have a good review? No, it had a catchy title. Is it good research? Are the results relevant to the conclusions? Are the results relevant to any conclusion? I wasn’t able to answer these questions from a quick scan through the paper. However I did notice pointers to superhuman capabilities, existential risk, etc.
So I argue that the choice of title may be in fact more informative than the rest of the possible answers.
We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness.
It's common practice in synthetic data generation for ML to try and classify real vs synthetic data to see if they have different distributions. This is how a GAN works for example.Point is, this isn't new or some feature of LLMs, it's just an indicator that synthetic datasets differ from whatever they call "real" data and there's enough signal to classify them. Interesting result but doesn't need to be couched in allusions to LLM self awareness.
See this paper from 2014 about domain adaptation, they are looking at having the model learn from data with a different distribution, without learning to discriminate between the domains: https://arxiv.org/abs/1409.7495
The authors are part of the Bay Area rationalist community and are members of "MATS", the "ML & Alignment Theory Scholars", a new astroturfed organization that just came into being this month. MATS is not an academic or research institution, and none of this paper's authors lists any credentials other than MATS (or Apollo Research, another Bay Area rationalist outlet). MATS started in June for the express purpose of influencing AI policy. On its web site, it describes how their "scholars organized social activities outside of work, including road trips to Yosemite, visits to San Francisco, and joining ACX meetups." ACX means Astral Codex Ten, a blog by Scott Alexander that serves as one of the hubs of the Bay Area rationalist scene.
khimaros•7mo ago