I'm effect, the different response types are measuring how the models respond to a context-free novel environment. I imagine humans would also respond on a variety of ways to this test, none of which are necessarily incorrect from the perspective of intelligence testing .
Many tests of human behavior (eg, n behavioral economics) create some pretense context to avoid boarding the response that is actually being measured. For example, we may invite a participant to a study of color preference, but actually measure how fast they complete the task when the scientist has/hasn't bathed in a week (or whatever).
Likewise, for llm intelligence testing, you could create pretext tasks and context, and perhaps measure what the model considered along the way, instead of the actual task outcome.
In short: start with a dataset of question and answer pairs, where each question has been answered by two different LLMs. Ask the model you want to evaluate to choose the better answer for each pair. Then measure how consistently it selects winners. Does it reliably favor some models over the questions, or does it behave close to randomly? This consistency is a strong proxy for the model’s intelligence.
It is not subject to dataset leaks, lets you measure intelligence in many fields where you might not have golden answers, and converges pretty fast making it really cheap to measure.
It seems to me many models - maybe by design - have a recognizable style which would be much easier to detect than evaluating the factual quality of answers.
The difference is going to be instead of starting from pre-existing games and hoping that "generalizes" to intelligence, this time people are going to build gamified simulators of economically valuable stuff. This is feasible now because we can use LLMs to help generate these games much faster than we would have been able to previously.
"The behvior summary"
I tested this locally and got the same result with gpt-oss 120b. But only on the default 'medium' reasoning effort. When I used 'low' I kept getting more playful responses with emojis and when I used 'high' I kept getting more guessing responses.
I had a lot of fun with this and it provided me with more insight than I would have thought.
This is not a new idea. Traditional IQ tests pivoted to them (they weren't originally like that), and no doubt they have great "discriminative power", because having the ability to figure out what's expected of you and not getting intimidated by cryptic and obtuse tasks put before you, are certainly extremely valuable skills in e.g business and politics.
But I always respected real tasks more. A question on a math test is honest; if it doesn't precisely define what's expected of you, the taskmaster has done a bad job, not you. It still can be extremely demanding.
An implicit task, by comparison, smells more of riddles, gnosticism. Do you know the way? Do you know the genre? (Once you know the genre of implicit tasks typical to IQ tests, you can easily increase your performance by a lot).
For that matter, this idea isn't new to machine learning either. Francois Chollet did it already, and he was IMO just as wrongheaded in thinking implicit tasks are somehow more indicative of "true intelligence" than explicit ones.
vitaelabitur•17h ago
Also, commercial LLMs generally have system instructions baked on top of the core models, which intrinsically prompt them to look for purpose even in random user prompts.
crooked-v•16h ago
wood_spirit•15h ago
globnomulous•11h ago
nomel•8h ago
lubujackson•14h ago
Understanding how LLMs fail differently is becoming more valuable than knowing that they all got 100% on some reasoning test with perfect context.