Not because of actual increased “intelligence” but because the test would be included in model’s training data - either directly or indirectly where model developers “tune” their model to give better performance on this particular attention driving test.
Doe this address your concern?
The economic incentives to tweak, tune, or cheat are through the roof.
What? That’s some serious cash for mostly wrong answers.
This isn't to say we shouldn't think critically about the use and performance of models, but "Not Even Bronze..." turned me off to this critique.
(It's specifically trained on formalized math problems, unlike most LLM, so it's not an apple to apple comparison.)
It wasn’t that long ago that the Turing Test was seen as the gold standard of whether a machine was actually intelligent. LLMs blew past that benchmark a year or two ago and people barely noticed. This might be moving the goalposts, but I see it as a realization that thought and language are less inherently connected than we thought.
So yeah, the fact that they even do this well is pretty amazing, but they sound like they should be doing so much better.
It's not an unfamiliar phenomenon in humans. Look at Malcolm Gladwell.
On their website you can see the full answers LLM gave ("click cells to see...")
15 years ago I, working on AI systems at a FAANG, would have told you “real” AI probably wasn’t coming in my lifetime. 15 years ago the only engineers I knew who thought AI was coming soon were dreamers and Silicon Valley koolaiders. The rest of us saw we needed a step-function break through that may not even exist. But it did, and we got there, a couple of years ago.
Now I’m telling people it’s here. We’ve hit a completely different kind of technology, and it’s so clear to people working in the field. The earthquake has happened and the tsunami is coming.
This is why original problems are important, it's a measure of how sensible something is in an open-ended environment, and here they're completely useless, not just because they fail but how they fail. The fact that these LLMS according to the article "invent non-existent math theorems", i.e. gibberish instead of even being able to know what they don't know, is an indication of how limited this still is.
Software engineers understand this better than most - describing a task in general terms, and doing it yourself, can be incredibly easy, even while writing the code to automate the task is difficult or impossible, because of all the devilish details we don't often think about.
* Write a query to link table X to table Y across this schema, returning all the unique entries related to X.id 1234
* Write code add an editable comment list to this UI
* Give me a design to visually manage statuses for this list
* Look at this UI and give me five ideas for improving it
Some of those work better than others, but none of them are guaranteed failures.They used best-of-32 and used the same model to judge a "tournament" to find the best answer. Seems like something that could be boltet on reasonably easy, eg in say WebUI.
edit: forgot to add that I'm curious if this translates to smaller models as well, or if it requires these huge models.
intuition????
I'm a little confused by this. My assumptions (possibly incorrect!): 64k tokens per prompt, they are claiming the model wouldn't need more tokens even for reasoning
Is that right? Would be helpful to see how many tokens the models actually used.
"ok here is my strategy here are the five steps", then requery with a strategy or proof of step 1, 2, 3...
in a dfs
Interesting but I'm not sure if this is really due to "minor logical issues". This sounds like a failure due to the lack of the actual understanding (the world model problem). Perhaps the actual answers from AIs might have some hints, but I can't find them.
(EDIT: ooops, found the output on the main page of their website. Didn't expect that.)
> Best-of-n is Important ... the models are surprisingly effective at identifying the relative quality of their own outputs during the best-of-n selection process and are able to look past coherence to check for accuracy.
Yes, it's always easier to be a backseat driver.
Any model that can identify the correct answer reliably can arrive at the correct answer given enough time and stochasticity.
https://www.imo-official.org/year_info.aspx?year=2025 (download page)
They are very difficult.
Meanwhile Noam "well aschtually..."
I love how people are still betting against AI, its hilarious. Please write more 2000-esk "The internet is a fad" articles
AI still has a long way to go, though it has proven to be a useful tool at this point.
The fact that the only formal comparisons for AI systems that are ever done are explicitly based on the highest performing narrowly focused humans, tells me how unprepared society is for what’s happening.
Appreciate that: at the point in which there is unambiguous demonstration of superhuman level performance across all human tasks by a machine, (and make no mistake, that *is the bar that this blog post and every other post about AI sets*) it’s completely over for the human race; unless someone figures out an entirely new economic system.
Average math major can't get Brozne.
The average everyday human does not have the time to read all available math texts. LLMs do, but they still can't get bronze. What does that say about them?
If I want something done, I'll seek out someone with a skill set that matches the problem.
I don't want AI to be as good as an average person. I want AI to be better than the person I would go to for help. A person can talk with me, understand where I've misunderstood my own problem, can point out faulty assumptions, and may even tell me that the problem isn't even a problem that needs solving. A person can suggest a variety of options and let me decide what trade-offs I want to make.
If I don't trust the AI to do that, then I'm not sure why I'd use it for anything other than things that don't need to be done at all, unless I can justify the chance that maybe it'll be done right, and I can afford the time lost getting it done right without the AI afterwards.
Wow... that's quite a generalization. And not my experience at all.
We don’t ask the average person to do most things, either finding a specialist or providing training beforehand.
??? If you don’t know how to do something you’re really bad at it. I’m not sure what that sentence is even trying to convey.
That doesn't make them bad at reciting The Raven from memory. Being trained to recite The Raven from memory and still being unable to do so would be a proper application of the term. There is an obvious difference between the two states of being and conflating them is specious.
If you want to take seriously the premise that humans are bad at almost everything because most humans haven't been trained at doing almost everything humans can do, then you must apply the same rubric to LLMs, which are only capable of expressions within their specific dataset (and thus not the entire corpus of data on which they haven't been trained) and even then which tend to confabulate far more frequently than human beings at even simple tasks.
edit: never mind, I guess you aren't willing to take this conversation on good faith.
And the average person would do poorly. Not because they couldn't be trained to do it, but because they haven't.
But that isn't the claim I'm objecting to. The claim I'm objecting to is "The average person is bad at literally almost everything," which is not an equivalent claim to "people who aren't trained at math would be bad at math at a competitive level," because it implicitly includes everything that a person is trained in and is expected to be qualified to do.
It was just bad, cynical hyperbole. And it's weird that people are defending it so aggressively.
Nitpicking language doesn't help to move the conversation. One thing most humans are good at is understanding meaning even when the speaker wasn't absolutely precise.
- Does not have the requisite skills and experiences to do X successfully
- Inherently does not have the capacity to do X
I think the former is a reasonable standard to apply in this context. I'd definitely say I would be bad if I tried to play the guitar, but I'm not inherently incapable of doing it. It's just not very useful to say "I could be good at it if I put 1000 hours of practice in."
More than 50% of people employed as software engineers cannot read an academic paper in a field like education, and explain whether the conclusions are sound, based on the experiment description and included data.
More than 50% of people cannot interpret an X-ray.
I know this was meant as a dig, but I’m actually guessing that software engineers score higher on this task than non-engineers who hold M.Ed. degrees.
The only reason I chose software engineers is because I was trying to show that people who can write 'hello world' programs (first example) are not good at all intellectual tasks.
Nothing new really, but there’s no where left to go for human labor and even that concept is being jeered at as a fantasy despite this attitude.
An average human may not be suitable for a given task, but a person with specialized skills will be. More than that, I believe they will continue to outperform LLMs on solving unbounded problems- i.e. those problems without an obvious, algorithmic solution.
Anything that requires brute force computation can be done by an LLM more quickly, assuming you have humans you trust to validate the output, but that's about the extent of what I'm expecting them to achieve.
We don't allow chess players to access a Syzygy tablebase in a tournament.
We have specialists everywhere.
The whole competition is unfair anyway. An "AI" has access to millions of similar problems stolen and encoded in the model. Humans would at least need access to a similar database; think open database exam, a nuclear version of open book exam.
blendergeek•13h ago
untitled2•13h ago
changoplatanero•13h ago
masterjack•13h ago
JohnKemeny•13h ago
Meanwhile high schoolers get a piece of paper and 4.5 hours.
wslh•12h ago
[1] https://chess.stackexchange.com/questions/9959/did-deep-blue...
[2] https://nautil.us/why-the-chess-computer-deep-blue-played-li...
[3] https://en.chessbase.com/post/deep-blue-s-cheating-move
throwawaymaths•12h ago
kenjackson•13h ago
idiotsecant•13h ago
senkora•13h ago
emp17344•13h ago
souldeux•13h ago
esafak•12h ago
e1g•13h ago
kenjackson•12h ago
raincole•13h ago
This OP claims the publicly available models all failed to get Bronze.
OpenAI tweet claims there is an unreleased model that can get Gold.
dmitrygr•13h ago
raincole•12h ago
OpenAI likely had unlimited tokens, and evaluated "best of N attempts."
amelius•12h ago
klabb3•12h ago
sigmoid10•12h ago
>we note that the vast majority of its answers simply stated the final answer without additional justification
While the reasoning steps are obviously important for judging human participant answers, none of the current big-game providers disclose their actual reasoning tokens. So unless they got direct internal access to these models from the big companies (which seems highly unlikely), this might be yet another failed study designed to (of which we have seen several in recent months, even by serious parties).
bgwalter•12h ago
We'll never know how many GPUs and other assistance (like custom code paths) this model got.