(b) your comment is miles off-topic, as he is not addressing doom in any sense
> AI technology is now rapidly approaching the point of transition from qualitative to quantitative achievement.
I don't get it. The whole history of deep learning was driven by quantitative achievement on benchmarks.
I guess the rest of the post is about adding emphasis on costs in addition to overall performance. But, I don't see how that is a shift from qualitative to quantitative.
(My interpretation, obviously)
I should write an article on it sometime, but I think the incessant focus on data someone collected from the mystical "real world" over well designed synthetic data from a properly understood algorithm is really damaging to proper understanding.
Shameless plug, but I made a simple app for anyone to create their own evals locally:
This is a very valid point. Google and ChatGPT announced they got the gold medal with specialized models, but what exactly does that entail? If one of them used a billion dollars in compute and the other a fraction of that, we should know about it. Error rates are equally important. Since there are conflicts of interest here, academia would be best suited for producing reliable benchmarks, but they would need access to closed models.
In short (there is nuance), Google cooperated with the IMO team while OpenAI didn't which is why OpenAI announced before Google.
"Human teens beat AI at an international math competition Google and OpenAI earned gold medals, but were still out-mathed by students."
> what exactly does that entail
Overfitting on the test set with models that are useless for anything else, that's what.
The tech — despite being sometimes impresaive — is objectively inefficient, expensive, and harmful to the environment (excessive use if energy and water for cooling), to the people located near the data centers (by stochastic leeching of coolants to the waterbed IIRC), and the economic harm done to hundreds of millions of people whose data was involuntarily used for training.
We've all seen the bad faith actors that questioned, for example, studies on the efficacy of wearing masks in reducing chance of transmission of airborne diseases because the study combined wearing masks AND washing hands... Those people would gladly hand wipe without toilet paper to "own the libs" or whatever hate-filled mental gymnastics strokes their ego.
With that in mind, let's call things for what they are: there are multiple companies that are salivating at the prospects of being able to make the working class obsolete. There's trillions to be made in their mind.
> I would like to see numbers, results, or something of that nature
I would like the same thing! So far, we have seen that a very big company that had pledged, IIRC, to remain not-for-profit for the benefit of humanity sold out at the drop of a hat the moment they were able to hint Zombocom levels of possibility to investors.
"There are two schools of thought, you see..."
Joking aside, I think that's a very valid point; not sure what would be the nonreligious term for the amorality of "sins of omission"... But, in essence, one can clearly be unethical by ignoring the social responsibility we have to study who is affected by our actions.
Corporations can't really play dumb there, since they have to weigh the impacts for every project they undertake.
Also, side note... It's very telling how little control we (commoners?) have as a global society that — collectively — we're throwing mountains of cash at weapons and AI, which would directly move us closer to oblivion and further the effects of climate change (despite the majority of people not wanting wars nor being replaced by a chatbot). I would instead favor world peace; ending poverty, famine, and genocide; and, preventing further global warming.
If the best you can do is bring up this garbage, then you have nothing of value to say.
[1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lesso...
I'm fairly certain this phenomenon is responsible for LLM capabilities on GeoGuesser type games. They have unreasonably good performance. For example, being able to identify obscure locations from featureless/foggy pictures of a bench. GeoGuesser's entire dataset, including GPS metadata, is definitely included in all of the frontier model training datasets - so it should be unsurprising that they have excellent performance in that domain.
No, it is not included, however there must be quite a lot of pictures on internet for most cities.. Geoguesser data is same as Google's street view data and it probably contains billions of 360 degree photos.
This is not uncommon. Bears aren't always tearing people apart, that's a movie trope with little connection to reality. Black bears in particular are smart and social enough to befriend their food sources.
But a hungry bear, or a bear with cubs, that's a different story. Even then bears may surprise you. Once in Alaska, a mama bear got me to babysit her cubs while she went fishing -- link: https://arachnoid.com/alaska2018/bears.html .
It just isn't plausible that anyone has actually done that. I'm sure some people include a small sample of them, though.
Why bother to create a copy, if it can be avoided, right?
This is a good rebuttal when someone quips that we “are about to run out of data”. There’s oh so much more, just not in the form of books and blogs.
They still kicked ass.
It seems like those AIs just have an awful lot of location familiarity. They've seen enough tagged photos to be able to pick up on the patterns, and generalize that to kicking ass at GeoGuessr.
An irony here is that math blogs like Tao's might not be in LLM training data, for the same reason they aren't accessible to screen readers - they're full of math, and the math is rendered as images, so it's nonsense if you can't read the images.
(The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)
I wouldn't think an LLM would have issue with that at all. I can see how a screen reader might, but it seems like the same problem faced by a screen reader with any piece of code, not just LaTex.
I remember running this experiment some time ago in a context where I was certain there was no possibility of tool use to encode/decode. Nowadays, it can be hard to certain whether there is any tool use or not, in some cases, such as Mistral, the response is quick enough to make it unlikely there's any tool use.
It "left out" the A in its decode and still correctly answered the proposition, either out of reflexive familiarity with the form or via metasyntactic reasoning over an implicit anaphor; I believe I recall this to be a formulation of one of the elementary axioms of set theory, though you will excuse me for omitting its name before coffee, which makes the pattern matching possibility seem somewhat more feasible. ('Seem' may work a little too hard there. But a minimally more novel challenge I think would be needed to really see more.)
There's lots of text in lots of languages about using an online base64 decoder, and nearly none at all about decoding the representation "in your head," which for humans would be a party trick akin to that one fellow who could see a city from a helicopter for 30 seconds and then perfectly reproduce it on paper from memory. It makes sense to me that a model trained on the Internet would "invent" the "metaphor" of an online decoder here, I think. What in its "experience" serves better as a description?
LLMs are extremely good at outputting LaTeX, ChatGPT will output LaTeX, which the website will render as such. Why do you think LLMs have trouble understanding it?
But the people writing the web page extraction pipelines also have to handle the alt text properly.
I've been working on implementing some E&M simulations with Claude Code and it's so-so on the C++ and TERRIBLE at the actual math (multiplying a couple 6x6 matrix differential operators is beyond it).
But I can dash off some notes and tell Claude to TeXify and the output is great.
Let’s say everyone agrees to refer to one hosted copy of a token “cat”, and instead generate a unique vector to represent their reference to “cat”.
Blam. Endless unique vectors which are nice and precise for parsing. No endless copies of arbitrary text like “cat”.
Now make that your globally distributed data base to bootstrap AI chips from. The data driven programming dream where other machines on the network feed new machines boot strap.
American tech industry is IBM now. Stuck on recent success of web SaaS and way behind the plans of AI.
No, that's actually really easy. What's hard is coming up with original questions of a specific level of difficulty. And that's what you need for a competition.
To elaborate: it's really easy to find lots and lots of elementary, unsolved questions. But it's not clear whether you can actually solve them or how hard solving them is, so it's hard to judge the performance of LLMs on them.
> It it interesting that this rule has gone completely out of the window in the age of LLMs.
No, it hasn't.
And don’t get me started in the decline on depth in technical topics and soaring in political discussions. I came to HN for the first, not the second.
So we are humans, there will never be a perfect forum.
Perfect is in the eye of the moderator.
I am not saying trusting your memory is always false or true. Most of the times it might be true. It's a heuristic.
But if someone comes and deny what you did, the best course of action would be to consider the evidence they have and not assume they are stupid because they believe differently.
Let's be honest, you have not personally went and verified the rocks belongs to Moon. Nor were you tracking the telemetry data in your computer when the rocket was going to Moon.
I also believe we went to Moon.
But all I have is beliefs.
Everyone believed Early was flat 1000s years back as well. They had solid evidence.
But the humility is accepting you don't know and you are believing and not pretend you are above others who believe exact opposite..
As you say, you should have the humility to consider the evidence that others provide that you might be wrong. The thing with the various popular conspiracy theories is that the evidence is conspicuously missing when any competent good faith actor would be presenting it front and center.
I think you don't know what evidence means. You want proof and that's for mathematics.
You don't know that you exist. You could be a simulation.
are all equally bad? Or same bad but a different aspect? E.g. I read often here that X has more disinformation, and right wing propaganda, while mastodon here was called out on another topic.
Maybe somebody active in different networks can answer that.
But in general, I'd say that the microblogging format as a whole encourages a number of toxic behaviors and interaction patterns.
I like ARC-AGI approach for the reason that it shows both axes - score and price, and place human benchmark on these.
When considering top tier labs that optimize inference and own the GPUs: the electricity cost of USD 5000 at a data center with 4 cents per kWh (which may be possible to arrange or beat in some counties in the US with special industrial contracts) can produce about 2 trillion tokens for the R1-0528 model using 120kW draw for the B200 NVL72 hardware and the (still to be fully optimized) sglang inference pipeline: https://lmsys.org/blog/2025-06-16-gb200-part-1/
Although 2T tokens is not unreasonable for being able to get high precision answers to challenging math questions, such a very high token number would strongly suggest there are lots of unknown techniques deployed at these labs.
If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU, then the number of tokens for 5k USD shrinks dramatically to only 66B tokens, which is still high for usual techniques that try to optimize for a best single answer in the end, but perhaps plausible if the vast majority of these are intermediate thinking tokens and a lot of the value comes from LLM-based verification.
Interestingly, Tao mentions https://teorth.github.io/equational_theories/, and I believe this is better progress than LLMs doing math. I believe enhancing Lean with more tactics and formalizing those in Lean itself is a more fruitful avenue for AI in math.
LLMs, especially in /conjunction/ with Lean for formal validation, are really an exciting new frontier in mathematics and it's a mistake to see that as just "unreliable" versus "reliable" symbolic AI etc. The OP Terence Tao has been pushing the edge here since day one and providing, I think, the most unbiased perspective on where things stand today, strengths as much as limitations.
[1] https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehamm...
More information
NitpickLawyer•1d ago
It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them).
The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests.
antupis•1d ago
NitpickLawyer•1d ago
This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation.
rachofsunshine•1d ago
One is a measurement problem, a statement about the world as it is: an engineer who can finish such-and-such many steps of this coding task in such-and-such time has such-and-such chance of getting hired. The thing you're measuring isn't running away from you or trying to hide itself, because facts aren't conscious agents with the goal of misleading you. Measurement problems are problems of statistics and optimization, and their goal is a function f: states -> predictions. Your problems are usually problems of inputs, not problems of mathematics.
But the larger you get, and the more valuable gaming your test is, the more you leave that measurement problem and find an adversarial problem. Adversarial problems are at least as difficult as your adversary is intelligent, and they can sometimes be even worse by making your adversary the invisible hand of the market. You don't live in the world of gradient descent anymore, because the landscape is no longer fixed. You now live in the world of game theory, and your goal is a function f: (state) x (time) x (adversarial capability) x (history of your function f) -> predictions.
It's that last, recursive bit that really makes adversarial problems brutal. Very simple functions can rapidly result in extremely deep chaotic dynamics once you allow even the slightest bit of recursion - even very nice functions like f(x) = 3.5x(1-x) become writhing ergodic masses of confusion.
visarga•20h ago
bwfan123•16h ago
pixl97•16h ago
klingon-3•23h ago
Just feed it into an LLM, unintentionally hint at your bias, and voila, it will use research and the latest or generated metrics to prove whatever you’d like.
> The only true tests are the ones you write yourself, never publish, and only work 100% on open models.
This may be good enough, and that’s fine if it is.
But, if you do it in-house in a closet with open models, you will have your own biases.
No tests are valid if all that ever mattered was the argument and perhaps curated evidence.
All tests, private and public tests have proved flawed theories historically.
Truth has always been elusive and under siege.
People will always just believe things. Data is just foundation for pre-existing or fabricated beliefs. It’s the best rationale for faith, because in the end, faith is everything. Without it, there is nothing.
mmcnl•22h ago
crocowhile•20h ago
ACCount36•15h ago
Benchmarks are a really good option to have.