Quantitative AI progress needs accurate and transparent evaluation

https://mathstodon.xyz/@tao/114910028356641733

199•bertman•1d ago

Comments

NitpickLawyer•1d ago

The problem with benchmarks is that they are really useful for honest researchers, but extremely toxic if used for marketing, clout, etc. Something something, every measure that becomes a target sucks.

It's really hard to trust anything public (for obvious reasons of dataset contamination), but also some private ones (for the obvious reasons that providers do get most/all of the questions over time, and they can do sneaky things with them).

The only true tests are the ones you write yourself, never publish, and only work 100% on open models. If you want to test commercial SotA models from time to time you need to consider them "burned", and come up with more tests.

antupis•1d ago

Also, even if you want to be honest, at this point, probably every public or semipublic benchmark is part of CommonCrawl.

NitpickLawyer•1d ago

True. And it's even worse than that, because each test probably gets "talked about" a lot in various places. And people come up with variants. And those variants get ingested. And then the whole thing becomes a mess.

This was noticeable with the early Phi models. They were originally trained fully on synthetic data (cool experiment tbh) but the downside was that GPT3 / 4 was "distilling" benchmarks "hacks" into it. It became aparent when new benchmarks were released, after the published date, and there was one that measured "contamination" of about 20+%. Just from distillation.

rachofsunshine•1d ago

What makes Goodhart's Law so interesting is that you transition smoothly between two entirely-different problems the more strongly people want to optimize for your metric.

One is a measurement problem, a statement about the world as it is: an engineer who can finish such-and-such many steps of this coding task in such-and-such time has such-and-such chance of getting hired. The thing you're measuring isn't running away from you or trying to hide itself, because facts aren't conscious agents with the goal of misleading you. Measurement problems are problems of statistics and optimization, and their goal is a function f: states -> predictions. Your problems are usually problems of inputs, not problems of mathematics.

But the larger you get, and the more valuable gaming your test is, the more you leave that measurement problem and find an adversarial problem. Adversarial problems are at least as difficult as your adversary is intelligent, and they can sometimes be even worse by making your adversary the invisible hand of the market. You don't live in the world of gradient descent anymore, because the landscape is no longer fixed. You now live in the world of game theory, and your goal is a function f: (state) x (time) x (adversarial capability) x (history of your function f) -> predictions.

It's that last, recursive bit that really makes adversarial problems brutal. Very simple functions can rapidly result in extremely deep chaotic dynamics once you allow even the slightest bit of recursion - even very nice functions like f(x) = 3.5x(1-x) become writhing ergodic masses of confusion.

visarga•20h ago

Well said, the problem with recursion is that it constructs its own context as it goes, rewrites its rules, and you cannot predict it statically, without forward execution. It's why we have the halting problem. Recursion is irreducible. A benchmark is a static dataset, it does not capture the self constructive nature of recursion.

bwfan123•16h ago

nice comment, a reason why ML approaches may struggle in trading markets where other agents are also competing with you possibly using similar algos. or self-driving which involves other agents who could be adversarial. just training on past data is not sufficient as existing edges are competed away and new edges keep arising out of nowhere.

pixl97•16h ago

I would also assume Russell's paradox needs added in here too. Humans can and do hold sets of conflicting information, it is my theory that conflicts have an informational/processing cost to manage. In benchmark gaming you can optimize the processing speed by removing the conflicting information but you lose real world reliability metrics.

klingon-3•23h ago

> It's really hard to trust anything public

Just feed it into an LLM, unintentionally hint at your bias, and voila, it will use research and the latest or generated metrics to prove whatever you’d like.

> The only true tests are the ones you write yourself, never publish, and only work 100% on open models.

This may be good enough, and that’s fine if it is.

But, if you do it in-house in a closet with open models, you will have your own biases.

No tests are valid if all that ever mattered was the argument and perhaps curated evidence.

All tests, private and public tests have proved flawed theories historically.

Truth has always been elusive and under siege.

People will always just believe things. Data is just foundation for pre-existing or fabricated beliefs. It’s the best rationale for faith, because in the end, faith is everything. Without it, there is nothing.

mmcnl•22h ago

Yes, I ignore every news article about LLM benchmarks. "GPT 7.3o first to reach >50% score in X2FGT AGI benchmark" - ok thanks for the info?

crocowhile•20h ago

There is also a social issue that has to do with accountability. If you claim your model is the best and then it turns out you overfitted the benchmarks and it's actually 68th, your reputation should suffer considerably for cheating. If it does not, we have a deeper problem than the benchmarks.

ACCount36•15h ago

Your options for evaluating AI performance are: benchmarks or vibes.

Benchmarks are a really good option to have.

ozgrakkurt•1d ago

Out of topic but just opening link and actually being able to read the posts and go to profile on a browser, without an account, feels really good. Opening a mastadon profile, fk twitter

ipnon•1d ago

Stallman was right all along.

ipnon•1d ago

Tao’s commentary is more practical and insightful than all of the “rationalist” doomers put together.

Quekid5•1d ago

That seems like a low bar :)

ipnon•1d ago

My priors do not allow the existence of bars. Your move.

tempodox•19h ago

You would have felt right at home in the time of the Prohibition.

jmmcd•1d ago

(a) no it's not

(b) your comment is miles off-topic, as he is not addressing doom in any sense

ks2048•23h ago

I agree about Tao in general, but here,

> AI technology is now rapidly approaching the point of transition from qualitative to quantitative achievement.

I don't get it. The whole history of deep learning was driven by quantitative achievement on benchmarks.

I guess the rest of the post is about adding emphasis on costs in addition to overall performance. But, I don't see how that is a shift from qualitative to quantitative.

raincole•23h ago

He means people in this AI hype trend mostly focused on "now AI can do a task that was impossible mere 5 years ago", but we will gradually change our perception of AI to "how much energy/hardware cost to complete this task and does it really benefit us."

(My interpretation, obviously)

kingstnap•1d ago

My own thoughts on it are that it's entirely crazy that we focus so much on "real world" fixed benchmarks.

I should write an article on it sometime, but I think the incessant focus on data someone collected from the mystical "real world" over well designed synthetic data from a properly understood algorithm is really damaging to proper understanding.

paradite•1d ago

I believe everyone should run their own evals on their own tasks or use cases.

Shameless plug, but I made a simple app for anyone to create their own evals locally:

https://eval.16x.engineer/

pu_pe•1d ago

> For instance, if a cutting-edge AI tool can expend $1000 worth of compute resources to solve an Olympiad-level problem, but its success rate is only 20%, then the actual cost required to solve the problem (assuming for simplicity that success is independent across trials) becomes $5000 on the average (with significant variability). If only the 20% of trials that were successful were reported, this would give a highly misleading impression of the actual cost required (which could be even higher than this, if the expense of verifying task completion is also non-trivial, or if the failures to solve the goal were correlated across iterations).

This is a very valid point. Google and ChatGPT announced they got the gold medal with specialized models, but what exactly does that entail? If one of them used a billion dollars in compute and the other a fraction of that, we should know about it. Error rates are equally important. Since there are conflicts of interest here, academia would be best suited for producing reliable benchmarks, but they would need access to closed models.

JohnKemeny•23h ago

Don't put Google and ChatGPT in the same category here. Google cooperated with the organizers, at least.

spuz•23h ago

Could you clarify what you mean by this?

raincole•23h ago

Google's answers were judged by IMO. OpenAI's were judged by themselves internally. Whether it matters is up to the reader.

EnnEmmEss•22h ago

TheZvi had a summarization of this here: https://thezvi.substack.com/i/168895545/not-announcing-so-fa...

In short (there is nuance), Google cooperated with the IMO team while OpenAI didn't which is why OpenAI announced before Google.

ml-anon•20h ago

Also neither got a gold medal. Both solved problems to meet the threshold for a human child getting a gold medal but it’s like saying an F1 car got a gold medal in the 100m sprint at the Olympics.

vdfs•20h ago

"Google F1 Preview Experimental beat the record of the fastest man on earth Usain Bolt"

nmca•19h ago

Indeed, it’s like saying a jet plane can fly!

bwfan123•16h ago

The popular science title was funnier with a pun on "mathed" [1]

"Human teens beat AI at an international math competition Google and OpenAI earned gold medals, but were still out-mathed by students."

[1] https://www.popsci.com/technology/ai-math-competition/

moffkalast•23h ago

> with specialized models

> what exactly does that entail

Overfitting on the test set with models that are useless for anything else, that's what.

sojuz151•20h ago

Compute has been getting cheaper and models more optimised. So if models can do something it will not be long till they can do this cheap.

EvgeniyZh•18h ago

GPU compute per watt has grown by a factor of 2 in last 5 years

BrenBarn•1d ago

I like Tao, but it's always so sad to me to see people talk in this detached rational way about "how" to do AI without even mentioning the ethical and social issues involved. It's like pondering what's the best way to burn down the Louvre.

spuz•23h ago

Do you not think social and ethical issues can be approached rationally? To me it sounds like Tao is concerned about the cost of running AI powered solutions and I can quite easily see how the ethical and social costs fit under that umbrella along with monetary and environmental costs.

bubblyworld•23h ago

I don't think everybody has to pay lip service to this stuff every time they talk about AI. Many people (myself included) acknowledge these issues but have nothing to add to the conversation that hasn't been said a million times already. Tao is a mathematician - I think it's completely fine that he's focused on the quantitative aspects of this stuff, as that is where his expertise is most relevant.

Karrot_Kream•23h ago

I feel like your comment could be more clear and less hyperbolic or inflammatory by saying something like: “I like Tao but the ethical and social issues surrounding AI are much more important to me than discussing its specifics.”

rolandog•23h ago

I don't agree; portraying it as an opinion has the risk of continuing to erode the world with moral relativism.

The tech — despite being sometimes impresaive — is objectively inefficient, expensive, and harmful to the environment (excessive use if energy and water for cooling), to the people located near the data centers (by stochastic leeching of coolants to the waterbed IIRC), and the economic harm done to hundreds of millions of people whose data was involuntarily used for training.

Karrot_Kream•23h ago

For the claim to be objective then I believe it needs objective substance to discuss. I saw none of that. I would like to see numbers, results, or something of that nature. It's fine to have subjective feelings as well but I feel it's important to clarify one's feelings especially because I see online discussion on forums become so heated so quickly which I feel degrades discussion quality.

rolandog•17h ago

Let's not shift the burden of proof so irresponsibly.

We've all seen the bad faith actors that questioned, for example, studies on the efficacy of wearing masks in reducing chance of transmission of airborne diseases because the study combined wearing masks AND washing hands... Those people would gladly hand wipe without toilet paper to "own the libs" or whatever hate-filled mental gymnastics strokes their ego.

With that in mind, let's call things for what they are: there are multiple companies that are salivating at the prospects of being able to make the working class obsolete. There's trillions to be made in their mind.

> I would like to see numbers, results, or something of that nature

I would like the same thing! So far, we have seen that a very big company that had pledged, IIRC, to remain not-for-profit for the benefit of humanity sold out at the drop of a hat the moment they were able to hint Zombocom levels of possibility to investors.

calf•22h ago

I find it extremist and inflammatory this reoccurring—frankly conservative—tendency on HN to police any strong polemic criticism as "hyperbole" and "inflammatory". People should learn to take criticism is stride, not every strongly critical comment ought to be socially censored by tone policing it. The comparison to Louvre was a funny comment and if people didn't get that perhaps it is not too far-fetched to suggest improving on basic literary-device literacy skills.

rolandog•23h ago

> what's the best way to burn down the Louvre.

"There are two schools of thought, you see..."

Joking aside, I think that's a very valid point; not sure what would be the nonreligious term for the amorality of "sins of omission"... But, in essence, one can clearly be unethical by ignoring the social responsibility we have to study who is affected by our actions.

Corporations can't really play dumb there, since they have to weigh the impacts for every project they undertake.

Also, side note... It's very telling how little control we (commoners?) have as a global society that — collectively — we're throwing mountains of cash at weapons and AI, which would directly move us closer to oblivion and further the effects of climate change (despite the majority of people not wanting wars nor being replaced by a chatbot). I would instead favor world peace; ending poverty, famine, and genocide; and, preventing further global warming.

blitzar•23h ago

It's always so sad to me to see people banging on about the ethical and social issues involved without quantifying anything, or using dodgy projections - "at this rate it will kill 100 billion people by the end of the year".

ACCount36•15h ago

And I am tired of "mentioning the ethical and social issues".

If the best you can do is bring up this garbage, then you have nothing of value to say.

benlivengood•14h ago

I think using LLMs/AI for pure mathematics is one of the very least ethically fraught use-cases. Creative works aren't being imitated, people aren't being deceived by hallucinations (literally by design; formal proof systems prevent it), from a safety perspective even a superintelligent agent that was truly limited to producing true theorems would be dramatically safer than other kinds of interactions with the world, etc.

fsh•23h ago

I believe that it may be misguided to focus on compute that much, and it would be more instructive to consider the effort that went into curating the training set. The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data. The most crass example is OpenAI paying off the FrontierMath creators last year to get exclusive secret access to the problems before the evaluation [1]. Even without resorting to cheating, competition formats are vulnerable to this. It is extremely difficult to come up with truly original questions, so by spending significant resources on re-hashing all kinds of permutations of previous question, one will probably end up very close to the actual competition set. The first rule I learned about training neural networks is to make damn sure there is no overlap between the training and validation sets. It it interesting that this rule has gone completely out of the window in the age of LLMs.

[1] https://www.lesswrong.com/posts/8ZgLYwBmB3vLavjKE/some-lesso...

OtherShrezzing•23h ago

> The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set. Many of the AI achievements would probably look a lot less miraculous if one could check the training data

I'm fairly certain this phenomenon is responsible for LLM capabilities on GeoGuesser type games. They have unreasonably good performance. For example, being able to identify obscure locations from featureless/foggy pictures of a bench. GeoGuesser's entire dataset, including GPS metadata, is definitely included in all of the frontier model training datasets - so it should be unsurprising that they have excellent performance in that domain.

YetAnotherNick•22h ago

> GeoGuesser's entire dataset

No, it is not included, however there must be quite a lot of pictures on internet for most cities.. Geoguesser data is same as Google's street view data and it probably contains billions of 360 degree photos.

ivape•22h ago

I just saw a video on Reddit where a woman still managed to take a selfie while being literally face to face with a black bear. There’s definitely way too much video training data out there for everything.

lutusp•14h ago

> I just saw a video on Reddit where a woman still managed to take a selfie while being literally face to face with a black bear.

This is not uncommon. Bears aren't always tearing people apart, that's a movie trope with little connection to reality. Black bears in particular are smart and social enough to befriend their food sources.

But a hungry bear, or a bear with cubs, that's a different story. Even then bears may surprise you. Once in Alaska, a mama bear got me to babysit her cubs while she went fishing -- link: https://arachnoid.com/alaska2018/bears.html .

suddenlybananas•21h ago

Why do you say it's not included? Why wouldn't they include it.

sebzim4500•19h ago

If every photo in streetview was included in the training data of a multimodal LLM it would be like 99.9999% of the training data/resource costs.

It just isn't plausible that anyone has actually done that. I'm sure some people include a small sample of them, though.

bluefirebrand•19h ago

Why would every photo in streetview be required in order to have Geoguessr's dataset in the training data?

bee_rider•17h ago

I’m pretty sure they are saying that Geoguessr's just pulls directly from Google Streetview. There isn’t a separate Geoguessr dataset, it just pulls from Google’s API (at least that’s what Wikipedia says).

bluefirebrand•16h ago

I suspect that Geoguessr's dataset is a subset of Google Streetview, but maybe it really is just pulling everything directly

bee_rider•15h ago

My guess would be that they pull directly from street-view, maybe with some extra filtering for interesting locations.

Why bother to create a copy, if it can be avoided, right?

clbrmbr•6h ago

Yet.

This is a good rebuttal when someone quips that we “are about to run out of data”. There’s oh so much more, just not in the form of books and blogs.

ACCount36•15h ago

People tried VLMs on "closed set" GeoGuessr-type tasks - i.e. non-Street View photos in similar style, not published anywhere.

They still kicked ass.

It seems like those AIs just have an awful lot of location familiarity. They've seen enough tagged photos to be able to pick up on the patterns, and generalize that to kicking ass at GeoGuessr.

astrange•23h ago

> The easiest way of solving math problems with an LLM is to make sure that very similar problems are included in the training set.

An irony here is that math blogs like Tao's might not be in LLM training data, for the same reason they aren't accessible to screen readers - they're full of math, and the math is rendered as images, so it's nonsense if you can't read the images.

(The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)

prein•22h ago

What would be a better alternative than LaTex for the alt text? I can't think of a solution that makes more sense, it provides an unambiguous representation of what's depicted.

I wouldn't think an LLM would have issue with that at all. I can see how a screen reader might, but it seems like the same problem faced by a screen reader with any piece of code, not just LaTex.

QuesnayJr•22h ago

LLMs understand LaTeX extraordinarily well.

MengerSponge•22h ago

LLMs are decent with LaTeX! It's just markup code after all. I've heard from some colleagues that they can do decent image to code conversion for a picture of an equation or even some handwritten ones.

alansammarone•21h ago

As others have pointed out, LLMs have no trouble with LaTeX. I can see why one might think they're not - in fact, I made the same assumption myself sometime ago. LLMs, via transformers, are exceptionally good any _any_ sequence or one-dimensional data. One very interesting (to me anyway) example is base64 - pick some not-huge sentence (say, 10 words), base64-encode it, and just paste it in any LLM you want, and it will be able to understand it. Same works with hex, ascii representation, or binary. Here's a sample if you want to try: aWYgYWxsIEEncyBhcmUgQidzLCBidXQgb25seSBzb21lIEIncyBhcmUgQydzLCBhcmUgYWxsIEEncyBDJ3M/IEFuc3dlciBpbiBiYXNlNjQu

I remember running this experiment some time ago in a context where I was certain there was no possibility of tool use to encode/decode. Nowadays, it can be hard to certain whether there is any tool use or not, in some cases, such as Mistral, the response is quick enough to make it unlikely there's any tool use.

throwanem•19h ago

I've just tried it, in the form of your base64 prompt and no other context, with a local Qwen-3 30b instance that I'm entirely certain is not actually performing tool use. It produced a correct answer ("Tm8="), which in a moment of accidental comedy it spontaneously formatted with LaTeX. But it did talk about invoking an online decoder, just before the first appearance of the (nearly) complete decoded string in its CoT.

It "left out" the A in its decode and still correctly answered the proposition, either out of reflexive familiarity with the form or via metasyntactic reasoning over an implicit anaphor; I believe I recall this to be a formulation of one of the elementary axioms of set theory, though you will excuse me for omitting its name before coffee, which makes the pattern matching possibility seem somewhat more feasible. ('Seem' may work a little too hard there. But a minimally more novel challenge I think would be needed to really see more.)

There's lots of text in lots of languages about using an online base64 decoder, and nearly none at all about decoding the representation "in your head," which for humans would be a party trick akin to that one fellow who could see a city from a helicopter for 30 seconds and then perfectly reproduce it on paper from memory. It makes sense to me that a model trained on the Internet would "invent" the "metaphor" of an online decoder here, I think. What in its "experience" serves better as a description?

kaffekaka•1h ago

I assume you're referring to Stephen Wiltshire: https://en.m.wikipedia.org/wiki/Stephen_Wiltshire

constantcrying•20h ago

>(The images on his blog do have alt text, but it's just the LaTeX code, which isn't much better.)

LLMs are extremely good at outputting LaTeX, ChatGPT will output LaTeX, which the website will render as such. Why do you think LLMs have trouble understanding it?

astrange•12h ago

I don't think LLMs will have trouble understanding it. I think people using screen readers will. …oh I see, I accidentally deleted the part of the comment about that.

But the people writing the web page extraction pipelines also have to handle the alt text properly.

mbowcut2•18h ago

LLMs are better at LaTeX than humans. ChatGPT often writes LaTeX responses.

neutronicus•17h ago

Yeah, it's honestly one of the things they're best at!

I've been working on implementing some E&M simulations with Claude Code and it's so-so on the C++ and TERRIBLE at the actual math (multiplying a couple 6x6 matrix differential operators is beyond it).

But I can dash off some notes and tell Claude to TeXify and the output is great.

disruptbro•22h ago

Language modeling is compression, whittle down graph to reduce duplication and data with little relationship: https://arxiv.org/abs/2309.10668

Let’s say everyone agrees to refer to one hosted copy of a token “cat”, and instead generate a unique vector to represent their reference to “cat”.

Blam. Endless unique vectors which are nice and precise for parsing. No endless copies of arbitrary text like “cat”.

Now make that your globally distributed data base to bootstrap AI chips from. The data driven programming dream where other machines on the network feed new machines boot strap.

American tech industry is IBM now. Stuck on recent success of web SaaS and way behind the plans of AI.

eru•4h ago

> It is extremely difficult to come up with truly original questions, [...]

No, that's actually really easy. What's hard is coming up with original questions of a specific level of difficulty. And that's what you need for a competition.

To elaborate: it's really easy to find lots and lots of elementary, unsolved questions. But it's not clear whether you can actually solve them or how hard solving them is, so it's hard to judge the performance of LLMs on them.

> It it interesting that this rule has gone completely out of the window in the age of LLMs.

No, it hasn't.

mhl47•23h ago

Side note: What is going on with these comments on Mathstodon? From moon landing denials, to insults, allegations that he used AI to write this ... almost all of them are to some capacity insane.

nurettin•23h ago

That is how peak humanity looks like.

Karrot_Kream•23h ago

I find the same kind of behavior on bigger Bluesky AI threads. I don't use Mathstodon (or actively follow folks on it) but I certainly feel sad to see similar replies there too. I speculate that folks opposed to AI are angry and take it out by writing these sorts of comments, but this is just my hunch. That's as much as I feel I should write about this without feeling guilty for derailing the discussion.

ACCount36•15h ago

No wonder. Bluesky is where insane Twitter people go when they get too insane for Twitter.

dash2•22h ago

Almost everywhere on the internet is like this. It's hn that is (mostly!) exceptional.

f1shy•22h ago

The “mostly” there is so important! But also HN suffers from other problems (see in this thread the discussion about over policing comments, and calling fast hyperbolic and inflammatory).

And don’t get me started in the decline on depth in technical topics and soaring in political discussions. I came to HN for the first, not the second.

So we are humans, there will never be a perfect forum.

frumiousirc•21h ago

> So we are humans, there will never be a perfect forum.

Perfect is in the eye of the moderator.

hshshshshsh•22h ago

The truth is, both deniers and believers are operating on belief. Only those who actually went to the Moon know firsthand. The rest of us trust information we've received — filtered through media, education, or bias. That makes us no fundamentally different from deniers; we just think our belief is more justified.

fc417fc802•21h ago

Just to carry this line of reasoning out to the extreme for entertainment purposes (and to illustrate for everyone how misguided it is). Even if you perform a task firsthand, at the end of the day you're just trusting your memory of having done so. You feel that your trust in your memory is justified but fundamentally that isn't any different from the deniers either.

hshshshshsh•21h ago

This is actually true. Plenty of accidents has happened because of this.

I am not saying trusting your memory is always false or true. Most of the times it might be true. It's a heuristic.

But if someone comes and deny what you did, the best course of action would be to consider the evidence they have and not assume they are stupid because they believe differently.

Let's be honest, you have not personally went and verified the rocks belongs to Moon. Nor were you tracking the telemetry data in your computer when the rocket was going to Moon.

I also believe we went to Moon.

But all I have is beliefs.

Everyone believed Early was flat 1000s years back as well. They had solid evidence.

But the humility is accepting you don't know and you are believing and not pretend you are above others who believe exact opposite..

fc417fc802•14h ago

It's a misguided line of reasoning because the "belief" thing is a red herring. Nearly everything comes down to belief at a low level. The differences lie in the justifications.

As you say, you should have the humility to consider the evidence that others provide that you might be wrong. The thing with the various popular conspiracy theories is that the evidence is conspicuously missing when any competent good faith actor would be presenting it front and center.

esafak•17h ago

Some beliefs are more supported by evidence than others. To ignore this is to make the concept of belief practically useless.

hshshshshsh•16h ago

Yeah. My point is you have not seen any of the evidence. You just have belief that evidence exists. Which is a belief and not evidence.

esafak•15h ago

Yes, we have seen evidence: videos, pictures and other artifacts of the landing.

I think you don't know what evidence means. You want proof and that's for mathematics.

You don't know that you exist. You could be a simulation.

andrepd•22h ago

Have you opened a twitter thread? People are insane on social media, why should open source social media be substantially different? x)

f1shy•22h ago

I refrain from any of those X, mastodon, etc. so let me ask a question:

are all equally bad? Or same bad but a different aspect? E.g. I read often here that X has more disinformation, and right wing propaganda, while mastodon here was called out on another topic.

Maybe somebody active in different networks can answer that.

fc417fc802•21h ago

Moderation and the algorithms used to generate user feeds both have strong impacts. In the case of mastodon (ie activitypub) moderation varies wildly between different domains.

But in general, I'd say that the microblogging format as a whole encourages a number of toxic behaviors and interaction patterns.

miltonlost•19h ago

X doesn't let you use trans as a word and has Grok spewing right-wing propaganda (mechahitler?). That self-selects into the most horrible people being on X now.

stared•22h ago

I agree that after a challenge is something can be done at all (heavier-than-air flight, Moon landing, Gold medal at the IMO) then next question is makes sense economically.

I like ARC-AGI approach for the reason that it shows both axes - score and price, and place human benchmark on these.

https://arcprize.org/leaderboard

pama•22h ago

This sounds very reasonable to me.

When considering top tier labs that optimize inference and own the GPUs: the electricity cost of USD 5000 at a data center with 4 cents per kWh (which may be possible to arrange or beat in some counties in the US with special industrial contracts) can produce about 2 trillion tokens for the R1-0528 model using 120kW draw for the B200 NVL72 hardware and the (still to be fully optimized) sglang inference pipeline: https://lmsys.org/blog/2025-06-16-gb200-part-1/

Although 2T tokens is not unreasonable for being able to get high precision answers to challenging math questions, such a very high token number would strongly suggest there are lots of unknown techniques deployed at these labs.

If one adds the cost of GPU ownership or rental, say 2 USD/h/GPU, then the number of tokens for 5k USD shrinks dramatically to only 66B tokens, which is still high for usual techniques that try to optimize for a best single answer in the end, but perhaps plausible if the vast majority of these are intermediate thinking tokens and a lot of the value comes from LLM-based verification.

iloveoof•21h ago

Moore’s Law for AI Progress: AI metrics will double every two years whether the AI gets smarter or not.

akomtu•20h ago

The benchmarks should really add the test of data compression. Intelligence is mostly about discovering the underlying principles, the ability to see simple rules behind complex behaviors, and data compression captures this well. For example, if you can look at a dataset of planetary and stellar motions and compress it into a simple equation, you'd be considered wildly intelligent. If you can't remember and reproduce a simple checkerboard pattern, you'd be considered dumb. Another example is drawing a duck in SVG - another form of data compression. Data extrapolation, on the other hand, is the opposite problem, which can be solved by imitation or by understanding the rules producing the data. Only the latter deserves to be called intelligence. Note, though, that understanding the rules isn't always a superior method. When we are driving, we drive by imitation based on our extensive experience with similar situations, hardly understanding the physics of driving.

js8•20h ago

LLMs could be very useful in formalizing the problem and assumptions (conversion from natural language), but once problem is described in a formal way (it can be described in some fuzzy logic), then more reliable AI techniques should be applied.

Interestingly, Tao mentions https://teorth.github.io/equational_theories/, and I believe this is better progress than LLMs doing math. I believe enhancing Lean with more tactics and formalizing those in Lean itself is a more fruitful avenue for AI in math.

agentcoops•20h ago

I used to work quite extensively with Isabelle and as a developer on Sledgehammer [1]. There are well-known results, most obviously the halting problem, that mean fully-automated logical methods applied to a formalism with any expressive capability, i.e. that can be used to formalize non-trivial problems, simply can never fulfill the role you seem to be suggesting. The proofs that are actually generated in that way are, anyway, horrendous -- in fact, the problem I used to work on was using graph algorithms to try and simplify computer-generated proofs for human comprehension. That's the very reason that all the serious work has previously been on proof /assistants/ and formal validation.

LLMs, especially in /conjunction/ with Lean for formal validation, are really an exciting new frontier in mathematics and it's a mistake to see that as just "unreliable" versus "reliable" symbolic AI etc. The OP Terence Tao has been pushing the edge here since day one and providing, I think, the most unbiased perspective on where things stand today, strengths as much as limitations.

[1] https://isabelle.in.tum.de/website-Isabelle2009-1/sledgehamm...

js8•19h ago

LLMs (as well as humans) are algorithms like anything else and so they are also subject to halting problem. I don't see what LLMs do that couldn't be in principle formalized as a Lean tactic. (IMHO LLMs are just learning rules - theorems of some kind of fuzzy logic - and then try to apply them using heuristic search to satisfy the goal. Unfortunately the rules learned are likely not fully consistent and so you get reasoning errors.)

data_maan•17h ago

The concept of pre-registered eval (an analogy to pre-registered study) will go a long way towards fixing this.

More information

https://mathstodon.xyz/@friederrr/114881863146859839

kristianp•9h ago

It's going to take a large step up in transparency for AI companies to do this. It was back in gpt 4 days that openai stopped reporting model size for example and the others followed suit.

Do not download the app, use the website

Open Sauce is a confoundingly brilliant Bay Area event

Turn any diagram image into an editable Draw.io file. No more redrawing

CCTV Footage Captures the First-Ever Video of an Earthquake Fault in Motion

It's time for modern CSS to kill the SPA

Show HN: Auto Favicon MCP Server

Users claim Discord's age verification can be tricked with video game characters

Simon Tatham's Portable Puzzle Collection

It's a DE9, not a DB9 (but we know what you mean)

Never write your own date parsing library

Windsurf employee #2: I was given a payout of only 1% what my shares where worth

Vanilla JavaScript support for Tailwind Plus

Efficient Computer's Electron E1 CPU – 100x more efficient than Arm?

Why MIT switched from Scheme to Python (2009)

Animated Cursors

Experimental surgery performed by AI-driven surgical robot

Why I Do Programming

What is X-Forwarded-For and when can you trust it? (2024)

Steam, Itch.io are pulling ‘porn’ games. Critics say it's a slippery slope

The future is not self-hosted

Developing our position on AI

A Union Pacific-Norfolk Southern combination would redraw the railroad map

CO2 Battery

Generic Containers in C: Vec

Women dating safety app 'Tea' breached, users' IDs posted to 4chan

Programming vehicles in games

Researchers value null results, but struggle to publish them

Steve Jobs' cabinet

Show HN: Apple Health MCP Server

Show HN: Open IT Maintenance Planner

Do not download the app, use the website

Open Sauce is a confoundingly brilliant Bay Area event

Turn any diagram image into an editable Draw.io file. No more redrawing

CCTV Footage Captures the First-Ever Video of an Earthquake Fault in Motion

It's time for modern CSS to kill the SPA

Show HN: Auto Favicon MCP Server

Users claim Discord's age verification can be tricked with video game characters

Simon Tatham's Portable Puzzle Collection

It's a DE9, not a DB9 (but we know what you mean)

Never write your own date parsing library

Windsurf employee #2: I was given a payout of only 1% what my shares where worth

Vanilla JavaScript support for Tailwind Plus

Efficient Computer's Electron E1 CPU – 100x more efficient than Arm?

Why MIT switched from Scheme to Python (2009)

Animated Cursors

Experimental surgery performed by AI-driven surgical robot

Why I Do Programming

What is X-Forwarded-For and when can you trust it? (2024)

Steam, Itch.io are pulling ‘porn’ games. Critics say it's a slippery slope

The future is not self-hosted

Developing our position on AI

A Union Pacific-Norfolk Southern combination would redraw the railroad map

CO2 Battery

Generic Containers in C: Vec

Women dating safety app 'Tea' breached, users' IDs posted to 4chan

Programming vehicles in games

Researchers value null results, but struggle to publish them

Steve Jobs' cabinet

Show HN: Apple Health MCP Server

Show HN: Open IT Maintenance Planner

Quantitative AI progress needs accurate and transparent evaluation

Comments