It's all just vibes,there is no good general benchmark for agents and i think it's just impossible, there are just way too many degrees of freedom to achieve anything useful. They're just a complicated tool to achieve things. It's like trying to make a general use benchmark of a stack of 10 microservices together. It does not make sense, it just depends on your usecase and your own metrics
There was yahoo-pipes and web-services frameworks which rhyme with MCP and agentic.
All benchmarks are flawed.
Not all benchmarks are useless.
But these benchmarks are.
However the article does seem to be pointing out some fundamental issues. I'm particularly annoyed by using LLMs to evaluate the output of LLMs. Anyone with enough experience to be writing benchmarks of this sort in the first place ought to know that's a no-go. It isn't even just using "AI to evaluate AI" per se, but using a judge of the same architecture as the thing being judged maximizes the probability of fundamental failure of the benchmark to be valid due to the judge having the exact same blind spots as the thing under test. As we, at the moment, lack a diversity of AI architectures that can play on the same level as LLMs, it is simply necessary for the only other known intelligence architecture, human brains, to be in the loop for now, however many other difficulties that may introduce into the testing procedures.
Tests that a "do nothing" AI can pass aren't intrinsically invalid but they should certainly be only a very small number of the tests. I'd go with low-single-digit percentage, not 38%. But I would say it should be above zero; we do want to test for the AI being excessively biased in the direction of "doing something", which is a valid failure state.
+1, and IMO part of a general trend where we're just not serious about making sure this shit works. Higher scores make stonks go up, who cares if it actually leads to reliably working products.
But also more importantly it's starting to expose the fact that we haven't solved one of ML's core challenges: data collection and curation. On the training side we have obviated this somewhat (by ingesting the whole internet, for example), but on the eval side it feels like we're increasing just going "actually constructing rigorous evaluation data, especially at this scale, would be very expensive... so let's not".
I was at a local tech meetup recently where a recruiting firm was proudly showing off the LLM-based system they're using to screen candidates. They... did not evaluate the end-to-end efficacy of their system. At all. This seems like a theme within our industry - we're deploying these systems based purely on vibes without any real quantification of efficacy.
Or in this case, we're quantifying efficacy... poorly.
I suspect quite a lot of the industry is actively _opposed_ to that, because it could be damaging for the "this changes everything" narrative.
Discriminating good answers is easier than generating them. Good evaluations write test sets for the discriminators to show when this is or isn’t true. Evaluating the outputs as the user might see them are more representative than having your generator do multiple tasks (e.g. solve a math query and format the output as a multiple choice answer).
Also, human labels are good but have problems of their own, it isn’t like by using a “different intelligence architecture” we elide all the possible errors. Good instructions to the evaluation model often translate directly to better human results, showing a correlation between these two sources of sampling intelligence.
Fundamentally I’m not disagreeing with the article, but also think most people who care take the above approach because if you do care you read samples, find the issues, and patch them to hill climb better
I don't think this is true for many fields - especially outside of math/programming. Let's say the task is "find the ten most promising energy startups in Europe." (This is essentially the sort of work I see people frequently talk about using research modes of models for here or on LinkedIn.)
In ye olden days pre-LLM you'd be able to easily filter out a bunch of bad answers from lazy humans since they'd be short, contain no detail, have a bunch of typos, formatting inconsistencies from copy-paste, etc. You can't do that for LLM output.
So unless you're a domain expert on European energy startups you can't check for a good answer without doing a LOT of homework. And if you're using a model that usually only looks at, say, the top two pages of Google results to try to figure this out, how is the validator going to do better than the original generator?
And what about when the top two pages of Google results start turning into model-generated blogspam?
If your benchmark can't evaluate prospective real-world tasks like this, it's of limited use.
A larger issue is that once your benchmark, that used this task as a criteria, based on an expert's knowledge, is published, anyone making an AI Agent is incredibly incentivized to (intentionally or not!) to train specifically on this answer without necessarily actually getting better at the fundamental steps in the task.
IMO you can never use an AI agent benchmark that is published on the internet more than once.
This is actually very wrong. Consider for instance the fact that people who grade your tests in school are typically more talented, capable, trained than the people taking the test. This is true even when an answer key exists.
> Also, human labels are good but have problems of their own,
Granted, but...
> it isn’t like by using a “different intelligence architecture” we elide all the possible errors
nobody is claiming this. We elide the specific, obvious problem that using a system to test itself gives you no reliable information. You need a control.
As a result, we always had a two-step evaluation process. We would use a suite of metrics to guide development progress (validation), but the final evaluation reported in a paper always involved subjective human listening experiments. This was expensive, but the only way to show that the codecs were actually improving.
Similarly, here it seems fine to use LLMs to judge your work in progress, but we should be requiring human evaluation for 'final' results.
They were great for taking to Grateful Dead concerts to record the music directly in front of the Wall of Sound, and to measure the response so you can play back all your Dead tapes with that same front row psychoacoustic perspective. ;)
https://www.grasacoustics.com/industries/kemar/applications-...
https://www.grasacoustics.com/products/accessories/product/4...
There is a simple improvement here: give the agent a "do nothing" button. That way it at least needs to understand the task well enough to know it should press the do nothing button.
Now a default agent that always presses it still shouldn't score 38%, but that's better than a NOP agent scoring 38%.
That's what humans do all the time. What's the fundamental difference? Or are you saying that's also broken?
I know LLM evangelists love this "humans make mistakes too" line, but, really, only an _exceptionally_ incompetent human evaluator would fall for that one.
I cite the entire current education system. Substantiating that claim would take more than an HN comment allows, though I think most people can probably get the drift of what I'm talking about, even if we'd disagree about the details. Absolutely humans are not immune to this.
I also cite the entire concept of "fallacies", many of which are things that both human brains tend to produce and then tend to evaluate poorly. An alien species might find some of our fallacies absolutely transparent, and have entirely different fallacies of their own that none of us would find convincing in the slightest, because of fundamentally different brain architectures.
I don't think AIs are ready for this yet and I don't expect LLMs ever will be, but in the future getting an outsider perspective from them in a sort of Mixture of Experts architecture could be valuable for life decisions. (I look to the future AI architectures in which LLMs are just a component but not the whole.)
To some extent the same llm with a new context history and different prompt is sorta like that ... but still is much weaker than using a different system entirely.
This does seem a little crazy on its face, but it is yielding useful and improving tools.
See also a cousin comment of mine observing that human brains are absolutely susceptible to the same effect. We're just so used to it that it is the water we swim through. (And arguably human brains are more diverse than current AI systems functioning at this level. No bet on how long that will be true for, though.)
Such composite systems would still have their own characteristics and certainly wouldn't be guaranteed to be perfect or anything, but at least they would not tend to iteratively magnify their own individual flaws.
Perhaps someday we will have such diverse architectures. We don't today have anything that can evaluate LLMs other than human brains, though.
The first person to make steel made it without steel didn't they?
Did I miss something?
Edit0: fun tidbit - Wootz steel was made with crucibles of clay with rice husks mixed in (husks would carbonize quickly and introduce air layers to better isolate) and many seemingly random objects (fruits, vegetation) were added to the crucible to control carbon content.
I higly recommend A Collection of Unmitigated Pedantry's series on steel (it's a blog, just search "ACOUP steel".
> Pass
Yeah, this generally feels like about the quality one would expect from the industry.
Of course, for such tasks we could benchmark them :
* arithmetic (why would use LLM for that ?)
* correct JSON syntax, correct command lines etc.
* looking for specific information in a text
* looking for a missing information in a text
* language logic (ifs then elses where we know the answer in advance)
But by Goodhart's Law, LLMs that have been trained to succeed in those benchmarks might loose powerfulness in others tasks where we really need them (fuzzy inputs, fuzzy outputs)
Because people ask LLMs all of these things, including arithmetic. People were saying the same about the number of r's in strawberry. Why ask and LLM that!?!? But the big AI companies want LLMs to be better at these questions, probably because people ask them to LLMs. The big AI companies want this because there is no other explanation for the money poured into RLHF'ing these types of problems.
For people whose purpose is to produce reliably working systems yeah, training a model that calls out to deterministic logic to do things like math makes total sense. It will pretty much always be more reliable than training a text generation model to produce correct arithmetic.
But it feels like there's another side of the industry that's more concerned with... I dunno, metaphysical aspects of these models? Where the idea that the model is a stochastic ball that isn't conscious, isn't thinking, and does poorly at various tasks is anathema. So the effort continues to try and train and fine-tune these models until... something.
It reminds me of the great Tesla-vs-everyone-else self-driving debates that raged over the past several years. Lots of people unhappy that the best-functioning systems fused many sensor types and a mixture of heuristic and machine-learned systems in a complex architecture. These folks insisted that the "best" architecture was an end-to-end machine-learned system based entirely on visible light cameras. Because it's "most human" or some other such nonsense. As far as I can tell there was never any merit to this position beyond some abstract notion of architectural purity.
Same thing here I suppose.
anupj•3h ago