Does it actually work? Isn’t AI training so far simply ignores all license and copyright restrictions completely?
I definitely trust the totally private dataset more.
Older, non-instruction tuned models needed post-training on public datasets to even reliably produce meaningful answers.
Now we're testing tasks that are so complex that the LLM should reasonably be expected to answer without additional post-training.
Once you have a public dataset, even feeding those examples to an LLM and producing synthetic variations is enough to let you game the benchmark. And the worst part is you don't need to be unethical to do this: some people would say it's just a good way to expand your training data even though it incidentally allows you to overfit on the task, without overfitting on the public dataset.
So everyone's doing stuff like that, and we're getting models that are increasing overfit to a few narrow tasks.
-
The alternative is just giving detailed plain english descriptions of the tasks in question. Those can be used to generate synthetic tasks, but won't result in matching the benchmark's "shape" perfectly (as long as the questions stay hidden), and that alone is enough to ensure some level of generalization takes place.
Hope they’re addressing that at the same time.
While I haven’t dug into the details of this benchmark, this absolutely matches my personal experience.
Assuming “semantic correctness” is in the sense of Rice and runtime behavior.
While syntactic correctness has dramatically improved, security and architectural erosion and other long term issues have not.
Unfortunately Rice’s theorem also applies to finite programs in finite time too.
Actually it can apply to total functions in the general case.
I am still optimistic that coding agents will provide value long term in some fashion.
But the open domain frame problem simply reduces to the halting problem, yes and humans struggle with it too.
But fundamentally, PAC learning has to be reduced to _trivial_ problems, with only T/F.
We have found clever ways to work within these s limitations, but they still exist.
Hopefully we find clever ways to keep humans engaged with the code, while gaining the potential force multiplier that ML may offer.
The long tailed problems are particularly important, and while human SREs make mistakes and organizations often have constraints that add to the problem, SREs do a lot more to avoid those long tailed problems than they are given credit for.
IMHO that has always been one of the hardest parts of the industry and a true measure for what makes great team members.
Unfortunately the metrics and incentives often don’t capture that value.
https://github.com/google-deepmind/bbeh?tab=readme-ov-file
https://github.com/google/lmeval
I hesitate to say this lest folks adapt, but does anyone else immediately distrust a repo when it has a bunch of emojis in the README? It is often a giveaway that they were LLM-generated.
siliconc0w•2h ago