I don't think they understand how the LLM models work. To truly benchmark a non-deterministic probabilistic model, they are going to need to run each about 45 times. LLM models are distributions and behave accordingly.
The better story is how do the models behave on the same problem after 5 samples, 15 samples, and 45 samples.
That said, using lambda calculus is a brilliant subject for benchmarking.
The models are reliably incorrect. [0]
It's funny that even one the strongest research labs in china (deepseek) has said there's still a gap to opus, after releasing a humongous 1.6T model, yet the internet goes crazy and we now have people claiming [1] a 27b dense model is "as good as opus"...
I'm a huge fan of local models, have been using them regularly ever since devstral1 released, but you really have to adapt to their limitations if you want to do anything productive. Same as with other "cheap", "opus killers" from china. Some work, some look like they work, but they go haywire at the first contact with a real, non benchmarked task.
Honestly, I was "happy" with December 2025 time frame AI or even earlier. Yes, what's come after has been smarter faster cleverer, but the biggest boost in productivity was just the release of Opus 4.5 and GPT 5.2/5.3.
And yes it might be a competitive disadvantage for an engineer not to have access to the SOTA models from Anthropic/OpenAI, but at the same time I feel like the missing piece at this point is improvements in the tooling/harness/review tools, not better-yet models.
They already write more than we can keep up with.
The problem is with overhypers, that they overhype small / open models and make it sound like they are close to the SotA. They really aren't. It's one thing to say "this small model is good enough to handle some tasks in production code", and it's a different thing to say "close to opus". One makes sense, the other just sets the wrong expectations, and is obviously false.
I'm probably going to have to make it myself.
Being able to use it as a rubber duck while it can also read the code works quite well.
There are a few APIs at work I have never worked on and the person that wrote them no longer works with us so AI fills that gap well.
Nevertheless, I do not believe that either OpenAI or Anthropic or Google know any secret sauce for better training LLMs. I believe that their current superiority is just due to brute force. This means that their LLMs are bigger and they have been trained on much more data than the other LLM producers have been able to access.
Moreover, for myself, I can extract much more value from an LLM that is not constrained by a metered by token cost and where I have full control on the harness used to run the model. Even if the OpenAI or Anthropic models had been much better in comparison with the competing models, I would have still been able to accomplish more useful work with an open-weights model.
I have already passed once through the transition from fast mainframes and minicomputers that I was accessing remotely by sharing them with other users, to slow personal computers over which I had absolute control. Despite the differences in theoretical performance, I could do much more with a PC and the same is true when I have absolute control over an LLM.
For the OpenAI and Anthropic models, it is clear that they have been run by their owners, but for the other models there are a great number of options for running them, which may run the full models or only quantized variants, with very different performances.
For instance, in the model list there are both "moonshotai/kimi-k2.6" and "kimi-k2.6", with very different results, but there is no information about which is the difference between these 2 labels, which refer to the same LLM.
Moreover, as others have said, such a benchmark does not prove that a certain cheaper model cannot solve a problem. It happened to not solve it within the benchmark, but running it multiple times, possibly with adjusted prompts, may still solve the problem.
While for commercial models running them many times can be too expensive, when you run a LLM locally you can afford to run it much more times than when you are afraid of the token price or of reaching the subscription limits.
I think/hope we'll see a 4.2 that looks a lot better, same as 3.2 was quite competitive at the time it launched.
Also, being from Victor Taelin, shouldn't this be benching Interaction Combinators? :)
tromp•3h ago
An example task (writing a lambda calculus evaluator) can be seen at https://github.com/VictorTaelin/lambench/blob/main/tsk/algo_...
Curiously, gpt-5.5 is noticeably worse than gpt-5.4, and opus-4.7 is slightly worse than opus-4.6.
lioeters•26m ago