Curious to see some third-party testing of this model. Currently it seems to primarily improve of "general non-coding and visual reasoning" primarily, based on the benchmarks.
Also, the confidence interval for a such a small dataset is about 3 percent points, so these differences could just be up to chance.
I have scripted prompts for long duration automated coding workflows of the fire and forget, issue description -> pull request variety. Sonnet 4 does better than you’d expect: it generates high quality mergable code about half the time. Sonnet 4.5 fails literally every time.
Will run extended benchmarks later, let me know if you want to see actual data.
It generated a blender script that makes the model.
I guess openscad would be a sweet spot in the middle. Good shout, might experiment.
Results are amazing! 2.5 and 3 seems way way head.
2.5 stands between GPT-5 and GPT-5.1, where GPT-5 is the best of the 3.
In preliminary evals Gemini 3 seems to be way better than all, but I will know when I run extended benchmarks tonight.
Not trying to challenge you, and I'd sincerely love to read your response. People said similar things about previous gen-AI tool announcements that proved over time to be overstated. Is there some reason to put more weight in "what people on HN said" in this case, compared to previous situations?
1. They likely work at the company (and have RSUs that need to go up)
2. Also invested in the company in the open market or have active call options.
3. Trying to sell you their "AI product".
4. All of the above.
Using Anthropic or OpenAI's models are incredibly straightforward -- pay us per month, here's the button you press, great.
Where do I go for this for these Google models?
Does that have any relation to the Gemini plan thing: https://one.google.com/explore-plan/gemini-advanced?utm_sour...
?
I haven't seen it in the box yet, and pricing is unknown https://cloud.google.com/blog/products/ai-machine-learning/r...
I absolutely LOVE that Google themselves drew a sharp distinction here.
wohoef•1h ago