Yet all of my tests show o3 blows o4-mini out of the water.
What are you classifying as intelligence?
A bit too much praise for a model that's barely ahead of the competition in a subset of benchmarks...
> To be honest, this model not only competes with other AI models but also with humans, making it the first of its kind
I'm out
Where did this number come from? What is "the right tool"? I find this extremely subjective. As most engineers know, there is no right tool, but mostly a compromise where you pick the least worst tool and choose what risks you're willing to manage or not.
This is just my speculation, though, as I've never used Grok anything
This is just not even science at all at this point, we're just into solid cargo cult.
Is this a joke
Results: Claude: ~10s, perfect working demo ChatGPT: ~20s, solid solution Grok 4: ~1000s, failed completely, gave me a truncated base64 blob
This wasn't some obscure edge case... it was basic data visualization that any decent model should handle. Yet somehow Grok 4 is "competing with humans" and has "99% tool accuracy"...
I don't buy it..
links: Claude: https://claude.ai/share/7a413a6a-5c01-44a1-aaed-8b237e5e9e94 Chatgpt: https://chatgpt.com/canvas/shared/687a9f9d4304819187ac7d98d3... Grok 4: https://grok.com/share/c2hhcmQtMw%3D%3D_20b61291-e1bb-45e5-a...
These benchmarks are either just wrong or measuring something completely divorced from practical utility imo...
OrvalWintermute•3h ago
amitksingh1490•3h ago