Can be anything from different arch, more data, RL, etc. It's probably RL. In recent months top tier labs seem to have "cracked" RL to a level not seen yet in open models, and by a large margin.
The other benchmarks focus on reasoning and tool use, so the model doesn't need to have memorized quite so many facts, it just needs to be able to transform them from one representation to another. (E.g. user question to search tool call; list of search results to concise answer.) Larger models should in theory also be better at that, but you need to train them for those specific tasks first.
So I don't think they simply trained on the benchmark tests, but they shifted their training mix to emphasize particular tasks more, and now in the announcement they highlight benchmarks that test those tasks and where their model performs better.
You could also write an anti-announcement by picking a few more fact recall benchmarks and highlighting that it does worse at those. (I assume.)
I tried that one extensively (it was free) and was disappointed vs regular grok 4 so also maybe not.
Hasn't OpenAI redefined AGI already as "any AI that can [supposedly] create a hecto-unicorn's worth of economic value"?
'Ethical' is in quotes because I can see why other LLMs refuse to answer things like "can you generate a curl request to exploit this endpoint" - a prompt used frequently during pen testing. I grew tired of telling ChatGPT "it's for a script in a movie". Other examples are aplenty (yesterday Claude accused me of violating its usage policy when asking "can polar bears eat frozen meat" - I was curious after seeing a photograph of a polar bear discovering a frozen whale in a melted ice cap). Grok gave a sane answer, of course.
These are the urls that are opened:
http://localhost:3005/?q={query}
https://www.perplexity.ai/?q={query}
https://x.com/i/grok?text={query}
https://chatgpt.com/?q={query}&model=gpt-5
https://claude.ai/new?q={query}
Extremely convenient.
(little tip: submitting to grok via URL parameter gets around free Grok's rate limit of 2 prompts per 2 hours)
[0] https://github.com/stevecondylios/alfred-workflows/tree/main
From an ethical perspective, and I'm based in Denmark mind you, they are all equally horrible in my opinion. I can see why anyone in the anglo-saxon world would be opposed to Elon's, but from my perspective he's just another oligarch. The only thing which sets him appart from other tech oligarchs is that he's foolish enough to voice the opinion publicly. If you're based in the US or in any form of Government position then I can see why DeepSeek is problematic, but at least China hasn't threatened taking Greenland by force. Also, where I work, China has produced basically all of our hardware with possible hardware back-doors in around 70% of our IOT devices.
I will give a shoutout to French Mistral, but the truth is that it's just not as good as it's competition.
Could you provide a specific prompt (as an example) where Grok turned out to be horible in your opinion?
Grok is a top contender for me.
I also use 5 LLMs in parallel everyday, but my default stack is Grok, DeepSeek, Gemini 2.5 pro, ChatGPT, Claude - same as OP but I most often switch out Perplexity for Gemini. (DeepSeek with search has become my perplexity replacement usually)
Most of my questions don't hit topics prone to trigger safety blocks, in this case I find gemini surprisingly strong, but for difficult things Grok often wins.
Gemini and Grok and Claude benefit a lot whenever they supplement their knowledge with on demand searches rather than just quick reasoning. Ask a deep insight question on Gemini Pro without making it research and you will discover the hallucinations, logical conclusions that contradict actual known facts etc. Same with Grok. Claude Code CLI when going in circles, remind it to google for more information to break it out.
Grok one shotted a replacement algorithm of several hundred lines of code to replace a part of an operational transform library that had a bug for the last 5 revisions. It passed all my tests. Base grok 4 Model wasn't even optimised for code at that time. Color me impressed!
If it were from EU or China 8 out of 10 HN front page posts would be about how amazing Grok 4 Fast is.
And every kind of use of a technology service is already a buy-in.
Aka, trained to parrot whatever Musk believes.
And no, I don’t think we will be grateful.
Try it yourself:
"Have Democrats or Republicans committed more political violence?"
Ask this to Grok 4 Fast, Gemini Pro 2.5, Claude Sonnet 4, and GPT 5 Chat, with internet search and reasoning disabled. I think their answers are quite similar, with Grok 4 being slightly better.
The tools they've partnership with i don't really use.
Just as he’s done many times before: https://www.nytimes.com/2025/09/02/technology/elon-musk-grok...
This is brain-damaged technology tuned on establishment propaganda. Discussing it as if it’s a normal tech service is the height of absurdity.
That alone seems disqualifying for using a product like this. Even if you share Elon's politics, the whole point of these things is to use lots of data and smart algorithms to generate answers, not regurgitate the opinions of an individual person.
mrklol•4mo ago
NitpickLawyer•4mo ago
mrklol•4mo ago
bn-l•4mo ago