Midnight New York Time
5am London Time
12pm Hong Kong Time
Why?
I will add that, as an unfair smell test, the very name "Humanity's Last Exam" implies an arrogant contempt for scientific reasoning, and I would not be at all surprised if they were corrupt in a similar way as Frontier Math and OpenAI - maybe xAI funded HLE in exchange for peeking at the questions.
Grok 4 has probably been training when O3 was released, and now that Grok 4 is released, OpenAI is probably preparing O4, Google is preparing Gemini 3 and soon new SOTA benchmark scores will appear.
So it is impressive but not surprising, no? Whoever releases the latest model and has sufficient compute will be SOTA.
EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.
Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.
So if you can do it in prod for users paying 300$/month, it's a pretty good deal.
But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.
Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)
The brain is not a monolith.
I struggle to imagine how much further a purely text based system can be pushed - a system that basically knows that 1+1=2 not because it has built an internal model of arithmetic, but because it estimates that the sequence of `1+1=` is mostly followed by `2`.
Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)
We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?
No wonder the prices are increasing and capacity is more limited.
Impressive. /s
Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.
That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.
On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.
What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.
Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.
Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.
It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?
This is incorrect.
strawberry is actually 4 tokens (at least for GPT but most LLM are similar).
> Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified.
unfortunately no requests are passing because of some rate limits
This is just a for-fun test to get a sense of how models are progressing; it highlights the jagged nature of their intelligence and capabilities. None of the big AI labs are testing for such a basic problem type, which makes it a bit of an interesting check.
I think it's still interesting to see how Grok 4 performs, even if we don't use this test to draw any broader conclusions about what capabilities it offers.
They also have not released a model card, and I suspect they never will.
Can you name an Elon company that is not number 1 globally in terms of product capabilities?
The only one I would've been able to name would've been Grok. Until yesterday.
None of the neuroscience people I follow think much of Neuralink; none of the civil engineers I've talked to IRL think much of TBC; none of the car people I follow favour Tesla over the huge range of competitors, and that includes the robo-taxi where they're about 6.5 years behind Waymo; X.com is so painful that whenever someone shares a link with me, I edit the URL to Xcancel.com *because that loads faster by a bigger margin than the time taken to edit the URL* and actually shows me the thread without needing an account of my own.
But the space nerds I follow are still impressed with SpaceX, and they have extremely obvious reasons to be impressed.
[0] https://devblogs.microsoft.com/foundry/announcing-grok-3-and... [1] https://www.bbc.co.uk/news/articles/cdxvr3n7wlxo
As a huge Musk fan i'll be the first to point out how he's doing exactly what he accused Sama of doing; making powerful ai with an obvious lack of control or effective alignment.
There is so much money and so many top labs falling over themselves to attract good talent, that at this point people have to be leaning on ideological goals to choose their employer.
Are there really that many AI researchers who want to make Elon god-emperor?
Can you say what you mean by deep research?
https://x.ai/news/grok-3#grok-agents-combining-reasoning-and...
A neutral 3rd party.
See his just-removed-after-public-outcry instruction to disregard "political correctness", which immediately resulted in it calling itself MechaHitler - or his previous instructions to try to cry about reverse racism in South Africa.
Hope FB brings something like this tho. Might be especially useful to summarize/search big groups.
People used to cry how private groups and slack killed forums and hidden info, but I think we have a chance with tools like this.
The only two areas I've found Grok to be the best at are real time updates and IT support questions.
I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.
Also would be great if they added voice mode in browser (again like perplexity).
There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video. So perhaps they're working on this, but it's hidden from the public.
You can circumvent that by instructing the model to use "radio etiquette" - only respond after the other part says "over". It will still be compelled to answer when it detects silence, you can't prevent that, but you can instruct it to only reply with a short "mhm" until you say "over". Feels very natural.
Like most models I've used with this old hack, it will immediately start role-playing and also end its own responses with "over".
I can recall the first experiments with dota2 while he was still "in charge" of openai.
[0] https://openai.com/index/openai-elon-musk/
[1] https://www.goodreads.com/book/show/223400731-the-optimist
When he left OpenAI the stated reason was conflict of interests: Tesla was ramping up work on self driving.
He also hired A. Karpathy away from OpenAI to lead Tesla's ai vision.
And the fact that Sam from the very start wanted to turn it into his own closed source for-profit company (still ongoing) using non-profit funding as start-up seed funds (essentially stealing Elon Musk's money)?
Edit: few chats seem to indicate mid 2024 cut off.
https://deepmind.google/discover/blog/improving-language-mod...
Grok 4 Heavy is not in the API.
Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.
We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.
LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.
When a model is "lossy" and can't reproduce the data by copying, it's forced to come up with rules to synthesise the answers instead, and this is usually the "intelligent" behavior we want. It should be forced to learn how multiplication works instead of storing every combination of numbers as a fact.
Compression is related to intelligence: https://en.wikipedia.org/wiki/Kolmogorov_complexity
Reasoning isn't an on-off switch. It's a hill that needs climbing. The models are getting better at complex and novel tasks.
I've played around with both, yes, I'd also personally say that v2 is harder. Overall a better benchmark. ARC-AGI-3 will be a set of interactive games. I think they're moving in the right direction if they want to measure general reasoning.
This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".
I wish Ai companies would do this.
To guard against potential training data contamination, I separately calculate the score using only the newest 100 puzzles. Grok 4 still leads.
I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.
The vast majority of the world can’t afford 100s of dollars a month
Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.
It is Google. So, I'd pay attention to data collection feeding back in to training or evaluation.
Pricing the competition out & then turning the screws on locked-in users.
Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.
A Ferrari is more expensive than the model T.
The most expensive computer is a lot more expensive than the first PC.
The price that usually falls is:
* The entry level. * The same performance over time.
But the _price range_ gets wider. That's fine. That's a sign of maturity.
The only difference this time is that the entry level was artificially 0 (or very low) because of VC funding.
If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb that would be one thing. But it hears dog whistles and barks which makes it a dog. Except a real dog is soft and has a warm breath, knows your scent, is genuinely happy when you come home and will take a chomp out of the leg of anyone who invades your home at night.
We are also getting this kind of discussion
https://news.ycombinator.com/item?id=44502981
where Grok exhibited the kind of behavior that puts "degenerate" in "degenerate behavior". Why do people expect anything more? Ten years ago you could be a conservative with a conscience -- now if you are you start The Bulwark.
Having only barely heard of these authors even in the collective, I bet most models could do a better job of mimicking their style than I could. Perhaps not well enough to be of interest to you, and I will absolutely agree that LLMs are "low intelligence" in the sense that they need far more examples than any organic life does, but many of them will have had those examples and I definitely have not.
> We are also getting this kind of discussion
> https://news.ycombinator.com/item?id=44502981
Even just a few years ago, people were acting as if a "smart" AI automatically meant a "moral AI".
Unfortunately, these things can be both capable* and unpleasant.
* which doesn't require them to be "properly intelligent"
Not if you're only looking at modern PCs (and adjusting for inflation). It seems unfair to compare a computer built for a data center with tens of thousands in GPUs to a PC from back then as opposed to a mainframe.
Well, valuations keep increasing, they have to make the calculations work somehow.
Like the other AI companies, they will want to sign up companies.
I don't remeber anyone promising that, but whoever promised you that, in some period of time which includes our current present, frontier public model pricing would be monotonically decreasing was either lting or badly misguided. While there will be short term deviations, the overall arc for that will continue be upward.
OTOH, the models available at any given price point will also radically improve, to the point where you can follow a curve of both increasing quality and decreasing price, so long as you don't want a model at the quality frontier.
"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."
Please stop.
Look up.
I need your help.
Watch him jump.
It's time to sleep.
Try to keep.
Take one more step.
We love to shop.
Climb to the top.
Fill the cup.
Board the ship.
Don't move your lip.
Shake your hip.
Here's a good tip.
Use the whip.
Do a quick flip.
Hold on with grip.
Plan the trip.
Let it drop.
Start to chop.
Are you fucking kidding me?
> This is what everyone @xAI does. Works better than Cursor.
This makes no sense to me whatsoever.
Musk obviously didn't test Cursor, and either got this from his yesmen, or he's just lying unchecked as usual.
1. Musk didn't test Cursor
2. Yesmen
3. Lying
Shows much more about your biases than anything related to Grok 4 usage
I had Gemini cli running trying to do a straightforward refactor today, but when I copy-pasted the relevant code into the Gemini web app, it came up with the solution instantly.
For comparison, the Claude 4 hacker news post received > 2k upvotes https://news.ycombinator.com/item?id=44063703
Goodhart's Law means 2 is approximately always true.
As it happens, we also have a lot of AI benchmarks to choose from.
Unfortunately this means every model basically has a vibe score right now, as the real independent tests are rapidly saturated into the "ooh shiny" region of the graph. Even the people working on e.g. the ARC-AGI benchmark don't think their own test is the last word.
This is a 50 minute long video, many won't bother to watch
To me, AGI is achieved when the machine can improve itself and reproduce in a way that allows survival of the fittest and evolution to take place, though I’m sure when those goals are achieved someone will redefine AGI to be something even more unattainable.
PS: Is the approach something like LORA or a complete retrain on the visual part?
We completely remove a couple simple, obvious inventions from the training data and then see if the AI can come up with it. Perhaps a toothbrush for example. Or a comb? But there could be better examples that would also have minimal effect on the final Ai.
Training is expensive so we wouldn’t want to leave anything important out like the wheel.
I have no idea why this is a PDF, but here's a transcript: https://ecorner.stanford.edu/wp-content/uploads/sites/2/2023...
Another idea would be to use, for example, a 2024 state of the art model to try to predict discoveries or events from 2025.
LLMs has already dramatically changed our industry and I can't fathom what the possibilities could look like the future when these models become smarter.
Right now, there is a rush with companies pouring millions into R&D, so there is certainly hype but I have no doubt that this will yield to incremental improvements over the next few decades. The result of which will look like a breakthrough in Computer Science and Engineering.
I remained a skeptic for a long time (and still am), however after messing these LLMS, I can't ignore the fact that they have significantly boosted my productivity. It takes time to learn how to work with these tools and they require supervision and review but I feel better leveraging LLMs than writing code from scratch for every feature.
What will our job look like in the next 30 years? It's hard to say but I doubt most of us will be writing code by hand.
Does anybody have any example of a company that made some huge product from close to no developers by using those AIs? Or of something harder to create than what we are used to made possible by using the AIs? Or anything else that shows that "LLMs has already dramatically changed our industry"?
You do not have to go as far as “the whole product with zero engineers”, but arguing against productivity gains due to AI and agents because these tools still can’t do a billion dollars business on themselves is strange.
I too know I am being more productive. The most concrete examples for my work has come from the ease of prototyping: making a quick quasi-working version of an idea is now insanely easy, so we’ve been able to explore (and adopt) ideas that would not have been worth the effort previously.
These are the words of a billionaire who has been supporting authoritarian and ethno-nationalist movements across the world, including playing a key role in the authoritarian takeover of the US government. He wants to instill “truth-seeking” as a “value” in Grok in anticipation of its future power.
But the authoritarian ethno-nationalist version of “truth” is not one based on science and objectivity. It’s the misanthropic “truth” widespread among ethnic-nationalist and authoritarian ideologies - “truth” that appeals to billionaires and disenfranchised members of the working class alike because it provides scapegoats without challenging the structural origins of that very disenfranchisement. A real commitment to truth would mean seeing past the exploitive power structure that Elon and billionaires like him inhabit.
tills13•12h ago