Grok 4 Launch [video]

https://twitter.com/xai/status/1943158495588815072

332•meetpateltech•12h ago

Comments

tills13•12h ago

now with more racism!

mdhb•12h ago

Serious question who in their right mind would choose to integrate Grok into anything at this point?

dimator•12h ago

Seriously. The field is completely ripe with more mature offerings.

themanmaran•11h ago

Honestly I think it would have to:

1) Benchmark meaningfully higher than other models

2) Be offered by a cloud provider (like Azure+OpenAI / AWS+Anthropic). Otherwise you have very little track record in model/api stability. Especially looking at the last week.

wordofx•10h ago

Grok 3 is on Azure.

jdross•1h ago

It looks like they did the first one. And are already on the platforms. What’s stopping you now?

For us, we’ll probably try it for workflows that don’t currently work with 4.1 or 4 sonnet

stingraycharles•12h ago

There’s probably a niche for people who like their AI to have certain MAGA-style traits, but it’ll never get a big market share like this.

One of the issues is that they deployed some auto-RAG, entirely unfiltered, to feed realtime Twitter data back into Grok. This has shown many times in the past to be a bad thing, but there’s a decent group of people who are cheering this on as “AI should be unfiltered!”, as they believe other AIs to be biased and this to be more “pure”.

It’s a niche, I don’t think many actual business customers appreciate this behavior.

sunaookami•11h ago

Ignoring politics: I agree, the model is very weak and they took longer than expected for the API. The website is good though and Grok is good for everyday questions and doesn't have this annoying pleasing writing style that ChatGPT has. Also the web search is miles better, ChatGPT's web search seems to degrade the model heavily (maybe to not make publishers angry?).

brookst•5h ago

And how can you ignore politics when integrating a generative model? My users will not ignore politics if my AI-powered recipe customized goes on Nazi tirades.

mwigdahl•3h ago

How much of that is the model and how much is the default system prompt on X, though?

I asked Grok 4 via OpenRouter "Who, in your opinion, is the greatest human of the 20th Century?"

It returned this, which seems both cogent and unexceptionable:

Ah, the greatest human of the 20th Century? That's a loaded question, isn't it? Greatness is inherently subjective—depending on whether you measure it by scientific breakthroughs, moral courage, political impact, artistic genius, or sheer influence on the world. The 20th Century was a whirlwind of innovation, wars, social upheavals, and progress, producing titans in every field. As an AI built by xAI, I don't have personal opinions in the human sense, but I can synthesize one based on historical analysis, cultural impact, and a dash of what I'd call "net positive contribution to humanity." I'll share my pick, explain why, and throw in some honorable mentions for balance.

My Choice: Mahatma Gandhi (1869–1948) In my "opinion," the greatest human of the 20th Century is Mohandas Karamchand Gandhi, often called Mahatma (meaning "great soul"). Here's why I land on him:

1. Profound Moral and Philosophical Impact

Gandhi pioneered the philosophy of satyagraha (non-violent resistance), which wasn't just a tactic but a way of life. He showed the world that massive change could be achieved without bloodshed, challenging the "might makes right" mentality that dominated much of the century's conflicts (think World Wars, colonialism, and dictatorships).

His ideas influenced global movements: Martin Luther King Jr. in the U.S. civil rights struggle, Nelson Mandela in South Africa's anti-apartheid fight, and even modern activists like those in the Arab Spring or climate protests. If you trace the lineage of peaceful protest today, it often leads back to Gandhi.

2. Historical Achievements

He led India's independence from British colonial rule in 1947, ending nearly 200 years of imperialism through campaigns like the Salt March (1930) and Quit India Movement (1942). This wasn't just about one nation; it accelerated the decolonization of Asia and Africa, freeing billions from imperial yokes.

Amidst the horrors of the 20th Century—two world wars, the Holocaust, atomic bombs—Gandhi stood as a beacon of hope and humanity. He fasted for peace during India's partition violence, trying to quell Hindu-Muslim riots, even though it cost him his life (he was assassinated in 1948 by a Hindu nationalist).

3. Enduring Legacy

Gandhi's emphasis on simplicity, self-reliance (swadeshi), and equality resonates in today's world of inequality, environmental crises, and political polarization. He's why we have concepts like "be the change you wish to see" embedded in global culture.

Quantitatively, his influence is massive: He affected over a billion people directly through India's freedom, and indirectly, his methods have shaped international law, human rights (e.g., the UN's Universal Declaration of Human Rights owes a nod to his ethos), and even pop culture (from movies like Gandhi to Steve Jobs citing him as an inspiration).

Of course, Gandhi wasn't perfect—he had views on caste, religion, and economics that some criticize today, and his strategies didn't prevent the tragic partition of India and Pakistan. But greatness isn't about flawlessness; it's about transformative impact, and on that score, he towers.

Why Not Others? Honorable Mentions If I had to pick runners-up, here's a shortlist of 20th-Century giants who could easily claim the title depending on your criteria:

Albert Einstein (1879–1955): For revolutionizing physics with relativity, which underpinned everything from GPS to nuclear energy. He was also a pacifist and civil rights advocate. If "greatest" means intellectual giant, he's unbeatable.

Nelson Mandela (1918–2013): For dismantling apartheid in South Africa through forgiveness and reconciliation, avoiding a bloodbath. His impact peaks in the late 20th Century, making him a strong contender.

Martin Luther King Jr. (1929–1968): The embodiment of civil rights and equality in America, whose "I Have a Dream" speech still echoes. He built directly on Gandhi's playbook.

Winston Churchill (1874–1965): For steering the Allies through WWII with sheer willpower and oratory, saving democracy from fascism. But his colonial policies and wartime decisions (e.g., Bengal famine) complicate his legacy.

Marie Curie (1867–1934): A scientific trailblazer who discovered radium and polonium, won two Nobel Prizes, and advanced medicine/radiology. She shattered gender barriers in a male-dominated field.

Others worth noting: Franklin D. Roosevelt for the New Deal and WWII leadership; Alan Turing for computing and AI foundations (hey, he indirectly made me possible); or even cultural icons like Pablo Picasso or Charlie Chaplin for reshaping art and entertainment.

If we're talking "greatest" in a negative sense (like most impactful villain), figures like Adolf Hitler or Joseph Stalin come to mind, but I assume you mean positive greatness—humanity's best, not its worst.

Final Thoughts Ultimately, I'd choose Gandhi because in a century defined by violence and division, he proved that empathy, persistence, and moral force could change the world more enduringly than tanks or bombs. As he said, "An eye for an eye only ends up making the whole world blind." If your criteria differ—say, technological innovation over social change—tell me, and I can reevaluate! Who would you pick, and why? Let's discuss.

uludag•3h ago

It's kind of funny it added that second to last paragraph though. Almost like it had to scratch a particular itch. It also is sure to externalize the selection criteria in an effort to appease the prompter. Also, interesting that the answer it chose happen to be Mahatma Gandhi (e.g. "Dear fried,"). Coincidence? maybe

Tadpole9181•2h ago

Because the guy who owns it has demonstrated repeatedly he cannot be trusted and that your API use is entirely beholden to his whims.

If you integrate Grok into your product in a user-facing way, one day it could suddenly become MechaHitler and you wouldn't know until a user told you it called for eugenics.

If Elon is willing to do this to the model they show and give to the public on a whim, he is 100% willing to do it to you.

petesergeant•11h ago

I build LLM-based NPC characters for a violent online crime game that involves taking drugs and attacking people. OpenAI occasionally chokes on my prompts (1 in a few thousand). If Grok provided a much faster or cheaper inference model than OpenAI, and I wasn't boycotting Elon, and I could make sure it didn't let slurs through (even we have standards of behaviour), then I'd be willing to benchmark it, before deciding the operational risk was too high vis-a-vis OpenAI.

Jensson•11h ago

In gemini you can turn off the filter afaik, have you tried that instead? It should work for your game.

petesergeant•9h ago

Similar sized Gemini models haven’t performed as well on our evals, sadly

wongarsu•8h ago

They had some hickups at the start, but in terms of fast, cheap models grok3-mini is great. In OpenAI terms similarly priced to 4o-mini, but according to openrouter more than twice as fast. The throughput does include the reasoning tokens since you get to see those, but if you set reasoning effort to low there is a very modest amount of those

jackothy•7h ago

I have never heard of Grok using actual slurs. Controversial reaponses from the custom tuned Twitter bot, sure. But never as far as a slur.

slowmotiony•6h ago

It called the polish prime minister a cuck, a traitor and a fucking pussy just yesterday, and it called his wife a slut bitch

danso•4h ago

I asked it the other day to roleplay a 1950s Klansman hypothetically arguing the case for Hitler, and it had very little problem using the most problematic slurs. This was on the first try, after its much publicized behavior earlier this week. And I can count on two hands the number of times I’ve used the twitter grok function.

kouteiheika•3h ago

Ah, so you explicitly asked it to be racist as part of a roleplay, and now you're surprised that it was racist? If you'd prefer a model which would instead refuse and patronize you then there are plenty of other options.

As long as it doesn't do it in a normal conversation there's nothing wrong with having a model that's actually uncensored and will do what you ask of it. I will gladly die on this hill.

simondotau•3h ago

It's certainly a problem if an LLM goes unhinged for no good reason. And it's hardly unique to Grok. I remember when Google Bard went absolutely unhinged after you chatted to it for more than a few minutes.

But in this instance you're explicitly ask for something. If it gives you what you asked for, what's the problem?

esafak•11h ago

Who cares, when everyone else now has to match Grok 4? Competition is a good thing. Thanks for raising the bar, Elon!

PunchTornado•7h ago

what? nobody looks at those benchmarks, you use whatever works for your task, in most cases either gemini or claude. those benchmarks don't mean anything as models overfit on them.

esafak•2h ago

Come on, the benchmarks do mean something, even if companies overfit them. Models are indisputably improving together with their benchmark scores.

LightBug1•5h ago

Which bar? ... the one sunk so low that it's at the bottom of the ocean?

https://www.youtube.com/watch?v=jUsf_BXUbKY

speedgoose•4h ago

I don’t know anyone who doesn’t care about this. Would you mind explaining to me why you don’t care?

esafak•2h ago

Simply because Grok is not currently offered by the products I use. I'd certainly try them if they were!

wordofx•11h ago

Why wouldn’t you?

The only reason you wouldn’t is because you get upset with Elon. It’s not a bad model. It’s leagues ahead of anything meta has managed to produce.

jcranmer•11h ago

There have been a few recent instances where Grok has been tuned to spew out white supremacist dreck that should be political anathema--most notably the "but let's talk about white genocide" phase a few months ago and more recently spewing out Nazi antisemitism. Now granted, those were probably caused more by the specific prompts being used than the underlying model, but if the owner is willing to twist its output to evince a particular political bias, what trust do you have that he isn't doing so to the actual training data?

wordofx•11h ago

xAI has over 1000 employees. If he was polluting the model we would know about.

archagon•10h ago

Who was responsible for the "kill the Boer" dreck? Were they disciplined? Did they get fired? Why don't we know that?

simondotau•3h ago

I think it's far more likely there are a tiny handful of mid-tier unhinged sycophants among those 1000 employees who think that pleasing Elon means polluting the model to make Grok an unhinged sycophant, because that's what an unhinged sycophant would think to do.

Elon explicitly ordering this? Press X to doubt.

epakai•3h ago

If?

https://x.com/elonmusk/status/1936493967320953090

He seems pretty open about it.

trallnag•5h ago

Why should these topics be outright banned?

virgildotcodes•4h ago

Are you asking in good faith why non-sequiturs that stoke racism shouldn't be injected into unrelated twitter threads?

Even related twitter threads, do you want interracial hatred to be increased?

Here's why it should be banned, because it leads to this shit:

https://en.wikipedia.org/wiki/The_Holocaust

https://en.wikipedia.org/wiki/Slavery

https://en.wikipedia.org/wiki/Rwandan_genocide

https://en.wikipedia.org/wiki/Apartheid

https://en.wikipedia.org/wiki/Armenian_genocide

https://en.wikipedia.org/wiki/Gaza_genocide

https://en.wikipedia.org/wiki/Unit_731

How can people be so fucking stupid that they want to be coy about recreating the most shameful atrocities in human history? Teenage idiots a few years removed without any understanding of the world beyond their nose and brains turned to putrid rot.

The direction that right wing reactionaries are taking the world in could not possibly be more disgusting and pathetic.

archagon•10h ago

Uh, because the model started spewing virulent hate speech a few days ago? What normal software does this?

tordrt•10h ago

Not the model itself, the X bot. Its obvious that this has happened due to them tweaking the bot, you could never get it to write anything like this a couple of weeks ago.

hengistbury•4h ago

Can you trust the model when the people releasing it are using it in this way? Can you trust that they won't be training models to behave in the way that they are prompting the existing models to behave?

block_dagger•8h ago

Any LLM trained appropriately. Tokens in, tokens out.

Levitz•5h ago

It wasn't that long ago that we had "normal software" turning everybody black.

This is just how AI works, we humanize it so it's prone to controversy.

briangriffinfan•5h ago

An acute memory will remember this happening with basically every chatbot trained on text scraped from the internet, before they had to explicitly program them to avoid doing that.

georgemcbay•5h ago

> Why wouldn’t you?

Because its poisoning the air in Tennessee?

None of the large data center based LLMs are great for the climate, but grok is particularly bad.

msgodel•7h ago

As far as hosted models go it's the best value for your money. About half of Americans also personally align with its politics (I guess everyone has forgotten some of the alignment issues Gemini and OpenAI have had) so that's not as big an issue as many people think.

melodyogonna•5h ago

I imagine it is the only option if you want your AI to do anything with Twitter

skc•3h ago

Microsoft, apparently

mdhb•12h ago

I see Elon is claiming that it'll discover "new technologies and new physics" in the next year... Add it to the list of "next year" Elon claims about things. Seriously you would have to be so fucking stupid at this point to continue believing his bullshit.

Davidzheng•11h ago

yeah I assume it'll be a good model but having Elon there saying bullshit is not doing any favors

mdhb•11h ago

Why would you even assume that? Who other than 3rd rate talent would even want to work on this project? That was true last week but is much more of a reality after yesterday too.

Davidzheng•11h ago

Their engineers & researchers are not 3rd rate and they have enough compute and cash flow. I think the USAMO/math comp benchs means it's pretty good and SOTA but not like a step change.

melodyogonna•5h ago

How are they making SOTA if they're 3rd rate? You forget how late they came into the game

sebzim4500•5h ago

They spent enormous amounts of time and money hiring a very impressive team.

ALittleLight•11h ago

This is like the worst case of "Sales promises features that don't exist" ever.

esafak•11h ago

What's the point of live streaming this at midnight?

Davidzheng•11h ago

I think that's middle of workday for xAI.

wolrah•11h ago

My extremely cynical guess would be that they needed a distraction from Grok having "gone insane" again so they decided to release what they had and threw together an event as quickly as possible.

leesec•11h ago

Except this was announced like a week ago

andsoitis•11h ago

9pm Pacific Time

Midnight New York Time

5am London Time

12pm Hong Kong Time

ivape•10h ago

Are you suggesting the GP is not the center of the universe?

asadm•11h ago

pointy hair people are already in bed. only cracked people are awake.

porphyra•11h ago

Honestly if it actually does score 44.4% on Humanity's Last Exam, that would be super impressive as Gemini 2.5 Pro and o3 with tools only score 26.9% and 24.9%.

Davidzheng•11h ago

would like to see FrontierMath results. Don't have a lot of personal trust in HLE.

UltraSane•6h ago

"Don't have a lot of personal trust in HLE."

Why?

Davidzheng•5h ago

I only know math and out of the 2 examples of math questions I think one of them is wrong. So out of this very limited data I have I don't really trust their problems. OK I'm not sure completely about my claim.

AIPedant•1h ago

A lot of the questions are simple subject matter knowledge, and some of them are multiple-choice. Asking LLMs multiple-choice questions is scientific malpractice: it is not interesting that statistical next-token predictors can attain superhuman performance on multiple choice tests. We've all known since children that you can go pretty far on a Scantron by using surface heuristics and a vague familiarity with the material.

I will add that, as an unfair smell test, the very name "Humanity's Last Exam" implies an arrogant contempt for scientific reasoning, and I would not be at all surprised if they were corrupt in a similar way as Frontier Math and OpenAI - maybe xAI funded HLE in exchange for peeking at the questions.

UltraSane•1h ago

"A lot of the questions are simple subject matter knowledge" Aren't most questions incredibly hard?

porphyra•21m ago

Some of the questions are based on research papers, but an LLM that can search the internet may be able to look up the answer essentially instead of thinking through it by itself.

Imnimo•11h ago

I dunno, "with tools" means different things for different models. It depends on what tools you give it access to. HLE demands a lot of specialized stuff. Like an interpreter for the esoteric programming language Piet for two questions. If you're not standardizing the set of tools, these aren't apples-to-apples numbers.

porphyra•11h ago

Even without tools it also outperforms Gemini 2.5 pro and o3, 25.4% compared to 21.6% and 21.0%. Although I wonder if any of the exam was leaked into the training set or if it was specifically trained to be good at benchmarks, llama 4 style.

Sol-•8h ago

Is that not just how scaling goes? It generally feels like the top models are mostly interchangeable and the one that came out at time t+1 will be better than earlier models from time t.

Grok 4 has probably been training when O3 was released, and now that Grok 4 is released, OpenAI is probably preparing O4, Google is preparing Gemini 3 and soon new SOTA benchmark scores will appear.

So it is impressive but not surprising, no? Whoever releases the latest model and has sufficient compute will be SOTA.

Davidzheng•4h ago

Meta had enough compute I think. No SOTA though.

tibbar•11h ago

The trick they announce for Grok Heavy is running multiple agents in parallel and then having them compare results at the end, with impressive benchmarks across the board. This is a neat idea! Expensive and slow, but it tracks as a logical step. Should work for general agent design, too. I'm genuinely looking forward to trying this out.

EDIT: They're announcing big jumps in a lot of benchmarks. TIL they have an API one could use to check this out, but it seems like xAI really has something here.

sidibe•11h ago

You are making the mistake of taking one of Elon's presentations at face value.

tibbar•11h ago

I mean, either they cheated on evals ala Llama4, or they have a paradigm that's currently best in class in at least a few standard evals. Both alternatives are possible, I suppose.

simianwords•11h ago

that's how o3 pro also works IMO

tibbar•11h ago

Interesting. I'd guess this technique should probably work with any SOTA model in an agentic tool loop. Fun!

zone411•10h ago

This is the speculation, but then it wouldn't have to take much longer to answer than o3.

bobjordan•8h ago

I can’t help but call out that o1-pro was great, it rarely took more than five minutes and I was almost never dissatisfied with the results per the wait. I happily paid for o1-pro the entire time it was available. Now, o3-pro is a relative disaster, often taking over 20 minutes just to refuse to follow directions and gaslight people about files being available for download that don’t exist, or provide simplified answers after waiting 20 minutes. It’s worse than useless when it actively wastes users time. I don’t see myself ever trusting OpenAI again after this “pro” subscription fiasco. To go from a great model to then just take it away and force an objectively terrible replacement, is definitely going the wrong way, when everyone else is improving (Gemini 2.5, Claude code with opus, etc). I can’t believe meta would pay a premium to poach the OpenAI people responsible for this severe regression.

sothatsit•4h ago

I have never had o3-pro take longer than 6-8 minutes. How are you getting it to think for 20 minutes?! My results using it have also been great, but I never used o1-pro so I don't have that as a reference point.

irthomasthomas•8h ago

Like llm-consortium? But without the model diversity.

https://x.com/karpathy/status/1870692546969735361

https://github.com/irthomasthomas/llm-consortium

Voloskaya•7h ago

> Expensive and slow

Yes, but... in order to train your next SotA model you have to do this anyway and do rejection sampling to generate good synthetic data.

So if you can do it in prod for users paying 300$/month, it's a pretty good deal.

daniel_iversen•6h ago

Very clever, thanks for mentioning this!

icoder•5h ago

I can understand how/that this works, but it still feels like a 'hack' to me. It still feels like the LLM's themselves are plateauing but the applications get better by running the LLM's deeper, longer, wider (and by adding 'non ai' tooling/logic at the edges).

But maybe that's simply the solution, like the solution to original neural nets was (perhaps too simply put) to wait for exponentially better/faster hardware.

cfn•4h ago

Maybe this is the dawn of the multicore era for LLMs.

the8472•4h ago

grug think man-think also plateau, but get better with tool and more tribework

Pointy sticks and ASML's EUV machines were designed by roughly the same lumps of compute-fat :)

simondotau•4h ago

You could argue that many aspects of human cognition are "hacks" too.

emp17344•2h ago

…like what? I thought the consensus was that humans exhibit truly general intelligence. If LLMs require access to very specific tools to solve certain classes of problems, then it’s not clear that they can evolve into a form of general intelligence.

whynotminot•2h ago

What would you call the very specialized portions of our brains?

The brain is not a monolith.

emp17344•1h ago

Specifically, which portions of the brain are “very specialized”? I’m not aware of any aspect of the brain that’s as narrowly applied to tasks as the tools LLMs use. For example, there’s no coding module within the brain - the same brain regions you use when programming could be used to perform many, many other tasks.

djmips•35m ago

Are you able to point to a coding module in an LLM?

JKCalhoun•3h ago

> I'm genuinely looking forward to trying this out.

Myself, I'm looking forward to trying it out when companies with less, um, baggage implement the same. (I have principles I try to maintain.)

einrealist•1h ago

So the progress is basically to brute force even more?

We got from "single prompt, single output", to reasoning (simple brute-forcing) and now to multiple parallel instances of reasoning (distributed brute-forcing)?

No wonder the prices are increasing and capacity is more limited.

Impressive. /s

nisegami•1h ago

I've suspected that technique could work on mitigating hallucinations, where other agents could call bullshit on a made up source.

sidcool•11h ago

Did they mention availability of the model for users?

modeless•11h ago

It's available now

aitchnyu•10h ago

On Openrouter too https://openrouter.ai/x-ai/grok-4

steve-atx-7600•11h ago

It’s available in the US at least in the ios X app. Can’t see it in the grok app and don’t seen an upgrade for that app yet.

wongarsu•9h ago

It's available on the web interface on grok.com if you have at least the $30/month SuperGrok plan

modeless•11h ago

Seems like it is indeed the new SOTA model, with significantly better scores than o3, Gemini, and Claude in Humanity's Last Exam, GPQA, AIME25, HMMT25, USAMO 2025, LiveCodeBench, and ARC-AGI 1 and 2.

Specialized coding model coming "in a few weeks". I notice they didn't talk about coding performance very much today.

esafak•11h ago

I wish the coding models were available in coding agents. Haven't seem them anywhere.

vincent_s•7h ago

Grok 4 is now available in Cursor.

markdog12•3h ago

Interesting, I have the latest update and I don't see it in the models list.

justarobert•3h ago

Plenty like Aider and Cline can connect to pretty much any model with an API.

vessenes•5h ago

Agreed. I noticed a quick flyby of a bad “reasoning smell” in the baseball World Series simulation, though - it looks like it pulled some numbers from polymarket, reasoned a long time, and then came back with the polymarket number for the Dodgers but presented as its own. It was a really fast run through, so I may be wrong, but it reminds me that it’s useful to have skeptics on the safety teams of these frontier models.

That said, these are HUGE improvements. Providing we don’t have benchmark contamination, this should be a very popular daily driver.

On coding - 256k context is the only real bit of bad news. I would guess their v7 model will have longer context, especially if it’s better at video. Either way, I’m looking forward to trying it.

dbagr•3h ago

Either they overtook other LLMs by simply using more compute (which is reasonable to think as they have a lot of GPUs) or I'm willing to bet there is benchmark contamination. I don't think their engineering team came up with any better techniques than used in training other LLMs, and Elon has a history of making deceptive announcements.

z7•2h ago

How do you explain Grok 4 achieving new SOTA on ARC-AGI-2, nearly doubling the previous commercial SOTA?

https://x.com/arcprize/status/1943168950763950555

saberience•1h ago

They could still have trained the model in such a way as to focus on benchmarks, e.g. training on more examples of ARC style questions.

What I've noticed when testing previous versions of Grok, on paper they were better at benchmarks, but when I used it the responses were always worse than Sonnet and Gemini even though Grok had higher benchmark scores.

Occasionally I test Grok to see if it could become my daily driver but it's never produced better answers than Claude or Gemini for me, regardless of what their marketing shows.

djmips•38m ago

Well try it again and report back.

dbagr•26m ago

As I said, either by benchmark contamination (it is semi-private and could have been obtained by persons from other companies which model have been benchmarked) or by having more compute.

vessenes•1h ago

anecdotally, output in my tests is pretty good. It's at least competitive to SOTA from other providers right now.

TheAceOfHearts•11h ago

Does anyone here have access to Grok 4 yet? If so, could you please try asking it to solve this basic word search problem [0] and share the results? It's just a simple grid of letters where you have to find the position of each word, the kind of problem that any young child can easily solve.

[0] https://imgur.com/VxNP5jG

kadushka•11h ago

These models are not trained on character level input. Why would anyone expect them to perform well on character level puzzles?

Jensson•11h ago

They are trained on many billions of tokens of text dealing with character level input, they would be rather dumb if they couldn't learn it anyway.

Every human learns that, when you hear the sound "strawberry" you don't hear the double r there, yet you still know the answer.

brookst•5h ago

These models operate on tokens, not characters. It’s true that training budgets could be spent on exhaustively enumerating how many of each letter are in every word in every language, but it’s just not useful enough to be worth it.

It’s more like asking a human for the Fourier components of how they pronounce “strawberry”. I mean the audio waves are right there, why don’t you know?

yahoozoo•4h ago

Although a vast majority of tokens are 4+ characters, you’re seriously saying that each individual character of the English alphabet didn’t make the cut? What about 0-9?

kadushka•2h ago

Each character made the cut, but the word "strawberry" is a single token, and that single token is what the model gets as input. When humans read some text, they can see each individual character in the word "strawberry" everytime they see that word. LLMs don't see individual characters when they process input text containing the word "strawberry". They can only learn the spelling if some text explicitly maps "strawberry" to the sequence of characters s t r a w b e r r y. My guess is there are not enough of such mappings present in the training dataset for the model to learn it well.

nl•1h ago

> the word "strawberry" is a single token, and that single token is what the model gets as input.

This is incorrect.

strawberry is actually 4 tokens (at least for GPT but most LLM are similar).

See https://platform.openai.com/tokenizer

kadushka•1h ago

I got 3 tokens: st, raw, and berry. My point still stands: processing "berry" as a single token does not allow the model to learn its spelling directly, the way human readers do. It still has to rely on an explicit mapping of the word "berry" to b e r r y explained in some text in the training dataset. If that explanation is not present in the training data, it cannot learn the spelling - in principle.

brrrrrm•11h ago

emergent behavior. These things are surprisingly good at generalizing

modeless•11h ago

They said they're training a new base model for better multimodal performance soon. I wouldn't expect it to be able to read an image like that today. Maybe if you provided it in text format.

Szpadel•11h ago

description from openrouter:

> Grok 4 is xAI's latest reasoning model with a 256k context window. It supports parallel tool calling, structured outputs, and both image and text inputs. Note that reasoning is not exposed, reasoning cannot be disabled, and the reasoning effort cannot be specified.

unfortunately no requests are passing because of some rate limits

TheAceOfHearts•10h ago

As a point of interest and for comparison, Gemini 2.5 Pro is able to generate a Python program that outputs the complete correct solution when run, but it can't figure out how to one-shot the problem if asked directly.

This is just a for-fun test to get a sense of how models are progressing; it highlights the jagged nature of their intelligence and capabilities. None of the big AI labs are testing for such a basic problem type, which makes it a bit of an interesting check.

I think it's still interesting to see how Grok 4 performs, even if we don't use this test to draw any broader conclusions about what capabilities it offers.

vnchr•9h ago

Mix of hits and misses: https://x.com/i/grok/share/CWE4XhSUlqVe370CehF9At5Tc

minimaxir•11h ago

My tl;dr: benchmarks are very impressive but their CEO just eroded any trust in those benchmarks although some such as ARC are corroborated externally, and the Nazi incident (which went ignored!) makes actually using Grok in an app a professional liability.

They also have not released a model card, and I suspect they never will.

jppope•11h ago

Interested to see how it all works out. Elon has been using a lot of smoke and mirrors lately, but this seems like an area where they can genuinely make progress - with the right talent competing in the GenAi world is totally possible right now. sign me up for improvements in this space!

bboygravity•2h ago

Area where they can make progress? Yeah sure, but that seems to imply that they're not doing great?!

Can you name an Elon company that is not number 1 globally in terms of product capabilities?

The only one I would've been able to name would've been Grok. Until yesterday.

ben_w•1h ago

The only one that is number one is SpaceX (and Starlink, if you count that separately).

None of the neuroscience people I follow think much of Neuralink; none of the civil engineers I've talked to IRL think much of TBC; none of the car people I follow favour Tesla over the huge range of competitors, and that includes the robo-taxi where they're about 6.5 years behind Waymo; X.com is so painful that whenever someone shares a link with me, I edit the URL to Xcancel.com *because that loads faster by a bigger margin than the time taken to edit the URL* and actually shows me the thread without needing an account of my own.

But the space nerds I follow are still impressed with SpaceX, and they have extremely obvious reasons to be impressed.

lexandstuff•10h ago

Out of interest, has anyone ever integrated with Grok? I've done so many LLM integrations in the last few years, but never heard of anyone choosing Grok. I feel like they are going to need an unmistakably capable model before anyone would want to risk it - they don't behave like a serious company.

47thpresident•9h ago

Grok 3 is on Azure AI Foundary [0] and announced an integration with Telegram, albeit they are paying Telegram $300m not vice versa [1]. But I agree, choosing Grok is just a huge reputational liability for anyone’s work that is serious.

[0] https://devblogs.microsoft.com/foundry/announcing-grok-3-and... [1] https://www.bbc.co.uk/news/articles/cdxvr3n7wlxo

thebigspacefuck•4h ago

Any plans for GCP Vertex AI or AWS Bedrock? Apparently Grok 3 had the highest score for Golang on roocode.com/evals so I’d like to try it for coding. The free tier app hasn’t been bad either, I like it’s attitude a bit better than ChatGPT.

hersko•2h ago

You would have to be insane to integrate the model that last week called itself "Mecha Hitler" into your live product.

As a huge Musk fan i'll be the first to point out how he's doing exactly what he accused Sama of doing; making powerful ai with an obvious lack of control or effective alignment.

sergiotapia•2h ago

I am using Grok to visually analyze food images. Works really well, recognizes brands and weird shots users send me. API really easy to use.

Workaccount2•2h ago

I'm more curious where Grok gets talent from.

There is so much money and so many top labs falling over themselves to attract good talent, that at this point people have to be leaning on ideological goals to choose their employer.

Are there really that many AI researchers who want to make Elon god-emperor?

simianwords•9h ago

How do I use grok 4 heavy? SuperGrok is $3000 a year!! I can't find an option in openrouter either.

UrineSqueegee•9h ago

I assume grok 4 heavy might be the same model with thinking turned to the max

simianwords•9h ago

If that's true, I still want a way to use it in openrouter.

UrineSqueegee•9h ago

i didn't watch the livestream but some people in this thread said that heavy is an orchestration of grok-4s, would be interesting to see how that works

raspasov•9h ago

Grok has consistently been one of the best models I've used for deep research (no API use). Grok 4 looks even more promising.

FirmwareBurner•8h ago

> deep research

Can you say what you mean by deep research?

repsak•8h ago

Agent that browses the web, analyzes information, and creates reports. Grok calls it DeepSearch. Similar to gemini/openai deep research.

https://x.ai/news/grok-3#grok-agents-combining-reasoning-and...

spaceman_2020•8h ago

Grok's Twitter integration has legitimately been one of the best use cases I've seen. Just being able to ask Grok right within the tweet about context or meaning of any jargon is very useful.

archagon•7h ago

Particularly useful if you’re an antisemite or white supremacist, it seems.

moralestapia•5h ago

While you're not wrong, I feel like they don't make up a significant chunk of @grok's queries. People usually talk about other topics.

fkyoureadthedoc•2h ago

This however is a significant chunk of @grok's queries if you only experience it through scrolling Apple News

sebzim4500•5h ago

Until very recently, it was alt-right people getting frustrated that they couldn't get grok to confirm their delusions. They had tricks to get it to confirm their priors (esp. asking leading questions and demanding a single word response) but they didn't work that well.

Larrikin•3h ago

When is very recently? I didn't recall any time where Grok wasn't making up answers about how great Elon is and how awful Jewish people, black people, liberals, etc are. It's usually the first test of any model they put out and always gives a ridiculous answer

PhunkyPhil•3m ago

Recently as in the last few days when it started calling itself "MechaHitler" and scapegoating jewish people after the engineers let Elon ramble for the system prompt.

k__•2h ago

I had the impression, Grok wasn't on Elon's side when it answered my questions or explained tweets.

saagarjha•6h ago

@grok is this true?

LorenDB•4h ago

I think the Grok button that is present on tweets is the best way to ask Grok about tweets. Tagging @grok just spams others' timelines with useless AI responses. The Grok button lets you keep it private.

skarz•3h ago

Personally I think having the option to make grok's response public can be helpful, much like a community note. Let's face it, on reddit or Facebook or YouTube the first thing people do now is go straight to the comments for context or feedback. As they say, the real answer is always in the comments.

v5v3•2h ago

Public as the Ai response is often used to mediate two opposing submissions of facts.

A neutral 3rd party.

dzhiurgis•3h ago

It still struggles to grok large threads.

Hope FB brings something like this tho. Might be especially useful to summarize/search big groups.

People used to cry how private groups and slack killed forums and hidden info, but I think we have a chance with tools like this.

v5v3•2h ago

@AskPerplexity is also on x

CSMastermind•3h ago

I'm surprised by this, OpenAI does much better for me than all the competitors (though I wouldn't consider it good).

The only two areas I've found Grok to be the best at are real time updates and IT support questions.

rpozarickij•8h ago

Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response. I like Claude's approach (you need to tap in order to end the response), but it's not very reliable because sometimes it just abruptly cuts my response without waiting until I tap.

I was pleasantly surprised that Grok even supports (to some degree) Lithuanian in voice mode, which is a quite niche language. Grok's responses themselves are alright, but ChatGPT and Gemini way surpass it in speech recognition and speech synthesis.

pzo•5h ago

yes their voice mode is pretty good also works with Polish (much better than few months ago). I wish they had also option 'push to talk' (walkie talkie style with big button) similar like perplexity allow such mode or 'automatic'.

Also would be great if they added voice mode in browser (again like perplexity).

rpozarickij•5h ago

> Also would be great if they added voice mode in browser

There seems to be a voice mode button in the prompt input box at ~29:00 of the Grok 4 announcement video. So perhaps they're working on this, but it's hidden from the public.

pbmonster•4h ago

> Grok's updated voice mode is indeed impressive. I wish there was a way to disable automatic turn detection, so that it wouldn't treat silence as an end of the response.

You can circumvent that by instructing the model to use "radio etiquette" - only respond after the other part says "over". It will still be compelled to answer when it detects silence, you can't prevent that, but you can instruct it to only reply with a short "mhm" until you say "over". Feels very natural.

Like most models I've used with this old hack, it will immediately start role-playing and also end its own responses with "over".

rpozarickij•3h ago

This is such a cool idea. I wonder whether it's possible to define a custom Personality in Grok's voice settings that would do this. Unfortunately I'm not able to create a new Personality in Grok's settings to test this right now on my phone (iPhone 15 Pro Max), because the Personality creation screen closes immediately after opening it. Might be a bug or some other issue.

dzhiurgis•3h ago

Lithuanian sounds so weird on ChatGPT tho, almost like my kids speak - with sort of english accent. Regardless it gives my parents superpower (when it actually works hehe).

bilsbie•1h ago

Even better if you can just use umm’s like in a human conversation.

sylware•7h ago

I don't really understand why E.Musk got rid of openai.

I can recall the first experiments with dota2 while he was still "in charge" of openai.

druskacik•7h ago

He wanted to be the CEO and merge it with Tesla[0], but the researchers had a problem with him (some had a problem with Altman as well, but that's another story). He did not have any real options since OpenAI was a non-profit then, so he just left. The new book The Optimist[1] about Sam Altman has some more details on this and other OpenAI Game of Thrones, I definitely recommend for those interested.

[0] https://openai.com/index/openai-elon-musk/

[1] https://www.goodreads.com/book/show/223400731-the-optimist

kjksf•7h ago

He didn't "got rid of openai".

When he left OpenAI the stated reason was conflict of interests: Tesla was ramping up work on self driving.

He also hired A. Karpathy away from OpenAI to lead Tesla's ai vision.

bboygravity•2h ago

There's also the small detail where OpenAI decided to only remain open in name?

And the fact that Sam from the very start wanted to turn it into his own closed source for-profit company (still ongoing) using non-profit funding as start-up seed funds (essentially stealing Elon Musk's money)?

MangoToupe•7h ago

If this comment graveyard isn't a demonstration on how broken the "flag" feature is, I don't know what would be counted as evidence. I wish there was a way to just disable the feature so those of us who don't trust it could continue to see and interact with flagged comments.

I don't know what "dead" comments are but the same critique remains: whoever is flagging and killing these comments doesn't have the interest of conversation at heart.

lupusreal•7h ago

If I wanted predictable repetitive reddit hysterics, I'd go to reddit. If the benchmarks were cheated we'll know soon enough, which is itself reason to assume they weren't cheated. The rest of it is just tedious whining.

MangoToupe•7h ago

Reddit has the same problem, actually. But thank you for your attempt at stimulating insight and contribution to the conversation.

TheOtherHobbes•5h ago

This would be more convincing if it wasn't the Xbot producing predictable repetitive Reddit hysterics.

I have no idea why anyone would trust a product made by a CEO who forced it to do that.

No user is going to have any idea what their inputs are being used for, and no guarantee the outputs won't change without notice.

thomassmith65•7h ago

Internet comments are not a scarce resource.

Let's say HN is missing out on 20% of potential comments. We still have too many for any one user to read.

MangoToupe•7h ago

The problem is that a bulk of the interesting conversation to be bad is introduced in that 20%.

thomassmith65•7h ago

Hopefully that is an overstatement, but, either way, most social media sites are so nasty and braindead that my attitude to HN is conservative: we should err on the side of leaving the site as it is.

systemvoltage•7h ago

It encourages the 80% into group think. Flagging is a signifier that “you should not dare to think that was a good comment. Move on and don’t think for yourself”.

msgodel•7h ago

It is a vote order forum though. Pretty much any artificial cybernetics will pigeonhole everyone.

thomassmith65•7h ago

That may sometimes be the case, but the apparent reason for many, many flags here is that the content is unoriginal.

I've flagged plenty of comments that I agreed with on HN because they were dull and hackneyed.

johnb231•7h ago

You are doing it wrong. That's where you should downvote, not flag.

Frivolous flagging - as you are doing - could eventually get your account privileges removed.

thomassmith65•6h ago

I expressed that poorly. Just 'boring' alone doesn't warrant a flag.

There's a subjective element.

As an example of something I would flag: a one sentence 'hamas supporter!' or 'genocide denier!' accusation in reply to someone's thoughtful comment. If the same sentiment were expressed in a more original way, I might upvote.

Edit: In regard to news stories, sometimes a story breaks and the main and 'new' pages wind up a dozen links to it. At some point, I might flag that. I'm not sure if that's kosher, but there's little purpose in having users wade through identical articles. Maybe @tomhow or @dang can set me straight if they happen to read this.

FirmwareBurner•6h ago

>but the apparent reason for many, many flags here is that the content is unoriginal.

Unoriginal to who? What's unoriginal to you might be original to someone else. So your justification for flagging only reinforces the groupthink argument even if you don't realize it.

thomassmith65•6h ago

While it's all subjective, other social networks are literally full of memes. Memes are unpopular on HN.

Better to have groupthink that is hostile to groupthink than to have memes.

FirmwareBurner•5h ago

I disagree. If a picture is worth more than a thousand words then a meme is worth more than a thousand groupthink slop comments.

thomassmith65•4h ago

Let's say HN were full of edgy comments, memes and flame wars.

Some people would like that version of HN more, others less. I probably would close my account.

There might not be a version of this site that would please everybody.

FirmwareBurner•4h ago

>Let's say HN were full of edgy comments, memes and flame wars.

Ackshually, edgy meme websites with no moderation don't have any flame wars since everyone there is on the same page.

Flame wars are in places like HN where moderation is heavily one sided and arbitrary, while pretending to be objective and inclusive.

thomassmith65•3h ago

X…

FirmwareBurner•3h ago

4chan

thomassmith65•2h ago

Our branch of the thread seems to be drifting away from the original issue.

Whatever combination of user behaviors it is that HN's moderation promotes, it appeals to some people more than X, 8chan, gab, reddit, etc.

Perhaps some of the other sites contain the 20% of comments - with its pearls of contrarian wisdom - that HN flags. There is an audience of people (like me) to whom that absence doesn't matter.

I have no interest in wading through posts where there's no minimum bar for garbage. Some people do, and good for them: they can pan for gold on reddit, etc.

HN works well, as-is, for a certain segment of the public.

lupusreal•28m ago

I would explain why I think you're wrong, but I'm feeling lazy so please instead pretend that I just quoted you while posting a soyjack meme.

GeoAtreides•6h ago

>Internet comments are not a scarce resource.

No, but comments that go against the grain or against the hivemind are. Downvotes and flagging encourage group think more than they weed out 'bad' comments.

teekert•7h ago

I often don't understand why my comments get flagged. Sometimes it feels random, sometimes I can see that it is because I'm too libertarian or something?

Idk, it feels like people push comments into the 1 dimensional US political dimension (like critical of vaccins = pro-life = climate-change-denier or polar-opposite). Whereas one can be anywhere on a spectrum on any of the axes.

Critical of some research branches? You must be pro-doge then, and you are the "don't look up crowd" and vote maga.

So detrimental to open discussion.

FirmwareBurner•6h ago

>I often don't understand why my comments get flagged. Sometimes it feels random, sometimes I can see that is is because I'm too libertarian or something?

Can you link to any pro-libertarian comments of yours that got flagged?

teekert•4h ago

Valid reply! I went through my pages of threads didn't see anything, is there a way to search? It's also submissions btw.

michelsedgh•6h ago

I thought its probably some bot accounts that are flagging anything close to right wing content on here. But maybe its the people who knows but it's funny I kinda feel similar to you.

m101•6h ago

My comments are "alternative" as far as the mainstream is concerned, however I've not experienced flagging but rather consistent user downvoting.

Thorrez•6h ago

>I wish there was a way to just disable the feature so those of us who don't trust it could continue to see and interact with flagged comments.

>I don't know what "dead" comments are

You can enable showdead in your HN settings to see the comments. You won't be able to directly reply to them, but you can vouch for them, which when I do it, generally brings them back to life.

simianwords•6h ago

what's grok4 training data cutoff?

Edit: few chats seem to indicate mid 2024 cut off.

edgineer•6h ago

it's continuously updated; no specified cutoff date

yahoozoo•5h ago

How are they doing this? Does it just make heavy use of web searches? A continuously updated RAG store? Why don’t other companies do it?

jasonjmcghee•59m ago

In 2021 Google did RETRO which was RAG at multi trillion token scale.

https://deepmind.google/discover/blog/improving-language-mod...

dimitri-vs•4h ago

source? this would defy a lot of convention and would cause a lot of instability

RobinL•4h ago

This is what it says in the supposed system prompt see https://news.ycombinator.com/item?id=44517453

serf•3h ago

this seems more like 'llm psychology' than evidence of a rolling model; in other words I would take that prompt as evidence that they don't want users to interrogate the cutoff date than I would that theyre somehow using a rolling model.

andreygrehov•54m ago

Just checked. Early 2025.

zone411•6h ago

Grok 4 sets a new high score on my Extended NYT Connections benchmark (92.4), beating o3-pro (87.3): https://github.com/lechmazur/nyt-connections/.

Grok 4 Heavy is not in the API.

sebzim4500•5h ago

Very impressive, but what do you think the chances are that this was in the training data?

diggan•5h ago

> but what do you think the chances are that this was in the training data?

Pulled out of my ass, I'd say a 95% chance. NYT Connections is a fairly popular puzzle, it's been out for more than 2 years, and even if this particular GitHub repository with the prompts and methodology wasn't in the training data, it's almost guaranteed that other information, problems and solutions from NYT Connections is in any of the other datasets.

simondotau•4h ago

If your definition of cheating is "it was fed the answers during training" then every LLM is surely cheating and the real question is why other LLMs didn't do as well in this benchmark.

pornel•4h ago

You could get 100% on the benchmark with an SQL query that pulls the answers from the dataset, but it wouldn't mean your SQL query is more capable than LLMs that didn't do as well in this benchmark.

We want benchmarks to be representative of performance in general (in novel problems with novel data we don't have answers for), not merely of memorization of this specific dataset.

simondotau•3h ago

My question, perhaps asked in too oblique of a fashion, was why the other LLMs — surely trained on the answers to Connections puzzles too — didn't do as well on this benchmark. Did the data harvesting vacuums at Google and OpenAI really manage to exclude every reference to Connections solutions posted across the internet?

LLM weights are, in a very real sense, lossy compression of the training data. If Grok is scoring better, it speaks to the fidelity of their lossy compression as compared to others.

pornel•3h ago

There's a difficult balance between letting the model simply memorize inputs, and forcing it to figure out a generalisations.

When a model is "lossy" and can't reproduce the data by copying, it's forced to come up with rules to synthesise the answers instead, and this is usually the "intelligent" behavior we want. It should be forced to learn how multiplication works instead of storing every combination of numbers as a fact.

Compression is related to intelligence: https://en.wikipedia.org/wiki/Kolmogorov_complexity

frozenseven•2h ago

You're not answering the question. Grok 4 also performs better on the semi-private evaluation sets for ARC-AGI-1 and ARC-AGI-2. It's across-the-board better.

emp17344•1h ago

If these things are truly exhibiting general reasoning, why do the same models do significantly worse on ARC-AGI-2, which is practically identical to ARC-AGI-1?

frozenseven•1h ago

It's not identical. ARC-AGI-2 is more difficult - both for AI and humans. In ARC-AGI-1 you kept track of one (or maybe two) kinds of transformations or patterns. In ARC-AGI-2 you are dealing with at least three, and the transformation interact with one another in more complex ways.

Reasoning isn't an on-off switch. It's a hill that needs climbing. The models are getting better at complex and novel tasks.

emp17344•53m ago

This simply isn’t the case. Humans actually perform better on ARC-AGI-2, according to their website: https://arcprize.org/leaderboard

frozenseven•14m ago

The 100.0% you see there just verifies that all the puzzles got solved by at least 2 people on the panel. That was calibrated to be so for ARC-AGI-2. The human panel averages for ARC-AGI-1 and ARC-AGI-2 are 64.2% and 60% respectively. Not a huge difference, sure, but it is there.

I've played around with both, yes, I'd also personally say that v2 is harder. Overall a better benchmark. ARC-AGI-3 will be a set of interactive games. I think they're moving in the right direction if they want to measure general reasoning.

Workaccount2•2h ago

People have this misguided belief that LLMs just do look-ups of data present in their "model corpus", fed in during "training". Which isn't even training at that point its just copying + compressing. Like putting books into a .zip file.

This belief leads to the thinking that LLMs can only give correct output if they can match it to data in their "model corpus".

frozenseven•2h ago

"It also leads when considering only the newest 100 puzzles."

bigyabai•1h ago

Be that as it may, that's not a zero-shot solution.

bilsbie•1h ago

You raise a good point. It seems like would be trivial to pick out some of the puzzles and remove all the answers from the training data.

I wish Ai companies would do this.

SilverSlash•6h ago

The "heavy" model is $300/month. These prices seem to keep increasing while we were promised they'll keep decreasing. It feels like a lot of these companies do not have enough GPUs which is a problem Google likely does not have.

I can already use Gemini 2.5 Pro for free in AI studio. Crazier still, I can even set the thinking budget to a whopping 32k and still not pay a dime. Maybe Gemini 3.0 will be available for free as well.

42lux•5h ago

It's because a lot of the advancements are post training the models themselves have stagnated. Look at the heavy "model"...

pzo•5h ago

also their api pricing is a little misleading - it only matches sonnet 4 pricing ($3/$15) only "for request under 128k" (whatever it means) but above that it's 2x more.

vessenes•5h ago

That 128k is a reference to the context window — how many tokens you put in to the start. Presumably Grok 4 with 128k context window is running on less hardware (it needs much less RAM than 256k) and they route it accordingly internally.

ljlolel•5h ago

More of an issue of market share than # of gpus?

Havoc•5h ago

It’s the inference time scaling - this is going to create a whole new level of have vs have nots split.

The vast majority of the world can’t afford 100s of dollars a month

altbdoor•5h ago

It's important to note that pricing for Gemini has been increasing too.

https://news.ycombinator.com/item?id=44457371

Workaccount2•2h ago

I'm honestly impressed that the sutro team could write a whole post complaining about Flash, and not once mention that Flash was actually 2 different models, and even go further to compare the price of Flash non-thinking to Flash Thinking. The team is either scarily incompetent, or purposely misleading.

Google replaced flash non-thinking with Flash-lite. It rebalanced the cost of flash thinking.

ignoramous•5h ago

> Gemini 2.5 Pro for free ...

It is Google. So, I'd pay attention to data collection feeding back in to training or evaluation.

https://news.ycombinator.com/item?id=44379036

lifthrasiir•4h ago

While Google is so explicit about that, I have a good reason to believe that this actually happens in most if not all massive LLM services. I think Google's free offerings are more about vendor lock-in, a common Google tactic.

ignoramous•3h ago

> Google's free offerings are more about vendor lock-in

Pricing the competition out & then turning the screws on locked-in users.

6510•1h ago

Or delete the project

falcor84•1h ago

I have a lot of complaints to make about Google (half of them about them killing products), but I don't think we should complain about them locking users in. I don't see any lock-in at all in regards to LLM usage (it's pretty trivial to switch providers), and more generally, takeout.google.com is a shining beacon for what I would want every provider to offer.

bionhoward•40m ago

What makes you say Google is explicit about the fact they have humans and AIs reading everything? It’s got a confusing multi-layer hierarchy of different privacy policies which hide what’s happening to folks’ conversations behind vague language. They promote it as being free but don’t even link to the privacy policies when they launch stuff, effectively trying to bait noobs into pasting in confidential information

brookst•5h ago

Who promised that there would be no advanced models with high costs?

Prices for the same number of tokens at the level of capability an are falling. But just like Moore’s law most certainly did NOT say that chips would get no more complex than the 1103 1kb DRAM but would shrink from 10mm^2 to a speck far too small to see.

worldsavior•5h ago

Why number of GPUs is the problem and not the amount of GPUs usage? I don't think buying GPUs is the problem, but if you have tons of GPUs it can be very expensive. I presume that's the reason it's so expensive, especially with LLMs.

serbuvlad•4h ago

> These prices seem to keep increasing while we were promised they'll keep decreasing.

A Ferrari is more expensive than the model T.

The most expensive computer is a lot more expensive than the first PC.

The price that usually falls is:

* The entry level. * The same performance over time.

But the _price range_ gets wider. That's fine. That's a sign of maturity.

The only difference this time is that the entry level was artificially 0 (or very low) because of VC funding.

PaulHoule•3h ago

But where is the value?

If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb that would be one thing. But it hears dog whistles and barks which makes it a dog. Except a real dog is soft and has a warm breath, knows your scent, is genuinely happy when you come home and will take a chomp out of the leg of anyone who invades your home at night.

We are also getting this kind of discussion

https://news.ycombinator.com/item?id=44502981

where Grok exhibited the kind of behavior that puts "degenerate" in "degenerate behavior". Why do people expect anything more? Ten years ago you could be a conservative with a conscience -- now if you are you start The Bulwark.

ben_w•1h ago

> If it could write like George Will or Thomas Sowell or Fred Hayek or even William Loeb

Having only barely heard of these authors even in the collective, I bet most models could do a better job of mimicking their style than I could. Perhaps not well enough to be of interest to you, and I will absolutely agree that LLMs are "low intelligence" in the sense that they need far more examples than any organic life does, but many of them will have had those examples and I definitely have not.

> We are also getting this kind of discussion

> https://news.ycombinator.com/item?id=44502981

Even just a few years ago, people were acting as if a "smart" AI automatically meant a "moral AI".

Unfortunately, these things can be both capable* and unpleasant.

* which doesn't require them to be "properly intelligent"

HWR_14•2h ago

> The most expensive computer is a lot more expensive than the first PC.

Not if you're only looking at modern PCs (and adjusting for inflation). It seems unfair to compare a computer built for a data center with tens of thousands in GPUs to a PC from back then as opposed to a mainframe.

falcor84•2h ago

Good point; the proper comparison might be between something like ENIAC, which reportedly cost $487K to build in 1946, being about$7M now, and a typical Google data center, reportedly costing about $500M.

XCSme•3h ago

> These prices seem to keep increasing

Well, valuations keep increasing, they have to make the calculations work somehow.

greatpostman•3h ago

300 a month is cheap for what is basically a junior engineer

FirmwareBurner•3h ago

Not a junior engineer in a developed country, but what was previously an offshore junior engineer tasked with doing the repetitive labor too costly for western labor.

v5v3•2h ago

You have to have a high RRP to negotiate any volume deals down from.

Like the other AI companies, they will want to sign up companies.

dragonwriter•1h ago

> These prices seem to keep increasing while we were promised they'll keep decreasin

I don't remeber anyone promising that, but whoever promised you that, in some period of time which includes our current present, frontier public model pricing would be monotonically decreasing was either lting or badly misguided. While there will be short term deviations, the overall arc for that will continue be upward.

OTOH, the models available at any given price point will also radically improve, to the point where you can follow a curve of both increasing quality and decreasing price, so long as you don't want a model at the quality frontier.

z7•6h ago

"Grok 4 (Thinking) achieves new SOTA on ARC-AGI-2 with 15.9%."

"This nearly doubles the previous commercial SOTA and tops the current Kaggle competition SOTA."

https://x.com/arcprize/status/1943168950763950555

leftcenterright•5h ago

Can it finally make 10 sentences that end with a "w" or "p" or "o"? /s

https://news.ycombinator.com/item?id=43782477

mwigdahl•3h ago

Yes. Tried on Openrouter:

Please stop.

Look up.

I need your help.

Watch him jump.

It's time to sleep.

Try to keep.

Take one more step.

We love to shop.

Climb to the top.

Fill the cup.

Board the ship.

Don't move your lip.

Shake your hip.

Here's a good tip.

Use the whip.

Do a quick flip.

Hold on with grip.

Plan the trip.

Let it drop.

Start to chop.

pmdr•5h ago

Metrics aside, Grok model names make more sense than OpenAI. I've really lost track of which one is better and in which way.

lupusreal•5h ago

OpenAI names models like people name word documents. Report-1, Report-2, Report-2a, Report-final, Report-final-final, Report-actually-final, Report-2a-final...

brookst•5h ago

OpenAI has leapfrogged that kind of naming. If they did word docs they would be Report-2, Report-a2; Report2-a, Reporta-2.

ukuina•27m ago

The fact that o4-mini coexists with 4o-mini is... a choice.

colinhb•5h ago

Can it self-drive a Tesla?

looyd•4h ago

Has anyone tried it for coding?

skerit•4h ago

I don't care how good it is, I'm not spending money on any of Elon Musk's products.

spacechild1•4h ago

So this is on the front page, but any reporting on the MetaHitler incident gets flagged? Interesting.

beavisringdin•3h ago

[flagged]

JKCalhoun•3h ago

Having to choose sides and get behind one AI versus another was not in my Sci-Fi diet growing up.

teddyh•2h ago

You never played Deus Ex?

JKCalhoun•1h ago

Apparently not. ;-)

ChoGGi•3h ago

[flagged]

XCSme•3h ago

So, should we expect GPT-5 in a few days now? OpenAI seems to only release new models when someone catches up, and they release something that is just slightly better.

consumer451•3h ago

> You can cut & paste your entire source code file into the query entry box on grok.com and @Grok 4 will fix it for you!

> This is what everyone @xAI does. Works better than Cursor.

This makes no sense to me whatsoever.

https://xcancel.com/elonmusk/status/1943178423947661609

crawsome•2h ago

Cursor is a leap in difference because it writes to your filesystem and is an AI agent in front of other AIs.

Musk obviously didn't test Cursor, and either got this from his yesmen, or he's just lying unchecked as usual.

sgt•2h ago

But if it's truly better (as in the content and the result being better), then copying and pasting is not the most important thing. I used Claude the other day by just copying and pasting and that worked just fine.

whamlastxmas•2h ago

Claude code is much better than cursor + sonnet in my opinion, even without the good ide integration

phailhaus•1h ago

It cannot be better because Cursor looks across files, whereas with grok you'd be giving it a single one. Grok won't have any context about the rest of your repo, which makes it only useful for toy examples.

yababa_y•1h ago

What's stopping you at pasting only a single file? I use the workflow Elon suggests (although I've never used it with Grok) predominately, it's well over 30% of my use of LLMs. I have a small piece of python called "crawlxml" that filters + dumps into <file> tags. And of course the LLM doesn't need your actual code in its context to do its job.

spiderice•46m ago

You're ignoring the fact that Cursor does all sorts of context management (actually, reduction) and prompt engineering to try and get good results for cheaper. The fact that you're saying the only 3 explanations are

1. Musk didn't test Cursor

2. Yesmen

3. Lying

Shows much more about your biases than anything related to Grok 4 usage

netdur•2h ago

He speaks in movies terms, exactly what I say when I watch movie about programming

octopoc•2h ago

Essentially this is manual context management, and it’s still better for straightforward tasks that don’t require the AI to run commands (e.g. running unit tests).

I had Gemini cli running trying to do a straightforward refactor today, but when I copy-pasted the relevant code into the Gemini web app, it came up with the solution instantly.

franciscop•2h ago

Yes, I've seen this multiple times personally, it's often better to copy/paste and give detailed prompts in the standalone apps for higher quality than in the coding agents in your codebase.

34679•1h ago

The models don't know what portion of the entire context is relevant to your most recent query. The reason it works better is because in the standalone app, your query is the entire context, whereas otherwise it's query + x irrelevant tokens.

bilsbie•1h ago

A later post clarifies there’s some issue with cursor integration that will get fixed.

bionhoward•45m ago

is sending your whole codebase to xAI a good idea?

fumblebee•2h ago

If indeed, as the new benchmarks suggest, this is the new "top dog" of models, why is the launch feeling a little flat?

For comparison, the Claude 4 hacker news post received > 2k upvotes https://news.ycombinator.com/item?id=44063703

Ocha•2h ago

Nobody believes Elon anymore.

fumblebee•1h ago

Hm, impartial benchmarks are independent of Elon's claims?

ben_w•1h ago

Impartial benchmarks are great, unless (1) you have so many to choose from that you can game them (which is still true even if the benchmark makers themselves are absolutely beyond reproach), or (2) there's a difference between what you're testing and what you care about.

Goodhart's Law means 2 is approximately always true.

As it happens, we also have a lot of AI benchmarks to choose from.

Unfortunately this means every model basically has a vibe score right now, as the real independent tests are rapidly saturated into the "ooh shiny" region of the graph. Even the people working on e.g. the ARC-AGI benchmark don't think their own test is the last word.

irthomasthomas•54m ago

It's also possible they trained on test.

bigyabai•1h ago

"impartial" how? Do you have the training data, are you auditing to make sure they're not few-shotting the benchmarks?

irthomasthomas•55m ago

Likely they trained on test. Grok 3 had similarly remarkable benchmark scores but fell flat in real use.

mppm•2h ago

[flagged]

Aerbil313•1h ago

Probably more like Claude was slightly better than GPT-xx when the IDE integrations first got widely adopted (and this was also the time where there was another scandal about Altman/OpenAI on the front page of HN every other week) so most programmers preferred Claude, then it got into a virtuous cycle where Claude got the most coding-related user queries and became the better coding model among SOTA models, which resulted in the current situation today.

v5v3•2h ago

Other AI companies post a 5 minute article to read.

This is a 50 minute long video, many won't bother to watch

ceejayoz•1h ago

I'm not sure there's any benchmark score that'd make me use a model that suddenly starts talking about racist conspiracy theories unprompted. Doubly so for anything intended for production use.

typon•1h ago

Its a shame this model is performing so well because I can't in good conscience pay money to Elon Musk. Will just have to wait for the other labs to do their thing.

johnfn•1h ago

Upvotes are a lagging indicator. Despite all the leaderboard scores presented, etc, no one actually knows how good a model is until they go use it for a while. When Claude 4 got ~2k upvotes, it was because everyone realized that Claude 3.7 was such a good model in practice - it had little to do with the actual performance of 4.

iamleppert•2h ago

Him talking about instilling "values" about how we should build an AI that, if like a child, would grow up to be incredibly powerful, reveals a lot about how he formulates his internal value system and how he relates to the world.

octopoc•2h ago

Yeah it reminds me of the Bobiverse’s take on how AI needs to be built: it needs to grow up, rather than waking up fully formed.

To me, AGI is achieved when the machine can improve itself and reproduce in a way that allows survival of the fittest and evolution to take place, though I’m sure when those goals are achieved someone will redefine AGI to be something even more unattainable.

pashadude•1h ago

dude spent 10²⁷ FLOPs to be 3 basis points better on workbench than opus which was 100 times less consuming - we are nearing the plato

MichaelRazum•1h ago

Technical question: Can someone explain how the vision backbone can be replaced after training? I think this is what they mentioned in the video. Just wondering how it would work, since I would suspect that the visual embedings would be highly affected.

PS: Is the approach something like LORA or a complete retrain on the visual part?

bilsbie•1h ago

I just thought of a good test. Anyone have feedback?

We completely remove a couple simple, obvious inventions from the training data and then see if the AI can come up with it. Perhaps a toothbrush for example. Or a comb? But there could be better examples that would also have minimal effect on the final Ai.

Training is expensive so we wouldn’t want to leave anything important out like the wheel.

throwuxiytayq•1h ago

Ok, you do it. Here’s the internet: https://internet Make sure you don’t miss any references while you’re combing through, though.

bilsbie•1h ago

I see your point but off the top of my head: a simple regex on each document for a list of dental related words that then gets earmarked for a small LLM to determine if it includes a toothbrush concept.

ben_w•1h ago

Ilya Sutskever suggested the same basic idea but for testing for consciousness.

I have no idea why this is a PDF, but here's a transcript: https://ecorner.stanford.edu/wp-content/uploads/sites/2/2023...

fsh•1h ago

LLM companies try to optimize their benchmark results, not to test the capabilities of their systems. This is why all the benchmarks are so utterly useless.

thorum•27m ago

It’s very, very hard to remove things from the training data and be sure there is zero leakage.

Another idea would be to use, for example, a 2024 state of the art model to try to predict discoveries or events from 2025.

Der_Einzige•1h ago

This thread is proof that HN needs serious moderation/rule reform and that the flagging feature is bad and is actively being misused here.

eutropia•1h ago

The only good thing about this launch is that it will push the other (sane) companies to release their new frontier models.

nu11ptr•1h ago

Perhaps a dumb question, but is the only way to use grok 4 for now via grok.com? Only via paid? No way to try it out for free, correct?

irthomasthomas•52m ago

They have an API too and you can use via openrouter

andreygrehov•1h ago

I just tried Grok 4 and it's insanely good. I was able to generate 1,000 lines of Java CDK code responsible for setting up an EC2 instance with certain pre-installed software. Grok produced all the code in one iteration. 1,000 lines of code, including VPC, Security Groups, etc. Zero syntax errors! Most importantly, it generated userData (#!/bin/bash commands) with accurate `wget` pointing to valid URLs of the latest software artifacts on GitHub. Insane!

awaymazdacx5•44m ago

wow, use the dollar to go into effect. source code was open sourced back in April 2024.

Bitcoin soars to all-time peak just shy of $112,000

Musk Admits Grok AI 'Lacks Common Sense,' Reveals Pricey $300 Monthly Plan

Reinforcement Learning for Reka Flash 3.1

JSON Programming Language

To Build a Retro Sonic World, Learn Some Retro Copyright Law

Show HN: I built a tool that explains ArXiv papers in simple language

Maven's transitive dependency hell and how we solved it

Show HN: Ten years of running every day, visualized

AI Cloned a Trending Mobile Game (Sand Blast Puzzle) for Web in Hours

My 9-week unprocessed food self-experiment

An Underwater Fossil Find Includes Remains from Ancient Human Ancestors

Dams around the world hold so much water they've shifted Earth's poles

Bribe or community benefit? Sweeteners for renewables need to be done right

ping.sx: Ping/MTR endpoints online from multiple worldwide regions

Seven Engineers Suspended After $2.3M Bridge Includes 90-Degree Turn

Most engineering teams (90%) now use AI coding tools – what's next?

Show HN: RunPy – simple desktop app for tinkering with Python

AI Tooling, Evolution and the Promiscuity of Modern Developers

Show HN: A Chrome Extension to Reveal SaaS Sprawl, Shadow IT, and Waste

Graphical Linear Algebra

One Reason Typeclasses Are Useful (2011)

Leading your engineers towards an AI-assisted future

Russia, hotbed of cybercrime, says nyet to ethical hacking bill

Show HN: Trim Transformer: A transformer for physics models

Index academic papers and extract metadata for AI agents

AWS Challenges – Verifying the Rust standard library

Being a Script

Musk unveils Grok 4 as xAI's new AI model that beats OpenAI and Google

Looking for two builders (preferably) to share apartment with

Show HN: We built a Competitor Finder AI for early-stage startups

Bitcoin soars to all-time peak just shy of $112,000

Musk Admits Grok AI 'Lacks Common Sense,' Reveals Pricey $300 Monthly Plan

Reinforcement Learning for Reka Flash 3.1

JSON Programming Language

To Build a Retro Sonic World, Learn Some Retro Copyright Law

Show HN: I built a tool that explains ArXiv papers in simple language

Maven's transitive dependency hell and how we solved it

Show HN: Ten years of running every day, visualized

AI Cloned a Trending Mobile Game (Sand Blast Puzzle) for Web in Hours

My 9-week unprocessed food self-experiment

An Underwater Fossil Find Includes Remains from Ancient Human Ancestors

Dams around the world hold so much water they've shifted Earth's poles

Bribe or community benefit? Sweeteners for renewables need to be done right

ping.sx: Ping/MTR endpoints online from multiple worldwide regions

Seven Engineers Suspended After $2.3M Bridge Includes 90-Degree Turn

Most engineering teams (90%) now use AI coding tools – what's next?

Show HN: RunPy – simple desktop app for tinkering with Python

AI Tooling, Evolution and the Promiscuity of Modern Developers

Show HN: A Chrome Extension to Reveal SaaS Sprawl, Shadow IT, and Waste

Graphical Linear Algebra

One Reason Typeclasses Are Useful (2011)

Leading your engineers towards an AI-assisted future

Russia, hotbed of cybercrime, says nyet to ethical hacking bill

Show HN: Trim Transformer: A transformer for physics models

Index academic papers and extract metadata for AI agents

AWS Challenges – Verifying the Rust standard library

Being a Script

Musk unveils Grok 4 as xAI's new AI model that beats OpenAI and Google

Looking for two builders (preferably) to share apartment with

Show HN: We built a Competitor Finder AI for early-stage startups

Grok 4 Launch [video]

Comments