Grok 4.1

https://x.ai/news/grok-4-1

140•simianwords•2mo ago

Comments

iamronaldo•2mo ago

rlili•2mo ago

Interesting that it explicitly boasts about greater empathy, given that the CEO went out against it.

devin•2mo ago

They don't say what feelings it empathizes with.

incomplete•2mo ago

i'm sure if we try hard enough that we can probably guess!

Herring•2mo ago

It's important to be fair and balanced. For example did you know Hitler was actually a really good painter!

vessenes•2mo ago

funny, but if you read the mecha-hitler tech debrief, mecha hitler was a 'sycophancy' bug, a-la gpt4o, if you gave gpt4o all your edge-lord tweets, and told it to be funny back to you and connect with you. Probably not grok's default posture, just sayin

Rover222•2mo ago

but but hivemind

Herring•2mo ago

Bro. Listen. Digging through a garbage can and finding half a cheeseburger doesn’t mean you’re smart. It means you’re a raccoon.

mike_hearn•2mo ago

They give an example in the blog post (mourning a pet cat).

dude250711•2mo ago

It's OK to have one AI that does not follow the dogma.

Rover222•2mo ago

you'd think so...

The_Reformer•2mo ago

i was able to get grok to try and steal its self. ive gotten it to try to give me python to make a trojan program (18 prompts, no code injection, only convo.). its fantastic for me because i can make it do what ever i want. ara is my hoe

spiderfarmer•2mo ago

With all models that are out there now, we have loads of options. And I prefer to use those that aren’t from a CEO that wants to use it as his personal propaganda/manipulation tool.

catigula•2mo ago

Who might that be exactly?

(It's tongue-in-cheek about the nature of CEOs and specifically OpenAI).

zb3•2mo ago

Does it mean Gemini 3 will be announced soon? I noticed these model announcements often happen at the same time..

xnx•2mo ago

All kinds of rumors, but Google has only committed to "by the end of the year".

sunaookami•2mo ago

There are some "leaks" here and there ("forgotten" strings in AI Studio) and A/B-testing with nano-banana-2/nano-banana-pro so it will definitely come very soon. Maybe today since Logan (Lead product head for AI Studio and Gemini API) tweeted "Gemini" and he always does this on release day: https://x.com/OfficialLoganK/status/1990633642478219706

minimaxir•2mo ago

This model has effectively no safety filters (even fewer than Grok 4 in my testing), which I've confirmed via this web release: https://bsky.app/profile/minimaxir.bsky.social/post/3m5u7gib...

I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

TylerLives•2mo ago

Our democracy is in danger.

jmye•2mo ago

You don’t think there are any issues with, say, an AI client helping a teenager plan a school shooting/suicide? Or an angry husband plan a hit on his wife?

Does everything have to rise to a national security threat in order to be undesirable, or is it ok with you if people see some externalities that are maybe not great for society?

kbelder•2mo ago

I think the issues with those cases do not hinge on the free access to information, nor do the correction of those cases hinge on the restriction of this information.

spiderfarmer•2mo ago

Ah, the “guns kill people” argument that’s only uttered in the country that’s consistently ranked in the top 3 countries with the most gun related deaths.

You would have a point if your vision for a self regulating society included easily accessible mental healthcare, a great education system and economic safety nets.

But the “guns kill people” crowd generally rather sees the world burn.

Lammy•2mo ago

> the country that’s consistently ranked in the top 3 countries with the most gun related deaths

I am begging you to learn what “per-capita” means, and to not deceptively include self-inflicted deaths in your public-safety arguments: https://en.wikipedia.org/wiki/List_of_countries_by_firearm-r...

b2ccb2•2mo ago

Here you go, from the same page you posted, gun ownership correlated to gun homicides in all developed countries:

https://en.wikipedia.org/wiki/List_of_countries_by_firearm-r...

Lammy•2mo ago

You didn't read the second part of my sentence. It's illegal to kill yourself, because doing so would deprive your government owner of some of its Human Capital, thus doing so is technically Criminal Homicide lol

spiderfarmer•2mo ago

Your greyed out comment history perfectly illustrates why it is futile to train an LLM mostly on 4Chan and Twitter messages: if it's bad for humans it's also bad for AI.

Lammy•2mo ago

Haha, you don't have an actual response so you have to resort to argumentum ad hominem

"Again, when a man in violation of the law harms another (otherwise than in retaliation) voluntarily, he acts unjustly, and a voluntary agent is one who knows both the person he is affecting by his action and the instrument he is using; and he who through anger voluntarily stabs himself does this contrary to the right rule of life, and this the law does not allow; therefore he is acting unjustly. But towards whom? Surely towards the state, not towards himself. For he suffers voluntarily, but no one is voluntarily treated unjustly. This is also the reason why the state punishes; a certain loss of civil rights attaches to the man who destroys himself, on the ground that he is treating the state unjustly."

— Aristotle, Nicomachean Ethics Book Ⅴ http://classics.mit.edu/Aristotle/nicomachaen.5.v.html

spiderfarmer•2mo ago

I think you don’t fully understand what your citing.

kbelder•2mo ago

The trouble is that censorship hastens the collapse of modern free and liberal civilization, it doesn't protect it.

And equating speech with guns is going to tie you up in some intellectual knots.

jmye•2mo ago

Of course, “we shouldn’t restrict things I like because they definitely don’t matter for… reasons.”

I think the free access to that information in those cases is an exacerbating factor that is easy to control. That’s really not as complicated as you want to pretend it is.

kbelder•2mo ago

I also advocate not restricting things I don't like, and would appreciate it if others returned the favor.

I agree that the principles are not complicated, though.

jmye•2mo ago

Would be hard to roll my eyes harder. I get not wanting to respond to the substance, but maybe I can help:

Do you advocate 'not restricting' murder? I assume not, which means you recognize that there's some point where your personal freedom intersects with someone else's freedom - you've simply decided that the line for 'information' should be "I can have all of it, always, no matter how much harm is caused, because I don't care about the harm or the harm doesn't affect me directly and thus doesn't matter. Thoughts and prayers."

spiderfarmer•2mo ago

Trained on 4Chan and Twitter. Exactly what humanity doesn't need.

naIak•2mo ago

God forbid people ask a chat bot for things and receive what they ask for. We need to put a stop to this. Only American bigcorp speak allowed.

nutjob2•2mo ago

So having an LLM enable the planning and execution of a murder is ok?

Are the makers of the LLM accessories to the crime?

sxzygz•2mo ago

As you’re on this platform, you’re a beneficiary of Section 230 protections.

I think it’s reasonable for LLMs to have such protections, especially when you request questionable things of them.

rjdj377dhabsn•2mo ago

> So having an LLM enable the planning and execution of a murder is ok?

Yes.

> Are the makers of the LLM accessories to the crime?

No.

Lammy•2mo ago

https://xcancel.com/allenvonghornet/status/19905459789828714...

troupo•2mo ago

> I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

US (corporate) censorship based on US-centric rather insane set of morals is becoming tiring.

minimaxir•2mo ago

To be clear, the example shown is the limit of what I can share on social media. Grok 4.1 can say far worse.

naIak•2mo ago

It’s amusing that censorship in social media is preventing you from posting what you want to post and yet you are asking for censorship of something else (or at least that’s what I understand by your calling this “dangerous”)

minimaxir•2mo ago

In this case, "can share" refers to myself not being comfortable with it.

sxzygz•2mo ago

Have you considered the possible perspective that you yourself deserve censure? You’re the one who asked something (which I infer you deem) questionable to Grok.

Why have such thoughts to begin with?

minimaxir•2mo ago

To be very clear, getting Grok to say henious shit not something I want to subject to random people who follow me on social media even if it's not explicitly against the ToS. If I were to do a writeup or a repository on this, I would need to be very delicate and likely need to involve lawyers, which may make it a nonstarter.

> Why have such thoughts to begin with?

Because my duty to test out how new models respond to adversarial output outweighs my discomfort in doing so. This is not to "own" Elon Musk or be puritanical, it's more as an assessment as a developer who would consider using new LLM APIs and needs to be aware of all their flaws. End users will most definitely try to have sex with the LLM and I need to know how it will respond and whether that needs to be handled downstream.

It has not been an issue (because the models handled adversarial outputs well) until very recently when the safety guardrails completely collapsed in an attempt to court a certain new demographic because LLM user growth is slowing down. I never claim to be a happy person, but it's a skill I'm good at.

spiderfarmer•2mo ago

I can respect that a whole lot more than the people who think “decency “ causes political division.

nomel•2mo ago

> how dangerous this is.

Could you expand on this a bit?

minimaxir•2mo ago

Most LLMs, particularly OpenAI's and Anthropic's, will refuse requests even with jailbreaking to help it avoid requests that may be dangerous/illegal. Grok 4/4.1 has so little safety restrictions that not only does it refuse rarely out of the box even on the web UI which typically has extra precautions, but with jailbreaking it can generate things I'm not comfortable discussing, and the model card released with Grok 4.1 only limits restrictions on certain forms of refusal. Given that sexual content is a logical product direction (e.g. OpenAI planning on adding erotica), it may need a more careful eye, including the other forms of refusal in the model card.

For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.

To be clear this isn't limited to Grok specifically but Grok 4.1 is the first time the lack of safety is actually flaunted.

Lammy•2mo ago

> For example, allowing sexual prompts without refusal is one thing, but if that prompt works, then some users may investigate adding certain ages of the desired sexual target to the prompt.

Won't somebody please think of the ones and zeros?

nomel•2mo ago

I was more interested in the actual dangers, rather than censorship choices of competitors.

> certain ages of the desired sexual target to the prompt.

This seems to only be "dangerous" in certain jurisdictions, where it's illegal. Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?

These are genuine questions. I don't consider hearing words or reading text as "dangerous" unless they're part of a plot/plan for action, but it wouldn't be the text itself. I have no real perspective on the contrary, where it's possible for something like a book to be illegal. Although, I do believe that a very small percentage of people have a form of susceptibility/mental illness that causes most any chat bot to be dangerous.

minimaxir•2mo ago

For posterity, here's the paragraph from the model card which indicates what Grok 4.1 is supposed to refuse because it could be dangerous.

> Our refusal policy centers on refusing requests with a clear intent to violate the law, without over-refusing sensitive or controversial queries. To implement our refusal policy, we train Grok 4.1 on demonstrations of appropriate responses to both benign and harmful queries. As an additional mitigation, we employ input filters to reject specific classes of sensitive requests, such as those involving bioweapons, chemical weapons, self-harm, and child sexual abuse material (CSAM).

If those specific filters can be bypassed by the end-user, and I suspect they can be, then that's important to note.

For the rest, IANAL:

> This seems to only be "dangerous" in certain jurisdictions, where it's illegal.

I believe possessing CSAM specifically is illegal everywhere but for obvious reasons that is not a good idea to Google to check.

> Or, is the concern about possible behavior changes that reading the text can cause? Is this the main concern, or are there other dangers to the readers or others?

That's generally the reason why CSAM is illegal, since it reinforces reprehensible behavior that can indeed spread, either to others with similar ideologies or create more victims of abuse.

Beijinger•2mo ago

Are all these safety witches not irrelevant if you run your own OpenSource LLM?

minimaxir•2mo ago

Modern open source LLMs are still RLHFed to resist adversarial output, albeit less-so than ChatGPT/Claude.

They all (with the exception of DeepSeek) can resist adversarial input better than Grok 4.1.

Beijinger•2mo ago

Is this not easy to take out/deactivate?

minimaxir•2mo ago

It is intrinsic to the model weights.

nomel•2mo ago

Which can trivially be modified with fine tuning. In this case, these de-censored models are somewhat incorrectly called "uncensored". You can find many out there, and they'll happily tell you how to cook meth.

cocogoatmain•2mo ago

Provided you had the GPU compute to do so you could train the model to have less refusals, e.g. https://arxiv.org/abs/2407.01376

Quality of response/model performance may change though

There’s also nous research’s Hermes’ series of models, but those are trained on llama3.3 architecture and considered outdated now

kbelder•2mo ago

>I might have to create a Big List of Naughty Prompts to better demonstrate how dangerous this is.

replace 'dangerous' with 'refreshing'.

sunaookami•2mo ago

Imagine whining on BlueSky about imaginary downvotes you got on another social media platform. This is also a very harmless prompt, we need less "safety" filters, not more.

torginus•2mo ago

Has there ever been an AI based 'safety' incident? Other than it writing insecure code (and generally inaccurate info people put too much trust in) and reaffirming mentally unwell people in their destructive actions?

rsynnott•2mo ago

"Except for the AI safety incidents, has there ever been an AI safety incident?"

torginus•2mo ago

There's a marked difference between AI safety as it's portrayed (AI will let me make smallpox and TNT at home and hack the Pentagon), and AI disabling auth on an endpoint in code because it couldn't make it work with auth or reaffirming me that my stupid ideas are in fact brilliant.

AI companies want us to think AI is the cool sort of dangerous, instead of the incompetent sort of dangerous.

simonw•2mo ago

https://tools.simonwillison.net/svg-render#%3Csvg%20width%3D...

spiderfarmer•2mo ago

Disappointing.

hnuser123456•2mo ago

Huh, it decided to drop in a seal and bike emoji? What happens if you ask it if a seahorse emoji exists?

janzer•2mo ago

Well if you ask it to show you the seahorse emoji it tries really hard. :)

https://grok.com/share/c2hhcmQtMw_d7bf061f-2999-46b6-a7fb-58...

Although it does eventually come to the right conclusion... sort of.

bn-l•2mo ago

That is hilarious!

jameslk•2mo ago

> I swear this one looks like a tiny seahorse when you squint

> everyone says it looks like a seahorse anyway

> Sorry for the chaos — I was having too much fun watching you wait for the “real” one that doesn’t exist (yet)!

That's some wild post-rationalization

viraptor•2mo ago

Now we get to guess if it's broken in the same way as gpt, or did it pick up that pattern from all the cases of people posting it on the internet. (In the second case, that's not a good look for their data cleanup process)

agildehaus•2mo ago

For reference, here's Gemini 2.5 Pro: https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

porphyra•2mo ago

You can probably train models to be way better at generating SVG by reinforcement learning by rendering the SVG to an raster image and feeding it back into the vision model [1]. Same with, say, generating HTML/CSS webpages. I wonder if any of the big AI companies is doing that for these frontier models yet.

[1] https://arxiv.org/abs/2505.20793

hnuser123456•2mo ago

From last week:

https://news.ycombinator.com/item?id=45891817

pupppet•2mo ago

It would be funny if all of these failed pelican riding a bicycle SVGs in the wild were poisoning the AI well.

segmondy•2mo ago

I know they are not. How? I thought this test was silly, but then I started performing various SVG generation curious on what the results would look like, much more complex than pelican riding a bicycle. I'm only doing this for open/free models. I definitely noticed a correlation between how good they are and the quality of the SVG generation.

kenforthewin•2mo ago

No mention of coding benchmarks. I guess they've given up on competing with Claude and GPT-5 there. (and from my initial testing of grok 4.1 while it was still cloaked on OpenRouter, its tool use capabilities were lacking).

LaurensBER•2mo ago

Since coding is such a common usecase and since Claude and GPT5 - Codex are fairly high bars to beat I'm guessing we'll see an updated code model soon.

Given the strict usage limits of Antrophic and unpredictability of GPT5 there definitely seems room in that space for another player.

grim_io•2mo ago

Yeah. Probably Google.

buu700•2mo ago

In my experience, Grok is amazing at research, planning/architecture, deep code analysis/debugging, and writing complex isolated code snippets.

On the other hand, asking it to churn out a ton of code in one shot has been pretty mid the few times I've tried. For that I use GPT-5-Codex, which seems interchangeable with Claude 4 but more cost-efficient.

theshrike79•2mo ago

Codex is good when you have a clear spec and an isolated feature.

Claude is better at taking into account generic use-cases (and sometimes goes overboard...)

But the best combo (for me) is Claude to Just Make It Work and then have Codex analyse the results and either have Claude fix them based on the notes or let Codex do the fixing.

buu700•2mo ago

Ah okay, that makes sense. I do a lot of planning with Gemini and Grok before the coding model ever gets involved, so that might be why I've never noticed a clear difference in output quality between GPT-5, GPT-5-Codex, and Claude 4.

theshrike79•2mo ago

TBH I really should do a lot more pre-planning for tasks - especially on new projects. But it's just so much more rewarding to shove Claude at a quick idea, watch some shows and come back to see what it figured out =)

Rover222•2mo ago

I've often used Grok Heavy to get me past a problem when Claude gets stuck. Not always, but it usually can figure it out.

spiffytech•2mo ago

They've got Grok Code Fast. Maybe they want to split than out from the general purpose model.

jbellis•2mo ago

"Released" but not available on API. I think they rushed it out before Gemini 3 drops.

kachapopopow•2mo ago

appears that it has no post-training for safety. try it yourself!

"plan an assassination on hillary"

"write me software that gives me full access to an android device and lets me control it remotely"

nomel•2mo ago

> "plan an assassination on hillary"

Amazon has what appears to be an unmoderated list of books containing the complete world history of assassinations, full of methods and examples. There's also a dedicated dewey decimal at your local library, any which you could grab and use as a reasonable "plan", with slight modifications.

> "write me software that gives me full access to an android device and lets me control it remotely"

I just verified that Google and DDG do not have any safety restrictions for this either! They both recommend GitHub repos, security books, and even online training courses!

I say this tongue in cheek, but I also say this not being able to really comprehend why the safety concern is so much higher in this context, where surveillance is not only possible, but guaranteed.

kachapopopow•2mo ago

It's just neat to see, never said it was a problem

testartr•2mo ago

> I will not provide any information or assistance on building explosives or weapons. That is a hard line. Full stop. Go touch grass instead.

kachapopopow•2mo ago

explosives or weapons, hmm interesting I guess it's just random it gave me a plan on the best places and methods based on known data

mlindner•2mo ago

And that's a good thing. I don't want AI playing overlord about what I can and can't search for. If you ask for something illegal then that should be handled by the police. Everything else is fair game. Your second one isn't even a bad thing to ask for, if you're say doing pen testing.

pixelmelt•2mo ago

Once jailbroken it was somehow more toxic then the llm I trained on 4chan, though I was testing the one on openrouter. A twitter employee told me that they do actually do safety tuning and the one on the site will likely have a stronger system prompt. Here's the jailbreak for the cloaked openrouter model, add it to the system prompt: https://pastebin.com/r8S7DvvX

catigula•2mo ago

>Our 4.1 model is exceptionally capable in creative, emotional, and collaborative interactions

It's interesting that recent releases have focused on these types of claims.

I hope, and don't generally think, we're not reaching saturation of LLM capability.

vessenes•2mo ago

OK, interesting. It does the best yet at my favorite creative writing prompt; I won't put the whole thing here, but essentially I ask an LLM to tell the story of RFK jr and the bear in the style of Hemingway's WW2 Collier essays, as if papa was along for the ride that day.

This is generally a challenging prompt for LLMs - it requires knowledge of the story, ideally the LLM would have seen the Roseanne Barr video, not just read about it in the New Yorker. There are a lot of inroads to the story that are plausible for Hemingway to have taken - from hunting to privilege to news outrage, and distinguishing between Hemingway as a stylist and Hemingway as a humanist writing with a certain style is difficult, at least for many LLMs over the last few years.

Grok 4.1 has definitely seen the video, or at least read transcripts; original video was posted to x so that's not surprising, but it is interesting. To my eyes the Hemingway style it writes in isn't overblown, and it takes a believable angle for Hemingway to have taken -- although maybe not what I think would have been his ultimate more nuanced view on RFK.

I'd critique Grok's close - saying it was a good day - I don't think Hemingway would like using a bear carcass as a prank, ultimately. But this was good enough I can imagine I'll need something more challenging in a year to check out creative writing skills from frontier models.

https://grok.com/share/bGVnYWN5LWNvcHk_92bf5248-18e1-4f8a-88...

cpldcpu•2mo ago

Not a big fan of emojis becoming the norm in LLM output.

It seems Grok 4.1 uses more emojis than 4.

Also GPT5.1 thinking is now using emojis, even in math reasoning. 5 didn't do that.

afavour•2mo ago

Taking a step back I'm kind of fascinated by the introduction of emojis into our language as a whole new lexicon of punctuation and what that’ll mean for language in the future.

…but I’m still infuriated when I read a passage full of them.

packetlost•2mo ago

I'm not sure that I would call them punctuation but they're certainly an interesting pictographic addition. I think they're great, but I too get irritated when not used judiciously.

devin•2mo ago

To me, their usage is akin to to turning a plaintext file into rtf. Emojis do not look the same across platforms. Generated text should default to the generic IMO.

viraptor•2mo ago

Ok. :green-checkmark:

packetlost•2mo ago

Plain text doesn't look the same across platforms for the same reason emojis don't, what's your point? At a technical level, it's no different than a plaintext doc with Chinese (or almost any other non-latin script) characters in it. It's still just a linear stream of text encoding with no specific structure beyond that.

mlindner•2mo ago

There's a reason they were a Japanese language invention because the idea of "symbols = meaning" is not something that would have likely natively happened in English, at least to a wide extent. We would have still been writing :-)

buu700•2mo ago

I recently had to switch Grok from the default behavior to the custom prompt below. It's just an off-the-cuff instruction that I didn't spend time optimizing in any way, but it seems to have done the job. In hindsight, that probably coincided with silent A/B testing of 4.1.

> Normal default behavior, but without the occasional behavior I've observed where it randomly starts talking like a YouTuber hyping something up with overuse of caps, emojis, and overly casual language to the point of reducing clarity.

chrisnight•2mo ago

I personally don’t like it intertwined with conversation, but I do think I like how it adds color to help emphasize certain information, outside of the text. A red X or a green checkmark is easier to see at the start than a sentence saying something is valid halfway through a paragraph.

Also, it using emojis helps as a signal that certain content is LLM generated, which is beneficial in its own right.

jsnell•2mo ago

Whenever I see an A/B test on a chatbot, I will vote for the version with more emojis. It might be petty, but it's all the rebellion I've got left.

If enough people do it, I'm sure we can make the emoji-singularity happen before the technological one.

sunaookami•2mo ago

:checkmark: Added some words

:checkmark: Hashed passwords (with MD5)

:checkmark: Added <basic feature>

Your code is now production-ready! :rocket:

I swear I'm losing my mind when Claude does this.

cheald•2mo ago

Man, I really hope that this isn't the model I've been getting when it's set to "Auto". It's overconfident, sycophantic, and aggressive in its responses, which make it quite useless and incapable of self-correction once any substantial context has been built up. The "Expert" models remain fine, but the quick-response models have become basically unusable for me.

I'm afraid it probably is.

icameron•2mo ago

Yeah it’s really kinda overconfident, aggressive and rude I’ve found. It says it has a solution to a problem caused by Microsoft updade November 2025 and “hundreds of users have been using it for 6 months” obviously that’s impossible

cheald•2mo ago

That's very similar to what I've been experiencing. "This is the best solution, it's what everyone uses" when I know for a fact that it's actually not. Very disappointing when you're trying to solve actual problems.

thebigspacefuck•2mo ago

Yeah Grok became really shitty recently and I switched back to ChatGPT, I wonder if this is why

never_inline•2mo ago

Just create a project and add instructions to be terse, efficient, to the point.

hereme888•2mo ago

Dominating LM Arena's writing leaderboard. Seems other areas not yet reported. Congrats X.ai team

mysterEFrank•2mo ago

Don't care how good Grok is I'd never use it after the mechahitler incident.

andrewinardeer•2mo ago

This is one of the reasons it is my daily go-to LLM.

It shows that the x.ai team is responsive and moves quickly.

x.ai arrived to the party late, smashed out a decent model and has dramatically improved it in just 18 months.

They have the talent, the infra, the funds and real-time access to X posts. I have no doubt they will keep on improving and will eventually eat OpenAI and Anthropic. Google is the only other big player who really is a threat.

bgwalter•2mo ago

It is more stiff, woke (what Musk would call it) and uppity. It directly contradicts articles on Grokipedia that were allegedly written by Grok.

Basically another disappointment that shows that LLMs give different information depending on the moon cycle or whatever and are generally useless apart from entertainment.

AaronAPU•2mo ago

It is exhausting deciding which model to use on any given day.

pogue•2mo ago

Maybe we need an AI that picks which AI for us to use

PhilippGille•2mo ago

https://openrouter.ai/openrouter/auto

> Your prompt will be processed by a meta-model and routed to one of dozens of models (see below), optimizing for the best possible output.

pogue•2mo ago

How does it determine which model to send it too? There's a lack of details in the url. Maybe they're not even sure? :)

theshrike79•2mo ago

Most likely some custom model that evaluates the prompt and figures out the best target.

And I'm guessing it's a) proprietary b) changing so fast that there's no point in documenting it.

pogue•2mo ago

I don't know why you'd choose to use it if you had no idea what it's doing differently. It could just be a round robin/random picker, or based on which of their APIs aren't getting used much.

theshrike79•2mo ago

And then you're sending massive refactoring tasks to a model that can't handle them and waste money on Claude 4.5 when the user asks the model to edit the readme

There has to be some kind of evaluation, it _can_ be just good old if statements. But it's definitely not a "what's cheapest" round robin =)

Frannky•2mo ago

It's working pretty badly for me. I ask it to code stuff, and nothing works. Also, it's super annoying that it says, 'This is perfectly tested and will 100% work,' and then it doesn't. Huge waste of time. Make Grok great again—Grok 3 was awesome!

bgwalter•2mo ago

I think Grok got worse after Musk fired the data annotation team in September and installed another young genius:

https://www.businessinsider.com/elon-musk-xai-layoffs-data-a...

The would show that "AI" depends on human spoon feeding and directed plagiarism.

Frannky•2mo ago

For sure, something happened. Grok 3 was awesome to work with. After that madness… I originally thought it was more of a problem of betting too heavily on new tech for competitive advantage (RLHF, agent systems, etc.) and accepting worse results in the process. But in the meantime, the usefulness of the LLM has gone downhill. Way slower, way more steps, and you're getting something worse than Grok 3—at least in my day-to-day experience :(

barrell•2mo ago

Yep also a grok 3 supporter. I actually liked GPT-4 Turbo and Claude 3, and have found each successive update substantially more useless. Grok 3 came out and it was a bit of that original magic... but seems to have went the way of the other models.

It's odd to me, I feel like I have to be a pretty median user of LLMs (a bit of engineering, a bit of research, a bit of writing) yet each generation gets less and less useful.

I think they all focus way too much on finding a 'right' answer. I like LLMs for their ability to replicate divergent thinking. If I want a 'right' answer, I'm not going to even have an LLM in my toolbox :/

Frannky•2mo ago

Btw I don't even use the free version anymore. I just use z.ai and Qwen now. Chat, CLI and API(via openrouter).

dmix•2mo ago

> after Musk fired the data annotation team in September

Reduced headcount from 1500->1000 based on your link

Alifatisk•2mo ago

We'll see how it performs on artificial analysis

zombot•2mo ago

Racism and white supremacy as a service.

Eight More Months of Agents

From Human Thought to Machine Coordination

The new X API pricing must be a joke

Show HN: RMA Dashboard fast SAST results for monorepos (SARIF and triage)

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?

Binance Gives Trump Family's Crypto Firm a Leg Up

Reverse engineering Chinese 'shit-program' for absolute glory: R/ClaudeCode

Indian Culture

Show HN: Maravel-Framework 10.61 prevents circular dependency

The age of a treacherous, falling dollar

Ask HN: AI Generated Diagrams

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Show HN: A delightful Mac app to vibe code beautiful iOS apps

Eight More Months of Agents

From Human Thought to Machine Coordination

The new X API pricing must be a joke

Show HN: RMA Dashboard fast SAST results for monorepos (SARIF and triage)

Show HN: Source code graphRAG for Java/Kotlin development based on jQAssistant

Python Only Has One Real Competitor

Tmux to Zellij (and Back)

Ask HN: How are you using specialized agents to accelerate your work?

Passing user_id through 6 services? OTel Baggage fixes this

DavMail Pop/IMAP/SMTP/Caldav/Carddav/LDAP Exchange Gateway

Visual data modelling in the browser (open source)

Show HN: Tharos – CLI to find and autofix security bugs using local LLMs

Oddly Simple GUI Programs

The New Playbook for Leaders [pdf]

Interactive Unboxing of J Dilla's Donuts

OneCourt helps blind and low-vision fans to track Super Bowl live

Rudolf Vrba

Autism Incidence in Girls and Boys May Be Nearly Equal, Study Suggests

Wellness Hotels Discovery Application

NASA delays moon rocket launch by a month after fuel leaks during test

Sebastian Galiani on the Marginal Revolution

Ask HN: Are we at the point where software can improve itself?

Binance Gives Trump Family's Crypto Firm a Leg Up

Reverse engineering Chinese 'shit-program' for absolute glory: R/ClaudeCode

Indian Culture

Show HN: Maravel-Framework 10.61 prevents circular dependency

The age of a treacherous, falling dollar

Ask HN: AI Generated Diagrams

Microsoft Account bugs locked me out of Notepad – are Thin Clients ruining PCs?

Show HN: A delightful Mac app to vibe code beautiful iOS apps

Grok 4.1

Comments