In the end, the model only ran `git bisect reset` (if we're to believe the video at least), which obviously doesn't do anything. Why did it run that? Well, the user asked them to use `git bisect` to find a specific commit, but that doesn't make sense, `git bisect` is not for that, so what the user is asking for, isn't possible.
Instead of the model stopping and saying "Hey, that's not the right idea, did you mean ... ?" so to ensure it's actually possible and what the user wants, the model runs its own race and start invoking a bunch of other git commands, because that's how you'd find that commit the user is looking for.
I think I see the same thing when letting LLMs code as well. If you give them some work to do that is actually impossible, but the words kind of make sense, and it'll produce something but not what you wanted, I think they're doing exactly the same thing, bypassing what you clearly instructed so they at least do something.
I'm not sure if I'm just hallucinating that they're acting like that, but LLMs doing "the wrong thing" has been hitting me more than once, and imagining something more dangerous than `do a git bisect`, it seems to me like that video is telling us Gemini 3 Pro will act exactly the same way, no improvements on that front.
chis•12m ago
Currently my ranking is
* Cursor composer: impressively fast and able but not tuned to be that agentic, so it's better for one-shot code changes than long-running tasks. Fantastic UI.
* Claude Code: Works great if you can set up a verifiable environment, a clear plan and set it loose to build something for an hour
* Grok: Similar to cursor composer but slower and more agentic. Not currently using.
* ChatGPT Codex, Gemini: Haven't tried yet.
all2•10m ago
malnourish•5m ago
xnx•5m ago