* give an error
* return the wrong result
* not be internally consistent with the rest of the content
* be logically impossible
* be factually impossible
* have basic errors
It is entirely possible (and quite common) to know something is wrong without knowing what a right answer is.
an example of things that are the opposite is "public policy development", which is why it's simply malicious that various corrupt oligarchs are pushing for it to be used for such things.
so, a simple model for you to understand why other people might find these tools useful for some things:
- low stakes - doesn't matter that much if the output isn't Top Quality, either because it's easy to fix or it just doesn't matter
- enormous gap in cost between generation and review - e.g. coding
- review systems exist and are used - I don't care very much if my coworkers use an LLM to write code or not, since all the code gets reviewed by someone else, and if the proposer of the change doesn't even bother to check it themselves then they pay the social cost for it
If your quality threshold is so low that you can tolerate crashes on invalid input, you will certainly cut corners when building software for others. I would dread having you on my team, let alone using a piece of software that you wrote.
> I don't care very much if my coworkers use an LLM to write code or not, since all the code gets reviewed by someone else
Ah, yes, let's kick the can down the road.
> if the proposer of the change doesn't even bother to check it themselves then they pay the social cost for it
... The side effects of shoddy code are not redeemed by "paying a social cost". They negatively impact your users, and thus the bottom line of your company.
In essence, they are only adequate in niche situations (like creative writing, marketing, placeholder during iterative design, …) where there's no such social contract and assumption that people operate in good faith and do their best diligence not to deceive others.
Pretending otherwise, not pushing back when LLMs are clearly used outside of those contexts, or dressing them into what they are not (thinking machines, search engines, knowledge archives, …) is doing the work of useful idiots defending tech oligarchs and data thieves against their own interests.
And yeah, I get it, naysayers are annoying. Doesn't mean they are wrong or their voices shouldn't be heard at a time where the legality and ethics of all this are being debated.
How long do your teams take to write vs review PRs? How long does it take to review a test case and run it vs write the implementation under test? Or to verify that a fix a regressed test now completes successfully? How long does it take you to do a "design review" of a rendered webpage vs to create a static webpage? How long does it take to evaluate a performance optimization vs write it?
> How long does it take to review a test case and run it vs write the implementation under test?
If you blindly trust a passing test and don't review it as production code, I have a bridge to sell you.
> How long does it take to evaluate a performance optimization vs write it?
Factoring in the time to review that the optimization didn't introduce a regression, and isn't a hack that will cause other issues later: the difference shouldn't be too large.
Yes, code usually takes more time and effort to write, but if it's not thoroughly read, understood, and reviewed, it can cause havoc someone will have to deal with later.
This idea that just because LLMs help you write code quicker will make you or the team more productive is delusional. It's just kicking the can down the road. You can ignore it, but sooner or later someone will have to handle it. And you better hope that it happens before it impacts your users.
That said, I'd like this quality with a relatively quick tool using model; I'm not sure what else I'd want to call it "AGI" at that point.
the only example uses I see written about on HN appear to basically be Substack users asking o3 marketing questions and then writing substack posts about it, and a smattering of vague posts about debugging.
Example: Pull together a list of the top 20 startups funded in Germany this year, valuation, founder and business model. Estimate which is most likely to want to take on private equity investment from a lower mid market US PE fund, as well as which would be most suitable taking into consideration their business model, founders and market; write an approach letter in english and in german aimed at getting a meeting. make sure that it's culturally appropriate for german startup founders.
I have no idea what the output of this query would be by the way, but it's one I would trust to get right on
* the list of startups
* the letter and its cultural sensitivity
* broad strokes of what the startup is doing
Stuff I'd "trust but verify" would be
* Names of the founders
* Size of company and target market
Stuff I'd double check / keep my own counsel on
* Suitability and why (note that o3 pro is def. better at this than o3 which is already not bad; it has some genuinely novel and good ideas, but often misses things.)
With no deep research - agreed; too recent to believe info is accurately stored in the model weights.
how do you validate all of that is actually correct?
Like how there's a ton of psychics, tarot and palm readers around Wall St.
If OP had suggested that they were just medium-quality nonsense generators I would have just agreed and not replied.
Then I have it take those matches and try and chase down the hiring manager based on public info.
I did it at first just to see if it was possible, but I am getting direct emails that have been accurate a handful of times and I never would have gotten that on my own
Thank you!
With coding using anything is always a hit and miss, so I prefer to have faster models where I can throw away the chat if it turns into an idiot.
Would I wait 15 minutes for a transcription from Python to Rust if I don't know what the result will be? No.
Would I wait 15 minutes if I'd be a mathematician working on some kind of proof? Probably yes.
It’s the progressive jpg download of 2025. You can short circuit after the first model which gives a good enough response.
I feel like we are in awkward phase of: "We know this has severe environmental impact - but we need to know if these tools are actually going to be useful and worth adopting..." - so it seems like just keeping the environmental question at the forefront will be important as things progress.
This is a rhetorical question.
Sure we aren’t capturing every last externality, but optimization of large systems should be pushed toward the creators and operators of those systems. Customers shouldn’t have to validate environmental impact every time they spend 0.05 dollars to use a machine.
Until then, the choice is being made by the entities funding all of this.
I use AI a lot to double check my code via a code review what I've found is
Gemini - really good at contextual reasoning. Doesn't confabulate bugs that don't exist. Is really good at finding issues related to large context. (this method calls this method, and it does it with a value that could be this)
Sonnet/Opus - Seems to be the more creative. More likely to confabulate bugs that don't exist, but also most likely to catch a bug o3 and gemini missed.
o3 - Somewhere in the middle
That's pretty scary.
I put this into Claude.md and need to remind it every other hour. But yeah, you need to jump back every few hours or so.
my setup is claude code in yolo mode with playwright MCP + browser MCP (to do stuff in the logged i firebase web interface) plus search enabled.
the prototype was developed via firebase studio until i reached a dead end there, then i used claude code to rip out firebase genkit and hooked in google-genai, openai, ...
the whole codebase goes into google gemini studio (caus the million token window) to write tickets, more tickets and even more tickets.
claude code then has the job to implemt these tickets (create a detailed tasklist for each ticket first) and then code it until done. end of each tasklist is a working playwright end to end test with verified output.
and atomic commits.
i hooked anydesk to my computer so i can check i at some point to tell to to continue or to read Claude.md again (the meta instructions which basically tells it to not to fallbacks, mock data or cheat in amy other way.)
ever fourth ticket is refactoring for sinplicity and documentation.
the tickets mist be updated before each commit and moved to the do done folder only when 100 tested ok.
so yeah, when i wale up in the morning either magic happend and the tockets are all done. or it got stuck and refactores half the codebase. in that case it works for an hoor to go over all git commits to find out where it went wrong.
what i need are multiple coding agent which challenge each other at crucial points.
But I should note that o3-pro has been getting faster for me lately. At first every damn thing, however simple, took 15+ minutes. Today I got a few answers back within 5 minutes.
In fact, o1-preview has given me more consistently correct results than any other model. But it's being sunset next month so I have to move to o3.
Not strict rational A+B=C, nuance.
As far as usage of API for business processes (like document processing) - I can't say.
> The problem with o3-pro is that it is slow.
well maybe Arena is not that silly then. poorly argued/organized article.
But we're still human mate.
Stop discriminating or actually solve the problem. I've had enough of this attitude.
As mainly AI invester not AI user, I think profitability is great importance. It has been race to top so far, soon we see race to the bottom.
My primary use cases where I am willing to wait 10-20 minutes for an answer from the "big slow" model (o3-pro) is code reviews of large amounts of code. I have been comparing results on this task from the three models above.
Oddly, I see many cases where each model will surface issues that the other two miss. In previous months when running this test (e.g., Claude 3.7 Sonnet vs o1-pro vs earlier Gemini), that wasn't the case. Back then, the best model (o1-pro) would almost always find all the issues that the other models found. But now it seems they each have their own blindspots (although they are also all better than the previous generation of models).
With that said, I am seeing Claude Opus 4 (w/extended thinking) be distinctly worse at missing problems which o3-pro and Gemini find. It seems fairly consistent that Opus will be the worst out of the three (despite sometimes noticing things the others do not).
Whether o3-pro or Gemini 2.5 Pro is better is less clear. o3-pro will report more issues, but it also has a tendency to confabulate problems. My workflow involves providing the model with a diff of all changes, plus the full contents of the files that were changed. o3-pro seems to have a tendency to imagine and report problems in the files that were not provided to it. It also has an odd new failure mode, which is very consistent: it gets confused by the fact that I provide both the diff and the full file contents. It "sees" parts of the same code twice and will usually report that there has accidentally been some code duplicated. Base o3 does this as well. None of the other models get confused in that way, and I also do not remember seeing that failure mode with o1-pro.
Nevertheless, it seems o3-pro can sometimes find real issues that Gemini 2.5 Pro and Opus 4 cannot more often than vice versa.
Back in the o1-pro days, it was fairly straightforward in my testing for this use case that o1-pro was simply better across the board. Now with o3-pro compared particularly with Gemini 2.5 Pro, it's no longer clear whether the bonus of occasionally finding a problem that Gemini misses is worth the trouble of (1) waiting way longer for an answer and (2) sifting through more false positives.
My other common code-related use case is actually writing code. Here, Claude Code (with Opus 4) is amazing and has replaced all my other use of coding models, including Cursor. I now code almost exclusively by peer programming with Claude Code, allowing it to be the code writer while I oversee and review. The OpenAI competitor to Claude Code, called Codex CLI, feels distinctly undercooked. It has a recurring problem where it seems to "forget" that it is an agent that needs to go ahead and edit files, and it will instead start to offer me suggestions about how I can make the change. It also hallucinates running commands on a regular basis (e.g., I tell it to commit the changes we've done, and outputs that it has done so, but it has not.)
So where will I spend my $200 monthly model budget? Answer: Claude, for nearly unlimited use of Claude Code. For highly complex tasks, I switch to Gemini 2.5 Pro, which is still free in AI Studio. If I can wait 10+ minutes, I may hand it to o3-pro. But once my ChatGPT Pro subscription expires this month, I may either stop using o3-pro altogether, or I may occasionally use it as a second opinion by paying on-demand through the API.
I've found the same thing. That claude is more likely miss a bug than o3 or gemini but more likely to catch something o3 and gemini missed. If I had to pick one model I'd pick o3 or gemini, but if I had to pick a second model I'd pick opus.
It's also seems to have a much higher false positive rate where as gemini seems to have the lowest false positive rate.
Basically o3 and gemini are better, but also more correlated which gives opus a lot of value.
iLoveOncall•3h ago
My solution for this has been to use non-reasoning models, and so far in 90% of the situations I have received the exact same results from both.
jasonjmcghee•3h ago
It tends to output significantly longer and more detailed output. So when you want that kind of thing- works well. Especially if you need up to date stuff or want to find related sources.
joshstrange•3h ago
Anytime I do my own “deep” research I like to then throw the same problem at OpenAI and see how well it fares. Often it misses things or gets things subtly wrong. The results look impressive so it’s easy to fool people and I’m not saying the results are useless, I’ve absolutely gotten value out of it, but I don’t love using it for anything I actually care about.
bcrosby95•2h ago
matwood•54m ago
joshstrange•38m ago