GPT5 is the best coding LLM because other LLMs admit it?

1•adinhitlore•5mo ago

So I vibe-code a lot these days and recently i decided to give the same prompt to several llms, then get their codes and later give each code to every single one of them to ask which one they think is the most useful without telling them that they or the other 2 llms wrote it. The overall consensus is: gpt5. True I only compared gpt5 vs claude 4.1 vs qwen 230bn. OSS 120b, gemini and grok 4 were excluded since well i don't have the time. And obvious failures like amazon nova or anything from meta weren't even planned. Deepseek (both) seem a bit underperforming . Personally I'd say it's a close call between claude opus 4.1 vs both gpt4 and gpt5 (ironically gpt5 sometimes performs worse than 4, i think this has been addressed by many people already). That's just my personal experience, i know HumanEwal or SWE or whatever give various performance but idk, Musk used the benchmarks as "proof" to hype Grok and in my experience grok 4 is between LLAMA4 and obviously behind gpt4 or some variations of qwen.

Again this is coding only: Python and C. For physics, chemistry, scifi novels or whatever the case may be very different. Another kudos to OSS 120bn btw: it's very generous on tokens...like it will write a small programming book if it takes to in one reply, unless of course you tell it to be more limited, this is a huge plus for me since the code I demand should be complex and not some 20 lines nova "pro" joke.

Comments

incomingpain•5mo ago

all ive done with gpt5 for coding was a major db refactor. i had run out of gemini limit for the day.

certainly got the job done. I doubt my gpt 20b or ~30b local llm would have been as capable. Overall it was about ~2000 lines of code to change, probably only 100,000 context.

gpt5 didnt one shot it. there were many steps inbetween. At the end, few hours, i had >50 linter warnings from tripled imports, loads of dead code that wouldnt be touched and for some reason gpt5 just couldnt fix any of this. Ended up increasing the warnings and added an error. My expectation is that any of the big guys could immediately fix it. Even restarted fresh context and gpt just wasnt having any of it. im certain even gpt 20b would have completed it in a minute. Curious.

I went to gemini flash, very generic prompt about linter warnings and it fixed it in 30 seconds.

Just kind of weirdness that benchmarks will never be able to catch. It's also going to be very dependent. A rust programmer might have a favourite, whereas python programmer benefits from another model. There can never be a best.

adinhitlore•5mo ago

I had similar experience, usually I'd ignore Gemini be it flash or pro but on several occasions it fixed complex errors like it's nothing. Yet when it comes to codegen it is "cheap" on tokens and struggles outputting complex logic. As a great bonus: their easy to setup API is freemium but a generous freemium (google AI studio I mean). My "ecosystem" atm will be something like: gpt5, claude 4.1 - if they both fail: try to fix with gemini. I'd skip Grok for privacy issues mostly not that I completely ignore its capabilities, qwen is good but sometimes 'overengineered' i don't need 400bn , given the large params maybe it will work for non-coding like if you ask it some exotic questions about science: casimir effect, acoustic levitation, ununennium etc etc you name it.

zahlman•5mo ago

> recently i decided to give the same prompt to several llms, then get their codes and later give each code to every single one of them to ask which one they think is the most useful without telling them that they or the other 2 llms wrote it.

The fact that you expect the result of this experiment to be useful, is more interesting than the actual result.

adinhitlore•5mo ago

vibe-coding is the future, drop conservatism....'free palestine' i mean you get the idea: be progressive and open minded.

pavel_lishin•5mo ago

Those seem like completely orthogonal concepts.

bigyabai•5mo ago

This is a profoundly mentally-ill response to a surface-level criticism you should have been able to refute.

adinhitlore•5mo ago

well i'm happy with my response which is what matters lol. Hedonism > all, well on this site anyway, i'm not trying to impress anyone or prove anything...random markov chain kind of typing fits it ideally.

slater•5mo ago

Are you ok?

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: Runtime Fence – Kill switch for AI agents

Researchers surprised by the brain benefits of cannabis usage in adults over 40

Peter Thiel warns the Antichrist, apocalypse linked to the 'end of modernity'

USS Preble Used Helios Laser to Zap Four Drones in Expanding Testing

Show HN: Animated beach scene, made with CSS

An update on unredacting select Epstein files – DBC12.pdf liberated

Was going to share my work

Pitchfork: A devilishly good process manager for developers

You Are Here

Why social apps need to become proactive, not reactive

How patient are AI scrapers, anyway? – Random Thoughts

Vouch: A contributor trust management system

I built a terminal monitoring app and custom firmware for a clock with Claude

Tiny C Compiler

Y Combinator Founder Organizes 'March for Billionaires'

Ask HN: Need feedback on the idea I'm working on

OpenClaw Addresses Security Risks

Apple finalizes Gemini / Siri deal

Italy Railways Sabotaged

Emacs-tramp-RPC: high-performance TRAMP back end using MsgPack-RPC

Nintendo Wii Themed Portfolio

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

GPT5 is the best coding LLM because other LLMs admit it?

Comments