GTP Blind Voting: GPT-5 vs. 4o

https://gptblindvoting.vercel.app/

46•findhorn•6mo ago

https://x.com/flowersslop/status/1953908930897158599

Comments

zipping1549•6mo ago

I feel like the questions are way too simple. 3B models may perform similar with this sort of questions.

JoshuaDavid•6mo ago

It is striking how similar these answers are to each other, hitting the same points beat for beat in a slightly different tone.

ggoo•6mo ago

19/20 for gpt5. All the answers were very similar though I mostly just felt like the tone and delivery was better.

imafish•6mo ago

Interesting. 6/10 for gpt5 here.

Workaccount2•6mo ago

8/10 gpt-5

esperent•6mo ago

6/10 for 4o. However for several responses I would have preferred a "neither" option.

FergusArgyll•6mo ago

It's the same model

ismailmaj•6mo ago

There should be a “both” option, I often found both answers acceptable but sometimes I strictly preferred one and those get sadly watered down.

croes•6mo ago

This is a voting about the tone and note the quality of the answers.

dmd•6mo ago

How is tone not part of quality? It’s about preference, and there’s an overwhelmingly consistent result here.

croes•6mo ago

„Everyone knows π is 3. It’s one of those cozy little facts, like cats landing on their feet or toast landing butter-side down. You don’t have to overthink it — circles just work that way. Ask any pie, and it’ll tell you the same thing.”

π = 3.14159…

If it’s about correctness tone isn’t part of quality.

mrtksn•6mo ago

My understanding was that with GPT-5 you don't actually get the high quality stuff unless the system decides that you need it. So, for simple questions you end up getting the subpar response. A bit like not getting hot water until you increase the flow enough to trigger the boiler to start the heating.

Lately I enjoy Grok the most for simple questions, even if it isn't necessarily about a recent event. Then I like OpenAI, Mistral, Deepseek equally and for some reason never felt good about Gemini. Tried switching to Gemini Pro the last two months but I found myslef going back to ChatGPT's free mode and Grok's free mode. Cancelled yesterday and now happily back to ChatGPT Plus.

I got %80 GPT-5 preference anyway.

vbezhenar•6mo ago

Before GPT-5, I've used almost exclusively o3 and sometimes o3-pro. Now I'm using GPT 5 Thinking and sometimes GPT 5 Pro. So I think that I have some control over quality. At least it thinks for few dozens of seconds every time.

energy123•6mo ago

> GPT 5 Pro

Have you noticed either of these things:

(1) If your first prompt is too long (50k+ tokens) but just below the limit (like 80k tokens or whatever), it cannot see the right-side of your prompt.

(2) By the second prompt, if the first prompt was long-ish, the context from the first prompt is no longer visible to the model.

wrcwill•6mo ago

definitely 1!

it seems to truncate your prompt even under the "maximum message length" and yeah around 55k is where it starts to happen.

extremely annoying. o1 pro worked up until 115k or so. both o3 and gpt5 have the issue. (it happens on all models for me not just the pro variations)

with the new 400k context length in api i would expect atleast 128k message lengths and maybe 200k context in chat.

energy123•6mo ago

Do you have a workaround?

I'm putting the highest quality context into the 50k tokens, and attaching the rest for RAG. But maybe there is a better way.

wrcwill•5mo ago

i split the context and give it in two messages :/

vbezhenar•6mo ago

Sorry, can't really answer to it, as I very rarely using any long context. I prefer to either edit previous question or just start new chat to keep context short. And even when I need to dump code, I prefer to choose relevant snippets. I'm aware that LLM quality degrades with long contexts, so I've trained myself to avoid it.

energy123•6mo ago

I'm back on ChatGPT today. The UI is so fast. I didn't realize how the buggy and slow Gemini UI was contributing to my stress levels. AIStudio is also quite slow compared to the ChatGPT app. Is it that hard to make it so that when you paste text into a box and press enter, your computer doesn't slow down and get noisy? Is it really that difficult of an engineering problem?

swagmoney1606•6mo ago

I HATE how you can't re-write a previous response in gemini, only the most recent response.

edg5000•6mo ago

What about Claude Opus 4 and 4.1?

mrtksn•6mo ago

I like Claude too but wasn't using it much lately. I don't know why, maybe because the UI is too original? Maybe because it was a bit slow the last time I used Claude? Maybe because the free usage limits were too low so didn't got hooked into to upgrade? And on the API side of things didn't bother to try I guess.

zazar•6mo ago

GPT and Grok have the best everyday-feel. Gemini issn't quite there as a product.

logicprog•6mo ago

I took the test with 10 questions, and carefully picked the answer with more specificity and unique propositional content, that felt like it was communicating more logic that was worth reading, and also the answers that were just obviously more logical or effective, or framed better. I chose GPT-5 8 out of 10 times.

gavinray•6mo ago

17/3 - GPT-5/4o

AmazingTurtle•6mo ago

12/8 gpt-5/gpt-4o

senko•6mo ago

I took the "Rank Models" and got GPT5 and Sonnet 4 tied at 25% each, Gemini and Grok close by and 4o in the dust.

But ... the advice (answers) was quite uniform. In more than a few cases I would personally choose a different approach to all of them.

It'd be fun to have a few Chinese models in the mix and see if the cultural biases show up.

vbezhenar•6mo ago

I don't like this test, because the very first question I was present with, had both answers looked equivalently good. Actually they were almost the same, just with different phrasing. So my choice would be absolute random. It means, that end score will be polluted by random. They should have added things like "both answers good" and "both answers bad".

viraptor•6mo ago

If the positions are randomly assigned, it shouldn't matter. I mean, the results may be clear faster, but the overall shouldn't change even if you need to flip a coin from time to time.

jstummbillig•6mo ago

Sure, but providing a "undecided" option would solve the issue OP is describing for the individual voter.

timhh•6mo ago

I have a lot of experience with pairwise testing so I can explain this.

The reason there isn't an "equal" option is because it's impossible to calibrate. How close do the two options have to be before the average person considers them "equal"? You can't really say.

The other problem is when two things are very close, if you provide an "equal" option you lose the very slight preference information. One test I did was getting people to say which of two greyscale colours is lighter. With enough comparisons you can easily get the correct ordering even down to 8 bits (i.e. people can distinguish 0x808080 and 0x818181), but they really look the same if you just look at a pair of them (unless they are directly adjacent, which wasn't the case in my test).

The "polluted by randomness" issue isn't a problem with sufficient comparisons because you show the things in a random order so it eventually gets cancelled out. Imagine throwing a very slightly weighted coin; it's mostly random but with enough throws you can see the bias.

...

On the other hand, 16 comparisons isn't very many at all, and also I did implement an ad-hoc "they look the same" option for my tests and it did actually perform significantly better, even if it isn't quite as mathematically rigorous.

Also player skill ranking systems like Elo or TrueSkill have to deal with draws (in games that allow them), and really most of these ranking algorithms are totally ad-hoc anyway (e.g. why does Bradley-Terry use a sigmoid model?), so it's not really a big deal to add more ad-hocness into your model.

notahacker•6mo ago

Ordering isn't necessarily the most valuable signal to rank models where much stronger degrees of preference between some of the answers exist though. "I don't mind either of these answers but I do have a clear preference for this one" is sometimes a more valuable signal than a forced choice". And A model x which is consistently subtly preferred to model y in the common case where both models yield acceptable outputs but manages to be universally disfavoured for being wrong or bad more often is going to be a worse model for most use cases.

Also depends what the pairwise comparisons are measuring of course. If it's shades of grey, is the statistical preference identifying a small fraction of the public that's able to discern a subtle mismatch in shading between adjacent boxes, or is it purely subjective colour preference confounded by far greater variation in monitor output? If it's LLM responses, I wonder whether regular LLM users have subtle biases against recognisable phrasing quirks of well-known models which aren't necessarily more prominent or less appropriate than the less familiar phrasing quirks of a less-familiar model. Heavy use of em-dashes, "not x but y" constructions and bullet points were perceived as clear, well-structured communication before they were seen as stereotypical, artificial AI responses.

can16358p•6mo ago

I've got 7 GPT-5, and 3 4o.

IAmGraydon•6mo ago

This doesn’t really work as you can tell there’s an underlying prompt telling the model to reply in one or two sentences. That doesn’t seem like a good way to display the strengths and weaknesses of a model except in situations where you want a very short answer.

hoppp•6mo ago

I gravitated choosing the longer answer so my result was a preference for GPT5 responses

kylecazar•6mo ago

I did the exact opposite! So, 4o won in my poll. Given roughly the same meaning, I prefer less words.

voisin•6mo ago

7/10 GPT-5.

afro88•6mo ago

Does anyone ever get answers this short? What's the system prompt here? That may bias things a little.

Also it's GPT not GTP

elaus•6mo ago

Yeah, it felt like two different styles (one very short, the other a little bit more verbose), but both very different from a plain query to GPT without additional system prompts.

So this might just test how the two models react to the (hidden) system prompt...

the_mitsuhiko•6mo ago

> Does anyone ever get answers this short?

If you converse with it, yes

> What's the system prompt here?

You don't need a specific one. If you talk to it, it turns into that.

afro88•6mo ago

First question: https://chatgpt.com/s/t_6899bc7881e88191bb3d2146eac718d7

If you meant, after you converse with it for a while: what was the conversation leading up to this point?

the_mitsuhiko•6mo ago

I cannot access your link so I have no idea what this points into.

> If you meant, after you converse with it for a while: what was the conversation leading up to this point?

If you have a conversational style with chatgpt you end up with much shorter back and forths (at least on o4) than you do if you give it a long prompt.

KronisLV•6mo ago

I did 20 questions.

In 75% of the answers I picked GPT-5, that's a pretty strong result, at least when it comes to subjective preferences!

dmd•6mo ago

19/20 GPT-5. I’m impressed.

uh_uh•6mo ago

Same result.

johndhi•6mo ago

I chose 4 more often! Was trying to be honest about what I preferred.

hopfenspergerj•6mo ago

What on earth are these questions? They don’t resemble any real use of an llm for work.

theon144•6mo ago

Huh, I got 9/10 for GPT-5, and I was pretty convinced I was picking 4o in several questions based on the style. Interesting!

The questions were pretty much unlike anything I've ever asked an LLM though, is this how people use LLMs nowadays?

Davidzheng•6mo ago

strange questions. I don't think self-help advice and advice for social relationships should be judged based on how popular it is. A lot of very similar and generic answers. Got an equal split when I took it. 10 each

viraptor•6mo ago

Those kinds of comparisons are interesting but also not the kind of questions I'd ever ask an AI, so the results are a bit meh. I wish there was a version with custom prompts, or something like a mix of battle and side-by-side modes from lmarena. Let me choose the prompts (or prepared sets of prompt categories) and blinded models to compare. I'm happy to use a model worse at interpersonal issues, but better at cooking, programming and networking.

Balgair•6mo ago

Found myself just choosing the longer answer absent any real difference in the information presented.

Now I know why they tell you to just keep writing more when it comes to SAT writing sections.

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

Japanese rice is the most expensive in the world

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

The Story of Heroku (2022)

Obey the Testing Goat

Claude Opus 4.6 extends LLM pareto frontier

Brute Force Colors (2022)

Google Translate apparently vulnerable to prompt injection

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

Software development is undergoing a Renaissance in front of our eyes

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

Spec-Driven Design with Kiro: Lessons from Seddle

Agents need good developer experience too

The Dark Factory

Free data transfer out to internet when moving out of AWS (2024)

Interop 2025: A Year of Convergence

Prejudice Against Leprosy

Slint: Cross Platform UI Library

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

Substack makes money from hosting Nazi newsletters

A New Crypto Winter Is Here and Even the Biggest Bulls Aren't Certain Why

Moltbook was peak AI theater

Why Claude Cowork is a math problem Indian IT can't solve

Show HN: Built an space travel calculator with vanilla JavaScript v2

Why a 175-Year-Old Glassmaker Is Suddenly an AI Superstar

Micro-Front Ends in 2026: Architecture Win or Enterprise Tax?

Japanese rice is the most expensive in the world

These White-Collar Workers Actually Made the Switch to a Trade

The Wonder Drug That's Plaguing Sports

Show HN: Which chef knife steels are good? Data from 540 Reddit tread

Federated Credential Management (FedCM)

Token-to-Credit Conversion: Avoiding Floating-Point Errors in AI Billing Systems

The Story of Heroku (2022)

Obey the Testing Goat

Claude Opus 4.6 extends LLM pareto frontier

Brute Force Colors (2022)

Google Translate apparently vulnerable to prompt injection

(Bsky thread) "This turns the maintainer into an unwitting vibe coder"

Software development is undergoing a Renaissance in front of our eyes

Can you beat ensloppification? I made a quiz for Wikipedia's Signs of AI Writing

Spec-Driven Design with Kiro: Lessons from Seddle

Agents need good developer experience too

The Dark Factory

Free data transfer out to internet when moving out of AWS (2024)

Interop 2025: A Year of Convergence

Prejudice Against Leprosy

Slint: Cross Platform UI Library

AI and Education: Generative AI and the Future of Critical Thinking

Maple Mono: Smooth your coding flow

GTP Blind Voting: GPT-5 vs. 4o

Comments