vs
π = 3.14159…
If it’s about correctness tone isn’t part of quality.
Lately I enjoy Grok the most for simple questions, even if it isn't necessarily about a recent event. Then I like OpenAI, Mistral, Deepseek equally and for some reason never felt good about Gemini. Tried switching to Gemini Pro the last two months but I found myslef going back to ChatGPT's free mode and Grok's free mode. Cancelled yesterday and now happily back to ChatGPT Plus.
I got %80 GPT-5 preference anyway.
Have you noticed either of these things:
(1) If your first prompt is too long (50k+ tokens) but just below the limit (like 80k tokens or whatever), it cannot see the right-side of your prompt.
(2) By the second prompt, if the first prompt was long-ish, the context from the first prompt is no longer visible to the model.
it seems to truncate your prompt even under the "maximum message length" and yeah around 55k is where it starts to happen.
extremely annoying. o1 pro worked up until 115k or so. both o3 and gpt5 have the issue. (it happens on all models for me not just the pro variations)
with the new 400k context length in api i would expect atleast 128k message lengths and maybe 200k context in chat.
I'm putting the highest quality context into the 50k tokens, and attaching the rest for RAG. But maybe there is a better way.
But ... the advice (answers) was quite uniform. In more than a few cases I would personally choose a different approach to all of them.
It'd be fun to have a few Chinese models in the mix and see if the cultural biases show up.
The reason there isn't an "equal" option is because it's impossible to calibrate. How close do the two options have to be before the average person considers them "equal"? You can't really say.
The other problem is when two things are very close, if you provide an "equal" option you lose the very slight preference information. One test I did was getting people to say which of two greyscale colours is lighter. With enough comparisons you can easily get the correct ordering even down to 8 bits (i.e. people can distinguish 0x808080 and 0x818181), but they really look the same if you just look at a pair of them (unless they are directly adjacent, which wasn't the case in my test).
The "polluted by randomness" issue isn't a problem with sufficient comparisons because you show the things in a random order so it eventually gets cancelled out. Imagine throwing a very slightly weighted coin; it's mostly random but with enough throws you can see the bias.
...
On the other hand, 16 comparisons isn't very many at all, and also I did implement an ad-hoc "they look the same" option for my tests and it did actually perform significantly better, even if it isn't quite as mathematically rigorous.
Also player skill ranking systems like Elo or TrueSkill have to deal with draws (in games that allow them), and really most of these ranking algorithms are totally ad-hoc anyway (e.g. why does Bradley-Terry use a sigmoid model?), so it's not really a big deal to add more ad-hocness into your model.
Also depends what the pairwise comparisons are measuring of course. If it's shades of grey, is the statistical preference identifying a small fraction of the public that's able to discern a subtle mismatch in shading between adjacent boxes, or is it purely subjective colour preference confounded by far greater variation in monitor output? If it's LLM responses, I wonder whether regular LLM users have subtle biases against recognisable phrasing quirks of well-known models which aren't necessarily more prominent or less appropriate than the less familiar phrasing quirks of a less-familiar model. Heavy use of em-dashes, "not x but y" constructions and bullet points were perceived as clear, well-structured communication before they were seen as stereotypical, artificial AI responses.
Also it's GPT not GTP
So this might just test how the two models react to the (hidden) system prompt...
If you converse with it, yes
> What's the system prompt here?
You don't need a specific one. If you talk to it, it turns into that.
If you meant, after you converse with it for a while: what was the conversation leading up to this point?
> If you meant, after you converse with it for a while: what was the conversation leading up to this point?
If you have a conversational style with chatgpt you end up with much shorter back and forths (at least on o4) than you do if you give it a long prompt.
In 75% of the answers I picked GPT-5, that's a pretty strong result, at least when it comes to subjective preferences!
The questions were pretty much unlike anything I've ever asked an LLM though, is this how people use LLMs nowadays?
Now I know why they tell you to just keep writing more when it comes to SAT writing sections.
zipping1549•6mo ago