It's similar to how I can pass any multiple-choice exam if you let me keep attempting it and tell me my overall score at the end of each attempt - even if you don't tell me which answers were right/wrong
Why spend evaluation resources on outsiders? Everyone wants to know who is exactly first second etc, after #10 it’s do your own evaluation if this is important to you.
Thus, we have this inequality.
Basically get in early and get a high rank and you are usually going to 'win'. Now it does not work all the time. But it had a very high success rate. I probably should have studied it a bit more. My theory is any stack ranking algorithm is susceptible to it. I also suspect it works decently well due to the way people will create puppet accounts to up rank things on different platforms. But you know, need numbers to back that up...
drcongo recently referenced something I sort of wish I had time to build: https://news.ycombinator.com/item?id=43843116 And/or could just go somewhere to use, which is a system where an upvote doesn't mean "everybody needs to see this more" but instead means "I want to see more of this user's comments", and downvotes mean the corresponding opposite. It's more computationally difficult but would create an interestingly different community, especially as further elaborations were built on that. One of the differences would be to mitigate the first-mover advantage in conversations. Instead of it winning you more karma if it appeals to the general public of the relevant site, what it would instead do is expose you to more people. That would produce more upvotes and downvotes in general but wouldn't necessarily impact visibility in the same way.
- Lots of bullet points in every response.
- Emoji.
...even at the expense of accurate answers. And I'm beginning to wonder if the sycophantic behavior of recent models ("That's a brilliant and profound idea") is also being driven by Arena scores.
Perhaps LLM users actually do want lots of bullets, emoji and fawning praise. But this seems like a perverse dynamic, similar to the way that social media users often engage more with content that outrages them.
In reality I prefer different model, for different things, and quite often it's because model X is tuned to return more of my preference - e.g. Gemini tends to be usually the best in non-english, chatgpt works better for me personally for health questions, ...
The funniest example I've seen recently was "Dude. You just said something deep as hell without even flinching. You're 1000% right:"
A social deduction game for both LLMs and humans. All the past games are available for anyone.
I'm open for feedback.
pongogogo•6h ago
ilrwbwrkhv•4h ago
__alexs•5m ago
AstroBen•4h ago
pongogogo•3h ago
I would pick one of two parts of that analysis that are most relevant to you and zoom in. I'd choose something difficult that the model fails at, then look carefully at how the model failures change as you test different model generations.