It’s a problem that the models are moving faster than they can be usefully tested. GPT-4.1/4o/o-1 are not SOTA for some time and they don’t even seem to have the Anthropic models as part of their study.
Even if their conclusions were valid at the time they did the work, it says frustratingly little about it today.
sandbags•59m ago
Even if their conclusions were valid at the time they did the work, it says frustratingly little about it today.
We’re testing implementations, not principles.