In my repeat evaluations on the same datasets the scores are all over the place, sometimes scoring really high and sometimes doing very badly.
Has anyone experienced something similar?
I’m guessing this may be because “GPT-5.1” can sometimes choose to use a much smaller model, but for production use this makes it unreliable.
xXSLAYERXx•26m ago