Are there any llms in particular that work best with g-evals?
zlatkov•1h ago
I haven’t come across any research showing that a specific LLM consistently outperforms others for this. It generally works best with strong reasoning models that produce consistent outputs.
lyuata•1h ago
LLM Benchmark leaderboard for common evals sounds like a fun idea to me.
kirchoni•1h ago
Interesting overview, though I still wonder how stable G-Eval really is across different model families. Auto-CoT helps with consistency, but I’ve seen drift even between API versions of the same model.
zlatkov•1h ago
That's true. Even small API or model version updates can shift evaluation behavior. G-Eval helps reduce that variance, but it doesn’t eliminate it completely. I think long-term stability will probably require some combination of fixed reference models and calibration datasets.
eeasss•1h ago
zlatkov•1h ago
lyuata•1h ago