With that in mind, how do you go about best evaluating LLM's these days, short of going with "gut feel"? My best idea so far is to design/write various small "design a program/library" tasks with clear functional requirements and letting each model try implementing the tasks, probably using Open Code and Open Router as the common components throughout the evaluation.
But this field moves fast and I may well have missed many better or easier approaches. What would you do?