Interesting experiment, but I'd say aggregating the scores across models is far from ideal. Gemini 1.5 Flash got close-to-perfect scores on most languages (probably boils down to small variances in temp/top_k and statistical error). Small models are generally quite bad at non-English languages and tank the overall performance.
BTW, newer generations of models seem to have made some real progress in multilingual performance.
curioussquirrel•1h ago
BTW, newer generations of models seem to have made some real progress in multilingual performance.