In my opinion it is not particularly useful for comparing different models from different companies since some models are optimized heavily on math or even trained on AIME problems.
However it is really useful for testing different quantizations of the same model or the same quantization from different providers.
Let me know what you think about it!
Also check the README to see some examples of the results you will get from it.