For context:
When evaluating structured outputs, you often want to composable comparison logic to allow for meaningful comparison across different types of outputs (free text, enums, ints, and all the other json tyeps). You also want to compare arrays as a multisets -- order-agnostic pairwise matching across elements in sets.
What it is: This CLI and python library (I called "structeval" but not to be compared to the LLM eval framework with the same name -- I may change it!) supports order-agnostic pairwise matching, customizable comparison logic, and recursive metric aggregation. It can also be used to compare outputs when sampling from an LLM with N>1 to measure semantic entropy or find the "median" result. As it works as a generic json tool without requiring a schema, it could also be applied, at least in principle, as a more configurable (and quirkier :) ) alternative to a generic diffing tool like jd.
I had struggled with this task in a few contexts and found I was often rewriting a utility like this, so figured it may be helpful for others if encapsulated in a little library.
But I'm curious if any feedback or suggestions!