Booktest is build based of 2 decade-career in data science. It has been used to support RnD on numerous LLM, ML, NLP, information retrieval and also more traditional software engineering.
It was partly inspired by earlier examples (kudos for Ferenc), but especially real pains with how to assert ML QA with regression testing, transparency and iteration cycle speed.
So, in systems where correctness is fuzzy, evaluation is expensive, and changes have non-local effects, a failing test without diagnostics often raises more questions than it answers. This is a painful combination, if left unsolved.
Booktest is now on its 3rd or 4th iteration of the same idea, and as such it addresses most common needs and problems in this space.
It is a review-driven regression testing approach that captures system behavior as readable artifacts, so humans can see, review, and reason about regressions instead of fighting tooling.
This approach has been used in production for testing ML/NLP systems processing large volumes of data, and we’ve now open-sourced it.
I'm curious whether this matches others’ experience, and how people handle this today.
arauhala•1h ago
It was partly inspired by earlier examples (kudos for Ferenc), but especially real pains with how to assert ML QA with regression testing, transparency and iteration cycle speed.
So, in systems where correctness is fuzzy, evaluation is expensive, and changes have non-local effects, a failing test without diagnostics often raises more questions than it answers. This is a painful combination, if left unsolved.
Booktest is now on its 3rd or 4th iteration of the same idea, and as such it addresses most common needs and problems in this space.
It is a review-driven regression testing approach that captures system behavior as readable artifacts, so humans can see, review, and reason about regressions instead of fighting tooling.
This approach has been used in production for testing ML/NLP systems processing large volumes of data, and we’ve now open-sourced it.
I'm curious whether this matches others’ experience, and how people handle this today.