Tends to be a problem. I've tried to mitigate these problems by using either external harnesses (aka GitHub actions that are "fixed" based on known-good) or by using n-number of witness agents (e.g. Kimi/Qwen/whatever <=> Claude/OpenAI/Google). Generally sucks more time and energy (and now token/$).
that being said, I still have a "fix the code, not the test" line somewhere in here...
benchwright•1h ago
that being said, I still have a "fix the code, not the test" line somewhere in here...