In hindsight, a simple integration test that sends a real (non-mocked) request to the LLM provider would probably have caught this.
One idea is to have a test suite which sends each prompt to the LLM provider and checks whether the response matches the expected schema. This has its own issues since LLMs are inherently nondeterministic and these tests might be flaky, but I’m currently lacking better ideas.
Curious to hear how others approach this.