frontpage.

We shipped a couple of LLM-powered features at work over the past year. Traditional monitoring (Datadog, Sentry) tells us if the API is up and how fast it responds, but nothing about whether the outputs are actually good. Right now we genuinely don't know if users are happy with the AI responses or silently frustrated. No errors get thrown when GPT returns a mediocre answer. Curious what others are doing. Are you just not monitoring output quality? Built something internal? Using a tool I haven't found?

fp.

Ask HN: How do you monitor AI features in production?