Mix them together and you’re already deep in make believe land, so letting AI take over step 1 seems like a perfect fit.
I was hoping to read this article and be surprised by how OpenAI was able to solve the reliability problem, but alas.
We're building something similar and found that no matter how good the agent loop is, you still need "canonical metrics" that are human-curated. Otherwise non-technical users (marketing, product managers) are playing a guessing game with high-stakes decisions, and they can't verify the SQL themselves.
Our approach: 1. We control the data pipeline and work with a discrete set of data sources where schemas are consistent across customers 2. We benchmark extensively so the agent uses a verified metric when one exists, falls back to raw SQL when it doesn't, and captures those gaps as "opportunities" for human review
Over time, most queries hit canonical metrics. The agent becomes less of a SQL generator and more of a smart router from user intent -> verified metric.
The "Moving fast without breaking trust" section resonates, their eval system with golden SQL is essentially the same insight: you need ground truth to catch drift.
Wrote about the tradeoffs here: https://www.graphed.com/blog/update-2
0xferruccio•58m ago
Our chief engineer Wade gave an awesome demo to Claire Vo some months back here: https://www.youtube.com/watch?v=9Q9Yrj2RTkg
I use this basically every day asking all sorts of questions