Hi HN, I'm a Senior SDET at Plotly. My company just launched Plotly Studio, a new tool that uses AI to build data visualizations and analytics apps.
My job was to answer the big question: does it actually work with real, messy data? When I first started testing it against our collection of 100+ diverse datasets, our success rate was around 30%.
The problem I faced was that you can't just unit-test an AI that generates code for a desktop app. You have to test the full, end-to-end user experience.
So, I led the effort to build our own internal benchmark system to validate performance at scale. Every day, our CI (GitHub Actions) kicks off a job that:
Generates a full data app from each of our 100+ test datasets
Launches each app in a real browser using Playwright
Asserts that the app loads without any Python or JavaScript errors
Takes screenshots to verify the visual output
Runs each test 3 times to detect "flakiness" (inconsistent results)
This gave me and the rest of the team a clear, actionable metric. The dev team used the failure reports to improve the backend, and we just hit a 100% success rate on our latest test run.
I wrote an article about the architecture of this benchmarking system. We're now expanding it with user-donated datasets to make it even more challenging.
I'd love to hear your feedback. You can read my full technical write-up here:
Link:
https://plotly.com/blog/chasing-nines-on-ai-reliability-benc...
mtmail•1h ago