frontpage.

Hi HN, I'm a Senior SDET at Plotly. My company just launched Plotly Studio, a new tool that uses AI to build data visualizations and analytics apps. My job was to answer the big question: does it actually work with real, messy data? When I first started testing it against our collection of 100+ diverse datasets, our success rate was around 30%. The problem I faced was that you can't just unit-test an AI that generates code for a desktop app. You have to test the full, end-to-end user experience. So, I led the effort to build our own internal benchmark system to validate performance at scale. Every day, our CI (GitHub Actions) kicks off a job that: Generates a full data app from each of our 100+ test datasets Launches each app in a real browser using Playwright Asserts that the app loads without any Python or JavaScript errors Takes screenshots to verify the visual output Runs each test 3 times to detect "flakiness" (inconsistent results) This gave me and the rest of the team a clear, actionable metric. The dev team used the failure reports to improve the backend, and we just hit a 100% success rate on our latest test run. I wrote an article about the architecture of this benchmarking system. We're now expanding it with user-donated datasets to make it even more challenging. I'd love to hear your feedback. You can read my full technical write-up here: Link: https://plotly.com/blog/chasing-nines-on-ai-reliability-benc...

Oldest Text Suggests Humans Aren't from Earth – Matt LaCroix [video]

What Is Quantum Computer Security?

'Thank you, Nancy': $531M now rides on tech startup's 'Pelosi Tracker'

A Big Update from Spectral Compute, the Team Behind Scale

The Making of Autism Simulator: 60k Visitors in 12 Hours

(nossl) What do Ursula von der Leyen and Putin have in common?

Select 1 Touches 5,583 Lines of Postgres Source Code

Nested Learning Reproduction

DHS authorized to merge SSA data into SAVE

Functional Networking for Millions of Docker Desktops [video]

Apple made a $230 crossbody sock

This is Real...Really Made by a Robot

Why Account Linking Should Be Pivotal in Your CIAM SSO

The perfect Hacker News launch

Multi-Instance Root Modules

Show HN: Durable cloud hosting for MCP servers

Thoughts on the Economic Impact of AI

Rockstar fired developers looks like "union busting" [video]

HTTPS: //mylexon.site/ref/radzilan122 Make me go to level 150

Show HN: I have zero dev experience and built a 220k LOC fintech SaaS with AI

Ask HN: Startup Head of Engineering

Building Zephyr for the Raspberry Pi Pico2 W

Fusion isn't the holy grail of energy

C# 14 Language Features in ReSharper and Rider 2025.3

Scaling back DEI programmes and the loss of scientific talent

.NET 10 Released

What's Special about Life?

I Read Sam Bhagwat's AI Agents Bible So You Don't Have to (But Probably Should)

Firefox Expands Fingerprint Protections

Online Safety Act Tracker

Show HN: I benchmarked our AI tool from 30% to 100% success

Comments