frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: I benchmarked our AI tool from 30% to 100% success

https://plotly.com/blog/chasing-nines-on-ai-reliability-benchmarks/
1•sadafnajam•2h ago
Hi HN, I'm a Senior SDET at Plotly. My company just launched Plotly Studio, a new tool that uses AI to build data visualizations and analytics apps. My job was to answer the big question: does it actually work with real, messy data? When I first started testing it against our collection of 100+ diverse datasets, our success rate was around 30%. The problem I faced was that you can't just unit-test an AI that generates code for a desktop app. You have to test the full, end-to-end user experience. So, I led the effort to build our own internal benchmark system to validate performance at scale. Every day, our CI (GitHub Actions) kicks off a job that: Generates a full data app from each of our 100+ test datasets Launches each app in a real browser using Playwright Asserts that the app loads without any Python or JavaScript errors Takes screenshots to verify the visual output Runs each test 3 times to detect "flakiness" (inconsistent results) This gave me and the rest of the team a clear, actionable metric. The dev team used the failure reports to improve the backend, and we just hit a 100% success rate on our latest test run. I wrote an article about the architecture of this benchmarking system. We're now expanding it with user-donated datasets to make it even more challenging. I'd love to hear your feedback. You can read my full technical write-up here: Link: https://plotly.com/blog/chasing-nines-on-ai-reliability-benc...

Comments

mtmail•1h ago
Please use a regular submission for blog posts https://news.ycombinator.com/showhn.html

Oldest Text Suggests Humans Aren't from Earth – Matt LaCroix [video]

https://www.youtube.com/watch?v=b7hdYdxHU1U
1•keepamovin•27s ago•0 comments

What Is Quantum Computer Security?

https://arxiv.org/abs/2510.07334
1•refezs•39s ago•0 comments

'Thank you, Nancy': $531M now rides on tech startup's 'Pelosi Tracker'

https://www.sfgate.com/politics/article/nancy-pelosi-tracker-tech-startup-21155596.php
1•bryan0•2m ago•0 comments

A Big Update from Spectral Compute, the Team Behind Scale

https://scale-lang.com/posts/2025-11-10-Announcement_Venture_Backed
1•mgl•2m ago•0 comments

The Making of Autism Simulator: 60k Visitors in 12 Hours

https://blog.drjoshcsimmons.com/p/the-making-of-autism-simulator-60000
1•joshcsimmons•3m ago•0 comments

(nossl) What do Ursula von der Leyen and Putin have in common?

http://mikhailian.mova.org/node/314
1•sam_lowry_•3m ago•0 comments

Select 1 Touches 5,583 Lines of Postgres Source Code

https://narekg.me/blog/2025/11/pg-select-1-lines/
2•ngalstyan4•4m ago•1 comments

Nested Learning Reproduction

https://github.com/kmccleary3301/nested_learning
1•eamag•6m ago•0 comments

DHS authorized to merge SSA data into SAVE

https://www.propublica.org/article/dhs-social-security-data-voter-citizenship-trump
1•jhpacker•8m ago•0 comments

Functional Networking for Millions of Docker Desktops [video]

https://www.youtube.com/watch?v=j84ocjlj1JA
2•abathologist•9m ago•0 comments

Apple made a $230 crossbody sock

https://www.theverge.com/news/818328/apple-iphone-pocket-crossbody-knitted-sock-bag
1•c5karl•10m ago•0 comments

This is Real...Really Made by a Robot

https://signalminusnoise.substack.com/p/this-is-real-really-made-by-a-robot
1•alxdistill•11m ago•1 comments

Why Account Linking Should Be Pivotal in Your CIAM SSO

https://discovery.cevolution.co.uk/ciam/2025/06/04/why-account-linking-should-be-pivotal-in-your-...
1•mooreds•12m ago•0 comments

The perfect Hacker News launch

https://www.dvsj.in/the-perfect-hackernews-launch
1•mooreds•13m ago•0 comments

Multi-Instance Root Modules

https://newsletter.masterpoint.io/p/multi-instance-root-modules
1•mooreds•13m ago•0 comments

Show HN: Durable cloud hosting for MCP servers

https://github.com/Haniehz1/mcp-agent
1•haniehz•20m ago•0 comments

Thoughts on the Economic Impact of AI

https://www.asad.pw/ai-and-the-economy/
1•asaddhamani•20m ago•0 comments

Rockstar fired developers looks like "union busting" [video]

https://www.youtube.com/watch?v=c9nOwjeznjI
2•ignition•20m ago•0 comments

HTTPS: //mylexon.site/ref/radzilan122 Make me go to level 150

1•radzilani•21m ago•0 comments

Show HN: I have zero dev experience and built a 220k LOC fintech SaaS with AI

https://medium.com/@allester21212/i-have-zero-professional-dev-experience-and-built-a-220k-loc-fi...
1•DepthSight•21m ago•2 comments

Ask HN: Startup Head of Engineering

1•heroicmailman•22m ago•0 comments

Building Zephyr for the Raspberry Pi Pico2 W

https://blog.golioth.io/building-zephyr-for-the-raspberry-pi-pico2-w/
2•hasheddan•22m ago•0 comments

Fusion isn't the holy grail of energy

https://nickmcgreivy.substack.com/p/fusion-isnt-the-holy-grail-of-energy
3•Luc•24m ago•0 comments

C# 14 Language Features in ReSharper and Rider 2025.3

https://blog.jetbrains.com/dotnet/2025/11/11/csharp-14-language-features-in-resharper-and-rider-2...
3•quapster•24m ago•0 comments

Scaling back DEI programmes and the loss of scientific talent

https://www.nature.com/articles/s41556-025-01797-5.epdf?sharing_token=oxrenJgtNXv1m6Fyx7wLzdRgN0j...
1•cratermoon•25m ago•0 comments

.NET 10 Released

https://dotnet.microsoft.com/en-us/download/dotnet/10.0
4•codegeek•25m ago•1 comments

What's Special about Life?

https://writings.stephenwolfram.com/2025/11/whats-special-about-life-bulk-orchestration-and-the-r...
2•surprisetalk•27m ago•0 comments

I Read Sam Bhagwat's AI Agents Bible So You Don't Have to (But Probably Should)

https://kuber.studio/blog/Post-Extended/I-Read-Sam-Bhagwat%27s-AI-Agents-Bible-So-You-Don%27t-Hav...
1•kuberwastaken•27m ago•1 comments

Firefox Expands Fingerprint Protections

https://blog.mozilla.org/en/firefox/fingerprinting-protections/
12•ptrhvns•31m ago•0 comments

Online Safety Act Tracker

https://osatracker.co.uk/domains/browse
1•Jigsy•31m ago•0 comments