frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

SigmaEval – statistical evaluation for GenAI apps

https://github.com/Itura-AI/SigmaEval
1•TarekOraby•2h ago

Comments

TarekOraby•2h ago
Hey HN, I released SigmaEval, a Python framework to evaluate GenAI applications.

Non-deterministic outputs of LLM-based apps don’t fit pass/fail tests, leading teams to often ship without confidence. SigmaEval aims to solve this by adopting a statistical evaluation approach, similar to that used in clinical trials. It supports statements such as: “We are 95% confident that our AI will resolve at least 90% of user issues with a quality score of 8/10 or higher.”

It works in three steps:

- Define “good”: You describe the test scenario and desired outcome in plain English (e.g., “when a new user asks about the bot’s capabilities” -> “then the bot lists its main functions”).

- Simulate: An AI user simulator exercises your app repeatedly, switching styles (polite, impatient, verbose) to build a diverse conversation set.

- Judge & analyze: An AI judge scores each conversation against your definition of success. SigmaEval runs binomial and bootstrap tests to decide whether you meet your quality bar at a chosen confidence level.

SigmaEval is LLM-provider, and testing-framework, agnostic.

Open source (Apache 2.0).

GitHub: https://github.com/Itura-AI/sigmaeval

PyPI: https://pypi.org/project/sigmaeval-framework/

I’m the creator and happy to answer questions.

Dynamic Levels of Detail in Evolve

https://www.evolvebenchmark.com/blog-posts/dynamic-levels-of-detail-in-evolve
1•evolve_•16s ago•0 comments

If you buy more than two new games a year, you're in the minority

https://www.eurogamer.net/if-you-buy-more-than-two-new-games-a-year-youre-in-the-minority-new-rep...
1•MBCook•2m ago•0 comments

Less Is More: Recursive Reasoning with Tiny Networks

https://github.com/SamsungSAILMontreal/TinyRecursiveModels
1•klaussilveira•2m ago•0 comments

Test your README in a fresh VM

https://shkspr.mobi/blog/2025/10/how-to-actually-test-your-readme/
1•birdculture•2m ago•0 comments

Scientists Find Hidden Switch Controlling Hunger

https://scitechdaily.com/scientists-find-hidden-switch-controlling-hunger/
1•pseudolus•3m ago•0 comments

The unexpected upside of Canada's wildfires

https://www.japantimes.co.jp/environment/2025/09/23/climate-change/canada-wildfire-upsides/
1•PaulHoule•3m ago•0 comments

Trump's Crusaders: Christian Nationalists Are Gaining a Solid Foothold in D.C

https://www.spiegel.de/international/world/trumps-crusaders-christian-nationalists-are-gaining-a-...
1•nabla9•4m ago•0 comments

Show HN: Hacker News as a biological simulation (with time travel)

https://hackernews.life
1•aeonfox•6m ago•0 comments

Agentic Context Engineering: Evolving Contexts for SelfImproving Language Models

https://arxiv.org/abs/2510.04618
1•JnBrymn•7m ago•0 comments

Insurers hesitate at claims faced by OpenAI, Anthropic in AI lawsuits: report

https://seekingalpha.com/news/4502547-insurers-hesitate-at-multibillion-dollar-claims-faced-by-op...
1•1vuio0pswjnm7•8m ago•0 comments

The Cancer Imaging Archive Is Down

https://www.cancerimagingarchive.net/
1•datelligence•8m ago•0 comments

ChatGPT is a great tool for investment backtesting

https://old.reddit.com/r/Daytrading/comments/1j8tjiw/holy_cow_chatgpt_is_a_great_tool_for_backtes...
1•wslh•8m ago•0 comments

Show HN: Searchable compression for JSON (p50≈0.18 ms; 10-min demo)

https://github.com/kodomonocch1/see_proto
1•kodomonocch1•9m ago•0 comments

Why some doctors have started asking patients about their spiritual lives

https://www.npr.org/2025/01/14/nx-s1-5252809/why-some-doctors-have-started-asking-patients-about-...
1•bilsbie•11m ago•1 comments

Show HN: Appcockpit – Version and Maintenance Control for Native Apps

https://appcockpit.dev
1•moritzmoritz21•12m ago•0 comments

PRD-ware – freeware for vibe coding/engineering

https://prdware.github.io/
1•consultutah•13m ago•0 comments

Bypassing Restricted Shell on Uniview Security Camera

https://brownfinesecurity.com/blog/bypassing-restricted-shell-on-uniview-security-camera
1•hasheddan•13m ago•0 comments

Vectrex Mini

https://vectrex.com/vectrex-mini-details/
2•rbanffy•13m ago•0 comments

React Status Issue 447: October 8, 2025

https://react.statuscode.com/issues/447
1•unripe_syntax•13m ago•0 comments

Debugging My Mind: 10 Days Struggling Through a Vipassana Course

https://blog.charlied.org/p/tasting-a-spiritual-psychedelic-mind
1•madpen•14m ago•0 comments

RailsStart: Makefile Helps Rails Developers

https://github.com/the-teacher/rails-start/blob/master/docs/articles/RailsStart_How_Makefile_help...
2•amalinovic•15m ago•0 comments

How I made $300K from an open-source side project using dual licensing

https://www.paritydeals.com/blog/monetize-open-source-dual-licensing/
3•sachinneravath•17m ago•0 comments

Prizes must recognize machine contributions to discovery

https://www.nature.com/articles/d41586-025-03217-y
1•bookofjoe•17m ago•2 comments

2 Hours in Line for a Free Hat

https://edatweets.substack.com/p/2-hours-in-line-for-a-free-hat
2•gmays•17m ago•0 comments

VimWiki: Personal Knowledge Management in Vim

https://mkaz.blog/working-with-vim/vimwiki/
1•marcuskaz•18m ago•0 comments

IBM Introduces the Spyre Accelerator for Commercial Availability

https://www.hpcwire.com/off-the-wire/ibm-introduces-the-spyre-accelerator-for-commercial-availabi...
1•rbanffy•20m ago•0 comments

Designing MCP servers for wide schemas and large result sets

https://axiom.co/blog/designing-mcp-servers-for-wide-events
2•ucirello•21m ago•0 comments

China builds thriving semi industry off US leftovers, export controls be damned

https://www.theregister.com/2025/10/07/china_chip_gear/
2•rntn•21m ago•0 comments

Mantle – free cap table management for early-stage startups

https://withmantle.com
1•sabenamantle•21m ago•1 comments

New in 2025: Linux Patches Enable PCI Support for the Amiga 4000

https://www.phoronix.com/news/Linux-PCI-Amiga-4000
2•rbanffy•21m ago•0 comments