frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Sup AI, a confidence-weighted ensemble (52.15% on Humanity's Last Exam)

https://sup.ai
18•supai•1d ago
Hi HN. I'm Ken, a 20-year-old Stanford CS student. I built Sup AI.

I started working on this because no single AI model is right all the time, but their errors don’t strongly correlate. In other words, models often make unique mistakes relative to other models. So I run multiple models in parallel and synthesize the outputs by weighting segments based on confidence. Low entropy in the output token probability distributions correlates with accuracy. High entropy is often where hallucinations begin.

My dad Scott (AI Research Scientist at TRI) is my research partner on this. He sends me papers at all hours, we argue about whether they actually apply and what modifications make sense, and then I build and test things. The entropy-weighting approach came out of one of those conversations.

In our eval on Humanity's Last Exam, Sup scored 52.15%. The best individual model in the same evaluation run got 44.74%. The relative gap is statistically significant (p < 0.001).

Methodology, eval code, data, and raw results:

- https://sup.ai/research/hle-white-paper-jan-9-2026

- https://github.com/supaihq/hle

Limitations:

- We evaluated 1,369 of the 2,500 HLE questions (details in the above links)

- Not all APIs expose token logprobs; we use several methods to estimate confidence when they don't

We tried offering free access and it got abused so badly it nearly killed us. Right now the sustainable option is a $5 starter credit with card verification (no auto-charge). If you don't want to sign up, drop a prompt in the comments and I'll run it myself and post the result.

Try it at https://sup.ai. My dad Scott (@scottmu) is in the thread too. Would love blunt feedback, especially where this really works for you and where it falls short.

Here's a short demo video: https://www.youtube.com/watch?v=DRcns0rRhsg

Comments

algolint•1d ago
Ensembling usually hits a wall at latency and cost. Running these in parallel is table stakes, but how are you handling the orchestration layer overhead when one provider (e.g., Vertex or Bedrock) spikes in P99 latency? If you're waiting for the slowest model to get entropy stats, the DX falls off a cliff. Are you using speculative execution or a timeout/fallback strategy to maintain a responsive ttft?
supai•1d ago
A few things:

- We do something similar to OpenRouter which measures the latency of the different providers, to ensure we always get the fastest results

- Users can cancel a single model stream if it's taking too long

- The orchestrator is pretty good at choosing what models for what task. The actual confidence scoring and synthesis at the end is the difficult part that you cannot do naively, however, the orchestrator plays the biggest part in optimizing cost + speed. I've made sure that we don't exceed 25% extra in cost or time in the vast majority of queries, compared to equivalent prompts in ChatGPT/Gemini/etc.

The reason why this is viable IMO is because of the fact that you can run multiple less-intelligent models with lower thinking efforts and beat a single more-intelligent model with a large thinking effort. The thinking effort reduction speeds up the prompt dramatically.

The sequential steps are then:

1. Ensemble RAG 2. Orchestrator 3. Models in parallel 4. Synthesizer

And retries for low-confidence (although that's pretty optimized with selective retries of portions of the answer).

mememememememo•10h ago
You could timeout. You could trade them off dynamically.

I.e. you get 3 replies. 80% confidence. You decide at 80% you are fairly good but happy to wait 5 seconds for completion / 500ms for time to first token. If either breaches you give the current answer.

But if you are at 5% you wait for 60s total/2s for a token since the upside of that unspoken model is much higher.

Basically wagering time for quality in a dynamic prediction market in front of the LLM.

all2•1h ago
If we treat LLM output as a manufacturing output if you have three 80% probabilities you actually have something like 0.80.80.8 -> 0.512 or 51%.
scottmu•22h ago
I want to clarify what Ken meant by "entropy in the output token probability distributions." Whenever an LLM outputs a token, it's choosing that token out of all possible tokens. Every possible output token has a probability assigned by the model (typically a logarithm of the probability). This is a probability distribution (the output token probabilities sum to 1). Entropy is a measure of uncertainty and can quantify if a token probability distribution is certain (1 token has a 99.9% probability, and the rest share the leftover 0.1% probability) or uncertain (every token has roughly the same probability, so it's pretty much random which token is selected). Low entropy is the former case, and high entropy is the latter.

There is interesting research in the correlation of entropy with accuracy and hallucinations:

- https://www.nature.com/articles/s41586-024-07421-0

- https://arxiv.org/abs/2405.19648

- https://arxiv.org/abs/2509.04492 (when only a small number of probabilities are available, which is something we frequently deal with)

- https://arxiv.org/abs/2603.18940

- tons more, happy to chat about if interested

mememememememo•10h ago
Wow if it is that easy to detect hallucinations, are the big models or rigs (agentic scaffolds) building any self-correcting behaviour. Or possibly switching it to an I don't know mode so it can ask the human for help understanding.

Maybe this insight is why I feel hallucinations are much rarer in the last 12 months on top models. Are they being detected before they get sent out.

stephantul•52m ago
Buddy… your son gets a top post on HN in which he clearly mentions you, yet you feel the need to make an account just to correct him in the first comment? Can’t you send him a message and let him correct it?
hello12343214•14h ago
I use gemini and cursor for enterprise software implementation, but they often suggest incorrect solutions to edge cases and unique config requirements. An AI that has a higher likelihood of being accurate is very appealing. I'll give Sup AI at try over the next few days at work.

Also, discovering HLE was great... scrolling through some of the questions brings back memories of college organic chem.

scottmu•12h ago
I've felt your pain. Models aren't always trained well enough on edge cases and configs.

Would love to hear how Sup works out for you.

siliconc0w•1h ago
Do you have data for other benchmarks? +7% for HLE isn't nothing but it'd be more compelling if you could show you're consistently doing better with your method across more domains (especially coding, which seems like the primary use-case these days).
wavemode•49m ago
Is 7 extra percent on HLE benchmark really worth the cost of running an entire ensemble of models?
kelseyfrog•26m ago
Depends on the use-case and requirements.
Tomjosetj31•28m ago
Impressive result on HLE if the methodology holds up. One thing I'd want to understand better: how much of the gain comes from the entropy weighting specifically vs. simply having more compute via parallel inference? Would be curious to see an ablation — same models, same budget, but with naive majority voting instead. That would isolate the actual contribution of your confidence-weighting approach.

PyPI package telnyx has been compromised in yet another supply chain attack

https://www.aikido.dev/blog/telnyx-pypi-compromised-teampcp-canisterworm
39•overflowy•1h ago•12 comments

Anatomy of the .claude/ folder

https://blog.dailydoseofds.com/p/anatomy-of-the-claude-folder
241•freedomben•4h ago•125 comments

Installing a Let's Encrypt TLS Certificate on a Brother Printer with Certbot

https://owltec.ca/Other/Installing+a+Let%27s+Encrypt+TLS+certificate+on+a+Brother+printer+automat...
141•8organicbits•5h ago•39 comments

Sand from Different Beaches in the World

https://magnifiedsand.com/
91•RAAx707•4d ago•23 comments

Desk for people who work at home with a cat

https://soranews24.com/2026/03/27/japan-now-has-a-special-desk-for-people-who-work-at-home-with-a...
218•zdw•3h ago•81 comments

Building FireStriker: Making Civic Tech Free

https://firestriker.org/blog/building-firestriker-why-im-making-civic-tech-free
30•noleary•1d ago•4 comments

Vibe-Coded Ext4 for OpenBSD

https://lwn.net/SubscriberLink/1064541/1a399d572a046fb9/
7•corbet•28m ago•1 comments

AI got the blame for the Iran school bombing. The truth is more worrying

https://www.theguardian.com/news/2026/mar/26/ai-got-the-blame-for-the-iran-school-bombing-the-tru...
185•cptroot•2h ago•130 comments

Meow.camera

https://meow.camera/#4258783365322591678
77•surprisetalk•4h ago•17 comments

How and why to take a logarithm of an image [video]

https://www.youtube.com/watch?v=ldxFjLJ3rVY
162•jgwil2•4d ago•58 comments

Hold on to Your Hardware

https://xn--gckvb8fzb.com/hold-on-to-your-hardware/
483•LucidLynx•9h ago•394 comments

Embracing Bayesian Methods in Clinical Trials

https://jamanetwork.com/journals/jama/fullarticle/2847011
24•nextos•3d ago•0 comments

People inside Microsoft are fighting to drop mandatory Microsoft Account

https://www.windowscentral.com/microsoft/windows-11/people-inside-microsoft-are-fighting-to-drop-...
276•breve•5h ago•256 comments

A Faster Alternative to Jq

https://micahkepe.com/blog/jsongrep/
332•pistolario•12h ago•205 comments

‘Energy independence feels practical’: Europeans building mini solar farms

https://www.euronews.com/2026/03/26/suddenly-energy-independence-feels-practical-europeans-are-bu...
110•vrganj•10h ago•111 comments

Gzip decompression in 250 lines of Rust

https://iev.ee/blog/gzip-decompression-in-250-lines-of-rust/
79•vismit2000•3d ago•30 comments

Apple discontinues the Mac Pro

https://9to5mac.com/2026/03/26/apple-discontinues-the-mac-pro/
606•bentocorp•22h ago•563 comments

Schedule tasks on the web

https://code.claude.com/docs/en/web-scheduled-tasks
259•iBelieve•14h ago•215 comments

Telnyx Python SDK: Supply Chain Security Notice

https://telnyx.com/resources/telnyx-python-sdk-supply-chain-security-notice-march-2026
6•KomoD•1h ago•1 comments

Can It Resolve Doom? Game Engine in 2k DNS Records

https://core-jmp.org/2026/03/can-it-resolve-doom-game-engine-in-2000-dns-records/
10•Einenlum•3d ago•0 comments

The 'paperwork flood': How I drowned a bureaucrat before dinner

https://sightlessscribbles.com/posts/the-paperwork-flood/
463•robin_reala•6h ago•382 comments

Iran-linked hackers have breached FBI director's personal emails

https://www.cnn.com/2026/03/27/politics/iran-linked-hackers-fbi-director-patel
142•vrganj•2h ago•58 comments

Byte Magazine Archive 1975 to 1995

https://www.worldradiohistory.com/Byte_Magazine.htm
8•oldnetguy•1h ago•2 comments

21,864 Yugoslavian .yu domains

https://jacobfilipp.com/yu/
35•freediver•1d ago•58 comments

EMachines never obsolete PCs: More than a meme

https://dfarq.homeip.net/emachines-never-obsolete-pcs-more-than-a-meme/
46•zdw•3d ago•25 comments

Apple says no one using Lockdown Mode has been hacked with spyware

https://techcrunch.com/2026/03/27/apple-says-no-one-using-lockdown-mode-has-been-hacked-with-spyw...
78•jbegley•3h ago•49 comments

Everything old is new again: memory optimization

https://nibblestew.blogspot.com/2026/03/everything-old-is-new-again-memory.html
142•ibobev•4d ago•105 comments

Should QA exist?

https://www.rubick.com/should-qa-exist/
51•PretzelFisch•8h ago•87 comments

Last gasps of the rent seeking class?

https://geohot.github.io//blog/jekyll/update/2026/02/26/the-last-gasps-of-the-rent-seeking-class....
104•surprisetalk•4h ago•101 comments

The European AllSky7 fireball network

https://www.allsky7.net/#archive
108•marklit•12h ago•8 comments