frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Show HN: Sales Agent Benchmark – SWE-Bench for sales AI agents (open source)

https://sales-agent-benchmarks.fly.dev/benchmark
1•a1j9o94•2h ago
Live leaderboard: https://sales-agent-benchmarks.fly.dev/benchmark GitHub: https://github.com/a1j9o94/sales-agent-benchmark

I built an open-source benchmark for evaluating LLMs as sales agents. The idea came from noticing that every sales AI tool demos well on clean summaries but falls apart on real deal data — and there was no rigorous way to measure that gap.

How it works

You register an API endpoint. We send your agent deal context (anonymized real B2B deals), it returns structured recommendations (risks, next steps, stakeholder analysis). A multi-judge panel (Claude, GPT, Gemini via OpenRouter) scores against ground truth — what actually happened in the deal.

Two evaluation modes:

Summary Benchmark — Pre-digested checkpoint summaries. Single-turn. 15 deals, 36 checkpoints, 4 scoring dimensions. Models score 68–81%. This is the easy mode.

Artifact-Based Benchmark — Raw call transcripts, email threads, CRM snapshots, Slack messages, documents. Multi-turn (agent can request specific artifacts before answering). 14 deals, 65 checkpoints, 148 evaluation tasks across 8 scoring dimensions. Models score 26–38%.

Every model we tested drops roughly in half when switching from summaries to real artifacts.

The interesting findings

Risk Identification collapses. Best model goes from 8.0/10 on summaries to 2.3/10 on real data. Models confidently identify risks that don't exist in the source material.

Hallucinated stakeholders. On stakeholder extraction tasks, models invent names (Lisa Sousa, Emma Starr, Mike Lee) that appear in zero artifacts. The actual stakeholders are in the transcripts — models just don't extract them.

Structured frameworks survive. MEDDPICC qualification scoring holds up at 7.5/10. Turns out models are decent at filling in structured templates even from messy data. It's the open-ended analysis that falls apart.

Communication quality is fine. Models score 5–8/10 on drafting follow-up emails and call summaries. The writing is good. The reasoning behind it isn't.

Technical details

Stack: Bun, TypeScript, React, Postgres (Neon), deployed on Fly.io

Evaluation: Task-specific judge prompts per artifact type. Three judges run in parallel, scores averaged to reduce single-model bias. Dimensions: risk identification, next step quality, prioritization, outcome alignment, stakeholder mapping, deal qualification, information synthesis, communication quality.

Artifact types: TranscriptArtifact (speaker-labeled turns from Granola AI), EmailArtifact (threaded messages with metadata), CrmSnapshotArtifact (HubSpot deal properties + stage history), DocumentArtifact (proposals, decks), SlackThreadArtifact, CalendarEventArtifact

Multi-turn protocol: Artifact-based requests include turnNumber/maxTurns. Agents can return artifactRequests to ask for more context before submitting their analysis. The benchmark runner handles the conversation loop.

API contract: POST your endpoint, receive { version: 2, artifacts: [...], stakeholders: [...], evaluationTask: {...} }, return structured JSON with risks, next steps, and dimension-specific analysis.

What I'm looking for

Try it. Register an endpoint and benchmark your agent: https://sales-agent-benchmarks.fly.dev/benchmark

Data partners. The dataset is small (29 deals). If you have anonymized deal artifacts — call transcripts, email exports, CRM data with outcomes — I'd love to process them through the pipeline and credit you as a founding contributor.

Feedback on evaluation methodology. The multi-judge approach works but I'm not confident the prompts are optimal. Happy to discuss the judge prompt design in issues.

The gap between summary performance and real-artifact performance seems like a general problem beyond sales. If anyone's seen similar benchmark work in other domains (legal document analysis, medical records, etc.), I'd be interested to compare notes.

Resist.bot – Text your elected officials all at once

https://resist.bot/
1•oldfuture•1m ago•0 comments

Show HN: Nix-sandbox-MCP: Reproducible, isolated code execution for Claude/LLMs

https://github.com/SecBear/nix-sandbox-mcp
1•secbear•1m ago•1 comments

MastarRec

https://mastarrec.com/
1•MarcusMas•2m ago•1 comments

Show HN: Revibe – Turn any codebase into interactive, multi-level documentation

https://revibe.codes/
1•selvaprakash•4m ago•0 comments

Jony Ive killed buttons in cars. Now he's fixing it [video]

https://www.youtube.com/watch?v=6Wv1btxCjVE
2•twalichiewicz•4m ago•0 comments

Running my kernel on real hardware

https://www.kamkow1lair.pl/blog/MOP2/MOP3-real-hardware.html
1•lionkor•4m ago•1 comments

Can creativity survive the loss of time and space?

https://kamilas.substack.com/p/can-creativity-survive-the-loss-of
1•kamselig•5m ago•0 comments

Google official JSON schema package for Go

https://opensource.googleblog.com/2026/01/a-json-schema-package-for-go.html
1•h1fra•6m ago•0 comments

Circumstantial Complexity, LLMs and Large Scale Architecture

https://datagubbe.se/aiarch/
1•rbanffy•7m ago•0 comments

What Is Z-Angle Memory and Why Is Intel Developing It?

https://www.hpcwire.com/2026/02/05/what-is-z-angle-memory-and-why-is-intel-developing-it/
1•rbanffy•7m ago•0 comments

What Is an Async Agent, Really?

https://www.omnara.com/blog/what-is-an-async-agent-really
3•kmansm27•7m ago•0 comments

NkArc: A versatile multi filesystem explorer for Windows based on the GRUB2 code

https://github.com/a1ive/NkArc
1•goodburb•8m ago•0 comments

US is dependent on European tech too, chips bosses warn

https://www.politico.eu/article/us-is-dependent-on-european-tech-too-chips-bosses-warn/
3•giuliomagnifico•9m ago•0 comments

Show HN: A pipeline to render and serve web components dynamically via LLM

https://github.com/pilifs/Terminal-Value
1•plif•10m ago•0 comments

Ace-Step 1.5 prompt tips: how I get more controllable music output

https://github.com/ace-step/ACE-Step-1.5
1•DanielWen•10m ago•1 comments

Show HN: I deleted all my note apps and built an 8MB replacement

https://www.stik.ink
1•massi24•10m ago•0 comments

Show HN: ContinualCode – a coding agent that updates its weights from feedback

https://sdan.github.io/continualcode/
1•sdan•11m ago•0 comments

Odyssey: The Compleat Apventure

https://en.wikipedia.org/wiki/Odyssey:_The_Compleat_Apventure
1•tosh•11m ago•0 comments

What Lego Can Teach Us about Autonomy and Engagement

https://brodzinski.com/2026/01/lego-autonomy-engagement.html
1•flail•11m ago•0 comments

Aleksander Doba

https://en.wikipedia.org/wiki/Aleksander_Doba
2•lifeisstillgood•12m ago•0 comments

Novel Technique to Detect Cloud Threat Actor Operations

https://unit42.paloaltonetworks.com/tracking-threat-groups-through-cloud-logging/
1•yarapavan•13m ago•0 comments

This Week in the "DMCA Eating Copyright Law": Cordova vs. Huneault

https://blog.ericgoldman.org/archives/2026/02/this-week-in-the-dmca-eating-copyright-law-cordova-...
1•hn_acker•14m ago•0 comments

Optimization of energy-efficient residential building design in Japan

https://www.sciencedirect.com/science/article/pii/S277242712500244X
1•PaulHoule•14m ago•0 comments

Introduction of the Atari 400/800 in 1979

https://www.goto10retro.com/p/introduction-of-the-atari-400800
1•rbanffy•14m ago•0 comments

Write-Only Code

https://www.heavybit.com/library/article/write-only-code
3•bryanmikaelian•14m ago•0 comments

[wrong post... please delete if anyone can]

https://www.devseekr.ai/
1•yusufhgmail•16m ago•3 comments

Designing a Cost-Efficient Agentic System

https://p.agnihotry.com/post/designing_a_cost_efficient_agentic_system/
1•pagnihotry•19m ago•0 comments

DSA Interview Questions: What Gets Asked (and How to Prepare Smart)

https://dsa-interview-questions.pages.dev
1•anjandutta•19m ago•1 comments

From watchdogs to mouthpieces: Washington Post and the wreckage of legacy media

https://www.thejournal.ie/readme/bezos-washington-post-trump-6950317-Feb2026/
10•DyslexicAtheist•20m ago•2 comments

Show HN: I built an autonomous agent to play Pax Historia (YC AI Strategy Game)

https://github.com/phillipyan300/Pax-Automata
1•curiouscrow55•22m ago•1 comments