frontpage.

Show HN: Sales Agent Benchmark – SWE-Bench for sales AI agents (open source)

https://sales-agent-benchmarks.fly.dev/benchmark

1•a1j9o94•2h ago

Live leaderboard: https://sales-agent-benchmarks.fly.dev/benchmark GitHub: https://github.com/a1j9o94/sales-agent-benchmark

I built an open-source benchmark for evaluating LLMs as sales agents. The idea came from noticing that every sales AI tool demos well on clean summaries but falls apart on real deal data — and there was no rigorous way to measure that gap.

How it works

You register an API endpoint. We send your agent deal context (anonymized real B2B deals), it returns structured recommendations (risks, next steps, stakeholder analysis). A multi-judge panel (Claude, GPT, Gemini via OpenRouter) scores against ground truth — what actually happened in the deal.

Two evaluation modes:

Summary Benchmark — Pre-digested checkpoint summaries. Single-turn. 15 deals, 36 checkpoints, 4 scoring dimensions. Models score 68–81%. This is the easy mode.

Artifact-Based Benchmark — Raw call transcripts, email threads, CRM snapshots, Slack messages, documents. Multi-turn (agent can request specific artifacts before answering). 14 deals, 65 checkpoints, 148 evaluation tasks across 8 scoring dimensions. Models score 26–38%.

Every model we tested drops roughly in half when switching from summaries to real artifacts.

The interesting findings

Risk Identification collapses. Best model goes from 8.0/10 on summaries to 2.3/10 on real data. Models confidently identify risks that don't exist in the source material.

Hallucinated stakeholders. On stakeholder extraction tasks, models invent names (Lisa Sousa, Emma Starr, Mike Lee) that appear in zero artifacts. The actual stakeholders are in the transcripts — models just don't extract them.

Structured frameworks survive. MEDDPICC qualification scoring holds up at 7.5/10. Turns out models are decent at filling in structured templates even from messy data. It's the open-ended analysis that falls apart.

Communication quality is fine. Models score 5–8/10 on drafting follow-up emails and call summaries. The writing is good. The reasoning behind it isn't.

Technical details

Stack: Bun, TypeScript, React, Postgres (Neon), deployed on Fly.io

Evaluation: Task-specific judge prompts per artifact type. Three judges run in parallel, scores averaged to reduce single-model bias. Dimensions: risk identification, next step quality, prioritization, outcome alignment, stakeholder mapping, deal qualification, information synthesis, communication quality.

Artifact types: TranscriptArtifact (speaker-labeled turns from Granola AI), EmailArtifact (threaded messages with metadata), CrmSnapshotArtifact (HubSpot deal properties + stage history), DocumentArtifact (proposals, decks), SlackThreadArtifact, CalendarEventArtifact

Multi-turn protocol: Artifact-based requests include turnNumber/maxTurns. Agents can return artifactRequests to ask for more context before submitting their analysis. The benchmark runner handles the conversation loop.

API contract: POST your endpoint, receive { version: 2, artifacts: [...], stakeholders: [...], evaluationTask: {...} }, return structured JSON with risks, next steps, and dimension-specific analysis.

What I'm looking for

Try it. Register an endpoint and benchmark your agent: https://sales-agent-benchmarks.fly.dev/benchmark

Data partners. The dataset is small (29 deals). If you have anonymized deal artifacts — call transcripts, email exports, CRM data with outcomes — I'd love to process them through the pipeline and credit you as a founding contributor.

Feedback on evaluation methodology. The multi-judge approach works but I'm not confident the prompts are optimal. Happy to discuss the judge prompt design in issues.

The gap between summary performance and real-artifact performance seems like a general problem beyond sales. If anyone's seen similar benchmark work in other domains (legal document analysis, medical records, etc.), I'd be interested to compare notes.

Resist.bot – Text your elected officials all at once

Show HN: Nix-sandbox-MCP: Reproducible, isolated code execution for Claude/LLMs

MastarRec

Show HN: Revibe – Turn any codebase into interactive, multi-level documentation

Jony Ive killed buttons in cars. Now he's fixing it [video]

Running my kernel on real hardware

Can creativity survive the loss of time and space?

Google official JSON schema package for Go

Circumstantial Complexity, LLMs and Large Scale Architecture

What Is Z-Angle Memory and Why Is Intel Developing It?

What Is an Async Agent, Really?

NkArc: A versatile multi filesystem explorer for Windows based on the GRUB2 code

US is dependent on European tech too, chips bosses warn

Show HN: A pipeline to render and serve web components dynamically via LLM

Ace-Step 1.5 prompt tips: how I get more controllable music output

Show HN: I deleted all my note apps and built an 8MB replacement

Show HN: ContinualCode – a coding agent that updates its weights from feedback

Odyssey: The Compleat Apventure

What Lego Can Teach Us about Autonomy and Engagement

Aleksander Doba

Novel Technique to Detect Cloud Threat Actor Operations

This Week in the "DMCA Eating Copyright Law": Cordova vs. Huneault

Optimization of energy-efficient residential building design in Japan

Introduction of the Atari 400/800 in 1979

Write-Only Code

[wrong post... please delete if anyone can]

Designing a Cost-Efficient Agentic System

DSA Interview Questions: What Gets Asked (and How to Prepare Smart)

From watchdogs to mouthpieces: Washington Post and the wreckage of legacy media

Show HN: I built an autonomous agent to play Pax Historia (YC AI Strategy Game)