I built an open-source benchmark for evaluating LLMs as sales agents. The idea came from noticing that every sales AI tool demos well on clean summaries but falls apart on real deal data — and there was no rigorous way to measure that gap.
How it works
You register an API endpoint. We send your agent deal context (anonymized real B2B deals), it returns structured recommendations (risks, next steps, stakeholder analysis). A multi-judge panel (Claude, GPT, Gemini via OpenRouter) scores against ground truth — what actually happened in the deal.
Two evaluation modes:
Summary Benchmark — Pre-digested checkpoint summaries. Single-turn. 15 deals, 36 checkpoints, 4 scoring dimensions. Models score 68–81%. This is the easy mode.
Artifact-Based Benchmark — Raw call transcripts, email threads, CRM snapshots, Slack messages, documents. Multi-turn (agent can request specific artifacts before answering). 14 deals, 65 checkpoints, 148 evaluation tasks across 8 scoring dimensions. Models score 26–38%.
Every model we tested drops roughly in half when switching from summaries to real artifacts.
The interesting findings
Risk Identification collapses. Best model goes from 8.0/10 on summaries to 2.3/10 on real data. Models confidently identify risks that don't exist in the source material.
Hallucinated stakeholders. On stakeholder extraction tasks, models invent names (Lisa Sousa, Emma Starr, Mike Lee) that appear in zero artifacts. The actual stakeholders are in the transcripts — models just don't extract them.
Structured frameworks survive. MEDDPICC qualification scoring holds up at 7.5/10. Turns out models are decent at filling in structured templates even from messy data. It's the open-ended analysis that falls apart.
Communication quality is fine. Models score 5–8/10 on drafting follow-up emails and call summaries. The writing is good. The reasoning behind it isn't.
Technical details
Stack: Bun, TypeScript, React, Postgres (Neon), deployed on Fly.io
Evaluation: Task-specific judge prompts per artifact type. Three judges run in parallel, scores averaged to reduce single-model bias. Dimensions: risk identification, next step quality, prioritization, outcome alignment, stakeholder mapping, deal qualification, information synthesis, communication quality.
Artifact types: TranscriptArtifact (speaker-labeled turns from Granola AI), EmailArtifact (threaded messages with metadata), CrmSnapshotArtifact (HubSpot deal properties + stage history), DocumentArtifact (proposals, decks), SlackThreadArtifact, CalendarEventArtifact
Multi-turn protocol: Artifact-based requests include turnNumber/maxTurns. Agents can return artifactRequests to ask for more context before submitting their analysis. The benchmark runner handles the conversation loop.
API contract: POST your endpoint, receive { version: 2, artifacts: [...], stakeholders: [...], evaluationTask: {...} }, return structured JSON with risks, next steps, and dimension-specific analysis.
What I'm looking for
Try it. Register an endpoint and benchmark your agent: https://sales-agent-benchmarks.fly.dev/benchmark
Data partners. The dataset is small (29 deals). If you have anonymized deal artifacts — call transcripts, email exports, CRM data with outcomes — I'd love to process them through the pipeline and credit you as a founding contributor.
Feedback on evaluation methodology. The multi-judge approach works but I'm not confident the prompts are optimal. Happy to discuss the judge prompt design in issues.
The gap between summary performance and real-artifact performance seems like a general problem beyond sales. If anyone's seen similar benchmark work in other domains (legal document analysis, medical records, etc.), I'd be interested to compare notes.