I built a tool to benchmark my AI agent's API costs

https://local001.com/tokens

5•sampleSal•1h ago

Comments

sampleSal•1h ago

We're building AI agents on OpenClaw and were burning $1,100/week on Anthropic API calls.

No idea if our prompting strategy was inefficient or if everyone was paying this much.

Built a quick benchmarking tool: https://local001.com/tokens

Submit your weekly spend + provider + use case → see your percentile + comparisons.

The dataset is early — it gets more useful the more people submit. But here's why I built this:

We're spending $1,100/week on Anthropic for a mix of coding agents and personal assistant tasks. I have no idea if that's normal or insane. Specifically:

Are we overspending by use case? Our coding agent burns ~$700/week and the assistant tasks burn ~$400. But I don't know what "good" looks like. Is $700/week for an agentic coding workflow competitive? Are teams doing similar work at $200? $2,000? There's zero public data on this.

Are we overspending on Anthropic? We're all-in on Claude right now. For coding tasks, maybe that's the right call. But for assistant/chat workflows — should we be routing half of that to GPT-4o or Gemini and cutting costs 60%? I genuinely don't know, and I haven't seen anyone publish real cost comparisons by task type, not just benchmark scores.

That's what this tool is for. Submit your weekly spend, provider, and use case → see where you land. If 50 teams submit data, we'll finally have a real answer to "is Anthropic worth the premium for X?"

Open questions:

Should we track tokens/$ instead of just $?

Should we separate o1/reasoning models vs base models?

How do you benchmark "efficiency" vs raw spend?

Built with Next.js + Cloudflare Workers + D1. Submissions are anonymous (just hashed IPs).

Long-term goal: use this data to negotiate bulk API rates with Anthropic/OpenAI/Google.

How would you improve this?

https://local001.com/tokens

No Consent Required: A Minimal Data Privacy Policy

Free LLM APIs Compared: Real Limits and Setup for 10 Providers

Chrome Extension Risk Scoring (Partially Open Source, Evidence-Based)

He Studied Cognitive Science at Stanford. Then He Wrote a Startling Play (Cont)

Story of the Fed balance sheet in a single chart

Show HN: Supervisor IDE – Command center for coding agents in complex projects

The Perils of ISBN

Legally ban certain autonomous LLM-based AI agents, or risk societal collapse?

AI adoption hitting Irish graduate jobs, finance department says

Thoughtworks Future of Software Development Retreat

An Open Source Client for World of Warcraft

Show HN: DovahScript – A language for the Thu'um-powered developer

Firetiger: Long Horizon Agents in Production

Tesla announces Powerwall 3P with native three-phase inverter

Microplastic pollution induces algae blooms in experimental ponds

Benchmarking STT for Voice Agents – 10 Services, 1k Samples, Semantic WER

No food, no fuel, no tourists: Under US pressure, life in Cuba grinds to a halt

We built Writtte using vanilla JavaScript (TS), PSQL, and a Go, No frameworks

Practical Guide to Reducing AI Agent Token Costs

Kalshi Dealt Major Setback in Fight to Remain in Nevada

We Built a QA Agent for Our Background Agent

Leaking Secrets from the Claud

Japan Plans $36B in U.S. Investments Under Trump Administration Deal

An Inside Look at Lego's New Tech-Packed Smart Brick

Show HN: Vett – Scan, sign, and verify AI agent skills before installing

Zero-Code Tracing Setup for Claude Agent SDK

I code from bed now – a Telegram bot for Claude Code

How do I embed Polymarket odds on Substack?

Plasma 6.6

A Guide to Which AI to Use in the Agentic Era