frontpage.

Built an AI Agent from Scratch to Measure Token Costs. Here's What I Found

1•harsharanga•1h ago

I’ve been measuring token costs in multi-tool AI agents. To understand where tokens actually go, I built an agent framework from scratch with no libraries or abstractions. Frameworks hide cost mechanics; I needed bare-metal visibility.

The goal was simple: measure how token usage grows as you introduce more tools and more conversation turns.

THE SETUP 6 tools (metrics, alerts, topology, neighbors, etc.) gpt-4o-mini Token instrumentation across four phases No caching, no prompt tricks, no compression

THE FOUR PHASES Phase 1: Single tool. One LLM call, one tool schema. Baseline. Phase 2: Six tools. Same query, but the agent exposes six tools. Token growth comes entirely from additional tool definitions. Phase 3: Chained calls. Three sequential tool calls, each feeding into the next. No conversation history yet. Phase 4: Multi-turn conversation. Three turns with full replay of every prior message, tool request, and tool response.

RESULTS Phase 1: 590 tokens Phase 2: 1,250 tokens (2.1x increase) Phase 3: 4,500 tokens (7.6x increase) Phase 4: 7,166 tokens (12.1x increase)

Two non-obvious findings stood out. First, adding 5 more tools roughly doubled token cost. Second, adding two more conversation turns tripled it. Conversation depth drove more token growth than tool count.

WHY THIS HAPPENS LLMs are stateless. Every call must replay full context: tool definitions, conversation history, and previous tool outputs. Adding tools increases context size linearly. Adding conversation turns increases it multiplicatively because each turn resends everything that came before it.

IMPLICATIONS Real systems often have dozens of tools across domains, multi-turn conversations during incidents, and power users issuing many queries per day. Token costs don’t scale linearly. They compound. This isn’t a prompt-engineering issue. It’s an architectural issue. If you get the architecture wrong, you pay for it on every query.

NEXT STEPS I’m measuring the effects of parallel tool execution, conversation history truncation, semantic routing, structured output constraints, and OpenAI’s new prompt caching (which claims large cost reductions on cache hits). Each of these targets a different part of the token-growth pattern.

Happy to share those results as I gather them. Curious how others are managing token expansion in multi-turn, multi-tool agents.

TSMC in a tight spot: demand for high-end chips exceeds capacity by factor of 3

Show HN: I made an AI SEO tool for people who hate writing content

Big attack on NPM – Shai-Hulud 2.0

Cryptology firm cancels elections after losing encryption key

Show HN: A terminal based voice over IP service

Show HN: Open-Source Email Verifier

My Experience Using Tinker

Show HN: A browser tool that tracks your hands in real-time

Idempotency Keys

OpenTransit – A MassTransit Fork

A Software Language That Vibe Coding Kids Deserve

Show HN: I built a "Hot or Not" for startups to get the feedback YC doesn't give

A Power Grid-Aware Website

A We-Free December

Show HN: Product Loop – Automated AI customer interviews

Show HN: Tree Dangler

Show HN: Smart Bill Splitter: Split bills in browser without login, ads, cookies

Getting Started with Claude Code

Browserbench.ai is launched to evaluate browser runtimes for AI Agents

Ruthless prioritization while the dog pees on the floor

Alphabet (Googl) Gains on Report Meta to Use Its AI Chips

Ageing Populations Will Lead to Lower Living Standards, Warns Study

Show HN: A seeded, deterministic chaos simulation runtime for async Rust

The State of AI: don't share your secrets with a chatbot

Questioning an Interface: From Parquet to Vortex

Show HN: Fractalbits – S3 compatibe store，1M iops p99～5ms，using Rust and Zig

Ask HN: Is There a Market for a "Phone-Only" VA Proxy?

Literary Philosophy and Philosophical Literature

What Happens to Kids' Brains After Hours Staring at Screens?

Optique 0.7.0: Smarter error messages and validation library integrations