Expensively Quadratic: The LLM Agent Cost Curve

https://blog.exe.dev/expensively-quadratic

19•luu•3d ago

Comments

stuxf•1h ago

> Some coding agents (Shelley included!) refuse to return a large tool output back to the agent after some threshold. This is a mistake: it's going to read the whole file, and it may as well do it in one call rather than five.

disagree with this: IMO the primary reason that these still need to exist is for when the agent messes up (e.g reads a file that is too large like a bundle file), or when you run a grep command in a large codebase and end up hitting way too many files, overloading context.

Otherwise lots of interesting stuff in this article! Having a precise calculator was very useful for the idea of how many things we should be putting into an agent loop to get a cost optimum (and not just a performance optimum) for our tasks, which is something that's been pretty underserved.

Areena_28•1h ago

Classic trap! Reminds me of accidental quadraticism in Python list comprehensions. Have you benchmarked Rust's iterators vs. JS for these cases?

jauntywundrkind•1h ago

Very awesome to see these numbers, to see this explored so. Nice job exe.dev.

TZubiri•39m ago

I'm not sure, but I think that cached read costs are not the most accurately priced, if you consider your costs to be costs when consuming an API endpoint, then the answer will be 50k tokens, sure. But if you consider how much it costs the provider, cached tokens probably have a way higher margin than (the probably negative margin of ) input and output inference tokens.

Most caching is done without hints from the application at this point, but I think some APIs are starting to take hints or explicit controls for keeping state associated with specific input tokens in memory, so these costs will go down, in essence you really don't reprocess the input token at inference, if you own the hardware it's quite trivial to infer one output token at a time, there's no additional cost, if you have 50k input tokens, and you generate 1 output token, it's not like you have to "reinfer" the 50k input tokens before you output the second token.

To put it in simple terms, the time it takes to generate the Millionth output token is the same as the first output token.

This is relevant in an application I'm working on where I check the logprobs and not always choose the most likely token(for example by implementing a custom logit_bias mechanism client-side), so you can infer 1 output token at a time. This is not quite possible with most APIs, but if you control the hardware and use (virtually) 0 cost cached tokens, you can do it.

So bottomline, cached input tokens are almost virtually free naturally (unless you hold them for a loong period of time), the price of cached input APIs is probably due to the lack of API negotiation as to what inputs you want to cache. As APIs and self-hosted solutions evolve, we will likely see the cost of cached inputs masssively drop down to almost 0. With efficient application programming the only accounting should be for output tokens and system prompts. Your output tokens shouldn't be charged again as inputs, at least not more than once.

eshaham78•34m ago

This matches my experience running coding agents at scale. The cached token pricing is indeed somewhat artificial - in practice, for agent workflows with repeated context (like reading the same codebase across multiple tasks), you can achieve near-zero input costs through strategic caching. The real cost optimization isn't just token pricing but minimizing the total tokens flowing through the loop through better tool design.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I’m joining OpenAI

Building SQLite with a small swarm

picol: A Tcl interpreter in 500 lines of code

Magnus Carlsen Wins the Freestyle (Chess960) World Championship

1,300-year-old world chronicle unearthed in Sinai

Expensively Quadratic: The LLM Agent Cost Curve

Arm wants a bigger slice of the chip business

Modern CSS Code Snippets: Stop writing CSS like it's 2015

Lost Soviet Moon Lander May Have Been Found

LT6502: A 6502-based homebrew laptop

Audio is the one area small labs are winning

I gave Claude access to my pen plotter

Show HN: Solving Sudoku reasoning via Energy Geometric models

JavaScript-heavy approaches are not compatible with long-term performance goals

Databases should contain their own Metadata – Use SQL Everywhere

Show HN: Microgpt is a GPT you can visualize in the browser

EU bans the destruction of unsold apparel, clothing, accessories and footwear

Error payloads in Zig

Pocketblue – Fedora Atomic for mobile devices

Real-time PathTracing with global illumination in WebGL

I Love Board Games: A Personal Obsession Explained by Psychology

Gwtar: A static efficient single-file HTML format

How long do job postings stay open?

GNU Pies – Program Invocation and Execution Supervisor

Radio host David Greene says Google's NotebookLM tool stole his voice

Show HN: Knock-Knock.net – Visualizing the bots knocking on my server's door

Amazon's Ring and Google's Nest reveal the severity of U.S. surveillance state

Transforming a Clojure Database into a Library with GraalVM Native Image and FFI

Editor's Note: Retraction of article containing fabricated quotations

Expensively Quadratic: The LLM Agent Cost Curve

Comments

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I’m joining OpenAI

Building SQLite with a small swarm

picol: A Tcl interpreter in 500 lines of code

Magnus Carlsen Wins the Freestyle (Chess960) World Championship

1,300-year-old world chronicle unearthed in Sinai

Expensively Quadratic: The LLM Agent Cost Curve

Arm wants a bigger slice of the chip business

Modern CSS Code Snippets: Stop writing CSS like it's 2015

Lost Soviet Moon Lander May Have Been Found

LT6502: A 6502-based homebrew laptop

Audio is the one area small labs are winning

I gave Claude access to my pen plotter

Show HN: Solving Sudoku reasoning via Energy Geometric models

JavaScript-heavy approaches are not compatible with long-term performance goals

Databases should contain their own Metadata – Use SQL Everywhere

Show HN: Microgpt is a GPT you can visualize in the browser

EU bans the destruction of unsold apparel, clothing, accessories and footwear

Error payloads in Zig

Pocketblue – Fedora Atomic for mobile devices

Real-time PathTracing with global illumination in WebGL

I Love Board Games: A Personal Obsession Explained by Psychology

Gwtar: A static efficient single-file HTML format

How long do job postings stay open?

GNU Pies – Program Invocation and Execution Supervisor

Radio host David Greene says Google's NotebookLM tool stole his voice

Show HN: Knock-Knock.net – Visualizing the bots knocking on my server's door

Amazon's Ring and Google's Nest reveal the severity of U.S. surveillance state

Transforming a Clojure Database into a Library with GraalVM Native Image and FFI

Editor's Note: Retraction of article containing fabricated quotations