frontpage.

Tested OpenAI's prompt caching across models. Found undocumented behavior

5•harsharanga•2mo ago

Been building an AI agent from scratch to understand token economics. Spent a week on prompt caching. Found something interesting that isn't in OpenAI's docs. Setup: Network device monitoring chatbot, 10 tools, ~1,400 token prefix. Tested gpt-4o-mini, gpt-5-mini, gpt-5. Logged cached_tokens from every response.

Finding 1: Caching works as documented Once prefix exceeds 1024 tokens, OpenAI caches it automatically. I saw 80-90% cache hit rates after the first call. Cost reduction of 47-49% on input tokens. Cache discount is 50% for 4o-mini, 90% for gpt-5 family.

Finding 2: Tool schema tokenization is heavily compressed Added 4 tools to my existing 6. Expected +400-500 tokens based on JSON size. Actual increase: 56 tokens. OpenAI is clearly doing aggressive compression on function schemas.

Finding 3: Cache is shared across model generations (undocumented) This is the interesting part. Test: Call gpt-4o-mini first (cold start). Wait 5 seconds. Call gpt-5-mini with identical prefix. Result: gpt-5-mini got a cache hit on its first call. Tested all permutations. Every time, model 2 and 3 hit cache from model 1's warmup. The prefix-processing cache is shared across 4o-mini, 5-mini, and 5. I couldn't find this documented anywhere.

Why it matters: If you have many cold starts (separate user sessions, different contexts), you can warm cache with the cheapest model. Example - 1,000 cold starts/day, 10K token prefix, primary model gpt-5: Without cross-model warming: Each session pays 10K tokens at $1.25/1M = $0.0125 Daily: $12.50, Annual: $4,562 With nano warming first: 10K tokens at $0.05/1M = $0.0005 per warmup Daily: $0.50, Annual: $182 Savings: $4,380/year At gpt-5-pro pricing ($15/1M), difference is $54K+/year on warmup costs alone.

Technical note: This is prefix-processing cache sharing, not KV-cache sharing. Models share tokenization and prefix hashing, not attention states. But billing-wise, cached tokens are cached tokens.

Reproduction: Create 1024+ token prefix. Call model A, log cached_tokens. Call model B with same prefix. Check if B's first call shows cached tokens. Field is in response.usage.prompt_tokens_details.cached_tokens. Happy to share test scripts.

Show HN: Seedance 2.0 AI video generator for creators and ecommerce

Wally: A fun, reliable voice assistant in the shape of a penguin

Rewriting Pycparser with the Help of an LLM

Lobsters Vibecoding Challenge

E-Commerce vs. Social Commerce

Avoiding Modern C++ – Anton Mikhailov [video]

Show HN: AegisMind–AI system with 12 brain regions modeled on human neuroscience

Zig – Package Management Workflow Enhancements

AI-powered text correction for macOS

AppSecMaster – Learn Application Security with hands on challenges

Fibonacci Number Certificates

AI Overviews are killing the web search, and there's nothing we can do about it

City skylines need an upgrade in the face of climate stress

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux