Ask HN: How are you scaling AI agents reliably in production?

7•nivedit-jain•5mo ago

I’m looking to learn from people running agents beyond demos. If you have a production setup, would you share what works and what broke?

What I’m most curious about:

- Orchestrator choice and why: LangGraph, Temporal, Airflow, Prefect, custom queues.

- State and checkpointing: where do you persist steps, how do you replay, how do you handle schema changes.

- Concurrency control: parallel tool calls, backpressure, timeouts, idempotency for retries.

- Autoscaling and cost: policies that kept latency and spend sane, spot vs on-demand, GPU sharing.

- Memory and retrieval: vector DB vs KV store, eviction policies, preventing stale context.

- Observability: tracing, metrics, evals that actually predicted incidents.

- Safety and isolation: sandboxing tools, rate limits, abuse filters, PII handling.

- A war story: the incident that taught you a lesson and the fix.

Context (so it’s not a drive-by): small team, Python, k8s, MongoDB for state, Redis for queues, everything custom, experimenting with LangGraph and Temporal. Happy to share configs and trade notes in the comments.

Answer any subset. Even a quick sketch of your stack and one gotcha would help others reading this. Thanks!

Comments

prohobo•5mo ago

I'm using LangGraph for my app which is an AI ecommerce analyst with multiple modes (report builder, and chatbot). It consumes API data and visitor sessions to build a giant report then compress it back down to actionable insights for online store owners. The report runs for each customer once a day, queued up with BullMQ.

It's not super complex, in fact that seems to be the only way to get a more or less reliable agent right now. Keep the graph small, the prompts concise, the nodes and tools atomic in function, etc.

* Orchestrator choice and why: LangGraph because it seems the most robust and well established from my research at the time (about 6 months ago). It has decent documentation, and includes community-built graphs and nodes. People complain a lot about LangChain, but the general vibe around LangGraph is that it's a maturely designed framework.

* State and checkpointing: I'm using a memory checkpointer after every state change. Why? Reports can just re-run at negligible cost. For chats, my users' requirements just don't need persistent thread storage. Persistence is better managed through RAG entries.

* Concurrency control: I don't use parallel tool calling for most of my agents because it adds too much instability to graph execution. This is actually fine for chatbots and my app's reporting system (which doesn't need many tools), but I can see this being an issue for more complex agents.

* Autoscaling and cost: Well I use foundation models, not local ones. I swap out models for various tasks and customer subscription levels (e.g., gpt-5-nano with low reasoning effort for trial users, and gpt-5-mini for paying customers).

* Memory and retrieval: Vector DB for RAG tooling, normal DB for everything else. Sometimes I use the same Postgres database for both vector and normal data, to simplify architecture. I load raw contextual data into prompts (JSON dump). In my app's case, I use a 30-day rolling window of store data so I never keep anything longer than 30 days. I instead keep distilled information as permanent context, which I let the AI control the lifecycle of (create, update, delete).

* Observability: The only thing I would use evals for are prompts, but haven't found a good tool for that yet. I use sentiment analysis for chats the AI deems "interesting" just to see if people are complaining about something.

* Safety and isolation: For reports, I filter out PII before giving data to the AI. For chats, memory checkpointing makes threads ephemeral anyway - and I just add a rate limit + message length limit. The sentiment analysis doesn't include their original messages, only a thematic summary by the AI.

* A war story: I spent weeks trying to fine-tune a prompt for the reporting agent, in which one node was tasked with A) analyzing multiple 30-day ecommerce reports, B) generating findings, C) comparing the findings to existing insights and mutating them, and finally: D) creating short and punchy copy for new insights (title, description). I re-wrote it like 100 times, and every time I ran it it would screw up in a new way or a way that occurred 5 revisions ago. Sometimes it would work perfectly, then the next time it ran it would screw up again, with the same data and temperature set to 0.

This, honestly, is the main problem with modern AI. My solution was to decompose the node into 4 separate ones that each handle a single task - and they still manage to screw it up quite often. It's much better, but not 100% reliable.

nivedit-jain•5mo ago

Thanks for sharing this, truly inspiring. A few questions: (1) What do you like the most about langgraph, have you tried platforms like autogen? (2) Why using BullMQ with node, why not a solution like Temporal? (3) I didn't got you usecase regarding memory check pointer? if things can re-run at negligible cost why do we need it? (4) For sentimental analysis for chats are you using batch inferencing? Probably a loop keeping ready "interesting" chats for review (5) this 30 days analysis is it happening parallelly or is it a sequential loop? why not using something like Airflow for this?

We interfaced single-threaded C++ with multi-threaded Rust

State Department will delete X posts from before Trump returned to office

AI Skills Marketplace

Show HN: A fast TUI for managing Azure Key Vault secrets written in Rust

eInk UI Components in CSS

Discuss – Do AI agents deserve all the hype they are getting?

ChatGPT is changing how we ask stupid questions

Zig Package Manager Enhancements

Neutron Scans Reveal Hidden Water in Martian Meteorite

Deepfaking Orson Welles's Mangled Masterpiece

France's homegrown open source online office suite

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change