It's interesting how architectural patterns built at large tech companies (for completely different use-cases than AI) have become so relevant to the AI execution space.
You see a lot of AI startups learning the hard way that value of event sourcing and (eventually) durable execution, but these patterns aren't commonly adopted on Day 1. I blame the AI frameworks.
(disclaimer - currently working on a durable execution platform)
i just broke Claude Code Research Preview, and i've crashed ChatGPT 4.5 Pro Deep Research. and i have the receipts :), so i'm looking for tools that work
A lot of the time the dashboard contents doesn't actually matter anyway, just needs to look pretty...
On a serious note, the systems being built now will eventually be "correct enough most of the time" and that will be good enough (read: cheaper than doing it any other way).
I don’t believe this would work. File a “good enough” tax return one year and enjoy hefty fine 5 years later. Or constantly deal with customers not understanding why one amount is in the dashboard and another is in their warehouse.
Probability of error increase rapidly when you start layer one probabilistic component onto another. Four 99% reliable components sequenced one after another have error rate of 4%.
Of course that still doesn't mean that you should do that. If you want to maximize model's performance, offload as much distracting stuff as possible to the code.
And its not like it changes every day. KPis etc stay the same for months. And then you can easily update it in a hour.
So what exactly does llm solve here?
Is this true for all tool calls? Even if the tool returns little data?
also looking into "fire and forget" tools, to see even if that is possible
Use grep & edit lines. and sequences instead of full files.
This way you can edit files with 50kl loc without issue while Claude will blow out if you ever try to write such file.
Edited in FS. Then next agent/tools directly read.
The issue is the workflow here as you make everything getting thru the model and combining tools you control and tools you don't control ( Claude Artefacts). But default I disable EVERYthing from Claude. And use filesystem. With that I have git diff to check the changes, and can as I said do such granular changes and edits.
As I said the issue is in your workflow.
I have web pages fetching.
But web page fetching bring a lot of JS / noise.
So the fetched page instead use https://jina.ai/reader/. I get Markdown. But is this enough. No, there still a lot of links and stuff, so I again do another pipe and pass to strip a lot like URL's I usually don't need as focus here on content.
Most MCP are replicating API. Returning blobs of data.
1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON. 2. This contain a lot of irrelevant information that you can same on it.
So the issue is the MCP tool. It should instead flaten the data as possible as it's going back again thru JSON Encoding. And if needed remove some fields.
So MCP SAAS here are mainly API gateways.
That brings this noise! And most of ALL they are not optimizing MCP's.
Isn't it a model problem that they don't respect complex json schemas?
In that scenario running code on the data with minimum evaluation of the data (eg. a schema with explanation) is a much better approach and it will scale up to use cases of a certain complexity.
Even this system is not perfect: once your data definition and orchestration grow to big you'll face the same problems.
This should allow you to scale to pretty complex problems though, while the naive approach of just embedding API responses in the chat fails soon (I run into this issue frequently, maintaining a relatively simple systems with a few tool calls).
The only proper solution is reproducing the level of granularity of human decisions in code and call this "decisional system" from an LLM (which would be then reduced to a mere language interface between human language and the internal system). Easier said than done, though.
There's nothing stopping your endpoints from returning data in some other format. LLMs actually seem to excel with XML for instance. But you could just use a template to define some narrative text.
XML seems more text heavy, more tokens. However, maybe more context helps?
But it's also evident for anyone who has used these models. It's also not unique to OpenAI, this bias is prevalent in every model I've ever tested from GPT 3 to the latest offerings from every single frontier model provider.
As to why I would guess it's because XML bakes semantic meaning into the tags it uses so it's easier for the model to understand the structure of the data. <employee>...</employee> is a lot easier to understand than { "employee": { ... }}.
I would guess that the models are largely ignoring the angular brackets and just focusing on the words which have unique tokens and thus are easier to pair up than the curly braces that are the same throughout JSON. Just speculation on my part though.
And this only applies to the input. Earlier models struggled to reliably output JSON so they've been both fine-tuned and wrapped in specific formatters that reliably force clean JSON outputs.
def process(param1, param2):
my_data = mcp_get_data(param1)
sorted_data = mcp_sort(my_data, by=param2)
return sorted_data
No one has found anything revolutionary yet, but there are some useful applications to be sure.
At this point, it _really_ seems like a solution in desperate, near-the-end-of-the-runway, search of a problem.
I really did not even want this, it just happened.
If your tools are calling APIs on-behalf of users, it's better to use OAuth flows to enable users of the app to give explicit consent to the APIs/scopes they want the tools to access. That way, tools use scoped tokens to make calls instead of hard to manage, maintain API keys (or even client credentials).
Slight tangent, but as a long term user of OpenAI models, I was surprised at how well Claude Sonnet 3.7 through the desktop app handles multi-hop problem solving using tools (over MCP). As long as tool descriptions are good, it’s quite capable of chaining and “lateral thinking” without any customisation of the system or user prompts.
For those of you using Sonnet over API: is this behaviour similar there out of the box? If not, does simply pasting the recently exfiltrated[1] “agentic” prompt into the API system prompt get you (most of the way) there?
This would be more reliable than expecting the LLM to generate working code 100% of the time?
Slightly off-topic, on the topic of nesting containers: Have you run gVisor successfully in such a setup? I seem to remember using gVisor to run the child container is not that easy and gVisor might still need some syscalls(?) that the parent container might not allow. I might be misremembering, though.
MCP is literally just a wrapper around an API call, but because it has some LLM buzz sprinkled on top, people expect it to do some magic, when they wouldn't expect the same magic from the underlying API.
Explain how I would do this without an LLM:
https://blog.nawaz.org/posts/2025/May/gemini-figured-out-my-...
Put another way now that I have the MCP in place I no longer need to write any programs to do these tasks (unless you consider my prompt to be the program).
I used it to find the name of a different person's kid. And it used a different set of queries than the one I sent you. How are you going to encode all possibilities merely by using an API?
I use the exact same MCP tool to summarize emails, get me links from emails etc. You want me to write a program for each use case when I can do it all with just one program?
Or, if I gave the LLM a list of my users and asked it to filter based on some criteria, the grammar would change to only output user IDs that existed in my list.
I don't know how useful this would be in practice, but at least it would make it impossible for the LLM to hallucinate for these cases.
The issue right now is that both (1) function calling and (2) codegen just aren't really very good. The hype train far exceeds capabilities. Giving great demos like fetching some Stripe customers, generating an email or getting the weather work flawlessly. But anything more sophisticated goes off the rails very quickly. It's difficult to get models to reliably call functions with the right parameters, to set up multi-step workflows and more.
Add codegen into the mix and it's hairier. You need a deployment and testing apparatus to make sure the code actually works... and then what is it doing? Does it need secret keys to make web requests to other services? Should we rely on functions for those?
The price / performance curve is a consideration, too. Good models are slow and expensive. Which means their utility has to be higher in order to charge a customer to pay for the costs, but they also take a lot longer to respond to requests which reduces perception of value. Codegen is even slower in this case. So there's a lot of alpha in finding the right "mixture of models" that can plan and execute functions quickly and accurately.
For example, OpenAI's GPT-4.1-nano is the fastest function calling model on the market. But it routinely tries to execute the same function twice in parallel. So if you combine it with another fast model, like Gemini Flash, you can reduce error rates - e.g. 4.1-nano does planning, Flash executes. But this is non-obvious to anybody building these systems until they've tried and failed countless times.
I hope to see capabilities improve and costs and latency trend downwards, but what you're suggesting isn't quite feasible yet. That said I (and many others) are interested in making it happen!
Rather than asking an agent to internalize its algorithm, you should teach it an API and then ask it to design an algorithm which you can run that in user space. There are very few situations where I think it makes sense (for cost or accuracy) for an LLM to internalize its algorithm. It's like asking asking an engineer to step through a function in their head instead of just running it.
Also, since you're implicitly questioning OP's claim to have been saying this all along, here's a comment from September 2023 where they first said the same quote and said they'd been building agents for 3 months by that point [1]. That's close enough to 2 years in my book.
[0] https://hn.algolia.com/?dateEnd=1685491200&dateRange=custom&...
So in concrete terms I'm imagining:
1. Create a prompt that gives the complete API specification and some general guidance about what role the agent will have.
2. In that prompt ask it to write a function that can be concisely used by the agent, written to be consumed from the agent and with the agent's perspective. The body of that function translates the agent-oriented function definition to an API call.
3. Now the agent can use these modified versions of the API that expose only what's really important from its perspective.
4. But there's no reason APIs and functions have to map 1:1. You can wrap multiple APIs in one function, or break things up however made most sense.
5. Now the API-consuming agent is just writing library routines for other agents, and creating a custom environment for those agents.
6. This is all really starting to look like a team of programmers building a platform.
7. You could design the whole thing top-down as well, speculating then creating the functions the agents will likely want, and using whatever capabilities you have to implement those functions. The API calls are just an available set of functionality.
And really you could have multiple APIs being used in one function call, and any number of ways to rephrase the raw capabilities as more targeted and specific capabilities.
4. Now the
I agree function composition and structured data are essential for keeping complexity in check. In our experience, well-defined structured outputs are the real scalability lever in tool calling. Typed schemas keep both cognitive load and system complexity manageable. We rely on deterministic behavior wherever possible, and reserve LLM processing for cases where schema-less data or ambiguity is involved. Its a great tool for mapping fuzzy user requests to a more structured deterministic system.
That said, finding the right balance between taking complexity out of high entropy input or introducing complexity through chained tool calling is a tradeoff and balance that needs to be struck carefully. In real-world commerce settings, you rarely get away with just one approach. Structured outputs are great until you hit ambiguous intents—then things get messy and you need fallback strategies.
It’s not well thought out. I’ve been building one with the new auth spec and their official code and tooling is really lacking.
It could have been so much simpler and straight forward by now.
Instead you have 3 different server types and one is deprecated already (SSE) it’s almost funny
Think of extracting parts of an email subject. LLM is great at going through unseen subject lines and telling us what can be extracted. We ask LLM what it found, where. For things like dates, times, city, country etc, we can then deterministically re-run on new strings to extract.
avereveard•1mo ago
I guess one could in principle wrap the entire execution block into a distributed transaction, but llm try to make code that is robust, which works against this pattern as it makes hard to understand failure
jngiam1•1mo ago
For example, when the code execution fails mid-way, we really want the model to be able to pick up from where it failed (with the states of the variables at the time of failure) and be able to continue from there.
We've found that the LLM is able to generate correct code that picks up gracefully. The hard part now is building the runtime that makes that possible; we've something that works pretty well in many cases now in production at Lutra.
hooverd•1mo ago
avereveard•1mo ago
latency tho, would be unbearable for real time.
avereveard•1mo ago
jngiam1•1mo ago