It's interesting how architectural patterns built at large tech companies (for completely different use-cases than AI) have become so relevant to the AI execution space.
You see a lot of AI startups learning the hard way that value of event sourcing and (eventually) durable execution, but these patterns aren't commonly adopted on Day 1. I blame the AI frameworks.
(disclaimer - currently working on a durable execution platform)
i just broke Claude Code Research Preview, and i've crashed ChatGPT 4.5 Pro Deep Research. and i have the receipts :), so i'm looking for tools that work
A lot of the time the dashboard contents doesn't actually matter anyway, just needs to look pretty...
On a serious note, the systems being built now will eventually be "correct enough most of the time" and that will be good enough (read: cheaper than doing it any other way).
Of course that still doesn't mean that you should do that. If you want to maximize model's performance, offload as much distracting stuff as possible to the code.
Is this true for all tool calls? Even if the tool returns little data?
also looking into "fire and forget" tools, to see even if that is possible
Use grep & edit lines. and sequences instead of full files.
This way you can edit files with 50kl loc without issue while Claude will blow out if you ever try to write such file.
Most MCP are replicating API. Returning blobs of data.
1. This is using a lot of input context in formating as JSON and escaping a Json inside already a JSON. 2. This contain a lot of irrelevant information that you can same on it.
So the issue is the MCP tool. It should instead flaten the data as possible as it's going back again thru JSON Encoding. And if needed remove some fields.
So MCP SAAS here are mainly API gateways.
That brings this noise! And most of ALL they are not optimizing MCP's.
Isn't it a model problem that they don't respect complex json schemas?
There's nothing stopping your endpoints from returning data in some other format. LLMs actually seem to excel with XML for instance. But you could just use a template to define some narrative text.
XML seems more text heavy, more tokens. However, maybe more context helps?
But it's also evident for anyone who has used these models. It's also not unique to OpenAI, this bias is prevalent in every model I've ever tested from GPT 3 to the latest offerings from every single frontier model provider.
As to why I would guess it's because XML bakes semantic meaning into the tags it uses so it's easier for the model to understand the structure of the data. <employee>...</employee> is a lot easier to understand than { "employee": { ... }}.
I would guess that the models are largely ignoring the angular brackets and just focusing on the words which have unique tokens and thus are easier to pair up than the curly braces that are the same throughout JSON. Just speculation on my part though.
And this only applies to the input. Earlier models struggled to reliably output JSON so they've been both fine-tuned and wrapped in specific formatters that reliably force clean JSON outputs.
def process(param1, param2):
my_data = mcp_get_data(param1)
sorted_data = mcp_sort(my_data, by=param2)
return sorted_data
No one has found anything revolutionary yet, but there are some useful applications to be sure.
If your tools are calling APIs on-behalf of users, it's better to use OAuth flows to enable users of the app to give explicit consent to the APIs/scopes they want the tools to access. That way, tools use scoped tokens to make calls instead of hard to manage, maintain API keys (or even client credentials).
Slight tangent, but as a long term user of OpenAI models, I was surprised at how well Claude Sonnet 3.7 through the desktop app handles multi-hop problem solving using tools (over MCP). As long as tool descriptions are good, it’s quite capable of chaining and “lateral thinking” without any customisation of the system or user prompts.
For those of you using Sonnet over API: is this behaviour similar there out of the box? If not, does simply pasting the recently exfiltrated[1] “agentic” prompt into the API system prompt get you (most of the way) there?
This would be more reliable than expecting the LLM to generate working code 100% of the time?
MCP is literally just a wrapper around an API call, but because it has some LLM buzz sprinkled on top, people expect it to do some magic, when they wouldn't expect the same magic from the underlying API.
Explain how I would do this without an LLM:
https://blog.nawaz.org/posts/2025/May/gemini-figured-out-my-...
Or, if I gave the LLM a list of my users and asked it to filter based on some criteria, the grammar would change to only output user IDs that existed in my list.
I don't know how useful this would be in practice, but at least it would make it impossible for the LLM to hallucinate for these cases.
The issue right now is that both (1) function calling and (2) codegen just aren't really very good. The hype train far exceeds capabilities. Giving great demos like fetching some Stripe customers, generating an email or getting the weather work flawlessly. But anything more sophisticated goes off the rails very quickly. It's difficult to get models to reliably call functions with the right parameters, to set up multi-step workflows and more.
Add codegen into the mix and it's hairier. You need a deployment and testing apparatus to make sure the code actually works... and then what is it doing? Does it need secret keys to make web requests to other services? Should we rely on functions for those?
The price / performance curve is a consideration, too. Good models are slow and expensive. Which means their utility has to be higher in order to charge a customer to pay for the costs, but they also take a lot longer to respond to requests which reduces perception of value. Codegen is even slower in this case. So there's a lot of alpha in finding the right "mixture of models" that can plan and execute functions quickly and accurately.
For example, OpenAI's GPT-4.1-nano is the fastest function calling model on the market. But it routinely tries to execute the same function twice in parallel. So if you combine it with another fast model, like Gemini Flash, you can reduce error rates - e.g. 4.1-nano does planning, Flash executes. But this is non-obvious to anybody building these systems until they've tried and failed countless times.
I hope to see capabilities improve and costs and latency trend downwards, but what you're suggesting isn't quite feasible yet. That said I (and many others) are interested in making it happen!
Rather than asking an agent to internalize its algorithm, you should teach it an API and then ask it to design an algorithm which you can run that in user space. There are very few situations where I think it makes sense (for cost or accuracy) for an LLM to internalize its algorithm. It's like asking asking an engineer to step through a function in their head instead of just running it.
avereveard•9h ago
I guess one could in principle wrap the entire execution block into a distributed transaction, but llm try to make code that is robust, which works against this pattern as it makes hard to understand failure
jngiam1•9h ago
For example, when the code execution fails mid-way, we really want the model to be able to pick up from where it failed (with the states of the variables at the time of failure) and be able to continue from there.
We've found that the LLM is able to generate correct code that picks up gracefully. The hard part now is building the runtime that makes that possible; we've something that works pretty well in many cases now in production at Lutra.
hooverd•8h ago
avereveard•8h ago
latency tho, would be unbearable for real time.
avereveard•8h ago
jngiam1•8h ago