I've been working on an open source implementation of Programmatic Tool Calling for Agents, based on cloudflare's codemode & a few anthropic articles, and although i think it can be very powerful in certain usecases, there are some challenges that i would love to have your thoughts on
Instead of traditional agents that burn tens of thousands of tokens loading all tool definitions upfront and compound context with sequential calls, this approach lets agents discover only the tools they need from a file tree of TypeScript SDKs, then write code to one-shot tasks in a single pass.
Although having an agent execute code seems like its ideal as LLMs are great at writing code, there are a few big challenges that i have faced below
The main challenges w/ Programmatic Tool Calling:
- Output Schemas from the Tools
MCP servers or most tool definitions almost never define output schemas, and without knowing what a tool returns, the model hallucinates property names, like think of 'task.title' vs 'task.name' as an example, and the script fails at runtime because it has too guess the shape of the output of a tool. I'm working around this by the classifying tools and by actually calling the tools to infer schemas, but it's really hacky because a single sample misses optional fields, and testing write + destructive tools means creating real or destroying data which is an approach i really dislike and don't think is viable
- Tool Outputs Are Often Plain Strings (returns unstructured data)
Even with perfect schemas and defined shapes, most MCP tools return markdown blobs or plain strings meant for LLM inference. No JSON, no fields to index into and just text. If majority of your tools return in just strings (even when listing data) the main value of codecall is lost because you can't write deterministic code against unstructured data in a string. You're forced back into traditional agent behavior where the LLM interprets text. If you don't control the server or the tool definitions, there's no fix i can really think of.
- Input/Output examples for each Tool (Amplified w/ Programmatic Tool Calling)
The final challenge is that JSON Schema defines structure but not usage patterns. Take that support ticket API example: the schema tells you due_date is a string, but not whether it wants "2024-11-06" or "Nov 6, 2024". It says reporter.id is a string, but is that a UUID or "USR-12345"? When should reporter.contact be populated? How do escalation.level and priority interact? (got this example from an anthropic article covering this)
In traditional tool calling, the model can learn these patterns through trial and error across multiple turns. It tries something, gets an error or unexpected result, and adjusts for the rest But with programmatic tool calling, the model writes a script that might call create_ticket 50 times in a loop for different users. If it misinterprets the date format or ID convention in the first call, all 50 calls fail and so on.
-------------
Although all of these could be fixed by just setting them manually by the user, is there a reliable way we can get the Output Schemas and generate Input/Output examples for each Tool, without actually calling the tool, and without having a user manually input the data?
If anybody is interested, or has any thoughts on Tool Calling for Agents and has any ideas please feel free to share!
zekejohn•8h ago
I've been working on an open source implementation of Programmatic Tool Calling for Agents, based on cloudflare's codemode & a few anthropic articles, and although i think it can be very powerful in certain usecases, there are some challenges that i would love to have your thoughts on
Instead of traditional agents that burn tens of thousands of tokens loading all tool definitions upfront and compound context with sequential calls, this approach lets agents discover only the tools they need from a file tree of TypeScript SDKs, then write code to one-shot tasks in a single pass.
Although having an agent execute code seems like its ideal as LLMs are great at writing code, there are a few big challenges that i have faced below
The main challenges w/ Programmatic Tool Calling:
- Output Schemas from the Tools
MCP servers or most tool definitions almost never define output schemas, and without knowing what a tool returns, the model hallucinates property names, like think of 'task.title' vs 'task.name' as an example, and the script fails at runtime because it has too guess the shape of the output of a tool. I'm working around this by the classifying tools and by actually calling the tools to infer schemas, but it's really hacky because a single sample misses optional fields, and testing write + destructive tools means creating real or destroying data which is an approach i really dislike and don't think is viable
- Tool Outputs Are Often Plain Strings (returns unstructured data)
Even with perfect schemas and defined shapes, most MCP tools return markdown blobs or plain strings meant for LLM inference. No JSON, no fields to index into and just text. If majority of your tools return in just strings (even when listing data) the main value of codecall is lost because you can't write deterministic code against unstructured data in a string. You're forced back into traditional agent behavior where the LLM interprets text. If you don't control the server or the tool definitions, there's no fix i can really think of.
- Input/Output examples for each Tool (Amplified w/ Programmatic Tool Calling)
The final challenge is that JSON Schema defines structure but not usage patterns. Take that support ticket API example: the schema tells you due_date is a string, but not whether it wants "2024-11-06" or "Nov 6, 2024". It says reporter.id is a string, but is that a UUID or "USR-12345"? When should reporter.contact be populated? How do escalation.level and priority interact? (got this example from an anthropic article covering this)
In traditional tool calling, the model can learn these patterns through trial and error across multiple turns. It tries something, gets an error or unexpected result, and adjusts for the rest But with programmatic tool calling, the model writes a script that might call create_ticket 50 times in a loop for different users. If it misinterprets the date format or ID convention in the first call, all 50 calls fail and so on.
-------------
Although all of these could be fixed by just setting them manually by the user, is there a reliable way we can get the Output Schemas and generate Input/Output examples for each Tool, without actually calling the tool, and without having a user manually input the data?
If anybody is interested, or has any thoughts on Tool Calling for Agents and has any ideas please feel free to share!