Tendril – a self-extending agent that builds and registers its own tools

https://github.com/serverless-dna/tendril

33•walmsles•2h ago

Comments

walmsles•2h ago

I built this while working on a coding agent that kept starting cold every session. The deeper problem was that agent frameworks give you what a tool does and how to call it, but no structured answer to when — when should a tool fire autonomously, and when should it stay silent. That judgement is always implicit, scattered across system prompts and tool descriptions.

Tendril is a reference implementation of what I'm calling the Agent Capability pattern. It starts with three bootstrap tools and builds everything else itself. The key constraint: there's no direct code execution. The agent can only run registered capabilities, so every task forces it to write a tool, define its invocation conditions, and register it for future sessions. The registry accumulates across sessions.

I also ran the self-extending loop against five local models — Qwen3-8B, Gemma 4, Mistral Small 3.1, Devstral Small 2, Salesforce xLAM-2. None passed.

The failure modes were distinct enough to be worth writing up separately: https://serverlessdna.com/strands/ai-agents/agents-know-what...

Stack: AWS Strands TypeScript SDK, Bedrock (Claude Sonnet), Deno sandbox, Tauri + React desktop shell.

esafak•1h ago

You can list the uses of the available tools in the AGENTS. I keep my agents on a tight leash, and self-extension runs counter to this. I would not my agent to spontaneously develop the ability to tap my bank account, for example.

dd8601fn•38m ago

I did something that sounds similar for my home assistant.

The agent never executes anything. It has like four tools… search, request execute, request build, request update.

The tool service runs vector search against the tools catalog.

The build generalizes the requested function and runs authoring with review steps, declaring needed credentials and network access.

The adversarial reviewer can reject back to the authoring three times.

After passing, the tool is registered and embeddings are done for search. It’s live for future use.

Credentials are stored encrypted, and only get injected by the tools catalog service during tool execution. The network resources are declared so tool function execution can be better sandboxed (it’s not, yet).

The agent never has access to credentials and cannot do anything without going through vetted functions in the tool service.

Agent, author process, reviewer, embedding… all can be different models running local or remote.

Event bus, agent, tool service… all separate containers.

I have an url if you want to read a bit about what I did: https://dcd.fyi/agent

It’s really just meant for me, but if you’re interested in more details on anything let me know. There’s nothing super special in it.

sockaddr•1h ago

So basically you've built a mechanism for a model to de-compress itself.

gavinray•1h ago

It's really cool to see that other people run into the same issues and arrive at the same conclusions/solution.

At $DAYJOB, we have an LLM-based tool and this issue of "how do we avoid burning tokens solving the same problems over again" was an early obstacle

We wound up building a very similar thing to what you call "tools" (we named them "Saved Programs").

There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.

If it finds one, it'll re-use it, otherwise it'll generate a program and offer to save it, if you think it'll be common enough.

afshinmeh•55m ago

> how do we avoid burning tokens solving the same problems over again

Letting the LLM write half baked tools is the recipe for burning more tokens.

> There's a wiki the LLM searches before solving a problem, that links saved programs for past actions to their content entry.

What's the criteria for marking an LLM written tool as useful/correct before publishing it?

gavinray•36m ago

  > Letting the LLM write half baked tools is the recipe for burning more tokens.

It sure is, if the tools are half-baked and your user scale is N=1 rather than N=100 or N=1,000

  > What's the criteria for marking an LLM written tool as useful/correct before publishing it?

It solves the problem the originating user asked it to

afshinmeh•29m ago

> It solves the problem the originating user asked it to

Interesting. And is there a mechanism to go back and "fix" the tools after they are published? What happens if the tool decided to use the "id" attribute to click on buttons and now you have a new website that follows a different pattern to find the right target?

I agree that "correctness" of a tool could have different meaning depending on the context of the problem though (e.g. would you consider OOM a correctness bug even if it addresses the user's ask?)

Edmond•27m ago

it's called workflow automation: https://blog.codesolvent.com/2025/12/workflow-automation-let...

Everyone is just taking a round about way to get there. The workflow/program as "tools" approach is the right one. Agents skills are more or less in that same direction.

nickstinemates•30m ago

We built something similar[1], including integrated memory for debugging. It is very useful to have repeatable artifacts left behind every time you use an agent to accomplish a task.

The main design decision we took was to integrate with your existing agent instead of building a new one. Your harness, swamp, and you're off.

As an aside, building software for agents is incredibly fun.

1: https://swamp.club

weitendorf•28m ago

Get outta my swamp! Just kidding, it’s cool to see other people working on this stuff.

I think right now this is still a bit too fresh out of Claude Code to be usable by anybody but the people developing it. I got to around the same point with my first tempt at building a tool registry (https://github.com/accretional/collector) and then realized I basically needed to start over with much more investment in supporting infrastructure to build the thing I really wanted.

I can go as far into the weeds as anybody would ever care to hear about this, but for the sake of brevity I’ll just say this: reflection and type systems over the network are pretty much the only way to get this stuff to work properly (I mean you could just go full MCP/Skills but then all you really have are giant blobs of markdown and unconstrained json that make integration/discovery/usability a nightmare, and require an agent in the loop to drive/integrate the tools when you really just need to give them the actual APIs and documentation). That ends up getting rather hairy, we ended up actually building a declarative meta-lexer/parser/transpiler (meta basically just meaning it’s generalized across languages and self-hosting/bootstrapped) recently (https://github.com/accretional/gluon) because it turns out building a cross-language distributed type system is rather difficult. But reflection alone gets you halfway there as far as benefits.

mtrifonov•23m ago

I like that you approach the question of "when" in regards to tool calls. I've become frustrated that most agent frameworks don't acknowledge it in their design philosophy.

WHEN is upstream of WHAT and HOW. You can have perfect tool descriptions and perfect call signatures, but if the model can't read the situation to know whether the moment calls for any tool at all, you get either over-firing (agent burns tokens trying to "help") or under-firing (agent waits to be addressed and acts like a chatbot, not an autonomous participant).

I have had a lot of success when I refrain from codifying WHEN as rules. "If X then fire tool Y" is a dumb heuristic with extra steps. Describe the conditions of the moment. What's been tried, what's converged, what state the work is in. Then let the model decide whether to act and which tool fits.

Rules get stale. Situation-reads generalize.

Reading the Tendril README, looks like the registration mechanic is solving a slightly different problem (the "too many tools" / context-bloat problem) by giving the agent three bootstrap tools and a growing registry. The WHEN itself still seems to be codified as rules in the system prompt ("BEFORE acting, call searchCapabilities; IF found, load and execute; IF NOT found, build yourself"). That's exactly the IF-X-THEN-Y pattern your framing seems to want to move past.

Curious whether you see the registry itself as the structured WHEN, or whether the rule-based system prompt is a starting point you intend to evolve toward something more situational.

Microsoft and OpenAI end their exclusive and revenue-sharing deal

"Why not just use Lean?"

The Woes of Sanitizing SVGs

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Pgbackrest is no longer being maintained

Apple is dropping AFP/TimeCapsule support in macOS 27

4TB of voice samples just stolen from 40k AI contractors at Mercor

Men who stare at walls

Fully Featured Audio DSP Firmware for the Raspberry Pi Pico

Tendril – a self-extending agent that builds and registers its own tools

FDA Approves First-Ever Gene Therapy for Treatment of Genetic Hearing Loss

Den stora Älgvandringen – The great moose migration (live)

US Supreme Court Reviews Police Use of Cell Location Data to Find Criminals

Flipdiscs

Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop

I bought Friendster for $30k – Here's what I'm doing with it

GitHub Copilot is moving to usage-based billing

Managing the Unmanaged Switch

Running local LLMs offline on a ten-hour flight

Dutch central bank ditches AWS and chooses Lidl for European Cloud

Quarkdown – Markdown with Superpowers

AI should elevate your thinking, not replace it

Understanding the short circuit in solid-state batteries

Show HN: A terminal spreadsheet editor with Vim keybindings

Getting my daily news from a dot matrix printer 2024

TurboQuant: A first-principles walkthrough

Self-updating screenshots

The Prompt API

Supreme Court to Hear Arguments in Landmark Roundup Weedkiller Case

Canva apologizes after its AI tool replaces 'Palestine' in designs

Tendril – a self-extending agent that builds and registers its own tools

Comments

Microsoft and OpenAI end their exclusive and revenue-sharing deal

"Why not just use Lean?"

The Woes of Sanitizing SVGs

Show HN: OSS Agent I built topped the TerminalBench on Gemini-3-flash-preview

Pgbackrest is no longer being maintained

Apple is dropping AFP/TimeCapsule support in macOS 27

4TB of voice samples just stolen from 40k AI contractors at Mercor

Men who stare at walls

Fully Featured Audio DSP Firmware for the Raspberry Pi Pico

Tendril – a self-extending agent that builds and registers its own tools

FDA Approves First-Ever Gene Therapy for Treatment of Genetic Hearing Loss

Den stora Älgvandringen – The great moose migration (live)

US Supreme Court Reviews Police Use of Cell Location Data to Find Criminals

Flipdiscs

Show HN: Utilyze – an open source GPU monitoring tool more accurate than nvtop

I bought Friendster for $30k – Here's what I'm doing with it

GitHub Copilot is moving to usage-based billing

Managing the Unmanaged Switch

Running local LLMs offline on a ten-hour flight

Dutch central bank ditches AWS and chooses Lidl for European Cloud

Quarkdown – Markdown with Superpowers

AI should elevate your thinking, not replace it

Understanding the short circuit in solid-state batteries

Show HN: A terminal spreadsheet editor with Vim keybindings

Getting my daily news from a dot matrix printer 2024

TurboQuant: A first-principles walkthrough

Self-updating screenshots

The Prompt API

Supreme Court to Hear Arguments in Landmark Roundup Weedkiller Case

Canva apologizes after its AI tool replaces 'Palestine' in designs