This post has an unusually large number of code blocks without syntax highlighting since they're copy-pasted outputs from the debug tool which isn't in any formal syntax.
Since you released version 0.26 alpha, I’ve been trying to create a plugin to interact with a some MCP server, but it’s a bit too challenging for me. So far, I’ve managed to connect and dynamically retrieve and use tools, but I’m not yet able to pass parameters.
I'm a heavy user of the llm tool, so as soon as I saw your post, I started tinkering with MCP.
I’ve just published an alpha version that works with stdio-based MCP servers (tested with @modelcontextprotocol/server-filesystem) - https://github.com/Virtuslab/llm-tools-mcp. Very early stage, so please make sure to use with --ta option (Manually approve every tool execution).
The code is still messy and there are a couple of TODOs in the README.md, but I plan to work on it full-time until the end of the week.
Some questions:
Where do you think mcp.json should be stored? Also, it might be a bit inconvenient to specify tools one by one with -T. Do you think adding a --all-tools flag or supporting glob patterns like -T name-prefix* in llm would be a good idea?
You're using function-based tools at the moment, hence why you have to register each one individually.
The alternative to doing that is to use what I call a "toolbox", described here: https://llm.datasette.io/en/stable/python-api.html#python-ap...
Those get you two things you need:
1. A single class can have multiple tool methods in it, you just have to specify it once 2. Toolboxes can take configuration
With a Toolbox, your plugin could work like this:
llm -T 'MCP("path/to/mcp.json")' ...
You might even be able to design it such that you don't need a mcp.json at all, and everything gets passed to that constructor.There's one catch: currently you would have to dynamically create the class with methods for each tool, which is possible in Python but a bit messy. I have an open issue to make that better here: https://github.com/simonw/llm/issues/1111
Ah, I saw "llm.Toolbox" but I thought it's just for plugin developer convenience.
I'll take a look at the issue you posted (#1111). Maybe I can contribute somehow :).
I have an idea to fix that by writing a 'plugins.txt' file somewhere with all of your installed plugins and then re-installing any that go missing - issue for that is here: https://github.com/simonw/llm/issues/575
uv tool install llm --upgrade --upgrade --with llm-openrouter --with llm-cmd ...
llm install -U llm
instead of
uv tool upgrade llm
(the latter of which is recommended by simonw in the original post)
I'll switch to o4-mini when I'm writing code, but otherwise 4.1-mini usually does a great job.
Fun example from earlier today:
llm -f https://raw.githubusercontent.com/BenjaminAster/CSS-Minecraft/refs/heads/main/main.css \
-s 'explain all the tricks used by this CSS'
That's piping the CSS from that incredible CSS Minecraft demo - https://news.ycombinator.com/item?id=44100148 - into GPT-4.1 mini and asking it for an explanation.The code is clearly written but entirely uncommented: https://github.com/BenjaminAster/CSS-Minecraft/blob/main/mai...
GPT-4.1 mini's explanation is genuinely excellent: https://gist.github.com/simonw/cafd612b3982e3ad463788dd50287... - it correctly identifies "This CSS uses modern CSS features at an expert level to create a 3D interactive voxel-style UI while minimizing or eliminating JavaScript" and explains a bunch of tricks I hadn't figured out.
And it used 3,813 input tokens and 1,291 output tokens - https://www.llm-prices.com/#it=3813&ot=1291&ic=0.4&oc=1.6 - that's 0.3591 cents (around a third of a cent).
How come it doesn't know for sure?
Though it's worth noting that CSS Minecraft was first released three years ago, so there's a chance it has hints about it in the training data already. This is not a meticulous experiment.
(I've had a search around though and the most detailed explanation I could find of how that code works is the one I posted on my blog yesterday - my hunch is that it figured it out from the code alone.)
Are you aware of any user interfaces that expose some limited ChatGPT functionality using a UI, that internally uses llm. This is for my non-techie wife.
I've been meaning to put together a web UI for ages, I think that's the next big project now that tools is out.
It's not using LLM, but right now one of the best UI options out there is https://openwebui.com/ - it works really well with Ollama (and any other OpenAI-compatible endpoint).
The doc [1] warns about prompt injection, but I think a more likely scenario is self-inflicted harm. For instance, you give a tool access to your brokerage account to automate trading. Even without prompt injection, there's nothing preventing the bot from making stupid trades.
https://news.ycombinator.com/item?id=44073456
https://news.ycombinator.com/item?id=44073413
https://news.ycombinator.com/item?id=44070923
https://news.ycombinator.com/item?id=44070514
https://news.ycombinator.com/item?id=44010921
https://news.ycombinator.com/item?id=43970274
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
Yeah, it really does.
There are so many ways things can go wrong once you start plugging tools into an LLM, especially if those tool calls are authenticated and can take actions on your behalf.
The MCP world is speed-running this right now, see the GitHub MCP story from yesterday: https://news.ycombinator.com/item?id=44097390
I stuck a big warning in the documentation and I've been careful not to release any initial tool plugins that can cause any damage - hence my QuickJS sandbox one and SQLite plugin being read-only - but it's a dangerous space to be exploring.
(Super fun and fascinating though.)
This is absolutely going to happen at a large scale and then we'll have "cautionary tales" and a lot of "compliance" rules.
Letting the LLM run the tool unsupervised is another thing entirely. We do not understand the choices the machines are making. They are unpredictable and you can't root-cause their decisions.
LLM tool use is a new thing we haven't had before, which means tool misuse is a whole new class of FUBAR waiting to happen.
Let's say you are making an AI-controlled radiation therapy machine. You prompt and train and eval the system very carefully, and you are quite sure it won't overdose any patients. Well, that's not really good enough, it can still screw up. But did you do anything wrong? Not really, you followed best practices and didn't make any mistakes. The LLM just sometimes kills people. You didn't intend that at all.
I make this point because this is already how these systems work today. But instead of giving you a lethal dose of radiation, it uses slurs or promotes genocide or something else. The builders of those bots didn't intend that, and in all likelihood tried very hard to prevent it. It's not very fair to blame them.
Even a year ago I let LLMs execute local commands on my laptop. I think it is somewhat risky, but nothing harmful happened. You also have to consider what you are prompting. So when I prompt 'find out where I am and what weather it is going to be', it is possible that it will execute rm -rf / but very unlikely.
However, speaking of letting an LLMs trade stocks without understanding how the LLM will come to a decision... too risky for my taste ;-)
Overall, I found tool use extremely hit-and-miss, to the point where I'm sure I'm doing something wrong (I'm using the OpenAI Agents SDK, FWIW).
Anthropic's system prompt just for their "web_search" tool is over 6,000 tokens long! https://simonwillison.net/2025/May/25/claude-4-system-prompt...
And, this is why I'm very excited about this addition to the llm tool, because it feels like it moves the tool closer to the user and reduces the likelihood of the problem I'm describing.
See also my multi-year obsession with prompt injection and LLM security, which still isn't close to being a solved problem: https://simonwillison.net/tags/prompt-injection/
Yet somehow I can't tear myself away from them. The fact that we can use computers to mostly understand human language (and vision problems as well) is irresistible to me.
I agree that'd be amazing if they do that, but they most certainly do not. I think this is the core my disagreement here that you believe this and let this guide you. They don't understand anything and are matching and synthesizing patterns. I can see how that's enthralling like watching a rube goldberg machine go through its paces, but there is no there there. The idea that there is an emergent something there is at best an unproven theory, is documented as being an illusion, and at worst has become an unfounded messianic belief.
I know they're just statistical models, and that having conversations with them is like having a conversation with a stack of dice.
But if the simulation is good enough to be useful, the fact that they don't genuinely "understand" doesn't really matter to me.
I've had tens of thousands of "conversations" with these things now (I know because I log them all). Whether or not they understand anything they're still providing a ton of value back to me.
More background: https://github.com/simonw/llm/issues/12
(Also check out https://github.com/day50-dev/llmehelp which features a tmux tool I built on top of Simon's llm. I use it every day. Really. It's become indispensable)
I think I want a plugin hook that lets plugins take over the display of content by the tool.
Just filed an issue: https://github.com/simonw/llm/issues/1112
Would love to get your feedback on it, I included a few design options but none of them feel 100% right to me yet.
We have cost, latency, context window and model routing but I haven't seen anything semantic yet. Someone's going to do it, might as well be me.
That's why everybody else either rerenders (such as rich) or relies on the whole buffer (such as glow).
I didn't write Streamdown for fun - there are genuinely no suitable tools that did what I needed.
Also various models have various ideas of what markdown should be and coding against CommonMark doesn't get you there.
Then there's other things. You have to check individual character width and the language family type to do proper word wrap. I've seen a number of interesting tmux and alacritty bugs in doing multi language support
The only real break I do is I render h6 (######) as muted grey.
Compare:
for i in $(seq 1 6); do
printf "%${i}sh${i}\n\n-----\n" | tr " " "#";
done | pv -bqL 30 | sd -w 30
to swapping out `sd` with `glow`. You'll see glow's lag - waiting for that EOF is annoying.Also try sd -b 0.4 or even -b 0.7,0.8,0.8 for a nice blue. It's a bit easier to configure than the usual catalog of themes that requires a compilation after modification like with pygments.
| bat --language=markdown --force-colorization ?
simple and works well.
echo "$@" | llm "Provide a brief response to the question, if the question is related to command provide the command and short description" | bat --plain -l md
Lauch as: llmquick "why is the sky blue?"
https://github.com/day50-dev/llmehelp/blob/main/Snoopers/wtf
I've thought about redoing it because my needs are things like
$ ls | wtf which endpoints do these things talk to, give me a map and line numbers.
What this will eventually be is "ai-grep" built transparently on https://ast-grep.github.io/ where the llm writes the complicated query (these coding agents all seem to use ripgrep but this works better)Conceptual grep is what I've wanted my while life
Semantic routing, which I alluded to above, could get this to work progressively so you quickly get adequate results which then pareto their way up as the token count increases.
Really you'd like some tampering, like a coreutils timeout(1) but for simplex optimization.
Lmao. Does it work? I hate that it needs to be repeated (in general). ChatGPT could not care less to follow my instructions, through the API it probably would?
This one is a ZSH plugin that uses zle to translate your English to shell commands with a keystroke.
https://github.com/day50-dev/Zummoner
It's been life changing for me. Here's one I wrote today:
$ git find out if abcdefg is a descendent of hijklmnop
In fact I used it in one of these comments $ for i in $(seq 1 6); do
printf "%${i}sh${i}\n\n-----\n" | tr " " "#";
done | pv -bqL 30
Was originally $ for i in $(seq 1 6); do
printf "(# $i times)\n\n-----\n"
done | pv (30 bps and quietly)
I did my trusty ctrl-x x and the buffer got sent off through openrouter and got swapped out with the proper syntax in under a second.It's also intelligent about inferring leading zeros without needing to be told with options, e.g. {001..995}.
Looks from the demo like mine's a little less automatic and more iterative that yours.
The conversational context is nice. The ongoing command building is convenient and the # syntax carryover makes a lot of sense!
My next step is recursion and composability. I want to be able to do things contextualized. Stuff like this:
$ echo PUBLIC_KEY=(( get the users public key pertaining to the private key for this repo )) >> .env
or some other contextually complex thing that is actually fairly simple, just tedious to code. Then I want that <as the code> so people collectively program and revise stuff <at that level as the language>.Then you can do this through composability like so:
with ((find the variable store for this repo by looking in the .gitignore)) as m:
((write in the format of m))SSH_PUBLICKEY=(( get the users public key pertaining to the private key for this repo ))
or even recursively: ((
((
((rsync, rclone, or similar)) with compression
))
$HOME exclude ((find directories with secrets))
((read the backup.md and find the server))
((make sure it goes to the right path))
));
it's not a fully formed syntax yet but then people will be able to do something like: $ llm-compile --format terraform --context my_infra script.llm > some_code.tf
and compile publicly shared snippets as specific to their context and you get abstract infra management at a fractional complexity.It's basically GCC's RTL but for LLMs.
The point of this approach is your building blocks remain fairly atomic simple dumb things that even a 1b model can reliably handle - kinda like the guarantee of the RTL.
Then if you want to move from terraform to opentofu or whatever, who cares ... your stuff is in the llm metalanguage ... it's just a different compile target.
It's kinda like PHP. You just go along like normal and occasionally break form for the special metalanguage whenever your hit a point of contextual variance.
I use fish, but the language change is straightforward https://github.com/viktomas/dotfiles/blob/master/fish/.confi...
I'll use this daily
I am keenly aware this is a major footgun here, but it seems that a terminal tool + llm would be a perfect lightweight solution.
Is there a way to have llm get permission for each tool call the way other "agents" do this? ("llm would like to call `rm -rf ./*` press Y to confirm...")
Would be a decent way to prevent letting an llm run wild on my terminal and still provide some measure of protection.
We've seen problems in the past where plugins with expensive imports (like torch) slow everything down a lot: https://github.com/simonw/llm/issues/949
I'm interested in tracking down the worst offenders and encouraging them to move to lazy imports instead.
sudo uvx py-spy record -o /tmp/profile.svg -- llm --help
Fortunately this gets me 90% of the way there:
llm -f README.md -f llm.plugin.zsh -f completions/_llm -f https://simonwillison.net/2025/May/27/llm-tools/ "implement tab completions for the new tool plugins feature"
My repo is here:
https://github.com/eliyastein/llm-zsh-plugin
And again, it's a bit of a mess, because I'm trying to get as many options and their flags as I can. I wouldn't mind if anyone has any feedback for me.
It's like if you use english everyday, but don't bother to learn the language because you have google translate (and now AI).
I put a lot of effort into it - it integrates with `llm` command line tool and with your desktop, via a tray icon and nice chat window.
I recently released 3.0.0 with packages for all three major desktop operating systems.
The Unix shell is good at being the glue between programs. We've increased the dimensionality with LLMs.
Some kind of ports based system like named pipes with consumers and producers.
Maybe something like gRPC or NATS (https://github.com/nats-io). MQTT might also work. Network transparent would be great.
At this point I would have expected something mcp or openapi based but probably is simpler and more flexible this way. Implementing it as plugin shouldn't be hard, I think.
... OK, I got the second one working!
brew install llama.cpp
llama-server --jinja -hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
Then in another window: llm install llm-llama-server
llm -m llama-server-tools -T llm_time 'what time is it?' --td
Wrote it up here: https://simonwillison.net/2025/May/28/llama-server-tools/Is there a blog post / article that addresses this?
If you're interested in what I recommend generally that changes a lot, but my most recent piece about that is here: https://simonwillison.net/2025/May/15/building-on-llms/
EDIT: I think I just found what I want. There is no need for the plugin, extra-openai-models.yaml just needs "supports_tools: true" and "can_stream: false".
The key elements I had to write:
- The system prompt
- Tools to pull external data
- Tools to do some calculations
Your library made the core functionality very easy.
Most of the effort for the demo was to get the plumbing working (a nice-looking web UI for the chatbot that would persist the conversation, update nicely if the user refreshed their browser due to a connection issue, and allow the user to start a new chat session).
I didn't know about `after_call=print`. So I'm glad I read this blog post!
Can we stop already? stop following webdevs practices
It's been a very useful tool to test out and prototype using various LLM features like multimodal, schema output and now tools as well! I specifically like that I can just write a python function with type annotations and plug it to the LLM.
Things now come in full circle :D
I've wondered how exactly, say, Claude Code knows about and uses tools. Obviously, an LLM can be "told" about tools and how to use them, and the harness can kind of manage that. But I assumed Claude Code has a very specific expectation around the tool call "API" that the harness uses, probably reinforced very heavily by some post-training / fine tuning.
Do you think your 3rd party tool-calling framework using Claude is at any disadvantage to Anthropic's own framework because of this?
Separately, on that other HN post about the GitHub MCP "attack", I made the point that LLMs can be tricked into using up to the full potential of the credential. GitHub has fine-grained auth credentials, and my own company does as well. I would love for someone to take a stab at a credential protocol that the harness can use to generate fine-grained credentials to hand to the LLM. I'm envisioning something where the application (e.g. your `llm` CLI tool) is given a more powerful credential, and the underlying LLM is taught how to "ask for permission" for certain actions/resources, which the user can grant. When that happens the framework gets the scoped credential from the service, which the LLM can then use in tool calls.
> We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%.
behnamoh•1d ago
simonw•1d ago
The ability to pipe files and other program outputs into an LLM is wildly useful. A few examples:
It can process images too! https://simonwillison.net/2024/Oct/29/llm-multi-modal/ LLM plugins can be a lot of fun. One of my favorites is llm-cmd which adds the ability to do things like this: It proposes a command to run, you hit enter to run it. I use it for ffmpeg and similar tools all the time now. https://simonwillison.net/2024/Mar/26/llm-cmd/I'm getting a whole lot of coding done with LLM now too. Here's how I wrote one of my recent plugins:
I wrote about that one here: https://simonwillison.net/2025/Apr/20/llm-fragments-github/LLM was also used recently in that "How I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation" story - to help automate running 100s of prompts: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-...
th0ma5•1d ago
> had I used o3 to find and fix the original vulnerability I would have, in theory [...]
they ran a scenario that they thought could have lead to finding it, which is pretty much not what you said. We don't know how much their foreshadowing crept into their LLM context, and even the article says it was also sort of chance. Please be more precise and don't give into these false beliefs of productivity. Not yet at least.
simonw•1d ago
th0ma5•22h ago
setheron•1d ago
simonw•1d ago
- The official docs: https://llm.datasette.io/
- The workshop I gave at PyCon a few weeks ago: https://building-with-llms-pycon-2025.readthedocs.io/
- The "New releases of LLM" series on my blog: https://simonwillison.net/series/llm-releases/
- My "llm" tag, which has 195 posts now! https://simonwillison.net/tags/llm/
setheron•1d ago
``` # AI cli (unstable.python3.withPackages ( ps: with ps; [ llm llm-gemini llm-cmd ] )) ```
looks like most of the plugins are models and most of the functionality you demo'd in the parent comment is baked into the tool itself.
Yea a live document might be cool -- part of the interesting bit was seeing "real" type of use cases you use it for .
Anyways will give it a spin.
furyofantares•1d ago
Most recently I wanted a script that could produce word lists from a dictionary of 180k words given a query, like "is this an animal?" The script breaks the dictionary up into chunks of size N (asking "which of these words is an animal? respond with just the list of words that match, or NONE if none, and nothing else"), makes M parallel "think" queries, and aggregates the results in an output text file.
I had Claude Code do it, and even though I'm _already_ talking to an LLM, it's not a task that I trust an LLM to do without breaking the word list up into much smaller chunks and making loads of requests.
cyanydeez•1d ago
furyofantares•21h ago