Overall I try to keep it as thin a wrapper as I can. The better the model, the less wrapper is needed. It's a good way to measure model competence. The code is here https://github.com/swax/NAISYS and context logs here for examples - https://test.naisys.org/logs/
I have agents built with it that do research on the web for content, run python scripts, update the database, maintain a website, etc.. all running through the CLI, if it calls APIs then it does it with curl. Example agent instructions here: https://github.com/swax/NAISYS/tree/main/agents/scdb/subagen...
lol no. The right way to get a program to interact with another program is through an API
We don't necessarily need to replace the versions humans use - though some of the changes might well make tools better for humans too - but e.g. most of the tools I add for my coding agent are attempts at coaxing it to avoid doing things like e.g. the "head" example in the article.
The answer is:
- make non-deterministic black boxes more deterministic and less black boxes
- improve tools for humans
https://github.com/jerpint/context-llemur
I’ve actually bootstrapped ctx with ctx and found it very useful !
It basically stops me from having to repeat myself over and over to different agents
So there is probably some room for innovation here
But most of these seems like problems with Claude (and maybe fundamental problems with LLMs), not problems with the CLI interface:
This started a game of whack-a-mole where the LLM would also attempt to change the pre-commit hooks! I had to fix it by denying Edit(.git/hooks/pre-commit) to my project’s .claude/settings.json. I look forward to its next lazy innovation.
If you watch Claude Code, you’ll see that it often uses head -n100 to limit the results apriori. It also gets lost about which directory it’s in, and it will frustratingly flail around trying to run commands in different directories until it finds the right one.
It has helped tremendously having a dedicated build service that CC can control through MCP vs running Docker itself because it can then restart the container and test. And, the fuzzy search tool and diff editor seem to perform better than the replacement strategy Claude Code uses, most of the time. I continue to work on the editor when I run into issues with it, so happy to help anyone interested in implementing their own file editing (and search) strategy.
You will need to convert these to Claude Code format, but all you need to do is ask CC to do it for you...
> “Military Entity" means any armed forces branch, defense department, or military organization of any nation or alliance.
As written, this only applies to nation states. It excludes many kinds of human organizations that use force to impose their will on others. The word for this is “Terrorist.”
While that term has been applied to many groups for many reasons, it technically means “the use of violence against non-combatants to achieve political or ideological aims.”
If you add Terrorism + Nation-State Militaries should cover most everyone you intend here, including organized crime and private military contractors. You could add “financial gain” to the definition if you want to ensure those last two are captured.
For another example of where this is a problem, look at any large company that pays to keep logs in kibana, the amount of over logging paid for is insane.
Approximately 1/3rd of my Claude code tokens are spent parsing CLI output, that is insane! Often Claude doesn't catch what it needs in the massive log outputs and I have to go through the logs myself to find the problem and point it out to Claude.
I might give access to a terminal in a locked down VM, I don't know about a shell.
When will people acknowledge that LLMs are stochastic text generators?
This whole blog reads like trying to fit a square into a round hole. And frankly most of the comments in this thread is jumping right on the wagon “what water?”-style [1]
By all means use LLMs for what they can be useful for but god damnit when they are not useful please acknowledge this and stop trying to make everything a nail for the LLM-hammer.
LLMs are. not. intelligent. They don’t have a work ethic that says “oh maybe skipping tests is bad”. If they generate output that skips tests it’s because a high enough part of the training data contained that text sentence.
[1] fish joke
This is elementary school stuff. Do the assignment, don't cheat. Does useful software get written by people who don't understand this basic fact?
Another idea that I've been thinking about is context hierarchy:
Low -> High Utility
Base (AI reads tool desc/manpage,etc.) > General human advice (typically use grep this way, etc.) > Specific advice (for this project / impl this is how you use the tool).
Currently the best interface to provide our insights are via MCPs. At https://toolprint.ai/ we're building a human (or machine) driven way supplement that knowledge around tool-use to Claude/Cursor, etc.
A practical way in which we dogfood our own product is with the Linear MCP. If you connect that and ask an agent to create a new issue, it predictably fails because there's no instructions on which linear project to select or the correct way to provide a description around Linear's quirks. When we connect the linear mcp via the toolprint mcp, it gets pre-primed context around these edge cases to improve tool use.
That is generally how it goes for power users. And people that takes the time to RTFM.
But now, I see people that don't want to determine their workflows. It's just ad-hoc use, spending their time on local decisions that doesn't matter that much instead of grasping the big picture and then solves it. Maybe it helps them looks busy.
So I don't want an agent for Linear. What I want is maybe a bash alias "isc" (for "issue create"), that pops up nano where I write the issue in git commit format (title + blank line + description). Upon saving, the rest is done automatically, because it can determines the project based on some .trackerrc I put in the root of the project. Or maybe a "linear-issues" emacs command and a transient interface (again, the correct project can be determined automatically).
To clarify, the example that was provided using `command_not_found_handler`, is that possible to implement in bash? Or perhaps you were saying this would be a nice to have if this functionality existed?
I was thinking about this just the other day, and there was one from the late 80s that had scores of parameters, but I could not remember its name. I think it was an `ls` type utility.
There may be something more generalizable here.
There’s the saying that computers should adapt to us, rather than the other way around, but now this makes me wonder which side LLMs are on in that picture.
We designed Desktop GUI & Web Browsers on top of the terminal to allow a type of user to interact without speaking "lower level" commands, but we've also created abstractions to hide complexity for ourselves at this layer. We just so happen to call them CLI Apps, Scripts, Makefile targets, Taskfile tasks, Justfile recipes, unix tools, etc. It consists of a pseudo-natural language short-code name combined with schema-validated options and some context around what each option does (via the --help view). The trick is how do we optimize for both human developers and AI Agents to have access to the same tools but in the optimized interface for each.
In an experiment to let my agents share the exact same 'tools' that I do for developing in a repository, I gave it direct access to load and self-modify the local project Justfile via MCP: https://github.com/toolprint/just-mcp
Just as (pun intended) I create tools for myself to repeat common tasks with sane defaults and some parameters, my agents immediately gain the same access and I can restrict permissions to use these instead of ANY bash command (IE: "Bash(just:*)"). The agent can also assist in creating tools for me or itself to use that would save on time and token usage. I'd love to see the paradigm evolve to the point it feels more like warp.dev where you don't have to switch between two text boxes to choose whether you're talking in natural language or instructing to run a known 'tool'.
Workflows are humans processes, what we do is naming them and identify their parameters. The actual tools to implement those workflows don't matters that much at a human scale other than cognitive load. So I don't care much about gcc various options. What I want is `make debug` or `make release` (or just `make`). And cognitive load is lowered because I only have these to remember and they are deterministic.
Agent is not a good bridge between humans and tools. Because they increase cognitive load, while all the interface have been about lowering it. There's no "make test" and have a nice output of all the lines that have been flagged (and have some integration like Vim's quickfix which can quickly bring you to each line). Instead it's typing a lot and praying that it actually do something good.
I agree that if the human is "driving", they should be able to use the Tool directly (IE: make test). If you put an agent in the middle and ask it "please run make test" that's just silly and costs extra for no benefit.
Where you get benefit is if you design tools like "just test" as an MCP tool called "mcp__just-mcp__test" and give a fully-autonomous agent instructions like: "Whenever you feel like you've completed a task, run mcp__just-mcp__test and fix errors and warnings until it passes, then you may commit changes locally". LLM's have 'congitive load' as well, so why not offload the deterministic logic to Tools in the same way we do?
there is everyone else who is supposed deliver software that works, like always, and they are not fine with built-in flakiness
https://github.com/elitan/lightform/pull/35
One of the advantages of vibe coding CLI tools like this is that it's easy for the AI to debug itself, and it's easy to see where it gets stuck.
And it usually gets stuck because of:
1. Errors 2. Don't know what command to run
So:
1. Errors must be clear and *actionable* like `app_port` ("8080") is a string, expected a number. 2. The command should have simple, actionable and complete help (`--help`) sections for all commands.
procone•4h ago
withzombies•4h ago
procone•4h ago
esafak•3h ago
edit: Mea culpa.
lucianbr•2h ago
Seems pretty clear the article is talking about teaching LLMs how to use 'ls'.
withzombies•1h ago
Definitely not just `ls`.
skydhash•3h ago
In one of the example provided:
You could just use `pwd`, like most people that put the current directory in the $PS1 to make sure that the agent stays in the correct directory.jacobr1•3h ago
But for the `$command | head -100` example, the usage is a bit different. I run into this myself on the cli, and often ended up using `less` in similar context.
Two cases
1) sometimes I use head to short circuit a long running, but streaming output, command so I just assess if it is starting to do the right thing but not bear the time/computational cost of full processing
2) sometimes the timing doesn't matter but the content is too verbose, need to see some subset of the data. But here head is too limited. I need something like wc & head and maybe grep in one command line with context. Maybe something like
$command | contextual-filter -grepn 5 -grep error -head 10
some data ... first the first 10 lines ... an error message with 5 lines of context surrounding before and after
Summary: 100000 total lines 15 printed exited with code 0
You can do all that already with grep and others, but you need to run multiple commands to get all the context
skydhash•2h ago
2) Again logs, if actions needs to be taken after the command has stopped. For immediate action, you can use `tee`.
Managing context isn't hard. I see more issues with ensuring the right command.
psifertex•4h ago
I share your apprehension regarding the current AI landscape changing so quickly it causes whiplash but I don't think a mindset of "it's been fine for 50 years" is going to survive the pace of development possible by better LLM integration.
skydhash•4h ago
IDEs have not changed that much. They've always been an editor superchaged with tools that will all share the same context of a "project". And for development, it's always been about navigation (search and goto), compile/build, and run.
ivape•4h ago
while(true):
>> User requests something
<< The LLM picks a cli tool from an index
<< LLM grabs the manual for the tool to get the list of commands
<< Attempts to fulfill the request
I would not be shocked if engineers have already managed to overcomplicate these agents.
kjkjadksj•3h ago