I have used MCP daily for a few months. I'm now down to a single MCP server: terminal (iTerm2). I have OpenAPI specs on hand if I ever need to provide them, but honestly shell commands and curl get you pretty damn far.
LLM output is often coerced back into something more deterministic such as types, or DB primary keys. The value of the LLM is determined by how well your existing code and tools model the data, logic, and actions of your domain.
In some ways I view LLMs today a bit like 3D printers, both in terms of hype and in terms of utility. They excel at quickly connecting parts similar to rapid prototyping with 3d printing parts. For reliability and scale you want either the LLM or an engineer to replace the printed/inferred connector with something durable and deterministic (metal/code) that is cheap and fast to run at scale.
Additionally, there was a minute during the 3D printer Gardner hype cycle where there were notions that we would all just print substantial amounts of consumer goods when the reality is the high utility use case are much more narrow. There is a corollary here to LLM usage. While LLMs are extremely useful we cannot rely on LLMs to generate or infer our entire operational reality or even engage meaningfully with it without some sort of pre-existing digital modeling as an anchor.
The reality is Chinese.
54 minute flight time (47 min hover) for fully unmanned operations.
If you're talking about fpv racing where tiny drones fly around 140+ mph, then yeah DJI isn't in that space.
LLMs have already found large-scale usage (deep research, translation) which makes them more ubiquitous today than 3D printers ever will or could have been.
I don’t want to sound like a hard-core LLM believer. I get your point and it’s fair.
I just wanted to point out that the current usage of chatgpt is a lot broader than that of 3D printers even at the peak hype of it.
And I mean random people, not tech circles
It’s very different from NFTs in that respect
Everybody under a certain age is using ChatGPT, where they were once using search and friendship/expertises. It’s the number 1 app in the App Store. Copilot use in the enterprise is so seamless, you just talk to PowerPoint or outlook and it formulated what you were supposed to make or write.
It’s not a fad, it is a paradigm change.
People don’t need to understand how it works for it to work.
NFTs are about as close to literally useless as it gets, and that was always obvious; 99% of the serious attention paid to them came from hustlers and speculators.
LLMs, for all their limitations, are already good at some things and useful in some ways. Even in the areas where they are (so far) too unreliable for serious use, they're not pure hype and bullshit; they're doing things that would have seemed like magic 10 years ago.
This isn't (intentionally at least) mere HN pedantry: they really do act like translation tools in a bunch of observable ways.
And while they have recently crossed the threshold into "yeah, I'm always going to have a gptel buffer open now" territory at the extreme high end, their utility outside of the really specific, totally non-generalizing code lookup gizmo usecase remains a claim unsupported by robust profits.
There is a hole in the ground where something between 100 billion and a trillion dollars in the ground that so far has about 20B in revenue (not profit) going into it annually.
AI is going to be big (it was big ten years ago).
LLMs? Look more and more like the Metaverse every day as concerns the economics.
This is a concern for me. I'm using claude-code daily and find it very useful, but I'm expecting the price to continue getting jacked up. I do want to support Anthropic, but they might eventually need to cross a price threshold where I bail. We'll see.
I expect at some point the more open models and tools will catch up when the expensive models like ChatGPT plateau (assuming they do plateau). Then we'll find out if these valuations measure up to reality.
Note to the Hypelords: It's not perfect. I need to read every change and intervene often enough. "Vibe coding" is nonsense as expected. It is definitely good though.
You don't stock antibiotics and bullets in a survival compound because you think that's going to keep out a paperclip optimizer gone awry. You do that in the forlorn hope that when the guillotines come out that you'll be able to ride it out until the Nouveau Regime is in a negotiating mood. But they never are.
- Is your demand inelastic at that point, if having claude-code becomes effectively required, to sustain your livelihood? Does pricing continue to increase, until it's 1%/5%/20%/50% of your salary (because hey, what's the alternative? if you don't pay, then you won't keep up with other engineers and will just lose your job completely)?
- But if tools like claude-code become such a necessity, wouldn't enterprises be the ones paying? Maybe, but maybe like health-insurance in America (a uniquely dystopian thing), your employer may pay some portion of the premiums, but they'll also pass some costs to you as the employee... Tech salaries have been cushy for a while now, but we might be entering a "K-shaped" inflection point --> if you are an OpenAI elite researcher, then you might get a $100M+ offer from Meta; but if you are an average dev doing average enterprise CRUD, maybe your wages will be suppressed because the small cabal of LLM providers can raise prices and your company HAS to pay, which means you HAVE to bear the cost (or else what? you can quit and look for another job, but who's hiring?)
This is a pessimistic take of course (and vastly oversimplified / too cynical). A more positive outcome might be, that increasing quality of AI/LLM options leads to a democratization of talent, or a blossoming of "solo unicorns"... personally I have toyed with calling this, something like a "techno-Amish utopia", in the sense that Amish people believe in self-sufficiency and are not wholly-resistant to technology (it's actually quite clever, what sorts of technology they allow for themselves or not), so what if we could take that further?
If there was a version of that Amish-mentality of loosely-federated self-sufficient communities (they have newsletters! they travel to each other! but they largely feed themselves, build their own tools, fix their own fences, etc.!), where engineers + their chosen LLM partner could launch companies from home, manage their home automation / security tech, run a high-tech small farm, live off-grid from cheap solar, use excess electricity to Bitcoin mine if they choose to, etc.... maybe there is actually a libertarian world that can arise, where we are no longer as dependent on large institutions to marshal resources, deploy capital, scale production, etc., if some of those things are more in-reach for regular people in smaller communities, assisted by AI. This of course assumes that, the cabal of LLM model creators can be broken, that you don't need to pay for Claude if the cheaper open-source-ish Llama-like alternative is good enough
As usual, the answer is "it depends". I guarantee though that I'll at least start looking at alternatives when there's a huge price hike.
Also I suspect that a 100x improvement (if even possible) wouldn't just cost 100 times as much, but probably 100,000+ times as much. I also suspect than an improvement of 100x will be hyped as an improvement of 1,000x at least :)
Regardless, AI is really looking like a commodity to me. While I'm thankful for all the investment that got us here, I doubt anyone investing this late in the game at these inflated numbers are going to see a long term return (other than ponzi selling).
ChatGPT has 800M+ weekly active users how is that comparable to the Metaverse in any way?
One could have said the same thing about Google in 2006
I assume there are massive number of LLM analysis pipelines out there.
I suppose it depends if you consider non determinist DS/ML pipelines "loadbearing" or not. Most are not using LLMs though.
3D parts regularly are used beyond prototyping though as tooling for a small company can be higher than just metal 3D parts. So I do somewhat agree but the loss of productivity in software prototyping would be a massive hit if LLMs vanished.
We have a term at work we use called, "directionally accurate", when it's not entirely accurate but headed in the right direction.
Oh god please no, we must stop this initialism. We've gone too far.
I mean, you _probably_ could make most furniture with only a saw, but why?
It's also true for human. But then we invented functions / libraries / modules
It’s bad writing practice to do this, even if you are assuming your followers are following you.
Especially for a site like Twitter that has a login wall.
Oh wait... hm ;) perhaps the writing nerds had it right when they recommend always writing the full acronym out the first time it's used in an article, no matter how common one presumes it to be
Otherwise you can e.g just give it a folder of preapproved scripts to run and explain usage in a prompt.
Disclaimer: it does not work well enough. But I think it shows great promise.
[1] https://huggingface.co/papers/2402.01030
[2] https://huggingface.co/papers/2401.00812
[3] https://huggingface.co/papers/2411.01747
I am working on a model that goes a step beyond and even makes the distinction between thinking and code execution unnecessary (it is all computation in the end), unfortunately no link to share yet
The job then becomes identifying those problems and figuring out how to configure a sandbox for them, what tools to provide and how to define the success criteria for the model.
That still takes significant skill and experience, but it's at a higher level than chewing through that problem using trial and error by hand.
My assembly Mandelbrot experiment was the thing that made this click for me: https://simonwillison.net/2025/Jul/2/mandelbrot-in-x86-assem...
I treat an LLM the same way I'd treat myself as it relates to context and goals when working with code.
"If I need to do __________ what do I need to know/see?"
I find that traditional tools, as per the OP, have become ever more powerful and useful in the age of LLMs (especially grep).
Furthermore, LLMs are quite good at working with shell tools and functionalities (heredoc, grep, sed, etc.).
With some host data directories mounted read only inside the VM.
This creates some friction though. Feels like a tool which runs the AI agent in a VM, but then copies it's output to the host machine after some checks would help, so that it would feel that you are running it natively on the host.
Does this require using big models through their APIs and spending a lot of tokens?
Or can this be done either with local models (probably very slow), or with subscriptions like Claude Code with Pro (without hitting the rate/usage limits)?
I saw the Mandelbrot experiment, it was very cool, but still a rather small project, not really comparable to a complex/bigger/older code base for a platform used in production
I think it's possible we'll see a local model that can do this well within the next few months though - it needs good tool calling, not an encyclopedic knowledge of the world. Might be possible to fit that in a model that runs locally.
It seems like maybe similar approaches could be used for coding tasks, especially with tool calls for reading man pages, info pages, running `tldr`, specifically consulting Stack Overflow, etc. Some of the recent small MoE models from Chinese companies are significantly smarter than models like Qwen 4B, but run about as quickly, so maybe on systems with high RAM or high unified memory, even with middling GPUs, they could be genuinely useful for coding if they are made to be avoid doing anything without tool use.
Personally I am convinced JSON is a bad format for LLMs and code orchestration in python-ish DSL is the future. But local models are pretty bad at code gen too.
I wonder if there are any groups/companies out there building something like this
Would love to have models that only know 1 or 2 languages (eg. python + js), but are great at them and at tool calling. Definitely don't need my coding agent to know all of Wikipedia and translating between 10 different languages
1. A special code dataset 2. A bunch of "unrelated" books
My understanding is that the model trained on just the first will never beat the model trained on both. Bloomberg model is my favorite example of this.
If you can squirell away special data then that special data plus everything else will beat the any other models. But that's basically what openai, google, and anthropic are all currently doing.
I’ve been thinking about using LLMs for brute forcing problems too.
Like LLMs kinda suck at typescript generics. They’re surprisingly bad at them. Probably because it’s easy to write generics that look correct, but are then screwy in many scenarios. Which is also why generics are hard for humans.
If you could have any LLM actually use TSC, it could run tests, make sure things are inferring correctly, etc. it could just keep trying until it works. I’m not sure this is a way to produce understandable or maintainable generics, but it would be pretty neat.
Also while typing this is realized that cursor can see typescript errors. All I need are some utility testing types, and I could have cursor write the tests and then brute force the problem!
If I ever actually do this I’ll update this comment lol
Your test case seems like a quintessential example where you're missing that last step.
Since it is unlikely that you understand the math behind fractals or x86 assembly (apologies if I'm wrong on this), your only means for verifying the accuracy of your solution is a superficial visual inspection, e.g. "Does it look like the Mandelbrot series?"
Ideally, your evaluation criteria would be expressed as a continuous function, but at the very least, it should take the form of a sufficiently diverse quantifiable set of discrete inputs and their expected outputs.
1. https://github.com/davidkimai/Context-Engineering/blob/main/...
(the repo is a WIP book, I've only scratched the surface but it seems pretty brilliant to me)
Consider internal company tools or niche APIs with minimal online documentation. Sure, you could dump all the documentation into context for code generation, but that often requires more context than interacting with an MCP tool. More importantly, generated code for unfamiliar APIs is prone to errors so you'd need robust testing and retry mechanisms built in to the process.
With MCP, if the tools are properly designed and receive correct inputs, they work reliably. The LLM doesn't need to figure out API intricacies, authentication flows, or handle edge cases - that's already handled by the MCP server.
So I agree MCP for GitHub is probably overkill but there are many legitimate use cases where pre-built MCP tools make more sense than asking an LLM to reverse-engineer poorly documented or proprietary systems from scratch.
If that's what you wanted you could have designed that as your poorly documented internal API differently to begin with. There's zero advantage to MCP in the scenario you describe aside from convincing people that their original API is too hard to use.
MCP works exactly that way: you dump documentation into the context. That's how the LLM knows how to call your tool. Even for custom stuff I noticed that giving the LLM things to work with that it knows (eg: python, javascript, bash) beats it using MCP tool calling, and in some ways it wastes less context.
YMMV, but I found the limit of tools available to be <15 with sonnet4. That's a super low amount. Basically the official playwright MCP alone is enough to fully exhaust your available tool space.
The playwright MCP alone introduces 25 tools into the context :(
I don’t think the world needs yet another shell scripting language. They’re all pretty mediocre at best. But maybe this is an opportunity to do something interesting.
Python environment is a clusterfuck. Which UV is rapidly bringing into something somewhat sane. Python isn’t the ultimate language. But I’d definitely be more interested in “replace yourself with a UV Python script” over “replace yourself with a shell script”. Would be nice to see use this as an opportunity to do better than Bash.
I realize this is unpopular. But unpopular doesn’t mean wrong.
Tool composition over stdio will get you very very far. That's what an interface "from the 80s" does for you 45 years later. That same stdio composability is easily pipe into/through any number of cli tools written in any number of languages, compiled and interpreted.
Except for the fact that actually it is not everywhere.
I suppose if one wanted to be pedantically literal, then you are indeed correct. In every other meaningful consideration, the parent comment is. Maybe not Bash specifically, but #!/bin/sh is broadly available on nearly every connected device on the planet, in some capacity. From the perspective of how we could automate nearly anything, you'd be hard-pressed to find something more universal than a shell script.
99.9% of my 20-year career has been spent on Windows. So bash scripts are entirely worthless and dead to me.
Want to get me talking reverentially about the pioneers of our industry? Talk to me about Doug Engelbart, Xerox PARC and the Macintosh team at Apple. There was some brilliant work!
What did Unix win?
Multics would be an example of a more innovative OS than Unix, but its influence on the OSes we use today has been a lot less.
Seems like that would be a potential way to get self-organizing integrations.
However, I couldn’t really understand if he’s saying that the Playwright MCP is good to use for your own app or whether he means for your own app just tell the LLM directly to export Playwright code.
and while running the code might faster, it's unclear whether that approach scales well. Sending an MCP tool command to click the button that says "X", is something a small local LLM can do. Writing complex code after parsing significant amount of HTML (for correct selectors for example) probably needs a managed model.
Claude Code shows that the models can excel at using “old” programmatic interfaces (CLIs) to do Real Work™.
MCP is a way to dynamically provide “new” programmatic interfaces to the models.
At some point this will start to converge, or at least appear to do so, as the majority of tools a model needs will be in its pre-training set.
Then we’ll argue about MPPP (model pre-training protocol pipeline), and how to reduce knowledge pollution of all the LLM-generated tools we’re passing to the model.
Eventually we’ll publish the Merrium-Webster Model Tool Dictionary (MWMTD), surfacing all of the approved tools hidden in the pre-training set.
Then the kids will come up with Model Context Slang (MCS), in an attempt to use the models to dynamically choose unapproved tools, for much fun and enjoyment.
Ad infinitum.
This is solved trivially by having default initial prompts. All majors tools like Claude Code or Gemini CLI have ways to set them up.
> You pass all your tools to an LLM and ask it to filter it down based on the task at hand. So far, there hasn't been much better approaches proposed.
Why is a "better" approach needed? If modern LLMs can properly figure it out? It's not like LLMs don't keep getting better with larger and larger context length. I never had a problem with an LLM struggling to use the appropriate MCP function on it's own.
> But you run into three problems: cost, speed, and general reliability
- cost: They keep getting cheaper and cheaper. It's ridiculously inexpensive for what those tools provide.
- speed: That seem extremely short sighted. No one is sitting idle looking at Claude Code in their terminal. And you can have more than one working on unrelated topics. That defeats the purpose. No matter how long it takes the time spent is purely bonus. You don't have to spend time in the loop when asking well defined tasks.
- reliability: Seem very prompt correlated ATM. I guess some people don't know what to ask which is the main issue.
Having LLMS being able to complete tedious tasks involving so many external tools at once is simply amazing thanks to MCP. Anecdotal but just today it did a task flawlessly involving: Notion pages, Linear Ticket, git, GitHub PR, GitHub CI logs. Being in the loop was just submitting one review on the PR. All the while I was busy doing something else. And for what, ~1$?
That only makes it worse. The MCP tools available all add to the initial context. The more tools, the more of the context is populated by MCP tool definitions.
If that's the case: I understand the knee-jerk reaction but if it works? Also what theoretically prevents altering the prompt chaining logic in these tools to only expose a condensed list of MCP servers, not their whole capabilities, and only inject details based on LLM outputs? Doesn't seem like an insurmountable problem.
Not just some, all. That's just how MCP works.
> If that's the case: I understand the knee-jerk reaction but if it works?
I would not be writing about this if it worked well. The data indicates that it worse significantly worse than not using MCP because of the context rot, and the low too utilization.
no they don't[0], the cost is just still hidden from you but the freebies will end just like MoviePass and cheap Ubers
https://bsky.app/profile/edzitron.com/post/3lsw4vatg3k2b
"Cursor released a $200-a-month subscription then made their $20-a-month subscription worse (worse output, slower) - yet it seems even on Max they're rate limiting people!"
It is not MCP: it is autonomous agents that don't get feedbacks from smart humans.
It's definitely useful, but you have to read everything. I'm working in a type-safe functional compiled language too. I'd be scared to try this flow in a less "correctness enforced" language.
That being said, I do find that it works well. It's not living up to the hype, but most of that hype was obvious nonsense. It continues to surprise me with it's grasp on concepts and is definitely saving me some time, and more importantly making some larger tasks more approachable since I can split my time better.
The effect is that it's far more efficient at editing Clojure code than any purely string-diff-based approach, and if you write a good test suite it can rapidly iterate back and forth just editing files, reloading them, and then re-running the test suite at the REPL -- just like I would. It's pretty incredible to watch.
It can straight up debug your code, eval individual expressions, document return types of functions. It’s amazing.
It actually makes me think that languages with strong REPLs are a better for languages than those without. Seeing clojure-mcp do its thing is the most impressive AI feat I’ve seen since I saw GPT-3 in action for the first time
This is not an environment where you can establish a durable manifesto
It would be better for MCP to deliver function definitions and let the LLM write little scripts in a simple language.
> So maybe we need to look at ways to find a better abstraction for what MCP is great at, and code generation. For that that we might need to build better sandboxes and maybe start looking at how we can expose APIs in ways that allow an agent to do some sort of fan out / fan in for inference. Effectively we want to do as much in generated code as we can, but then use the magic of LLMs after bulk code execution to judge what we did.
Por que no los dos? I ended up writing an OSS MCP server that securely executes LLM generated JavaScript using a C# JS interpreter (Jint) and handing it a `fetch` analogue as well as `jsonpath-plus`. Also gave it a built-in secrets manager.Give it an objective and the LLM writes its own code and uses the tool iteratively to accomplish the task (as long as you can interact with it via a REST API).
For well known APIs, it does a fine job generating REST API calls.
You can pretty much do anything with this.
MCP servers might be fun to get an idea for what's possible, and good for one-off mashups, but API calls are generally more efficient and stable, when you know what you want.
Here's the agent I ended up writing: https://github.com/pamelafox/personal-linkedin-agent
LinkedIn has this reputation of being notorious about making it hard to build automations on top of, did you run into any roadblocks when building your personal LinkedIn agent?
This would work best if a human is the end consumer of this output, or will receive manual vetting eventually. I'm not sure I'd leave such a system running unsupervised in production ("the Automation at Scale" part mentioned by the OP).
Hooks into the agent's execution lifecycle seem more reliable for deterministic behavior and supervision.
This doesn't take away from the utility of MCP when it comes Claude Desktop and the likes!
Right now I got it for: DRAFTS of prose things -- and the only real killer in my opinion, autotagging thousands of old bookmarks. But again, that's just to have cool stuff to go back and peruse, not something that must be correc.t
Consider a python function signature
list_containers(show_stopped: bool = False, name_pattern: Optional[str] = None, sort: Literal["size", "name", "started_at"] = "name"). It doesn't even need docs
Now convert this to JSON schema which is 4x larger input already.
And when generating output, the LLM will generate almost 2x more tokens too, because JSON. Easier to get confused.
And consider that the flow of calling python functions and using their output to call other tools etc... is seen 1000x more times in their fine tuning data, whereas JSON tool calling flows are rare and practically only exist in instruction tuning phase. Then I am sure instruction tuning also contains even more complex code examples where model has to execute complex logic.
Then theres the whole issue of composition. To my knowledge there's no way LLM can do this in one response.
vehicle = call_func_1()
if vehicle.type == "car":
details = lookup_car(vehicle.reg_no)
else if vehicle.type == "motorcycle":
details = lookup_motorcycle(vehicle.reg_ni)
How is JSON tool calling going to solve this?But "the" problem with MCP? IMVHO (Very humble, non-expert) the half-baked or missing security aspects are more fundamental. I'd love to hear updates about that from ppl who know what they're talking about.
the llm cant just be given this function because its specialized to just the two options.
you could have it do a feedback loop of rewriting the python script after running it, but whats the savings at tha point? youre wasting tokens talking about cars in python when you already know is a ski, and the llm could ask directly for the ski details without writing a script to do it in between
It’s 2025 and this is the epitome of progress.
On the positive side code generation can be solid if you also have/can generate easy-to-read validation or tests for the generated code. I mean that you can read, of course.
mritchie712•6h ago
This is spot on. I have a "devops" folder with a CLAUDE.md with bash commands for common tasks (e.g. find prod / staging logs with this integration ID).
When I complete a novel task (e.g. count all the rows that were synced from stripe to duckdb) I tell Claude to update CLAUDE.md with the example. The next time I ask a similar question, Claude one-shots it.
This is the first few lines of the CLAUDE.md
lsaferite•6h ago
Edit: First result when looking for such an MCP Server: https://github.com/inercia/MCPShell
gbrindisi•5h ago
lsaferite•4h ago
fassssst•3h ago
wrs•3h ago
To an LLM there’s not much difference between the list of sample commands above and the list of tool commands it would get from an MCP server. JSON and GNU-style args are very similar in structure. And presumably the command is enforcing constraints even better than the MCP server would.
jayd16•4h ago
lreeves•4h ago
light_hue_1•4h ago
Because now the capabilities of the model grow over time. And I can ask questions that involve a handful of those snippets. When we get to something new that requires some doing, it becomes another snippet.
I can offload everything I used to know about an API and never have to think about it again.
mritchie712•4h ago
I don't have a snippet for, "find all 500's for the meltano service for duckdb syntax errors", but it'd easily nail that given the existing examples.
dingnuts•3h ago
In the other cases I see what the computer outputs, LEARN, and then the functionality of finding what I need just isn't useful next time. Next time I just type the command.
I don't get it.
loudmax•2h ago
For example, I have a pretty good grasp of regular expressions because I'm an old Perl programmer, but I find processing json using `jq` utterly baffling. LLMs are great at coming up with useful examples, and sometimes they'll even get it perfect the first time. I've learned more about properly using `jq` with the help of LLMs than I ever did on my own. Same goes for `ffmpeg`.
LLMs are not a substitute for learning. When used properly, they're an enhancement to learning.
Likewise, never mind the idiot CEOs of failing companies looking forward to laying off half their workforce and replacing them with AI. When properly used, AI is a tool to help people become more productive, not replace human understanding.
qazxcvbnmlp•9m ago
chriswarbo•2h ago
stpedgwdgfhgdd•2h ago
To tackle this, I converted a custom prompt into an application, but there is an interesting trade-off. The application is deterministic. It cannot deal with unknown situations. In contrast to CC, which is way slower, but can try alternative ways of dealing with an unknown situation.
I ended up with adding an instruction to the custom command to run the application and fix the application code (TDD) if there is a problem. Self healing software… who ever thought