All the simple stuff (creating a repo, pushing, frontend edits, testing, Docker images, deployment, etc.) is automated. For the difficult parts, you can just use free Grok to one-shot small code files. It works great if you force yourself to keep the amount of code minimal and modular. Also, they are great UIs—you can create smart programs just with CLI + MCP servers + MD files. Truly amazing tech.
Sometimes I can't really tell.
Input $0.28 / 1M tokens cache miss Output $0.42 / 1M tokens
Via synthetic (which otherwise looks cool):
Input $0.56/mtok Output $1.68/mtok
So 2-3 better value through https://platform.deepseek.com
(Granted Synthetic gives you way more models to choose from, including ones that don't parrot CPC/PLA propaganda and censor)
unfortunately it doesn't support local models but they're too slow for coding anyway.
GLM is maybe slightly weaker on average but on the other hand it's also solved problems where both CC and Codex got stuck in endless failure loops so for the price it's nice to have in my back pocket. I also see some tool use failures sometimes that it always works around which I'm guessing are due to slight differences with Claude.
Compared to the anthropic offering is night and day. Claude gets on with the job and makes me way more productive.
Which model were you using? In my experience Gemini 2.5 Pro is just as good as Claude Sonnet 4 and 4.5. It's literally what I use as a fallback to wrap something up if I hit the 5 hour limit on Claude and want to just push past some incomplete work.
I'm just going to throw this out there. I get good results from a truly trash model like gpt-oss-20b (quantized at 4bits). The reason I can literally use this model is because I know my shit and have spent time learning how much instruction each model I use needs.
Would be curious what you're actually having issues with if you're willing to share.
Is just strange to me that my experience seems to be a polar opposite of yours.
I can one-shot new webapps in Claude and Codex and can't in Gemini Pro.
It's ok for documentation or small tasks, but consistently fails at tasks that both Claude or Codex succeed at.
You can use Git hooks to do that. If you have tests and one fails, spawn an instance of claude a prompt -p 'tests/test4.sh failed, look in src/ and try and work out why'
$ claude -p 'hello, just tell me a joke about databases'
A SQL query walks into a bar, walks up to two tables and asks, "Can I JOIN you?"
$
Or if, you use Gogs locally, you can add a Gogs hook to do the same on pre-push> An example hook script to verify what is about to be pushed. Called by "git push" after it has checked the remote status, but before anything has been pushed. If this script exits with a non-zero status nothing will be pushed.
I like this idea. I think I shall get Claude to work out the mechanism itself :)
It is even a suggestion on this Claude cheet sheet
https://www.howtouselinux.com/post/the-complete-claude-code-...
the only thing I imagine might be problem is claude demanding a login token as it happens quite regularly
> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.
The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).
Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.
Why? Bug hunting is more challenging and cognitive intensive than writing code.
For the latter, the good news is that you’re free to use LLMs for debugging or completely ignore them.
Which low level code base have you tried this latest tool on? Official Anthropic commercials do not count.
LLMs can generate content but not really write, out of the box they tend to be quote verbose and generate a lot of proforma content. Perhaps with the right kind of prompts, a lot of editing and reviews, you can get them to good, but at the point it is almost same as writing it yourself.
It is a hard choice between lower quality documentation (AI slop?) or it being lightly or fully undocumented. The uncanny valley of precision in documentation maybe acceptable in some contexts but it can be dangerous in others and it is harder to differentiate because depth of doc means nothing now.
Over time we find ourselves skipping LLM generated documentation just like any other AI slop. The value/emphasis placed on reading documentation erodes that finding good documentation becomes harder like other online content today and get devalued.
And I find that even the auto-generated stuff tends to go up at least a bit in terms of level of abstraction than staring at the code itself, and helps you more like a "sparknotes" version of the code, so that when you dig in yourself you have an outline/roadmap.
Even worse, the model you let it build in your head of the space it describes can lead to chains of incorrect reasoning that waste time and make debugging Sisyphean.
Like there is some value there, but I wonder how much of it is just (my own) feelings, and whether I'm correctly accounting for the fact that I'm being confidently lied to by a damn computer on a regular basis.
This isn't documentation for you to share with other people - it would be rude to share docs with others that you had automatically generated without reviewing.
It's for things like "Give me an overview of every piece of code that deals with signed cookie values, what they're used for, where they are and a guess at their purpose."
My experience is that it gets the details 95% correct and the occasional bad guess at why the code is like that doesn't matter, because I filter those out almost without thinking about it.
After reading one of your blog posts recommending it, I decided to specifically give them a try as bug hunters/codebase explainers instead, and I’ve been blown away. Several hard-to-spot production bugs down in two weeks or so that would have all taken me at least a few focused hours to spot all in all.
Their also better at making tests for algorithmic things than for concurrency situations, but can get pretty close. Just usually don't have great out-of-the-box ideas for "how to ensure these two different things run in the desired order."
Everything that I dislike about generating non-greenfield code with LLMs isn't relevant to the "make tests" or "debug something" usage. (Weird/bad choices about when to duplicate code vs refactor things, lack of awareness around desired "shape" of codebase for long-term maintainability, limited depth of search for impact/related existing stuff sometimes, running off the rails and doing almost-but-not-quite stuff that ends up entirely the wrong thing.)
For code modifications in a large codebase the problem with multi-shot is that it doesn't take too many iterations before I've spent more time on it. At least for tasks where I'm trying to be lazy or save time.
I've found voice input to completely change the balance there.
For stuff that isn't urgent, I can just fire off a hosted codex job by saying what I want done out loud. It's not super often that it completely nails it, but it almost always helps give me some info on where the relevant files might be and a first pass on the change.
Plus it has the nice side effect of being a todo list of quick stuff that I didn't want to get distracted by while working on something else, and often helps me gather my thoughts on a topic.
It's turned out to be a shockingly good workflow for me
I agree that the popular "one shot at all costs / end the chat at the first whiff of a mistake" advice is much too reductive but unlike a colleague, after putting in all that effort into developing a shared mental model of the desired outcome you reach the max context and then all that nuanced understanding instantly evaporates. You then have to hope the lossy compression into text instructions will actually steer it where you want next time but from experience that unfortunately is far from certain.
The end result being these robots doing bikeshedding. When paired with junior engineers looking at this output and deciding to act on it, it just generates busywork. Not helping that everyone and their dog wants to automatically run their agent against PRs now
I'm trying to use these to some extent when I find myself in a canonical situation that should work and am not getting the value everyone else seems to get in many cases. Very much "trying to explain a thing to a junior engineer taking more time than doing it myself" thing, except at least the junior is a person.
Notably, these walls are never where I expect them to be—despite my best efforts, I can't find any sort of pattern. LLMs can find really tricky bugs and get completely stuck on relatively simple ones.
Practically, though, how would someone become good at just the skills LLMs don't do well? Much of this discussion is about how that's difficult to predict, but even if you were a reliable judge of what sort of coding tasks LLMs would fail at, I'm not sure it's possible to only be good at that without being competent at it all.
This is, in fact, why we teach kids math that calculators could handle!
We don't teach kids how to use an abacus or a slide rule. But we teach positional representations and logarithms.
The goal is theoretical concepts so you can learn the required skills if necessary. The same will occur with code.
You don't need to memorize the syntax to write a for loop or for each loop, but you should understand when you might use either and be able to look up how to write one in a given language.
There are a growing set of problems which feel like using a calculator for basic math to me.
But also school is a whole other thing which I'm much more worried about with LLMs. Because there's no doubt in my mind I would have abused AI every chance I got if it were around when I was a kid, and I wouldn't have learned a damn thing.
And I hated mental math exercises as a kid.
If at the first step I'm already dealing with a robot in the weeds, I will have to spend time getting it out of the weeds, all for uncertain results afterwards.
Now sometimes things are hard and tricky, and you might still save time... but just on an emotional level, it's unsatisfying
Again, worst case all you wasted was your time, and now you've bounded that.
I've found that having local clones of large library repos (or telling it to look in the environment for packages) is far more effective than relying on built-in knowledge or lousy web search. It can also use ast-grep on those. For some reason the agent frameworks are still terrible about looking up references in a sane way (where in an IDE you would simply go to declaration).
Alternatively, if it is in an area with good test coverage, let it go fix the minor stuff.
EXCEPT…
What did you have for AI three years ago? Jack fucking shit is what.
Why is “wow that’s cool, I wonder what it’ll turn into” a forbidden phrase, but “there are clearly no experts on this topic but let me take a crack at it!!” important for everyone to comment on?
One word: Standby. Maybe that’s two words.
If you find yourself saying the same thing every year and adding 1 to the total...
+1 Juniors can learn over time.
But you literally still are. If you weren't, it should be trivially easy to create these models without using huge swathes of non-public-domain code. Right?
If someone scraped every photo on the internet (along with their captions) and used the data to create a model that was used purely for accessibility purposes - to build tools which described images to people with visual impairments - many people would be OK with that, where they might be justifiably upset at the same scraped data being used to create an image-generation model that competes with the artists on who's work it was trained.
Similarly, many people were OK with Google scraping the entire internet for 20+ years to build a search engine that helps users find their content, but are unhappy about an identical scrape being used to train a generative AI model.
Search engines help website owners, they don't hurt them. Whether the goal of a website is to inform people, build reputation or make money, search engines help with that. (Unless they output an excerpt so large visiting your website is no longer necessary. There have been lawsuits about that.)
LLMs take other people's work and regurgitate a mixed/mangled (verbatim or not does not matter) version without crediting/compensating the original authors and which cannot easily be tracked to any individual authors even if you actively try.
---
LLMs perform no work (creative or otherwise), no original research, have no taste - in fact they have no anchor to the real world except the training data. Literally everything they output is based on the training data which took possibly quadrillions of hours of _human work_ and is now being resold without compensating them.
Human time and natural resources are the only things with inherent value and now human time is being devalued and stolen.
Also, hunting for bugs is often a very good way to get intimately familiar with the architecture of a system which you don't know well, and furthermore it improves your mental model of the cause of bugs, making you a better programmer in the future. I can spot a possible race condition or unsafe alien call at a glance. I can quickly identify a leaky abstraction, and spot mutable state that could be made immutable. All of this because I have spent time fixing bugs that were due to these mistakes. If you don't fix other people's bugs yourself, I fear you will also end up relying on an LLM to make judgements about your own code to make sure that it is bug-free.
1/5 times the get it wrong and I might waste a minute or two confirming they they missed. I can live with those odds.
Before I used Claude, I would be surprised.
I think it works because Claude takes some standard coding issues and systematizes them. The list is long, but Claude doesn't run out of patience like a human being does. Or at least it has some credulity left after trying a few initial failed hypotheses. This being a cryptography problem helps a little bit, in that there are very specific keywords that might hint at a solution, but from my skim of the article, it seems like it was mostly a good old coding error, taking the high bits twice.
The standard issues are just a vague laundry list:
- Are you using the data you think you're using? (Bingo for this one)
- Could it be an overflow?
- Are the types right?
- Are you calling the function you think you're calling? Check internal, then external dependencies
- Is there some parameter you didn't consider?
And a bunch of others. When I ask Claude for a debug, it's always something that makes sense as a checklist item, but I'm often impressed by how it diligently followed the path set by the results of the investigation. It's a great donkey, really takes the drudgery out of my work, even if it sometimes takes just as long.
It very much does! I had a debugging session with Claude Code today, and it was about to give up with the message along the lines of “I am sorry I was not able to help you find the problem”.
It took some gentle cheering (pretty easy, just saying “you are doing an excellent job, don’t give up!”) and encouragement, and a couple of suggestions from me on how to approach the debug process for it to continue and finally “we” (I am using plural here because some information that Claude “volunteered” was essential to my understanding of the problem) were able to figure out the root cause and the fix.
Context Usage • Used: 112K/200K tokens (56%) • Remaining: 88K tokens • Sufficient for continued debugging, but fresh session recommended for clarity
lol. I said ok use a subagent for clarity.
I've flat out had Claude tell me it's task was getting tedious, and it will often grasp at straws to use as excuses for stopping a repetitive task and moving in to something else.
Keeping it on task when something keeps moving forward, is easy, but when it gets repetitive it takes a lot of effort to make it stick to it.
Some global rules will generally keep it on track though, telling it to ask me before it simplifies or give up, and I ask it frequently to ask me clarifying questions, which generally also helps keeping it chugging in the right direction and uncover gaps in its understanding.
Quite different if you are not a cryptographer or a domain expert.
If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.
By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.
Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.
Unfortunately, it doesn't quite work out that way.
Yes, you will get better at using these tools the more you use them, which is the case with any tool. But you will not learn what they can do as easily, or at all.
The main problem with them is the same one they've had since the beginning. If the user is a domain expert, then they will be able to quickly spot the inaccuracies and hallucinations in the seemingly accurate generated content, and, with some effort, coax the LLM into producing correct output.
Otherwise, the user can be easily misled by the confident and sycophantic tone, and waste potentially hours troubleshooting, without being able to tell if the error is on the LLM side or their own. In most of these situations, they would've probably been better off reading the human-written documentation and code, and doing the work manually. Perhaps with minor assistance from LLMs, but never relying on them entirely.
This is why these tools are most useful to people who are already experts in their field, such as Filippo. For everyone else who isn't, and actually cares about the quality of their work, the experience is very hit or miss.
> That being said.. you also need to understand how they fail and build an intuition for why they fail.
I've been using these tools for years now. The only intuition I have for how and why they fail is when I'm familiar with the domain. But I had that without LLMs as well, whenever someone is talking about a subject I know. It's impossible to build that intuition with domains you have little familiarity with. You can certainly do that by traditional learning, and LLMs can help with that, but most people use them for what you suggest: throwing things over the wall and running with it, which is a shame.
> I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.
I haven't used GPT-5-Codex, but have experience with Sonnet 4.5, and it's only marginally better than the previous versions IME. It still often wastes my time, no matter the quality or amount of context I feed it.
As for the building intuition, perhaps I am over-estimating what most people are capable of.
Working with and building systems using LLMs over the last few years has helped me build a pretty good intuition about what is breaking down when the model fails at a task. While having an ML background is useful in some very narrow cases (like: 'why does an LLM suck at ranking...'), I "think" a person can get a pretty good intuition purely based on observational outcomes.
I've been wrong before though. When we first started building LLM products, I thought, "Anyone can prompt, there is no barrier for this skill." That was not the case at all. Most people don't do well trying to quantify ambiguity, specificity, and logical contridiction when writing a process or set of instructions. I was REALLY surprised how I became a "go-to" person to "fix" prompt systems all based on linguistics and systematic process decomposition. Some of this was understaing how the auto-regressive attention system benefits from breaking the work down into steps, but really most of it was just "don't contradict yourself and be clear".
Working with them extensively also has helped me hone in on how the models get "better" with each release. Though most of my expertise is with OpenAI and Antrhopic model families.
I still think most engineers "should" be able to build intuition generally on what works well with LLMs and how to interact with them, but you are probably right. It will be just like most ML engineers where they see something work in a paper and then just paste it onto their model with no intuition about what systemically that structurally changes in the model dynamics.
No take on the rest of your comment, but it’s the nature of software engineering that we work on a breadth of problems. Nobody can be a domain expert in everything.
For example: I use a configurable editor every day, but I’m not a domain expert in the configuration. An LLM wasted an hour of my day pointing me in “almost the right direction” when after 10 minutes I really needed to RTFM.
I am a domain expert in some programming languages, but now I need to implement a certain algorithm… I’m not an expert in that algorithm. There’s lots of traps for the unwary.
I just wanted to challenge the assumption that we are all domain experts in the things we do daily. We are, but … with limitations.
A typical programmer works within unfamiliar domains all the time. It's not just about being familiar with the programming language or tooling. Every project potentially has new challenges you haven't faced before, new APIs to evaluate and design, new tradeoffs to consider, etc.
The less familiar you are with the domain or API, the less instincts and influence you have to steer the LLM in the right direction, and the more inclined you are to trust the tool over yourself. So when the tool is wrong, as it often still is, you can spend a lot of time fighting with it to produce the correct output.
The example in the article is actually the best case scenario for these tools. It's essentially pattern matching using high quality code, from someone who's deeply familiar with the domain and the code they've written. The experience of someone unfamiliar trying to implement the same algorithm from scratch by relying on LLMs would be vastly different.
I'm constantly reviewing things that I am not a domain expert on. I have to identify what is risky, what I don't know, etc. Throwing to the AI first is no different than throwing to someone else first. I have the same requirements. Now I can choose how much I "trust" the person or LLM. I have had coworkers I trust less than LLMs.. I'll put it that way.
So just like with reviewing a co-worker.. pay attention to areas you are not sure what the right method is and maybe double-check it. This just isn't a "new" thing.
A competent human engineer won't delude you with claims not based in reality, and be confident about it. They can be wrong about practical ways of accomplishing something, but they won't suggest using APIs that don't exist, or go off on wild tangents because a certain word was mentioned. They won't give a different answer whenever you ask them the same question. Most importantly, conversations with humans can be productive in ways that both parties gain a deeper understanding of the topic and respect for each other. Humans can actually think and reason about topics and ideas, they can actually verify their and your claims, and they won't automatically respond with "You're right!" at any counterargument or suggestion.
Furthermore, the marketing around "AI" is strongly based on promoting their superhuman abilities. If we're led to believe that these are superintelligent machines, we're more inclined to trust their output. We have people using them as medical professionals, thinking that they're talking to a god, and being influenced by them. Trusting them to produce software is somewhere on that scale. All of this is highly misleading and potentially dangerous.
Any attempt at anthropomorphizing "AI" is a mistake. You can get much more out of them by using them as what they are: excellent pattern matching probabilistic tools.
It gave me horribly inefficient or long-winded ways of doing it. In the time it took for "prompt tuning" I could have just written the damn code myself. It decreased the confidence for anything else it suggested about things I didn't already know about.
Claude still sometimes insists that iOS 26 isn't out yet. sigh.. I suppose I just have to treat it as an occasional alternative to Google/StackOverflow/Reddit for now. No way would I trust it to write an entire class let alone an app and be able to sleep at night (not that I sleep at night, but that's besides the point)
I think I prefer Xcode's built-in local model approach better, where it just offers sane autocompletions based on your existing code. e.g. if you already wrote a Dog class it can make a Cat class and change `bark()` to `meow()`
How would you imagine an AI system working that didn't make mistakes like that?
iOS 26 came out on September 15th.
LLMs aren't omniscient or constantly updated with new knowledge. Which means we have to figure out how to make use of them despite them not having up-to-the-second knowledge of the world.
I mean, if the user says "Use the latest APIs as of version N" and the AI thinks version N isn't out yet, then it should CHECK on the web first, it's right there, before second guessing the user. I didn't ask it whether 26 was out or not. I told it.
Oh but I guess AIs aren't allowed to have free use of Google's web search or scrap other websites eh
> iOS 26 came out on September 15th.
It was in beta all year and the APIs were publicly available on Apple's docs website. If I told it to use version 26 APIs then it should just use those instead of gaslighting me.
> LLMs aren't omniscient or constantly updated with new knowledge.
So we shouldn't use them if we want to make apps with the latest tech? Despite what the AI companies want us to believe.
You know, on a more general note, I think all AIs should have a toggle between "Do as I say" (Monkey Paw) and "Do what I mean"
Different harnesses have different search capabilities.
If I'm doing something that benefits from search I tend to switch to ChatGPT because I know it has a really good search feature available to it. I don't trust Claude's as much.
I feel like the article is giving out very bad advice which is going to end up shooting someone in the foot.
AI are very capable heuristics tools. Being able to "sniff test" things blind is their specialty.
i.e. Treat them like an extremely capable gas detector that can tell you there is a leak and where in the plumbing it is, not a plumber who can fix the leak for you.
The author uses an LLM to find bugs and then throw away the fix and instead write the code he would have written anyway. This seems like a rather conservative application of LLMs. Using the 'shooting someone in the foot' analogy - this article is an illustration of professional and responsible firearm handling.
Related, lately I've been getting tons of Anthropic Instagram ads; they must be near a quarter of all the sponsored content I see for the last month or so. Various people vibe coding random apps and whatnot using different incarnations of Claude. Or just direct adverts to "Install Claude Code." I really have no idea why I've been targeted so hard, on Instagram of all places. Their marketing team must be working overtime.
Developers find Claude Code extremely useful (once they figure out how to use it). Many developers subscribe to their $200/month plan. Assuming that's profitable (and I expect it is, since even for that much money it cuts off at a certain point to avoid over-use) Anthropic would be wise to spend a lot of money on marketing to try and grow their paying subscriber base for it.
I suspect that a lot of the “try using Claude code” feedback is just another version of “you’re holding it wrong” by people who have never tried VSC (parent is not in this group however). If you’re bought into a particular model system, of course, it might make more sense to use their own tool.
Edit: I will say that if you’re the YOLO type who wants your bots to be working a bunch of different forks in parallel, VSC isn’t so great for that.
Even if there's some slight immediate performance advantage for Cursor over GHC, the ability to trivially switch models more than makes up for it, IMO.
Also, as a heavy user of both, there are small paper cuts that seriously add up with copilot. Things that are missing like sub agents. The options and requests for feedback that cc can give (interactive picker style instead of prompt based). Most annoyingly commands running in a new integrated vscode terminal instance and immediately mistakenly "finishing" even though execution has just begun.
It's just a better harness than copilot. You should give it a shot for a while and see how you like it! I'm not saying its the best for everybody. At the end of the day these issues turn into something like the old vi/emacs wars.
Not sponsored, just a heavy user of both. Claude code is not allowed at work, so we use copilot. I purchased cc for my side projects and pay for the $125/m plan for now.
It also lacks a lot of the “features” of CC or Codex cli, like hooks, subagents, skills, or whichever flavor of the month you are getting value out of (I am finding skills really useful).
It also has models limited to 128k context - even sonnet - which under claude has (iirc) a million tokens. It can become a bottleneck if you aren’t careful.
We are stuck with vscode at $job, and so are making it work, but I really fly on personal projects at home using the “Swiss army knife “.
There are of course good reasons for some to prefer an ide as well, it has strengths. Like much more permissible limits and predictable cost.
I don’t feel like paying for a max level subscription, but am trying out MCP servers across OpenAI, Anthropic etc so I pay for the access to test them.
When my X hour token allotment runs out on one model I jump to the next closing Codex and opening Claude code or whatever together with altering my prompting a tiny bit to fit the current model.
Being so extremely fungible should by definition be a race to zero margins and about zero profit being made in the long run.
I suppose they can hope to make bank the next 6-12 months but that doesn’t create a longterm sustainable company.
I guess they can try building context to lock me in by increasing the cost to switch - but this is today broken by me every 3-4 prompts clearing the context because I know the output will be worse if I keep going.
The challenge is definitely in the competition though. GPT-5-Codex offered _very_ real competition for Claude Sonnet 4 / Opus 4 / Opus 4.1 - for a few weeks OpenAI were getting some of those subscribers back until Sonnet 4.5 landed. I expect that will happen frequently.
LLMs built by trillion dollar companies will do it for me.
Except they regularly come up with "explanations" that are completely bogus and may actually waste an hour or two. Don't get me wrong, LLMs can be incredibly helpful for identifying bugs, but you still have to keep a critical mindset.
I agree though, LLMs can be incredible debugging tools, but they are also incredibly gullable and love to jump to conclusions. The moment you turn your own fleshy brain off is when they go to lala land.
But that's what I meant! Just recently I asked an LLM about a weird backtrace and it pointed me the supposed source of the issue. It sounded reasonable and I spent 1-2 hours researching the issue, only to find out it was a total red herring. Without the LLM I wouldn't have gone down that road in the first place.
(But again, there have been many situations where the LLM did point me to the actual bug.)
I also agree that many more times the LLM is like a blood hound leading me to the right thing (which makes it all the more annoying the few times when it chases a red herring).
You can build this pretty easily: https://github.com/jasonjmcghee/claude-debugs-for-you
qsort•13h ago
> As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete
I think part of the reason why I was initially more skeptical than I ought to have been is because chat is such a garbage modality. LLMs started to "click" for me with Claude Code/Codex.
A "continuously running" mode that would ping me would be interesting to try.
cmrdporcupine•13h ago
A more socratic method, and more augmentic than "agentic".
Hell, if anybody has investment money and energy and shares this vision I'd love to work on creating this tool with you. I think these models are being misused right now in attempt to automate us out of work when their real amazing latent power is the intuition that we're talking about on this thread.
Misused they have the power to worsen codebases by making developers illiterate about the very thing they're working on because it's all magic behind the scenes. Uncorked they could enhance understanding and help better realize the potential of computing technology.
mccoyb•13h ago
What are your motivations?
Interested in your work: from your public GitHub repos, I'm perhaps most interested in `moor` -- as it shares many design inclinations that I've leaned towards in thinking about this problem.
cmrdporcupine•12h ago
I'm off work right now, between jobs and have been working 10, 12 hours a day on it. That will shortly have to end. I applied for a grant and got turned down.
My motivations come down to making a living doing the things I love. That is increasingly hard.
reachableceo•10h ago
I’ve found that using some high level direction / language and sharing my wants / preferences for workflow and interaction works very well.
I don’t think that you can find an off the shelf system todo what you want. I think you have to customize it to your own needs as you go.
Kind of like how you customize emacs as it’s running to your desires.
I’ve often wondered if you could put a mini LLM into emacs or vscode and have it implement customizations :)
cmrdporcupine•10h ago
imiric•11h ago
But on the other, given what I know about these tools and how error-prone they are, I simply refuse to give them access to my system, to run commands, or do any action for me. Partly due to security concerns, partly due to privacy, but mostly distrust that they will do the right thing. When they screw up in a chat, I can clean up the context and try again. Reverting a removed file or messed up Git repo is much more difficult. This is how you get a dropped database during code freeze...
The idea of giving any of these corporations such privileges is unthinkable for me. It seems that most people either don't care about this, or are willing to accept it as the price of admission.
I experimented with Aider and a self-hosted model a few months ago, and wasn't impressed. I imagine the experience with SOTA hosted models is much better, but I'll probably use a sandbox next time I look into this.
cmrdporcupine•10h ago
If you want open source and want to target something over an API "crush" https://github.com/charmbracelet/crush is excellent
But you should try Claude Code or Codex just to understand them. Can always run them in a container or VM if you fear their idiocy (and it's not a bad idea to fear it)
Like I said sibling, it's not the right modality. Others agree. I'm a good typer and good at writing, so it doesn't bug me too much, but it does too much without asking or working through it. Sometimes this is brilliant. Other times it's like.. c'mon guy, what did you do over there? What Balrog have I disturbed?
It's good to be familiar with these things in any case because they're flooding the industry and you'll be reviewing their code for better or for worse.