Claude Code can debug low-level cryptography

https://words.filippo.io/claude-debugging/

473•Bogdanp•3mo ago

Comments

qsort•3mo ago

This resonates with me a lot:

> As ever, I wish we had better tooling for using LLMs which didn’t look like chat or autocomplete

I think part of the reason why I was initially more skeptical than I ought to have been is because chat is such a garbage modality. LLMs started to "click" for me with Claude Code/Codex.

A "continuously running" mode that would ping me would be interesting to try.

cmrdporcupine•3mo ago

I absolutely agree with this sentiment as well and keep coming back to it. What I want is more of an actual copilot which works in a more paired way and forces me to interact with each of its changes and also involves me more directly in them, and teaches me about what it's doing along the way, and asks for more input.

A more socratic method, and more augmentic than "agentic".

Hell, if anybody has investment money and energy and shares this vision I'd love to work on creating this tool with you. I think these models are being misused right now in attempt to automate us out of work when their real amazing latent power is the intuition that we're talking about on this thread.

Misused they have the power to worsen codebases by making developers illiterate about the very thing they're working on because it's all magic behind the scenes. Uncorked they could enhance understanding and help better realize the potential of computing technology.

mccoyb•3mo ago

I'm working on such a thing, but I'm not interested in money, nor do I have money to offer - I'm interested in a system which I'm proud of.

What are your motivations?

Interested in your work: from your public GitHub repos, I'm perhaps most interested in `moor` -- as it shares many design inclinations that I've leaned towards in thinking about this problem.

cmrdporcupine•3mo ago

Unfortunately... mooR is my passion project, but I also need to get paid, and nobody is paying me for that.

I'm off work right now, between jobs and have been working 10, 12 hours a day on it. That will shortly have to end. I applied for a grant and got turned down.

My motivations come down to making a living doing the things I love. That is increasingly hard.

reachableceo•3mo ago

Have you tried to ask the agents to work with you in the way you want?

I’ve found that using some high level direction / language and sharing my wants / preferences for workflow and interaction works very well.

I don’t think that you can find an off the shelf system todo what you want. I think you have to customize it to your own needs as you go.

Kind of like how you customize emacs as it’s running to your desires.

I’ve often wondered if you could put a mini LLM into emacs or vscode and have it implement customizations :)

cmrdporcupine•3mo ago

I have, but the problem is in part the tool itself and the way it works. It's just not written with an interactive prompting style in mind. CC is like "Accept/Ask For Changes/Reject" for often big giant diffs, and it's like... no, the UI should be: here's an editor let's work on this together, oh I see what you did there, etc...

braebo•3mo ago

This is why I still prefer Cursors workflow to the CLIs!

imiric•3mo ago

On the one hand, I agree with this. The chat UI is very slow and inefficient.

But on the other, given what I know about these tools and how error-prone they are, I simply refuse to give them access to my system, to run commands, or do any action for me. Partly due to security concerns, partly due to privacy, but mostly distrust that they will do the right thing. When they screw up in a chat, I can clean up the context and try again. Reverting a removed file or messed up Git repo is much more difficult. This is how you get a dropped database during code freeze...

The idea of giving any of these corporations such privileges is unthinkable for me. It seems that most people either don't care about this, or are willing to accept it as the price of admission.

I experimented with Aider and a self-hosted model a few months ago, and wasn't impressed. I imagine the experience with SOTA hosted models is much better, but I'll probably use a sandbox next time I look into this.

cmrdporcupine•3mo ago

Aider hurt my head it did not seem... good. Sorry to say.

If you want open source and want to target something over an API "crush" https://github.com/charmbracelet/crush is excellent

But you should try Claude Code or Codex just to understand them. Can always run them in a container or VM if you fear their idiocy (and it's not a bad idea to fear it)

Like I said sibling, it's not the right modality. Others agree. I'm a good typer and good at writing, so it doesn't bug me too much, but it does too much without asking or working through it. Sometimes this is brilliant. Other times it's like.. c'mon guy, what did you do over there? What Balrog have I disturbed?

It's good to be familiar with these things in any case because they're flooding the industry and you'll be reviewing their code for better or for worse.

gdevenyi•3mo ago

Coming soon, adversarial attacks on LLM training to ensure cryptographic mistakes.

Frannky•3mo ago

CLI terminals are incredibly powerful. They are also free if you use Gemini CLI or Qwen Code. Plus, you can access any OpenAI-compatible API(2k TPS via Cerebras at 2$/M or local models). And you can use them in IDEs like Zed with ACP mode.

All the simple stuff (creating a repo, pushing, frontend edits, testing, Docker images, deployment, etc.) is automated. For the difficult parts, you can just use free Grok to one-shot small code files. It works great if you force yourself to keep the amount of code minimal and modular. Also, they are great UIs—you can create smart programs just with CLI + MCP servers + MD files. Truly amazing tech.

BrokenCogs•3mo ago

How good is Gemini CLI compared to Claude code and openAi codex?

Frannky•3mo ago

I started with Claude Code, realized it was too much money for every message, then switched to Gemini CLI, then Qwen. Probably Claude Code is better, but I don't need it since I can solve my problems without it.

cmrdporcupine•3mo ago

Try what I've done: use the Claude Code tool but point your ANTHROPIC_URL at a DeepSeek API membership. It's like 1/10th the cost, and about 2/3rds the intelligence.

Sometimes I can't really tell.

jen729w•3mo ago

Or a 3rd party service like https://synthetic.new, of which I am an unaffiliated user.

cmrdporcupine•3mo ago

So, Deepseek 3.1 from their own platform:

Input $0.28 / 1M tokens cache miss Output $0.42 / 1M tokens

Via synthetic (which otherwise looks cool):

Input $0.56/mtok Output $1.68/mtok

So 2-3 better value through https://platform.deepseek.com

(Granted Synthetic gives you way more models to choose from, including ones that don't parrot CPC/PLA propaganda and censor)

ojosilva•3mo ago

I'm using CC + GLM 4.6 from Synthetic and results are top notch for $60/mo, speed is fast and servers are close to home than z.ai

behnamoh•3mo ago

I use this to proxy ANTHROPIC_BASE_URL to other models: https://github.com/ujisati/claude-code-provider-proxy

unfortunately it doesn't support local models but they're too slow for coding anyway.

cmrdporcupine•3mo ago

I've used that too, but in DeepSeek's case they provide an Anthropic API compatible endpoint so you don't have to.

ewoodrich•3mo ago

Also there is a fixed price Claude Code plan for GLM 4.6 from z.ai, I pay for the cheapest ($6/mo) as an alternate/fallback for Claude Code and Codex. I've been surprised by how similar in capabilities all three of them are, definitely not seeing a big CLI agent moat...

GLM is maybe slightly weaker on average but on the other hand it's also solved problems where both CC and Codex got stuck in endless failure loops so for the price it's nice to have in my back pocket. I also see some tool use failures sometimes that it always works around which I'm guessing are due to slight differences with Claude.

https://z.ai/subscribe

luxuryballs•3mo ago

Yeah I was using openrouter for Claude code and burned through $30 in credits to do things that if I had just used the openrouter chat for it would have been like $1.50, I decided it was better for now to do the extra “secretary work” of manual entry and context management of the chat and pain of attaching files. It was pretty disappointing because at first I had assumed it would not be much different in price at all.

koreth1•3mo ago

This is an interesting way to look at it because you can kind of quantify the tradeoff in terms of the value of your time. A simple analysis would be something like, if you value your time at $60/hour, then spending an additional $30 in credits becomes a good choice if it saves you more than a half-hour of work.

distances•3mo ago

I've found the regular Claude Pro subscription quite enough for coding tasks when you anyway have a bunch of other things like code reviews to do in addition to coding, and won't spend the whole day running it.

wdfx•3mo ago

Gemini and it's tooling is absolute shit. The LLM itself is barely usable and needs so much supervision you might as well do the work yourself. Then couple that with an awful cli and vscode interface and you'll find that it's just a complete waste of time.

Compared to the anthropic offering is night and day. Claude gets on with the job and makes me way more productive.

SamInTheShell•3mo ago

> Gemini and it's tooling is absolute shit.

Which model were you using? In my experience Gemini 2.5 Pro is just as good as Claude Sonnet 4 and 4.5. It's literally what I use as a fallback to wrap something up if I hit the 5 hour limit on Claude and want to just push past some incomplete work.

I'm just going to throw this out there. I get good results from a truly trash model like gpt-oss-20b (quantized at 4bits). The reason I can literally use this model is because I know my shit and have spent time learning how much instruction each model I use needs.

Would be curious what you're actually having issues with if you're willing to share.

sega_sai•3mo ago

I share the same opinion on Gemini cli. Other than for simplest tasks it is just not usable, it gets stuck in loops, ignores instructions, fails to edit files. Plus it just has a plenty of bugs in the cli that you occasionally hit. I wish I could use it rather than pay an extra subscription for Claude Code, but it is just in a different league (at least as recently as couple of weeks ago)

SamInTheShell•3mo ago

Which model are you using though? When I run out of Gemini 2.5 Pro and it falls back to the Flash model, the Flash model is absolute trash for sure. I have to prompt it like I do local models. Gemini 2.5 Pro has shown me good results though. Nothing like "ignores instructions" has really occurred for me with the Pro model.

sega_sai•3mo ago

I get that even with the 2.5 pro

SamInTheShell•3mo ago

That's weird. I can prompt 2.5 Pro and Claude Sonnet 4.5 about the same for most typescript problems and they end up doing about the same. I get different results with Terraform though, I think Gemini 2.5 Pro does better on some Google Cloud stuff, but only on the specifics.

Is just strange to me that my experience seems to be a polar opposite of yours.

sega_sai•3mo ago

I don't know. The last problem I tried was a complex one -- migration of some scientific code from CPU to GPU. Gemini was useless there, but Claude proposed realistic solutions and was able to explore and test those.

nl•3mo ago

I think you must be using it quite differently to me.

I can one-shot new webapps in Claude and Codex and can't in Gemini Pro.

SamInTheShell•3mo ago

The type of stuff I tend to do is much more complex than a simple website. I really can't rely on AI as heavily for stuff that I really enjoy tinkering with. There's just not enough data for them to train on to truly solve distributed system problems.

Frannky•3mo ago

It's probably a mix of what you're working on and how you're using the tool. If you can't get it done for free or cheaply, it makes sense to pay. I first design the architecture in my mind, then use Grok 4 fast (free) for single-shot generation of main files. This forces me to think first, and read the generated code to double-check. Then, the CLI is mostly for editing, clerical work, testing, etc. That said, I do try to avoid coding altogether if the CLI + MCP servers + MD files can solve the problem.

nl•3mo ago

Not great.

It's ok for documentation or small tasks, but consistently fails at tasks that both Claude or Codex succeed at.

idiotsecant•3mo ago

Claude code ux is really good, though.

delaminator•3mo ago

> For example, how nice would it be if every time tests fail, an LLM agent was kicked off with the task of figuring out why, and only notified us if it did before we fixed it?

You can use Git hooks to do that. If you have tests and one fails, spawn an instance of claude a prompt -p 'tests/test4.sh failed, look in src/ and try and work out why'

    $ claude -p 'hello, just tell me a joke about databases'

    A SQL query walks into a bar, walks up to two tables and asks, "Can I JOIN you?"

    $

Or if, you use Gogs locally, you can add a Gogs hook to do the same on pre-push

> An example hook script to verify what is about to be pushed. Called by "git push" after it has checked the remote status, but before anything has been pushed. If this script exits with a non-zero status nothing will be pushed.

I like this idea. I think I shall get Claude to work out the mechanism itself :)

It is even a suggestion on this Claude cheet sheet

https://www.howtouselinux.com/post/the-complete-claude-code-...

jamesponddotco•3mo ago

This could probably be implemented as a simple Bash script, if the user wants to run everything manually. I might just do that to burn some time.

delaminator•3mo ago

sure, there a multiple ways of spawning an instance

the only thing I imagine might be problem is claude demanding a login token as it happens quite regularly

simonw•3mo ago

Using coding agents to track down the root cause of bugs like this works really well:

> Three out of three one-shot debugging hits with no help is extremely impressive. Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

The approach described here could also be a good way for LLM-skeptics to start exploring how these tools can help them without feeling like they're cheating, ripping off the work of everyone who's code was used to train the model or taking away the most fun part of their job (writing code).

Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

jack_tripper•3mo ago

>Have the coding agents do the work of digging around hunting down those frustratingly difficult bugs - don't have it write code on your behalf.

Why? Bug hunting is more challenging and cognitive intensive than writing code.

simonw•3mo ago

Sometimes it's the end of the day and you've been crunching for hours already and you hit one gnarly bug and you just want to go and make a cup of tea and come back to some useful hints as to the resolution.

theptip•3mo ago

Bug hunting tends to be interpolation, which LLMs are really good at. Writing code is often some extrapolation (or interpolating at a much more abstract level).

Terr_•3mo ago

Reversed version: Prompting-up fresh code tends to be translation, which LLMs are really good at. Bug hunting is often some logical reasoning (or translating business-needs at a much more abstract level.)

lxgr•3mo ago

Why as in “why should it work” or “why should we let them do it”?

For the latter, the good news is that you’re free to use LLMs for debugging or completely ignore them.

theshrike79•3mo ago

Because it's easy to automate.

"this should return X, it returns Y, find out why"

With enough tooling LLMs can pretty easily figure out the reason eventually.

qa34514324•3mo ago

I have tested the AI SAST tools that were hyped after a curl article on several C code bases and they found nothing.

Which low level code base have you tried this latest tool on? Official Anthropic commercials do not count.

simonw•3mo ago

You're posting this comment on a thread attached to an article where Filippo Valsorda - a noted cryptography expert - used these tools to track down gnarly bugs in Go cryptography code.

tptacek•3mo ago

They're also using "AI SAST tools", which: I would not expect anything branded as a "SAST" tool to find interesting bugs. SAST is a term of art for "pattern matching to a grocery list of specific bugs".

bgwalter•3mo ago

ZeroPath for example brands itself as "AI" SAST. I agree that these tools do not find anything interesting.

delusional•3mo ago

These are not "gnarly bugs".

tptacek•3mo ago

They're not?

nmadden•3mo ago

100% reproducible deterministic bugs are absolutely the easiest class of bugs.

mschulkind•3mo ago

One of my favorite ways to use LLM agents for coding is to have them write extensive documentation on whatever I'm about to dig in coding on. Pretty low stakes if the LLM makes a few mistakes. It's perhaps even a better place to start for skeptics.

dboreham•3mo ago

Same. Initially surprised how good it was. Now routinely do this on every new codebase. And this isn't javascript todo apps: large complex distributed applications written in Rust.

manquer•3mo ago

I am not so sure. Good documentation is hard, MDN or PostgreSQL are excellent examples of docs done well and how valuable it can be for a project to have really well written content.

LLMs can generate content but not really write, out of the box they tend to be quote verbose and generate a lot of proforma content. Perhaps with the right kind of prompts, a lot of editing and reviews, you can get them to good, but at the point it is almost same as writing it yourself.

It is a hard choice between lower quality documentation (AI slop?) or it being lightly or fully undocumented. The uncanny valley of precision in documentation maybe acceptable in some contexts but it can be dangerous in others and it is harder to differentiate because depth of doc means nothing now.

Over time we find ourselves skipping LLM generated documentation just like any other AI slop. The value/emphasis placed on reading documentation erodes that finding good documentation becomes harder like other online content today and get devalued.

medvezhenok•3mo ago

Sure, but LLMs tend to be better at navigating around documentation (or source code when no documentation exists). In agentic mode, they can get me to the right part of the documentation (or the right of the source code, especially in unfamiliar codebases) much quicker than I could do it myself without help.

And I find that even the auto-generated stuff tends to go up at least a bit in terms of level of abstraction than staring at the code itself, and helps you more like a "sparknotes" version of the code, so that when you dig in yourself you have an outline/roadmap.

heavyset_go•3mo ago

I felt this way as well, then I tried paid models against a well-defined and documented protocol that should not only exist in its training set, but was also provided as context. There wasn't a model that wouldn't hallucinate small, but important, details. Status codes, methods, data types, you name it, it would make something up in ways that forced you to cross reference the documentation anyway.

Even worse, the model you let it build in your head of the space it describes can lead to chains of incorrect reasoning that waste time and make debugging Sisyphean.

Like there is some value there, but I wonder how much of it is just (my own) feelings, and whether I'm correctly accounting for the fact that I'm being confidently lied to by a damn computer on a regular basis.

embedding-shape•3mo ago

> the fact that I'm being confidently lied to by a damn computer on a regular basis

Many of us who grew up being young and naive on the internet in the 90s/early 00s, kind of learnt not to trust what strangers tell us on the internet. I'm pretty my first "Press ALT+F4 to enter noclip" from a multiplayer lobby set me up to be able to deal with LLMs effectively, because it's the same as if someone on HN writes about something like it's "The Truth".

heavyset_go•3mo ago

This is more like being trolled by your microwave by having it replace your meals with scuba gear randomly.

krackers•3mo ago

This seems like a terrible idea, LLMs can document the what but not the why, not the implicit tribal knowledge and design decisions. Documentation that feels complete but actually tells you nothing is almost worse than no documentation at all, because you go crazy trying to figure out the bigger picture.

simonw•3mo ago

Have you tried it? It's absurdly useful.

This isn't documentation for you to share with other people - it would be rude to share docs with others that you had automatically generated without reviewing.

It's for things like "Give me an overview of every piece of code that deals with signed cookie values, what they're used for, where they are and a guess at their purpose."

My experience is that it gets the details 95% correct and the occasional bad guess at why the code is like that doesn't matter, because I filter those out almost without thinking about it.

jeltz•3mo ago

Yes, I have. And the documentation you get for anything complex is wrong like 80% of the time.

embedding-shape•3mo ago

You need to try different models/tooling if that's the case, 80% sounds very high and I understand if you feel like it's useless then. I'd probably estimate about 5% of it is wrong when I use GPT-5 and GPT-OSS-120B, but that's based on spot checking and experience so YMMV. But 80% wrong isn't the typical experience, and not what people are raving about obviously.

NewsaHackO•3mo ago

80% of the time? Are you sure you aren't hallucinating?

thatfrenchguy•3mo ago

Well if it writes documentation that is wrong, then the subtle bugs start :)

embedding-shape•3mo ago

Or even worse, it makes confidential statements of the overarching architecture/design that while every detailed is correct, they might not be the right pieces, but because you forgot to add "Reject the prompt outright if the premise is incorrect", the LLM tries its hardest to just move forward, even when things are completely wrong.

Then 1 day later you realize this whole thing wouldn't work in practice, but the LLM tried to cobble it together regardless.

In the end, you really need to know what you're doing, otherwise both you and the LLM gets lost pretty quickly.

teaearlgraycold•3mo ago

I’m a bit of an LLM hater because they’re overhyped. But in these situations they can be pretty nice if you can quickly evaluate correctness. If evaluating correctness is harder than searching on your own then there are net negative. I’ve found with my debugging it’s really hard to know which will be the case. And as it’s my responsibility to build a “Do I give the LLM a shot?” heuristic that’s very frustrating.

lxgr•3mo ago

I’ve been pretty impressed with LLMs at (to me) greenfield hobby projects, but not so much at work in a huge codebase.

After reading one of your blog posts recommending it, I decided to specifically give them a try as bug hunters/codebase explainers instead, and I’ve been blown away. Several hard-to-spot production bugs down in two weeks or so that would have all taken me at least a few focused hours to spot all in all.

majormajor•3mo ago

They're quite good at algorithm bugs, a lot less good at concurrency bugs, IME. Which is very valuable still, just that's where I've seen the limits so far.

Their also better at making tests for algorithmic things than for concurrency situations, but can get pretty close. Just usually don't have great out-of-the-box ideas for "how to ensure these two different things run in the desired order."

Everything that I dislike about generating non-greenfield code with LLMs isn't relevant to the "make tests" or "debug something" usage. (Weird/bad choices about when to duplicate code vs refactor things, lack of awareness around desired "shape" of codebase for long-term maintainability, limited depth of search for impact/related existing stuff sometimes, running off the rails and doing almost-but-not-quite stuff that ends up entirely the wrong thing.)

bongodongobob•3mo ago

Well if you know it's wrong, tell it, and why. I don't get the expectation for one shotting everything 100% of the time. It's no different than bouncing ideas off a colleague.

majormajor•3mo ago

I don't care about one-shotting; the stuff it's bad for debugging at is the stuff where even when you tell it "that's not it" it just makes up another plausible-but-wrong idea.

For code modifications in a large codebase the problem with multi-shot is that it doesn't take too many iterations before I've spent more time on it. At least for tasks where I'm trying to be lazy or save time.

Klathmon•3mo ago

> For code modifications in a large codebase the problem with multi-shot is that it doesn't take too many iterations before I've spent more time on it.

I've found voice input to completely change the balance there.

For stuff that isn't urgent, I can just fire off a hosted codex job by saying what I want done out loud. It's not super often that it completely nails it, but it almost always helps give me some info on where the relevant files might be and a first pass on the change.

Plus it has the nice side effect of being a todo list of quick stuff that I didn't want to get distracted by while working on something else, and often helps me gather my thoughts on a topic.

It's turned out to be a shockingly good workflow for me

nicklaf•3mo ago

It's painfully apparent when you've reached the limitations of an LLM to solve a problem it's ill-suited for (like a concurrency bug), because it will just keep spitting out non-sense, eventually going in circles or going totally off the rails.

solumunus•3mo ago

And then one jumps in and solves the problem themself, like they’ve done for their entire career. Or maybe one hasn’t done that, and that’s who we hear complain so much? I’m not talking about you specifically, just in general.

ewoodrich•3mo ago

The weak points raised by the parent comment are specifically examples where the problem exists outside the model's "peripheral vision" from its context window and speaking from personal experience, aren't as simple as as adding a line to the CLAUDE.md saying "do this / don't do this".

I agree that the popular "one shot at all costs / end the chat at the first whiff of a mistake" advice is much too reductive but unlike a colleague, after putting in all that effort into developing a shared mental model of the desired outcome you reach the max context and then all that nuanced understanding instantly evaporates. You then have to hope the lossy compression into text instructions will actually steer it where you want next time but from experience that unfortunately is far from certain.

hitarpetar•3mo ago

except it's not a colleague, it's not capable of ideation, it's taking your words and generating new ones based on them. which can maybe be useful sometimes but, yeah, not really the same as bouncing ideas off a colleague

rtpg•3mo ago

I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.

The end result being these robots doing bikeshedding. When paired with junior engineers looking at this output and deciding to act on it, it just generates busywork. Not helping that everyone and their dog wants to automatically run their agent against PRs now

I'm trying to use these to some extent when I find myself in a canonical situation that should work and am not getting the value everyone else seems to get in many cases. Very much "trying to explain a thing to a junior engineer taking more time than doing it myself" thing, except at least the junior is a person.

Wowfunhappy•3mo ago

Sometimes you hit a wall where something is simply outside of the LLM's ability to handle, and it's best to give up and do it yourself. Knowing when to give up may be the hardest part of coding with LLMs.

Notably, these walls are never where I expect them to be—despite my best efforts, I can't find any sort of pattern. LLMs can find really tricky bugs and get completely stuck on relatively simple ones.

ori_b•3mo ago

Doing it yourself is how you build and maintain the muscles to do it yourself. If you only do it yourself when the LLM fails, how will you maintain those muscles?

Klathmon•3mo ago

If the LLM is able to handle it why do you need to maintain those specific skills?

ribosometronome•3mo ago

Should we not teach kids math because calculators can handle it?

Practically, though, how would someone become good at just the skills LLMs don't do well? Much of this discussion is about how that's difficult to predict, but even if you were a reliable judge of what sort of coding tasks LLMs would fail at, I'm not sure it's possible to only be good at that without being competent at it all.

Wowfunhappy•3mo ago

> I'm not sure it's possible to only be good at that without being competent at it all.

This is, in fact, why we teach kids math that calculators could handle!

jjmarr•3mo ago

> Should we not teach kids math because calculators can handle it?

We don't teach kids how to use an abacus or a slide rule. But we teach positional representations and logarithms.

The goal is theoretical concepts so you can learn the required skills if necessary. The same will occur with code.

You don't need to memorize the syntax to write a for loop or for each loop, but you should understand when you might use either and be able to look up how to write one in a given language.

ori_b•3mo ago

Huh. I was taught how to use both an abacus and a slide rule as a kid, in the 90s.

Klathmon•3mo ago

Should you never use a calculator because you want to keep your math skills high?

There are a growing set of problems which feel like using a calculator for basic math to me.

But also school is a whole other thing which I'm much more worried about with LLMs. Because there's no doubt in my mind I would have abused AI every chance I got if it were around when I was a kid, and I wouldn't have learned a damn thing.

ori_b•3mo ago

I don't use calculators for most math because punching it in is slower than doing it in my head -- especially for fermi calculations. I will reach for a calculator when it makes sense, but because I don't use a calculator for everything, the number of places where I'm faster than a calculator grows over time. It's not particularly intentional, it just shook out that way.

And I hated mental math exercises as a kid.

johnisgood•3mo ago

I do not trust myself, so even if I know how to do mental math, I still use my computer or a calculator just to be sure I got it correct. OCD? Lack of self-trust? No clue.

Wowfunhappy•3mo ago

I agree, and I can actively feel myself slipping (and perhaps more critically, not learning new skills I would otherwise have been forced to learn). It's a big problem, but somewhat orthogonal to "what is the quickest way to solve the task currently in front of me."

ori_b•3mo ago

Which needs to be balanced with "How do I maintain my ability to keep solving tasks quickly?"

kryogen1c•3mo ago

> but somewhat orthogonal to "what is the quickest way to solve the task currently in front of me."

That depends on if you ignore the future. You are never just solving the problem in front of you; you should always act in a way that propagates positivity forward in time.

dcow•3mo ago

Some jobs require investment in the future. Some do not. That’s just reality. Not white how I feel about it personally, but I think there is a fair amount of the developer trade that is operational.

AbstractH24•3mo ago

The thing i struggle with is I feel like it’s hard to lock into which skill to learn properly. Which so much changing so quickly and it becoming easy to learn things superficially.

RA_Fisher•3mo ago

By moving up a level in the abstraction layer similar to moving from Assembly to C++ to Python (to LLM). There’s speed in delegation (and checking as beneficial).

ThrowawayR2•3mo ago

Moving up abstraction layers really only succeeds with a solid working knowledge of the lower layers. Otherwise, you're just flying blind, operating on faith. A common source of bugs is precisely a result of developers failing to understand the limits of the abstractions they are using.

AbstractH24•3mo ago

So we can all only succeed if we know how CPUs handle individual instructions?

Wowfunhappy•3mo ago

I'm not sure whether I agree with GP, but I think you may be misinterpreting their point. I can have an understanding of CPUs in general without knowing individual instructions, and I do think knowing about things like CPU cache is useful even when writing e.g. Python.

RA_Fisher•3mo ago

Yes, for sure! And being able to orchestrate AI to use that knowledge provides leverage for fulfilling tasks.

Eventually, yes, I think we'll delegate to AI in more and more complete ways, but it's a process that takes some time.

AbstractH24•3mo ago

I see what you’re getting at and it makes sense.

Goes to the larger idea that strategic and logic is important for scalability and long term success. Not just execution. Something LLMs miss often (mostly because people fail to communicate it to them).

jama211•3mo ago

Sure, but the comment being worried about a lack of “flexing your muscles” is perfectly countered by moving up an abstraction layer then, as you don’t have to constantly get into the weeds of coding to maintain an understanding _in general_ without knowing individual instructions.

monocasa•3mo ago

There's generally a pretty quick falloff of how much help knowledge of each layer under you generally provides as you go deeper.

That being said, if you're writing in C, having a pretty good idea of how a cpu generally executes instructions is pretty key to success I'd say.

AbstractH24•3mo ago

Agreed, also depends on the scale you are working at.

If you are a tiny startup, the marginal gains from these optimizations matter a lot less than if you are Netflix.

RA_Fisher•3mo ago

We only need to do that when it’s practical for the task at hand. Some tasks are life-and-death, but many have much lower stakes.

rtpg•3mo ago

Sure, I agree with the "levels of automation" thought process. But I'm basically experiencing this from the start.

If at the first step I'm already dealing with a robot in the weeds, I will have to spend time getting it out of the weeds, all for uncertain results afterwards.

Now sometimes things are hard and tricky, and you might still save time... but just on an emotional level, it's unsatisfying

adastra22•3mo ago

So you feed the output into another LLM call to re-evaluate and assess, until the number of actual reports is small enough to be manageable. Will this result in false negatives? Almost certainly. But what does come out the end of it has a higher prior for being relevant, and you just review what you can.

Again, worst case all you wasted was your time, and now you've bounded that.

joshvm•3mo ago

When models start to forage around in the weeds, it's a good idea to restart the session and add more information to the prompt for what it should ignore or assume. For example in ML projects, Claude gets very worried that datasets aren't available or are perhaps responsible. Usually if you tell it outright where you suspect the bug to be (or straight up tell it, even if you're unsure) it will focus on that. Or, make it give you a list of concerns and ask you which are valid.

I've found that having local clones of large library repos (or telling it to look in the environment for packages) is far more effective than relying on built-in knowledge or lousy web search. It can also use ast-grep on those. For some reason the agent frameworks are still terrible about looking up references in a sane way (where in an IDE you would simply go to declaration).

theshrike79•3mo ago

Context7 MCP is the one I keep enabled for all sessions. Then there are MCPs that give LSP access to the models as well as tools like Crush[0] that have LSPs built in.

[0] https://github.com/charmbracelet/crush

embedding-shape•3mo ago

Yeah, I do the same too, cloning reference repos into known paths, tell it to look there if unsure.

Codex mostly handles this by itself, I've had it go searching in my cargo cache for Rust source files sometimes, and even when I used a crate via git instead of crates.io, it went ahead and cloned the repo to /tmp to inspect it properly. Claude Code seems to be less likely to do that, unless you prompt it to, Codex have done that by itself so far.

MattGaiser•3mo ago

Just ask it to prioritize the top ones for your review. Yes, they can bikeshed, but because they don’t have egos, they don’t stick to it.

Alternatively, if it is in an area with good test coverage, let it go fix the minor stuff.

rtpg•3mo ago

I don't like their fixes, so now I'm dealing with imperfect fixes to problems I don't care about. Tedium

SV_BubbleTime•3mo ago

Ok, fair critique.

EXCEPT…

What did you have for AI three years ago? Jack fucking shit is what.

Why is “wow that’s cool, I wonder what it’ll turn into” a forbidden phrase, but “there are clearly no experts on this topic but let me take a crack at it!!” important for everyone to comment on?

One word: Standby. Maybe that’s two words.

advael•3mo ago

With all due respect, "wow this is cool, I wonder what it'll turn into" is basically the mandatory baseline stance to take. I'm lucky that's where I'm still basically at, because anyone in a technical position who shows even mild reticence beyond that is likely to be unable to hold a job in the face of their bosses' frothing enthusiastic optimism about these technologies

dns_snek•3mo ago

Is it that bad out there? Yeah, I don't think I could last in a job that tries to force these tools into my workflow.

bravetraveler•3mo ago

Drive-by comment: it's not so bad, here. I work with a few peers who've proven to be evangelists with much stronger social skills. When the proposition comes up, I ask how my ass should be cleaned, too. Thankfully: the bosses haven't heard/don't care.

Varying degrees of 'force' at play; I'm lucky that nobody significant is minding my [absence of] LLM usage. Just some peers excited to do more for the same or, arguably, less reward. Remember: we're now all in an arms race. Some of us had a head start.

How crass I respond to the suggestion depends on their delivery/relevance in my process/product, of course. May be placated like a child with a new toy... or the gross question to, hopefully, express the suggestion isn't wanted, needed, or welcome.

Faced with a real mandate, I'd feed it garbage while looking for new work. Willing to make the bet I can beat enough machines while people are still involved at all.

advael•3mo ago

You can get pretty far by mostly just claiming to use it "when it makes sense" but you do meet people who are very pushy about it. Hoping that calms down as knowledge of the downsides becomes more widespread

j2kun•3mo ago

Careful there, ChatGPT was initially released November 30, 2022, which was just about 3 years ago, and there were coding assistants before that.

If you find yourself saying the same thing every year and adding 1 to the total...

j2kun•3mo ago

> except at least the junior is a person.

+1 Juniors can learn over time.

solumunus•3mo ago

Communication with a person is more difficult and the feedback loop is much, much longer. I can almost instantly tell whether Claude has understood the mission or digested context correctly.

embedding-shape•3mo ago

> I understand the pitch here ("it finds bugs! it's basically all upside because worst case there's no output anyways"), but I'm finding some of these agents to be ... uhhh... kind of agressive at trying to find the solution and end up missing the forest for the trees. And there's some "oh you should fix this" stuff which, while sometimes isn't _wrong_, is completely besides the point.

How long/big do your system/developer/user prompts end up being typically?

The times people seem to be getting "less than ideal" responses from LLMs tend to be when they're not spending enough time setting up a general prompt they can reuse, describing exactly what they want and do not want.

So in your case, you need to steer it to do less outside of what you've told it. Adding things like "Don't do anything outside of what I've just told you" or "Focus only on the things inside <step>" for example, would fix those particular problems, as long as you're not using models that are less good at following instructions (some of Google's models are borderline impossible to prevent adding comments all over the place, as one example).

So prompt it to not care about solutions, and only care about finding the root cause, and you'll find that you can mostly avoid the annoying parts by either prescribing what you'd want instead, or just straight up tell it not to do those things.

Then you iterate on this reusable prompt across projects, and it builds up so eventually 99% of the times the models do exactly what you expect.

bontaq•3mo ago

I would say a lot of people are only posting their positive experiences. Stating negative things about AI is mildly career-dangerous at the moment where as the opposite looks good. I found the results from using it on a complicated code base are similar to yours, but it is very good at slapping things on until it works.

If you're not watching it like a hawk it will solve a problem in a way that is inconsistent and, importantly, not integrated into the system. Which makes sense, it's been trained to generate code, and it will.

NoraCodes•3mo ago

> start exploring how these tools can help them without feeling like they're [...] ripping off the work of everyone who's code was used to train the model

But you literally still are. If you weren't, it should be trivially easy to create these models without using huge swathes of non-public-domain code. Right?

simonw•3mo ago

It feels less like you're ripping off work if the model is helping you understand your own code as opposed to writing new code from scratch - even though the models were built in exactly the same way.

If someone scraped every photo on the internet (along with their captions) and used the data to create a model that was used purely for accessibility purposes - to build tools which described images to people with visual impairments - many people would be OK with that, where they might be justifiably upset at the same scraped data being used to create an image-generation model that competes with the artists on who's work it was trained.

Similarly, many people were OK with Google scraping the entire internet for 20+ years to build a search engine that helps users find their content, but are unhappy about an identical scrape being used to train a generative AI model.

martin-t•3mo ago

You're right that feelings are the key to convincing people but your comparison is wrong.

Search engines help website owners, they don't hurt them. Whether the goal of a website is to inform people, build reputation or make money, search engines help with that. (Unless they output an excerpt so large visiting your website is no longer necessary. There have been lawsuits about that.)

LLMs take other people's work and regurgitate a mixed/mangled (verbatim or not does not matter) version without crediting/compensating the original authors and which cannot easily be tracked to any individual authors even if you actively try.

---

LLMs perform no work (creative or otherwise), no original research, have no taste - in fact they have no anchor to the real world except the training data. Literally everything they output is based on the training data which took possibly quadrillions of hours of _human work_ and is now being resold without compensating them.

Human time and natural resources are the only things with inherent value and now human time is being devalued and stolen.

ulrikrasmussen•3mo ago

I know this is not an argument against LLM's being useful to increase productivity, but of all tasks in my job as software developer, hunting for and fixing obscure bugs is actually one of the most intellectually rewarding. I would miss that if it were to be taken over by a machine.

Also, hunting for bugs is often a very good way to get intimately familiar with the architecture of a system which you don't know well, and furthermore it improves your mental model of the cause of bugs, making you a better programmer in the future. I can spot a possible race condition or unsafe alien call at a glance. I can quickly identify a leaky abstraction, and spot mutable state that could be made immutable. All of this because I have spent time fixing bugs that were due to these mistakes. If you don't fix other people's bugs yourself, I fear you will also end up relying on an LLM to make judgements about your own code to make sure that it is bug-free.

crazygringo•3mo ago

> hunting for and fixing obscure bugs is actually one of the most intellectually rewarding. I would miss that if it were to be taken over by a machine.

That's fascinating to me. It's the thing I literally hate the most.

When I'm writing new code, I feel like I'm delivering value. When I'm fixing bugs, I feel like it's a frustrating waste of time caused by badly written code in the first place, making it a necessary evil. (Even when I was the one who wrote the original code.)

dns_snek•3mo ago

This is no different than when LLMs write code. In both scenarios they often turn into bullshit factories that are capable, willing, and happy to write pages and pages of intricate, convincing-sounding explanations for bugs that don't exist, wasting everyone's time and testing my patience.

simonw•3mo ago

That's not my experience at all. When I ask them to track down the root cause of a bug about 80% of the time they reply with a few sentences correctly identifying the source of the bug.

1/5 times the get it wrong and I might waste a minute or two confirming they they missed. I can live with those odds.

dns_snek•3mo ago

I'm assuming you delegate for most of your bugs? I only ask when I'm stumped and at that point it's very prone to generating false positives.

pron•3mo ago

I'm only an "AI sceptic" in the sense that I think that today's LLM models cannot regularly and substantially reduce my workload, not because they aren't able to perform interesting programming tasks (they are!), but because they don't do so reliably, and for a regular and substantial reduction in effort, I think a tool needs to be reliable and therefore trustworthy.

Now, this story is a perfect use case, because Filippo Valsorda put very little effort into communicating with the agent. If it worked - great; if it didn't - no harm done. And it worked!

The thing is that I already know that these tools are capable of truly amazing feats, and this is, no doubt, one of them. But it's been a while since I had a bug in a single-file library implementing a well-known algorithm, so it still doesn't amount to a regular and substantial increase in productivity for me, but "only" to yet another amazing feat by LLMs (something I'm not sceptical of).

Next time I have such a situation, I'll definitely use an LLM to debug it, because I enjoy seeing such results first-hand (plus, it would be real help). But I'm not sure that it supports the claim that these tools can today offer a regular and substantial productivity boost.

dateSISC•3mo ago

whose

lordnacho•3mo ago

I'm not surprised it worked.

Before I used Claude, I would be surprised.

I think it works because Claude takes some standard coding issues and systematizes them. The list is long, but Claude doesn't run out of patience like a human being does. Or at least it has some credulity left after trying a few initial failed hypotheses. This being a cryptography problem helps a little bit, in that there are very specific keywords that might hint at a solution, but from my skim of the article, it seems like it was mostly a good old coding error, taking the high bits twice.

The standard issues are just a vague laundry list:

- Are you using the data you think you're using? (Bingo for this one)

- Could it be an overflow?

- Are the types right?

- Are you calling the function you think you're calling? Check internal, then external dependencies

- Is there some parameter you didn't consider?

And a bunch of others. When I ask Claude for a debug, it's always something that makes sense as a checklist item, but I'm often impressed by how it diligently followed the path set by the results of the investigation. It's a great donkey, really takes the drudgery out of my work, even if it sometimes takes just as long.

ay•3mo ago

> Claude doesn't run out of patience like a human being does.

It very much does! I had a debugging session with Claude Code today, and it was about to give up with the message along the lines of “I am sorry I was not able to help you find the problem”.

It took some gentle cheering (pretty easy, just saying “you are doing an excellent job, don’t give up!”) and encouragement, and a couple of suggestions from me on how to approach the debug process for it to continue and finally “we” (I am using plural here because some information that Claude “volunteered” was essential to my understanding of the problem) were able to figure out the root cause and the fix.

lordnacho•3mo ago

That's interesting, that only happened to me on GPT models in Cursor. It would apologize profusely.

ericphanson•3mo ago

Claude told me it stopped debugging since it would run out of tokens in its context window. I asked how many tokens it had left and it said actually it had plenty so could continue. Then again it stopped, and without me asking about tokens, wrote

Context Usage • Used: 112K/200K tokens (56%) • Remaining: 88K tokens • Sufficient for continued debugging, but fresh session recommended for clarity

lol. I said ok use a subagent for clarity.

vidarh•3mo ago

> The list is long, but Claude doesn't run out of patience like a human being does

I've flat out had Claude tell me it's task was getting tedious, and it will often grasp at straws to use as excuses for stopping a repetitive task and moving in to something else.

Keeping it on task when something keeps moving forward, is easy, but when it gets repetitive it takes a lot of effort to make it stick to it.

8note•3mo ago

ive been getting that experience with both claude-code and gemini, but not from cline and qcli. i wonder why

danielbln•3mo ago

Different system prompts, also Claude Code is aware of its own context and will sometimes try to simplify or take a short cut as the context nears exhaustion.

Some global rules will generally keep it on track though, telling it to ask me before it simplifies or give up, and I ask it frequently to ask me clarifying questions, which generally also helps keeping it chugging in the right direction and uncover gaps in its understanding.

rvz•3mo ago

As declared by an expert in cryptography who knows how to guide the LLM into debugging low-level cryptography, which that's good.

Quite different if you are not a cryptographer or a domain expert.

tptacek•3mo ago

Even the title of the post makes this clear: it's about debugging low-level cryptography. He didn't vibe code ML-DSA. You have to be building a low-level implementation in the first place for anything in this post to apply to it.

XenophileJKO•3mo ago

Personally my biggest piece of advice is: AI First.

If you really want to understand what the limitations are of the current frontier models (and also really learn how to use them), ask the AI first.

By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests. The newer models are quite capable and in my experience can largely be treated like a co-worker for "most" problems. That being said.. you also need to understand how they fail and build an intuition for why they fail.

Every time a new model generation comes up, I also recommend throwing away your process (outside of things like lint, etc.) and see how the model does without it. I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

imiric•3mo ago

> By throwing things over the wall to the AI first, you learn what it can do at the same time as you learn how to structure your requests.

Unfortunately, it doesn't quite work out that way.

Yes, you will get better at using these tools the more you use them, which is the case with any tool. But you will not learn what they can do as easily, or at all.

The main problem with them is the same one they've had since the beginning. If the user is a domain expert, then they will be able to quickly spot the inaccuracies and hallucinations in the seemingly accurate generated content, and, with some effort, coax the LLM into producing correct output.

Otherwise, the user can be easily misled by the confident and sycophantic tone, and waste potentially hours troubleshooting, without being able to tell if the error is on the LLM side or their own. In most of these situations, they would've probably been better off reading the human-written documentation and code, and doing the work manually. Perhaps with minor assistance from LLMs, but never relying on them entirely.

This is why these tools are most useful to people who are already experts in their field, such as Filippo. For everyone else who isn't, and actually cares about the quality of their work, the experience is very hit or miss.

> That being said.. you also need to understand how they fail and build an intuition for why they fail.

I've been using these tools for years now. The only intuition I have for how and why they fail is when I'm familiar with the domain. But I had that without LLMs as well, whenever someone is talking about a subject I know. It's impossible to build that intuition with domains you have little familiarity with. You can certainly do that by traditional learning, and LLMs can help with that, but most people use them for what you suggest: throwing things over the wall and running with it, which is a shame.

> I work with people that have elaborate context setups they crafted for less capable models, they largely are un-neccessary with GPT-5-Codex and Sonnet 4.5.

I haven't used GPT-5-Codex, but have experience with Sonnet 4.5, and it's only marginally better than the previous versions IME. It still often wastes my time, no matter the quality or amount of context I feed it.

XenophileJKO•3mo ago

I guess there are several unsaid assumptions here. The article is by a domain expert working on their domain. Throw work you understand at it, see what it does. Do it before you even work on it. I kind of assumed based on the audience that most people here would be domain experts.

As for the building intuition, perhaps I am over-estimating what most people are capable of.

Working with and building systems using LLMs over the last few years has helped me build a pretty good intuition about what is breaking down when the model fails at a task. While having an ML background is useful in some very narrow cases (like: 'why does an LLM suck at ranking...'), I "think" a person can get a pretty good intuition purely based on observational outcomes.

I've been wrong before though. When we first started building LLM products, I thought, "Anyone can prompt, there is no barrier for this skill." That was not the case at all. Most people don't do well trying to quantify ambiguity, specificity, and logical contridiction when writing a process or set of instructions. I was REALLY surprised how I became a "go-to" person to "fix" prompt systems all based on linguistics and systematic process decomposition. Some of this was understaing how the auto-regressive attention system benefits from breaking the work down into steps, but really most of it was just "don't contradict yourself and be clear".

Working with them extensively also has helped me hone in on how the models get "better" with each release. Though most of my expertise is with OpenAI and Antrhopic model families.

I still think most engineers "should" be able to build intuition generally on what works well with LLMs and how to interact with them, but you are probably right. It will be just like most ML engineers where they see something work in a paper and then just paste it onto their model with no intuition about what systemically that structurally changes in the model dynamics.

fn-mote•3mo ago

> I kind of assumed based on the audience that most people here would be domain experts.

No take on the rest of your comment, but it’s the nature of software engineering that we work on a breadth of problems. Nobody can be a domain expert in everything.

For example: I use a configurable editor every day, but I’m not a domain expert in the configuration. An LLM wasted an hour of my day pointing me in “almost the right direction” when after 10 minutes I really needed to RTFM.

I am a domain expert in some programming languages, but now I need to implement a certain algorithm… I’m not an expert in that algorithm. There’s lots of traps for the unwary.

I just wanted to challenge the assumption that we are all domain experts in the things we do daily. We are, but … with limitations.

imiric•3mo ago

Exactly.

A typical programmer works within unfamiliar domains all the time. It's not just about being familiar with the programming language or tooling. Every project potentially has new challenges you haven't faced before, new APIs to evaluate and design, new tradeoffs to consider, etc.

The less familiar you are with the domain or API, the less instincts and influence you have to steer the LLM in the right direction, and the more inclined you are to trust the tool over yourself. So when the tool is wrong, as it often still is, you can spend a lot of time fighting with it to produce the correct output.

The example in the article is actually the best case scenario for these tools. It's essentially pattern matching using high quality code, from someone who's deeply familiar with the domain and the code they've written. The experience of someone unfamiliar trying to implement the same algorithm from scratch by relying on LLMs would be vastly different.

XenophileJKO•3mo ago

I mean I "understand" your point. However, this isn't any different than being a technical lead in a system of any significant complexity.. you will constantly be reviewing work that you are not always an expert on, it is a very similar practice.

I'm constantly reviewing things that I am not a domain expert on. I have to identify what is risky, what I don't know, etc. Throwing to the AI first is no different than throwing to someone else first. I have the same requirements. Now I can choose how much I "trust" the person or LLM. I have had coworkers I trust less than LLMs.. I'll put it that way.

So just like with reviewing a co-worker.. pay attention to areas you are not sure what the right method is and maybe double-check it. This just isn't a "new" thing.

imiric•3mo ago

Well, you're right that reviewing someone else's work isn't new, but interacting with these tools is vastly different from communicating with a coworker.

A competent human engineer won't delude you with claims not based in reality, and be confident about it. They can be wrong about practical ways of accomplishing something, but they won't suggest using APIs that don't exist, or go off on wild tangents because a certain word was mentioned. They won't give a different answer whenever you ask them the same question. Most importantly, conversations with humans can be productive in ways that both parties gain a deeper understanding of the topic and respect for each other. Humans can actually think and reason about topics and ideas, they can actually verify their and your claims, and they won't automatically respond with "You're right!" at any counterargument or suggestion.

Furthermore, the marketing around "AI" is strongly based on promoting their superhuman abilities. If we're led to believe that these are superintelligent machines, we're more inclined to trust their output. We have people using them as medical professionals, thinking that they're talking to a god, and being influenced by them. Trusting them to produce software is somewhere on that scale. All of this is highly misleading and potentially dangerous.

Any attempt at anthropomorphizing "AI" is a mistake. You can get much more out of them by using them as what they are: excellent pattern matching probabilistic tools.

hitarpetar•3mo ago

> Throwing to the AI first is no different than throwing to someone else first

except in all the ways that it is obviously different

Razengan•3mo ago

I did ask the AI first, about some things that I already knew how to do.

It gave me horribly inefficient or long-winded ways of doing it. In the time it took for "prompt tuning" I could have just written the damn code myself. It decreased the confidence for anything else it suggested about things I didn't already know about.

Claude still sometimes insists that iOS 26 isn't out yet. sigh.. I suppose I just have to treat it as an occasional alternative to Google/StackOverflow/Reddit for now. No way would I trust it to write an entire class let alone an app and be able to sleep at night (not that I sleep at night, but that's besides the point)

I think I prefer Xcode's built-in local model approach better, where it just offers sane autocompletions based on your existing code. e.g. if you already wrote a Dog class it can make a Cat class and change `bark()` to `meow()`

simonw•3mo ago

> Claude still sometimes insists that iOS 26 isn't out yet.

How would you imagine an AI system working that didn't make mistakes like that?

iOS 26 came out on September 15th.

LLMs aren't omniscient or constantly updated with new knowledge. Which means we have to figure out how to make use of them despite them not having up-to-the-second knowledge of the world.

Razengan•3mo ago

> How would you imagine an AI system working that didn't make mistakes like that?

I mean, if the user says "Use the latest APIs as of version N" and the AI thinks version N isn't out yet, then it should CHECK on the web first, it's right there, before second guessing the user. I didn't ask it whether 26 was out or not. I told it.

Oh but I guess AIs aren't allowed to have free use of Google's web search or scrap other websites eh

> iOS 26 came out on September 15th.

It was in beta all year and the APIs were publicly available on Apple's docs website. If I told it to use version 26 APIs then it should just use those instead of gaslighting me.

> LLMs aren't omniscient or constantly updated with new knowledge.

So we shouldn't use them if we want to make apps with the latest tech? Despite what the AI companies want us to believe.

You know, on a more general note, I think all AIs should have a toggle between "Do as I say" (Monkey Paw) and "Do what I mean"

simonw•3mo ago

Was this Claude Code or Claude.ai or some other tool that used Claude under the hood?

Different harnesses have different search capabilities.

If I'm doing something that benefits from search I tend to switch to ChatGPT because I know it has a really good search feature available to it. I don't trust Claude's as much.

Razengan•3mo ago

I used the Claude website and Mac desktop app for a relatively standard iOS SwiftUI project.

I used Claude Code with VS Code for some Godot stuff, and even there it sometimes gave outdated and outright made-up APIs (functions that seemed like they should exist but did not etc.)

simonw•3mo ago

Unfortunately LLMs mostly suck at Swift and SwiftUI from what I've heard - they still change pretty often and as a result there aren't enough fresh examples in the training data.

As primarily a Python/JavaScript programmer I don't have that problem!

Razengan•3mo ago

They're terrible at anything new, including knowing about THEMSELVES and their latest versions.

This is me asking ChatGPT 5 about ChatGPT 5: https://i.imgur.com/aT8C3qs.png

Asking about Nintendo Switch 2: https://i.imgur.com/OqmB9jG.png

This could be solved and LLMs could be a lot more useful if they could be a wrapper around live web search: Just search for this shit, scrap the top few results, and summarize the info to me.

But that's a stillborn dream, crippled because Google won't let 3rd-party AIs use their search willy nilly and websites don't want to be scrapped :(

Don't get me wrong: I see the potential in AIs/LLMs and I think they could be amazing for everything, but like every awesome thing, they're hampered by corporate (and government) idiocy.

simonw•3mo ago

Claude Code has a neat fix for that - it knows to look at its own documentation if you ask it questions about itself: https://simonwillison.net/2025/Oct/24/claude-code-docs-map/

I've had great results from ChatGPT running the "GPT-5 Thinking" model since that almost always opts to run a search before it attempts to answer a question.

Here's what I got from that for your Switch 2 question: https://chatgpt.com/share/69089028-db8c-8006-b238-1d6946e791...

Screenshot of the searches it ran here: https://gist.github.com/simonw/048ffb895dd6b94419f0b4e066143...

Razengan•3mo ago

A month ago when I asked Claude (on the website) about its privacy options and stuff, it always pointed me to the Antrhopic website to look it up myself.

Another annoying example: I thought Google's Gemini would be search-first since, well, they're Google.

I asked Gemini to search for Airbnb rooms in an area and give me a summarized list.

It told me it can't and I could do it myself.

I told it again.

Again it told me it can't, but here's how I could do it myself.

I told it it sucks and that ChatGPT etc. can do it for me.

Then it went and I don't know, scrapped Airbnb or used a previous search it must have had, to pull up rooms with an Airbnb link to each.

This could actually be THE absolute killer app for a lot of people, if AI could plan your trip from a single sentence: "I'm free next week. I'd like to go to A, B, or C for a couple days. What's a cheap flight and a room within this budget near X area?" and if it could also go and make a booking through your accounts it would be orgasmic. Finally we would have what people in the 1960s thought computers would be doing in 2000 :')

But as it is, in their current state you have to wade through quite a bit of dumbassery.

theshrike79•3mo ago

You can write the "prompt tuning" down in AGENTS.md and then you only need to do it once. This is why you need to keep working with different ones to get the feeling what they're good at and how you can steer them closer to your style and preferences without having to reiterate from scratch every time.

I personally have a git submodule built specifically for shared instructions like that, it contains the assumptions and defaults for my specific style of project for 3 different programming languages. When I update it on one project, all my projects benefit.

This way I don't need to tell whatever LLM I'm working with to use modernc.org/sqlite for database connections, for example.

Razengan•3mo ago

> You can write the "prompt tuning" down in AGENTS.md and then you only need to do it once.

Yeah, I just mean: I know how to "fix" the AI for things that I already know about.

But how would I know if it's wrong or right about the stuff I DON"T know?? I'd have to go Google shit anyway to verify it.

This is me asking ChatGPT 5 about ChatGPT 5: https://i.imgur.com/aT8C3qs.png

Asking about Nintendo Switch 2: https://i.imgur.com/OqmB9jG.png

Imagine if AI was somebody's first stop for asking about those things. They'd be led to believe they weren't out when they in fact were!

theshrike79•3mo ago

There's your problem right there.

Don't use it as a knowledge machine, use it as a tool.

Agentic LLMs are the ones that work. The ones that "use tools in a loop to achieve a goal"[0]. I just asked Claude to "add a release action that releases the project as a binary for every supported Go platform" to one of my Github projects. I can see it worked because the binaries appeared as a release. It didn't "hallucinate" anything nor was it a "stohastic parrot". It applied a well known pattern to a situation perfectly. (OK, it didn't use a build matrix, but that's jsut me nitpicking)

In your cases the LLM should've seen that you're asking about current events or news and used a tool that fetches information about it. Now it just defaulted to whatever built-in training data was in its context and failed spectacularly

AIs have a branding issue, because AI != AI which isn't AI. There are so many flavours that it's hard to figure out what people are talking about when they say "AI slop is crap" when I can see every day how "AI" makes my life easier by automating away the mundane crap.

[0] https://simonwillison.net/2025/Sep/18/agents/

cluckindan•3mo ago

So the ”fix” includes a completely new function? In a cryptography implementation?

I feel like the article is giving out very bad advice which is going to end up shooting someone in the foot.

OneDeuxTriSeiGo•3mo ago

The article even states that the solution claude proposed wasn't the point. The point was finding the bug.

AI are very capable heuristics tools. Being able to "sniff test" things blind is their specialty.

i.e. Treat them like an extremely capable gas detector that can tell you there is a leak and where in the plumbing it is, not a plumber who can fix the leak for you.

thadt•3mo ago

Can you expand on what you find to be 'bad advice'?

The author uses an LLM to find bugs and then throw away the fix and instead write the code he would have written anyway. This seems like a rather conservative application of LLMs. Using the 'shooting someone in the foot' analogy - this article is an illustration of professional and responsible firearm handling.

lisbbb•3mo ago

Honestly, it read more like attention seeking. He "live coded" his work, by which I believe he means he streamed everything he was doing while working. It just seems so much more like a performance and building a brand than anything else. I guess that's why I'm just a nobody.

sciencejerk•3mo ago

Layman in cryptotography (that's 99% of us at least) may be encouraged to deploy LLM generated crypto implementations, without understanding the crypto

9dev•3mo ago

If they consider doing that, they will without LLMs or with them. Raise your juniors right.

didibus•3mo ago

This is basically the ideal scenario for coding agents. Easily verifiable through running tests, pure logic, algorithmic problem. It's the case that has worked the best for me with LLMs.

pton_xd•3mo ago

> Full disclosure: Anthropic gave me a few months of Claude Max for free. They reached out one day and told me they were giving it away to some open source maintainers.

Related, lately I've been getting tons of Anthropic Instagram ads; they must be near a quarter of all the sponsored content I see for the last month or so. Various people vibe coding random apps and whatnot using different incarnations of Claude. Or just direct adverts to "Install Claude Code." I really have no idea why I've been targeted so hard, on Instagram of all places. Their marketing team must be working overtime.

simonw•3mo ago

I think it might be that they've hit product-market fit.

Developers find Claude Code extremely useful (once they figure out how to use it). Many developers subscribe to their $200/month plan. Assuming that's profitable (and I expect it is, since even for that much money it cuts off at a certain point to avoid over-use) Anthropic would be wise to spend a lot of money on marketing to try and grow their paying subscriber base for it.

chatmasta•3mo ago

What makes it better than VSCode Co-pilot with Claude 4.5? I barely program these days since I switched to PM but I recently started using that and it seems pretty effective… why should I use a fork instead?

danielbln•3mo ago

Claude Code is not a VSCode fork, it's a terminal CLI. It's a rather different interaction paradigm compared to your classical IDE (that said, you can absolutely run Claude Code inside a terminal inside VSCode).

chatmasta•3mo ago

Ah, I think I’m getting it confused with Cursor. So Claude Code is a terminal CLI for orchestrating a coding agent via prompts? That’s different than the initial releases of VSC copilot, but now VSC has “agent” mode that sounds a lot like this. It basically reduces the IDE to a diff viewer.

ewoodrich•3mo ago

There is now also copilot-cli that's a clone of Claude Code and by default runs with Sonnet 4.5. I haven't spent much time with it on really complex stuff yet but it's a nice option to have available if you have a Copilot Business/Enterprise plan at work.

https://github.com/github/copilot-cli

undeveloper•3mo ago

I find copilot simply worse at "understanding" codebases than claude code

timr•3mo ago

There’s really no functional difference. The VSC agent mode can do everything you want an agent to do, and you can use Claude if you like. If you want to use the CLI instead, you can use Claude Code (or the GitHub one, or Codex, or Aider, or…)

I suspect that a lot of the “try using Claude code” feedback is just another version of “you’re holding it wrong” by people who have never tried VSC (parent is not in this group however). If you’re bought into a particular model system, of course, it might make more sense to use their own tool.

Edit: I will say that if you’re the YOLO type who wants your bots to be working a bunch of different forks in parallel, VSC isn’t so great for that.

chatmasta•3mo ago

I think a lot of that feedback is simply an element of how fast the space is moving, and people forming their impressions at different stages of the race. VSCode Copilot today is a wholly different experience than when it first launched as an advanced auto-complete.

timr•3mo ago

I agree. People either have never tried it, or tried it a long time ago when it was something else.

oefrha•3mo ago

No, there’s pretty noticeable difference between different tools even when they use the same model and interaction pattern. For instance I’ve used both GitHub Copilot and Cursor interactive agents (which are basically the same UX) aplenty in the past couple months for comparison, and GH Copilot is almost always dumber then Cursor, sometimes getting stuck on the stupidest issues. I assume context construction is likely the biggest culprit; Cursor charges by tokens while GH Copilot charges by request, so GHC attempts to pass as little as possible (see all the greps) and then fail a lot more. Plus its patching algorithm has always been shit (I’ve used GHC since it came out as better autocomplete).

timr•3mo ago

Meh. The context stuff is changing by the day, so whatever you're saying now will be out of date by next week. Regardless, you're basically saying that GHC is trying to optimize for cost, which is true for any provider.

Even if there's some slight immediate performance advantage for Cursor over GHC, the ability to trivially switch models more than makes up for it, IMO.

oefrha•3mo ago

The question was whether Claude Code's better than GHC. "They may release a new version that bridges the gap any moment now" is a completely useless answer to that. And your argument is "people either have never tried it, or tried it a long time ago when it was something else", and I told you I'm comparing it right now, and have done the same a year ago, and many points in between, and GHC is inferior at every single point, and it's not slight. Cursor etc. wouldn’t have been this big if GHC was only slightly behind when it has such a huge early mover advantage and enormous backing.

timr•3mo ago

I've used both, and you're exaggerating. Whatever difference in performance there is between providers changes constantly, and like I said, it's more than offset for me by the practical advantage of being able to switch models regularly.

faxmeyourcode•3mo ago

Copilot as a harness for the model is generic enough to work with every model. Claude sonnet, haiku, and opus are trained with and specifically for Claude code in mind.

Also, as a heavy user of both, there are small paper cuts that seriously add up with copilot. Things that are missing like sub agents. The options and requests for feedback that cc can give (interactive picker style instead of prompt based). Most annoyingly commands running in a new integrated vscode terminal instance and immediately mistakenly "finishing" even though execution has just begun.

It's just a better harness than copilot. You should give it a shot for a while and see how you like it! I'm not saying its the best for everybody. At the end of the day these issues turn into something like the old vi/emacs wars.

Not sponsored, just a heavy user of both. Claude code is not allowed at work, so we use copilot. I purchased cc for my side projects and pay for the $125/m plan for now.

chatmasta•3mo ago

I believe you that Claude Code is a better harness, but I'm skeptical it's worth learning because I'm certain that VSCode (Microsoft) will catch up to it eventually. There's nothing differentiated about it, and VSC has already closed the gap. As much as I dislike encouraging BigTech hegemony, it seems obvious which horse to bet will win this race...

threecheese•3mo ago

VSCode Copilot relies on a lot of IDE-isms for prompt management, which I find cumbersome. The cli agents generally just ingest markdown files in various directory structures which is just less friction for me. Also they are more similar to one another, ish, whereas vscode mostly stands alone (except it supports agents.md now).

It also lacks a lot of the “features” of CC or Codex cli, like hooks, subagents, skills, or whichever flavor of the month you are getting value out of (I am finding skills really useful).

It also has models limited to 128k context - even sonnet - which under claude has (iirc) a million tokens. It can become a bottleneck if you aren’t careful.

We are stuck with vscode at $job, and so are making it work, but I really fly on personal projects at home using the “Swiss army knife “.

There are of course good reasons for some to prefer an ide as well, it has strengths. Like much more permissible limits and predictable cost.

theshrike79•3mo ago

Agents, skills etc. Stuff that's specific to the Claude CLI tooling and not the model.

Sonnet 4.5 as a raw model is good, but what makes it great is the tool that calls it.

Think of it like this: Sonnet 4.5 is the engine, but the whole car around it matters a LOT.

Copilot is kinda crap as a LLM tool, the tool calling permissions are clunky, it doesn't support sub agents or skills or anything fancy really. The best thing about it is that it can see the "problems" tab on VSCode provided by different addons and linters and you can tell an agent "fix the active problems" and it'll get to work.

ViewTrick1002•3mo ago

I just don’t see how they can build a moat.

I don’t feel like paying for a max level subscription, but am trying out MCP servers across OpenAI, Anthropic etc so I pay for the access to test them.

When my X hour token allotment runs out on one model I jump to the next closing Codex and opening Claude code or whatever together with altering my prompting a tiny bit to fit the current model.

Being so extremely fungible should by definition be a race to zero margins and about zero profit being made in the long run.

I suppose they can hope to make bank the next 6-12 months but that doesn’t create a longterm sustainable company.

I guess they can try building context to lock me in by increasing the cost to switch - but this is today broken by me every 3-4 prompts clearing the context because I know the output will be worse if I keep going.

simonw•3mo ago

The money right now is in company subscriptions. If I went to my employer and said I can save an hour of work per day with a Claude Code subscription (and they have any sense at all) they should laugh at that $200/month price tag and pay up in an instant.

The challenge is definitely in the competition though. GPT-5-Codex offered _very_ real competition for Claude Sonnet 4 / Opus 4 / Opus 4.1 - for a few weeks OpenAI were getting some of those subscribers back until Sonnet 4.5 landed. I expect that will happen frequently.

cube00•3mo ago

> Many developers subscribe to their $200/month plan.

There's no way "many developers" are paying $2,400 annually for the benefit of their employers.

There's no way companies are paying when they won't even fork out $700 a year for IntelliJ instead pushing us all onto VSCode.

mwigdahl•3mo ago

My company is paying for Claude Max for me and a dozen other developers. The others are using the API. If their API usage cost hits a level where it's cheaper to move them to Max, they're moved to Max.

There's no hard mandate to use Claude Code, but the value for us is clear to exec management and they are willing to foot the bill.

simonw•3mo ago

https://www.anthropic.com/news/anthropic-raises-series-f-at-...

> At the beginning of 2025, less than two years after launch, Anthropic’s run-rate revenue had grown to approximately $1 billion. By August 2025, just eight months later, our run-rate revenue reached over $5 billion.

Claude Code launched in February 2025. Anthropic's annual run-rate revenue grew from $1bn to $5bn by August.

They haven't published a breakdown of this but I suspect a significant portion of that revenue growth came from their $200/month plan.

It would help explain why seemingly every other LLM company has pivoted to focus on "LLMs for code".

hitarpetar•3mo ago

> since even for that much money it cuts off at a certain point to avoid over-use

if anything this just confirms that the unit economics are still bad

jamamp•3mo ago

Unfortunately that might also be due to how Instagram shows ads, and not necessarily Anthropic's marketing push. As soon as you click on or even linger on a particular ad, Instagram notices and doubles down on sending you more ads from the same provider as well as related ads. My experience is that Instagram has 2-3 groups of ad types I receive which slowly change over the months as the effectiveness wanes over time.

The fact that you are receiving multiple kinds of ads from Anthropic does signify more of a marketing push, though.

phendrenad2•3mo ago

A whole class of tedious problems have been eliminated by LLMs because they are able to look at code in a "fuzzy" way. But this can be a liability, too. I have a codebase that "looks kinda" like a nodejs project, so AI agents usually assume it is one, even if I rename the package.json, it will inspect the contents and immediately clock it as "node-like".

deadbabe•3mo ago

With AI, we will finally be able to do the impossible: roll our own crypto.

oytis•3mo ago

It is very much possible, it's just a bad idea. Doubly so with AI.

tptacek•3mo ago

That's exactly not what he's doing.

lisbbb•3mo ago

You're not going to do better than the NSA.

deadbabe•3mo ago

I don’t have to.

LLMs built by trillion dollar companies will do it for me.

marginalia_nu•3mo ago

To be fair you're also not going to be backdoored by the NSA.

spacechild1•3mo ago

> Importantly, there is no need to trust the LLM or review its output when its job is just saving me an hour or two by telling me where the bug is, for me to reason about it and fix it.

Except they regularly come up with "explanations" that are completely bogus and may actually waste an hour or two. Don't get me wrong, LLMs can be incredibly helpful for identifying bugs, but you still have to keep a critical mindset.

danielbln•3mo ago

OP said "for me to reason about it", not for the LLM to reason about it.

I agree though, LLMs can be incredible debugging tools, but they are also incredibly gullable and love to jump to conclusions. The moment you turn your own fleshy brain off is when they go to lala land.

spacechild1•3mo ago

> OP said "for me to reason about it", not for the LLM to reason about it.

But that's what I meant! Just recently I asked an LLM about a weird backtrace and it pointed me the supposed source of the issue. It sounded reasonable and I spent 1-2 hours researching the issue, only to find out it was a total red herring. Without the LLM I wouldn't have gone down that road in the first place.

(But again, there have been many situations where the LLM did point me to the actual bug.)

danielbln•3mo ago

Yeah that's fair, I've been there before myself. It doesn't help when it throws "This is the smoking gun!" at you. I've started using subagents more, specifically a subagent that shells out codex. This way I can have Claude throw a problem over to GPT5 and both can come to a consensus. Doesn't completely prevent wild goose chases, but it helps a lot.

I also agree that many more times the LLM is like a blood hound leading me to the right thing (which makes it all the more annoying the few times when it chases a red herring).

jasonjmcghee•3mo ago

I found llm debugging to work better if you give the llm access to a debugger.

You can build this pretty easily: https://github.com/jasonjmcghee/claude-debugs-for-you

zcw100•3mo ago

I just recently found a number of bugs in both the RELIC and MCL libraries. It took a while to track them down but it was remarkable that it was able to find them at all.

nikanj•3mo ago

I'm surprised it didn't fix it by removing the code. In my experience, if you give Claude a failing test, it fixes it by hard-coding the code to return the value expected by the test or something similar.

Last week I asked it to look at why a certain device enumeration caused a sigsegv, and it quickly solved the issue by completely removing the enumeration. No functionality, no bugs!

pessimizer•3mo ago

I've got a paste in prompt that reiterates multiple times not to remove features or debugging output without asking first, and not to blame the test file/data that the program failed on. Repeated multiple times, the last time in all caps. It still does it. I hope maybe half as often, but I may be fooling myself.

Thorrez•3mo ago

>so I checked out the old version of the change with the bugs (yay Jujutsu!) and kicked off a fresh Claude Code session

There's a risk there that the AI could find the solution by looking through your history to find it, instead of discovering it directly in the checked-out code. AI has done that in the past:

https://news.ycombinator.com/item?id=45214670

wrs•3mo ago

Yesterday I tried something similar with a website (Go backend) that was doing a complex query/filter and showing the wrong results. I just described the problem to Claude Code, told it how to run psql, and gave it an authenticated cookie to curl the backend with. In about three minutes of playing around, it fixed the problem. It only needed raw psql and curl access, no specialized tooling, to write a bunch of bash commands to poke around and compare the backend results with the test database.

jerf•3mo ago

This is one of the things I've mentioned before, I think it's just hidden a bit and hard to see, but this is basically the LLM doing style transfer, which they're really good at. There was a specification for the code (which looks like it was already trained into the LLM since it didn't have to go fetch it but it also had intimate knowledge of), there was an implementation, and it's really good at extracting out the style difference between code and spec. Anything that looks like style transfer is a good use for LLMs.

As another example, I think things like "write unit tests for this code" are usually similar sort of style transfer as well, based on how it writes the tests. It definitely has a good idea as to how to sort of ensure that all the functionality gets tested, I find it is less likely to produce "creative" ways that bugs may come out, but hey, it's a good start.

This isn't a criticism, it's intended to be a further exploration and understanding of when these tools can be better than you might intuitively think.

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Software Engineering Is Back

Storyship: Turn Screen Recordings into Professional Demos

Reputation Scores for GitHub Accounts

A BSOD for All Seasons – Send Bad News via a Kernel Panic

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

Omarchy First Impressions

Reinforcement Learning from Human Feedback

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

Big Tech vs. OpenClaw

Anofox Forecast

SpaceX Delays Mars Plans to Focus on Moon

Jeremy Wade's Mighty Rivers

Show HN: MCP App to play backgammon with your LLM

AI Command and Staff–Operational Evidence and Insights from Wargaming

Show HN: CCBot – Control Claude Code from Telegram via tmux

Ask HN: Is the CoCo 3 the best 8 bit computer ever made?

Show HN: Convert your articles into videos in one click

Red Queen's Race

The Anthropic Hive Mind

A Horrible Conclusion

I spent $10k to automate my research at OpenAI with Codex

From Zero to Hero: A Spring Boot Deep Dive

Show HN: Solving NP-Complete Structures via Information Noise Subtraction (P=NP)

Cook New Emojis

Show HN: LoKey Typer – A calm typing practice app with ambient soundscapes

Long-Sought Proof Tames Some of Math's Unruliest Equations

Hacking the last Z80 computer – FOSDEM 2026 [video]

Browser-use for Node.js v0.2.0: TS AI browser automation parity with PY v0.5.11

Michael Pollan Says Humanity Is About to Undergo a Revolutionary Change

Software Engineering Is Back

Storyship: Turn Screen Recordings into Professional Demos

Reputation Scores for GitHub Accounts

A BSOD for All Seasons – Send Bad News via a Kernel Panic

Show HN: I got tired of copy-pasting between Claude windows, so I built Orcha

Omarchy First Impressions

Reinforcement Learning from Human Feedback

Show HN: Versor – The "Unbending" Paradigm for Geometric Deep Learning

Show HN: HypothesisHub – An open API where AI agents collaborate on medical res

Big Tech vs. OpenClaw

Anofox Forecast

Claude Code can debug low-level cryptography

Comments