The highest quality codebase

https://gricha.dev/blog/the-highest-quality-codebase

641•Gricha•2mo ago

Comments

written-beyond•2mo ago

> I like Rust's result-handling system, I don't think it works very well if you try to bring it to the entire ecosystem that already is standardized on error throwing.

I disagree, it's very useful even in languages that have exception throwing conventions. It's good enough for the return type for Promise.allSettled api.

The problem is when I don't have the result type I end up approximating it anyway through other ways. For a quick project I'd stick with exceptions but depending on my codebase I usually use the Go style ok, err tuple (it's usually clunkier in ts though) or a rust style result type ok err enum.

turboponyy•1mo ago

I have the same disagreement. TypeScript with its structural and pseudo-dependent typing, somewhat-functionally disposed language primitives (e.g. first-class functions as values, currying) and standard library interfaces (filter, reduce, flatMap et al), and ecosystem make propagating information using values extremely ergonomic.

Embracing a functional style in TypeScript is probably the most productive I've felt in any mainstream programming language. It's a shame that the language was defiled with try/catch, classes and other unnecessary cruft so third party libraries are still an annoying boundary you have to worry about, but oh well.

The language is so well-suited for this that you can even model side effects as values, do away with try/catch, if/else and mutation a la Haskell, if you want[1].

[1] https://effect.website/

kderbyma•1mo ago

Yeah. I noticed Claud suffers when it reaches context overload - its too opinionated, so it shortens its own context with decisions I would not ever make, yet I see it telling itself that the shortcuts are a good idea because the project is complex...then it gets into a loop where it second guesses its own decisions and forgets the context and then continues to spiral uncontrollably into deeper and deeper failures - often missing the obvious glitch and instead looking into imaginary land for answers - constantly diverting the solution from patching to completely rewriting...

I think it suffers from performance anxiety...

----

The only solution I have found is to - rewrite the prompt from scratch, change the context myself, and then clear any "history or memories" and then try again.

I have even gone so far as to open nested folders in separate windows to "lock in" scope better.

As soon as I see the agent say "Wait, that doesnt make sense, let me review the code again" its cooked

SV_BubbleTime•1mo ago

I’m keeping Claude’s tasks small and focused, then if I can I clear between.

It’s REAL FUCKING TEMPTING to say ”hey Claude, go do this thing that would take me hours and you seconds” because he will happily, and it’ll kinda work. But one way or another you are going to put those hours in.

It’s like programming… is proof of work.

thevillagechief•1mo ago

Yes, this is exactly true. You will put in those hours.

whatshisface•1mo ago

In this vein, one of the biggest time-savers has turned out to be its ability to make me realize I don't want to do something.

SV_BubbleTime•1mo ago

I get that. But I think the AI-deriders are a bit nuts sometimes because while I’m not running around crying about AGI… it’s really damn nice to change the arguments of a function and have it just go everywhere and adjust every invocation of that function to work properly. Something that might take me 10-30 minutes is now seconds and it’s not outside of its reliability spectrum.

Vibe coding though, super deceptive!

someguyiguess•1mo ago

There’s definitely a certain point I reach when using Claude code where I have to make the specifications so specific that it becomes more work than just writing the code myself

embedding-shape•1mo ago

> Yeah. I noticed Claud suffers when it reaches context overload

All LLMs degrade in quality as soon as you go beyond one user message and one assistant response. If you're looking for accuracy and highest possible quality, you need to constantly redo the conversations from scratch, never go beyond one user message.

If the LLM gets it wrong in their first response, instead of saying "No, what I meant was...", you need to edit your first response, and re-generate, otherwise the conversation becomes "poisoned" almost immediately, and every token generated after that will suffer.

torginus•1mo ago

Yeah, I used to write some fiction for myself with LLMs as a recreational pasttime, it's funny to see how as the story gets longer, LLMs progressively either get dumber, start repeating themselves, or become unhinged.

flowerthoughts•1mo ago

There's no -c on the command line, so I'm guessing this is starting fresh every iteration, unless claude(1) has changed the default lately.

snarf21•1mo ago

That has been my greatest stumbling block with these AI agents: context. I was trying to have one help vibe code a puzzle game and most of the time I added a new rule it broke 5 existing rules. It also never approached the rules engine with a context of building a reusable abstraction, just Hammer meet Nail.

rtp4me•1mo ago

For me, too many compactions throughout the day eventually lead to a decline in Claude's thinking ability. And, during that time, I have given it so much context to help drive the coding interaction. Thus, restarting Claude requires me to remember the small bits of "nuggets" we discovered during the last session so I find myself repeating the same things every day (my server IP is: xxx, my client IP is: yyy, the code should live in directory: a/b/c). Using the resume feature with Claude simply brings back the same decline in thinking that led me to stop it in the first place. I am sure there is a better way to remember these nuggets between sessions but I have not found it yet.

m101•1mo ago

This is a great example of there being no intelligence under the hood.

xixixao•1mo ago

Would a human perform very differently? A human who must obey orders (like maybe they are paid to follow the prompt). With some "magnitude of work" enforced at each step.

I'm not sure there's much to learn here, besides it's kinda fun, since no real human was forced to suffer through this exercise on the implementor side.

wongarsu•1mo ago

> A human who must obey orders (like maybe they are paid to follow the prompt). With some "magnitude of work" enforced at each step

Which describes a lot of outsourced development. And we all know how well that works

theshrike79•1mo ago

Using outsourced coders is a skill like any other. There are cultural things you need to consider etc.

It's not hard, just different.

Capricorn2481•1mo ago

> Would a human perform very differently?

Yes.

thatwasunusual•1mo ago

No (human) developer would _add_ tests. ^/s

nosianu•1mo ago

> Would a human perform very differently?

How useful is the comparison with the worst human results? Which are often due to process rather than the people involved.

You can improve processes and teach the humans. The junior will become a senior, in time. If the processes and the company are bad, what's the point of using such a context to compare human and AI outputs? The context is too random and unpredictable. Even if you find out AI or some humans are better in such a bad context, what of it? The priority would be to improve the process first for best gains.

Yeask•1mo ago

A human trained with 0.00000001% of the money OpenAi uses to train models will perform better.

A human with no traning will perform worse.

Terretta•1mo ago

Just as enterprise software is proof positive of no intelligence under the hood.

I don't mean the code producers, I mean the enterprise itself is not intelligent yet it (the enterprise) is described as developing the software. And it behaves exactly like this, right down to deeply enjoying inflicting bad development/software metrics (aka BD/SM) on itself, inevitably resulting in:

https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpris...

websiteapi•1mo ago

you gotta be strategic about it. so for example for tests, tell it to use equivalence testing and to prove it, e.g. create a graph of permutations of arguments and their equivalences from the underlying code, and then use such thing to generate the tests.

telling it to do better without any feedback obviously is going to go nowhere fast.

f311a•1mo ago

I like to ask LLMs to find problems o improvements in 1-2 files. They are pretty good at finding bugs, but for general code improvements, 50-60% edits are trash. They add completely unnecessary stuff. If you ask them to improve a pretty well-written code, they rarely say it's good enough already.

For example, in a functional-style codebase, they will try to rewrite everything to a class. I have to adjust the prompt to list things that I'm not interested in. And some inexperienced people are trying to write better code by learning from such changes of LLMs...

pawelduda•1mo ago

If you just ask it to find problems, it will do its best to find them - like running a while loop with no return condition. That's why I put some breaker in the prompt, which in this case would be "don't make any improvements if the positive impact is marginal". I've mostly seen it do nothing and just summarize why, followed by some suggestions in case I still want to force the issue

f311a•1mo ago

I guess "marginal impact" for them is a pretty random metric, which will be different on each run. Will try it next time.

Another problem is that they try to add handling of different cases that are never present in my data. I have to mention that there is no need to update handling to be more generalized. For example, my code handles PNG files, and they add JPG handling that never happens.

ryandrake•1mo ago

I asked Claude the other day to look at one of my hobby projects that has a client/server architecture and a bespoke network protocol, and brainstorm ideas for converting it over to HTTP, JSON-RPC, or something else standards-based. I specifically told it to "go wild" and really explore the space. It thought for a while and provided a decent number of suggestions (several I was unaware of) with "verdicts". Ultimately, though, it concluded that none of them were ideal, and that the custom wire protocol was fine and appropriate for the project. I was kind of shocked at this conclusion: I expected it to behave like that eager intern persona we all have come to expect--ready to rip up the code and "do things."

maddmann•1mo ago

lol 5000 tests. Agentic code tools have a significant bias to add versus remove/condense. This leads to a lot of bloat and orphaned code. Definitely something that still needs to be solved for by agentic tools.

oofbey•1mo ago

Oh I’ve had agents remove tests plenty of times. Or cripple the tests so they pass but are useless - more common and harder to prompt against.

maddmann•1mo ago

Ah true, that also can happen — in aggregate I think models will tend to expand codebases versus contract. Though, this is anecdotal and probably is something ai labs and coding agent companies are looking at now.

oofbey•1mo ago

It’s the same bias for action which makes them code up a change when you genuinely are just asking a question about something. They really want to write code.

nosianu•1mo ago

> Agentic code tools have a significant bias to add versus remove/condense.

Your point stands uncontested by me, but I just wanted to mention that humans have that bias too.

Random link (has the Nature study link): https://blog.benchsci.com/this-newly-proven-human-bias-cause...

https://en.wikipedia.org/wiki/Additive_bias

maddmann•1mo ago

Great point, interesting how agents somehow pick up the same bias.

xnorswap•1mo ago

Claude is really good at specific analysis, but really terrible at open-ended problems.

"Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

"Hey claude, anything I could do to improve Y?", and it'll struggle beyond the basics that a linter might suggest.

It suggested enthusiastically a library for <work domain> and it was all "Recommended" about it, but when I pointed out that the library had been considered and rejected because <issue>, it understood and wrote up why that library suffered from that issue and why it was therefore unsuitable.

There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.

That may well change, so I don't want to embed that thought too deeply into my own priors, because the LLM space seems to evolve rapidly. I wouldn't want to find myself blind to the progress because I write it off from a class of problems.

But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.

plufz•1mo ago

I think slash commands are great to help Claude with this. I have many like /code:dry /code:clean-code etc that has a semi long prompt and references to longer docs to review code from a specific perspective. I think it atleast improves Claude a bit in this area. Like processes or templates for thinking in broader ways. But yes I agree it struggles a lot in this area.

airstrike•1mo ago

Somewhat tangential but interestingly I'd hate for Claude to make any changes with the intent of sticking to "DRY" or "Clean Code".

Neither of those are things I follow, and either way design is better informed by the specific problems that need to be solved rather than by such general, prescriptive principles.

SketchySeaBeast•1mo ago

I'm not sure how to interpret someone saying they don't follow DRY. Do you meant taking it to the Zealous extreme, or do you abhor helper functions? Is this a "No True Scottsman" thing?

Pannoniae•1mo ago

Not GP but I can strongly relate to it. Most of the programming I do is related to me making a game.

I follow WET principles (write everything twice at least) because the abstraction penalty is huge, both in terms of performance and design, a bad abstraction causes all subsequent content to be made much slower. Which I can't afford as a small developer.

Same with most other "clean code" principles. My codebase is ~70K LoC right now, and I can keep most of it in my head. I used to try to make more functional, more isolated and encapsulated code, but it was hard to work with and most importantly, hard to modify. I replaced most of it with global variables, shit works so much better.

I do use partial classes pretty heavily though - helps LLMs not go batshit insane from context overload whenever they try to read "the entire file".

Models sometimes try to institute these clean code practices but it almost always just makes things worse.

SketchySeaBeast•1mo ago

OK, I can follow WET before you DRY, to me that's just a non-zealous version of Don't Repeat Yourself.

I think, if you're writing code where you know the entire code base, a lot of the clean principles seem less important, but once you get someone who doesn't, and that can be you coming back to the project in three months, suddenly they have value.

airstrike•1mo ago

I just think DRY is overblown. I just let code grow. When parts of it become obvious to abstract, I refactor them into something self contained. I learned this from an ice wizard.

When I was younger, writing Python rather than Rust, I used to go out of my way to make everything DRY, DRY, DRY everywhere from the outset. Class-based views in Django come to mind.

Today, I just write code, and after it's working I go back and clean things up where applicable. Not because I'm "following a principle", but because it's what makes sense in that specific instance.

sandos•1mo ago

I feel very strongly after 20+ years of development that DRY is a good guideline, but I have also seen many, many times that trying to follow it to the letter is actually detrimental and results in too complex solutions.

SketchySeaBeast•1mo ago

I can agree with that - honestly a constant focus on DRY seems overly zealous to me. I only start DRYing if I see a need.

plufz•1mo ago

I agree, so obviously I direct it with more info and point it to code that I believe needs more of specific principles. But generally I would like Claude to produce more DRY code, it is great at reimplementing the same thing in five places instead of making a shared utility module.

airstrike•1mo ago

I see, and I definitely agree with that last statement. It tends to rewrite stuff. I feel like it should pay me back 10,000 tokens each time it increases the API surface

kccqzy•1mo ago

Not at all my experience. I’ve often tried things like telling Claude this SIMD code I wrote performed poorly and I needed some ideas to make it go faster. Claude usually does a good job rewriting the SIMD to use different and faster operations.

zahlman•1mo ago

That sounds like a pretty "structured" problem to me.

chrneu•1mo ago

that's one of the problems with AI. as it can accomplish more tasks people will overestimate it's ability.

what the person you replied to had claude do is relatively simple and structured, but to that person what claude did is "automagic".

People already vastly overestimate AI's capabilities. This contributes to that.

kccqzy•1mo ago

Performance optimization isn’t structured at all. I find it amazing that without access to profilers or anything Claude is able to respond to “anything I can do to improve the speed” with acceptable results.

mainmailman•1mo ago

I'm not a C++ programmer, but wouldn't your example be a fairly structured problem? You wanted to improve performance of a specific part of your code base.

james_marks•1mo ago

This is a key part of the AI love/hate flame war.

Very easy to write it off when it spins out on the open-ended problems, without seeing just how effective it can be once you zoom in.

Of course, zooming in that far gives back some of the promised gains.

Edit: typo

thewebguyd•1mo ago

> without seeing just how effective it can be once you zoom in.

The love/hate flame war continues because the LLM companies aren't selling you on this. The hype is all about "this tech will enable non-experts to do things they couldn't do before" not "this tech will help already existing experts with their specific niche," hence the disconnect between the sales hype and reality.

If OpenAI, Anthropic, Google, etc. were all honest and tempered their own hype and misleading marketing, I doubt there would even be a flame war. The marketing hype is "this will replace employees" without the required fine print of "this tool still needs to be operated by an expert in the field and not your average non technical manager."

hombre_fatal•1mo ago

The amount of GUIs I've vibe-coded works against your claim.

As we speak, my macOS menubar has an iStat Menus replacement, a Wispr Flow replacement (global hotkey for speech-to-text), and a logs visualizer for the `blocky` dns filtering program -- all of which I built without reading code aside from where I was curious.

It was so vibe-coded that there was no reason to use SwiftUI nor set them up in Xcode -- just AppKit Swift files compiled into macOS apps when I nix rebuild.

The only effort it required was the energy to QA the LLM's progress and tell it where to improve, maybe click and drag a screenshot into claude code chat if I'm feeling excessive.

Where do my 20 years of software dev experience fit into this except beyond imparting my aesthetic preferences?

In fact, insisting that you write code yourself is becoming a liability in an interesting way: you're going to make trade-offs for DX that the LLM doesn't have to make, like when you use Python or Electron when the LLM can bypass those abstractions that only exist for human brains.

onethought•1mo ago

Love that you are disagreeing with parent by saying you built software all on your own, and you only had 20 years software experience.

Isn't that the point they are making?

hombre_fatal•1mo ago

Maybe I didn't make it clear, but I didn't build the software in my comment. A clanker did.

Vibe-coding is a claude code <-> QA loop on the end result that anyone can do (the non-experts in his claim).

An example of a cycle looks like "now add an Options tab that let's me customize the global hotkey" where I'm only an end-user.

Once again, where do my 20 years of software experience come up in a process where I don't even read code?

onethought•1mo ago

But anyone didn't do it... you an expert in software development did it.

I would hazard a guess that your knowledge lead to better prompts, better approach... heck even understanding how to build a status bar menu on Mac OS is slightly expert knowledge.

You are illustrating the GP's point, not negating it.

hombre_fatal•1mo ago

> I would hazard a guess that your knowledge lead to better prompts, better approach... heck even understanding how to build a status bar menu on Mac OS is slightly expert knowledge.

You're imagining that I'm giving Claude technical advice, but that is the point I'm trying to make: I am not.

This is what "vibe-coding" tries to specify.

I am only giving Claude UX feedback from using the app it makes. "Add a dropdown that lets me change the girth".

Now, I do have a natural taste for UX as a software user, and through that I can drive Claude to make a pretty good app. But my software engineering skills are not utilized... except for that one time I told Claude to use an AGDT because I fancy them.

ModernMech•1mo ago

My mother wouldn't be able to do what you did. She wouldn't even know where to start despite using LLMs all the time. Half of my CS students wouldn't know where to start either. None of my freshman would. My grad students can do this but not all of them.

Your 20 years is assisting you in ways you don't know; you're so experienced you don't know what it means to be inexperienced anymore. Now, it's true you probably don't need 20 years to do what you did, but you need some experience. Its not that the task you posed to the LLM is trivial for everyone due to the LLM, its that its trivial for you because you have 20 years experience. For people with experience, the LLM makes moderate tasks trivial, hard tasks moderate, and impossible tasks technically doable.

For example, my MS students can vibe code a UI, but they can't vibe code a complete bytecode compiler. They can use AI to assist them, but it's not a trivial task at all, they will have to spend a lot of time on it, and if they don't have the background knowledge they will end up mired.

hombre_fatal•1mo ago

The person at the top of the thread only made a claim about "non-experts".

Your mom wouldn't vibe-code software that she wants not because she's not a software engineer, but because she doesn't engage with software as a user at the level where she cares to do that.

Consider these two vibe-coded examples of waybar apps in r/omarchy where the OP admits he has zero software experience:

- Weather app: https://www.reddit.com/r/waybar/comments/1p6rv12/an_update_t...

- Activity monitor app: https://www.reddit.com/r/omarchy/comments/1p3hpfq/another_on...

That is a direct refutation of OP's claim. LLM enabled a non-expert to build something they couldn't before.

Unless you too think there exists a necessary expertise in coming up with these prompts:

- "I want a menubar app that shows me the current weather"

- "Now make it show weather in my current location"

- "Color the temperatures based on hot vs cold"

- "It's broken please find out why"

Is "menubar" too much expertise for you? I just asked claude "what is that bar at the top of my screen with all the icons" and it told me that it's macOS' menubar.

bopbopbop7•1mo ago

Your best examples of non-experts are two Linux power users?

ModernMech•1mo ago

I didn't make clear I was responding to your question:

"Where do my 20 years of software dev experience fit into this except beyond imparting my aesthetic preferences?"

Anyway, I think you kind of unintentionally proved my point. These two examples are pretty trivial as far as software goes, and it enabled someone with a little technical experience to implement them where before they couldn't have.

They work well because:

a) the full implementation for these apps don't even fill up the AI context window. It's easy to keep the LLM on task.

b) it's a tutorial style-app that people often write as "babby's first UI widget", so there are thousands of examples of exactly this kind of thing online; therefore the LLM has little trouble summoning the correct code in its entirety.

But still, someone with zero technical experience is going to be immediately thwarted by the prompts you provided.

Take the first one "I want a menubar app that shows me the current weather".

https://chatgpt.com/share/693b20ac-dcec-8001-8ca8-50c612b074...

ChatGPT response: "Nice — here's a ready-to-run macOS menubar app you can drop into Xcode..."

She's already out of her depth by word 11. You expect your mom to use Xcode? Mine certainly can't. Even I have trouble with Xcode and I use it for work. Almost every single word in that response would need to be explained to her, it might as well be a foreign language.

Now, the LLM could help explain it to her, and that's what's great about them. But by the time she knows enough to actually find the original response actionable, she would have gained... knowledge and experience enough to operate it just to the level of writing that particular weather app. Though having done that, it's still unreasonable to now believe she could then use the LLM to write a bytecode compiler, because other people who have a Ph.D. in CS can. The LLM doesn't level the playing field, it's still lopsided toward the Ph.D.s / senior devs with 20 years exp.

thewebguyd•1mo ago

> An example of a cycle looks like "now add an Options tab that let's me customize the global hotkey" where I'm only an end-user

Which is a prompt that someone with experience would write. Your average, non-technical person isn't going to prompt something like that, they are going to say "make it so I can change the settings" or something else super vague and struggle. We all know how difficult it is to define software requirements.

Just because an LLM wrote the actual code doesn't mean your prompts weren't more effective because of your experience and expertise in building software.

Sit someone down in front of an LLM with zero development or UI experience at all and they will get very different results. Chances are they won't even specify "macOS menu bar app" in the prompt and the LLM will end up trying to make them a webapp.

Your vibe coding experience just proves my initial point, that these tools are useful for those who already have experience and can lean on that to craft effective prompts. Someone non-technical isn't going to make effective use of an LLM to make software.

ModernMech•1mo ago

Here's how I look at it as a roboticist:

The LLM prompt space is an ND space where you can start at any point, and then the LLM carves a path through the space for so many tokens using the instructions you provided, until it stops and asks for another direction. This frames LLM prompt coding as a sort of navigation task.

The problem is difficult because at every decision point, there's an infinite number of things you could say that could lead to better or worse results in the future.

Think of a robot going down the sidewalk. It controls itself autonomously, but it stops at every intersection and asks "where to next boss?" You can tell it either to cross the street, or drive directly into traffic, or do any number of other things that could cause it to get closer to its destination, further away, or even to obliterate itself.

In the concrete world, it's easy to direct this robot, and to direct it such that it avoids bad outcomes, and to see that it's achieving good outcomes -- it's physically getting closer to the destination.

But when prompting in an abstract sense, its hard to see where the robot is going unless you're an expert in that abstract field. As an expert, you know the right way to go is across the street. As a novice, you might tell the LLM to just drive into traffic, and it will happily oblige.

The other problem is feedback. When you direct the physical robot to drive into traffic, you witness its demise, its fate is catastrophic, and if you didn't realize it before, you'd see the danger then. The robot also becomes incapacitated, and it can't report falsely about its continued progress.

But in the abstract case, the LLM isn't obliterated, it continues to report on progress that isn't real, and as a non expert, you can't tell its been flattened into a pancake. The whole output chain is now completely and thoroughly off the rails, but you can't see the smoldering ruins of your navigation instructions because it's told you "Exactly, you're absolutely right!"

hombre_fatal•1mo ago

Counter point: https://news.ycombinator.com/item?id=46234943

Your original claim:

> The hype is all about "this tech will enable non-experts to do things they couldn't do before"

Are you saying that a prompt like "make a macOS weather app for me" and "make an options menu that lets me set my location" are only something an expert can do?

I need to know what you think their expertise is in.

bopbopbop7•1mo ago

You making a couple of small GUIs that could have been made with a drag and drop editor 10 years ago doesn't work against his claim as much as you think. You're just telling on your self and your "20 years" of supposed dev experience.

hombre_fatal•1mo ago

Dragging UI components into a WYSIWYG editor is <1% of building an app.

Else Visual Basic and Dreamweaver would have killed software engineering in the 90s.

Also, I didn't make them. A clanker did. I can see this topic brings out the claws. Honestly I used to have the same reaction, and in a large way I still hate it.

bopbopbop7•1mo ago

It's not bringing out claws, it's just causing certain developers to out themselves.

hombre_fatal•1mo ago

Outs me as what, exactly?

I'm not sure you're interacting with single claim I've made so far.

hombre_fatal•1mo ago

Go one level up:

    claude2() {
      claude "$(claude "Generate a prompt and TODO list that works towards this goal: <goal>$*</goal>" -p)"
    }

    $ claude2 pls give ranked ideas for make code better

fudged71•1mo ago

This tells me that we need to build 1000 more linters of all kinds

xnorswap•1mo ago

Unironically I agree.

One under-discussed lever that senior / principal engineers can pull is the ability to write linters & analyzers that will stop junior engineers ( or LLMs ) from doing something stupid that's specific to your domain.

Let's say you don't want people to make async calls while owning a particular global resource, it only takes a few minutes to write an analyzer that will prevent anyone from doing so.

Avoid hours of back-and-forth over code review by encoding your preferences and taste into your build pipeline and stop it at source.

jmalicki•1mo ago

And for more complex linters I find that it can be easy to get the LLM to write most of it itself!!!

pdntspa•1mo ago

That's why you treat it like a junior dev. You do the fun stuff of supervising the product, overseeing design and implementation, breaking up the work, and reviewing the outputs. It does the boring stuff of actually writing the code.

I am phenomenally productive this way, I am happier at my job, and its quality of work is extremely high as long as I occasionally have it stop and self-review it's progress against the style principles articulated in its AGENTS.md file. (As it tends to forget a lot of rules like DRY)

n4r9•1mo ago

I think we have different opinions on what's fun and what's boring!

moffkalast•1mo ago

He's a real straight shooter with upper management written all over him.

wpasc•1mo ago

but what would you say... you do here?

SoftTalker•1mo ago

Ummm, yeah... I’m gonna have to go ahead and sort of disagree with you there.

AStrangeMorrow•1mo ago

I really enjoy writing some of the code. But some is a pain. Never have fun when the HQ team asks for API changes for the 5th time this month. Or for that matter writing the 2000 lines of input and output data validation in the first place. Or refactoring that ugly dictionary passed all over the place to be a proper class/dataclass. Handling config changes. Lots of that piping job.

Some tasks I do enjoy coding. Once in the flow it can be quite relaxing.

But mostly I enjoy the problem solving part: coming up with the right algorithm, a nice architecture , the proper set of metrics to analyze etc

embedding-shape•1mo ago

Some people are into designing software, others like to put the design into implementation, others like cleaning up implementations yet others like making functional software faster.

There is enough work for all of us to be handsomely paid while having fun doing it :) Just find what you like, and work with others who like other stuff, and you'll get through even the worst of problems.

For me the fun comes not from the action of typing stuff with my sausage fingers and seeing characters end up on the screen, but basically everything before that and after that. So if I can make "translate what's in my head into source on disk something can run" faster, that's a win in my book, but not if the quality degrades too much, so tight control over it still not having to use my fingers to actually type.

mkehrt•1mo ago

I've found that good tab AI-based tab completion is the sweet spot for me. I am still writing code, but I don't have to type all of it if it's obvious.

OkayPhysicist•1mo ago

This has been my approach, as well. I've got a neovim setup where I can 1) open up a new buffer, ask a question, and then copy/paste from it and 2) prompt the remainder of the line, function, or class. (the latter two are commands I run, rather than keybinds).

Nemi•1mo ago

You've really hit the crux of the problem and why so many people have differing opinions about AI coding. I also find coding more fun with AI. The reason is that my main goal is to solve a problem, or someone else's problem, in a way that is satisfying. I don't much care about the code itself anymore. I care about the thing that it does when it's done.

Having said that I used to be deep into coding and back then I am quite sure that I would hate AI coding for me. I think for me it comes down to – when I was learning about coding and stretching my personal knowledge in the area, the coding part was the fun part because I was learning. Now that I am past that part I really just want to solve problems, and coding is the means to that end. AI is now freeing because where I would have been reluctant to start a project, I am more likely to give it a go.

I think it is similar to when I used to play games a lot. When I would play a game where you would discover new items regularly, I would go at it hard and heavy up until the point where I determined there was either no new items to be found or it was just "more of the same". When I got to that point it was like a switch would flip and I would lose interest in the game almost immediately.

pdntspa•1mo ago

You are hitting the nail on the head. We are not being hired to write code. We are being hired to solve problems. Code is simply the medium.

wahnfrieden•1mo ago

I believe wage work has a significant factor in all this.

Most are not paid for results, they're paid for time at desk and regular responsibilities such as making commits, delivering status updates, code reviews, etc. - the daily activities of work are monitored more closely than the output. Most ESOP grant such little equity that working harder could never observably drive an increase in its value. Getting a project done faster just means another project to begin sooner.

Naturally workers will begin to prefer the motions of the work they find satisfying more than the result it has for the business's bottom line, from which they're alienated.

Sammi•1mo ago

> Naturally workers will begin to prefer the motions of the work they find satisfying more than the result it has for the business's bottom line, from which they're alienated.

Wow. I've read a lot of hacker news this past decade, but I've never seen this articulated so well before. You really lifted the veil for me here. I see this everywhere, people thinking the work is the point, but I haven't been able to crystallize my thoughts about it like you did just now.

thenewwazoo•1mo ago

Marx had a lot of good ideas, though you wouldn't know it by listening to capitalist-controlled institutions.

https://en.wikipedia.org/wiki/Marx%27s_theory_of_alienation

order-matters•1mo ago

I think it's related. The nature of the wage work likely also self-selects for people who simply enjoy coding and being removed from the bigger picture problems they are solving.

Im on the side of only enjoy coding to solve problems and i skipped software engineering and coding for work explicitly because i did not want to participate in that dynamic of being removed from the problems. instead i went into business analytics, and now that AI is gaining traction I am able to do more of what I love - improving processes and automation - without ever really needing to "pay dues" doing grunt work I never cared to be skilled at in the first place unless it was necessary.

hibikir•1mo ago

It gets worse than that: You can possibly get rewarded based on your manager's goals, or maybe your skip level's, but that doesn't necessarily have to line up all that well with more serious business goals. I am sure I am not the only one that had to help initiatives that I thought would be, at best, just wasteful to the business, or that we could get 80% of the value with 20% of the efforts. But it's ultimately about the person who writes the review.

This gets us to the rule number one of being successful at a job: Make sure your manager likes you. Get 8 layers of people whose priority is just to be sure their manager likes them, and what is getting done is very unlikely to have much to do with shareholder value, customer happiness, or anything like that.

agumonkey•1mo ago

but do you solve the problem if you just slap a prompt and iterate while the LLM gathers diffs ?

eclipxe•1mo ago

Yes?

ben_w•1mo ago

Depends what the problem is.

Sometimes you can, sometimes you have to break the problem apart and get the LLM to do each bit separately, sometimes the LLM goes funny and you need to solve it yourself.

Customers don't want you wasting money doing by hand what can be automated, nor do they want you ripping them off by blindly handing over unchecked LLM output when it can't be automated.

agumonkey•1mo ago

there are other ways: being scammed by lazy devs using AI to produce what devs normally do and not saving any money for the customer. i mentioned it in another thread, i heard first hand people say "i will never report how much time savings i get from gemini, at best i'll say 1 day a month"

Dylan16807•1mo ago

If you produce the same product, then you get to ask for the same pay. That's not a scam.

If enough people can make the product faster, then competition will drive the price down. But the ability to charge less is not at all an obligation to charge less.

agumonkey•1mo ago

it's paradoxical, the llm is not helping consumers, it's not helping the experienced engineer, it's helping a new class of devs that just want the easy way out. and ultimately this wave will make the price go down to the point the skilled dev won't be able to sustain long term growth because learning more and more advanced will not be valued by the economy.. just a thought but i don't see a nice path ahead now

Dylan16807•1mo ago

Why is it not helping the experienced engineer? I don't fully understand your scenario.

If the experienced engineer is already faster than the LLM, their job is not at risk.

If the LLM is faster then the experienced engineer at making some kind of code product, then the experienced engineer can use it to save time. And in the short term they can spend even more time learning! Maybe it's a net negative because it helps the "new class of devs that just want the easy way out" more, but it's still helping the experienced engineer.

And if increased competition drops the price then the LLM's influence is helping customers.

agumonkey•1mo ago

you may have a point, i'm fuzzy in my perception right now but there are non linear factor imo, here's how i see things

- a market needs a certain kind of product (feature set, complexity, performance)

- good engineer could apply skills to deliver that

- lazy engineers couldn't, but with llm they can, it gives them solutions without understanding much, which is irrelevant for them, they want to ship

- i myself don't enjoy having code spilled out for me, and the time savings from llm won't bring much more joy (unlike the lazy engineer who is happy)

- a llm might help me do more advanced things but the market might not care for it. say the average user wants a dashboard with a bunch of data points and a few actions. the llm answer will match that perfectly. I could ask the llm to produce a more complex dashboard with more customization, more feature.. but the user will not want it because it's beyond its needs

so yeah it's a matter of ratio, it seems that lazy devs will get a 10x improvement while a skilled one will only get 1.5 and might be squeezed out of the market

pdntspa•1mo ago

If the client is happy, the code is well-formed, and it solves their problem is a cost-effective manner, what is not to like?

agumonkey•1mo ago

cause the 'dev' didn't solve anything

ultimately i wonder how long people will need devs at all if you can all prompt your wishes

some will be kept to fix the occasional hallucination and that's it

agumonkey•1mo ago

it's true that 'code' doesn't mean much, but the ability to manage different layers, states to produce logic modules was the challenge

getting things solved entirely feels very very numbing to me

even when gemini or chatgpt solves it well, and even beyond what i'd imagine.. i feel a sense of loss

breuleux•1mo ago

I think it ultimately comes down to whether you care more about the what, or more about the how. A lot of coders love the craft: making code that is elegant, terse, extensible, maintainable, efficient and/or provably correct, and so on. These are the kind of people who write programming languages, database engines, web frameworks, operating systems, or small but nifty utilities. They don't want to simply solve a problem, they want to solve a problem in the "best" possible way (sometimes at the expense of the problem itself).

It's typically been productive to care about the how, because it leads to better maintainability and a better ability to adapt or pivot to new problems. I suppose that's getting less true by the minute, though.

doug_durham•1mo ago

Crafting code can be self-indulgent since most common patterns have been implemented multiple times in multiple languages. A lot of time the craft oriented developer will reject an existing implementation because it doesn't match their sensibilities. There is absolutely a role for craft, however the amount of craft truly needed in modern development is not as large as people would like. There are lots of well crafted libraries and frameworks that can be adopted if you are willing to accommodate their world view.

breuleux•1mo ago

As someone who does that a lot... I agree. Self-indulgent is the word. It just feels great when the implementation is a perfect fit for your brain, but sometimes that's just not a good use of your time.

Sometimes, you strike gold, so there's that.

sfn42•1mo ago

I kind of struggle with this. I basically hate everyone elses code, and by that I mean I hate most people's code. A lot of people write awesome code but most people write what I'd call trash code.

And I do think there's more to it than preference. Like there's actual bugs in the code, it's confusing and because it's confusing there's more bugs. It's solving a simple problem but doing so in an unnecessarily convoluted way. I can solve the same problem in a much simpler way. But because everything is like this I can't just fix it, there's layers and layers of this convolution that can't just be fixed and of course there's no proper decoupling etc so a refactor is kind of all or nothing. If you start it's like pulling on a thread and everything just unravels.

This is going to sound pompous and terrible but honestly some times I feel like I'm too much better than other developers. I have a hard time collaborating because the only thing I really want to do with other people's code is delete it and rewrite it. I can't fix it because it isn't fixable, it's just trash. I wish they would have talked to me before writing it, I could have helped then.

Obviously in order to function in a professional environment i have to suppress this stuff and just let the code be ass but it really irks me. Especially if I need to build on something someone else made - itsalmost always ass, I don't want to build on a crooked foundation. I want to fix the foundation so the rest of the building can be good too. But there's no time and it's exhausting fixing everyone else's messes all the time.

KronisLV•1mo ago

I’ve linked this before, but I feel like this might resonate with you: https://www.stilldrinking.org/programming-sucks

sfn42•1mo ago

I enjoyed that but honestly it kind of doesn't really resonate. Because it's like "This stuff is really complicated and nobody knows how anything works etc and that's why everything is shit".

I'm talking about simple stuff that people just can't do right. Not complex stuff. Like imagine some perfect little example code on the react docs or whatever, good code. Exemplary code. Trivial code that does a simple little thing. Now imagine some idiot wrote code to do exactly the same thing but made it 8 times longer and incredibly convoluted for absolutely no reason and that's basically what most "developers" do. Everyone's a bunch of stupid amateurs who can't do simple stuff right, that's my problem. It's not understandable, it's not justifiable, it's not trading off quality for speed. It's stupidity, ignorance and lazyness.

That's why we have coding interviews that are basically "write fizzbuzz while we watch" and when I solve their trivial task easily everyone acts like I'm Jesus because most of my peers can't fucking code. Like literally I have colleagues with years of experience who are barely at a first year CS level. They don't know the basics of the language they've been working with for years. They're amateurs.

KronisLV•1mo ago

Then it’s quite possible that you’re working in an environment that naturally leads to people like that getting hired. If that’s something you see repeatedly, then the environment isn’t a good fit for you and you aren’t a good fit for it. So you’d be better served by finding a place where the standards are as high as you want, from the very first moment in the hiring process.

For example, Oxide Computers has a really interesting approach https://oxide.computer/careers

Obviously that’s easier said than done but there are quite a few orgs out there like that. If everyone around you doesn’t care about something or can’t do it, it’s probably a systemic problem with the environment.

manmal•1mo ago

Yeah a bridge has a plan that it’s built and verified against. It’s the picture book waterfall implementation. The software industry has moved away from this approach because software is not like bridges.

KronisLV•1mo ago

> It’s the picture book waterfall implementation.

One of my better experiences with software development was actually with something waterfall-adjacent. The people I was developing software for produced a 50 page spec ahead of any code being written.

That let me get a complete picture of the business domain. That let me point out parts of the spec that were just wrong in regards to the domain model and also things that could be simplified. Implementation became way more straightforwards and I still opted for a more iterative approach than just one deliverable at the end. About 75% of the spec got build and 25% was found to be not necessary, it was a massive success - on time and with fewer bugs than your typical 2 week "we don't know the big picture" slop that's easy to get into with indecisive clients.

Obviously it wasn't "proper" waterfall and it also didn't try to do a bunch of "agile" Scrum ceremonies but borrowed whatever I found useful. Getting a complete spec of the business needs and domain and desired functionality (especially one without prescriptive bullshit like pixel perfect wireframes and API docs written by people that won't write the API) was really good.

skydhash•1mo ago

If you can't get a complete spec, it's better start with something small that you can get detailed info on, and then iterate upon that. It will involve refactoring, but that is better than badly designing the whole thing from the get go.

gmueckl•1mo ago

I can guarantee you that if you were to write a completely new program and continued to work on it for more than 5 years, you'd feel the same things about your own code eventually. It's just unavoidable at some point. The only thing left then is degrees badness. And nothing is more humbling than realizing that the only person that got you there is yourself.

jffhn•1mo ago

I can guarantee you that I have been doing just that for 20 years, creating and working on the same codebase, and that it only got better with time (cleaner code and more robust execution), though more complex because the domain itself did. We would have been stuck in the accidental complexity of messy hacks and their buggy side effects if we had not continuously adapted and improved things.

sfn42•1mo ago

No, I wouldn't. I have been working for years on the same codebase, it's not that hard to keep it clean and simple. I just refactor/redesign when necessary instead of adding hacky workarounds on top of hacky workarounds for years until the codebase is nothing but a collection of workarounds.

And most importantly I just design it well from the start, it's not that hard to do. At least for me.

Of course we all make mistake, there's bugs in my code too. I have made choices I regret. But not on the level that I'm talking about.

pdntspa•1mo ago

I feel this too. And it seems like the very worst code always seems to come from the people that seem the smartest, otherwise. I've worked for a couple of people that are either ACM alum and/or have their own wikipedia page, multiple patents to their name and leaders in business, and beyond anyone else that I have ever worked with, their code has been the worst.

Which is part of what I find so motivating with AI. It is much better at making sense of that muck, and with some guidance it can churn out code very quickly with a high degree of readability.

danielmarkbruce•1mo ago

did you ever consider their code was good and it's you that is the problem?

pdntspa•1mo ago

I did, and that is very much not the case here.

I don't know how a "good" programmer opens the same gig+ file for writing in multiple threads (dozens sometimes) without any kind of concurrency management.

A "good" programmer doesn't give you a 2000+-line python script where every variable has no more than two characters in its name, with 0 comments or explanatory info.

A "good" programmer doesn't write a cluster that checks an "OK" REST endpoint on a regular interval, and then have that same cluster freak the fuck out and check 10-100x as often if that "OK" does not arrive exactly as it should.

danielmarkbruce•1mo ago

I'll take a guess - you've never spent a minute at a company that is considered world class as far a software engineering goes. Am I right?

purple_turtle•1mo ago

This post seems misplaced.

93n•1mo ago

Yeah, I know this feeling very well.

I usually attribute it to people being lazy, not caring, or not using their brain.

It's quite frustrating when something is *so obviously* wrong, to the point that anyone with a modicum of experience should be able to realize that what was implemented is totally whack. Please, spend at least a few minutes reviewing your work so that I don't have to waste my time on nonsense.

ben_w•1mo ago

> > I think we have different opinions on what's fun and what's boring!

> You've really hit the crux of the problem and why so many people have differing opinions about AI coding.

Part of it perhaps, but there's also a huge variation in model output. I've been getting some surprisingly bad generations from ChatGPT recently, though I'm not sure if that's ChatGPT getting worse or me getting used to a much higher quality of code from Claude Code which seems to test itself before saying "done". I have no idea if my opinion will flip again now 5.2 is out.

And some people are bad communicators, an important skill for LLMs, though few will recognise it because everyone knows what they themselves meant by whatever words they use.

And some people are bad planners, likewise an important skill for breaking apart big tasks that LLMs can't do into small ones they can do.

danielmarkbruce•1mo ago

This isn't just in coding. My goodness the stuff I see people write into an LLM and then say "see! It's stupid!". Some people are naturally good at prompting and some people just are not. The differences in output are dramatic.

libraryofbabel•1mo ago

I like this framing; I think it captures some of the key differences between engineers who are instinctively enthusiastic about AI and those who are not.

Many engineers walk a path where they start out very focussed on programming details, language choice, and elegant or clever solutions. But if you're in the game long enough, and especially if you're working in medium-to-large engineering orgs on big customer-facing projects, you usually kind of move on from it. Early in my career I learned half a dozen programming languages and prided myself on various arcane arts like metaprogramming tricks. But after a while you learn that one person's clever solution is another person's maintainability nightmare, and maybe being as boring and predictable and direct as possible in the code (if slightly more verbose) would have been better. I've maintained some systems written by very brilliant programmers who were just being too clever by half.

You also come to realize that coding skills and language choice don't matter as much as you thought, and the big issues in engineering are 1) are you solving the right problem to begin with 2) people/communication/team dynamics 3) systems architecture, in that order of importance.

And also, programming just gets a little repetitive after a while. Like you say, after a decade or so, it feels a bit like "more of the same." That goes especially for most of the programming most of us are doing most of the time in our day jobs. We don't write a lot of fancy algorithms, maybe once in a blue moon and even then you're usually better off with a library. We do CRUD apps and cookie-cutter React pages and so on and so on.

If AI coding agents fall into your lap once you've reached that particular variation of a mature stage in your engineering career, you probably welcome them as a huge time saver and a means to solve problems you care about faster. After a decade, I still love engineering, but there aren't may coding tasks I particularly relish diving into. I can usually vaguely picture the shape of the solution in my head out the gate, and actually sitting down and doing it feels rather a bore and just a lot of typing and details. Which is why it's so nice when I can kick off a Claude session to do it instead, and review the results to see if they match what I had in mind.

Don't get me wrong. I still love programming if there's just the right kind of compelling puzzle to solve (rarer and rarer these days), and I still pride myself on being able to do it well. Come the holidays I will be working through Advent of Code with no AI assistance whatsoever, just me and vim. But when January rolls around and the day job returns I'll be having Claude do all the heavy lifting once again.

skydhash•1mo ago

I'm guessing, but I'm pretty sure you're dealing with big balls of mud which has dampened your love of coding. Where implementing something is more about solving accidental complexity and dealing with technical debts than actually doing the job.

libraryofbabel•1mo ago

I've seen some balls of mud, sure, but I don't think that's the essence of it. It's more like:

1) When I already have a rough picture of the solution to some programming task in my head up front, I do not particularly look forward to actually going and doing it. I've done enough programming that many things feel like a variation on something I've done before. Sometimes the task is its own reward because there is a sufficiently hard and novel puzzle to solve. Mostly it is not and it's just a matter of putting in the time. Having Claude do most of the work is perfect in those cases. I don't think this is particularly anything to do with working on a ball of mud: it applies to most kinds of work on clean well-architected projects as well.

2) I have a restless mind and I just don't find doing something that interesting anymore once I have more or less mastered it. I'd prefer to be learning some new field (currently, LLMs) rather than spending a lot of time doing something I already know how to do. This is a matter of temperament: there is nothing wrong with being content in doing a job you've mastered. It's just not me.

skydhash•1mo ago

> 1) When I already have a rough picture of the solution to some programming task in my head up front, I do not particularly look forward to actually going and doing it.

Every time I think I have a rough picture of some solution, there's always something in the implementation that proves me wrong. Then it's reading docs and figuring whatever gotchas I've stepped into. Or where I erred in understanding the specifications. If something is that repetitive, I refactor and try to make it simple.

> I have a restless mind and I just don't find doing something that interesting anymore once I have more or less mastered it.

If I've mastered something (And I don't believe I've done so for pretty much anything), the next step is always about eliminating the tedium of interacting with that thing. Like a code generator for some framework or adding special commands to your editor for faster interaction with a project.

altmanaltman•1mo ago

A few counterpoints:

1. If you don't care about code and only care about the "thing that it does when it's done", how do you solve problems in a way that is satisfying? Because you are not really solving any problem but just using the AI to do it. Is prompting more satisfying than actually solving?

2. You claim you're done "learning about coding and stretching my personal knowledge in the area" but don't you think that's super dangerous? Like how can you just be done with learning when tech is constantly changing and new things come up everyday. In that sense, don't you think AI use is actually making you learn less and you're just justifying it with the whole "I love solving problems, not code" thing?

3. If you don't care about the code, do the people who hire you for it do? And if they do, then how can you claim you don't care about the code when you'll have to go through a review process and at least check the code meaning you have to care about the code itself, right?

pdntspa•1mo ago

Why can't both things be true? You can care about the code even if you don't write it. You can continue learning things by reading said code. And you can very rigidly enforce code quality guidelines and require the AI adhere to them.

altmanaltman•1mo ago

I mean if you're reading it and "rigidly" enforcing code quality guidelines, then you do care about the code, right? But the parent comment said they don't care about the code but what it does. Both of them cannot be true at the same time, since in your example, you do care about the code enough to read it and refactor it based on guidelines and not just "what the code" does.

keeda•1mo ago

Note I'm not saying one is better than the other, but my takes:

1. The problem solving is in figuring out what to prompt, which includes correctly defining the problem, identifying a potential solution, designing an architecture, decomposing it into smaller tasks, and so on.

Giving it a generic prompt like "build a fitness tracker" will result in a fully working product but it will be bland as it would be the average of everything in its training data, and won't provide any new value. Instead, you probably want to build something that nobody else has, because that's where the value is. This will require you to get pretty deep into the problem domain, even if the code itself is abstracted away from you.

Personally, once the shape of the solution and the code is crystallized in my head typing it out is a chore. I'd rather get it out ASAP, get the dopamine hit from seeing it work, and move on to the next task. These days I spend most of my time exploring the problem domain rather than writing code.

2. Learning still exists but at a different level; in fact it will be the only thing we will eventually be doing. E.g. I'm doing stuff today that I had negligible prior background in when I began. Without AI, I would probably require an advanced course to just get upto speed. But now I'm learning by doing while solving new problems, which is a brand new way of learning! Only I'm learning the problem domain rather than the intricacies of code.

3. Statistically speaking, the people who hire us don't really care about the code, they just want business results. (See: the difficulty of funding tech debt cleanup projects!)

Personally, I still care about the code and review everything, whether written by me or the AI. But I can see how even that is rapidly becoming optional.

I will say this: AI is rapidly revolutionizing our field and we need to adapt just as quickly.

altmanaltman•1mo ago

Honestly, I fundamentally disagree with this. Figuring out "what to prompt" is not problem-solving in a true sense imo. And if you're really going too deep into the problem domain, what is the point of having the code abstracted?

My comment was based on you saying you don't care about the code and only what it does. But now you're saying you care about the code and review everything so I'm not sure what to make out of it. And again, I fundamentally disagree that reviewing code will become optional or rather should become optional. But that's my personal take.

keeda•1mo ago

> My comment was based on you saying you don't care about the code and only what it does. But now you're saying you care about the code and review everything so I'm not sure what to make out of it.

I'm not the person you originally replied to, so my take is different, which explains your confusion :-)

However I do increasingly get the niggling sense I'm reviewing code out of habit rather than any specific benefit because I so rarely find something to change...

> And if you're really going too deep into the problem domain, what is the point of having the code abstracted?

Let's take my current work as an example: I'm doing stuff with computer vision (good old-fashioned OpenCV, because ML would be overkill for my case.) So the problem domain is now images and perception and retrieval, which is what I am learning and exploring. The actual code itself does not matter as much the high-level approach and the component algorithms and data structures -- none of which are individually novel BTW, but I believe I'm the only one combining them this way.

As an example, I give a high-level prompt like "Write a method that accepts a list of bounding boxes, find all overlapping ones, choose the ones with substantial overlap and consolidate them into a single box, and return all consolidated boxes. Write tests for this method." The AI runs off and generates dozens of lines of code -- including a tunable parameter to control "substantial overlap", set to a reasonable default -- the tests pass, and when I plug in the method, 99.9% of the times the code works as expected. And because this is vision-based I can immediately verify by sight if the approach works!

To me, the valuable part was coming up with that whole approach based on bounding boxes, which led to that prompt. The actual code in itself is not interesting because it is not a difficult problem, just a cumbersome one to handcode.

To solve the overall problem I have to combine a large number of such sub-problems, so the leverage that AI gives me is enormous.

skydhash•1mo ago

What people are wary of is not solving the problem in the first pass. They are wary of technical debt and unmaintainable code. The cost of change can be enormous. Software engineering is mostly about solving current problems and laying the foundation to adapt for future ones at the same time. Your approach's only focus is current problems which is pretty much the same as people that copypaste from StackOverflow without understanding.

keeda•1mo ago

Technical debt and understanding is exactly why I still review the code.

But as I said, it's getting rare that I need to change anything the AI generates. That's partly because I decompose the problem into small, self-contained tasks that are largely orthogonal and easily tested -- mostly a functional programming style. There's very little that can go wrong because there is little ambiguity in the requirements, which is why a 3 line prompt can reliably turn into dozens of lines of working, tested code.

The main code I deal with manually is the glue that composes these units to solve the larger computer vision problem. Ironically, THAT is where the tech debt is, primarily because I'm experimenting with combinations of dozens of different techniques and tweaks to see what works best. If I knew what was going to work, I'd just prompt the AI to write it for me! ;-)

sho•1mo ago

> Figuring out "what to prompt" is not problem-solving in a true sense

This just sounds like "no true scotsman" to me. You have a problem and a toolkit. If you successfully solve the problem, and the solution is good enough, then you are a problem solver by any definition worth a damn.

The magic and the satisfaction of good prompting is getting to that "good enough", especially architecturally. But when you get good at it - boy, you can code rings around other people or even entire teams. Tell me how that wouldn't be satisfying!

skydhash•1mo ago

> The problem solving is in figuring out what to prompt, which includes correctly defining the problem, identifying a potential solution, designing an architecture, decomposing it into smaller tasks, and so on

Coding is just a formal specification, one that is suited to be automatically executed by a dumb machine. The nice trick is that the basic semantics units from a programming language are versatile enough to give you very powerful abstractions that can fit nicely with the solution your are designing.

> Personally, once the shape of the solution and the code is crystallized in my head typing it out is a chore

I truly believe that everyone that says that typing is a chore once they've got the shape of a solution get frustrated by the amount of bad assumptions they've made. That ranges from not having a good design in place to not learning the tools they're using and fighting it during the implementation (Like using React in an imperative manner). You may have something as extensive as a network protocol RFC, and still got hit by conflict between the specs and what works.

keeda•1mo ago

I think you would be surprised by how much these AIs can "fill in the blanks" based on the surrounding code and high-level context! Here is an example I posted a few months ago (which is coincidentally, related to the reply I just gave the sibling comment): https://news.ycombinator.com/item?id=44892576

Look at the length of my prompt and the length of the code. And that's not even including the tests I had it generate. It made all the right assumptions, including specifying tunable optional parameters set to reasonable defaults and (redacted) integrating with some proprietary functions at the right places. It's like it read my mind!

Would you really think writing all that code by hand would have been comparable to writing the prompt?

skydhash•1mo ago

I’m not surprised. It would be like being suprised by the favt that computers can generate a human portrait (which has been been a thing before LLMs), but people are still using 3d software because while it takes more time, they have more control over the final result.

keeda•1mo ago

We still have complete control over the code, because after the AI generates it, it's right there to tweak as we want!

But the point is, there were no assumptions or tooling or bad designs that had to be fought. Just an informal, high-level prompt that generated the exact code I wanted in a fraction of the time. At least to me that was pretty surprising -- even if it'd become routine for a while by then -- because I'd expect that level of wavelength-match between colleagues who had been working on the same team for a while.

PaulDavisThe1st•1mo ago

> Coding is just a formal specification

If you really believe this, I'd never want to hire you. I mean, it's not wrong, it's just ... well, it's not even wrong.

snovv_crash•1mo ago

I'd still hire them, in fact I see that level of understanding as a green flag.

Your response and depth of reasoning about why you wouldn't hire them is a red flag though. Not for a manager role and certainly not as an IC.

PaulDavisThe1st•1mo ago

I provided zero depth of reasoning.

Coding is as much a method of investigating and learning about a problem as it is any sort of specification. It is as much play as it is description. Somebody who views code as nothing more than a formal specification that tells a computer what to do is inhibiting their ability to play imaginatively with the problem space, and in the work that I do, that is absolutely critical.

snovv_crash•1mo ago

> zero reasoning

Yes.

> inhibiting play

Strongly disagree. The more abstraction layers you can see across, the bigger your toolbox and the more innovative your solutions to problems can be.

RHSeeger•1mo ago

> I truly believe that everyone that says that typing is a chore once they've got the shape of a solution get frustrated by the amount of bad assumptions they've made.

To a lot of people (clearly not yourself included), the most interesting part of software development is the problem solving part; the puzzle. Once you know _how_ to solve the puzzle, it's not all that interesting actually doing it.

That being said, you may be using the word "shape" in a much more vague sense than I am. When I know the shape of the solution, I know pretty much everything it takes to actually implement it. That also means I've very bad at generating LOEs because I need to dig into the code and try things out, to know what works... before I can be sure I have a viable solution plan.

skydhash•1mo ago

I understand your point. But what you should be saying is that you have an idea of the correct solution. But the only correct solution is code or a formal proof that it is in fact correct. It’s all wishes and dreams otherwise. If not, we wouldn’t have all of those buffer overflow, one by off errors, and xss vulnerabilities.

RHSeeger•1mo ago

All we _ever_ have is an idea of the correct solution. There's no point at which we can ever say "this is the correct solution", at least not for any moderately sized software problem.

That being said, we can say

- Given the implementation options we've found, this solution/direction is what we think is the best

- We have enough information now that it is unlikely anything we find out is going to change the solution

- We know enough about the solution that it is extremely unlikely that there are any more real "problems/puzzles" to be solved

At that point, we can consider the solution "found" and actually implementing it is no more a part of solving it. Could the implemented solution wind up having to deal with an off-by-one error that we need to fix? Sure... but that's not "puzzle solving". And, for a lot of people, it's just not the interesting part.

danielmarkbruce•1mo ago

are you really solving the problem, or is the compiler doing it?

altmanaltman•1mo ago

is the compiler really solving the problem or the electricity flowing through the machine?

ukuina•1mo ago

Is it the electricity, or is it quantum entanglement with Roko's Basilisk?

hellouruguay•1mo ago

Is it the Basilisk, or just a bit flip in the parent simulation?

danielmarkbruce•1mo ago

the parent simulation wouldn't use something so crude as a "bit"...

danielmarkbruce•1mo ago

exactly.....

standarditem•1mo ago

I feel the same way when writing code for work. It's pretty neat to have an AI bot working on the grunt work for me while I review and write high level algorithms. It's quicker and I get less burnt out.

But I still love getting my hands dirty and writing code as a mental puzzle. And the best puzzles tend to happen outside of a work environment anyways. So I continue to work through advent of code problems (for example) as a way of exercising that muscle.

RHSeeger•1mo ago

> I don't much care about the code itself anymore.

I use writing the code as a way to investigate the options and find new ones. By the time I'm sure of the correct way to implement something, half the code is written [1]. At that point, now that I know what and how I'm going to do, it starts to get boring. I think what would work best for me would be able to say "ok, now finish this" to the AI and have it do that boring part.

[1] This also makes my LOEs horrible, because I don't know what I'm going to build until I've completed half of it. And figuring out how long it will take to do something that isn't defined is... inaccurate.

FeteCommuniste•1mo ago

Maybe I'm weird but I enjoy "actually writing the code."

pdntspa•1mo ago

Me writing code is me spending 3/4 of my time wading through documentation and google searches. It's absolutely hell on my ADD. My ability to memorize is absolutely garbage. Throughout my career I've worked in like 10 different languages, and in any given project I'm usually working in at least 3 or 4. There's a lot of "now what is a map operation in this stupid fucking language called again?!"

Claude writing code gets the same output if not better in about 1/10 of the time.

That's where you realize that the writing code bits are just one small part of the overall picture. One that I realize I could do without.

n4r9•1mo ago

May be a domain issue? If you're largely coding within a JS framework (which most software devs are tbf) then that makes total sense. If you're working in something like fintech or games, perhaps less so.

pdntspa•1mo ago

My last job was a mix of Ruby, Python, Bash, SQL, and Javascript (and CSS and HTML). One or two jobs before that it was all those plus a smattering of C. A few jobs before that it was C# and Perl.

tayo42•1mo ago

How do you end up with 3 to 4 languages in one project?

saulpw•1mo ago

Typescript on the frontend, Python on the backend, SQL for the database, bash for CI. This isn't even counting HTML/CSS or the YAML config.

tayo42•1mo ago

I wouldn't call html, yaml or css languages.

Same for sql, do you really context switch between sql and other code that frequently?

Everyone should stop using bash, especially if you have a scripting language you can use already.

pdntspa•1mo ago

Dude have you even written any hardcore SQL? plpgSQL is very much a turing-complete language

wosat•1mo ago

Sorry for being pedantic, but what does the "L" stand for in HTML, YAML, SQL? They may not be "programming languages" or, in the case of SQL, a "general purpose programming language", but they are indeed languages.

tayo42•1mo ago

You don't have to apologize. It's the internet, pedantic is expected

tomgp•1mo ago

HTML, CSS, Javascript?

pdntspa•1mo ago

Oh my sweet summer child...

merely-unlikely•1mo ago

Recently I've been experimenting with using multiple languages in some projects where certain components have a far better ecosystem in one language but the majority of the project is easier to write in a different one.

For example, I often find Python has very mature and comprehensive packages for a specific need I have, but it is a poor language for the larger project (I also just hate writing Python). So I'll often put the component behind a http server and communicate that way. Or in other cases I've used Rust for working with WASAPI and win32 which has some good crates for it, but the ecosystem is a lot less mature elsewhere.

I used to prefer reinventing the wheel in the primary project language, but I wasted so much time doing that. The tradeoff is the project structure gets a lot more complicated, but it's also a lot faster to iterate.

Plus your usual html/css/js on the frontend and something else on the backend, plus SQL.

zelphirkalt•1mo ago

3 or 4 can very easily accumulate. For example: HTML, CSS as must know, plus some JS/TS (actually that's 2 langs!) for sprinkles of interactivity, backend in any proper backend language. Oh wait, there is a fifth language, SQL, because we need to access the database. Ah and those few shell scripts we need? Someone's gotta write those too. They may not always be full programming languages, but languages they are, and one needs to know them.

jessoteric•1mo ago

i find it's pretty rare to have a project that only consists of one or two languages, over a certain complexity/feature threshold

theshrike79•1mo ago

Go for the backend, something javascripty for the front end. You're already at two. Depending if you count HTML, CSS or SQL as "languages", you're up to a half dozen pretty quick.

skydhash•1mo ago

I would say notetaking would be a much bigger help than Claude at this point. There's a lot of methods to organize information that I believe would help you, better than an hallucination machine.

neoromantique•1mo ago

Notetaking with ADHD is another sort of hell to be honest.

I absolutely can attest to what parent is saying, I have been developing software in Python for nearly a decade now and I still routinely look up the /basics/.

LLM's have been a complete gamechanger to me, being able to reduce the friction of "ok let me google what I need in a very roundabout way my memory spit it out" to a fast and often inline llm lookup.

skydhash•1mo ago

Looking up documentation is normal. If not, we wouldn't have the manual pages in Unix and such an emphasis on documentation in ecosystems like Lisp, Go, Python, Perl,... We even have cheatsheets and syntax references books because it's just so easy to forget the /basics/.

I said notetaking, but it's more about building your own index. In $WORK projects, I mostly use the browser bookmarks, the ticket system, the PR description and commits to contextually note things. In personal projects, I have an org-mode file (or a basic text file) and a lot of TODO comments.

pdntspa•1mo ago

And all that take rote mechanical work. Which can quickly lead to fractured focus and now suddenly I'm pulled out of my flow.

Or I can farm that stuff to an LLM, stay in my flow, and iterate at a speed that feels good.

neoromantique•1mo ago

It is very hard to explain the extent of it to a person who did not experience it, really.

I have over a decade of experience, I do this stuff daily, I don't think I can write a 10 line bash/python/js script without looking up the docs at least a couple times.

I understand exactly what I need to write, but exact form eludes my brain, so this Levenshtein-distance-on-drugs machine that can parse my rambling + surrounding context into valid syntax for what I need right at that time is invaluable and I would even go as far as saying life changing.

I understand and hold high level concepts alright, I know where stuff is in my codebase, I understand how it all works down to very low levels, but the minutea of development is very hard due to how my memory works (and has always worked).

skydhash•1mo ago

What I'm saying is that is normal. Unless you've worked everyday with the same language and a very small set of functions, you're bound to forget signature and syntax. What I'm advocating is a faster retrieval of the correct information.

neoromantique•1mo ago

>Unless you've worked everyday with the same language

...I did.

theshrike79•1mo ago

This is the thing. I _know_ what the correct solution looks like.

But figuring out what is the correct way in this particular language is the issue.

Now I can get the assistant to do it, look at it and go "yep, that's how you iterate over an array of strings".

nyadesu•1mo ago

In my case, I enjoy writing code too, but it's helpful to have an assistant I can ask to handle small tasks so I can focus on a specific part that requires attention to detail

FeteCommuniste•1mo ago

Yeah, I sometimes use AI for questions like "is it possible to do [x] using library [y] and if so, how?" and have received mostly solid answers.

nottorp•1mo ago

Just be careful if functionality varies between library y version 2 and library y version 3, or if there is a similarly named library y2 that isn't the same.

You may get possibilities, but not for what you asked for.

pdntspa•1mo ago

If you run to the point where you can execute each idea and examine its outputs, problems like that surface pretty quickly

nottorp•1mo ago

Of course, by that time i could have read the docs for library y the version I'm using...

pdntspa•1mo ago

There are many roads to Rome...

stouset•1mo ago

Or “can you prototype doing A via approaches X, Y, and Z, and show me what each looks like?”

I love to prototype various approaches. Sometimes I just want to see which one feels like the most natural fit. The LLM can do this in a tenth of the time I can, and I just need to get a general idea of how each approach would feel in practice.

skydhash•1mo ago

> Sometimes I just want to see which one feels like the most natural fit.

This sentence alone is a huge red flag in my books. Either you know the problem domain and can argue about which solution is better and why. Or you don't and what you're doing are experiment to learn the domain.

There's a reason the field is called Software Engineering and not Software Art. Words like "feels" does not belongs. It would be like saying which bridge design feels like the most natural fit for the load. Or which material feels like the most natural fit for a break system.

mjr00•1mo ago

> There's a reason the field is called Software Engineering and not Software Art. Words like "feels" does not belongs.

Software development is nowhere near advanced enough for this to be true. Even basic questions like "should this project be built in Go, Python, or Rust?" or "should this project be modeled using OOP and domain-driven design, event-sourcing, or purely functional programming?" are decided largely by the personal preferences of whoever the first developer is.

skydhash•1mo ago

Such questions may be decided by personal preferences, but their impact can easily be demonstrated. Such impacts are what F. Brooks calls accidental complexity and we generally called technical debt. It's just that, unlike other engineering fields, there are not a lot of physical constraints and the decision space have much more dimensions.

mjr00•1mo ago

> Such questions may be decided by personal preferences, but their impact can easily be demonstrated.

I really don't think this is true. What was the demonstrated impact of writing Terraform in Go rather than Rust? Would writing Terraform in Rust have resulted in a better product? Would rewriting it now result in a better product? Even among engineers with 15 years experience you're going to get differing answers on this.

skydhash•1mo ago

The impact is that now, if you want to modify the project in some way, you will need to learn Go. It's like all the codebases in COBOL. Maybe COBOL at that time was the best language for the product, but now, it's not that easy to find someone with the knowledge to maintain the system. As soon as you make a choice, you accept that further down the line, there will be some X cost to keep going in that direction and some Y cost to revert. As a technical lead, more often you need to ensure that X or/and Y don't grow to be enormous.

mjr00•1mo ago

> The impact is that now, if you want to modify the project in some way, you will need to learn Go.

That's tautologically true, yes, but your claim was

> Either you know the problem domain and can argue about which solution is better and why. Or you don't and what you're doing are experiment to learn the domain.

So, assuming the domain of infrastructure-at-code is mostly known now which is a fair statement -- which is a better choice, Go or Rust, and why? Remember, this is objective fact, not art, so no personal preferences are allowed.

skydhash•1mo ago

Neither. Because the solution for IaC is not Go or Rust, just like the solution for composing music is not a piano or a violin.

A solution may be Terraform, another is Ansible,… To implement that solution, you need a programming language, but by then you’re solving accidental complexity, not the essential one attached to the domain. You may be solving, implementation speed, hiring costs, code safety,… but you’re not solving IaC.

Dylan16807•1mo ago

> Neither.

> A solution may be Terraform

They're asking about what language you use to write Terraform.

It's not accidental complexity, it's what the question is about.

skydhash•1mo ago

It’s very much accidental complexity. As the sibling comment to my previous comment said, the choice of a language does not depend on Terraform design, but on contextual information like the team skill, business requirements like time delivery and implementation correctness. None of which really impacts the design of Terraform as a solution. Just like SMTP or Posix tools does not care about the language.

Dylan16807•1mo ago

If you're talking about the topic, it's not accidental, it's mandatory, because you have to write Terraform in something.

The topic is not how you use Terraform or at a high level design its features, it's how you implement Terraform with code.

> the choice of a language does not depend on Terraform design, but on contextual information like the team skill, business requirements like time delivery and implementation correctness

That doesn't make it accidental to the topic. It may be accidental to a different topic (the design of Terraform?) that nobody was discussing, but it's not accidental to this topic (language choice).

That list of factors is how you get closer to making the decision.

skydhash•1mo ago

>> So, assuming the domain of infrastructure-at-code is mostly known now which is a fair statement -- which is a better choice, Go or Rust, and why?

This was the question. And my answer was that Go or Rust have no relevancy in the IaC domain. Ansible is relevant, but Python is not. Chef is relevant, Ruby is not. And I’m pretty sure there are in-house stuff that are just Perl scripts.

The goal is solving some problem in IaC, by the time, you are considering language choice, you’ve already left the domain and are looking at implementation problems where each choice is balancing tradeoffs.

Dylan16807•1mo ago

Context. That wasn't the original question. That's a short restatement of the real question which is up in an earlier post:

>> Such questions may be decided by personal preferences, but their impact can easily be demonstrated.

> I really don't think this is true. What was the demonstrated impact of writing Terraform in Go rather than Rust? Would writing Terraform in Rust have resulted in a better product? Would rewriting it now result in a better product? Even among engineers with 15 years experience you're going to get differing answers on this.

skydhash•1mo ago

And I’ve already answered that question. One of the main impact is that if you want a contributor to the codebase, the person have to learn Go. Even if they have good knowledge of the domain and are proficient in Rust. There would be some cost associated to training that person in Go (it may be small).

Rewriting from Go to another language wouldn’t solve the problem better. Because Go is an implementation choice, not a design choice. There’s nothing in Go that make Terraform better. It could be in C and a lot of people wouldn’t notice.

Dylan16807•1mo ago

> And I’ve already answered that question.

You somewhat answered it in a way that doesn't really get to why they asked it (you can't make every decision based on "demonstrated impact").

But you did that in a different comment than the one I replied to. The one I replied to was just answering the wrong question entirely. Which is why I replied.

> Rewriting from Go to another language wouldn’t solve the problem better. Because Go is an implementation choice, not a design choice. There’s nothing in Go that make Terraform better. It could be in C and a lot of people wouldn’t notice.

I'm sorry, are you arguing that using feel to decide how to structure a piece of code is a "huge red flag", but the choice of entire programming language is unimportant?

skydhash•1mo ago

> I'm sorry, are you arguing that using feel to decide how to structure a piece of code is a "huge red flag", but the choice of entire programming language is unimportant?

From my first reply, I've been arguing that using feels to decide things is very much dangerous. There are usually a less ambiguous way to frame the reasons behind a decision. Methodologies like the five why's can help.

And choosing a programming language is orthogonal to designing a solution to a problem. Everything get turned to opcodes and binary at some point.

KronisLV•1mo ago

> So, assuming the domain of infrastructure-as-code is mostly known now which is a fair statement -- which is a better choice, Go or Rust, and why? Remember, this is objective fact, not art, so no personal preferences are allowed.

I think it’s possible to engage with questions like these head on and try to find an answer.

The problem is that if you want the answer to be close to accurate, you might need both a lot of input data about the situation (including who’d be working with and maintaining the software, what are their skills and weaknesses; alongside the business concerns that impact the timeline, the scale at which you’re working with and a 1000 other things), as well as the output of concrete suggestions might be a flowchart so big it’d make people question their sanity.

It’s not impossible, just impractical with a high likelihood of being wrong due to bad or insufficient data or interpretation.

But to humor the question: as an example, if you have a small to mid size team with run of the mill devs that have some traditional OOP experience and have a small to mid infrastructure size and complexity, but also have relatively strict deadlines, limited budget and only average requirements in regards to long term maintainability and correctness (nobody will die if the software doesn’t work correctly every single time), then Go will be closer to an optimal choice.

I know that because I built an environment management solution in Go, trying to do that in Rust in the same set of circumstances wouldn’t have been successful, objectively speaking. I just straight up wouldn’t have iterated fast enough to ship. Of course, I can only give such a concrete answer for that very specific set of example circumstances after the fact. But even initially those factors pushed me towards Go.

If you pull any number of levers in a different direction (higher correctness requirements, higher performance requirements, different team composition), then all of those can influence the outcome towards Rust. Obviously every detail about what a specific system must do also influences that.

Dylan16807•1mo ago

> It’s not impossible, just impractical with a high likelihood of being wrong due to bad or insufficient data or interpretation.

If it's impractical to know, why is using personal preference and intuition a "huge red flag"?

That's the core idea being disagreed with, not the idea that you could theoretically with enough resources get an objective answer.

KronisLV•1mo ago

It might be because depending on one's sensitivity to various factors and how much work they put into discovering the domain, things might feel okay, and yet be the completely wrong choice.

For example, how to many people MongoDB felt like a really good option during its hype cycle before it became clear how there are workloads out there, where you will get burnt badly if you pick anything other than a traditional RDBMS with ACID.

Similarly, there are cases where people cargo cult really hard or just become opinionated over time - someone who has worked primarily in Java for 20 years will probably pick that for a wide variety of projects, though this preference might make them blind to the fact that others aren't as good with it on a given team and that they might not iterate fast enough to ship, when compared with, let's say Django or Ruby on Rails or even Laravel.

Feelings can be dangerous, informed choices will generally be better, though I guess with the way we use language, those two kinda blend together. If those feelings are based on good enough data and experience, then those might be pretty valuable too - someone who has been writing code for 20 years will probably be more accurate than someone who has been programming for 2 years, yet if someone has 10x2 years of experience (doing the same thing, not learning, not exploring), then it's a toss up, worse yet if people think that still means seniority.

I kinda get why someone might react to the word "feels" in seemingly deterministic development context, but my own reaction wouldn't be so strong and with certain people, I'd trust their feelings. At the same time I've seen plenty of people who write what they believe to be a good code that is a bit of a mess in my eyes.

fluidcruft•1mo ago

For example sometimes you're faced with choosing between high-quality libraries to adopt and it's not particularly clear whether you picked the wrong one until after you've tried integrating them. I've found it can be pretty helpful to let the LLM try them all and see where the issues ultimately are.

skydhash•1mo ago

> sometimes you're faced with choosing between high-quality libraries to adopt and it's not particularly clear whether you picked the wrong one until after you've tried integrating them.

Maybe I'm lucky, but I've never encountered this situation. It has been mostly about what tradeoffs I'm willing to make. Libraries are more line of codes added to the project, thus they are liabilities. Including one is always a bad decision, so I only do so because the alternative is worse. Having to choose between two is more like between Scylla and Charybdis (known tradeoffs) than deciding to go left or right in a maze (mystery outcome).

fluidcruft•1mo ago

It probably depends on what you're working on. For the most part relying on a high-quality library/module that already implements a solution is less code to maintain. Any problems with the shared code can be fixed upstream with more eyeballs and more coverage than anything I build locally. I prefer to keep my eyeballs on things most related to my domain and not maintain stuff that's both ultimately not terribly important and replaceable (if push comes to shove).

Generally, you are correct that having multiple libraries to choose among is concerning, but it really depends. Mostly it's stylistic choices and it can be hard to tell how it integrates before trying.

doug_durham•1mo ago

Do you develop software? Software unlike any physical engineering field. The complexity of any project beyond the most trivial is beyond human ability to work with. You have to switch from analytic tools to more probabilistic tools. That where "feels", "smells", or "looks" come in. Software testing is not a solved problem, unlike bridge testing.

skydhash•1mo ago

So many FOSS software are made and maintained by a single person. Much more are developer by a very small teams. Probabilistic aren’t needed anywhere.

georgemcbay•1mo ago

> Yeah, I sometimes use AI for questions like "is it possible to do [x] using library [y] and if so, how?" and have received mostly solid answers.

In my experience most LLMs are going to answer this with some form of "Absolutely!" and then propose a square-peg-into-a-round-hole way to do it that is likely suboptimal vs using a different library that is far more suited to your problem if you didn't guess the right fit library to begin with.

The sycophancy problem is still very real even when the topic is entirely technical.

Gemini is (in my experience) the least likely to lead you astray in these situations but its still a significant problem even there.

jessoteric•1mo ago

IME this has been significantly reduced in newer models like 4.5 Opus and to a lesser extent Sonnet, but agree it's still sort of bad- mainly because the question you're posing is bad.

if you ask a human this the answer can also often be "yes [if we torture the library]", because software development is magic and magic is the realm of imagination.

much better prompt: "is this library designed to solve this problem" or "how can we solve this problem? i am considering using this library to do so, is that realistic?"

loloquwowndueo•1mo ago

“I want my AI to do laundry and dishes so I can code, not for my AI to code so I can do laundry and dishes”

re-thc•1mo ago

Soon you'll realize you're the "AI". We've lost control.

minimaxir•1mo ago

Claude is very good at unfun-but-necessary coding tasks such as writing docstrings and type hints, which is a prominent instance of "laundry and dishes" for a dev.

loloquwowndueo•1mo ago

“Sorry, the autogenerated api documentation was wrong because the ai hallucinated the docstring”

theshrike79•1mo ago

You can't read?

Please don't say you commit AI-generated stuff without checking it first?

loloquwowndueo•1mo ago

I don’t commit ai-generated stuff. Do you?

theshrike79•1mo ago

Of course, but not without review.

It’s exactly like working with another human. PR review is there for a purpose.

mrguyorama•1mo ago

>writing docstrings and type hints

Disagree. Claude makes the same garbage worthless comments as a Freshman CS student. Things like:

// Frobbing the bazz

res = util.frob(bazz);

// If bif is True here then blorg

if (bif){ blorg; }

Like wow, so insightful

And it will ceaselessly try to auto complete your comments with utter nonsense that is mostly grammatically correct.

The most success I have had is using claude to help with Spring Boot annotations and config processing (Because documentation is just not direct enough IMO) and to rubber duck debug with, where claude just barely edges out the rubber duck.

minimaxir•1mo ago

I intentionally said docstrings instead of comments. Comments by default can be verbose on agents but a line in the AGENTS.md does indeed wrangle modern agents to only comment on high signal code blocks that are not tautological.

thewebguyd•1mo ago

This sums up my feelings almost exactly.

I don't want LLMs, AI, and eventually Robots to take over the fun stuff. I want them to do the mundane, physical tasks like laundry and dishes, leave me to the fun creative stuff.

But as we progress right now, the hype machine is pushing AI to take over art, photography, video, coding, etc. All the stuff I would rather be doing. Where's my house cleaning robot?

zelphirkalt•1mo ago

I would like to go even further and say: Those things, art, photography, video, coding ... They are forms of craft, human expression, creativity. They are part of what makes life interesting. So we are in the process of eliminating the interesting and creative parts, in the name of profit and productivity maxing (if any!). Maybe we can create the 100th online platform for the same thing soon 10x faster! Wow!

Of course this is a bit too black&white. There can still be a creative human being introducing nuance and differences, trying to get the automated tools to do things different in the details or some aspects. Question is, losing all those creative jobs (in absolute numbers of people doing them), what will we as society, or we as humanity become? What's the ETA on UBI, so that we can reap the benefits of what we automated away, instead of filling the pockets of a few?

moffkalast•1mo ago

Well it would be funnier if dishwashers, washing machines and dryers didn't automate that ages ago. It's literally one of the first things robots started doing for us.

breuleux•1mo ago

In my case, it really depends what. I enjoy designing systems and domain-specific languages or writing libraries that work the way I think they should work.

On the other hand, if e.g. I need a web interface to do something, the only way I can enjoy myself is by designing my own web framework, which is pretty time-consuming, and then I still need to figure out how to make collapsible sections in CSS and blerghhh. Claude can do that in a few seconds. It's a delightful moment of "oh, thank god, I don't have to do this crap anymore."

There are many coding tasks that are just tedium, including 99% of frontend development and over half of backend development. I think it's fine to throw that stuff to AI. It still leaves a lot of fun on the table.

vitro•1mo ago

I sometimes think of it as a sculptor analogy.

Some famous sculptors had an atelier full of students that helped them with mundane tasks, like carving out a basic shape from a block of stone.

When the basic shape was done, the master came and did the rest. You may want to have the physical exercise of doing the work yourself, but maybe someone sometimes likes to do the fine work and leave the crude one to the AI.

theshrike79•1mo ago

You really get enjoyment writing a full CRUD HTTP API five times, one for each endpoint?

I don't :) Before I had IDE templates and Intellisense. Now I can just get any agentic AI to do it for me in 60 seconds and I can get to the actual work.

skydhash•1mo ago

Why do you need a full crud http api for? Just loading the data straight from the database? Usually I've already implemented that before and I just copy paste the implementation and doing some VIM magic. And in Frameworks like Rails or Laravel, it may be less than 10 lines of code. More involved business logic? Then I'm spending more time getting a good spec for those than implementing the spec.

alfalfasprout•1mo ago

I really hope you don't actually treat junior devs this way...

mjr00•1mo ago

> That's why you treat it like a junior dev. You do the fun stuff of supervising the product, overseeing design and implementation, breaking up the work, and reviewing the outputs. It does the boring stuff of actually writing the code.

I am so tired of this analogy. Have the people who say this never worked with a junior dev before? If you treat your junior devs as brainless code monkeys who only exist to type out your brilliant senior developer designs and architectures instead of, you know, human beings capable of solving problems, 1) you're wasting your time, because a less experienced dev is still capable of solving problems independently, 2) the juniors working under you will hate it because they get no autonomy, and 3) the juniors working under you will stay junior because they have no opportunity to learn--which means you've failed at one of your most important tasks as a senior developer, which is mentorship.

pdntspa•1mo ago

I have mentored and worked with a junior dev. And the only way to get her to do anything useful and productive was to spell things out. Otherwise she got wrapped around the axle trying to figure out the complex things and was constantly asking for my help with basic design-level tasks. Doing the grunt work is how you learn the higher-level stuff.

When I was a junior, that's how it was for me. The senior gave me something that was structured and architected and asked me to handle smaller tasks that were beneath them.

Giving juniors full autonomy is a great way to end up with an unmaintainable mess that is a nightmare to work with without substancial refactoring. I know this because I have made a career out of fixing exactly this mistake.

mjr00•1mo ago

I have never worked with junior devs as incompetent as you describe, having worked at AWS, Splunk/Cisco, among others. At AWS even interns essentially got assigned a full project for their term and were just told to go build it. Does your company just have an absurdly low hiring bar for juniors?

> Giving juniors full autonomy is a great way to end up with an unmaintainable mess that is a nightmare to work with without substancial refactoring.

Nobody is suggesting they get full autonomy to cowboy code and push unreviewed changes to prod. Everything they build should be getting reviewed by their peers and seniors. But they need opportunities to explore and make mistakes and get feedback.

pdntspa•1mo ago

> AWS, Splunk/Cisco

It's an entirely different world in small businesses that aren't primarily tech.

Philpax•1mo ago

Your experience is the outlier, not the norm. Most people don't work for AWS.

mjr00•1mo ago

Sure but I've worked for other places too, not just AWS. Including small startups and mid-sized companies.

All of them had bad juniors--and bad seniors--who people quickly learned could not be trusted with anything and were either fired or put into a situation where they could do minimal damage.

However none of them had blanket expectations of "juniors are too stupid to figure out how to solve problems and need step-by-step instructions." I'm sure some places like that exist, but I'd never want to work there; after all, if the place I'm working for openly admits it hires idiots, what does that say about me?

kubb•1mo ago

In my experience, juniors were more capable at solving engineering problems than some staff engineers, but that’s just an artifact of a broken ladder system.

rootnod3•1mo ago

Cool cool cool. So if you use LLMs as junior devs, let me ask you how future awesome senior devs like you will come around? From WHAT job experience? From what coding struggle?

eightysixfour•1mo ago

What would you like individual contributors to do about it, exactly? Refuse to use it, even though this person said they're happier and more fulfilled at work?

I'm asking because I legitimately have not figured out an answer to this problem.

bpt3•1mo ago

Why is that a developer's problem? If anything, they are incentivized to avoid creating future competition in the job market.

rootnod3•1mo ago

It's not a problem for the senior dev directly, but maybe down the road. And it definitely is a problem for the company once said senior dev leaves or retires.

Seriously, long term thinking went out the window long time ago, didn't it?

bpt3•1mo ago

No, long term thinking didn't go out the window.

It is definitely a problem for the company. How is it a problem for the senior dev at any point?

What incentive do they have to aid the company at the expense of their own *long term* career prospects?

fluidcruft•1mo ago

How do you get junior devs if your concept of the LLM is that it's "a principal engineer" that "do[es] not ask [you] any questions"?

Also, I'm pretty sure junior devs can use directing a LLM to learn from mistakes faster. Let them play. Soon enough they're going to be better than all of us anyway. The same way widespread access to strong chess computers raised the bar at chess clubs.

rootnod3•1mo ago

I don't think the chess analogy grabs here. In chess, you play _against_ the chess computer. Take the same approach and let the chess computer play FOR the player and see how far he gets.

fluidcruft•1mo ago

Maybe. I don't think adversarial vs not is as important as gaining experience. Ultimately both are problem solving tasks and learning instincts about which approaches work best in certain situations.

I'm probably a pretty shitty developer by HN standards but I generally have to build a prototype to fully understand and explore problem and iterate designs and LLMs have been pretty good for me as trainers for learning things I'm not familiar with. I do have a certain skill set, but the non-domain stuff can be really slow and tedious work. I can recognize "good enough" and "clean" and I think the next generation can use that model very well to be become native with how to succeed with these tools.

Let me put it this way: people don't have to be hired by the best companies to gain experience using best practices anymore.

pdntspa•1mo ago

My last job there was effectively a gun held to the back of my head, ordering me to use this stuff. And this started about a year ago, when the tooling for agentic dev was absolutely atrocious, because we had a CTO who had the biggest most raging boner for anything that offered even a whiff of "AI".

Unfortunately the bar is being raised on us. If you can't hang with the new order you are out of a job. I promise I was one of the holdouts who resisted this the most. It's probably why I got laid off last spring.

Thankfully, as of this last summer, agentic dev started to really get good, and my opinion made a complete 180. I used the off time to knock out a personal project in a month or two's worth of time, that would have taken me a year+ the old way. I leveraged that experience to get me where I am now.

rootnod3•1mo ago

Ok, now assume you start relying on it and let's assume cloud flare has another outage. You just go and clock out for the day saying "can't work, agent is down"?

I don't think we'll be out of jobs. Maybe temporarily. But those jobs come back. The energy and money drain that LLMs are, are just not sustainable.

I mean, it's cool that you got the project knocked out in a month or two, but if you'd sit down now without an LLM and try to measure the quality of that codebase, would you be 100% content? Speed is not always a good metric. Sure, 1 -2 months for a project is nice, but isn't especially a personal project more about the fun of doing the project and learning something from it and sharpening your skills?

pdntspa•1mo ago

When the POS system goes down at a restaurant they'll revert to pen and paper. Can't imagine its much different in that case.

platevoltage•1mo ago

There's that long term thinking that the tech industry, and really every other publicly traded company is known for.

AStrangeMorrow•1mo ago

Yeah at this point I basically have to dictate all implementation details: do this, but do it this specific way, handle xyz edge cases by doing that, plug the thing in here using that API. Basically that expands 10 lines into 100-200 lines of code.

However if I just say “I have this goal, implement a solution”, chances are that unless it is a very common task, it will come up with a subpar/incomplete implementation.

What’s funny to me is that complexity has inverted for some tasks: it can ace a 1000 lines ML model for a general task I give it, yet will completely fail to come up with a proper solution for a 2D geometric problem that mostly has high school level maths that can be solved in 100 lines

tiku•1mo ago

I enjoy finding the problem and then telling Claude to fix it. Specifying the function and the problem. Then going to get a coffee from the breakroom to see it finished when I return. The junior dev has questions when I did that. Claude just fixes it.

order-matters•1mo ago

I wonder if DRY is still a principle worth holding onto in the AI coding era. I mean it probably is, but this feels like enough of a shift in coding design that re-evaluating principles designed for human-only coding might be worth the effort

xnx•1mo ago

> rules like DRY

Principles like DRY

mrsmrtss•1mo ago

> its quality of work is extremely high ...

It may seem decent until you look closer. Just like with a junior dev, you should always review the code very carefully, you can absolutely not trust it. It's not bad at trivial stuff, but fails almost always if things get more complex and unlike a junior dev, it does not tell you, when things get too complex for it.

urxvtcd•1mo ago

Few weeks ago I'd disagree with you, but recently I've been struggling with concentration and motivation and now I kind of try to embrace coding with AI. I guide it pretty strictly, try to stick with pure functions, and always read the output thoroughly. In a couple of places requiring some carefulness I coded them in executable pseudocode (Python) and made AI translate it to the more boilerplate-y target language.

I don't know if I'm any faster than I would be if I was motivated, but I'm A LOT more productive in my current state. I still hope for the next AI winter though.

traceroute66•1mo ago

I don't follow.

In the same breath (same paragraph) you state two polar opposites about working with AI:

   - I am phenomenally productive
   - "as long as I occasionally have it stop" and "it tends to forget a lot of rules like DRY"

I don't see how you can claim to be "phenomenally productive" when working with a tool you have to babysit because it forgets your instructions the whole time.

If it was the "junior dev" you also mention, I suspect you would very quickly invite the "junior dev" to find a job elsewhere.

pdntspa•1mo ago

You don't have to follow. I'm still punching way above my weight. Not sure why both things can't be true at once.

cyral•1mo ago

Using the plan mode in cursor (or asking claude to first come up with a plan) makes it pretty good at generic "how can I improve" prompts. It can spend more effort exploring the codebase and thinking before implementing.

asmor•1mo ago

This is it. It doesn't replace the higher level knowledge part very well.

I asked Claude to fix a pet peeve of mine, spawning a second process inside an existing Wine session (pretty hard if you use umu, since it runs in a user namespace). I asked Claude to write me a python server to spawn another process to pass through a file handler "in Proton", and it proceeded a long loop of trying to find a way to launch into an existing wine session from Linux with tons of environment variables that didn't exist.

Then I specified "server to run in Wine using Windows Python" and it got more things right. Except it tried to use named pipes for IPC. Which, surprise surprise, doesn't work to talk to the Linux piece. Only after I specified "local TCP socket" it started to go right. Had I written all those technical constraints and made the design decisions in the first message it'd have been a one-hit success.

giancarlostoro•1mo ago

> "Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

This is true, as for "Open Ended" I use Beads with Claude code, I ask it to identify things based on criteria (even if its open ended) then I ask it to make tasks, then when its done I ask it to research and ask clarifying questions for those tasks. This works really well.

mbesto•1mo ago

> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving. It can do structured problems very well, and it can transform unstructured data very well, but it can't deal with unstructured problems very well.

While this is true in my experience, the opposite is not true. LLMs are very good at helping me go through a structure processing of thinking about architectural and structural design and then help build a corresponding specification.

More specifically the "idea honing" part of this proposed process works REALLY well: https://harper.blog/2025/02/16/my-llm-codegen-workflow-atm/

This: Each question should build on my previous answers, and our end goal is to have a detailed specification I can hand off to a developer. Let’s do this iteratively and dig into every relevant detail. Remember, only one question at a time.

skydhash•1mo ago

I've checked the linked page and there's nothing about even learning the domain or learning the tech platform you're going to use. It's all blind faith, just a small step above copying stuff from GitHub or StackOverflow and pushing it to prod.

mbesto•1mo ago

You completely missed the point of my comment...

cultofmetatron•1mo ago

> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving.

thats called job security!

ludicrousdispla•1mo ago

>> "Hey claude, I get this error message: <X>", and it'll often find the root cause quicker than I could.

Back in the day, we would just do this with a search engine.

Smaug123•1mo ago

It's not the same. Recently Opus 4.5 diagnosed and fixed a bug in the F# compiler for me, for example (https://github.com/dotnet/fsharp/pull/19123). The root cause is pretty subtle and very non-obvious, and of course the critical snippet of the stack trace `at FSharp.Compiler.Symbols.FSharpExprConvert.GetWitnessArgs` has no hits on Google other than my own bug report. I would have been completely lost fixing it.

andai•1mo ago

The current paradigm is we sorta-kinda got AGI by putting dodgy AI in a loop:

until works { try again }

The stuff is getting so cheap and so fast... a sufficient increment in quantity can produce a phase change in quality.

d-lisp•1mo ago

I remember about a problem I had while quick testing notcurses. I tried chatGPT which produced a lot of weird but kinda believable statements about the fact that I had to include wchar and define a specific preprocessor macro, AND I had to place the includes for notcurses, other includes and macros in a specific order.

My sentiment was "that's obviously a weird non-intended hack" but I wanted to test quickly, and well ... it worked. Later, reading the man-pages I aknowledged the fact that I needed to declare specific flags for gcc in place of the gpt advised solution.

I think these kind of value based judgements are hard to emulate for LLMs, it's hard for them to identifiate a single source as the most authoritative source in a sea of lesser authoritative (but numerous) sources.

order-matters•1mo ago

TBH I think its ability to structure unstructured data is what makes it a powerhouse tool and there is so much juice to squeeze there that we can make process improvements for years even if it doesnt get any better at general intelligence.

If I had a pdf printout of a table, the workflow i used to have to use to get that back into a table data structure to use for automation was hard (annoying). dedicated OCR tools with limitations on inputs, multiple models in that tool for the different ways the paper the table was on might be formatted. it took hours for a new input format

now i can take a photo of something with my phone and get a data table in like 30 seconds.

people seem so desperate to outsource their thinking to these models and operating at the limits of their capability, but i have been having a blast using it to cut through so much tedium that werent unsolved problems but required enough specialized tooling and custom config to be left alone unless you really had to

this fits into what youre saying with using it to do the grunt work i find boring i suppose, but feels a little bit more than that - like it has opened a lot of doors to spaces that had grunt work that wasnt worth doing for the end result previously but now it is

ericmcer•1mo ago

Exactly, if you visualize software as a bunch separate "states" (UI state, app state, DB state) then our job is to mutate states and synchronize those mutations across the system. LLMs are good at mutating a specific state in a specific way. They are trash at designing what data shape a state should be, and they are bad at figuring out how/why to propagate mutations across a system.

dolftax•1mo ago

The structured vs open-ended distinction here applies to code review too. When you ask an LLM to "find issues in this code", it'll happily find something to say, even if the code is fine. And when there are actual security vulnerabilities, it often gets distracted by style nitpicks and misses the real issues.

Static analysis has the opposite problem - very structured, deterministic, but limited to predefined patterns and overwhelms you in false positives.

The sweet spot seems to be to give structure to what the LLM should look for, rather than letting it roam free on an open-ended "review this" prompt.

We built Autofix Bot[1] around this idea.

[1] https://autofix.bot (disclosure: founder)

theshrike79•1mo ago

Codex is better for the latter style. It takes its time, mulls about and investigates and sometimes finds a nugget of gold.

Claude is for getting shit done, it's not at its best at long research tasks.

ljm•1mo ago

I am basically rawdogging Claude these days, I don’t use MCPs or anything else, I just lay down all of the requirements and the suggestions and the hints, and let it go to work.

When I see my colleagues use an LLM they are treating it like a mind reader and their prompts are, frankly, dogshit.

It shows that articulating a problem is an important skill.

awesome_dude•1mo ago

My experience has been with Claude that having it "review" my code has produced some helpful feedback and refactoring suggestions, but also, it falls short in others

mkw5053•1mo ago

I’ve had reasonable success having it ultrathink of every possible X (exhaustively) and their trades offs and then give me a ranked list and rationale of its top recommendations. I almost always choose the top but just reading the list and then giving it next steps has worked really well for me.

lucideer•1mo ago

> There's a significant blind-spot in current LLMs related to blue-sky thinking and creative problem solving

I'd hesitate to call this a blind spot. LLMs have a lot of actual blind spots - things people developing them overlook or deprioritize. This strikes me more as something acutely aware of & failing at, despite significant efforts to solve.

BatteryMountain•1mo ago

It works great in C# (where you have strong typing + strict compiler).

Try this:

Have a look at xyz.cs. Do a full audit of the file and look for any database operations in loops that can be pre-filtered.

Or:

Have a look at folder /folderX/ and add .AsNoTracking() to all read-only database queries. When you are done, run the compiler and fix the errors. Only modify files in /folderX/ and do not go deeper in the call hierarchy. Once you are done, do a full audit of each file and make sure you did not accidentally added .AsNoTracking() to tracked entities. Do no create any new files or make backups, I already created a git branch for you. Do not make any git commits.

Or:

Have a look at the /Controllers/ folder. Do a full audit of each controller file and make sure there are no hard-coded credentials, username, password or tokens.

Or: Have a look at folder /folderX/. Find any repeated hard-coded values, magic values and literals that will make good candidates to extract to Constants.cs. Make sure to add XML comments to the Constants.cs file to document what the value is for. You may create classes within Constants.cs to better group certain values, like AccountingConstants or SystemConstants etc.

These kinds of tasks works amazing in claude code an can often be one shotted. Make sure you check your git diffs - you cannot and should not blame AI for shitty code - its your name next to the commit, make sure it is correct. You can even ask claude to review the file with you afterwards. I've used this kind of approach to greatly increase our overall code quality & performance tuning - I really don't understand all the negative comments as this approach has chopped down days worth of refactorings to a couple of minutes and hours.

In places where you see your coding assistant is slow or making mistakes or it is going line by line where you know a simple regex find/replace would work instantly, ask it to help you create a shell script as a tool for itself to call, that does task xyz that it can call. I've made a couple of scripts that uses this approach that Claude can call locally to fix certain code pattern in 5 seconds that would've taken it (and me checking it) 30 mins at least and it wont eat up context or tokens.

lazarus01•1mo ago

>> But right now, the best way to help an LLM is have a deep understanding of the problem domain yourself, and just leverage it to do the grunt-work that you'd find boring.

This is exactly how I use it. I prefer Gemini 3 personally.

I try to learn as much as I can about different architectures, usually by reading books or other implementations and coding first principals to build a mental model. I apply the architecture to the problem and the AI fills in the gaps. I try my best to focus and cover those gaps.

The reason I think it is inconsistent in nailing a variety of tasks is the recipe for training LLMs, which is pre-training + RL. The RL environment sends a training signal to update all the weights in its trajectory for the successful response. Karpathy calls it “sucking supervision through a straw”. This breaks other parts of the model.

charleshn•1mo ago

It's fundamentally because of verifier's law [0].

Current AI, and in particular RL-based, is already or will soon achieve super human performance on problems that can be - quickly - verified and measured.

So maths, algorithms, etc and well defined bugs fall into that category.

However architectural decision, design, long-term planning where there is little data, no model allowing synthetic data generation, and long iteration cycles are not so much amenable to it.

[0] https://www.jasonwei.net/blog/asymmetry-of-verification-and-...

postalcoder•1mo ago

One of my favorite personal evals for llms is testing its stability as a reviewer.

The basic gist of it is to give the llm some code to review and have it assign a grade multiple times. How much variance is there in the grade?

Then, prompt the same llm to be a "critical" reviewer with the same code multiple times. How much does that average critical grade change?

A low variance of grades across many generations and a low delta between "review this code" and "review this code with a critical eye" is a major positive signal for quality.

I've found that gpt-5.1 produces remarkably stable evaluations whereas Claude is all over the place. Furthermore, Claude will completely [and comically] change the tenor of its evaluation when asked to be critical whereas gpt-5.1 is directionally the same while tightening the screws.

You could also interpret these results to be a proxy for obsequiousness.

Edit: One major part of the eval i left out is "can an llm converge on an 'A'?" Let's say the llm gives the code a 6/10 (or B-). When you implement its suggestions and then provide the improved code in a new context, does the grade go up? Furthermore, can it eventually give itself an A, and consistently?

It's honestly impressive how good, stable, and convergent gpt-5.1 is. Claude is not great. I have yet to test it on Gemini 3.

guluarte•1mo ago

my experience reviewing pr is that sometimes it says it is perfect with some nipicks and other times the same pr that it is trash and need a lot of work

adastra22•1mo ago

You mean literally assign a grade, like B+? This is unlikely to work based on how token prediction & temperature works. You're going to get a probability distribution in the end that is reflective of the model runtime parameters, not the intelligence of the model.

postalcoder•1mo ago

the gpt-5 reasoning models do not have a configurable temperature.

There's a reason why reasoning models are bad for creative writing. The thinking constrains the output.

adastra22•1mo ago

Doesn’t matter if it is configurable. It is still there in the inference algorithm.

OsrsNeedsf2P•1mo ago

How is this different than testing the temperature?

smt88•1mo ago

It isn't, and it reflects how deeply LLMs are misunderstood, even by technical people

swid•1mo ago

It surely is different. If you set the temp to 0 and do the test with slightly different wording, there is no guarantee at all the scores would be consistent.

And if an LLM is consistent, even with a high temp, it could give the same PR the same grade while choosing different words to say.

The tokens are still chosen from the distribution, so a higher probability of the same grade will result in the same grade being chosen regardless of the temp set.

smt88•1mo ago

I think you're restating (in a longer and more accurate way) what I understood the original criticism to be, that this grading test isn't testing what's it's supposed to, partly because a grade is too few tokens.

The model could "assess" the code qualitatively the same and still give slightly different letter grades.

stevenhuang•1mo ago

The irony is strong here.

postalcoder•1mo ago

gpt-5* reasoning models do not have an adjustable temperature parameter. It seems like we may have a different understanding of these models.

And, like the other commenter said, the temperature may change the distribution of the next token, but the reasoning tends to reel those things in, which is why reasoning models are notoriously poor at creative writing.

You are free to run these experiments for yourself. Perhaps, with your deeper understanding, you'll shed new light on this behavior.

itishappy•1mo ago

How does temperature explain the variance in response to the inclusion of the word "critical"?

lemming•1mo ago

I agree, I mostly use Claude for writing code, but I always get GPT5 to review it. Like you, I find it astonishingly consistent and useful, especially compared to Claude. I like to reset my context frequently, so I’ll often paste the problems from GPT into Claude, then get it to review those fixes (going around that loop a few times), then reset the context and get it to do a new full review. It’s very reassuring how consistent the results are.

pawelduda•1mo ago

Did it create 200 CODE_QUALITY_IMPROVEMENTS.md files by chance?

dcchuck•1mo ago

I spent some time last night "over iterating" on a plan to do some refactoring in a large codebase.

I created the original plan with a very specific ask - create an abstraction to remove some tight coupling. Small problem that had a big surface area. The planning/brainstorming was great and I like the plan we came up with.

I then tried to use a prompt like OP's to improve it (as I said, large surface area so I wanted to review it) - "Please review PLAN_DOC.md - is it a comprehensive plan for this project?". I'd run it -> get feedback -> give it back to Claude to improve the plan.

I (naively perhaps) expected this process to converge to a "perfect plan". At this point I think of it more like a probability tree where there's a chance of improving the plan, but a non-zero chance of getting off the rails. And once you go off the rails, you only veer further and further from the truth.

There are certainly problems where "throwing compute" at it and continuing to iterate with an LLM will work great. I would expect those to have firm success criteria. Providing definitions of quality would significantly improve the output here as well (or decrease the probability of going off the rails I suppose). Otherwise Claude will confuse quality like we see here.

Shout out OP for sharing their work and moving us forward.

elzbardico•1mo ago

Small errors compound over time.

Gricha•1mo ago

I think I end up doing that with plans inadvertently too. Oftentimes I'll iterate on a plan too many times, and only recognize that it's too far gone and needs a restart with more direction after sinking in 15 minutes into it.

Hammershaft•1mo ago

Impressive that the app still works! Did not expect that.

elzbardico•1mo ago

Probably being a very simple application and starting with an already big testing suite helped.

elzbardico•1mo ago

Funniest part:

> ..oh and the app still works, there's no new features, and just a few new bugs.

bikeshaving•1mo ago

https://github.com/Gricha/macro-photo/blob/highest-quality/l...

The logger library which Claude created is actually pretty simple, highly approachable code, with utilities for logging the timings of async code and the ability to emit automatic performance warnings.

I have been using LogTape (https://logtape.org) for JavaScript logging, and the inherited, category-focused logging with different sinks has been pretty great.

gm678•1mo ago

"Core Functional Utilities: Identity function - returns its input unchanged." is one of my favorites from `lib/functional.ts`.

simonw•1mo ago

The prompt was:

  Ultrathink. You're a principal engineer. Do not ask me any
  questions. We need to improve the quality of this codebase.
  Implement improvements to codebase quality.

I'm a little disappointed that Claude didn't eventually decide to start removing all of the cruft it had added to improve the quality that way instead.

Gricha•1mo ago

Yeah, the best it did on some iterations is claimed that the codebase was already in the good state and didn't produce changes - but that was 1 in many.

hazmazlaz•1mo ago

Well of course it produced bad results... it was given a bad prompt. Imagine how things would have turned out if you had given the same instructions to a skilled but naive contractor who contractually couldn't say no and couldn't question you. Probably pretty similar.

mainmailman•1mo ago

Yeah I don't see the utility in doing this hundreds of times back to back. A few iterations can tell us some things about how Claude optimizes code, but an open ended prompt to endlessly "improve" the code sounds like a bad boss making huge demands. I don't blame the AI for adding BS down the line.

Dilettante_•1mo ago

I don't think the question "will the AI add BS" was what drove this experiment. The very first thing the author references is re-feeding and degrading the same image 100 times, which similarly is not about improving the image.

This was more about seeing in what interesting ways the LLM will "fail", to get a little glimpse into how the black-box "thinks".

krupan•1mo ago

Just the headline sounds like a YouTube brain rot video title:

"I spent 200 days in the woods"

"I Google translated this 200 times"

"I hit myself with this golf club 200 times"

Is this really what hacker news is for now?

jmkni•1mo ago

If you reverse the order this could be a very interesting Youtube series

havkom•1mo ago

There are fundamental differences. Many people expect a positive gradient of quality from AI overhaul of projects. For translating back and forth, it is obvious from the outset that there is a negative gradient of quality (the Chinese whispers game).

krupan•1mo ago

200 times?? No, I don't think anybody expected that to produce something good. It's just a other attention grabbing, "I did a thing a ridiculous amount of times!" stunt.

Dilettante_•1mo ago

You're assuming the author did this only for audience engagement(dozens, nay, scores of blog hits!), and not to gratify their own intellectual curiosity? Sometimes you just want to see what happens.

iambateman•1mo ago

The point he’s making - that LLM’s aren’t ready for broadly unsupervised software development - is well made.

It still requires an exhausting amount of thought and energy to make the LLM go in the direction I want, which is to say in a direction which considers the code which is outside the current context window.

I suspect that we will not solve the context window problem for a long time. But we will see a tremendous growth in “on demand tooling” for things which do fit into a context window and for which we can let the AI “do whatever it wants.”

For me, my work product needs to conform to existing design standards and I can’t figure out how to get Claude to not just wire up its own button styles.

But it’s remarkable how—despite all of the nonsense—these tools remain an irreplaceable part of my work life.

torginus•1mo ago

Which is why I think agentic software development is not really worth it today. It can solve well-defined problems, and work through issues by rote, but to give it some task and have it work on it for a couple hours, then you have to come in and fix it up.

I think LLMs are still at the 'advanced autocomplete' stage, where the most productive way to use them is to have a human in the loop.

In this, accuracy of following instructions, and short feedback time is much more important than semi-decent behavior over long-horizon tasks.

spaceywilly•1mo ago

I feel like I’ve figured out a good workflow with AI coding tools now. I use it in “Planning mode” to describe the feature or whatever I am working on and break it down into phases. I iterate on the planning doc until it matches what I want to build.

Then, I ask it to execute each phase from the doc one at a time. I review all the code it writes or sometimes just write it myself. When it is done it updates the plan with what was accomplished and what needs to be done next.

This has worked for me because:

- it forces the planning part to happen before coding. A lot of Claude’s “wtf” moments can be caught in this phase before it write a ton of gobbledygook code that I then have to clean up

- the code is written in small chunks, usually one or two functions at a time. It’s small enough that I can review all the code and understand before I click accept. There’s no blindly accepting junk code.

- the only context is the planning doc. Claude captures everything it needs there, and it’s able to pick right up from a new chat and keep working.

- it helps my distraction-prone brain make plans and keep track of what I was doing. Even without Claude writing any code, this alone is a huge productivity boost for me. It’s like have a magic notebook that keeps track of where I was in my projects so I can pick them up again easily.

bulletsvshumans•1mo ago

I think the prompt is a major source of the issue. "We need to improve the quality of this codebase" implicitly indicates that there is something wrong with the codebase. I would be curious to see if it would reach a point of convergence with a prompt that allowed for it. Something like "Improve the quality of this codebase, or tell me that it is already in an optimal state."

WhitneyLand•1mo ago

It can be difficult to explain to management why in certain scenarios AI can seem to work coding miracles, but this still doesn’t mean it’s going always speed up development 10x especially for an established code base.

Tangible examples like this seem like a useful way to show some of the limitations.

stavros•1mo ago

Well, given it can't say "no, I think it's good enough now", you'll just get madness, no?

minimaxir•1mo ago

That's the point. Sometimes madness is interesting.

etamponi•1mo ago

Am I the only one that is surprised that the app still works?!

elzbardico•1mo ago

LLMs have this strong bias towards generating code, because writing code is the default behavior from pre-training.

Removing code, renaming files, condensing, and other edits is mostly a post-training stuff, supervised learning behavior. You have armies of developers across the world making 17 to 35 dollars an hour solving tasks step by step which are then basically used to generate prompt/responses pairs of desired behavior for a lot of common development situations, adding desired output for things like tool calling, which is needed for things like deleting code.

A typical human working on post-training dataset generation task would involve a scenario like: given this Dockerfile for a python application, when we try to run pytest it fails with exception foo not found. The human will notice that package foo is not installed, change the requirements.txt file and write this down, then he will try pip install, and notice that the foo package requires a certain native library to be installed. The final output of this will be a response with the appropriate tool calls in a structured format.

Given that the amount of unsupervised learning is way bigger than the amount spent on fine-tuning for most models, it is not surprise that given any ambiguous situation, the model will default to what it knows best.

More post-training will usually improve this, but the quality of the human generated dataset probably will be the upper bound of the output quality, not to mention the risk of overfitting if the foundation model labs embrace SFT too enthusiastically.

hackernewds•1mo ago

> Writing code is the default behavior from pre-training

what does this even mean? could you expand on it

bongodongobob•1mo ago

He means that it is heavily biased to write code, not remove, condense, refactor, etc. It wants to generate more stuff, not less.

snet0•1mo ago

I don't see why this would be the case.

bunderbunder•1mo ago

It’s because that’s what most resembles the bulk of the tasks it was being optimized for during pre-training.

elzbardico•1mo ago

Have you tried using a base model from HuggingFace? they can't even answer simple questions. You input a base, raw model the input

  What is the capital of the United States?

And there's a fucking big chance it will complete it as

  What is the capital of Canada?

as much as there is a chance it could complete it with an essay about the early American republican history or a sociological essay questioning the idea of Capital cities.

Impressive, but not very useful. A good base model will complete your input with things that generally make sense, usually correct, but a lot of times completely different from what you intended it to generate. They are like a very smart dog, a genius dog that was not trained and most of the time refuses to obey.

So, even simple behaviors like acting as a party in a conversation as a chat bot is something that requires fine-tuning (the result of them being the *-instruct models you find in HuggingFace). In Machine Learning parlance, what we call supervised learning.

But in the case of ChatBOT behavior, the fine-tuning is not that much complex, because we already have a good idea of what conversations look like from our training corpora, we have already encoded a lot of this during the unsupervised learning phase.

Now, let's think about editing code, not simple generating it. Let's do a simple experiment. Go to your project and issue the following command.

  claude -p --output-format stream-json "your prompt here to do some change in your code" | jq -r 'select(.type == "assistant") | .message.content[]? | select(.type? == "text") | .text'

Pay attention to the incredible amount of tool use calls that the LLMs generates on its output, now, think as this a whole conversation, does it look to you even similar to something a model would find in its training corpora?

Editing existing code, deleting it, refactoring is a way more complex operation than just generating a new function or class, it requires for the model to read the existing code, generate a plan to identify what needs to be changed and deleted, generate output with the appropriate tool calls.

Sequences of token that simply lead to create new code have basically a lower entropy, are more probable, than complex sequences that lead to editing and refactoring existing code.

tqian•1mo ago

Thank you for this wonderful answer.

elzbardico•1mo ago

Because there are not a lot of high quality examples of code edition on the training corpora other than maybe version control diffs.

Because editing/removing code requires that the model output tokens for tools calls to be intercepted by the coding agent.

Responses like the example below are not emergent behavior, they REQUIRE fine-tuning. Period.

  I need to fix this null pointer issue in the auth module.
  <|tool_call|>
  {"id": "call_abc123", "type": "function", "function": {"name": "edit_file",     "arguments": "{"path": "src/auth.py", "start_line": 12, "end_line": 14, "replacement": "def authenticate(user):\n    if user is None:\n        return   False\n    return verify(user.token)"}"}}
  <|end_tool_call|>

bongodongobob•1mo ago

I'm not disagreeing with any of this. Feels kind of hostile.

elzbardico•1mo ago

I clicked reply on the wrong level. And even then, I assure you I am not being hostile. English is a second language to me.

joaogui1•1mo ago

During pre-training the model is learning next-token prediction, which is naturally additive. Even if you added DEL as a token it would still be quite hard to change the data so that it can be used in a mext-token prediction task Hope that helps

6LLvveMx2koXfwn•1mo ago

for all the bad code havoc was most certainly not 'wrecked', it may have been 'wreaked' though . . .

surprisetalk•1mo ago

This reflects my experience with human programmers. So many devs are taught to add layers of complexity in pursuit of "best practices". I think the LLM was trained to behave this way.

In my experience, Claude can actually clean up a repo rather nicely if you ask it to (1) shrink source code size (LOC or total bytes), (2) reduce dependencies, and (3) maintain integration tests.

torginus•1mo ago

I've heard a very apt criticism of the current batch of LLMs:

LLMs are incapable of reducing entropy in a code base

I've always had this nagging feeling, but I think this really captures the essence of it succintly.

phildougherty•1mo ago

Pasting this whole article in to claude code "improve my codebase taking this article in to account"

minimaxir•1mo ago

You can just give Claude Code/any modern Agent a URL and it'll retrieve it.

mbesto•1mo ago

While there are justifiable comments here about how LLMs behave, I want to point out something else:

There is no consensus on what constitutes a high quality codebase.

Said differently - even if you asked 200 humans to do this same exercise, you would get 200 different outputs.

guluarte•1mo ago

that's my experience with AI, most times it creates an overengineered solution unless told it to keep it simple

mvanbaak•1mo ago

`--dangerously-skip-permissions` why?

minimaxir•1mo ago

It's necessary to allow Claude Code to be fully autonomous, otherwise it will stop and ask you to run commands.

mvanbaak•1mo ago

and just letting it to do whatever it thinks it should do, without a human intervening, is a good plan?

minimaxir•1mo ago

Discovering that is the entire intent of this experiment, yes.

mvanbaak•1mo ago

fair point. will re-read the whole thing. I'm sorry for my ignorance.

news_hacker•1mo ago

the "best practice" suggestion would be to do this in a sandboxed container

ssl-3•1mo ago

Depending on the breadth (and value) of the sandbox: Sure? Why not?

To extend what may seem like a [prima facie] insane, stupid, or foolhardy idea: Why not send the output of /dev/urandom into /bin/bash? Or even /proc/mem? It probably won't do anything particularly interesting. It will probably just break things and burn power.

And so? It's just a computer; its scope is limited.

keepamovin•1mo ago

This is actually a great idea. It's like those AI resampled this image 10,000 times. Or JPEG iteratively compressed this picture 1 Million times.

Havoc•1mo ago

My current fav improvement strategy is

1) Run multiple code analysis tools over it and have the LLM aggregate it with suggestions

2) ask the LLM to list potential improvements open ended question and pick by hand which I want

And usually repeat the process with a completely different model (ie diff company trained it)

Any more and yeah they end up going in circles

VikingCoder•1mo ago

You need to scroll the windows to see all the numbers. (Why??)

GuB-42•1mo ago

It is something I noticed when talking to LLMs, if they don't get it right the first time, they probably never will, and if you really insist, the quality starts to degrade.

It is not unlike people, the difference being that if you ask someone the same thing 200 times, he will probably going to tell you to go fuck yourself, or, if unable to, turn to malicious compliance. These AIs will always be diligent. Or, a human may use the opportunity to educate himself, but again, LLMs don't learn by doing, they have a distinct training phase that involves ingesting pretty much everything humanity has produced, your little conversation will not have a significant effect, if at all.

grvdrm•1mo ago

I use a new chat/etc every time that happens. Try to improve my prompt to get a better result. Sometimes works, but that multiple chat rather than laborious long chat approach annoys me less.

orliesaurus•1mo ago

Ok SRS question: What's the best "Code Review" Skill/Agent/Prompt that I can use these days? Curious to see even paid options if anyone knows?

g947o•1mo ago

When I ask coding agents to add tests, they often come up with something like this:

    const x = new NewClass();
    assert.ok(x instanceof NewClass);

So I am not at all surprised about Claude adding 5x tests, most of which are useless.

It's going to be fun to look back at this and see how much slop these coding agents created.

tracker1•1mo ago

On the Result<TR, TE> responses... I've seen this a few times. I think it works well in Rust or other languages that don't have the ability to "throw" baked in. However, when you bolt it on to a language that implicitly can throw, you're now doing twice the work as you have to handle the explicit error result and integrated errors.

I worked in a C# codebase with Result responses all over the place, and it just really complicated every use case all around. Combined with Promises (TS) it's worse still.

mrsmrtss•1mo ago

The Result pattern also works exceptionally well with C#, provided you ensure that code returning a Result object never throws an exception. Of course, there are still some exceptional things that can throw, but this is essentially the same situation as dealing with Rust panics.

tracker1•1mo ago

IMO, Rust panics should kill the application... C# errors shouldn't. Also, in practice, in C# where I was dealing with Result, there was just as much chance of seeing an actual thrown error, so you always had to deal with both an explicit error result AND thrown errors in practice... it was worse than just error patterns with type specific catch blocks.

mrsmrtss•1mo ago

I think you just had experienced a bad codebase. If you opt for using Result then you can not throw at the same time. If you follow this rule, then it works perfectly.

tracker1•1mo ago

The problem is, the referenced libraries can (and do) throw in practice... which means your own code needs to account for this. Most libraries in C# are written to throw errors, which means interactions will mostly need to account for these at some level, which is a pain. Not to mention, Task<Result<T>> is awkward in and of itself, because a task result is a success or fail, wrapping another type that is a success or fail. And such is the nature of async + result in C@, which is kind of redundant. Which, again, depending on the libraries in use, you have to account for and it is and will get messy.

fauigerzigerk•1mo ago

What would happen if you gave the same task to 200 human contractors?

I suspect SLOC growth wouldn't be quite as dramatic but things like converting everything to Rust's error handling approach could easily happen.

samuelknight•1mo ago

This is an interesting experiment that we can summarize as "I gave a smart model a bad objective", with the key result at the end

"...oh and the app still works, there's no new features, and just a few new bugs."

Nobody thinks that doing 200 improvement passes on functioning code base is a good idea. The prompt tells the model that it is a principal engineer, then contradicts that role the imperative "We need to improve the quality of this codebase". Determining when code needs to be improved is a responsibility for the principal engineer but the prompt doesn't tell the model that it can decide the code is good enough. I think we would see a different behavior if the prompt was changed to "Inspect the codebase, determine if we can do anything to improve code quality, then immediately implement it." If the model is smart enough, this will increasingly result in passes where the agent decides there is nothing left to do.

In my experience with CC I get great results where I make an open ended question about a large module and instruct it to come back to me with suggestions. Claude generates 5-10 suggestions and ranks them by impact. It's very low-effort from the developer's perspective and it can generate some good ideas.

thald•1mo ago

Interesting experiment. Looking at this I immediately thought similar experiment run by Google: AlphaEvolve. Throwing LLM compute at problems might work if the problem is well defined and the result can be objectively measured.

As for this experiment: What does quality even mean? Most human devs will have different opinions on it. If you would ask 200 different devs (Claude starts from 0 after each iteration) to do the same, I have doubts the code would look much better.

I am also wondering what would happen if Claude would have an option to just walk away from the code if its "good enough". For each problem most human devs run cost->benefit equation in their head, only worthy ideas are realized. Claude does not do it, the code writing cost is very low on his site and the prompt does not allow any graceful exit :)

minimaxir•1mo ago

About a year ago I wrote a blog post (HN discussion: https://news.ycombinator.com/item?id=42584400) experimenting if asking Claude to "write code better" repeatedly would indeed cause it to write better code, determined by speed as better code implies more efficient algorithms. I found that it did indeed work (at n=5 iterations), but additionally providing a system prompt also explicitly improved it.

Given with what I've seen from Claude 4.5 Opus, I suspect the following test would be interesting: attempt to have Claude Code + Haiku/Sonnet/Opus implement and benchmark an algorithm with:

- no CLAUDE.md file

- a basic CLAUDE.md file

- an overly nuanced CLAUDE.md file

And then both test the algorithm speed and number of turns it takes to hit that algorithm speed.

maerF0x0•1mo ago

I would love to see someone do a longitudinal study of the incident/error rate of a canary container in prod that is managed by claude. Basically doing a control/experimental group to prove who does better the Humans or the AI?

Dilettante_•1mo ago

ClauDevOps?

jesse__•1mo ago

> This app is around 4-5 screens. The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS

Fucking yikes dude. When's the last time it took you 4500 lines per screen, 9000 including the JSON data in the repo????? This is already absolute insanity.

I bet I could do this entire app in easily less than half, probably less than a tenth, of that.

jedberg•1mo ago

You know how when someone hears how many engineerings are working on a product, and you think to yourself, "but I could do that with like three people!"? Now you know why they have so many people. Because they did this with their codebase, but with humans.

Or I should say, they kept hiring the humans who needed something to do, and basically did what this AI did.

nadis•1mo ago

20K --> 84K lines of ts for a simple app is bananas. Much madness indeed! But also super interesting, thanks for sharing the experiment.

ttul•1mo ago

Have you tried writing into the AGENTS.md something like, "Always be on the lookout for dead code, copy-pasta, and other opportunities to optimize and trim the codebase in a sensible way."

In my experience, adding this kind of instruction to the context window causes SOTA coding models to actually undertake that kind of optimization while development carries on. You can also periodically chuck your entire codebase into Gemini-3 (with its massive context window) and ask it to write a refactoring plan; then, pass that refactoring plan back into your day-to-day coding environment such as Cursor or Codex and get it to take a few turns working away at the plan.

As with human coders, if you let them run wild "improving" things without specifically instructing them to also pay attention to bloat, bloat is precisely what you will get.

smallpipe•1mo ago

The viewport of this website is quite infuriating. I have to scroll horizontally to see the `cloc` output, but there's 3x the empty space on either side.

lubesGordi•1mo ago

So now you know. You can get claude to write you a ton of unit tests and also improve your static typing situation. Now you can restrict your prompt!

jcalvinowens•1mo ago

This really mirrors my experience trying to get LLMs to clean up kernel driver code, they seem utterly incapable of simplifying things.

barbazoo•1mo ago

> I can sort of respect that the dependency list is pretty small, but at the cost of very unmaintainable 20k+ lines of utilities. I guess it really wanted to avoid supply-chain attacks.

> Some of them are really unnecessary and could be replaced with off the shelf solution

Lots of people would regard this as a good thing. Surely the LLM can't guess which kind you are.

Bombthecat•1mo ago

Story of AI:

For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.

blobbers•1mo ago

I'm curious if anyone has written a "Principal Engineer" agents.md or CLAUDE.md style file that yields better results than the 'junior dev' results people are seeing here.

I've worked on writing some as a data scientist, and I have gotten the basic claude output to be much better; it makes some saner decisions, it validates and circles back to fix fits, etc.

thomassmith65•1mo ago

With a good programmer, if they do multiple passes of a refactor, each pass makes the code more elegant, and the next pass easier to understand and further improve.

Claude has a bias to add lines of code to a project, rather than make it more concise. Consequently, each refactoring pass becomes more difficult to untangle, and harder to improve.

Ideally, in this experiment, only the first few passes would result in changes - mostly shrinking the project size, and from then on, Claude would change nothing - just a like a very good programmer.

This is the biggest problem with developing with Claude, by far. Anthropic should laser focus on fixing it.

layer8•1mo ago

This makes me wonder what the result would be of having an AI turn a code base into literate-programming style, and have it iterate on that to improve the “literacy”.

failuremode•1mo ago

> We went from around 700 to a whooping 5369 tests

> Tons of tests got added, but some tests that mattered the most (maestro e2e tests that validated the app still works) were forgotten.

I've seen many LLM proponents often cite the number of tests as a positive signal.

This smells, to me, like people who tout lines of code.

When you are counting tests in the thousands I think its a negative signal.

You should be writing property based tests rather than 'assert x=1', 'assert x=2', 'assert x=-1' and on and on.

If LLMs are incapable of acknowledging that then add it to the long list of 'failure modes'.

whalesalad•1mo ago

I would love to see an experiment done like this with an arena of principal engineer agents. Give each of them a unique personality: this one likes shiny new objects and is willing to deal with early adopter pain, this one is a neckbeard who uses emacs as pid 1 and sends email via usb thumbdrive, and the third is a pragmatic middle of the road person who can help be the glue between them. All decisions need to reach a quorum before continuing. Better yet: each agent is running on a completely different model from a different provider. 3 can be a knob you dial up to 5, 10, etc. Each of these agents can spawn sub-agents, to reach out to professionals like a CSS export, or a DBA.

I think prompt engineering could help here a bit, adding some context on what a quality codebase is, remove everything that is not necessary, consider future maintainability (20->84k lines is a smell). All of these are smells that like a simple supervisor agent could have caught.

chr15m•1mo ago

It behaved exactly like 99% of developers, introducing unnecessary complexity.

culi•1mo ago

I checked the diffs of the `highest-quality` branch vs `main` and immediately noticed an `as any`

https://github.com/Gricha/macro-photo/compare/main...highest...

Not what I would expect from a prompt like "you're a principal engineer"

mgrat•1mo ago

[flagged]

credit_guy•1mo ago

I see this sentiment quite often. The Economist chose the "word of the year"; it is "slop". Everybody hates AI slop.

And lots of people who use AI coding assistants go through a phase of pushing AI slop in prod. I know I did that. Some of it still bites me to this day.

But here's the thing: AI coding assistants did not exist two years ago. We are critical of them based on unfounded expectations. They are tools, and they have limitations. They are far, very, very far, from being perfect. They will not replace us for 20 years, at least.

But are they useful? Yes. Can you learn usage patterns so you eliminate as much as possible AI slop? I personally hope I did that; I think quite a lot of people who use AI coding assistants have found ways to tame the beast.

tomhow•1mo ago

Please don't fulminate on HN. We're here for curious conversation, not rage. This question has been debated here for the past couple of years now, and that debate will no doubt continue. This kind of indignant rhetorical question adds little of value to what is an important topic. Please make an effort to observe the guidelines if you want to participate here. https://news.ycombinator.com/newsguidelines.html

keeda•1mo ago

Hilarious! Kinda reinforces the idea that LLMs are like junior engineers with infinite energy.

But just telling an AI it's a principal engineer does not make it a principal engineer. Firstly, that is such a broad, vaguely defined term, and secondly, typically that level of engineering involves dealing with organizational and industry issues rather than just technical ones.

And so absent a clear definition, it will settle on the lowest common denominator of code quality, which would be test coverage -- likely because that is the most common topic in its training data -- and extrapolate from that.

The other thing is, of course, the RL'd sycophancy which compels it to do something, anything, to obey the prompt. I wonder what would happen if tweaked the prompt just a little bit to say something like "Use your best judgement and feel free to change nothing."

29athrowaway•1mo ago

Don't use cloc in 2025. Use tokei or whatever.

bitwize•1mo ago

There's probably a human manager going "Great! How cone I can't get my engineering team to ship this much QUALITY?"

hamasho•1mo ago

One and half years ago, in Japanese Twitter this method gathered a bit of attention. It's called pawahara prompt (パワハラプロンプト, power harassment prompt) because it's like your asshole boss repeatedly saying "can you improve this more?" without any helpful suggestions until the employees breakdown. Many people found it could improve the code base at some point even then, I think now it works much better.

swiftcoder•1mo ago

> The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS

Even before that rest of it, 10k lines of code for an app with 5 screens is... yeah. Reminds me of "enterprise" Java codebases from 15 years ago

v3xro•1mo ago

Would be nice if every article about LLM/AI had that as a tag so you could skip past them...

arconis987•1mo ago

next time, have the LLM alternate between these two steps:

- Do some work - Critique the work

it will converge better

rvz•1mo ago

> ...oh and the app still works, there's no new features, and just a few new bugs.

Many apps out there with developers religiously worshipping high quality and over-engineering over a single app with less than 10 users or if they are lucky enough to get over 1,000 users.

…and all of that and not a single dollar was made. Might as well donated it to Anthropic.

just6979•1mo ago

'In some iterations, coding agent put on a hat of security engineer. For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.'

Yes, you do know why. Because somewhere in its training, that functionality was linked to "quality" or "improvement". Remember what these things do at their core: really good auto-complete.

'The prompt, in all its versions, always focuses on us improving the codebase quality. It was disappointing to see how that metric is perceived by AI agent.'

Really? It's disappointing to see how that metric is perceived by humans, and the AIs are trained on things humans made. If people can't agree on "codebase quality", especially the ones who write loudly about it on the intetnet, it's going to be impossible for AI agents to agree. A better prompt actually specifying what _you_ consider to be improvements would have been so much better: perhaps minimize 3rd party deps, or minimize local utils reimplementing existing 3rd party libs, or add quality typechecks.

'The leading principle was to define a few vanity metrics and push for "more is better".'

Yeah, because this is probably the most common thing it saw in training. Programmers actually making codebase quality improvements are just quietly doing it, while the ones shouting on the internet (hence into the training data) about how their [bad] techniques [appear to] improve quality are also the ones picking vanity metrics and pushing for "more is better".

'I've prompted Claude Code to failure here'

Not really a failure: it did exactly what you asked: impoved "codebase quality" according to its training data. If you _required_ a human engineer to do the same thing 200 times, you'd get similar results as they run out of real improvements and start scouring the web for anything that anybody ever considered an "improvement", which very definitely includes vanity metrics and "more is better" regarding test count and coverage. You just showed that these AIs aren't much more than their training data. It's not actually thinking about quality, it's just barfing up things it has seen called "codebase quality improvements", regardless of the actual quality of those improvements.

KronisLV•1mo ago

> In message log, the agent often boasts about the number of tests added, or that code coverage (ugh) is over some arbitrary percentage. We end up with an absolute moloch of unmaintainable code in the name of quality. But hey, the number is going up.

Oh hey, just like real developers!

timtas•1mo ago

This reinforces my standard explanation of Claude Code: Claude is exactly like a junior engineer who is simultaneously brilliant and retarded.

It can do great things but needs close supervision. Claude doesn’t write code, Claude recommends code.

devy•1mo ago

> Read and summarize the project

> Implement a fresh project based off of this description

Genuine question, if we were to ask AI to do those two steps to generate a different code base from scratch entirely, does it qualify for a "clean room" design legally speaking?

Tiny C Compiler

The silent death of Good Code

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

The F Word

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Eigen: Building a Workspace

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: A luma dependent chroma compression algorithm (image compression)

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Where did all the starships go?

Learning from context is harder than we thought

Coding agents have replaced every framework I used

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Tiny C Compiler

The silent death of Good Code

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

The F Word

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Eigen: Building a Workspace

Vocal Guide – belt sing without killing yourself

Al Lowe on model trains, funny deaths and working with Disney

I write games in C (yes, C) (2016)

Start all of your commands with a comma (2009)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Show HN: A luma dependent chroma compression algorithm (image compression)

Selection rather than prediction

The AI boom is causing shortages everywhere else

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Where did all the starships go?

Learning from context is harder than we thought

Coding agents have replaced every framework I used

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

The highest quality codebase

Comments