Software is mostly all you need

https://softwarefordays.com/post/software-is-mostly-all-you-need/

60•jbmilgrom•1w ago

Comments

wrs•1w ago

In other words, a higher-level JIT compiler, meaning it still dynamically generates code based on runtime observations, but the code is in a higher-level language than assembly, and the observations are of a higher-level context than just runtime data types.

daxfohl•1w ago

I agree this is what the article says, but it's a pretty bad premise. That would only be the case if the primary user interaction with coding agents was "feed in requirements, get a finished product". But we all know it's a more iterative process than that.

jbmilgrom•1w ago

Author here

We are building this at docflowlabs ie a self-healing system that can respond to customer feedback automatically. And youre right that not all customers know what they want or even how to express it when they do, which is why the agent loop we have facing them is way more discovery-focused than the internal one.

And we currently still have humans in the loop for everything (for now!) - e.g, the agent does not move onto implementation until the root cause has been approved

daxfohl•1w ago

Cool, I tried something similar over a couple weeks but the problem I ran into was that beyond a fairly low level of complexity, the English spec became more confusing than the code itself. Even for a simple multi-step KYC workflow, it got very convoluted and hard to make it precise, whereas in code it's a couple loops and if/else blocks with no possibility of misinterpretation. Have you encountered that at all, or have any techniques you've found useful in these situations?

That's why I feel like iterative workflows have won out so far. Each step gets you x% closer, so you close in on your goal exponentially, whereas the one-shot approach closes in much slower, and each iteration starts from scratch. The advantage is that then you have a spec for the whole system, though you can also just generate that from the code if you write the code first.

jbmilgrom•1w ago

that's right, and agents turning specs into software can go in all sorts of directions especially when we don't control the input.

what we've done to mitigate is essentially backing every entrypoint (customer comment, internal ticket, etc) with a remote claude code session with persistent memory - that session essentially becomes the expert in the case. And we've developed checkpoints that work from experience (e.g. the root cause one) where a human has the opportunity to take over the wheel so to speak and drive in a different direction with all the context/history up to that point.

basically, we are creating a assembly line where agents do most of the work and humans increasingly less and less as we continue to optimize the different parts of assembly

as far as techniques, it's all boring engineering

* Temporal workflow for managing the lifecycle of a session

* complete ownership of the data model e2e. we dont use Linear for example; we built our own ticketing system so we could represent Temporal signals, github webhooks and events from the remote claude sessions exactly how we wanted

* incremental automation gains over and over again. We do a lot of the work manually first (like old fashioned hand coding lol) before trying to automate so we become experts in that piece of the assembly line and it becomes obvious how to incrementally automate...rinse and repeat

daxfohl•1w ago

Ooh, it sounds like you've already got most of the groundwork done for something I was wondering about yesterday: I'd love it if there was some way during an incident, for some system to pull all the PRs included in the latest release, check which agents worked on them (i.e. line in the commit message with an identifier that corresponds to the agent's LLM context and any other data at the time of commit), "rehydrate" these agents from the corresponding stored context, feed them the relevant incident data, and ask if it could be related to their changes and what to do about it.

In most cases it might not be much more valuable than just looking through the diffs from scratch with a new agent, but there are probably going to be some cases where a rehydrated agent is like "Doh, I meant to do X but it looks like I hallucinated Y instead. Here's a PR to fix it!"

I know that's just a small piece of what you're doing, but I think it's something that would be valuable on its own, and soon something that is likely to be "standard infrastructure" for any company that does even a little agentic coding (assuming it works). It'd probably even be "required infrastructure" in regulated industries; the fact that all these agent contexts are ephemeral has to be a red flag from a regulatory perspective.

jbmilgrom•1w ago

totally, it's like ai-native github with some linear plus some ability to push the ball forward autonomously. This doesn't exist yet so we had to build a version internally, but also we built it pretty specifically for our needs. The general version might have to be more componentized, not sure. We also as an industry probably need some version control protocol above git that includes all the history around the commit so we don't have to string together root cause documents and conversation history in s3 linked via relational entities in psql.

verdverm•1w ago

Lost me at the claim AI is good at judgement making, this is the exact opposite of my experience, they make both good and bad decisions with reliability

mvc•1w ago

I think it makes better decisions than me provided I give it enough high-level direction and context.

Sometimes I give it __too much__ direction and it finds the solution I had in mind but not the best.

I'm not into it enough that I'm formally running different personas against each other in a co-operative system but I kind of informally do that.

verdverm•1w ago

The type of decision very much matters, coding is one thing. I met a chap at the bar who ChatGPT had verified his crazy theories and he now outsourced all of his major life decisions to it, very proud and enthusiastic about it all. First IRL case of AI psychosis I have encountered. He was keen for my thoughts, as though I was the first person he met IRL that knew more than the layman about Ai. Hope the questions (contradictions) I left him with helped bring him back a bit.

It's going to get a lot worse

rybosworld•1w ago

I think that's also true of people but we are kinder to each other and ourselves when judgement is bad.

How many times have you been in a conversation where you asked the wrong question or stated the wrong thing because you either weren't 100% listening (no one is), or you forgot, or you didn't connect the same dots that others did?

Terr_•1w ago

Treating humans differently makes sense because the "badness" of a judgement isn't just the correctness of an outcome, but also the nature of the process that created it, and humans are a different process.

For example, if two otherwise-identical humans yield the same equally-correct answer, we probably will favor the one that reached it through facts and reasoning, as opposed to the one who literally flipped a coin.

krainboltgreene•1w ago

> I think that's also true of people

Reductionist positions seem to always pop up in these threads.

verdverm•1w ago

That's just humans everywhere, in all of time

It takes effort to be better about it, don't expect perfection from yourself or others

2001zhaozhao•1w ago

> Code is the policy, deployment is the episode, and the bug report is the reward signal

This is a great quote. I think it makes a ton of sense to view a sufficiently-cheap-and-automated agentic SWE system as a machine learning system rather than traditional coding.

* Perhaps the key to transparent/interpretable ML is to just replace the ML model with AI-coded traditional software and decision trees. This way it's still fully autonomously trained but you can easily look at the code to see what is going on.

* I also wonder whether you can use fully-automated agentic SWE/data science in adversarial use-cases where you traditionally have to use ML, such as online moderation. You could set a clear goal to cut down on any undesired content while minimizing false-positives, and the agent would be able to create a self-updating implementation that dynamically responds to adversarial changes. I'm most familiar with video game anti-cheat where I think something like this is very likely possible.

* Perhaps you can use a fully-automated SWE loop, constrained in some way, to develop game enemies and AI opponents which currently requires gruesome amounts of manual work to implement. Those are typically too complex to tackle using traditional ML and you can't naively use RL because the enemies are supposed to be immersive rather than being the best at playing the game by gaming the mechanics. Maybe with a player controller SDK and enough instructions (and live player feedback?), you can get an agent to make a programmatic game AI for you and automatically refine it to be better.

jbmilgrom•1w ago

> Perhaps the key to transparent/interpretable ML is to just replace the ML model with AI-coded traditional software and decision trees. This way it's still fully autonomously trained but you can easily look at the code to see what is going on.

For certain problems I think thats completely right. We still are not going to want that of course for classic ML domains like vision and now coding, etc. But for those domains where software substrate is appropriate, software has a huge interpretability and operability advantage over ML

fooker•1w ago

> We still are not going to want that of course for classic ML domains like vision

It could make sense to decompose one large opaque model into code with decision trees calling out to smaller models having very specific purposes. This is more or less science fiction right now, 'mixture of experts' notwithstanding.

You could potentially get a Turing award by making this work for real ;)

jbmilgrom•1w ago

woah that would be crazy

Terr_•1w ago

Just yesterday I came across a something a sci-fi webcomic author wrote as backstory back in ~2017, where all future AI has auditable logic-chains, due to a disaster in 2061 involving an American AI defense system.

While the overall concept of "turns on its creators" is not new, I still found the "root cause" darkly amusing:

> [...] until the millisecond that Gordon Smith put his hand on a Bible and swore to defend the constitution.

> Thus, when the POTUS changed from Vanderbilt to Smith, a switch flipped. TIARA [Threat Intel Analysis and Response Algorithm] was now aware of an individual with 1) a common surname, 2) a lot of money and resources, 3) the allegiance of thousands of armed soldiers, 4) many alternate aliases (like "POTUS"), 5) frequent travel, 6) bases of operation around the world, 7) mentioned frequently in terrorist chatter, etc, etc, etc.

> And yes, of course, when TIARA launches a drone strike, it notifies a human operator, who can immediately countermand it. This is, unfortunately, not useful when the drone strike mission has a travel time of zero seconds.

> Thousands of intelligent weapons, finding themselves right on top of a known terrorist's assets, immediately did their job and detonated. In less than fifteen minutes, over ten thousand people lost their lives, and the damage was estimated in the trillions of dollars.

[0] https://forwardcomic.com/archive.php?num=200

fooker•1w ago

> Perhaps the key to transparent/interpretable ML is to just replace the ML model with AI-coded traditional software and decision trees

I like this train of thought. Research shows that decision trees are equivalent to 1-bit model weights + larger model.

But critically, we only know some classes of problems that are effectively solved by this approach.

So, I guess we are stuck waiting for new science to see what works here. I suspect we will see a lot more work on these topics after the we hit some hard LLM scalability limits.

isodev•1w ago

> Neural networks excel at judgment

I don’t think they do. I think they excel at outputting echoes of their training data that best fit (rhyme with, contextually) the prompt they were given. If you try using Claude with an obscure language or use case, you will notice that effect even more - it will keep pulling towards things it knows that aren’t at all what’s asked or “the best judgement” for what’s needed.

rybosworld•1w ago

Neural nets have been better at classifying handwriting (MNIST) than the best humans for a long time. This is what the author means by judgement.

They are super-human in their ability to classify.

verdverm•1w ago

Classifiers and LLMs get very different training and objectives, it's a mistake to draw inference from MNIST for coding agents or LLMs more generally.

Even within coding, their capability varies widely between context and even runs with the same context. They are not better at judgement in coding for all cases, def not

kranner•1w ago

A lot of the context is not even explicit, unlike the case for toy problems like MNIST.

PostOnce•1w ago

Tell that to all the OCR fuckups I see in all the ebooks I read.

raincole•1w ago

Your ebooks are made with handwriting recognition...? What do you read, the digital version of Dead Sea Scrolls?

PostOnce•1w ago

Some of them are, most of them are standard typesetting, which you would think would be all the easier to OCR, due to the uniformity.

But because you're curious, there are some fairly famous handwritten books that maintain their handwriting in publication, my favorite being: https://boingboing.net/2020/08/31/getting-started-in-electro...

Old manuscripts are another one, there are a LOT of those. Is that handwriting? Maybe you'd argue it's "hand-printing" because its so meticulous.

esafak•1w ago

They could be OCRs of scanned printed books.

boogrpants•1w ago

> I think they excel at outputting echoes of their training data that best fit (rhyme with, contextually) the prompt they were given.

Just like people who get degrees in economics or engineering and engage in such role-play for decades. They're often pretty bad at anything they are not trained on.

Coincidentally, if you put a single American English speaker on a team of native German language speakers you will notice information transference falls apart.

Very normal physical reality things occurring in two substrates, two mediums. As if there is a shared limitation called the rest of the universe attempting to erode our efforts via entropy.

LLM is a distribution of human generated data sets. Since humans have the same incompleteness problems in society this affords enough statistical wiggle room for LLMs to make shit up; humans do it! Look in their data!

We're massively underestimating realities indifference to human existence.

There is no doing any better until we effectively break physics, by that I really mean come upon a game changing discovery that informs us we had physics all wrong to begin with.

harry8•1w ago

The fact there are a lot of people around who don't think (including me at times!) does mean LLMs doing that are thinking.

Much like LLMs writing text like mindless middle managers, it doesn't mean they're intelligent, more that mindless middle managers aren't.

isodev•1w ago

> Just like people

I understand that having model related vocabulary borrow similar words we use to describe human brains and cognition gets confusing. We are not the same, we don’t “learn” the same we certainly don’t use the knowledge we posses in the same way.

The major difference between an LLM and a human is that as a human, I can look at your examples (which sound solid at first glance) and choose to truly “reason” about them in a way that allows me to judge if they’re correct or even applicable.

perfmode•1w ago

how’s your reasoning different from LLM reasoning?

obirunda•1w ago

What humans are known to do, and apparently there is no limit to what they won't, is anthropomorphizing. I think there's not been a single one of these discussions where someone inevitably says LLM's don't do X as well as a human and someone interjects in cult-like fashion.

isodev•1w ago

slopbros need to keep vibing

boogrpants•1w ago

Obviously. You are not exactly the same as your nearest neighbor but have similar observable traits to outside observers.

But since you end up trying to differentiate yourself from an LLM in vague, conceptual qualifiers, not empirical differences, what it means to "reason" ...I am left uncertain what you mean at all.

An LLM can reject false assertions and generate false positives just like a human.

Within a culture too individual people become pretty copy paste distillations of their generations customs. As a social creature you aren't that different. Really all that sets you apart from other people or a computer is a unique meat suit.

Unfortunately for your meat suit most people don't care it exists and will carry on with their lives never noticing it.

While LLMs have massive valuations right now. Pretty sure the public has spoken when it comes to the differences you fail to illustrate actually mattering.

isodev•1w ago

I think I've read that book... but I distinctly remember the plot was a lot more engaging.

nchagnet•1w ago

> While LLMs have massive valuations right now. Pretty sure the public has spoken when it comes to the differences you fail to illustrate actually mattering.

Are you seriously using market valuation as an indicator of worth?

jauntywundrkind•1w ago

Here here. Code has uniquely an incredible volume of data. And incredibly good ways to assess & test it's weights, to immediately find out of its headed the right way on the gradient.

geraneum•1w ago

> And incredibly good ways to assess & test it's weights

What weights are you referring to? How does [Claude?] code do that

pegasus•1w ago

Look into RLVR (Reinforcement Learning with Verifiable Rewards). It happens during model post-training.

geraneum•5d ago

That’s not happening in run time, when you send the prompts for the next token to be generated. The comment seems to imply a run time phenomenon.

jauntywundrkind•1w ago

The hidden virtual weights in reality.

Which are often complex & multi-faceted, measuring the rest of reality's weighing, to make broad judgement with. Reality's normal context window is a google deep for even the most everyday of circumstance. The weights exist there, but amid too broad a reality with too many factors for that exacting a use, and are too complected to measure out individually easily.

Code is simple. It's context is limited to what it is. To ascertainable viewable realities that mankind has already distilled out, into the form of systems and code.

And like relativity, we can measure the curvature of space around these weights, can envision how space bends and attracts. And now set in motion our own bodies, to orbit on nicely composed courses.

geraneum•5d ago

Imagine hearing this in Alan Watts’ voice.

jbmilgrom•1w ago

Author here

We are building this learned software system at Docflow Labs to solve the integration problem in healthcare at scale ie systems only able to chat with other systems via web portals. RPA historically awful to build and maintain so we've needed to build this to stay above water. Happy to answer any questions!

cadamsdotcom•6d ago

There’s something interesting here around when to write code, when to reuse code, and when to replace the code that was written.

Browser use could be improved by being partly done with code and part genetically.. completing tasks on the web is deceptive because it seems easily codifiable (just have the model write some Playwright code!) while actually being gnarly as hell. What if the page has changed completely since last visit?

It’d be interesting to let the agent build up a library of code that it can reuse if it feels confident that will get the job done, while feeding back any error to let the agent debug.. and that might lead to it writing a new routine to stick in the library, possibly replacing the old one.

Seems like something today’s models could be made to do with a bit of work in the harness. Anyone tried anything like this?

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

Implementing TCP Echo Server in Rust [video]

LicGen – Offline License Generator (CLI and Web UI)

Service Degradation in West US Region

The Janitor on Mars

Bringing Polars to .NET

Adventures in Guix Packaging

Show HN: We had 20 Claude terminals open, so we built Orcha

Your Best Thinking Is Wasted on the Wrong Decisions

Warcraftcn/UI – UI component library inspired by classic Warcraft III aesthetics

Trump Vodka Becomes Available for Pre-Orders

Velocity of Money

Stop building automations. Start running your business

You can't QA your way to the frontier

Software is mostly all you need

Comments

"There must be something like the opposite of suicide "

Ask HN: Why doesn't Netflix add a “Theater Mode” that recreates the worst parts?

Show HN: Engineering Perception with Combinatorial Memetics

Show HN: Steam Daily – A Wordle-like daily puzzle game for Steam fans

The Anthropic Hive Mind

Just Started Using AmpCode

LLM as an Engineer vs. a Founder?

Crosstalk inside cells helps pathogens evade drugs, study finds

Show HN: Design system generator (mood to CSS in <1 second)

Show HN: 26/02/26 – 5 songs in a day

Toroidal Logit Bias – Reduce LLM hallucinations 40% with no fine-tuning

Top AI models fail at >96% of tasks

The Science of the Perfect Second (2023)

Bob Beck (OpenBSD) on why vi should stay vi (2006)

Show HN: a glimpse into the future of eye tracking for multi-agent use

The Optima-l Situation: A deep dive into the classic humanist sans-serif

Barn Owls Know When to Wait

Implementing TCP Echo Server in Rust [video]

LicGen – Offline License Generator (CLI and Web UI)

Service Degradation in West US Region

The Janitor on Mars

Bringing Polars to .NET

Adventures in Guix Packaging

Show HN: We had 20 Claude terminals open, so we built Orcha

Your Best Thinking Is Wasted on the Wrong Decisions

Warcraftcn/UI – UI component library inspired by classic Warcraft III aesthetics

Trump Vodka Becomes Available for Pre-Orders

Velocity of Money

Stop building automations. Start running your business

You can't QA your way to the frontier