Good luck with that. Users will flood you with complaints if a button moves 5px to the left after a design update. A program that is generated at runtime, with not just a variable UI but also UX and workflows, would get you death threats.
The problem is that outside of that most people want boring and regular interfaces so they can get in and solve the problem and get out - they don't want to "love" it or care if its "sexy" they want it to work and get out of the way.
LLMs transmogrifying your software at ever request assumes people are software architects and creators who love the computer interface, and that just doesn't describe the bulk of the population.
Most people using computers use the to consume things or utilize access to things, not for their own sake, and they certainly don't think "what if I just had code to do x..." unless x is make them a lot of money.
Your comment EXACTLY mirrors my experience. Week 1 was ever expanding prompts, and degrading performance. Week 2 has been all about actually defining the objects precisely (notes, tasks, projects, people etc) and defining methods for performing well defined operations against these objects. The agent surface has, as you rightly point out, shrunk to a translation layer that converts natural language to commands and args that pass the input validator.
Still have yet to see a universal treatment that tackles this well.
However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.
There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.
I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.
0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...
I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.
1. an adversarial agent harness that uses one agent to create a plan and implement it, and another to review the plan and code-review each step.
2. an agentic validation suite -- a more flexible take on e2e testing.
3. some custom skills that explain how to use both of those flows.
With this in place you can formulate ideas in a chat session, produce planning artifacts, then use the adversarial system to implement the plans and the validation layer to get everything working e2e for human review.
There are a lot of tools you can use for these things but I chose to just build the tooling in the repo as I go.
Agents aren't reliable; use workflows instead.
I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).
Neywiny•1h ago
tekne•41m ago
Making an unreliable, nondeterministic system give reliable results for a bounded task with well-understood parameters is... like half of engineering, no?
There's a huge difference between "generate this code here's a vague feature description" and "here's a list of criteria, assign this input to one of these buckets" -- the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
aleksiy123•13m ago
Somewhere in between that I guess is the varying levels of intelligence more likely able to make the “right” decision for anything you throw at it.
pydry•12m ago