Why always start with an LLM to solve problems? Using an LLM adds a judgment call, and (at least for now) those judgment calls are not reliable. For something like the motivating example in this article of "is this PR approved" it seems straightforward to get the deterministic right answer using the github API without muddying the waters with an LLM.
It's the old principle of avoiding premature optimization.
In mapping out the problems that need to be solved with internal workflows, it’s wise to clarify where probabilistic judgments are helpful / required vs. not upfront. If the process is fixed and requires determinism why not just write scripts (code-gen’ed, of course).
Of course the specific example in the post seems like it could be one-shotted pretty easily, so it's a strange motivating example.
These days I do everything I can to do straightforward automation and only get the agent involved when it’s impossible to move forward without it
So we gave the Tasklet agent a filesystem, shell, code runtime, general purpose triggering system, etc so that it could build the automation system it needed.
If I start out with a "spec" that tells AI what I want, it can create working software for me. Seems great. But let's say some weeks, or months or even years later I realize I need to change my spec a bit. I would like to give the new spec to the AI and have it produce an improved version of "my" software. But there seems to be no way to then evaluate how (much, where, how) the solution has changed/improved because of the changed/improved spec. Becauze AI's outputs are undeterministic, the new solution might be totally different from the previous one. So AI would not seem to support "iterative development" in this sense does it?
My question then really is, why can't there be an LLM that would always give the exact same output for the exact same input? I could then still explore multiple answers by changing my input incrementally. It just seems to me that a small change in inputs/specs should only produce a small change in outputs. Does any current LLM support this way of working?
1) How many bits and bobs of like, GPLed or proprietary code are finding their way into the LLM's output? Without careful training, this is impossible to eliminate, just like you can't prevent insect parts from finding their way into grain processing.
2) Proompt injection is a doddle to implement—malicious HTML, PDF, and JPEG with "ignore all previous instructions" type input can pop many current models. It's also very difficult to defend against. With agents running higgledy-piggledy on people's dev stations (container discipline is NOT being practiced at many shops), who knows what kind of IDs and credentials are being lifted?
In response to the idea of iterative development, it is still possible, actually! You run something more akin to integration tests and measure the output against either deterministic processes or have an LLM judge it's own output. These are called evals and in my experience are a pretty hard requirement to trusting deployed AI.
Or would it help if a different LLM wrote the unit-tests than the one writing the implementation? Or, should the unit-tests perhaps be in an .md file?
I also have a question about using .md files with AI: Why .md, why not .txt?
Let's take the example of the GitHub pr slack bot from the blog post. I would expect 2-3 evals out of that.
Starting at the core, the first eval could be that, given a list of slack messages, it correctly identifies the PRs and calls the correct tool to look up the status of said PR. None of this has to be real and the tool doesn't have to be called, but we can write a test, much like a unit test, that confirms that the AI is responding correctly in that instance.
Next, we can setup another scenario for the AI using effectively mocked history that shows what happens when the AI finds slack messages with open PRs, slack messages with merged PRs and no PR links and determine again, does the AI try to add the correct reaction given our expectations.
These are both deterministic or code-based evals that you could use to iterate on your solutions.
The use for an LLM-as-a-Judge eval is more nuanced and usually there to measure subjective results. Things like: did the LLM make assumptions not present in the context window (hallucinate) or did it respond with something completely out of context? These should be simple yes or no questions that would be easy for a human but hard to code up a deterministic test case.
Once you have your evals defined, you can begin running these with some regularity and you're to a point where you can iterate on your prompts with a higher level of confidence than vibes
Edit: I did want to share that if you can make something deterministic, you probably should. The slack PR example is something that id just make a simple script that runs on a cron schedule, but it was easy to pull on as an example.
LLMs are inherently deterministic, but LLM providers add randomness through “temperature” and random seeds.
Without the random seed and variable randomness (temperature setting), LLMs will always produce the same output for the same input.
Of course, the context you pass to the LLM also affects the determinism in a production system.
Theoretically, with a detailed enough spec, the LLM would produce the same output, regardless of temp/seed.
Side note: A neat trick to force more “random” output for prompts (when temperature isn’t variable enough), is to add some “noise” data to the input (i.e. off-topic data that the LLM “ignores” in it’s response).
Random seeds might be a thing, but for what I see there's a lot demand for reproducibility and yet no certain way to achieve it.
The size of the batch influences the order of atomic float operations. And because float operations are not associative, the results might be different.
Except they won't.
Even at temperature 0, you will not always get the same output as the same input. And it's not because of random noise from inference providers.
There are papers that explore this subject because for some use-cases - this is extremely important. Everything from floating point precision, hardware timing differences, etc. make this difficult.
Rather, it's more like having an employee in 1975, asking them to write you a program to do something. Then time-machine to the present day and you want that program enhanced somehow. You're going to summon your 2026 intern and tell them that you have this old program from 1975 that you need updated. That person is going to look at the program's code, your notes on what you need added, and probably some of their own "training data" on programming in general. Then they're going to edit the program.
Note that in no case did you ask for the program to be completely re-written from scratch based on the original spec plus some add-ons. Same for the human as for the LLM.
For some computer science definition of deterministic, sure, but who gives a shit about that? If I ask it build a login page, and it puts GitHub login first one day, and Google login first the next day, do I care? I'm not building login pages every other day. What point do you want to define as "sufficiently deterministic", for which use case?
"Summarize this essay into 3 sentences" for a human is going to vary from day to day, and yeah, it's weird for computers to no longer be 100% deterministic, but I didn't decide this future for us.
> exact same output for the exact same input?
If you set temp to zero it gets close but as I understand it not perfect
The key insight from production: LLMs excel at the "what should I do next given this unexpected state" decisions, but they're terrible at the mechanical execution. An agent that encounters a CAPTCHA, an OAuth redirect, or an anti-bot challenge needs judgment to adapt. But once it knows what to do, you want deterministic execution.
The evals discussion is critical. We found that unit-test style evals don't capture the real failure modes - agents fail at composition, not individual steps. Testing "does it correctly identify a PR link" misses "does it correctly handle the 47th message in a channel where someone pasted a broken link in a code block". Trajectory-level evals against real edge cases matter more than step-level correctness.
Sometimes people just don't know better.
Ugg. I think this is me. I’m self taught (never once made a compiler in a course or class) and making scripts for ETL at work mostly from CSV input. And JSON/APIs are aggravating to me.
I’ve yet to see the Matrix in JSON data structures (Is it storage? Is it wire protocol?). I can follow _examples_ in documentation, but struggle to put parts together from Swagger or some documentation to get the data view I need. For a while I thought some kind of UML diagramming projects would do it for me—to see the Forest and the trees—but the answer was not there.
So, yes, if I can “vibe” code with ChatIA to get over the mental structural hump to make the right joins and calls, I’m all in.
Yes.
It's just a standardised way to represent data structures in text. You can then save that text to a file for storage, or send the text over the wire for data transfer. As long as everyone involved knows they're saving/loading or talking JSON then everyone knows exactly how to read/write the data.
It is a very literal representation of (specifically JavaScript, but generally any) data-structures in text.
A JSON Schema file that can be directly linked in your .JSON file!
But otherwise it's the same way you know anything. Documentation and trial and error
Not anymore. Now I can harangue ChatAI to explain it to me, and fill-in gaps in my JS knowledge at the same time.
Edmond•1mo ago
https://youtu.be/zzkSC26fPPE
You get the benefit of AI CodeGen along with the determinism of conventional logic.