I've had some moments recently for my own projects as I worked through some bottle necks where I took a whole section of a project and said "rewrite in rust" to Claude and had massive speedups with a 0 shot rewrite, most recently some video recovery programs, but I then had an output product I wouldn't feel comfortable vouching for outside of my homelab setup.
The fear that lead to the black and white thinking expressed in your comment is the real issue.
It’s surprising how much even Opus 4.5 still trips itself up with things like off-by-one or logic boundaries, so another model (preferably with a fresh session) can be a very effective peer reviewer.
So my checks are typically lint->test->other model->me, and relatively few things get to me in simple code. Contrived logic or maths, though, it needs to be all me.
Yep, a constantly updated spec is the key. Wrote about this here:
https://lukebechtel.com/blog/vibe-speccing
I've also found it's helpful to have it keep an "experiment log" at the bottom of the original spec, or in another document, which it must update whenever things take "a surprising turn"
Some things I've been doing:
- Move as much actual data into YML as possible.
- Use CEL?
- Ask Claude to rewrite pseudocode in specs into RFC-style constrained language?
How do you sync your spec and code both directions? I have some slash commands that do this but I'm not thrilled with them?
I tend to have to use Gemini for actually juggling the whole spec. Of course it's nice and chunked as much as it can be? but still. There's gonna need to be a whole new way of doing this.
If programming languages can have spooky language at a distance wait until we get into "but paragraph 7, subsection 5 of section G clearly defines asshole as..."
What does a structured language look like when it doesn't need mechanical sympathy? YML + CEL is really powerful and underexplored but it's still just ... not what I'm actually wanting.
Sharding: Make well-named sub-documents for parts of work. LLM will be happy to create these and maintain cross references for you.
Compaction: Ask the LLM to compact parts of the spec, or changelog, which are over specified or redundant.
"Make sub-documents with cross-references" is just... recreating the problem of programming languages but worse. Now we have implicit dependencies between prose documents with no tooling to track them, no way to know if a change in document A invalidates assumptions in document B, no refactoring support, no tests for the spec.
To make things specific:
At some level you have to do semantic compression... To your point on non-explicitness -- the dependencies between the specs and sub-specs can be explicit (i.e. file:// links, etc).
But your overall point on assumption invalidation remains... Reminds me of a startup some time ago that was doing "Automated UX Testing" where user personas (i.e. prosumer, avg joe, etc) were created, and Goals/ Implicit UX flows through the UI were described (i.e. "I want to see my dashboard", etc). Then, an LLM could pretend to be each persona, and test each day whether that user type could achieve the goals behind their user flow.
This doesn't fully solve your problem, but it hints at a solution perhaps.
Some of what you're looking for is found by adding strict linter / tests. But your repo looks like something in an entirely different paradigm and I'm curious to dig into it more.
1. The post was written before this was common :)
2. If using Cursor (as I usually am), this isn't what it always does by default, though you can invoke something like it using "plan" mode. It's default is to keep todo items in a little nice todo list, but that isn't the same thing as a spec.
3. I've found that Claude Code doesn't always do this, for reasons unknown to me.
4. The prompt is completely fungible! It's really just an example of the idea.
I saw in your other comment you've made accommodations for the newer generation, and I will confess than in Cursor (with plan mode) I've found an abbreviated form works just as well as the extremely explicit example found in the post.
If you ever had a followup, I imagine it'd be just as well received!
My experience has been that it gets worse with more structure. You misinform it and heavily bias it's results in ways you dont intend. Maybe there are AI wizards out there with the perfect system of markdown artifacts but I found it increased the trouble a lot and made the results worse. It's a non deterministic system. Knock yourself out tryin to micromanage it.
Did you run any benchmarking? I'm curious if python's stack is faster or slower than a pure C vibe coded inference tool.
People can say what they want about LLMs reducing intelligence/ability; The trend has clearly been that people are beginning to get more organized, document things better, enforce constraints, and think in higher-level patterns. And there's renewed interest in formal verification.
LLMs will force the skilled, employable engineer to chase both maintainability and productivity from the start, in order to maintain a competitive edge with these tools. At least until robots replace us completely.
[0] https://www.atlassian.com/work-management/knowledge-sharing/...
There's going to be a bifurcation; caricaturing it, "operating system kernels" and "disposable code". In the latter case, you don't maintain it; you dispose of it and vibe-code up a new one.
This is amazing, Salvatore! Please spend some more time here and free us from the CUDA toolkit and Python.
The difference is massive because the source material is covered by copyright. So even if the product can't be copyrighted there is a fair chance that you'll get your ass sued by whoever is able to trace back some critical part of that product to their own work of which yours is now a derived work.
Or are you getting at the idea that the works the AI was originally trained on could still be considered an original work the generated code was derived from? Like if the generate code happens to look like someones code in github, that they could sue? I'm not 100% on sources here but I thought this was already tested in court and ruled it wasn't infringement.
One suggestion, which I have been trying to do myself, is to include a PROMPTS.md file. Since your purpose is sharing and educating, it helps others see what approaches an experienced developer is using, even if you are just figuring it out.
One can use a Claude hook to maintain this deterministically. I instruct in AGENTS.md that they can read but not write it. It’s also been helpful for jumping between LLMs, to give them some background on what you’ve been doing.
The PROMPTS.md is communication metadata. Indeed, if you fed the same series of prompts freshly, the resultant ground truths might not make sense because of the stochastic nature of LLMs.
Maybe “ground truth” isn’t exactly the right word, but it is the consistent, determined basis which formed from past work and will evolve with future work.
But is this "stochastic nature" inherent to the LLM? Can't you make the outputs deterministic by specifying a version of the weights and a seed for the random number generator?
Your vibe coding log (i.e. your source code) may start like this:
fix weights as of 18-1-2026
set rng seed to 42
write a program that prints hello world
Notice that the first two lines may be added automatically by the system and you don't need to write or even see them.For what you're saying to work, then the LLM must adhere consistently to that initial prompt. Different LLMs and the same LLM on different runs might have different adherence and how does it evolve from there? Meaning at playback of prompt #33, will the ground truth gonna be the same and the next result the same as in the first attempt?
If this is local LLM and we control all the context, then we can control that LLM's seeds and thus get consistent output. So I think your idea would work well there.
I've not started keeping thinking traces, as I'm mostly interested in how humans are using this tech. But, that could get involved in this as well, helping other LLMs understand what happened with a project up to a state.
At any kind of reasonable scale, yes. CUDA accelerators, like most distributed systems, are nondeterministic, even at zero temperature (which you don't want) with fixed seed.
I only say this as it seems one of your motivations is education. I'm also noting it for others to consider. Much appreciation either way, thanks for sharing what you did.
If the spec and/or tests are sufficiently detailed maybe you can step back and let it churn until it satisfies the spec.
That said, I'm mixed on agentic performance for data science work but it does a good job if you clearly give it the information it needs to solve the problem (e.g. for SQL, table schema and example data)
Plus they continue to introduce performance blunders.
Crying wolves, on day maybe there will be a wolf and I may be the last of us to check whether that's true.
the creator of llama.cc can hardly be suspected to be reluctant or biased towards GenAI.
I wanted to see if Claude Code could port the HF / MLX implementation to llama.cpp and it was successful -- in my mind that's wild!
I also learned a ton about GPU programming, how omni models work, and refined my approach to planning large projects with automated end to end integration tests.
The PR was mostly to let people know about the code and weights, since there are quite a few comments requesting support:
Nice work getting multimodal in there already.
What you're saying here is that you do not appreciate systems not using the Python stack, which I think is the opposite of what you wanted to say.
-I believe that <inference systems not using the Python stack> (which I do not appreciate) are a way to free open models usage and make AI more accessible.
This reading of the text would lead one to believe they don't appreciate inferences systems not written in python. Given the inference system produced by the author is also not using the python stack (it is in C), we can assume this is not the correct reading.
-I believe that inference systems not using the <Python stack> (which I do not appreciate) are a way to free open models usage and make AI more accessible.
This reading says that the author does not like the python stack for inference, which given the author has produced this inference in C, would support the statement.
That is we have to take both readings and think which one fits the context around it, hopefully this helps :)
FLUX.2 [Klein]: Towards Interactive Visual Intelligence
I don't think it counts as recreating a project "from scratch" if the model that you're using was trained against it. Claude Opus 4.5 is aware of the stable-diffusion.cpp project and can answer some questions about it and its code-base (with mixed accuracy) with web search turned off.
The main advantage is that you don't need the python interpreter to run the program.
While not revolutionary, it is definitely not trivial and its main purpose is to demonstrate Claude code abilities in a low level, non trivial task.
Running inference for a model, even when you have all the weights, is not trivial.
EDIT: https://github.com/bodaay/HuggingFaceModelDownloader seems to be making progress.
https://github.com/antirez/gguf-tools
And I have a YouTube channel mostly about AI (in Italian language) where I regularly post videos and often read papers that I then explain in the channel. I have a long time passion about AI, I wrote my first NN implementation in 2003 (used here, many years ago, as a showcase of Redis modules https://github.com/antirez/neural-redis), and never stopped since there to implement, for fun, small GPT models and things like that, using PyTorch or C.
Also my work at Redis Vector Sets, in the latest year, exposed me more to working with models (especially text embedding models of many kinds, but also other models).
So while Claude was fundamental to get the implementation fast, I had background to have idea about what was happening in the different stages. I believe it is a very interesting question to understand if this kind of work can be made with programming background and near-zero AI background. My feeling is that you ned more time, more back and forth, maybe to provide the agent with more examples, but eventually it will do something working.
[1] https://bfl.ai/blog/flux2-klein-towards-interactive-visual-i...
I'm the maintainer of MFLUX (https://github.com/filipstrand/mflux) which does a similar thing, but at a higher level using the MLX framework optimised for Apple Silicon. I just merged Flux 2 Klein support as well and was happy to see this discussion :)
I started out doing this type of work roughly 1.5 years ago when FLUX.1 was released and have been doing it off and on ever since with newer models, trying to use more and more AI over time.
At one point, I vibe-coded a debugger to help the coding agent along. It worked OK but as models have gotten stronger, this doesn't really seem to be needed anymore. My latest version simply has a SKILL.md that outlines my overall porting strategy (https://github.com/filipstrand/mflux/blob/main/.cursor/skill...). Somewhat surprisingly, this actually works now with Cursor + Codex-5.2, with little human intervention.
> Even if the code was generated using AI, my help in steering towards the right design, implementation choices, and correctness has been vital during the development.
This definitely resonates! Curious to hear more about what worked/didn't for you. A couple of things I've found useful:
- Porting the pipeline backwards: This is the way I did it personally before using any coding models. The typical image generation flow is the following:
1.Text_encodings (+ random_noise_latent) 2.Transformer loop 3.VAE decoding
I found that by starting with the VAE first (by feeding it pre-loaded tensors from the reference extracted at specific locations) it was the quickest way to get to an actual generated image. Once the VAE is done and verified, only then proceed backwards the chain and handle the Transformer, etc. I still prefer to do it this way and I like to manually intervene between step 3,2 and 1, but maybe this won't actually be needed soon?
- Also, with the VAE, if you care about implementing the encoding functionality (e.g to be used with img2img features), the round-trip test is a very quick way to verify correctness:
image_in -> encode -> decode -> image_out : compare(image_in, image_out)
- Investing in a good foundation for weight handling, especially when doing repeat work across similar models. Earlier coding models would easily get confused about weight assignment, naming conventions etc. A lot of time could be wasted because weight assignment failed (sometimes silently) early on.
They’re actually very good at speed optimization and can iterate very quickly taking notes on trials and failures and benchmarks. I’ve had it write 10 different attempts in around an hour and benchmark them all then merge and beat very strong baselines in torch
1. Now it is much faster, and the Python benchmarks were re-done correctly (the benchmark didn't account for model loading, and did warm-up before starting the actual inference, while the C code was tested exactly in the reverse way).
2. Now there is --mmap support to run on Linux with blas target with 16GB of RAM. Inference is viable on my old-ish Dell Latitude i5.
3. Seed now part of the PNG metadata.
4. Many other improvements, check the README.
I expected this C implementation to be notably faster, but my M3 Max (36GB) could barely make it past the first denoising step before OOMing (at 512x512)
Am I doing something wrong? The MLX implementation takes ~1/sec per step with the same model and dimensions: https://x.com/scottinallcaps/status/2013187218718753032
What's wrong with the Python stack? I have never much used it or any other ML stack so I'm genuinely curious.
You had me at image generation!
Pure C with zero external dependencies -- is just an extra added bonus...
No, actually, pure C with zero external dependencies -- is quite awesome in its own right!
Well done!
reactordev•2w ago
nusl•2w ago
rvz•2w ago
It's almost as if this is the first time many have seen something built in C with zero dependencies which makes this easily possible.
Since they are used to languages with package managers adding 30 package and including 50-100+ other dependencies just before the project is able to build.
snarfy•2w ago
https://xkcd.com/1425/