The unreasonable effectiveness of fuzzing for porting programs

https://rjp.io/blog/2025-06-17-unreasonable-effectiveness-of-fuzzing

130•Bogdanp•5h ago

Comments

oasisaimlessly•4h ago

Author FYI: The "You can see the session log here." link to [1] is broken.

[1]: https://rjp.io/blog/claude-rust-port-conversation

rjpower9000•25m ago

Fixed, thanks!

nyanpasu64•4h ago

> Most code doesn't express subtle logic paths. If I test if a million inputs are correctly sorted, I've probably implemented the sorter correctly.

I don't know if this was referring to Zopfli's sorter or sorting in general, but I have heard of a subtle sorting bug in Timsort: https://web.archive.org/web/20150316113638/http://envisage-p...

rjpower9000•8m ago

Thanks for sharing, I did not know about that!

Indeed, this is exactly the type of subtle case you'd worry about when porting. Fuzzing would be unlikely to discover a bug that only occurs on giant inputs or needs a special configuration of lists.

In practice I think it works out okay because most of the time the LLM has written correct code, and when it doesn't it's introduced a dumb bug that's quickly fixed.

Of course, if the LLM introduces subtle bugs, that's even harder to deal with...

amw-zero•3h ago

There are 2 main problems in generative testing:

- Input data generation (how do you explore enough of the program's behavior to have confidence that you're test is a good proxy for total correctness)

- Correctness statements (how do you express whether or not the program is correct for an arbitrary input)

When you are porting a program, you have a built in correctness statement: The port should behave exactly as the source program does. This greatly simplifies the testing process.

bluGill•2h ago

Several times I've been involved in porting code. Eventually we reach a time where we are getting a lot of bug reports "didn't work, didn't work with the old system as well" which is to say we ported correctly, but the old system wasn't right either and we just hadn't tested it in that situation until the new system had the budget for exhaustive testing. (normally it worked at one point on the old system and got broke in some other update)

lhmiles•3h ago

Are you the author? You can speed things up and get better results sometimes by retrying the initial generation step many times in parallel, instead of the interactive rewrite thing.

rjpower9000•2m ago

I'm the author. That's a great idea. I didn't explore that for this session but it's worth trying.

I didn't measure consistently, but I would guess 60-70% of the symbols ported easily, with either one-shot or trivial edits, 20% Gemini managed to get there but ended up using most of its attempts, and 10% it just struggled with.

The 20% would be good candidates for multiple generations & certainly consumed more than 20% of the porting time.

rcthompson•3h ago

The author notes that the resulting Rust port is not very "rusty", but I wonder if this could also be solved through further application of the same principle. Something like telling the AI to minimize the use of unsafe etc., while enforcing that the result should compile and produce identical outputs to the original.

DrNosferatu•2h ago

It will be inevitable that this generalizes.

DrNosferatu•2h ago

Why not use the same approach to port the full set of Matlab libraries to Octave?

(or a open source language of your choice)

Matlab manuals are public: it would be clean room reverse engineering.

(and many times, the appropriate bibliography of the underlying definitions of what is being implemented is listed on the manual page)

e28eta•2h ago

> LLMs open up the door to performing radical updates that we'd never really consider in the past. We can port our libraries from one language to another. We can change our APIs to fix issues, and give downstream users an LLM prompt to migrate over to the new version automatically, instead of rewriting their code themselves. We can make massive internal refactorings. These are types of tasks that in the past, rightly, a senior engineer would reject in a project until its the last possibly option. Breaking customers almost never pays off, and its hard to justify refactoring on a "maintenance mode" project.

> But if it’s more about finding the right prompt and letting an LLM do the work, maybe that changes our decision process.

I don’t see much difference between documenting any breaking changes in sufficient detail for your library consumers to understand them vs “writing an LLM prompt for migrating automatically”, but if that’s what it takes for maintainers to communicate the changes, okay!

Just as long as it doesn’t become “use this LLM which we’ve already trained on the changes to the library, and you just need to feed us your codebase and we’ll fix it. PS: sorry, no documentation.”

marxism•40m ago

There's a huge difference between documentation and prompts. Let me give you a concrete example.

I get requests to "make your research code available on Hugging Face for inference" with a link to their integration guide. That guide is 80% marketing copy about Git-based repositories, collaboration features, and TensorBoard integration. The actual implementation details are mixed in through out.

A prompt would be much more compact.

The difference: I can read a prompt in 30 seconds and decide "yes, this is reasonable" or "no, I don't want this change." With documentation, I have to reverse-engineer the narrow bucket which applies to my specific scenario from a one size drowns all ocean.

The person making the request has the clearest picture of what they want to happen. They're closest to the problem and most likely to understand the nuances. They should pack that knowledge densely instead of making me extract it from documentation links and back and forth.

Documentation says "here's everything now possible, you can do it all!" A prompt says "here's the specific facts you need."

Prompts are a shared social convention now. We all have a rough feel for what information you need to provide - you have to be matter-of-fact, specific, can't be vague. When I ask someone to "write me a prompt," that puts them in a completely different mindset than just asking me to "support X".

Everyone has experience writing prompts now. I want to leverage that experience to get cooperative dividends. It's division of labor - you write the initial draft, I edit it with special knowledge about my codebase, then apply it. Now we're sharing the work instead of dumping it entirely on the maintainer.

[1] https://peoplesgrocers.com/en/writing/write-prompts-not-guid...

rjpower9000•13m ago

I was pretty hand-wavy when I made the original comment. I was thinking implicitly to things like the Python sub-interpreter proposal, which had strong pushback from the Numpy engineers at the time (I don't know the current status, whether it's a good idea, etc, just something that came to mind).

https://lwn.net/Articles/820424/

The objections are of course reasonable, but I kept thinking this shouldn't be as big a problem in the future. A lot of times we want to make some changes that aren't _quite_ mechanical, and if they hit a large part of the code base, it's hard to justify. But if we're able to defer these types of cleanups to LLMs, it seems like this could change.

I don't want a world with no API stability of course, and you still have to design for compatibility windows, but it seems like we should be able to do better in the future. (More so in mono-repos, where you can hit everything at once).

Exactly as you write, the idea with prompts is that they're directly actionable. If I want to make a change to API X, I can test the prompt against some projects to validate agents handle it well, even doing direct prompt optimization, and then sharing it with end users.

gaogao•2h ago

Domains where fuzzing is useful are generally good candidates for formal verification, which I'm pretty bullish about in concert with LLMs. This is in part because you can just formal verify by exhaustiveness for many problems, but the enhancement is being able to prove that you don't need to test certain combinations through inductive reasoning and such.

rjpower9000•5m ago

That's an interesting idea. I hadn't thought about it, but it would be interesting to consider doing something similar for the porting task. I don't know enough about the space, could you have an LLM write a formal spec for a C function and the validate the translated function has the same properties?

I guess I worry it would be hard to separate out the "noise", e.g. the C code touches some memory on each call so now the Rust version has to as well.

zie1ony•2h ago

I find it amazing, that the same ideas pop up in the same period of time. For example, I work on tests generation and I went the same path. I tried to find bugs by prompting "Find bugs in this code and implement tests to show it.", but this didn't get me far. Then I switched to property (invariant) testing, like you, but in my case I ask AI: "Based on the whole codebase, make the property tests." and then I fuzz some random actions on the state-full objects and run prop tests over and over again.

At first I also wanted to automate everything, but over time I realized that best is: 10% human to 90% AI of work.

Another idea I'm exploring is AI + Mutation Tests (https://en.wikipedia.org/wiki/Mutation_testing). It should help AI with generation of full coverage.

wahnfrieden•1h ago

An under-explored approach is to collect data on human usage of the app (from production and from internal testers) and feed that to your generative inputs

LAC-Tech•23m ago

I'd have much more confidence in an AI codebase where the human has chosen the property tests, than a human codebase where the AI has chosen the property tests.

Tests are executable specs. That is the last thing you should offload to an LLM.

koakuma-chan•1m ago

How about an LRM?

punnerud•1h ago

Reading that TensorFlow is not used much anymore (besides Google) felt good to read. Had to check Google Trends: https://trends.google.com/trends/explore?date=all&q=%2Fg%2F1...

I started using TensorFlow years ago and switched to PyTorch. Hope ML will make switches like TensorFlow to PyTorch faster and easier, and not just the biggest companies eating the open source community. Like it have been for years.

Ask HN: Has Google Weather forecasting changed?

Address bar shows hp.com. Browser displays scammers' malicious text anyway

Amazon Orders Employees to Relocate to Seattle and Other Hubs

GEO Is the new SEO

Pushing the Envelope: The Effects of Salary Negotiation

A deep-dive explainer on Ink and Switch's BeeKEM protocol

NIST and Partners Use Quantum Mechanics to Make a Factory for Random Numbers

Become More Social as an Engineer – By Gregor Ojstersek

Update to GitHub Copilot consumptive billing experience

Remote MCP Support in Claude Code

Accelerating Collaboration with AI, chat => personal knowledge => train

Core Components of a Profitable AI Billing System

The Missing 11th of the Month

Microsoft planning thousands more job cuts aimed at salespeople

The Python Language Summit 2025: State of Free-Threaded Python

A Python dict that can report which keys you did not use

Understanding and managing requests in Copilot

The Impossible Man: Roger Penrose and the Cost of Genius

Preview app adds Dark Mode toggle for PDFs on macOS Tahoe, iOS and iPadOS 26

Use this free tool if you are not able to be productive

Lego improves maths and spatial ability in the classroom

Claude Code can use AST-grep to improve search efficiency and accuracy

Show HN: universal application where LLM does all computation directly

OpenDeepWiki – the open-source multi-repo AI chat Copilot wishes it were

Addictive Screen Use Trajectories and Suicidal Behaviors in US Youths

Chord: Multiplayer LLM Chats

Ancestra says a lot about the current state of AI-generated videos

Geopolymer-based soil solidifiers using waste siding and glass powders

AI Labs Are Starting to Look Like Sports Teams

Saudi journalist who tweeted against the government executed for 'high treason'