Agentic pelican on a bicycle

https://www.robert-glaser.de/agentic-pelican-on-a-bicycle/

29•todsacerdoti•4h ago

Comments

davesque•2h ago

I feels like it's a bit hard to take much from this without running this trial many times for each model. Then it would be possible to see if there are consistent themes among each model's solutions. Otherwise, it feels like the specific style of each result could be somewhat random. I didn't see any mention of running multiple trials for each model.

williamstein•2h ago

> Some models (looking at you, GPT-5-Codex) seemed to mistake “more complex” for “better.”

That's what working with GPT-5-Codex on actual code also feels like.

lubujackson•2h ago

What I take from this is that LLMs are somewhat miraculous in generation but terrible at revision. Especially with images, they are very resistant to adjusting initial approaches.

I wonder if there is a consistent way to force structural revisions. I have found Nano Banana particularly terrible at revisions, even something like "change the image dimensions to..." it will confidently claim success but do nothing.

halflife•1h ago

I’m not quite sure. I think that adversarial network works pretty well at image generation.

I think that the problem here is that svg is structured information and an image is unstructured blob, and the translation between them requires planning and understanding. Maybe if instead of treating an svg like a raster image in the prompt is wrong. I think that prompting the image like code (which svg basically is) would result in better outputs.

This is just my uninformed opinion.

pinko•1h ago

I see this all the time when asking Claude or ChapGPT to produce a single-page two-column PDF summarizing the conclusions of our chat. Literally 99% of the time I get a multi-page unpredictably-formatted mess, even after gently asking over and over for specific fixes to the formatting mistake/s.

And as you say, they cheerfully assert that they've done the job, for real this time, every time.

tadfisher•43m ago

Tools are still evolving out of the VLM/LLM split [0]. The reason image-to-image tasks are so variable in quality and vastly inferior to text-to-image tasks is because there is an entirely separate model that is trained on transforming an input image into tokens in the LLM's vector space.

The naive approach that gets you results like ChatGPT is to produce output tokens based on the prompt and generate a new image from the output. It is really difficult to maintain details from the input image with this approach.

A more advanced approach is to generate a stream of "edits" to the input image instead. You see this with Gemini, which sometimes maintains original image details to a fault; e.g. it will preserve human faces at all cost, probably as a result of training.

I think the round-trip through SVG is an extreme challenge to train through and essentially forces the LLM to progressively edit the SVG source, which can result in something like the Gemini approach above.

[0]: https://www.groundlight.ai/blog/how-vlm-works-tokens

Retr0id•32m ago

I almost always get better results from LLMs by going back and editing my prompt and starting again, rather than trying to correct/guide it interactively. Almost as if having mistakes in your context window is an instruction to generate more mistakes! (I'm sure it's not quite that simple)

sorenjan•1h ago

It would be interesting to see if they would get better results if they didn't grade their own work. Feed the output to a different model and let that suggest improvements, almost like a GAN.

ripped_britches•1h ago

I have tried to do agentic figma in this way but same results: attempt 1 becomes frozen and no forward progress can be made.

andy99•48m ago

  This wasn’t just “add more details”—it was “make this mechanically coherent.”

The overall text doesn’t appear to be AI written, making this all the more confusing. Is AI making people write this way now on their own? Or is it actually written by an LLM and just doesn’t look like it?

HPMOR•43m ago

Something about the cadence, structure, and staccato nature of the bottom paragraphs also felt very LLMed.

Retr0id•36m ago

I assume this was written by a human and then "improved" by an LLM.

nl•14m ago

I write like that and I'm not an LLM.

simonw•35m ago

I tried an experiment like this a while back (for the GPT-5 launch) and was surprised at how ineffective it was.

This is a better version of what I tried but suffers from the same problem - the models seem to stick close to their original shapes and add new details rather than creating an image from scratch that's a significantly better variant of what they tried originally.

andy99•19m ago

I feel like I’ve seen this with code too, where it’s unlikely to scrap something and try a new approach a more likely to double down iterating on a bad approach.

For the svg generation, it would be an interesting experiment to seed it with increasingly poor initial images and see at what point if any the models don’t anchor on the initial image and just try something else

simonw•15m ago

Yeah, for code I'll often start an entirely new chat and paste in just the bits I liked from the previous attempt.

smusamashah•31m ago

A single run (irrespective of number of iterations) on any model is not a good data point.

If first output is crappy, the next 3 iterations will improve the same crap.

This was not a good test.

measurablefunc•28m ago

Iterating a Markov chain does not make it any more or less "agentic". This is yet another instance of corporate marketing departments redefining words b/c they are confused about what exactly they're trying to build & sell.

X5.1 solar flare, G4 geomagnetic storm watch

I didn't reverse-engineer the protocol for my blood pressure monitor in 24 hours

Laptops adorned with creative stickers

Four strange places to see London's Roman Wall

A modern 35mm film scanner for home

.NET MAUI Is Coming to Linux and the Browser, Powered by Avalonia

The terminal of the future

A catalog of side effects

Collaboration sucks

The history of Casio watches

Meticulous (YC S21) is hiring to redefine software dev

Scaling HNSWs

Pikaday: A friendly guide to front-end date pickers

Terminal Latency on Windows (2024)

My fan worked fine, so I gave it WiFi

Adk-go: code-first Go toolkit for building, evaluating, and deploying AI agents

FFmpeg to Google: Fund us or stop sending bugs

We ran over 600 image generations to compare AI image models

Xortran - A PDP-11 Neural Network With Backpropagation in Fortran IV

AV1 vs. H.264: What Video Codec to Choose for Your App?

Cache-friendly, low-memory Lanczos algorithm in Rust

Agentic pelican on a bicycle

iPhone Pocket

How I fell in love with Erlang

The R47: A new physical RPN calculator

Étude in C minor (2020)

The Department of War just shot the accountants and opted for speed

Show HN: Cactoide – Federated RSVP Platform

Array-programming the Mandelbrot set

Vertical integration is the only thing that matters

Agentic pelican on a bicycle

Comments

X5.1 solar flare, G4 geomagnetic storm watch

I didn't reverse-engineer the protocol for my blood pressure monitor in 24 hours

Laptops adorned with creative stickers

Four strange places to see London's Roman Wall

A modern 35mm film scanner for home

.NET MAUI Is Coming to Linux and the Browser, Powered by Avalonia

The terminal of the future

A catalog of side effects

Collaboration sucks

The history of Casio watches

Meticulous (YC S21) is hiring to redefine software dev

Scaling HNSWs

Pikaday: A friendly guide to front-end date pickers

Terminal Latency on Windows (2024)

My fan worked fine, so I gave it WiFi

Adk-go: code-first Go toolkit for building, evaluating, and deploying AI agents

FFmpeg to Google: Fund us or stop sending bugs

We ran over 600 image generations to compare AI image models

Xortran - A PDP-11 Neural Network With Backpropagation in Fortran IV

AV1 vs. H.264: What Video Codec to Choose for Your App?

Cache-friendly, low-memory Lanczos algorithm in Rust

Agentic pelican on a bicycle

iPhone Pocket

How I fell in love with Erlang

The R47: A new physical RPN calculator

Étude in C minor (2020)

The Department of War just shot the accountants and opted for speed

Show HN: Cactoide – Federated RSVP Platform

Array-programming the Mandelbrot set

Vertical integration is the only thing that matters