(Do we say we software engineered something?)
You CREATED something, and I like to think that creating things that I love and enjoy and that others can love and enjoy makes creating things worth it.
No, that simply is not true. If you actually compare the before and after you can see it still regenerates all the details on the "unchanged" aspects. Texture, lighting, sharpness, even scale its all different even if varyingly similar to the original.
Sure they're cute for casual edits but it really pains me people suggesting these things are suitable replacements for actual photo editing. Especially when it comes to people, or details outside their training data theres a lot of nuance that can be lost as it regenerates them no matter how you prompt things.
Even if you
I figured that if you write the text in Google docs and share the screenshot with banana it will not make any spelling mistake.
So, use something like "can you write my name on this Wimbledon trophy, both images are attached. Use them" will work.
That's on my list of blog-post-worthy things to test, namely text rendering to image in Python directly and passing both input images to the model for compositing.
But it is still generating it with a prompt
> Logo: "A simple, modern logo with the letters 'G' and 'A' in a white circle.
My idea was do to it manually so that there is no probabilities involved.
Though your idea of using python is same.
It's actually fairly difficult to put to words any specific enough vision such that it becomes understandable outside of your own head. This goes for pretty much anything, too.
For anything, even back in the "classical" search days.
"This got searched verbatim, every time"
W*ldcards were handy
and so on...
Now, you get a 'system prompt' which is a vague promise that no really this bit of text is special you can totally trust us (which inevitably dies, crushed under the weight of an extended context window).
Unfortunately(?), I think this bug/feature has gotta be there. It's the price for the enormous flexibility. Frankly, I'd not be mad if we had less control - my guess is that in not too many years we're going to look back on RLHF and grimace at our draconian methods. Yeah, if you're only trying to build a "get the thing I intend done" machine I guess it's useful, but I think the real power in these models is in their propensity to expose you to new ideas and provide a tireless foil for all the half-baked concepts that would otherwise not get room to grow.
I give it,
Reposition the text bubble to be coming from the middle character.
DO NOT modify the poses or features of the actual characters.
Now sure, specs are hard. Gemini removed the text bubble entirely. Whatever, let's just try again: Place a speech bubble on the image. The "tail" of the bubble should make it appear that the middle (red-headed) girl is talking. The speech bubble should read "Hide the vodka." Use a Comic Sans like font. DO NOT place the bubble on the right.
DO NOT modify the characters in the image.
There's only one red-head in the image; she's the middle character. We get a speech bubble, correctly positioned, but with a sans-serif, Arial-ish font, not Comic Sans. It reads "Hide the vokda" (sic). The facial expression of the middle character has changed.Yes, specs are hard. Defining a spec is hard. But Gemini struggles to follow the specification given. Whole sessions are like this, and absolute struggle to get basic directions followed.
You can even see here that I & the author have started to learn the SHOUT AT IT rule. I suppose I should try more bulleted lists. Someone might learn, through experimentation "okay, the AI has these hidden idiosyncrasies that I can abuse to get what I want" but … that's not a good thing, that's just an undocumented API with a terrible UX.
(¹because that is what the AI on a previous step generated. No, that's not what was asked for. I am astounded TFA generated an NYT logo for this reason.)
Which is exactly why the current discourse is about 'who does it best' (IMO, the flux series is top dog here. No one else currently strikes the proper balance between following style / composition / text rendering quite as well). That said, even flux is pretty tricky to prompt - it's really, really easy to step on your own toes here - for example, by giving conflicting(ish) prompts "The scene is shot from a high angle. We see the bottom of a passenger jet".
Talking to designers has the same problem. "I want a nice, clean logo of a distressed dog head. It should be sharp with a gritty feel". For the person defining the spec, they actually do have a vision that fits each criteria in some way, but it's unclear which parts apply to what.
Discounting the testing around the character JSON which became extremely expensive due to extreme iteration/my own stupidity, I'd wager it took about $5 total including iteration.
This is a very different fuzzy interface compared to programming languages.
There will be techniques better or worse at interfacing.
This is what the term prompt engineering is alluding to since we don’t have the full suite of language to describe this yet.
That is why I always call technical writers "documentation engineers," why I call diplomats "international engineers," why I call managers "team engineers," and why I call historians "hindsight engineers."
So Prompt Philosopher/Communicator?
Despite needing much knowledge of how a planes inner workings function, a pilot is still a pilot and not an aircraft engineer.
Just because you know how human psychology works when it comes to making purchase decision and you are good at applying that to sell things, you're not a sales engineer.
Giving something a fake name, to make it seem more complicated or aspirational than it actually is makes you a bullshit engineer in my opinion.
now you can really use natural language and people want to debate you about how poor they are at articulating a shared concepts, amazing
it's like the people are regressing and the AI is improving
>Nano Banana is terrible at style transfer even with prompt engineering shenanigans
My context: I'm kind of fixated on visualizing my neighborhood as it would have appeared in the 18th century. I've been doing it in Sketchup, and then in Twinmotion, but neither of those produce "photorealistic" images... Twinmotion can get pretty close with a lot of work, but that's easier with modern architecture than it is with the more hand-made, brick-by-brick structures I'm modeling out.
As different AI image generators have emerged, I've tried them all in an effort to add the proverbial rough edges to snapshots of the models I've created, and it was not until Nano Banana that I ever saw anything even remotely workable.
Nano Banana manages to maintain the geometry of the scene, while applying new styles to it. Sometimes I do this with my Twinmotion renders, but what's really been cool to see is how well it takes a drawing, or engraving, or watercolor - and with as simple a prompt as "make this into a photo" it generates phenomenal results.
Similarly to the Paladin/Starbucks/Pirate example in the link though, I find that sometimes I need to misdirect a little bit, because if I'm peppering the prompt with details about the 18th century, I sometimes get a painterly image back. Instead, I'll tell it I want it to look like a photograph of a well preserved historic neighborhood, or a scene from a period film set in the 18th century.
As fantastic as the results can be, I'm not abandoning my manual modeling of these buildings and scenes. However, Nano Banana's interpretation of contemporary illustrations has helped me reshape how I think about some of the assumptions I made in my own models.
I added a CLI to it (using Gemini CLI) and submitted a PR, you can run that like so:
GEMINI_API_KEY="..." \
uv run --with https://github.com/minimaxir/gemimg/archive/d6b9d5bbefa1e2ffc3b09086bc0a3ad70ca4ef22.zip \
python -m gemimg "a racoon holding a hand written sign that says I love trash"
Result in this comment: https://github.com/minimaxir/gemimg/pull/7#issuecomment-3529...I'm exceptionally excited about Chinese editing models. They're getting closer and closer to NanoBanana in terms of robustness, and they're open source. This means you can supply masks and kernels and do advanced image operations, integrate them into visual UIs, etc.
You can even fine tune them and create LoRAs that will do the style transferring tasks that Nano Banana falls flat on.
I don't like how closed the frontier US models are, and I hope the Chinese kick our asses.
That said, I love how easy it'll be to distill Nano Banana into a new model. You can pluck training data right out of it: ((any image, any instruction) -> completion) tuples.
If that's true, it seems worth getting past the 'cumbersome' aspects. This tech may not put Hollywood out of business, but it's clear that the process of filmmaking won't be recognizable in 10 years if amateurs can really do this in their basements today.
Adobe's conference last week points to the future of image gen. Visual tools where you mold images like clay. Hands on.
Comfy appeals to the 0.01% that like toolkits like TouchDesigner, Nannou, and ShaderToy.
https://www.youtube.com/watch?v=YqAAFX1XXY8 - dynamic 3D scene relighting is insane, check out the 3:45 mark.
https://www.youtube.com/watch?v=BLxFn_BFB5c - molding photos like clay in 3D is absolutely wild at the 3:58 mark.
I don't have links to everything. They presented a deluge of really smart editing tools and gave their vision for the future of media creation.
Tangible, moldable, visual, fast, and easy.
For imagegen, agreed. But for textgen, Kimi K2 thinking is by far the best chat model at the moment from my experience so far. Not even "one of the best", the best.
It has frontier level capability and the model was made very tastefully: it's significantly less sycophantic and more willing to disagree in a productive, reasonable way rather than immediately shutting you out. It's also way more funny at shitposting.
I'll keep using Claude a lot for multimodality and artifacts but much of my usage has shifted to K2. Claude's sycophancy is particular is tiresome. I don't use ChatGPT/Gemini because they hide the raw thinking tokens, which is really cringe.
Also, yesterday I asked it a question and after the answer it complained about its poorly written system prompt to me.
They're really torturing their poor models over there.
I agree that a project.scripts would be good but that's a decision for the maintainer to take on separately!
is this just a manual copy/paste into a gist with some html css styling; or do you have a custom tool à la amp-code that does this more easily?
I made a video about building that here: https://simonwillison.net/2025/Oct/23/claude-code-for-web-vi...
It works much better with Claude Code and Codex CLI because they don't mess around with scrolling in the same way as Gemini CLI does.
- make massive, seemingly random edits to images - adjust image scale - make very fine grained but pervasive detail changes obvious in an image diff
For instance, I have found that nano-banana will sporadically add a (convincing) fireplace to a room or new garage behind a house. This happens even with explicit "ALL CAPS" instructions not to do so. This happens sporadically, even when the temperature is set to zero, and makes it impossible to build a reliable app.
Has anyone had a better experience?
This looks like it's caused by 99% of the relative directions in image descriptions describing them from the looker's point of view, and that 99% of the ones that aren't it they refer to a human and not to a skull-shaped pancake.
See below to demonstrate this weakness with the same prompts as the article see the link below, which demonstrates that it is a model weakness and not just a language ambiguity:
For some offline character JSON prompts I ended up adding an additional "any mentions of left and right are from the character's perspective, NOT the camera's perspective" to the prompt, which did seem to improve success.
It’s pretty good, but one conspicuous thing is that most of the blueberries are pointing upwards.
[0] https://www.lux.camera/content/images/size/w1600/2024/09/IMG...
Looks like specific f-stops don't actually make a difference for stable diffusion at least: https://old.reddit.com/r/StableDiffusion/comments/1adgcf3/co...
I didn't expect that. I would have definitely counted that as a "probably real" tally mark if grading an image.
I also created a small editing suite for myself where I can draw bounding boxes on images when they aren’t perfect, and have them fixed. Either just with a prompt or feeding them to Claude as image and then having it write the prompt to fix the issue for me (as a workflow on the api). It’s been quite a lot of fun to figure out what works. I am incredibly impressed by where this is all going.
Once you do have good storyboards. You can easily do start-to-end GenAI video generation (hopping from scene to scene) and bring them to life and build your own small visual animated universes.
Maybe a little mode collapse away from pale ugliness, not quite getting to the hints of unnatural and corpse-like features of a vampire - interesting what the limitations are. You'd probably have to spend quite a lot of time zeroing in, but Google's image models are supposed to have allowed smooth traversal of those feature spaces generally.
I see where you are coming from...
Which is not to say don’t be creative, I applaud all creativity, but also to be very critical of what you are doing.
It's pretty easy to get something decent. It's really hard to get something good. I share my creations with some close friends and some are like "that's hot!" but are too fixated on breasts to realize that the lighting or shadow is off. Other friends do call out the bad lighting.
You may be like "it's just porn, why care about consistent lighting?" and the answer for me is that I'm doing all this to learn how everything works. How to fine tune weights, prompts, using IP Adapter, etc. Once I have a firm understanding of this stuff, then I will probably be able to make stuff that's actually useful to society. Unlike that coke commercial.
But what I understood from parent comment is that they just do it for fun, not necessarily to be a boon to society. And then if it comes with new skills that actually can benefit society, then that's a win.
Granted, the commenter COULD play around with SFW stuff but if they're just doing it for fun then that's still not benefiting society either, so either way it's a wash. We all have fun in our own ways.
But it's impressive that this billion dollar company didn't have one single person say "hey it's shitty, make it better."
AI is shitty in its own new unique ways. And people don't like new. They want they old, polished shittiness they are used to.
It's only a matter of time before we get experienced AI filmmakers. I think we already have them, actually. It's clear that Coke does not employ them though.
Also, since it's new media, nobody knows how to budget time or money to fix the flaws. It could be infinitely expensive.
That's my entire point. Artists were fine with everybody making "art" as long as everybody except them (with their hard fought skill and dedication) achieved toddler level of output quality. As soon as everybody could truly get even close to the level of actual art, not toddler art, suddenly there's a horrible problem with all the amateur artists using the tools that are available to them to make their "toddler" art.
Folks in tech generally have very limited exposure to the art world — fan art communities online, Reddit subs, YouTubers, etc. It’s more representative of internet culture than the art world— no more representative of artists than X politics is representative of voters. People have real grievances here and you are not a victim of the world’s artists. Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
I will be if they manage to slow down development of AI even by a smidgen.
> Most artists also don’t care about online art communities or what you think about them. Not even a little bit.
Fully agree. They care about whether there's going to be anyone willing to buy their stuff from them. And not-toddler art is a real competition for them. So they are super against everybody making it.
You obviously can’t un-ring a bell, but finding ways to poison models that try to rip artists off sure is amusing. The real joke is on the people in software that think they’re so special that their skills will be worth anything at all, or believe that this will do anything but transfer wealth out of the paychecks of regular people, straight into the pockets of existing centibillionaires. There are too many developers in the existing market as it is, and so many of the ones that are diligently trying to reduce that demand further for an even larger range of disciplines, especially the in-demand jobs like setting up agents to take people’s jobs. Well, play stupid games, win stupid prizes.
You got it the other way around. Many edgelords became AI art boosters. Some became AI art dissers. It's a good topic for edgelords to edge about.
> finding ways to poison models that try to rip artists off sure is amusing
Yes. It is, but for other reasons. It looks like trying to turn a river with a stick. Little bit of water in one spot for a moment goes backwards and that one specific niche of people cheer. You made your song unfindable with Shazam. Good for you! Now I can't find you if I hear your music accidentally, because even if I catch some lyrics they are also heavily copyrighted. Step 2: ?, Step 3: Profit! Let me encounter your output the classic way, by being heavily marketed to for inordinate amounts of advertiser money. The way God intended!
> The real joke is on the people in software that think they’re so special that their skills will be worth anything at all
I fully expect to be completely replaced within few years. The same way my other skills were replaced by mobile diggers and powertools before I even acquired them.
Some IT people don't believe they will not get replaced and I think they have fairly strong argument. Software breeds more software. Coding breeds more need for coding. Even if only 5% of coding will be done by humans it still might be more than 100% from few years ago. Juniors are screwed though until we decide to extend college age to 35.
> There are too many developers in the existing market as it is
Junior developers. If you could magically turn them all into AI senior researchers they would all have a job in a month.
> believe that this will do anything but transfer wealth out of the paychecks of regular people, straight into the pockets of existing centibillionaires
This is progressing for 50 years. Sure, it's a flimsy hope that it can be changed, but there are no other hopeful things to look forward to. Next best thing is a WWIII because the previous one turned out to be great societal equalizer.
There’s no point in letting good be the enemy of perfect.
Imagine if you gave everyone a free guitar and people just started posting their electric guitar noodlings on social media after playing for 5 minutes.
It is not a judgement on the guitar. If anything it is a judgement on social media and the stupidity of the social media user who get worked up about someone creating "slop" after playing guitar for 5 minutes.
What did you expect them to sound like, Steve Vai?
It's intentionally hostile and inconsiderate.
But it would be _much_ better if when you hit reply, it gave you a message that you're "posting too fast" before you spend the time to write it up.
Bounding boxes: I actually send an image with a red box around where the requested change is needed. And 8 out of 10 times it works well. But if it doesn't work, I use Claude to make the prompt more refined. The Claude API call that I make, can see the image + the prompt, as well understanding the layering system. This is one of the 3 ways I edit, there is another one where I just sent the prompt to Claude without it looking at the image. Right now this all feels like dial-up. With a minimum of 0.035$ per image generation (0.0001$ if I just use a LoRa though) and a minimum of 12-14 seconds wait on each edit/generation.
Who has thought that we reach this uncharted territory with so many opportunities for pioneering and innovation? Back in 2019 it felt like nothing was new under the sun, today it feels like there is a whole new world under the sun, for us to explore!
> i was to bring awareness to the dangers of dressing up like a seal while surfboarding (ie. wearing black wetsuites, arms hanging over the board). Create a scene from the perspective of a shark looking up from the bottom of the ocean into a clear blue sky with silhouettes of a seal and a surfer and fishing boat with line dangling in the water and show how the shark contemplates attacking all these objects because they look so similiar.
I havnt found a model yet that can process that description, or any varition, into a scene that usable and makes sense visually to anyone older the a 1st grader. They will never place the seal, surfer, shark or boat in the correct location to make sense visually. Typically everyone is under water, sizing of everything is wrong. You tell them to the image is wrong, to place the person ontop of the water, and they cant. Please can someone link to a model that is capable or tell me what i am doing wrong? How can you claim to process words into images in a repeatable way when these systems cant deal with multiple contraints at once?
https://lmarena.ai/c/019a84ec-db09-7f53-89b1-3b901d4dc6be
https://gemini.google.com/share/da93030f131b
Obviously neither are good but it is better.
I think image models could be producing a lot more editable outputs if eg they output multi-layer PSDs.
Are you talking about Automatic1111 / ComfyUI inpainting masks? Because Nano doesn't accept bounding boxes as part of its API unless you just stuffed the literal X/Y coordinates into the raw prompt.
You could do something where you draw a bounding box and when you get the response back from Nano, you could mask that section back back over the original image - using a decent upscaler as necessary in the event that Nano had to reduce the size of the original image down to ~1MP.
I also had similar mixed results wrt Nano-banana especially around asking it to “fix/restore” things (a character’s hand was an anatomical mess for example)
It also works well if you draw a bb on the original image, then ask Claude for a meta-prompt to deconstruct the changes into a much more detailed prompt, and then send the original image without the bbs for changes. It really depends on the changes you need, and how long you're willing to wait.
- normal image editing response: 12-14s
- image editing response with Claude meta-prompting: 20-25s
- image editing response with Claude meta-prompting as well as image deconstructing and re-constructing the prompt: 40-60s
(I use Replicate though, so the actual API may be much faster).
This way you can also go into new views of a scene by zooming in and out the image on the same aspect-ratio canvas, and asking it to generatively fill the white borders around. So you can go from an tight inside shot, to viewing the same scene from outside of an house window. Or from inside the car, to outside the car.
I also have a custom pipeline/software that takes in a given prompt, rewrites it using an LLM into multiple variations, sends it to multiple GenAI models, and then uses a VLM to evaluate them for accuracy. It runs in an automated REPL style, so I can be relatively hands-off, though I do have a "max loop limiter" since I'd rather not spend the equivalent of a small country's GDP.
I keep hearing advocates of AI video generation talking at length about how easy the tools are to use and how great the results are, but I've yet to see anyone produce something meaningful that's coherent, consistent, and doesn't look like total slop.
There's a real legal fight that needs to go on right now about these companies stealing style, voices, likeness, etc. But it's really beginning to feel like there's a generation of artists that are hampering their career by saying they are above it instead of using the tools to enhance their art to create things they otherwise couldn't.
I see kids in high school using the tools like how I used Photoshop when I was younger. I see unemployed/under employed designers lamenting what the tools have done.
If at some point I also get very good at it; and the tech, models and tools mature, this will turn into a real avenue; who are they to tell us not to pursue it?
You need talented people to make good stuff, but at this time most of them still fear the new tools.
> Bots in the Hall
* voices don't match the mouth movements * mouth movements are poorly animated * hand/body movements are "fuzzy" with weird artifacts * characters stare in the wrong direction when talking * characters never move * no scenes over 3 seconds in length between cuts
> Neural Viz
* animations and backgrounds are dull * mouth movements are uncanny * "dead eyes" when showing any emotions * text and icons are poorly rendered
> The Meat Dept video for Igorrr's ADHD
This one I can excuse a bit since it's a music video, and for the sake of "artistic interpretation", but:
* continuation issues between shots * inconsistent visual style across shots * no shots longer then 4 seconds between cuts * rendered text is illegible/nonsensical * movement artifacts
Entire narrative driven AI stores that are driven by AI stories and AI characters in AI generated universes... they are here already, but I can only count those who do it well on two hands (last year, there where 1-2). This is going to accelerate, and if you think its "slop" now, it just takes a few iterations of artists who you personally resonate with to jump onto this, before you stop seeing it as slop. I am jumping on this, because I can see very clearly where this will all lead. You don't have to like it, but it will arrive regardless.
Things like: Convert the people to clay figures similar to what one would see in a claymation.
And it would think it did it, but I could not perceive any change.
After several attempts, I added "Make the person 10 years younger". Suddenly it made a clay figure of the person.
www.brandimagegen.com
if you want a premium account to try out, you can find my email in my bio!!
> Nano Banana supports a context window of 32,768 tokens: orders of magnitude above T5’s 512 tokens and CLIP’s 77 tokens.
In my pipeline for generating highly complicated images (particularly comics [1]), I take advantage of this by sticking a Mistral 7b LLM in-between that takes a given prompt as an input and creates 4 variations of it before sending them all out.
> Surprisingly, Nano Banana is terrible at style transfer even with prompt engineering shenanigans, which is not the case with any other modern image editing model.
This is true - though I find it works better by providing a minimum of two images. The first image is intended to be transformed, and the second image is used as "stylistic aesthetic reference". This doesn't always work since you're still bound by the original training data, but it is sometimes more effective than attempting to type out a long flavor text description of the style.
I've noticed a lot of this misinformation floating around lately, and I can't help but wonder if it's intentional?
I first extract all the entities from the text, generate characters from an art style, and then start stitching them together into individual illustrations. It works much better with NB than anything else I tried before.
That sounds interesting. Could you share?
"YOU WILL BE PENALIZED FOR USING THEM"
That is disconcerting.
AI can't do that (yet?).
A 1024x1024 image seems to cost about 3ct to generate.
Yet when I ask some simple tasks to it, like doing a 16:9 picture sized image instead of a square one, it ends up doing a 16:9 on a white background that matches a square.
When I ask it to make it with text, then on the second request to redo while changing just a certain visual element, it ends up breaking the previously asked text.
It's getting more good at flattering people and telling them how clever and right they are than actually doing the task.
Not (knowingly) used an llm for a long time. Is the above true?
Very cool post, thanks for sharing!
I had no idea that the context window was so large. I’d been instinctively keeping my prompts small because of experience with other models. I’m going to try much more detailed prompts now!
I was trying to create a simple "mascot logo" for my pet project. I first created an account on Kittl [0] and even paid for one month but it was quite cumbersome to generate images until I figured out I could just use the nano banana api myself.
Took me 4 prompts to ai-slop a small python script I could run with uv that would generate me a specified amount of images with a given prompt (where I discovered some of the insight the author shows in their post). The resulting logo [1] was pretty much what I imagined. I manually added some text and played around with hue/saturation in Kittl (since I already paid for it :)) et voilà.
Feeding back the logo to iterate over it worked pretty nicely and it even spit out an "abstract version" [2] of the logo for favicons and stuff without a lot of effort.
All in all this took me 2 hours and around 2$ (excluding the 1 month Kittl subscription) and I would've never been able to draw something like that in Illustrator or similar.
[0] https://www.kittl.com/ [1] https://github.com/sidneywidmer/yass/blob/master/client/publ... [2] https://github.com/sidneywidmer/yass/blob/master/client/publ...
Going further, one thing you can do is give Gemini 2.5 a system prompt like the following:
https://goto.isaac.sh/image-prompt
And then pass Gemini 2.5's output directly to Nano-Banana. Doing this yields very high-quality images. This is also good for style transfer and image combination. For example, if you then give Gemini 2.5 a user prompt that looks something like this:
I would like to perform style transfer. I will provide the image generation model a photograph alongside your generated prompt. Please write a prompt to transfer the following style: {{ brief style description here }}.
You can get aesthetic consistently-styled images, like these:Unfortunately I have to use ChatGPT for this, for some reason local models don't do well with such tasks. I don't know if it's just the extra prompting sauce that ChatGPT does or just diffusion models aren't well designed for these kind of tasks.
Regarding the generated cat image:
> Each and every rule specified is followed.
Not quite; the eye color and heterochromia is followed only so-so.
The black-and-silver cat seems to have no heterochromia; eye color could be interpreted as silver though.
The white-and-gold cat _does_ have heterochromia. The colors can be interpreted as "white" and "gold", though I'd describe them as whitish-blue and orange. What's interesting about this is an adjustment of the instructions toward biologically more plausible eye colors in the cat which also has more natural fur colors.
The last cat's fur colors are so "implausible" that the model doesn't seem to have problems taking exactly those colors for the (heterochromatic) eyes too!
It is in fact not at all obvious why you can't.
The American prudishness continues to boggle my mind.
doctorpangloss•2mo ago
okay, look at imagen 4 ultra:
https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
In this link, Imagen is instructed to render the verbatim prompt “the result of 4+5”, which shows that text, and not instructed, which renders “4+5=9”
Is Imagen thinking?
Let's compare to gemini 2.5 flash image (nano banana):
look carefully at the system prompt here: https://aistudio.google.com/app/prompts?state=%7B%22ids%22:%...
Gemini is instructed to reply in images first, and if it thinks, to think using the image thinking tags. It cannot seemingly be prompted to show verbatim the result 4+5 without showing the answer 4+5=9. Of course it can show whatever exact text that you want, the question is, does it prompt rewrite (no) or do something else (yes)?
compare to ideogram, with prompt rewriting: https://ideogram.ai/g/GRuZRTY7TmilGUHnks-Mjg/0
without prompt rewriting: https://ideogram.ai/g/yKV3EwULRKOu6LDCsSvZUg/2
We can do the same exercises with Flux Kontext for editing versus Flash-2.5, if you think that editing is somehow unique in this regard.
Is prompt rewriting "thinking"? My point is, this article can't answer that question without dElViNg into the nuances of what multi-modal models really are.
gryfft•2mo ago
PunchTornado•2mo ago