1. Not seen browsing ”ai.dev”.
2. The text ”Imagen 4 is now generally available!” is spoken, not a comic caption.
3. Invalid second panel.
4. Hallucinates ”Meet Imagen 4 fast!”
5. Hallucinates ”It offers low..” etc. (this is the second part of a single sentence said by the cat)
6. Hallucinates ”You can export images in 2K!” (this sentence is not asked for)
7. Doesn’t have the cat and the dog in the fourth panel.
—
Here’s the gpt-image-1 counterpart with the issues I could find:
https://chatgpt.com/share/689f7e4b-01e4-8011-8997-0f37edf8c2...
1. The text ”Imagen 4 is now generally available!” is still spoken, not a caption.
2. ”low latency” -> ”low-laten”
(3. Has that ugly gpt-image-1 trademark yellow filter requiring work in post to avoid.)
I didn’t bring up the ”retro comic look” thing. I certainly think it’s an issue with Imagen 4’s version. It doesn’t look very old school at all. But I can’t judge the OpenAI one either on that, I’m no comic book expert, so I just skipped that one.
I’m no Scott McCloud, but the OpenAI version definitely does a better job with the retro style. The yellow filter you criticised actually helps to sell the illusion. The Imagen version utterly fails in the retro area, that style is very much modern.
But there are other important flaws in the OpenAI version. The fourth panel has a different cat (the head shape and stripes are wrong) and it bleeds into the previous panel. Technically that could be a stylistic choice, except that the floor/table is inconsistent, making it clear it was a mistake.
Repo: https://github.com/google-deepmind/synthid-text
Paper: https://www.nature.com/articles/s41586-024-08025-4
With images and video, it's less clear exactly what they're doing, but it's watermarking on the pixel leve. From one of their blog posts:
Videos are composed of individual frames or still images. So we developed a watermarking technique inspired by our SynthID for image tool. This technique embeds a watermark directly into the pixels of every video frame, making it imperceptible to the human eye, but detectable for identification.
https://deepmind.google/discover/blog/watermarking-ai-genera...Elevenlab's audio watermarking is trivial to shake off with compression, but google claims that synthid is resilient to such manipulation.
https://console.cloud.google.com/vertex-ai/studio/media/gene...
Also, the prompt specifically ask "Panel 4 should show the cat and dog high-fiving" but the cat is high-fiving ... the cat. Personally I find this hallucinated plot twist good, it makes the ending a bit better. Although technically this is demonstrating a failure of the tool to follow the instructions from the prompt. Interesting choice of example for an official announcement.
It seems that you may need the "Ultra" version if you want strict prompt adherence.
It's an interesting strategy. Personally, I notice that most of the times I actually don't need strict prompt adherence for image generation. If it looks nice, I'll accept it. If it doesn't, I'll click generate again. For creativity task, following the prompt too strictly might not be the outcome the users want.
An image has to be much worse than that to fail to save you 6 seconds.
That said, this is their own chosen example of what it can do, so I'd have to assume it is much worse than that on average.
And again, if I can't use it because it's totally wrong, then... what are we even doing here?
It will probably save a lot more, but the point is 6 seconds is the threshold at which 2 cents is "worth it".
Good art takes a long time to create.
If this image were representative, errors and all, it would be where you could expect a professional to reach after an hour or so, give or take — I've seen professionals working on an icon set for multiple days, and most webcomics I see, even when it's their full time job and they've got a good system going to make their output easy for themselves, don't tend to do produce outputs like this should have been more than once per day.
> And again, if I can't use it because it's totally wrong, then... what are we even doing here?
On this, I tend to agree. If you have a specific output in mind, quite often they're just wildly wrong. Repeated generations are just plain bad, and the system just can't seem to get what's being asked for.
Muphry's Law strikes again.
Indeed.
It's and its are backwards. The latter breaks the possessive s rule.
Speaking of, the possessive s should _always_ be added, no reason to sometimes omit it if the name ends in an s.
Ass backwards, all of it.
For those commenting in the latter category, it might be worthwhile to read a bit about the underlying technology and share your insights on why it does not deliver.
if you followed news during the GAN cycle you could extrapolate that deep NN could do this type of things. it is really cool that this things happened so fast, but we are talking about companies that have the money to deploy thousands of cars around the globe to collect data, so they absolutely know how to gather data
The problem with 2025 is I have seen thousands of better examples than that landscape. The reflections in the lake are complete trash.
Then I think of Veo 3 that is just incredible. So no, it is not impressive if a still from the video model is vastly better than the static image generator from the same company.
I find it especially annoying because I can't think of another company this would happen at. It is just so Google.
Does the world need yet another AI slop generator?
>The model may output text only. Try asking for image outputs explicitly (e.g. "generate an image", "provide images as you go along", "update the image").
>The model may stop generating partway through. Try again or try a different prompt.
Seriously?
One of the biggest corporations in the world and they can't re-read before posting a typo in the title.
Heads be shakin
It's a typo, it doesn't matter.
Meanwhile Veo3 is far better than the OpenAI's equivalent. I assume speed is not a priority there; both take their time.
''' A four panel comic strip. Simple black on white. Stick figures for characters. In the first panel there is a stick figure man and a stick figure bird eating bird seed at his feet. He is slightly hunched over to show he is looking at the bird. In the second panel. He is more hunched over looking more closely at the bird. In the third panel he is even more hunched over practically with his head to the bird, he is crouched down, knees bent, hands on thighs. In the upper left of the third panel the tip of an enormous beak can be seen, but it's only a few lines so could be anything. In the final panel the beak has gobbled up the man and his arms and legs are flailing outside of the beak while the small bird continues to eat birdseed on the ground. '''
Despite claims that Ultra supports improved strict prompt adherence, we saw no evidence that it scored any better than Imagen 4 and in some cases seemed to ignore the prompt altogether (see the "Not the Bees" comic). In many cases, it also seemed much less steerable than Imagen3 requiring many of the prompts to be rewritten.
https://genai-showdown.specr.net?models=IMAGEN_3,IMAGEN_4,IM...
There's some speculation it's Gemini 3's multi-modal output, and other speculation that it's an OpenAI model. Hard to definitively since these models tend to hallucinate when interrogated.
gpt-image-1 is in a class all of its own with regards to prompt adherence in the "text to image" category.
Once it hits GA I'll put it through its paces and add it to the site!
Result: https://imgur.com/a/Ri0yb31
This is supposed to be SOTA?
qoez•5mo ago
tripplyons•5mo ago