https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
The best time to judge how good a new image model actually is seems to be about a week from launch. That's when enough pieces have fallen into place that people have had a chance to really mess with it and come out with 3rd party pros/cons of the models. Looking hopeful for this one though!
As an aside, I am not sure why for LLM models the technology to spread among multiple cards is quite mature, while for image models, despite also using GGUFs, this has not been the case. Maybe as image models become bigger there will be more of a push to implement it.
40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).
Training it will also be out of reach for most. I’m sure I’ll be able to handle it on my own 5090 at some point but it’ll be slow going.
Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.
Any M3 Ultra Mac Studio, or midrange-or-better Macbook Pro, would handle FP16 with no issues though. A 5090 would handle FP8 like a champ and a 4090 could probably squeeze it in as well, although it'd be tight.
You really don't understand art. At all.
Besides style transfer, object additions and removals, text editing, manipulation of human poses, it also supports object detection, semantic segmentation, depth/edge estimation, super-resolution and novel view synthesis (NVS) i.e. synthesizing new perspectives from a base image. It’s quite a smorgasbord!
Early results indicate to me that gpt-image-1 has a bit better sharpness and clarity but I’m honestly not sure if OpenAI doesn’t simply do some basic unsharp mask or something as a post-processing step? I’ve always felt suspicious about that, because the sharpness seems oddly uniform even in out-of-focus areas? And sometimes a bit much, even.
Otherwise, yeah this one looks about as good.
Which is impressive! I thought OpenAI had a lead here from their unique image generation solution that’d last them this year at least.
Oh, and Flux Krea has lasted four days since announcement! In case this one is truly similar in quality to gpt-image-1.
Flux Kontext was a gamechanger release for image editing and it can do some absurd things, but it's still relatively unknown. Qwen-Image, with its more permissive license, could lead to much more innovation once the editing model is released.
It's more that the novelty just wore off. Mainstream image generation in online services is "good enough" for most casual users - and power users are few, and already knee deep in custom workflows. They aren't about to switch to the shiny new thing unless they see a lot of benefits to it.
There is what's probably better described as a bullying campaign. People tried the same thing when synthesizers and cameras were invented. But nobody takes it seriously unless you're already in the angry person fandom.
In practice AI image generation is ubiquitous at this point. AI image editing is also built into all major phones.
Midjourneys images are the only ones which don’t make me uncomfortable (most of the time), hopefully they can fix their prompt adherence.
However, you have mistakenly marked some answers as correct ones in the octopus prompt: only 1 generated image has octopus have sock puppets on all of its tentacles. And you marked that one image as an incorrect one due to sock looking more like gloves.
But, obviously you wouldn’t do that. Right? Did you look at the scaling on their graphs?
# Configure NF4 quantization
quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder"],
)
# Load the pipeline with NF4 quantization
pipe = DiffusionPipeline.from_pretrained(
model_name,
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
use_safetensors=True,
low_cpu_mem_usage=True
).to(device)
seems to use 17gb of vram like thisupdate: doesn't work well. this approach seems to be recommended: https://github.com/QwenLM/Qwen-Image/pull/6/files
I ended up building my own tool for that: https://tools.simonwillison.net/huggingface-storage
For PCs I take it one that has two PCIe 4.0 x16 or more recent slots? As in: quite some consumers motherboards. You then put two GPU with 24 GB of VRAM each.
A friend runs this (don't know if the tried this Qwen-Image yet): it's not an "out of this world" machine.
This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.
But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.
This is a slightly scaled up SD3 Large model (38 layers -> 60 layers).
To me they all seem to suffer from the same artifacts, that the text looks sort of unnatural and doesn't have the correct shadows/reflections as the rest of the image. This applies to all the models I have tried, from OpenAI to Flux. Presumably they are all using the same trick?
Maybe in the future someone will come up with a method for putting realistic text into images so that they can generate data to train a model for putting realistic text into images.
It reminds me of how CivitAI is full of “sexy Emma Watson” LoRAs, presumably because she very notably has said she doesn’t want to be portrayed in ways that objectify her body. There’s a really rotten vein of “anti-consent” pulsing through this community, where people deliberately seek out people who have asked to be left out of this and go “Oh yeah? Well there’s nothing you can do to stop us, here’s several terabytes of exactly what you didn’t want to happen”.
What disappoints me is how aligned the whole community is with its worst exponents. That someone went “Heh heh, I’m gonna spend hours of my day and hundreds/thousands of dollars in compute just to make Miyazaki sad.” and then influencers in the AI art space saw this happen and went “Hell yeah let’s go” and promoted the shit out of it making it one of the few finetunes to actually get used by normies in the mainstream, and then leaders in this field like the Qwen team went “Yeah sure let’s ride the wave” and made a Studio Ghibli style image their first example.
I get that there was no way to physically stop a Studio Ghibli LoRA from existing. I still think the community’s gleeful reaction to it has been gross.
Those behaviors might appear correct in an extremely superficial sense, but it is as if they prompted themselves for "man eating cookies" and ended up with what is akin to early Will Smith pasta gifs. Whatever they're doing and assuming it's cookies held in hands, they're not eating them.
No he has not. He was talking about an AI model that was shown off for crudely animating 3D people in 2016, in a way that he found creepy. If you watch the actual video, you can see the examples that likely set him off here[0].
Leading by example by not condoning copying artists' styles would be a simple polite gesture.
That, and the weird prudishness of most american people and companies.
I would really like to find a way to do this (either online or locally) if anyone has any tips for giving a model some images of real jewelry with dimensions (and if needed even photographed or generated children) and having the model accurately place the jewelry on the kids.
And then you have to do it all over again every few months as the products and the seasons change!
That being said, it still lags pretty far behind OpenAI's gpt-image-1 strictly in terms of prompt adherence for txt2img prompting. However as has already been mentioned elsewhere in the thread, this model can do a lot more around editing, etc.
Nope. The text includes the line "That dawn will bloom" but the render reads "That down will bloom", which is meaningless.
If it’s as good as they say, one less reason for that ChatGPT sub..
Anyone thinking otherwise hasn't attempted implementing it or haven't thought about it in depth.
(Select "Image Generation" and be sure to use the Qwen3-235B model - also tried selecting "Coder" but it errors out.)
An entire thread on this subject previously unfolded on HN but I can't find it at this time!
The example further down has "down" not "dawn" in the poem.
For these to be their hero image examples, they're fairly poor; I know it's a significant improvement vs. many of the other current offerings, but it's clear the bar is still being set very low.
There were a few small text mistakes and the image isn't quite as good as I've seen before, but overall it delivers on its promise.
https://cdn.qwenlm.ai/output/wV13g6892e758082439d7000d439ed5...
Go experience AI: https://www.qwenimagen.com/
Why it works: Highlights open‑source nature and 20 billion‑parameter strength Emphasizes its superior multilingual, layout‑aware text rendering Mentions real‑world use cases: posters, slides, graphics, image editing, comics/info visuals
djoldman•6mo ago
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
numpad0•6mo ago
entropie•6mo ago