https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
The best time to judge how good a new image model actually is seems to be about a week from launch. That's when enough pieces have fallen into place that people have had a chance to really mess with it and come out with 3rd party pros/cons of the models. Looking hopeful for this one though!
As an aside, I am not sure why for LLM models the technology to spread among multiple cards is quite mature, while for image models, despite also using GGUFs, this has not been the case. Maybe as image models become bigger there will be more of a push to implement it.
40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).
Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.
Besides style transfer, object additions and removals, text editing, manipulation of human poses, it also supports object detection, semantic segmentation, depth/edge estimation, super-resolution and novel view synthesis (NVS) i.e. synthesizing new perspectives from a base image. It’s quite a smorgasbord!
Early results indicate to me that gpt-image-1 has a bit better sharpness and clarity but I’m honestly not sure if OpenAI doesn’t simply do some basic unsharp mask or something as a post-processing step? I’ve always felt suspicious about that, because the sharpness seems oddly uniform even in out-of-focus areas? And sometimes a bit much, even.
Otherwise, yeah this one looks about as good.
Which is impressive! I thought OpenAI had a lead here from their unique image generation solution that’d last them this year at least.
Oh, and Flux Krea has lasted four days since announcement! In case this one is truly similar in quality to gpt-image-1.
Flux Kontext was a gamechanger release for image editing and it can do some absurd things, but it's still relatively unknown. Qwen-Image, with its more permissive license, could lead to much more innovation once the editing model is released.
It's more that the novelty just wore off. Mainstream image generation in online services is "good enough" for most casual users - and power users are few, and already knee deep in custom workflows. They aren't about to switch to the shiny new thing unless they see a lot of benefits to it.
# Configure NF4 quantization
quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder"],
)
# Load the pipeline with NF4 quantization
pipe = DiffusionPipeline.from_pretrained(
model_name,
quantization_config=quant_config,
torch_dtype=torch.bfloat16,
use_safetensors=True,
low_cpu_mem_usage=True
).to(device)
seems to use 17gb of vram like thisupdate: doesn't work well. this approach seems to be recommended: https://github.com/QwenLM/Qwen-Image/pull/6/files
I ended up building my own tool for that: https://tools.simonwillison.net/huggingface-storage
For PCs I take it one that has two PCIe 4.0 x16 or more recent slots? As in: quite some consumers motherboards. You then put two GPU with 24 GB of VRAM each.
A friend runs this (don't know if the tried this Qwen-Image yet): it's not an "out of this world" machine.
This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.
But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.
This is a slightly scaled up SD3 Large model (38 layers -> 60 layers).
To me they all seem to suffer from the same artifacts, that the text looks sort of unnatural and doesn't have the correct shadows/reflections as the rest of the image. This applies to all the models I have tried, from OpenAI to Flux. Presumably they are all using the same trick?
Maybe in the future someone will come up with a method for putting realistic text into images so that they can generate data to train a model for putting realistic text into images.
It reminds me of how CivitAI is full of “sexy Emma Watson” LoRAs, presumably because she very notably has said she doesn’t want to be portrayed in ways that objectify her body. There’s a really rotten vein of “anti-consent” pulsing through this community, where people deliberately seek out people who have asked to be left out of this and go “Oh yeah? Well there’s nothing you can do to stop us, here’s several terabytes of exactly what you didn’t want to happen”.
What disappoints me is how aligned the whole community is with its worst exponents. That someone went “Heh heh, I’m gonna spend hours of my day and hundreds/thousands of dollars in compute just to make Miyazaki sad.” and then influencers in the AI art space saw this happen and went “Hell yeah let’s go” and promoted the shit out of it making it one of the few finetunes to actually get used by normies in the mainstream, and then leaders in this field like the Qwen team went “Yeah sure let’s ride the wave” and made a Studio Ghibli style image their first example.
I get that there was no way to physically stop a Studio Ghibli LoRA from existing. I still think the community’s gleeful reaction to it has been gross.
That, and the weird prudishness of most american people and companies.
That being said, it still lags pretty far behind OpenAI's gpt-image-1 strictly in terms of prompt adherence for txt2img prompting. However as has already been mentioned elsewhere in the thread, this model can do a lot more around editing, etc.
Nope. The text includes the line "That dawn will bloom" but the render reads "That down will bloom", which is meaningless.
djoldman•6h ago
https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...
numpad0•2h ago