frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Qwen-Image: Crafting with native text rendering

https://qwenlm.github.io/blog/qwen-image/
249•meetpateltech•7h ago
https://huggingface.co/Qwen/Qwen-Image

https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...

Comments

djoldman•6h ago
Checkout section 3.2 Data Filtering:

https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Image/Q...

numpad0•2h ago
It's also kind of interesting that no other languages than English and Chinese are named or shown...
nickandbro•6h ago
The fact that it doesn’t change the images like 4o image gen is incredible. Often when I try to tweak someone’s clothing using 4o, it also tweaks their face. This only seems to apply those recognizable AI artifacts to only the elements needing to be edited.
herval•3h ago
You can select the area you want edited on 4o, and it’ll keep the rest unchanged
barefootford•3h ago
gpt doesn't respect masks
icelancer•2h ago
Correct. Have tried this without much success despite OpenAI's claims.
vunderba•3h ago
That's why Flux Kontext was such a huge deal - it gave you the power of img2img inpainting without needing to manually mask the content.

https://mordenstar.com/blog/edits-with-kontext

artninja1988•5h ago
Insane how many good Chinese open source models they've been releasing. This really gives me hope
anon191928•5h ago
It will take years for people to use these but Adobe is not alone.
herval•3h ago
Adobe has never been alone. Photoshop’s AI stuff is consistently behind OSS models and workflows. It’s just way more convenient
dvt•3h ago
I think Adobe is also very careful with copyrighted content not being a part of their models, which inherently makes them of lower quality.
herval•2h ago
They have a much better and cleaner dataset than Stable Diffusion & others, so I’d expect it to be better with some kinds of images (photos in particular)
doctorpangloss•2h ago
as long as you don't consider the part of the model which understands text as part of the model, and as long as you don't consider copyrighted text content copyrighted :)
yjftsjthsd-h•5h ago
Wow, the text/writing is amazing! Also the editing in general, but the text really stands out
rushingcreek•4h ago
Not sure why this isn’t a bigger deal —- it seems like this is the first open-source model to beat gpt-image-1 in all respects while also beating Flux Kontext in terms of editing ability. This seems huge.
zamadatix•3h ago
It's only been a few hours and the demo is constantly erroring out, people need more time to actually play with it before getting excited. Some quantized GGUFs + various comfy workflows will also likely be a big factor for this one since people will want to run it locally but it's pretty large compared to other models. Funnily enough, the main comparison to draw might be between Alibaba and Alibaba. I.e. using Wan 2.2 for image generation has been an extremely popular choice, so most will want to know how big a leap Qwen-Image is from that rather than Flux.

The best time to judge how good a new image model actually is seems to be about a week from launch. That's when enough pieces have fallen into place that people have had a chance to really mess with it and come out with 3rd party pros/cons of the models. Looking hopeful for this one though!

rushingcreek•3h ago
I spun up an H100 on Voltage Park to give it a try in an isolated environment. It's really, really good. The only area where it seems less strong than gpt-image-1 is in generating images of UI (e.g. make me a landing page for Product Hunt in the style of Studio Ghibli), but other than that, I am impressed.
hleszek•3h ago
It's not clear from their page but the editing model is not released yet: https://github.com/QwenLM/Qwen-Image/issues/3#issuecomment-3...
tetraodonpuffer•3h ago
I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

As an aside, I am not sure why for LLM models the technology to spread among multiple cards is quite mature, while for image models, despite also using GGUFs, this has not been the case. Maybe as image models become bigger there will be more of a push to implement it.

TacticalCoder•3h ago
> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

40 GB of VRAM? So two GPU with 24 GB each? That's pretty reasonable compared to the kind of machine to run the latest Qwen coder (which btw are close to SOTA: they do also beat proprietary models on several benchmarks).

cellis•2h ago
A 3090 + 2xTitanXP? technically i have 48, but i don't think you can "split it" over multiple cards. At least with Flux, it would OOM the Titans and allocate the full 3090
cma•3h ago
If 40GB you can lightly quantize and fit it on a 5090.
reissbaker•2h ago
40GB is small IMO: you can run it on a mid-tier Macbook Pro... or the smallest M3 Ultra Mac Studio! You don't need Nvidia if you're doing at-home inference, Nvidia only becomes economical at very high throughput: i.e. dedicated inference companies. Apple Silicon is much more cost effective for single-user for the small-to-medium-sized models. The M3 Ultra is ~roughly on par with a 4090 in terms of memory bandwidth, so it won't be much slower, although it won't match a 5090.

Also for a 20B model, you only really need 20GB of VRAM: FP8 is near-identical to FP16, it's only below FP8 that you start to see dramatic drop-offs in quality. So literally any Mac Studio available for purchase will do, and even a fairly low-end Macbook Pro would work as well. And a 5090 should be able to handle it with room to spare as well.

RossBencina•2h ago
Does M3 Ultra or later have hardware FP8 support on the CPU cores?
jug•3h ago
I think it does way more than gpt-image-1 too?

Besides style transfer, object additions and removals, text editing, manipulation of human poses, it also supports object detection, semantic segmentation, depth/edge estimation, super-resolution and novel view synthesis (NVS) i.e. synthesizing new perspectives from a base image. It’s quite a smorgasbord!

Early results indicate to me that gpt-image-1 has a bit better sharpness and clarity but I’m honestly not sure if OpenAI doesn’t simply do some basic unsharp mask or something as a post-processing step? I’ve always felt suspicious about that, because the sharpness seems oddly uniform even in out-of-focus areas? And sometimes a bit much, even.

Otherwise, yeah this one looks about as good.

Which is impressive! I thought OpenAI had a lead here from their unique image generation solution that’d last them this year at least.

Oh, and Flux Krea has lasted four days since announcement! In case this one is truly similar in quality to gpt-image-1.

jacooper•2h ago
Not to mention, flux models are for non-commercial use only.
doctorpangloss•2h ago
the license for flux models is $1,000/mo, hardly an obstacle to any serious commercial usage
liuliu•2h ago
Per 100k image. And it is additionally $0.01 per image. Considering H100 is $1.5 per hour and you can get 1 image per 5s, we are talking about bare-metal cost of ~$0.002 per image + $0.01 license cost.
minimaxir•2h ago
With the notable exception of gpt-image-1, discussion about AI image generation has become much less popular. I suspect it's a function of a) AI discourse being dominated by AI agents/vibe coding and b) the increasing social stigma of AI image generation.

Flux Kontext was a gamechanger release for image editing and it can do some absurd things, but it's still relatively unknown. Qwen-Image, with its more permissive license, could lead to much more innovation once the editing model is released.

doctorpangloss•2h ago
gpt-image-1 is the League of Legends of image generation. It is a tool in front of like 30 million DAUs...
ACCount36•1h ago
Social stigma? Only if you listen to mentally ill Twitter users.

It's more that the novelty just wore off. Mainstream image generation in online services is "good enough" for most casual users - and power users are few, and already knee deep in custom workflows. They aren't about to switch to the shiny new thing unless they see a lot of benefits to it.

toisanji•36m ago
how can it beat gpt-image-1 if there is no image editor?
dewarrn1•21m ago
Slightly hyperbolic, gpt-image-1 is better on at least a couple of the text metrics.
vunderba•18m ago
I've been playing around with it for the past hour. It's really good but from my preliminary testing it definitely falls short of gpt-image-1 (or even Imagen 3/4) where reasonably complex strict prompt adherence is concerned. Scored around ~50% where gpt-image-1 scored ~75%. Couldn't handle the maze, Schrödinger's equation, etc.

https://genai-showdown.specr.net

rwmj•4h ago
This may be obvious to people who do this regularly, but what kind of machine is required to run this? I downloaded & tried it on my Linux machine that has a 16GB GPU and 64GB of RAM. This machine can run SD easily. But Qwen-image ran out of space both when I tried it on the GPU and on the CPU, so that's obviously not enough. But am I off by a factor of two? An order of magnitude? Do I need some crazy hardware?
zippothrowaway•4h ago
You're probably going to have to wait a couple of days for 4 bit quantized versions to pop up. It's 20B parameters.
pollinations•2h ago

   # Configure NF4 quantization
   quant_config = PipelineQuantizationConfig(
       quant_backend="bitsandbytes_4bit",
       quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
       components_to_quantize=["transformer", "text_encoder"],
   )

   # Load the pipeline with NF4 quantization
   pipe = DiffusionPipeline.from_pretrained(
       model_name,
       quantization_config=quant_config,
       torch_dtype=torch.bfloat16,
       use_safetensors=True,
       low_cpu_mem_usage=True
   ).to(device)
seems to use 17gb of vram like this

update: doesn't work well. this approach seems to be recommended: https://github.com/QwenLM/Qwen-Image/pull/6/files

mortsnort•4h ago
I believe it's roughly the same size as the model files. If you look in the transformers folder you can see there are around 9 5gb files, so I would expect you need ~45gb vram on your GPU. Usually quantized versions of models are eventually released/created that can run on much less vram but with some quality loss.
foobarqux•3h ago
Why doesn't huggingface list the aggregate model size?
matcha-video•3h ago
Huggingface is just a git hosting service, like github. You can add up the sizes of all the files in the directory yourself
simonw•2h ago
I've been bugging them about this for a while. There are repos that contain multiple model weights in a single repo which means adding up the file sizes won't work universally, but I'd still find it useful to have a "repo size" indicator somewhere.

I ended up building my own tool for that: https://tools.simonwillison.net/huggingface-storage

halJordan•2h ago
Model size = file for fp8, so if this was released at fp16 then 40-ish, if it's quantized to fp4 then 10ish
TacticalCoder•3h ago
> I think the fact that, as far as I understand, it takes 40GB of VRAM to run, is probably dampening some of the enthusiasm.

For PCs I take it one that has two PCIe 4.0 x16 or more recent slots? As in: quite some consumers motherboards. You then put two GPU with 24 GB of VRAM each.

A friend runs this (don't know if the tried this Qwen-Image yet): it's not an "out of this world" machine.

icelancer•2h ago
> This may be obvious to people who do this regularly

This is not that obvious. Calculating VRAM usage for VLMs/LLMs is something of an arcane art. There are about 10 calculators online you can use and none of them work. Quantization, KV caching, activation, layers, etc all play a role. It's annoying.

But anyway, for this model, you need 40+ GB of VRAM. System RAM isn't going to cut it unless it's unified RAM on Apple Silicon, and even then, memory bandwidth is shot, so inference is much much slower than GPU/TPU.

cellis•2h ago
Also I think you need a 40GB "card", not just 40GB of vram. I wrote about this upthread, you're probably going to need one card, I'd be surprised if you could chain several GPUs together.
rapfaria•1h ago
Not sure what you mean or new to llms, but two RTX 3090 will work for this, and even lower-end cards will (RTX3060) once it's GGUF'd
karolist•1h ago
do you mean https://github.com/pollockjj/ComfyUI-MultiGPU? One GPU would do the computation, but others could pool in for VRAM expansion, right? (I've not used this node)
liuliu•2h ago
16GiB RAM with 8-bit quantization.

This is a slightly scaled up SD3 Large model (38 layers -> 60 layers).

philipkiely•46m ago
For prod inference, 1xH100 is working well.
ethan_smith•12m ago
Qwen-Image requires at least 24GB VRAM for the full model, but you can run the 4-bit quantized version with ~8GB VRAM using libraries like AutoGPTQ.
oceanplexian•3h ago
Does anyone know how they actually trained text rendering into these models?

To me they all seem to suffer from the same artifacts, that the text looks sort of unnatural and doesn't have the correct shadows/reflections as the rest of the image. This applies to all the models I have tried, from OpenAI to Flux. Presumably they are all using the same trick?

yorwba•2h ago
It's on page 14 of the technical report. They generate synthetic data by putting text on top of an image, apparently without taking the original lighting into account. So that's the look the model reproduces. Garbage in, garbage out.

Maybe in the future someone will come up with a method for putting realistic text into images so that they can generate data to train a model for putting realistic text into images.

doctorpangloss•2h ago
i'm not sure if that's such garbage as you suggest, surely it is helpful for generalization yes? kind of the point of self-supervised models
halJordan•2h ago
If you think diffusing legible, precise text from pure noise is garbage then wtf are you doing here. The arrogance of the it crowd can be staggering at times
Maken•58m ago
Wouldn't it make sense to use rendered images for that?
sampton•2h ago
Short canva.
esafak•2h ago
Team Qwen: Please stop ripping off Studio Ghibli to demo your product.
Destiner•2h ago
The text rendering is impressive, but I don't understand the value — wouldn't it be easier to add any text that you like in Figma?
doctorpangloss•2h ago
the value is: the absence of text where you expect it, and the presence of garbled text, are dead giveaways of AI generation. i'm not sure why you are being downvoted, compositing text seems like a legitimate alternative.
sipjca•1h ago
it seems like the value is that you don't need another tool to composite the text. especially for users who aren't aware of figma/photoshop nor how to use them (many many many people)
Uehreka•2h ago
I’m interested to see what this model can do, but also kinda annoyed at the use of a Studio Ghibli style image as one of the first examples. Miyazaki has said over and over that he hates AI image generation. Is it really so much to ask that people not deliberately train LoRAs and finetunes specifically on his work and use them in official documentation?

It reminds me of how CivitAI is full of “sexy Emma Watson” LoRAs, presumably because she very notably has said she doesn’t want to be portrayed in ways that objectify her body. There’s a really rotten vein of “anti-consent” pulsing through this community, where people deliberately seek out people who have asked to be left out of this and go “Oh yeah? Well there’s nothing you can do to stop us, here’s several terabytes of exactly what you didn’t want to happen”.

aabhay•2h ago
Seems a bit drastic to compare Ghibli style transfer to revenge porn, but you do you I guess.
Uehreka•2h ago
It’s the anti-consent thing that ties them together. The idea of “You asked us to leave you alone, which is why we’re targeting you.”
littlestymaar•1h ago
Why are you talking about revenge porn here?
topato•2h ago
I mean, did you really expect anything more from the internet? Maybe I'm wrong, but hentai, erotic roleplay, and nudify applications seem to still represent a massive portion of AI use cases. At least in the case of ero RP, perhaps the exploitation of people for pornography might be lessened....
Uehreka•1h ago
I get that if you can imagine something, it exists, and also there is porn of it.

What disappoints me is how aligned the whole community is with its worst exponents. That someone went “Heh heh, I’m gonna spend hours of my day and hundreds/thousands of dollars in compute just to make Miyazaki sad.” and then influencers in the AI art space saw this happen and went “Hell yeah let’s go” and promoted the shit out of it making it one of the few finetunes to actually get used by normies in the mainstream, and then leaders in this field like the Qwen team went “Yeah sure let’s ride the wave” and made a Studio Ghibli style image their first example.

I get that there was no way to physically stop a Studio Ghibli LoRA from existing. I still think the community’s gleeful reaction to it has been gross.

bongodongobob•40m ago
Whatever. "Studio Ghibli style" is so loose of a definition to begin with. You can't own a "style" anyway. Tough cookies.
Zopieux•1h ago
Welcome to the internet, which is for porn (and cat pictures).
artninja1988•2h ago
How censored is it?
Zopieux•1h ago
I love that this is the only thing the community wants to know at every announce of a new model, but no organization wants to face the crude reality of human nature.

That, and the weird prudishness of most american people and companies.

vunderba•2h ago
Good release! I've added it to the GenAI Showdown site. Overall a pretty good model scoring around 40% - and definitely represents SOTA for something that could be reasonably hosted on consumer GPU hardware (even more so when its quantized).

That being said, it still lags pretty far behind OpenAI's gpt-image-1 strictly in terms of prompt adherence for txt2img prompting. However as has already been mentioned elsewhere in the thread, this model can do a lot more around editing, etc.

https://genai-showdown.specr.net

masfuerte•1h ago
> In this case, the paper is less than one-tenth of the entire image, and the paragraph of text is relatively long, but the model still accurately generates the text on the paper.

Nope. The text includes the line "That dawn will bloom" but the render reads "That down will bloom", which is meaningless.

sciencesama•39m ago
What lowest graphic card can support this self hosted with a reasonable output !

Show HN: I spent 6 years building a ridiculous wooden pixel display

https://benholmen.com/blog/kilopixel/
644•benholmen•7h ago•95 comments

Is It FOSS?

https://isitreallyfoss.com/
69•exiguus•2h ago•12 comments

Qwen-Image: Crafting with native text rendering

https://qwenlm.github.io/blog/qwen-image/
250•meetpateltech•7h ago•75 comments

NASA's Curiosity picks up new skills

https://www.jpl.nasa.gov/news/marking-13-years-on-mars-nasas-curiosity-picks-up-new-skills/
78•Bluestein•4h ago•26 comments

How we made JSON.stringify more than twice as fast

https://v8.dev/blog/json-stringify
134•emschwartz•9h ago•25 comments

What Does One Billion Dollars Look Like?

https://whatdoesonebilliondollarslooklike.website/
23•alexrustic•2h ago•18 comments

EconTeen – Financial Literacy Lessons and Tools for Teens

https://econteen.com/
5•Chrisjackson4•27m ago•1 comments

Show HN: I've been building an ERP for manufacturing for the last 3 years

https://github.com/crbnos/carbon
6•barbinbrad•1h ago•0 comments

Indian Sign Painting: A typeface designer's take on the craft

https://bl.ag/indian-sign-painting-a-typeface-designers-take-on-the-craft/
101•detaro•2d ago•16 comments

Content-Aware Spaced Repetition

https://www.giacomoran.com/blog/content-aware-sr/
61•ran3000•4h ago•15 comments

Job-seekers are dodging AI interviewers

https://fortune.com/2025/08/03/ai-interviewers-job-seekers-unemployment-hiring-hr-teams/
475•robtherobber•15h ago•732 comments

Hiroshima (1946)

https://www.newyorker.com/magazine/1946/08/31/hiroshima
25•pseudolus•2d ago•16 comments

OpenIPC: Open IP Camera Firmware

https://openipc.org/à
181•zakki•3d ago•105 comments

Cellular Starlink expands to support IoT devices

https://me.pcmag.com/en/networking/31452/spacexs-cellular-starlink-expands-to-support-iot-devices
57•teleforce•3d ago•38 comments

Once a death sentence, cardiac amyloidosis is finally treatable

https://www.nytimes.com/2025/08/04/well/cardiac-amyloidosis.html
76•elektor•3h ago•2 comments

DrawAFish.com Postmortem

https://aldenhallak.com/blog/posts/draw-a-fish-postmortem.html
221•hallak•11h ago•52 comments

How we built Bluey’s world

https://www.itsnicethat.com/features/how-we-built-bluey-s-world-cartoon-background-scenery-art-director-catriona-drummond-animation-090725
299•skrebbel•3d ago•137 comments

Thingino: Open-Source Firmware for IP Cameras

https://thingino.com/
5•zakki•1h ago•2 comments

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
903•rrampage•9h ago•525 comments

AWS European Sovereign Cloud to be operated by EU citizens

https://www.aboutamazon.eu/news/aws/aws-european-sovereign-cloud-to-be-operated-by-eu-citizens
48•pulisse•2h ago•42 comments

A deep dive into Rust and C memory interoperability

https://notashes.me/blog/part-1-memory-management/
127•hyperbrainer•8h ago•58 comments

What Can a Cell Remember?

https://www.quantamagazine.org/what-can-a-cell-remember-20250730/
42•chapulin•4d ago•4 comments

Customizing tmux

https://evgeniipendragon.com/posts/customizing-tmux-and-making-it-less-dreadful/
75•EPendragon•7h ago•71 comments

My Ideal Array Language

https://www.ashermancinelli.com/csblog/2025-7-20-Ideal-Array-Language.html
109•bobajeff•10h ago•50 comments

Show HN: Sidequest.js – Background jobs for Node.js using your database

https://docs.sidequestjs.com/quick-start
42•merencia•7h ago•11 comments

Read your code

https://etsd.tech/posts/rtfc/
156•noeclement•10h ago•90 comments

Century-old stone “tsunami stones” dot Japan's coastline (2015)

https://www.smithsonianmag.com/smart-news/century-old-warnings-against-tsunamis-dot-japans-coastline-180956448/
124•deegles•10h ago•43 comments

Objects should shut up

https://dustri.org/b/objects-should-shut-the-fuck-up.html
263•gm678•9h ago•204 comments

Show HN: Tiny logic and number games I built for my kids

https://quizmathgenius.com/
66•min2bro•8h ago•25 comments

Is the interstellar object 3I/ATLAS alien technology? [pdf]

https://lweb.cfa.harvard.edu/~loeb/HCL25.pdf
72•jackbravo•10h ago•94 comments