Qwen3.5: Towards Native Multimodal Agents

81•danielhanchen•4h ago

Comments

danielhanchen•4h ago

For those interested, made some MXFP4 GGUFs at https://huggingface.co/unsloth/Qwen3.5-397B-A17B-GGUF and a guide to run them: https://unsloth.ai/docs/models/qwen3.5

ggcr•3h ago

From the HuggingFace model card [1] they state:

> "In particular, Qwen3.5-Plus is the hosted version corresponding to Qwen3.5-397B-A17B with more production features, e.g., 1M context length by default, official built-in tools, and adaptive tool use."

Anyone knows more about this? The OSS version seems to have has 262144 context len, I guess for the 1M they'll ask u to use yarn?

[1] https://huggingface.co/Qwen/Qwen3.5-397B-A17B

danielhanchen•3h ago

Unsure but yes most likely they use YaRN, and maybe trained a bit more on long context maybe (or not)

NitpickLawyer•3h ago

Yes, it's described in this section - https://huggingface.co/Qwen/Qwen3.5-397B-A17B#processing-ult...

Yarn, but with some caveats: current implementations might reduce performance on short ctx, only use yarn for long tasks.

Interesting that they're serving both on openrouter, and the -plus is a bit cheaper for <256k ctx. So they must have more inference goodies packed in there (proprietary).

We'll see where the 3rd party inference providers will settle wrt cost.

ggcr•2h ago

Thanks, I've totally missed that

It's basically the same as with the Qwen2.5 and 3 series but this time with 1M context and 200k native, yay :)

mynti•2h ago

Does anyone know what kind of RL environments they are talking about? They mention they used 15k environments. I can think of a couple hundred maybe that make sense to me, but what is filling that large number?

yorwba•1h ago

Every interactive system is a potential RL environment. Every CLI, every TUI, every GUI, every API. If you can programmatically take actions to get a result, and the actions are cheap, and the quality of the result can be measured automatically, you can set up an RL training loop and see whether the results get better over time.

robkop•1h ago

Rumours say you do something like:

  Download every github repo
    -> Classify if it could be used as an env, and what types
      -> Issues and PRs are great for coding rl envs
      -> If the software has a UI, awesome, UI env
      -> If the software is a game, awesome, game env
      -> If the software has xyz, awesome, ...
    -> Do more detailed run checks, 
      -> Can it build
      -> Is it complex and/or distinct enough
      -> Can you verify if it reached some generated goal
      -> Can generated goals even be achieved
      -> Maybe some human review - maybe not
    -> Generate goals
      -> For a coding env you can imagine you may have a LLM introduce a new bug and can see that test cases now fail. Goal for model is now to fix it
    ... Do the rest of the normal RL env stuff

NitpickLawyer•46m ago

The real real fun begins when you consider that with every new generation of models + harnesses they become better at this. Where better can mean better at sorting good / bad repos, better at coming up with good scenarios, better at following instructions, better at navigating the repos, better at solving the actual bugs, better at proposing bugs, etc.

So then the next next version is even better, because it got more data / better data. And it becomes better...

This is mainly why we're seeing so many improvements, so fast (month to month, from every 3 months ~6 monts ago, from every 6 months ~1 year ago). It becomes a literal "throw money at the problem" type of improvement.

For anything that's "verifiable" this is going to continue. For anything that is not, things can also improve with concepts like "llm as a judge" and "council of llms". Slower, but it can still improve.

alex43578•34m ago

Judgement-based problems are still tough - LLM as a judge might just bake those earlier model’s biases even deeper. Imagine if ChatGPT judged photos: anything yellow would win.

NitpickLawyer•5m ago

Agreed. Still tough, but my point was that we're starting to see that combining methods works. The models are now good enough to create rubrics for judgement stuff. Once you have rubrics you have better judgements. The models are also better at taking pages / chapters from books and "judging" based on those (think logic books, etc). The key is that capabilities become additive, and once you unlock something, you can chain that with other stuff that was tried before. That's why test time + longer context -> IMO improvements on stuff like theorem proving. You get to explore more, combine ideas and verify at the end. Something that was very hard before (i.e. very sparse rewards) becomes tractable.

isusmelj•1h ago

Is it just me or is the page barely readable? Lots of text is light grey on white background. I might have "dark" mode on on Chrome + MacOS.

dryarzeg•1h ago

I'm using Firefox on Linux, and I see the white text on dark background.

> I might have "dark" mode on on Chrome + MacOS.

Probably that's the reason.

Jacques2Marais•51m ago

Yes, I also see that (also using dark mode on Chrome without Dark Reader extension). I sometimes use the Dark Reader Chrome extension, which usually breaks sites' colours, but this time it actually fixes the site.

thunfischbrot•44m ago

That seems fine to me. I am more annoyed at the 2.3MB sized PNGs with tabular data. And if you open them at 100% zoom they are extremely blurry.

Whatever workflow lead to that?

bertili•1h ago

Last Chinese new year we would not have predicted a Sonnet 4.5 level model that runs local and fast on a 2026 M5 Max MacBook Pro, but it's now a real possibility.

lostmsu•52m ago

Will 2026 M5 MacBook come with 390+GB of RAM?

bertili•36m ago

Most certainly not, but the Unsloth MLX fits 256GB.

embedding-shape•26m ago

Curious what the prefilled and token generation speed is. Apple hardware already seem embarrassingly slow for the prefill step, and OK with the token generation, but that's with way smaller models (1/4 size), so at this size? Might fit, but guessing it might be all but usable sadly.

alex43578•36m ago

Quants will push it below 256GB without completely lobotomizing it.

echelon•15m ago

I hope China keeps making big open weights models. I'm not excited about local models. I want to run hosted open weights models on server GPUs.

People can always distill them.

Alifatisk•59m ago

Wow, the Qwen team is pushing out content (models + research + blogpost) at an incredible rate! Looks like omni-modals is their focus? The benchmark look intriguing but I can’t stop thinking of the hn comments about Qwen being known for benchmaxing.

trebligdivad•51m ago

Anyone else getting an automatically downloaded PDF 'ai report' when clicking on this link? It's damn annoying!

ddtaylor•47m ago

Does anyone know the SWE bench scores?

simonw•43m ago

Pelican is OK, not a good bicycle: https://gist.github.com/simonw/67c754bbc0bc609a6caedee16fef8...

tarruda•19m ago

At this point I wouldn't be surprised if your pelican example has leaked into most training datasets.

I suggest to start using a new SVG challenge, hopefully one that makes even Gemini 3 Deep Think fail ;D

embedding-shape•15m ago

How many times do you run the generation and how do you chose which example to ultimately post and share with the public?

canadiantim•7m ago

gunalx•42m ago

Sad to not see smaller distills of this model being released alongside the flaggship. That has historically been why i liked qwen releases. (Lots of diffrent sizes to pick from from day one)

woadwarrior01•17m ago

Judging by the code in the HF transformers repo[1], smaller dense versions of this model will most likely be released at some point. Hopefully, soon.

[1]: https://github.com/huggingface/transformers/tree/main/src/tr...

dash2•30m ago

You'll be pleased to know that it chooses "drive the car to the wash" on today's latest embarrassing LLM question.

WithinReason•13m ago

Is that the new pelican test?

tarruda•23m ago

Would love to see a Qwen 3.5 release in the range of 80-110B which would be perfect for 128GB devices. While Qwen3-Next is 80b, it unfortunately doesn't have a vision encoder.

lollobomb•18m ago

Yes, but does it answer questions about Tiananmen Square?

MessageFormat: Unicode standard for localizable message strings

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I’m joining OpenAI

Qwen3.5: Towards Native Multimodal Agents

Thanks a lot, AI: Hard drives are sold out for the year, says WD

Rolling your own serverless OCR in 40 lines of code

Anthropic tries to hide Claude's AI actions. Devs hate it

picol: A Tcl interpreter in 500 lines of code

The Israeli spyware firm that accidentally just exposed itself

Magnus Carlsen Wins the Freestyle (Chess960) World Championship

Vim-pencil: Rethinking Vim as a tool for writing

Modern CSS Code Snippets: Stop writing CSS like it's 2015

Expensively Quadratic: The LLM Agent Cost Curve

1,300-year-old world chronicle unearthed in Sinai

Arm wants a bigger slice of the chip business

LT6502: A 6502-based homebrew laptop

Audio is the one area small labs are winning

Show HN: I generated a "stress test" of 200 rare defects from 7 real photos

Show HN: Microgpt is a GPT you can visualize in the browser

JavaScript-heavy approaches are not compatible with long-term performance goals

I gave Claude access to my pen plotter

Building SQLite with a small swarm

EU bans the destruction of unsold apparel, clothing, accessories and footwear

Hard problems in social media archiving

Designing a 36-key custom keyboard layout (2021)

Lost Soviet Moon Lander May Have Been Found

Gwtar: A static efficient single-file HTML format

Real-time PathTracing with global illumination in WebGL

Show HN: Knock-Knock.net – Visualizing the bots knocking on my server's door

Pocketblue – Fedora Atomic for mobile devices

MessageFormat: Unicode standard for localizable message strings

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I’m joining OpenAI

Qwen3.5: Towards Native Multimodal Agents

Thanks a lot, AI: Hard drives are sold out for the year, says WD

Rolling your own serverless OCR in 40 lines of code

Anthropic tries to hide Claude's AI actions. Devs hate it

picol: A Tcl interpreter in 500 lines of code

The Israeli spyware firm that accidentally just exposed itself

Magnus Carlsen Wins the Freestyle (Chess960) World Championship

Vim-pencil: Rethinking Vim as a tool for writing

Modern CSS Code Snippets: Stop writing CSS like it's 2015

Expensively Quadratic: The LLM Agent Cost Curve

1,300-year-old world chronicle unearthed in Sinai

Arm wants a bigger slice of the chip business

LT6502: A 6502-based homebrew laptop

Audio is the one area small labs are winning

Show HN: I generated a "stress test" of 200 rare defects from 7 real photos

Show HN: Microgpt is a GPT you can visualize in the browser

JavaScript-heavy approaches are not compatible with long-term performance goals

I gave Claude access to my pen plotter

Building SQLite with a small swarm

EU bans the destruction of unsold apparel, clothing, accessories and footwear

Hard problems in social media archiving

Designing a 36-key custom keyboard layout (2021)

Lost Soviet Moon Lander May Have Been Found

Gwtar: A static efficient single-file HTML format

Real-time PathTracing with global illumination in WebGL

Show HN: Knock-Knock.net – Visualizing the bots knocking on my server's door

Pocketblue – Fedora Atomic for mobile devices

Qwen3.5: Towards Native Multimodal Agents

Comments