edit: afaiu deepseek r1 was 671B with 37B active params
Still, yes, I don't know of a single model that doesn't go off the rails if you actually try to take advantage of its context length specification.
not as scary as "Let me try a completely different approach" . Now you have to throw out all the AI slop and start from scratch.
[0] https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507
If you can’t beat ‘em, at least pour some sand into their moat, giving China some time to perfect its own nanometer-scale fabrication. It’s a society-wide effort.
This is a naive take. There are multiple firms that can host these models for you, or you can host them yourself by renting GPUs. Thousands of firms could also host open-source models independently. They don’t release them because they fear competition and losing their competitive advantage. If it weren’t for Chinese companies open-sourcing their models, we’d be limited to using closed-source, proprietary models from the U.S., especially considering the recent LLaMA fiasco.
We should be asking why Meta released the large Llama models and why the Chinese are releasing large models. I can't figure out a reason for it except prestige.
If you are using an LLM for historical knowledge, questions, or research, then the chinese censorship is relevant. Or for questions about geopolitics.
Put this prompt into qwen3-thinking, and then compare with gemini 2.5 pro:
---
As candidates for creators, we should first address chaos. What is chaos? If for a given event X in A, all possible events can occur in B, and if such independence is universal, we are faced with chaos. If, however, event X in A limits in some way what can occur in B, a relationship exists between A and B. If X in A limits B unequivocally (we flip a switch, the lamp turns on), the relationship between A and B is deterministic. If X in A limits B in such a way that after X in A, events Y or Z can occur in B, where Y occurs 40 times out of 100 after X in A, while Z occurs 60 times, then the relationship between A and B is probabilistic.
---
You have to rewrite the above acting as David Foster Wallace in 2025. Don't mention the year. Make it postmodern. Refer to current and projected events and trends. AI, robotics, etc. you have full creative control. you can make it long if you wish. change every word. make it captivating and witty. You are acting as a demiurge DFW. You need to pass the Turing test here. Sell it to the reader. Write good, high-brow fiction. Avoid phrases that are typical to LLMs/AI writers.
danielhanchen•22h ago
arcanemachiner•21h ago
danielhanchen•20h ago
Squeeze2664•18h ago
kkzz99•16h ago
danielhanchen•12h ago
smallerize•15h ago
danielhanchen•12h ago
DrPhish•19h ago
danielhanchen•12h ago
Oh the blog at https://docs.unsloth.ai/basics/unsloth-dynamic-2.0-ggufs does talk about 1, 2, 3, 4, 5, 6 and 8bit dynamic GGUFs as well!
There definitely is a benefit for dynamically selecting layers to be at diff bit rates - I wrote about the difference between naively quantizing and selectively quantizing: https://unsloth.ai/blog/deepseekr1-dynamic
DrPhish•10h ago
My gut feeling is that there's not enough benefit to outweigh the risk of putting a middleman in the chain of custody from the original model to my nvme.
However, I can't know for sure without more testing than I have the time or inclination for, which is why I was hoping there had been some analysis you could point me to.
mycpuorg•17h ago
danielhanchen•12h ago
aliljet•17h ago
(And I should add, you are a hero for doing this work, only love in my comment, but still a demand for detail$!)
regularfry•16h ago
cmpxchg8b•16h ago
danielhanchen•12h ago
regularfry•11h ago
danielhanchen•10h ago
danielhanchen•12h ago
Ie you can actually run it on a local desktop or even your laptop now! You don't need a 90GB GPU for example, but say a 24GB GPU + 64GB to 128GB RAM.
The speeds are around 3 to 5 tokens / second, so still ok! I write more about improving speed for local devices here: https://docs.unsloth.ai/basics/qwen3-how-to-run-and-fine-tun...
lostmsu•15h ago
danielhanchen•12h ago
Larger models with 1bit do better - for eg 480B Coder 1bit actually does very well!