As this is in trillions, where does this amount of material come from?
wonder at what price
I like gemini 2.5 pro a lot bc its fast af but it struggles some times when context is half used to effectively use tools and make edits and breaks a lot of shit (on cursor)
Edit: The larger models have 128k context length. 32k thinking comes from the chart which looks like it's for the 235B, so not full length.
Out of all the Qwen3 models on Hugging Face, it's the most downloaded/hearted. https://huggingface.co/collections/Qwen/qwen3-67dd247413f0e2...
my concern on these models though unfortunately is it seems like architectures very a bit so idk how it'll work
We're really getting close to the point where local models are good enough to handle practically every task that most people need to get done.
Again, very rough numbers, there's calculators online.
Of course, there are lots of factors that can change the RAM usage: quantization, context size, KV cache. And this says nothing about whether the model will respond quickly enough to be pleasant to use.
After trying to implement a simple assistant/helper with GPT-4.1 and getting some dumb behavior from it, I doubt even proprietary models are good enough for every task.
On the other hand, LLM progress feels like bullshit, gaming benchmarks and other problems occured. So either in two years all hail our AGI/AMI (machine intelligence) overlords, or the bubble bursts.
They do improve on literally all of these, at incredible speed and without much sign of slowing down.
Are you asking for a technical innovation that will just get from 0 to perfect AI? That is just not how reality usually works. I don't see why of all things AI should be the exception.
As for other possible technologies, I'm most excited about clone-structured causal graphs[1].
What's very special about them is that they are apparently a 1:1 algorithmic match to what happens in the hippocampus during learning[2], to my knowledge this is the first time an actual end-to-end algorithm has been replicated from the brain in fields other than vision.
[1] "Clone-structured graph representations enable flexible learning and vicarious evaluation of cognitive maps" https://www.nature.com/articles/s41467-021-22559-5
[2] "Learning produces an orthogonalized state machine in the hippocampus" https://www.nature.com/articles/s41586-024-08548-w
Until now I found that open weight models were either not as good as their proprietary counterparts or too slow to run locally. This looks like a good balance.
I do not know much about the benchmarks but the two coding ones look similar.
~34 tok/s on a Radeon RX 7900 XTX under today's Debian 13.
https://ollama.com/library/qwen3:235b-a22b-q4_K_M
Is there a smaller one?
that's 3x h100?
I.e. I have Quadro RTX 4000 with 8G vram and seeing all the models https://ollama.com/search here with all the different sizes, I am absolutely at loss which models with which sizes would be fast enough. I.e. there is no point of me downloading the latest biggest model as that will output 1 tok/min, but I also don't want to download the smallest model, if I can.
Any advice ?
A basic thing to remember: Any given dense model would require X GB of memory at 8-bit quantization, where X is the number of params (of course I am simplifying a little by not counting context size). Quantization is just 'precision' of the model, 8-bit generally works really well. Generally speaking, it's not worth even bothering with models that have more param size than your hardware's VRAM. Some people try to get around it by using 4-bit quant, trading some precision for half VRAM size. YMMV depending on use-case
I know this is crazy to here because the big iron folks still debate 16 vs 32 and 8 vs 16 is near verboten in public conversation.
I contribute to llama.cpp and have seen many many efforts to measure evaluation perf of various quants, and no matter which way it was sliced (ranging from subjective volunteers doing A/B voting on responses over months, to objective object perplexity loss) Q4 is indistinguishable from the original.
Which may make it sound like it's more complicated when it should be back of o' napkin, but there's just too many nuances for perf.
Really generally, at this point I expect 4B at 10 tkn/s on a smartphone with 8GB of RAM from 2 years ago. I'd expect you'd get somewhat similar, my guess would be 6 tkn/s at 4B (assuming rest of the HW is 2018 era and you'll relay on GPU inference and RAM)
Not simply, no.
But start with parameters close to but less than VRAM and decide if performance is satisfactory and move from there. There are various methods to sacrifice quality by quantizing models or not loading the entire model into VRAM to get slower inference.
I’ve run llama and gemma3 on a base MacMini and it’s pretty decent for text processing. It has 16GB ram though which is mostly used by the GPU with inference. You need more juice for image stuff.
My son’s gaming box has a 4070 and it’s about 25% faster the last time I compared.
The mini is so cheap it’s worth trying out - you always find another use for it. Also the M4 sips power and is silent.
https://ollama.com/library/qwen3:8b-q4_K_M
For fast inference, you want a model that will fit in VRAM, so that none of the layers need to be offloaded to the CPU.
4-bit was ,,fine'', but a smaller 8-bit version beat it in quality for the same speed
Aside from https://huggingface.co/blog/leonardlin/chinese-llm-censorshi... I haven't seen a great deal of research into this.
Has this turned out to be less of an issue for practical applications than was initially expected? Are the models just not censored in the way that we might expect?
(the steelman here, ofc, is "the screenshots drove buzz which drove usage!", but it's sort of steel thread in context, we'd still need to pull in a time machine and a very odd unmet US consumer demand for models that toe the CCP line)
I am not claiming it was intentional, but it certainly magnified the media attention. Maybe luck and not 4d chess.
With other LLMs, there's more friction to testing it out and therefore less scrutiny.
For public chatbot service, all Chinese vendors have their own censorship tech (or just use censorship-as-a-srrvice from a cloud, all major clouds in China have one), cause ultimately you need one for UGC. So why not just censor LLM output with the same stack, too.
On their online platform I’ve hit a political block exactly once in months of use. Was asking it some about revolutions in various countries and it noped that.
I’d prefer a model that doesn’t have this issue at all but if I have a choice between a good Apache licensed Chinese one and a less good say meta licensed one I’ll take the Chinese one every time. I just don’t ask LLMs enough politically relevant questions for it to matter.
To be fair maybe that take is the LLM equivalent of „I have nothing to hide“ on surveillance
With that said, they're in a fight for dominance so censoring now would be foolish. If they win and establish a monopoly then the screws will start to turn.
This is what gpt-5 was supposed to have right? How is this implemented under the hood? Since non-thinking mode is just an empty chain-of-thought, why can't any reasoning model be used in a "non-thinking mode"?
https://developers.googleblog.com/en/start-building-with-gem...
"Write a Golang program that merges huge presorted text files, just like sort -m does". Quite often models need "use heap" as guidance, but this time big model figured it out by itself.
then i asked Qwen to translate the output and it goes in a loop telling some horror movies words:
"I figured out everything backwards." "This is the official most favorite office." "Net comes to the most official offices." "This float is specific." "Double meanings: What is 'queset à la rispostat'?" "What was reversed in real-time character study is specific." "This is an open date for a piracy group." "This is an open date for a piracy group (this is a period statement)." "Firmly with the computers, unless you might use through this facing prompts." "Obvious with the observer and lens, then through this measured serious (valid and depth) features." "Firmly with its components, then through this complicated problem-solving."
"This is the most intense friend's expectation." "This is the most intense friend's expectation." "Science is implementing friends." "Science is implementing friends." "Science is implementing friends." "Science is the most intense friend." "Science is the most intense friend."
.... and finally i asked to DeepSeek V3 to translate it
.... and deepSeek it's in a loop, writing:
Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. Science is friend. S
Here is the reversed text:
"Science is friends. Science is silent friends. Science is implacable friends. Science is most silent friends. This silent summer of friends. Observable and evident with this measured (valid and depth) Server. Simply with the components are possible, then though this moderation. Use might you unless through this facing prompts. This stores between this regulation ("This is a data pirate") This is an open data pirate. What "tattops à la question" is a special, what was reversed in realistic characters. Al is a science of favorite cognition is science. Prov. Connect Nine Nine. Open a data pirate. Dad present trial è Lapad. Dablings: This is float specially. Ten come to the most official summer. Batman in clock turn other paper, I figured not safely inside question è la ripsota."
damn son
If they do open sourcing something, I expect them to open-source some existing model (maybe something useless like gpt-3.5) rather than providing something new.
natrys•2h ago
Not even going into performance, need to test first. But what a stellar release just for attention to all these peripheral details alone. This should be the standard for major release, instead of whatever Meta was doing with Llama 4 (hope Meta can surprise us at LlamaCon tomorrow though).
[1] https://qwen.readthedocs.io/en/latest/
kadushka•2h ago
I’m curious, who are the community quant makers?
tough•2h ago
kadushka•2h ago
natrys•2h ago
[1] https://huggingface.co/unsloth
[2] https://huggingface.co/bartowski
Gracana•1h ago
sroussey•2h ago
daemonologist•2h ago
The space loads eventually as well; might just be that HF is under a lot of load.
tough•2h ago
dkga•1h ago
Jayakumark•1h ago
echelon•19m ago
We need an answer to gpt-image-1. Can you please pair Qwen with Wan? That would literally change the art world forever.
gpt-image-1 is an almost wholesale replacement of ComfyUI and SD/Flux ControlNets. I can't underscore how big of a deal it is. As such, OpenAI has leapt ahead and threatens to start capturing more of the market for AI images and video. The expense of designing and training a multimodal model presents challenges to the open source community, and it's unlikely that Black Forest Labs or an open effort can do it. It's really a place where only Alibaba can shine.
If we get an open weights multimodal image gen model that we can fine tune, then it's game over - open models will be 100% the future. If not, then the giants are going to start controlling media creation. It'll be the domain of OpenAI and Google alone. Firing a salvo here will keep media creation highly competitive.
So please, pretty please work on an LLM/Diffusion multimodal image gen model. It would change the world instantly.
And keep up the great work with Wan Video! It's easily going to surpass Kling and Veo. The controllability is already well worth the tradeoffs.