I have Ollama installed (only a small proportion of their clients would have a large enough GPU for this) and have download deepseek and played with it, but I still pay for an OpenAI subscription because I want the speed of a hosted model, and never mind the luxuries of things like Codex's diffs/pull request support, agents on new models, deep research etc. - I use them all at least weekly.
Ah; this definitely makes sense! I do this myself and then paste back only the relevant part of the log so as to limit this. I suspect I am being more conservative than others.
Are you using a proxy to connect Claude code to Kimi?
And how much do you estimate it would cost in a month of daily usage?
Are you using it everyday for programming? If so, how much more or less does it cost you per month? More or less than $100?
They are fully trying to be a consumer product, developer services be damned. But they can’t just get rid of the API because it’s a good incremental source of revenue, and thanks to the Microsoft deal, all that revenue would end up in Azure. Maintaining their API is basically just a way to get a slice of that revenue.
But if they open sourced everything, it might sour the relationship more with Microsoft, who would lose azure revenue and might be willing to part ways. It would also ensure that they compete on consumer product quality not (directly) model quality. At this point, they could basically put any decent model in their app and maintain the user base, they don’t actually need to develop their own.
Even if it does poorly in all areas (like Llama 4 [0]), there is still a lot the community and industry can learn from even an uncompetitive model.
[0] Llama 4 technically has a massive 10M token context as a differentiator, however in my experience, it is not reliably usable beyond 100k.
If you asked "What's the best bicycle", most enthusiasts would say one you tried, works for your usecase, etc.
Benchmarks should be for pruning models you try at the absolute highest level, because at the end of the day it's way too easy to hack them without breaking any rules (post-train on the public, generate a ton of synthetic examples, train on those, repeat)
You should also remember that there's no free lunch. If you see models below a certain size fail consistently, don't expect a model that is even smaller to somehow magically succeed, no matter how much pixie dust the developer advertises.
Many small models are supposedly good for controlled tasks, but given a detailed prompt, I can't get any of them to follow simple instructions. They usually just regurgitate the examples in the system prompt. Useless.
i have an m4 studio with a lot of unified memory and i’m still no where near running a 120b model. i’m at like 30b
apple or nvidia’s going to have to sell 1.5 tb ram machines before benchmark performance is going to be comparable
Plus when you use claude or openai, these days it’s performing google searches etc that my local model isn’t doing.
I'm running a 400B parameter model at FP8 and it still took a lot of post-training to get an even somewhat comparable performance
-
I think a lot of people implicitly bake in some grace because the models are open weights, and that's not unreasonable because of the flexibility... but in terms of raw performance it's not even close.
GPT-3.5 has better world knowledge than some 70B models, and a few even larger.
"the hacker news dream" - a house, 2 kids, and a desktop supercomputer that can run a 700B model.
I'm on a 128GB M4 Max, and running models locally is a curiosity at best given the relative performance.
I agree with other comments that there are productive uses for them. Just not on the scale of o4-mini/o3/claude 4 sonnet/opus.
So imo open weights larger models from big US labs is a big deal! Glad to see it. Gemma models, for example, are great for their size. They’re just quite small.
I should try Kimi K2 too.
You get the picture. Sure, even last year's local LLM will do well in capable hands in that scenario.
Now try pushing over 100,000 tokens in a single call, every call, in an automated process. I'm talking the type of workflows where you push over a million tokens in a few minutes, over several steps.
That's where the moat, no, the chasm, between local setups and a public API lies.
No one who does serious work "chats" with an LLM. They trigger workflows where "agents" chew on a complex problem for several minutes.
That's where local models fold.
Not their proprietary model, but maybe other open source models, or closed source models of their competitors. That way they can first ensure they are the only player on both sides, and then can kneecap their open source models just enough to drive the revenue to their proprietary one.
As far as dense models go, it’s larger than many but Mistral has released multiple 120B dense models, not to mention Llama3 405B.
I wish they released a nano model for local hackers instead
I’m running a gaming rig and could swap one in right now without having to change anything compared to my 5090, so no $5000 Threadripper or a $1000 HEDT motherboard with a ton of RAM slots, just a 1000 watt PSU and a dream.
It doesn't mean you can grab your work laptop from 5 years ago and run it there.
I will be running the 120B on my 2x4090-48GB, though.
126G /llmzoo/models/Qwen3-235B-InstructQ4
126G /llmzoo/models/Qwen3-235B-ThinkingQ4
189G /llmzoo/models/Qwen3-235B-InstructQ6
219G /llmzoo/models/glm-4.5-air
240G /llmzoo/models/Ernie
257G /llmzoo/models/Qwen3-Coder-480B
276G /llmzoo/models/DeepSeek-R1-0528-UD-Q3_K_XL.b.gguf
276G /llmzoo/models/DeepSeek-TNG
276G /llmzoo/models/DeepSeek-V3-0324-UD-Q3_K_XL.gguf
422G /llmzoo/models/KimiK2
The size in bytes of this 120B model is about 65 GB according to the screenshot, and elsewhere it's said to be trained in FP4, which matches.
That makes this model small enough to run locally on some laptops without reading from SSD.
The Apple M2 Max 96GB from January 2023, which is two generations old now, has enough GPU-capable RAM to handle it, albeit slowly. Any PC with 96 GB of RAM can run it on the CPU, probably more slowly. Even a PC with less than 64 GB of RAM can run it but it will be much slower due to having to read from the SSD constantly.
If it's a 20B MoE, it will read about one fifth of the data per token, making it about 5x faster than a 120B FP4 non-MoE would be, but it still needs all the data readily available for multiple tokens.
Alternatively, someone can distill and/or quantize the model themselves to make a smaller model. These things can be done locally, even on a CPU if necessary if you don't mind how long it takes to produce the smaller model. Or on a cloud machine rented long enough to make the smaller model, which you can then run locally.
Pretty much give away Sonnet level coding model and have it work with GPT-5 for harder tasks / planning.
yieldcrv•12h ago
ipsum2•12h ago
yieldcrv•12h ago
vntok•12h ago
seydor•10h ago
kristianp•6h ago