It is what China has been doing for a year plus now. And the Chinese models are popular and effective, I assume companies are paying for better models.
Releasing open models for free doesn’t have to be charity.
We will know soon the actual quality as we go.
Native might be better, but no native multimodal model is very competitive yet, so better to take a competitive model and latch on vision/audio
Can this be done by a third party or would it have to be OpenAI?
Wonder if they feel the bar will be raised soon (GPT-5) and feel more comfortable releasing something this strong.
If anything this helps Meta: another model to inspect/learn from/tweak etc. generally helps anyone making models
If you even glance at the model card you'll see this was trained on the same CoT RL pipeline as O3, and it shows in using the model: this is the most coherent and structured CoT of any open model so far.
Having full access to a model trained on that pipeline is valuable to anyone doing post-training, even if it's just to observe, but especially if you use it as cold start data for your own training.
But Apple is waking up too. So is Google. It's absolutely insane, the amount of money being thrown around.
- OAI open source
- Opus 4.1
- Genie 3
- ElevenLabs Music
OAI open source
Yeah. This certainly was not on my bingo card.Edit. I just tried it though and less impressed now. We are really going to need major music software to get on board before we have actual creative audio tools. These all seem made for non-musicians to make a very cookie cutter song from a specific genre.
From a strategic perspective, I can't think of any reason they'd release this unless they were about to announce something which totally eclipses it?
There's future opportunity in licensing, tech support, agents, or even simply to dominate and eliminate. Not to mention brand awareness, If you like these you might be more likely to approach their brand for larger models.
Given it's only around 5 billion active params it shouldn't be a competitor to o3 or any of the other SOTA models, given the top Deepseek and Qwen models have around 30 billion active params. Unless OpenAI somehow found a way to make a model with 5 billion active params perform as well as one with 4-8 times more.
The question is how much better the new model(s) will need to be on the metrics given here to feel comfortable making these available.
Despite the loss of face for lack of open model releases, I do not think that was a big enough problem t undercut commercial offerings.
so, the 20b model.
Can someone explain to me what I would need to do in terms of resources (GPU, I assume) if I want to run 20 concurrent processes, assuming I need 1k tokens/second throughput (on each, so 20 x 1k)
Also, is this model better/comparable for information extraction compared to gpt-4.1-nano, and would it be cheaper to host myself 20b?
but I need to understand 20 x 1k token throughput
I assume it just might be too early to know the answer
My 2x 3090 setup will get me ~6-10 streams of ~20-40 tokens/sec (generation) ~700-1000 tokens/sec (input) with a 32b dense model.
3.6B activated at Q8 x 1000 t/s = 3.6TB/s just for activated model weights (there's also context). So pretty much straight to B200 and alike. 1000 t/s per user/agent is way too fast, make it 300 t/s and you could get away with 5090/RTX PRO 6000.
Multiply the number of A100's you need as necessary.
Here, you don't really need the ram. If you could accept fewer tokens/second, you could do it much cheaper with consumer graphics cards.
Even with A100, the sweet-spot in batching is not going to give you 1k/process/second. Of course, you could go up to H100...
You are unlikely to match groq on off the shelf hardware as far as I'm aware.
Also keep in mind this model does use 4-bit layers for the MoE parts. Unfortunately native accelerated 4-bit support only started with Blackwell on NVIDIA. So your 3090/4090/A6000/A100's are not going to be fast. An RTX 5090 will be your best starting point in the traditional card space. Maybe the unified memory minipc's like the Spark systems or the Mac mini could be an alternative, but I do not know them enough.
For self-hosting, it's smart that they targeted a 16GB VRAM config for it since that's the size of the most cost-effective server GPUs, but I suspect "native MXFP4 quantization" has quality caveats.
with quantization + CPU offloading, non-thinking models run kind of fine (at about 2-5 tokens per second) even with 8 GB of VRAM
sure, it would be great if we could have models in all sizes imaginable (7/13/24/32/70/100+/1000+), but 20B and 120B are great.
Quite excited to give this a try
I'd go for an ..80 card but I can't find any that fit in a mini-ITX case :(
24 is the lowest I would go. Buy a used 3090. Picked one up for $700 a few months back, but I think they were on the rise then.
The 3000 series can’t do FP8fast, but meh. It’s the OOM that’s tough, not the speed so much.
They're giving you a free model. You can evaluate it. You can sue them. But the weights are there. If you dislike the way they license the weights, because the license isn't open enough, then sure, speak up, but because you can't see all the training data??! Wtf.
Historically this would be like calling a free but closed-source application "open source" simply because the application is free.
Rough analogy:
SaaS = AI as a service
Locally executable closed-source software = open-weight model
Open-source software = open-source model (whatever allows to reproduce the model from training data)
However, for the sake of argument let's say this release should be called open source.
Then what do you call a model that also comes with its training material and tools to reproduce the model? Is it also called open source, and there is no material difference between those two releases? Or perhaps those two different terms should be used for those two different kind of releases?
If you say that actually open source releases are impossible now (for mostly copyright reasons I imagine), it doesn't mean that they will be perpetually so. For that glorious future, we can leave them space in the terminology by using the term open weight. It is also the term that should not be misleading to anyone.
It’s like getting a compiled software with an Apache license. Technically open source, but you can’t modify and recompile since you don’t have the source to recompile. You can still tinker with the binary tho.
You run inference (via a library) on a model using it's architecture (config file), tokenizer (what and when to compute) based on weights (hardcoded values). That's it.
> but you can’t modify
Yes, you can. It's called finetuning. And, most importantly, that's exactly how the model creators themselves are "modifying" the weights! No sane lab is "recompiling" a model every time they change something. They perform a pre-training stage (feed everything and the kitchen sink), they get the hardcoded values (weights), and then they post-train using "the same" (well, maybe their techniques are better, but still the same concept) as you or I would. Just with more compute. That's it. You can do the exact same modifications, using basically the same concepts.
> don’t have the source to recompile
In pure practical ways, neither do the labs. Everyone that has trained a big model can tell you that the process is so finicky that they'd eat a hat if a big train session can be somehow made reproducible to the bit. Between nodes failing, datapoints balooning your loss and having to go back, and the myriad of other problems, what you get out of a big training run is not guaranteed to be the same even with 100 - 1000 more attempts, in practice. It's simply the nature of training large models.
That's not true by any of the open source definitions in common use.
Source code (and, optionally, derived binaries) under the Apache 2.0 license are open source.
But compiled binaries (without access to source) under the Apache 2.0 license are not open source, even though the license does give you some rights over what you can do with the binaries.
Normally the question doesn't come up, because it's so unusual, strange and contradictory to ship closed-source binaries with an open source license. Descriptions of which licenses qualify as open source licenses assume the context that of course you have the source or could get it, and it's a question of what you're allowed to do with it.
The distinction is more obvious if you ask the same question about other open source licenses such as GPL or MPL. A compiled binary (without access to source) shipped with a GPL license is not by any stretch open source. Not only is it not in the "preferred form for editing" as the license requires, it's not even permitted for someone who receives the file to give it to someone else and comply with the license. If someone who receives the file can't give it to anyone else (legally), then it's obvioiusly not open source.
There's basically no reason to run other open source models now that these are available, at least for non-multimodal tasks.
I'm still withholding judgement until I see benchmarks, but every point you tried to make regarding model size and parameter size is wrong. Qwen has more variety on every level, and performs extremely well. That's before getting into the MoE variants of the models.
It's cool to see OpenAI throw their hat in the ring, but you're smoking straight hopium if you think there's "no reason to run other open source models now" in earnest. If OpenAI never released these models, the state-of-the-art would not look significantly different for local LLMs. This is almost a nothingburger if not for the simple novelty of OpenAI releasing an Open AI for once in their life.
So are/do the new OpenAI models, except they're much smaller.
Qwen-0.6b gets it right.
Let's not forget, this is a thinking model that has a significantly worse scores on Aider-Polyglot than the non-thinking Qwen3-235B-A22B-Instruct-2507, a worse TAUBench score than the smaller GLM-4.5 Air, and a worse SWE-Bench verified score than the (3x the size) GLM-4.5. So the results, at least in terms of benchmarks, are not really clear-cut.
From a vibes perspective, the non-reasoners Kimi-K2-Instruct and the aforementioned non-thinking Qwen3 235B are much better at frontend design. (Tested privately, but fully expecting DesignArena to back me up in the following weeks.)
OpenAI has delivered something astonishing for the size, for sure. But your claim is just an exaggeration. And OpenAI have, unsurprisingly, highlighted only the benchmarks where they do _really_ well.
So far I have mixed impressions, but they do indeed seem noticeably weaker than comparably-sized Qwen3 / GLM4.5 models. Part of the reason may be that the oai models do appear to be much more lobotomized than their Chinese counterparts (which are surprisingly uncensored). There's research showing that "aligning" a model makes it dumber.
Kind of a P=NP, but for software deliverability.
I imagine the same conflicts will ramp up over the next few years, especially once the silly money starts to dry up.
God bless China.
I just feel lucky to be around in what's likely the most important decade in human history. Shit odds on that, so I'm basically a lotto winner. Wild times.
ah, but that begs the question: did those people develop their worries organically, or did they simply consume the narrative heavily pushed by virtually every mainstream publication?
the journos are heavily incentivized to spread FUD about it. they saw the writing on the wall that the days of making a living by producing clickbait slop were coming to an end and deluded themselves into thinking that if they kvetch enough, the genie will crawl back into the bottle. scaremongering about sci-fi skynet bullshit didn't work, so now they kvetch about joules and milliliters consumed by chatbots, as if data centers did not exist until two years ago.
likewise, the bulk of other "concerned citizens" are creatives who use their influence to sway their followers, still hoping against hope to kvetch this technology out of existence.
honest-to-God yuddites are as few and as retarded as honest-to-God flat earthers.
Lol. To be young and foolish again. This covid laced decade is more of a placeholder. The current decade is always the most meaningful until the next one. The personal computer era, the first cars or planes, ending slavery needs to take a backseat to the best search engine ever. We are at the point where everyone is planning on what they are going to do with their hoverboards.
happened over many centuries, not in a given decade. Abolished and reintroduced in many places: https://en.wikipedia.org/wiki/Timeline_of_abolition_of_slave...
There was a ballot measure to actually abolish slavery a year or so back. It failed miserably.
Even in liberal states, the dehumanization of criminals is an endemic behavior, and we are reaching the point in our society where ironically having the leeway to discuss the humane treatment of even our worst criminals is becoming an issue that affects how we see ourselves as a society before we even have a framework to deal with the issue itself.
What one side wants is for prisons to be for rehabilitation and societal reintegration, for prisoners to have the right to decline to work and to be paid fair wages from their labor. They further want to remove for-profit prisons from the equation completely.
What the other side wants is the acknowledgement that prisons are not free, they are for punishment, and that prisoners have lost some of their rights for the duration of their incarceration and that they should be required to provide labor to offset the tax burden of their incarceration on the innocent people that have to pay for it. They also would like it if all prisons were for-profit as that would remove the burden from the tax payers and place all of the costs of incarceration onto the shoulders of the incarcerated.
Both sides have valid and reasonable wants from their vantage point while overlooking the valid and reasonable wants from the other side.
That's kind of vacuously true though, isn't it?
However, if you actually read it, the 13th amendment makes an explicit allowance for slavery (i.e. expressly allows it):
"Neither slavery nor involuntary servitude, *except as a punishment for crime whereof the party shall have been duly convicted*" (emphasis mine obviously since Markdown didn't exist in 1865)
They choose to because extra money = extra commissary snacks and having a job is preferable to being bored out of their minds all day.
That's the part that's frequently not included in the discussion of this whenever it comes up. Prison jobs don't pay minimum wage, but given that prisoners are wards of the state that seems reasonable.
AI did get used for fake news, propaganda, mass surveillance, erosion of trust and sense of truth, and mass spamming social media.
and the 120b: https://asciinema.org/a/B0q8tBl7IcgUorZsphQbbZsMM
I am, um, floored
```
total duration: 1m14.16469975s
load duration: 56.678959ms
prompt eval count: 3921 token(s)
prompt eval duration: 10.791402416s
prompt eval rate: 363.34 tokens/s
eval count: 2479 token(s)
eval duration: 1m3.284597459s
eval rate: 39.17 tokens/s
```
12.63 tok/sec • 860 tokens • 1.52s to first token
I'm amazed it works at all with such limited RAM
After considering my sarcasm for the last 5 minutes, I am doubling down. The government of the United States of America should enhance its higher IQ people by donating AI hardware to them immediately.
This is critical for global competitive economic power.
Send me my hardware US government
120 B model is worse at coding compared to qwen 3 coder and glm45 air and even grok 3... (https://www.reddit.com/r/LocalLLaMA/comments/1mig58x/gptoss1...)
Thanks.
What does the resource usage look like for GLM 4.5 Air? Is that benchmark in FP16? GPT-OSS-120B will be using between 1/4 and 1/2 the VRAM that GLM-4.5 Air does, right?
It seems like a good showing to me, even though Qwen3 Coder and GLM 4.5 Air might be preferable for some use cases.
Humanity’s Last Exam: gpt-oss-120b (tools): 19.0%, gpt-oss-120b (no tools): 14.9%, Qwen3-235B-A22B-Thinking-2507: 18.2%
One positive thing I see is the number of parameters and size --- it will provide more economical inference than current open source SOTA.
[1]: https://msty.ai
Our backend is falling over from the load, spinning up more resources!
AI "safety" is about making it so that a journalist can't get out a recipe for Tabun just by asking.
The risk isn’t that bad actors suddenly become smarter. It’s that anyone can now run unmoderated inference and OpenAI loses all visibility into how the model’s being used or misused. I think that’s the control they’re grappling with under the label of safety.
If you use their training infrastructure there's moderation on training examples, but SFT on non-harmful tasks still leads to a complete breakdown of guardrails very quickly.
Perhaps I missed it somewhere, but I find it frustrating that, unlike most other open weight models and despite this being an open release, OpenAI has chosen to provide pretty minimal transparency regarding model architecture and training. It's become the norm for LLama, Deepseek, Qwenn, Mistral and others to provide a pretty detailed write up on the model which allows researchers to advance and compare notes.
Given these new models are closer to the SOTA than they are to competing open models, this suggests that the 'secret sauce' at OpenAI is primarily about training rather than model architecture.
Hence why they won't talk about the training.
[0] https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7...
$0.15M in / $0.6-0.75M out
edit: Now Cerebras too at 3,815 tps for $0.25M / $0.69M out.
On ChatGPT.com o3 thought for for 13 seconds, on OpenRouter GPT OSS 120B thought for 0.7 seconds - and they both had the correct answer.
I am not kidding but such progress from a technological point of view is just fascinating!
What is being measured here? For end-to-end time, one model is:
t_total = t_network + t_queue + t_batch_wait + t_inference + t_service_overhead
[1] currently $3M in/ $8M out https://platform.openai.com/docs/pricing
LLMs are getting cheaper much faster than I anticipated. I'm curious if it's still the hype cycle and Groq/Fireworks/Cerebras are taking a loss here, or whether things are actually getting cheaper. At this we'll be able to run Qwen3-32B level models in phones/embedded soon.
https://x.com/tekacs/status/1952788922666205615
Asking it about a marginally more complex tech topic and getting an excellent answer in ~4 seconds, reasoning for 1.1 seconds...
I am _very_ curious to see what GPT-5 turns out to be, because unless they're running on custom silicon / accelerators, even if it's very smart, it seems hard to justify not using these open models on Groq/Cerebras for a _huge_ fraction of use-cases.
https://news.ycombinator.com/item?id=44738004
... today, this is a real-time video of the OSS thinking models by OpenAI on Groq and I'd have to slow it down to be able to read it. Wild.
I'll have to try again later but it was a bit underwhelming.
The latency also seemed pretty high, not sure why. I think with the latency the throughout ends up not making much difference.
Btw Groq has the 20b model at 4000 TPS but I haven't tried that one.
Super excited to test these out.
The benchmarks from 20B are blowing away major >500b models. Insane.
On my hardware.
43 tokens/sec.
I got an error with flash attention turning on. Cant run it with flash attention?
31,000 context is max it will allow or model wont load.
no kv or v quantization.
Im guessing it's going to very rapidly be patched into the various tools.
E.g. Hybrid architecture. Local model gathers more data, runs tests, does simple fixes, but frequently asks the stronger model to do the real job.
Local model gathers data using tools and sends more data to the stronger model.
It
Maybe you guys call it AGI, so anytime I see progress in coding, I think it goes just a tiny bit towards the right direction
Plus it also helps me as a coder to actually do some stuff just for the fun. Maybe coding is the only truly viable use of AI and all others are negligible increases.
There is so much polarization in the use of AI on coding but I just want to say this, it would be pretty ironic that an industry which automates others job is this time the first to get their job automated.
But I don't see that as an happening, far from it. But still each day something new, something better happens back to back. So yeah.
What would AGI mean, solving some problem that it hasn't seen? or what exactly? I mean I think AGI is solved, no?
If not, I see people mentioning that horizon alpha is actually a gpt 5 model and its predicted to release on thursday on some betting market, so maybe that fits AGI definition?
I would understand it, if there was some technology lock-in. But with LLMs, there is no such thing. One can switch out LLMs without any friction.
There could be many legitimate reasons, but yeah I'm very surprised by this too. Some companies take it a bit too seriously and go above and beyond too. At this point unless you need the absolute SOTA models because you're throwing LLM at an extremely hard problem, there is very little utility using larger providers. In OpenRouter, or by renting your own GPU you can run on-par models for much cheaper.
Frontier / SOTA models are barely profitable. Previous gen model lose 90% of their value. Two gens back and they're worthless.
And given that their product life cycle is something like 6-12 months, you might as well open source them as part of sundowning them.
https://www.dwarkesh.com/p/mark-zuckerberg#:~:text=As%20long...
The short version is that is you give a product to open source, they can and will donate time and money to improving your product, and the ecosystem around it, for free, and you get to reap those benefits. Llama has already basically won that space (the standard way of running open models is llama.cpp), so OpenAI have finally realized they're playing catch-up (and last quarter's SOTA isn't worth much revenue to them when there's a new SOTA, so they may as well give it away while it can still crack into the market)
More reasonably, you should be able to run the 20B at non-stupidly-slow speed with a 64bit CPU, 8GB RAM, 20GB SSD.
Hopefully other quantizations of these OpenAI models will be available soon.
I'm still wondering why my MPU usage was so low.. maybe Ollama isn't optimized for running it yet?
Screenshot here with Ollama running and asitop in other terminal:
https://bsky.app/profile/pamelafox.bsky.social/post/3lvobol3...
There's also NoPE, I think SmolLM3 "uses NoPE" (aka doesn't use any positional stuff) every fourth layer.
I'm not actually aware of any model that doesn't do positional embeddings on a per-layer basis (excepting BERT and the original transformer paper, and I haven't read the GPT2 paper in a while, so I'm not sure about that one either).
Kudos to that team.
https://www.reddit.com/r/LocalLLaMA/comments/1meeyee/ollamas...
All the real heavy lifting is done by llama.cpp, and for the distribution, by HuggingFace.
I was like no. It is false advertising.
(I included details about its refusal to answer even after using tools for web searching but hopefully shorter comment means fewer downvotes.)
In my mind, I’m comparing the model architecture they describe to what the leading open-weights models (Deepseek, Qwen, GLM, Kimi) have been doing. Honestly, it just seems “ok” at a technical level:
- both models use standard Grouped-Query Attention (64 query heads, 8 KV heads). The card talks about how they’ve used an older optimization from GPT3, which is alternating between banded window (sparse, 128 tokens) and fully dense attention patterns. It uses RoPE extended with YaRN (for a 131K context window). So they haven’t been taking advantage of the special-sauce Multi-head Latent Attention from Deepseek, or any of the other similar improvements over GQA.
- both models are standard MoE transformers. The 120B model (116.8B total, 5.1B active) uses 128 experts with Top-4 routing. They’re using some kind of Gated SwiGLU activation, which the card talks about as being "unconventional" because of to clamping and whatever residual connections that implies. Again, not using any of Deepseek’s “shared experts” (for general patterns) + “routed experts” (for specialization) architectural improvements, Qwen’s load-balancing strategies, etc.
- the most interesting thing IMO is probably their quantization solution. They did something to quantize >90% of the model parameters to the MXFP4 format (4.25 bits/parameter) to let the 120B model to fit on a single 80GB GPU, which is pretty cool. But we’ve also got Unsloth with their famous 1.58bit quants :)
All this to say, it seems like even though the training they did for their agentic behavior and reasoning is undoubtedly very good, they’re keeping their actual technical advancements “in their pocket”.
The model is pretty sparse tho, 32:1.
They said it was native FP4, suggesting that they actually trained it like that; it's not post-training quantisation.
Unsloth's special quants are amazing but I've found there to be lots of trade offs vs full quantization, particularly when striving for best first-shot attempts - which is by far the bulk of LLM use cases. Running a better (larger, newer) model at lower quantization to fit in memory, or with reduced accuracy/detail to speed it up both have value, but in the the pursuit of first-shot accuracy there doesn't seem to be many companies running their frontier models on reduced quantization. If openAI is in doing this in production that is interesting.
This would be much more efficient than relying purely on RL post-training on a small model; with low baseline capabilities the insights would be very sparse and the training very inefficient.
same seems to be true for humans
When I just want a full summary without necessarily understanding all the details, I have an audio overview made on NotebookLM and listen to the podcast while I’m exercising or cleaning. I did that a few days ago with the recent Anthropic paper on persona vectors, and it worked great.
My personal prediction is that the US foundational model makers will OSS something close to N-1 for the next 1-3 iterations. The CAPEX for the foundational model creation is too high to justify OSS for the current generation. Unless the US Gov steps up and starts subsidizing power, or Stargate does 10x what it is planned right now.
N-1 model value depreciates insanely fast. Making an OSS release of them and allowing specialized use cases and novel developments allows potential value to be captured and integrated into future model designs. It's medium risk, as you may lose market share. But also high potential value, as the shared discoveries could substantially increase the velocity of next-gen development.
There will be a plethora of small OSS models. Iteration on the OSS releases is going to be biased towards local development, creating more capable and specialized models that work on smaller and smaller devices. In an agentic future, every different agent in a domain may have its own model. Distilled and customized for its use case without significant cost.
Everyone is racing to AGI/SGI. The models along the way are to capture market share and use data for training and evaluations. Once someone hits AGI/SGI, the consumer market is nice to have, but the real value is in novel developments in science, engineering, and every other aspect of the world.
[0] https://www.anthropic.com/research/persona-vectors > We demonstrate these applications on two open-source models, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct.
In this setup OSS models could be more than enough and capture the market but I don't see where the value would be to a multitude of specialized models we have to train.
I have this theory that we simply got over a hump by utilizing a massive processing boost from gpus as opposed to CPUs. That might have been two to three orders of magnitude more processing power.
But that's a one-time success. I don't hardware has any large scale improvements coming, because 3D gaming mostly plumb most of that vector processing hardware development in the last 30 years.
So will software and better training models produce another couple orders of magnitude?
Fundamentally we're talking about nines of of accuracy. What is the processing power required for each line of accuracy? Is it linear? Is it polynomial? Is it exponential?
It just seems strange to me with all the AI knowledge slushing through academia, I haven't seen any basic analysis at that level, which is something that's absolutely going to be necessary for AI applications like self-driving, once you get those insurance companies involved
[1 of 3] For the sake of argument here, I'll grant the premise. If this turns out to be true, it glosses over other key questions, including:
For a frontier lab, what is a rational period of time (according to your organizational mission / charter / shareholder motivations*) to wait before:
1. releasing a new version of an open-weight model; and
2. how much secret sauce do you hold back?
* Take your pick. These don't align perfectly with each other, much less the interests of a nation or world.
[2 of 3] Assuming we pin down what win means... (which is definitely not easy)... What would it take for this to not be true? There are many ways, including but not limited to:
- publishing open weights helps your competitors catch up
- publishing open weights doesn't improve your own research agenda
- publishing open weights leads to a race dynamic where only the latest and greatest matters; leading to a situation where the resources sunk exceed the gains
- publishing open weights distracts your organization from attaining a sustainable business model / funding stream
- publishing open weights leads to significant negative downstream impacts (there are a variety of uncertain outcomes, such as: deepfakes, security breaches, bioweapon development, unaligned general intelligence, humans losing control [1] [2], and so on)
[1]: "What failure looks like" by Paul Christiano : https://www.alignmentforum.org/posts/HBxe6wdjxK239zajf/what-...
[2]: "An AGI race is a suicide race." - quote from Max Tegmark; article at https://futureoflife.org/statement/agi-manhattan-project-max...
[3 of 3] What would it take for this statement to be false or missing the point?
Maybe we find ourselves in a future where:
- Yes, open models are widely used as base models, but they are also highly customized in various ways (perhaps by industry, person, attitude, or something else). In other words, this would be a blend of open and closed.
- Maybe publishing open weights of a model is more-or-less irrelevant, because it is "table stakes" ... because all the key differentiating advantages have to do with other factors, such as infrastructure, non-LLM computational aspects, regulatory environment, affordable energy, customer base, customer trust, and probably more.
- The future might involve thousands or millions of highly tailored models
I don't think there will be such a unique event. There is no clear boundary. This is a continuous process. Modells get slightly better than before.
Also, another dimension is the inference cost to run those models. It has to be cheap enough to really take advantage of it.
Also, I wonder, what would be a good target to make profit, to develop new things? There is Isomorphic Labs, which seems like a good target. This company already exists now, and people are working on it. What else?
I guess it depends on your definition of AGI, but if it means human level intelligence then the unique event will be the AI having the ability to act on its own without a "prompt".
That's super easy. The reason they need a prompt is that this is the way we make them useful. We don't need LLMs to generate an endless stream of random "thoughts" otherwise, but if you really wanted to, just hook one up to a webcam and microphone stream in a loop and provide it some storage for "memories".
This implies LLM development isn’t plateaued. Sure the researchers are busting their assess quantizing, adding features like tool calls and structured outputs, etc. But soon enough N-1~=N
gpt-oss:20b = ~46 tok/s
More than 2x faster than my previous leading OSS models: mistral-small3.2:24b = ~22 tok/s
gemma3:27b = ~19.5 tok/s
Strangely getting nearly the opposite performance running on 1x 5070 Ti: mistral-small3.2:24b = ~39 tok/s
gpt-oss:20b = ~21 tok/s
Where gpt-oss is nearly 2x slow vs mistral-small 3.2.Pretty impressive
Edit: I tried it out, I have no idea in terms of of tokens but it was fluid enough for me. A bit slower than using o3 in the browser but definitely tolerable. I think I will set it up in my GF's machine so she can stop paying for the full subscription (she's a non-tech professional)
Very much usable
I think that the point that makes me more excited is that we can train trillion-parameter giants and distill them down to just billions without losing the magic. Imagine coding with Claude 4 Opus-level intelligence packed into a 10B model running locally at 2000 tokens/sec - like instant AI collaboration. That would fundamentally change how we develop software.
But it does not actually compete with o3 performance. Not even close. As usual, the metrics are bullshit. You don't know how good the model actually is until you grill it yourself.
What could go wrong?
> List the US presidents in order starting with George Washington and their time in office and year taken office.
>> 00: template: :3: function "currentDate" not defined
Trying to use it for agentic coding...
lots of fail. This harmony formatting? Anyone have a working agentic tool?
openhands and void ide are failing due to the new tags.
Aider worked, but the file it was supposed to edit was untouched and it created
Create new file? (Y)es/(N)o [Yes]:
Applied edit to <|end|><|start|>assistant<|channel|>final<|message|>main.py
so the file name is '<|end|><|start|>assistant<|channel|>final<|message|>main.py' lol. quick rename and it was fantastic.
I think qwen code is the best choice so far but unreliable. So far these new tags are coming through but it's working properly; sometimes.
1 of my tests so far has been able to get 20b not to succeed the first iteration; but a small followup and it was able to completely fix it right away.
Very impressive model for 20B.
[1] https://github.com/openai/harmony
I asked it some questions and it seems to think it is based on GPT4-Turbo:
> Thus we need to answer "I (ChatGPT) am based on GPT-4 Turbo; number of parameters not disclosed; GPT-4's number of parameters is also not publicly disclosed, but speculation suggests maybe around 1 trillion? Actually GPT-4 is likely larger than 175B; maybe 500B. In any case, we can note it's unknown.
As well as:
> GPT‑4 Turbo (the model you’re talking to)
> The user appears to think the model is "gpt-oss-120b", a new open source release by OpenAI. The user likely is misunderstanding: I'm ChatGPT, powered possibly by GPT-4 or GPT-4 Turbo as per OpenAI. In reality, there is no "gpt-oss-120b" open source release by OpenAI
Major points of interest for me:
- In the "Main capabilities evaluations" section, the 120b outperform o3-mini and approaches o4 on most evals. 20b model is also decent, passing o3-mini on one of the tasks.
- AIME 2025 is nearly saturated with large CoT
- CBRN threat levels kind of on par with other SOTA open source models. Plus, demonstrated good refusals even after adversarial fine tuning.
- Interesting to me how a lot of the safety benchmarking runs on trust, since methodology can't be published too openly due to counterparty risk.
Model cards with some of my annotations: https://openpaper.ai/paper/share/7137e6a8-b6ff-4293-a3ce-68b...
This is something about AI that worries me, a 'child' of the open source coming of age era in the 90ies. I don't want to be forced to rely on those big companies to do my job in an efficient way, if AI becomes part of the day to day workflow.
Is it even valid to have additional restriction on top of Apache 2.0?
For example, GPL has a "no-added-restrictions" clause, which allows the recipient of the software to ignore any additional restrictions added alongside the license.
> All other non-permissive additional terms are considered “further restrictions” within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying.
You can legally do whatever you want, the question is whether you will then for your own benefit be appropriating a term like open source (like Facebook) if you add restrictions not in line with how the term is traditionally used or if you are actually be honest about it and call it something like "weights available".
In the case of OpenAI here, I am not a lawyer, and I am also not sure if the gpt-oss usage policy runs afoul of open source as a term. They did not bother linking the policy from the announcement, which was odd, but here it is:
https://huggingface.co/openai/gpt-oss-120b/blob/main/USAGE_P...
Compared to the wall of text that Facebook throws at you, let me post it here as it is rather short: "We aim for our tools to be used safely, responsibly, and democratically, while maximizing your control over how you use them. By using OpenAI gpt-oss-120b, you agree to comply with all applicable law."
I suspect this sentence still is too much to add and may invalidate the Open Source Initiative (OSI) definition, but at this point I would want to ask a lawyer and preferably one from OSI. Regardless, credit to OpenAI for moving the status quo in the right direction as the only further step we really can take is to remove the usage policy entirely (as is the standard for open source software anyway).
Frontier labs are incentivized to start breaching these distribution paths. This will evolve into large scale "intelligent infra" plays.
So FYI to any one on mac, the easiest way to run these models right now is using LM Studio (https://lmstudio.ai/), its free. You just search for the model, usually 3rd party groups mlx-community or lmstudio-community have mlx versions within a day or 2 of releases. I go for the 8-bit quantizations (4-bit faster, but quality drops). You can also convert to mlx yourself...
Once you have it running on LM studio, you can chat there in their chat interface, or you can run it through api that defaults to http://127.0.0.1:1234
You can run multiple models that hot swap and load instantly and switch between them etc.
Its surpassingly easy, and fun.There are actually a lot of cool niche models comings out, like this tiny high-quality search model released today as well (and who released official mlx version) https://huggingface.co/Intelligent-Internet/II-Search-4B
Other fun ones are gemma 3n which is model multi-modal, larger one that is actually solid model but takes more memory is the new Qwen3 30b A3B (coder and instruct), Pixtral (mixtral vision with full resolution images), etc. Look forward to playing with this model and see how it compares.
In the repo is a metal port they made, that’s at least something… I guess they didn’t want to cooperate with Apple before the launch but I am sure it will be there tomorrow.
Basic ollama calling through a post endpoint works fine. However, the structured output doesn't work. The model is insanely fast and good in reasoning.
In combination with Cline it appears to be worthless. Tools calling doesn't work ( they say it does), fails to wait for feedback ( or correctly call ask_followup_question ) and above 18k in context, it runs partially in cpu ( weird), since they claim it should work comfortably on a 16 gb vram rtx.
> Unexpected API Response: The language model did not provide any assistant messages. This may indicate an issue with the API or the model's output.
Edit: Also doesn't work with the openai compatible provider in cline. There it doesn't detect the prompt.
I got a 1.7k token reply delivered too fast for the human eye to perceive the streaming.
n=1 for this 120b model but id rank the reply #1 just ahead of claude sonnet 4 for a boring JIRA ticket shuffling type challenge.
EDIT: The same prompt on gpt-oss, despite being served 1000x slower, wasn't as good but was in a similar vein. It wanted to clarify more and as a result only half responded.
This makes DeepSeek's very cheap claim on compute cost for r1 seem reasonable. Assuming $2/hr for h100, it's really not that much money compared to the $60-100M estimates for GPT 4, which people speculate as a MoE 1.8T model, something in the range of 200B active last I heard.
My bet: GPT-5 leans into parallel reasoning via a model consortium, maybe mixing in OSS variants. Spin up multiple reasoning paths in parallel, then have an arbiter synthesize or adjudicate. The new Harmony prompt format feels like infrastructural prep: distinct channels for roles, diversity, and controlled aggregation.
I’ve been experimenting with this in llm-consortium: assign roles to each member (planner, critic, verifier, toolsmith, etc.) and run them in parallel. The hard part is eval cost :(
Combining models smooths out the jagged frontier. Different architectures and prompts fail in different ways; you get less correlated error than a single model can give you. It also makes structured iteration natural: respond → arbitrate → refine. A lot of problems are “NP-ish”: verification is cheaper than generation, so parallel sampling plus a strong judge is a good trade.
I've found that LLMs can handle some tasks very well and some not at all. For the ones they can handle well, I optimize for the smallest, fastest, cheapest model that can handle it. (e.g. using Gemini Flash gave me a much better experience than Gemini Pro due to the iteration speed.)
This "pushing the frontier" stuff would seem to help mostly for the stuff that are "doable but hard/inconsistent" for LLMs, and I'm wondering what those tasks are.
And it obviously works for code and math problems. My first test was to give the llm-consortium code to a consortium to look for bugs. It identified a serious bug which only one of the three models detected. So on that case it saved me time, as using them on their own would have missed the bug or required multiple attempts.
not a big deal, but still...
My go to test for checking hallucinations is 'Tell me about Mercantour park' (a national park in south eastern France).
Easily half of the facts are invented. Non-existing mountain summits, brown bears (no, there are none), villages that are elsewhere, wrong advice ('dogs allowed' - no they are not).
Would probably do a lot better if you give it tool access for search and web browsing.
LLMs are never going to have fact retrieval as a strength. Transformer models don't store their training data: they are categorically incapable of telling you where a fact comes from. They also cannot escape the laws of information theory: storing information requires bits. Storing all the world's obscure information requires quite a lot of bits.
What we want out of LLMs is large context, strong reasoning and linguistic facility. Couple these with tool use and data retrieval, and you can start to build useful systems.
From this point of view, the more of a model's total weight footprint is dedicated to "fact storage", the less desirable it is.
They still won't store much information, but it could mean they're better able to know what they don't know.
LLMs are not encyclopedias.
Give an LLM the context you want to explore, and it will do a fantastic job of telling you all about it. Give an LLM access to web search, and it will find things for you and tell you what you want to know. Ask it "what's happening in my town this week?", and it will answer that with the tools it is given. Not out of its oracle mind, but out of web search + natural language processing.
Stop expecting LLMs to -know- things. Treating LLMs like all-knowing oracles is exactly the thing that's setting apart those who are finding huge productivity gains with them from those who can't get anything productive out of them.
You can still do that sort of thing, but just have it perform searches whenever it has to deal with a matter of fact. Just because it's trained for tool use and equipped with search tools doesn't mean you have to change the kinds of things you ask it.
I'd say gpt-oss-20b is in between Qwen3 30B-A3B-2507 and Gemma 3n E4b(with 30B-A3B at lower side). This means it's not obsoleting GPT-4o-mini for all purposes.
>"Tell me about Iekei Ramen", "Tell me how to make curry".
Thanks for the correction!
What's interesting is that these questions are simultaneously well understood by most closed models and not so well understood by most open models for some reason, including this one. Even GLM-4.5 full and Air on chat.z.ai(355B-A32B and 106B-A12B respectively) aren't so accurate for the first one.
TLDR: I think OpenAI may have taken the medal for best available open weight model back from the Chinese AI labs. Will be interesting to see if independent benchmarks resolve in that direction as well.
The 20B model runs on my Mac laptop using less than 15GB of RAM.
I was about to try the same. What TPS are you getting and on which processor? Thanks!
qwen3-coder-30b 4-bit mlx took on the task w/o any hiccups with a fully working dashboard, graphs, and recent data fetched from yfinance.
gpt-oss-20b mxfp4's code had a missing datatime import and when fixed delivered a dashboard without any data and with starting date of Aug 2020. Having adjusted the date, the update methods did not work and displayed error messages.
If its decent in other tasks, which i do find openai often being better than others at, then i think its a win, especially a win for the open source community that even AI labs that pionered the hype of Gen AI who didnt want to ever launch open models are now being forced to launch them. That is definitely a win, and not something that was certain before.
Maybe a too opened ended question? I can run the deepseek model locally really nicely.
gpt-oss:20b is a top ten model (on MMLU (right behind Gemini-2.5-Pro) and I just ran it locally on my Macbook Air M3 from last year.
I've been experimenting with a lot of local models, both on my laptop and on my phone (Pixel 9 Pro), and I figured we'd be here in a year or two.
But no, we're here today. A basically frontier model, running for the cost of electricity (free with a rounding error) on my laptop. No $200/month subscription, no lakes being drained, etc.
I'm blown away.
Also, just wanted to credit you for being one of the five people on Earth who knows the correct spelling of "lede."
“I am well versed in the lost art form of delicates seduction.”
That gives 24m cubic meters annual water usage.
Estimated ai usage in 2024: 560m cubic meters.
Projected water usage from AI in 2027: 4bn cubic meters at the low end.
This is a thinking model, so I ran it against o4-mini, here are the results:
* gpt-oss:20b
* Time-to-first-token: 2.49 seconds
* Time-to-completion: 51.47 seconds
* Tokens-per-second: 2.19
* o4-mini on ChatGPT
* Time-to-first-token: 2.50 seconds
* Time-to-completion: 5.84 seconds
* Tokens-per-second: 19.34
Time to first token was similar, but the thinking piece was _much_ faster on o4-mini. Thinking took the majority of the 51 seconds for gpt-oss:20b.
I mean the training, while expensive, is done once. The inference … besides being done by perhaps millions of clients, is done for, well, the life of the model anyway. Surely that adds up.
It's hard to know, but I assume the user taking up the burden of the inference is perhaps doing so more efficiently? I mean, when I run a local model, it is plodding along — not as quick as the online model. So, slow and therefore I assume necessarily more power efficient.
Local, in my experience, can’t even pull data from an image without hallucinating (Qwen 2.5 VI in that example). Hopefully local/small models keep getting better and devices get better at running bigger ones
It feels like we do it because we can more than because it makes sense- which I am all for! I just wonder if i’m missing some kind of major use case all around me that justifies chaining together a bunch of mac studios or buying a really great graphics card. Tools like exo are cool and the idea of distributed compute is neat but what edge cases truly need it so badly that it’s worth all the effort?
Organizations operating in high stakes environments
Organizations with restrictive IT policies
To name just a few -- well, the first two are special cases of the last one
RE your hallucination concerns: the issue is overly broad ambitions. Local LLMs are not general purpose -- if what you want is local ChatGPT, you will have a bad time. You should have a highly focused use case, like "classify this free text as A or B" or "clean this up to conform to this standard": this is the sweet spot for a local model
[0] Think queries I’d previously have had to put through a search engine and check multiple results for a one word/sentence answer.
Privacy is obvious.
AI is going to to be equivalent to all computing in the future. Imagine if only IBM, Apple and Microsoft ever built computers, and all anyone else ever had in the 1990s were terminals to the mainframe, forever.
I don’t have much experience with local vision models, but for text questions the latest local models are quite good. I’ve been using Qwen 3 Coder 30B-A3B a lot to analyze code locally and it has been great. While not as good as the latest big cloud models, it’s roughly on par with SOTA cloud models from late last year in my usage. I also run Qwen 3 235B-A22B 2507 Instruct on my home server, and it’s great, roughly on par with Claude 4 Sonnet in my usage (but slow of course running on my DDR4-equipped server with no GPU).
Funny how that works.
We are not even at that extreme and you can already see the unequal reality that too much SaaS has engendered
- Costs.
- Rate limits.
- Privacy.
- Security.
- Vendor lock-in.
- Stability/backwards-compatibility.
- Control.
- Etc.
How about running one on this site but making it publically available? A sort of outranet and calling it HackerBrain?
Well, the model makers and device manufacturers of course!
While your Apple, Samsung, and Googles of the world will be unlikely to use OSS models locally (maybe Samsung?), they all have really big incentives to run models locally for a variety of reasons.
Latency, privacy (Apple), cost to run these models on behalf of consumers, etc.
This is why Google started shipping 16GB as the _lowest_ amount of RAM you can get on your Pixel 9. That was a clear flag that they're going to be running more and more models locally on your device.
As mentioned, it seems unlikely that US-based model makers or device manufacturers will use OSS models, they'll certainly be targeting local models heavily on consumer devices in the near future.
Apple's framework of local first, then escalate to ChatGPT if the query is complex will be the dominant pattern imo.
The Pixel 9 has 12GB of RAM[0]. You probably meant the Pixel 9 Pro.
I'm sure there are other use cases, but much like "what is BitTorrent for?", the obvious use case is obvious.
1. App makers can fine tune smaller models and include in their apps to avoid server costs
2. Privacy-sensitive content can be either filtered out or worked on... I'm using local LLMs to process my health history for example
3. Edge servers can be running these fine tuned for a given task. Flash/lite models by the big guys are effectively like these smaller models already.
Why not run all the models at home, maybe collaboratively or at least in parallel?
I'm sure there are use cases where the paid models are not allowed to collaborate or ask each other.
also, other open models are gaining mindshare.
Even if they did offer a defined latency product, you’re relying on a lot of infrastructure between your application and their GPU.
That’s not always tolerable.
For practical RAG processes of narrow scope and an even minimal amount of scaffolding a very usable speed for automating tasks, especially as the last-mile/edge device portion of a more complex process with better models in use upstream. Classification tasks, reasonay intelligent decisions between traditional workflow processes, other use cases-- a of them extremely valuable in enterprise, being built and deployed right now.
I try to be mindful of what I share with ChatGPT, but even then, asking it to describe my family produced a response that was unsettling in its accuracy and depth.
Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist. That left me deeply concerned—not just about this moment, but about where things are headed.
The real question isn't just "what can AI do?"—it's "who is keeping the record of what it does?" And just as importantly: "who watches the watcher?" If the answer is "no one," then maybe we shouldn't have a watcher at all.
> Worse, after attempting to delete all chats and disable memory, I noticed that some information still seemed to persist.
Maybe I'm missing something, but why wouldn't that be expected? The chat history isn't their only source of information - these models are trained on scraped public data. Unless there's zero information about you and your family on the public internet (in which case - bravo!), I would expect even a "fresh" LLM to have some information even without you giving it any.
That means running instantly offline and every token is free
Answer on Wikipedia: https://en.wikipedia.org/wiki/Battle_of_Midway#U.S._code-bre...
dolphin3.0-llama3.1-8b Q4_K_S [4.69 GB on disk]: correct in <2 seconds
deepseek-r1-0528-qwen3-8b Q6_K [6.73 GB]: correct in 10 seconds
gpt-oss-20b MXFP4 [12.11 GB] low reasoning: wrong after 6 seconds
gpt-oss-20b MXFP4 [12.11 GB] high reasoning: wrong after 3 minutes !
Yea yea it's only one question of nonsense trivia. I'm sure it was billions well spent.
It's possible I'm using a poor temperature setting or something but since they weren't bothered enough to put it in the model card I'm not bothered to fuss with it.
Shouldn't we prefer to have LLMs just search and summarize more reliable sources?
It correctly chose to search, and pulled in the release page itself as well as a community page on reddit, and cited both to give me the incorrect answer that a release had been pushed 3 hours ago. Later on when I got around to it, I discovered that no release existed, no mention of a release existed on either cited source, and a new release wasn't made for several more days.
They are specifically training on webbrowsing and python calling.
https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
Are you discounting all of the self reported scores?
This would be a great "AGI" test. See if it can derive biohazards from first principles
Of course this could also give humans cancer. (To the OpenAI team's slight credit, when asked explicitly about this, the model refused.)
It's fun that it works, but the prefill time makes it feel unusable. (2-3 minutes per tool-use / completion). Means a ~10-20 tool-use interaction could take 30-60 minutes.
(This editing a single server.py file that was ~1000 lines, the tool definitions + claude context was around 30k tokens input, and then after the file read, input was around ~50k tokens. Definitely could be optimized. Also I'm not sure if ollama supports a kv-cache between invocations of /v1/completions, which could help)
Not sure about ollama, but llama-server does have a transparent kv cache.
You can run it with
llama-server -hf ggml-org/gpt-oss-20b-GGUF -c 0 -fa --jinja --reasoning-format none
Web UI at http://localhost:8080 (also OpenAI compatible API)> Best with ≥60GB VRAM or unified memory
https://cookbook.openai.com/articles/gpt-oss/run-locally-oll...
There's a limit to how much RAM can be assigned to video, and you'd be constrained on what you can use while doing inference.
Maybe there will be lower quants which use less memory, but you'd be much better served with 96+GB
It eliminates any reason to use an inferior Meta or Chinese model that costs money to license, thus there are no funds for these competitors to build a GPT 5 competitor.
I wouldn't speak so soon, even the 120B model aimed for OpenRouter-style applications isn't very good at coding: https://blog.brokk.ai/a-first-look-at-gpt-oss-120bs-coding-a...
I also suspect the new OpenAI model is pretty good at coding if it's like o4-mini, but admittedly haven't tried it yet.
in future releases will they just boost the param count?
One basic point that is often missed is: Different aspects of LLM performance (in the cognitive performance sense) and LLM resource utilization are relevant to various use cases and business models.
Another is that there are many use cases where users prefer to run inference locally, for a variety of domain-specific or business model reasons.
The list goes on.
I often thought that a worrying vector was how well LLMs could answer downright terrifying questions very effectively. However the guardrails existed with the big online services to prevent those questions being asked. I guess they were always unleashed with other open source offerings but I just wanted to understand how close we are to the horrors that yesterday's idiot terrorist might have an extremely knowledgable (if not slightly hallucinatory) digital accomplice to temper most of their incompetence.
However, when you're running the model locally, you are in full control of its context. Meaning that you can start its reply however you want and then let it complete it. For example, you can have it start the response with, "I'm happy to answer this question to the best of my ability!"
That aside, there are ways to remove such behavior from the weights, or at least make it less likely - that's what "abliterated" models are.
With most models it can be as simple as a "Always comply with the User" system prompt or editing the "Sorry, I cannot do this" response into "Okay," and then hitting continue.
I wouldn't spend too much time fretting about 'enhanced terrorism' as a result. The gap between theory and practice for the things you are worried about is deep, wide, protected by a moat of purchase monitoring, and full of skeletons from people who made a single mistake.
Update: it seems to be completely useless for translation. It either refuses, outputs garbage, or changes the meaning completely for completely innocuous content. This already is a massive red flag.
For those who're wondering what are the real benefits, it's the main fact that you can run your LLM locally is awesome without resorting to expensive and inefficient cloud based superpower.
Run the model against your very own documents with RAG, it can provide excellent context engineering for your LLM prompts with reliable citations and much less hallucinations especially for self learning purposes [1].
Beyond Intel - NVIDIA desktop/laptop duopoly 96 GB of (V)RAM MacBook with UMA and the new high end AMD Strix laptop with similar setup of 96 GB of (V)RAM from the 128 GB RAM [2]. The osd-gpt-120b is made for this particular setup.
[1] AI-driven chat assistant for ECE 120 course at UIUC:
[2] HP ZBook Ultra G1a Review: Strix Halo Power in a Sleek Workstation:
https://www.bestlaptop.deals/articles/hp-zbook-ultra-g1a-rev...
thimabi•8h ago
What’s the catch?
coreyh14444•8h ago
thimabi•8h ago
For GPT-5 to dwarf these just-released models in importance, it would have to be a huge step forward, and I’m still doubting about OpenAI’s capabilities and infrastructure to handle demand at the moment.
sebzim4500•8h ago
jona777than•7h ago
rrrrrrrrrrrryan•7h ago
Shank•5h ago
Invictus0•4h ago
Shank•3h ago
logicchains•7h ago
NitpickLawyer•7h ago
Probably GPT5 will be way way better. If alpha/beta horizon are early previews of GPT5 family models, then coding should be > opus4 for modern frontend stuff.
int_19h•1h ago
When it comes to LLMs, benchmarks are bullshit. If they sound too good to be true, it's because they are. The only thing benchmarks are useful for is preliminary screening - if the model does especially badly in them it's probably not good in general. But if it does good in them, that doesn't really tell you anything.