But I would say that the reaction was probably vastly overblown as what Deepseek really showed was there are much more efficient ways of doing things (which can also be applied with even larger clusters).
If this checkpoint is trained using non-Nvidia GPUs that would definitely be a much bigger situation but it doesn't seem like there has been any associated announcements.
And then part of the impact was just "woah, if some noname team from China can casually leapfrog major western players on a tiny budget and kill one of their moats in the same move, what other surprises like this are possible?". The event definitely invalidated a lot of assumptions investors had about what is or isn't possible near-term; the stock market reacted to suddenly increased uncertainty.
Not only DeepSeek uses a lot of Nvidia hardware for the training.
But even more so, by releasing an open weight frontier model, people around the world need more Nvidia chips than ever for inference.
DeepSeek helped "prove" to a lot of execs that "Good" is "Good enough" and that there are viable alternatives with less perceived risk of supply chain disruption - even if facts differ may from this narrative.
The hardware is great, CANN is not CUDA yet.
"Tech Chip software stocks sink on report Trump ordered halt to China sales" - https://www.cnbc.com/2025/05/28/chip-software-trump-china.ht...
no idea why they cant just wait a bit to coordinate stuff. bit messy in the news cycle.
it's almost as if they don't care about creating a proper buzz.
Despite constant protestations of hype among the tech crowd, GenAI really is big enough of a deal that new developments don't need to be pushed onto market; people are voluntarily seeking them out.
Not to make people aware of GenAI, but to make sure OpenAI continues to be perceived as the AI company. The company that leads and revolutionizes, with everyone just copying them and trying to match them. That perception is a significant part of their value and probably their biggest moat
Hugging face has a leader board and it seems dominated by models that are finetunings of various common open source models, yet don't seem be broader used:
Out of curiosity, how did you gauge that?
My ignorance is showing here: why is the Pro 05-06 a nerf?
- live benchmarks (livebench, livecodebench, matharena, SWE-rebench, etc)
- benchmarks that do not have a fixed structure, like games or human feedback benches (balrog, videogamebench, arena)
- (to some extent) benchmark without existing/published answers (putnambench, frontiermath). You could argue that someone could hire people to solve those or pay off benchmark dev, but it's much more complicated.
Most of the benchmarks that don't try to tackle future contamination are much less useful, that's true. Unfortunately, HLE kind of ignored it (they plan to add a hidden set to test for contamination, but once the answers are there, it's a lost game IMHO); I really liked the concept.
Edit: it is true that these benchmarks are focusing only on a fairly specific subset of the model capabilities. For everything else vibe check is your best bet.
Of course, some benchmarks are still valid and will remain valid. Ie. we can make the models play chess against each other and score them on how well they do. But those benchmarks are in general fairly narrow. They don't really measure the "broader" intelligence we are after. And often, LLMs perform worse than specialized models. Ie. I don't think there is any LLM out there that can beat a traditional chess program (surely not using the same computing power).
What is really bad are the QA benchmarks which leak over time into the training data of the models. And sometimes, one can suspect even big labs have an economic incentive in scoring well on popular benchmarks which cause them to manipulate the models way beyond what is reasonable.
And taking a bunch of flawed benchmarks and combining them in indexes, saying this model is 2% better than that model is just completely meaningless but of course fun and draws a lot of attention.
So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.
Of course, done right, that would be really expensive. And those sponsoring might not like the result.
I think a general model that can
- finish nethack, doom, zelda and civilization,
- solve the hardest codeforces/atcoder problems,
- formally prove putnam solution with high probability, not given the answer
- write a PR to close a random issue on github
is likely to have some broader intelligence. I may be mistaken, since there were tasks in the past that appeared to be unsolvable without human-level intelligence, but in fact weren't.
I agree that such benchmarks are limited to either environment with well-defined feedback and rules (games) or easily verifiable ones (code/math), but I wouldn't say it's super narrow, and there are no non-LLM models to perform significantly better on these (except some games); though specialized LLMs work better. Finding other examples, I think, is one of the important problems in AI metrology.
> So, yes, we are kind of left with vibe checks, but in theory, we could do more; take a bunch of models, double-blind, and have a big enough, representative group of human evaluators score them against each other on meaningful subjects.
You've invented an arena (who just raised quite a lot of money). Can argue about "representative," of course. However, I think the SNR in the arena is not too high now; it turns out that the average arena user is quite biased, the most of their queries are trivial for LLMs, and for non-trivial ones, they cannot necessarily figure out which answer is better. MathArena goes in opposite directions: narrow domain, but expert evaluation. You could imagine a bunch of small arenas, each with its own domain experts. I think it may happen eventually if money flow into AI continues.
I wasn't trying to invent anything. Just describing what you would obviously have to do if you were to take a "scientific" or "objective" approach: Sound experiments, reproducible, free of financial incentives.
As far as I can tell, no one is doing that at a significant scale. Everything is buried in hype and marketing.
Now for that broad set of benchmarks (PRs to GitHub, Putnam, Zelda). There is something to that, but it depends on the model. A lot of what is out there are “mixtures of experts" either by implicit or explicit design. So there is a mechanism that looks at the problem and then picks the subsystem to delegate it to. Is it a game of chess - boot up the chess program? Is it poetry? Boot up the poetry generator.
That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.
Deepseek is, as far as I can tell, the leading open-source model; and in some way, that makes it the leading model. I don't think you can fairly compare a model that you can run locally with something that is running behind a server-side API - because who knows what is really going on behind the API.
Deepseek being Chinese makes it political and even harder to have a sane conversation about; but I am sure that had it been China that did mostly closed models and the US that did open ones; we would hold that against them, big time.
But how is it different from what arena or matharena does?
> That sort of thing is not showing broad intelligence anymore than a person both knowing a chess player and a poet is having broad intelligence.
The claim is that these problems require somewhat broad intelligence by themselves, as opposed to specialization into specific task while unable to do anything else.
No, that's not actually a good description of the mixture-of-experts methodology. It was poorly named. There is no conscious division of the weights into "This subset is good for poetry, this one is best for programming, this one for math, this one for games, this one for language translation, etc."
I think you just described SATs and other standardized tests
https://openrouter.ai/deepseek/deepseek-r1-0528/providers
May 28th update to the original DeepSeek R1 Performance on par with OpenAI o1, but open-sourced and with fully open reasoning tokens. It's 671B parameters in size, with 37B active in an inference pass.
Fully open-source model.
I remember there's a project "Open R1" that last I checked was working on gathering their own list of training material, looks active but not sure how far along they've gotten:
out of curiosity, does anyone do anything "useful" with that knowledge? it's not like people can just randomly train models..
Today we introduce OLMoTrace, a one-of-a-kind feature in the Ai2 Playground that lets you trace the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace is a manifestation of Ai2’s commitment to an open ecosystem – open models, open data, and beyond.
what you're saying is just that it's non reproducible, which is a completely valid but separate issue
The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.
If you're speaking about the computational cost, it used to be that way for compilers too. Give it 20 years and you'll be able to train one of today's models on your phone.
There simply aren't that many sources of non-determinism in a modern computer.
Though I'll grant that if you've engineered your codebase for speed and not for determinism, error can creep in via floating point error, sloppy ordering of operations, etc. These are not unavoidable implementation details, however. CAD kernels and other scientific software do it every day.
When you boil down what's actually happening during training, it's just a bunch of matrix math. And math is highly repeatable. Size of the matrix has nothing to do with it.
I have little doubt that some implementations aren't deterministic, due to software engineering choices as discussed above. But the algorithms absolutely are. Claiming otherwise seems equivalent to claiming that 2 + 2 can sometimes equal 5.
Not some of them; ALL OF THEM. Engineering training pipelines for absolute determinism would be, quite frankly, extremely dumb, so no one does it. When you need millions of dollars worth of compute to train a non-toy model are you going to double or triple your cost just so that the process is deterministic, without actually making the end result perform any better?
The cost of adaptive precision floats can be negligible depending on application. One example I'm familiar with from geometry processing: https://www.cs.cmu.edu/~quake/robust.html
Integer math often carries no performance penalty compared to floating point.
I guess my takeaway from this conversation is that there's a market for fast high-precision math techniques in the AI field.
Training is reproducible only if, besides the pipeline and data, you also start from the same random weights.
It sounds like you're referring to something like simulated annealing. Using that as an example, the fundamental requirement is to introduce arbitrary, uncorrelated steps -- there's no requirement that the steps be random, and the only potential advantage of using a random source is that it provides independence (lack of correlation) inherently; but in exchange, it makes testing and reproduction much harder. Basically every use of simulated annealing or similar I've run into uses pseudorandom numbers for this reason.
Point at the science that says that, please: Current scientific knowledge doesn't agree with you.
I'd love a citation. So far you haven't even suggested a possible source for this non-determinism you claim exists.
Where would other non-determinism come from?
I'm open to there being another source. I'd just like to know what it would be. I haven't found one yet.
No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.
That has everything to do with implementation, and nothing to do with algorithm. There is an important difference.
Math is deterministic. The way [random chip] implements floating point operations may not be.
Lots of scientific software has the ability to use IEEE-754 floats for speed or to flip a switch for arbitrary precision calculations. The calculation being performed remains the same.
The point is none of these models are trained with pure "math". It doesn't matter that you can describe a theoretical training process using a set of deterministic equations, because in practice it doesn't work that way. Your claim that "the training process is fully deterministic" is objectively wrong in this case because none of the non-toy models use (nor they practically can use) such a deterministic process. There is a training process which is deterministic, but no one uses it (for good reasons).
If you had infinite budget, exactly the same code, the same training data, and even the same hardware you would not be able to reproduce the weights of Deepseek R1, because it wasn't trained using a deterministic process.
torch.manual_seed(0)
random.seed(0)
np.random.seed(0)
torch.use_deterministic_algorithms(True)
But in practice that is too slow, we use nondeterministic implementations that run fast and loose with memory management and don't necessarily care about the order in which parallel operations return.E.g. see all the specialized third-party models out there based on Qwen.
"Open-source" is the wrong word here, what they mean is "you can modify and redistribute these weights".
https://chatgpt.com/share/6838c070-705c-8005-9a88-83c9a5550a...
Also, the "redistribute" part is key here.
Fully agree, it isn't. Reverse engineering isn't necessary for modifying compiled program behaviour, so comparing it to finetuning is not applicable. Finetuning applied to program domain would be more like adding plugins or patching in some compiled routines. Reverse-engineering applied to models would be like extracting source documents from weights.
> Finetuning is a standard supported workflow for these models.
Yes, so is adding mods for some games, just put your files in a designated folder and game automatically picks it up and does required modifications.
> Also, the "redistribute" part is key here.
It is not. Redistributability and being open source is orthogonal. You can have a source for a program and not be able to redistribute source or program, or you can redistribute a compiled program, but not have it's source (freeware).
Open source has the word "source" in it for a reason, and those models ain't open source and have nothing to do with it.
There's a few efforts at full open data / open weight / open code models, but none of them have gotten to leading-edge performance.
I’d just assume they did because—why scrape again if you want to train a new model? But if you know otherwise, I’m not tied to this idea.
If you (or any human) violate copyright law, legal redress can be sought. The amount of damage you can do is limited because there's only one of you vs the marginal cost of duplicating AI instances.
There are many other differences between humans and AI in terms of capabilities and motivations to f the legal persons making decisions.
But enough about whether it should be legal to own a Xerox machine. It's what you do with the machine that matters.
However, once these words are broadcast—once they’re read, and the ideas expressed here enter someone else’s mind—I believe it’s only fair that the person on the receiving end has the right to use, replicate, or create something from them. After all, they lent me their brain—ideas that originated in my mind now live in theirs.
This uses up their mental "meat space," their blood sugar, and their oxygen—resources they provide. So, they have rights too: the right to do as they please with those ideas, including creating any and all data derived from them. Denying them that right feels churlish, as if it isn’t the most natural thing in the world.
(Before people jump on me:- Yes, creators need to be compensated—they deserve to make a living from their work. But this doesn’t extend to their grandchildren. Copyright laws should incentivize creation, not provide luxury for the descendants of the original creator a century later.)
What they have released has been distilled into many new models that others have been using for commercial benefit and I appreciate the contributions that they have made.
I also don't expect Microsoft to release their full Windows 11 source code, but that also means it's not open source. And that's okay, because Microsoft doesn't call it open source.
Just because the model produces stuff doesn't mean that's the model's source, just like the binary for a compiler isn't the compiler's source.
> Studies have found ethanol levels in commercial apple juice ranging from 0.06 to 0.66 grams per liter, with an average around 0.26 grams per liter[1]
Even apple juice is an alcoholic drink if you push your criteria to absurdity.
We have numerous artifacts to reason about:
- The model code
- The training code
- The fine tuning code
- The inference code
- The raw training data
- The processed training data (which might vary across various stages of pre-training and potentially fine-tuning!)
- The resultant weights
- The inference outputs (which also need a license)
- The research papers (hopefully it's described in literature!)
- The patents (or lack thereof)
The term "open source" is wholly inadequate here. We need a 10-star grading system for this.
This is not your mamma's C library.
AFAICT, DeepSeek scores 7/10, which is better than OpenAI's 0/10 (they don't even let you train on the outputs).
This is more than enough to distill new models from.
Everybody is laundering training data, and it's rife with copyrighted data, PII, and pilfered outputs from other commercial AI systems. Because of that, I don't expect we'll see much legally open training data for some time to come. In fact, the first fully open training data of adequate size (not something like LJSpeech) is likely to be 100% synthetic or robotically-captured.
If you think the data story isn't a complicated beast, then consider:
If you wanted an "open" dataset, would you want it before or after it was processed? There are a lot of cleaning, categorizing, feature extraction steps. The data typically undergoes a lot of analysis, extra annotation, bucketing, and transformation.
If the pre-train was done in stages, and the training process was complicated, how much hand-holding do you need to replicate that process?
Do you need all of the scripts to assist with these processes? All of the infra and MLOps pieces? There's a lot of infrastructure to just move the data around and poke it.
Where are you going to host those terabytes or petabytes of data? Who is going to download it? How often? Do you expect it to be downloaded as frequently as the Linux kernel sources?
Did you scrub it of PII? Are you sure?
And to clarify, we're not even talking about trained models at this point.
The answer is also known. So the reason one would want an open source model (read reproducible model), would be that of ethics
And if we wait for the the internet to be wholly eaten by AI, if we accept perfect as the enemy of good, then we'll have nothing left to cling to.
> And the question is also pretty clear: did $company steal other peoples work?
Who the hell cares? By the time this is settled - and I'd argue you won't get a definitive agreement - the internet will be won by the hyperscalers.
Accept corporate gifts of AI, and keep pushing them forward. Commoditize. Let there be no moat.
There will be infinite synthetic data available to us in the future anyway. And none of this bickering will have even mattered.
That's my formal argument. The less formal one is that copyright protection is something that smaller artists deserve more than rich conglomerates, and even then, durations shouldn't be "eternity and a day". A huge chunk of what is being "stolen" should be in the commons anyway.
The companies that create these models cant answer that question! Models get jailbroken all the time to ignore alignment instructions. The robust refusal logic normally sits on top of the model, ie looking at the responses and flagging anything that they don't want to show to users.
The best tool we have for understanding if a model is refusing to answer a problem or actually doesn't know is mechanistic interp, which you only need the weights for.
This whole debate is weird, even with traditional open source code you cant tell the intent of a programmer, what sources they used to write that code etc.
> ollama run deepseek-r1
edit: most providers are offering a quantized version...
This whole “building moats” and buying competitors fascination in the US has gotten boring, obvious and dull. The world benefits when companies struggle to be the best.
[0] https://xcancel.com/glitchphoton/status/1927682018772672950
DeepSeek-coder-v2 is fine for this, I occasionally use a smaller Qwen3 (I forget exactly which at the moment... Set and forget) for some larger queries about code, given my fairly light used cases and pretty small contexts it works well enough for me
Having said that, I'm paranoid too. But if I wasn't they'd have got me by now.
Sending a document with a social security number to OpenAI is just a dumb idea. As an example.
Whilst the Chinese intelligence agency will have not much power over you.
You can for instance use them to extract some information such as postal codes from strings, or to translate and standardize country names written in various languages (e.g. Spanish, Italian and French to English), etc.
I'm sure people will have more advanced use cases, but I've found them useful for that.
transcriptase•1d ago
jacob019•1d ago
behohippy•1d ago
wongarsu•1d ago
For the hobby version you would presumably buy a used server and a used GPU. DDR4 ECC Ram can be had for a little over $1/GB, so you could probably build the whole thing for around $2k
phonon•1d ago
https://github.com/kvcache-ai/ktransformers
JKCalhoun•1d ago
Mobo was some kind of mining rig from AliExpress for less than $100. GPU is an inexpensive NVIDIA TESLA card that I 3D printed a shroud for (added fans). Power supply a cheap 2000 Watt Dell server PS off eBay....
[1] https://bsky.app/profile/engineersneedart.com/post/3lmg4kiz4...
terhechte•1d ago
lodovic•1d ago
diggan•1d ago
Not affiliated, just a (mostly) happy user, although don't trust the bandwidth numbers, lots of variance (not surprising though, it is a user-to-user marketplace).
qingcharles•9h ago
I don't even know how these Vast servers make money because there is no way you can ever pay off your hardware from the pennies you're getting.
omneity•1d ago
An alternative is to use serverless GPU or LLM providers which abstract some of this for you, albeit at a higher cost and slow starts when you first use your model for some time.
girvo•1d ago
zackangelo•20h ago
Not for the GPU poor, to be sure.
hu3•1d ago
There's already a 685B parameter DeepSeek V3 for free there.
https://openrouter.ai/deepseek/deepseek-chat-v3-0324:free
latchkey•1d ago
ankit219•1d ago
85392_school•1d ago
latchkey•1d ago
dist-epoch•1d ago
For example you could use it to summarize a public article.
latchkey•1d ago
criddell•1d ago
dist-epoch•1d ago
latchkey•9h ago
jacob019•1d ago
latchkey•9h ago
hadlock•1d ago
rahimnathwani•1d ago
This guy ran a 4-bit quantized version with 768GB RAM: https://news.ycombinator.com/item?id=42897205
SkyPuncher•1d ago
Truthfully, it's just not worth it. You either run these things so slowly that you're wasting your time or you have to buy 4- or 5-figures of hardware that's going to sit, mostly unused.
danielhanchen•1d ago
I'm working on the new one!
screaminghawk•1d ago
danielhanchen•1d ago
CamperBob2•1d ago
danielhanchen•1d ago
nxobject•1d ago
danielhanchen•1d ago
behnamoh•1d ago
of course we can run any model if quantize it enough. but I think the OP was talking about the unquantized version.
danielhanchen•1d ago
You can do it via `-ot ".ffn_.*_exps.=CPU"`
behnamoh•23h ago
mechagodzilla•1d ago
threeducks•1d ago
If speed is truly not an issue, you can run Deepseek on pretty much any PC with a large enough swap file, at a speed of about one token every 10 minutes assuming a plain old HDD.
Something more reasonable would be a used server CPU with as many memory channels as possible and DDR4 ram for less than $2000.
But before spending big, it might be a good idea to rent a server to get a feel for it.
z2•1d ago
Software: client of choice to https://openrouter.ai/deepseek/deepseek-r1-0528
Sorry I'm being cheeky here, but realistically unless you want to shell out 10k for the equivalent of a Mac Studio with 512GB of RAM, you are best using other services or a small distilled model based on this one.
jazzyjackson•1d ago
There's a couple of guides for setting it up "manually" on ec2 instances so you're not paying the Bedrock per-token-prices, here's [1] that states four g6e.48xlarge instances (192 vCPUs, 1536GB RAM, 8x L40S Tensor Core GPUs that come with 48 GB of memory per GPU)
Quick google tells me that g6e.48xlarge is something like 22k USD per month?
[0] https://aws.amazon.com/bedrock/deepseek/
[1] https://community.aws/content/2w2T9a1HOICvNCVKVRyVXUxuKff/de...
whynotmaybe•5h ago
With an average of 3.6 tokens/sec, answers usually take 150-200 seconds.