This leads me to believe that the training data won’t be made publicly available in full, but merely be “reproducible”. This might mean that they’ll provide references like a list of URLs of the pages they trained on, but not their contents.
We'll find out in September if it's true?
“ Open LLMs are increasingly viewed as credible alternatives to commercial systems, most of which are developed behind closed doors in the United States or China”
It is obvious that the companies producing big LLMs today have the incentive to try to enshitify them. Trying to get subscriptions at the same time as trying to do product placement ads etc. Worse, some already have political biases they promote.
It would be wonderful if a partnership between academia and government in Europe can do a public good search and AI that endeavours to serve the user over the company.
They missed an opportunity though. They should have called their machine the AIps (AI Petaflops Supercomputer).
OLMo is fully open
Ai2 believes in the power of openness to build a future where AI is accessible to all. Open weights alone aren’t enough – true openness requires models to be trained in the open with fully open access to data, models, and code.
Disclaimer: I’m Swiss and studied at ETH. We’ve got the brainpower, but not much large-scale training experience yet. And IMHO, a lot of the “magic” in LLMs is infrastructure-driven.
I agree with everything you say about getting the experience, the infrastructure is very important and is probably the most critical part of a sovereign LLM supply chain. I would hope there will also be enough focus on the data, early on, that the model will be useful.
But it's good to have more and more players in this space.
Source: I'm part of the training team
Can you comment on how the filtering impacted language coverage? E.g. finweb2 has 1800+ languages, but some with very little actual representation, while finweb2-hq has just 20 but each with a subdsantial data set.
(I'm personaly most interested in covering the 24 official EU languages)
Good luck though, very needed project!
70B feels like the best balance between usable locally and decent for regular use.
maybe not SOTA, but a great first step.
In that regard it's absolutely not a waste of public infra just like this car was not a waste.
LLMs do seem to favor general relativity but probably would've favored classical mechanics at the time given the training corpora.
Not-yet unified: Quantum gravity, QFT, "A unified model must: " https://news.ycombinator.com/item?id=44289148
Will be interested to see how this model responds to currently unresolvable issues in physics. Is it an open or a closed world mentality and/or a conditioned disclaimer which encourages progress?
What are the current benchmarks?
From https://news.ycombinator.com/item?id=42899805 re: "Large Language Models for Mathematicians" (2023) :
> Benchmarks for math and physics LLMs: FrontierMath, TheoremQA, Multi SWE-bench: https://news.ycombinator.com/item?id=42097683
Multi-SWE-bench: A Multi-Lingual and Multi-Modal GitHub Issue Resolving Benchmark: https://multi-swe-bench.github.io/
Add'l LLM benchmarks and awesome lists: https://news.ycombinator.com/item?id=44485226
Microsoft has a new datacenter that you don't have to keep adding water to; which spares the aquifers.
How to use this LLM to solve energy and sustainability problems all LLMs exacerbate? Solutions for the Global Goals, hopefully
Is the performance or accuracy on this better on FrontierMath or Multi-SWE-bench, given the training in 1,000 languages?
I just read in the Colab release notes that models uploaded to HuggingFace can be opened on Colab with "Open in colab" on HuggingFace
C.f. https://medium.com/@biswanai92/understanding-token-fertility...
What does anyone get out of this when we have open weight models already ?
Are they going to do very innovative AI research that companies wouldn't dare try/fund? Seems unlikely ..
Is it a moonshot huge project that no single company could fund..? Not that either
If it's just a little fun to train the next generation of LLM researchers.. Then you might as well just make a small scale toy instead of using up a super computer center
Including how it was trained, what data was used, how training data was synthesized, how other models were used etc. All the stuff that is kept secret in case of llama, deepseek etc.
1 https://ethz.ch/en/news-and-events/eth-news/news/2023/09/fro...
obviously I don't know if it's university or people there because I haven't been there, but I keep hearing about ETH Zurich in different areas and it means something
[1] https://www.bluewin.ch/en/news/usa-restricts-swiss-access-to...
Also I'm curious if there was any reason to make such a PR without actually releasing the model (due Summer)? What's the delay? Or rather what was the motivation for a PR?
k__•7mo ago
Great to read that!
Onavo•7mo ago
esafak•7mo ago
How are you going to serve users if web site owners decide to wall their content? You can't ignore one side of the market.
Onavo•7mo ago
diggan•7mo ago
It is a fair point, but how strong of a point it is remains to be seen, some architectures are better than others, even with the same training data, so not impossible we could at one point see some innovative architectures beating current proprietary ones. It would probably be short-lived though, as the proprietary ones would obviously improve in their next release after that.
jowea•7mo ago
datameta•7mo ago
jowea•7mo ago
tharant•7mo ago
[0] the ultimate, of course, being profit.
jowea•7mo ago
I'm not sure if we're thinking of the same field of AI development. I think I'm talking about the super-autocomplete with integrated copy of all of digitalized human knowledge, while you're talking about trying to do (proto-)AGI. Is that it?
heavenlyblue•6mo ago
You just listed possible options in the order of their relative probability. Human would attempt to use them in exactly that order
diggan•6mo ago
Dylan16807•7mo ago
Don't focus too much on a single variable, especially when all the variables have diminishing returns.
lllllm•6mo ago
JKCalhoun•7mo ago
I understand the web is a dynamic thing but still it would seem to be useful on some level.
CaptainFever•6mo ago
stephen_cagle•7mo ago
conradkay•7mo ago
lllllm•7mo ago
[1] https://arxiv.org/abs/2504.06219