> What makes this vast private data uniquely valuable is its quality and real-world grounding. This data includes electronic health records, financial transactions, industrial sensor readings, proprietary research data, customer/population databases, supply chain information, and other structured, verified datasets that organizations use for operational decisions and to gain competitive advantages. Unlike web-scraped data, these datasets are continuously validated for accuracy because organizations depend on them, creating natural quality controls that make even a small fraction of this massive pool extraordinarily valuable for specialized AI applications.
Will there be a data exchange where one can buy and sell data, or even commododata markets, where one can hedge/speculate on futures?
Asking for a friend.
> Despite what their name might suggest, so-called “large language models” (LLMs) are trained on relatively small datasets.1 2 3 For starters, all the aforementioned measurements are described in terms of terabytes (TBs), which is not typically a unit of measurement one uses when referring to “big data.” Big data is measured in petabytes (1,000 times larger than a terabyte), exabytes (1,000,000 times larger), and sometimes zettabytes (1,000,000,000 times larger).
How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series.
We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available.
Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow.
But to your point, a crucial question in AI right now is: how much quality data is still out there?
As far as the impracticality, it's a great point. I disagree and have spent about 10 years working in the area. But that can be a post for another day. I understand and appreciate the skepticism.
Why? Intelligence and compression might just be two sides of the same coin, and given that, I'd actually be very surprised if a future ASI couldn't make due with a fraction of that.
Just because current LLMs need tons of data doesn't mean that that's somehow an inherent requirement. Biological lifeforms seem to be able to train/develop general intelligence from much, much less.
"Biological lifeforms seem to be able to train/develop general intelligence from much, much less."
This statement is hard to defend. The brain takes in 125 MB / second, and lives for 80 years, taking in about 300+ petabytes over our lifetime.
But that's not the real kicker. It's pretty unfair to say that humans learn everything they know from birth -> death. A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.
That also seems several orders of magnitude off. Would you suspect that a human that only experiences life through H.264-compressing glasses, MP3-recompressing headphones etc. does not develop a coherent world model?
What about a human only experiencing a high fidelity 3D rendering of the world based on an accurate physics simulation?
The claim that humans need petabytes of data to develop their mind seems completely indefensible to me.
> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.
Isn't that like saying that you only need the right data? In which case I'd completely agree :)
A bigger hand wave has never been done I think. Homomorphic encryption increases computational load several fold. And I'm not aware of anyone trying to use this (very interesting) technology for much of anything, let alone GPU ML algorithms.
I have a better idea: let's just cut the middlemen out and send every bit of data every computer generates to OpenAI. Sorry, to be fair, they want this to be a government-led operation... I'm sure that'll be fine too.
I am going to make a blank model, train it homomorphically to predict someone's name based on their butt cancer status, then prompt it to generate a list of people's names who have butt cancer, and blackmail them not to send it to their employers.
Pay them.
Otherwise why on Earth should I care about "contributing to AI?" It's just another commercial venture which is trying to get something of high value for no money. A protocol that doesn't involve royalty payments is a non starter.
This is a bold assumption. After Enron (financial transactions), Lehman Brothers (customer/population databases, financial transactions), Theranos (electronic health records), Nikola (proprietary research data), Juicero (I don't even know what this is), WeWork (umm ... everything), FTX (everything and we know they didn't mind lying to themselves) I'm pretty sure we can all say for certain that "real world grounding" isn't a guarantee with regards to anything where money or ego is involved.
Not to mention that at this point we're actively dealing with processes being run (improperly) by AI (see the lawsuits against Cigna and and United Health Care [1]), leading to self-training loops without revealing the "self" aspect of it.
[1]: https://www.afslaw.com/perspectives/health-care-counsel-blog...
FTFY
If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.
But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.
I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.
If any more scaling scaling does happen, it will happen in the mid-training (using agentic/reasoning outputs from previous model versions) and RL training stages.
Recent progress on useful LLMs seems to involve slimming them down.[1] Does your customer-support LLM really need a petabyte of training data? Yes, now it can discuss everything from Kant to the latest Taylor Swift concert lineup. It probably just needs enough of that to make small talk, plus comprehensive data on your own products.
The future of business LLMs probably fits in a 1U server.
[1] https://mljourney.com/top-10-smallest-llm-to-run-locally/
All the top models are moving towards synthetic data - not because they want more data but because they want quality data that is structured to train utility.
Having zettabytes of “invisible” data is effectively pointless. You can’t train on it because there is so much of it, it’s way more expensive to train per byte because of homomorphic magic (if it’s even possible), and most importantly - it’s not quality training data!
janice1999•1h ago