Unlocking a Million Times More Data for AI

https://ifp.org/unlocking-a-million-times-more-data-for-ai/

22•williamtrask•2h ago

Comments

janice1999•1h ago

Think-tank wants to enable companies get access to private medical and other personal data. "Solution" to privacy "problem" sounds like a blockchain pitch circa-2019. Wonderful.

k310•1h ago

What that locked (private) data entails.

> What makes this vast private data uniquely valuable is its quality and real-world grounding. This data includes electronic health records, financial transactions, industrial sensor readings, proprietary research data, customer/population databases, supply chain information, and other structured, verified datasets that organizations use for operational decisions and to gain competitive advantages. Unlike web-scraped data, these datasets are continuously validated for accuracy because organizations depend on them, creating natural quality controls that make even a small fraction of this massive pool extraordinarily valuable for specialized AI applications.

Will there be a data exchange where one can buy and sell data, or even commododata markets, where one can hedge/speculate on futures?

Asking for a friend.

williamtrask•21m ago

This is the magic :)

runako•1h ago

One would have to be a special kind of fool to expect honest payments from the very same organizations that are currently doing everything possible to avoid paying for the original training data they stole.

williamtrask•33m ago

Fwiw - this post doesn't advocate for trust. It advocates for an enforcement mechanism (attribution-based control).

Normal_gaussian•1h ago

Big data's no true scotsman problem:

> Despite what their name might suggest, so-called “large language models” (LLMs) are trained on relatively small datasets.1 2 3 For starters, all the aforementioned measurements are described in terms of terabytes (TBs), which is not typically a unit of measurement one uses when referring to “big data.” Big data is measured in petabytes (1,000 times larger than a terabyte), exabytes (1,000,000 times larger), and sometimes zettabytes (1,000,000,000 times larger).

williamtrask•1h ago

(OP here) — with you on that analysis. This was in an effort to make the piece legible for a (primarily) non-technical, policy audience. Rigorous numbers are in other parts of the piece (and in the sources behind them).

eichin•47m ago

The joke (10 years ago) was that "big data" means "doesn't fit on my Mac". Kind of still works...

pimlottc•44m ago

Isn’t the basically true? The crux of big data is that requires different techniques since you can’t just processing it on one device

jerf•1h ago

This document seems to treat "data" as a fungible commodity. Perhaps our use of the word encourages that. But it's not.

How valuable is 70 petabytes of temperature sensor readings to a commercial LLM? It is in fact negative. You don't want to be training the LLM on that data. You've only got so much room in those neurons and we don't need it consumed with trying to predict temperature data series.

We don't need "more data", we need "more data of the specific types we're training on". That is not so readily available.

Although it doesn't really matter anyhow. The ideas in the document are utterly impractical. Nobody is going to label the world's data with a super-complex permission scheme any more than the world created the Semantic Web by labeling the world's data with rich metadata and cross-linking. But especially since it would be of negative value to AI training anyhow.

williamtrask•1h ago

(OP here) I agree with this in spirit, but also it's hard to imagine the world can be fully described with 200 terabytes of data. There's a lot more good stuff out there.

But to your point, a crucial question in AI right now is: how much quality data is still out there?

As far as the impracticality, it's a great point. I disagree and have spent about 10 years working in the area. But that can be a post for another day. I understand and appreciate the skepticism.

lxgr•40m ago

> it's hard to imagine the world can be fully described with 200 terabytes of data

Why? Intelligence and compression might just be two sides of the same coin, and given that, I'd actually be very surprised if a future ASI couldn't make due with a fraction of that.

Just because current LLMs need tons of data doesn't mean that that's somehow an inherent requirement. Biological lifeforms seem to be able to train/develop general intelligence from much, much less.

williamtrask•28m ago

Well, we're opining about a statement about the world. Is the universe only 200 terabytes of information?

"Biological lifeforms seem to be able to train/develop general intelligence from much, much less."

This statement is hard to defend. The brain takes in 125 MB / second, and lives for 80 years, taking in about 300+ petabytes over our lifetime.

But that's not the real kicker. It's pretty unfair to say that humans learn everything they know from birth -> death. A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

lxgr•13m ago

> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

That also seems several orders of magnitude off. Would you suspect that a human that only experiences life through H.264-compressing glasses, MP3-recompressing headphones etc. does not develop a coherent world model?

What about a human only experiencing a high fidelity 3D rendering of the world based on an accurate physics simulation?

The claim that humans need petabytes of data to develop their mind seems completely indefensible to me.

> A lot of that learning bias was worked out through evolution... which takes that 300+ petabytes and multiplies it by... many lifetimes.

Isn't that like saying that you only need the right data? In which case I'd completely agree :)

horhay•1h ago

Man, it's not like the wave of generative AI has showed us that these companies don't work with altruistic intentions and means.

hunterpayne•1h ago

"Homomorphic encryption enables the aggregation of these distributed model pieces while they are encrypted, allowing for federated learning without centralizing data."

A bigger hand wave has never been done I think. Homomorphic encryption increases computational load several fold. And I'm not aware of anyone trying to use this (very interesting) technology for much of anything, let alone GPU ML algorithms.

rafale•51m ago

Didn't a hedge fund publish data encrypted with homomorphic encryption to run an open competition to see how can build the best trading AI. The encryption allow them to keep the sensitive data private.

williamtrask•36m ago

(OP here) Homomorphic addition (e.g. aggregation) is very performant, including for the Federated Averaging algorithm used in Federated Learning. Not hand-waivey.

Yoric•21m ago

Doesn't Zama have an homomorphic machine learning product?

williamtrask•13m ago

Yeah Zama's stuff is great.

squigz•1h ago

I hope the author considers the morality of advocating for health records and financial transactions - and probably every other bit of private data we might have - to be openly available to companies.

I have a better idea: let's just cut the middlemen out and send every bit of data every computer generates to OpenAI. Sorry, to be fair, they want this to be a government-led operation... I'm sure that'll be fine too.

williamtrask•28m ago

The piece advocates for the opposite of this. Attrbution-based control keeps data holders in control.

aaroninsf•53m ago

I literally laughed out loud when I got to the modest proposal.

williamtrask•19m ago

(OP) YOLO

01HNNWZ0MV43FF•50m ago

Sounds good.

I am going to make a blank model, train it homomorphically to predict someone's name based on their butt cancer status, then prompt it to generate a list of people's names who have butt cancer, and blackmail them not to send it to their employers.

CrazyStat•42m ago

I’m glad you specified you were going to train it homomorphically, for a minute there I was worried about the privacy implications.

williamtrask•19m ago

(OP) fwiw I fully agree with the privacywashing you're describing here, and this piece is advocating for a more rigorous standard than input privacy (homomorphic encryption), which is insufficient to enable data owners to actually retain control over their data (but is a useful ingredient).

themafia•43m ago

It's simple.

Pay them.

Otherwise why on Earth should I care about "contributing to AI?" It's just another commercial venture which is trying to get something of high value for no money. A protocol that doesn't involve royalty payments is a non starter.

williamtrask•33m ago

(OP) 100% and this piece advocates for an enforcement mechanism for that kind of payment (attribution-based control)

svieira•41m ago

> What makes this vast private data uniquely valuable is its quality and real-world grounding.

This is a bold assumption. After Enron (financial transactions), Lehman Brothers (customer/population databases, financial transactions), Theranos (electronic health records), Nikola (proprietary research data), Juicero (I don't even know what this is), WeWork (umm ... everything), FTX (everything and we know they didn't mind lying to themselves) I'm pretty sure we can all say for certain that "real world grounding" isn't a guarantee with regards to anything where money or ego is involved.

Not to mention that at this point we're actively dealing with processes being run (improperly) by AI (see the lawsuits against Cigna and and United Health Care [1]), leading to self-training loops without revealing the "self" aspect of it.

[1]: https://www.afslaw.com/perspectives/health-care-counsel-blog...

williamtrask•34m ago

(OP Here) This is a fair point. Internal datasets can be deceitful just as public ones can. That said, most propaganda lives in the public domain. :)

collingreen•19m ago

I'll be surprised if public data is less accurate than private data on average. I've watched many people lie to themselves or others with data within an organization because they are often incentivized to do so.

catigula•35m ago

If only laws and having to respect people and privacy didn't exist, then we could build our machine God and I could maybe (but probably not) live forever!

palmotea•14m ago

> If only laws and having to respect people and privacy didn't exist, then we could build our machine God and [a small handful of billionaires could] live forever [while our useless bodies can be disposed of to make room for their vile creations]!

FTFY

JackYoustra•31m ago

I'm a bit worried - there could be idiosyncratic links that these models learn that causes deanonymization. Ideally you could just add a forget loss to prevent this... but how do you add a forget loss if you don't have all of the precise data necessary for such a term?

williamtrask•20m ago

This is the right question. If full attribution-based control is achieved, then this would be impossible. And the ingredient you've suggested could be a useful way to help achieve it.

ttfvjktesd•28m ago

I think one important point is missing here: more data does not automatically lead to better LLMs. If you increase the amount of data tenfold, you might only achieve a slight improvement. We already see that simply adding more and more parameters for instance does not currently make models better. Instead, progress is coming from techniques like reasoning, grounding, post-training, and reinforcement learning, which are the main focus of improvement for state-of-the-art models in 2025.

williamtrask•26m ago

(OP) the scaling laws / bitter lesson would disagree, but I tend to agree with you with some hedging.

If you get copies of the same data, it doesn't help. In a similar fashion, going from 100 TBs of data scraped from the internet to 200TBs of data scraped from the internet... does it tell you much more? Unclear.

But there are large categories of data which aren't represented at all in LLMs. Most of the world's data just isn't on the internet. AI for Health is perhaps the most obvious example.

joe_the_user•19m ago

the scaling laws / bitter lesson would disagree

I have to note that taking the "bitter lesson" position as a claim that more data will result in better LLMs is a wild misinterpretation (or perhaps a "telephone version) of the original bitter lesson article, which say only that general, scalable algorithms do better than knowledge-carrying, problem-specific algorithms. And the last I heard it was the "scaling hypothesis" that hardly had consensus among those in the field.

williamtrask•18m ago

Agree with you on the nuance.

CuriouslyC•14m ago

More data isn't automatically better. You're trying to build the most accurate model of the "true" latent space (estimated from user preference/computational oracles) possible. More data can give you more coverage of the latent space, it can smooth out your estimate of it, and it can let you bake more knowledge in (TBH this is low value though, freshness is a problem). If you add more data that isn't covering a new part of the latent space the value quickly goes to zero as your redundancy increases. Also, you have to be careful when you add data that you aren't giving the model ineffective biases.

lordofgibbons•17m ago

We don't have a data scarcity problem. Further refinement to the pretraining stage will continue to happen, but I don't expect the orders of magnitude of additional scaling to be required any longer. What's lacking is RL datasets and environments.

If any more scaling scaling does happen, it will happen in the mid-training (using agentic/reasoning outputs from previous model versions) and RL training stages.

williamtrask•7m ago

I agree with you in a way - that it seems likely that new data will be incorproated in more inference-like ways. RAG is a little extreme... but i think there's going to be middle grounds betweeen full pre-training and RAG. Git-rebasin, MoE, etc.

Animats•16m ago

Does vast amounts of lower and lower quality data help much? If you can train on the entire feeds of social media, you keep up on recent pop culture trends, but does it really make LLMs much smarter?

Recent progress on useful LLMs seems to involve slimming them down.[1] Does your customer-support LLM really need a petabyte of training data? Yes, now it can discuss everything from Kant to the latest Taylor Swift concert lineup. It probably just needs enough of that to make small talk, plus comprehensive data on your own products.

The future of business LLMs probably fits in a 1U server.

[1] https://mljourney.com/top-10-smallest-llm-to-run-locally/

williamtrask•6m ago

I think this is the right question to ask. I think it depends on the task. For example, if you want to predict whether someone has cancer, then access to avast amounts of medical information would be important.

supermatt•8m ago

This entire article reads like some hand wavey nonsense, throwing cutting edge tech buzzwords around to solve a problem that doesnt exist.

All the top models are moving towards synthetic data - not because they want more data but because they want quality data that is structured to train utility.

Having zettabytes of “invisible” data is effectively pointless. You can’t train on it because there is so much of it, it’s way more expensive to train per byte because of homomorphic magic (if it’s even possible), and most importantly - it’s not quality training data!

williamtrask•6m ago

This article is meant for a policy audience, so that does keep the technical depth pretty thin. It's rooted in more rigorous deep learning work. Happy to send your way if interested.

supermatt•2m ago

Posting it here would be more beneficial. There is certainly nothing of substance in the article, despite all the links.

Ask HN: Decent Builtwith alternative for finding leads?

Ask HN: How important is peer programming?

Ron DiMenna, Founder of the Ron Jon Surf Shop Chain, Dies at 88

Green Lights More Often: The Secret 2018 Study of Sydney's Traffic Signals

Layoffs and H-1Bs: Texas Instruments' Billion-Dollar Balancing Act

ReDisclosure: New technique for exploiting FTS in MySQL (myBB case study)

Multi-Kernel Architecture Proposed for the Linux Kernel

We got Claude Code to stop gaslighting our tests

Cross-Agent Privilege Escalation: When Agents Free Each Other

Musical mel transform in Torch for Music AI

Crystallabs.io

Qualcomm's New Snapdragon X2 Elite Extreme and Elite Chips for PC

Show HN: Aegis – A Self-Hosted Code Hosting Server Written in Golang

Textile Encoding via Elastically Graded Embroidered Tessellations

Expat 2.7.3 released, includes security fixes

SonyShell – an effort to "SSH into my Sony DSLR"

Isometric Asset Builder

Best React Native UI resources for creating beautiful apps

iPhone Air Review: Beauty Is Pain [video]

The whole bull run is because of an influx of money

Understanding AddressSanitizer: Better memory safety for your code (2024)

How I was secretly logged as a criminal by police

Power Retention – Drop-In Replacement for Flash_attention

Disobey

Project Rain:L1TF

Weak Memory Model Formalisms: Introduction and Survey

Intel Is Seeking an Investment from Apple as Part of Its Comeback Bid

How to Raise a Reader in an Age of Digital Distraction

Amiga40 Germany – 40 years of Amiga, a celebration for everyone

Judge finds Amazon acted in bad faith during discovery in FTC litigation