A major AI training data set contains millions of examples of personal data

https://www.technologyreview.com/2025/07/18/1120466/a-major-ai-training-data-set-contains-millions-of-examples-of-personal-data/

167•pera•6mo ago

Comments

kristianp•6mo ago

https://archive.is/k7DY3

cheschire•6mo ago

I hope future functionality of haveibeenpwned includes a tool to search LLM models and training data for PII based on the collected and hashed results of this sort of research.

croes•6mo ago

Hard to search in the model itself

cheschire•6mo ago

Yep, that's why at the end of my sentence I referred to the results of research efforts like this that do the hard work of extracting the information in the first place.

pera•6mo ago

Yesterday I asked if there is any LLM provider that is GDPR compliant: at the moment I believe the answer is no.

https://news.ycombinator.com/item?id=44716006

tonyhart7•6mo ago

so your best bet is open weight LLM then???

but its that a breach of GDPR???

atoav•6mo ago

Only if it contains personal data you collected without explicit consent ("explicit" here means litrrally asking: "I want to use this data for that purpose, do you allow this? Y/N").

Also people who have given their consent before need to be able to revoke it at any point.

tonyhart7•6mo ago

so EU basically locked itself from AI space????

idk but how can we do that with GDPR compliance etc???

croes•6mo ago

So basically EU citizens could sue all AI providers

galangalalgol•6mo ago

I don't think they have to? The bodies in charge can simply levy the fine for 4% I think. The weird case is if all the companies self hosting open weight models become liable as data controllers and processors. If that was aggressively penalized they could kill AI use within the EU and depending on the math, such companies might choose to pull operations from the EU instead of giving up use of AI.

Edit: that last bit is probably catastrophic thinking. Enforcement has always been precisely enough to cause compliance vs withdrawal from the market.

croes•6mo ago

I don‘t think killing the us is the only thing that could happen.

You can’t steal something and avoid punishment just because you don’t sell in the country where the theft happened.

galangalalgol•6mo ago

You absolutely can depending on the countries involved. A recent extreme example is hackers in NK stealing cryptocurrency. A more regular one is Chinese manufacturers stealing designs. If the countries where the theives live and operate won't prosecute them there is no recourse. The question for multinationals is if continuing to operate in the EU is worth giving up their models, and if the countries they are headquartered in care or can be made to.

croes•6mo ago

If those countries still like to enforce their IP in the EU I guess they will.

Tit for tat.

NK isn’t really a business partner in the world.

galangalalgol•6mo ago

But China is, and western countries including those in the EU have frequently ignored such things. Looking closer this really only affects diffusion models which are much cheaper to retrain. The exception is integrated models like Gemini and gpt-4v where retraining might reasonably cost more than the fine. Behemoths like google and openai won't bail over a few 100M, unless they see it is likely to happen repeatedly, in which case they would likely decouple the image models. But there is nothing to say some text database that is widely used isn't contaminated as well. Maybe only China will produce models in the future. They don't care if you enforce their IP.

Edit: After more reading. Clearview AI did exactly this, they ignored all the EU rulings and the UK refused to enforce them. They were fined tens of millions and paid nothing. Stability is now also a UK company that used pi images for training; it seems quite likely they will try to walk that same path given their financial situation. Meta is facing so many fines and lawsuits who knows what it will do. Everyone else will call it cost of business while fighting it every step of the way.

pera•6mo ago

While my question was in relation to GDPR there are similar laws in the UK (DPA) and in California (CCPA).

Also note that AI is not just generative models, and generative models don't need to be trained with personal data.

jeroenhd•6mo ago

I'm sure it's possible, but AI companies don't invest much money into complying with the law as it's not profitable.

A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.

I'm sure they'll all cry foul if one of them get hit with a fine and an order to figure out how to fix the mess they've created, but this is what you get when you don't ethics to computer scientists.

GardenLetter27•6mo ago

> A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.

China is already dominating AI, you are asking the few companies in the West to stop completely.

The regulation is anti-growth and anti-technology - the GDPR, DSA, Cybersecurity Act and AI Act (and future Chat Control / Online Safety Act equivalent) must be repealed if Europe is to have any hope of a future tech industry.

xxs•6mo ago

> need to be able to revoke it at any point.

They have to be able to ask how much (if) data is being used, and how.

pera•6mo ago

There is currently no effective method for unlearning information - specially not when you don't have access to the original training datasets (as is the case with open weight models), see:

Rethinking Machine Unlearning for Large Language Models

https://arxiv.org/html/2402.08787v6

thrance•6mo ago

Mistral's products are supposed to be at least, since they are based in the EU.

pera•6mo ago

I am not sure if Mistral is: if you go to their GDPR page (https://help.mistral.ai/en/articles/347639-how-can-i-exercis...) and then to the erasure request section they just link to a "How can I delete my account?" page.

Unfortunately they don't provide information regarding their training sets (https://help.mistral.ai/en/articles/347390-does-mistral-ai-c...) but I think it's safe to assume it includes DataComp CommonPool.

wilg•6mo ago

GDPR has plenty of language related to reasonability, cost, feasibility, and technical state of the art that probably means LLM providers do not have to comply in the same way, say, a social platform might.

GardenLetter27•6mo ago

This just demonstrates how bad the GDPR is rather than the LLMs though.

China must be laughing.

itsalotoffun•6mo ago

I WISH this mattered. I wish data breaches actually carried consequences. I wish people cared about this. But people don't care. Right up until you're targeted for ID theft, fraud or whatever else. But by then the causality feels so diluted that it's "just one of those things" that happens randomly to good people, and there's "nothing you can do". Horseshit.

atoav•6mo ago

It doesn't now, but we could collectively decide to introduce consequences of the kind that deter anybody willing to try this again.

jelvibe25•6mo ago

What's the right consequence in your opinion?

passwordoops•6mo ago

Criminal liability with a minimum 2 years served for executives and fines amounting to 110% of total global revenue to the company that allowed the breach would see cybersecurity taken a lot more seriously in a hurry

lifestyleguru•6mo ago

Would be nice to have executives finally responsible for something.

bearl•6mo ago

Internet commerce requires databases with pii that will be breached.

Who is to blame for internet commerce?

Our legislators. Maybe specifically we can blame Al Gore, the man who invented the internet. If we had put warning labels on the internet like we did with NWA and 2 live crew, Gore’s second best achievement, we wouldn’t be a failed democracy right now.

krageon•6mo ago

A stolen identity destroys the life of the victim, and there's going to be more than one. They (every single involved CEO) should have all of their assets seized, to be put in a fund that is used to provide free legal support to the victims. Then they should go to a low-security prison and have mandatory community service for the rest of their lives.

They probably can't be redeemed and we should recognise that, but that doesn't mean they can't spend the rest of their life being forced to be useful to society in a constructive way. Any sort of future offense (violence, theft, assault, anything really) should mean we give up on them. Then they should be humanely put down.

rypskar•6mo ago

We should also stop calling it ID theft. The identity is not stolen, the owner do still have it. Calling it ID theft is moving the responsibility from the one that a fraud is against (often banks or other large entities) to an innocent 3rd party

herbturbo•6mo ago

Yes tricking a bank into thinking you are one of their customers is not the same as assuming someone else’s identity.

messagebus•6mo ago

As always, Mitchell and Webb hit the nail precisely on the head.

https://www.youtube.com/watch?v=CS9ptA3Ya9E

JohnFen•6mo ago

> Calling it ID theft is moving the responsibility from the one that a fraud is against (often banks or other large entities)

The victim of ID theft is the person whose ID was stolen. The damage to banks or other large entities pales in comparison to the damage to those people.

rypskar•6mo ago

I did probably not formulate myself good enough. By calling it ID theft you are blaming the person the ID belongs to and that person have to prove they are innocent. By calling it by the correct words, bank fraud, the bank have to prove that the person the ID belongs to did it. No ID was stolen, it was only used by someone else to commit fraud. The banks don't have enough security to stop it because they have gotten away with calling it ID theft and putting the blame on the person the ID belongs to

erikerikson•6mo ago

While I agree that bank fraud is a more accurate and just labeling, I observe that people are required to prove their innocence regularly.

laughingcurve•6mo ago

It’s not clear to me how this is a data breach at all. Did the researchers hack into some database and steal information? No?

Because afaik everything they collected was public web. So now researchers are being lambasted for having data in their sets that others released

That said, masking obvious numbers like SSN is low hanging fruit. Trying to obviate every piece of public information about a person that can identify them is insane.

imglorp•6mo ago

Reader mode works on this site.

satvikpendem•6mo ago

This is all public data. People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped. There is nothing private about the internet and I wish people understood that.

malfist•6mo ago

What's important is that we blame the victims instead of the corporations that are abusing people's trust. The victims should have known better than to trust corporations

blitzar•6mo ago

> blame the victims

If you post something publicly you cant be complaining that it is public.

lewhoo•6mo ago

But I can complain about what happens to said something. If my blog photo becomes deep fake porn am I allowed to complain or not ? What we have is an entirely novel situation (with ai) worth at least a serious discussion.

blitzar•6mo ago

> But I can complain about what happens to said something

no.

> but ...

no.

YetAnotherNick•6mo ago

> If my blog photo becomes deep fake porn

Depends. In most cases, this thing is forbidden by law and you can claim actual damages.

kldg•6mo ago

That's helpful if they live in the same country, can figure out who the 4chan poster was, the police are interested (or you want to risk paying a lawyer), you're willing to sink the time pursuing such action (and if criminal, risk adversarial LEO interaction), and are satisfied knowing hundreds of others may be doing the same and won't be deterred. Of course, friends and co-workers are too close to you to post publicly when they generate it. Thankfully, the Taylor Swift laws in the US have stopped generation of nonconsensual imagery and video of its namesake (it hasn't).

Daughter's school posted pictures of her online without an opt-out, but she's also on Facebook from family members and it's just kind of... well beyond the point of trying to suppress. Probably just best to accept people can imagine you naked, at any age, doing any thing. What's your neighbor doing with the images saved from his Ring camera pointed at the sidewalk? :shrug:

YetAnotherNick•6mo ago

I am not talking about 4chan poster. I am talking if a company does it.

dpoloncsak•6mo ago

FWIW...I really don't think so. If you say, posted your photo on a bulletin board in your local City Hall, can you prevent it from being defaced? Can you choose who gets to look at it? Maybe they take a picture of it and trace it...do you have any legal ground there? (Genuine Question). And even if so...It's illegal to draw angry eyebrows on every face on a billboard but people still do it...

IMO, it being posted online to a publicly accessible site is the same. Don't post anything you don't want right-click-saved.

yencabulator•6mo ago

GDPR right to erasure says I can demand my personal data to be deleted, and I don't see any language limiting that to things _I_ submitted.

danielmarkbruce•6mo ago

No. Don't give the entire world access to your photo. Creating fakes using photoshop was a thing well before AI.

satvikpendem•6mo ago

Don't have a blog photo in the first place.

malfist•6mo ago

Sure, and if I put out a local lending library box in my front yard I shouldn't by annoyed by the neighbor that takes every book out of it and throws it in the trash.

Decorum and respect expectations don't disappear the moment it's technically feasible to be an asshole

YetAnotherNick•6mo ago

That's a bad analogy. Most people including me do expect that their "public" data is used for AI training. I mean based on the ads everyone gets, most people know and expect completely well that anything they post online would be used in AI.

JohnFen•6mo ago

> Most people including me do expect that their "public" data is used for AI training.

Based on what ordinary people have been saying, I don't think this is true. Or, maybe it's true now that the cat is out of the bag, but I don't think most people expected this before.

Most tech-oriented people did, of course, but we're a small minority. And even amongst our subculture, a lot of people didn't see this abuse coming. I didn't, or I would have removed all of my websites from the public web years earlier than I did.

YetAnotherNick•6mo ago

> Most tech-oriented people did

In fact it's the opposite. People who aren't into tech thinks Instagram is listening to them 24*7 to show feed and ads. There was even a hoax near my area among elderly groups that Whatsapp is using profile photo in illegal activity and many people removed their photo one time.

> I didn't, or I would have removed all of my websites from the public web years earlier than I did.

Your comment is public information. In fact posting anything in HN is a sure shot way to giving your content for AI training.

JohnFen•6mo ago

> People who aren't into tech thinks Instagram is listening to them 24*7 to show feed and ads

True, but that's a world different than thinking that your data will be used to train genAI.

> In fact posting anything in HN is a sure shot way to giving your content for AI training.

Indeed so, but HN seems to be a bad habit I just can't kick. However, my comments here are the entirety of what I put up on the open web and I intentionally keep them relatively shallow. I no longer do long-form blogging or make any of my code available on the open web.

However, you're right. Leaving HN is something that I need to do.

bearl•6mo ago

No, the average person has no idea what “ai training” even is. Should the average person have an above average iq? Yes. Could they? No. Don’t be average yourself.

malfist•6mo ago

Are you trying to argue that 10 years ago when I uploaded my resume to linkedin, that I should have known it'd be used for AI training?

Or that teenager that signed up to facebook should know that the embarrassing things they're posting is going to train AI and is, as you called it, public?

What about the blog I started 25 years ago and then took down but it lives in the geocities archive. Was I supposed to know it'd go to an AI overlord corporation when I was in middle school writing about dragon photos I found on google?

And we're not even getting into data breaches, or something that was uploaded as private and then sold when the corporation changed their privacy policy decades after it was uploaded.

It's not a bad analogy when you don't give all the graces to corporations and none to the exploited.

danielmarkbruce•6mo ago

"Corporations".... you gave access to the whole world, including criminals.

victorbjorklund•6mo ago

Seriously, when YOU posted something on the Internet 20 years ago you expected it to be used by a corporation to train an AI 20 years later?

erikerikson•6mo ago

Data sourcing has been a discussion, at least in AI circles, for much longer than 20 years.

So if you are asking me, I would have to say yes. I cannot speak for the original poster.

pbhjpbhj•6mo ago

>based on the ads everyone get

I'm not sure what you mean here? In context I suspect you mean 'because ads were chosen from a perspective of knowledge about you'?? But that's really opposite my experience (UK).

Ads now go hard on brainwashing. Same advert over-and-over, almost never anything I want to buy.

YouTube suggestions are pretty much inline with my previous viewing though.

My ISP has a list of every domain I connect to, my streaming providers know every video we watch, the supermarkets and credits companies know every item we buy at the shops, but still the brainwashing attempts continue for things we'd simply never buy.

nerdjon•6mo ago

Right, both things can be wrong here.

We need to better educate people on the risks of posting private information online.

But that does not absolve these corporations of criticism of how they are handling data and "protecting" people's privacy.

Especially not when those companies are using dark patterns to convince people to share more and more information with them.

thinkingtoilet•6mo ago

If this was 2010 I would agree. This is the world we live in. If you post a picture of yourself on a lamp post on a street in a busy city, you can't be surprised if someone takes it. It's the same on the internet and everyone knows it by now.

squigz•6mo ago

> The victims should have known better than to trust corporations

Literally yes? Is this sarcasm? Are we in 2025 supposed to implicitly trust multi-billion dollar multi-national corporations that have decades' worth of abuses to look back on? As if we couldn't have seen this coming?

It's been part of every social media platform's ToS for many years that they get a license to do whatever they want with what you upload. People have warned others about this for years and nothing happened. Those platforms' have already used that data prior to this for image classification, identification and the like. But nothing happened. What's different now?

keybored•6mo ago

Modern companies: We aim to create or use human-like AI.

Those same modern companies: Look, if our users inadvertently upload sensitive or private information then we can't really help them. The heuristics for detecting those kinds of things are just too difficult to implement.

Workaccount2•6mo ago

I have negative sympathy for people who still aren't aware that if they aren't paying for something, they are the something to be sold. This has been the case for almost 30 years now with the majority of services on the internet, including this very website right here.

gishglish•6mo ago

Tbh, even if they are paying for it, they’re probably still the product. Unless maybe they’re an enterprise customer who can afford magnitudes more to obtain relative privacy.

amelius•6mo ago

I paid big $$ for my smart TV, yet I still feel like I'm the product :(

malfist•6mo ago

That explains why ISPs sell DNS lookup history, or your utility company sells your habits. Or your TV tracks your viewership. I've paid for all of those, but somehow, I'm still the product.

exasperaited•6mo ago

People are literally born into that misunderstanding all the time (because it’s not obvious). It’s an evergreen problem.

So you are basically saying you have no sympathy for young people who happen to have not been taught about this, or been guided by someone highly articulate in explaining it.

Is it taught in schools yet? If it’s not, then why assume everyone should have a good working understanding of this (actually nuanced) topic?

For example I encounter people who believe that Google literally sells databases, lists of user data, when the actual situation (that they sell gated access to targeted eyeballs at a given moment and that this sort of slowly leaks identifying information) is more nuanced and complicated.

satvikpendem•6mo ago

It is taught in schools that everything you post online is public.

jeroenhd•6mo ago

AI and scraping companies are why we can't have nice things.

Of course privacy law doesn't necessarily agree with the idea that you can just scrape private data, but good luck getting that enforced anywhere.

pera•6mo ago

> This is all public data

It's important to know that generally this distinction is not relevant when it comes to data subject rights like GDPR's right to erasure: If your company is processing any kind of personal data, including publicly available data, it must comply with data protection regulations.

booder1•6mo ago

Legal has in no way been able to keep up with AI. Just look at copyright. Internet data is public and the government is incapable of changing this.

johnnyanmac•6mo ago

By design, yes. AI companies are taking "move fast and break stuff" to its logical extreme.

Eventually, it will catch up. Whether the punishment offsets the abuse is yet to be seen (I'm not holding my breath).

>Internet data is public and the government is incapable of changing this.

Incapable or unwilling (paid for by those who want to grab more data)?

booder1•6mo ago

They will not be punished, e.g. uber and Airbnb were never really punished despite blatantly ignoring the law.

I would claim incapable but it doesn't really matter, outcome is the same.

GDPR won't protect you nor will data privacy laws. Most of the world simply doesn't care enough. I wish it were different.

satvikpendem•6mo ago

That's all fine. But until someone requests their information to be deleted, it is still public.

Anonbrit•6mo ago

A hidden camera can make your bedroom public. Don't do it if you don't want it to be on pay-per-view?

satvikpendem•6mo ago

That is indeed what Justin.tv did, to much success. But that was because Justin had consented to doing so, just as anything anyone posts online is also consented to being seen by anyone.

dpoloncsak•6mo ago

Does this analogy really apply? Maybe I'm misunderstanding, but it seems like all of this data was publicly available already, and scraped from the web.

In that case, its not a 'hidden camera'...users uploaded this data and made it public, right? I'm sure some were due to misconfiguration or whatever (like we see with Tea), but it seems like most of this was uploaded by the user to the clear web. I'm all for "Dont blame the victims", but if you upload your CC to Imgur I think you deserve to have to get a new card.

Per the article "CommonPool ... draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022."

dlivingston•6mo ago

Your analogy doesn't hold. A 'hidden camera' would be either malware that does data exfiltration, or the company selling/training on your data outside of the bounds of its terms of service.

A more apt analogy would be someone recording you in public, or an outside camera pointed at your wide-open bedroom window.

windexh8er•6mo ago

We've got plenty of examples where Microsoft (owner of LinkedIn) is OK with spying on their users using methods akin to malware.

People who've put data on LinkedIn had some expectation of privacy at a certain point. But this is exactly why I deleted everything from LinkedIn, other than a bare minimum representation that links external to my personal site, after they were acquired.

Microsoft, Google, Meta, OpenAI... None of them should be trusted by anyone at this point. They've all lied and stolen user data. People have taken their own lives because of the legal retaliation for doing far less than these people hiding behind corporate logos that suck up any and all information because they've been entitled to not have to deal with consequences.

They've all broken their own ToS under an air of: OK for me, not for thee. So, yes, the hidden camera is a great analogy. All of these companies, and the people running them, are cancers in and on society.

andsoitis•6mo ago

> There is nothing private about the internet and I wish people understood that.

I don’t know that that is useful advice for the average person. For instance, you can access your bank account via the internet, yet there are very strong privacy guarantees.

Concur that it is a safe default assumption what you say, but then you need a way for people to not now mistrust all internet services because everything is considered public.

satvikpendem•6mo ago

I'm more so talking about posting things consciously on some platform, not accessing.

Jaxan•6mo ago

Well even when consciously posting on a platform, you could assume it’s stays on the platform. Sure, tech people know it’s public data by then. But for an average person, it’s not weird to think it stays on the platform. Especially since you have to log in to even see posts.

satvikpendem•6mo ago

Why would anyone assume it stays on the platform? It's public now, therefore anyone can see it.

Jaxan•6mo ago

Because that’s normally how life works. Things stay within a context.

satvikpendem•6mo ago

Not online, which again people should remember.

chrisg23•6mo ago

What if someone else posts your personal data on the public internet and it gets collected into a dataset like this?

satvikpendem•6mo ago

How is that not a different story?

boie0025•6mo ago

While I agree with your sentiment, there's a pretty good chance that at least some of this is, for example, data that inadvertently leaked while someone accidentally exposed an automatic index with Apache, or perhaps an asset manifest exposed a bunch of uploaded images in a folder or bucket that wasn't marked private for whatever reason. I can think of a lot of reasons this data could be "public" that would be well beyond the control of the person exposed. I also don't think that there's a universal enough understanding that uploading something to your WordPress or whatever personal/business site to share with a specific person, with an obscure unpublished URL is actually public. I think these lines are pretty blurry.

Edit: to clarify, in the first two examples I'm referring to web applications that the exposed person uses but does not control.

johnnyanmac•6mo ago

>People should not be putting personal data on public image hosts and sites like LinkedIn if they did not want them to be scraped.

So my choice in society is to not have a job or get interviews and accept that I have no privacy in the modern world, being mined for profit to companies that lay off their workers anyway.

By the way, I was also recommended to make and show off a website portfolio to get interviews... sigh.

ako•6mo ago

But that is information you intend to be public, you want it in google, and in ai models as they are replacing traditional search engines. The only reason you put it on LinkedIn is for other people to find you, so be happy the llm helps.

satvikpendem•6mo ago

You don't have to use LinkedIn or similar, many people don't.

djoldman•6mo ago

Just to be clear, as with LAION, the data set doesn't contain personal data.

It contains links to personal data.

The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material. Folks can argue if that's true but the discussion should at least be transparent.

yorwba•6mo ago

I think the data set is generally considered to consist of the images, not the list of links for downloading the images.

That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.

ryanjshaw•6mo ago

It does matter? When implemented as a reference, the image can be taken down and will no longer be included in training sets*. As a copy, the image is eternal. What’s the alternative?

* Assuming the users regularly check the images are still being hosted (probably something that should be regulated)

djoldman•6mo ago

The data set is a list of ("descriptive text", URL) tuples.

As with almost any URL, it is not in and of itself an image.

As an aside, this presents a problem for researchers because the links can resolve to different resources, or no resource at all, depending on when they are accessed.

Therefore this is not a static dataset on which a machine learning model can be trained in a guaranteed reproducible fashion.

pera•6mo ago

I think you may be missing the point: The title says "AI training data set", which is the result of downloading the linked images. The list of tuples is just how this training dataset is distributed.

The issue in question is that many/most large generative AI models were trained with personal data.

djoldman•6mo ago

Ah! Yes, thanks very much, I misread.

bearl•6mo ago

Links to pii are by far the worst sort of pii, yes.

“It’s not his actual money, it’s just his bank account and routing number.”

djoldman•6mo ago

A more accurate analogy is "it's not his actual money, it's a link to a webpage or image that has his bank account and routing number."

bearl•6mo ago

My contention is that links to pii are themselves pii.

A name, Jon Smith, is technically PII but not very specific. If I have a link to a specific Jon Smith’s facebook page or his HN profile, it’s even more personally identifiable than knowing his name is Jon Smith.

Ferret7446•6mo ago

That is crazy. The target of a link could change, which means that all links are Schroedinger's PII.

And if a link to PII is PII, then a link to a link to PII is PII, and thus all links are PII unless it links to the dark (unlinked) Web

atemerev•6mo ago

Well yes, by knowing my bank account and routing number, you don't have access to my money.

Cheer2171•6mo ago

You do in the US.

atemerev•6mo ago

That sounds insecure. Looks like blockchain and private keys with extra hops. Perhaps you can easily revert banking transactions...

duskwuff•6mo ago

That's a distinction without a difference. Just as with LAION, anyone using this data set is going to be downloading the images and training on them, and the potential harms to the affected users are the same.

djoldman•6mo ago

LAION was alleged to link to CSAM. If LAION didn't link and instead hosted/contained/distributed the actual files, I think there would be a much higher chance that someone distributing LAION could serve prison time, at least in the USA.

That seems like a pretty big difference to me.

kazinator•6mo ago

When the model is trained, are the links not resolved to fetch whatever the point to, and that goes into the model?

Secondly, privacy and copyright are different. Privacy is more of a concern with how information is used than getting credit and monetization for being the author.

anonymoushn•6mo ago

no, normally your training pipeline wouldn't involve running bittorrent

kazinator•6mo ago

If the training set contained BitTorrent magnet links to the desired information (e.g. images whose pixels are to be trained on, then, yes, it would have to.

Upthread it was mentioned that the training data representation contained links to material; magnet links were mentioned in passing as an example of something supposedly not violating copyright. It wasn't stated that training data contained magnet links. (Did it?)

johnnyanmac•6mo ago

You sure about that? https://arstechnica.com/tech-policy/2025/07/meta-pirated-and...

Frieren•6mo ago

> The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material.

I interpret that the article is about AI being trained on personal data. That is a big break of many countries legislation.

And AI is 100% being trained in copyrighted data too. Breaking another different set of laws.

That shows how much big-tech is just breaking the law and using money and influence to get away with it.

os2warpman•6mo ago

"Ladies and gentlemen of the jury, my client did not rob that bank. He only made a Google Maps link to directions to the bank, a link to an Imgur image containing the vault's combination, and a link to a Pastebin with instructions on how to disable the security system available. He merely packaged that information together and made it publicly available in a single source in a format only really useful to robbers for the purpose of robbery training. It's twoo hward to actually look at the information one is compiling and releasing to the public and to expect even a microscopically minuscule cursory amount of minimal effort to that end is unreasonable. He is clearly innocent."

lazide•6mo ago

What do you think they’d be charged with in this situation?

It wouldn’t be bank robbery.

1vuio0pswjnm7•6mo ago

archive.is is (a) sometimes blocked, (b) serves CAPTCHAs in some instances and (c) includes a tracking pixel

One alternative to archive.is for this website is to disable Javascript and CSS

Another alternative is the website's RSS feed

Works anywhere without CSS or Javascript, without CAPTCHAs, without tracking pixel

For example,

   curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/ 
   |(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm

   firefox ./1.htm

To retrieve only the entry about DataComp CommonPool,

   curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/ 
   |sed -n '/./{/>1120522</post-id>/,/>1120466</post-id>/p;}' 
   |(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm

   firefox ./1.htm

1vuio0pswjnm7•6mo ago

If using a text-only browser that does not process CSS or run Javascript, 100% of the article is displayed

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

1979: The Model World of Robert Symes [video]

Satellites Have a Lot of Room

1980s Farm Crisis

Show HN: FSID - Identifier for files and directories (like ISBN for Books)

Show HN: Holy Grail: Open-Source Autonomous Development Agent

Show HN: Minecraft Creeper meets 90s Tamagotchi

Show HN: Termiteam – Control center for multiple AI agent terminals

The only U.S. particle collider shuts down

Ask HN: Why do purchased B2B email lists still have such poor deliverability?

Show HN: Remotion directory (videos and prompts)

Portable C Compiler

Show HN: Kokki – A "Dual-Core" System Prompt to Reduce LLM Hallucinations

Software Engineering Transformation 2026

Microsoft purges Win11 printer drivers, devices on borrowed time

Lunch with the FT: Tarek Mansour

Old Mexico and her lost provinces (1883)

'AI' is a dick move, redux

The source code was the moat. But not anymore

Does anyone else feel like their inbox has become their job?

An AI model that can read and diagnose a brain MRI in seconds

Dev with 5 of experience switched to Rails, what should I be careful about?

AlphaFace: High Fidelity and Real-Time Face Swapper Robust to Facial Pose

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2019) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

A major AI training data set contains millions of examples of personal data

Comments