but its that a breach of GDPR???
Also people who have given their consent before need to be able to revoke it at any point.
idk but how can we do that with GDPR compliance etc???
Edit: that last bit is probably catastrophic thinking. Enforcement has always been precisely enough to cause compliance vs withdrawal from the market.
You can’t steal something and avoid punishment just because you don’t sell in the country where the theft happened.
Tit for tat.
NK isn’t really a business partner in the world.
Edit: After more reading. Clearview AI did exactly this, they ignored all the EU rulings and the UK refused to enforce them. They were fined tens of millions and paid nothing. Stability is now also a UK company that used pi images for training; it seems quite likely they will try to walk that same path given their financial situation. Meta is facing so many fines and lawsuits who knows what it will do. Everyone else will call it cost of business while fighting it every step of the way.
Also note that AI is not just generative models, and generative models don't need to be trained with personal data.
A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.
I'm sure they'll all cry foul if one of them get hit with a fine and an order to figure out how to fix the mess they've created, but this is what you get when you don't ethics to computer scientists.
China is already dominating AI, you are asking the few companies in the West to stop completely.
The regulation is anti-growth and anti-technology - the GDPR, DSA, Cybersecurity Act and AI Act (and future Chat Control / Online Safety Act equivalent) must be repealed if Europe is to have any hope of a future tech industry.
They have to be able to ask how much (if) data is being used, and how.
Rethinking Machine Unlearning for Large Language Models
Unfortunately they don't provide information regarding their training sets (https://help.mistral.ai/en/articles/347390-does-mistral-ai-c...) but I think it's safe to assume it includes DataComp CommonPool.
China must be laughing.
Who is to blame for internet commerce?
Our legislators. Maybe specifically we can blame Al Gore, the man who invented the internet. If we had put warning labels on the internet like we did with NWA and 2 live crew, Gore’s second best achievement, we wouldn’t be a failed democracy right now.
They probably can't be redeemed and we should recognise that, but that doesn't mean they can't spend the rest of their life being forced to be useful to society in a constructive way. Any sort of future offense (violence, theft, assault, anything really) should mean we give up on them. Then they should be humanely put down.
The victim of ID theft is the person whose ID was stolen. The damage to banks or other large entities pales in comparison to the damage to those people.
Because afaik everything they collected was public web. So now researchers are being lambasted for having data in their sets that others released
That said, masking obvious numbers like SSN is low hanging fruit. Trying to obviate every piece of public information about a person that can identify them is insane.
If you post something publicly you cant be complaining that it is public.
no.
> but ...
no.
Depends. In most cases, this thing is forbidden by law and you can claim actual damages.
Daughter's school posted pictures of her online without an opt-out, but she's also on Facebook from family members and it's just kind of... well beyond the point of trying to suppress. Probably just best to accept people can imagine you naked, at any age, doing any thing. What's your neighbor doing with the images saved from his Ring camera pointed at the sidewalk? :shrug:
IMO, it being posted online to a publicly accessible site is the same. Don't post anything you don't want right-click-saved.
Decorum and respect expectations don't disappear the moment it's technically feasible to be an asshole
Based on what ordinary people have been saying, I don't think this is true. Or, maybe it's true now that the cat is out of the bag, but I don't think most people expected this before.
Most tech-oriented people did, of course, but we're a small minority. And even amongst our subculture, a lot of people didn't see this abuse coming. I didn't, or I would have removed all of my websites from the public web years earlier than I did.
In fact it's the opposite. People who aren't into tech thinks Instagram is listening to them 24*7 to show feed and ads. There was even a hoax near my area among elderly groups that Whatsapp is using profile photo in illegal activity and many people removed their photo one time.
> I didn't, or I would have removed all of my websites from the public web years earlier than I did.
Your comment is public information. In fact posting anything in HN is a sure shot way to giving your content for AI training.
True, but that's a world different than thinking that your data will be used to train genAI.
> In fact posting anything in HN is a sure shot way to giving your content for AI training.
Indeed so, but HN seems to be a bad habit I just can't kick. However, my comments here are the entirety of what I put up on the open web and I intentionally keep them relatively shallow. I no longer do long-form blogging or make any of my code available on the open web.
However, you're right. Leaving HN is something that I need to do.
Or that teenager that signed up to facebook should know that the embarrassing things they're posting is going to train AI and is, as you called it, public?
What about the blog I started 25 years ago and then took down but it lives in the geocities archive. Was I supposed to know it'd go to an AI overlord corporation when I was in middle school writing about dragon photos I found on google?
And we're not even getting into data breaches, or something that was uploaded as private and then sold when the corporation changed their privacy policy decades after it was uploaded.
It's not a bad analogy when you don't give all the graces to corporations and none to the exploited.
So if you are asking me, I would have to say yes. I cannot speak for the original poster.
I'm not sure what you mean here? In context I suspect you mean 'because ads were chosen from a perspective of knowledge about you'?? But that's really opposite my experience (UK).
Ads now go hard on brainwashing. Same advert over-and-over, almost never anything I want to buy.
YouTube suggestions are pretty much inline with my previous viewing though.
My ISP has a list of every domain I connect to, my streaming providers know every video we watch, the supermarkets and credits companies know every item we buy at the shops, but still the brainwashing attempts continue for things we'd simply never buy.
We need to better educate people on the risks of posting private information online.
But that does not absolve these corporations of criticism of how they are handling data and "protecting" people's privacy.
Especially not when those companies are using dark patterns to convince people to share more and more information with them.
Literally yes? Is this sarcasm? Are we in 2025 supposed to implicitly trust multi-billion dollar multi-national corporations that have decades' worth of abuses to look back on? As if we couldn't have seen this coming?
It's been part of every social media platform's ToS for many years that they get a license to do whatever they want with what you upload. People have warned others about this for years and nothing happened. Those platforms' have already used that data prior to this for image classification, identification and the like. But nothing happened. What's different now?
Those same modern companies: Look, if our users inadvertently upload sensitive or private information then we can't really help them. The heuristics for detecting those kinds of things are just too difficult to implement.
So you are basically saying you have no sympathy for young people who happen to have not been taught about this, or been guided by someone highly articulate in explaining it.
Is it taught in schools yet? If it’s not, then why assume everyone should have a good working understanding of this (actually nuanced) topic?
For example I encounter people who believe that Google literally sells databases, lists of user data, when the actual situation (that they sell gated access to targeted eyeballs at a given moment and that this sort of slowly leaks identifying information) is more nuanced and complicated.
Of course privacy law doesn't necessarily agree with the idea that you can just scrape private data, but good luck getting that enforced anywhere.
It's important to know that generally this distinction is not relevant when it comes to data subject rights like GDPR's right to erasure: If your company is processing any kind of personal data, including publicly available data, it must comply with data protection regulations.
Eventually, it will catch up. Whether the punishment offsets the abuse is yet to be seen (I'm not holding my breath).
>Internet data is public and the government is incapable of changing this.
Incapable or unwilling (paid for by those who want to grab more data)?
I would claim incapable but it doesn't really matter, outcome is the same.
GDPR won't protect you nor will data privacy laws. Most of the world simply doesn't care enough. I wish it were different.
In that case, its not a 'hidden camera'...users uploaded this data and made it public, right? I'm sure some were due to misconfiguration or whatever (like we see with Tea), but it seems like most of this was uploaded by the user to the clear web. I'm all for "Dont blame the victims", but if you upload your CC to Imgur I think you deserve to have to get a new card.
Per the article "CommonPool ... draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022."
A more apt analogy would be someone recording you in public, or an outside camera pointed at your wide-open bedroom window.
People who've put data on LinkedIn had some expectation of privacy at a certain point. But this is exactly why I deleted everything from LinkedIn, other than a bare minimum representation that links external to my personal site, after they were acquired.
Microsoft, Google, Meta, OpenAI... None of them should be trusted by anyone at this point. They've all lied and stolen user data. People have taken their own lives because of the legal retaliation for doing far less than these people hiding behind corporate logos that suck up any and all information because they've been entitled to not have to deal with consequences.
They've all broken their own ToS under an air of: OK for me, not for thee. So, yes, the hidden camera is a great analogy. All of these companies, and the people running them, are cancers in and on society.
I don’t know that that is useful advice for the average person. For instance, you can access your bank account via the internet, yet there are very strong privacy guarantees.
Concur that it is a safe default assumption what you say, but then you need a way for people to not now mistrust all internet services because everything is considered public.
Edit: to clarify, in the first two examples I'm referring to web applications that the exposed person uses but does not control.
So my choice in society is to not have a job or get interviews and accept that I have no privacy in the modern world, being mined for profit to companies that lay off their workers anyway.
By the way, I was also recommended to make and show off a website portfolio to get interviews... sigh.
It contains links to personal data.
The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material. Folks can argue if that's true but the discussion should at least be transparent.
That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.
* Assuming the users regularly check the images are still being hosted (probably something that should be regulated)
As with almost any URL, it is not in and of itself an image.
As an aside, this presents a problem for researchers because the links can resolve to different resources, or no resource at all, depending on when they are accessed.
Therefore this is not a static dataset on which a machine learning model can be trained in a guaranteed reproducible fashion.
The issue in question is that many/most large generative AI models were trained with personal data.
“It’s not his actual money, it’s just his bank account and routing number.”
A name, Jon Smith, is technically PII but not very specific. If I have a link to a specific Jon Smith’s facebook page or his HN profile, it’s even more personally identifiable than knowing his name is Jon Smith.
And if a link to PII is PII, then a link to a link to PII is PII, and thus all links are PII unless it links to the dark (unlinked) Web
That seems like a pretty big difference to me.
Secondly, privacy and copyright are different. Privacy is more of a concern with how information is used than getting credit and monetization for being the author.
Upthread it was mentioned that the training data representation contained links to material; magnet links were mentioned in passing as an example of something supposedly not violating copyright. It wasn't stated that training data contained magnet links. (Did it?)
I interpret that the article is about AI being trained on personal data. That is a big break of many countries legislation.
And AI is 100% being trained in copyrighted data too. Breaking another different set of laws.
That shows how much big-tech is just breaking the law and using money and influence to get away with it.
It wouldn’t be bank robbery.
One alternative to archive.is for this website is to disable Javascript and CSS
Another alternative is the website's RSS feed
Works anywhere without CSS or Javascript, without CAPTCHAs, without tracking pixel
For example,
curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/
|(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm
firefox ./1.htm
To retrieve only the entry about DataComp CommonPool, curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/
|sed -n '/./{/>1120522</post-id>/,/>1120466</post-id>/p;}'
|(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm
firefox ./1.htm
kristianp•6mo ago