but its that a breach of GDPR???
Also people who have given their consent before need to be able to revoke it at any point.
idk but how can we do that with GDPR compliance etc???
Edit: that last bit is probably catastrophic thinking. Enforcement has always been precisely enough to cause compliance vs withdrawal from the market.
You can’t steal something and avoid punishment just because you don’t sell in the country where the theft happened.
Tit for tat.
NK isn’t really a business partner in the world.
Edit: After more reading. Clearview AI did exactly this, they ignored all the EU rulings and the UK refused to enforce them. They were fined tens of millions and paid nothing. Stability is now also a UK company that used pi images for training; it seems quite likely they will try to walk that same path given their financial situation. Meta is facing so many fines and lawsuits who knows what it will do. Everyone else will call it cost of business while fighting it every step of the way.
Also note that AI is not just generative models, and generative models don't need to be trained with personal data.
A normal industry would've figured out how to deal with this problem before going public, but AI people don't seem to be all that interested.
I'm sure they'll all cry foul if one of them get hit with a fine and an order to figure out how to fix the mess they've created, but this is what you get when you don't ethics to computer scientists.
They have to be able to ask how much (if) data is being used, and how.
Rethinking Machine Unlearning for Large Language Models
Unfortunately they don't provide information regarding their training sets (https://help.mistral.ai/en/articles/347390-does-mistral-ai-c...) but I think it's safe to assume it includes DataComp CommonPool.
Who is to blame for internet commerce?
Our legislators. Maybe specifically we can blame Al Gore, the man who invented the internet. If we had put warning labels on the internet like we did with NWA and 2 live crew, Gore’s second best achievement, we wouldn’t be a failed democracy right now.
They probably can't be redeemed and we should recognise that, but that doesn't mean they can't spend the rest of their life being forced to be useful to society in a constructive way. Any sort of future offense (violence, theft, assault, anything really) should mean we give up on them. Then they should be humanely put down.
The victim of ID theft is the person whose ID was stolen. The damage to banks or other large entities pales in comparison to the damage to those people.
Because afaik everything they collected was public web. So now researchers are being lambasted for having data in their sets that others released
That said, masking obvious numbers like SSN is low hanging fruit. Trying to obviate every piece of public information about a person that can identify them is insane.
If you post something publicly you cant be complaining that it is public.
no.
> but ...
no.
Depends. In most cases, this thing is forbidden by law and you can claim actual damages.
Daughter's school posted pictures of her online without an opt-out, but she's also on Facebook from family members and it's just kind of... well beyond the point of trying to suppress. Probably just best to accept people can imagine you naked, at any age, doing any thing. What's your neighbor doing with the images saved from his Ring camera pointed at the sidewalk? :shrug:
IMO, it being posted online to a publicly accessible site is the same. Don't post anything you don't want right-click-saved.
Decorum and respect expectations don't disappear the moment it's technically feasible to be an asshole
Based on what ordinary people have been saying, I don't think this is true. Or, maybe it's true now that the cat is out of the bag, but I don't think most people expected this before.
Most tech-oriented people did, of course, but we're a small minority. And even amongst our subculture, a lot of people didn't see this abuse coming. I didn't, or I would have removed all of my websites from the public web years earlier than I did.
In fact it's the opposite. People who aren't into tech thinks Instagram is listening to them 24*7 to show feed and ads. There was even a hoax near my area among elderly groups that Whatsapp is using profile photo in illegal activity and many people removed their photo one time.
> I didn't, or I would have removed all of my websites from the public web years earlier than I did.
Your comment is public information. In fact posting anything in HN is a sure shot way to giving your content for AI training.
True, but that's a world different than thinking that your data will be used to train genAI.
> In fact posting anything in HN is a sure shot way to giving your content for AI training.
Indeed so, but HN seems to be a bad habit I just can't kick. However, my comments here are the entirety of what I put up on the open web and I intentionally keep them relatively shallow. I no longer do long-form blogging or make any of my code available on the open web.
However, you're right. Leaving HN is something that I need to do.
Or that teenager that signed up to facebook should know that the embarrassing things they're posting is going to train AI and is, as you called it, public?
What about the blog I started 25 years ago and then took down but it lives in the geocities archive. Was I supposed to know it'd go to an AI overlord corporation when I was in middle school writing about dragon photos I found on google?
And we're not even getting into data breaches, or something that was uploaded as private and then sold when the corporation changed their privacy policy decades after it was uploaded.
It's not a bad analogy when you don't give all the graces to corporations and none to the exploited.
We need to better educate people on the risks of posting private information online.
But that does not absolve these corporations of criticism of how they are handling data and "protecting" people's privacy.
Especially not when those companies are using dark patterns to convince people to share more and more information with them.
Literally yes? Is this sarcasm? Are we in 2025 supposed to implicitly trust multi-billion dollar multi-national corporations that have decades' worth of abuses to look back on? As if we couldn't have seen this coming?
It's been part of every social media platform's ToS for many years that they get a license to do whatever they want with what you upload. People have warned others about this for years and nothing happened. Those platforms' have already used that data prior to this for image classification, identification and the like. But nothing happened. What's different now?
Those same modern companies: Look, if our users inadvertently upload sensitive or private information then we can't really help them. The heuristics for detecting those kinds of things are just too difficult to implement.
So you are basically saying you have no sympathy for young people who happen to have not been taught about this, or been guided by someone highly articulate in explaining it.
Is it taught in schools yet? If it’s not, then why assume everyone should have a good working understanding of this (actually nuanced) topic?
For example I encounter people who believe that Google literally sells databases, lists of user data, when the actual situation (that they sell gated access to targeted eyeballs at a given moment and that this sort of slowly leaks identifying information) is more nuanced and complicated.
Of course privacy law doesn't necessarily agree with the idea that you can just scrape private data, but good luck getting that enforced anywhere.
It's important to know that generally this distinction is not relevant when it comes to data subject rights like GDPR's right to erasure: If your company is processing any kind of personal data, including publicly available data, it must comply with data protection regulations.
In that case, its not a 'hidden camera'...users uploaded this data and made it public, right? I'm sure some were due to misconfiguration or whatever (like we see with Tea), but it seems like most of this was uploaded by the user to the clear web. I'm all for "Dont blame the victims", but if you upload your CC to Imgur I think you deserve to have to get a new card.
Per the article "CommonPool ... draws on the same data source: web scraping done by the nonprofit Common Crawl between 2014 and 2022."
A more apt analogy would be someone recording you in public, or an outside camera pointed at your wide-open bedroom window.
I don’t know that that is useful advice for the average person. For instance, you can access your bank account via the internet, yet there are very strong privacy guarantees.
Concur that it is a safe default assumption what you say, but then you need a way for people to not now mistrust all internet services because everything is considered public.
It contains links to personal data.
The title is like saying that sending a magnet link to a copyrighted torrent file is distributing copyright material. Folks can argue if that's true but the discussion should at least be transparent.
That the data set aggregator doesn't directly host the images themselves matters when you want to issue a takedown (targeting the original image host might be more effective) but for the question "Does that mean a model was trained on my images?" it's immaterial.
“It’s not his actual money, it’s just his bank account and routing number.”
A name, Jon Smith, is technically PII but not very specific. If I have a link to a specific Jon Smith’s facebook page or his HN profile, it’s even more personally identifiable than knowing his name is Jon Smith.
Secondly, privacy and copyright are different. Privacy is more of a concern with how information is used than getting credit and monetization for being the author.
Upthread it was mentioned that the training data representation contained links to material; magnet links were mentioned in passing as an example of something supposedly not violating copyright. It wasn't stated that training data contained magnet links. (Did it?)
One alternative to archive.is for this website is to disable Javascript and CSS
Another alternative is the website's RSS feed
Works anywhere without CSS or Javascript, without CAPTCHAs, without tracking pixel
For example,
curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/
|(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm
firefox ./1.htm
To retrieve only the entry about DataComp CommonPool, curl https://web.archive.org/web/20250721104402if_/https://www.technologyreview.com/feed/
|sed -n '/./{/>1120522</post-id>/,/>1120466</post-id>/p;}'
|(echo "<meta charset=utf-8>";grep -E "<pubDate>|<p>|<div") > 1.htm
firefox ./1.htm
kristianp•18h ago