While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.
[1] https://www.investors.com/research/the-new-america/reddit-st...
Is there even one example of a “tech mega corp” that has grown to control more than 1/5 of its market without this circling back to hurt people in some way? A single example?
I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.
The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.
On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.
LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.
Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.
Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)
Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.
A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.
In fact a “decoder” is simply autoregressive token classification.
If this intended to refer to Judge Alsup, it is extremely wrong.
The licensing standard they're talking about will achieve nothing.
Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.
Putting the content behind a login wall can work for large sites, but not small ones.
The free-for-all will not end until adversarial scraping becomes illegal.
Syndication is the answer. Small artists are on Spotify, small video makers are on YouTube.
See the problem?
Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.
I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.
WaltPurvis•2h ago
jmkni•32m ago
tenuousemphasis•22m ago