Even if I agree entirely with the premise, this is not something Open Source projects can use, just like every other restriction on use.
Open Source is a Schelling point ( https://en.wikipedia.org/wiki/Focal_point_(game_theory) ). It's not perfect, but it has the advantage that people can agree upon what it means and what does and doesn't qualify. Once use restrictions like these start cropping up, any non-trivial project would become a maze of restrictions, all different.
And in losing Open Source, we'd gain absolutely nothing. AI training already ignores all Open Source licenses, and proprietary licenses, and complete lacks of licenses. What makes you think this will be respected where every other Open Source license isn't?
brisky•1h ago
Does any current Open Source license address the question of AI/LLM training at all? Some OSS developers have clear sentiment against it but currently they can not even pick a standard OSS license that aligns with their worldview.
josephcsible•1h ago
One of these things is true:
1. Training AI on copyrighted works is fair use, so it's allowed no matter what the license says.
2. Training AI on copyrighted works is not fair use, so since pretty much every open source license requires attribution (even ones as lax as MIT do; it's only ones that are pretty much PD-equivalent like CC0, WTFPL, and Unlicense that don't) and AI doesn't give attribution, it's already disallowed by all of them.
So in either case, having a license mention AI explicitly wouldn't do any good, and would only make the license fail to comply with the OSD.
TomOwens•25m ago
Point 2 misses the distinction between AI models and their outputs.
Let's assume for a moment that training AI (or, in other words, creating an AI model) is not fair use. That means that all of the license restrictions must be adhered to.
For the MIT license, the requirement is to include the copyright notice and permission notice "in all copies or substantial portions of the Software". If we're going to argue that the model is a substantial portion of the software, then only the model would need to carry the notices. And we've already settled on accessing over a server doesn't trigger these clauses.
Something like the AGPL is more interesting. Again, if we accept that the model is a derivative work of the content it was trained on, then the AGPL's viral nature would require that the model be released under an appropriate license. However, it still says nothing about the output. In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.
So far, though, in the US, it seems courts are beginning to recognize AI model training as fair use. Honestly, I'm not surprised, given that it was seen as fair use to build a searchable database of copyright-protected text. The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.
But there is still the ethical question of disclosing the training material. Plagiarism still exists, even for content in the public domain. So attributing the complete set of training material would probably fall into this form of ethical question, rather than the legal questions around intellectual property and licensing agreements. How you go about obtaining the training material is also a relevant discussion, since even fair use doesn't allow you to pirate material, and you must still legally obtain it - fair use only allows you to use it once you've obtained it.
There are still questions for output, but those are, in my opinion, less interesting. If you have a searchable copy of your training material, you can do a fuzzy search of that material to return potential cases where the model returned something close to the original content. GitHub already does something similar with GitHub Copilot and finding public code that matches AI responses, but there are still questions there, too. It's more around matches that may not be in the training data or how much duplicated code needs to be attributed. But once you find the original content, working with licensing becomes easier. There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.
apatheticonion•41m ago
I'd love a copy-left form of this.
I don't have an issue with LLM enhanced coding, but if you use my projects as training data, give me royalties.
JoshTriplett•1h ago
Open Source is a Schelling point ( https://en.wikipedia.org/wiki/Focal_point_(game_theory) ). It's not perfect, but it has the advantage that people can agree upon what it means and what does and doesn't qualify. Once use restrictions like these start cropping up, any non-trivial project would become a maze of restrictions, all different.
And in losing Open Source, we'd gain absolutely nothing. AI training already ignores all Open Source licenses, and proprietary licenses, and complete lacks of licenses. What makes you think this will be respected where every other Open Source license isn't?
brisky•1h ago
josephcsible•1h ago
1. Training AI on copyrighted works is fair use, so it's allowed no matter what the license says.
2. Training AI on copyrighted works is not fair use, so since pretty much every open source license requires attribution (even ones as lax as MIT do; it's only ones that are pretty much PD-equivalent like CC0, WTFPL, and Unlicense that don't) and AI doesn't give attribution, it's already disallowed by all of them.
So in either case, having a license mention AI explicitly wouldn't do any good, and would only make the license fail to comply with the OSD.
TomOwens•25m ago
Let's assume for a moment that training AI (or, in other words, creating an AI model) is not fair use. That means that all of the license restrictions must be adhered to.
For the MIT license, the requirement is to include the copyright notice and permission notice "in all copies or substantial portions of the Software". If we're going to argue that the model is a substantial portion of the software, then only the model would need to carry the notices. And we've already settled on accessing over a server doesn't trigger these clauses.
Something like the AGPL is more interesting. Again, if we accept that the model is a derivative work of the content it was trained on, then the AGPL's viral nature would require that the model be released under an appropriate license. However, it still says nothing about the output. In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.
So far, though, in the US, it seems courts are beginning to recognize AI model training as fair use. Honestly, I'm not surprised, given that it was seen as fair use to build a searchable database of copyright-protected text. The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.
But there is still the ethical question of disclosing the training material. Plagiarism still exists, even for content in the public domain. So attributing the complete set of training material would probably fall into this form of ethical question, rather than the legal questions around intellectual property and licensing agreements. How you go about obtaining the training material is also a relevant discussion, since even fair use doesn't allow you to pirate material, and you must still legally obtain it - fair use only allows you to use it once you've obtained it.
There are still questions for output, but those are, in my opinion, less interesting. If you have a searchable copy of your training material, you can do a fuzzy search of that material to return potential cases where the model returned something close to the original content. GitHub already does something similar with GitHub Copilot and finding public code that matches AI responses, but there are still questions there, too. It's more around matches that may not be in the training data or how much duplicated code needs to be attributed. But once you find the original content, working with licensing becomes easier. There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.