MIT-Human License Proposal

https://github.com/tautvilas/MIT-Human/blob/main/LICENSE

2•brisky•1w ago

Comments

JoshTriplett•1w ago

Even if I agree entirely with the premise, this is not something Open Source projects can use, just like every other restriction on use.

Open Source is a Schelling point ( https://en.wikipedia.org/wiki/Focal_point_(game_theory) ). It's not perfect, but it has the advantage that people can agree upon what it means and what does and doesn't qualify. Once use restrictions like these start cropping up, any non-trivial project would become a maze of restrictions, all different.

And in losing Open Source, we'd gain absolutely nothing. AI training already ignores all Open Source licenses, and proprietary licenses, and complete lacks of licenses. What makes you think this will be respected where every other Open Source license isn't?

brisky•1w ago

Does any current Open Source license address the question of AI/LLM training at all? Some OSS developers have clear sentiment against it but currently they can not even pick a standard OSS license that aligns with their worldview.

josephcsible•1w ago

One of these things is true:

1. Training AI on copyrighted works is fair use, so it's allowed no matter what the license says.

2. Training AI on copyrighted works is not fair use, so since pretty much every open source license requires attribution (even ones as lax as MIT do; it's only ones that are pretty much PD-equivalent like CC0, WTFPL, and Unlicense that don't) and AI doesn't give attribution, it's already disallowed by all of them.

So in either case, having a license mention AI explicitly wouldn't do any good, and would only make the license fail to comply with the OSD.

TomOwens•1w ago

Point 2 misses the distinction between AI models and their outputs.

Let's assume for a moment that training AI (or, in other words, creating an AI model) is not fair use. That means that all of the license restrictions must be adhered to.

For the MIT license, the requirement is to include the copyright notice and permission notice "in all copies or substantial portions of the Software". If we're going to argue that the model is a substantial portion of the software, then only the model would need to carry the notices. And we've already settled on accessing over a server doesn't trigger these clauses.

Something like the AGPL is more interesting. Again, if we accept that the model is a derivative work of the content it was trained on, then the AGPL's viral nature would require that the model be released under an appropriate license. However, it still says nothing about the output. In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

So far, though, in the US, it seems courts are beginning to recognize AI model training as fair use. Honestly, I'm not surprised, given that it was seen as fair use to build a searchable database of copyright-protected text. The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

But there is still the ethical question of disclosing the training material. Plagiarism still exists, even for content in the public domain. So attributing the complete set of training material would probably fall into this form of ethical question, rather than the legal questions around intellectual property and licensing agreements. How you go about obtaining the training material is also a relevant discussion, since even fair use doesn't allow you to pirate material, and you must still legally obtain it - fair use only allows you to use it once you've obtained it.

There are still questions for output, but those are, in my opinion, less interesting. If you have a searchable copy of your training material, you can do a fuzzy search of that material to return potential cases where the model returned something close to the original content. GitHub already does something similar with GitHub Copilot and finding public code that matches AI responses, but there are still questions there, too. It's more around matches that may not be in the training data or how much duplicated code needs to be attributed. But once you find the original content, working with licensing becomes easier. There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.

JoshTriplett•1w ago

> The AI model is an even more transformative use, since (from my understanding) you can't reverse engineer the training data out of a model.

You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

> In fact, the GPL family licenses don't require the output of software under one of those licenses to be open, so I suspect that would also be true for content.

It does if the software copies portions of itself into the output, which seems close enough to what LLMs do. The neuron weights are essentially derived from all the training data.

> There are also questions about guardrails and how much is necessary to prevent exact reproduction of copyright protected material that, even if licensed for training, isn't licensed for redistribution.

That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The only way to not violate the license on the training data is to treat all output as potentially derived from all training data.

TomOwens•1w ago

> You absolutely can; the model is quite capable of reproducing works it was trained on, if not perfectly then at least close enough to infringe copyright. The only thing stopping it from doing so is filters put in place by services to attempt to dodge the question.

The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output. The same model can be used across multiple software applications for different purposes. If I were to go to https://huggingface.co/deepseek-ai/DeepSeek-V3.2/tree/main (for example) and download those files, I wouldn't be able to reverse-engineer the training data without building more software.

Compare that to a search database, which needs the full text in an indexable format, directly associated with the document it came from. Although you can encrypt the database, at some point, it needs to have the text mapped to documents, which would make it much easier to reconstruct the complete original documents.

> That's not something you can handle via guardrails. If you read a piece of code, and then produce something substantially similar in expression (not just in algorithm and comparable functional details), you've still created a derivative work. There is no well-defined threshold for "how similar", the fundamental question is whether you derived from the other code or not.

The threshold of originality defines whether something can be protected by copyright. There are plenty of small snippets of code that can't be protected. But there are still questions about these small snippets that were consumed in the context of a larger, protected work, especially when there are only so many ways to express the same concept in a given language. It's definitely easier in written text than code to reason about.

JoshTriplett•1w ago

> The model doesn't reproduce anything. It's a mathematical representation of the training data. Software that uses the model generates the output.

By that argument, a compressed copy of the Internet doesn't reproduce the Internet, the decompression software does. That's not a useful semantic distinction; the compressed file is the derived work, not the decompression software.

apatheticonion•1w ago

I'd love a copy-left form of this.

I don't have an issue with LLM enhanced coding, but if you use my projects as training data, give me royalties.

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD

SpaceKit.xyz – a browser‑native VM for decentralized compute

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

Shannon: Claude Code for Pen Testing: #1 on Github today

Anthropic: Latest Claude model finds more than 500 vulnerabilities

Brooklyn cemetery plans human composting option, stirring interest and debate

Why the 'Strivers' Are Right

Brain Dumps as a Literary Form

Agentic Coding and the Problem of Oracles

Malicious packages for dYdX cryptocurrency exchange empties user wallets

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Arcan Explained: A browser for different webs

What did we learn from the AI Village in 2025?

An open replacement for the IBM 3174 Establishment Controller

FSD helped save my father's life during a heart attack

Show HN: Writtte – Draft and publish articles without reformatting, anywhere

Portuguese icon (FROM A CAN) makes a simple meal (Canned Fish Files) [video]

Brookhaven Lab's RHIC Concludes 25-Year Run with Final Collisions

Transcribe your aunts post cards with Gemini 3 Pro

.72% Variance Lance

ReKindle – web-based operating system designed specifically for E-ink devices

Encrypt It

NextMatch – 5-minute video speed dating to reduce ghosting

Personalizing esketamine treatment in TRD and TRBD

SpaceKit.xyz – a browser‑native VM for decentralized compute

NotebookLM: The AI that only learns from you

Show HN: An open-source starter kit for developing with Postgres and ClickHouse

Game Boy Advance d-pad capacitor measurements

South Korean crypto firm accidentally sends $44B in bitcoins to users

Apache Poison Fountain

Web.whatsapp.com appears to be having issues syncing and sending messages

Google in Your Terminal

Shannon: Claude Code for Pen Testing: #1 on Github today

Anthropic: Latest Claude model finds more than 500 vulnerabilities

Brooklyn cemetery plans human composting option, stirring interest and debate

Why the 'Strivers' Are Right

Brain Dumps as a Literary Form

Agentic Coding and the Problem of Oracles

Malicious packages for dYdX cryptocurrency exchange empties user wallets

Show HN: I built a <400ms latency voice agent that runs on a 4gb vram GTX 1650"

Penisgate erupts at Olympics; scandal exposes risks of bulking your bulge

Arcan Explained: A browser for different webs

What did we learn from the AI Village in 2025?

An open replacement for the IBM 3174 Establishment Controller

MIT-Human License Proposal

Comments