If yes, this in a sense allows a path around GPL requirements. Linux's MIT version would be out in the next 1-2 years.
The thesis I propose is that tests are more akin to facts, or can be stated as facts, and facts are not copyright-able. That's what makes this case interesting.
If "tests" should mean a proper specification let's say some IETF RFC of a protocol, then that would be different.
1. Generate specification on what the system does. 2. Pass to another "clean" system 3. Second clean system implements based just on the specification, without any information on the original.
That 3rd step is the hardest, especially for well known projects.
2. Dumped into a file.
3. claude-code that converts this to tests in the target language, and implements the app that passes the tests.
3 is no longer hard - look at all the reimplementations from ccc, to rewrites popping up. They all have a well defined test suite as common theme. So much so that tldraw author raised a (joke) issue to remove tests from the project.
Then the model that is familiar with the code can write specs. The model that does not have knowledge of the project can implement them.
Would that be a proper clean room implementation?
Seems like a pretty evil, profitable product "rewrite any code base with an inconvenient license to your proprietary version, legally".
Is the "clean room" process meaningfully backed by legal precedent?
As an aside, this clean room engineering is one of the plot points of Season 1 of the fictional TV show Halt and Catch Fire where they do this with the BIOS image they dumped.
So, you can pilfer the commons ("public") but not stuff unavailable in source form.
If we expand your thought experiment to other forms of expression, say videos on YT or Netflix, then yes.
That's the core issue here. All models are trained on ALL source code that is publicly available irrespective of how it was licensed. It is illegal but every company training LLMs is doing it anyways.
We can debate if this law is moral. Like the GP I took agree public data in -> public domain out is what's right for society. Copyright as an artificial concept has gone on for long enough.
I don't think so. It is no where "limited use". Entirety of the source code is ingested for training the model. In other words, it meets the bar of "heart of the work" being used for training. There are other factors as well, such as not harming owner's ability to profit from original work.
Both Meta and Anthropic were vindicated for their use. Only for Anthropic was their fine for not buying upfront.
Also the mentioned SCOTUS decision is concerned with authorship of generative AI products. That's very different of this case. Here we're talking about a tool that transformed source code and somehow magically got rid of copyright due to this transformation? Imagine the consequences to the US copyright industry if that were actually possible.
But what about training without having seen any human written program? Coul a model learn from randomly generated programs?
How would that work? We still have no legal conclusion on whether AI model generated code, that is trained on all publicly available source (irrespective of type of license), is legal or not. IANAL but IMHO it is totally illegal as no permission was sought from authors of source code the models were trained on. So there is no way to just release the code created by a machine into public domain without knowing how the model was inspired to come up with the generated code in the first place. Pretty sure it would be considered in the scope of "reverse engineering" and that is not specific only to humans. You can extend it to machines as well.
EDIT: I would go so far as to say the most restrictive license that the model is trained on should be applied to all model generated code. And a licensing model with original authors (all Github users who contributed code in some form) should be setup to be reimbursed by AI companies. In other words, a % of profits must flow back to community as a whole every time code-related tokens are generated. Even if everyone receives pennies it doesn't matter. That is fair. Also should extend to artists whose art was used for training.
That license is called "All Rights Reserved", in which case you wouldn't be able to legally use the output for anything.
There are research models out there which are trained on only permissively licensed data (i.e. no "All Rights Reserved" data), but they're, colloquially speaking, dumb as bricks when compared to state-of-art.
But I guess the funniest consequence of the "model outputs are a derivative work of their training data" would be that it'd essentially wipe out (or at very least force a revert to a pre-AI era commit) every open source project which may have included any AI-generated or AI-assisted code, which currently pretty much includes every major open source project out there. And it would also make it impossible to legally train any new models whose training data isn't strictly pre-AI, since otherwise you wouldn't know whether your training data is contaminated or not.
[0] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
[1] https://reorchestrate.com/posts/your-binary-is-no-longer-saf...
See e.g. https://banteg.xyz/posts/crimsonland/ , a single human reverse engineered a non-trivial game and rewrote it in another language + graphics lib in 2 weeks.
You can do this a lot by saying things like: complete the code "<snippet from gpl licensed code>".
And if now the models are GPL licensed the problem of relicensing is gone since the code produced by these models should in theory be also GPL licensed.
Unfortunately, there is a dumb clause that computer generated code cannot be copyrighted or licensed to begin with.
Can you point to the clause? I have never seen it in any GPL license.
Mark Pilgrim! Now that‘s a name I haven‘t read in a long time.
Im struggling to see where this conclusion came from. To me it sounds like the AI-written work can not be coppywritten, and so its kind of like a copy pasting the original code. Copy pasting the original code doesnt make it public domain. Ai gen code cant be copywritten, or entered into the public domain, or used for purposes outside of the original code's license. Whats the paradox here?
I think they are rhetorically asking if your position is correct.
We can't speak about clean room implementation from LLM since they are technically capable only of spitting their training data in different ways, not of any original creation.
A lawyer could easily argue that the model itself stores a representation of the original, and thus it can never do a "fresh context".
And to be perfectly honest, LLMs can quote a lot of text verbatim.
The key leap from gpt3 to gpt-3.5 (aka ChatGPT) was code-davinci-002, which is trained upon Github source code after OpenAI-Microsoft partnership.
Open source code contributed much to LLM's amazing CoT consistency. If there's no Open Source movement, LLM would be developed much later.
verdverm•1h ago
Hoping the HN community can bring more color to this, there are some members who know about these subjects.