If the code is different but API compatible, Google Java vs Oracle Java case shows that if the implementation is different enough, it can be considered a new implementation. Clean room or not.
I don't think this is a precedent either, plenty of projects changed licenses lol.
I keep kind mixing them up but the GPL licenses keep popping up as occasionally horror stories. Maybe the license is just poorly written for today's standards?
What if I decide to make a JS or Rust implementation of this project and use it as inspiration? Does that mean I'm no longer doing a "clean room" implementation and my project is contaminated by LGPL too?
Generally relicensing is done in good faith for a good reason, so pretty much everyone ok's it.
Trickiness can turn up when code contributors aren't contactable (ie dead, missing, etc), and I'm unsure of the legally sound approach to that.
I rewrite it, my head full of my own, original, new ideas. The results turn out great. There's a few if and while loops that look the same, and some public interfaces stayed the same. But all the guts are brand new, shiny, my own.
Do I have no rights to this code?
But code that is any kind of derivative of code before it contains a complex mix of other peoples rights. It can be relicensed, but only if all authors large and small agree to the terms.
I understand you need to publish the source code of your modifications, if you distribute them outside of your company.
They usually did that with approval from existing license holders (except when they didn't, those were the bad cases for sure).
Be really careful who you give your projects keys to, folks!
One of their engineers was able to recreate their platform by letting Claude Code reverse engineer their Apps and the Web-Frontend, creating an API-compatible backend that is functionally identical.
Took him a week after work. It's not as stable, the unit-tests need more work, the code has some unnecessary duplication, hosting isn't fully figured out, but the end-to-end test-harness is even more stable than their own.
"How do we protect ourselves against a competitor doing this?"
Noodling on this at the moment.
https://en.wikipedia.org/wiki/Google_LLC_v._Oracle_America,_....
AI can't be the author of the work. Human driving the AI can, unless they zero-shotted the solution with no creative input.
I think we didn't even began to consider all the implications of this, and while people ran with that one case where someone couldn't copyright a generated image, it's not that easy for code. I think there needs to be way more litigation before we can confidently say it's settled.
If "generated" code is not copyrightable, where do draw the line on what generated means? Do macros count? Does code that generates other code count? Protobuf?
If it's the tool that generates the code, again where do we draw the line? Is it just using 3rd party tools? Would training your own count? Would a "random" code gen and pick the winners (by whatever means) count? Bruteforce all the space (silly example but hey we're in silly space here) counts?
Is it just "AI" adjacent that isn't copyrightable? If so how do you define AI? Does autocomplete count? Intellisense? Smarter intellisense?
Are we gonna have to have a trial where there's at least one lawyer making silly comparisons between LLMs and power plugs? Or maybe counting abacuses (abaci?)... "But your honour, it's just random numbers / matrix multiplications...
But for this type of copyright laundering, it doesn't really matter. The goal isn't really about licensing it, it's about avoiding the existing licence. The idea that the code ends up as public domain isn't really an issue for them.
Its also US only. Other countries will differ. This means you can only rely on this ruling at all for something you are distributing only in the US. Might be OK for art, definitely not for most software. Very definitely not OK for a software library.
For example UK law specifically says "In the case of a literary, dramatic, musical or artistic work which is computer-generated, the author shall be taken to be the person by whom the arrangements necessary for the creation of the work are undertaken."
This seems extremely vague. One could argue that any part of the pipeline counts as an "arrangement necessary for the creation of the work", so who is the author? The prompter, the creator of the model, or the creator of the training data?
How do our competitors protect themselves against us doing this?
They do something very similar for some of their work. It’s hard to use external services so they replicate them and the cost of doing so has come down from “don’t be daft, we can’t reimplement slack and google drive this sprint just to make testing faster” to realistic. They run the sdks against the live services and their own implementations until they don’t see behaviour differences. Now they have a fast slack and drive and more (that do everything they need for their testing) accelerating other work. I’m dramatically shifting my concept of what’s expensive and not for development. What you’re describing could have been done by someone before, but the difficulty of building that backend has dropped enormously. Even if the application was closed you could probably either now or soon start to do the same thing starting with building back to core user stories and building the app as well.
You can view some of this as having things like the application as a very precise specification.
Really fascinating moment of change.
If the platform is so trivial that it can be reverse engineered by an AI agent from a dumb frontend, what's there to protect against? One has to assume that their moat is not that part of the backend but something else entirely about how the service is being provided.
I know it's a provoking question but that answers why a competitor is not a competitor.
DMCA. The EULA likely prohibits reverse engineering. If a competitor does that, hit'em with lawyers.
Or, if you want to be able to sleep at night, recognize this as an opportunity instead of a threat.
As engineers, we often think only about code, but code has never been what makes a business succeed. If your client thinks that their businesses primary value is in the mobile app code they wrote, 1) why is it even open source? 2) the business is doomed.
Realistically, though, this is inconsequential, and any time spent worrying about this is wasted time. You don't protect yourself from your competitor by worrying about them copying your mobile app.
They did not copy the mobile app. They copied the service.
Question: if they had built one using AI teams in both “rooms”, one writing a spec the other implementing, would that be fine? You’d need to verify spec doesn’t include source code, but that’s easy enough.
It seems to mostly follow the IBM-era precedent. However, since the model probably had the original code in its training data, maybe not? Maybe valid for closed source project but not open-source? Interesting question.
No, I don't think so. I hate comparing LLMs with humans, but for a human being familiar with the original code might disqualify them from writing a differently-licensed version.
Anyway, LLMs are not human, so as many courts confirmed, their output is not copyrightable at all, under any license.
If true, it would mean most commercial code being developed today, since it's increasingly AI-generated, would actually be copyright-free. I don't think most Western courts would uphold that position.
If that were the case, nobody would bother with clean-room rewrites.
While it feels unlikely that a simple "write this spec from this code" + "write this code from this spec" loop would actually trigger this kind of hiding behaviour, an LLM trained to accurately reproduce code from such a loop definitely would be capable of hiding code details within the spec - and you can't reasonably prove that the frontier LLMs have not been trained to do so.
Also, it's weird that it's okay apparently to use pirated materials to teach an LLM, but maybe not to disseminate what the LLM then tells you.
It doesn't matter how they structure the agents. Since chardet is in the LLM training set, you can't claim any AI implementation thereof is clean room.
Might still be valid for closed source projects (probably is).
I think courts would need to weigh in on the open source side. There’s legal precedent is that you can use a derived work to generate a new unique work (the spec derived for the copyrighted code is very much a derived work). There are rulings that LLMs are transformative works, not just copies of training data.
LLMs can’t reproduce their entire training set. But this thinking is also ripe for misuse. I could always train or fine-tune a model on the original work so that it can reproduce the original. We quickly get into statistical arguments here.
It’s a really interesting question.
If you wish to be able to claim in court that it is a "clean room" implementation, yes.
Clean room implementations are specifically where a company firewalls the implementing team off from any knowledge of the original implementation, in order to be able to swear in court that their implementation does not make any use of the original code (which they are in such a case likely not licensed to use).
Edit: this is wrong
What is this recent (clanker-fueled?) obsession to give everything fancy computer-y names with high numbers?
It's not a '12 stage pipeline', it's just an algorithm.
Do you know this kind of area and are commenting on the code?
All AI generated code is tainted with GPL/LGPL because the LLMs might have been taught with it
That is however stricter than what's actually legally necessary. It's just that the actual legal standard would require a court ruling to determine if you passed it, and everyone wants to avoid that. As a consequence there also aren't a lot of court cases to draw similarities to
However, the copyright system has always be a sham to protect US capital interests. So I would be very surprised if this is actually ruled/enforced. And in any case american legislators can just change the law.
This is actually harder standard than some people think.
The absolute clean room approaches in USA are there because they help short circuit a long lawsuit where a bigger corp can drag forever until you're broken.
Like, "we don't like copyright, but since you insist on enforcing it and we can't do anything against it, we will invent a clever way to use your own rules against you".
They are literally stealing from open source, but it's the original license that is the issue?
That is just the easiest way to disambiguate the legal situation (i.e. the most reliable approach to prevent it from being considered a derivative work by a court).
I'm curious how this is gonna go.
“chardet 7.0 is a ground-up, MIT-licensed rewrite of chardet. Same package name, same public API — drop-in replacement for chardet 5.x/6.x”
Do people not write anymore?
As Freud famously said, sometimes an em dash is just an em dash.
2,305 files changed
+0 -546871 lines changed
https://github.com/chardet/chardet/commit/7e25bf40bb4ae68848...Why does this new project here needed to replace the original like that in this dishonourable way? The proper way would have been to create a proper new project.
Note: even Python's own pip drags this in as dependency it seems (hopefully they'll stick to a proper version)
"Insider Knowledge" is not relevant for copyright law. That is more in the space of patent law then copyright law.
Or else a artist having seen a picture of a sunset over an empty ocean wouldn't be allowed to pain another sunset over an empty ocean as people could claim copyright violation.
Through what is a violation is, if you place the code side by side and try to circumvent copyright law by just rephrasing the exact same code.
This also means that if you give an AI access to a code base and tell it to produce a new code base doing the same (or similar) it will most likely be ruled as copyright violation as it's pretty much a side by side rewriting.
But you very much can rewrite a project under new license even if you have in depth knowledge. IFF you don't have the old project open/look at it while doing so. Rewrite it from scratch. And don't just rewrite the same code from memory, but instead write fully new code producing the same/similar outputs.
Through while doing so is not per-se illegal, it is legally very attackable. As you will have a hard time defending such a rewrite from copyright claims (except if it's internally so completely different that it stops any claims of "being a copy", e.g. you use complete different algorithms, architecture, etc. to produce the same results in a different way).
In the end while technically "legally hard to defend" != "illegal", for companies it's most times best to treat it the same.
mytailorisrich•2h ago
I don't think that the second sentence is a valid claim per se, it depends on what this "rewritten code" actually looks like (IANAL).
Edit: my understanding of "clean room implementation" is that it is a good defence to a copyright infrigement claim because there cannot be infringement if you don't know the original work. However it does not mean that NOT "clean room implementation" implies infrigement, it's just that it is potentially harder to defend against a claim if the original work was known.
klustregrif•2h ago
mytailorisrich•1h ago
This is not a good analogy.
A "rewrite" in context here is not a reproduction of the original work but a different work that is functionally equivalent, or at least that is the claim.
IanCal•1h ago
jerven•2h ago
Radle•2h ago
Especially now that ai can do this for any kind of intellectual property, like images, books or sourcecode. If judges would allow an ai rewrite to count as an original creation, copyright as we know it completely ends world wide.
Instead whats more likely is that no one is gonna buy that shit
charcircuit•2h ago
The change log says the implementation is completely different, not a copy paste. Is that wrong?
>Internal architecture is completely different (probers replaced by pipeline stages). Only the public API is preserved.
fzeroracer•1h ago
charcircuit•1h ago
fzeroracer•1h ago
Ukv•1h ago
Only after that would the burden be on the defendants, such as to give a defense that their usage is sufficiently transformative to qualify as fair use.
_ache_•2h ago
spacedcowboy•2h ago
I’m not sure that “a total rewrite” wouldn’t, in fact, pass muster - depending on how much of a rewrite it was of course. The ‘clean room’ approach was just invented as a plausible-sounding story to head off gratuitous lawsuits. This doesn’t look as defensible against the threat of a lawsuit, but it doesn’t mean it wouldn’t win that lawsuit (I’m not saying it would, I haven’t read or compared the code vs its original). Google copied the entire API of the Java language, and got away with it when Oracle sued. Things in a courtroom can often go in surprising ways…
[edit: negative votes, huh, that’s a first for a while… looks like Reddit/Slashdot-style “downvote if you don’t like what is being said” is alive and well on HN]
actionfromafar•1h ago
toyg•1h ago
duskdozer•27m ago
[1]https://github.com/chardet/chardet/compare/6.0.0.post1...7.0...
bo1024•1h ago
As the LGPL says:
> A "work based on the Library" means either the Library or any derivative work under copyright law: that is to say, a work containing the Library or a portion of it, either verbatim or with modifications and/or translated straightforwardly into another language. (Hereinafter, translation is included without limitation in the term "modification".)
Is v7.0.0 a [derivative work](https://en.wikipedia.org/wiki/Derivative_work)? It seems to depend on the details of the source code (implementing the same API is not copyright infringement).