I'm sure the scripts of Star Wars would be similarly ignored if they were used.
https://web.archive.org/web/20260105115129/https://devblogs....
Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.
If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.
https://www.kaggle.com/datasets/shubhammaindola/harry-potter...
More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.
Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.
Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'
Why wouldn't that apply?
Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.
If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.
Everyone should torrent and rip off those books, anyway.
I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?
(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)
Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?
There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.
But one doesn't necessarily say anything about the other.
If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...
Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.
andsoitis•1h ago
I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.
Does anyone know whether there is some special reason why this has lasted so long without being taken down?
anonymous908213•1h ago
[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.
zythyx•39m ago
selridge•23m ago
anonymous908213•20m ago
ryandrake•8m ago