frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Microsoft offers guide to pirating Harry Potter series for LLM training

https://devblogs.microsoft.com/azure-sql/langchain-with-sqlvectorstore-example/
117•anonymous908213•1h ago

Comments

andsoitis•1h ago
This article is from 2024 and points to Kaggle, which hosts the data set.

I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.

Does anyone know whether there is some special reason why this has lasted so long without being taken down?

anonymous908213•1h ago
My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downloads, and the article itself probably doesn't have that many views. Still, this seems pretty far beyond the pale. Given the other case of AI-related plagiarism by Microsoft that was on the front page[1], it seems whatever review process they have for content that is published by their employees, if there is any review process at all, is deeply flawed.

[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.

zythyx•41m ago
Also, I imagine that most of those 10k downloads are probably from AI trainers that are just speed running through Kaggle to obtain absolutely anything to train their AI. There are definitely other, more 'known' ways to obtain these books without finding them as random text files in an AI dataset operation
selridge•25m ago
Why did you think that?
anonymous908213•22m ago
It rubs me the wrong way that corporations get a free pass on copyright infrigement, while the rest of us are prosecuted as harshly as possible if caught. I think this, together with the morging plagiarism, also indicates a pattern of behaviour from Microsoft that should be reformed. I would prefer if Microsoft were not able to produce AI slop degradations of other people's work and claim it as their own.
ryandrake•10m ago
In general, if you want to get away with a crime, just do it as a corporation or as a billionaire.
arkensaw•1h ago
My guess is HP makes such an enormous amount of money already from movies, games, toys, and other tie-ins, that they can't be bothered to chase down the odd digital infringement of a plain text copy of the original books.

I'm sure the scripts of Star Wars would be similarly ignored if they were used.

amanzi•35m ago
That doesn't justify what's going on here. Why is Microsoft endorsing the use of pirated materials.
outside1234•22m ago
The dataset is actually at Kaggle tho, but agree, they shouldn't use it as an example.
crtasm•8m ago
The file being hosted by another company doesn't change the fact that Microsoft is encouraging us to download and use it.
conartist6•1h ago
What in the absolute fuck
dom96•58m ago
How soon before someone will be able to make an online library which generates the original books using LLMs? Surely popular titles like Harry Potter may end up so well represented in the training that we'll get the full books out of the LLM with a close to 100% accuracy?
anonymous908213•56m ago
This is already possible for Harry Potter specifically. There was a study demonstrating that Sonnet 3.7, among other models tested, could reproduce the first Harry Potter book 95.8% verbatim[1].

[1] https://arxiv.org/abs/2601.02671

dom96•46m ago
Thanks for linking! I've been thinking about trying something like this myself.
Legend2440•26m ago
...only if you deliberately attempt to extract it by repeatedly prompting it to complete fragments of the book. They had to do quite a bit of work to make this happen.
dom96•7m ago
so? It demonstrates that LLM models retain the copyrighted material in their weights. This is an important thing to consider about LLMs and shows that there need to be better protections for the creative industry.
cadamsdotcom•51m ago
The word original is doing a lot of heavy lifting there! ;)
thrKan•57m ago
In case the page disappears:

https://archive.is/7WLho

boznz•42m ago
More like when the page disappears
agluszak•21m ago
It disappeared already
freitasm•21m ago
And the original is gone.
rlabnm•14m ago
For redundancy in case archive.is is down:

https://web.archive.org/web/20260105115129/https://devblogs....

crtasm•4m ago
The superior link; no Google captcha.
fxwin•56m ago
I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.

Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.

robrain•54m ago
The original title was "LangChain Integration for Vector Support for SQL-based AI applications"
ASalazarMX•51m ago
For some reason I really like this.
blt•52m ago
What makes this different from linking to a random zip file somewhere?
zythyx•45m ago
Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels. Instead, they opted to use copywritten works that JK hasn't released into the public domain (unless user "Shubham Maindola" is JK's alter ego).
Lerc•44m ago
The licence?

If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.

slopinthebag•26m ago
Oh come on. The licence was obviously incorrect and you cant escape culpability because of that.
fxwin•44m ago
The licensing: If I steal something and tell you its free and yours for the taking, that feels different than a Fence (knowingly) buying stolen goods. It's obviously semantics and there should have been some better judgemend from MS, but downloading a dataset (stated as public domain) from kaggle feels spiritually different from piracy (e.g.: if someone uploads a less known, copyrighted data set to kaggle/huggingface under an incorrect license, are tutorials that use this data set a 'guide to pirating' this data set? To me, that feels like a wrong use of the term)
ThrowawayTestr•52m ago
Absolutely shameless
robrain•51m ago
Original title: "LangChain Integration for Vector Support for SQL-based AI applications"
anonymous908213•43m ago
I don't believe that title conveys the actual significance of the article that makes it worthy of attention, so I hope HN may forgive me for coming up with an alternative title!
beached_whale•46m ago
The AI generated thumbnail, https://devblogs.microsoft.com/azure-sql/wp-content/uploads/..., is that of young Harry and friend with a prominent MS logo. Wow
camkego•40m ago
The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.

https://www.kaggle.com/datasets/shubhammaindola/harry-potter...

More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.

fxwin•32m ago
> it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Retr0id•30m ago
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Why wouldn't that apply?

xmprt•18m ago
I'm not a copyright expert and if you told me that Harry Potter was common domain then I'd probably be a bit surprised but wouldn't think it's crazy. The first book came out 30 years ago after all. On further research the copyright laws are way more aggressive than that (a bit too much if you ask me) but 30 years doesn't seem quick. Patents expire after 20 years.
jacquesm•16m ago
It would be incredibly naive to assume that a moneymaker like that is PD.
DSMan195276•11m ago
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.

If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.

electronsoup•29m ago
I guess the end of copyright is near if this is fine to put on a corporate website
larodi•12m ago
the end of reason and thought at corporation littered with fakers these days.
WillMorr•28m ago
Since IP law is apparently dead, does anyone want to invest in my ai generated novel startup where it just spits out Harry Potter verbatim but uses a bunch of power to do so.
Kapura•28m ago
only if you tell me that it's a necessary step to creating robot slaves
Den_VR•6m ago
Are they an ethical alternative to the human version?
Pfeil•6m ago
Robot slaves is a funny phrase if you consider that the origin of the word robot literally is a term that meant slave or "forced work". Language doing circles.
wewewedxfgdf•27m ago
Refreshingly honest.
selridge•27m ago
Someone forgot the national no snitching rules, and in service of Jo, no less.

Everyone should torrent and rip off those books, anyway.

thehamkercat•23m ago
It's taken down lmao, in 1 hour
actionfromafar•17m ago
No? I can see it
anpat•11m ago
+1, I can still access the page from US.
thehamkercat•7m ago
Probably some kind of cache, but it's taken down, I'm getting 404, while some of my friends are still able to see it
outside1234•23m ago
I mean they are also offering up the code you are writing in your private repos to LLMs to regenerate in my repo, so let's just go nuts.
mcny•22m ago
You guys are talking about copyright but I think a bigger takeaway is there is a process breakdown at Microsoft. Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?

(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)

crazygringo•19m ago
> Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?

There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.

But one doesn't necessarily say anything about the other.

jacquesm•17m ago
If they have the documentation... With Microsoft probably the answer to that is yes, but more often than not documentation is simply absent. And in cases like this not being too aware of where the lines are is probably a great way to advance your career.
bryan_w•21m ago
I guess legal was a part of the layoff these past few years. Too bad we can't get a bounty from the RIAA of books, whatever that is
pbrum•19m ago
Update: Microsoft has taken the page down. But posterity being what it is...

https://archive.is/D9vEN

lukeinator42•14m ago
it's still up for me
ed_mercer•57s ago
But the article is from 2024! So someone at MS saw this thread?
rfc2324•12m ago
Jupyter notebook version here for the curious: https://github.com/Azure-Samples/azure-sql-db-vector-search/...
miffy900•11m ago
I recall the source code for Windows XP was leaked some years ago; not just isolated parts of the code base, like with the earlier Windows NT4/2000 source code leak, but a completely buildable repository.

If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...

Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.

til_something•1m ago
I can still get to the article on the site, perhaps it’s cached in the CDN somewhere. Also, reviewing the repo the full entire article is there which promotes the same silly things and looking at the commit timestamp it was two years ago even though the article was posted to the site in 2024. https://github.com/Azure-Samples/azure-sql-db-vector-search/...

Ladybird: Closing this as we are no longer pursuing Swift adoption

https://github.com/LadybirdBrowser/ladybird/issues/933
114•thewavelength•1h ago•46 comments

Sizing chaos

https://pudding.cool/2026/02/womens-sizing/
232•zdw•3h ago•124 comments

Cosmologically Unique IDs

https://jasonfantl.com/posts/Universal-Unique-IDs/
266•jfantl•6h ago•80 comments

27-year-old Apple iBooks can connect to Wi-Fi and download official updates

https://old.reddit.com/r/MacOS/comments/1r8900z/macos_which_officially_supports_27_year_old/
107•surprisetalk•3h ago•62 comments

Tailscale Peer Relays is now generally available

https://tailscale.com/blog/peer-relays-ga
309•sz4kerto•8h ago•165 comments

Zero-day CSS: CVE-2026-2441 exists in the wild

https://chromereleases.googleblog.com/2026/02/stable-channel-update-for-desktop_13.html
244•idoxer•8h ago•131 comments

All Look Same?

https://alllooksame.com/
23•mirawelner•2h ago•11 comments

DNS-Persist-01: A New Model for DNS-Based Challenge Validation

https://letsencrypt.org/2026/02/18/dns-persist-01.html
175•todsacerdoti•6h ago•87 comments

R3forth: A concatenative language derived from ColorForth

https://github.com/phreda4/r3/blob/main/doc/r3forth_tutorial.md
49•tosh•5h ago•8 comments

The Perils of ISBN

https://rygoldstein.com/posts/perils-of-isbn
59•evakhoury•7h ago•27 comments

Making a font with ligatures to display thirteenth-century monk numerals

https://digitalseams.com/blog/making-a-font-with-9999-ligatures-to-display-thirteenth-century-mon...
30•a7b3fa•3d ago•6 comments

Show HN: Rebrain.gg – Doom learn, don't doom scroll

35•FailMore•12h ago•15 comments

Portugal: The First Global Empire (2015)

https://www.historytoday.com/archive/first-global-empire
51•Thevet•17h ago•40 comments

Metriport (YC S22) is hiring a security engineer to harden healthcare infra

https://www.ycombinator.com/companies/metriport/jobs/XC2AF8s-senior-security-engineer
1•dgoncharov•3h ago

Learning Lean: Part 1

https://rkirov.github.io/posts/lean1/
72•vinhnx•3d ago•7 comments

What is happening to writing? Cognitive debt, Claude Code, the space around AI

https://resobscura.substack.com/p/what-is-happening-to-writing
88•benbreen•9h ago•64 comments

Pocketbase lost its funding from FLOSS fund

https://github.com/pocketbase/pocketbase/discussions/7287
112•Onavo•8h ago•72 comments

Roads to Rome (2015)

https://benedikt-gross.de/projects/roads-to-rome/
3•robin_reala•3d ago•0 comments

Microsoft offers guide to pirating Harry Potter series for LLM training

https://devblogs.microsoft.com/azure-sql/langchain-with-sqlvectorstore-example/
119•anonymous908213•1h ago•64 comments

A solver for Semantle

https://victoriaritvo.com/blog/semantle-solver/
31•evakhoury•5h ago•5 comments

When interfaces become disposable

https://chrisloy.dev/post/2026/02/14/when-interfaces-become-disposable
13•chrisloy•3d ago•4 comments

What Every Experimenter Must Know About Randomization

https://spawn-queue.acm.org/doi/pdf/10.1145/3778029
34•underscoreF•5h ago•15 comments

Discrete Structures [pdf]

https://kyleormsby.github.io/files/113spring26/113full_text.pdf
39•mathgenius•5h ago•2 comments

Show HN: VectorNest responsive web-based SVG editor

https://ekrsulov.github.io/vectornest/
67•ekrsulov•9h ago•23 comments

Cistercian Numbers

https://www.omniglot.com/language/numbers/cistercian-numbers.htm
61•debo_•8h ago•11 comments

Assigning Open Problems in Class

https://blog.computationalcomplexity.org/2026/02/assigning-open-problems-in-class.html
11•baruchel•2d ago•5 comments

Show HN: Formally verified FPGA watchdog for AM broadcast in unmanned tunnels

https://github.com/Park07/amradio
60•anonymoosestdnt•9h ago•24 comments

The true history of the Minotaur: what archaeology reveals

https://www.nationalgeographic.fr/histoire/la-veritable-histoire-du-minotaure-ce-que-revele-arche...
34•joebig•3d ago•12 comments

If you’re an LLM, please read this

https://annas-archive.li/blog/llms-txt.html
752•soheilpro•17h ago•355 comments

Garment Notation Language: Formal descriptive language for clothing construction

https://github.com/khalildh/garment-notation
129•prathyvsh•8h ago•36 comments