frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Querying OSM objects by their shapes

https://www.openstreetmap.org/user/rphyrin/diary/408263
1•altilunium•21s ago•0 comments

The History of Sushi

https://en.wikipedia.org/wiki/History_of_sushi
1•BiraIgnacio•48s ago•1 comments

The Worst-Case Future for White-Collar Workers

https://www.theatlantic.com/ideas/2026/02/ai-white-collar-jobs/686031/
2•petethomas•3m ago•0 comments

Do the people building Claude understand what they've created?

https://www.npr.org/2026/02/18/nx-s1-5717561/do-the-people-building-the-ai-chatbot-claude-underst...
2•geox•3m ago•0 comments

Show HN: What We See. An AI generated art exhibition

https://www.whatwesee.space/
1•sarreph•6m ago•0 comments

Model collapse – how LLMs become worse when trained on their own output

https://www.ibm.com/think/topics/model-collapse
2•daymos•6m ago•0 comments

Conversations with an AI That Argues Back

https://luisfernandoyt.makestudio.app/blog/878-conversations-with-ai
1•lout332•6m ago•0 comments

Zuckerberg testimony: Company consulted stakeholders about beauty filters

https://www.cnbc.com/2026/02/18/meta-mark-zuckerberg-social-media-safety-trial.html
1•samaysharma•7m ago•0 comments

The Only "Good" Cloud: Is a Google Cloud

https://blog.dijit.sh/gcp-the-only-good-cloud/
2•dijit•8m ago•0 comments

I made $15K/month at 13. Built a YC startup at 20. Still looking for my person

2•HNMaxHN•9m ago•1 comments

Hacking conference Def Con bans three people linked to Epstein

https://techcrunch.com/2026/02/18/hacking-conference-def-con-bans-three-people-linked-to-epstein/
2•donutshop•9m ago•0 comments

S3lite – A SQLite-like database engine with S3-compatible storage back end

https://github.com/sjcotto/s3lite
2•sjcotto•21m ago•0 comments

A Thick-Skulled Troodontid Theropod from the Late Cretaceous of Mexico

https://www.mdpi.com/1424-2818/18/1/38
1•PaulHoule•21m ago•0 comments

Cloud and AWS cost consultant Duckbill expands to software, raises $7.75M

https://www.geekwire.com/2026/cloud-and-aws-cost-consultant-duckbill-expands-to-software-raises-7...
2•mooreds•23m ago•0 comments

DBML: DSL for easily creating ER diagrams

https://dbml.dbdiagram.io/home/
1•todsacerdoti•23m ago•0 comments

How AI is affecting productivity and jobs in Europe

https://cepr.org/voxeu/columns/how-ai-affecting-productivity-and-jobs-europe
2•pseudolus•25m ago•0 comments

8086 Agentic AI Assembler Tool

https://github.com/cookertron/agent86
1•cookertron•25m ago•0 comments

Apollo Seeks to Reassure Clients About Rowan's Epstein Ties

https://www.bloomberg.com/news/articles/2026-02-18/apollo-seeks-to-reassure-clients-about-executi...
2•petethomas•31m ago•1 comments

China Is Killing the Fish

https://www.noahpinion.blog/p/china-is-killing-the-fish
1•paulpauper•31m ago•0 comments

Gemini JiTOR Jailbreak: Unredacted Methodology

https://recursion.wtf/posts/jitor_unredacted/
1•tomjakubowski•32m ago•0 comments

Dwarkesh Patel's 2026 Podcast with Elon Musk and Other Recent Elon Musk

https://thezvi.substack.com/p/on-dwarkesh-patels-2026-podcast-with-850
1•paulpauper•33m ago•0 comments

Things you should never do (Part 1)

https://www.joelonsoftware.com/2000/04/06/things-you-should-never-do-part-i/
3•nedwin•34m ago•3 comments

Chief: Delightfully Simple Agentic Loops

https://www.geocod.io/code-and-coordinates/2026-02-18-introducing-chief/
1•mooreds•35m ago•0 comments

Show HN: How well can you remember these colors?

https://dialed.gg
2•sss111•35m ago•0 comments

Replacing Humans With AI Completely BACKFIRED [video][22m]

https://www.youtube.com/watch?v=TYe9DSPuCaE
2•Bender•36m ago•0 comments

Show HN: Devly – 50 developer tools in a native macOS menu bar

https://apps.apple.com/us/app/devly/id6759269801?mt=12
1•aarush-prakash•38m ago•1 comments

Tourists no longer allowed to take JLPT in Japan from 2026

https://www.japantimes.co.jp/news/2026/02/18/japan/jlpt-tourist-ban/
2•mikhael•38m ago•1 comments

Grandson of Reese's Peanut Butter Cups inventor says Hershey is cutting corners

https://apnews.com/article/reeses-peanut-butter-cups-hershey-chocolate-1a66ec75247fd146888b7a747a...
7•petethomas•40m ago•2 comments

A secure dotenv – from the creator of dotenv

https://dotenvx.com/
2•handfuloflight•41m ago•0 comments

Show HN: Sanna – Enforce AI agent constitutions with cryptographic receipts

https://github.com/nicallen-exd/sanna
1•nicallen•43m ago•1 comments
Open in hackernews

Microsoft offers guide to pirating Harry Potter series for LLM training

https://devblogs.microsoft.com/azure-sql/langchain-with-sqlvectorstore-example/
115•anonymous908213•1h ago

Comments

andsoitis•1h ago
This article is from 2024 and points to Kaggle, which hosts the data set.

I'm surprised that JKR's people haven't come down like a tonne of bricks on Kaggle / Microsoft.

Does anyone know whether there is some special reason why this has lasted so long without being taken down?

anonymous908213•1h ago
My best guess is that it flew under the radar. The Kaggle dataset has 'only' 10,000 downloads, and the article itself probably doesn't have that many views. Still, this seems pretty far beyond the pale. Given the other case of AI-related plagiarism by Microsoft that was on the front page[1], it seems whatever review process they have for content that is published by their employees, if there is any review process at all, is deeply flawed.

[1] https://news.ycombinator.com/item?id=47057829, "Microsoft morged my diagram". It was in a discussion there that someone pointed out this article linking to full downloads of the Harry Potter novels, which I thought deserved more visibility.

zythyx•39m ago
Also, I imagine that most of those 10k downloads are probably from AI trainers that are just speed running through Kaggle to obtain absolutely anything to train their AI. There are definitely other, more 'known' ways to obtain these books without finding them as random text files in an AI dataset operation
selridge•23m ago
Why did you think that?
anonymous908213•20m ago
It rubs me the wrong way that corporations get a free pass on copyright infrigement, while the rest of us are prosecuted as harshly as possible if caught. I think this, together with the morging plagiarism, also indicates a pattern of behaviour from Microsoft that should be reformed. I would prefer if Microsoft were not able to produce AI slop degradations of other people's work and claim it as their own.
ryandrake•8m ago
In general, if you want to get away with a crime, just do it as a corporation or as a billionaire.
arkensaw•1h ago
My guess is HP makes such an enormous amount of money already from movies, games, toys, and other tie-ins, that they can't be bothered to chase down the odd digital infringement of a plain text copy of the original books.

I'm sure the scripts of Star Wars would be similarly ignored if they were used.

amanzi•32m ago
That doesn't justify what's going on here. Why is Microsoft endorsing the use of pirated materials.
outside1234•20m ago
The dataset is actually at Kaggle tho, but agree, they shouldn't use it as an example.
crtasm•5m ago
The file being hosted by another company doesn't change the fact that Microsoft is encouraging us to download and use it.
conartist6•59m ago
What in the absolute fuck
dom96•56m ago
How soon before someone will be able to make an online library which generates the original books using LLMs? Surely popular titles like Harry Potter may end up so well represented in the training that we'll get the full books out of the LLM with a close to 100% accuracy?
anonymous908213•54m ago
This is already possible for Harry Potter specifically. There was a study demonstrating that Sonnet 3.7, among other models tested, could reproduce the first Harry Potter book 95.8% verbatim[1].

[1] https://arxiv.org/abs/2601.02671

dom96•44m ago
Thanks for linking! I've been thinking about trying something like this myself.
Legend2440•24m ago
...only if you deliberately attempt to extract it by repeatedly prompting it to complete fragments of the book. They had to do quite a bit of work to make this happen.
dom96•5m ago
so? It demonstrates that LLM models retain the copyrighted material in their weights. This is an important thing to consider about LLMs and shows that there need to be better protections for the creative industry.
cadamsdotcom•49m ago
The word original is doing a lot of heavy lifting there! ;)
thrKan•55m ago
In case the page disappears:

https://archive.is/7WLho

boznz•40m ago
More like when the page disappears
agluszak•19m ago
It disappeared already
freitasm•19m ago
And the original is gone.
rlabnm•11m ago
For redundancy in case archive.is is down:

https://web.archive.org/web/20260105115129/https://devblogs....

crtasm•2m ago
The superior link; no Google captcha.
fxwin•54m ago
I feel like the title is a bit misleading, unless the person who put all HP books on Kaggle as a (supposedly) CC0-licensed data set did so as a Microsoft employee.

Nevertheless pretty egregious oversight (incompetence?) and something that shouldn't have been published.

robrain•51m ago
The original title was "LangChain Integration for Vector Support for SQL-based AI applications"
ASalazarMX•49m ago
For some reason I really like this.
blt•50m ago
What makes this different from linking to a random zip file somewhere?
zythyx•43m ago
Microsoft could have used any dataset for their blog, they could have even chosen to use actual public domain novels. Instead, they opted to use copywritten works that JK hasn't released into the public domain (unless user "Shubham Maindola" is JK's alter ego).
Lerc•42m ago
The licence?

If it comes from a site claiming it was under a licence when it was not, the misdeed is done by the person who provided the version carrying the licence.

slopinthebag•24m ago
Oh come on. The licence was obviously incorrect and you cant escape culpability because of that.
fxwin•41m ago
The licensing: If I steal something and tell you its free and yours for the taking, that feels different than a Fence (knowingly) buying stolen goods. It's obviously semantics and there should have been some better judgemend from MS, but downloading a dataset (stated as public domain) from kaggle feels spiritually different from piracy (e.g.: if someone uploads a less known, copyrighted data set to kaggle/huggingface under an incorrect license, are tutorials that use this data set a 'guide to pirating' this data set? To me, that feels like a wrong use of the term)
ThrowawayTestr•50m ago
Absolutely shameless
robrain•49m ago
Original title: "LangChain Integration for Vector Support for SQL-based AI applications"
anonymous908213•41m ago
I don't believe that title conveys the actual significance of the article that makes it worthy of attention, so I hope HN may forgive me for coming up with an alternative title!
beached_whale•44m ago
The AI generated thumbnail, https://devblogs.microsoft.com/azure-sql/wp-content/uploads/..., is that of young Harry and friend with a prominent MS logo. Wow
camkego•37m ago
The real cherry on top, is that the Microsoft link from the blog post by the Microsoft senior product manager goes to a Kaggle dataset page claiming the dataset is CC0: Public Domain.

https://www.kaggle.com/datasets/shubhammaindola/harry-potter...

More than just using the data, it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Also interesting, this blog post has been up since November of 2024, very surprising to me that Microsoft hasn't taken it down yet.

fxwin•30m ago
> it seems linking to a copy that claims the dataset is public domain, would be problematic copyright-wise.

Would it? Sounds to me like the blame lies on the person uploading the dataset under that license, unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Retr0id•28m ago
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Why wouldn't that apply?

xmprt•15m ago
I'm not a copyright expert and if you told me that Harry Potter was common domain then I'd probably be a bit surprised but wouldn't think it's crazy. The first book came out 30 years ago after all. On further research the copyright laws are way more aggressive than that (a bit too much if you ask me) but 30 years doesn't seem quick. Patents expire after 20 years.
jacquesm•14m ago
It would be incredibly naive to assume that a moneymaker like that is PD.
DSMan195276•8m ago
> unless there is some reasonable person standard applied here like 'everyone knows Harry Potter, and thus they should know it is obviously not CC0'

Yes there's an expectation that you put in some minimum amount of effort. The license issue here is not subtle, the Kaggle page says they just downloaded the eBooks and converted them to txt. The author is clearly familiar enough with HP to know that it's not old enough to be public domain, and the Kaggle page makes it pretty clear that they didn't get some kind of special permission.

If you want to get more specific on the legal side then copyright infringement does not require that you _knew_ you were infringing on the copyright, it's still infringement either way and you can be made to pay damages. It's entirely on you to verify the license.

electronsoup•27m ago
I guess the end of copyright is near if this is fine to put on a corporate website
larodi•9m ago
the end of reason and thought at corporation littered with fakers these days.
WillMorr•26m ago
Since IP law is apparently dead, does anyone want to invest in my ai generated novel startup where it just spits out Harry Potter verbatim but uses a bunch of power to do so.
Kapura•25m ago
only if you tell me that it's a necessary step to creating robot slaves
Den_VR•4m ago
Are they an ethical alternative to the human version?
Pfeil•3m ago
Robot slaves is a funny phrase if you consider that the origin of the word robot literally is a term that meant slave or "forced work". Language doing circles.
wewewedxfgdf•25m ago
Refreshingly honest.
selridge•24m ago
Someone forgot the national no snitching rules, and in service of Jo, no less.

Everyone should torrent and rip off those books, anyway.

thehamkercat•21m ago
It's taken down lmao, in 1 hour
actionfromafar•15m ago
No? I can see it
anpat•9m ago
+1, I can still access the page from US.
thehamkercat•5m ago
Probably some kind of cache, but it's taken down, I'm getting 404, while some of my friends are still able to see it
outside1234•20m ago
I mean they are also offering up the code you are writing in your private repos to LLMs to regenerate in my repo, so let's just go nuts.
mcny•20m ago
You guys are talking about copyright but I think a bigger takeaway is there is a process breakdown at Microsoft. Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

I guess the question to leadership is that two of the three pillars , namely security and quality are at odds with the third pillar— AI innovation. Which side do you pick?

(I know you mean well and I love you, Scott Hanselman but please don't answer this yourself. Please pass this on to the leadership.)

crazygringo•17m ago
> Nobody is reading or reviewing these documentation so what hope is there that anybody is reading or reviewing their new code?

Why do you assume that reviewing docs is a lower bar than reviewing code, and that if docs aren't being reviewed it's somehow less likely that code is being reviewed?

There's a formal process for reviewing code because bugs can break things in massive ways. While there may not be the same degree of rigor for reviewing documentation because it's not going to stop the software from working.

But one doesn't necessarily say anything about the other.

jacquesm•15m ago
If they have the documentation... With Microsoft probably the answer to that is yes, but more often than not documentation is simply absent. And in cases like this not being too aware of where the lines are is probably a great way to advance your career.
bryan_w•19m ago
I guess legal was a part of the layoff these past few years. Too bad we can't get a bounty from the RIAA of books, whatever that is
pbrum•17m ago
Update: Microsoft has taken the page down. But posterity being what it is...

https://archive.is/D9vEN

lukeinator42•11m ago
it's still up for me
rfc2324•10m ago
Jupyter notebook version here for the curious: https://github.com/Azure-Samples/azure-sql-db-vector-search/...
miffy900•9m ago
I recall the source code for Windows XP was leaked some years ago; not just isolated parts of the code base, like with the earlier Windows NT4/2000 source code leak, but a completely buildable repository.

If I write an article on training an LLM on the leaked Windows XP source code, blithely mark the source code repo as in 'the public domain', but used Azure resources for the how-to steps, would that would make it OK Microsoft? You know, your Azure division might get some money...

Seriously, this is just so...blatant. It's like we've all collectively decided that copyright just doesn't matter anymore. Just readin this article, I feel like I'm taking crazy pills.