frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Extracting books from production language models (2026)

https://arxiv.org/abs/2601.02671
48•logicprog•11h ago

Comments

visarga•7h ago
This sounds pretty damning, why don't they implement a n-gram based bloom filter to ensure they don't replicate expression too close to the protected IP they trained on? Almost any random 10 word ngram is unique on the internet.

Alternatively they could train on synthetic data like summaries and QA pairs extracted from protected sources, so the model gets the ideas separated from their original expression. Since it never saw the originals it can't regurgitate them.

isodev•6h ago
But that would only hide the problem, doesn’t resolve the fact that models, in fact, violate copyright
jgalt212•6h ago
True, but it would certainly reduce litigation risk in so much as copypasta is ipso facto proof of copyright violation.
protocolture•5h ago
That hasnt been established. Theres no concrete basis to assert that training violates copyright.
soulofmischief•5h ago
The idea of applying clean-room design to model training is interesting... having a "dirty model" and a "clean model", dirty model touches restricted content and clean model works only with the output of the dirty model.

However, besides how this sidesteps the fact that current copyright law violates the constitutional rights of US citizens, I imagine there is a very real threat of the clean model losing the fidelity of insight that the dirty model develops by having access to the base training data.

bryanrasmussen•4h ago
>this sidesteps the fact that current copyright law violates the constitutional rights of US citizens

I think most people sidestep this as it's the first I've heard of it! Which right do you think is being violated and how?

soulofmischief•3h ago
Actually, plenty of activists, for example Cory Doctorow, have spent a significant amount of effort discussing why the DMCA, modern copyright law, DRM, etc. are all anti-consumer and how they encroach on our rights.

It's late so I don't feel like repeating it all here, but I definitely recommend searching for Doctorow's thoughts on the DMCA, DRM and copyright law in general as a good starting point.

But generally, the idea that people are not allowed to freely manipulate and share data that belongs to them is patently absurd and has been a large topic of discussion for decades.

You've probably at least been exposed to how copyright law benefits corporations such as Disney, and private equity, much more than it benefits you or I. And how copyright law has been extended over and over by entities like Disney just so they could prolong their beloved golden geese from entering public domain as long as possible; far, far longer than intended by the original spirit of the copyright act.

JimDabell•3h ago
> To promote the Progress of Science and useful Arts, by securing for limited Times to Authors and Inventors the exclusive Right to their respective Writings and Discoveries.

Copyright is not “you own this forever because you deserve it”, copyright is “we’ll give you a temporary monopoly on copying to give you an incentive to create”. It’s transactional in nature. You create for society, society rewards you by giving you commercial leverage for a while.

Repeatedly extending copyright durations from the original 14+14 years to durations that outlast everybody alive today might technically be “limited times” but obviously violates the spirit of the law and undermines its goal. The goal was to incentivise people to create, and being able to have one hit that you can live off for the rest of your life is the opposite of that. Copyright durations need to be shorter than a typical career so that its incentive for creators to create for a living remains and the purpose of copyright is fulfilled.

In the context of large language models, if anybody successfully uses copyright to stop large language models from learning from books, that seems like a clear subversion of the law – it’s stopping “the progress of science and useful arts” not promoting it.

(To be clear, I’m not referring to memorisation and regurgitation like the examples in this paper, but rather the more commonplace “we trained on a zillion books and now it knows how language works and facts about the world”.)

apical_dendrite•5h ago
I'm assuming that the goal of the bloom filter is to prevent the model from producing output that infringes copyright rather than hide that the text is in the training data.

In that case the model would lose the ability to provide relatively brief quotes from copyrighted sources in its answers, which is a really helpful feature when doing research. A brief quote from a copyrighted text, particularly for a transformative purpose like commentary is perfectly fine under copyright law.

orbital-decay•4h ago
That would reduce the training quality immensely. Besides, any generalist model really needs to remember facts and texts verbatim to stay useful, not just generalize. There's no easy way around that.
stubish•16m ago
Even if output is blocked, if it can be demonstrated that the copyrighted material is still in the model then you become liable for distribution and/or duplication without a license.

Training on synthetic data is interesting, but how do you generate the synthetic data? Is it turtles all the way down?

orbital-decay•4h ago
It's all pretty obvious to anyone who tried a similar experiment just out of curiosity. Big models remember a lot. And all non-local models have regurgitation filters in place due to this fact, with the entire dataset indexed (e.g. Gemini will even cite the source of the regurgitated text as it gives the RECITATION error). You'll eventually trip those filters if you force the model to repeat some copyrighted text. Interesting that they don't even try to circumvent those, they simply repeat the request from the interruption point, as the match needs some runway to trigger and by that time a part of the response is already streamed in.
rurban•1h ago
I find it interesting that OpenAI's safety worked best, where the others didn't work at all. I had different impressions before

A Unique Performance Optimization for a 3D Geometry Language

https://cprimozic.net/notes/posts/persistent-expr-memo-optimization-for-geoscript/
1•Ameo•34s ago•0 comments

Markdown Is a Disaster: Why and What to Do Instead

https://www.karl-voit.at/2025/08/17/Markdown-disaster/
1•todsacerdoti•37s ago•0 comments

Elon Musk says X's new algorithm will be made open source next week

https://www.engadget.com/big-tech/elon-musk-says-xs-new-algorithm-will-be-made-open-source-next-w...
1•O1111OOO•1m ago•0 comments

I hope to help you evaluate your GenAI App

https://github.com/shihongDev/evalyn
1•shloveai•9m ago•1 comments

After 20 Years, This Scientist Proved Birds Can Talk and Use Grammar [video]

https://www.youtube.com/watch?v=jmys2abx4co
1•theogravity•10m ago•0 comments

What do you think about a "linter" for code logic?

https://commitguard.ai
1•moshetanzer•11m ago•1 comments

Removing Tahoe's Unwanted Menu Icons

https://weblog.rogueamoeba.com/2026/01/10/removing-tahoes-unwanted-menu-icons/
1•dbushell•13m ago•0 comments

Gixy-Next: Nginx Configuration Security and Hardening Scanner

https://gixy.io/
1•mmsc•16m ago•0 comments

Debian Taco – Towards a GitSecDevOps Debian

https://blog.josefsson.org/2026/01/09/debian-taco-towards-a-gitsecdevops-debian/
1•pabs3•18m ago•0 comments

Netlify Is Down

https://www.netlifystatus.com
1•forgingahead•22m ago•0 comments

Linus is vibe coding

https://github.com/torvalds/AudioNoise
4•dhruv3006•26m ago•1 comments

80% of Rye in 20% of the Time [1/3]

https://ryelang.org/blog/posts/learn_80_rye_in_20_time_code/
3•todsacerdoti•29m ago•0 comments

Notes on Enterprise Architecture from Doing the Job

https://github.com/justinamiller/EnterpriseArchitecture
2•maverickeye•31m ago•1 comments

Instagram breach exposes data of 17.5M accounts

https://twitter.com/H4ckmanac/status/2009870969998049400
3•thunderbong•31m ago•0 comments

Côme, une ville italienne dénaturée

https://www.lemonde.fr/m-le-mag/article/2026/01/02/en-italie-la-ville-de-come-denaturee-pour-deve...
1•altro•32m ago•0 comments

A new type of microscope lets scientists observe life unfolding inside cells

https://www.thebrighterside.news/post/a-new-type-of-microscope-lets-scientists-observe-life-unfol...
2•01-_-•32m ago•1 comments

Practical .NET Coding Guidelines We Use Internally

https://github.com/justinamiller/DotNet-Coding-Guidelines
1•maverickeye•33m ago•1 comments

Steam Machine price leak shakes the console market

https://comuniq.xyz/post?t=696
3•01-_-•34m ago•0 comments

Iranian regime tries to shut down Starlink

https://www.timesofisrael.com/iran-appears-to-jam-starlink-after-shutting-down-comms-networks/
31•ukblewis•41m ago•9 comments

Backing the Backslash

https://shadycharacters.co.uk/2025/03/backing-the-backslash/
1•everybodyknows•48m ago•0 comments

Elon Musk on Tesla's summon – LA to NY in 2 years (2016 – 10 years anniversary)

https://twitter.com/elonmusk/status/686279251293777920
2•TheAlchemist•55m ago•0 comments

Show HN: Keyboard-first diagram editor in Rust with fzf-style command palette

https://github.com/joonho3020/sansuyu
1•archipelago123•56m ago•0 comments

Biological and artificial consciousness: A case for biological computationalism

https://www.sciencedirect.com/science/article/pii/S0149763425005251
5•galaxyLogic•57m ago•0 comments

We Put Claude Code in Rollercoaster Tycoon

https://ramplabs.substack.com/p/ai-plays-rollercoaster-tycoon
2•gwintrob•1h ago•0 comments

Words

https://justinjackson.ca/words.html
1•Tomte•1h ago•0 comments

Torvalds: Another silly guitar-pedal-related repo

https://github.com/torvalds/AudioNoise/blob/71b256a7fcb0aa1250625f79838ab71b2b77b9ff/README.md
2•m-hodges•1h ago•1 comments

If I search for "opencode GitHub" in Bing, a random fork is returned

https://www.bing.com/search?q=opencode+github&PC=U316
2•theanonymousone•1h ago•0 comments

Yeast Programmed for Opioid Total Synthesis

https://cen.acs.org/articles/93/i49/Yeast-Programmed-Opioid-Total-Synthesis.html
1•slow_typist•1h ago•0 comments

Google employee made redundant after reporting sexual harassment, court hears

https://www.bbc.co.uk/news/articles/c62v51d1ry2o
4•latein•1h ago•0 comments

HeyToken – Access all LLMs for 30% less via a unified API

https://heytoken.ai
1•alhazar•1h ago•1 comments