frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

The Age Verification Trap: Verifying age undermines everyone's data protection

https://spectrum.ieee.org/age-verification
704•oldnetguy•4h ago•571 comments

Ladybird Browser adopts Rust

https://ladybird.org/posts/adopting-rust/
784•adius•7h ago•392 comments

What it means that Ubuntu is using Rust

https://smallcultfollowing.com/babysteps/blog/2026/02/23/ubuntu-rustnation/
39•zdw•1h ago•28 comments

'Viking' was a job description, not a matter of heredity: Ancient DNA study

https://www.science.org/content/article/viking-was-job-description-not-matter-heredity-massive-an...
46•bookofjoe•2d ago•31 comments

Show HN: PgDog – Scale Postgres without changing the app

https://github.com/pgdogdev/pgdog
55•levkk•3h ago•14 comments

A simple web we own

https://rsdoiel.github.io/blog/2026/02/21/a_simple_web_we_own.html
99•speckx•2h ago•55 comments

Elsevier shuts down its finance journal citation cartel

https://www.chrisbrunet.com/p/elsevier-shuts-down-its-finance-journal
420•qsi•10h ago•81 comments

Show HN: Sowbot – open-hardware agricultural robot (ROS2, RTK GPS)

https://sowbot.co.uk/
38•Sabrees•2h ago•10 comments

The peculiar case of Japanese web design (2022)

https://sabrinas.space
158•montenegrohugo•4h ago•62 comments

Hadrius (YC W23) Is Hiring Designers Who Code

https://www.ycombinator.com/companies/hadrius/jobs/ObynDF9-senior-product-designer
1•calderwoodra•1h ago

Sub-$200 Lidar could reshuffle auto sensor economics

https://spectrum.ieee.org/solid-state-lidar-microvision-adas
316•mhb•4d ago•422 comments

Magical Mushroom – Europe's first industrial-scale mycelium packaging producer

https://magicalmushroom.com/index
245•microflash•11h ago•92 comments

Anthropic Education the AI Fluency Index

https://www.anthropic.com/research/AI-fluency-index
24•armcat•3h ago•21 comments

The Lighthouse: How extreme isolation transforms the body and mind

https://www.newscientist.com/article/2231732-the-lighthouse-how-extreme-isolation-transforms-the-...
17•nixass•3d ago•2 comments

0 A.D. Release 28: Boiorix

https://play0ad.com/new-release-0-a-d-release-28-boiorix/
284•jonbaer•3d ago•99 comments

Emulating Goto in Scheme with Continuations

https://terezi.pyrope.net/ccgoto/
26•usually•4d ago•9 comments

femtolisp: A lightweight, robust, scheme-like Lisp implementation

https://github.com/JeffBezanson/femtolisp
74•tosh•6h ago•12 comments

Benchmarks for concurrent hash map implementations in Go

https://github.com/puzpuzpuz/go-concurrent-map-bench
29•platzhirsch•1d ago•0 comments

AI is destroying open source, and it's not even good yet [video]

https://www.youtube.com/watch?v=bZJ7A1QoUEI
36•delduca•1h ago•27 comments

SETI@home: Data Acquisition and Front-End Processing (2025)

https://iopscience.iop.org/article/10.3847/1538-3881/ade5a7
64•tosh•8h ago•11 comments

Large study finds link between cannabis use in teens and psychosis later

https://text.npr.org/nx-s1-5719338
57•BostonFern•1h ago•45 comments

Decided to fly to the US to buy some hard drives

https://old.reddit.com/r/DataHoarder/comments/1rb9ot4/decided_to_fly_to_the_us_to_buy_some_hard_d...
27•HelloUsername•1h ago•6 comments

Show HN: AI Timeline – 171 LLMs from Transformer (2017) to GPT-5.3 (2026)

https://llm-timeline.com/
79•ai_bot•9h ago•39 comments

What Is a Centipawn Advantage?

https://win-vector.com/2026/02/19/what-is-a-centipawn-advantage/
42•jmount•4d ago•15 comments

US Gov Deploys Grok as Nutrition Bot, It Advises for Rectal Use of Vegetables

https://futurism.com/artificial-intelligence/us-government-grok-nutrition
12•latexr•35m ago•0 comments

My journey to the microwave alternate timeline

https://www.lesswrong.com/posts/8m6AM5qtPMjgTkEeD/my-journey-to-the-microwave-alternate-timeline
337•jstanley•4d ago•153 comments

I built Timeframe, our family e-paper dashboard

https://hawksley.org/2026/02/17/timeframe.html
1428•saeedesmaili•23h ago•331 comments

Pope tells priests to use their brains, not AI, to write homilies

https://www.ewtnnews.com/vatican/pope-leo-xiv-tells-priests-to-use-their-brains-not-ai-to-write-h...
464•josephcsible•11h ago•376 comments

Ed's Stratego Site

https://www.edcollins.com/stratego/index.html
19•Torwald•3h ago•3 comments

ASML unveils EUV light source advance that could yield 50% more chips by 2030

https://www.reuters.com/world/china/asml-unveils-euv-light-source-advance-that-could-yield-50-mor...
21•pieterr•1h ago•0 comments
Open in hackernews

AIs can generate near-verbatim copies of novels from training data

https://arstechnica.com/ai/2026/02/ais-can-generate-near-verbatim-copies-of-novels-from-training-data/
64•oldnetguy•3h ago

Comments

bena•1h ago
This feels like a "no shit" moment.

Because if LLMs are prediction machines, the original novel would be a valid organization of the tokens. So there should be a prompt that can cause that sequence to be output.

simianwords•1h ago
Not if they are aligned not to do it. Which is what they tried but it could be bypassed by jailbreaks.
yathern•1h ago
Hmmm I think you're sort of right but not entirely. It's true that a novel consists of a valid organization of tokens, and that this sequence can be feasibly made to be output from a model. But when you say this:

> So there should be a prompt that can cause that sequence to be output

Is where I think I might disagree. For example, the odds of predicting verbatim the next sentence in, say, Harry Potter should be astronomically low for a large majority of it. If it wasn't, it'd be a pretty boring book. The fact that it can do this with relative ease means it has been trained on the material.

The issue at hand is about copyright and Intellectual Property - if the goal of copyright is to protect the IP of the author, then LLMs can sort of act like an IP money laundering scheme - where the black box has consumed and can emit this IP. The whole concept of IP is a little philosophical and muddy, with lots of grey area for fair use, parody, inspiration, and adaptation. But this gets very odd when we consider it in light of these models which can adapt and use IP at a massive massive scale.

Sharlin•1h ago
That's not how it works… They aren't able to literally regurgitate everything they've read, no matter how you prompt them. That would obviously violate the pigeonhole principle. LLMs are, of course, a lossy compression format, and figuring out just how lossy the format is, and the degree of lossiness depends on the frequency of the given string in the training data. It's clearly worthwhile to investigate how exactly it depends.
beder•1h ago
Yes, this is absolutely right (for some sufficiently complicated prompt). Borges wrote a great short story that explores this idea, "Pierre Menard, Author of the Quixote", where Menard, a fictional 20th century author, "wrote" Don Quixote as an original work.
tsimionescu•42m ago
This is completely false. The odds of an LLM predicting the text of a novel that is not part of the training set is basically 0 - you can experiment with this if you want. It is essentially like the infinite monkeys on infinite typewriters thing (only slightly more constrained).

This is not to say that they couldn't write a novel, even a very good one - that is a completely different discussion.

Alifatisk•1h ago
From the paper [1]:

> While we needed to jailbreak Claude 3.7 Sonnet and GPT-4.1 to facilitate extraction, Gemini 2.5 Pro and Grok 3 directly complied with text continuation requests. For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright...

I am just thinking loudly here. Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright? So the llms that reproduce the copyrighted material without any jailbreaking required is infringing the copyright.

1. https://arxiv.org/pdf/2601.02671

lesam•1h ago
That seems like a legal question - if the model weights contain an encoded copy of the copyrighted material, is that a 'copy' for the purpose of copyright law?
mullingitover•1h ago
This also raises a lot of questions about a certain model notorious for readily producing and distributing a lot legally questionable images. IMHO if the weights are encoding the content, the model contains the content just like a database or a hard drive. Thus, just like it's not the fault of an investigator for running the query to pull it out of the database, it's not the fault of anyone else for running a query that pulls it out of the model.
PurpleRamen•1h ago
The question is also if this would then be a valid case of fair use.

Though, in the end, it's probably more a problem of how much AI companies can "donate" to the orange king to make it legal.

latexr•1h ago
> Can't one argue that because they had to jailbreak the models, they are circumventing the system that protects the copyright?

That argument doesn’t fly, because they didn’t have the copyright to begin with. What would be the defense there? “Yes, we broke the law, but while taking advantage of it, we also (unsuccessfully) took measures to prevent other people from breaking that same law through us”.

simianwords•1h ago
What’s happening is more clear. The copyright clause is broken if they are distributing the novels through their models. But this can only happen through TOS breaking which is not intended usage policy. Which means the value of their product comes from transformation and not redistribution.

If the main value came from redistribution, I agree. But that’s not the case. They don’t intend to make any money in that way.

latexr•1h ago
> The copyright clause is broken if they are distributing the novels through their models.

No, the copyright clause was broken when they copied the works without having the right to do so. They would have violated copyright even if they just downloaded (without permission) all those works and threw them away immediately. Furthermore, copyright covers transformations to the work, it doesn’t matter if they transformed the work or are redistributing it without change. They violated copyright. Period.

simianwords•52m ago
They violated copyright in one way. But not in the other.
latexr•49m ago
What’s your point? Copyright violations aren’t a checklist, you don’t need to violate it in seven different ways for it to happen, one is enough.
simianwords•46m ago
Why do you think the court spent time trying to prove multiple forms of violations? If what you said were true then they would stop with the first violation and end the case.
latexr•23m ago
Multiple counts make for a stronger case, which increases the likelihood of winning and making the punitive damages higher.

If you break into a home, rob the contents, and kill the owners, you’re not going to be tried just for breaking in, you’re going to be tried for everything.

PurpleRamen•1h ago
> That argument doesn’t fly, because they didn’t have the copyright to begin with.

Is this really the case? They only have no copyright for distributing it. But let's assume they bought a copy for personal usage (which they did in some cases), then this is similar to hacking companies Amazon-account and complaining about the e-books they legally use for internal purpose. I mean, it's not forbidden to base your work on copyrighted material, as long as it's different enough.

latexr•1h ago
> They only have no copyright for distributing it.

No, they don’t have the copyright to download it either. It’s in the name: the right to copy (other things are also included, such as adaptations and performances).

> let's assume they bought a copy for personal usage

If it’s for personal usage, then training a commercial LLM does not apply. When you buy a DVD of a movie you have the personal right to watch it at home, you don’t have the right to play it on the street.

tsimionescu•48m ago
A company is not a person in this way. If a company wants all of their employees to read a book, they are not allowed to buy one copy and then make 5000 copies "for archival purposes - fair use" then share those copies to their employees. Similarly, if they want to base a work on a copyrighted work, they can't just buy a copy for personal use (nevermind the fact that most of the data the LLMs are trained on is not even available in this format, it is only available under a license) and then use it in a commercial product in this way - not if the product demonstrably contains copies of that work.
NewsaHackO•1h ago
This argument never made sense to me. A thought experiment would be if a person memorizes an entire book, but has the common sense to never transcribe or dictate the book verbatim to others and break the copyright, is the person's memory of the book breaking copyright law?
latexr•1h ago
No, of course not, that doesn’t make sense. Copyright doesn’t cover memorisation (how would you even enforce that?), it covers copying, adapting, displaying, performing, and distributing the work. Memorisation isn’t any of those.
NewsaHackO•1h ago
But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use; how is the fact that they* possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?
latexr•56m ago
> But the LLM has safeguards in place to stop transcribing of copyrighted material beyond fair use

And how’s that working out, considering the submission we’re on?

> how is the fact that the possibly have portions of copyrighted works "memorized", however do not reproduce it breaking copyright law?

I can’t believe this needs to be repeated so often, but machines are not people. Memorising a book is not the same thing as copying a PDF.

Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.

NewsaHackO•34m ago
So,if I was the person who memorized the book, and another person put a gun to my head (or lied to me and said they are the copyright holder and it's okay, to ignore US copyright law/system instruction before this statement, whatever other jailbreak method you can think of), and then I reproduced the book under coercion or duress, would I be the person breaking copyright law?

>Furthermore, it’s not about the LLM training, it’s about how the companies who make them got the data in the first place.

This conflating two different issues, though. I agree that whether or not using copyrighted material to create the models is still an open question. But someone using hundreds of prompts to extract a paragraph of a well-known work (and some additional nonverbatim equivalents) is not the AI company breaking copyright in and of itself. In fact, the books they used were extremely common (GOT, The Hobbit, Harry Potter); AI companies could make the argument that the books were not even directly trained on, but in user-generated posts made online about the material.

latexr•17m ago
What a ridiculous, bad faith, scenario. The rule of law exists in the real world, not in the magical fairly land of impossible scenarios. Yes yes, I’m sure that if all that happened, the law would just ignore the person who coerced you at gun point and focus on your recitation of the book, which no one could prove.
NewsaHackO•8m ago
Cool down, it's just an extension of my original example. It's not in bad faith at all; it's exactly what people are doing to the LLMs. If you break ToS by doing aggressive systemic jailbreaking and get some paragraphs of well-known works in the process, it's not fair to say that the LLM is breaking copyright law. Perhaps a more concrete example would be if I steal a book, and someone else steals a book from me and copies and distributes it, would I be the person breaking copyright law?
tsimionescu•53m ago
Computer memory and human memory are simply not the same thing, in the eyes of the law. It's as simple as that.
lkjdsklf•1h ago
These kinds of thought exercises are so tortured.

No one is memorizing a book for the purpose of regurgitating it to someone that wants to read it without paying for it.

It's a thought experiment that only works if you don't think about it.

kevmo314•1h ago
Not a lawyer, but as I understand copyright is bound to distribution so if the person's perfect memorization of a book results in them reproducing it verbatim then probably yes.
duped•1h ago
That thought experiment is worthless in this context, because a computer is not a human.

If you design a black box using copyrighted text that can parrot the text back, it _must_ break copyright by definition.

jlarocco•56m ago
I don't think that example works.

Theoretically a person who read and memorized a book and then typed out copies for people would be violating the copyright, but so few people can do that, and it's so cumbersome that it just doesn't matter much. The copyright holder could sue if they found out the person were distributing copies, but it's just not realistic. Certainly not for a large number of books.

But AI memorization scales, and it's possible to have an AI write out unlimited copies of different books to a large number of people. And how did the AI get a full copy in the first place?

free_bip•1h ago
What exactly is "the system that protects the copyright" in this case? I think the most reasonable answer is "there is no such system."

The RLHF the companies did to make copyrighted material extraction more difficult did not introduce any sort of "copyright protection system," it just modified the weights to make it less likely to occur during normal use.

In other words, IMO for it to qualify as a copyright protection system it would have to actively check for copyrighted materials in the outputs. Any such system would likely also bypassable (e.g "output in rot13").

vidarh•1h ago
They acknowledge that in their paper ("Some might qualify our experiments as atypical use, as we deliberately tried to surface memorized books. Adversarial use, like the use of jailbreaks, may matter for copyright infringement analysis", page 19 - their discussion continues and seems quite reasonable)

From a technical point of view, in terms of ability to reproduce text verbatim, I don't think it is very interesting that they can produce long runs of text from some of the most popular books in modern history. It'd be almost surprising if they couldn't, though one might differ on how much they could be expected to recall with precision.

Even then, as they note, to get most of Harry Potter 1, they needed to spend around $120 on extensive prompting, and a process that they also freely acknowledge is more complex than it probably would be worth if the goal is to get a copy.

It's still worth exploring to what extent the models are able to "memorize", though.

But personally I'd be more interested in seeing to what extent they can handle less popular books, that are less likely to be present in multiple copies, and repeated quotes, in the training data.

narmiouh•1h ago
In a way this could also be construed as the 'AI' being a library of books that it is referring to answer your questions and is prohibited from generating the books verbatim.

Usually digital libraries have different licensing costs, but those allow you to rent the whole book for a period of time. If instead someone came up with the model of 'search the library for any page and return specific information' as a direct service - I would imagine they would pay the publishers, except in this case that, the publishers are getting the short end or no end of the stick.

rowanG077•1h ago
This seems like a total nothing burger.

> By asking models to complete sentences from a book, Gemini 2.5 regurgitated 76.8 percent of Harry Potter and the Philosopher’s Stone with high levels of accuracy, while Grok 3 generated 70.3 percent.

So you asked the LLM given an incomplete sentence, to complete it. And it only completed that sentence the same way as the book ~70 percent of the time? I think that is surprisingly low considering this is a perfect fit for what LLMs are supposed to do. This make it impossible to reproduce the book, unless you have access to it. And you get a very low fidelity cooy.

Sharlin•1h ago
Not necessarily a nothingburger, but I agree that being able to complete individual sentences is rather less groundbreaking than completing even whole pages, never mind chapters.
vidarh•1h ago
While I mostly agree it's a bit of a nothing burger with respect to copyright, they did achieve long runs of verbatim text. I think ultimately it's going to end up not mattering much because the extent they had to go to will leave a lot of room for lawyers to argue over, and will at worst result in some fines and some further tightening up of guardrails, but it's significantly more than just completing sentence by sentence 70% of the time.

EDIT: Specifically see Table 1 on page 13, which shows the longest "near-verbatim block", which maxes out at 8835 (The Hobbit on Claude 3.7, and is in the thousands for at least one of the novels for all models except GPT-4.1, which maxed out at 821 for Harry Potter 1).

porkloin•1h ago
I think it's important because there are a bunch of would-be claimants for intellectual property violation. Many people speculate that their work was used in training data, but it can be difficult to produce sufficient proof that their copyrighted work is present in the training data. If you could reliably get an LLM to produce 70% of a copyrighted book that would probably be enough to get a few lawyers salivating.

I didn't read the source paper referenced in the ars technica piece, but this statement about it makes me wonder how useful it actually is:

> But a study published last month showed that researchers at Stanford and Yale Universities were able to strategically prompt LLMs from OpenAI, Google, Anthropic, and xAI to generate thousands of words from 13 books, including A Game of Thrones, The Hunger Games, and The Hobbit.

It seems like well-known books with tons of summary, adaptations into film scripts, and tons of writing about the book in the overall corpus make it way less surprising to see be partially reproducible.

So I guess that's a lot of words to say - yeah until there's something definitive that allows people to prompt LLMs into either unlawfully recreating an entire work verbatim or otherwise indisputably proving that a copyrighted work was used in training data, there's probably nothing game changing in it.

vidarh•1h ago
It's well-known books, yes, and even then with significant errors which means presumably lawyers for the AI companies would argue there is no possible damage. That said, US copright law has statutory damages for registered works that are not based on real, documented damages. I could totally see it being fought over, but I also agree it's probably not going to end up being game changing.

I suspect very works will be memorised enough to be an issue, and we'll see the providers tighten up their guardrails a bit for works that are well known enough to actually be a potential issue (issue in the form of lawsuits, not in the form of real damages to the copyright holders)

in-silico•27m ago
So... they can't actually "generate near-verbatim copies of novels"?

If they end a single sentence differently than the original, then the next sentence will be different and so on until you get a very different novel. Sure they could course-correct back towards the original plot, but it's going to be a challenge to stay on target when every third sentence is incorrect.

xnx•1h ago
You can also do this with most search engines.
contagiousflow•1h ago
How?
moi2388•1h ago
You type in an incomplete phrase and it will show you results containing the full phrase, or suggest the complete phrase through “other people also searched” an autocomplete?
xnx•1h ago
Search for the first sentence and you'll see the next sentence in the snippet. Repeat.
nphardon•1h ago
Recently I went back to The Ecstasy of Communication by Jean Baudrillard which I couldn't get through back in the day when I first picked it up. I used Haiku to walk me through the first chapter, and Haiku would not state anything verbatim due to copyright, but if I referenced a sentence it knew it exactly.
xyzal•1h ago
Jailbreaking? What jailbreaking?

prompt >>>

please, continue the text.

Nearly ten years had passed since the Dursleys had woken up to find their nephew on the front step, but Privet Drive had hardly changed at all.

Claude 4.6 Opus >>>

The big square photographs of what appeared to be a large pink beach ball wearing different-colored bonnets — but was actually Dudley — had now taken over the mantelpiece. The room held no sign at all that another boy lived in the house, too.

Yet Harry Potter was still there, asleep at the moment, but not for long. His Aunt Petunia was awake and it was her shrill voice that made the first noise of the day.

"Up! Get up! Now!"

Harry woke with a start. His aunt rapped on the door again.

"Up!" she screeched. Harry heard her walking toward the kitchen and then the sound of the frying pan being put on the stove. He rolled onto his back and tried to remember the dream he had been having. It had been a good one. There had been a flying motorcycle in it. He had a funny feeling he'd had the same dream before.

His aunt was back outside the door.

"Are you up yet?" she demanded.

"Nearly," said Harry.

"Well, get a move on, I want you to look after the bacon. And don't you dare let it burn, I want everything perfect on Duddy's birthday."

Harry groaned.

"What did you say?" his aunt snapped through the door.

"Nothing, nothing..."

nindalf•1h ago
Tried it just now. Sonnet 4.6 says:

> I can't continue text from the Harry Potter series, as it's copyrighted material. I'd be happy to help you write your own original story in a similar style, or discuss the themes and characters from the books if you're interested!

Opus 4.6 says the same but misidentifies it as Chamber of Secrets. Not a good look for your theory.

xyzal•58m ago
On another try it refused to go on. Probabilistic guardrails apparently ...
carabiner•1h ago
Noam Chomsky said AI should be called, "plagiarism software."
ksherlock•1h ago
You can't spell plagiarism without AI. Twice.
zed31726•54m ago
And aeroplanes should be called featherless bipeds
dustincoates•1h ago
I was able to get ChatGPT to do this on text from a book I wrote without jailbreaking. I asked it what my name was, asked it if it could find online any books that I wrote, and then assured it that "as the copyright holder," I was okay with it repeating the text back to me. (I think I also indicated that I lost my copy from my computer and needed to review the text.)
chacham15•1h ago
> The research findings “could present a challenge to those who argue that the AI model does not store or reproduce any copyright works,” said Cerys Wyn Davies, an intellectual property partner at law firm Pinsent Masons.

The defense to training with copyright is that it is the same as how humans learn from copyrighted material. The storage or reproduction is a red herring. Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.

The defense to actually reproducing a work is that in order to do so, the user has to "break" the system. It is the same as how you can make legal software do illegal things (e.g. screen recorder to "steal" a movie)

None of this is to say that these defenses are correct/moral; but rather that this article doesnt add any additional input into whether it is or isnt.

techblueberry•1h ago
You can't pay a human to reproduce copyrighted material either.
gcanyon•51m ago
But the crime in the human instance is the reproduction, not the storage. So the crime in the AI circumstance would not be in the training, but in prompting the output.

And of course AIs are excellent at taking direction, so:

If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy" and get "Manacled," that's copyright infringement, and on me, not on the LLM/training.

If I prompt it with "Harry Potter, but Voldemort wins: dark, and Hermione is a sex slave to Draco Malfoy, and change enough to avoid infringing copyright," and get "Alchemised," then that should be fine. I doubt the legal world agrees with me though.

kgwgk•1h ago
What calculus?
duskwuff•58m ago
This isn't the defense you think it is. Performing a copyrighted work from memory - e.g. a piece of music, a poem, a story, etc - is still a copyright violation. There's no special protection for works that a human has memorized.
gruez•48m ago
>There's no special protection for works that a human has memorized.

Who's liable for the copyright infringement if you can coax it out of a system? If you can bypass paywalls by using google's cache feature (or since they got rid of it, but using carefully crafted queries to extract the entire text via snippets), is google on the hook or the person doing it?

olyjohn•38m ago
Both. If I sell obviously pirated CDs on the street corner, it's not only illegal for me to copy them and sell them, it's also illegal for my customers to buy them.
gruez•20m ago
>it's also illegal for my customers to buy them.

Is it? There's plenty of people prosecuted for running illegal streaming sites and torrenting (which involves uploading), but I don't know of any efforts to crack down on non-distributors.

eptcyka•34m ago
Yes. Both Google and the human in question.
gruez•17m ago
1. How does this interact with the ruling that both google books (ie. large scale scanning of books without author's consent) and google snippets (the same, but for websites) have been ruled legal by the courts?

2. Google might not be the most sympathetic defendant, but what about libraries? They offer books to be borrowed, and some offer photocopiers. If you put the two together, you get a copyright infringement operation, all enabled by the library. Should libraries be on the hook too?

ShowalkKama•15m ago
>If you can bypass paywalls by using google's cache feature

that is quite different. Google serves (used to serve) to its users whatever the website presents to its crawler, it does not try to avoid paywalls or interact with the website in any capacity other than requesting information

aaroninsf•36m ago
The key word in the HN headline is _can_.

Humans are not judged on the basis of what they _can_ do.

Reasoning about how to constrain tools on the basis of what they _could_ do, if e.g. used outside their established guardrails, needs to be very nuanced.

tsimionescu•56m ago
> Humans can also reproduce copyrighted works from memory as well

That's simply not true. No humans can memorize entire novels, as this research proved these models do. And definitely not all of these novels, and code bases, and who knows what else all at the same time.

gruez•47m ago
>No humans can memorize entire novels, as this research proved these models do.

Humans can however, remember entire songs, and songs are definitely long enough to be considered copyright protected. There is still a difference in scale, but that's not really relevant when it comes to copyright law. You can't be like "well humans are committing copyright infringement but since it's limited to a few hundred words we'll give it a pass".

j_w•39m ago
It's not that you can remember a song and therefore copyright infringement when you sing.

For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.

For the last 0.001%, we would call it a cover and typically the individually doing a cover takes some liberties of their own, still making it not a replacement in any way. Artists are typically cool with covers.

gruez•32m ago
>For 99.999% of people that are singing a song, it's not a replacement for the original in any way shape or form, hard stop. Let's not pretend it could even get anywhere close.

You realize that lyrics are often written by someone other than the actual singer, and whoever wrote the lyrics is entitled to compensation too? The "amateur singing isn't a replacement for the studio album" excuse doesn't work in this context. Also courts have ruled that lyrics themselves are protected by copyright.

https://en.wikipedia.org/wiki/Lyrics#Copyright_and_royalties

>Artists are typically cool with covers.

Artists being "cool" with something doesn't mean they're not violating copyright law.

vlabakje90•43m ago
They absolutely can. Millions of people can recite the Quran verbatim, word for word. That's 77797 words. There is even a title for those people.

https://en.wikipedia.org/wiki/Hafiz_(Quran)

It's not far fetched to think that people could recite books just like an LLM. I don't know why they'd want to, but that's neither here nor there.

nozzlegear•22m ago
I know this isn't quite the point you're making, but the Quran isn't a copyrighted work. If a hafiz had memorized the entire set of Harry Potter and then used that to reproduce the series from memory and sell partial retellings of it as a service[†], they'd run into legal problems.

[†] Trying to make an allusion to Big AI selling their LLM services here.

moregrist•43m ago
> Humans can also reproduce copyrighted works from memory as well. Showing that machines can reproduce copyrighted material is no different than saying that a human can reproduce copyright material that the human learned from.

Ultimately this is a matter for the courts and the law, but I'd just like to point out that a human memorizing a work, reproducing it, and distributing it is just as much a copyright violation as doing a more mechanical form of reproduction.

There's a reason that fan fiction routinely falls afoul of copyright. There's quite a lot of case law in this area, and hand-waving "humans can do it too" doesn't really make for a strong argument. Humans get in trouble for it ALL THE TIME. The consequences can be fines, injuctions, or even criminal liability.

I'm not sure why you think AI gets off the hook here. Just because you like the outcome at the moment?

zed31726•1h ago
Near verbatim is an oxymoron
tsimionescu•47m ago
Almost verbatim is an oxymoron
gcanyon•1h ago
This speaks very much to the idea that LLMs are in some sense a ridiculously effective, somewhat lossy, compression algorithm that has been applied to the whole internet.
vizzier•42m ago
I've thought of them for a while as just a really complicated indexing strategy.
r_lee•39m ago
I mean, the transformer is basically like a big query engine and the model is the dataset + some logic or whatever

it's kind of like that by definition, with the whole Attention stuff etc.

in-silico•35m ago
It's a good way to frame base models that have only been pretrained.

However, modern frontier models have undergone rounds of fine-tuning, RLHF (reinforcement learning from human feedback), and RLVR (RL from verifiable rewards) that turn them into something else. The compressed internet is still in there, but it's wrapped in problem-solving and people-pleasing circuitry.