frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Writing an operating system kernel from scratch

https://popovicu.com/posts/writing-an-operating-system-kernel-from-scratch/
100•Bogdanp•2h ago•16 comments

Models of European metro stations

http://stations.albertguillaumes.cat/
559•tcumulus•11h ago•112 comments

Why We Spiral

https://behavioralscientist.org/why-we-spiral/
61•gmays•3h ago•20 comments

You're a Slow Thinker. Now What?

https://chillphysicsenjoyer.substack.com/p/youre-a-slow-thinker-now-what
43•sebg•3d ago•9 comments

Bank of Thailand freezes 3M accounts, sets daily transfer limits to curb fraud

https://www.thaienquirer.com/57752/bot-freezes-3-million-accounts-sets-daily-transfer-limits-of-5...
132•walterbell•3h ago•106 comments

Observable Notebooks Data Loaders

https://observablehq.com/notebook-kit/data-loaders
40•mbostock•4d ago•6 comments

Nicu's test website made with SVG (2007)

https://svg.nicubunu.ro/
101•caminanteblanco•3h ago•66 comments

CorentinJ: Real-Time Voice Cloning (2021)

https://github.com/CorentinJ/Real-Time-Voice-Cloning
64•redbell•7h ago•16 comments

Repetitive negative thinking associated with cognitive decline in older adults

https://bmcpsychiatry.biomedcentral.com/articles/10.1186/s12888-025-06815-2
91•redbell•6h ago•59 comments

Geedge and MESA leak: Analyzing the great firewall’s largest document leak

https://gfw.report/blog/geedge_and_mesa_leak/en/
334•yourapostasy•1d ago•95 comments

Read to Forget

https://mo42.bearblog.dev/read-to-forget/
47•diymaker•5h ago•16 comments

SpikingBrain 7B – More efficient than classic LLMs

https://github.com/BICLab/SpikingBrain-7B
109•somethingsome•12h ago•30 comments

Fukushima Insects Tested for Cognition

https://news.cnrs.fr/articles/fukushima-insects-tested-for-cognition
81•nis0s•7h ago•52 comments

A single, 'naked' black hole confounds theories of the young cosmos

https://www.quantamagazine.org/a-single-naked-black-hole-rewrites-the-history-of-the-universe-202...
138•pykello•14h ago•55 comments

macOS Tahoe is certified Unix 03 [pdf]

https://www.opengroup.org/openbrand/certificates/1223p.pdf
138•john_alan•7h ago•132 comments

Refurb Weekend: Silicon Graphics Indigo² Impact 10000

http://oldvcr.blogspot.com/2025/09/refurb-weekend-silicon-graphics-indigo.html
132•Bogdanp•12h ago•47 comments

Show HN: A store that generates products from anything you type in search

https://anycrap.shop/
992•kafked•1d ago•296 comments

Two Slice, a font that's only 2px tall

https://joefatula.com/twoslice.html
441•JdeBP•18h ago•108 comments

Introduction to GrapheneOS

https://dataswamp.org/~solene/2025-01-12-intro-to-grapheneos.html
76•renehsz•4d ago•55 comments

The PC was never a true 'IBMer'

https://thechipletter.substack.com/p/the-pc-was-never-a-true-ibmer
50•klelatti•9h ago•38 comments

MIT-MC CP/M archive files, 1979-1984

https://github.com/MITDDC/cpmarchive-1979-1984
46•elvis70•2d ago•1 comments

High Altitude Living – 8,000 ft and above (2021)

https://studioq.com/blog/2021/5/30/high-altitude-living-8000-ft-and-above-2450-meters
65•walterbell•15h ago•51 comments

Pass: Unix Password Manager

https://www.passwordstore.org/
284•Bogdanp•19h ago•148 comments

Gemini (2023)

https://geminiquickst.art/
59•jhanschoo•9h ago•26 comments

Dynamic Bird Migration Map

https://explorer.audubon.org/explore/species?sidebar=expand
71•skadamat•4d ago•9 comments

The Socratic Journal Method: A Simple Journaling Method That Works

https://mindthenerd.com/the-socratic-journal-method-a-simple-journaling-method-that-actually-works/
159•surprisetalk•4d ago•68 comments

Will AI be the basis of many future industrial fortunes, or a net loser?

https://joincolossus.com/article/ai-will-not-make-you-rich/
195•saucymew•20h ago•282 comments

The unreasonable effectiveness of modern sort algorithms

https://github.com/Voultapher/sort-research-rs/blob/main/writeup/unreasonable/text.md
116•Voultapher•3d ago•34 comments

How the restoration of ancient Babylon is drawing tourists back to Iraq

https://www.theartnewspaper.com/2025/09/12/how-the-restoration-of-ancient-babylon-is-helping-to-d...
101•leoh•17h ago•50 comments

AMD’s RDNA4 GPU architecture

https://chipsandcheese.com/p/amds-rdna4-gpu-architecture-at-hot
150•rbanffy•21h ago•35 comments
Open in hackernews

The AI-Scraping Free-for-All Is Coming to an End

https://nymag.com/intelligencer/article/ai-scraping-free-for-all-by-openai-google-meta-ending.html
34•geox•3h ago

Comments

WaltPurvis•3h ago
http://archive.today/SqPCL
jmkni•1h ago
It is a bit ironic that a paywalled article like this will have a top level comment with the archive link, which can then be easily scraped by AI (along with the comments)
tenuousemphasis•1h ago
It's not ironic at all. The only reason the anti-paywall sites work is that the news companies in fact want some scrapers reading the full article.
mschuster91•1h ago
Actually, the team behind archive dot today in at least spiegel.de has premium accounts, I presume bought with anonymous credit cards.

You can see artifacts when their servers are at queue load and you see the URLs, a few resources have the JWT with the account details in the URL. IIRC the clearname of the account in the token is Masha Rabinovich, with an email account masha@dns.li, an identity that has cropped up in various investigations [1][2].

[1] https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...

[2] https://webapps.stackexchange.com/questions/145817/who-owns-...

ec109685•1h ago
Also interesting how sites like this are mainstream whereas a link to a site hosting an mp3 of pirated music wouldn’t be tolerated in discussion forums like this.

I think a big difference is that there’s no micro transactions or compulsory licensing for content, so it always feels patently unfair to buy a subscription to read one article.

1gn15•3h ago
Biased TL;DR: Reddit (notable for having a high stock value from their "selling data" business [1]), Medium, Quora, and Cloudflare competitor Fastly created a standard to restrict what the reader can do with the data users created, called Really Simple Licensing (RSL). Basically robots.txt but with more details, notably with details on how much you should pay Reddit/Medium/Quora.

While this likely has no legal weight (except for EU TDM for commercial use, where the law does take into account opt-outs), they are betting on using services like CloudFlare and Fastly to enforce this.

[1] https://www.investors.com/research/the-new-america/reddit-st...

isodev•2h ago
In other words, a lightweight form of DRM. Here come the reasons why we shouldn’t all deploy CloudFlare and similar as gatekeepers to the web.

Is there even one example of a “tech mega corp” that has grown to control more than 1/5 of its market without this circling back to hurt people in some way? A single example?

PhantomHour•2h ago
> While this likely has no legal weight

I wouldn't be quite so sure about that. The AI industry has entirely relied on 'move fast and break things' and 'old fart judges who don't understand the tech' as their legal strategy.

The idea that AI training is fair use isn't so obvious, and quite frankly is entirely ridiculous in a world where AI companies pay for the data. If it's not fair use to take reddit's data, it's not fair use to take mine either.

On a technological level the difference to prior ML is straightforward: A classical classifier system is simply incapable of emitting any copyrighted work it was trained on. The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself.

LLMs and similar generative AI do not have that safeguard. To be practically useful they have to be capable of emitting facts from training data, but have no architectural mechanism to separate facts from expressions. For them to be capable of emitting facts they must also be capable of emitting expressions, and thus, copyright violation.

Add in how GenAI tends to directly compete with the market of the works used as training data in ways that prior "fair use" systems did not and things become sketchy quickly.

Every major AI company knows this, as they have rushed to implement copyright filtering systems once people started pointing out instances of copyrighted expressions being reproduced by AI systems. (There are technical reasons why this isn't a very good solution to curtail copyright infringement by AI)

Observe how all the major copyright victories amount to judges dismissing cases on grounds of "Well you don't have an example specific to your work" rather than addressing whether such uses are acceptable as a collective whole.

janalsncm•1h ago
> The very architecture of the system guarantees it to produce new information derived from the training data rather than the training data itself

A “classical” classifier can regurgitate its training data as well. It’s just that Reddit never seemed to care about people training e.g. sentiment classifiers on their data before.

In fact a “decoder” is simply autoregressive token classification.

orangecat•1h ago
'old fart judges who don't understand the tech'

If this intended to refer to Judge Alsup, it is extremely wrong.

visarga•1h ago
> but have no architectural mechanism to separate facts from expressions

Sure they do. Every time a bot searches, reads your site and formulates an answer it does not replicate your expression. First of all, it compares across 20.. 100 sources. Second, it only reports what is related to the user query. And third - it uses its own expression. It's more like asking a friend who read those articles and getting an answer.

LLMs ability to separate facts from expression is quite well developed, maybe their strongest skill. They can translate, paraphrase, summarize, or reword forever.

luckylion•2h ago
Does that have any implications on liability for content? They're no longer just a provider, they are re-licensing and marketing content. Are they losing protection?
ec109685•1h ago
It’s surprising Reddit doesn’t get pushback for reselling their user’s content.

The right thing would be for the end users to receive the compensation Reddit is getting from AI companies.

deadbabe•2h ago
Just ladder kicking at this point.
jsnell•2h ago
The headline seems pretty aspirational.

The licensing standard they're talking about will achieve nothing.

Anti-bot companies selling scraping protections will run out of runway: there's a limited set of signals, and none of them are robust. As the signals get used, they're also getting burned. And it's politically impossible to expand the web platform to have robust counter-abuse capabilities.

Putting the content behind a login wall can work for large sites, but not small ones.

The free-for-all will not end until adversarial scraping becomes illegal.

carlosjobim•2h ago
> Putting the content behind a login wall can work for large sites, but not small ones.

Syndication is the answer. Small artists are on Spotify, small video makers are on YouTube.

salawat•1h ago
Yes. Conglomeration and centralization. More, more, more!

See the problem?

atm3ga•1h ago
As AI companies like Perplexity introduce AI enabled browsers like Comet, they will scrape web sites through the interaction of end-users with whatever site they are using. Therefore, indeed anti-bot companies are absolutely running out of runway.
thelittleone•1h ago
Wow hadn't even considered this... so say I have a members only section of my site where I share high value content, one of the members browses using Comet, and that scrapes the private content and sends to perplexity?
lupire•1h ago
This also happens with covert botnets running secretly on user machines.
ec109685•1h ago
The way comet browses the web is weird enough that it’s easily detectable.
gdulli•53m ago
Did you stop getting non-compliant spam when that became illegal?
aaaggg•1h ago
L - wish they'd stop posting articles that are paywalled...
janalsncm•1h ago
> There was for years an experimental period, when ethical and legal considerations about where and how to acquire training data for hungry experimental models were treated as afterthoughts.

Those things were afterthoughts because for the most part the experimental methods sucked compared to the real thing. If we were in mid 2016 and your LSTM was barely stringing together coherent sentences, it was a curiosity but not a serious competitor to StackOverflow.

I say this not because I don’t think law/ethics are important in the abstract, but because they only became relevant after significant technological improvement.

Zigurd•49m ago
Sites containing original content will adopt active measures against LLM scraper bots. Unlike search indexing bots, there's much less upside to allowing scraping for LLM training material. Openly adversarial actions like serving up poisoned text that would induce LLMs to hallucinate is much more defensible.