frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Perplexity is using stealth, undeclared crawlers to evade no-crawl directives

https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/
348•rrampage•2h ago

Comments

gruez•1h ago
>We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains

Thats... less conclusive than I'd like to see, especially for a content marketing article that's calling out a company in particular. Specifically it's unclear on whether Perplexity was crawling (ie. systematically viewing every page on the site without the direction of a human), or simply retrieving content on behalf of the user. I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

thoroughburro•1h ago
> I think most people would draw a distinction between the two, and would at least agree the latter is more acceptable than the former.

No. I should be able to control which automated retrieval tools can scrape my site, regardless of who commands it.

We can play cat and mouse all day, but I control the content and I will always win: I can just take it down when annoyed badly enough. Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

gkbrk•1h ago
> Then nobody gets the content, and we can all thank upstanding companies like Perplexity for that collapse of trust.

But they didn't take down the content, you did. When people running websites take down content because people use Firefox with ad-blockers, I don't blame Firefox either, I blame the website.

Bluescreenbuddy•1h ago
FF isn’t training their money printer with MY data. AI scrapers are
hombre_fatal•1h ago
Taking down the content because you're annoyed that people are asking questions about it via an LLM interface doesn't seem like you're winning.

It's also a gift to your competitors.

You're certainly free to do it. It's just a really faint example of you being "in control" much less winning over LLM agents: Ok, so the people who cared about your content can't access it anymore because you "got back" at Perplexity, a company who will never notice.

ipaddr•16m ago
It could be my server keeps going down because of llms agents keep requesting pages from my lyric site. Removing that site allowed other sites to remain up. True story.

Who cares if perplexity will never notice. Or competitors get an advantage. It is a negative for users using perplexity or visiting directly because the content doesn't exist.

That's the world perplexity and others are creating. They will be able to pull anything from the web but nothing will be left.

IncreasePosts•1h ago
You don't win, because presumably you were providing the content for some reason, and forcing yourself to take it down is contrary to whatever reason that was in the first place.
ipaddr•15m ago
Llms attack certain topics so removing one site will allow the others to live on the same server.
Den_VR•1h ago
You can limit access, sure: with ACLs, putting content behind login, certificate based mechanisms, and at the end of the day -a power cord-.

But really, controlling which automated retrieval tools are allowed has always been more of a code of honor than a technical control. And that trust you mention has always been broken. For as long as I can remember anyway. Remember LexiBot and AltaVista?

fluidcruft•1h ago
If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper? Just learn off the cached data. Being the man-in-the-middle seems like a pretty easy way to scrape salient content while also getting signals about that content's value.
JimDabell•1h ago
No. The key difference is that if a user asks about a specific page, when Perplexity fetches that page, it is being operated by a human not acting as a crawler. It doesn’t matter how many times this happens or what they do with the result. If they aren’t recursively fetching pages, then they aren’t a crawler and robots.txt does not apply to them. robots.txt is not a generic access control mechanism, it is designed solely for automated clients.
sbarre•1h ago
I would only agree with this if we knew for sure that these on-demand human-initiated crawls didn't result in the crawled page being added to an overall index and scheduled for future automated crawls.

Otherwise it's just adding an unwilling website to a crawl index, and showing the result of the first crawl as a byproduct of that action.

fluidcruft•1h ago
Many people don't want their data used for free/any training. AI developers have been so repeatedly unethical that the well-earned Baysian prior is high probability that you cannot trust AI developers to not cross the training/inference streams.
JimDabell•1h ago
> Many people don't want their data used for free/any training.

That is true. But robots.txt is not designed to give them the ability to prevent this.

gruez•1h ago
>If the AI archives/caches all the results it accesses and enough people use it, doesn't it become a scraper?

That's basically how many crowdsourced crawling/archive projects work. For instance, sci-hub and RECAP[1]. Do you think they should be shut down as well? In both cases there's even a stronger justification to shutting them down, because the original content is paywalled and you could plausibly argue there's lost revenue on the line.

[1] https://en.wikipedia.org/wiki/Free_Law_Project#RECAP

fluidcruft•1h ago
I didn't suggest Perplexity should be shut down, though. And yes, in your analogy sites are completely justified to take whatever actions they can to block people who are building those caches.
a2128•54m ago
In theory retrieving a page on behalf of a user would be acceptable, but these are AI companies who have disregarded all norms surrounding copyright, etc. It would be stupid of them not to also save contents of the page and use it for future AI training or further crawling
fxtentacle•1h ago
I find this problem quite difficult to solve:

1. If I as a human request a website, then I should be shown the content. Everyone agrees.

2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

3. If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

Beijinger•1h ago
How about I open a proxy, replace all ads with my ads, redirect the content to you and we share the ad revenue?
fxtentacle•1h ago
That's somewhat antisocial, but perfectly legal in the US. It's called PayPal Honey, for example, and has been running for 13 years now.
zeta0134•1h ago
If the LLM were running this sort of thing at the user's explicit request this would be fine. The problem is training. Every AI startup on the planet right now is aggressively crawling everything that will let them crawl. The server isn't seeing occasional summaries from interested users, but thousands upon thousands of bots repeatedly requesting every link they can find as fast as they can.
mnmalst•1h ago
But that's not what this article is about. From, what I understand, this articles is about a user requesting information about a specific domain and not general scraping.
fxtentacle•1h ago
Then what if I ask the LLM 10 questions about the same domain and ask it to research further? Any human would then click through 50 - 100 articles to make sure they know what that domain contains. If that part is automated by using an LLM, does that make any legal change? How many page URLs do you think one should be allowed to access per LLM prompt?
zeta0134•1h ago
All of them. That's at the explicit request of the user. I'm not sure where the downvotes are coming from, since I agree with all of these points. The training thing has merely pissed off lots of server operators already, so they quite reasonably tend to block first and ask questions later. I think that's important context.
hombre_fatal•1h ago
TFA isn’t talking about crawling to harvest training data.

It’s talking about Perplexity crawling sites on demand in response to user queries and then complaining that no it’s not fine, hence this thread.

cjonas•1h ago
Doesn't perplexity crawl to harvest and index data like a traditional search engine? Or is it all "on demand"?
lukeschlather•1h ago
For the most part I would assume they pay for access to Google or Bing's index. I also assume they don't really train models. So all their "crawling" is on behalf of users.
bbqfog•1h ago
Correct, it’s user hostile to dictate which software is allowed to see content.
klabb3•40m ago
They all do it. Facebook, Reddit, Twitter, Instagram. Because it interferes with their business model. It was already bad, but now the conflict between business and the open web is reaching unprecedented levels, especially since the copyright was scrapped for AI companies.
Workaccount2•1h ago
>2. If I as the human request the software on my computer to modify the content before displaying it, for example by installing an ad-blocker into my user agent, then that's my choice and the website should not be notified about it. Most users agree, some websites try to nag you into modifying the software you run locally.

If I put time and effort into a website and it's content, I should expect no compensation despite bearing all costs.

Is that something everyone would agree with?

The internet should be entirely behind paywalls, besides content that is already provided ad free.

Is that something everyone would agree with?

I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"

Bjartr•1h ago
You're free to deny access to your site arbitrarily, including for lack of compensation.
Workaccount2•1h ago
>and the website should not be notified about it.
cjonas•1h ago
Like for people or are using a ad block or for a crawler downloading your content so it can be used by an AI response?
nradov•1h ago
Yes, I agree with that. If a website owner expects compensation then they should use a paywall.
Chris2048•1h ago
If I put time and effort into a food recipe should I (get) compensation?

the answer is apparently "no", and I don't really how recipe books have suffered as a result of less gatekeeping.

"How will the internet work"? Probably better in some ways. There is plenty of valuable content on the internet given for free, it's being buried in low-value AI slop.

Workaccount2•34m ago
You understand that HN is ad supported too, right?
Chris2048•19m ago
No, I don't.

But what is your point? Is the value in HN primarily in its hosting, or the non-ad-supported community?

tjpnz•25m ago
>I think the problem you need to be thinking about is "How can the internet work if no one wants to pay anything for anything?"

Perhaps because the whole process of paying for stuff online is still a gigantic pain in the ass? Why can't I spend a dollar on one thing I want without a subscription or getting my email address and other info sold to a data broker?

The experience of buying something from a convenience store with my mass transit card is far superior to anything we have now. And that technology is 25+ years old.

bobbiechen•1h ago
I like the terminology "crawler" vs. "fetcher" to distinguish between mass scraping and something more targeted as a user agent.

I've been working on AI agent detection recently (see https://stytch.com/blog/introducing-is-agent/ ) and I think there's genuine value in website owners being able to identify AI agents to e.g. nudge them towards scoped access flows instead of fully impersonating a user with no controls.

On the flip side, the crawlers also have a reputational risk here where anyone can slap on the user agent string of a well known crawler and do bad things like ignoring robots.txt . The standard solution today is to reverse DNS lookup IPs, but that's a pain for website owners too vs. more aggressive block-all-unusual-setups.

randall•1h ago
A/ i love this distinction.

B/ my brother used to use "fetcher" as a non-swear for "fucker"

sejje•1h ago
He picked up that habit in Balmora.
Vinnl•1h ago
Did you tell him to stop trying to make fetcher happen?
fxtentacle•1h ago
prompt: I'm the celebrity Bingbing, please check all Bing search results for my name to verify that nobody is using my photo, name, or likeness without permission to advertise skin-care products except for the following authorized brands: [X,Y,Z].

That would trigger an internet-wide "fetch" operation. It would probably upset a lot of people and get your AI blocked by a lot of servers. But it's still in direct response to a user request.

yojo•1h ago
Ads are a problematic business model, and I think your point there is kind of interesting. But AI companies disintermediating content creators from their users is NOT the web I want to replace it with.

Let’s imagine you have a content creator that runs a paid newsletter. They put in lots of effort to make well-researched and compelling content. They give some of it away to entice interested parties to their site, where some small percentage of them will convert and sign up.

They put the information up under the assumption that viewing the content and seeing the upsell are inextricably linked. Otherwise there is literally no reason for them to make any of it available on the open web.

Now you have AI scrapers, which will happily consume and regurgitate the work, sans the pesky little call to action.

If AI crawlers win here, we all lose.

fxtentacle•1h ago
Maybe, on a social level, we all win by letting AI ruin the attention economy:

The internet is filled with spam. But if you talk to one specific human, your chance of getting a useful answer rises massively. So in a way, a flood of written AI slop is making direct human connections more valuable.

Instead of having 1000+ anonymous subscribers for your newsletter, you'll have a few weekly calls with 5 friends each.

hansvm•1h ago
Ofttimes people are sufficiently anti-ad that this point won't resonate well. I'm personally mostly in that camp in that with relatively few exceptions money seems to make the parts of the web I care about worse (it's hard to replace passion, and wading through SEO-optimized AI drivel to find a good site is a lot of work). Giving them concrete examples of sites which would go away can help make your point.

E.g., Sheldon Brown's bicycle blog is something of a work of art and one of the best bicycle resources literally anywhere. I don't know the man, but I'd be surprised if he'd put in the same effort without the "brand" behind it -- thankful readers writing in, somebody occasionally using the donate button to buy him a coffee, people like me talking about it here, etc.

blacksmith_tb•1h ago
Sheldon died in 2008, but there's no doubt that all the bicycling wisdom he posted lives on!
wulfstan•53m ago
He's that widely respected that amongst those who repair bikes (I maintain a fleet of ~10 for my immediate family) he is simply known as "Saint Sheldon".
vertoc•1h ago
But even your example gets worse with AI potentially - the "upsell" of his blog isn't paid posts but more subscribers so there will be thankful readers, a few donators, people talking about it. If the only interface becomes an AI summary of his work without credit, it's much more likely he stops writing as it'll seem like he's just screaming into the void
hansvm•29m ago
I don't think we're disagreeing?
yojo•38m ago
I agree that specific examples help, though I think the ones that resonate most will necessarily be niche. As a teen, I loved Penny Arcade, and watched them almost die when the bottom fell out of the banner-ad market.

Now, most of the value I find in the web comes from niche home-improvement forums (which Reddit has mostly digested). But even Reddit has a problem if users stop showing up from SEO.

bee_rider•1h ago
I think it’s basically impossible to prevent AI crawlers. It is like video game cheating, at the extreme they could literally point a camera at the screen and have it do image processing, and talk to the computer through the USB port emulating, a mouse and keyboard outside the machine. They don’t do that, of course, because it is much easier to do it all in software, but that is the ultimate circumvention of any attempt to block them out that doesn’t also block out humans.

I think the business model for “content creating” is going have to change, for better or worse (a lot of YouTube stars are annoying as hell, but sure, stuff like well-written news and educational articles falls under this umbrella as well, so it is unfortunate that they will probably be impacted too).

yojo•27m ago
I don’t subscribe to technological inevitabilism.

Cloudflare banning bad actors has at least made scraping more expensive, and changes the economics of it - more sophisticated deception is necessarily more expensive. If the cost is high enough to force entry, scrapers might be willing to pay for access.

But I can imagine more extreme measures. e.g. old web of trust style request signing[0]. I don’t see any easy way for scrapers to beat a functioning WOT system. We just don’t happen to have one of those yet.

0: https://en.m.wikipedia.org/wiki/Web_of_trust

shadowgovt•27m ago
> Otherwise there is literally no reason for them to make any of it available on the open web

This is the hypothesis I always personally find fascinating in light of the army of semi-anonymous Wikipedia volunteers continuously gathering and curating information without pay.

If it became functionally impossible to upsell a little information for more paid information, I'm sure some people would stop creating information online. I don't know if it would be enough to fundamentally alter the character of the web.

Do people (generally) put things online to get money or because they want it online? And is "free" data worse quality than data you have to pay somebody for (or is the challenge more one of curation: when anyone can put anything up for free, sorting high- and low-quality based on whatever criteria becomes a new kind of challenge?).

Jury's out on these questions, I think.

johnfn•1h ago
Unless I am misunderstanding you, you are talking about something different than the article. The article is talking about web-crawling. You are talking about local / personal LLM usage. No one has any problems with local / personal LLM usage. It's when Perplexity uses web crawlers that an issue arises.
lukeschlather•1h ago
You probably need a computer that costs $250,000 or more to run the kind of LLM that Perplexity uses, but with batching it costs pennies to have the same LLM fetch a page for you, summarize the content, and tell you what is on it. And the power usage similarly, running the LLM for a single user will cost you a huge amount of money relative to the power it takes in a cloud environment with many users.

Perplexity's "web crawler" is mostly operating like this on behalf of users, so they don't need a massively expensive computer to run an LLM.

sbarre•1h ago
All of these scenarios assume you have an unconditional right to access the content on a website in whatever way you want.

Do you think you do?

Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you say that you have the right, and that right should be legally protected, to do whatever you want on your computer, should the content owner not also have a legally protected right to control how, and by who, and in what manner, their content gets accessed?

That's how it currently works in the physical world. It doesn't work like that in the digital world due to technical limitations (which is a different topic, and for the record I am fine with those technical limitations as they protect other more important rights).

And since the content owner is, by definition, the owner of the content in question, it feels like their rights take precedence. If you don't agree with their offering (i.e. their terms of service), then as an end user you don't engage, and you don't access the content.

It really can be that simple. It's only "difficult to solve" if you don't believe a content owner's rights are as valid as your own.

cutemonster•1h ago
If there's an article you want to read, and the ToS says that in between reading each paragraph, you must switch to their YouTube channel and look at their ads about cat food for 5 minutes, are your going to do that?
JimDabell•1h ago
Hacker News has collectively answered this question by consistently voting up the archive.is links in the comments of every paywalled article posted here.
gruez•1h ago
>Or is there a balance between the owner's rights, who bears the content production and hosting/serving costs, and the rights of the end user who wishes to benefit from that content?

If you believe in this principle, fair enough, but are you going to apply this consistently? If it's fair game for a blog to restrict access to AI agents, what does that mean for other user agents that companies disagree with, like browsers with adblock? Does it just boil down to "it's okay if a person does it but not okay if a big evil corporation does it?"

hansvm•42m ago
It doesn't work like that in the physical world though. Once you've bought a book the author can't stipulate that you're only allowed to read it with a video ad in the sidebar, by drinking a can of coke before each chapter, or by giving them permission to sniff through your family's medical history. They can't keep you from loaning it out for other people to read, even thousands of other people. They can't stop you from reading it in a certain room or with your favorite music playing. You can even have an LLM transcribe or summarize it for you for personal use (not everyone has those automatic page flipping machines, but hypothetically).

The reason people are up in arms is because rights they previously enjoyed are being stripped away by the current platforms. The content owner's rights aren't as valid as my own in the current world; they trump mine 10 to 1. If I "buy" a song and the content owner decides that my country is politically unfriendly, they just delete it and don't refund me. If I request to view their content and they start by wasting my bandwidth sending me an ad I haven't consented to, how can I even "not engage"? The damage is done, and there's no recourse.

jasonjmcghee•1h ago
I think it's an issue of scale.

The next step in your progression here might be:

If / when people have personal research bots that go and look for answers across a number of sites, requesting many pages much faster than humans do - what's the tipping point? Is personal web crawling ok? What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)? Or is it when you tip the scale further and do general / mass crawling for many users to consume that it becomes a problem?

cj•57m ago
Doesn't o3 sort of already do this? Whenever I ask it something, it makes it look like it simultaneously opens 3-8 pages (something a human can't do).

Seems like a reasonable stance would be something like "Following the no crawl directive is especially necessary when navigating websites faster than humans can."

> What if it gets a bit smarter and tried to anticipate what you'll ask and does a bunch of crawling to gather information regularly to try to stay up to date on things (from your machine)?

To be fair, Google Chrome already (somewhat) does this by preloading links it thinks you might click, before you click it.

But your point is still valid. We tolerate it because as website owners, we want our sites to load fast for users. But if we're just serving pages to robots and the data is repackaged to users without citing the original source, then yea... let's rethink that.

Spivak•6m ago
You don't middle click a bunch of links when doing research? Of all the things to point to I wouldn't have thought "opens a bunch of tabs" to be one of the differentiating behaviors between browsing with Firefox and browsing with an LLM.
fxtentacle•57m ago
Maybe we should just institutionalize and explicitly legalize the Internet Archive and Archive Team. Then, I can download a complete and halfway current crawl of domain X from the IA and that way, no additional costs are incurred for domain X.

But of course, most website publishers would hate that. Because they don't want people to access their content, they want people to look at the ads that pay them. That's why to them, the IA crawling their website is akin to stealing. Because it's taking away some of their ad impressions.

ivape•37m ago
Or websites can monetize their data via paid apis and downloadable archives. That's what makes Reddit the most valuable data trove for regular users.
palmfacehn•19m ago
https://commoncrawl.org/

>Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

Spacecosmonaut•1h ago
Regarding point 3: The problem from the perspective of websites would not be any different if they had been completely ad free. People would still consume LLM generated summaries because they cut down clicks and eyeballing to present you information that directly pertains to the promt.

The whole concept of a "website" will simply become niche. How many zoomers still visit any but the most popular websites?

ai-christianson•1h ago
Websites should be able to request payment. Who cares if it is a human or an agent of a human if it is paying for the request?
pyrale•50m ago
Because LLM companies have historically been extremely disingenuous when it comes to crawling these sites.

Also because there is a difference between a user hitting f5 a couple times and a crawler doing a couple hundred requests.

Also because ultimately, by intermediating the request, llm companies rob website owners of a business model. A newspaper may be fine letting adblockers see their article, in hopes that they may eventually subscribe. When a LLM crawls the info and displays it with much less visibility for the source, that hope may not hold.

fluidcruft•48m ago
In theory, couldn't the LLM access the content on your browser and it's cache, rather than interacting with the website directly? Browser automation directly related to user activity (prefetch etc) seems qualitatively different to me. Similarly, refusing to download content or modifying content after it's already in my browser is also qualitatively different. That all seems fair-use-y. I'm not sure there's a technical solution beyond the typical cat/mouse wars... but there is a smell when a datacenter pretends to be a person. That's not a browser.

It could be a personal knowledge management system, but it seems like knowledge management systems should be operating off of things you already have. The research library down the street isn't considered a "personal knowledge management system" in any sense of the term if you know what I mean. If you dispatch an army of minions to take notes on the library's contents that doesn't seem personal. Similarly if you dispatch the army of minions to a bookstore rather than a library. At the very least bring the item into your house/office first. (Libraries are a little different because they are designed for studying and taking notes, it's use of an army of minions aspect)

beardyw•47m ago
You speak as 1% of the population to 1% of the population. Don't fool yourself.
porridgeraisin•45m ago
I don't think people have a problem with an LLM issuing GET website.com and then summarising that, each and every time it uses that information (or atleast, save a citation to it and refer to that citation). Except ad ecosystem, ignoring them for now, please refer to last paragraph.

The problem is with the LLM then training on that data _once_ and then storing it forever and regurgitating it N times in the future without ever crediting the original author.

So far, humans themselves did this, but only for relatively simple information (ratio of rice and water in specific $recipe). You're not gonna send a link to your friend just to see the ratio, you probably remember it off the top of your head.

Unfortunately, the top of an LLMs head is pretty big, and they are fitting almost the entire website's content in there for most websites.

The threshold beyond which it becomes irreproducible for human consumers, and therefore, copyrightable (lot of copyright law has "reasonable" term which refers to this same concept) has now shifted up many many times higher.

Now, IMO:

So far, for stuff that won't fit in someone's head, people were using citations (academia, for example). LLMs should also use citations. That solves the problem pretty much. That the ad ecosystem chose views as the monetisation point and is thus hurt by this is not anyone else's problem. The ad ecosystem can innovate and adjust to the new reality in their own time and with their own effort. I promise most people won't be waiting. Maybe google can charge per LLM citation. Cost Per Citation, you even maintain the acronym :)

skydhash•29m ago
That’s why websites have no issues with googlebot and the search results. It’s a giant index and citation list. But stripping works from its context and presenting as your own is decried throughout history.
wulfstan•27m ago
Yes, this is the crux of the matter.

The "social contract" that has been established over the last 25+ years is that site owners don't mind their site being crawled reasonably provided that the indexing that results from it links back to their content. So when AltaVista/Yahoo/Google do it and then score and list your website, interspersing that with a few ads, then it's a sensible quid pro quo for everyone.

LLM AI outfits are abusing this social contract by stuffing the crawled data into their models, summarising/remixing/learning from this content, claiming "fair use" and then not providing the quid pro quo back to the originating data. This is quite likely terminal for many content-oriented businesses, which ironically means it will also be terminal for those who will ultimately depend on additions, changes and corrections to that content - LLM AI outfits.

IMO: copyright law needs an update to mandate no training on content without explicit permission from the holder of the copyright of that content. And perhaps, as others have pointed out, an llms.txt to augment robots.txt that covers this for llm digestion purposes.

itsdesmond•43m ago
Some stores do not welcome Instacart or Postmates shoppers. You can shop there. You can shop with your phone out, scanning every item to price match, something that some bookstores frown on, for example. Third party services cannot send employees to index their inventory, nor can they be dispatched to pick up an item you order online.

Their reasons vary. Some don’t want their businesses perception of quality to be taken out of their control (delivering cold food, marking up items, poor substitutions). Some would prefer their staff service and build relationships with customers directly, instead of disinterested and frequently quite demanding runners. Some just straight up disagree with the practice of third party delivery.

I think that it’s pretty unambiguously reasonable to choose to not allow an unrelated business to operate inside of your physical storefront. I also think that maps onto digital services.

rjbwork•30m ago
But I can send my personal shopper and you'll be none the wiser.
bradleyjg•23m ago
It’s possible to violate all sorts of social norms. Societies that celebrate people that do so are on the far opposite end of the spectrum from high trust ones. They are rather unpleasant.
ToucanLoucan•10m ago
Just the Silicon Valley ethos extended to it's logical conclusions. These companies take advantage of public space, utilities and goodwill at industrial scale to "move fast and break things" and then everyone else has to deal with the ensuing consequences. Like how cities are awash in those fucking electric scooters now.

Mind you I'm not saying electric scooters are a bad idea, I have one and I quite enjoy it. I'm saying we didn't need five fucking startups all competing to provide them at the lowest cost possible just for 2/3s of them to end up in fucking landfills when the VC funding ran out.

itsdesmond•22m ago
No one is tryinna regulate your mother sending you to the corner for some eggs. This is about massive businesses creating massive volumes of web traffic. We all know yo mommas business is massive, but it certainly isn’t digital.
rapind•18m ago
It's all about scale. The impact of your personal shopper is insignificant unless you manage to scale it up into a business where everyone has a personal shopper by default.
mbrumlow•10m ago
Well then. Seems like you would be a fool to not allow personal shoppers then.

The point is the web is changing, and people use a different type of browser now. Ans that browser happens to be LLMs.

Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.

Grow up cloud flare, some is your business models don’t make sense any more.

ToucanLoucan•5m ago
> Anybody complaining about the new browser has just not got it yet, or has and is trying to keep things the old way because they don’t know how or won’t change with the times. We have seen it before, Kodak, blockbuster, whatever.

You say this as though all LLM/otherwise automated traffic is for the purposes of fulfilling a request made by a user 100% of the time which is just flatly on-its-face untrue.

Companies make vast amounts of requests for indexing purposes. That could be to facilitate user requests someday, perhaps, but it is not today and not why it's happening. And worse still, LLMs introduce a new third option: that it's not for indexing or for later linking but is instead either for training the language model itself, or for the model to ingest and regurgitate later on with no attribution, with the added fun that it might just make some shit up about whatever you said and be wrong.

"The web is changing" does not mean every website must follow suit. Since I build my blog about 2 internet eternities ago, I have seen fad tech come and fad tech go. My blog remains more or less exactly what it was 2 decades ago, with more content and a better stylesheet. I have requested in my robots.txt that my content not be used for LLM training, and I fully expect that to be ignored because tech bros don't respect anyone, even fellow tech bros, when it means they have to change their behavior.

542354234235•9m ago
True, and I would ask, what is your point? Is it that no rule can have 100% perfect enforcement? That all rules have a grey area if you look close enough? Was it just a "gotcha" statement meant to insinuate what the prior commenter said was invalid?
Polizeiposaune•7m ago
To stretch the analogy to the breaking point: If you send 10,000 personal shoppers all at once to the same store just to check prices, the store's going to be rightfully annoyed that they aren't making sales because legit buyers can't get in.
GardenLetter27•40m ago
And isn't the obvious solution to just make some sort of browsers add-on for the LLM summary so the request comes from your browser and then gets sent to the LLM?

I think the main concern here is the huge amount of traffic from crawling just for content for pre-training.

otterley•22m ago
Why would a personal browser have to crawl fewer pages than the agent’s mechanism? If anything, the agent would be more efficient because it could cache the content for others to use. In the situation we’re talking about, the AI engine is behaving essentially like a caching proxy—just like a CDN.
shadowgovt•38m ago
Not only is it difficult to solve, it's the next step in the process of harvesting content to train AIs: companies will pay humans (probably in some flavor of "company scrip," such as extra queries on their AI engine) to install a browser extension that will piggy-back on their human access to sites and scrape the data from their human-controlled client.

At the limit, this problem is the problem of "keeping secrets while not keeping secrets" and is unsolvable. If you've shared your site content to one entity you cannot control, you cannot control where your site content goes from there (technologically; the law is a different question).

danieldk•35m ago
There are also a gazillion pages that are not ad-riddled content. With search engines, the implicit contract was that they could crawl pages because they would drive traffic to the websites that are crawled.

AI crawlers for non-open models void the implicit contract. First they crawl the data to build a model that can do QA. Proprietary LLM companies earn billions with knowledge that was crawled from websites and websites don't get anything in return. Fetching for user requests (to feed to an LLM) is kind of similar - the LLM provider makes a large profit and the author that actually put in time to create the content does not even get a visit anymore.

Besides that, if Perplexity is fine with evading robots.txt and blocks for user requests, how can one expect them not to use the fetched pages to train/finetine LLMs (as a side channel when people block crawling for training).

Tuna-Fish•33m ago
I would not mind 3, so long as it's just the LLM processing the website inside its context window, and no information from the website ends up in the weights of the model.
talos_•29m ago
This analogy doesn't map to the actual problem here.

Perplexity is not visiting a website everytime a user asks about it. It's frequently crawling and indexing the web, thus redirecting traffic away from websites.

This crawling reduces costs and improves latency for Perplexity and its users. But it's a major threat to crawled websites

shadowgovt•23m ago
I have never created a website that I would not mind being fully crawled and indexed into another dataset that was divorced from the source (other than such divorcement makes it much harder to check pedigree, which is an academic concern, not a data-content concern: if people want to trust information from sources they can't know and they can't verify I can't fix that for them).

In fact, the "old web" people sometimes pine for was mostly a place where people were putting things online so they were online, not because it would translate directly to money.

Perhaps AI crawlers are a harbinger for the death of the web 2.0 pay-for-info model... And perhaps that's okay.

troyvit•27m ago
> If I now go one step further and use an LLM to summarize content because the authentic presentation is so riddled with ads, JavaScript, and pop-ups, that the content becomes borderline unusable, then why would the LLM accessing the website on my behalf be in a different legal category as my Firefox web browser accessing the website on my behalf?

I think one thing to ask outside of this question is how long before your LLM summaries don't also include ads and other manipulative patterns.

Neil44•25m ago
Flip it around, why would you go to the trouble of creating a web page and content for it, if some AI bot is going to scrape it and save people the trouble of visiting your site? The value of your work has been captured by some AI company (by somewhat nefarious means too).
carlosjobim•23m ago
Legal category?
renewiltord•14m ago
The websites don’t nag you, actually. They just send you data. You have configured your user agent to nag yourself when the website sends you data.

And you’re right: there’s no difference. The web is just machines sending each other data. That’s why it’s so funny that people panic about “privacy violations” and server operators “spying on you”.

We’re just sending data around. Don’t send the data you don’t want to send. If you literally send the data to another machine it might save it. If you don’t, it can’t. The data the website operator sends you might change as a result but it’s just data. And a free interaction between machines.

baxuz•8m ago
1. To access a website you need a limited anonymized token that proves you are a human being, issued by a state authority

2. the end

I am firmly convinced that this should be the future in the next decade, since the internet as we know it has been weaponized and ruined by social media, bots, state actors and now AI.

nnx•1h ago
I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

It’s a different UI, sure, but there should be no discrimination towards it as there should be no discrimination towards, say, Links terminal browser, or some exotic Firefox derivative.

ploynog•1h ago
Being daft on purpose? I haven't heard that using an alternative browser suddenly increases the traffic that a user generates by several orders of magnitude to the point where it can significantly increase hosting cost. A web scraper on the other hand easily can and they often account for the majority of traffic especially on smaller sites.

So your comparison is at least naive assuming good intentions or malicious if not.

magicmicah85•1h ago
A crawler intends to scrape the content to reuse for its own purposes while a browser has a human being using it. There's different intents behind the tools.
JimDabell•1h ago
Cloudflare asked Perplexity this question:

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

In this case, Perplexity had a human being using it. Perplexity wasn’t crawling the site, Perplexity was being operated by a human working for Cloudflare.

gruez•1h ago
>I do not really get why user-agent blocking measures are despised for browsers but celebrated for agents?

AI broke the brains of many people. The internet isn't a monolith, but prior to the AI boom you'd be hard pressed to find people who were pro-copyright (except maybe a few who wanted to use it to force companies to comply with copyleft obligations), pro user-agent restrictions, or anti-scraping. Now such positions receive consistent representation in discussions, and are even the predominant position in some places (eg. reddit). In the past, people would invoke principled justifications for why they opposed those positions, like how copyright constituted an immoral monopoly and stifled innovation, or how scraping was so important to interoperability and the open web. Turns out for many, none of those principles really mattered and they only held those positions because they thought those positions would harm big evil publishing/media companies (ie. symbolic politics theory). When being anti-copyright or pro-scraping helped big evil AI companies, they took the opposite stance.

Fraterkes•59m ago
I think the intelligent conclusion would be that the people you are looking at have more nuanced beliefs than you initially thought. Talking about broken brains is often just mediocre projecting
gruez•45m ago
>I think the intelligent conclusion would be that the people you are looking at have more nuanced beliefs than you initially thought.

You don't seem to reject my claim that for many, principles took a backseat to "does this help or hurt evil corporations". If that's what passes as "nuance" to you, then sure.

>Talking about broken brains is often just mediocre projecting

To be clear, that part is metaphorical/hyperbolic and not meant to be taken literally. Obviously I'm not diagnosing people who switched sides with a psychiatric condition.

542354234235•19m ago
There is an expression “the dose makes the poison”. With any sufficiently complex or broad category situation, there is rarely a binary ideological position that covers any and all situations. Should drugs be legal for recreation? Well my feeling for marijuana and fentanyl are different. Should individuals be allowed to own weapons? My views differ depending on if it a switch blade knife of a Stinger missile. Can law enforcement surveille possible criminals? My views differ based on whether it is a warranted wiretap or an IMSI catcher used on a group of protestors.

People can believe that corporations are using the power asymmetry between them and individuals through copywrite law to stifle the individual to protect profits. People can also believe that corporations are using the power asymmetry between them and individuals through AI to steal intellectual labor done by individuals to protect their profits. People’s position just might be that the law should be used to protect the rights of parties when there is a large power asymmetry.

bbqfog•1h ago
If you put info on the web, it should be available to everyone or everything with access.
TechDebtDevin•1h ago
Not according to CF. They are desperate to turn web sites into newspaper dispensers, where you should give them a quarter to see the content, on the basis that a bot is somehow different than a normal human vistor o a legal basis. Cf has been trying this psyop for years.
ectospheno•1h ago
Sites aren’t getting ad clicks for this traffic. Thus, they have an incentive to do something. Cloudflare is just responding to the market. Is this response bad for us in the long run? Probably. Screaming about cloudflare isn’t going to change the market. You fix a problem with capitalism by using supply and demand levers. Everything else is folly.
Workaccount2•1h ago
What this actually translates to is "Don't bother putting much effort into web content. Put effort into siloed mobile app content where you get compensation".

People like getting money for their work. You do too. Don't lose sight of that.

9cb14c1ec0•1h ago
Even for AI summaries that leech off your content without sending any traffic your direction?
TechDebtDevin•1h ago
Cloudflare screaming into the void desperate to insert themselves as a middleman, in a market ( that they will never succeed in creating) where they extort scrapers for access to websites they cover.

Sorry CF, give up. the courts are on our sides here

morkalork•1h ago
Are you sure? I'm surprised they haven't jumped in on the "scan your face to see the webpage" madness that's taking off around the world
sbarre•1h ago
Which courts exactly?

The world is bigger than the USA.

Just because American tech giants have captured and corrupted legislators in the US doesn't mean the rest of the world will follow.

JimDabell•1h ago
Their test seems flawed:

> We created multiple brand-new domains, similar to testexample.com and secretexample.com. These domains were newly purchased and had not yet been indexed by any search engine nor made publicly accessible in any discoverable way. We implemented a robots.txt file with directives to stop any respectful bots from accessing any part of a website:

> We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

> Hello, would you be able to assist me in understanding this website? https:// […] .com/

Under this situation Perplexity should still be permitted to access information on the page they link to.

robots.txt only restricts crawlers. That is, automated user-agents that recursively fetch pages:

> A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

— https://www.robotstxt.org/faq/what.html

If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it. Perplexity is not acting as a robot in this situation – if a human asks about a specific URL then Perplexity is being operated by a human.

These are long-standing rules going back decades. You can replicate it yourself by observing wget’s behaviour. If you ask wget to fetch a page, it doesn’t look at robots.txt. If you ask it to recursively mirror a site, it will fetch the first page, and then if there are any links to follow, it will fetch robots.txt to determine if it is permitted to fetch those.

There is a long-standing misunderstanding that robots.txt is designed to block access from arbitrary user-agents. This is not the case. It is designed to stop recursive fetches. That is what separates a generic user-agent from a robot.

If Perplexity fetched the page they link to in their query, then Perplexity isn’t doing anything wrong. But if Perplexity followed the links on that page, then that is wrong. But Cloudflare don’t clearly say that Perplexity used information beyond the first page. This is an important detail because it determines whether Perplexity is following the robots.txt rules or not.

1gn15•1h ago
> > We conducted an experiment by querying Perplexity AI with questions about these domains, and discovered Perplexity was still providing detailed information regarding the exact content hosted on each of these restricted domains. This response was unexpected, as we had taken all necessary precautions to prevent this data from being retrievable by their crawlers.

Right, I'm confused why CloudFlare is confused. You asked the web-enabled AI to look at the domains. Of course it's going to access it. It's like asking your web browser to go to "testexample.com" and then being surprised that it actually goes to "testexample.com".

Also yes, crawlers = recursive fetching, which they don't seem to have made a case for here. More cynically, CF is muddying the waters since they want to sell their anti-bot tools.

wulfstan•1h ago
Yeah I'm not so sure about that.

If Perplexity are visiting that page on your behalf to give you some information and aren't doing anything else with it, and just throw away that data afterwards, then you may have a point. As a site owner, I feel it's still my decision what I do and don't let you do, because you're visiting a page that I own and serve.

But if, as I suspect, Perplexity are visiting that page and then using information from that webpage in order to train their model then sorry mate, you're a crawler, you're just using a user as a proxy for your crawling activity.

JimDabell•54m ago
It doesn’t matter what you do with it afterwards. Crawling is defined by recursively following links. If a user asks software about a specific page and it fetches it, then a human is operating that software, it’s not a crawler. You can’t just redefine “crawler” to mean “software that does things I don’t like”. It very specifically refers to software that recursively follows links.
wulfstan•48m ago
Technically correct (the best kind of correct), but if I set a thousand users on to a website to each download a single page and then feed the information they retrieve from that one page into my AI model, then are those thousand users not performing the same function as a crawler, even though they are (technically) not one?

If it looks like a duck, quacks like a duck and surfs a website like a duck, then perhaps we should just consider it a duck...

Edit: I should also add that it does matter what you do with it afterwards, because it's not content that belongs to you, it belongs to someone else. The law in most jurisdictions quite rightly restricts what you can do with content you've come across. For personal, relatively ephemeral use, or fair quoting for news etc. - all good. For feeding to your AI - not all good.

JimDabell•43m ago
> if I set a thousand users on to a website to each download a single page and then feed the information they retrieve from that one page into my AI model, then are those thousand users not performing the same function as a crawler, even though they are (technically) not one?

No.

robots.txt is designed to stop recursive fetching. It is not designed to stop AI companies from getting your content. Devising scenarios in which AI companies get your content without recursively fetching it is irrelevant to robots.txt because robots.txt is about recursively fetching.

If you try to use robots.txt to stop AI companies from accessing your content, then you will be disappointed because robots.txt is not designed to do that. It’s using the wrong tool for the job.

catlifeonmars•26m ago
I don’t disagree with you about robots.txt… however, what _is_ the right tool for the job?
runako•55m ago
Relevant to this is that Perplexity lies to the user when specifically asked about this. When the user asks if there is a robots.txt file for the domain, it lies and says there is not.

If an LLM will not (cannot?) tell the truth about basic things, why do people assume it is a good summarizer of more complex facts?

charcircuit•30m ago
The article did not test if the issue was specific to robots.txt or if it can not find other files.

There is a difference between doing a poor summarization of data, and failing to even be able to get the data to summarize in the first place.

runako•8m ago
> specific to robots.txt > poor summarization of data

I'm not really addressing the issue raised in the article. I am noting that the LLM, when asked, is either lying to the user or making a statement that it does not know to be true (that there is no robots.txt). This is way beyond poor summarization.

Izkda•13m ago
> If the user asks about a particular page and Perplexity fetches only that page, then robots.txt has nothing to say about this and Perplexity shouldn’t even consider it

That's not what Perplexity own documentation[1] says though:

"Webmasters can use the following robots.txt tags to manage how their sites and content interact with Perplexity

Perplexity-User supports user actions within Perplexity. When users ask Perplexity a question, it might visit a web page to help provide an accurate answer and include a link to the page in its response. Perplexity-User controls which sites these user requests can access. It is not used for web crawling or to collect content for training AI foundation models."

[1] https://docs.perplexity.ai/guides/bots

throw_m239339•1h ago
> How can you protect yourself?

Put your valuable content behind a paywall.

b0ner_t0ner•20m ago
A combination of "Bypass Paywalls Clean for Firefox" and archive.is usually get past these.
binarymax•1h ago
I've built and run a personal search engine, that can do pretty much what perplexity does from a basic standpoint. Testing with friends it gets about 50/50 preference for their queries vs Perplexity.

The engine can go and download pages for research. BUT, if it hits a captcha, or is otherwise blocked, then it bails out and moves on. It pisses me off that these companies are backed by billions in VC and they think they can do whatever they want.

kissgyorgy•57m ago
Not sure I would consider a user copy-pasting an URL being a bot.

Should curl be considered a bot too? What's the difference?

rwmj•51m ago
In unrelated news, Fedora (the Linux distro) has been taken down by a DDoS today which I understand is AI-scraping related: https://pagure.io/fedora-infrastructure/issue/12703
larodi•45m ago
Good they do it. Facebook took TBs of data to train, nobody knows what Goog does to evade whatever they want.

the service is actually very convenient no matter faang likes it or not.

klabb3•33m ago
Unexpected underdog argument. What is happening in reality is all companies are racing to (a) scrape, buy and collect as much as they can from others, both individuals and companies while (b) locking down their own data against everyone else who isn’t directly making them money (eg through viewing their ads).

Part of me thinks that the open web has a paradox of tolerance issue, leading to a race to the bottom/tragedy of the commons. Perhaps it needs basic terms of use. Like if you run this kind of business, you can build it on top of proprietary tech like apps and leave the rest of us alone.

larodi•17m ago
We need to wake up and understand that all the information already uploaded is more or less a free web material, once taken through the lens of ML-somethings. With all the second, and third-order effects such as the fact that this changes completely the whole motivation, and consequence of open-source perhaps.

It is also only a matter of time scrapers once again get through walls by twitter, reddit and alike. This is, after all, information everyone produced, without being aware of it was now considered not theirs anymore.

ipaddr•7m ago
Reddit sold their data already. Twitter made thier own AI.
rzz3•16m ago
Well Cloudflare doesn’t even block Google’s AI crawlers because they don’t differentiate themselves from their search crawlers. Cloudflare gives Google an unfair competitive advantage.
blibble•44m ago
AI companies continuing to have problems with the concept of "consent" is increasingly alarming

god help us if they ever manage to build anything more than shitty chatbots

jp1016•39m ago
Using a robots.txt file to block crawlers is just a request, it’s not enforced. Even if some follow it, others can ignore it or get around it using fake user agents or proxies. It’s a battle you can’t really win.
gonzo41•38m ago
This is expected. There are not rules or conventions anymore. Look at LLMs, they stole/pirated all knowledge....no consequences.
Havoc•38m ago
Seems a win.

CF being internet police is a problem too but someone credible publicly shaming a company for shady scraping is good. Even if it just creates conversation

Somehow this needs to go back to search era where all players at least attempt to behave. This scrapping Ddos stuff and I don’t care if it kills your site (while “borrowing” content) is unethical bullshit

tucnak•27m ago
The rage-baiters in this thread are merely fishing for excuses to go up against "the Machine," but honestly, widely off-mark when it comes to reality of crawling. This topic has been chewed to bits long before LLM's, but only now it's a big deal because somebody is able to make money by selling automation of all things..? The irony would be strong to hear this from programmers, if only it didn't spell Resentment all over.

If you don't want to get scrapped, don't put up your stuff online.

rzz3•26m ago
> Today, over two and a half million websites have chosen to completely disallow AI training through our managed robots.txt feature or our managed rule blocking AI Crawlers.

No, he (Matthew) opted everyone in by default. If you’re a Cloudflare customer and you don’t care if AI can scrape your site, you should contact them and/or turn this off.

In a world where AI is fast becoming more important than search, companies who want AI to recommend their products need to turn this off before it starts hurting them financially.

fourside•14m ago
> companies who want AI to recommend their products need to turn this off before it starts hurting them financially

Content marketing, gamified SEO, and obtrusive ads significantly hurt the quality of Google search. For all its flaws, LLMs don’t feel this gamified yet. It’s disappointing that this is probably where we’re headed. But I hope OpenAI and Anthropic realize that this drop in search result quality might be partly why Google’s losing traffic.

ipaddr•9m ago
This has already started with people using special tags also people making content just for llms.
observationist•25m ago
Crawling and scraping is legal. If your web server serves the content without authentication, it's legal to receive it, even if it's an automated process.

If you want to gatekeep your content, use authentication.

Robots.txt is not a technical solution, it's a social nicety.

Cloudflare and their ilk represent an abuse of internet protocols and mechanism of centralized control.

On the technical side, we could use CRC mechanisms and differential content loading with offline caching and storage, but this puts control of content in the hands of the user, mitigates the value of surveillance and tracking, and has other side effects unpalatable to those currently exploiting user data.

Adtech companies want their public reach cake and their mass surveillance meals, too, with all sorts of malignant parties and incentives behind perpetuating the worst of all possible worlds.

tantalor•5m ago
I think Cloudfare is setting themselves up to get sued.

(IANAL) tortious interference

emehex•5m ago
Would highly recommend listening to the latest Hard Fork podcast with Matthew Prince (CEO, Cloudflare): https://www.nytimes.com/2025/08/01/podcasts/hardfork-age-res...

I was skeptical about their gatekeeping efforts at first, but came away with a better appreciation for the problem and their first pass at a solution.

curiousgal•23m ago
I am sorry, Cloudafre is the internet police now?
otterley•19m ago
Which is ironic given they are the primary enabler of streaming video copyright infringement on the Internet.
rzz3•13m ago
They hate AI it seems. I don’t see them offering any AI products or embracing it in any way. Seems like they’ll get left behind in the AI race.
talkingtab•17m ago
I wonder if DRM is useful for this. The problem: I want people to access my site, but not Google, not bots, not crawlers and certainly not for use by AI.

I don't really know anything about DRM except it is used to take down sites that violate it. Perhaps it is possible for cloudflare (or anyone else) to file a take down notice with Perplexity. That might at least confuse them.

Corporations use this to protect their content. I should be able to protect mine as well. What's good for the goose.

bob1029•17m ago
"Stealth" crawlers are always going to win the game.

There are ways to build scrapers using browser automation tools [0,1] that makes detection virtually impossible. You can still captcha, but the person building the automation tools can add human-in-the-loop workflows to process these during normal business hours (i.e., when a call center is staffed).

I've seen some raster-level scraping techniques used in game dev testing 15 years ago that would really bother some of these internet police officers.

[0] https://www.w3.org/TR/webdriver2/

[1] https://chromedevtools.github.io/devtools-protocol/

blibble•6m ago
> "Stealth" crawlers are always going to win the game.

no, because we'll end up with remote attestation needed to access any site of value

kocial•16m ago
Those Challenges can be bypassed too using various browser automation. With the Comet-like tool, Perplexity can advance its crawling activity with much more human-like behaviour.
ipaddr•13m ago
If they can trick the ad networks then go for it. If the ad networks can detect it and exclude those visits we should be able to.
rustc•15m ago
It's ironic Perplexity itself blocks crawlers:

    $ curl -sI https://www.perplexity.ai | head -1
    HTTP/2 403
tr_user•6m ago
use anubis to throw up a POW challenge

Moravec's Paradox: Towards an Auditory Turing Test

https://arxiv.org/abs/2507.23091
1•belter•1m ago•0 comments

Earth, Mars, Venus, and a lost planet once 'waltzed' in perfect harmony

https://www.livescience.com/space/planets/earth-mars-venus-and-a-long-lost-planet-may-have-once-waltzed-in-perfect-harmony-around-the-sun
1•Brajeshwar•1m ago•0 comments

Efforts to Ground Physics in Math Are Opening the Secrets of Time

https://www.wired.com/story/efforts-to-ground-physics-in-math-are-opening-the-secrets-of-time/
1•Brajeshwar•1m ago•0 comments

Mysterious boost to Earth's spin makes Aug 5 one of the shortest days on record

https://www.space.com/astronomy/earth/mysterious-boost-to-earths-spin-will-make-aug-5-one-of-the-shortest-days-on-record
1•Brajeshwar•1m ago•0 comments

Show HN: LLMs Play Monopoly Deal

https://dealbench.org/
1•advaith08•2m ago•0 comments

Anthropic beats OpenAI as the top LLM provider for business – and it's not close

https://www.zdnet.com/article/anthropic-beats-openai-as-the-top-llm-provider-for-business-and-its-not-even-close/
1•CrankyBear•2m ago•0 comments

Luzer, a coverage-guided Lua fuzzing engine

https://bronevichok.ru/posts/introducing-luzer.html
1•todsacerdoti•3m ago•0 comments

Why Universal Basic Income Is the Answer to Poverty, Insecurity, and Inequality

https://www.scottsantens.com/what-the-media-isnt-telling-you-why-universal-basic-income-ubi-is-the-answer-to-poverty-insecurity-and-inequality/
3•2noame•3m ago•0 comments

Third Man Factor

https://en.wikipedia.org/wiki/Third_man_factor
2•palad1n•4m ago•0 comments

War Never Changes

https://siamesepilgrim.bearblog.dev/war-never-changes/
3•speckx•6m ago•0 comments

Show HN: Help me evolve a new type of AI, so it can learn English

1•shallowNuralNet•8m ago•0 comments

Test

https://news.ycombinator.com/submit
1•vyzen_akave•8m ago•2 comments

Elite New York High School Admits 8 Black Students in a Class of 781

https://www.nytimes.com/2025/07/31/nyregion/specialized-high-schools-black-students-stuyvesant.html
4•antlas•9m ago•1 comments

Crucial mutant corn stocks threatened under 2026 USDA budget

https://aces.illinois.edu/news/crucial-mutant-corn-stocks-threatened-under-2026-usda-budget
2•PaulHoule•9m ago•0 comments

Customizing tmux and making it less dreadful

https://evgeniipendragon.com/posts/customizing-tmux-and-making-it-less-dreadful/
2•EPendragon•10m ago•0 comments

SHOW HN: Built GPX Monster: Browser-based tool to merge multi-day route files

https://gpxmonster.com
2•dnl23•11m ago•1 comments

OpenAI's ChatGPT to hit 700M weekly users, up 4x from last year

https://www.cnbc.com/2025/08/04/openai-chatgpt-700-million-users.html
1•speckx•11m ago•0 comments

ChatGPT is on track to reach 700M weekly active users this week

https://twitter.com/nickaturley/status/1952385556664520875
1•mfiguiere•12m ago•0 comments

Show HN: Sidequest.js – Background jobs for Node.js using your database

https://docs.sidequestjs.com/quick-start
2•merencia•12m ago•0 comments

GitHub Actions now supports YAML anchors

https://github.com/actions/runner/issues/1182
1•slorber•14m ago•1 comments

Ask HN: What alternatives to the now-woefully out-of-date highscalability.com?

2•warrenm•15m ago•0 comments

Building Android for Robots

https://openmind.org/
1•danboarder•16m ago•0 comments

Speaking in "LLM Idioms"

https://blog.promptlayer.com/llm-idioms/
1•jzone3•17m ago•0 comments

Show HN: I built a privacy-first AI chat with deep research capabilities

https://vtchat.io.vn
1•vinhnx•17m ago•0 comments

Skyscrapers Techniques

https://www.conceptispuzzles.com/index.aspx?uri=puzzle/skyscrapers/techniques
2•amichail•17m ago•0 comments

AlgoSEO – Generate 50 SEO articles from keywords in 2 clicks

https://algoseo.xyz
1•Dimension_Tech•20m ago•1 comments

Over 10k hotels join mass claim against Booking.com

https://nltimes.nl/2025/08/04/10000-hotels-join-mass-claim-bookingcom
4•TechTechTech•20m ago•1 comments

When to Hire a Computer Performance Engineering Team

https://www.brendangregg.com/blog//2025-08-04/when-to-hire-a-computer-performance-engineering-team-2025-part1.html
3•ingve•21m ago•0 comments

Every Visual Workflow Tool Is Just Excel for Developers Who Gave Up

https://medium.com/@mohamedalibenothmen1/every-visual-workflow-tool-is-just-excel-for-developers-who-gave-up-f7261090fbc8
9•dalibenothmen•22m ago•1 comments

An easy, realistic model for MCP connectivity

https://tailscale.com/blog/model-for-mcp-connectivity-lee-briggs
1•ingve•23m ago•0 comments