Not blocking violent, bad-actor scrapers is dumb. Letting through bad-actor scrapers because a bunch of rich people want to make it the norm is dumb.
Llms are not directing traffic to the sites and that is the tradeoff that site owners allow with Googlebot. Even if Perplexity or Claude will provide a source, the Llm user is most likely not asking/clicking for it 99% of the time.
We just had the article about how AI search is leading to less clicks, so where is that supposed "pipeline"?
Also completely ignores how you may not want your information to be misconstrued (lied basically) to the user with a helpful link telling them where the source is, but they may never click through. And worse if they know that the information being told to them is wrong, they may then think it was because your site was wrong and trust you less, all without ever clicking that link.
That is not my job nor is it my goal. These companies are taking my work, repurposing it, and selling it under the assumption that because they can access it they can sell it.
Maybe the OP should leave their house door open so people can come in and use his couch. The new game in town is to let other people use your couch.
The mental gymnastics in this post qualify for the Special Olympics.
[1]: https://www.pewresearch.org/short-reads/2025/07/22/google-us...
I know that medium, substack and the other "publication" platforms (like LinkedIn) are trying to commodify even the act of writing into purely a form or marketing (either for a product, or for your personal brand), but not everyone gave up just yet.
Would removing your website from google search results cause people to go directly to your website?
-Someone, somewhere, eventually
Yes, as a user I'd like everything served to me on a silver platter, for free, on demand, and completely and 100% aligned with my interests exclusively with no thought given to anybody else... but that's not a realistic world. In the real world, if the content providers have no reason to provide content, they won't.
I kind of hate the connotations of "content provider", that neutral term that implies that it is all "content" that can just be measured in megabytes or something, but I mean the full richness here of the term, individual producers, small businesses, big business, everybody. Even my personal site, if I'm not getting something out of it, however intangible it may be, I wouldn't do it. I'd be mighty pissed if I lose a job someday because I get accused of just spewing out LLM content that the LLM can only spew out because of my own original ideas/formulation of ideas being on the internet.
If there was a paid-only search engine with dubious ethics practices that was overwhelming my site with traffic in order resell search trained off of (among other things) my personally generated content, I would absolute block it.
LLMs are not search engines, and I'm not gaining any followers or customers in any meaningful way because an LLM indexes my site.
> it also cuts you off from the fastest-growing distribution channel on the web.
I haven't seen the needle tip at all in my acquisition channels from LLMs. Unless you're a household name or very large, LLMs aren't going to shill for your business.
> most LLMs have an agentic web-search component that will actively generate links
Totally. Which is why I don't care if the LLMs index it. Let web content search be good, and lead LLMs to good content; product placement in LLM weights ain't what I'm gonna optimize for, or even permit, if it comes at a cost to me and my infra.
^^^^
This
For the moment, and for the foreseeable future, you are just giving your content for free (and have to pay the hosting bill).
Counterpoint: my wife owns an accounting firm and publishes a lot of highly valuable informational content on their website's blog. Stuff like sales tax policies and rates in certain states, accounting/payroll best practices articles, etc. I guess you could call it "content marketing".
Lately they have been getting highly qualified leads coming from LLMs that cite her website's content when answering questions like "What is the sales tax nexus policy in California?". Users presumably follow the citation and then engage with the website, eventually becoming a very warm lead.
So LLMs are obviously not search engines in the conventional sense, but it doesn't mean they are not useful at generating valuable traffic to your marketing website.
Friends of mine run a service company, and they already see a significant number of customers reach out because they found them using ChatGPT (et al), not Google. By significant I mean ~20% or so.
Also, for e-commerce, Deep Research from OpenAI, is way better in doing product recommendations than Google. That's my goto place to find most stuff novadays (e.g. I purchased dancing shoes, pants, air cleaners, an air conditioner, supplements and a ton of other things using the recommendations of DR - no search engine comes even close to it)
Interesting, that’s not my experience and I’d be the first to replace Google if I could. I’ll have to try again.
For me, the main place where it fails is specific links to the stores and specific prices / opportunities. But when I want to find an item that fits a need (e.g. "quietest mobile AC" or "best ultra short throw projector for my specific use case", "collagen supplement that has clinical confirmation of the quality...") it works way better than Google. And I tried many product categories.
Let’s compare Google with OpenAI:
Paid-only: neither check; both have free tiers, eventually supported by ads (Google took 10+ years before it got littered with ads, I promise OpenAI will make the ad experience even stinkier because they keep you on the site as opposed to Google who only have you for a few seconds. The ads will be blinky, and they will be nested into the content.
Dubious ethics: both check.
Overwhelming bot traffic: both check.
Make money on your content: both check.
> LLMs are not search engines, and I'm not gaining any followers or customers in any meaningful way because an LLM indexes my site.
So paywall it?
The Anubis PoW captcha is an option, too. Then you will block trainers and allow agents.
How about the fact that Google (ideally) sends users to you rather than sharing your work unattributed?
Not to mention LLMs still spew a lot of badly wrong results (no I will not anthropomorphize the models, they're not ready yet).
This is one heck of a poison chalice. 王先生,你願意喝這杯鶴酒嗎?
But how many of you wouldn’t hook up your website to Google?
Me. https://ashwinsundar.com/robots.txtYour computer doesn't have the right to scrape what I say or do anything with it.
I know one of the primary reasons that I do anything online is to provide an outlet for someone else to see it. If I didn’t want someone else to see it, I’d write it down on my notebook, not on the public web.
Sounds like the same schpiel from the anti-privacy advocates who think that we should all expose everything we're doing because "you should have nothing to hide".This article was written for Wired by Moxie Marlinspike in 2013, who went on to later develop the Signal protocol.
I don't want my thoughts or ideas spread across the web promiscuously. The things I say publicly are curated and full of context. That's why I have my own website, and don't post elsewhere.
I'm not playing the same game you are, which appears to be to post liberally and have loose thoughts to maximize "reach".
> You can flag submissions that you think don't belong on Hacker News.
Well, this is a very low quality submission in my eyes. A tiny read with an unsubstantiated, purely contrarian take that completely misses the point of the debate. Just to be clear, I think anyone is free to post anything on their blogs, that's what they're for, but I don't think posts like these contribute to HN having a good atmosphere for discussion; if I were to write something like this, I'd be ok with it being unsuitable for HN.
BTW I hadn't flagged this before reading your comment. I've done so after reading the submission though.
I guess that’s the problem - search being only a component.
Is the possible search traffic worth having your content become part of an LLM’s training set and possibly used elsewhere?
I guess the answer depends on the content and the website’s business model.
As an amateur blogger, I would not like LLMs to "steal" my content, display the users the needed pieces they are looking for, while leaving me with zero visitors. The reason I write is to convey a particular message, which the meaning of gets lost, or worse communicated wrongly, due to LLMs.
As an online business owner, I do see both ChatGPT and Perplexity as referrers to my business, meaning that potential customers ask LLM a question/service recommendation, and LLM is directing them to my service, and I would not like to lose this vertical of organic customer acquisition.
---
On a completely different note, medium should die as a platform, together with substack. The amount of intrusive popups, "install our app" bars, and paywalls is just insane. Bloggers, especially technically savvy ones, should be able to host their own blog.
LLMs are being blocked by standard bot detection - and the use cases are very much the same. People want smarter bots for the same shitty use cases.
Why would anyone choose to anonymously and freely provide content to LLMs? Actually the only use case for that is deliberately seeding misinformation. Which is likely already happening and will soon be the majority of the content accessible to LLMs regardless of what blocking legitimate content providers choose to use.
That's narrow. Perplexity, and other LLM agent services, do perform a regular web search to gain context, before generation their output. How else would they have access to recent data when the underlying LLM's knowledge cutoff is usual at least a few weeks?
> RAG crawlers
Very few people know what "RAG" is, so it makes little sense to mention it to any other than a technical audience.
> not LLM training
There's an issue of trust, because once content is scraped, it can also be used to train future models. That's really what ought to be emphasized IMO.
You write content so that you get paid, usually through ads and clicks. If people aren't seeing your content because am LLM has consumed it and is regurgitating it and taking your ad clicks, then there's no benefit for you only for the LLM. You're doing the work of Sam Altman and helping him attain his multibillionaire status and you get nothing in return.
The fact LLM companies constantly keep getting dinged for ignoring every barrier we throw up to stop their scraping short of something like Anubis shows what their real goal is: theft, monopolization, and reality authoring.
https://www.pewresearch.org/short-reads/2025/07/22/google-us...
The training done on the content does not provide citable references with current models. The agentic search and summary done post-training does.
A lot of the heavy traffic is for training, though, because AI companies are in competition for large amounts of training data.
This is just yet another person running an AI company telling me why I should provide free data and labor to the LLMs that power their company. These AI companies are acting as middlemen between the end-user and the content creator; its the latest iteration of an age-old business model which works-out great for the middlemen. Meanwhile, people on either side are taken advantage of.
If the "next-generation" of search is accessed mostly through an LLM, then there's no incentive to participate in it unless you're directly selling a product or service... and then you have to hope and pray the LLM doesn't lie and misrepresent you. Otherwise, if you're making a website to share information or show off your own work, there's zero incentive to participate.
If AI companies want to pay me cold hard cash every time they query my site, then we can negotiate.
If you do you're just feeding the AI monster for free.
riffraff•6mo ago
[citation needed]
cpursley•6mo ago