When you're asking AI chatbots for answers, they're data-mining you

https://www.theregister.com/2025/08/18/opinion_column_ai_surveillance/

77•rntn•2h ago

Comments

roscas•2h ago

Always good to remember people of this.

But not just AI bots or interfaces. Everything is saved and never deleted.

Remember Facebook? "We will never delete anything" that is their business.

So anything that you put on those "services" is gone out of your hands. But we still have an option, is to stop using these ads company and let them die.

Back to AI, there are loads of offline models we can use. Many like Ollama that will even download it. Install Ollama, on the ollama site find a model name and "ollama run model-name" and you can use it.

Ok, it is not as chatgpt5 but it can help you so much, that you might not even need chatgpt.

notpushkin•1h ago

> Always good to remember people of this.

You mean “remind”?

Phemist•1h ago

Indeed, and asking facebook to delete the data or to not use it for AI training is just another data point indicating you care about it. Your preferences will eventually be stripped through redesigns, refactors, careless usage or facebooks crooked idea of consent. The data will remain and be used again.

lowwave•21m ago

It is better to NOT delete facebook, but spam your profile with other data and just leave it.

lm28469•1h ago

That's why you should use multiple accounts and bullshit about 30% of what you post. LLMs are godsent for that, they poison their own well.

SoftTalker•1h ago

I assume that companies like Facebook know pretty well which accounts are really the same person. Even if you are careful about keeping cookies in separate browser profiles, your machine can be fingerprinted, your posting habits and writing style can be fingerprinted, and Facebook/Google have the resources to do it.

mgh2•53m ago

The risk are the externalities to actual users who don't know the difference and get affected by your 30% bs

BolexNOLA•13m ago

I recently set up LM Studio and have run open AI's 20b model locally using an AMD 9070 + 9800x3d. I honestly assumed it would be way more work than it was to set it up. It has limitations, but given it took me all of 5min and I can easily attach docs for it to reference as it all runs locally...it's fantastic. I've got a Claude model I've been messing with too.

andrepd•1h ago

What can you do online these days without being data mined? Browsing gemini?

em3rgent0rdr•1h ago

Download stuff in bulk (for instance the entire wikipedia torrent) and then peruse it on you own computer.

Squeeeez•46m ago

If you are not using an OS which has something like windows recall enabled, or that weird stardict with online lookup with automatic lookup on select which came up recently.

I wonder how far back this has been going on. Did ICQ, IRC server hosters, BBSes do similar things?

reactordev•35m ago

No, back then storage was a premium so everything aside from config, accounts, and billing was ephemeral. It really wasn’t until Cloud came along that storage made it so you could keep everything. About the time of the social media boom.

It wasn’t until around 2014 that I stopped building routes that did:

    DELETE FROM <table> WHERE id = ? ON DELETE CASCADE;

y0eswddl•21m ago

Start w/

https://ssd.eff.org

https://privacyguides.net

panny•1h ago

I would expect this, but it doesn't seem to be the case.

If I ask for search.brave.com to give me a list of gini coefficients for the top ten countries by GDP, it can't do it. However, if I tell it the data is available on the CIA world factbook, it can then spit that info out promptly. However, if I close the context and ask again, it hasn't learned this information and once again is unable to provide the list.

It didn't datamine me. It had no better idea where to find this information the second time I asked. This is the experience others have stated with other AIs as well. It does not seem special to brave.

Etheryte•1h ago

Data mining doesn't mean the model is instantly updated, that would be prohibitively expensive at scale. It's way easier to batch your data together with a bunch of other data and use it later on. That doesn't even mean it will know where to find the information eventually since models are not one to one with their inputs, because again, size and cost.

panny•1h ago

>Data mining doesn't mean the model is instantly updated

I'm not expecting instant. Even next week it won't be there. It's like how AI never learned to count how many times the letter r appears in strawberry. Like sure, now if you ask brave, it will tell you three, but that is only because that question went viral. It didn't "learn" anything, it was just hard coded for that particular answer. Ask it how many times the letter l appears in smallville and it will get it wrong again.

qwertytyyuu•1h ago

every week is still way to expensive to do at scale, at best they'll update training data with each model iteration.

simgt•1h ago

I didn't think for a second you could be right, so I tried with Claude. L in smallville was correct, then it suggests it'd have gotten l in parallel wrong by answering 3 instead of 2 (buts gets it right in a new chat). Then it suggests it'd get n in millennium wrong by giving the right answer, and gets it wrong in a new chat. https://claude.ai/share/93b46c3b-23a7-40ad-8a2b-ec2ed6c34a19

Thanks, that was enlightening.

t0md4n•1h ago

It wouldn’t be instant, next week or even next month. Pre-training doesn’t happen that frequently and varies between each model provider. As for the strawberry test, this is a tokenization issue that is fundamental to LLM’s, however, most models can now solve this type of question using thinking/code/tools to count the letters.

https://imgur.com/a/NqIJEx6

ordersofmag•1h ago

LLM aren't retrained and released on a weekly time-scale. The data mining may only be reflected in the training of the next generation of the model.

Etheryte•31m ago

Both OpenAI and Claude average roughly one flagship release a year, and these are some of the best funded companies in the space. The bigger your model, the more expensive it is to train, so you want to do it as rarely as reasonably possible. Every other company will either work with smaller models and/or train even more rarely, aside from fine-tunes and customizations they put on top.

add-sub-mul-div•1h ago

Brave isn't data mining you for your benefit, they're doing it for their benefit.

Kim_Bruning•1h ago

Earlier discussion on the "ChatGPT chats in google" angle:

https://news.ycombinator.com/item?id=44778764

Interesting how much traction

     "[x] Make this chat discoverable (allows it to be shown in web searches)"

gets in news articles.

People don't seem to have the same intuition for the web that they used to!

falcor84•1h ago

> So, kids, let's not be asking any AI chatbot whether you should divorce your husband, how to cheat on your taxes, or if you should try to get your boss fired. That information will be kept, it may be revealed in a security breach, and, if so, it will come back to bite you in the buns.

Just as a PSA - there's nothing unique to AIs here - whenever you ask a question of anyone, in any way, they then have the memory of you having asked it. A lot of sitcoms and comedic plays have the plot premise build upon such questions that a person voiced then eventually reaching (either accurately or inaccurately) the person they were hiding the question from.

And as someone who's into spy stories, I know that a big part of tradecraft is of formulating your questions in a way that divulges the least about your actual intentions and current information.

If anything, LLM-driven AIs are the first technology that in principle allow you to ask a complex question that would be immediately forgotten. The thing is that you need to be running the AI yourself; if you ask an AI controlled by another entity, then you're trusting that entity with your question, regardless of whether there's an AI on the way.

frakt0x90•56m ago

Books are also technology that allow you to answer complex questions without recording the question.

Jalad•10m ago

Not necessarily though, it depends on where you got the book from (Amazon, the library?), and what your question is

y0eswddl•22m ago

The questions and info you ask friends doesn't end up in a massive data profile on you stored in somebody's cloud to be used for future manipulation/marketing/profiling...

smjburton•1h ago

> The more data you give any of the AI services, the more that information can potentially be used against you.

It may seem obvious, but Sam Altman also recently emphasized that the information you share with ChatGPT is not confidential, and could potentially be used against you in court.

[1] https://www.pcmag.com/news/altman-your-chatgpt-conversations...

[2] https://techcrunch.com/2025/07/25/sam-altman-warns-theres-no...

Jalad•12m ago

This is always true though. Any data that a cloud company has against you can be subpoenad

It would be weird for him not to be transparent about that

nottorp•40m ago

> "How to Use a Microwave Without Summoning Satan,"

Oh, nice idea. We should all ask that.

mystraline•27m ago

Wait, you can summon Satan with a microwave?!

Lemee ask ShatGPT how to do that!

ceroxylon•38m ago

What about the people who did not opt to share or index their chats, and the companies that claim to not train on user chats?

https://privacy.anthropic.com/en/articles/10023555-how-do-yo...

> We do not actively set out to collect personal data to train our models

The 'snarky tech guy' tone of the article is a bit like nails on a chalkboard.

hazKu4•31m ago

(At least to me) that language doesn’t feel particularly reassuring… especially given the duplicitous nature of data collection - i.e. “we don’t sell your data” translates to “we create a sophisticated advertising profile about you, and monetize that”

boesboes•7m ago

That line is about data they find on internet. soooo completely not relevant

unethical_ban•37m ago

Duck.ai claims to anonymize AI chats and says its conversations are not used for training. It is my go to for casual usage.

Otherwise, I use local for complex for potentially controversial questions.

glitchc•15m ago

Everyone knows this. Every layperson I talk to is aware that these companies are siphoning their information. When free email was introduced over two decades ago, the behaviour was the same. Everyone knew Microsoft and Google could read your emails. Then, like now, people think it's worth it. It is too useful a tool to have and the price is palatable.

What people don't want to do is sign up for yet another subscription. There's immense subscription fatigue among the general population, especially in tough economic times such as now.

boesboes•7m ago

What a terrible, utter bullshit article. Full of half truths and fear mongering. smh.

FFmpeg Assembly Language Lessons

Show HN: I built an app to block Shorts and Reels

Web apps in a single, portable, self-updating, vanilla HTML file

MCP doesn't need tools, it needs code

Electromechanical reshaping, an alternative to laser eye surgery

Show HN: A Minimal Hacker News Reader for Apple Watch Built with SwiftUI

AI is predominantly replacing outsourced, offshore workers

Vibe coding tips and tricks

Walkie-Textie Wireless Communicator

A gigantic jet caught on camera: A spritacular moment for NASA astronaut

Class-action suit claims Otter AI records private work conversations

MCP tools with dependent types

Sky Calendar

Texas law gives grid operator power to disconnect data centers during crisis

8x19 Text Mode Font Origins

SystemD Service Hardening

When you're asking AI chatbots for answers, they're data-mining you

The Lives and Loves of James Baldwin

Claudia – Desktop companion for Claude code

LLMs and coding agents are a security nightmare

95% of AI Pilots Failing

Typechecker Zoo

The Enterprise Experience

Scientists discover surprising language 'shortcuts' in birdsong – like humans

AI accounts impersonating doctors on social media [video]

Unification (2018)

Llama-Scan: Convert PDFs to Text W Local LLMs

Weather Radar APIs in 2025: A Founder's Complete Market Overview

Website is served from nine Neovim buffers on my old ThinkPad

Nvidia Tilus: A Tile-Level GPU Kernel Programming Language

When you're asking AI chatbots for answers, they're data-mining you

Comments

FFmpeg Assembly Language Lessons

Show HN: I built an app to block Shorts and Reels

Web apps in a single, portable, self-updating, vanilla HTML file

MCP doesn't need tools, it needs code

Electromechanical reshaping, an alternative to laser eye surgery

Show HN: A Minimal Hacker News Reader for Apple Watch Built with SwiftUI

AI is predominantly replacing outsourced, offshore workers

Vibe coding tips and tricks

Walkie-Textie Wireless Communicator

A gigantic jet caught on camera: A spritacular moment for NASA astronaut

Class-action suit claims Otter AI records private work conversations

MCP tools with dependent types

Sky Calendar

Texas law gives grid operator power to disconnect data centers during crisis

8x19 Text Mode Font Origins

SystemD Service Hardening

When you're asking AI chatbots for answers, they're data-mining you

The Lives and Loves of James Baldwin

Claudia – Desktop companion for Claude code

LLMs and coding agents are a security nightmare

95% of AI Pilots Failing

Typechecker Zoo

The Enterprise Experience

Scientists discover surprising language 'shortcuts' in birdsong – like humans

AI accounts impersonating doctors on social media [video]

Unification (2018)

Llama-Scan: Convert PDFs to Text W Local LLMs

Weather Radar APIs in 2025: A Founder's Complete Market Overview

Website is served from nine Neovim buffers on my old ThinkPad

Nvidia Tilus: A Tile-Level GPU Kernel Programming Language