The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots
Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.
The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.
Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.
bboygravity•16m ago
Like why? Don't you want people to read your content? Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?
Meanwhile, the rest of the world is trying to figure out how to deliberately get their stuff INTO as many LLMs as fast as possible.
LtWorf•12m ago
Every single person who has wrote a book is happy if others read their book. They might be less enthusiastic about printing million copies and shipping them to random people with their own money.