Wikipedia: Signs of AI Writing

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing

61•FergusArgyll•6mo ago

Comments

constantcrying•6mo ago

I think this is actually a bad idea, especially the language and tone part.

You can not detect AI writing by the language and tone, all LLMs are trained and prompted to write in a very particular style. You can just tell them to write in a different style and they will. What is worse that the default LLM writing style is actually quite common. If you read through that list you will also see that many of these are very much human errors.

Trying to detect what is and isn't LLM generated text will only lead to people chasing ghosts, either accusing innocent people or putting faith in text which is the result of more careful prompting.

rgoulter•6mo ago

> You can just tell them to write in a different style and they will.

I'm guessing the priorities are to have contributions which stick to Wikipedia's guidelines. The LLM tendencies cited are in violation of those.

I don't think the game is strictly "we only want human contributions", where you can imagine a sophisticated LLM-user crafting a reasonable contribution which doesn't get rejected.

The "accidental disclosure" section indicates that some of these bad contributions are just very low effort.

supriyo-biswas•6mo ago

Not in this particular case; the point of Wikipedia is to surface objective and factual information (we could debate what "objective/factual" information are, but that's a different issue).

The issue with LLMs is that they try to insert a lot of judgement about the subject matter without quantification or comparison. A lot of this is already covered by Wikipedia's other rules, such as those about weasel words, verifiability etc. but it is useful to have rules that specifically detect AI content, and by proxy, also take out all the bad human writing along with it.

For example, when asked about person X who discovered a method to do Y, a LLM may try to write "As a testament to X's ingenuity, he also discovered method Y, which helps achieve Z in a rapid and effective manner"; it doesn't really matter whether it was written by a LLM as this writing style is unsuited for Wikipedia. Instead, one may have to quantify it by writing "He/she discovered method Y, a method to do Z, which was regarded as an improvement over historical methods such as P and Q", with references to X discovering Y, and research that cites that improvement.

LLMs could adopt that latter writing style and cite references, but the issue there is that a large market that wants to simply use it to decompress their documents to satisfy the intricacies of the social structure they are embedded in. As an example, someone may want to prove to their manager that they produced a well researched report, but since their manager may have to conduct said research in order to know whether it meets their bar and instead use the document length as a proxy. LLMs meet a lot of such use cases and it'd be difficult to take away this "feature".

nunez•6mo ago

This Wikipedia entry covers more than tone and style.

There are small things that LLM-generated content will almost always do. The emdash used to be one of them; transition word overuse is another; being overly verbose by default is yet another.

That said, I posit that it will get increasingly difficult to keep this page up to date as models get smarter about how they write.

hackermeows•6mo ago

Cool , I just include this in the prompt when writing for wiki. And ask the llm specifically to not write like this . What am i missing?

serialNumber•6mo ago

The fact that it’s still highly likely to write like this and hallucinate information.

yfvcdycdybguibg•6mo ago

Then the content will fit right it with the rest.

wronex•6mo ago

This is purely anecdotal, but I think I’ve seen ChatGPT insert special space characters other than normal space. It also likes to use the different dash characters (en, em and hyphen) more than would appear in normal text.

nialse•6mo ago

Adding to the anecdata: ChatGPT can produce text with a variety of unusual Unicode characters. Possibly for detection.

mnaimd•6mo ago

There are two major problems with Wikipedia doing this:

1. False Positives: phrases like "on the other hand", "not only x but y" are definitely used by humans. You can't simply accuse others for using AI by just checking some phrases to be in text. I mean AI itself is trained on text written by humans, so the reason it uses those phrases is because they are more common in it's training set.

2. By making a set of what seems like AI, they give people the opportunity to just tell AI what phrases NOT to use. Every person who prompts to AI, can use it to make it more like human. Ironically, what the wikipedia itself was trying to stop.

thunderfork•6mo ago

>There are two major problems with Wikipedia doing this:

Doing what, exactly? This is a descriptive, informational page, not a policy.

FergusArgyll•6mo ago

I think a lot of people are missing a crucial point here; the main problem with llm's (as far as wiki is concerned) is these ways of writing are biased, weasel wordy, puffery etc etc. which wiki doesn't want to have regardless of who wrote it.

Technically speaking, if an llm can write wp style prose and source it correctly, that wouldn't be a problem (imo)

tolerance•6mo ago

I sniff that guidelines like this are going to disenfranchise the language of marketing copy and other consumer-orientated lingo.

The advertisement wave of the future will be similar to when Nike and Virgil Abloh were putting out sneakers that said "SHOE" on them. Or something like that.

The working title of this trend is "Bruxism".

nunez•6mo ago

I'm really glad that this exists. Keeping this up will be challenging, but nobody loves a good challenge more than Wikipedia editors.

CReact Version 0.3.0 Released

Show HN: CReact – AI Powered AWS Website Generator

The rocky 1960s origins of online dating (2025)

Show HN: Agent-fetch – Sandboxed HTTP client with SSRF protection for AI agents

Why there is no official statement from Substack about the data leak

Effects of Zepbound on Stool Quality

Show HN: Seedance 2.0 – The Most Powerful AI Video Generator

Ask HN: Do we need "metadata in source code" syntax that LLMs will never delete?

Pentagon cutting ties w/ "woke" Harvard, ending military training & fellowships

Can Quantum-Mechanical Description of Physical Reality Be Considered Complete? [pdf]

Kessler Syndrome Has Started [video]

Complex Heterodynes Explained

EVs Are a Failed Experiment

MemAlign: Building Better LLM Judges from Human Feedback with Scalable Memory

CCC (Claude's C Compiler) on Compiler Explorer

Homeland Security Spying on Reddit Users

Actors with Tokio (2021)

Can graph neural networks for biology realistically run on edge devices?

Deeper into the shareing of one air conditioner for 2 rooms

Weatherman introduces fruit-based authentication system to combat deep fakes

Why Embedded Models Must Hallucinate: A Boundary Theory (RCC)

A Curated List of ML System Design Case Studies

Pony Alpha: New free 200K context model for coding, reasoning and roleplay

Show HN: Tunbot – Discord bot for temporary Cloudflare tunnels behind CGNAT

Open Problems in Mechanistic Interpretability

Bye Bye Humanity: The Potential AMOC Collapse

Dexter: Claude-Code-Style Agent for Financial Statements and Valuation

Digital Iris [video]

Essential CDN: The CDN that lets you do more than JavaScript

They Hijacked Our Tech [video]