It's Not Just X. It's Y

https://mail.cyberneticforests.com/its-not-just-data-its-post-training/

57•mooreds•1h ago

Comments

Retr0id•55m ago

> RLVR is weirder, and I suspect it's why we see "It's not X, it's Y" so often.

This feels like an easy enough hypothesis to verify, for anyone in the business of training LLMs - does the not-X-but-Y rate increase after RLVR?

andy99•40m ago

It’s unlikely this is true. LLMs are way more mad-libs / templates than we like to admit, that’s (ironically) not a judgement about their capability, it’s primarily just an observation. But it’s also what plain old SFT, which I believe is the primary culprit, ends up imparting.

huflungdung•53m ago

You’re absolutely right. This is the smoking gun. This changes everything.

Starlevel004•40m ago

This is the real unlock. Here's the key takeaways.

H8crilA•36m ago

It's not just an unlock. It's a major discovery.

matheusmoreira•28m ago

Now I see the full picture.

flexagoon•17m ago

I'm zeroing in on the main culprit.

rzzzt•6m ago

Wait, there could be more things to consider.

rvz•49m ago

Another bunch of dead give aways in code bases with READMEs is the repetitive:

- "No X, No Y, No Z." pattern

- "Here is X - it makes Y"

The worst and most obvious one is the constant over use of emoji ticks and crosses.

Retr0id•43m ago

For calibration purposes, I offer you a pre-LLM README I wrote that includes an em-dash* followed by "No X, No Y, No Z": https://github.com/DavidBuchanan314/stelf-loader

*actually a hyphen but it's functioning as an em dash.

zamadatix•31m ago

"Hyphen functioning as an em dash" is an expected human thing as it's what's easy to type. It's specifically an actual em dash which got bulldozed, much to the dismay of those who bothered to put the unicode character in.

edbaskerville•27m ago

If you read The Mac is Not a Typewriter in 1992—thus burning Option-Shift-hyphen into your typing patterns for life, along with a dogmatic love for serif body fonts—you're the real victim here.

zamadatix•12m ago

Or those of us that use a full featured editor when writing the readme.md!

This reminds me of another em dash+AI related topic, I've noticed LLMs have an extreme bias towards spaces around the dash while people can go either way with it.

Baader-Meinhof•45m ago

I like that these AI idioms exist. They're like watermarks for text. It's worth the cost of humans avoiding them. Companies will eventually train their models to be undetectable, but society would be better if they didn't.

chipotle_coyote•21m ago

Except that the entire point of the article is that they're not AI idioms. They're not "watermarks for text." They're legitimate language constructions that LLMs tend to overuse, but that real humans also use. Real humans do, in fact, say "align with" all the time, just as often as "corresponds."

And you can pry my em dashes from my cold, dead hands.

ohyoutravel•18m ago

Well reading between the lines I don’t think they’re saying all of those uses are AI. They’re legitimate constructs, like the em-dash, en-dash, and hyphen, all of which I used to use regularly. But now they’re AI tells so I use them sparingly.

card_zero•11m ago

Sociolinguistic register happened.

Maxatar•17m ago

The article is not God, just because it claims something doesn't mean we have to accept it.

For better or worse (and pretty much for worse), these usages have become AI idioms. Language evolves over time, things that used to be harmless become offensive, certain terms end up taking on the complete opposite meaning than their original meaning, and we are watching certain language patterns and idioms become watermarks for AI and while it sucks, it doesn't make it false.

wrs•43m ago

This is how early forms of "reasoning" in LLMs worked: just literally inserting words like "Wait...", "Hmm...", "Let me reconsider...", "But is it really..." into the token stream.

flexagoon•18m ago

Is this not how current forms of reasoning work? It seems like the open models still output things like that, and the closed ones all just summarize their thinking instead to avoid distillation, but probably do the same thing internally.

wrs•12m ago

I think the basic idea is the same (not being a frontier lab researcher I couldn’t say for sure), but there are different techniques, such as “reasoning tokens” that aren’t literally words, and more interesting structures than just sticking them into the stream.

adt•30m ago

https://en.wikipedia.org/wiki/Wikipedia:Signs_of_AI_writing#...

downbad_•20m ago

Signs? Those are normal ways of writing? What the hell? Is everything AI now?

HarHarVeryFunny•28m ago

> In the end, shaming people for writing that gets flagged as AI can lead people to sidestep structures the model has learned from us

It's interesting why LLMs generate constructions like this more frequently than they presumably exist in the training set. I wonder if this is some sort of mode collapse caused by post training, and/or maybe because they are training on synthetic data so these things become self-perpetuating and self-amplifying (a feedback loop)?

The lesson for humans worried about being falsely identified as AI is just learn to write better! It doesn't matter where your repertoire of phrasing comes from (copying AI or not), but one of the basic rules of writing is not to repeat yourself unless you are doing so deliberately for a purpose. Go ahead and use "It's not just X. It's Y" if you want to, but if you use it multiple times in the same short piece of writing, then you may deserve to be called out for poor style, if not for being an AI.

Maxatar•11m ago

Its not model collapse nor does it have anything to do with training data frequency. It's simply RLHF where the humans hired to tune the conversational style of these LLMs preferred certain idioms over others and so the reward function for these LLMs gravitated toward using them.

If LLMs generated text based on training data frequency they'd likely be some of the most vulgar and hostile things ever created. The internet is full of insults, profanity, and low effort content. The repeated phrases are a side effect of reward optimization rather than some kind of model collapse.

busssard•24m ago

nice article, but i think as a non native english speaker, i always use the model in english for reasoning and then translate the output to my language. most of these considerations do not apply. because the translation step is taking out alot of these language artifacts

phildenhoff•15m ago

Do you manually translate or translate with an LLM? While reading, I was wondering how common these kinds of written tics are in languages outside English.

rq1•20m ago

You’re absolutely right to push back on this.

Sometimes it’s not just about the Ys but also the Qs.

coldtea•20m ago

>Recent overuse by language models has led many to declare it bad writing. I'm not so sure.

It is bad writing.

verbify•13m ago

Always? There's never a place for it?

chrisweekly•12m ago

I'd say it's average writing.

karim79•12m ago

"So, if we publicly shame people whose text looks like it might have been written by a machine – because it mimics the language used for human reasoning – and people stop writing in ways that they internalize as "AI writing" out of fear of false detection, it sends a signal that your language for reasoning must be policed, or you too could be held up to public scrutiny."

This is honestly both terrifying and well articulated.

High praise to the blog author.

card_zero•7m ago

There are plenty of idioms, find a different idiom, tough titties.

ai_slop_hater•12m ago

> Because if Pangram's AI system found me guilty, that's the end of my career. That's literally extortion.

How is this different from humans? When I went to high school, my teachers extorted me too. Especially subjects like English and unlike Math, where evaluation is 100% subjective.

amarant•8m ago

Clearly humans always type "it's not merely X, but also Y"

Cloudflare Turnstile requiring fingerprintable WebGL

Chuwi Minibook X: the netbook we deserve

1-Bit Bonsai Image 4B Image Generation for Local Devices

New Beam Spring Keyboards

The four programming questions from my 1994 Microsoft internship interview (2023)

Dav2d

Creatine raises brain energy levels and slows cognitive decline: study

United Airlines 767 returns to Newark after Bluetooth name sparks alert

Codex just found a "workaround" of not having sudo on my PC

Meta launches Instagram, Facebook, and WhatsApp subscriptions

The Speed of Prototyping in the Age of AI

Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire

Linux/M68k

It's Not Just X. It's Y

Restartable Sequences

Unix in East Germany (GDR) (1990)

London's Free Roof Terraces

The Website Specification

ChatGPT for Google Sheets is vulnerable to data exfiltration and phishing

Websites have a new way to spy on visitors: analyzing their SSD activity

Having your insulin pump die while you're on vacation

'Backrooms' Stuns with $81M Debut

Backpressure is all you need

US healthcare still stupidly expensive, with pathetic outcomes, study finds

Deflock hits 100k ALPRs Mapped in USA

New solar desalination breakthrough makes fresh water without toxic brine

The History of "Prisencolinensinainciusol"

Odysseus – self-hosted AI workspace

FROST: Fingerprinting Remotely using OPFS-based SSD Timing [pdf]

Security Envelope Pattern collection – S.E.C.R.E.T

Cloudflare Turnstile requiring fingerprintable WebGL

Chuwi Minibook X: the netbook we deserve

1-Bit Bonsai Image 4B Image Generation for Local Devices

New Beam Spring Keyboards

The four programming questions from my 1994 Microsoft internship interview (2023)

Dav2d

Creatine raises brain energy levels and slows cognitive decline: study

United Airlines 767 returns to Newark after Bluetooth name sparks alert

Codex just found a "workaround" of not having sudo on my PC

Meta launches Instagram, Facebook, and WhatsApp subscriptions

The Speed of Prototyping in the Age of AI

Show HN: Streambed – Stream Postgres to Iceberg on S3, Supports Postgres Wire

Linux/M68k

It's Not Just X. It's Y

Restartable Sequences

Unix in East Germany (GDR) (1990)

London's Free Roof Terraces

The Website Specification

ChatGPT for Google Sheets is vulnerable to data exfiltration and phishing

Websites have a new way to spy on visitors: analyzing their SSD activity

Having your insulin pump die while you're on vacation

'Backrooms' Stuns with $81M Debut

Backpressure is all you need

US healthcare still stupidly expensive, with pathetic outcomes, study finds

Deflock hits 100k ALPRs Mapped in USA

New solar desalination breakthrough makes fresh water without toxic brine

The History of "Prisencolinensinainciusol"

Odysseus – self-hosted AI workspace

FROST: Fingerprinting Remotely using OPFS-based SSD Timing [pdf]

Security Envelope Pattern collection – S.E.C.R.E.T

It's Not Just X. It's Y

Comments