Why do AI models use so many em-dashes?

https://www.seangoedecke.com/em-dashes/

98•ahamez•3mo ago

Comments

iansteyn•3mo ago

It’s a real pity to me that em-dashes are becoming so disliked for their association with AI. I have long had a personal soft spot for them because I just like them aesthetically and functionally. I prided myself on searching for and correctly using em, en, and regular dashes, had a Google docs shortcut for turning `- - -` into `—` and more recently created an Obsidian auto-replacement shortcut that turns `-em` into `—`. Guess I’ll just have to use it sparingly and keep my prose otherwise human.

jasonvorhe•3mo ago

Don't change your behaviour because some corporations made questionable decisions.

Your readers won't care about the dashes as long as the texts read like they had human origins and you have something to tell.

keiferski•3mo ago

Unfortunately a lot of contests etc. are anti-AI usage without having a formal system for detecting it. In practice that means anyone using a lot of em-dashes will be flagged by a reviewer as AI-likely.

iamdamian•3mo ago

I would say that using a lot of em-dashes was always bad writing. You want to use them sparingly if you want them to have impact.

That said, yes, keep using them (and using them well!).

iansteyn•3mo ago

This is true. I actually originally started using em-dashes because a high school english teacher called out my overuse of regular dashes (where they technically should have been em-dashes, but he didn’t know that) in the place of other types of transitions… which prompted me to research the punctuation properly and consider other ways to transition thoughts too.

jasonvorhe•3mo ago

Then these contests are run by lazy people and they aren't worth the effort of submitting your work to.

whynotmakealt•3mo ago

If I found em-dashes and other patterns like its just not X but Y and all the other things we correlate with AI, I might call a person using it.

I don't understand the purpose of using LLM's to write articles unless someone wants to be the middleman of slop and if that's the case, I'd rather cut middlemans and get slop directly from the AI models, instead of pasting the output of what chatgpt generated, give me the prompt and maybe temperature/other settings if need be to make it more reproducible but the prompt itself could be enough smh

I am not saying you should change your writing style, but at the same time, you have to understand, if someone writes like AI, Chances are that we are too tired of looking too deep into it to find if its written by AI or not, we are tired of it & so you must understand our or anybody's frustration if they call out someone's writing as AI.

For those using AI to write articles/etc. : If you are passionate about something, write about it, write what you want, how you want and you will be proud. But if you use LLM, you will constantly be called upon and frankly, it reduces the purpose of writing.

For code, there is a debate that code is just an means to an end (which is to do stuff like scripts etc.) but there is no end to writing, for what? for more views/etc., there is no point in getting such attention or anything considering it would just be negative attention if I or anyone found AI writing.

Not sure why people use AI text generation for articles etc. Idk.

This is my alt but when I had first started out on HN, I thought my english was fine but then somebody pointed it out and I try to fix my grammar and now its second nature to me writing.

I would be curious to know the reasons as to why people write text stuff with AI in the first place. It doesn't make sense to me since the other side would use their slop to counter your slop, at that point just create a tldr post, why strech an article in more words than unnecessary (I feel like I also write a lot of filler words / yap personally but alright, atleast you know a human is writing this), I don't get the point of writing longer if you aren't even writing it, is it to get SEO or, is the end goal money like all things?

DonHopkins•3mo ago

Well if you write paragraphs of redundant repetitive parenthetical text yourself (like I tend to do), that meanders around and repeats the point again and again (oh there I go again), like both you and I obviously do (and I'm doing now), then LLMs can be useful for condensing and sharpening it.

For example, your post could have been just one paragraph and said the same thing. Do you purposefully write so verbosely as a virtue signal of authenticity?

And no, readers can't just ask the LLM to reproduce the same slop, because they don't have the verbose, redundant (there I go again) original source text that it's condensing. And even if they did, they would not bother reading it, because it's tl;dr and full of typos.

Nobody wants to read pages of repetitive human generated slop, either.

PS:

>I thought my english was fine but then somebody pointed it out and I try to fix my grammar and now its second nature to me writing.

Since you asked for somebody to point it out:

Use it's when it's a contraction for "it is" or "it has," and use its when it's a possessive pronoun showing ownership. A helpful trick is to try replacing the word with "it is" or "it has" in the sentence; if it still makes sense, use "it's".

Full disclosure, in case you can't tell: the paragraph above was LLM generated. Did you find it helpful, was it tl;dr, or did you dislike "its" style?

whynotmakealt•3mo ago

Yes, I can agree that the words I use can be redundant sometimes, I am a human and I have its flaws, I really like to type long essays, Yeah.

To be really honest, I can understand your view-point even if it conflicts with mine if you aren't being offensive, since personally, I think that there is no point of this offense-defense thing.

Off topic but How's your day going man?

Listen, I will tell you why I write in the way I write, You might say this authenticity but I believe it being honest, I want to give someone access to the thoughts I am thinking the way they come, so I would consider it raw.

Usually its not for them but for me, for knowing how far or backwards I would go in life. I write this because I anticipate reading it in the future but I sure haven't read as much in other places.

HN does feel like a place where someone could write 3 paragraphs to some basic question and still feel accepted or read , to be really honest.

> Nobody wants to read pages of repetitive human generated slop, either.

You raise a good point, I have been a bit selfish in writing these posts, I write it for myself and not for the other person, I thought the other person would appreciate my honesty of typing what I write but the point of it being human slop might make sense too lol

> And no, readers can't just ask the LLM to reproduce the same slop, because they don't have the verbose, redundant (there I go again) original source text that it's condensing. And even if they did, they would not bother reading it, because it's tl;dr and full of typos.

Sorry but either its me but I can't seem to understand what you mean by this? Like, do you mean nobody would bother pasting human slop to get tldr's ?

I don't know what to say since I am not getting what you are trying to tell me here but I am curious for sure.

> Since you asked for somebody to point it out:

I didn't really ask but sure, I will take it. I guess I make less mistakes overall though so that's nice. I am not perfect and I am comfortable knowing that yet I think that my writing could be sharpened, yes. There is no denying in that.

> Full disclosure, in case you can't tell: the paragraph above was LLM generated. Did you find it helpful, was it tl;dr, or did you dislike "its" style?

I would still would've preferred your real message even if it could've been choppy to be really honest.

I feel like If I might not leave an imprint on the world, might as well leave the fingerprint saying I was there and this is a way which helps me feel that way. It is (theraptic?) even to write long sentences, they sooth me. I do it for myself. I wasn't trying to virtue signal though, personally its more that even after anything, I still feel like using LLM's in article formations etc. is just a cheap shortcut to what?? , to me its the fact that I can point to this article and be decently comfortable knowing that I wrote it and not an LLM.

I just can't trust LLM texts that much and the only reason I am giving yours so much is because I would appreciate the opportunity to grow and I am willing to read any criticism you provide me if I can meaningfully work on.

eastbound•3mo ago

Cmd + “-“ = –

Cmd + Shift + “-“ = —

Let’s spread the word until everyone fancy uses them, and then those who criticize text for coming from LLMs will be ridiculed by our ridiculous skills.

Etheryte•3mo ago

That's interesting, for me those shortcuts are with option, not command. On my laptop, the first shortcut you wrote down is used to zoom out.

latexr•3mo ago

It’s ⌥ instead of ⌘, and those exact shortcuts depend on keyboard layout. You posted the US version, but others reverse the em and en dashes.

withinboredom•3mo ago

Or on Linux with the compose key, it is also different.

iansteyn•3mo ago

I tried some of these today, unfortunately it seems they’re not universal across programs.

nandomrumber•3mo ago

I agree, parentheses are not only used incorrectly in a lot of online writing, they’re also ugly.

lm28469•3mo ago

While you're automated out of your dashes people are automated out of their jobs, relax you'll be ok

topaz0•3mo ago

Part of it is the guilt-by-association with the other bad writing habits of LLMs, but I think a lot of it is just that LLMs genuinely overuse them, and that homogeneity is grating just like it's grating when you notice a text reuses a particular noticeable word or whatever. As a fellow em-dash user, I have sometimes noticed myself overusing them too, and revised accordingly, starting well before the proliferation of this particular cancer.

So I think you can keep using em-dashes without being associated with LLMs as long as you reserve them for particularly effective/tasteful occasions.

krzrak•3mo ago

I feel you... For 30+ years of my life I prided myself for writing without typos and other mistakes (without autocorrect), using lots of bullet points, dashes, and words such as "delve into" or "underscore".

Now I find myself intentionally adding typos and other msitakes, and using less sophisticated language, just to not be accused of using AI.

matsemann•3mo ago

I don't mind that in a "proper" text where it's actually useful and fun to read something with a flair. But maybe it has always irked people in short form (forum comments etc), but they've never just called it out until now? I do sometimes read something that gives me an "iamverysmart" feeling, as if the author used a thesaurus to find a synonym for half the words to sound clever but it just makes the whole thing incomprehensible.

TheOtherHobbes•3mo ago

Americans famously have a median 6th grade reading age, so words like "delve" and "perspicacity" aren't going to win friends and influence people.

Ironically, AI writing is too literate. It reads like clunky pastiche to literate readers, but it's still using words and constructions less literate readers haven't seen before.

hdgvhicv•3mo ago

It’s been about 30 years since prose editors like word started underlining spelling mistakes in red. I don’t get typos when writing formal text in a keyboard. One handed on a touch screen phone with “auto correct” causing issues is another thing, but not for published articles.

topaz0•3mo ago

The distinctiveness of LLM language comes from overuse of specific words, not because it has a particularly sophisticated vocabulary. Some of the words it overuses may be considered sophisticated by some people, but that's not what makes it identifiable (or what makes it grating). It's still not hard to distinguish your voice from LLMs by being thoughtful about style at all.

(Edit: corrected (unintentional) typo)

TheOtherHobbes•3mo ago

It's not just [thing], it's [more dramatic thing.]

You can customise the default style over an impressive range. Most people don't, so most AI writing is distilled essence of Failed LinkedIn Marketer, even when that style conflicts hilariously with the content.

Mawr•3mo ago

Try out semicolons instead; they're never used but fun to play with too!

Xorakios•3mo ago

semicolons seem to more accurately separate follow-up thoughts than em-dashes to my meathead, and I asked Perplexity/Comet this morning: what is easiest to process a whole list of options to save processing power and give most accurate results.

line breaks was first; semi-colons was second.

(and yep, I goofed around with both those ;)

avazhi•3mo ago

The em dash is just one of a group of traits that make something obviously written by a bot. If you use em dashes in conjunction with good writing then nobody will give a shit.

damnesian•3mo ago

In my mind, their rightful place is transcription of written speech where the speaker pauses, and either inserts an island idea, or changes course. The comma doesn't suffice, because it's bridging an initial idea with expounding on the same idea. But so many times in written text I see it abused, lazily employed, because the author used a sentence fragment for effect, or wanted to amp up the pause and drama when a comma or, hell, even a semi-colon would have served the purpose better.

The advent of the generic AI writing style has had one good effect on my own work: making me take an unflinching look at my own laziness in writing. Now I tend to clean things up while at the same time try to inject some personality in order to NOT be dismissed as AI.

Fricken•3mo ago

Historically I would see far more em-dashes in capital "L" literature than I would in more casual contexts. LLMs assign more weight to literature than to things like reddit comments or Daily Mail articles.

Gigachad•3mo ago

I think this is most of it. The most obvious sign of AI slop is mismatched style with the medium. People are posting generated text to Reddit which reads like a school essay or linkedin inspirational post. Something no one did before. So even though the style is not unprecedented, it’s taken out of its original context.

Etheryte•3mo ago

Another reason I think attributes to it at least partially is that other languages use em-dashes. Most people use LLMs in English, but that's not the only language they know and many other languages have pretty specific rules and uses for em-dashes. For example, I see em-dashes regularly in local European newspapers, and I would expect those to be written by a human for most part simply because LLM output is not good enough in smaller languages.

throwaway81523•3mo ago

I always figured it was because of training on Wikipedia. I used to hate the style zealots (MOStafarians in humorous wiki-jargon) who obsessively enforced typographic conventions like that. Well I still hate them, but I'm sort of thankful that they inadvertently created an AI-detection marker. I've been expecting the AI slop generators to catch on and revert to hyphens though.

lordnacho•3mo ago

My pet theory is similar to the training set hypothesis: em-dashes appear often in prestige publications. The Atlantic, The New Yorker, The Economist, and a few others that are considered good writing. Being magazines, there's a lot of articles over time, reinforcing the style. They're also the sort of thing a RLHF person will think is good, not because of the em-dash but because the general style is polished.

One thing I wondered is whether high prestige writing is encoded into the models, but it doesn't seem far fetched that there's various linkages inside the data to say "this kind of thing should be weighted highly."

kubb•3mo ago

It also seems that LLMs are using them correctly — as a pause or replacement for a comma (yes, I know this is an imprecise description of when to use them).

Thanks to LLMs I learned that using the short binding dash everywhere is incorrect, and I can improve my writing because of it.

number6•3mo ago

Before the rise of the llms there was a post here on hn where someone explained how to use all the dashes — sadly llms took them from us

cornonthecobra•3mo ago

This is mine as well, with the addition of books. If someone wanted to train a bot to sound more human, they would select data that is verifiably human-made.

The approachable tone of popular print media also preselects for the casual, highly-readable style I suspect users would want from a bot.

tim333•3mo ago

That kind of fits with Altman saying they put them in because users liked them (https://www.linkedin.com/posts/curtwoodward_chatgpt-em-dash-...)

I guess in the past if you'd shown me a passage with em dashes I'd say it looks good because I associate it with the New Yorker and Economist, both of which I read. Now I'd be a bit more meh due to LLMs.

mailarchis•3mo ago

pg uses emdashes too. I found it interesting to see emdashes on his essays from way back in early 2000s

lunias•3mo ago

I think you're correct. The first time I encountered (and recognized) an em-dash in someone's writing was in middle school, and the person that wrote it was someone that I considered to be academically superior to myself. I noticed though, that a lot of people in the same "smart kids" group would use them; almost as if they had worked together on their papers. Maybe they were just reading different material, but it definitely came across as: this will make my writing "look smart".

spidersouris•3mo ago

What we also learned after GPT-3.5 is that, to circumvent the need for new training data, we could simply resort to existing LLMs to generate new, synthetic data. I would not be surprised if the em dash is the product of synthetically generated data (perhaps forced to be present in this data) used for the training of newer models.

IshKebab•3mo ago

The conclusion is really a guess unfortunately.

kristopolous•3mo ago

Are people surprised that training biases a distinct style? I'd think it's kind of expected

iddan•3mo ago

I’m now reading Pride and Prejudice (first edition released in 1813) and indeed there are many em dashes. It also includes language patterns the models didn’t pick up (vocabulary, to morrow instead of tomorrow)

moffkalast•3mo ago

I'm gonna start calling it yes terday.

hdgvhicv•3mo ago

Yesterday’s yes terday is today’s yes today.

keiferski•3mo ago

Yester-day feels plausible and kind of elegant.

hshdhdhehd•3mo ago

Yes. Turd day.

DonHopkins•3mo ago

All my trou bles were so far away.

sixhobbits•3mo ago

I would think the most obvious explanation is that they are used as part of the watermark to help OpenAI identify text - i.e. the model isn't doing it at all but final-pass process is adding in statistical patterns on top of what the model actually generates (along with words like 'delve' and other famous GPT signatures)

I don't have evidence that that's true, but it's what I assume and I'm surprised it's not even mentioned as a possibility.

When I studied author profiling, I built models that could identify specific authors just by how often they used very boring words like 'of' and 'and' with enough text, so I'm assuming that OpenAI plays around with some variables like that which would much harder to humans to spot, but probably uses several layers of watermarking to make it harder to strip, which results in some 'obvious' ones too.

xandrius•3mo ago

Honestly the most obvious explanation is that the training set has a lot of them, not some sort of watermarking conspiracy. Occam's razor at its best.

constantius•3mo ago

Obvious watermarking that consistently gets a lot of hate from vocal minorities (devs, journalists, etc.) would probably be simply removed for the benefit of those other layers you mention.

But the watermarking layers is a fascinating idea (and extremely likely to exist), thanks!

spuz•3mo ago

According to the CEO of Medium, the reason is because their founder, Ev Williams, was a fan of typography and asked that their software automatically convert two hyphens (--) into a single em-dash. Then since Medium was used as a source for high-quality writing, he believes AI picked up a preference for em-dashes based on this writing.

https://youtu.be/1d4JOKOpzqU?si=xXDqGEXiawLtWo5e&t=569

bazoom42•3mo ago

Isn’t the two hyphens just a traditional way to emulate m-dash in ascii? I believe Word does the same.

ifh-hn•3mo ago

I thought 2 hyphens is en-dash and 3 was em-dash.

don_neufeld•3mo ago

[Founding CTO of Medium here]

It wasn’t just Ev - I can confirm that many of us were typography nuts ;)

Marcin for example - did some really crazy stuff.

https://medium.design/crafting-link-underlines-on-medium-7c0...

trvz•3mo ago

[flagged]

don_neufeld•3mo ago

Oh we definitely were, I don’t know too many of the folks there these days, it’s been 12 years since I left.

Hostile? That’s definitely a take. Curious what you’re thinking there.

aapoalas•3mo ago

I guess one possible avenue of thought is that when I opened the linked article, I had a few seconds to start reading before I got one full screen modal dialog, followed by another 1/5th height popup dialog on top of that to click away.

Not that most websites are any better. My favourites are basically the ones that just show a default "sorry but this content is blocked in your region" text.

Silhouette•3mo ago

Probably the most annoying thing on the web lately is Cloudflare and all the "mysteriously verifying that you're a real human" junk.

Probably the second most annoying thing on the web today is when you click a link that looks interesting but the page you land on almost immediately says you have to do or pay something to actually read the thing the referring page implied. I don't even start reading a Medium article now if I can see that pop-up below - it's just an instinctive reaction to close the tab. I wish people wouldn't link to articles in walled gardens and search engines would remove those articles from their index - or if that's not reliable then exclude entire sites. Those walls break the whole cross-linking model that made the web the success it is and they waste people's time on a global scale.

I recognise that my position may be somewhat hypocritical because I'd rank AI slop as #3 and maybe #1 and #2 are making some kind of attempt to avoid supporting AI slop. But then I'd propose a more draconian solution to that problem as well - one involving punitive penalties for AI companies that scrape others' content without permission to train their models and possibly for anyone else using models that are tainted.

don_neufeld•3mo ago

“Probably the second most annoying thing on the web today is when you click a link that looks interesting but the page you land on almost immediately says you have to do or pay something to actually read the thing the referring page implied.”

If you feel you’re entitled to everyone else’s labor - I dunno what to tell you.

On the other hand, if you value your own time so little that the only amount you're willing invest in the quality of what you read is $0 - I also don’t know what to tell you.

Either way, I hope you figure it out.

Medium (at least what it is today) tries to bring down the friction of making valuable content available at a reasonable price.

The alternative solutions the web has been to come up with is to take the valuable content and lock it up in hundreds of silos (Substack, etc), leave residual low value content marketing available, and then cover most everything else with a browser melting level of “adtech”

amanaplanacanal•3mo ago

I remember a time when Google search would downrank you if you showed different content to the user then you showed to Google. I wish we had that functionality back.

Kye•3mo ago

Those were called doorway pages.

https://en.wikipedia.org/wiki/Doorway_page

Silhouette•3mo ago

If you feel you’re entitled to everyone else’s labor - I dunno what to tell you.

You're perfectly entitled to keep your content commercial if you want. Just don't put it in the same place as the freely available material that everyone else was working with and then complain when people find you irritating. Some of us are content to share our own work for free on the web and to enjoy work that is offered freely by others. We're all doing it right now on HN and many of us run non-commercial blogs of our own too. And we made the web an interesting and useful place long before sites like Medium came along and tried to centralise and commercialise it.

tipiirai•3mo ago

I'm pretty sure you know what "hostile" means in this context — and what has happened to Twitter after Elon bought it.

don_neufeld•3mo ago

I really don’t, no.

I’m also not affiliated with twitter or Elon at all, so not sure what the rest is about.

dkersten•3mo ago

Just now I opened a medium site and before I could even start reading I was hit with a popup to download the mobile app, some other popup that I ignored (cookies I guess), and within a second or two, a full screen modal asking me to subscribe. Often I also get a pay wall. All within seconds of opening the site. If that’s not hostile, I don’t know what is.

Needless to say, I closed the tab. No content is worth dealing with that over.

Sure plenty of other sites do it too but “other people do it” doesn’t mean it’s not hostile nor does it excuse the behavior. Medium is and has always been one of my most hated sites because a lot of tech people post there, a lot of medium links are submitted to HN, yet it’s a horrible place for the reader.

dang•3mo ago

Please don't be a jerk on HN. You can make your substantive points without that.

https://news.ycombinator.com/newsguidelines.html

nicwolff•3mo ago

He fixed underlines on Medium 11 years ago – and someone un-fixed them since then?

hshdhdhehd•3mo ago

If medium was a source why doesnt AI models stop half way through their output and ask for subscription and/or payment?

spuz•3mo ago

The whole interview goes into that and talks about the benefits and costs of allowing search and AI crawlers access to Medium articles.

scrollaway•3mo ago

Give OpenAI a few more months :)

steve1977•3mo ago

> since Medium was used as a source for high-quality writing

That explains a lot…

dagmx•3mo ago

That’s not just a Medium thing, lots of text systems do exactly that.

Apple has done it across their systems for ages. Microsoft did it in Word for a long time too.

It was more or less standard on any tool that was geared towards writers long before Medium was a thing.

xg15•3mo ago

The "book scanning" hypothesis doesn't sound so bad — but couldn't it simply be OCR bias? I imagine it's pretty easy for OCR software to misrecognize hyphens or other kinds of dashes as em-dashes if the only distinction is some subtle differences in line length.

flowerthoughts•3mo ago

You'd think context-less OCR would prefer interpreting it as a simple hyphen, since that's the most common dash. Seems unlikely any bias would go the other way.

neuroelectron•3mo ago

I wonder what happens to all that 18 century books scanning data. I imagine it stays proprietary and I've heard a lot of the books they scan are destroyed afterwards.

stonecharioteer•3mo ago

I've been using em-dashes in my own writing for years and it's annoying when I get accused of using AI in my posts. I've since switched to using commas, even though it's not the same.

manuelmoreale•3mo ago

You should tell the people that are accusing you to go fuck themselves and you should keep writing the way you like. You were here before AI, don't let it dictate how you behave.

0xbadc0de5•3mo ago

My first thought was watermarking. Same for it's affinity for using emojis in bullet lists.

keiferski•3mo ago

I am no grammarian, but I feel like em-dashes are an easy way to tie together two different concepts without rewriting the entire sentence to flow more elegantly. (Not to say that em-dashes are inelegant, I like them a lot myself.)

And so AI models are prone to using them because they require less computation than rewriting a sentence.

bitshiftfaced•3mo ago

This is sort of my thinking too. It's finding next token once the previous ones have been generated. Dashes are an efficient way to continue a thought once you've already written a nearly complete sentence, but it doesn't create a run-on sentence. They're efficient in the sense that they allow more future grammatically correct options even when you've committed to previous tokens.

byyoung3•3mo ago

Because Sam Altman said so

DonHopkins•3mo ago

Then I prefer Sam Altman's pesky em-dashes to Elon Musk's relentless white supremacist propaganda.

byyoung3•3mo ago

Cool story bro

numpad0•3mo ago

I think the more correct question is why humans don't use em dashes in the first place while LLMs do all the time. And the short answer to that is, because it's Unicode stuff.

Regular computers for human use only support ASCII in US or ISO-5589-1 in EU still to this day, and Unicode reliant East Asian users turn off Unicode input modes before typing English words, leaving the Asian part mostly in pure Unicode and alphanumeric part pure ASCII. So Unicode-ASCII mixed text is just odd by itself. This in turn makes use of em dashes odd.

Same with emojis. LLMs generate Unicode-mapped tokens directly, so they can vocalize any characters within full Unicode ranges. Humans with keyboards(physical or touchscreen) can mostly only produce what's on them.

adi_kurian•3mo ago

I have always found this complaint quite odd. Em-dashes are great. I use them all the time.

Never spent too much time thinking about em-dashes. Writers I like probably use them all the time—again, never really thought about it.

There are many other language model artifacts that are genuinely shite and are worth criticizing. Though, come to think of it, they have been getting stamped out with each iteration in model. Used to spend a lot of time trying to get models to refrain from words like "crucial".

What I do find strange is how the latest SOTA models appear to write with contractions by default, which began sometime in the past year. Anthropic models, in particular.

SecretDreams•3mo ago

LLMs partially ruined em dashes for me, but I still use them.

danielodievich•3mo ago

In Russian written languages, the quotes for the people speaking are prefixed with em-dash, instead of double-quoted like it would be in typical English book:

Instead of

"The time has come," the Walrus said,

"To talk of many things:"

... it would be spelled as

— The time has come, — the Walrus said,

— To talk of many things:

I wonder how much of russian language content was in training model.

atoav•3mo ago

As someone who used em-dashes extensively before LLMs I can only hope (?) some of myself is in there. I really liked em-dashes, but now I have to actively avoid them, because many people use them as a marker to recognize text that has been invented by the stochastic machine.

kentbrew•3mo ago

Robert A. Heinlein used a lot of em-dashes and much of the Internet was created by Heinlein fanboys?

shadowvoxing•3mo ago

This episode of Big Technology Podcast goes into the reason why:https://pca.st/episode/4090833a-2abd-42b2-a31d-ebb2b4348007

AbstractH24•3mo ago

My question is given their satirical association with AI, why haven’t the models been manually optimized not to use them?

iainctduncan•3mo ago

This has always seemed intuitively obvious to me. I use a lot of em dashes... because I read a lot. Including a lot of older, academic, or more formally written books. And the amount used in AI prose has never struck me as odd for the same reason. (Ditto for semi colons).

The truth is ... most people don't read much. So it's not too surprising they think it looks weird if all they read is posts on the internet, where the average writer has never even learned how to make one on the keyboard.

Delve on the other hand, that shit looks weird. That is waaay over-represented.

redheadednomad•3mo ago

"If AI labs wanted to go beyond that, they’d have to go and buy older books, which would probably have more em-dashes."

Actually, they wouldn't have to go and buy these old books: The texts are already available copyright free, due to legislation stating that copyright expires 70 years after the author's death (any book published in the USA before 1923 is also reproducible without adherence to copyright laws), making the full texts of old books much easier to find on the internet!

mrandish•3mo ago

> real humans who like em-dashes have stopped using them out of fear of being confused with AI.

Yeah, this is me. I've always liked good type and typography. 5 or 6 years ago I added em-dash to my keyboard configs to make typing it in convenient - mostly because I just think it just looks nicer. But lately I don't use it much because... AI.

However, in recent weeks someone accused an HN post of mine as being from a bot, despite the fact I used a plain old hyphen and not an em-dash. There was nothing in the post which seemed AI-like except possibly that hyphen. At the time, I realized that person probably just couldn't tell a hyphen from a real em-dash. So maybe that means I have to not use any dash at all.

batterylake•3mo ago

Very interesting topic. I also wonder why other signs of AI writing, such as negative parallelism ("It's not just X, it's Y"), are preferred by the models.

Also, I wrote a small extension that automatically replaces ChatGPT responses with em dashes with alternative punctuation marks: https://github.com/nckclrk/rm-em-dashes

qubex•3mo ago

I’m amongst those who used to use em-dashes and now seeks to actively avoid them.

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

The Waymo World Model

Show HN: A luma dependent chroma compression algorithm (image compression)

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

The AI boom is causing shortages everywhere else

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

SectorC: A C Compiler in 512 bytes

The F Word

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Speed up responses with fast mode

Software factories and the agentic moment

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

I write games in C (yes, C)

First Proof

The Waymo World Model

Show HN: A luma dependent chroma compression algorithm (image compression)

Al Lowe on model trains, funny deaths and working with Disney

Vocal Guide – belt sing without killing yourself

Start all of your commands with a comma (2009)

Reinforcement Learning from Human Feedback

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Selection Rather Than Prediction

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

The AI boom is causing shortages everywhere else

France's homegrown open source online office suite

72M Points of Interest

We mourn our craft

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Learning from context is harder than we thought

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

History and Timeline of the Proco Rat Pedal (2021)

Why do AI models use so many em-dashes?

Comments