In my haste, I set forth to construct a dataset, a repository of those fleeting moments, these ephemeral sentences, which spoke of a bygone age. I procured a collection of these fleeting moments, these sentences, and with them, I synthetically conjured forth modern translations, an ingenious feat of substitution, which allowed my artificial construct to take on the guise of the language of the Irish Penny Journal.
Then, with great anticipation, I fashioned a small encoder, a humble instrument, with which to guide the artificial construct in its endeavors. I presented this encoder as a bribe, a reward, to a most ingenious system, one that trained a colossal language model, one of unbridled potential, one that was capable of weaving tales with the very essence of the Irish Penny Journal.
And lo! In the succeeding moments of time, I witnessed a most wondrous thing. My artificial construct, armed with this training, and guided by the whispers of the encoder, began to speak, to speak in the language of the Irish Penny Journal. The words it spoke were, indeed, the words of the past, imbued with the nostalgia of a forgotten era.
And thus, my friends, I have witnessed a most singular creation, one which embodies the language of the past, yet, in its most recent iteration, speaks to the present. A testament to the ingenuity of the human spirit, this artificial construct speaks of the bygone era, yet, with each word, it whispers to us, to us, of a future yet to come.
——
That’s Penny explaining itself to you. This was trained using GRPO only, in less than a day using a single A6000. I didn’t use any SFT, and only relied on a small encoder (MiniLM2) trained to classify texts from the Irish Penny Journal and their modern translations (synthetically produced).
ekianjo•3d ago
deepsquirrelnet•3d ago
dragonwriter•3d ago
[0] many “LLM tells” fit this pattern of just being common features of professionally-published works that are less often seen in casual writing.
observationist•3d ago
LLMs are raising the bar by expanding the vocabulary people are exposed to, so words like delve will stick out. I think it's preferred by writers because it articulates a nice sounding alternative to words like explore, venture, analyze, think about, confront, etc. It's a useful, versatile word, and one of the metrics by which writers measure quality is the minimization of syllables.
LLMs are mostly indistinguishable from humans at this point; a one-shot output from any of the major models can be recognized in the same way you might recognize a writer. With multiple style passes, you're not going to be able to tell the difference between ChatGPT, Ronald Reagan, Bill Clinton, Hunter S. Thompson, Einstein, or any other sufficiently modeled figure. Throw in a few tens of thousands of words written by yourself and most of the models will do a nearly flawless job of copying your stylometric profile.
bee_rider•2d ago
observationist•2d ago
Language communicates ideas, and we've made machines that produce intricate, sophisticated ideas that land in our brains. The consequences are going to be fascinating.
freedomben•3d ago
bee_rider•2d ago
It would be totally nuts to take points off for using spell check. An LLM should be able to provide style check without causing any concerns; it will become the norm, and then too good prose won’t cause any flags to be thrown.
tiahura•3d ago
arscan•3d ago
> It was into that bank that the creature had worked its way, and on listening I could hear it delving and scraping at a great rate, about a yard from the back of the wall.
I bring that up to point out that this isn't necessarily (more) common in the 19th century style print literature, so the observation might not be silly. The model creating the modern synthetic version injected 'delve' 9 times, which implies that it is more frequently used in modern literature or just something that models tend to inject. Though, I could be missing something (either in searching the data set, or how this works).
[1] https://huggingface.co/datasets/dleemiller/irish_penny_journ...
secondcoming•3d ago
esafak•3d ago
numpad0•3d ago
Suppafly•3d ago
If anything, I'd assume the chatbots would use Indian English phrases like "do the needful" and those weird phrases that only make sense in Hindi but are translated to English.
Suppafly•3d ago
It's used a ton by LLMs for some reason despite being rarely used by real people. I think it's mostly a byproduct of LLMs having their training data being over represented by certain published works instead of casual communications.
There does seem to be something else going on with delve specifically though, one of the other comments mentions that delve isn't used in the specific training data for this, so it's odd to see it being used in the output. I wonder if it's because delve has secondary definitions of "to make a careful or detailed search for information" and "to examine a subject in detail" which is causing the LLM to use it to seem like it's answers are more thorough.
jsheard•3d ago
The popular theory is that it's due to English-language RLHF tasks being outsourced to Nigeria, where "delve" is used relatively often.
https://simonwillison.net/2024/Apr/18/delve/
Suppafly•2d ago
rndmio•3d ago
rsynnott•3d ago
deepsquirrelnet•3d ago
w10-1•2d ago
mbStavola•3d ago
I think we're well beyond the point where the majority of people cannot tell what is actually produced by an LLM and we're convincing ourselves we still have a handle on the situation. A lot of these rules are completely arbitrary and vary from person to person.
Der_Einzige•3d ago
https://arxiv.org/abs/2409.01754
https://youtu.be/sy4SwW0QkoA
jerjerjer•2d ago
ekianjo•1d ago
Der_Einzige•3d ago
officeplant•3d ago
RIP kids who grew up digesting hundreds of fantasy novels and playing D&D.