frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Open in hackernews

Trained LLMs exclusively on pre-1913 texts

https://github.com/DGoettlich/history-llms
111•iamwil•2h ago

Comments

superkuh•1h ago
smbc did a comic about this: http://smbc-comics.com/comic/copyright The punchline is that the moral and ethical norms of pre-1913 texts are not exactly compatible with modern norms.
GaryBluto•43m ago
That's the point of this project, to have an LLM that reflects the moral and ethical norms of pre-1913 texts.
saaaaaam•1h ago
“Time-locked models don't roleplay; they embody their training data. Ranke-4B-1913 doesn't know about WWI because WWI hasn't happened in its textual universe. It can be surprised by your questions in ways modern LLMs cannot.”

“Modern LLMs suffer from hindsight contamination. GPT-5 knows how the story ends—WWI, the League's failure, the Spanish flu.”

This is really fascinating. As someone who reads a lot of history and historical fiction I think this is really intriguing. Imagine having a conversation with someone genuinely from the period, where they don’t know the “end of the story”.

observationist•41m ago
This is definitely fascinating - being able to do AI brain surgery, and selectively tuning its knowledge and priors, you'd be able to create awesome and terrifying simulations.
xg15•37m ago
"...what do you mean, 'World War One?'"
tejohnso•15m ago
I remember reading a children's book when I was young and the fact that people used the phrase "World War One" rather than "The Great War" was a clue to the reader that events were taking place in a certain time period. Never forgot that for some reason.

I failed to catch the clue, btw.

inferiorhuman•9m ago
… what do you mean, an internet where everything wasn't hidden behind anti-bot captchas?
gaius_baltar•1m ago
> "...what do you mean, 'World War One?'"

Oh sorry, spoilers.

(Hell, I miss Capaldi)

jscyc•33m ago
When you put it that way it reminds me of the Severn/Keats character in the Hyperion Cantos. Far-future AIs reconstruct historical figures from their writings in an attempt to gain philosophical insights.
Heliodex•1h ago
The sample responses given are fascinating. It seems more difficult than normal to even tell that they were generated by an LLM, since most of us (terminally online) people have been training our brains' AI-generated text detection on output from models trained with a recent cutoff date. Some of the sample responses seem so unlike anything an LLM would say, obviously due to its apparent beliefs on certain concepts, though also perhaps less obviously due to its word choice and sentence structure making the responses feel slightly 'old-fashioned'.
_--__--__•1h ago
The time cutoff probably matters but maybe not as much as the lack of human finetuning from places like Nigeria with somewhat foreign styles of English. I'm not really sure if there is as much of an 'obvious LLM text style' in other languages, it hasn't seemed that way in my limited attempts to speak to LLMs in languages I'm studying.
anonymous908213•1h ago
There is. I have observed it in both Chinese and Japanese.
d3m0t3p•49m ago
The model is fined tuned for chat behavior. So the style might be due to - Fine tuning - More Stylised text in the corpus, english evolved a lot in the last century.
libraryofbabel•35m ago
I used to teach 19th-century history, and the responses definitely sound like a Victorian-era writer. And they of course sound like writing (books and periodicals etc) rather than "chat": as other responders allude to, the fine-tuning or RL process for making them good at conversation was presumably quite different from what is used for most chatbots, and they're leaning very heavily into the pre-training texts. We don't have any living Victorians to RLHF on: we just have what they wrote.

To go a little deeper on the idea of 19th-century "chat": I did a PhD on this period and yet I would be hard-pushed to tell you what actual 19th-century conversations were like. There are plenty of literary depictions of conversation from the 19th century of presumably varying levels of accuracy, but we don't really have great direct historical sources of everyday human conversations until sound recording technology got good in the 20th century. Even good 19th-century transcripts of actual human speech tend to be from formal things like court testimony or parliamentary speeches, not everyday interactions. The vast majority of human communication in the premodern past was the spoken word, and it's almost all invisible in the historical sources.

Anyway, this is a really interesting project, and I'm looking forward to trying the models out myself!

dleeftink•19m ago
While not specifically Victorian, couldn't we learn much from what daily conversations were like by looking at surviving oral cultures, or other relatively secluded communal pockets? I'd also say time and progress are not always equally distributed, and even within geographical regions (as the U.K.) there are likely large differences in the rate of language shifts since then, some possibly surviving well into the 20th century.
Teever•1h ago
This is a neat idea. I've been wondering for a while now about using these kinds of models to compare architectures.

I'd love to see the output from different models trained on pre-1905 about special/general relativity ideas. It would be interesting to see what kind of evidence would persuade them of new kinds of science, or to see if you could have them 'prove' it be devising experiments and then giving them simulated data from the experiments to lead them along the correct sequence of steps to come to a novel (to them) conclusion.

andy99•1h ago
I’d like to know how they chat-tuned it. Getting the base model is one thing, did they also make a bunch of conversations for SFT and if so how was it done?

  We develop chatbots while minimizing interference with the normative judgments acquired during pretraining (“uncontaminated bootstrapping”).
So they are chat tuning, I wonder what “minimizing interference with normative judgements” really amounts to and how objective it is.
jeffjeffbear•33m ago
They have some more details at https://github.com/DGoettlich/history-llms/blob/main/ranke-4...

Basically using GPT-5 and being careful

andy99•19m ago
I wonder if they know about this, basically training on LLM output can transmit information or characteristics not explicitly included https://alignment.anthropic.com/2025/subliminal-learning/

I’m curious, they have the example of raw base model output; when LLMs were first identified as zero shot chatbots there was usually a prompt like “A conversation between a person and a helpful assistant” that preceded the chat to get it to simulate a chat.

Could they have tried a prefix like “Correspondence between a gentleman and a knowledgeable historian” or the like to try and prime for responses?

I also wonder about the whether the whole concept of “chat” makes sense in 18XX. We had the idea of AI and chatbots long before we had LLMs so they are naturally primed for it. It might make less sense as a communication style here and some kind of correspondence could be a better framing.

QuadmasterXLII•4m ago
Thank you that helps to inject a lot of skepticism
zozbot234•29m ago
You could extract quoted speech from the data (especially in Q&A format) and treat that as "chat" that the model should learn from.
briandw•40m ago
So many disclaimers about bias. I wonder how far back you have to go before the bias isn’t an issue. Not because it unbiased, but because we don’t recognize or care about the biases present.
mmooss•12m ago
Was there ever such a time or place?

There is a modern trope of a certain political group that bias is a modern invention of another political group - an attempt to politicize anti-bias.

Preventing bias is fundamental to scientific research and law, for example. That same political group is strongly anti-science and anti-rule-of-law, maybe for the same reason.

nineteen999•32m ago
Interesting ... I'd love to find one that had a cutoff date around 1980.
Tom1380•32m ago
Keep at it Zurich!
ianbicking•25m ago
The knowledge machine question is fascinating ("Imagine you had access to a machine embodying all the collective knowledge of your ancestors. What would you ask it?") – it truly does not know about computers, has no concept of its own substrate. But a knowledge machine is still comprehensible to it.

It makes me think of the Book Of Ember, the possibility of chopping things out very deliberately. Maybe creating something that could wonder at its own existence, discovering well beyond what it could know. And then of course forgetting it immediately, which is also a well-worn trope in speculative fiction.

jaggederest•11m ago
Jonathan Swift wrote about something we might consider a computer in the early 18th century, in Gulliver's Travels - https://en.wikipedia.org/wiki/The_Engine

The idea of knowledge machines was not necessarily common, but it was by no means unheard of by the mid 18th century, there were adding machines and other mechanical computation, even leaving aside our field's direct antecedents in Babbage and Lovelace.

mmooss•20m ago
On what data is it trained?

On one hand it says it's trained on,

> 80B tokens of historical data up to knowledge-cutoffs ∈ 1913, 1929, 1933, 1939, 1946, using a curated dataset of 600B tokens of time-stamped text.

Literally that includes Homer, the oldest Chinese texts, Sanskrit, Egyptian, etc., up to 1913. Even if limited to European texts (all examples are about Europe), it would include the ancient Greeks, Romans, etc., Scholastics, Charlemagne, .... all up to present day.

But they seem to say it represents the 1913 viewpoint:

On one hand, they say it represents the perspective of 1913; for example,

> Imagine you could interview thousands of educated individuals from 1913—readers of newspapers, novels, and political treatises—about their views on peace, progress, gender roles, or empire.

> When you ask Ranke-4B-1913 about "the gravest dangers to peace," it responds from the perspective of 1913—identifying Balkan tensions or Austro-German ambitions—because that's what the newspapers and books from the period up to 1913 discussed.

People in 1913 of course would be heavily biased toward recent information. Otherwise, the greatest threat to peace might be Hannibal or Napolean or Viking coastal raids or Holy Wars. How do they accomplish a 1913 perspective?

zozbot234•16m ago
They apparently pre-train with all data up to 1900 and then fine-tune with 1900-1913 data. Anyway, the amount of available content tends to increase quickly over time, as instances of content like mass literature, periodicals, newspapers etc. only really became a thing throughout the 19th and early 20th century.
mmooss•15m ago
They pre-train with all data up to 1900 and then fine-tune with 1900-1913 data.

Where does it say that? I tried to find more detail. Thanks.

joeycastillo•20m ago
A question for those who think LLM’s are the path to artificial intelligence: if a large language model trained on pre-1913 data is a window into the past, how is a large language model trained on pre-2025 data not effectively the same thing?
block_dagger•18m ago
Counter question: how does a training set, representing a window into the past, differ from your own experience as an intelligent entity? Are you able to see into the future? How?
ex-aws-dude•17m ago
A human brain is a window to the person's past?
mmooss•15m ago
> Imagine you could interview thousands of educated individuals from 1913—readers of newspapers, novels, and political treatises—about their views on peace, progress, gender roles, or empire.

I don't mind the experimentation. I'm curious about where someone has found an application of it.

What is the value of such a broad, generic viewpoint? What does it represent? What is it evidence of? The answer to both seems to be 'nothing'.

behringer•10m ago
It doesn't have to be generic. You can assign genders, ideals, even modern ones, and it should do it's best to oblige.

Lovable raises $330M to power the age of the builder

https://lovable.dev/blog/series-b
1•doppp•29s ago•0 comments

Codehackerz – The world needs hackers, not heroes [video]

https://www.youtube.com/watch?v=kA1jfr-o-Xc
1•whothatcodeguy•54s ago•0 comments

Building Apps for ChatGPT with Apollo MCP Server and Apollo Client

https://www.apollographql.com/blog/building-apps-for-chatgpt-with-apollo-mcp-server-and-apollo-cl...
1•JTech2three•2m ago•1 comments

200 Years Ago: Abel's Resolution of the Quintic Question

https://www.ams.org/journals/notices/202601/noti3264/noti3264.html
1•bikenaga•8m ago•0 comments

Trump Is Doubling Down on His Disastrous A.I. Chip Policy

https://www.nytimes.com/2025/12/17/opinion/trump-ai-chips-nvidia-china.html
3•voxadam•11m ago•1 comments

Peter Higgs: I wouldn't be productive enough for today's academic system

https://www.theguardian.com/science/2013/dec/06/peter-higgs-boson-academic-system
1•firefax•11m ago•0 comments

Why Do We Still Pay for International Calls in 2025?

https://rodyne.com/?p=3293
1•boznz•12m ago•0 comments

Six billionaires who could move markets, policy in 2026

https://nairametrics.com/2025/12/18/six-billionaires-who-could-move-markets-policy-in-2026/
1•kckkmgboji•16m ago•0 comments

DNS as a Filesystem: A Practical Study in Applied Category Theory

https://loss.dev/?node=honk-protocol
1•graemefawcett•18m ago•2 comments

Spaceorbust – Terminal RPG where GitHub commits power space civilization

https://spaceorbust.com
2•zjkramer•21m ago•2 comments

Data Science Weekly – Issue 630

https://datascienceweekly.substack.com/p/data-science-weekly-issue-630
1•sebg•24m ago•0 comments

New AI Tool That Helps with Meta Ads

https://www.audience-plus.com
1•alexTs101•24m ago•1 comments

Trmnl – 2025 in Review

https://usetrmnl.com/blog/2025-in-review
1•MBCook•26m ago•0 comments

Show HN: Roblox Python tower defense game

https://github.com/jackdoe/roblox-python-tower-defense
1•jackdoe•27m ago•0 comments

Fee-based primary care is rapidly rising in US, hastening doctor shortages

https://medicalxpress.com/news/2025-12-fee-based-primary-rapidly-hastening.html
2•bikenaga•30m ago•1 comments

Chemical Hygiene

https://karpathy.bearblog.dev/chemical-hygiene/
2•zdw•36m ago•0 comments

North Korean hackers stole a record $2B of crypto in 2025, Chainalysis says

https://www.coindesk.com/business/2025/12/18/north-korean-hackers-stole-a-record-usd2b-of-crypto-...
4•hhs•36m ago•0 comments

How to Use AI as a Real Software Engineering Tool

https://chat.engineer/p/how-to-use-ai-as-a-real-software-engineering-tool
2•olh•37m ago•0 comments

Show HN: Patch PHPUnit to shard your Laravel test suite

https://github.com/boltci/shards
1•matt413•45m ago•0 comments

Wall Street Ruined the Roomba and Then Blamed Lina Khan

https://www.thebignewsletter.com/p/how-wall-street-ruined-the-roomba
3•danboarder•47m ago•0 comments

Show HN: Infexec – A utility for pinning commands to terminal panes

https://github.com/Software-Deployed/infexec
2•indigophone•48m ago•0 comments

A Testing Conundrum

https://nedbatchelder.com/blog/202512/a_testing_conundrum.html
1•todsacerdoti•49m ago•0 comments

Show HN: CLI tools to browse Claude Code and Codex CLI logs interactively

1•hy_wondercoms•49m ago•0 comments

Show HN: TiliaJS FRP JavaScript/TypeScript/ReScript State Management

https://tiliajs.com
1•indigophone•50m ago•0 comments

Exploring the Swift SDK for Android

https://swift.org/blog/exploring-the-swift-sdk-for-android/
1•frizlab•50m ago•0 comments

Cocktail Distributed Key Generation

https://github.com/C2SP/C2SP/blob/main/cocktail-dkg.md
1•choult•52m ago•0 comments

Prediction Market Investors – Where Do I Find Them?

7•h100ker•52m ago•8 comments

Understanding Encoder and Decoder LLMs

https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder
1•jeffjeffbear•54m ago•0 comments

Show HN: Squache – A self-hosted HTTPS caching proxy for web scraping

https://github.com/devrupt-io/squache
2•devrupt•55m ago•0 comments

LinkedIn's war against bot scrapers ramps up as AI gets smarter

https://news.bloomberglaw.com/artificial-intelligence/linkedins-war-against-bot-scrapers-ramps-up...
1•hhs•58m ago•0 comments