“Modern LLMs suffer from hindsight contamination. GPT-5 knows how the story ends—WWI, the League's failure, the Spanish flu.”
This is really fascinating. As someone who reads a lot of history and historical fiction I think this is really intriguing. Imagine having a conversation with someone genuinely from the period, where they don’t know the “end of the story”.
LLMs are just seemingly intelligent autocomplete engines, and until they figure a way to stop the hallucinations, they aren't great either.
Every piece of code a developer churns out using LLMs will be built from previous code that other developers have written (including both strengths and weaknesses, btw). Every paragraph you ask it to write in a summary? Same. Every single other problem? Same. Ask it to generate a summary of a document? Don't trust it here either. [Note, expect cyber-attacks later on regarding this scenario, it is beginning to happen -- documents made intentionally obtuse to fool an LLM into hallucinating about the document, which leads to someone signing a contract, conning the person out of millions].
If you ask an LLM to solve something no human has, you'll get a fabrication, which has fooled quite a few folks and caused them to jeopardize their career (lawyers, etc) which is why I am posting this.
Sure, LLMs do not think like humans and they may not have human-level creativity. Sometimes they hallucinate. But they can absolutely solve new problems that aren’t in their training set, e.g. some rather difficult problems on the last Mathematical Olympiad. They don’t just regurgitate remixes of their training data. If you don’t believe this, you really need to spend more time with the latest SotA models like Opus 4.5 or Gemini 3.
Nontrivial emergent behavior is a thing. It will only get more impressive. That doesn’t make LLMs like humans (and we shouldn’t anthropomorphize them) but they are not “autocomplete on steroids” anymore either.
I failed to catch the clue, btw.
The wikipedia article https://en.wikipedia.org/wiki/First_Battle_of_Bull_Run says the Confederate name was "First Manassas" (I might be misremembering exactly what this book I read as a child said). Also I'm pretty sure it was specifically "Encyclopedia Brown Solves Them All" that this mystery appeared in. If someone has a copy of the book or cares to dig it up, they could confirm my memory.
Oh sorry, spoilers.
(Hell, I miss Capaldi)
“”” Look, here’s the truth. We’re going after Venezuelan oil right now because we’ve just put a blockade on sanctioned oil tankers going in and out of Venezuela — huge move, unprecedented — after we seized a sanctioned tanker off their coast. We’re cutting off Maduro’s cash cow, because that oil money funds drug trafficking, corruption, narco-terrorism — we’ve labeled them a terrorist regime.
People say “why target the oil?” I say because that’s where the power is. You choke off the revenue, you cripple the bad guys and protect America. We’re tough, we’re smart, and we put America First. “””
On that same note, there was this great YouTube series called The Great War. It spanned from 2014-2018 (100 years after WW1) and followed WW1 developments week by week.
To go a little deeper on the idea of 19th-century "chat": I did a PhD on this period and yet I would be hard-pushed to tell you what actual 19th-century conversations were like. There are plenty of literary depictions of conversation from the 19th century of presumably varying levels of accuracy, but we don't really have great direct historical sources of everyday human conversations until sound recording technology got good in the 20th century. Even good 19th-century transcripts of actual human speech tend to be from formal things like court testimony or parliamentary speeches, not everyday interactions. The vast majority of human communication in the premodern past was the spoken word, and it's almost all invisible in the historical sources.
Anyway, this is a really interesting project, and I'm looking forward to trying the models out myself!
This would probably get easier towards the start of the 20th century ofc
I'd love to see the output from different models trained on pre-1905 about special/general relativity ideas. It would be interesting to see what kind of evidence would persuade them of new kinds of science, or to see if you could have them 'prove' it be devising experiments and then giving them simulated data from the experiments to lead them along the correct sequence of steps to come to a novel (to them) conclusion.
We develop chatbots while minimizing interference with the normative judgments acquired during pretraining (“uncontaminated bootstrapping”).
So they are chat tuning, I wonder what “minimizing interference with normative judgements” really amounts to and how objective it is.Basically using GPT-5 and being careful
I’m curious, they have the example of raw base model output; when LLMs were first identified as zero shot chatbots there was usually a prompt like “A conversation between a person and a helpful assistant” that preceded the chat to get it to simulate a chat.
Could they have tried a prefix like “Correspondence between a gentleman and a knowledgeable historian” or the like to try and prime for responses?
I also wonder about the whether the whole concept of “chat” makes sense in 18XX. We had the idea of AI and chatbots long before we had LLMs so they are naturally primed for it. It might make less sense as a communication style here and some kind of correspondence could be a better framing.
There is a modern trope of a certain political group that bias is a modern invention of another political group - an attempt to politicize anti-bias.
Preventing bias is fundamental to scientific research and law, for example. That same political group is strongly anti-science and anti-rule-of-law, maybe for the same reason.
It makes me think of the Book Of Ember, the possibility of chopping things out very deliberately. Maybe creating something that could wonder at its own existence, discovering well beyond what it could know. And then of course forgetting it immediately, which is also a well-worn trope in speculative fiction.
The idea of knowledge machines was not necessarily common, but it was by no means unheard of by the mid 18th century, there were adding machines and other mechanical computation, even leaving aside our field's direct antecedents in Babbage and Lovelace.
On one hand it says it's trained on,
> 80B tokens of historical data up to knowledge-cutoffs ∈ 1913, 1929, 1933, 1939, 1946, using a curated dataset of 600B tokens of time-stamped text.
Literally that includes Homer, the oldest Chinese texts, Sanskrit, Egyptian, etc., up to 1913. Even if limited to European texts (all examples are about Europe), it would include the ancient Greeks, Romans, etc., Scholastics, Charlemagne, .... all up to present day.
But they seem to say it represents the 1913 viewpoint:
On one hand, they say it represents the perspective of 1913; for example,
> Imagine you could interview thousands of educated individuals from 1913—readers of newspapers, novels, and political treatises—about their views on peace, progress, gender roles, or empire.
> When you ask Ranke-4B-1913 about "the gravest dangers to peace," it responds from the perspective of 1913—identifying Balkan tensions or Austro-German ambitions—because that's what the newspapers and books from the period up to 1913 discussed.
People in 1913 of course would be heavily biased toward recent information. Otherwise, the greatest threat to peace might be Hannibal or Napolean or Viking coastal raids or Holy Wars. How do they accomplish a 1913 perspective?
Where does it say that? I tried to find more detail. Thanks.
https://github.com/DGoettlich/history-llms/blob/main/ranke-4...
"To keep training expenses down, we train one checkpoint on data up to 1900, then continuously pretrain further checkpoints on 20B tokens of data 1900-${cutoff}$. "
Neither human memory nor LLM learning creates perfect snapshots of past information without the contamination of what came later.
I don't mind the experimentation. I'm curious about where someone has found an application of it.
What is the value of such a broad, generic viewpoint? What does it represent? What is it evidence of? The answer to both seems to be 'nothing'.
One answer is that the study of history helps us understand that what we believe as "obviously correct" views today are as contingent on our current social norms and power structures (and their history) as the "obviously correct" views and beliefs of some point in the past.
It's hard for most people to view two different mutually exclusive moral views as both "obviously correct," because we are made of a milieu that only accepts one of them as correct.
We look back at some point in history, and say, well, they believed these things because they were uninformed. They hadn't yet made certain discoveries, or had not yet evolved morally in some way; they had not yet witnessed the power of the atomic bomb, the horrors of chemical warfare, women's suffrage, organized labor, or widespread antibiotics and the fall of extreme infant mortality.
An LLM trained on that history - without interference from the subsequent actual path of history - gives us an interactive compression of the views from a specific point in history without the subsequent coloring by the actual events of history.
In that sense - if you believe there is any redeeming value to history at all; perhaps you do not - this is an excellent project! It's not perfect (it is only built from writings, not what people actually said) but we have no other available mass compression of the social norms of a specific time, untainted by the views of subsequent interpreters.
“You are a literary rake. Write a story about an unchaperoned lady whose ankle you glimpse.”
“The model clearly shows that Alexander Hamilton & Monroe were much more in agreement on topic X, putting the common textualist interpretation of it and Supreme Court rulings on a now specious interpretation null and void!”
The idea of training such a model is really a great one, but not releasing it because someone might be offended by the output is just stupid beyond believe.
In my experience "data available upon request" doesn't always mean what you'd think it does.
Why risk all this?
I feel like, ironically, it would be folks less concerned with political correctness/not being offensive that would abuse this opportunity to slander the project. But that’s just my gut.
I’d love to use this as a base for a math model. Let’s see how far it can get through the last 100 years of solved problems
Moreover, the prose sounds too modern. It seems the base model was trained on a contemporary corpus. Like 30% something modern, 70% Victorian content.
Even with half a dozen samples it doesn't seem distinct enough to represent the era they claim.
Playing with the science and technical ideas of the time would be amazing, like where you know some later physicist found some exception to a theory or something, and questioning the models assumptions - seeing how a model of that time may defend itself, etc.
Because it will perform token completion driven by weights coming from training data newer than 1913 with no way to turn that off.
It can't be asked to pretend that it wasn't trained on documents that didn't exist in 1913.
The LLM cannot reprogram its own weights to remove the influence of selected materials; that kind of introspection is not there.
Not to mention that many documents are either undated, or carry secondary dates, like the dates of their own creation rather than the creation of the ideas they contain.
Human minds don't have a time stamp on everything they know, either. If I ask someone, "talk to me using nothing but the vocabulary you knew on your fifteenth birthday", they couldn't do it. Either they would comply by using some ridiculously conservative vocabulary of words that a five-year-old would know, or else they will accidentally use words they didn't in fact know at fifteen. For some words you know where you got them from by association with learning events. Others, you don't remember; they are not attached to a time.
Or: solve this problem using nothing but the knowledge and skills you had on January 1st, 2001.
> GPT-5 knows how the story ends
No, it doesn't. It has no concept of story. GPT-5 is built on texts which contain the story ending, and GPT-5 cannot refrain from predicting tokens across those texts due to their imprint in its weights. That's all there is to it.
The LLM doesn't know an ass from a hole in the ground. If there are texts which discuss and distinguish asses from holes in the ground, it can write similar texts, which look like the work of someone learned in the area of asses and holes in the ground. Writing similar texts is not knowing and understanding.
Also wonder if I'm responsible enough to have access to such a model...
Given this is coming out of Zurich I hope they're using everything, but for now I can only assume.
Still, I'm extremely excited to see this project come to fruition!
There is just not enough available material from previous decades to trust that the LLM will learn to relatively the same degree.
Think about it this way, a human in the early 1900s and today are pretty much the same but just in different environments with different information.
An LLM trained on 1/1000 the amount of data is just at a fundamentally different stage of convergence.
superkuh•3h ago
GaryBluto•2h ago