Britannica11.org – a structured edition of the 1911 Encyclopædia Britannica

78•ahaspel•1h ago

Comments

ahaspel•1h ago

I rebuilt the 1911 Encyclopædia Britannica into a clean, structured, navigable site:

What it does:

– ~37k articles reconstructed from the original volumes – section-level structure (contents are clickable within articles) – cross-references extracted and linked – contributors indexed and searchable – original volume + page references preserved and shown while reading – links to the original scans for each page – ancillary material included (prefaces, abbreviations, etc.) – topic index reproduced and cross-linked – full-text search with article metadata (length, volume, etc.)

Most of the work was in parsing and reconstruction: headings, multi-page articles, tables, math, languages, footnotes, plates, and all the small edge cases that come up in a work like this.

The goal was to make something that feels like the original, but is actually usable.

I’d especially appreciate feedback on: – search quality – navigation (sections, cross-references) – anything that looks structurally off

Happy to answer questions about the pipeline or data model

logicallee•1h ago

Thanks so much for sharing this. It looks fantastic. A couple of questions, if you don't mind: what license are you releasing this under, if any? Is there any way to download it? The reason someone might want to download it is for use as training data.

ahaspel•1h ago

Thanks!

The underlying text (1911 edition) is public domain, but the structured version here — the parsing, reconstruction, and linking — is something I put together for this site. Right now there isn’t a bulk download available. I’m considering exposing structured access (API or dataset) in some form, but haven’t decided exactly how that will work yet.

If you have a specific use case in mind (especially for training), I’d be interested to hear more.

logicallee•1h ago

Regarding the specific use case, I was thinking this: I had Gemma 4 (a small but highly capable offline model released by Google) make a public domain cc0 encyclopedia of some core science and technology concepts[1]. I thought it was pretty good.

Separately, I've fine-tuned the Gemma 4 model[2], it was very quick (just 90 seconds), so I think it could be interesting to train it to talk like 1911 Encyclopedia Britannica.

I would use the entries as training data and train it to talk in the same style. There isn't a specific use case for why, I just think it would be interesting. For example, I could see how it writes about modern concepts in the style of 1911 Britannica.

[1] https://stateofutopia.com/encyclopedia/

[2] To talk like a pirate! https://www.youtube.com/live/WuCxWJhrkIM

hallole•10m ago

I've wanted to do something like this for The Encyclopédie, a hugely relevant text to the Enlightenment. If you ever get around to adding a rough "How I (generally) Made This" section, that'd be appreciated! Site looks great :)

realityfactchex•1h ago

> Is there any way to download it? The reason someone might want to download it is for use as training data.

Another reason would be to able to keep running/using it even if the main site were to go down for whatever reason eventually; or, to operate a mirror of it, for redundancy (linking back to the original, of course).

zozbot234•8m ago

Wikisource has the original scans available in the public domain, and their enriched text under CC-BY-SA: https://en.wikisource.org/wiki/EB1911

gnerd00•1h ago

legal terms question here also -- several major world economies are operating under very different rules regarding datasets and publication rights. I am in the USA / California.. will there be terms for me, given that I am not a giant deep-pockets FAANG, just a book person ? commercial use terms for "small business" scale ?

TremendousJudge•1h ago

I guess such an old edition is in the public domain

ahaspel•1h ago

The 1911 text itself is public domain, so anyone is free to use it.

What I’ve built here is a structured edition — the parsing, reconstruction, linking, indexing, etc. I haven’t published a formal license for that yet.

For casual or small-scale use there’s no issue at all. For bulk use (e.g. dataset / training / redistribution), I’d prefer people get in touch so I can figure out a sensible way to support that.

robin_reala•1h ago

A seriously trivial bug report, but the font you’ve chosen doesn’t support ℔, making articles like https://britannica11.org/article/22-0688-s2/putting_the_shot look odd. Potentially might be worth rewriting ℔ to a more normal (these days) lb?

ahaspel•1h ago

Good catch — thanks. That’s a font coverage issue. I’ll either swap in a fallback font for missing glyphs or normalize those cases. This only sounds trivial, this project is full of items like that.

yodon•1h ago

The most important entry I found in my physical copy of the 1911 Britannica is for Eavesdropping[0], detailing the original historical origins of the term and how it was thought about just before our modern era.

> Though the offence of eavesdropping still exists at common law, there is no modern instance of a prosecution or indictment.

Thanks for posting this resource, I've often wanted to share a link to this and other entries.

[0]https://britannica11.org/article/08-0867-eavesdrip/eavesdrip...

ahaspel•1h ago

That’s exactly the use case I had in mind. The 11th is full of gems like that, but they’ve never been easy to point people to.

keane•1h ago

Beautiful work! This is an amazing resource to have online. Reminds me a little of greensdictofslang.com or of Webster’s 1913, a perennial HN favorite: https://news.ycombinator.com/item?id=29733648

ahaspel•1h ago

That’s high praise. Those are both great projects and this one is definitely in the same spirit.

timciep•37m ago

These projects came to mind for me, as well.

I actually took a recent crack at making a more modern website for Websters 1913: https://websters1913.timcieplowski.com/

indigodaddy•1h ago

Just as a random data point, I searched for Genghis and nothing came up. Was there not much knowledge on Genghis Khan in 1911 I wonder?

ahaspel•1h ago

Try Jenghiz Khan. That's how they used to spell it then. Or just plain Khan and scroll the results.

ks2048•1h ago

Yes. Here is the article,

https://britannica11.org/article/15-0341-jenghiz-khan/jenghi...

indigodaddy•1h ago

Interesting! Thanks

rustyhancock•1h ago

I spent ages trying to work out if it would be possible to find a copy of the 2021 Encarta or Britannica.

Pre LLM And post COVID and perhaps the best we can hope for before AI taints all the info.

One of my prized possessions as a child was a CDROM based encyclopedia (well before the internet was common). I don't know why I liked it so much but on a rainy afternoon I'd kick up some of my favourite articles and read and learn more of them.

ahaspel•1h ago

I know exactly what you mean — I had the same experience with CD-ROM encyclopedias. There’s something about just browsing and falling into articles that’s hard to replicate.

Part of the motivation here was to bring that kind of exploration back, but with the original 1911 text and structure.

pawsocks•47m ago

Do you happen to use a language model to translate or format your comments?

ahaspel•33m ago

Just me. I spent a lot of time thinking about this, so I like talking about it.

tezza•1h ago

2004: https://archive.org/details/britannica-2004

2009: https://archive.org/details/britannica-multimedia-dvd-2009-d...

2012: https://archive.org/details/britannica-dvd_20230709

2013: https://archive.org/details/encyclopedia-britannica-dvd-2013

realityfactchex•1h ago

Very, very cool. Hats off. I've considered attempting a more limited form of this for years.

For those who don't know, the 1911 Britannica is heralded for several reasons (and rightly criticized for regrettable others), but the most well-known is that it was the last encyclopedia before The Great War, and hence had a good amount of steam/optimism coming from the first and second industrial revolutions and the "Progressive Era", not sullied yet by thoughts of "the war to end all wars".

Trying https://britannica11.org specifically, it quickly found and displayed the article I searched for, chosen (to search for) at random: Portuguese East Africa, at https://britannica11.org/article/22-0177-portuguese-east-afr...

A question/idea for nice-to-haves, most respectfully. I don't know if it would be feasible. It's probably perfect as it is, simply linking to the image-page in unobtrusive text for each section. But I would love an option (emphasis on option) to see the text side by side with the page images. That parallel view would load all of the page images on the same page as the full article text. That way, I could "confirm" or "fact check" the faithfulness of the OCR, and also see the beautiful printing, at once, without opening each page separately and managing the images/windows myself. Most likely, I would use the site to jump to the articles, and read them mainly as images, only switching to the text form to verify what something said, or to copy-paste cleanly, etc. (As it is, initially, I thought I read the original images were available, but had to visit the page three (3!) times before finding where the side-links to them were.) Maybe thumbnails could be a middle-ground option (again, optional) for salience.

Very, very well done. And it's fast!

ahaspel•1h ago

Thanks — really appreciate that, and glad it worked well for a random article.

That’s a great suggestion. A side-by-side text + page view would be very nice for exactly the reasons you mention (verifying the text and seeing the original layout). I haven’t built that yet, but I’ve considered it.

Also helpful to hear that the links to the scans weren’t immediately obvious — I should probably make them a bit clearer. This may also not be obvious, but you can click the vol:page links in the left margin and go directly to the scan of whatever page you're reading.

Thanks again.

entrepy123•1h ago

Bravo. People who like the 1911 Encyclopedia Britannica might like https://OldEncyc.com to dig into the volumes (by letter range) of 22 editions of old encyclopedias dated 1728-1926 (though not searchable like the OP).

ahaspel•47m ago

I hadn’t seen that before, it’s a great collection. I like the breadth across editions.

Aardwolf•1h ago

Very neat!

Some bugs I noticed:

Searching for Zurich allows you to go to the article for the canton of Zurich, not the city. Clicking the link "Zürich (city)" inside of this article, opens this same article again about the canton, rather than opening the actual article for the city

When viewing an article, the search for articles (leftmost search box) doesn't seem to work at all for me (in Firefox). When being on the main page, it does work

There's a small clickable 'home' button on the right, but muscle memory from how other websites work makes me expect that clicking the big title "Encyclopædia Britannica, 11th Edition" on the top left also goes to home

ahaspel•52m ago

Excellent points. There are indeed two Zurich articles. One way to get to the city is to search for Zurich and open the second one, which goes to the city directly. The xref in Zurich (canton) is indeed a disambiguation bug (identically named articles); thanks for catching that.

I haven't tested the article search box on the article viewer in Firefox. I'll look into that as well.

Making the title linkable is a great idea and it will be implemented shortly. Thanks for catching all of this.

shantara•53m ago

Interesting how different both the tone and the structure of the articles are compared to the modern texts.

Take the article about Copenhagen as an example: https://britannica11.org/article/07-0111-copenhagen/copenhag... The geography and key points of interest are described very accurately, but the authors aren’t shy about inserting emotionally charged adjectives and personal options on what they consider interesting or curious. Also, the huge portion about the Battle of Copenhagen in the bottom is a complete departure and shifts the genre from a geographical description to the shot-per-shot narration of a naval battle.

ahaspel•44m ago

Yes, that’s one of the things I like most about it. The articles have a personal tone and are less homogenized.

You get that mix of geography, history, and sometimes quite opinionated description all in one place, which makes them much more readable, in my view. My introduction to this version discusses this and other related matters: https://britannica11.org/about.html

neonscribe•36m ago

You can discover beliefs that are shocking today, such as this excerpt from the article "Adolescence":

"In the case of girls, let them run, leap and climb with their brothers for the first twelve years or so of life. But as puberty approaches, with all the change, stress and strain dependent thereon, their lives should be appropriately modified. Rest should be enforced during the menstrual periods of these earlier years, and milder, more graduated exercise taken at other times. In the same way all mental strain should be diminished. Instead of pressure being put on a girl’s intellectual education at about this time, as is too often the case, the time devoted to school and books should be diminished. Education should be on broader, more fundamental lines, and much time should be passed in the open air."

ahaspel•28m ago

No doubt. That’s one of the reasons I find the 1911 edition interesting — the authors have more license to express their own opinions, which naturally reflect those current at the time.

ahmedfromtunis•32m ago

No entry on the Great War? Really?!!!

Just kidding, of course. This is incredible and surprisingly nostalgic. Reading some of the entries took me right back to being a kid huddled in my room for hours pouring over an encyclopedia or even the dictionary.

And I still vividly remember the rush of installing Encarta for the first time on the family PC.

I couldn't believe that I, a mere kid, have now access to iconic historical footage and that I can watch anytime I felt like it. I can't describe how amazingly cool that felt at the time! It still gives me a hit of endorphins when I remember it today.

ahaspel•6m ago

I feel exactly the same way about encyclopedias and dictionaries. And Encarta really was amazing. You'd be surprised how much modern criticism of the 11th amounts to "no entry on the Great War", except in earnest.

spudlyo•31m ago

I'm curious how the information is structured under the hood. I just recently learned about how folks in the digital humanities use the XML-TEI format for semantic markup of works like this. I've recently been exploring the Latin-English Lewis & Short dictionary encoded in XML-TEI.

I've had a ton of fun playing learning about BaseX and XQuery to ask questions like "Which classical authors are responsible for writing words that appear only once in the entire corpus (hapax legomena)" or "what are longest hapax words" (usually the funniest ones) and that kind of thing. Shout out to Tufts University for making this available!

I would love to be able to load the 1911 Britannica into BaseX and and see what interesting things I could learn about it via XQuery!

ahaspel•21m ago

Under the hood it’s not XML-TEI — it’s a relational/data-pipeline approach, with article boundaries, sections, contributors, cross-references, and source-page provenance all reconstructed into structured records. The text itself is public domain, but I haven’t released a bulk structured export yet.

People asking for dataset access has definitely been one of the themes of this thread. I’m taking that seriously. If I do expose it, I’d want to do it in a form that preserves the structure and doesn't just dump plain text.

peterldowns•16m ago

I've been meaning to build ~exactly this experience, but for the 1952 Encyclopedia Brittanica Great Books of the World collection and its experimental index Syntopicon [0]. Would love to know more about how you OCR'd or otherwise ingested and parsed the raw material. I have a physical copy of the books, and I found some samizdat raw-image scans and started working on a custom OCR pipeline, but wondering if maybe I could learn from your approach...

[0] https://en.wikipedia.org/wiki/A_Syntopicon

ahaspel•10m ago

I'm familiar with the Syntopicon, which would be fun to structure.

I didn’t do OCR myself, except for the topic index and to fill in a few gaps. I started from existing Wikisource text and then built a pipeline around that: cleaning (headers, hyphenation, etc.), detecting article boundaries, reconstructing sections, and linking things back to the original page images. Most of the effort went into rendering the complex layouts, and handling the cross-linking, not the initial ingestion.

Glad to go into more detail if you’re interested, but that’s the gist of it.

zozbot234•6m ago

That collection is not in the public domain, AIUI? You might be able to do it for the Harvard Classics, which has a nice collection-wide index of terms. https://en.wikisource.org/wiki/The_Harvard_Classics has links to the scans.

The Vercel breach: OAuth attack exposes risk in platform environment variables

Britannica11.org – a structured edition of the 1911 Encyclopædia Britannica

Cal.diy: open-source community edition of cal.com

Framework Laptop 13 Pro

Laws of Software Engineering

OpenAI Livestream

A Periodic Map of Cheese

Show HN: GoModel – an open-source AI gateway in Go

Fusion Power Plant Simulator

Edit store price tags using Flipper Zero

Trellis AI (YC W24) Is hiring engineers to build self-improving agents

Theseus, a Static Windows Emulator

Show HN: VidStudio, a browser based video editor that doesn't upload your files

Running a Minecraft Server and More on a 1960s Univac Computer

Modern Front end Complexity: essential or accidental?

Kasane: New drop-in Kakoune front end with GPU rendering and WASM Plugins

Ibuilt a tiny Unix‑like 'OS' with shell and filesystem for Arduino UNO (2KB RAM)

A type-safe, realtime collaborative Graph Database in a CRDT

Anthropic says OpenClaw-style Claude CLI usage is allowed again

Show HN: Ctx – a /resume that works across Claude Code and Codex

MNT Reform is an open hardware laptop, designed and assembled in Germany

Clojure: Transducers

Tindie store under "scheduled maintenance" for days

Show HN: Daemons – we pivoted from building agents to cleaning up after them

Show HN: Mediator.ai – Using Nash bargaining and LLMs to systematize fairness

Meta capturing employee mouse movements, keystrokes for AI training data

Tim Cook's Impeccable Timing

Leonardo, Borgia, and Machiavelli: A Fateful Collusion

Colorado River disappeared record for 5M years: now we know where it was

Anthropic takes $5B from Amazon and pledges $100B in cloud spending in return

The Vercel breach: OAuth attack exposes risk in platform environment variables

Britannica11.org – a structured edition of the 1911 Encyclopædia Britannica

Cal.diy: open-source community edition of cal.com

Framework Laptop 13 Pro

Laws of Software Engineering

OpenAI Livestream

A Periodic Map of Cheese

Show HN: GoModel – an open-source AI gateway in Go

Fusion Power Plant Simulator

Edit store price tags using Flipper Zero

Trellis AI (YC W24) Is hiring engineers to build self-improving agents

Theseus, a Static Windows Emulator

Show HN: VidStudio, a browser based video editor that doesn't upload your files

Running a Minecraft Server and More on a 1960s Univac Computer

Modern Front end Complexity: essential or accidental?

Kasane: New drop-in Kakoune front end with GPU rendering and WASM Plugins

Ibuilt a tiny Unix‑like 'OS' with shell and filesystem for Arduino UNO (2KB RAM)

A type-safe, realtime collaborative Graph Database in a CRDT

Anthropic says OpenClaw-style Claude CLI usage is allowed again

Show HN: Ctx – a /resume that works across Claude Code and Codex

MNT Reform is an open hardware laptop, designed and assembled in Germany

Clojure: Transducers

Tindie store under "scheduled maintenance" for days

Show HN: Daemons – we pivoted from building agents to cleaning up after them

Show HN: Mediator.ai – Using Nash bargaining and LLMs to systematize fairness

Meta capturing employee mouse movements, keystrokes for AI training data

Tim Cook's Impeccable Timing

Leonardo, Borgia, and Machiavelli: A Fateful Collusion

Colorado River disappeared record for 5M years: now we know where it was

Anthropic takes $5B from Amazon and pledges $100B in cloud spending in return

Britannica11.org – a structured edition of the 1911 Encyclopædia Britannica

Comments