Thoughts on the Word Spec in Rust

53•piker•4mo ago

Comments

olivermuty•4mo ago

would be cool if they published this as oss

maverwa•4mo ago

or - if feasible - extend the existing crate instead of creating a new one.

piker•4mo ago

I thought about that as well, but it's really just so core to every aspect of the product that Tritium needs to own it 100%. We just don't have the capacity to take a tradeoff that is favorable to the broader use case. I highlighted this issue in particular but there were other places where Tritium's needs diverged from the docx_rs approach. (e.g., dealing with references)

piker•4mo ago

Thanks for the kind words, and I have given it serious thought.

It's definitely not impossible in the future.

I just don't think there is enough interest right now in contributing to the underlying tech without generalizing it so much that it basically becomes an inferior LibreOffice.

Instead, the business model for Tritium is to give away the niche legal product for free to the community, but charge commercial users who need more granular control over its network activity, etc. This gives smaller start-ups, law offices and in-house shops a chance to benefit from the niche features while reserving for more demanding organizations to express an interest in and benefit from advanced features.

IshKebab•4mo ago

I think you should just charge everyone. I can't imagine there are many people in the community who would have a use for it but aren't professionals who could pay money for it.

You could make a special exemption for non-profits and public defenders.

Giving it away for free just creates potential for freeloaders.

Great product idea by the way! Hard to believe lawyers have gone without this for so long.

olivermuty•3mo ago

By «this» I meant the docx parts, not the rest ;)

joachimma•4mo ago

I wonder why round-trip is such a small concern for people implementing serializers/deserializers of various kinds. I usually throw in an "Unknown" node type, which stores things unaltered until I can understand things again. The parsers I usually write are very small, so I haven't seen what issues comes up at scale, maybe there are dragons lurking .

piker•4mo ago

This is the solution for that particular issue that Tritium uses.

[NOTE: one dragon would be the memory consumption alluded to in the article.]

robmccoll•4mo ago

Could you intern strings? Seems like you're likely to see the same tags and attributes over and over.

piker•4mo ago

Yes, and there are probably a lot of other clever ideas. But the better solution is probably just to implement more of the spec. Once you get through maybe 80% of the tags, you've eliminated 99.9% of the memory issue given their frequency distribution.

iyn•4mo ago

I see the author is here — I wonder if you also handle PDFs? Quick look at the site indicates that yes, but could you tell more about it? Do you also have a custom parser/serializer? Do you allow for PDF editing?

The reason for asking, is that I've had a shower thought of building custom PDF/doc reader for myself, that would allow me to easily take notes and integrate with Anki. Been doing that in Obsidian with the pdf plugin, but it's too slow. At the same time, I've heard that PDF spec is not easy to work with, so I'm curious about your experience on that front.

piker•4mo ago

Yes, it does render PDFs.

There's actually an example PDF in the bundle if you click "Fetch Example" from the web preview at: https://tritium.legal/preview.

Under the hood, Tritium is using PDFium[1]. That's the same library used by Chrome, for example. The PDF spec is another animal that will be tackled in due course, but most legal users only need to view and comment on PDFs.

Try and find a binding to PDFium from your language of choice and start at that layer. PDFs are complex beasts, most of which complexity it may not be necessary to try to tackle in the first instance.

[1] https://pdfium.googlesource.com/pdfium/

amelius•4mo ago

The original Word format was a literal dump of a part of the data segment of the Word process. Basically like an mmapped file. Super fast. It is a pity that modern languages and their runtimes do not allow data structures to be saved like that.

johngossman•4mo ago

Your mileage may be different. I didn't work on Word (though I talked to those guys about their format) but I worked on two other apps that used the same strategy in the same era. One, on load you had to fix up runtime data that landed in that part of the data segment. Two, the in memory representation was actually somewhat sparse. This meant that a serializer actually read and wrote less to disk than mapping the file. So documents were smaller and there was actually less i/o and faster loads.

The reason I hated it though was because it was very hard to version. I know the Word team had that problem, especially when the mandate came down for older versions to be able to read newer versions. Hard enough to organize the disk format so old versions can ignore stuff, but now you're putting the same requirements on the in-memory representation. Maybe Word did it better.

maxerickson•4mo ago

There's all kinds of discussions of recovering text from corrupted files that just kind of went away when they moved over to the explicit serialization in docx.

OskarS•4mo ago

You can absolutely save data like that, it's just that it's a terrible idea. There are obvious portability concerns issues: little-endian vs. big endian, 32-bit vs. 64-bit, struct padding, etc.

Essentially, this system works great if you know the exact hardware and compiler toolchain, and you never expect to upgrade it with things that might break memory layout. Obviously this does not hold for Word: it was written originally in a 32-bit world and now we live in a 64-bit one, MSVC has been upgraded many times, etc. There's also address space concern: if you embed your pointers, are you SURE that you're always going to be able to load them in the same place in the address space?

The overhead of deserialization is very small with a properly written file format, it's nowhere near worth the sacrifice in portability. This is not why Word is slow.

skywal_l•4mo ago

Andrew Kelley (author of zig) has a nice talk about programming without pointers allowing ultra fast serialize/deserialization. [0]

And then you have things like cap'n'proto if you want to control your memory layout. [1]

But for "productivity" files, you are essentially right. Portability and simplicity of the format is probably what matters.

[0]: https://www.hytradboi.com/2025/05c72e39-c07e-41bc-ac40-85e83...

[1]: https://capnproto.org/

OskarS•4mo ago

That is true, cap’n proto and flatbuffers are excellent realizations of this basic concept. But that’s very different thing from what the commenter is talking about Word doing in the 90s, of just memory-mapping the internal data structures and be done with it.

actionfromafar•4mo ago

Smalltalk is something like that.

amelius•4mo ago

It's only a terrible idea because our tools are terrible.

That's exactly the point!

(For example, if Rust would detect a version change, it could rewrite the data into a compatible format, etc.)

johngossman•4mo ago

At which point you're not just memory mapping the file. And if the new version changes the size of the object, it doesn't pack in the same place in memory, so you have to repack before saving. Even serializing with versioning is very hard. Memory mapping is much worse. Several other comments indicate that I am not the only one with bad experiences here.

cbm-vic-20•4mo ago

The article links to a classic Joel on Software article "In Defense of Not-Invented-Here Syndrome" written in 2001.

It's interesting to see how this has played out 24 years later with "vibe coding" and how Amazon does business.

> Indeed during the recent dotcom mania a bunch of quack business writers suggested that the company of the future would be totally virtual — just a trendy couple sipping Chardonnay in their living room outsourcing everything. What these hyperventilating “visionaries” overlooked is that the market pays for value added. Two yuppies in a living room buying an e-commerce engine from company A and selling merchandise made by company B and warehoused and shipped by company C, with customer service from company D, isn’t honestly adding much value.

https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-...

cbondurant•4mo ago

It could easily be the case that its just outside of the goals or scope of docx-rs, but I wonder. It would probably be pretty reasonable to add some kind of a catch-all "unknown" variant, that backs itself up with storing the names of tags as interned strings?

Justified under the idea that unexpected tags should be uncommon by the fact they are unexpected (if its common you should have expected it), and can be relegated to a less-performant cold-path as a result.

It would probably mean not having the most fun time ever for the developer depending on docx-rs if an explicit requirement is interacting with and modifying a tag that ends up in the "whatever" bucket, but at least you could make sure that you (de)serialize losslessly.

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse

No 10 blocks report on impact of rainforest collapse on food prices

Seedance 2.0 Is Coming

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Dexterous robotic hands: 2009 – 2014 – 2025

Interop 2025: A Year of Convergence

JobArena – Human Intuition vs. Artificial Intelligence

Concept Artists Say Generative AI References Only Make Their Jobs Harder

Show HN: PaySentry – Open-source control plane for AI agent payments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

Pax Historia – User and AI powered gaming platform

Show HN: I built a RAG engine to search Singaporean laws

Scams, Fraud, and Fake Apps: How to Protect Your Money in a Mobile-First Economy

Porting Doom to My WebAssembly VM

CoreWeave's $30B Bet on GPU Market Infrastructure

Creating and Hosting a Static Website on Cloudflare for Free

"The Stanford scam proves America is becoming a nation of grifters"

Elon Musk on Space GPUs, AI, Optimus, and His Manufacturing Method

X (Twitter) is back with a new X API Pay-Per-Use model

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

Show HN: Deterministic signal triangulation using a fixed .72% variance constant

Scientists Discover Levitating Time Crystals You Can Hold, Defy Newton’s 3rd Law

When Michelangelo Met Titian

Solving NYT Pips with DLX

Baldur's Gate to be turned into TV series – without the game's developers

Interview with 'Just use a VPS' bro (OpenClaw version) [video]

EchoJEPA: Latent Predictive Foundation Model for Echocardiography

Disablling Go Telemetry

Effective Nihilism

The UK government didn't want you to see this report on ecosystem collapse

No 10 blocks report on impact of rainforest collapse on food prices

Seedance 2.0 Is Coming

Show HN: Fitspire – a simple 5-minute workout app for busy people (iOS)

Dexterous robotic hands: 2009 – 2014 – 2025

Interop 2025: A Year of Convergence

JobArena – Human Intuition vs. Artificial Intelligence

Concept Artists Say Generative AI References Only Make Their Jobs Harder

Show HN: PaySentry – Open-source control plane for AI agent payments

Show HN: Moli P2P – An ephemeral, serverless image gallery (Rust and WebRTC)

The Crumbling Workflow Moat: Aggregation Theory's Final Chapter

Pax Historia – User and AI powered gaming platform

Show HN: I built a RAG engine to search Singaporean laws

Scams, Fraud, and Fake Apps: How to Protect Your Money in a Mobile-First Economy

Porting Doom to My WebAssembly VM

Thoughts on the Word Spec in Rust

Comments