Is OOXML Artifically Complex?

https://hsu.cy/2025/09/is-ooxml-artificially-complex/

148•firexcy•3d ago

Comments

3cats-in-a-coat•20h ago

Microsoft just took what they had and directly translated it to XML. It's not intentionally messy, it's just a big corporation with old product acting like it.

gitonup•19h ago

This is the God's honest.

I worked on the MS Word core team for a little over three years from 2010-2014, and de-facto owned a significant part of implementing ODF / OOXML Strict support.

The binary format was a liability for Microsoft to begin with, because of decades of cruft lining up with actual memory alignment. During my tenure there I ran into code my GM had written as an intern and was still intact -- he had 20+ years of tenure (mostly on Word) when I joined the team.

The translation of the file format to XML involved a significant amount of performance degradation if you weren't careful. Hundreds of millions of people use the app monthly, and MS still tries to maintain backwards compatibility. Given that open APIs were a relatively late development for the app, I really don't think in the current reality of what's expected by boards of directors for the companies they oversee that _anyone_ would take years to:

a) define a spec that maintained that backwards compatibility

b) reach whatever nebulous simplicity metric today's HN article wants

c) not get whoever greenlit the project fired for taking that many engineering hours for a and b

piker•20h ago

Dead on.

Microsoft is just dominant and exporting its 40 year old legacy codebase as a spec. LibreOffice team is frustrated that the for-profit model is beating the OSS model and crying foul over mostly necessary complexity. If LibreOffice started from scratch they’d probably appreciate how much Microsoft serializes because a sufficiently complicated document saved to .docx basically provides a reference implementation.

We do need for-profit alternatives to Word, and I’m working on one in legal.

[edit: I hope to put some real thoughts on this down soon, but most of the wonkiness emanates from evolving functionality and varying trends in best practices over the decades. I’ve implemented a fair bit of the spec here: https://tritium.legal, but most of the hard part is providing for bidi language support, fonts, real-time editing and re-rendering, UI and annotations like spellchecking and grammar, not conforming to the markup spec. Spec conformance is just polish and testing. A performant modern word processor of any spec, however, is a technological achievement on the order of a web browser.]

trelane•18h ago

LibreOffice has versions that you pay for, with support. The most prominent is Collabora, which is a (if not the) biggest contributor to LibreOffice.

croes•18h ago

Where does the article say it’s a necessary complexity?

> Thus, the primary goal for this new format wasn’t to be elegant, universal, or easy to implement; it was to placate regulators while preserving Microsoft’s technological and commercial advantages.

That sounds quite anti-competitive to me

Gigachad•17h ago

I feel like Libreoffice became largely irrelevant the day Google Docs came out. People put up with LO wonkyness because it was free and office was expensive.

Google completely flipped the game and then cloud collaboration became everything.

toast0•15h ago

I mean, multiplayer features are useful, but Google Docs is wonkier than LO. At least when LO loads a document, it's fully loaded.

taftster•16h ago

> We do need for-profit alternatives to Word, and I’m working on one in legal.

Wow, big undertaking!

What we really need, though, is a for-profit alternative to Excel, that's not Google. I think Excel is more of the Killer App than Word has ever been.

qcnguy•11h ago

That's Apple Numbers.

mschuster91•8h ago

... which comes tied to macOS and with it Apple hardware. Neither play well in a shop that uses x86 Windows-only software, and Apple's switch to ARM hasn't made that easier.

qcnguy•4h ago

So? The person I was replying to didn't impose any conditions beyond it being not Google.

unscaled•15h ago

This may be nitpicking, but the complexity in OOXML is not "necessary", at least not in the sense of what Fred Brooks would call essential complexity. As OP clearly demonstrate, the complexity in OOXML is not artificial: there was never some grand conspiracy by Microsoft to create a format that competitors will find it hard to implement.

But very little of this complexity is necessary for a standard interoperable document file format. The background was that the EU started pushing for a standardized document exchange format, and several governments started implementing regulations requiring the use of this format — Microsoft now had some very big customers which urgently needed a feature: a standard document file format. Microsoft _could_ have implemented and submitted a new format that doesn't include slavishly reflect their in-memory object graph and legacy issues. Or they even could have just adopted ODF (shudder). But they've chosen the easy way, because, frankly, they probably just didn't have the time. They took the accidental complexity which was the hot mess Microsoft Office internals (like a buggy date format) and serialized it to disk. It was never an ideal solution, but this was quick to implement.

That's just a classic case of technical debt: Microsoft needed to deliver a feature fast, and they were willing to make compromises. The crazy political shenanigans Microsoft had executed to standardize their technical debt are ironically just another form of accidental complexity.

simoncion•5h ago

> [T]here was never some grand conspiracy by Microsoft to create a format that competitors will find it hard to implement.

No, it's just an ordinary conspiracy. Everywhere in the spec you see shit that says "Do it like Word95 does" or "Do it like Word97 does" is an intentional aspect of the standard that makes it unreasonably difficult for anyone who wishes to faithfully read or write documents in this format to do so.

It is inappropriate for an open standard to define behavior in terms of an undocumented proprietary black box. The primary reason for an open standard to exist is to permit interoperability. Anyone who has read nontrivial portions the standard would argue that ISO shouldn't have standardized OOXML as it was. It's a damn shame that Microsoft acted in bad faith to exploit ISO's rules [0] in order to ram a very poorly-specified standard through. It's always sad when people and organizations that should be acting pro-socially choose to do the opposite.

[0] By paying money to stack the organization with a bunch of entities whose only interest was to vote "yes" for the ratification of this standard, natch. IIRC, ISO had to modify their rules again after the OOXML vote because they couldn't get quorum due to those one-issue voters refusing to show up for future business.

like_any_other•15h ago

> LibreOffice team is frustrated that the for-profit model is beating the OSS model

Let's take a look at this "for-profit model" - is it just higher price outweighed by better product? lol:

Microsoft, after getting beat up in the press for making propietary extensions to the Kerberos protocol, has released the specifications on the web -- but in order to get it, you have to run a Windows .exe file which forces you agree to a click-through license agreement where you agree to treat it as a trade secret, before it will give you the .pdf file. Who would have thought that you could publish a trade secret on the web? - https://slashdot.org/story/00/05/02/158204/kerberos-pacs-and...

Back in 2001, Be, Inc. managed to get BeOS pre-installed on one computer model from Hitachi. Just one. On the entire PC market. Microsoft forced Hitachi to drop the bootloader entry to hide BeOS from customers buying it. They enforced their monopoly over the only possible niche BeOS could find on the PC market, crushing Be, Inc. in the process. - https://www.haiku-os.org/blog/mmu_man/2021-10-04_ok_lenovo_w...

So why aren't there any dual-boot computers for sale? The answer lies in the nature of the relationship Microsoft maintains with hardware vendors. More specifically, in the "Windows License" agreed to by hardware vendors who want to include Windows on the computers they sell. This is not the license you pretend to read and click "I Accept" to when installing Windows. This license is not available online. This is a confidential license, seen only by Microsoft and computer vendors. You and I can't read the license because Microsoft classifies it as a "trade secret." The license specifies that any machine which includes a Microsoft operating system must not also offer a non-Microsoft operating system as a boot option. In other words, a computer that offers to boot into Windows upon startup cannot also offer to boot into BeOS or Linux. The hardware vendor does not get to choose which OSes to install on the machines they sell -- Microsoft does. - https://birdhouse.org/beos/byte/30-bootloader/

haskellshill•14h ago

> If LibreOffice started from scratch

What do you mean though? Libreoffice wrote their application from scratch, did they not? And they managed to implement a superior serialization format, did they not? And they managed to get that format standardized without bribing and cheating, did they not?

What you're saying is akin to "those residents of banana republics are just frustrated capitalism (and a little help from the CIA) is beating democracy"

> We do need for-profit alternatives to Word

Why does it have to be for profit?

smaudet•14h ago

I think this is definitely some weird attempt to justify a terrible piece of technological junk....

For all the hate people gave CSS, it was/is fantastic at its job. Word documents are an example of how you don't design a document, and how when a for profit org designs a thing (instead of standards and market pressures), you get a technological monstrosity...

To be clear, I don't think LibreOffice is great. Part of their issue, they were built as a way to "not pay" for office, and it turns out that no, volunteers don't really do a better job at implementing 1000 pages of nonsense that the people who came up with that spaghetti code in the first place...

We don't need that software anymore, though. If you use it, know we are looking at you like you are pulling out a physical paper phonebook to store your numbers in, or a less hurtfully but just as topically, a record or CD player...it is dinosaur technology that pretty much has no place in todays world...

So, they have a point, I don't disagree with them, however it probably would be better just to "admit defeat", get MS to open source their code for compat reasons, and work on something new that's not trying to write viruses on your computer better than paragraphs...

etothepii•20h ago

I spent a lot of time last year replicating every valid Excel number format. I've really struggled to find good documentation on the excel format when you really get into the weeds.

The use of namespaces is also incredibly annoying in so far as I can tell in every xml library I can find they really aren't well supported for that "human" readable component.

When you crack open the file it feels like you are going to be able to find everything you need with an xpath like //w:t but none of the xml parsers I've found cope well with the namespaces.

rhdunn•19h ago

What language?

In Python, the `find`, `findall`, etc. methods take a namespace dictionary. E.g.

   result = doc.findall("//w:t", namespaces={"w": "..."})

In C# you can do:

    var navigator = doc.Root!.CreateNavigator();
    nsManager = new XmlNamespaceManager(navigator.NameTable);
    nsManager.AddNamespace("w", "...");
    var results = doc.Root?.XPathSelectElements("//w:t", nsManager);

In Java you need to enable a namespace-aware flag in the settings to get namespaces to work. I can't recall off-hand how to do that.

Joker_vD•20h ago

sigh Just because it was not deliberately engineered to be prohitibively expensive to support does not mean that it can not be used to deliberately obstruct interoperability. It's really not that difficult a concept: if you want others to suffer, you can take a sad artifact of well-meant historical accidents, and say "welp, now it's a standard, you gotta support it!" There is nothing contradictory or conspirational.

piker•20h ago

I think we take issue with requiring the leap to Microsoft “deliberately” obstructing interoperability. Microsoft just isn’t incentivized to make it simple to implement, but it’s probably less complicated than the various web standards.

Joker_vD•19h ago

An engineering team in Microsoft decides to switch from binary format to XML to save effort in the long run; even though it'll take some effort now, they have the competency, and can afford it. They are absolutely correct!

But then their manager needs to sell this project to the higher-ups, who have read BillG's memo about how "One thing we have got to change in our strategy – allowing Office documents to be rendered very well by other people's browsers is one of the most destructive things we could do to the company. We have to stop putting any effort into this and make sure that Office documents very well depend on proprietary IE capabilities. Anything else is suicide for our platform. This is a case where Office has to avoid doing something to destroy Windows." and took it to heart. So what does he do? Why, he spins a tale that since it's XML, they'll be able to standardize it, and everyone else will still be forced to interoperate with MS Office anyhow, because it will be the de-facto reference implementation (by the virtue of being there first, and widely deployed), and the spec is going to be an absolute PITA to implement decently — and that manager too will be absolutely correct!

piker•19h ago

It’s not actually that bad.

to11mtm•19h ago

IME there at least used to be a difference between 'fresh OO doc' and 'oo doc upsaved from legacy' as far as parsing.

I know when I had to deal with a LOT of excel in 2008-2013, somewhere in that range I gave up on trying to parse the XML (admittedly with the then-rudimentary tools, to say nothing of nascent state of nuget at the time) and just learned how to do VSTO (Visual Studio Tools for Office) as we all had excel installed anyway, and it led to less overall code for the tasks we had to do that involved Excel...

taeric•19h ago

Agreed. I'm... not entirely clear I get the distinction the article is trying to make?

If you take the idea that it is "artificially complex, because they actively added complexity", then I can see how that isn't quite right. But "artificially complex" can also allow for "because they actively avoided the effort to remove complexity." In which case, we are back to the same spot? But in agreement this time?

cyberax•20h ago

The answer: no.

OOXML is an extremely detailed spec that lists minute details of the Office documents, with uncountable features. While it could have used some "standard" features, there weren't that many usable standards when OOXML was being developed.

In comparison, OASIS OpenDocument spec is horribly ambiguous and has all the same issues (like units not being used consistently). It got better over the years, but it's still not at all great. And its size is now comparable to OOXML, when all the referenced specs are incorporated.

rhdunn•19h ago

There are places where it says the equivalent of "Works the same as Word 95" [3], but does not specify in the specification what that means.

It's essentially a serialization of the binary format to XML.

ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.

[1] https://stephesblog.blogs.com/my_weblog/2007/08/microsofts-f...

[2] https://ooxmlisdefectivebydesign.blogspot.com/2007/08/micros...

[3] https://www.robweir.com/blog/2007/01/how-to-hire-guillaume-p...

cyberax•19h ago

> There are places where it says the equivalent of "Works the same as Word 95" [3], but does not specify in the specification what that means.

Yeah, sure, whatever. You'll never see these kinds of documents in real life. And the specified quirks were minor. If you don't implement them, you'll get subtle formatting issues in documents imported directly from Word97.

MS could have just put them into a "vendor-specific" extension and not documented them at all.

> ODF 1.4 is around 1,100 pages across all 4 parts whereas OOXML is over 6,000.

LOL, no. SVG spec alone is 800 pages. ODF formula spec is 200 pages alone, and is still underspecified.

xeeeeeeeeeeenu•14h ago

They improved this in later revisions of the standard. The behaviour of autoSpaceLikeWord95 is now actually described and there's an example.

You can see it for yourself here (in Part 4): https://ecma-international.org/publications-and-standards/st...

theanonymousone•19h ago

Duplicate: https://news.ycombinator.com/item?id=45147639

fsflover•8h ago

It's not a duplicate if there's no discussion.

theanonymousone•3h ago

How come the same link was accepted twice,in the first place?

fsflover•2h ago

Resubmissions are acceptable, if the discussion hasn't started.

fsflover•19h ago

charlieyu1•19h ago

I once digged through the 5000 page specification. There was a lot of useless stuff that only old Microsoft Word supported like WordArt items.

bawolff•15h ago

Does office no longer support word art?

When i was a kid,making cool wordart headers for school projects was like 50% of what we used office for.

lblume•14h ago

Office does still support word art. [0]

[0]: https://support.microsoft.com/en-us/office/insert-wordart-c5...

bjoli•12h ago

How else would terminally uncool church youth groups advertise in their local church?

It might be a Swedish thing, but I always laugh when I see them. Not nearly as common today as ten years ago, but I see them a couple of times a year.

RcouF1uZ4gsC•19h ago

> Faced with demands for openness, Microsoft could have produced a clean, modern spec and keep the mass pile of legacy inside the application.

Very, very few people care about openness. Maybe a few hundred. Tens of millions care about docx capturing exactly what their doc files had.

Microsoft made the correct choice.

stuzenz•19h ago

My theory (from anecdotal use) is that the OOXML complexity also explains why M365 office implementation is lacking in so many features and is just not very good at all when compared to the Google office suite.

I do have strong memories of OOXML and the scandals that were with it when it became a standard through MS allegedly buying/stacking/influencing votes:

https://chatgpt.com/share/68bf5e11-4e10-8003-ac9d-d4d10f7951...

tracker1•19h ago

I think the last part is probably the biggest thing holding them back IMO... I tend not to install MS Office products on my personal devices, I haven't run Windows on a personal device in a few years. I've mostly maintained just my resume in word or libre-office format for well over a decade. I can't tell you how many times the LO format lost formatting, or just messed up between version upgrades. Same goes for opening a word version in LO.

That doesn't count the various times where it behaved weird, inconsistently had fields/tables that were impossible to edit, etc. I've had to completely recreate everything a couple times over the years. That's just one document, for one guy that I don't really touch that often.

Say what you will about Firefox vs Chrome in terms of usability, compared to MS Word using LibreOffice is worse than early betas of Netscape Navigator 4.0. It's both impressive and upsetting. OnlyOffice at least looks nicer, even if it doesn't really function any better. MS's online version of Word in the browser operates more consistently than either.

abhinavk•16h ago

Have you tried setting MSWord's default save format to Strict OOXML for inter-operation with LibreOffice?

tracker1•8h ago

I've generated the docs in LibreOffice and tried to only use that. My intent wasn't too use MS Word at all.

But I have used Word for work.

tannhaeuser•19h ago

Worth keeping in mind that the native MSO formats were using "structured storage", a horrible binary chunked serialization and metadata format from an era where binary embedding of document streams in other application documents via "Object linking and embedding" (OLE, see also Apple's OpenDoc format) was deemed desirable, with zero consideration given to third-party apps and segment formats tied to C++ data structures. Compared to that, OOXML is still a huge progress, and while it's complex I wouldn't say it's maliciously so.

The Shakespeare example is a good one where the sentence is split into multiple spans to apply style rules yet the bare text content could be extracted by just removing all XML tags. Whereas the ODF variant is actually less recommendable as it relies on an unneccesarily complex formatting and text addressing language on top of XML.

The article says

> Even at a glance [ODF's markup] is more intelligible. Strip the text: namespaces and it’s nearly valid HTML. The only thing that needs explaining is that ODF doesn’t wrap To be with a dedicated “bold” tag. Instead, it applies an auto-style named T1 to a <text:span>, an act of separating content and presentation that mirrors established web practices.

but this definitely makes things more complex for data exchange compared to OOXML.

quotemstr•14h ago

Can you explain what's wrong with the concept of a container format that allows embedding subdocuments of different types?

> zero consideration given to third-party apps and segment formats

The reality is the opposite. COM serialization was specifically built to allow for composing components (and serializations thereof) that didn't know about each other into a single document. That's why it leans so heavily on GUIDs for names: they avoid collisions without needing coordination. That's a laudable goal, not pointless bloat. And the COM people implemented it pretty efficiently too!

> C++ data structures

What gives you that idea? Yes, the OLE stream thing was a binary format, but so is DER for ASN.1. Every webpage you load goes over a binary tagged object format not too different from OLE/COM's.

But due to a persistence of myths from the 90s, people still think of the Office binary format as "horrible" when it's actually quite elegant, especially considering the problems the authors had to solve and their constraints in doing so.

In many ways, we've regressed.

> Markup

The author of the article nails it when he says ODF is meant to be a markup language and OOXML is the serialization of an object graph. So what? Do people write ODF by hand? There are countless JSON formats just as inscrutable as MSO's legacy streams.

Anyway, the idea that the MSO binary format was crap because it was binary, lazy, and represented a "memory dump" is an old myth that just won't die. It wasn't a memory dump, it wasn't lazy, and it wasn't crap. Yes, there are real problems with some of the things people put inside the OLE container, but it's facile and wrong to blame the container or the OLE stream composition model for the problem.

tannhaeuser•25m ago

> Can you explain what's wrong with the concept of a container format that allows embedding subdocuments of different types?

A system managing opaque streams with handler apps registered via GUIDs is pretty much antithetical to open formats for data exchange.

quotemstr•17m ago

Are you suggesting that a document format should be a monolithic controlled by one organization and is actually more open than one that allows multiple entities to contribute to an artifact without permission or coordination? You can think that if you want, but you have to actually argue for it, not just assert it, because to me, that's closed, not open.

Or is it that you just really hate UUIDs? Me too, man. Should have gone with reverse DNS. It's a technical and aesthetic quibble though.

mschuster91•8h ago

IIRC Adobe's PSD file format is similar, which made it very very complex to reverse engineer on one side - and vulnerable to exploits on the other side.

pessimizer•19h ago

I have no idea what this article is intending to express. It is artificially complex to dump the exact implementations of your legacy products into a giant data structure and call it a standard. Nobody can implement that. Which is why they had to bribe, stuff committees and bully people to get it done.

I don't think anyone cares about debating the word "artificial," I don't think that was anyone's point. It's just not a standard. It was, as is made clear here, a way to head off a standard that would be possible to competitors to implement with a fake standard that Microsoft couldn't even implement.

I also don't think that it is "a counterproductive reflex that’s common in open-source circles: scolding users for accepting proprietary tech." I don't even know wtf that's supposed to mean. People are stuck with it because of corruption, they're not being scolded for using it.

edit: "LibreOffice itself, as ODF’s flagship, still suffers from rough edges in design, interaction, and performance. As a result, even as Office hobble itself with bloat, most people still find it easier."

Yeah, it'd be a lot easier if they didn't every have to deal with OOXML and could just work on their own product.

lorenzohess•19h ago

> In my view, OOXML is indeed complex, convoluted, and obscure. But that’s likely less about a plot to block third-party compatibility and more about a self-interested negligence: Microsoft prioritized the convenience of its own implementation and neglected the qualities of clarity, simplicity, and universality that a general-purpose standard should have.

The author only provides arguments for "self-interested negligence". He provides no counterarguments to the claim that OOXML complexity was "a plot to block third-party compatibility". Therefore, he cannot compare "negligence" and "a plot". Therefore, his claim that "negligence" is a better explanation for OOXML complexity than "a plot" cannot follow.

To restate:

> If we dig into the context of OOXML’s creation, it can be argued that harming competitors was not Microsoft’s primary aim.

The author provides no evidence to support this claim. At most, the evidence provided in this section at most supports the claim that "negligence" played a role in OOXML complexity. From this evidence alone, no conclusions can be drawn about the "primariness" of "negligence" vs "harming competitors".

to11mtm•19h ago

I mean sometimes you gotta ship a product (and remember back then, that meant masters for CDs,) and it's perfectly possible that whatever team was in charge of handling 'conversion' stuff for old format (remember that old excel formats have OLE type cruft going on, the sorts of things that led to VBA viruses, imagine what other functionality needs to be implemented) just plain had to take shortcuts in uglifying the spec to support all the jank.

unscaled•15h ago

Unless we ever get the full archive of Microsoft emails, meeting minutes and recordings from all the secret microphones they didn't have in their meeting rooms, I don't think you can ever disprove this claim. It's generally impossible to conclusively disprove conspiracy theories, because you could always claim you're only showing there are no documents proving the conspiracy, but there are no documents disproving it.

The author is just implicitly appealing to Occam's razor here, as people often in face of accusations of a plot. They can show that Microsoft has backed the ANSI accreditation of ODF[1] and eventually implemented support for ODF import and export in Office, but that's not enough to prove there was no conspiracy.

Instead, the article just provides a very plausible explanation for the complexity in OOXML. Does this explanation thoroughly disprove the accusations of a plot? Clear not. Is it more plausible than a great plot to crush a bunch of competitors that had no market share and kill a better standard document format that Microsoft did end up implementing in Office? Yes. This is probably as far as we can get.

[1] https://news.microsoft.com/source/2007/05/16/microsoft-votes...

airstrike•14h ago

Both things can be true. It had a genuine purpose, but the fact that Microsoft will go out of its way to not implement anything better and less temperamental is an indication it's not really open. There's plenty of evidence of Microsoft dragging their feet at playing nice with the rest of the office ecosystem.

I'm not saying they shouldn't do that as a company maximizing shareholder value. But we should all collectively groan every time the topic comes up, not applaud them.

PaulHoule•18h ago

People who were developing "office" programs in the early 1990s were thinking about the problem of serializing arbitrary object graphs into documents to support technologies like

https://en.wikipedia.org/wiki/Object_Linking_and_Embedding

where you could embed an Excel spreadsheet inside a Word document or actually embedded any of a large range of COM objects into a Word document which on one hand is a really appealing vision but on the other hand means you have to have and be able to run all the binaries for all the objects that live in a document which ties the whole thing to Windows.

PDF is a different sort of document format which privileges viewing over editing but it is also really about serializing an object graph when it comes down to it and then having various sorts of filters and transformations and a range of objects defined in the spec as opposed to open ended access to an object library.

This kind of system has a lot of overlap with the serdes problem you get with RPC frameworks that used to be under the files "Sun RPC sucks", "DCOM Sucks", "CORBA Sucks" and "WS-* Sucks" Those things are mostly forgotten these days because well... they sucked, and now the usual complaint is "protobuf sucks" but you rarely hear "JSON sucks" because it gave up on graphs for trees, if you don't have a type system people can't say the type system sucks, and the only thing that really sucks about it is that people won't just use ISO 8601 dates but you can always rise above that by just using ISO 8601 dates without asking for permission. But we all agree YAML sucks.

That points to any flexible document format sucking but also sucks because it has lots of poorly specified and obscure features that amount to "format this the same way Word 95 formatted it if you used a certain obscure option".

From a glass is half empty perspective it sucks because it's close to impossible to make a Microsoft Office replacement that renders 100% of documents 100% correctly.

From a glass is half empty perspective it rules because if you want to make a Python script that writes an Excel script with formulas it is easy. If you want to extract the images out of a Word document it is easy because a Word document is just a ZIP file. If you want to do anything with an OOXML document short of writing an Office replacement it's actually a pretty good situation.

com2kid•16h ago

> but you rarely hear "JSON sucks" because it gave up on graphs for trees

Except it also spawned a thousand custom formats that include $ref support of some type, so we are right back to having graphs. :-D

Lammy•18h ago

I love this screen that shows you exactly why they named it “Office Open” XML: https://i.imgur.com/hnj3sdv.png

It was a pretty big deal when OpenOffice.org's 2.0 release came with OpenDocument as the default file format. Very easy for someone to misread this MSOffice screen and click on OOXML expecting it to mean OO.o.

zamadatix•14h ago

Oh wow. I must have clicked through that page dozens of times, selecting "Keep Current" after a quick scan and thinking the 2nd option was talking about Open Office.

Lammy•2h ago

> after a quick scan

I have to wonder what sort of psychologists they employ who come up with ideas like aligning the “Word, Excel, PowerPoint” word column in the first selection with “Open” in the second selection so you read that word first and backtrack left to “Office”. Or maybe it's just a happy accident lol

croes•18h ago

That sound exactly like it is an anti-competitive format.

Keeping the own advantage sums pretty all anti-competitive behavior.

eirikbakke•18h ago

Microsoft Office has many features. Each feature must be reflected in the file format somehow.

(I wonder what the specification-pages-to-man-years ratio is...)

freeopinion•17h ago

This is talking about OOXML the proprietary MS format, right? Not ISO/IEC 29500?

ISO/IEC 29500 should be open to evolution, no? Just like all the open collaboration on it before it was confirmed as a standard.

mxmilkiib•17h ago

https://m.slashdot.org/story/78708 (2007)

themerone•17h ago

It's as complex as it needs to be to losslessly convert old binary office files.

A better format would have made us geeks a lot happier, but the average user just wants things to work the way they always have.

Gigachad•17h ago

My possibly incomplete understanding was that the original office file format was basically just raw dumps of the internal C data structures. Not designed or specified in any way.

The XML version likely carries a lot of baggage having to be compatible with that.

lmkg•16h ago

They weren't "just" raw dumps of internal C structures. It takes careful design work to dump raw memory in a usable fashion. Consider: You can't just write a pointer to disk and then read it back next week.

Binary MS Office format is a phenomenal piece of engineering to achieve a goal that's no longer relevant: fast save/load on late-80's hard drives. Other programs took minutes to save a spreadsheet, Excel took seconds. It did this by making sure it's in-memory data structures for a document could be dumped straight to disk without transformation.

But yes, this approach carries a shitton of baggage. And that achievement is no longer relevant in a world where consumer hardware can parse XML documents on the fly.

I have heard it argued, though, that the "baggage" isn't the file format. It's actually the full historical featureset of Excel. Being backwards-compatible means being able to faithfully represent the features of old Excel, and the essential complexity of that far outweighs the incidental complexity of how those features were encoded.

taspeotis•16h ago

Off topic sorry but with all the comments discussing Office's size and age and technical baggage ... does anyone know how they pivoted from X million lines of code for a desktop application to running it on the web with all those collaboration features?

nashashmi•15h ago

OOXML carries bloat from a full legacy doc file into a docx file. Readability was not the mission of the developers of the open format. Openness was the mission of the developers of the format. And they made it open enough.

s20n•14h ago

> Why Microsoft’s Motive Wasn’t Deliberate Sabotage

I absolutely do not agree.

Not only is the standard overly complex, Microsoft also indulged in all sorts of unscrupulous activities to corrupt various National Standards Organisations to get it approved through the ISO <https://en.wikipedia.org/wiki/Standardization_of_Office_Open...>, which is clear evidence of malicious intent.

This is a quote from Richard Stallman:

> The specifications document was so long that it would be difficult for anyone else to implement it properly. When the proposed standard was submitted through the usual track, experienced evaluators rejected it for many good reasons. Microsoft responded using a special override procedure in which its money buy the support of many of the voting countries, thus bypassing proper evaluation and demonstrating that ISO can be bought.

monocasa•14h ago

Specifically what I heard on the grapevine was that Microsoft sponsored a collection of small island nations into the ISO process, in exchange for their vote on OOXML.

miohtama•8h ago

Not only small islands nations. For example in Finland Microsoft partners invaded the local working group to get the standard passed in the voting process.

Yizahi•7h ago

That is not on MS though. That is a fault of those in change of ISO, that they assign same vote weight to the enormous empires and to the microstates. Votes should be proportional to the population, full stop. Then no one would be able to abuse the system by simply playing by the rules.

MereInterest•6h ago

So, if I'm understanding your argument correctly, failure to stop a bad actor from taking a hostile action absolves the bad actor of all responsibility for that hostile action? Because that seems to be what you're saying.

quotemstr•14h ago

Some myths just won't die.

OOXML is complex because it has to be. It has to losslessly round trip through an open format every single feature of Office. That's a lot of features.

Yes, it's complex. Should Microsoft have cut features of Office just to make OOXML simpler? That's ridiculous. What about users who relied on those cut features?

It was fair to ask Microsoft to open the file format. It wasn't fair to expect them to cut features and compatibility. The complaints about complexity from RMS and others represent outsiders seeing the sausage factory and realizing that the sausage making is complicated and needs a lot of moving parts. Maybe life wasn't as simple as the Slashdot "Micro$oft" narrative would suggest. Maybe the complexity of the product was downstream of the shit ton of complexity and sweat and thought that had gone into it.

But admitting that would have been hard. Easier to come up with conspiracy theories.

dullcrisp•13h ago

The…sausage has a lot of moving parts?

user3939382•13h ago

So you put extensions in the spec you don’t make it impossible for anyone else to implement. They knew open source suites were competing with them they did it on purpose.

quotemstr•12h ago

> So you put extensions in the spec

... which are either public, in which case people complain that the spec+extensions is too long instead of that the spec is too long, or

... which aren't public, in which case people complain that there's no interoperability.

You can't win.

> impossible for anyone else to implement

Except for all the people who did implement it?

fsflover•11h ago

> Except for all the people who did implement it?

It was never fully implemented. LibreOffice has been trying since then and there are always problems.

troupo•12h ago

> OOXML is complex because it has to be.

What it didn't have to be is sections upon sections of "this behaviour is as seen in Word 95", "this behaviour is as seen in Word 97" without any further specification or context.

The main struggle for independent implementors was reverse engineering all the implicit and explicit assumptions and inner workings of MS Office software.

> But admitting that would have been hard. Easier to come up with conspiracy theories.

I actually read through a lot of that spec at the time. A lot of it was just lip service to open standards at a time when MS was under a lot of regulatory pressure.

qcnguy•12h ago

That stuff happens because Microsoft don't know what the behavior is. It's just a bit which forks Word down some ancient code path that nobody understands and isn't properly documented. Given the huge effort that would have gone into producing this thousand plus page specification, is understandable why the spec writers would have given up at times.

I expect most people posting on Hacker News would not be able to write a satisfactory specification for their own software if they are working a large legacy code base.

lozenge•9h ago

This is correct and there's no point fixing the bugs because it means the layout of the document will change.

troupo•9h ago

> That stuff happens because Microsoft don't know what the behavior is.

They do. Or they did at the time. They literally had things like "save as Word 95" in their office suite.

> Given the huge effort that would have gone into producing this thousand plus page specification, is understandable why the spec writers would have given up at times.

Given the huge effort to produce it in unreasonable timeline they forced themselves into due to regulatory pressure, sure.

The whole OOXML came about only because some large governments said "well, we don't want to be beholden to black box document formats, and we might want a selection of vendors in the future, so ODF looks like a nice proposition compared to Word, actually".

So it was literally rushed through Ecma. MS submitted 2000 pages in December 2005, the spec grew to 6000 pages over the course of the yer, and got standardised in December 2006. So, only a year to significantly increase the spec and standardize it.

And then it was rushed through the ISO standards track which included things like "Swedish vote declared invalid, accusing MS of manipulating votes" https://www.linux-magazine.com/Online/News/Swedish-OpenXML-V... or "Netherlands automatically abstains from voting due to Microsoft" https://archive.ph/20120711220944/http://isoc.nl/michiel/nod... or "near unanimous 'No with comments' turned into 'Abstain' from Malaysia" https://web.archive.org/web/20090726171905/http://www.openma... or...

Google said it best: https://www.csun.edu/~hcmth008/odf/google_ooxml.pdf

--- start quote ---

In developing standards, as in other engineering processes, it is a bad idea to reinvent the wheel. The OOXML standard document is 6546 pages long. The ODF standard, which achieves the same goal, is only 867 pages. The reason for this is that ODF references other existing ISO standards for such things as date specifications, math formula markup and many other needs of an office document format standard. OOXML invents its own versions of these existing standards, which is unnecessary and complicates the final standard.

If ISO were to give OOXML with its 6546 pages the same level of review that other standards have seen, it would take 18 years (6576 days for 6546 pages) to achieve comparable levels of review to the existing ODF standard (871 days for 867 pages) which achieves the same purpose and is thus a good comparison.

Considering that OOXML has only received about 5.5% of the review that comparable standards have undergone, reports about inconsistencies, contradictions and missing information are hardly surprising.

--- end quote ---

Do not for a second assume that anything about OOXML was done in good faith. Well, apart from the thankless work that people assembling the standard did.

qcnguy•4h ago

> They literally had things like "save as Word 95" in their office suite.

And what do you think that setting did? Forked execution down an alternative no longer maintained codepath instead of the rewritten version that wasn't quite compatible.

happymellon•3h ago

Which shouldn't be in an open spec...

troupo•1h ago

Ah yes. All the changes made to word just continued to work magically when "forked down an alternative codepath no one knew about".

And if that's the case, why was that specified in OOXML?

mmis1000•1h ago

> "this behaviour is as seen in Word 95", "this behaviour is as seen in Word 97"

The office relies on behaviour in windows itself "a lot". Even office mac or office web they made themselves isn't a 1:1 replica of the office on windows.

Let alone describe it as a standard.

"this behaviour is as seen in Word 95" sounds sloppy, but it is indeed the closest they can get.

Or what else can you do? You can't just also ship a installation media of word 95 and windows into the ISO standard, right?

troupo•1h ago

> You can't just also ship a installation media of word 95 and windows into the ISO standard, right?

That's what they almost literally did. The spec is littered with "behavior of this program that has no specification and to see it you need to install it and run it"

And that's on top of re-inventing a bunch of specs in MS-only and MS-specific manner (like dates, for example)

clort•11h ago

You are wrong. Microsoft was not asked to open the file format. There was an open file format already accepted as an ISO standard, so now they needed to make their product compliant with an ISO standard because companies around the world were going to prioritise that in their purchases. They did everything they could to ensure that their format was both an ISO standard, and impossible for somebody else to implement.

hdjrudni•11h ago

From the article,

> First, OOXML was, in material part, a defensive posture under intensifying antitrust and “open standards” pressure. Microsoft announced OOXML in late 2005 while appealing an adverse European Commission judgment centered on interoperability disclosures. Thus, it was only a matter of time before Office file compatibility came under the regulatory microscope. (The Commission indeed opened a probe in 2008.)

> Meanwhile, the rival ODF matured and became an ISO standard in May 2006. Governments, especially in Europe, began to mandate open standards in public procurement. If Microsoft did nothing, Office risked exclusion from government deals.

So... maybe they weren't directly asked to open their file format, but what then? Adopt ODF which is surely incompatible with their feature set, and... just corrupt every .doc file when converting into the new format? And also have to reimplement all their apps?

jesus_666•9h ago

Work with OpenDocument to get the necessary features into the next version of ODF while keeping national bodies informed about the status of that effort. In the meanwhile, allow Office to save (with reduced functionality) to ODF in order to fulfill the requirements of existing standards-oriented procurement processes. (Fun fact: They did the latter pretty quickly.)

Here's what they shouldn't have done: Undermine ISO's credibility by ramming a hastily-constructed, not-yet-implemented spec through a fast-track process intended for mature specs by stuffing national bodies. I see no reason to place Microsoft's short term profits over the integrity of international standards bodies, nor do I see one to excuse Microsoft for doing so.

jeroenhd•4h ago

>Work with OpenDocument

Why on earth would they want to do that? Because they hate having money? Because they suddenly decided that opening the market to competition would be more important than the billions they stood to lose?

These standards determine the tools people use to communicate with tax offices and other government institutions. Thanks to their efforts (supported by as much corruption as necessary), Microsoft didn't have to invent a new file format and would let people just use the file format everyone was already using for official business.

Office allows saving as ODF already and has supported it for ages. It was never about supporting open standards. This is all about corporate interests.

I can't think of a single "open" format designed by a large corporation that isn't "open" as a way to make more money.

devnonymous•6h ago

Small change to emphasize the intent:

> because companies and governments around the world were going to prioritise that in their purchases.

Governments are the largest revenue stream of pretty much every large software company starting from IBM/Xerox to OpenAI. MS is well known to indulge in all sort of legally grey practices to win such contracts.

ranger_danger•30m ago

What companies around the world were prioritizing open standard file formats?

MattPalmer1086•10h ago

But they did define two variants to get their standard approved in the fast track process.

The Transitional variant which is entirely backwards compatible is not fully defined in a way that others can implement without reverse engineering how Microsoft Office does things.

The Strict variant isn't totally compatible with all older binary formats but is fully defined.

Guess which one is the standard file format?

gregopet•13h ago

My wife worked in one of the national standardization organizations. She was urgently called into her boss' office: "Please be on this meeting with me, I think they will try to bribe me if I'm alone". It only happened once while my wife worked there and it was right before the vote where Microsoft tried to fast track their office format.

shaky-carrousel•6h ago

It's what we the old farts have been saying during decades. Do not ever trust Microsoft, they are corrupt beyond hope, they're evil.

But people got blindsided by the new Microsoft propaganda.

bayindirh•4h ago

> But people got blindsided by the new Microsoft propaganda.

This is sadly true. I tried to warn many young folks about VSCode, Copilot and whatnot, and they all laughed at me.

Now, they're not laughing either.

CorrectHorseBat•12h ago

Both can be true at once.

They didn't want a standard other people could adapt easily nor do the work to make Word adhere to one and it had to happen fast. By doing it the way they did they got everything they wanted and only needed to buy ISO.

fsflover•11h ago

Sounds exactly like deliberate sabotage to me.

thomasfl•9h ago

I worked for the Norwegian standard organization at the time. After seing with my own eyes how Microsoft was able to get OOXML approved, I quit doing standards. The OOXML standard is a joke. Three different ways to store basically the exact same thing. Like dates.

fsflover•8h ago

What if they could do it, because people like you had quit?

thomasfl•8h ago

I was new in the standards business. Believed this was common. Understand now that it wasn't.

izacus•8h ago

Standards committees being completely divorced from reality of software engineering is why most of the standards are useless.

So the question is whether it was actually a loss.

pjmlp•7h ago

Like POSIX, OpenGL, OpenCL, Vulkan, C, C++, JavaScript, TCP/IP,....

rglullis•6h ago

Is there any example on your list where the standard came before the implementation?

pjmlp•4h ago

Yes, Vulkan (Mantle was the idea), C (since C89), C++ (since C++89), OpenCL (after Apple gave it to Khronos).

rglullis•3h ago

So, no. None of your examples are equivalent to OOXML. The implementations were first opened up and then standardized.

OOXML was the other way around: Microsoft had a standard and tried to enshrine into a standard and force others to waste time and resources to be compatible.

pjmlp•1h ago

Only if you ignore what was standardised in PDF form and only later made available on existing implementations.

That is why I explicitly made references to specific versions as turning points, as I expected the usual FOSS advocacy replies.

perching_aix•2h ago

What if then? Should they have just bodied the thing for the love of the game? So that people uncaring for their wellbeing then wouldn't have appreciated it as a sacrifice anyhow?

Quite often I find that if people stopped holding fundamentally broken dynamics together and just let the thing fail and fail hard, the overall long term outcome would be better off. Much to the opposite of your suggestion.

It's just that turns out, things being properly bodied or properly broken take coordinated action. People deciding one by one, one way or the other, is what actually enables and sustains pathological dynamics like this.

But then how does one single out any specific decision? Well, nohow, not with any rigor for sure.

sevensor•5h ago

Indeed. “The bad standard is the result of negligence rather than malice” is a total nonsequitur. It in no way excuses pushing a bad standard on everybody to say, “they didn’t mean to make it bad.” It was still bad in obvious ways, and they still did power moves and underhanded things to get it signed off over legitimate technical objections. The reasons it was bad are irrelevant to the fact that it was bad and they promoted it.

itsthecourier•4h ago

wow. thanks for sharing

hoistbypetard•8h ago

The format wasn't the act of sabotage. The way they drove it through the standardization process was. It couldn't have been standardized through the normal process. Similarly, pointing to it, afterwards, as if it were just as implementable as any other standardized format, was an act of deliberate sabotage.

fsflover•8h ago

> The format wasn't the act of sabotage. The way they drove it through the standardization process was.

Why not both? You didn't provide any arguments against it.

hoistbypetard•3h ago

What I mean is that it is just the internal binary format they were using before, converted to XML. I don't believe the file format was developed as an act of sabotage; it was just some internal shit they were using because that's how the product had evolved.

Standardizing it as if it were an actual designed, open standard, was, however, very much an act of sabotage.

That's my read, anyway.

amiga386•4h ago

The format itself is an act of sabotage. The format is basically Microsoft's internal formats for Office, with all their bugs and flags and features, for which Microsoft already owned the only working implementation that works correctly.

They completely rejected what standardisation processes _do_, which is to subject the format to scrutiny, criticism and change, to make it universally useful and implementable.

Microsoft absolutely did not do that. They rammed through their proprietary bullshit and slapped an "open standards!" label on it.

https://www.consortiuminfo.org/opendocument-and-ooxml/the-co...

> 2.15.3.26 footnoteLayoutLikeWW8 (Emulate Word 6.x/95/97 Footnote Placement)

> This element specifies that applications shall emulate the behavior of a previously existing word processing application (Microsoft Word 6.x/95/97) when determining the placement of the contents of footnotes relative to the page on which the footnote reference occurs. This emulation typically involves some and/or all of the footnote being inappropriately placed on the page following the footnote reference.

> [Guidance: To faithfully replicate this behavior, applications must imitate the behavior of that application, which involves many possible behaviors and cannot be faithfully placed into narrative for this Office Open XML Standard. If applications wish to match this behavior, they must utilize and duplicate the output of those applications. It is recommended that applications not intentionally replicate this behavior as it was deprecated due to issues with its output, and is maintained only for compatibility with existing documents from that application. end guidance]

> Typically, applications shall not perform this compatibility. This element, when present with a val attribute value of true (or equivalent), specifies that applications shall attempt to mimic that existing word processing application in this regard.

The format was _written_ to include specifics that only matter for one product - Microsoft Office - and don't even reveal in that format how those specifics should be interpreted faithfully. This is of ZERO use to anyone looking to make interoperable software that can make use of this standard. And that's the point - it's NOT an open standard, it's quite deliberately Microsoft's proprietary and closed bullshit with "open" shat on top of it, and a paid-for endorsement by a standards body that completely detonated its own credibility by approving it.

hoistbypetard•3h ago

> Microsoft absolutely did not do that. They rammed through their proprietary bullshit and slapped an "open standards!" label on it.

We are agreeing, I think. I was saying that the format was not developed as an act of sabotage. Ramming that format through standardization (without, as you note, doing any of the things standardization *should do*) so it could plausibly be labeled an open standard was the act of sabotage, IMO.

Ygg2•2h ago

Malicious compliance is still sabotage.

conartist6•7h ago

I seriously don't see the author's purpose in trying to establish a distinction. If I act like I hate you: say I come and light your house on fire and salt the earth where you raise your crops and poison the water from your family's drinking well, would it not be reasonable to say that I hate you?

If you can then confirm that I have no intention to stop (even though I know I'm hurting you) and all I have to say in my defense is, "Actually I do exactly whatever I want and just don't care about you at all," what is the difference to me at that point?

In practice total indifference is even more toxic than hate, because it denies engagement. I owe no extra charitability for callous indifference being the root cause of the actions taken. Company or person, society reserves the right to judge you on the effects your force of will brings forth on others. They used theirs to kill their competitors.

me-vs-cat•5h ago

In other words: "the purpose of a system is what it does". Duck-typed ethics?

jrochkind1•5h ago

One difference it might make is a warning for those who aren't deliberately trying to make something complex, and think they might thus be immune from the result. In fact, you can wind up with something terribly complex as a result of other pressures, incentives, history, and context, without it having been deliberate. It takes active investment and skill to avoid complexity, even if not intended.

devnonymous•6h ago

From the same wikipedia article:

> An Ars Technica article sources Groklaw stating that at Portugal's national body TC meeting, "representatives from Microsoft attempted to argue that Sun Microsystems, the creators and supporters of the competing OpenDocument format (ODF), could not be given a seat at the conference table because there was a lack of chairs."[55]

Sure, yeah, that's not deliberate Sabotage /s

devnonymous•6h ago

Also this:

> Google stated that "the ODF standard, which achieves the same goal, is only 867 pages" and that

    If ISO were to give OOXML with its 6546 pages the same level of review that other standards have seen, it would take 18 years (6576 days for 6546 pages) to achieve comparable levels of review to the existing ODF standard (871 days for 867 pages) which achieves the same purpose and is thus a good comparison.

    Considering that OOXML has only received about 5.5% of the review that comparable standards have undergone, reports about inconsistencies, contradictions and missing information are hardly surprising.[118]

dfox•4h ago

> The specifications document was so long that it would be difficult for anyone else to implement it properly.

In contrast to ODF specification that is long, complex and written in such a terse way that it really does only specify what is a valid ODF file and not in any way what it means. Good luck implementing that without just copying whatever LibreOffice does.

happymellon•3h ago

I can't find it now, but I'm pretty sure that Pages was "corrupting" docx files because Apple followed the spec to a tee, and it turned out that Office didn't actually follow the spec that they had published.

layer8•2h ago

Not only is the standard overly complex, it’s also missing clarifications to the extent that it has to be considered underspecified. Moreover, MS Word in particular doesn’t precisely implement the OOXML standard (or any reasonable reading of it), but a buggy and subtly different variation of it.

Complexity alone would just make it laborious to implement, but the underspecification and subtle deviations of Microsoft’s implementation makes it virtually impossible to achieve full compatibility.

nneonneo•14h ago

Microsoft seems to have known that they could ram basically anything through a standards body, so they presumably didn't bother to actually try and simplify the standard. Instead, it's basically an XML serialization of their older binary formats, complete with all of the quirks and bugs that have to be emulated for 100% compatibility.

To be fair, we're talking about a product line with over 35 years of history here. Cruft in the format builds up but can never be removed, so long as you commit to strong backwards compatibility - which Microsoft has always done.

Fun trivia: many of the old binary formats use a meta-format called OLE2 (Object Linking and Embedding). The file format is a FAT12 filesystem packed into a single file, with a FAT filesystem chain, file blocks aligned to a specific power-of-two size, etc. This made saving files very fast, but raised the possibility of internal fragmentation (where individual sub-files are scattered over many non-contiguous blocks); hence, users were recommended to "Save As..." periodically for large/complex files to optimize the internal storage.

flomo•13h ago

Officially now MS-CFB (i think). OLE2 generally refers to a predecessor to COM, and not just the file format.

https://learn.microsoft.com/en-us/openspecs/windows_protocol...

masfuerte•6h ago

Being pedantic, OLE1 was the predecessor. OLE2 used COM for its plumbing.

Wikipedia has an article on the file format [1]. It was quite nice. It works like an uncompressed zip file with transactional updates.

Earlier Word document formats were much worse. They were a dump of Word's memory contents. Saving and loading was very quick though!

[1]: https://en.wikipedia.org/wiki/Compound_File_Binary_Format

rtpg•11h ago

"You have to standardize the format"

"OK we will standardize our serialization format"

It's... I guess malicious compliance, though also if you don't care about interop you're not going to try to abstract away your internal application structures, are you!

I appreciate the standard existing rather than it not existing. Trying to have the standard exist in this way has always felt like an uphill battle, and at least now there's _something_.

Just you will have a better time if you emulate how Office does things. But you have a bit more documentation to go along with it.

Mikhail_Edoshin•13h ago

I remember Spreadsheet ML, an older format compatible with Excel. It had a subset of features, I think, but it was a rather powerful subset: formatting, formulae, multiple sheets. And it was rather simple. (Had a silly design mistake: for some reason MS gave namespace to attributes, which is not necessary, only for rather specific purposes).

Another XML standard from MS that also seems relatively simple is XPS, a PDF alternative. But it uses Open Packaging and that is somewhat hard to read.

CobrastanJorji•12h ago

The OOXML fight is near and dear to my heart because, when it happened, I was a baby developer, and I cared about the issue for some reason I can barely recall, and I found an expert on the issue on Twitter. That guy would regularly tweet about everything that was going on and the problems with the spec and the shenanigans, and I was one of the, like, 20 people who was hanging on his every word. And sometimes he'd talk about bee keeping instead. It was my first introduction to Twitter at its best. You got these unfiltered whole views of the lives and concerns of real people who were, in part, experts at what you cared about. So sometimes you had to listen to them talk about other random stuff they thought was neat. And that's great!

fsflover•8h ago

> and I cared [...] and I found an expert

So did you somehow contribute to it in the end?

rjsw•7h ago

Another fake standard is ISO 14306.

Freak_NL•7h ago

If you ever write some HTTP endpoint where tabulated data is returned, you could quite reasonably return RFC 4180 style CSV.

However, if your API ever interfaces with users in a corporate environment, parsing simple comma-separated UTF-8 CSV is suddenly quite beyond the reach of however is nibbling at your endpoint, so why not code up a simple little reusable bit of code where you can write any simple tabular data (string, numbers, and dates, in one or more sheets of data made up of rows and columns) that lets you choose the output format? A zip-archive of CSV-files (one per sheet), JSON, ODS, or XLSX; pick your poison.

I did just that, and while it is perfectly doable, any low-level, low-resources, low-dependency approach will mean actually touching the XML in LibreOffice's ODS (fine), and Microsoft's OOXML (…).

This is how you write a date in a cell in both.

ODS:

            <table:table-row table:style-name="ro1">
                <table:table-cell office:value-type="date" office:date-value="2021-04-10T12:34:56" calcext:value-type="date">
                    <text:p>10/4/2021, 12:34</text:p>
                </table:table-cell>
            </table:table-row>

OK, a bit verbose, but trivial to implement. Format the date however you like — you'll probably use two different formatters on the same datetime instant.

XLSX (OOXML):

            <row r="1" ht="12.8">
                <c r="A1" s="1" t="n">
                    <v>39448.5</v>
                </c>
            </row>

Obviously, as you can all plainly see, the date here is 2008-01-01T12:00:00…

And of course it makes perfect sense to hardcode the cell coordinate there. It's not like you would dynamically generate a bunch of cells (…).

boricj•7h ago

> However, if your API ever interfaces with users in a corporate environment, parsing simple comma-separated UTF-8 CSV is suddenly quite beyond the reach of however is nibbling at your endpoint

Excel can directly ingest a CSV file served over an URL as data source, with the Accept header manually set to text/csv.

I wrote a backend once that supported this feature so that management could pull whatever data they wanted off an internal application without pestering me. They could literally take the URL of a page and pull it as a CSV file as-is.

Anybody who knows a bit of Excel can pull that data themselves by following a set of simple instructions.

Freak_NL•6h ago

> Anybody who knows a bit of Excel can pull that data themselves by following a set of simple instructions.

That is very much possible. It is also completely impossible when you live in a country where Microsoft decreed that the C in CSV stands for semicolon; as far as Excel is concerned (no, seriously). Welcome in the Netherlands!

Now whether or not Excel can open a CSV file depends on the locale of the user, which will inevitably vary, and of course, whether they are using Excel at all.

So yes, you could offer just CSV, but not if your user is a spreadsheet jockey and you would like to stay on good terms with your support staff.

1718627440•4h ago

I don't use MS Office, but does Excel seriously not simply show a dialog, where the user can select this and maybe even auto-detects it? That's what I'm used to from the "subpar clone" (/s) from the Document Foundation.

kiicia•31m ago

It mostly did that until they changed it few versions ago, now it sometimes does, sometimes does not and sometimes falls flat on face depending on exact context of your action…

noAnswer•5h ago

For all my life, whenever I File > Opened a CSV file with Excel, all its content ended up in column A. I always have to work via Import Data and specify the file encoding and what the C in CSV stands for.

eythian•4h ago

I was not aware there was an RFC for CSV, but the concept of "simple comma-separated UTF-8 CSV" is, in my experience, not something that exists. In a previous job, a chunk of my work was taking CSV files that were given to us and writing tooling to process them into a structured form for import elsewhere (typically we'd do a few test runs, and finally do a cut-over with final data, so it had to be scripted.)

During this I saw just about every variant of CSV and character encoding known to man, often inside the same file. Once I had a file that had UTF-8, MARC-8, Latin1, and (yes really) VT100 control codes. All in one file.

All in all, I'd prefer something that actually could be validated for some sort of correctness (this said, another time I got an XML export from some software that was invalid XML, so...)

jeroenhd•4h ago

Storing dates as numbers in a spreadsheet has been a thing since the first spreadsheet program I know of. Microsoft picked "days since 1900". If you're on UNIX, you may prefer using "1199188800" instead.

Other than that, the difference is pretty minor. ODS is very verbose and stores the content of the cell twice for some reason, but the XML trees are essentially the same.

The best way for corporate interaction is to export to whatever the hell Microsoft Excel accepts as an external data source, because .xslx files can natively import remote data that way. Hope your customers' computers are all configured for en_US mode, though, because CSVs aren't as universal as people pretend they are.

Freak_NL•2h ago

I'm somewhat fine with the unix epoch timestamp, but this thing is not just seconds or microseconds since x, it is days since x, plus the rest of the time bits as a fraction. One second to midnight on January 1st 2008 is… 39448.9999884259.

Oh, and while 39448.5 is fine, 39448.0 makes Excel throw an error and refuse the whole document. Midnight January 1st 2008 is just 39448. The parser cannot handle 39448.0.

donatj•6h ago

You want to see something "fun" check out Apple's Numbers XML format from before they switched to protobuf. A simple 1,2,3 resulted in over 1 megabyte of XML.

At a previous job I'd been tasked with developing support for importing Numbers files along side our existing Excel and CSV support. After a couple days we rightly gave up as the tiny fraction of people who actually wanted to import Numbers files was outweighed by it's massive complexity.

We ended up just adding instructions for Numbers users to export to CSV

jonathaneunice•6h ago

Yes, artificially complex.

No one genuinely interested in document openness—be it for document workflows, publishing automation, content archiving, or future-proofed documents—would have done it that way.

Maybe it was as simple as dumping a simplistic re-encoding of its legacy binary format into XML and ramming it through standards organizations. But yes, there was malice aforethought and a classic Microsoft playbook in motion: embrace, extend, extinguish.

noAnswer•5h ago

I guess OP wasn't around back then when Microsoft openly bribed African countries to join ISO just to vote on this topic.

This says it all: https://cdn.imgpile.com/p/RppGj1l

fantyoon•3h ago

> The voting that followed was among ISO’s most contentious: several national bodies abruptly swelled with new members, many Microsoft partners, who then voted in favor. Sweden’s initial approval was voided after incentives linked to support came to light.

Direct quote from the article.

EdwardCoffin•5h ago

I remember reading an early criticism of the spreadsheet side of OOXML, where a simple spreadsheet with three cells was created: A1 containing '1', B1 containing '2', and C1 containing the formula A1+B1. That spreadsheet was saved, the file opened in an editor which showed the values of the cells, and A1 changed to something else, say 3. This broke the spreadsheet, as there were all sorts of knock-on effects contained in the virtually opaque mess that followed the cell contents.

I've probably got the details wrong, but that was the gist of it. I'd love to rediscover the analysis, but my searches have not yielded it.

itsthecourier•4h ago

exactly, the author ignored the specs and try to come with conclusions of a system, just by doing a Hello World

EdwardCoffin•4h ago

I think the point of the criticism I read was that the edit should have worked. There is no reason why the opaque mess following what was obviously a definition of the contents of the spreadsheet should even be there let alone be dependent on the original contents of the cells.

jeroenhd•4h ago

The obvious reason to have other stuff depend on the value of the cell would be to store a cache alongside the formulae. In a 300MiB XSLX, you don't want to evaluate every formula every time the spreadsheet is opened.

EdwardCoffin•3h ago

If caches have a place in a file storage format, they should at least be optional and separate from mandatory content, and I got the impression from the critique that they were neither.

itsthecourier•5h ago

they should have used some examples with tables, tackle of contents, macros, styling, columns, page size, etc. the blog post is artificially shallow

bayindirh•4h ago

If a standard which allows embedding closed-spec binary blobs even Microsoft can't implement perfectly from version to version is not a deliberate sabotage attempt, then I don't know what is.

This is Microsoft. Don't get distracted.

We all dodged a bullet

Claude can now create and edit files

Dropbox Paper mobile App Discontinuation

A new experimental Go API for JSON

Tomorrow's Emoji, Today: Unicode 17.0 Has Arrived

An attacker’s blunder gave us a look into their operations

Mistral AI raises 1.7B€, enters strategic partnership with ASML

ICE Is Using Fake Cell Towers to Spy on People's Phones

Weave (YC W25) is hiring a founding AI engineer

Building a DOOM-like multiplayer shooter in pure SQL

X open sourced their latest algorithm

I solved a distributed queue problem after 15 years

A clickable visual guide to the Rust type system

You too can run malware from NPM (I mean without consequences)

Go for Bash Programmers – Part II: CLI Tools

How can England possibly be running out of water?

Anscombe's Quartet

Yet Another TypeSafe and Generic Programming Candidate for C

What happens when private equity buys homes in your neighborhood

Disrupting the DRAM roadmap with capacitor-less IGZO-DRAM technology

William James at CERN (1995)

U.S. Added 911,000 Fewer Jobs in the Year Ended in March

iPhone Air, a powerful new iPhone with a breakthrough design

Hallucination Risk Calculator

New Mexico is first state in US to offer universal child care

Synthesizing Object-Oriented and Functional Design to Promote Re-Use

Google to Obey South Korean Order to Blur Satellite Images on Maps

iPhone dumbphone

Liquid Glass in the Browser: Refraction with CSS and SVG

Strong Eventual Consistency – The Big Idea Behind CRDTs

We all dodged a bullet

Claude can now create and edit files

Dropbox Paper mobile App Discontinuation

A new experimental Go API for JSON

Tomorrow's Emoji, Today: Unicode 17.0 Has Arrived

An attacker’s blunder gave us a look into their operations

Mistral AI raises 1.7B€, enters strategic partnership with ASML

ICE Is Using Fake Cell Towers to Spy on People's Phones

Weave (YC W25) is hiring a founding AI engineer

Building a DOOM-like multiplayer shooter in pure SQL

X open sourced their latest algorithm

I solved a distributed queue problem after 15 years

A clickable visual guide to the Rust type system

You too can run malware from NPM (I mean without consequences)

Go for Bash Programmers – Part II: CLI Tools

How can England possibly be running out of water?

Anscombe's Quartet

Yet Another TypeSafe and Generic Programming Candidate for C

What happens when private equity buys homes in your neighborhood

Disrupting the DRAM roadmap with capacitor-less IGZO-DRAM technology

William James at CERN (1995)

U.S. Added 911,000 Fewer Jobs in the Year Ended in March

iPhone Air, a powerful new iPhone with a breakthrough design

Hallucination Risk Calculator

New Mexico is first state in US to offer universal child care

Synthesizing Object-Oriented and Functional Design to Promote Re-Use

Google to Obey South Korean Order to Blur Satellite Images on Maps

iPhone dumbphone

Liquid Glass in the Browser: Refraction with CSS and SVG

Strong Eventual Consistency – The Big Idea Behind CRDTs

Is OOXML Artifically Complex?

Comments