* https://miller.readthedocs.io/
vis/unvis are fairly important tools for those text tables, too.
Also, FediVerse discussion: https://social.pollux.casa/@adele/statuses/01K1VA9NQSST4KDZP...
Csv files hide their meaning in external documentation or someone’s head, are extremely unclear in many cases (is this a number or a string? A date?) and is extremely fragile when it comes to people editing them in text editors. They entirely lack checks and verification at the most basic level and worse still they’re often but not always perfectly line based. Many tools then work fine until they completely break you file and you won’t even know. Until I get the file and tell you I guess.
I’ve spent years fixing issues introduced by people editing them like they’re text.
If you’ve got to use tools to not completely bugger them then you might as well use a good format.
Maybe you need a database or an app rather than flat files.
It’s not about the stupidity of the humans, and if it was then planning for “no stupid people” is even stupider than those messing up the files.
> Maybe you need a database or an app rather than flat files.
Flat files are great. What’s needed are good file formats.
What's the problem?
> It's very verbose.
This is his example: https://github.com/crdoconnor/strictyaml/blob/master/hitch/s...
I think you shouldn't use yaml or toml for this.
> TOML's hierarchies are difficult to infer from syntax alone
True! The point of TOML is to flatten the hierarchical structures. I would argue your configuration files shouldn't have much nesting anyway.
> Overcomplication: Like YAML, TOML has too many features
Basically TOML has a date type and all associated problems and advantages. I think it's a reasonable thing to include.
> Syntax typing
I think this is a good thing. I want to know whether something is a string or a number.
From that article:
“This memo […] does not specify an Internet standard of any kind”
and
“Interoperability considerations:
Due to lack of a single specification, there are considerable differences among implementations. Implementors should "be conservative in what you do, be liberal in what you accept from others" (RFC 793 [8]) when processing CSV files”
Also, you're quoting me to myself: https://news.ycombinator.com/item?id=44837879
If I time traveled back to 1985 and told corporate to adopt CSV because it’d be useful in 50 years when unearthing old customer records I’d be laughed out of the cigar lounge.
https://en.wikipedia.org/wiki/Unix_philosophy
There already exist a bazillion binary serialization formats: protobufs, thrift, msgpack, capnproto, etc. but these all suffer from human inaccessibility. Generally, they should be used only when performance becomes a severe limiting factor but never before or it's likely a sign of premature optimization.
CSV was definitely in wide use back then.
Text formats are compressible.
Also I'd argue if HTTP1 can be treated as a pure text format, since it requires \r and \n as EOL markers, even on systems that only use \n. Strict binary requirements like this shouldn't be needed if it was a text protocol.
When I write I write text. I can transform text using various tools to provide various presentations consumable through various products. The focus is on content, not presentation, tools, or product.
I prefer human-readable file formats, and that has only been reinforced over more than 4 decades as a computer professional.
It took me many hours and a few backtracks to get to a point where I am satisfied with it, and where errors are caught early. I would just suggest anyone starting now to enable --strict --pedantic on ledger-cli from the day 1, and writing asserts for your accounts as well e.g. to check that closed accounts don’t get new entries.
I really miss data entry being easier and not as prone to free-form text editing errors (most common are typos on the amount or copying the wrong source/dest account), but I am confident it matches reality much better than my spreadsheets did.
I have had to work with large 1GB+ JSON files, and it is not fun. Amazing projects such as jsoncons for streaming JSONs, and simdjson, for parsing JSON with SIMD, exist, but as far as I know, the latter still does not support streaming and even has an open issue for files larger than 4 GiB. So you cannot have streaming for memory efficiency and SIMD-parsing for computational efficiency at the same time. You want streaming because holding the whole JSON in memory is wasteful and sometimes not even possible. JSONL tries to change the format to fix that, but now you have another format that you need to support.
I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data. Storing binary data as base64 strings seems wasteful. Random access into these files is also an issue, depending on the use case. Sometimes it would be a nice feature to jump over some data, but for JSON, you cannot do that without parsing everything in search of the closing bracket or quotes, accounting for escaped brackets and quotes, and nesting.
That is not the hallmark of a space-efficient file format.
Between repeated string keys and frequently repeated string values, that are often quite large due to being "human readable", it adds up fast.
"I was also contemplating the mentioned formats for another project, but they are hardly usable when you need to store binary data, such as images, compressed data, or simply arbitrary data."
One trick you can use is to prefix a file with some JSON or other readable value, then dump the binary afterwards. The JSON can have offsets into the binary as necessary for identifying things or labeling whether or not it is compressed or whatever. This often largely mitigates the inefficiency concerns because if you've got a big pile of binary data the JSON bloat by percent tends to be much smaller than the payload; if it isn't, then of course I don't recommend this.
The trick for the mix of JSON with binary is a good reminder. That's how the ASAR file archive format works. That could indeed be usable for what I was working on: a new file format for random seek indexes. Although the gztool index format seems to suffice for now.
In bioinformatics, basically all of the file formats are human-readable/text based. And file sizes range between 1-2Mb and 1 Tb. I regularly encounter 300-600 Gb files.
In this context, human-readable files are ridiculously inefficient, on every axis you can think of (space, parsing, searching, processing, etc.). It's a GD crime against efficiency.
And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.
In some cases human readable data is for interchange and it should be processed and queried in other forms - e.g. CSV files to move data between databases.
An awful lot of data is small - and these days I think you can say small is quite a bit bigger than 10Mb.
Quite a lot of data that is extracted from a large system would be small at that point, and would benefit from being human readable.
The benefit of data being human readable is not necessarily that you will read it all, but that it is easier to read bits that matter when you are debugging.
In bioinformatics, most large text files are gzip'd. Decompression is a few times slower than proper file parsing in C/C++/Rust. Some pure python parsers can be "ridiculously inefficient" but that is not the fault of human-readability. Binary files are compressed with existing libraries. Compressed binary files are not noticeably faster to parse than compressed text files. Binary formats can be indeed smaller but space-efficienct formats take years to develop and tend to have more compatibility issues. You can't skip the text format phase.
> And at that scale, "readable" has no value, since it would take you longer to read the file than 10 lifetimes.
You can't read the whole file by eye, but you can (and should often) eyeball small sections in a huge file. For that, you need a human-readable file format. A problem with this field IMHO is that not many people are literally looking at the data by eye.
Similarly, NEXUS files are also human-readable, but it'd be tough to discern the shape of inlined 200 node Newick trees.
When I was asking people who did actual bioinformatics (well, genomics) what some of their annoyances when working with the bioinf software were, having to do a bunch of busywork on files in-between pipeline steps (compressing/uncompressing, indexing) was one of the complaints mentioned.
I think there's a place in bioinformatics for a unified binary format which can take care of compression, indexing, and metadata. But with that list of requirements it'd have to be binary. Data analysis moved from CSVs and Excel files to Parquet, and I think there's a similar transition waiting to happen here
Further, when the file is unindexed it's even harder to read it as a human because you can't easily skip to a particular section. I have this trouble often where my code can efficiently access the data once it's loaded, but a human-eye check is tedious/impossible because you have to scroll through gigabytes to find what you want.
Unless someone just decided to shove random stuff in binary mode and call it a day?
Binary formats are binary for a reason. Speed of interpretation is one reason. Usage of memory is another reason. Directly mapping it and using it, is another reason. Binary formats can make assumptions about system memory page size. They can store internal offsets to make incremental reading faster. None of this is offered by text formats.
Also, the ability to modify text formats is completely wrong. Nothing can be changed if we introduce checksums inside text formats. Also if we digitally sign a format, then nothing can be changed despite the fact that it's a text format.
Also, comparing CSV files to internal database binary format? It's like comparing a book cover to the ERP system of a library. Meaning, it's comparing two completely different things.
Most of the arguments presented in TFA are about openness, which can still be achieved with standard binary formats and a schema. Hence the problem left to solve is accessibility.
I’m thinking something like parquet, protobuf or sqllite. Despite their popularities, still aren’t trivial for anyone to edit.
Google uses it a lot for data dumps for tests or config that can be put into source control.
It saves you from escaping stuff inside of multiline-strings by using meaningful whitespace.
What I did not like about CCL so much that it leaves a bunch of stuff underspecified. You can make lists and comments with it, but YOU have to decide how.
If a binary file has a well-known format and tools available to view/edit it, I see zero problems with it.
Given an arbitrary stream of bytes, readability only means the human can inspect the file. We say "text is readable" but that's really only because all our tooling for the last sixty years speaks ASCII and we're very US-centric. Pick up a text file from 1982 and it could be unreadable (EBCDIC, say). Time to break out dd and cross your fingers.
Comprehension breaks down very quickly beyond a few thousand words. No geneticist is loading up a gig of CTAGT... and keeping that in their head as they whiz up and down a genome. Humans have a working set size.
Short term retrieval is excellent for text and a PITA for everything else. Raise your hand if you've gotten a stream of bytes, thrown file(1) at it, then strings(1), and then resorted to od or picking through the bytes.
Long term retrieval sucks for everyone. Even textfiles. After all, a string of bytes has no intrinsic meaning except what the operating system and the application give it. So who knows if people in 2075 will recognise "48 65 6C 6C 6F 20 48 4E 21"?
Edit: I can't count. H and I are consecutive in the alphabet, and it actually says "Hello HN!". I think my general point is valid, though.
I think the author is thinking about a very narrow set of files.
There was a published study, Wrangling Messy CSV Files by Detecting Row and Type Patterns by Gerrit J. J. van den Burg, Alfredo Nazábal, and Charles Sutton (Data Mining and Knowledge Discovery, 2019) that showed many pitfalls with parsing CSV files found on GitHub. They achieved 97%. It's easy to write code that slings out some text fields separated by commas, with the objective of using a human-readable portable format.
You can learn even more by allowing autofuzz to test your nice simple code to parse human readable files.
rickcarlino•5h ago
rizky05•4h ago