> A cell starts with | and ends with one or more tabs.
|one\t|two|three
How many cells is this? Seems like just one, with garbage at the end, since there are no closing tabs after the first cell? Should this line count as a valid row?> A line that starts with a cell is a row. Any other lines are ignored.
Well, I guess it counts. Either way, how should one encode a value containing a tab followed by a pipe?
This is why you should almost always use text formats.
The binary form has lot of benefits than plain text form in editing. For example, when you are replacing UIn8 value from 0 to 100 then you just replacing a byte at a position instead of rewriting whole document.
It was also my idea of an operating system design, it will have a binary format used for most stuff, similar to DER but different in some ways (including different types are available), which is intended to be interoperable among most of the programs on the system.
My very next step is OS development too but I'm not sure where to learn the OS in the opcode coding level. I thought to get started with Intel docs for my CPU.
We have them, they're used where appropriate.
> Throw away the text formats.
I would argue that _most_ of the time tsv or csv are used it's because either:
a) the lowest common denominator for interchange. Oh, you don't have _my specific version of $program? How about I give you the data in csv? _everything_ can read that...
b) a human is expected to inspect/view/adjust the data and they'd be using a bin -> text tool anyways. The move to binary based log formats (`journald`) is still somewhat controversial. It would have been a non-starter if the tooling to make the binary "readable" wasn't on-par with (or, in a few cases, better!) than the contemporary text based tooling we've been used to for the prior 30+ years..
More recently though, consider that LLMs are terrible at emitting binary files, but amazing at emitting text. I can have a GPT spit out a nice diagram in Mermaid, or create calendar entries from a photo of an event program in ical format.
2. Generated TPSV would look like an unreadable hard to edit mess. I doubt any tool would calculate max column length to adjust tab count for all cells. It basically kills any streaming.
> 1. It's quite easy to miss a tab and use only `|`.
Any format is hard to edit manually if you don't follow the requirements of the format (which are very simple in this case).
> 2. Generated TPSV would look like an unreadable hard to edit mess.
CSVs are much less readable than this, but still entirely possible to edit.
Ummm, how do you figure out what row has too many cells? Can all the rows before this one have too few cells?
> 3. The first row defines the number of columns
Also it doesn't seem to say anything about the header row?
ASCII (and through it, Unicode) has these values specifically for this purpose.
Notepad++ handles the display and entry of these characters fairly easily. I think they're nowhere as unergonomic as people say they are.
If RS and US were in common use, there would be a need to have a visible representation for them in the terminal, and a way to enter RS on the keyboard. Pretty soon, strings that contain RS would become much more common in the wild.
Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
I do think that having RS display in the terminal (like a newline followed by some graphic?) and using it would be an improvement over TSV's use of newline for this purpose, but considering that it's not a perfect solution, I can understand why people are not overly motivated to make this happen. The time for this may have been 40+ years ago when a standard for how to display or type it would be feasible to agree upon.
Both already possible, they have official symbols representing them
> Then, one day somebody would need to store one of those strings in a table, and there would be no way to do so without escaping.
Why? But also, yes, escaping also exists, just like in the alternative formats
I'm not sure what you mean. For an illustration, my terminal does not print anything for them.
$ printf "qq\36\37text\n"
qqtext
*Update/Aside:* "My terminal", in this case, was `tmux`. Ghostty, OTOH, prints spaces instead of RS or US.Unicode does have some symbols for every non-printable ASCII character, which you can see as follows with https://github.com/sharkdp/bat (assuming your font has the right characters, which it probably does):
$ printf "qq\36\37text\n" | bat -A --decorations never
qq␞␟text␊
Here, `␞` is https://www.compart.com/en/unicode/U+241E, one of the symbols for non-printable characters that Unicode has; different fonts display it differently. See also https://www.compart.com/en/unicode/block/U+2400.Is there some better representation it has?
If you meant the default should always be symbolic, not sure, like newline separator isn't displayed in the terminal as a symbol, but maybe that's just a matter of extra terminal config
It instinctively feels horrible, but it’s easy to create and parse in basically every language, easy to fully specify, recovers well from one broken line in large datasets, chops up and concatenates easily.
We have YAML but it's too complex. JSON is rather verbose with all the repeated keys and quoting, XML even moreso. I'd also like to see a 'schema tree' corresponding to a header row in TSV/CSV. I'd even be fine with a binary format with standard decoding to see the plain-text contents. Something for XML like what MessagePack does for JSON would work, since we already have schema specifications.
But CSV represented as JSON is usually accomplished like so:
{
"headers": ["name", "habitat", "food"],
"data": [
["Acorn Woodpecker", "forest", "grain"],
["American Goldfinch", "grassland", "grain"],
["Anhinga", "wetland", "fish"],
["Australian Reed Warbler", "wetland", "grub"],
["Black Vulture", "forest", null]
]
}
Every textual data format that is not originally S-expressions eventually devolves into an informally-specified, bug-ridden, slow implementation of half of S-expressions.
https://en.wikipedia.org/wiki/ASCII#Character_groups
https://en.wikipedia.org/wiki/C0_and_C1_control_codes#C0_con...
Dec Octal Hex Binary
028 034 01C 00011100 FS (File Separator)
029 035 01D 00011101 GS (Group Separator)
030 036 01E 00011110 RS (Request to Send)(Record Separator)
031 037 01F 00011111 US (Unit Separator)
I think it can encode anything except for something matching the regex `(\t+\|)+` at the end of cells (*Update:* Maybe `\n?(\t+\|)+`, but that doesn't change my point much) including newlines and even newlines followed by `\` (with the newline extension, of course).
For a cell containing `cell<newline>\`, you'd have:
|cell<tab>|
\\<tab >|
(where `<tab >` represents a single tab character regardless of the number of spaces)Moreover, if you really needed it, you could add another extension to specify tabs or pipes at the end of cells. For a POC, two cells with contents `a<tab>|` and `b<tab>|` could be represented as:
|a<tab ><tab>|b
~tab pipe<tab>|tab pipe
(with literal words "tab" and "pipe"). Something nicer might also be possible.*Update:* Though, if the focus is on humans reading it, it might also make sense to allow a single row of the table to wrap and span multiple lines in the file, perhaps as another extension.
I personally use it to write tabular data manually, used to define our datamodel. Because this format is editor agnostic, colleagues can easily read and edit as well. So in my case it's focus on human read/write and machine read.
It has some nice properties: 1) it’s many fewer tokens than JSON. 2) it’s easier to edit prompts and examples in something like Google sheets, where the default format of a copied group of cells is in TSV. 3) have I mentioned how many fewer tokens it is? It’s faster, cheaper, and less brittle than a format that requires the redefinition of every column name for every row.
Obviously this breaks down for nested object hierarchies or other data that is not easily represented as a 2d table, but otherwise we’ve been quite happy. I think this format solves some other things I’ve wanted, including header comments, inline comments, better alignment, and markdown support.
Hackbraten•14h ago