A love letter to the CSV format (2024)

https://medialab.sciencespo.fr/en/news/a-love-letter-to-the-csv-format/

109•jordigh•5mo ago

Comments

ayhanfuat•5mo ago

Previously: A love letter to the CSV format (https://github.com/medialab/xan/blob/master/docs/LOVE_LETTER...)

708 points | 5 months ago | 698 comments (https://news.ycombinator.com/item?id=43484382)

klinch•5mo ago

Hot take: I prefer xlsx over CSV

I used to work on payment infrastructure and whenever a vendor offered us the choice between CSV and some other format we always opted for that other format (often xlsx). This sounds a bit weird, but when using xlsx and a good library for handling it, you never have to worry about encoding, escaping and number formatting.

This is one of these things that sound absolutely wrong from an engineering standpoint (xlsx is abhorrently complex on the inside), but works robustly in practice.

Slightly related: This was a German company, with EU and US payment providers. Also note that Microsoft Excel (and therefore a lot of other tools) produces "semicolon-separated values" files when started on a computer with the locale set to German...

n4r9•5mo ago

Works okay until someone opens the file in Excel, writes "2-9" into a cell, and saves it without realising it's been changed to "02/09/2025" behind the scenes.

chungy•5mo ago

Wait until you find out that "02/09/2025" is actually 45697 behind the scenes ;)

cluckindan•5mo ago

And thus, with some semantic leeway, -7 = 45697

porridgeraisin•4mo ago

Why?

personalityson•5mo ago

It should have been semicolon from the start

IanCal•5mo ago

There's ascii characters for field and record delimiters which would be perfect.

I tried using them once after what felt like an aeon of quoting issues, and the first customer file I had had them randomly appearing in their fields.

olive-n•5mo ago

I'll take csv over xlsx any time.

I work a lot with time series data, and excel does not support datetimes with timezones, so I have to figure out the timezone every time to align with other sources of data.

Reading and writing them is much slower than csv, which is annoying when datasets get larger.

And most importantly, xlsx are way more often fucked up in some way than any other format. Usually, because somebody manually did something to them and sometimes because the library used to write them had some hiccup.

So I guess a hot take indeed.

imtringued•5mo ago

This is the correct take. I've never had any significant problems with xlsx. You may call it abhorrently complex, but for me it is just a standardized way to serialize tabular data via XML.

porker•5mo ago

> when using xlsx and a good library for handling it

Which good libraries did you find? That's been my pain point when dealing with xlsx.

klinch•4mo ago

Apache POI (Java) + a light in-house abstraction on top of it

hiAndrewQuinn•5mo ago

I like CSV because its simplicity and ubiquity make it an easy Schelling point in the wide world of corporate communication. Even very non-technical people can, with some effort, figure out how to save a CSV from Excel, and figure out how to open a CSV with Notepad if absolutely necessary.

On the technical side libraries like pandas have undergone extreme selection pressure to be able to read in Excel's weird CSV choices without breaking. At that point we have the luxury of writing them out as "proper" CSV, or as a SQLite database, or as whatever else we care about. It's just a reasonable crossing-over point.

heresie-dabord•5mo ago

CSV is a flexible solution that is as simple as possible. The next step is JSONL.

https://jsonlines.org/

https://medium.com/@ManueleCaddeo/understanding-jsonl-bc8922...

mcdonje•5mo ago

>Excel hates CSV. It clearly means CSV must be doing something right.

Use tabs as a delimiter and excel interops with the format as if natively.

tacker2000•5mo ago

The problem is that nobody in the real world uses tabs.

Everyone uses , or ; as delimiters and then uses either . or , for decimals, depending on the source.

It shouldnt be so hard to auto-detect these different formats, but somehow in 2025, Excel still cannot do it.

sfn42•5mo ago

You don't need to auto-detect the format. The delimiter can be declared at the top of the file as for example sep=;

yrro•5mo ago

But now that's not CSV. It's CSV with some kind of ad-hoc header...

sfn42•5mo ago

It may not be part of any official CSV spec but Excel accepts it. I found that Excel would read my files much more reliably using the sep declaration, which is great when the target audience is less technical types.

IAmBroom•5mo ago

I love you forever and a day. Thank you.

sfn42•5mo ago

Happy to help! :)

tacker2000•5mo ago

Ok thats a nice tip, but to be fair when i download some CSV report off some website, i dont wont to open it to check the delimiter, then edit it and resave it. Often I am downloading dozens of such files at a time.

sfn42•5mo ago

The idea is that the program that creates the file adds that line at the top. If you're downloading CSV files from websites then ideally they should already have that line.

If they don't then what you could do is create a simple script that just adds that line, and Excel will open the files without you having to hassle with making sure Excel interprets them correctly. Of course that's a bit more challenging if they use different separators, but you might be able to find an easy adaptation for your usecase like making a decision about which delimiter to declare based on the filename. Or you could try to analyze the header row to figure out which delimiter to use based on that.

pragmatic•5mo ago

Pipe enters the chat.

For whatever reason, pipe seems to be support common in health care data.

gentooflux•5mo ago

Use tabs as a delimiter and it's not CSV anymore, that's TSV.

mcdonje•5mo ago

They're essentially the same format. Same with PSV. They're all DSVs.

Most arguments for or against one apply to all.

https://en.m.wikipedia.org/wiki/Delimiter-separated_values

roelschroeven•5mo ago

It still can't properly deal with CSVs that use different decimal separators than the UI setting in Excel / Windows. It's still too stupid to understand that UI localization and interoperation should never be mixed.

IAmBroom•5mo ago

Yes, again: Excel is not the ideal CSV tool.

It is A CSV tool, readily available in the business world, that often works quite well.

And your argument about comma separators is wrong; the string

1,234

in a CSV file SHOULD mean "two values: 1 and 234", regardless of the local decimal separator. The number one thousand two hundred thirty-four is represented as

"1,234"

roelschroeven•5mo ago

"And your argument about comma separators is wrong; the string

1,234

in a CSV file SHOULD mean "two values: 1 and 234", regardless of the local decimal separator."

Yes, I agree, it SHOULD mean that, but that is NOT what Excel does when the decimal separator is set to "," in the regional settings. Excel wrongly thinks it should apply the comma as decimal separator, and reads that number as 1 unit and 234 thousands.

Locale MUST NOT be used for data formats, but Excel does it anyway.

This problem doesn't manifest itself when you're using a locale which matches the CSV's separators. Consider yourself lucky if you're in that situation.

sevensor•5mo ago

I was writing a program a little while ago that put data on the clipboard for you to paste into Excel. I tried all manner of delimiters before I figured out that Excel really loves HTML tables. If you wrap everything in <tr> and <td>, and put it up on the clipboard, it pastes flawlessly.

kelvinjps10•5mo ago

Isn't that tsv then?

mcdonje•5mo ago

Answered: https://news.ycombinator.com/item?id=45195713

kelvinjps10•5mo ago

Thanks for explanation

vim-guru•5mo ago

Excel hates CSV

It clearly means CSV must be doing something right.

wkat4242•5mo ago

Especially in Europe because we use the comma as a decimal point. So every csv file opened in Excel is screwed up.

adithyassekhar•4mo ago

Someone should write an article on this

guzik•5mo ago

I am glad that we decided to pick CSV as our default format for health data (even for heavy stuff like raw ECG). Yeah, files were bigger, but clients loved that they could just download them, open in Excel, make a quick chart. Meanwhile other software was insisting on EDF (lighter, sure) but not everything could handle it.

IAmBroom•5mo ago

And at this point, "lighter" is immaterial.

"Hi, I'm sending you a two-line statement in a Word document. It's 10kB."

"Thanks, I took a screenshot of it and forwarded it. It's now 10MB."

"Great! That's handy!"

untrimmed•5mo ago

This is a great defense, but I feel like it misses the single biggest reason CSV will never die: your boss can open it. We can talk about streaming and Parquet all day, but if the marketing team can't double-click the file, it's useless.

imtringued•5mo ago

With what software? LibreOffice? Excel doesn't support opening CSV files with a double click. It lets you import CSV files into a spreadsheet, but that requires reading unreasonably complicated instructions.

jowea•5mo ago

How is that not opening?

imtringued•5mo ago

You are creating a new spreadsheet that you can save as an xlsx. What you are looking at is not the CSV file itself.

NoboruWataya•5mo ago

This is a distinction that does not matter to most non-technical people.

tokai•5mo ago

You are missing the point so hard its hilarious.

john_the_writer•5mo ago

Well I mean unless you're inspecting it with a hex editor, you're not looking at the csv file itself. Even then, I suppose you could say that's not even the file itself. An electron microscope perhaps? But then you've got the whole Heisenberg issue, so there's that.

eviks•5mo ago

That's not true either, try it yourself with a simple csv file, open it, add a row and save - you'll see the original update

(there are some limitations)

john_the_writer•5mo ago

What are you talking about? Excel opens csv with zero issue. In windows, and mac. Mac you right click and "open with". Or you open excel, and click file/open and find the csv. I do the first one a dozen times a day.

1wd•5mo ago

Only if the Windows Regional Settings List Separators happens to be "comma", which is not the case in most of Europe (even in regions that use the decimal point) so only CSV files with SEP=, as the first line work reliably with Excel.

john_the_writer•5mo ago

Literally did this all day today. Took a csv file, parsed it in elixir, processed it and created a new csv file, then opened that in excel, to confirm the changes. At least 100 times today.

curioussquirrel•4mo ago

This, plus the parser in Excel gets thrown off by some multiline edge cases very easily. Also, the file has to be UTF-8-BOM, not just UTF-8.

efitz•5mo ago

Excel absolutely can open csv files with a double click if you associate the file type extension.

boshomi•5mo ago

You should never blindly trust Excel when using CSV files. Try this csv file:

    COL1,COL2,COL3 
    5,"+A2&C1","+A2*8&B1"

IAmBroom•5mo ago

True, but not the point here. "You can" and "You should as a general rule" are not the same.

0x3444ac53•4mo ago

Hesitant to actually try to open this, what does it do?

ertgbnm•5mo ago

On windows, csv's automatically open in Excel through the file explorer. Almost all normal businesses use windows so the OPs claim is pretty reasonable.

tommica•5mo ago

Depends on the country/locale - I just generate them with semicolons to enable easy opening

IAmBroom•5mo ago

We're discussing CSVs. You are discussing SemicolonSVs.

tommica•4mo ago

I do wish SCSV was thing

delta_p_delta_x•5mo ago

> Excel doesn't support opening CSV files with a double click

Yes, it does. When Excel is installed, it installs a file type association for CSV and Explorer sets Excel as the default handler.

kelvinjps10•5mo ago

Those programs support opening csv with double click

femto•5mo ago

CSV is good for debugging C/C++ real-time signal processing data paths.

Add cout or printf lines, which on each iteration print out relevant intermediate values separated by commas, with the first cell being a constant tag. Provided you don't overdo it, the software will typically still run in real-time. Pipe stdout to a file.

After the fact, you can then use grep to filter tags to select which intermediate results you want to analyse. This filtered data can be loaded into a spreadsheet, or read into a higher level script for analysis/debugging/plotting/... In this way you can reproducibly visualise internal operation over a long period of time and see infrequent or subtle deviations from expected behaviour.

matt_daemon•5mo ago

Agree this is the main use for it

joz1-k•5mo ago

Except that the "comma" was a poor choice for a separator, the CSV is just a plain text that can be trivially parsed from any language or platform. That's its biggest value. There is essentially no format, library, or platform lock-in. JSON comes close to this level of openness and ease, but YAML is already too complicated as a file format.

jstanley•5mo ago

JSON has the major annoyance that grep doesn't work well on it. You need tooling to work with JSON.

theknarf•5mo ago

grep is a tool. jq is a good tool for json.

kergonath•5mo ago

grep is POSIX and you can count on it being installed pretty much anywhere. That’s not the case for jq.

whizzter•5mo ago

Do people contain themselvs to a POSIX conformant grep subset in practice, or do you mean GNU grep that probably doesn't behave according to spec unless POSIXLY_CORRECT is set?

IAmBroom•5mo ago

"Anywhere" does not include Windows environments, which are over half the work computers out there.

krogenx•4mo ago

If a workstation has Git installed on it, which I’d think would be the case for substantial number of engineers out there (…not just software engineers), grep is there due to Git BASH.

re•5mo ago

As soon as you encounter any CSVs where field values may contain double quotes, commas, or newlines, you need tooling to work with CSV as well.

(TSV FTW)

spicybbq•5mo ago

One of my first tasks as a junior dev was replacing an incorrect/incomplete "roll your own" CSV parsing regex (which broke in production) with a library.

IAmBroom•5mo ago

TSV is superior to CSVs, and it still angers me that Excel doesn't offer it as a standard input option, but your examples are fairly easily handled by eye in a text file.

Tools definitely make it faster and more reliable.

euroderf•4mo ago

ASCII FS GS RS US ... just make decent font entries for them.

jstanley•4mo ago

And keys on the keyboard.

euroderf•4mo ago

Yes! But nobody ever came up with decent font entries that would look snappy on keys. Not even IBM (or Data General or Burroughs or whoever) I guess.

rogue7•4mo ago

For this I use gron [0]. It's very convenient.

[0]: https://github.com/tomnomnom/gron

john_the_writer•5mo ago

100%.. xml also worked here too..

YAML is a pain because it has every so slightly different versions, that sometimes don't play nice.

csv or TSV's are almost always portable.

humanfromearth9•5mo ago

And the best thing about CSV is that it is a text file with a standardized, well known, universally shared encoding, so you don't have to guess it when opening a CSV file. Exactly in the same way as any other text file. The next best thing with CSV is that separators are also standardized and never positional, you never have to guess.

whizzter•5mo ago

Almost missed the sarcasm :)

nradov•5mo ago

Technically there is a CSV standard in IETF RFC 4180, although compliance isn't required and of course many implementations are broken.

https://www.ietf.org/rfc/rfc4180.txt

thw_9a83c•5mo ago

The notion of a "platform" caught my attention. Funny story: About five years ago, I got a little nostalgic and wanted to retrieve some data from my Atari XL computer (8-bit) from my preteen years. Back then, I created primitive software that showed a map of my village with notable places, like my friends' homes. I was able to transform all the BASIC files (stored on cassette tape) from the Atari computer to my PC using a special "SIO2PC" cable, but the mapping data were still locked in the BASIC files. So I had the idea to create a simple BASIC program that would run in an Atari 8-bit PC emulator, linearize all the coordinates and metadata, and export them as a CSV file. The funny thing is that the 8-bit Atari didn't even use ASCII, but an unusual ATASCII encoding. But it's the same for letters, numbers, and even for the comma. Finally, I got the data and wrote a little Python script to create an SVG map. So yes, CSV forever! :)

conception•5mo ago

|| separated for life

keeperofdakeys•5mo ago

Arguably, "comma as a separator" is close enough to comma's usage in (many) written languages that it makes it easier for less technical users to interact with CSV.

wlesieutre•5mo ago

Easier as long as they don't try to put any of those written languages in the CSV

Commas and quotation marks suddenly make it complicated

dirkt•4mo ago

Try exporting things from Excel to CSV on a Mac with non-us locale.

Some genius at Microsoft decided the exporting to CSV should follow the locale convention. Which means I get a "semicolon-separated value" file instead of a comma-separated one, unless I change my local to us.

Line breaks are also fun...

freetinker•4mo ago

The comma makes it more human-readable. What separator would you suggest?

joz1-k•4mo ago

The comma is too prevalent in the data to be a suitable separator. A semicolon would be a better choice.

snthpy•4mo ago

So ASCII actually had dedicated characters for this, 0x1C-0x1F. The problem is that they are non-printing.

Unicode has rendered analogs, U+241C-U+241F, but they take more bytes to encode, which can significantly increase file size in large USV files.

So my ideal would be to use ASV files rendered as USV in editors.

https://github.com/SixArm/usv

snthpy•4mo ago

The benefits are that ASV / USV files are trivial to parse with simple string splitting since you don't have to worry about nesting and quoting.

Here's an example of what a USV looks like:

Folio1␟␞ Sheet1␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet2␟␞ e␟f␟␞ g␟h␟␞ ␝␜ Folio2␟␞ Sheet3␟␞ a␟b␟␞ c␟d␟␞ ␝ Sheet4␟␞ e␟f␟␞ g␟h␟␞ ␝␜

r721•4mo ago

"|" looks pretty good (and is relatively rarely-used).

renox•4mo ago

I'd say that is not its biggest issue. The way to escape things is by far its biggest issue, a passwd like \, \", \\ would have been far easier.

talles•4mo ago

What separator would be better?

IanCal•5mo ago

Counterpoint - CSV is absolutely atrocious and should be cast into the Sun.

It's unkillable, like many eldritch horrors.

> The specification of CSV holds in its title: "comma separated values". Okay, it's a lie, but still, the specification holds in a tweet and can be explained to anybody in seconds: commas separate values, new lines separate rows. Now quote values containing commas and line breaks, double your quotes, and that's it. This is so simple you might even invent it yourself without knowing it already exists while learning how to program.

Except that's just one way people do it. It's not universal and so you cannot take arbitrary CSV files in and parse them like this. You can't take a CSV file constructed like this and pass it into any CSV accepting program - many will totally break.

> Of course it does not mean you should not use a dedicated CSV parser/writer because you will mess something up.

Yes, implementers often have.

> No one owns CSV. It has no real specification

Yep. So all these monstrosities in the real world are all... maybe valid? Lots of totally broken CSV files can be parsed as CSV but the result is wrong. Sometimes subtly.

> This means, by extension, that it can both be read and edited by humans directly, somehow.

One of the very common ways they get completely fucked up, yes. Someone goes and sorts some rows and boom broken, often unrecoverable data loss. Someone doesn't correctly add or remove a comma. Someone mixes two files that actually have differently encoded text.

> CSV can be read row by row very easily without requiring more memory than what is needed to fit a single row.

CSV must be parsed row by row.

> By comparison, column-oriented data formats such as parquet are not able to stream files row by row without requiring you to jump here and there in the file or to buffer the memory cleverly so you don't tank read performance.

Sort of? Yes if you're building your own parser but who is doing that? It's also not hard with things like parquet.

> But of course, CSV is terrible if you are only interested in specific columns because you will indeed need to read all of a row only to access the part you are interested in.

Or if you're interested in a specific row, because you're going to have to be careful about parsing out every row until you get there.

CSV does not have a row separator. Or rather it does but it also lets you have that row separator appear and not mean "separate these rows" so you can't simply trust it.

> But critics of CSV coming from this set of pratices tend to only care about use-cases where everything is expected to fit into memory.

Parquet uses row groups which means you can stream chunks easily, those chunks contain metadata so you can easily filter rows you don't need too.

I much more often need to keep the whole thing in memory working with CSV than parquet. With parquet I don't even need to be able to fit all the rows on disk I can read the chunk I want remotely.

> CSV can be appended to

Yeah that's easier. Row groups means you can still do this though, but granted it's not as easy. *However* I will point out that absolutely nothing stops someone completely borking things by appending something that's not exactly the right format.

> CSV is dynamically typed

Not really. Everything is strings. You can do that with anything else if you want to. JSON can have numbers of any size if you just store them as strings.

> CSV is succinct

Yes, more so than jsonl, but not really more than (you guessed it) parquet. Also it's horrific for compression.

> Reverse CSV is still valid CSV

Get a file format that doesn't absolutely suck and you can parse things in reverse if you want. More usefully you can parse just sections you actually care about!

> Excel hates CSV

Helpfully this just means that the most popular way of working with tabular data in the world doesn't play that nicely with it.

cluckindan•5mo ago

Or just forget about the quotes which open a new can of worms, and use TSV while escaping newlines and tabs in values.

cogman10•5mo ago

I agree with all this and will just add. CSV is slow. Like really slow.

Partially due to the bloat, but also partially because the format doesn't allow for speed.

And because CSV is untyped, you have to either trust the producer or put in mountains of guards to ensure you can handle the weird garbage that might come through.

My company deals with a lot of CSV and we literally built tools and hire multiple full time employees whose entire job is handling CSV sucking in new and interesting ways.

Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query.

One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example).

Escaping is also all over the board.

CSV SEEMS simple, but the lack of a real standard means it's anything but.

I'd take xml over CSV.

IanCal•5mo ago

> My company deals with a lot of CSV and we literally built tools and hire multiple full time employees whose entire job is handling CSV sucking in new and interesting ways.

Truly shocking how many ways people manage to construct these files. I don't think people really get this if they've mostly been moving files from one system to another and not dealing with basically the union of horrors all the various systems that write CSV can make.

> Parquet literally eliminates 1/2 of our our data ingestion pipeline simply by being typed, consistent, and fast to query.

Parquet has been a huge step forwards. It's not perfect, but it is really good. Most improvements I'd like actually are served well stepping up from that to groups of them in larger tables.

> One example of a problem we constantly run into is that nobody likes to format numbers the same way. Scientific notation, no scientific notation, commas or periods, sometimes mixed formats (scientific notation when a number is big enough, for example).

That's a new one on me, but makes loads of sense. Dates were my issue - you hit 13/2/2003 and 5/16/2001 in the same file. What date is 1/2/2003?

For anyone that's never dealt with this before, let me paint a picture -

You have a programming language you're working in. You import new packages every single day, written by people you start to consider adversarial after a few weeks in the job.

You need to keep your system running, with new imports added every day.

There are only string types. Nothing more. You must interpret them correctly.

This is an understatement for what CSV files coming from random customers actually means. I did it for a decade and was constantly shocked at what new and inventive ways people had to mess up a simple file.

> I'd take xml over CSV.

Not to rag on xml but because it feels like you're in or have been in the same boat as me and it's nice to share horror stories - I've had to manually dig through a multi gig xml file to deal with parsing issues as somewhere somehow someone managed to combine files of different encodings and we had control characters embedded in. Just a random ^Z here and there. It's been years so I don't remember the details of exactly how we reconstructed what had happened but there was something due to encodings and mixing things together that messed it up.

This isn't xmls fault, and was a smaller example but since then I've had a strong mistrust of anything that lets humans manipulate files outside of something that can validate them as being parsable.

Also would take XML over CSV.

cogman10•5mo ago

Yeah, the xml thing wasn't about how good xml is, just how terrible CSV is :).

Particularly for tabular data, parquet is really good. Even a SQLite database isn't a terrible way send that sort of data.

At least with XML, all the problems of escaping are effectively solved. And since it's (usually) tool generated it's likely valid. That means I can feed it into my favorite xml parser and pound the data out of it (usually). There's still a lot of the encoding issues I mentioned with CSV.

bsghirt•5mo ago

Can you provide a reproducible example of how sorting rows can lead to unrecoverable data loss?

Also, commas in quoted strings are quite mainstream csv, but csvs with quoted strings containing unescaped newlines are extremely baroque. Criticism of csv based on the assumption that strings will contain newlines is not realistic.

IanCal•5mo ago

> Can you provide a reproducible example of how sorting rows can lead to unrecoverable data loss?

This was in the context of having it in a place humans can edit it directly so the case here is sorting rows by sorting lines. CSV has this wonderful property when editing - anything that doesn't parse it in then out to ensure that it is a valid file lets you write out a broken file if you mess it up - and in addition has the property that the record delimiter is an exceptionally common element in text.

So to answer your question, sure - take a csv file with newlines in some entries and sort the lines. You can restore it if you don't have two entries with newlines in the same field, and then only if you know it was exactly valid to start with, extra commas anywhere etc.

> csvs with quoted strings containing unescaped newlines are extremely baroque

No, they're all over the place. If you don't think so I don't believe you've worked with lots of real world csvs. Also, how do you know? How do you know your file doesn't contain them? Here's a fun fact - you can get to the point very easily where you *cannot programmatically tell*.

> Criticism of csv based on the assumption that strings will contain newlines is not realistic.

It's a very common thing to happen though.

Let's imagine something. CSV doesn't exist. I'm proposing it to you. I tell you that the bytes used to split records is a very commonly occurring thing in text. But don't worry! You can escape this by putting in another character commonly used. Oh and to escape that use two of them :)

Would you tell me to use something else? That you could foresee this causing problems?

bsghirt•5mo ago

I would tell you to escape the newlines. Then you would know as much about CSV with multiline text in it as most other people in the world.

IanCal•5mo ago

This is about dealing with csv files in the wild not about whether you can craft the perfect csv file. I have had years dealing with actual csv files from all corners of the world and all corners of sanity.

bsghirt•4mo ago

Are the CSVs with literal newlines in string fields in the room with us right now?

IanCal•4mo ago

They're definitely in the room I'm in, yes.

efitz•5mo ago

I don’t think I ever heard anyone say “csv is dead”.

Smart people (that have been burned once too many times) put quotes around fields in csv if they aren’t 100% positive the field will be comma-free, and escape quotes in such fields.

eviks•5mo ago

> Okay, it's a lie,

Indeed, a lie only a lover would believe,,,

xbmcuser•5mo ago

I did not care for CSV format much till I started using them with llm and python scripts.

jcattle•5mo ago

If you don't care that much about the accuracy of your data (like only caring about a few decimals of accuracy in your floats), you don't generate huge amounts of data, you do not need to work with it across different tools and pass it back and forth, then yes CSV CAN be nice.

I wouldn't write it a love letter though. There's a reason that parquet exists.

christophilus•5mo ago

CSV is just a string serialization, so you can represent floats with any accuracy you choose. It’s streamable and compressible, so large files are fine, though maybe not “huge” depending on how you define “huge”. It works fine passing back and forth between various tools, so…

Without more specifics, I disagree with your take.

jcattle•5mo ago

It's only fine passing between various tools if you tell each tool exactly how they should serialize your values. Each tool will interpret column values in some way and if you want to work with those values in any meaningful way it will convert to their representation of the data type that is likely present in a column.

Going from tool to tool will leave you with widely different representations of the original data you've put in. Because as you said yourself. All of this data does not have any meaning. It's just strings. The csv and the tools do not care if one column was ms-epoch and another was mathematical notation floating point. It'll all just go through that specific tools deserialization - serialization mangle and you'll have completely different data on the other end.

bsghirt•5mo ago

How would you deserialise the entity "0.4288"?

lan321•5mo ago

I hate parsing CSV. There are so many different implementations it's a constant cat and mouse.. Literally any symbol can be the separator, then the ordering starts getting changed, then since you have to guess what's where you go by type but strings, for example, are sometimes in quotations, other times not, then you have some decimal split with a comma when the values are also separated with commas so you have to track what's a split and what's a decimal comma.. Then you get some line with only 2 elements when you expect 7 and have no clue what to do because there's no documentation for the output and hence what that line means..

If the CSV is not written by me it's always been an exercise in making things as difficult as possible. It might be a tad smaller as a format but I find the parsing to be so ass you need really good reason to use it.

Edit: Oh yeah, and some have a header, others don't and CSV seems to always come from some machine where the techs can come over to do an update, and just reorder everything because fuck your parsing and then you either get lucky and the parser dies, or since you don't really have much info the types just align and you start saving garbage data to your database until a domain expert notices something isn't quite right so you have to find when was the last time someone touched the machines and rollback/reparse everything..

roland35•5mo ago

To people saying that "your boss can open it" being an benefit of csv, well I have a funny story!

Back in the early 2000s I designed and built a custom data collector for a air force project. It saved data at 100 Hz on an SD card. The project manager loved it! He could pop the SD card out or use the handy USB mass storage mode to grab the csv files.

The only problem... Why did the data cut off after about 10 minutes?? I couldn't see the actual data collected since it was secret, but I had no issue on my end, assuming there was space on the card and battery life was good.

Turns out, I learned he was using excel 2003 to open the csv file. There is a 65,536 row limit (does that number look familiar?). That took a while to figure out!!

Dilettante_•5mo ago

"Plz fix. No look! Just fix!!"[1] must be one of the circles of programmer hell.

[1]https://i.kym-cdn.com/entries/icons/facebook/000/027/691/tum...

IanCal•5mo ago

Love it.

The first data release I did excel couldn't open the CSV file, because it started with a capital I (first column ID). Excel looks at this, looks at this file with a comma in the header and text and the ending "csv" and says

I KNOW WHAT THIS IS

THIS IS A SYLK FILE

BECAUSE IT STARTS WITH "I"

NO OTHER POSSIBLE FILE COULD START WITH THE LETTER "I"

then reads some more and says

THIS SYLK FILE LOOKS WRONG

IT MUST BE BROKEN

ERROR

https://en.wikipedia.org/wiki/Symbolic_Link_(SYLK)

IAmBroom•5mo ago

Yes! That was the STUPIDEST file encoding detection ever developed! OMG....

1vuio0pswjnm7•5mo ago

Certainly I love CSV, too and agree with most of the reasons put forth in the blog post

But I tried the group's "CSV magician" program and was not impressed

https://github.com/medialab/xan

The table output seems very similar to sqlite3

It's a 20MB musl dynamically-linked executable (cf. 1.7MB statically-linked sqlite3 executable)

Most of the "suite" of subcommands seemed to be aimed at accomplishing the same stuff I can do with sqlite3, and, if necessary, flex

There seemed no obvious way too disable color

The last straw was it messed up the console

First freezing it and the pid was not visible with ps so I had to kill the shell

Then leaving me with no up/down or tab action

Yes I can fix this but I should not have to

I really want to keep an open mind and to believe in these rust console programs

But every time I try one it is a huge executable, an assault of superfluous color and generally no functionality that cannot be achieved with existing non-rust software

Even assuming I could get used to the annoying design, I like software that is reliable

Compared to the software I normally use, I cannot say these rust programs are equally as reliable

I also like software where I can easily edit the source to change what I do not like

These rust programs would require more resources and time to compile, messing around with a package manager and heaps of dependencies, plus a new language, with no clear benefit... b/c TBH I am not losing any sleep worrying that the small text processing programs I use are written in C; after all, the operating systems I use are written in C and that is not changing anytime soon

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Eigen: Building a Workspace

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

The F Word

Selection rather than prediction

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Learning from context is harder than we thought

A Fresh Look at IBM 3270 Information Display System

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

72M Points of Interest

Hackers (1995) Animated Experience

Tiny C Compiler

SectorC: A C Compiler in 512 bytes

Speed up responses with fast mode

Brookhaven Lab's RHIC concludes 25-year run with final collisions

Software factories and the agentic moment

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Stories from 25 Years of Software Development

Hoot: Scheme on WebAssembly

FDA intends to take action against non-FDA-approved GLP-1 drugs

First Proof

Vocal Guide – belt sing without killing yourself

Show HN: Craftplan – Elixir-based micro-ERP for small-scale manufacturers

Show HN: A luma dependent chroma compression algorithm (image compression)

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

Eigen: Building a Workspace

Al Lowe on model trains, funny deaths and working with Disney

Start all of your commands with a comma (2009)

Microsoft account bugs locked me out of Notepad – Are thin clients ruining PCs?

I write games in C (yes, C) (2016)

The F Word

Selection rather than prediction

The AI boom is causing shortages everywhere else

Reinforcement Learning from Human Feedback

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Learning from context is harder than we thought

A Fresh Look at IBM 3270 Information Display System

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

72M Points of Interest

Hackers (1995) Animated Experience

A love letter to the CSV format (2024)

Comments