zsv was built because I needed a library to integrate with my application, and other CSV parsers had one or more of a variety of limitations (couldn't handle "real-world" CSV or malformed UTF8, were too slow, degraded when used on very large files, couldn't compile to web assembly, could not handle multi-row headers (seems like basically none of the other CSV parsers do this) etc-- more details are in the repo README). The closest solution to what I wanted was xsv, but was not designed as an API and I still needed a lot of flexibility that wasn't already built into it.
My first inclination was to use flex/bison but that approach yielded surprisingly slow performance; SIMD had just been shown to be useful in unprecedented performance gains for JSON parsing, so a friend and I took a page from that approach to create what afaik (though I could be wrong) is now the fastest CSV parser (and most customizable as well) that properly handles "real-world" CSV.
When I say "real-world CSV": if you've worked with CSV in the wild, you probably know what I mean, but feel free to check out the README for a more technical explanation.
With parser built, I found that some of the use cases I needed it for were generic, so I wrapped them up in a CLI. Most of the CLI commands are run-of-the-mill stuff: echo, select, count, sql, pretty, 2tsv, stack. Some of the commands are harder to find in other utilities: compare (cell-level comparison with customizable numerical tolerance-- useful when, for example, comparing CSV vs data from a deconstructed XLSX, where the latter may look the same but technically differ by < 0.000001), serialize/flatten, 2json (multiple different JSON schema output choices). A few are not directly CSV-related, but dovetail with others, such as 2db, which converts 2json output to sqlite3 with indexing options, allowing you to run e.g. `zsv 2json my.csv --unique-index mycolumn | zsv 2db -t mytable -o my.db`.
I've been using zsv for years now in commercial software running bare metal and also in the browser (see e.g. https://liquidaty.github.io/zsv/), so I finally got around to tagging v1.0.1 as the first production-ready release.
I'd love for you to try it out and would welcome any feedback, bug reports, or questions.
mattewong•3h ago
zsv was built because I needed a library to integrate with my application, and other CSV parsers had one or more of a variety of limitations (couldn't handle "real-world" CSV or malformed UTF8, were too slow, degraded when used on very large files, couldn't compile to web assembly, could not handle multi-row headers (seems like basically none of the other CSV parsers do this) etc-- more details are in the repo README). The closest solution to what I wanted was xsv, but was not designed as an API and I still needed a lot of flexibility that wasn't already built into it.
My first inclination was to use flex/bison but that approach yielded surprisingly slow performance; SIMD had just been shown to be useful in unprecedented performance gains for JSON parsing, so a friend and I took a page from that approach to create what afaik (though I could be wrong) is now the fastest CSV parser (and most customizable as well) that properly handles "real-world" CSV.
When I say "real-world CSV": if you've worked with CSV in the wild, you probably know what I mean, but feel free to check out the README for a more technical explanation.
With parser built, I found that some of the use cases I needed it for were generic, so I wrapped them up in a CLI. Most of the CLI commands are run-of-the-mill stuff: echo, select, count, sql, pretty, 2tsv, stack. Some of the commands are harder to find in other utilities: compare (cell-level comparison with customizable numerical tolerance-- useful when, for example, comparing CSV vs data from a deconstructed XLSX, where the latter may look the same but technically differ by < 0.000001), serialize/flatten, 2json (multiple different JSON schema output choices). A few are not directly CSV-related, but dovetail with others, such as 2db, which converts 2json output to sqlite3 with indexing options, allowing you to run e.g. `zsv 2json my.csv --unique-index mycolumn | zsv 2db -t mytable -o my.db`.
I've been using zsv for years now in commercial software running bare metal and also in the browser (see e.g. https://liquidaty.github.io/zsv/), so I finally got around to tagging v1.0.1 as the first production-ready release.
I'd love for you to try it out and would welcome any feedback, bug reports, or questions.