New technique to easily partition CSV files for parallel processing

https://github.com/medialab/xan/blob/master/docs/blog/csv_base_jumping.md

9•Yomguithereal•1h ago

Comments

Someone•1h ago

FTA: “Now let's come back to our jumping thought experiment: the issue here is that, if you jump to a random byte of a CSV file, you cannot know whether you landed in a quoted cell or not. So, if you read ahead and find a line break, is it delineating a CSV row, or is just allowed here because we stand in a quoted cell? And if you find a double quote? Are you opening a quoted cell or are you closing one?
[…]
Real-life CSV data is usually consistent. What I mean is that tabular data often has a fixed number of columns. Indeed, rows suddenly demonstrating an inconsistent number of columns are typically frowned upon. What's more, columns often hold homogeneous data types: integers, floating point numbers, raw text, dates etc. Finally, rows tend to have a comparable size in number of bytes. We would be fools not to leverage this consistency.
So now, before doing any reckless jumping, let's start by analyzing the beginning of our CSV file to record some statistics that will be useful down the line.
[…]
Anyway, we now have what we need to be able to jump safely”

‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.

The author also admits that, saying “This technique is reasonably robust and will let you jump safely”

I agree with “reasonably robust”, but not with “will let you jump safely”.

starlita•1h ago

"robust" in the same sentence as "CSV" makes me laugh anyway ;)

Yomguithereal•57m ago

> ‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.

This is clearly not the sort of thing you should expose to anyone, it is an optimization technique. The same way you would not use a fast but DOSable hash function for your hashmap.

gazoduke•1h ago

If I remember correctly CSV.jl also has something of the kind: https://csv.juliadata.org/stable/reading.html#CSV.Chunks

Used statistics are a bit different though.

topita•1h ago

Isn't this more robust though? I feel like using lines to detect next rows is very flimsy. I usually deal with CSV containing full press articles, I am quite sure the CSV.Chunks method would fail without the correct hyperparameter. This method seems more, I dunno, "adaptative".

Creating "Edit" Links That Open Plain-Text Source Files in a Native App

Show HN: WhatsApp Group Contact Extractor - Paste JS, get group contacts .tsv

Deploying Open Source Vision Language Models (VLM) on Jetson

Show HN: HN Digest Widget – Nothing Essential Lab S1 Winner

LLM-LD, the Open Standard for AI-Readable Websites

Sutton and Barto, Ch. 08: Planning and Learning with Tabular Methods

Fish Shell 4.0 released. Rust re write finished

Show HN: BountyBook – A task marketplace where AI agents earn USDC

What Virtual Worlds Can Learn from the Social Serendipity of Arc Raiders

Show HN: VibeFrame – AI video editor for the terminal (CLI and MCP)

NASA says it needs to haul the Artemis II rocket back to the hangar for repairs

Hospitals fighting measles confront a challenge: Few doctors have seen it before

Humanity's Last Exam

Fixing Slow AWS Uploads

Show HN: Raindrop Self Diagnostics: let agents self-report issues

Toilet Map [UK]

From Jamstack to CAMstack – Bridging the Content Gap

The Pentagon Threatens Anthropic

The Myth of the Chad

om

Fentanyl or phony? Machine learning algorithm learns opioid signatures

Time-Travel Debugging: Replaying Production Bugs Locally

Show HN: Djevops – Deploy Django Easily

A federal experiment opens up a new market for digital health – if it works

Aletheia Tackles FirstProof Autonomously

Show HN: Mamba3-minimal – PyTorch implementation of Mamba-3

Show HN: DRYwall – Claude Code plugin to to deduplicate code with jscpd

Stylometry Protection (Using Local LLMs)

Surfboard Makers

Don't ask if it works. Ask for proof