FTA: “Now let's come back to our jumping thought experiment: the issue here is that, if you jump to a random byte of a CSV file, you cannot know whether you landed in a quoted cell or not. So, if you read ahead and find a line break, is it delineating a CSV row, or is just allowed here because we stand in a quoted cell? And if you find a double quote? Are you opening a quoted cell or are you closing one?
[…]
Real-life CSV data is usually consistent. What I mean is that tabular data often has a fixed number of columns. Indeed, rows suddenly demonstrating an inconsistent number of columns are typically frowned upon. What's more, columns often hold homogeneous data types: integers, floating point numbers, raw text, dates etc. Finally, rows tend to have a comparable size in number of bytes. We would be fools not to leverage this consistency.
So now, before doing any reckless jumping, let's start by analyzing the beginning of our CSV file to record some statistics that will be useful down the line.
[…]
Anyway, we now have what we need to be able to jump safely”
‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.
The author also admits that, saying “This technique is reasonably robust and will let you jump safely”
I agree with “reasonably robust”, but not with “will let you jump safely”.
starlita•1h ago
"robust" in the same sentence as "CSV" makes me laugh anyway ;)
Yomguithereal•57m ago
> ‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.
This is clearly not the sort of thing you should expose to anyone, it is an optimization technique. The same way you would not use a fast but DOSable hash function for your hashmap.
Isn't this more robust though? I feel like using lines to detect next rows is very flimsy. I usually deal with CSV containing full press articles, I am quite sure the CSV.Chunks method would fail without the correct hyperparameter. This method seems more, I dunno, "adaptative".
Someone•1h ago
[…]
Real-life CSV data is usually consistent. What I mean is that tabular data often has a fixed number of columns. Indeed, rows suddenly demonstrating an inconsistent number of columns are typically frowned upon. What's more, columns often hold homogeneous data types: integers, floating point numbers, raw text, dates etc. Finally, rows tend to have a comparable size in number of bytes. We would be fools not to leverage this consistency.
So now, before doing any reckless jumping, let's start by analyzing the beginning of our CSV file to record some statistics that will be useful down the line.
[…]
Anyway, we now have what we need to be able to jump safely”
‘Safely’. An attacker who has control over a row in that file can easily embed data that satisfies the statistical checks, thus injecting data.
The author also admits that, saying “This technique is reasonably robust and will let you jump safely”
I agree with “reasonably robust”, but not with “will let you jump safely”.
starlita•1h ago
Yomguithereal•57m ago
This is clearly not the sort of thing you should expose to anyone, it is an optimization technique. The same way you would not use a fast but DOSable hash function for your hashmap.