...

mjw_byrne

303

Karma

2016-02-17

Created

Recent Activity

  • It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.

    This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.

  • > I'm not clear why quotes prevent parallel processing?

    Because of the "non-local" effect of quotes, you can't just jump into the middle of a file and start reading it, because you can't tell whether you're inside a quoted section or not. If (big if) you know something about the structure of the data, you might be able to guess. So that's why I said "tricky" instead of "impossible".

    Contrast to my escaping-only strategy, where you can jump into the middle of a file and fully understand your context by looking one char on either side.

    > Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!

    I used to be a data analyst at a management consultancy. A very common scenario would be that I'm handed a multi-gigabyte CSV and told to "import the data". No spec, no schema, no nothing. Data loss or corruption is totally unacceptable, because we were highly risk-sensitive. So step 1 is to go through the whole thing trying to determine field types by testing them. Does column 3 always parse as a timestamp? Great, we'll call it a timestamp. That kind of thing. In that case, it's great to be able to parallelise reading.

    > And data corruption is data corruption

    Agreed, but I prefer data corruption which messes up one field, not data corruption which makes my importer sit there for 5 minutes thinking the whole file is a 10GB string value and then throw "EOF in quoted field".

  • I agree that it's best to pick "unlikely" delimiters so that you don't have to pepper your data with escape chars.

    But some people (plenty in this thread) really do think "pick a delimiter that won't be in the data" - and then forget quoting and/or escaping - is a viable solution.

  • Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.

    Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes. And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".

  • Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped. If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.

HackerNews