A love letter to the CSV format

2025-03-2617:08708689github.com

The CSV magician. Contribute to medialab/xan development by creating an account on GitHub.

Show article

Or why people pretending CSV is dead are wrong

Every month or so, a new blog article declaring the near demise of CSV in favor of some "obviously superior" format (parquet, newline-delimited JSON, MessagePack records etc.) find its ways to the reader's eyes. Sadly those articles often offer a very narrow and biased comparison and often fail to understand what makes CSV a seemingly unkillable staple of data serialization.

It is therefore my intention, through this article, to write a love letter to this data format, often criticized for the wrong reasons, even more so when it is somehow deemed "cool" to hate on it. My point is not, far from it, to say that CSV is a silver bullet but rather to shine a light on some of the format's sometimes overlooked strengths.

The specification of CSV holds in its title: "comma separated values". Okay, it's a lie, but still, the specification holds in a tweet and can be explained to anybody in seconds: commas separate values, new lines separate rows. Now quote values containing commas and line breaks, double your quotes, and that's it. This is so simple you might even invent it yourself without knowing it already exists while learning how to program.

Of course it does not mean you should not use a dedicated CSV parser/writer because you will mess something up.

No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.

Like JSON, YAML or XML, CSV is just plain text, that you are free to encode however you like. CSV is not a binary format, can be opened with any text editor and does not require any specialized program to be read. This means, by extension, that it can both be read and edited by humans directly, somehow.

CSV can be read row by row very easily without requiring more memory than what is needed to fit a single row. This also means that a trivial program that anyone can write is able to read gigabytes of CSV data with only some kilobytes of RAM.

By comparison, column-oriented data formats such as parquet are not able to stream files row by row without requiring you to jump here and there in the file or to buffer the memory cleverly so you don't tank read performance.

But of course, CSV is terrible if you are only interested in specific columns because you will indeed need to read all of a row only to access the part you are interested in.

Column-oriented data format are of course a very good fit for the dataframes mindset of R, pandas and such. But critics of CSV coming from this set of pratices tend to only care about use-cases where everything is expected to fit into memory.

It is trivial to add new rows at the end of a CSV file and it is very efficient to do so. Just open the file in append mode (a+) and get going.

Once again, column-oriented data formats cannot do this, or at least not in a straightforward manner. They can actually be regarded as on-disk dataframes, and like with dataframes, adding a column is very efficient while adding a new row really isn't.

Please don't flee. Let me explain why this is sometimes a good thing. Sometimes when dealing with data, you might like to have some flexibility, especially across programming languages, when parsing serialized data.

Consider JavaScript, for instance, that is unable to represent 64 bits integers. Or what languages, frameworks and libraries consider as null values (don't get me started on pandas and null values). CSV lets you parse values as you see fit and is in fact dynamically typed. But this is as much of a strength as it can become a potential footgun if you are not careful.

Note also, but this might be hard to do with higher-level languages such as python and JavaScript, that you are not required to decode the text at all to process CSV cell values and that you can work directly on the binary representation of the text for performance reasons.

Having the headers written only once at the beginning of the file means the amount of formal repetition of the format is naturally very low. Consider a list of objects in JSON or the equivalent in XML and you will quickly see the cost of repeating keys everywhere. That does not mean JSON and XML will not compress very well, but few formats exhibit this level of natural conciseness.

What's more, strings are often already optimally represented and the overhead of the format itself (some commas and quotes here and there) is kept to a minimum. Of course, statically-typed numbers could be represented more concisely, but you will not save up an order of magnitude there neither.

This one is not often realized by everyone but a reversed (byte by byte) CSV file, is still valid CSV. This is only made possible because of the genius idea to escape quotes by doubling them, which means escaping is a palindrom. It would not work if CSV used a backslash-based escaping scheme, as is most common when representing string literals.

But why should you care? Well, this means you can read very efficiently and very easily the last rows of a CSV file. Just feed the bytes of your file in reverse order to a CSV parser, then reverse the yielded rows and their cells' bytes and you are done (maybe read the header row before though).

This means you can very well use a CSV output as a way to efficiently resume an aborted process. You can indeed read and parse the last rows of a CSV file in constant time since you don't need to read the whole file but only to position yourself at the end of the file to buffer the bytes in reverse and feed them to the parser.

It clearly means CSV must be doing something right.

Read the original article

Comments

By jwr 2025-03-2712:5113 reply

I so hate CSV.

I am on the receiving end: I have to parse CSV generated by various (very expensive, very complicated) eCAD software packages. And it's often garbage. Those expensive software packages trip on things like escaping quotes. There is no way to recover a CSV line that has an unescaped double quote.

I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.

Then there are the TSV and semicolon-Separated V variants.

Did I mention that field quoting was optional?

And then there are banks, which take this to another level. My bank (mBank), which is known for levels of programmer incompetence never seen before (just try the mobile app) generates CSVs that are supposed to "look" like paper documents. So, the first 10 or so rows will be a "letterhead", with addresses and stuff in various random columns. Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.

By mjw_byrne 2025-03-2713:233 reply

I used to be a data analyst at a Big 4 management consultancy, so I've seen an awful lot of this kind of thing. One thing I never understood is the inverse correlation between "cost of product" and "ability to do serialisation properly".

Free database like Postgres? Perfect every time.

Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting, escaping or the difference between \n and \r and who thinks it's clever to use 0xFF as a delimiter, because in the Windows-1252 code page it looks like a weird rune and therefore "it won't be in the data".

By recursive 2025-03-2719:361 reply

"Enterprise software" has been defined as software that is purchased based on the decisions of people that will not use it. I think that explains a lot.

By mjw_byrne 2025-03-2811:551 reply

Yep, we had a constant tug of war between techies who wanted to use open-source tools that actually work (Linux, Postgres, Python, Go etc.) and bigwigs who wanted impressive-sounding things in Powerpoint decks and were trying to force "enterprise" platforms like Palantir and IBM BigInsights on us.

Any time we were allowed to actually test one of the "enterprise" platforms, we'd break it in a few minutes. And I don't mean by being pathologically abusive, I mean stuff like "let's see if it can correctly handle a UTF-8 BOM...oh no, it can't".

By mharig 2025-03-3013:22

[dead]

By ethbr1 2025-03-2714:001 reply

> Big complex 6-figure e-discovery system? Apparently written by someone who has never heard of quoting...

It's because about a certain size, system projects are captured by the large consultancy shops, who eat the majority of the price in profit and management overhead...

... and then send the coding work to a lowest-cost someone who has never heard of quoting, etc.

And it's a vicious cycle, because the developers in those shops that do learn and mature quickly leave for better pay and management.

(Yes, there's usually a shit hot tiger team somewhere in these orgs, but they spend all their time bailing out dumpster fires or landing T10 customers. The average customer isn't getting them.)

By deepsun 2025-03-2717:191 reply

Just a nitpick about consultancy shops -- I've had a chance of working in one in eastern europe and noticed that it's approach to quality was way better than client's. It also helped that client paid by hours, so consultancy company was incentivized to spend more time on refactorings, improvals and testing (with constant pushback from client).

So I don't buy the consultancy company sentiment, it always boils down to engineers and incentives.

By ethbr1 2025-03-2720:491 reply

How big was the one you worked for?

In my experience, smaller ones tend to align incentives better.

Once they grow past a certain size though, it's a labor arbitrage game. Bill client X, staff with resources costing Y (and over-represented), profit = X-Y, minimize Y to maximize profit.

PwC / IBM Global Services wasn't offering the best and brightest. (Outside of aforementioned tiger teams)

By deepsun 2025-03-2921:58

I agree with you in general, although my case was the other way around. My company was 10k+ people. But my client was probably the most technically advanced company at that time, with famously hard interviews for their own employees. My employer also didn't want to lose the client (it was beginning of collaboration), and since everyone wanted to work there (and move to US+California) my shop applied pretty strong filter for their own heads, even before sending them to client's vendor-interview.

And client was very-very happy with the quality, and that we didn't fight for promotions and could maintain very important, but promotion-poor projects. Up to the point that client trusted to completely gave couple of projects fully to my shop. When you don't need to fight for promotions, code quality also improves.

By raxxorraxor 2025-03-2713:594 reply

Try to live in a country where "," is the decimal point. Of course this causes numerous interoperability issues or hidden mistakes in various data sets.

There would have been many better separators... but good idea to bring formatting into it as well...

By Moru 2025-03-2714:112 reply

There was a long period of my life that I thought .csv meant cemicolon separated because all I saw was cemicolon separated files and I had no idea of the pain.

By JadeNB 2025-03-2716:111 reply

Although it is spelled "semicolon," so that doesn't quite fit.

By achierius 2025-03-2722:29

CMYK -- Cyan, Magenta, Yellow, blacK :)

(of course it originally stood for "key", but you don't see that much anymore)

By mrweasel 2025-03-2714:461 reply

Not sure if they still do this, but Klarna would send us ", " separated files. If there wasn't a space after the comma then it was to be read as a decimal point. Most of the CSV parser don't/didn't allow you to specify multi-character separators. In the end I just accepted that we had one field for krona and for öre and most fields would need to have a leading space removed.

By raxxorraxor 2025-03-2715:451 reply

Microsoft did this very extensively. Many Non-English versions of Excel do save CSV-files with a semicolon as a separator and it probably was handled differently too in normal Excel files. But it goes even further, it affected their scripting languages even to this day with newer languages like their BI script (forgot the name of the language). For example, parameters of function calls aren't separated by ',' anymore and ';' is used instead. But only in the localized versions.

That of course means that you have to translate these scripts depending on the locale set in your office suite, otherwise they are full of syntax errors...

By oddmiral 2025-03-284:39

Many English languages use ';' to end statements instead of '.'.

Many European languages use '.' to end statements, Prolog (France) for example, but use ';' to separate arguments.

By zaxomi 2025-03-285:29

There are better separators included in ASCII, but not used as often: 28 File Separator, 29 Group Separator, 30 Record Separator and 31 Unit Separator.

By Sami_Lehtinen 2025-03-2716:15

TSV should do it for you. Been there done that.

By danso 2025-03-2713:331 reply

> Then there will be your data, but they will format currency values as prettified strings, for example "34 593,12 USD", instead of producing one column with a number and another with currency.

To be fair, that's not a problem with CSV but with the provider's lack of data literacy.

By remram 2025-03-2714:411 reply

Yeah, you can also use Parquet/JSON/protobuf/XLSX and store numbers as strings in this format. CSV is just a container.

By ahoka 2025-03-2717:301 reply

But somehow CSV is the PHP of serialization formats, attracts the wrong kind of developers and projects.

By remram 2025-03-2720:38

I definitely wouldn't say that. I saw a lot of weird stuff in Excel files, and there's the whole crowd only giving you data as PDFs.

By TRiG_Ireland 2025-03-2717:19

I worked in a web shop which had to produce spreadsheets which people wanted to look at in Excel. I gave them so many options, and told each client to experiment and choose the option which worked for them. In the end, we had (a) UTF-8 CSV, (b) UTF-8 CSV with BOM, (c) UTF-16 TSV, (d) UTF-8 HTML table with a .xlsx file extension and a lying Content-Type header which claimed it was an Excel spreadsheet.

Option a worked fine so long as none of the names in the spreadsheet had any non-ASCII characters.

Option d was by some measures the worst (and was definitely the largest file size), but it did seem to consistently work in Excel and Libre Office. In fact, they all worked without any issue in Libre Office.

By recursive 2025-03-2719:32

> I can't point to a strict spec and say "you are doing this wrong", because there is no strict spec.

Have you tried RFC 4180?

https://www.ietf.org/rfc/rfc4180.txt

By hermitcrab 2025-03-2716:40

I've written a commercial, point and click, data wrangling tool (Easy Data Transform) that can deal with a lot of these issues:

-different delimiters (comma, semi-colon, tab, pipe etc)

-different encodings (UTF8, UTF16 etc)

-different line ending (CR, LF, CR+LF)

-ragged rows

-splitting and merging columns

And much more besides.

However, if you have either:

-line feeds and/or carriage returns in data values, but no quoting

-quoting, but quotes in data values aren't properly handled

Then you are totally screwed and you have my sympathies!

By imtringued 2025-03-2714:552 reply

I agree and as a result I have completely abandoned CSV.

I use the industry standard that everyone understands: ECMA-376, ISO/IEC 29500 aka .xlsx.

Nobody has any problems producing or ingesting .xlsx files. The only real problem is the confusion between numbers and numeric text that happens when people use excel manually. For machine to machine communication .xlsx has never failed me.

By Hackbraten 2025-03-2716:52

Off the top of my head:

https://learn.microsoft.com/en-us/office/troubleshoot/excel/...

Now you might argue that ECMA-376 accounts for this, because it has a `date1904` flag, which has to be 0 for 1900-based dates and 1 for 1904-based dates. But what does that really accomplish if you can’t be sure that vendors understand subtleties like that if they produce or consume it? Last time I checked (maybe 8 years ago), spreadsheets created on Windows and opened on Mac still shifted dates by four years, and the bug was already over twenty years old at that time.

And the year-1904 issue is just the one example that I happen to know.

I have absolutely zero confidence in anything that has touched, or might have touched, MS Excel with anything short of a ten-foot pole.

By Gormo 2025-03-2718:161 reply

Parsing Excel files in simple data interchange use cases that don't involve anyone manually using spreadsheets is an instance of unnecessary complexity. There are plenty of alternatives to CSV that remain plaintext, have much broader support, and are more rigorous than Excel in ensuring data consistency. You can use JSON, XML, ProtoBuf, among many other options.

By eternauta3k 2025-03-2718:351 reply

But everyone already has a GUI installed for editing xlsx files...

By Gormo 2025-03-2720:12

Which introduces even more problems when manually editing files is out of scope.

By byyll 2025-03-2718:04

I was recently writing a parser for a weird CSV. It had multiple header column rows in it as well as other header rows indicating a folder.

By edoceo 2025-03-2714:591 reply

This RFC maybe?

https://www.ietf.org/rfc/rfc4180.txt

By mcpeepants 2025-03-2716:59

The RFC explicitly does not define a standard

By pwenzel 2025-03-2714:271 reply

I hate CSV too. If I have to use it, I'll live with TSV or some other special-charter delimited format.

By NikkiA 2025-03-2719:17

I'd much rather it be something that is neither used in normal* text/numbers, nor whitespace, thus non-printable delimiters wins for me.

* Don't mind me extending 'normal' here to include human-written numbers with thousand seperators.

By SoftTalker 2025-03-2716:172 reply

If only there were character codes specifically meant to separate fields and records.... we wouldn't have to worry so much about quoted commas or quoted quotes.

By craftkiller 2025-03-2716:231 reply

That isn't solving anything, just changing the problem. If I want to store a string containing 0x1C - 0x1F in one of the columns then we're back in the exact same situation while also losing the human readable/manually typeable aspect people seem to love about CSV. The real solution is a strict spec with mandatory escaping.

By SoftTalker 2025-03-2716:264 reply

Not for text data. Those values are not text characters like , or " are, and have only one meaning. It would be like arguing that 0x41 isn't always the letter "A".

For binary files, yeah but you don't see CSV used there anyway.

By mjw_byrne 2025-03-2716:482 reply

The idea that binary data doesn't go in CSVs is debatable; people do all sorts of weird stuff. Part of the robustness of a format is coping with abuse.

But putting that aside, if the control chars are not text, then you sacrifice human-readability and human-writability. In which case, you may as well just use a binary format.

By SoftTalker 2025-03-2720:51

True, but very few people compose or edit CSV data in Notepad. You can, but it's very error-prone. Most people will use a spreadsheet and save as CSV, so field and record separator characters are not anything they would ever deal with.

By Gormo 2025-03-2718:18

I've dealt with a few cases of CSVs including base64-encoded binary data. It's an unusual scenario, but the tools for working with CSVs are robust enough that it was never an issue.

By craftkiller 2025-03-2716:461 reply

So in addition to losing human readability, we are also throwing away the ability to nest (pseudo-)CSVs? With comma delimiters, I can take an entire CSV document and put it in 1 column, but with 0x1C-0x1F delimiters and banning non-text valid utf-8 in columns I no longer can. This continues to be a step backwards.

By thayne 2025-03-280:001 reply

No reason you can't escape those special characters.

By craftkiller 2025-03-2821:03

Then we're back to my original response of it doesn't solve anything: https://news.ycombinator.com/item?id=43495217

By kragen 2025-03-2717:02

There are lots of 8-bit mostly-ASCII character sets that assign printable glyphs to some or all of the codepoints that ASCII assigns to control characters. TeX defined one, and the IBM PC's "code page 437" defined another.

By Hackbraten 2025-03-2716:391 reply

There are several ways how a control character might inadvertently end up inside a text corpus. Given enough millions of lines, it’s bound to happen, and you absolutely don’t want it to trip up your whole export because of that one occurrence. So yes, you have to account for it in text data, too.

By yellowapple 2025-03-283:41

Sure, but that's orders of magnitude less likely to happen than a comma ending up inside a text corpus.

By mjw_byrne 2025-03-2716:221 reply

There's just no such thing as a delimiter which won't find its way into the data. Quoting and escaping really are the only robust way.

By saulpw 2025-03-2717:161 reply

You can disallow all control characters (ASCII < 32) other than CR/LF/TAB, which is reasonable. I don't know of any data besides binary blobs which uses those. I've never heard of anyone inlining a binary file (like an image) into a "CSV" anyway.

By mjw_byrne 2025-03-2718:402 reply

If you disallow control characters so that you can use them as delimiters, then CSV itself becomes a "binary" data format - or to put it another way, you lose the ability to nest CSV.

It isn't good enough to say "but people don't/won't/shouldn't do that", because it will just happen regardless. I've seen nested CSV in real-life data.

Compare to the zero-terminated strings used by C, one legacy of which is that PostgreSQL doesn't quite support UTF-8 properly, because it can't handle a 0 byte in a string, because 0 is "special" in C.

By saulpw 2025-03-2719:19

Nested CSVs as you've seen in real-life data are a good counterexample, thanks for providing it.

By thayne 2025-03-280:441 reply

So have a way to escape those control characters.

By mjw_byrne 2025-03-2811:511 reply

Right, but the original point I was responding to is that control characters are disallowed in the data and therefore don't need to be escaped. If you're going to have an escaping mechanism then you can use "normal" characters like comma as delimiters, which is better because they can be read and written normally.

By thayne 2025-03-2818:031 reply

But a comma is much more likely to need to be escaped.

By mjw_byrne 2025-03-2818:25

It's good for a delimiter to be uncommon in the data, so that you don't have to use your escaping mechanism too much.

This is a different thing altogether from using "disallowed" control characters, which is an attempt to avoid escaping altogether - an attempt which I was arguing is doomed to fail.

By j45 2025-03-2718:271 reply

CSV is a data-exchange format.

By thayne 2025-03-281:391 reply

But it is terrible at that because there is no widely adhered to standard[1], the sender and receiver often disagree on the details of what exactly a CSV is.

[1]: yes, I know about RFC 4180. But csvs in the wild often don't follow it.

By j45 2025-03-298:54

Agreed that that CSV construction isn't consistent even with standards.

That footprint seems to be dozens of variations to work with to find a library for?

CSV are universal though, text kind of like markdown, and that is my intended main point.

By INTPenis 2025-03-2714:28

The only times I hated CSV was when it came from another system I had no control over. For example Windows and their encodings, or some other proprietary BS.

But CSV under controlled circumstances is very simple.

And speaking of Wintendo, the bonus is often that you can go straight from CSV to Excel presentation for the middle management.

By jll29 2025-03-273:0416 reply

The post should at least mention in passing the major problem with CSV: it is a "no spec" family of de-facto formats, not a single thing (it is an example of "historically grown"). And omission of that meams I'm going to have to call this our for its bias (but then it is a love letter, and love makes blind...).

Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files, and there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa. Quoting, escaping, UTF-8 support are particular problem areas, but also that you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).

Having worked extensively with SGML for linguistic corpora, with XML for Web development and recently with JSON I would say programmatically, JSON is the most convenient to use regarding client code, but also its lack of types makes it useful less broadly than SGML, which is rightly used by e.g. airlines for technical documntation and digital humanities researchers to encode/annotate historic documents, for which it is very suitable, but programmatically puts more burden on developers. You can't have it all...

XML is simpler than SGML, has perhaps the broadest scope and good software support stack (mostly FOSS), but it has been abused a lot (nod to Java coders: Eclipse, Apache UIMA), but I guess a format is not responsible for how people use or abuse it. As usual, the best developers know the pros and cons and make good-taste judgments what to use each time, but some people go ideological.

(Waiting for someone to write a love letter to the infamous Windows INI file format...)

By sramsay64 2025-03-273:155 reply

In fairness there are also several ambiguities with JSON. How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?

jq supports several pseudo-JSON formats that are quite useful like record separator separated JSON, newline separated JSON. These are obviously out of spec, but useful enough that I've used them and sometimes piped them into a .json file for storage.

Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.

By d0mine 2025-03-274:081 reply

JSON lines is not JSON It is built on top of it. .jsonl extension can be used to make it clear https://jsonlines.org/

By joquarky 2025-03-2719:38

Back in my day it was called NDJSON.

The industry is so chaotic now we keep giving the same patterns different names, adding to the chaos.

By thiht 2025-03-279:021 reply

> How do you handle multiple copies of the same key

That’s unambiguously allowed by the JSON spec, because it’s just a grammar. The semantics are up to the implementation.

By sbergot 2025-03-279:531 reply

interestingly other people are answering the opposite in this thread.

By thiht 2025-03-2710:141 reply

They're wrong.

From ECMA-404[1] in section 6:

> The JSON syntax does not impose any restrictions on the strings used as names, does not require that name strings be unique, and does not assign any significance to the ordering of name/value pairs.

That IS unambiguous.

And for more justification:

> Meaningful data interchange requires agreement between a producer and consumer on the semantics attached to a particular use of the JSON syntax. What JSON does provide is the syntactic framework to which such semantics can be attached

> JSON is agnostic about the semantics of numbers. In any programming language, there can be a variety of number types of various capacities and complements, fixed or floating, binary or decimal.

> It is expected that other standards will refer to this one, strictly adhering to the JSON syntax, while imposing semantics interpretation and restrictions on various encoding details. Such standards may require specific behaviours. JSON itself specifies no behaviour.

It all makes sense when you understand JSON is just a specification for a grammar, not for behaviours.

[1]: https://ecma-international.org/wp-content/uploads/ECMA-404_2...

By kevincox 2025-03-2712:283 reply

> and does not assign any significance to the ordering of name/value pairs.

I think this is outdated? I believe that the order is preserved when parsing into a JavaScript Object. (Yes, Objects have a well-defined key order. Please don't actually rely on this...)

By hajile 2025-03-2713:38

In the JS spec, you'd be looking for 25.5.1

If I'm not mistaken, this is the primary point:

> Valid JSON text is a subset of the ECMAScript PrimaryExpression syntax. Step 2 verifies that jsonString conforms to that subset, and step 10 asserts that that parsing and evaluation returns a value of an appropriate type.

And in the algorithm

    c. Else,
      i. Let keys be ? EnumerableOwnProperties(val, KEY).
      ii. For each String P of keys, do
        1. Let newElement be ? InternalizeJSONProperty(val, P, reviver).
        2. If newElement is undefined, then
          a. Perform ? val.[[Delete]](P).
        3. Else,
          a. Perform ? CreateDataProperty(val, P, newElement).

If you theoretically (not practically) parse a JSON file into a normal JS AST then loop over it this way, because JS preserves key order, it seems like this would also wind up preserving key order. And because it would add those keys to the final JS object in that same order, the order would be preserved in the output.

> (Yes, Object's have a well-defined key order. Please don't actually rely on this...)

JS added this in 2009 (ES5) because browsers already did it and loads of code depended on it (accidentally or not).

There is theoretically a performance hit to using ordered hashtables. That doesn't seem like such a big deal with hidden classes except that `{a:1, b:2}` is a different inline cache entry than `{b:2, a:1}` which makes it easier to accidentally make your function polymorphic.

In any case, you are paying for it, you might as well use it if (IMO) it makes things easier. For example, `let copy = {...obj, updatedKey: 123}` is relying on the insertion order of `obj` to keep the same hidden class.

By thiht 2025-03-2712:501 reply

In JS maybe (I don't know tbh), but that's irrelevant to the JSON spec. Other implementations could make a different decision.

By kevincox 2025-03-2713:20

Ah, I thought the quote was from the JS spec. I didn't realize that ECMA published their own copy of the JSON spec.

By hadlock 2025-03-2713:551 reply

[flagged]

By yrro 2025-03-279:251 reply

Internet JSON (RRC 7493) forbids objects to have members with duplicate names.

By _flux 2025-03-2712:131 reply

As it says:

I-JSON (short for "Internet JSON") is a restricted profile of JSON designed to maximize interoperability and increase confidence that software can process it successfully with predictable results.

So it's not JSON, but a restricted version of it.

I wonder if use of these restrictions is popular. I had never heard of I-JSON.

By rcxdude 2025-03-2713:49

I think it's rare for them to be explicilty stated, but common for them to be present in practice. I-JSON is just an explicit list of these common implicit limits. For any given tool/service that describes itself as accepting JSON I would expect I-JSON documents to be more likely to work as expected than non-I-JSON.

By zzo38computer 2025-03-282:08

> How do you handle multiple copies of the same key? Does the order of keys have semantic meaning?

This is also an issue, due to the way that order of keys are working in JavaScript, too.

> record separator separated JSON, newline separated JSON.

There is also JSON with no separators, although that will not work very well if any of the top-level values are numbers.

> Also, encoding things like IEEE NaN/Infinity, and raw byte arrays has to be in proprietary ways.

Yes, as well as non-Unicode text (including (but not limited to) file names on some systems), and (depending on the implementation) 64-bit integers and big integers. Possibly also date/time.

I think DER avoids these problems. You can specify whether or not the order matters, you can store Unicode and non-Unicode text, NaN and Infinity, raw byte arrays, big integers, and date/time. (It avoids some other problems as well, including canonization (DER is already in canonical form) and other issues. Although, I have a variant of DER that avoids some of the excessive date/time types and adds a few additional types, but this does not affect the framing, which can still be parsed in the same way.)

A variant called "Multi-DER" could be made up, which is simply concatenating any number of DER files together. Converting Multi-DER to BER is easy just by adding a constant prefix and suffix. Converting Multi-DER to DER is almost as easy; you will need the length (in bytes) of the Multi-DER file and then add a prefix to specify the length. (In none of these cases does it require parsing or inspecting or modifying the data at all. However, converting the JSON variants into ordinary JSON does require inspecting the data in order to figure out where to add the commas.)

By diekhans 2025-03-273:184 reply

Plus the 64-bit integer problem, really 52-bit integers, due to JS not having integers.

By d0mine 2025-03-274:24

JSON itself is not limited to neither 52 nor 64-bit integers.

    integer = -? (digit | onenine digit+)

https://json.org/

By 0cf8612b2e1e 2025-03-273:45

That’s a JavaScript problem, not JSON.

By dtech 2025-03-275:431 reply

Most good parsers have an option to parse to integers or arbitrary precision decimals.

By VMG 2025-03-2710:021 reply

Agreed. Which means that Javascript does not have a good parser.

By exogen 2025-03-2710:363 reply

`JSON.parse` actually does give you that option via the `reviver` parameter, which gives you access to the original string of digits (to pass to `BigInt` or the number type of your choosing) – so per this conversation fits the "good parser" criteria.

By hajile 2025-03-2713:54

To be specific (if anyone was curious), you can force BigInt with something like this:

    //MAX_SAFE_INTEGER is actually 9007199254740991 which is 16 digits
    //you can instead check if exactly 16 and compare size one string digit at a time if absolute precision is desired.
    const bigIntReviver = (key, value, context) => typeof value === 'number' && Math.floor(value) === value && context.source.length > 15 ? BigInt(context.source) : value
      

    const jsonWithBigInt = x => JSON.parse(x, bigIntReviver)

Generally, I'd rather throw if a number is unexpectedly too big otherwise you will mess up the types throughout the system (the field may not be monomorphic) and will outright fail if you try to use math functions not available to BigInts.

By VMG 2025-03-288:18

Huh, TIL!

https://caniuse.com/mdn-javascript_builtins_json_parse_reviv...

Absent in Safari though

By whizzter 2025-03-2713:442 reply

Sadly the reviver parameter is a new invention only recently available in FF and Node, not at all in Safari.

Naturally not that hard to write a custom JSON parser but the need itself is a bad thing.

By hajile 2025-03-2713:56

Just use the polyfill

https://github.com/zloirock/core-js#jsonparse-source-text-ac...

By arnorhs 2025-03-2719:051 reply

No it's been there for ages. Finalized as part of ecmascript 5

What you are probably thinking of is the context parameter of the reviver callback. That is relatively recent and mostly a qol improvement

By whizzter 2025-03-2720:59

Sorry yes, i was thinking of the context object with source parameter.

The issue it solves is a big one though, since without it the JSON.parse functionality cannot parse numbers that are larger than 64bit float numbers (f.ex. bigints).

By tobyhinloopen 2025-03-2710:03

bigint exists

By realitysballs 2025-03-2711:202 reply

They do specifically mention this:

“No one owns CSV. It has no real specification (yes, I know about the controversial ex-post RFC 4180), just a set of rules everyone kinda agrees to respect implicitly. It is, and will forever remain, an open and free collective idea.”

By thayne 2025-03-281:49

They even seem to think it is a good thing. But I don't see how not having a bunch of implementations that can't agree on the specifics of a file/interchange format is a good thing. And being free and open is completely orthogonal. There are many proprietary formats that don't have a spec, and many open formats that do have a spec (like, say, json).

By Gormo 2025-03-2718:21

That's true of the vast majority of protocols that people use in real life to exchange date. We're using one of them right now, in fact.

By rkagerer 2025-03-274:523 reply

Waiting for someone to write a love letter to the infamous Windows INI file format

I actually miss that. It was nice when settings were stored right alongside your software, instead of being left behind all over a bloated registry. And the format was elegant, if crude.

I wrote my own library for encoding/writing/reading various datatypes and structure into ini's, in a couple different languages, and it served me well for years.

By isoprophlex 2025-03-275:25

TOML is nice like that... elegant like INI, only with lists.

By eddythompson80 2025-03-2717:23

> instead of being left behind all over a bloated registry

Really? I think the idea of a central, generic, key-value pair database for all the setting on a system is probably the most elegant reasonable implementation there could be.

The initial implementation of Windows Registry wasn't good. It was overly simplistic and pretty slow. Though the "bloat" (what ever that means) of registry hasn't been an actual issue in over 20 years. The only people invested in convincing you "it's an issue" are CCleaner type software that promise to "speed up your computer" if you just pay $6.99.

How many rows do you need in a sqlite database for it to be "bloated"?

By xp84 2025-03-274:595 reply

I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.

By estebank 2025-03-275:141 reply

Whenever I ask myself "should I use YAML?" I answer myself "Norway".

By Ygg2 2025-03-279:33

To be fair to YAML that's been solved in 1.2.

https://yaml.org/spec/1.2.2/#10212-boolean

By lelanthran 2025-03-275:53

> I feel like YAML is a spiritual successor to the .ini, since it shares a notable ideal of simple human readability/writability.

It doesn't feel that way to me: it's neither simple to read nor to write. I suppose that that's a builtin problem due to tree representation, which is something that INI files were never expected to represent.

TBH, I actually prefer the various tree representation workarounds used by INI files: using whitespace to indicate child nodes stops being readable once you have more than a screenful of children in a node.

By ElectricalUnion 2025-03-275:30

Given how YAML does magic and sometimes accidental type conversions of potentially nested objects, I think TOML is the well-defined sucessor to .ini

By consp 2025-03-278:232 reply

YAML is readable? No way as there are too many ways to do the same thing and nested structures are unclear to the non trained eye (what is a list? What is nested?), let alone indentation in large files is an issue especially with the default 2 space unreadable standard so many people adhere to.

YAML simple? It's sepc is larger than XML... Parsing of numbers and strings is ambiguous, leading zeros are not strings but octal (implicit conversion...). List as keys? Oh ffs, and you said readable. And do not get me started about "Yes" being a boolean, reminds me of the MS Access localizations which had other decimal values for true and [local variant of true] (1 vs -1).

Writable? Even worse. I think I have never been able to write a YAML file without errors. But that might just be me, XML is fine though while unreadable.

By xp84 2025-03-3123:35

I agree that one could make wild YAML if you get into advanced stuff, but I make YAML files that look like this:

  things:
    - title: "A thing"
      item_id: "deadbeef-feb1-4e8c-b61c-dd9a7a9fffff"
      is_active: true
      favorite_formats:
        - yml
        - ini
    - title: "Another thing"
      item_id: "deadbeef-feb1-3333-4444-dd9a7a9fffff"
      is_active: false
      favorite_formats:
        - mp3
        - wav

Just because you can use it to create a monstrosity doesn't prevent it from being useful for simple configuration. Basically, it's just prettier JSON.

By HelloNurse 2025-03-278:541 reply

Say "no" to YAML. As a string, if you can.

By Ygg2 2025-03-279:531 reply

You can. YAML 1.2 is only 16 years old. Just old enough to drive. Norway problem has been solved for only 16 years.

By HelloNurse 2025-03-2710:341 reply

YAML 1.2 leaves data types ambiguous, merely making the "Norway problem" optional and at the mercy of the application rather than, in the words of https://yaml.org/type/ (which has not been marked as deprecated), "strongly recommended".

By Ygg2 2025-03-2714:27

Those schemas aren't part of the core schema, and you may interpret them if you are aiming for full 1.1 compatibility. If you're aiming for 1.1 compatibility, then you accept the Norway problem.

I've been looking in the specs and I can't find the link to the https://yaml.org/type/

By afiori 2025-03-278:39

I think GRON[1] would fit the bill better

[1] https://github.com/tomnomnom/gron

By jimbokun 2025-03-273:12

The post does mention it, as a positive:

https://github.com/medialab/xan/blob/master/docs/LOVE_LETTER...

By otabdeveloper4 2025-03-277:092 reply

People who say that CSV is "simpler" are talking about whatever format Excel exports.

Also these people have only ever had to deal with the American Excel localization.

So yeah, with the caveat of "only ever use Excel and only ever the American edition" CSV is pretty nice.

By mbnielsen 2025-03-2712:05

As someone living in a country where , is used as the decimal separator, I cannot begin to describe the number of times CSV data has caused me grief. This becomes especially common in an office environment where Excel is the de facto only data handling tool that most people can and will use. Here the behavior of loading data becomes specific to the individual machine and changes over time (e.g. when IT suddenly forces a reset of MS Office application languages to the local one).

That said, I don't really know of any alternative that won't be handled even worse by my colleagues...

By cgio 2025-03-278:501 reply

Also keeping in mind all the locales where comma is the decimal point…tsv for the world.

By matwood 2025-03-2713:52

And all the 'simple' formats start failing when dealing with blocks of text.

By lelanthran 2025-03-275:492 reply

To be honest, I'm wondering why you are rating JSON higher than CSV.

> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files,

There is, actually, RFC 4180 IIRC.

> there are many flavours that are incompatible with each other in the sense that a reader for one flavour would not be suitable for reading the other and vice versa.

"There are many flavours that deviate from the spec" is a JSON problem too.

> you cannot tell programmatically whether line 1 contains column header names or already data (you will have to make an educated guess but there ambiguities in it that cannot be resolved by machine).

Also a problem in JSON

> Quoting, escaping, UTF-8 support are particular problem areas,

Sure, but they are no more nor no less a problem in JSON as well.

By IanCal 2025-03-276:285 reply

Have you had to work with csv files from the wild much? I'm not being snarky but what you're talking about is night and day to what I've experienced over the years.

There aren't vast numbers of different JSON formats. There's practically one and realistically maybe two.

Headers are in each line, utf8 has never been an issue for me and quoting and escaping are well defined and obeyed.

This is because for datasets, almost exclusively, the file is machine written and rarely messed with.

Csv files have all kinds of separators, quote characters, some parsers don't accept multi lines and some do, people sort files which mostly works until there's a multi line. All kinds of line endings, encodings and mixed encodings where people have combined files.

I tried using ASCII record separators after dealing with so many issues with commas, semicolons, pipes, tabs etc and still data in the wild had these jammed into random fields.

Lots of these things don't break when you hit the issue either, the parsers happily churn on with garbage data, leading to further broken datasets.

Also they're broken for clients if the first character is a capital I.

By dspillett 2025-03-2711:191 reply

WRT JSON:

> Headers are in each line

This might be my old “space and network cost savings” reflex, which is a lot less necessary these days, kicking in, but the feels inefficient. It also gives rise to not knowing the whole schema until you read the whole dataset (which might be multiple files), unless some form of external schema definition is provided.

Having said that, I accept that JSON has advantages over CSV, even if all that is done is translating a data-table into an array of objects representing one row each.

> utf8 has never been an issue for me

The main problem with UTF8 isn't with CSV generally, it is usually, much like the “first column is called ID” issue, due to Excel. Unfortunately a lot of people interact with CSVs primarily with Excel, so it gets tarred with that brush by association. Unless Excel sees the BOM sequence at the start of a CSV file, which the Unicode standards recommend against for UTF8, it assumes its characters are using the Win1252 encoding (almost, but not quite, ISO-8859-1).

> Csv files have all kinds of separators

I've taken to calling them Character Separated Value files, rather than Comma, for this reason.

By IanCal 2025-03-2711:292 reply

Yes, it's not great. Space is annoying, though compression pretty much removes that as a concern (zstd is good for this, you can even have a custom dictionary). And yes, missing keys is annoying.

JSONL is handy, JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.

I'm quite a fan of parquet, but never expect to receive that from a client (alas).

By Izkata 2025-03-2714:29

> JSON that's in the form {data: [...hundred megs of lines]} is annoying for various parsers.

One reason this became common was a simple protection against json hijacking: https://haacked.com/archive/2009/06/25/json-hijacking.aspx/

By cogman10 2025-03-2712:59

Parquet should get the praise. It's simply awesome.

It's what I'd pick for tabular data exchange.

A recent problem I solved with it and duckdb allowed me to query and share a 3M record dataset. The size? 50M. And my queries all ran subsecond. You just aren't going to get that sort of compression and query-ability with a csv.

By SkyBelow 2025-03-2717:591 reply

I wonder if CSV is the trivial format, so you have many people picking it because they want the easiest, and still getting it wrong. JSON is harder, so very few people are going to roll their own serializer/deserializer, and those who do are more likely to focus on getting it right (or at least catching the really obvious bugs).

I've dealt with incorrect CSVs numerous times, never with incorrect JSON, but, of the times I know what was happening on the other system, each time the CSV was from some in house (or similar) implementation of dumping a SQL output (or similar) into a text file as an MVP. JSON was always using some library.

If so, that's all the more reason to love CSV as it stands guard for JSON. If CSV didn't exist, we would instead have broken JSON implementations. (JSON and XML would likely then share a similar relationship.)

By Gormo 2025-03-2718:291 reply

Sometimes people interpret the term too generically and actually implement a high degree of non-trivial, very idiosyncratic complexity, while still calling it "CSV".

One project I worked on involved a vendor promising to send us data dumps in "CSV format". When we finally received their "CSV" we had to figure out how to deal with (a) global fields being defined in special rows above the header row, and (b) a two-level hierarchy of semicolon-delimited values nested within comma-delimited columns. We had to write a custom parser to complete the import.

By ttyprintk 2025-03-284:07

Hi,

Yes, we chose ARFF format, which is idiosyncratic yet well-defined back in the old data mining days.

By lelanthran 2025-03-277:195 reply

Sure, I get your arguments and we're probably mostly in agreement, but in practice I see very few problems arising with using CSV.

I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.

I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.

And yet, they work.

The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you. It's done multiple times a day, on multiple systems, in multiple companies.

And yet, they work.

I get your argument though - a JSON array of arrays can represent everything that CSV can, and is preferable to CSV, and is what I would choose when given the choice, but the issues with using that are not going to be fewer than issues with CSV using RFC1480.

By fauigerzigerk 2025-03-2710:121 reply

>but in practice I see very few problems arising with using CSV

That is not my experience at all. I've been processing CSV files from financial institutions for many years. The likelihood of brokenness must be around 40%. It's unbelievable.

The main reason for this is not necessarily the CSV format as such. I believe the reason is that it is often the least experienced developers who are tasked with writing export code. And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.

JSON is better but it doesn't help with things like getting dates right. XML can help with that but it has complexities that people get wrong all the time (such as entities), so I think JSON is the best compromise.

By kragen 2025-03-2717:122 reply

> And many inexperienced developers seem to think that they can generate CSV without using a library because the format is supposedly so simple.

Can't they?

    def excel_csv_of(rows):
      for row in rows:
        for i, field in enumerate(row):
          if i:
            yield ','
          yield '"'
          for c in field:
            yield '""' if c == '"' else c
          yield '"'
        yield '\n'

I haven't tested this, even to see if the code parses. What did I screw up?

By fauigerzigerk 2025-03-2720:52

>Can't they?

If my experience reflects a relevant sample then the answer is that most can but a very significant minority fails at the job (under the given working conditions).

Whether or not _you_ can is a separate question. I don't see anything wrong with your code. It does of course assume that whatever is contained in rows is correct. It also assumes that the result is correctly written to a file without making any encoding mistakes or forgetting to flush the stream.

Not using name value pairs makes CSV more prone to mistakes such as incorrect ordering or number of values in some rows, a header row that doesn't correspond with the data rows, etc. Some export files are merged from multiple sources or go through many iterations over many years, which makes such mistakes far more likely.

I have also seen files that end abruptly somewhere in the middle. This isn't specific to CSV but it is specific to not using libraries and not using libraries appears to be more prevalent when people generate CSV.

You'd be surprised how many CSV files are out there where the developer tried to guess incorrectly whether or not a column would ever have to be escaped. Maybe they were right initially and it didn't have to be escaped but then years later something causes a change in number formats (internationalisation) and bang, silent data corruption.

Prioritising correctness and robustness over efficiency as you have done is the best choice in most situations. Using a well tested library is another option to get the same result.

By gthompson512 2025-03-2719:231 reply

This forces each field to be quoted, and it assumes that each row has the same fields in the same order. A library can handle the quoting issues and fields more reliably. Not sure why you went with a generator for this either.

Most people expect something like `12,,213,3` instead of `"12","213","3"` which yours might give.

https://en.wikipedia.org/wiki/Comma-separated_values#Basic_r...

By kragen 2025-03-2719:34

Forcing each field to be quoted is always correct, isn't it? How could something be "more reliable" than something that is always correct?

With respect to "the same fields in the same order", no, although you may or may not feed the CSV to an application that has such an expectation. But if you apply it to data like [("Points",),(),("x","y"),("3","4"),("6","8","10")] it will successfully preserve that wonky structure in a file Excel can ingest reliably. (As reliably as Excel can ingest anything, anyway, since Excel has its own Norway problem.)

It's true that it's possible to produce more optimized output, but I didn't claim that the output was optimal, just correct.

Using generators is necessary to be able to correctly output individual fields that are many times larger than physical memory.

By IanCal 2025-03-2710:16

I'll preface this that I think we are mostly in agreement, so that's the friendly tone of reply, part of this is just having flashbacks.

It's massively used, but the lack of adherence to a proper spec causes huge issues. If you have two systems that happen to talk properly to each other, great, but if you are as I was an entrypoint for all kinds of user generated files it's a nightmare.

CSV is the standard, sure, but it's easy to write code that produces it that looks right at first glance but breaks with some edge case. Or someone has just chosen a different separator, or quote, so you need to try and detect those before parsing (I had a list that I'd go through, then look for the most commonly appearing non-letter character).

The big problem is that the resulting semantically broken csv files often look pretty OK to someone scanning them and permissive parsers. So one system reads it in, splits something on lines and assumes missing columns are blank and suddenly you have the wrong number of rows, then it exports it. Worse if it's been sorted before the export.

Of course then there's also the issues around a lack of types, so numbers and strings are not distinguishable automatically leading to broken issues where you do want leading zeros. Again often not identified until later. Or auto type detection in a system breaking because it sees a lot of number-like things and assumes it's a number column. Without types there's no verification either.

So even properly formatted CSV files need a second place for metadata about what types there are in the file.

JSON has some of these problems too, it lacks dates, but far fewer.

> but the issues with using that are not going to be fewer than issues with CSV using RFC1480.

My only disagreement here is that I've had to deal with many ingest endpoints that don't properly support that.

Fundamentally I think nobody uses CSV files because they're a good format. They've big, slow to parse, lack proper typing, lack columnar reading, lack fast jumping to a particular place, etc.

They are ubiquitous, just not good, and they're very easy to screw up in hard to identify or fix ways.

Finally, lots of this comes up because RFC4180 is only from *2005*.

Oh, and if I'm reading the spec correctly, RFC4180 doesn't support UTF8. There was a proposed update maybe in 2022 but I can't see it being accepted as an RFC.

By watwut 2025-03-2715:40

> I mean, right now, the data interchange format between multiple working systems is CSV; think payment systems, inter-bank data interchange, ERP systems, CRM systems, billing systems ... the list goes on.

And there are constant issues arising from that. You basically need a small team to deal with them in every institution that is processing them.

> I just recently had a coffee with a buddy who's a salesman for some enterprise system: of the most common enterprise systems we recently worked with (SAP type things, but on smaller scales), every single one of them had CSV as the standard way to get data between themselves and other systems.

Salesman of enterprise system do not care about issues programmers and clients have. They care about what they can sell to other businessmen. That teams on both sides then waste time and money on troubleshooting is no concern to the salesman. And I am saying that as someone who worked on the enterprise system that consumed a lot of csv. It does not work and process of handling them literally sometimes involved phone calls to admins of other systems. More often then would be sane.

> The number of people uploading excel files to be processed or downloading excel files for local visualistation and processing would floor you.

That is perfectly fine as long as it is a manager downloading data so that he can manually analyze them. It is pretty horrible when those files are then uploaded to other systems.

By flanked-evergl 2025-03-277:481 reply

In practice, I have never ever received CSV to process that complied with RFC 4180, and in most cases it was completely incoherent and needed incredibly special handling to handle all the various problems like lack of escaping.

SAP has been by far the worst. I never managed to get data out of it that were not completely garbage and needed hand crafted parsers.

By consp 2025-03-278:131 reply

SAP only has to be SAP and MS Excel compatible. The rest is not needed so in their eyes it is probably to spec.

By flanked-evergl 2025-03-2710:09

European quality™.

By tikhonj 2025-03-2712:47

> And yet, they work.

Through a lot of often-painful manual intervention. I've seen it first-hand.

If an organization really needs something to work, it's going to work somehow—or the organization wouldn't be around any more—but that is a low bar.

In a past role, I switched some internal systems from using CSV/TSV to using Parquet and the difference was amazing both in performance and stability. But hey, the CSV version worked too! It just wasted a ton of people's time and attention. The Parquet version was far better operationally, even given the fact that you had to use parquet-tools instead of just opening files in a text editor.

By recursive 2025-03-2719:452 reply

> There aren't vast numbers of different JSON formats.

Independent variations I have seen:

* Trailing commas allowed or not * Comments allowed or not * Multiple kinds of date serialization conventions * Divergent conventions about distinguishing floating point types from integers * Duplicated key names tolerated or not * Different string escaping policies, such as, but not limited to "\n" vs "\x0a"

There are bazillions of JSON variations.

By thayne 2025-03-284:111 reply

> Trailing commas allowed or not

The json spec does not allow commas. Although there are jsom supersets that do.

> Comments allowed or not

The json spec does not allow comments. Although there are jsom supersets that do.

> Multiple kinds of date serialization conventions

Json spec doesn't say anything about dates. That is dependent on your application schema.

> Divergent conventions about distinguishing floating point types from integers

This is largely due to divergent ways different programming languages handle numbers. I won't say jsom handles this the best, but any file format used across multiple languages will run into problems with differences in how numbers are represented. At least there is a well defined difference between a number and a string, unlike csv.

> Duplicated key names tolerated or not

According to the spec, they are tolerated, although the semantics of such keys is implementation defined.

> Different string escaping policies, such as, but not limited to "\n" vs "\x0a"

Both of those are interpreted as the same thing, at least according to the spec. That is an implementation detail of the serializer, not a different language.

By recursive 2025-03-2816:422 reply

And CSV parsers and serializers compliant with RFC 4180 are similarly reliable.

By thayne 2025-03-2818:02

But many, perhaps most, parsers and serializers for CSV are not compliant with RFC 4180.

RFC 4180 is not an official standard. The text of the RFC itself states:

> This memo provides information for the Internet community. It does > not specify an Internet standard of any kind.

CSVs existed long before that RFC was written, and it is more a description of CSVs that are somewhat portable, not a definitive specification.

By IanCal 2025-03-2818:40

That RFC doesn't even support utf8.

It is, and accepts it is, codifing best practices rather than defining an authoritative standard.

By IanCal 2025-03-2812:47

There are always many, but in comparison to csv I've received almost no differences. Json issues were rare but csv issues it was common to have a brand new issue per client.

Typically the big difference is there are different parsers that are less tolerant of in spec values. Clickhouse had a more restrictive parser, and recently I've dealt with matrix.

Maybe I've been lucky for json and unlucky for csv.

By gpvos 2025-03-277:363 reply

What's the problem with capital I?

By Someone 2025-03-279:181 reply

https://superuser.com/questions/210027/why-does-excel-think-... says itks not capital I but “ID”.

Basically, Excel uses the equivalent of ‘file’ (https://man7.org/linux/man-pages/man1/file.1.html), sees the magic “ID”, and decides a SYLK file, even though .csv files starting with “ID” have outnumbered .SYLK files by millions for decades.

By gpvos 2025-03-2714:19

Thanks. So I guess the easy compatible solution is to always quote the first item on the first line when writing CSV. Good to know. (Checking if the item starts with ID is more work. Possibly quote all items on the first line for simplicity.) (Reading SYLK is obviously irrelevant, so accepting unquoted ID when reading is the smarter way to go and will actually improve compatibility with writers that are not Excel. Also it takes no work.)

By IanCal 2025-03-279:051 reply

The byte for a capital I is the same as the start for an odd file format, slyk maybe? Excel has (or did if they finally fixed it) for years decided this was enough to assume the file (called .csv) cannot possibly be csv but must actually be slyk. It then parses it as such, and is shocked to find your slyk file is totally broken!

By boogheta 2025-03-279:242 reply

It sounds to me like as often the problem here is Excel, not CSV

By pasc1878 2025-03-2711:091 reply

Yes but in practice CSV is defined by what Excel does.

As there is no standard to which Excel conforms as it predates standards and there would be an outcry if Excel started rejecting files that had worked for years.

By imtringued 2025-03-2715:101 reply

There is a common misconception here. You can import CSV files into an excel sheet. You cannot open a CSV file with excel. That is a nonsense operation.

By melagonster 2025-03-287:14

Excel do not ask the user whether they want to import file, and tell user their file was broken.

By IanCal 2025-03-279:47

Clients don't particularly make the distinction, and in a way nor should they - they can't open your file.

By n_plus_1_acc 2025-03-278:191 reply

Probably referring to the "turkish i problem"

By gpvos 2025-03-2714:21

Not an unreasonable guess, but it turned out to be something different.

By Someone 2025-03-279:12

> There is, actually, RFC 4180 IIRC.

Does any software fully follow that spec (https://www.rfc-editor.org/rfc/rfc4180)? Some requirements that I doubt are commonly followed:

- “Each record is located on a separate line, delimited by a line break (CRLF)” ⇒ editing .csv files using your the typical Unix text editor is complicated.

- “Spaces are considered part of a field and should not be ignored”

- “Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes” ⇒ fields containing lone carriage returns or new lines need not be enclosed in double quotes.

By pcwalton 2025-03-274:31

> Unlike XML or JSON, there isn't a document defining the grammar of well-formed or valid CSV files

There is such a document: RFC 4180. It may not be a good document, but it does exist.

By no_wizard 2025-03-2711:03

INI was for a long time a seemingly preferable format in the Python community for configuration for a long time, as I recall it.

Haven’t been a full time Python dev in sometime though, it seems TOML has supplanted that, but I remember thinking how interesting it was that Python had a built in INI parser and serializer

By d0mine 2025-03-274:011 reply

I lived through SOAP/WSDL horror with their numerous standards and the lack of compatibility between stacks in different programming languages. Having seen abused XML, CSV formats. CSV is preferable over XML. Human-readability matters. Relative simplicity matters.

Despite JSON may also be interpreted differently by different tools, it is a good default choice for communicating between programs

By ElectricalUnion 2025-03-275:44

> lack of compatibility between stacks in different programming languages

Well, that sure beats OpenAPI lack of compatibility between stacks in the same programming language.

I think the fact one can't randomly concatenate strings and call it "valid XML" a huge bonus over the very common "join strings with comma and \r\n", non-rfc4180 compliant (therefore mostly unparseable without human/LLM interaction) garbage people often pretend is CSV.

By cess11 2025-03-278:311 reply

While CSV isn't exactly grammared or standardised like XML I think if it as more schema:d than JSON. There might be data corruption or consistency issues, but there is implicitly a schema: every line is exactly n fields, and the first line might contain field names.

When a JSON API turns out to have optional fields it usually shows through trial and error, and unlike CSV it's typically not considered a bug you can expect the API owner to fix. In CSV 'missing data' is an empty string rather than nulls or their cousins because missing fields aren't allowed, which is nice.

I also like that I can write my own ad hoc CSV encoder in most programming languages that can do string concatenation, and probably also a suitable decoder. It helps a lot in some ETL tasks and debugging. Decent CSV also maps straight to RDBMS tables, if the database for some reason fails at immediate import (e.g. too strict expectations) into a newly created table it's almost trivial to write an importer that does it.

By cgio 2025-03-278:461 reply

JSON is not schema’d per se and intentionally so. There’s jsonschema which has better expressiveness than inference of a tabular schema, as it can reflect relationships.

By cess11 2025-03-279:31

Sure. I have yet to come across a data source with JSON Schema, I'll develop an opinion of it when I do.

By jajko 2025-03-279:091 reply

There is no file format that works out of box under all extreme corner cases.

You would think that ie XML-defined WSDL with XSD schema is well battle proven. I've encountered 2 years ago (and still dealing with that) WSDL from a major banking vendor that is technically valid, but no open source library in Java (from all languages) was able to parse it successfully or generate binding classes out of box.

Heck, flat files can end up with extreme cases, just work enough with legacy banking or regulatory systems and you will see some proper shit.

The thing is, any sort of critical integration needs to be battle tested and continuously maintained, otherwise it will eventually go bad, even a decade after implementation and regular use without issues.

By hajile 2025-03-2714:00

What about S-expressions? Where do they break?

By immibis 2025-03-2710:45

XML is a pretty good markup language. Using XML to store structured data is an abuse of it. All the features that are useful for markup are not useful for structured data and only add to the confusion.

By sgarland 2025-03-2713:472 reply

> Waiting for someone to write a love letter to the infamous Windows INI file format...

Honestly, it’s fine. TOML is better if you can use it, but otherwise for simple applications, it’s fine. PgBouncer still uses INI, though that in particular makes me twitch a bit, due to discovering that if it fails to parse its config, it logs the failed line (reasonable), which can include passwords if it’s a DSN string.

By marcosdumay 2025-03-2714:39

Well, once you get over the fact that information on a TOML file can be out of order in any place, denominated by any mix of 3 different key encodings, and broken down in any random way... then yes, the rest of TOML is good.

By flkenosad 2025-03-2713:561 reply

I should write a love letter to JSON.

By recursive 2025-03-2719:45

For as good as JSON is or is not, it's definitely not under-rated.

By golly_ned 2025-03-273:11

It does mention this. Point 2.

By y42 2025-03-2719:46

why you hate csv, not the program that is not able to properly create csv?

By mjw_byrne 2025-03-2617:3614 reply

CSV is ever so elegant but it has one fatal flaw - quoting has "non-local" effects, i.e. an extra or missing quote at byte 1 can change the meaning of a comma at byte 1000000. This has (at least) two annoying consequences:

1. It's tricky to parallelise processing of CSV. 2. A small amount of data corruption can have a big impact on the readability of a file (one missing or extra quote can bugger the whole thing up).

So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

By koolba 2025-03-2618:148 reply

JSON serialized without extra white space with one line per record is superior to CSV.

If you want CSV-ish, enforce an array of strings for each record. Or go further with actual objects and non-string types.

You can even jump to an arbitrary point and then seek till you see an actual new line as it’s always a record boundary.

It’s not that CSV is an invalid format. It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.

By derriz 2025-03-2620:147 reply

> It’s that libraries and tools to parse CSV tend to suck. Whereas JSON is the lingua franca of data.

This isn't the case. An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity. Despite the lack of any sort of specification, it's easily the most widely supported data format in existence in terms of tools and language support.

By nukem222 2025-03-271:02

> An incredible amount of effort and ingenuity has gone into CSV parsing because of its ubiquity.

Yea and it's still a partially-parseable shit show with guessed values. But we can and could have and should have done better by simply defining a format to use.

By otikik 2025-03-2620:491 reply

Meanwhile, Excel exports to CSV as “semicolon separated values” depending on your OS locale

By matthewmacleod 2025-03-2620:561 reply

Albeit for fairly justifiable reasons

By notpushkin 2025-03-2621:002 reply

Justifiable how?

By matthewmacleod 2025-03-2623:50

Well, Excel has a lot of common use-cases around processing numeric (and particularly financial) data. Since some locales use commas as decimal separators, using a character that's frequently present as a piece of data as a delimiter is a bit silly; it would be hard to think of a _worse_ character to use.

So, that means that Excel in those locales uses semicolons as separators rather than the more-frequently-used-in-data commas. Probably not the decision I'd make in retrospect, but not completely stupid.

By LoganDark 2025-03-2621:032 reply

Decimal separators being commas in some locales?

By otikik 2025-03-2621:291 reply

They could have just ignored the locale altogether though. Put dots on the numbers when using csv, and assume it has dots when importing

By notpushkin 2025-03-276:091 reply

This exactly. Numbers in XLS(X) are (hopefully) not locale-specific – why should they be in CSV?

By afiori 2025-03-278:47

CSV -> text/csv

Microsoft Excel -> application/vnd.ms-excel

CSV is a text format, xls[x], json, and (mostly) xml are not.

By ideamotor 2025-03-2621:233 reply

Commas are commonly used in text, too.

By susam 2025-03-2623:142 reply

Clearly they should have gone with BEL as the delimiter.

  printf "alice\007london\007uk\nbob\007paris\007france\n" > data.bsv

I'm hoping no reasonable person would ever use BEL as punctuation or decimal separator.

By dessimus 2025-03-2710:251 reply

If one was going to use a non-printable character as a delimiter, why wouldn't they use the literal record separator "\030"?

By susam 2025-03-2723:50

Every time you cat a BSV file, your terminal beeps like it's throwing a tantrum. A record separator (RS) based file would be missing this feature! In other words, my previous comment was just a joke! :)

By the way, RS is decimal 30 (not octal '\030'). In octal, RS is '\036'. For example:

  $ printf '\036' | xxd -p
  1e
  $ printf '\x1e' | xxd -p
  1e

See also https://en.cppreference.com/w/cpp/language/ascii for confirmation.

By kmoser 2025-03-273:122 reply

On the off chance you're not being facetious, why not ASCII 0 as a delimiter? (This is a rhetorical question.)

By mastax 2025-03-275:261 reply

ASCII has characters more or less designed for this

0x1C - File Separator

0x1D - Group Separator

0x1E - Record Separator

0x1F - Unit Separator

So I guess 1F would be the "comma" and 1E would be the "newline."

By afiori 2025-03-278:541 reply

https://stackoverflow.com/questions/8695118/what-are-the-fil...

I am pretty sure you shifted the meaning, the decimal separator is part of the atomic data it does not need a control character.

You would use 1F instead of the comma/semicolon/tab and 1E to split lines (record means line just like in SQL).

You could then use 1D to store multiple CSV tables in a single file.

By pasc1878 2025-03-2711:201 reply

Yes but then the text is not human readable or editable in a plain text editor.

This would confuse most users of csvs they are not programmers they at most use text editors and Excel.

By afiori 2025-03-2716:18

I am not proposing to do this, but if you were to use ascii separators you would do it this way

By defrost 2025-03-273:17

There are some decent arguments for BEL over NUL, however given you posed that as a rhetorical question I feel I can say little other than

ding! ding! ding! winner winner, chicken dinner!

Although BEL would drive me up the wall if I broke out any of my old TTY hardware.

By kevindamm 2025-03-272:09

...and excel macros

By LoganDark 2025-03-276:46

Sure, let's put quotation marks around all number values.

Oh wait.

lol

By hajile 2025-03-2620:412 reply

Can you point me to a language with any significant number of users that does NOT have a JSON library?

I went looking at some of the more niche languages like Prolog, COBOL, RPG, APL, Eiffel, Maple, MATLAB, tcl, and a few others. All of these and more had JSON libraries (most had one baked into the standard library).

The exceptions I found (though I didn't look too far) were: Bash (use jq with it), J (an APL variant), Scratch (not exposed to users, but scratch code itself is encoded in JSON), and Forth (I could find implementations, but it's very hard to pin down forth dialects).

By derriz 2025-03-2620:582 reply

I made no claim about JSON libraries. I contested the claim that "CSV libraries and tools suck". They do not.

By FridgeSeal 2025-03-2621:462 reply

CSV tooling has had to invest enormous amounts of effort to make a fragile, under-specified format half-useful. I would call it ubiquitous, I would call the tooling that we’ve built around it “impressive” but I would by no means call any of it “good”.

I do not miss dealing with csv files in the slightest.

By freehorse 2025-03-2623:481 reply

> CSV tooling has had [...] to make a fragile, under-specified format half-useful

You get this backwards. Tabular structured data to store are ubiquitous. Text as a file format is also ubiquitous because it is accessible. The only actual decisions are about whether to encode your variables as rows or columns, what is the delimiter, and other rules such as escaping etc. Vars as columns makes sense because it makes appending easier. There is a bunch of stuff that can be used for delimeters, commas being the most common, none is perfect. But from this point onwards, decisions do not really matter, and "CSV" basically covers everything from now on. "CSV" is basically what comes naturally when you have tabular datasets and want to store them in text. CSV tooling is developed because there is a need for this way of formatting data. Whether CSV is "good" or "ugly" or whatever is irrelevant, handling data is complicated as much as the world itself is. The alternatives are either not structuring/storing the data in a tabular manner, or non-text (eg binary) formats. These alternative exist and are useful in their own right, but don't solve the same problems.

By ddulaney 2025-03-272:401 reply

I think the issue is that CSV parsing is really easy to screw up. You mentioned delimiter choice and escaping, and I’d add header presence/absence to that list.

There are at least 3 knobs to turn every time you want to parse a CSV file. There’s reasonably good tooling around this (for example, Python’s CSV module has 8 parser parameters that let you select stuff), but the fact that you have to worry about these details is itself a problem.

You said “handling data is complicated as much as the world itself is”, and I 100% agree. But the really hard part is understanding what the data means, what it describes. Every second spent on figuring out which CSV parsing option I have to change could be better spent actually thinking about the data.

By ozim 2025-03-276:39

I am kind of amazed how people nag about having to parse practically a random file.

Having header or not should be specified up front and one should not parse some unknown file because that will always end up with failure.

If you have your own serialization and your own parsing working yeah this will simply work.

But then not pushing back to the user some errors and trying to deal with everything is going to be frustrating because amount of edge cases is almost infinite.

Handling random data is hard, saying it is a CSV and trying to support everything that comes with it is hard.

By dylan604 2025-03-2621:592 reply

Microsoft Windows has had to invest enormous amounts...

Apple macOS has had to invest enormous amounts...

Pick your distro of Linux has had to invest enormous amounts...

None of them a perfect and any number of valid complaints can be said about any of them. None of the complaints make any of the things useless. Everyone has workarounds.

Hell, JSON has had to invest enormous amounts of effort...

By FridgeSeal 2025-03-2623:471 reply

I guess the point is that I can take a generic json parser and point it at just about any JSON I get my hands on, and have close to no issues parsing it.

Want to do the same with csv? Good luck. Delimiter? Configurable. Encoding? Configurable. Misplaced comma? No parse in JSON, in csv: might still parse, but is now semantically incorrect and you possibly won’t know until it’s too late, depending on your parser. The list goes on.

By pbrumm 2025-03-272:42

Here is a quick test

The table of contents points to a single Json object that is 20ish gb compressed

https://www.anthem.com/machine-readable-file/search/

All stock libs will fail

By nukem222 2025-03-271:45

[flagged]

By jstanley 2025-03-2621:371 reply

You claimed that CSV is "easily the most widely supported data format in existence in terms of tools and language support", which is a claim that CSV is better supported than JSON, which is a claim that JSON support is lacking.

By shawabawa3 2025-03-2621:481 reply

Can you import .jsonl files into Google sheets or excel natively?

By freehorse 2025-03-2622:063 reply

Importing csvs in excel can be a huge pain due to how excel handles localisation. It can basically alter your data if you are not mindful about that, and I have seen it happening too many times.

By xnx 2025-03-2623:342 reply

Excel dropping leading zeros (as in ZIP codes) was a crazy design decision that has certainly cost many lifetimes of person-hours.

By disgruntledphd2 2025-03-277:24

And forcing 16+ digits to be floats, destroying information.

By freehorse 2025-03-2716:59

Yeah have had similar struggles with social security numbers.

By codetrotter 2025-03-2623:011 reply

For example:

Scientists rename genes because Microsoft Excel reads them as dates (2020)

https://www.reddit.com/r/programming/comments/i57czq/scienti...

By TRiG_Ireland 2025-03-2722:46

I was so glad of that story. It gave me something to point to to get my boss off my back.

By phkahler 2025-03-2623:281 reply

But it handles it better than Json.

By freehorse 2025-03-2623:511 reply

Depends on what you mean by "better". I would rather software not handle a piece of data at all, than handle it erroneously and changing the data without me realising and thus causing all sorts of issues after.

By d0mine 2025-03-274:321 reply

In practice, web browsers accept the tag soup that is sometimes called html and strict xml-based formats failed.

By afiori 2025-03-278:44

The browser are not a database (unlike excel). Modifying data before showing it is reversible, modifying it before storing it is not.

By niccl 2025-03-2623:281 reply

Excel.

Before you dismiss it as 'not a language, people have argued that it is. And you can definitely program stuff in it, and so that surely makes it a language

By squeaky-clean 2025-03-270:17

Excel can import and parse JSON, it's under the "Get Data" header. It doesn't have a direct GUI way to export to JSON, but it takes just a few lines in Office Scripts. You can even use embedded TypeScript to call JSON.stringify.

By kentm 2025-03-2620:54

I’ve found that the number of parsers that don’t handle multiline records is pretty high though.

By consteval 2025-03-2623:24

It's widely, but inconsistently, supported. The behavior of importers varies a lot, which is generally not the case for JSON.

By recursive 2025-03-2719:46

> Despite the lack of any sort of specification

People keep saying this but RFC 4180 exists.

By autoexec 2025-03-270:18

> it's easily the most widely supported data format in existence in terms of tools and language support.

Even better, the majority of the time I write/read CSV these days I don't need to use a library or tools at all. It'd be overkill. CSV libraries are best saved for when you're dealing with random CSV files (especially from multiple sources) since the library will handle the minor differences/issues that can pop up in the wild.

By benwilber0 2025-03-2619:062 reply

JSON is a textual encoding no different than CSV.

It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")

I have seen people try to build up JSON strings like that too, and then you have all the same problems.

So there is no problem with CSV except that maybe it's too deceptively simple. We also see people trying to build things like URLs and query strings without using a proper library.

By int_19h 2025-03-2619:183 reply

The problem with CSV is that there's no clear standard, so even if you do reach for a library to parse it, that doesn't ensure compatibility.

By gopher_space 2025-03-2620:37

There is a clear standard and it's usually written on an old Word '97 doc in a local file server. Using CSV means that you are the compatibility layer, and this is useful if you need firm control or understanding of your data.

If that sounds like a lot of edge-case work keep in mind that people have been doing this for more than half a century. Lots of examples and notes you can steal.

By Kinrany 2025-03-2623:261 reply

https://datatracker.ietf.org/doc/html/rfc4180 exists

By pasc1878 2025-03-2711:211 reply

And does Excel fully comply and more imprtantly tell you when the CSV file is wrong

By recursive 2025-03-2719:47

No. Excel's fault. Not CSV. There are plenty of busted CSV parsers (and serializers) too.

By rcbdev 2025-03-2619:422 reply

Same for JSON though. What Python considers a valid JSON might not be that if you ask a Java library.

By hajile 2025-03-2620:053 reply

JSON has a clearly-defined standards: ISO/IEC 21778:2017, IETF RFC 7159, and ECMA-404. Additionally, Crockford has had a spec available on json.org since it's creation in 2001.

Do you have any examples of Python, Java, or any of the other Tiobe top 40 languages breaking the JSON spec in their standard library?

In contrast, for the few of those that have CSV libraries, how many of those libraries will simply fail to parse a large number of the .csv variations out there?

By whizzter 2025-03-2620:161 reply

Not to mention that stuff like Excel loves to export CSV files in "locale specific ways".

Sometimes commas to delimiter, sometimes semicolons, floating point values might have dots or commas to separate fraction digits.

Not to mention text encodings, Ascii, western european character sets, or maybe utf-8 or whatever...

It's a bloody mess.

By ohgr 2025-03-2623:33

That’s why we just email the sheets around like it’s 1999 :)

By dwattttt 2025-03-2620:30

You need more than a standard; that standard has to be complete and unambiguous. What you're looking for is https://github.com/nst/JSONTestSuite

EDIT: The readme's results are from 2016, but there's more recent results (last updated 5 years ago). Of the 54 parsers /versions tested, 7 gave always the expected result per the spec (disregarding cases where the spec does not define a result).

By noitpmeder 2025-03-2620:581 reply

It falls down under very large integers -- think large valid uint64_t values.

By hajile 2025-03-2623:151 reply

JSON doesn't fail for very large values because they are sent over the wire as strings. Only parsers may fail if they or their backing language doesn't account for BigInts or floats larger than f64, but these problems exist when parsing any string to a number.

By girvo 2025-03-2623:29

And indeed applies to CSV as well: it's just strings at the end of the day, its up to the parser to make sense of it into the data types one wants. There is nothing inherently stopping you from parsing a JSON string into a uint64: I've done so plenty!

By zeroimpl 2025-03-2623:492 reply

Example? I know there's some ambiguity over whether literals like false are valid JSON, but I can't think of anything else.

By tubthumper8 2025-03-271:41

That _shouldn't_ be ambiguous, `false` is a valid JSON document according to specification, but not all parsers are compliant.

There's some interesting examples of ambiguities here: https://seriot.ch/projects/parsing_json.html

By recursive 2025-03-2719:481 reply

Trailing commas, comments, duplicate key names, for a few examples.

By int_19h 2025-03-2722:101 reply

Trailing commas and comments are plainly not standard JSON under any definition. There are standards that include them which extend JSON, sure, but I'm not aware of any JSON library that emits this kind of stuff by default.

By recursive 2025-03-2722:16

I'm not aware of any CSV library that doesn't follow RFC4180 by default, and yet... this whole thread.

By gthompson512 2025-03-2620:492 reply

> It's just that people tend to use specialized tools for encoding and decoding it instead of like ",".join(row) and row.split(",")

You really super can't just split on commas for csv. You need to handle the string encodings since records can have commas occur in a string, and you need to handle quoting since you need to know when a string ends and that string may have internal quote characters. For either format unless you know your data super well you need to use a library.

By benwilber0 2025-03-2621:17

Right, obviously.

By pasc1878 2025-03-2711:21

Yes but people don't

By magicalhippo 2025-03-275:05

> JSON [...] with one line per record

Couple of standards that I know of that does this, primarily intended for logging:

https://jsonlines.org/

https://clef-json.org/

Really easy to work with in my experience.

Sure some space is usually wasted on keys but compression takes care of that.

By kec 2025-03-2713:10

Until you have a large amount of data & need either random access or to work on multiple full columns at once. Duplicated keys names mean it's very easy for data in jsonlines format to be orders of magnitude larger than the same data as CSV, which is incredibly annoying if your processing for it isn't amenable to streaming.

By klysm 2025-03-2621:441 reply

You serialize the keys on every row which is a bit inefficient but it’s a text format anyway

By zeroimpl 2025-03-270:46

Space-wise, as long as you compress it, it's not going to make any difference. I suspect a JSON parser is a bit slower than a CSV parser, but the slight extra CPU usage is probably worth the benefits that come with JSON.

By nmz 2025-03-274:22

This is simply not true, parsing json v csv is a difference of thousands of lines.

By packetlost 2025-03-2619:131 reply

Eh, it really isn't. The format does not lend itself to tabular data, instead the most natural way of representing data involves duplicating the keys N times for each record.

By koolba 2025-03-2619:163 reply

You can easily represent it as an array:

    [“foo”,”bar”,123]

That’s as tabular as CSV but you now have optional types. You can even have lists of lists. Lists of objects. Lists of lists of objects…

By simonw 2025-03-2619:433 reply

Right - the JSON-newline equivalent of CSV can look like this:

    ["id", "species", "nickname"]
    [1, "Chicken", "Chunky cheesecakes"]
    [2, "Dog", "Wagging wonders"]
    [3, "Bunny", "Hopping heroes"]
    [4, "Bat", "Soaring shadows"]

By nomel 2025-03-2620:251 reply

Remove the [] characters and you've invented CSV with excel style quoting.

By simonw 2025-03-2623:231 reply

Almost, except the way Excel-style quoting works with newlines sucks - you end up with rows that span multiple lines, so you can't split on newline to get individual rows.

With JSON those new lines are \n characters which are much easier to work with.

By magicalhippo 2025-03-274:55

I ended up parsing the XML format instead of the CSV format when handling paste from Excel due to the newlines issue.

CSV seemed so simple but after numerous issues, a cell with both newline and " made me realize I should keep the little hair I had left and put in the work to parse the XML.

It's not great either, with all its weird tags, but at least it's possible to parse reliably.

By collinmanderson 2025-03-2714:30

This is the way. jsonl where each row is a json list. It has well-defined standard quoting.

Just like csv you don't actually need the header row either, as long as there's convention about field ordering. Similar to proto bufs, where the field names are not included in the file itself.

By freehorse 2025-03-2620:531 reply

This misses the point of standardization imo because it’s not possible to know a priori that the first line represents the variable names, that all the rows are supposed to have the same number of elements and in general that this is supposed to represent a table. An arbitrary parser or person wouldn’t know to guess since it's not standard or expected. Of course it would be parsed fine but the default result would be a kind of structure or multi-array rather than tabular.

By afiori 2025-03-279:10

application/jsonl+table

By lgas 2025-03-2620:412 reply

Typing isn't optional in JSON, every value has a concrete type, always.

By lelandbatey 2025-03-2719:30

Types at the type layer are not the same as types at the semantic layer. Sure every type in the JSON level has a "strong type" but the semantic meaning of the contents of e.g. a string are usually not expressable in pure JSON. So it is with CSV; you can think of every cell in CSV as containing a string (series of bytes) with it being up to you to enforce the semantics atop those bytes. JSON gives you a couple extra types, and if you can fit things into those types well, then that's great, but for most data concrete semantically meaningful data you won't be able to do that and you'll end up in a similar world to CSVs.

By gghffguhvc 2025-03-2621:031 reply

  [
   [“header1”,”header2”],
   [“1.1”, “”],
   [7.4, “2022-01-04”]
  ]

By Sohcahtoa82 2025-03-2715:59

...and?

I see an array of arrays. The first and second arrays have two strings each, the last one has a float and a string. All those types are concrete.

Let's say those "1.1" and 7.4 values are supposed to be version strings. If your code is only sometimes putting quotes around the version string, the bug is in your code. You're outputting a float sometimes, but a string in others. Fix your shit. It's not your serialization format that's the problem.

If you have "7.4" as a string, and your serialization library is saying "Huh, that looks like a float, I'm going to make it a float", then get a new library, because it has a bug.

By packetlost 2025-03-2619:414 reply

You're missing my point: basically nothing spits out data in that format because it's not ergonomic to do so. JSON is designed to represent object hierarchies, not tabular data.

By hajile 2025-03-2620:184 reply

CSV is lists of lists of fixed length.

JSON is lists of lists of any length and groups of key/value pairs (basically lisp S-expressions with lots of unnecessary syntax). This makes it a superset of CSV's capabilities.

JSON fundamentally IS made to represent tabular data, but it's made to represent key-value groups too.

Why make it able to represent tabular data if that's not an intended use?

By packetlost 2025-03-2623:19

> JSON is lists of lists of any length and groups of key/value pairs

The "top-level" structure of JSON is usually an object, but it can be a list.

> JSON fundamentally IS made to represent tabular data

No, it's really not. It's made to represent objects consisting of a few primitive types and exactly two aggregate types: lists and objects. It's a textual representation of the JavaScript data model and even has "Object" in the name.

> Why make it able to represent tabular data if that's not an intended use?

It's mostly a question of specialization and ergonomics, which was my original point. You can represent tabular data using JSON (as you can in JavaScript), but it was not made for it. Anything that can represent """data""" and at least 2 nesting levels of arbitrary-length sequences can represent tabular data, which is basically every data format ever regardless of how awkward actually working with it may be.

By freehorse 2025-03-2621:122 reply

The fact that json can represent a superset of tabular data structures that csv is specifically designed to represent can be rephrased into that csv is more specialised than json in representing tabular data. The fact that json can also represent tabular data does not mean it is a better or more efficient way to represent that data instead of a format like csv.

In the same way, there are hierarchically structured datasets that can be represented by both json in hierarchical form and csv in tabular form by repeating certain variables, but if using csv would require repeating them too many times, it would be a bad idea to choose that instead of json. The fact that you can do sth does not always make it a good idea to do it. The question imo is about which way would be more natural, easy or efficient.

By Kinrany 2025-03-2623:33

> The fact that json can represent a superset of tabular data structures that csv is specifically designed to represent can be rephrased into that csv is more specialised than json in representing tabular data. The fact that json can also represent tabular data does not mean it is a better or more efficient way to represent that data instead of a format like csv.

The reverse is true as well: being more specialized is a description of goals, not advantages.

By hajile 2025-03-2623:29

It's hardly a bad idea to do a list of lists in JSON...

The big advantage of JSON is that it's standardized and you can reuse the JSON infrastructure for more than just tabular data.

By meepmorp 2025-03-2620:533 reply

> CSV is lists of lists of fixed length.

I'd definitely put that in my list of falsehoods programmers believe about CSV files.

By hajile 2025-03-2623:131 reply

It seems to be indicated by RCF-4180 which says

> This header will contain names corresponding to the fields in the file and should contain the same number of fields as the records in the rest of the file

But of course, CSV is the wild west and there's no guarantee that any two encoders will do the same thing (sometimes, there's not even a guarantee that the same encoder will do the same thing with two different inputs).

[0] https://www.ietf.org/rfc/rfc4180.txt

By HelloNurse 2025-03-279:041 reply

You should know that "should" isn't very binding.

Headers should have as many rows as possible that contain data items for their column and data items in a row should have a header for the respective columns, but real CSV files should be assumed to have incomplete or variable length lines.

By hajile 2025-03-2714:08

NOTHING is very binding about the CSV spec and that's the biggest problem with CSV.

By timacles 2025-03-2622:481 reply

CSV is a text file that might have commas in it

By miningape 2025-03-270:55

yeah if I had a cookie for every time I've had to deal with this I'd have maybe 10 cookies - it's not a lot but it's more than it should be.

By kazinator 2025-03-2623:481 reply

A format consisting of newline-terminated records, each containing comma-separated JSON strings would be superior to CSV.

It could use backslash escapes to denote control characters and Unicode points.

Everyone would agree exactly on what the format is, in contrast to the zoo of CSV variants.

It wouldn't have pitfalls in it, like spaces that defeat quotes

  RFC CSV         JSON strings
  a,"b c"         "a", "b c"
  a, "b c"        "a", " \" b c\""

oops; add an innocuous-looking space, and the quotes are now literal.

By koolba 2025-03-2620:16

JSON was designed to represent any data. There's plenty of systems that spit out data in exact that format because it's the natural way to represent tabular data using JSON serialization. And clearly if you're the one building the system you can choose to use it.

By kccqzy 2025-03-2621:18

JSON is designed to represent JavaScript objects with literal notation. Guess what, an array of strings or an array of numbers or even an array of mixed strings and numbers is a commonly encountered format in JavaScript.

By juliansimioni 2025-03-2619:034 reply

What happens when you need to encode the newline character in your data? That makes splitting _either_ CSV or LDJSON files difficult.

By koolba 2025-03-2619:12

The new line character in a JSON string would always be \n. The new line in the record itself as whitespace would not be acceptable as that breaks the one line record contract.

Remember that this does not allow arbitrary representation of serialized JSON data. But it allows for any and all JSON data as you can always roundtrip valid JSON to a compact one line representation without extra whitespace.

By afiori 2025-03-279:091 reply

Actually even whitespace-separated json would be a valid format and if you forbid json documents to be a single integer or float then even just concatenating json gives a valid format as JSON is a prefix free language.

That is[0] if a string s is a valid JSON then there is no substring s[0..i] for i < n that is a valid json.

So you could just consume as many bytes you need to produce a json and then start a new one when that one is complete. To handle malformed data you just need to throw out the partial data on syntax error and start from the following byte (and likely throw away data a few more times if the error was in the middle of a document)

That is [][]""[][]""[] is unambiguos to parse[1]

[0] again assuming that we restrict ourselves to string, null, boolean, array and objects at the root

[1] still this is not a good format as a single missing " can destroy the entire document.

By boogheta 2025-03-279:571 reply

« a single missing " can destroy the entire document » This is basically true for any data format, so really worse argument ever...

By afiori 2025-03-2715:22

In jsonl a modified chunk will lose you at most the removed lines and the two adjacent ones (unless the noise is randomly valid json), in particular a single byte edit can destry at most 2 lines.

utf-8 is also similarly self-correcting and so is html and many media formats.

My point was that in my made-up concatenated json format

[]"""[][][][][][][][][][][]"""[]

and

[]""[][][][][][][][][][][]""[]

are both valid but have differ only for 2 bytes but have entirely different structures.

Also it is a made-up format nobody uses (if somebody were to want this they would likely disallow strings at the root level).

By mananaysiempre 2025-03-2619:11

When you need to encode the newline character in your data, you say \n in the JSON. Unlike (the RFC dialect of) CSV, JSON has an escape sequence denoting a newline and in fact requires its use. The only reason to introduce newlines into JSON data is prettyprinting.

By nmz 2025-03-274:13

It's tricky, but simple enough, RFC states that " must be used, inserting a " is done with "". This makes knowing what a record is difficult, since you must keep a variable that keeps the entire string.

How do you do this simply? you read each line, and if there's an uneven number of ", then you have an incomplete record and you will keep all lines until there is an odd number of ". after having the string, parsing the fields correctly is harder but you can do it in regex or PEGs or a disgusting state machine.

By andrepd 2025-03-2617:488 reply

That would be solved by using the ASCII control chars Record Separator / Unit Separator! I don't get how this is not widely used as standard.

By sundarurfriend 2025-03-2618:423 reply

I remembered seeing a comment like this before, and...

comment: https://news.ycombinator.com/item?id=26305052

comment: https://news.ycombinator.com/item?id=39679662

"ASCII Delimited Text – Not CSV or Tab Delimited Text" post [2014]: https://news.ycombinator.com/item?id=7474600

same post [2024]: https://news.ycombinator.com/item?id=42100499

comment: https://news.ycombinator.com/item?id=15440801

(...and many more.) "This comes up every single time someone mentions CSV. Without fail." - top reply from burntsushi in that last link, and it remains as true today as in 2017 :D

You're not wrong though, we just need some major text editor to get the ball rolling and start making some attempts to understand these characters, and the rest will follow suit. We're kinda stuck at a local optimum which is clearly not ideal but also not troublesome enough to easily drum up wide support for ADSV (ASCII Delimiter Separated Values).

By scythe 2025-03-2621:591 reply

>we just need some major text editor to get the ball rolling and start making some attempts to understand these characters

Many text editors offer extensions APIs, including Vim, Emacs, Notepad++. But the ideal behavior would be to auto-align record separators and treat unit separators as a special kind of newline. That would allow the file to actually look like a table within the text editor. Input record separator as shift+space and unit separator as shift+enter.

By hermitcrab 2025-03-2622:15

I think it would be enough for:

1. the field separator to be shown as a special character

2. the row separator to be (optionally) be interpreted as a linefeed

IIRC 1) is true for Notepad++, but not 2).

By nukem222 2025-03-271:19

Excel needs to default its export to this. Unfortunately excel is proprietary software and therefore fucked.

By ryandrake 2025-03-2619:06

Hahah, I came here to make the comment about ASCII's control characters, so I'm glad someone else beat me to it, and also that someone further pointed out that this topic comes up every time someone mentions CSV!

By mjevans 2025-03-2617:524 reply

The _entire_ point of a CSV file is that it's fully human readable and write-able.

The characters you mention could be used in a custom delimiter variant of the format, but at that point it's back to a binary machine format.

By mbreese 2025-03-2618:13

And that’s why I tend to use tab delimited files more… when viewed with invisible characters shown, it’s pretty clear to read/write separate fields and have an easier to parse format.

This, of course, assumes that your input doesn’t include tabs or newlines… because then you’re still stuck with the same problem, just with a different delimiter.

By mikepurvis 2025-03-2618:042 reply

As soon as you give those characters magic meanings then suddenly people will have reason to want to use them— it'll be a CSV containing localization strings for tooltips that contain that character and bam, we'll be back to escaping.

Except the usages of that character will be rare and so potentially way more scary. At least with quotes and commas, the breakages are everywhere so you confront them sooner rather than later.

By immibis 2025-03-2710:491 reply

Graphical representations of the control characters begin at U+2400 in the "Control Pictures" Unicode block. Instead of the actual U+001E Record Separator, you put the U+241E Symbol for Record Separator in the help text.

By mikepurvis 2025-03-2713:151 reply

.... with a note underneath urging readers not to copy and paste the character because it's only the graphical representation of it, not the thing itself.

Perhaps a more salient example might be CSV nested in CSV. This happens all the time with XML (hello junit) and even JSON— when you plug a USB drive into my LG TV, it creates a metadata file on it that contains {"INFO":"{ \"thing\": true, <etc> }"}

By immibis 2025-03-2810:20

You wouldn't use this format for that. It's not a universal format, but an application-specific one. It works in some applications, not in others.

By BeFlatXIII 2025-03-2623:55

Best to add deliberate breakage to the spec, then.

By kevmo314 2025-03-2617:532 reply

What do you mean? I just push the Record Separator key on my keyboard.

/s in case :)

By corysama 2025-03-2619:274 reply

The entire argument against ASCII Delimited Text boils down to "No one bothered to support it in popular editors back in 1984. Because I grew up without it, it is impossible to imagine supporting it today."

You need 4 new keyboard shortcuts. Use ctrl+, ctrl+. ctrl+[ ctrl+] You need 4 new character symbols. You need a bit of new formatting rules. Pretty much page breaks decorated with the new symbols. It's really not that hard.

But, like many problems in tech, the popular advice is "Everyone recognizes the problem and the solution. But, the problematic way is already widely used and the solution is not. Therefore everyone doing anything new should invest in continuing to support the problem forever."

By LegionMammal978 2025-03-2620:311 reply

> The entire argument against ASCII Delimited Text boils down to "No one bothered to support it in popular editors back in 1984. Because I grew up without it, it is impossible to imagine supporting it today."

There's also the argument of "Now you have two byte values that cannot be allowed to appear in a record under any circumstances. (E.g., incoming data from uncontrolled sources MUST be sanitized to reject or replace those bytes.)" Unless you add an escaping mechanism, in which case the argument shifts to "Why switch from CSV/TSV if the alternative still needs an escaping mechanism?"

By zzo38computer 2025-03-2621:312 reply

One benefit of binary formats is not needing the escaping.

By LegionMammal978 2025-03-270:482 reply

Length-delimited binary formats do not need escaping. But the usual "ASCII Delimited Text" proposal just uses two unprintable bytes as record and line separators, and the signalling is all in-band.

This means that records must not contain either of those two bytes, or else the format of the table will be corrupted. And unless you're producing the data yourself, this means you have to sanitize the data before adding it, and have a policy for how to respond to invalid data. But maintaining a proper sanitization layer has historically been finicky: just look at all the XSS vulnerabilities out there.

If you're creating a binary format, you can easily design it to hold arbitrary data without escaping. But just taking a text format and swapping out the delimiters does not achieve this goal.

By zzo38computer 2025-03-2715:51

I did mean length-delimited binary formats (rather than ASCII formats).

By immibis 2025-03-2710:53

At least you don't need these values in your data, unlike the comma, which shows up in human-written text.

If you do need these values in your data, then don't use them as delimiters.

Something the industry has stopped doing, but maybe should do again, is restricting characters that can appear in data. "The first name must not contain a record separator" is a quite reasonable restriction. Even Elon Musk's next kid won't be able to violate that restriction.

By oever 2025-03-2710:39

Hear hear! Why is all editing done with text-based editors where humans can make syntax errors. Is it about job security?

By WorldMaker 2025-03-2620:111 reply

In Windows (and DOS EDIT.COM and a few other similarly ancient tools) there have existed Alt+028, Alt+029, Alt+030, and Alt+031 for a long time. I vaguely recall some file format I was working with in QBASIC used some or all of them and I was editing those files for some reason. That was not quite as far back as 1984, but sometime in the early 1990s for sure. I believe EDIT.COM had basic glyphs for them too, but I don't recall what they were, might have been random Wingdings like the playing card suits.

Having keyboard shortcuts doesn't necessarily solve why people don't want to use that format, either.

By zzo38computer 2025-03-2621:30

> I believe EDIT.COM had basic glyphs for them too, but I don't recall what they were, might have been random Wingdings like the playing card suits.

That is not specific to EDIT.COM; they are the PC characters with the same codes as the corresponding control characters, so they appear as graphic characters. (They can be used in any program that can use PC character set.)

However, in EDIT.COM and QBASIC you can also prefix a control character with CTRL+P in order to enter it directly into the file (and they appear as graphic characters, since I think the only control characters they will handle as control characters are tabs and line breaks).

Suits are PC characters 3 to 6; these are PC characters 28 to 31 which are other shapes.

By zzo38computer 2025-03-2619:33

The keys would be something other than those, though. They would be: CTRL+\ for file separator, CTRL+] for group separator, CTRL+^ for record separator, CTRL+_ for unit separator. Other than that, it would work like you described, I think.

> But, like many problems in tech, the popular advice is "Everyone recognizes the problem and the solution. But, the problematic way is already widely used and the solution is not

This is unfortunately common. However, what else happens too, is disagreement about what is the problem and the solution.

By wat10000 2025-03-2622:01

It's more like, "because the industry grew up without it, other approaches gained critical mass."

Path dependence is a thing. Things that experience network effects don't get changed unless the alternative is far superior, and ASCII Delimited Text is not that superior.

Ignoring that and pushing for it anyway will at most achieve an xkcd 927.

By mbreese 2025-03-2618:07

I’m pretty sure those used to exist.

But when looking for a picture to back up my (likely flawed) memory, Google helpfully told me that you can get a record separator character by hitting Ctrl-^ (caret). Who knew?

By EGreg 2025-03-2618:061 reply

Can't there be some magic sequence (like two unescaped newlines) to start a new record?

By macintux 2025-03-2620:55

You’ll still find that sequence in data; it’ll just be rare enough that it won’t rear its ugly head until your solution has been in production for a while.

By jandrese 2025-03-2618:241 reply

If there were visible well known characters that could be printed for those and keys on a keyboard for inputting them we would probably have RSV files. Because they are buried down in the nonprintable section of the ASCII chart they are a pain for people to deal with. All it would have taken is one more key on the keyboard, maybe splitting the tab key in half.

By thesuitonym 2025-03-2619:151 reply

> If there were visible well known characters that could be printed...

...There would be datasets that include those characters, and so they wouldn't be as useful for record separators. Look into your heart and know it to be true.

By jandrese 2025-03-2621:22

I wouldn't feel too bad about blindly scrubbing those characters out of inputs unlike commas, tabs, and quotes.

By andrewflnr 2025-03-2617:54

Can't type them on a keyboard I guess, or generally work with them in the usual text-oriented tools? Part of the appeal of CSV is you can just open it up in Notepad or something if you need to. Maybe that's more a critique of text tools than it is of ASCII record separator characters.

By marxisttemp 2025-03-2623:38

I was really excited when I learned of these characters, but ultimately if it’s in ASCII then it’s in-band and will eventually require escaping leading to the same problem.

By orthoxerox 2025-03-2620:311 reply

But what if one of your columns contains arbitrary binary data?

By ndsipa_pomu 2025-03-2620:46

You'd likely need to uuencode it or similar as CSV isn't designed for binary data.

By zoover2020 2025-03-2617:53

Perhaps the popularity,or lack thereof? More often than not, the bad standard wins the long term market

By masklinn 2025-03-2621:03

> I don't get how this is not widely used as standard.

It requires bespoke tools for edition, and while CSV is absolute garbage it can be ingested and produced by most spreadsheet software, as well as databases.

By taeric 2025-03-2618:192 reply

Reminds me of a fatal flaw of yaml. Turns out truncating a yaml file doesn't make it invalid. Which can lead to some rather non-obvious failures.

By nextts 2025-03-2621:152 reply

What is the failure mode where a yaml file gets truncated? They are normally config files in Git. Or uploaded to S3 or Kubernetes etc.

CSV has the same failure mode. As does HTML. (But not XML)

By taeric 2025-03-2713:56

I couldn't find the story on it, but there was an instance of a config for some major service getting truncated, but since it was yaml it was more difficult to figure out that that was what happened. I think in AWS, but I can't find the story, so can't really remember.

And fully fair that you can have similar issues in other formats. I think the complaint here was that it was a bit harder, specifically because it did not trip up any of the loading code. With a big lesson learned that configs should probably either go pascal string style, where they have an expected number of items as the first part of the data, or xml style, where they have a closing tag.

Really, it is always amusing to find how many of the annoying parts of XML turned out to be somewhat more well thought out than people want to admit.

By bobmcnamara 2025-03-270:34

Bad merges.

By pasc1878 2025-03-2711:281 reply

Same is true of CSV/TSV.

By taeric 2025-03-2713:521 reply

I think you are a bit more likely to notice in a CSV/TSV, as it is unlikely to truncate at a newline?

Still, fair point. And is part of why I said it is a flaw, not the flaw. Plenty of other reasons to not like YAML, to me. :D

By pasc1878 2025-03-2714:481 reply

Not if it is split at a line e.g. if the source or target can only deal with a fixed number of lines.

By taeric 2025-03-2715:291 reply

Right, that is what I meant about that being unlikely? Most instances of truncated files that I have seen were because of size, not lines.

Still, a fair point.

By ziml77 2025-03-2717:32

Really depends on how the CSV is generated/transferred. If the output of the faulting software is line-buffered then it's quite likely that a failure would terminate the file at a line break.

By fragmede 2025-03-2618:093 reply

I want to push Sqlite as a data interchange format! it has the benefit of being well defined, and can store binary data, like images for product pictures inside the database. not a good idea if you're trying to serve users behind a web app, but as interchange, better than a zip file with filenames that have to be "relinked".

By Someone1234 2025-03-2619:024 reply

For context: I have a LOT of experience of interchange formats, like "full time job, every day, all day, hundreds of formats, for 20-years" experience.

Based on that experience I have come to one key, but maybe, counter-intuitive truth about interchange formats:

- Too much freedom is bad.

Why? Generating interchange data is cheaper than consuming it, because the creator only needs to consider the stuff they want to include, whereas the consumer needs to consider every single possible edge case and or scenario the format itself can support.

This is why XML is WAY more costly to ingest than CSV, because in XML someone is going to use: attributes, CDATA, namespaces, comments, different declaration, includes, et al. In CVS they're going to use rows, a format separator, and quotes (with or without escaping). That's it. That's all it supports.

Sqlite as an interchange format is a HORRIFYING suggestion, because every single feature Sqlite supports may need to be supported by consumers. Even if you curtailed Sqlite's vast feature set, you've still created something vastly more expensive to consume than XML, which itself is obnoxious.

My favorite interchange formats are, in order:

- CVS, JSON (inc. NDJSON), YAML, XML, BSON (due to type system), MessagePack, Protobuf, [Giant Gap] Sqlite, Excel (xlsx, et al)

More features mean more cost, more edge cases, more failures, more complex consumers. Keep in mind, this is ONLY about interchange formats between two parties, I have wildly different opinions about what I would use for my own application where I am only ever the creator/consumer, I actually love Sqlite for THAT.

By nomel 2025-03-2620:333 reply

Oh, this is interesting. Are you tying different systems together? If so, do you use some preferred intermediate format? Do you have a giant library of * -> intermediate -> * converters that you sprinkle between everything? Or maybe the intermediate format is in memory?

What about Parquet and the like?

By gopher_space 2025-03-2621:04

Not the person you were replying to, but from my experience CSV is a good place to define data types coming into a system and a safe way to dump data as long as I write everything down.

So I might do things like have every step in a pipeline begin development by reading from and writing to CSV. This helps with parallel dev work and debugging, and is easy to load into any intermediate format.

> do you use some preferred intermediate format?

This is usually dictated by speed vs money calculations, weird context issues, and familiarity. I think it's useful to look at both "why isn't this a file" and "why isn't this all in memory" perspectives.

By theLiminator 2025-03-2621:29

For tabular/time-series greater than 100k rows I personally feel like parquet cannot be beat. It's self-describing, strongly-typed, relatively compact, supports a bunch of io/decode skipping, and is quite fast.

Orc also looks good, but isn't well supported. I think parquet is optimal for now for most analytical use-cases that don't require human readability.

By Someone1234 2025-03-2621:59

It is an interchange format, so it is inter-system by virtue of that. If I am a self-creator/consumer the format I use can be literally anything even binary memory dumps.

By sadcodemonkey 2025-03-2620:40

I love the wisdom in this comment!

By ttyprintk 2025-03-284:29

I’m not sure you need to support every SQLite feature. I’m unconvinced of binary formats, but the .dump output is text and simple SQL.

By fragmede 2025-03-2619:28

Interesting! I've dealt with file interchange between closed source (and a couple open source) programs, but that was a while ago. I've also had to deal with csvs and xslts between SaaS vendors for import export of customer's data. I've done a bunch of reverse engineering of proprietary formats so we could import the vendor's files, which had more information than they were willing to export in an interchange format. Sometimes they're encrypted and you have to break it.

What you say is fair. Csv is underspecified though, there's no company called csv that's gonna sue for trademark enforcement, there's no official csv standard library that everyone uses. (They exist are some but there are so many naive importations because from first principles, because how hard could it be? output records and use a comma and newline (of which there are three possible options)).

How often do you deal with multiple Csv files to represent multiple tables that are actually what's used by vendors internally, vs one giant flattened Csv with hundreds of columns and lots of empty cells? I don't have your level of experience with csvs, but I've dealt with a them being a mess, where the other side implement whatever they think is reasonable given the name "comma separated values".

With sqlite, we're in the Internet age and so I presume this hypothetical developer would use the sqlite library and not implement their own library from scratch for funsies. This then leads to types, database normalization, multiple tables. I hear you that too many choices can bad, and xml is a great example of this, but sqlite isn't xml and isn't Csv.

It's hard to have this discussion in the abstract so I'll be forthcoming about where I'm coming from, which is Csv import export between vendors for stores, think like Doordash to UberEATS. the biggest problem we have is images of the items, and how to deal with that. It's an ongoing issue how to get them, but the failure mode, which does happen, is that when moving vendor, they just have to redo a lot of work that they shouldn't have to.

Ultimately the North Star I want to push towards is moving beyond csvs, because it'll let a people who currently have to hand edit the Csv so every row imports properly, not have to do that. They'd still exist, but instead have to deal with, well, what you see with XML files. which has its shortcomings, as you mention, but at least once how a vendor is using it is understood, individual records are generally understandable.

I was moved so I don't deal with import export currently, but it's because sqlite is so nice to work with on personal projects where it's appropriate that I want to push the notion of moving to sqlite over csvs.

By 0cf8612b2e1e 2025-03-2618:16

One very minor problem is that you max out storing blobs of 2GB(? I think, maybe 4GB). Granted few will hit this, but this limit did kill one previous data transfer idea of mine.

By ThatPlayer 2025-03-278:38

> not a good idea if you're trying to serve users behind a web app

I use Sqlite for a static site! Generating those static pages out to individual pages would involve millions of individual files. So instead I serve up a sqlite database over http, and use a sqlite wasm driver [0] to load (database) pages as needed. Good indexing cuts down on the number of pages it grabs, and I can even get full text search!

Only feature I'm missing is compression, which is complicated because for popular extensions like sqlite-zstd written in Rust.

[0] https://github.com/mmomtchev/sqlite-wasm-http

By dietr1ch 2025-03-2622:274 reply

I don't understand why CSV became a thing when TSV, or a format using the nowadays weird ASCII control characters like start/end of text, start of heading, horizontal/vertical tab, file/group/record/unit separator.

It seems many possible designs would've avoided the quoting chaos and made parsing sort of trivial.

By niccl 2025-03-2623:346 reply

Any time you have a character with a special meaning you have to handle that character turning up in the data you're encoding. It's inevitable. No matter what obscure character you choose, you'll have to deal with it

By oever 2025-03-2710:371 reply

It's evitable by stating the number of bytes in a field and then the field. No escaping needed and faster parsing.

By pasc1878 2025-03-2711:261 reply

But not human editable/readable

By kevincox 2025-03-2712:321 reply

I understand this argument in general. But basically everyone has some sort of spreatsheet application that can read CSV installed.

In some alternate worked where this "binary" format caught on it would be a very minor issue that it isn't human readable because everyone has a tool that is better at reading it than humans are. (See the above mentioned non-local property of quotes where you may think you are reading rows but are actually inside a single cell.)

Makes me also wonder if something like CBOR caught on early enough we would just be used to using something like `jq` to read it.

By jonathanberi 2025-03-2712:42

https://github.com/wader/fq is "jq for binary formats."

By wvenable 2025-03-272:072 reply

Except we have all these low ASCII characters specifically for this purpose that don't turn up in the data at all. But there is, of course, also an escape character specifically for escaping them if necessary.

By kapep 2025-03-279:30

Even if you find a character that really is never in the data - your encoded data will contain it. And it's inevitable that someone encodes the encoded data again. Like putting CSV in a CSV value.

By Brian_K_White 2025-03-272:36

You can't type any of those on a typewriter, or see them in old or simple simple editors, or no editor like just catting to a tty.

If you say those are contrived examples that don't matter any more then you have missed the point and will probably never acknowledge the point and there is no purpose in continuing to try to communicate.

One can only ever remember and type out just so many examples, and one can always contrive some response to any single or finite number of examples, but they are actually infinite, open-ended.

Having a least common denominator that is extremely low that works in all the infinite situations you never even thought of, vs just pretty low and pretty easy to meet in most common situations, is all the difference in the world.

By noosphr 2025-03-271:45

The difference is that the coma and newline characters are much more common in text than 0x1F and 0x1E, which if you restrict your data to alphanumeric characters (which you really should) will never appear anywhere else.

By mjw_byrne 2025-03-2711:491 reply

Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.

I have in the past does data extractions from systems which really can't serialise properly, where the only option is to concat all the fields with some "unlikely" string like @#~!$ as a separator, then pick it apart later. Ugh.

By dietr1ch 2025-03-2717:171 reply

> Exactly. "Use a delimiter that's not in the data" is not real serialisation, it's fingers-crossed-hope-for-the-best stuff.

It's not doing just this, you pick something that's likely not in the data, and then escape things properly. When writing strings you can write a double quote within double quotes with \", and if you mean to type the designated escape character you just write it twice, \\.

The only reason you go for something likely not in the data is to keep things short and readable, but it's not impossible to deal with.

By mjw_byrne 2025-03-2811:57

I agree that it's best to pick "unlikely" delimiters so that you don't have to pepper your data with escape chars.

But some people (plenty in this thread) really do think "pick a delimiter that won't be in the data" - and then forget quoting and/or escaping - is a viable solution.

By dietr1ch 2025-03-271:46

The characters would likely be unique, maybe even by the spec.

Even if you wanted them, we use backslashes to escape strings in most common programming languages just fine, the problem CSV is that commas aren't easy to recognize because they might be within a single or double quote string, or might just be a separator.

Can strings in CSV have newlines? I bet parsers disagree since there's no spec really.

By solidsnack9000 2025-03-283:23

In TSV as commonly implemented (for example, the default output format of Postgres and MySQL), tab and newline are escaped, not quoted. This makes processing the data much easier. For example, you can skip to a certain record or field just by skipping literal newlines or tabs.

By wodenokoto 2025-03-274:49

It's a lot easier to type comma than control characters and it's a lot easier to view comma than a tab (which might look like a space).

For automated serialization, plain text formats won out, because they are easy to implement a minimal working solution (both import and export) and more importantly, almost all systems agree on what plain text is.

We don't really have Apple formatted text, that will show up as binary on windows. Especially if you are just transferring id's and numbers, those will fall within ascii and that will work even if you are expecting unicode.

By pasc1878 2025-03-2711:261 reply

The ASCII control characters do not appear well or are editable in a plain text editor.

I did always use TSV and I think the original use of CSV could have used that.

But TSV would still have many issues.

By solidsnack9000 2025-03-283:24

What issues would TSV have? As commonly implemented (for example, the default output format of Postgres and MySQL), in TSV, tab and newline are escaped, not quoted.

By hajile 2025-03-2714:11

I don't understand why CSV became a thing in the 70s when S-expressions existed since at least the 50s and are better in practically every way.

By msla 2025-03-272:13

CSV's actual problem is that there's no single CSV, and you don't know what type you have (or even if you have single consistent type through the whole file) without trying to parse the whole file and seeing what breaks. Is there quoting? Is that quoting used consistently? Do you have five-digit ZIP codes, or have the East Coast ones been truncated to four digits because they began with zero? Spin the wheel!

By 1vuio0pswjnm7 2025-03-2619:33

https://www.ietf.org/rfc/rfc4180.txt

By lelanthran 2025-03-275:571 reply

> So these days for serialisation of simple tabular data I prefer plain escaping, e.g. comma, newline and \ are all \-escaped. It's as easy to serialise and deserialise as CSV but without the above drawbacks.

For my own parser, I made everything `\` escaped: outside of a quote or double-quote delimited string, any character prefixed with a `\` is read verbatim. There are no special exceptions resulting in `\,` producing a comma while `\a` produces `\a`. This makes it a good rule, because it is only one rule with no exceptions.

By mjw_byrne 2025-03-2712:18

I considered this but then went the other way - a \ before anything other than a \, newline or comma is treated as an error. This leaves room for adding features, e.g. \N to signify a SQL NULL.

Regarding quoting and escaping, there are two options that make sense to me - either use quoting, in which case quotes are self-escaped and that's that; or use escaping, in which case quotes aren't necessary at all.

By Yomguithereal 2025-03-2621:221 reply

A good way to parallelize CSV processing is to split datasets into multiple files, kinda like manual sharding. xan has a parallel command able to perform a wide variety of map-reduce tasks on splitted files.

https://github.com/medialab/xan

By jgord 2025-03-272:221 reply

nice .. xsv is also very handy for wrangling csv files generally

By Yomguithereal 2025-03-278:20

xan is a maintained fork of xsv

By LPisGood 2025-03-2619:021 reply

I always treat CSVs as comma separated values with new line delimiters. If it’s a new line, it’s a new row.

By criddell 2025-03-2619:052 reply

Do you ever have CSV data that has newlines within a string?

By thesuitonym 2025-03-2619:122 reply

I don't. If I ever have a dataset that requires newlines in a string, I use another method to store it.

I don't know why so many people think every solution needs to to be a perfect fit for every problem in order to be viable. CSV is good at certain things, so use it for those things! And for anything it's not good at, use something else!

By criddell 2025-03-2619:211 reply

> use something else

You don't always get to pick the format in which data is provided to you.

By thesuitonym 2025-03-2619:381 reply

True, but in that case I'm not the one choosing how to store it, until I ingest the data, and then I will store it in whatever format makes sense to me.

By kittoes 2025-03-2619:31

I don't think we do? It's more that a bunch of companies already have their data in CSV format and aren't willing to invest any effort in moving to a new format. Doesn't matter how much one extolls all the benefits, they know right? They're paying someone else to deal with it.

By LPisGood 2025-03-2619:171 reply

No - that’s what I’m trying to say. If I have newlines I use something else.

By 0x073 2025-03-2619:231 reply

Wouldn't work if csv is used as a exchange format with external companies.

By LPisGood 2025-03-2620:03

Of course. I’m not saying I roll my own parser for every project that uses a CSV file, I’m just describing my criteria for using CSV vs some other format when I have the option.

By da_chicken 2025-03-2617:53

Eh, all you're really saying is "I'm not using CSV. Instead I'm using my CSV." Except that's all that anybody does.

CSV can just as easily support escaping as any other format, but there is no agreement for a CSV format.

After all, a missed escape can just as easily destroy a JSON or XML structure. And parallel processing of text is already a little sketchy simply because UTF-8 exists.

By widforss 2025-03-277:091 reply

How is this not true for every format that includes quote marks?

By mjw_byrne 2025-03-2720:01

It is true for everything that uses quoting, I didn't mean to imply otherwise.

By crazygringo 2025-03-272:452 reply

I'm not clear why quotes prevent parallel processing?

I mean, you don't usually parallelize reading a file in the first place, only processing what you've already read and parsed. So read each record in one process and then add it to a multiprocessing queue for multiple processes to handle.

And data corruption is data corruption. If a movie I'm watching has a corrupted bit I don't mind a visual glitch and I want it to keep playing. But with a CSV I want to fix the problem, not ignore a record.

Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!

By mjw_byrne 2025-03-2812:08

> I'm not clear why quotes prevent parallel processing?

Because of the "non-local" effect of quotes, you can't just jump into the middle of a file and start reading it, because you can't tell whether you're inside a quoted section or not. If (big if) you know something about the structure of the data, you might be able to guess. So that's why I said "tricky" instead of "impossible".

Contrast to my escaping-only strategy, where you can jump into the middle of a file and fully understand your context by looking one char on either side.

> Do you really have a use case where reading itself is the performance bottleneck and you need to parallelize reading by starting at different file offsets? I know that multiple processes can read faster from certain high-end SSD's than just one process, but that's a level of performance optimization that is pretty extraordinary. I'm kind of curious what it is!

I used to be a data analyst at a management consultancy. A very common scenario would be that I'm handed a multi-gigabyte CSV and told to "import the data". No spec, no schema, no nothing. Data loss or corruption is totally unacceptable, because we were highly risk-sensitive. So step 1 is to go through the whole thing trying to determine field types by testing them. Does column 3 always parse as a timestamp? Great, we'll call it a timestamp. That kind of thing. In that case, it's great to be able to parallelise reading.

> And data corruption is data corruption

Agreed, but I prefer data corruption which messes up one field, not data corruption which makes my importer sit there for 5 minutes thinking the whole file is a 10GB string value and then throw "EOF in quoted field".

By Eridrus 2025-03-273:041 reply

Doing sequential reading into a queue for workers to read is a lot more complicated than having a file format that supports parallel reading.

And the fix to allow parallel reading is pretty trivial: escape new lines so that you can just keep reading until the first unescaped new line and start at that record.

It is particularly helpful if you are distributing work across machines, but even in the single machine case, it's simpler to tell a bunch of workers their offset/limit in a file.

By akritid 2025-03-274:251 reply

The practical solution is to generate several CSV files and distribute work at the granularity of files

By Eridrus 2025-03-2715:211 reply

Sure, now you need to do this statically ahead of time.

It's not unsolvable, but now you have a more complicated system.

A better file format would not have this problem.

The fix is also trivial (escape new lines into \n or similar) would also make the files easier to view with a text editor.

By ttyprintk 2025-03-284:35

But in practice, you’ll receive a bag of similar-format CSVs.

By solidsnack9000 2025-03-2621:17

Tab-Separated Value, as implemented by many databases, solves these problems, because tab, newline and other control characters are escaped. For example, the default text serialization format of Postgres (`COPY <table> TO '<file>'` without any options) is this way.