Best way to handle delimited files

Question

So typically a CSV file uses a comma and the return character as it's field and line delimiters.

This runs into obvious problems with text which can contain both these characters.

Obviously there are options there (escaping them) but how do people handle this? Use different characters - pipes or tildas? Escape them? Not use delimited files, after all it's 2010 and we have XML now?

Looking at least effort for a decent chance of not seeing problems.

(Just to be clear, this is a question out of curiosity rather than something more solid - it's something I've come up against time and time again playing with data, always got round it but it normally feels a bit, well, dirty, and wondered what other people's experience was).

Think carefully about using CSV - it's nice and easy to deal with (see answers for common escaping rules), but it's not nearly as inter operable as it should be - if you're just communicating with your own programs it's fine, but if you want to import elsewhere it gets a bit odd because different programs obey different escape rules. — Michael Kohne, Dec 14 '10 at 12:15
@Michael - Absolutely. The issue though is that it's so omnipresent that you'll almost always come up with times when it's a very tempting option, and in the case of many older systems it's the only option. — Jon Hopkins, Dec 14 '10 at 12:18
Mature libraries exist in many languages (certainly the common ones) for reading and writing character delimited files. They will handle most any situation. Writing one's own CSV parser seems to be a common sort of anti-pattern. — quentin-starin, Dec 14 '10 at 16:44

Joonas Pulakka · Accepted Answer · 2010-12-14T14:44:09.783

13

According to Wikipedia:

Fields with embedded commas must be enclosed within double-quote characters.

And furthermore:

Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.

I don't know who invented that, but it effectively shows that eventually you have to escape. It's the only solid solution. Everything else is just duct tape on top of duct tape: maybe works for now, but eventually you'll bump on a case where you need an exception to the exception of an exception, and it doesn't take long before your mudball of rules is way more complex than a simple escape character solution had been.

It seems that CSV creators first tried to avoid escaping commas by coming up with double-quoted special syntax, which allowed saving commas, but then someone wanted to save double-quote characters too, so they had to escape at that point - funnily using the double-quote as an escape character. Had they decided to escape properly in the first place, the syntax would be simpler now.

edited Dec 14 '10 at 14:44

answered Dec 14 '10 at 12:11

Joonas Pulakka

23,534
9
64
93

3

What should be, and what is .. often differs :) – Tim Post Dec 14 '10 at 12:19
I think the solution is quite ok. For simple data, CSV works fine, for complex data then quoting is necesary, and escaping " with "" traces back to BASIC. – Ernelli Dec 14 '10 at 12:30
1

@Ernelli: Now that I think about it more, it may actually be a reasonable compromise between human-readability and simplicity. The problem escaping is that it looks ugly to *humans*, even though it's trivial for the *computer* to parse. Thus, reserving escapes for rare cases only ("fields with embedded double-quote characters") produces output that *usually* looks quite human-readable. This is a good solution, assuming that commas in field names are more frequently used than double-quotes in field names. – Joonas Pulakka Dec 14 '10 at 12:35

Tim Post · Answer 2 · 2010-12-14T12:14:02.870

I'm assuming you have something like this:

Foo,Baz,,,"Foo,Baz"

If strings that contain the delimiter aren't quoted or escaped, you have no real reliable way of parsing the file.

You can, however examine the data to parse and draw conclusions like:

Comma separated floats should be treated as a string
If the line before or after this contains less delimiters, skip parsing this line and log it
Treat ' like "

You do have to write a parser to handle stuff like that, but it doesn't have to be complicated.

In my experience, importing massive dumps from something like Excel always results in having to go back and review some oddballs. Your challenge is to give your program just enough common sense regarding the data so that it doesn't do a crazy insert. Then review what was logged and wash/rinse/repeat.

I once handled an internal FAQ for a small company that used all Ubuntu workstations. A part of the FAQ gave 'shell shortcuts', and it came to me pipe delimited. Well, the answers were also typically pipe delimited (i.e. grep foo | something) and not quoted or escaped. I feel that pain :)

score 2 · Answer 3 · answered Dec 14 '10 at 13:19

Nothing wrong with CSV up to a point

CSV works well for rigidly defined data that is unlikely to change format and doesn't spring many surprises on the recipient parser.

Here's a handy list of the big gotchas:

Escaping ""s within ""s (field contains field delimiter)
""s containing CRLFs (field contains line delimiter)
Unicode (underlying text format may be insufficient)
Different line terminators for different OSes (is CR or CRLF or LF or NUL?)
Inline comments (line prefixed with #, //, --, ; etc)
Version management (the latest version of the file contains more or less fields)
Differentiating between NULL and empty data (,"", is empty but ,, is null?)

You could approach this with a meta-data header that describes how the fields should be parsed, but then you may as well just use XML. It's because of this sort of freeform CSV mess that it was invented. The XML approach just seems too heavyweight for what could, on the face of it, be a simple problem.

A popular alternative is the "weird character delimiter" strategy. This gets around a lot of the escaping issues above because you use something like a | (pipe) character for field delimiting, and a CRLF for record termination. This doesn't get around the multi-line field issue (unless you use a field counter) but you do get nicely formatted lines for humans.

Overall, if you're just looking for a simple way of handling this kind of file then, in the Java world, you could simply throw OpenCSV at it. That way you abstract away all of the problems into an established framework.

score 2 · Answer 4 · answered Dec 14 '10 at 13:56

CSV is still a valid format in many situations, especially since it's still got to be the easiest way for a customer to write data that needs to be imported into your application. Few of our customers like to deal with XML, perhaps because it's very verbose and has all of those "scary" angle brackets. It's so much simpler for them to wrap their brains around a simple list of items separated by an agreed upon character, and also agree that the same character will not be allowed in the contents of a field.

That said, you still have to handle the input correctly and check for situations where they use invalid characters. I've started to use FileHelpers for my CSV parsing needs.

score 1 · Answer 5 · answered Dec 14 '10 at 11:58

i usualy stick to the standard and escape them. in most programming languages there is good builtin support or a good library available.

it depends on the situation which format will be used and CSV is a reasonable format to exchange simple data format structures.

score 0 · Answer 6 · answered Dec 14 '10 at 11:57

0

Forget CSV, use JSON. Easy to write, easy to parse. XML is so 2005.

answered Dec 14 '10 at 11:57

user281377

28,352
5
75
130

6

and has the same problem when you want to use a character that's part of the JSON format (like { or ,) – Salandur Dec 14 '10 at 11:59
Salandur: Not at all! There are exact rules how to escape! But { and , don't even need to be escaped, because inside are string, they are not ambiguous! – user281377 Dec 14 '10 at 12:05
1

Well and good, but I don't recall excel having an "Export to JSON" feature :) There are times when you have to parse odd things, if only to get them into a more agreeable format. – Tim Post Dec 14 '10 at 12:25
1

And JSON is just so totally brilliant for passing around a million objects of the same shape. Oh, wait. – Frank Shearar Dec 14 '10 at 12:27
1

JSON offers no improvement over CSV in regarding to this question and crucially lacks interoperability with many applications (as has been mentioned, can't import or export from Office, SQL DBs etc). JSON is great for internal, light-weight client-side operations but XML is much better for passing data between applications. – Dan Diplo Dec 14 '10 at 12:37
@Frank: If you pass them as an array of arrays (instead of an array of objects, where field names have to be repeated for every line), the overhead compared to CVS is small. – user281377 Dec 14 '10 at 12:46
@Dan: Implementing JSON export and import is so easy that I would be very surprised if that aspect doesn't get better very soon. – user281377 Dec 14 '10 at 12:47
@ammoQ Unless I misunderstand the JSON grammar, sending all that data as an array of arrays _is_ basically CSV, with a few []s thrown in. I mean, add a [ at the start of the file, ] at the end, and wrap every line with "[],", right? Can't say I see the point of using JSON in this particular case, really. – Frank Shearar Dec 14 '10 at 15:45
@Frank: In the most simple case, yes. But JSON makes it easy to add a header that contains meta-information about the array of arrays to follow. The syntax of JSON is by far more precisely specified than that of CSV. Quote from the wikipedia article about CSV: "Each field value of a record is separated from the next by a character (typically a comma, but some European countries use a semi-colon as a value separator instead)." By using JSON, such problems are generally avoided. Even without the meta-info I propose, you can at least syntactically distinguish between strings, numbers and bools. – user281377 Dec 14 '10 at 17:28

score 0 · Answer 7 · answered Dec 14 '10 at 15:57

Usually, what I find myself doing is getting a TSV (tab-separated values) rather than a CSV file, pull the file into Emacs and see which of a few uncommon characters it NEVER uses ($ is usually a good choice around here), and then I convert all the tabs to $.

From there, GNU AWK can be told to use $ as the field separator, and Bob's your uncle.

Best way to handle delimited files

7 Answers7