27

A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, and separator char.

The regex need not be capable of changing these chars after creation, but it must not fail on any other edge case.

I have argued that this is impossible for just a tokenizer. The only regex that might be able to do this is a very complex PCRE style that moves beyond just tokenizing.

I am looking for something along the lines of:

... the csv format is a context free grammar and as such, it is impossible to parse with regex alone ...

Or am I wrong? Is it possible to parse csv with just a POSIX regex?

For example, if both the escape char and the quote char are ", then these two lines are valid csv:

"""this is a test.""",""
"and he said,""What will be, will be."", to which I replied, ""Surely not!""","moving on to the next field here..."
Spencer Rathbun
  • 3,576
  • 1
  • 21
  • 28
  • it's not a CSV as there's no nesting anywhere (IIRC) – ratchet freak Sep 27 '12 at 16:41
  • 1
    but what are the edge cases ? maybe there is more in CSV, than i ever thought ? – c69 Sep 27 '12 at 16:44
  • 1
    @c69 How about escape and quote char are both `"`. Then the following is valid: `"""this is a test.""",""` – Spencer Rathbun Sep 27 '12 at 17:00
  • Did you try regexp from [here](http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html)? – Sergey Kalinichenko Sep 27 '12 at 17:09
  • 1
    You do need to watch out for edge cases, but a regex should be able to tokenize csv as you have described it. The regex does not need to count up an arbitrary number of quotes -- it only needs to count to 3, which regular expressions can do. As others have mentioned, you should try to write down a well-defined representation of what you expect a csv token to be... – comingstorm Sep 27 '12 at 17:53
  • @c69 Edge cases are illustrated here http://tools.ietf.org/html/rfc4180. – Evan Plaice Oct 22 '13 at 18:56

5 Answers5

26

Nice in theory, terrible in practice

By CSV I'm going to assume you mean the convention as described in RFC 4180.

While matching basic CSV data is trivial:

"data", "more data"

Note: BTW, it's a lot more efficient to use a .split('/n').split('"') function for very simple and well-structured data like this. Regular Expressions work as a NDFSM (Non-Deterministic Finite State Machine) that wastes a lot of time backtracking once you start adding edge cases like escape chars.

For example here's the most comprehensive regular expression matching string I've found:

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^                                   # Anchor to start of string.
\s*                                 # Allow whitespace before value.
(?:                                 # Group for value alternatives.
  '[^'\\]*(?:\\[\S\s][^'\\]*)*'     # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*"     # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*    # or Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Allow whitespace after value.
(?:                                 # Zero or more additional values
  ,                                 # Values separated by a comma.
  \s*                               # Allow whitespace before value.
  (?:                               # Group for value alternatives.
    '[^'\\]*(?:\\[\S\s][^'\\]*)*'   # Either Single quoted string,
  | "[^"\\]*(?:\\[\S\s][^"\\]*)*"   # or Double quoted string,
  | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*  # or Non-comma, non-quote stuff.
  )                                 # End group of value alternatives.
  \s*                               # Allow whitespace after value.
)*                                  # Zero or more additional values
$                                   # Anchor to end of string.
"""

It reasonably handles single and double quoted values, but not newlines in values, escaped quotes, etc.

Source: Stack Overflow - How can I parse a string with JavaScript

It's becomes a nightmare once the common edge-cases are introduced like...

"such as ""escaped""","data"
"values that contain /n newline chars",""
"escaped, commas, like",",these"
"un-delimited data like", this
"","empty values"
"empty trailing values",        // <- this is completely valid
                                // <- trailing newline, may or may not be included

The newline-as-value edge case alone is enough to break 99.9999% of the RegEx based parsers found in the wild. The only 'reasonable' alternative is to use RegEx matching for basic control/non-control character (ie terminal vs non-terminal) tokenization paired with a state machine used for higher level analysis.

Source: Experience otherwise known as extensive pain and suffering.

I am the author of jquery-CSV, the only javascript based, fully RFC-compliant, CSV parser in the world. I have spent months tackling this problem, speaking with many intelligent people, and trying a ton if different implementations including 3 full rewrites of the core parser engine.

tl;dr - Moral of the story, PCRE alone sucks for parsing anything but the most simple and strict regular (Ie Type-III) grammars. Albeit, it's useful for tokenizing terminal and non-terminal strings.

Evan Plaice
  • 5,725
  • 2
  • 24
  • 34
  • 2
    Yup, that's been my experience as well. Any attempt to fully encapsulate more than a very simple CSV pattern runs into these things, and then you ram up against both the efficiency problems and complexity problems of a massive regex. Have you looked at the [node-csv](http://www.adaltas.com/projects/node-csv/) library? It seems to validate this theory as well. Every non trivial implementation uses a parser internally. – Spencer Rathbun Oct 22 '13 at 12:22
  • @SpencerRathbun Yep. I'm sure I've taken a look at the node-csv source before. It appears to use a typical character tokenization state machine for processing. The jquery-csv parser works on the same fundamental concept except I use regex for terminal/non-terminal tokenization. Instead of evaluating and concatenating on a char-by-char basis, regex is able to match multiple non-terminal characters at a time and return them as a group (ie string). This minimizes unnecessary concatenation and 'should' increase efficiency. – Evan Plaice Oct 22 '13 at 18:31
21

Regex can parse any regular language, and cannot parse fancy things like recursive grammars. But CSV seems to be pretty regular, so parseable with a regex.

Let's work from definition: allowed are sequence, choice form alternatives (|), and repetition (Kleene star, the *).

  • An unquoted value is regular: [^,]* # any char but comma
  • A quoted value is regular: "([^\"]|\\\\|\\")*" # sequence of anything but quote " or escaped quote \" or escaped escape \\
    • Some forms may include escaping quotes with quotes, which adds a variant ("")*" to the expression above.
  • An allowed value is regular: <unquoted-value>|<quoted-value>
  • A single CSV line is regular: <value>(,<value>)*
  • A sequence of lines separated by \n is also obviously regular.

I did not meticulously test each of these expressions, and never defined catch groups. I also glossed over some technicalities, like the variants of characters which can be used instead of ,, ", or line separators: these do not break the regularity, you just get several slightly different languages.

If you can spot a problem in this proof, please comment! :)

But despite this, practical parsing of CSV files by pure regular expressions may be problematic. You need to know which of the variants is being fed to the parser, and there's no standard for it. You can try several parsers against each line until one succeeds, or somehow divinate the format form comments. But this may require means other than regular expressions to do efficiently, or at all.

9000
  • 24,162
  • 4
  • 51
  • 79
  • 4
    Absolutely a +1 for the practical point. There is something that I'm sure, somewhere deep is an example of a (contrived) value that would break the quoted value version I just don't know what it is. The 'fun' with multiple parsers would be "these two work, but give different answers" –  Sep 27 '12 at 17:55
  • 1
    You will obviously need different regexes for backslash-escaped-quotes vs. doubled-quote-escaped-quotes. A regex for the former type of csv field should be something like `[^,"]*|"(\\(\\|")|[^\\"])*"`, and the latter should be something like `[^,"]*|"(""|[^"])*"`. (Beware, as I have not tested either of these!) – comingstorm Sep 27 '12 at 18:10
  • Hunting for [something](http://tools.ietf.org/html/rfc4180) that might be a standard, there is a case that is missed - a value with a record delimiter enclosed. This also makes the practical parsing even more fun when there are multiple different ways to handle that –  Sep 27 '12 at 18:11
  • Nice answer, but if I run `perl -pi -e 's/"([^\"]|\\\\|\\")*"/yay/'` and pipe in `"I have here an item,\" that is a test\""` then the result is `yay that is a test\"". Methinks your regex is flawed. – Spencer Rathbun Sep 27 '12 at 18:22
  • @SpencerRathbun: when I have more time I'll actually test the regexes and probably even paste some proof-of-concept code that passes tests. Sorry, work day is going on. – 9000 Sep 27 '12 at 18:38
  • No problem. I will be curious what you come up with. – Spencer Rathbun Sep 27 '12 at 19:01
  • @SpencerRathbun: That's because `[^\"]` is equivalent to `[^"]`. It needs to be `[^\\"]`. Another problem, by the way, is the definition of "unquoted value" as `[^,]*`, since that allows `"foo,bar"` to be seen as containing the two unquoted values `"foo` and `bar"`. – ruakh Sep 27 '12 at 20:09
  • @ruakh even with that change, the cell `This, is a test\",` becomes `"This, is a test \"","` when exported as a csv from libre office calc. The regex does not match the final comma as part of the the quoted string. (the code: `perl -e 'while(<>) { chomp; if(m/"((?:[^\\"]|\\\\|\\")*)"/) { print "$1\n"; } }' test.csv` –  Sep 28 '12 at 13:50
  • @MichaelT: Sorry, but I don't see what Libre Office Calc has to do with anything, since it escapes `"` as `""`, and this answer describes a regex for a CSV format where `"` is escaped as `\"`. Naturally a regex designed for the latter case will not automatically handle the former. – ruakh Sep 28 '12 at 13:54
  • @ruakh it was a way of reproducing the data that I had entered and show what the expected result. Escaping the text with double quotes instead of a backslashed quote `"This, is a test \"""","` also misses the final comma in string. Whatever the case, the regular expression needs to be able to handle text with the escaped form other csv data. –  Sep 28 '12 at 14:11
  • @MichaelT: "Whatever the case, the regular expression needs to be able to handle text with the escaped form other csv data": No, it doesn't. The OP's formulation of the problem explicitly specified that the "escape char, quote char, and separator char" were givens. – ruakh Sep 28 '12 at 14:53
  • @ruakh Ok. With the double quote form for escaping a embeded quote within a string, the string `"This, is a test \"""","` as one value is not properly recognized by the regular expression `/"((?:[^\\"]|\\\\|\\")*)"/` as tested by `perl -e 'while(<>) { chomp; if(m/"((?:[^\\"]|\\\\|\\")*)"/) { print "$1\n"; } }' test.csv` -- I am of the opinion, that for any regular expression that you create, I can construct a valid quoted string that will be improperly matched. Chances are, this string will contain what appears to be another set of csv data with an alternate system of encoding embedded. –  Sep 28 '12 at 16:06
  • @MichaelT: You're still not making sense. That regular expression is designed for the case that the escape character is `\ `. When the escape character is `\ `, the string `"This, is a test \"""","` is not valid CSV, so is not *supposed* to be "properly recognized"! – ruakh Sep 28 '12 at 16:28
  • Don't forget that, as a practical matter, what you do with your tokens *after* they are recognized (or fail to be recognized) is an important part of the process. While malformed data sources are so endlessly ingenious that you can't guarantee automated handling of every possible case, you can be much more robust by using some common-sense handling of the results of your regex matching. E.g., if your separator fails to match `\w*,\w*`, try an alternate quote format for the preceding field. – comingstorm Oct 02 '12 at 22:21
  • How about a line containing a trailing empty non-terminal, a line containing an escaped newline char(s), multiple variations of newline chars (ie \r, \n \r\n), etc... There is a common standard for CSV data but it's not possible to parse using regular expressions alone. Even if it was, the implementation would be extremely inefficient. – Evan Plaice Oct 21 '13 at 22:20
  • @EvanPlaice: I still think that a reasonably complete regex-based CSV parser is possible, but completely agree that it will probably have poor performance and most definitely be unmaintainable. – 9000 Oct 22 '13 at 07:53
  • @9000 See my answer. I used to think the same and invested a lot of time in trying to prove so but -- in the end -- chose an alternative approach. If CSV was strictly a machine-readable format with a very concise context-free format it would be completely do-able with regex alone. In practice, CSV is used as a human-readable format with a lot of (completely valid and acceptable) quirks. – Evan Plaice Oct 22 '13 at 18:32
6

Simple answer - probably not.

The first problem is a lack of a standard. While one may describe their csv in a way that is strictly defined, one cannot expect to get strictly defined csv files. "Be conservative in what you do, be liberal in what you accept from others" -Jon Postal

Assuming that one does have a standardesque that is acceptable, there is the question of escape characters and if these need to be balanced.

A string in many csv formats is defined as string value 1,string value 2. However, if that string contains a comma it is now "string, value 1",string value 2. If it contains a quote it becomes "string, ""value 1""",string value 2.

At this point I believe it is impossible. The problem being you need to determine how many quotes you have read and if a comma is inside or outside of the double quoted mode of the value. Balancing parentheses is an impossible regex problem. Some extended regular expression engines (PCRE) can deal with it, but it isn't a regular expression then.

You might find https://stackoverflow.com/questions/8629763/csv-parsing-with-a-context-free-grammar useful.


Amended:

I have been looking at formats for escape characters and haven't found any that need arbitrary counting - so that is probably not the issue.

However, there are issues of what is the escape character and record delimiter (to start with). http://www.csvreader.com/csv_format.php is a good read on the different formats in the wild.

  • The rules for the quoted string (if it is a single quoted string or a double quoted string) differ.
    • 'This, is a value' vs "This, is a value"
  • The rules for escape characters
    • "This ""is a value""" vs "This \"is a value\""
  • The handling of embedded record delimiter ({rd})
    • (raw embeded) "This {rd}is a value" vs (escaped) "This \{rd}is a value" vs (translated)"This {0x1C}is a value"

The key thing here is that it is possible to have a string that will always have multiple valid interpretations.

The related question (for edge cases) "is it possible to have a invalid string that is accepted?"

I still strongly doubt that there is a regular expression that can match every valid CSV that is created by some application and reject every csv that cannot be parsed.

  • 1
    Quotes inside quotes need not be balanced. Instead, there must be an even number of quotes before an embedded quote, which is obviously regular: `("")*"`. If the quotes _inside_ the value are off balance, it's already not our business. – 9000 Sep 27 '12 at 17:39
  • This is my position, having run into these horrible excuses for "data transfer" in the past. The only thing that properly handled them was a parser, pure regex broke every few weeks. – Spencer Rathbun Sep 27 '12 at 18:23
5

This regexp can tokenize normal CSV, as described in the RFC:

/("(?:[^"]|"")*"|[^,"\n\r]*)(,|\r?\n|\r)/

Explanation:

  • ("(?:[^"]|"")*"|[^,"\n\r]*) - a CSV field, quoted or not
    • "(?:[^"]|"")*" - a quoted field;
      • [^"]|"" - each character is either not ", or " escaped as ""
    • [^,"\n\r]* - an unquoted field, which may not contain , " \n \r
  • (,|\r?\n|\r) - the following separator, either , or a newline
    • \r?\n|\r - a newline, one of \r\n \n \r

An entire CSV file can be matched and validated by using this regexp repeatedly. It is then necessary to fix the quoted fields, and split it into rows based on the separators.

Here is code for a CSV parser in Javascript, based on the regexp:

var csv_tokens_rx = /("(?:[^"]|"")*"|[^,"\n\r]*)(,|\r?\n|\r)/y;
var csv_unescape_quote_rx = /""/g;
function csv_parse(s) {
    if (s && s.slice(-1) != '\n')
        s += '\n';
    var ok;
    var rows = [];
    var row = [];
    csv_tokens_rx.lastIndex = 0;
    while (true) {
        ok = csv_tokens_rx.lastIndex == s.length;
        var m = s.match(csv_tokens_rx);
        if (!m)
            break;
        var v = m[1], d = m[2];
        if (v[0] == '"') {
            v = v.slice(1, -1);
            v = v.replace(csv_unescape_quote_rx, '"');
        }
        if (d == ',' || v)
            row.push(v);
        if (d != ',') {
            rows.push(row)
            row = [];
        }
    }
    return ok ? rows : null;
}

Whether this answer helps to settle your argument is for you to decide; I am just happy to have a small, simple and correct CSV parser.

In my opinion a lex program is more or a less a large regular expression, and those can tokenize much more complex formats, such as the C programming language.

With reference to the RFC 4180 definitions:

  1. line break (CRLF) - The regexp is more flexible, allowing CRLF, LF or CR.
  2. The last record in the file may or may not have an ending line break - The regexp as it is requires a final line break, but the parser adjusts for that.
  3. There maybe an optional header line - This does not impact the parser.
  4. Each line should contain the same number of fields throughout the file - not enforced
    Spaces are considered part of a field and should not be ignored - okay
    The last field in the record must not be followed by a comma - not enforced
  5. Each field may or may not be enclosed in double quotes ... - okay
  6. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes - okay
  7. a double-quote appearing inside a field must be escaped by preceding it with another double quote - okay

The regexp itself satisfies most of the RFC 4180 requirements. I don't agree with the others, but it is easy to adjust the parser to implement them.

Sam Watkins
  • 151
  • 1
  • 4
  • 1
    this looks more like self-promotion than addressing the question asked, see [answer] – gnat Mar 22 '18 at 19:32
  • 2
    @gnat, I edited my answer to give more explanation, check the regexp against RFC 4180, and to make it less self-promoting. I believe that this answer has value, as it contains a tested regexp which can tokenize the most common form of CSV as used by Excel and other spreadsheets. I think that this settles the question. The little CSV parser demonstrates that it is easy to parse CSV using this regexp. – Sam Watkins Mar 22 '18 at 23:16
  • 1
    Without wishing to promote myself excessively, here are my complete little csv and tsv libraries which I am using as part of a little spreadsheet app (Google sheets feels too heavy for me). This is open source / public domain / CC0 code like all the stuff I publish. I hope this can be useful for someone else. http://sam.aiki.info/code/js/ – Sam Watkins Mar 28 '18 at 23:59
3

First define the grammar for your CSV (are the field delimiters escaped or encoded somehow if they appear in text?) and then it can be determined if it is parsable with regex. Grammar first: parser second: http://www.boyet.com/articles/csvparser.html It should be noted that this method uses a tokenizer - but I can't constuct a POSIX regex that would match all edge cases. If your usage of CSV formats is non-regular and context free ... then your answer is in your question. Good overview here: http://nikic.github.com/2012/06/15/The-true-power-of-regular-expressions.html

iivel
  • 205
  • 1
  • 4