Which is a more efficient approach to decoding escape sequences in text?

Question

I'm working on parsers that not only process delimited content, but also escape sequences within certain portions of that content. I'm contemplating the efficiency of several approaches to tokenization and would appreciate thoughts on the following.

To use JSON as but one example, there's a few ways to process the following:

["Foo\n"]

Consider that the input stream might be multi-megabyte with lots of varied escaped/un-escaped sequences

Approaches:

By Character

The first option is simply to tokenize by character:

[array string #"F" #"o" #"o" #"^/" /string /array]

Pro: uniform, quick to implement, works well with streamed content

Con: not at all efficient as you are invoking the token handler for near as many tokens as there are characters in the input stream.

By Escaped/Un-Escaped Tokens

A somewhat more efficient tokenizer might result in:

[array string "Foo" "^/" /string /array]

Pro: Somewhat more efficient, quick to implement

Con: There's still a lot of tokens for heavily escaped content, can't imply that two tokens represent one or two items

By Whole Tokens

A minimal tokenizer might produce the following:

[array "Foo^/" /array]

Pro: Far fewer tokens to handle

Con: This raises many questions, chief amongst them—how is the string "Foo^/" created? To break this down, will consider two sub-approaches:

Match the sequence, then resolve escapes:

This might be handled thus:

[
      "[" (emit 'array)
    | "]" (emit /array)
    | {"} copy value string-sequence {"} (emit de-escape value)
]

Pros: Quickly identify matches, use and modify a single string

Cons: This is effectively a two-pass process—there may be two separate rules to match escape sequences: one in string-sequence and one in de-escape—there's the extra effort in ensuring they are consistent

Match portions of the string and append to a buffer:

This might look like:

[
      "[" (emit 'array)
    | "]" (emit /array)
    | {"} (buffer: make string! "") some [
          copy part unescaped-sequence (append buffer part)
        | "\n" (append buffer "^/")
    ] {"} (emit buffer)
]

Pros: One pass

Cons: Now we're back to handling chunks similar to the 'Handling escaped/un-escaped sequence' method and managing an additional value buffer.

What's the problem? Are you having trouble recognizing the string token or converting a the string with escape sequences into the desired string value? The first is easily done with a regular expression, the second problem is similar but a hand-written finite-state machine will be fine. These techniques are maximally efficient. Just be careful not to make unneeded intermediate copies of the input text. — kevin cline, Sep 10 '17 at 18:59
@kevin-cline The problem is deciding which of these approaches to go with. They all have their own complexities. — rgchris, Sep 10 '17 at 19:03
@kevincline Note the [tag:rebol] tag—there is no regex and Parse can act like one big finite state machine. Making copies of portions of the input string is one consideration, as is over-processing each instance of an escape sequence which each approach above has a danger of in its own way. — rgchris, Sep 10 '17 at 19:10
Suggest putting rebol in the question instead of leaving it for the tag. — kevin cline, Sep 11 '17 at 03:14
@kevincline It's less the *how* and more the *where* in the tokenization process you recognize and resolve escape sequences (see first answer). It isn't exclusively a Rebol question—I just use Rebol notation to illustrate the different approaches and it is frequently encountered topic in Rebol parsing. — rgchris, Sep 11 '17 at 05:36
The universe seems to have a deep crack in it, in that so many things break down to two irreconcilable cases! Duality. — , Mar 23 '18 at 12:24

score 6 · Accepted Answer · answered Sep 10 '17 at 19:20

Most parsers decode string escapes during tokenization. Note that the tokenizer needs to be aware of escapes anyway to determine the end of a string, since the string delimiter " itself may be escaped. That means it is not fundamentally more complicated to also decode it in one pass.

It is not worse to first recognize the string literal and then process escapes. In low-level languages (like C) a two-phase approach would allow you to know the exact size of the decode string before you process escapes. This is not a good argument in your case. It may still be a good idea if this approach is simpler, but your de-escape function will amount to something like the single-pass tokenizer anyway.

A tokenizer should produce one token per literal because that makes it much easier to handle for consumers of the token stream. In particular, a string literal split into multiple tokens would be concatenated into a single string anyway at some point, so we ought to do that right away.

A notable scenario where that is not the case is in complicated languages with string “literals” that allow interpolation – tokenizing those elegantly is challenging, and would likely lead to something like your “tokenize by escaped/unescaped tokens” suggestion. As another example, the XML document model may contain multiple consecutive text nodes. XML entities are tricky because a tokenizer will have to parse the DTD before entities can be expanded, and because eagerly expanding entities opens the program to a denial-of-service attack.

TL;DR: Prefer your By Whole Tokens approach, ideally implemented via your Match portions of the string and append to a buffer idea.

That is a great articulation of the merits that method and one that I would say I'm inclined toward. I guess where I've been butting against a problem is where I try in the first pass to identify where an escape sequence appears valid, yet would resolve to an error/warning (in order to bind the error/warning to its location in the source). Using JSON again as an example: `["\uD800"]` In order to catch that in the first pass, I'd run the same resolution code as in the second thereby repeating what could potentially be a costly process. — rgchris, Sep 10 '17 at 20:14
@rgchris Handling surrogate pairs is the most difficult part of JSON. It would be feasible to handle Unicode in a second pass after escape processing, though as you've noted that makes good error messages difficult. Alternatively, you can check whether an escape is part of a surrogate pair, then expect (as part of your grammar) that it must be followed immediately by the second part of the pair, as another `\uXXXX` escape. Everything else can be treated as a syntax error. — amon, Sep 10 '17 at 21:15