I'm working on parsers that not only process delimited content, but also escape sequences within certain portions of that content. I'm contemplating the efficiency of several approaches to tokenization and would appreciate thoughts on the following.
To use JSON as but one example, there's a few ways to process the following:
["Foo\n"]
Consider that the input stream might be multi-megabyte with lots of varied escaped/un-escaped sequences
Approaches:
By Character
The first option is simply to tokenize by character:
[array string #"F" #"o" #"o" #"^/" /string /array]
Pro: uniform, quick to implement, works well with streamed content
Con: not at all efficient as you are invoking the token handler for near as many tokens as there are characters in the input stream.
By Escaped/Un-Escaped Tokens
A somewhat more efficient tokenizer might result in:
[array string "Foo" "^/" /string /array]
Pro: Somewhat more efficient, quick to implement
Con: There's still a lot of tokens for heavily escaped content, can't imply that two tokens represent one or two items
By Whole Tokens
A minimal tokenizer might produce the following:
[array "Foo^/" /array]
Pro: Far fewer tokens to handle
Con: This raises many questions, chief amongst them—how is the string "Foo^/"
created? To break this down, will consider two sub-approaches:
Match the sequence, then resolve escapes:
This might be handled thus:
[
"[" (emit 'array)
| "]" (emit /array)
| {"} copy value string-sequence {"} (emit de-escape value)
]
Pros: Quickly identify matches, use and modify a single string
Cons: This is effectively a two-pass process—there may be two separate rules to match escape sequences: one in string-sequence
and one in de-escape
—there's the extra effort in ensuring they are consistent
Match portions of the string and append to a buffer:
This might look like:
[
"[" (emit 'array)
| "]" (emit /array)
| {"} (buffer: make string! "") some [
copy part unescaped-sequence (append buffer part)
| "\n" (append buffer "^/")
] {"} (emit buffer)
]
Pros: One pass
Cons: Now we're back to handling chunks similar to the 'Handling escaped/un-escaped sequence' method and managing an additional value buffer
.