Is there a good single byte delimeter for use with utf-8 strings that isn't a null terminator?

Question

I'm looking for a quick way to split strings containing individual JSON payloads. Currently, I'm using newlines and searching for the newline ASCII character, but I figure if I start using utf-8 this could easily break.

Is there any quick single byte character that I would be able to use besides a null terminator that I can use to split strings by that won't be thrown off by utf-8 or appear in the JSON payload?

"if I start using utf-8 this could easily break" – How? I don't see how that could happen. — Jörg W Mittag, Feb 21 '17 at 19:12
So the newline ASCII character will never accidentally appear in UTF-8, you're saying? — Mikey A. Leonetti, Feb 21 '17 at 19:13
Yes. Characters in the ASCII range will be encoded exactly the same as in ASCII, characters outside of the ASCII range are encoded as a multi-octet sequence of 2-6 octets, all of which have their high-order bit (i.e. the 8th bit) set; since ASCII only uses 7 bit and the 8th is always 0, there is no possible confusion. — Jörg W Mittag, Feb 21 '17 at 19:15
Might be interested in [this answer](http://stackoverflow.com/questions/5847982/utf-8-string-delimiter) on another StackExchange. If you choose to use newline, you might want to add code to treat CR+LF, CR, and LF all the same way; this will address a number of issues that could come up if you work with Unix or Macintosh systems. — John Wu, Feb 21 '17 at 19:16
@JörgWMittag What you wrote actually answers my question in the way that I misunderstood things wonderfully! It would make for a fantastic answer. — Mikey A. Leonetti, Feb 21 '17 at 19:19
Note that, according to Wikipedia, the bytes 0xC0 and 0xC1, as well as the bytes 0xF5 and higher, are never allowed in UTF-8: https://en.wikipedia.org/wiki/UTF-8#Codepage_layout — Tanner Swett, Feb 21 '17 at 20:28

score 9 · Accepted Answer · answered Feb 21 '17 at 19:43

UTF-8 was specifically designed to be forwards- and backwards-compatible with ASCII, specifically it has these two properties:

the encoding of characters within the ASCII character set is the same in UTF-8 as it is in ASCII
all other codepoints are encoded as a sequence of 2-6 octets, all of which have their high-order bit (8th bit) set; since ASCII only uses 7 bits and always has the 8th bit unset, a single-octet ASCII character can never be mistaken for a part of a multi-octet sequence and vice versa

So, assuming that newlines work reliably for you using ASCII, they will also work reliably using UTF-8.

You will have to deal with different newline conventions of different operating systems, either by accepting all of \r\n (DOS, Windows), \r (Classic MacOS), and \n (Unix), or by specifying one and only one (the Internet Standards all use \r\n, because they are treated as a newline by all OSs, with maybe some additional garbage attached). And this is not even taking into account the various non-ASCII newline characters defined in Unicode.

However, there is a problem: newlines are valid characters in JSON; they can appear in between any two tokens and are ignored as whitespace

AFAICS, it is not that easy to find a character that is guaranteed to not appear in JSON. The spec is a bit vague, it talks about "whitespace" being allowed, but it does not specify what "whitespace" actually means.

One way to get around this, is to enclose the JSON documents into a JSON list, essentially making the JSON objects just elements of an outer JSON array.

Another way would be to switch to a different language: as of version 1.2, YAML is a proper superset of JSON, meaning that every valid JSON document is also a valid YAML document. One of the features YAML has that JSON doesn't, is a document end marker that allows you to put multiple documents into the same bytestream. So, if you just insert a YAML document end marker in between your JSON documents, you have a valid stream consisting of multiple YAML documents.

Enclosing the JSON documents into a JSON list is the most sane option. I was going to answer that, but you covered it. — Mike Nakis, Feb 21 '17 at 20:56
I might prefer a JSON dictionary, because then I don't need to worry about the order of items, I can leave out things that are not needed, and I can add things with backward compatibility. — gnasher729, Feb 22 '17 at 14:59

score 4 · Answer 2 · answered Feb 21 '17 at 19:23

If it doesn't appear in your payload, any single-byte ASCII character is a valid separator, because the (ASCII) code points 0 - 127 will be unique, no escaped single bytes will match their values.

See Wikipedia on UTF-8.

Single Byte (ASCII) code points will always be encoded as 0xxxxxxx bits, whereas all bytes of sequences will be encoded as 1xxxxxxx bits.

So, your line break byte (0x0A / dec 10 / bin 00001010) can only appear if you actually but a line feed there.

Is there a good single byte delimeter for use with utf-8 strings that isn't a null terminator?

2 Answers2