14

I am desiging a file format and I want to do it right. Since it is a binary format, the very first byte (or bytes) of the file should not form valid textual characters (just like in the PNG file header1). This allows tools that do not recognize the format to still see that its not a text file by looking at the first few bytes.

Any codepoint above 0x7F is invalid US-ASCII, so that's easy. But for Unicode it's a whole different story. Apart from valid Unicode characters there are private-use characters, noncharacters and sentinels, as I found in the Unicode Private-Use Characters, Noncharacters & Sentinels FAQ.

What would be a sentinel sequence of bytes that I can use at the start of the file that would result in invalid US-ASCII, UTF-8, UTF-16LE and UTF-16BE?

  • Obviously the first byte cannot have a value below 0x80 as that would be a valid US-ASCII (control)character, so 0x00 cannot be used.
  • Also, since private-use characters are valid Unicode characters, I can't use those codepoints either.
  • Since it must work with both little-endian and big-endian UTF-16, a noncharacter such as 0xFFFE is also not possible as its reverse 0xFEFF is a valid Unicode character.
  • The above mentioned FAQ suggests not using any of the noncharacters as that would still result in a valid Unicode sequence, so something like 0xFFFF is also out of the picture.

What would be the future-proof sentinel values that are left for me to use?


1) The PNG format has as its very first byte the non-ASCII 0x89 value, followed by the string PNG. A tool that read the first few bytes of a PNG may determine it is a binary file since it cannot interpret 0x89. A GIF file, on the other hand, starts directly with the valid and readable ASCII string GIF followed by three more valid ASCII characters. For GIF a tool might determine it is a readable text file. This is wrong and the idea of starting the file with a non-textural byte sequence came from Designing File Formats by Andy McFadden.

Daniel A.A. Pelsmaeker
  • 2,715
  • 3
  • 22
  • 27
  • 3
    `Since it is a binary format, the first bytes of the file should not form valid textual characters` - You should look at the magic file (/usr/share/magic, or /etc/magic on many unix systems) that shows how this application identifies file types. A PNG file starts out with `\x89PNG\x0d\0a\x1a\x0a` -- note the "PNG" in there, that's a raw string. The sequences `\x89` and the like are non-printable bytes. –  Mar 13 '13 at 15:29
  • @MichaelT Yes, since PNG is a binary format, the first byte does not form a valid textual character. That's what I meant. I fail to see your point? – Daniel A.A. Pelsmaeker Mar 13 '13 at 15:36
  • Off the top of my head I believe Gifs have "GIF" in their first few bytes too. – Rig Mar 13 '13 at 15:47
  • 7
    That was an example. A .gif starts out with `GIF8`. A SGI movi file starts out with `MOVI`. One style of zip archive file starts out with `ZZ`, the more popular pkzip format starts out with `PK`. The constraint that the first byte be an invalid text character does not seem to match what is found in the wild. I am curious why this is a requirement. –  Mar 13 '13 at 15:51
  • @MichaelT : I added a response to your concerns at the bottom of my post. – Daniel A.A. Pelsmaeker Mar 13 '13 at 16:03
  • 3
    Do you really care how other programs behave when they see a unknown file? To me, a signature sequence (like PNG files) is much more useful than a sentinel sequence - when the content is sent through a simple stream protocol, the receiver can immediately decide how to handle the following bytes. A Omani-sentinel sequence is next to no-sequence once everyone starts using it to identify their own format. – Codism Mar 13 '13 at 16:51
  • @Codism After the sentinel bytes my file format will _too_ have a short ASCII string identifying the file. The sentinel bytes are simply part of the magic signature, just like in PNG files. But PNG files start with the `0x89` sentinel value, which is a valid UTF-8 value, so I'm looking for a different sentinel. – Daniel A.A. Pelsmaeker Mar 13 '13 at 17:10
  • @Virtlink `0x89` is a continuation byte, it's not valid UTF-8 ever if a file starts with `0x89`. – Esailija Mar 13 '13 at 17:37
  • I fail to see how using an ASCII character in the header is 'wrong' - especially since many, many programs do this without a problem. Why does it matter how other tools handle the file? Ultimately, its a binary file, its going to have non-textual data in there... open it up in a text editor, and its going to look like gibberish regardless of what the header looks like. – GrandmasterB Mar 13 '13 at 19:09
  • 1
    @GrandmasterB What are the _great_ benefits of _not_ doing it? Is it in any way better? Is it so much effort to do it? In a new file format one can pick any magic bytes, so why not pick some that may have at least some benefits, no matter how small? – Daniel A.A. Pelsmaeker Mar 13 '13 at 19:17
  • 2
    @Virtlink, I dont particularly care what bytes you use in your file format. But you made an assertion that its 'wrong' to use ascii characters... yet I've not seen anything here that supports that claim, and there's plenty of empirical experience that shows it really doesn't matter (ie, the countless file formats that have been using ASCII characters without a problem for decades) – GrandmasterB Mar 13 '13 at 19:49
  • 1
    @GrandmasterB Well, I don't know if it's wrong. I got the idea [from here](http://www.fadden.com/techmisc/file-formats.htm) and I liked it. In general there is a lot that is 'wrong' about many file formats (such as lack of extensibility and future-proofing). That _everyone starts with ASCII characters so I should start with ASCII characters_ does not convince me. – Daniel A.A. Pelsmaeker Mar 14 '13 at 00:10

3 Answers3

16

0xDC 0xDC

  • Obviously invalid UTF-8 and ASCII
  • Unpaired trail surrogate in lead position regardless of endianess in UTF-16. It doesn't get more invalid UTF-16 than that.
Esailija
  • 5,364
  • 1
  • 19
  • 16
  • But perfectly reasonable ISO-8859-1, and probably reasonable in any other character set that uses an 8-bit encoding. – parsifal Mar 13 '13 at 23:06
  • 4
    +1 OP didn't ask for ISO 8859-1, just US-ASCII and UTF-*. – Ross Patterson Mar 14 '13 at 00:00
  • @RossPatterson - true, but I suspect that's mostly because the OP hasn't really thought through the problem. Without any statistics to back me up, I'm willing to bet that a random "is this text" algorithm is more likely to give preference to ISO-8859-1 than UTF-16, simply because there's an enormous amount of 8-bit text in the world. – parsifal Mar 14 '13 at 12:57
  • 3
    @parsifal Any binary is valid ISO-8859-1 so it doesn't need to be considered simply because it's impossible to make invalid ISO-8859-1. – Esailija Mar 14 '13 at 13:05
  • @Esailija - valid, yes, but "text" files don't usually contain control characters (outside of the limited set of whitespace characters). – parsifal Mar 14 '13 at 14:40
  • 1
    @parsifal true and if that was the requirement you could just use `0x00` or whatever, but op didn't want that. – Esailija Mar 14 '13 at 14:42
5
  • In UTF-8, the bytes C0, C1, and F5 - FF are illegal. The first byte must either be ASCII or a byte in the range C2-F4, any other starting byte is not valid UTF-8.

  • In UTF-16, the file normally starts with the Byte Order Mark (U+FEFF), otherwise applications have to guess at the byte order. Codepoints in the range D800-DBFF are lead bytes for a surrogate pair, and DC00-DFFF are the trailing bytes for a surrogate pair.

Thus, I'd use the byte combo F5DC. These two values are:

  • Not ASCII
  • Not valid UTF-8
  • Either interpreted as a UTF-16 trailing byte in a surrogate pair (not legal), or the codepoint U+F5DC, which is a private use character, but only by applications that stubbornly try to interpret this as UTF-16 even without a BOM.

If you need more options, F5DD through to F5DF all have the same 3 properties, as do F6DC - F6DF, F7DC - F7DF and F8DC - F8DF, for a total of 16 different byte combos to pick from.

Martijn Pieters
  • 14,499
  • 10
  • 57
  • 58
  • So, by [Esailija's suggestion](http://programmers.stackexchange.com/a/190417/48260) to use U+DCDC, `0xDC` would be valid UTF-8? – Daniel A.A. Pelsmaeker Mar 13 '13 at 16:38
  • 2
    @Virtlink `0xDC` is a UTF-8 lead byte for a 2-byte sequence. It must be followed by a `10xxxxxx` continuation byte for it to be valid. `0xDC` is not a valid continuation byte, so `0xDC 0xDC` is not valid UTF-8. – Esailija Mar 13 '13 at 16:40
  • @Virtlink: No, because the second byte is not valid, it would have to be in the range `80` - `BF`. – Martijn Pieters Mar 13 '13 at 16:45
2

If you're trying to use a non-printable character to indicate "not text," then you'll find it hard to beat 0x89:

  • It's outside the US-ASCII range
  • In ISO-8859-1 it's a non-printable character ("CHARACTER TABULATION WITH JUSTIFICATION "). Likewise with Shift-JIS, which I believe is still in common use. Other 8-bit encodings may, however treat this as a valid character.
  • In UTF-8 it's an invalid first-byte for a multi-byte sequence (top bits are 10, which are reserved for characters 2..N of a multi-byte sequence)

Generally, when you form magic numbers, "non-text" is a minor point. I'll have to look up the reference, but one of the standard graphics formats (TIFF, I think) has something like six different pieces of useful information from its magic number.

parsifal
  • 483
  • 3
  • 5