Differentiating Between ASCII and Unicode in File Spec

Question

I am developing against a file spec that lists the data type for certain fields as

CHAR(<length>)

The spec is for a fixed width flat file. In most cases, possible values to populate the fields are obvious (either delineated in a list of choices, or simply alpha/numeric by design). One of the fields stores an error message, which is populated by the other party according to their system's behavior; it could be specified in a somewhat "freehand" manner.

I would like to make the assumption that this data will be stored as ASCII text (for the purpose of setting up data structures). Is there a way that I can ensure the file will not contain Unicode data in this field, based on the field being specified as simply CHAR(100)?

Note: I am coming from a SQL Server background, where I would use CHAR for ASCII and NCHAR for Unicode - I'm wondering if there is a similar way to specify the difference when creating an interface document.

What is the file spec simply a flat file - however the normal way is ask the provider of the spec to explain as it is not clear — mmmmmm, Aug 22 '18 at 17:40
This maybe relevant: https://stackoverflow.com/questions/3825390/effective-way-to-find-any-files-encoding — NoChance, Aug 22 '18 at 18:16
@Mark Updated the question to reflect that - the spec is for a fixed width flat file. Also, your suggestion is exactly what I ended up doing, so I'll wait to hear back from them with clarification, and hopefully a more detailed spec. — mathewb, Aug 22 '18 at 19:42
@mathewb You should not assume anything, if the spec is not clear you should ask for clarification. You can ensure whatever you want but if the other parties do not expect that it's not going to work well. — Stop harming Monica, Aug 22 '18 at 20:00
Your question is unclear. ASCII and Unicode are different things, you cannot contrast them in the way you do. ASCII is both a character set (i.e. literally an unordered collection of unique characters) and a character encoding (i.e. it assigns a single unambiguous bit pattern to each character from the character set so that there is a bijective function between characters and bit patterns). Unicode is also a character set (like ASCII), but it is *not* a concrete encoding like ASCII, it is only an abstract encoding. Unicode doesn't assign bit patterns to characters, only an abstract integer. — Jörg W Mittag, Aug 22 '18 at 20:36
Therefore, it is simply *impossible* for a file to be "encoded in Unicode", since Unicode doesn't *have* an encoding. There are many, many, many different encodings for Unicode, like UTF-9, UTF-18, UTF-7,5, UTF-7, UTF-32LE, UTF-32BE, UTF-16LE, UTF-16BE, and UTF-8, to name just a few. Actually, the ASCII character set is a subset of Unicode, which makes the ASCII encoding a Unicode encoding for that particular Unicode subset. So, in some sense, any ASCII file *is* a "Unicode file", so distinguishing between ASCII files and Unicode files also doesn't make sense from that perspective. — Jörg W Mittag, Aug 22 '18 at 20:39
@JörgWMittag Very helpful information. With this, and what others have said, I think that where I stand is a) go back to the provider and ask them to clarify, b) find out what encoding they are using to store information in the file, and c) request that they explicitly note in the spec that the field be populated only with ASCII characters (not ASCII encoding, as I've been referencing with confused terminology). — mathewb, Aug 23 '18 at 02:38
And if that's the case, I'm not sure how I can improve this question without stating the obvious, so I probably should just close/delete it. — mathewb, Aug 23 '18 at 02:40
If they only specify ASCII characters but not what encoding, then this doesn't help you one bit. They could be encoded as UTF-32, for example, or UTF-16. Or something completely different. How are you going to parse the file if you don't know how it is encoded? — Jörg W Mittag, Aug 23 '18 at 02:41
@JörgWMittag Agreed, that's what I meant by part b) of my multi-part approach. — mathewb, Aug 23 '18 at 02:44

score 3 · Answer 1 · answered Aug 22 '18 at 17:53

When discussing a file format, it is important to be clear about the encoding:

Are we describing the file format in terms of bytes (octets) or in terms of decoded text?
For a binary format, what is the byte order for multi-byte numbers or Unicode encodings with multi-byte code units?
For a textual format, how is the file encoding communicated?

A “character” can mean anything from a byte (octet), a 7-bit ASCII character, an UTF16 code unit, a Unicode code point, or sometimes even a grapheme cluster.

For a binary format, the interpretation as an octet/byte is probably appropriate. An octet may represent non-ASCII characters. It may not be possible to decode those octets meaningfully unless their encoding is specified somewhere.
For a textual format, it all depends on the encoding. Usually this is a single-byte encoding like CP-1252 or a variable-length Unicode encoding like UTF-8. In either case, this may represent non-ASCII characters.

So in either case, it is not good to assume that you will never see non-ASCII characters in that field.

I think I'm making sense of this. 1) What I really need to establish with the provider is what encoding (or for me coming from SQL Server, what collation) the text in the file is using. 2) When I say ASCII in my question, I am probably being unnecessarily narrow, and perhaps not even accurate, and what I really should be saying is single-byte encoding. (And this gets answered by answering the first issue.) — mathewb, Aug 22 '18 at 19:50
@mathewb Collation is not the same as encoding. The encoding is how the bytes are converted into characters. — Stop harming Monica, Aug 22 '18 at 19:56

Differentiating Between ASCII and Unicode in File Spec

1 Answers1