How to detect the encoding of a file?

Question

On my filesystem (Windows 7) I have some text files (These are SQL script files, if that matters).

When opened with Notepad++, in the "Encoding" menu some of them are reported to have an encoding of "UCS-2 Little Endian" and some of "UTF-8 without BOM".

What is the difference here? They all seem to be perfectly valid scripts. How could I tell what encodings the file have without Notepad++?

There is a pretty simple way using Firefox. Open your file using Firefox, then View > Character Encoding. Detailed [here](http://codeftw.blogspot.ch/2009/07/how-to-find-character-encoding-of-text.html). — Catherine Gasnier, Apr 15 '14 at 16:26
use heuristics. checkout `enca` and `chardet` for POSIX systems. — Janus Troelsen, Nov 03 '14 at 21:00
I think an alternative answer is TRIAL and ERROR. `iconv` in particular is useful for this purpose. Essentially you iterate the corrupted characters strings/text through different encodings to see which one works. You win, when the characters are no longer corrupted. I'd love to answer here, with a programmatic example. But it's unfortunately a protected question. — Brandon Bertelsen, Dec 07 '15 at 19:03
FF is using [Mozilla Charset Detectors](https://www-archive.mozilla.org/projects/intl/chardet.html). Another simple way is opening the file with MS word, it'll guess the files quite correctly even for various ancient Chinese and Japanese codepages — phuclv, May 14 '18 at 03:39
If `chardet` or `chardetect` is not available on your system, then you can install the package via your package manager (e.g. `apt search chardet` — on ubuntu/debian the package is usually called `python-chardet` or `python3-chardet`) or via *pip* with `pip install chardet` (or `pip install cchardet` for the faster c-optimized version). — ccpizza, Mar 27 '19 at 17:12
Have you tried Emacs? It looks like that you can see the current encoding with `C-h` `v` `buffer-file-coding-system` and Return key (ref: https://stackoverflow.com/a/10500912/5595995) — Cloud Cho, Aug 09 '21 at 19:25
On Mac: file --mime-type --mime-encoding config/checkstyle/checkstyle.xml — 0x8BADF00D, Aug 22 '23 at 20:00

vaughandroid · Accepted Answer · 2013-02-15T11:00:29.560

142

Files generally indicate their encoding with a file header. There are many examples here. However, even reading the header you can never be sure what encoding a file is really using.

For example, a file with the first three bytes 0xEF,0xBB,0xBF is probably a UTF-8 encoded file. However, it might be an ISO-8859-1 file which happens to start with the characters ï»¿. Or it might be a different file type entirely.

Notepad++ does its best to guess what encoding a file is using, and most of the time it gets it right. Sometimes it does get it wrong though - that's why that 'Encoding' menu is there, so you can override its best guess.

For the two encodings you mention:

The "UCS-2 Little Endian" files are UTF-16 files (based on what I understand from the info here) so probably start with 0xFF,0xFE as the first 2 bytes. From what I can tell, Notepad++ describes them as "UCS-2" since it doesn't support certain facets of UTF-16.
The "UTF-8 without BOM" files don't have any header bytes. That's what the "without BOM" bit means.

edited Feb 15 '13 at 11:00

answered Feb 15 '13 at 10:16

vaughandroid

7,569
4
27
37

3

BOMs: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374101%28v=vs.85%29.aspx – Jan Doggen Feb 15 '13 at 10:34
3

Why would a file that starts with a BOM be auto-detected as "UTF-8 without BOM"? – Michael Borgwardt Feb 15 '13 at 10:36
3

And if a file started with 0xFF,0xFE it should be auto-detected as UTF-16, not UCS-2. UCS-2 is probably guessed because it contains mainly ASCII characters and thus every other byte is null. – Michael Borgwardt Feb 15 '13 at 10:39
@MichaelBorgwardt You are definitely right on the the UTF-2. The UCS-2/UTF-16 is a bit less clear. Will update my answer. – vaughandroid Feb 15 '13 at 10:48
Gah, meant to say "UTF-8" not "UTF-2" in my previous comment. – vaughandroid Feb 15 '13 at 11:01
2

With experience, alas, metadata (“headers”) can also be wrong. The database holding the information could be corrupted, or the original uploader could have got this wrong. (This has been a significant problem for us in the past few months; some data was uploaded as “UTF-8” except it was “really ISO8859-1, since they're the same really?!” Bah! Scientists should be kept away from origination of metadata; they just get it wrong…) – Donal Fellows Dec 08 '13 at 19:39
4

Actually I think it's "funny" that the encoding problem still stays in 2014 since no file in the world will begin with "ï»¿" and I'm very surprised when I see a HTML page which has been loaded with the wrong encoding.. It's a matter of probability. It's unthinkable to choose the wrong encoding if another encoding would avoid strange chars.. Looking for the encoding which avoids strange chars would work in 99,9999% of cases I guess. But still there are errors.. Also it's a very confusing message to use ascii instead of UTF8 to save space.. it's confusing junior developers this idea of perform.. – Revious Oct 18 '14 at 18:47
Floppy Disks became obsolete. Encoding are still all there.. :o – Revious Oct 18 '14 at 18:48
1

"no file in the world" sounds to me like "no-one would ever do that". – bytepusher Aug 30 '15 at 17:56
as *you can never be sure what encoding a file is really using*, it can be used for malicious purposes ["When the browser isn't told what the character encoding of a text is, it has to guess: and sometimes the guess is wrong. Hackers can manipulate this guess in order to slip XSS past filters and then fool the browser into executing it as active code. A great example of this is the Google UTF-7 exploit"](http://htmlpurifier.org/docs/enduser-utf8.html#fixcharset-none) – phuclv May 14 '18 at 03:43
1k like for "Notepad++ does its best to guess what encoding a file is using". – Abdollah Mar 29 '20 at 16:31

Marco · Answer 2 · 2013-02-15T15:18:48.620

30

You cannot. If you could do that, there would not be so many web sites or text files with “random gibberish” out there. That's why the encoding is usually sent along with the payload as meta data.

In case it's not, all you can do is a “smart guess” but the result is often ambiguous since the same byte sequence might be valid in several encodings.

edited Feb 15 '13 at 15:18

answered Feb 15 '13 at 10:16

Marco

409
3
5

2

OK, then, does the Windows OS store that information (meta data) actually somewhere? In the registry probably? – Marcel Feb 15 '13 at 10:18
You're wrong. That is codepages- not quite the same. There are algorithms to guess at the Unicode encoding. – DeadMG Feb 15 '13 at 10:24
7

@Marcel: No. That's why "text files" are so problematic for anything except pure ASCII. – Michael Borgwardt Feb 15 '13 at 10:37
1

well notepad++ can do this, it can tell you if text file is utf-8 encoded or not – user25 Jan 01 '18 at 14:31
1

From the tools I tried, [this one](https://fix-encoding.com/) was the only that gave precise results, tried Cyrillic and non-standard Japanese. It uses [chardet](https://github.com/chardet/chardet) under the hood. Wish I could post it as an answer ;c – Klesun Jun 10 '21 at 23:14

score 1 · Answer 3 · answered May 16 '23 at 22:30

The character encoding can generally not be determined completely. However, there are many hints:

ASCII contains only bytes with values below 0x7F, originally it is a 7 bit encoding, but the byte values are simply zero-padded so the first bit is always zero;
UTF-8 contains ASCII + additional bytes for which the highest bit is set, e.g. the three most significant bits may be set to 110 to indicate that two bytes are used instead of one to encode a character. UTF-8 may contain a Byte Order Mark (BOM), but usually it doesn't. The encoding is identical to ASCII if only ASCII characters are present.
UTF-16 is usually prefixed with a BOM as it is a 16 bit encoding that may either use big- or little endian w.r.t. the order of the bytes (not the bits inside the bytes). As it is a 16 bit encoding where the only the lowest 7 bits encode ASCII, it is usually easy to recognize humans, and easily distinguished using statistical heuristics as well.
There are many, many 8 bit encoding schemes, such as Windows-1252 also known as CP-1252. This encoding extends ISO-8859-1 which encodes the Latin-1 character set. This by itself is a form of extended ASCII.

UTF-16 is generally easy to recognize due to the common BOM and many bytes set to zero - at least for Western languages that use Latin-1. UTF-8 usually doesn't have a BOM, but the encoding scheme for additional characters is relatively easy to recognize.

A text editor that only sees ASCII will usually represent them using UTF-8 (now more and more the default) or Windows-1252. Sometimes applications and languages will simply keep to the system default. However, nowadays many text files do not do this and simply default to UTF-8 for all text. It has been the common default on Linux and Android for a long time now.

For older systems usually a system-specific code page was used. One of the more recognizable ones - at least for Westeners - is the IBM code page 437 as it was used for text-based windowing systems and a lot of ANSI art (sometimes incorrectly called ASCII art), going back to the time of DOS. However, quite often these code pages are not easily recognizable, which is why ASCII art often doesn't look good when a text file is opened. It quite often defaults to the system default such as Windows-1252.

It is extremely uncommon, but sometimes other character encodings are used. Some of those are "dialects" of ASCII such as IA5 that are just slightly different. More commonly though they would be text files using a code page for another country, where the first 128 codes are ASCII compatible.

If you come across such an encoding then you could convert to UTF-8 which is generally recognized easily, and it contains all the possible characters from the various code pages.

gnasher729 · Answer 4 · 2023-05-19T15:37:51.823

Assuming you have a file that is given to you just as a sequence of bytes, with no indication of the encoding, and you want to either determine an encoding consistent with the bytes, or reject the file.

You can first check whether the bytes are consistent with an encoding. For example, many byte sequences are not valid ASCII, or valid UTF-8, or valid UTF-16 or UTF-32.

And then you can check whether your data looks reasonable in some encoding. For example, lots of data might be valid in some Chinese encoding, but look like complete nonsense. That has to be done carefully. For example, base-85 encoded data looks like nonsense even though it is valid ASCII or UFT-8.

Note that UTF-16 is interesting. In practice you can almost always detect that a file is UTF-16. For example the bytes in a file containing “hello” in utf-16 consist of 5 ASCII bytes, preceded or followed by a zero byte each. But most pairs of two ASCII bytes are valid utf-16, so many files with an even number of ASCII bytes could be UTF-16 with some very, very strange contents.

score -1 · Answer 5 · answered May 16 '23 at 11:42

I tried to id the encoding on three files that actually ended up being encrypted without any headers, footer or checksum. chardet was no good, hexadecimal comparison or string extraction produced nothing, however, what did work was to assess the software that created the files.

But, for your situation, you would want to use hexadecimal and a visual inspector. As win7 has none, a batch would work. You could perhaps use copy or pipe tools. Here's one someone else made. https://stackoverflow.com/questions/27575910/how-to-convert-binary-to-hex-in-batch

This batch complements the other answers I think, as it's an actual technique that would likely work on win7 (I haven't tried it though!)

How to detect the encoding of a file?

5 Answers5