1

I have the wikipedia data dump and trying to decode special characters in the page titles, except a lot of characters don't match up the "standard" ascii encoding (referencing from here.)

As an example, in wikipedia ë and ã are given as:

ë = %C3%AB

ã = %C3%A3

Is there a defined key anywhere I can pull from?

1 Answers1

7

It's UTF-8.

Besides, neither character is in ASCII. They're in various "extended ASCII" character sets, but these encodings are not ASCII, they're remnants of a wild west age of character encodings. Treat them as legacy encodings that civilized people like us may have to decode but ideally should never produce. At least for ASCII there is a single table which almost the entire western world can agree on (and the rest of the world if they use UTF-8), while "extended" character sets are so numerous that it's anyone's guess what any given byte above 127 means.

The page you're linking to tacitly assumes one of these many "extended" character sets and (if a quick search didn't betray me) fails to mention. Now, in English texts it's often safe to assume some variant of Latin-1 (or ISO-whatsthenumber etc.) is implied, but it's still sloppy. Furthermore, as far as I am aware, there is by no means any standard as to what encoding percent-encoded bytes should be interpreted as. Again Latin-1 etc. are common but far from universal even in English language text. You should really get better sources.

  • "..nor part of the civilized world?" O_o – Robert Harvey Mar 22 '16 at 18:02
  • @RobertHarvey clearer now? –  Mar 22 '16 at 18:13
  • Certainly more politically correct, yes. – Robert Harvey Mar 22 '16 at 18:46
  • @RobertHarvey Now I'm curious: Did you read the earlier version as referring to the characters ë and ã? In restrospect could see it being ambiguous, which was really unfortunate because it's the opposite of what I meant. I hate these 8-bit encodings to a large degree because they gives English-only speakers who want to write Æ an excuse to not go straight to Unicode. Or is it politically incorrect to call antiquated inadequate character encodings "uncivilized" now? :P –  Mar 22 '16 at 18:55
  • 1
    If it wasn't already clear, it is now. **tl;dr:** *Go straight to Unicode.* – Robert Harvey Mar 22 '16 at 19:00