Why exactly can't PHP have full unicode support?

Question

Everybody knows, that PHP has problems with Unicode. Version 6 is effectively abandoned, because of Unicode implementation difficulties. But I wonder if anyone knows what are the exact reasons? Architecture/design problems, performance concerns, community problems (I bet not), something other?

Kornel · Answer 1 · 2015-08-24T22:53:12.657

17

PHP as a language definitely can have it, but I think the problem is with compatibility with existing programs. Unicode support can break them in subtle ways, which is the most annoying kind of bug to have.

Currently most string-processing functions in PHP are "binary-safe", which means you can use them to process any file in any encoding as well as binary formats like image data, etc.

With addition of Unicode strings you'd have to be very careful not to mix Unicode strings with binary strings (pretty hard when your strings come from different sources and you never had to worry about it before). And you couldn't be ignorant about encodings any more (and lots of scripts are ignorant about this!)

Another hard, but solvable problem is random access in Unicode strings. Implementation of $string[$offset] changes from trivial to either very slow or little slow and very complex.

Also I think it was a mistake to choose UTF-16 as internal encoding for PHP. It has same problems as UTF-8 (variable width because of surrogate pairs) and inefficiency of UCS-2. Maybe they should scrap that and start again with UTF-8?

</speculation>

edited Aug 24 '15 at 22:53

answered Dec 27 '10 at 14:06

Kornel

681
4
11

2

totally agree with switching to utf8. – GrandmasterB Dec 27 '10 at 20:01
you think that UTF-16 is, apart of data chunk size, worse than UTF-8? – ts01 Dec 28 '10 at 14:39
UTF-16 isn't a variable width encoding in the same way that UTF-8 is. Surrogate pairs can be handled just like combining characters in almost every situation, so UTF-16 is not that much "harder" than UTF-32. But I certainly agree with the rest of your answer. – Dean Harding Dec 28 '10 at 19:24
@Dean Harding: I've meant you can't simply address even codepoints, like you could with UCS-2 and UTF-32. – Kornel Dec 28 '10 at 21:22
@porneL: But you can, that's how the majority (all?) of UTF-16 implementations I've seen use (for example, C# does this as does Win32). Things like cursor insertion, character deletion, etc all work with same for surrogates as they do for combining characters (i.e. you can't place the cursor in between a combining character and it's base character - that's the same for UTF-8, UTF-16 and UTF-32) – Dean Harding Dec 29 '10 at 02:41
3

@Dean Harding: I'm not saying that it's impossible to work with UTF-16 at all, only that *random access* (in *O(1)*) is not possible. UTF-16 doesn't guarantee that 100th codepoint will start at 200th byte, so to access 100th codepoint you have to linearly scan all previous ones (and good implementation would cache the result of course). In this regard it's similar to UTF-8 (i.e. access to n-th character/codepoint is *O(n)*, not *O(1)*). – Kornel Dec 29 '10 at 10:55
1

@Dean: Things like collation or conversions between UTF-16 and UTF-8 most certainly do *not* work the same for surrogates as they do for combining characters. – dan04 Mar 04 '11 at 05:01
3

An excellent summary about the reasons to chose UTF-8 over UTF-16 (or any other encoding) can be found at http://utf8everywhere.org/. – Joachim Sauer Jan 23 '13 at 12:50

Paulo Scardine · Answer 2 · 2013-01-23T02:26:09.030

TLDR: many PHP libraries are just a thin layer over native C libraries that don't support unicode, or support it in ways that are incompatible with each other. Rectifying this situation is likely to introduce backward incompatible changes.

DISCLAIMER: as I've switched from PHP to Python (to never look back) a few years ago, my opinion is clearly biased.

I think PHP is a nice and clever hack. As a hack, it started unpretentious and grew somewhat chaotically from a bunch of sparse libraries - lacking a well thought and unified design (from the computer language theory perspective).

As said by Machiavelli, "he who has not first laid his foundations may be able with great ability to lay them afterwards, but they will be laid with trouble to the architect and danger to the building".

For a programming language, the more popular, the harder to change. That is why languages like C changes once every 10 years. For example, Python 3 made many backward incompatible changes, and it was not pretty. The unicode support in previous Python incarnations was already considered superior to the current state of affairs in PHP, but guess what: the most polemic changes in Python 3 are related to unicode handling. This rant from Armin Ronacher summarizes the frustration from a huge share of the Python community.

PHP being "the" ubiquitous web platform makes it victim of its own success. Bring unified support for unicode in PHP is inevitable, but will require a lot of blood, sweat and tears.

well, everyone agrees here, I suppose. But I was asking the details ;) — ts01, Dec 28 '10 at 08:25
The problem is that many underlying libraries do not handle unicode well, and it is a very hard to solve problem without starting from scratch. — Paulo Scardine, Dec 29 '10 at 14:00
(fyi, "since a few years ago", PHP got better and Python worse) — ZJR, Jan 23 '13 at 13:07
@ZJE: Nice to know, thanks. Would you be kind enough to point me some reference material about this change? — Paulo Scardine, Jan 23 '13 at 19:20

score 6 · Answer 3 · answered Jan 23 '13 at 03:20

One of the primary reasons the old PHP 6 work was stopped was due to the internal complexity it brought and the amount of work to do, which barely anybody fully unerstood.

A bit of history: PHP 6's Unicode imlementation was designed by the need of a larger PHP user and tried to do Unicode "right". After some evaluation the primary designer of PHP's to-be-Unicode-support has choosen to add a new string type which internally is Utf-16 and to allow different encdings to be used in different places. So the code might be written in one encoding, the output might use a different encoding and "runtme operations" some other encoding. The reason for choosing UTF-16 was that the work should be based on the ICU livrary which uses UTF-16 and it was found that this encoding makes common string operations in a fast way while conversi between utf- and utf-16 is relatively cheap. So far so good.

Now the consequence of doing this is at foremost the introduction of a new string type. PHP's internal type system till then had a few types (NULL, bool, int/long, float/double, string, array, resource, object) and lots of code had some assumptions on this being the case. Besides such assumptions all functions operating on strings, and there are a lot of those, have to be evaluated individually and it has to be decided how to handle encodings. Should they work on binary strings or unicode strings? If a conversion is required which encoding should be used etc. and this is a lot of work and in some cases quite complicated to do right. Additionally the internal APIs became quite complicated, as most key APIs in PHP got versions for binary strings (the old one) and then often a version for "runtime encoded" strings, as well as utf-16 strings, creating quite a mess there ...

Over the process of doing that many developers stumbled over the coplexity, became annoyed by utf-16 and didn't like the fact that this would more than double memory usage and spend lots of time converting strings while breaking most existing applications. So, PHP being driven by volunteers, fewer and fewer developers were working on it and other things piled up and contributors became unhappy and in the end it had to be abandoned.

Now what might the future bring? - There is a slow evolution happening that more and more things in PHP ae built around utf-8. Not in a strong way with a custom type and forcing everything and currently the developrs aren't motivated to touch this hot iron. One can hope that somebody has a good proposal to make it work nicely, but currently "everybody" will run away if they only hear the word. :)

score 1 · Answer 4 · answered Jan 19 '11 at 14:30

1

I guess the actual reason is that PHP developing team lacks a clear roadmap for PHP development (let's just mention a pretty heated discussion when someone on the php-internals decided to start PHP 5.4 branch without previously agreeing on what features 5.4 should contain). I like this language very much, but the way it's being developed makes me a bit worried.

answered Jan 19 '11 at 14:30

Mchl

4,103
1
22
23

2

I left PHP for Python in 2006 after using it for 5 solid years -- Python has an incredible development process and good leadership -- plus the language is so much more terse, powerful, and consistent than PHP. The main challenge is to find the right web framework. We rolled our own -- AppStruct. – gahooa Jan 23 '13 at 02:55
1

Well we had a roadmap for PHP 6. Didn't help ;) One of the roadmap issues is that PHP is driven by volunteers which appear (and if they have "good ideas" we want to keep them and add their features soon) and suddenly disappear (getting married, changing jobs, ...) – johannes Jan 23 '13 at 02:55
Happily PHP 7 is a success. – Melroy van den Berg May 12 '16 at 14:03
5 years later and still with no 'full unicode support' :) – Mchl May 13 '16 at 08:12

Why exactly can't PHP have full unicode support?

4 Answers4