utfnormal is a library that contains Unicode normalization routines. It includes pure PHP implementations, and automatically uses the php-intl extension if installed.

The main function to care about is UtfNormal\Validator::cleanUp(). This will strip illegal UTF-8 sequences and characters that are illegal in XML, and if necessary convert to normalization form C (NFC). See also "Unicode equivalence" on Wikipedia.

If you know the string is already valid UTF-8, you can directly call:

  • UtfNormal\Validator::toNFC(),
  • UtfNormal\Validator::toNFK(),
  • or UtfNormal\Validator::toNFKC()

This will convert a given UTF-8 string to Normalization Form C, K, or KC if it's not already such. The function assumes that the input string is already valid UTF-8; if there are corrupt characters this may produce erroneous results.

Performance is kind of stinky in absolute terms, though it should be speedy on pure ASCII text. ;) On text that can be determined quickly to already be in NFC it's not too awful but it can quickly get uncomfortably slow, particularly for Korean text (the Hangul decomposition/composition code is extra slow).

Bugs should be filed in Wikimedia's Phabricator under the "utfnormal" project.

To use it in your project, run composer require wikimedia/utfnormal.

This library was first introduced in MediaWiki 1.3 (rev:4965). It was split out of the MediaWiki codebase and published as an independent library during the MediaWiki 1.25 development cycle.

This article is issued from Mediawiki. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.