22

Unicode has maybe 50 spaces

\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000][\u0009\u000A-\u000D\u0020\u0085\u00A0\u1680\u180E\u2000-\u200A\u2028\u2029\u202F\u205F\u3000

and 6 line breaks

not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028).

Maybe I could understand most of the spaces and PS ("Paragraph separator"), but what are "Next Line" and "Line separator" good for?

It all looks like invented by a very big committee where everybody wanted their own space and the leaders were granted one line break each. But seriously, how do you deal with it when your programming language doesn't support it (or does it wrong as e.g. Java does)?

maaartinus
  • 2,633
  • 1
  • 21
  • 29
  • 1
    How does Java do it "wrongly"? – Billy ONeal Jan 30 '11 at 01:19
  • Nearly completely, s. http://stackoverflow.com/questions/4304928/unicode-equivalents-for-w-and-b-in-java-regular-expressions – maaartinus Jan 30 '11 at 01:23
  • 2
    @maaartinus: (I can't believe I'm defending Java of all things) Java's character classes are documented to apply to a specific set of characters. Unicode supplies more characters which look like they fit into these character classes, but Unicode does not define regular expression languages; only character encodings. Java behaves completely correctly according to it's spec -- that is to match typical whitespace. If you want it to match everything in the Unicode standard that might be seen as empty space then you'll have to write that yourself. – Billy ONeal Jan 30 '11 at 01:29
  • @Billy: You're wrong -- Unicode defines everything: http://unicode.org/reports/tr18/ – maaartinus Feb 09 '11 at 12:26
  • @maaartinus: That's not correct. Unicode is not defining a regular expression language, they're defining in general what they think a "Unicode regular expression engine" should have. Java does not implement, nor has it ever claimed to implement, Unicode regular expressions. Finally, Java complies with the Unicode standard just fine -- what you posted is not from the Unicode standard. From the document itself: "A Unicode Technical Standard (UTS) is an independent specification. Conformance to the Unicode Standard does not imply conformance to any UTS." – Billy ONeal Feb 09 '11 at 13:38
  • Moreover, the indicated document post-dates Java by more than 13 years. Oracle is not free to *make* Java's default regular expression engine implement "Unicode Regular Expressions" because doing so would break existing code. So unless you have a time machine and can go give the document published in 2008 to the people designing Java in 1995, you can't really use that as justification. – Billy ONeal Feb 09 '11 at 13:40
  • 2
    Thx for the info. However, but they're free to create a `Pattern.compile2010` method returning regexes working according to last years definition. They're also free to create a method `Pattern.compileLatestUTS` which would explicitly state, that the meaning would change according to new specification. – maaartinus Feb 09 '11 at 14:51
  • 2
    Looks like Java eventually _did_ fix/modernize their regex implementation, using an opt-in flag to prevent backwards compatibility problems: http://stackoverflow.com/a/4307261/1172352 – peterflynn May 19 '16 at 00:24

1 Answers1

16

Maybe I could understand most of the spaces and PS ("Paragraph separator"), but what are "Next Line" and "Line separator" good for

NEXT LINE (U+0085) is often used as the newline character on EBCDIC systems (as 0x15). It's like CR+LF, but as one character.

LINE SEPARATOR (U+2028) and PARAGRAPH SEPARATOR (U+2029) are explained in section 5.8 of the Unicode standard, which describes them as a plain-text version of HTML <br> and <p>, to disambiguate these functions of "newline". But in practice, these characters don't get used much.

dan04
  • 3,748
  • 1
  • 24
  • 26