pcre case insensitivity: handle escape sequences as raw bytes?

Question

Consider the following regex (no unicode):

Example:\x04\x05\x41

Suppose you search this regex case-insensitively. Would you expect it to search the final \x41 in case-sensitive manner? People whom I ask actually expect such behavior. And such expectations are wrong.

Perldoc says:

Note that a character expressed as one of these escapes is considered a character without special meaning by the regex engine, and will match "as is".

Which basically means that if the character escaped is printable non special character it's equivalent to just typing that character. And that's how it actually behaves. However if regex is case insensitive alphabetic characters do have a special meaning of "this character or same character in opposite case" and such special meaning is not ignored in \x sequences.

Escape sequences have "raw byte" semantics, but not always. Does it seem to you like a design awkwardness?

Is there a way to cause pcre to treat \x sequences as raw bytes that ignore case sensitivity? Perhaps as a flag? As non standard patch? If no how if at all pcre engine code can be modified to provide such functionality?

There are many implementations of PCRE. What language would you be expecting the patch/extension in (one might be able to make an alternate to java.util.regex - languages that have regex as part of the core language would be a bit more difficult)? Lastly, and most importantly - why do you want this functionality - what is the problem you are trying to solve? — , Jun 23 '13 at 16:56
Regular expressions in the project tend to be like the one in original question and there are many of them. Doing [Ee][Xx][Aa]... every time is awkward. Using C pcre library — Muxecoid, Jun 23 '13 at 17:23

score 2 · Answer 1 · answered Jun 24 '13 at 12:56

However if regex is case insensitive alphabetic characters do have a special meaning of "this character or same character in opposite case" and such special meaning is not ignored in \x sequences.

You're reading too much into what constitutes "special meaning." In this case, it refers to single characters that direct interpretation of the input such as . and *. Escaping those characters in any form treats them as literals. In other words, . means match any character but \. or \x2e means match a period. (There are exceptions, such as \d, which is fine because there's never a need to escape a lowercase d.)

Special meaning does not include behavior brought on by using modifiers. Using i tells the engine to use a different algorithm for comparing single characters, which happens long after the expression itself has been parsed. The \x escape will have been applied when the expression itself was processed, meaning that \x41 will already have been interpreted as A.

Escape sequences have "raw byte" semantics, but not always. Does it seem to you like a design awkwardness?

Escapes originally came to exist as a way to embed characters in files that could not otherwise be represented using printable characters. If embedded as-is, these characters might be interpreted as control sequences or would simply be invisible when looking at them in an editor or on a printed page. Regular expressions co-opted this concept and added all sorts of additional escapes that had semantic rather than literal meaning, such as the $...$ construct with which you're probably familiar.

How awkward that is would be a matter of opinion, but there really isn't any way to denote special meaning in strings without selecting one of the otherwise-valid characters to do it.

Is there a way to cause pcre to treat \x sequences as raw bytes that ignore case sensitivity?

No, there isn't. Perl doesn't allow it, so PCRE won't, either.

Perl regular expressions do have a way to enable or disable modifiers within spans of characters (see the Extended Patters section of perlre):

/(?i:case-insensitive)case-sensitive/
/(?-i:case-sensitive)case-insensitive/i

If you have your regular expression available as a string, you could develop a function that does the equivalent of s/((?<!\\)\\x[[:xdigit:]]{2})/(?-i:$1)/g, which would spare you having to modify PCRE.

As non standard patch? If no how if at all pcre engine code can be modified to provide such functionality?

Anything's possible, but I'm not going to tear into PCRE and figure it out for you. You'd have to alter RE parser so that it stores literals specified with \x as a case-sensitive span instead of a literal. You would run the risk of breaking expressions that depend on the standard behavior, so you'd also have to add a modifier that explicitly enables it.

By the way, if you're simply searching for one string inside another, using m// is overkill. Calling index() would be simpler and much more efficient.

pcre case insensitivity: handle escape sequences as raw bytes?

1 Answers1