Typography is another frequent source of character substitution. Certain two-letter combinations may be replaced by ligatures during printing (e.g. ae
might become æ
), fractions expressed in text might be replaced with single characters, and symbols that are often mimicked in text may be replaced with dedicated characters (e.g. (c)
becomes ©
).
Such transformations are often performed when documents are being prepared for printing, but printing doesn't necessarily mean toner on paper. PDF files are typically produced by a printing-like process, so if PDF files are an input to your process, you may need to consider alternative forms for your data. For example, if you're searching for data in a PDF file, you need to be aware of the ways that text can change.
One way to manage this sort of thing is to convert all data to a canonical form before using it. Choose a form that makes sense for your project. You might decide to exclude ligatures but keep accents and symbols where possible. Then create a set of filters that implement your rules and use them to canonicalize all data coming into the system. This lets you do a single search for words like naïve
instead of having to search for both naïve
and naive
.
The other way to manage it is to use unmodified data and try to deal with all the possible variations that might occur. If you don't have control over the data (e.g. you're searching through data on someone else's server) this may be the only reasonable option. Regular expressions can help here -- you can search for na[iï]ve
.