I'm having trouble finding an algorithm that matches (or fails to match) a substring of a string to a suffix from a list of suffixes. The hits I'm finding are for suffix trees and suffix arrays, but they are not quite what I am looking for.
My particular problem is a fully qualified hostname - possibly malformed - and trying to match the suffix provided by the banned suffix list published by Mozilla. As a concrete example, given w.x.y.z.example.com
, I need to identify it as hostname w
, domain example.com
, and know that example.com
is valid domain.
As a degenerative example, I may need to take example.com
and know that its a domain with no host, and that matching it to *.com
would be wrong. And the negative cases only get worse when you factor in suffixes like pvt.k12.ma.us
, nasushiobara.tochigi.jp
, 公司.cn
and السعودیة
.
The eventual application will be DNS hostname matching in X509 certificates. Hence the reason there may be malformed input from miscreants.
I think this algorithm should run in m log n. log n for each lookup into the [sorted] list of suffixes, and m for each DNS label.
I think it may be a case of not seeing the forest through the trees. Does anyone know of an efficient algorithm for matching substrings from a list of suffixes?