Subdomain matching

Question

I am working on a small plugin for a DNS server. I have a static list of domain (sometimes subdomains too) names:

gaming.xyz.com
facebook.com
mail.example.com
blog.example.com

I want to check if a subdomain has a match in the above list. For example:

abcd.gaming.xyz.com and def.gaming.xyz.com => gaming.xyz.com
a.b.c.facebook.com and z.facebook.com => facebook.com

My approach is to use a Trie with the domains inverted. In our example, com will be the root. xyz, facebook and example will be children of com. mail and blog will be children of example.

Is there a better approach?

@PhilipKendall Maybe another data structure that is better in terms of running time. — psy, Oct 06 '18 at 15:28

score 1 · Accepted Answer · answered Oct 06 '18 at 18:12

A trie is perfectly fine, but its complexity is not necessary. You can insert all your domains into a set data structure (e.g. an ordered array for binary search, or a hash set) and see whether any of the given (parent) domains is in that set. I.e. for when given abcd.gaming.xyz.com you would try:

   is abcd.gaming.xyz.com in the set?
or is      gaming.xyz.com in the set?
or is             xyz.com in the set?
or is                 com in the set?

Your trie approach is generally equivalent, just more complex to implement. The exact running time complexities will depend on the exact data structure you use. E.g. for a domain with k subdomains and n domains in the set, a hash set would give you expected O(k) running time, and a binary search in an array would give you O(k · log n) running time. The trie cannot improve over the hash set approach, but a trie-like tree has a theoretical advantage over binary search because the “n” items are split up into smaller searches at each level of the trie: roughly n = m^k, giving us O(k² · log m) (assuming a log m lookup at each level). The exact value depends on the distribution of the data, and may in practice be more similar to a simple O(k log n) search. But again, this helps nothing if you could use a hash table instead.

Note that a simple binary search in an array can benefit from similar tricks as a trie if we match reversed domain names. For this purpose, either moc.zyx.gnimag.dcba or com.xyz.gaming.abcd would be fine. Whereas the first is easier to implement, I'll use the latter here for clarity. During a binary search we can keep track of the range of entries that contain our current key as prefix. So while we search for com first, we get a range of reversed domain names with com as prefix – for free. If that key isn't found we can continue to search for com.xyz within that range, rather than restarting with the whole array. And so on until we have a match.

The choice of data structure largely depends on whether a data structure implementation is already available (not necessarily a given in C), whether the set will change at run time, and whether you have any memory constraints. E.g. a trie will effectively compress the data, and tree data structures can be updated more efficiently than a sorted list. But trees also tend to be more complex and possibly have worse memory locality than ordered arrays or hash sets.

The list of domains won't change at run time. The application will run on Raspberry Pi. Not a lot of memory. Yeah, I would have to implement my own hashing function in C if I were to choose the hash set approach. — psy, Oct 07 '18 at 15:31
@psy [FNV-1a](http://www.isthe.com/chongo/tech/comp/fnv/index.html#FNV-1a) is a good hash function and very easy to implement. See also: [Which hashing function is best for uniqueness and speed?](https://softwareengineering.stackexchange.com/q/49550). For your hash table, it should be sufficient to resolve collisions through open addressing rather than malloc()ing memory for a linked list. — amon, Oct 07 '18 at 16:14

Subdomain matching

1 Answers1