Algorithm to search very long blacklist in another very long data set

Question

I have two data sets. The first data set has approx. 50.000 movie and song titles and the second one have 20.000 blacklist strings. I am looking for the best algorithm to detect movie/song title which contains blacklisted word(s).

Example: Dataset #1

The Lord Of The Rings
E.T.
Star Wars
...
(50k items)

Blacklist Data set

Lord
Home Alone
Matrix
ar
...
(20k items)

Items in these data sets may be a character or a few words. String search algorithms like Boyer-Moore is not helping me with this since I have more than 1 needle to search in the haystack. I (probably) need to find an algorithm to find all combinations efficiently and later make a string search (regex maybe?) for each combination.

Well, after seeing many algorithm questions/discussions on softwareengineering.stackexchange.com , I thought it's in the scope of this website and decided to ask this question here. But got a downvote after a few seconds of my question submission :-) — Eray, Feb 19 '20 at 19:36
Is the blacklist composed of words (in your example it's always full words) ? Or can it be partial strings like `ord` or `me Al`? And is it case sensitive or not ? — Christophe, Feb 19 '20 at 19:40
Yes, it can be. @Christophe thanks for your comment, I will update the question. — Eray, Feb 19 '20 at 19:41
No treatment of this subject would be complete without mentioning the [Scunthorpe problem](https://en.wikipedia.org/wiki/Scunthorpe_problem). And if you're trying to save children, they already know more than we do. Automated word blacklisting is a fool's errand, in my opinion, propagated by delicate flowers who are easily offended. — Robert Harvey, Feb 19 '20 at 19:58
Is this a one-shot problem or are you maintaining a long-term solution? I think you need to lay out your use case a bit more. — JimmyJames, Feb 19 '20 at 20:04
(hopefully) it will be a long-term solution. We will use the algorithm/method as a part of our automation tool. The problem is, movie / song list is fixed but the blacklist is changing for each of our users. Every user can configure their own blacklisted strings. We will list all movies and songs to users after filtering out blacklisted ones. — Eray, Feb 19 '20 at 21:25
There are several such algorithms, e.g. https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm, the main gimmick is that you spend a lot of time preprocessing the dataset. Which for you is great because your dataset is fixed so you can just reuse the preprocessing, updating it every once in a while. — Countingstuff, Feb 19 '20 at 22:46
Do you have any evidence that simply cramming those pattern into a disjunctive pattern (or several) and using the standard regex engine of your environment is *not* efficient enough? — Kilian Foth, Feb 20 '20 at 07:21
Kilian has the idea. You need to compose your blacklist into a Non-Backtracking Regex. Then apply a triple matching, or a standard FSM. See [Russ Cox](https://swtch.com/~rsc/regexp/) for some good writeups on this. — Kain0_0, Feb 20 '20 at 22:53

Algorithm to search very long blacklist in another very long data set

0 Answers0