Finding and counting equal substrings in a set of strings

Question

I'm thinking about a way of finding similar parts in Strings. I have a set of strings of varying length i.e:

The quick brown fox jumps
fox force five
the bunny is much quicker than the fox
is

First, i thought of just tokenizing the strings and count the tokens but in the case of "quick" here I also need to match "quicker".

So the output should be something like this (a mapping of tokens to the count, if the count is 1 it is omitted):

{
  "the": 3,
  "fox": 3,
  "quick": 2,
  "is": 2
}

The use case is as follows:

The user groups together Strings to categories, the goal is to make suggestions for a new string to which category it might belong. So the idea was to search all current strings in a category for keywords like this.

related (possibly a duplicate): [What's the algorithm should I use for seeing how well 2 strings match?](http://programmers.stackexchange.com/questions/276278/whats-the-algorithm-should-i-use-for-seeing-how-well-2-strings-match) — gnat, Jul 07 '16 at 07:05
Reducing a word like quicker to quick is called 'stemming'. While you could implement your own algorithm it would be much easier to use a full text search engine like Elasticsearch, which would allow you to easily add a few more features to make sure you find all relevant words that are somewhat similar. For example you can relative easily drop in lists of synonyms if this helps to get better results (like quick: fast, swift, hasty...) — thorsten müller, Jul 07 '16 at 07:06
No its neither related nor a duplicate. I don't care how well the strings match. I need to count the appearance of substrings. I evaluated lots of string matching algorithms they don't extract the information i need. — Chris, Jul 07 '16 at 07:09
It's really important that you specify whether you need *string* similarity or *language* similarity. String similarity means that 'quick' and 'ick' are quite close, language similarity doesn't. — Kilian Foth, Jul 07 '16 at 07:11
I need string equality. I even don't want 'quick' and 'ick" to be counted together. And i also don't need language similarity. 'quick' is quick and fast is fast. I edited the question title hoping to make it more clear. — Chris, Jul 07 '16 at 07:14
So, you want “quick” and “quicker” to be coalesced, but not “quick” and “ick.” Sounds to me like you do want what Kilian calls “language similarity.” Do you want “the” and “theater” to be counted as the same thing? — Christopher Creutzig, Jul 07 '16 at 07:26
-1 for not answering Christophers question, and so leaving this post as an occasion for guessing games. Please clarify what you meant, otherwise don't expect useful answers. — Doc Brown, May 15 '17 at 15:00
@ChristopherCreutzig no i need string similarity, not language similarity. Nothing as complex as stemming. — Chris, Jul 10 '17 at 12:20
In which sense then is “quick” closer to “quicker” than to “ick”? It's just removing two characters in both cases. — Christopher Creutzig, Jul 11 '17 at 12:48
I think after some researching i found that what i actually need is a bayes classifier that works on strings. The problem is to find the attributes to do so. — Chris, Jul 11 '17 at 19:02

score 1 · Answer 1 · answered May 15 '17 at 14:49

As counting identical strings is trivial, I'll address the portion of your question about making "quicker" count as "quick". As mentioned in the comments this is called stemming. Note that stemming is language-specific so you'll have to know what language your terms are in.

I will post an example for Java as you didn't specify a programming language. The lucene-analyzers-common library has stemmers in many languages. Assuming English (other languages are similar), you'd want to do something like:

SnowballProgram stemmer = new EnglishStemmer();
// for each term:
stemmer.setCurrent(term);
stemmer.stem();
String stemmedTerm = stemmer.getCurrent();
// count terms

Note that creating an EnglishStemmer is not so fast so you want to reuse the same stemmer if performance matters.

score -1 · Answer 2 · answered Aug 26 '16 at 19:43

This solution assumes that your words will be separated by spaces. In order to consider any punctuation marks you would also att them to the string input to the temp.split() call. Obviously if you want to work on multiple strings then it would be best to move the split and loop into a method and call it for every input string that you would like to be included in the count.

Hashmap<String,Integer> wordFrequency = new Hashmap<String,Integer>();
String temp = "whatever string thing is being checked"
String[] stringList = temp.split(" ");
for (string s : stringList){
    if (!wordFrequency.keySet().contains(s)){
        wordFrequency.put(s,1);
    }
    else{
        wordFrequency.put(s,wordFrequency.get(s)++);
    }
}
System.out.println(wordFrequency.entrySet()); 
//I believe that Set has a well defined toString() implementation

Finding and counting equal substrings in a set of strings

2 Answers2