Architect Solution to generate set of Default Phrases from a large list of specific phrases

Question

We have a large set of existing key-value pairs (e.g. product specs). A made-up example might be:

Prod 1, LOGO, The standard 1990 Audi logo is located in the center of the grille.
Prod 2, LOGO, The standard 2010 Volkswagen logo is located in the upper right corner of the trunk
Prod 3, LOGO, The standard 2005 Porche logo is located in the center of the hood, near the front. 
Prod 4, ENGINE, The 2016 Volvo comes standard with a 2.8 litre v6 engine.
Prod 1, ENGINE, The 2016 Audi comes standard with a 1.8 litre v6 engine.
Prod 1, OTHER, blah blah this is a one-off spec that isn't found nearly as commonly as the others.

So there would be many thousands of these, all different specs (not just LOGO), etc. but you can see that for the example list item, there are common parts found in the "value" string.

Some of the specs e.g. LOGO, and ENGINE would be found for nearly every product, and the verbiage of each of these is generally consistent. Others like OTHER would be uncommon.

I am trying to design a process that:

looks at the existing list of key-value pairs (e.g. product specs)
finds the most common verbiages for each spec based on the value string with the variable parts masked...

The result (based on the example set above) would look like something like:

LOGO, The standard _____ logo is located in the _____ of the _____. (occurs 3 times in the list above)
ENGINE, The _____ comes standard with a _____ engine. (occurs 3 times)
OTHER, blah blah this is a one-off spec that isn't found nearly as commonly as the others.  (this one wouldn't have any words replaced with blanks since it only occurs once or just a few times (occurs 1 time)

There are no default value masks existing now, so there is nothing I can use to know which words/positions have variable values (like make, model, engine type, etc from the example).

What approach or logic could I apply to the existing set of spec values to find the common parts of the phrasing for each spec?

The end goal is to derive a set of common default phrases for each spec "key" based on the thousands of entries which already exist E.G. in the end I should have a list that would show me that "the most common verbiage pattern for LOGO is The standard _____ logo is located in the _____ of the _____." etc.

Some other helpful info:

Performance isn't a concern
This would be a one-off process
Would be run on a dedicated machine with no worry about using up too much memory other resources etc.

Hopefully I have explained it well enough. Let me know if it needs more detail.

Possible duplicate of [Choosing the right Design Pattern](https://softwareengineering.stackexchange.com/questions/227868/choosing-the-right-design-pattern) — gnat, Mar 21 '18 at 12:59
Not duplicate. That post is about generically _how_ to chose a design pattern, and the answers explain what designs patterns, in general, are. It contains nothing useful to me as far as I can see in terms of helping me come up with an approach to my problem. If I missed it, please help me see the points relevant to the problem space I described. — GWR, Mar 21 '18 at 13:53
@gant This doesn't seem to be a question about design patterns at all. It's an algorithm question. — JimmyJames, Mar 21 '18 at 14:47
Yeah, this isn't an architectural question. Architecture is about organizing your project's algorithms, not creating the algorithms themselves. — Robert Harvey, Mar 21 '18 at 16:26

JimmyJames · Accepted Answer · 2018-03-21T16:24:52.733

There might be other, better, pattern matching algorithms for this but where you might want to start is something similar to the Levenshtein distance algorithm for strings. In this case though, you would substitute tokens for characters. That instead of saying CAT and CUT are a distance of 1 because one letter is different, you would say "LOGO, The standard 1990 Audi logo" is a distance of 2 from "LOGO, The standard 2010 Volkswagen logo" because there are two tokens that are different.

Based on your examples, I would also ignore position changes. This should simplify the implementation a bit and give you more clear results.

Once you have this built, you can compare all the phrases to each other. This could take a while but you say that isn't an issue. You should then find groupings of phrases that are similar to each other. You can represent this as a graph and find the clusters.

Then within each cluster, you should be able to determine which parts of the phrases are identical. That's the mask. The parameters then fall out easily from this.

One complication you mention in your comment on this answer is that the phrasing, spelling, punctuation, and/or capitalization are not perfectly consistent. One option is to use a standard distance algorithm such as Levenshtein or Jaro-Winkler between the tokens. This will slow it down a little more but probably not significantly and could make the results less clear-cut. You'll want to set some sort of threshold on how different the tokens can be. Ignoring case and punctuation will simplify things a good bit.

You might need to also build thesaurus of synonyms for the matching process. But I think you likely should use an approach that combines an algorithm such as described here in combination with a neural network that you can find between your ears. That is, when you run through this, you should be able to see the clusters fairly easily in the data. If you get them down to a reasonable level you should be able to also see that some of the clusters that are different are really equivalent in purpose and meaning. Trying to automate that is likely not worth it for a one-time effort.

Yes, this is the issue, the phrasing, spelling, punctuation, and/or capitalization are not perfectly consistent, but the core words exist in most instances of a given key. One good thing is that the key set is relatively small (between 50 and 100 different keys) so even if I can get some of them down a few common variations it would be great — GWR, Mar 21 '18 at 15:29

Architect Solution to generate set of Default Phrases from a large list of specific phrases

1 Answers1