We have a large set of existing key-value pairs (e.g. product specs). A made-up example might be:
Prod 1, LOGO, The standard 1990 Audi logo is located in the center of the grille.
Prod 2, LOGO, The standard 2010 Volkswagen logo is located in the upper right corner of the trunk
Prod 3, LOGO, The standard 2005 Porche logo is located in the center of the hood, near the front.
Prod 4, ENGINE, The 2016 Volvo comes standard with a 2.8 litre v6 engine.
Prod 1, ENGINE, The 2016 Audi comes standard with a 1.8 litre v6 engine.
Prod 1, OTHER, blah blah this is a one-off spec that isn't found nearly as commonly as the others.
So there would be many thousands of these, all different specs (not just LOGO
), etc. but you can see that for the example list item, there are common parts found in the "value" string.
Some of the specs e.g. LOGO
, and ENGINE
would be found for nearly every product, and the verbiage of each of these is generally consistent. Others like OTHER
would be uncommon.
I am trying to design a process that:
- looks at the existing list of key-value pairs (e.g. product specs)
- finds the most common verbiages for each spec based on the value string with the variable parts masked...
The result (based on the example set above) would look like something like:
LOGO, The standard _____ logo is located in the _____ of the _____. (occurs 3 times in the list above)
ENGINE, The _____ comes standard with a _____ engine. (occurs 3 times)
OTHER, blah blah this is a one-off spec that isn't found nearly as commonly as the others. (this one wouldn't have any words replaced with blanks since it only occurs once or just a few times (occurs 1 time)
There are no default value masks existing now, so there is nothing I can use to know which words/positions have variable values (like make, model, engine type, etc from the example).
What approach or logic could I apply to the existing set of spec values to find the common parts of the phrasing for each spec?
The end goal is to derive a set of common default phrases for each spec "key" based on the thousands of entries which already exist E.G. in the end I should have a list that would show me that "the most common verbiage pattern for LOGO
is The standard _____ logo is located in the _____ of the _____.
" etc.
Some other helpful info:
- Performance isn't a concern
- This would be a one-off process
- Would be run on a dedicated machine with no worry about using up too much memory other resources etc.
Hopefully I have explained it well enough. Let me know if it needs more detail.