I have a problem I was hoping I could get some advice on!
I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured.
I have a 'category list'. I want to process the text, and cross-reference the items in the category list, and output the categories for each match, e.g.
Input text
The quick brown fox ran over the lazy dog.
Category lookup
Colour
Red | Brown | Green
Speed
Slow | Quick | Lazy | Fast
Expected Output
Colour - Brown
Speed - Quick, Lazy
To add to the complexity of the problem, the source text probably doesn't match the categories exactly, e.g. there will have to be a fuzzy match algorithm of sorts applied here.
I want to use 'Big data' tech to solve this (whether or not it TRULY NEEDS big data isn't the question - it's a secondary objective).
My thoughts are to utilize Hadoop Map/Reduce with Lucene to do the fuzzy-matching.
What do you think? Am I way off base?
Thanks a lot - ANY advice appreciated!!
Duncan