3

I have a problem I was hoping I could get some advice on!

I have a LOT of text as input (about 20GB worth, not MASSIVE but big enough). This is just free text, unstructured.

I have a 'category list'. I want to process the text, and cross-reference the items in the category list, and output the categories for each match, e.g.

Input text

The quick brown fox ran over the lazy dog.

Category lookup

Colour

Red | Brown | Green

Speed

Slow | Quick | Lazy | Fast

Expected Output

Colour - Brown

Speed - Quick, Lazy

To add to the complexity of the problem, the source text probably doesn't match the categories exactly, e.g. there will have to be a fuzzy match algorithm of sorts applied here.

I want to use 'Big data' tech to solve this (whether or not it TRULY NEEDS big data isn't the question - it's a secondary objective).

My thoughts are to utilize Hadoop Map/Reduce with Lucene to do the fuzzy-matching.

What do you think? Am I way off base?

Thanks a lot - ANY advice appreciated!!

Duncan

Duncan
  • 131
  • 4
  • Update: Thanks to all for the feedback! Have come across Elasticsearch, it seems to get positive reviews, so will look at this. – Duncan Jul 10 '13 at 09:52

2 Answers2

3

I would recommend starting with Solr, then do your machine learning with Mahout and Hadoop. Solr will give you basic text analysis through word stemming, normalization (lower-casing), and tokenization. If you enable term vectors in the schema you can feed those directly into Mahout and experiment with the different algorithms there. A lot (maybe most) of Mahout's algorithms will work in a distributed manor on Hadoop, as well as in a pseudo-distributed manor locally while you're working.

Once you've got Mahout picking out the right features of your text you can then add them to the docs already in Solr and then do facet queries over them.

  • Thanks, I should have mentioned that I don't really need to worry about stemming etc, however I have given SolR a look and elasticsearch has caught my eye! – Duncan Jul 10 '13 at 09:44
1

It sounds like the biggest problem you have left is to actually develop the categories and flesh them out.

Given a set of categories, and a set of 'marker' words for each category. (You'll want to read about stemming (turn vegetables -> vegetable) and stopping (skip common, meaningless words the, etc). A good implementation of the stemmer you can find will be either the Porter2, or Krovetz, but this is language-dependent.

Ultimately, this depends upon having categories - in Machine Learning, these are labels, and a set of features that you use to turn the data into inputs to a function that returns labels. In a way, it sounds like you want your features to be membership in a particular set, and your labels will correspond 1:1 with features, and you're allowing for multiple labels on each.

That being said, streaming through 20GB of data, particularly if you partition it on a rather common 4 core machine will go quickly. I'd recommend you take a chunk of 100 or so sentences, and build up your training and test data. Then move it to Hadoop, if you're interested. When data gets big, make sure your experiments start tiny.

map: (sentence, sentence_id) -> (category, sentence_id)

reduce: (sentence_id, '(category, category, ...))

John Foley
  • 111
  • 2
  • The good news is that I don't have to worry about verbs! Forgot to mention that in my original post. I was also going to get rid of the common words too, so I feel good that I was on the right track there. What did you think about the tech stack? Is Lucene a good choice? – Duncan Jul 05 '13 at 21:43