Software design strategy for a machine learning tool that outputs a subset of the text input (Information Extraction)?

Question

Let's say I have thousands of pdfs that are each about 30k words written in conversational English. In each of the pdfs there is a name / names of a person/people who snowboard. There are also many other names. I need to extract the name(s) of the snowboarder(s) from any future pdfs. What are some tools / methods you could approach this problem with?

I just started learning about Natural Language Processing and Machine Learning a couple weeks ago. I have been using Python's NLTK to filter my data and have used scikit-learn for my classification and multilabel classification solutions pertaining to other questions I want to answer on the same data set, but this snowboarder example is not classification. I know I can strictly use an NLP solution but I want to try to have a ML model recognize the patterns in the text because all the documents are formatted similarly (and I have a lot of documents to train with and I am willing to manually label).

I was able to get some success training a word2vec neural net on each individual document. I then checked the model similarity (model.wv.similarity(HUMAN_NAME, 'snowboard')) between each name in a list of human names and the word 'snowboard', and the most similar has been my answer so far. I know there has to be a more eloquent solution. I know Sequence to Sequence models and topic modeling might be my next steps. Can someone point me in the right direction if they have a better idea?

So it dawned on me today (idk why I didn't think of this earlier) that I could just combine my input documents with each human name I extract from the documents and output whether or not that person snowboards. Eg. Document A in my training set has human names "a", "b", and "c" where "b" is the snowboarder. I can create inputs of Document A + "a", Document A + "b", and Document A + "c" which would output "no", "yes", "no" respectively. I will post an answer to this question when I get full solution — Hiding, Feb 05 '18 at 15:42

Software design strategy for a machine learning tool that outputs a subset of the text input (Information Extraction)?

0 Answers0