Let's say I have thousands of pdfs that are each about 30k words written in conversational English. In each of the pdfs there is a name / names of a person/people who snowboard. There are also many other names. I need to extract the name(s) of the snowboarder(s) from any future pdfs. What are some tools / methods you could approach this problem with?
I just started learning about Natural Language Processing and Machine Learning a couple weeks ago. I have been using Python's NLTK to filter my data and have used scikit-learn for my classification and multilabel classification solutions pertaining to other questions I want to answer on the same data set, but this snowboarder example is not classification. I know I can strictly use an NLP solution but I want to try to have a ML model recognize the patterns in the text because all the documents are formatted similarly (and I have a lot of documents to train with and I am willing to manually label).
I was able to get some success training a word2vec neural net on each individual document. I then checked the model similarity (model.wv.similarity(HUMAN_NAME, 'snowboard')
) between each name in a list of human names and the word 'snowboard', and the most similar has been my answer so far. I know there has to be a more eloquent solution. I know Sequence to Sequence models and topic modeling might be my next steps. Can someone point me in the right direction if they have a better idea?