similarity measure between millions of documents

Question

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.

Below is an example of data.

skills  hobbies        certification    education
Java    fishing        PMP              MS
Python  reading novel  SCM              BS
C#      video game     PMP              B.Tech.
C++     fishing        PMP              MS

so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.

Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.

Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?

document = all_document_in_db
For i in document:
   for j in document:
      if i != j :
        compute_similarity(i,j)

PS: I use python

Possible duplicate of [If I wanted to build a search engine, how would I start?](https://softwareengineering.stackexchange.com/questions/47360/if-i-wanted-to-build-a-search-engine-how-would-i-start) — gnat, Jul 27 '17 at 05:05
Checking each *x* with every other *x* is O(n^2) time and there's no getting around that. If this is a prerequisite for your question, then no improved algorithm is possible. — Neil, Jul 27 '17 at 06:28
@gnat i am not building this for search engine. Its for a marketing lead database. — StatguyUser, Jul 27 '17 at 06:40

bikeman868 · Accepted Answer · 2017-07-27T07:22:14.550

5

I think you are taking the wrong approach by comparing every document with every other document. You should define an N-dimensional space to contain your documents then use a nearest neighbour algorithm to find the best matches. This will be many times faster than doing 100,000,000 x 100,000,000 document comparisons.

There are a number of algorithms you can consider. If you are not familiar with them I suggest you start by trying k-d tree and see how well it works for your specific problem.

edited Jul 27 '17 at 07:22

answered Jul 27 '17 at 03:55

bikeman868

879
6
9

Can you elaborate on this please? it sounds like a solution. Thanks! – StatguyUser Jul 27 '17 at 06:40
How do i define N-dimensional space? Did you mean something like TF-TDF? and how knn might be applied? I am basically unable to get the intuition. Also, any code snippet or links whcih has examples will be helpful. Thanks!! – StatguyUser Jul 27 '17 at 06:53
1

@Enthusiast Each column in your original post is a dimension. – Frank Hileman Jul 27 '17 at 17:53
What other algorithm might be used apart form k-d- tree for measuring similarity? – StatguyUser Jul 28 '17 at 01:24
1

You seem to be a beginner in this field. I highly suggest that you try k-d first, it is easy to understand, very well documented and there are many example implementations on the Internet. – bikeman868 Jul 28 '17 at 01:41
You'r right i am beginner in information retrieval. My background is in data mining. Thanks for your help thus far. I have a question though. How do i put this in production? Can you guide me how this kind of analysis can be integrated with databases? My organization uses MySQL and elasticsearch, although i have limited understanding of elasticsearch and its one of my senior who works on that. Thanks in advance!! – StatguyUser Jul 28 '17 at 09:24
You can't write this as a MySQL stored routine because the language does not support this type of algorithmic programming. It might be possible to add this functionality directly into a PostgreSQL database but most developers would not go that route. This kind of functionality normally lives inside the application code and would be written in Java, C# or another application programming language. – bikeman868 Jul 28 '17 at 11:43

similarity measure between millions of documents

1 Answers1