I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.
Below is an example of data.
skills hobbies certification education
Java fishing PMP MS
Python reading novel SCM BS
C# video game PMP B.Tech.
C++ fishing PMP MS
so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.
Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.
Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?
document = all_document_in_db
For i in document:
for j in document:
if i != j :
compute_similarity(i,j)
PS: I use python