4

I have to design and implement an algorithm for my university project that searches a given set of documents based on the keywords/query given. Assume that each document contain few sentences and these documents can be stored in a suitable data structure. When a query is made I have to display the documents that contain the keywords. A query can contain simple logical operators such as “AND” and “OR”.

For example assume that there are 3 documents named Doc1, Doc2, Doc3 with this content:

  • Doc1: This is my University.
  • Doc2: My University is situated at Delhi.
  • Doc3: I like My University.

Here are the answers to some queries:

  1. "University": Doc1, Doc2, Doc3
  2. "my AND University": Doc1, Doc2
  3. "like OR Delhi": Doc2, Doc3

Currently what I have developed reads each file and put its contents into separate binary trees, and I've developed a function for searching one word from the binary trees. How can I extend my search algorithm for search with logical operators?

doppelgreener
  • 1,274
  • 17
  • 26
  • Your are not looking for third party tool? You need to build this from the ground up? Lucene is Open Source. – paparazzo Oct 11 '16 at 15:43
  • yeah. I don't want to use any 3rd party tool and request to provide any possible approach to implement this. – SHdinesh Madushanka Oct 11 '16 at 15:45
  • If I understand correctly, the answers to some queries are *theoretical expected results,* not actual results you're already getting, is that right? – doppelgreener Oct 11 '16 at 15:47
  • 2
    Why does query 2 "my AND University" not return all three documents? Or, if this is case-sensitive, why _does_ it return Doc2? – jscs Oct 11 '16 at 16:56
  • if you know to find one word then break the string into tokens and ineach step check two: the word + operator. if or and found word true, if "and" and not found... false else move to the next two query words. – Asaf Oct 11 '16 at 18:36

2 Answers2

3

Hash the words to a key value pair of (word, set of documents). When you do the search, insert the sets found into the hashtable. Then do a union on the sets for OR and intersections for AND

Skorpius
  • 139
  • 2
1

You could go with .

Dictionary<string, HashSet<Int>>  se
           word            docID

But I thought you had to build from scratch

I don't know java
It may be called a HashTable in java

var docs =  se["my"].Interset(se["university"]);

var docs =  se["my"].Union(se["Delhi"]);
paparazzo
  • 1,937
  • 1
  • 14
  • 23