Can somebody point me into the right direction? I really don't have experience writing code but I'm a fast learner. I'm not trying to say it will be easy but I hope somebody can help me...
-
6If you don't have any experience writing code, how about learning how to code first? Building a search engine is kind of an overkill for someone new to programming. – Terence Ponce Feb 13 '11 at 08:40
-
3As Terence says, it is not something for those new to programming. Any search engine that is any good at all tends to be a fairly complicated piece of work, there is a reason Google is so massive, they do a very hard task fairly well. Even seasoned programmers tend to prefer using something pre-built. If you give some idea as to what you want it for, maybe we can point you in the right direction. – Orbling Feb 13 '11 at 08:45
-
2Start with a smaller project. Building a Search Engine involves massive amount of code. And how do you gather all the data for your search engine? You'll need lots more code and a massive infrastructure. Really, please start on smaller projects and work up or you will just get frustrated. – James Feb 13 '11 at 11:09
-
1This book helped me understand what search engines are and how they are built: http://nlp.stanford.edu/IR-book/ – devnull Feb 14 '12 at 07:27
4 Answers
I wrote this for a blog i used to have way back when.... it not longer on the web so.. here it is! :
How to write a search engine
Darren Rowse over at probolgger.net is holding a Group Writing Project on anything "How to". This is one of the few blogs that I read regularly so I figure why not write something worth reading for a change, rather my standard violent rant where I'll end up threatening to stab Hugo Chaves in the throat.
I decided to write "How to write a search engine". I chose this topic for two reasons:
- There is not much good info on this on the web.
- I am currently writing one for one of my clients.
My client is an online retailer of significant size, so i'ts not searching the entire web just their site, more specifically just the products for sale on their site. None the less the same techniques can be used for writing more complex one used for searching the internet. I know this is not a tech blog so I won't go too deep into the technicalities, nor will I be discussing hardware\ processing power requirements, or web crawling.
I'm using a fairly simple technique, I have table (tblKeywords) with three fields:
- Itemid (If you are doing a web search this would be URL)
- KeyWord (Indexed Keyword)
- Weight (this is numeric value from 1-100 the higher this number the more significant (weight) the keyword carries) *PK=ItemID+ KeyWord
First thing I do is collect individual words from anyplace that is relevant. For my client I will pull words the products table. Specifically from the fields Itemid, ItemName, ItemShortDescription, ItemLongDescription, Manufacturer, ManufacturerSKU, Category1, Category2, Category3 ect. If you are indexing webpages you can pull data from the page text, page title, the URL or links on other pages that link back to page being indexed.
The weight value is determined by where the keyword came from. For example in my case the Item's Manufacture's SKU would get a weight of 100, while a word from the Item name may get a weight of 25. A word from the ItemLongDescription may get a weight of 5. If you are Indexing web pages the words from page title may get a weight of 75 while a word in bold from the page text may get weight of 10. If a word is repeated in more than once or\and in more than one place you would add up the weight for each time it occurs. For Example If the word "Shirt" comes from two places for ItemId=12345, The ItemName (weight of 25) and appears twice in ItemLongDescription (Weight of 5 x2=10) the word "shirt" would have total weight of 35 for ItemId=12345.
If some one does a search for "pink shirt" I search my table for all instances of the words "Pink" or "Shirt" and Total the weights. Showing the Items with highest total weight on top.
SQL :
Select Itemid, sum(weight) as totWeight from tblKeywords
group by itemId having keyword in ('pink','shirt')
So here you have it, a basic (and fast) search engine. Of course there is more to do, such as strip out punctuation, HTML Code and worthless keywords such as "and","if","or". This doesn't address searching for Key phrases, But you can uses a similar system for phrases if you can figure out where they start and end.

- 1,025
- 2
- 10
- 22

- 14,674
- 4
- 37
- 73
-
thanks nice answer. what would you also suggest for grouping products. i mean you have like 1 million crawled product pages from different websites. you want to group same products across different websites : http://programmers.stackexchange.com/questions/134292/product-classifying-algorithm-text-classification-c-algorithm-suggestions – Furkan Gözükara Feb 10 '12 at 03:25
-
-
You can also buy (rent) this data from companies like [Etilize](http://www.etilize.com/index.htm) – Morons Feb 10 '12 at 13:51
-
can you explain it a bit more because i did not get it. Thank you. – Furkan Gözükara Feb 11 '12 at 00:15
-
3An "ok" practical answer. However this ignores the entire field of study aka "Information Retrieval". In terms of "Pointing in the right direction" IR would have been better in my mind. – Darknight Feb 14 '12 at 10:17
The freely distributed draft of Introduction to Information Retrieval is going to be your prime reference material. It handles search (information retrieval) from basic to advanced level.

- 2,514
- 1
- 16
- 13
Search engines are built upon web crawlers, you will need to figure out how to build one of these suckers before you can develop a website to display it's results (you'll need a fast, efficient database to go with it).

- 176
- 4
-
3A search engine may consume the output of a web crawler, but they otherwise have very little to do with each other. Moreover, a fast, efficient database of the SQL variety is unlikely to help much in this endeavour. Search engines are generally built using inverted file indexing schemes, which don't fit the SQL mold at all. – Marcelo Cantos Feb 13 '11 at 09:01
-
@Marcelo Cantos: Inverted file indexing sounds complicated :-0 - Thanks for helping to clarify my answer! – palbakulich Feb 13 '11 at 10:42
This is an introductory course to CS that's going to start on the 20th, I suggest you check it out, it's offered free of charge.

- 3,817
- 2
- 29
- 41