How to : Nextag and such other dynamic data pulling website

Question

Yesterday, I came across Nextag which is a website for comparing prices of various products from different retailers. Is this site automated?

Can the process of data pulling from various sites can be automated for comparison with each other. I tried understanding a hell lot, but could not realize this fact and now think that they ought to be manually managed because on every other retailer site, the product name was different and it was very hard to recognize a similarity for a specific product.

I have made such a system in which one can add a product and then add the respective retailers and their pricess. Every thing is manual in here. Today, I will be entering the data into the system, but thought it would be better to know if there is a way to automate this thing. Or I am going the right way as there can't be an automation for such system, because the data has no similarities.

yannis · Accepted Answer · 2011-12-20T11:46:16.643

Such applications face two important challenges:

Gathering the data, and
Making sense of the data.

Gathering the data

A variety of techniques may be employed, depending on what's supported by each source site:

Simple web crawler

Anything from a simple curl script to a full fledged custom web crawler.
Web services

The optimal way, if source site supports them. Any kind of web service, from RSS to SOAP, that's their point, to provide a method of communication between two sites over the web.
Black magic stuff

Depending on the relation of the source and target sites, this can be anything really. I've seen one site that literally send SQL dumps of it's database to another site, via ftp. That's definitely not recommended, but it's a good example of a weird counter-intuitive approach you never thought existed, until you saw it with your own eyes.
By hand

Obviously not recommended. But under very specific circumstances and for very small sets of data it might prove to be the optimal solution.

The automated techniques commonly require some time-based background process, and for websites the obvious example is a cron script. Scheduling plays an important part in the process as you need to schedule the gathering process(es) according to:

How often products are modified (created / deleted)
How often product information is modified
Bandwidth restrictions (on your site and each source site)

I've separated 2 from 1, as I've seen them treated as separate processes, with different workflows and scheduling. For example I've worked on a similar site that had a vendor that added new products only on Mondays. So we only checked on Tuesdays, at 00:00.

Making sense of the data

This is where the fun begins! Heterogeneous data is always a blast to work with! Every bit of data you gather will most probably follow a unique structure. If you are lucky enough to gather the data via a web service, chances are that their structures will make sense and you can build a combined index relatively easy.

But if there's at least one site you have to crawl, or its web service structure doesn't really makes sense, then your index is not going to be that easy to build. The most common problem is the one you've identified:

For example, in case of iPhone, some websites name Apple iPhone, while some name it iPhone 4S. How can one make sure that the sync is maintained.

There's no single answer to that, it depends on data size, volume, complexity etc. It can be as simple as building and maintaining an associative index by hand:

tokenA1 == tokenB1 == token1
tokenA2 == tokenB5 == token2
tokenA5 == tokenB2 == token3

or as complex as building a custom algorithmic solution to discover the associations automatically. Some relevant buzzwords are:

Pattern matching:

In computer science, pattern matching is the act of checking some sequence of tokens for the presence of the constituents of some pattern. In contrast to pattern recognition, the match usually has to be exact. The patterns generally have the form of either sequences or tree structures.
Pattern recognition:

In machine learning, pattern recognition is the assignment of some sort of output value (or label) to a given input value (or instance). An example of pattern recognition is classification, which attempts to assign each input value to one of a given set of classes (for example, determine whether a given email is "spam" or "non-spam")
Information extraction:

Information extraction (IE) is a type of information retrieval whose goal is to automatically extract structured information from unstructured and/or semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP).
Document retrieval:

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual.
Statistical classification:

In machine learning, statistical classification is the problem of identifying the sub-population to which new observations belong, where the identity of the sub-population is unknown, on the basis of a training set of data containing observations whose sub-population is known. Therefore these classifications will show a variable behaviour which can be studied by statistics.
Data mining:

Data mining (the analysis step of the knowledge discovery in databases process,1 or KDD), a relatively young and interdisciplinary field of computer science2 is the process of discovering new patterns from large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics and database systems

One algorithm (or variation of¹) I've seen used in a somewhat comparable scenario is the k-nearest neighbor algorithm:

In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space.

Normally, it won't be that hard, my intention is more to illustrate the spectrum of possible solutions than pinpointing an optimal solution. There isn't a single one, the optimal solution will be different for each target site and almost in every scenario you'll need to mix and match.

A possible solution of average difficulty that would work for a wide range of source sites could be build around a text search engine / information retrieval library, like Apache Lucene, or even just a few well written regular expressions.

And of course, to make matters worse, there isn't a single workflow. The simplest I can think of are:

Gather the data first, then analyse,
Analyse the data while gathering.

Which highly depends on your data storage, and it might even depend on your hardware. It will also depend on how you gather the data, multiple requests for lots of data pose the risk of hurting the target site, so the workflow may even be different per site.

It's a grand mess! Make sure you ask for a lot of money if ever involved in such a project...

_{¹ To be perfectly honest it might have been something different entirely, at the time I was told it was a naive nearest neighbor search, but was too young and impressionable to deeply investigate. It worked, though, and it looked like NNS.}

:) If I haven't managed to scare you yet, consider that _any_ target site can modify their data structure any time they want. Some will be kind enough to let you know and allow for some time for you to prepare, but that's not always the case. So it's not a "build once, works forever" kind of thing... — yannis, Dec 20 '11 at 08:16
LOL...you have definitely scared me now. Actually for the time being, i will let the site go online with manual edition. In the background will keep my head rolling and working on the automation....BTW its sure is an uphill task. You need to have definite rewards or else working on such requirement should be avoided. — Pankaj Upadhyay, Dec 20 '11 at 08:24

score 1 · Answer 2 · answered Dec 20 '11 at 06:59

1

The could be a number of methods with varying difficulty:

A very advanced method could be using a web-crawler ( writing one ) to meet your specific needs of finding and pulling data from sources that interest you/your app.
An easier approach: Find out the sources manually which provide the relevant data. Get the RSS feeds of those websites and make your application update its database automatically.

answered Dec 20 '11 at 06:59

SHOUBHIK BOSE

381
2
8

The problem with RSS feed i guess would be no similar names or id to use. For example, in case of iPhone, some websites name Apple iPhone, while some name it iPhone 4S. How can one make sure that the sync is maintained. These sites don't follow a Unique product Id or something. – Pankaj Upadhyay Dec 20 '11 at 07:05
1

@PankajUpadhyay Yeap, that's a problem. You'll have to built the sync on your own. That sometimes is as easy as maintaining an index of token associations `(token1 in site a) == (token2 in site a) == (token3 in your site)` or as hard as building an algorithmic solution to discover the associations automatically. – yannis Dec 20 '11 at 07:20
@YannisRizos , I think t start it will be better to go with manual edition because implementing such an algorithm will be lot tough and time consuming. Also the probabilities of getting the optimal results via algorithm won't surpass the manual process – Pankaj Upadhyay Dec 20 '11 at 07:58

score 1 · Answer 3 · answered Dec 20 '11 at 14:56

The best and most reliable way is to use a feed provided by the merchants. Usually this will be in the form of an RSS feed or a web service. Some merchants only offer these feeds to registered affiliates and they won't be easily discoverable by the general public. Plus, as an affiliate, you should get paid for customers you send to the merchant from your site. The downside, of course, is that Google actively tries to penalize sites like you've described since they consider them "web spam" (aka advertising competitors).

As long as you're using merchants feeds/services the data acquisition is trivial. The real challenge is in formatting and presenting the data in an engaging format, along with other content, that will retain visitors and get them to directly return to your site even if Google sends you to the back of the bus.

score -1 · Answer 4 · answered Dec 20 '11 at 11:11

While your question appears to be programing related, its answer is more business related. Since i know a thing or two about such sites let me explain things in general.

Such "price engines" are automated, but data is provided by retailers themselves. Why? Just imagine that you crawl some retailer website every other day and that they decide to change price just after you have finished crawling. What is going to happen (sooner or later) is that customers will be disapointed when they figure out something is out of stock or has a different price. They will assign all the blame to retailer, which also won't be very happy with such mess.

Even that @Yannis Rizos is right. The biggest challenge is making the sense of the data. However, so far no site have achieved 100% automation. You can use every fancy technique out there, but you will get somewhere between 80% to 90% correct mappings. Even when you use EAN/UPC or P/N as with any open standard, there are always some people misusing it. So at least to some extent you are going to rely on manual corrections.

@JacekPrucia , **data is provided by retailers themselves.** , I visited indian region's such website : [ebay.in](http://ebay.in) , [flipkart.com](http://flipkart.com) and others but couldn't find anywhere on the site if they offer such data. Even the RSS feed was missing — Pankaj Upadhyay, Dec 20 '11 at 12:29

How to : Nextag and such other dynamic data pulling website

4 Answers4