Such applications face two important challenges:
- Gathering the data, and
- Making sense of the data.
Gathering the data
A variety of techniques may be employed, depending on what's supported by each source site:
Simple web crawler
Anything from a simple curl script to a full fledged custom web crawler.
Web services
The optimal way, if source site supports them. Any kind of web service, from RSS to SOAP, that's their point, to provide a method of communication between two sites over the web.
Black magic stuff
Depending on the relation of the source and target sites, this can be anything really. I've seen one site that literally send SQL dumps of it's database to another site, via ftp. That's definitely not recommended, but it's a good example of a weird counter-intuitive approach you never thought existed, until you saw it with your own eyes.
By hand
Obviously not recommended. But under very specific circumstances and for very small sets of data it might prove to be the optimal solution.
The automated techniques commonly require some time-based background process, and for websites the obvious example is a cron script. Scheduling plays an important part in the process as you need to schedule the gathering process(es) according to:
- How often products are modified (created / deleted)
- How often product information is modified
- Bandwidth restrictions (on your site and each source site)
I've separated 2 from 1, as I've seen them treated as separate processes, with different workflows and scheduling. For example I've worked on a similar site that had a vendor that added new products only on Mondays. So we only checked on Tuesdays, at 00:00.
Making sense of the data
This is where the fun begins! Heterogeneous data is always a blast to work with! Every bit of data you gather will most probably follow a unique structure. If you are lucky enough to gather the data via a web service, chances are that their structures will make sense and you can build a combined index relatively easy.
But if there's at least one site you have to crawl, or its web service structure doesn't really makes sense, then your index is not going to be that easy to build. The most common problem is the one you've identified:
For example, in case of iPhone, some websites name Apple iPhone, while some name it iPhone 4S. How can one make sure that the sync is maintained.
There's no single answer to that, it depends on data size, volume, complexity etc. It can be as simple as building and maintaining an associative index by hand:
tokenA1 == tokenB1 == token1
tokenA2 == tokenB5 == token2
tokenA5 == tokenB2 == token3
or as complex as building a custom algorithmic solution to discover the associations automatically. Some relevant buzzwords are:
One algorithm (or variation of1) I've seen used in a somewhat comparable scenario is the k-nearest neighbor algorithm:
In pattern recognition, the k-nearest neighbor algorithm (k-NN) is a method for classifying objects based on closest training examples in the feature space.
Normally, it won't be that hard, my intention is more to illustrate the spectrum of possible solutions than pinpointing an optimal solution. There isn't a single one, the optimal solution will be different for each target site and almost in every scenario you'll need to mix and match.
A possible solution of average difficulty that would work for a wide range of source sites could be build around a text search engine / information retrieval library, like Apache Lucene, or even just a few well written regular expressions.
And of course, to make matters worse, there isn't a single workflow. The simplest I can think of are:
- Gather the data first, then analyse,
- Analyse the data while gathering.
Which highly depends on your data storage, and it might even depend on your hardware. It will also depend on how you gather the data, multiple requests for lots of data pose the risk of hurting the target site, so the workflow may even be different per site.
It's a grand mess! Make sure you ask for a lot of money if ever involved in such a project...
1 To be perfectly honest it might have been something different entirely, at the time I was told it was a naive nearest neighbor search, but was too young and impressionable to deeply investigate. It worked, though, and it looked like NNS.