2

I have a web crawler, and I'm looking for hints that will help me automatically detect a website country of origin.

And by country of origin I generally mean the country the website is targeting. For example:

I know there's not a foolproof way of doing it, so I will likely rely on a scoring system.

  • The domain name;
  • The content language;
  • The server's IP address;
  • Whois information;

What additional parameters would you use?

For the previous examples a combination of domain and content language will do, but many websites have a .com domain and a language spoken in more than one country...

  • 1
    I've added the `I'm building a web crawler` in your question, please revise if that's not what you're doing. If that's so, please don't rely on tags to convey information about what you're doing, we prefer to read about it, tags are only useful for quickly categorizing the question (and oh so many people don't tag appropriately, you can never be sure)... – yannis Jan 20 '12 at 10:50
  • 1
    This assumes that there is a single country of origin. Multi-hosted servers (e.g. www.google.com) fail this test. – MSalters Jan 20 '12 at 10:58
  • What about I am in the Netherlands, but have a site in English, hosted somewhere in the world. The host location can change as the hosting provider sees fit. Same site may also host pages in Dutch, but not necessarily. WhoIs is not going to give you anything, many sites are registered "privately" to avoid as much spam as possible through WhoIs lookups. Looking for contact information maybe harder in a crawler, but should be more reliable. And of course you still need to deal with multi-nationals. – Marjan Venema Jan 20 '12 at 11:11
  • I'm aware there's no single parameter truly reliable, and no solution for all websites... I'm not looking for 100% accuracy. – Filipe Miguel Fonseca Jan 20 '12 at 11:14
  • 1
    When you say Country of Origin, do you mean the country that the website is associated with, the country that the servers are located in, or the country that the business is incorporated in? – briddums Jan 20 '12 at 15:00

5 Answers5

2

Country selection information from the content and headers.

S.Lott
  • 45,264
  • 6
  • 90
  • 154
  • 2
    What's _the actual user_ in the context of a website? As I read it, he wants to know where a website server is located, not where a website user is located. – MSalters Jan 20 '12 at 10:52
  • Often, it's in the con ten as well as the headers. The language and internationalization settings are sometimes present in headers. – S.Lott Jan 20 '12 at 10:53
2

Try this approach. Start with your initial list of parameters, then start collecting the data. After you have hundreds of sites in your database look for ones that don't seem to make sense. Then start looking for other clues to solve those outliers.

In other words, use the scientific method.

mhoran_psprep
  • 2,328
  • 2
  • 16
  • 14
1

The four elements you have identified are also the only ones I can think of.

The domain name, or rather, the TLD which is probably what you mean, can be used for a positive identification, but not for a negative one, because it may be .com, in which case it tells you nothing.

The content language (as specified in the headers) is the best indicator. That's also what your crawler probably wants to know: what language is the content that it crawls, not what country the operator of the web site happens to reside in.

The server's IP address is probably next to useless, but if all else fails, you could use it as an additional means for positive only identification. Never use it for negative identification though, because lots of people host their web sites in the USA, regardless of where they come from or what their audience is.

Whois information is also next to useless, because it will usually tell you where the registrar is located, which really means nothing. And in any case, is probably going to be quite hard to get and parse any meaningful data out of it.

Mike Nakis
  • 32,003
  • 7
  • 76
  • 111
  • Lots of people and companies based in Sweden (even some government agencies!) use `.nu` domains either as an alternative or as the only address. This is because the Swedish word "nu" means "now", leading to many nice-sounding domain names. And it's not uncommon to host outside of Sweden. – user Jan 20 '12 at 14:08
1

You should first define what you mean by "country of origin", and then base your detection on that. It could mean any of:

  • country (or countries) where the physical server is located
  • country from where the content is uploaded and maintained (physical location of authors)
  • country where the domain name is registered
  • country at which the top-level domain hints (e.g., .fr sites are supposedly "from France")
  • country that the website targets (e.g., lots of .nu sites target a Dutch audience)
  • country associated with the primary language or languages used in the site's content

Each meaning calls for different detection methods, some accurate, others more a matter of guesswork.

tdammers
  • 52,406
  • 14
  • 106
  • 154
0

Don't remember that? xkcd - map of online communities 2, in 3D

Internet is about user groups, not their realworld political or geographical relations. Thus, you'll not gaining much from knowing anything about all sites larger than a school page or a yellowpages list. All options now are "filter by hosting IP", "filter by page language" and "[x] detect specific language district patterns" .

However, if you still need more detailed info, well.. think of imageshack.us, tinypic/imgur or any other picture hosting. They know both the client IPs and the referring addresses of sites which hotlink to the hosted images, and they can tell, how much visitors from which country have seen the image hosted at some page.

So if you want that tracking - start an image hosting, attack the resource under research with picture postings, links and/or 1x1 hidden transparent GIF/PNG images hotlinked with unique URI, and voila, you have that visitor map.

kagali-san
  • 49
  • 11