I have a web crawler, and I'm looking for hints that will help me automatically detect a website country of origin.
And by country of origin I generally mean the country the website is targeting. For example:
- http://www.spiegel.de/ -> Germany
- http://www.lemonde.fr/ -> France
- http://publico.pt/ -> Portugal
- http://www.elpais.es/ -> Spain
I know there's not a foolproof way of doing it, so I will likely rely on a scoring system.
- The domain name;
- The content language;
- The server's IP address;
Whois
information;
What additional parameters would you use?
For the previous examples a combination of domain and content language will do, but many websites have a .com
domain and a language spoken in more than one country...