10

Recently I've learned that using a regex to parse the HTML of a website to get the data you need isn't the best course of action.

So my question is simple: What then, is the best / most efficient and a generally stable way to get this data?

I should note that:

  • There are no API's
  • There is no other source where I can get the data from (no databases, feeds and such)
  • There is no access to the source files. (Data from public websites)
  • Let's say the data is normal text, displayed in a table in a html page

I'm currently using python for my project but a language independent solution/tips would be nice.

As a side question: How would you go about it when the webpage is constructed by Ajax calls?

EDIT:

In the case of HTML parsing, I know that there is no actual stable way to get the data. As soon as the page changes, your parser is done for. What I mean with stable in this case is: an efficient way to parse the page, that always hands me the same results (for the same set of data obviously) provided that the page does not change.

Mike
  • 203
  • 1
  • 2
  • 7
  • 8
    There is no stable way, no matter how you implement your scraping, it can easily break with a simple change of the webpage. The stables way to get your data is to contact the authors of the data and broker a deal for you to get the data in a sane format. Sometimes that doesn't even cost money. – Joachim Sauer Jun 06 '12 at 08:13
  • 1
    @JoachimSauer - Question could still be answered with the 'best' method. – Anonymous Jun 06 '12 at 08:18
  • Since most websites are dynamic and store their data in databases, the best way is to get the database from the website. If the website has an API, you can use it. In case you want to scrape the static pages, then the built-in Python urllib and HTMLParser modules work well. A few packages for scraping HTML is also available at PyPi. – Ubermensch Jun 06 '12 at 08:21
  • Site scraping is skeezy business. There's really no stable way to do this because site owners don't want you to, and the industry as a whole is trying stop people from doing it. – Steven Evers Jun 06 '12 at 14:14
  • 1
    Maybe embed a web browser such as Webkit and then use DOM scripting to get information from the rendered page? Almost every platform can do that, but here's how you'd do it in Qt: http://doc.qt.nokia.com/4.7-snapshot/qtwebkit-bridge.html – user16764 Jun 06 '12 at 14:15
  • @JoachimSauer Sometimes the owners of the data are not competent enough in 'this whole web-thing' to provide another meaningful way to access the data. Think small store owner, who paid his neighbor's teenager to put a webcart for his store online. – K.Steff Jun 06 '12 at 20:27

5 Answers5

5

In my experience, using .NET environment, you can take advantage of HTML Agility Pack.

If the page is formatted as XHTML you can also use a regular XML parser. There's a lot out of there for any environment you can imagine.

For the side question about AJAX, you can use regular HTTP networking code to get data and parse it.

Again if your AJAX stack returns XML, you'll got a lot of choices. If it returns JSON, consider a library that allow you to map the stream to typed objects. In .NET I suggest you Newtonsoft.Json.

Glorfindel
  • 3,137
  • 6
  • 25
  • 33
gsscoder
  • 246
  • 1
  • 4
  • And by 'HTTP networking code' you mean capturing the server's response when a request is made? Thanks for the suggestions, I'll be sure to look in to them. +1 – Mike Jun 07 '12 at 07:38
  • Exactly. In .NET you can use System.Net.WebClient or a library like RestSharp | http://restsharp.org/. I've used it also on Mono for Droid. – gsscoder Jun 08 '12 at 14:15
4

Parsing HTML is not a completely trivial task, since one has to deal with possibly incorrect markup (tag soup). During the years, browsers have implemented more or less the same strategy to deal with errors, and that algorithm has been christened in the HTML5 specification (yes, the HTML5 specification specifies what to do with things that are not HTML5).

The are libraries for all major languages to parse HTML, for instance this one.

In any case, what you will get is not stable in any sense. Each time the webpage format changes, you have to adapt your scraper.

Andrea
  • 5,355
  • 4
  • 31
  • 36
  • Thanks, I've been using [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) to get the job done. I know it will not be stable, I should probably clearify that in my questions. +1 for you :) – Mike Jun 07 '12 at 07:42
4

As a side question: How would you go about it when the webpage is constructed by Ajax calls?

If ajax calls are being made, then its very likely its either some POST or GET url with some variables.

I would examine the JavaScript to find out what the endpoints and parameters are. After that its very likely that either the data returned is json/xml/plain text or perhaps partial html.

Once you know the above information, you simply make a GET or POST request to that endpoint, and parse the returned data.

Darknight
  • 12,209
  • 1
  • 38
  • 58
  • 2
    Worth noting that many services inspect the HTTP headers to ensure `HTTP_X_REQUESTED_WITH` is `XMLHttpRequest`. Good ones will also implement some kind of XSRF protection for POST requests, so you'll need that magic cookie as well. Tickling AJAX endpoints not deliberately exposed by some public API feels a little icky to me, and your scraper is just as prone to breakage if the output (or request policy) changes. – Tim Post Jun 06 '12 at 08:52
  • @TimPost you are 100% correct. I agree its "icky" indeed :) but in the absence of any public API, needs must.. – Darknight Jun 06 '12 at 09:23
  • I could use this on my own AJAX powered application (and by 'own' I don't mean I wrote it but the setup is mine) but it wouldn't feel right to try and bypass another server's system so I must agree with @TimPost , it feels kind of 'icky'. It is a good idea however, thanks! +1! – Mike Jun 07 '12 at 07:34
2

Well, here are my 2 cents:

If there is no AJAX involved, or it can be cleared easily, 'fix' the HTML to XHTML (using HTMLTidy for example), then use XPath instead of regular expressions to extract the information.
In a well-structured web page, the logically separated entities of information are in different <div>s, or whatever other tag, which means you would be able to easily find the right information with a simple XPath expression. This is great also because you can test it in, say, Chrome's console, or Firefox' developer console and verify it works before writing even one line of other code.
This approach also has very high signal-to-noise ratio, since usually expressions to select the relevant information will be one-liners. They are also way easier to read than regular expressions and are designed for that purpose.

If there is AJAX and serious JavaScript-ing involved in the page, embed a browser component in the application and use its DOM to trigger events you need, and XPath to extract information. There are plenty good embeddable browser components out there, most of which use real-world browsers under the hood, which is a good thing, since a web-page might be incorrect (X)HTML, but still render good on all major browsers (actually, most of the pages eventually get this way).

K.Steff
  • 4,475
  • 2
  • 31
  • 28
  • Thanks, I'll certainly take a look at XPath some more. I'm not used to working with it, so it'll be a nice thing to learn. +1 :) – Mike Jun 07 '12 at 07:24
1

There is no stable or better way to do this, HTML web pages was not made to be manipulated by computers. It is for human users, but if you need to do it I suggest that will use a browser and some javascript. At my work I was involved with a project that need to extract some information from a third party site. The application was developed as a Chrome extension. The application logic is written using javascript that is injected on the site after the Page load is complete. The data that is extracted is sent to a database through a http server. It is not the best approach, but it works. Ps: The Site owner has authorized us to do such thing.

nohros
  • 448
  • 3
  • 9
  • I know that HTML pages weren't supposed to be parsed by computers but sometimes there simply is no other option. Also, I'm using publicly available information for a personal project which is not commercial in any way, I don't think I need explicit authorization, do I? Thanks for your input! +1 for you too ;) – Mike Jun 07 '12 at 07:27
  • @MikeHeremans To know whether you're authorized to get information from a web site, read the ToS and robots.txt. If both don't deny you the right to scrape information off automatically, you should probably be OK in most cases legally. Of course, IANAL... – K.Steff Jun 07 '12 at 15:43
  • If you like to see the code of the mentioned project: http://code.google.com/p/acao-toolkit/source/browse/#svn%2Ftrunk%2Fsrc%2Fcreditors%2Fwebcrawler%2Fchrome_plugins%2Fplataforma_itau. Check the content_script.js, it is the code that is injected on the page. – nohros Jun 08 '12 at 23:41