Modern approaches to retrieve useful content from a web page?

Question

What are the modern ways to (effectively) determine which part of page contains useful text, data tables, etc. and which are not (e.g. ads, navigation, etc.)?

What were the last valuable researches/result/papers in this field in latest years?

Thank you in advance!

There's no money in filtering ads, unless you're writing ad filtering software. The companies that do that are not likely to reveal their secrets. — Robert Harvey, Jun 29 '11 at 18:43

score 3 · Answer 1 · answered Sep 08 '11 at 10:01

Semantic Web

"a web of data that can be processed directly and indirectly by machines."

Semantic Web is a system to let machines understands human data.

Tim Berners-Lee originally expressed the vision of the semantic web as follows:

"I have a dream for the Web [in which computers] become capable of analyzing all the data on the Web – the content, links, and transactions between people and computers. A ‘Semantic Web’, which should make this possible, has yet to emerge, but when it does, the day-to-day mechanisms of trade, bureaucracy and our daily lives will be handled by machines talking to machines. The ‘intelligent agents’ people have touted for ages will finally materialize."

In order to do this, Semantic Web relies on languages specifically designed to store arbitrary data, such as RDF, OWL and XML.

Microformat

A web-based approach to semantic markup to gather metadata from documents.

This approach allows software to process information intended for end-users (such as contact information, geographic coordinates, calendar events, and the like) automatically.

As of 2010 microformats allow the encoding and extraction of events, contact information, social relationships and so on. More are being developed.

Modern approaches to retrieve useful content from a web page?

1 Answers1

Semantic Web

Microformat