< Structured Data Across Wikimedia < Section Topics

What's a topic?

We define a topic as a Wikidata item of a given wikilink extracted from a given piece of Wikipedia content.

General architecture

How it works

The data pipeline is implemented as an Airflow job and breaks down into the following steps:

  1. two sensors that give green lights as soon as fresh data is available in the Data Lake;
  2. one Python Spark task that takes as input Wikipedias wikitext, Wikidata item page links, and outputs the section topics dataset.

A look at the data

Here is how a row of data looks like (manually hyperlinked if the reader wishes to check it):

snapshot wiki_db page_namespace revision_id page_qid page_id page_title section_index section_title topic_qid topic_title topic_score
2023-01-16enwiki01127523670Q36724841Attila5Solitary kingshipQ3623581Arnegisclus1.13

Data processing flow

  1. Gather the content of top-level sections, lead section included;
  2. filter out sections that don't convey relevant topics, such as External links. See phab:T318092 and phab:T323504 for more details;
  3. extract Wikidata items from wikilinks: the so-called section topics;
  4. filter out noisy topics, such as dates. See phab:T323597 and phab:T323036 for more details;
  5. compute topics relevance score.

Note that we:

  • resolve redirect pages;
  • optionally separate media links from the main dataset.

Relevance score

We define relevance as a score that measures to what extent a given topic helps summarize and understand a given piece of Wikipedia content. This enables topic ranking and is computed as a term frequency-inverse document frequency (TF-IDF) weight based on the distribution of topics.

We must distinguish between article-level and section-level relevance, which summarize a Wikipedia article and a Wikipedia article section respectively. They follow slightly different implementations:

  • the former is a custom weight, where the TF component is computed across Wikipedias by leveraging the language-agnostic nature of Wikidata items;
  • the latter is a classic one, i.e., computed within the same Wikipedia;
  • both compute the IDF component within the same Wikipedia.

As a result, we expect article-level relevance to be much more meaningful than section-level one, due to the much larger amount of topics that contribute to the computation. Moreover, TF-IDF doesn't perform well in case of short content, which is likely to impact relevance of short sections with few topics.

Code base

See also

This article is issued from Mediawiki. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.