Towards a Large Corpus of Richly Annotated Web Tables
for Knowledge Base Population
Authors: Basil Ell, Sherzod Hakimov, Philipp Braukmann, Lorenzo Cazzoli, Fabian Kaupmann, Amerigo Mancino, Junaid Altaf Memon, Kai Rother, Abhishek Saini, and Philipp Cimiano
AboutThis page provides information in addition to our submission with the title Towards a Large Corpus of Richly Annotated Web Tables for Knowledge Base Population to the LD4IE workshop at ISWC 2017. This work was supported by the Cluster of Excellence Cognitive Interaction Technology 'CITEC' (EXC 277) at Bielefeld University, which is funded by the German Research Foundation (DFG). The work was partially created within the Intelligent Systems Master students project Information Extraction from Web Tables under the supervision of Basil Ell and Sherzod Hakimov.
Abstract of the Paper
Web Table Understanding in the context of Knowledge Base Population and the Semantic Web is the task of i) linking the content of tables retrieved from the Web to an RDF knowledge base, ii) of building hypotheses about the tables' structures and contents, iii) of extracting novel information from these tables, and iv) of adding this new informa-tion to a knowledge base. Knowledge Base Population has gained more and more interest in the last years due to the increased demand in large knowledge graphs which became relevant for Artificial Intelligence appli-cations such as Question Answering and Semantic Search. In this paper we describe a set of basic tasks which are relevant for Web Table Understanding in the mentioned context. These tasks incremen-tally enrich a table with hypotheses about the table's content. In doing so, in the case of multiple interpretations, selecting one interpretation and thus deciding against other interpretations is avoided as much as possible. By postponing these decision, we enable learning approaches that gain an understanding of the tables' contents to decide by them-selves, thus increasing the usability of the annotated table data. We present statistics from analyzing and annotating 1.000.000 tables from the Web Table Corpus 2015 and make this dataset available online.