Technical Report MSC-2018-23

TR#:MSC-2018-23
Class:MSC
Title: Relational Framework for Information Extraction
Authors: Yoav Nahshon
Supervisors: Benny Kimelfeld
PDFCurrently accessibly only within the Technion network
Abstract: Unstructured textual data conceals within itself structured data, and oftentimes it is accompanied by metadata. However, relational databases, which are highly suitable for storing structured data, typically treat text as a black box, as they lack the means for handling it sufficiently. In the context of text analytics, an essential component in many applications from this domain is Information Extraction (IE), the task of extracting (in a structured format) valuable knowledge from textual data. Typically, modern IE pipelines are constructed by (1) loading textual data from a database into a special-purpose application, (2) applying to the text a myriad of text-analytics functions that produce a structured relational table, and (3) storing this table in a database. However, this approach is prone to laborious development processes, complex and tangled programs, and inefficient control flows. These deficiencies have given rise to declarative solutions that automate significant parts of the manual work. However, such frameworks typically stitch together various programming components and technologies, and may lack an all-binding theory. In this thesis we embark on an effort to lay foundations of general purpose and text centric database management systems. Concretely, we introduce a novel formal framework, called Spannerlog, where we extend the relational model by incorporating into it the theory of document spanners, and define a Datalog-like query language for this model. Our main contribution is a uniform framework for textual data management w.r.t. unstructured data (text), structured data (extracted information and metadata like identifiers and timestamps), and functions that carry out transformations from the former to the latter. The formal foundations on which we built on our framework provide new capabilities and opportunities to be explored: (1) a better understanding of the system through theoretical studies; here we report on initial results concerning the expressive power of Spannerlog programs. (2) Diminished software complexity; on a single framework developers can write IE programs and query the extracted information in concise and readable manner. (3) New optimization opportunities due to static program analysis on top of Spannerlog's formalism; to illustrate these opportunities we present the notion of split correctness, that enables the construction of parallel execution plans based on data splitting, while providing provable correctness. We believe that the formalism of Spannerlog will have a substantial impact on the way systems manage and query textual data.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (http://www.cs.technion.ac.il/users/wwwb/cgi-bin/tr-info.cgi/2018/MSC/MSC-2018-23), rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the MSC technical reports of 2018
To the main CS technical reports page

Computer science department, Technion
admin