Technical Report PHD-2019-10

Title: The Complexity of Relational Queries over Extractions from Text
Authors: Liat Peterfreund
Supervisors: Benny Kimelfeld
PDFCurrently accessibly only within the Technion network
Abstract: Large amounts of textual data with high potential value are prevalent nowadays. Incorporating textual data within traditional database systems usually consists of two stages (a) extracting relations from the text using primitive extractors and (b) manipulating the extracted relations with relational algebra operations. Document spanners is a formal framework proposed by Fagin, Kimelfeld, Reiss, and Vansummeren, that captures the relational philosophy of commercial database systems for text analytics.

A document spanner (or spanner, for short) is a function that maps a string (document) into a relation over spans which are interval positions of the string. Evaluating spanners is a computational problem that involves both the atomic extractors, the relational algebra expression, and the document. We investigate the complexity of this problem from various angles, each regarding some components as fixed and regarding the rest as input. We focus on regular atomic extractor specified by means of either regular expressions or finite state automata. We present lower bounds and devise different restrictions and approaches to establish efficient upper bounds. We study the expressive power of spanners that involve recursion and the implications of extensions that support partial extractions in the flavor of SPARQL.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2019
To the main CS technical reports page

Computer science department, Technion