Liat Peterfreund, Ph.D. Thesis Seminar
Information Extraction (IE) is the task of extracting structured information from textual data. We explore a programming paradigm that is supported by several IE systems where relations extracted by atomic extractors undergo a relational manipulation. In our efforts toward achieving a better understanding of IE systems, we study queries in the framework of document spanners that was introduced by Fagin et al. A spanner is a function that extracts from a document (string) a relation over text intervals, called spans, using either atomic extractors or a relational algebra query on top of these extractors. Evaluating IE queries is a computational problem that involves both the atomic extractors, the relational algebra expression and the document. We investigate the complexity of this problem from various angles, each keeping some components fixed and regarding the rest as input. In addition, we study the expressive power of IE queries that involve recursion ands the implications of supporting partial answers in the flavor of SPARQL.