Technical Report PHD-2013-05

Title: Combined Search Over Heterogeneous Repositories
Authors: Mirit Shalem
Supervisors: Yaron Kanza and Ziv Bar-Yossef
Abstract: In complex search tasks that utilize information from several data sources, it is often required to pose several basic search queries, join the answers to these queries, where each answer is given as a ranked list of items, and return a ranked list of combinations.

Example: A tourist Alice wants to find a festive event that is related to classical music, a good Italian restaurant and an underground station that contains an elevator, such that all three will be located within a short walking distance. A combined search should return relevant triples (event, restaurant, station), ordered according to the combined scores of the items.

Evaluation of such complex queries consists of several parts. The first part requires extracting, for each basic query independently, the relevant information from a suitable repository. Next, the results of the basic queries should be integrated, i.e., joined according to the join conditions of the complex query. Finally, the join result should be filtered and ranked, to provide the user with a manageable sized answer that has a high probability of satisfying her.

This thesis studies several aspects of complex query evaluation. We begin with the question how to extract data efficiently from XML documents. Current algorithms incur high memory costs on queries that involve child-axis nodes. In this work we provide an analytical explanation for this phenomenon. We present a large-scale study of the space complexity of evaluating twig queries (a fragment of XPath) over indexed XML documents.

The next part of the complex search is to join the ranked results of the basic queries, and return the "best" results. We discuss two problems of the top-k join of ranked lists. First, a join is a lossy operation, and over heterogeneous data sources some highly-ranked items may not appear in any combination, although they may be relevant to the user. We introduce top-k maximal join, which solves this problem by allowing maximal (incomplete) combinations in the answer, i.e., combinations that cannot be extended. We present novel algorithms for computing the top-k maximal combinations. One of the algorithms is instance optimal w.r.t. algorithms that compute a $\theta$-approximate answer. A second algorithm is much more efficient than adaptations of existing algorithms for top-k maximal join.

Second, the top-k join result may include too many repetitions of items, and repetitions may reduce user satisfaction. The main task is to choose a subset of the join result that would maximize the probability to satisfy the user. To that end, we propose two measures for estimating the quality of result sets, namely, coverage and optimality-ratio. We present new semantics for complex search queries, aiming at providing high coverage and high optimality ratio. We show empirically that our semantics outperform top-k and other existing semantics, w.r.t. user satisfaction.

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2013
To the main CS technical reports page

Computer science department, Technion