Mirit Shalem, Ph.D. Thesis Seminar
In complex search tasks that utilize information from several data sources, it is often required to pose several basic search queries, join the answers to these queries, where each answer is given as a ranked list of items, and return a ranked list of combinations.
Example: A tourist Alice wants to find a festive event that is related to classical music, a good Italian restaurant and an underground station that contains an elevator, such that all three will be located within a short walking distance. A combined search should return relevant triples (event, restaurant, station), ordered according to the combined scores of the items.
In this talk we discuss two problems of the top-k join of ranked lists. First, a join is a lossy operation, and over heterogeneous data sources some highly-ranked items may not appear in any combination, although they may be relevant to the user. We introduce top-k maximal join, which solves this problem by allowing maximal (incomplete) combinations in the answer, i.e., combinations that cannot be extended. We present novel algorithms for computing the top-k maximal combinations. One of the algorithms is instance optimal w.r.t. algorithms that compute a \theta-approximate answer. A second algorithm is much more efficient than adaptations of existing algorithms for top-k maximal join.
Second, the top-k join result may include too many repetitions of items, and repetitions may reduce user satisfaction. The main task is to choose a subset of the join result that would maximize the probability to satisfy the user. To that end, we propose two measures for estimating the quality of result sets, namely, coverage and optimality-ratio. We present new semantics for complex search queries, aiming at providing high coverage and high optimality ratio. We show empirically that our semantics outperform top-k and other existing semantics, w.r.t. user satisfaction.