Abstract:
We revisit a problem introduced by Bharat and Broder almost a
decade ago: how to sample random pages from a search engine's
index using only the search engine's public interface? Such a
primitive is particularly useful in creating objective
benchmarks for search engines.
The technique of Bharat and Broder suffers from two well
recorded biases: It favors long documents and highly ranked
documents. We introduce two novel sampling techniques: a
lexicon-based technique and a random walk technique. Our methods
produce biased sample documents, but each sample is accompanied
by a corresponding ``weight'', which represents the probability
of this document to be selected in the sample. The samples, in
conjunction with the weights, are then used to simulate
near-uniform samples. To this end, we resort to four well known
Monte Carlo simulation methods: rejection sampling, importance
sampling, the Metropolis-Hastings algorithm, and the maximum
degree method.
We analyze our methods rigorously and prove that under plausible
assumptions, our techniques are guaranteed to produce
near-uniform samples from the search engine's index. Experiments
on a corpus of 2.4 million documents substantiate our analytical
findings and show that our algorithms do not have significant
bias towards long or highly ranked documents. We use our
algorithms to collect fresh data about the relative sizes of
Google, MSN Search, and Yahoo!.
Joint work with Maxim Gurevich.