Pixel Club: Reporting Neighbors in High-Dimensional Euclidean Space

Dror Aiger (Google)

Tuesday, 29.5.2012, 11:30

EE Meyer Building 1061

We consider the following problem, which arises in many database and
web-based applications: Given a set $P$ of $n$ points in a high-dimensional space
$\reals^d$ and a distance $r$, we want to report all pairs of points of $P$
at Euclidean distance at most $r$. We present two randomized algorithms, one
based on randomly shifted grids, and the other on randomly shifted and rotated
grids.

The expected performance of both algorithms is of the form $C(d)(n+k)\log n$, where $k$ is the output size and $C(d)$ is a constant that depends on the dimension $d$. The $\log n$ factor is needed to guarantee, with high probability, that all neighbor pairs are reported, and can be dropped if it suffices to report some (arbitrarily large) fraction of the pairs. When only translations are used, $C(d)$ is of the form $(a\sqrt{d})^d$, for some (small) absolute constant $a$; this bound is worst-case tight, up to an exponential factor. When both rotations and translations are used, $C(d)$ can be improved to roughly $6.65^d$, getting rid of the superexponential factor $\sqrt{d}^d$. When the input set (lies in a subset of $d$-space that) has low \emph{doubling dimension} $\delta$, the performance of the first algorithm improves to $C'(d,\delta)(n+k)\log n$ (or to $C'(d,\delta)(n+k)$), where $C'(d,\delta)\approx \sqrt{d}^\delta e^{O(\sqrt{d})}$. Finally, our analysis also leads to improved algorithms for ANN and LSH.

The expected performance of both algorithms is of the form $C(d)(n+k)\log n$, where $k$ is the output size and $C(d)$ is a constant that depends on the dimension $d$. The $\log n$ factor is needed to guarantee, with high probability, that all neighbor pairs are reported, and can be dropped if it suffices to report some (arbitrarily large) fraction of the pairs. When only translations are used, $C(d)$ is of the form $(a\sqrt{d})^d$, for some (small) absolute constant $a$; this bound is worst-case tight, up to an exponential factor. When both rotations and translations are used, $C(d)$ can be improved to roughly $6.65^d$, getting rid of the superexponential factor $\sqrt{d}^d$. When the input set (lies in a subset of $d$-space that) has low \emph{doubling dimension} $\delta$, the performance of the first algorithm improves to $C'(d,\delta)(n+k)\log n$ (or to $C'(d,\delta)(n+k)$), where $C'(d,\delta)\approx \sqrt{d}^\delta e^{O(\sqrt{d})}$. Finally, our analysis also leads to improved algorithms for ANN and LSH.