Abstract:
Suppose you are studying the occurrence of a disease and need
to uncover any salient statistical properties that might hold.
For example, it would be important to know if the probability
of contracting the disease decreases with distance of your house
from a given nuclear plant, whether the distribution on zipcodes
of patients is close to the distribution for another disease, or
whether a person's likelihood of contracting it is correlated
with their profession. Of course, you wish to notice such trends
given as few samples as possible.
We survey several works regarding the complexity of testing
various global properties of distributions, when given access to only
a few samples from the distributions. We will focus on properties
that give some indication on the independence of an underlying
joint distribution. We will describe algorithms whose sample
complexities are *sublinear* in the size of the support for many such
testing problems. In contrast, classical techniques, such as the
Chi-squared test or the straightforward use of Chernoff bounds,
have sample complexities that are at least linear in the size
of the support of the underlying discrete probability distributions.