Netanel Raviv (California Institute of Technology)
Sunday, 15.12.2019, 14:30
Data is one of the most important assets of the information age, and its societal impact is undisputed. Yet, rigorous methods of assessing the quality of data are lacking. In this talk, we propose a formal definition for the quality of a given static dataset. We assess a dataset's quality by a quantity we call the expected diameter, which measures the expected disagreement between two randomly chosen hypotheses that explain it, and has recently found applications in active learning. We utilize a collection of Fourier analytic, algebraic, and probabilistic methods to come up with theoretical guarantees and practical solutions for the computation of the expected diameter, and study its behaviour on algebraically structured datasets. We conduct experiments that validate this notion of quality, and demonstrate the feasibility of our techniques.