Technical Report PHD-2011-01

Title: Privacy Preserving Data Mining
Authors: Arie Friedman
Supervisors: Assaf Schuster
Abstract: In recent years, privacy preserving data mining has emerged as a very active research ‎area. This field of research studies how knowledge can be extracted from large data ‎stores while maintaining commercial or legislative privacy constraints. Quite often, ‎these constraints pertain to individuals represented in the data stores. While data ‎collectors strive to derive new insights that would allow them to improve customer ‎service and increase their sales, consumers are concerned about the vast quantities of ‎information collected about them and how this information is put to use. Privacy ‎preserving data mining aims to settle these conflicting interests. The question how ‎these two contrasting goals, mining new knowledge while protecting individuals' ‎privacy, can be reconciled, is the focus of this research. We seek ways to improve the ‎tradeoff between privacy and utility when mining data.‎

We address the privacy/utility tradeoff problem by considering the privacy and ‎algorithmic requirements simultaneously. We study this problem in the context of two ‎privacy models that attracted a lot of attention in recent years, k-anonymity and ‎differential privacy. Our analysis and experimental evaluations confirm that ‎algorithmic decisions made with privacy considerations in mind may have a profound ‎impact on the accuracy of the resulting data mining models. In chapter 1 we study this ‎tradeoff in the context of k-anonymity. Since k-anonymity was originally defined in ‎the context of relational tables, we begin with extending the definition of k-anonymity ‎with definitions of our own, which can then be used to prove that a given data mining ‎model is k-anonymous. We demonstrate how the definitions can be incorporated ‎within a data mining algorithm to guarantee k-anonymous output. In chapter 2 we ‎focus on decision tree induction and demonstrate that our technique can be used to ‎induce decision trees which are more accurate than those acquired by anonymizing the ‎data first and inducing the decision tree later. ‎ In chapters 3 and 4 we study the privacy/accuracy tradeoff in the context of ‎differential privacy. We show that when applying differential privacy, the method ‎chosen by the data miner to build a classifier could make the difference between an ‎accurate classifier and a useless one, even when the same choice without privacy ‎constraints would have no such effect. Moreover, an improved algorithm can achieve ‎the same level of accuracy and privacy as the naive implementation but with an order ‎of magnitude fewer learning samples.‎

CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2011
To the main CS technical reports page

Computer science department, Technion