Improving machine learning data utility and optimal K-anonymization
Machine learning models are being rapidly adopted for many predictive analytics purposes particularly in the areas related to human health and behaviour. However, publishing training models and data sets that are generated by human activities is often hindered by privacy protection requirements (e.g., data anonymization) and by the concern of disclosing too much personal information. Finding the right balance between privacy of individuals and the utility of large data sets is a challenge needs to be addressed by designers prior to model training. Mathematical optimization can be a promising approach to address this problem where the utility of dataset is defined as the objective function that needs to be maximized while the requirements for anonymity are modelled as constraints over this function. In this talk we explore how the general data anonymization problem can be modelled as a mixed integer linear optimization. We then investigate the complexity of such models and discuss a number of strategies to reduce the complexity such that the problem can be partially solved.
Bio: Reza Samavi is currently an assistant professor at the Department of Computing and Software at McMaster University where he holds the position of the department lead for the eHealth graduate program. Reza received his PhD and Masters from University of Toronto and his main research interests are in the fields of information security, privacy, and health data analytics. Reza is a faculty affiliate with the Vector Institute for Artificial Intelligence and a collaborating researcher with the SOSCIP working on developing an active classifier for radiation risk with medical imaging. For his research on information privacy Reza has received the Privacy Technologies Research Award from the IBM Center for Advanced Studies and the Privacy By Design Research Award from the Information and Privacy Commissioner of Ontario.