Protecting privacy in the age of big data
As technology has become more advanced, its ability to house large amounts of data has come with the development of a variety of tools, used by social scientists, researchers, and various companies, to process that data. Recent privacy breaches have shown that these tools can be penetrated and personal information can be leaked to the public, fueling an ever-growing need for privacy-preserving data analysis. In the information age, how can we guarantee that the information we share with governments, companies, or researchers remains private?
Wassim Marrakchi, A.B. ’21, a computer science and mathematics joint concentrator at the Harvard A. Paulson School of Engineering and Applied Sciences, is working to solve that problem as part of the Harvard Privacy Tools Project. The Privacy Tools Project is a multidisciplinary group that focuses on the design and integration of computational, statistical, legal, and policy tools to address privacy issues in a variety of contexts.
“Social databases are of enormous value in providing different information and statistics about ourselves,” said Marrakchi, a native of Tunisia. “In this age of big data, the privacy of many individuals has been challenged and researchers and companies are increasingly worried about the data they collect because of the ethical obligation they have to preserve the privacy of their data subjects.”
This old problem was reignited by Latanya Sweeney, a professor in the Department of Government, explained Marrakchi. She showed that more than half of the U.S. population were uniquely identifiable from a 1990s U.S. census by using their date of birth, zip code, and state of residence. This showcased that privacy is no longer just about de-identification, encryption, or access control—it is more complex than that. The theory of differential privacy is one solution to the problems raised by Sweeney’s work.
“Emanating from cryptography, differential privacy is a formal mathematical theory of privacy preservation,” explained Marrakchi. “Its purpose is to provide the means to maximize the accuracy of queries from statistical databases while minimizing, and accounting for, the privacy impact on data subjects. Differentially private algorithms guarantee that any released statistical result does not reveal information about any one single individual.”
In essence, differential privacy entails the creation and usage of algorithms that make it virtually impossible to determine whether an individual’s data was included even after looking at aggregate statistics from the database. By injecting a precisely calculated quantity of noise, these algorithms can mask the contribution of an individual such that they cannot be differentiated by any possible combination of queries or model results. The goal of the Harvard University Privacy Tools Project is to make differential privacy accessible to everyone by creating a Private data Sharing Interface (PSI) that allows researchers to upload sensitive datasets to secure data repositories, decide what statistics about the data they would like to release, and release exclusively privacy-preserving versions of them. The implementation and control of such an interface involves many other issues that need to be taken into consideration.
“In an ideal world, all the data would be moved to these data repositories. However, practically, it can be cumbersome or impossible to move some data sets. Currently, these data silos don’t achieve their full privacy-preserving potential, so it would be very useful if, given an arbitrary R script from a data analyst, we could certify it meets the requirements of differential privacy and package it up in a container to have it run by the data owner where the data is stored,” he said. “I’m looking into ways to do that by enforcing that the arbitrary R code exclusively uses the Privacy Tools Project’s PSI library of differentially private algorithms for any information leakage.”
Through the project, Marrakchi has been able to interact with leading computer scientists in this field, and with a myriad of professors and students outside of it.
“Through Harvard, my interest in data privacy issues led me to interact with the people who started or revolutionized entire fields,” said Marrakchi, referring specifically to his academic advisor, Cynthia Dwork, Gordon McKay Professor of Computer Science. “The multidisciplinary work of the Privacy Tools Project is not something you can find in every school. Having lawyers, computer scientists, and social scientists sit at the same table and discuss data privacy issues is a very unique thing and, as a student, you just like to sit and watch them interact.”
Marrakchi plans on following up on the multidisciplinary aspect of his research by exploring different opportunities both inside and outside the field of computer science. To that end, he is spending the rest of the summer as a teaching assistant for AddisCoder, a computer science summer program for high school students in Ethiopia.