· 3 min read
Anonymizing Data: Why Your Business Should Be Doing It More
Data privacy and utility are fundamentally at odds, but with the recent advancements in privacy technology, the question of anonymizing data is shifting from why to why not.
Data teams face many choices when dealing with personally identifiable information (PII). But these choices largely boil down to two options: satisfy regulatory and compliance requirements for using PII; or remove PII from your data. As compliance tasks pile up with the expanding list of global regulations, users are increasingly favoring PII removal for things like analytics, machine learning, and dev/test environments.
Anonymizing data properly, and without adversely affecting its utility, is challenging. Anonymization, by nature, removes information from data. Remove too much information and your data is no longer useful. Remove too little and risk the unintentional leakage of sensitive personal information.
How much privacy is enough?
Data “privacy” is not binary. Aside from direct identifiers, most consumer data contains some amount of information about its subject. With enough external knowledge about the subject, it is rather trivial to connect them to the data you considered to be anonymized.
Privacy Dynamics does not believe in privacy guarantees. There is always some level of risk that bad actors can re-identify records when joined with enough external knowledge. Consider, for example, your closest family member. You know the date of their last visit to the ER, their date of birth, and their gender. If these bits of information, also known as indirect identifiers, are present in a dataset that contains “anonymized” personal health records, you — the attacker — could re-identify them and gain access to protected health information about them in the dataset. Research shows that 87% of Americans have reported characteristics that can be uniquely identified with as little as their date of birth, zip code, and gender.
Understanding re-identification risk
With an unlimited combination of categorical, descriptive, or behavioral data, how can we possibly understand when a record containing unknown values is at risk of being linked to its rightful owner? The answer lies in a method, first published in 1998, known as k-anonymity. K-based approaches have long served as the basis for de-identifying health data under the Expert Determination process. In essence, a table in which every record is indistinguishable from at least two other records is defined as having 2-anonymity. In order to harden this metric for modern data sets and address the known limitations of k-anonymity, Privacy Dynamics simulates attacks (with varying levels of external knowledge) on your data to determine re-identification risk on a row-by-row basis. Using this simulated risk, we increase the level of k-anonymity in the dataset by carefully micro-clustering similar records until your desired privacy risk target is achieved. As soon as that target is reached, transformations are stopped in order to preserve the maximum amount of original information.
Can anonymized data deliver the results I need?
Historically, anonymized data has not been useful for deep analytical and research projects. But this is a symptom of crude legacy approaches; where anonymizers used a sledgehammer to perturb data column by column. We designed Privacy Dynamics as a precision anonymizer, making privacy-preserving data more suitable for things like analytics, machine learning, dev/test, and publishing. Our anonymizer generates a detialed transformation report summarizing observed distortion and the resulting impact on field-to-field relationships.
Try anonymized data first
How can you answer if PII is directly relevant and necessary to the project if you don’t make an attempt without it? Historically, generating and testing anonymized data has been a project in itself, causing many to simply avoid it. But with Privacy Dynamics, you can create anonymized versions of any table in your data warehouse within minutes. If the anonymized data serves the needs of the project, downstream users suddenly have unbridled access to the data they need. Should you find the anonymized data is insufficient, you’re left with a convincing argument (and a handy report!) to defend your use of PII to your compliance team.