· 11 min read
Privacy best practices aren't built into Analytics Engineering workflows in most organizations today. It's time for that to change
We live in a digital world — everything we do is online or through the use of a digital device — resulting in organizations’ ability to collect an unprecedented amount of data. While this data presents tremendous opportunity, unfortunately, much of the richest data includes customer information: purchases, locations, and personal identifiable information (PII). Privacy regulation and the desire to limit liability often puts data anonymization at the center of this conversation.
For the analytics engineer working to utilize this information, the goal is for this data to inform better decisions. And while data privacy is a concern, it often presents as merely a roadblock to achieving the highest quality results.
Privacy Dynamics’s mission is to remove data privacy roadblocks in order to empower innovative and ethical data teams. We are a team of engineers that intimately know the roadblock anonymization presents, and are passionate about finding a solution for analytics engineers and consumers alike.
Let’s dive into how we make that happen.
Anonymized data is primarily needed for three core business functions:
For each use case, businesses are expected to obtain explicit consent from consumers to use their personal information. The challenge is most data is collected prior to a project being fully defined, leaving organizations with a difficult choice — go back and ask for consent, or risk the consequences.
While often unintentional, the improper use of sensitive, personal data can have damaging effects on millions of people. A recent Robinhood data breach exposed five million email addresses and two million names.
Often, in fact, it is the data that seems anonymous that is most at risk for violating consumer privacy. Latanya Sweeney’s research found that the combination of gender, date of birth, and zip code can uniquely identify up to 87% of Americans. These same fields can often be important predictors for data analysis.
Data anonymization retains the use of this data to extract important business insights, without risking the unintentional disclosure of personal information.
Data breaches are on the rise, up 38% in Q2 2021. The natural response to the rise in data breaches is for organizations to lock down their data — making it less accessible to malicious attackers, but also harder to use by authorized parties. Instead, the data should be dissected to understand the component parts that need to be de-identified. To use personal data under the premise that it has been de-identified, both direct and indirect identifiers need to be addressed.
Direct identifiers are the most obvious to classify and treat, using techniques such as masking, tokenization, or redaction. The most common examples of direct identifiers include a social security number, credit card number, or phone number. In total, there are 18 direct identifiers.
Indirect identifiers are harder to locate and remove. They can be used to re-identify data subjects when combined with other external knowledge, and it’s impossible to truly know how much external information an attacker has access to.
The primary responsibilities for an analytics engineer are writing production ELT code and building tools that serve downstream data workers such as data scientists or analysts. Rarely do you see privacy expertise in a job requisition for an Analytics Engineer. In fact, they are quite different career pathways. Emerging tools and services can help bridge the gap, helping safely tap into knowledge trapped in a database, but it remains a complex implementation to do correctly.
Scaling data for custom cleaning first requires analytics engineering teams to consider the needs of each team and process that relies on it. For many teams, this means building and supporting infrastructure end-to-end. Analytics engineers collaborate with data engineers to build out the best solutions while advocating for broader business stakeholders.
The goal with data anonymization is to strip datasets of all personally identifiable information. The process of doing so though, is a balancing act between the intentional removal of information to preserve privacy, with the unintentional loss of information necessary for quality analytics. With haphazard anonymization, analysts risk deriving inaccurate insights and conclusions.
Rigorous anonymization practices also minimize the threat of re-identification attacks. This occurs when data is joined with external information, allowing a record to be re-linked to its human originator. Let’s look at this example.
|Customer ID||Age||Gender||Zip Code||Income|
While there doesn’t appear to be anything particularly vulnerable in this dataset, that narrative changes when combined with the information from another set of data:
|Customer Name||Age||Zip Code||# of Orders|
The combination of data from both datasets opens up the possibility of identifying Jerome in what’s known as a re-identification attack. Someone could look at his unique age and zip code before pairing him with personal information they do not have consent to use like email address or income. Preventing this threat is a key function of anonymization.
Traditionally, there are two approaches to anonymize data:
Given that anonymizing data is not an analytics engineer’s primary job function, let’s look at each of these traditional methods as possible solutions to evaluate the pros and cons of each approach.
By definition, data anonymization is simply removing units of information, or clusters of information, that may lead to identification. In reality, there are a lot of ways organizations set out to achieve this goal to adhere to policy regulations. With this understanding, the challenge then becomes balancing a maximum tolerable information loss, while also being mindful of the best possible privacy threshold.
First, let’s clarify what we mean by “manually” anonymizing data. The manual approach is the process by which an analytics engineer would take advantage of their programming knowledge to write a script to scrub the data set of PII and other personal information.
To get a sense for commonly used tools and methods, check out the 1,000+ repositories in the #data-cleaning topic on GitHub.
As we saw with the sample dataset, there’s still privacy risk in this simple transformation. The script to anonymize 100% of the data often depends on hundreds of transformations like this. District Data Labs has an excellent practical guide to anonymizing data using Python.
While this method does not require a line-by-line analysis, it is still manual due to the time required to write a robust transformation script. An analytics engineer must fully understand the specific dataset, write and test the script, and then maintain it in infrastructure. A unique transformation script then needs to be updated for each specific use case.
Like any code, there’s potential for bugs and human errors. Today’s script may not work with tomorrow’s data. And there’s still no guarantee that the manual approach will prevent the re-identification attacks described earlier.
The appeal of synthetic data makes sense; if one isn’t using the actual data, then there’s no data at risk. Synthetic data can be generated to match properties of a real dataset without exposing the source data.
Engineers and Data Scientists generate synthetic data with a scripting language like Python or using a synthetic data generation tool. The core idea is that statistical patterns in a real dataset can be abstracted out (e.g. a distribution of order totals), and then used to generate an entirely novel set of data for actionable analysis.
Unfortunately, it can be hard to know how well the synthetic version matches the real data. Synthetic data is created to replicate the attributes of the real dataset, but ultimately it is not the same. No matter how much effort is put into building a synthetic data model, there will be some loss in both accuracy and utility.
Synthetic data providers offer incredibly large software clusters akin to building a massive model with Lego pieces. And while these patchwork models can potentially get close, many will find they’ll still be missing valuable insights. The only way to know for sure if insights are missing is to spend countless hours configuring the synthetic data to test it out.
Using synthetic data can lead to missing out on crucial insights only discoverable in real-world data. Edge cases, anomalies, subtle patterns, all are at risk of not being visible in the synthetic data.
When worried about international data laws, it may feel safer to just eliminate the data entirely.
But if your data’s statistical power relies on a high sample size, then clearly generalizing the whole column reduces the sample, which is being statistically analyzed. It is sometimes surprising to hear that broad generalization and suppression may be unnecessary to de-identification goals.
There are situations where manual cleaning and/or synthetic data creation are viable solutions to the privacy problem. However, many times, both of these methods use a sledgehammer to solve a problem where a chisel is better suited for the job.
While column-wise anonymization approaches the problem of data privacy with the same mindset, the end result of the data is vastly different from a precise cell-based approach. This approach generalizes an entire column of data, when only changing a value or two within the column would have sufficed.
When worried about international data laws, it may feel safer to just eliminate the data entirely. But if your data’s statistical power relies on a high sample size, then generalizing the whole column reduces the sample. Even moreso, if a column contains sensitive Direct or Quasi identifiers, then suppressing or generalizing the column is overkill even by GDPR standards. Once the data has been de-identified, then it’s no longer under scope of GDPR and you’re welcome to keep your data’s statistical power.
So while executed with the best of intentions, broad generalization and suppression may be unnecessary to de-identification goals.
For many analytics engineering projects, the existing data anonymization methods are less than ideal solutions due to the delicate trade-off between privacy and utility. Blunt forcing the anonymization step often yields uninspiring results.
The most precious resource for any Analytics Engineer is time. Analytics engineers are at their best when they can use their knowledge to diagnose what needs to happen to the dataset, and then use tools to improve their efficiency. Data transformations are a cost center, and managing them efficiently remains a top priority for data teams — as indicated by the radical growth dbt has experienced this year.
Privacy Dynamics automatically pre-assesses and scores the risk of each dataset before it is treated. The anonymization treatment can be applied with a single click or scheduled to run in the background. After it’s run, users can see results on the Privacy Dynamics datasets page, and can easily track all anonymization projects in a single view.
The dashboard presents users with visualizations of how the data has changed based on the current level of anonymization. It provides an overview of which information has changed, with the option to delve into specifics. Statistical visualization is particularly useful when deciding how much anonymity is appropriate for a dataset. It enables a side-by-side comparison of distributions and relationships.
To provide analytics engineer’s with the ability to track multiple projects in one view, Privacy Dynamics acts as a central hub to monitor connected datasets and facilitate sharing within the organization. The Projects screen offers a birds-eye view of risk across all active projects, along with their owners and operational status.
Privacy Dynamics provides users with the ability to adjust the amount of anonymity by changing how the clustering algorithms handle individual data points. Dialing the k-anonymity threshold up adjusts individual data points to be more in line with one another, blending individuals into the crowd. A release of data is said to have k-anonymity if the information for an individual cannot be distinguished from at least 1 other individual whose information also appears in the release.
With smarter data anonymization, without destroying the data itself, Privacy Dynamics lets analytics engineers effortlessly fine-tune to the desired level of privacy. At the same time, it retains patterns and relationships that are important for analysis. This puts the control back into the analytics engineers’ hands, while also addressing privacy and regulatory compliance concerns of the company.
We know your organization wants to protect your customers’ data, but we also know it can be a burdensome task for you to strike the right balance between privacy and utility. We minimized that burden on you, making the choice to work with anonymized data more about what’s right for everyone than about what’s required for you.
We became the privacy experts, so you don’t have to — when you have the right tools, insightful data anonymization doesn’t have to be manual, nor daunting. Learn more about Privacy Dynamics and how our technology connects to your existing stack and enables your data teams to create safe, anonymized data for all of your development, testing, and production analytics.