· 7 min read
In this post, I'll share why teams see value in deploying de-identified datasets alongside masked ones, and how large organizations use both to achieve data minimization without burdensome governance schemes.
At Privacy Dynamics, I speak to a lot of customers who have implemented Dynamic Data Masking to protect the personal information in their data warehouse or data lake. This is great! Dynamic masking is a fantastic tool for limiting access to sensitive data. Modern implementations, like Snowflake's, are flexible, powerful, and performant. There is a lot to like.
However, dynamic masking can't do everything: it doesn't prevent re-identification, and in large deployments, it is easy for masking policies to become unusably complex and inconsistently applied. Our customers find that de-identified (or anonymized) datasets provide a valuable, complementary tool that data and risk teams can use to achieve their goals.
After a brief introduction to dynamic masking, I'll highlight a few examples of how our customers use de-identified datasets alongside data masking.
Data masking is the process of selectively obfuscating data to make it less sensitive. There are many types of masks for many types of data. If you have ever seen your phone or social security number represented by just its last four digits, you have encountered data masking!
Dynamic data masking is the process of applying a mask to data at query time (or when the data is accessed), rather than masking the data when it is stored. This is desirable because you can create policies to mask or unmask data depending on the context of the query. Typically, this means allowing access to the masked data by default, and unmasking data only for privileged users (or roles).
Since dynamic data masking happens at query-time, it can only be performed by the database, or by a proxy that is between the database and end-user. There are many implementations; for the rest of this article, we'll use Snowflake's Dynamic Data Masking, since it's well-documented and SQL-based.
In Snowflake, you first create a masking policy, and then alter a column on a table to apply that policy. When Snowflake receives a query that includes a masked column, it dynamically rewrites the query to insert the masking logic from the masking policy everywhere that the column is used in the query, including in joins and predicates.
A dynamic mask on an phone number column in Snowflake could look like this:
create or replace masking policy phone_mask as (val string) returns string -> case when val is null then null when current_role() in ('SUPERADMIN') then val when current_role() in ('SUPPORT_REP') then '********' || right(val, 4) else '************' end ; alter table if exists users modify column phone set masking policy phone_mask
With this mask, most users who attempt to query
users.phone will only see
************. However, customer support reps will have access to the
last four digits of the phone number, and our database super admins will
have access to the full phone number.
A big part of why dynamic masking is so powerful is that it is non-destructive. You can always apply a masking policy to a column, and you'll never have any regrets. You can always grant people access if they need it by granting the user an additional role or modifying the masking policy.
However, the non-destructive aspect of dynamic masking also makes it risky. The underlying data is still there, and may be present in snapshots, exports, and other backups (depending on the RDBMS and masking implementation). It will also be present in downstream datasets that are created by a privileged user, and that user must remember to apply masking policies to those new datasets.
It also means that dynamic masking can't be used to satisfy data retention policies, comply with data subject access requests (DSAR) for deletion, or simplify compliance with privacy laws, since dynamically masked personal data is still personal data.
In contrast, Privacy Dynamics creates a de-identified dataset with all personal information permantently removed. This is an intentionally irreversible action (although you can choose to leave primary keys intact), so that de-identified data no longer constitutes regulated personal information. We run simulated attacks on your de-identified data to ensure the probability of re-identification is very low.
This means that de-identified datasets are not subject to DSAR or records retention policies, making them immutable artifacts that are perfect for model training and analytics. By building analytical pipelines on top of de-identified data, our customers streamline their overall risk and compliance posture by limiting DSAR and retention concerns only to their core, sensitive databases.
Since masking is non-destructive, it doesn't stop the proliferation of PII across your data ecosystem. This means your masking policies must proliferate as well, and with a deployment at scale, this can lead to unusable complexity.
In our toy example above, we have just one field and two groups of users. But what about hundreds or thousands of fields, and tens or hundreds of roles? Implementing both column and row-level policies (e.g., limiting support reps to a single region) adds further complications. Granular roles and policies make user onboarding and provisioning equally complex, and similar processes need to be implemented to de-provision access when it is no longer needed. Ultimately, new tools are usually required to automate the application of policies, lint the code for rule violations, manage user permissions, and scan your database for unmasked PII.
In contrast, because de-identified data is safe for broad distribution, the governance of de-identified data can be kept exceptionally simple, even at large organizations: just give people access.
Some datasets need to contain PII, and those should absolutely be dynamically masked to restrict access as much as possible. But for the 80% of use cases that can be served by de-identified data, it should be the default, which will dramatically simplify governance.
Development and Test environments are the largest sources of data leaks. Best practice (and data minimization regulations) demand that dev + test environments do not contain production data, but that can make these environments less useful, especially in a data engineering context, where data edge cases in production data are a big source of bugs.
Dynamic masking really isn't appropriate to create safe datasets for dev + test. Masked values typically don't provide format consistency, and using a UDF to fake values dynamically is too slow. This can cause tests on masked data to fail, and limits the usefulness of masked datasets to test data transformations and migrations.
Again, de-identified data is perfect for this use case. You can use your production data (at its full scale and complexity) for dev + test, without the risk. If desired, identifiers can be faked at storage time, before the dataset is queried, which makes it fast. And since de-identified data can be loaded from a snapshot or queried in a live database, it can support both local and cloud-based dev workflows, CI/CD, preview environments, and more.
In most deployments (partially to limit the complexity we discussed earlier), dynamic masking is limited to direct identifiers, like names, email addresses, social security numbers, and precise locations. However, redacting direct identifiers makes data pseudonymized, not anonymized: it can be simple to re-identify individuals based on other traits that make them unique, like birthday, zip code, and gender.
These unique traits (called quasi-identifiers) are also often very valuable to business users: for example, marketers may want to create customer segments and improve ad targeting by using a customer's birthday, zip code, and gender. The typically means that the business needs these fields unmasked for many if not most consumers of the data, which limits the effectiveness of masking to reduce risk.
De-identified data can provide more a more flexible solution. By slightly perturbing the values of direct identifiers, either by adding noise or through micro-aggregation, treated datasets can protect individuals' privacy while maintaining their analytical utility. Some of our customers choose to use our software to treat quasi-identifiers while leaving direct identifiers intact; they then apply masking policies to the direct identifiers in the treated dataset, which allows them to keep those policies to a minimal number of columns and restrict access to those columns as much as possible.
Previously, creating de-identified datasets was a manual and expensive process. Privacy Dynamics has completely automated the process of de-identifying data to a HIPPA, GDPR, and CPRA-compliant standard. You can connect your data store (databases, data warehouses, and data lakes are supported) to our SaaS, or deploy Privacy Dynamics inside your VPC, and start anonymizing in just a few minutes.
If you would like to learn more about Privacy Dynamics, you can try it for free or contact us and we will help you design a solution for minimizing risk while maximizing the impact of data at your organization. You can also book a demo.