Data Minimization in Analytics Using dbt and Privacy Dynamics

Ted Conbeer

07.26.22 · 12 min read

Data Minimization in Analytics Using dbt and Privacy Dynamics

Data minimization is an important security, privacy, and compliance strategy, but applying it in an analytics context is hard. In this post, we show you how you can minimize the use of personal information in the Modern Data Stack, using dbt, Snowflake, and Privacy Dynamics.

What is Data Minimization?

Data Minimization has long been a tenet of responsible data best-practices, but more recently has become codified in data privacy regulations like GDPR and HIPAA, and other widespread standards, like SOC2. Essentially, data minimization is the principle that an organization should only collect, store, and process personal data that is directly relevant to its specific business goals. An organization should not collect personal data because it might become useful some day: unless there is a specific purpose that the data is directly relevant and necessary to accomplish, the collection and storage of that data may be unlawful.

Furthermore, data minimization applies to access and sharing of data, even within organizations. HIPAA explicitly covers this with its Minimum Necessary standard, which requires companies "to take reasonable steps to limit the use or disclosure of, and requests for, protected health information to the minimum necessary to accomplish the intended purpose."

Data Minimization Tactics for Data Teams

Data minimization should be applied at every step of the data "value chain." Many teams seek to "shift privacy to the left," or farther upstream, since it can be simpler to manage sensitive information at its source, before it proliferates through numerous models and tools. With that in mind, there are five independent tactics that data teams can use to achieve data minimization:

A slide outlining the five tactics of data minimization: minimize collection, minimize access/consumption, minimize sharing, minimize retention, minimize data in lower environments

Minimize collection. Data teams should work with their Product and Engineering teams to ensure that only necessary personal information is being collected by their product. By bringing business context into these conversations, analysts can ask questions like "Do we really need our customers' birthdays?" or "Who will be using the IP addresses we're planning on collecting?" When designing product tracking plans, analysts should limit the use of PII (like name and email) in event properties, and encourage the use of user traits instead.
Minimize access and consumption. Role-based access control (RBAC) is a critical component of any data minimization effort. Data teams need to understand the various personas, or access needs, of their stakeholders, create roles in their data warehouses, BI tools, and identity providers that map to those personas, and define policies for each role that limit access to sensitive data that is not required by that persona. Policies may limit access to entire datasets or to subsets of data, like individual fields or records. The implementation will depend on the data platform, and may take advantage of vendor-specific features, like Snowflake's Dynamic Data Masking. Alternatively, sequestering PII in a "privacy vault" and tokenizing data that is stored in most systems can help centralize the RBAC problem.

Sophisticated teams may wish to go further than limiting access by limiting consumption. This may require time-boxing role authorization, reviewing or auditing queries made by authorized users, and documenting reasons for every query of sensitive data.
Minimize sharing. When partnering with Marketing teams, data team members should engage in audience-building and attribution efforts and ensure data sharing with third parties is minimized, and where possible, data processing by third parties is restricted.
Minimize retention. The complement of collection is retention. Data that is old, stale, or no longer relevant should be deleted or made inaccessible to most end-users. Being realistic about the value of old data is important. Analysts should be mindful of changing business contexts (like product features, marketing budgets, and market sizes) and be willing to part with data from an earlier era.
Minimize data in lower environments. Data engineers, analytics engineers, data scientists, and data analysts often interact with sensitive data in contexts where it is not required, like in the development of data pipelines, data models, ML models, and even dashboards or operational reports. Even if these same individuals are trusted with this data in other contexts, and even if sensitive data will flow through the production asset, it violates the tenets of minimization to use that same data in a development context. Data teams must develop tactics for minimizing personal data in development environments, which could include sampling or subsetting records; pseudonymizing data by masking, hashing, or faking direct identifiers; or manufacturing test or synthetic data.

If that sounds like a lot of work, that's because it is! Data minimization can be a burden for data teams of all sizes. And while new tooling can help with automation, a lot of the complexity of data minimization (and RBAC in particular) is driven by organizational and even political factors that are unlikely to be automated away.

Anonymize Data in Minutes.

Create Sandbox

Schedule a Call

Anonymization as an Alternative to Minimization

Under most regulations, standards, and ethical frameworks, data minimization is only important for personal data. While every regulation has its own definition of "personal data," nearly all limit it to (from GDPR) "information relating to an identified or identifiable natural person."

Accordingly, anonymization (or de-identification) provides an alternative to data minimization. If your data is truly anonymous, then it is safe to store, process, and use however you like.

However, anonymization is hard, because the treatment of both direct identifiers (like name and email address) and quasi-identifiers, which are traits like age, gender, and zip code, which can be combined and linked with external data to re-identify individuals. Historically, anonymization has required redacting or badly distorting sensitive data, which limited its usefulness for analytics. However, modern methods of anonymization, like those employed by Privacy Dynamics, minimize distortion in treated data, which enables even complex analytics on anonymized data.

Designing a Solution for Data Minimization in the Modern Data Stack

The "Modern Data Stack" is a loose category of tools that integrate with cloud data warehouses, typically in an ELT (extract-load-transform) and batch-processing paradigm. At the center of the Modern Data Stack is dbt, an open-source tool that automates data transformations in a data warehouse. dbt has adapter plug-ins that allow it to integrate with nearly any database, data warehouse, or data lake.

In the rest of this post, we'll share a design for using dbt and Privacy Dynamics with Snowflake to minimize the use of personal data while minimizing operational complexity and maximizing the data's analytical value.

Loading Minimal Data

In a typical dbt project, raw data is loaded into the warehouse before it is transformed. Many third-party tools, like Fivetran, Stitch, and Airbyte provide a huge number of "connectors" to nearly any database, object store, or SaaS API.

Some of these tools support data minimization directly by making it easy to configure what tables and fields should be replicated. Some go farther, and provide basic privacy transformations before loading your data; for example, Fivetran can hash fields before loading.

In Snowflake, we recommend loading all raw data into a database called RAW. Each data source gets its own schema in the RAW database.

Diagram showing data loaded from four sources into four schemas in Snowflake

Access to the RAW database should be strictly limited. Many data team members will need the usage privilege on the RAW database, but ideally select should be limited just to administrators (the SYSADMIN role) and service accounts for production pipelines. Grants for such a scheme would look like:

create role loader;
grant usage on database RAW to loader;
grant create on database RAW to loader;
create role developer;
grant usage on database RAW to developer;

Anonymizing Raw Data

We will use anonymized data for all model development. We will achieve this by maintaining a schema-preserving copy of the RAW database in a new database, called RAW_SAFE.

The simplest way to create a RAW_SAFE database is to use Privacy Dynamics. After connecting to Snowflake, we can use the UI to select tables that need to be anonymized ("treated") and those that can simply be replicated without anonymization ("passed through").

Screenshot showing simple Privacy Dynamics config

After optionally fine-tuning the anonymization, we can set up a schedule to run the anonymizer every hour. On each run, the anonymizer will detect any schema changes to the source data, and if new tables are added to the RAW database, it will automatically replicate those with the default anonymization settings.

Screenshot showing Privacy Dynamics project config with a schedule and with Automatically Create New Datasets checked

After repeating that process for each source, we are finished! We now have a RAW_SAFE database for our team to use.

Diagram showing Privacy Dynamics creating a RAW_SAFE database

If a large amount of data is being "passed through," we should consider other approaches that avoid storing duplicate copies of our raw data. Those could include:

Using a dedicated dbt project (with stricter access controls than our main project) to create views in RAW_SAFE that simply select * from each table in RAW that does not require anonymization. This code could be generated programmatically by the dbt-codegen package or a bespoke macro that introspects the INFORMATION_SCHEMA. This dbt project would only have to run when new tables are added to RAW.
Using Snowflake's zero-copy clones for the same purpose. Unlike views, clones would not be populated with new data, so the dbt project or cron job that creates the clones would have to run more frequently.

Without using Privacy Dynamics, it is difficult to treat quasi-identifiers and truly anonymize data. However, pseudonymizing data (removing direct identifiers) is a form of data minimization and is far better than doing nothing. We would again use a dedicated dbt project for the transformation from RAW to RAW_SAFE, and use a dbt package like dbt-privacy or dbt-snow-mask to manually mask and hash direct identifiers.

With any of these approaches, appropriate care and oversight (through code reviews or similar) is necessary to keep the data in RAW_SAFE truly safe. However, any efforts here will pay for themselves with a simplified governance model downstream (this is one form of "shifting privacy to the left").

Using Anonymized Data for Development and Testing

One of the most powerful features of dbt is its support for multiple environments, or "targets." A dbt user can run the same code locally (on their laptop) as in production, but by default a runtime parameter called target will cause the local run to write data to a different database or schema than the production run.

Targets are configured using a profiles.yml file and can be given arbitrary names, but we'll use dev to represent the development case, and prod to represent the automated, production runs of dbt. A simple profile with dev and prod targets follows:

my-analytics-profile:
  # the target key sets the default target at runtime
  target: dev
  # each key under outputs becomes a selectable target
  outputs:
    dev:
      type: snowflake
      account: <accountid>
      user: <dev username>
      password: <password>
      role: DEVELOPER
      warehouse: DEVELOPING
      database: DEV
    prod:
      type: snowflake
      account: <accountid>
      user: DBT_PROD
      password: <password>
      role: TRANSFORMER
      warehouse: TRANSFORMING
      database: ANALYTICS

dbt uses an abstraction called "sources" for database relations that are inputs to a dbt project, and that are created by a process outside of dbt. Sources are defined in a YAML file. To build a dbt project on our anonymized data, we will create a source for each schema in our RAW_SAFE database (note that by default on Snowflake, database identifiers are not case-sensitive):

sources:
  - name: app
    database: raw_safe
    tables:
      - name: users
      - name: orders
  - name: stripe
    database: raw_safe
    tables: ...

While we believe most teams can use anonymized data for production analytics, teams seeking to only use anonymized data for development can configure their sources to select from a different database, depending on the target. We can do this because dbt supports jinja templating in its .sql and, to a limited extent, its .yml files. To only use anonymized data for our dev target, we can substitute a jinja expression for the database name we used above:

sources:
  - name: app
    database: "{{ 'raw' if target.name == 'prod' else 'raw_safe' }}"
    tables:
      - name: users
      - name: orders
  - name: stripe
    database: "{{ 'raw' if target.name == 'prod' else 'raw_safe' }}"
    tables: ...

Now, when run against the dev target, dbt will select from the anonymized RAW_SAFE database and write to the DEV database. When run against the prod target, dbt will read from the RAW database and write to the ANALYTICS database.

Diagram showing Privacy Dynamics and dbt working together to create anonymized data assets

With this setup, you could lock down acccess to identifiable personal data by granting different privileges to the DEVELOPER and TRANSFORMER roles. For example, the DEVELOPER role gets nearly zero privileges on RAW:

create role developer;
-- need usage on RAW to use views in RAW_SAFE that select from RAW
grant usage on database RAW to developer;
grant usage on all schemas in database RAW to developer;
grant usage on database RAW_SAFE to developer;
grant usage on all schemas in database RAW_SAFE to developer;
grant usage on future schemas in database RAW_SAFE to developer;
grant select on all tables in database RAW_SAFE to developer;
grant select on future tables in database RAW_SAFE to developer;
grant select on all views in database RAW_SAFE to developer;
grant select on future views in database RAW_SAFE to developer;
grant create_schema on database dev to developer;

While the production TRANSFORMER role gets nothing on RAW_SAFE (to reduce the opportunity for dbt config mistakes):

create role transformer;
grant usage on database RAW to transformer;
grant usage on all schemas in database RAW to transformer;
grant usage on future schemas in database RAW to transformer;
grant select on all tables in database RAW to transformer;
grant select on future tables in database RAW to transformer;
grant select on all views in database RAW to transformer;
grant select on future views in database RAW to transformer;
grant ownership on database analytics to transformer;

If your team uses automated testing in continuous integration ("CI") and the CI runners (and target databases) are suitably secure, you could run your CI tests using either the anonymized or untreated data (or both!).

Anonymizing Modeled Data

If you preserve foreign key relationships, anonymizing raw data may not be sufficient to fully protect the subjects of your analysis from re-identification. When tables are joined, it is possible for k-anonymized quasi-identfiers in multiple tables to become unique and enable linkage attacks.

If your quasi-identfiers are spread between many source tables or systems, you should consider anonymizing modeled data, in addition to anonymizing the raw data. You should also anonymize any data asset that will be shared with a wider audience, either internally at your company or externally, for example, with marketing partners or even the broader public. Truly public datasets should be anonymized to a higher standard (a larger k value) to protect subjects as much as possible.

If you use Privacy Dynamics, the setup is largely the same as anonymizing your raw data. Simply select your production ANALYTICS database as a source, and choose which tables to anonymize and store in a PUBLIC database. Privacy Dynamics can run on any schedule and be configured to run immediately after your dbt project completes.

Anonymize, Don't Delete

Data retention is an important component of data minimization, but most teams hesitate before deleting any old data, believing it may be useful for future analysis or training an especially data-hungry ML model.

Because anonymized data is not personal data, anonymization is an alternative to deletion in order to satisfy data retention policies. There are a few different ways to operationalize this approach:

For append-only datasets, old data can be deleted at the source if desired. The RAW_SAFE database becomes the only source for old data (and should be backed up and managed accordingly)
Row access policies can be used to strictly limit access to raw data before a certain date (or a dynamic date, defined with an interval)
Views can be created to filter out "old" data from the identifiable datasets and union them with old data from the anonymized datasets

Pulling it all together

In this post, we showed how you can use Privacy Dynamics and dbt for personal data minimization in any analytics environment. We covered tactics for minimizing access and consumption, retention, sharing, and managing lower environments.

If you would like to start minimizing personal data in your analytics stack, reach out and we will get you started today on a free trial of Privacy Dynamics.