# Configuring Datasets

Privacy Dynamics automatically detects PII and applies conservative default settings for Dataset treatment. However, depending on your use case, you may want to configure one or more Datasets to customize the way we treat your data. For a deeper understanding of these options, it helps to first have some background on anonymization. For more understanding on how our anonymizer applies your configuration to your Dataset, see How It Works.

# Treatment Settings

# Categorization

We automatically detect and classify fields in your database with their semantic meaning, including the category of PII and the type of identifier. You can adjust this classification by selecting a new category from the drop-down menu. The category you choose may impact the treatment: depending on the semantic category, direct IDs will be given different realistic values, and quasi-IDs will use a different clustering distance function.

# Lock

Locking a field excludes it from treatment. Any field that is locked will be "passed through" to the Destination data exactly as it was in the Source data.

To maintain referential integrity in your Destination data, you may want to Lock primary and (optionally) foreign keys.

Numerical fields are Locked by default, unless our classifier considers them to be encodings of categorical variables; categories are Anonymized by default.

If you are preparing a Dataset for a specific analysis, you may want to Lock your response variables and only Anonymize your features.

# Mask

Masking a field obscures its values, but retains the field in the Destination data. Masking is a good choice when the Destination data must have the same schema (or shape) as the Source, and the field's category does not support Realistic suppression, or if Realistic-looking data is not required in the Destination for the desired use case. Masking may be faster than faking Realistic data. Finally, Masked data cannot be confused for real data, so end-users will know that the data has been treated (which may not be obvious with Realistic values).

# Realistic

Realistic suppression replaces the values in your data with randomly-generated, fake values that preserve the format of the Source data. Faking Realistic data can be very powerful, since it allows you to test data migrations and other data transformations on anonymized data, which makes it a very popular choice for Development and Testing use cases.

A full list of categories that can be suppressed with Realistic values is available here.

# Redact

Redacting is the simplest and fastest form of suppression: we simply drop the column from the Destination data. This makes it a great choice for Analytics and data publishing, but it may not be the best choice if systems expect the Source and Destination data to have the same schema.

# Anonymize

Mask, Realistic, and Redact all configure the anonymizer to completely suppress the data in that field. With quasi-identifiers (like birth date or zip code), suppression destroys a lot of potentially-useful data. Our k-anonymity based micro-aggregation approach provides an alternative: we can slightly perturb values to provide anonymity while preserving data utility.

Any field that could be linked with external data to re-identify someone should be treated with Anonymize. This includes known quasi-identifiers like age, zip code, and gender, but also applies to most other data, especially demographic and categorical data.

The alternative to Anonymize is Lock, but any field that you Lock instead of Anonymize increases the risk of re-identification through linkage attacks. There are many examples of seemingly-benign traits being used to re-identify individuals in anonymous datasets — it's best not to take this risk, and only Lock a field after testing demonstrates an unsuitable level of distortion in your Destination data.

# k-Anonymity

The Anonymize treatment is a group-based approach, where we "hide" individuals in uniform groups that share most of their traits. More specifically, our micro-aggregation step provides k-anonymity by ensuring that no tuple of quasi-identifiers appears in the dataset fewer than k times.

k is configurable for each dataset. The default is 2, which is sufficient in many cases to give any individual "plausible deniability" and therefore anonymity.

Keeping k low maximizes the utility of the treated data by minimizing distortion. But in some circumstances, a low k can lead to re-identification risk, or a related risk of sensitive attribute disclosure.

You should consider a higher value of k when:

  • Your treated (Destination) data will be published to a large or untrusted audience that may contain bad-actors who will attempt to re-identify individuals.
  • Your data contains sensitive attributes, like medical diagnoses or financial information
  • Minimizing your risk is more important than minimizing distortion in the Destination data (e.g., in developer environments).

# Common Use Cases

Continue reading for specific guidance for the following use cases:

Last Updated: 11/9/2022, 7:10:06 PM
logo green

Empowering innovative and ethical data teams