The anonymizer's treatment plan is designed to protect against re-identification through the use of crowd-based privacy methods. With our proprietary process, we can provide k-anonymity at any level on a treated dataset, while maintaining original data types and minimizing distortion. We refer to this process as k-member micro-aggregation.
Forming k-member Clusters
We cluster individual records into groups of no fewer than k records. In practice, k can be as low as two or five, so this results in a very large number of clusters with only a few records each. Forming clusters intelligently is key to minimizing distortion, and we take care to choose appropriate distance functions for different semantic types of data.
Once clusters have been formed, to achieve k-anonymity, we must ensure that all tuples of quasi-identifiers within the cluster are identical.
Traditionally, this has been achieved using generalization, or by making the quasi-identifiers less specific, and therefore less unique. For example, ages can be replaced by a range of ages (47 replaced with 45-50). However, this approach has many drawbacks, especially for analysts and data scientists: namely, it changes the types in a field, and minimizing distortion would require a large number of generalized values (ranges) that may overlap and are difficult to handle algorithmically.
Another option is perturbation, which maintains types, but changes the values in a field. We prefer a form of perturbation called aggregation, which applies an aggregate function to the clustered values, and replaces each individual's data with the cluster's aggregate. Because our clusters are typically very small (often k or slightly larger), we call this k-member micro-aggregation. This technique generalizes well for many different semantic types of data; each semantic type gets its own aggregation function (the simplest of which is the mode).
After direct identifiers have been removed, and quasi-identifiers have been replaced by their micro-aggregates, the Dataset has been effectively anonymized. As a final step, we compute the Risk Score and Distortion Metrics and produce the Dataset Report.