· 6 min read
Survival analysis is a powerful tool for understanding the time between any two events, but typically it requires rich data that can re-identify individuals. In this post, we demonstrate how to use anonymized data for survival analysis without degrading the utility of the analysis.
Survival Analysis is an area of Statistics that studies the expected amount of time between two events, such as the birth and death of an individual or the acquisition and churn of a subscription customer. Specifically, it is useful when the observations are censored, which means that the dataset you would use to analyze behavior is missing data (typically because there are individuals in the data who have not yet died, churned, etc.).
Survival analysis was created to quite literally study the survival of patients in clinical trials. While it is possible to compute the "average lifetime" of a patient by waiting until everyone in your trial dies, you can conclude your analysis much earlier if you can create a model that predicts the average lifetime based on early observations. This is the insight at the heart of survival analysis. Let's take an example:
In the example above, we can eyeball the results and see that the treatment seems to be helping. But by how much? And how certain can we be? For this, we need statistics.
Fortunately for those who like Python, there is an excellent library called lifelines that makes it easy to get started with survival analysis. The lifelines docs even have an excellent introduction to the topic.
Lifelines provides several different models for survival analysis. Each model comes with its own set of assumptions and limitations, but lifelines provides a consistent API across the models to make it easy to work with many of them.
At a high level, when using lifelines, you will need to:
Lifelines insists on a standard data structure to train all of its models. Each record (row) in your dataframe will be a single subject (a person, patient, subscription, etc.). That record must contain a field for the measured duration, and another field for whether the event was observed or not. For our toy example above, a subject is a patient, the duration is the number of periods that elapsed after they started the clinical trial, and the event is their death. We can also include additional features about the subject, like their name and whether or not they were part of the treatment group:
Note that for subjects where the event is observed, the duration is the time until the event was observed. For subjects that are censored, the duration is the time to the most recent "proof of life." In our toy example, we assume that we checked in with all of our patients at t=10, but that does not have to be the case (if some subjects started the trial later, we could have an alive subject with a duration less than 10).
If your data is in a different format, like start/stop datetimes or an event stream, lifelines provides utilities for transforming your data.
In our toy example, it would be easy to anonymize our subjects: we simply remove their names. But in a real study, we would have many features about our patients, like their age, sex, weight, dietary and exercise habits, genetic test results, and more. Those features could collectively act as quasi-identifiers that could allow analysts or others to re-identify the subjects of our study, even if we remove direct identifiers, like names.
Historically, this has posed a privacy vs. utility dilemma. However, with modern anonymization techniques like k-member microaggregation, we can effectively anonymize subjects and use all of their traits for analysis.
In the tutorial, we use the
dd dataset provided by lifelines. The dataset has 1808 records; each record represents a single political regime in one of 202 countries from 1946 until 2008. We would like to study the average lifetime of a political regime and the factors that influence that lifetime, like when it started, what country it ruled, and what type of regime it was (Note: this dataset has additional records randomly censored; a real dataset would have
OBSERVED=1 for all regimes except those that had not ended by 2008, when the dataset was created).
This dataset makes re-identification through linkage attacks especially easy (for example, you can probably name the U.S. President whose
START_YEAR is 2001). So can we analyze this data while protecting the privacy of its subjects?
With Privacy Dynamics, we can.
After loading the data into a data warehouse or data lake (as a CSV or Parquet file), we can create a new dataset in Privacy Dynamics. Only minimal configuration is required: we simply choose to lock the
OBSERVED fields, and leave the rest on Auto. Given the small size of this dataset,
DURATION alone could also be a quasi-identifier. If we leave that field on Auto, its values will be perturbed, but the subjects will be completely protected from re-identification.
The resulting data contains no unique combinations of Country, Continent, Democracy, Regime, and Start Year, which makes it impossible to definitively re-identify any subjects:
To achieve this, the Privacy Dynamics anonymizer clustered similar records together and replaced unique values with an aggregate from the cluster, effectively hiding individuals in groups that share most of their characteristics. Because of advanced clustering techniques, this anonymization minimizes the distortion in the treated data, so it is still suitable for the same types of analysis we would use the raw data for.
You can use the drop-down menu in the Hex App to choose whether to load raw or anonymized data into our dataframe. After that, all of the steps are identical to analyze either dataset, since Privacy Dynamics preserves the schema and types of treated data (except for the
LEADER_NAME field, which we chose to redact). This means we can follow exactly the commands in the lifelines tutorial, and complete all of the same analysis on either dataset.
For example, in the Hex App, we import the
KaplanMeierFitter, instantiate it, and fit it on our dataframe before plotting the survival curve. We follow the same workflow for the
CoxPHFitter, but we first need to prepare our features for regression, including one-hot encoding the categorical features.
Other methods of anonymization can badly distort the data, but the Privacy Dynamics approach maximizes the utility of the preserved data.
As a result, we see that our survival analysis is virtually unchanged by anonymization. Compare the Kaplan-Meier (KM) survival curve estimates for democratic regimes using the raw data to those from the anonymized data:
Even digging in deeper, we see very little distortion in the results, with any changes mostly staying within the confidence interval provided by Kaplan-Meier:
We can stress-test this better by running a regression on the anonymized features using the Cox Proportional Hazards (CPH) model. The data in these fields were in some cases perturbed by the anonymizer to prevent unique records from being present in the anonymized dataset, but we see that these perturbations only minimally impacted the coefficients of the regression:
And finally, using the CPH model, we can generate predictions for any combinations of features. Visualizing these predictions again shows almost no distortion in the model trained on anonymized data:
Finally, it's worth noting that this dataset is small, and therefore the perturbations required for anonymizing are quite large! For a real dataset of millions or billions of records, distortion is typically much smaller.
If you are interested in anonymizing the data you use for analytics or ML, please reach out today, and we can get you started with Privacy Dynamics for free.