Entity Recognition
Proper anonymization requires a semantic understanding of the data. It is insufficient to know the types or statistical qualities of a dataset—we must infer the meaning of data in order to treat it appropriately. We call this process of semantic enrichment Entity Recognition.
There are two categories of data that are especially important to our algorithms: PII and numerical data.
Detecting PII
Personally Identifiable Information ("PII") can take many forms. HIPAA enumerates 18 families of attributes that are considered personal identifiers. They include names, email addresses, and phone numbers, but also dates, geographies, account numbers, biometrics, and more.
Detecting PII is a non-trivial problem, even in structured data. Privacy Dynamics combines several approaches, including those based on academic research and open-source libraries, encompassing rules-based heuristics and machine learning models. After detecting PII, our anonymizer creates a treatment plan to redact direct identifiers and treat quasi-identifiers.
Privacy Dynamics can automatically detect the following types of direct and quasi-identifiers:
Identifier | Category |
---|---|
Person Name (Full, First, Middle, Last) | Direct |
Email Address | Direct |
Phone Number | Direct |
Mailing Address | Direct |
ABA Routing Number | Direct |
US Bank Account Number | Direct |
Credit Card Number | Direct |
Crypto Address | Direct |
US Passport Number | Direct |
US Driver's License Number | Direct |
US Individual Taxpayer Identification Number (ITIN) | Direct |
US Social Security Number (SSN) | Direct |
Other US Tax ID's (EIN, PTIN) | Direct |
IP Address | Direct |
Advertising ID | Direct |
Equipment Serial Number (ESN) in Decimal or Hex | Direct |
International Mobile Equipment Identity (IMEI) Number | Direct |
MAC Address | Direct |
Mobile Equipment Identifier (MEID) | Direct |
Age | Quasi |
Date of Birth | Quasi |
Date of Death | Quasi |
Other Dates and Times | Quasi |
Gender | Quasi |
Sex | Quasi |
Coordinates (Full or Partial) | Quasi |
City | Quasi |
US State | Quasi |
US ZIP Code | Quasi |
Nationality | Quasi |
To handle edge cases, PII treatment for your Dataset is configurable. For more information, see Configuring a Dataset.
Classifying Numerical Variables
Fields that have numerical types (like integers or floats) can also be used to encode categories. When anonymizing data, it is essential to know whether a numerical field is semantically discrete, continuous, or categorical.
After first attempting to detect PII in a numerical field (such as phone, account, social security, and other numbers), we profile each numerical field using a range of statistical methods. That profile is then fed into a machine-learning model that determines the probability that a field is categorical. This categorical classification is an important input into the anonymizer's treatment plan.