Entity Recognition

Proper anonymization requires a semantic understanding of the data. It is insufficient to know the types or statistical qualities of a dataset—we must infer the meaning of data in order to treat it appropriately. We call this process of semantic enrichment Entity Recognition.

There are two categories of data that are especially important to our algorithms: PII and numerical data.

Detecting PII

Personally Identifiable Information ("PII") can take many forms. HIPAA enumerates 18 families of attributes that are considered personal identifiers. They include names, email addresses, and phone numbers, but also dates, geographies, account numbers, biometrics, and more.

Detecting PII is a non-trivial problem, even in structured data. Privacy Dynamics combines several approaches, including those based on academic research and open-source libraries, encompassing rules-based heuristics and machine learning models. After detecting PII, our anonymizer creates a treatment plan to redact direct identifiers and treat quasi-identifiers.

Privacy Dynamics can automatically detect the following types of direct and quasi-identifiers:

Identifier	Category
Person Name (Full, First, Middle, Last)	Direct
Email Address	Direct
Phone Number	Direct
Mailing Address	Direct
ABA Routing Number	Direct
US Bank Account Number	Direct
Credit Card Number	Direct
Crypto Address	Direct
US Passport Number	Direct
US Driver's License Number	Direct
US Individual Taxpayer Identification Number (ITIN)	Direct
US Social Security Number (SSN)	Direct
Other US Tax ID's (EIN, PTIN)	Direct
IP Address	Direct
Advertising ID	Direct
Equipment Serial Number (ESN) in Decimal or Hex	Direct
International Mobile Equipment Identity (IMEI) Number	Direct
MAC Address	Direct
Mobile Equipment Identifier (MEID)	Direct
Age	Quasi
Date of Birth	Quasi
Date of Death	Quasi
Other Dates and Times	Quasi
Gender	Quasi
Sex	Quasi
Coordinates (Full or Partial)	Quasi
City	Quasi
US State	Quasi
US ZIP Code	Quasi
Nationality	Quasi

To handle edge cases, PII treatment for your Dataset is configurable. For more information, see Configuring a Dataset.

Classifying Numerical Variables

Fields that have numerical types (like integers or floats) can also be used to encode categories. When anonymizing data, it is essential to know whether a numerical field is semantically discrete, continuous, or categorical.

After first attempting to detect PII in a numerical field (such as phone, account, social security, and other numbers), we profile each numerical field using a range of statistical methods. That profile is then fed into a machine-learning model that determines the probability that a field is categorical. This categorical classification is an important input into the anonymizer's treatment plan.