# Entity Recognition

Proper anonymization requires a semantic understanding of the data. It is insufficient to know the types or statistical qualities of a dataset -- we must infer the meaning of data in order to treat it appropriately. We call this process of semantic enrichment Entity Recognition.

There are two categories of data that are especially important to our algorithms: PII and numerical data.

# Detecting PII

Personally Identifiable Information ("PII") can take many forms. HIPAA enumerates 18 families of attributes (opens new window) that are considered personal identifiers. They include names, email addresses, and phone numbers, but also dates, geographies, account numbers, biometrics, and more.

Detecting PII is a non-trivial problem, even in structured data. Privacy Dynamics combines several approaches, including those based on academic research and open-source libraries, encompassing rules-based heuristics and machine learning models. After detecting PII, our anonymizer creates a treatment plan to redact direct identifiers and treat quasi-identifiers.

Privacy Dynamics can automatically detect the following types of direct and quasi-identifiers:

Identifier Category
Person Name (Full, First, Middle, Last) Direct
Email Address Direct
Phone Number Direct
Mailing Address Direct
ABA Routing Number Direct
US Bank Account Number Direct
Credit Card Number Direct
Crypto Address Direct
US Passport Number Direct
US Driver's License Number Direct
US Individual Taxpayer Identification Number (ITIN) Direct
US Social Security Number (SSN) Direct
Other US Tax ID's (EIN, PTIN) Direct
IP Address Direct
Advertising ID Direct
Equipment Serial Number (ESN) in Decimal or Hex Direct
International Mobile Equipment Identity (IMEI) Number Direct
MAC Address Direct
Mobile Equipment Identifier (MEID) Direct
Age Quasi
Date of Birth Quasi
Date of Death Quasi
Other Dates and Times Quasi
Gender Quasi
Sex Quasi
Coordinates (Full or Partial) Quasi
City Quasi
US State Quasi
US ZIP Code Quasi
Nationality Quasi

To handle edge cases, PII treatment for your Dataset is configurable. For more information, see Configuring a Dataset.

# Classifying Numerical Variables

Fields that have numerical types (like integers or floats) can also be used to encode categories. When anonymizing data, it is essential to know whether a numerical field is semantically discrete, continuous, or categorical.

After first attempting to detect PII in a numerical field (such as phone, account, social security, and other numbers), we profile each numerical field using a range of statistical methods. That profile is then fed into a machine-learning model that determines the probability that a field is categorical. This categorical classification is an important input into the anonymizer's treatment plan.

Last Updated: 11/7/2022, 7:43:43 PM
logo green

Empowering innovative and ethical data teams