# Entity Recognition

Proper anonymization requires a semantic understanding of the data. It is insufficient to know the types or statistical qualities of a dataset -- we must infer the meaning of data in order to treat it appropriately. We call this process of semantic enrichment Entity Recognition.

There are two categories of data that are especially important to our algorithms: PII and numerical data.

# Detecting PII

Personally Identifiable Information ("PII") can take many forms. HIPAA enumerates 18 families of attributes (opens new window) that are considered personal identifiers. They include names, email addresses, and phone numbers, but also dates, geographies, account numbers, biometrics, and more.

Detecting PII is a non-trivial problem, even in structured data. Privacy Dynamics combines several approaches, including those based on academic research and open-source libraries, encompassing rules-based heuristics and machine learning models. After detecting PII, our anonymizer creates a treatment plan to redact direct identifiers and treat quasi-identifiers.

To handle edge cases, PII treatment for your Dataset is configurable. For more information, see Configuring a Dataset.

# Classifying Numerical Variables

Fields that have numerical types (like integers or floats) can also be used to encode categories. When anonymizing data, it is essential to know whether a numerical field is semantically discrete, continuous, or categorical.

After first attempting to detect PII in a numerical field (such as phone, account, social security, and other numbers), we profile each numerical field using a range of statistical methods. That profile is then fed into a machine-learning model that determines the probability that a field is categorical. This categorical classification is an important input into the anonymizer's treatment plan.

Last Updated: 5/2/2022, 5:32:57 PM
logo green

Empowering innovative and ethical data teams