Ted Conbeer

 ·  5 min read

Exploring Snowflake's New Data Classification Tools

Data classification, also called entity recognition or PII detection, can now be automated with a new feature in Snowflake. In this post, we're going to explore that feature and discuss its design, performance, and limitations.

What is Data Classification?

Data classification, also called entity recognition or PII detection, is the process of labeling data with its semantic type after inferring the meaning of the data. For example, you may have a table named customers with a field named email: after data classification, that field could be labeled with a semantic category (email address) or a privacy category (direct identifier).

Data Classification is also the name of a new Snowflake feature, released in public preview for Enterprise Edition customers in early 2022. In this post, we're going to explore that feature and discuss its design, performance, and limitations.

Snowflake's Implementation

Data Classification is implemented through a single new function, EXTRACT_SEMANTIC_CATEGORIES, and a new stored procedure, ASSOCIATE_SEMANTIC_CATEGORY_TAGS. The function takes an object (a table, view, etc.) and analyzes up to 10,000 cells in each field before returning a single JSON object with the classification results and additional result metadata. The stored procedure parses the JSON object returned by the function and creates a tag with the semantic and privacy category on the column in the original object.

Snowflake's documentation also provides the code for JavaScript stored procedures to:

  1. Run EXTRACT_SEMANTIC_CATEGORIES against all objects in a schema (classify_schema)
  2. Run classify_schema against all schemas in a database (classify_database)

However, these stored procedures must be created by the user, and are not built in; they merely loop through each object (one at a time) and run the workflow described above.

Finally, the tags created can be queried to limit access or apply masking policies to specific fields. Unfortunately, this is a two-step process; it is not possible to directly create access or masking policies that are automatically applied when fields are tagged.

Using Snowflake's Data Classification: An Example

Let's try it out. We have a synthetic dataset in Snowflake that mimics healthcare data. Our synthetic dataset has 100k rows and 15 fields that run the gamut from direct identifiers to quasi-identifiers and sensitive attributes.

PATIENT_ID NAME GENDER AGE DOB DOD PHONE CITY STATE ZIP LANGUAGE RELIGION MARITAL_STATUS ETHNICITY HISPANIC
33 Audra Ferrara Female 40 1960 2016-07-17 (864)314-6727 Brooklyn New York 11226 English Unspecified Single White False
35 Mitchel Christensen Male 70 1930 null (803)586-3966 Marrero Louisiana 70072 English Catholic Divorced White False

To use Data Classification, you select EXTRACT_SEMANTIC_CATEGORIES like any other SQL function, passing the fully-qualified object name as the first argument:

select extract_semantic_categories('health_demo_db.public.patient_questionnaire_view');

After running for 19 seconds (on an XS warehouse), the function returns a JSON object:

{
    "AGE":{
        "extra_info":{
            "alternates":[],
            "probability":"1.00"
        },
        "privacy_category":"QUASI_IDENTIFIER",
        "semantic_category":"AGE"
    },
    ...
}

If you like structured data, you can flatten this JSON object into a table:

select
    t.key::varchar as column_name,
    t.value:"privacy_category"::varchar as privacy_category,
    t.value:"semantic_category"::varchar as semantic_category,
    t.value:"extra_info":"probability"::number(10,2) as probability,
    t.value:"extra_info":"alternates"::variant as alternates
from table(
        flatten(
            extract_semantic_categories(
                'health_demo_db.public.patient_questionnaire_view'
            )::variant
        )
    ) as t;
COLUMN_NAME PRIVACY_CATEGORY SEMANTIC_CATEGORY PROBABILITY ALTERNATES
AGE QUASI_IDENTIFIER AGE 1 []
CITY QUASI_IDENTIFIER US_CITY 1 []
DOB []
DOD []
ETHNICITY [{"privacy_category": "IDENTIFIER", "probability": "0.64", "semantic_category": "NAME"}]
GENDER QUASI_IDENTIFIER GENDER 1 []
HISPANIC []
LANGUAGE QUASI_IDENTIFIER ETHNICITY 0.99 []
MARITAL_STATUS QUASI_IDENTIFIER MARITAL_STATUS 1 []
NAME IDENTIFIER NAME 1 []
PATIENT_ID []
PHONE IDENTIFIER PHONE_NUMBER 0.83 []
RELIGION []
STATE QUASI_IDENTIFIER US_STATE_OR_TERRITORY 1 []
ZIP QUASI_IDENTIFIER US_POSTAL_CODE 1 []

So how did it do? Comparing the actual results against the full list of tags, we see an overall accuracy of 53% (8/15), although the average precision for values with classified categories is much higher, at 89% (8/9).

Field Labeled Category Ground Truth
AGE AGE AGE
CITY US_CITY US_CITY
DOB YEAR_OF_BIRTH
DOD DATE_OF_BIRTH*
ETHNICITY ETHNICITY
GENDER GENDER GENDER
HISPANIC ETHNICITY
LANGUAGE ETHNICITY
MARITAL_STATUS MARITAL_STATUS MARITAL_STATUS
NAME NAME NAME
PATIENT_ID
PHONE PHONE_NUMBER PHONE_NUMBER
RELIGION
STATE US_STATE_OR_TERRITORY US_STATE_OR_TERRITORY
ZIP US_POSTAL_CODE US_POSTAL_CODE
  • Note: DOD is a "date of death" not "date of birth" field, but from a semantic and privacy point of view, those categories should be treated identically

Snowflake recommends a workflow where classification results are reviewed and maintained manually, before applying tags. While cumbersome, we would adopt that workflow given the accuracy in our sample dataset. To do this, we would create a new table, health_demo_db.public_classification.patient_questionnaire_view, and insert the data from our flattened query above. Then we would write queries to update the values in the new table to the ground truth labels. Next, you can use the stored procedure to convert the values in our table to tags on columns in the source table:

call associate_semantic_category_tags(
    'health_demo_db.public.patient_questionnaire_view',
    (select * from health_demo_db.public_classification.patient_questionnaire_view)
);

However, we're still not finished. We're probably classifying data so that we can better govern it. However, there are no automated ways provided for setting access or masking policies based on semantic tags. To implement such policies, we have to first wait 120 minutes for the tag_references system view to update, then query that view for any relevant tags:

select *
from snowflake.account_usage.tag_references
where tag_name = 'PRIVACY_CATEGORY'

We can then manually take the columns returned by that query and write access or masking policies for each individual column:

alter table health_demo_db.public.patient_questionnaire_view
    modify column phone set masking policy identifier_mask;

Alternatively, we could write a stored procedure (or dbt macro) that wraps the logic above to apply policies to tagged columns.

Discussion

As teams gather more data and face more scrutiny from customers and regulators, we think there is a lot of utility in automating data classification (also called entity recognition or PII detection). However, Snowflake's current design likely does not live up to that promise for many customers. In designing an automated entity recognition solution, we think the following features are important:

  1. Automated scanning of entire schemas and databases. While Snowflake provides a stored procedure for looping through all objects, a more useful system would automatically detect and classify new objects, only run on new or changed objects, and provide diffs, alerts, or automated treatment of sensitive data. Humans are quite good at entity recognition, but human attention doesn't scale — an entity recognition service needs to solve the problem of scale, which a single manually-triggered function does not.
  2. Simpler policy management for semantic categories. Automated classification is of limited value unless automated actions can be taken based on that classification. A more effective system would automatically redact or mask detected PII, rather than requiring manual policy management.
  3. A better UI for managing categories. Snowflake's only interface for Data Classification is SQL. We love SQL! But we do not think it provides the best interface for reviewing and managing (or overriding) semantic categories.
  4. Improved performance. EXTRACT_SEMANTIC_CATEGORIES runs for tens of seconds per table it analyzes; in practice, this means it is too slow to run on a frequent schedule across an entire database. Its accuracy is also quite poor; from our dataset, it appears not to use contextual clues from field names, and relies exclusively on the data itself (e.g., not classifying ETHNICITY or HISPANIC fields with the ETHNICITY category; HISPANIC only contains boolean values)

Privacy Dynamics has fully-automated and easily-configured entity recognition built into the core of the product, so we can de-identify your data and make it safe to share for a variety of use cases. If you would like to learn more, you can read our docs or start your free trial.