Exploring Snowflake's New Data Classification Tools

Ted Conbeer

05.12.22 · 5 min read

Exploring Snowflake's New Data Classification Tools

Data classification, also called entity recognition or PII detection, can now be automated with a new feature in Snowflake. In this post, we're going to explore that feature and discuss its design, performance, and limitations.

What is Data Classification?

Data classification, also called entity recognition or PII detection, is the process of labeling data with its semantic type after inferring the meaning of the data. For example, you may have a table named customers with a field named email: after data classification, that field could be labeled with a semantic category (email address) or a privacy category (direct identifier).

Data Classification is also the name of a new Snowflake feature, released in public preview for Enterprise Edition customers in early 2022. In this post, we're going to explore that feature and discuss its design, performance, and limitations.

Snowflake's Implementation

Data Classification is implemented through a single new function, EXTRACT_SEMANTIC_CATEGORIES, and a new stored procedure, ASSOCIATE_SEMANTIC_CATEGORY_TAGS. The function takes an object (a table, view, etc.) and analyzes up to 10,000 cells in each field before returning a single JSON object with the classification results and additional result metadata. The stored procedure parses the JSON object returned by the function and creates a tag with the semantic and privacy category on the column in the original object.

Snowflake's documentation also provides the code for JavaScript stored procedures to:

Run EXTRACT_SEMANTIC_CATEGORIES against all objects in a schema (classify_schema)
Run classify_schema against all schemas in a database (classify_database)

However, these stored procedures must be created by the user, and are not built in; they merely loop through each object (one at a time) and run the workflow described above.

Finally, the tags created can be queried to limit access or apply masking policies to specific fields. Unfortunately, this is a two-step process; it is not possible to directly create access or masking policies that are automatically applied when fields are tagged.

Using Snowflake's Data Classification: An Example

Let's try it out. We have a synthetic dataset in Snowflake that mimics healthcare data. Our synthetic dataset has 100k rows and 15 fields that run the gamut from direct identifiers to quasi-identifiers and sensitive attributes.

PATIENT_ID	NAME	GENDER	AGE	DOB	DOD	PHONE	CITY	STATE	ZIP	LANGUAGE	RELIGION	MARITAL_STATUS	ETHNICITY	HISPANIC
33	Audra Ferrara	Female	40	1960	2016-07-17	(864)314-6727	Brooklyn	New York	11226	English	Unspecified	Single	White	False
35	Mitchel Christensen	Male	70	1930	null	(803)586-3966	Marrero	Louisiana	70072	English	Catholic	Divorced	White	False

To use Data Classification, you select EXTRACT_SEMANTIC_CATEGORIES like any other SQL function, passing the fully-qualified object name as the first argument:

select extract_semantic_categories('health_demo_db.public.patient_questionnaire_view');

After running for 19 seconds (on an XS warehouse), the function returns a JSON object:

{
    "AGE":{
        "extra_info":{
            "alternates":[],
            "probability":"1.00"
        },
        "privacy_category":"QUASI_IDENTIFIER",
        "semantic_category":"AGE"
    },
    ...
}

If you like structured data, you can flatten this JSON object into a table:

select
    t.key::varchar as column_name,
    t.value:"privacy_category"::varchar as privacy_category,
    t.value:"semantic_category"::varchar as semantic_category,
    t.value:"extra_info":"probability"::number(10,2) as probability,
    t.value:"extra_info":"alternates"::variant as alternates
from table(
        flatten(
            extract_semantic_categories(
                'health_demo_db.public.patient_questionnaire_view'
            )::variant
        )
    ) as t;

COLUMN_NAME	PRIVACY_CATEGORY	SEMANTIC_CATEGORY	PROBABILITY	ALTERNATES
AGE	QUASI_IDENTIFIER	AGE	1	[]
CITY	QUASI_IDENTIFIER	US_CITY	1	[]
DOB				[]
DOD				[]
ETHNICITY				[{"privacy_category": "IDENTIFIER", "probability": "0.64", "semantic_category": "NAME"}]
GENDER	QUASI_IDENTIFIER	GENDER	1	[]
HISPANIC				[]
LANGUAGE	QUASI_IDENTIFIER	ETHNICITY	0.99	[]
MARITAL_STATUS	QUASI_IDENTIFIER	MARITAL_STATUS	1	[]
NAME	IDENTIFIER	NAME	1	[]
PATIENT_ID				[]
PHONE	IDENTIFIER	PHONE_NUMBER	0.83	[]
RELIGION				[]
STATE	QUASI_IDENTIFIER	US_STATE_OR_TERRITORY	1	[]
ZIP	QUASI_IDENTIFIER	US_POSTAL_CODE	1	[]

So how did it do? Comparing the actual results against the full list of tags, we see an overall accuracy of 53% (8/15), although the average precision for values with classified categories is much higher, at 89% (8/9).

Field	Labeled Category	Ground Truth
AGE	AGE	AGE
CITY	US_CITY	US_CITY
DOB		YEAR_OF_BIRTH
DOD		DATE_OF_BIRTH*
ETHNICITY		ETHNICITY
GENDER	GENDER	GENDER
HISPANIC		ETHNICITY
LANGUAGE	ETHNICITY
MARITAL_STATUS	MARITAL_STATUS	MARITAL_STATUS
NAME	NAME	NAME
PATIENT_ID
PHONE	PHONE_NUMBER	PHONE_NUMBER
RELIGION
STATE	US_STATE_OR_TERRITORY	US_STATE_OR_TERRITORY
ZIP	US_POSTAL_CODE	US_POSTAL_CODE

Note: DOD is a "date of death" not "date of birth" field, but from a semantic and privacy point of view, those categories should be treated identically

Snowflake recommends a workflow where classification results are reviewed and maintained manually, before applying tags. While cumbersome, we would adopt that workflow given the accuracy in our sample dataset. To do this, we would create a new table, health_demo_db.public_classification.patient_questionnaire_view, and insert the data from our flattened query above. Then we would write queries to update the values in the new table to the ground truth labels. Next, you can use the stored procedure to convert the values in our table to tags on columns in the source table:

call associate_semantic_category_tags(
    'health_demo_db.public.patient_questionnaire_view',
    (select * from health_demo_db.public_classification.patient_questionnaire_view)
);

However, we're still not finished. We're probably classifying data so that we can better govern it. However, there are no automated ways provided for setting access or masking policies based on semantic tags. To implement such policies, we have to first wait 120 minutes for the tag_references system view to update, then query that view for any relevant tags:

select *
from snowflake.account_usage.tag_references
where tag_name = 'PRIVACY_CATEGORY'

We can then manually take the columns returned by that query and write access or masking policies for each individual column:

alter table health_demo_db.public.patient_questionnaire_view
    modify column phone set masking policy identifier_mask;

Alternatively, we could write a stored procedure (or dbt macro) that wraps the logic above to apply policies to tagged columns.

Discussion

As teams gather more data and face more scrutiny from customers and regulators, we think there is a lot of utility in automating data classification (also called entity recognition or PII detection). However, Snowflake's current design likely does not live up to that promise for many customers. In designing an automated entity recognition solution, we think the following features are important:

Automated scanning of entire schemas and databases. While Snowflake provides a stored procedure for looping through all objects, a more useful system would automatically detect and classify new objects, only run on new or changed objects, and provide diffs, alerts, or automated treatment of sensitive data. Humans are quite good at entity recognition, but human attention doesn't scale — an entity recognition service needs to solve the problem of scale, which a single manually-triggered function does not.
Simpler policy management for semantic categories. Automated classification is of limited value unless automated actions can be taken based on that classification. A more effective system would automatically redact or mask detected PII, rather than requiring manual policy management.
A better UI for managing categories. Snowflake's only interface for Data Classification is SQL. We love SQL! But we do not think it provides the best interface for reviewing and managing (or overriding) semantic categories.
Improved performance. EXTRACT_SEMANTIC_CATEGORIES runs for tens of seconds per table it analyzes; in practice, this means it is too slow to run on a frequent schedule across an entire database. Its accuracy is also quite poor; from our dataset, it appears not to use contextual clues from field names, and relies exclusively on the data itself (e.g., not classifying ETHNICITY or HISPANIC fields with the ETHNICITY category; HISPANIC only contains boolean values)

Privacy Dynamics has fully-automated and easily-configured entity recognition built into the core of the product, so we can de-identify your data and make it safe to share for a variety of use cases. If you would like to learn more, you can read our docs or start your free trial.