Replication How-to: Anonymizing Production Data for Better Dev, Test & Build

John Craft

01.26.24 · 6 min read

Replication How-to: Anonymizing Production Data for Better Dev, Test & Build

Privacy Dynamics’ GitHub Action plugs de-identification directly into CI/CD processes quickly and easily. Here’s how it works.

A lot has changed in software development over the past two decades: Better tools, improved methods, more powerful frameworks, languages purpose-built for a variety of needs and outcomes. It’s a magical time to be a developer.

One thing that hasn’t changed, however, is the software engineer’s insatiable need for data. Real data. Current data. The AppDev space has long wrestled with how to balance this need for realistic information with the risk of sharing production databases with developers for the express purpose of poking, prodding, and testing their wares.

Working with real data offers real benefits to software makers, particularly those engaged in the practice of Continuous Integration/Continuous Delivery (CI/CD). Developing and testing apps with data that is consistently faithful to the content and structure of production databases lets developers gather immediate, real-world feedback on app behavior as they work to resolve bugs, improve performance, speed deployment, and boost overall quality.

All terrific reasons for sharing production databases with the dev team, right? But there’s good reason why granting access to real prod data is a major challenge in software development best practices. A production database is one of the organization’s digital crown jewels. Unfettered access can corrupt or break data stores through accidental deletion or modification. Coding errors, misconfigurations, even incorrect deployment procedures can threaten the integrity and availability of these critical assets.

If that weren’t bad enough, sharing a production database also exponentially heightens risk to security and privacy. Most production data is riddled with proprietary and sensitive personal information that cannot be exposed without grave financial and legal consequences to the business.

Hence the need for data that replicates the production database continuously and at scale while obfuscating the sensitive information within. The Privacy Dynamics solution to this conundrum for development and test environments is to securely de-identify production data—algorithmically removing all sensitive PII—then make that de-identified data available to developers within the context of their development platform for use in building, testing, and shipping new apps and features.

That process has never been easier thanks to Privacy Dynamics’ recently announced anonymize-project GitHub Action available now in the GitHub Marketplace. This GitHub Action allows users to leverage state-of-the-art algorithms to anonymize sensitive production data for use in data in dev, test, and preview CI/CD environments. Using it is simple and straightforward. To get started, you’ll need:

An active Privacy Dynamics account. (Don't have one? Start a 14-day free trial here).
Source data stored in a supported database, accessible to the Privacy Dynamics service.
A configured Project within your Privacy Dynamics account using your data source as an “Origin Connection.”
Machine-to-machine credentials for your account. (We’ll generate those for you).

Step One: Create the Source and Destination Data Connections

Privacy Dynamics serves fundamentally as an Extract, Transform, and Load (ETL) tool. It simply transforms data from the source to the destination, removing all sensitive information in the process while maintaining data integrity and utility. To use it, first you’ll need to create connections to both a source database from which the production data is drawn, and a destination database to store the sanitized and de-identified data for development use.

From within Privacy Dynamics go to the CONNECTIONS page, select ADD CONNECTION, choose POSTGRES and click on NEXT.

Enter the connection details for the source production database you configured in Step One:

Name - a name for you to identify the connection.
Host - the endpoint, without the port or database name.
Port - the port we use to connect to your database. (Default is 5432).
Username - the username of the service account you created earlier (we recommend using a service account and naming it svc_pvcy).
Password - the password for the service account user.
Database - the name of the database you would like to use.

Select TEST CONNECTION to verify the credentials. Select ADD CONNECTION and your connection saves if there are no errors.

Repeat this process to create a connection to an intermediate database to store the anonymized copy of the production database. This will serve as the destination data connection.

Step Two: Set Up the Privacy Dynamics Project

The Project in Privacy Dynamics serves as a logical container for the source and destination database information. This is how Privacy Dynamics knows where to read data from and where to write it to. Projects also contain special treatment information such as how to handle Direct Identifiers (DIDs) and whether or not to treat Quasi-identifiers (QIDs) which is critical if you’re dealing with healthcare data, for example.

Step Three: Configure the GitHub Action

You’ll now create an API token, then pass the client ID and secret to the GitHub so that it can talk to the Privacy Dynamics API. Create new secrets in GitHub to store the client ID and secret.

Once you’ve created your API token, you’re now ready for the main act: Creating a GitHub workflow to anonymize data and get down to work.

To use the Privacy Dynamics GitHub Action in the repository, create a GitHub workflow file (for example: .github/workflows/anonymize-prod-db.yml), then populate that workflow with the following:

name: Anonymize Data for Testing
on:
  pull_request:
    branches: [ main ]
jobs:
  anonymize:
  runs-on: ubuntu-latest
  steps:
    - name: Anonymize Data
      uses: pvcy/anonymize-project@latest
      with:
        project-id: 4e1213f4-my-project-uuid-0242ac120002
        db-host: my_postgres_host.host.com
        db-port: 5432
        db-username: postgres_username
        db-password: postgres_password
        client-id: ClientIDFromYourPDAccount
        client-secret: ClientSecretFromYourPDAccount

You can customize the behavior of the action with the following configuration parameters:

project-id (required): The ID of Privacy Dynamics Project (configured in Step Two) to anonymize data for this instance of the GitHub Action.
db-host (required, default: localhost): Hostname of PostgreSQL engine the anonymized data is being sent to (Configured in Step One). Must be made accessible to GitHub Actions runner.
db-port (required, default: 5432): Port of PostgreSQL engine the anonymized data is being sent to.
db-username (required): Username of PostgreSQL engine the anonymized data is being sent to.
db-password (required): Password of PostgreSQL engine the anonymized data is being sent to.
client-id (required): Machine-to-machine (M2M) Client ID provided from your Privacy Dynamics account.
client-secret (required): Client secret provided from your Privacy Dynamics account.
api-url (optional): API Url for Privacy Dynamics. Defaults to SaaS instance. This is only required if you are running the On-Premises version of Privacy Dynamics.

The action is now ready to be incorporated directly into CI/CD workflows.

The Privacy Dynamics GitHub Action in Action

This GitHub Action streamlines the ability to anonymize sensitive information, helping you protect sensitive production data such as PII in dev, test, and preview environments. This GitHub Action gives developers a powerful tool to overcome common software development challenges and allows teams new latitude for:

Using production data for automated testing. Production data can be anonymized on-demand and written to ephemeral environments for integration and functional tests.
Verifying database migrations. Assuring migrated data will work in production by running modified datasets first in test environments with anonymized copies.

Tell us what you think

If you’ve experienced the problem of getting quality data into development and testing environments, we want to hear from you. At Privacy Dynamics, we’re focused on making the data de-identification process as easy as possible for engineers. We know you want to be responsible with your customers’ personal information, and you need to satisfy the requirements of your security team. Does this GitHub Action make it easier to get the data you need while shifting your development and testing processes to the left? Let us know at support@privacydynamics.io.