Brett Westover

 ·  5 min read

How to Use Pre-Configured Development Environments with Okteto

Development teams are using pre-configured environments to increase efficiency and minimize inconsistencies, but this hasn't solved the problem of getting representative and useful data to development teams. Privacy Dynamics, along with modern developer tools like Okteto, can safely bring production data into your development and test environments.

Safely using production data in development

Using anonymized data for development with Privacy Dynamics can be set up a variety of ways, but most will share a few common requirements:

  • We need a source of data (typically this is a production database, or cloud data warehouse)
  • We need a destination that development teams can access
  • We need a way to periodically refresh the data, and effectively share or copy the data into dev environments

In this post we'll demonstrate how to do this with a PostgreSQL database using the Okteto open source movies example application. The Movies App is a microservices demonstration of a movies rental service. It has a service for reading a catalog of movies, services to respond to requests for rentals and move the data through a queue, and of course a frontend application that the users will interact with. It also has an "admin" interface that shows the list of users who are signed up. We'll focus on the user data which lives in the PostgreSQL instance, since it contains sensitive information that needs to be anonymized.

Create our data source and destination environments

Using Okteto's CLI we'll create a pair of environments with a few simple commands. From a local checkout of the movies repository:

  • Create a deployment of the movies app to use as a source for the data. When this comes up the users data is loaded automatically. We'll pretend this is production data for this demo:

    • Create a namespace: okteto namespace create movies-source
    • Deploy the application: okteto deploy
  • Create a deployment of the movies app to use as a target for the anonymized data. In this case we'll skip loading the data on startup since we're going to populate the database with anonymized data from Privacy Dynamics. We can do that by passing an environment variable:

    • Create a namespace: okteto namespace create movies-target
    • Deploy the application: okteto deploy --var API_LOAD_DATA=false

Load Balancer setup

To use the cloud version of Privacy Dynamics we need a way to access the source and target databases from the internet. We'll use an AWS load balancer here, but you can do this using NGINX Ingress Controller or any TCP capable load balancer in your cluster.

Create a service for the database in the source and target namespaces and use the AWS Load Balancer Controller annotations to make it accessible via the internet.

postgres-public-target.yml:

apiVersion: v1
kind: Service
metadata:
  labels:
    app: postgresql
  name: postgres-public-target
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-type: external
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
spec:
  ports:
    - port: 19876
      targetPort: 5432
  selector:
    app.kubernetes.io/component: primary
    app.kubernetes.io/instance: postgresql
    app.kubernetes.io/name: postgresql

Apply this config:

  • kubectl -n movies-source apply -f postgres-public-target.yml
  • kubectl -n movies-target apply -f postgres-public-target.yml

Let's anonymize that data!

Now that we have our "production" environment running, we can see the users data. There is clearly privacy sensitive data here (if this were a real application that is, this data is actually fake).

Admin screen showing production user data with PII

To anonymize this data we'll create a connection to the source of the data in the Privacy Dynamics application. We'll also create a connection to the target environment, which will be the destination for our anonymized data.

Next we'll create a project. There are just a few choices we need to make:

  • Which source and destination connections to use - we'll use the ones we just created that point to our Okteto movies-source and movies-target databases.
  • Which tables to treat - There are only two in this demo: "rentals" and "users." We can go ahead and treat them both, but we're focused on "users" here.
  • By default Privacy Dynamics will treat all identifiers. We'll choose to use "Realistic" data for direct identifiers (like the name and phone numbers) and to "Anonymize" the indirect identifiers (like City, State, and Zip Code).
Modal window showing configuration options for treating data

We can also choose to set up a scheduled job. This is a great option for development environments since you'll likely want to periodically refresh with updated anonymized data from production. For our demo we'll skip that, and simply run the project.

This job only takes a few seconds to anonymize 10,000 users. We can see that it detected the direct identifiers automatically, replacing them with fake data. Privacy Dynamics treated the indirect identifiers using our surgical process to maximize utility while minimizing privacy risk. We provide a report that shows statistics on the distribution of values before and after treatment.

Modal window showing treatment report

Taking a peek at our users in the anonymized target environment we can see that the names and phone numbers have been changed completely. The other fields have changed slightly in some cases, or in cases where there was no detectable risk to privacy, not changed at all. Privacy Dynamics carefully avoids large changes to the distributions of indirect identifier data, allowing for development environments that are far more representative of production.

Admin screen showing anonymized user data

Volume Snapshots for faster and repeatable builds

Now that we have this anonymized data in our movies-target environment, we can use the Volume Snapshots feature in our Self-Hosted Okteto cluster to create a snapshot. Then we can use the data repeatedly in development or preview environments.

After setting up the cluster to support Kubernetes VolumeSnapshots and enabling the feature, let's create a snapshot of the PostgreSQL data volume in our movies-target environment.

Apply a bit of YAML to create the snapshot, specifying a source namespace, and the name of the PersistentVolumeClaim for the database:

snapshot.yml:

apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata:
  namespace: movies-target
  name: dbdata-snapshot
spec:
  volumeSnapshotClassName: okteto-snapshot-class
  source:
    persistentVolumeClaimName: data-postgresql-0

Apply this with kubectl -n movies-target apply -f snapshot.yml

It might take a minute or two for the snapshot to be created. You can see when it's ready:

$ kubectl -n movies-target get volumesnapshot
NAME              READYTOUSE   SOURCEPVC           SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS           SNAPSHOTCONTENT                                    CREATIONTIME   AGE
dbdata-snapshot   true         data-postgresql-0                           5Gi           okteto-snapshot-class   snapcontent-37d54cf5-a990-460c-81da-c4b5aa6a7e9e   60s            61s

Now this snapshot is ready for use in Okteto development or preview environments.

Preview environments using the anonymized snapshot

Following the docs in the movies repository we will make a small code change in the movies repo to tell the Okteto GitHub Action to use our new snapshot.

        name: pr-${{ github.event.number }}-cindylopez
        scope: global
-
+        file: "okteto-with-volumes.yaml"
+        variables: "API_LOAD_DATA=false,DB_SNAPSHOT_NAME=dbdata-snapshot,DB_SNAPSHOT_NAMESPACE=movies-test"

Once this change is in place we'll get preview environments on all PRs with the anonymized snapshot that we specified.

Privacy Dynamics + Okteto = Awesome

You can sign up for a free trial of Privacy Dynamics at https://www.privacydynamics.io/ and start anonymizing data in minutes. Okteto offers a powerful and fast development tool to supercharge your development process, check out a free trial at https://www.okteto.com/ Together, we are excited to offer an awesome option for using anonymized prod data in your development environment.