Data Pseudonymization in Django

Pseudonymization for Privacy and Compliance

Pseudonymization refers to the obfuscation of personal data using a placeholder, or pseudonym. There are many reasons to pseudonymize personal data, including enhanced security, user privacy, and compliance with regulations like the European Union's General Data Protection Regulation (GDPR) and the Health Insurance Portability and Accountability Act (HIPAA) in the United States.

GDPR recommends pseudonymization as one way to limit the exposure of personal data in the event of a data breach. By storing pseudonyms rather than real data, you're limiting the identifiability of that data while still allowing for its use within the application/organization. Learn more about the basics and benefits of pseudonymization.

Pseudonymization in Django

Right out of the box, Django enables compliance with a number of GDPR's prescribed user rights; for example, the right to rectification can be addressed through Django's admin management tool, and the right to data portability is enabled by Django's serialization. The framework gives us a head start in implementing pseudonymization, too. Still, we'll need to set up a few things, including a custom user model, to create a strong pseudonymization pattern for our application.

Using the Django auth app, we already have a separate user model containing a user's personal data; this model is referenced within the system via primary key (or other identifier), which is an example of basic tokenization (a form of pseudonymization wherein a token stands in for the original data).

From there, we might opt for a profile-type model with a one-to-one link with our default auth user model, further isolating any personal data. It's also important to think about securing personal data in transit, as well as at rest; splitting the more sensitive personal data into a separate model like this enables us to pass around the profile data without concerning ourselves so much with the availability of data stored in the auth user model.

To follow along with both examples, check out the project on GitHub.

Dependency Prerequisites
This example uses Django 2.0, which supports Python 3. To properly run the code examples you'll need:

Python 3.4 or greater
PostgreSQL
pipenv

A Note About Our Mask/Unmask Methods
The simple mask and unmask functions we use for masking and unmasking values are stand-ins in both examples, since the mechanism for this is largely unimportant to the rest of the implementation and, in actual use, will depend heavily on the specifics of an application and its data. Our placeholders are extremely easy to reverse engineer, however, so please do not use them in a production environment.

For a more robust method for encrypting and decrypting personal data fields, refer to Will Hardy's DjangoCon EU 2018 talk. In his talk, Will outlines a per-user approach which may also help facilitate further compliance, such as fulfilling a user's right to erasure encompassing personal data in database backups.

Example #1 - Data Masking via Properties in Django

At its core, this first and most basic example of pseudonymization takes a per-field approach using setter methods to mask data on storage and getter methods to unmask data on retrieval.

Although it illustrates the concepts behind our implementation and offers a lot of control over how we handle fields at each step in processing, this approach has a number of shortcomings. As a result, it serves more as a stepping stone to example #2 than as a fully-realized solution in its own right.

This approach involves (with links to code snippets for each step):

creating a custom User class,
altering the model fields used for storage of the masked values,
adding getter/setter methods for interacting with the unmasked values,
creating custom manager and queryset classes with custom _filter_or_exclude method for basic filtering, and
creating custom admin and admin form classes for admin management.

Some caveats with this approach:

Since there is very little abstraction, getters and setters must be added separately for each personal data attribute.
In addition to the model and fields themselves, we also have to create/customize our model's manager, queryset, admin, and admin form classes for the example to work.
We're only making changes to the _filter_or_exclude QuerySet method, which handles filter, get, and exclude, but does not cover other useful QuerySet methods like all.
This approach does not work if the masking method uses per-user keys, e.g., if each user has their own key, the first user named "Alex" might be masked to "Uamd", while the second is "Zqcn," making it impossible to provide a single search term to find all users named "Alex."
Consideration must be paid to any data that may need to be searched. For example, you may lose the ability to search for a range of dates rather than a single specific date, or find users based on partial name matches.
The masked value from the database is retrieved and retained when the User object is instantiated; even if we alter our queryset to defer the fields, they will be loaded the first time any of the properties is accessed.
We don't address serialization, as the particular needs of each controller/processor (data roles described under GDPR) should dictate in what form data is available or transmitted.

Implementation

Starting Point

This approach assumes a custom User model, whether via an additional linked model or by extending AbstractUser or AbstractBaseUser. Setting this up is beyond the scope of this post, but excellent references can be found here and here.

We will begin by extending our Django User model with fields for name, phone, date of birth, and IP address, which are all personal data that we wish to pseudonymize.

User Model

To start pseudonymizing our personal data fields, we will alter each of the model fields to indicate that they are private (by convention). Then each field will be accessed via:

a getter method, which will retrieve and unmask the value from the stored value via the private attribute, and
a setter method, which will mask the value and assign it to the private attribute for storage.

# model imports
from app.utils import mask, unmask

class User(AbstractUser):
    # rename original "name" field  to "_name" to allow property setter and
    # getter to mask/unmask the value for us by default
    _name = models.CharField(max_length=128, blank=True)

    @property
    def name(self):
        return unmask(self._name)

    @name.setter
    def name(self, value):
        self._name = mask(value)

From this point, interacting with the name property, for example, allows us access to the unmasked data, whereas _name will store the masked data for us; this way, we can ensure that we are able to work with our original data values where needed in our application while actually storing and accessing their pseudonyms.

User Manager

Next, we will create custom QuerySet and Manager classes for our User class and alter the _filter_or_exclude QuerySet method to meet our needs.

For simplicity, we will list the fields to be intercepted in MASKING_FIELDS in the User class. Each time a filtering method is called, we will check provided kwargs against that list; if found, we will remove the (unmasked) property from kwargs, replace it with its corresponding (masked) model field name, and mask the value for the purposes of the query. We then call the parent _filter_or_exclude method to continue with the swapped out field and masked value in kwargs.

class UserQuerySet(models.QuerySet):
    
    def _filter_or_exclude(self, negate, *args, **kwargs):
        for field in self.model.MASKING_FIELDS:
            value = kwargs.pop(field, None)
            if value is not None:
                kwargs[f'_{field}'] = mask(value)

        return super(UserQuerySet, self)._filter_or_exclude(negate, *args, **kwargs)

Now we can call get or filter methods with our added properties, passing unmasked values, and have it run against the stored model fields and masked values automatically.

It is again important to note here that pseudonymized fields may lose some ability to be queried. For example, if you are masking a date field, the stored, masked dates will be scrambled such that they are no longer in the same chronological order relative to each other. Applying the same masking transformation on the search terms, then, will not provide the same results as the original search terms would on the original data.

* It is important to keep such considerations in mind when planning your data architecture, deciding which fields to pseudonymize, and designing your application around your ultimate business goals.

* There are potential workarounds that maintain the queryability of the data (e.g., storing a subset of unmasked user data with an identifying token in a secured Elasticsearch instance, searching against the unmasked data, then retrieving the related user model(s) via the stored id/token), but the implementation and added security implications of these are beyond the scope of this post.

User Admin

Django has specialized classes for the User Admin/Form, so we will start by importing and inheriting those.

Ultimately, we want to fully replace access to the database-backed, masked fields with new fields for the model properties we've created in their stead. We also want to ensure we do so in such a way that the new property fields look and behave just as the original model fields do, including field types and validations.

Our first step is to add our fields to the list_display in our UserAdmin class. This change is sufficient to get the values to appear unmasked in our User admin list.

Next, we will add our fields to the form itself. This can be accomplished by

defining the fields in our UserForm class,
assigning fields in our UserForm Meta, and
assigning fieldsets in our UserAdmin class.

When we save the form, we want to validate our property fields with the same validators we established for our original fields; we can accomplish this by copying each field's validators to the corresponding property field in the class constructor, after the default initialization takes place.

We will further extend the form's __init__ method to initialize the field for each property via its getter, so that the saved value is initially unmasked and loaded into the form. We will also tweak the date of birth field to use the appropriate date widget.

Finally, we will extend the form's clean method to call each setter with the submitted data, triggering the masking of each value and its assignment to the appropriate model field after each value has been cleaned and validated.

class UserChangeForm(AuthUserChangeForm):
    # fields/Meta

    def __init__(self, *args, **kwargs):
        super(UserChangeForm, self).__init__(*args, **kwargs)

        model = self._meta.model
        for field in model.MASKING_FIELDS:
            self.fields[field].initial = getattr(self.instance, field)
            self.fields[field].validators = model._meta.get_field(
                f'_{field}'
            ).validators

    def clean(self, *args, **kwargs):
        super(UserChangeForm, self).clean(*args, **kwargs)

        for field in self._meta.model.MASKING_FIELDS:
            setattr(self.instance, field, self.cleaned_data.get(field))

Now we have a working admin form with the appropriate initial field values and validations applied to our getter/setter properties.

Where to Go from Here

This example is sufficient to get us started with some basic pseudonymization for a few personal data fields, but there are a number of pitfalls using it as-is.

In general, improvements could be made to the user querysets to apply the masking method to all database queries. The queryset could also be updated to strip personal data from returned records by default, with separate queries for retrieving and returning records with personal data intact. This way, any access to personal data would necessarily be intentional, and it would be easier to pinpoint for logging and auditing that processing action.

Additionally, you may need to consider applying a pseudonymization technique to data you are gathering in your logs, depending on whether you are collecting any personal data.

The next example builds on the first approach by abstracting much of the same functionality into custom Field classes.

Example #2 - Data Masking via Custom Fields in Django

This example improves significantly upon the previous one, using a custom Field class to automatically mask values on their way into the database and unmask them on their way out. With this approach, we no longer require the getters/setters in the model, the custom queryset and corresponding user manager, or the bulk of our changes to the user admin form. The implementation involves:

creating a custom User class,
adding a custom Field class to automatically mask/unmask field values, and
altering the model fields to use our custom field.

Implementation

Starting Point

We'll begin from the same starting point as our first example -- a custom user model inheriting from Django auth's AbstractUser class, with fields added for name, phone, date of birth, and IP address. We'll also maintain some of the admin changes from the previous example; these are not necessary for the pseudonymization implementation itself, but they do make it easier to interact with users in the example nonetheless.

Custom Fields

Our custom Field class is the key piece of this approach, and we will need to override a few methods for it to do everything we want.

First, we will set up the class constructor/deconstructor methods to accept and assign a field_type (e.g., CharField) that will be used to define the internal type of the field. We'll also have it accept a tuple of functions for masking and unmasking the data, so that we can use different functions for different fields if we need to. Note that the deconstruct method here has to mirror any argument changes we make in __init__.

We will also override the get_internal_type method, which specifies the internal type of the field (e.g., for creating the corresponding database column with the appropriate type); we'll have it pull the internal type from whatever field type the PseudonymizedField was initialized with.

The core of our implementation is in our get_prep_value and from_db_value methods. The get_prep_value method is called prior to interacting with the database, so we'll use that opportunity to mask values before they are saved, as well as to mask values for query purposes. The from_db_value method is called when a value is being pulled out of the database and into our Python object, so we'll unmask our values there.

By extending the Field class and the methods that handle moving values in and out of the database, we are able to ensure that the pseudonyms will be available to the database and the original values will be available to the application, but not vice versa.

User Model

All that's left for us to do is set up our User model to use our new PseudonymizedField class. We'll import our mask/unmask functions to pass into each field, then replace each field with our custom field class (e.g., name = models.CharField( … ) becomes name = PseudonymizedField(models.CharField, (mask, unmask), …)).

And that's it! We can interact with our pseudonymized fields with unmasked data at the application level, but it will all be automatically masked prior to interaction with the database. Whether that means saving a value or making queries against the table, the data will be masked automatically.

Where to Go from Here

This example accepts as an argument the field type that should ultimately be used. A more complete solution might involve extending each available Django model field and inheriting from our PseudonymizedField class. This way, we could create fields on our models for each type directly, rather than having to pass the desired class for field creation. Additionally, this would allow us to handle processing values for masking/unmasking differently for different field types, if desired.

Conclusion

This post outlined a few different ways to achieve pseudonymization for user privacy, security, and compliance with regulations like GDPR using the Django framework. We hope to provide helpful examples and a base for your organization to build upon. While there's no one path towards compliance, the techniques outlined here can bring your organization closer to that goal.

If you're looking for more hands-on help implementing the methods described here or building GDPR-compliant applications, reach out to one of the experts at Cuttlesoft.

Pseudonymization for Privacy and Compliance

Pseudonymization in Django

Example #1 - Data Masking via Properties in Django

Implementation

Starting Point

User Model

User Manager

User Admin

Where to Go from Here

Example #2 - Data Masking via Custom Fields in Django

Implementation

Starting Point

Custom Fields

User Model

Where to Go from Here

Conclusion

Related Posts

GDPR Compliance for 2018

SSL Certificate Transparency in Chrome for 2018

Data Pseudonym­ization in Django

Pseudonymization for Privacy and Compliance

Pseudonymization in Django

Example #1 - Data Masking via Properties in Django

Implementation

Starting Point

User Model

User Manager

User Admin

Where to Go from Here

Example #2 - Data Masking via Custom Fields in Django

Implementation

Starting Point

Custom Fields

User Model

Where to Go from Here

Conclusion

Related Posts

GDPR Compliance for 2018

SSL Certificate Transparency in Chrome for 2018

Data Pseudonymization in Django