Security and privacy on Machine Learning Projects over Google Cloud Platform
How to ensure the security and the privacy of the data over Google Cloud Platform, including anonymization processes
One of the most important topics, when we work on a Machine Learning project, is how to ensure the security and privacy of data. First, we will analyze all the security and privacy - there are different approaches and scopes - and we will end up talking about how to handle both cases in a real scenario.
Firstly, we have a data security challenge. Although, Google provides a Security Overview (Google stores the information in an encrypted way, ensuring data is only accessed by people who have been granted). Our focus is mainly on how we secure the data access to who are the people who can access them.
In a Machine Learning project we usually have a place where data is stored - ex: Google Cloud Storage - and, from the security point of view, we need to ensure both:
- The data is stored encrypted
- The data transport
Data is stored by default with Google Cloud, which uses an asymmetric algorithm to encrypt all the data. The keys are provided and managed by Google or the user.
Talking about the data transport, all the data should be transported using TLS - otherwise, the data transport will not be secured so we will be exposing during the traffic from the origin to our Google Cloud Storage. In GCP, all the transport is encrypted.
On the other hand, we also have to focus on the data privacy challenge, which is as important (or more) as the security challenge. Why is so important? Because it may vary depending on the problem or the country and it needs a full studio of the case. However, is not the same for the security challenge, the procedure is very similar no matter the circumstances.
In a Machine Learning project, when we talk about privacy, we focus on:
- What data should be collected?
- What are the permissible uses?
- With whom might it be shared (users, apps, etc)?
- What granular access control model is appropriate?
What data should be collected?
The easiest option is to collect as much data as you can collect, but first, you need to know what type of data you need. Otherwise, you will have unnecessary data which will not be useful in any case. The first approach is always the most difficult one because you have to be aware and study the problem, store the data and ensure it thanks to a privacy chain. It may seem a little bit more tedious at first, but in the future, it will decrease costs.
The key is first studying the data that we can collect, then decide what data you are going to collect and how are we going to do it. In some cases, you first store all the data and then after a first analysis you decide which one you stay with. However, storing all the data before analyzing it, is a very common error but not very practical.
What are the permissible uses?
The permissible uses have to be clear enough at the beginning of the study. For example, if a company which business is hotel bookings provides us a dataset with all the bookings that the company with the aim of looking for metrics to support to the Marketing team preparing campaigns, that is the permissible use, not using that data to analyze another thing. Also, the most important part is that the users usually give permissions to handle their data to be analyzed with an aim (ex: business studies, etc) so we cannot use the data for another purpose.
This use should be clear enough before working on a Machine Learning project because using the data in an unauthorized way could lead you to legal problems.
With whom might it be shared?
This point is important because the data belongs to the end users who are the owners of their data and only they should access. Users are always allowed to revoke access to that data from our systems. Of course, if only they have access, how are we going to analyze the data? By anonymizing the data, we can handle it to be analyzed so we need at least one user - or group of users - with access to that data with the only aim of transforming and anonymizing them to work with it. Data anonymization reduces the risk of unintended disclosure when sharing data between countries, industries, and even departments within the same company.
To detect and avoid sharing not anonymized data, we should use the Cloud Data Loss Prevention (DLP) API that has a lot of detector of some information type (credit card number, phone numbers, emails, etc) and it has features to anonymize the information. Also, the tool has a way to create our identifiers of some type of content that shouldn’t be shared in plain, so we can create our own detectors depending on our case.
Once the data is anonymized and can be treated, the data can be shared with the apps or users that are allowed to work with the data. The best way to ensure them is controlling the access to the data. In GCloud Storage you can handle it:
- By using the IAM Permissions: This solution works fine but it’s more oriented to manage roles and granting them. If we need services to access to some data, we need to handle it by using service accounts.
- By using service accounts: This solution works when we need services to access to the data - ex: Big Query - importing information from Storage. This solution is perfect when is combined with the IAM permissions because in the IAM permission you establish the roles and also assign them to each service account. Also if a service doesn’t need to access the data, you can easily revoke grants.
There are also cases that we only need some service to access the data for a short period of time. In this case, we strongly recommend to use short-lived service accounts that are a variety of a service account with an expiration time - so you will ensure your data is protected if you forget to revoke permissions to that service account.
An example in a real scenario
We have talked about how to handle security and privacy. Now we will apply it to a real project. First of all, this project has an extra request and it’s a project for a German company. So we needed to handle data in a GDPR compliant way(data shouldn’t be used outside the European Union). One of the things that we had to bear in mind was that they didn’t want to have their data outside Germany. So we stored and ensured all their information in the Google Cloud data center located in Germany. (Europe-west3)
First of all, we need to create a bucket where we will store the raw data. This bucket will collect the raw information and, after being de-identified, will be stored in another bucket that will be the one where the services will access. Both buckets will be Regional (we need to retrieve data frequently) located in Europe-west3.
The bucket that we previously mentioned, will keep the raw data. Then we have to analyze the data by using the Data Loss Prevention API and look for sensitive information to be anonymized. It depends on the frequency of raw data updates. You have to take into account that when you use the DLP API to detect sensitive data, you use the pre-defined infoType detectors, and you can get false positives (as we had in the project, the detected License Drivers ID were from Canada).
Once the real sensitive data is found, you can automate the de-identification by using Cloud Functions - the company allowed the data center that is in Belgium for the transformations. However, if this couldn't be possible, you can approach it by using a compute engine in Europe-west3, exposing the functionality like a flask. In the next image you can see the most important parts:
Triggering the Cloud Function to the bucket
The Cloud Function is triggered when a new element is created in the bucket.
There are different ways to de-identify data. This one, in particular, has been encrypted by using a key because of the customer’s need to get data back after the analysis.
Which data will be de-identified
After the analysis, classifying which data needs to be de-identified, the fields that need to be encrypted are the ones set in that variable. If we need more, we only need to add more elements.
Once the data is anonymized and stored in another bucket, we will be able to work safely. This transformation is essential because we will have to analyze the complete data set, for example, to implement a customized estimator in TensorFlow, we must ensure the protection of the original data. From this moment, the original bucket must not be accessible, while the anonymous bucket is the one that must be used by the services that need to make use of the data.