5 (Noticeable) Business Benefits of Secure Database Provisioning

“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.”
William A. Foster

I know what you’re thinking.

Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’

But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:

  • Work in shared development models
  • Work on empty (schema-only) / heavily subset databases
  • Work on old, out of date and/or irrelevant data
  • Make decisions without knowing enough about their data or what they hold

When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.

No. In fact it comes down to, as all things do, time and money.

paid make it rain GIF by Thalia de Jong

In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+ respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?

From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!

*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!

As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.

Very important side note for this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:

  • Data Identification and Classification / Cataloging
  • Data De-Identification i.e. Data Masking
  • Data Provisioning i.e. Real Time Database Cloning / Provisioning

1 – Increase developer happiness / contentedness

Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.

Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.

All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.

Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.

With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.

2 – Develop a common language about data & make better decisions

It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?

The Office Reaction GIF

No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.

As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:

  1. What data you hold
  2. Where your data is

I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.

Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.

Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!

3 – Move faster and better enable the DevOps pipeline

It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.

Yank Tug Of War GIF by BEERLAND

A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.

This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.

This is why taking advantage of something like database virtualization, allowing you to spin up copies of databases in seconds, could be the answer. You can automate the provisioning of environments as githooks, during Pull Request automation or as release candidates and the experience will be exactly the same across the board – boom *code base*, fresh and ready to go. When developers can move fast, value comes through a whole lot faster.

4 – Minimize space constraints on new copies, on premise or in the cloud

Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).

But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.

Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.

As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).

Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.

5 – Work on realistic data without worrying about data breaches

This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.

If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.

If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.

So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.

The Next Generation Data GIF by Star Trek

Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.

Conclusion

As you’ve seen above, it all comes down to time and money but there are many ways to save and speed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:

The ease of delivering value.

In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.

Deterministic data masking – the who, who and who? (and how?)

“Security is always excessive until it’s not enough.”
– Robbie Sinclair

You may not already know this about me, but I kinda like data masking.
Scratch that, I LOVE data masking.

Increasingly both around Redgate and in general I seem to be getting a bit of a reputation as “the data masking guy” but for good reason – to include yet another quote, from Joe Kaeser this time: “Data is the oil, some say the gold, of the 21st century…”, more and more I hear stories about people leaving their oil/gold out for everyone to see, opening up the widest attack surface area by doing things like copying backups down into non-Production environments or exposing test systems to the internet – the list goes on.

This means that people turn to all of the protective methods they have available to them: encryption (TDE, row and column level etc.), static and dynamic data masking, access control… and many combinations of them.

One of the big points I always have to cover when it comes to static data masking though, is something called “deterministic masking”, so let’s start with 2 definitions of my own to make sure we’re on the same page:

Static data masking is the process of de-identifying sensitive data-at-rest within the tables of your Database. It is typically used to provide realistic, Production-like data into non-Production environments like Dev and Test, and even sets that are given to 3rd parties. This relies on retaining non-sensitive business specific fields within rows and taking anything considered PII (Personally Identifiable Information) or PHI (Protected Health Information) and either scrambling or replacing it with similar but ultimately false data.

Deterministic data masking is the process of masking data with values in a repeatable way, such that it will give the same value when masked in any and all future runs on any value that matches and will create a new record for values which have not been previously masked. An example of this would be if you were to mask “Chris Unwin” to “Brad Pitt”, it should appear as “Brad Pitt” not only in our (for example) dbo.Contacts table but also all associated tables (regardless of PKFK relationships at the DB level) and every single run should provide the same output. This is useful for building up familiarity with the data and utilizing for future test runs.

Now. I should caveat this blog post (you’ll find I’m always caveating my posts) with the fact that deterministic masking does have it’s benefits but the very idea that one thing should always become another, in my eyes, is inherently less secure than something that always gets de-identified to a different value. As such, I will always recommend that where possible, masking runs should produce differing values. Deterministic masking is also compute heavy because instead of simply randomizing in values, it will have to check up front the value to be replaced and then replace it with the corresponding value, another potential downside if speed is a key driver in the process you’re trying to put in place.

Most masking tools either support deterministic masking directly (*cough* dbatools.io *cough*) or require a little bit of configuration to get started, but it is a workflow that should be catered for, for those who need it. So here is a quick getting started guide for deterministic masking, if you’re writing your own scripts or (as I will in this example) you’re using Data Masker for SQL Server from Redgate.

Step 0 – Figure out where you’re going to do the masking

There are lots of different ways to move data around and get things masked before exposing it to development and testing teams (or handing it to 3rd parties). In this example, I’m going to assume that the only thing I have available to me is a backup and restore process for moving data around, nothing fancy like SQL Clone to help.

So for this I will assume I have a staging instance somewhere that I can use as part of this process, lets call it WIN2019 and the process I’m going to follow is:

  • Restore a full .bak file to WIN2019
  • Carry out the data masking process on this instance
  • Backup the masked DB
  • Move the new .bak into lower environments (restore / make available to Devs)

Step 1 – Map it for multiple future runs

The first problem you have to contend with is needing to maintain a record of how we want to mask things. If we always want to mask the credit card number 3422-567157-24797 to be 3488-904546-46471 then we need to have a place this is stored. The question to ask ourselves here though is WHAT needs to be recorded. There is a huge difference between:

CreditCardBeforeCreditCardAfter
3422-567157-24797 3488-904546-46471

And

CustomerIDCreditCardAfter
1000000001 3488-904546-46471

The latter is obviously far preferable because it does not contain any sensitive PII – it is purely the masked value and a non-sensitive CustomerID which only really makes sense within the company or is a system identifier.

So we should get a mapping location set up on WIN2019, I don’t want to make my tables too wide so I’m going to keep this fairly small and atomic – we’ll create a Database on the server with a mapping table for the credit cards:

CreditCardMap table in SSMS - CustomerID as INT (PK) and CreditCard as nvarchar(60)

This is going to be the basis for our repeated masking. The reason for having this as a separate DB/Table though?

1) The mapping should persist – if it exists in the same DB then we will just overwrite it every time, rendering the mapping useless.
2) Devs/Testers don’t need the mapping – just the end result.

Step 2 – Set up the masking to cover the tables, regardless if there is a mapped value

One of the most important phrases in the GDPR is “Data Protection by Design and Default“, and it’s one of my favorites. In this context I am going to interpret this in a very specific way, and that is: “we must mask everything, before trying to map it back to a value that exists, just in case the link to the MaskingMapper DB were to fail for any reason.

I first restore a copy of the Database I’m going to mask (the DMDatabase) and then setup a Data Masker substitution rule to process the DM_Customer table, de-identifying the credit card numbers:

Data Masker substitution rule to mask Credit Card Numbers with invalid AMEX CC numbers, also (fake) customer credit card Nos. displayed in SSMS

Step 3 – Copy the distinct values across into the mapping table

This step is going to be as simple as writing a single tSQL statement to copy the values across – in Data Masker I will wrap this into a Command Rule and make it dependent on the previous substitution rule:

INSERT INTO MaskingMapper.dbo.CreditCardMap
(
    CustomerID
  , CreditCard
)
SELECT DISTINCT
       customer_id
     , customer_credit_card_number
FROM dbo.DM_CUSTOMER
WHERE (
          customer_credit_card_number <> ''
          AND customer_credit_card_number IS NOT NULL
		  AND customer_id NOT IN (SELECT customer_id FROM MaskingMapper.dbo.CreditCardMap)
      );

Step 4 – Sync everything back together

Finally – we need to bring any information back from the table if it had values written to it in previous runs. In tSQL we could write an UPDATE with an appropriate WHERE clause but I’m going to use an additional controller and Table-to-Table Sync rule in Data Masker to handle this:

Rules in Data Masker - Substitution to mask data, Command rule to update the mapping table and a Table to Table rule to sync back into the table

Result

If we now run this we will have achieved deterministic masking, because we have the following before and afters – first for the DM_Customer table:

DM_Customer Credit cards before
DM_Customer credit cards after masking

and for the CreditCardMap table:

Mapping table prior to masking
mapping table after masking

The mapping table now has 77 rows and if we repeat the masking step by step without changing anything we can see that the credit card numbers change in the first instance, but then synchronize back to the values that should persist, the images below represent just running the first two steps in isolation (i.e. masking everything regardless – left) and then the synchronization job restoring the predetermined values (right) and the mapping table still has 77 rows.

Now if in the next run one of the NULL/blank fields has a real credit card number, or we add any additional customer IDs (i.e. with a more recent backup with fresh data) they can be masked, accounted for and persisted between each run.

Conclusion

Deterministic masking is hard, but it is possible. You can use a number of methods to achieve it, such as the above, but the first question you need to ask yourself (after “do I feel lucky”) is:

“Do I NEED deterministically masked data, or is it a nice to have?”

9 times out of 10, I’m pretty sure the answer will be that it is not essential, and therefore you should focus on making sure the masking of the data is random, static and fast. Adding compute to this process will only slow it down and at the end of the day, we just need to make sure our customers data is protected.