5 (Noticeable) Business Benefits of Secure Database Provisioning

“Quality is never an accident; it is always the result of high intention, sincere effort, intelligent direction and skillful execution; it represents the wise choice of many alternatives.”
William A. Foster

I know what you’re thinking.

Chris. Your title looks like it was written to be a corporate whitepaper that I see ads for when I’m browsing social media; it should have a CLICK HERE button, a boilerplate photo of a smiling person holding a pen and it should say something like ‘executives hate them, find out their secret here!’

But something has become immediately obvious to me in the last few months, I still speak to people daily who are forced to:

  • Work in shared development models
  • Work on empty (schema-only) / heavily subset databases
  • Work on old, out of date and/or irrelevant data
  • Make decisions without knowing enough about their data or what they hold

When speaking to them though it becomes immediately obvious that the reason there is no dedicated option available for developers is actually not related to the “traditional” problems that one would expect. You would naturally assume that the reason for not refreshing these environments is because of the large amounts of space or time taken to refresh often enough, or even that ‘we simply cannot due to sensitive PII and regulatory concerns‘.

No. In fact it comes down to, as all things do, time and money.

paid make it rain GIF by Thalia de Jong

In the most recent State of Database DevOps report (2020 that is), a whopping 70% of 2000+ respondents replied that they were using a shared development database and this comes with a whole heap of associated problems, like poor code quality, looser controls around sensitive data and defective deployments. Just these figures alone already point to the solution being to spin up copies for developers on demand and it’s not like we can’t do that. There is SO much technology in the world, across almost all database platforms, that will allow us to virtualize, containerize, sanitize… (effectively all of the ‘izes‘) our databases so that we can have full, safe, realistic copies as frequently as we like. So what is stopping us?

From experience, it’s justification*. People going to senior stakeholders and saying “we need this technology” and hearing a cacophony of classic business challenges back: “but is it broken?”, “do we really NEED it?“, “it costs HOW much!?!“, “how much time will it take to implement?” etc… It’s dev and test hygiene, not a sexy major modernization project like using Azure Arc, using Blockchain or creating Artificial Intelligence. Who cares that developers have to share a database? We’ve got bigger Tofish to fry!

*Sometimes, but much less frequently, it’s down to complexity of implementation – but we’ll leave that one aside for now!

As you will know from my last post on why now is the time to adopt better working practices, it’s important for us to highlight the gains that can be made from newer, updated practices, and why now is not the time to be closing our minds off to a better way of life. It’s not going to be easy to sum this up in 5 points, and there are many other benefits to solid database provisioning but these are in my opinion, the ones that will revolutionize the way you develop.

Very important side note for this blog post: there are lots of subjective key practices, processes and tools that can form part of the “database provisioning process” specifically and they will vary wildly by experience, opinion and company – so for the purposes of the below I will be describing the benefits of a process that involves 3 primary components / steps, given these are the three I tackle most often:

  • Data Identification and Classification / Cataloging
  • Data De-Identification i.e. Data Masking
  • Data Provisioning i.e. Real Time Database Cloning / Provisioning

1 – Increase developer happiness / contentedness

Developers are employed to do 1 thing: innovate. It’s even in the name! Developers are on the cutting edge and are focused on providing value to end users as quickly and efficiently as possible, with shortened release cycles, incremental stories and optimized workflows they can produce this innovation. But a big part of the story is the setup.

Even if you’re working to a more agile methodology it is hard to deliver and test changes which are, in development environments, fundamentally destructive and experimental if you are sharing a workspace with multiple colleagues. Writing on shared Word documents can be frustrating at the best of times, so how can developers be expected to produce high-quality, rigorously tested, game-changing code when at any minute another developer can take the environment down, cause it to run slowly, or overwrite those changes with their own? When you cannot produce changes in an isolated, sandbox environment where they can be individually assessed, re-worked and improved then you have no guarantee that it should be promoted.

All of these sound like arguments that are focused around the production of code, but in fact these issues can all have a huge impact on something that is widely under regarded and scrutinized: developer happiness.

Developers are the people who make stuff go, and without them feeling content and valued in their roles, we can’t expect our productivity and product quality to reflect that – so when developers witness the poor management of their code, something they have worked so hard on as it goes sliding down the priority list or gets rolled back or overwritten etc. they don’t feel motivated to continue doing the best that they can do.

With dedicated environments for dev and test, for different branches, pull requests etc. developers can finally work on innovative and exciting projects, and optimize the code that goes out the door to end users.

2 – Develop a common language about data & make better decisions

It’s very hard to speak about things when you use different language to describe the same thing. That much is obvious. In the United Kingdom alone we have many different words for bread rolls. So when someone comes into a sandwich shop in London and asks for a “Stotty”, can you guarantee that the person serving will know exactly what they mean, exactly when they say it?

The Office Reaction GIF

No. There will be a gap where some translation will be required: some “down-time“, if you will. Now imagine taking something as simple as a bread roll and applying it to an enterprise data estate… you’re going to have a very bad time.

As I talked about in my blog posts here (importance of database classification) and here (classifications role in DevOps) before you can really make a fully informed decision about your data, you must know 2 simple things:

  1. What data you hold
  2. Where your data is

I should hurriedly add that I don’t just mean sensitive data now – all data deserves to be classified because whether you’re a full stack developer adding a column to a table you’ve never used before, an auditor trying to carry out a Data Protection Impact Assessment (DPIA) and trying desperately to include the database, or you’re a BI developer setting up some new reports or processes, you’re going to need to know about the data. This is where people have questions, and this is where you shouldn’t have to reply on anecdotal knowledge or being pushed around from one person to another at the company who supposedly “might be able to help“.

Better insights into data leads to better practices, less waiting (waste reduction) and greater insight. When we then act on this insight we move faster and deliver greater value in our pipelines.

Have you picked up on the trend yet? How all of these are going to end? Well don’t spoil the ending for those who haven’t, they’ll have to wait fort he conclusion!

3 – Move faster and better enable the DevOps pipeline

It’s apt that I’m listening to an amazing EDM remix of the Green Hill Zone from Sonic when writing this section, but isn’t this just what we need as a business? We want to be able to move faster, or to put it in more ‘agile’ terms, we need to be able to pivot and adapt with only a moments notice. Until now, the database has been a monolithic and difficult to steer behemoth, and it shows in our processes.

Yank Tug Of War GIF by BEERLAND

A tangible example of what I mean when I say “move faster”, is branching. It’s fairly commonplace now for a developer to be able to clone a repository and checkout a specific branch, create new branches etc. without fear of switching between those branches and what it might entail. On a dev environment, especially when one is working database-first with your changes (it does make sense to know how the changes will impact the database first – that’s all I’m saying) it is, without a reasonable process in place, exceptionally difficult to easily switch between branches and keep work separate.

This often forces developers to stick to one environment when changes are all made in tandem and can play havoc when it comes to capturing those changes in the right place – a manual state-based comparison of a dev database with multiple branches of work on it to a target upstream could be disastrous.

This is why taking advantage of something like database virtualization, allowing you to spin up copies of databases in seconds, could be the answer. You can automate the provisioning of environments as githooks, during Pull Request automation or as release candidates and the experience will be exactly the same across the board – boom *code base*, fresh and ready to go. When developers can move fast, value comes through a whole lot faster.

4 – Minimize space constraints on new copies, on premise or in the cloud

Space is always a big player in these conversations, and for some it’s enough to boil it down to “well just how much space can we save??” and that’s enough to put a dollar value on the ROI, and people storm ahead with a solution (that’s not always right for them).

But space is a very real problem, much as we (as technology professionals) like to believe that in these modern times of cloud-native solutions, easily scaled serverless-compute VMs and Big Data Clusters, we know there are still a LOT of people out there firefighting legacy, necessary technology and wrestling with what they CAN get out of backups or their SAN tech.

Even using cloud providers costs money, data egress and ingress costs $, BLOB storage costs $, additional security measures cost $. So it’s really not ideal when our databases, for historical reasons or by virtue of the sheer AMOUNT of data we hold and process, are 5, 10, 50, 100TB+, because we’re going to be struggling with this Dev/Test issue still for years to come.

As before with point 3, database virtualization has come of age and has now we have a lot of different solutions from containerization through DBaaS that can aid us in minimizing the amount of space that we ACTUALLY require, meaning we have less money that we need to pour into maintaining large, unwieldy Dev/Test environments or paying a large bill for the privilege of doing so in the cloud (and when developers will be using their dev machines anyway it just makes sense to see what we can do to leverage this existing hardware).

Whilst this one doesn’t directly add specific value to the end of the pipeline, or speed up this delivery, it can help reduce overhead costs associated with the infrastructure needed when providing this value.

5 – Work on realistic data without worrying about data breaches

This is probably one of the most obvious reasons given that I tend to blog about data regulations and compliance ALL THE TIME but I feel like I need to keep saying this.

If you remove all of the data from development and test database copies, this will not help with development and developers will have nothing meaningful to go on, nor any testing that isn’t limited to pre-defined values.

If you leave all of the data in development and test database copies, all you’re doing is duplicating your attack surface area and creating a lot of potential risks for that data to be surfaced where it shouldn’t be – on the internet, in screenshots, emails and of course, hacked.

So there needs to be a happy medium where we can have both the useful data that gives us insight and intelligence of a full data set, the business logic, trends, demographics etc. that we need during testing or analytics – but it should also be sanitized so that data subjects contained therein cannot be re-identified. Static masking, applied to lower environments allows us to retain the data with none of the data.

The Next Generation Data GIF by Star Trek

Protective measures can be built into the DevOps process from the very beginning as you’ve already seen right here on my blog; so as long as it is a part of the process, and we have multiple controls (or guard rails) that allow us to operate safely and quickly without fearing that same speed will cause us to release any sensitive information, allowing us to focus on one thing, value.

Conclusion

As you’ve seen above, it all comes down to time and money but there are many ways to save and speed up within a DevOps process by means of a good, solid database provisioning process. Whilst none of these reasons comes with a fixed ROI (unless you have ALL of your pre-prod database storage costs to hand) they contribute to something far better than that:

The ease of delivering value.

In a world where we can be concerned about everything, and where it’s hard to keep up with emerging technologies – it makes sense to start pruning away blockers to the process, the problems that are stopping us from delivering value faster – THAT is the theme and point of this blog post; our end users. We’re already delivering excellent value to them, we trust our developers and teams, but what’s stopping them from moving faster with database changes? Adopting a good provisioning process will mean you start to notice all of the above become true of your database development lifecycle.

Classification by design: Including data cataloging in a DevOps process

“The most powerful tool we have as developers is automation.”
Scott Hanselman

It is no secret that I love to talk about data protection, specifically from the perspective of structured data. When we talk about database development practices, we often find ourselves talking about 3 things most often:

  • Source Control
  • Continuous Integration and Continuous Delivery/Deployment (CI/CD)
  • Estate Monitoring

Some people refer to this as “DataOps“, others refer to it as “DevDataOps” but in reality, it’s all DevOps guys. This may be an unpopular opinion (and if it clashes with yours please forgive me, it’s just my opinion) but just because a certain niche area hasn’t been specifically called out within a subset of DevOps doesn’t mean you have to invent your own term for it!

Now this leads me on to DevSecOps, or as I like to call it… More secure DevOps.

rimshot GIF

No but seriously this is a slightly different case – DevSecOps is like DevOps but fortified with security from the ground up. There’s a fantastic article and diagram of this on Plutora from Mark Robinson of how this looks (below) and if you haven’t read his article I would definitely go and give it a read!

DevSecOps Diagram

Good DevOps practice is a combination of different things working together, bringing the right mentality, the principles, processes and amazing tools at our disposal like automation but this all includes security from the ground up too. DevOps is about putting those principles and practices in place to strengthen the pipeline, so why don’t we treat security in the same way?

Take, for example, 3 pieces of legislation that have been very much in the spotlight:

The controller shall implement appropriate technical and organisational measures for ensuring that, by default, only personal data which are necessary for each specific purpose of the processing are processed.
GDPR (Europe) Art. 25 “Data protection by design and by default”

Processing agents shall adopt security, technical and administrative
measures
able to protect personal data from unauthorized accesses and accidental or unlawful situations of destruction, loss, alteration, communication or any type of improper or unlawful processing.

– LGPD (Brazil) Chapter VII, Art. 46 “Security and Secrecy of Data”

“A Controller or Processor is required to implement appropriate technical and organisational measures to demonstrate that Processing is performed in accordance with this Law…”
DIFC LAW NO. 5 OF 2020 (Dubai) Part 2D, Art. 14 (2) “Accountability and notification”

There’s a common running theme here and although lots of global legislation will either allude to, or directly tell you ways you can be compliant and what some of these “organizational” and “technical” measures are, it’s still pretty blurry.

How do we know what we can do? How do we know what “default” and “design” mean in this context? Well, we build it into the DevOps process.

Now I could sit here forever and talk about why transforming your database development, deployment and provisioning processes allows us to be more secure, but that’s a lot of material and it might have to come in chunks! So what we’re going to focus on today is as the title suggests: Data Classification and Cataloging.

Why is Cataloging important?

Cataloging structured data is incredibly important because it can be one of the first steps we take to securing sensitive Personally Identifiable Information (PII) or Protected Health Information (PHI) wherever it exists across our database environments. It allows us to make strengthened, contextual decisions about the data we hold including how we treat it in pre-Production, how long we retain it for and which systems and processes consume it.

But the most important part of this is simply: it tells us where the risk is.

Read through any of the most recent data protection laws and you will notice that a few things come up quite a lot including “Data Protection Impact Assessment“, or DPIA. Effectively if you can assess the risk of processing activities you can more readily answer the data protection questions and challenges you may face.

Knowing where your data resides can be the first step to helping you assess this risk, and to more readily answer your own data questions. If you want to read more about Cataloging specifically and why it is useful, you can read more about it on my previous blog here.

Where does Cataloging fit into DevOps?

This one is simple to answer. Once you have fully classified your entire estate, you’re not done. No, if you’re a development house or indeed even a single developer – if you are making any schema changes to the tables holding that sensitive data, you’re never done.

The reason for this is that Cataloging is an evergreen activity – if you update tables by removing columns, adding columns, splitting tables, adding tables… anything! Well then you need to be ready to make sure that you are:

a) Prepared and equipped with knowledge of the tables you’re working on and if this is a high risk activity.

b) Updating classification information to reflect the new “truth”, i.e. if you’re adding a column that will collect people’s Twitter handles, then that column should be classified as sensitive, and this should be reflected the moment it is deployed to Production.

So it is important to have the correct people working on this, with the right knowledge, preparation and processes and using the correct tools ensuring that those updates are persisted properly and securely through your deployment pipelines.

Huh… people, processes and tools… That sounds familiar!

The Process: SQL Data Catalog, SQL Change Automation and Azure DevOps

For this little experiment of mine I used Redgate’s SQL Change Automation (Migrations First approach in SQL Server Management Studio) and SQL Data Catalog to both develop & deploy and classify/categorize respectively, and for simple version control and orchestration of this pipeline I opted for Azure DevOps (with SQL Change Automation CI/CD plugins):

NOTE: Heads up, all of the code I used for this can be found right here in my GitHub, feel free to have at it!

Step 1) Setup SQL Change Automation project with an Azure DevOps Git Repo and then create the YAML file to build it, and create a deployment process to Acceptance and the Production.

Ok – that’s a bit of a mouthful and a massive ask. There’s a lot of things there you have to be familiar with, but we don’t have time to go into right here. BUT fortunately if you checkout Redgate University right here, my blog post on using Change Automation with Azure DB and Redgate’s Product Learning section, you’ll be a Database DevOps ‘Whizz-Kid’ in no time!

Fast forward a little and I had my example databases, VCS and pipeline all up and running:

Step 2)The “Theory”: This is where things get interesting. So we have an example pipeline set up and we are able to completely deploy all the way through to “Production” so let’s talk theory.

In SQL Data Catalog I have covered both my Production and Acceptance Databases:

Now, in development we don’t make changes directly to Production, so why should Classification be any different? Now how you adapt the above code is up to you, feel free to split it, move it around, incorporate it into Pull Requests if you want to… But I’m going with a bit more of a simple situation.

Situation: Developer makes a change in Development, which gets committed, reviewed and merged o the main branch, resulting in a build and a deployment, in this case to Acceptance and then it is later deployed to Production.

Now, by Acceptance we should only have the “good work”, i.e. all of our testing is shifted left within DevOps so Acceptance is basically the last stop before Production. Therefore we should classify the work we have done on Acceptance, crucially, before it gets to Production and starts gathering sensitive data, and then copy this classification up on deployment.

Ideal: We should have no columns on Production that have not been classified.

Step 3) – In Practice: Fortunately it’s very easy to automate a lot of these steps with SQL Data Catalog utilizing it’s PowerShell cmdlets and REST API. The cmdlets are fully documented and very easy to use (docs here). This allows us to easily scan, classify and copy classifications up to other databases, but we’ll also need to do some checks and report if there are discrepancies, as part of the deployment pipeline that can be investigated.

  • Are there any columns on Acceptance that aren’t classified but have been deployed to Production? (failure to comply with process)
  • Are there any columns on Production that have not been classified? (classification drift)
  • Are there any unclassified columns on Acceptance that have not yet been deployed to Production (for pipeline hygiene purposes)

The other part of this ‘fun’ is reporting what has been changed in the same process. Now fortunately SQL Change Automation spits out a Changes.json file with its Release Artifacts and we can steal that away and find out how many tables have been created or changed in this release and report that back so we can correlate what has been done and what is missing:

So actually getting this up and running is just going to require 3 things:

  1. The PowerShell script from GitHub (or your own personalized variant) as a step in your production deployment
  2. Data Catalog available and pointed at Acceptance and Production (or your versions of these environments)
  3. Variables set in Azure DevOps to fill the gaps (e.g. Where is Data Catalog? Whats my PowerShell Auth token? What are my Acceptance and PROD DBs called? etc.)

3 is the last step there so you’ll need something like this to run the script:

  • DatabaseDeploymentJSON – where the JSON file will be with the latest changes in the Prod release
  • DataCatalogAuthToken – Your PowerShell Auth token from Settings in Data Catalog
  • DataCatalogUrl – The full URL to your Data Catalog installation, missing the “\” at the end (ending :15156)
  • ExportPath (Optional) – I specified the path for my Database Deployment Resources to save typing it out in the Redgate plugins
  • ProdDB / StageDB – As you would expect, the Production and Acceptance/Staging DBs you’re deploying to/from
  • ProdInstance / StageInstance – As above, except the instance the Database are located on

In the variables above the Instance and DB names are purely used within Data Catalog, so there’s no need to worry about anything happening to the actual databases themselves!

Once you’ve run through the deployment pipeline a couple of times and the changes.json file is being produced, you can go ahead and copy the script into an inline PowerShell script step in your release and you should find it will fire to life! I simulated an example by modifying my Contacts table and my Articles table, adding 1 column each and deploying both to Acceptance. I then classified just 1 of these in Acceptance in Data Catalog:

and then approved the deployment to Production and tada!

Ok you probably can’t make all that out, but it effectively says:

(Information) Table dbo.Articles was modified in this deployment.
(Information) Table dbo.Contacts was modified in this deployment.

That much we knew!

1 column(s) with classifications were discovered on VoiceOfTheDBA Acceptance that are not classified in VoiceOfTheDBA Production:
dbo.Articles.TestingPineapple

Excellent, we classified that one so it gets copied up and we can verify that in data catalog against Production:

and finally, we get a warning about Production now containing unclassified columns:

(Alert) The following columns have been discovered on VoiceOfTheDBA Production that require classification:

dbo.Contacts.TestingPineapple

You should classify these columns in VoiceOfTheDBA Acceptance prior to the next deployment.

Just as we expected. Success!

Happy Tom And Jerry GIF

Conclusion

Classification and categorization belongs as part of DevOps, if you expect the context for your business decisions around data to remain evergreen and informed then it cannot sit on the shoulders of one or two people to support it, and it cannot live in a manually updated Excel sheet or document.

By including it within the DevOps process, not only do you add an additional layer of security but you also make it an automated, team activity that can be audited, checked and easily kept up to date.

Is this DevSecOps? Well… not really no. Is this a more secure approach to Database DevOps? Absolutely! Happy DevOpsing!

Shall we begin? (With data classification)

“If I had an hour to solve a problem I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
Albert Einstein

So I just spent about 20 minutes trying to come up with a suitable title for this blog post, and then it struck me – one of my favorite movies of all time (which I will soon be ensuring my wife and I watch again) is Star Trek into Darkness, featuring the magnificent Benedict Cumberbatch, in which he masterfully growls the phrase “shall we begin?”.

shall we begin benedict cumberbatch GIF

This sums up perfectly where people find themselves at the beginning of their classification activities. “Where do I start?” is usually the first question I get asked – regardless if you’re using an excel sheet, Azure Data Catalog or Redgate SQL Data Catalog to carry out this process, you will find yourself in the same place, asking the same question.

Classifying your SQL Server Tables and Columns is not straight forward, I’ll say that up front – you have to be prepared for whatever may come your way – but give yourself a fighting chance! Whether you’re looking to better understand your data, protect it, or you’re just hoping to prepare your business to be more able to deal with things such as Subject Access Requests (SARs), the Right to be Forgotten *cough* I’m looking at _you_ GDPR *cough* – or even just key development in systems containing sensitive information; this is the ultimate starting point. As per my blog post on data masking here, you can’t protect what you don’t know you have.

This is my effort to give you the best possible start with your classification process, whether this feeds into a wider data lineage process, data retention or, of course, data masking. So… shall we begin?

Get a taxonomy set up

This is perhaps the most crucial part of your success. Even if you have the best classification process in the world it really means nothing if you’ve basically described your data in one of say, 3 possible ways. The thing to bear in mind before getting started is that the data cataloging process is not specific to one job.

You may think at this point in time that you’re going to use it to highlight what you want to mask for Dev/Test environments, or maybe it’s your hit list for implementing TDE or column level encryption – but this _thing_ you’re building is going to be useful for everyone.

  • DBAs will be able to use this to help them prioritize systems they look after and being more proactive when it comes to security checks or updates, backups etc.
  • Developers will be able to use this to better understand the tables and environments they are working on, helping them contextualize their work and therefore engage and work with any other teams or individuals who may be affected or who may need to be involved.
  • Governance teams and auditors will be able to use this to better understand what information is held by the business, who is responsible for keeping it up to date and how it is classified and protected.

The list goes on.

So all of the above will need to be engaged in a first run to help actually describe the data you’re working with. What do you actually care about? What do you want to know about data at a first glance? Below is the standard taxonomy that comes out of the box with Redgate’s Data Catalog:

Some of my favorites are in here, which I would encourage you to include as well! If nothing else, having Classification Scope as a category is an absolute must – but I’ll come to this soon. You can see though, how being able to include tags such as who owns the data (and is therefore in charge of keeping it up to date), what regulation(s) it falls under and even what our treatment policy is in line with any of those regulations is, gives us so much more to go on. We can be sure we are appropriately building out our defensible position.

Having a robust Taxonomy will enable you to not only know more about your data but to easily communicate and collaborate with others on the data you hold and the structure of your tables.

Decide who is in charge

This seems like an odd one, but actually one of the most common questions I get is about who will be carrying out the classification process, and this is where the true nature of collaboration within a company is going to be absolutely critical.

Some people believe that a DBA or a couple of developers will suffice but as you’ll see later on, this is not a simple process that only 1 or 2 people can handle by themselves. Be prepared to spend hours on this and actually the implementation of classification means by nature you are going to need a team effort in the first instance.

You will need representation from people who know the database structure, people who know the function of the various tables and people who know the business and how data should be protected. You will require representation on this team and the collaboration between Dev, DBAs, Testers, Governance and DevOps, and you will need someone central to coordinate this effort. When you have key representation from these teams, it will make it easier to identify and collaborate on hot spots of data, so ensure you have this knowledge up front.

Get rid of what doesn’t matter

You may be surprised that the next step is technically an execution step, but it is an important point nonetheless and will absolutely help with the classification effort. This is where the Classification Scope category comes in, and this is why it’s my favorite.

One of the biggest problems that people face when actually executing on their classification is the sheer enormity of the task. There is no “average” measure we can rely on unfortunately but even small schemas can be not insubstantial – recently, some work I did with a customer meant they provided me with just ONE of their database schemas which had well in advance of 1800 columns across dozens of tables. When you scale that same amount to potentially hundreds of databases, it will become rapidly clear that going over every single column is going to be unmanageable.

To start then, the knowledge brought by the team mentioned above will be invaluable because we’re going to need to “de-scope” everything that is not relevant to this process. It is very rare to find a company with more than 50% of columns per database which contain PII/PHI and even if you are one of those companies, this process can help you too.

There could be many reasons that something shouldn’t be included in this process. Perhaps it is an empty table that exists as part of a 3rd party database schema, such as in an ERP or CRM solution. It could be a purely system specific table that holds static/reference data or gathers application specific information. Regardless what the table is, use the knowledge the team has to quickly identify these and then assign them all with the necessary “Out of Scope” tag.

This will not only help you reduce the number of columns you’re going to need to process significantly, but will give you greater focus on what does need to be processed. One of the greatest quotes I’ve heard about this process comes from @DataMacas (a full on genius, wonderful person and someone who over the years I have aspired to learn as much from as possible) who referred to it as “moving from a battleships style approach to one more akin to minesweeper“. Which is just so incredibly accurate.

In my example database below with only 150 odd columns, using the “Empty Tables” filter, and then filtering down to system tables I know about, I was able to de-scope just under half of the database, just as a starting point:

Figure out patterns and speed up

Many of the people carrying out this process will already have some anecdotal knowledge of the database as I’ve already mentioned, but now it’s time to turn this from what _isn’t_ important, to what is.

The fastest way to do this is to build up some examples of column naming conventions you already have in place across multiple databases – there will likely be columns with names that contain things like Name, Email or SSN in some format. Both SQL Data Catalog from Redgate and Microsoft’s Azure Data Catalog have suggestions out of the box that will look at your column names and make suggestions as to what might be sensitive for you to check and accept the classification tags.

Now these suggestions are incredibly helpful but they do both have a reduced scope because they’re just matching against common types of PII, so it’s important to customize them to better reflect your own environments. You can do this fairly easily, one or both of the following ways:

1 – Customize the suggestions

In Redgate’s SQL Data Catalog you can actually, prior to even looking at the suggestions and accepting them, customize the regular expressions that are being run over the column naming convention to actually check that they are more indicative of your own schemas – either by editing existing rules or by creating your own rules and choosing which tags should be associated with the columns as a result:

You can then go through and accept these suggestions if you so wish, obviously making sure to give them a sense check first:

2 – POWER ALL THE SHELL

In both of the aforementioned solutions you can call the PowerShell API to actually carry out mass classification against columns with known formats – this will allow you to rapidly hit any known targets to further reduce the amount of time spent looking directly at columns, an example of the SQL Data Catalog PowerShell in action is the below, which will classify any columns it finds where the name is like Email but not like ID (as Primary and Foreign keys may, in most cases, fall under the de-scoping work we did above) with a tag for sensitivity and information type (full worked example here):

Finally – get classifying

This is the last stage, or the “hunt” stage. It’s time for us to get going with classifying what’s left i.e. anything that wasn’t de-scoped and wasn’t caught by your default suggestions or PowerShell rules.

You can obviously start going through each column one by one, but it makes the most sense to start by filtering down by tables which have the highest concentration of columns (i.e. the widest tables) or the columns that are specifically known as containing sensitive information (anecdotally or by # of tags) and classifying those as in or out of scope and what information they hold, who owns it and what the treatment intent is at the very least.

The approach i take in this instance is to use filtering to it’s utmost – in SQL Data Catalog we can filter by table and column names but also by Data Type. Common offenders can be found with certain NVARCHAR, XML or VARBINARY types, such as NVARCHAR(MAX) – to me, that sounds like an XML, JSON, document or free-text field which will likely contain some kind of difficult to identify but ultimately sensitive information.

Following the NVARCHAR classification I move on and look at DATETIME and INT/DECIMAL fields for any key candidates like dates when someone attended an event or even a Date of Birth field. This helps especially when the naming conventions don’t necessarily reflect the information that is stored in the column.

Finally, one thing to add is that you will need access to the databases or tables at some point. You can’t truly carry out a full-on data classification process purely against the schema, especially for the reason above. Data will often exist in places you weren’t aware of, and knowing the contents, format and sensitivity of the data can only reasonably be found and tagged if you have the support of the data team to do so.

Conclusion?

This is not a one time thing. The initial classification is going to be the hardest part but that can be mitigated if you follow some of the processes in this post and ultimately work as a team.

Classification though, is ongoing. It is there to be an evergreen solution, something that provides context for any data governance or DevOps processes that you have in place and therefore should be maintained and treated as what it is. The place that everyone is able to use to gather information about what they will be using, and what the company might have at risk.