Data Masking involves replacing real world sensitive data with synthetic replacements. If your sensitive data is scattered across several tables consisting of thousands, millions or even billions of rows that requires a lot of changes to replace all those values.
Customers often ask if it is possible to reduce the time and resources to mask data. You can reduce the time running masking jobs in parallel but this expands overall resource consumption. The only way to reduce time and resource is to do less work.
One of the benefits of a relational database (RDBMS) is that when data is normalised (de-duplicated) updates to data are very efficient because you only have to update a given value once. In this blog article I suggest that all tables containing sensitive data should be normalised by extracting sensitive data into lookup tables and replacing the original data with references. There are several benefits to this model:
- Provisioning masked clones is faster and more resource efficient
- The Production data is easier to secure
- The Production database and its clones become more performant
A Simple Example
Let’s take a simple example of a CUSTOMER table which has data of the form:
Let’s query the first few rows of this table
CUST_ID CUST_FIRSTNAME CUST_LASTNAME CUST_SSN ------- -------------- -------------- --------- 1 Mary Smith PL898474B 2 Patricia Johnson RD993299B 3 Linda Williams LU785983C 4 Barbara Jones XK295759D 5 Elizabeth Brown QA006325A 6 Jennifer Davis KQ616191C 7 Maria Miller EG308056D 8 Susan Wilson XP917622A 9 Margaret Moore ZH212274D 10 Dorothy Taylor TX804130B
To make this table secure you should build a secure version of the table which replaces all embeded sensitive literal values with references, which are resolved by lookups on tables which contain the actual values.
Our new secure version of the table now looks like
CUST_ID CUST_FIRSTNAME_ID CUST_LASTNAME_ID CUST_SSN_ID ------- ----------------- ---------------- ----------- 1 1070 2083 3186 2 1078 2044 3210 3 1060 2095 3144 4 1010 2045 3279 5 1030 2010 3192 6 1043 2021 3135 7 1066 2054 3057 8 1093 2096 3282 9 1065 2056 3294 10 1029 2085 3240
That looks a lot more secure. The lookup or dimension tables look as follows:
Sample from FIRSTNAME_DIM table
FIRSTNAME_ID FIRSTNAME ------------ --------- 1007 Anne 1013 Bonnie 1036 Helen 1045 Joan 1056 Kelly 1073 Mildred 1078 Patricia
Sample from LASTNAME_DIM table
LASTNAME_ID LASTNAME ----------- --------- 2023 Edwards 2025 Flores 2055 Mitchell 2063 Perez 2068 Price 2077 Ross 2088 Torres 2089 Turner
Sample from SSN_DIM table
SSN_ID SSN ---------- --------- 3012 AW876543D 3033 CC929313D 3057 EG308056D 3135 KQ616191C 3147 MK525013B 3192 QA006325A 3207 QO371839A
We have now built a cluster of tables which looks like:
To expand our secure table to display the data as stored in the original we need a database view which pulls the data together.
create or replace view CUSTOMER_VIEW as select csd.CUST_ID, fnd.FIRSTNAME CUST_FIRSTNAME, lnd.LASTNAME CUST_LASTNAME, snd.SSN CUST_SSN from CUSTOMER_SEC csd join FIRSTNAME_DIM fnd on csd.CUST_FIRSTNAME_ID = fnd.FIRSTNAME_ID join LASTNAME_DIM lnd on csd.CUST_LASTNAME_ID = lnd.LASTNAME_ID join SSN_DIM snd on csd.CUST_SSN_ID = snd.SSN_ID;
When we query from this view we get the fully expanded version.
SQL> select * from CUSTOMER_VIEW sample(10); CUST_ID CUST_FIRSTNAME CUST_LASTNAME CUST_SSN ------- -------------- -------------- --------- 32 Amy Lopez LH062496C 18 Carol Garcia NS733964A 39 Debra Nelson TL389654A 10 Dorothy Taylor TX804130B 57 Evelyn Morgan NQ870196D 86 Jacqueline Long GT430822C 58 Jean Bell HG462950A 6 Jennifer Davis KQ616191C 36 Kathleen Adams MP278347B 98 Lillian Griffin OT351086A 3 Linda Williams LU785983C 37 Pamela Baker GF089913C
Steps to create a secure CUSTOMER table cluster
- Rename table CUSTOMER to CUSTOMER_ORIG
- For each column containing secure data create dimension (lookup) tables with a unique list of values for the column and a lookup id. You should create a primary key on the ID and a single unique index on the value and ID columns.
- Build a secure version of table CUSTOMER, called CUSTOMER_SEC replacing literal secure values with IDs from their corresponding dimension tables. You can add foreign keys between CUSTOMER_SEC and the lookup tables.
- Create a database view which presents the data from CUSTOMER_SEC as it would have looked when querying the original CUSTOMER table.
- Drop table CUSTOMER_ORIG
- Create a synonym called CUSTOMER on view CUSTOMER_VIEW
That takes care of querying. For DML you will need to create triggers that will translate INSERT, DELETE and UPDATE requests into the corresponding DML for the cluster of tables.
Increase Production Database Security
At this stage it looks like we have just made it more complex to access the data that was originally held in the CUSTOMER table. Well we have but that is a good thing. We can now more easily improve production database security by doing the following:
- Encrypt the dimension/lookup tables
There is no need to encrypt any other data in the database since we have isolated the secure data. This avoids encrypting the entire database simply because we do not know where the secure data resides. This also has some performance benefits as the majority of data accesses will not require decryption.
- Enable auditing on access to those tables
Auditing on tables is an overhead but by isolating the secure data we have minimised the audit activity.
- Ensure Redaction policies on all value data in the dimension/lookup tables
Again by isolating the secure data we now know exactly which data needs to have Redaction policies applied. Redaction hides sensitive information from production database users based on role based access.
Near Instant data masking
To produce a masked version of a database with the above changes we only need to provide masked versions of the domain/lookup tables and drop the original tables. We could build these replacement tables from scratch, this process will still be faster than masking the original format as there is less data to scan (shorter rows) and fewer rows to mask due to de-duplication.
Incremental masking would also be easier if used we versioning of masked tables and mask only the necessary deltas.
Objection … overruled?
The obvious concern is that this approach requires schema changes and potentially application changes. There is no escaping that and it means that this approach to security is likely only applicable to custom applications where the schema is under your control.
The good news though is that usually only a small fraction of tables in a schema contain sensitive data and so the amount of effort may be far less than anticipated.
The benefits are immense, more secure Production Databases and faster, potentially near instant, provisioning of secure masked databases. Both production and masked databases will also likely perform faster for any scan based operations due to the implicit compression afforded by replacing literal values with numeric identifiers.
This approach improves data security, performance and uses less resources.