Delphix enabled Logical Data Warehouse

ManyDatabase

Image courtesy of Business Cloud News

What is a Logical Data Warehouse?

A Logical Data Warehouse uses federated queries to query source databases, and other data sources, directly, avoiding the need to copy data into a dedicated data warehouse. The key benefits to this approach over traditional data warehouse solutions are:

  1. Access the latest version of the data, not a copy from the night before or last weekend
  2. Easily add new Data Sources and adjust data models on existing data sources much more quickly
  3. Reduced Infrastructure, decreases the amount of server capacity and storage required because you leverage the source environments.

A Logical Data Warehouse is often called a Virtual Data Warehouse or a Federated Data Warehouse.

To Copy or not to Copy, that is the question

As mentioned there are two main approaches to provide access to data for analytics purposes.  Either you copy the data to an analytics or data warehouse platform or you keep the data in place and implement a unified data model and use a set of connectors to query the data directly from their sources.

There is a variation of the latter, mainly OLAP solutions, where pre-calculated query results are materialized into views or multi-dimensional cubes for improving response times for predicted queries.

Solutions that require their own copy of the data, include Exadata, Teradata, Netezza, Hadoop, Snowflake, Amazon Redshift, Azure SQL Data Warehouse and Google BigQuery.

Logical (or Virtual) Data Warehouse vendors include Denodo, Cisco / Composite, Red Hat JBoss Data Virtualisation, DataVirtuality, and SAP Hana.

Similar to Delphix which avoids all the costs and delays in provisioning copies of data, a Logical Data Warehouse (LDW) avoids all the costs and delays in copying data but still needs continuous access to the source data.  With Delphix the opportunity exists to build a Logical Data Warehouse on top of Virtual Databases.  Why would you want to do that?

 

Virtual Data Sources for Logical Data Warehouses

Using a logical data warehouse customers may be concerned about performance hits on mission-critical source systems being exposed directly to logical data warehouse queries.

Connecting your logical data warehouse to Delphix Virtual Databases has the following advantages:

  1. Protect Production, Delphix offloads production by provisioning virtual data copies on separate database servers using shared storage managed by Delphix.
  2. Enhance Performance, run as many virtual databases as required to spread the load with no storage overhead.
  3. Enhance Security, Delphix can provision original or secured (consistently masked) virtual database copies avoiding the need and cost of masking an entire warehouse data copy.
  4. Time Travel, Delphix can provision consistent database sets from any point in time within the configurable retention period, enabling much simpler period to period based analytics over much smaller data sets.
  5. Reduced Infrastructure, leverage under-utilized but licensed, non-production servers.

You probably already have sufficient licensed database servers to accommodate a data warehouse solution without building or renting additional infrastructure or purchasing additional licenses.

  1. Ground or Cloud, Delphix can run in your data center or on AWS or Azure. Delphix replication can allow you to synchronize your data in the cloud allowing you to run your Logical Data Warehouse against co-located database copies.

 

To Summarize

Delphix is an outstanding complement to Logical Data Warehouse solutions, in particular for provisioning high performance, secure, virtual copies of your most important data.

 

 

Advertisements

Data Masking: Locating Sensitive Data

Data Masking is a Sensitive Subject

One of the primary concerns with Data Masking is identifying the locations of sensitive data within a database. For custom applications, the development team will likely know where all sensitive data is stored, however with third party applications this is a bigger challenge, especially considering some database schemas can have 10s of thousands of tables.

Delphix Masking has a Profiling tool which uses regular expressions to locate potentially sensitive data based on the column name or a sample of the data itself. This is easily extended to add patterns for any custom formatted sensitive data and the profiling tool can then be used to find any additional occurrences of that data in any supported database or file type.

The profiling tool cannot always locate all sensitive data so an additional technique may help.

Customers are always right and sometimes Innovative

Discussing this issue recently with David Battiston from Worldpay, I suggested we could create a Delphix VDB database copy, and run functional tests covering sensitive transactions such as customer account creation, modification, ordering, or for a web based application any screen using https. We would use sentinel data in the test so we could easily spot it in the database. I proposed using a text index on the database but David suggested using database auditing.

What a great idea, I decided to give it a try.

Test Process

I downloaded the Community Edition of an eCommerce application called Konakart.
I created an Oracle Virtual Database as the database to underpin Konakart and created database user KKART to own the Konakart schema.

-- Create database user KKART to own Konakart schema

CREATE USER kkart IDENTIFIED BY kkart
    DEFAULT TABLESPACE users
  TEMPORARY TABLESPACE temp
    QUOTA UNLIMITED ON users;

GRANT connect,resource TO kkart;

 

I then installed the Konakart application which populated the VDB with the Konakart schema and some demo data. The Konakart schema has around 128 tables.

I then enabled database auditing in the Oracle VDB.

-- Enable extended database auditing

ALTER SYSTEM SET audit_trail=db_extended SCOPE=SPFILE;
shutdown
startup

-- Enable audit for database schema owner KKART
AUDIT ALL BY kkart BY ACCESS;
AUDIT SELECT TABLE, UPDATE TABLE, INSERT TABLE,
      DELETE TABLE BY kkart BY ACCESS;
AUDIT EXECUTE PROCEDURE BY kkart BY ACCESS;

 

Insert Sentinel Data

Using the Konakart application I added new customer details and placed an order.

Below is a summary of the details.

Screen Shot 2017-08-09 at 15.26.54

Looks like Donald has just purchased a printer!

Locate Sentinel Data

Now we just have to query the audit tables to locate where these details have been recorded.

select SQL_TEXT, SQL_BIND
  from dba_common_audit_trail
 where OBJECT_SCHEMA = 'KKART'
   and TRANSACTIONID is not null /* failed */
   and SQL_BIND like '%123 Washington Road%'; /* sentinel data */

And the result was two tables, ADDRESS_BOOK and ORDERS and quite a few columns.

INSERT INTO address_book
(address_book_id, entry_email_address, entry_company, entry_city,  entry_lastname, entry_firstname, entry_gender, entry_telephone,
 entry_telephone_1, entry_suburb, entry_zone_id, entry_country_id, entry_street_address_1, entry_postcode, entry_street_address,customers_id)
 VALUES (:1 ,:2 ,:3 ,:4 ,:5 ,:6 ,:7 ,:8 ,:9 ,:10 ,:11 ,:12 ,:13 ,:14 ,:15 ,:16 )

#4 entry_city = Washington
#5 entry_lastname = Trump
#6 entry_firstname = Donald
#8 entry_telephone = 555-666-1234
#14 entry_postcode = 666
#15 entry_street_address = 123 Washington Road

 

INSERT INTO orders
(billing_telephone, customers_telephone,customers_locale,
 orders_status, lifecycle_id, delivery_country,
 filterdecimal1, points_reservation_id, orders_id,
 billing_name, customers_name, billing_street_address,
 billing_country, delivery_name, billing_city, date_purchased,
 billing_state, customers_city, points_used, currency,
 parent_id, delivery_addr_id, billing_postcode,
 delivery_postcode, payment_module_code,
 billing_address_format_id, payment_method, delivery_city,
 customers_country, customers_state, customers_id,
 customer_addr_id,points_awarded, customers_postcode,
 last_modified, customers_email_address, currency_value,
 delivery_telephone, customers_street_address, delivery_state,
 delivery_address_format_id, delivery_street_address,
 shipping_module_code, billing_addr_id, orders_number,
 customers_address_format_id) VALUES (:1 ,:2 ,:3 ,:4 ,:5 ,:6 ,
  :7 ,:8 ,:9 ,:10 ,:11 ,:12 ,:13 ,:14 ,:15 ,:16 ,:17 ,:18 ,:19 ,
  :20 ,:21 ,:22 ,:23 ,:24 ,:25 ,:26 ,:27 ,:28 ,:29 ,:30 ,:31 ,
  :32 ,:33 ,:34 ,:35 ,:36 ,:37 ,:38 ,:39 ,:40 ,:41 ,:42 ,:43 ,
  :44 ,:45 ,:46 )


#1 billing_telephone = 555-666-1234
#2 customers_telephone = 555-666-1234
#6 delivery_country = United States
#10 billing_name = Donald Trump
#11 customers_name = Donald Trump
#12 billing_street_address = 123 Washington Road
#13 billing_country = United States
#14 delivery_name = Donald Trump
#15 billing_city = Washington
#17 billing_state = Washington
#18 customers_city = Washington
#23 billing_postcode = 666
#24 delivery_postcode = 666
#28 delivery_city = Washington
#29 customers_country = United States
#30 customers_state = Washington
#34 customers_postcode = 666
#36 customers_email_address = djtrump@whitehouse.com
#38 delivery_telephone = 555-666-1234
#39 customers_street_address = 123 Washington Road
#40 delivery_state = Washington
#42 delivery_street_address = 123 Washington Road

  

So Konakart is unusual by using clear column and table names but that is not the case for many third-party applications. Using this technique, we can semi automate the detection of sensitive data regardless of the database schema complexity or naming conventions.

To Summarize

The first challenge with data masking is getting a hold of a full copy of the database to be masked. A Delphix Virtual Database gives you a personal pop up full copy in minutes.

The second challenge with data masking is which columns to mask. If you do not already know you could use a technique as above using sentinel data and database auditing.  Either way, once you have identified the location of any sensitive data you can use the Delphix Masking profiling tool to locate further occurrences in the same or different databases and files.

 

Delphix Cloud Economics for on Premise

onpremisevcloud

Image courtesy of vyopta.com

Cloud Economics

Many customers are examining how they can leverage public clouds for reduced costs and increased agility.

Public Clouds save costs through automation, higher server utilisation, lower storage costs and consumption based charging.

Public clouds are seen by many as a great way to secure expanded capacity to handle demand spikes for seasonal or overflow activities. Allowing you to keep your on-premise IT resources trimmed to handle only Business as Usual (BAU) activity.

What about running BAU on Cloud? What is the cost per database on Cloud versus on premise, factoring in database server, network, storage and license costs?   Some customers have done this calculation and the answer may surprise you.

Delphix Economics

Let’s examine how Delphix enables the same core cost saving measures as Cloud solutions.

1. Automation

Delphix automates the entire end-to-end process of data synchronisation to provisioning. These, policy driven, automated workflows can also include Data Masking and Replication.   End users can manage their own full copy data sets but are limited to pre-allocated server resources to avoid consumption creep.

2. Server Utilisation

The two primary methods to increase server utilisation are density (increased concurrent instances) and hoteling (drop in environments).

Delphix enables increased density by allowing you to run more database instances on the same database server. Customers have reported 30% increases in concurrent instances per server.   Often a server is maxed out on memory but Delphix allows you to reduce the memory allocation per instance without compromising performance due to the caching effect of the Delphix shared cache.

Delphix enables hoteling by allowing an environment to be provisioned and de-provisioned quickly and automatically. This allows customers to pool existing siloed and implicitly underutilised database servers and share these servers between teams and across time zones. Test servers can be kept busier by supporting tests from different projects and sprints, simply by lifting the constraint on data provisioning.

Take a look at your current test server utilisation, imagine if you could increase that by 30% or better.

3. Storage

The best way to save storage costs is to avoid using storage altogether.   Delphix reduces non-production storage consumption by at least 70%.

Here’s the math:

Source copy = 50%
Non-production copies = 4

Storage consumption for 4 copies = 50% / 400% = 1/8 = 12.5%
Storage saving = 100% - 12.5% = 87.5%.

As with most of these calculations it depends.  In reality the savings are generally so substantial that you can use the highest performing storage available because you don’t have to buy much of it.  You then get the additional benefit of increased test throughput allowing more tests for increased quality and reduced release cycles.

The Nirvana

The best solution for most customers will be an on premise private cloud for BAU activity linked to a public cloud, or two, for burst activity.

Delphix will make your private cloud as economical as a public cloud with the added advantage of supporting true agile development for database based applications, i.e. the really important ones.

For public cloud, Delphix supports AWS and Azure.  Delphix will seamlessly replicate data between on-premise and cloud, cloud to cloud and cloud to on-premise.  Allowing you to switch workloads between clouds without worrying about moving data.

 

Delphix Secure Data Subset Replication

Screen Shot 2017-05-03 at 12.04.47

VLDB Data Subsets

Many Delphix customers are virtualising enormous, Very Large Databases (VLDBs) with some tables running into 10s of billions of rows.   Occasionally these databases grow because there is no archiving and purging process in place, but increasingly the current active working set of data is growing due to increasing transaction levels and consolidation of databases of record.   The transactional databases of today are now substantially larger than the data warehouses of yesterday.

In general with Delphix there is no requirement to provision data subsets since Delphix virtual databases can be provisioned in a few minutes regardless of size and with minimal storage overhead. However for some customers there is a requirement to provision only the latest few days, weeks or months of data for faster functional testing.

Delphix Selective Data Replication

Delphix Replication is used to provide resilience, scalability, data migration and data segregation.   With Delphix Replication you can choose to replicate any set of virtual databases and this is used to ensure only secure, masked, data is replicated to a second ‘replica’ Delphix Engine.

This same capability allows us to replicate a virtual database which has been trimmed to reduce it to a more manageable size, a subset if you will. For some databases like Oracle dropping tables and deleting data does not release space so you usually have to drop tablespaces and corresponding datafiles which is of course much easier if you have partitioned or transportable tablespaces.

What’s the Recipe?

The steps are relatively straightforward.

  1. Create a VDB on the primary Delphix Engine using a configure clone hook to run a script to trim the VDB
  2. Configure a masking job to execute after the trim process.
  3. Bounce the VDB once provisioned
  4. Create a Replication job with Selective Data Distribution enabled.
  5. Choose the trimmed and masked VDB to be replicated.
  6. Run the replication job.

Additional advantages of this approach are that you require even less storage assigned to the replication Delphix Engine and you only have to mask the trimmed database which will make the masking process execute proportionately faster to the reduction in size ratio.

To Summarize

If you have substantial databases you can create full virtual copies of these in production for production support, reporting and ETL.  You can replicate these full virtual copies to a replica engine for full volume functionality and performance testing.

With Delphix Selective Data Distribution Replication you also have the option to replicate secure data subsets for faster functionality testing.

Since a Delphix Replication Engine can be located on-premise or on-cloud you have the choice of what to replicate, how much to replicate and where to replicate.

 

Delphix and Data Subsets

TPCE

Image courtesy of tpc.com

Who needs Data Subsets?

With Delphix there is generally no need for data subsets.  If the reason for creating data subsets is to save storage and improve provisioning times then Delphix Virtual Databases (VDBs) avoid that.  Occasionally customers want to use data subsets to improve performance of tests involving table scans and hence want to use smaller tables.  The most common examples are batch jobs or reports.   For these tests one option could be to consider using more restrictive WHERE clauses or partitioned tables.  These changes can be hidden from the application by using database views.

However if you still want to proceed with data subsets Delphix can help with that too.

Agile Data Subsets

Almost all data subsetting involves creating a standalone relationally intact set of data which means you need to understand the relationships between all the tables needed to support your tests.   This for many is a huge stumbling block and is why the cost of creating subsets is high and hence the frequency of generation is low.

The good new is that it is much easier to create a data subset in a VDB since you do not need to subset all the tables as all the tables are present and none of them incur a storage overhead.   You only need to subset the larger tables, usually the transaction and history tables that are involved in full table or slow scans.

For partitioned tables you can just drop the unwanted partitions.  For non-partitioned tables you can either delete rows, or more efficiently create a replacement table with the desired rows.

The only occasion when you need to factor in table relationships is if you are subsetting two tables where one has a dependency on the other.    That is easiest done using foreign keys, and if they do not already exist you can create them for the purposes of subsetting. Since this is a VDB, a personal full read write copy, you can manipulate it however you require.   You do not have to factor in any other relationships.

Secure Data Subsets

If you started your data subsetting process using a pre-masked VDB then your data subset is already secured.  If not then you can run the same Delphix Masking job against you data subset VDB.

To cap it all

Having streamlined your data subsetting process you can now create as many copies on demand as required using your data subset VDB as a parent VDB from which child VDBs can be created.

Once you have done all your tests against data subsets you can then run a full volume test against a copy of the original VDB.

Delphix for Performance Testing

performance

Image courtesy of  juliandontcheff.wordpress.com

There will be Change

Many customers ask whether it is possible to use Delphix for performance tuning and testing and the answer is a definitive yes.   It is a sensible question because database files presented via NFS over Ethernet from a compressed, de-duplicated shared copy is quite different to a dedicated physical copy presented directly from a SAN over a Fibre Channel network.

Inevitably the performance will be different.   The performance is almost always different between production and non-production environments even in pre-production environments which are often configured using the same deployment architecture and components as production.

Consistency is the key to Quality

For all performance testing you must first perform a baseline test on the test infrastructure and you then must ensure you can reproduce those results consistently when no changes are applied.  The infrastructure must have a guaranteed quality of service and the testing tool must be able to present the workload consistently.  Consistent workloads are usually achieved using sequential regression test suites, batch jobs or automated testing tools which can present an identical workload repeatedly.

Tuning requires many Iterations

So we have established that you cannot measure the effects of a change unless you have a consistent environment on which to test that change in isolation.  In addition performance tuning is a highly iterative process, you should only test one change at a time and each change needs to be validated not just for the intended improvement but also for any undesirable side effects.  That’s a lot of test runs generally requiring automation of the environment set up and automated restore of the data to a known point in time.

Delphix for performance Tuning

Many customers use Delphix for performance tuning but almost all are initially a little skeptical about the baseline performance, everyone assuming it will be worse than their current non-production physical storage.  This is not the case.

On numerous occasions we have proven that if you deploy Delphix with the recommended storage requirements (IOPS and latencies) and network requirements (throughput and latencies) the performance of Delphix virtual databases will be the same and often better than your physical storage.

So now the baseline performance of a virtual database will be similar to that of a physical database.  That’s a good start, let the iterations begin.

Oracle customers generally use AWR comparison reports to analyse the results of any tuning changes comparing the baseline AWR with the tuned run AWR.  Once satisfied they then run a performance regression test to ensure the candidate change does not regress performance for other code paths.

With Delphix, restoring the data to a known state, bookmarking the data changes to pinpoint schema evolution is trivial and near instantaneous.  Automating the entire environment and data provisioning for testing also enables higher frequency testing performed earlier in the release cycle and higher test server utilisation.

To cap it all

Finally once your tuning changes appear good to go you likely want to run a final live-like test on your pre-production environment.   Delphix can provision a physical copy of a database directly from the candidate virtual copy to your physical storage.    This is particularly useful if you have made substantial changes to the test database, such as data masking or a SAP refresh and do not want to run that entire process again.

Delphix and Machine Intelligence

Image courtesy of anticorruptiondigest.com

Image courtesy of anticorruptiondigest.com

There are several examples of semi-automated workflows which lend themselves to using machine intelligence techniques to improve the precision of the automated components and reduce the final stage manual effort common in these workflows.

For example the process of detecting Money Laundering requires anomaly detection, looking for some peculiar or unusual activity which goes against the patterns of normal behaviour.  Anti-Money Laundering software scans recent transactions to look for candidate anomalies based on rules that are built and maintained by specialists.  Any candidate anomalies identified then have to be investigated in person and each of these investigations incurs a cost.

The goal of the workflow is to identify all suspect anomalies whilst minimising false positives which incur unnecessary costs.   The good news is these systems have been in play for some time and have built a valuable library of data, including the original set of transactions for analysis, the subset of transactions previous identified as candidate anomalies and the results of the manual investigation to identify actual money laundering activity.

This library of data is ideal for use by machine learning algorithms which can be trained based on input and output data.   In this case we have two input and output stages.

Automatically provisioning each set of data from the library is a non trivial task, in particular for the original set of transactions which may number several million.   Delphix is ideally suited for this task as Delphix can be used to automatically collect, locate and provision sets of data on demand in minutes.   For machine intelligence this means that any newly refined algorithm can be validated against the full history of past analysis to ensure that the candidate algorithm performs as well or better on all known sets of data.