The Cost of (not) using Snowflake

Posted on July 17, 2019 by ruthvenblog

Introduction

Understandably customers want to know how much Snowflake costs and like many computing related questions the initial answer is “it depends”. There are no fixed costs with Snowflake as we use a consumption-based model but the Snowflake architecture has economy built-in as part of our design philosophy. Our goal is to make sure you only pay for what you use and do not waste money on idle or unnecessary resources.

Storage

For storage the cost per terabyte is near enough the same as that for the underlying cloud platform native storage, however since we compress all data ingested into Snowflake the actual cost is reduced by the average compression ratio. As this inevitably varies depending on the nature of the source data the best guideline we can use is somewhere between 3-5 times. As an example a customer quoted “we tripled the amount of data we’re storing, for about half the cost” when comparing with a cloud vendor’s database. Snowflake does not charge storage costs for results caches which retain all query results for up to 24 hours nor do we charge for the Virtual Warehouse storage caches.

Another method that Snowflake uses to reduce storage cost is our zero copy cloning. With Snowflake you can clone databases, schemas or tables with almost no storage overhead. These clones are created within seconds and mean you can have as many development or test environments with minimal storage costs. Snowflake cloning is also be used for point in time reporting environments and backups.

Snowflake data sharing allows you to share any portion of data with another Snowflake user or third party. Again there is no copying or extracts of this data to generate or maintain or consume storage, Snowflake data sharing shares the data blocks

Storage costs are particularly important because they are a constant cost. You pay for storage 24×7 with the only option to trim the costs being to compress data, delete data or find a better rate.

Compute

Next is compute costs and this is where with Snowflake you have more control over costs. Snowflake compute costs are paid for using credits. We also have per second billing so every saved second counts. Our smallest Virtual Data Warehouse (VDW) consumes a single credit per hour, each next sized VDW from XS to 4XL consumes credits at double the rate of the smaller size. Since we tend to get linear scalability the cost for running a workload on the next sized VDW is the same as that for the previous sized smaller VDW as although the consumption rate is doubled the run time is halved. Snowflake VDWs can be quiesced or suspended at will either explicitly or via policy. When a VDW is not running it consumes no credits. A suspended VDW will resume on query within seconds, again policy permitting (99% of VDWs have this enabled). The ability to start or resume or resize a VDW within seconds gives customers confidence to suspend VDWs and essentially trim their compute costs by trimming to the minimum run time.

The ability to trim back compute whether scaling down a VDW or ‘scaling-in’ nodes in a VDW cluster is all aimed at reducing VDW run time to reduce or terminate compute consumption.

Administration

Finally Snowflake has near zero administration. With Snowflake there is no installation, no upgrades, no patches, no indexes, no partitions, no table statistics, no need to rebalance, re-stripe or replicate data and almost no tuning. This level of automation substantially reduces administration costs.

In summary

Storage costs lower than native cloud storage due to compression
No storage costs for on disk caches
Near zero storage overhead for data cloning or data sharing
Compute costs minimised by scaling down, scaling in and suspend operations on VDWs
Compute costs minimised by efficient query execution plans, requiring zero tuning, and results caches
Near zero administration

At the end of the day it comes down to product philosophy. Snowflake wants to help organisations become data driven by providing a cloud native database which can support all data warehouse and data lake workloads. We want you to spend your time analysing data not cloud vendor invoices.

Disclaimer: The opinions expressed in my articles are my own, and will not necessarily reflect those of my employer (past or present).

Data Driven Product Management using Snowflake Warehouse

Posted on June 26, 2019 by ruthvenblog

Data Driven Product Management

Product Management is a crucial capability for all organisations and those that are able to collect data from their deployed applications or appliances are poised to get real world insight to influence product strategy, marketing, sales and support.

Increasingly we are seeing customers and prospects capturing usage and diagnostic instrumentation and this is typically published as semi-structured data, often in JSON format. The instrumentation changes over time as the product evolves and so the JSON is an ideal data format for encapsulation.

One of our customers sells their solutions as a software appliance which makes them easy to deploy on premise and cloud and also makes them inherently robust as no 3^rdparty software, patches or configuration changes are necessary. The appliances are self-contained and are famous for their detailed level of instrumentation. These appliances capture information about usage, performance, capacity and this information can be viewed via GUI or API or can also be published to enterprise monitoring tools such as Splunk.

In addition to the customer visible data the appliances also collect detailed diagnostics as an aid to offline analysis for maintenance and debugging support cases.

Data Science for Product Management

Our customer’s Product Management started an initiative over a year ago to improve insight into customer’s success and adoption levels with their solutions. Their customers have an option to opt-in to a feature whereby their appliances will periodically upload data captured on usage, capacity and performance rather than diagnostics. This telemetry data has been captured from the outset with a custom application used to help extract data from the telemetry JSON bundles to allow internal users to query data via a portal using a mixture of forms based and SQL like queries. The portal automatically aggregates data as you query and lets you drill down to the details contained in a single telemetry bundle.

The new initiative was to broaden the scope of data for analysis and allow Product Management, Engineering , Sales Operations and Finance to glean insight by correlating data across several data sources. In addition to the telemetry data the team are keen to capture and correlate Segment (click analysis) data, data from Salesforce, Jira and Zendesk.

Something for Everyone

Whilst each data source can provide valuable insight independently combining the data sources allows for much wider analysis.

The Snowflake Warehouse allows the product management team to validate assertions and reduce risk. For example, by simply being able to quantify usage the product management team can easily validate which features are being used the most or not at all, which are growing or tailing off in their usage and which features and versions could be candidates for prioritising enhancements or even deprecating. Feature usage analysis can validate how many customers are actually using the company’s unique differentiators. It can also help predict how many customers might churn because they are using the solution in a very basic way and only gaining a fraction if its potential value.

Insight afforded by the Snowflake Warehouse can provide ‘best next action’ at a product and customer level.

Correlating renewals information with usage and adoption level data can identify key patterns that indicate if some action needs to be taken proactively and this can form the basis of Predictive Analytics.

The product management team get inundated with questions from the field looking for provenance around which other customers are using their solution for certain use cases, deployment patterns or with given third party applications or specific versions. The questions are almost endless.

Another potential use case is to identify which patches are relevant to which customers. Any software issue that leads to a patch is only required by those customers using the feature which utilises that code path. This can enable forensic patching and can also help prioritise fixes, as the true scope of impact can be more accurately assessed from the empirical evidence of feature usage patterns.

For Marketing, collating all this data together allows for 20,000 foot aggregations which can provide effective Marketing insight. For example, headline numbers such largest deployments, fastest adoption levels, fastest ROI, greatest savings, pretty much all the superlative metrics. The growth in adoption levels is a strong indication of customers leveraging the power of the solution to materially improve their business. Trends in cloud versus on premise deployments can help steer campaigns and identify customer references.

Primary Operational Data Sources

Telemetry Service

The telemetry bundles contain several Megabytes of JSON data each and contain details of product and feature usage and configuration information. All sensitive data is fully redacted before being uploaded.

User-click Analytics

Their customers can also opt to capture browser-based, user-click analytics data. If enabled, this service automatically sends a stream of anonymous, non-personal metadata describing user interaction with the product’s user interface. This data provides a better understanding of product usage, engagement, and user behaviour, and helps improve product and customer experience. Again, no sensitive data is collected by this service.

Appliance Telemetry

One of the first tasks with their Warehouse initiative was to get data pipelines in place to synchronise data with Snowflake. The JSON telemetry bundles are uploaded using Snowpipe via S3 storage and the Click Analytics data collated by Segment and is loaded using a Snowflake API.

Why Snowflake

A number of alternate solutions were considered including Domo and Redshift, however one of the key differentiators for Snowflake was its variant data type which allows the customer to load JSON data directly and query using standard SQL. The customer was keen to ingest all data intact with minimal preparation on the basis that you never know what data you might need and the quicker you ingest the data the faster you can analyse it and decide what is useful.

In addition, with separate lines of business wanting to analyse copies of the same the data, Snowflake’s separation of compute and storage capacity allows each LoB to have their own Virtual Data Warehouse which ensures quality of service through isolation of independent workloads.

Disclaimer: The opinions expressed in my articles are my own, and will not necessarily reflect those of my employer (past or present).

Ultra Lightweight, Ultra Fast Database Clones

Posted on April 9, 2019 by ruthvenblog

Introduction

A common use case for read only database clones is to offload production using read farms for reporting, analytics or ETL. Very often the transactional (OLTP) database schema is not necessarily organised in an optimal fashion for these types of queries. Sometimes a read replica is created with a different schema and maintained using logical replication. With Delphix Virtual Databases (clones) and database Materialized Views you can achieve fabulous performance with very little maintenance effort.

Materialized Views

From Wikipedia: “In computing, a materialized view is a database object that contains the results of a query. For example, it may be a local copy of data located remotely or may be a subset of the rows and/or columns of a table or join result or may be a summary using an aggregate function.”. In essence a materialized view is a set of pre-computed results which can be referenced immediately rather than running an entire query or slow portion of a query.

Materialized Views are one of the best ways to increase performance for long running, resource hungry queries, typically queries scanning large volumes of data, performing aggregations and sorting. These are all resource hungry workloads for any database system, for relational databases we could also add large volume joins.

Often Materialized Views are not used in production OLTP environments for a variety of reasons.

Materialized Views are difficult to define
Materialized Views take up more storage space
Refreshing Materialized Views is time consuming
Using Fast Refresh Materialized Views has an overhead on DML to maintain the Materialized View Logs

All of these are correct, although Oracle 12c now has real time Materialized Views. For read only clones built on Delphix Virtual Databases, the maintenance of Materialized Views and the storage overhead issues are mitigated.

Testing the Theory

I used the excellent Swingbench, Oracle Database benchmarking tool, with the built in Sales History benchmark. I admit this is not an OLTP workload but it illustrates the benefit of Materialized Views for query workloads. The Sales History benchmark includes 14 queries illustrated below. I used the default load ratios.

Queries and Materialized View Mapping

Table and Materialized View Sizing

Screenshot 2019-04-09 at 12.28.24

Screenshot 2019-04-09 at 12.28.37

Vanilla Run

Running the Sales History benchmark against a vanilla installation of the Sales History schema, with no materialized views, gave the following results.

Screenshot 2019-03-31 at 17.26.19

The results are a bit difficult to read so refer to the summary comparison table coming up below.

Note that I/O wait time is shown in the CPU statistics section, however CPUs do not wait on I/O. This really represents a process wait time on I/O, i.e. the process cannot proceed until an I/O operation completes.

Materialized View Run

Adding Materialized Views for 9 of the 14 queries as illustrated in the Query and MV table mapping above, yielded the following results with all other factors being the same.

Screenshot 2019-03-31 at 18.01.12

Comparing the results from the two scenarios we get

Screenshot 2019-04-09 at 12.25.17

And we get all this for the one-time effort of defining and creating the materialized views and only 14K database blocks, representing a 3% overhead.

Even Better on Cloud

One of the benefits of Cloud computing in particular is that tuning activity is instantly rewarded. Tuning on premise is usually only performed reactively to meet query SLAs, since the only other benefit is that your servers are idle longer. However, on Cloud the nickel meter will run slower if you consume less resource. Coupled with the scale up ability of Cloud resources, Cloud deployments are the perfect hosts for pop-up read only materialized view enabled clones.

Summary

With Delphix your DBAs have more time to concentrate on higher values activities than cloning and refreshing non-production database copies. Tuning databases is a high value activity which can improve SLAs and data accessibility. On Cloud there is additional benefit where the costs savings for reduced CPU and I/O consumption are instantly realized.

Instant Data Masking in RDBMS

Posted on August 24, 2018 by ruthvenblog

Introduction

Data Masking involves replacing real world sensitive data with synthetic replacements. If your sensitive data is scattered across several tables consisting of thousands, millions or even billions of rows that requires a lot of changes to replace all those values.

Customers often ask if it is possible to reduce the time and resources to mask data. You can reduce the time running masking jobs in parallel but this expands overall resource consumption. The only way to reduce time and resource is to do less work.

One of the benefits of a relational database (RDBMS) is that when data is normalised (de-duplicated) updates to data are very efficient because you only have to update a given value once. In this blog article I suggest that all tables containing sensitive data should be normalised by extracting sensitive data into lookup tables and replacing the original data with references. There are several benefits to this model:

Provisioning masked clones is faster and more resource efficient
The Production data is easier to secure
The Production database and its clones become more performant

A Simple Example

Let’s take a simple example of a CUSTOMER table which has data of the form:

Screen Shot 2018-08-24 at 17.06.28

Let’s query the first few rows of this table

CUST_ID CUST_FIRSTNAME CUST_LASTNAME  CUST_SSN
------- -------------- -------------- ---------
      1 Mary           Smith          PL898474B
      2 Patricia       Johnson        RD993299B
      3 Linda          Williams       LU785983C
      4 Barbara        Jones          XK295759D
      5 Elizabeth      Brown          QA006325A
      6 Jennifer       Davis          KQ616191C
      7 Maria          Miller         EG308056D
      8 Susan          Wilson         XP917622A
      9 Margaret       Moore          ZH212274D
     10 Dorothy        Taylor         TX804130B

To make this table secure you should build a secure version of the table which replaces all embeded sensitive literal values with references, which are resolved by lookups on tables which contain the actual values.

Our new secure version of the table now looks like

Screen Shot 2018-08-24 at 17.08.21

CUST_ID CUST_FIRSTNAME_ID CUST_LASTNAME_ID CUST_SSN_ID
------- ----------------- ---------------- -----------
      1             1070             2083        3186
      2             1078             2044        3210
      3             1060             2095        3144
      4             1010             2045        3279
      5             1030             2010        3192
      6             1043             2021        3135
      7             1066             2054        3057
      8             1093             2096        3282
      9             1065             2056        3294
     10             1029             2085        3240

That looks a lot more secure. The lookup or dimension tables look as follows:

Screen Shot 2018-08-24 at 17.09.50

Sample from FIRSTNAME_DIM table

FIRSTNAME_ID FIRSTNAME
------------ ---------
        1007 Anne
        1013 Bonnie
        1036 Helen
        1045 Joan
        1056 Kelly
        1073 Mildred
        1078 Patricia

Screen Shot 2018-08-24 at 17.12.25

Sample from LASTNAME_DIM table

LASTNAME_ID LASTNAME
----------- ---------
       2023 Edwards
       2025 Flores
       2055 Mitchell
       2063 Perez
       2068 Price
       2077 Ross
       2088 Torres
       2089 Turner

Screen Shot 2018-08-24 at 17.13.34

Sample from SSN_DIM table

    SSN_ID SSN
---------- ---------
      3012 AW876543D
      3033 CC929313D
      3057 EG308056D
      3135 KQ616191C
      3147 MK525013B
      3192 QA006325A
      3207 QO371839A

We have now built a cluster of tables which looks like:

Table_Cluster

To expand our secure table to display the data as stored in the original we need a database view which pulls the data together.

create or replace view CUSTOMER_VIEW
as
select csd.CUST_ID, fnd.FIRSTNAME CUST_FIRSTNAME, lnd.LASTNAME CUST_LASTNAME, snd.SSN CUST_SSN
from CUSTOMER_SEC  csd
join FIRSTNAME_DIM fnd on csd.CUST_FIRSTNAME_ID = fnd.FIRSTNAME_ID
join LASTNAME_DIM  lnd on csd.CUST_LASTNAME_ID = lnd.LASTNAME_ID
join SSN_DIM       snd on csd.CUST_SSN_ID = snd.SSN_ID;

When we query from this view we get the fully expanded version.

SQL> select * from CUSTOMER_VIEW sample(10);

CUST_ID CUST_FIRSTNAME CUST_LASTNAME  CUST_SSN
------- -------------- -------------- ---------
     32 Amy            Lopez          LH062496C
     18 Carol          Garcia         NS733964A
     39 Debra          Nelson         TL389654A
     10 Dorothy        Taylor         TX804130B
     57 Evelyn         Morgan         NQ870196D
     86 Jacqueline     Long           GT430822C
     58 Jean           Bell           HG462950A
      6 Jennifer       Davis          KQ616191C
     36 Kathleen       Adams          MP278347B
     98 Lillian        Griffin        OT351086A
      3 Linda          Williams       LU785983C
     37 Pamela         Baker          GF089913C

Steps to create a secure CUSTOMER table cluster

Rename table CUSTOMER to CUSTOMER_ORIG
For each column containing secure data create dimension (lookup) tables with a unique list of values for the column and a lookup id. You should create a primary key on the ID and a single unique index on the value and ID columns.
Build a secure version of table CUSTOMER, called CUSTOMER_SEC replacing literal secure values with IDs from their corresponding dimension tables. You can add foreign keys between CUSTOMER_SEC and the lookup tables.
Create a database view which presents the data from CUSTOMER_SEC as it would have looked when querying the original CUSTOMER table.
Drop table CUSTOMER_ORIG
Create a synonym called CUSTOMER on view CUSTOMER_VIEW

That takes care of querying. For DML you will need to create triggers that will translate INSERT, DELETE and UPDATE requests into the corresponding DML for the cluster of tables.

Increase Production Database Security

At this stage it looks like we have just made it more complex to access the data that was originally held in the CUSTOMER table. Well we have but that is a good thing. We can now more easily improve production database security by doing the following:

Encrypt the dimension/lookup tables

There is no need to encrypt any other data in the database since we have isolated the secure data. This avoids encrypting the entire database simply because we do not know where the secure data resides. This also has some performance benefits as the majority of data accesses will not require decryption.

Enable auditing on access to those tables

Auditing on tables is an overhead but by isolating the secure data we have minimised the audit activity.

Ensure Redaction policies on all value data in the dimension/lookup tables

Again by isolating the secure data we now know exactly which data needs to have Redaction policies applied. Redaction hides sensitive information from production database users based on role based access.

Near Instant data masking

To produce a masked version of a database with the above changes we only need to provide masked versions of the domain/lookup tables and drop the original tables. We could build these replacement tables from scratch, this process will still be faster than masking the original format as there is less data to scan (shorter rows) and fewer rows to mask due to de-duplication.

Incremental masking would also be easier if used we versioning of masked tables and mask only the necessary deltas.

Objection … overruled?

The obvious concern is that this approach requires schema changes and potentially application changes. There is no escaping that and it means that this approach to security is likely only applicable to custom applications where the schema is under your control.

The good news though is that usually only a small fraction of tables in a schema contain sensitive data and so the amount of effort may be far less than anticipated.

Summary

The benefits are immense, more secure Production Databases and faster, potentially near instant, provisioning of secure masked databases. Both production and masked databases will also likely perform faster for any scan based operations due to the implicit compression afforded by replacing literal values with numeric identifiers.

This approach improves data security, performance and uses less resources.

Delphix Data Transport Shuttle

Posted on February 16, 2018 by ruthvenblog

Move, Copy or Replicate Data – sounds easy?

In IT, it often transpires that tasks that appear difficult turn out to be quite easy, clearly though not often enough. Conversely tasks that appear to be easy, all too often, turn out to be rather difficult. Moving, copying or replicating data between data centres, including Public Clouds turns out to be more challenging than it first appears.

Of course, there are many point solutions. It is easy to transfer or copy files from server to server, databases have physical and logical replication, you can use ETL to logically synchronise data between two data repositories and storage replication is commonly used between data centres which have the same storage and High Availability primary use case. You can also physically transport a backup copy of your data and restore it at the target end.

The problem is that few of these solutions are fully automated, they require custom configuration for each source, for storage replication you need to have the same storage at each end and physical transportation is a manual process which is only suitable for discrete copies and not for continuous replication or synchronisation.

Delphix’ best kept Secret: Delphix Replication

Delphix Replication allows you to easily transport and synchronise any data between two Delphix Engines regardless of where the Delphix Engines are deployed. This could be on-premise to on-premise, on-premise to cloud (Azure, AWS, IBM Cloud), cloud to cloud or cloud to on-premise.

Delphix Replication is also bi-directional.

Did you say any data?

Yes. Delphix supports a variety of Unix , Linux and Windows platforms. Another of our open secrets is you can push any data, documents, files, folder, backup images, database files, almost anything running on any of Delphix supported platforms directly into Delphix and that data can then leverage Delphix Data Plaform capabilities implicitly. The Delphix data copy (dSource) is compressed, de-duplicated and versioned regardless of how the data is ingested. Copies can then be provisioned from any point in time, included in a Data Pod with full self-service or the data can be replicated to another Delphix Engine.

Does it have to Virtualised?

No. Once a Delphix Data Copy (dSource) is created, any point in time copy can be created as either a virtual copy or a physical copy.

A single fully automated data transport solution

Delphix Replication can act as a data shuttle. Delphix can automatically ingest the data or you can push data into Delphix on your own schedule. Once Delphix has a copy it can automatically transport any selected data between data centres or clouds and enable self-service provisioning of physical or virtual copies locally or remotely.

You can put all the data you need to transport between data centres or clouds into Delphix.

Does it require high bandwidth?

No, you can use what our Support team refer to as a “wet piece of string”. For example we have a customer using Delphix Replication between Hong Kong and China over a 50 Mbps link and another replicating between Argentina and the over a 30 Mbps link. It’s a matter of physics the higher the bandwidth and lower the latency the better the throughput.

The data is already compressed and is encrypted when transported using the Delphix Session Protocol (DSP).

It’s not an option

What I mean is, it is not an extra cost option, Delphix Replication is free with any two or more Delphix Engines. It is available with a single Delphix Engine and does work but is not nearly as useful in such a deployment.

To Summarize

Replicating data between data centres and clouds is currently plagued by a variety of point solutions. It is a non-trivial task and is increasingly required as customers start spreading workload between data centres and clouds, not just for high availability but for scaling workload, enriching data by tapping into remote services and improved vendor support.

Delphix simplifies data transportation and as a Data Platform makes the entire process consistent, simple, safe and easy to manage with out of the box services including compression, de-duplication, replication, encryption and virtualisation.

If you have data on any of Delphix’ supported Operating Systems, Linux (RHEL, CentOS, OEL, SUSE), Solaris, HP-UX, IBM AIX or Windows, that you want to easily send or synchronise to another data centre or cloud, try Delphix Replication.

Delphix Data Virtualisation enabled Data Masking

Posted on January 15, 2018 by ruthvenblog

GDPR

How Do You Solve A Problem Like… Masking?

Well, as in the film, for you youngsters “The Sound of Music”, we could burst into song but most customers are more inclined to burst into tears.

Providing secure data for non-production use is an expensive exercise whether you synthesize or mask (obfuscate) data. Encryption is much easier but not really an option for non-production use since it is reversible.

In addition, GDPR has raised the priority of securing non-production data with more binding legal requirements and stiffer penalties for non-compliance.

Delphix can help with self-funded, simplified masking and the ability to easily leverage lower cost cloud resources using masked data. In this article, I outline how Delphix Data Virtualisation not only streamlines data masking but can also help fund it as well.

Challenge 1: I need a database copy to mask

The first challenge for a data masking exercise is you need a copy of the data to mask. If your database is multi-terabyte where are you going to place that copy? It has to be in the production zone because until it is masked it contains sensitive data. If you have space for a full copy but you need to refresh that copy you will likely need to overwrite the previous copy and deny access to existing users for the duration of the copy and mask process.

Solution 1: Virtual Data Provisioning and Versioning

Delphix Data Virtualisation allows you to create a full read write copy of an entire database with minimal storage, almost zero overhead for the initial copy. Better still Delphix Data Virtualisation allows you to create a new version without impacting previous versions allowing continuous production of new masked versions and uninterrupted access to previous versions.

Challenge 2: I need to refresh several environments with the latest copy

With physical copies you need to copy the entire newly masked database to each environment and as with the original provisioning you will likely need to overwrite the previous copy and deny access to existing uses for the duration of the copy process.

Secure Data Generation Pipeline and Consumption

Solution 2: Virtual Data Child Copies, Refresh, Branching and Versioning

Once you have masked your virtual data copy you can then create as many virtual child copies as required, all now sharing the set of blocks that were modified as a result of the data masking process, typically between 10-15% of data blocks.

Delphix Data Virtualisation allows you to refresh an environment on demand. Even after refreshing your database with the latest masked copy, the previous versions are still accessible and you can switch between versions on demand.

Challenge 3: I need to provision several masked databases at a consistent point in time

The first challenge is you need a masking solution that masks consistently across databases even from different vendors. You also require to be able to provision the initial set of databases for masking to a consistent point in time.

Solution 3: Virtual Data Point in Time Provisioning and Data Pods

Delphix Data Virtualisation allows you to provision multiple databases from the same or different vendors to the same point in time. These virtual copies can be grouped into Data Pods which are assigned to projects or individuals. All self-service operations on the Data Pods including, start, stop, rewind, refresh, bookmark, share and switch version automatically apply to all the components in the Data Pod.

asset_73x

Challenge 4: I need to get my masked data onto a Cloud

One of the commercial benefits of data masking is that it enables customers to leverage lower cost cloud resources using secured data. Unfortunately, it is not always that easy to upload data to the cloud and keep the cloud copy updated with refreshed versions. Often this requires setting up some form of replication or backup and restore process for each database.

Solution 4: Delphix Replication

Delphix Replication allows you to replicate only masked database copies between Delphix Engines, which can either be on premise or on cloud. The replication process is completely automated and new database versions only require compressed blocks to be sent to keep the target engine in sync.

Challenge 5: Cost of non-production copies and data masking

Data Masking adds additional license, processing and operating cost.

Solution 5: Delphix Data Virtualisation

Delphix Data Virtualisation will typically reduce your non-production storage footprint by 70-90%. In addition, Delphix turnkey, end to end automation reduces operational costs and accelerates development and test cycles. This combination of savings will typically fund your Delphix Virtualisation and Data Masking investments in 6-9 months.

To Summarize

Data Masking is a great example of how Delphix Data Virtualisation can streamline and accelerate the flow of data for use in non-production whether on premise or on cloud. In this article, we have not even looked at the benefits of Delphix Data Masking

Any customer performing Data Masking using any Data Masking solution will benefit from Delphix Data Virtualisation.

Delphix enabled Logical Data Warehouse

Posted on August 21, 2017 by ruthvenblog

Image courtesy of Business Cloud News

What is a Logical Data Warehouse?

A Logical Data Warehouse uses federated queries to query source databases, and other data sources, directly, avoiding the need to copy data into a dedicated data warehouse. The key benefits to this approach over traditional data warehouse solutions are:

Access the latest version of the data, not a copy from the night before or last weekend
Easily add new Data Sources and adjust data models on existing data sources much more quickly
Reduced Infrastructure, decreases the amount of server capacity and storage required because you leverage the source environments.

A Logical Data Warehouse is often called a Virtual Data Warehouse or a Federated Data Warehouse.

To Copy or not to Copy, that is the question

As mentioned there are two main approaches to provide access to data for analytics purposes. Either you copy the data to an analytics or data warehouse platform or you keep the data in place and implement a unified data model and use a set of connectors to query the data directly from their sources.

There is a variation of the latter, mainly OLAP solutions, where pre-calculated query results are materialized into views or multi-dimensional cubes for improving response times for predicted queries.

Solutions that require their own copy of the data, include Exadata, Teradata, Netezza, Hadoop, Snowflake, Amazon Redshift, Azure SQL Data Warehouse and Google BigQuery.

Logical (or Virtual) Data Warehouse vendors include Denodo, Cisco / Composite, Red Hat JBoss Data Virtualisation, DataVirtuality, and SAP Hana.

Similar to Delphix which avoids all the costs and delays in provisioning copies of data, a Logical Data Warehouse (LDW) avoids all the costs and delays in copying data but still needs continuous access to the source data. With Delphix the opportunity exists to build a Logical Data Warehouse on top of Virtual Databases. Why would you want to do that?

Virtual Data Sources for Logical Data Warehouses

Using a logical data warehouse customers may be concerned about performance hits on mission-critical source systems being exposed directly to logical data warehouse queries.

Connecting your logical data warehouse to Delphix Virtual Databases has the following advantages:

Protect Production, Delphix offloads production by provisioning virtual data copies on separate database servers using shared storage managed by Delphix.
Enhance Performance, run as many virtual databases as required to spread the load with no storage overhead.
Enhance Security, Delphix can provision original or secured (consistently masked) virtual database copies avoiding the need and cost of masking an entire warehouse data copy.
Time Travel, Delphix can provision consistent database sets from any point in time within the configurable retention period, enabling much simpler period to period based analytics over much smaller data sets.
Reduced Infrastructure, leverage under-utilized but licensed, non-production servers.

You probably already have sufficient licensed database servers to accommodate a data warehouse solution without building or renting additional infrastructure or purchasing additional licenses.

Ground or Cloud, Delphix can run in your data center or on AWS or Azure. Delphix replication can allow you to synchronize your data in the cloud allowing you to run your Logical Data Warehouse against co-located database copies.

To Summarize

Delphix is an outstanding complement to Logical Data Warehouse solutions, in particular for provisioning high performance, secure, virtual copies of your most important data.

Data Masking: Locating Sensitive Data

Posted on August 10, 2017 by ruthvenblog

Data Masking is a Sensitive Subject

One of the primary concerns with Data Masking is identifying the locations of sensitive data within a database. For custom applications, the development team will likely know where all sensitive data is stored, however with third party applications this is a bigger challenge, especially considering some database schemas can have 10s of thousands of tables.

Delphix Masking has a Profiling tool which uses regular expressions to locate potentially sensitive data based on the column name or a sample of the data itself. This is easily extended to add patterns for any custom formatted sensitive data and the profiling tool can then be used to find any additional occurrences of that data in any supported database or file type.

The profiling tool cannot always locate all sensitive data so an additional technique may help.

Customers are always right and sometimes Innovative

Discussing this issue recently with David Battiston from Worldpay, I suggested we could create a Delphix VDB database copy, and run functional tests covering sensitive transactions such as customer account creation, modification, ordering, or for a web based application any screen using https. We would use sentinel data in the test so we could easily spot it in the database. I proposed using a text index on the database but David suggested using database auditing.

What a great idea, I decided to give it a try.

Test Process

I downloaded the Community Edition of an eCommerce application called Konakart.
I created an Oracle Virtual Database as the database to underpin Konakart and created database user KKART to own the Konakart schema.

-- Create database user KKART to own Konakart schema

CREATE USER kkart IDENTIFIED BY kkart
    DEFAULT TABLESPACE users
  TEMPORARY TABLESPACE temp
    QUOTA UNLIMITED ON users;

GRANT connect,resource TO kkart;

I then installed the Konakart application which populated the VDB with the Konakart schema and some demo data. The Konakart schema has around 128 tables.

I then enabled database auditing in the Oracle VDB.

-- Enable extended database auditing

ALTER SYSTEM SET audit_trail=db_extended SCOPE=SPFILE;
shutdown
startup

-- Enable audit for database schema owner KKART
AUDIT ALL BY kkart BY ACCESS;
AUDIT SELECT TABLE, UPDATE TABLE, INSERT TABLE,
      DELETE TABLE BY kkart BY ACCESS;
AUDIT EXECUTE PROCEDURE BY kkart BY ACCESS;

Insert Sentinel Data

Using the Konakart application I added new customer details and placed an order.

Below is a summary of the details.

Screen Shot 2017-08-09 at 15.26.54

Looks like Donald has just purchased a printer!

Locate Sentinel Data

Now we just have to query the audit tables to locate where these details have been recorded.

select SQL_TEXT, SQL_BIND
  from dba_common_audit_trail
 where OBJECT_SCHEMA = 'KKART'
   and TRANSACTIONID is not null /* failed */
   and SQL_BIND like '%123 Washington Road%'; /* sentinel data */

And the result was two tables, ADDRESS_BOOK and ORDERS and quite a few columns.

INSERT INTO address_book
(address_book_id, entry_email_address, entry_company, entry_city,  entry_lastname, entry_firstname, entry_gender, entry_telephone,
 entry_telephone_1, entry_suburb, entry_zone_id, entry_country_id, entry_street_address_1, entry_postcode, entry_street_address,customers_id)
 VALUES (:1 ,:2 ,:3 ,:4 ,:5 ,:6 ,:7 ,:8 ,:9 ,:10 ,:11 ,:12 ,:13 ,:14 ,:15 ,:16 )

#4 entry_city = Washington
#5 entry_lastname = Trump
#6 entry_firstname = Donald
#8 entry_telephone = 555-666-1234
#14 entry_postcode = 666
#15 entry_street_address = 123 Washington Road

INSERT INTO orders
(billing_telephone, customers_telephone,customers_locale,
 orders_status, lifecycle_id, delivery_country,
 filterdecimal1, points_reservation_id, orders_id,
 billing_name, customers_name, billing_street_address,
 billing_country, delivery_name, billing_city, date_purchased,
 billing_state, customers_city, points_used, currency,
 parent_id, delivery_addr_id, billing_postcode,
 delivery_postcode, payment_module_code,
 billing_address_format_id, payment_method, delivery_city,
 customers_country, customers_state, customers_id,
 customer_addr_id,points_awarded, customers_postcode,
 last_modified, customers_email_address, currency_value,
 delivery_telephone, customers_street_address, delivery_state,
 delivery_address_format_id, delivery_street_address,
 shipping_module_code, billing_addr_id, orders_number,
 customers_address_format_id) VALUES (:1 ,:2 ,:3 ,:4 ,:5 ,:6 ,
  :7 ,:8 ,:9 ,:10 ,:11 ,:12 ,:13 ,:14 ,:15 ,:16 ,:17 ,:18 ,:19 ,
  :20 ,:21 ,:22 ,:23 ,:24 ,:25 ,:26 ,:27 ,:28 ,:29 ,:30 ,:31 ,
  :32 ,:33 ,:34 ,:35 ,:36 ,:37 ,:38 ,:39 ,:40 ,:41 ,:42 ,:43 ,
  :44 ,:45 ,:46 )


#1 billing_telephone = 555-666-1234
#2 customers_telephone = 555-666-1234
#6 delivery_country = United States
#10 billing_name = Donald Trump
#11 customers_name = Donald Trump
#12 billing_street_address = 123 Washington Road
#13 billing_country = United States
#14 delivery_name = Donald Trump
#15 billing_city = Washington
#17 billing_state = Washington
#18 customers_city = Washington
#23 billing_postcode = 666
#24 delivery_postcode = 666
#28 delivery_city = Washington
#29 customers_country = United States
#30 customers_state = Washington
#34 customers_postcode = 666
#36 customers_email_address = djtrump@whitehouse.com
#38 delivery_telephone = 555-666-1234
#39 customers_street_address = 123 Washington Road
#40 delivery_state = Washington
#42 delivery_street_address = 123 Washington Road

So Konakart is unusual by using clear column and table names but that is not the case for many third-party applications. Using this technique, we can semi automate the detection of sensitive data regardless of the database schema complexity or naming conventions.

To Summarize

The first challenge with data masking is getting a hold of a full copy of the database to be masked. A Delphix Virtual Database gives you a personal pop up full copy in minutes.

The second challenge with data masking is which columns to mask. If you do not already know you could use a technique as above using sentinel data and database auditing. Either way, once you have identified the location of any sensitive data you can use the Delphix Masking profiling tool to locate further occurrences in the same or different databases and files.

Delphix Cloud Economics for on Premise

Posted on May 22, 2017 by ruthvenblog

Image courtesy of vyopta.com

Cloud Economics

Many customers are examining how they can leverage public clouds for reduced costs and increased agility.

Public Clouds save costs through automation, higher server utilisation, lower storage costs and consumption based charging.

Public clouds are seen by many as a great way to secure expanded capacity to handle demand spikes for seasonal or overflow activities. Allowing you to keep your on-premise IT resources trimmed to handle only Business as Usual (BAU) activity.

What about running BAU on Cloud? What is the cost per database on Cloud versus on premise, factoring in database server, network, storage and license costs? Some customers have done this calculation and the answer may surprise you.

Delphix Economics

Let’s examine how Delphix enables the same core cost saving measures as Cloud solutions.

1. Automation

Delphix automates the entire end-to-end process of data synchronisation to provisioning. These, policy driven, automated workflows can also include Data Masking and Replication. End users can manage their own full copy data sets but are limited to pre-allocated server resources to avoid consumption creep.

2. Server Utilisation

The two primary methods to increase server utilisation are density (increased concurrent instances) and hoteling (drop in environments).

Delphix enables increased density by allowing you to run more database instances on the same database server. Customers have reported 30% increases in concurrent instances per server. Often a server is maxed out on memory but Delphix allows you to reduce the memory allocation per instance without compromising performance due to the caching effect of the Delphix shared cache.

Delphix enables hoteling by allowing an environment to be provisioned and de-provisioned quickly and automatically. This allows customers to pool existing siloed and implicitly underutilised database servers and share these servers between teams and across time zones. Test servers can be kept busier by supporting tests from different projects and sprints, simply by lifting the constraint on data provisioning.

Take a look at your current test server utilisation, imagine if you could increase that by 30% or better.

3. Storage

The best way to save storage costs is to avoid using storage altogether. Delphix reduces non-production storage consumption by at least 70%.

Here’s the math:

Source copy = 50%
Non-production copies = 4

Storage consumption for 4 copies = 50% / 400% = 1/8 = 12.5%
Storage saving = 100% - 12.5% = 87.5%.

As with most of these calculations it depends. In reality the savings are generally so substantial that you can use the highest performing storage available because you don’t have to buy much of it. You then get the additional benefit of increased test throughput allowing more tests for increased quality and reduced release cycles.

The Nirvana

The best solution for most customers will be an on premise private cloud for BAU activity linked to a public cloud, or two, for burst activity.

Delphix will make your private cloud as economical as a public cloud with the added advantage of supporting true agile development for database based applications, i.e. the really important ones.

For public cloud, Delphix supports AWS and Azure. Delphix will seamlessly replicate data between on-premise and cloud, cloud to cloud and cloud to on-premise. Allowing you to switch workloads between clouds without worrying about moving data.

Delphix Secure Data Subset Replication

Posted on May 3, 2017 by ruthvenblog

Screen Shot 2017-05-03 at 12.04.47

VLDB Data Subsets

Many Delphix customers are virtualising enormous, Very Large Databases (VLDBs) with some tables running into 10s of billions of rows. Occasionally these databases grow because there is no archiving and purging process in place, but increasingly the current active working set of data is growing due to increasing transaction levels and consolidation of databases of record. The transactional databases of today are now substantially larger than the data warehouses of yesterday.

In general with Delphix there is no requirement to provision data subsets since Delphix virtual databases can be provisioned in a few minutes regardless of size and with minimal storage overhead. However for some customers there is a requirement to provision only the latest few days, weeks or months of data for faster functional testing.

Delphix Selective Data Replication

Delphix Replication is used to provide resilience, scalability, data migration and data segregation. With Delphix Replication you can choose to replicate any set of virtual databases and this is used to ensure only secure, masked, data is replicated to a second ‘replica’ Delphix Engine.

This same capability allows us to replicate a virtual database which has been trimmed to reduce it to a more manageable size, a subset if you will. For some databases like Oracle dropping tables and deleting data does not release space so you usually have to drop tablespaces and corresponding datafiles which is of course much easier if you have partitioned or transportable tablespaces.

What’s the Recipe?

The steps are relatively straightforward.

Create a VDB on the primary Delphix Engine using a configure clone hook to run a script to trim the VDB
Configure a masking job to execute after the trim process.
Bounce the VDB once provisioned
Create a Replication job with Selective Data Distribution enabled.
Choose the trimmed and masked VDB to be replicated.
Run the replication job.

Additional advantages of this approach are that you require even less storage assigned to the replication Delphix Engine and you only have to mask the trimmed database which will make the masking process execute proportionately faster to the reduction in size ratio.

To Summarize

If you have substantial databases you can create full virtual copies of these in production for production support, reporting and ETL. You can replicate these full virtual copies to a replica engine for full volume functionality and performance testing.

With Delphix Selective Data Distribution Replication you also have the option to replicate secure data subsets for faster functionality testing.

Since a Delphix Replication Engine can be located on-premise or on-cloud you have the choice of what to replicate, how much to replicate and where to replicate.

David Ruthven's Blog

Databases and IT

The Cost of (not) using Snowflake

Introduction

Storage

Compute

Administration

In summary

Data Driven Product Management using Snowflake Warehouse

Data Driven Product Management

Data Science for Product Management

Something for Everyone

Primary Operational Data Sources

Telemetry Service

User-click Analytics

Appliance Telemetry

Why Snowflake

Ultra Lightweight, Ultra Fast Database Clones

Introduction

Materialized Views

Testing the Theory

Queries and Materialized View Mapping

Table and Materialized View Sizing

Vanilla Run

Materialized View Run

Even Better on Cloud

Summary

Instant Data Masking in RDBMS

Introduction

A Simple Example

Increase Production Database Security

Near Instant data masking

Objection … overruled?

Summary

Data Masking: Locating Sensitive Data

Delphix Secure Data Subset Replication