0% found this document useful (0 votes)
183 views

A Modern Approach To Test Data Management

Uploaded by

Monica Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views

A Modern Approach To Test Data Management

Uploaded by

Monica Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

WHITEPAPER

A Modern Approach to
Test Data Management
Building a comprehensive solution to solve today’s
biggest application development challenges
Executive Summary
Speed is a critical business imperative for all organizations, The increasing pace of software development presents new
regardlessof industry. The pace at which enterprises can bring challenges. With the proliferation of DevOps, a heightened
new products and services to market determines their ability focus on automation, and requirements to secure data across
to differentiate fromcompetitors and retain market share. Now global teams of employees and contractors, IT organizations
more than ever, applications are at the center of this race. As must expand the charter of traditional TDM to meet the needs
enterprises look to deliver high-quality applications at the lowest of today’s development and testing teams. This white paper
possible cost, they need to build out a more agile application explores the top challenges that IT organizations face when
infrastructure—and that includes a robust and comprehensive managing test data, and highlights the top evaluative criteria to
test data management (TDM) strategy. Once viewed as a back- consider when implementing new technology solutions as part
office function, TDM is now a critical business enabler for enterprise of a TDM strategy.
agility, security, and cost efficiency.

Understanding Test Data Challenges


Historically, application teams manufactured data for development and testing in a siloed, unstructured fashion. As the volume of
application projects increased, many large IT organizations recognized the opportunity to gain economies of scale by consolidating
TDM functions into a single group or department—enabling them to take advantage of innovative tools to create test data. As increasing
centralization began to yield large efficiency gains, the scope of TDM was expanded to include the use of subsetting, and most
recently, the use of masking to manipulate production data. However, the rise of development methodologies demanding fast,
iterative release cycles has led to a new set of challenges.

Environment Provisioning Is A Slow, Software Development Teams Lack


Manual, And High-Touch Process High-Quality Data
Most IT organizations still use a request-fulfill model, in which Software development teams often lack access to the right test
developers and testers often find their requests queuing behind data. For example, depending on the release version being tested,
others. Because it takes significant time and effort to create developers require data sets as of a specific point in time. But
a copy of test data, environments often take days or weeks to all too often, they must work with stale copies of production
provision. Not only does this place a strain on operations teams, data due to the complexity of setting up a new test bed. This can
it also creates time sinks during test cycles, slowing the pace result in lost productivity due to time spent resolving data-related
of application delivery and undermining competitive advantage. issues. According to recent research, developers spend more
In the payments industry, for example, nimble technology than 20% of their time on test-data related activities.1
companies with optimized processes can release applications
weeks or months faster than traditional banks with slow IT
ticketing systems.
Data Masking Is Increasingly Important, Test Data Requirements And Storage
But Adds Friction To Release Cycles Costs Are Continually Rising
For many applications, such as those processing credit card IT organizations create multiple, redundant copies of test data,
numbers, patient records, or other sensitive information, data resulting in inefficient use of storage. To meet concurrent demands
masking is critical to ensuring regulatory compliance and protecting within the confines of storage, operations teams must coordinate
against data breach. According to the Ponemon Institute, the test data availability across multiple teams, applications, and
cost of a data breach—including the costs of remediation, custo- release versions. As a result, development teams often contend
mer churn, and other losses—averages $3.6 million.2 However, for limited, shared environments, resulting in the serialization of
masking sensitive data often adds operational overhead; an end- critical application projects.
to-end masking process may take an entire week, which can
prolong test cycles.

Top Considerations for a Test Data


Management Solution
To address these challenges, IT organizations need to adopt the • AUTOMATION: modern software toolsets already include
right tools and processes to efficiently make the right test data technologies to automate build processes, source code
available to project teams. A comprehensive approach should management, and regression testing. However, organizations
seek to improve TDM in each of the following areas: often lack equivalent tools for delivering copies of test data

• DATA DISTRIBUTION: reducing the time to operationalize test data with the same level of effortlessness. A streamlined TDM
approach must eliminate manual processes—for example,
• DATA QUALITY: fulfilling requirements for high-fidelity test data
target database initialization, configuration steps, and vali-
• DATA SECURITY: minimizing security risks without compro- dation checks—providing a low-touch approach to standing
mising agility up new data environments.
• INFRASTRUCTURE COSTS: lowering the costs of storing and • TOOLSET INTEGRATION: an efficient TDM approach unites
archiving test data the heterogeneous set of technologies that interact with
test datasets along the delivery pipeline, including masking,
The following sections highlight the top evaluative criteria in each
subsetting, and synthetic data creation. This requires both
of these four areas.
compatibility across tools and exposed APIs, or other clear
integration mechanisms. A factory-like approach to TDM that
Data Distribution
combines tools into a cohesive unit allows for greater levels of
Making a copy of production data available to a downstream automation and eliminates handoffs between different teams.
testing environment is often a time-consuming, labor-intensive • SELF SERVICE: by putting sufficient levels of automation and
process involving multiple handoffs between teams. The end- toolset integration in place, test data delivery can be executed
to-end process usually lags demand; at a typical IT organization, via self service, directly by end users. Instead of relying on IT
delivering a new copy of production data to a non-production ticketing systems, end users should take advantage of inter-
environment takes days, weeks, or months in some cases. faces purpose-built for their needs. Self-service capabilities
should extend not just to data delivery, but also to control
Organizations looking to improve TDM must build a solution that
over test data. For example, developers or testers should be
streamlines this process and creates a path towards fast, repeatable
able tobookmark and reset, archive, or share copies of test
data delivery. Specifically, test data managers should look for
data without involving operations teams.
solutions that feature:
A well-orchestrated approach to TDM has the potential to transform the overall application development process. Slashing the wait
time for data means testers can execute more test cases earlier in the software development lifecycle (SDLC), enabling them to
identify defects when they are easier and less expensive to fix.

Figure 1: Testing in a traditional scenario (A) vs. a scenario with an optimized TDM approach (B).

Data Quality
TDM teams go through great efforts to make the right types of A TDM approach should allow for multiple datasets to be
test data—such as masked production data or synthetic datasets— provisioned to the same point in time and simultaneously
available to software development teams. As TDM teams balance reset to quickly validate complicated end-to-end functional
requirements for different types of test data, they must also testing scenarios.
ensure data quality is preserved across three key dimensions: • DATA SIZE: due to storage constraints, developers must often
• DATA AGE: due to the time and effort required to prepare test work with subsets of data, which aren’t likely to satisfy all
data, operations teams are often unable to meet a number functional testing requirements. The use of subsets can result
of ticket requests. As a result, data often becomes stale in non- in missed test case outliers, which can paradoxically increase
production, which can impact the quality of testing and result rather than decrease project costs due to data-related errors.
in costly, late-stage errors. A TDM approach should aim to In an optimized strategy, full-size test data copies can be
reduce the time it takes to refresh from a gold copy, making provisioned in a fraction of the space of subsets by sharing
the latest test data more accessible. In addition, the latest common data blocks across copies. As a result, TDM teams can
production data should be readily available in minutes in the reduce the operational costs of subsetting—both in terms of
event that it is needed for triage. data preparation and error resolution—by reducing the need to

• DATA ACCURACY: a TDM process can become challenging subset data as frequently.

when multiple datasets are required as of a specific point-in-


The end-to-end process of manipulating and operationalizing test
time for systems integration testing. For instance, testing a
data introduces complexities that often undermine data quality.
procure-to-pay process might require that data is federated
As demonstrated in the following case study, implementing the
across CRM, inventory management, and financial applications.
proper TDM tools can optimize this process, resulting in massive
improvements in data quality across all three dimensions.
Case Study
Consider the following TDM process at a large health plan provider:

1. First, a developer requests a copy of masked production data.

2. Once approved, a DBA creates a backup of production and then transfers and imports the backup copy to a staging server.

3. A test data manager validates the rowcount, any new PHI fields, and data structures.

4. The test data manager updates the subsetting artifacts, creates a subset, and executes the masking process.

5. The test data manager validates the row count, sends out an email update, and performs unit testing.

6. The DBA exports the masked data, transfers it to non-production, and updates the gold copy.

7. The DBA performs a backup and restore operation into the target Dev environment.

8. A second backup and restore process is performed to load data into a QA environment.

Figure 2: Example of a test data management process at a large health plan provider.
End-to-end, this process takes seven days. In an optimized process leveraging an integrated test data management platform, masked
data can be prepared in less than two days:

1. A test data platform automatically and non-disruptively remains in sync with production, providing continuous access to the latest
data and eliminating the need to perform a backup.

2. An admin restores data to the clean room in minutes.

3. An admin profiles and automatically assigns repeatable masking algorithms. If required, an admin subsets the data beforehand.

4. After masking is complete, the admin tests changes and validates referential integrity with the ability to quickly rollback to the initial
unmasked state.

5. An admin efficiently and securely replicates masked data to non-production.

6. Instead of updating or replacing the existing gold copy, it remains as an archive in a centralized repository, where it is compressed
to a fraction of the size of production.

7. Developers access masked data via self service in minutes instead of performing a manual backup and restore process.

8. QA engineers branch their own copies of development and begin testing in minutes.

Figure 3: Example of an optimized test data management process.


Data Security
Masking tools have emerged as the de facto standard for pro-
Infrastructure Costs
tecting test data. By irreversibly replacing sensitive data with
With the rapid proliferation of test data, TDM teams must build a
fictitious yet realistic values, masking can ensure regulatory
toolset that maximizes the efficient use of infrastructure resources.
compliance and completely neutralize the risk of data breach in
Specifically, a TDM toolset should meet the following criteria:
test environments. But to make masking practical and effective,
organizations must consider the following requirements: • DATA CONSOLIDATION: it is not uncommon for organizations
to maintain non-production environments in which 90% of the
• END-TO-END REPEATABILITY: many organizations fail to
data is redundant. A TDM approach should aim to curb storage
adequately mask test data because added process overhead
costs by sharing common data across environments—including
deters them from applying masking everywhere they should.
those used not only for testing, but also development, reporting,
However, solutions with out-of-the-box capabilities to orch-
production support, and other use cases.
estrate a complete masking process—identifying sensitive
data, applying masking to that data, and then auditing the • DATA ARCHIVING: according to Bloor Research, as many as 6
resulting test dataset—can minimize coordination and con- versions of a gold copy should be archived for testing different
figuration efforts. versions of an application.3 A TDM approach should make it
feasible to maintain libraries of test data by optimizing storage
• NO NEED FOR DEVELOPMENT EXPERTISE: organizations
use and enabling fast retrieval. Data libraries should also be
should look for lightweight masking tools that can be set
automatically version-controlled in the same way that tools like
up without scripting or specialized development expertise.
Git exist for code versioning.
Tools with fast, predefined masking algorithms, for example, can
dramatically reduce the complexity and resource requirements • ENVIRONMENT UTILIZATION: at most IT organizations, projects
that stand in the way of consistently applying masking. are serialized due to contention for environments. Paradoxically,
at the same time, environments are often underutilized due to
• INTEGRATED MASKING AND DISTRIBUTION: masking pro-
the time to populate an environment with new test data. A TDM
cesses should be tightly coupled with a data delivery mechanism.
solution should decouple data from blocks of computing resources
Instead of relying on separate workflows for masked data
through intelligent use of “bookmarking.” Bookmarked datasets
and unmasked data, aan integrated, platform-based approach
can be loaded into environments on demand, making it easier for
lends itself to greater standardization of masking as a security
developers and testers to effectively timeshare environments.
precaution, and helps ensure that masked data can be delivered
As a result, an optimized TDM strategy can eliminate contention
wherever it’s needed. For example, many organizations will
while achieving up to 50% higher utilization of environments.
benefit from an approach that allows them to mask data in
a secure zone and then easily deliver that secure data to By optimizing storage and improving the elasticity of test data,
targets in non-production environments, including those in TDM teams can move to a model where infrastructure is no longer
offsite data centers or in private or public clouds. a limiting cost factor in application development.

While data masking tools eliminate the risk of exposing sensitive


information in testing environments—which can represent the
majority of the surface area of risk—organizations must inte-
grate them into TDM workflows without compromising speed
and simplicity objectives.
Building a Comprehensive Toolset
No single technology exists that fulfills all TDM requirements. • MASKED PRODUCTION DATA (either full sets or subsets)
Rather, teams must build an integrated solution that provides all makes it possible for development teams to use real data
the data types required to meet a diverse set of testing needs. without introducing unsafe levels of risk. However, masking
Once test data requirements have been identified, a successful processes can drag on data delivery. Also, masking requires
TDM approach should aim to improve the distribution, quality, staging environments with additional storage and staff to
security, and cost of various types of test data. Table 1 examines ensure referential integrity after data is transformed.
these criteria across the most common types of test data. • SYNTHETIC DATA circumvents security issues, but the space
• PRODUCTION DATA provides the most complete test coverage, savings are limited. While synthetic data might be required to
but it usually comes at the expense of agility and storage costs. test new features, this is only a relatively small percentage of
For some applications, it can also mean exposing sensitive data. test cases. If performed manually, creating test data is also prone

• SUBSETS OF PRODUCTION DATA are significantly more agile to human error and requires an in-depth understanding of data

than full copies. They can provide some savings on hardware, relationships both within the database schema or file system,

CPU, and licensing costs, but it can be difficult to achieve as well as those implicit in the data itself.

sufficient test coverage. Sensitive data is often still exposed


using subsets.

Test Data Type


Distribution Quality Security Cost
vs. Criteria

High consumption
Slow, manual Good test Sensitive data
Production Data of storage, CPU,
access coverage at risk
and licenses

Less time to Some storage,


Subset of Missed test Sensitive data
provision than CPU, and license
Production case outliers at risk
full copies savings

Requires masking
Masked Data Extended SLAs Must ensure Improved data software or custom
(Full of Subset) for masked data referential integrity privacy and security scripting and staging
server

Limited to a small Data de-


Limited storage
Synthetic Data Manual process percentage of test identification
saving
requirements not required

Fails to meet criteria Partially meets criteria Meets criteria

Table 1: Evaluating Common Required Test Data Types.


One technology that can improve the distribution, quality, security, and cost of all-of-the-above types of test data is data virtualization.
Virtual data—which can be a copy of any data stored in a relational database or file system—shares common data blocks across copies,
enabling space savings of up to 90%. Virtual data also enables rapid provisioning of test data, as of any point in time, via self service.
As a result, TDM teams can accelerate the rate at which they can not only manipulate test data, but also operationalize the rollout of
test data to software development teams.

Case Study
One Fortune 500 financial services institution investigated the use of virtual data for the development and testing of an online platform
that provides market insights to clients and enables them to make smarter financial decisions. The investigation was triggered by
massive platform growth: over the span of a few years, financial data had doubled, usage had tripled, and feature development effort
had quadrupled. IT was struggling to keep pace with the exploding storage costs and missed several releases due to slow environment
provisioning. Moreover, a large percentage of bugs was discovered in late stage user-acceptance testing, which risked impacting
customer experience.

For both Oracle and MS SQL production data sources, the firm’s IT organization implemented a data virtualization technology with
built-in masking. After deploying the solution in less than a few weeks, the results were immediate. For instance, rather than waiting a full
day for a DBA team to restore an environment after a 20-minute test run, QA engineers leveraged secure virtual data to initiate a
10-minute reset process that brings the environment back to a bookmarked state. Less waiting enabled QA teams to execute more
test cycles earlier in the SDLC—a “shift left” in testing. Ultimately, this led QA teams to discover and resolve errors when they were
easier and less expensive to fix. The firm estimated that they reduced overall defect rates by 60 percent and improved productivity
across 800+ developers and testers by 25 percent. They also dramatically reduced storage requirements by almost 200 TB, enabling
them to accommodate massive platform growth without expanding their existing infrastructure.

1. “Transforming Test Data Management for Increased Business Value.” Cognizant 20-20 Insights, March 2013.
2. “Exploring Successful Approaches to Test Data Management”. Bloor Research, July 2012.
3. “Exploring Successful Approaches to Test Data Management”. Bloor Research, July 2012.

ABOUT DELPHIX
Delphix’s mission is to free companies from data friction and accelerate innovation. Fortune 100 companies use the Delphix Dynamic Data Platform
to connect, virtualize, secure and manage data in the cloud and in on-premise environments. For more information visit www.delphix.com.

© 2018 Delphix Corp. All rights reserved. | delphix.com | [email protected] | +1.650.494.1645


3-2018

You might also like