Previewpdf
Previewpdf
VENKATARAMANAN SHRIRAM
The digital revolution has taken its toll on our data’s privacy. Every agency
DATA PRIVACY
we transact with has captured our identity. Our personal data is available so
abundantly and in the custody of so many handlers that it appears we are
just one disgruntled employee’s vengeful act away from losing our privacy.
External adversaries are using publicly available information to infer our iden-
tity. Outsourcing of services is compromising our data across borders. Big
Data Analytics are nudging us to share our private data on various devices
and social networking platforms, while analysts accumulate our behavioral
nuances to take advantage of them. With this ocean of our personally iden-
tifiable information (PII) available publicly, how do we remain anonymous?
Principles and Practice
Data handlers are primarily the ones to answer this question while also com-
plying with various privacy regulations. Are they skilled at handling this? This
book provides a complete reference for personal data identification, clas-
sification, anonymization, and usage in various application contexts. Privacy
Nataraj Venkataramanan
principles in the book guide practitioners to create an anonymization design
specific for each data type and application. Examples drawn from health-
Ashwin Shriram
care, banking, financial services, and retail demonstrate how these principles
can lead to effective practice and result in a compliant business that respects
DATA PRIVACY
customer privacy.
These topics will be useful for students of privacy from academia as well as
industry.
K25563
w w w. c rc p r e s s . c o m
A CHAPMAN & HALL BOOK
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
and my Guru
Nataraj Venkataramanan
Ashwin Shriram
Contents
Preface������������������������������������������������������������������������������������������������������������������� xiii
Acknowledgments.................................................................................................xv
Authors................................................................................................................. xvii
List of Abbreviations........................................................................................... xix
vii
viii Contents
The ubiquity of computers and smart devices on the Internet has led to vast
amounts of data being collected from users, shared, and analyzed. Most of
these data are collected from customers by enterprises dealing with bank-
ing, insurance, healthcare, financial services, retail, e-commerce, manufac-
turing, and social networks. These data consist of transactional data, search
data, and health and financial data, almost always including a lot of personal
data. On the one hand, these data are considered valuable for the enterprise
as they can be used for a variety of purposes, such as knowledge discov-
ery, software application development, and application testing. At the same
time, these data are considered to be sensitive as they contain their custom-
ers’ personal information. Therefore, sharing, offshoring, or outsourcing of
such data for purposes like data mining or application testing should ensure
that customers’ personal information is not compromised in any way. Many
countries have stringent data protection acts, such as the Health Insurance
Portability and Accountability Act of 1996 (HIPAA) in the United States, the
EU Data Protection Act, and the Swiss Data Protection Act, which mandate
high standards of data privacy. In this context, data privacy, which pertains
to protecting the identity of the customers, is high priority for all enterprises
as any data privacy loss would result in legal issues with hefty fines, erosion
of customer confidence, and customer attrition. To this effect, data privacy as
a subject of study is being introduced in some universities at the postgradu-
ate level. Various certifications like PrivacyTrust and CIPP also exist, which
endorse an individual’s knowledge of privacy laws.
Data privacy as a subject of study and practice is relatively young; in par-
ticular, many of the techniques used for data protection are still evolving.
Currently, many companies are adopting these data privacy techniques as
they try to leverage the opportunities provided by offshoring and outsourc-
ing while at the same time complying with regulatory requirements. This
book provides a comprehensive guidance on the implementation of many of
the data privacy techniques in a variety of applications. An enterprise’s data
architecture consists of a wide variety of data structures like multidimen-
sional data, also known as relational data, which are the most widely used
data structure, and complex data structures like transaction data, longitudinal
data, time series data, graph data, and spatiotemporal data. Multidimensional
data are simple in structure, and a rich set of anonymization algorithms
are in use currently for such data, but anonymization techniques for other
complex data structures are still evolving. Chapters 2 through 4 attempt to
provide a detailed coverage on the various anonymization approaches for
xiii
xiv Preface
xv
Authors
xvii
List of Abbreviations
xix
xx List of Abbreviations
1.1 Introduction
Organizations dealing with banking, insurance, retail, healthcare, and
manufacturing across the globe collect large amounts of data about their
customers. This is a valuable asset to the organizations as these data can
be mined to extract a lot of insights about their customers. For example,
mining these data can throw light on customers’ spending/buying, credit
card usage, and investment patterns and health issues, to name a few. This
information is used by companies to provide value-added services to their
customers, which in turn results in higher revenue and profit. But these data
might contain customers’ personal identification information, and when in
the hands of a data snooper, they can be exploited.
Large companies across the globe outsource their IT and business pro-
cess work to service providers in countries like India, China, Brazil, etc. The
outsourced work may involve application maintenance and development,
testing, data mining/analysis, statistical analysis, etc. Business applications
contain sensitive information, such as personal or financial and health-
related data. Sharing such data can potentially violate individual privacy and
lead to financial loss to the company. Serious concerns have been expressed
by general public about exposing person-specific information. The issue of
data leakage, either intentional or accidental exposure of sensitive informa-
tion, is becoming a major security issue.
An IDC survey [19] claims that data leakage is the number one threat,
ranked higher than viruses, Trojan horses, and worms. To address the privacy
of an individual’s data, governments across the globe have mandated regula-
tions that companies have to adhere to: HIPAA (Health Insurance Portability
and Accountability Act) in the United States, FIPPA (Freedom of Information
and Protection of Privacy Act) in Canada, Sarbanes–Oxley Act, Video Privacy
Protection, U.S. Declaration of Human Rights, and the EU’s Data Protection
Directive are just a few examples. Companies need to look into methods and
tools to anonymize sensitive data. Data anonymization techniques have been
the subject of intense investigation in recent years for many kinds of struc-
tured data, including tabular, transactional data, and graph data.
1
2 Data Privacy: Principles and Practice
Government—Compliance
Customer/Record owner and regulations
Adversary/Data snooper
Company—Banks, insurance,
healthcare, or retail
Data analysts
Data anonymizer
Data Data
Pro Data Data s es
du
cti Data Data at aba
on dd
dat
aba m ize Testers
ses o ny
An
Business operations
employee
FIGURE 1.1
Data privacy—stakeholders in the organization.
Data anonymizer: A person who anonymizes and provides data for analysis
or as test data.
Data analyst: This person uses the anonymized data to carry out data m ining
activities like prediction, knowledge discovery, and so on. Following govern-
ment regulations, such as the Data Moratorium Act, only anonymized data
can be used for data mining. Therefore, it is important that the provisioned
data support data mining functionalities.
Tester: Outsourcing of software testing is common among many companies.
High-quality testing requires high-quality test data, which is present in pro-
duction systems and contains customer-sensitive information. In order to test
the software system, the tester needs data to be extracted from production
systems, anonymized, and provisioned for testing. Since test data contain
customer-sensitive data, it is mandatory to adhere to regulatory compliance
in that region/country.
Business operations employee: Data analysts and software testers use anony-
mized data that are at rest or static, whereas business operations employees
access production data because they need to support customer’s business
requirements. Business operations are generally outsourced to BPO (busi-
ness process outsourcing) companies. In this case too, there is a requirement
to protect customer-sensitive data but as this operation is carried out during
run-time, a different set of data protection techniques are required to protect
data from business operations employees.
Adversary/data snooper: Data are precious and their theft is very common.
An adversary can be internal or external to the organization. The anonymiza-
tion design should be such that it can thwart an adversary’s effort to identify
a record owner in the database.
Companies spend millions of dollars to protect the privacy of customer
data. Why is it so important? What constitutes personal information? Personal
information consists of name, identifiers like social security number, geo-
graphic and demographic information, and general sensitive information, for
example, financial status, health issues, shopping patterns, and location data.
Loss of this information means loss of privacy—one’s right to freedom from
intrusion by others. As we will see, protecting one’s privacy is nontrivial.
others know your health issues or financial status? All these are sensitive
data and should be well protected as they could fall into the wrong hands
and be exploited. Let us look at a sample bank customer and an account
table. The customer table taken as such has nothing confidential as most of
the information contained in it is also available in the public voters data-
base and on social networking sites like Facebook. Sensitiveness comes in
when the customer table is combined with an accounts table. A logical rep-
resentation of Tables 1.1 and 1.2 is shown in Table 1.3.
Data D in the tables contains four disjointed data sets:
TABLE 1.1
Customer Table
Explicit Identifiers Quasi-Identifiers
ID First Name DOB Gender Address Zip Code Phone
1 Ravi 1970 Male Fourth Street 66001 92345-67567
2 Hari 1975 Male Queen Street 66011 98769-66610
3 John 1978 Male Penn Street 66003 97867-00055
4 Amy 1980 Female Ben Street 66066 98123-98765
TABLE 1.2
Account Table
Sensitive Data
Account Account Account Credit Nonsensitive
ID Number Type Balance Limit Data
1 12345 Savings 10,000 20,000
2 23456 Checking 5,000 15,000
3 45678 Savings 15,000 30,000
4 76543 Savings 17,000 25,000
Introduction to Data Privacy 7
TABLE 1.3
Logical Representation of Customer and Account Tables
Explicit
Identifiers Quasi-Identifiers Sensitive Data
Zip Account Account Account Credit
ID Name DOB Gender Address Code Number Type Balance Limit
1 Ravi 1970 Male Fourth 66001 12345 Savings 10,000 20,000
Street
2 Hari 1975 Male Queen 66011 23456 Checking 5,000 15,000
Street
3 John 1978 Male Penn Street 66003 45678 Savings 15,000 30,000
4 Amy 1980 Female Ben Street 66066 76543 Savings 17,000 25,000
The first two data sets, the EI and QI, uniquely identify a record owner and
when combined with sensitive data become sensitive or confidential. The
data set D is considered as a matrix of m rows and n columns. Matrix D is a
vector space where each row and column is a vector
Each of the data sets, EI, QI, and SD, are matrices with m rows and i, j, and
k columns, respectively. We need to keep an eye on the index j (represent-
ing QI), which plays a major role in keeping the data confidential.
Apart from assuring their customers’ privacy, organizations also have
to comply with various regulations in that region/country, as mentioned
earlier. Most countries have strong privacy laws to protect citizens’ per-
sonal data. Organizations that fail to protect the privacy of their customers
or do not comply with the regulations face stiff financial penalties, loss of
reputation, loss of customers, and legal issues. This is the primary reason
organizations pay so much attention to data privacy. They find themselves
in a Catch-22 as they have huge amounts of customer data, and there is a
compelling need to share these data with specialized data analysis com-
panies. Most often, data protection techniques, such as cryptography and
anonymization, are used prior to sharing data. In this book, we focus only
on anonymization.
Anonymization is a process of logically separating the identifying
information (PII) from sensitive data. Referring to Table 1.3, the anony-
mization approach ensures that EI and QI are logically separated from
SD. As a result, an adversary will not be able to easily identify the record
owner from his sensitive data. This is easier said than done. How to
effectively anonymize the data? This is the question we explore through-
out this book.
8 Data Privacy: Principles and Practice
TABLE 1.4
Example of Anonymity
Personal Identity Sensitive Data
Zip Account Account Account Credit
SSN Name DOB Gender Address Code Number Type Balance Limit
X X X X X X
X X X X X X
X X X X X X
X X X X X X
Note: X, identity is protected.
TABLE 1.5
Example of Privacy
Personal Identity Sensitive Data
Zip Account Account Account Credit
SSN Name DOB Gender Address Code Number Type Balance Limit
X X X X
X X X X
X X X X
X X X X
Note: X, sensitive data are protected.
Introduction to Data Privacy 9
of birth, gender, and zip code have the capacity to uniquely identify indi-
viduals. Combine that with SD, such as income, and a Warren Buffet or Bill
Gates is easily identified in the data set. By de-identifying, the values of QI
are modified carefully so that the relationship is till maintained by identities
cannot be inferred.
In Equation 1.1, the original data set is D which is anonymized, resulting
in data set D′ = T(D) or T([DEI][DQI][DSD]), where T is the transformation func-
tion. As a first step in the anonymization process, EI is completely masked
and no longer relevant in D′. As mentioned earlier, no transformation is
applied on SD and it is left in its original form. This results in D′ = T([DQI]),
which means that transformation is applied only on QI as EI is masked
and not considered as part of D′ and SD is left in its original form. D′ can
be shared as QI is transformed and SD is in its original form but it is very
difficult to identify the record owner. Coming up with the transformation
function is key to the success of anonymization design and this is nontrivial.
We spend a lot of time on anonymization design, which is generally applied
on static data or data at rest.
The other scenario is protecting SD, as shown in Table 1.5, which is applied
on data in motion. The implementation of this is also very challenging.
It is dichotomous as organizations take utmost care in protecting the
privacy of their customers’ data, but the same customers provide a whole
lot of personal information when they register on social network sites like
Facebook (of course, many of the fields are not mandatory but most peo-
ple do provide sufficient personal information), including address, phone
numbers, date of birth (DOB), details of education and qualification, work
experience, etc. Sweeney [2] reports that zip code, DOB, and gender are
sufficient to uniquely identify 83% of population in the United States. With
the amount of PII available on social networking sites, a data snooper with
some background knowledge could use the publicly available information
to re-identify customers in corporate databases.
In the era of social networks, de-identification becomes highly challenging.
Use Case 3—
Business process
Historic data outsourcing (BPO)
External auditors
FIGURE 1.2
Sensitive data in the enterprise.
Introduction to Data Privacy 11
the context for this book, which illustrates the importance of enterprise-wide
sensitive data protection. PII is found in all layers of architecture like busi-
ness processes, applications, operational data stores (ODSs), data warehouse
for historic data, time series databases, file servers for unstructured data,
and reports. Whether data are shared with internal departments or external
vendors, it is critical to protect the privacy of customer data. It is important to
have an enterprise-wide privacy preservation design strategy to address the
heterogeneity in data sources, data structures, and usage scenarios.
What are the relevant use cases for which data are shared by organizations?
Let us now explore for what purposes the data are shared and how.
The first two use cases fall under the first category and the rest under the
second category. One important aspect to note here—without privacy pres-
ervation none of these use cases is feasible. Figure 1.3 illustrates this concept.
A brief coverage of some of the use cases is provided here, and they are dealt
with in detail in dedicated chapters in the book.
Association Business
Classification Clustering Test data
mining operations
Privacy preservation
Data
Data Data Data Data
Data Data Data
Data Data Data
Data Data
Data Data Data
Data Data Data
Data Data
Enterprise data
FIGURE 1.3
Use cases for privacy preservation.
12 Data Privacy: Principles and Practice
Software application
Extract a subset of
Anonymize test data
test data
Database
FIGURE 1.4
Privacy preserving test data manufacturing.
14 Data Privacy: Principles and Practice
But access to a customer’s trade account during these processes would expose
a lot of sensitive information and this is not acceptable to the customer,
and the regulation in the country will not permit such access. Therefore, these
data have to be protected. But the question is how and what data are to be
protected? A technique known as tokenization is used here wherein the sen-
sitive data that should not be seen by the BPO employee are replaced with
a token. This token has no relationship with the original data, and outside
the context of the application the token has no meaning at all. All these are
executed during run-time. Like privacy preserving data mining and test data
management, protection of sensitive data is a subject in itself and is covered
in depth in this book.
Tokenization is a technique that replaces the original sensitive data with non-
sensitive placeholders referred to as tokens. The fundamental difference
between tokenization and the other techniques is that in tokenization, the origi-
nal data are completely replaced by a surrogate that has no connection to the
original data. Tokens have the same format as the original data. As tokens are
not derived from the original data, they exhibit very powerful data protection
features. Another interesting point of tokens is, although the token is usable
within its native application environment, it is completely useless elsewhere.
Therefore, tokenization is ideal to protect sensitive identifying information.
Op
t
1 Op imum
tim
um priva
uti c y,
lity
Privacy
0
0 Utility 1
FIGURE 1.5
Privacy versus utility map.
TABLE 1.6
Original Table with Strong Correlation between QI and SD
Name Zip Code Gender Income
Chen 56001 Male 25K
Jenny 56015 Female 8K
Alice 56001 Female 30K
Ram 56011 Male 5K
Introduction to Data Privacy 17
Table 1.6 shows four individuals. Although many rows have not been shown
here, let us assume that the ZIP CODE and INCOME are correlated, in that
the ZIP CODE 56001 primarily consists of high-income individuals. Table 1.7
is a modified version of Table 1.6. Let us not worry about the techniques used
to anonymize data, but focus just on the results. We can see that the names
have been changed, the original ZIP CODES have been replaced with differ-
ent values and INCOME values are unchanged.
Let us assess gains and losses for this anonymization design.
Privacy gain: Names are substituted (hence protected), financial standing is
not attributed to another zip code, and geographical location is anonymized.
Utility loss: Gender information is preserved, names are substituted while pre-
serving demographic clues, correlation is preserved but the zip code is different.
Another design can have just “XXXX” for all names, 56001 for all zip codes,
and “Male” for all gender values. We can agree that this anonymization
design scores well in terms of privacy, but utility is pathetic. Privacy gain:
Names are completely suppressed, financial standing cannot be inferred,
and geographical location is not compromised. Utility loss: Presence of
females in the population, meaningless names lose demographic clues, flat
value of zip code annuls the correlation.
This shows that anonymization design drives the extent of privacy and
utility, which are always opposed to each other. The two designs also show
that privacy or utility need not be 0 and 1 as in encryption; rather, both are
shades of gray as stated earlier. A good design can achieve a balance between
them and achieve both goals to a reasonable extent.
One way to quantify privacy is on the basis of how much information an
adversary can obtain about the SD of an individual from different dimen-
sions in the data set [5–8]. These references state that SD fields can be iden-
tified (or estimated/deduced) using QI fields. This is a very simple way
to quantify privacy. In fact, this model does not capture many important
dimensions, such as background knowledge of the adversary, adversary’s
knowledge of some of the sensitive data, the complexity of the data structure,
etc. We discuss this in sufficient detail in Chapter 4.
The utility loss of a particular anonymization technique is measured
against the utility provided by the original data set. A measure of utility
TABLE 1.7
Anonymized Table with Generalized Values—Correlation
between QI and SD Is Broken
Name Zip Code Gender Income
Yang 56000 Male 25K
Emma 56010 Female 8K
Olivia 56000 Female 30K
Krishna 56010 Male 5K
18 Data Privacy: Principles and Practice
TABLE 1.8
Background Knowledge of the Adversary
about the Distribution of SD Fields
Name Zip Code Gender Disease
John Smith 46001 Male Hypertension
Tom Henry 46005 Male Gastritis
Alice Williams 46001 Female Cancer
Little Wood 46011 Male Asthma
Introduction to Data Privacy 19
Business domain
Utility
Application
requirement
Privacy
requirement
FIGURE 1.6
Drivers for anonymization design.
have very little utility. In this context, irrespective of which tool an organi-
zation uses, there is a need for a mechanism to monitor privacy versus util-
ity for various privacy requirements. Unfortunately, quantifying privacy
and utility is nontrivial. Therefore, it is critical to provide assurance of high
quality of data anonymization during the initial phase of the anonymiza-
tion life cycle. To support this, we felt it is necessary to define a set of design
principles. These principles will provide the required guidelines for the
data anonymizer to adopt the correct design for a given anonymization
requirement.
As software architects, we start the architecting process by following a
set of architecture principles that will guide us to come up with the correct
design for the system. We base our work here on a similar approach. In [12],
the authors classify principles into two broad types—scientific and norma-
tive. Scientific principles are laws of nature and form the fundamental truths
that one can build upon. Normative principles act as a guide and need to be
enforced. Similarly, a data anonymizer needs guidance, and the anonymiza-
tion design principles should be enforced to ensure proper anonymization
design. These principles are fundamental in nature and are applicable to all
aspects of anonymization. They connect the high-level privacy and utility
requirements to low-level implementation.
Introduction to Data Privacy 21
• Principle Name
• Rationale
• Implications
TABLE 1.9
Sample Sparse High-Dimensional Transaction Database
in a Supermarket
Name P1 P2 P3 P4 P5 P6 Pn
Hari 1 1
Nancy 1 1
Jim 1 1
Introduction to Data Privacy 23
1. High dimensionality.
2. Sparsity.
3. Conventional privacy preservation techniques used for relational
tables that have fixed schema are not applicable on transaction data.
TABLE 1.10
Sample Longitudinal Data Set in the Healthcare Domain
Systolic Diastolic
ID Name DOB ZIP Service Date Diseases (mmHg) (mmHg)
1 Bob 1976 56711 30/05/2012 Hypertension 180 95
2 Bob 1976 56711 31/05/2012 Hypertension 160 90
3 Bob 1976 56711 01/06/2012 Hypertension 140 85
4 Bob 1976 56711 02/06/2012 Hypertension 130 90
5 Bob 1976 56711 03/06/2012 Hypertension 125 85
6 Bob 1976 56711 04/06/2012 Hypertension 120 80
7 Alice 1969 56812 31/03/2012 Hypertension 160 90
24 Data Privacy: Principles and Practice
Consider the longitudinal data set D, which has three disjoint sets of data
(EI, QI, and SD). EI are completely masked to prevent identification. QI
are anonymized using generalization and suppression to prevent iden-
tity disclosure. In the case of longitudinal data, anonymizing identity attri-
butes alone is not sufficient to prevent an adversary from re-identifying the
patient. An adversary can still link some of the sensitive attributes to the
publicly available data, that is, medical records. Now comes the need to pre-
vent attributes disclosure. For longitudinal data, an anonymization design
that prevents identity as well as attributes disclosure is required [16]. There
are a number of techniques to prevent identity disclosure, such as perturba-
tive and nonperturbative techniques. Effective anonymization techniques
are required to prevent attributes disclosure, but these techniques should
also ensure that they preserve the characteristics of longitudinal data.
Hari Ram
Jane Jack
Bob Alice
FIGURE 1.7
Graph network with original data.
A B
F C
E D
FIGURE 1.8
Modified graph network.
26 Data Privacy: Principles and Practice
1. Identity disclosure
2. Link disclosure
3. Content/attribute disclosure
TABLE 1.11
Sample Time Series Data Table Showing Weekly Sales of Companies
Company
ID Name Address Week 1 Week 2 Week 3 Week 4 Week 5
1 ABC Park Street, 56001 10,000 12,000 17,000 8,000 11,000
2 ACME Kings Street, 56003 15,000 17,000 18,000 20,000 21,000
3 XYZ Main Street, 56022 20,000 23,000 25,000 26,000 30,000
4 PQR Queen Street, 56021 14,000 18,000 19,000 19,500 21,000
• High dimensionality
• Retaining the statistical properties of the original time series data
like mean, variance, and so on
• Supporting various types of queries like range query or pattern
matching query
• Preventing identity disclosure and linkage attacks
References
1. J.M. Skopek, Anonymity: The production of goods and institutional design,
Fordham Law Review, 82(4), 1751–1809, 2014, http://ir.lawnet.fordham.edu/flr/
vol82/iss4/4/.
2. L. Sweeney, k-Anonymity: A model for protecting privacy, International Journal
of Uncertainty, Fuzziness and Knowledge Based Systems, 10 (5), 557–570, 2002.
28 Data Privacy: Principles and Practice
TABLE 1.11