0% found this document useful (0 votes)
35 views

Previewpdf

Uploaded by

GAVASKAR S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Previewpdf

Uploaded by

GAVASKAR S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Computer Science & Engineering

VENKATARAMANAN SHRIRAM
The digital revolution has taken its toll on our data’s privacy. Every agency

DATA PRIVACY
we transact with has captured our identity. Our personal data is available so
abundantly and in the custody of so many handlers that it appears we are
just one disgruntled employee’s vengeful act away from losing our privacy.
External adversaries are using publicly available information to infer our iden-
tity. Outsourcing of services is compromising our data across borders. Big
Data Analytics are nudging us to share our private data on various devices
and social networking platforms, while analysts accumulate our behavioral
nuances to take advantage of them. With this ocean of our personally iden-
tifiable information (PII) available publicly, how do we remain anonymous?
Principles and Practice
Data handlers are primarily the ones to answer this question while also com-
plying with various privacy regulations. Are they skilled at handling this? This
book provides a complete reference for personal data identification, clas-
sification, anonymization, and usage in various application contexts. Privacy
Nataraj Venkataramanan
principles in the book guide practitioners to create an anonymization design
specific for each data type and application. Examples drawn from health-
Ashwin Shriram
care, banking, financial services, and retail demonstrate how these principles
can lead to effective practice and result in a compliant business that respects

DATA PRIVACY
customer privacy.

Data Privacy in this book is covered from many perspectives including


• Anonymization design principles
• Privacy vs. Utility: the trade-offs
• Privacy preserving data mining (PPDM)
• Privacy preserving test data manufacturing (PPTDM)
• Dynamic data protection
• Synthetic data generation
• Threat modeling
• Privacy regulations

These topics will be useful for students of privacy from academia as well as
industry.

K25563

w w w. c rc p r e s s . c o m
A CHAPMAN & HALL BOOK

K25563_cover.indd 1 7/8/16 2:59 PM


DATA PRIVACY
Principles and Practice
DATA PRIVACY
Principles and Practice
Nataraj Venkataramanan
Ashwin Shriram
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742

© 2017 by Taylor & Francis Group, LLC


CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

Printed on acid-free paper


Version Date: 20160628

International Standard Book Number-13: 978-1-4987-2104-2 (Hardback)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit-
ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.
com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.

Library of Congress Cataloging‑in‑Publication Data

Names: Venkataramanan, Nataraj, author.


Title: Data privacy : principles and practice / Nataraj Venkataramanan and
Ashwin Shriram.
Description: Boca Raton, FL : CRC Press, 2017. | Includes bibliographical
references and index.
Identifiers: LCCN 2016009691 | ISBN 9781498721042 (alk. paper)
Subjects: LCSH: Data protection. | Management information systems--Security
measures. | Computer networks--Security measures. | Privacy, Right of. |
Customer relations.
Classification: LCC HF5548.37 .V46 2017 | DDC 005.8--dc23
LC record available at https://lccn.loc.gov/2016009691

Visit the Taylor & Francis Web site at


http://www.taylorandfrancis.com

and the CRC Press Web site at


http://www.crcpress.com
In memory of my parents

V. Sulochana and E. Venkataramanan

and my Guru

Sri Chandrasekharendra Saraswathi

Nataraj Venkataramanan

Seeking God’s blessings, I dedicate this book to my

wonderful parents, my lovely wife, and daughter,

who have sacrificed so much to make this happen.

Ashwin Shriram
Contents

Preface������������������������������������������������������������������������������������������������������������������� xiii
Acknowledgments.................................................................................................xv
Authors................................................................................................................. xvii
List of Abbreviations........................................................................................... xix

1. Introduction to Data Privacy.........................................................................1


1.1 Introduction............................................................................................1
1.2 What Is Data Privacy and Why Is It Important?...............................3
1.2.1 Protecting Sensitive Data.........................................................5
1.2.2 Privacy and Anonymity: Two Sides of the Same Coin........8
1.3 Use Cases: Need for Sharing Data.......................................................9
1.3.1 Data Mining and Analysis.................................................... 12
1.3.2 Software Application Testing................................................ 13
1.3.3 Business Operations............................................................... 13
1.4 Methods of Protecting Data................................................................ 14
1.5 Importance of Balancing Data Privacy and Utility......................... 15
1.5.1 Measuring Privacy of Anonymized Data........................... 18
1.5.2 Measuring Utility of Anonymized Data............................. 19
1.6 Introduction to Anonymization Design Principles........................ 19
1.7 Nature of Data in the Enterprise....................................................... 21
1.7.1 Multidimensional Data.......................................................... 21
1.7.1.1 Challenges in Privacy Preservation
of Multidimensional Data......................................22
1.7.2 Transaction Data.....................................................................22
1.7.2.1 Challenges in Privacy Preservation
of Transaction Data................................................. 23
1.7.3 Longitudinal Data.................................................................. 23
1.7.3.1 Challenges in Anonymizing Longitudinal
Data........................................................................... 24
1.7.4 Graph Data............................................................................... 24
1.7.4.1 Challenges in Anonymizing Graph Data............ 26
1.7.5 Time Series Data..................................................................... 26
1.7.5.1 Challenges in Privacy Preservation of Time
Series Data................................................................ 27
References........................................................................................................ 27

vii
viii Contents

2. Static Data Anonymization Part I: Multidimensional Data................. 29


2.1 Introduction.......................................................................................... 29
2.2 Classification of Privacy Preserving Methods................................. 29
2.3 Classification of Data in a Multidimensional Data Set................... 31
2.3.1 Protecting Explicit Identifiers............................................... 31
2.3.2 Protecting Quasi-Identifiers..................................................34
2.3.2.1 Challenges in Protecting QI................................... 35
2.3.3 Protecting Sensitive Data (SD).............................................. 38
2.4 Group-Based Anonymization............................................................42
2.4.1 k-Anonymity............................................................................42
2.4.1.1 Why k-Anonymization?.........................................42
2.4.1.2 How to Generalize Data?....................................... 47
2.4.1.3 Implementing k-Anonymization.......................... 52
2.4.1.4 How Do You Select the Value of k?.......................54
2.4.1.5 Challenges in Implementing
k-Anonymization.................................................55
2.4.1.6 What Are the Drawbacks of
k-Anonymization?................................................ 57
2.4.2 l-Diversity................................................................................. 58
2.4.2.1 Drawbacks of l-Diversity........................................ 59
2.4.3 t-Closeness............................................................................... 60
2.4.3.1 What Is t-Closeness?............................................... 60
2.4.4 Algorithm Comparison.......................................................... 60
2.5 Summary............................................................................................... 62
References........................................................................................................63

3. Static Data Anonymization Part II: Complex Data Structures............65


3.1 Introduction..........................................................................................65
3.2 Privacy Preserving Graph Data......................................................... 66
3.2.1 Structure of Graph Data......................................................... 66
3.2.2 Privacy Model for Graph Data.............................................. 67
3.2.2.1 Identity Protection.................................................. 68
3.2.2.2 Content Protection.................................................. 69
3.2.2.3 Link Protection........................................................ 69
3.2.2.4 Graph Metrics.......................................................... 71
3.3 Privacy Preserving Time Series Data................................................ 71
3.3.1 Challenges in Privacy Preservation of Time
Series Data...........................................................................73
3.3.1.1 High Dimensionality.............................................. 73
3.3.1.2 Background Knowledge of the Adversary.......... 74
3.3.1.3 Pattern Preservation............................................... 74
3.3.1.4 Preservation of Statistical Properties................... 74
3.3.1.5 Preservation of Frequency-Domain
Properties..............................................................75
Contents ix

3.3.2Time Series Data Protection Methods................................. 75


3.3.2.1 Additive Random Noise......................................... 76
3.3.2.2 Perturbation of Time Series Data Using
Generalization: k-Anonymization........................ 78
3.4 Privacy Preservation of Longitudinal Data..................................... 79
3.4.1 Characteristics of Longitudinal Data...................................80
3.4.1.1 Challenges in Anonymizing Longitudinal
Data...........................................................................80
3.5 Privacy Preservation of Transaction Data........................................ 81
3.6 Summary...............................................................................................83
References........................................................................................................83

4. Static Data Anonymization Part III: Threats to Anonymized Data.....85


4.1 Threats to Anonymized Data.............................................................85
4.2 Threats to Data Structures.................................................................. 89
4.2.1 Multidimensional Data.......................................................... 92
4.2.2 Longitudinal Data.................................................................. 92
4.2.3 Graph Data............................................................................... 93
4.2.4 Time Series Data..................................................................... 93
4.2.5 Transaction Data..................................................................... 94
4.3 Threats by Anonymization Techniques........................................... 95
4.3.1 Randomization (Additive)..................................................... 96
4.3.2 k-Anonymization.................................................................... 96
4.3.3 l-Diversity................................................................................. 96
4.3.4 t-Closeness............................................................................... 96
4.4 Summary............................................................................................... 96
References........................................................................................................ 97

5. Privacy Preserving Data Mining................................................................ 99


5.1 Introduction.......................................................................................... 99
5.2 Data Mining: Key Functional Areas of
Multidimensional Data.............................................................. 100
5.2.1 Association Rule Mining..................................................... 100
5.2.1.1 Privacy Preserving of Association Rule
Mining: Random Perturbation............................ 102
5.2.2 Clustering............................................................................... 104
5.2.2.1 A Brief Survey of Privacy Preserving
Clustering Algorithms.......................................... 106
5.3 Summary............................................................................................. 108
References...................................................................................................... 108

6. Privacy Preserving Test Data Manufacturing....................................... 109


6.1 Introduction........................................................................................ 109
6.2 Related Work....................................................................................... 110
x Contents

6.3 Test Data Fundamentals.................................................................... 110


6.3.1 Testing.................................................................................... 111
6.3.1.1 Functional Testing: System and Integration
Testing..................................................................... 111
6.3.1.2 Nonfunctional Testing.......................................... 111
6.3.2 Test Data................................................................................. 111
6.3.2.1 Test Data and Reliability...................................... 112
6.3.2.2 How Are Test Data Created Today?................... 114
6.3.3 A Note on Subsets................................................................. 115
6.4 Utility of Test Data: Test Coverage................................................... 115
6.4.1 Privacy versus Utility........................................................... 117
6.4.2 Outliers................................................................................... 118
6.4.3 Measuring Test Coverage against Privacy........................ 119
6.5 Privacy Preservation of Test Data.................................................... 119
6.5.1 Protecting Explicit Identifiers............................................. 119
6.5.1.1 Essentials of Protecting EI................................... 120
6.5.1.2 What Do Tools Offer?........................................... 121
6.5.1.3 How Do Masking Techniques Affect
Testing?................................................................ 121
6.5.2 Protecting Quasi-Identifiers................................................ 124
6.5.2.1 Essentials of Protecting QI................................... 124
6.5.2.2 Tool Offerings to Anonymize QI........................ 125
6.5.2.3 How Does QI Anonymization Affect Test
Coverage?................................................................ 126
6.5.3 Protecting Sensitive Data (SD)............................................ 130
6.6 Quality of Test Data........................................................................... 130
6.6.1 Lines of Code Covered......................................................... 131
6.6.2 Query Ability........................................................................ 132
6.6.3 Time for Testing.................................................................... 133
6.6.3.1 Test Completion Criteria...................................... 133
6.6.3.2 Time Factor............................................................. 134
6.6.4 Defect Detection.................................................................... 135
6.7 Anonymization Design for PPTDM................................................ 135
6.8 Insufficiencies of Anonymized Test Data....................................... 137
6.8.1 Negative Testing.................................................................... 137
6.8.2 Sensitive Domains................................................................ 137
6.8.3 Nonfunctional Testing......................................................... 138
6.8.4 Regression Testing................................................................ 138
6.8.5 Trust Deficit........................................................................... 138
6.9 Summary............................................................................................. 138
References...................................................................................................... 139

7. Synthetic Data Generation........................................................................ 141


7.1 Introduction........................................................................................ 141
7.2 Related Work....................................................................................... 141
Contents xi

7.3 Synthetic Data and Their Use........................................................... 142


7.4 Privacy and Utility in Synthetic Data............................................. 144
7.4.1 Explicit Identifiers................................................................. 144
7.4.1.1 Privacy.................................................................... 144
7.4.1.2 Utility...................................................................... 145
7.4.1.3 Generation Algorithms......................................... 145
7.4.2 Quasi-Identifiers.................................................................... 145
7.4.2.1 Privacy.................................................................... 146
7.4.2.2 Utility...................................................................... 146
7.4.2.3 Generation Algorithms......................................... 147
7.4.3 Sensitive Data........................................................................ 148
7.4.3.1 Privacy.................................................................... 148
7.4.3.2 Utility...................................................................... 148
7.5 How Safe Are Synthetic Data?......................................................... 151
7.5.1 Testing.................................................................................... 151
7.5.1.1 Error and Exception Data..................................... 152
7.5.1.2 Scaling..................................................................... 152
7.5.1.3 Regression Testing................................................ 152
7.5.2 Data Mining........................................................................... 152
7.5.3 Public Data............................................................................. 152
7.6 Summary............................................................................................. 153
References...................................................................................................... 153

8. Dynamic Data Protection: Tokenization................................................ 155


8.1 Introduction........................................................................................ 155
8.2 Revisiting the Definitions of Anonymization and Privacy......... 155
8.3 Understanding Tokenization............................................................ 157
8.3.1 Dependent Tokenization...................................................... 157
8.3.2 Independent Tokenization................................................... 159
8.4 Use Cases for Dynamic Data Protection........................................ 159
8.4.1 Business Operations............................................................. 160
8.4.2 Ad Hoc Reports for Regulatory Compliance.................... 161
8.5 Benefits of Tokenization Compared to Other Methods................ 161
8.6 Components for Tokenization.......................................................... 162
8.6.1 Data Store............................................................................... 162
8.6.2 Tokenization Server.............................................................. 163
8.7 Summary............................................................................................. 163
Reference........................................................................................................ 163

9. Privacy Regulations.................................................................................... 165


9.1 Introduction........................................................................................ 165
9.2 UK Data Protection Act 1998............................................................ 167
9.2.1 Definitions............................................................................. 167
9.2.2 Problems in DPA................................................................... 168
xii Contents

9.3 Federal Act of Data Protection of Switzerland 1992...................... 171


9.3.1 Storing Patients’ Records in the Cloud.............................. 171
9.3.2 Health Questionnaires for Job Applicants........................ 171
9.3.3 Transferring Pseudonymized Bank Customer Data
Outside Switzerland............................................................. 172
9.4 Payment Card Industry Data Security Standard (PCI DSS)........ 172
9.5 The Health Insurance Portability and Accountability Act
of 1996 (HIPAA).................................................................................. 174
9.5.1 Effects of Protection.............................................................. 176
9.5.2 Anonymization Considerations.......................................... 176
9.5.2.1 Record Owner........................................................ 177
9.5.2.2 Business Associate................................................ 178
9.5.3 Anonymization Design for HIPAA.................................... 178
9.5.4 Notes on EIs, QIs, and SD.................................................... 181
9.5.4.1 Explicit Identifiers................................................. 181
9.5.4.2 Quasi-Identifiers.................................................... 182
9.5.4.3 Sensitive Data......................................................... 182
9.6 Anonymization Design Checklist................................................... 182
9.7 Summary............................................................................................. 185
9.8 Points to Ponder................................................................................. 185
References...................................................................................................... 185

Appendix A: Anonymization Design Principles


for Multidimensional Data............................................................................... 189
Appendix B: PPTDM Manifesto..................................................................... 207
Index...................................................................................................................... 209
Preface

The ubiquity of computers and smart devices on the Internet has led to vast
amounts of data being collected from users, shared, and analyzed. Most of
these data are collected from customers by enterprises dealing with bank-
ing, insurance, healthcare, financial services, retail, e-commerce, manufac-
turing, and social networks. These data consist of transactional data, search
data, and health and financial data, almost always including a lot of personal
data. On the one hand, these data are considered valuable for the enterprise
as they can be used for a variety of purposes, such as knowledge discov-
ery, software application development, and application testing. At the same
time, these data are considered to be sensitive as they contain their custom-
ers’ personal information. Therefore, sharing, offshoring, or outsourcing of
such data for purposes like data mining or application testing should ensure
that customers’ personal information is not compromised in any way. Many
countries have stringent data protection acts, such as the Health Insurance
Portability and Accountability Act of 1996 (HIPAA) in the United States, the
EU Data Protection Act, and the Swiss Data Protection Act, which mandate
high standards of data privacy. In this context, data privacy, which pertains
to protecting the identity of the customers, is high priority for all enterprises
as any data privacy loss would result in legal issues with hefty fines, erosion
of customer confidence, and customer attrition. To this effect, data privacy as
a subject of study is being introduced in some universities at the postgradu-
ate level. Various certifications like PrivacyTrust and CIPP also exist, which
endorse an individual’s knowledge of privacy laws.
Data privacy as a subject of study and practice is relatively young; in par-
ticular, many of the techniques used for data protection are still evolving.
Currently, many companies are adopting these data privacy techniques as
they try to leverage the opportunities provided by offshoring and outsourc-
ing while at the same time complying with regulatory requirements. This
book provides a comprehensive guidance on the implementation of many of
the data privacy techniques in a variety of applications. An enterprise’s data
architecture consists of a wide variety of data structures like multidimen-
sional data, also known as relational data, which are the most widely used
data structure, and complex data structures like transaction data, longitudinal
data, time series data, graph data, and spatiotemporal data. Multidimensional
data are simple in structure, and a rich set of anonymization algorithms
are in use currently for such data, but anonymization techniques for other
complex data structures are still evolving. Chapters 2 through 4 attempt to
provide a detailed coverage on the various anonymization approaches for

xiii
xiv Preface

both multidimensional and complex data structures. Chapters 5 and 6 focus


on applications such as privacy preserving data mining and privacy pre-
serving test data management. Chapter 7 focuses on synthetic data that are
used as an alternative to anonymization. There is a requirement for protec-
tion of data during run-time—dynamic data protection—which is covered
in Chapter 8. Here, the techniques of one-way and two-way tokenization
are covered. Appendix A provides a detailed set of anonymization design
principles, which serve as guidelines to the practitioner. The salient features
of this book are

• Static data anonymization techniques for multidimensional data


• Clear-cut guidelines on anonymization design
• Analysis of anonymization algorithms
• Static data anonymization techniques for complex data structures
like transaction, longitudinal, time series, and graph data
• Emphasis on privacy versus utility in the techniques
• Applications of anonymization techniques on a variety of domains
like privacy preserving data mining and privacy preserving test
data management
• Use of synthetic data in place of anonymized data
• Use of tokenization algorithms for dynamic data protection
Acknowledgments

Writing a book is never easy. It requires hard work, research, collaboration,


and support from various quarters. Both of us thank our management at
HCL Technologies Ltd. for providing a wonderful environment conducive to
­learning and innovation. We thank our colleague, Harikrishnan Janardhanan,
for his selfless support during this endeavor.
We thank N. Krishna and Dr. K. R. Gunasekar of the Indian Institute of
Science, Bangalore, India, for helping us with our research in this field. We
sincerely thank Aastha Sharma and the members of the editorial group at
CRC Press/Taylor & Francis Group for their support in completing this work.
Writing a book is hard and one needs to spend a lot of time on it. This
reduces the amount of time one gets to spend with one’s family. Nataraj owes
special thanks to his wife, Kalyani, and his two sons, Balaji and Srihari, for
providing constant encouragement throughout this period. Ashwin thanks
his parents for their blessings and support. He appreciates his wife, Deepa,
for taking over the entire responsibility of raising their little daughter, Srishti,
during this time.
Finally, we collectively thank God for giving us the courage, perseverance,
and diligence to contribute to an area that we firmly believe is extremely
important in today’s world.

xv
Authors

Nataraj Venkataramanan is currently an associate vice president at HCL


Technologies Ltd., India. He has previously worked in some of India’s
major information technology (IT) companies and has over two decades
of experience in computing. He has worked across different domains such
as banking, financial services, insurance, government, oil and gas, retail,
and manufacturing. His main research interests are in large-scale software
architecture, quality attributes of software architecture, data privacy, pri-
vacy preserving data mining, data analytics, pattern recognition, and learn-
ing systems. He has published refereed technical papers in journals and
conferences. He is a member of IEEE and ACM. Nataraj can be reached at
[email protected].

Ashwin Shriram works for HCL Technologies as a solution architect. As


an engineer in computer science, he has a strong technical background in
data management. At HCL, Ashwin is a senior member of the Test Data
Management Center of Excellence. His current research interests include data
privacy, data analytics, pattern recognition, and big data privacy. Prior to join-
ing HCL, he was working in the United States for customers in public as well
as in private sectors. Ashwin can be reached at [email protected].

xvii
List of Abbreviations

AG-TS Attribute generalization–tuple suppression


API Application programming interface
BPO Business process outsourcing
CCN Credit card number
CIPP Certified information privacy professional
CM Classification metric
DGH Domain generalization hierarchy
DMV Department of motor vehicles
DOB Date of birth
DPA Data Protection Act (UK)
EC Equivalence class
ECG Electrocardiogram
EI Explicit identifier
FADP Federal Act on Data Protection
FERPA Family Educational Rights and Privacy Act
FIPPA Freedom of Information and Protection of Privacy Act
HIPAA Health Insurance Portability and Accountability Act
IDAS Information discovery and analysis systems
IDSG Information discovery scenario generator
IOT Internet of things
ISI International Statistical Institute
LDAP Lightweight Directory Access Protocol
LM Information loss metric
LOC Lines of code
MBA Market basket analysis
MTBF Mean time between failures
MTTR Mean time to recovery
NFR Nonfunctional requirements
NSD Nonsensitive data
ODS Operational data store
OECD Organization for Economic Cooperation and Development
PAN Permanent account number
PCI Payment card industry
PCI DSS Payment Card Interface Data Security Standard
PHI Protected health information
PII Personally identifiable information
PPDM Privacy preserving data mining
PPTDM Privacy preserving test data manufacturing
QI Quasi-identifier
SD Sensitive data

xix
xx List of Abbreviations

SDC Statistical disclosure control


SDDL Synthetic data description language
SDG Synthetic data generation
SDLC Systems development life cycle
SMC/MPC Secure multiparty computation/multiparty computation
SSN Social security number
TDM Test data manufacturing/test data management
TSG Threat stream generator
VIN Vehicle identification number
ZKP Zero-knowledge proof
1
Introduction to Data Privacy

1.1 Introduction
Organizations dealing with banking, insurance, retail, healthcare, and
manufacturing across the globe collect large amounts of data about their
customers. This is a valuable asset to the organizations as these data can
be mined to extract a lot of insights about their customers. For example,
mining these data can throw light on customers’ spending/buying, credit
card usage, and investment patterns and health issues, to name a few. This
information is used by companies to provide value-added services to their
customers, which in turn results in higher revenue and profit. But these data
might contain customers’ personal identification information, and when in
the hands of a data snooper, they can be exploited.
Large companies across the globe outsource their IT and business pro-
cess work to service providers in countries like India, China, Brazil, etc. The
outsourced work may involve application maintenance and development,
testing, data mining/analysis, statistical analysis, etc. Business applications
contain sensitive information, such as personal or financial and health-
related data. Sharing such data can potentially violate individual privacy and
lead to financial loss to the company. Serious concerns have been expressed
by general public about exposing person-specific information. The issue of
data leakage, either intentional or accidental exposure of sensitive informa-
tion, is becoming a major security issue.
An IDC survey [19] claims that data leakage is the number one threat,
ranked higher than viruses, Trojan horses, and worms. To address the privacy
of an individual’s data, governments across the globe have mandated regula-
tions that companies have to adhere to: HIPAA (Health Insurance Portability
and Accountability Act) in the United States, FIPPA (Freedom of Information
and Protection of Privacy Act) in Canada, Sarbanes–Oxley Act, Video Privacy
Protection, U.S. Declaration of Human Rights, and the EU’s Data Protection
Directive are just a few examples. Companies need to look into methods and
tools to anonymize sensitive data. Data anonymization techniques have been
the subject of intense investigation in recent years for many kinds of struc-
tured data, including tabular, transactional data, and graph data.

1
2 Data Privacy: Principles and Practice

In Chapter 2, Static Data Anonymization, we discuss relational data, also


known as multidimensional data, which are the most widely found data
structure in enterprises currently. This chapter focuses on privacy pres-
ervation methods for multidimensional data. Multidimensional data are
simple in structure, and a rich set of data protection algorithms, such as
randomization, generalization, k-anonymization, l-diversity, and t-closeness,
is described.
Anonymization techniques for multidimensional data are simple in
structure and very commonly found across enterprises. Apart from mul-
tidimensional data, other types of data structures, such as graph, longi-
tudinal data, sparse high-dimensional transaction data, time series data,
spatiotemporal data, semistructured XML data, and big data, are also pres-
ent across enterprises. These data structures are complex, contain sensitive
customer information, and should therefore be protected. There are unique
challenges in designing anonymization techniques for these complex data
structures, though. Anonymization techniques used for the protection of
multidimensional data are not directly applicable to these complex data
structures. Chapter 3 discusses some of the anonymization techniques for
complex data structures.
Any anonymization design is a function of many inputs. One of the impor-
tant inputs is, “who are we protecting these data from?” The answer builds a
profile of adversaries who are expected to attack the data. Chapter 4 explains
various profiles of adversaries, their techniques, and what safeguards can be
implemented against such threats.
Data mining is the first application of data privacy that we discuss. We
explore two areas of mining: association rule mining and clustering. Each of
these areas works with the goal of knowledge discovery. However, ensuring
privacy is important for users/customers/patients to willingly share their
data for analysis. In Chapter 5, we explore a few prominent privacy preserva-
tion algorithms and conclude with a discussion on their impact on the utility
of data.
The second application of privacy, test data, is increasingly becoming an
area for privacy preservation. High-quality testing requires high-quality test
data. Outsourcing of testing has brought data privacy concerns to the fore.
Hence, in Chapter 6, we discuss the need for privacy and current trends and
list appropriate algorithms for each class of data. Measuring the utility of
test data and also ascertaining the overall quality of test data are important
to understand if a balance between privacy and utility has been achieved.
Finally, the chapter also highlights some problems with the current anony-
mization options available.
In data mining and testing, there are times when a need arises to create
external supplementary data. Classifiers require training data, which are
not available initially. While testing too, special processes like error data
handling, performance benchmarking, etc., require data not available in
the original source data. Synthetic data generation solves both these issues.
Introduction to Data Privacy 3

In Chapter 7, we visit each class of personally identifiable information (PII)


and explore the techniques available to generate synthetic data. The safety
aspects of synthetic data are also covered in this chapter.
Run-time preservation of privacy is needed in cases where changes to the
data itself cannot be made. Another challenge is that different roles of users
require different levels of protection. In such cases, tokenization is a good
solution. Tokens preserve the formats of sensitive data, which make them
look just like the original data. Chapter 8 covers use cases, implementation
and examples.
Chapter 9, the final chapter, explores the compliance side of data privacy.
Most privacy implementations are direct results of compliance mandates at
the regional or organizational level. In this chapter, we explain the rules and
definitions in some of the relevant privacy regulations. Most of this chap-
ter is dedicated to HIPAA, which is the definitive privacy law in the United
States for healthcare data.
Appendix A lists the principles of anonymization that are referred
throughout this book. These principles are applicable across domains, thus
providing a concrete guideline for data privacy implementations.
Appendix B (PPTDM Manifesto) summarizes the best practices to be
applied while ­preserving privacy in a test data setting.

1.2 What Is Data Privacy and Why Is It Important?


Thousands of ambulance service staff and housing benefits claimants have
had their personal information accidently leaked in the latest UK data breach
blunder (January 4, 2014; news in www.infosecurity-magazine.com/news/
thousands-of-personal/-details.)
Natural healthcare chain Community Health Systems (CHS) says that
about 4.5 million pieces of “non-medical patient identification data related
to our physician practice” have been stolen …. August 18, 2014/News
www.infosecurity-magazine.com/news/45-million-records-stolen-from/
NASDAQ-listed outsourcing firm EXL Service has lost a key client due to
the breach of confidential client data by some of its employees.

• Economic Times (India) on November 6, 2013

There are numerous such incidents where customers’ confidential personal


information has been attacked by or lost to a data snooper. When such
untoward incidents occur, organizations face legal suits, financial loss, loss
of image, and, importantly, the loss of their customers.
There are many stakeholders of data privacy in an organization; these are
shown in Figure 1.1. Let us define these stakeholders.
4 Data Privacy: Principles and Practice

Government—Compliance
Customer/Record owner and regulations

Adversary/Data snooper

Company—Banks, insurance,
healthcare, or retail
Data analysts

Data anonymizer
Data Data
Pro Data Data s es
du
cti Data Data at aba
on dd
dat
aba m ize Testers
ses o ny
An

Business operations
employee

FIGURE 1.1
Data privacy—stakeholders in the organization.

Company: Any organization like a bank, an insurance company, or an


e-commerce, retail, healthcare, or social networking company that holds
large amounts of customer-specific data. They are the custodians of cus-
tomer data, which are considered very sensitive, and have the responsibility
of protecting the data at all costs. Any loss of these sensitive data will result
in the company facing legal suits, financial penalties, and loss of reputation.
Customer/record owner: An organization’s customer could be an individual or
another organization who share their data with the company. For example,
an individual shares his personal information, also known as PII, such as
his name, address, gender, date of birth, phone numbers, e-mail address,
and income with a bank. PII is considered sensitive as any disclosure or loss
could lead to undesired identification of the customer or record owner. It has
been shown that gender, age and zip code are sufficient to identify a large
population of people in the United States.
Government: Government defines what data protection regulations that
the company should comply with. Examples of such regulations are the
HIPPA Act, the EU Data Protection Act, and the Swiss Data Protection Act.
It is mandatory for companies to follow government regulations on data
protection.
Introduction to Data Privacy 5

Data anonymizer: A person who anonymizes and provides data for analysis
or as test data.
Data analyst: This person uses the anonymized data to carry out data m­ ining
activities like prediction, knowledge discovery, and so on. Following govern-
ment regulations, such as the Data Moratorium Act, only anonymized data
can be used for data mining. Therefore, it is important that the provisioned
data support data mining functionalities.
Tester: Outsourcing of software testing is common among many companies.
High-quality testing requires high-quality test data, which is present in pro-
duction systems and contains customer-sensitive information. In order to test
the software system, the tester needs data to be extracted from production
systems, anonymized, and provisioned for testing. Since test data contain
customer-sensitive data, it is mandatory to adhere to regulatory compliance
in that region/country.
Business operations employee: Data analysts and software testers use anony-
mized data that are at rest or static, whereas business operations employees
access production data because they need to support customer’s business
requirements. Business operations are generally outsourced to BPO (busi-
ness process outsourcing) companies. In this case too, there is a requirement
to protect customer-sensitive data but as this operation is carried out during
run-time, a different set of data protection techniques are required to protect
data from business operations employees.
Adversary/data snooper: Data are precious and their theft is very common.
An adversary can be internal or external to the organization. The anonymiza-
tion design should be such that it can thwart an adversary’s effort to identify
a record owner in the database.
Companies spend millions of dollars to protect the privacy of customer
data. Why is it so important? What constitutes personal information? Personal
information consists of name, identifiers like social security number, geo-
graphic and demographic information, and general sensitive information, for
example, financial status, health issues, shopping patterns, and location data.
Loss of this information means loss of privacy—one’s right to freedom from
intrusion by others. As we will see, protecting one’s privacy is nontrivial.

1.2.1 Protecting Sensitive Data


“I know where you were yesterday!” Google knows your location when
you use Google Maps. Google maps can track you wherever you go
when you use it on a smart phone. Mobile companies know your exact loca-
tion when you use a mobile phone. You have no place to hide. You have lost
your privacy. This is the flip side of using devices like smart phones, Global
positioning systems (GPS), and radio frequency identification (RFID). Why
should others know where you were yesterday? Similarly, why should
6 Data Privacy: Principles and Practice

others know your health issues or financial status? All these are sensitive
data and should be well protected as they could fall into the wrong hands
and be exploited. Let us look at a sample bank customer and an account
table. The customer table taken as such has nothing confidential as most of
the information contained in it is also available in the public voters data-
base and on social networking sites like Facebook. Sensitiveness comes in
when the ­customer table is combined with an accounts table. A logical rep-
resentation of Tables 1.1 and 1.2 is shown in Table 1.3.
Data D in the tables contains four disjointed data sets:

1. Explicit identifiers (EI): Attributes that identify a customer (also called


record owner) directly. These include attributes like social security
number (SSN), insurance ID, and name.
2. Quasi-identifiers (QI): Attributes that include geographic and demo-
graphic information, phone numbers, and e-mail IDs. Quasi-
identifiers are also defined as those attributes that are publicly
available, for example, a voters database.
3. Sensitive data (SD): Attributes that contain confidential information
about the record owner, such as health issues, financial status, and
salary, which cannot be compromised at any cost.
4. Nonsensitive data (NSD): Data that are not sensitive for the given context.

TABLE 1.1
Customer Table
Explicit Identifiers Quasi-Identifiers
ID First Name DOB Gender Address Zip Code Phone
1 Ravi 1970 Male Fourth Street 66001 92345-67567
2 Hari 1975 Male Queen Street 66011 98769-66610
3 John 1978 Male Penn Street 66003 97867-00055
4 Amy 1980 Female Ben Street 66066 98123-98765

TABLE 1.2
Account Table
Sensitive Data
Account Account Account Credit Nonsensitive
ID Number Type Balance Limit Data
1 12345 Savings 10,000 20,000
2 23456 Checking 5,000 15,000
3 45678 Savings 15,000 30,000
4 76543 Savings 17,000 25,000
Introduction to Data Privacy 7

TABLE 1.3
Logical Representation of Customer and Account Tables
Explicit
Identifiers Quasi-Identifiers Sensitive Data
Zip Account Account Account Credit
ID Name DOB Gender Address Code Number Type Balance Limit
1 Ravi 1970 Male Fourth 66001 12345 Savings 10,000 20,000
Street
2 Hari 1975 Male Queen 66011 23456 Checking 5,000 15,000
Street
3 John 1978 Male Penn Street 66003 45678 Savings 15,000 30,000
4 Amy 1980 Female Ben Street 66066 76543 Savings 17,000 25,000

The first two data sets, the EI and QI, uniquely identify a record owner and
when combined with sensitive data become sensitive or confidential. The
data set D is considered as a matrix of m rows and n columns. Matrix D is a
vector space where each row and column is a vector

D = [DEI] [DQI] [DSD] (1.1)

Each of the data sets, EI, QI, and SD, are matrices with m rows and i, j, and
k columns, respectively. We need to keep an eye on the index j (represent-
ing QI), which plays a major role in keeping the data confidential.
Apart from assuring their customers’ privacy, organizations also have
to comply with various regulations in that region/country, as mentioned
earlier. Most countries have strong privacy laws to protect citizens’ per-
sonal data. Organizations that fail to protect the privacy of their customers
or do not comply with the regulations face stiff financial penalties, loss of
reputation, loss of customers, and legal issues. This is the primary reason
organizations pay so much attention to data privacy. They find themselves
in a Catch-22 as they have huge amounts of customer data, and there is a
compelling need to share these data with specialized data analysis com-
panies. Most often, data protection techniques, such as cryptography and
anonymization, are used prior to sharing data. In this book, we focus only
on anonymization.
Anonymization is a process of logically separating the identifying
information (PII) from sensitive data. Referring to Table 1.3, the anony-
mization approach ensures that EI and QI are logically separated from
SD. As a result, an adversary will not be able to easily identify the record
owner from his sensitive data. This is easier said than done. How to
effectively anonymize the data? This is the question we explore through-
out this book.
8 Data Privacy: Principles and Practice

1.2.2 Privacy and Anonymity: Two Sides of the Same Coin


This brings up the interesting definition of privacy and anonymity. According
to Skopek [1], under the condition of privacy, we have knowledge of a person’s
identity, but not of an associated personal fact, whereas under the condition
of anonymity, we have knowledge of a personal fact, but not of the associated
person’s identity. In this sense, privacy and anonymity are flip sides of the
same coin. Tables 1.4 and 1.5 illustrate the fundamental d ­ ifferences between
privacy and anonymity.
There is a subtle difference between privacy and anonymity. The word pri-
vacy is also used in a generic way to mean anonymity, and there are specific
use cases for both of them. Table 1.4 illustrates an anonymized table where
PII is protected and sensitive data are left in their original form. Sensitive
data should be in original form so that the data can be used to mine useful
knowledge.
Anonymization is a two-step process: data masking and de-identification.
Data masking is a technique applied to systematically substitute, suppress,
or scramble data that call out an individual, such as names, IDs, account
numbers, SSNs, etc. Masking techniques are simple techniques that perturb
original data. De-identification is applied on QI fields. QI fields such as date

TABLE 1.4
Example of Anonymity
Personal Identity Sensitive Data
Zip Account Account Account Credit
SSN Name DOB Gender Address Code Number Type Balance Limit
X X X X X X
X X X X X X
X X X X X X
X X X X X X
Note: X, identity is protected.

TABLE 1.5
Example of Privacy
Personal Identity Sensitive Data
Zip Account Account Account Credit
SSN Name DOB Gender Address Code Number Type Balance Limit
X X X X
X X X X
X X X X
X X X X
Note: X, sensitive data are protected.
Introduction to Data Privacy 9

of birth, gender, and zip code have the capacity to uniquely identify indi-
viduals. Combine that with SD, such as income, and a Warren Buffet or Bill
Gates is easily identified in the data set. By de-identifying, the values of QI
are modified carefully so that the relationship is till maintained by identities
cannot be inferred.
In Equation 1.1, the original data set is D which is anonymized, resulting
in data set D′ = T(D) or T([DEI][DQI][DSD]), where T is the transformation func-
tion. As a first step in the anonymization process, EI is completely masked
and no longer relevant in D′. As mentioned earlier, no transformation is
applied on SD and it is left in its original form. This results in D′ = T([DQI]),
which means that transformation is applied only on QI as EI is masked
and not considered as part of D′ and SD is left in its original form. D′ can
be shared as QI is transformed and SD is in its original form but it is very
­difficult to identify the record owner. Coming up with the transformation
function is key to the success of anonymization design and this is nontrivial.
We spend a lot of time on anonymization design, which is generally applied
on static data or data at rest.
The other scenario is protecting SD, as shown in Table 1.5, which is applied
on data in motion. The implementation of this is also very challenging.
It is dichotomous as organizations take utmost care in protecting the
privacy of their customers’ data, but the same customers provide a whole
lot of personal information when they register on social network sites like
Facebook (of course, many of the fields are not mandatory but most peo-
ple do provide sufficient personal information), including address, phone
numbers, date of birth (DOB), details of education and qualification, work
experience, etc. Sweeney [2] reports that zip code, DOB, and gender are
­sufficient to uniquely identify 83% of population in the United States. With
the amount of PII available on social networking sites, a data snooper with
some background knowledge could use the publicly available information
to re-identify customers in corporate databases.
In the era of social networks, de-identification becomes highly challenging.

1.3 Use Cases: Need for Sharing Data


Organizations tend to share customer data as there is much insight to be
gained from customer-sensitive data. For example, a healthcare provider’s
database could contain how patients have reacted to a particular drug or
treatment. This information would be useful to a pharmaceutical company.
However, these sensitive data cannot be shared or released due to legal, finan-
cial, compliance, and moral issues. But for the benefit of the organization and
the customer, there is a need to share these data responsibly, which means
the data are shared without revealing the PII of the customer. Figure 1.2 sets
10
Business process management
Use Case 1—Data mining and analysis
Enterprise application integration Personnel

Business analytics Business analytics

App App App


Use Case 2—Test data manufacturing

Enterprise-wide sensitive data protection


ODS ODS ODS
PII Test data Test data
Production data

Use Case 3—
Business process
Historic data outsourcing (BPO)

Data Privacy: Principles and Practice


Unstructured data
Data warehouse (secure) Use Case 4—Application support

Time series database Reporting Hadoop file server


Use Case 5—Audit for regulatory compliance

External auditors

FIGURE 1.2
Sensitive data in the enterprise.
Introduction to Data Privacy 11

the context for this book, which illustrates the importance of enterprise-wide
sensitive data protection. PII is found in all layers of architecture like busi-
ness processes, applications, operational data stores (ODSs), data warehouse
for historic data, time series databases, file servers for unstructured data,
and reports. Whether data are shared with internal departments or external
vendors, it is critical to protect the privacy of customer data. It is important to
have an enterprise-wide privacy preservation design strategy to address the
heterogeneity in data sources, data structures, and usage scenarios.
What are the relevant use cases for which data are shared by organizations?
Let us now explore for what purposes the data are shared and how.

• Data mining and analysis


• Application testing
• Business operation
• Application support
• Auditing and reporting for regulatory compliance

These use cases can be classified under two categories:

1. Privacy protection of sensitive data at rest


2. Privacy protection of sensitive data in motion (at run-time)

The first two use cases fall under the first category and the rest under the
second category. One important aspect to note here—without privacy pres-
ervation none of these use cases is feasible. Figure 1.3 illustrates this concept.
A brief coverage of some of the use cases is provided here, and they are dealt
with in detail in dedicated chapters in the book.

Association Business
Classification Clustering Test data
mining operations

Privacy preservation

Data
Data Data Data Data
Data Data Data
Data Data Data
Data Data
Data Data Data
Data Data Data
Data Data

Enterprise data

FIGURE 1.3
Use cases for privacy preservation.
12 Data Privacy: Principles and Practice

1.3.1 Data Mining and Analysis


Banks would want to understand their customers’ credit card usage pat-
terns, a retail company would like to study customers’ buying habits, and
a healthcare company would like to understand the effectiveness of a drug
or treatment provided to their patients. All these patterns are hidden in the
massive amounts of data they hold. The intention of the company is to gather
comprehensive knowledge from these hidden patterns. Data mining is the
technique that is used to gather knowledge and predict outcomes from large
quantities of data. The goal of data mining is to extract knowledge, discover
patterns, predict, learn, and so on. The key functional blocks in most data
mining applications are classification, clustering, and association pattern
mining.
Classification is defined as categorizing observations into various classes or
groups. Classification is an approach for predicting qualitative discrete response
variables—response will be either YES or NO. Many techniques are used
for classification—decision trees, random forests, support vector machines,
Bayesian classifiers, etc. An example could be that of customers defaulting on
mortgage payments. In this case, the intention is to understand “customers in
which income bracket will default on payments.” For this to work, the classifier
needs to work on training data so that it can come up with a model to be used
for prediction. Training data contain sensitive customer information. All this
needs to be carried out in a way that preserves privacy. Training data are also
not directly available. Synthetic data need to be generated to train this model,
which will later need testing using some valid test data.
While classification comes under supervised learning where we are inter-
ested in predicting an outcome, clustering is an unsupervised learning
technique for finding subgroups in the data set. The goal of clustering is
exploratory learning rather than prediction. In order to protect the confi-
dentiality of customer data, anonymization techniques are used to prevent
identity and attribute disclosure, but the important aspect that needs to be
considered is “what is the correct approach to anonymize in cluster analy-
sis?” For example, are perturbative techniques more suitable than nonper-
turbative techniques for cluster analysis?
Association pattern mining is very similar to clustering and is defined in the
context of sparse binary transaction databases where data entries are either
0 or 1. In this context, it should determine all the subsets of columns such that
all the values in the columns take on the values of 1. For example, when a cus-
tomer buys a product A, he will certainly also buy product B—this shows a
strong association between A and B. Customers who buy bread may invariably
buy butter. Association rule mining is extensively used in retail supermar-
kets (for positioning their products in store), healthcare, pharmaceuticals, and
e-commerce companies. When such data are mined, customers’ transactions
and their personal information could get compromised. Therefore, it is impor-
tant to mine association rules or patterns in a way that preserves privacy.
Introduction to Data Privacy 13

1.3.2 Software Application Testing


A number of companies across the globe outsource their application t­ esting.
Outsourcing of testing is growing at about 20% every year. Application
testing comprises functional requirements and nonfunctional require-
ments testing. Successful application testing requires high-quality test data.
High-quality test data are present in production systems and this is sensi-
tive customer data, which must be anonymized before sharing with ­testers.
A high-level process of test data manufacturing is shown in Figure 1.4.
How to ensure the privacy of test data and at the same time make it useful
for testing is examined in detail in Chapter 6.

1.3.3 Business Operations


Many large companies across the globe outsource their business operations to
business process outsourcing (BPO) companies in countries like India, China,
and Brazil. For example, a financial services company outsources its business
operations to a BPO company in India. Then that BPO company will assist
customers of the financial services company in their business operations
such as securities trading, managing their financial transactions, and so on.

Software application

Extract a subset of
Anonymize test data
test data

Database

Load test data

Privacy preserved test data

FIGURE 1.4
Privacy preserving test data manufacturing.
14 Data Privacy: Principles and Practice

But access to a customer’s trade account during these processes would expose
a lot of sensitive information and this is not acceptable to the customer,
and the regulation in the country will not permit such access. Therefore, these
data have to be protected. But the question is how and what data are to be
protected? A technique known as tokenization is used here wherein the sen-
sitive data that should not be seen by the BPO employee are replaced with
a token. This token has no relationship with the original data, and outside
the context of the application the token has no meaning at all. All these are
executed during run-time. Like privacy preserving data mining and test data
management, protection of sensitive data is a subject in itself and is covered
in depth in this book.

1.4 Methods of Protecting Data


One of the most daunting tasks in information security is protecting sensitive
data in enterprise applications, which are often complex and distributed. What
methods are available to protect sensitive data? Some of the methods avail-
able are cryptography, anonymization, and tokenization, which are briefly dis-
cussed in this section, and a detailed coverage is provided in the other chapters
in the book. Of course, there are other one-way functions like hashing.
Cryptographic techniques are probably one of the oldest known tech-
niques for data protection. When done right, they are probably one of the
safest techniques to protect data in motion and at rest. Encrypted data have
high protection, but are not readable, so how can we use such data? Another
issue associated with cryptography is key management. Any compromise
of key means complete loss of privacy. For the use cases discussed in this
book, cryptographic techniques are not used widely. Of course, there are
techniques like secure multiparty computation (MPC) and zero-knowledge
proof (ZKP), which are discussed in detail in Duan and Canny [3].
Anonymization is a set of techniques used to modify the original data in
such a manner that it does not resemble the original value but maintains the
semantics and syntax. Regulatory compliance and ethical issues drive the
need for anonymization. The intent is that anonymized data can be shared
freely with other parties, who can perform their own analysis on the data.
Anonymization is an optimization problem, in that when the original data are
modified they lose some of its utility. But modification of the data is required
to protect it. An anonymization design is a balancing act between data pri-
vacy and utility. Privacy goals are set by the data owners, and utility goals are
set by data users. Now, is it really possible to optimally achieve this balance
between privacy and utility? We will explore this throughout this book.
Tokenization is a data protection technique that has been extensively used in
the credit card industry but is currently being adopted in other domains as well.
Introduction to Data Privacy 15

Tokenization is a technique that replaces the original sensitive data with non-
sensitive placeholders referred to as tokens. The fundamental difference
between tokenization and the other techniques is that in tokenization, the origi-
nal data are completely replaced by a surrogate that has no connection to the
original data. Tokens have the same format as the original data. As tokens are
not derived from the original data, they exhibit very powerful data protection
features. Another interesting point of tokens is, although the token is usable
within its native application environment, it is completely useless elsewhere.
Therefore, tokenization is ideal to protect sensitive identifying information.

1.5 Importance of Balancing Data Privacy and Utility


In the introductory section, we looked at how an enterprise provisions data
for purposes like application testing, knowledge discovery, and data analy-
sis. We also emphasized the importance of privacy preservation of customer
data before publishing them. Privacy preservation should also ensure util-
ity of data. In other words, the provisioned data should protect the indi-
vidual’s privacy and at the same time ensure that the anonymized data are
useful for knowledge discovery. By anonymizing the data, EI are completely
masked out, QI is de-identified by applying a transformation function, and
SD is left in its original form. There is a strong correlation between QI and SD
fields. So, as part of privacy preservation, this correlation between QI fields
and SD fields should not be lost. If the correlation is lost, then the resulting
data set is not useful for any purpose.
As a transformation function is applied on QI, it is obvious that the cor-
relation between QI fields and SD fields is affected or weakened, and this
indicates how useful the transformed data are for the given purpose. Let
us take an example from the healthcare domain to illustrate this important
relationship between privacy and utility. HIPAA states that if any of the data
elements are associated with health information, it makes that information
personally identifiable. HIPAA defines 18 attributes as PII that include name,
SSN, geographic information, demographic information telephone number,
admission date, etc. [4]. Therefore, in any privacy preserving data analysis of
health data, it should be ensured that any of these 18 attributes, if present,
should be completely anonymized. If so much information is stripped off,
then how can the remaining data be useful for the analysis? Let us take an
example of a patient getting admitted to a hospital. According to the HIPAA
privacy rules, the admission date is part of the patient’s PII and therefore
should be anonymized. The healthcare provider can share the patient’s
medical data to external partners for the analysis, but it will be impossible to
analyze the efficacy of the treatment as the date of admission is anonymized
as per HIPAA privacy laws. HIPAA’s intention is to protect patient privacy,
16 Data Privacy: Principles and Practice

but it impacts medical research in the process. Therefore, it is extremely


important to ensure the utility of the data while preserving privacy. In other
words, there needs to be a balance between privacy and utility of anony-
mized data. Figure 1.5 provides a map of privacy versus utility.
In the previous section, we looked at different mechanisms to protect data.
Cryptographic mechanism provides low utility (0) and high privacy (1) when
data are encrypted and it provides high utility (1) and low privacy (0) when
data are decrypted. The privacy or utility in a cryptographic mechanism is
either black (0) or white (1), whereas in anonymization methods, it is “shades
of gray,” meaning that is possible to control the levels of privacy or utility.
Anonymization can be viewed as constrained ­optimization—produce a data set
with smallest distortion that also satisfies the given set of privacy requirements.
But how do you balance the two contrasting features—privacy and utility?
Anonymized data are utilized in many areas of an organization like data min-
ing, analysis, or creating test data. An important point to remember here is each
type of requirement or analysis warrants a different anonymization design.
This means that there is no single privacy versus utility measure. To understand
privacy versus utility trade-off, let us take the original data given in Table 1.6.

1. Original data table with no privacy but high utility


2. High correlation between QI and SD (attributes fields)

Op
t
1 Op imum
tim
um priva
uti c y,
lity
Privacy

0
0 Utility 1

FIGURE 1.5
Privacy versus utility map.

TABLE 1.6
Original Table with Strong Correlation between QI and SD
Name Zip Code Gender Income
Chen 56001 Male 25K
Jenny 56015 Female 8K
Alice 56001 Female 30K
Ram 56011 Male 5K
Introduction to Data Privacy 17

Table 1.6 shows four individuals. Although many rows have not been shown
here, let us assume that the ZIP CODE and INCOME are correlated, in that
the ZIP CODE 56001 primarily consists of high-income individuals. Table 1.7
is a modified version of Table 1.6. Let us not worry about the techniques used
to anonymize data, but focus just on the results. We can see that the names
have been changed, the original ZIP CODES have been replaced with differ-
ent values and INCOME values are unchanged.
Let us assess gains and losses for this anonymization design.
Privacy gain: Names are substituted (hence protected), financial standing is
not attributed to another zip code, and geographical location is anonymized.
Utility loss: Gender information is preserved, names are substituted while pre-
serving demographic clues, correlation is preserved but the zip code is different.
Another design can have just “XXXX” for all names, 56001 for all zip codes,
and “Male” for all gender values. We can agree that this anonymization
design scores well in terms of privacy, but utility is pathetic. Privacy gain:
Names are completely suppressed, financial standing cannot be inferred,
and geographical location is not compromised. Utility loss: Presence of
females in the population, meaningless names lose demographic clues, flat
value of zip code annuls the correlation.
This shows that anonymization design drives the extent of privacy and
utility, which are always opposed to each other. The two designs also show
that privacy or utility need not be 0 and 1 as in encryption; rather, both are
shades of gray as stated earlier. A good design can achieve a balance between
them and achieve both goals to a reasonable extent.
One way to quantify privacy is on the basis of how much information an
adversary can obtain about the SD of an individual from different dimen-
sions in the data set [5–8]. These references state that SD fields can be iden-
tified (or estimated/deduced) using QI fields. This is a very simple way
to quantify privacy. In fact, this model does not capture many important
dimensions, such as background knowledge of the adversary, adversary’s
knowledge of some of the sensitive data, the complexity of the data structure,
etc. We discuss this in sufficient detail in Chapter 4.
The utility loss of a particular anonymization technique is measured
against the utility provided by the original data set. A measure of utility

TABLE 1.7
Anonymized Table with Generalized Values—Correlation
between QI and SD Is Broken
Name Zip Code Gender Income
Yang 56000 Male 25K
Emma 56010 Female 8K
Olivia 56000 Female 30K
Krishna 56010 Male 5K
18 Data Privacy: Principles and Practice

is also the correlation between QI and SD preserved in the anonymized


data. There are many anonymization techniques in use today, which can be
broadly classified into perturbative and nonperturbative techniques. Each
of these techniques provides its own privacy versus utility model. The core
goals of these anonymization techniques are (1) to prevent an adversary
from identifying SD fields and (2) to ensure minimal utility loss in the ano-
nymized data set by ensuring high correlation between the QI and SD fields.
This is easier said than done. These are extremely difficult goals to meet.
To address this complex set of problem patterns, we have defined a rich set
of anonymization design principles in Appendix A.

1.5.1 Measuring Privacy of Anonymized Data


Given a data set D, a data anonymizer can create different anonymized
data sets D1′, D2′,…, Dn′ based on different anonymization algorithm com-
binations for each attribute. Each of these anonymized data sets will have
different privacy versus utility trade-offs. Privacy is a relative measure.
This means that the privacy of D1′ is measured against another anony-
mized data set D2′. There are multiple ways to measure the difference in
privacy. These approaches are broadly classified into statistical and proba-
bilistic methods. Some statistical approaches measure privacy in terms of
the difference or variation in perturbed variables. The larger the variance,
the better the privacy of the perturbed data. This technique is generally
used for statistical databases.
Probabilistic methods measure privacy loss when an adversary has knowl-
edge of the distribution of the data in the original data set and background
information about some tuples in the data set. For example, consider the
simple example in Table 1.8.
Bob is the adversary and has some background information about Alice as
she is his neighbor. Bob knows that Alice smokes heavily but does not really
know what disease she is suffering from. However, he has knowledge about
the distribution of the sensitive fields in a table containing medical records
of a hospital that he has noticed Alice visiting. Bob then uses the knowledge
of the distribution of SD fields and background information about Alice to
identify her illness, which is cancer.

TABLE 1.8
Background Knowledge of the Adversary
about the Distribution of SD Fields
Name Zip Code Gender Disease
John Smith 46001 Male Hypertension
Tom Henry 46005 Male Gastritis
Alice Williams 46001 Female Cancer
Little Wood 46011 Male Asthma
Introduction to Data Privacy 19

1.5.2 Measuring Utility of Anonymized Data


Assume that in the original data D, QI, and SD are highly correlated.
An example could be the correlation between demographic and geographic
information, such as year of birth, country of birth, locality code, and income
[10]. Data set D contains the truth about the relationship between demo-
graphic and geographic information and income. While anonymizing D, the
truth should be preserved for the data to be useful. When D is anonymized
to D′ using a transformation function T, D′ = T(D), the QI fields are distorted
to some extent in D′. Now, how true is D′? Does the correlation between
QI and SD fields in D′ still exist? Each anonymization function will provide
­different levels of distortion. If Q is the distribution of QI fields in D and Q′
is the distribution of QI fields in D′, then the statistical distance measure of Q
and Q′ provides an indication of the utility of D′ [11]. This reference provides
a number of approaches to measure utility.
In Chapter 6, we show that the quality of anonymized test data sets is
one of the drivers for test coverage. The higher the quality of test data, the
higher will be the test coverage. High-quality test data are present in produc-
tion systems and contain PII. Privacy preservation results in poor test data
­quality or utility. Reduced utility reflects lower test coverage. We examine
different anonymization approaches and resulting utility.

1.6 Introduction to Anonymization Design Principles


Anonymization design is not straightforward. As we saw in Section 1.5,
achieving a balance between privacy and utility has many dependencies. So,
what are the drivers for anonymization design? Factors that drive anony-
mization design for a given requirement are illustrated in Figure 1.6.
When there is a need for data privacy, organizations generally use either
a commercial or a home-grown product for anonymizing data. It is critical
to ensure that an organization’s data anonymization program is not limited
by the features of the product. Many organizations fail to maintain a bal-
ance between privacy and utility. It is generally difficult to determine how
much anonymization is required, which results in either loss of informa-
tion or the anonymized data set becoming unusable. Even with adoption
of the best of breed data anonymization products, an organization’s anony-
mization program may not be successful. In addition to this, the pressures
of regulatory compliance force many organizations to be very defensive
and adopt very high privacy standards that will render the data unusable
for any research. Take, for example, HIPAA or Swiss Data Protection Act,
which are highly restrictive with an intention to protect the privacy of an
individual. If enough care is not taken, then the anonymized data could
20 Data Privacy: Principles and Practice

Business domain

Environment Data semantics

Data type and Anonymization


Regulations
structure design

Utility
Application
requirement

Privacy
requirement

FIGURE 1.6
Drivers for anonymization design.

have very little utility. In this context, irrespective of which tool an organi-
zation uses, there is a need for a mechanism to monitor privacy versus util-
ity for various privacy requirements. Unfortunately, quantifying privacy
and utility is nontrivial. Therefore, it is critical to provide assurance of high
quality of data anonymization during the initial phase of the anonymiza-
tion life cycle. To support this, we felt it is necessary to define a set of design
principles. These principles will provide the required guidelines for the
data anonymizer to adopt the correct design for a given anonymization
requirement.
As software architects, we start the architecting process by following a
set of architecture principles that will guide us to come up with the correct
design for the system. We base our work here on a similar approach. In [12],
the authors classify principles into two broad types—scientific and norma-
tive. Scientific principles are laws of nature and form the fundamental truths
that one can build upon. Normative principles act as a guide and need to be
enforced. Similarly, a data anonymizer needs guidance, and the anonymiza-
tion design principles should be enforced to ensure proper anonymization
design. These principles are fundamental in nature and are applicable to all
aspects of anonymization. They connect the high-level privacy and utility
requirements to low-level implementation.
Introduction to Data Privacy 21

In this book, all principles are explained in the following form:

• Principle Name
• Rationale
• Implications

These anonymization principles can be found in Appendix A.

1.7 Nature of Data in the Enterprise


1.7.1 Multidimensional Data
Multidimensional data also referred to as relational data are the most com-
mon format of data available today in many enterprises. In a relational
table, each row is a vector that represents an entity. The columns repre-
sent the attributes of the entity. As relational data are the most common
data format, a lot of attention has been paid to privacy preservation of
relational data [2,13,14]. As described earlier, a row of data in a relational
table is classified into explicit identifiers, quasi-identifiers, sensitive data,
and nonsensitive data. Both perturbative and nonperturbative techniques
could be used to protect the data. As a rule, EI are completely masked out,
QI are anonymized, and SD are left in their original form. (These terms
are explained in detail in Chapter 2.) Depending on the sensitivity of data,
appropriate data protection techniques can be applied. The fundamental
differences between anonymizing multidimensional data and other data
structures are as follows:

• In a multidimensional data table, each record or row is independent


of others; therefore, anonymizing a few of the records will not affect
other records.
• Anonymizing a tuple in a record will not affect other tuples in the
record.

Other complex data structures, such as graph, longitudinal, or time series


data, cannot be viewed in this way. Privacy preservation for multidimen-
sional data can be classified into (1) random perturbation methods and
(2) group anonymization techniques, such as k-anonymity or l-diversity.
These techniques are used to prevent identity disclosure and attribute
disclosure.
22 Data Privacy: Principles and Practice

1.7.1.1 Challenges in Privacy Preservation of Multidimensional Data


The challenges in this kind of data preservation are as follows:

1. Difficulty in identifying the boundary between QI and SD in the


presence of background knowledge of the adversary
2. High dimensionality of data poses a big challenge to privacy
preservation
3. Clusters in sensitive data set
4. Difficulty in achieving realistic balance between privacy and utility

1.7.2 Transaction Data


Transaction data are a classic example of sparse high-dimensional data.
A transaction database holds transactions of a customer at a supermarket or
it can be used to hold the diagnosis codes of a patient in a hospital. Privacy
of transaction data is very critical as an adversary who has access to this
database can obtain the shopping preferences of customers and exploit that
information. But the problem with transaction database is that it is of very
high dimensionality and sparsely filled. A supermarket will have thou-
sands of products contributing to the high dimensionality of the transaction
database. Moreover, the transactional data contained in the database are
binary—either 0 or 1. An event of a transaction is represented by 1; other-
wise, it would be a 0 (Table 1.9).
In this table, P1–Pn represents the products in the supermarket. The cus-
tomer Hari has made a transaction on P3 and P6, which means his shopping
cart contains product P3 and P6, say bread and cheese. There is nothing sen-
sitive about a customer buying bread and cheese. But if the product hap-
pens to be a blood glucose or blood pressure monitor, then that transaction
is sensitive from the customer’s perspective as he would not want others to
know that he is diabetic. It is the sensitivity of the transaction that needs to be
protected. Privacy preservation techniques used in the case of relational data
table will not be applicable here.

TABLE 1.9
Sample Sparse High-Dimensional Transaction Database
in a Supermarket
Name P1 P2 P3 P4 P5 P6 Pn
Hari 1 1
Nancy 1 1
Jim 1 1
Introduction to Data Privacy 23

1.7.2.1 Challenges in Privacy Preservation of Transaction Data


Some of the challenges in privacy preservation of transaction data are
as follows:

1. High dimensionality.
2. Sparsity.
3. Conventional privacy preservation techniques used for relational
tables that have fixed schema are not applicable on transaction data.

1.7.3 Longitudinal Data


Longitudinal studies are carried out extensively in the healthcare domain.
An example would be the study of the effects of a treatment or medicine
on an individual over a period of time. The measurement of the effects is
repeatedly taken over that period of time on the same individual. The goal
of longitudinal study is to characterize the response of the individual to the
treatment. Longitudinal studies also help in understanding the factors that
influence the changes in response. Consider the following table that illus-
trates the effect of treatment for hypertension in a patient (Table 1.10).
The table contains a longitudinal set D, which has three disjoint sets of data—
EI, QI, and SD. A few important characteristics of the data set D that must be
considered while designing an anonymization approach are as follows:

• Data are clustered—composed of repeated measurements obtained


from a single individual at different points in time.
• The data within the cluster are correlated.
• The data within the cluster have a temporal order, which means the
first measurement will be followed by the second and so on [15].

TABLE 1.10
Sample Longitudinal Data Set in the Healthcare Domain
Systolic Diastolic
ID Name DOB ZIP Service Date Diseases (mmHg) (mmHg)
1 Bob 1976 56711 30/05/2012 Hypertension 180 95
2 Bob 1976 56711 31/05/2012 Hypertension 160 90
3 Bob 1976 56711 01/06/2012 Hypertension 140 85
4 Bob 1976 56711 02/06/2012 Hypertension 130 90
5 Bob 1976 56711 03/06/2012 Hypertension 125 85
6 Bob 1976 56711 04/06/2012 Hypertension 120 80
7 Alice 1969 56812 31/03/2012 Hypertension 160 90
24 Data Privacy: Principles and Practice

These are to be noted because the anonymization design should ensure


that these characteristics of D are preserved in the anonymized data set D′,
otherwise, the truth in the data will be lost.

1.7.3.1 Challenges in Anonymizing Longitudinal Data


Anonymization design for longitudinal data should consider two aspects:

1. The characteristics of longitudinal data in the anonymized data set


D′ should be maintained.
2. Anonymization designs aim to prevent identity and attribute
disclosure.

Consider the longitudinal data set D, which has three disjoint sets of data
(EI, QI, and SD). EI are completely masked to prevent identification. QI
are ­anonymized using generalization and suppression to prevent iden-
tity disclosure. In the case of longitudinal data, anonymizing identity attri-
butes alone is not sufficient to prevent an adversary from re-identifying the
patient. An adversary can still link some of the sensitive attributes to the
publicly available data, that is, medical records. Now comes the need to pre-
vent attributes disclosure. For longitudinal data, an anonymization design
that prevents identity as well as attributes disclosure is required [16]. There
are a number of techniques to prevent identity disclosure, such as perturba-
tive and nonperturbative techniques. Effective anonymization techniques
are required to prevent attributes disclosure, but these techniques should
also ensure that they preserve the characteristics of longitudinal data.

1.7.4 Graph Data


Graph data are interesting and found in many domains like social networks,
Electronics, Transportation, Software, and Telecom. A graph G = (V,E)
consists of a set of vertices together with a set of vertex pairs or edges.
Graphs are interesting as they model almost any relationship. This is
especially relevant in modeling networks like financial networks and also
social networks like Facebook, LinkedIn, and Twitter. It is in these types
of applications, we see the need for privacy preservation of graph data.
Social networks have many users and contain a lot of personal informa-
tion, such as network of friends and personal preferences. Social network
data analytics is a rich source of information for many companies that
want to understand how their products are received by the customers. For
example, a bank would like to get feedback from their customers about the
various financial products and services they offer. The bank can have its
own page on, say, Facebook where its customers provide their views and
feedback. Publishing these data for mining and analysis will compromise
Introduction to Data Privacy 25

the privacy of the customers. Therefore, it is required to anonymize the


data before provisioning it for analytics. However, graph data are complex
in nature. The more complex the data structure, the easier it is to identify
entities [17].
Due to the complexity of graph data, the anonymization design applied
is different compared to that applied to relational data tables. In the case
of relational data tables, each row is treated as an entity and its QI are ano-
nymized to prevent identity disclosure. The attack model in relational data
is also straightforward. An adversary will use an external data source to
identify an individual via the QI. Graph data are more complex, and because
of their complexity they provide more avenues for re-identification. Consider
the network shown in Figures 1.7 and 1.8.
Figure 1.7 depicts a network with original data of users. The same has
been anonymized in Figure 1.8. Will this anonymization be enough to
thwart an adversary’s attempt to re-identify the users? The simple answer
is no. The many challenges in anonymizing graph data are discussed next.

Hari Ram

Jane Jack

Bob Alice

FIGURE 1.7
Graph network with original data.

A B

F C

E D

FIGURE 1.8
Modified graph network.
26 Data Privacy: Principles and Practice

1.7.4.1 Challenges in Anonymizing Graph Data


Privacy of graph data can be classified into three categories [18]:

1. Identity disclosure
2. Link disclosure
3. Content/attribute disclosure

Apart from these, graph metrics such as betweenness, closeness, reachability,


path length, and centrality can also be used to breach the privacy of t­arget
individuals in a graph network.
Identity disclosure: Identity disclosure occurs when it is possible to identify
the users in the network.
Content disclosure: Just as in relational table, sensitive content is associated
with each node (entity). This sensitive content is classified into explicit iden-
tifiers like name, SSN, and QI, such as demographics, gender, date of birth,
and other sensitive data such as preferences and relationships.
Link disclosure: Links between users are highly sensitive and can be used to
identify relationships between users.
It is very challenging to anonymize graph data as it is difficult to devise
an anonymizing design. For the graph to maintain privacy and utility after
anonymization, the techniques need to alter the graph, but just enough to
prevent identity, content, and link disclosure. At the same time, changing
the nodes or labels could affect other nodes in the network and could poten-
tially alter the structure of the network itself. Meanwhile, there are multiple
avenues to identify a user in a graph network since it is difficult to model
the adversary’s background knowledge. How to provision graph data in a
privacy preserving way for mining and analysis? What anonymization tech-
niques are available for graph data?

1.7.5 Time Series Data


Time series data result from taking measurements at regular intervals of
time from a process. An example of this could be temperature measurement
from a sensor or daily values of a stock, the net asset value of a fund taken on
a daily basis, or blood pressure measurements of a patient taken on a weekly
basis. We looked at longitudinal data where we considered the response of
a patient to blood pressure medication. The measurements have a temporal
order. So, what is the difference between longitudinal data and time series
data? Longitudinal data are extensively used in the healthcare domain, espe-
cially in clinical trials. Longitudinal data represent repeated measurements
taken on a person. These measurements are responses to a treatment or drug,
and there is a strong correlation among these measurements. A univariate
time series is a set of long measurements of a single variable taken at regular
Introduction to Data Privacy 27

TABLE 1.11
Sample Time Series Data Table Showing Weekly Sales of Companies
Company
ID Name Address Week 1 Week 2 Week 3 Week 4 Week 5
1 ABC Park Street, 56001 10,000 12,000 17,000 8,000 11,000
2 ACME Kings Street, 56003 15,000 17,000 18,000 20,000 21,000
3 XYZ Main Street, 56022 20,000 23,000 25,000 26,000 30,000
4 PQR Queen Street, 56021 14,000 18,000 19,000 19,500 21,000

intervals, say, blood pressure measurements of a patient taken over a period


of time. These measurements need not necessarily be a response to a drug
or treatment. Longitudinal data have very small dimensions compared to
time series data that have high dimensional and keep growing. Sample time
series data showing weekly sales of their products are given in Table 1.11.
It can be observed from the table that each record has three disjoint sets
of data—EI, QI, and SD. This is very similar to the structure of multidi-
mensional data. But that is where the similarity ends. In multidimensional
data, each record is independent of the others and can be anonymized with-
out affecting other records. The tuples in each record can be anonymized
­without affecting other tuples in the record. But this approach cannot be
used with time series data because of its large size, high dimensionality, and
pattern. This makes privacy preservation rather challenging.

1.7.5.1 Challenges in Privacy Preservation of Time Series Data


Some of the challenges in privacy preservation of the time series data are as
follows:

• High dimensionality
• Retaining the statistical properties of the original time series data
like mean, variance, and so on
• Supporting various types of queries like range query or pattern
matching query
• Preventing identity disclosure and linkage attacks

References
1. J.M. Skopek, Anonymity: The production of goods and institutional design,
Fordham Law Review, 82(4), 1751–1809, 2014, http://ir.lawnet.fordham.edu/flr/
vol82/iss4/4/.
2. L. Sweeney, k-Anonymity: A model for protecting privacy, International Journal
of Uncertainty, Fuzziness and Knowledge Based Systems, 10 (5), 557–570, 2002.
28 Data Privacy: Principles and Practice

3. Y. Duan and J. Canny, Practical distributed privacy-preserving data analysis at


large scale, in Large Scale Data Analytics, A. Gkoulalas-Divanis, A. Labbi (eds.),
2014 Springer.
4. Summary of the HIPAA Privacy Rule, http://www.hhs.gov/ocr/privacy/
hipaa/understanding/summary/index.html. Accessed October, 2015
5. A. Machanavajjhala et al., l-Diversity: Privacy beyond k-Anonymity, in
Proceedings of the 22nd International Conference Data Engineering (ICDE 2006),
Atlanta, GA, 2006, p. 24.
6. T. Li and N. Li, On the trade-off between privacy and utility in data pub-
lishing, in Proceedings of the 15th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, ACM, Paris, France, June 28–July 1, 2009,
pp. 517–525.
7. N. Li, T. Li, and S. Venkatasubramanian, t-closeness: Privacy beyond
k-­anonymity and l-diversity, in Proceedings of the IEEE International Conference
Data Engineering (ICDE 2007), 2007, Istanbul, Turkey, pp. 106–115.
8. R.C.-W. Wong, J. Li, A.W.-C. Fu, and K. Wang, (α, k)-anonymity: An enhanced
k-anonymity model for privacy preserving data publishing, in Proceedings of
the 12th International Conference on Knowledge Discovery and Data Mining, 2006,
Philadelphia, PA, pp. 754–759.
9. X. Xiao and Y. Tao, Anatomy: Simple and effective privacy preservation, in
Proceedings of the 32nd International Conference on Very Large Data Bases (VLDB
2006), Seoul, South Korea, 2006, pp. 139–150.
10. N. Shlomo, Accessing microdata via the internet, Joint UN/ECE and Eurostat
work session on statistical data confidentiality, Luxembourg, April 7–9, 2003.
11. D. Kifer and J. Gehrke, Injecting utility into anonymized datasets, in Proceedings
of the ACM SIGMOD International Conference on Management of Data (SIGMOD
2006), Chicago, IL, June 27–29, 2006.
12. D. Greefhorst and E. Proper, Architecture Principles: The Cornerstone of Enterprise
Architecture, Springer, Berlin, Heidelberg, 2011.
13. P. Samarati, Protecting respondents identities in micro data release, in IEEE
Transactions on Knowledge Data Engineering, 2001, pp. 1010–1027.
14. C.C. Aggarwal, On k-anonymity and curse of dimensionality, in Proceedings of
the Very Large Data Base, Trondheim, Norway, 2005, pp. 901–909.
15. G.M. Fitzmaurice, N.M. Laird, and J.H. Ware, Applied Longitudinal Analysis, 2nd
edn., Wiley, August 30, 2011.
16. M. Sehatkar and S. Matwin, HALT: Hybrid anonymization of longitudinal
transaction, in 11th Annual Conference on Privacy, Security and Trust (PST), IEEE,
2013, Tarragona, Spain, pp. 127–137.
17. M.L. Maag, L. Denoyer, and P. Gallinari, Graph anonymization using machine
learning, in The 28th IEEE International Conference on Advanced Information
Networking Application, 2014, Victoria, BC, Canada, pp. 1111–1118.
18. E. Zheleva and L. Getoor, Preserving the privacy of sensitive relationships in
graph data, in Proceedings of the First ACM SIGKDD International Workshop in
Privacy, Security and Trust, 2007, San Jose, CA.
19. B.E. Burke, Information protection and control survey: Data loss prevention
and encryption trends. IDC Special Study, Doc. # 211109, 2008.
References

1 1. Introduction to Data Privacy

1. J.M. Skopek, Anonymity: The production of goods and


institutional design, Fordham Law Review, 82(4),
1751–1809, 2014, http://ir.lawnet.fordham.edu/flr/
vol82/iss4/4/.

TABLE 1.11

Sample Time Series Data Table Showing Weekly Sales of


Companies

ID Company Name Address Week 1 Week 2 Week 3 Week 4 Week 5

1 ABC Park Street, 56001 10,000 12,000 17,000 8,000 11,000

2 ACME Kings Street, 56003 15,000 17,000 18,000 20,000


21,000

3 XYZ Main Street, 56022 20,000 23,000 25,000 26,000 30,000

4 PQR Queen Street, 56021 14,000 18,000 19,000 19,500 21,000

3. Y. Duan and J. Canny, Practical distributed


privacy-preserving data analysis at large scale, in Large
Scale Data Analytics, A. Gkoulalas-Divanis, A. Labbi
(eds.), 2014 Springer.

4. Summary of the HIPAA Privacy Rule,


http://www.hhs.gov/ocr/privacy/
hipaa/understanding/summary/index.html. Accessed October,
2015

5. A. Machanavajjhala et al., l-Diversity: Privacy beyond


k-Anonymity, in Proceedings of the 22nd International
Conference Data Engineering (ICDE 2006), Atlanta, GA,
2006, p. 24.

6. T. Li and N. Li, On the trade-off between privacy and


utility in data publishing, in Proceedings of the 15th ACM
SIGKDD International Conference on Knowledge Discovery and
Data Mining, ACM, Paris, France, June 28–July 1, 2009, pp.
517–525.

7. N. L i, T. Li, and S. Venkatasubramanian, t-closeness:


Privacy beyond k anonymity and l-diversity, in Proceedings
of the IEEE International Conference Data Engineering
(ICDE 2007), 2007, Istanbul, Turkey, pp. 106–115.
8. R.C.-W. W ong, J. Li, A.W.-C. Fu, and K. Wang, (α,
k)-anonymity: An enhanced k-anonymity model for privacy
preserving data publishing, in Proceedings of the 12th
International Conference on Knowledge Discovery and Data
Mining, 2006, Philadelphia, PA, pp. 754–759.

9. X. Xiao and Y. Tao, Anatomy: Simple and effective


privacy preservation, in Proceedings of the 32nd
International Conference on Very Large Data Bases (VLDB
2006), Seoul, South Korea, 2006, pp. 139–150.

10. N. Shlomo, Accessing microdata via the internet, Joint


UN/ECE and Eurostat work session on statistical data
confidentiality, Luxembourg, April 7–9, 2003.

11. D. Kifer and J. Gehrke, Injecting utility into


anonymized datasets, in Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD
2006), Chicago, IL, June 27–29, 2006.

12. D. Greefhorst and E. Proper, Architecture Principles:


The Cornerstone of Enterprise Architecture, Springer,
Berlin, Heidelberg, 2011.

13. P. Samarati, Protecting respondents identities in micro


data release, in IEEE Transactions on Knowledge Data
Engineering, 2001, pp. 1010–1027.

14. C. C. Aggarwal, On k-anonymity and curse of


dimensionality, in Proceedings of the Very Large Data
Base, Trondheim, Norway, 2005, pp. 901–909.

15. G.M. Fitzmaurice, N.M. Laird, and J.H. Ware, Applied


Longitudinal Analysis, 2nd edn., Wiley, August 30, 2011.

16. M. Sehatkar and S. Matwin, HALT: Hybrid anonymization


of longitudinal transaction, in 11th Annual Conference on
Privacy, Security and Trust (PST), IEEE, 2013, Tarragona,
Spain, pp. 127–137.

17. M.L. Maag, L. Denoyer, and P. Gallinari, Graph


anonymization using machine learning, in The 28th IEEE
International Conference on Advanced Information
Networking Application, 2014, Victoria, BC, Canada, pp.
1111–1118.

18. E. Z heleva and L. Getoor, Preserving the privacy of


sensitive relationships in graph data, in Proceedings of
the First ACM SIGKDD International Workshop in Privacy,
Security and Trust, 2007, San Jose, CA.

19. B. E. Burke, Information protection and control survey:


Data loss prevention and encryption trends. IDC Special
Study, Doc. # 211109, 2008.
2 2. Static Data Anonymization Part I:
Multidimensional Data

1. C.C. Aggarwal, Privacy and the dimensionality curse,


Springer US, pp. 433–460.

2. C.C. Aggarwal and P.S. Yu (ed.), Privacy Preserving Data


Mining: Models and Algorithms, Springer US, 2008.

3. L. Sweeney, Achieving k-anonymity privacy protection


using generalization and suppression, International
Journal on Uncertainty, Fuzzy and Knowledge Based Systems,
10(5), 571–588, 2002.

4. L. Sweeney, Guaranteeing anonymity when sharing medical


data, The datafly system, in Proceedings of the Journal of
American Medical Informatics Association, Hanley and
Belfus Inc, Nashville, TN, 1997.

5. P. Samaraty, Protecting respondents’ identifies in


microdata release, IEEE Transactions on Knowledge and Data
Engineering, 13(6), 1010–1027, 2001.

6. K. LeFevre, D. DeWitt, and R. Ramakrishnan, Incognito:


Efficient full domain k-anonymity, in Proceedings of the
2005 ACM SIGMOD International Conference on Management of
Data, Baltimore, MD, June 13–16, 2005.

7. V. Iyengar, Transforming data to satisfy privacy


constraints, in Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining, Edmonton, AB, 2002.

8. R. Bayardo and R. Agarwal, Data privacy through optimal


k-anonymization, in Proceedings of the 21st International
Conference in Data Engineering, Tokyo, Japan, 2005.

9. M. Ercan Nergiz and C. Clifton, Thoughts on


k-anonymization, in ICDEW’06: Proceedings of the 22nd
International Conference on Data Engineering Workshops,
IEEE Computer Society, Atlanta, GA, 2006, p. 96.

10. R. Dewri, I. Roy, and D. Whitley, On the optimal


section of k in the k-anonymity problem, in ICDE, IEEE,
Cancun, Mexico, 2008.

11. M. Hua and J. Pei, A survey of Utility based Privacy


preserving data transformation methods, in
Privacy-Preserving Data Mining, Springer US, 2008,
pp. 207–237.
12. A. Machanavajjhala, D. Kifer, J. Gehrke, and M.
Venkitasubramaniam, l-Diversity: Privacy beyond
k-anonymity, ACM Transactions on Knowledge Discovery from
Data, 1(1), Article 3, Publication date: March 2007.

13. N. Li, T. Li, and S. Venkatasubramanian, t-Closeness


Privacy beyond k- anonymity and l-diversity, in ICDE Data
Engineering, Istanbul, Turkey, 2007.
3 3. Static Data Anonymization Part II:
Complex Data Structures

1. E. Zheleva and L. Getoor, Preserving the privacy of


sensitive relationships in graph data, in Proceedings of
the First ACM SIGKDD Workshop on Privacy, Security and
Trust in KDD, (PinKDD 2007), Springer-Verlag Berlin,
Heidelberg, 2007, pp. 153–171.

2. B. Zhou, J. Pei, and W.-S. Luk, A brief survey on


anonymization techniques for

3. K. Liu and E. Terzi, Towards identity anonymization on


graphs, in Proceedings of the 2008 ACM SIGMOD
International Conference on Management of Data
(SIGMOD’08), ACM Press, New York, 2008, pp. 93–106.

4. M. Hay, Anonymizing social networks, Technical Report


07–19, University of Massachusetts Amherst, Amherst, MA,
2007.

5. G. Loukides, A.G. Koulalas-Divanis, and B. Malen,


Anonymization of electronic medical records for validating
genome-wide associateion studies, Proceedings of the
National Academy Sciences, 107(17), 7898–7903, 2010.

6. A. Tamerroy, G. Loukides, M. E. Nergiz, Y. Saygin, and


B. Malin, Anonymization of longitudinal electronic medical
records, IEEE Transactions on Information Technology in
Biomedicine, 16, 413–423, 2012.

7. M. Sehatkar and S. Matwin, HALT: Hybrid anonymization of


longitudinal transactions, in 11th Annual Conference on
Privacy, Security and Trust (PSI), IEEE, 2013, Tarragona,
Spain, pp. 127–134.

8. Y. Xu, K. Wang, A.W.C. Fu, and P.S. Yu, Anonymizing


transaction databases for publication, in Proceedings of
the 14th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (KDD’08), ACM, Las Vegas, NV,
August 2008.

9. G. Ghinita, Y. Tao, and P. Kalnis, On the anonymization


of sparse high dimensional data, in 2008 IEEE 24th
International Conference on Data Engineering, Cancun,
Mexico, April 7–12, 2008, pp. 715–724.

10. M. H ay, G. Miklau, D. Jensen, D. Towsley, and P. Weis,


Resisting structural re-identification anonymized social
networks, Proceedings of VLDB Endowment, 1, 102–114, 2008,
Computer Science Department Faculty Publication Series,
Paper 179.
http://scholarworks.umass.edu/cs_faculty_pubs/179.

11. S.P. Borgatti, Centrality and network flow, in Sunbelt


International Social Networks Conference, New Orleans, LA,
2002.
4 4. Static Data Anonymization Part III:
Threats to Anonymized Data

1. B.-C. Chen, K. LeFevre, and R. Ramakrishnan, Privacy


Skyline—Privacy with multidimensional adversarial
knowledge, in Proceedings of 33rd International Conference
on Very Large Data Bases (VLDB 2007), July 2007.

2. M.L. Maug, L. Denoyer, and P. Gallinari, Graph


anonymization using machine learning, in IEEE 28th
International Conference on Advanced Information Networking
and Application, 2014, pp. 1111–1118.

3. N. Li, T. Li, and S. Venkatasubramanian, t-closeness:


Privacy beyond k- anonymity and l-diversity, in
Proceedings of the 21st International Conference on Data
Engineering (ICDE 2007), 2007.
5 5. Privacy Preserving Data Mining

1. J. Rizvi and J.R. Haritsa, Maintaining data privacy in


association rule mining, in Proceedings of 28th
International Conference on Very Large Databases (VLDB),
Hong Kong, China, August 2002.

2. A. Evfimievski, R. Srikant, R. Agarwal, and J. Gehrke,


Privacy preserving mining of association rules, in
Proceedings of Eighth ACM SIGKDD International Conference
on Knowledge, Discovery, and Data Mining, Edmonton, AB,
Canada, 2002.

3. J.R. Haritsa, Mining association rules under privacy


constraints, in Privacy Preserving Data Mining: Models and
Algorithms, C. Aggarwal and P.S. Yu (eds.), Springer, New
York, 2008.

4. R. Agrawal and R. Srikant, Privacy preserving data


mining, in Proceedings of the 2000 ACM SIGMOD
International Conference on Management of Data, May 2000,
Dallas, TX, pp. 439–450.

5. X. Li, Z. Yu, and P. Zhang, A review of privacy


preserving data mining, in IEEE International Conference
on Computer and Information Technology, Xi’an, China, 2014.

6. W. Johnson and J. Lindenstrauss, Extension of Lipshitz


mapping into Hilbert space, Contemporary Mathematics, 26,
189–206, 1984.

7. S.R.M. Oliveira and O. Zaiane, Privacy preserving


clustering by data transformation, in Proceedings of 18th
Brazilian Symposium for Databases, Brazil, October 2003,
pp. 304–318.
6 6. Privacy Preserving Test Data
Manufacturing

1. G.J. Myers, The Art of Software Testing, 3rd edition,


October 2011, Wiley Publications.

2. Information Commissioner Office, Mission and vision as


published at https://

3. K. Taneja, M. Grechanik, R. Ghani, and T. Xie, Testing


software in age of data privacy—A balancing act, in
ESEC/FSE’11, Szeged, Hungary, September 5–9, 2011.

4. C.C. Aggarwal and P.S. Yu, A condensation approach to


privacy preserving data mining, in Proceedings of EDBT,
Heraklion, Crete, Greece, 2004, pp. 183–199.

5. K. Chen and L. Liu, Privacy preserving data


classification with rotation perturbation, in Proceedings
of the International Conference on Data Mining, New
Orleans, LA, 2005, pp. 589–592.

6. R. Agrawal and R. Srikant, Privacy-preserving data


mining, in Proceedings of SIGMOD, Dallas, TX, 2000, pp.
439–450.

7. J. Metsa, M. Katara, and T. Mikkonen, Testing


non-functional requirements with aspects: An Industrial
Case Study, in Proceedings of QSIC, Portland, OR, 2007.

9. L. Lucia, D. Lo, L. Jiang and A. Budi, kbe-Anonymity:


Test data anonymization for evolving programs, in ASE’12,
Essen, Germany, September 3–7, 2012.

10. G.-A. Klutke, P.C. Kiessler, and M.A. Wortman, A


critical look at the bathtub curve, IEEE Transactions on
Reliability, 52(1), March 2003.

11. J. Feiman and B. Lowans, Magic quadrant for data


masking technology, December 22, 2015, available at

12. M. Kessis, Y. Ledru, and G. Vandome, Experiences in


coverage testing of a Java middleware, in Proceedings SEM
2005, ACM, Lisbon, Portugal, 2005, pp. 39–45.

13. C.C. Aggarwal, Privacy and the dimensionality curse,


Springer, 2008, pp. 433–460.

14. C.C. Aggarwal and P.S. Yu (ed.), Privacy Preserving


Data Mining: Models and Algorithms, Springer, New York,
2008.

15. J. Held, Towards measuring test data quality, in


Proceedings of the 2012 EDBT/ ICDT Workshops, Berlin,
Germany, March 26–30, 2012.
7 7. Synthetic Data Generation

1. J. Dreschler, Synthetic datasets for statistical


disclosure control—Theory and implementation, Lecture
notes in Statistics 201, pp. 7–10, 13–20, 41–42, 56, 2011.

2. D.B. Rubin, Multiple imputations in sample surveys, in


Proceedings of the Section

3. D.B. Rubin, Multiple Imputation for Non-Response in


Surveys, John Wiley & Sons, New York, 1987.

4. D.B. Rubin, The design of a general and flexible system


for handling nonresponse in sample surveys, The American
Statistician, 58, 298–302, 2004.

5. Grid Tools Datamaker,


https://www.grid-tools.com/datamaker/.

6. J.H. Lee, Y. Kim, and M. O’Keefe, On


regression-tree-based synthetic data methods for business
data, Journal of Privacy and Confidentiality, 5(1),
107–135, 2013.

7. M.E. Nergiz, M. Atzori, Y. Saygin, and B. Guc, Towards


trajectory anonymization: A generalization-based approach,
Transactions on Data Privacy, 2, 47–75, 2009.

8. P.J. Lin et al., Development of a synthetic data set


generator for building and testing information discovery
systems in Proceedings of the Third International
Conference on Information Technology: New Generations
(ITNG’06), Las Vegas, NV, 2006.

9. M. S hams, D. Krishnamurthy, and B. Far, A model-based


approach for testing the performance of web applications,
in Proceedings of the Third International Workshop on
Software Quality Assurance (SOQUA’06), Portland, OR, 2006.

10. X. Wu, Y. Wang, and Y. Zheng, Privacy preserving


database application testing, in Proceedings of the 2003
ACM Workshop on Privacy in the Electronic Society
(WPES’03), Washington, DC, October 30, 2003.

11. M. Whiting, J. Haack, and C. Varley, Creating


realistic, scenario-based synthetic data for test and
evaluation of information analytics software, in
Proceedings of the Beyond Time and Errors: Novel
Evaluation Methods for Information Visualization
(BELIV’08), ACM, Florence, Italy, April 5, 2008.
12. D. Ferrari, On the foundation of artificial workload
design, in Proceedings of ACM SIGMETRICS, 1984, New York,
NY, pp. 8–14.

13. J.E. Hoag and C.W. Thompson, A parallel general-purpose


synthetic data ge nerator, SIGMOD Record, 36(1), 19–24,
March 2007.

14. J. Domingo-Ferrer, V. Torra, J.M. Mateo-Sanz, and F.


Sebe, Re-identification and synthetic data generators: A
case study, in Fourth Joint UNECE/EUROSTAT Work Session on
Statistical Data Confidentiality, Geneva, Switzerland,
November 9–11, 2005.

15. T. Blakely and C. Salmond, Probabilistic record linkage


and a method to calculate the positive predictive value,
International Journal of Epidemiology, 31, 1246–1252,
2002.
9 9. Privacy Regulations

1. Summary of the HIPAA privacy rule, available at


http://www.hhs.gov/sites/ default/files/privacysummary.pdf
by US Department of Health and Human Services. Prepared by
Office of Civil Rights on 05/03. Accessed April 19, 2016.

2. Declaration on Professional Ethics, adopted by ISI


Council in July 2010, TABLE 9.15 Steps for Anonymization
Design S NO Checklist Item 1 What is the intended use of a
data set? 2 What is the domain, context, and application
scenario? 3 Who are the users of data and what are their
profiles? 4 Have adversaries, threats, and attack models
been identified? 5 What are the possible correlated data
sets available that can be matched with the source?

3. Commission proposes a comprehensive reform of the data


protection rules, available at

4. Regulation (EC) No 45/2001 of the European Parliament


and of the Council of 18 December 2000. Published by the
Official Journal of the Eur opean Communities on
12.1.2001. Accessed April 19, 2016.

5. Data Protection Act, Switzerland, available at


https://www.admin.ch/opc/en/
classified-compilation/19920153/index.html published by The
Federal Council, the Portal of the Swiss Government on
June 19, 1992 (Status as of January 1, 2014). Accessed
April 19, 2016.

6. Payment Card Industry (PCI) Data Security


Standard-Requirements and Security Assessment Procedures,
v3.1 Page 5, April 2015. Published by PCI Security
Standards Council, LLC 2006–2015. Accessed April 19, 2016.

7. Protecting consumer privacy in an era of rapid change:


Recommendations for businesses and policymakers, Fed.
Trade Comm’n, March 2012, available at

8. APEC Privacy Framework of 2004, available at


http://www.apec.org/Groups/

9. OECD Guidelines governing the protection of Privacy and


Transborder flows of Personal Data-11, available at

10. R. Wu, G.J. Ahn, and H. Hu, Towards HIPAA-compliant


Healthcare Systems, in IHI’12 Proceedings of the Second
ACM SIGHIT International Health Informatic Symposium,
Miami, FL, January 28–30, 2012.
11. K. Benitez and B. Malin, Evaluating re-identification
risks with r espect to the HIPAA privacy rule, Journal of
American Medical Informatics Association, 2009, p.170.
Accepted December 14, 2009. Accessed April 19, 2016.

12. L. Sweeney, k-anonymity: a model for protecting


privacy, International Journal on Uncertainty, Fuzziness
and Knowledge-Based Systems, 10(5), 557–570, 2002.

13. A. Evfimievski et al., Privacy preserving mining of


association rules, in Proceedings of the Eighth ACM SIGKDD
International Conference on Knowledge Discovery and Data
Mining, New York, 2002.

14. L. Motiwalla and X. Li, Value added privacy services


for healthcare data, in IEEE Sixth World Congress on
Services, Miami, FL, 2010.

15. Income Tax Department—Permanent Account Number,


available at http://

16. License Plates—Department of Justice of Montana,


available at https://dojmt. gov/driving/license-plates/.
Prepared by Motor Vehicle Division, State of

17. Data Protection Act 1998, Schedule 1, Part 1


Principles, available at http://
www.legislation.gov.uk/ukpga/1998/29/schedule/1/part/I.
Delivered by The National Archives. Accessed April 19,
2016.

18. Data Protection Act 1998, Original available at


http://www.legislation.gov.
uk/ukpga/1998/29/pdfs/ukpga_19980029_en.pdf. Published by
TSO (The Stationary Office) printed in 1998 reprinted
incorporating corrections in 2005. Accessed April 19,
2016.

19. Coroners and Justice Act 2009, available at


http://www.legislation.gov.uk/ ukpga/2009/25/contents.
Delivered by The National Archives. Accessed April 19,
2016.

20. Standards Catalogue ISO, available at


http://www.iso.org/iso/home/store/
catalogue_ics/catalogue_detail_ics.htm?csnumber=66011.
Published on 201507-01 by Registration Authority c/o RA/MA
(ABA) Washington DC. Accessed April 19, 2016.
21. 22nd Annual report 2014/2015 of the Federal data
Protection and Information Commissioner (FDPIC), available
at
Appendix A: Anonymization Design
Principles for Multidimensional Data

1. N. Shlomo, Accessing Micro data via the internet, Joint


UN/ECE and Eurostat work session on statistically data
confidentiality, Working paper No. 6, Luxemburg, April
7–9, 2003.

2. B.-C. Chen, K. LeFevre, and R. Ramakrishnan, Privacy


skyline: Privacy with multidimensional adversial
knowledge, Technical report #1596, University of
Wisconsin, Madison, WI, July 2007.

3. G. Ghinita, Y. Tao, and P. Kalnis, On the anonymization


of sparse high dimensional data, in Proceedings of the 24th
IEEE International Conference on Data Engineering (ICDE),
April 2008, Cancun, Mexico, pp. 715–724.

4. Y. Xu et al., Anonymizing transaction databases for


publication, in ACM KDD’08, August 24–27, Las Vegas, NV,
2008.

5. M. Terrontis et al., Privacy preserving anonymization


of set—Valued data, Proceedings of the VCDB Endowment,
Auckland, New Zealand, 1(1), 115–125, August 2008.

6. C.C. Aggarwal and P.S. Yu (eds.), Privacy Preserving


Data Mining: Model and Algorithms, 2008, New York,
Springer.

You might also like