Data Preparation PDF
Data Preparation PDF
(Data pre-processing)
Data Preparation
4
Why Prepare Data?
5
Why Prepare Data?
• Data need to be formatted for a given software tool
6
Major Tasks in Data Preparation
• Data discretization
• Part of data reduction but with particular importance, especially for numerical data
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and resolve
inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data transformation
• Normalization and aggregation
• Data reduction
• Obtains reduced representation in volume but produces the same or similar analytical
results
7
Data Preparation as a step in the
Knowledge Discovery Process Knowledge
Evaluation and
Presentation
Data Mining
Selection and
Transformation
Cleaning and
Integration
DW
DB
8
CRISP-DM
CRISP-DM is a
comprehensive data
mining methodology and
process model that
provides anyone—from
novices to data mining
experts—with a complete
blueprint for conducting a
data mining project.
A methodology
enumerates the steps to
reproduce success
9
CRISP-DM Phases and Tasks
Business Data Data
Modelling Evaluation Deployment
Understanding Understanding Preparation
Generate Plan
Assess Describe Clean Review
Test Monitoring &
Situation Data Data Process
Design Maintenance
Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report
Produce Verify
Integrate Assess Review
Project Data
Data Model Project
Plan Quality
Format
Data
10
CRISP-DM Phases and Tasks
Business Data Data
Modelling Evaluation Deployment
Understanding Understanding Preparation
Generate Plan
Assess Describe Clean Review
Test Monitoring &
Situation Data Data Process
Design Maintenance
Determine Produce
Explore Construct Build Determine
Data Mining Final
Data Data Model Next Steps
Goals Report
Produce Verify
Integrate Assess Review
Project Data
Data Model Project
Plan Quality
Format
Data
11
CRISP-DM: Data Understanding
• Collect data
• Describe data
12
CRISP-DM: Data Understanding
•Explore data
14
CRISP-DM: Data Preparation
• Construct data
• Derived attributes.
• Background knowledge.
• How can missing attributes be constructed or imputed?
• Integrate data
16
Types of Measurements
• Nominal scale
• Interval scale
Quantitative
• Ratio scale
Discrete or Continuous
17
Types of Measurements: Examples
• Nominal:
• Examples:
• US State Code (50 values)
• Profession Code (7,000 values, but only few frequent)
• Ignore ID-like fields whose values are unique for each record
21
DISCRETIZATION OF
CONTINUOUS VARIABLES
22
Discretization
• Divide the range of a continuous attribute into intervals
23
Equal-width Binning
• It divides the range into N intervals of equal size (range): uniform grid
• If A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B -A)/N.
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4
2 2 2 0 2 2
24
Equal-width Binning
Count
1
[0 – 200,000) … …. [1,800,000 –
2,000,000]
Salary in a corporation
Disadvantage
Advantage
(a) Unsupervised
(a) simple and easy to implement
(b) Where does N come from?
(b) produce a reasonable abstraction of data
(c) Sensitive to outliers
25
Equal-depth (or height) Binning
• Additional considerations:
26
Equal-depth (or height) Binning
Temperature values:
64 65 68 69 70 71 72 72 75 75 80 81 83 85
Count
4 4 4
2
27
Discretization considerations
• Class-independent methods
28
Method 1R
• After sorting the data, the range of continuous values is divided into a
number of disjoint intervals and the boundaries of those intervals are
adjusted based on the class labels associated with the values of the
feature.
• The adjustment of the boundary continues until the next values belongs
to a class different to the majority class in the adjacent interval.
29
1R Example
Interval contains at leas 6 elements
Adjustment of the boundary continues until the next values belongs to a class
different to the majority class in the adjacent interval.
1 2 3 4 5 6 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4
Var 65 78 79 79 81 81 82 82 82 82 82 82 83 83 83 83 83 84 84 84 84 84 84 84 84 84 85 85 85 85 85
Class 2 1 2 2 2 1 1 2 1 2 2 2 2 1 2 2 2 1 2 2 1 1 2 2 1 1 1 2 2 2 2
majority 2 2 2 1
new
class 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 3 3 3 3
Comment: The original method description does not mention the criterion of making sure that the
same value for Var is kept is the same interval (although that seems reasonable).
The results above are given by the method available in the R package Dprep.
See the following papers for more detail:
Very Simple Classification Rules Perform Well on Most Commonly Used Datasets by Robert C. Holte
The Development of Holte’s 1R Classifier by Craig Nevill-Manning, Geoffrey Holmes and Ian H. Witten
30
Exercise
• 13, 15, 16, 16, 19, 20, 21, 22, 22, 25, 30, 33, 35, 35, 36, 40, 45
31
Entropy Based Discretization
Class dependent (classification)
32
Entropy
p 1-p Ent
0.2 0.8 0.72
0.4 0.6 0.97
0.5 0.5 1
0.6 0.4 0.97
0.8 0.2 0.72
log2(2)
p1 p2 p3 Ent
0.1 0.1 0.8 0.92
0.2 0.2 0.6 1.37
N 0.1 0.45 0.45 1.37
Ent pc log2 pc log2(3) 0.2 0.4 0.4 1.52
c 1 0.3 0.3 0.4 1.57
0.33 0.33 0.33 1.58 33
Entropy/Impurity
• S - training set, C1,...,CN classes
• pc - proportion of Cc in S
N
Impurity(S ) pc log2 pc
c 1
34
Impurity
high entropy
null entropy
35
An example of entropy disc.
Test split temp < 71.5 yes no
< 71.5 4 2
Temp. Play? > 71.5 5 3
64 Yes
65 No 6 4 4 2 2
Ent ( split 71.5) log log
68 Yes (4 yes, 2 no) 14 6 6 6 6
69 Yes
8 5 5 3 3
70 Yes log log 0.939
71 No 14 8 8 8 8
72 No
72 Yes yes no
75 Yes < 77 7 3
75 Yes
(5 yes, 3 no) > 77 2 2
80 No
81 Yes 10 7 7 3 3
Ent ( split 77) log log
83 Yes 14 10 10 10 10
85 No
4 2 2 2 2
log log 0.915
14 4 4 4 4 36
An example (cont.)
Temp. Play? The method tests all split
64 Yes 6th split possibilities and chooses
65 No the split with smallest
68 Yes entropy.
69 Yes
70 Yes In the first iteration a
5th split split at 84 is chosen.
71 No
72 No
4th split The two resulting
72 Yes
branches are processed
75 Yes
3rd split recursively.
75 Yes
80 No
81 Yes 2nd split The fact that recursion
83 Yes only occurs in the first
85 No interval in this example is
1st split
an artifact. In general
both intervals have to be
split.
37
The stopping criterion
Previous slide did not take into account the stopping criterion.
Ent(S) E(T , S)
log(N 1) (T , S)
N N
c is the number of classes in S
c1 is the number of classes in S1
c2 is the number of classes in S2.
This is called the Minimum Description Length Principle (MDLP)
38
Exercise
• Compute the gain of splitting this data in half
Humidity play
65 Yes
70 No
70 Yes
70 Yes
75 Yes
80 Yes
80 Yes
85 No
86 Yes
90 No
90 Yes
91 No
95 No
96 Yes 39
OUTLIERS
40
Outliers
• Outliers are values thought to be out of range.
• “An outlier is an observation that deviates so much from other
observations as to arouse suspicion that it was generated by a
different mechanism”
• Approaches:
• do nothing
• enforce upper and lower bounds
• let binning handle the problem
41
Outlier detection
• Univariate
• Compute mean and std. deviation. For k=2 or 3, x is an outlier
if outside limits (normal distribution assumed)
( x ks, x ks)
42
43
Outlier detection
• Univariate
• Boxplot: An observation is an extreme outlier if
(Q1-3IQR, Q3+3IQR), where IQR=Q3-Q1
(IQR = Inter Quartile Range)
http://www.physics.csbsju.edu/stats/box2.html 44
> 3 L
> 1.5 L
45
Outlier detection
• Multivariate
• Clustering
• Very small clusters are outliers
http://www.ibm.com/developerworks/data/li
brary/techarticle/dm-0811wurst/
46
Outlier detection
• Multivariate
• Distance based
• An instance with very few neighbors within D is regarded
as an outlier
Knn algorithm
47
A bi-dimensional outlier that is not an outlier in either of its projections.
48
Recommended reading
49
DATA TRANSFORMATION
50
Normalization
• min-max normalization
• z-score normalization
• normalization by decimal scaling
51
Normalization
• min-max normalization
v min v
v' (new _ max v new_minv ) new_minv
max v min v
• z-score normalization
vv does not eliminate outliers
v'
v
52
Age min‐max (0‐1) z‐score dec. scaling
44 0.421 0.450 0.44
35 0.184 ‐0.450 0.35
34 0.158 ‐0.550 0.34
34 0.158 ‐0.550 0.34
39 0.289 ‐0.050 0.39
41 0.342 0.150 0.41
42 0.368 0.250 0.42
31 0.079 ‐0.849 0.31
28 0.000 ‐1.149 0.28
30 0.053 ‐0.949 0.3
38 0.263 ‐0.150 0.38
36 0.211 ‐0.350 0.36
42 0.368 0.250 0.42
35 0.184 ‐0.450 0.35
33 0.132 ‐0.649 0.33
45 0.447 0.550 0.45
34 0.158 ‐0.550 0.34
65 0.974 2.548 0.65
66 1.000 2.648 0.66
38 0.263 ‐0.150 0.38
28 minimun
66 maximum
39.50 avgerage
53
10.01 standard deviation
MISSING DATA
54
Missing Data
• Data is not always available
• E.g., many tuples have no recorded value for several attributes, such as
customer income in sales data
• MVs may have an impact on modelling, in fact, they can destroy it!
• Some tools ignore missing values, others use some metric to fill in
replacements
56
How to Handle Missing Data?
• Use only features (attributes) with all values (may leave out
important features)
• tedious + infeasible?
57
How to Handle Missing Data?
• Use the attribute mean for all samples belonging to the same
class to fill in the missing value
58
How to Handle Missing Data?
• Nearest-Neighbour estimator
• Finding the k neighbours nearest to the point and fill in the most
frequent value or the average value
• Finding neighbours in a large dataset may be slow
59
Nearest-Neighbour
60
How to Handle Missing Data?
61
HANDLING REDUNDANCY
62
Handling Redundancy in Data Integration
N 1 n 1 N 1 n 1
63
Scatter Matrix
64
SAMPLING AND
UNBALANCED DATASETS
65
Sampling
• The cost of sampling is proportional to the sample
size and not to the original dataset size, therefore, a
mining algorithm’s complexity is potentially sub-linear
to the size of the data
66
Unbalanced Target Distribution
67
Handling Unbalanced Data
• With two classes: let positive targets be a minority
• Separate raw held-aside set (e.g. 30% of data) and raw train
• put aside raw held-aside and don’t use it till the final model
• Join with equal number of negative targets from raw train, and
randomly sort it
68
Building Balanced Train Sets
Same % of
Y and N
Y
Targets .. Balanced set
..
Non‐Targets N
N
N SRS 70/30
N
..
Y Balanced Train
Raw Held N SRS
..
Balanced Test
Raw set for estimating
accuracy of final model
69
Summary
70
References