ED317 Statistical Machine learning
ED317 Statistical Machine learning
Dr Manish Kumar Pandey, Centre for Quantitative Economics & Data Science,
04/27/2025 1
Birla Institute of Technology, Mesra, Ranchi, Jharkhand-835215
I. Concept Learning
• Determine:
• A hypothesis h in H such that h(x) = c(x) for all x in X.
Specific
h
x1 1 h
3
h2
x2
General
x1= <Sunny, Warm, High, Strong, Cool, Same> h 1= <Sunny, ?, ?, Strong, ?, ?>
x = <Sunny, Warm, High, Light, Warm, Same> h = <Sunny, ?, ?, ?, ?, ?>
2 2
h = <Sunny, ?, ?, ?, Cool, ?>
3
The box on the left represents the set X of all instances, the box on the right the set H of all hypotheses.
Each hypothesis corresponds to some subset of X-the subset of instances that it classifies positive. The arrows connecting hypotheses
represent the m o r e - g e n e r a l - t h a n relation, with the arrow pointing toward the less general hypothesis. Note the subset of instances
characterized by h2 subsumes the subset characterized by h l , hence h2 is m o r e - g e n e r a l - t h a n h l .
- h0 Specific
x3
h1
h 2,3
x 1+ x+
2
General
x+
4 h4
h0 = <∅, ∅, ∅, ∅, ∅, ∅>
x 1 = <Sunny Warm Normal Strong Warm Same>, +
h1 = <Sunny Warm Normal Strong Warm Sam
x 2 = <Sunny Warm High Strong Warm Same>, +
h2 = <Sunny Warm ? Strong Warm Same>
x 3 = <Rainy Cold High Strong Warm Change>, - h = <Sunny Warm ? Strong Warm Same>
3
x 4 = <Sunny Warm High Strong Cool Change>, + h = <Sunny Warm ? Strong ? ? >
4
The search begins (ho) with the most specific hypothesis in H, then considers increasingly general hypotheses (h1 through h4) as mandated by
the training examples. In the instance space diagram, positive training examples are denoted by "+," negative by "-," and instances that have10
04/27/2025
not been presented as training examples are denoted by a solid circle.
Dr Manish Kumar Pandey, Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi,
Jharkhand-835215
In a nut-shell
• Can’t tell whether it has
• The key property of the Find-S algorithm is that for learned concept
hypothesis spaces described by conjunctions of • Can’t tell when training data
attribute constraints (such as H for the Enjoys port inconsistent
task), FIND-S is guaranteed to output the most
specific hypothesis within H that is consistent with • Picks a maximally specific h
the positive training examples. (why!)
• Its final hypothesis will also be consistent with the • Depending on H, there might
negative examples provided the correct target con be several!
cept is contained in H, and provided the training
examples are correct.
<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>
A version space with its general and specific boundary sets. The
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> } version space includes all six hypotheses shown here but can be
represented more simply by S and G . Arrows indicate instances
of the more-general-than relation.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 14
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Representing Version Spaces
The target concept is exactly learned when the S and G boundary sets
converge to a single, identical, hypothesis.
• The problem here is that with this very expressive hypothesis representation, the S boundary will always be
simply the disjunction of the observed positive examples, while the G boundary will always be the negated
disjunction of the observed negative examples.
• Therefore, the only examples that will be un-ambiguously classified by S and G are the observed training
examples themselves.
• Since P(X) is constant for all classes, only P(C | X) P(X | C )P(C )
i i i
needs to be maximized
31
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between
n
attributes):
P ( X | C i ) P ( x | C i ) P ( x | C i ) P ( x | C i ) ... P ( x | C i )
k 1 2 n
k 1
• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
( x )2
1
g ( x, , ) e 2 2
and P(xk|Ci) is 2
P ( X | C i ) g ( x k , Ci , C i )
32
Naïve Bayes Classifier: Training Dataset
buys
Class: _com
age income student credit_rating puter
C1:buys_computer = ‘yes’ <=30 high no fair no
C2:buys_computer = ‘no’ <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Data to be classified: >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
>40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
04/27/2025
speaker identification, benchmarking time-series
Dr Manish Kumar Pandey, Centre for Quantitative Economics 62
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—General Philosophy
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated
with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the
best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 65
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—Linearly Separable
A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
For 2-D it can be written as
w0 + w 1 x1 + w 2 x2 = 0
The hyperplane defining the sides of the margin:
H1: w 0 + w 1 x1 + w 2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints Quadratic
Programming (QP) Lagrangian multipliers
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 66
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Why Is SVM Effective on High Dimensional Data?
SVM can also be used for classifying multiple (> 2) classes and
for regression analysis (with additional parameters)
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 69
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Distance-weighted
discrimination
• Instance-based learning:
• Store training examples and delay the
processing (“lazy evaluation”) until a new
instance must be classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean
space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-
based inference
04/27/2025
_
+ . 83
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects
are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data
objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
d (i, j) p p m
• Method 2: Use a large number of binary
attributes
• creating a new binary attribute for each of the M
nominal states
87
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• Gender is a symmetric attribute
• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0
0 1
d ( jack , mary ) 0.33
2 0 1
11
d ( jack , jim ) 0.67
111
1 2
d ( jim , mary ) 0.75
11 2
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 88
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Standardizing Numeric Data
x
z
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
• the distance between the raw score and the population mean in units of
the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f m f | | x2 f m f | ... | xnf m f |)
where m f 1n (x1 f x2 f ... xnf )
xif m f
.
zif s
• standardized measure (z-score): f
• Using mean absolute deviation is more robust than using standard deviation
89
Example:
Data Matrix and Dissimilarity Matrix
Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 90
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Distance on Numeric Data: Minkowski Distance
92
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan
x1 1 2 (L1)L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
04/27/2025 0 2 4
Dr Manish Kumar Pandey,x4
Centre for Quantitative3Economics 1 5 0 93
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94
30 Output
Units
4 Hidden
Units
30x32 Sensor
Dr Manish Kumar Pandey, Centre for Quantitative
Input Retina
Economics & Data Science, Birla Institute of Technology,
Mesra, Ranchi, Jharkhand-835215
04/27/2025 105
Perceptron
x1 w1 x0=1
w0
x2 w2
.
. n
{
n
. wi xi
wn i=0 o= wi xi > 0
1 if i=0
-1 otherwise
xn
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics & Data 106
Science, Birla Institute of Technology, Mesra, Ranchi, Jharkhand-835215
Decision Surface of a Perceptron
x2 x2
+
+
- + -
+
x1 x1
- - +
-
(a) (b)
The decision surface represented by a two-input perceptron. (a) A set of training examples and the
decision surface of a perceptron that classifies them correctly. (b) A set of training examples that is not
linearly separable (i.e., that cannot be correctly classified by any straight line). x1 and x2 are the
perceptron inputs. Positive examples are indicated by "+", negative by "-".
These two algorithms are guaranteed to converge to somewhat different acceptable hypotheses, under
somewhat different conditions.
Begin with random weights, then iteratively apply the
perceptron to each training example, modifying the perceptron
weights whenever it misclassifies an example.
This process is repeated, iterating through the training
examples as many times as needed until the perceptron
classifies all training examples correctly.
Weights are modified at each step according to the perceptron
training rule, which revises the weight wi; associated with input
xi; according to the rule
Dr Manish Kumar Pandey, Centre for Quantitative Economics
04/27/2025 & Data Science, Birla Institute of Technology, Mesra, Ranchi, 108
Jharkhand-835215
Gradient Descent and the Delta Rule
• If the data are not linearly separable, convergence is not assured.
• Provided a sufficiently small η is used
• If the training examples are not linearly separable, the delta rule
converges toward a best-fit approximation to the target concept.
• The key idea behind the delta rule is to use gradient descent to search
the hypothesis space of possible weight vectors to find the weights
that best fit the training examples.
• This rule is important because gradient descent provides the basis for
the BACKPROPAGATION algorithm, which can learn networks with
many interconnected units.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 109
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gradient Descent and the Delta Rule..
• The delta training rule is best understood by considering the task of training an
unthresholded perceptron; that is, a linear unit for which the output o is given
To derive a weight learning rule for linear units, let us begin by specifying a measure for the training error
of a hypothesis (weight vector), relative to the training examples.
• At each step, the weight vector is altered in the direction that produces the steepest descent along the
error surface. This process continues until the global minimum error is reached.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 111
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gradient Descent
To understand, consider simpler linear unit, where
When interpreted as a vector in weight space, the gradient specifies the direction that
produces the steepest increase in E.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 113
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
GRADIENT DESCENT algorithm for training
a linear unit.
• Benefits
• the hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a linear unit), and
• the error can be differentiated with respect to these hypothesis parameters.
• Challenges
• converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands of gradient descent
steps), and
• if there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global
minimum.
• Solution is Incremental gradient Descent, alternatively called stochastic gradient descent.
F1 F2
x2 w2 w0
.
. n
. wi xi
net = i=0 1
wn
o = (net) = -net
1+e
xn
• Suppose the attribute income partitions D into10 10 in D1: {low, 4 medium}
giniincome{low,medium} ( D) Gini ( D1 ) Gini ( D2 )
and 4 in D2 14 14
• Ensemble methods
• Use a combination of models to increase
accuracy
• Combine a series of k learned models, M1, M2, …,
Mk, with the aim of creating an improved model
M*
• Popular ensemble methods
• Bagging: averaging the prediction over a
collection of classifiers
• Boosting: weighted vote with a collection of
04/27/2025 classifiers Dr Manish Kumar Pandey, Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 145
145
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set
Di of d tuples is sampled with replacement from D (i.e.,
bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the
class with the most votes to X
• Prediction: can be applied to the prediction of continuous
values by taking the average value of each prediction for a
given test tuple
• Accuracy
• Often significantly better than a single classifier derived
04/27/2025 from D Dr Manish Kumar Pandey, Centre for Quantitative Economics 146
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 146
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, M i+1, to pay more
attention to the training tuples that were
misclassified by Mi
• The final M* combines the votes of each individual
classifier, where the weight of each classifier's vote is a
function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater
04/27/2025
accuracy, but itDr also risks overfitting the model to
Manish Kumar Pandey, Centre for Quantitative Economics 147
misclassified data
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 147
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set D i
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
d
error rate is the sum of (the
M weights
) of w the misclassified tuples:
error i j
j err (X ) j
1 error ( M i )
log
error ( M i )
• The weight of classifier Mi’s vote is 148
Random Forest (Breiman 2001)
• Random Forest:
• Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at
each node to determine the split
• During classification, each tree votes and the most popular
class is returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at
each node, F attributes as candidates for the split at the
node. The CART methodology is used to grow the trees to
maximum size
• Forest-RC (random linear combinations): Creates new
attributes (or features) that are a linear combination of the
existing attributes (reduces the correlation between
individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to
errors and outliers
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 149
• Insensitive to the number
& Data of attributes
Science, Birla Institute selected
of Technology, Mesra, Ranchi, J for 149
Unsupervised Classification
159
159
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two
independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
• At i-th iteration, use Di as test set and others as training
set
• Leave-one-out: k folds where k = # of tuples, for small
sized data
• *Stratified cross-validation*: folds are stratified so
that class dist. in each fold is approx. the same as that
04/27/2025
in the initial Dr
data
&
Manish Kumar Pandey, Centre for Quantitative Economics
Data Science, Birla Institute of Technology, Mesra, Ranchi, J 160
160
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:
161
161
Estimating Confidence Intervals:
Classifier Models M1 vs. M2
• These mean error rates are just estimates of error on the true population
of future data cases
where
• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2
165
165
Estimating Confidence Intervals:
Statistical Significance
• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 171
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 171
Predictor Error Measures
• Measure predictor accuracy: measure how far off the predicted value is from the
actual known value
• Loss function: measures the error between yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
d
• Test error (generalization error): d
the average loss over the testset ( yi yi ' ) 2
| y i y i ' |
i 1
• Mean absolute error: i 1
Mean squared error: d
d d d
| y y '|
i 1
i i
( yi yi ' ) 2i 1
d
• Relative absolute error: | y yRelative
| squared error: d
(y
i
i 1
i y)2
i 1
• Advantages
• Prediction accuracy is generally high
• As compared to Bayesian methods – in general
• Robust, works when training examples contain errors
• Fast evaluation of the learned target function
• Bayesian networks are normally slow
• Criticism
• Long training time
• Difficult to understand the learned function (weights)
• Bayesian networks can be used easily for pattern discovery
• Not easy to incorporate domain knowledge
• Easy in the form of priors on the data or distributions