0% found this document useful (0 votes)
7 views

ED317 Statistical Machine learning

The document discusses concept learning in statistical machine learning, focusing on the process of inferring general functions from specific training examples. It introduces key concepts such as hypotheses, instances, and the inductive learning hypothesis, as well as algorithms like Find-S and Candidate-Elimination for hypothesis space search. The document emphasizes the importance of efficiently searching through hypothesis spaces to identify the best fitting hypotheses based on training data.

Uploaded by

bhavyajha784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

ED317 Statistical Machine learning

The document discusses concept learning in statistical machine learning, focusing on the process of inferring general functions from specific training examples. It introduces key concepts such as hypotheses, instances, and the inductive learning hypothesis, as well as algorithms like Find-S and Candidate-Elimination for hypothesis space search. The document emphasizes the importance of efficiently searching through hypothesis spaces to identify the best fitting hypotheses based on training data.

Uploaded by

bhavyajha784
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 174

Statistical Machine Learning

Dr Manish Kumar Pandey, Centre for Quantitative Economics & Data Science,
04/27/2025 1
Birla Institute of Technology, Mesra, Ranchi, Jharkhand-835215
I. Concept Learning

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 2


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Concept Learning
• The problem of inducing general functions from specific training
examples is central to learning.
• Much of learning involves acquiring general concepts from specific
training exam­ples
Concept learning. Inferring a boolean-valued
function from training examples of its input
and output.

E x a m p le Sky A irT em p H um idity W ind W ater F orecast E njoySport

1 S u nn y Warm N o rm a l S tro n g Warm S am e Y es


2 S u nn y Warm H igh S tro n g Warm S am e Y es
3 Rainy C old H igh S tro n g Warm C h an g e No
4 S u nn y Warm H igh S tro n g C oo l C h an g e Y es

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 3


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Concept learning Task
• Given:
• Instances X: Possible days, each described by the attributes
• Sky (with possible values Sunny, Cloudy, and Rainy),
• AirTemp (with values Warm and Cold),
• Humidity (with values Normal and High),
• Wind (with values Strong and Weak),
• Water (with values Warm and Cool), and
• Forecast (with values Same and Change).
• Hypotheses H: Each hypothesis is described by a conjunction of constraints on the at­tributes Sky, AirTemp, Humidity, Wind,
Water, and Forecast. The constraints may be"?" (any value is acceptable), "0" (no value is acceptable), or a specific value.
• Target concept c: EnjoySport: X--+ {O, 1}
• Training examples D: Positive and negative examples of the target function.

• Determine:
• A hypothesis h in H such that h(x) = c(x) for all x in X.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 4


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Concept learning Task..
• c(x) = 1, are called positive examples, or members of the target concept.
• Instances for which c(x) = 0 are called negative examples, or nonmembers of the target concept.
• H denotes the set of all possible hypotheses that the learner may consider regarding the identity of the
target concept.

The inductive learning hypothesis.


Any hypothesis found to approximate the target function well over a sufficiently
large set of training examples will also approximate the target function well
over other unobserved examples.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 5


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Concept Learning As Search
• Concept learning can be viewed as the task of searching through a large space of hypotheses
implicitly defined by the hypothesis representation.
• The goal of this search is to find the hypothesis that best fits the training examples.
• The instance space X contains exactly 3 • 2 • 2 • 2 • 2 . 2 = 96 distinct instances.
• There are 5 •4 •4 •4 •4 . 4 = 5120 syntactically distinct hypotheses within H.
• However, every hypothesis containing one or more "0" symbols represents the empty set of instances; that
is, it classifies every instance as negative. Therefore,
• the number of semantically distinct hypotheses is only 1 + (4• 3 .3 .3 .3 · 3) = 973.

The Goal is, Algorithms capable of efficiently searching very large or


infinite hypothesis spaces, to find the hypotheses that best fit the
training data.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 6


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
General-to-Specific Ordering of Hypotheses
For general-to-specific ordering, consider the two hypotheses
• h1 = (Sunny,?,?, Strong,?,?}
• h2 = (Sunny,?,?,?,?,?}
• Consider the sets of instances that are classified positive by h1 and by h2.
• Because h2 imposes fewer constraints on the instance, it classifies more instances as
positive.
• In fact, any instance classified positive by h 1 will also be classified positive by h2.
• Therefore, we say that h2 is more general than h1.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 7


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Instances, hypotheses, and the m o r e - g e n e r a l
- t h a n relation.
Instances X Hypotheses H

Specific

h
x1 1 h
3

h2
x2
General

x1= <Sunny, Warm, High, Strong, Cool, Same> h 1= <Sunny, ?, ?, Strong, ?, ?>
x = <Sunny, Warm, High, Light, Warm, Same> h = <Sunny, ?, ?, ?, ?, ?>
2 2
h = <Sunny, ?, ?, ?, Cool, ?>
3
The box on the left represents the set X of all instances, the box on the right the set H of all hypotheses.
Each hypothesis corresponds to some subset of X-the subset of instances that it classifies positive. The arrows connecting hypotheses
represent the m o r e - g e n e r a l - t h a n relation, with the arrow pointing toward the less general hypothesis. Note the subset of instances
characterized by h2 subsumes the subset characterized by h l , hence h2 is m o r e - g e n e r a l - t h a n h l .

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 8


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Find-S: Finding A Maximally Specific
Hypothesis
1. Initialize h to the most specific hypothesis in H
2. For each positive training instance x h +- (0, 0, 0, 0, 0, 0)

h +- (Sunny, Warm, Normal, Strong, Warm, Same)


•For each attribute constraint ai in h
If the constraint a i in h is satisfied by x h +- (Sunny, Warm,?, Strong, Warm, Same)
Then do nothing h +- (Sunny, Warm,?, Strong,?,?}
Else replace a i in h by the next more
general constraint that is satisfied by x
3. Output hypothesis h
E x a m p le Sky A irT em p H um idity W ind W ater F orecast E njoySport

1 S u nn y Warm N o rm a l S tro n g Warm S am e Y es


2 S u nn y Warm H igh S tro n g Warm S am e Y es
3 Rainy C old H igh S tro n g Warm C h an g e No
4 S u nn y Warm H igh S tro n g C oo l C h an g e Y es

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 9


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Hypothesis Space Search by Find-S
Instances X Hypotheses H

- h0 Specific
x3
h1

h 2,3
x 1+ x+
2

General
x+
4 h4

h0 = <∅, ∅, ∅, ∅, ∅, ∅>
x 1 = <Sunny Warm Normal Strong Warm Same>, +
h1 = <Sunny Warm Normal Strong Warm Sam
x 2 = <Sunny Warm High Strong Warm Same>, +
h2 = <Sunny Warm ? Strong Warm Same>
x 3 = <Rainy Cold High Strong Warm Change>, - h = <Sunny Warm ? Strong Warm Same>
3
x 4 = <Sunny Warm High Strong Cool Change>, + h = <Sunny Warm ? Strong ? ? >
4

The search begins (ho) with the most specific hypothesis in H, then considers increasingly general hypotheses (h1 through h4) as mandated by
the training examples. In the instance space diagram, positive training examples are denoted by "+," negative by "-," and instances that have10
04/27/2025
not been presented as training examples are denoted by a solid circle.
Dr Manish Kumar Pandey, Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi,
Jharkhand-835215
In a nut-shell
• Can’t tell whether it has
• The key property of the Find-S algorithm is that for learned concept
hypothesis spaces de­scribed by conjunctions of • Can’t tell when training data
attribute constraints (such as H for the Enjoys port inconsistent
task), FIND-S is guaranteed to output the most
specific hypothesis within H that is consistent with • Picks a maximally specific h
the positive training examples. (why!)
• Its final hypothesis will also be consistent with the • Depending on H, there might
negative examples provided the correct target con­ be several!
cept is contained in H, and provided the training
examples are correct.

• Has the learner converged to the correct target concept?


• Why prefer the most specific hypothesis?
• Are the training examples consistent?
• What if there are several maximally specific consistent hypotheses?

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 11


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Version Spaces And The Candidate-
elimination Algorithm
• The CANDIDATE-ELIMINATION algorithm finds all describable
hypotheses that are consistent with the observed training examples.
Definition: A hypothesis h is consistent with a set of training examples D if
and only if h(x) = c(x) for each example (x,c(x)) in D.
Consistent (h, D) = (∀ (x, c(x)) ∈ D) h(x) = c(x)
• An example x is said to satisfy hypothesis h when h(x) = 1, regardless of whether x is a positive or negative
example of the target concept.
• However, whether such an example is consistent with h depends on the target concept, and in particular, whether
h(x) = c(x). The CANDIDATE-ELIMINATION algorithm represents the set of all hypotheses consistent with the
observed training examples.
This subset of all hypotheses is called the version space with respect to the hypothesis space
H and the training examples D, because it contains all plausible versions of the target concept.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 12


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Version Space
Definition: The version space, denoted VSH,D, with respect to
hypothesis space H and training examples D, is the subset of
hypotheses from H consistent with the training examples in D.
VSH,D ={h ∈ H|Consistent(h, D)}

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 13


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
The LIST-THEN-ELIMINATE Algorithm
1. Version Space  a list containing every hypothesis in H
2. For each training example, <x, c(x)>
remove from VersionSpace any hypothesis h for which h(x) !=c(x)
3. Output the list of hypotheses in VersionSpace

S: { <Sunny, Warm, ?, Strong, ?, ?> }

<Sunny, ?, ?, Strong, ?, ?> <Sunny, Warm, ?, ?, ?, ?> <?, Warm, ?, Strong, ?, ?>

A version space with its general and specific boundary sets. The
G: {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?> } version space includes all six hypotheses shown here but can be
represented more simply by S and G . Arrows indicate instances
of the more-general-than relation.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 14
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Representing Version Spaces

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 15


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
The Candidate-elimination Algorithm

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 16


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Trace 1

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 17


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Trace 2

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 18


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Trace 3

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 19


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Final Version

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 20


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Will the CANDIDATE-ELIMINATION
Algorithm Converge to the Correct Hypothesis?
• there are no errors in the training examples, and
• there is some hypothesis in H that correctly describes the target
concept.

The target concept is exactly learned when the S and G boundary sets
converge to a single, identical, hypothesis.

What will happen if the training data contains


errors?

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 21


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
What Training Example Should the Learner
Request Next?
• the learner will succeed in learning more about the true identity of the
target concept
• the optimal query strategy for a concept learner is to generate
instances that satisfy exactly half the hypotheses in the current version
space.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 22


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
How Can Partially Learned Concepts Be
Used?
In stan ce Sky A irT em p Humidity W ind W a ter F o reca st E n jo yS p o rt

A Sunny Wann N o rm a l S tro n g C ool C hange ?


B R ain y C o ld N o rm a l L ig h t Wann S am e ?
C Sunny Wann N o rm a l L ig h t Wann S am e ?
D Sunny C o ld N o rm a l S tro n g Wann S am e ?

proportion of hypotheses voting positive can be interpreted as


the probability that this instance is positive given the training
data.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 23


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
An UN-Biased Learner
• Idea: Choose H that expresses every teachable concept (i.e., H is the
power set of X)
• Consider H= disjunctions, conjunctions, negations over previous H.
E.g.,
(Sunny,?,?,?,?,?) v (Cloudy,?,?,?,?,?)
This hypothesis space eliminates any problems of expressibility, it un­fortunately raises a new, equally difficult
problem: our concept learning algorithm is now completely unable to generalize beyond the observed
examples!
e.g. Let us take three positive examples (x 1, x2, x3) and two negative ex­amples (x4, xs) to the learner.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 24


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
An UN-Biased Learner..
• The S boundary of the version space will contain the hypothesis which is just the disjunction
of the positive examples
• S : {(x1 V X2 V x3)}
• because this is the most specific possible hypothesis that covers these three exam­ples.
Similarly, the G boundary will consist of the hypothesis that rules out only the observed
negative examples
• G: {~(X4 V X5)}

• The problem here is that with this very expressive hypothesis representation, the S boundary will always be
simply the disjunction of the observed positive examples, while the G boundary will always be the negated
disjunction of the observed negative examples.
• Therefore, the only examples that will be un-ambigu­ously classified by S and G are the observed training
examples themselves.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 25


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
The Futility of Bias-Free Learning
a learner that makes no a priori assumptions regarding the identity of
the tar­get concept has no rational basis for classifying any unseen
instances.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 26


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Modeling inductive systems by equivalent
deductive systems.
• The input-output behavior of the CANDIDATE-
ELIMINATION algorithm using a hypothesis
space H is identical to that of a deduc­tive
theorem prover utilizing the assertion "H
contains the target concept."
• This assertion is therefore called the inductive
bias of the CANDIDATE-ELIMINATION
algorithm. Characterizing inductive systems by
their inductive bias allows modeling them by
their equivalent deductive systems.
• This provides a way to compare inductive
systems according to their policies for
generalizing beyond the observed training data.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 27


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Bayes’ Theorem

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 28


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Bayes’ Theorem: Basics
M
• Total probability Theorem: P(B)   P(B | A )P( A )
i i
i 1

• Bayes’ Theorem: P( H | X) P(X | H ) P( H ) P(X | H )P( H ) / P(X)


P(X)
• Let X be a data sample (“evidence”): class label is unknown
• Let H be a hypothesis that X belongs to class C
• Classification is to determine P(H|X), (i.e., posteriori probability): the
probability that the hypothesis holds given the observed data sample X
• P(H) (prior probability): the initial probability
• E.g., X will buy computer, regardless of age, income, …
• P(X): probability that sample data is observed
• P(X|H) (likelihood): the probability of observing the sample X, given that the
hypothesis holds
• E.g., Given that X will buy computer, the prob. that X is 31..40, medium
income
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 29
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Prediction Based on Bayes’ Theorem
• Given training data X, posteriori probability of a hypothesis H, P(H|X),
follows the Bayes’ theorem

P(H | X) P(X | H ) P(H ) P(X | H )P(H ) / P(X)


P(X)
• Informally, this can be viewed as
posteriori = likelihood x prior/evidence
• Predicts X belongs to Ci iff the probability P(Ci|X) is the highest
among all the P(Ck|X) for all the k classes
• Practical difficulty: It requires initial knowledge of many probabilities,
involving significant computational cost
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 30
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Classification Is to Derive the Maximum Posteriori
• Let D be a training set of tuples and their associated class labels, and each tuple is
represented by an n-D attribute vector X = (x1, x2, …, xn)
• Suppose there are m classes C1, C2, …, Cm.
• Classification is to derive the maximum posteriori, i.e., the maximal P(C i|X)
• This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X)  i i
i P(X)

• Since P(X) is constant for all classes, only P(C | X) P(X | C )P(C )
i i i

needs to be maximized

31
Naïve Bayes Classifier
• A simplified assumption: attributes are conditionally independent (i.e., no
dependence relation between
n
attributes):
P ( X | C i )   P ( x | C i ) P ( x | C i ) P ( x | C i ) ... P ( x | C i )
k 1 2 n
k 1

• This greatly reduces the computation cost: Only counts the class distribution
• If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak
divided by |Ci, D| (# of tuples of Ci in D)
• If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian
distribution with a mean μ and standard deviation σ
( x  )2
1 
g ( x,  ,  )  e 2 2
and P(xk|Ci) is 2 

P ( X | C i )  g ( x k ,  Ci ,  C i )
32
Naïve Bayes Classifier: Training Dataset
buys
Class: _com
age income student credit_rating puter
C1:buys_computer = ‘yes’ <=30 high no fair no
C2:buys_computer = ‘no’ <=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
Data to be classified: >40 low yes fair yes
X = (age <=30, >40 low yes excellent no
31…40 low yes excellent yes
Income = medium, <=30 medium no fair no
Student = yes <=30 low yes fair yes
>40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 33


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Naïve Bayes Classifier: An Example
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 age income student credit_rating
buys_computer
<=30 high no fair no
P(buys_computer = “no”) = 5/14= 0.357 <=30 high no excellent no
31…40 high no fair yes
• Compute P(X|Ci) for each class >40 medium no fair yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 >40 low yes fair yes
>40 low yes excellent no
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 31…40 low yes excellent yes
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 <=30 medium no fair no
<=30 low yes fair yes
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 >40 medium yes fair yes
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 <=30 medium yes excellent yes
31…40 medium no excellent yes
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 31…40 high yes fair yes
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 >40 medium no excellent no

P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4


• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 34


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Disadvantages
• Assumption: class conditional independence,
therefore loss of accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung
cancer, diabetes, etc.
• Dependencies among these cannot be modeled by Naïve
Bayes Classifier
• How to deal with these dependencies? Bayesian Belief
Networks
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 35
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Bayesian Belief Networks
• Bayesian belief networks (also known as Bayesian networks, probabilistic
networks): allow class conditional independencies between subsets of variables
• A (directed acyclic) graphical model of causal relationships
• Represents dependency among the variables
• Gives a specification of joint probability distribution
 Nodes: random variables
 Links: dependency
X Y  X and Y are the parents of Z, and Y is the
parent of P
Z  No dependency between Z and P
P
 Has no loops/cycles

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 36


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 36
Bayesian Belief Network: An Example
Family CPT: Conditional Probability Table for
Smoker (S)
History (FH) variable LungCancer:
(FH, S) (FH, ~S) (~FH, S) (~FH, ~S)

LC 0.8 0.5 0.7 0.1


LungCancer
Emphysema ~LC 0.2 0.5 0.3 0.9
(LC)

shows the conditional probability for each


possible combination of its parents

PositiveXRay Dyspnea Derivation of the probability of a particular


combination of values of X, from CPT:
n
Bayesian Belief Network P ( x1 ,..., xn )   P ( x i | Parents (Y i ))
i 1
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 37
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Training Bayesian Networks: Several Scenarios
• Scenario 1: Given both the network structure and all variables observable:
compute only the CPT entries
• Scenario 2: Network structure known, some variables hidden: gradient
descent (greedy hill-climbing) method, i.e., search for a solution along the
steepest descent of a criterion function
• Weights are initialized to random probability values
• At each iteration, it moves towards what appears to be the best solution
at the moment, w.o. backtracking
• Weights are updated at each iteration & converge to local optimum
• Scenario 3: Network structure unknown, all variables observable: search
through the model space to reconstruct network topology
• Scenario 4: Unknown structure, all hidden variables: No good algorithms
known for this purpose.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 38


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Naive Bayes algorithms for learning and
classifying text.

Assumption:- the probability of a word occurring is independent of its


04/27/2025 position within the Dr
text.
Manish Kumar Pandey, Centre for Quantitative Economics 39
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be non-zero.
Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0), income=
medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their “uncorrected”
counterparts
40
Choosing Hypotheses
Generally want the most probable hypothesis given the training data

Maximum a posteriori hypothesis h MAP

If assume P (hi)=P (hj) , then can


i furtherj simplify, and choose the Maximum likelihood hML
hypothesis as

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 41


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Brute Force MAP Hypothesis Learner

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 42


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Relation to Concept Learning

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 43


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Relation to Concept Learning

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 44


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Relation to Concept Learning

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 45


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Evolution of Posterior
Probabilities

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 46


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Characterizing Learning Algorithms by
Equivalent MAP Learners

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 47


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Learning A Real Valued Function

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 48


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Learning A
Real
Valued
Function

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 49


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Learning to
Predict
Probabilities

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 50


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Minimum
Description
Length
Principle

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 51


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Minimum
Description
Length
Principle

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 52


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Most Probable Classification of New Instances

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 53


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Bayes
Optimal
Classifier

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 54


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gibbs Classifier

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 55


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
II. Learning through Linear
Methods of Classification

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 56


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Fisher's discriminant
function

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 57


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Linear discriminant analysis

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 58


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Logistic regression

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 59


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Support Vector Machines

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 60


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—Support Vector Machines
• A relatively new classification method for both
linear and nonlinear data
• It uses a nonlinear mapping to transform the
original training data into a higher dimension
• With the new dimension, it searches for the linear
optimal separating hyperplane (i.e., “decision
boundary”)
• With an appropriate nonlinear mapping to a
sufficiently high dimension, data from two classes
can always be separated by a hyperplane
• SVM finds this hyperplane using support vectors
04/27/2025 (“essential” training tuples)
Dr Manish Kumar Pandey, and margins
Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
(defined 61
SVM—History and Applications
• Vapnik and colleagues (1992)—groundwork from
Vapnik & Chervonenkis’ statistical learning theory
in 1960s
• Features: training can be slow but accuracy is high
owing to their ability to model complex nonlinear
decision boundaries (margin maximization)
• Used for: classification and numeric prediction
• Applications:
• handwritten digit recognition, object recognition,

04/27/2025
speaker identification, benchmarking time-series
Dr Manish Kumar Pandey, Centre for Quantitative Economics 62
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—General Philosophy

Small Margin Large Margin


Support Vectors

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 63


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—Margins and Support Vectors

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 64


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—When Data Is Linearly Separable

Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples associated
with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to find the
best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum marginal
hyperplane (MMH)
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 65
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—Linearly Separable
 A separating hyperplane can be written as
W●X+b=0
where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)
 For 2-D it can be written as
w0 + w 1 x1 + w 2 x2 = 0
 The hyperplane defining the sides of the margin:
H1: w 0 + w 1 x1 + w 2 x2 ≥ 1 for yi = +1, and
H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1
 Any training tuples that fall on hyperplanes H1 or H2 (i.e., the
sides defining the margin) are support vectors
 This becomes a constrained (convex) quadratic optimization problem:
Quadratic objective function and linear constraints  Quadratic
Programming (QP)  Lagrangian multipliers
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 66
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Why Is SVM Effective on High Dimensional Data?

 The complexity of trained classifier is characterized by the # of support


vectors rather than the dimensionality of the data
 The support vectors are the essential or critical training examples —they lie
closest to the decision boundary (MMH)
 If all other training examples are removed and the training is repeated, the
same separating hyperplane would be found
 The number of support vectors found can be used to compute an (upper)
bound on the expected error rate of the SVM classifier, which is independent
of the data dimensionality
 Thus, an SVM with a small number of support vectors can have good
generalization, even when the dimensionality of the data is high
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 67
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM—Linearly Inseparable A2

 Transform the original input data into a higher dimensional


space A1

 Search for a linear separating hyperplane in the new space


04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 68
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
SVM: Different Kernel functions
 Instead of computing the dot product on the transformed data, it
is math. equivalent to applying a kernel function K(Xi, Xj) to the
original data, i.e., K(Xi, Xj) = Φ(Xi) Φ(Xj)
 Typical Kernel Functions

 SVM can also be used for classifying multiple (> 2) classes and
for regression analysis (with additional parameters)
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 69
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Distance-weighted
discrimination

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 70


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Classification using kernel
Regression.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 71


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
III. Learning through Other
Methods of Classification

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 72


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Quadratic discriminant
analysis

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 73


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Regularized discriminant
analysis

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 74


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Nonlinear SVM

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 75


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Kernel density estimation

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 76


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Kernel discriminant analysis

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 77


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
IV. Instance Learning,
Statistical Models & Neural
Networks

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 78


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Instance Learning

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 79


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply
stores training data (or only minor processing) and
waits until it is given a test tuple
• Eager learning (the above discussed methods):
Given a set of training tuples, constructs a
classification model before receiving new (e.g., test)
data to classify
• Lazy: less time in training but more time in predicting
• Accuracy
• Lazy method effectively uses a richer hypothesis
space since it uses many local linear functions to form
an implicit global approximation to the target function
• Eager: must commit to a single hypothesis that
covers the entire instance space
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 80
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Lazy Learner: Instance-Based Methods

• Instance-based learning:
• Store training examples and delay the
processing (“lazy evaluation”) until a new
instance must be classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean
space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-
based inference

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 81


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
k-Nearest Neighbor

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 82


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
The k-Nearest Neighbor Algorithm

• All instances correspond to points in the n-D


space
• The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
• Target function could be discrete- or real- valued
• For discrete-valued, k-NN returns the most
common value among the k training examples
nearest to xq
• Vonoroi diagram: the decision surface induced
by 1-NN_
for_a typical
__
set of training .
examples
Dr Manish Kumar Pandey, Centre
for Quantitative Economics & Data
Science, Birla Institute of
+
_
. +
xq +
. . .
Technology, Mesra, Ranchi,
Jharkhand-835215

04/27/2025
_
+ . 83
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects
are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data
objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 84


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Data Matrix and Dissimilarity Matrix
• Data matrix
• n data points with p  x11 ... x1f ... x1p 
dimensions  
 ... ... ... ... ... 
• Two modes x ... x if ... x ip 
 i1 
 ... ... ... ... ... 
x ... x nf ... x np 
 n1 
• Dissimilarity matrix
• n data points, but  0 
registers only the  d(2,1) 0 
 
distance  d(3,1) d ( 3,2) 0 
• A triangular matrix  
 : : : 
• Single mode  d ( n,1) d ( n,2) ... ... 0

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 85


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow,


blue, green (generalization of a binary attribute)
• Method 1: Simple matching
• m: # of matches, p: total # of variables

d (i, j)  p p m
• Method 2: Use a large number of binary
attributes
• creating a new binary attribute for each of the M
nominal states

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 86


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Proximity Measure for Binary Attributes
Object j

• A contingency table for binary data


Object i

• Distance measure for symmetric binary


variables:

• Distance measure for asymmetric binary


variables:

• Jaccard coefficient (similarity measure


for asymmetric binary variables):

 Note: Jaccard coefficient is the same as “coherence”:

87
Dissimilarity between Binary Variables

• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
• Gender is a symmetric attribute
• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N 0

0 1
d ( jack , mary )  0.33
2  0 1
11
d ( jack , jim )  0.67
111
1 2
d ( jim , mary )  0.75
11 2
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 88
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Standardizing Numeric Data
x
z  
• Z-score:
• X: raw score to be standardized, μ: mean of the population, σ: standard
deviation
• the distance between the raw score and the population mean in units of
the standard deviation
• negative when the raw score is below the mean, “+” when above
• An alternative way: Calculate the mean absolute deviation
s f 1n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
where m f  1n (x1 f  x2 f  ...  xnf )
xif  m f
.

zif  s
• standardized measure (z-score): f

• Using mean absolute deviation is more robust than using standard deviation

89
Example:
Data Matrix and Dissimilarity Matrix

Data Matrix
x2 x4
point attribute1 attribute2
4 x1 1 2
x2 3 5
x3 2 0
x4 4 5
2 x1
Dissimilarity Matrix
(with Euclidean Distance)
x3
0 4 x1 x2 x3 x4
2
x1 0
x2 3.61 0
x3 5.1 5.1 0
x4 4.24 1 5.39 0
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 90
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Distance on Numeric Data: Minkowski Distance

• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp)


are two p-dimensional data objects, and h is the
order (the distance so defined is also called L-h
norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive
definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
04/27/2025
• A distance that satisfies these properties
Dr Manish Kumar Pandey, Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J is a 91
Special Cases of Minkowski Distance

• h = 1: Manhattan (city block, L1 norm) distance


• E.g., the Hamming distance: the number of bits that are different
between two binary vectors
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1 i2 j 2 ip jp

• h = 2: (L2 norm) Euclidean distance


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j 2 ip jp

• h  . “supremum” (Lmax norm, L norm) distance.


• This is the maximum difference between any component (attribute) of
the vectors

92
Example: Minkowski Distance
Dissimilarity Matrices
point attribute 1 attribute 2 Manhattan
x1 1 2 (L1)L x1 x2 x3 x4
x2 3 5 x1 0
x3 2 0 x2 5 0
x4 4 5 x3 3 6 0
x4 6 1 7 0
Euclidean (L2)
x2 x4
L2 x1 x2 x3 x4
4 x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

2 x1
Supremum
L x1 x2 x3 x4
x1 0
x2 3 0
x3 x3 2 5 0
04/27/2025 0 2 4
Dr Manish Kumar Pandey,x4
Centre for Quantitative3Economics 1 5 0 93
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Cosine Similarity
• A document can be represented by thousands of attributes, each
recording the frequency of a particular word (such as keywords) or
phrase in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature
mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency
vectors), then
cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d||: the length of vector d
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 94
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Example: Cosine Similarity
• cos(d1, d2) = (d1  d2) /||d1|| ||d2|| ,
where  indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5
= 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5
= 4.12
cos(d1, d2 ) = 0.94

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 95


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Mahalanobis distance
• the different dimensions have the equal weights on the overall distance
between two data tuples, and their effects are considered
independently.
• Therefore, if certain dimension has a larger value range than others, or
if different dimensions are correlated with each other, Euclidean
distance might lead to suboptimal classification performance.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 96


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Mahalanobis distance
Mahalanobis distance naturally
1) assigns different weights (through the diagonal elements of matrix M) to
different dimensions and
2) incorporates the interaction effect of different dimensions (through the off-
diagonal elements of matrix M) in measuring the distance between the two
input tuples.
The basic idea of distance metric learning is that we want to find the optimal M matrix
(hence the optimal Mahalanobis distance) so that (1) the distance between any pair of
similar tuples in S is small and (2) the distance between any pair of dissimilar tuples in D
is large
Mathematically, we can formulate it as the following optimization problem.
means that matrix is positive
semidefinite.
The optimization problem defined is convex.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 97


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Discussion on the k-NN Algorithm
• k-NN for real-valued prediction for a given unknown
tuple
• Returns the mean values of the k nearest neighbors
• Distance-weighted nearest neighbor algorithm
• Weight the contribution of each of the k neighbors
1
w
according to their distance to the query
d ( xx 2
q ,qxi )
• Give greater weight to closer neighbors

• Robust to noisy data by averaging k-nearest neighbors


• Curse of dimensionality: distance between neighbors
could be dominated by irrelevant attributes
• To overcome it, axes stretch or elimination of the
04/27/2025 least relevant attributes
Dr Manish Kumar Pandey, Centre for Quantitative Economics 98
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Case-Based Reasoning (CBR)
• CBR: Uses a database of problem solutions to solve new problems
• Store symbolic description (tuples or cases)—not points in a Euclidean
space
• Applications: Customer-service (product-related diagnosis), legal ruling
• Methodology
• Instances represented by rich symbolic descriptions (e.g., function
graphs)
• Search for similar cases, multiple retrieved cases may be combined
• Tight coupling between case retrieval, knowledge-based reasoning,
and problem solving
• Challenges
• Find a good similarity metric
• Indexing based on syntactic similarity measure, and when failure,
backtracking, and adapting to additional cases
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 99
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Additive Models

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 100


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Projection Pursuit

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 101


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Neural Network

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 102


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Connectionist Models
Consider humans:
• Neuron switching time ~ .001 second Properties of artificial neural nets (ANN's):
• Number of neurons ~ 10 10 • Many neuron-like threshold switching units
• Connections per neuron ~ 10 4-5 • Many weighted interconnections among units
• Scene recognition time ~ .1 second • Highly parallel, distributed process
• 100 inference steps doesn't seem like enough • Emphasis on tuning weights automatically
➔ much parallel computation

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 103


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
When to Consider Neural Networks
• Instances are represented by many attribute-value pairs.
• Input is high-dimensional discrete or real-valued (e.g. raw sensor input)
• Output is discrete or real valued
• Output is a vector of values
• Possibly noisy data
• Form of target function is unknown
• Human readability of result is unimportant
• Long training times are acceptable.
• Fast evaluation of the learned target function may be required.
Examples:
• Speech phoneme recognition Waibel4
• Image classification Kanade, Baluja, Rowley4
• Financial prediction

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 104


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
ALVINN drives 70 mph on highways

Sharp Straight Sharp


Left Ahead Right

30 Output
Units

4 Hidden
Units

30x32 Sensor
Dr Manish Kumar Pandey, Centre for Quantitative
Input Retina
Economics & Data Science, Birla Institute of Technology,
Mesra, Ranchi, Jharkhand-835215
04/27/2025 105
Perceptron
x1 w1 x0=1
w0
x2 w2

.
.  n

{
n
.  wi xi
wn i=0 o=  wi xi > 0
1 if i=0
-1 otherwise
xn

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics & Data 106
Science, Birla Institute of Technology, Mesra, Ranchi, Jharkhand-835215
Decision Surface of a Perceptron
x2 x2
+
+
- + -
+

x1 x1
- - +
-

(a) (b)

The decision surface represented by a two-input perceptron. (a) A set of training examples and the
decision surface of a perceptron that classifies them correctly. (b) A set of training examples that is not
linearly separable (i.e., that cannot be correctly classified by any straight line). x1 and x2 are the
perceptron inputs. Positive examples are indicated by "+", negative by "-".

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 107


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
The Perceptron Training Rule
• The precise learning problem is to determine a weight vector that causes the per­ceptron to produce the correct
±1 output for each of the given training examples.
The perceptron rule The delta rule

These two algorithms are guaran­teed to converge to somewhat different acceptable hypotheses, under
somewhat different conditions.
Begin with random weights, then iteratively apply the
perceptron to each training example, modifying the perceptron
weights whenever it misclassifies an example.
This process is repeated, iterating through the training
examples as many times as needed until the perceptron
classifies all training examples correctly.
Weights are modified at each step according to the perceptron
training rule, which revises the weight wi; associated with input
xi; according to the rule
Dr Manish Kumar Pandey, Centre for Quantitative Economics
04/27/2025 & Data Science, Birla Institute of Technology, Mesra, Ranchi, 108
Jharkhand-835215
Gradient Descent and the Delta Rule
• If the data are not linearly separable, convergence is not as­sured.
• Provided a sufficiently small η is used
• If the training examples are not linearly separable, the delta rule
converges toward a best-fit approximation to the target concept.
• The key idea behind the delta rule is to use gradient descent to search
the hy­pothesis space of possible weight vectors to find the weights
that best fit the train­ing examples.
• This rule is important because gradient descent provides the basis for
the BACKPROPAGATION algorithm, which can learn networks with
many inter­connected units.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 109
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gradient Descent and the Delta Rule..

• The delta training rule is best understood by considering the task of training an
unthresholded perceptron; that is, a linear unit for which the output o is given

To derive a weight learning rule for linear units, let us begin by specifying a measure for the training error
of a hypothesis (weight vector), relative to the training examples.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 110


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Visualizing The Hypothesis Space

Error of different hypotheses. For a linear unit


with two weights, the hypothesis space H is the
w , w plane. The vertical axis indicates the error of
o 1

the corresponding weight vector hypothesis,


relative to a fixed set of training examples. The
arrow shows the negated gradient at one partic­
ular point, indicating the direction in the wo, w1
plane producing steepest descent along the error
surface.
3
w w1
o
• Determine a weight vector that minimizes E by starting with an arbitrary initial weight vector, then
repeatedly modifying it in small steps.

• At each step, the weight vector is altered in the direction that produces the steepest descent along the
error surface. This process continues until the global minimum error is reached.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 111
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gradient Descent
To understand, consider simpler linear unit, where

Let's learn wi's that minimize the squared error

Where D is set of training examples

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 112


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gradient Descent
How can we calculate the direction of steepest descent along the error surface?

When interpreted as a vector in weight space, the gradient specifies the direction that
produces the steepest increase in E.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 113
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
GRADIENT DESCENT algorithm for training
a linear unit.

To implement the stochastic


approximation to gradient
descent, Second part would be
deleted, and first part would be
replaced by
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 114
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Stochastic Approximation To Gradient Descent

• Benefits
• the hypothesis space contains continuously parameterized hypotheses (e.g., the weights in a linear unit), and
• the error can be differentiated with respect to these hypothesis parameters.

• Challenges
• converging to a local minimum can sometimes be quite slow (i.e., it can require many thousands of gradient descent
steps), and
• if there are multiple local minima in the error surface, then there is no guarantee that the procedure will find the global
minimum.
• Solution is Incremental gradient Descent, alternatively called stochastic gradient descent.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 115


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Stochastic Approximation To Gradient Descent

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 116


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Comparison
• In standard gradient descent, the error is summed over all examples before
updating weights, whereas in stochastic gradient descent weights are updated
upon examining each training example.
• Summing over multiple examples in standard gradient descent requires more
computation per weight update step. On the other hand, because it uses the
true gradient, standard gradient descent is often used with a larger step size per
weight update than stochastic gradient descent.
• In cases where there are multiple local minima with respect to E(), stochas­tic
gradient descent can sometimes avoid falling into these local minima because
it uses the various ∇Ed() rather than ∇E() to guide its search.

Is known as delta rule, LMS (Least Mean Square), Adaline Rule


or Widrow-Hoff rule.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 117
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
In a nutshell
• The perceptron training rule converges after a finite number of iterations
to a hypothesis that perfectly classifies the training data, provided the
training examples are linearly separable.
• The delta rule converges only asymp­totically toward the minimum error
hypothesis, possibly requiring unbounded time, but converges regardless
of whether the training data are linearly sepa­rable.
• A third possible algorithm for learning the weight vector is linear
program­ming. Linear programming is a general, efficient method for
solving sets of linear inequalities. Notice each training example
corresponds to an inequality of the form . > 0 or . <= 0, and their
solution is the desired weight vector.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 118
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Summary of choices in
designing the checkers
learning program

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 119


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Multilayer Networks of Sigmoid Units

head hid who’d hood


... ...

F1 F2

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 120


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Sigmoid Unit
x1 w1 x0 = 1

x2 w2 w0
.
.  n
.  wi xi
net = i=0 1
wn
o = (net) = -net
1+e
xn

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 121


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Error Gradient for a
Sigmoid Unit

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 122


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Backpropagation
Algorithm

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 123


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
The Paradox
• One major difference in the case of multilayer networks is that the
error surface can have multiple local minima, in contrast to the single-
minimum parabolic error surface.
• Unfortunately, this means that gradient descent is guaranteed only to
converge toward some local minimum, and not necessarily the global
minimum error.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 124


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Adding Momentum

• Here, 0<= α <1, is a


constant, Known as
momentum.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 125


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Derivation of Backpropagation rule

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 126


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Training Rule for Output Unit Weights

Dr Manish Kumar Pandey, Centre for Quantitative Economics


04/27/2025 & Data Science, Birla Institute of Technology, Mesra, Ranchi, 127
Jharkhand-835215
Training Rule for Hidden Unit Weights

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 128


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Deep Learning

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 129


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Principal component
Analysis (PCA)

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 130


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
V. Tree Based Models
&Clustering Paradigm

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 131


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Basics of classification tree
and random forests

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 132


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Decision Tree

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 133


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Decision Tree Induction: An Example
age income student credit_rating buys_computer
 <=30 high no fair no
Training data set: Buys_computer <=30 high no excellent no
 The data set follows an example of 31…40 high no fair yes
>40 medium no fair yes
Quinlan’s ID3 (Playing Tennis) >40 low yes fair yes
 Resulting tree: >40 low yes excellent no
31…40 low yes excellent yes
age? <=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
<=30 overcast
31..40 >40
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no

student? yes credit rating?

no yes excellent fair


Dr Manish Kumar Pandey, Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi,

no yes no yes Jharkhand-835215


04/27/2025 134
Algorithm for Decision Tree Induction
• Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-
conquer manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected
attributes
• Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
• Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning
– majority voting is employed for classifying the leaf
• There are no samples left
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 135
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Basic algorithm for inducing a decision tree from training tuples.

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 136


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Brief Review of Entropy

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 137


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Attribute Selection Measure: Information Gain
(ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify
m a tuple in D:
Info ( D)   pi log 2 ( pi )
i 1
 Information needed (after using A to split D into v partitions) to
v | D |
classify D:
Info A ( D ) 
j
Info ( D j )
j 1 | D |
 Information gained by branching on attribute A
Gain(A) Info(D)  Info A(D)
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 138
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Attribute Selection: Information Gain
g Class P: buys_computer = 5 4
Info age ( D )  I (2,3)  I (4,0)
“yes” 14 14
g Class N: buys_computer
9 9 5 =5 5
Info ( D) I (9,5)  log 2 ( )  log 2 ( ) 0.940  I (3,2) 0.694
“no” 14 14 14 14 14
age pi ni I(pi, ni)
5
<=30 2 3 0.971 I (2,3) means “age <=30” has 5
31…40 4 0 0 14 out of 14 samples, with 2 yes’es
>40 3 2 0.971 and 3 no’s. Hence
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no Gain(age) Info ( D )  Info age ( D ) 0.246
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes Similarly,
>40 low yes excellent no
31…40 low yes excellent yes
<=30
<=30
medium
low
no
yes
fair
fair
no
yes
Gain(income) 0.029
>40
<=30
medium
medium
yes
yes
fair
excellent
yes
yes Gain( student ) 0.151
31…40 medium no excellent yes
31…40
>40
high
medium
yes
no
fair
excellent
yes
no
Gain(credit _ rating ) 0.048
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 139
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Computing Information-Gain for Continuous-
Valued Attributes
• Let attribute A be a continuous-valued attribute
• Must determine the best split point for A
• Sort the value A in increasing order
• Typically, the midpoint between each pair of adjacent values is
considered as a possible split point
• (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
• The point with the minimum expected information requirement
for A is selected as the split-point for A
• Split:
• D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the
set of tuples in D satisfying A > split-point
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 140
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a
large number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
v | Dj | | Dj |
SplitInfo A ( D )   log 2 ( )
j 1 |D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A)
• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019


• The attribute with the maximum gain ratio is selected as the
splitting attribute
141
Gini Index (CART, IBM IntelligentMiner)
• If a data set D contains examples from n classes, gini index,
gini(D) is defined as n
gini( D) 1 p2
 j
j 1
where pj is the relative frequency of class j in D
• If a data set D is split on A into two subsets D1 and D2, the gini
index gini(D) is defined as |D | |D |
gini A ( D)  1 gini( D1)  2 gini( D 2)
|D| |D|
• Reduction in Impurity:
gini( A) gini( D)  giniA ( D)
• The attribute provides the smallest ginisplit(D) (or the largest
reduction in impurity) is chosen to split the node (need to
enumerate all the possible splitting points for each attribute)
142
Computation of Gini Index
• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no”
2 2
 9  5
gini ( D) 1       0.459
 14   14 

• Suppose the attribute income partitions D into10 10 in D1: {low, 4 medium}

giniincome{low,medium} ( D)   Gini ( D1 )    Gini ( D2 )
and 4 in D2  14   14 

Gini{low,high} is 0.458; Gini{medium,high} is 0.450. Thus, split on the


{low,medium} (and {high}) since it has the lowest Gini index
• All attributes are assumed continuous-valued
• May need other tools, e.g., clustering, to get the possible split values
143
• Can be modified for categorical attributes
Comparing Attribute Selection Measures

• The three measures, in general, return good results


but
• Information gain:
• biased towards multivalued attributes
• Gain ratio:
• tends to prefer unbalanced splits in which one partition
is much smaller than the others
• Gini index:
• biased to multivalued attributes
• has difficulty when # of classes is large
• tends to favor tests that result in equal-sized partitions
and purity in both partitions
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 144
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Ensemble Methods: Increasing the Accuracy

• Ensemble methods
• Use a combination of models to increase
accuracy
• Combine a series of k learned models, M1, M2, …,
Mk, with the aim of creating an improved model
M*
• Popular ensemble methods
• Bagging: averaging the prediction over a
collection of classifiers
• Boosting: weighted vote with a collection of
04/27/2025 classifiers Dr Manish Kumar Pandey, Centre for Quantitative Economics
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 145
145
Bagging: Boostrap Aggregation
• Analogy: Diagnosis based on multiple doctors’ majority vote
• Training
• Given a set D of d tuples, at each iteration i, a training set
Di of d tuples is sampled with replacement from D (i.e.,
bootstrap)
• A classifier model Mi is learned for each training set Di
• Classification: classify an unknown sample X
• Each classifier Mi returns its class prediction
• The bagged classifier M* counts the votes and assigns the
class with the most votes to X
• Prediction: can be applied to the prediction of continuous
values by taking the average value of each prediction for a
given test tuple
• Accuracy
• Often significantly better than a single classifier derived
04/27/2025 from D Dr Manish Kumar Pandey, Centre for Quantitative Economics 146
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 146
Boosting
• Analogy: Consult several doctors, based on a combination of
weighted diagnoses—weight assigned based on the previous
diagnosis accuracy
• How boosting works?
• Weights are assigned to each training tuple
• A series of k classifiers is iteratively learned
• After a classifier Mi is learned, the weights are updated to
allow the subsequent classifier, M i+1, to pay more
attention to the training tuples that were
misclassified by Mi
• The final M* combines the votes of each individual
classifier, where the weight of each classifier's vote is a
function of its accuracy
• Boosting algorithm can be extended for numeric prediction
• Comparing with bagging: Boosting tends to have greater
04/27/2025
accuracy, but itDr also risks overfitting the model to
Manish Kumar Pandey, Centre for Quantitative Economics 147
misclassified data
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 147
Adaboost (Freund and Schapire, 1997)
• Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
• Initially, all the weights of tuples are set the same (1/d)
• Generate k classifiers in k rounds. At round i,
• Tuples from D are sampled (with replacement) to form a training set D i
of the same size
• Each tuple’s chance of being selected is based on its weight
• A classification model Mi is derived from Di
• Its error rate is calculated using Di as a test set
• If a tuple is misclassified, its weight is increased, o.w. it is decreased
• Error rate: err(Xj) is the misclassification error of tuple Xj. Classifier Mi
d
error rate is the sum of (the
M weights
)  of w the misclassified tuples:
error i  j
j err (X ) j

1  error ( M i )
log
error ( M i )
• The weight of classifier Mi’s vote is 148
Random Forest (Breiman 2001)
• Random Forest:
• Each classifier in the ensemble is a decision tree classifier
and is generated using a random selection of attributes at
each node to determine the split
• During classification, each tree votes and the most popular
class is returned
• Two Methods to construct Random Forest:
• Forest-RI (random input selection): Randomly select, at
each node, F attributes as candidates for the split at the
node. The CART methodology is used to grow the trees to
maximum size
• Forest-RC (random linear combinations): Creates new
attributes (or features) that are a linear combination of the
existing attributes (reduces the correlation between
individual classifiers)
• Comparable in accuracy to Adaboost, but more robust to
errors and outliers
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 149
• Insensitive to the number
& Data of attributes
Science, Birla Institute selected
of Technology, Mesra, Ranchi, J for 149
Unsupervised Classification

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 150


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
K-means, k-medoids and
linkage methods

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 151


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Gap statistics

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 152


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Clustering using Gaussian
mixtures

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 153


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Model Evaluation and
Selection

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 154


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Model Evaluation and Selection
• Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
• Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
• Methods for estimating a classifier’s accuracy:
• Holdout method, random subsampling
• Cross-validation
• Bootstrap
• Comparing classifiers:
• Confidence intervals
• Cost-benefit analysis and ROC Curves
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 155
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 155
Classifier Evaluation Metrics: Confusion
Matrix
Confusion Matrix:
Actual class\Predicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)

Example of Confusion Matrix:


Actual class\Predicted buy_computer buy_computer Total
class = yes = no
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

• Given m classes, an entry, CMi,j in a confusion matrix indicates


# of tuples in class i that were labeled by the classifier as class j
• May have extra rows/columns to provide totals
156
156
Classifier Evaluation Metrics: Accuracy, Error
Rate, Sensitivity and Specificity
A\P C ¬C  Class Imbalance Problem:
C TP FN P  One class may be rare, e.g.
¬C FP TN N
fraud, or HIV-positive
P’ N’ All
 Significant majority of the

• Classifier Accuracy, or negative class and minority of


recognition rate: the positive class
percentage of test set  Sensitivity: True Positive
tuples that are correctly recognition rate
classified 
Sensitivity = TP/P
Accuracy = (TP +
 Specificity: True Negative
TN)/All
recognition rate
• Error rate: 1 – accuracy,
or

Specificity = TN/N
04/27/2025 Error rate =Dr Manish
(FPKumar
+Pandey, Centre for Quantitative Economics 157
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 157
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
• Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive

• Recall: completeness – what % of positive tuples did the


classifier label as positive?
• Perfect score is 1.0
• Inverse relationship between precision & recall
• F measure (F1 or F-score): harmonic mean of precision
and recall,

• Fß: weighted measure of precision and recall


• assigns ß times as much weight to recall as to
precision

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 158


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 158
Classifier Evaluation Metrics: Example

Actual Class\Predicted class cancer = yes cancer = no Total Recognition(%)


cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40 (accuracy)
• Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

159
159
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
• Holdout method
• Given data is randomly partitioned into two
independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
• Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies
obtained
• Cross-validation (k-fold, where k = 10 is most popular)
• Randomly partition the data into k mutually exclusive
subsets, each approximately equal size
• At i-th iteration, use Di as test set and others as training
set
• Leave-one-out: k folds where k = # of tuples, for small
sized data
• *Stratified cross-validation*: folds are stratified so
that class dist. in each fold is approx. the same as that
04/27/2025
in the initial Dr
data
&
Manish Kumar Pandey, Centre for Quantitative Economics
Data Science, Birla Institute of Technology, Mesra, Ranchi, J 160
160
Evaluating Classifier Accuracy: Bootstrap
• Bootstrap
• Works well with small data sets
• Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected
again and re-added to the training set
• Several bootstrap methods, and a common one is .632 boostrap
• A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data
end up in the bootstrap, and the remaining 36.8% form the test set (since (1
– 1/d)d ≈ e-1 = 0.368)
• Repeat the sampling procedure k times, overall accuracy of the model:

161
161
Estimating Confidence Intervals:
Classifier Models M1 vs. M2

• Suppose we have 2 classifiers, M1 and M2, which one is better?

• Use 10-fold cross-validation to obtain and

• These mean error rates are just estimates of error on the true population
of future data cases

• What if the difference between the 2 error rates is just attributed to


chance?
• Use a test of statistical significance

• Obtain confidence limits for our error estimates


04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 162
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 162
Estimating Confidence Intervals:
Null Hypothesis
• Perform 10-fold cross-validation
• Assume samples follow a t distribution with k–1 degrees of freedom
(here, k=10)
• Use t-test (or Student’s t-test)

• Null Hypothesis: M1 & M2 are the same

• If we can reject null hypothesis, then


• we conclude that the difference between M 1 & M2 is statistically
significant
• Chose model with lower error rate
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 163
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 163
Estimating Confidence Intervals: t-test

• If only 1 test set available: pairwise comparison


• For ith round of 10-fold cross-validation, the same cross
partitioning is used to obtain err(M1)i and err(M2)i
• Average over 10 rounds to get
• t-test computes t-statistic with k-1andegrees of freedom:
d

where

• If two test sets available: use non-paired t-test


wher
e
where k1 & k2 are # of cross-validation samples used for M1 & M2, resp.
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 164
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 164
Estimating Confidence Intervals:
Table for t-distribution

• Symmetric
• Significance level,
e.g., sig = 0.05 or
5% means M1 & M2
are significantly
different for 95% of
population
• Confidence limit, z
= sig/2
165
165
Estimating Confidence Intervals:
Statistical Significance

• Are M1 & M2 significantly different?


• Compute t. Select significance level (e.g. sig = 5%)
• Consult table for t-distribution: Find t value corresponding to k-
1 degrees of freedom (here, 9)
• t-distribution is symmetric: typically, upper % points of
distribution shown → look up value for confidence limit
z=sig/2 (here, 0.025)
• If t > z or t < -z, then t value lies in rejection region:
• Reject null hypothesis that mean error rates of M1 & M2 are same
• Conclude: statistically significant difference between M1 & M2
• Otherwise, conclude that any difference is chance

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 166


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 166
Model Selection: ROC Curves
• ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification models
• Originated from signal detection theory
• Shows the trade-off between the true
positive rate and the false positive rate

 Vertical axis
The area under the ROC curve is a represents the true
measure of the accuracy of the model positive rate
• Rank the test tuples in decreasing order:  Horizontal axis rep.
the one that is most likely to belong to the false positive rate
the positive class appears at the top of the  The plot also shows a
list diagonal line
• The closer to the diagonal line (i.e., the
 A model with perfect
closer the area is to 0.5), the less accurate accuracy will have an
04/27/2025 is the model Dr Manish Kumar Pandey, Centre for Quantitative Economics
area of 1.0 167
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 167
Issues Affecting Model Selection
• Accuracy
• classifier accuracy: predicting class label
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
04/27/2025 size or compactness ofKumar
Dr Manish classification rulesEconomics
Pandey, Centre for Quantitative
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 168
168
Techniques to Improve
Classification Accuracy:
Ensemble Methods

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 169


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J
Classification of Class-Imbalanced Data Sets

• Class-imbalance problem: Rare positive example but


numerous negative ones, e.g., medical diagnosis, fraud, oil-
spill, fault, etc.
• Traditional methods assume a balanced distribution of
classes and equal error costs: not suitable for class-
imbalanced data
• Typical methods for imbalance data in 2-class classification:
• Oversampling: re-sampling of data from positive class
• Under-sampling: randomly eliminate tuples from
negative class
• Threshold-moving: moves the decision threshold, t, so
that the rare class tuples are easier to classify, and
hence, less chance of costly false negative errors
• Ensemble techniques: Ensemble multiple classifiers
introduced above
04/27/2025
• Still difficult forDrclass imbalance problem on multiclass tasks
Manish Kumar Pandey, Centre for Quantitative Economics 170
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 170
Issues: Evaluating Classification Methods

• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 171
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 171
Predictor Error Measures
• Measure predictor accuracy: measure how far off the predicted value is from the
actual known value
• Loss function: measures the error between yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
d
• Test error (generalization error): d
the average loss over the testset ( yi  yi ' ) 2
 | y i  y i ' |
i 1
• Mean absolute error: i 1
Mean squared error: d
d d d
 | y  y '|
i 1
i i
 ( yi  yi ' ) 2i 1
d
• Relative absolute error:  | y  yRelative
| squared error: d

(y
i
i 1
i  y)2
i 1

The mean squared-error exaggerates the presence of outliers


Popularly use (square) root mean-square error, similarly, root relative squared
04/27/2025
error Dr Manish Kumar Pandey, Centre for Quantitative Economics 172
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 172
Classification: A Mathematical Mapping
• Classification: predicts categorical class labels
• E.g., Personal homepage classification
• xi = (x1, x2, x3, …), yi = +1 or –1
• x1 : # of word “homepage”
x
• x2 : # of word “welcome”
x x
x x
• Mathematically, x  X = n, y  Y = {+1, –1}, x
x x x o
• We want to derive a function f: X  Y o
x o o
• Linear Classification ooo
• Binary Classification problem o o
o o o o
• Data above the red line belongs to class ‘x’
• Data below red line belongs to class ‘o’
• Examples: SVM, Perceptron, Probabilistic Classifiers
04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 173
& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 173
Discriminative Classifiers

• Advantages
• Prediction accuracy is generally high
• As compared to Bayesian methods – in general
• Robust, works when training examples contain errors
• Fast evaluation of the learned target function
• Bayesian networks are normally slow
• Criticism
• Long training time
• Difficult to understand the learned function (weights)
• Bayesian networks can be used easily for pattern discovery
• Not easy to incorporate domain knowledge
• Easy in the form of priors on the data or distributions

04/27/2025 Dr Manish Kumar Pandey, Centre for Quantitative Economics 174


& Data Science, Birla Institute of Technology, Mesra, Ranchi, J 174

You might also like