0% found this document useful (0 votes)
198 views

LDA 01 Linear Discriminant Analysis

Linear discriminant analysis (LDA) is a technique used for classification and feature selection. It works by finding the linear combination of features that best separates two or more classes of objects. LDA finds the directions, called discriminant functions, that maximize the separation between the classes. It assumes normal distribution of the data and equal class covariance matrices. LDA is commonly used for binary classification problems and can be extended to problems with more than two classes. It is generally preferred over logistic regression when the normality assumption is valid.

Uploaded by

Vijay Mani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

LDA 01 Linear Discriminant Analysis

Linear discriminant analysis (LDA) is a technique used for classification and feature selection. It works by finding the linear combination of features that best separates two or more classes of objects. LDA finds the directions, called discriminant functions, that maximize the separation between the classes. It assumes normal distribution of the data and equal class covariance matrices. LDA is commonly used for binary classification problems and can be extended to problems with more than two classes. It is generally preferred over logistic regression when the normality assumption is valid.

Uploaded by

Vijay Mani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

Linear discriminant analysis

• Need of a classification model


• An introduction of classification situation through a case study
• Formal definition of LDA
• LDA function for feature selection – using R
• LDA for classification – using R
• Mahalanobis Distance, linear discriminant function & Bayes
Theorem
• LDA vs PCA
• Industrial usage of LDA
• Handling special situations 1
Created by – Gopal Prasad Malakar
Need of a classification model
BUSINESS SCENARIO

Created by – Gopal Prasad Malakar


2
Business Scenario – need of a
model?
• Say 100,000 prospect
• Say 1,000 takes up the product
• Say by working on 20000 prospect
• Can we get 900 responder
Business is unhappy
with such a poor
response rate
• Note – no possibility of exact
match in real life scenarios
• Think of – if $2 is the cost of • Also very rare possibility of
mailer then one has spend $200 getting all the responder by
per new customer acquisition, working on part of population
right? • Target is to get almost all the
• Can we find a base where by responder by working on only
working on less number of small portion of the population
prospect, we can still get almost
all the responder

Created by – Gopal Prasad Malakar


3
So the target is …..
• Target is to get almost all the responder by working on only part of
the population

X% of Population N
Y% of all Responder K
Population – N Y>X
Responder - K

1- X% of Population – N
1- Y% of Responder - K
 Note RGB concept –
✓ Green the bench mark response rate
✓more response rate – red
✓Less response rate - blue
 Work on red / blue– higher response / lower response rate section
Created by – Gopal Prasad Malakar
4
Purpose of
LDA

Created by – Gopal Prasad Malakar


5
Two purpose
• One : independent variable selection (also called feature selection)
– At times 1000s of independent variables and one need to do a quick selection
of variables to make it manageable
– Highly recommended technique for numeric variable as it is very efficient
computationally
– Can proceed to another techniques – like logistic regression / decision tree
• Two : for classification of data
– From k predictor variables, can we say what will be the value of dependent
variable?
– Very useful technique for binary dependent variable
– Preferred technique for more than two classes

Created by – Gopal Prasad Malakar


6
classical case
OWNERS VS NON OWNER

Created by – Gopal Prasad Malakar


7
Income n lot size

• A riding mower manufacturer is interesting in knowing  who is likely


to buy the riding mower and who is not
• The data shows income andCreated
lot size for owners and non owners
by – Gopal Prasad Malakar
8
Income n lot size

• Is it possible to find a line here, which separates owners vs non owners?


• What is the use of the line?
• One can say, if a equation is giving higher value than the threshold line,
9
then owner else non owner.Created by – Gopal Prasad Malakar
Income n lot size

• What is the misclassification here.


• What is the misclassification rate error?
• 3/ 24 10
Created by – Gopal Prasad Malakar
Income n lot size

• What will the quality of the best line which separates two types?
• The one which gives least misclassification error
• How can you simplify the process of higher than the threshold 11value?
Created by – Gopal Prasad Malakar
Usage simplification
• Generate score for each category
• The category with highest score will give the category
• Can be extended to more than two class
• Like Equipment buyer, entertainment buyer, both buyer customers

Person 01 both

Person 02 Equipment

Person 03 Entertainment

Person 04 Equipment

0 1 2 3 4 5 6

Both Entertainment Equipment


Created by – Gopal Prasad Malakar
12
Formal definition of
LINEAR DISCRIMINANT
ANALYSIS

Created by – Gopal Prasad Malakar


13
Linear discriminant analysis
• Is all about using numeric data to decide about a categorical outcome.
• It is very useful when it is reasonable to believe that independent
variables are normally distributed.
• It assumes that correlation among predictors within a class is the same
across all classes
– So x1, x2, x3…..xn have same correlation when Y=0 or Y=1
– Extended case x1, x2, x3…..xn have same correlation when Y=“High Risk”
or Y=“Medium Risk” or Y=“Low Risk”
• It can be easily extended to more than two categories.
• Logistic regression and probit regression are similar to LDA than
ANOVA is, as they also explain a categorical variable by the values of
continuous independent variables
• When it is unreasonable to expect normality of independent variables,
you prefer logistic regression over LDA.

Created by – Gopal Prasad Malakar


14
Linear discriminant analysis
• If you think of, from historical data, we know
– The probability of the dependent variable (Y) taking a particular class
– And conditional probability of observing X , for given Y
• Example – if you know distribution of height for each gender of students
of Masters in a particular zone is like as follows

• Now if you know the height, can you predict the gender?
• For extreme cases, it is easier but for the overlap region, one will go for
the group that gives least misclassification.
• Now with the new data, X is known. it is all about finding probability of Y
from given X. This is obtained using Bayes theorem. 15
Created by – Gopal Prasad Malakar
D2

D1

P1

P2

• Consider various directions such as directions D1 and D2 shown in


Figure.
• One way to identify a good linear discriminant function is to choose
• Amongst all possible directions the one that has the property that
when we project (drop a perpendicular lines from observations) the
– means of the two groups, onto a line in the chosen direction i.e.
– group means of the projections (feet of the perpendiculars, e.g. P1 and P2
in direction D1) are separated by the maximum possible distance.
Created by – Gopal Prasad Malakar
16
D3
D2

D1

P1

P2

• Amongst all possible directions the one that has the property that
when we project (drop a perpendicular lines from observations) the
– means of the two groups, onto a line in the chosen direction i.e.
– group means of the projections (feet of the perpendiculars, e.g. P1 and P2
in direction D1) are separated by the maximum possible distance.
• Which one appears better doing the job D1 or D2?
• Once you have found the line, you can always use the line
perpendicular to this to define the threshold . D3 is more of a
threshold line here. Created by – Gopal Prasad Malakar
17
When to apply
WHICH TECHNIQUE

Created by – Gopal Prasad Malakar


18
Techniques at a glance
Dependent Variable
Categorical Numeric

Chi Square for contingency ANOVA / dummy variable


table/ classification tree regression/ Regression
Independent

Categorical (type of decision tree) / Tree


Variable

Logistic Regression / Linear Regression /


Linear Discriminant Analysis / Regression Tree (type of
Numeric Probit Regression/ decision tree)
Classification tree/
Support Vector Machine /
Artificial Neural Network
Created by – Gopal Prasad Malakar
19
LDA for
FEATURE SELECTION

Created by – Gopal Prasad Malakar


20
Fisher’s ratio
All Data
Non Responder (Y=0)
Responder (Y=1)

Mean
Mean Var
Var

• Calculate mean of variable for each group


• Calculate variance of variable for each group

Created by – Gopal Prasad Malakar


21
Fishers ratio
Fisher's ratio is a measure for (linear) discriminating power of some
variable:

with m1, and m2 being the means of class 1 and class 2, and V1, and V2
the variances.
• Fisher’s ratio will always be positive (why?)
• Greater the difference between m1 and m2  greater value of Fisher’s
ratio
• i.e. Greater between group variance  greater value of Fisher’s ratio
• Smaller value of V1 and V2  greater value of Fisher’s ratio
• i.e. Smaller within group variance  greater value of Fisher’s ratio

Created by – Gopal Prasad Malakar


22
Fishers ratio

with m1, and m2 being the means of class 1 and class 2, and v1, and v2
the variances.

Quiz – which variable / scenario will have higher Fisher’s ratio?

Created by – Gopal Prasad Malakar


23
Rational for Fishers ratio
• What is the rational behind this ratio, let’s see graphically

✓ Which one is more clear distinction between two group?


✓ So lesser the sum of variance: more clear is the distinction
Created by – Gopal Prasad Malakar
24
Rational for Fishers ratio

• Greater the difference of the means of the population: clearer the


distinction between two sets

Created by – Gopal Prasad Malakar


25
Calculate
FISHER’S RATIO

Created by – Gopal Prasad Malakar


26
Fishers ratio
✓ One can select variables having higher fisher’s ratio
✓ This is standardization in some sense (why?)
✓ Steps to calculate the Fisher’s ratio
X1 X2 X3 X4 X5 X6 .. Y
0
1
0

All Y=0 All Y=1


Var Mean Var Var Mean Var Var Fisher’s ratio
X1 MX10 VX10 X1 MX11 VX11 X1 (MX11 -MX10)2/(VX10+VX11)
X2 MX20 VX20 X2 MX21 VX21 X2 (MX21 –MX20)2/(VX20+VX21)
X3 X3 X3

Demo using R n Excel

Created by – Gopal Prasad Malakar


27
Feel the formula n method

• Avoid this when more than 20% observation has missing value for the
numeric independent variable either for population dependent variable
=0 or 1
• Taking higher Fisher’s ratio will lead to
– maximize between variance
– Minimize within variance
• It is computationally less demanding
– You are just creating two datasets for responder and non responder
– And then calculating mean and variance
– Statistical procedure are optimized for the same
Created by – Gopal Prasad Malakar
28
LDA for
CLASSIFICATION

Created by – Gopal Prasad Malakar


29
Intuitively
Non Responder (Y=0)

Responder (Y=1)

New Object

• Calculate distance of New Object from mean of different populations


• Assign it to the group, from which it is closest
Created by – Gopal Prasad Malakar
30
Step by step
1. Calculate distance of new object from each center
2. Use function of distance (rather than the direct distance) , which
minimizes misclassification error
3. Generate score for each class
4. Score is like probability to belong to the class
5. Assign it to group, for which it has highest score

Created by – Gopal Prasad Malakar


31
How do we calculate
DISTANCE

Created by – Gopal Prasad Malakar


32
Euclidean distance

• Distance between object i and j is given by  Euclidean distance:


d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1 i2 j2 ip jp
– Properties
• d(i,j)  0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j)  d(i,k) + d(k,j)

(U2,V2)
(U1,V1)

D = ((U1 - U2)2 + (V1 - V2)2)1/2


Created by – Gopal Prasad Malakar
33
Euclidean distance issue
A B C
Income Lot Size Lot Size Income ($ Lot Size ( sq
($ 000's) (000's sq ft) Income ($) (000's sq ft) 000's) ft)
75.0 19.6 75,000 19.6 75.0 19,600
52.8 20.8 52,800 20.8 52.8 20,800
64.8 17.2 64,800 17.2 64.8 17,200
43.2 20.4 43,200 20.4 43.2 20,400
84.0 17.6 84,000 17.6 84.0 17,600
49.2 17.6 49,200 17.6 49.2 17,600
• Do you think, the three table will give the same Euclidean distance
between the objects?
• Table B will develop distance primarily on the basis of income
• Table C will develop distance primarily on the basis of lot size
• So it gets affected by scale
• What is the way out?
• Standardization ( z= (x – x_mean)/ x_std_deviation) 34
Created by – Gopal Prasad Malakar
Euclidean distance issue
• But even after Standardization one more issue is remaining
• Take a look at below table

Income ($ Lot Size Saving ($


000's) (000's sq ft) 000's)
75.0 20 27.9
52.8 21 23.3
64.8 17 28.6
43.2 20 19.3
84.0 18 34.8
49.2 18 23.0
• Do you suspect variables to be little correlated here?
• If variables are correlated, then don’t you think, you are counting same
impact multiple times?
• You need a better method – which gets rid of scaling impact as well as
collinearity impact of variables.
• Mahalanobis distance is theCreated
technique
by – Gopal Prasad Malakar
35
What is
MAHALANOBIS DISTANCE

Created by – Gopal Prasad Malakar


36
Mahalanobis distance
• Mahalanobis distance measure does the following:
– it transforms the variables into uncorrelated variables
– And makes their variances equal to 1,
– and then calculates simple Euclidean distance.
• Formula is

X1 X2 … Xn
X1
X2
X..
Xn

Created by – Gopal Prasad Malakar


37
Mahalanobis distance
• Intuitively
x1
X2
X3

xn

Distance from mean


Divide by covariance matrix
Is not it like multi variate standardization? Standardization ( z= (x – m)/ s)
Multi Variate Standardization ( z= (X metrics – m metrices )2/ var-cov metrics)
Please note the formula is measure D2 and that’s why you are dividing by
variance not standard deviation
Created by – Gopal Prasad Malakar
Impact
Y-Values
0
0 20 40 60 80 100 120
-0.5

-1

-1.5

-2

-2.5

-3

-3.5

-4

These two impact slides (39 & 40) has


been added, later hence all the subsequent
slide number has increased by 2 Created by – Gopal Prasad Malakar
39
Impact
Y-Values
0
0 20 40 60 80 100 120

-1

-2

-3

-4

-5

-6

-7

Created by – Gopal Prasad Malakar


40
An example
• Can be seen at
• http://www.jennessent.com/arcview/mahalanobis_description.htm
• The example is about a bi variate data with X and Y as variable
• where
Variable Mean Std
deviation

X 500 79.32
Y 500 79.25

• So for an observation which has X=410 and Y=400, it shows the calculation of
Mahalanobis distance
• Let’s see the Mahalanobis calculation

Created by – Gopal Prasad Malakar


41
A screen grab

Created by – Gopal Prasad Malakar


42
Next steps
DISCRIMINANT FUNCTION

Created by – Gopal Prasad Malakar


43
Steps
1. Calculate distance of new object from each center
2. Use function of distance (rather than the direct distance) , which
minimizes misclassification error
3. Generate score for each class
4. Score is like probability to belong to the class
5. Assign it to group, for which it has highest score
6. We will use R to do all these

• Discriminant analysis finds a set of linear combinations of the variables,


whose values are
✓ as close as possible within groups and
✓ as far apart as possible between groups.
• The linear combinations are called discriminant functions
• For k class, we need k-1 discriminant functions (why?).
• Discriminant functions are obtained using Bayes theorem.
Created by – Gopal Prasad Malakar
44
Linear discriminant function
• Is expected to reduce overlap
• Which will result into lower misclassification error
• The two main approach of doing so is

Reduce
variance

Max diff
between
means

Created by – Gopal Prasad Malakar


45
One such example
http://www.ismll.uni-hildesheim.de/lehre/ml-08w/skript/classification1.pdf

Note – Note –
1. Green thick line joins centers of 1. Rotates the line joining the
the two population mean
2. Almost uniform distribution 2. Close to normal distribution
3. Higher overlap 3. Low overlap
Created by – Gopal Prasad Malakar
46
Bayes theorem
BY EXAMPLE

Created by – Gopal Prasad Malakar


47
A scenario
2% Leakage
Overall
98% No Leakage

• A firm has found that overall probability of leakage of gas is 2%.


• They install a sensor to detect leakage
• Sensor blows alarm 99% of the time when there is leakage
• However it also blows alarm 5% of the time when there is no leakage
99% Alarm

2% Leakage 1% No Alarm If alarm has blown,


Overall then what is the
98% 5% probability of
No Leakage Alarm leakage?

95% No Alarm
Created by – Gopal Prasad Malakar
48
Calculate 99% Alarm

2% Leakage 1% No Alarm If alarm has blown,


Overall then what is the
98% 5% probability of
No Leakage Alarm leakage?

95% No Alarm

• Probability of alarm = (0.02*0.99 + 0.98*0.05) = 0.0693


• Probability of leakage if alarm has blown =
– 0.02*0.99/ (0.02*0.99 + 0.99*0.05)
– 1/3.5 ( or 28.6%)
• So out of seven alarm, how many will be false alarm?
• Let’s discuss intuitively
• Prior & posterior probability – probability of an outcome before and after
the event Prior =2%,
• What is the event here? posterior=28.6%
• What was prior probability and posterior probability after an alarm?
49
Created by – Gopal Prasad Malakar
In equation format
99% Alarm

2% Leakage 1% No Alarm
Overall
99% 5%
No Leakage Alarm

95% No Alarm

• Probability of alarm = (0.02*0.99 + 0.98*0.05) = 0.0693


• Probability of leakage if alarm has blown =
– 0.02*0.99/ (0.02*0.99 + 0.99*0.05)
P(A/L)∗ P(L)
– P(leakage (L) / alarm (A) : P (L/A) =
(P(A/L)∗ P(L)+P(A/NL)∗P(NL))

Created by – Gopal Prasad Malakar


In equation format
99% Alarm

2% Leakage 1% No Alarm
Overall
99% 5%
No Leakage Alarm

95% No Alarm

P(A/L)∗ P(L)
– P(leakage (L) / alarm (A) : P (L/A) =
(P(A/L)∗ P(L)+P(A/NL)∗P(NL))

Suppose that B1, B2, B3,. . . , Bn partition the


outcomes of an experiment and that A is another
B1 B2 B3 B4 event. For any number, k, with
A 1 <= k <= n, we have the formula:

P( A | Bk )  P( Bk )
P( Bk | A)  n

 P( A | B )  P( B )
i 1
i i

Created by – Gopal Prasad Malakar


51
Demo for two class LDA
USING R

Created by – Gopal Prasad Malakar


52
What is the difference between
LDA AND PCA

Created by – Gopal Prasad Malakar


53
Quiz ?
• What is the difference between LDA & PCA?

PCA  Xs

LDA  Xs with
respect to Y

http://stackoverflow.com/questions/33576963/dimensions-reduction-in-matlab-using-pca
Created by – Gopal Prasad Malakar
54
Quiz ?
• What is the difference between LDA & PCA?

LDA PCA

Discovers relationship between Discovers relationship between


Dependent & independent independent variables
variables
Used for variable reduction based Used for reducing variables based
on strength of relationship between on collinearity of independent
independent n dependent variable variables
Used for prediction of classes

Finds the direction that maximizes Finds direction that maximizes the
difference between two classes variance in the data

Created by – Gopal Prasad Malakar


55
Extension of LDA
TO MORE THAN TWO CLASSES

Created by – Gopal Prasad Malakar


56
IRIS data

Created by – Gopal Prasad Malakar


Data details
setosa versicolor virginica
50 50 50

Iris Data
2.5
my_iris$Petal.Width
1.5
0.5

1 2 3 4 5 6 7
my_iris$Petal.Length

Species mean_pet_length mean_pet_Width


1 setosa 1.462 0.246
2 versicolor 4.260 1.326
3 virginica 5.552 2.026

Created by – Gopal Prasad Malakar


Industrial use of LDA
CLASSIFICATION -- SAME
APPLIES TO CA, LOGISTIC ETC.

Created by – Gopal Prasad Malakar


59
Analytical tool for
reducing attrition
• LDA/ Classification tree / logistic regression is a very handy tool in this
respect
• Note  Objective function is binary here, and tool is generic in nature
• It can help business to know about which profile has high probability of
attrition
• So that efforts can be made to keep them

Created by – Gopal Prasad Malakar


Analytical tool for xsell
• LDA/ Classification tree / logistic regression is a very handy tool in this respect
• Same chart  but objective function is different
• The tool is generic and has wide usage
• It can help business to know about which profile has high probability of
taking your xsell product
• So that effort can be optimized for better gains

Created by – Gopal Prasad Malakar


Campaigns
• Campaigns are important marketing tool
• Here you give targeted offer i.e. customer specific offer
• And track result to know, where you gained good response
• Classification methods can be used to predict the profile, where we can
get the best response.

Created by – Gopal Prasad Malakar


Handling special cases
IN LDA

Created by – Gopal Prasad Malakar


63
For bias sampling
• If frequency of dependent variable in the sample does not reflect reality
• Then
• One needs to incorporate prior (or real) probabilities of class
membership:
– Add log(pj) to the classification function for class j
– Where Pj is probability a case belongs to class j
• Example
– Say sample data contains 50% owner and 50% non owners
– However if actual population has 20% owners and 80% non owners then
– If needed one can derive new_posterior.N=
posterior.N/(posterior.N+posterior.Y)

posterior.N posterior.Y posterior.N posterior.Y

0.152 0.848 =ln(0.8)+ 0.152 =ln(0.2) + 0.848

Created by – Gopal Prasad Malakar


64
For unequal
misclassification costs
• At times misclassification errors has different costs
• Example – better to double check a cancer patient than leave when
confusion
• One needs to incorporate costs of classification errors
• Add :
– Add log(Cj) to the classification function for class j
– Where Cj is misclassification costs to class j
– If Cj are not known then add assumed ln(C1/C2) to one group and ln(1) =0 to other
• Example
– If you have little doubt of cancer, you will like it be doubly sure than otherwise
– If You want to be 50 times more sure that
• if you say no cancer  good health than
• when you say yes may have cancer  worrisome situation
– Here again new_posterior.N & new_posterior.Y can be derived to make sum = 1

posterior.N posterior.Y posterior.N posterior.Y

0.88 0.12 Created by – Gopal Prasad Malakar= 0.88 =ln(50)


65 + 0.12

You might also like