0% found this document useful (0 votes)
120 views

ML Unit 2

This document discusses various feature engineering techniques. It covers concepts like feature preprocessing including normalization, standardization, and handling missing values. Dimensionality reduction techniques like principal component analysis and feature selection methods are introduced. Statistical feature engineering approaches for creating feature vectors based on counts, means, medians, and modes are also covered. Finally, it discusses multidimensional scaling and matrix factorization techniques.

Uploaded by

Aanchal Padmavat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views

ML Unit 2

This document discusses various feature engineering techniques. It covers concepts like feature preprocessing including normalization, standardization, and handling missing values. Dimensionality reduction techniques like principal component analysis and feature selection methods are introduced. Statistical feature engineering approaches for creating feature vectors based on counts, means, medians, and modes are also covered. Finally, it discusses multidimensional scaling and matrix factorization techniques.

Uploaded by

Aanchal Padmavat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

Unit-2 Feature Engineering

• Concept of Feature, Preprocessing of data: Normalization


and Scaling, Standardization, Managing
• missing values, Introduction to Dimensionality Reduction,
Principal Component Analysis (PCA),
• Feature Extraction: Kernel PCA, Local Binary Pattern.
• Introduction to various Feature Selection Techniques,
Sequential .
• Forward Selection, Sequential Backward Selection.
• Statistical feature engineering: count-based, Length, Mean,
Median, Mode etc. based feature vectorcreation.
• Multidimensional Scaling, Matrix Factorization Techniques.

1
Concept of Feature Preprocessing of data:
• Feature Engineering is the process of
creating predictive features that can
potentially help Machine Learning models
achieve a desired performance. In most of
the cases, features will be measurements
of different unit and range of values.
• Feature scaling helps decrease the time to find support
vectors for SVMs

2
Concept of Feature Preprocessing of data:
• Having features varying in scale and range
could be an issue when the model we are
trying to build uses distance measures
such as Euclidean Distance. Such models
could be K-Nearest Neighbours, K-Means
Clustering. (Euclidean distances are
sensitive to feature magnitude.)
• Principal Component Analysics (PCA) is
also a good example of when feature
scaling is important. PCA require the
features to be centered at 0.
3
Apple & Strawberry
Concept of Feature Preprocessing of data:
• If we want to get the best-mixed juice, we need to mix all fruit
not by their size but based on their right proportion.
• We just need to remember apple and strawberry are
not the same unless we make them similar in some
context to compare their attribute.
• Similarly, in many machine learning
algorithms, to bring all features in the
same standing, we need to do scaling so
that one significant number doesn’t
impact the model just because of their
large magnitude.
5
Concept of Feature Preprocessing of data:
• Some examples of algorithms where feature scaling
matters are:
• K-nearest neighbors (KNN) with a Euclidean
distance measure is sensitive to magnitudes
and hence should be scaled for all features to
weigh in equally.
• K-Means uses the Euclidean distance measure
here feature scaling matters.
• Scaling is critical while performing Principal
Component Analysis(PCA). PCA tries to get the
features with maximum variance, and the
variance is high for high magnitude features
and skews the PCA towards high magnitude
features.
• Skews- means- change direction position of feature 6
Concept of Feature Preprocessing of data:
• Thus feature scaling is needed to bring every
feature in the same footing without any
upfront importance.
• if we convert the weight to “Kg,” then “Price” becomes
dominant.
• Eg. Neural network gradient descent converge
much faster with feature scaling than without it.

7
Concept of Feature Preprocessing of data:
• Feature scaling in machine learning is one
of the most critical steps during the pre-
processing of data before creating a
machine learning model. Scaling can make
a difference between a weak machine
learning model and a better one.
• The most common techniques of feature
scaling are
• Normalization and Standardization.
8
Normalization
• Normalization is used when we want to
bound our values between two
numbers, typically, between [0,1] or
[-1,1].
• which helps us to change the values of
numeric columns in the dataset to use a
common scale.
• It is required only when features of
machine learning models have
different ranges.
9
Normalization
• Mathematically, we can calculate normalization
with the below formula:
• Xn = (X - Xminimum) / ( Xmaximum - Xminimum)
• Xn = Value of Normalization
• Xmaximum = Maximum value of a feature
• Xminimum = Minimum value of a feature
• To normalize the machine learning model, values
are shifted and rescaled so their range can vary
between 0 and 1. This technique is also known
as Min-Max scaling.
• In this scaling technique, we will change the
feature values as follows:
10
Normalization

11
Normalization
• Case3- On the other hand, if the value of
X is neither maximum nor minimum, then
values of normalization will also be
between 0 and 1.
• Hence, Normalization can be defined as a
scaling method where values are shifted
and rescaled to maintain their ranges
between 0 and 1, or in other words; it can
be referred to as Min-Max scaling
technique.
12
Normalization
• Normalization techniques in Machine Learning
• Min-Max Scaling: This technique is also
referred to as scaling. ranging between 0 and
1.
• Standardization scaling:
• Standardization scaling is also known as Z-
score normalization, in which values are
centered around the mean with a unit
standard deviation, which means the attribute
becomes zero and the resultant distribution
has a unit standard deviation.
13
Standardising a variable does not
normalize the distribution of the data.
• centers the mean at 0
• scales the variance at 1
• preserves the shape of the original
distribution
• the minimum and maximum values of the
different variables may vary
• preserves outliers
Scaling to Minimum and Maximum
values — MinMaxScaling
• does not center the mean at 0
• variance varies across variables
• may not preserve the shape of the original
distribution
• the minimum and maximum values are 0 and
1.
• sensitive outliers
• While Standardization transforms
the data to have zero mean and a
variance of 1, they make our
data unitless.
• The machine learning algorithm works on
numbers and does not know what that
number represents. A weight of 10 grams and
a price of 10 dollars represents completely
two different things — which is a no brainer
for humans, but for a model as a feature, it
treats both as same.
Concept of Feature Preprocessing of data:
• Suppose we have two features of weight
and price, as in the below table. The
“Weight” cannot have a meaningful
comparison with the “Price.” So the
assumption algorithm makes that since
“Weight” > “Price,” thus “Weight,” is
more important than “Price.”

18
Standardization scaling
• Standard Deviation Definitions
• The standard deviation (SD) is a quantification
that measures the distribution (dispersion) of
the data set relative to its mean.
• It is calculated as the square root of the
variance.
• It is denoted by the lower Greek letter σ
(sigma).
• If the deviation is greater, the dispersion will
be greater.
• if the deviation is lesser greater the
uniformity.
19
Standardization
• Hence, standardization can be expressed
as follows:

• Here, µ represents the mean of feature


value, and σ represents the standard
deviation of feature values.
• However, unlike Min-Max scaling
technique, feature values are not
restricted to a specific range in the
standardization technique.
20
Standardization scaling
• The standard deviation is the measure of the
variations of all values from the mean.
• Standard deviation is the square root of the sum
of squared deviation from the mean divided by
the number of observations.
• It is the square root of the variance.
• Variance
• It defines how much a random variable differs
from its expected value.
• It is the average of the squares of the differences
between expected and individual value.
• It can never have a negative value.
• It is denoted by σ2.
21
Difference between Normalization and Standardization

22
When to use Normalization or
Standardization?
• Feature distribution of data does not
follow a Gaussian (bell curve) distribution.
Normalization must have an abounding
range, so if you have outliers in data, they
will be affected by Normalization.
• it is also useful for data having variable
scaling techniques such as KNN, artificial
neural networks.
• Hence, you can't use assumptions for the
distribution of data.

23
When to use Normalization or
Standardization?
• Standardization in the machine learning
model is useful when your data follows a
Gaussian distribution. However, this does not
have to be necessarily true.
• Standardization does not necessarily have a
bounding range, so if you have outliers in
your data, they will not be affected by
Standardization.
• Further, it is also useful when data has
variable dimensions and techniques such
as linear regression, logistic regression, and
linear discriminant analysis.
24
Missing Data Values And How To Handle It

25
Managing Missing Values
Missing Completely At Random (MCAR)-
• Missing values are completely independent of other
data. There is no pattern.
• In the case of MCAR, the data could be missing due to
human error, some system/equipment failure, loss of
sample, or some unsatisfactory technicalities while
recording the values.
• there is no relationship between the missing data and
any other values observed or unobserved
• For Example, suppose in a library there are some
overdue books. Some values of overdue books in the
computer system are missing. The reason might be a
human error like the librarian forgot to type in the
values. So, the missing values of overdue books are not
related to any other variable/data in the system.
26
Managing Missing Values
Missing At Random (MAR)-
• Missing at random (MAR) means there is
some relationship between the missing
data and other values/data.
• Eg. Suppose a poll is taken for overdue
books of a library. Gender and the number
of overdue books are asked in the poll.
Assume that most of the females answer
the poll and men are less likely to answer.
So why the data is missing can be
explained by another factor that is gender.
27
Managing Missing Values
• Missing Not At Random (MNAR)
• Missing values depend on the unobserved
data.
• If there is some structure/pattern in missing
data and other observed data can not
explain it, then it is Missing Not At Random
(MNAR).
• For example, suppose the name and the
number of overdue books are asked in the
poll for a library. So most of the people having
no overdue books are likely to answer the poll.
People having more overdue books are less
likely to answer the poll.
28
How To Handle The Missing Data
• Deleting the Missing values- Row/Column
• Imputing the Missing Values- Replacing
With Arbitrary Value, Mean Value, Mode
Value, Median Value
• Impute the Most Frequent Value.

29
Introduction to Dimensionality Reduction
• What is Dimensionality Reduction?
• The number of input features, variables,
or columns present in a given dataset is
known as Dimensionality, and the process
to reduce these features is called
dimensionality reduction.
• "It is a way of converting the higher
dimensions dataset into lesser dimensions
dataset ensuring that it provides similar
information."

30
Introduction to Dimensionality Reduction
• It is used to solving the classification and
regression problems.
• It is commonly used in the fields that deal
with high-dimensional data, such
as speech recognition, signal processing,
bioinformatics, etc. It can also be used for
data visualization, noise reduction,
cluster analysis, etc.
• A 3-D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a
simple 2 dimensional space, and a 1-D problem to a simple line. The below figure illustrates this
concept, where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be
correlated, the number of features can be reduced even further.

31
32
• There are two components of dimensionality
reduction:
• Feature selection: In this, we try to find a
subset of the original set of variables, or
features, to get a smaller subset which can be
used to model the problem. It usually involves
three ways:
– Filter
– Wrapper
– Embedded
• Feature extraction: This reduces the data in a
high dimensional space to a lower dimension
space, i.e. a space with lesser no. of dimensions.
34
Methods of Dimensionality Reduction
• The various methods used for dimensionality
reduction include:
• Principal Component Analysis (PCA)
• Linear Discriminant Analysis (LDA)
• Generalized Discriminant Analysis (GDA)
• Dimensionality reduction may be both linear
or non-linear, depending upon the method
used.
• The prime linear method, called Principal
Component Analysis, or PCA, is discussed
below. 35
Benefits of applying Dimensionality Reduction
• By reducing the dimensions of the features,
the space required to store the dataset
also gets reduced.
• Less Computation training time is
required for reduced dimensions of
features.
• Reduced dimensions of features of the
dataset help in visualizing the data quickly.
• It removes the redundant features (if
present) by taking care of multicollinearity.
36
Disadvantages of dimensionality Reduction
• Some data may be lost due to
dimensionality reduction.
• In the PCA dimensionality reduction
technique, sometimes the principal
components required to consider are
unknown.

37
Feature Selection
• Feature selection is the process of selecting the
subset of the relevant features and leaving out the
irrelevant features present in a dataset to build a
model of high accuracy.
• Three methods are used for the feature selection:
1. Filters Methods
• In this method, the dataset is filtered, and a subset
that contains only the relevant features is taken.
• Some common techniques of filters method are:
• Correlation
• Chi-Square Test
• ANOVA
• Information Gain, etc.

38
Feature Selection
2. Wrappers Methods
• In this method, some features are fed to the ML
model, and evaluate the performance.
• The performance decides whether to add those
features or remove to increase the accuracy of
the model. This method is more accurate than the
filtering method but complex to work.
• Some common techniques of wrapper methods
are:
• Forward Selection
• Backward Selection
• Bi-directional Elimination
39
Feature Selection
• 3.Embedded Methods: Embedded
methods check the different training
iterations of the machine learning model
and evaluate the importance of each
feature.
• Some common techniques of Embedded
methods are:
• LASSO
• Elastic Net
• Ridge Regression, etc.
40
Feature Extraction:
• Feature extraction is the process of
transforming the space containing many
dimensions into space with fewer dimensions.
• This approach is useful when we want to keep
the whole information but use fewer
resources while processing the information.
• Some common feature extraction techniques
are:
• Principal Component Analysis
• Linear Discriminant Analysis
• Kernel PCA
• Quadratic Discriminant Analysis
41
Common techniques of Dimensionality Reduction
• Principal Component Analysis
• Backward Elimination
• Forward Selection
• Score comparison
• Missing Value Ratio
• Low Variance Filter
• High Correlation Filter
• Random Forest
• Factor Analysis
• Auto-Encoder
42
Principal Component Analysis
• This method was introduced by Karl Pearson. It
works on a condition that while the data in a
higher dimensional space is mapped to data in
a lower dimension space, the variance of the
data in the lower dimensional space should be
maximum.

43
Principal Component Analysis
• Principal Component Analysis is a
statistical process that converts the
observations of correlated features into a
set of linearly uncorrelated features with
the help of orthogonal transformation.
These new transformed features are
called the Principal Components.
• It is one of the popular tools that is used
for exploratory data analysis and
predictive modeling.
44
Principal Component Analysis
• PCA works by considering the variance of
each attribute because the high attribute
shows the good split between the classes,
and hence it reduces the dimensionality.
• Some real-world applications of PCA
are image processing, movie
recommendation system, optimizing the
power allocation in various
communication channels.

45
Principal Component Analysis
• The PCA algorithm is based on some mathematical
concepts such as:
• Variance and Covariance
• Eigenvalues and Eigen factors

46
Principal Component Analysis
• Variance refers to the spread of a data set
around its mean value.
• It is calculated by finding the probability-
weighted average of squared deviations from
the expected value.
• While performing market research, variance is
particularly useful when calculating
probabilities of future events.
• Variance is a great way to find all of the
possible values and likelihoods that a random
variable can take within a given range.
47
Principal Component Analysis
• A variance value of zero represents that all of
the values within a data set are identical,
while all variances that are not equal to zero
will come in the form of positive numbers.
• The larger the variance, the more spread in
the data set.
• A large variance means that the numbers in a
set are far from the mean and each other
• A small variance means that the numbers are
closer together in value.
Principal Component Analysis
• X represents an individual data point,
• u represents the mean of the data points,
• N represents the total number of data points.
• Note that while calculating a sample variance in order
to estimate a population variance, the denominator of
the variance equation becomes N – 1.
• This removes bias from the estimation, as it prohibits
the researcher from underestimating the population
variance.

49
Principal Component Analysis
An Advantage of Variance
• One of the primary advantages of
variance is that it treats all deviations
from the mean of the data set in the same
way, regardless of direction.
• A Disadvantage of Variance
• it gives added weight to numbers that are
far from the mean, or outliers.
• Squaring these numbers can at times
result in skewed interpretations of the
data set as a whole.
50
Principal Component Analysis
• Covariance refers to the measure of the
directional relationship between two
random variable.
• A positive covariance means that the two
variables at hand are positively related,
and they move in the same direction.
• A negative covariance means that the
variables are inversely related, or that
they move in opposite directions.
51
Principal Component Analysis
• X represents the independent variable,
• Y represents the dependent variable,
• N represents the number of data points in
the sample,
• x-bar represents the mean of the X, and
• y-bar represents the mean of the
dependent variable Y.

52
Principal Component Analysis
• Covariance is used to measure variables
that have different units of measurement.
So covariance does not use one
standardized unit of measurement.
• Correlation, on the other hand,
standardizes the measure of
interdependence between two variables
and informs researchers as to how closely
the two variables move together.

53
Concept of Feature Preprocessing of data:
• Eigenvalue - the eigenvalue is a scalar that is
used to transform the eigenvector.
• The basic equation is Ax = λx
• The number or scalar value “λ” is an eigenvalue
of A.

54
Principal Component Analysis
• Hence, the
• two eigenvalues
• of the given
• matrix are
• λ = 0 and λ = 4.
• Eigen matrix
• can be written
• as,
• |A- λI|
• and
• |A- λI| = 0
55
Principal Component Analysis
• Eigenvector- Eigenvector of a square
matrix is defined as a non-vector in which
when a given matrix is multiplied.
• If A be an n×n matrix and λ be the
eigenvalues associated with it.
• Then, eigenvector v can be defined by the
following relation: Av =λv

56
Principal Component Analysis

57
Principal Component Analysis

58
Principal Component Analysis
Advantages of Dimensionality Reduction
• It helps in data compression, and hence reduced
storage space.
• It reduces computation time.
• It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
• It may lead to some amount of data loss.
• PCA tends to find linear correlations between
variables, which is sometimes undesirable.
• PCA fails in cases where mean and covariance are
not enough to define datasets.
• We may not know how many principal components
to keep- in practice, some thumb rules are applied.
63
• Some common terms used in PCA algorithm:
• Dimensionality: It is the number of features or variables
present in the given dataset. More easily, it is the number
of columns present in the dataset.
• Correlation: It signifies that how strongly two variables are
related to each other. Such as if one changes, the other
variable also gets changed. The correlation value ranges
from -1 to +1. Here, -1 occurs if variables are inversely
proportional to each other, and +1 indicates that variables
are directly proportional to each other.
• Orthogonal: It defines that variables are not correlated to
each other, and hence the correlation between the pair of
variables is zero.
• Eigenvectors: If there is a square matrix M, and a non-zero
vector v is given. Then v will be eigenvector if Av is the
scalar multiple of v.
• Covariance Matrix: A matrix containing the covariance
between the pair of variables is called the Covariance
Matrix.
Kernel PCA
• PCA is a linear method. That is it can only be applied to
datasets which are linearly separable. It does an excellent
job for datasets, which are linearly separable.
• But, if we use it to non-linear datasets, we might get a
result which may not be the optimal dimensionality
reduction.
• Kernel PCA uses a kernel function to project dataset into a
higher dimensional feature space, where it is linearly
separable. It is similar to the idea of Support Vector
Machines.
• There are various kernel methods like linear, polynomial,
and Gaussian.
• In the kernel space the two classes are linearly separable.
Kernel PCA uses a kernel function to project the dataset into
a higher-dimensional space, where it is linearly separable.
Kernel PCA
Local Binary Pattern
• is in its ability to differentiate tiny differences
in texture and topography, to identify key
features with which we can then differentiate
between images of the same type — no
painstaking labeling required.
• The goal of LBP is to encode geometric
features of an image by detecting edges,
corners, raised or flat areas and hard lines;
allowing us to generate a feature vector
representation of an image, or group of
images.
Local Binary Pattern
• we can determine the level of similarity
between our target representation and an
unseen image and can calculate the
probability that the image presented is of
the same variety or type as the target image.
• LBP can be split into 4 key steps:
• · Simplification
• · Binarisation
• · PDF (probability density function) calculation
• · Comparison (of the above functions)
• Simplification
Local Binary Pattern
• This is our data preprocessing step. In essence,
this is our first step in dimensionality reduction,
which allows our algorithm to focus purely on the
local differences in luminance, rather than
worrying about any other potential features.
• Therefore, we first convert our image into a single
channel (typically greyscale) representation
• Binarisation
• Next, we calculate the relative local luminance
changes. This allows us to create a local, low
dimensional, binary representation of each pixel
based on luminance.
Local Binary Pattern
• For each comparison, we output a binary value of 0
or 1, dependent on whether the central pixel’s
intensity (scalar value) is greater or less
(respectively) than the comparison pixel.
• This forms a k-bit binary value, which can then be
converted to a base 10 number; forming a new
intensity for that given pixel.
Kernel PCA
Statistical feature engineering:
• Feature engineering refers to a process of
selecting & transforming variables/features in
your dataset when creating a predictive
model using machine learning.
• Therefore you have to extract the features from
the raw dataset you have collected before training
your data in machine learning algorithms.
• Feature engineering has two goals:
• Preparing the proper input dataset, compatible
with the machine learning algorithm requirements.
• Improving the performance of machine learning
models.
79
Statistical feature engineering:
• Mean: The "average" number; found by
adding all data points and dividing by the
number of data points.
• Example: The mean of 444, 111,
and 777 is (4+1+7)/3 = 12/3 =
4(4+1+7)/3=12/3=4left parenthesis, 4, plus,
1, plus, 7, right parenthesis, slash, 3,
equals, 12, slash, 3, equals, 4.

81
Statistical feature engineering:
• Median: The middle number; found by
ordering all data points and picking out the
one in the middle (or if there are two
middle numbers, taking the mean of those
two numbers).
• Example: The median of 444, 111,
and 777 is 444 because when the numbers
are put in order (1(1left parenthesis,
1, 444, 7)7)7, right parenthesis, the
number 444 is in the middle.
82
Statistical feature engineering:
• Mode: The most frequent number—that is,
the number that occurs the highest
number of times.
• Example: The mode of \{4{4left brace,
4, 222, 444, 333, 222, 2\}2}2, right
brace is 222 because it occurs three times,
which is more than any other number.

83
Feature Vector Creation.
• A vector is a series of numbers, like
a matrix with one column but multiple
rows, that can often be represented
spatially.
• A feature is a numerical or symbolic
property of an aspect of an object.
• A feature vector is a vector containing
multiple elements about an object.
Putting feature vectors for objects
together can make up a feature space.
84
Statistical feature engineering:
• The features may represent, as a whole, one mere pixel
or an entire image. The granularity depends on what
someone is trying to learn or represent about the
object. You could describe a 3-dimensional shape with
a feature vector indicating its height, width, depth, etc.

85
Statistical feature engineering:
• One simple way to compare the feature
vectors of two objects is to take
the Euclidean distance.
• In image processing, features can be gradient
magnitude, color, grayscale intensity, edges,
areas, and more.
• Feature vectors are particularly popular for
analyses in image processing because of the
convenient way attributes about an image,
like the examples listed, can be compared
numerically once put into feature vectors.
• In speech recognition, features can be sound
lengths, noise level, noise ratios, and more.
86
• In spam-fighting initiatives, features are abundant.
They can be IP location, text structure, frequency of
certain words, or certain email headers.
• Feature vectors are used in classification problems,
artificial neural networks, and k-nearest
neighbors algorithms in machine learning.

• Feature vector is an n-dimensional vector of


numerical features that describe some object in
pattern recognition in machine learning
What is Multidimensional Scaling?
• Multidimensional scaling is a visual
representation of distances or dissimilarities
between of high dimensional data.
• The “multidimensional” part is due to the
fact that you aren’t limited to two
dimensional graphs or data. Three-
dimensional, four-dimensional and higher
plots are possible.
• Its use in geostatistics helps visually assess
and understand multivariate data in a lower
dimension.
• By reducing the dimensionality of the data
one can observe patterns, gradients, and
clusters that may be helpful in exploratory
data analysis.
• MDS does this by projecting the multivariate
distances between entities to a best-fit
configuration in lower dimensions.

You might also like