0% found this document useful (0 votes)
11 views

Cse3001 Ai Ml m2

The document provides an overview of supervised machine learning algorithms, covering essential topics such as types of machine learning, data preparation, model training, and evaluation metrics. It details the lifecycle of machine learning, including data gathering, preprocessing, and feature selection techniques. Additionally, it discusses the significance of feature engineering and the differences between supervised and unsupervised learning.

Uploaded by

ashutosh.wk.24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Cse3001 Ai Ml m2

The document provides an overview of supervised machine learning algorithms, covering essential topics such as types of machine learning, data preparation, model training, and evaluation metrics. It details the lifecycle of machine learning, including data gathering, preprocessing, and feature selection techniques. Additionally, it discusses the significance of feature engineering and the differences between supervised and unsupervised learning.

Uploaded by

ashutosh.wk.24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

PRESIDENCY UNIVERSITY

Bengaluru

Module 2
Supervised machine learning algorithms
CONTENT
• Introduction to the Machine Learning (ML) Framework
• Types of ML,
• Types of variables/features used in ML algorithms,
• One-hot encoding,
• Simple Linear Regression,
• Multiple Linear Regression,
• Evaluation metrics for regression model

Artificial Intelligence
2
CONTENT
• Classification models
• Decision Tree algorithms using Entropy and Gini Index as measures
of node impurity,
• Model evaluation metrics for classification algorithms,
• Multi-class classification
• Class Imbalance problem.
• Naïve Bayes Classifiers
• Naive Bayes model for sentiment classification

Artificial Intelligence
3
What is Machine Learning

 Definition
 A Machine Learning system learns from historical data, builds the prediction
models, and whenever it receives new data, predicts the output for it.
 The accuracy of predicted output depends upon the amount of data and
quality of data. The huge amount of data and accurate data, helps to build a
better model which predicts the output more accurately.
 Importance of machine learning
 Finding hidden patterns and extracting useful information from data
 Solving complex problems and decision making in many fields
(applications)
 Feature of ML: Data-Driven Technology
 similar to data mining as it also deals with huge amount of data.
 uses data to detect various patterns in a given dataset
 learn from past data and improve automatically.

Artificial Intelligence
4
Applications of machine learning

Artificial Intelligence
5
Lifecycle of machine learning
1. Gathering Data
2. Data preparation
• Data exploration
• Data pre-processing
3. Data Wrangling
4. Data Analysis
5. Train Model
6. Test Model
7. Deployment

Artificial Intelligence
6
Lifecycle of machine learning
Gathering Data
 To identify the different data sources, as data can be collected from various
sources such as files, database or internet.
 The quantity and quality of the collected data will determine the accuracy of
the prediction and efficiency of the output.
 This step includes the below tasks:
 Identify various data sources
 Collect data
 Integrate the data obtained from different sources – This coherent set of data
is called dataset

Artificial Intelligence
7
Lifecycle of machine learning
Data preparation: This step can be further divided into two processes:

 Data exploration: To understand the characteristics, format, and


quality of data to find Correlations, general trends, and outliers for an
effective outcome.
 Data pre-processing: Cleaning of data is required to address the
quality issues: Missing Values, Duplicate data, Invalid data and Noise,
which can be solved using filtering techniques.
Data Wrangling
 Reorganizing, mapping and transforming raw, unstructured data to a
useable format.
 This step involves data aggregation and data visualization.

Artificial Intelligence
8
Lifecycle of machine learning
Data Analysis
 The aim of this step is to build a machine learning model to analyze the data and
review the outcome.
Train Model
 Datasets are used to train the model using various machine learning algorithms
– to understand various patterns, rules, and, features.
Test Model
 Tests accuracy of the model with respect to the requirements of project or
problem.
Deployment
 Performance of the project is checked with the available data and deployed
which is similar to making the final report for a project.

Artificial Intelligence
9
Machine learning - dataset

 A dataset is a collection of data in which data is arranged in some order. A


dataset can contain any data from a series of an array to a database table.
 Types of data in datasets
o Numerical data: Such as house price, temperature, etc.
o Categorical data: Such as Yes/No, True/False, Blue/green, etc.
o Ordinal data: These data are similar to categorical data but can be
measured on the basis of comparison.
 Types of Datasets
o Training Dataset: This data set is used to train the model i.e.; these
datasets are used to update the weight of the model.

Artificial Intelligence
11
Contd…
o Validation Dataset
 It is used to verify that the increase in the accuracy of the training dataset is
actually increased if we test the model with the data that is not used in the
training.
 If the accuracy over the training dataset increases while the accuracy over the
validation dataset decreases, then this results in the case of high variance i.e.,
overfitting.
o Test Dataset
 Most of the time when we try to make changes to the model based upon the
output of the validation set then unintentionally, we make the model peek into
our validation set and as a result, our model might get overfit on the validation
set as well.
 To overcome this issue, we have a test dataset that is only used to test the final
output of the model in order to confirm the accuracy.

Artificial Intelligence
12
Contd…
How to get the datasets / Popular sources for ML dataset
 Kaggle Dataset
 UCI Machine Learning Repository
 Datasets via AWS
 Google's Dataset Search Engine
 Microsoft Dataset
 Awesome Public Dataset Collection
 Government Datasets
 Computer Vision Datasets
 Scikit-learn dataset

Artificial Intelligence
13
Machine learning-
data preprocessing
 Definition: Data pre-processing is a process of preparing the raw data and
making it suitable for a machine learning model.
 Significance
 A real-world data contains noises, missing values, and maybe in an
unusable format which cannot be directly used for machine learning
models.
 Data pre-processing is required tasks for cleaning the data and making it
suitable for a machine learning model.
 Steps
Getting the dataset
 The data is usually put in CSV file ("Comma-Separated Values" files; it is a
file format which allows us to save the tabular data, such as spreadsheets.
It is useful for huge datasets and can use these datasets in programs).

Artificial Intelligence
14
Contd…
Importing libraries
o Numpy: used for including any type of mathematical operation in the code.
o Matplotlib: used to plot any type of charts in Python for the code.
o Pandas: used for importing and managing the datasets. It is an open-source
data manipulation and analysis library.
Importing datasets
 read_csv() function: used to read a csv file
 To distinguish the matrix of features (independent variables) and dependent
variables from dataset - iloc[ ] method is used to extract the required rows and
columns from the dataset.
 To extract dependent variables, again, we will use Pandas.iloc[] method.

Artificial Intelligence
15
Contd…

 Handling Missing Data


o By deleting the particular row: delete the specific row or column which
consists of null values. But this way is not so efficient and removing data may
lead to loss of information which will not give the accurate output.
o By calculating the mean: In this way, we will calculate the mean of that column
or row which contains any missing value and will put it on the place of missing
value. This strategy is useful for the features which have numeric data such as
age, salary, year, etc. Here, we will use this approach.
o To handle missing values, we will use Scikit-learn library in our code, which
contains various libraries for building machine learning models. Here we will
use Imputer class of sklearn.preprocessing library.

Artificial Intelligence
16
Contd…
 Encoding Categorical Data
o Label Encoder() class from pre-processing library is used for encoding the
variables into digits.
o Categorical variables usually have strings for their values. Many machine
learning algorithms do not support string values for the input variables.
Therefore, we need to replace these string values with numbers. This process is
called categorical variable encoding.
o Types of encoding:
 One-hot encoding

 Dummy encoding

 One-hot encoding
• In one-hot encoding, we create a new set of dummy (binary) variables that is
equal to the number of categories (k) in the variable.

Artificial Intelligence
17
Contd…
 For example, let’s say we have a categorical variable Color with three categories
called “Red”, “Green” and “Blue”, we need to use three dummy variables to
encode this variable using one-hot encoding. A dummy (binary) variable just
takes the value 0 or 1 to indicate the exclusion or inclusion of a category.

• In one-hot encoding,
 “Red” color is encoded as [1 0 0] vector of size 3.
 “Green” color is encoded as [0 1 0] vector of size 3.
 “Blue” color is encoded as [0 0 1] vector of size 3.

Artificial Intelligence
18
Contd…
 Dummy encoding
 Dummy encoding also uses dummy (binary) variables. Instead of creating a
number of dummy variables that is equal to the number of categories (k) in the
variable, dummy encoding uses k-1 dummy variables.
 To encode the same Color variable with three categories using the dummy
encoding, we need to use only two dummy variables

 In dummy encoding, “Red” color is encoded as [1 0] vector of size 2, “Green” color


is encoded as [0 1] vector of size 2, “Blue” color is encoded as [0 0] vector of size
2.
 Dummy encoding removes a duplicate category present in the one-hot encoding.

Artificial Intelligence
19
Contd…
 Splitting dataset into training, validation and test set

 Feature scaling
 Feature scaling is the final step of data pre-processing in machine learning.

 It is a technique to standardize the independent variables of the dataset in a


specific range.

 In feature scaling, we put our variables in the same range and in the same scale
so that no any variable dominates the other variable.

Artificial Intelligence
20
Feature selection techniques in
Machine Learning

Artificial Intelligence 21
Feature selection
• A feature is an attribute that has an impact on a problem or is useful for the
problem, and choosing the important features for the model is known as feature
selection.
• Definition: Feature selection is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant, irrelevant, or
noisy features.
• Significance of Feature Selection:
 It helps in avoiding the curse of dimensionality.
 It helps in the simplification of the model so that it can be easily interpreted
by the researchers.
 It reduces the training time.
 It reduces overfitting hence enhance the generalization.

Artificial Intelligence
22
feature selection techniques

Artificial Intelligence
23
Supervised feature selection
techniques
• Wrapper Methods
 In wrapper methodology, selection of features is done by considering it as a
search problem, in which different combinations are made, evaluated, and
compared with other combinations.
 It trains the algorithm by using the subset of features iteratively.
 On the basis of the output of the model, features are added or subtracted,
and with this feature set, the model has trained again.

Artificial Intelligence
24
Contd…
• Filter Methods
 In Filter Method, features are selected on the basis of statistics measures.
This method does not depend on the learning algorithm and chooses the
features as a pre-processing step.
 The filter method filters out the irrelevant feature and redundant columns
from the model by using different metrics through ranking.
 The advantage of using filter methods is that it needs low computational
time and does not overfit the data.

Artificial Intelligence
25
Contd…
• Embedded Methods
 Embedded methods combined the advantages of both filter and wrapper
methods by considering the interaction of features along with low
computational cost. These are fast processing methods similar to the filter
method but more accurate than the filter method.
 These methods are also iterative, which evaluates each iteration, and
optimally finds the most important features that contribute the most to
training in a particular iteration.

Artificial Intelligence
26
Feature engineering for
machine learning
• Feature engineering is the pre-processing step of machine learning, which
extracts features from raw data.
• Feature engineering in ML contains mainly four processes:
 Feature Creation: finding the most useful variables to be used in a
predictive model.,
 Transformations: This step of feature engineering involves adjusting the
predictor variable to improve the accuracy and performance of the model.
 Feature Extraction: Is an automated feature engineering process that
generates new variables by extracting them from the raw data
 Feature Selection: Is a way of selecting the subset of the most relevant
features from the original features set by removing the redundant,
irrelevant, or noisy features

Artificial Intelligence
27
Feature engineering
techniques for ML
• Imputation: Imputation is responsible for handling irregularities within the
dataset.
• Handling Outliers: Standard deviation can be used to identify the outliers. Z-
score can also be used to detect outliers.
• Log Transform: helps in handling the skewed data, and it makes the distribution
more approximate to normal after transformation.
• Binning: used to normalize the noisy data.
• Feature Split: is the process of splitting features intimately into two or more
parts and performing to make new features.
• One hot encoding: It is a technique that converts the categorical data in a form
so that they can be easily understood by machine learning algorithms and hence
can make a good prediction.

Artificial Intelligence
28
Machine learning - types

Artificial Intelligence
29
Supervised learning
• Supervised learning is the types of machine learning in which machines are
trained using well "labelled" training data, and on basis of that data, machines
predict the output. The labelled data means some input data is already tagged
with the correct output.

Artificial Intelligence
30
Types of Supervised learning
Regression Algorithms
• Are used if there is a relationship between the input variable and the output
variable. Example: Weather forecasting, Market Trends, etc.
• Regression algorithms under supervised learning: Linear Regression, Non-Linear
Regression, Polynomial Regression, Ridge Regression and Lasso Regression.

Classification Algorithms
• Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
Example: Spam Filtering.
• Classification algorithms under supervised learning: Random Forest, Decision
Trees, Logistic Regression, Support vector Machines

Artificial Intelligence
31
Important Terminologies
 Dependent Variable: The main factor in Regression analysis which we want to
predict or understand is called the dependent variable. It is also called target
variable.
 Independent Variable: The factors which affect the dependent variables or
which are used to predict the values of the dependent variables are called
independent variable, also called as a predictor.
 Outliers: Outlier is an observation which contains either very low value or very
high value in comparison to other observed values. An outlier may hamper the
result, so it should be avoided.
 Multicollinearity: If the independent variables are highly correlated with each
other than other variables, then such condition is called Multicollinearity.
 Overfitting: If our algorithm works well with the training dataset but not well
with test dataset, then such problem is called Overfitting.
 Underfitting: If our algorithm does not perform well even with training
dataset, then such problem is called underfitting.

Artificial Intelligence
32
Unsupervised Machine Learning
 Unsupervised learning is a type of machine learning in which models are
trained using unlabeled dataset and are allowed to act on that data without any
supervision.

 The goal of unsupervised learning is to find the underlying structure of dataset,


group that data according to similarities, and represent that dataset in a
compressed format.

Artificial Intelligence
33
Unsupervised Machine Learning -
Types
 Types:
 Clustering: Clustering is a method of grouping the objects into clusters such
that objects with most similarities remains into a group and has less or no
similarities with the objects of another group.
 Association: An association rule is an unsupervised learning method which
is used for finding the relationships between variables in the large database.
It determines the set of items that occurs together in the dataset.
 Unsupervised learning algorithms: K-means clustering, Hierarchal clustering,
Anomaly detection, Neural Networks, Principle Component Analysis, Apriori
algorithm
 Advantage of Unsupervised Learning: “Preferable” as it is easy to get unlabeled
data in comparison to labeled data.
 Disadvantages of Unsupervised Learning: The result might be less accurate as
input data is not labeled, and algorithms do not know the exact output in
advance.

Artificial Intelligence
34
Semi-Supervised Learning
• Semi-Supervised learning is a type of Machine Learning algorithm that lies
between Supervised and Unsupervised machine learning.
• The main aim of semi-supervised learning is to effectively use all the available
data, rather than only labelled data like in supervised learning.
• Advantages: It is highly efficient and is used to solve drawbacks of Supervised
and Unsupervised Learning algorithms.
• Disadvantages
 Iterations results may not be stable.
 We cannot apply these algorithms to network-level data.
 Accuracy is low.

Artificial Intelligence
35
Reinforcement Learning
• Reinforcement learning works on a feedback-based process, in which an AI agent
(A software component) automatically explore its surrounding by hitting & trail,
taking action, learning from experiences, and improving its performance.
• Agent gets rewarded for each good action and get punished for each bad action;
hence the goal of reinforcement learning agent is to maximize the rewards.
• In reinforcement learning, there is no labelled data like supervised learning, and
agents learn from their experiences only.
• Due to its way of working, reinforcement learning is employed in different fields
such as Game theory, Operation Research, Information theory, multi-agent
systems.
• A reinforcement learning problem can be formalized using Markov Decision
Process(MDP).

Artificial Intelligence
36
Reinforcement Learning
• Categories of Reinforcement Learning
 Positive Reinforcement Learning: Specifies increasing the tendency that the
required behavior would occur again by adding something.
 Negative Reinforcement Learning: It increases the tendency that the specific
behavior would occur again by avoiding the negative condition.
• Applications: Robotics, Text Mining, Resource Management, Video Games.
• Advantages
 The learning model of RL is similar to the learning of human beings; hence most
accurate results can be found.
 Helps in achieving long term results.
• Disadvantages
 RL algorithms require huge data and computations.
 Too much reinforcement learning can lead to an overload of states which can
weaken the results.

Artificial Intelligence
37
Linear Regression analysis
• It is a statistical method that is used for predictive analysis.
• Linear regression makes predictions for continuous/real or numeric variables
such as sales, salary, age, product price, etc.
• Linear regression algorithm shows a linear relationship between a dependent (y)
and one or more independent (y) variables, hence called as linear regression.
• Mathematically, we can represent a linear regression as:
y= a0+a1x+ ε

Artificial Intelligence
38
Linear Regression analysis
• Types of Linear Regression
 Simple Linear Regression: If a single independent variable is used to
predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Simple Linear Regression.
 Multiple Linear regression: If more than one independent variable is used
to predict the value of a numerical dependent variable, then such a Linear
Regression algorithm is called Multiple Linear Regression.
• Model Performance: R-squared method:
 R-squared is a statistical method that determines the goodness of fit.
 The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
 It can be calculated from the below formula:

Artificial Intelligence
39
Simple Linear regression
• Models the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is linear or a
sloped straight line.
• Simple Linear regression algorithm has mainly two objectives:
 Model the relationship between the two variables. Eg: Income and expenditure,
experience and Salary, etc.
 Forecasting new observations. Such as Weather forecasting according to
temperature, Revenue of a company according to the investments in a year, etc.
• The Simple Linear Regression model can be represented using the below equation:
y= a0+a1x+ ε
a0= It is the intercept of the Regression line (can be obtained putting x=0)
a1= It is the slope of the regression line, which is either increasing or decreasing.
ε = The error term. (For a good model it will be negligible)

Artificial Intelligence
40
Multiple linear regression
• Multiple Linear Regression is one of the important regression algorithms which
models the linear relationship between a single dependent continuous variable
and more than one independent variable.
• For MLR, the dependent or target variable(Y) must be the continuous/real, but
the predictor or independent variable may be of continuous or categorical form.
• Each feature variable must model the linear relationship with the dependent
variable.
• MLR tries to fit a regression line through a multidimensional space of data-
points.
• Example: Prediction of CO2 emission based on engine size and number of
cylinders in a car.

Artificial Intelligence
41
Multiple linear regression
• MLR equation:
o In Multiple Linear Regression, the target variable(Y) is a linear combination
of multiple predictor variables x1, x2, x3, ...,xn.
Y= b0 + b1x1 +b2x2 +b3x3 ………….+bnxn
where, Y= Output/Response variable, b0, b1, b2, b3 , bn....= Coefficients of the model,
x1, x2, x3, x4,...= Various Independent/feature variable

• Assumptions for Multiple Linear Regression:


 A linear relationship should exist between the Target and predictor
variables.
 The regression residuals must be normally distributed.
 MLR assumes little or no multicollinearity (correlation between the
independent variable) in data.

Artificial Intelligence
42
Evaluation metrics for
regression model
• In regression problems, the prediction error is used to define the model
performance. The prediction error is also referred to as residuals and it is defined
as the difference between the actual and predicted values.
• Residuals are important when determining the quality of a model.
• Residual = actual value — predicted value
error(e) = y — ŷ
• We can technically inspect all residuals to judge the model’s accuracy, but this
does not scale if we have thousands or millions of data points. That’s why we
have summary measurements that take our collection of residuals and condense
them into a single value representing our model's predictive ability.

Artificial Intelligence
43
Evaluation metrics for regression
model - Mean Absolute Error (MAE)
• It is the average of the absolute differences between the actual value and the
model’s predicted value.

where, N = total number of data points, Yi = actual value, Ŷi = predicted value.


• A small MAE suggests the model is great at prediction, while a large MAE
suggests that your model may have trouble in certain areas. MAE of 0 means that
your model is a perfect predictor of the outputs.
• Advantages of MAE: It is most Robust to outliers.
• Disadvantages of MAE: The graph of MAE is not differentiable so we have to
apply various optimizers like Gradient descent which can be differentiable.

Artificial Intelligence
44
Evaluation metrics for regression
model – Mean Squared Error
• It is the average of the squared differences between the actual and the predicted
values. Lower the value, the better the regression model.

where, n = total number of data points, yi = actual value, ŷi = predicted value


• If you have outliers in the dataset then it penalizes the outliers most and the
calculated MSE is bigger.
• Advantages of MSE - The graph of MSE is differentiable, so you can easily use it
as a loss function.
• Disadvantages of MSE - If you have outliers in the dataset then it penalizes the
outliers most and the calculated MSE is bigger. So, in short, It is not Robust to
outliers which were an advantage in MAE.

Artificial Intelligence
45
Evaluation metrics for regression
model – Route Mean Squared Error
• It is the average root-squared difference between the real value and the predicted
value.
• lower the RMSE value, the better the model is with its predictions.
• A Higher RMSE indicates that there are large deviations between the predicted
and actual value.

where, n = total number of data points, yj = actual value, ŷj= predicted value
• Advantages of RMSE: The output value is in the same unit as the required output
variable which makes interpretation of loss easy.
• Disadvantages of RMSE: It is not that robust to outliers as compared to MAE.

Artificial Intelligence
46
Evaluation metrics for regression
model – R Squared
• R2 score is a metric that tells the performance of your model, not the loss in an
absolute sense that how many wells did your model perform.
• So, with help of R squared we have a baseline model to compare a model which
none of the other metrics provides.
• The same we have in classification problems which we call a threshold which is
fixed at 0.5. So basically R2 squared calculates how must regression line is better
than a mean line.

Artificial Intelligence
47
Evaluation metrics for regression
model – R Squared
• Now, how will you interpret the R2 score? suppose If the R2 score is zero then the
above regression line by mean line is equal means 1 so 1-1 is zero.
• So, in this case, both lines are overlapping means model performance is worst, It
is not capable to take advantage of the output column.
• Now the second case is when the R2 score is 1, it means when the division term is
zero and it will happen when the regression line does not make any mistake, it is
perfect. In the real world, it is not possible.
• So we can conclude that as our regression line moves towards perfection, R2
score move towards one. And the model performance improves.
• The normal case is when the R2 score is between zero and one like 0.8 which
means your model is capable to explain 80 per cent of the variance of data.

Artificial Intelligence
48
Evaluation metrics for regression
model – Adjusted R Squared
• The disadvantage of the R2 score is while adding new features in data the R2
score starts increasing or remains constant but it never decreases because It
assumes that while adding more data variance of data increases.
• But the problem is when we add an irrelevant feature in the dataset then at that
time R2 sometimes starts increasing which is incorrect.
• Hence, To control this situation Adjusted R Squared came into existence.

Artificial Intelligence
49
Evaluation metrics for regression
model – Adjusted R Squared
• Now as K increases by adding some features so the denominator will decrease, n-
1 will remain constant.
• R2 score will remain constant or will increase slightly so the complete answer
will increase and when we subtract this from one then the resultant score will
decrease.
• So, this is the case when we add an irrelevant feature in the dataset.
• And if we add a relevant feature then the R2 score will increase and 1-R2 will
decrease heavily and the denominator will also decrease so the complete term
decreases, and on subtracting from one the score increases.

Artificial Intelligence
50
Thank you

Artificial Intelligence 51
PRESIDENCY UNIVERSITY
Bengaluru

Module 2
Classification models
CONTENT
• Classification models
• Decision Tree algorithms using Entropy and Gini Index as measures of
node impurity,
• Model evaluation metrics for classification algorithms,
• Logistic regression.
• Multi-class classification
• Class Imbalance problem.
• Naïve Bayes Classifiers
• Naive Bayes model for sentiment classification – An Introduction

2
classification models

3
What is classification algorithm?
• Classification algorithm is a Supervised Learning technique in which a
program learns from the given dataset or observations and then classifies
new observation into a number of classes or groups. Such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc.
• Classes can be called as targets/labels or categories.
• Unlike regression, the output variable of Classification is a category, not a
value, such as "Green or Blue", "fruit or animal”.
• Since Classification algorithm is a Supervised learning technique, it takes
labeled input data, which means it contains input with the corresponding
output.
• In classification algorithm, a discrete output function(y) is mapped to input
variable(x), i.e., y=f(x), where y = categorical output

4
Types of classifications

• The algorithm which implements the classification on a dataset is known as a


classifier.
• There are two types of Classifications:
⮚ Binary Classifier: If the classification problem has only two possible
outcomes, then it is called as Binary Classifier. Examples: YES or NO,
MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.

⮚ Multi-class Classifier: If a classification problem has more than two


outcomes, then it is called as Multi-class Classifier. Example: Classifications
of types of crops, Classification of types of music.

5
Learners in classification
problems
1. Lazy Learners
• Stores the training dataset and wait until it receives the test dataset.
• In this case, classification is done on the basis of the most related data stored
in the training dataset.
• It takes less time in training but more time for predictions.
• Example: K-NN algorithm, Case-based reasoning

2. Eager Learners
• Eager Learners develop a classification model based on a training dataset
before receiving a test dataset.
• Unlike Lazy learners, Eager Learner takes more time in learning, and less time
in prediction.
• Example: Decision Trees, Naïve Bayes, ANN.

6
Types of classification algorithms

• Linear Models
• Logistic Regression
• Support Vector Machines

• Non-linear Models
• K-Nearest Neighbours
• Kernel SVM
• Naïve Bayes
• Decision Tree Classification
• Random Forest Classification

7
Uses cases of classification
algorithms
• Email Spam Detection
• Speech Recognition
• Identifications of Cancer tumor cells.
• Drugs Classification
• Biometric Identification, etc.

8
Decision Tree algorithms using
Entropy and Gini Index

9
What is decision tree?
• A Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• Contains two nodes: Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches, whereas
Leaf nodes are the output of those decisions and do not contain any further
branches.
• It is a tree-structured classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf node represents
the outcome.
• The decisions or the test are performed on the basis of features of the given dataset.
• It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.

10
CONTD…
• In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees

11
Significance of decision tree

• Decision Trees usually mimic human thinking ability while making a


decision, so it is easy to understand.
• The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

12
Decision Tree Terminologies
❖ Root Node: Node from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

❖ Leaf Node: Final output node and the tree cannot be segregated further after getting a
leaf node.

❖ Splitting: Process of dividing the decision node/root node into sub-nodes according to
the given conditions.

❖ Branch/Sub Tree: A tree formed by splitting the tree.

❖ Pruning: Pruning is the process of removing, unwanted branches from the tree.

❖ Parent & Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

13
Decision tree algorithm
working
• Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
• Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
• Step-3: Divide the S into subsets that contains possible values for the best
attributes.
• Step-4: Generate the decision tree node, which contains the best attribute.
• Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

14
Illustrative Example

15
Attribute Selection Measures:
Entropy

16
Attribute Selection Measures:
Information Gains

• Information gain is the measurement of changes in entropy after the segmentation


of a dataset based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision
tree.
• A decision tree algorithm always tries to maximize the value of information gain,
and a node/attribute having the highest information gain is split first. It can be
calculated using the below formula:
Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)]

17
Attribute Selection Measures: Gini
Index

• Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.

• An attribute with the low Gini index should be preferred as compared to the high
Gini index.

• It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.

• Gini index can be calculated using the below formula:


Gini Index= 1 - ∑jPj2

18
Numerical example –
Decision Tree (Entropy, Gini Impurity
& Information Gain)

19
20
21
22
Algorithm for construction of Decision tree using
entropy and Information Gain

Artificial Intelligence
23
24
25
26
27
Information gain:
Outlook: 0.246
Humidity: 0.151
Wind: 0.048
Temperature: 0.029

→ Split on Outlook

Outlook

Sunny Overcast Rain


1,2,8,9,11 3,7,12,13 4,5,6,10,14
2+,3- 4+,0- 3+,2-
? Yes ?

Artificial Intelligence
28
Computation time is reduced as don’t use
logarithmic function in Gini impurity

29
Algorithm for construction of Decision tree using Gini Index

Artificial Intelligence
30
Artificial Intelligence
31
Artificial Intelligence
32
Artificial Intelligence
33
Artificial Intelligence
34
Artificial Intelligence
35
Artificial Intelligence
36
Artificial Intelligence
37
Cine

Artificial Intelligence
38
Advantages & Disadvantages of the
Decision Tree

Advantages of the Decision Tree


• It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
• It can be very useful for solving decision-related problems and to generate possible
outcomes for a problem.
• There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


• The decision tree contains lots of layers, which makes it complex.
• It has overfitting issue - resolved using the Random Forest algorithm.
• For more class labels, the computational complexity of the decision tree may
increase.

39
Methods for Evaluating a
classification model
Log Loss or Cross-Entropy Loss
• It is used for evaluating the performance of a classifier, whose output is a
probability value between the 0 and 1.
• For a good binary Classification model, the value of log loss should be near
to 0.
• The value of log loss increases if the predicted value deviates from the actual
value.
• The lower log loss represents the higher accuracy of the model.
• For Binary classification, cross-entropy can be calculated as
- (ylog(p)+(1-y)log(1-p))
• where y= Actual output, p= predicted output.

40
Methods for Evaluating a
classification model
Confusion Matrix
• The confusion matrix provides us a matrix/table as output and describes the
performance of the model.
• It is also known as the error matrix.
• The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:
Predicted Predicted
Positive Negative

Actual Positive True Positive False


Negative
Actual Negative False Positive True Negative

41
Methods for Evaluating a
classification model
AUC-ROC curve

• ROC curve stands for Receiver Operating Characteristics Curve and AUC stands
for Area Under the Curve.
• It is a graph that shows the performance of the classification model at different
thresholds.
• To visualize the performance of the multi-class classification model, we use the
AUC-ROC Curve.
• The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-
axis and FPR(False Positive Rate) on X-axis.

42
LOGISTIC REGRESSION

43
Multi-class
classification

44
What is Multi-class
classification?
• In machine learning and statistical classification, multiclass
classification or multinomial classification is the problem of classifying
instances into one of three or more classes (classifying instances into one of two
classes is called binary classification).
• While many classification algorithms (notably multinomial logistic regression)
naturally permit the use of more than two classes, some are by
nature binary algorithms; these can, however, be turned into multinomial classifiers
by a variety of strategies.
• Multiclass classification should not be confused with multi-label classification,
where multiple labels are to be predicted for each instance.
• The existing multi-class classification techniques can be categorized into (i)
transformation to binary (ii) extension from binary and (iii) hierarchical
classification

45
Multi-class Classification
Technique
• Also called as problem transformation techniques, discusses strategies for reducing the
problem of multiclass classification to multiple binary classification problems.
• Transformation to Binary : Categorized into one vs rest and one vs one.
1. One vs. rest
2. one vs. one
3. Transformation to Binary
4. Strategies of extending the existing binary classifiers to solve multi-class classification
problems –
1. neural networks
2. decision trees
3. k-nearest neighbors
4. Naive Bayes.

46
Class Imbalance problem

47
What is the Class Imbalance
Problem?

• It is the problem in machine learning where the total number of a class of


data (positive) is far less than the total number of another class of data
(negative).
• This problem is extremely common in practice and can be observed in
various disciplines including fraud detection, anomaly detection, medical
diagnosis, oil spillage detection, facial recognition, etc.

48
Why is it a problem?
• Most machine learning algorithms works best when the number of instances of each
classes are roughly equal. When the number of instances of one class far exceeds the
other, problems arise.
• Example: Given a dataset of transaction data, we would like to find out which are
fraudulent and which are genuine ones. For a dataset consisting of 10000 genuine
and 10 fraudulent transactions and the classifier classifies fraudulent transactions as
genuine transactions.
• The reason for this can be easily explained by the numbers.
• Suppose the machine learning algorithm has two possibly outputs as follows:
1. Model 1 classified 7 out of 10 fraudulent transactions as genuine transactions
and 10 out of 10000 genuine transactions as fraudulent transactions.
2. Model 2 classified 2 out of 10 fraudulent transactions as genuine transactions
and 100 out of 10000 genuine transactions as fraudulent transactions.

49
Why is it a problem?
• If the classifier’s performance is determined by the number of mistakes, then clearly
Model 1 is better as it makes only a total of 17 mistakes while Model 2 made 102
mistakes.
• However, as we want to minimize the number of fraudulent transactions happening,
we should pick Model 2 instead, which only made 2 mistakes classifying the
fraudulent transactions.
• But, this could come at the expense of more genuine transactions being classified as
fraudulent transactions, and a general machine learning algorithm will just pick
Model 1 instead of Model 2, which is a problem.
• In practice, this means we will let a lot of fraudulent transactions go through
although we could have stopped them by using Model 2.
• This translates to unhappy customers and money lost for the company.

50
How to tell a ML algorithm which is
a better solution?
• To tell the machine learning algorithm (or the researcher) that Model 2 is better than
Model 1, we need to show that Model 2 above is better than Model 1 above. For that, we
will need better metrics than just counting the number of mistakes made.
• We introduce the concept of True Positive, True Negative, False Positive and False
Negative:
⮚ True Positive (TP) – An example that is positive and is classified correctly
as positive
⮚ True Negative (TN) – An example that is negative and is classified correctly
as negative
⮚ False Positive (FP) – An example that is negative but is classified wrongly
as positive
⮚ False Negative (FN) – An example that is positive but is classified wrongly
as negative
⮚ Based on this above. We will have also the following of True Positive Rate, True
Negative Rate, False Positive Rate, False Negative Rate:

51
Contd…

With these new metrics, let’s compare it with the conventional metrics of counting
the number of mistakes made with the example above. First, we will use the old
metrics to calculate the number of mistakes made (error):

52
Contd…

• As illustrated above, Model 1 looks like it has lower error (0.1% error) than Model
2 (1.0% error) but we know that Model 2 is the better one, as it makes less false
negatives (FN) (maximize true positive (TP)).
• Now let’s see what the performance of Model 1 and Model 2 are like with the new
metrics:

53
Contd…

• Now, we can see that the false negative rate of Model 1 is at 70% while the false
negative rate of Model 2 is just at 20%, which is clearly a better classifier.
• This is what we should educate the machine learning algorithm (or us) to use in
order to allow it to pick a better algorithm.

54
Sampling based approach to mitigate
class imbalance problem
• This can be roughly classified into three categories:
⮚ Oversampling
⮚ Undersampling
⮚ Hybrid, a mix of oversampling and Undersampling
• Oversampling
⮚ By oversampling, just duplicating the minority classes could lead the classifier
to overfitting to a few examples, which can be illustrated below:

55
Contd…
⮚ On the left hand side is before oversampling, where as on the right hand side
is oversampling has been applied. On the right side, The thick positive signs
indicate there are multiple repeated copies of that data instance.
⮚ The machine learning algorithm then sees these cases many times and thus
designs to overfit to these examples specifically, resulting in a blue line
boundary as above.
• Undersampling
⮚ By Undersampling, we could risk removing some of the majority class
instances which is more representative, thus discarding useful information.
This can be illustrated as follows:

56
Contd…
⮚ Here the green line is the ideal decision boundary we would like to have, and
blue is the actual result.
⮚ On the left side is the result of just applying a general machine learning
algorithm without using Undersampling.
⮚ On the right, we undersampled the negative class but removed some
informative negative class, and caused the blue decision boundary to be
slanted, causing some negative class to be classified as positive class wrongly.

• Hybrid approach
⮚ By combining Undersampling and oversampling approaches, we get the
advantages but also drawbacks of both approaches as illustrated above, which
is still a tradeoff.

57
Naïve Bayes Classifiers

58
Naïve Bayes algorithm?
• Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes
theorem and used for solving classification problems.
• Naïve Bayes Classifier is one of the simple and most effective Classification algorithms
which helps in building the fast machine learning models that can make quick predictions.
• It is a probabilistic classifier, which means it predicts on the basis of the probability of an
object.
• The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be
described as:
⮚ Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is
independent of the occurrence of other features. Such as if the fruit is identified on
the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized
as an apple. Hence each feature individually contributes to identify that it is an apple
without depending on each other.
⮚ Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

59
Bayes’ Theorem
• Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to
determine the probability of a hypothesis with prior knowledge. It depends on the
conditional probability.
• The formula for Bayes' theorem is given as:

Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the
probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

60
Working of Naïve Bayes’
Classifier
• Working of Naïve Bayes' Classifier can be understood with the help of the below
example:
• Suppose we have a dataset of weather conditions and corresponding target variable
"Play". So using this dataset we need to decide that whether we should play or not
on a particular day according to the weather conditions.
• So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.

61
Outlook Play

CONTD… 0 Rainy Yes


1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
Problem: If the weather is sunny, then the
Player should play or not? 5 Rainy Yes
Solution: To solve this, first consider the 6 Sunny Yes
given dataset
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes

62
CONTD…

Frequency table for the Weather Conditions:

Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 5

63
CONTD…

Likelihood table weather condition:

Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71

64
ApplyiNg BAyes’ TheOrem
• P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35, P(Yes)=0.71
So, P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
• P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29, P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
As we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.

65
Advantages, Disadvantages
and Application
• Advantages
⮚ It can be used for Binary as well as Multi-class Classifications and performs
well in Multi-class predictions as compared to the other Algorithms.
⮚ It is widely used for text classification problems.
• Disadvantages: Naive Bayes assumes that all features are independent or
unrelated, so it cannot learn the relationship between features.

• Application
⮚ Credit Scoring
⮚ Medical data classification.
⮚ In real-time predictions because Naïve Bayes Classifier is an eager learner.
⮚ Used in Text classification such as Spam filtering and Sentiment analysis.

66
Naive Bayes model for sentiment
classification – An Introduction
• In Naïve Bayes, probabilities are assigned to words or phrases, segregating them
into different labels. Consider the following example:

• Here, the model will try to learn how these sentiments are classified using
corresponding text. Example: It will see that a sentence having the word “good” has
a high probability of being appositive sentiment
• Using such a probabilistic value, a total probability of a test sentiment being
positive/ negative can be assigned

67
Thank you

Artificial Intelligence 68

You might also like