ML 2
ML 2
Consider a supervised learning task of predicting house prices based on features such as
square footage, number of bedrooms, and location. The dataset consists of these features
along with the corresponding house prices (labels).
1. Data Collection: Collect data on various houses, including their features and prices.
2. Data Preprocessing: Normalize the feature values and handle any missing data.
3. Train-Test Split: Split the data into a training set (e.g., 80%) and a test set (e.g.,
20%).
4. Model Selection: Choose a regression algorithm, such as linear regression.
Example
• One-vs-One (OvO)
One-vs-Rest (OvR)
Example: For three classes (A, B, C), three classifiers are trained:
• Classifier 1: A vs. B
• Classifier 2: A vs. C
• Classifier 3: B vs. C
Evaluation Metrics
Evaluating the performance of a multi-class classification model involves various metrics that
provide insights into different aspects of the model's predictive capabilities. To evaluate the
performance of a multi-class classification model, the following metrics can be used:
In above image the dependent variable is on Y-axis (salary) and independent variable is on x-axis(experience).
The regression line can be written as: y= a0+a1x+ ε or y=mx+c
Where a1 or m = slope (how much y changes for change in x) and c or a0 = intercept(value when x=0)
Where, a0 and a1 are the coefficients and ε is the error term.
Example 1
Example 2
Multilinear Regression
Multilinear Regression (also known as Multiple Linear Regression) extends linear regression by
modeling the relationship between a dependent variable and multiple independent variables.
Multiple Independent Variables: The equation involves more than one independent variable:
𝑌=𝛽0+𝛽1𝑋1+𝛽2𝑋2+⋯+𝛽𝑛𝑋𝑛+𝜖
where:
𝑌 is the dependent variable.
X1,X2,…,Xn are the independent variables.
β0 is the y-intercept.
β1,β2,…,βn are the coefficients (slopes) for each independent variable.
ϵ is the error term.
Multilinear Regression
Goal: Similar to simple linear regression, the goal is to find the coefficients (𝛽s) that
minimize the sum of the squared differences between the observed and predicted values.
Assumptions: The same as for simple linear regression, but applied to a multidimensional
space:
• Linearity: The relationship between the dependent variable and each independent
variable is linear.
• Independence: The residuals are independent.
• Homoscedasticity: Constant variance of residuals.
• Normality: The residuals are normally distributed.
• No multicollinearity: Independent variables should not be highly correlated with each
other.
Differences Between Linear and Multilinear Regression
• Number of Independent Variables:
• Linear Regression: Involves one independent variable.
• Multilinear Regression: Involves multiple independent variables.
• Complexity:
• Linear Regression: Simpler, easier to visualize and interpret.
• Multilinear Regression: More complex, requires more data to estimate multiple parameters.
• Use Cases:
• Linear Regression: Suitable for simple scenarios where the outcome is influenced by a single
factor.
• Multilinear Regression: Suitable for more complex scenarios where the outcome is
influenced by multiple factors.
Example 1
Y
Applications
Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple. Hence each feature individually contributes to
identify that it is an apple without depending on each other.
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is
true.
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this
dataset we need to decide that whether we should play or not on a particular day according to the weather
conditions. So to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Outlook Play
Working of Naïve Bayes' Classifier: 0 Rainy Yes
1 Sunny Yes
Solution: To solve this, first consider
2 Overcast Yes
the given dataset: ->
3 Overcast Yes
Example 1
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Working of Naïve Bayes' Classifier:
Frequency table for the Weather Conditions:
Weather Yes No
Overcast 5/10 0
Weather No Yes
Overcast 0 5 5/14= 0.35
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
All 4/14=0.29 10/14=0.71
Working of Naïve Bayes' Classifier:
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.35
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Example 2
Example 3
Example 4
Advantages of Naïve Bayes Classifier::
Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
It can be used for Binary as well as Multi-class Classifications.
It performs well in Multi-class predictions as compared to the other Algorithms.
It is the most popular choice for text classification problems.
Disadvantages of Naïve Bayes Classifier:
Naive Bayes assumes that all features are independent or unrelated, so it
cannot learn the relationship between features.
Applications of Naïve Bayes Classifier
It is used for Credit Scoring.
It is used in medical data classification.
It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
It is used in Text classification such as Spam filtering and Sentiment analysis.
Types of Naïve Bayes Model:
There are three types of Naive Bayes Model, which are given below:
Gaussian: The Gaussian model assumes that features follow a normal distribution. This means
if predictors take continuous values instead of discrete, then the model assumes that these
values are sampled from the Gaussian distribution.
Multinomial: The Multinomial Naïve Bayes classifier is used when the data is multinomial
distributed. It is primarily used for document classification problems, it means a particular
document belongs to which category such as Sports, Politics, education, etc.
The classifier uses the frequency of words for the predictors.
Bernoulli: The Bernoulli classifier works similar to the Multinomial classifier, but the
predictor variables are the independent Booleans variables. Such as if a particular word is
present or not in a document. This model is also famous for document classification tasks.
Decision Tree Classification Algorithm:
• Decision Tree is a Supervised learning technique that can be used for both classification and
Regression problems, but mostly it is preferred for solving Classification problems.
• It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
• In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
• Decision nodes are used to make any decision and have multiple branches.
• Leaf nodes are the output of those decisions and do not contain any further branches.
• The decisions or the test are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a problem/decision based
on given conditions.
Decision Tree Classification Algorithm:
• It is called a decision tree because it is similar to a tree.
• It starts with the root node, which expands on further branches and constructs a tree-like
structure.
• In order to build a tree, we use the ID3 And CART algorithm, which stands for Iterative
Dichotomiser 3 / Classification and Regression Tree algorithm.
• A decision tree simply asks a question, and based on the answer (Yes/No), it further split the
tree into subtrees.
Decision Tree Classification Algorithm:
Below diagram explains the general structure of a decision tree:
Decision Tree Classification Algorithm:
Below diagram explains the general structure of a decision tree:
Why use Decision Trees?:
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to
understand.
The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies:
Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are
called the child nodes.
How does the Decision Tree algorithm Work?:
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the next
node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree.
How does the Decision Tree algorithm Work?:
complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in step -3. Continue this
process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf
node.
How does the Decision Tree algorithm Work?:
Example :
Suppose there is a candidate who has a job offer and wants to decide whether he should
accept the offer or Not. So, to solve this problem, the decision tree starts with the root node
(Salary attribute by ASM).
The root node splits further into the next decision node (distance from the office) and one
leaf node based on the corresponding labels.
The next decision node further gets split into one decision node (Cab facility) and one leaf
node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
How does the Decision Tree algorithm Work?:
Consider the below diagram:
How does the Decision Tree algorithm Work?:
Example 2:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
Information Gain
Gini Index
Attribute Selection Measures
1. Information Gain:
• Information gain is the measurement of changes in entropy after the segmentation of a dataset
based on an attribute.
• It calculates how much information a feature provides us about a class.
• According to the value of information gain, we split the node and build the decision tree.
• A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first.
• It can be calculated using the below formula:
Attribute Selection Measures
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies
randomness in data. Entropy can be calculated as:
• Entropy
– It’s a measure of uncertainty, purity and information content.
• Consider a sample of training example S
• P1 is the portion of positive examples in S
• P2 is the portion of negative example in S
• Entropy(S) = p1(-log2p1)+p2(-log2p2) = -p1(log2p1) - p2(log2p2)
• Information Gain
– When a node is split the increase/decrease in the value of entropy is referred to as
Information gain.
• For splitting an attribute with highest information gain is selected.
Example – ID3
Solution – ID3
Solution – ID3 (Age <=30,31..40,>40)
Solution – ID3 (Age <=30)
Solution – ID3
Solution – ID3
Solution – ID3
Solution – ID3 (Age <=30)
Solution – ID3 (Age <=30)
Solution – ID3 (Age <= 30, Income)
Solution – ID3 (Age <= 30, Student)
Solution – ID3 (Age <= 30, credit_rating)
Solution – ID3 (Age <=30)
Solution – ID3 (Age 31..40)
Solution – ID3 (Age > 40)
Solution – ID3 (Age > 40)
Solution – ID3 (Age > 40, Income)
Solution – ID3 (Age > 40, credit_rating)
Solution – ID3 (Age > 40)
Decision Tree – ID3
Classification rules from decision tree
Limitations :
1. Tendency to overfit if not properly regularized.
2. Difficulty handling irrelevant features.
3. Lack of smoothness in the decision boundaries.
Real-World Applications CART
1. Credit Scoring: Predicting creditworthiness based on customer attributes.
2. Disease Diagnosis: Identifying diseases based on symptoms and patient characteristics.
3. Customer Churn Prediction: Predicting whether a customer is likely to cancel a subscription
or leave a service.
4. Stock Market Analysis: Forecasting stock prices based on historical data.
Error Bounds
• Error bounds are estimates that quantify the uncertainty or potential error
in the predictions or estimates made by a model. They provide a range
within which the true value is expected to lie, giving a measure of the
reliability and accuracy of the model. Error bounds are crucial in both
theoretical and practical applications of statistics and machine learning.
Types of Error Bounds
• Confidence Intervals
• Prediction Intervals
• Chebyshev's Inequality
• Hoeffding's Inequality
• PAC (Probably Approximately Correct) Bounds
Confidence Intervals
• Definition: A confidence interval provides a range of values, derived from the sample
data, that is likely to contain the true value of an unknown population parameter.
• Example: If a 95% confidence interval for a population mean is (5, 10), we are 95%
confident that the true mean lies between 5 and 10.
• Calculation: For a sample mean 𝑥ˉxˉ with standard deviation 𝑠s, the confidence interval is
given by:
where 𝑧z is the z-score corresponding to the desired confidence level, and 𝑛n is the sample
size.
Prediction Intervals
• This means that at least of the values lie within 𝑘 standard deviations of the
mean.
Hoeffding's Inequality
• Definition: Provides a bound on the probability that the sum of bounded independent
random variables deviates from its expected value.
• Example: For independent random variables 𝑋1,𝑋2,…,𝑋𝑛X1,X2,…,Xn each bounded by
the interval [𝑎𝑖,𝑏𝑖][ai,bi], the inequality states that:
where error(ℎ) is the true error, error^(h) is the empirical error, 𝜖 is the error margin,
and 𝛿 is the confidence parameter.
Importance of Error Bounds
1. Uncertainty Quantification: Error bounds provide a way to quantify the uncertainty in
model predictions, allowing for more informed decision-making.
2. Model Evaluation: They help in evaluating the performance and reliability of models,
ensuring that predictions are within acceptable limits.
3. Risk Management: In critical applications like finance or healthcare, error bounds help
manage risks by providing worst-case scenarios.
4. Research and Development: In research, error bounds help validate the findings and
ensure that the results are not due to random chance.
Thanks