MLE
MLE
o Learning refers to the process by which machines improve their performance based
on experience. It allows systems to make predictions or decisions without being
explicitly programmed for every task.
o Machine Learning (ML) is a subset of AI that enables systems to learn from data. It
uses algorithms to detect patterns and make decisions based on data, instead of
following a set of hard-coded instructions.
o Example: A recommendation system (like Netflix) learns from your viewing history to
suggest shows or movies.
o In machine learning, the system learns from the data and adapts to make decisions
or predictions without explicit programming.
• Learning Process:
o The learning process in ML involves feeding data to a model, training it, and
adjusting the model’s parameters to make accurate predictions. The model learns
patterns from the data and can then generalize these patterns to new, unseen data.
• Types of Data:
o Structured Data: Data that is organized in tables or databases (e.g., rows and
columns in a spreadsheet).
o Unstructured Data: Data that doesn’t have a clear structure, like images, text, and
audio.
o Semi-structured Data: Data that has some structure, like JSON or XML files, but is
not entirely organized in a strict format.
o Evaluation: The process of assessing the model’s performance using metrics like
accuracy, precision, recall, etc.
o Probability refers to the likelihood of an event occurring. For instance, the probability
of rolling a 6 on a fair die is 1/6.
o Distribution describes how values are spread across a dataset. Normal distribution is
a common type, where data is symmetrically distributed around a central mean.
o Euclidean Distance is the straight-line distance between two points in a space (think
of it like a straight ruler distance).
o Manhattan Distance measures the distance between two points along axes at right
angles (like a grid layout, where you can only move along rows and columns).
o Regression is used to predict continuous values, like predicting the price of a house
based on its features (size, location, etc.).
• Hypothesis Testing:
o Hypothesis testing is a statistical method to test if a hypothesis about a population is
true or false. For example, testing whether a new drug is more effective than an old
one.
o Dataset creation involves gathering and structuring the data relevant to the problem
you want to solve. You can create a dataset by collecting data through surveys,
sensors, or even by scraping data from websites.
▪ Filling missing values with the mean, median, or a value estimated by other
methods.
o The dataset is typically split into two parts: a training set to train the model and a
test set to evaluate the model's performance. A common split is 70% for training and
30% for testing, but this can vary.
• Feature Scaling:
o Feature scaling involves standardizing or normalizing the features so that they have a
similar scale, which helps improve the performance of algorithms that are sensitive
to feature magnitude, like k-nearest neighbors (KNN) and gradient descent-based
algorithms.
o Example: If one feature is in the range of 1 to 100 and another is in the range of 0 to
1, scaling ensures both features contribute equally to the model.
**********************************************************************************
1. Supervised Learning:
Supervised learning uses labeled data, meaning the input comes with the correct output or
answer. The model learns by finding patterns between the inputs and their corresponding
outputs. It is like teaching a child math by showing examples with solutions. Examples
include predicting house prices or classifying emails as spam. Common algorithms are linear
regression, decision trees, and support vector machines.
2. Unsupervised Learning:
Unsupervised learning works with unlabeled data, finding hidden patterns or structures
without explicit answers. The model groups or clusters data based on similarities. It’s like
organizing books on a shelf by size and color without knowing their genres. Examples include
customer segmentation and market basket analysis. Common techniques are clustering (e.g.,
K-means) and dimensionality reduction (e.g., PCA).
3. Semi-Supervised Learning:
Semi-supervised learning combines a small amount of labeled data with a large amount of
unlabeled data. It helps the model learn better with minimal supervision. For example, in
language translation, some sentences are translated (labeled), while others are not
(unlabeled). This method is useful when labeling data is expensive or time-consuming.
Algorithms often adapt supervised techniques to leverage both types of data.
1. Bias:
Bias occurs when a model makes overly simple assumptions, leading to errors in predictions.
High bias means the model underfits the data, failing to capture key patterns. For example,
fitting a straight line to curved data results in high bias. This leads to poor performance on
both training and test data. Reducing bias often involves using more complex models.
2. Variance:
Variance refers to how sensitive the model is to the training data. High variance means the
model learns noise along with the actual patterns, causing overfitting. For example, a model
that memorizes training data but fails on new data has high variance. This results in good
training accuracy but poor generalization to unseen data. Regularization and simpler models
help reduce variance.
3. Underfitting:
Underfitting happens when the model is too simple to capture the data's complexity. It leads
to poor performance on both training and test datasets. For example, predicting stock prices
with only one feature like the day of the week underfits the data. This is caused by high bias
and low variance. Using more features and a better algorithm can address underfitting.
4. Overfitting:
Overfitting occurs when a model is too complex and memorizes the training data, including
noise. This leads to great performance on training data but poor accuracy on test data. For
example, a decision tree that grows too deep might overfit. Regularization, pruning, or cross-
validation helps avoid overfitting. Balancing model complexity is key to better generalization.
2. Design Cycle:
The design cycle is about planning the machine learning process. It starts with defining the
problem clearly, choosing the right model, and gathering appropriate data. After that, the
model is trained, parameters are fine-tuned, and the final system is tested. Feedback from
tests is used to redesign or improve the model. This cycle ensures an efficient and effective
learning process.
1. Accuracy:
Accuracy measures the percentage of correct predictions made by the model. It is calculated
as the ratio of correctly predicted instances to the total instances. For example, if a model
predicts 80 out of 100 results correctly, its accuracy is 80%. However, accuracy may not be
ideal for imbalanced datasets. Alternative metrics like precision and recall may give better
insights.
2. Scalability:
Scalability evaluates how well a model performs as the dataset grows. A scalable model
maintains good performance and efficiency even with a significant increase in data. For
instance, algorithms like linear regression scale well for larger datasets. Scalability ensures
the model remains practical in real-world, data-intensive scenarios. It’s a critical factor for
choosing algorithms in big data applications.
3. Squared Error:
Squared error measures the average difference between predicted and actual values,
squared to emphasize larger errors. It is commonly used in regression tasks to evaluate
model performance. Lower squared error means the model predicts closer to the actual
values. Minimizing this metric is a key goal during training. Techniques like gradient descent
optimize models to reduce squared error.
o Precision: Measures the proportion of correctly predicted positive cases out of all
predicted positives. It focuses on accuracy for specific outcomes, like identifying
spam emails.
o Recall: Measures the proportion of actual positive cases the model correctly
identified. It ensures the model doesn't miss critical positive cases. Together, these
metrics provide a balanced evaluation.
5. Posterior Probability:
Posterior probability updates the likelihood of an event happening after new evidence is
observed. It’s based on Bayes' theorem and adjusts the prior probability using new data. For
example, diagnosing a disease may update probabilities after observing test results. It’s
widely used in probabilistic models and Bayesian machine learning. Posterior probabilities
help make more informed decisions.
2.5 Classification Accuracy and Performance
1. Classification Accuracy:
Classification accuracy measures the proportion of correct classifications out of total
instances. For example, a model predicting spam correctly for 90 out of 100 emails has 90%
accuracy. However, it may not always reflect true performance, especially for imbalanced
datasets. Other metrics like precision, recall, and F1-score are often used alongside. Accuracy
is a starting point for evaluating classifiers.
2. Performance Metrics:
To evaluate model performance, tools like a confusion matrix are used. It shows true
positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). These values
help calculate precision, recall, F1-score, and more. For example, F1-score balances precision
and recall to give an overall performance measure. Evaluating performance ensures the
model meets the task's requirements.
**********************************************************************************
3. Polynomial Regression:
Polynomial regression is used when the relationship between the independent and
dependent variables is not a straight line but can be modeled as a curve. It fits a polynomial
(a higher degree equation) instead of a line.
Example: Predicting sales based on time where the sales growth follows a curvy pattern over
time. The model uses a quadratic or cubic equation to fit the data.
**********************************************************************************
• K-Nearest Neighbors (KNN) is a simple classification algorithm. It works by finding the "K"
closest data points to a new data point and assigning the majority class of those neighbors to
the new point.
• The distance between points is usually measured using Euclidean distance (straight-line
distance).
• Example: If you want to classify a new email as spam or not spam, KNN looks at the closest
emails in the training data and classifies it based on the majority class (spam or not spam).
• Advantages: Simple to understand and implement, works well with smaller datasets.
• Disadvantages: Can be slow for large datasets and sensitive to irrelevant features.
• Logistic Regression is a statistical model used for binary classification tasks (two classes). It
predicts the probability that a data point belongs to one of the two classes using the logistic
function (sigmoid curve).
• It’s called "regression" because it uses a linear equation, but it outputs probabilities, which
are then converted into class labels (0 or 1).
• Example: Predicting whether a customer will buy a product (yes/no) based on features like
age, income, and previous purchases.
• Advantages: Easy to implement, interpretable results, and works well for linearly separable
data.
• Disadvantages: Assumes a linear relationship between features and the log odds of the
outcome, so it may not work well for complex, non-linear data.
• Naive Bayes is based on Bayes' Theorem and assumes that the features are independent
(naive assumption). It calculates the probability of a data point belonging to each class and
chooses the class with the highest probability.
• It is particularly effective for text classification tasks like spam detection and sentiment
analysis.
• Example: Classifying an email as spam or not spam by calculating the probability of words
appearing in each class (spam or not spam).
• Advantages: Fast, simple, works well with high-dimensional data (e.g., text).
• Disadvantages: The independence assumption often doesn’t hold true, which can limit its
performance.
• Support Vector Machine (SVM) is a powerful classifier that finds the best hyperplane (line or
surface) that separates data points of different classes with the largest margin.
• It’s particularly useful for binary classification, but it can be extended to multi-class
problems.
• Example: Classifying emails as spam or not spam by finding the optimal line that separates
spam from non-spam emails in a feature space.
• Advantages: Works well in high-dimensional spaces and is effective for both linear and non-
linear problems using kernel functions.
• Example: Classifying whether a customer will churn or not by combining the predictions of
many decision trees, each considering different aspects of the customer’s behavior.
• Advantages: Reduces overfitting, works well with large datasets, and handles both
classification and regression tasks.
• Disadvantages: Can be computationally intensive and less interpretable compared to a single
decision tree.
• Random Tree Classification is similar to Random Forest but uses fewer trees or randomly
selects features to split on in each decision tree. It’s a variation of decision trees that reduces
the variance by introducing more randomness into the training process.
• Example: Classifying types of fruits based on features like color, weight, and texture using
random trees.
• Advantages: Fast and efficient for large datasets, less prone to overfitting compared to
traditional decision trees.
• Disadvantages: While it is faster than random forests, the performance may be slightly lower
since fewer trees are used.
**********************************************************************************
• K-means is a popular clustering algorithm that partitions data into "K" distinct clusters based
on similarity. The algorithm works by selecting "K" initial cluster centroids and then assigning
each data point to the nearest centroid. After that, the centroids are recalculated as the
mean of the points in each cluster, and the process is repeated until the centroids no longer
change.
• Example: Grouping customers into clusters based on their purchasing behavior, where each
group (cluster) might represent customers with similar interests or spending habits.
• Disadvantages: It requires the number of clusters ("K") to be specified in advance and may
not work well with non-spherical clusters.
▪ Starts with each data point as its own cluster and then merges the closest
clusters step-by-step.
▪ Example: Starting with all customers in one group and progressively dividing
them into smaller groups based on their purchasing patterns.
• Dendrogram:
o Example: In the dendrogram, if two clusters are very close to each other on the tree,
they are merged early. The height of the merge indicates how similar the clusters
are.
5.3 Selecting Optimal Number of Clusters Using WCSS and Elbow Method
• Elbow Method:
o The Elbow Method helps find the optimal number of clusters ("K") by plotting the
WCSS for different values of K. As K increases, WCSS generally decreases. However,
after a certain point, the decrease becomes smaller, forming an "elbow" in the
graph. The "elbow" point indicates the optimal K because adding more clusters
beyond that doesn't significantly improve the compactness.
o Example: If you plot WCSS for K=1 to K=10 and see a sharp drop in WCSS up to K=3,
and then a much slower decrease, K=3 would be considered the optimal number of
clusters.
**********************************************************************************
1. Support:
o Example: If you are analyzing a grocery store's transactions and you find that 200 out
of 1000 transactions contain both bread and butter, the support for the combination
of bread and butter is 200/1000 = 0.2 (20% of transactions).
2. Confidence:
o Example: If 80 transactions contain both bread and butter, and 100 transactions
contain bread, the confidence of "bread → butter" is 80/100 = 0.8 (80% confidence
that when bread is bought, butter is also bought).
3. Lift:
o Lift measures the strength of a rule by comparing the observed support of the rule
with the expected support if A and B were independent. Lift values greater than 1
indicate that the items are more likely to be bought together than by chance.
o Example: If the lift of the rule "bread → butter" is 1.2, this means that the
occurrence of both bread and butter together is 1.2 times more likely than if the
items were bought independently.
• The Apriori Algorithm is a classic algorithm used to find association rules in a dataset. It
works by iteratively identifying frequent itemsets (combinations of items that appear
together frequently) and then generating rules based on those frequent itemsets. The key
idea is that if an itemset is frequent, all its subsets must also be frequent.
o Step 1: Identify the individual items that are frequently purchased (those with high
support).
o Step 2: Generate candidate itemsets of size 2 (pairs of items) and calculate their
support. Keep only the itemsets that meet the minimum support threshold.
o Step 3: Repeat the process for itemsets of increasing size (3 items, 4 items, etc.) until
no more frequent itemsets can be found.
o Step 4: Once frequent itemsets are found, generate association rules from these
itemsets. For example, if {bread, butter} is a frequent itemset, you can generate a
rule like {bread} → {butter} and calculate its confidence and lift.
2. Example:
o If we have a dataset of transactions with items like bread, butter, and jam, the
Apriori algorithm will identify frequent itemsets such as {bread, butter}, {butter,
jam}, and {bread, jam}, based on the support threshold. It will then generate rules
such as {bread} → {butter}, and calculate confidence and lift for each rule.
3. Advantages:
o It’s easy to implement and widely used for mining association rules in market basket
analysis, where you want to find patterns in customer purchases.
4. Disadvantages:
o The Apriori algorithm can be computationally expensive, especially when the dataset
contains many items or the minimum support threshold is low. It also requires
multiple passes over the data, which can be inefficient for large datasets.
**********************************************************************************
• How it works:
o For each action (arm), UCB computes a confidence interval for the expected reward
based on previous actions and rewards. The agent then selects the action with the
highest upper bound (the action with the most potential for a high reward).
o This encourages the agent to explore less tried actions, while also exploiting those
that seem to give good rewards.
• Example:
o Imagine you are playing a slot machine with multiple arms (each with a different
probability of winning). The UCB algorithm would select the arm with the highest
potential reward based on the past outcomes, encouraging you to try new arms
when necessary but favoring those that have already provided good rewards.
• Advantages:
• Disadvantages:
• Thompson Sampling is another method used to solve the multi-armed bandit problem,
aiming to maximize cumulative rewards by balancing exploration and exploitation. It uses a
probabilistic approach to select actions based on a model of uncertainty (prior distributions)
for each action’s reward.
• How it works:
o For each arm, the algorithm maintains a probability distribution over the possible
rewards. It then samples from this distribution and chooses the arm with the highest
sampled value. This allows the agent to explore actions that are uncertain and
exploit actions that are more likely to yield a high reward.
• Example:
• Advantages:
o It is more natural and efficient for balancing exploration and exploitation compared
to other methods like UCB.
• Disadvantages:
o It can require maintaining complex distributions, and the sampling process can be
slower for high-dimensional problems.
7.3 Q-Learning
• How it works:
o The agent interacts with the environment and updates its Q-values based on the
rewards received after each action. The Q-value of a state-action pair is updated
using the Bellman equation, which is a recursive formula that combines immediate
rewards and the estimated future rewards from subsequent states.
o Over time, the agent learns the best action to take in each state to maximize the
total reward.
• Example:
o In a maze-solving task, Q-learning helps the agent learn which paths to take by
updating its knowledge of the best possible moves as it navigates the maze,
gradually learning the optimal strategy for reaching the goal.
• Advantages:
o Q-Learning is very flexible and can be used for both discrete and continuous spaces.
It is simple to understand and can be applied to many different problems.
• Disadvantages:
o It can require a lot of data and computational time to converge, especially for large
state spaces or environments with continuous actions.
**********************************************************************************
• How it works:
o ANNs typically consist of three layers: an input layer, hidden layers, and an output
layer. Each neuron in a layer receives input, processes it with an activation function,
and passes the result to the next layer. The weights of connections between neurons
are adjusted during training to minimize errors and improve predictions.
• Example:
o In image recognition, the input layer receives pixel values, and the network learns to
identify patterns (e.g., edges, shapes) through the hidden layers, eventually
outputting the classification (e.g., “cat” or “dog”).
• Advantages:
o Highly flexible and powerful for complex tasks, like image and speech recognition.
• Disadvantages:
• Convolutional Neural Network (CNN) is a specialized type of ANN designed for processing
grid-like data, such as images. CNNs automatically detect important features in images
without needing manual feature extraction.
• How it works:
o CNNs use layers called convolutional layers to apply filters (kernels) to input images,
detecting patterns like edges and textures. The output from these layers is pooled
(reduced) using pooling layers to focus on the most important features. Finally, fully
connected layers make the final predictions based on the learned features.
• Example:
o In facial recognition, a CNN might first detect edges, then combine those edges into
more complex features like eyes, nose, and mouth, and finally classify the image as a
particular person.
• Advantages:
o Excellent for tasks involving images and videos, automatically detecting relevant
features.
• Disadvantages:
• Recurrent Neural Network (RNN) is a type of neural network designed to handle sequential
data, where the current input depends on previous inputs. RNNs have loops that allow
information to persist, making them suitable for tasks like language modeling and time-series
prediction.
• How it works:
o In an RNN, the output from one step is fed back as input to the next step, allowing
the network to have "memory" of previous inputs. This is useful for tasks like
predicting the next word in a sentence, where each word depends on the context of
the previous words.
• Example:
o In language translation, an RNN can be used to predict the next word in a sentence
based on the previous words, such as translating "I love" to "je t'aime" in French.
• Advantages:
o Good for sequential data like time series, speech, and text.
• Disadvantages:
o Can struggle with long sequences due to the vanishing gradient problem, where
earlier information gets "forgotten" as the sequence length increases.
• As previously explained, Convolutional Neural Networks (CNNs) are designed for tasks
involving spatial data like images. They use convolutional and pooling layers to automatically
detect relevant features in images, and fully connected layers to make predictions.
o Example: Predicting stock prices, where each price is dependent on past prices and
trends.