Machine Learning Notes (1)
Machine Learning Notes (1)
automatically learn and improve from experience without being explicitly programmed. It
involves algorithms that can analyze data, identify patterns, and make decisions or
predictions. The goal of ML is to create systems that can adapt and evolve as they are
History
The history of machine learning dates back to the mid-20th century. In 1959, Arthur
Samuel coined the term "machine learning" while working on a program that could play
checkers. Over the decades, machine learning has evolved from basic algorithms to
complex deep learning models, thanks to advances in computing power, big data, and
mathematical theory. In recent years, it has become a core part of many modern
growing volume, variety, and complexity of data. It is especially useful in areas where
datasets.
2
learn from past experiences, automation of analytical model building, and continuous
improvement over time. It relies on statistical methods and is capable of handling both
● Supervised Learning: In this type, the model is trained on labeled data. The
algorithm learns to map input data to known output labels. Examples include
classification and regression tasks.
● Unsupervised Learning: Here, the model is trained on unlabeled data and
attempts to find hidden patterns or structures. Common techniques include
clustering and dimensionality reduction.
● Reinforcement Learning: This is a type of learning where an agent interacts with
an environment and learns to take actions to maximize a reward signal. It is
commonly used in robotics and game-playing AI.
collecting and preparing the data, selecting and training the model, evaluating
performance in the real world. This cycle is iterative and involves continuous feedback.
autonomous vehicles. It has transformed how businesses operate and how decisions
are made.
parameters (e.g., linear regression). These models are generally simpler and faster to
train. Non-parametric models, like decision trees or k-nearest neighbors, do not assume
a fixed form and can grow more complex with data. They are more flexible but often
overly simplistic models that fail to capture the underlying patterns (underfitting), while
variance refers to models that are too complex and sensitive to training data
(overfitting). A good model balances bias and variance to generalize well on unseen
data.
Underfitting
Underfitting occurs when a model is too simple to learn the underlying structure of the
data. It leads to poor performance on both training and test datasets. This can happen
Overfitting
Overfitting happens when a model learns the training data too well, including its noise
and outliers, and performs poorly on new data. It occurs when the model is too complex
4
or is trained for too long without proper regularization. Techniques like pruning,
Statistical models are often interpretable, whereas ML models, especially deep learning,
Loss functions measure how well a model’s predictions match actual outcomes.
Common loss functions include Mean Squared Error (MSE) for regression and
Cross-Entropy Loss for classification. The choice of loss function affects how the model
learns and is optimized.
5
when improvements are minimal and not worth the added complexity. Over-tuning can
lead to overfitting. Using techniques like early stopping can help automate this process.
Cross-Validation
Cross-validation is a technique used to assess the performance of a model more
reliably by dividing data into multiple subsets (folds). One popular method is k-fold
cross-validation, where the data is split into k parts and the model is trained and tested
k times, each time with a different fold used as the test set. It helps reduce variability
Grid Search
Grid Search is a systematic way to find the best combination of hyperparameters for a
machine learning model. It involves specifying a set of possible values for each
Dimensionality reduction is a process used in machine learning to reduce the number of input
variables or features in a dataset. It helps in simplifying models, reducing computation time, and
removing noise or redundancy in data. By reducing dimensions, the data becomes easier to
visualize and interpret while retaining as much relevant information as possible. This technique
is especially useful when dealing with high-dimensional datasets, which can suffer from the
"curse of dimensionality."
In the context of linear algebra and machine learning, a row vector is a 1 × n matrix (a single
row with multiple columns), and a column vector is an n × 1 matrix (a single column with
multiple rows). Each vector can represent a set of features or data points. For example, in a
dataset, a row vector can represent one observation across multiple features, while a column
vector can represent one feature across all observations.
A dataset can be represented as a data matrix, where rows represent individual samples (data
points), and columns represent features. If there are m samples and n features, the dataset
becomes an m × n matrix. This matrix form is useful because many machine learning
algorithms, especially those involving linear algebra (like PCA), are designed to operate on
matrices.
7
Data preprocessing is a crucial step in the ML pipeline where raw data is cleaned and
transformed into a usable format. It includes tasks such as handling missing values, encoding
categorical variables, and normalizing numerical values. Good preprocessing ensures that the
data fed into the model is accurate, consistent, and suitable for learning.
Feature Normalization
Feature normalization is the process of scaling the values of features so they fall within a
specific range (typically 0 to 1 or -1 to 1). This is important because different features may have
different scales, and unnormalized data can bias machine learning algorithms. Normalization
ensures that each feature contributes equally to the learning process, especially in algorithms
that use distance metrics.
The mean of a data matrix is calculated feature-wise—by averaging each column. This gives a
vector containing the average value of each feature across all samples. Subtracting the mean
from each element (centering the data) is a standard step in many preprocessing techniques,
such as PCA, to ensure that the data has zero mean.
Column Standardization
Column standardization (also called z-score normalization) transforms each feature in the
dataset so that it has a mean of 0 and a standard deviation of 1. This is done by subtracting the
column mean from each element and dividing by the column’s standard deviation.
Standardization is especially important for algorithms like PCA and k-means, which are sensitive
to the scale of features.
The covariance matrix captures how much the features vary with respect to each other. For an
n-dimensional dataset, the covariance matrix is an n × n matrix, where each element (i, j)
indicates the covariance between feature i and feature j. A high positive value means the
features increase together, a negative value means they vary inversely, and zero means no
linear relationship.
8
PCA is one of the most widely used techniques for dimensionality reduction. It transforms the
original features into a new set of uncorrelated features called principal components, ordered
by the amount of variance they explain. PCA identifies the directions (components) in which the
data varies the most and projects the data onto those directions. By keeping only the top few
principal components, we can reduce the dimensionality of the data while retaining most of its
important information. PCA helps in visualization, noise reduction, and improving model
efficiency.
Supervised learning is a type of machine learning where the algorithm is trained on a labeled
dataset. This means that each training example is paired with an output label. The goal of
supervised learning is to learn a mapping from inputs to outputs so that the model can predict
the output for new, unseen data. It is called "supervised" because the learning process is guided
by the correct answers provided during training.
Supervised learning begins with a dataset that includes both input features (independent
variables) and corresponding output labels (dependent variables). The algorithm analyzes the
training data and learns a function that maps the inputs to the correct outputs. This function is
then used to predict outputs for new inputs. The accuracy of the predictions is evaluated using
performance metrics such as accuracy, precision, recall, or mean squared error, depending on
the problem type (classification or regression). The model improves over time through
optimization techniques that minimize the difference between predicted and actual values.
9
Supervised learning can be broadly categorized into two types based on the type of output:
● Classification: When the output variable is categorical (e.g., spam or not spam).
● Regression: When the output variable is continuous (e.g., predicting house prices).
k-NN is a simple and intuitive classification algorithm. It classifies a new data point based on the
majority label among its k closest neighbors in the training data. The closeness is usually
measured using distance metrics like Euclidean distance. It is non-parametric and lazy, meaning
it doesn’t learn a model during training but rather memorizes the training data and classifies only
at prediction time. While k-NN is easy to implement, it can be slow with large datasets.
Naïve Bayes
Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem. It assumes that all features
are independent of each other given the class label, which is rarely true in practice but works
surprisingly well. It calculates the probability of each class given the input features and selects
the class with the highest probability. It is especially effective for text classification problems like
spam detection and sentiment analysis due to its simplicity and speed.
Decision Trees
Decision Trees are tree-like models where each internal node represents a decision on a
feature, each branch represents an outcome of that decision, and each leaf node represents a
final class label or value. They work by recursively splitting the data based on feature values to
maximize information gain or reduce impurity (e.g., using Gini index or entropy). Decision Trees
are easy to interpret and visualize, but they can overfit the data if not properly pruned.
10
Linear Regression
Logistic Regression
SVM is a powerful classification algorithm that finds the best boundary (called the hyperplane)
that separates data points of different classes. It tries to maximize the margin between the two
classes. SVMs are effective in high-dimensional spaces and work well when there is a clear
margin of separation between classes. They can also handle non-linear classification using
kernel tricks, which map the input data into higher dimensions.
Unsupervised learning is a type of machine learning where the model learns from data that has
no labels. The algorithm tries to find hidden patterns or structures within the data on its own. It is
commonly used for clustering, anomaly detection, and dimensionality reduction. The key goal is
to group similar data points or reduce the complexity of data while preserving important
relationships.
11
Clustering: K-means
K-means is a popular unsupervised learning algorithm used for clustering data into k groups. It
works by randomly selecting k centroids (initial cluster centers), assigning each data point to the
nearest centroid, and then updating the centroids based on the average position of the assigned
points. This process is repeated until the centroids stabilize. K-means is simple and efficient but
can be sensitive to the initial selection of centroids and does not work well with non-spherical
clusters.
Ensemble Methods
Ensemble methods combine multiple models to improve overall performance. The idea is that a
group of weak learners can come together to form a strong learner. Common ensemble
methods include:
● Boosting: Boosting builds models sequentially, where each new model tries to correct
the errors made by the previous ones. Algorithms like AdaBoost and Gradient Boosting
are examples. Boosting focuses on difficult examples and often results in high accuracy
but may overfit if not regularized properly.
● Random Forests: A Random Forest is an ensemble of decision trees, where each tree
is trained on a different subset of the data and a random subset of features. The final
prediction is made by aggregating the outputs of all trees (e.g., majority vote for
classification). Random Forests are robust, accurate, and can handle missing data well.
● Principal Component Analysis (PCA): PCA transforms data into new coordinates that
maximize variance, selecting top components to reduce dimensions while preserving
most of the information.
● Linear Discriminant Analysis (LDA): LDA is used primarily for classification tasks.
Unlike PCA, which is unsupervised, LDA is supervised and tries to find feature
12
● Singular Value Decomposition (SVD): SVD decomposes a matrix into three other
matrices and is widely used in dimensionality reduction, especially in text processing and
recommendation systems. It helps in reducing noise and compressing data.
Accuracy
Accuracy measures the proportion of correct predictions out of total predictions. It is simple and
intuitive but may be misleading in imbalanced datasets (where one class dominates).
Confusion Matrix
A confusion matrix is a table that shows the number of true positives (TP), true negatives (TN),
false positives (FP), and false negatives (FN). It provides a detailed breakdown of model
performance beyond just accuracy.
● Precision = TP / (TP + FP): It measures how many predicted positives are actually
correct.
● Recall = TP / (TP + FN): It measures how many actual positives were correctly
predicted.
Precision is useful when false positives are costly; recall is critical when missing
positives is more dangerous.
F1-score
The F1-score is the harmonic mean of precision and recall. It balances the two and is especially
useful when there is an uneven class distribution.
F1 = 2 * (Precision * Recall) / (Precision + Recall)
● The Receiver Operating Characteristic (ROC) curve plots the true positive rate (recall)
against the false positive rate.
● AUC (Area Under the Curve) measures the area under the ROC curve. A model with an
AUC close to 1 performs well, while 0.5 indicates random guessing.
MAD is a robust measure of variability. It is the median of the absolute differences between
each data point and the median of the dataset. Unlike standard deviation, MAD is less affected
by outliers, making it valuable in robust regression and error analysis.
Distribution of Errors
The distribution of errors refers to the analysis of how model prediction errors (difference
between predicted and actual values) are spread. Ideally, errors should be randomly distributed
with a mean close to zero. Analyzing the distribution helps identify patterns such as underfitting,
overfitting, or bias in the model.
A spam filter in your email inbox learns from thousands of labeled emails ("spam" or "not spam")
to automatically classify new emails.
History (Example)
In 1959, Arthur Samuel developed a program that played checkers and improved by playing
against itself – one of the first ML applications.
Need (Example)
Netflix uses ML to recommend shows based on what you've already watched, improving user
experience and engagement.
Features of ML (Example)
Self-learning: Google Translate improves over time by learning from translations worldwide.
14
Classification of ML
● Supervised Learning: Predicting house prices using labeled data with features like size,
location, and past prices.
● Reinforcement Learning: A robot learning to walk by trial and error, receiving rewards
for each successful step.
ML Lifecycle (Example)
Applications (Examples)
● Non-parametric: k-NN, which adapts its complexity to the dataset and doesn’t assume a
specific function form.
● High bias (Underfitting): Predicting all house prices as ₹50L – too simple.
● High variance (Overfitting): Memorizing the exact prices for training data but failing on
new data.
15
● MSE (Mean Squared Error): Measures how far off predictions are in regression
problems.
Cross-validation (Example)
K-fold cross-validation: Divides data into 5 parts, trains on 4 and tests on 1, repeating 5 times
for better accuracy estimate.
Testing various combinations of learning rates and tree depths in a decision tree to find the best
performing one.
✅ UNIT-II:
Row & Column Vector (Example)
● Column vector:
CopyEdit
20
5.6
60
css
CopyEdit
[ 25, 30000, 5.9 ]
[ 32, 45000, 6.1 ]
Income: ₹10K to ₹1L → Normalize to 0–1 range so it doesn't overpower smaller features like
age.
17
Standardize test scores so each subject (column) has mean 0 and std deviation 1 before
applying algorithms.
If height and weight have high positive covariance, taller people tend to weigh more.
In a dataset of 10 features, PCA might reduce it to 2 components that explain 95% of the data,
making it easier to visualize.
To classify a new fruit as an apple or orange, k-NN looks at its 3 nearest neighbors (based on
size, color) and picks the majority label.
In spam filtering, if an email has the words "free", "win", and "lottery", Naïve Bayes uses the
probabilities of each word being in spam to predict.
■ No → Approve
■ Yes → Decline
18
Predict if a student will pass (yes/no) based on study hours using sigmoid function to output
probabilities.
Classifying emails as spam or not by finding the best dividing line (hyperplane) that separates
the two classes with maximum margin.
✅ UNIT-IV:
K-Means Clustering (Example)
Customer segmentation: Grouping people into clusters based on their spending habits without
any labels.
Boosting (Example)
A model first predicts wrongly that a person won't default on a loan. The next model focuses
more on this error, improving the final combined output.
Bagging (Example)
Random Forest creates different decision trees using random subsets of training data and
averages the results to reduce overfitting.
Predicting diabetes risk using multiple decision trees built from different patient samples,
combining their predictions.
PCA (Example)
Reduce a 100-feature image dataset to 10 features while retaining the most important
information.
19
LDA (Example)
Used in face recognition: separates images of different people by finding the directions that best
distinguish classes.
ICA (Example)
Separating mixed audio signals into individual speakers in a recording (blind source separation).
SVD (Example)
Used in recommender systems like Netflix: reduces movie rating matrix to uncover patterns in
user preferences.
Confusion Matrix:
yaml
CopyEdit
TP: 50, FP: 5
FN: 10, TN: 35