0% found this document useful (0 votes)
15 views

Machine Learning Notes (1)

The document provides a comprehensive overview of machine learning, covering its definition, history, and the necessity for ML in handling complex data. It details various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their applications and algorithms. Additionally, it discusses key concepts such as dimensionality reduction, model evaluation, and the importance of data preprocessing in the machine learning pipeline.

Uploaded by

ajitpmbxr2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Machine Learning Notes (1)

The document provides a comprehensive overview of machine learning, covering its definition, history, and the necessity for ML in handling complex data. It details various types of machine learning, including supervised, unsupervised, and reinforcement learning, along with their applications and algorithms. Additionally, it discusses key concepts such as dimensionality reduction, model evaluation, and the importance of data preprocessing in the machine learning pipeline.

Uploaded by

ajitpmbxr2000
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

1

UNIT-I: Machine Learning:

Machine Learning: Definition


Machine Learning (ML) is a branch of artificial intelligence that allows systems to

automatically learn and improve from experience without being explicitly programmed. It

involves algorithms that can analyze data, identify patterns, and make decisions or

predictions. The goal of ML is to create systems that can adapt and evolve as they are

exposed to new data.

History
The history of machine learning dates back to the mid-20th century. In 1959, Arthur

Samuel coined the term "machine learning" while working on a program that could play

checkers. Over the decades, machine learning has evolved from basic algorithms to

complex deep learning models, thanks to advances in computing power, big data, and

mathematical theory. In recent years, it has become a core part of many modern

technologies, including search engines, recommendation systems, and self-driving cars.

Need for Machine Learning


Machine learning is essential because traditional programming cannot handle the

growing volume, variety, and complexity of data. It is especially useful in areas where

designing explicit rules is too difficult or impossible. ML helps in automating tasks,

improving accuracy, enabling personalization, and finding hidden patterns in large

datasets.
2

Features of Machine Learning


Key features of machine learning include data-driven decision-making, the ability to

learn from past experiences, automation of analytical model building, and continuous

improvement over time. It relies on statistical methods and is capable of handling both

structured and unstructured data.

Classification of Machine Learning


Machine learning can be classified into three main types:

●​ Supervised Learning: In this type, the model is trained on labeled data. The
algorithm learns to map input data to known output labels. Examples include
classification and regression tasks.
●​ Unsupervised Learning: Here, the model is trained on unlabeled data and
attempts to find hidden patterns or structures. Common techniques include
clustering and dimensionality reduction.
●​ Reinforcement Learning: This is a type of learning where an agent interacts with
an environment and learns to take actions to maximize a reward signal. It is
commonly used in robotics and game-playing AI.

Machine Learning Life Cycle


The machine learning life cycle consists of several stages: understanding the problem,

collecting and preparing the data, selecting and training the model, evaluating

performance, tuning hyperparameters, deploying the model, and monitoring its

performance in the real world. This cycle is iterative and involves continuous feedback.

Applications of Machine Learning


ML is widely used across industries. Some common applications include spam

detection in emails, product recommendations on e-commerce platforms, facial

recognition, medical diagnosis, financial forecasting, language translation, and


3

autonomous vehicles. It has transformed how businesses operate and how decisions

are made.

Parametric vs. Non-Parametric Models


Parametric models make assumptions about the data and use a fixed number of

parameters (e.g., linear regression). These models are generally simpler and faster to

train. Non-parametric models, like decision trees or k-nearest neighbors, do not assume

a fixed form and can grow more complex with data. They are more flexible but often

require more data to perform well.

Learning Theory – Bias/Variance Tradeoff


The bias-variance tradeoff is a fundamental concept in ML. Bias refers to errors due to

overly simplistic models that fail to capture the underlying patterns (underfitting), while

variance refers to models that are too complex and sensitive to training data

(overfitting). A good model balances bias and variance to generalize well on unseen

data.

Underfitting
Underfitting occurs when a model is too simple to learn the underlying structure of the

data. It leads to poor performance on both training and test datasets. This can happen

due to insufficient training, overly simplistic algorithms, or lack of relevant features.

Overfitting
Overfitting happens when a model learns the training data too well, including its noise

and outliers, and performs poorly on new data. It occurs when the model is too complex
4

or is trained for too long without proper regularization. Techniques like pruning,

regularization, and using more data can help prevent overfitting.

Major Differences Between Statistical Modelling and


Machine Learning
While both approaches deal with data, statistical modeling is more focused on

inference—understanding relationships between variables and drawing conclusions.

Machine learning emphasizes prediction and performance on new, unseen data.

Statistical models are often interpretable, whereas ML models, especially deep learning,

may act like “black boxes.”

Steps in Machine Learning Model Development


Developing a machine learning model involves several steps:

1.​ Define the problem and goals.


2.​ Collect and preprocess data.
3.​ Choose an appropriate algorithm.
4.​ Train the model using training data.
5.​ Validate and tune the model using validation data.
6.​ Evaluate the model with test data.
7.​ Deploy the model and monitor performance.

Machine Learning Losses

Loss functions measure how well a model’s predictions match actual outcomes.
Common loss functions include Mean Squared Error (MSE) for regression and
Cross-Entropy Loss for classification. The choice of loss function affects how the model
learns and is optimized.
5

When to Stop Tuning Machine Learning Models


Tuning should stop when performance on the validation set no longer improves, or

when improvements are minimal and not worth the added complexity. Over-tuning can

lead to overfitting. Using techniques like early stopping can help automate this process.

Train, Validation, and Test Data

●​ Training data is used to teach the model.


●​ Validation data is used to fine-tune the model and select hyperparameters.
●​ Test data is used to assess the model’s performance on unseen data. Proper
separation of these sets ensures reliable and unbiased evaluation.

Cross-Validation
Cross-validation is a technique used to assess the performance of a model more

reliably by dividing data into multiple subsets (folds). One popular method is k-fold

cross-validation, where the data is split into k parts and the model is trained and tested

k times, each time with a different fold used as the test set. It helps reduce variability

and ensures robust evaluation.

Grid Search
Grid Search is a systematic way to find the best combination of hyperparameters for a

machine learning model. It involves specifying a set of possible values for each

hyperparameter and training models for every combination. Though time-consuming, it

is useful for optimizing model performance.


6

UNIT-II: Dimensionality Reduction and Data Representation in Machine Learning:

Dimensionality Reduction: Definition

Dimensionality reduction is a process used in machine learning to reduce the number of input
variables or features in a dataset. It helps in simplifying models, reducing computation time, and
removing noise or redundancy in data. By reducing dimensions, the data becomes easier to
visualize and interpret while retaining as much relevant information as possible. This technique
is especially useful when dealing with high-dimensional datasets, which can suffer from the
"curse of dimensionality."

Row Vector and Column Vector

In the context of linear algebra and machine learning, a row vector is a 1 × n matrix (a single
row with multiple columns), and a column vector is an n × 1 matrix (a single column with
multiple rows). Each vector can represent a set of features or data points. For example, in a
dataset, a row vector can represent one observation across multiple features, while a column
vector can represent one feature across all observations.

How to Represent a Dataset

A dataset in machine learning is typically a collection of samples or observations, each


described by a set of features. It can be thought of as a table where rows correspond to different
samples (like different people or transactions), and columns correspond to features (like age,
income, etc.). Each cell in the table contains a value of a feature for a particular sample.

How to Represent a Dataset as a Matrix

A dataset can be represented as a data matrix, where rows represent individual samples (data
points), and columns represent features. If there are m samples and n features, the dataset
becomes an m × n matrix. This matrix form is useful because many machine learning
algorithms, especially those involving linear algebra (like PCA), are designed to operate on
matrices.
7

Data Preprocessing in Machine Learning

Data preprocessing is a crucial step in the ML pipeline where raw data is cleaned and
transformed into a usable format. It includes tasks such as handling missing values, encoding
categorical variables, and normalizing numerical values. Good preprocessing ensures that the
data fed into the model is accurate, consistent, and suitable for learning.

Feature Normalization

Feature normalization is the process of scaling the values of features so they fall within a
specific range (typically 0 to 1 or -1 to 1). This is important because different features may have
different scales, and unnormalized data can bias machine learning algorithms. Normalization
ensures that each feature contributes equally to the learning process, especially in algorithms
that use distance metrics.

Mean of a Data Matrix

The mean of a data matrix is calculated feature-wise—by averaging each column. This gives a
vector containing the average value of each feature across all samples. Subtracting the mean
from each element (centering the data) is a standard step in many preprocessing techniques,
such as PCA, to ensure that the data has zero mean.

Column Standardization

Column standardization (also called z-score normalization) transforms each feature in the
dataset so that it has a mean of 0 and a standard deviation of 1. This is done by subtracting the
column mean from each element and dividing by the column’s standard deviation.
Standardization is especially important for algorithms like PCA and k-means, which are sensitive
to the scale of features.

Co-variance of a Data Matrix

The covariance matrix captures how much the features vary with respect to each other. For an
n-dimensional dataset, the covariance matrix is an n × n matrix, where each element (i, j)
indicates the covariance between feature i and feature j. A high positive value means the
features increase together, a negative value means they vary inversely, and zero means no
linear relationship.
8

Principal Component Analysis (PCA) for Dimensionality Reduction

PCA is one of the most widely used techniques for dimensionality reduction. It transforms the
original features into a new set of uncorrelated features called principal components, ordered
by the amount of variance they explain. PCA identifies the directions (components) in which the
data varies the most and projects the data onto those directions. By keeping only the top few
principal components, we can reduce the dimensionality of the data while retaining most of its
important information. PCA helps in visualization, noise reduction, and improving model
efficiency.

UNIT-III: Supervised Learning in Machine Learning:

Supervised Learning: Definition

Supervised learning is a type of machine learning where the algorithm is trained on a labeled
dataset. This means that each training example is paired with an output label. The goal of
supervised learning is to learn a mapping from inputs to outputs so that the model can predict
the output for new, unseen data. It is called "supervised" because the learning process is guided
by the correct answers provided during training.

How Supervised Learning Works

Supervised learning begins with a dataset that includes both input features (independent
variables) and corresponding output labels (dependent variables). The algorithm analyzes the
training data and learns a function that maps the inputs to the correct outputs. This function is
then used to predict outputs for new inputs. The accuracy of the predictions is evaluated using
performance metrics such as accuracy, precision, recall, or mean squared error, depending on
the problem type (classification or regression). The model improves over time through
optimization techniques that minimize the difference between predicted and actual values.
9

Types of Supervised Learning Algorithms

Supervised learning can be broadly categorized into two types based on the type of output:

●​ Classification: When the output variable is categorical (e.g., spam or not spam).​

●​ Regression: When the output variable is continuous (e.g., predicting house prices).​

Several popular supervised learning algorithms fall under these categories:

k-Nearest Neighbours (k-NN)

k-NN is a simple and intuitive classification algorithm. It classifies a new data point based on the
majority label among its k closest neighbors in the training data. The closeness is usually
measured using distance metrics like Euclidean distance. It is non-parametric and lazy, meaning
it doesn’t learn a model during training but rather memorizes the training data and classifies only
at prediction time. While k-NN is easy to implement, it can be slow with large datasets.

Naïve Bayes

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem. It assumes that all features
are independent of each other given the class label, which is rarely true in practice but works
surprisingly well. It calculates the probability of each class given the input features and selects
the class with the highest probability. It is especially effective for text classification problems like
spam detection and sentiment analysis due to its simplicity and speed.

Decision Trees

Decision Trees are tree-like models where each internal node represents a decision on a
feature, each branch represents an outcome of that decision, and each leaf node represents a
final class label or value. They work by recursively splitting the data based on feature values to
maximize information gain or reduce impurity (e.g., using Gini index or entropy). Decision Trees
are easy to interpret and visualize, but they can overfit the data if not properly pruned.
10

Linear Regression

Linear Regression is a regression algorithm used to model the relationship between a


dependent variable and one or more independent variables by fitting a straight line (in the form y
= mx + c). It assumes a linear relationship between the variables. It is simple and effective for
continuous outcome prediction but does not work well when the relationship is non-linear or
when there are many correlated features.

Logistic Regression

Logistic Regression is a classification algorithm used to predict binary or multi-class outcomes.


It estimates the probability that a given input belongs to a particular class using the logistic
(sigmoid) function. Despite the name, it is used for classification tasks. It is widely used for
binary classification problems like predicting whether a customer will buy a product or not.

Support Vector Machines (SVM)

SVM is a powerful classification algorithm that finds the best boundary (called the hyperplane)
that separates data points of different classes. It tries to maximize the margin between the two
classes. SVMs are effective in high-dimensional spaces and work well when there is a clear
margin of separation between classes. They can also handle non-linear classification using
kernel tricks, which map the input data into higher dimensions.

UNIT-IV: Unsupervised Learning, Ensemble Methods, Dimensionality Reduction, and


Model Evaluation in Machine Learning:

Unsupervised Learning: Definition

Unsupervised learning is a type of machine learning where the model learns from data that has
no labels. The algorithm tries to find hidden patterns or structures within the data on its own. It is
commonly used for clustering, anomaly detection, and dimensionality reduction. The key goal is
to group similar data points or reduce the complexity of data while preserving important
relationships.
11

Clustering: K-means

K-means is a popular unsupervised learning algorithm used for clustering data into k groups. It
works by randomly selecting k centroids (initial cluster centers), assigning each data point to the
nearest centroid, and then updating the centroids based on the average position of the assigned
points. This process is repeated until the centroids stabilize. K-means is simple and efficient but
can be sensitive to the initial selection of centroids and does not work well with non-spherical
clusters.

Ensemble Methods

Ensemble methods combine multiple models to improve overall performance. The idea is that a
group of weak learners can come together to form a strong learner. Common ensemble
methods include:

●​ Boosting: Boosting builds models sequentially, where each new model tries to correct
the errors made by the previous ones. Algorithms like AdaBoost and Gradient Boosting
are examples. Boosting focuses on difficult examples and often results in high accuracy
but may overfit if not regularized properly.​

●​ Bagging: Bagging (Bootstrap Aggregating) trains multiple models independently on


different random subsets of the training data (with replacement). Their outputs are then
combined, often by averaging or majority voting. Random Forests are a common
bagging technique. Bagging reduces variance and helps avoid overfitting.​

●​ Random Forests: A Random Forest is an ensemble of decision trees, where each tree
is trained on a different subset of the data and a random subset of features. The final
prediction is made by aggregating the outputs of all trees (e.g., majority vote for
classification). Random Forests are robust, accurate, and can handle missing data well.​

Dimensionality Reduction Techniques

Apart from PCA, other important dimensionality reduction techniques include:

●​ Principal Component Analysis (PCA): PCA transforms data into new coordinates that
maximize variance, selecting top components to reduce dimensions while preserving
most of the information.​

●​ Linear Discriminant Analysis (LDA): LDA is used primarily for classification tasks.
Unlike PCA, which is unsupervised, LDA is supervised and tries to find feature
12

combinations that best separate different classes.​

●​ Independent Component Analysis (ICA): ICA separates a multivariate signal into


additive, statistically independent components. It is used in applications like blind source
separation (e.g., separating audio signals from a mixed recording).​

●​ Singular Value Decomposition (SVD): SVD decomposes a matrix into three other
matrices and is widely used in dimensionality reduction, especially in text processing and
recommendation systems. It helps in reducing noise and compressing data.​

Evaluation: Performance Measurement of Models

Accuracy

Accuracy measures the proportion of correct predictions out of total predictions. It is simple and
intuitive but may be misleading in imbalanced datasets (where one class dominates).

Confusion Matrix

A confusion matrix is a table that shows the number of true positives (TP), true negatives (TN),
false positives (FP), and false negatives (FN). It provides a detailed breakdown of model
performance beyond just accuracy.

Precision and Recall

●​ Precision = TP / (TP + FP): It measures how many predicted positives are actually
correct.​

●​ Recall = TP / (TP + FN): It measures how many actual positives were correctly
predicted.​
Precision is useful when false positives are costly; recall is critical when missing
positives is more dangerous.​

F1-score

The F1-score is the harmonic mean of precision and recall. It balances the two and is especially
useful when there is an uneven class distribution.​
F1 = 2 * (Precision * Recall) / (Precision + Recall)

ROC Curve and AUC


13

●​ The Receiver Operating Characteristic (ROC) curve plots the true positive rate (recall)
against the false positive rate.​

●​ AUC (Area Under the Curve) measures the area under the ROC curve. A model with an
AUC close to 1 performs well, while 0.5 indicates random guessing.​

Median Absolute Deviation (MAD)

MAD is a robust measure of variability. It is the median of the absolute differences between
each data point and the median of the dataset. Unlike standard deviation, MAD is less affected
by outliers, making it valuable in robust regression and error analysis.

Distribution of Errors

The distribution of errors refers to the analysis of how model prediction errors (difference
between predicted and actual values) are spread. Ideally, errors should be randomly distributed
with a mean close to zero. Analyzing the distribution helps identify patterns such as underfitting,
overfitting, or bias in the model.

UNIT-I: Introduction to Machine Learning


Machine Learning Definition (Example)

A spam filter in your email inbox learns from thousands of labeled emails ("spam" or "not spam")
to automatically classify new emails.

History (Example)

In 1959, Arthur Samuel developed a program that played checkers and improved by playing
against itself – one of the first ML applications.

Need (Example)

Netflix uses ML to recommend shows based on what you've already watched, improving user
experience and engagement.

Features of ML (Example)

Self-learning: Google Translate improves over time by learning from translations worldwide.
14

Classification of ML

●​ Supervised Learning: Predicting house prices using labeled data with features like size,
location, and past prices.​

●​ Unsupervised Learning: Grouping customers into clusters based on their shopping


behavior without predefined labels.​

●​ Reinforcement Learning: A robot learning to walk by trial and error, receiving rewards
for each successful step.​

ML Lifecycle (Example)

Building a voice assistant: data collection → preprocessing → model training → evaluation →


deployment → monitoring.

Applications (Examples)

●​ Healthcare: Predicting diseases​

●​ E-commerce: Recommending products​

●​ Agriculture: Forecasting crop yields​

●​ Finance: Detecting fraud​

Parametric vs. Non-parametric Models (Example)

●​ Parametric: Linear regression with a fixed number of parameters (line equation).​

●​ Non-parametric: k-NN, which adapts its complexity to the dataset and doesn’t assume a
specific function form.​

Bias-Variance Tradeoff (Example)

●​ High bias (Underfitting): Predicting all house prices as ₹50L – too simple.​

●​ High variance (Overfitting): Memorizing the exact prices for training data but failing on
new data.​
15

Statistical Modeling vs. ML (Example)

●​ Statistical: Assumes a model form like normal distribution.​

●​ ML: Focuses more on prediction accuracy with fewer assumptions.​

Steps in ML Model Development (Example)

1.​ Define problem – predict sales​

2.​ Collect data – past sales data​

3.​ Preprocess – clean and normalize​

4.​ Train – use algorithm​

5.​ Test – evaluate accuracy​

6.​ Deploy – put into real use​

Loss Functions (Example)

●​ MSE (Mean Squared Error): Measures how far off predictions are in regression
problems.​

When to Stop Tuning (Example)

When validation error stops improving or starts increasing, indicating overfitting.

Train, Validation, Test (Example)

In a face recognition app:

●​ Train set – teaches the model with known faces​

●​ Validation set – tunes hyperparameters​

●​ Test set – evaluates on unseen faces​


16

Cross-validation (Example)

K-fold cross-validation: Divides data into 5 parts, trains on 4 and tests on 1, repeating 5 times
for better accuracy estimate.

Grid Search (Example)

Testing various combinations of learning rates and tree depths in a decision tree to find the best
performing one.

✅ UNIT-II:
Row & Column Vector (Example)

●​ Row vector: [20, 5.6, 60] → 1 student’s age, height, weight​

●​ Column vector:​

CopyEdit
20
5.6
60

Representing Dataset as Matrix (Example)

Each row = one person, each column = age, income, height

css
CopyEdit
[ 25, 30000, 5.9 ]
[ 32, 45000, 6.1 ]

Feature Normalization (Example)

Income: ₹10K to ₹1L → Normalize to 0–1 range so it doesn't overpower smaller features like
age.
17

Mean of Matrix (Example)

Average height across all people (column average).

Column Standardization (Example)

Standardize test scores so each subject (column) has mean 0 and std deviation 1 before
applying algorithms.

Covariance Matrix (Example)

If height and weight have high positive covariance, taller people tend to weigh more.

Principal Component Analysis (PCA) (Example)

In a dataset of 10 features, PCA might reduce it to 2 components that explain 95% of the data,
making it easier to visualize.

✅ UNIT-III: Supervised Learning Algorithms


k-NN (Example)

To classify a new fruit as an apple or orange, k-NN looks at its 3 nearest neighbors (based on
size, color) and picks the majority label.

Naïve Bayes (Example)

In spam filtering, if an email has the words "free", "win", and "lottery", Naïve Bayes uses the
probabilities of each word being in spam to predict.

Decision Trees (Example)

For credit approval:

●​ Is income > ₹40K?​

○​ Yes → Has existing loan?​

■​ No → Approve​

■​ Yes → Decline​
18

Linear Regression (Example)

Predict house price based on size:​


Price = 5000 × Size + 2,00,000

Logistic Regression (Example)

Predict if a student will pass (yes/no) based on study hours using sigmoid function to output
probabilities.

Support Vector Machines (SVM) (Example)

Classifying emails as spam or not by finding the best dividing line (hyperplane) that separates
the two classes with maximum margin.

✅ UNIT-IV:
K-Means Clustering (Example)

Customer segmentation: Grouping people into clusters based on their spending habits without
any labels.

Boosting (Example)

A model first predicts wrongly that a person won't default on a loan. The next model focuses
more on this error, improving the final combined output.

Bagging (Example)

Random Forest creates different decision trees using random subsets of training data and
averages the results to reduce overfitting.

Random Forest (Example)

Predicting diabetes risk using multiple decision trees built from different patient samples,
combining their predictions.

PCA (Example)

Reduce a 100-feature image dataset to 10 features while retaining the most important
information.
19

LDA (Example)

Used in face recognition: separates images of different people by finding the directions that best
distinguish classes.

ICA (Example)

Separating mixed audio signals into individual speakers in a recording (blind source separation).

SVD (Example)

Used in recommender systems like Netflix: reduces movie rating matrix to uncover patterns in
user preferences.

Evaluation Metrics (Examples)

●​ Accuracy: 90 out of 100 predictions were correct → 90%​

Confusion Matrix:​

yaml​
CopyEdit​
TP: 50, FP: 5
FN: 10, TN: 35

●​ Precision: 50 / (50 + 5) = 0.91


●​ Recall: 50 / (50 + 10) = 0.83
●​ F1-score: 2 * (0.91 * 0.83) / (0.91 + 0.83) ≈ 0.87
●​ ROC Curve (Example): A graph to visualize how well the model distinguishes between
classes.
●​ AUC: Closer to 1 is better (e.g., AUC = 0.98 is excellent).
●​ MAD (Example): Measures median of errors in house price prediction; robust to outliers.
●​ Distribution of Errors (Example): A histogram showing how far predictions are from
actual values – ideally symmetric and centered at zero.​

You might also like