0% found this document useful (0 votes)
5 views

AI Unit 1

This document provides an introduction to Machine Learning (ML), defining it as a branch of Artificial Intelligence that enables computers to learn from data. It outlines the differences between ML, AI, and Deep Learning, explains how ML works through data input, abstraction, and generalization, and discusses suitable problems for ML along with challenges faced in the field. Additionally, it covers various learning types, dataset structures, preprocessing techniques, and the importance of dataset division for effective model training.

Uploaded by

kuberkumarjha516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

AI Unit 1

This document provides an introduction to Machine Learning (ML), defining it as a branch of Artificial Intelligence that enables computers to learn from data. It outlines the differences between ML, AI, and Deep Learning, explains how ML works through data input, abstraction, and generalization, and discusses suitable problems for ML along with challenges faced in the field. Additionally, it covers various learning types, dataset structures, preprocessing techniques, and the importance of dataset division for effective model training.

Uploaded by

kuberkumarjha516
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Unit 1: Introduction to Machine Learning

What is Machine Learning?


Machine Learning (ML) is a branch of Artificial Intelligence (AI) that allows computers to learn from
data and improve their performance over time without being explicitly programmed.
• Key Idea: Instead of hardcoding rules, ML algorithms learn patterns and
relationships in data to make predictions or decisions.
• Tom Mitchell’s Definition:
"A computer learns from experience E for a task T and
performance measure P if its performance on T, as measured by P,
improves with E."
o Experience (E): Historical data or observations.
o Task (T): The activity the model needs to perform (e.g., classifying images).
o Performance Measure (P): How the success of the model is measured
(e.g., accuracy).
Arthur Samuel, an early American leader in the field of computer gaming and artificial
intelligence,
coined the term “Machine Learning” in 1959 while at IBM. He defined machine learning
as “the field of
study that gives computers the ability to learn without being explicitly programmed.”
However, there is
no universally accepted definition for machine learning. Different authors define the
term differently.
.
Unit 1
Differences: Machine Learning, Artificial Intelligence, and Deep Learning
Aspect Machine Learning Artificial Deep Learning
(ML) Intelligence (AI) (DL)
Definition Focuses on learning Encompasses all Subset of ML using
patterns from data for intelligent machine neural networks
predictions. behaviors. with many layers.
Core Analyze data to make Mimic human Process large,
Objective predictions/decisions. intelligence to solve complex data like
broader problems. images and speech.

Example Regression, clustering, Robotics, natural Convolutional


Techniques reinforcement learning. language Neural Networks
understanding. (CNNs), RNNs.

How Does Machine Learning Work?


ML can be seen as mimicking human learning processes, broken into three main steps:
1. Data Input:
o Historical data, like sales records or patient symptoms, is collected. o This
data acts as the machine's "experience" and is vital for training the model.
2. Abstraction (Training):
o The machine analyzes patterns and relationships in the data to create a
model. o For example, in image classification, it learns the distinguishing
features of objects like size, shape, and color.
3. Generalization:
o Once trained, the model applies its understanding to new, unseen data.

2 By kuber kr jha
Unit 1
o Example: A trained spam detection model filters new emails even

though it has never seen those specific messages before.

What Makes a Problem Suitable for Machine Learning?


Not all problems are ideal for ML. A good ML problem typically meets these criteria:
1. Clear Task (T): o Example: Identify spam emails.
2. Relevant Experience (E):
o Historical data must be available and relevant. For spam detection, this could
be a dataset of emails labeled as spam or not spam.
3. Measurable Performance (P):
o There must be a way to evaluate success, such as accuracy, precision, or

recall.
Example:
• For predicting house prices:
o Task: Predict house prices.
o Experience: Historical data on house sales, including features like size,
location, and price.
o Performance: How close predicted prices are to actual prices.

Challenges in Machine Learning


1. Ambiguity in Problem Definition:
o If the problem isn’t clearly defined, ML cannot provide meaningful results.
o Example: Predicting "success" without defining it (e.g., is it revenue, sales,
or customer satisfaction?).
2. Insufficient Data:
o ML models require large datasets to learn effectively. Small or incomplete
datasets can lead to poor performance. o Example: A medical diagnosis
model trained on only 50 patients may not generalize to other cases.
3. Overfitting and Underfitting:
3 By kuber kr jha
Unit 1
o Overfitting: The model memorizes the training data but performs poorly
on new data.
o Underfitting: The model fails to learn important patterns in the training
data.
4. Ethical and Privacy Concerns:
o Using personal data without consent can breach privacy laws. o Bias in
training data can lead to unfair predictions (e.g., biased hiring algorithms).

Learning Types in Machine Learning


Machine Learning methods are grouped based on how they learn from data. These types include
Supervised Learning, Unsupervised Learning, Reinforcement Learning, and two modern approaches:
Semi-supervised and Self-supervised Learning. Let’s break them down step by step.

1. Supervised Learning
What is it?
This type of learning happens when the algorithm is trained using data that has both inputs (features)
and outputs (labels). Think of it like a teacher showing you examples and then asking you to solve
similar problems.
How does it work?
• Step 1: Feed the algorithm labeled data.
• Step 2: The algorithm learns the relationship between input and output.
• Step 3: Use the trained model to make predictions on new data.
Examples:
• Email spam detection: The model learns from emails labeled as “spam” or “not
spam.”

4 By kuber kr jha
Unit 1
• Predicting house prices: The algorithm uses data like size, location, and age to
predict prices.
Popular Algorithms:
• Linear Regression for predicting numbers.
• Decision Trees for classification tasks like diagnosing diseases.
• Support Vector Machines (SVM) for separating data into categories.

2. Unsupervised Learning
What is it?
Here, there are no labels or predefined outcomes. The algorithm learns patterns or groupings directly
from the data. It’s like exploring a new city without a guide—you figure out neighborhoods and
landmarks on your own.
How does it work?
• Step 1: The data is given without any labels.
• Step 2: The algorithm identifies hidden structures, such as clusters or patterns.
Examples:
• Grouping customers with similar buying habits for targeted marketing.
• Reducing large datasets into smaller, meaningful dimensions (e.g., image
compression).
Popular Algorithms:
• k-Means Clustering: Groups similar data points into clusters.
5 By kuber kr jha
Unit 1
• DBSCAN: Finds clusters of data points based on density.

3. Reinforcement Learning (RL)


What is it?
Reinforcement learning is like learning through rewards and penalties. Imagine teaching a dog tricks:
when it performs correctly, you give it a treat. When it doesn’t, there’s no treat. Over time, it learns to
maximize treats.
How does it work?
• An Agent (learner) interacts with an Environment (world).
• Actions are taken, and feedback is given as rewards or penalties.
• The agent learns to take actions that maximize its rewards.
Examples:
• Self-driving cars: Learning how to navigate roads and avoid obstacles.
• Robotics: Teaching robots to pick up objects or walk.
Key Terms in RL:
• Agent: The decision-maker (e.g., the car). • Environment: The context (e.g., the
road).
• Reward: Feedback for good actions (e.g., avoiding an accident).

6 By kuber kr jha
Unit 1

7 By kuber kr jha
Unit 1

8 By kuber kr jha
9

Datasets and Preprocessing

1. Structure of Datasets
A dataset is like a table where each row and column has specific meanings:
1. Features (Attributes):
o Features are like the columns in a table, and each feature describes a specific
property of the data. o Example: In a dataset about students, features could be
Name, Age, Marks, and Grade.
o Features are also called variables or attributes.

2. Labels (Target Values):


o Labels are the answers or outcomes you want to predict (used in supervised
learning).
o Example: In a house price dataset, the label is the price of the house.
Not all datasets have labels; only supervised learning datasets include them.
10
o
3. Records (Rows):
o Each record is a single row in the dataset, representing one instance of data. o
Example: In a dataset of students, a record could be: Name: John, Age: 15,
Marks: 85, Grade: A.

2. Handling Missing Data


When data is incomplete, it can cause errors or reduce the accuracy of machine learning models. Here’s
how to handle it:
1. Imputation (Filling Missing Data): o Replace missing values with some
estimates:
▪ Mean/Median: Use the average or middle value for numerical data.
▪ Mode: Use the most common value for categorical data.
▪ Example: If students’ marks are missing, replace them with the average
marks of the class.
▪ Advanced Methods: Algorithms like k-Nearest Neighbors can guess
the missing values based on similar data points.
2. Deletion (Removing Data):
o If only a small number of rows or columns are missing data, you can delete
them. o Example: If 5 out of 100 rows have missing values, you might delete
those rows to avoid bias.
o Avoid deletion if too much data is missing, as it can lead to loss of important
information.
3. Noise Filtering (Fixing Errors): o Correct or remove incorrect data entries (like
extreme outliers). o Example: A student’s marks recorded as 999 are likely a
mistake and need correction.

3. Feature Scaling
Feature scaling ensures that all numerical values are treated equally by the algorithm, especially when
features have different ranges (e.g., Age vs. Salary).
1. Normalization:
o Converts all values to a range between 0 and 1.
𝑂𝑙𝑑 𝑉𝑎𝑙𝑢𝑒−𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒
Formula: 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒−𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒

o
oExample: If salaries range from $10,000 to $100,000, normalization scales
them to 0.1 to 1.
Suitable for algorithms like k-Nearest Neighbors or Neural Networks. 2.
Standardization: o 𝐶𝑒𝑛𝑡𝑒𝑟𝑠 𝑑𝑎𝑡𝑎 𝑎𝑟𝑜𝑢𝑛𝑑 0 𝑎𝑛𝑑 𝑠𝑐𝑎𝑙𝑒𝑠 𝑖𝑡 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑
𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛.

10 By kuber kr jha
11
o
𝑂𝑙𝑑 𝑉𝑎𝑙𝑢𝑒−𝑀𝑒𝑎𝑛
𝐹𝑜𝑟𝑚𝑢𝑙𝑎: 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Example: If test scores have a mean of 70 and standard deviation of 10, 𝑎 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 80
𝑏𝑒𝑐𝑜𝑚𝑒𝑠

= 1.0
o 𝑊𝑜𝑟𝑘𝑠 𝑤𝑒𝑙𝑙 𝑤𝑖𝑡ℎ 𝑚𝑜𝑑𝑒𝑙𝑠 𝑙𝑖𝑘𝑒 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑟 𝑆𝑉𝑀.

4. Encoding Techniques
Sometimes, data has text categories (like colors or cities) that need to be converted into numbers for the
model to understand.
1. Label Encoding: o Assigns a unique number to each category.
o Example:
▪ Colors: Red → 0, Green → 1, Blue → 2.
o Works well for ordered categories like grades (A > B > C).
2. One-Hot Encoding:
o Creates separate columns for each category and assigns binary values
(0 or 1). o
Example:
▪ Colors:
Red → [1, 0, 0], Green → [0, 1, 0], Blue → [0, 0, 1]. o Avoids
giving unnecessary importance to numbers, especially for nonordered
categories like colors.

5. Dataset Properties
1. Dimensionality:
o Refers to the number of features (columns) in a dataset. o Example: A student
dataset with Name, Age, Marks, and Grade has 4 dimensions.
o High Dimensionality Problems:
▪ Harder to process.
▪ Models may overfit because of irrelevant features.
o Solution: Use dimensionality reduction methods like Principal Component
Analysis (PCA).
2. Sparsity: o Occurs when most values in a dataset are zero. o Example: In a dataset
tracking items bought in a store, a row showing a customer who bought only 1 item
will have many zeros for other items.
Sparse datasets are common in text analysis (e.g., bag-of-words) and can lead
to inefficiencies.

11 By kuber kr jha
12
o
Solution: Use techniques like compressing the dataset or specialized
o
algorithms for sparse data.
Dataset Division

1. Splitting Strategies: Training, Validation, and Test Sets


Splitting datasets is a fundamental step in building machine learning models. Proper splitting ensures the
model learns effectively, avoids overfitting, and generalizes well to unseen data.
Training
Set •
Purpose:
o This
subset is used by the algorithm to learn patterns, relationships, and
features in the data. o The model uses this data to optimize its parameters (e.g.,
weights in a neural network).
• Size:
o Typically comprises 70-80% of the dataset. o Larger datasets might require a
smaller proportion (e.g., 60%) as the training set still provides sufficient data. •
Example:
o In a housing price prediction problem, the training set might include historical
data of house sizes, locations, and prices.

12 By kuber kr jha
13
o

13 By kuber kr jha
14

Validation Set
• Purpose:
o Helps tune the model by adjusting hyperparameters (e.g., learning rate,
number of layers in a neural network).
o Prevents the model from overfitting to the training set.
• Key Characteristics:
o It acts like a “practice test” for the model during training. o The model’s
performance on the validation set helps decide which model version to keep.
• Size:
o Typically 10-15% of the dataset.
o May not be required if cross-validation is used. • Example:
o If training a model to predict student grades, the validation set might be used
to test the model’s performance with different algorithms (e.g., Decision Trees
vs. Random Forest).

Test Set
• Purpose:
o Evaluates the final model’s performance on completely unseen data.
o It is the ultimate measure of how well the model generalizes to realworld
scenarios. • Key Characteristics:
o The test set should never be used during model training or hyperparameter
tuning. o It provides unbiased performance metrics (e.g., accuracy, precision,
recall).
• Size: o Generally 10-30% of the dataset, depending on its overall size. •
Example:
o In fraud detection, the test set might include transaction data the model has
not seen, ensuring it can detect fraud reliably.

14 By kuber kr jha
15

2. Cross-Validation Techniques
Cross-validation is used to assess how well a model generalizes to unseen data. It helps overcome issues
with limited data availability and ensures robust model evaluation.

K-Fold Cross-Validation
• How It Works:
o Divides the dataset into k equally-sized folds.
o Each fold takes a turn as the test set, while the remaining k−1 folds form the
training set.
15 By kuber kr jha
16
o The process repeats k times, and the results are averaged for final evaluation.
• Typical Values for k:
o k=10k = 10k=10 is a common choice for a good balance between
computational cost and accuracy.
o Higher k values provide a more thorough evaluation but are computationally
expensive. • Advantages:
o Reduces bias since every data point is used for both training and testing.
o Suitable for small datasets where separate train-test splits are not feasible.
• Example:
o If a dataset has 100 records and k=5k = 5k=5, each fold will have 20 records.
The model is trained on 80 records and tested on 20 in each iteration.

Stratified K-Fold Cross-Validation


• How It Works:
o Similar to k-fold, but ensures that the distribution of target labels is
consistent across all folds.
o Particularly useful for imbalanced datasets (e.g., 90% non-fraudulent
transactions, 10% fraudulent). • Advantages:
o Prevents the model from performing poorly on minority classes by
maintaining the class ratio.
• Example:
o In a medical dataset with 90 healthy and 10 sick patients, stratified kfold
ensures that each fold contains approximately 9 healthy and 1 sick patient.

16 By kuber kr jha
17

17 By kuber kr jha
18
Leave-One-Out Cross-Validation (LOOCV)
• How It Works:
o Each data point is treated as a test set once, while the rest of the data forms
the training set.
o Repeats nnn times (where nnn is the total number of records).
• Advantages:
o Utilizes the maximum amount of training data in each iteration.
o Suitable for very small datasets.
• Disadvantages:
o Computationally expensive, especially for large datasets. • Example:
o For a dataset of 20 records, LOOCV trains the model 20 times, each time
using 19 records for training and 1 for testing.

3. Practical Considerations for Real-World Datasets


Real-world datasets come with challenges that must be addressed to ensure meaningful results.

Data Quality
• Missing Values:
o Handle missing data using imputation or removal to avoid biases in model
performance.
• Outliers and Noise:
o Remove or correct extreme values that could distort model training. o
Example: Salary data with unrealistic entries like $1 or $1 million.

Reproducibility •
Random Seed:
o Always set a fixed seed when splitting data to ensure the splits remain
consistent across runs.
o Example: Use random_state=42 in Python’s scikit-learn.

Data Leakage
• What It Is:
o Data leakage occurs when information from the test set influences the training
process, leading to overly optimistic performance. • How to Avoid It:
o Ensure features derived from future data or test data are excluded from the
training phase.
• Example:

18 By kuber kr jha
19
o In fraud detection, using transaction outcomes (fraudulent or not) as features
in training data creates leakage.

Dataset Size
• Small Datasets:
o Use techniques like cross-validation or data augmentation to maximize
learning opportunities. • Large Datasets:
o With abundant data, splitting into training, validation, and test sets becomes
straightforward.

19 By kuber kr jha
20
Domain-Specific Splitting •
Time-Series Data:
o Use chronological splits to ensure the model is evaluated on future data it has
not seen during training.
• Spatial Data:
o Split by geographic regions to test how well the model generalizes across
different areas.
• Example:
o In predicting sales trends, training on data from January–June and testing on
July–December ensures chronological relevance.

Applications and Workflow of Machine Learning

1. Typical ML Pipeline: From Problem Definition to


Deployment
The ML pipeline is a structured sequence of steps that guides the development of a machine learning
model from conceptualization to deployment. Here is a detailed breakdown:

1.1. Problem Definition


• The first and most critical step is to clearly understand and define the problem that
needs solving.
• This involves identifying the goal (e.g., classification, prediction, or clustering) and
understanding the domain.
o Example: Predict whether a tumor is benign or malignant based on medical
test results.
• Consider the feasibility of the problem:
o Is there enough data available?
o How will the solution be used in practice?

1.2. Data Collection


• Gather all necessary data from multiple sources like databases, APIs, sensors, or
manual inputs.
• Ensure the data is representative of real-world scenarios to avoid biases.
o Example: In a fraud detection system, data should include both fraudulent
and non-fraudulent transactions.
• The more diverse and comprehensive the dataset, the better the model generalizes.
1.3. Data Preprocessing
• Raw data is often messy and requires cleaning and transformation.
20 By kuber kr jha
21
• Steps in Preprocessing:
1. Cleaning: Remove duplicates, correct inconsistencies, and handle missing
values.
2. Normalization/Standardization: Scale numerical data to bring all
features to a uniform range.
3. Encoding: Convert categorical data (like "male/female") into numerical
values (e.g., one-hot encoding).
4. Outlier Handling: Detect and address extreme values that may skew
results.
o Example: Normalize age and income when predicting loan approvals since
these variables have vastly different scales.

1.4. Feature Engineering


• Features are the input variables the model uses to make predictions.
• Techniques:
o Feature Selection: Identify the most relevant features (e.g., eliminate
irrelevant ones like customer ID).
o Feature Extraction: Derive new features from existing ones (e.g., calculate
BMI from weight and height).
• Example: In predicting heart disease, relevant features might include blood
pressure, cholesterol levels, and age.

1.5. Model Selection


• Choose an appropriate algorithm based on the problem type:
o Classification: Algorithms like Decision Trees or Support Vector Machines
(SVM) for predicting categories. o Regression: Linear Regression for
predicting continuous values like house prices.
o Clustering: K-Means for grouping similar data points.
o Deep Learning: Neural Networks for image or speech recognition.
• Use the dataset’s size and complexity to guide your choice.

1.6. Model Training


• During training, the model learns patterns from the data by optimizing its
parameters.
• Steps:
1. Divide the dataset into training, validation, and test sets.
2. Feed the training data into the model and adjust parameters to minimize
errors using optimization techniques like gradient descent.

21 By kuber kr jha
22
3. Use validation data to fine-tune hyperparameters (e.g., learning rate or tree
depth).
• Example: A spam filter learns to classify emails as spam or not by analyzing
labeled data.

1.7. Model Evaluation


• After training, evaluate the model’s performance on unseen test data.
• Common evaluation metrics:
o Accuracy: Percentage of correct predictions.
o Precision & Recall: Measures for imbalanced datasets (e.g., detecting rare
diseases).
o F1 Score: Harmonic mean of precision and recall for balanced performance.
• Use cross-validation to ensure the model performs well on various subsets of the
data.

1.8. Deployment
• The final step is integrating the model into a real-world application or production
environment.
• Deployment Methods:
o APIs: Embed the model in an API that other applications can use.
o Real-Time Systems: Use the model for real-time predictions (e.g., fraud
detection systems).
• Monitoring: Continuously track model performance, as data patterns may change
over time. This process is called model retraining.

2. Applications of Machine Learning in Key Sectors


ML is transforming various industries by automating tasks, improving accuracy, and making data-driven
decisions. Let’s explore its impact in healthcare, finance, and transportation.

2.1. Healthcare
1. Disease Prediction and Diagnosis:
• ML models analyze patient data (e.g., symptoms, test results, genetic information)
to predict diseases.
• Example: Algorithms like Logistic Regression predict whether a tumor is
malignant or benign.
2. Medical Imaging:
• ML and Computer Vision detect anomalies in medical images like X-rays, MRIs,
and CT scans.

22 By kuber kr jha
23
• Example: Convolutional Neural Networks (CNNs) identify fractures, tumors, or
other conditions.
3. Wearable Devices:
• Smartwatches and fitness trackers monitor heart rate, oxygen levels, and activity
patterns in real time.
• ML models analyze this data to alert users or medical professionals about potential
health issues.
• Example: Detecting atrial fibrillation from irregular heart rhythms.
4. Personalized Treatment:
• Predict the best treatment plan based on the patient’s medical history and genetic
profile.
• Example: Recommending cancer therapies using precision medicine.

2.2. Finance
1. Fraud Detection:
• ML models monitor transactions for unusual patterns or activities to flag potential
fraud.
• Example: Identifying a sudden large transaction in a low-spending account.
2. Risk Assessment:
• Predict a customer’s likelihood of defaulting on loans using credit history and
income data.
• Example: Logistic Regression estimates default risk for mortgage loans.
3. Algorithmic Trading:
• Analyze stock market data to make trading decisions at high speeds.
• Example: Predicting stock price trends using historical data and ML models like
Recurrent Neural Networks (RNNs).
4. Customer Retention:
• Identify customers likely to leave and offer incentives to retain them.
• Example: Predicting churn for a bank's customers and offering personalized loan
rates.

2.3. Transportation
1. Autonomous Vehicles:
• Self-driving cars use ML to detect objects, make decisions, and navigate routes.
• Example: Tesla’s Autopilot uses neural networks to identify lanes, obstacles, and
traffic signs.
2. Route Optimization:
• ML suggests the most efficient delivery routes, saving time and fuel.
• Example: Apps like Google Maps predict traffic congestion and recommend
alternative routes.
23 By kuber kr jha
24
3. Predictive Maintenance:
• Monitor vehicle performance to predict and prevent failures before they happen.
• Example: Aircraft systems use ML to predict engine failures based on sensor data.
4. Smart Traffic Systems:
• Optimize traffic light timings and manage congestion using real-time traffic data.
• Example: Smart cities use ML to reduce traffic delays and improve commuter
flow.

24 By kuber kr jha
25

25 By kuber kr jha
26

26 By kuber kr jha
27

27 By kuber kr jha
28

28 By kuber kr jha
29

29 By kuber kr jha
30

30 By kuber kr jha

You might also like