AI Unit 1
AI Unit 1
2 By kuber kr jha
Unit 1
o Example: A trained spam detection model filters new emails even
recall.
Example:
• For predicting house prices:
o Task: Predict house prices.
o Experience: Historical data on house sales, including features like size,
location, and price.
o Performance: How close predicted prices are to actual prices.
1. Supervised Learning
What is it?
This type of learning happens when the algorithm is trained using data that has both inputs (features)
and outputs (labels). Think of it like a teacher showing you examples and then asking you to solve
similar problems.
How does it work?
• Step 1: Feed the algorithm labeled data.
• Step 2: The algorithm learns the relationship between input and output.
• Step 3: Use the trained model to make predictions on new data.
Examples:
• Email spam detection: The model learns from emails labeled as “spam” or “not
spam.”
4 By kuber kr jha
Unit 1
• Predicting house prices: The algorithm uses data like size, location, and age to
predict prices.
Popular Algorithms:
• Linear Regression for predicting numbers.
• Decision Trees for classification tasks like diagnosing diseases.
• Support Vector Machines (SVM) for separating data into categories.
2. Unsupervised Learning
What is it?
Here, there are no labels or predefined outcomes. The algorithm learns patterns or groupings directly
from the data. It’s like exploring a new city without a guide—you figure out neighborhoods and
landmarks on your own.
How does it work?
• Step 1: The data is given without any labels.
• Step 2: The algorithm identifies hidden structures, such as clusters or patterns.
Examples:
• Grouping customers with similar buying habits for targeted marketing.
• Reducing large datasets into smaller, meaningful dimensions (e.g., image
compression).
Popular Algorithms:
• k-Means Clustering: Groups similar data points into clusters.
5 By kuber kr jha
Unit 1
• DBSCAN: Finds clusters of data points based on density.
6 By kuber kr jha
Unit 1
7 By kuber kr jha
Unit 1
8 By kuber kr jha
9
1. Structure of Datasets
A dataset is like a table where each row and column has specific meanings:
1. Features (Attributes):
o Features are like the columns in a table, and each feature describes a specific
property of the data. o Example: In a dataset about students, features could be
Name, Age, Marks, and Grade.
o Features are also called variables or attributes.
3. Feature Scaling
Feature scaling ensures that all numerical values are treated equally by the algorithm, especially when
features have different ranges (e.g., Age vs. Salary).
1. Normalization:
o Converts all values to a range between 0 and 1.
𝑂𝑙𝑑 𝑉𝑎𝑙𝑢𝑒−𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒
Formula: 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒 = 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒−𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑉𝑎𝑙𝑢𝑒
o
oExample: If salaries range from $10,000 to $100,000, normalization scales
them to 0.1 to 1.
Suitable for algorithms like k-Nearest Neighbors or Neural Networks. 2.
Standardization: o 𝐶𝑒𝑛𝑡𝑒𝑟𝑠 𝑑𝑎𝑡𝑎 𝑎𝑟𝑜𝑢𝑛𝑑 0 𝑎𝑛𝑑 𝑠𝑐𝑎𝑙𝑒𝑠 𝑖𝑡 𝑏𝑎𝑠𝑒𝑑 𝑜𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑
𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛.
10 By kuber kr jha
11
o
𝑂𝑙𝑑 𝑉𝑎𝑙𝑢𝑒−𝑀𝑒𝑎𝑛
𝐹𝑜𝑟𝑚𝑢𝑙𝑎: 𝑁𝑒𝑤 𝑉𝑎𝑙𝑢𝑒 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
Example: If test scores have a mean of 70 and standard deviation of 10, 𝑎 𝑠𝑐𝑜𝑟𝑒 𝑜𝑓 80
𝑏𝑒𝑐𝑜𝑚𝑒𝑠
= 1.0
o 𝑊𝑜𝑟𝑘𝑠 𝑤𝑒𝑙𝑙 𝑤𝑖𝑡ℎ 𝑚𝑜𝑑𝑒𝑙𝑠 𝑙𝑖𝑘𝑒 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 𝑜𝑟 𝑆𝑉𝑀.
4. Encoding Techniques
Sometimes, data has text categories (like colors or cities) that need to be converted into numbers for the
model to understand.
1. Label Encoding: o Assigns a unique number to each category.
o Example:
▪ Colors: Red → 0, Green → 1, Blue → 2.
o Works well for ordered categories like grades (A > B > C).
2. One-Hot Encoding:
o Creates separate columns for each category and assigns binary values
(0 or 1). o
Example:
▪ Colors:
Red → [1, 0, 0], Green → [0, 1, 0], Blue → [0, 0, 1]. o Avoids
giving unnecessary importance to numbers, especially for nonordered
categories like colors.
5. Dataset Properties
1. Dimensionality:
o Refers to the number of features (columns) in a dataset. o Example: A student
dataset with Name, Age, Marks, and Grade has 4 dimensions.
o High Dimensionality Problems:
▪ Harder to process.
▪ Models may overfit because of irrelevant features.
o Solution: Use dimensionality reduction methods like Principal Component
Analysis (PCA).
2. Sparsity: o Occurs when most values in a dataset are zero. o Example: In a dataset
tracking items bought in a store, a row showing a customer who bought only 1 item
will have many zeros for other items.
Sparse datasets are common in text analysis (e.g., bag-of-words) and can lead
to inefficiencies.
11 By kuber kr jha
12
o
Solution: Use techniques like compressing the dataset or specialized
o
algorithms for sparse data.
Dataset Division
12 By kuber kr jha
13
o
13 By kuber kr jha
14
Validation Set
• Purpose:
o Helps tune the model by adjusting hyperparameters (e.g., learning rate,
number of layers in a neural network).
o Prevents the model from overfitting to the training set.
• Key Characteristics:
o It acts like a “practice test” for the model during training. o The model’s
performance on the validation set helps decide which model version to keep.
• Size:
o Typically 10-15% of the dataset.
o May not be required if cross-validation is used. • Example:
o If training a model to predict student grades, the validation set might be used
to test the model’s performance with different algorithms (e.g., Decision Trees
vs. Random Forest).
Test Set
• Purpose:
o Evaluates the final model’s performance on completely unseen data.
o It is the ultimate measure of how well the model generalizes to realworld
scenarios. • Key Characteristics:
o The test set should never be used during model training or hyperparameter
tuning. o It provides unbiased performance metrics (e.g., accuracy, precision,
recall).
• Size: o Generally 10-30% of the dataset, depending on its overall size. •
Example:
o In fraud detection, the test set might include transaction data the model has
not seen, ensuring it can detect fraud reliably.
14 By kuber kr jha
15
2. Cross-Validation Techniques
Cross-validation is used to assess how well a model generalizes to unseen data. It helps overcome issues
with limited data availability and ensures robust model evaluation.
K-Fold Cross-Validation
• How It Works:
o Divides the dataset into k equally-sized folds.
o Each fold takes a turn as the test set, while the remaining k−1 folds form the
training set.
15 By kuber kr jha
16
o The process repeats k times, and the results are averaged for final evaluation.
• Typical Values for k:
o k=10k = 10k=10 is a common choice for a good balance between
computational cost and accuracy.
o Higher k values provide a more thorough evaluation but are computationally
expensive. • Advantages:
o Reduces bias since every data point is used for both training and testing.
o Suitable for small datasets where separate train-test splits are not feasible.
• Example:
o If a dataset has 100 records and k=5k = 5k=5, each fold will have 20 records.
The model is trained on 80 records and tested on 20 in each iteration.
16 By kuber kr jha
17
17 By kuber kr jha
18
Leave-One-Out Cross-Validation (LOOCV)
• How It Works:
o Each data point is treated as a test set once, while the rest of the data forms
the training set.
o Repeats nnn times (where nnn is the total number of records).
• Advantages:
o Utilizes the maximum amount of training data in each iteration.
o Suitable for very small datasets.
• Disadvantages:
o Computationally expensive, especially for large datasets. • Example:
o For a dataset of 20 records, LOOCV trains the model 20 times, each time
using 19 records for training and 1 for testing.
Data Quality
• Missing Values:
o Handle missing data using imputation or removal to avoid biases in model
performance.
• Outliers and Noise:
o Remove or correct extreme values that could distort model training. o
Example: Salary data with unrealistic entries like $1 or $1 million.
Reproducibility •
Random Seed:
o Always set a fixed seed when splitting data to ensure the splits remain
consistent across runs.
o Example: Use random_state=42 in Python’s scikit-learn.
Data Leakage
• What It Is:
o Data leakage occurs when information from the test set influences the training
process, leading to overly optimistic performance. • How to Avoid It:
o Ensure features derived from future data or test data are excluded from the
training phase.
• Example:
18 By kuber kr jha
19
o In fraud detection, using transaction outcomes (fraudulent or not) as features
in training data creates leakage.
Dataset Size
• Small Datasets:
o Use techniques like cross-validation or data augmentation to maximize
learning opportunities. • Large Datasets:
o With abundant data, splitting into training, validation, and test sets becomes
straightforward.
19 By kuber kr jha
20
Domain-Specific Splitting •
Time-Series Data:
o Use chronological splits to ensure the model is evaluated on future data it has
not seen during training.
• Spatial Data:
o Split by geographic regions to test how well the model generalizes across
different areas.
• Example:
o In predicting sales trends, training on data from January–June and testing on
July–December ensures chronological relevance.
21 By kuber kr jha
22
3. Use validation data to fine-tune hyperparameters (e.g., learning rate or tree
depth).
• Example: A spam filter learns to classify emails as spam or not by analyzing
labeled data.
1.8. Deployment
• The final step is integrating the model into a real-world application or production
environment.
• Deployment Methods:
o APIs: Embed the model in an API that other applications can use.
o Real-Time Systems: Use the model for real-time predictions (e.g., fraud
detection systems).
• Monitoring: Continuously track model performance, as data patterns may change
over time. This process is called model retraining.
2.1. Healthcare
1. Disease Prediction and Diagnosis:
• ML models analyze patient data (e.g., symptoms, test results, genetic information)
to predict diseases.
• Example: Algorithms like Logistic Regression predict whether a tumor is
malignant or benign.
2. Medical Imaging:
• ML and Computer Vision detect anomalies in medical images like X-rays, MRIs,
and CT scans.
22 By kuber kr jha
23
• Example: Convolutional Neural Networks (CNNs) identify fractures, tumors, or
other conditions.
3. Wearable Devices:
• Smartwatches and fitness trackers monitor heart rate, oxygen levels, and activity
patterns in real time.
• ML models analyze this data to alert users or medical professionals about potential
health issues.
• Example: Detecting atrial fibrillation from irregular heart rhythms.
4. Personalized Treatment:
• Predict the best treatment plan based on the patient’s medical history and genetic
profile.
• Example: Recommending cancer therapies using precision medicine.
2.2. Finance
1. Fraud Detection:
• ML models monitor transactions for unusual patterns or activities to flag potential
fraud.
• Example: Identifying a sudden large transaction in a low-spending account.
2. Risk Assessment:
• Predict a customer’s likelihood of defaulting on loans using credit history and
income data.
• Example: Logistic Regression estimates default risk for mortgage loans.
3. Algorithmic Trading:
• Analyze stock market data to make trading decisions at high speeds.
• Example: Predicting stock price trends using historical data and ML models like
Recurrent Neural Networks (RNNs).
4. Customer Retention:
• Identify customers likely to leave and offer incentives to retain them.
• Example: Predicting churn for a bank's customers and offering personalized loan
rates.
2.3. Transportation
1. Autonomous Vehicles:
• Self-driving cars use ML to detect objects, make decisions, and navigate routes.
• Example: Tesla’s Autopilot uses neural networks to identify lanes, obstacles, and
traffic signs.
2. Route Optimization:
• ML suggests the most efficient delivery routes, saving time and fuel.
• Example: Apps like Google Maps predict traffic congestion and recommend
alternative routes.
23 By kuber kr jha
24
3. Predictive Maintenance:
• Monitor vehicle performance to predict and prevent failures before they happen.
• Example: Aircraft systems use ML to predict engine failures based on sensor data.
4. Smart Traffic Systems:
• Optimize traffic light timings and manage congestion using real-time traffic data.
• Example: Smart cities use ML to reduce traffic delays and improve commuter
flow.
24 By kuber kr jha
25
25 By kuber kr jha
26
26 By kuber kr jha
27
27 By kuber kr jha
28
28 By kuber kr jha
29
29 By kuber kr jha
30
30 By kuber kr jha