Unit 4_Question Bank and answers
Unit 4_Question Bank and answers
1. Regression - If the prediction value tends to be a continuous value then it falls under
Regression type problem in machine learning. Giving area name, size of land, etc. as features
and predicting expected cost of the land.
3. Clustering - Grouping a set of points to given number of clusters for given unlabeled dataset
(Unsupervised learning).
4. Ranking - Constructs a ranker from a set of labelled examples. This example set
consists of instance groups that can be scored with a given criteria. The ranking labels are { 0,
1, 2, 3, 4 } for each instance. The ranker is trained to rank new instance groups with
unknown scores for each instance.
Q2 Explain the steps involved in the development of the ML (Classification or Regression) model.
Following are the steps to be considered in development of classification model.
2. Collect Data
Gather data from sources like files, databases, APIs, or web scraping.
The quality and quantity of data are key to model success.
6. Choose a Model
Fit the model to the training data using .fit() in most libraries like Scikit-learn.
Use the test set to assess how well your model generalizes.
Regression Metrics:
o Mean Absolute Error (MAE)
o Mean Squared Error (MSE)
o R-squared (R²)
Classification Metrics:
o Accuracy
o Precision, Recall, F1 Score
o Confusion Matrix
o ROC-AUC
9. Tune Hyperparameters
Use Grid Search or Randomized Search with cross-validation to find the best parameters.
Improves performance and reduces overfitting.
1. Data Cleaning
o Handling Missing Values:
Remove rows/columns with missing values.
Impute with mean, median, mode, or use interpolation.
o Handling Outliers:
Detect using statistical methods (e.g., Z-score, IQR).
Remove or cap/floor extreme values.
2. Data Transformation
o Normalization (Min-Max Scaling): Scales data to a range (usually 0 to 1).
8. Data Discretization
o Converts continuous data into categorical bins (e.g., age into "young", "adult",
"senior").
Q4 Explain the difference between training data and Testing data in a Dataset? How it is useful
in a Machine Learning Model?
Training Data vs Testing Data
Feature Training Data Testing Data
Purpose To teach (train) the model To evaluate how well the model
to learn patterns. performs.
Usage Used to fit the model Used to test the model's predictions.
(model learns from this).
Seen by Yes, during training. No, kept hidden during training.
Model?
Size 70–80% of the dataset. 20–30% of the dataset.
(typically)
Role Helps in building the Checks the model’s generalization
model. ability.
Training Data
This is where the model learns relationships between inputs and outputs.
For example, in a regression task, the model uses training data to learn the best-fit line.
In classification, it learns to assign labels to features.
Testing Data
Acts like a final exam for the model.
It helps evaluate how well the model will perform on new, unseen data in the real world.
It helps detect issues like overfitting (model memorizes training data but fails on new data).
Example:
Imagine building a spam email classifier:
Training data: Emails labeled as spam or not-spam that the model uses to learn.
Testing data: New emails the model hasn't seen, used to test if it correctly classifies them
as spam or not.
Best Practice:
Q5 What is training data, labeled data and unlabeled data? What are key steps involved in
developing training data?
The model tries to group or understand structure without knowing the actual sentiment.
2. K-Fold Cross-Validation
Divide the dataset into K equal parts (folds).
Train the model on K-1 folds and test it on the remaining one.
Repeat this K times, each time changing the test fold.
Average the results for final evaluation.
📌 Commonly used K = 5 or 10.
Example (K = 5):
Run 1: [Train | Train | Train | Train | Test ]
Run 2: [Train | Train | Train | Test | Train]
5. Leave-P-Out Cross-Validation
Similar to LOOCV, but instead of 1, P data points are used for testing.
Repeat this for every possible combination.
📌 Very rarely used due to high computational cost.
📌 Use When: Working with time-dependent data (e.g., stock prices, weather, logs).
The classifier model can be designed/trained and performance can be evaluated based on
K-fold cross-validation mode, training mode and test mode.
The main idea behind K-Fold cross-validation is that each sample in our dataset has the
opportunity of being tested. It is a special case of cross-validation where we iterate over
a dataset set k times. In each round, we split the dataset into k parts. one part is used
for validation, and the remaining k-1 parts are merged into a training subset for model
evaluation
Computation time is reduced as we repeated the process only 10 times when the value of
k is 10. It has Reduced bias.
Every data points get to be tested exactly once and is used in training k-1 times.
The variance of the resulting estimate is reduced as k increases.
• Computation time is reduced as we repeated the process only 10 times when the value of k
is 10.
• Reduced bias
• Every data points get to be tested exactly once and is used in training k-1 times
• The variance of the resulting estimate is reduced as k increases
Q8 Explain use of Confusion matrix in Machine Learning Model with suitable example. Confusion
matrix
A Confusion matrix is an N x N matrix used for evaluating the performance of a
classification model, where N is the number of target classes. The matrix compares the
actual target values with those predicted by the machine learning model.
A confusion matrix is a table that describes the performance of a classification model by
comparing the actual values (true labels) with the predicted values.
It gives a detailed breakdown of how well the model is classifying each class.
Structure of a Confusion Matrix (Binary Classification)
Predicted Class
Actual Class
Predicted: Negative
Predicted: Positive (1)
(0)
Actual: Positive (1) ✅ True Positive (TP) ❌ False Negative (FN)
Actual: Negative
❌ False Positive (FP) ✅ True Negative (TN)
(0)
Diagonal values in Confusion Matrix are Truly Classified Values (Correctly Classified)
There are two possible predicted classes: “yes" and "no". If we were predicting the
presence of a disease, for example, "yes" would mean they have the disease, and "no" would
mean they don't have the disease.
The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.
True positives (TP): these are cases in which we predicted yes (they have the disease), and
they do have the disease.
True negatives (TN): we predicted no, and they don't have the disease.
False positives (FP): we predicted yes, but they don't actually
have the disease. (Also known as a "type I error.")
False negatives (FN): we predicted no, but they actually do have the disease. (Also known
as a "type II error.")
Predicted Class
Actual Class
Predicted: Negative
Predicted: Positive (1)
(0)
Actual: Positive (1) ✅ True Positive (TP) ❌ False Negative (FN)
Actual: Negative
❌ False Positive (FP) ✅ True Negative (TN)
(0)
Diagonal values in Confusion Matrix are Truly Classified Values (Correctly Classified)
There are two possible predicted classes: “yes" and "no". If we were predicting the
presence of a disease, for example, "yes" would mean they have the disease, and "no" would
mean they don't have the disease.
The classifier made a total of 165 predictions (e.g., 165 patients were being tested for
the presence of that disease).
Out of those 165 cases, the classifier predicted "yes" 110 times, and "no" 55 times.
In reality, 105 patients in the sample have the disease, and 60 patients do not.
True Positives (TP): these are cases in which we predicted yes (they have the disease), and
they do have the disease.
True Negatives (TN): we predicted no, and they don't have the disease.
False Negatives (FN): we predicted no, but they actually do have the disease. (Also known
as a "type II error.")
3. Precision: Precision explain, how many of the predicted positives are actually positive
4. True positive Rate (TP Rate) or Recall or Sensitivity: When it's actually yes, how often
does it predict yes?
5. False Positive Rate (FP Rate) : When it's actually no, how often does it predict yes?
FP Rate = FP/actual no
6. True Negative Rate (TN Rate): When it's actually no, how often does it predict no?
TN Rate= TN/actual no
7. False Negative Rate (FN Rate): When it's actually no, how often does it predict no?
8. F1 Score: Harmonic mean of precision and recall. Useful when you want a balance.
9. Cohen’s Kappa: It is a very useful metric when you want to measure how much agreement
exists between two raters or classifiers, beyond chance. Cohen’s Kappa (κ) measures the
agreement between two sets of categorical data (e.g., actual vs predicted labels), correcting
for the agreement that could happen by chance. It’s commonly used in:
Where:
Po = Observed agreement (how often the raters agree)
Pe = Expected agreement by chance
Interpretation Scale of Cohen’s Kappa Value:
≤0 None or Poor
10. ROC Curve — It is key too in evaluating the performance of classification models,
especially binary classifiers. ROC stands for Receiver Operating Characteristic curve.
It’s a graph that shows the performance of a classification model across different
threshold values. It plots the following:
1. Your model returns probabilities instead of just binary predictions (like 0 or 1).
2. You sweep through thresholds (e.g., 0.0 to 1.0).
3. For each threshold, calculate TPR and FPR.
4. Plot TPR vs. FPR.
1. A perfect model: ROC curve passes through the top-left corner (TPR = 1, FPR = 0).
2. A random model: ROC curve is a diagonal line from (0,0) to (1,1).
3. Better model: The closer the curve follows the top-left border, the better.
Q10 From the below confusion matrix, determine accuracy, recall, precision,F 1-score, True
Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), and False Negative
Rate (FNR). Interpret the Result:
Actual Values
1 0
Examples of Hyperparameters:
Algorithm Hyperparameter Example