Week 12 Intro to DS and ML
Week 12 Intro to DS and ML
12/3/2024 9
MAIN DATA SCIENCE STAGES
Problem definition
Extermination
Data Data Exploratory/ , predication
collection preprocessing visualization and
Evaluation.
Data
computer Output
Rules
Machine Learning
Data
computer Rules
Output
WHAT DOSE LEARNING MEAN?
oImagine teaching a child the difference between dogs and cats by
using flash- cards?
oAs the child practices, his performance improves.
oHuman cognition has built-in classification mechanisms.
oAfter the child is proficient with the flashcards, he’ll be able to classify
not only the images on the flashcards, but also any cat or dog image
oThis ability to generalize, to apply knowledge gained through
training to new unseen examples, is a key characteristic of both human
and machine learning
TYPES OF ML
ALGORITHMS
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised systems
4. Reinforcement learning
1. SUPERVISED LEARNING
Supervised Learning is where you have both the input variable x and the
output variable y and you use an algorithm to learn the mapping function from
the input to the output Y = f(X)
300
Price in
$1000's 200
100
0
0 500 1000 1500 2000 2500
House House
size in size
feet2
in feet2
REGRESSION: HOUSE PRICE PREDICTION
400
300
Price in
$1000's 200
100
0
0 500 1000 1500 2000 2500
House House
size in size
feet2
in feet2
let's say a friend wants to know what’s the price for their 750 square foot house. How can the
learning algorithm help you?
REGRESSION: HOUSE PRICE PREDICTION
400
300
Price in
$1000's 200
100
0
0 500 1000 1500 2000 2500
House House
size in size
feet2
in feet
Let's say a friend wants to know what’s the price for2 their 750 square foot house. How
can the learning algorithm help you?
REGRESSION: HOUSE PRICE PREDICTION
- Fitting a straight line isn't the only 400
learning algorithm you can use.
300
-There are others that could work better Price in
for this application. $1000's 200
tumor size 𝑥
(diameter in cm)
CLASSIFICATION: CANCER DETECTION
0
tumor size 𝑥
(diameter in cm)
benign
malignant
CLASSIFICATION: CANCER DETECTION
benign
Malignant type 1
0cm diameter(cm) 10cm
Malignant type 2
Age
Tumor size
CLASSIFICATION: CANCER DETECTION
Age
Tumor size
CLASSIFICATION: CANCER DETECTION
Decision Boundary
Age
Tumor size
SUPERVISED LEARNING (RECAP)
2. UNSUPERVISED LEARNING
Supervised learning learn from data Unsupervised learning f ind something
labeled with the “right answers” interesting in unlabeled data.
age age
Data only comes with inputs x, but not output labels y. Algorithm has to find structure in the data.
UNSUPERVISED LEARNING
(CLUSTERING: GOOGLE NEWS
UNSUPERVISED LEARNING
(CLUSTERING: DNA MICROARRAY)
Unsupervised learning algorithms can analyze genetic data to identify patterns and
relationships, leading to insights in personalized medicine and genetic research.
genes
(each row)
individuals
(each column)
3. SEMI-SUPERVISED LEARNING
➢Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning.
➢It is a method that uses a small amount of labeled data and a large amount of unlabeled data
to train a model.
➢First stage: train the model on the small labeled dataset to learn a function that can accurately
predict the output variable based on the input variables, similar to supervised learning.
➢Second stage: several purposes such as:
1. Self-training: the model trained on the labeled data is used to predict labels for the
unlabeled data.
2. Cluster: use the small labeled dataset to inform and guide the pattern discovery in the
unlabeled data.
3. SEMI-SUPERVISED LEARNING
Training set
Features (X) Label / Target (Y)
70%
Test Set
Features (X) Label / Target (Y)
30%
TERMINOLOGIES OF MACHINE LEARNING
Model: A model is a specific representation learned from data by applying some
machine learning algorithm.
Training: The idea is to give a set of inputs (training set) and its expected outputs
(labels), so after training, we will have a model that will then map new data to one of
the categories trained on.
Prediction: Once our model is ready, it can be fed a set of inputs (test set) to which it
will provide a predicted output (label).
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
1. Confusion Matrix
A table summarizing the performance of a classification model by
showing the actual vs predicted classifications. It includes:
True Positives (TP): Correctly predicted positive instances Actual value
1 0
True Negatives (TN): Correctly predicted negative instances
1 TP FP
Predict value
False Positives (FP): Incorrectly predicted positive instances 0 FN TN
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
2. Accuracy
The proportion of correctly classified instances over the total instances.
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = Actual value
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
True Positives (TP): Correctly predicted positive instances 1 0
True Negatives (TN): Correctly predicted negative instances 1 TP FP
Predict value
0 FN TN
False Positives (FP): Incorrectly predicted positive instances
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
3. Precision
The proportion of true positive predictions out of all the instances that
were predicted as positive.
𝑇𝑃 Actual value
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃+𝐹𝑃 1 0
True Positives (TP): Correctly predicted positive instances 1 TP FP
Predict value
True Negatives (TN): Correctly predicted negative instances 0 FN TN
False Positives (FP): Incorrectly predicted positive instances
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
3. Recall (Sensitivity(
The proportion of true positive predictions out of all the actual positive
instances.
𝑇𝑃 Actual value
Recall =
𝑇𝑃+𝐹𝑁 1 0
True Positives (TP): Correctly predicted positive instances 1 TP FP
Predict value
True Negatives (TN): Correctly predicted negative instances 0 FN TN
False Positives (FP): Incorrectly predicted positive instances
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
4. F1-Score
The harmonic mean of precision and recall. It balances the trade-off between
precision and recall.
Precision×𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − score = 2 ×
Precision+𝑅𝑒𝑐𝑎𝑙𝑙
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃+𝐹𝑃
𝑇𝑃
Recall =
𝑇𝑃+𝐹𝑁
Actual value
True Positives (TP): Correctly predicted positive instances
1 0
True Negatives (TN): Correctly predicted negative instances
1 TP FP
False Positives (FP): Incorrectly predicted positive instances Predict value
0 FN TN
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
Example: Email spam (Classification)
1. Data engineer
2. Data analyst
3. Data scientist
4. Machine learning scientist/
engineer
1. DATA ENGINEER
➢ Data engineers control the flow of data, build custom data pipelines and
storage systems.
➢They design infrastructure so that data is not only collected, but easy to
obtain and process.
➢Within the data science workflow, they focus on the first stage: data collection
and storage.
➢Data engineering tools: SQL, Java, Scala, or Python to process data, and
cloud computing to ingest and store large amounts of data.
2. DATA ANALYST
➢Data analysts describe the present via data.
➢They do this by exploring the data and creating visualizations and
dashboards.
➢To do these tasks, they often have to clean data first.
➢Within the workflow, they focus on the middle two stages: data preparation
and exploration and visualization.
➢Data analyst tools: SQL, spreadsheets, Business Intelligence (BI) tools such as
Tableau, Power BI, or Looker, to create dashboards and share their analyses.
3. DATA SCIENTIST
➢Data Scientists have a strong background in statistics, enabling them to find
new insights from data, rather than solely describing data.
➢They also use traditional machine learning for prediction and forecasting.
➢Within the workflow, they focus on the last three stages: data preparation
and exploration and visualization, and experimentation and prediction.
➢Data scientist tools: SQL, Python, and R.
4. MACHINE LEARNING SCIENTIST/ ENGINEER
➢Machine learning scientists are similar to data scientists but with a machine
learning specialization.
➢Focuses on developing, training, and deploying machine learning models in
production environments.
➢They go beyond traditional machine learning with deep learning.
➢Within the workflow, they do the last three stages with a strong focus on
prediction.
➢Machine learning tools: Python or R to create their predictive models