0% found this document useful (0 votes)
178 views

Week 12 Intro to DS and ML

Uploaded by

laplluve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
178 views

Week 12 Intro to DS and ML

Uploaded by

laplluve
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

INTRODUCTION TO DATA SCIENCE

AND MACHINE LEARNING


OUTLINES
➢Introduction
➢What Data can do for us?
➢What is the Data Science?
➢Main Data Science stages
➢What is Machine Learning (ML)?
➢Types of ML Algorithms
➢Measuring the performance of ML
➢Data science is not machine learning!
➢Job Description related to DS and ML
INTRODUCTION
Data is being collected all around us. Every like, click, email,
credit card swipe, or tweet is a new piece of data that can be
used to better describe the present or predict the future.
Types of Data:
❖Structured: Organized into rows and columns (e.g., databases,
spreadsheets).
❖Unstructured: Unorganized, often textual or multimedia (e.g.,
emails, videos).
❖Semi-structured: Falls between structured and unstructured
(e.g., JSON, XML).
INTRODUCTION
Big Data refers to extremely large datasets that are too complex or
voluminous for traditional data processing tools to handle efficiently. These
datasets can come from a variety of sources such as social media, sensors,
transactions, and more. Big Data typically involves three main characteristics,
often referred to as the "Three Vs":
1. Volume: The sheer amount of data being generated and stored.
2. Velocity: The speed at which data is generated, processed, and analyzed.
3. Variety: The different types and formats of data, including structured
(databases), semi-structured (XML, JSON), and unstructured (text, images,
videos).
Data
Explosion

The world generates 2.5 quintillion bytes of


data daily – enough to fill millions of DVDs!
WHAT DATA CAN DO
FOR US?
1. Describe our current state, like
our energy consumption.
2. Diagnose the causes of
observed events and behaviors.
3. Detect anomalous events, such
as fraudulent purchases.
4. Predict future events
WHAT IS THE DATA SCIENCE?
“At a high level, data science is a set of fundamental principles that guide the
extraction of knowledge from data.”
•Principles can be statistical, computational, algorithmic, visual, etc.
•DS a set of methodologies for taking in thousands of forms of data then using them to
draw meaningful conclusions.
•Data science is interdisciplinary, due to its goal to aid discoveries and decision making,
such as: Statistics and Mathematics, Computer Science, Domain Expertise, etc.
•Applicable to many domains (e.g., sciences, finance, healthcare, etc.)
WHY IS DATA SCIENCE IMPORTANT?
• Data Science unlocks potential of data in solving societal
challenges and large-scale complex problems across
domains, from business, technology, science, engineering,
healthcare, to government, and many more.
• As data continues to grow in volume, velocity and
complexity, there is a strong demand for data science
talents to help design the best solutions.
MODELLING PROCESS IN DATA SCIENCE

12/3/2024 9
MAIN DATA SCIENCE STAGES
Problem definition

Extermination
Data Data Exploratory/ , predication
collection preprocessing visualization and
Evaluation.

Insight / Policy Decisions


1. DATA COLLECTION
➢Gathering data from various sources, such as sensors, databases, APIs, or surveys. This is the first
step to begin the data science process.
➢There are two general types of data:
1. Quantitative data can be expressed in numbers. For example, the fridge is 60 inches tall, has two
apples in the basket, and costs 1000 dollars.
2. Qualitative data are things that can be observed but not measured. For example, the fridge is
red, was built in Italy, and might need to be cleaned out because it smells like fish.
➢Other than the traditional quantitative and qualitative data such as image data, text data,
geospatial data, network data, and many more.
➢To select the storage type, we need to determine: where we want to store the data, what kind of
data we are storing, and how we can retrieve our data from storage.
➢The storage could be a single computer, parallel storage, cloud-based, etc.
2. DATA PREPROCESSING
A. Data Cleaning C. Data Transformation
oHandle missing values (e.g., imputation, deletion). oNormalize or scale features to bring them into a
oCorrect errors (e.g., typos, outliers). uniform range (e.g., Min-Max Scaling, Z-score).
oStandardize formats (e.g., dates, currencies). oEncode categorical data (e.g., one-hot encoding,
oHandle outliers: Remove or adjust anomalous data label encoding).
points oCreate new features (feature engineering) to
oExample: Filling missing "age" with the median or enhance predictive power.
mean. oExample: Converting "Date of Birth" to "Age."

B. Data Integration D. Data Reduction


oCombine data from multiple sources into a single oRemove irrelevant or redundant features.
dataset. oReduce dimensionality using techniques like PCA.
oResolve inconsistencies between datasets. oExample: Dropping columns with low variance
oExample: Joining tables using a common key.
3. EXPLORATION / VISUALIZATION
Analyzing the data through visualizations, like graphs
and charts, to understand patterns and relationships
within the data. This step helps to reveal trends or
anomalies.
Exploratory Data Analysis (EDA) consists of exploring
the data and then formulating hypotheses about it, and
assessing its main characteristics, with a strong emphasis
on visualization
EDA happens after data preparation, but EDA can
reveal new things that need cleaning.
Histograms
Histograms: Show the distribution of a single variable,
useful for understanding the frequency of values in
numerical data.
3. EXPLORATION / VISUALIZATION
Bar Charts: Useful for comparing
Box Plots: Help identify the spread quantities across different categories
and outliers in a dataset. in categorical data.
3. EXPLORATION / VISUALIZATION
Heatmaps: represents data values using
colors on a map. Could be used to show the
Scatter Plots: Show relationships between correlation matrix between different
two continuous variables, helping to numerical variables to identify which
identify trends, correlations, or clusters. variables are correlated.
4. EXTERMINATION, PREDICATION AND EVALUATION

What is Machine Learning


(ML)?

“Field of study that gives


computers the ability to learn
without being explicitly
programmed.”

Arthur Samuel (1959)


A checker game between a human player and an electronic player
WHAT IS MACHINE LEARNING (ML)?
Definition by Tom Mitchell (1998)
Machine Learning is the study of algorithms that:
• improve their performance P
• at some task T
• with experience E
A well-defined learning task is given by <P, T, E>.
WHAT IS MACHINE LEARNING (ML)?

❑ Task T: Playing checkers

❑ Performance P: Percentage of games


won against an arbitrary opponent

❑ Training Experience E: Playing


practice games against itself
WHAT IS MACHINE LEARNING (ML)?
Handwriting recognition learning problem

❑ Task T: Recognizing and classifying handwritten


words within images

❑ Performance P: Percent of words correctly classified

❑ Training experience E: A dataset of handwritten


words with given classifications
WHAT IS MACHINE LEARNING (ML)?
A robot driving learning problem

❑ Task T : Driving on highways using vision sensors

❑ Performance P : Average distance travelled before an


error

❑ Training experience E : A sequence of images and


steering commands recorded while observing a human
driver
TRADITIONAL PROGRAMMING VS ML
Traditional Programming

Data
computer Output
Rules

Machine Learning
Data
computer Rules
Output
WHAT DOSE LEARNING MEAN?
oImagine teaching a child the difference between dogs and cats by
using flash- cards?
oAs the child practices, his performance improves.
oHuman cognition has built-in classification mechanisms.
oAfter the child is proficient with the flashcards, he’ll be able to classify
not only the images on the flashcards, but also any cat or dog image
oThis ability to generalize, to apply knowledge gained through
training to new unseen examples, is a key characteristic of both human
and machine learning
TYPES OF ML
ALGORITHMS
1. Supervised learning
2. Unsupervised learning
3. Semi-supervised systems
4. Reinforcement learning
1. SUPERVISED LEARNING

input output label

Supervised Learning is where you have both the input variable x and the
output variable y and you use an algorithm to learn the mapping function from
the input to the output Y = f(X)

Learns from being given “right answers”


TYPES OF SUPERVISED LEARNING
1. Regression: 2. Classification:
Used to predict a continuous Used to predict a categorical output.
numerical output.
For example, a classification
For example, a regression algorithm algorithm could be used to predict
could be used to predict the price of whether an email is spam or not.
a house based on its size, location,
and other features.
REGRESSION: HOUSE PRICE PREDICTION
400

300
Price in
$1000's 200

100

0
0 500 1000 1500 2000 2500

House House
size in size
feet2
in feet2
REGRESSION: HOUSE PRICE PREDICTION
400

300
Price in
$1000's 200

100

0
0 500 1000 1500 2000 2500

House House
size in size
feet2
in feet2
let's say a friend wants to know what’s the price for their 750 square foot house. How can the
learning algorithm help you?
REGRESSION: HOUSE PRICE PREDICTION
400

300
Price in
$1000's 200

100

0
0 500 1000 1500 2000 2500

House House
size in size
feet2
in feet
Let's say a friend wants to know what’s the price for2 their 750 square foot house. How
can the learning algorithm help you?
REGRESSION: HOUSE PRICE PREDICTION
- Fitting a straight line isn't the only 400
learning algorithm you can use.
300
-There are others that could work better Price in
for this application. $1000's 200

- For example, you might decide that it's 100


better to fit a curve, a function that's 0
slightly more complicated than a straight 0 500 1000 2000 2500
line. 1500
House
House size in feet2
size in
feet2
REGRESSION: HOUSE PRICE PREDICTION
•This was an example of supervised learning. Because we gave the algorithm a
dataset in which the label (the correct price y is given for every house on the plot).
•The task of the learning algorithm is to predict what is the likely price for other
houses like your friend's house.
•This housing price prediction is the particular type of supervised learning called
Regression.
•In regression, we are trying to predict a number from infinitely many possible
numbers such as the house prices in our example, which could be 150,000 or 70,000
or 183,000 or any other number in between.
CLASSIFICATION: CANCER DETECTION
ML system to diagnose a Lump (Malignant – Benign)
The dataset has tumors of different sizes and labels either (0 for benign) or (1
for malignant)
CLASSIFICATION: CANCER DETECTION
Now we plot the data
The tumor size on x-axis
y-axis for the type of the tumor (0 for benign) or (1 for malignant).

tumor size 𝑥
(diameter in cm)
CLASSIFICATION: CANCER DETECTION

0
tumor size 𝑥
(diameter in cm)

benign
malignant
CLASSIFICATION: CANCER DETECTION

benign
Malignant type 1
0cm diameter(cm) 10cm
Malignant type 2

Remember: Classification predict categories


In classification we may have more than 2 classes.
CLASSIFICATION: CANCER DETECTION
Two or more inputs (not just the size also you have the age of the patient).

Age

Tumor size
CLASSIFICATION: CANCER DETECTION

Age

Tumor size
CLASSIFICATION: CANCER DETECTION

Decision Boundary
Age

Tumor size
SUPERVISED LEARNING (RECAP)
2. UNSUPERVISED LEARNING
Supervised learning learn from data Unsupervised learning f ind something
labeled with the “right answers” interesting in unlabeled data.

age age

tumor size tumor size


UNSUPERVISED LEARNING
➢You're given data on patients and their tumor size and the
patient's age.
➢But not whether the tumor was benign or malignant.
➢We're not asked to diagnose whether the tumor is benign
or malignant, because we're not given any labels.
➢Our job is to find some structure or some pattern or just
find something interesting in the data.
UNSUPERVISED LEARNING

Data only comes with inputs x, but not output labels y. Algorithm has to find structure in the data.
UNSUPERVISED LEARNING
(CLUSTERING: GOOGLE NEWS
UNSUPERVISED LEARNING
(CLUSTERING: DNA MICROARRAY)
Unsupervised learning algorithms can analyze genetic data to identify patterns and
relationships, leading to insights in personalized medicine and genetic research.

genes
(each row)

individuals
(each column)
3. SEMI-SUPERVISED LEARNING
➢Semi-supervised learning is a type of machine learning that falls in between supervised and
unsupervised learning.
➢It is a method that uses a small amount of labeled data and a large amount of unlabeled data
to train a model.
➢First stage: train the model on the small labeled dataset to learn a function that can accurately
predict the output variable based on the input variables, similar to supervised learning.
➢Second stage: several purposes such as:
1. Self-training: the model trained on the labeled data is used to predict labels for the
unlabeled data.
2. Cluster: use the small labeled dataset to inform and guide the pattern discovery in the
unlabeled data.
3. SEMI-SUPERVISED LEARNING

➢Why use semi-supervised Learning?


1. Cost of Labeling Data
2. Better Performance with Less Labeling
➢It uses with Text Classification, Speech Recognition, Image
Classification, etc.
4. REINFORCEMENT LEARNING
➢Reinforcement learning is the problem of getting
an agent to take actions that maximize reward in
a particular situation
➢A learner is not told what actions to take as in
most forms of machine learning but instead must
discover which actions yield the most reward by
trying them.
➢For example — Consider teaching a dog a new
trick: we cannot tell him what to do, what not to do,
but we can reward/punish it if it does the
right/wrong thing.
4. REINFORCEMENT LEARNING
The problem is as follows: We have an agent and a
reward, with many hurdles in between. The agent is
supposed to find the best possible path to reach the
reward.
The robot learns by trying all the possible paths and
then choosing the path which gives him the reward with
the least hurdles.
Each right step will give the robot a reward and each
wrong step will subtract the reward of the robot.
The total reward will be calculated when it reaches the
final reward that is the diamond.
TYPES OF ML ALGORITHMS
1. Supervised learning: Models learn from labeled data (input-output pairs)
Given: training data + desired outputs (labels)
Example: Logistic Regression, Liner regression, Random forest, Decision trees and
Naive Bayes
2. Unsupervised learning: Models find hidden patterns in unlabeled data
Given: training data (without desired outputs)
Example: K-means clustering and Principal component analysis PCA
TYPES OF ML ALGORITHMS
3. Semi-supervised learning:
Given: training data + a few desired outputs
4. Reinforcement learning: Models learn by interacting with the environment and
receiving rewards.
Example: Gaming (AlphaGo), robotics.
What type of ML was the
Arthur Samuel Checker's
Program?
DS & ML

Training set
Features (X) Label / Target (Y)
70%
Test Set
Features (X) Label / Target (Y)
30%
TERMINOLOGIES OF MACHINE LEARNING
Model: A model is a specific representation learned from data by applying some
machine learning algorithm.

Training: The idea is to give a set of inputs (training set) and its expected outputs
(labels), so after training, we will have a model that will then map new data to one of
the categories trained on.

Prediction: Once our model is ready, it can be fed a set of inputs (test set) to which it
will provide a predicted output (label).
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
1. Confusion Matrix
A table summarizing the performance of a classification model by
showing the actual vs predicted classifications. It includes:
True Positives (TP): Correctly predicted positive instances Actual value
1 0
True Negatives (TN): Correctly predicted negative instances
1 TP FP
Predict value
False Positives (FP): Incorrectly predicted positive instances 0 FN TN
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
2. Accuracy
The proportion of correctly classified instances over the total instances.
𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = Actual value
𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁
True Positives (TP): Correctly predicted positive instances 1 0
True Negatives (TN): Correctly predicted negative instances 1 TP FP
Predict value
0 FN TN
False Positives (FP): Incorrectly predicted positive instances
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
3. Precision
The proportion of true positive predictions out of all the instances that
were predicted as positive.
𝑇𝑃 Actual value
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃+𝐹𝑃 1 0
True Positives (TP): Correctly predicted positive instances 1 TP FP
Predict value
True Negatives (TN): Correctly predicted negative instances 0 FN TN
False Positives (FP): Incorrectly predicted positive instances
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
3. Recall (Sensitivity(
The proportion of true positive predictions out of all the actual positive
instances.
𝑇𝑃 Actual value
Recall =
𝑇𝑃+𝐹𝑁 1 0
True Positives (TP): Correctly predicted positive instances 1 TP FP
Predict value
True Negatives (TN): Correctly predicted negative instances 0 FN TN
False Positives (FP): Incorrectly predicted positive instances
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
4. F1-Score
The harmonic mean of precision and recall. It balances the trade-off between
precision and recall.
Precision×𝑅𝑒𝑐𝑎𝑙𝑙
𝐹1 − score = 2 ×
Precision+𝑅𝑒𝑐𝑎𝑙𝑙
𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃+𝐹𝑃
𝑇𝑃
Recall =
𝑇𝑃+𝐹𝑁

Actual value
True Positives (TP): Correctly predicted positive instances
1 0
True Negatives (TN): Correctly predicted negative instances
1 TP FP
False Positives (FP): Incorrectly predicted positive instances Predict value
0 FN TN
False Negatives (FN): Incorrectly predicted negative instances
MEASURING THE PERFORMANCE OF ML
CLASSIFICATION
Example: Email spam (Classification)

Example Target y prediction 𝑦ො


1 1 1
2 1 0 Accuracy: 3/5= 0.6 = 60%
3 0 1
4 0 0
5 1 1
MEASURING THE PERFORMANCE OF ML
REGRESSION
1. Mean Absolute Error (MAE)
The average of the absolute differences between the predicted values and the actual values.
1
𝑀𝐴𝐸 = σ𝑛𝑖=1 𝑦𝑡𝑟𝑢𝑒,𝑖 − 𝑦𝑝𝑟𝑒𝑑,𝑖
𝑛
2. Mean Squared Error (MSE)
The average of the squared differences between the predicted values and the actual values. It
penalizes larger errors more heavily than MAE.
1 𝑛 2
𝑀𝑆𝐸 = σ 𝑦𝑡𝑟𝑢𝑒,𝑖 − 𝑦𝑝𝑟𝑒𝑑,𝑖
𝑛 𝑖=1

MSE emphasizes larger errors because the differences are squared


MEASURING THE PERFORMANCE OF ML
REGRESSION
Example: house pricing

Example Target y prediction 𝑦ො


1 1 0.8
2 2 1.9
MSE: 1/N * Σ(y_i — ŷ_i)²
= 1/5 (1-0.8) ²+ (2-1.9) ²+ (3-2.9) ²+ (4-4.1) ²+ (5-5.2) ²
3 3 2.9
= 0.02200
4 4 4.1
5 5 5.2
DATA SCIENCE IS NOT
MACHINE LEARNING!
Even though DS and ML are related
closely, but they are not the same!
Machine learning has a heavy focus on
fancy and complex algorithm and involves
computation and statistics
These algorithms need a clean and ready
datasets from DS field to test the complex
algorithm
But sometimes the best way to solve a
problem is just by visualizing the data, for
instance
JOBS DESCRIPTIONS

1. Data engineer
2. Data analyst
3. Data scientist
4. Machine learning scientist/
engineer
1. DATA ENGINEER
➢ Data engineers control the flow of data, build custom data pipelines and
storage systems.
➢They design infrastructure so that data is not only collected, but easy to
obtain and process.
➢Within the data science workflow, they focus on the first stage: data collection
and storage.
➢Data engineering tools: SQL, Java, Scala, or Python to process data, and
cloud computing to ingest and store large amounts of data.
2. DATA ANALYST
➢Data analysts describe the present via data.
➢They do this by exploring the data and creating visualizations and
dashboards.
➢To do these tasks, they often have to clean data first.
➢Within the workflow, they focus on the middle two stages: data preparation
and exploration and visualization.
➢Data analyst tools: SQL, spreadsheets, Business Intelligence (BI) tools such as
Tableau, Power BI, or Looker, to create dashboards and share their analyses.
3. DATA SCIENTIST
➢Data Scientists have a strong background in statistics, enabling them to find
new insights from data, rather than solely describing data.
➢They also use traditional machine learning for prediction and forecasting.
➢Within the workflow, they focus on the last three stages: data preparation
and exploration and visualization, and experimentation and prediction.
➢Data scientist tools: SQL, Python, and R.
4. MACHINE LEARNING SCIENTIST/ ENGINEER
➢Machine learning scientists are similar to data scientists but with a machine
learning specialization.
➢Focuses on developing, training, and deploying machine learning models in
production environments.
➢They go beyond traditional machine learning with deep learning.
➢Within the workflow, they do the last three stages with a strong focus on
prediction.
➢Machine learning tools: Python or R to create their predictive models

You might also like