0% found this document useful (0 votes)
11 views

CSC 492 Lecture Notes_19.06.2024

The document provides an overview of data science and machine learning, emphasizing the DIKW pyramid that illustrates the relationship between data, information, knowledge, and wisdom. It outlines the CRISP-DM process, which consists of six stages for data-driven decision-making, and explains the differences between supervised and unsupervised learning in machine learning. Examples of supervised learning applications, such as predicting house prices and classifying images, are also discussed, along with the challenges of unsupervised learning, particularly in clustering and measuring similarity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

CSC 492 Lecture Notes_19.06.2024

The document provides an overview of data science and machine learning, emphasizing the DIKW pyramid that illustrates the relationship between data, information, knowledge, and wisdom. It outlines the CRISP-DM process, which consists of six stages for data-driven decision-making, and explains the differences between supervised and unsupervised learning in machine learning. Examples of supervised learning applications, such as predicting house prices and classifying images, are also discussed, along with the challenges of unsupervised learning, particularly in clustering and measuring similarity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CSC492

DATA SCIENCE
MACHINE LEARNING
Data Science
• Data science encompasses a set of principles, problem definitions,
algorithms, and processes for extracting nonobvious and useful
patterns from large data sets.
• The goal of data science is to use data to get insight and
understanding.
• The standard model of the structural relationships between wisdom,
knowledge information, and data known as the DIKW pyramid.
• In the DIKW pyramid, data precedes information, which precedes
knowledge, which precedes wisdom.
• Data are created through
abstractions or measurements
taken from the world.
• Information is data that have
been processed, structured, or
contextualized so that it is
meaningful to humans.
• Knowledge is information that
has been interpreted and
understood by a human so that
she can act on it if required.
• Wisdom is acting on
knowledge in an appropriate
way.
The DIKW pyramid
Data Science Activities
• The activities in the data science process can also be represented
using a similar pyramid hierarchy where the width of the pyramid
represents the amount of data being processed at each level and
where the higher the layer in the pyramid, the more informative the
results of the activities are for decision making
• The hierarchy of data science activities goes from data capture and
generation through data preprocessing and aggregation, data
understanding and exploration, pattern discovery and model
creation using ML, and decision support using data-driven models
deployed in the business context.
• Cross Industry Standard
Process for Data Mining
(CRISP-DM)
• It is designed to be
independent of any
software, vendor, or
data analysis
technique.

The CRISP-DM Process


• The CRISP-DM life cycle consists of six
stages: business understanding, data
understanding, data preparation,
modeling, evaluation, and
deployment.
• Data are at the center of all data
science activities
• The arrows between the stages
indicate the typical direction of the
process.
• The process is semistructured, which
means that a data scientist doesn’t
always move through these six stages
in a linear fashion.
• Depending on the outcome of a
particular stage, a data scientist may
go back to one of the previous stages,
redo the current stage, or move on to
the next stage
The CRISP-DM Life Cycle
The CRISP-DM Life Cycle
• The first 2 stages involves iteration typically targeted at identifying a
business problem and then exploring if the appropriate data are available
to develop a data-driven solution to the problem.
• Next phase: data-preparation stage is the creation of a data set that can
be used for the data analysis
• This involves integrating data sources from a number of databases.
• Once a data set has been created, the quality of the data is checked and
fixed e.g. outliers and missing values.
• errors in the data can have a serious effect on the performance of the data-
analysis algorithms.
• Next: Modeling. This is the stage where automatic algorithms are used to
extract useful patterns from the data and to create models that encode
these patterns
• ML is focused on the design of these algorithms.
The CRISP-DM Life Cycle
• Modeling phase:
• a data scientist uses a number of different ML algorithms to train a number of
different models on the data set.
• A model is trained on a data set by running an ML algorithm on the data set so as
to identify useful patterns in the data and to return a model that encodes these
patterns.
• In most cases, ML algorithm is ultimately the software that is deployed by an organization to
help it
• The last two stages of the CRISP-DM process, evaluation and
deployment, are focused on how the models fit the business and its
processes.
• The tests run during the modeling stage are focused purely on the accuracy of the
models for the data set.
• The evaluation phase involves assessing the models in the broader context
defined by the business needs
MACHINE LEARNING
Machine Learning
• Data science is best understood as a partnership
between a data scientist and a computer.
• computer brings the ability to process data and search
for patterns in the data.
• ML is the field of study that develops the algorithms
that the computers follow in order to identify and
extract patterns from data.
• ML algorithms and techniques are applied primarily
during the modeling stage of CRISP-DM.
Machine Learning

• ML involves a two-step process


i. First, an ML algorithm is applied to a data set to identify useful patterns in the
data.

o These patterns can be represented in a number of different ways: decision trees,


regression models, and neural networks.

o These representations of patterns are known as “models,” which is why this stage of
the CRISP-DM life cycle is known at the “modeling stage.”

o ML algorithms create models from data, and each algorithm is designed to create
models using a particular representation (neural network or decision tree or other)
Machine Learning

• ML involves a two-step process


ii. Second, once a model has been created, it is used for analysis. In some
cases, the structure of the model is what is important.
• A model structure can reveal what the important attributes are in a domain
• For example, in a medical domain we might apply an ML algorithm to a data set of stroke
patients and use the structure of the model to identify the factors that have a strong
association with stroke.

• In other cases, a model is used to label or classify new examples. For instance, the
primary purpose of a spam-filter model is to label new emails as either spam or not spam
rather than to reveal the defining attributes of spam email.
Supervised versus Unsupervised Learning

• Like humans, machines are capable of learning in different ways


• Majority of ML algorithms/strategies (“the model”) are classified as:
i. Supervised learning or
ii. Unsupervised learning
iii. Reinforcement learning
Supervised learning: As the name indicates, supervised learning involves machine
learning algorithms that learn under the presence of a supervisor.
Supervised versus Unsupervised Learning
• Goal: to learn a function that maps from the values of the attributes
describing an instance to the value of another attribute, known as the
target attribute, of that instance.
• For example, when supervised learning is used to train a spam filter, the
algorithm attempts to learn a function that maps from the attributes
describing an email to a value (spam/not spam) for the target attribute;
• the function the algorithm learns is the spam-filter model returned by the
algorithm.
• So, in this context the pattern that the algorithm is looking for in the data
is the function that maps from the values of the input attributes to the
values of the target attribute, and the model that the algorithm returns is
a computer program that implements this function
Supervised Learning (SL)
• SL refers to a category of methods in which we teach or train a
machine learning algorithm using data, while guiding the algorithm
model with labels associated with the data.
• Supervised learning works by searching through lots of different
functions to find the function that best maps between the inputs and
output.
• However, for any data set of reasonable complexity there are so many combinations
of inputs and possible mappings to outputs that an algorithm cannot try all possible
functions. As a consequence, each ML algorithm is designed to look at or prefer
certain types of functions during its search.
• These preferences are known as the algorithm’s learning bias
Supervised Learning (SL)
• if ML algorithm gives a correct answer, then there is nothing for us to do. Our
job is to correct the model when the output of the model is wrong.
• If this is the case, we need to make sure that the model makes necessary
updates so that the next time an input (say a cat image) is shown to the model,
it can correctly identify the image.
• The formal supervised learning process involves input variables, which we call
(X), and an output variable, which we call (Y). We use an algorithm to learn the
mapping function from the input to the output.
• In simple mathematics, the output (Y) is a dependent variable of input (X) as
illustrated by:
Y = f(X)

• Here, our end goal is to try to approximate the mapping function (f), so that
we can predict the output variables (Y) when we have new input data (X).
Supervised Learning (SL)
• The real challenge in using ML is to find the algorithm whose learning bias is the
best match for a particular data set.
• Generally, this task involves experiments with a number of different algorithms to find
out which one works best on that data set
Supervised Learning
• Supervised learning is “supervised” because each of the instances
in the data set lists both the input values and the output (target)
value for each instance.
• So, the learning algorithm can guide its search for the best function by checking how
each function it tries matches with the data set, and at the same time the data set
acts as a supervisor for the learning process by providing feedback.
• for supervised learning to take place, each instance in the data set must be
labeled with the value of the target attribute.
• Often, however, the reason a target attribute is interesting is that it is not easy to
directly measure, and therefore it is not possible to easily create a data set of labeled
instances. In such scenarios, a great deal of time and effort is required to create a
data set with the target values before a model can be trained using supervised
learning
Examples of Supervised Learning
Recall: goal of SL is to try to approximate the mapping function (f), so
that we can predict the output variables (Y) when we have new input
data (X).
• Here, the machine learning model learns to fit mapping between
examples of input features with their associated labels. When
models are trained with these examples, we can use them to make
new predictions on unseen data.
• The predicted labels can be both numbers or categories. For
instance, if we are predicting house prices, then the output is a
number. In this case, the model is a regression model. If we are
predicting if an email is spam or not, the output is a category and the
model is a classification model.
Examples of Supervised Learning
Example: House prices
• First, we need data about the houses: square footage, number of
rooms, features, whether a house has a garden or not, and so on.
• We then need to know the prices of these houses, i.e. the
corresponding labels.
• By leveraging data coming from thousands of houses, their features
and prices, we can now train a supervised machine learning model
to predict a new house’s price based on the examples observed by
the model.
Examples of Supervised Learning
Example: Is it a cat or a dog?
• Image classification is a popular problem in the computer vision
field.
• Here, the goal is to predict what class an image belongs to.
• In this set of problems, we are interested in finding the class label of
an image.
• More precisely: is the image of a car or a plane? A cat or a dog?
Examples of Supervised Learning
Example: How’s the weather today?
• One particularly interesting problem which requires considering a lot of
different parameters is predicting weather conditions in a particular
location.
• To make correct predictions for the weather, we need to take into account
various parameters, including historical temperature data, precipitation,
wind, humidity, and so on.
• This particularly interesting and challenging problem may require
developing complex supervised models that include multiple tasks.
• Predicting today’s temperature is a regression problem, where the output
labels are continuous variables. By contrast, predicting whether it is going
to snow or not tomorrow is a binary classification problem.
Examples of Supervised Learning
Example: Who are the unhappy customers?
• Another great example of supervised learning is text classification
problems. In this set of problems, the goal is to predict the class
label of a given piece of text.
• One particularly popular topic in text classification is to predict the
sentiment of a piece of text, like a tweet or a product review.
• This is widely used in the e-commerce industry to help companies to
determine negative comments made by customers.
Closing words on supervised learning
Supervised learning is a way that we can teach computers to do things
by showing them examples and telling them the right answer. For
example, let’s say we want to teach a computer to recognize pictures
of dogs. We can show it pictures of different breeds of dogs and tell it
the name of each breed. Then the computer will try to figure out which
characteristics typically go with each breed of dog.

Once the computer has learned enough about different breeds of


dogs, we can test it by showing it a picture of a dog it has never seen
before. The computer will use what it has learned to try to guess which
breed of dog it is. If it guesses correctly, we can say that the computer
did a good job of learning about dogs. If it doesn’t guess correctly, we
can give it more examples to help it learn even better.
Unsupervised Learning
• In unsupervised learning, there is no target attribute.
• As a consequence, unsupervised-learning algorithms can be used
without investing the time and effort in labeling the instances of the
data set with a target attribute.
• However, not having a target attribute also means that learning
becomes more difficult: instead of the specific problem of searching
for a mapping from inputs to output that matches the data, the
algorithm has the more general task of looking for regularities in the
data.
• In unsupervised learning, even though we do not have any labels for data points,
we do have the actual data points. This means we can draw references from
observations in the input data.
Unsupervised Learning
• Scenario to explain unsupervised learning
• Imagine you are in a foreign country and you are visiting a food market, for
example. You see a stall selling a fruit that you cannot identify. You don’t
know the name of this fruit. However, you have your observations to rely
on, and you can use these as a reference. In this case, you can easily the
fruit apart from nearby vegetables or other food by identifying its various
features like its shape, color, or size.

• This is roughly how unsupervised learning happens. We use the data


points as references to find meaningful structure and patterns in the
observations. Unsupervised learning is commonly used for finding
meaningful patterns and groupings inherent in data, extracting
generative features, and exploratory purposes.
Unsupervised Learning
• The most common type of unsupervised learning is cluster
analysis, where the algorithm looks for clusters of instances that are
more similar to each other than they are to other instances in the
data.
• These clustering algorithms often begin by guessing a set of clusters
and then iteratively updating the clusters (dropping instances from
one cluster and adding them to another) so as to increase both the
within-cluster similarity and the diversity across clusters
Unsupervised Learning
• A challenge for clustering is figuring out how to measure similarity.
• If all the attributes in a data set are numeric and have similar ranges,
then it probably makes sense just to calculate the Euclidean
distance (better known as the straight-line distance) between the
instances (or rows).
• Rows that are close together in the Euclidean space are then treated
as similar.
Unsupervised Learning
• Factors that make calculation of similarities between rows
complex
• A number of factors:
i. In some data sets, different numeric attributes have different ranges, with
the result that a variation in row values in one attribute may not be as
significant as a variation of a similar magnitude in another attribute.
• In these cases, the attributes should be normalized so that they all have the same range.
ii. Things can be deemed similar in many different ways. Some attributes are
sometimes more important than other attributes, so it might make sense to weight
some attributes in the distance calculations, or it may be that the data set includes
nonnumeric data. These more complex scenarios may require the design of
bespoke/specific/tailor-made similarity metrics/measures for the clustering
algorithm to use.
Unsupervised Learning
• Example to illustrate unsupervised learning
• Analyzing the causes of Type 2 diabetes in white American adult males

Some attributes can be added (e.g age) while some irrelevant can be removed (e.g. shoe size)
Unsupervised Learning
• An unsupervised clustering algorithm will look for groups of rows that are
more similar to each other than they are to the other rows in the data.
• Each of these groups of similar rows defines a cluster of similar instances.
• For instance, an algorithm can identify causes of a disease or disease
comorbidities (diseases that occur together) by looking for attribute values
that are relatively frequent within a cluster.
• The simple idea of looking for clusters of similar rows is very powerful
and has applications across many areas of life.
• Another application of clustering rows is making product
recommendations to customers.
• If a customer liked a book, song, or movie, then he may enjoy another book, song,
or movie from the same cluster
Difference between SL and UL
Supervised Learning Unsupervised Learning
uses well-labelled data sets uses unlabelled data sets
ML algorithms learn from the dataset by making Unsupervised learning algorithms learn by themselves and
multiple predictions and making adjustments discover any pattern of unlabelled data by themselves
for correct output
requires human intervention to make the data requires minimal human intervention. The only time they are
learn required to use humans is when they verify if the output is
making sense
useful to make weather predictions, detect useful for recommendation engines, customer personas,
human sentiments and make pricing predictions medical imaging or anomaly detection
comparatively simple, as it uses Python or R uses powerful tools to analyse a large volume of data. Also,
programming language unsupervised learning is computationally complex because
of the vast amount of data it uses to predict the desired
outcome
Supervised machine learning is a more accurate unsupervised learning is a comparatively less accurate
method method
does the prediction for new data sets. The user insights are provided based on a huge amount of new data.
is already aware of the expected output
Advantages and disadvantages of SL
Advantages Disadvantages
you can collect data or produce output by using To train the classifier, you may be required to choose a lot of
your previous experience. examples from every class, otherwise, the accuracy of the
output is impacted
This model allows you to optimise performance To classify a large amount of data is a challenge
criteria by using experience
You are completely aware of the number of To train the data in SML takes high computation time, which
classes in a training data set. sometimes also tests the machine's efficiency.
It allows you to understand the process of how Supervised learning cannot classify or cluster data by itself like
the machine is learning to predict the output. unsupervised learning can.
It helps to solve different real-world computation It is not always possible to provide supervision for large data,
problems so the machine may require to learn itself through training data
After completing training, it is not mandatory to Supervised learning's capability is limited in the sense that it is
store the training dataset in the memory; instead, not capable of handling certain complex tasks in ML.
you may maintain the decision boundary as a
mathematical formula
Supervised machine learning models need a lot of time to train
the data and require expertise to create labelled data.
Challenges Faced In Supervised Machine Learning

• It is a challenge to pre-process the data and prepare the data for the input.
• If incomplete values and unlikely and impossible values are sent as an
input, the accuracy of the supervised learning model may decrease.
• If the input is irrelevant to the model, it may give inaccurate output.
• To label the data, an expert is important, but in the absence of one, the
results could be inaccurate.
• Since there is human intervention in supervised learning, there are
chances of human error in datasets which may lead to incorrect learning
of the algorithms.

You might also like