CSC 492 Lecture Notes_19.06.2024
CSC 492 Lecture Notes_19.06.2024
DATA SCIENCE
MACHINE LEARNING
Data Science
• Data science encompasses a set of principles, problem definitions,
algorithms, and processes for extracting nonobvious and useful
patterns from large data sets.
• The goal of data science is to use data to get insight and
understanding.
• The standard model of the structural relationships between wisdom,
knowledge information, and data known as the DIKW pyramid.
• In the DIKW pyramid, data precedes information, which precedes
knowledge, which precedes wisdom.
• Data are created through
abstractions or measurements
taken from the world.
• Information is data that have
been processed, structured, or
contextualized so that it is
meaningful to humans.
• Knowledge is information that
has been interpreted and
understood by a human so that
she can act on it if required.
• Wisdom is acting on
knowledge in an appropriate
way.
The DIKW pyramid
Data Science Activities
• The activities in the data science process can also be represented
using a similar pyramid hierarchy where the width of the pyramid
represents the amount of data being processed at each level and
where the higher the layer in the pyramid, the more informative the
results of the activities are for decision making
• The hierarchy of data science activities goes from data capture and
generation through data preprocessing and aggregation, data
understanding and exploration, pattern discovery and model
creation using ML, and decision support using data-driven models
deployed in the business context.
• Cross Industry Standard
Process for Data Mining
(CRISP-DM)
• It is designed to be
independent of any
software, vendor, or
data analysis
technique.
o These representations of patterns are known as “models,” which is why this stage of
the CRISP-DM life cycle is known at the “modeling stage.”
o ML algorithms create models from data, and each algorithm is designed to create
models using a particular representation (neural network or decision tree or other)
Machine Learning
• In other cases, a model is used to label or classify new examples. For instance, the
primary purpose of a spam-filter model is to label new emails as either spam or not spam
rather than to reveal the defining attributes of spam email.
Supervised versus Unsupervised Learning
• Here, our end goal is to try to approximate the mapping function (f), so that
we can predict the output variables (Y) when we have new input data (X).
Supervised Learning (SL)
• The real challenge in using ML is to find the algorithm whose learning bias is the
best match for a particular data set.
• Generally, this task involves experiments with a number of different algorithms to find
out which one works best on that data set
Supervised Learning
• Supervised learning is “supervised” because each of the instances
in the data set lists both the input values and the output (target)
value for each instance.
• So, the learning algorithm can guide its search for the best function by checking how
each function it tries matches with the data set, and at the same time the data set
acts as a supervisor for the learning process by providing feedback.
• for supervised learning to take place, each instance in the data set must be
labeled with the value of the target attribute.
• Often, however, the reason a target attribute is interesting is that it is not easy to
directly measure, and therefore it is not possible to easily create a data set of labeled
instances. In such scenarios, a great deal of time and effort is required to create a
data set with the target values before a model can be trained using supervised
learning
Examples of Supervised Learning
Recall: goal of SL is to try to approximate the mapping function (f), so
that we can predict the output variables (Y) when we have new input
data (X).
• Here, the machine learning model learns to fit mapping between
examples of input features with their associated labels. When
models are trained with these examples, we can use them to make
new predictions on unseen data.
• The predicted labels can be both numbers or categories. For
instance, if we are predicting house prices, then the output is a
number. In this case, the model is a regression model. If we are
predicting if an email is spam or not, the output is a category and the
model is a classification model.
Examples of Supervised Learning
Example: House prices
• First, we need data about the houses: square footage, number of
rooms, features, whether a house has a garden or not, and so on.
• We then need to know the prices of these houses, i.e. the
corresponding labels.
• By leveraging data coming from thousands of houses, their features
and prices, we can now train a supervised machine learning model
to predict a new house’s price based on the examples observed by
the model.
Examples of Supervised Learning
Example: Is it a cat or a dog?
• Image classification is a popular problem in the computer vision
field.
• Here, the goal is to predict what class an image belongs to.
• In this set of problems, we are interested in finding the class label of
an image.
• More precisely: is the image of a car or a plane? A cat or a dog?
Examples of Supervised Learning
Example: How’s the weather today?
• One particularly interesting problem which requires considering a lot of
different parameters is predicting weather conditions in a particular
location.
• To make correct predictions for the weather, we need to take into account
various parameters, including historical temperature data, precipitation,
wind, humidity, and so on.
• This particularly interesting and challenging problem may require
developing complex supervised models that include multiple tasks.
• Predicting today’s temperature is a regression problem, where the output
labels are continuous variables. By contrast, predicting whether it is going
to snow or not tomorrow is a binary classification problem.
Examples of Supervised Learning
Example: Who are the unhappy customers?
• Another great example of supervised learning is text classification
problems. In this set of problems, the goal is to predict the class
label of a given piece of text.
• One particularly popular topic in text classification is to predict the
sentiment of a piece of text, like a tweet or a product review.
• This is widely used in the e-commerce industry to help companies to
determine negative comments made by customers.
Closing words on supervised learning
Supervised learning is a way that we can teach computers to do things
by showing them examples and telling them the right answer. For
example, let’s say we want to teach a computer to recognize pictures
of dogs. We can show it pictures of different breeds of dogs and tell it
the name of each breed. Then the computer will try to figure out which
characteristics typically go with each breed of dog.
Some attributes can be added (e.g age) while some irrelevant can be removed (e.g. shoe size)
Unsupervised Learning
• An unsupervised clustering algorithm will look for groups of rows that are
more similar to each other than they are to the other rows in the data.
• Each of these groups of similar rows defines a cluster of similar instances.
• For instance, an algorithm can identify causes of a disease or disease
comorbidities (diseases that occur together) by looking for attribute values
that are relatively frequent within a cluster.
• The simple idea of looking for clusters of similar rows is very powerful
and has applications across many areas of life.
• Another application of clustering rows is making product
recommendations to customers.
• If a customer liked a book, song, or movie, then he may enjoy another book, song,
or movie from the same cluster
Difference between SL and UL
Supervised Learning Unsupervised Learning
uses well-labelled data sets uses unlabelled data sets
ML algorithms learn from the dataset by making Unsupervised learning algorithms learn by themselves and
multiple predictions and making adjustments discover any pattern of unlabelled data by themselves
for correct output
requires human intervention to make the data requires minimal human intervention. The only time they are
learn required to use humans is when they verify if the output is
making sense
useful to make weather predictions, detect useful for recommendation engines, customer personas,
human sentiments and make pricing predictions medical imaging or anomaly detection
comparatively simple, as it uses Python or R uses powerful tools to analyse a large volume of data. Also,
programming language unsupervised learning is computationally complex because
of the vast amount of data it uses to predict the desired
outcome
Supervised machine learning is a more accurate unsupervised learning is a comparatively less accurate
method method
does the prediction for new data sets. The user insights are provided based on a huge amount of new data.
is already aware of the expected output
Advantages and disadvantages of SL
Advantages Disadvantages
you can collect data or produce output by using To train the classifier, you may be required to choose a lot of
your previous experience. examples from every class, otherwise, the accuracy of the
output is impacted
This model allows you to optimise performance To classify a large amount of data is a challenge
criteria by using experience
You are completely aware of the number of To train the data in SML takes high computation time, which
classes in a training data set. sometimes also tests the machine's efficiency.
It allows you to understand the process of how Supervised learning cannot classify or cluster data by itself like
the machine is learning to predict the output. unsupervised learning can.
It helps to solve different real-world computation It is not always possible to provide supervision for large data,
problems so the machine may require to learn itself through training data
After completing training, it is not mandatory to Supervised learning's capability is limited in the sense that it is
store the training dataset in the memory; instead, not capable of handling certain complex tasks in ML.
you may maintain the decision boundary as a
mathematical formula
Supervised machine learning models need a lot of time to train
the data and require expertise to create labelled data.
Challenges Faced In Supervised Machine Learning
• It is a challenge to pre-process the data and prepare the data for the input.
• If incomplete values and unlikely and impossible values are sent as an
input, the accuracy of the supervised learning model may decrease.
• If the input is irrelevant to the model, it may give inaccurate output.
• To label the data, an expert is important, but in the absence of one, the
results could be inaccurate.
• Since there is human intervention in supervised learning, there are
chances of human error in datasets which may lead to incorrect learning
of the algorithms.