0% found this document useful (0 votes)
32 views

Classification of Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Classification of Machine Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 73

Classification of Machine

Learning
Classification of ML
Supervised Learning
-Classification
-Regression

Unsupervised Learning
-Clustering
-Association

Arthur Samuel coined the term “Machine


Learning” in 1959 and defined it as a “Field of
study that gives computers the capability to
learn without being explicitly programmed”.
Supervised Machine Learning
• Machines are trained using well "labelled" training data,
– on basis of that data, machines predict the output
– The labelled data means some input data is already tagged
with the correct output.
• Training data provided to the machines work as the supervisor
that teaches the machines to predict the output correctly.
– It applies the same concept as a student learns in the
supervision of the teacher.
• Supervised learning is a process of providing input data as well as
correct output data to the machine learning model.
• Aim: find a mapping function to map the input variable(x) with
the output variable(y).
• App: Risk Assessment, Image classification, Fraud Detection,
spam filtering, etc.
How Supervised Learning Works?
• trained with labelled dataset.
• model is tested on the basis of test data
– a subset of the training set
– it predicts the output.
Steps Involved in Supervised Learning
 Determine the type of training dataset
 Collect/Gather the labelled training data
 Split the dataset: training set, test set, & validation set
 Determine the input features of the training dataset
 Determine the suitable algorithm: SVM, DT, etc.
 Execute the algorithm on the training dataset.
 Sometimes need validation sets as the control parameters, which
are the subset of training datasets.
 Evaluate the accuracy of the model by providing the test set.
 If the model predicts the correct output-means the model is
accurate.
Types of supervised ML Algorithms
Regression
• Regression algorithms are used if there is a
relationship between the input variable & the
output variable.
• It is used for the prediction of continuous
variables, such as Weather forecasting, Market
Trends, etc.
– Linear Regression
– Regression Trees
– Non-Linear Regression
– Bayesian Linear Regression
– Polynomial Regression
Classification
• Classification algorithms are used when the
output variable is categorical, which means
there are two classes such as Yes-No, Male-
Female, True-false, etc.
– Spam Filtering,
– Random Forest
– Decision Trees
– Logistic Regression
– Support vector Machines
Pros & Cons
• Pros:
– the model can predict the output on the basis of prior experiences.
– can have an exact idea about the classes of objects.
– helps us to solve various real-world problems such as fraud detection,
spam filtering, etc.
• Cons:
– not suitable for handling the complex tasks.
– cannot predict the correct output if the test data is different from the
training dataset.
– Training required lots of computation times.
– need enough knowledge about the classes of object.
Unsupervised learning
• is the training of a machine using information
that is neither classified nor labelled &
allowing the algorithm to act on that
information without guidance/supervision.
• the task of the machine is to group unsorted
information according to similarities, patterns,
and differences without any prior training of
data.
 The algorithm is never trained
upon the given dataset, which
means it does not have any idea
about the features of the dataset.
 The task of the unsupervised
learning algorithm is to identify
the image features on their own.
 Unsupervised learning algorithm
It allows the model to work on its
own to discover patterns &
will perform this task by clustering
information that was previously the image dataset into the groups
undetected. according to similarities between
It mainly deals with unlabelled data. images.
Why use Unsupervised Learning?
 helpful for finding useful insights from the data.
 much similar as a human learns to think by their
own experiences, which makes it closer to the
real AI.
 works on unlabeled & uncategorized data which
make unsupervised learning more important.
 In real-world, we do not always have input data
with the corresponding output so to solve such
cases, we need unsupervised learning.
Working of Unsupervised Learning

 it interpret the raw data to find the hidden patterns from the
data
 then apply suitable algorithms such as k-means clustering,
Decision tree, etc.
 the algorithm divides the data objects into groups according
to the similarities & difference between the objects.
Types of Unsupervised Learning Algorithm
 Clustering: Clustering is a method of  Association: An association rule is used for
grouping the objects into clusters such finding the relationships between variables
that objects with most similarities in the large database.
remains into a group and has less or no  It determines the set of items that occurs
similarities with the objects of another together in the dataset.
group.  Association rule makes marketing strategy
 Cluster analysis finds the commonalities more effective.
between the data objects and  Such as people who buy X item (suppose a
categorizes them as per the presence bread) are also tend to purchase Y
and absence of those commonalities. (Butter/Jam) item.

Principle Component Analysis


Independent Component Analysis
Apriori algorithm K-means clustering
Singular value decomposition KNN (k-nearest neighbors)
Hierarchal clustering
Neural Networks
 Pros:
• used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have
labelled input data.
• preferable as it is easy to get unlabelled data in comparison to
labelled data.
 Cons:
• intrinsically more difficult than supervised learning as it does
not have corresponding output.
• The result of the unsupervised algorithm might be less accurate
as input data is not labelled, and algorithms do not know the
exact output in advance.
Difference between Supervised and
Unsupervised Learning
Supervised Learning Unsupervised Learning
algorithms are trained using labeled data algorithms are trained using unlabeled data.
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the
output hidden patterns in data.
In supervised learning, input data is In unsupervised learning, only input data is
provided to the model along with the output provided to the model.
Goal: is to train the model so that it can Goal: is to find the hidden patterns & useful
predict the output when it is given new data insights from the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Supervised learning model produces an may give less accurate result as compared to
accurate result. supervised learning.
It includes various algorithms such as LR, It includes various algorithms such as
SVM, Multi-class Classification, DT, etc. Clustering, KNN, and Apriori algorithm.
ML Life Cycle
• ML has given the computer systems the
abilities to automatically learn without being
explicitly programmed.
– How does a ML system work?
• ML life cycle: a cyclic process to build an
efficient ML project
– The main purpose is to find a solution to the
problem or project
07 Major Steps
A. Gathering Data
• Most important steps !
• Goal: to identify & obtain all data-related problems
• Identify the different data sources: files, DB, internet, or mobile
devices.
• The quantity & quality of the collected data will determine the
efficiency of the output.
• The more will be the data, the more accurate will be the
prediction.
• Tasks Involved:
– Identify various data sources
– Collect data
– Integrate the data obtained from different sources
• Develop a dataset or corprus
B. Data Preparation
• Put data into a suitable place & prepare it to use in ML
training.
• Place all data together, & then randomize the ordering
of data.
• Two Steps:
1. Data exploration: used to understand the nature of
data.
– understand the characteristics, format, & quality of data.
A better understanding of data leads to an effective outcome.
• Find Correlations, general trends, & outliers
2. Data pre-processing: several processes of data for
better analysis.
C. Data Wrangling
• Process of cleaning & converting raw data into a useable
format.
– cleaning the data, selecting the variable to use, & transforming the
data in a proper format to make it more suitable for analysis in the
next step.
• Cleaning of data is required to address the quality issues.
– Missing Values
– Duplicate data
– Invalid data
– Noise
• Use various filtering techniques to clean the data
• Mandatory to detect & remove the all issues due to their
negative affect on the quality of the outcome
D. Data Analysis
• Involves:
– Selection of analytical techniques
– Building models
– Review the result
• Aim: build a ML model to analyze the data using
various analytical techniques & review the outcome.
• Determination of type of problems
– Classification, Regression, Cluster analysis, Association, etc.
• Then build the model using prepared data, & evaluate
the model.
E. Train Model
• Train model to improve its performance for
better outcome of the problem
– Use datasets to train the model using various ML
algorithms.
• Training a model is required so that it can
understand the various patterns, rules, &
features
F. Test Model
• Test the model using unknown/unlabelled
data.
• Check for the accuracy/efficiency of the model
G. Deployment
• Deploy the model in the real-world system
• If the prepared model is producing an
accurate result as per requirement with an
acceptable speed,
– then deploy the model in the real system
• Before deploying the project, should check
whether it is improving its performance using
available data or not.
Data in Machine Learning
• DATA: It can be any unprocessed fact, value,
text, sound, or picture that is not being
interpreted & analyzed.
• Data is the most important part of all Data
Analytics, ML, AI.
• Without data, ML model cannot be trained and
all modern research/automation will go in vain.
• Big Enterprises are spending lots of money just
to gather as much certain data as possible.
• Example: Why did Facebook acquire
WhatsApp by paying a huge price of $19
billion?
Answer is very simple & logical – it is to have
access to the users’ information that Facebook
may not have but WhatsApp will have.
This information of their users is of paramount
importance to Facebook as it will facilitate the task
of improvement in their services.
• INFORMATION: Data that has been
interpreted & manipulated with some
meaningful inference for the users.
• KNOWLEDGE: Combination of inferred
information, experiences, learning, & insights.
– Results in awareness or concept building for an
individual or organization.
How we split data in Machine Learning?
• Training Data: The part of data use to train the model. This is
the data that ML model actually sees (input + output) & learns
from.
• Validation Data: The part of data that is used to do a frequent
evaluation of the model, fit on the training dataset along with
improving involved hyperparameters (initially set parameters
before the model begins learning).
– This data plays its part when the model is actually training.
• Testing Data: Once the model is completely trained, testing
data provides an unbiased evaluation.
• The model will predict some values (without seeing actual
output).
Consider an example:
• There’s a Shopping Mart Owner who conducted a survey for which he has
a long list of questions & answers that he had asked from the customers,
this list of questions & answers is DATA.
• Now every time when he wants to infer anything and can’t just go through
each and every question of thousands of customers to find something
relevant as it would be time-consuming & not helpful.
• In order to reduce this overhead & time wastage and to make work easier,
data is manipulated through software, calculations, graphs, etc. as per own
convenience, this inference from manipulated data is Information.
– Data is a must for Information.
• Knowledge has its role in differentiating between two individuals having
the same information.
– Knowledge is actually not technical content but is linked to the human thought
process.
Properties of Data
• Volume: Scale of Data. With the growing world
population and technology at exposure, huge data
is being generated each and every millisecond.
• Variety: Different forms of data – healthcare,
images, videos, audio clippings.
• Velocity: Rate of data streaming and generation.
• Value: Meaningfulness of data in terms of
information that researchers can infer from it.
• Veracity: Certainty & correctness in data.
Data Preprocessing
• Pre-processing refers to the transformations
applied to our data before feeding it to the
algorithm.
• Data Preprocessing is a technique that is used
to convert the raw data into a clean data set.
• In other words, whenever the data is gathered
from different sources it is collected in raw
format which is not feasible for the analysis
Need of Data Preprocessing
 For achieving better results from the applied model in ML Learning
projects the format of the data has to be in a proper manner.
 Some specified ML model needs information in a specified
format
 RF algorithm does not support null values, therefore to execute
RFalgorithm null values have to be managed from the original
raw data set.
 Data set should be formatted in such a way that more than one ML &
DL algorithm are executed in one data set, and best out of them is
chosen.
3 popular data preprocessing techniques
 Rescale Data
• When data is comprised of attributes with varying
scales, many ML algorithms can benefit from rescaling
the attributes to all have the same scale.
• This is useful for optimization algorithms in used in
the core of ML algorithms like gradient descent.
• It is also useful for algorithms that weight inputs like
regression & NN and algorithms that use distance
measures like K-NN.
– rescale using scikit-learn: MinMaxScaler class.
• After rescaling see that all of the values are in
the range between 0 and 1
2. Binarize Data (Make Binary)
• transform data using a binary threshold.
– All values above the threshold are marked 1 & all equal
to or below are marked as 0.
• This is called binarizing your data or threshold your
data.
– It can be useful when you have probabilities that you
want to make crisp values.
– It is also useful when feature engineering & want to
add new features that indicate something meaningful.
• scikit-learn: Binarizer class.
• all values equal or less than 0 are marked 0 &
all of those above 0 are marked 1.
3. Standardize Data
• Standardization is a useful technique to
transform attributes with a Gaussian
distribution &
– differing means & standard deviations to a
standard Gaussian distribution with a mean of 0 &
a standard deviation of 1.
• standardize data using scikit-learn:
StandardScaler class.
• The values for each attribute have a mean
value of 0 & a standard deviation of 1.
Data Cleaning
 It surely isn’t the fanciest part of machine learning and at the same
time, there aren’t any hidden tricks or secrets to uncover.
 However, the success or failure of a project relies on proper data
cleaning.
 Professional data scientists usually invest a very large portion of their
time in this step because of the belief that “Better data beats fancier
algorithms”
 If we have a well-cleaned dataset, there are chances that we can get
achieve good results with simple algorithms also, which can prove
very beneficial at times especially in terms of computation when the
dataset size is large.
– Different types of data require different types of cleaning.
– However, this systematic approach can always serve as a good starting point
Steps involved in Data Cleaning:
1. Removal of unwanted observations
• Includes deleting duplicate/redundant or irrelevant values
from your dataset
• Duplicate observations most frequently arise during data
collection & Irrelevant observations are those that don’t
actually fit the specific problem that you’re trying to solve.
– Redundant observations alter the efficiency by a great extent as
the data repeats & may add towards the correct side or towards
the incorrect side, thereby producing unfaithful results.
– Irrelevant observations: any data that is of no use
• can be removed directly.
2. Fixing Structural Errors
• Errors arise during measurement, transfer of data, or other
similar situations
• Structural errors include typos in the name of features, the
same attribute with a different name, mislabeled classes,
i.e. separate classes that should really be the same, or
inconsistent capitalization.
– Example: the model will treat America & America as different
classes or values, though they represent the same value or red,
yellow, & red-yellow as different classes or attributes, though
one class can be included in the other two classes.
– So, these are some structural errors that make our model
inefficient & give poor quality results.
3. Managing Unwanted Outliers
• Outliers can cause problems with certain types of models.
• For example, linear regression models are less robust to
outliers than DT models.
• Generally, we should not remove outliers until we have a
legitimate reason to remove them.
• Sometimes, removing them improves performance,
sometimes not.
• So, one must have a good reason to remove the outlier:
suspicious measurements that are unlikely to be part of
real data.
4. Handling Missing Data
• Missing data is a deceptively tricky issue in ML.
• It cannot just ignore or remove the missing observation.
• They must be handled carefully as they can be an indication of
something important.
• 02 most common ways:
– Dropping observations with missing values.
• The fact that the value was missing may be informative in itself.
• Plus, in the real world, you often need to make predictions on new data even if
some of the features are missing!
– Imputing the missing values from past observations.
• Again, “missingness” is almost always informative in itself, and you should tell
your algorithm if a value was missing.
• Even if you build a model to impute your values, you’re not adding any real
information. You’re just reinforcing the patterns already provided by other
features.
• Missing data is like missing a puzzle piece.
– If you drop it, that’s like pretending the puzzle slot isn’t there.
– If you impute it, that’s like trying to squeeze in a piece from
somewhere else in the puzzle.
• So, missing data is always an informative & an indication
of something important.
• Must be aware of our algorithm of missing data by
flagging it.
• By using this technique of flagging & filling, you are
essentially allowing the algorithm to estimate the optimal
constant for missingness, instead of just filling it in with
the mean.
Some Data Cleansing Tools
• Openrefine
• Trifacta Wrangler
• TIBCO Clarity
• Cloudingo
• IBM Infosphere Quality Stage
Feature Scaling
 Feature Scaling is a technique to standardize the
independent features present in the data in a fixed range
– It is performed during the data pre-processing
 Working: Given a data-set with features- Age, Salary, BHK
Apartment with the data size of 5000 people, each having
these independent data features.
• Each data point is labeled as:
– Class1-YES (means with the given Age, Salary, BHK
Apartment feature value one can buy the property)
– Class2-NO (means with the given Age, Salary, BHK
Apartment feature value one can’t buy the property)
• Using a dataset to train
the model, one aims to
build a model that can
predict whether one can
buy a property or not
with given feature
values.
• Once the model is
trained, an N-dimensional
(where N is the no. of
features present in the
dataset) graph with data
points from the given
dataset, can be created.
• Star: Class1 – Yes
• Circles: Class2 – No
• A new data point (diamond)
is given & it has different
independent values for the
3 features (Age, Salary, BHK
Apartment)
• The model has to predict
whether this data point
belongs to Yes or No
Prediction of the class of new data points:
The model calculates the distance of this data
point from the centroid of each class group.
• Finally, this data point will belong to that class,
which will have a minimum centroid distance
from it.
The distance can be calculated between
centroid & data point

• Euclidean Distance: It is the square root of the


sum of squares of differences between the
coordinates (feature values – Age, Salary, BHK
Apartment) of data point & centroid of each
class.

• where x is Data Point value, y is Centroid value


& k is no. of feature values, Example: given data
set has k = 3
• Manhattan Distance: It is calculated as the
sum of absolute differences between the
coordinates (feature values) of data point &
centroid of each class

• Minkowski Distance: It is a generalization of


the above two methods.
Need of Feature Scaling
• The given data set contains 3 features:
Age, Salary, BHK Apartment.
• Consider a range of 10- 60 for Age, 1 Lac- 40
Lacs for Salary, 1- 5 for BHK of Flat. All these
features are independent of each other.
• Suppose the centroid of class 1 is [40, 22 Lacs,
3] and the data point to be predicted is [57, 33
Lacs, 2].
Using Manhattan Method,

• It can be seen that the Salary feature will dominate all


other features while predicting the class of the given data
point & since all the features are independent of each
other i.e. a person’s salary has no relation with his/her
age or what requirement of the flat he/she has.
• This means that the model will always predict wrong.
• So, the simple solution to this problem is Feature Scaling.
• Feature Scaling Algorithms will scale Age, Salary, BHK in a
fixed range say [-1, 1] or [0, 1].
• Then no feature can dominate others.
Why and Where to Apply Feature Scaling?

• The real-world dataset contains features that highly vary in


magnitudes, units, & range
• Normalization should be performed when the scale of a
feature is irrelevant or misleading and not should Normalise
when the scale is meaningful.
• The algorithms which use Euclidean Distance measures are
sensitive to Magnitudes. Here feature scaling helps to weigh
all the features equally.
• Formally, If a feature in the dataset is big in scale compared to
others then in algorithms where Euclidean distance is
measured this big scaled feature becomes dominating & needs
to be normalized
Examples of Algorithms where Feature
Scaling matters

• 1. K-Means uses the Euclidean distance measure here feature


scaling matters
• 2. K-Nearest-Neighbours also require feature scaling
• 3. Principal Component Analysis (PCA): Tries to get the
feature with maximum variance, here too feature scaling is
required.
• 4. Gradient Descent: Calculation speed increase as Theta
calculation becomes faster after feature scaling.
 Note: Naive Bayes, Linear Discriminant Analysis, & Tree-Based
models are not affected by feature scaling.
 In Short, any Algorithm which is Not Distance-based is Not affected
by Feature Scaling.
Handling Imbalanced Data with SMOTE &
Near Miss Algorithm

• Self Learning
Some facts about Data
• As compared to 2005, 300 times i.e. 40 Zettabytes
(1ZB=10^21 bytes) of data will be generated by 2020.
• By 2011, the healthcare sector has a data of 161 Billion GB
• 400 M tweets are sent by about 200 M active users per day
• Each month, more than 4B hours of video streaming is done
by the users.
• 30B different types of content are shared every month by
the user.
• It is reported that about 27% of data is inaccurate and so 1
in 3 business idealists or leaders don’t trust the information
on which they are making decisions.
Best Python libraries for ML
• In the older days, people used to perform ML tasks
by manually coding all the algorithms/mathematical
or statistical formula
– time consuming, tedious & inefficient.
• It is become very much easy & efficient compared
by various python libraries, frameworks, & modules.
• Python is one of the most popular programming
languages for this task and it has replaced many
languages in the industry
• Reason: vast collection of libraries.
Python ML Libraries
• Numpy
• Scipy
• Scikit-learn
• Theano
• TensorFlow
• Keras
• PyTorch
• Pandas
• Matplotlib
NumPy
• Popular python library for large multi-
dimensional array & matrix processing, with the
help of a large collection of high-level
mathematical functions.
– It is very useful for fundamental scientific
computations in Machine Learning.
– It is particularly useful for linear algebra, Fourier
transform, & random number capabilities.
• High-end libraries like TensorFlow uses NumPy
internally for manipulation of Tensors.
SciPy
• Popular library among ML enthusiasts as it
contains different modules for optimization,
linear algebra, integration & statistics.
• There is a difference between the SciPy library
& SciPy stack.
– The SciPy is one of the core packages that make
up the SciPy stack.
– SciPy is also very useful for image manipulation.
Skikit-learn
• Most popular for classical ML algorithms.
• It is built on top of two basic Python libraries,
viz., NumPy & SciPy.
• Scikit-learn supports most of the supervised &
unsupervised learning algorithms.
– Scikit-learn can also be used for data-mining &
data-analysis
Theano
• Popular python library that is used to define, evaluate &
optimize mathematical expressions involving multi-
dimensional arrays in an efficient manner.
• It is achieved by optimizing the utilization of CPU & GPU.
• It is extensively used for unit-testing & self-verification to
detect & diagnose different types of errors.
• Theano is a very powerful library that has been used in
large-scale computationally intensive scientific projects for
a long time
– but is simple & approachable enough to be used by individuals
for their own projects.
TensorFlow
• Very popular open-source library for high
performance numerical computation [Google
Brain team].
• Tensorflow is a framework that involves
defining & running computations involving
tensors.
• It can train & run DNN
– Widely used in the field of DL research &
application
Keras
• It is a high-level NN API capable of running on
top of TensorFlow, CNTK, or Theano.
• It can run seamlessly on both CPU & GPU
• Keras makes it really for ML beginners to build
& design a NN.
– allows for easy & fast prototyping
PyTorch
• Based on Torch, which is an open-source ML
library implemented in C with a wrapper in
Lua.
• It has an extensive choice of tools & libraries
that supports on Computer Vision, NLP &
many more ML programs.
• It allows developers to perform computations
on Tensors with GPU acceleration
– helps in creating computational graphs
Pandas
• Popular for data analysis
• Pandas comes handy as it was developed
specifically for data extraction & preparation.
• It provides high-level data structures & wide
variety tools for data analysis.
• It provides many in-built methods for groping,
combining & filtering data
Matplotlib
• Popular for data visualization.
• Like Pandas, not directly related to ML.
• It particularly comes in handy when a programmer wants
to visualize the patterns in the data.
• It is a 2D plotting library used for creating 2D graphs &
plots.
• A module named pyplot makes it easy for programmers for
plotting as it provides features to control line styles, font
properties, formatting axes, etc.
• It provides various kinds of graphs & plots for data
visualization, viz., histogram, error charts, bar chats, etc,
Popular Sources for ML Datasets
• Kaggle Datasets: https://www.kaggle.com/datasets
• UCI ML Repository: https://archive.ics.uci.edu/ml/index.php
• Datasets via AWS: https://registry.opendata.aws/
• Google's Dataset Search Engine: https://toolbox.google.com/datasetsearch
• Microsoft Datasets: https://msropendata.com/
• Awesome Public Dataset Collection:
https://github.com/awesomedata/awesome-public-datasets
• Government Datasets:
– Indian Government dataset
– US Government Dataset
– Northern Ireland Public Sector Datasets
– European Union Open Data Portal
• Computer Vision Datasets: https://www.visualdata.io/
• Scikit-learn Dataset: https://scikit-learn.org/stable/datasets/index.html
Assignment
• Explain with your own logic, relevant examples
and experiences: “AI is the superset of ML i.e.
all ML is AI but not all AI is ML”
• Special instructions:
– Should be individualized
– Don’t copy-paste
– Try to write by yourselves
– Similarity score will measure by the turnitin software

You might also like