Unit - II MLT
Unit - II MLT
1
10211AM223 - Machine Learning Techniques
A. Course Objectives
2
10211AM223 - Machine Learning Techniques
•Course Outcomes
Upon the successful completion of the course, students will be able to:
CO
Course Outcomes K - Level
No’s
Examine the basic concepts of data mining and machine learning
CO1 K2
concepts
Design and evaluate the dimensionality reduction algorithms using real
CO2 K3
world datasets.
CO3 Apply various algorithms of Classification and Association. K3
Demonstrate experiments to evaluate and compare different unsupervised
CO4 K3
learning algorithms
Use the concept of neural networks for learning linear and non-linear
CO5 K3
activation functions
Knowledge Level (Based on revised Bloom’s Taxonomy)
K1-Remember K2-Understand K3-Apply K4-Analyze K5-Evaluate K6-Create
3
Unit II Syllabus
Data Pre-processing- Needs Pre-processing the Data- Data Cleaning, Data Integration and Transformation,
Data Reduction, Discretization and Concept Hierarchy Generation- Dimensionality Reduction – Feature
Extraction- Variable Selection- Variable ranking- Linear Discriminant Analysis – Principal Component
Analysis – Factor Analysis –Cross Validation –Resampling methods
4
Introduction to Dimensionality Reduction
Dimensionality reduction
The complexity of any classifier or regressor depends on the number of inputs. This determines both the time and space
complexity and the necessary number of training examples to train such a classifier or regressor.
5
Introduction to Dimensionality Reduction
6
Introduction to Dimensionality Reduction
7
Feature selection and Feature extraction
➢ Feature selection involves selecting a subset of the original features that
are most relevant to the problem at hand.
➢ The goal is to reduce the dimensionality of the dataset while retaining the
most important features.
➢ There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods.
➢ Filter methods rank the features based on their relevance to the target
variable, wrapper methods use the model performance as the criteria for
selecting features, and embedded methods combine feature selection with
the model training process.
8
Feature Extraction
➢ The goal is to create a set of features that captures the essence of the
original data in a lower-dimensional space.
❖ This can be done to reduce the complexity of a model, improve the performance of a learning
algorithm, or make it easier to visualize the data.
❖ Each technique projects the data onto a lower-dimensional space while preserving important
information.
❖ It is important to note that dimensionality reduction can also discard useful information, so
care must be taken when applying these techniques.
10
Data Pre-processing
▪ Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine-
learning model.
▪ When creating a machine learning project, it is not always a case that we come across
clean and formatted data.
11
Need for Data Pre-processing
Preprocessing data is an important step for data analysis. The following are some benefits
of preprocessing data:
The following are the two main features with a brief explanation:
•Data validation: This is the process where businesses analyze and assess the raw data for
a project to determine if it's complete and accurate to achieve the best results.
•Data imputation: Data imputation is where you input missing values and rectify data
errors during the validation process manually or through programming, like business
process automation.
12
Steps involved in preprocessing
Steps involved:
• Importing libraries
• Importing datasets
• Feature scaling
13
1.Get the Dataset
➢ To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data.
➢ The collected data for a particular problem in a proper format is known as the dataset.
➢ Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient.
➢ So each dataset is different from another dataset. To use the dataset in our code, we
usually put it into a CSV file.
➢ CSV stands for "Comma-Separated Values" files; it is a file format which allows us to
save the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.
Numpy: Numpy Python library is used for including any type of mathematical
operation in the code.
It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we
can import it as:
import numpy as nm
15
2. Importing Libraries (…contd)
Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.
This library is used to plot any type of charts in Python for the code. It will be imported as
below:
Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:
Here, we have used pd as a short name for this library. Consider the below image:
16
3) Importing the Datasets
read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally
as well as through an URL.
We can use read_csv function as below:
data_set= pd.read_csv('Dataset.csv')
To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the required rows and
columns from the dataset.
x= data_set.iloc[:,:-1].values
17
3) Importing the Datasets (contd…)
y= data_set.iloc[:,3].values
Here we have taken all the rows with the last column only. It will give the array of dependent variables.
18
4) Handling Missing data
The next step of data preprocessing is to handle missing data in the datasets.
If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.
There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this way, we
just delete the specific row or column which consists of null values. But this way is not so efficient and
removing data may lead to loss of information which will not give the accurate output.
By calculating the mean: In this way, we will calculate the mean of that column or row which contains
any missing value and will put it on the place of missing value. This strategy is useful for the features
which have numeric data such as age, salary, year, etc.
19
Sample Code to handle missing values
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for
building machine learning models. Here we will use Imputer class of sklearn.preprocessing library. Below is the
code for it:
#handling missing data (Replacing missing data with the mean value)
Categorical data is data which has some categories such as, in our dataset; there are two categorical variable,
Country, and Purchased.
Firstly, we will convert the country variables into categorical data. So to do this, we will use LabelEncoder() class
from preprocessing library.
#Catgorical data
label_encoder_x= LabelEncoder()
21
6) Splitting the Dataset into the Training set and Test set
➢ In machine learning data preprocessing, we divide our dataset into a training set and test set.
➢ This is one of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.
Training Set: A subset of dataset to train the machine learning model, and we already know the output.
Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the
output.
For splitting the dataset, we will use the below lines of code:
In the above code, the first line is used for splitting arrays of the dataset into random train and test subsets.
In the second line, we have used four variables for our output that are
In train_test_split() function, we have passed four parameters in which first two are for arrays of data, and
test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells the dividing ratio of
training and testing sets.
23
7) Feature Scaling
➢ In feature scaling, we put our variables in the same range and in the same scale so that
no any variable dominate the other variable.
24
Feature Scaling Example
25
Feature Scaling Methods
Standardization
Normalization
26
Feature Scaling Methods (contd…)
Here, we will use the standardization method for our dataset.
For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:
Now, we will create the object of StandardScaler class for independent variables or features. And then we will fit
and transform the training dataset.
st_x= StandardScaler()
x_train= st_x.fit_transform(x_train)
For test dataset, we will directly apply transform() function instead of fit_transform() because it is already done in
training set.
x_test= st_x.transform(x_test)
27
Data Cleaning
➢ Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data.
➢ The goal of data cleaning is to ensure that the data is accurate, consistent, and free of
errors, as incorrect or inconsistent data can negatively impact the performance of the
ML model.
28
Data Cleaning Steps
The most common steps involved in data cleaning:
Steps
4. Handling outliers
5. Data transformation
29
Data Cleaning Steps (contd…)
1. Data inspection and exploration:
This step involves understanding the data by inspecting its structure and identifying missing values,
outliers, and inconsistencies.
import pandas as pd
import numpy as np
a1 = pd.Series([1, 2, 3])
a1.describe()
Output:
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64
30
Data Cleaning Steps (contd…)
2.Removal of unwanted observations
This includes deleting duplicate/ redundant or irrelevant values from your dataset.
Duplicate observations most frequently arise during data collection and Irrelevant
observations are those that don’t actually fit the specific problem that you’re trying to
solve.
• Redundant observations alter the efficiency to a great extent as the data repeats and
may add towards the correct side or towards the incorrect side, thereby producing
unfaithful results.
• Irrelevant observations are any type of data that is of no use to us and can be removed
directly.
31
Data Cleaning Steps (contd…)
2.Removal of unwanted observations
df.duplicated()
subset: Takes a column or list of column label. It’s default value is none. After passing
columns, it will consider them only for duplicates.
keep: Controls how to consider duplicate value. It has only three distinct value and
default is ‘first’.
–> If ‘first’, it considers first value as unique and rest of the same values as duplicate.
–> If ‘last’, it considers last value as unique and rest of the same values as duplicate.
–> If False, it consider all of the same values as duplicates.
32
Data Cleaning Steps (contd…)
3. Handling missing data:
Missing data is a common issue in real-world datasets, and it can occur due to various reasons such as
human errors, system failures, or data collection issues.
Various techniques can be used to handle missing data, such as imputation, deletion, or substitution.
❖ Deleting Rows with missing values In Pandas missing data is represented by two value:
❖ Impute missing values for continuous variable •None: None is a Python singleton object that is often used for
missing data in Python code.
❖ Impute missing values for categorical variable
•NaN : NaN (an acronym for Not a Number), is a special
❖ Other Imputation Methods
floating-point value recognized by all systems that use the
❖ Using Algorithms that support missing values standard IEEE floating-point representation
The outliers of the data can be detected using certain statistical plots, the most common
plots are Box Plot and Scatter Plot.
Data transformation involves converting the data from one form to another to make it more suitable for
analysis. Techniques such as normalization, scaling, or encoding can be used to transform the data.
Scaling
• Scaling involves transforming the values of features to a specific range. It maintains the shape of
the original distribution while changing the scale.
• Scaling is particularly useful when features have different scales, and certain algorithms are
sensitive to the magnitude of the features.
• Common scaling methods include Min-Max scaling and Standardization (Z-score scaling).
36
Advantages of Data Cleaning in Machine Learning:
❖ Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors,
inconsistencies, and irrelevant data, which can help the model to better learn from the data.
❖ Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can
help improve the accuracy of the ML model.
❖ Better representation of the data: Data cleaning allows the data to be transformed into a format that better
represents the underlying relationships and patterns in the data, making it easier for the ML model to learn from the
data.
❖ Improved data quality: Data cleaning helps to improve the quality of the data, making it more reliable and accurate.
This ensures that the machine learning models are trained on high-quality data, which can lead to better predictions
and outcomes.
❖ Improved data security: Data cleaning can help to identify and remove sensitive or confidential information that
could compromise data security. By eliminating this information, data cleaning can help to ensure that only the
necessary and relevant data is used for machine learning. 37
Disadvantages of Data Cleaning in Machine Learning:
❖ Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets.
❖ Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in
the loss of important information or the introduction of new errors.
❖ Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed
data may not be representative of the underlying relationships and patterns in the data.
❖ Data loss: Data cleaning can result in the loss of important information that may be valuable for machine learning
analysis. In some cases, data cleaning may result in the removal of data that appears to be irrelevant or inconsistent, but
which may contain valuable insights or patterns.
❖ Cost and resource-intensive: Data cleaning can be a resource-intensive process that requires significant time, effort,
and expertise. It can also require the use of specialized software tools, which can add to the cost and complexity of data
cleaning.
❖ Overfitting: Data cleaning can inadvertently contribute to overfitting by removing too much data, leading to a loss of
information that could be important for model training and performance. 38
Data Integration and Transformation
Data integration is one of the steps of data pre-processing that involves combining data
residing in different sources and providing users with a unified view of these data.
39
Data Integration and Transformation (contd…)
There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling
❑Here data is pulled over from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.
❑The single physical location provides an uniform interface for querying the data.
❑ETL layer helps to map the data from the sources so as to provide a uniform data
warehouse.
❑This approach is called tight coupling since in this approach the data is tightly coupled
with the physical repository at the time of query.
40
Data Integration and Transformation (contd…)
Advantages:
• Here a virtual mediated schema provides an interface that takes the query from the
user, transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.
• In this approach, the data only remains in the actual source databases.
42
Data Integration and Transformation (contd…)
Loose Coupling
Advantages:
❖ Higher Agility (when a new source system comes or existing source system changes - only the
corresponding adapter is created or changed - largely not affecting the other parts of the system)
❖ Less costlier (Lot of infrastructure cost can be saved since data localization not required)
Disadvantages:
❖ Semantic conflicts
43
Issues in Data Integration
There are several issues that can arise when integrating data from multiple sources, including:
1.Data Quality: Inconsistencies and errors in the data can make it difficult to combine and analyze.
2.Data Semantics: Different sources may use different terms or definitions for the same data, making it
difficult to combine and understand the data.
3.Data Heterogeneity: Different sources may use different data formats, structures, or schemas, making it
difficult to combine and analyze the data.
4.Data Privacy and Security: Protecting sensitive information and maintaining security can be difficult when
integrating data from multiple sources.
5.Scalability: Integrating large amounts of data from multiple sources can be computationally expensive and
time-consuming.
6.Data Governance: Managing and maintaining the integration of data from multiple sources can be difficult,
especially when it comes to ensuring data accuracy, consistency, and timeliness.
7.Performance: Integrating data from multiple sources can also affect the performance of the system.
8.Integration with existing systems: Integrating new data sources with existing systems can be a complex
task, requiring significant effort and resources.
9.Complexity: The complexity of integrating data from multiple sources can be high, requiring specialized
skills and knowledge.
44
Data Transformation
In data mining pre-processes and especially in metadata and data warehouse, we use data
transformation in order to convert data from a source data format into destination data.
The goal of the data transformation process is to extract data from a source, convert it into a usable
format, and deliver it to a destination.
During the extraction phase, data is identified and pulled from many different locations or sources into a
single repository.
45
Steps in Data Transformation
Data discovery. The first step in the data transformation process consists of identifying and
understanding the data in its source format. This is usually accomplished with the help of a data
profiling tool. This step helps you decide what needs to happen to the data in order to get it into
the desired format.
Data mapping. During this phase, the actual transformation process is planned.
Generating code. In order for the transformation process to be completed, a code must be
created to run the transformation job. Often these codes are generated with the help of a data
transformation tool or platform.
Executing the code. The data transformation process that has been planned and coded is now put
into motion, and the data is converted to the desired output.
Review. Transformed data is checked to make sure it has been formatted correctly.
46
Steps in Data Transformation (contd…)
In addition to these basic steps, other customized operations may occur. For example,
➢ Enriching (e.g. Full name to First Name , Middle Name , Last Name).
47
Benefits of Data Transformation
48
Types of data transformation
Types of data transformation
There are many different types of data transformation, depending on what kind of data you have and what you want to do
with it. Some common types include:
• Data cleaning
• Feature extraction
• Feature creation
• Data normalization
• Data aggregation/disaggregation
• Sampling
49
Types of data transformation (contd…)
Data cleaning
Data cleaning is the process of removing incorrect or incomplete information from your dataset,
adding or fixing missing values, dealing with outliers, and so on. It’s an important step in any data
transformation process, and it’s often the most time-consuming.
Feature extraction
Feature extraction is the process of reducing a large amount of information down to a smaller set of
more useful variables. It’s a common data transformation technique, and it’s often used when
working with images or videos.
Feature extraction is a type of data reduction, and it’s a common technique in machine learning.
50
Types of data transformation (contd…)
Feature creation
Feature creation is the process of adding extra information to your dataset where none existed
previously. It’s a useful data transformation technique when you want to make use of data that’s not
in a standard format.
Data normalization
Data normalization is the process of making sure all values in your dataset are on the same scale.
It’s a common data transformation technique, and it’s often used when working with numerical
data.
Data aggregation/disaggregation
Data aggregation is the process of combining multiple datasets into one. It’s a common data
transformation technique, and it’s often used when working with data from different sources.
Data disaggregation is the opposite of data aggregation. It’s the process of splitting one large dataset
into several smaller ones.
51
Types of data transformation (contd…)
Sampling
Sampling is the process of using only part of your dataset rather than all of it. It’s a common data
transformation technique, and it’s often used when working with very large datasets which cannot
be stored fully on the computer being used.
52
Data Reduction
Data reduction is a technique used to optimize the capacity required to store data. Data reduction can increase
storage performance and reduce storage costs.
• Deduplication: Removing duplicate data. This can include simply removing duplicated records to deleting
records that, while not strictly identical, represent the same information or event.
• Compression: Compression processes apply algorithms to transform information to take up less storage
space. Compression algorithms can (and often are) applied to data as it is moved into storage, but some can
be applied to data-at-rest to improve space gains even more.
• Thin Provisioning: Thin provisioning is an approach to storage where space is partitioned and used as
needed rather than pre-allocate storage to users or processes. While more computationally intensive, this
approach can significantly reduce inefficiencies like disk fragmentation.
53
Techniques for Data Reduction
What Techniques Are Used for Data Reduction?
54
Techniques for Data Reduction(contd…)
What Techniques Are Used for Data Reduction?
▪ This simply means that the data is processed and reduced to a smaller but equally useful
form of information that still represents trends in relevant information for analytics.
55
Techniques for Data Reduction(contd…)
What Techniques Are Used for Data Reduction?
• Clustering: This practice uses data attributes to build a set of clusters in which the data
is split. Similarities and dissimilarities between the data objects result in different
placements and distances between the clusters and the objects as a whole.
56
Discretization and Concept Hierarchy Generation
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy.
There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization.
Unsupervised discretization refers to a method depending upon the way which operation proceeds.
It means it works on the top-down splitting strategy and bottom-up merging strategy.
57
Famous techniques of data discretization
Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness representation, normal
distribution representation, etc.
Histograms are graphs that display the distribution of your continuous data.
58
Famous techniques of data discretization(…contd)
Histogram Analysis
59
Famous techniques of data discretization(…contd)
Histogram analysis
60
Famous techniques of data discretization(…contd)
Histogram analysis
61
Famous techniques of data discretization(…contd)
Histogram analysis
62
Famous techniques of data discretization(…contd)
Histogram analysis
63
Famous techniques of data discretization(contd..)
Binning
Binning refers to a data smoothing technique that helps to group a huge number of continuous values into smaller
values. For data discretization and the development of idea hierarchy, this technique can also be used.
Data discretization using correlation analysis
Discretizing data by linear regression technique, you can get the best neighboring interval, and then the large intervals
are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a supervised procedure.
Cluster Analysis
Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x numbers
into clusters to isolate a computational feature of x.
Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done through a
supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that has the least
entropy, and then you need to run it with the help of a recursive process. 64
Concept Hierarchy Generation
The term hierarchy represents an organizational structure or
mapping in which items are ranked according to their levels of
importance.
There are two types of hierarchy: top-down mapping and the
second one is bottom-up mapping.
Top-down mapping
Top-down mapping generally starts with the top with some general
information and ends with the bottom to the specialized information.
Bottom-up mapping
What is a concept?
A group of records that have been assigned a label.
66
Concept Hierarchy Generation Example
Case Study:
Suppose a user selects a set of location-oriented attributes —
street, country, province or state, and city — from the
AllElectronics database, but does not specify the hierarchical
ordering among the attributes.
First, sort the attributes in ascending order based on the number
of distinct values in each attribute. This results in the following
(where the number of distinct values per attribute is shown in
parentheses):
country (15),
province or state (365),
city (3567),
and street (674,339).
Second, generate the hierarchy from the top down according to
the sorted order, with the first attribute at the top level and the
last attribute at the bottom level. Finally, the user can examine
the generated hierarchy, and when necessary, modify it to reflect
desired semantic relationships among the attributes.
67
Dimensionality Reduction – Feature Extraction
68
Feature Selection
69
Filter method
70
Wrapper Method
71
Embedded Method
https://colab.research.google.com/drive/1k0ol1z3-
72
bZNseuNBYw1YF7AICgnl9dzg#scrollTo=4w0xj10jFYYw
Benefits of Feature Selection methods
73
Feature Extraction
74
Feature Extraction(contd…)
❖ Factor Analysis
75