0% found this document useful (0 votes)

2 views

Unit - II MLT

The document outlines the objectives and outcomes of a Machine Learning Techniques course, focusing on supervised and unsupervised learning, data preprocessing, and dimensionality reduction. It details the importance of techniques like feature selection and extraction, as well as the steps involved in data cleaning and preprocessing for machine learning models. Key methods discussed include Principal Component Analysis (PCA), handling missing data, and feature scaling to enhance model performance.

Uploaded by

Lokesh Upputuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Unit - II MLT

Uploaded by

Lokesh Upputuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 75

Unit - II

10211AM223 - Machine Learning Techniques

1
10211AM223 - Machine Learning Techniques

A. Course Objectives

Students are exposed to

• Apply the concepts of supervised and unsupervised learning algorithms for real time applications
• Executing decision tree algorithm and probabilistic models to overcome the problem of over fitting
• Analyse and suggest appropriate machine learning approaches for various types of problems
• Demonstrate the aspects of computational biology

2
10211AM223 - Machine Learning Techniques

•Course Outcomes
Upon the successful completion of the course, students will be able to:

CO
Course Outcomes K - Level
No’s
Examine the basic concepts of data mining and machine learning
CO1 K2
concepts
Design and evaluate the dimensionality reduction algorithms using real
CO2 K3
world datasets.
CO3 Apply various algorithms of Classification and Association. K3
Demonstrate experiments to evaluate and compare different unsupervised
CO4 K3
learning algorithms
Use the concept of neural networks for learning linear and non-linear
CO5 K3
activation functions
Knowledge Level (Based on revised Bloom’s Taxonomy)
K1-Remember K2-Understand K3-Apply K4-Analyze K5-Evaluate K6-Create
3
Unit II Syllabus

Unit 2 Dimensionality reduction ` L-9 Hours

Data Pre-processing- Needs Pre-processing the Data- Data Cleaning, Data Integration and Transformation,
Data Reduction, Discretization and Concept Hierarchy Generation- Dimensionality Reduction – Feature
Extraction- Variable Selection- Variable ranking- Linear Discriminant Analysis – Principal Component
Analysis – Factor Analysis –Cross Validation –Resampling methods

4
Introduction to Dimensionality Reduction

Dimensionality reduction

The complexity of any classifier or regressor depends on the number of inputs. This determines both the time and space
complexity and the necessary number of training examples to train such a classifier or regressor.

5
Introduction to Dimensionality Reduction

6
Introduction to Dimensionality Reduction

❖ Dimensionality reduction, or dimension reduction, is the transformation of

data from a high-dimensional space into a low-dimensional space so that the
low-dimensional representation retains some meaningful properties of the
original data, ideally close to its intrinsic dimension.

❖ Dimensionality reduction is a technique used to reduce the number of features

in a dataset while retaining as much of the important information as possible.

❖ There are two main approaches to dimensionality reduction: feature selection

and feature extraction.

7
Feature selection and Feature extraction
➢ Feature selection involves selecting a subset of the original features that
are most relevant to the problem at hand.

➢ The goal is to reduce the dimensionality of the dataset while retaining the
most important features.

➢ There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods.

➢ Filter methods rank the features based on their relevance to the target
variable, wrapper methods use the model performance as the criteria for
selecting features, and embedded methods combine feature selection with
the model training process.
8
Feature Extraction

➢ Feature extraction involves creating new features by combining or

transforming the original features.

➢ The goal is to create a set of features that captures the essence of the
original data in a lower-dimensional space.

➢ There are several methods for feature extraction, including principal

component analysis (PCA), linear discriminant analysis (LDA), and t-
distributed stochastic neighbor embedding (t-SNE).

➢ PCA is a popular technique that projects the original features onto a

lower-dimensional space while preserving as much of the variance as
possible.
9
Important Points of DR
❖ Dimensionality reduction is the process of reducing the number of features in a dataset while
retaining as much information as possible.

❖ This can be done to reduce the complexity of a model, improve the performance of a learning
algorithm, or make it easier to visualize the data.

❖ Techniques for dimensionality reduction include: principal component analysis (PCA),

singular value decomposition (SVD), and linear discriminant analysis (LDA).

❖ Each technique projects the data onto a lower-dimensional space while preserving important
information.

❖ Dimensionality reduction is performed during pre-processing stage before building a model to

improve the performance

❖ It is important to note that dimensionality reduction can also discard useful information, so
care must be taken when applying these techniques.
10
Data Pre-processing
▪ Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine-
learning model.

▪ When creating a machine learning project, it is not always a case that we come across
clean and formatted data.

11
Need for Data Pre-processing
Preprocessing data is an important step for data analysis. The following are some benefits
of preprocessing data:

➢ It improves accuracy and reliability.

➢ It makes data consistent.
➢ It increases the data's algorithm readability.

The following are the two main features with a brief explanation:
•Data validation: This is the process where businesses analyze and assess the raw data for
a project to determine if it's complete and accurate to achieve the best results.

•Data imputation: Data imputation is where you input missing values and rectify data
errors during the validation process manually or through programming, like business
process automation.

12
Steps involved in preprocessing

Steps involved:

• Getting the dataset

• Importing libraries

• Importing datasets

• Finding Missing Data

• Encoding Categorical Data

• Splitting dataset into training and test set

• Feature scaling

13
1.Get the Dataset

➢ To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data.
➢ The collected data for a particular problem in a proper format is known as the dataset.

➢ Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient.

➢ So each dataset is different from another dataset. To use the dataset in our code, we
usually put it into a CSV file.
➢ CSV stands for "Comma-Separated Values" files; it is a file format which allows us to
save the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.

➢ However, sometimes, we may also need to use an HTML or xlsx file.

14
2. Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries.

These libraries are used to perform some specific jobs.

There are three specific libraries that we will use for data preprocessing, which
are:

Numpy: Numpy Python library is used for including any type of mathematical
operation in the code.
It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we
can import it as:
import numpy as nm
15
2. Importing Libraries (…contd)

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.
This library is used to plot any type of charts in Python for the code. It will be imported as
below:

import matplotlib.pyplot as mpt

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

16
3) Importing the Datasets

read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally
as well as through an URL.
We can use read_csv function as below:

data_set= pd.read_csv('Dataset.csv')

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the required rows and
columns from the dataset.

x= data_set.iloc[:,:-1].values

17
3) Importing the Datasets (contd…)

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent variables.

18
4) Handling Missing data

The next step of data preprocessing is to handle missing data in the datasets.

If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.

There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this way, we
just delete the specific row or column which consists of null values. But this way is not so efficient and
removing data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which contains
any missing value and will put it on the place of missing value. This strategy is useful for the features
which have numeric data such as age, salary, year, etc.

19
Sample Code to handle missing values
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for
building machine learning models. Here we will use Imputer class of sklearn.preprocessing library. Below is the
code for it:

#handling missing data (Replacing missing data with the mean value)

from sklearn.preprocessing import Imputer

imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)

#Fitting imputer object to the independent variables x.

imputerimputer= imputer.fit(x[:, 1:3])

#Replacing missing data with the calculated mean value

x[:, 1:3]= imputer.transform(x[:, 1:3])

20
5) Encoding Categorical data

Categorical data is data which has some categories such as, in our dataset; there are two categorical variable,
Country, and Purchased.

Example : For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will use LabelEncoder() class
from preprocessing library.

#Catgorical data

#for Country Variable

from sklearn.preprocessing import LabelEncoder

label_encoder_x= LabelEncoder()

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

21
6) Splitting the Dataset into the Training set and Test set
➢ In machine learning data preprocessing, we divide our dataset into a training set and test set.

➢ This is one of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.

Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the
output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) 22

6) Splitting the Dataset into the Training set and Test set(contd…)
Explanation:

In the above code, the first line is used for splitting arrays of the dataset into random train and test subsets.

In the second line, we have used four variables for our output that are

x_train: features for the training data

x_test: features for testing data

y_train: Dependent variables for training data

y_test: Independent variable for testing data

In train_test_split() function, we have passed four parameters in which first two are for arrays of data, and
test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells the dividing ratio of
training and testing sets.
23
7) Feature Scaling

➢ Feature scaling is the final step of data preprocessing in machine learning.

➢ It is a technique to standardize the independent variables of the dataset in a specific

range.

➢ In feature scaling, we put our variables in the same range and in the same scale so that
no any variable dominate the other variable.

24
Feature Scaling Example

As we can see, the age and salary

column values are not on the same
scale.

25
Feature Scaling Methods

Standardization

Normalization

26
Feature Scaling Methods (contd…)
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features. And then we will fit
and transform the training dataset.

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because it is already done in
training set.

x_test= st_x.transform(x_test)

27
Data Cleaning

➢ Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data.

➢ The goal of data cleaning is to ensure that the data is accurate, consistent, and free of
errors, as incorrect or inconsistent data can negatively impact the performance of the
ML model.

28
Data Cleaning Steps
The most common steps involved in data cleaning:
Steps

1. Data inspection and exploration

2. Removal of unwanted observations

3. Handling missing data

4. Handling outliers

5. Data transformation

29
Data Cleaning Steps (contd…)
1. Data inspection and exploration:

This step involves understanding the data by inspecting its structure and identifying missing values,
outliers, and inconsistencies.

import pandas as pd
import numpy as np
a1 = pd.Series([1, 2, 3])
a1.describe()
Output:
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64

30
Data Cleaning Steps (contd…)
2.Removal of unwanted observations

This includes deleting duplicate/ redundant or irrelevant values from your dataset.
Duplicate observations most frequently arise during data collection and Irrelevant
observations are those that don’t actually fit the specific problem that you’re trying to
solve.

• Redundant observations alter the efficiency to a great extent as the data repeats and
may add towards the correct side or towards the incorrect side, thereby producing
unfaithful results.

• Irrelevant observations are any type of data that is of no use to us and can be removed
directly.
31
Data Cleaning Steps (contd…)
2.Removal of unwanted observations

Example: Check the duplicate rows.

df.duplicated()

Eg: DataFrame.duplicated(subset=None, keep='first’)

subset: Takes a column or list of column label. It’s default value is none. After passing
columns, it will consider them only for duplicates.
keep: Controls how to consider duplicate value. It has only three distinct value and
default is ‘first’.
–> If ‘first’, it considers first value as unique and rest of the same values as duplicate.
–> If ‘last’, it considers last value as unique and rest of the same values as duplicate.
–> If False, it consider all of the same values as duplicates.

32
Data Cleaning Steps (contd…)
3. Handling missing data:

Missing data is a common issue in real-world datasets, and it can occur due to various reasons such as
human errors, system failures, or data collection issues.

Various techniques can be used to handle missing data, such as imputation, deletion, or substitution.

The common ways to handle missing values in the dataset:

❖ Deleting Rows with missing values In Pandas missing data is represented by two value:
❖ Impute missing values for continuous variable •None: None is a Python singleton object that is often used for
missing data in Python code.
❖ Impute missing values for categorical variable
•NaN : NaN (an acronym for Not a Number), is a special
❖ Other Imputation Methods
floating-point value recognized by all systems that use the
❖ Using Algorithms that support missing values standard IEEE floating-point representation

❖ Prediction of missing values

33
Data Cleaning Steps (contd…)
4. Handling outliers:
• Outliers are extreme values that deviate significantly from the majority of the data.
• They can negatively impact the analysis and model performance.
• Techniques such as clustering, interpolation, or transformation can be used to handle outliers.
Here are a few common causes of outliers in a data set:
➢ Data entry errors: These are caused by human errors during data collection, recording, or entry.
➢ Measurement errors or instrument errors: This one is the most common reason for outliers. Such type
of error occurs when the instrument becomes faulty.
➢ Sampling errors: Consider an example where we have to measure the weight of athletes but by mistake,
we also include some wrestlers in the sample now this inclusion is very likely to cause outliers in the
dataset.
➢ Data processing error: While performing data mining data is extracted from multiple resources there is
a possibility that due to some manipulation or extraction errors there are some outliers in the data set.
➢ Natural novelties in data: The outliers that are not caused due to any error are called Natural Outliers.
34
How to detect outliers?

The outliers of the data can be detected using certain statistical plots, the most common
plots are Box Plot and Scatter Plot.

Scatter plot Histogram

35
Data Transformation
Data transformation

Data transformation involves converting the data from one form to another to make it more suitable for
analysis. Techniques such as normalization, scaling, or encoding can be used to transform the data.

Scaling

• Scaling involves transforming the values of features to a specific range. It maintains the shape of
the original distribution while changing the scale.

• Scaling is particularly useful when features have different scales, and certain algorithms are
sensitive to the magnitude of the features.

• Common scaling methods include Min-Max scaling and Standardization (Z-score scaling).

36
Advantages of Data Cleaning in Machine Learning:
❖ Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors,
inconsistencies, and irrelevant data, which can help the model to better learn from the data.

❖ Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can
help improve the accuracy of the ML model.

❖ Better representation of the data: Data cleaning allows the data to be transformed into a format that better
represents the underlying relationships and patterns in the data, making it easier for the ML model to learn from the
data.

❖ Improved data quality: Data cleaning helps to improve the quality of the data, making it more reliable and accurate.
This ensures that the machine learning models are trained on high-quality data, which can lead to better predictions
and outcomes.

❖ Improved data security: Data cleaning can help to identify and remove sensitive or confidential information that
could compromise data security. By eliminating this information, data cleaning can help to ensure that only the
necessary and relevant data is used for machine learning. 37
Disadvantages of Data Cleaning in Machine Learning:
❖ Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets.

❖ Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in
the loss of important information or the introduction of new errors.

❖ Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed
data may not be representative of the underlying relationships and patterns in the data.

❖ Data loss: Data cleaning can result in the loss of important information that may be valuable for machine learning
analysis. In some cases, data cleaning may result in the removal of data that appears to be irrelevant or inconsistent, but
which may contain valuable insights or patterns.

❖ Cost and resource-intensive: Data cleaning can be a resource-intensive process that requires significant time, effort,
and expertise. It can also require the use of specialized software tools, which can add to the cost and complexity of data
cleaning.

❖ Overfitting: Data cleaning can inadvertently contribute to overfitting by removing too much data, leading to a loss of
information that could be important for model training and performance. 38
Data Integration and Transformation

Data integration is one of the steps of data pre-processing that involves combining data
residing in different sources and providing users with a unified view of these data.

• It merges the data from multiple data stores (data sources)

• It includes multiple databases, data cubes or flat files.

39
Data Integration and Transformation (contd…)
There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling

❑Here data is pulled over from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.

❑The single physical location provides an uniform interface for querying the data.

❑ETL layer helps to map the data from the sources so as to provide a uniform data
warehouse.

❑This approach is called tight coupling since in this approach the data is tightly coupled
with the physical repository at the time of query.
40
Data Integration and Transformation (contd…)
Advantages:

1. Independence (Lesser dependency to source systems since data is physically

copied over)

2. Faster query processing

3. Complex query processing

4. Advanced data summarization and storage possible

5. High Volume data processing

Disadvantages: 1. Latency (since data needs to be loaded using ETL)

2. Costlier (data localization, infrastructure, security)

41
Data Integration and Transformation (contd…)
Loose Coupling

• Here a virtual mediated schema provides an interface that takes the query from the
user, transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.

• In this approach, the data only remains in the actual source databases.

42
Data Integration and Transformation (contd…)
Loose Coupling
Advantages:

❖ Data Freshness (low latency - almost real time)

❖ Higher Agility (when a new source system comes or existing source system changes - only the
corresponding adapter is created or changed - largely not affecting the other parts of the system)

❖ Less costlier (Lot of infrastructure cost can be saved since data localization not required)

Disadvantages:

❖ Semantic conflicts

❖ Slower query response

❖ High order dependency to the data sources

43
Issues in Data Integration
There are several issues that can arise when integrating data from multiple sources, including:
1.Data Quality: Inconsistencies and errors in the data can make it difficult to combine and analyze.
2.Data Semantics: Different sources may use different terms or definitions for the same data, making it
difficult to combine and understand the data.
3.Data Heterogeneity: Different sources may use different data formats, structures, or schemas, making it
difficult to combine and analyze the data.
4.Data Privacy and Security: Protecting sensitive information and maintaining security can be difficult when
integrating data from multiple sources.
5.Scalability: Integrating large amounts of data from multiple sources can be computationally expensive and
time-consuming.
6.Data Governance: Managing and maintaining the integration of data from multiple sources can be difficult,
especially when it comes to ensuring data accuracy, consistency, and timeliness.
7.Performance: Integrating data from multiple sources can also affect the performance of the system.
8.Integration with existing systems: Integrating new data sources with existing systems can be a complex
task, requiring significant effort and resources.
9.Complexity: The complexity of integrating data from multiple sources can be high, requiring specialized
skills and knowledge.
44
Data Transformation

In data mining pre-processes and especially in metadata and data warehouse, we use data
transformation in order to convert data from a source data format into destination data.

How Data Transformation Works?

The goal of the data transformation process is to extract data from a source, convert it into a usable
format, and deliver it to a destination.

This entire process is known as ETL (Extract, Load, Transform).

During the extraction phase, data is identified and pulled from many different locations or sources into a
single repository.

45
Steps in Data Transformation
Data discovery. The first step in the data transformation process consists of identifying and
understanding the data in its source format. This is usually accomplished with the help of a data
profiling tool. This step helps you decide what needs to happen to the data in order to get it into
the desired format.

Data mapping. During this phase, the actual transformation process is planned.

Generating code. In order for the transformation process to be completed, a code must be
created to run the transformation job. Often these codes are generated with the help of a data
transformation tool or platform.

Executing the code. The data transformation process that has been planned and coded is now put
into motion, and the data is converted to the desired output.

Review. Transformed data is checked to make sure it has been formatted correctly.
46
Steps in Data Transformation (contd…)

In addition to these basic steps, other customized operations may occur. For example,

➢ Filtering (e.g. Selecting only certain columns to load).

➢ Enriching (e.g. Full name to First Name , Middle Name , Last Name).

➢ Splitting a column into multiple columns and vice versa.

➢ Joining together data from multiple sources.

➢ Removing duplicate data.

47
Benefits of Data Transformation

➢ Getting maximum value from data

➢ Managing data more effectively

➢ Performing faster queries

➢ Enhancing data quality

48
Types of data transformation
Types of data transformation

There are many different types of data transformation, depending on what kind of data you have and what you want to do
with it. Some common types include:

• Data cleaning

• Feature extraction

• Feature creation

• Data normalization

• Data aggregation/disaggregation

• Sampling

49
Types of data transformation (contd…)
Data cleaning

Data cleaning is the process of removing incorrect or incomplete information from your dataset,
adding or fixing missing values, dealing with outliers, and so on. It’s an important step in any data
transformation process, and it’s often the most time-consuming.

Feature extraction

Feature extraction is the process of reducing a large amount of information down to a smaller set of
more useful variables. It’s a common data transformation technique, and it’s often used when
working with images or videos.
Feature extraction is a type of data reduction, and it’s a common technique in machine learning.

50
Types of data transformation (contd…)
Feature creation
Feature creation is the process of adding extra information to your dataset where none existed
previously. It’s a useful data transformation technique when you want to make use of data that’s not
in a standard format.
Data normalization
Data normalization is the process of making sure all values in your dataset are on the same scale.
It’s a common data transformation technique, and it’s often used when working with numerical
data.
Data aggregation/disaggregation
Data aggregation is the process of combining multiple datasets into one. It’s a common data
transformation technique, and it’s often used when working with data from different sources.
Data disaggregation is the opposite of data aggregation. It’s the process of splitting one large dataset
into several smaller ones.

51
Types of data transformation (contd…)
Sampling

Sampling is the process of using only part of your dataset rather than all of it. It’s a common data
transformation technique, and it’s often used when working with very large datasets which cannot
be stored fully on the computer being used.

52
Data Reduction

Data reduction is a technique used to optimize the capacity required to store data. Data reduction can increase
storage performance and reduce storage costs.

Some of the most common types of data reduction are:

• Deduplication: Removing duplicate data. This can include simply removing duplicated records to deleting
records that, while not strictly identical, represent the same information or event.

• Compression: Compression processes apply algorithms to transform information to take up less storage
space. Compression algorithms can (and often are) applied to data as it is moved into storage, but some can
be applied to data-at-rest to improve space gains even more.

• Thin Provisioning: Thin provisioning is an approach to storage where space is partitioned and used as
needed rather than pre-allocate storage to users or processes. While more computationally intensive, this
approach can significantly reduce inefficiencies like disk fragmentation.
53
Techniques for Data Reduction
What Techniques Are Used for Data Reduction?

• Dimensionality Reduction: This

approach attempts to reduce the
number of “dimensions,” or
aspects/variables, from a data set.

54
Techniques for Data Reduction(contd…)
What Techniques Are Used for Data Reduction?

• Data Cube Aggregation: This technique aggregates multidimensional data at various

levels to create a “data cube” (or multidimensional data object).

▪ This simply means that the data is processed and reduced to a smaller but equally useful
form of information that still represents trends in relevant information for analytics.

55
Techniques for Data Reduction(contd…)
What Techniques Are Used for Data Reduction?

• Numerosity Reduction: Numerosity reduction, as the name suggests, is replacing the

original data with a smaller form of data representing the original with more or less fidelity.
This common compression technique is used in various data types, including audio, video,
and image.

• Clustering: This practice uses data attributes to build a set of clusters in which the data
is split. Similarities and dissimilarities between the data objects result in different
placements and distances between the clusters and the objects as a whole.

56
Discretization and Concept Hierarchy Generation

Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy.

In other words, data discretization is a method of converting attributes values of continuous

data into a finite set of intervals with minimum data loss.

There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization.

Supervised discretization refers to a method in which the class data is used.

Unsupervised discretization refers to a method depending upon the way which operation proceeds.
It means it works on the top-down splitting strategy and bottom-up merging strategy.

57
Famous techniques of data discretization
Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness representation, normal
distribution representation, etc.

Histograms are graphs that display the distribution of your continuous data.

58
Famous techniques of data discretization(…contd)
Histogram Analysis

59
Famous techniques of data discretization(…contd)
Histogram analysis

60
Famous techniques of data discretization(…contd)
Histogram analysis

61
Famous techniques of data discretization(…contd)
Histogram analysis

62
Famous techniques of data discretization(…contd)
Histogram analysis

63
Famous techniques of data discretization(contd..)
Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous values into smaller
values. For data discretization and the development of idea hierarchy, this technique can also be used.
Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and then the large intervals
are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a supervised procedure.
Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x numbers
into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done through a
supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that has the least
entropy, and then you need to run it with the help of a recursive process. 64
Concept Hierarchy Generation
The term hierarchy represents an organizational structure or
mapping in which items are ranked according to their levels of
importance.
There are two types of hierarchy: top-down mapping and the
second one is bottom-up mapping.
Top-down mapping

Top-down mapping generally starts with the top with some general
information and ends with the bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some

specialized information and ends with the top to the generalized
information.
65
Concept Hierarchy Generation(contd…)
Concept Hierarchy for Attribute price in $

What is a concept?
A group of records that have been assigned a label.

What is a concept hierarchy?

Means generating a hierarchical order among concepts.

66
Concept Hierarchy Generation Example
Case Study:
Suppose a user selects a set of location-oriented attributes —
street, country, province or state, and city — from the
AllElectronics database, but does not specify the hierarchical
ordering among the attributes.
First, sort the attributes in ascending order based on the number
of distinct values in each attribute. This results in the following
(where the number of distinct values per attribute is shown in
parentheses):
country (15),
province or state (365),
city (3567),
and street (674,339).
Second, generate the hierarchy from the top down according to
the sorted order, with the first attribute at the top level and the
last attribute at the bottom level. Finally, the user can examine
the generated hierarchy, and when necessary, modify it to reflect
desired semantic relationships among the attributes.

67
Dimensionality Reduction – Feature Extraction

68
Feature Selection

69
Filter method

70
Wrapper Method

71
Embedded Method

https://colab.research.google.com/drive/1k0ol1z3-
72
bZNseuNBYw1YF7AICgnl9dzg#scrollTo=4w0xj10jFYYw
Benefits of Feature Selection methods

73
Feature Extraction

74
Feature Extraction(contd…)

Feature Extraction Methods:

❖ Principal Component Analysis

❖ Factor Analysis

❖ Singular Value Decomposition

Improve Model Accuracy With Data Pre-Processing
No ratings yet
Improve Model Accuracy With Data Pre-Processing
11 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
27 pages
Machine Learning Algorithms PDF
100% (1)
Machine Learning Algorithms PDF
148 pages
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
From Everand
Google Cloud Platform for Data Engineering: From Beginner to Data Engineer using Google Cloud Platform
alasdair gilchrist
5/5 (1)
Data Preprocesing JavaPoint
No ratings yet
Data Preprocesing JavaPoint
19 pages
Data Preprocessing in Machine Learning
No ratings yet
Data Preprocessing in Machine Learning
5 pages
Unit 2
No ratings yet
Unit 2
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
9 pages
ML (Prac1)
No ratings yet
ML (Prac1)
12 pages
MSDSModule 2
No ratings yet
MSDSModule 2
35 pages
Data Preprocessing Implementation 13112023 061217pm
No ratings yet
Data Preprocessing Implementation 13112023 061217pm
31 pages
S-9
No ratings yet
S-9
18 pages
DS Module2 L3 L13
No ratings yet
DS Module2 L3 L13
43 pages
Lect 04 Preprocessing Structured
No ratings yet
Lect 04 Preprocessing Structured
39 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Data Preprocessing in Machine Learning[1]
No ratings yet
Data Preprocessing in Machine Learning[1]
24 pages
72b85f60-8523-423f-9efc-ff56aa21f3f3
No ratings yet
72b85f60-8523-423f-9efc-ff56aa21f3f3
29 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
NN-7
No ratings yet
NN-7
26 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Lecture-2-20022025-092902am
No ratings yet
Lecture-2-20022025-092902am
87 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
AIDS C04-Session-20
No ratings yet
AIDS C04-Session-20
17 pages
Session-2-CO3-Introduction to Data Preprocessing (1)
No ratings yet
Session-2-CO3-Introduction to Data Preprocessing (1)
39 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
11 pages
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
No ratings yet
(A) What Is Machine Learning? Explain The Impact of Various Machine Learning Techniques in Today's World
6 pages
Session 2 - Data Pre-Processing
No ratings yet
Session 2 - Data Pre-Processing
19 pages
Machine Learning Fundamentals
No ratings yet
Machine Learning Fundamentals
4 pages
Building Good Training Sets UNIT 1 PART2
No ratings yet
Building Good Training Sets UNIT 1 PART2
46 pages
UNIT 2 dt
No ratings yet
UNIT 2 dt
8 pages
ML_notion_1
No ratings yet
ML_notion_1
18 pages
DPT Week 1
No ratings yet
DPT Week 1
3 pages
Data
No ratings yet
Data
36 pages
Handling Missing Values in A Real-Time Dataset During
No ratings yet
Handling Missing Values in A Real-Time Dataset During
5 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Day11 Machine Learning
No ratings yet
Day11 Machine Learning
37 pages
DS Unit 2
No ratings yet
DS Unit 2
42 pages
Data Preprocessing Before Classification: Presented by
No ratings yet
Data Preprocessing Before Classification: Presented by
23 pages
DSUR_EA2352001010391_W7
No ratings yet
DSUR_EA2352001010391_W7
3 pages
Ml Notes All
No ratings yet
Ml Notes All
32 pages
Data Mining: Concepts and Techniques: - Chapter 3
No ratings yet
Data Mining: Concepts and Techniques: - Chapter 3
52 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Unit - II
No ratings yet
Unit - II
56 pages
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
No ratings yet
AI Project Report: By: Neha Kalra (17csu122) and Prerna Pathak (17csu143)
22 pages
data preprocessing
No ratings yet
data preprocessing
8 pages
Workflow of A Machine Learning Project
No ratings yet
Workflow of A Machine Learning Project
12 pages
ML_DA
No ratings yet
ML_DA
55 pages
CSC407_Chapter 2-3
No ratings yet
CSC407_Chapter 2-3
46 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
ML_MDU_2024_10939237
No ratings yet
ML_MDU_2024_10939237
20 pages
Machine Learning
No ratings yet
Machine Learning
48 pages
Data Analytics Lab Manual_250402_095326
No ratings yet
Data Analytics Lab Manual_250402_095326
58 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
ML Notes
No ratings yet
ML Notes
79 pages
Common DS Interview Questions and Answers - 1
No ratings yet
Common DS Interview Questions and Answers - 1
4 pages
7 محاضرات
No ratings yet
7 محاضرات
36 pages
Data Preprocessing for Machine Learning in Python
No ratings yet
Data Preprocessing for Machine Learning in Python
27 pages
AI351 Lecture 1
No ratings yet
AI351 Lecture 1
32 pages
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
BD_tasks
No ratings yet
BD_tasks
3 pages
Unit - I MLT
No ratings yet
Unit - I MLT
137 pages
SE Answer Key
No ratings yet
SE Answer Key
17 pages
MLT Syllabus
No ratings yet
MLT Syllabus
3 pages
LAB MANUAL-Use Case
No ratings yet
LAB MANUAL-Use Case
32 pages
Exploring ITIL4 Practices Monitor Support Fulfil
100% (1)
Exploring ITIL4 Practices Monitor Support Fulfil
43 pages
MDS Firmware Upgrade
No ratings yet
MDS Firmware Upgrade
4 pages
CSE 021 - Course Info. (UPDATED)
No ratings yet
CSE 021 - Course Info. (UPDATED)
11 pages
IoT GTU Study Material Presentations Unit-7 07062021082654AM
No ratings yet
IoT GTU Study Material Presentations Unit-7 07062021082654AM
21 pages
On-Premise System
No ratings yet
On-Premise System
8 pages
Abdelghani KABOT S CV 1675947915
No ratings yet
Abdelghani KABOT S CV 1675947915
2 pages
Informatica Intelligent Cloud Services (IICS) – Part 1_ Architecture and Services Overview - ClearPeaks Blog
No ratings yet
Informatica Intelligent Cloud Services (IICS) – Part 1_ Architecture and Services Overview - ClearPeaks Blog
13 pages
Enterprise Resource Planning - ERP Old
No ratings yet
Enterprise Resource Planning - ERP Old
15 pages
Manual Testing
No ratings yet
Manual Testing
36 pages
Naukri VinitaSingh 1790045 - 08 00 - 1
No ratings yet
Naukri VinitaSingh 1790045 - 08 00 - 1
3 pages
Anonymous Resume
No ratings yet
Anonymous Resume
1 page
Dbms Notes Be Sem 5wbeb
No ratings yet
Dbms Notes Be Sem 5wbeb
243 pages
Syllabus of Cyber Security - Topics List V1.0
100% (1)
Syllabus of Cyber Security - Topics List V1.0
6 pages
Sahooprabir Application Production Support Mphasis
No ratings yet
Sahooprabir Application Production Support Mphasis
4 pages
11.2.4.4 Packet Tracer - Configuring Port Forwarding On A Linksys Router Instructions
No ratings yet
11.2.4.4 Packet Tracer - Configuring Port Forwarding On A Linksys Router Instructions
2 pages
Deep-boot
No ratings yet
Deep-boot
8 pages
Sap Erp Ehnmnt Pack 6 Procurement
No ratings yet
Sap Erp Ehnmnt Pack 6 Procurement
28 pages
Software Testing Revealed by International Software Test Institute
No ratings yet
Software Testing Revealed by International Software Test Institute
80 pages
Huawei Hi-Care Service
No ratings yet
Huawei Hi-Care Service
7 pages
Chapter 1 - Overview of Computer Networks
No ratings yet
Chapter 1 - Overview of Computer Networks
49 pages
Multiple Choice: Bit, Byte, Field, Record, File, Database
No ratings yet
Multiple Choice: Bit, Byte, Field, Record, File, Database
7 pages
SAP SQL To HANA - CL
No ratings yet
SAP SQL To HANA - CL
13 pages
Management Information Systems: Managing The Digital Firm, 12e Authors: Kenneth C. Laudon and Jane P. Laudon
No ratings yet
Management Information Systems: Managing The Digital Firm, 12e Authors: Kenneth C. Laudon and Jane P. Laudon
40 pages
SPK Penempatan Calon Karyawan Pada Tes DISC Dan Papi Kostick
No ratings yet
SPK Penempatan Calon Karyawan Pada Tes DISC Dan Papi Kostick
12 pages
Windows - How Should I Properly Use Git For Visual FoxPro Development - Stack Overflow
No ratings yet
Windows - How Should I Properly Use Git For Visual FoxPro Development - Stack Overflow
4 pages
Brkops 2857
No ratings yet
Brkops 2857
56 pages
Xmpp-Real Time Web
No ratings yet
Xmpp-Real Time Web
32 pages
SAP BPC 70 SP03 M SMGde
No ratings yet
SAP BPC 70 SP03 M SMGde
16 pages
PAC File Practice Guide
No ratings yet
PAC File Practice Guide
20 pages
Eds CR
No ratings yet
Eds CR
81 pages