0% found this document useful (0 votes)
2 views

Unit - II MLT

The document outlines the objectives and outcomes of a Machine Learning Techniques course, focusing on supervised and unsupervised learning, data preprocessing, and dimensionality reduction. It details the importance of techniques like feature selection and extraction, as well as the steps involved in data cleaning and preprocessing for machine learning models. Key methods discussed include Principal Component Analysis (PCA), handling missing data, and feature scaling to enhance model performance.

Uploaded by

Lokesh Upputuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unit - II MLT

The document outlines the objectives and outcomes of a Machine Learning Techniques course, focusing on supervised and unsupervised learning, data preprocessing, and dimensionality reduction. It details the importance of techniques like feature selection and extraction, as well as the steps involved in data cleaning and preprocessing for machine learning models. Key methods discussed include Principal Component Analysis (PCA), handling missing data, and feature scaling to enhance model performance.

Uploaded by

Lokesh Upputuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 75

Unit - II

10211AM223 - Machine Learning Techniques

1
10211AM223 - Machine Learning Techniques

A. Course Objectives

Students are exposed to


• Apply the concepts of supervised and unsupervised learning algorithms for real time applications
• Executing decision tree algorithm and probabilistic models to overcome the problem of over fitting
• Analyse and suggest appropriate machine learning approaches for various types of problems
• Demonstrate the aspects of computational biology

2
10211AM223 - Machine Learning Techniques

•Course Outcomes
Upon the successful completion of the course, students will be able to:

CO
Course Outcomes K - Level
No’s
Examine the basic concepts of data mining and machine learning
CO1 K2
concepts
Design and evaluate the dimensionality reduction algorithms using real
CO2 K3
world datasets.
CO3 Apply various algorithms of Classification and Association. K3
Demonstrate experiments to evaluate and compare different unsupervised
CO4 K3
learning algorithms
Use the concept of neural networks for learning linear and non-linear
CO5 K3
activation functions
Knowledge Level (Based on revised Bloom’s Taxonomy)
K1-Remember K2-Understand K3-Apply K4-Analyze K5-Evaluate K6-Create
3
Unit II Syllabus

Unit 2 Dimensionality reduction ` L-9 Hours

Data Pre-processing- Needs Pre-processing the Data- Data Cleaning, Data Integration and Transformation,
Data Reduction, Discretization and Concept Hierarchy Generation- Dimensionality Reduction – Feature
Extraction- Variable Selection- Variable ranking- Linear Discriminant Analysis – Principal Component
Analysis – Factor Analysis –Cross Validation –Resampling methods

4
Introduction to Dimensionality Reduction

Dimensionality reduction

The complexity of any classifier or regressor depends on the number of inputs. This determines both the time and space
complexity and the necessary number of training examples to train such a classifier or regressor.

5
Introduction to Dimensionality Reduction

6
Introduction to Dimensionality Reduction

❖ Dimensionality reduction, or dimension reduction, is the transformation of


data from a high-dimensional space into a low-dimensional space so that the
low-dimensional representation retains some meaningful properties of the
original data, ideally close to its intrinsic dimension.

❖ Dimensionality reduction is a technique used to reduce the number of features


in a dataset while retaining as much of the important information as possible.

❖ There are two main approaches to dimensionality reduction: feature selection


and feature extraction.

7
Feature selection and Feature extraction
➢ Feature selection involves selecting a subset of the original features that
are most relevant to the problem at hand.

➢ The goal is to reduce the dimensionality of the dataset while retaining the
most important features.

➢ There are several methods for feature selection, including filter methods,
wrapper methods, and embedded methods.

➢ Filter methods rank the features based on their relevance to the target
variable, wrapper methods use the model performance as the criteria for
selecting features, and embedded methods combine feature selection with
the model training process.
8
Feature Extraction

➢ Feature extraction involves creating new features by combining or


transforming the original features.

➢ The goal is to create a set of features that captures the essence of the
original data in a lower-dimensional space.

➢ There are several methods for feature extraction, including principal


component analysis (PCA), linear discriminant analysis (LDA), and t-
distributed stochastic neighbor embedding (t-SNE).

➢ PCA is a popular technique that projects the original features onto a


lower-dimensional space while preserving as much of the variance as
possible.
9
Important Points of DR
❖ Dimensionality reduction is the process of reducing the number of features in a dataset while
retaining as much information as possible.

❖ This can be done to reduce the complexity of a model, improve the performance of a learning
algorithm, or make it easier to visualize the data.

❖ Techniques for dimensionality reduction include: principal component analysis (PCA),


singular value decomposition (SVD), and linear discriminant analysis (LDA).

❖ Each technique projects the data onto a lower-dimensional space while preserving important
information.

❖ Dimensionality reduction is performed during pre-processing stage before building a model to


improve the performance

❖ It is important to note that dimensionality reduction can also discard useful information, so
care must be taken when applying these techniques.
10
Data Pre-processing
▪ Data preprocessing is a process of preparing the raw data and making it suitable for a
machine learning model. It is the first and crucial step while creating a machine-
learning model.

▪ When creating a machine learning project, it is not always a case that we come across
clean and formatted data.

11
Need for Data Pre-processing
Preprocessing data is an important step for data analysis. The following are some benefits
of preprocessing data:

➢ It improves accuracy and reliability.


➢ It makes data consistent.
➢ It increases the data's algorithm readability.

The following are the two main features with a brief explanation:
•Data validation: This is the process where businesses analyze and assess the raw data for
a project to determine if it's complete and accurate to achieve the best results.

•Data imputation: Data imputation is where you input missing values and rectify data
errors during the validation process manually or through programming, like business
process automation.

12
Steps involved in preprocessing

Steps involved:

• Getting the dataset

• Importing libraries

• Importing datasets

• Finding Missing Data

• Encoding Categorical Data

• Splitting dataset into training and test set

• Feature scaling

13
1.Get the Dataset

➢ To create a machine learning model, the first thing we required is a dataset as a machine
learning model completely works on data.
➢ The collected data for a particular problem in a proper format is known as the dataset.

➢ Dataset may be of different formats for different purposes, such as, if we want to create a
machine learning model for business purpose, then dataset will be different with the
dataset required for a liver patient.

➢ So each dataset is different from another dataset. To use the dataset in our code, we
usually put it into a CSV file.
➢ CSV stands for "Comma-Separated Values" files; it is a file format which allows us to
save the tabular data, such as spreadsheets. It is useful for huge datasets and can use these
datasets in programs.

➢ However, sometimes, we may also need to use an HTML or xlsx file.


14
2. Importing Libraries
In order to perform data preprocessing using Python, we need to import some
predefined Python libraries.

These libraries are used to perform some specific jobs.


There are three specific libraries that we will use for data preprocessing, which
are:

Numpy: Numpy Python library is used for including any type of mathematical
operation in the code.
It is the fundamental package for scientific calculation in Python. It also
supports to add large, multidimensional arrays and matrices. So, in Python, we
can import it as:
import numpy as nm
15
2. Importing Libraries (…contd)

Matplotlib: The second library is matplotlib, which is a Python 2D plotting library, and
with this library, we need to import a sub-library pyplot.
This library is used to plot any type of charts in Python for the code. It will be imported as
below:

import matplotlib.pyplot as mpt

Pandas: The last library is the Pandas library, which is one of the most famous Python
libraries and used for importing and managing the datasets. It is an open-source data
manipulation and analysis library. It will be imported as below:

Here, we have used pd as a short name for this library. Consider the below image:

16
3) Importing the Datasets

read_csv() function:
Now to import the dataset, we will use read_csv() function of pandas library, which is used to read
a csv file and performs various operations on it. Using this function, we can read a csv file locally
as well as through an URL.
We can use read_csv function as below:

data_set= pd.read_csv('Dataset.csv')

Extracting independent variable:

To extract an independent variable, we will use iloc[ ] method of Pandas library. It is used to extract the required rows and
columns from the dataset.

x= data_set.iloc[:,:-1].values

17
3) Importing the Datasets (contd…)

Extracting dependent variable:

To extract dependent variables, again, we will use Pandas .iloc[] method.

y= data_set.iloc[:,3].values

Here we have taken all the rows with the last column only. It will give the array of dependent variables.

18
4) Handling Missing data

The next step of data preprocessing is to handle missing data in the datasets.

If our dataset contains some missing data, then it may create a huge problem for our
machine learning model.

There are mainly two ways to handle missing data, which are:
By deleting the particular row: The first way is used to commonly deal with null values. In this way, we
just delete the specific row or column which consists of null values. But this way is not so efficient and
removing data may lead to loss of information which will not give the accurate output.

By calculating the mean: In this way, we will calculate the mean of that column or row which contains
any missing value and will put it on the place of missing value. This strategy is useful for the features
which have numeric data such as age, salary, year, etc.

19
Sample Code to handle missing values
To handle missing values, we will use Scikit-learn library in our code, which contains various libraries for
building machine learning models. Here we will use Imputer class of sklearn.preprocessing library. Below is the
code for it:

#handling missing data (Replacing missing data with the mean value)

from sklearn.preprocessing import Imputer

imputer= Imputer(missing_values ='NaN', strategy='mean', axis = 0)

#Fitting imputer object to the independent variables x.

imputerimputer= imputer.fit(x[:, 1:3])

#Replacing missing data with the calculated mean value

x[:, 1:3]= imputer.transform(x[:, 1:3])


20
5) Encoding Categorical data

Categorical data is data which has some categories such as, in our dataset; there are two categorical variable,
Country, and Purchased.

Example : For Country variable:

Firstly, we will convert the country variables into categorical data. So to do this, we will use LabelEncoder() class
from preprocessing library.

#Catgorical data

#for Country Variable

from sklearn.preprocessing import LabelEncoder

label_encoder_x= LabelEncoder()

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])

21
6) Splitting the Dataset into the Training set and Test set
➢ In machine learning data preprocessing, we divide our dataset into a training set and test set.

➢ This is one of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our
machine learning model.

Training Set: A subset of dataset to train the machine learning model, and we already know the output.

Test set: A subset of dataset to test the machine learning model, and by using the test set, model predicts the
output.

For splitting the dataset, we will use the below lines of code:

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0) 22


6) Splitting the Dataset into the Training set and Test set(contd…)
Explanation:

In the above code, the first line is used for splitting arrays of the dataset into random train and test subsets.

In the second line, we have used four variables for our output that are

x_train: features for the training data

x_test: features for testing data

y_train: Dependent variables for training data

y_test: Independent variable for testing data

In train_test_split() function, we have passed four parameters in which first two are for arrays of data, and
test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells the dividing ratio of
training and testing sets.
23
7) Feature Scaling

➢ Feature scaling is the final step of data preprocessing in machine learning.

➢ It is a technique to standardize the independent variables of the dataset in a specific


range.

➢ In feature scaling, we put our variables in the same range and in the same scale so that
no any variable dominate the other variable.

24
Feature Scaling Example

As we can see, the age and salary


column values are not on the same
scale.

25
Feature Scaling Methods

Standardization

Normalization

26
Feature Scaling Methods (contd…)
Here, we will use the standardization method for our dataset.

For feature scaling, we will import StandardScaler class of sklearn.preprocessing library as:

from sklearn.preprocessing import StandardScaler

Now, we will create the object of StandardScaler class for independent variables or features. And then we will fit
and transform the training dataset.

st_x= StandardScaler()

x_train= st_x.fit_transform(x_train)

For test dataset, we will directly apply transform() function instead of fit_transform() because it is already done in
training set.

x_test= st_x.transform(x_test)

27
Data Cleaning

➢ Data cleaning is a crucial step in the machine learning (ML) pipeline, as it involves
identifying and removing any missing, duplicate, or irrelevant data.

➢ The goal of data cleaning is to ensure that the data is accurate, consistent, and free of
errors, as incorrect or inconsistent data can negatively impact the performance of the
ML model.

28
Data Cleaning Steps
The most common steps involved in data cleaning:
Steps

1. Data inspection and exploration

2. Removal of unwanted observations

3. Handling missing data

4. Handling outliers

5. Data transformation

29
Data Cleaning Steps (contd…)
1. Data inspection and exploration:

This step involves understanding the data by inspecting its structure and identifying missing values,
outliers, and inconsistencies.

import pandas as pd
import numpy as np
a1 = pd.Series([1, 2, 3])
a1.describe()
Output:
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
dtype: float64

30
Data Cleaning Steps (contd…)
2.Removal of unwanted observations

This includes deleting duplicate/ redundant or irrelevant values from your dataset.
Duplicate observations most frequently arise during data collection and Irrelevant
observations are those that don’t actually fit the specific problem that you’re trying to
solve.

• Redundant observations alter the efficiency to a great extent as the data repeats and
may add towards the correct side or towards the incorrect side, thereby producing
unfaithful results.

• Irrelevant observations are any type of data that is of no use to us and can be removed
directly.
31
Data Cleaning Steps (contd…)
2.Removal of unwanted observations

Example: Check the duplicate rows.

df.duplicated()

Eg: DataFrame.duplicated(subset=None, keep='first’)

subset: Takes a column or list of column label. It’s default value is none. After passing
columns, it will consider them only for duplicates.
keep: Controls how to consider duplicate value. It has only three distinct value and
default is ‘first’.
–> If ‘first’, it considers first value as unique and rest of the same values as duplicate.
–> If ‘last’, it considers last value as unique and rest of the same values as duplicate.
–> If False, it consider all of the same values as duplicates.

32
Data Cleaning Steps (contd…)
3. Handling missing data:

Missing data is a common issue in real-world datasets, and it can occur due to various reasons such as
human errors, system failures, or data collection issues.

Various techniques can be used to handle missing data, such as imputation, deletion, or substitution.

The common ways to handle missing values in the dataset:

❖ Deleting Rows with missing values In Pandas missing data is represented by two value:
❖ Impute missing values for continuous variable •None: None is a Python singleton object that is often used for
missing data in Python code.
❖ Impute missing values for categorical variable
•NaN : NaN (an acronym for Not a Number), is a special
❖ Other Imputation Methods
floating-point value recognized by all systems that use the
❖ Using Algorithms that support missing values standard IEEE floating-point representation

❖ Prediction of missing values


33
Data Cleaning Steps (contd…)
4. Handling outliers:
• Outliers are extreme values that deviate significantly from the majority of the data.
• They can negatively impact the analysis and model performance.
• Techniques such as clustering, interpolation, or transformation can be used to handle outliers.
Here are a few common causes of outliers in a data set:
➢ Data entry errors: These are caused by human errors during data collection, recording, or entry.
➢ Measurement errors or instrument errors: This one is the most common reason for outliers. Such type
of error occurs when the instrument becomes faulty.
➢ Sampling errors: Consider an example where we have to measure the weight of athletes but by mistake,
we also include some wrestlers in the sample now this inclusion is very likely to cause outliers in the
dataset.
➢ Data processing error: While performing data mining data is extracted from multiple resources there is
a possibility that due to some manipulation or extraction errors there are some outliers in the data set.
➢ Natural novelties in data: The outliers that are not caused due to any error are called Natural Outliers.
34
How to detect outliers?

The outliers of the data can be detected using certain statistical plots, the most common
plots are Box Plot and Scatter Plot.

Scatter plot Histogram


35
Data Transformation
Data transformation

Data transformation involves converting the data from one form to another to make it more suitable for
analysis. Techniques such as normalization, scaling, or encoding can be used to transform the data.

Scaling

• Scaling involves transforming the values of features to a specific range. It maintains the shape of
the original distribution while changing the scale.

• Scaling is particularly useful when features have different scales, and certain algorithms are
sensitive to the magnitude of the features.

• Common scaling methods include Min-Max scaling and Standardization (Z-score scaling).

36
Advantages of Data Cleaning in Machine Learning:
❖ Improved model performance: Data cleaning helps improve the performance of the ML model by removing errors,
inconsistencies, and irrelevant data, which can help the model to better learn from the data.

❖ Increased accuracy: Data cleaning helps ensure that the data is accurate, consistent, and free of errors, which can
help improve the accuracy of the ML model.

❖ Better representation of the data: Data cleaning allows the data to be transformed into a format that better
represents the underlying relationships and patterns in the data, making it easier for the ML model to learn from the
data.

❖ Improved data quality: Data cleaning helps to improve the quality of the data, making it more reliable and accurate.
This ensures that the machine learning models are trained on high-quality data, which can lead to better predictions
and outcomes.

❖ Improved data security: Data cleaning can help to identify and remove sensitive or confidential information that
could compromise data security. By eliminating this information, data cleaning can help to ensure that only the
necessary and relevant data is used for machine learning. 37
Disadvantages of Data Cleaning in Machine Learning:
❖ Time-consuming: Data cleaning can be a time-consuming task, especially for large and complex datasets.

❖ Error-prone: Data cleaning can be error-prone, as it involves transforming and cleaning the data, which can result in
the loss of important information or the introduction of new errors.

❖ Limited understanding of the data: Data cleaning can lead to a limited understanding of the data, as the transformed
data may not be representative of the underlying relationships and patterns in the data.

❖ Data loss: Data cleaning can result in the loss of important information that may be valuable for machine learning
analysis. In some cases, data cleaning may result in the removal of data that appears to be irrelevant or inconsistent, but
which may contain valuable insights or patterns.

❖ Cost and resource-intensive: Data cleaning can be a resource-intensive process that requires significant time, effort,
and expertise. It can also require the use of specialized software tools, which can add to the cost and complexity of data
cleaning.

❖ Overfitting: Data cleaning can inadvertently contribute to overfitting by removing too much data, leading to a loss of
information that could be important for model training and performance. 38
Data Integration and Transformation

Data integration is one of the steps of data pre-processing that involves combining data
residing in different sources and providing users with a unified view of these data.

• It merges the data from multiple data stores (data sources)

• It includes multiple databases, data cubes or flat files.

39
Data Integration and Transformation (contd…)
There are mainly 2 major approaches for data integration - commonly known as "tight
coupling approach" and "loose coupling approach".
Tight Coupling

❑Here data is pulled over from different sources into a single physical location through the
process of ETL - Extraction, Transformation and Loading.

❑The single physical location provides an uniform interface for querying the data.

❑ETL layer helps to map the data from the sources so as to provide a uniform data
warehouse.

❑This approach is called tight coupling since in this approach the data is tightly coupled
with the physical repository at the time of query.
40
Data Integration and Transformation (contd…)
Advantages:

1. Independence (Lesser dependency to source systems since data is physically


copied over)

2. Faster query processing

3. Complex query processing

4. Advanced data summarization and storage possible

5. High Volume data processing

Disadvantages: 1. Latency (since data needs to be loaded using ETL)

2. Costlier (data localization, infrastructure, security)


41
Data Integration and Transformation (contd…)
Loose Coupling

• Here a virtual mediated schema provides an interface that takes the query from the
user, transforms it in a way the source database can understand and then sends the
query directly to the source databases to obtain the result.

• In this approach, the data only remains in the actual source databases.

42
Data Integration and Transformation (contd…)
Loose Coupling
Advantages:

❖ Data Freshness (low latency - almost real time)

❖ Higher Agility (when a new source system comes or existing source system changes - only the
corresponding adapter is created or changed - largely not affecting the other parts of the system)

❖ Less costlier (Lot of infrastructure cost can be saved since data localization not required)

Disadvantages:

❖ Semantic conflicts

❖ Slower query response

❖ High order dependency to the data sources

43
Issues in Data Integration
There are several issues that can arise when integrating data from multiple sources, including:
1.Data Quality: Inconsistencies and errors in the data can make it difficult to combine and analyze.
2.Data Semantics: Different sources may use different terms or definitions for the same data, making it
difficult to combine and understand the data.
3.Data Heterogeneity: Different sources may use different data formats, structures, or schemas, making it
difficult to combine and analyze the data.
4.Data Privacy and Security: Protecting sensitive information and maintaining security can be difficult when
integrating data from multiple sources.
5.Scalability: Integrating large amounts of data from multiple sources can be computationally expensive and
time-consuming.
6.Data Governance: Managing and maintaining the integration of data from multiple sources can be difficult,
especially when it comes to ensuring data accuracy, consistency, and timeliness.
7.Performance: Integrating data from multiple sources can also affect the performance of the system.
8.Integration with existing systems: Integrating new data sources with existing systems can be a complex
task, requiring significant effort and resources.
9.Complexity: The complexity of integrating data from multiple sources can be high, requiring specialized
skills and knowledge.
44
Data Transformation

In data mining pre-processes and especially in metadata and data warehouse, we use data
transformation in order to convert data from a source data format into destination data.

How Data Transformation Works?

The goal of the data transformation process is to extract data from a source, convert it into a usable
format, and deliver it to a destination.

This entire process is known as ETL (Extract, Load, Transform).

During the extraction phase, data is identified and pulled from many different locations or sources into a
single repository.

45
Steps in Data Transformation
Data discovery. The first step in the data transformation process consists of identifying and
understanding the data in its source format. This is usually accomplished with the help of a data
profiling tool. This step helps you decide what needs to happen to the data in order to get it into
the desired format.

Data mapping. During this phase, the actual transformation process is planned.

Generating code. In order for the transformation process to be completed, a code must be
created to run the transformation job. Often these codes are generated with the help of a data
transformation tool or platform.

Executing the code. The data transformation process that has been planned and coded is now put
into motion, and the data is converted to the desired output.

Review. Transformed data is checked to make sure it has been formatted correctly.
46
Steps in Data Transformation (contd…)

In addition to these basic steps, other customized operations may occur. For example,

➢ Filtering (e.g. Selecting only certain columns to load).

➢ Enriching (e.g. Full name to First Name , Middle Name , Last Name).

➢ Splitting a column into multiple columns and vice versa.

➢ Joining together data from multiple sources.

➢ Removing duplicate data.

47
Benefits of Data Transformation

➢ Getting maximum value from data

➢ Managing data more effectively

➢ Performing faster queries

➢ Enhancing data quality

48
Types of data transformation
Types of data transformation

There are many different types of data transformation, depending on what kind of data you have and what you want to do
with it. Some common types include:

• Data cleaning

• Feature extraction

• Feature creation

• Data normalization

• Data aggregation/disaggregation

• Sampling

49
Types of data transformation (contd…)
Data cleaning

Data cleaning is the process of removing incorrect or incomplete information from your dataset,
adding or fixing missing values, dealing with outliers, and so on. It’s an important step in any data
transformation process, and it’s often the most time-consuming.

Feature extraction

Feature extraction is the process of reducing a large amount of information down to a smaller set of
more useful variables. It’s a common data transformation technique, and it’s often used when
working with images or videos.
Feature extraction is a type of data reduction, and it’s a common technique in machine learning.

50
Types of data transformation (contd…)
Feature creation
Feature creation is the process of adding extra information to your dataset where none existed
previously. It’s a useful data transformation technique when you want to make use of data that’s not
in a standard format.
Data normalization
Data normalization is the process of making sure all values in your dataset are on the same scale.
It’s a common data transformation technique, and it’s often used when working with numerical
data.
Data aggregation/disaggregation
Data aggregation is the process of combining multiple datasets into one. It’s a common data
transformation technique, and it’s often used when working with data from different sources.
Data disaggregation is the opposite of data aggregation. It’s the process of splitting one large dataset
into several smaller ones.

51
Types of data transformation (contd…)
Sampling

Sampling is the process of using only part of your dataset rather than all of it. It’s a common data
transformation technique, and it’s often used when working with very large datasets which cannot
be stored fully on the computer being used.

52
Data Reduction

Data reduction is a technique used to optimize the capacity required to store data. Data reduction can increase
storage performance and reduce storage costs.

Some of the most common types of data reduction are:

• Deduplication: Removing duplicate data. This can include simply removing duplicated records to deleting
records that, while not strictly identical, represent the same information or event.

• Compression: Compression processes apply algorithms to transform information to take up less storage
space. Compression algorithms can (and often are) applied to data as it is moved into storage, but some can
be applied to data-at-rest to improve space gains even more.

• Thin Provisioning: Thin provisioning is an approach to storage where space is partitioned and used as
needed rather than pre-allocate storage to users or processes. While more computationally intensive, this
approach can significantly reduce inefficiencies like disk fragmentation.
53
Techniques for Data Reduction
What Techniques Are Used for Data Reduction?

• Dimensionality Reduction: This


approach attempts to reduce the
number of “dimensions,” or
aspects/variables, from a data set.

54
Techniques for Data Reduction(contd…)
What Techniques Are Used for Data Reduction?

• Data Cube Aggregation: This technique aggregates multidimensional data at various


levels to create a “data cube” (or multidimensional data object).

▪ This simply means that the data is processed and reduced to a smaller but equally useful
form of information that still represents trends in relevant information for analytics.

55
Techniques for Data Reduction(contd…)
What Techniques Are Used for Data Reduction?

• Numerosity Reduction: Numerosity reduction, as the name suggests, is replacing the


original data with a smaller form of data representing the original with more or less fidelity.
This common compression technique is used in various data types, including audio, video,
and image.

• Clustering: This practice uses data attributes to build a set of clusters in which the data
is split. Similarities and dissimilarities between the data objects result in different
placements and distances between the clusters and the objects as a whole.

56
Discretization and Concept Hierarchy Generation

Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy.

In other words, data discretization is a method of converting attributes values of continuous


data into a finite set of intervals with minimum data loss.

There are two forms of data discretization first is supervised discretization, and the second is
unsupervised discretization.

Supervised discretization refers to a method in which the class data is used.

Unsupervised discretization refers to a method depending upon the way which operation proceeds.
It means it works on the top-down splitting strategy and bottom-up merging strategy.

57
Famous techniques of data discretization
Histogram analysis

Histogram refers to a plot used to represent the underlying frequency distribution of a continuous data set.
Histogram assists the data inspection for data distribution. For example, Outliers, skewness representation, normal
distribution representation, etc.

Histograms are graphs that display the distribution of your continuous data.

58
Famous techniques of data discretization(…contd)
Histogram Analysis

59
Famous techniques of data discretization(…contd)
Histogram analysis

60
Famous techniques of data discretization(…contd)
Histogram analysis

61
Famous techniques of data discretization(…contd)
Histogram analysis

62
Famous techniques of data discretization(…contd)
Histogram analysis

63
Famous techniques of data discretization(contd..)
Binning

Binning refers to a data smoothing technique that helps to group a huge number of continuous values into smaller
values. For data discretization and the development of idea hierarchy, this technique can also be used.
Data discretization using correlation analysis

Discretizing data by linear regression technique, you can get the best neighboring interval, and then the large intervals
are combined to develop a larger overlap to form the final 20 overlapping intervals. It is a supervised procedure.
Cluster Analysis

Cluster analysis is a form of data discretization. A clustering algorithm is executed by dividing the values of x numbers
into clusters to isolate a computational feature of x.

Data discretization using decision tree analysis

Data discretization refers to a decision tree analysis in which a top-down slicing technique is used. It is done through a
supervised procedure. In a numeric attribute discretization, first, you need to select the attribute that has the least
entropy, and then you need to run it with the help of a recursive process. 64
Concept Hierarchy Generation
The term hierarchy represents an organizational structure or
mapping in which items are ranked according to their levels of
importance.
There are two types of hierarchy: top-down mapping and the
second one is bottom-up mapping.
Top-down mapping

Top-down mapping generally starts with the top with some general
information and ends with the bottom to the specialized information.

Bottom-up mapping

Bottom-up mapping generally starts with the bottom with some


specialized information and ends with the top to the generalized
information.
65
Concept Hierarchy Generation(contd…)
Concept Hierarchy for Attribute price in $

What is a concept?
A group of records that have been assigned a label.

What is a concept hierarchy?


Means generating a hierarchical order among concepts.

66
Concept Hierarchy Generation Example
Case Study:
Suppose a user selects a set of location-oriented attributes —
street, country, province or state, and city — from the
AllElectronics database, but does not specify the hierarchical
ordering among the attributes.
First, sort the attributes in ascending order based on the number
of distinct values in each attribute. This results in the following
(where the number of distinct values per attribute is shown in
parentheses):
country (15),
province or state (365),
city (3567),
and street (674,339).
Second, generate the hierarchy from the top down according to
the sorted order, with the first attribute at the top level and the
last attribute at the bottom level. Finally, the user can examine
the generated hierarchy, and when necessary, modify it to reflect
desired semantic relationships among the attributes.

67
Dimensionality Reduction – Feature Extraction

68
Feature Selection

69
Filter method

70
Wrapper Method

71
Embedded Method

https://colab.research.google.com/drive/1k0ol1z3-
72
bZNseuNBYw1YF7AICgnl9dzg#scrollTo=4w0xj10jFYYw
Benefits of Feature Selection methods

73
Feature Extraction

74
Feature Extraction(contd…)

Feature Extraction Methods:

❖ Principal Component Analysis

❖ Factor Analysis

❖ Singular Value Decomposition

75

You might also like