100% found this document useful (5 votes)
5K views

Lab Workbook With Solutions-Final PDF

Uploaded by

Abhinash Alla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (5 votes)
5K views

Lab Workbook With Solutions-Final PDF

Uploaded by

Abhinash Alla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 109

LAB WORKBOOK

18CS3159 DATA WAREHOUSING AND MINING

III B.TECH 2019-20 ODD SEMESTER


K L UNIVERSITY | DATA WAREHOUSING AND MINING – 18CS3159

1
18CS3159 DATA WAREHOUSING AND MINING

LABORATORY WORKBOOK

STUDENT
NAME
REG. NO
YEAR
SEMESTER
SECTION
FACULTY

2
18CS3159 DATA WAREHOUSING AND MINING

Table of Contents
Table of Contents .................................................................................................................................... 3
Organization of the STUDENT LAB WORKBOOK ............................................................................. 4
Lab #1: Basic Statistical Descriptions ................................................................................................... 5
Lab #2: To implement data pre-processing techniques. ...................................................................... 11
Lab #3: To implement principle component analysis .......................................................................... 20
Lab #4: Classification using Decision Trees. ......................................................................................... 24
Lab #5: Classification using K Nearest Neighbour. ............................................................................... 31
Lab #6: Classification using Bayesian Classifiers .................................................................................. 38
Lab #7: Classification using Backpropagation………………………………………………………………………………. 44
Lab #8: Association Rule Mining - Apriori ......................................................................................... 50
Lab #9: Implementation of K-Means Clustering ............................................................................... 57
Lab #10: Implementation of Fuzzy c-means clustering..................................................................... 63
Lab #11: Classification: Support Vector Machine (SVM) .................................................................. 69
Lab #12: Rule Based Classification .................................................................................................... 74
Lab #13: Hierarchical Clustering ....................................................................................................... 81
Lab #14: Outiers detection ……………………………………………………………………………………………………………88

3
18CS3159 DATA WAREHOUSING AND MINING

Organization of the STUDENT LAB WORKBOOK

The laboratory framework includes a creative element but shifts the time-intensive aspects
outside of the Two-Hour closed laboratory period. Within this structure, each laboratory
includes three parts: Prelab, In-lab, and Post-lab.
a. Pre-Lab
The Prelab exercise is a homework assignment that links the lecture with the laboratory
period - typically takes 2 hours to complete. The goal is to synthesize the information they
learn in lecture with material from their textbook to produce a working piece of software.
Prelab Students attending a two-hour closed laboratory are expected to make a good-faith
effort to complete the Prelab exercise before coming to the lab. Their work need not be
perfect, but their effort must be real (roughly 80 percent correct).
b. In-Lab
The In-lab section takes place during the actual laboratory period. The First hour of the
laboratory period can be used to resolve any problems the students might have
experienced in completing the Prelab exercises. The intent is to give constructive
feedback so that students leave the lab with working Prelab software - a significant
accomplishment on their part. During the second hour, students complete the In-lab
exercise to reinforce the concepts learned in the Prelab. Students leave the lab having
received feedback on their Prelab and In-lab work.
c. Post-Lab
The last phase of each laboratory is a homework assignment that is done following the
laboratory period. In the Post-lab, students analyse the efficiency or utility of a given
system call. Each Post-lab exercise should take roughly 120 minutes to complete.

4
18CS3159 DATA WAREHOUSING AND MINING

Lab #1: Basic Statistical Descriptions


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab

BASIC STATISTICAL DESCRIPTIONS OF DATA


Basic statistical descriptions provide the analytical foundation for data
preprocessing. It can be used to identify properties of the data and highlight which data values
should be treated as noisy or outliers.

Answer the following Questions

1. What are the various ways to measure the central tendency of data?

The three most common measures of central tendency are the mean, median, and mode.

2. What are the several ways of measuring the dispersion of data?

Range, Quartiles, Variance, Standard Deviation, and Interquartile Range are the measures of dispersion.

3. What is IQR (inter quartile range)?

The distance between the first and third quartiles is a simple measure os spread that gives the range
covered by the middle half of the data. The distance is called interquartile range(IQR).

4. Observe the following diagrams; identify the quantile and q-q plot? Define how the q-q-
plot is different from quantile plot?

5
18CS3159 DATA WAREHOUSING AND MINING

A quantile plot is a simple and effective way to have a first look at a univariate
data distribution.
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another.

5. What are the items involved Five number summary?

A five number summary consists of these five statistics:


1. The minimum value
2. the first quartile
3. the median
4. the third quartile
5. the maximum value

6. Identify the symmetric data, positively skewed data and negatively skewed data from the
below graphs?

a) b) c)

A) postively skewed data


B) symmetric data
C) negatively skewed data

18CS3159 DATA WAREHOUSING AND MINING

6
In-lab
1. Given a data set “cars” for analysis it includes the variables speed and distance.
(Download the dataset from lms)

a) What are the average speed and the distance of the cars?
b) What is the median and midrange of the data?
c) Find mode of the data and comment on the data modality (i.e, unimodal or bimodal)?
d) What are the variance and the standard deviation of the data?
e) Find the five number summary of the data?
f) Show the histogram and box plot of the data?

Writing space of the Problem:(For Student’s use only)


import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from np import percentile
import statistics
data = pd.read_csv(r"Desktop\cars.csv")

1) data.mean()
speed 10.684211
dist 25.526316
dtype: float64

data.median()

speed 11.0
dist 24.0
dtype: float64

data.mode()

speed 11.0
dist 24.0
dtype: float64

data.mode()

speed dist
0 10 26.0
1 12 NaN
d1=(data.max()+data.min())/2
d1

speed 10.0
dist 31.0
dtype: float64

7
18CS3159 DATA WAREHOUSING AND MINING
Writing space of the Problem:(For Student’s use only)
statistics.stdev(data['speed'])
3.4165062007675386

statistics.stdev(data['dist'])
16.540752438388957

statistics.variance(data['speed'])
11.67251461988304

statistics.variance(data['dist'])
273.5964912280702

data.min()
speed 4
dist 2
dtype: int64

data.max()
speed 16
dist 60
dtype: int64

quartiles=percentile(data,[25,50,75])
quartiles[0]
10.0
In [33]:

quartiles[1]
14.0
In [34]:
quartiles[2]
23.5
In [47]:
data.boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x1c7bec5ff60>

data.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1c7c0719cc0>

18CS3159 DATA WAREHOUSING AND MINING

8
Writing space of the Problem:(For Student’s use only)

18CS3159 DATA WAREHOUSING AND MINING

9
Post-lab
1. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results. (Download the dataset from lms)

a) Find the maximum and the minimum percentage of the fat and age of the adults who
visited the hospital.
b) Calculate mean, median and midrange of the age.
c) Find the first quartile and third quartile of the data.
d) Draw a scatter plot and q-q plot based on these two variables.

Writing space of the Problem:(For Student’s use only)


a)
df.min()
age 23.0
%fat 7.8
dtype: float64

df.max()
age 61.0
%fat 42.5
dtype: float64
b)
df.mean()
age 46.444444
%fat 28.783333
dtype: float64

df.median()
age 51.0
%fat 30.7
dtype: float64

mid range
(df.min()+df.max())/2
age 42.00
%fat 25.15
dtype: float64

c)
quartiles=np.percentile(df,[25,50,75])
quartiles[0]
27.15

quartiles[1]
34.35

quartiles[2]
50.5

d)
import pandas as pd

10
18CS3159 DATA WAREHOUSING AND MINING
Writing space of the Problem:(For Student’s use only)

import matplotlib.pyplot as plt


import scipy.stats as stats
import statsmodels.api as sm
df=pd.read_csv(r"Desktop\df.csv")
plt.scatter(df.iloc[:,0],df.iloc[:,1])

sm.qqplot(df, stats.t, fit=True, line='45')


plt.show()

18CS3159 DATA WAREHOUSING AND MINING

11
Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. Difference between symmetric data and skewed data.
2. What are the most widely used forms of quartiles?
3. Variance and Standard deviation fall under what category of measuring data?
4. What do low and high standard deviations indicate?
5. Based on what condition, two variables are said to be correlated?

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

18CS3159 DATA WAREHOUSING AND MINING

12
Lab #2: To implement data pre-processing techniques.
Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab

DATA PREPROCESSING:
Databases are exceedingly helpless to noisy, missing, and inconsistent data because of
their commonly enormous size (frequently a few gigabytes or more). Low-quality information
will prompt low-quality mining results. Pre-processing helps to get a quality data. The steps
involved in data pre-processing are data cleaning, data integration, data reduction, data
transformation.

Match the following:


i. Data cleaning a. Reduced representation of data
ii. Data integration b. xold/xmax
iii. Data reduction c. deal with missing values and noisy data
iv. Normalization d. works to remove noisy data
v. Data transformation e. (xold-xmin)/ (xmax-xmin)
vi. Decimal scaling f. merging of data from multiple data stores
vii. Min max normalization g. scale the data values in specified range
viii. Z score normalization h. convert data into appropriate forms
ix. Smoothing i. (xold-mean)/standard deviation

ANS:- 1.c 2.f 3.a 4.g 5.h 6.b 7.e 8.i 9.d

1. Mention any two methods that deals with missing values and noisy data.
a. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value.
b. drop the tuple
c. use binning technique to deal with noisy data

2. Mention two techniques that are applied to obtain a reduced data set.
The techniques that are applied to obtain a reduced data set are dimensionality reduction,
numerosity reduction.

3.Using min-max normalization, transform the value 35 onto the range [0.0, 1.0].
importnumpy as np
import math
fromstatsmodels import robust
df1=[200,300,400,600,1000]
df = np.asarray(df1)
normalized_df=(35-df.min())/(df.max()-df.min())
normalized_df
o/p: -0.20625

18CS3159 DATA WAREHOUSING AND MINING

13
4. Using z-score normalization, transform the value 35, where the standard deviation is 12.94
years.
normalized_df=(35-df.mean())/12.94
normalized_df
o/p: -35.93508500772798

5. Using normalization by decimal scaling, transform the value 35.

normalized_df=(35)/df.max()
normalized_df
o/p: 0.035

14
18CS3159 DATA WAREHOUSING AND MINING
In-lab
1. Given a data set “data” for analysis it includes the attribute nation, purchased item, age,
Salary. (Download the data set from lms)

a. Identify number of missing values in a given data set


b. Drop the tuples that have missing values in the attributes.
c. Check the data type of age, if it is not an integer then convert into integer.
d. Normalize the salary using simple feature scaling.
e. Categorize the salary into low, high, medium bins.
f. Turn the categorical values into numerical.
Writing space of the Problem:(For Student’s use only)

1.
import math
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.read_csv(r"Desktop\nation.csv")

NUMBER OF MISSING VALUES


data.isnull().sum()
nation 0
purchased item 0
age 3
salary 3

Missing values-dropna
df=data.dropna()
>>> df
nation purchased item age salary
0 india no 26.0 23456.0
2 america no 57.0 45676.0
5 india no 45.0 566678.0
6 japan yes 39.0 677644.0
9 india yes 20.0 45678.0

DATA FORMATING
type(data['age']) is int
False

df['age']=df['age'].astype("int")
df['age']

0 26
2 57
5 45
6 39
9 20
Name: age, dtype: int32

18CS3159 DATA WAREHOUSING AND MINING


Writing space of the Problem:(For Student’s use only)

15
DATA NORMALIZATION
USINGSIMPLE FEATURE SCALING
d1=df['salary']/df['salary'].max()
d1
0 0.034614
2 0.067404
5 0.836247
6 1.000000
9 0.067407
Name: salary, dtype: float64

BINNING
bins=np.linspace(min(df["salary"]),max(df["salary"]),4)
groups=["low","medium","high"]
df["salbin"]=pd.cut(df["salary"],bins,labels=groups,include_lowest=True)
print(df["salbin"])
0 low
2 low
5 high
6 high
9 low
Name: salbin, dtype: category
Categories (3, object): [low < medium < high]

TURNING CATEGORICAL VALUES INTO NUMERIC VALUES


pd.get_dummies(df["purchased item"])

18CS3159 DATA WAREHOUSING AND MINING

16
2. Suppose john is working as a manager at Nuclear Power Corporation of India and have been
charged with analyzing the Nuclear power station construction data. He carefully inspects the
company’s database identifying and selecting the attributes (cost, date, t1, t2 and cap) to be
included in the analysis. (Download the dataset from lms)
a. He noticed that several values of the attributes for various tuples have no recorded
value.
b. He observed that data type of year is recorded in float instead of integer type.
c. He wants to normalize all the data (variables) in equal weights.
d. Finally, he wants to know if there are any outliers present in cost of the construction.
You immediately set out to perform this task.
Hint: missing values can be solved by replacing with mean)
Writing space of the Problem:(For Student’s use only)
import pandas as pd
importscipy.stats as stats
data =pd.read_csv("nuclear.csv")
df = data[['cost','date','t1','t2','cap']]
df

dm=df.fillna(df.mean())
dm

dm['date']=dm['date'].astype("int")
dm

d1=dm/dm.max()
d1

plt.scatter(df.iloc[:,0],df.iloc[:,1])

18CS3159 DATA WAREHOUSING AND MINING

17
Post-lab
1. Data (13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70)
Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps.
Comment on the effect of this technique for the given data. Also Plot a histogram.

Writing space of the Problem:(For Student’s use only)

1.
import numpy as np
import math
fromitertools import groupby
data = [13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70]
b = data
b=np.sort(b)
bin1=np.zeros((9,3))
for i in range (0,len(data),3):
k=int(i/3)
mean=(b[i] + b[i+1] + b[i+2])/3
for j in range(3):
bin1[k,j]=mean
print("Bin Mean: \n",bin1)

histogram
import numpy as np
import math
import matplotlib.pyplot as plt

from itertools import groupby


age = [13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70]
plt.hist(age)
plt.show()

18
19
18CS3159 DATA WAREHOUSING AND MINING

2. Use these two methods below to normalize the following group of data: 200, 300, 400, 600
and
1000.
a. min-max normalization by setting min = 0 and max = 1
b. z-score normalization
c. z-score normalization using the mean absolute deviation of standard deviation
d. Normalization by simple feature scaling.

Writing space of the Problem:(For Student’s use only)


importnumpy as np
import math
fromstatsmodels import robust
df1=[200,300,400,600,1000]
df = np.asarray(df1)

a. normalized_df=(df-df.min())/(df.max()-df.min())
normalized_df
o/p: array([0. , 0.125, 0.25 , 0.5 , 1. ])

b. normalized_df=(df-df.mean())/df.std()
normalized_df
o/p: array([-1.06066017, -0.70710678, -0.35355339, 0.35355339, 1.76776695])

c. normalized_df=(df-df.mean())/(robust.mad(df1))
normalized_df

o/p: array([-1.01173463, -0.67448975, -0.33724488, 0.33724488, 1.68622438])

d)normalized_df=(df)/df.max()
normalized_df
o/p: array([0.2, 0.3, 0.4, 0.6, 1. ])

20
18CS3159 DATA WAREHOUSING AND MINING

3. a. Normalize the two variables(age, fat) based on z-score normalization


b. Calculate the correlation matrix. Are these two variables positively or negatively
correlated?

Writing space of the Problem:(For Student’s use only)


import pandas as pd
df=pd.read_csv(r"Desktop\df.csv")
df

A.
normalized_df=(df-df.mean())/df.std()
normalized_df

B.
CORRELATION COEFFICIENT
np.corrcoef(age,fat)
array([[1. , 0.71029574],
[0.71029574, 1. ]])

21
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. What are the factors that comprising data quality?
2. What do you mean by noise in the dataset?
3. What are outliers in the dataset?
4. What is discretization?
5. What is the difference between lossy and lossless in data reduction?

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

22
18CS3159 DATA WAREHOUSING AND MINING

Lab #3: To implement principle component analysis


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-

Principal Component Analysis:


Principal Component Analysis is a method of extracting important variables from
large set of variables available in a dataset. Suppose that the data to be reduced consist of tuples
or data vectors described by n attributes or dimensions. Principal components analysis (PCA; also
called the Karhunen-Loeve, or K-L, method) searches for k n-dimensional orthogonal vectors that
can best be used to represent the data, where k ≤ n. The original data are thus projected onto a
much smaller space, resulting in dimensionality reduction.

1. What are principal components?


A principal component is a normalized linear combination of the original predictors in a data set.

2. Mention the steps to construct principal components?

Step 1: from the dataset, standardize the variables so that all variables are represented in a single scale
Step 2: construct variance-covariance matrix of those variables
Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent
the components of the dataset
Step 4: Reorder the matrix by eigenvalues, highest to lowest. This gives the components in order of
significance
Step 5: Keep the top n-components which together explain 75%-80% variability of the dataset
Step 6: create a feature vector by taking the eigenvectors that are kept in step 5, and forming a matrix
with these eigenvectors in the columns
Step7: take the transpose of the feature vector and multiply it on the left of the original data set,
transposed. The values obtained are the principal scores

23
In-lab:-

1. Suppose that you are given a small 3x2 matrix, you have to
calculate Principal Component Analysis without using pca() function?
Matrix: ([3, 5], [4, 2], [1, 6])

WITH OUT USING PCA FUNCTION

fromnumpy import array


fromnumpy import mean
fromnumpy import cov
fromnumpy.linalg import eig
A = array([[20, 50], [70, 40], [50, 80]])
M = mean(A.T, axis=1)
C=A-M
V = cov(C.T)
values, vectors = eig(V)
P = vectors.T.dot(C.T)

print(A)
[[20 50]
[70 40]
[50 80]]

M
array([46.66666667, 56.66666667])

C
array([[-26.66666667, -6.66666667],
[ 23.33333333, -16.66666667],
[ 3.33333333, 23.33333333]])

V
array([[633.33333333, -66.66666667],
[-66.66666667, 433.33333333]])

Vectors
array([[ 0.95709203, 0.28978415],
[-0.28978415, 0.95709203]])
Values
array([653.51837585, 413.14829082])
P.T

array([[-23.59055972, -14.10819081],
[ 27.1618831 , -9.18990364],
[ -3.57132338, 23.29809445]])

24
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

2. Calculate the principal component analysis for the matrix given in Q1 using PCA?
USING PCA

fromnumpy import array


fromsklearn.decomposition import PCA
# define a matrix
A = array([[20, 50], [70, 40], [50, 80]])
print(A)

pca = PCA(2)
pca.fit(A)
print(pca.components_)
print(pca.explained_variance_)
B = pca.transform(A)
print(B)

[[20 50]
[70 40]
[50 80]]
[[ 0.95709203 -0.28978415]
[ 0.28978415 0.95709203]]
[653.51837585 413.14829082]
[[-23.59055972 -14.10819081]
[ 27.1618831 -9.18990364]
[ -3.57132338 23.29809445]]

3. Pollution has been a concern since Industrialization due to its effects on human lives and planet.
According to WHO, air pollution is having an effect in 7 million premature deaths per annum. A
report is generated on the quality of air in 5 months. It is found that the data within the reported
dataset is correlated. So, perform a strategic method to reconstruct the dataset with 2 components.
Also visualize a graph between the two components.
(Download the dataset from lms)

importmatplotlib.pyplot as plt
importnumpy as np
import pandas as pd
fromsklearn.preprocessing import MinMaxScaler
fromsklearn.preprocessing import StandardScaler

%matplotlib inline
df = pd.read_csv("airquality.csv")
df=df.dropna()
scaler = StandardScaler()
scaler.fit(df)
scaled_data=scaler.transform(df)

fromsklearn.decomposition import PCA


pca=PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
scaled_data.shape

25
o/p: (111, 7)

scaled_data

x_pca.shape

(111, 2)

plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=df['Day'])
plt.xlabel('First principle component')
plt.ylabel('Second principle component')

26
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-

1. Pollution has been a concern since Industrialization due to its effects on human lives and
planet. According to WHO, air pollution is having an effect in 7 million premature deaths per
annum. A report is generated on the quality of air in 5 months. It is found that the data within
the reported dataset are correlated. So, perform a strategic method to reconstruct the dataset
with 2 components. Also visualize a graph between the two components?
(Download the dataset from lms)

Writing space of the Problem:(For Student’s use only)

importmatplotlib.pyplot as plt
importnumpy as np
import pandas as pd
fromsklearn.preprocessing import MinMaxScaler
fromsklearn.preprocessing import StandardScaler

%matplotlib inline
d=pd.read_csv(r"Desktop\death.csv")
d.fillna(d.mean(),inplace=True)
df=d.iloc[:,5:10]
scaler = StandardScaler()
scaler.fit(df)
scaled_data=scaler.transform(df)

fromsklearn.decomposition import PCA


pca=PCA(n_components=2)
pca.fit(scaled_data)
x_pca = pca.transform(scaled_data)
scaled_data.shape

o/p: (151, 5)

scaled_data

array([[-1.69447075e-01, 1.10178460e-01, 0.00000000e+00,


-2.16912323e-01, 2.06136314e-16],
[-1.15056656e-02, -9.20033520e-01, -9.11518258e-01,
0.00000000e+00, -9.82609964e-01],
[-2.48417780e-01, -8.37616562e-01, -7.11840006e-01,
6.80655909e-01, -8.66565536e-01],
[-2.48417780e-01, -4.25531770e-01, -2.12644373e-01,
1.57822414e+00, -7.50521107e-01],
[-2.48417780e-01, 2.77615018e-02, 2.86551259e-01,
1.57822414e+00, -5.42545379e-02],
[-2.48417780e-01, 0.00000000e+00, -7.11840006e-01,
0.00000000e+00, -8.66565536e-01],
[-2.48417780e-01, 4.81054773e-01, 0.00000000e+00,
0.00000000e+00, 2.06136314e-16],
[ 4.62318563e-01, 0.00000000e+00, -1.11119651e+00,
0.00000000e+00, 2.06136314e-16],

27
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

x_pca.shape
(151, 2)

x_pca
array([[ 2.19650844e-02, -2.72925429e-01],
[-1.44914240e+00, -4.65084297e-02],
[-9.34861034e-01, 1.46993451e-01],
[-5.88842978e-02, 5.51733526e-01],

Viva Voce:-

1. Why PCA is preferable, mention the two primary reasons?


2. Is there any loss of data if we use PCA?
3. PCA is an unsupervised technique, will you agree with it? Why?
4. What are the applications of PCA?
5. Define covariance matrix?

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

28
18CS3159 DATA WAREHOUSING AND MINING

Lab #4: Classification using Decision Trees.


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-

1. What are the attribute selection measures in modeling a decision tree and write the       
respective equations for each of them.  

Attribute Selection Measures


 Information Gain
Information Gain = entropy(parent) – [average entropy(children)]

 Gain Ratio
Gain Ratio = Gain / SplitINFO

 Gini index

Gini index =

2. What do you mean by entropy in a decision tree? How is it calculated? 

Entropy is the measures of impurity, disorder or uncertainty in a bunch


of examples.Entropy controls how a Decision Tree decides to split the data. It actually
effects how a Decision Tree draws its boundaries.

Here p(x) is a fraction of examples in a given class

3. What is Information gain and how does is matter in a Decision Tree? 

1)Decision Trees algorithm will always tries to maximize Information gain.


2)An attribute with highest Information gain will tested/split first.

4. List out the parameters involved in DecisionTreeClassifier and export_graphviz and try


to understand the role of each parameter.

Some of the parameters are


criterion : string, optional (default=”gini”)
The function to measure the quality of a split. Supported criteria are “gini” for the Gini impurity and
“entropy” for the information gain.
splitter : string, optional (default=”best”)

29
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split
and “random” to choose the best random split.
max_depth : int or None, optional (default=None)

The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all
leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)

The minimum number of samples required to split an internal node:

 If int, then consider min_samples_split as the minimum number.


 If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum
number of samples for each split.

For more parameters please refer to

https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeCl
assifier

Match the following: 


1. ID3  a. GAIN RATIO  
2. CART  b. INFORMATION GAIN 
3. C4.5  c. GINI INDEX 

ANS:- 1-b, 2-c, 3-a

30
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-

1. Implement the decision tree algorithm on the given data which has weight and smoothness as
the segregating criteria for the fruit apple and orange. Apple is represented by the number ‘1’
and orange by ‘0’. Construct a decision tree and apply the prediction measures for
the given data to obtain the types of fruits. 
Weight Smooth Fruit
180 7 ?
140 8 ?
150 5 ?

Fruit dataset
https://drive.google.com/file/d/1qoMDjozHHELVn5tFAJxp8mMw0Ggt-BVX/view?usp=sharing

Convert the trained decision tree classifier into graphviz object. Later, we use


the converted graphviz object for visualization. To visualize the decision tree, you just need to
open the .txt file and copy the contents of the file to paste in the graphviz web
portal graphyiz web portal address: http://webgraphviz.com 

Writing space of the Problem:(For Student’s use only)

# Required Python Packages


# Required Python Packages
import pandas as pd
import numpy as np
from sklearn import tree
# creating dataset for modeling Apple / Orange classification
fruit_data_set = pd.DataFrame()
fruit_data_set["fruit"] = np.array([1, 1, 1, 1, 1, # 1 for apple

31
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

0, 0, 0, 0, 0]) # 0 for orange


fruit_data_set["weight"] = np.array([170, 175, 180, 178, 182,
130, 120, 130, 138, 145])
fruit_data_set["smooth"] = np.array([9, 10, 8, 8, 7,
3, 4, 2, 5, 6])
fruit_classifier = tree.DecisionTreeClassifier()
fruit_classifier.fit(fruit_data_set[["weight", "smooth"]], fruit_data_set["fruit"])

print(">>>>> Trained fruit_classifier <<<<<")


print(fruit_classifier)
# fruit data set 1st observation

test_features_1_fruit = fruit_classifier.predict([[180,7]])
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set["fruit"][0], predicted_fruit=test_features_1_fruit))

# fruit data set 3rd observation

test_features_3_fruit = fruit_classifier.predict([[140,8]])
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set["fruit"][2], predicted_fruit=test_features_3_fruit))

# fruit data set 8th observation

test_features_8_fruit = fruit_classifier.predict([[150,5]])
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set["fruit"][7], predicted_fruit=test_features_8_fruit))

#creating graphviz object


with open("fruit_classifier.txt", "w") as f:
f = tree.export_graphviz(fruit_classifier, out_file=f)

A fruit_classifier text file will be created which has:

digraph Tree {
node [shape=box] ;
0 [label="X[1] <= 6.5\ngini = 0.5\nsamples = 10\nvalue = [5, 5]"] ;
1 [label="gini = 0.0\nsamples = 5\nvalue = [5, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="gini = 0.0\nsamples = 5\nvalue = [0, 5]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}

Now copy the above code, go to the above mentioned website and paste it. Select Generate graph.

32
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

ANOTHER PROCESS FOR THE PREVIOUS QUESTION(SOLUTION 2):


# using csv file
import pandas as pd
import numpy as np
from sklearn import tree
fruit_data_set = pd.read_csv("Fruit.csv")
fruit_classifier = tree.DecisionTreeClassifier()
fruit_classifier.fit(fruit_data_set.values[:,0:2], fruit_data_set.values[:,2])
print(">>>>> Trained fruit_classifier <<<<<")
print(fruit_classifier)

# fruit data set 1st observation


print( fruit_classifier.predict([[170,9]]))

print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(


act_fruit=fruit_data_set.values[0,2], predicted_fruit=test_features_1_fruit))

# fruit data set 3rd observation


print( fruit_classifier.predict([[180,8]]))

33
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(


act_fruit=fruit_data_set.values[2,2], predicted_fruit=test_features_3_fruit))

# fruit data set 8th observation


print(fruit_classifier.predict([[130,2]]))
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set.values[7,2], predicted_fruit=test_features_8_fruit))
with open("fruit_classifier.txt", "w") as f:
f = tree.export_graphviz(fruit_classifier, out_file=f)

SOLUTION 3:

from sklearn import tree

# Gathering training data

# features = [[155, “rough”], [180, “rough”], [135, “smooth”], [110, “smooth”]] # Input to classifier

features = [[155, 0], [180, 0], [135, 1], [110, 1]] # scikit-learn requires real-valued features

# labels = [“orange”, “orange”, “apple”, “apple”] # output values

labels = [1, 1, 0, 0]

# Training classifier

classifier = tree.DecisionTreeClassifier() # using decision tree classifier

classifier = classifier.fit(features, labels) # Find patterns in data

# Making predictions

print (classifier.predict([[120, 1]]))

# Output is 0 for apple

34
18CS3159 DATA WAREHOUSING AND MINING

2. Below given is the diabetes dataset. 

(Ref: https://drive.google.com/file/d/1PJizP39JPh_T-
5dQVcUVCfswrPSxT734/view?usp=sharing)

 Make sure to install the scikit-learn package and other required packages.  
1. Find the correlation matrix for the diabetes dataset?
1. Split the dataset into train_set and test_set for modeling and prediction. Divide the dataset
in such a way that the trained dataset constitutes 70 percent of the original dataset and the
rest of the part belongs to the test dataset. 
2. Produce a decision tree model using  
    a. Gini index metric 
    b. Entropy and Information gain metric  on the trained dataset using
the DecisionTreeClassifier function. 
3. Apply the prediction measures on the test dataset. 
4. Define a function named accuracy_score by interpreting the difference between the
predicted values and the test set values. Display the accuracy in terms of  
a. Fraction using the accuracy_score function 
b. Number of correct predictions. 
6. Print the confusion matrix of the test dataset. 
6. Calculate the following values manually after obtaining the confusion matrix  
a. Accuracy 
b. Error rate 
c. Precision 
d. Recall  (sensitivity)
e. F1 Score 
f. Specificity
6. Compare the two results(obtained from two kinds of metrics) and state which method is
more accurate for this dataset. Convert the trained decision tree classifier
into graphviz object. Later, we use the converted graphviz object for visualization.

10. Plot ROC curve and calculate AUC


11. Plot recall vs precision curve

Writing space of the Problem:(For Student’s use only)


import os
import numpy as np
import pandas as pd
import seaborn as sb
import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn import tree, metrics
from sklearn.tree import DecisionTreeClassifier # Import Decision Tree Classifier
from sklearn.model_selection import train_test_split # Import train_test_split function
from sklearn import metrics #Import scikit-learn metrics module for accuracy calculation
from sklearn import tree
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import precision_recall_curve

35
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

from sklearn.metrics import f1_score


from sklearn.metrics import auc
from sklearn.metrics import average_precision_score
from sklearn.metrics import roc_curve, roc_auc_score

pima = pd.read_csv("diabetes.csv")
type(pima)
X=pima.iloc[:,0:7]
y=pima.iloc[:,8]
#correlation matrix
print(pima.corr(method='pearson'))
#splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30%
test
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion='entropy') #try the criterion 'gini' too

# Train Decision Tree Classifer


clf.fit(X_train,y_train)

#Predict the response for test dataset


y_pred = clf.predict(X_test)

#accuracy in fraction
print(accuracy_score(y_test,y_pred,normalize=True))
#accuracy in number of predictions
print(accuracy_score(y_test,y_pred,normalize=False))

#confusion matrix
x=confusion_matrix(y_test,y_pred)
print(x)
tp=x[0][0]
fp=x[0][1]
fn=x[1][0]
tn=x[1][1]
fpr1=fp/(fp+tn)
tpr1=tp/(tp+fp)
print(fpr1)
print(tpr1)

#method 1
y_pred_proba = clf.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

'''#method 2
import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = clf.predict_proba(X_test)
preds = probs[:,1]
print(preds)
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)

# method I: plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)

36
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

plt.legend(loc = 'lower right')


plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

#method 3
predictions = tree.predict_proba(X_test)
x = roc_auc_score(y_test, predict_proba[:,1])

fpr, tpr, _ = roc_curve(y_test, predictions[:,1])

plt.clf()
plot.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()'''

#https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
#Accuracy = TP+TN/TOTAL where total=TP+TN+FP+FN
#Error rate = FP+FN/TOTAL
#Recall = TP/(TP+FN)
#Precision = TP/(TP+FP)
#F1 score = 2*Recall*Precision/(Recall+Precision)
#Specificity = TN / TN + FP
#fpr=fp/(fp+tn)
#tpr=tp/(tp+fp)

# entropy is more accurate


# creating graphviz object
with open("clf.txt","w") as f:
f=tree.export_graphviz(clf, out_file = f)

37
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. What is the C4.5 algorithm and how does it work? State the differences between ID3
and C4.5. 

A. CART builds a decision tree for the given data in a top-down fashion, starting from a set of
objects and a specification of properties Resources and Information.each node of the tree, one
property is tested based on maximizing information gain and minimizing entropy, and the results are
used to split the object set. This process is recursively done until the set in a given sub-tree is
homogeneous (i.e. it contains objects belonging to the same category). The ID3 algorithm uses a
greedy search. It selects a test using the information gain criterion, and then never explores the
possibility of alternate choices.
Disadvantages
 Data may be over-fitted or over-classified, if a small sample is tested.
 Only one attribute at a time is tested for making a decision.
 Does not handle numeric attributes and missing values.
The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles
incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually
known as "pruning"; and (iv) different weights can be applied the features that comprise the training
data.
Disadvantages
 C4.5 constructs empty branches with zero values
 Over fitting happens when algorithm model picks up data with uncommon characteristics
, especially when data is noisy.

2.Differentiate between over-fitting, over-fitting and over-fitting loss? Why does it occur
during classification? 

A.  Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model. The problem is
that these concepts do not apply to new data and negatively impact the models ability to generalize.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and
is subject to overfitting training data. This problem can be addressed by pruning a tree after it has
learned in order to remove some of the detail it has picked up.
Underfitting refers to a model that can neither model the training data nor generalize to new data.
An underfit machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The
remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a
good contrast to the problem of overfitting.

38
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

3. Explain the concept of pruning and why it is important. Differentiate between pre-
pruning and post- pruning. 
A. Pruning is a technique in machine learning and search algorithms that reduces the size
of decision trees by removing sections of the tree that provide little power to classify
instances. Pruning reduces the complexity of the final classifier, and hence improves predictive
accuracy by the reduction of overfitting.
(OR)
In machine learning and data mining, pruning is a technique associated with decision trees. Pruning
reduces the size of decision trees by removing parts of the tree that do not provide power to classify
instances. Decision trees are the most susceptible out of all the machine learning algorithms to
overfitting and effective pruning can reduce this likelihood.

Pruning means reducing size of the tree that are too larger and deeper. ... First isPost pruning, in
which the tree is build first and then reduction of branches & levels of the decision tree is done.
Second is Pre pruning, in which while building the decision tree keep on checking whether tree is
overfitting.

Viva Voce:-
1. What is the difference between supervised and unsupervised machine learning? 
2. What is a confusion matrix? 
3. Which of the following is true about training and testing error in such case? 
a. The difference between training error and test error increases as number of
observations increase.
b. The difference between training error and test error decreases as number of
observations increase. 
c. The difference between training error and test error will not change 
4. What is the difference between classification and clustering? 
5. What are Recommender Systems? 

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

39
18CS3159 DATA WAREHOUSING AND MINING

Lab #5: Classification using K Nearest Neighbour.


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-requisite:
In LMS: Find the file named “Concept of k-Nearest-Neighbor.doc”. Read the specified
document and answer the below questions.

Pre-lab:-

1. State whether the given statement is true or false with supported reasoning.

a. k-Nearest-Neighbor is a simple algorithm that stores all available cases and classifies
the new case based on dissimilarity measure.
b. The value of ‘k’ in k-nearest-neighbor algorithm helps to check the no. of training sets
labels to assign the most common label for the testing set.

ANS:-
a. False

KNN is a simple algorithm that stores all available cases and classifies the new case based
on similarity measure.

b. True

K denotes the number of nearest neighbours which are voting the class of new data or testing
data.

2. List the industrial uses of k-nearest-neighbor algorithm in the real world.

ANS:- Industrial uses of k-nearest neighbour:

k – nearest neighbour as a datamining technique has a wide variety of applications in


classification. Some of the applications are mentioned below:
i. Text Mining
The k-nearest neighbor algorithm is one of the most popular algorithms for text
categorization or text mining.
ii. Finance
 Forecasting stock market – Predict the price of a stock, based on company
performance measures and economic data.
 Currency exchange rate
 Credit rating
iii. Medicine
 Predict whether a patient, hospitalized due to a heart attack. The prediction is
to be based on demographic, diet and clinical measurements for that patient.
 Estimate the amount of glucose in the blood of a diabetic person, from the
infrared absorption spectrum of that person’s blood.

40
3. Write an algorithm for k-nearest-neighbor classification given k, the nearest number
of neighbors, and n, the number of attributes describing each tuple.

ANS:- Algorithm for k-nearest-neighbor classification:


1. Load the data
2. Initialise the value of K.
3. For getting the predicted class, iterate from 1 to n of training data points.
i. Calculate the distance between the test data and each row of training data.
Here we will use Euclidian distance as our distance metric since it’s the most popular
method. The other metrics that can be used are Chebyshev, cosine, etc.
ii. Sort the calculated distances in ascending order based on distance values.
iii. Get top K rows from the sorted array.
iv. Get the most frequent class of these rows.
v. Return the predicted class.

4. Compare the advantages and disadvantages of eager


classification(e.g.,decision tree,Bayesian, neural network) versus lazy classification(e.g.,k-
nearest-neighbor,case-based reasoning).

ANS:- Eager classification versus Lazy classification:


i. Eager classification:
Eager classification is seen to be much faster than the lazy classification. This is because
it constructs a generalization model before receiving any new tuples to classify. Accuracy
of classification is generally seen because weights are assigned to attributes. Since this
kind of classification involves a single hypothesis for the entire classification, this leads to
decrease in classification levels and takes a long time. People working on such
classifications need to be trained.
ii. Lazy Classification:
Lazy classification on the other hand involves a better hypothesis level. This improves
accuracy of classification. In this classification, trading tuples must be stored in increasing
costs and involved high end indexing structures. Classifiers are not built into tuples which
need to be classified. There could be irrelevant attributes in the data leading to
inefficient.

5. Give the distance methods that are most commonly used in k-nearest-neighbor algorithm.
a. Euclidian Distance:
It is the square root of sum of difference between the new point ‘x’ and existing point ‘y’. It
is the direct or least possible distance between points A, B. 
b. Manhattan Distance:
It is used to calculate the distance between the real vector using the sum of their absolute
difference. It is the distance between A, B measured along the axis at right angles.

41
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
Perform the following Analysis:
Step-by-step process to compute k-nearest-neighbor algorithm is:
1. Determine parameter k=no. of nearest neighbors
2. Calculate the distance between the test sample and the training samples.
3. Sort the distance and determine nearest neighbors based on the kth minimum distance.
4. Gather the category of nearest neighbors.
5. Use simple majority of the category of nearest neighbors as the prediction value of
testing sample.

Dataset:
Suppose we have the following “StudentDataSet” dataset which consists of 1st year
CGPA, 2nd year CGPA, Category (C: CRT, NC: Non-CRT) as parameters.

Std.No 1st year 2nd year CGPA Category


CGPA
1 8.5 8.5 C
2 8.2 9 C
3 7.5 7.6 C
4 5.5 4.5 NC
5 9.2 9 C
6 7.8 7.3 C
7 7.3 7.4 NC
8 7.9 7 NC
9 10 6 C
10 6.8 7.1 NC
11 6.5 7.1 NC
12 7.2 7.3 NC

When a new student comes only with 1st year CGPA and 2nd year CGPA as information,
predict the category of that new student (whether he belongs to CRT or Non-CRT)
by Euclidean distance measure, where Euclidean distance between 2 points or tuples, say
X1 =(x11,x12,............,x1n) and X2 =(x21,x22,............,x2n), is

Test sample:
1st year CGPA and 2nd year CGPA of the new student are 8.4 and 7.1 respectively.
(Consider k=3)

42
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)


import csv
import random
import math
import operator
def loadDataset(filename, split, trainingSet=[] , testSet=[]):
print(type(filename))
with open(filename, 'r') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(2):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])
#This operation returns the Euclidean distance.
def euclideanDistance(instance1, instance2, length):
distance = 0
for x in range(length):
distance += pow((instance1[x] - instance2[x]), 2)
return math.sqrt(distance)
def getNeighbors(trainingSet, testInstance, k):
distances = []
trainInstance=trainingSet[0]
TrainLength=len(trainInstance)
TestLength = len(testInstance)
if (TrainLength==TestLength):
TestLength=TestLength-1
for x in range(len(trainingSet)):
dist = euclideanDistance(testInstance, trainingSet[x], TestLength)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
def getAccuracy(testSet, predictions):
correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0

43
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.70
loadDataset('StudentDataSet.csv', split, trainingSet, testSet)
print ('Train set: ' + repr(len(trainingSet)))
print(trainingSet)
print ('Test set: ' + repr(len(testSet)))
print(testSet)
# generate predictions
predictions=[]
k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
#To Get Accuracy
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
#Predicting the Result for new Entry
neighbors1 = getNeighbors(trainingSet,[8.4,7.1], k)
result1 = getResponse(neighbors1)
print('> predicted class for new entry [8.4,7.1] =' + repr(result1))
main()

Output:
<class 'str'>
Train set: 8
[[8.2, 9.0, 'C'], [5.5, 4.5, 'NC'], [9.2, 9.0, 'C'], [7.8, 7.3, 'C'], [7.3, 7.4, 'NC'], [7.9, 7.0, 'NC'], [10.0,
6.0, 'C'], [6.8, 7.1, 'NC']]
Test set: 3
[[8.5, 8.5, 'C'], [7.5, 7.6, 'C'], [6.5, 7.1, 'NC']]
> predicted='C', actual='C'
> predicted='NC', actual='C'
> predicted='NC', actual='NC'
Accuracy: 66.66666666666666%
> predicted class for new entry [8.4,7.1] ='NC'

44
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. Predict the Category of student with 1st year CGPA and 2nd year CGPA as 7.3 and 7.1
respectively using the Manhattan measuring technique formula with k=3(Manually).
Note: The Manhattan distance between two tuples (or points) a and b is defined
as ∑i|ai−bi|
2. By considering the above StudentDataSet ,, predict the Category of the new student
having 1st year CGPA and 2nd year CGPA as 8.4 and 7.1 respectively, by
implementing the python code using Manhattan distance measure in order to find
nearest neighbors for k=3 and check whether the output is same for both the measuring
techniques or not.

import csv
import random
import math
import operator

def loadDataset(filename, split, trainingSet=[] , testSet=[]):


print(type(filename))
with open(filename, 'r') as csvfile:
lines = csv.reader(csvfile)
dataset = list(lines)
for x in range(len(dataset)-1):
for y in range(2):
dataset[x][y] = float(dataset[x][y])
if random.random() < split:
trainingSet.append(dataset[x])
else:
testSet.append(dataset[x])

#This operation returns the Manhattan distance.


def getManhattan(xList,yList,length):
return float(xList[0]-xList[1]) + abs(yList[0]-yList[1])

def getNeighbors(trainingSet, testInstance, k):


distances = []
trainInstance=trainingSet[0]
TrainLength=len(trainInstance)
TestLength = len(testInstance)
if (TrainLength==TestLength):
TestLength=TestLength-1
for x in range(len(trainingSet)):
dist = getManhattan(testInstance, trainingSet[x], TestLength)
distances.append((trainingSet[x], dist))
distances.sort(key=operator.itemgetter(1))
neighbors = []
for x in range(k):
neighbors.append(distances[x][0])
return neighbors

def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):

45
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]

def getAccuracy(testSet, predictions):


correct = 0
for x in range(len(testSet)):
if testSet[x][-1] == predictions[x]:
correct += 1
return (correct/float(len(testSet))) * 100.0

def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.70
loadDataset('StudentDataSet.csv', split, trainingSet, testSet)
print ('Train set: ' + repr(len(trainingSet)))
print(trainingSet)
print ('Test set: ' + repr(len(testSet)))
print(testSet)
# generate predictions
predictions=[]
k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
#To Get Accuracy
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
#Predicting the Result for new Entry
neighbors1 = getNeighbors(trainingSet,[8.4,7.1], k)
result1 = getResponse(neighbors1)
print('> predicted class for new entry [8.4,7.1] =' + repr(result1))
main()

46
Output:
<class 'str'>
Train set: 5
[[8.2, 9.0, 'C'], [7.5, 7.6, 'C'], [7.9, 7.0, 'NC'], [10.0, 6.0, 'C'], [6.8, 7.1, 'NC']]
Test set: 6
[[8.5, 8.5, 'C'], [5.5, 4.5, 'NC'], [9.2, 9.0, 'C'], [7.8, 7.3, 'C'], [7.3, 7.4, 'NC'], [6.5, 7.1, 'NC']]
> predicted='C', actual='C'
> predicted='C', actual='NC'
> predicted='C', actual='C'
> predicted='C', actual='C'
> predicted='C', actual='NC'
> predicted='C', actual='NC'
Accuracy: 50.0%
> predicted class for new entry [8.4,7.1] ='C'

Viva Voce:-
Refer Page no: 423,424,425 in Han J & Kamber M, “Data Mining: Concepts and Techniques”,
Third Edition, Elsevier, 2011

1. k-nearest-neighbor is a ____________ lazy learning algorithm.


2. How can the distance be computed for attributes that are not numeric, but nominal (or
categorical) such as color?
3. List some techniques used to speed up the classification time.
4. If the value of a given attribute A is missing in tuple X1 and/or in tuple X2, the difference
is always _________

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

47
18CS3159 DATA WAREHOUSING AND MINING

Lab #6: Classification using Bayesian Classifiers


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-
1. Match the following   
                 Column A                    Column B 

a. Naive Bayesian a. Values are continuous 
Classification
b. Bayesian belief network b. Attributes conditionally dependent 
c. Gaussian distribution c. To avoid zero probability 
d. Laplace estimator d. Attributes conditionally independent 
ANS:- d b a c

2. Explain Baye’s theorem and write its derived formulae. 

ANS:- Bayes’s theorem describes the probability of an event, based on conditions that might be related to
that event. P(A|B) = P(A) P(B|A)P(B)
Which tells us:   how often A happens given that B happens, written P(A|B),
When we know:   how often B happens given that A happens, written P(B|A)
    and how likely A is on its own, written P(A)
and how likely B is on its own, written P(B)
   

3. Suppose we have continuous values for an attribute in a dataset then how to calculate
probability. 

ANS:- Whenever the given attribute has continuous values then we should use gaussian distribution.

4. Let us assume  
   p(age=youth/buys_car =yes) = 0.222 , 
   p(income=medium/buys_car)=0.444 and 
   p(buys_car=yes)=0.643 then 
   Find the probability of p(x/buys_car=yes), where x=(income=medium, age=youth). 

ANS:- 0.222*0.444*0.643

48
18CS3159 DATA WAREHOUSING AND MINING

5. While implementing Naïve Bayesian classifier, suppose we have encountered a zero


probability then we should add one count to each of the probability to avoid zero
probability. What is this estimation is called?  
ANS:- Laplace estimator is way of dealing with zero probability values. Whenever we encounter this problem
we assume that our dataset is so large that adding one column to each attribute would not effect our
model.

49
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. Consider the given table named “Weather_cond.csv” consisting of attributes Temperature
Humidity, Windy and a class label named “Outcome”. Depending on the weather
conditions you have to choose whether to play cricket or not. 
a. Unlike conventional function, write a python function to split the dataset into training
set and test set. Assume test size length as 0.33. 
b. Write a python function to calculate mean and standard deviation for each numerical
attribute in the data set. 
c. Calculate the number of priors for the given dataset after splitting into training and
test sets using python. 

Writing space of the Problem:(For Student’s use only)


a. def splitDataset(dataset, splitRatio):
trainSize = int(len(dataset) * splitRatio)
trainSet = []
copy = list(dataset)
while len(trainSet) < trainSize:
index = random.randrange(len(copy))
trainSet.append(copy.pop(index))
return [trainSet, copy]

b. def loadCsv(filename):
lines = csv.reader(open(filename, "rb"))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset

def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
del summaries[-1]
return summaries

def mean(numbers):
return sum(numbers)/float(len(numbers))

def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)

def calculateProbability(x, mean, stdev):


exponent = math.exp(-(math.pow(x-mean,2)/(2*math.pow(stdev,2))))
return (1 / (math.sqrt(2*math.pi) * stdev)) * exponent

c. import pandas as pd
import numpy as np
d=pd.read_csv(r’Enter the path of dataset’)
p=d.rename_axis('data')
q=np.array(d['Outcome'])
x = p.values

50
18CS3159 DATA WAREHOUSING AND MINING

y=q
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=1)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()

2. The problem is comprised of 100 observations of medical details for


Pima Indian’s patients. The records describe instantaneous measurements taken from the patient
such as their age, the number of times pregnant and blood workup. All patients are women aged
21 or older. All attributes are numeric, and their units vary from attribute to attribute. Each
record has a class value that indicates whether the patient suffered an onset of diabetes within 5
years of when the measurements were taken (1) or not (0). This is a standard dataset that
has been studied a lot in machine learning literature. A good prediction accuracy is 70%-76%.

Implement a python code to find the accuracy for given dataset named “Diabetes.csv”
based on train set and test set. Take test size length as 0.4. 

Writing space of the Problem:(For Student’s use only)


import pandas as pd
import numpy as np
d=pd.read_csv(r’Enter the path of dataset’)
p=d.rename_axis('data')
q=np.array(d['Outcome'])
x = p.values
y=q
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=1)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
  
y_pred = gnb.predict(X_test)
  
from sklearn import metrics
print("Gaussian Naive Bayes model accuracy(in %):", metrics.accuracy_score(y_test, y_pred)*100)

51
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. Consider the given table that specifies loan classification problem.  

Tid  Home Owner  Marital status  Annual Income  Defaulted


Borrower 
1  Yes  Single  125K  No 
2  No  Married  100K  No 
3  No  Single  70K  No 
4  Yes  Married  120K  No 
5  No  Divorced  95K  Yes 
6  No  Married  60K  No 
7  Yes  Divorced  220K  No 
8  No  Single  85K  Yes 
9  No  Married  75K  No 
10  No  Single  90K  Yes 

a. Compute the class conditional probability for each categorical attribute. 
b. Predict the class label value for test record X= (Home Owner =No, Marital Status =
Married, Income=$120K) 
  
Writing space of the Problem:(For Student’s use only)

52
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. Explain the difference between a Validation Set and a Test Set?
2. What are the three types of Naïve Bayes classifier?
3. How many terms are required for building a Bayes model?
4. What is training test and testing set?
5. What are the advantages of Naive Bayes?

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

53
18CS3159 DATA WAREHOUSING AND MINING

Lab #7: Classification using Backpropagation


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-
In LMS: Find the file named “Han J & Kamber M, Data Mining Concepts and
Techniques.doc”.  
Read the specified document from Pg. No: 398 – 404 and answer the below questions. 
1. State whether the given statement is True/False. 
a. Backpropagation is neural network learning algorithm. 
ANS:- TRUE

b. Backpropagation learns by iteratively processing a data set of training tuples,


comparing the network’s prediction for each tuple with the actual known target
value. 
ANS:- TRUE

2. What is the objective of Backpropagation? 


ANS:- The objective of backpropagation algorithm is to develop learning algorithm for multilayer feedback neural
network, so that network can be trained to capture the mapping implicitly.

3. Explain about Multilayer Feed-Forward Neural Network with diagram.  


ANS:- A multilayer feed-forward neural network consists of an input layer, one or more hidden layers, and an
output layer.

4. How does Backpropagation work? 


ANS:- Backpropagation learns by iteratively processing a dataset of training tuples, comparing the network’s
prediction for each tuple with the actual known target value. For each training tuple, the weights are
modified to minimize the mean-squared error between the network’s prediction and the actual target
value. These modifications are made in the backward direction through each hidden layer down to the first
hidden layer (hence the name backpropagation).

54
18CS3159 DATA WAREHOUSING AND MINING

5. Consider the following table. 



Input  Desired Output  Model Absolute Error  Square
Output  Error 
0  0       
1  2       
2  4       

Predict the Model Output by considering the initial value of weight as 3. Find the Absolute Error
and Square Error. Use the Backpropagation algorithm to update the weight and try to minimize
the square error as much as possible. 
Hint: 
i. Model Output = W * I(x) (W=weight, I=Input, x=index that iterates from
0 to length(Input) ) 
ii. Absolute Error = mod(Model Output-Desired Output) 
iii. Square Error = (Absolute Error)^2
Writing space of the Problem:(For Student’s use only)

ANS:-
Input Desired Model Absolute Square
Output Output Error Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4

Increasing Weight:

Model Absolute Square


Output Error Error
(W=4)
0 0 0
4 2 4
8 4 16

Since the square error gets increasing parallelly by increasing weight, we should decrease
the weight.

Decreasing Weight:

Model Absolute Square


Output Error Error
(W=2)
0 0 0
2 0 0
4 0 0

55
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
Analysis: 
The following steps will provide the foundation that you need to implement the Backpropagation
algorithm and apply it to your own predictive modelling problems: 
1. Initialize Network. 
2. Forward Propagate. 
i. Neuron Activation. 
ii. Neuron Transfer. 
iii. Forward Propagation. 
3. Back Propagate Error. 
i. Transfer Derivative 
ii. Error Backpropagation 
4. Train Network. 
i. Update Weights. 
ii. Train Network. 
5. Test Network.

Dataset: 
Suppose we have the following “Results Dataset” which consist of GPA’s of some
students that they had scored in two internal tests. And, it also consists of another attribute
named ‘Qualified’, which holds a character(Q/NQ), representing the student qualification
for final examination.  


S. No Test – 1 Test – 2 Qualified
1 8.5 8.5 Q
2 8.2 9.0 Q
3 3.5 5.0 NQ
4 5.5 4.5 NQ
5 9.2 9.0 Q
6 7.8 7.3 Q
7 8.0 3.1 NQ
8 10 7.0 Q
9 4.5 6.0 NQ
10 6.8 7.1 Q
11 5.1 4.1 NQ
12 4.2 5.3 NQ

Problem:    Train a network on above “Results Dataset” by applying Backpropagation


algorithm.  
a. Initializing a network with all weights and biases.   (Consider weights in range -0.5 to
+0.5, biases = 1, Learning Rate = {0.5, 0.7, 1})
b. Training the network according to the Dataset. (Consider both Activating Functions –
Sigmoid Function and Tanh Function)  
c. Backpropagating the errors. 

56
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)


import csv
from math import exp
from random import random
#Covert_dataset_values_into_float
def convert_into_float(row):
return [float(row[0]),float(row[1]),float(row[2])]
#Covert_dataset_values_into_int
def convert_into_int(row):
return [int(row[0]),int(row[1]),int(row[2])]
# Initialize a network
def initialize_network(n_inputs, n_hidden, n_outputs):
network = list()
hidden_layer = [{'weights':[random() for i in range(n_inputs + 1)]} for i in range(n_hidden)]
network.append(hidden_layer)
output_layer = [{'weights':[random() for i in range(n_hidden + 1)]} for i in range(n_outputs)]
network.append(output_layer)
return network
# Calculate neuron activation for an input
def activate(weights, inputs):
activation = weights[-1]
for i in range(len(weights)-1):
activation += weights[i] * inputs[i]
return activation
# Transfer neuron activation
def transfer(activation):
return 1.0 / (1.0 + exp(-activation))
# Forward propagate input to a network output
def forward_propagate(network, row):
inputs = row
for layer in network:
new_inputs = []
for neuron in layer:
activation = activate(neuron['weights'], inputs)
neuron['output'] = transfer(activation)
new_inputs.append(neuron['output'])
inputs = new_inputs
return inputs
# Calculate the derivative of an neuron output
def transfer_derivative(output):
return output * (1.0 - output)
# Backpropagate error and store in neurons
def backward_propagate_error(network, expected):
for i in reversed(range(len(network))):
layer = network[i]
errors = list()
if i != len(network)-1:
for j in range(len(layer)):
error = 0.0
for neuron in network[i + 1]:
error += (neuron['weights'][j] * neuron['delta'])
errors.append(error)
else:
for j in range(len(layer)):

57
neuron = layer[j]
errors.append(expected[j] - neuron['output'])
for j in range(len(layer)):
neuron = layer[j]
neuron['delta'] = errors[j] * transfer_derivative(neuron['output'])
# Update network weights with error
def update_weights(network, row, l_rate):
for i in range(len(network)):
inputs = row[:-1]
if i != 0:
inputs = [neuron['output'] for neuron in network[i - 1]]
for neuron in network[i]:
for j in range(len(inputs)):
neuron['weights'][j] += l_rate * neuron['delta'] * inputs[j]
neuron['weights'][-1] += l_rate * neuron['delta']
# Train a network for a fixed number of epochs
def train_network(network, train, l_rate, n_epoch, n_outputs):
for epoch in range(n_epoch):
sum_error = 0
for row in train:
outputs = forward_propagate(network, row)
expected = [0 for i in range(n_outputs)]
expected[row[-1]] = 1
sum_error += sum([(expected[i]-outputs[i])**2 for i in range(len(expected))])
backward_propagate_error(network, expected)
update_weights(network, row, l_rate)
print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))
# Test training backprop algorithmn
with open("G:\College\BackPropagation\Results Dataset.csv") as f:
reader = csv.reader(f)
next(reader) # skip header
data = []
for row in reader:
data.append(row)
data1=[]
dataset=[]
for row in data:
data1.append(convert_into_float(row))
for row in data1:
dataset.append(convert_into_int(row))
#Splitting dataset into train and test dataset
testdata=[]
traindata=[]
c=0
for row in dataset:
c=c+1
if(c<=8):
traindata.append(row)
else:
testdata.append(row)
n_inputs = len(traindata[0]) - 1
n_outputs = len(set([row[-1] for row in traindata]))
network = initialize_network(n_inputs, 2, n_outputs)
train_network(network, traindata,0.5,500, n_outputs)
for layer in network:
print(layer)

58
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Use the network which is trained on the above “Results Dataset” and test whether it is trained
with 100% accuracy or not. And, predict the result (qualified for final examination or not) of a
new entry which contains 5.9 and 5.9 GPA’s of test -1 and test -2 respectively.

Writing space of the Problem:(For Student’s use only)

59
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. What are the general tasks that are performed with backpropagation algorithm? 
2. What kind of real-world problems can neural networks solve? 
3. What is a gradient descent? 
4. Why is zero initialization not a recommended weight initialization technique? 
5. How are artificial neural networks different from normal networks? 

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

60
18CS3159 DATA WAREHOUSING AND MINING

Lab #8: Association Rule Mining - Apriori


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-
1. Define what is Apriori algorithm.
Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It
proceeds by identifying the frequent individual items in the database and extending them to larger and larger
item sets as long as those item sets appear sufficiently often in the database.

If an item set is frequent, then all of its subsets must also be frequent, Or If an item set is infrequent, then all of its
supersets must be infrequent.
2. What is association minning? 
Association Mining searches for frequent items in the data-set. In frequent mining usually the interesting
associations and correlations between item sets in transactional and relational databases are found. In short,
Frequent Mining shows which items appear together in a transaction or relation.

3. What is the need of association minning? 


Frequent mining is generation of association rules from a Transactional Dataset. If there are 2 items X and Y
purchased frequently then its good to put them together in stores or provide some discount offer on one item on
purchase of other item. This can really increase the sales. For example it is likely to find that if a customer
buys Milk and bread he/she also buys Butter.
So the association rule is [‘milk]^[‘bread’]=>[‘butter’]. So seller can suggest the customer to buy butter if he/she
buys Milk and Bread.

4. What is minimum support and minimum confidence? 


Minimum support threshold is applied to find all frequent item sets in a database.It is one of the method
of interestingness.It tells about usefulness and certainity of rules.5% support means total 5% of transactions in
the database follow the rules.
Minimum confidence :
Minimum confidence constraint is applied to this frequent item sets in order to form rules.

5. Consider the market basket transactions given in the following table. Let
min-sup=40% and min_conf=40% 
Transaction ID Items Bought
T1 A,B,C
T2 A,B,C,D,E
T3 A,C,D
T4 A,C,D,E
T5 A,B,C,D

a. Find all the frequent item sets using apriori algorithm. 


b. Obtain significant decision rules. 

61
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. For the following given transaction dataset, perform following operations : 
a. Generate rules using Apriori algorithm by using below dataset. 
vegetables green whole wheat flour cottage
shrimp  almonds  avocado  mix  grapes    yams  cheese 
burgers  meatballs  eggs                
chutney                      
turkey  avocado                   
mineral energy whole
water  milk  bar  wheat rice  green tea  eggs       
low fat
yogurt                      
whole
wheat french fries 
pasta                   
light
soup  cream  shallot                
frozen green
vegetables  spaghetti  tea                
french fries 
                    
eggs  pet food                   
cookies                      
mineral cooking
turkey  burgers  water  eggs  oil          
champagne 
spaghetti  cookies                
mineral
water  salmon  eggs                
mineral
water                      
low fat
shrimp  chocolate  chicken  honey  oil  cooking oil  yogurt    
turkey  eggs                   
tomatoes  mineral
turkey  fresh tuna  spaghetti  water  black tea  salmon  eggs 
french fries 
meatballs  milk  honey  protein bar          
shampoo 
red wine  shrimp  pasta  pepper  eggs  chocolate    
sparkling
rice  water                   
mineral body
spaghetti  water  ham  spray  pancakes  green tea       
grated white toothpaste 
burgers  cheese  eggs  pasta  avocado  honey  wine 

62
eggs                      
parmesan
cheese  spaghetti  soup  avocado  milk  fresh bread       
ground mineral frozen
beef  spaghetti  water  milk  eggs  black tea  salmon  smoothie 
sparkling
water                      
mineral french fries
water  eggs  chicken  chocolate            
frozen
vegetables  mineral
spaghetti  yams  water             

Writing space of the Problem:(For Student’s use only)


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Data Preprocessing
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
transactions = []
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

# Training Apriori on the dataset


from apyori import apriori
rules = apriori(transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2)

# Visualising the results


results = list(rules)

63
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. Same as In-lab question generate rules on below dataset. 
semi-
finished ready
citrus fruit  bread  margarine  soups                      
tropical
fruit  yogurt  coffee                         
whole milk                               
cream meat
pip fruit  yogurt  cheese   spreads                      
long life
other whole condensed bakery
vegetables  milk  milk  product                      
abrasive
whole milk  butter  yogurt  rice  cleaner                   
rolls/buns                               
liquor
other UHT- bottled (appetizer
vegetables  milk  rolls/buns  beer  )                   
pot plants                               
whole milk  cereals                            
other
tropical vegetabl white bottled chocolate 
fruit  es  bread  water                   
bottl
ed
tropical whole yogurt  water dishes 
citrus fruit  fruit  milk  butter  curd  flour         
beef                               
rolls/bun
frankfurter  s  soda                         
tropical
chicken  fruit                            
fruit/vegeta newspape
butter  sugar  ble juice  rs                      
fruit/vegetab
le juice                               
packaged
fruit/vegetab
les                               
chocolate                               
specialty
bar                               
other
vegetables                               
butter milk  pastry                            

64
whole milk                               
tropical cream processed detergent  newspape
fruit  cheese   cheese  rs                
  
bathroo
root sweet salty m
tropical vegetabl other frozen rolls/buns spread snac waffle cand cleaner 
fruit  es  vegetables  dessert    flour  s  k  s  y 
bottled canned
water  beer                            
yogurt                               
rolls/bun chocolate 
sausage  s  soda                      
other
vegetables                               
shoppi
brown fruit/vegeta canned newspape ng
bread  soda  ble juice  beer  rs  bags                

Writing space of the Problem:(For Student’s use only)


import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Data Preprocessing
dataset = pd.read_csv('grow.csv', header = None)
transactions = []
for i in range(0, 30):
transactions.append([str(dataset.values[i,j]) for j in range(0, 12)])

# Training Apriori on the dataset


from apyori import apriori
rules = apriori(transactions, min_support = 0.14, min_confidence = 0.4, min_lift = 2.85, min_length = 2)
rules
# Visualising the results
results = list(rules)

65
Viva Voce:-
1. Who proposed Apriori algorithm in which year?  
2. What is frequent item set? 
3. Why do we convert dataset into list? 
4. What is the formula for support, confidence and lift? 
5. How they get the name as Apriori? 

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

66
18CS3159 DATA WAREHOUSING AND MINING

Lab #9: Implementation of K-Means Clustering


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-Requisites:
Data pre-processing  
Basics of plotting techniques  
Various clustering techniques 

Pre-lab:-
1. Match the following.

                    Parameters                          Application 
1. pch  a. To set orientation of axis labels 
2. col  b. No. of plots per row and column 
3. mfrow  c. To set plot color 
4. lwd  d. Plotting symbol 
5. las  e. To set line width 

ANS:- d c b e a

2. List out various parameters and attributes in K Means clustering. 


ANS:- Parameters: lwd, las, col, xlab, ylab, mfrow, bg, plot, lty, pch
Attributes: cluster, centers, totss, withinss, tot.withinss, betweenss, size

3. Into how many types does clustering divided into and name them.

ANS:- Clustering is classified into six categories namely:


Partitioning method, hierarchical method, density-based method, grid-based method, model-based method,
constraint-based method.

4. List out various applications of clustering.


ANS:- Clustering algorithms are used for robotic situational awareness to track objects and detect outliers in
sensor data.
To find weather regimes or preferred sea level pressure atmospheric patterns.
Clustering can be used to divide a digital image into distinct regions for border detection or object recognition.
In study of social networks, clustering may be used to recognize communities within large group of people.

5. Describe Euclidean distance and Manhattan distance in brief with its derived formula.  


ANS:- The Euclidean distance or Euclidean metric is the ordinary straight-line distance between two points in
Euclidean space. Within the distance, Euclidean space becomes a metric space.
Euclidean distance:
The distance between two points measured along axes at right angles.
Manhattan distance:
Manhattan distance of a point and a line is the smallest distance between any point on a line.

67
18CS3159 DATA WAREHOUSING AND MINING

6. List out basic steps involved in K Means clustering. 


ANS:- Steps involved in K Means clustering:
a. Initialize centroids by randomly choosing data points.
b. All the data points that are closest to a centroid will create a cluster.
c. Move the centroids based on new clusters formed.
d. Repeat the process until all the centroids got stabilized.

In-lab:-
1. The given dataset comprises of 150 data entries of different countries around the
world.It is a report on world happiness, a landmark survey of the state of global
happiness that ranks 156 countries by how happy their citizens perceive themselves to
be, with a focus on the technologies, social norms, conflicts and government policies
that have driven those changes. The records contains various attributes of each country
that includes positive_effect, negative_effect, corruption, freedom, health life
expectancy etc. The data frame includes categorical variables, numerical values and
their values vary from country to country.
Implement a python code using scikit-learn to display a K-means clustering plot for
given data frame named “world_happiness_report.csv”.
Writing space of the Problem:(For Student’s use only)

ANS:- %matplotlib inline


from copy import deepcopy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
data=pd.read_csv(r’file path’)
a=d[‘variable_name1’].values
b=d[‘variable_name2’].values
x=np.array(list(zip(a,b)))
plt.scatter(a,b,c=’black’,s=7)
k=3
x1=np.random.randint(0,np.max(x),size=k)
x2=np.random.randint(0,np.max(x),size=k)
p=np.array(list(zip(x1,x2)),dtype=np.float32)
print(p)
plt.scatter(a,b,c=#050505,s=7)
plt.scatter(x1,x2,marker=’*’,s=200,c=’g’)
kmeans=KMeans(n_clusters=3)
kmeans=kmeans.fit(x)
labels=kmeans.predict(x)
centroids=kmeans.cluster_centers_
print(labels)
print(centroids)

68
18CS3159 DATA WAREHOUSING AND MINING

2. The given dataset named “Student_performance” consists of 150 data entries


of students in an institution that displays the performance of a student. It consists of
various attributes such as gender,
ethnicity, test_preparation, math_score, reading_score etc.Perform the K means
clustering for the given dataset taking an appropriate number of centers based on mean
and standard deviation for the data entries. Analyze the cluster plot and give a brief
note based on results obtained.
Writing space of the Problem:(For Student’s use only)

ANS:- import pandas as pd


import numpy as np
df=pd.read_csv(r’file path’)
df.drop(‘unnecessary_variable’,axis=1,inplace=True)
df[“categorical_var”]=pd.Categorical(df[“categorical_var”])
df[“categorical_var”]= df[“categorical_var”].cat.codes
data=df.values[:,0:4]
category=df.values[:,0]
k=3
n = data.shape[0]
c = data.shape[1]
mean = np.mean(data, axis = 0)
std = np.std(data, axis = 0)
centers = np.random.randn(k,c)*std + mean
colors=['orange', 'blue', 'green']
for i in range(1,n):
plt.scatter(data[i, 0], data[i,1], s=7, color = colors[int(category[i])])
plt.scatter(centers[:,0], centers[:,1], marker='*', c='g', s=150)

69
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. This lab module aims to build an analysis on customers of a shopping mall. It consists of 150
observations of customers consisting details that include gender, age, annual_income,
spending_score etc. Based on the two parameters annual_income and spending_score, try to
build a analysis on customers through cluster graphs

Apply k means clustering on the given data set named “Mall_customers” marking number of
clusters based on mean and standard deviation of any two attributes of your choice and
implement the K-means iteratively till the centroids get stabilized

Writing space of the Problem:(For Student’s use only)


ANS:- Note : In-lab continuation

centers_old = np.zeros(centers.shape)
centers_new = deepcopy(centers)

data.shape
clusters = np.zeros(n)
distances = np.zeros((n,k))

error = np.linalg.norm(centers_new - centers_old)

while error != 0:
for i in range(k):
distances[:,i] = np.linalg.norm(data - centers[i], axis=1)
clusters = np.argmin(distances, axis = 1)
centers_old = deepcopy(centers_new)
for i in range(k):
centers_new[i] = np.mean(data[clusters == i], axis=0)
error = np.linalg.norm(centers_new - centers_old)
centers_new

colors=['orange', 'blue', 'green']


for i in range(n):
plt.scatter(data[i, 0], data[i,1], s=7, color = colors[int(category[i])])
plt.scatter(centers_new[:,0], centers_new[:,1], marker='*', c='g', s=150)

70
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. K-means is which type of algorithm.
2. In K-means clustering algorithm what is the criteria used by the data points to get
separated from one cluster to another.
3. What are the basic steps in K Means clustering.
4. What does K refer in K-means algorithm - K refers to k no. of clusters.
5. How is K-means algorithm is different from KNN algorithm

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

71
18CS3159 DATA WAREHOUSING AND MINING

Lab #10: Implementation of Fuzzy c-means clustering


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-Requisites:
Should have a prior knowledge on Fuzzy c-means clustering algorithm.

Pre-lab:-
1.Can a data point in Fuzzy c-means clustering belong to more than one cluster.

ANS:- Yes

2. If Partition matrix = and data points are (1,3) , (1.5,3.2) ,(1.3,2.8) ,(3,1)
then find cluster centres.

ANS:-

3. Tick mark the requirements of clustering:


a. Scalability
b. Ability to deal with different types of attributes
c. Ability to deal with noisy data
d. Knowledge on the attributes of the dataset

ANS:- a,b,c are correct answers

4. In Fuzzy c-means clustering after each iteration cluster centres are updated according to

the formula:
Where,
a. n =
b. µij =
c. m =

ANS:-
a. n = number of datapoints
b. µij = element in membership matrix
c. m = Fuzzian parameter

72
18CS3159 DATA WAREHOUSING AND MINING

5. List the formulae’s required for Fuzzy c-means algorithm


a. For calculating centroid of each cluster.
b. For updating the membership matrix.

ANS:- a.

b.

73
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. Brief the Fuzzy c-means algorithm. And apply Fuzzy c-means to (1,3) , (1.5,3.20 ) ,
(1.3,2.6) , (3,1). Assume there are 2 clusters and Fuzziness parameter m=2.
Writing space of the Problem:(For Student’s use only)

74
18CS3159 DATA WAREHOUSING AND MINING

2. Implement Fuzzy c-means in python to find updated membership matrix for the dataset
Fuzzy_Code.csv which has 44 attributes of numerical type and 1 attribute of categorial type.
Assume number of clusters =2 and Fuzziness parameter =2.

Writing space of the Problem:(For Student’s use only)

75
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-

1. Write a snippet of python implementation to find accuracy and precision of the above
dataset through Fuzzy c-means algorithm.
Writing space of the Problem:(For Student’s use only)
ANS:- import pandas as pd
import numpy as np
import random
import operator
import math

df_full = pd.read_csv("SPECTF_New.csv")
columns = list(df_full.columns)
features = columns[:len(columns)-1]
class_labels = list(df_full[columns[-1]])
df = df_full[features]

# Number of Attributes
num_attr = len(df.columns) - 1

# Number of Clusters
k=2

# Maximum number of iterations


MAX_ITER = 100

# Number of data points


n = len(df)

# Fuzzy parameter
m = 2.00

def accuracy(cluster_labels, class_labels):


county = [0,0]
countn = [0,0]
tp = [0, 0]
tn = [0, 0]
fp = [0, 0]
fn = [0, 0]

for i in range(len(df)):
# Yes = 1, No = 0
if cluster_labels[i] == 1 and class_labels[i] == 'Yes':
tp[0] = tp[0] + 1
if cluster_labels[i] == 0 and class_labels[i] == 'No':
tn[0] = tn[0] + 1
if cluster_labels[i] == 1 and class_labels[i] == 'No':

76
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

fp[0] = fp[0] + 1
if cluster_labels[i] == 0 and class_labels[i] == 'Yes':
fn[0] = fn[0] + 1

for i in range(len(df)):
# Yes = 0, No = 1
if cluster_labels[i] == 0 and class_labels[i] == 'Yes':
tp[1] = tp[1] + 1
if cluster_labels[i] == 1 and class_labels[i] == 'No':
tn[1] = tn[1] + 1
if cluster_labels[i] == 0 and class_labels[i] == 'No':
fp[1] = fp[1] + 1
if cluster_labels[i] == 1 and class_labels[i] == 'Yes':
fn[1] = fn[1] + 1

a0 = float((tp[0] + tn[0]))/(tp[0] + tn[0] + fn[0] + fp[0])


a1 = float((tp[1] + tn[1]))/(tp[1] + tn[1] + fn[1] + fp[1])
p0 = float(tp[0])/(tp[0] + fp[0])
p1 = float(tp[1])/(tp[1] + fp[1])
r0 = float(tp[0])/(tp[0] + fn[0])
r1 = float(tp[1])/(tp[1] + fn[1])

accuracy = [a0*100,a1*100]
precision = [p0*100,p1*100]
recall = [r0*100,r1*100]

return accuracy, precision, recall

def initializeMembershipMatrix():
membership_mat = list()
for i in range(n):
random_num_list = [random.random() for i in range(k)]
summation = sum(random_num_list)
temp_list = [x/summation for x in random_num_list]
membership_mat.append(temp_list)
return membership_mat

def calculateClusterCenter(membership_mat):
cluster_mem_val = list(zip(*membership_mat))
cluster_centers = list()
for j in range(k):
x = list(cluster_mem_val[j])
xraised = [e ** m for e in x]
denominator = sum(xraised)
temp_num = list()

77
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

for i in range(n):
data_point = list(df.iloc[i])
prod = [xraised[i] * val for val in data_point]
temp_num.append(prod)
numerator = map(sum, zip(*temp_num))
center = [z/denominator for z in numerator]
cluster_centers.append(center)
return cluster_centers

def updateMembershipValue(membership_mat, cluster_centers):


p = float(2/(m-1))
for i in range(n):
x = list(df.iloc[i])
distances = [np.linalg.norm(list(map(operator.sub, x, cluster_centers[j]))) for j in range(k)]
for j in range(k):
den = sum([math.pow(float(distances[j]/distances[c]), p) for c in range(k)])
membership_mat[i][j] = float(1/den)
return membership_mat

def getClusters(membership_mat):
cluster_labels = list()
for i in range(n):
max_val, idx = max((val, idx) for (idx, val) in enumerate(membership_mat[i]))
cluster_labels.append(idx)
return cluster_labels

def fuzzyCMeansClustering():
# Membership Matrix
membership_mat = initializeMembershipMatrix()
curr = 0
while curr <= MAX_ITER:
cluster_centers = calculateClusterCenter(membership_mat)
membership_mat = updateMembershipValue(membership_mat, cluster_centers)
cluster_labels = getClusters(membership_mat)
curr += 1
print(membership_mat)
return cluster_labels, cluster_centers

labels, centers = fuzzyCMeansClustering()


a,p,r = accuracy(labels, class_labels)

print("Accuracy = " + str(a))


#print("Precision = " + str(p))

78
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)


#print("This is")
#print("Recall = " + str(r))

Viva Voce:-
1. What are the minimum number of attributes required for clustering.
2. Can decision trees be used for performing clustering?
3. In Fuzzy c-means can a data point be a part of more than one cluster?
4. List some applications of Fuzzy c-means .
5. Is FCM a static or dynamic algorithm.

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

79
18CS3159 DATA WAREHOUSING AND MINING

Lab #11: Classification: Support Vector Machine (SVM)


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-
1. What is SVM? 
ANS:- “Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both
classification or regression challenges. However,  it is mostly used in classification problems. In this
algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you
have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very well.

2. When do we use SVM? 
ANS:- It uses a technique called the kernel trick to transform your data and then based on these transformations
it finds an optimal boundary between the possible outputs.

3. What is maximum marginal hyper plane and what is the equation of separating hyper
plane? 
ANS:- A support vector machine performs classification by finding the hyperplane that maximizes the margin
between two classes.
W.X +b =0
Where,
W is the weight vector,W={w1,w2,w3…….,wn};n is the number of attributes
b is scalar
X = (x1,x2),x1 and x2 are values of the attributes.

4. What are the two cases of SVM? 


ANS:- The two cases are :
(i)Linearly Separable :
The case when the data are linearly separable
(ii)Linearly Inseparable :
The case when the data are linearly inseparable or not separable.

5. What are the equations for point that lies above the separating hyperplane and below the
separating hyperplane? 
ANS:- Point that lies above the separating hyperplane :
w0+w1x1+w2x2 > 0
Point that lies below the separating hyperplane :
w0+w1x1+w2x2 < 0

80
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. Below is the data of the employees in the company. The data shows  whether employee
purchased software or not. Take x co-ordinate as age and y co-ordinate as
estimated_salary. Now, Consider the following dataset and perform the below operations: 
User ID  Gender  Age  Estimated Salary  Purchased 
15624510  Male  19  19000  0 
15810944  Male  35  20000  0 
15668575  Female  26  43000  0 
15603246  Female  27  57000  0 
15804002  Male  19  76000  0 
15728773  Male  27  58000  0 
15598044  Female  27  84000  0 
15694829  Female  32  150000  1 
15600575  Male  25  33000  0 
15727311  Female  35  65000  0 
15570769  Female  26  80000  0 
15606274  Female  26  52000  0 
15746139  Male  20  86000  0 
15704987  Male  32  18000  0 
15628972  Male  18  82000  0 
15697686  Male  29  80000  0 
15733883  Male  47  25000  1 
15617482  Male  45  26000  1 
15704583  Male  46  28000  1 
15621083  Female  48  29000  1 
15649487  Male  45  22000  1 
15736760  Female  47  49000  1 
15714658  Male  48  41000  1 
15599081  Female  45  22000  1 
15705113  Male  46  23000  1 
15631159  Male  47  20000  1 
15792818  Male  49  28000  1 
15633531  Female  47  30000  1 
15744529  Male  29  43000  0 
a. Import the dataset into python 
b. Split the dataset set into training and testing sets 
c. Apply feature scaling on training and test sets 
d. Fit SVM to the training set 
e. Visualize the training set results 
f. Visualize the test set results. 

81
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)


ANS:-
a. import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
b. from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
c. from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
d. from sklearn.svm import SVC
classifier=SVC(kernel='linear',random_state = 0)
classifier.fit(X_train,y_train)
e. from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
f. from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

82
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
 1. Below dataset represents the bank transactions of KVB bank for an hour. Consider x co-
ordinate as Balance and y co-ordinate as Trtn_amt. Perform following operations on
given dataset: 
S.No  transaction_ID  Balance  Trtn_amt  sucornot 
1  3467  98687.36  500  0 
2  4801  8510.47  100  0 
3  2093  2475.3  200  1 
4  9933  37743.25  1000  0 
5  7178  2705.95  600  0 
6  1093  60314  750  1 
7  3708  812129.5  280  1 
8  3804  8076.25  140  0 
9  3192  42323.14  310  1 
10  3666  47045.25  2500  0 
11  8598  96171.25  6900  0 
12  8743  608581.8  8520  1 
13  9302  586057.3  410  1 
14  6127  4587.5  750  0 
15  7502  43597.75  250  0 

a. Import the dataset into python 
b. Split the dataset set into training and testing sets 
c. Apply feature scaling on training and test sets 
d. Fit SVM to the training set 
e. Visualize the training set results 
f. Visualize the test set results. 

Writing space of the Problem:(For Student’s use only)


(a) import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
dataset = pd.read_csv('trns.csv')
X = dataset.iloc[:, [2, 3]].values
y = dataset.iloc[:, 4].values
(b) from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
(c) from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
(d) from sklearn.svm import SVC
classifier=SVC(kernel='linear',random_state = 0)
classifier.fit(X_train,y_train)

83
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

(e)
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
(f)
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()

84
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. What are the advantages of SVM? 
2. How many types of machine learnings are there and in which type this svm fall under? 
3. What are the turning parameters in SVM? 

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

85
18CS3159 DATA WAREHOUSING AND MINING

Lab #12: Rule Based Classification


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-requisite: 
Refer Page no: 355-363 in Han J & Kamber M, “Data Mining: Concepts and Techniques”, Third
Edition, Elsevier, 2011    

Pre-lab:-

1. What is rule-based classification in data mining? 

2. Briefly explain about the building classification rules.   

3. When to stop building a rule? 

4. List some aspects of sequential covering. 

5. What are the characteristics of rule-based classifier?  

6. Define coverage and accuracy.

86
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. Implement a simple python code for rule-based
classification on “AllElectronicsCustomer” database  
(Download the dataset from LMS) 

RID  age  income  student  Credit_rating  Class:buys computer   

1  youth  high  no  fair  no 


2  youth  high  no  excellent  no 
3  middle_aged  high  no  fair  yes 
4  senior  medium  no  fair  yes 
5  senior  low  yes  fair  yes 
6  senior  low  yes  excellent  no 
7  middle_aged  low  yes  excellent  yes 
8  youth  medium  no  fair  no 
9  youth  low  yes  fair  yes 
10  senior  medium  yes  fair  yes 
11  youth  medium  yes  excellent  yes 
12  middle_aged  medium  no  excellent  yes 
13  middle_aged  high  yes  fair  yes 
14  senior  medium  no  excellent  no 

a. Calculate accuracy, coverage and print the RID values when the following rules
are satisfied:  
o Rule R1: if the age of the person is in the category of “youth” and he/she is a
student then the person purchases the computer. 
o Rule R2: if age of the person is in the category of “middle_aged” , income is
either medium or high and with excellent Credit_rating  then the person buys a
computer  
o Rule R3: if age of the person is in the category of “senior” and he/she is a
student then purchases a computer. 
o Rule R4: if age of the person is in the category of “senior” , income is
high, he/she is a student and with Credit_rating fair then purchases a computer. 
  

Writing space of the Problem:(For Student’s use only)


ANS:-

import math
import pandas as pd
import numpy as np
dt=pd.read_csv('AllElectronicsCustomer.csv')
print(dt)
print()
D=len(dt)
ncovers=0

87
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

ncorrect=0
c=0
j=int(input("Enter rule number(<=4):"))
for i in range(len(dt)):
if j==1:
if ((dt.iloc[i,1]=="youth")and(dt.iloc[i,3]=="yes")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
elif j==2:
if ((dt.iloc[i,1]=="middle_aged")and(dt.iloc[i,2]==("medium" or "high")) and
(dt.iloc[i,4]=="excellent")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
elif j==3:
if ((dt.iloc[i,1]=="senior")and (dt.iloc[i,3]=="yes")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
else:
if ((dt.iloc[i,1]=="senior")and (dt.iloc[i,2]== "high") and (dt.iloc[i,3]=="yes")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1

if(c==D):
print( "no one of the tuple satisfies the give rule")
print()
coverage=(ncovers/D)
print("coverage")
print(coverage)

88
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)

print()
accuracy=(ncorrect/ncovers)
print("accuracy")
print(accuracy) 

Post-lab:-
1. Extract possible classification rules from the given decision tree.

ANS:-

2. Write the sequential covering algorithm used in rule induction.


ANS:-

89
3. Difference between Decision tree and rule based classification.

Writing space of the Problem:(For Student’s use only)

Viva Voce:-
1. Rule-Based classifier classify records by using a collection of ______ rules.
2. Most rule-based classification systems use which strategy?
3. Difference between class-based ordering and rule-based ordering.
4. Briefly explain the below terms in your own words:
a. Mutually exclusive
b. Exhaustive
5. Name the terms that define the following statements:
a. Fraction of records that satisfy only antecedent of a rule.
b. Fraction of records that satisfy both antecedent and consequent of a rule.

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

90
18CS3159 DATA WAREHOUSING AND MINING

Lab #13: Hierarchical Clustering


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-
1. Problem: If you are a business person trying to get the best return on your marketing
investment, it is crucial that you target people in the right way. If you get it wrong, you
risk not making any sales, or worse, damaging your customer trust.
Think of how hierarchical clustering can help you to solve this problem.

ANS:-
Business Problem: The enterprise wishes to organize customers into groups/segments based on
similar traits, product preferences and expectations. Segments are constructed based on
customer demographic characteristics, psychographics, past behavior and product use behavior.
Business Benefit: Once the segments are identified, marketing messages and products can be
customized for each segment. The better the segment(s) chosen for targeting by a particular
organization, the more successful the business will be in the market.
Hierarchical Clustering can help an enterprise organize data into groups to identify similarities
and, equally important, dissimilar groups and characteristics, so that the business can target
pricing, products, services, marketing messages and more.

2. How does hierarchical clustering work?


ANS:- Hierarchical clustering starts by treating each observation as a separate cluster. Then, it
repeatedly executes the following two steps: (1) identify the two clusters that are closest
together, and (2) merge the two most similar clusters. This continues until all the clusters are
merged together.

3. What are the strengths and weaknesses of Hierarchical Clustering?


ANS:- Strengths of Hierarchical Clustering
· Conceptually Simple.
· Theoretical properties are well understood.
· When Clusters are merged /split, the decision is permanent => the number of different alternatives
that need to be examined is reduced.
Weaknesses of Hierarchical Clustering
· When Clusters are merged /split, the decision is permanent => the number of
different alternatives that need to be examined is reduced.
· Divisive methods can be computational hard.
· Methods are not (necessarily) scalable for large datasets.
· Does not require the number of clusters k in advance.
· Needs a termination condition.
· The final mode in both Agglomerative and Divisive is of no use.

91
4. What are the metrics used to compute the linkage?

5. List out the parameters and attributes involved in AgglomerativeClustering function.

18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. This dataset is taken from the Motor Trend US magazine, and comprises fuel consumption
and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Assume that you work for Motor Trend, a magazine about the automobile industry. Looking at a
data set of a collection of cars, they are interested in exploring the relationship between a set of
variables. Help them perform the desired analysis to get meaningful insights.

Import the following packages before proceeding:


numpy
pandas
scipy
pylab
seaborn
matplotlib
sklearn

(i) Split the dataset into two halves wherein the first half named ‘X’ contains
the columns 1,3,4,6 and the second half named ‘Y’ contains the column 9.
(ii) Apply the linkage function on ‘X’ considering the ward method. Construct
a dendrogram by giving appropriate title, x and y labels.
(iii) Plot a horizontal line for getting the precise number of clusters and store the value in
the variable ‘k’.
(iv) Perform the AgglomerativeClustering by inputting following different combinations
for affinity and linkage and then compare the accuracy between the fitted clustering and
the variable ‘Y’.
affinitylinkage
a)euclidean ward
b)euclidean complete
c)Euclidean average
d)manhattan average

92
Writing space of the Problem:(For Student’s use only)
import numpy as np
import pandas as pd

import scipy
from scipy.cluster.hierarchy import dendrogram,linkage
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist

from pylab import rcParams


import seaborn as sb
import matplotlib.pyplot as plt

import sklearn
from sklearn.cluster import AgglomerativeClustering
import sklearn.metrics as sm

# setting precision value


np.set_printoptions(precision=4,suppress=True)
plt.figure(figsize=(10,3))
%matplotlib inline
plt.style.use('seaborn-whitegrid')

#Splitting the data


address=r'C:\Users\Himani Agarwal\Desktop\DATA SCIENCE\Data Sets\mtcars.csv'
cars=pd.read_csv(r'C:\Users\Himani Agarwal\Desktop\DATA SCIENCE\Data Sets\mtcars.csv')
cars.columns=['car_names','mpg','cyl','disp','hp','drat','wt','qsec','vs','am','gear','carb']
X=cars.ix[:,(1,3,4,6)].values
y=cars.ix[:,(9)].values

#Z = linkage( X , method ) creates the tree using the specified method , which describes how to
measure the distance between clusters. ... Z = linkage( X , method , metric ) performs clustering by
passing metric to the pdist function, which computes the distance between the rows of X .
Z=linkage(X,'ward')

# Plotting dendrogram
dendrogram(Z,truncate_mode='lastp',p=12,leaf_rotation=45.,leaf_font_size=15.,show_contracted=True)

plt.title('Truncated Hierarchical Clustering Dendrogram')


plt.xlabel('Cluster Size')
plt.ylabel('Distance')
plt.axhline(y=500)
plt.axhline(y=150)
plt.show()

# Accuracy using AgglomerativeClustering with euclidean and ward


k=3
Hclustering=AgglomerativeClustering(n_clusters=k,affinity='euclidean',linkage='ward')
Hclustering.fit(X)
sm.accuracy_score(y,Hclustering.labels_)

93
# Accuracy using AgglomerativeClustering with euclidean and complete
k=2
Hclustering=AgglomerativeClustering(n_clusters=k,affinity='euclidean',linkage='complete')
Hclustering.fit(X)
sm.accuracy_score(y,Hclustering.labels_)

# Accuracy using AgglomerativeClustering with euclidean and complete

k=2
Hclustering=AgglomerativeClustering(n_clusters=k,affinity='euclidean',linkage='average')
Hclustering.fit(X)
sm.accuracy_score(y,Hclustering.labels_)

# Accuracy using AgglomerativeClustering with manhattan and average


k=2
Hclustering=AgglomerativeClustering(n_clusters=k,affinity='manhattan',linkage='average')
Hclustering.fit(X)
sm.accuracy_score(y,Hclustering.labels_)

94
18CS3159 DATA WAREHOUSING AND MINING

2.Write the advantages and disadvantages of each of the following methods:


a. Ward’s method
b. Group Average linkage
c. Single linkage
d. Complete linkage
Writing space of the Problem:(For Student’s use only)

ANS:-
Pros of Ward’s method:
· Ward’s method approach also does well in separating clusters if there is noise between
clusters.
Cons of Ward’s method:
· Ward’s method approach is also biased towards globular clusters.
Pros of Group Average:
· Ward’s method approach also does well in separating clusters if there is noise between
clusters.
.
Cons of Group Average:
· The group Average approach is biased towards globular clusters.
Pros of Single Linkage:
· This approach can separate non-elliptical shapes as long as the gap between two clusters
is not small.
Cons of Single Linkage:
· MIN approach cannot separate clusters properly if there is noise between clusters.
Pros of Complete Linkage:
· MAX approach does well in separating clusters if there is noise between clusters.
Cons of Complete Linkage:
· Max approach is biased towards globular clusters.
· Max approach tends to break large clusters.

95
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. Earning a master's degree helps you gain specialized knowledge to advance in your field. You
can focus on a particular field of study, which helps you become more competitive in
your field.Even the most qualified and confident applicants worry about getting into masters
program. But don’t panic! Graduate school acceptance rates, which give the percentage of
applicants that were accepted to a particular program in an academic year, can help you
determine how likely you are to get into a given program.

This dataset helps us to predict Graduate Admissions from an Indian perspective. The dataset
contains a few parameters which are viewed as significant during the application for Masters
Programs. The parameters included are :
1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. College Rating ( out of 5 )
4. Mission statement and Letter of Recommendation Strength ( out of 5 )
5. Undergrad GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Possibility of Admit ( extending from 0 to 1 ).

Helps students in shortlisting universities with their profiles by predicting output giving them a
fair idea about their chances for a particular university.

Download the dataset from the following link:


https://www.kaggle.com/mohansacharya/graduate-admissions

a.Apply agglomerative clustering on the following dataset using ward method for the following:-
CGPA and chance of admit.
GRE and chance of admit.
University ranking and chance of admit.

b. Plot dendrogram for each case listed in the previous bit .

Note: It is important to understand while performing clustering is considering the correct


number of clusters. In hierarchical clustering, the longest vertical distance without any
horizontal line passing is selected and a horizontal line is drawn through it. The number of
vertical lines this newly created horizontal line passes is equal to the number of clusters.

c. Draw a horizontal line to get the precise number of clusters.

d. Display the scatter plots between all the influencing factors( CGPA, GRE, TOEFL, SOP LOR
and University ranking) and chance of admit.

e. By visualising the above scatter plots, rank the factors in the order of linearity to analyze which
factor influences the chance of admit the most.
Example: x<y<z means greater value of z means greater chance of admit.

96
18CS3159 DATA WAREHOUSING AND MINING

Writing space of the Problem:(For Student’s use only)


import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn.metrics as sm
from sklearn.metrics import accuracy_score

#importing the dataset


customer_data = pd.read_csv(r'C:\Users\Hp\Desktop\BIG DATA ANALYTICS\R Programing\DATASET\graduate-
admissions\ad.csv')
customer_data.head()

# Considering the CGPA and chance of admit into the data frame, change the parameters by changing the index
values to consider different sets of data.
data = customer_data.iloc[:, [2,8]].values
#print(data)
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Customer Dendrograms")

# Plotting Dendrogram
dend = shc.dendrogram(shc.linkage(data, method='ward'),truncate_mode='level')

#Horizontal line to know the number of clusters


plt.axhline(y=100,color='r',linestyle='--')
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=5, affinity='euclidean', linkage='ward')
cluster.fit_predict(data)
#sm.accuracy_score(y,cluster.labels_)
plt.figure(figsize=(10, 7))

#Display the different scatter plots and comparing the attribute ranking.
plt.scatter(data[:,0], data[:,1], c=cluster.labels_, cmap='rainbow')

97
2. Your university is going to participate in an inter college cricket tournament. Therefore, you
need to select the best set of players to form an efficient team. You decide to take a physical
exam of all the interested students and then segregate them into 2 groups namely fit/unfit. Which
kind of data science technique will you use? Classification or Clustering?
Writing space of the Problem:(For Student’s use only)
ANS:-
In this case the classes are already defined namely fit and unfit. Therefore, this is a classification
technique. Classification is the process of classifying the data with the help of class labels. On the
other hand,Clustering is similar to classification but there are no predefined class labels. Classification is
geared with supervised learning. As against, clustering is also known as unsupervised learning.

Viva Voce:-
1. Why is hierarchical clustering considered as an unsupervised machine learning algorithm?
2. List out the 2 kinds of hierarchical clustering techniques. How do the both kinds differ from
each other?
3. State the differences between hierarchical and K-means clustering.
4. What are the real world applications of clustering?
5. What are the different linkage methods to use agglomerative clustering effectively?

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

98
18CS3159 DATA WAREHOUSING AND MINING

Lab #14: Outliers Detection


Date of the Session: ___/___/___ Time of the Session: _____to______

Pre-lab:-
1.What do you mean by an outlier? What are the main causes for outliers?
ANS:-
Outliers are extreme values that deviate from other observations on data , they may indicate a
variability in a measurement, experimental errors or a novelty.
Most common causes of outliers on a data set:
· Data entry errors (human errors)
· Measurement errors (instrument errors)
· Experimental errors (data extraction or experiment planning/executing errors)
· Intentional (dummy outliers made to test detection methods)
· Data processing errors (data manipulation or data set unintended mutations)
· Sampling errors (extracting or mixing data from wrong or various sources)
· Natural (not an error, novelties in data)

2. What are the important methods for outlier detection?


ANS:-
Some of the important methods for outlier detection are:
· Z-Score or Extreme Value Analysis (parametric)
· Probabilistic and Statistical Modeling (parametric)
· Linear Regression Models (PCA, LMS)
· Proximity Based Models (non-parametric)
· Information Theory Models
· High Dimensional Outlier Detection Methods (high dimensional sparse data)

3. Why is outlier detection necessary in data analysis?


Data analytics deals with making observations with various data sets and trying to make sense of the
data. One of the most important tasks when dealing with very large data sets is to find an outlier.
Outliers can potentially skew or bias any analysis performed on the dataset. It is therefore very
important to detect and adequately deal with outliers.

4. How do we calculate z-score?


A z-score measures exactly how many standard deviations above or below the mean a data point
.
formula for calculating a z-score:
z=data point –mean /standard deviation

99
18CS3159 DATA WAREHOUSING AND MINING

5. Consider the below dataset which comprises of the income( in thousands) of 15 people in
an organisation.
[ 45, 51, 63, 48, 67, 48, 56, 2, 62, 59, 44, 61, 99, 46, 52]
What do you observe from the above data? Is there any significant difference between the
income of few employees? If so, what could be the reason of it?

Writing space of the Problem:(For Student’s use only)

The observations 2 and 99 are outliers because their values are deviated the most from the
remaining observations. The observations are also very important for these dataset because these
values can be the salary of a peon and ceo of the organization.

100
18CS3159 DATA WAREHOUSING AND MINING

In-lab:-
1. The dataset Boston house prices consists of 9 attributes CRIM, ZN, INDUS, LSTAT, NOX,
RM, DIS, RAD, TAX. The description of each attribute
 CRIM per capita crime rate by town
 ZN proportion of residential land zoned for lots over 25,000 sq.ft.
 INDUS proportion of non-retail business acres per town
 NOX nitric oxides concentration (parts per 10 million)
 RM average number of rooms per dwelling
 DIS weighted distances to five Boston employment centres
 RAD index of accessibility to radial highways
 TAX full-value property-tax rate per $10,000

Boston dataset: https://drive.google.com/file/d/1YVYWQWPKsLX1UM-


0XCnGCwD1NIi7_uIv/view?usp=sharing

a. Using boxplot detect which columns have outliers


b. Implement scatterplot between INDUS and TAX and inspect the outliers
c. Apply z_score outlier detection method on Boston dataset considering threshold = 3
d. Print any five z_score values of the outliers.
e. Remove all the outliers obtained from the dataset and refashion the dataset.
f. Apply inter quantile range(IQR) outlier detection on the dataset and print IQR values of
each columns.
g. Calculate lower_bound and upper_bound and print boolean values wherein the outliers are
represented as TRUE.
h. Remove all the outliers produced by inter quartile range method and refashion the
dataset.

Writing space of the Problem:(For Student’s use only)

import pandas as pd
import numpy as np
boston=pd.read_csv(r'C:\Users\Hp\Desktop\BIG DATA ANALYTICS\R
Programing\DATASET\boston.csv')
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
# boxplot for a column
sns.boxplot(x=boston["TAX"])
# scatterplot between INDUS and TAX
boston_c = boston
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_c["INDUS"], boston_c["TAX"])
ax.set_xlabel('proportion of non-retail business acres per town')
ax.set_ylabel('full-value property-tax rate per $10,000')
plt.show()

101
# z_score method
from scipy import stats
zscore = np.abs(stats.zscore(boston_c))
print(zscore) # gives two array in which first array represents the row and the second
represents column
threshold = 3
print(np.where(zscore > 3))
# first five outlier z_score values
print(zscore[55][1])
print(zscore[56][1])
print(zscore[57][1])
print(zscore[141][3])
print(zscore[199][1])
# removing outliers using z_score
boston_clean = boston
boston_clean = boston_clean[(zscore < 3).all(axis=1)]
boston.shape
boston_clean.shape
# IQR method
boston_iqr = boston
Q1 = boston_iqr.quantile(0.25)
Q1
Q3 = boston_iqr.quantile(0.75)
IQR = Q3 - Q1
#printing IQR values of each column
print(IQR)
# prints the Booleans values
print(boston_iqr < (Q1 - 1.5 * IQR)) | (boston_iqr > (Q3 + 1.5 * IQR))
#Remove Outliers using IQR
boston_iqr_clean = boston_iqr[~((boston_iqr < (Q1 - 1.5 * IQR)) | (boston_iqr > (Q3 + 1.5 *
IQR))).any(axis=1)]

102
18CS3159 DATA WAREHOUSING AND MINING

2. Consider the iris dataset. It includes three iris species with 50 samples each as well as some
properties about each flower. The columns in this dataset are:
 SepalLengthCm
 SepalWidthCm
 PetalLengthCm
 PetalWidthCm
 Species

https://drive.google.com/file/d/1HEEMrAQqAynHdM5TmK0G-
mD5Qr0OW2J8/view?usp=sharing

Import the csv file and use the boxplot method to visualise the outliers considering the 4
properties of a flower. You will notice that one of the property has outliers.

1.Considering the range of the outliers from the visualisation, display the observations which
have outliers.
2.Implement a DBSCAN model fitting on the dataset taking epsilon value as 0.8 and minimum
samples value as 19.
3.Print the counter values using the counter function on the model labels.
4.Considering the values obtained from the model labels print the outliers of the data.
5.Draw a scatter plot between petal length and sepal width to visualise the outliers.

Writing space of the Problem:(For Student’s use only)

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from pylab import rcParams

%matplotlib inline

rcParams['figure.figsize']=5,4

df=pd.read_csv(r'C:\Users\Himani Agarwal\Desktop\DATA SCIENCE\Data Sets\irisnew.csv')

df.columns=['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width','Species']

X=df.iloc[:,0:4].values

y=df.iloc[:,4].values

df[:5]

#boxplot of all the 4 columns

df.boxplot(return_type='dict')

103
plt.plot()

#prints the outliers greater than 4 ( 4-> by observing the box plot)

Sepal_Width=X[:,1]

iris_outliers=(Sepal_Width>4)

df[iris_outliers]

#prints the outliers greater than 2.05 ( 2.05-> by observing the box plot)

Sepal_Width=X[:,1]

iris_outliers=(Sepal_Width<2.05)

df[iris_outliers]

# setting the precision value and printing the summary

pd.options.display.float_format='{:.1f}'.format

X_df=pd.DataFrame(X)

print(X_df.describe())

#DBSCAN method

import seaborn as sb

import sklearn

from sklearn.cluster import DBSCAN

from collections import Counter

%matplotlib inline

sb.set_style('whitegrid')

#DBSCAN function

model=DBSCAN(eps=0.8,min_samples=19).fit(data)

print(model)

# counter values gives the count of every cluster

outliers_df=pd.DataFrame(data)

104
print(Counter(model.labels_))

# -1 represents outliers

print(outliers_df[model.labels_==-1])

# scatter plot between petal length and sepal width

fig=plt.figure()

ax=fig.add_axes([.1,.1,1,1])

colors=model.labels_

ax.scatter(data[:,2],data[:,1],c=colors,s=120)

ax.set_xlabel('Petal Length')

ax.set_ylabel('Sepal Width')

plt.title('DBSCAN for outlier detection')

105
18CS3159 DATA WAREHOUSING AND MINING

Post-lab:-
1. Consider the following student dataset
https://drive.google.com/file/d/1edmKnHjXkTyHT6gSYhwLw9rTpzoy1Cig/view?usp=sha
ring
which consists of student details of two schools in a town.
i. Find the students who have taken more number of leaves than the average number
of absences by implementing a z_score function taking mean and standard
deviation into account.

ii. Find the number of students who got least and highest score in the subject G1
considering threshold = 2.5

iii. Apply boxplot for the above two instances.

import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot
from collections import Counter
dataset=pd.read_csv("student_dataset.csv")
outliers=[]
matplotlib.pyplot.boxplot(dataset["absences"]) # follow the same procedure for G1

def detect_outlier(data_1):
c=0
threshold=3
mean_1 = np.mean(data_1)
std_1 =np.std(data_1)

for y in data_1:
z_score= (y - mean_1)/std_1
if np.abs(z_score) > threshold:
outliers.append(y)
c=c+1
print(c)
return outliers

outlier_datapoints = detect_outlier(dataset["absences"])
print(outlier_datapoints)
dataset.describe()

106
2. Can we find outliers for categorical values? Explain.

18CS3159 DATA WAREHOUSING AND MINING

3. An sugar factory weighs every sugar packet in the weighing machine before packing
them into cartons. As per the guidelines of the factory,the standard weight of each sugar
packet should be 60 grams. It has been observed that during the final weighing of the
packets, few of them gave an anomalous weight due to malfunctioning of weighing
machines.
Consider the below dataset which comprises of weights of the packets.
https://drive.google.com/file/d/1JkdkQ3j-
J93DCfZa3kUjDycEtRzShk6V/view?usp=sharing

a.Find those anomalous weights by plotting a histogram

b.In the range 0 to 1, consider the lower_bound=0.1 &upper_bound=0.9 and find


the outliers using the quantile method.

c.Segregate the outliers from inlines using “loc” method to get the values of
“true_index”.Also obtain values of “false_index”.
.
d.Now find the median from the values obtained in “true_index”
.
d.Replace all the outliers with median.

Writing space of the Problem:(For Student’s use only)

107
Viva Voce:-

1. Is it good to remove an outlier from the dataset all the time?


2.What the applications of outlier detection.
3.What the different types of outliers?
4.Are outliers just side products of some clustering algorithms?
5. What is the difference between noise and anomoly?

(For Evaluator’s use only)


Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: _______ out of ________

Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

108
109

You might also like