Lab Workbook With Solutions-Final PDF
Lab Workbook With Solutions-Final PDF
1
18CS3159 DATA WAREHOUSING AND MINING
LABORATORY WORKBOOK
STUDENT
NAME
REG. NO
YEAR
SEMESTER
SECTION
FACULTY
2
18CS3159 DATA WAREHOUSING AND MINING
Table of Contents
Table of Contents .................................................................................................................................... 3
Organization of the STUDENT LAB WORKBOOK ............................................................................. 4
Lab #1: Basic Statistical Descriptions ................................................................................................... 5
Lab #2: To implement data pre-processing techniques. ...................................................................... 11
Lab #3: To implement principle component analysis .......................................................................... 20
Lab #4: Classification using Decision Trees. ......................................................................................... 24
Lab #5: Classification using K Nearest Neighbour. ............................................................................... 31
Lab #6: Classification using Bayesian Classifiers .................................................................................. 38
Lab #7: Classification using Backpropagation………………………………………………………………………………. 44
Lab #8: Association Rule Mining - Apriori ......................................................................................... 50
Lab #9: Implementation of K-Means Clustering ............................................................................... 57
Lab #10: Implementation of Fuzzy c-means clustering..................................................................... 63
Lab #11: Classification: Support Vector Machine (SVM) .................................................................. 69
Lab #12: Rule Based Classification .................................................................................................... 74
Lab #13: Hierarchical Clustering ....................................................................................................... 81
Lab #14: Outiers detection ……………………………………………………………………………………………………………88
3
18CS3159 DATA WAREHOUSING AND MINING
The laboratory framework includes a creative element but shifts the time-intensive aspects
outside of the Two-Hour closed laboratory period. Within this structure, each laboratory
includes three parts: Prelab, In-lab, and Post-lab.
a. Pre-Lab
The Prelab exercise is a homework assignment that links the lecture with the laboratory
period - typically takes 2 hours to complete. The goal is to synthesize the information they
learn in lecture with material from their textbook to produce a working piece of software.
Prelab Students attending a two-hour closed laboratory are expected to make a good-faith
effort to complete the Prelab exercise before coming to the lab. Their work need not be
perfect, but their effort must be real (roughly 80 percent correct).
b. In-Lab
The In-lab section takes place during the actual laboratory period. The First hour of the
laboratory period can be used to resolve any problems the students might have
experienced in completing the Prelab exercises. The intent is to give constructive
feedback so that students leave the lab with working Prelab software - a significant
accomplishment on their part. During the second hour, students complete the In-lab
exercise to reinforce the concepts learned in the Prelab. Students leave the lab having
received feedback on their Prelab and In-lab work.
c. Post-Lab
The last phase of each laboratory is a homework assignment that is done following the
laboratory period. In the Post-lab, students analyse the efficiency or utility of a given
system call. Each Post-lab exercise should take roughly 120 minutes to complete.
4
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab
1. What are the various ways to measure the central tendency of data?
The three most common measures of central tendency are the mean, median, and mode.
Range, Quartiles, Variance, Standard Deviation, and Interquartile Range are the measures of dispersion.
The distance between the first and third quartiles is a simple measure os spread that gives the range
covered by the middle half of the data. The distance is called interquartile range(IQR).
4. Observe the following diagrams; identify the quantile and q-q plot? Define how the q-q-
plot is different from quantile plot?
5
18CS3159 DATA WAREHOUSING AND MINING
A quantile plot is a simple and effective way to have a first look at a univariate
data distribution.
A quantile–quantile plot, or q-q plot, graphs the quantiles of one univariate distribution
against the corresponding quantiles of another.
6. Identify the symmetric data, positively skewed data and negatively skewed data from the
below graphs?
a) b) c)
6
In-lab
1. Given a data set “cars” for analysis it includes the variables speed and distance.
(Download the dataset from lms)
a) What are the average speed and the distance of the cars?
b) What is the median and midrange of the data?
c) Find mode of the data and comment on the data modality (i.e, unimodal or bimodal)?
d) What are the variance and the standard deviation of the data?
e) Find the five number summary of the data?
f) Show the histogram and box plot of the data?
1) data.mean()
speed 10.684211
dist 25.526316
dtype: float64
data.median()
speed 11.0
dist 24.0
dtype: float64
data.mode()
speed 11.0
dist 24.0
dtype: float64
data.mode()
speed dist
0 10 26.0
1 12 NaN
d1=(data.max()+data.min())/2
d1
speed 10.0
dist 31.0
dtype: float64
7
18CS3159 DATA WAREHOUSING AND MINING
Writing space of the Problem:(For Student’s use only)
statistics.stdev(data['speed'])
3.4165062007675386
statistics.stdev(data['dist'])
16.540752438388957
statistics.variance(data['speed'])
11.67251461988304
statistics.variance(data['dist'])
273.5964912280702
data.min()
speed 4
dist 2
dtype: int64
data.max()
speed 16
dist 60
dtype: int64
quartiles=percentile(data,[25,50,75])
quartiles[0]
10.0
In [33]:
quartiles[1]
14.0
In [34]:
quartiles[2]
23.5
In [47]:
data.boxplot()
<matplotlib.axes._subplots.AxesSubplot at 0x1c7bec5ff60>
data.plot.hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1c7c0719cc0>
8
Writing space of the Problem:(For Student’s use only)
9
Post-lab
1. Suppose that a hospital tested the age and body fat data for 18 randomly selected adults with
the following results. (Download the dataset from lms)
a) Find the maximum and the minimum percentage of the fat and age of the adults who
visited the hospital.
b) Calculate mean, median and midrange of the age.
c) Find the first quartile and third quartile of the data.
d) Draw a scatter plot and q-q plot based on these two variables.
df.max()
age 61.0
%fat 42.5
dtype: float64
b)
df.mean()
age 46.444444
%fat 28.783333
dtype: float64
df.median()
age 51.0
%fat 30.7
dtype: float64
mid range
(df.min()+df.max())/2
age 42.00
%fat 25.15
dtype: float64
c)
quartiles=np.percentile(df,[25,50,75])
quartiles[0]
27.15
quartiles[1]
34.35
quartiles[2]
50.5
d)
import pandas as pd
10
18CS3159 DATA WAREHOUSING AND MINING
Writing space of the Problem:(For Student’s use only)
11
Writing space of the Problem:(For Student’s use only)
Viva Voce:-
1. Difference between symmetric data and skewed data.
2. What are the most widely used forms of quartiles?
3. Variance and Standard deviation fall under what category of measuring data?
4. What do low and high standard deviations indicate?
5. Based on what condition, two variables are said to be correlated?
12
Lab #2: To implement data pre-processing techniques.
Date of the Session: ___/___/___ Time of the Session: _____to______
Pre-lab
DATA PREPROCESSING:
Databases are exceedingly helpless to noisy, missing, and inconsistent data because of
their commonly enormous size (frequently a few gigabytes or more). Low-quality information
will prompt low-quality mining results. Pre-processing helps to get a quality data. The steps
involved in data pre-processing are data cleaning, data integration, data reduction, data
transformation.
ANS:- 1.c 2.f 3.a 4.g 5.h 6.b 7.e 8.i 9.d
1. Mention any two methods that deals with missing values and noisy data.
a. Use a measure of central tendency for the attribute (e.g., the mean or median) to
fill in the missing value.
b. drop the tuple
c. use binning technique to deal with noisy data
2. Mention two techniques that are applied to obtain a reduced data set.
The techniques that are applied to obtain a reduced data set are dimensionality reduction,
numerosity reduction.
3.Using min-max normalization, transform the value 35 onto the range [0.0, 1.0].
importnumpy as np
import math
fromstatsmodels import robust
df1=[200,300,400,600,1000]
df = np.asarray(df1)
normalized_df=(35-df.min())/(df.max()-df.min())
normalized_df
o/p: -0.20625
13
4. Using z-score normalization, transform the value 35, where the standard deviation is 12.94
years.
normalized_df=(35-df.mean())/12.94
normalized_df
o/p: -35.93508500772798
normalized_df=(35)/df.max()
normalized_df
o/p: 0.035
14
18CS3159 DATA WAREHOUSING AND MINING
In-lab
1. Given a data set “data” for analysis it includes the attribute nation, purchased item, age,
Salary. (Download the data set from lms)
1.
import math
>>> import numpy as np
>>> import pandas as pd
>>> data = pd.read_csv(r"Desktop\nation.csv")
Missing values-dropna
df=data.dropna()
>>> df
nation purchased item age salary
0 india no 26.0 23456.0
2 america no 57.0 45676.0
5 india no 45.0 566678.0
6 japan yes 39.0 677644.0
9 india yes 20.0 45678.0
DATA FORMATING
type(data['age']) is int
False
df['age']=df['age'].astype("int")
df['age']
0 26
2 57
5 45
6 39
9 20
Name: age, dtype: int32
15
DATA NORMALIZATION
USINGSIMPLE FEATURE SCALING
d1=df['salary']/df['salary'].max()
d1
0 0.034614
2 0.067404
5 0.836247
6 1.000000
9 0.067407
Name: salary, dtype: float64
BINNING
bins=np.linspace(min(df["salary"]),max(df["salary"]),4)
groups=["low","medium","high"]
df["salbin"]=pd.cut(df["salary"],bins,labels=groups,include_lowest=True)
print(df["salbin"])
0 low
2 low
5 high
6 high
9 low
Name: salbin, dtype: category
Categories (3, object): [low < medium < high]
16
2. Suppose john is working as a manager at Nuclear Power Corporation of India and have been
charged with analyzing the Nuclear power station construction data. He carefully inspects the
company’s database identifying and selecting the attributes (cost, date, t1, t2 and cap) to be
included in the analysis. (Download the dataset from lms)
a. He noticed that several values of the attributes for various tuples have no recorded
value.
b. He observed that data type of year is recorded in float instead of integer type.
c. He wants to normalize all the data (variables) in equal weights.
d. Finally, he wants to know if there are any outliers present in cost of the construction.
You immediately set out to perform this task.
Hint: missing values can be solved by replacing with mean)
Writing space of the Problem:(For Student’s use only)
import pandas as pd
importscipy.stats as stats
data =pd.read_csv("nuclear.csv")
df = data[['cost','date','t1','t2','cap']]
df
dm=df.fillna(df.mean())
dm
dm['date']=dm['date'].astype("int")
dm
d1=dm/dm.max()
d1
plt.scatter(df.iloc[:,0],df.iloc[:,1])
17
Post-lab
1. Data (13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70)
Use smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your
steps.
Comment on the effect of this technique for the given data. Also Plot a histogram.
1.
import numpy as np
import math
fromitertools import groupby
data = [13,5,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70]
b = data
b=np.sort(b)
bin1=np.zeros((9,3))
for i in range (0,len(data),3):
k=int(i/3)
mean=(b[i] + b[i+1] + b[i+2])/3
for j in range(3):
bin1[k,j]=mean
print("Bin Mean: \n",bin1)
histogram
import numpy as np
import math
import matplotlib.pyplot as plt
18
19
18CS3159 DATA WAREHOUSING AND MINING
2. Use these two methods below to normalize the following group of data: 200, 300, 400, 600
and
1000.
a. min-max normalization by setting min = 0 and max = 1
b. z-score normalization
c. z-score normalization using the mean absolute deviation of standard deviation
d. Normalization by simple feature scaling.
a. normalized_df=(df-df.min())/(df.max()-df.min())
normalized_df
o/p: array([0. , 0.125, 0.25 , 0.5 , 1. ])
b. normalized_df=(df-df.mean())/df.std()
normalized_df
o/p: array([-1.06066017, -0.70710678, -0.35355339, 0.35355339, 1.76776695])
c. normalized_df=(df-df.mean())/(robust.mad(df1))
normalized_df
d)normalized_df=(df)/df.max()
normalized_df
o/p: array([0.2, 0.3, 0.4, 0.6, 1. ])
20
18CS3159 DATA WAREHOUSING AND MINING
A.
normalized_df=(df-df.mean())/df.std()
normalized_df
B.
CORRELATION COEFFICIENT
np.corrcoef(age,fat)
array([[1. , 0.71029574],
[0.71029574, 1. ]])
21
18CS3159 DATA WAREHOUSING AND MINING
Viva Voce:-
1. What are the factors that comprising data quality?
2. What do you mean by noise in the dataset?
3. What are outliers in the dataset?
4. What is discretization?
5. What is the difference between lossy and lossless in data reduction?
22
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
Step 1: from the dataset, standardize the variables so that all variables are represented in a single scale
Step 2: construct variance-covariance matrix of those variables
Step 3: Calculate the eigenvectors and eigenvalues of the covariance matrix. The eigenvectors represent
the components of the dataset
Step 4: Reorder the matrix by eigenvalues, highest to lowest. This gives the components in order of
significance
Step 5: Keep the top n-components which together explain 75%-80% variability of the dataset
Step 6: create a feature vector by taking the eigenvectors that are kept in step 5, and forming a matrix
with these eigenvectors in the columns
Step7: take the transpose of the feature vector and multiply it on the left of the original data set,
transposed. The values obtained are the principal scores
23
In-lab:-
1. Suppose that you are given a small 3x2 matrix, you have to
calculate Principal Component Analysis without using pca() function?
Matrix: ([3, 5], [4, 2], [1, 6])
print(A)
[[20 50]
[70 40]
[50 80]]
M
array([46.66666667, 56.66666667])
C
array([[-26.66666667, -6.66666667],
[ 23.33333333, -16.66666667],
[ 3.33333333, 23.33333333]])
V
array([[633.33333333, -66.66666667],
[-66.66666667, 433.33333333]])
Vectors
array([[ 0.95709203, 0.28978415],
[-0.28978415, 0.95709203]])
Values
array([653.51837585, 413.14829082])
P.T
array([[-23.59055972, -14.10819081],
[ 27.1618831 , -9.18990364],
[ -3.57132338, 23.29809445]])
24
18CS3159 DATA WAREHOUSING AND MINING
2. Calculate the principal component analysis for the matrix given in Q1 using PCA?
USING PCA
pca = PCA(2)
pca.fit(A)
print(pca.components_)
print(pca.explained_variance_)
B = pca.transform(A)
print(B)
[[20 50]
[70 40]
[50 80]]
[[ 0.95709203 -0.28978415]
[ 0.28978415 0.95709203]]
[653.51837585 413.14829082]
[[-23.59055972 -14.10819081]
[ 27.1618831 -9.18990364]
[ -3.57132338 23.29809445]]
3. Pollution has been a concern since Industrialization due to its effects on human lives and planet.
According to WHO, air pollution is having an effect in 7 million premature deaths per annum. A
report is generated on the quality of air in 5 months. It is found that the data within the reported
dataset is correlated. So, perform a strategic method to reconstruct the dataset with 2 components.
Also visualize a graph between the two components.
(Download the dataset from lms)
importmatplotlib.pyplot as plt
importnumpy as np
import pandas as pd
fromsklearn.preprocessing import MinMaxScaler
fromsklearn.preprocessing import StandardScaler
%matplotlib inline
df = pd.read_csv("airquality.csv")
df=df.dropna()
scaler = StandardScaler()
scaler.fit(df)
scaled_data=scaler.transform(df)
25
o/p: (111, 7)
scaled_data
x_pca.shape
(111, 2)
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1],c=df['Day'])
plt.xlabel('First principle component')
plt.ylabel('Second principle component')
26
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Pollution has been a concern since Industrialization due to its effects on human lives and
planet. According to WHO, air pollution is having an effect in 7 million premature deaths per
annum. A report is generated on the quality of air in 5 months. It is found that the data within
the reported dataset are correlated. So, perform a strategic method to reconstruct the dataset
with 2 components. Also visualize a graph between the two components?
(Download the dataset from lms)
importmatplotlib.pyplot as plt
importnumpy as np
import pandas as pd
fromsklearn.preprocessing import MinMaxScaler
fromsklearn.preprocessing import StandardScaler
%matplotlib inline
d=pd.read_csv(r"Desktop\death.csv")
d.fillna(d.mean(),inplace=True)
df=d.iloc[:,5:10]
scaler = StandardScaler()
scaler.fit(df)
scaled_data=scaler.transform(df)
o/p: (151, 5)
scaled_data
27
18CS3159 DATA WAREHOUSING AND MINING
x_pca.shape
(151, 2)
x_pca
array([[ 2.19650844e-02, -2.72925429e-01],
[-1.44914240e+00, -4.65084297e-02],
[-9.34861034e-01, 1.46993451e-01],
[-5.88842978e-02, 5.51733526e-01],
Viva Voce:-
28
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
1. What are the attribute selection measures in modeling a decision tree and write the
respective equations for each of them.
Gain Ratio
Gain Ratio = Gain / SplitINFO
Gini index
Gini index =
29
The strategy used to choose the split at each node. Supported strategies are “best” to choose the best split
and “random” to choose the best random split.
max_depth : int or None, optional (default=None)
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all
leaves contain less than min_samples_split samples.
min_samples_split : int, float, optional (default=2)
https://scikit-
learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeCl
assifier
30
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. Implement the decision tree algorithm on the given data which has weight and smoothness as
the segregating criteria for the fruit apple and orange. Apple is represented by the number ‘1’
and orange by ‘0’. Construct a decision tree and apply the prediction measures for
the given data to obtain the types of fruits.
Weight Smooth Fruit
180 7 ?
140 8 ?
150 5 ?
Fruit dataset
https://drive.google.com/file/d/1qoMDjozHHELVn5tFAJxp8mMw0Ggt-BVX/view?usp=sharing
31
18CS3159 DATA WAREHOUSING AND MINING
test_features_1_fruit = fruit_classifier.predict([[180,7]])
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set["fruit"][0], predicted_fruit=test_features_1_fruit))
test_features_3_fruit = fruit_classifier.predict([[140,8]])
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set["fruit"][2], predicted_fruit=test_features_3_fruit))
test_features_8_fruit = fruit_classifier.predict([[150,5]])
print("Actual fruit type: {act_fruit} , Fruit classifier predicted: {predicted_fruit}".format(
act_fruit=fruit_data_set["fruit"][7], predicted_fruit=test_features_8_fruit))
digraph Tree {
node [shape=box] ;
0 [label="X[1] <= 6.5\ngini = 0.5\nsamples = 10\nvalue = [5, 5]"] ;
1 [label="gini = 0.0\nsamples = 5\nvalue = [5, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="gini = 0.0\nsamples = 5\nvalue = [0, 5]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
}
Now copy the above code, go to the above mentioned website and paste it. Select Generate graph.
32
18CS3159 DATA WAREHOUSING AND MINING
33
18CS3159 DATA WAREHOUSING AND MINING
SOLUTION 3:
# features = [[155, “rough”], [180, “rough”], [135, “smooth”], [110, “smooth”]] # Input to classifier
features = [[155, 0], [180, 0], [135, 1], [110, 1]] # scikit-learn requires real-valued features
labels = [1, 1, 0, 0]
# Training classifier
# Making predictions
34
18CS3159 DATA WAREHOUSING AND MINING
(Ref: https://drive.google.com/file/d/1PJizP39JPh_T-
5dQVcUVCfswrPSxT734/view?usp=sharing)
Make sure to install the scikit-learn package and other required packages.
1. Find the correlation matrix for the diabetes dataset?
1. Split the dataset into train_set and test_set for modeling and prediction. Divide the dataset
in such a way that the trained dataset constitutes 70 percent of the original dataset and the
rest of the part belongs to the test dataset.
2. Produce a decision tree model using
a. Gini index metric
b. Entropy and Information gain metric on the trained dataset using
the DecisionTreeClassifier function.
3. Apply the prediction measures on the test dataset.
4. Define a function named accuracy_score by interpreting the difference between the
predicted values and the test set values. Display the accuracy in terms of
a. Fraction using the accuracy_score function
b. Number of correct predictions.
6. Print the confusion matrix of the test dataset.
6. Calculate the following values manually after obtaining the confusion matrix
a. Accuracy
b. Error rate
c. Precision
d. Recall (sensitivity)
e. F1 Score
f. Specificity
6. Compare the two results(obtained from two kinds of metrics) and state which method is
more accurate for this dataset. Convert the trained decision tree classifier
into graphviz object. Later, we use the converted graphviz object for visualization.
35
18CS3159 DATA WAREHOUSING AND MINING
pima = pd.read_csv("diabetes.csv")
type(pima)
X=pima.iloc[:,0:7]
y=pima.iloc[:,8]
#correlation matrix
print(pima.corr(method='pearson'))
#splitting the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30%
test
#Create Decision Tree classifer object
clf = DecisionTreeClassifier(criterion='entropy') #try the criterion 'gini' too
#accuracy in fraction
print(accuracy_score(y_test,y_pred,normalize=True))
#accuracy in number of predictions
print(accuracy_score(y_test,y_pred,normalize=False))
#confusion matrix
x=confusion_matrix(y_test,y_pred)
print(x)
tp=x[0][0]
fp=x[0][1]
fn=x[1][0]
tn=x[1][1]
fpr1=fp/(fp+tn)
tpr1=tp/(tp+fp)
print(fpr1)
print(tpr1)
#method 1
y_pred_proba = clf.predict_proba(X_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
'''#method 2
import sklearn.metrics as metrics
# calculate the fpr and tpr for all thresholds of the classification
probs = clf.predict_proba(X_test)
preds = probs[:,1]
print(preds)
fpr, tpr, threshold = metrics.roc_curve(y_test, preds)
roc_auc = metrics.auc(fpr, tpr)
# method I: plt
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b', label = 'AUC = %0.2f' % roc_auc)
36
18CS3159 DATA WAREHOUSING AND MINING
#method 3
predictions = tree.predict_proba(X_test)
x = roc_auc_score(y_test, predict_proba[:,1])
plt.clf()
plot.plot(fpr, tpr)
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.title('ROC curve')
plt.show()'''
#https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
#Accuracy = TP+TN/TOTAL where total=TP+TN+FP+FN
#Error rate = FP+FN/TOTAL
#Recall = TP/(TP+FN)
#Precision = TP/(TP+FP)
#F1 score = 2*Recall*Precision/(Recall+Precision)
#Specificity = TN / TN + FP
#fpr=fp/(fp+tn)
#tpr=tp/(tp+fp)
37
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. What is the C4.5 algorithm and how does it work? State the differences between ID3
and C4.5.
A. CART builds a decision tree for the given data in a top-down fashion, starting from a set of
objects and a specification of properties Resources and Information.each node of the tree, one
property is tested based on maximizing information gain and minimizing entropy, and the results are
used to split the object set. This process is recursively done until the set in a given sub-tree is
homogeneous (i.e. it contains objects belonging to the same category). The ID3 algorithm uses a
greedy search. It selects a test using the information gain criterion, and then never explores the
possibility of alternate choices.
Disadvantages
Data may be over-fitted or over-classified, if a small sample is tested.
Only one attribute at a time is tested for making a decision.
Does not handle numeric attributes and missing values.
The new features (versus ID3) are: (i) accepts both continuous and discrete features; (ii) handles
incomplete data points; (iii) solves over-fitting problem by (very clever) bottom-up technique usually
known as "pruning"; and (iv) different weights can be applied the features that comprise the training
data.
Disadvantages
C4.5 constructs empty branches with zero values
Over fitting happens when algorithm model picks up data with uncommon characteristics
, especially when data is noisy.
2.Differentiate between over-fitting, over-fitting and over-fitting loss? Why does it occur
during classification?
A. Overfitting refers to a model that models the training data too well.
Overfitting happens when a model learns the detail and noise in the training data to the extent that it
negatively impacts the performance of the model on new data. This means that the noise or random
fluctuations in the training data is picked up and learned as concepts by the model. The problem is
that these concepts do not apply to new data and negatively impact the models ability to generalize.
For example, decision trees are a nonparametric machine learning algorithm that is very flexible and
is subject to overfitting training data. This problem can be addressed by pruning a tree after it has
learned in order to remove some of the detail it has picked up.
Underfitting refers to a model that can neither model the training data nor generalize to new data.
An underfit machine learning model is not a suitable model and will be obvious as it will have poor
performance on the training data.
Underfitting is often not discussed as it is easy to detect given a good performance metric. The
remedy is to move on and try alternate machine learning algorithms. Nevertheless, it does provide a
good contrast to the problem of overfitting.
38
18CS3159 DATA WAREHOUSING AND MINING
3. Explain the concept of pruning and why it is important. Differentiate between pre-
pruning and post- pruning.
A. Pruning is a technique in machine learning and search algorithms that reduces the size
of decision trees by removing sections of the tree that provide little power to classify
instances. Pruning reduces the complexity of the final classifier, and hence improves predictive
accuracy by the reduction of overfitting.
(OR)
In machine learning and data mining, pruning is a technique associated with decision trees. Pruning
reduces the size of decision trees by removing parts of the tree that do not provide power to classify
instances. Decision trees are the most susceptible out of all the machine learning algorithms to
overfitting and effective pruning can reduce this likelihood.
Pruning means reducing size of the tree that are too larger and deeper. ... First isPost pruning, in
which the tree is build first and then reduction of branches & levels of the decision tree is done.
Second is Pre pruning, in which while building the decision tree keep on checking whether tree is
overfitting.
Viva Voce:-
1. What is the difference between supervised and unsupervised machine learning?
2. What is a confusion matrix?
3. Which of the following is true about training and testing error in such case?
a. The difference between training error and test error increases as number of
observations increase.
b. The difference between training error and test error decreases as number of
observations increase.
c. The difference between training error and test error will not change
4. What is the difference between classification and clustering?
5. What are Recommender Systems?
39
18CS3159 DATA WAREHOUSING AND MINING
Pre-requisite:
In LMS: Find the file named “Concept of k-Nearest-Neighbor.doc”. Read the specified
document and answer the below questions.
Pre-lab:-
1. State whether the given statement is true or false with supported reasoning.
a. k-Nearest-Neighbor is a simple algorithm that stores all available cases and classifies
the new case based on dissimilarity measure.
b. The value of ‘k’ in k-nearest-neighbor algorithm helps to check the no. of training sets
labels to assign the most common label for the testing set.
ANS:-
a. False
KNN is a simple algorithm that stores all available cases and classifies the new case based
on similarity measure.
b. True
K denotes the number of nearest neighbours which are voting the class of new data or testing
data.
40
3. Write an algorithm for k-nearest-neighbor classification given k, the nearest number
of neighbors, and n, the number of attributes describing each tuple.
5. Give the distance methods that are most commonly used in k-nearest-neighbor algorithm.
a. Euclidian Distance:
It is the square root of sum of difference between the new point ‘x’ and existing point ‘y’. It
is the direct or least possible distance between points A, B.
b. Manhattan Distance:
It is used to calculate the distance between the real vector using the sum of their absolute
difference. It is the distance between A, B measured along the axis at right angles.
41
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
Perform the following Analysis:
Step-by-step process to compute k-nearest-neighbor algorithm is:
1. Determine parameter k=no. of nearest neighbors
2. Calculate the distance between the test sample and the training samples.
3. Sort the distance and determine nearest neighbors based on the kth minimum distance.
4. Gather the category of nearest neighbors.
5. Use simple majority of the category of nearest neighbors as the prediction value of
testing sample.
Dataset:
Suppose we have the following “StudentDataSet” dataset which consists of 1st year
CGPA, 2nd year CGPA, Category (C: CRT, NC: Non-CRT) as parameters.
When a new student comes only with 1st year CGPA and 2nd year CGPA as information,
predict the category of that new student (whether he belongs to CRT or Non-CRT)
by Euclidean distance measure, where Euclidean distance between 2 points or tuples, say
X1 =(x11,x12,............,x1n) and X2 =(x21,x22,............,x2n), is
Test sample:
1st year CGPA and 2nd year CGPA of the new student are 8.4 and 7.1 respectively.
(Consider k=3)
42
18CS3159 DATA WAREHOUSING AND MINING
43
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.70
loadDataset('StudentDataSet.csv', split, trainingSet, testSet)
print ('Train set: ' + repr(len(trainingSet)))
print(trainingSet)
print ('Test set: ' + repr(len(testSet)))
print(testSet)
# generate predictions
predictions=[]
k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
#To Get Accuracy
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
#Predicting the Result for new Entry
neighbors1 = getNeighbors(trainingSet,[8.4,7.1], k)
result1 = getResponse(neighbors1)
print('> predicted class for new entry [8.4,7.1] =' + repr(result1))
main()
Output:
<class 'str'>
Train set: 8
[[8.2, 9.0, 'C'], [5.5, 4.5, 'NC'], [9.2, 9.0, 'C'], [7.8, 7.3, 'C'], [7.3, 7.4, 'NC'], [7.9, 7.0, 'NC'], [10.0,
6.0, 'C'], [6.8, 7.1, 'NC']]
Test set: 3
[[8.5, 8.5, 'C'], [7.5, 7.6, 'C'], [6.5, 7.1, 'NC']]
> predicted='C', actual='C'
> predicted='NC', actual='C'
> predicted='NC', actual='NC'
Accuracy: 66.66666666666666%
> predicted class for new entry [8.4,7.1] ='NC'
44
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Predict the Category of student with 1st year CGPA and 2nd year CGPA as 7.3 and 7.1
respectively using the Manhattan measuring technique formula with k=3(Manually).
Note: The Manhattan distance between two tuples (or points) a and b is defined
as ∑i|ai−bi|
2. By considering the above StudentDataSet ,, predict the Category of the new student
having 1st year CGPA and 2nd year CGPA as 8.4 and 7.1 respectively, by
implementing the python code using Manhattan distance measure in order to find
nearest neighbors for k=3 and check whether the output is same for both the measuring
techniques or not.
import csv
import random
import math
import operator
def getResponse(neighbors):
classVotes = {}
for x in range(len(neighbors)):
45
response = neighbors[x][-1]
if response in classVotes:
classVotes[response] += 1
else:
classVotes[response] = 1
sortedVotes = sorted(classVotes.items(), key=operator.itemgetter(1), reverse=True)
return sortedVotes[0][0]
def main():
# prepare data
trainingSet=[]
testSet=[]
split = 0.70
loadDataset('StudentDataSet.csv', split, trainingSet, testSet)
print ('Train set: ' + repr(len(trainingSet)))
print(trainingSet)
print ('Test set: ' + repr(len(testSet)))
print(testSet)
# generate predictions
predictions=[]
k=3
for x in range(len(testSet)):
neighbors = getNeighbors(trainingSet, testSet[x], k)
result = getResponse(neighbors)
predictions.append(result)
print('> predicted=' + repr(result) + ', actual=' + repr(testSet[x][-1]))
#To Get Accuracy
accuracy = getAccuracy(testSet, predictions)
print('Accuracy: ' + repr(accuracy) + '%')
#Predicting the Result for new Entry
neighbors1 = getNeighbors(trainingSet,[8.4,7.1], k)
result1 = getResponse(neighbors1)
print('> predicted class for new entry [8.4,7.1] =' + repr(result1))
main()
46
Output:
<class 'str'>
Train set: 5
[[8.2, 9.0, 'C'], [7.5, 7.6, 'C'], [7.9, 7.0, 'NC'], [10.0, 6.0, 'C'], [6.8, 7.1, 'NC']]
Test set: 6
[[8.5, 8.5, 'C'], [5.5, 4.5, 'NC'], [9.2, 9.0, 'C'], [7.8, 7.3, 'C'], [7.3, 7.4, 'NC'], [6.5, 7.1, 'NC']]
> predicted='C', actual='C'
> predicted='C', actual='NC'
> predicted='C', actual='C'
> predicted='C', actual='C'
> predicted='C', actual='NC'
> predicted='C', actual='NC'
Accuracy: 50.0%
> predicted class for new entry [8.4,7.1] ='C'
Viva Voce:-
Refer Page no: 423,424,425 in Han J & Kamber M, “Data Mining: Concepts and Techniques”,
Third Edition, Elsevier, 2011
47
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
1. Match the following
Column A Column B
a. Naive Bayesian a. Values are continuous
Classification
b. Bayesian belief network b. Attributes conditionally dependent
c. Gaussian distribution c. To avoid zero probability
d. Laplace estimator d. Attributes conditionally independent
ANS:- d b a c
ANS:- Bayes’s theorem describes the probability of an event, based on conditions that might be related to
that event. P(A|B) = P(A) P(B|A)P(B)
Which tells us: how often A happens given that B happens, written P(A|B),
When we know: how often B happens given that A happens, written P(B|A)
and how likely A is on its own, written P(A)
and how likely B is on its own, written P(B)
3. Suppose we have continuous values for an attribute in a dataset then how to calculate
probability.
ANS:- Whenever the given attribute has continuous values then we should use gaussian distribution.
4. Let us assume
p(age=youth/buys_car =yes) = 0.222 ,
p(income=medium/buys_car)=0.444 and
p(buys_car=yes)=0.643 then
Find the probability of p(x/buys_car=yes), where x=(income=medium, age=youth).
ANS:- 0.222*0.444*0.643
48
18CS3159 DATA WAREHOUSING AND MINING
49
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. Consider the given table named “Weather_cond.csv” consisting of attributes Temperature
Humidity, Windy and a class label named “Outcome”. Depending on the weather
conditions you have to choose whether to play cricket or not.
a. Unlike conventional function, write a python function to split the dataset into training
set and test set. Assume test size length as 0.33.
b. Write a python function to calculate mean and standard deviation for each numerical
attribute in the data set.
c. Calculate the number of priors for the given dataset after splitting into training and
test sets using python.
b. def loadCsv(filename):
lines = csv.reader(open(filename, "rb"))
dataset = list(lines)
for i in range(len(dataset)):
dataset[i] = [float(x) for x in dataset[i]]
return dataset
def summarize(dataset):
summaries = [(mean(attribute), stdev(attribute)) for attribute in zip(*dataset)]
del summaries[-1]
return summaries
def mean(numbers):
return sum(numbers)/float(len(numbers))
def stdev(numbers):
avg = mean(numbers)
variance = sum([pow(x-avg,2) for x in numbers])/float(len(numbers)-1)
return math.sqrt(variance)
c. import pandas as pd
import numpy as np
d=pd.read_csv(r’Enter the path of dataset’)
p=d.rename_axis('data')
q=np.array(d['Outcome'])
x = p.values
50
18CS3159 DATA WAREHOUSING AND MINING
y=q
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=1)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
51
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Consider the given table that specifies loan classification problem.
52
18CS3159 DATA WAREHOUSING AND MINING
Viva Voce:-
1. Explain the difference between a Validation Set and a Test Set?
2. What are the three types of Naïve Bayes classifier?
3. How many terms are required for building a Bayes model?
4. What is training test and testing set?
5. What are the advantages of Naive Bayes?
53
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
In LMS: Find the file named “Han J & Kamber M, Data Mining Concepts and
Techniques.doc”.
Read the specified document from Pg. No: 398 – 404 and answer the below questions.
1. State whether the given statement is True/False.
a. Backpropagation is neural network learning algorithm.
ANS:- TRUE
54
18CS3159 DATA WAREHOUSING AND MINING
ANS:-
Input Desired Model Absolute Square
Output Output Error Error
(W=3)
0 0 0 0 0
1 2 3 1 1
2 4 6 2 4
Increasing Weight:
Since the square error gets increasing parallelly by increasing weight, we should decrease
the weight.
Decreasing Weight:
55
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
Analysis:
The following steps will provide the foundation that you need to implement the Backpropagation
algorithm and apply it to your own predictive modelling problems:
1. Initialize Network.
2. Forward Propagate.
i. Neuron Activation.
ii. Neuron Transfer.
iii. Forward Propagation.
3. Back Propagate Error.
i. Transfer Derivative
ii. Error Backpropagation
4. Train Network.
i. Update Weights.
ii. Train Network.
5. Test Network.
Dataset:
Suppose we have the following “Results Dataset” which consist of GPA’s of some
students that they had scored in two internal tests. And, it also consists of another attribute
named ‘Qualified’, which holds a character(Q/NQ), representing the student qualification
for final examination.
S. No Test – 1 Test – 2 Qualified
1 8.5 8.5 Q
2 8.2 9.0 Q
3 3.5 5.0 NQ
4 5.5 4.5 NQ
5 9.2 9.0 Q
6 7.8 7.3 Q
7 8.0 3.1 NQ
8 10 7.0 Q
9 4.5 6.0 NQ
10 6.8 7.1 Q
11 5.1 4.1 NQ
12 4.2 5.3 NQ
56
18CS3159 DATA WAREHOUSING AND MINING
57
neuron = layer[j]
errors.append(expected[j] - neuron['output'])
for j in range(len(layer)):
neuron = layer[j]
neuron['delta'] = errors[j] * transfer_derivative(neuron['output'])
# Update network weights with error
def update_weights(network, row, l_rate):
for i in range(len(network)):
inputs = row[:-1]
if i != 0:
inputs = [neuron['output'] for neuron in network[i - 1]]
for neuron in network[i]:
for j in range(len(inputs)):
neuron['weights'][j] += l_rate * neuron['delta'] * inputs[j]
neuron['weights'][-1] += l_rate * neuron['delta']
# Train a network for a fixed number of epochs
def train_network(network, train, l_rate, n_epoch, n_outputs):
for epoch in range(n_epoch):
sum_error = 0
for row in train:
outputs = forward_propagate(network, row)
expected = [0 for i in range(n_outputs)]
expected[row[-1]] = 1
sum_error += sum([(expected[i]-outputs[i])**2 for i in range(len(expected))])
backward_propagate_error(network, expected)
update_weights(network, row, l_rate)
print('>epoch=%d, lrate=%.3f, error=%.3f' % (epoch, l_rate, sum_error))
# Test training backprop algorithmn
with open("G:\College\BackPropagation\Results Dataset.csv") as f:
reader = csv.reader(f)
next(reader) # skip header
data = []
for row in reader:
data.append(row)
data1=[]
dataset=[]
for row in data:
data1.append(convert_into_float(row))
for row in data1:
dataset.append(convert_into_int(row))
#Splitting dataset into train and test dataset
testdata=[]
traindata=[]
c=0
for row in dataset:
c=c+1
if(c<=8):
traindata.append(row)
else:
testdata.append(row)
n_inputs = len(traindata[0]) - 1
n_outputs = len(set([row[-1] for row in traindata]))
network = initialize_network(n_inputs, 2, n_outputs)
train_network(network, traindata,0.5,500, n_outputs)
for layer in network:
print(layer)
58
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Use the network which is trained on the above “Results Dataset” and test whether it is trained
with 100% accuracy or not. And, predict the result (qualified for final examination or not) of a
new entry which contains 5.9 and 5.9 GPA’s of test -1 and test -2 respectively.
59
18CS3159 DATA WAREHOUSING AND MINING
Viva Voce:-
1. What are the general tasks that are performed with backpropagation algorithm?
2. What kind of real-world problems can neural networks solve?
3. What is a gradient descent?
4. Why is zero initialization not a recommended weight initialization technique?
5. How are artificial neural networks different from normal networks?
60
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
1. Define what is Apriori algorithm.
Apriori is an algorithm for frequent item set mining and association rule learning over transactional databases. It
proceeds by identifying the frequent individual items in the database and extending them to larger and larger
item sets as long as those item sets appear sufficiently often in the database.
If an item set is frequent, then all of its subsets must also be frequent, Or If an item set is infrequent, then all of its
supersets must be infrequent.
2. What is association minning?
Association Mining searches for frequent items in the data-set. In frequent mining usually the interesting
associations and correlations between item sets in transactional and relational databases are found. In short,
Frequent Mining shows which items appear together in a transaction or relation.
5. Consider the market basket transactions given in the following table. Let
min-sup=40% and min_conf=40%
Transaction ID Items Bought
T1 A,B,C
T2 A,B,C,D,E
T3 A,C,D
T4 A,C,D,E
T5 A,B,C,D
61
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. For the following given transaction dataset, perform following operations :
a. Generate rules using Apriori algorithm by using below dataset.
vegetables green whole wheat flour cottage
shrimp almonds avocado mix grapes yams cheese
burgers meatballs eggs
chutney
turkey avocado
mineral energy whole
water milk bar wheat rice green tea eggs
low fat
yogurt
whole
wheat french fries
pasta
light
soup cream shallot
frozen green
vegetables spaghetti tea
french fries
eggs pet food
cookies
mineral cooking
turkey burgers water eggs oil
champagne
spaghetti cookies
mineral
water salmon eggs
mineral
water
low fat
shrimp chocolate chicken honey oil cooking oil yogurt
turkey eggs
tomatoes mineral
turkey fresh tuna spaghetti water black tea salmon eggs
french fries
meatballs milk honey protein bar
shampoo
red wine shrimp pasta pepper eggs chocolate
sparkling
rice water
mineral body
spaghetti water ham spray pancakes green tea
grated white toothpaste
burgers cheese eggs pasta avocado honey wine
62
eggs
parmesan
cheese spaghetti soup avocado milk fresh bread
ground mineral frozen
beef spaghetti water milk eggs black tea salmon smoothie
sparkling
water
mineral french fries
water eggs chicken chocolate
frozen
vegetables mineral
spaghetti yams water
# Data Preprocessing
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None)
transactions = []
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])
63
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Same as In-lab question generate rules on below dataset.
semi-
finished ready
citrus fruit bread margarine soups
tropical
fruit yogurt coffee
whole milk
cream meat
pip fruit yogurt cheese spreads
long life
other whole condensed bakery
vegetables milk milk product
abrasive
whole milk butter yogurt rice cleaner
rolls/buns
liquor
other UHT- bottled (appetizer
vegetables milk rolls/buns beer )
pot plants
whole milk cereals
other
tropical vegetabl white bottled chocolate
fruit es bread water
bottl
ed
tropical whole yogurt water dishes
citrus fruit fruit milk butter curd flour
beef
rolls/bun
frankfurter s soda
tropical
chicken fruit
fruit/vegeta newspape
butter sugar ble juice rs
fruit/vegetab
le juice
packaged
fruit/vegetab
les
chocolate
specialty
bar
other
vegetables
butter milk pastry
64
whole milk
tropical cream processed detergent newspape
fruit cheese cheese rs
bathroo
root sweet salty m
tropical vegetabl other frozen rolls/buns spread snac waffle cand cleaner
fruit es vegetables dessert flour s k s y
bottled canned
water beer
yogurt
rolls/bun chocolate
sausage s soda
other
vegetables
shoppi
brown fruit/vegeta canned newspape ng
bread soda ble juice beer rs bags
# Data Preprocessing
dataset = pd.read_csv('grow.csv', header = None)
transactions = []
for i in range(0, 30):
transactions.append([str(dataset.values[i,j]) for j in range(0, 12)])
65
Viva Voce:-
1. Who proposed Apriori algorithm in which year?
2. What is frequent item set?
3. Why do we convert dataset into list?
4. What is the formula for support, confidence and lift?
5. How they get the name as Apriori?
66
18CS3159 DATA WAREHOUSING AND MINING
Pre-Requisites:
Data pre-processing
Basics of plotting techniques
Various clustering techniques
Pre-lab:-
1. Match the following.
Parameters Application
1. pch a. To set orientation of axis labels
2. col b. No. of plots per row and column
3. mfrow c. To set plot color
4. lwd d. Plotting symbol
5. las e. To set line width
ANS:- d c b e a
3. Into how many types does clustering divided into and name them.
67
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. The given dataset comprises of 150 data entries of different countries around the
world.It is a report on world happiness, a landmark survey of the state of global
happiness that ranks 156 countries by how happy their citizens perceive themselves to
be, with a focus on the technologies, social norms, conflicts and government policies
that have driven those changes. The records contains various attributes of each country
that includes positive_effect, negative_effect, corruption, freedom, health life
expectancy etc. The data frame includes categorical variables, numerical values and
their values vary from country to country.
Implement a python code using scikit-learn to display a K-means clustering plot for
given data frame named “world_happiness_report.csv”.
Writing space of the Problem:(For Student’s use only)
68
18CS3159 DATA WAREHOUSING AND MINING
69
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. This lab module aims to build an analysis on customers of a shopping mall. It consists of 150
observations of customers consisting details that include gender, age, annual_income,
spending_score etc. Based on the two parameters annual_income and spending_score, try to
build a analysis on customers through cluster graphs
Apply k means clustering on the given data set named “Mall_customers” marking number of
clusters based on mean and standard deviation of any two attributes of your choice and
implement the K-means iteratively till the centroids get stabilized
centers_old = np.zeros(centers.shape)
centers_new = deepcopy(centers)
data.shape
clusters = np.zeros(n)
distances = np.zeros((n,k))
while error != 0:
for i in range(k):
distances[:,i] = np.linalg.norm(data - centers[i], axis=1)
clusters = np.argmin(distances, axis = 1)
centers_old = deepcopy(centers_new)
for i in range(k):
centers_new[i] = np.mean(data[clusters == i], axis=0)
error = np.linalg.norm(centers_new - centers_old)
centers_new
70
18CS3159 DATA WAREHOUSING AND MINING
Viva Voce:-
1. K-means is which type of algorithm.
2. In K-means clustering algorithm what is the criteria used by the data points to get
separated from one cluster to another.
3. What are the basic steps in K Means clustering.
4. What does K refer in K-means algorithm - K refers to k no. of clusters.
5. How is K-means algorithm is different from KNN algorithm
71
18CS3159 DATA WAREHOUSING AND MINING
Pre-Requisites:
Should have a prior knowledge on Fuzzy c-means clustering algorithm.
Pre-lab:-
1.Can a data point in Fuzzy c-means clustering belong to more than one cluster.
ANS:- Yes
2. If Partition matrix = and data points are (1,3) , (1.5,3.2) ,(1.3,2.8) ,(3,1)
then find cluster centres.
ANS:-
4. In Fuzzy c-means clustering after each iteration cluster centres are updated according to
the formula:
Where,
a. n =
b. µij =
c. m =
ANS:-
a. n = number of datapoints
b. µij = element in membership matrix
c. m = Fuzzian parameter
72
18CS3159 DATA WAREHOUSING AND MINING
ANS:- a.
b.
73
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. Brief the Fuzzy c-means algorithm. And apply Fuzzy c-means to (1,3) , (1.5,3.20 ) ,
(1.3,2.6) , (3,1). Assume there are 2 clusters and Fuzziness parameter m=2.
Writing space of the Problem:(For Student’s use only)
74
18CS3159 DATA WAREHOUSING AND MINING
2. Implement Fuzzy c-means in python to find updated membership matrix for the dataset
Fuzzy_Code.csv which has 44 attributes of numerical type and 1 attribute of categorial type.
Assume number of clusters =2 and Fuzziness parameter =2.
75
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Write a snippet of python implementation to find accuracy and precision of the above
dataset through Fuzzy c-means algorithm.
Writing space of the Problem:(For Student’s use only)
ANS:- import pandas as pd
import numpy as np
import random
import operator
import math
df_full = pd.read_csv("SPECTF_New.csv")
columns = list(df_full.columns)
features = columns[:len(columns)-1]
class_labels = list(df_full[columns[-1]])
df = df_full[features]
# Number of Attributes
num_attr = len(df.columns) - 1
# Number of Clusters
k=2
# Fuzzy parameter
m = 2.00
for i in range(len(df)):
# Yes = 1, No = 0
if cluster_labels[i] == 1 and class_labels[i] == 'Yes':
tp[0] = tp[0] + 1
if cluster_labels[i] == 0 and class_labels[i] == 'No':
tn[0] = tn[0] + 1
if cluster_labels[i] == 1 and class_labels[i] == 'No':
76
18CS3159 DATA WAREHOUSING AND MINING
fp[0] = fp[0] + 1
if cluster_labels[i] == 0 and class_labels[i] == 'Yes':
fn[0] = fn[0] + 1
for i in range(len(df)):
# Yes = 0, No = 1
if cluster_labels[i] == 0 and class_labels[i] == 'Yes':
tp[1] = tp[1] + 1
if cluster_labels[i] == 1 and class_labels[i] == 'No':
tn[1] = tn[1] + 1
if cluster_labels[i] == 0 and class_labels[i] == 'No':
fp[1] = fp[1] + 1
if cluster_labels[i] == 1 and class_labels[i] == 'Yes':
fn[1] = fn[1] + 1
accuracy = [a0*100,a1*100]
precision = [p0*100,p1*100]
recall = [r0*100,r1*100]
def initializeMembershipMatrix():
membership_mat = list()
for i in range(n):
random_num_list = [random.random() for i in range(k)]
summation = sum(random_num_list)
temp_list = [x/summation for x in random_num_list]
membership_mat.append(temp_list)
return membership_mat
def calculateClusterCenter(membership_mat):
cluster_mem_val = list(zip(*membership_mat))
cluster_centers = list()
for j in range(k):
x = list(cluster_mem_val[j])
xraised = [e ** m for e in x]
denominator = sum(xraised)
temp_num = list()
77
18CS3159 DATA WAREHOUSING AND MINING
for i in range(n):
data_point = list(df.iloc[i])
prod = [xraised[i] * val for val in data_point]
temp_num.append(prod)
numerator = map(sum, zip(*temp_num))
center = [z/denominator for z in numerator]
cluster_centers.append(center)
return cluster_centers
def getClusters(membership_mat):
cluster_labels = list()
for i in range(n):
max_val, idx = max((val, idx) for (idx, val) in enumerate(membership_mat[i]))
cluster_labels.append(idx)
return cluster_labels
def fuzzyCMeansClustering():
# Membership Matrix
membership_mat = initializeMembershipMatrix()
curr = 0
while curr <= MAX_ITER:
cluster_centers = calculateClusterCenter(membership_mat)
membership_mat = updateMembershipValue(membership_mat, cluster_centers)
cluster_labels = getClusters(membership_mat)
curr += 1
print(membership_mat)
return cluster_labels, cluster_centers
78
18CS3159 DATA WAREHOUSING AND MINING
Viva Voce:-
1. What are the minimum number of attributes required for clustering.
2. Can decision trees be used for performing clustering?
3. In Fuzzy c-means can a data point be a part of more than one cluster?
4. List some applications of Fuzzy c-means .
5. Is FCM a static or dynamic algorithm.
79
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
1. What is SVM?
ANS:- “Support Vector Machine” (SVM) is a supervised machine learning algorithm which can be used for both
classification or regression challenges. However, it is mostly used in classification problems. In this
algorithm, we plot each data item as a point in n-dimensional space (where n is number of features you
have) with the value of each feature being the value of a particular coordinate. Then, we perform
classification by finding the hyper-plane that differentiate the two classes very well.
2. When do we use SVM?
ANS:- It uses a technique called the kernel trick to transform your data and then based on these transformations
it finds an optimal boundary between the possible outputs.
3. What is maximum marginal hyper plane and what is the equation of separating hyper
plane?
ANS:- A support vector machine performs classification by finding the hyperplane that maximizes the margin
between two classes.
W.X +b =0
Where,
W is the weight vector,W={w1,w2,w3…….,wn};n is the number of attributes
b is scalar
X = (x1,x2),x1 and x2 are values of the attributes.
5. What are the equations for point that lies above the separating hyperplane and below the
separating hyperplane?
ANS:- Point that lies above the separating hyperplane :
w0+w1x1+w2x2 > 0
Point that lies below the separating hyperplane :
w0+w1x1+w2x2 < 0
80
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. Below is the data of the employees in the company. The data shows whether employee
purchased software or not. Take x co-ordinate as age and y co-ordinate as
estimated_salary. Now, Consider the following dataset and perform the below operations:
User ID Gender Age Estimated Salary Purchased
15624510 Male 19 19000 0
15810944 Male 35 20000 0
15668575 Female 26 43000 0
15603246 Female 27 57000 0
15804002 Male 19 76000 0
15728773 Male 27 58000 0
15598044 Female 27 84000 0
15694829 Female 32 150000 1
15600575 Male 25 33000 0
15727311 Female 35 65000 0
15570769 Female 26 80000 0
15606274 Female 26 52000 0
15746139 Male 20 86000 0
15704987 Male 32 18000 0
15628972 Male 18 82000 0
15697686 Male 29 80000 0
15733883 Male 47 25000 1
15617482 Male 45 26000 1
15704583 Male 46 28000 1
15621083 Female 48 29000 1
15649487 Male 45 22000 1
15736760 Female 47 49000 1
15714658 Male 48 41000 1
15599081 Female 45 22000 1
15705113 Male 46 23000 1
15631159 Male 47 20000 1
15792818 Male 49 28000 1
15633531 Female 47 30000 1
15744529 Male 29 43000 0
a. Import the dataset into python
b. Split the dataset set into training and testing sets
c. Apply feature scaling on training and test sets
d. Fit SVM to the training set
e. Visualize the training set results
f. Visualize the test set results.
81
18CS3159 DATA WAREHOUSING AND MINING
82
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Below dataset represents the bank transactions of KVB bank for an hour. Consider x co-
ordinate as Balance and y co-ordinate as Trtn_amt. Perform following operations on
given dataset:
S.No transaction_ID Balance Trtn_amt sucornot
1 3467 98687.36 500 0
2 4801 8510.47 100 0
3 2093 2475.3 200 1
4 9933 37743.25 1000 0
5 7178 2705.95 600 0
6 1093 60314 750 1
7 3708 812129.5 280 1
8 3804 8076.25 140 0
9 3192 42323.14 310 1
10 3666 47045.25 2500 0
11 8598 96171.25 6900 0
12 8743 608581.8 8520 1
13 9302 586057.3 410 1
14 6127 4587.5 750 0
15 7502 43597.75 250 0
a. Import the dataset into python
b. Split the dataset set into training and testing sets
c. Apply feature scaling on training and test sets
d. Fit SVM to the training set
e. Visualize the training set results
f. Visualize the test set results.
83
18CS3159 DATA WAREHOUSING AND MINING
(e)
from matplotlib.colors import ListedColormap
X_set, y_set = X_train, y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
(f)
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
84
18CS3159 DATA WAREHOUSING AND MINING
Viva Voce:-
1. What are the advantages of SVM?
2. How many types of machine learnings are there and in which type this svm fall under?
3. What are the turning parameters in SVM?
85
18CS3159 DATA WAREHOUSING AND MINING
Pre-requisite:
Refer Page no: 355-363 in Han J & Kamber M, “Data Mining: Concepts and Techniques”, Third
Edition, Elsevier, 2011
Pre-lab:-
86
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. Implement a simple python code for rule-based
classification on “AllElectronicsCustomer” database
(Download the dataset from LMS)
RID age income student Credit_rating Class:buys computer
import math
import pandas as pd
import numpy as np
dt=pd.read_csv('AllElectronicsCustomer.csv')
print(dt)
print()
D=len(dt)
ncovers=0
87
18CS3159 DATA WAREHOUSING AND MINING
ncorrect=0
c=0
j=int(input("Enter rule number(<=4):"))
for i in range(len(dt)):
if j==1:
if ((dt.iloc[i,1]=="youth")and(dt.iloc[i,3]=="yes")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
elif j==2:
if ((dt.iloc[i,1]=="middle_aged")and(dt.iloc[i,2]==("medium" or "high")) and
(dt.iloc[i,4]=="excellent")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
elif j==3:
if ((dt.iloc[i,1]=="senior")and (dt.iloc[i,3]=="yes")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
else:
if ((dt.iloc[i,1]=="senior")and (dt.iloc[i,2]== "high") and (dt.iloc[i,3]=="yes")) :
ncovers+=1
if(dt.iloc[i,5]=="yes"):
ncorrect+=1
print(str(i+1)+" buys the computer")
else :
c+=1
if(c==D):
print( "no one of the tuple satisfies the give rule")
print()
coverage=(ncovers/D)
print("coverage")
print(coverage)
88
18CS3159 DATA WAREHOUSING AND MINING
print()
accuracy=(ncorrect/ncovers)
print("accuracy")
print(accuracy)
Post-lab:-
1. Extract possible classification rules from the given decision tree.
ANS:-
89
3. Difference between Decision tree and rule based classification.
Viva Voce:-
1. Rule-Based classifier classify records by using a collection of ______ rules.
2. Most rule-based classification systems use which strategy?
3. Difference between class-based ordering and rule-based ordering.
4. Briefly explain the below terms in your own words:
a. Mutually exclusive
b. Exhaustive
5. Name the terms that define the following statements:
a. Fraction of records that satisfy only antecedent of a rule.
b. Fraction of records that satisfy both antecedent and consequent of a rule.
90
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
1. Problem: If you are a business person trying to get the best return on your marketing
investment, it is crucial that you target people in the right way. If you get it wrong, you
risk not making any sales, or worse, damaging your customer trust.
Think of how hierarchical clustering can help you to solve this problem.
ANS:-
Business Problem: The enterprise wishes to organize customers into groups/segments based on
similar traits, product preferences and expectations. Segments are constructed based on
customer demographic characteristics, psychographics, past behavior and product use behavior.
Business Benefit: Once the segments are identified, marketing messages and products can be
customized for each segment. The better the segment(s) chosen for targeting by a particular
organization, the more successful the business will be in the market.
Hierarchical Clustering can help an enterprise organize data into groups to identify similarities
and, equally important, dissimilar groups and characteristics, so that the business can target
pricing, products, services, marketing messages and more.
91
4. What are the metrics used to compute the linkage?
In-lab:-
1. This dataset is taken from the Motor Trend US magazine, and comprises fuel consumption
and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
Assume that you work for Motor Trend, a magazine about the automobile industry. Looking at a
data set of a collection of cars, they are interested in exploring the relationship between a set of
variables. Help them perform the desired analysis to get meaningful insights.
(i) Split the dataset into two halves wherein the first half named ‘X’ contains
the columns 1,3,4,6 and the second half named ‘Y’ contains the column 9.
(ii) Apply the linkage function on ‘X’ considering the ward method. Construct
a dendrogram by giving appropriate title, x and y labels.
(iii) Plot a horizontal line for getting the precise number of clusters and store the value in
the variable ‘k’.
(iv) Perform the AgglomerativeClustering by inputting following different combinations
for affinity and linkage and then compare the accuracy between the fitted clustering and
the variable ‘Y’.
affinitylinkage
a)euclidean ward
b)euclidean complete
c)Euclidean average
d)manhattan average
92
Writing space of the Problem:(For Student’s use only)
import numpy as np
import pandas as pd
import scipy
from scipy.cluster.hierarchy import dendrogram,linkage
from scipy.cluster.hierarchy import fcluster
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
import sklearn
from sklearn.cluster import AgglomerativeClustering
import sklearn.metrics as sm
#Z = linkage( X , method ) creates the tree using the specified method , which describes how to
measure the distance between clusters. ... Z = linkage( X , method , metric ) performs clustering by
passing metric to the pdist function, which computes the distance between the rows of X .
Z=linkage(X,'ward')
# Plotting dendrogram
dendrogram(Z,truncate_mode='lastp',p=12,leaf_rotation=45.,leaf_font_size=15.,show_contracted=True)
93
# Accuracy using AgglomerativeClustering with euclidean and complete
k=2
Hclustering=AgglomerativeClustering(n_clusters=k,affinity='euclidean',linkage='complete')
Hclustering.fit(X)
sm.accuracy_score(y,Hclustering.labels_)
k=2
Hclustering=AgglomerativeClustering(n_clusters=k,affinity='euclidean',linkage='average')
Hclustering.fit(X)
sm.accuracy_score(y,Hclustering.labels_)
94
18CS3159 DATA WAREHOUSING AND MINING
ANS:-
Pros of Ward’s method:
· Ward’s method approach also does well in separating clusters if there is noise between
clusters.
Cons of Ward’s method:
· Ward’s method approach is also biased towards globular clusters.
Pros of Group Average:
· Ward’s method approach also does well in separating clusters if there is noise between
clusters.
.
Cons of Group Average:
· The group Average approach is biased towards globular clusters.
Pros of Single Linkage:
· This approach can separate non-elliptical shapes as long as the gap between two clusters
is not small.
Cons of Single Linkage:
· MIN approach cannot separate clusters properly if there is noise between clusters.
Pros of Complete Linkage:
· MAX approach does well in separating clusters if there is noise between clusters.
Cons of Complete Linkage:
· Max approach is biased towards globular clusters.
· Max approach tends to break large clusters.
95
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Earning a master's degree helps you gain specialized knowledge to advance in your field. You
can focus on a particular field of study, which helps you become more competitive in
your field.Even the most qualified and confident applicants worry about getting into masters
program. But don’t panic! Graduate school acceptance rates, which give the percentage of
applicants that were accepted to a particular program in an academic year, can help you
determine how likely you are to get into a given program.
This dataset helps us to predict Graduate Admissions from an Indian perspective. The dataset
contains a few parameters which are viewed as significant during the application for Masters
Programs. The parameters included are :
1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. College Rating ( out of 5 )
4. Mission statement and Letter of Recommendation Strength ( out of 5 )
5. Undergrad GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Possibility of Admit ( extending from 0 to 1 ).
Helps students in shortlisting universities with their profiles by predicting output giving them a
fair idea about their chances for a particular university.
a.Apply agglomerative clustering on the following dataset using ward method for the following:-
CGPA and chance of admit.
GRE and chance of admit.
University ranking and chance of admit.
d. Display the scatter plots between all the influencing factors( CGPA, GRE, TOEFL, SOP LOR
and University ranking) and chance of admit.
e. By visualising the above scatter plots, rank the factors in the order of linearity to analyze which
factor influences the chance of admit the most.
Example: x<y<z means greater value of z means greater chance of admit.
96
18CS3159 DATA WAREHOUSING AND MINING
# Considering the CGPA and chance of admit into the data frame, change the parameters by changing the index
values to consider different sets of data.
data = customer_data.iloc[:, [2,8]].values
#print(data)
import scipy.cluster.hierarchy as shc
plt.figure(figsize=(10, 7))
plt.title("Customer Dendrograms")
# Plotting Dendrogram
dend = shc.dendrogram(shc.linkage(data, method='ward'),truncate_mode='level')
#Display the different scatter plots and comparing the attribute ranking.
plt.scatter(data[:,0], data[:,1], c=cluster.labels_, cmap='rainbow')
97
2. Your university is going to participate in an inter college cricket tournament. Therefore, you
need to select the best set of players to form an efficient team. You decide to take a physical
exam of all the interested students and then segregate them into 2 groups namely fit/unfit. Which
kind of data science technique will you use? Classification or Clustering?
Writing space of the Problem:(For Student’s use only)
ANS:-
In this case the classes are already defined namely fit and unfit. Therefore, this is a classification
technique. Classification is the process of classifying the data with the help of class labels. On the
other hand,Clustering is similar to classification but there are no predefined class labels. Classification is
geared with supervised learning. As against, clustering is also known as unsupervised learning.
Viva Voce:-
1. Why is hierarchical clustering considered as an unsupervised machine learning algorithm?
2. List out the 2 kinds of hierarchical clustering techniques. How do the both kinds differ from
each other?
3. State the differences between hierarchical and K-means clustering.
4. What are the real world applications of clustering?
5. What are the different linkage methods to use agglomerative clustering effectively?
98
18CS3159 DATA WAREHOUSING AND MINING
Pre-lab:-
1.What do you mean by an outlier? What are the main causes for outliers?
ANS:-
Outliers are extreme values that deviate from other observations on data , they may indicate a
variability in a measurement, experimental errors or a novelty.
Most common causes of outliers on a data set:
· Data entry errors (human errors)
· Measurement errors (instrument errors)
· Experimental errors (data extraction or experiment planning/executing errors)
· Intentional (dummy outliers made to test detection methods)
· Data processing errors (data manipulation or data set unintended mutations)
· Sampling errors (extracting or mixing data from wrong or various sources)
· Natural (not an error, novelties in data)
99
18CS3159 DATA WAREHOUSING AND MINING
5. Consider the below dataset which comprises of the income( in thousands) of 15 people in
an organisation.
[ 45, 51, 63, 48, 67, 48, 56, 2, 62, 59, 44, 61, 99, 46, 52]
What do you observe from the above data? Is there any significant difference between the
income of few employees? If so, what could be the reason of it?
The observations 2 and 99 are outliers because their values are deviated the most from the
remaining observations. The observations are also very important for these dataset because these
values can be the salary of a peon and ceo of the organization.
100
18CS3159 DATA WAREHOUSING AND MINING
In-lab:-
1. The dataset Boston house prices consists of 9 attributes CRIM, ZN, INDUS, LSTAT, NOX,
RM, DIS, RAD, TAX. The description of each attribute
CRIM per capita crime rate by town
ZN proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS proportion of non-retail business acres per town
NOX nitric oxides concentration (parts per 10 million)
RM average number of rooms per dwelling
DIS weighted distances to five Boston employment centres
RAD index of accessibility to radial highways
TAX full-value property-tax rate per $10,000
import pandas as pd
import numpy as np
boston=pd.read_csv(r'C:\Users\Hp\Desktop\BIG DATA ANALYTICS\R
Programing\DATASET\boston.csv')
%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt
# boxplot for a column
sns.boxplot(x=boston["TAX"])
# scatterplot between INDUS and TAX
boston_c = boston
fig, ax = plt.subplots(figsize=(16,8))
ax.scatter(boston_c["INDUS"], boston_c["TAX"])
ax.set_xlabel('proportion of non-retail business acres per town')
ax.set_ylabel('full-value property-tax rate per $10,000')
plt.show()
101
# z_score method
from scipy import stats
zscore = np.abs(stats.zscore(boston_c))
print(zscore) # gives two array in which first array represents the row and the second
represents column
threshold = 3
print(np.where(zscore > 3))
# first five outlier z_score values
print(zscore[55][1])
print(zscore[56][1])
print(zscore[57][1])
print(zscore[141][3])
print(zscore[199][1])
# removing outliers using z_score
boston_clean = boston
boston_clean = boston_clean[(zscore < 3).all(axis=1)]
boston.shape
boston_clean.shape
# IQR method
boston_iqr = boston
Q1 = boston_iqr.quantile(0.25)
Q1
Q3 = boston_iqr.quantile(0.75)
IQR = Q3 - Q1
#printing IQR values of each column
print(IQR)
# prints the Booleans values
print(boston_iqr < (Q1 - 1.5 * IQR)) | (boston_iqr > (Q3 + 1.5 * IQR))
#Remove Outliers using IQR
boston_iqr_clean = boston_iqr[~((boston_iqr < (Q1 - 1.5 * IQR)) | (boston_iqr > (Q3 + 1.5 *
IQR))).any(axis=1)]
102
18CS3159 DATA WAREHOUSING AND MINING
2. Consider the iris dataset. It includes three iris species with 50 samples each as well as some
properties about each flower. The columns in this dataset are:
SepalLengthCm
SepalWidthCm
PetalLengthCm
PetalWidthCm
Species
https://drive.google.com/file/d/1HEEMrAQqAynHdM5TmK0G-
mD5Qr0OW2J8/view?usp=sharing
Import the csv file and use the boxplot method to visualise the outliers considering the 4
properties of a flower. You will notice that one of the property has outliers.
1.Considering the range of the outliers from the visualisation, display the observations which
have outliers.
2.Implement a DBSCAN model fitting on the dataset taking epsilon value as 0.8 and minimum
samples value as 19.
3.Print the counter values using the counter function on the model labels.
4.Considering the values obtained from the model labels print the outliers of the data.
5.Draw a scatter plot between petal length and sepal width to visualise the outliers.
import numpy as np
import pandas as pd
%matplotlib inline
rcParams['figure.figsize']=5,4
df.columns=['Sepal.Length','Sepal.Width','Petal.Length','Petal.Width','Species']
X=df.iloc[:,0:4].values
y=df.iloc[:,4].values
df[:5]
df.boxplot(return_type='dict')
103
plt.plot()
#prints the outliers greater than 4 ( 4-> by observing the box plot)
Sepal_Width=X[:,1]
iris_outliers=(Sepal_Width>4)
df[iris_outliers]
#prints the outliers greater than 2.05 ( 2.05-> by observing the box plot)
Sepal_Width=X[:,1]
iris_outliers=(Sepal_Width<2.05)
df[iris_outliers]
pd.options.display.float_format='{:.1f}'.format
X_df=pd.DataFrame(X)
print(X_df.describe())
#DBSCAN method
import seaborn as sb
import sklearn
%matplotlib inline
sb.set_style('whitegrid')
#DBSCAN function
model=DBSCAN(eps=0.8,min_samples=19).fit(data)
print(model)
outliers_df=pd.DataFrame(data)
104
print(Counter(model.labels_))
# -1 represents outliers
print(outliers_df[model.labels_==-1])
fig=plt.figure()
ax=fig.add_axes([.1,.1,1,1])
colors=model.labels_
ax.scatter(data[:,2],data[:,1],c=colors,s=120)
ax.set_xlabel('Petal Length')
ax.set_ylabel('Sepal Width')
105
18CS3159 DATA WAREHOUSING AND MINING
Post-lab:-
1. Consider the following student dataset
https://drive.google.com/file/d/1edmKnHjXkTyHT6gSYhwLw9rTpzoy1Cig/view?usp=sha
ring
which consists of student details of two schools in a town.
i. Find the students who have taken more number of leaves than the average number
of absences by implementing a z_score function taking mean and standard
deviation into account.
ii. Find the number of students who got least and highest score in the subject G1
considering threshold = 2.5
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot
from collections import Counter
dataset=pd.read_csv("student_dataset.csv")
outliers=[]
matplotlib.pyplot.boxplot(dataset["absences"]) # follow the same procedure for G1
def detect_outlier(data_1):
c=0
threshold=3
mean_1 = np.mean(data_1)
std_1 =np.std(data_1)
for y in data_1:
z_score= (y - mean_1)/std_1
if np.abs(z_score) > threshold:
outliers.append(y)
c=c+1
print(c)
return outliers
outlier_datapoints = detect_outlier(dataset["absences"])
print(outlier_datapoints)
dataset.describe()
106
2. Can we find outliers for categorical values? Explain.
3. An sugar factory weighs every sugar packet in the weighing machine before packing
them into cartons. As per the guidelines of the factory,the standard weight of each sugar
packet should be 60 grams. It has been observed that during the final weighing of the
packets, few of them gave an anomalous weight due to malfunctioning of weighing
machines.
Consider the below dataset which comprises of weights of the packets.
https://drive.google.com/file/d/1JkdkQ3j-
J93DCfZa3kUjDycEtRzShk6V/view?usp=sharing
c.Segregate the outliers from inlines using “loc” method to get the values of
“true_index”.Also obtain values of “false_index”.
.
d.Now find the median from the values obtained in “true_index”
.
d.Replace all the outliers with median.
107
Viva Voce:-
108
109