0% found this document useful (0 votes)
95 views

Machine Learning Lab Mannual R20

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views

Machine Learning Lab Mannual R20

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

MACHINE LEARNING LAB MANNUAL R[20]

What is Find-S Algorithm in Machine Learning?

In order to understand Find-S algorithm, you need to have a basic idea of the following concepts
as well:

1. Concept Learning
2. General Hypothesis
3. Specific Hypothesis

1. Concept Learning

Let’s try to understand concept learning with a real-life example. Most of human learning is
based on past instances or experiences. For example, we are able to identify any type of vehicle
based on a certain set of features like make, model, etc., that are defined over a large set of
features.

These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.

Similar to this, machines can also learn from concepts to identify whether an object belongs to a
specific category or not. Any algorithm that supports concept learning requires the following:

 Training Data
 Target Concept
 Actual Data Objects

2. General Hypothesis

Hypothesis, in general, is an explanation for something. The general hypothesis basically states
the general relationship between the major variables. For example, a general hypothesis for
ordering food would be I want a burger.

G = { ‘?’, ‘?’, ‘?’, …..’?’}

3. Specific Hypothesis

The specific hypothesis fills in all the important details about the variables given in the general
hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.

S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}

Now, let’s talk about the Find-S Algorithm in Machine Learning.


The Find-S algorithm follows the steps written below:

1. Initialize ‘h’ to the most specific hypothesis.


2. The Find-S algorithm only considers the positive examples and eliminates negative
examples. For each positive example, the algorithm checks for each attribute in the
example. If the attribute value is the same as the hypothesis value, the algorithm moves
on without any changes. But if the attribute value is different than the hypothesis value,
the algorithm changes it to ‘?’.

Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at
how it works.

How Does It Work?

1. The process starts with initializing ‘h’ with the most specific hypothesis, generally, it is
the first positive example in the data set.
2. We check for each positive example. If the example is negative, we will move on to the
next example but if it is a positive example we will consider it for the next step.
3. We will check if each attribute in the example is equal to the hypothesis value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ‘?’.
6. We do this until we reach the last positive example in the data set.

Limitations of Find-S Algorithm

There are a few limitations of the Find-S algorithm listed down below:

1. There is no way to determine if the hypothesis is consistent throughout the data.


2. Inconsistent training sets can actually mislead the Find-S algorithm, since it ignores the
negative examples.
3. Find-S algorithm does not provide a backtracking technique to determine the best
possible changes that could be done to improve the resulting hypothesis.

Now that we are aware of the limitations of the Find-S algorithm, let us take a look at a practical
implementation of the Find-S Algorithm.

Experiment-1:

Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.

Implementation of Find-S Algorithm

To understand the implementation, let us try to implement it to a smaller data set with a bunch of
examples to decide if a person wants to go for a walk.
The concept of this particular problem will be on what days does a person likes to go on walk.

Time Weather Temperature Company Humidity Wind Goes


Morning Sunny Warm Yes Mild Strong Yes
Evening Rainy Cold No Mild Normal No
Morning Sunny Moderate Yes Normal Normal Yes
Evening Sunny Cold Yes High Strong Yes

Looking at the data set, we have six attributes and a final attribute that defines the positive or
negative example. In this case, yes is a positive example, which means the person will go for a
walk.

So now, the general hypothesis is:

h0 = {‘Morning’, ‘Sunny’, ‘Warm’, ‘Yes’, ‘Mild’, ‘Strong’}

This is our general hypothesis, and now we will consider each example one by one, but only the
positive examples.

h1= {‘Morning’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}

h2 = {‘?’, ‘Sunny’, ‘?’, ‘Yes’, ‘?’, ‘?’}

We replaced all the different values in the general hypothesis to get a resultant hypothesis. Now
that we know how the Find-S algorithm works, let us take a look at an implementation
using Python.

Program:

import pandas as pd

import numpy as np

data = pd.read_csv("data.csv")

print(data,"n")

d = np.array(data)[:,:-1]

print("n The attributes are: ",d)

target = np.array(data)[:,-1]

print("n The target is: ",target)


def train(c,t):

for i, val in enumerate(t):

if val == "Yes":

specific_hypothesis = c[i].copy()

break

for i, val in enumerate(c):

if t[i] == "Yes":

for x in range(len(specific_hypothesis)):

if val[x] != specific_hypothesis[x]:

specific_hypothesis[x] = '?'

else:

pass

return specific_hypothesis

print("n The final hypothesis is:",train(d,target))

Output:
Experiment-2:

For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Candidate Elimination algorithm to output a description of the set of all hypotheses
consistent with the training examples.

1. It is an extended form of FIND-s algorithm.

2. It considers both positive and negative examples.

3. Positive examples are used FIND-S algorithm (i.e, Specific to General).

4. While negative examples from general to specific.

Algorithm:

Step1: Load Data set

Step2: Initialize General Hypothesis and Specific Hypothesis.

Step3: For each training example

Step4: If example is positive example

if attribute_value == hypothesis_value:

Do nothing

else:

replace attribute value with '?' (Basically generalizing it)

Step5: If example is Negative example

Make generalize hypothesis more specific.

trainingdata.csv

Norma
Sunny Warm l Strong Warm Same Yes

Sunny Warm High Strong Warm Same Yes

Chang
Rainy Cold High Strong Warm e No
Chang
Sunny Warm High Strong Cool e Yes

prog2.py

import csv
with open("trainingdata.csv") as f:
csv_file=csv.reader(f)
data=list(csv_file)
s=data[1][:-1]
g=[['?' for i in range(len(s))] for j in range(len(s))]

for i in data:
if i[-1]=="Yes":
for j in range(len(s)):
if i[j]!=s[j]:
s[j]='?'
g[j][j]='?'

elif i[-1]=="No":
for j in range(len(s)):
if i[j]!=s[j]:
g[j][j]=s[j]
else:
g[j][j]="?"
print("\nSteps of Candidate Elimination Algorithm",data.index(i)+1)
print(s)
print(g)
gh=[]
for i in g:
for j in i:
if j!='?':
gh.append(i)
break
print("\nFinal specific hypothesis:\n",s)
print("\nFinal general hypothesis:\n",gh)
Output

Steps of Candidate Elimination Algorithm 1


['Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same']
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?',
'?', '?', '?'], ['?', '?', '?', '?', '?', '?']]

Steps of Candidate Elimination Algorithm 2


['Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same']
[['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?',
'?', '?', '?'], ['?', '?', '?', '?', '?', '?']]

Steps of Candidate Elimination Algorithm 3


['Sunny', 'Warm', '?', 'Strong', 'Warm', 'Same']
[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'],
['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', 'Same']]

Steps of Candidate Elimination Algorithm 4


['Sunny', 'Warm', '?', 'Strong', '?', '?']
[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?'],
['?', '?', '?', '?', '?', '?'], ['?', '?', '?', '?', '?', '?']]

Final specific hypothesis:


['Sunny', 'Warm', '?', 'Strong', '?', '?']

Final general hypothesis:

[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]
Experiment-3:

Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use
an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.

Program:

Temperat Humidi Wind PlayTen


Outlook ure ty y nis

Sunny Hot High False No

Sunny Hot High True No

Ye
Overcast Hot High False
s

Ye
Rainy Mild High False
s

Ye
Rainy Cool Normal False
s

Rainy Cool Normal True No

Ye
Overcast Cool Normal True
s

Sunny Mild High False No

Ye
Sunny Cool Normal False
s

Ye
Rainy Mild Normal False
s

Ye
Sunny Mild Normal True
s

Ye
Overcast Mild High True
s

Ye
Overcast Hot Normal False
s

Rainy Mild High True No

Program:
import pandas as pd

from sklearn import tree

from sklearn.preprocessing import LabelEncoder

from sklearn.tree import DecisionTreeClassifier

from sklearn.externals.six import StringIO

data = pd.read_csv('tennisdata.csv')

print("The first 5 values of data is \n",data.head())

X = data.iloc[:,:-1]

print("\nThe first 5 values of Train data is \n",X.head())

y = data.iloc[:,-1]

print("\nThe first 5 values of Train output is \n",y.head())

# Convert them in numbers

le_outlook = LabelEncoder()

X.Outlook = le_outlook.fit_transform(X.Outlook)

le_Temperature = LabelEncoder()

X.Temperature = le_Temperature.fit_transform(X.Temperature)

le_Humidity = LabelEncoder()

X.Humidity = le_Humidity.fit_transform(X.Humidity)

le_Windy = LabelEncoder()

X.Windy = le_Windy.fit_transform(X.Windy)

print("\nNow the Train data is",X.head())

le_PlayTennis = LabelEncoder()

y = le_PlayTennis.fit_transform(y)

print("\nNow the Train data is\n",y)

classifier = DecisionTreeClassifier()
classifier.fit(X,y)

def labelEncoderForInput(list1):

list1[0] = le_outlook.transform([list1[0]])[0]

list1[1] = le_Temperature.transform([list1[1]])[0]

list1[2] = le_Humidity.transform([list1[2]])[0]

list1[3] = le_Windy.transform([list1[3]])[0]

return [list1]

inp = ["Rainy","Mild","High","False"]

inp1=["Rainy","Cool","High","False"]

pred1 = labelEncoderForInput(inp1)

y_pred = classifier.predict(pred1)

y_pred

print("\nfor input {0}, we obtain {1}".format(inp1,


le_PlayTennis.inverse_transform(y_pred[0])))

Experiment-4:

Exercise to solve the real world problems using the following machine learning methods:

a)Linear Regression b)Logistic Regression

a)Linear Regression :

Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.

Linear regression uses the relationship between the data-points to draw a straight line through all
them.

Program:
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

Output:

b)Logistic Regression:

Logistic regression aims to solve classification problems. It does this by predicting categorical
outcomes, unlike linear regression that predicts a continuous outcome.

Program:

import numpy

from sklearn import linear_model

x = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)

y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

logr = linear_model.LogisticRegression()
logr.fit(x,y)

predicted = logr.predict(numpy.array([2.09]).reshape(-1,1))

print(predicted)

Output:

0.

Experiment-5:

Develop a program for bias, variance, remove duplicates, cross validation

Description:

In machine learning, an error is a measure of how accurately an algorithm can make predictions
for the previously unknown dataset. On the basis of these errors, the machine learning model is
selected that can perform best on the particular dataset. There are mainly two types of errors in
machine learning, which are:

o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.

Program:

import numpy as np

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.metrics import mean_squared_error

np.random.seed(42)
X = np.random.rand(100, 10)

y = np.dot(X, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) + np.random.randn(100) * 0.5

X = np.unique(X, axis=0)

y = y[:len(X)]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LinearRegression()

model.fit(X_train, y_train)

train_error = mean_squared_error(y_train, model.predict(X_train))

test_error = mean_squared_error(y_test, model.predict(X_test))

print("Training error:", train_error)

print("Testing error:", test_error)

num_splits = 10

cv_scores = cross_val_score(model, X, y, cv=num_splits, scoring='neg_mean_squared_error')

bias = -np.mean(cv_scores)

variance = np.var(cv_scores)

print("Bias:", bias)

print("Variance:", variance)

Output:

Training error: 29.298675932449534


Testing error: 35.55713027301944
Bias: 37.451044010954064
Variance: 153.46346777877687

Experiment-6:

Write a program to implement Categorical Encoding, One-hot Encoding.

One-hot encoding is used to convert categorical variables into a format that can be readily used
by machine learning algorithms.
The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to
represent the original categorical values.

For example, the following image shows how we would perform one-hot encoding to convert a
categorical variable that contains team names into new variables that contain only 0 and 1
values:

Program:

import pandas as pd

df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],

'points': [25, 12, 15, 14, 19, 23, 25, 29]})

print(df)

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')

encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())

final_df = df.join(encoder_df)

print(final_df)

final_df.drop('team', axis=1, inplace=True)

print(final_df)
final_df.columns = ['points', 'teamA', 'teamB', 'teamC']

print(final_df)

Experiment-7:
Build an Artificial Neural Network by implementing the Back propagation algorithm and
test the same using appropriate data sets.

Back Propagation:

Back propagation is a widely used algorithm in machine learning and neural networks that is
used to train artificial neural networks (ANNs) by adjusting the weights of the connections
between neurons. The goal of back propagation is to minimize the difference between the
predicted output of the neural network and the true output of the training data.
The back propagation algorithm works by first passing an input through the neural network to
obtain an output. The difference between the predicted output and the true output is then
calculated, and this difference is used to adjust the weights of the connections between neurons
in the network. This adjustment is done by calculating the gradient of the error function with
respect to the weights, which tells us how much each weight needs to be adjusted in order to
decrease the error.

Program:
import numpy as np
X=np.array(([2,9],[1,5],[3,6]),dtype=float)
Y=np.array(([92],[86],[89]),dtype=float)
X=X/np.amax(X,axis=0)
Y=Y/100
def sigmoid (x):
return 1/(1+np.exp(-x))
def derivatives_sigmoid(x):
return x*(1-x)
epoch=7000
lr=0.1
inputlayer_neurons=2
hiddenlayer_neurons=3
output_neurons=1
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
for i in range(epoch):
hinp1=np.dot(X,wh)
hinp=hinp1+bh
hlayer_act=sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp=outinp1+bout
output=sigmoid(outinp)
EO=Y-output
outgrad=derivatives_sigmoid(output)
d_output=EO* outgrad
EH=d_output.dot(wout.T)
hiddengrad=derivatives_sigmoid(hlayer_act)
d_hiddenlayer=EH*hiddengrad
wout+=hlayer_act.T.dot(d_output)*lr
wh+=X.T.dot(d_hiddenlayer)*lr
print("input:\n" + str(X))
print("Actual output:\n" + str(Y))
print("predicted output:\n",output)

Output:
input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual output:
[[0.92]
[0.86]
[0.89]]
predicted output:
[[0.89699538]
[0.87715324]
[0.89533612]]

Experiment 8:

Write a program to implement k-nearest neighbor algorithm to classify the iris data set.
print both correct and wrong predictions.
Description:

o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.

o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.

o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.

o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:

Program:

from sklearn.datasets import load_iris


from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Load the iris dataset
iris = load_iris()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2,
random_state=42)
# Create a KNeighborsClassifier object with k=3
knn = KNeighborsClassifier(n_neighbors=3)
# Fit the training data to the classifier
knn.fit(X_train, y_train)
# Predict the test set
y_pred = knn.predict(X_test)
# Print the number of correct and wrong predictions
correct = 0
wrong = 0
for i in range(len(y_test)):
if y_test[i] == y_pred[i]:
correct += 1
else:
wrong += 1
print("Number of correct predictions:", correct)
print("Number of wrong predictions:", wrong)

Output:
Number of correct predictions: 30
Number of wrong predictions: 0

Experiment-9:

Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.

Program:
from math import ceil
import numpy as np
from scipy import linalg
def lowess(x, y, f, iterations):
n = len(x)
r = int(ceil(f * n))
h = [np.sort(np.abs(x - x[i]))[r] for i in range(n)]
w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iterations):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
A = np.array([[np.sum(weights), np.sum(weights * x)],[np.sum(weights * x),
np.sum(weights * x * x)]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * x[i]

residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
return yest
import math
n = 100
x = np.linspace(0, 2 * math.pi, n)
y = np.sin(x) + 0.3 * np.random.randn(n)
f =0.25
iterations=3
yest = lowess(x, y, f, iterations)

import matplotlib.pyplot as plt


plt.plot(x,y,"r.")
plt.plot(x,yest,"b-")

Output:
Experiment-10:
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
Document1.csv dataset
I love this sandwich pos
This is an amazing place pos
I feel very good about these beers,pos pos
This is my best work,pos pos
What an awesome view,pos pos
I do not like this restaurant,neg neg
I am tired of this stuff,neg neg
I can't deal with this,neg neg
He is my sworn enemy,neg neg
My boss is horrible,neg neg
This is an awesome place,pos pos
I do not like the taste of this juice,neg neg
I love to dance,pos pos
I am sick and tired of this place,neg neg
What a great holiday,pos pos
That is a bad locality to stay,neg neg
We will have good fun tomorrow,pos pos
I went to my enemy's house today,neg neg

Program:
import pandas as pd
msg = pd.read_csv('document1.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})
X = msg.message
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
print(df[0:5])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
for doc, p in zip(Xtrain, pred):
p = 'pos' if p == 1 else 'neg'
print("%s -> %s" % (doc, p))
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
print('Accuracy Metrics: \n')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))

Output:
Total Instances of Dataset: 18
am amazing an and awesome bad boss can dance deal ... that the \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0
1 0 0 0 0 0 1 0 0 0 0 ... 1 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 0 0 0 0 0 0 0 1 0 1 ... 0 0
4 0 0 0 0 0 0 1 0 0 0 ... 0 0

this tired to today view went what with


0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0
2 0 0 1 1 0 1 0 0
3 1 0 0 0 0 0 0 1
4 0 0 0 0 0 0 0 0
[5 rows x 43 columns]
He is my sworn enemy,neg -> pos
That is a bad locality to stay,neg -> neg
I went to my enemy's house today,neg -> pos
I can't deal with this,neg -> pos
My boss is horrible,neg -> pos
Accuracy Metrics:
Accuracy: 1.0
Recall: 1.0
Precision: 1.0
Confusion Matrix:
[[1 0]
[0 4]]

Experiment-11:
Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for
clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API in
the program.

Program:
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.mixture import GaussianMixture
from sklearn.datasets import load_iris
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset=load_iris()
X=pd.DataFrame(dataset.data)
X.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y=pd.DataFrame(dataset.target)
y.columns=['Targets']
plt.figure(figsize=(14,7))
colormap=np.array(['red','lime','black'])
# REAL PLOT
plt.subplot(1,3,1)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y.Targets],s=40)
plt.title('Real')
# K-PLOT
plt.subplot(1,3,2)
model=KMeans(n_clusters=3)
model.fit(X)
predY=np.choose(model.labels_,[0,1,2]).astype(np.int64)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[predY],s=40)
plt.title('KMeans')
# GMM PLOT
scaler=preprocessing.StandardScaler()
scaler.fit(X)
xsa=scaler.transform(X)
xs=pd.DataFrame(xsa,columns=X.columns)
gmm=GaussianMixture(n_components=3)
gmm.fit(xs)
y_cluster_gmm=gmm.predict(xs)
plt.subplot(1,3,3)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y_cluster_gmm],s=40)
plt.title('GMM Classification')

Output:
Text(0.5, 1.0, 'GMM Classification')

Experiment-12:
Exploratory data analysis for classification using pandas or Matplotlib.

Description:
 EDA is applied to investigate the data and summarize the key insights.
 It will give you the basic understanding of your data, it’s distribution, null values and much
more.
 You can either explore data using graphs or through some python functions.
 In the non-graphical approach, you will be using functions such as shape, summary, describe,
isnull, info, datatypes and more.
 In the graphical approach, you will be using plots such as scatter, box, bar, density and
correlation plots.

Program:

import pandas as pd

import numpy as np

import seaborn as sns

df = pd.read_csv('titanic.csv')

df.head()  View the data

#Basic information

df.info()

#Describe the data

df.describe()
#Find the duplicates

df.duplicated().sum()

#unique values

df['Pclass'].unique()

df['Survived'].unique()

df['Sex'].unique()

#Plot the unique values

sns.countplot(df['Pclass']).unique()

#Datatypes

df.dtypes

#Filter data

df[df['Pclass']==1].head()

#Boxplot

df[['Fare']].boxplot()

#Correlation

df.corr()

#Correlation plot

sns.heatmap(df.corr())
Output:

You might also like