Machine Learning Lab Mannual R20
Machine Learning Lab Mannual R20
In order to understand Find-S algorithm, you need to have a basic idea of the following concepts
as well:
1. Concept Learning
2. General Hypothesis
3. Specific Hypothesis
1. Concept Learning
Let’s try to understand concept learning with a real-life example. Most of human learning is
based on past instances or experiences. For example, we are able to identify any type of vehicle
based on a certain set of features like make, model, etc., that are defined over a large set of
features.
These special features differentiate the set of cars, trucks, etc from the larger set of vehicles.
These features that define the set of cars, trucks, etc are known as concepts.
Similar to this, machines can also learn from concepts to identify whether an object belongs to a
specific category or not. Any algorithm that supports concept learning requires the following:
Training Data
Target Concept
Actual Data Objects
2. General Hypothesis
Hypothesis, in general, is an explanation for something. The general hypothesis basically states
the general relationship between the major variables. For example, a general hypothesis for
ordering food would be I want a burger.
3. Specific Hypothesis
The specific hypothesis fills in all the important details about the variables given in the general
hypothesis. The more specific details into the example given above would be I want a
cheeseburger with a chicken pepperoni filling with a lot of lettuce.
S = {‘Φ’,’Φ’,’Φ’, ……,’Φ’}
Now that we are done with the basic explanation of the Find-S algorithm, let us take a look at
how it works.
1. The process starts with initializing ‘h’ with the most specific hypothesis, generally, it is
the first positive example in the data set.
2. We check for each positive example. If the example is negative, we will move on to the
next example but if it is a positive example we will consider it for the next step.
3. We will check if each attribute in the example is equal to the hypothesis value.
4. If the value matches, then no changes are made.
5. If the value does not match, the value is changed to ‘?’.
6. We do this until we reach the last positive example in the data set.
There are a few limitations of the Find-S algorithm listed down below:
Now that we are aware of the limitations of the Find-S algorithm, let us take a look at a practical
implementation of the Find-S Algorithm.
Experiment-1:
Implement and demonstrate the FIND-S algorithm for finding the most specific hypothesis
based on a given set of training data samples. Read the training data from a .CSV file.
To understand the implementation, let us try to implement it to a smaller data set with a bunch of
examples to decide if a person wants to go for a walk.
The concept of this particular problem will be on what days does a person likes to go on walk.
Looking at the data set, we have six attributes and a final attribute that defines the positive or
negative example. In this case, yes is a positive example, which means the person will go for a
walk.
This is our general hypothesis, and now we will consider each example one by one, but only the
positive examples.
We replaced all the different values in the general hypothesis to get a resultant hypothesis. Now
that we know how the Find-S algorithm works, let us take a look at an implementation
using Python.
Program:
import pandas as pd
import numpy as np
data = pd.read_csv("data.csv")
print(data,"n")
d = np.array(data)[:,:-1]
target = np.array(data)[:,-1]
if val == "Yes":
specific_hypothesis = c[i].copy()
break
if t[i] == "Yes":
for x in range(len(specific_hypothesis)):
if val[x] != specific_hypothesis[x]:
specific_hypothesis[x] = '?'
else:
pass
return specific_hypothesis
Output:
Experiment-2:
For a given set of training data examples stored in a .CSV file, implement and demonstrate
the Candidate Elimination algorithm to output a description of the set of all hypotheses
consistent with the training examples.
Algorithm:
if attribute_value == hypothesis_value:
Do nothing
else:
trainingdata.csv
Norma
Sunny Warm l Strong Warm Same Yes
Chang
Rainy Cold High Strong Warm e No
Chang
Sunny Warm High Strong Cool e Yes
prog2.py
import csv
with open("trainingdata.csv") as f:
csv_file=csv.reader(f)
data=list(csv_file)
s=data[1][:-1]
g=[['?' for i in range(len(s))] for j in range(len(s))]
for i in data:
if i[-1]=="Yes":
for j in range(len(s)):
if i[j]!=s[j]:
s[j]='?'
g[j][j]='?'
elif i[-1]=="No":
for j in range(len(s)):
if i[j]!=s[j]:
g[j][j]=s[j]
else:
g[j][j]="?"
print("\nSteps of Candidate Elimination Algorithm",data.index(i)+1)
print(s)
print(g)
gh=[]
for i in g:
for j in i:
if j!='?':
gh.append(i)
break
print("\nFinal specific hypothesis:\n",s)
print("\nFinal general hypothesis:\n",gh)
Output
[['Sunny', '?', '?', '?', '?', '?'], ['?', 'Warm', '?', '?', '?', '?']]
Experiment-3:
Write a program to demonstrate the working of the decision tree based ID3 algorithm. Use
an appropriate data set for building the decision tree and apply this knowledge to classify a
new sample.
Program:
Ye
Overcast Hot High False
s
Ye
Rainy Mild High False
s
Ye
Rainy Cool Normal False
s
Ye
Overcast Cool Normal True
s
Ye
Sunny Cool Normal False
s
Ye
Rainy Mild Normal False
s
Ye
Sunny Mild Normal True
s
Ye
Overcast Mild High True
s
Ye
Overcast Hot Normal False
s
Program:
import pandas as pd
data = pd.read_csv('tennisdata.csv')
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
le_outlook = LabelEncoder()
X.Outlook = le_outlook.fit_transform(X.Outlook)
le_Temperature = LabelEncoder()
X.Temperature = le_Temperature.fit_transform(X.Temperature)
le_Humidity = LabelEncoder()
X.Humidity = le_Humidity.fit_transform(X.Humidity)
le_Windy = LabelEncoder()
X.Windy = le_Windy.fit_transform(X.Windy)
le_PlayTennis = LabelEncoder()
y = le_PlayTennis.fit_transform(y)
classifier = DecisionTreeClassifier()
classifier.fit(X,y)
def labelEncoderForInput(list1):
list1[0] = le_outlook.transform([list1[0]])[0]
list1[1] = le_Temperature.transform([list1[1]])[0]
list1[2] = le_Humidity.transform([list1[2]])[0]
list1[3] = le_Windy.transform([list1[3]])[0]
return [list1]
inp = ["Rainy","Mild","High","False"]
inp1=["Rainy","Cool","High","False"]
pred1 = labelEncoderForInput(inp1)
y_pred = classifier.predict(pred1)
y_pred
Experiment-4:
Exercise to solve the real world problems using the following machine learning methods:
a)Linear Regression :
Linear regression algorithm shows a linear relationship between a dependent (y) and one or more
independent (y) variables, hence called as linear regression. Since linear regression shows the
linear relationship, which means it finds how the value of the dependent variable is changing
according to the value of the independent variable.
Linear regression uses the relationship between the data-points to draw a straight line through all
them.
Program:
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
slope, intercept, r, p, std_err = stats.linregress(x, y)
def myfunc(x):
return slope * x + intercept
mymodel = list(map(myfunc, x))
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
Output:
b)Logistic Regression:
Logistic regression aims to solve classification problems. It does this by predicting categorical
outcomes, unlike linear regression that predicts a continuous outcome.
Program:
import numpy
x = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-
1,1)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
logr = linear_model.LogisticRegression()
logr.fit(x,y)
predicted = logr.predict(numpy.array([2.09]).reshape(-1,1))
print(predicted)
Output:
0.
Experiment-5:
Description:
In machine learning, an error is a measure of how accurately an algorithm can make predictions
for the previously unknown dataset. On the basis of these errors, the machine learning model is
selected that can perform best on the particular dataset. There are mainly two types of errors in
machine learning, which are:
o Reducible errors: These errors can be reduced to improve the model accuracy. Such
errors can further be classified into bias and Variance.
Program:
import numpy as np
np.random.seed(42)
X = np.random.rand(100, 10)
X = np.unique(X, axis=0)
y = y[:len(X)]
model = LinearRegression()
model.fit(X_train, y_train)
num_splits = 10
bias = -np.mean(cv_scores)
variance = np.var(cv_scores)
print("Bias:", bias)
print("Variance:", variance)
Output:
Experiment-6:
One-hot encoding is used to convert categorical variables into a format that can be readily used
by machine learning algorithms.
The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to
represent the original categorical values.
For example, the following image shows how we would perform one-hot encoding to convert a
categorical variable that contains team names into new variables that contain only 0 and 1
values:
Program:
import pandas as pd
print(df)
encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())
final_df = df.join(encoder_df)
print(final_df)
print(final_df)
final_df.columns = ['points', 'teamA', 'teamB', 'teamC']
print(final_df)
Experiment-7:
Build an Artificial Neural Network by implementing the Back propagation algorithm and
test the same using appropriate data sets.
Back Propagation:
Back propagation is a widely used algorithm in machine learning and neural networks that is
used to train artificial neural networks (ANNs) by adjusting the weights of the connections
between neurons. The goal of back propagation is to minimize the difference between the
predicted output of the neural network and the true output of the training data.
The back propagation algorithm works by first passing an input through the neural network to
obtain an output. The difference between the predicted output and the true output is then
calculated, and this difference is used to adjust the weights of the connections between neurons
in the network. This adjustment is done by calculating the gradient of the error function with
respect to the weights, which tells us how much each weight needs to be adjusted in order to
decrease the error.
Program:
import numpy as np
X=np.array(([2,9],[1,5],[3,6]),dtype=float)
Y=np.array(([92],[86],[89]),dtype=float)
X=X/np.amax(X,axis=0)
Y=Y/100
def sigmoid (x):
return 1/(1+np.exp(-x))
def derivatives_sigmoid(x):
return x*(1-x)
epoch=7000
lr=0.1
inputlayer_neurons=2
hiddenlayer_neurons=3
output_neurons=1
wh=np.random.uniform(size=(inputlayer_neurons,hiddenlayer_neurons))
bh=np.random.uniform(size=(1,hiddenlayer_neurons))
wout=np.random.uniform(size=(hiddenlayer_neurons,output_neurons))
bout=np.random.uniform(size=(1,output_neurons))
for i in range(epoch):
hinp1=np.dot(X,wh)
hinp=hinp1+bh
hlayer_act=sigmoid(hinp)
outinp1=np.dot(hlayer_act,wout)
outinp=outinp1+bout
output=sigmoid(outinp)
EO=Y-output
outgrad=derivatives_sigmoid(output)
d_output=EO* outgrad
EH=d_output.dot(wout.T)
hiddengrad=derivatives_sigmoid(hlayer_act)
d_hiddenlayer=EH*hiddengrad
wout+=hlayer_act.T.dot(d_output)*lr
wh+=X.T.dot(d_hiddenlayer)*lr
print("input:\n" + str(X))
print("Actual output:\n" + str(Y))
print("predicted output:\n",output)
Output:
input:
[[0.66666667 1. ]
[0.33333333 0.55555556]
[1. 0.66666667]]
Actual output:
[[0.92]
[0.86]
[0.89]]
predicted output:
[[0.89699538]
[0.87715324]
[0.89533612]]
Experiment 8:
Write a program to implement k-nearest neighbor algorithm to classify the iris data set.
print both correct and wrong predictions.
Description:
o K-NN algorithm assumes the similarity between the new case/data and available cases
and put the new case into the category that is most similar to the available categories.
o K-NN algorithm stores all the available data and classifies a new data point based on the
similarity. This means when new data appears then it can be easily classified into a well
suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
Program:
Output:
Number of correct predictions: 30
Number of wrong predictions: 0
Experiment-9:
Implement the non-parametric Locally Weighted Regression algorithm in order to fit data
points. Select appropriate data set for your experiment and draw graphs.
Program:
from math import ceil
import numpy as np
from scipy import linalg
def lowess(x, y, f, iterations):
n = len(x)
r = int(ceil(f * n))
h = [np.sort(np.abs(x - x[i]))[r] for i in range(n)]
w = np.clip(np.abs((x[:, None] - x[None, :]) / h), 0.0, 1.0)
w = (1 - w ** 3) ** 3
yest = np.zeros(n)
delta = np.ones(n)
for iteration in range(iterations):
for i in range(n):
weights = delta * w[:, i]
b = np.array([np.sum(weights * y), np.sum(weights * y * x)])
A = np.array([[np.sum(weights), np.sum(weights * x)],[np.sum(weights * x),
np.sum(weights * x * x)]])
beta = linalg.solve(A, b)
yest[i] = beta[0] + beta[1] * x[i]
residuals = y - yest
s = np.median(np.abs(residuals))
delta = np.clip(residuals / (6.0 * s), -1, 1)
delta = (1 - delta ** 2) ** 2
return yest
import math
n = 100
x = np.linspace(0, 2 * math.pi, n)
y = np.sin(x) + 0.3 * np.random.randn(n)
f =0.25
iterations=3
yest = lowess(x, y, f, iterations)
Output:
Experiment-10:
Assuming a set of documents that need to be classified, use the naïve Bayesian Classifier
model to perform this task. Built-in Java classes/API can be used to write the program.
Calculate the accuracy, precision, and recall for your data set.
Document1.csv dataset
I love this sandwich pos
This is an amazing place pos
I feel very good about these beers,pos pos
This is my best work,pos pos
What an awesome view,pos pos
I do not like this restaurant,neg neg
I am tired of this stuff,neg neg
I can't deal with this,neg neg
He is my sworn enemy,neg neg
My boss is horrible,neg neg
This is an awesome place,pos pos
I do not like the taste of this juice,neg neg
I love to dance,pos pos
I am sick and tired of this place,neg neg
What a great holiday,pos pos
That is a bad locality to stay,neg neg
We will have good fun tomorrow,pos pos
I went to my enemy's house today,neg neg
Program:
import pandas as pd
msg = pd.read_csv('document1.csv', names=['message', 'label'])
print("Total Instances of Dataset: ", msg.shape[0])
msg['labelnum'] = msg.label.map({'pos': 1, 'neg': 0})
X = msg.message
y = msg.labelnum
from sklearn.model_selection import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y)
from sklearn.feature_extraction.text import CountVectorizer
count_v = CountVectorizer()
Xtrain_dm = count_v.fit_transform(Xtrain)
Xtest_dm = count_v.transform(Xtest)
df = pd.DataFrame(Xtrain_dm.toarray(),columns=count_v.get_feature_names())
print(df[0:5])
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(Xtrain_dm, ytrain)
pred = clf.predict(Xtest_dm)
for doc, p in zip(Xtrain, pred):
p = 'pos' if p == 1 else 'neg'
print("%s -> %s" % (doc, p))
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
print('Accuracy Metrics: \n')
print('Accuracy: ', accuracy_score(ytest, pred))
print('Recall: ', recall_score(ytest, pred))
print('Precision: ', precision_score(ytest, pred))
print('Confusion Matrix: \n', confusion_matrix(ytest, pred))
Output:
Total Instances of Dataset: 18
am amazing an and awesome bad boss can dance deal ... that the \
0 0 0 0 0 0 0 0 0 0 0 ... 0 0
1 0 0 0 0 0 1 0 0 0 0 ... 1 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0
3 0 0 0 0 0 0 0 1 0 1 ... 0 0
4 0 0 0 0 0 0 1 0 0 0 ... 0 0
Experiment-11:
Apply EM algorithm to cluster a Heart Disease Data Set. Use the same data set for
clustering using k-Means algorithm. Compare the results of these two algorithms and
comment on the quality of clustering. You can add Java/Python ML library classes/API in
the program.
Program:
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.mixture import GaussianMixture
from sklearn.datasets import load_iris
import sklearn.metrics as sm
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset=load_iris()
X=pd.DataFrame(dataset.data)
X.columns=['Sepal_Length','Sepal_Width','Petal_Length','Petal_Width']
y=pd.DataFrame(dataset.target)
y.columns=['Targets']
plt.figure(figsize=(14,7))
colormap=np.array(['red','lime','black'])
# REAL PLOT
plt.subplot(1,3,1)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y.Targets],s=40)
plt.title('Real')
# K-PLOT
plt.subplot(1,3,2)
model=KMeans(n_clusters=3)
model.fit(X)
predY=np.choose(model.labels_,[0,1,2]).astype(np.int64)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[predY],s=40)
plt.title('KMeans')
# GMM PLOT
scaler=preprocessing.StandardScaler()
scaler.fit(X)
xsa=scaler.transform(X)
xs=pd.DataFrame(xsa,columns=X.columns)
gmm=GaussianMixture(n_components=3)
gmm.fit(xs)
y_cluster_gmm=gmm.predict(xs)
plt.subplot(1,3,3)
plt.scatter(X.Petal_Length,X.Petal_Width,c=colormap[y_cluster_gmm],s=40)
plt.title('GMM Classification')
Output:
Text(0.5, 1.0, 'GMM Classification')
Experiment-12:
Exploratory data analysis for classification using pandas or Matplotlib.
Description:
EDA is applied to investigate the data and summarize the key insights.
It will give you the basic understanding of your data, it’s distribution, null values and much
more.
You can either explore data using graphs or through some python functions.
In the non-graphical approach, you will be using functions such as shape, summary, describe,
isnull, info, datatypes and more.
In the graphical approach, you will be using plots such as scatter, box, bar, density and
correlation plots.
Program:
import pandas as pd
import numpy as np
df = pd.read_csv('titanic.csv')
#Basic information
df.info()
df.describe()
#Find the duplicates
df.duplicated().sum()
#unique values
df['Pclass'].unique()
df['Survived'].unique()
df['Sex'].unique()
sns.countplot(df['Pclass']).unique()
#Datatypes
df.dtypes
#Filter data
df[df['Pclass']==1].head()
#Boxplot
df[['Fare']].boxplot()
#Correlation
df.corr()
#Correlation plot
sns.heatmap(df.corr())
Output: