Data Assigment 1
Data Assigment 1
Instructions:
Please share your answers filled inline in the word document. Submit code files wherever
applicable.
Batch Id:DSWDMCOD28/10/22B
Topic: Data Pre-Processing
Problem Statement:
Most of the datasets have extreme values or exceptions in their observations. These
values affect the predictions (Accuracy) of the model in one way or the other, removing
these values is not a very good option. For these types of scenarios, we have various
techniques to treat such values.
Refer: https://360digitmg.com/mindmap-data-science
df =
pd.read_csv(C:/Users/HP1/Documents/boston_da
ta.csv")
df.dtypes
df_t = winsor.fit_transform(df[['crim']])
tails
fold=1.5,
variables=['zn'])
df_t = winsor.fit_transform(df[['zn']])
DISCRETIZATION
Instructions:
Please share your answers filled inline in the word document. Submit Python code and R code
files wherever applicable.
Name: _________________________
import pandas as pd
data = pd.read_csv(C:\Users\HP1\Documents\iris.csv")
data.head()
data['Sepal_Length_new'] = pd.cut(data['Sepal_Length'],
bins=[min(data.Sepal_Length) - 1,
data.Sepal_Length.mean(), max(data.Sepal_Length)],
labels=["Low","High"])
data.head()
data.Sepal_Length_new.value_counts()
data['Sepal_Width_new'] = pd.cut(data['Sepal_Width'],
bins=[min(data.Sepal_Width) - 1,
data.Sepal_Width.mean(), max(data.Sepal_Width)],
labels=["Low","High"])
data.head()
data.Sepal_Width_new.value_counts()
data['Petal_Width_new'] = pd.cut(data['Petal_Width'],
bins=[min(data.Petal_Width) - 1,
data.Petal_Width.mean(), max(data.Petal_Width)],
labels=["Low","High"])
data.head()
data.Petal_Width_new.value_counts()
import pandas as pd
data = pd.read_csv(r"C:\Users\HP1\Documents\iris.csv")
data.head()
data.describe()
data['Sepal_Length_new'] = pd.cut(data['Sepal_Length'],
bins=[min(data.Sepal_Length) - 1,
data.Sepal_Length.mean(), max(data.Sepal_Length)],
labels=["Low","High"])
data.head()
data.Sepal_Length_new.value_counts()
data['Sepal_Width_new'] = pd.cut(data['Sepal_Width'],
bins=[min(data.Sepal_Width) - 1,
data.Sepal_Width.mean(), max(data.Sepal_Width)],
labels=["Low","High"])
data.head()
data.Sepal_Width_new.value_counts()
data['Petal_Width_new'] = pd.cut(data['Petal_Width'],
bins=[min(data.Petal_Width) - 1,
data.Petal_Width.mean(), max(data.Petal_Width)],
labels=["Low","High"])
data.head()
data.Petal_Width_new.value_counts()
Dummy Variables
Problem Statement:
Data is one of the most important assets. It is often common that data is stored in
distinct systems with different formats and forms. Non-numeric form of data makes it
2) Prepare the dataset by performing the preprocessing techniques, to have the all
the features in numeric format.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
enc_df =
pd.DataFrame(enc.fit_transform(df.iloc[:,0:]).toarra
y())
Duplication Typecasting
Problem statement:
Data collected may have duplicate entries, that might be because the data collected
were not at regular intervals or any other reason. To build a proper solution on such
data will be a tough ask. The common techniques are either removing duplicates
completely or substitute those values with a logical data. There are various techniques
to treat these types of problems.
Q1. For the given dataset perform the type casting (convert the datatypes, ex. float to int)
Q2. Check for the duplicate values, and handle the duplicate values (ex. drop)
Q3. Do the data analysis (EDA)?
Such as histogram, boxplot, scatterplot etc
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country
import pandas as pd
df =
pd.read_csv(C:/Users/HP1/Documents/onlineretail.
csv")
#type casting
# Now we will convert 'float64' into 'int64' type.
df.UnitPrice = df.UnitPrice.astype('int64')
#Removing Duplicates
data = df.drop_duplicates()
#Graphical Representation
import matplotlib.pyplot as plt # mostly used for
visualization purposes
import numpy as np
plt.hist(data.UnitPrice) #histogram
sns.scatterplot(data=df, x="UnitPrice",
y="Description")
Inferential Statistics
Problem Statements:
Q1) Three Coins are tossed, find the probability that two heads and one tail are obtained?
ANSWER :3/8
Q2) Two Dice are rolled, find the probability that sum is
a) Equal to 1
b) Less than or equal to 4
c) Sum is divisible by 2 and 3
Answer:
a) 0
b) 1/12
c) 1/6
Q3) A bag contains 2 red, 3 green and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?
Q4) Calculate the Expected number of candies for a randomly selected child:
Below are the probabilities of count of candies for children (ignoring the nature of the child-
Generalized view)
i. Child A – probability of having 1 candy is 0.015
ii. Child B – probability of having 4 candies is 0.2
Q5) Calculate Mean, Median, Mode, Variance, Standard Deviation, Range & comment about
the values / draw inferences, for the given dataset
- For Points, Score, Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and comment about the
values/ Draw some inferences.
ANSWER
import pandas as pd
df.info()
# ( for points )
# measures of central tendency
df.Points.mean()
df.Points.median()
df.Points.mode()
# Measures of Dispersion / Second moment
business decision
# for score
# measures of central tendency
df.Score.mean()
df.Score.median()
df.Score.mode()
# Measures of Dispersion / Second moment
business decision
df.Score.var() # variance
df.Score.std() # standard deviation
range = max(df.Score) - min(df.Score) # range
range
# for weigh
ANSWER:
Probability of selecting each patient = 1/9
Ex 108, 110, 123, 134, 135, 145, 167, 187, 199
P(x) 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9
Expected Value = (1/9)(108) + (1/9)110 + (1/9)123 + (1/9)134 + (1/9)135 +
(1/9)145 + (1/9(167) + (1/9)187 + (1/9)199
= (1/9) ( 108 + 110 + 123 + 134 + 135 + 145 + 167 + 187 + 199)
= (1/9) ( 1308)
Q7) Look at the data given below. Plot the data, find the outliers and find out μ , σ , σ 2
Hint: [Use a plot which shows the data distribution, skewness along with the outliers; also use
R/Python code to evaluate measures of centrality and spread]
Q8) AT&T was running commercials in 1990 aimed at luring back customers who had switched
to one of the other long-distance phone service providers. One such commercial shows a
businessman trying to reach Phoenix and mistakenly getting Fiji, where a half-naked native on a
beach responds incomprehensibly in Polynesian. When asked about this advertisement, AT&T
admitted that the portrayed incident did not actually take place but added that this was an
enactment of something that “could happen.” Suppose that one in 200 long-distance telephone
calls is misdirected.
What is the probability that at least one in five attempted telephone calls reaches the wrong
number? (Assume independence of attempts.)
Q9) Returns on a certain business venture, to the nearest $1,000, are known to follow the
following probability distribution
X P(x)
-2,000 0.1
-1,000 0.1
0 0.2
1000 0.2
2000 0.3
3000 0.1
(i) What is the most likely monetary outcome of the business venture?
Hint: [The outcome is most likely the expected returns of the venture]
(iii) What is the long-term average earning of business ventures of this kind? Explain.
Hint: [Here, the expected returns to the venture is considered as the
the required average]
(iv) What is the good measure of the risk involved in a venture of this kind? Compute
this measure.
Hint: [Risk here stems from the possible variability in the expected returns,
therefore, name the risk measure for this venture]
Imputation
Problem Statement:
Majority of the datasets have missing values, that might be because the data collected
were not at regular intervals or the breakdown of instruments and so on. It is nearly
impossible to build the proper model or in other words, get accurate results. The
common techniques are either removing those records completely or substitute those
missing values with the logical ones, there are various techniques to treat these types of
problems.
import pandas as pd
import numpy as np
from sklearn.preprocessing import
StandardScaler
d =
pd.read_csv(C:/Users/HP1/Documents\
Seeds_data.csv")
STRING MANIPULATIONS
Problem Statement:
It is obvious that as part of data analysis we encounter a lot of text data which is a
collection of strings which in turn is a sequence of characters. Access the text data and
manipulate as per our requirem
1. Create a string “Grow Gratitude”.
Code for the following tasks:
a) How do you access the letter “G” of “Growth”?
b) How do you find the length of the string?
c) Count how many times “G” is in the string.
2. Create a string “Being aware of a single shortcoming within yourself is far more
useful than being aware of a thousand in someone else.”
Code for the following:
3. Create a string "Idealistic as it may sound, altruism should be the driving force in
business, not just competition and a desire for wealth"
Code for the following tasks:
a) get one char of the word
b) get the first three char
c) get the last three char
TRANSFORMATIONS
Problem Statement:
Everything will revolve around the data in Analytics world. Proper data will help you to
make useful predictions which improve your business. Sometimes the usage of original
data as it is does not help to have accurate solutions. It is needed to convert the data
from one form to another form to have better predictions. Explore on various
techniques to transform the data for better model performance. you can go through this
link:
https://360digitmg.com/mindmap-data-science
4) Prepare the dataset by performing the preprocessing techniques, to have the
data which improve model performance.
stats.probplot(df["Calories Consumed"],
dist="norm", plot=pylab)
#transformation to make Weight gained
(grams), Calories Consumed variables normal
import numpy as np
stats.probplot(np.log(df["Weight gained
(grams)"]),dist="norm",plot=pylab)
stats.probplot(np.log(df["Calories
Consumed"]),dist="norm",plot=pylab)
import pandas as pd
from sklearn.feature_selection import
VarianceThreshold
data =
pd.read_csv(C:/Users/HP1/Documents/Z_dataset.
csv")
#drop id and colour
var_thres=VarianceThreshold(threshold=0)
var_thres.fit(data)
var_thres.get_support()
data["square.length"].var()
data["square.breadth"].var()
data["rec.Length"].var()
data["rec.breadth"].var()