100% found this document useful (2 votes)
761 views

Data Assigment 1

Uploaded by

Sukhwinder Kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
761 views

Data Assigment 1

Uploaded by

Sukhwinder Kaur
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 32

Outlier Treatments

Instructions:

Please share your answers filled inline in the word document. Submit code files wherever
applicable.

Please ensure you update all the details:

Name: _HARVIR SINGH

Batch Id:DSWDMCOD28/10/22B
Topic: Data Pre-Processing

Problem Statement:
Most of the datasets have extreme values or exceptions in their observations. These
values affect the predictions (Accuracy) of the model in one way or the other, removing
these values is not a very good option. For these types of scenarios, we have various
techniques to treat such values.
Refer: https://360digitmg.com/mindmap-data-science

1. Prepare the dataset by performing the preprocessing techniques, to treat the


outliers.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


import pandas as pd
import numpy as np
import seaborn as sns

df =
pd.read_csv(C:/Users/HP1/Documents/boston_da
ta.csv")
df.dtypes

© 2013 - 2021 360DigiTMG. All Rights Reserved.


# finding outliers in crim
sns.boxplot(df.crim)

# Detecting of outliers (find limits for crim based


on IQR)
IQR = df['crim'].quantile(0.75) -
df['crim'].quantile(0.25)
lower_limit = df['crim'].quantile(0.25) - (IQR * 1.5)
upper_limit = df['crim'].quantile(0.75) + (IQR * 1.5)

############### Winsorization for crim


###############
# pip install feature_engine # install the package
from feature_engine.outliers import Winsorizer

© 2013 - 2021 360DigiTMG. All Rights Reserved.


winsor = Winsorizer(capping_method='iqr', #
choose IQR rule boundaries or gaussian for mean
and std
tail='both', # cap left, right or both
tails
fold=1.5,
variables=['crim'])

df_t = winsor.fit_transform(df[['crim']])

# we can inspect the minimum caps and maximum


caps
# winsor.left_tail_caps_, winsor.right_tail_caps_

# lets see boxplot


sns.boxplot(df_t.crim)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


##########################################
#########
#finding outliers in zn
sns.boxplot(df.zn)

# Detection of outliers (find limits for zn based on


IQR)
IQR = df['zn'].quantile(0.75) - df['zn'].quantile(0.25)
lower_limit = df['zn'].quantile(0.25) - (IQR * 1.5)
upper_limit = df['zn'].quantile(0.75) + (IQR * 1.5)

### winsorization for zn ####


from feature_engine.outliers import Winsorizer
winsor = Winsorizer(capping_method='iqr', #
choose IQR rule boundaries or gaussian for mean
and std

© 2013 - 2021 360DigiTMG. All Rights Reserved.


tail='both', # cap left, right or both

tails
fold=1.5,
variables=['zn'])

df_t = winsor.fit_transform(df[['zn']])

DISCRETIZATION
Instructions:

Please share your answers filled inline in the word document. Submit Python code and R code
files wherever applicable.

Please ensure you update all the details:

Name: _________________________

Batch Id: _______________________


Topic: Data Pre-Processing

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Problem Statement:
Everything will revolve around the data in Analytics world. Proper data will help you to
make useful predictions which improve your business. Sometimes the usage of original
data as it is does not help to have accurate solutions. It is needed to convert the data
from one form to another form to have better predictions. Explore on various
techniques to transform the data for better model performance. you can go through this
link:
https://360digitmg.com/mindmap-data-science
1) Convert the continuous data into discrete classes on iris dataset.
Prepare the dataset by performing the preprocessing techniques, to have the
data which improve model performance.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species


5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

import pandas as pd
data = pd.read_csv(C:\Users\HP1\Documents\iris.csv")
data.head()

© 2013 - 2021 360DigiTMG. All Rights Reserved.


data.describe()

# Discritization for sepal_length

data['Sepal_Length_new'] = pd.cut(data['Sepal_Length'],
bins=[min(data.Sepal_Length) - 1,
data.Sepal_Length.mean(), max(data.Sepal_Length)],
labels=["Low","High"])
data.head()
data.Sepal_Length_new.value_counts()

## Discritization for Sepal_width

data['Sepal_Width_new'] = pd.cut(data['Sepal_Width'],
bins=[min(data.Sepal_Width) - 1,
data.Sepal_Width.mean(), max(data.Sepal_Width)],
labels=["Low","High"])
data.head()
data.Sepal_Width_new.value_counts()

# Discritization for Petal_Length

© 2013 - 2021 360DigiTMG. All Rights Reserved.


data['Petal_Length_new'] = pd.cut(data['Petal_Length'],
bins=[min(data.Petal_Length) - 1,
data.Petal_Length.mean(), max(data.Petal_Length)],
labels=["Low","High"])
data.head()
data.Petal_Length_new.value_counts()

## Discritization for Petal_Width

data['Petal_Width_new'] = pd.cut(data['Petal_Width'],
bins=[min(data.Petal_Width) - 1,
data.Petal_Width.mean(), max(data.Petal_Width)],
labels=["Low","High"])
data.head()
data.Petal_Width_new.value_counts()
import pandas as pd
data = pd.read_csv(r"C:\Users\HP1\Documents\iris.csv")
data.head()
data.describe()

© 2013 - 2021 360DigiTMG. All Rights Reserved.


# Discritization for sepal_length

data['Sepal_Length_new'] = pd.cut(data['Sepal_Length'],
bins=[min(data.Sepal_Length) - 1,
data.Sepal_Length.mean(), max(data.Sepal_Length)],
labels=["Low","High"])
data.head()
data.Sepal_Length_new.value_counts()

## Discritization for Sepal_width

data['Sepal_Width_new'] = pd.cut(data['Sepal_Width'],
bins=[min(data.Sepal_Width) - 1,
data.Sepal_Width.mean(), max(data.Sepal_Width)],
labels=["Low","High"])
data.head()
data.Sepal_Width_new.value_counts()

# Discritization for Petal_Length

© 2013 - 2021 360DigiTMG. All Rights Reserved.


data['Petal_Length_new'] = pd.cut(data['Petal_Length'],
bins=[min(data.Petal_Length) - 1,
data.Petal_Length.mean(), max(data.Petal_Length)],
labels=["Low","High"])
data.head()
data.Petal_Length_new.value_counts()

## Discritization for Petal_Width

data['Petal_Width_new'] = pd.cut(data['Petal_Width'],
bins=[min(data.Petal_Width) - 1,
data.Petal_Width.mean(), max(data.Petal_Width)],
labels=["Low","High"])
data.head()
data.Petal_Width_new.value_counts()

Dummy Variables

Problem Statement:
Data is one of the most important assets. It is often common that data is stored in
distinct systems with different formats and forms. Non-numeric form of data makes it

© 2013 - 2021 360DigiTMG. All Rights Reserved.


tricky while developing mathematical equations for prediction models. We have the
preprocessing techniques to make the data convert to numeric form. Explore the
various techniques to have reliable uniform standard data, you can go through this link:
https://360digitmg.com/mindmap-data-science

2) Prepare the dataset by performing the preprocessing techniques, to have the all
the features in numeric format.

Index Animals Gende Homly Types


r
1 Cat Male Yes A
2 Dog Male Yes B
3 Mouse Male Yes C
4 Mouse Male Yes C
5 Dog Female Yes A
6 Cat Female Yes B
7 Lion Female Yes D
8 Goat Female Yes E
9 Cat Female Yes A
10 Dog Male Yes B

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# we use Animal_category dataset

© 2013 - 2021 360DigiTMG. All Rights Reserved.


df = pd.read_csv(C:\Users\HP1\Documents\
animal_category.csv")

df.columns # column names


df.shape # will give u shape of the dataframe

# drop emp_name column


df.drop(['Index'], axis=1, inplace=True)
df.dtypes

# Create dummy variables


df_new = pd.get_dummies(df)
df_new_1 = pd.get_dummies(df, drop_first = True)
# we have created dummies for all categorical
columns

##### One Hot Encoding works


df.columns

© 2013 - 2021 360DigiTMG. All Rights Reserved.


df = df[['Animals','Gender','Homly','Types']]

from sklearn.preprocessing import OneHotEncoder


# Creating instance of One Hot Encoder
enc = OneHotEncoder() # initializing method

enc_df =
pd.DataFrame(enc.fit_transform(df.iloc[:,0:]).toarra
y())
Duplication Typecasting

Problem statement:
Data collected may have duplicate entries, that might be because the data collected
were not at regular intervals or any other reason. To build a proper solution on such
data will be a tough ask. The common techniques are either removing duplicates
completely or substitute those values with a logical data. There are various techniques
to treat these types of problems.

Q1. For the given dataset perform the type casting (convert the datatypes, ex. float to int)
Q2. Check for the duplicate values, and handle the duplicate values (ex. drop)
Q3. Do the data analysis (EDA)?
Such as histogram, boxplot, scatterplot etc
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country

© 2013 - 2021 360DigiTMG. All Rights Reserved.


536365 85123A WHITE HANGING 6 12/1/2010 2.55 17850 United
HEART T-LIGHT 8:26 Kingdom
HOLDER
536365 71053 WHITE METAL 6 12/1/2010 3.39 17850 United
LANTERN 8:26 Kingdom
536365 84406B CREAM CUPID 8 12/1/2010 2.75 17850 United
HEARTS COAT 8:26 Kingdom
HANGER
536365 84029G KNITTED UNION 6 12/1/2010 3.39 17850 United
FLAG HOT 8:26 Kingdom
WATER BOTTLE
536365 84029E RED WOOLLY 6 12/1/2010 3.39 17850 United
HOTTIE WHITE 8:26 Kingdom
HEART.
536365 22752 SET 7 BABUSHKA 2 12/1/2010 7.65 17850 United
NESTING BOXES 8:26 Kingdom
536365 21730 GLASS STAR 6 12/1/2010 4.25 17850 United
FROSTED T- 8:26 Kingdom
LIGHT HOLDER
536366 22633 HAND WARMER 6 12/1/2010 1.85 17850 United
UNION JACK 8:28 Kingdom
536366 22632 HAND WARMER 6 12/1/2010 1.85 17850 United
RED POLKA DOT 8:28 Kingdom

import pandas as pd

df =
pd.read_csv(C:/Users/HP1/Documents/onlineretail.
csv")

#type casting
# Now we will convert 'float64' into 'int64' type.
df.UnitPrice = df.UnitPrice.astype('int64')

© 2013 - 2021 360DigiTMG. All Rights Reserved.


df.dtypes

#Identify duplicates records in the data


duplicate = df.duplicated()
sum(duplicate)

#Removing Duplicates
data = df.drop_duplicates()

#Exploratory Data Analysis


#Measures of Central Tendency / First moment
business decision
data.Description.mode()

from sklearn.impute import SimpleImputer

© 2013 - 2021 360DigiTMG. All Rights Reserved.


import numpy as np
# Mode Imputer
mode_imputer =
SimpleImputer(missing_values=np.nan,
strategy='most_frequent')
data["Description"] =
pd.DataFrame(mode_imputer.fit_transform(data[["
Description"]]))
data.Description.isnull().sum() # all descriptive
records replaced by mode
data.isnull().sum()

#Graphical Representation
import matplotlib.pyplot as plt # mostly used for
visualization purposes
import numpy as np

plt.hist(data.UnitPrice) #histogram

© 2013 - 2021 360DigiTMG. All Rights Reserved.


plt.boxplot(data.UnitPrice) #boxplot

import seaborn as sns

sns.scatterplot(data=df, x="UnitPrice",
y="Description")

Inferential Statistics
Problem Statements:
Q1) Three Coins are tossed, find the probability that two heads and one tail are obtained?

ANSWER :3/8
Q2) Two Dice are rolled, find the probability that sum is
a) Equal to 1
b) Less than or equal to 4
c) Sum is divisible by 2 and 3

Answer:
a) 0
b) 1/12
c) 1/6

Q3) A bag contains 2 red, 3 green and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?

© 2013 - 2021 360DigiTMG. All Rights Reserved.


ANSWER:10/21

Q4) Calculate the Expected number of candies for a randomly selected child:
Below are the probabilities of count of candies for children (ignoring the nature of the child-
Generalized view)
i. Child A – probability of having 1 candy is 0.015
ii. Child B – probability of having 4 candies is 0.2

CHILD Candies count Probability


A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.12

Q5) Calculate Mean, Median, Mode, Variance, Standard Deviation, Range & comment about
the values / draw inferences, for the given dataset
- For Points, Score, Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and comment about the
values/ Draw some inferences.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Dataset: Refer to Hands-on Material in LMS - Data Types EDA assignment snap shot of dataset
is given above.

ANSWER
import pandas as pd

# to read the file


df=pd.read_excel(C:/Users/HP1/Documents/
Assignment_module02 (1).xlsx")

df.info()

# ( for points )
# measures of central tendency

df.Points.mean()
df.Points.median()
df.Points.mode()
# Measures of Dispersion / Second moment
business decision

© 2013 - 2021 360DigiTMG. All Rights Reserved.


df.Points.var() # variance
df.Points.std() # standard deviation
range = max(df.Points) - min(df.Points) # range
range

# for score
# measures of central tendency
df.Score.mean()
df.Score.median()
df.Score.mode()
# Measures of Dispersion / Second moment
business decision
df.Score.var() # variance
df.Score.std() # standard deviation
range = max(df.Score) - min(df.Score) # range
range

# for weigh

© 2013 - 2021 360DigiTMG. All Rights Reserved.


# measures of central tendency
df.Weigh.mean()
df.Weigh.median()
df.Weigh.mode()
# Measures of Dispersion / Second moment
business decision
df.Weigh.var() # variance
df.Weigh.std() # standard deviation
range = max(df.Weigh) - min(df.Weigh) # range
range
Q6) Calculate Expected Value for the problem below
a) The weights (X) of patients at a clinic (in pounds), are
108, 110, 123, 134, 135, 145 Assume one of the patients is chosen at random. What is
the Expected Value of the Weight of that patient?
, 167, 187, 199

ANSWER:
Probability of selecting each patient = 1/9
Ex  108, 110, 123, 134, 135, 145, 167, 187, 199
P(x)  1/9  1/9   1/9  1/9   1/9   1/9   1/9   1/9  1/9
Expected Value  =  (1/9)(108) + (1/9)110  + (1/9)123 + (1/9)134 + (1/9)135 +
(1/9)145 + (1/9(167) + (1/9)187 + (1/9)199
= (1/9) ( 108 + 110 + 123 + 134 + 135 + 145 + 167 + 187 + 199)
= (1/9)  (  1308)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


= 145.33

Q7) Look at the data given below. Plot the data, find the outliers and find out μ , σ , σ 2
Hint: [Use a plot which shows the data distribution, skewness along with the outliers; also use
R/Python code to evaluate measures of centrality and spread]

Name of company Measure X


Allied Signal 24.23%
Bankers Trust 25.53%
General Mills 25.41%
ITT Industries 24.14%
J.P.Morgan & Co. 29.62%
Lehman Brothers 28.25%
Marriott 25.81%
MCI 24.39%
Merrill Lynch 40.26%
Microsoft 32.95%
Morgan Stanley 91.36%
Sun Microsystems 25.99%
Travelers 39.42%
US Airways 26.71%
Warner-Lambert 35.00%

Q8) AT&T was running commercials in 1990 aimed at luring back customers who had switched
to one of the other long-distance phone service providers. One such commercial shows a
businessman trying to reach Phoenix and mistakenly getting Fiji, where a half-naked native on a
beach responds incomprehensibly in Polynesian. When asked about this advertisement, AT&T
admitted that the portrayed incident did not actually take place but added that this was an
enactment of something that “could happen.” Suppose that one in 200 long-distance telephone
calls is misdirected.

What is the probability that at least one in five attempted telephone calls reaches the wrong
number? (Assume independence of attempts.)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Hint: [Using Probability formula evaluate the probability of one call being wrong out of five
attempted calls]

Q9) Returns on a certain business venture, to the nearest $1,000, are known to follow the
following probability distribution
X P(x)
-2,000 0.1
-1,000 0.1
0 0.2
1000 0.2
2000 0.3
3000 0.1

(i) What is the most likely monetary outcome of the business venture?
Hint: [The outcome is most likely the expected returns of the venture]

(ii) Is the venture likely to be successful? Explain.


Hint: [Probability of % of venture being a successful one]

(iii) What is the long-term average earning of business ventures of this kind? Explain.
Hint: [Here, the expected returns to the venture is considered as the
the required average]

(iv) What is the good measure of the risk involved in a venture of this kind? Compute
this measure.
Hint: [Risk here stems from the possible variability in the expected returns,
therefore, name the risk measure for this venture]

Imputation

Problem Statement:
Majority of the datasets have missing values, that might be because the data collected
were not at regular intervals or the breakdown of instruments and so on. It is nearly
impossible to build the proper model or in other words, get accurate results. The
common techniques are either removing those records completely or substitute those
missing values with the logical ones, there are various techniques to treat these types of
problems.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


1) Prepare the dataset using various techniques to solve the problem, explore all
the techniques available and use them to see which gives the best result.
Hint: Go through this link: https://360digitmg.com/mindmap-data-science

CASENUM ATTORNEY CLMSEX CLMINSURSEATBELT CLMAGE LOSS


5 0 0 1 0 50 34.94
3 1 1 0 0 18 0.891
66 1 0 1 0 5 0.33
70 0 0 1 1 31 0.037
96 1 0 1 0 30 0.038
97 0 1 1 0 35 0.309
10 0 0 1 0 9 3.538
36 0 1 1 0 34 4.881
51 1 1 1 0 60 0.874
55 1 0 1 0 0.35
61 0 1 1 0 37 6.19
148 0 0 1 0 41 19.61
150 1 0 1 0 7 1.678
150 0 1 1 0 40 0.673
169 1 1 1 0 37 0.143
171 1 1 0 0 9 0.053
334 1 1 1 0 58 0.05
360 0 0 1 0 58 0.758
376 1 0 1 0 3 0

STANDARDIZATION & NORMALIZATION


Problem Statement:
Data is one of the most important assets. It is often common that data is stored in
distinct systems with different formats and scales. These seemingly small differences in
how the data is stored can result in misinterpretations and inconsistencies in your
analytics. Inconsistency can make it impossible to deliver reliable information to
management for good decision making. We have the preprocessing techniques to make
the data uniform. Explore the various techniques to have reliable uniform standard
data, you can go through this link:
https://360digitmg.com/mindmap-data-science

© 2013 - 2021 360DigiTMG. All Rights Reserved.


3) Prepare the dataset by performing the preprocessing techniques, to have the
standard scale to data

import pandas as pd
import numpy as np
from sklearn.preprocessing import
StandardScaler
d =
pd.read_csv(C:/Users/HP1/Documents\
Seeds_data.csv")

© 2013 - 2021 360DigiTMG. All Rights Reserved.


a = d.describe()
# Initialise the Scaler
scaler = StandardScaler()
# To scale data
df = scaler.fit_transform(d)
# Convert the array back to a
dataframe
dataset = pd.DataFrame(df)
res = dataset.describe()

STRING MANIPULATIONS

Problem Statement:
It is obvious that as part of data analysis we encounter a lot of text data which is a
collection of strings which in turn is a sequence of characters. Access the text data and
manipulate as per our requirem
1. Create a string “Grow Gratitude”.
Code for the following tasks:
a) How do you access the letter “G” of “Growth”?
b) How do you find the length of the string?
c) Count how many times “G” is in the string.

2. Create a string “Being aware of a single shortcoming within yourself is far more
useful than being aware of a thousand in someone else.”
Code for the following:

© 2013 - 2021 360DigiTMG. All Rights Reserved.


a) Count the number of characters in the string.

3. Create a string "Idealistic as it may sound, altruism should be the driving force in
business, not just competition and a desire for wealth"
Code for the following tasks:
a) get one char of the word
b) get the first three char
c) get the last three char

TRANSFORMATIONS
Problem Statement:
Everything will revolve around the data in Analytics world. Proper data will help you to
make useful predictions which improve your business. Sometimes the usage of original
data as it is does not help to have accurate solutions. It is needed to convert the data
from one form to another form to have better predictions. Explore on various
techniques to transform the data for better model performance. you can go through this
link:
https://360digitmg.com/mindmap-data-science
4) Prepare the dataset by performing the preprocessing techniques, to have the
data which improve model performance.

© 2013 - 2021 360DigiTMG. All Rights Reserved.


import pandas as pd
import scipy.stats as stats
import pylab
import matplotlib.pyplot as plt
df=pd.read_csv
(C:/Users/HP1/Documents/calories_consumed.
csv")
# Checking Whether data is normally distributed
df.columns
stats.probplot(df["Weight gained (grams)"],
dist="norm", plot=pylab)

stats.probplot(df["Calories Consumed"],
dist="norm", plot=pylab)
#transformation to make Weight gained
(grams), Calories Consumed variables normal
import numpy as np
stats.probplot(np.log(df["Weight gained
(grams)"]),dist="norm",plot=pylab)
stats.probplot(np.log(df["Calories
Consumed"]),dist="norm",plot=pylab)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


Zero - Variance Features
Problem statement:
Find which columns of the given dataset with zero variance, explore various techniques used to
remove the zero variance from the dataset to perform certain analysis.

import pandas as pd
from sklearn.feature_selection import
VarianceThreshold

data =
pd.read_csv(C:/Users/HP1/Documents/Z_dataset.
csv")
#drop id and colour

© 2013 - 2021 360DigiTMG. All Rights Reserved.


data.info()
data.drop(['Id','colour'], axis=1, inplace=True)

var_thres=VarianceThreshold(threshold=0)
var_thres.fit(data)
var_thres.get_support()

data["square.length"].var()

data["square.breadth"].var()
data["rec.Length"].var()
data["rec.breadth"].var()

#variance value of squarebreadth is nearly equal


to zero(0.189) so we have to drop the column
data.drop(['square.breadth'], axis=1,inplace=True)

© 2013 - 2021 360DigiTMG. All Rights Reserved.


© 2013 - 2021 360DigiTMG. All Rights Reserved.

You might also like