100% found this document useful (2 votes)

761 views

Data Assigment 1

Uploaded by

Sukhwinder Kaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (2 votes)

761 views

Data Assigment 1

Uploaded by

Sukhwinder Kaur

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 32

Outlier Treatments

Instructions:

Please share your answers filled inline in the word document. Submit code files wherever
applicable.

Please ensure you update all the details:

Name: _HARVIR SINGH

Batch Id:DSWDMCOD28/10/22B
Topic: Data Pre-Processing

Problem Statement:
Most of the datasets have extreme values or exceptions in their observations. These
values affect the predictions (Accuracy) of the model in one way or the other, removing
these values is not a very good option. For these types of scenarios, we have various
techniques to treat such values.
Refer: https://360digitmg.com/mindmap-data-science

1. Prepare the dataset by performing the preprocessing techniques, to treat the

outliers.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

import pandas as pd
import numpy as np
import seaborn as sns

df =
pd.read_csv(C:/Users/HP1/Documents/boston_da
ta.csv")
df.dtypes

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# finding outliers in crim
sns.boxplot(df.crim)

# Detecting of outliers (find limits for crim based

on IQR)
IQR = df['crim'].quantile(0.75) -
df['crim'].quantile(0.25)
lower_limit = df['crim'].quantile(0.25) - (IQR * 1.5)
upper_limit = df['crim'].quantile(0.75) + (IQR * 1.5)

############### Winsorization for crim

###############
# pip install feature_engine # install the package
from feature_engine.outliers import Winsorizer

© 2013 - 2021 360DigiTMG. All Rights Reserved.

winsor = Winsorizer(capping_method='iqr', #
choose IQR rule boundaries or gaussian for mean
and std
tail='both', # cap left, right or both
tails
fold=1.5,
variables=['crim'])

df_t = winsor.fit_transform(df[['crim']])

# we can inspect the minimum caps and maximum

caps
# winsor.left_tail_caps_, winsor.right_tail_caps_

# lets see boxplot

sns.boxplot(df_t.crim)

© 2013 - 2021 360DigiTMG. All Rights Reserved.

##########################################
#########
#finding outliers in zn
sns.boxplot(df.zn)

# Detection of outliers (find limits for zn based on

IQR)
IQR = df['zn'].quantile(0.75) - df['zn'].quantile(0.25)
lower_limit = df['zn'].quantile(0.25) - (IQR * 1.5)
upper_limit = df['zn'].quantile(0.75) + (IQR * 1.5)

### winsorization for zn ####

from feature_engine.outliers import Winsorizer
winsor = Winsorizer(capping_method='iqr', #
choose IQR rule boundaries or gaussian for mean
and std

© 2013 - 2021 360DigiTMG. All Rights Reserved.

tail='both', # cap left, right or both

tails
fold=1.5,
variables=['zn'])

df_t = winsor.fit_transform(df[['zn']])

DISCRETIZATION
Instructions:

Please share your answers filled inline in the word document. Submit Python code and R code
files wherever applicable.

Please ensure you update all the details:

Name: _________________________

Batch Id: _______________________

Topic: Data Pre-Processing

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Problem Statement:
Everything will revolve around the data in Analytics world. Proper data will help you to
make useful predictions which improve your business. Sometimes the usage of original
data as it is does not help to have accurate solutions. It is needed to convert the data
from one form to another form to have better predictions. Explore on various
techniques to transform the data for better model performance. you can go through this
link:
https://360digitmg.com/mindmap-data-science
1) Convert the continuous data into discrete classes on iris dataset.
Prepare the dataset by performing the preprocessing techniques, to have the
data which improve model performance.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

5.1 3.5 1.4 0.2 setosa
4.9 3 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa

import pandas as pd
data = pd.read_csv(C:\Users\HP1\Documents\iris.csv")
data.head()

© 2013 - 2021 360DigiTMG. All Rights Reserved.

data.describe()

# Discritization for sepal_length

data['Sepal_Length_new'] = pd.cut(data['Sepal_Length'],
bins=[min(data.Sepal_Length) - 1,
data.Sepal_Length.mean(), max(data.Sepal_Length)],
labels=["Low","High"])
data.head()
data.Sepal_Length_new.value_counts()

## Discritization for Sepal_width

data['Sepal_Width_new'] = pd.cut(data['Sepal_Width'],
bins=[min(data.Sepal_Width) - 1,
data.Sepal_Width.mean(), max(data.Sepal_Width)],
labels=["Low","High"])
data.head()
data.Sepal_Width_new.value_counts()

# Discritization for Petal_Length

© 2013 - 2021 360DigiTMG. All Rights Reserved.

data['Petal_Length_new'] = pd.cut(data['Petal_Length'],
bins=[min(data.Petal_Length) - 1,
data.Petal_Length.mean(), max(data.Petal_Length)],
labels=["Low","High"])
data.head()
data.Petal_Length_new.value_counts()

## Discritization for Petal_Width

data['Petal_Width_new'] = pd.cut(data['Petal_Width'],
bins=[min(data.Petal_Width) - 1,
data.Petal_Width.mean(), max(data.Petal_Width)],
labels=["Low","High"])
data.head()
data.Petal_Width_new.value_counts()
import pandas as pd
data = pd.read_csv(r"C:\Users\HP1\Documents\iris.csv")
data.head()
data.describe()

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# Discritization for sepal_length

## Discritization for Sepal_width

# Discritization for Petal_Length

© 2013 - 2021 360DigiTMG. All Rights Reserved.

## Discritization for Petal_Width

Dummy Variables

Problem Statement:
Data is one of the most important assets. It is often common that data is stored in
distinct systems with different formats and forms. Non-numeric form of data makes it

© 2013 - 2021 360DigiTMG. All Rights Reserved.

tricky while developing mathematical equations for prediction models. We have the
preprocessing techniques to make the data convert to numeric form. Explore the
various techniques to have reliable uniform standard data, you can go through this link:
https://360digitmg.com/mindmap-data-science

2) Prepare the dataset by performing the preprocessing techniques, to have the all
the features in numeric format.

Index Animals Gende Homly Types

r
1 Cat Male Yes A
2 Dog Male Yes B
3 Mouse Male Yes C
4 Mouse Male Yes C
5 Dog Female Yes A
6 Cat Female Yes B
7 Lion Female Yes D
8 Goat Female Yes E
9 Cat Female Yes A
10 Dog Male Yes B

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# we use Animal_category dataset

© 2013 - 2021 360DigiTMG. All Rights Reserved.

df = pd.read_csv(C:\Users\HP1\Documents\
animal_category.csv")

df.columns # column names

df.shape # will give u shape of the dataframe

# drop emp_name column

df.drop(['Index'], axis=1, inplace=True)
df.dtypes

# Create dummy variables

df_new = pd.get_dummies(df)
df_new_1 = pd.get_dummies(df, drop_first = True)
# we have created dummies for all categorical
columns

##### One Hot Encoding works

df.columns

© 2013 - 2021 360DigiTMG. All Rights Reserved.

df = df[['Animals','Gender','Homly','Types']]

from sklearn.preprocessing import OneHotEncoder

# Creating instance of One Hot Encoder
enc = OneHotEncoder() # initializing method

enc_df =
pd.DataFrame(enc.fit_transform(df.iloc[:,0:]).toarra
y())
Duplication Typecasting

Problem statement:
Data collected may have duplicate entries, that might be because the data collected
were not at regular intervals or any other reason. To build a proper solution on such
data will be a tough ask. The common techniques are either removing duplicates
completely or substitute those values with a logical data. There are various techniques
to treat these types of problems.

Q1. For the given dataset perform the type casting (convert the datatypes, ex. float to int)
Q2. Check for the duplicate values, and handle the duplicate values (ex. drop)
Q3. Do the data analysis (EDA)?
Such as histogram, boxplot, scatterplot etc
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country

© 2013 - 2021 360DigiTMG. All Rights Reserved.

536365 85123A WHITE HANGING 6 12/1/2010 2.55 17850 United
HEART T-LIGHT 8:26 Kingdom
HOLDER
536365 71053 WHITE METAL 6 12/1/2010 3.39 17850 United
LANTERN 8:26 Kingdom
536365 84406B CREAM CUPID 8 12/1/2010 2.75 17850 United
HEARTS COAT 8:26 Kingdom
HANGER
536365 84029G KNITTED UNION 6 12/1/2010 3.39 17850 United
FLAG HOT 8:26 Kingdom
WATER BOTTLE
536365 84029E RED WOOLLY 6 12/1/2010 3.39 17850 United
HOTTIE WHITE 8:26 Kingdom
HEART.
536365 22752 SET 7 BABUSHKA 2 12/1/2010 7.65 17850 United
NESTING BOXES 8:26 Kingdom
536365 21730 GLASS STAR 6 12/1/2010 4.25 17850 United
FROSTED T- 8:26 Kingdom
LIGHT HOLDER
536366 22633 HAND WARMER 6 12/1/2010 1.85 17850 United
UNION JACK 8:28 Kingdom
536366 22632 HAND WARMER 6 12/1/2010 1.85 17850 United
RED POLKA DOT 8:28 Kingdom

import pandas as pd

df =
pd.read_csv(C:/Users/HP1/Documents/onlineretail.
csv")

#type casting
# Now we will convert 'float64' into 'int64' type.
df.UnitPrice = df.UnitPrice.astype('int64')

© 2013 - 2021 360DigiTMG. All Rights Reserved.

df.dtypes

#Identify duplicates records in the data

duplicate = df.duplicated()
sum(duplicate)

#Removing Duplicates
data = df.drop_duplicates()

#Exploratory Data Analysis

#Measures of Central Tendency / First moment
business decision
data.Description.mode()

from sklearn.impute import SimpleImputer

© 2013 - 2021 360DigiTMG. All Rights Reserved.

import numpy as np
# Mode Imputer
mode_imputer =
SimpleImputer(missing_values=np.nan,
strategy='most_frequent')
data["Description"] =
pd.DataFrame(mode_imputer.fit_transform(data[["
Description"]]))
data.Description.isnull().sum() # all descriptive
records replaced by mode
data.isnull().sum()

#Graphical Representation
import matplotlib.pyplot as plt # mostly used for
visualization purposes
import numpy as np

plt.hist(data.UnitPrice) #histogram

© 2013 - 2021 360DigiTMG. All Rights Reserved.

plt.boxplot(data.UnitPrice) #boxplot

import seaborn as sns

sns.scatterplot(data=df, x="UnitPrice",
y="Description")

Inferential Statistics
Problem Statements:
Q1) Three Coins are tossed, find the probability that two heads and one tail are obtained?

ANSWER :3/8
Q2) Two Dice are rolled, find the probability that sum is
a) Equal to 1
b) Less than or equal to 4
c) Sum is divisible by 2 and 3

Answer:
a) 0
b) 1/12
c) 1/6

Q3) A bag contains 2 red, 3 green and 2 blue balls. Two balls are drawn at random. What is the
probability that none of the balls drawn is blue?

© 2013 - 2021 360DigiTMG. All Rights Reserved.

ANSWER:10/21

Q4) Calculate the Expected number of candies for a randomly selected child:
Below are the probabilities of count of candies for children (ignoring the nature of the child-
Generalized view)
i. Child A – probability of having 1 candy is 0.015
ii. Child B – probability of having 4 candies is 0.2

CHILD Candies count Probability

A 1 0.015
B 4 0.20
C 3 0.65
D 5 0.005
E 6 0.01
F 2 0.12

Q5) Calculate Mean, Median, Mode, Variance, Standard Deviation, Range & comment about
the values / draw inferences, for the given dataset
- For Points, Score, Weigh>
Find Mean, Median, Mode, Variance, Standard Deviation, and Range and comment about the
values/ Draw some inferences.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Dataset: Refer to Hands-on Material in LMS - Data Types EDA assignment snap shot of dataset
is given above.

ANSWER
import pandas as pd

# to read the file

df=pd.read_excel(C:/Users/HP1/Documents/
Assignment_module02 (1).xlsx")

df.info()

# ( for points )
# measures of central tendency

df.Points.mean()
df.Points.median()
df.Points.mode()
# Measures of Dispersion / Second moment
business decision

df.Points.var() # variance
df.Points.std() # standard deviation
range = max(df.Points) - min(df.Points) # range
range

# for score
# measures of central tendency
df.Score.mean()
df.Score.median()
df.Score.mode()
# Measures of Dispersion / Second moment
business decision
df.Score.var() # variance
df.Score.std() # standard deviation
range = max(df.Score) - min(df.Score) # range
range

# for weigh

# measures of central tendency
df.Weigh.mean()
df.Weigh.median()
df.Weigh.mode()
# Measures of Dispersion / Second moment
business decision
df.Weigh.var() # variance
df.Weigh.std() # standard deviation
range = max(df.Weigh) - min(df.Weigh) # range
range
Q6) Calculate Expected Value for the problem below
a) The weights (X) of patients at a clinic (in pounds), are
108, 110, 123, 134, 135, 145 Assume one of the patients is chosen at random. What is
the Expected Value of the Weight of that patient?
, 167, 187, 199

ANSWER:
Probability of selecting each patient = 1/9
Ex 108, 110, 123, 134, 135, 145, 167, 187, 199
P(x) 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9 1/9
Expected Value = (1/9)(108) + (1/9)110 + (1/9)123 + (1/9)134 + (1/9)135 +
(1/9)145 + (1/9(167) + (1/9)187 + (1/9)199
= (1/9) ( 108 + 110 + 123 + 134 + 135 + 145 + 167 + 187 + 199)
= (1/9) ( 1308)

= 145.33

Q7) Look at the data given below. Plot the data, find the outliers and find out μ , σ , σ 2
Hint: [Use a plot which shows the data distribution, skewness along with the outliers; also use
R/Python code to evaluate measures of centrality and spread]

Name of company Measure X

Allied Signal 24.23%
Bankers Trust 25.53%
General Mills 25.41%
ITT Industries 24.14%
J.P.Morgan & Co. 29.62%
Lehman Brothers 28.25%
Marriott 25.81%
MCI 24.39%
Merrill Lynch 40.26%
Microsoft 32.95%
Morgan Stanley 91.36%
Sun Microsystems 25.99%
Travelers 39.42%
US Airways 26.71%
Warner-Lambert 35.00%

Q8) AT&T was running commercials in 1990 aimed at luring back customers who had switched
to one of the other long-distance phone service providers. One such commercial shows a
businessman trying to reach Phoenix and mistakenly getting Fiji, where a half-naked native on a
beach responds incomprehensibly in Polynesian. When asked about this advertisement, AT&T
admitted that the portrayed incident did not actually take place but added that this was an
enactment of something that “could happen.” Suppose that one in 200 long-distance telephone
calls is misdirected.

What is the probability that at least one in five attempted telephone calls reaches the wrong
number? (Assume independence of attempts.)

Hint: [Using Probability formula evaluate the probability of one call being wrong out of five
attempted calls]

Q9) Returns on a certain business venture, to the nearest $1,000, are known to follow the
following probability distribution
X P(x)
-2,000 0.1
-1,000 0.1
0 0.2
1000 0.2
2000 0.3
3000 0.1

(i) What is the most likely monetary outcome of the business venture?
Hint: [The outcome is most likely the expected returns of the venture]

(ii) Is the venture likely to be successful? Explain.

Hint: [Probability of % of venture being a successful one]

(iii) What is the long-term average earning of business ventures of this kind? Explain.
Hint: [Here, the expected returns to the venture is considered as the
the required average]

(iv) What is the good measure of the risk involved in a venture of this kind? Compute
this measure.
Hint: [Risk here stems from the possible variability in the expected returns,
therefore, name the risk measure for this venture]

Imputation

Problem Statement:
Majority of the datasets have missing values, that might be because the data collected
were not at regular intervals or the breakdown of instruments and so on. It is nearly
impossible to build the proper model or in other words, get accurate results. The
common techniques are either removing those records completely or substitute those
missing values with the logical ones, there are various techniques to treat these types of
problems.

1) Prepare the dataset using various techniques to solve the problem, explore all
the techniques available and use them to see which gives the best result.
Hint: Go through this link: https://360digitmg.com/mindmap-data-science

CASENUM ATTORNEY CLMSEX CLMINSURSEATBELT CLMAGE LOSS

5 0 0 1 0 50 34.94
3 1 1 0 0 18 0.891
66 1 0 1 0 5 0.33
70 0 0 1 1 31 0.037
96 1 0 1 0 30 0.038
97 0 1 1 0 35 0.309
10 0 0 1 0 9 3.538
36 0 1 1 0 34 4.881
51 1 1 1 0 60 0.874
55 1 0 1 0 0.35
61 0 1 1 0 37 6.19
148 0 0 1 0 41 19.61
150 1 0 1 0 7 1.678
150 0 1 1 0 40 0.673
169 1 1 1 0 37 0.143
171 1 1 0 0 9 0.053
334 1 1 1 0 58 0.05
360 0 0 1 0 58 0.758
376 1 0 1 0 3 0

STANDARDIZATION & NORMALIZATION

Problem Statement:
Data is one of the most important assets. It is often common that data is stored in
distinct systems with different formats and scales. These seemingly small differences in
how the data is stored can result in misinterpretations and inconsistencies in your
analytics. Inconsistency can make it impossible to deliver reliable information to
management for good decision making. We have the preprocessing techniques to make
the data uniform. Explore the various techniques to have reliable uniform standard
data, you can go through this link:
https://360digitmg.com/mindmap-data-science

3) Prepare the dataset by performing the preprocessing techniques, to have the
standard scale to data

import pandas as pd
import numpy as np
from sklearn.preprocessing import
StandardScaler
d =
pd.read_csv(C:/Users/HP1/Documents\
Seeds_data.csv")

a = d.describe()
# Initialise the Scaler
scaler = StandardScaler()
# To scale data
df = scaler.fit_transform(d)
# Convert the array back to a
dataframe
dataset = pd.DataFrame(df)
res = dataset.describe()

STRING MANIPULATIONS

Problem Statement:
It is obvious that as part of data analysis we encounter a lot of text data which is a
collection of strings which in turn is a sequence of characters. Access the text data and
manipulate as per our requirem
1. Create a string “Grow Gratitude”.
Code for the following tasks:
a) How do you access the letter “G” of “Growth”?
b) How do you find the length of the string?
c) Count how many times “G” is in the string.

2. Create a string “Being aware of a single shortcoming within yourself is far more
useful than being aware of a thousand in someone else.”
Code for the following:

a) Count the number of characters in the string.

3. Create a string "Idealistic as it may sound, altruism should be the driving force in
business, not just competition and a desire for wealth"
Code for the following tasks:
a) get one char of the word
b) get the first three char
c) get the last three char

TRANSFORMATIONS
Problem Statement:
Everything will revolve around the data in Analytics world. Proper data will help you to
make useful predictions which improve your business. Sometimes the usage of original
data as it is does not help to have accurate solutions. It is needed to convert the data
from one form to another form to have better predictions. Explore on various
techniques to transform the data for better model performance. you can go through this
link:
https://360digitmg.com/mindmap-data-science
4) Prepare the dataset by performing the preprocessing techniques, to have the
data which improve model performance.

import pandas as pd
import scipy.stats as stats
import pylab
import matplotlib.pyplot as plt
df=pd.read_csv
(C:/Users/HP1/Documents/calories_consumed.
csv")
# Checking Whether data is normally distributed
df.columns
stats.probplot(df["Weight gained (grams)"],
dist="norm", plot=pylab)

stats.probplot(df["Calories Consumed"],
dist="norm", plot=pylab)
#transformation to make Weight gained
(grams), Calories Consumed variables normal
import numpy as np
stats.probplot(np.log(df["Weight gained
(grams)"]),dist="norm",plot=pylab)
stats.probplot(np.log(df["Calories
Consumed"]),dist="norm",plot=pylab)

Zero - Variance Features
Problem statement:
Find which columns of the given dataset with zero variance, explore various techniques used to
remove the zero variance from the dataset to perform certain analysis.

import pandas as pd
from sklearn.feature_selection import
VarianceThreshold

data =
pd.read_csv(C:/Users/HP1/Documents/Z_dataset.
csv")
#drop id and colour

data.info()
data.drop(['Id','colour'], axis=1, inplace=True)

var_thres=VarianceThreshold(threshold=0)
var_thres.fit(data)
var_thres.get_support()

data["square.length"].var()

data["square.breadth"].var()
data["rec.Length"].var()
data["rec.breadth"].var()

#variance value of squarebreadth is nearly equal

to zero(0.189) so we have to drop the column
data.drop(['square.breadth'], axis=1,inplace=True)

Predictive Modelling Project_Nandini
No ratings yet
Predictive Modelling Project_Nandini
31 pages
Free Access to Elementary Statistics 11 th Edition Robert Johnson Chapter Answers
100% (4)
Free Access to Elementary Statistics 11 th Edition Robert Johnson Chapter Answers
82 pages
13.exploratory Data Analysis
67% (3)
13.exploratory Data Analysis
10 pages
11 Network Analytics - Problem Statement
25% (4)
11 Network Analytics - Problem Statement
4 pages
Association Rules Problem Statement
50% (2)
Association Rules Problem Statement
5 pages
Dana S. Dunn, Suzanne Mannes - Statistics and Data Analysis For The Behavioral Sciences-McGraw-Hill Companies (2001)
100% (1)
Dana S. Dunn, Suzanne Mannes - Statistics and Data Analysis For The Behavioral Sciences-McGraw-Hill Companies (2001)
758 pages
2a EDA
50% (2)
2a EDA
11 pages
Power BI Basic Exercise
67% (6)
Power BI Basic Exercise
2 pages
Module-Preliminaries For Data Analysis - Data Science
100% (1)
Module-Preliminaries For Data Analysis - Data Science
5 pages
String Manipulation Problem Statement
No ratings yet
String Manipulation Problem Statement
6 pages
Clustering Documentation R Code
100% (1)
Clustering Documentation R Code
9 pages
Business Uderstanding Solved1 - Module 1
No ratings yet
Business Uderstanding Solved1 - Module 1
11 pages
Hierarchical Clustering: Instructions
67% (3)
Hierarchical Clustering: Instructions
4 pages
DataPreparation - Outlier - Treatment ASSIGNMENT 1
100% (1)
DataPreparation - Outlier - Treatment ASSIGNMENT 1
7 pages
8.dummy Variables
No ratings yet
8.dummy Variables
4 pages
06.discretization Problem Statement
50% (2)
06.discretization Problem Statement
2 pages
CRISP ML (Q) Business Understanding
No ratings yet
CRISP ML (Q) Business Understanding
17 pages
Missing Values
No ratings yet
Missing Values
6 pages
Zero Variance-Problem Statement
0% (1)
Zero Variance-Problem Statement
3 pages
Module 03 Assignment
100% (1)
Module 03 Assignment
13 pages
Discretization Problem Statement
No ratings yet
Discretization Problem Statement
4 pages
Transformations Problem Statement
0% (1)
Transformations Problem Statement
7 pages
DataPreparation Outlier Treatment
No ratings yet
DataPreparation Outlier Treatment
5 pages
CRISP DM Business Understanding - Data Science
No ratings yet
CRISP DM Business Understanding - Data Science
15 pages
Business Moments Graphic Assignmebt
No ratings yet
Business Moments Graphic Assignmebt
11 pages
Discretization Problem Statement
No ratings yet
Discretization Problem Statement
3 pages
Day02-Data Understanding Answer Asit 25082022
No ratings yet
Day02-Data Understanding Answer Asit 25082022
4 pages
Topic: Dimension Reduction With PCA: Instructions
No ratings yet
Topic: Dimension Reduction With PCA: Instructions
8 pages
Duplication - Typecasting-Problem Statement
No ratings yet
Duplication - Typecasting-Problem Statement
6 pages
Assignment 05 ANSWERS
100% (1)
Assignment 05 ANSWERS
5 pages
Problem Statement - Mathematical Foundations
No ratings yet
Problem Statement - Mathematical Foundations
6 pages
KNN - Problem Statement ANSWER
100% (1)
KNN - Problem Statement ANSWER
8 pages
13.exploratory Data Analysis
50% (2)
13.exploratory Data Analysis
8 pages
DataPreparation Outlier Treatment
100% (1)
DataPreparation Outlier Treatment
3 pages
Assignment 05
No ratings yet
Assignment 05
8 pages
Assignment 06
50% (2)
Assignment 06
2 pages
Module - 5 Assignment File Handling & Exception Handling
100% (1)
Module - 5 Assignment File Handling & Exception Handling
3 pages
1 - Write A Python Program To Check That A String Contains Only A Certain Set of Characters (In This Case A-Z, A-Z and 0-9)
No ratings yet
1 - Write A Python Program To Check That A String Contains Only A Certain Set of Characters (In This Case A-Z, A-Z and 0-9)
4 pages
Original
No ratings yet
Original
30 pages
Zero Variance-Problem Statement
50% (2)
Zero Variance-Problem Statement
4 pages
Standardization Problem Statement
No ratings yet
Standardization Problem Statement
5 pages
Assignment 07 Answers
100% (1)
Assignment 07 Answers
6 pages
CRISP DM Business Understanding Completed
No ratings yet
CRISP DM Business Understanding Completed
18 pages
Basic Statistics (Module - 3)
100% (2)
Basic Statistics (Module - 3)
12 pages
Duplication - Typecasting-Problem Statement
100% (1)
Duplication - Typecasting-Problem Statement
3 pages
13.exploratory Data Analysis
0% (1)
13.exploratory Data Analysis
10 pages
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
Multinomial Problem Statement
No ratings yet
Multinomial Problem Statement
28 pages
Day13 K Means Clustering
No ratings yet
Day13 K Means Clustering
4 pages
CRISP ML (Q) Business Understanding
No ratings yet
CRISP ML (Q) Business Understanding
12 pages
Day10 Mathematical Foundations
No ratings yet
Day10 Mathematical Foundations
4 pages
Text Mining Problem Statement
100% (1)
Text Mining Problem Statement
3 pages
Name:Silpa Batch Id: Analysis: WDEO 171220 Topic: Principal Component
100% (1)
Name:Silpa Batch Id: Analysis: WDEO 171220 Topic: Principal Component
7 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
CRISP - ML (Q) - Business Understanding Assignment
No ratings yet
CRISP - ML (Q) - Business Understanding Assignment
11 pages
Problem Statement
100% (3)
Problem Statement
8 pages
Clustering Documentation Python Code
No ratings yet
Clustering Documentation Python Code
8 pages
Statistics and Probability
No ratings yet
Statistics and Probability
8 pages
R - Assignment
No ratings yet
R - Assignment
2 pages
ML_EX2
No ratings yet
ML_EX2
7 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
Data Visualization EDA-print
No ratings yet
Data Visualization EDA-print
18 pages
Civil Engineering Complete Revision
No ratings yet
Civil Engineering Complete Revision
1,004 pages
Jarvis Aad Has
No ratings yet
Jarvis Aad Has
1 page
10 Recommendation Engine Problem Statement
No ratings yet
10 Recommendation Engine Problem Statement
10 pages
PPSC Naib Tehsildar 22 May 2022 (English)
No ratings yet
PPSC Naib Tehsildar 22 May 2022 (English)
26 pages
House Plan A - Full Set 2020
No ratings yet
House Plan A - Full Set 2020
6 pages
Assignment 1 Data Type
No ratings yet
Assignment 1 Data Type
3 pages
03.DataPreparation Outlier Treatment
No ratings yet
03.DataPreparation Outlier Treatment
9 pages
Practice Question Lectures 41 To 45 More
No ratings yet
Practice Question Lectures 41 To 45 More
14 pages
Measure of Variability
No ratings yet
Measure of Variability
23 pages
Chapter 4 Two-Sample Tests
No ratings yet
Chapter 4 Two-Sample Tests
48 pages
SLE-for-Grade-10-week-5-to-8-Quarter-4-Measures-of-Central-Tendency
No ratings yet
SLE-for-Grade-10-week-5-to-8-Quarter-4-Measures-of-Central-Tendency
3 pages
PDF Test Bank For Essentials of Statistics For The Behavioral Sciences 4th by Nolan Download
100% (5)
PDF Test Bank For Essentials of Statistics For The Behavioral Sciences 4th by Nolan Download
33 pages
Ch20-Basic Data Analysis-Descriptive Statistics
No ratings yet
Ch20-Basic Data Analysis-Descriptive Statistics
31 pages
Unit 2.exercises
No ratings yet
Unit 2.exercises
8 pages
G7 Math Q4-Week 7 - Ungrouped Data of Measures of Variability
No ratings yet
G7 Math Q4-Week 7 - Ungrouped Data of Measures of Variability
11 pages
Learner'S Packet (Leap) : Student Name: Section: Subject Teacher: Adviser
No ratings yet
Learner'S Packet (Leap) : Student Name: Section: Subject Teacher: Adviser
7 pages
5week 5 NormalDistribution Math 225N 1
No ratings yet
5week 5 NormalDistribution Math 225N 1
8 pages
CA-Foundation Sept 24 - QA - Chapter 4+13+14+18 - Test
No ratings yet
CA-Foundation Sept 24 - QA - Chapter 4+13+14+18 - Test
4 pages
Week 2 Exercise Due Date: Submission Method:: Nama Kelompok
No ratings yet
Week 2 Exercise Due Date: Submission Method:: Nama Kelompok
4 pages
Chapter 9: Introduction To The T Statistic
No ratings yet
Chapter 9: Introduction To The T Statistic
42 pages
Spring 2024 - STA301 - 1 - SOL
No ratings yet
Spring 2024 - STA301 - 1 - SOL
4 pages
MVN: An R Package For Assessing Multivariate Normality
No ratings yet
MVN: An R Package For Assessing Multivariate Normality
12 pages
North Luzon Philippines State College: Midterm Examination IN Biostatistics
No ratings yet
North Luzon Philippines State College: Midterm Examination IN Biostatistics
6 pages
FM213 - AT - Problem Set 5 - Class Notes
No ratings yet
FM213 - AT - Problem Set 5 - Class Notes
10 pages
Measures of Central Tendency - Business Statistics by PR Vittal
No ratings yet
Measures of Central Tendency - Business Statistics by PR Vittal
20 pages
Inbound 7042853907796140836
No ratings yet
Inbound 7042853907796140836
108 pages
Data Hipotetik Valuasi
No ratings yet
Data Hipotetik Valuasi
20 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
26 pages
Math As A Tool
No ratings yet
Math As A Tool
31 pages
ADM-SHS-StatProb-Q3-M21-Illustrating the t-Distribution
No ratings yet
ADM-SHS-StatProb-Q3-M21-Illustrating the t-Distribution
27 pages
Definition of Mean PDF
No ratings yet
Definition of Mean PDF
7 pages
Tutorial Notes (Complete) 2
No ratings yet
Tutorial Notes (Complete) 2
41 pages
Modul 19 Taburan Kebarangkalian
No ratings yet
Modul 19 Taburan Kebarangkalian
15 pages
Notice Normalization13.02.2024
No ratings yet
Notice Normalization13.02.2024
3 pages

Uploaded by

Uploaded by

Outlier Treatments

Please ensure you update all the details:

Name: _HARVIR SINGH

1. Prepare the dataset by performing the preprocessing techniques, to treat the

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# Detecting of outliers (find limits for crim based

############### Winsorization for crim

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# we can inspect the minimum caps and maximum

# lets see boxplot

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# Detection of outliers (find limits for zn based on

### winsorization for zn ####

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Please ensure you update all the details:

Batch Id: _______________________

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# Discritization for sepal_length

## Discritization for Sepal_width

# Discritization for Petal_Length

© 2013 - 2021 360DigiTMG. All Rights Reserved.

## Discritization for Petal_Width

© 2013 - 2021 360DigiTMG. All Rights Reserved.

## Discritization for Sepal_width

# Discritization for Petal_Length

© 2013 - 2021 360DigiTMG. All Rights Reserved.

## Discritization for Petal_Width

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Index Animals Gende Homly Types

# we use Animal_category dataset

© 2013 - 2021 360DigiTMG. All Rights Reserved.

df.columns # column names

# drop emp_name column

# Create dummy variables

##### One Hot Encoding works

© 2013 - 2021 360DigiTMG. All Rights Reserved.

from sklearn.preprocessing import OneHotEncoder

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

#Identify duplicates records in the data

#Exploratory Data Analysis

from sklearn.impute import SimpleImputer

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

import seaborn as sns

© 2013 - 2021 360DigiTMG. All Rights Reserved.

CHILD Candies count Probability

© 2013 - 2021 360DigiTMG. All Rights Reserved.

# to read the file

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

Name of company Measure X

© 2013 - 2021 360DigiTMG. All Rights Reserved.

(ii) Is the venture likely to be successful? Explain.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

CASENUM ATTORNEY CLMSEX CLMINSURSEATBELT CLMAGE LOSS

STANDARDIZATION & NORMALIZATION

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

© 2013 - 2021 360DigiTMG. All Rights Reserved.

#variance value of squarebreadth is nearly equal

© 2013 - 2021 360DigiTMG. All Rights Reserved.

You might also like