100% found this document useful (1 vote)

2K views

360DigiTMG Practical Data Science New

The document discusses various concepts related to artificial intelligence including artificial intelligence, data science, data mining, machine learning, deep learning, and reinforcement learning. It provides definitions and examples of these concepts.

Uploaded by

yesbene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

2K views

360DigiTMG Practical Data Science New

Uploaded by

yesbene

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 168

INDEX

Ingredients of AI
- Artiﬁcial Intelligence
- Data Science
- Data Mining
- Machine Learning
- Deep Learning
- Reinforcement Learning

Stages of Analytics
CRISP - DM
- CRISP - DM Business Understanding
- CRISP - DM Data Collection
• Data Types
• Diﬀerent Scales of Measurement
• Data Understanding
• Qualitative vs Quantitative
• Structured vs Unstructured
• Big Data vs Non-Big Data
• Cross Sectional vs Time Series vs Longitudinal Data
• Balanced vs Unbalanced
• Data Collection Sources
- Primary Data
- Secondary Data
• Preliminaries for Data Analysis
• Probability
• Base Equation
• Random Variables
• Probability Distributions
• Sampling Techniques
• Inferential Statistics
• Non-Probability Sampling
• Probability Sampling
• Sampling Funnel
- CRISP - DM Data Cleansing / Data Preparation
• Outlier Treatment
• Winsorization
• Alpha Trimmed
• Missing Values
• Imputation
• Transformation
• Normalization/Standardization
• Dummy Variables
• Type Casting
• Handling Duplicates
• String Manipulation

- CRISP - DM Exploratory Data Analysis

• Measure of Central Tendency
• Measure of Dispersion
• Measure of Skewness
• Measure of Kurtosis
• Graphical Representations
- Histogram
- Box Plot
- Q-Q Plot
- Bivariate Analysis
- Scatter Plot
- Correlation Coeﬃcient
• Multivariate Analysis
• Data Quality Analysis
• Four Errors to Be Avoided During Data Collection
• Data Integration
• Feature Engineering
• Feature Extraction
• Feature Selection
- CRISP - DM Model Building Using Data Mining
• Supervised Learning
• Supervised Learning has Four Broad Problems to Solve:
- Predict A Categorical Class: Classiﬁcation
- Predict A Numerical Value: Prediction
- Predict User Preference from a Large Pool of Options: Recommendation
- Predict Relevance of an Entity to a "Query": Retrieval
• Data Mining Unsupervised
- A Few of the Algorithms are:
- Clustering
- Dimension Reduction
- Network Analysis
- Association Rules
- Online Recommendation Systems

• Unsupervised Preliminaries
- Distance Calculation
- Linkages
• Clustering / Segmentation
- K-Means Clustering
- Disadvantages of K-Means
- K-Means++ Clustering
- K-Medians Clustering
- K-Medoids
- Partitioning Around Medoids (PAM)
- CLARA
• Hierarchical Clustering
- Disadvantages of Hierarchical Clustering
• Density Based Clustering: DBSCAN
• OPTICS
• Grid-Based Clustering Methods
• Three Broad Categories of Measurement in Clustering
• Most Common Measures
• Clustering Assessment Methods
• Finding K Value
• Mathematical Foundations
• Dimension Reduction
- PCA
- SVD
- LDA
• Association Rules
- Support
- Confidence
- Lift
• Recommender Systems
- Types of Recommendation Strategies
- Collaborative Filtering
- Similarity Measures
- Disadvantages
- Alternative Approaches
- Recommendations vs Association Rules
- New Users and New Items
• Network Analysis
- Applications
- Degree Centrality
- Closeness Centrality
- Betweenness Centrality
- Eigenvector Centrality
- Edge / Link Properties
- Cluster Coefficient
• Text Mining
- Examples of Sources
- Pre-Process the Data
- Document Term Matrix / Term Document Matrix
- Word Cloud
- Natural Language Processing (NLP)
- Natural Language Understanding (NLU)
- Natural Language Generation (NLG)
- Parts of Speech Tagging (Pos)
- Named Entity Recognition (NER)
- Topic Modelling
- LSA / LSI
- LDA
- Text Summarization
• Data Mining Supervised Learning
• Machine Learning Primer
- Key Challenges
• Model Evaluation Techniques
- Errors
- Confusion Matrix
- Cross Table
- ROC Curve
• K-Nearest Neighbor
- Choosing K Value
- Pros and Cons
• Naive Bayes Algorithm
• Decision Tree
- Three Types of Nodes
- Greedy Algorithm
- Information Theory 101
- Entropy
- Pros and Cons of Decision Tree
• Scatter Diagram
• Correlation Analysis
• Linear Regression
- Ordinary Least Squares
- Model Assumptions
• Logistic Regression
• Support Vector Machine
- Hyperplane
- Non-Linear Spaces
- Kernel Tricks
- Kernel Functions
• Deep Learning Primer
- Image Recognition
- Speech Data
- Text Data
- Shallow Machine Learning Models
• Perceptron Algorithm
- Biological Neuron
- Simple Neural Network Components
- Perceptron Algorithm
- Learning Rate
- Gradient Primer
- Gradient Descent Algorithms Variants
- Empirically Determined Components
• Multi-Layers Perceptron (MLP) / Artificial Neural Network (ANN)
- Non-Linear Patterns
- Integration Function
- Activation Function
- Regularization Techniques Used for Overfitting
- Error-Change Criterion
- Weight-Change Criterion
- Dropout
- Drop Connect
- Noise
- Batch Normalization
- Shuffling Inputs
- Weight Initialization Techniques
• Forecasting
- Time Series vs Cross Sectional Data
- EDA - Components of Time Series
• Systematic Part
• Level
• Trend
• Seasonality
- Non-Systematic Part
• Noise/Random
- Data Partition
- Forecast Model
- Model-Driven Techniques
- Data-Driven Techniques
- Smoothing Techniques
- Moving Average
- Exponential Smoothing
- De-Trending and De-Seasoning
- Regression
- Differencing
- Moving Average
Ingredients of AI

Artiﬁcial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Deﬁnition of Artiﬁcial Intelligence,

Data Science, Data Mining, Machine
Learning, Deep Learning,
Reinforcement Learning (RL)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 1

Artiﬁcial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Ability of inanimate objects such as Machines, Robots,

Systems, etc., with computing capabilities to perform

____________ tasks that are similar to Humans.

Examples of AI
· ________ (Video Analytics & Image Processing)

· Hearing (Speech to Text Applications)

· Response to Stimuli (Inputs) X1 X2 X3 Y

y = f(x)

· __________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 2

Data Science

Artiﬁcial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Data Science is a ﬁeld of study related to data, to bring out

meaningful insights for eﬀective making.

Topics of Data Science includes

1. ______ Analysis 6. Black Box Techniques

2. Hypothesis Testing 7. ______ Mining

3. Data ______ 8. Natural Language Processing

4. Regression Analysis 9. ______ Analysis, etc.

5. Classiﬁcation Techniques

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 3

Data Mining

Artiﬁcial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Data Mining is similar to coal mining where we get coal and

if lucky one might get precious stones such as diamond.

In Data Mining we get __ from _ and insights

similar to diamond are extremely valuable for ______.

Data Mining (Branches)

_ __ Active Structured

Learning Learning Learning Prediction

Unsupervised Semi-Supervised _______

Learning Learning Learning

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 4

Machine Learning
Machine Learning is learning from the ________ of the

historical / past data and then using it on the ________

unseen data to achieve a deﬁned objective.

1. __ Learning _ Learning - Both _____and

_______ are known in the historical data

2. Learning Learning - Only are

known in the historical Data & ________ is not known or assumed as

not known

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 5

Deep Learning

Artiﬁcial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Deep Learning is a special branch of Machine

Learning where the ___________ in data are

_________ extracted.

Some of the Deep Learning Architecture

• _________________ / • Gated Recurrent Units (GRUs)

Multi-Layered Perceptron • Mask R-CNN

• ___________ Neural Network • Autoencoders

• _________ Neural Network • Generative Adversarial Network (GAN)

• Deep Belief Network • Boltzmann Machine

• Long Short Term Memory (LSTM) • Deep Q-Networks

• Q Learning etc.
Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 6
Reinforcement Learning
Reinforcement Learning is a special branch of ___________ Learning

which is heavily used in applications sincluding games, robotics, investment

banking, and trading etc.

Reinforcement Learning is a ___________ based learning, which solves

sequential decision problems by __________ with the environment.

The 5 key elements of Reinforcement Learning

________ is a learning component that makes decision

______
on actions to maximize the reward.

Environment is the physical world where agents perform

Environment actions.

_____ deﬁnes behavior of the agent from states to

_____ actions.

_____ deﬁnes the problem and maps it to a

______
numerical reward.

______
______ __________ deﬁnes the cumulative future reward.

Model of the Environment is an optional component which predicts the

behavior of the environment.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 7

Stages of Analytics

Highest Level of Analytics

4. Prescriptive Analytics

3. Predictive Analytics

2. Diagnostic Analytics

1. Descriptive Analytics

__________________ - Answers questions on what happened in the

past and present.

Example: Number of Covid-19 cases to date across various countries

Diagnostic Analytics - Answers questions on ________________.

Example: Why are the Covid-19 cases increasing?

_ - Answers questions on __________.

Example: What will be the number of Covid-19 cases for the next month?

_________________ - Provides remedies and solutions for what might

happen in the future.

Example: What should be done to avoid the spread of Covid-19 cases, which

might increase in the next one month?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 8

CRISP - DM
C____ I_______ S______ P_____
for Data Mining

__________

___________

6 1

__________
5
CRISP - DM 2
_________

4
3
Data Mining -
Model Development
Data Cleansing/
Preparation and
Exploratory
Data Analysis

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 9

CRISP - DM Business Understanding
Articulate the business problem by understanding the
client/customer requirements

____________ Business
Objectives ____________

A few examples on Business Objective and Business Constraints

Business Problem : Signiﬁcant proportion of customers who take loan are

unable to repay
Business Objective : ____________ Loan Defaulters
Business Constraint : Maximize Proﬁts

Business Problem : Signiﬁcant proportion of customers are complaining that

they did not do the credit card transaction
Business Objective : Minimize Fraud
Business Constraint : _____________ Convenience

Keys points to remember:

Ensure that objectives and constraints are SMART

- S____________
- Measurable
SMART - A____________
- R____________
- Time-Bound
Key Deliverable: ________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 10

CRISP - DM Data Collection
Understanding various __________ is pivotal to proceed further with data

collection.

Data Types
_____________________ _____________________

Any data, which can be Data when represented in decimal

represented in a _________ format does not make sense.
and makes sense.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 11

CRISP - DM Data Collection
_______________________________________________

_______________________________________________

Count Data examples

_______________________________________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 12

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 13
Data Understanding

_____________________ VS _____________________

_______________ data is Quantitative data include

non-numerical data. numbers.

Examples Examples

1. This weighs heavy 1. Weight 85 kg

2. That kitten is small 2. Height 164.3 cm

______ Data and _ Data fall under Quantitative Data.

Qualitative
Binary
Normal
Categorical
Multiple

Ordinal

Quantitative

Continuous Discrete (Count)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 14

____________________ vs Unstructured
Structured data is that data Unstructured data is that data
which in raw state can be placed which in its raw state cannot be
in a _______ format. placed in any _______ format.

Video is split into images and Videos, Images, Audio/Speech,

images into ________ and Textual Data are examples of
each pixel intensity value can be Unstructured data.
an entry in a column and this
becomes structured.

Audio Files / Speech data can be converted into features using _________

Frequency ____________ Coeﬃcient (MFCC).

Textual data can be converted into _____ of ______ (BoW) as an example

to make it Structured.

Example: But I, being poor, have only my dreams; I have spread my dreams

under your feet; Tread softly because you tread on my dreams.

Poor Dream Spread Feet Tread Soft

1 3 1 1 1 1

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 15

Data Understanding

_____________________ VS _____________________

Data which is governed by the 5 Vs

__________ High Volume, generating at rapid Velocity,

__________
from a wide Variety with an element of
5 Vs __________
uncertainty __________, and
__________
__________ appropriate Value

________ is that which cannot be stored in the available hardware and

cannot be processed using available software.

____________ is that data which can be stored in the available hardware

and can be processed using available software.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 16

Cross-Sectional vs ___________ vs ___________

1. Cross-sectional data is that data, where date, time, and

sequence in which we arrange the data is immaterial

2. Cross-sectional data usually contains more than one variable

1. Population survey of demographics

Examples:
2. Proﬁt & Loss statements of various companies

1 _________ data is that data, where the date, time, and

sequence in which we arrange the data is important

2. _________ data usually contains only one variable of

interest to be forecasted

1. Monitoring patient blood pressure every week

Examples:
2. Global warming trend

1. _ is also called _

2. _________ includes properties of both Cross-Sectional.

and Time Series

3. Data as well as _________ , wherein there is more than

one variable, which are sorted based on the date and time

1. Exam scores of all students in a class from sessional

Examples:
to ﬁnal exams
2. Health scores of all employees recorded every month

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 17

__________ vs __________
• Whether a person claims an insurance or not,

• Will a person 'pay on time, 'with a delay' or 'will default', etc,

• ___ is that data • _ is that data

where the classes of output where the classes of output
variables are more or less in variables are in unequal
equal proportion proportion

• E.g. 47% of people have • E.g. 23% of data is defaulted

defaulted and 53% of data is and 77% of data is not defaulted
not defaulted in the loan default in the loan default variable
variable

• When we have balanced data

then we can simply apply
random sampling techniques

Default Not Default

Thumb Rule: if proportion of minority output class is < 30% then

data is imbalanced.

Sampling for imbalanced data refer to next page.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 18

When we have imbalanced data then we apply diﬀerent
sampling techniques such as:

• _____________ - Undersampling and Oversampling

• Bootstrap Resampling

• K-Fold Cross Validation

• _____________ K-Fold Cross Validation

• _____________ K-Fold Cross-Validation

• _____________ (N-Fold Cross-Validation) LOOCV

• SMOTE (Synthetic Minority Oversampling Technique)

• MSMOTE (Modiﬁed SMOTE)

• Cluster-Based Sampling

• Ensemble Techniques

Imbalanced Data

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 19

Data Collection Sources
_______________: Data Collected at the Source

_______________: Data Collected Before Hand

Primary Data

Secondary Data

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 20

Primary Data
Examples of _________

Surveys, __ of , Sensors Data,

Interviews, Focus Groups, etc.

Survey steps:
1. Understand the business __________ and __________ behind
conducting the survey. E.g. Sales are low for a training company
2. Perform __________ analysis - __________ Analysis, 5-_____
Analysis, etc. E.g. Product Pricing is uncompetitive
3. Formulate Decision Problem. E.g. Should product prices be changed
4. Formulate Research _________. E.g. Determine the price elasticity of
demand and the impact on sales and proﬁts of various levels of price
changes
5. List out Constructs. E.g. Training Enrolment
6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect,
Constraint aspect
7. Devise Survey ____________ based on the _______. E.g. I am most
likely to enroll for the training program in: In the next one week, In the next
one month, In the next one quarter, etc.

of ____ examples:

• Coupon marketing with a 10% discount vs 20% discount, to which of these

customers are responding well

• Coupon targeting customers within 10 km radius versus 20 km radius

• Combinations of discount & distance to experiment

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 21

Secondary Data

Organizational data _ ___

are stored in databases (paid) databases
__________

• Oracle DB • Industry reports

• Microsoft DB • Government reports
• MySQL • Quasi-government
• NoSQL - MongoDB reports, etc.

• Big Data, etc.

Meta Data Description: Data about Data

• Obtaining meta data description is mandatory before we proceed
further in the project

• Understand the data volume details such as size, number of records,

total databases, tables, etc.

• Understand the data attributes/variables - description and values

which these variables take

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 22

Preliminaries for Data Analysis
Probability can be explained as the extent to which an event is likely to occur,
measured by the ratio of the _________ cases to the whole number of
cases possible.

# ________
Probability =
# Total events

Properties of Probability:

• Ranges from 0 to 1
• Summation of probabilities of all values of an event will be equal to 1

Example:

H 1
P(H) = = = 0.5
H&T 2

2 2
P(Red) = = = 0.5
2(R)+2(B) 4

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 23

Base Equation
Random variables can be broadly classiﬁed as Output and Input variables.

Mathematically the relation between these is expressed using base equation:

Y is known as: X is known as:

• _ variable • __

• Response • Explanatory

• _________ • __________

• Explained variable • Covariates

• Criterion • __________

• Measures variable • Factors

• _ variable • __

• _________ • Controlled variable

• _ variable • __ variable

• Exposure variable

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 24

_____________

If there is a chance / Any output on any event

probability associated with which _______ is called
each of the possible output Variable
then it is called _______

_____________ are always represented using Upper case.

Values that a random variable takes are represented using ____________.

Ex: Roll of a single die

X = {1, 2, 3, 4, 5, 6}

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 25

Probability Distribution
Representing the probabilities of all possible outcomes of an event in a tabular

format or a graphical representation is called Probability Distribution.

If a random variable is continuous then the underlying probability distribution

is called __________________.

If a random variable is discrete then the underlying probability distribution is

called __________________.

X P(X=x)

0 0.40

1 0.25 0.5
0.4
2 0.20 0.3
0.2
3 0.05
0.1
4 0.10 0
0 1 2 3 4

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 26

Sampling Techniques
Sampling is a technique to collect the _________ of population data.

These techniques are broadly classiﬁed into 2 types.

_________ Sampling

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 27

Inferential Statistics
Inferential statistical is a process of analysing the ________ and deriving

statements / properties of a ________.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 28

Sampling Techniques
Sampling Techniques is a technique which is based on convenience, wherein,

priority varies for the data that is to be collected to represent the population,

these approaches are also known as ____________ sampling.

A few examples of Non-Probability Sampling:

1 Convenience Sampling

2 Quota Sampling

3 Judgment Sampling

4 Snowball Sampling

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 29

Sampling Techniques
________ Sampling also known as ________. Sampling is the default

approach for inferential statistics. Each data point to be collected will have

________ to get selected.

A few examples of __________ Sampling:

1 Simple Random Sampling

2 Systematic Sampling

3 Stratiﬁed Sampling

4 Clustered Sampling

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 30

Sampling Funnel

Population

Sampling Frame

Simple Random Sampling

Sample

Population All Covid-19 cases on the planet

Sampling Frame • The majority of Covid-19 cases are in the USA,

India, and Brazil and hence these 3 countries
can be selected as a ______________

• ______ does not have any hard and fast rule,

It is devised based on business logic

Simple Random • Randomly Sample 10% or 20% of the data from

Sampling the sampling frame using Simple ________
technique

• ____________ is the gold standard

technique used for sampling

• ____________ is the only sampling

technique which has no bias

• Other sampling techniques such as Stratiﬁed

sampling, etc., Also can be used to sample the
data but _________ is the best

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 31

CRISP - DM Data Cleansing
Data Cleansing / Data Preparation

Data Cleansing is also called as _, _,

___________, ___________.

Outlier or____________ - Any value, which is extremely small or

extremely large from the remaining data.

Outliers are treated using 3 R technique:

___________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 32

Winsorization Technique
Winsorization is the technique, which modiﬁes the sample distribution of

random variables by removing outliers. For example, 90% winsorization means

all data below the 5th percentile is set at 5th percentile and all the data above

the 95th percentile is set at 95th percentile.

5 25 75 95

All values below 5th percentile are All values above 95th percentile are
changed to 5th percentile value changed to 95th percentile value

Alpha Trimmed Technique

Alpha Trimmed Technique lets you set an alpha value, for example if alpha = 5%,

then all the lower & upper 5% values are trimmed or removed.

Lower 5% values are Upper 5% values are

removed/trimmed removed/trimmed

5 25 75 95

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 33

Missing Values
Missing Values - Fields in the data which might have blank spaces and (or) NA,

_____, _, _____.

3 Variants of Missing Values

• ________________________ (MAR)

• Missingness Not At Random (MNAR)

• ________________________ (MCAR)

Name Age Salary

Steve 23 $ 4,000
Raj 33 $ 6,500
Chen 41 Missingness
Wilma 37 $ 7,200
Audrey 51 $ 9,300

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 34

Imputation
Imputation is a technique used to replace missing values with logical values.

Wide variety of Techniques are available, choosing the one which ﬁts the data is

an art:

______________ ______________

(Simple Strategies) Single Imputation Methods

• List-Wise Deletion or • Mean Imputation

Complete-Case Analysis
• Median Imputation

• Available Case Method or • Mode Imputation

Pair-Wise Deletion
• Random Imputation

• Hot deck Imputation

• Regression Imputation

• KNN Imputation

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 35

Transformation
______________________________________________

Types of transformation

• Logarithmic

• ____________

• Square Root

• ____________

• Box-Cox

• Johnson

____ / Binning / Grouping - Converting _ data to

____________

Binarization - Converting continuous data into ____________

Rounding - Rounding oﬀ the decimals to the nearest integer e.g. 5.6 = 6

Binning - Two types of Binning

• Fixed Width Binning

• Adaptive Binning

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 36

Normalization
Normalization/ ____________ - Making the data
X-mean
Z=
_________ and ____________. stdev

Methods of Normalization / include

also called as , also called as

____________ or__________________________.

Standardization has two parts:

• ____________________ or Mean Subtraction - Mean

Normalization will make the mean of the data _________.

• Variance Normalization - _________________ will make the

variance of the data __________.

X - min(x)
max(x)-min(x)

Normalization is also called the _____________________. Normalized

data has minimum value = 0 and maximum value = 1 and sometimes when
dealing with negative values the range can be in between -1 to +1.

Mix-Max Scaler's disadvantage is that its scaled values are inﬂuenced by

_______________.

____________________ is not inﬂuenced by X - median(x)

outliers because it considers 'Median' & 'IQR'. IQR(x)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 37

Dummy Variable
Dummy Variable Creation: Representing/Converting categorical data in
numerical data

Techniques for Dummy Variable creation are:

________________

_____________ Scheme

_______________

Label Encoding

________ Coding Scheme

________ Hashing Scheme

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 38

Type Casting
Converting one type to another, for e.g. converting 'Character' type to 'Factor';
'Integer' type to 'Float'.

Type Casting

Type Conversion Type Coercion Type Juggling

Handling Duplicates
Ensures that we get a ____________ of _________ from all the
various locations.

E.g. A person opens a bank account but his transactions are recorded as John
Travolta in a few, John in a few entries and Travolta in a few; however, all 3 are
the name of the same person. So we merge all these names into one.

Name Amount Spent Name Amount Spent

John Travolta $ 1,000 John Travolta $ 3,600
Travolta $ 800
Merged because all '3' entries belong
John $ 1,800
to same customer.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 39

String Manipulation
Working with textual data. Various ways of converting unstructured textual
data into structured data are:

Stemming
____________ ____________

Stopword Removal
____________

Zero or Near Zero Variance

_______ & Near_______________ feature:

Variables which are factors with a single level or majority of the levels are the
same. E.g. All the zip code numbers are the same or Gender column has all
entries listed as female.

We remove the variables from our analysis which have _____ or ________
variance in features.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 40

CRISP - DM Exploratory Data Analysis
(EDA)
Elements in EDA

Measure of Central Tendency is also called as "First Moment Business Decision".

MEAN

Also called as Gets inﬂuenced by

____________ ____________

MEDIAN

• Median is the middle value of the dataset

• Median of a dataset does not get inﬂuenced by __________

MODE

• Mode is the value, which repeats ____________ times

• Mode is applied to ____________ data

• If data has ______ mode it is called ______, if the data has ________

called ______ data and more than two modes called ____________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 41

Measure of _____________, also called as
Second Moment Business Decision
How far away is each data
point from mean/average?
Variance
Units of measurement get squared

Standard deviation is the

square root of variance
Standard Deviation
Get back the original units, which were
squared during variance calculation

Represents the boundaries

of the data spread
Range
Maximum - Minimum

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 42

Measure of Skewness
• Data concentrated to the __________ is ________ Skewed also called

______________Skewed

• Data concentrated to the __________ is Right Skewed also called

____________ Skewed

• Presence of long tails helps in devising interesting business strategies

Mean

Median

Measure of Kurtosis

Curve Curve __ Curve

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 43

Graphical Representations
Univariate analysis - Analysis of a single variable
is called Univariate Analysis.

Graphs using which we can visualize single variables are:

1. Bar Plot 8.____________

2. ____________ 9. Time Series Plots

3. 10.

4. Strip Plot 11. Density Plot

5. ____________ 12. Boxplot or Box & Whisker Plot

6. 13. or

_____________
7. Candle Plot

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 44

Graphical Representations
Majorly used plots for Univariate Analysis includes Histogram, Box Plot,

and Q-Q Plot.

Histogram
Histogram is also called as Frequency Distribution Plot.

Primary Purpose: Histogram is used to identify the shape of the distribution.

Summarises the Used to identify Identify if the data is unimodal,

data into the shape of the bimodal or multimodal
discrete bins Distribution

Secondary Purpose: Histogram is used to identify the presence of Outliers.

0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
30 35 40 45 50 55 60 65 70

Is used to identify presence of Outliers

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 45

Box Plot is also called as Box and Whisker Plot
• Box Plot gives the 5 point summary, namely, Min, Max, Q1 / First Quartile, Q3 /
Third Quartile, Median / Q2 / Second Quartile

• Middle 50% of data is located in the Inter Quartile Range (IQR) = Q3 - Q1

• Formula used to identify outliers is Q1 - 1.5 (IQR) on the lower side and
Q3 + 1.5 (IQR) on the upper side

• Primary Purpose of Boxplot is to identify the existence of outliers

• Secondary Purpose of Boxplot is to identify the shape of distribution

Q1 Median Q3

whisker whisker

Q-Q plot is also called Quantile Quantile Plot

• Q-Q plot is used to check whether the data are normally distributed or not.
If data are non-normal then we resort to transformation techniques to
make the data normal

• The line in the Q-Q plot connects from Q1 to Q3

• X-axis contains the standardized values of the

random variable

• Y-axis contains the random values, which are not

Q-Q Plot
standardized

• If the data points fall along the line then data are
considered to be Normally Distributed

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 46

Bivariate Analysis
Bivariate analysis is analyzing two variables.

Scatter plot is used to check for the correlation between two variables.

The primary purpose of the Scatter Plot is to determine the following:

• Direction - Whether the direction is Positive or Negative or No Correlation

Positive Correlation Negative Correlation No Correlation

• Strength - Whether the strength is Strong or Moderate or Weak

Moderate Correlation Strong Correlation No Correlation

• Check whether the relationship is Linear or Nonlinear

Linear Nonlinear

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 47

80
The Secondary purpose of the
70
Scatter Plot is to determine
60

_____________. 50

30
130 140 150 160 170 180 190

• Determining strength using a scatter plot is subjective

• Objectively evaluate strength using Correlation Coeﬃcient (r)

• Correlation coeﬃcient value ranges from +1 to -1

• Covariance is also used to track the correlation between 2 variables

• However, Correlation Coeﬃcient normalizes the data in correlation

calculations whereas Covariance does not normalize the data in correlation
calculation

• | r | > 0.85 implies that there is a strong correlation between the variables

• | r | < = 0.4 implies that there is a weak correlation

• | r | > 0.4 & | r | < = 0.85 implies that there is a moderate correlation

Multivariate Analysis
The two main plots to perform Multivariate analysis are:

• Pair Plot

• Interaction Plot

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 48

Data Quality Analysis
Focus of this step is to identify the potential
errors, shortcomings, and issues with data.

Name Age Date Date Date

Steve 23 2001 Jan - 01 1-Jan-01

Jeﬀ 37 2001 Jan - 01 17-Jan-01

Clara 28 2002 Feb - 01 8-Feb-02

2003 Jun - 01 12-Jun-03
Peter 41

Identify _ Identify _ Identify diﬀerent

levels of granularity

Sales Region
Name Age Salary
19,345 North
Steve $ 12,000 23
23,424 West
Jeﬀ $ 4,500 37
24,164 East
Clara $ 5,200 28
19,453 South

Validation and __________ Data Wrong metadata

Reliability information

Wrong information due to data errors (manual /

automated) - ______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 49

Four errors to be avoided during Data
Collection
1. Random Errors - Measurement device (thermometer) faulty or Person
measuring does mistakes. Leads to False Positives

2. Systematic Errors - Social desirability bias of Trump on Twitter. Wearable

devices data is of wealthy customers

3. Errors in choosing what to measure - Rather than choosing a person from a

top university for a job, maybe we need to look at their social network which
guided them through a series of events, which resulted in them joining the
top school. High SAT score is not just based on high IQ, it depends on access
to good tutors and purchasing good study material. Someone might like a
subject and hence got a high GPA, but can we guarantee such a success in
other ﬁelds

4. Errors of exclusion - Not capturing women data pertaining to cardiovascular

diseases. An election in the US, not having data of colored women
candidates. Chief Diversity Oﬃcer in big ﬁrms is a solution!

Random Errors Systematic Errors

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 50

Data Integration
Data Integration is invoked when there are
multiple datasets to be integrated or merged

Appending
Multiple datasets with the same attributes/columns.

Name Age Salary Name Age Salary

Steve 23 $ 4,000 Wilma 37 $ 7,200
Raj 33 $ 6,500 Audrey 51 $ 9,300
Chen 41 $ 5,900

Name Age Salary

Steve 23 $ 4,000
Raj 33 $ 6,500
Chen 41 $ 5,900
Wilma 37 $ 7,200
Audrey 51 $ 9,300

Merging
Multiple datasets having diﬀerent attributes using a common attribute.

Name Age Salary Name Designation Location

Steve 23 $ 4,000 Wilma Manager Kuala Lumpur
Raj 33 $ 6,500 Chen V.P NY City
Chen 41 $ 5,900

Name Age Salary Designation Location

Steve 23 $ 4,000 Manager Kuala Lumpur
Raj 33 $ 6,500 Nan NaN
Chen 41 $ 5,900 V.P NY City

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 51

Feature Engineering
Attribute Generation is also called Feature Extraction or Feature
Engineering. Using the given variables, try to apply domain
knowledge to come up with more meaningful derived variables.

Feature Extraction can be performed on:

• for Temporal Data

- Date Based Features

- Time-Based Features

• for Numeric Data

• for Categorical Data

• on Text Data

• on Image Data

Feature Extraction
1. Deep Learning is performed using Automatic Extraction

2. Shallow Machine Learning is performed using Manual Extraction

3. Feature Extraction is used to get either Derived Features or

Normalized Features

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 52

Feature Selection
Feature Selection or Attribute Selection is
shortlisting a subset of features or attributes.

It is based on:
All Features
• Attribute importance

• Quality
Feature Selection
• _________________

• Assumptions Final Features

• Constraints

Feature Selection Techniques

• Filter Methods • _________________

• Wrapper Methods • Variable Importance Plot

• _________________ • Subset Selection Methods

includes:
• Threshold-Based Methods
• _________________
• Statistical Methods (Lasso Regression, Ridge
Regression)
• Hypothesis Testing
• Forward Stepwise Selection
• _________________
• Backward Stepwise Selection
• Model-Based Selection

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 53

Model Building using Data Mining
Supervised Learning

In the historical data if the ________ variable is known, then

we apply supervised learning tasks on the historical data. Supervised Learning

is also called ________________________ or Machine Learning.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 54

Supervised Learning has four broad problems to solve:

• Predict a class: .

- Example: Does the pathology image show signs of Benign or
Malignant tumour

- Is employee 'X' going to Attrite or Not Attrite

• Predict a value:

(also sometimes called as __________)

- Example - What will be the stock value tomorrow?

- How many Samsung mobile phones will we sell next month?

• Predict user _____________ from a large pool of options:

Recommendation
- Example - Who will be the best match for getting married on a
matrimonial website?

• Predict RELEVANCE of an entity to a “query”: Retrieval

- Example - Return to the most relevant website in Google search?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 55

Data Mining-Unsupervised
What is Unsupervised learning?

Algorithms that draw conclusions on ______________.

____________ data is data where output variable ‘Y’ is unknown.

Unsupervised Learning algorithms help in ____________ analysis.

A few of the algorithms are:

Clustering
___________

Network Analysis
___________

___________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 56

Unsupervised Learing - Preliminaries
Distance Calculation:

Distance is either calculated between:

Between a _______ Between two

Two _________
and a _______ clusters

Distance Properties:

• Should be non-negative (distance > 0)

• Distance between a record to itself is equal to 0

• Satisﬁes ____________ (Distance between records 'i' & 'j' is equal to

the distance between records 'j' & 'i')

Standardize or ______________

the variables before calculating the

distance if the variables scale or are of

diﬀerent units.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 57

Distance Calculations
Distance Metrics for Continuous Data

• __________ Distance which is calculated using Correlation Matrix

• _________ Distance, is also called as L1 norm

• Euclidean Distance, is also called as L2 norm

d= a2+b2+c2
z
d c
d = (xi1 - xj1 ) 2 + (xi2 - xj2 ) 2 + .... + (xip - xjp) 2
y
b
a

Distance Metrics for Binary Categorical Data

• Binary Euclidean Distance

• Simple Matching Coeﬃcient

• _________ Coeﬃcient

Distance Metrics for Categorical Data (> 2 categories)

• Distance is 0, if both items have same category

• Distance is 1 otherwise

Distance Metrics when both Quantitative Data & Categorical Data

exists in a dataset

• ________ General Dissimilarity Coeﬃcient

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 58

Linkages
Linkages - Distance between a record & a cluster, or between two clusters.

1. Single Linkage - This is the least distance between a record and

a cluster, or between two clusters.

• Single Linkage is also called as _________________

• Emphasis is on close records or regions and not on overall structure of

data

• Capable of clustering non-elliptical shaped regions

• Gets inﬂuenced greatly by outliers or noisy data

2. Complete Linkage - This is the largest distance (diameter) between a record

and a cluster, or between two clusters.

• Complete Linkage is also called as ____________

• Complete Linkage is also sensitive to outliers

Nearest Farthest
Neighbour Neighbour
(Single Linkage) (Complete Linkage)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 59

Average Centroid

3. Average Linkage - This is the average of all distances between

a record and a cluster, or between two clusters.

• Average Linkage is also called Group Average

• Very expensive because computation takes a lot of time

4. Centroid Linkage - This is the distance between the centroids (centers) of

two clusters or between a record and centroid of a cluster.

• Centroid Linkage is also called Centroid Similarity

5. ______ Criterion - It is the increase in the value of SSE criterion for

clustering obtained by merging them into a single cluster.

• This is also called Ward's Minimum Variance and it minimizes the total

within cluster variance

6. G A A C (GAAC)

• Two clusters are merged based on cardinality of the clusters and

centroid of clusters

• Cardinality is the number of elements in the cluster

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 60

Clustering / Segmentation

Clustering has two main criteria:

a. Similar records to be grouped together. High ____________ similarity

b. Dissimilar records to be assigned to diﬀerent groups.

Less ____________ similarity

• groups will form groups

after clustering exercise

• Clustering is an ____________ technique

• Separation of clusters can be of two types:

______ (one entry belongs to one cluster)
vs ___________ (one entry belongs to more than one cluster)

When we have a single variable then clustering can be performed by using a

simple boxplot. When we have 2 variables then we can perform scatter diagrams.

-2

-4

-4 -2 0 2 4 6 8 10

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 61

Clustering / Segmentation
When we have more than 2 variables then there are a lot of other
techniques such as:

Partitioning Based Methods:

• K-Means Clustering

• K-Means ++ Clustering

• ____________ Clustering

• Genetic K-Means Clustering

• K-Medoids Clustering

• K-Medians Clustering

• K-Modes Clustering

• ____________ Clustering

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 62

K-Means Clustering
K-Means clustering is called Non-Hierarchical Clustering.

We upfront decide the number of clusters using ________________ or

Elbow Curve.

Steps for K-Means

1. Decide the number of clusters 'K' based on the elbow curve of scree plot or
based on the thumb rule _____________. Alternatively, users may
intuitively decide upon the number of clusters

2. Dataset is partitioned into K _____________ with 'K' centers called

centroids. These centroids are randomly chosen as part of __________

3. Each data point of the dataset, which is the closest to one of the centroids will
form a cluster with that closest centroid

4. Centroids are again _____________ with the data points assigned

to each cluster

5. Steps 3 & 4 are repeated iteratively until no ____________ of data points

to other clusters is possible

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 63

Disadvantages of K-Means Clustering
Random initialization of centroids leads to clustering exercise terminating
at a ______________ (______________ because the
objective is to get ______________ within the sum of squares).

Solution:
Initialize the algorithm multiple times with diﬀerent initial partitions.

No deﬁned rule for selecting the K-value, while there are thumb rules, these
are not foolproof.

Solution:
Run the algorithm with multiple 'K' values (range) and select the clusters
with the least within the sum of squares and highest ___________
within Sum of Squares’.

Extremely sensitive to the outliers or extreme values.

Solution:
K-medians, _____________________ are a few other variants
which handle outliers very well.

K-Means clustering works for the data which is continuous in nature.

Solution:
Use _________________ for categorical data.

Cannot discover clusters with non-convex shapes.

Solution:
Use _______________ clustering and ____________ K-Means.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 64

K-Means++ Clustering
K-Means++ altogether, addresses the problem of diﬀerent
initializations leading to diﬀerent clusters.

Steps:

1. Decide the number of clusters 'K'

2. First centroid is randomly selected

3. Second centroid is selected such that, it is at the _______________

4. Step 2 depends on weighted _______________ score criteria

5. This process continues until all 'K' centroids are obtained

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10

K-Means Clustering K-Means++ Clustering

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 65

K-Medians
K-Medians is very good at handling outliers.

L1 Norm is the distance measure used and is also called Manhattan Distance.

Steps are very similar to K-Means except that instead of calculating Mean we
calculate Median.

1. K-Means cannot handle categorical data

2. Categorical data can be converted into one-hot encoding but will hamper the
quality of the clusters, especially when the dimensions are large

3. K-Modes is the solution and uses modes instead of means and everything
else is similar to K-Means

4. Distance is measured using ________________

5. If the data has a mixture of categorical and numerical data then the
_______________ method can be used

6. K-Means can only handle linearly separable patterns and ___________.

Kernel K-Means clustering works well when the data is in non-convex format
(non-linearly separable patterns)

7. ___________functions are used to take data to high-dimensional space

to make it linear and captures the patterns to form clusters

________________ Functions to be used are:

Gaussian Radial
___________ Sigmoid
Basis Function
Kernel ___________
(RBF) ______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 66

K-Medoids
K-Medoids address the problem of K-Means getting inﬂuenced by outliers.

Steps:

1. Choose 'K' data points randomly as medoids

2. Instead of taking the centriod of data points of a cluster, medoids are

considered to be the center

3. Find out the distance from each and every data point to the medoid and add
them to get a value. This value is called total cost

4. Select any other point randomly as a representative point (any point other
than medoid points)

5. Find out the distance from each of the points to the new representative point
and add them to get a value. This value is called the total cost of a new
representative point

6. If the total cost of step 3 is greater than the total cost of step 5 then the
representative point at step 4 will become a new medoid and the process
continues

7. If the total cost of step 3 is less than the total cost of step 5 then the
algorithm ends

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 67

Partitioning Around Medoids (PAM)
Partitioning Around Medoids (PAM) is a classic example of ________ Algorithm.

Steps:

1. Randomly points are chosen to be ____________

2. Replace medoids will non-medoids, if the ____________

(Sum of Squared Errors - SSE) of the resulting cluster is improved (reduced)

3. Continue iteratively until the____________ criteria of step 2 is satisﬁed

PAM is well suited for small datasets but it fails for large datasets.

CLARA - Clustering Large Applications:

1. In the case of large datasets performing clustering by in-memory
computation is not feasible. The sampling technique is used to avoid this
problem
2. CLARA is a variant of PAM
3. However unlike PAM, the medoids of all the data points aren’t calculated, but
only for a small sample
4. The PAM algorithm is now applied to create optimal medoids for the sample
5. CLARA then performs the entire process for a speciﬁed no of points to reduce bias

CLARANS - Clustering Large Applications based on RANdomised Search:

1. The shortcoming of CLARA is that, it varies based on the sample size
2. CLARANS is akin to double randomization where the algorithm randomly
selects the ‘K’. And also randomly selects medoids and a non-medoid object
(Similar to K-Medoids)
3. CLARANS repeats this randomised process a ﬁnite number of times to obtain
optimal solution

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 68

Hierarchical Clustering
Hierarchical clustering is also called Agglomerative technique (bottom-up hierarchy
of clusters) or Divisive technique (top-down hierarchy of clusters).

Agglomerative:

Start by considering each data point as a cluster and keep merging the records
or clusters until we exhaust all records and reach a single big cluster.

Steps:

1. Start with 'n' number of clusters where 'n' is the number of data points

2. Merge two records, or a record and a cluster, or two clusters at each step
based on the distance criteria and linkage functions

Divisive:

• Start by considering that all data points belong to one single cluster and keep
splitting into two groups each time, until we reach a stage where each data
point is a single cluster

• Divisive Clustering is more eﬃcient than Agglomerative Clustering

• Split the clusters with the largest SSE value

• Splitting criterion can be Ward's criterion or Gini-index in case of

categorical data

• Stopping criterion can be used to determine the termination criterion

Number of clusters are decided after running the algorithm and viewing the
Dendrogram. Dendrogram is a set of data points, which appear like a tree of
clusters with multi-level nested partitioning.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 69

Disadvantages of Hierarchical Clustering
Work done previously cannot be un-done and cannot work well

on ________ datasets.

Types of Hierarchical Clustering

1 BIRCH

B_____ I_ R_______ and

C_ using H _

2 CURE

C_____ U_ RE_______

3 CHAMELEON

Hierarchical Clustering using Dynamic Modeling. This is a ___________

approach used in clustering ___________ structures

4 P___________ Hierarchical Clustering

5 G___________ Clustering Model

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 70

Density Based Clustering: DBSCAN
• Clustering based on a local cluster criterion

• Can discover clusters of random shapes and can handle outliers

• Density parameters should be provided for stopping condition

DBSCAN - D__B S_ C____ of

A with N

Works on the basis of two parameters:

______________ ______________

Maximum Radius of the Minimum number of points in the

neighbourhood Eps-neighbourhood of a point

It works on the principle of

_________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 71

OPTICS
Ordering of Points to Identify Cluster Structure

Works on the principle of varying density of clusters

2 Aspects for Optics

Core Distance Reachability Distance

“Plot the number of clusters for the image if it was subject to Optics clustering”.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 72

____________ - Based Clustering Methods
Partition the data space into ﬁnite number of cells to form a __________.

Find clusters from the cells in the __________ structure.

Challenges:

Diﬃcult to handle irregular distribution in the data.

Suﬀers from the curse of dimensionality, i.e., diﬃcult to cluster

high-dimensional data.

STING CLIQUE

Methods:

STING - ST____ IN G_ approach

CLIQUE - Cl_ in QUE_ - This is both

density-based as well as grid-based subspace

clustering algorithm.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 73

Three broad categories of
measurement in clustering

External Internal Relative

___________________

Used to compare the clustering output against subject matter expertise

(ground truth)

Four criteria for ____________ Methods are:

1. Cluster homogeneity - More the purity, better is the cluster formation

2. Cluster completeness - Ground truth of objects and cluster assigned

objects belong to same cluster

3. Rag bag better than alien - Assigning heterogeneous object is very

diﬀerent from the remaining points of a cluster to a cluster will be

penalized more than assigning it into a rag bag/miscellaneous/other

4. Small cluster preservation - Splitting a large cluster into smaller clusters

is much better than splitting a small cluster into smaller clusters

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 74

Most Common __________ Measures
1. _________-based measures

• Purity

• Maximum Matching

• F-measure (Precision & Recall)

2. _________-based measures

• Entropy of Clustering • Entropy of Partitioning

• Conditional Entropy • Mutual Information

• Normalized Mutual Information (NMI)

3. Pairwise measures

• True Positive • False Negative

• False Positive • True Negative

• _ Coeﬃcient • __ Statistic

• Fowlkes - Mallow Measure

4. Correlation measures

• Discretized ________ Static

• Normalized Discretized ________ Static

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 75

Most Common External Measures

____________________

Goodness of clustering and an example of same is ________ coeﬃcient

Most common internal measures:

1. Beta-CV measure

2. ______________ Cut

3. Modularity

4. Relative measure - ___________ Coeﬃcient

___________________

Compare the results of clustering obtained by diﬀerent parameter settings of

the same algorithm.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 76

Clustering Assessment Methods
1. ______________ Histogram

2. Distance Distribution

3. ____________ Statistic

Finding K value in clustering

1. __________________ Approach

2. Empirical Method = n
2

3. Elbow Method 3.5 x

3.0

4. ________-Validation Method 2.5

2.0
x
1.5

1.0 x
x x
0.5 x x
x x
1 2 3 4 5 6 7 8 9

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 77

Mathematical Foundations
Basic Matrix Operations:
1. Matrix _________________

2. Matrix Multiplication

3. Matrix ________________

4. _______________ Matrix

5. __________ and Eigenvalues

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 78

Dimension Analysis and_________________
____________(PCA, SVD, LDA)

Dimensions are also called as ____, Variables, .

Feature extraction of input variables from hundreds of variables is known as

____________ Reduction.

Lesser dimensions means easy interpretability, quicker calculations, which

also helps in reducing the ______ conditions and also avoiding

___________.

Another beneﬁt of dimensionality reduction is ____________ the

multivariate data on a ___________.

Out of the many techniques available, in this book

we will discuss the most popular methods:

• PCA - P____ C_ A_______

• SVD - S_ V D_______

• LDA - L___ D_ A____

• ____________ Analysis

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 79

PCA - P_________ C____________ A______
PCA is applied on Dense data (data, which does not have many zeros) which is

quantitative in nature.

PCA is used to convert a large number of features into an equal number of

features called P C______ (PCs).

These PCs capture 100% information, however, the initial set of PCs alone can

capture maximum information.

PCA helps us reduce the size of the dataset signiﬁcantly at the expenses of

minimum information loss.

If the original dataset has features, which are all correlated then applying PCA

does not help.

Each PC will capture information contained in all the variables of the original

dataset.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 80

Beneﬁts of PCA
• Reduction of number of ________ & hence faster processing

• Identify the __________ between multiple columns at one go by

interpreting the ________ of PCs

• Visualizing __________ data using a _ visualization technique

• Inputs being ________ is called as collinearity and this is a problem,

which is overcome by PCA because it makes the inputs ___________

• Helps in identifying similar columns

The ith principal component is a

weighted average of original PCi = ai1x1 + ai2x2 + ai3x3…. + ainxn
measurements / columns:

Weights (aij) are chosen such that:

• PCs are ordered by their _________(PC1 > PC2 > PC3, and so on)

• Pairs of PCs have ___________ = 0

• For each PC, sum of ___________ = 1 (Unit Vector)

Data Normalization / Standardization should be

performed before applying PCA.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 81

SVD - S___________ V_______ D_____________
S_____ V_____ D________ or SVD - is applied to reduce

__________ data (data, which has a lot of entries as zeros).

SVD is applied on the images to reduce the size of images and helps

immensely in image processing.

SVD is extensively used in ____________________

It is a _____ decomposition

method, represented as:

• diagonal matrix values are known as the singular values of the

original matrix X

• U matrix column values are called the ______________ of X

• V matrix column values are called the _______________ of X

_____________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 82

LDA - L_______ D_______________ A________
L_____ D__________ A_____ (LDA) is used to solve dimensionality

reduction for data with higher attributes.

Linear Discriminant Analysis is a supervised algorithm as it takes the class

label into consideration.

LDA ﬁnds a centroid for each class datapoints.

LDA determines a new dimension based on centroids in a way to satisfy two

criteria:

1. _____________ the distance between the centroid of each class.

2. ____________ the variation (which LDA calls scatter and is

represented by s2), within each category.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 83

Association Rules
Relationship Mining, ____________ Analysis, or ______ Analysis - All

mean the same thing, i.e., how are two entities related to each other, is there

any dependency between them.

The objective of study of association is to ﬁnd

Are certain What business

What item groups of items strategies will you
goes with what consistently device with
purchased this knowledge
together

Association rules are known as probabilistic ‘_____’ statements. Generating

the most ideal statements among all which show true dependencies is done
using the following measures.

Support Conﬁdence Lift

If part of the statement is called as ____________.

Then part of the statement is called ____________.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 84

Association Rules
Support:

Percentage / Number of transactions in which IF/____________ & THEN

/ Consequent appear in the data

# transactions in which A & C appear together

Support =
_____ # of transactions

Drawbacks of Support:

1. Generating all possible rules is exponential in the number of distinct items

2. It does not capture the true dependency - How good are these rules

beyond the point that they have high support?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 85

Association Rules
____________

Percentage of If/Antecedent transactions that also have the

Then/Consequent item set.

P (Consequent | Antecedent) = P(C & A) / ____

__________________
__________ =
# transactions with A

Drawbacks of Conﬁdence:

• Carries the same drawback as of Support

• It does not capture the true dependency - How good is the dependency

between entities which have high Support?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 86

Lift Ratio is a measure describing the ratio between dependency and

independency between entities.

Formula: Conﬁdence / ____________

Conﬁdence
Lift =
__________________

Note: __________________ assumes independence between

antecedent & consequent:

__________________

P(C|A) = P(C & A) / P(A) = P(C) X P(A) /P(A) = P(C)

# transactions with consequent item sets

_____________ =
# transactions in database

Threshold - 1:

Lift > 1 indicates a rule that is useful in ﬁnding consequent item sets. The rule

above is much better than selecting random transactions

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 87

Recommender Systems
Recommender Systems is also called _______________

Data used for the analysis usually has ‘Users’ as rows and ‘Items’ will be

columns. The entries within the dataset can be:

(from retail ecommerce context)

Whether a user has Whether user has ____

purchased or not the product or not

How many products each What is the rating

user has purchased? provided by the user?

Sometimes the values, for example ratings columns, are divided by the
______. ______ refers to the number of customers who have
purchased or rated the item. This process is called ______ the ratings.

Generally applied on e-commerce platforms. Customers purchasing patterns

are analysed to design Personalized strategies to recommend items which
have a high likely chance of getting purchased.

• What is the item most likely to be purchased?

• Can we identify and make suggestions/recommendations upfront?

These are broad two questions that have to be addressed. Recommendations

in turn help to gain conﬁdence in the user and make him loyal to the brand.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 88

Types of Recommendation Strategies

1 ____________ Recommender system

2 __________________ Recommender system

3 Demographic based Recommender system

4 ______ based Recommender system

5 Knowledge based Recommender system

6 ______ Recommender system

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 89

Recommender System
Collaborative ﬁltering is the most popular approach and it is based on

similarity measures between users.

Similarity Measures:

______ Based Similarity:

Cos(A,B) = A•B / |A|*|B|

______ Based Similarity:

Corr(AB) = Covariance (A,B) / Stdev (A) * Stdev (B)

Euclidean distance

____________ distance, etc.

What to Recommend:

List out and recommend the items that the person is MOST LIKELY to buy
from the list of items that similar customers have already purchased.

Sorting the list of items can be based on:

• How many similar customers purchased it

• Rated by most

• Highest rated, etc.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 90

Recommender System
Disadvantages:

learning and expensive

2
Compute is very expensive - n similarities calculations

Options to reduce computational burden

1 Discard ______ buyers

Discard items that are very

2
popular or very unpopular

3 ___________ can reduce the number of rows

PCA (dimension reduction) can

4
reduce the number of columns

5 __________________ customers

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 91

Recommender System
Alternative Approaches

Strategic decision making in terms of

‘Better accuracy and ______’ vs ‘Slightly lower accuracy and

______recommendations’

Search-Based method is a recommendation based on previous purchases.

A variant of Search-Based method is called Item-to-Item Collaborative Filtering.

Rows will be Items and Columns will be Users.

Disadvantage is that most obvious items are always recommended.

Recommendations vs Association Rules

Association Rules Recommendation Engine

______, Common, Generic Strategy Personalized strategy

__ is important is unimportant

Useful for large physical stores Useful for ____ recommendation

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 92

Recommender System
For New Users For New Items

Recommend Recommend to a

Recommend ______ popular Recommend to the tech-geeks

items based on demography (if it is a gadget)

Recommend based on Identify the most ______

____________ person in the social media graph
data and recommend the new
Make user login using social
item to this inﬂuential person
network, then look at the user’s
social media activity and
recommend accordingly

Show a few items and ask user to

rate them so that based on the
rating, one can be recommended

Challenge with Rating Matrix-Based Recommendation:

Rating matrices are huge and sparse (too many empty cells)

______ is used to handle the sparse rating matrix

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 93

Network Analysis
Network Data or ______ Data is a diﬀerent type of data, which requires

diﬀerent types of analysis.

Key components of a ______ or Network are

Vertices / and / Links

Network can be represented as Adjacency Matrix. Note: For an undirected

graph, the adjacency matrix is symmetric in nature.

Links / between can be either bidirectional or __

Applications

Supply Chain Crowd-Funding Airlines

Network Network Network

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 94

Network Analysis

Node Properties

____________ = Number of direct ties with other nodes

In-Degree = Number of Incoming connections

Out-Degree = Number of Outgoing connections

Degree centrality is a local measure and

hence we should look at other measures.

____________ is how close the node is to other nodes in the network

_________ When comparison of Normalized

= 1/(sum of two networks arise ______ = (Total

distances to all then normalized number of nodes -

other nodes) ______ should be 1) * Closeness

considered

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 95

Network Analysis

____________ centrality can be measured for a node or an edge

____________ centrality is how often the

node/edge lies on the shortest path between pairs

# of shortest paths between a pair it lies on

________ centrality = ∑
# of shortest path between a pair

When two networks are compared then we use normalized ____________

__________________
Normalized ________ =
# of all possible pairs except the focal node

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 96

Network Analysis
__________ centrality measures who are you connected to and not just

how many you are connected to

• Nodes which are connected to high scoring nodes contribute more to the

score of that nodes which are connected to low scoring nodes

• _ is calculated from ____ of adjacency matrix

x1 a11 a12 ... a1n x1

x2 . x2
1
. =
λ
. . λx = AX
. . .
xn an1 an2 ... ann xn

y
X corresponding to the highest Eigenvalue is the vector that consists of the

__________ centralities of the nodes.

__________ Centrality is a measure on how likely is a person, who

receives the information, going to diﬀuse the information further.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 97

Network Analysis
Edge / Link Properties

Edge or Link properties are deﬁned based on the domain knowledge and there

is no deﬁned rule in deﬁning the same.

Number of
edge/Number of
possible edges

Path _____

Network
Average Path
Shortest Path Related Length
Properties

Cluster
_____
Coeﬃcient

It is the longest, Measures the degree

shortest path to which nodes in a
graph tend to cluster
together

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 98

Network Analysis

# of links that exist among its neighbors

Cluster Coeﬃcient
=
of a node # of links that could have existed
among its neighbors

Cluster coeﬃcient of a network is average cluster

coeﬃcient of nodes in the network.

Community Detection Algorithms

Fast-Greedy

Also called as Iterative network divisive algorithm

Steps
- Calculate edge betweenness of each edge
- Remove the edge with the highest betweenness
- Repeat the above mentioned two steps until no edge
is remained

Starts with the __________ and goes on until all

nodes are isolated communities

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 99

Text Mining
Analyzing __________ Text data by generating __________ data in

key-value pair form. Deriving insights from the extracted keywords by

arranging the extracted keywords in a plain space with font sizes varying

based on their frequency is called __________.

Collect the text data / Extract data from sources.

_____ Speech
transcripts
_____

Email to Examples
customer of Field agents,
service Sources salespeople

________
Social media
outreach _______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 100

Text Mining
Pre-process the data

• Typos

• Case - uppercase / lowercase / proper case

• Punctuations & special symbols (‘%’, ‘!’, ‘&’, etc.)

• Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.)

• Numbers

• Extra White spaces

• Custom words

• Stemming

• Lemmatization

• Tokenization - Tokenization refers to the process of splitting a

sentence into its constituent words

Document Term Matrix / Term Document Matrix

Documents arranged in rows and Terms arranged in columns is

called as DTM and transpose of DTM is TDM.

Word Cloud
__________ - words present in positive dictionary.

__________- words present in negative dictionary.

_____ - two words repeated together - gives better context of the content.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 101

Natural Language Processing (NLP)
Text Analytics is the method of extracting meaningful insights and answering

questions from text data.

N_ L U______ (NLU)

A process by which an inanimate object (not alive - machines, systems, robots)

with computing power is able to comprehend spoken language.

Example: Humans talk to robot

N_____ L_ G_____ (NLG)

A process by which an inanimate object (not alive - machines, systems, robots)

with computing power is able to manifest its thoughts in a language that

humans are able to understand.

Example: Robot responds to human queries

POS tags - Parts of Speech Tagging – Process of tagging words within

sentences into their respective PoS and then labelling them.

N_____ E_ R_____

__________ are usually not present in the dictionaries so we need to treat

them separately. People, place, organizations, quantities, percentages, etc.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 102

Topic Modeling Algorithms
LSA/LSI (Latent Semantic Analysis/Latent Semantic )
Reducing dimension for classiﬁcation. LSA assumes that the words will occur in

similar pieces of text if they have similar meaning.

LDA (____________________)
A topic modelling method that generates topics based on words/expression

frequency from documents.

Text Summarization:
Process of producing concise version of text by retaining all the important

information.

Topics

Documents

TOPIC
MODELING

Topic

Frequency of words

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 103

Machine Learning Primer
Steps based on Training & Testing datasets

1. Get the __ / data needed for

analysis which is the output of data cleansing

2. Split the data into training data & testing data

Training and Validation

70%

Training
Dataset

Random 30%
Historical Data Sampling
Test
Dataset

a. Split the data based on random sampling if the data is balanced

b. Split the data based on other sampling techniques if the

data is imbalanced

(Refer to Step 2 of CRISP-DM to know about imbalance dataset

sampling techniques)

c. _____________________________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 104

Machine Learning Primer

Training and Validation

Statistical
70% Model
Training Machine
Dataset Learning

Random 30% X%
Historical Data Sampling Accuracy

Test
Dataset Prediction and Model Validation
Validation Outcome

3. ___________________________________

4. Test the model on testing data to get the predicted values

5. Compare the ________________________________ and

____________________ values of testing data to calculate error

or accuracy. (Model evaluation techniques are discussed in subsequent

sections). This will give us Testing Error or Testing Accuracy

6. Also test the built model on training data

7. Compare the training data predicted values and training data actual values

to calculate the error or accuracy. This will give us Training Error or Training

Accuracy

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 105

Machine Learning Primer
8. Training Error and Testing Error

a. If training error and testing error are small and close to each other then

the model is considered to be RIGHT FIT (how low the error values

should be is a subjective evaluation. E.g., In healthcare even 1% error

might be considered high, whereas in a garment manufacturing

process even 8% error might be considered low)

b. If training error is low and testing error is high then the model is

considered to be ____ . _ is also called ____

c. If training error is high then testing error also will be high. This scenario

is called _____ or

d. If training error is high and testing error is low then something is

seriously wrong with the data or model you built. Redo the entire

project

Y Y Y

X X X

__

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 106

Machine Learning Primer
9. ____________ is a common problem and also challenging to solve.

Diﬀerent Machine Learning algorithms have diﬀerent regularization

techniques (also called as generalization techniques) to handle

________________

10. __________ problems can be solved easily by increasing the number

of datapoints (observations) and/or features (columns). Also proper feature

engineering and transformation will address this issue

1. High bias, Low variance 2. Low bias, High variance

3. High bias, High variance 4. Low bias, Low variance

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 107

Machine Learning Primer
The challenge of Training & Testing dataset split, which leads to information

leak is countered with new school of thought with an idea to split the data into:

• Training Data

• __________ (Development Data)

• __________

Data

Training Validation Testing

Test on Training
Keep validating and keep • Test on Validation Data to get
Data to get • Test on Testing
retraining the model until Validation Error/Accuracy
Training Data to get Test
Error/Accuracy
desired accuracy is achieved • Fine tune the model parameters to Error/Accuracy
get better accuracy (also called as
• Pick the best-performing algorithm generalization
Build Model error)
• Check for model
Overfitting or
Underfitting
• Run multiple trials
• Combine Training & Validation Data to run the model, which has given and take average
the desired results based on a set of finalized model parameters. Then (numerical
test the model on testing data outcome) of
• Also ensure that model neither overfits (Variance) nor underfits (Bias) majority

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 108

Model Evaluation Techniques
If the ‘Y’ output variable is __________ then we can use the following list of error
functions to evaluate the model.
Error = Predicted Value - Actual Value (Actual Value is also called as ________ Value)

Data Science Workbook

n
1
• ______(ME) ME= ∑ et
T Actual Prediction Error from Prediction Error from
t=1
Data Model 1 Model 1 Model 2 Model 2
n
• _________ (MAE) 1 100 101 1 110 10
MAD= ∑ | et |
or___________ (MAD) n t=1
200 199 -1 190 -10
300 301 1 310 10
n
1 2 400 399 -1 390 -10
• Mean Squared Error (MSE) MSE= ∑ et
n

© 2020 360DigiTMG. All Rights Reserved

t=1

n
1 2
• Correlation Coeﬃcient
• Root Mean Squared Error (RMSE) RMSE= ∑ et
n t=1

1 n xy - x y
et (∑ (∑ (∑
( ( (
• Mean Percentage Error (MPE) MPE= ∑ r=
n Yt
t=1
2 2
2 2
n∑x - n
∑y -
(∑x (∑y
( (
• ______________ 1 et
MAPE= ∑
n Yt
(MAPE) t=1

MAE
• Mean Absolute Scaled Error (MASE) MASE=
MAE in-sample, naive

109
MAE in-sample, naive is the mean absolute error produced by a naive forecast
Model Evaluation Techniques
If the ‘Y’ is Discrete variable (Classiﬁcation Models) then we can use the
following list:

Confusion Matrix:
Can be applied for both _______ classiﬁcation as well as ________
classiﬁcation models.

Confusion matrix is used to compare predicted values and actual values.

Binary Classiﬁcation Confusion Matrix:

Actual Class

Positive Negative

Positive True Positives False Positives

Predicted (TP) (FP)
Class

Negative False Negatives True Negatives

(FN) (TN)

True Positive (TP)

• Patient with disease is told that he/she has disease

True Negative (TN)

• Patient with no disease is told that he/she has no disease

False Positive (FP)

• Patient with no disease is told that he/she has disease

False Negative (FN)

• Patient with disease is told that he/she has no disease

Error = 1- Accuracy
TP + TN
Accuracy = Accuracy should be greater
TP + FP + FN + TNP than % of majority class

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 110

Model Evaluation Techniques

Possible Outcomes

Not Your wife’s Your wife’s

Decision Birthday Birthday
Alternatives

Did not buy

Flowers
(No Action)

Status Quo Wife Angry

Bought
Flowers
(Action)
Wife Suspicious
Money Wasted Domestic Bliss

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 111

Model Evaluation Techniques
• ______ = TP/(TP+FP) = TP/Predicted Positive = Prob. of correctly

identifying a random patient with disease as having disease

is also called as ________ (PPV)

TP
Precision = (Designers in the formula hide the name precision)
TP + FP

• (Recall or or __ Rate) = TP/(TP+FN) =

TP/Actual Positive = Proportion of people with disease who are correctly

identiﬁed as having disease

TP
Recall =
TP + FN

• ______ (True negative rate) = TN/(TN+FP) = Proportion of people with no

disease being characterized as not having disease

• ______ (Alpha or type I error) = 1 - Speciﬁcity

• FN rate (Beta or type II error) = 1 - Sensitivity

• ____ = 2 * ((Precision * Recall) / (Precision + Recall)); F1: 1 to 0 & deﬁnes a

measure that balances precision & recall

2 X Precision X Recall
F1 Score =
Precision + Recall

F1 score is the harmonic mean of precision and recall.

Closer the ‘F1’ value to 1, better the accuracy.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 112

Confusion Matrix
A Confusion Matrix is also called a Cross Table or ____________. Here is

an example of a multi-class classiﬁcation problem.

Activity recognition from video

Bend 100 0 0 0 0 0
Jack 0 100 0 0 0 0
Jump 0 0 89 0 0 11
Actual Class Run 0 0 0 67 0 33
Skip 0 0 0 0 100 0
Walk 0 0 11 33 0 100
Bend Jack Jump Run Skip Walk

Predicted Class

The values along the diagonal are right predictions and the values oﬀ the

diagonal are incorrect predictions.

ROC Curve

100%

si t
as ec
r
ﬁe
Cl erf
80%

e
r

lu
ﬁe
True positive rate

va
si
s

e
la

tiv
tC
60%

ic
s

ed
Te

pr
(sensitivity)

no
40%

ith
rw
ﬁe
si
as
20% Cl

0%
0% 20% 40% 60% 80% 100%
False positive rate (1 - speciﬁcity)

R_ O C_____ Curve was used right from World

War II to distinguish between true signals and false alarms.
The ROC curve has the ‘True Positive Rate (TPR)’ on the Y-axis and ‘False
Positive Rate (FPR)’ on the X-axis.

ROC curve is used to visually depict accuracy.

ROC curve is also used to ﬁnd the _____ value

(Example: Risk Neutral: should probability be > 0.5 as cut-off value to
categorize a customer under ‘will default’ category; Risk Taking: should the
probability be > 0.8 cut-off to categorize a customer under ‘will default’
category; or Risk Averse: should the probability be > 0.3 cut-off to categorize a
customer under ‘will default’ category)

ROC Curve
Numerically if one must evaluate the accuracy then AUC
(Area Under the Curve) can be calculated.

Disease Present Disease Absent

Test Positive True Positives False Positives

Test Negative False Negatives True Negatives

0.9 - 1.0 = A (outstanding)

0.8 - 0.9 = B (excellent/good)

0.7 - 0.8 = C (acceptable/fair)

0.6 - 0.7 = D (poor)

0.5 - 0.6 = F (no discrimination)

K-Nearest Neighbour
KNN also known as:

• On-demand or Lazy Learning

• __________-Based Reasoning

• Instance-Based Learning

• Rote Learning

• __________ Reasoning

KNN works for both the scenarios : Y is ____________ as well as

____________.

KNN is based on calculating distance among the various points. Distance can

be any of the distance measures such as Euclidean distance discussed in

previous sections.

KNN also has an improved version where _____ _______ are assigned to

the neighbors based on their distance from the query point.

In case of continuous output, the ﬁnal prediction will be the _______ of all
output values and in case of categorical output, the ﬁnal prediction will be the

_______________ of all the output values.

Training Set Accuracy

Right Level of
Model Complexcity

Model Accuracy

Validation Set Accuracy

Model Complexicty

Choosing ‘K’ value is critical because it is used to solve the problem of

bias-variance tradeoﬀ.

• Low ‘K’ value is ____________________

• High ‘K’ value might introduce data points from _______________

Pros (Advantages) and Cons (Disadvantages)

Strengths Weakness

Does not depend on the underlying There is no model produced and hence
data distribution no interesting relationship among output
and inputs is learnt

Memory requirement is large because

Testing process will be very fast distance calculations is saved in memory

Testing process is slower in comparison

to other models

Categorical Inputs require additional

processing

Suﬀers from Curse of dimensionality

Naive Bayes Algorithm
Naive Bayes is a machine learning algorithm based on the principle of probability.

The relationship between ______ events is described using Bayes Theorem.

Probability of event A given that event B has occurred is called as

__________Probability.

Class Prior or Prior Probability Data Likelihood given class

P(Class) * P (Data | Class)

P(Class | Spam) = --------------------------------
P(Data)
Posterior probability
Data Prior or Marginal Likelihood

X ( Whether the email contains

Y ( Whether the email is spam or not)
the word lottery or not)
Spam Lottery
Not Spam Lottery
Spam No Lottery
Spam No Lottery
Not Spam No Lottery
Not Spam No Lottery
Not Spam Lottery
Not Spam Lottery
Spam No Lottery
Spam No Lottery

P(Class) = P(Spam) = No. of times spam appears in the data / Total no. of emails = 5/10

P(Data) = P(Lottery) = No. of times lottery appears in the data / Total no. of emails = 4/10

P(Data | Class) = P(Lottery | Spam) = No. of emails having word lottery given that emails are
spam = 1/5. In total there are 5 spam emails and out of which 1 email has the word lottery.

Decision Tree
“___________ Decision Tree” “__________ Decision Tree”
When output is Categorical When output is Numerical

Decision trees are

• Nonparametric ________ model, that works on divide & conquer strategy
• Rule-based algorithm that works on the principle of ____________.
A path from root node to leaf node represents a rule
• Tree-like structure in which an internal node represents a test on an
attribute, each branch represents outcome of test and each leaf node
represents the class label

Three type of nodes

_____ Branch Node / Internal Nodes Leaf Node /

/ Decision Nodes _______ Node

Root Node

Branch Node

Leaf
Node

A Greedy Algorithm
To develop a Decision Tree, consider 2 important questions:

Q1. Which _______________________

Q2. When to ______________________

Attributes Tree is developed Generate most

Identify an
100% of should be using a statistical homogeneous
______
data Discrete - if measure pair of branches
to split
continuous (________
then ______ _______)
_______

Conditions to Stop

All records of the No attributes to No records left

branch are of further split

__________

Age CR Class Age CR Class Age CR Class

>40 Fair Yes >40 Fair Yes 31 ..40 Fair Yes

>40 Excellent No <=30 Excellent Yes
31 ..40 Excellent Yes
<=30 Fair Yes

Information Theory 101
If the event is very _____, then the ________ content in the event is very low.

Examples

Information
Sun rises in Extremely high
content lowest
the east probability (p = 1)
(‘0’ bits)

Occurrence of
Very low High information
earthquake in
probability content
Kuala Lumpur

In conclusion “Information Content is proportional to Rarity”

)
I(event) = log2
) 1
= -log2 Prob (event)
Prob(event)

Entropy:
• Entropy is the expected information content of all the events
• Entropy value of 0 means the sample is completely homogeneous
• Entropy value of 1 means the sample is completely heterogeneous

)
H(p = (p1 ... pn)) =
n
∑p log
) 1
n
= -∑ p! log( p! )
!
i=1 p! i=1

Purity = Accuracy = 1 - Entropy

Information Theory 101
In Accuracy we assign the _______________ Label to each region.

5 60 40 40 60 10

Dominant Label Sky Blue NA Royal Blue

Accuracy - Sky Blue 60/65 40/80 60/70

Entropy is a measure of disorder or impurity (variation/______________)

Decision trees ﬁnd attributes which return the most homogeneous branches.

Purity can also be measured using GINI Measure, which is the Expected.
Accuracy with __________ Labeling.

5 60 40 40 60 10

Dominant - Royal Blue 5/65 40/80 10/70

Accuracy - Sky Blue 60/65 40/80 60/70

5 5 60 60 60 60 10 10
� �X � � +� �X� � � �X � � +� �X� �
65 65 65 65 70 70 70 70

After calculating the measure of _____, one must decide on which feature to
split. For this, one must measure the change in __________ resulting from
a split on each possible feature. This calculation is known as __________

Information gain of a feature is the diﬀerence between entropy in the segment

before the split (S1) and partitions resulting from the split (S2).

InfoGain (F) = Entropy (S1) - Entropy (S2)

Information Theory 101
Less the variation in class labels post the split better the _____

Information gain: Decrease in the _______ (variation) after the

dataset is split on an attribute.

Higher homogeneity implies Higher information gain.

Entropy = E1

Information gain(I1) = E1-E2

Where E1 > E2

Entropy = E2

Pros and Cons of Decision Tree
Strengths Weaknesses

Uses the important feature during Biased towards factors (features),

decision making which have a lot of levels
Interpretation is very simple because there Small changes in the data will result in
is no mathematical background needed large changes to decision making

Model overﬁtting can be addressed using __________ techniques.

__________ is the regularization technique used in the Decision Tree.

Pruning is the process of reducing the size of the tree to generalize the
unseen data.

Two ________ techniques are

Pre-_ or Early Stopping Post-_

Stopping the tree from growing Grows the tree completely and then
once the desired condition is meet. apply the conditions to reduce the
tree size.
• Stop the tree from growing once
it reaches a certain number of Example, if the error rate is less
decisions than 3% then reduce the nodes.
• Stop the tree from growing if So, the nodes and branches that
decision nodes contain only a have less reduction of errors are
small number of examples removed.
Disadvantage: When to stop the tree This process of grafting branches is
from growing. What if an important known as subtree raising or subtree
pattern was prevented from replacement.
learning?

Post _ is more eﬀective than pre- _

Continuous Value Prediction
Scatter Diagram - Visual representation of the relationship between
two continuous variables

Strong Positive Correlation Moderate Positive Correlation No Correlation

Moderate Negative Correlation Strong Negative Correlation Curvilinear relationship

Correlation Analysis - Measures the correlation between two variables

r = +1: r close to +1:

Perfect Positive Correlation Strong Positive Association

r close to -1: r close to 0:

Strong Negative Association Weak or No Association

Linear Regression
Equation of straight line that we have learnt in our school days

______________________________

Ordinary Least Squares
y
An observed value of x Fitting a straight line by least squares
when x equals x0
ŷ = 𝛽𝛽0 +𝛽𝛽1x

Erron term

Straight line deﬁned by

the equation = β0 + β1x

β1
β0
Mean value of
y and x equals x0
y intercept y = β0 + β1x + ε; ε = error term
x
x0 = A speciﬁc value of x,
the independent variable

____________________ Technique to ﬁnd the best ﬁt line.

The best ﬁt line is the line which has minimum square deviations from all the data points
to the line.

To improve the accuracy, transformations can be applied, this will ensure that the data has a
linear pattern with minimum spread.

Coeﬃcient of Determination R2 - also known as goodness of ﬁt, is the measure of

predictability of Y (dependent variable) when X’s (independent variable) are given.

It can be interpreted as the % of variability in output (Y) that can be explained with the

_______________ (X)

R2 = SSR/SST = ((SSR)/(SSR + SSE))

0 ≦ R2 ≦ 1

Where,

SSR = 𝚺𝚺�(ŷ - ȳ)2 (measure of explained variation)

SSE = 𝚺𝚺�(y - ŷ)2 (measure of unexplained variation)

SST = SSR + SSE = 𝚺𝚺�(y - ȳ)2 (measure of the total variation in y)

Model Assumptions
________________

________________

Problems arise while linear regression model training:

____________ : Errors are dependent on each other

____________ : Errors have non-constant variance

____________ : Independent variable pair are linearly dependent

on each other

Logistic Regression
Predicts the ____________ of the outcome class.

The algorithm ﬁnds the linear relationship between independent

variables and a link function of these probabilities.

The link function that provides the best goodness-of-ﬁt for the
given data is chosen.

1.0 1.0
0.8 0.8
Probability

Probability
0.6 0.6
Linear Regression Line Linear Regression
0.4 0.4 Curve
0.2 0.2
0.0 0.0

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Below 0

The output from logistic regression will lie between 0 to 1.

The logistic regression curve is known as ________________ Curve.

Probability values are segregated into binary outcomes using a _________

value. The default cutoﬀ is treated as 0.5 (50%)

• If probability of an event > 0.5; then Event is considered to be True

(predicted outcome = 1)

• If probability of an event </= 0.5; then Event is considered to be False

(predicted outcome = 0)

Logistic Regression
The logistic regression performed using:

Y= β0 + β1x1 + β2x2 + ... + βkxk + ε

Where,

β0 = the y is _______________

βi = the model coeﬃcient for the linear eﬀect of variable is on y

e = the random error

The probability function:

ey
p= ; where e = 2.7183
1+e y

The output of the logistic regression will give a sigmoid

curve (also known as S curve)

1.0

0.8 Logistic Regression

Curve
Probability

0.6

0.4 Interpretation: Probability p

0.2 indicates that the event has a
0.0 chance ‘p’ for a given
0 20 25 30 35 40
_______________
Inspection Time

Support Vector Machine
SVMs can be adapted to use with nearly any type of learning task, including

both and _____.

SVM is inspired from statistical learning theory.

Other names: Large-margin classiﬁer, Max-margin classiﬁer, __________

Two Dimensions Three Dimensions

The task of the SVM algorithm is to identify a line that separates the two

classes in a binary problem. However, in a multidimensional problem a line

cannot separate the classes.

The goal of an SVM is to create a ﬂat boundary called a __________,

which divides the space to create _______________ partitions.

Support Vector Machine
There is more than one choice of dividing line between the
groups of circles and squares.

c c
b b
a a

Support Vector

Maximum Margin
Support Vector

SVM searches for M_______ M_ H_______ (MMH)

MMH is as far away as possible from outer boundaries (convex hull) of the two

groups of data points.

The maximum margin linear classiﬁer is the linear classiﬁer with the maximum

margin. This is the simplest kind of SVM (Called an LSVM).

Support Vector Machine
Non-Linear Spaces

Sunny Snowy

Input Space Higher Dimension Space

Ф
Latitude

Altitude
Kernel Trick

Longitude Longitude

Kernel Tricks
A key feature of SVMs is their ability to map the problem into a higher

dimension space using a process known as the _____ trick. After

the _____ trick has been applied, we look at the data through the

lens of a new dimension and a nonlinear relationship may suddenly

appear to be quite linear.

Support Vector Machine
Kernel Functions
• The linear kernel does not transform the data at all. Therefore, it can be
expressed simply as the dot product of the features:

K �xi , xj� = �xi , xj�

• The______ kernel results in a SVM model somewhat analogous to a

neural network using a sigmoid activation function. The Greek letters kappa
and delta are used as kernel parameters

K �xi , xj� = tanh �kxi . xj - δ�

• The polynomial kernel of degree d adds a simple non-linear

transformation of the data

d
K �xi , xj� = �xi . xj + 1�

• The _________ kernel is similar to a RBF neural network. The RBF kernel
performs well on many types of data and is thought to be a reasonable
starting point for many learning tasks

2
-�xi -xj �
K �xi , xj� = e
2σ 2

Deep Learning/Neural Network
Artiﬁcial Neural Network is used to mimic Biological Neural Network.

Deep Learning is named in this way because it has many _____

________ to the output.

Models versus Learning Models

__________ Extraction from the data

(Images, Speech, Text, Videos (videos are a subset of images))is
automatically performed using Deep Learning models.

Deep Learning/Neural Network
The multiple layers capture compositionality:

Image Recognition

Each layer captures some features, For Example:

Pixel Edge Texton _____ Part Object

Initial layers capture __________ features

Next layers capture __________ features

Final layers capture __________ features

At the end, the classiﬁer will predict the output.

Low-Level Mid-Level High-Level Trainable

feature feature feature Classiﬁer Car

Deep Learning/Neural Network
Deep Learning Learns Layers of Features

Linear Linear Linear Linear

Transformation Transformation Transformation Transformation

Speech data is processed through multiple

layers and __________ is captured.

Sample _____ Sound Phone Phoneme Word

____

Text data is processed through Deep Learning

layers and compositionality is captured.

Text Character Word Word Clause _____ Story

Group

Deep Learning/Neural Network
Shallow Machine Learning Models:

Feature extraction from the data is performed manually.

Vision / Images / Videos: Videos are broken into ___________

and from each __________, the features are extracted.

Vision Unsupervised Learning:

K-Means
SIFT (Scale Invariant Feature
Transform)

HOG (Histogram of Gaussians)

Supervised Learning:
(KNN) K-Nearest Neighbor

Techniques used to
extract features

Speech Unsupervised Learning:

Mixture of Gaussians

MFCC - Mel Frequency

Cepstral Coefficient
Supervised Learning:
Hidden Markov Models (HMMs)

Text
Love the way Unsupervised Learning:
360DigiTMG Bag of Words n-grams
delivers training (BoW)
on Data Science Term Document
& Artiﬁcial Matrix (TDM)
Intelligence Supervised Learning:
Naive Bayes

Perceptron Algorithm:
Artiﬁcial Intelligence is about trying to mimic a human brain.

An Artiﬁcial Neural Network (ANN) models the relationship between a set of

_____ signals and an _____ signal using a model derived from our
understanding of how a biological brain responds to stimuli from sensory
inputs. Just as a brain uses a network of interconnected cells called Neurons
to create a massive parallel processor, ANN uses a network of artiﬁcial neurons
or nodes to solve learning problems.

Neuron

Dendrite
Axon

Nucleus
Axon Terminal
Myelin
Scwann Cell

Node of Ranvier

Soma

Deep Learning/Neural Network
Simple Neural Network components:

1. Input layer - contains the numbers of _____ equal to the

number of input features

2. Input layer also has one additional neuron called _____, which is
equivalent to the ‘b’ (y-intercept) in the equation of the line y = b + mx

3. ‘b’, ‘w1’, ‘w2’, ‘w3’,....... are called as weights and are ________ initialized

4. These neurons are also called as nodes and are connected via an edge to
the neuron in the next layer

5. function (usually summation) is used to

all the inputs and corresponding weights
f(x) = b + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 -> This equation will give a
numerical output

6. The output of the integration function is passed on to the

_______________ component of the neuron

x0= 1
Dendrites
b Cell body or Soma
x1
w1
w2 Activation
x2
w3 y = f (wx + b)
x3
w4
x4
Axon

xn wn Summation

Neural Network

Deep Learning/Neural Network
x0 = 1 Weight update
error
b
δl = al - yl
x1
w1

w2 w0
x2

n
net = wixi 1
i-0 o = σ (net) = 1+e-net
wn
Net input Activation Output
function function
xn
Input

7. Based on the functioning of ________ function, the ﬁnal output is

predicted

8. Predicted output and actual output are compared to calculate the _____
function / _____ function (error calculated for each record is called as
_____ function and combination of all these individual errors is called as
cost function)

9. Based on this error, the __________ algorithm is used to go back in the

network to update the weights

10. Weights are updated with the objective of minimizing the error and this
minimization of error is achieved using _______________ Algorithm

Deep Learning/Neural Network
Perceptron Algorithm

The Perceptron algorithm was proposed back in 1958 by Frank Rosenblatt

(Cornell Aeronautical Laboratory).

Neural network with no hidden layers and a single output neuron is called a
__________ Algorithm.

The ________ algorithm can only handle _ boundaries. ___

boundaries are handled using the Multi-Layered __________ algorithm.

Weight updation as part of _______________ algorithm is done using

the following formula:

Randomly initialize weight vector w0

Repeat until error is less than a threshold γ or max_iteratios M:

For each training example (xi ti) :

Predict output y1 using current network weights wn

Update weight vector as follows:

wn+1 = wn + ή * (ti - yi) * xi

Learning Rate Error

Deep Learning/Neural Network
__________ are updated so that the error is minimized.

Learning Rate is also called Eta value and ranges from 0 to 1.

A value close to 0 would mean

< <
_____ steps to arrive at the
bottom of the error surface.

A value close to 1 would mean

__________ the bottom of
the error surface.

Constant learning rate creates a

problem of bouncing around the bowl.
The gradient will never reach the
bottom of the error surface.

This problem is solved using

Changing Learning Rate
(_______________).

Deep Learning/Neural Network
____________- Learning rate is 0.2

reduced epoch after epoch, until it

Learning Rate
reaches the end of a deﬁned number 0.1

of epochs.
0.0
0 4 8 12 16 20
Epochs

____________ - Learning rate

is the same for a ﬁxed number of 0.2

epochs and then it starts reducing

Learning Rate
after every epoch until the deﬁned 0.1

number of epochs reaches to end.

0.0
0 4 8 12 16 20
Epochs

____________ - Learning rate

0.2
is reduced after a set of ﬁxed number
Learning Rate

of epochs (for e.g., learning rate will

0.1
be reduced by 10% after every 5
epochs).
0.0
0 4 8 12 16 20
Epochs

_________________________- Learning rate is reduced when it

is observed that the error stops reducing.

0.2 0.2
Learning Rate

Learning Rate

0.1 0.1

0.0 0.0
0 4 8 12 16 20 0 4 8 12 16 20
Epochs Epochs

Deep Learning/Neural Network
Gradient Primer:

Rate of change

Gradient is also called as ________

Slope

Curves/Surfaces should be
continuous and smooth
(_____ /sharp points)

Curves / Surfaces should

be __________

Deep Learning/Neural Network
Gradient Descent Algorithms Variants:

A few deﬁnitions:

Iteration: Equivalent to when a weight update is done

Epoch: When entire training set is used once to update the weights

Batch Gradient Stochastic Gradient Mini-batch Stochastic

Descent Descent Gradient Descent
Epoch 1 1 1
Example 10000 training records 10000 training records 10000 training records
1 10000 100 (if minibatch
size is 100).
Iteration
10000/100 = 100
iterations
Weights are updated Weights are updated Weights are updated
once, after all 10000 after each training after every minibatch
training records are sample passes (100 in this case) of
Example passed through the through the network. records are passed
network If we have 10000 through the network.
training samples then Records within
weights are updated minibatch are
10000 times randomly chosen.

Nesterov
Momentum
Momentum
Other advanced variants
of Mini-Batch SGD: Adagrad Adadelta

RMSprop Adam

Deep Learning/Neural Network
Empirically Determined components are:

1. Number of hidden layers

2. Number of _____ within each hidden layer

3. _______________

4. Error/Cost/Loss Functions

5. _______________ Methods

No. of neurons Activation Function Loss Function

Y (output)
in output layer in Output layer

Continuous 1 Linear / Identify ME, MAE, MSE, etc.

Discrete 1 for binary Sigmoid / Tanh Binary Cross Entropy

(2 categories) classiﬁcation problem

Discrete 10 if we have a Sigmoid Categorical

(>2 categories) 10 class problem Cross Entropy

Note: Hidden layers can have any activation function and majorly _________
activation functions seem to be giving good results.

Multi-Layered Perceptron (MLP) /
Artiﬁcial Neural Network (ANN)
Non-Linear patterns can be handled in two ways:
Changing _______________ Function:

Quadratic Function Spherical Function

m m

ƒ = ∑ wjxj2 - 0 ƒ = ∑ (xj - wj)2 - 0

j=1 j=1

The presence of hidden layers alone

y1 y2 yn will not capture the ________
pattern. Activation function to be
Output Layer used should be non-linear.
Usage of linear or identify activation
functions within the neurons of the
Hidden Layer hidden layer will only capture linear
patterns.

If no activation functions are

Input Layer
speciﬁed in the layers then by
x1 x2 xn default the network
assumes________________.

Multi-Layered Perceptron
List of activation functions include

Identity function
a (ƒ) = a ___ function
(Linear function)

a
Ramp Sigmoid .5
1
function ƒ function
1

1 ReLU 1

Tanh (Rectiﬁed
0 0
function Linear Unit)
-1
function -1

1 1
ELU
_____
(Exponential 0 0
ReLU
Linear Unit)
-1 -1

Output
neurons
A
Probability of A

Maxout _____ B
Probability of B
Rest of
network C
Probability of C

Multi-Layered Perceptron
Regularization Techniques used for Overﬁtting

L1 regularization / L1 _____ decay term

L2 regularization / L2 _____ decay term

Weight Decay Term

m n1-1 s1 s1+1

j(0) = 1
m ∑� 1
2 � h0 (x(i)) - y(i)�2� +
y
2 ∑ ∑ ∑�w � (l)
ji
2

i=1 l=1 i=1 j=1

Training Set Accuracy

Accuracy

Over ﬁtting

___ stopping:
Early Stopping
Test Set Accuracy Epoch

Epoch

Error-change criterion

1. Stop when error isn't dropping over a window of, say, 10 epochs

2. Train for a ﬁxed number of _____ after criterion is reached (possibly with
lower learning rate)

Weight-change criterion

1. Compare _____ at epochs t-10 & t and test

maxi � wi - wi �<
t t-10
ρ

2. Possibly express as a _______ of the weight

Multi-Layered Perceptron
Dropout

It is an interesting way to perform model averaging in Deep Learning

Training Phase: For each hidden layer, for each training sample, for each
iteration, ignore (zero out) a __________, p, of nodes (and corresponding
activations).

Test Phase: Use all __________, but reduce them by a factor p (to account
for the missing activations during training).

Randomly select a subset of _ and force their output to ____.

Standard
Neutral Net

After applying
Dropout

v1
r1
1
v2
r2
v3 0
r3
v4 1

Multi-Layered Perceptron
Drop Connect:

Very similar to _____, however, we disable the weights instead of the nodes.

Here the nodes are partially active.

v1
r1
v2
r2
v3
r3
v4

Noise:

Noise

Data Noise Label Noise Gradient Noise

• Add noise to data • Disturb each training • Add _____ to

while _____ sample with the the gradient
probability

• For each disturbed

sample, the label is
randomly drawn from
a uniform distribution
regardless of the true
label

Multi-Layered Perceptron
Batch Normalization:
Input: Values of x over a mini - batch: B = { x1...m };

Parameters to be learned: Ύ, β

Output: {y i= BN Ύ, β(x i)}

µ B‹ 1 ∑m x // mini - batch _____

—
m i=1
i

2 1 ∑m (x -µ ) // mini - batch _____

σB ‹ —
m i=1 i B
xi - µB // normalize
X� ‹ —
2
σB + Є

yi ‹ ΎX�i + β = BNΎ,β (xi ) // __________

• Batch Normalization layer is usually inserted before __________ layer

(after Fully Connected or Dense Layer)

• Reduces the strong dependence on weight initialization

Shuffling inputs:

• Choose examples with maximum __________ content

• Shuﬄe the training set so that successive __________ examples never

(rarely) belong to the same class

• Present input examples that produce a large error more frequently than
examples that produce a small error. Why? It helps to take large steps in the
Gradient descent

Multi-Layered Perceptron
Weight Initialization Techniques:

6 6
_____ initialization uniform( - , )
fan in+ fanout fan in+ fanout

Caﬀe implements a simpler version of Xavier’s initialization

2 2
uniform( - , )
fan in+ fanout fan in+ fanout
_____ initialization
4 4
uniform( - , )
fan in+ fanout fan in+ fanout

Forecasting
Time Series vs _______________ Data

Time Series Data:

Data that is collected over equal spaced time intervals and the time interval is

also an essential part of the data.

____________ Data:

Data that can be collected at a single point of time.

Forecasting is the use of various modeling techniques to predict a future

outcome on the basis of historical time series data.

Real Estate Price

Price

2001 2018 2035

Forecasting
EDA - Components of Time Series

EDA

_______ Visual

EDA in time series is mostly visual.

Elements of visualization in time series:

Time plot Lag Scatter Plot _____ Plot

Stacked Area Chart ______ Chart Heat Map

for multi variable time series data

Forecasting
Data Partition

Time series should be split in __________ order

Most Recent period data will be chosen as Validation data.

Training Data Validation Data

Fit the model only to Assess performance on

Training period Validation period

Training Validation Future

Conditions to choose the validation period:

• Forecast Horizon

• ____________

• Length of Time series

Forecasting
Forecasting Model

There are predominantly 2 approaches to forecasting

Model-Driven Data-Driven
_______ similar to future _______ similar to future

1. Linear Regression 1. Naïve forecasts

2. Autoregressive models 2. _________

3. ARIMA 3. Neural nets

4. ____________

5. ____________

Sales Forecast Sales Centered MA(4)

Sales

Months Months

Forecasting
Smoothing Techniques

Moving Average Exponential Smoothing

• ______ Moving Average • Simple Exponential Smoothing

• ______ Moving Average • Holt's Method/

Double Exponential Smoothing

• ____________ Method

MA ES

Moving Average Exponential Smoothing

Assigns equal weights to all Assigns ___________ to recent

past observations observations than past observations

Better to forecast when data & Better to forecast when data &
environment is not _________ environment is _________

Window width is key Smoothing constant (α, β, γ) value

to success is key to success (0 < α, β, γ ≤ 1)

Forecasting
De-Trending and De-Seasoning

• To remove and/or , ﬁt

a regression model with trend and/or
seasonality
Regression
• Series of forecast errors should be
de-trended & deseasonalized

• Simple & popular for removing trend and /

or seasonality from a time series

• Lag-1 diﬀerence: Yt – Yt-1 (For removing

Differencing ______); Lag-M diﬀerence: Yt – Yt-M
(For removing ______)

• Double diﬀerencing: diﬀerence the

diﬀerenced series

• Uses moving average to remove ______

Moving Average • Generates seasonal indexes as a

byproduct

Accreditation to international certiﬁcation bodies

For further details, call us at

1800-212-654321

[email protected] 360digitmg.com

2-56/2/19, 3rd Floor, Vijaya Towers, Ayyappa Society Road, Madhapur, Hyderabad, Telangana 500008

USA | INDIA | MALAYSIA | ROMANIA | SOUTH AFRICA | DUBAI | BAHRAIN

13.exploratory Data Analysis
67% (3)
13.exploratory Data Analysis
10 pages
11 Network Analytics - Problem Statement
25% (4)
11 Network Analytics - Problem Statement
4 pages
Association Rules Problem Statement
50% (2)
Association Rules Problem Statement
5 pages
2a EDA
50% (2)
2a EDA
11 pages
Power BI Basic Exercise
67% (6)
Power BI Basic Exercise
2 pages
Business Uderstanding Solved1 - Module 1
No ratings yet
Business Uderstanding Solved1 - Module 1
11 pages
Module-Preliminaries For Data Analysis - Data Science
100% (1)
Module-Preliminaries For Data Analysis - Data Science
5 pages
Clustering Documentation R Code
100% (1)
Clustering Documentation R Code
9 pages
String Manipulation Problem Statement
No ratings yet
String Manipulation Problem Statement
6 pages
CRISP - DM - Business Understanding
No ratings yet
CRISP - DM - Business Understanding
17 pages
Research Presentation - Chapter 1
100% (2)
Research Presentation - Chapter 1
17 pages
CRISP DM Business Understanding - Data Science
No ratings yet
CRISP DM Business Understanding - Data Science
15 pages
CRISP ML (Q) Business Understanding
No ratings yet
CRISP ML (Q) Business Understanding
17 pages
Duplication - Typecasting-Problem Statement
100% (1)
Duplication - Typecasting-Problem Statement
3 pages
CRISP DM Business Understanding Completed
No ratings yet
CRISP DM Business Understanding Completed
18 pages
CRISP - ML (Q) - Business Understanding
No ratings yet
CRISP - ML (Q) - Business Understanding
13 pages
Text Mining Problem Statement
100% (1)
Text Mining Problem Statement
3 pages
Day02-Data Understanding Answer Asit 25082022
No ratings yet
Day02-Data Understanding Answer Asit 25082022
4 pages
Association Rules Ans
No ratings yet
Association Rules Ans
28 pages
Module 03 Assignment
100% (1)
Module 03 Assignment
13 pages
Day13 K Means Clustering
No ratings yet
Day13 K Means Clustering
4 pages
Business Moments Graphic Assignmebt
No ratings yet
Business Moments Graphic Assignmebt
11 pages
Basic Statistics (Module - 3)
100% (2)
Basic Statistics (Module - 3)
12 pages
Multinomial Problem Statement
No ratings yet
Multinomial Problem Statement
28 pages
Discretization Problem Statement
No ratings yet
Discretization Problem Statement
3 pages
8.dummy Variables
No ratings yet
8.dummy Variables
4 pages
CRISP - ML (Q) - Business Understanding Assignment
No ratings yet
CRISP - ML (Q) - Business Understanding Assignment
11 pages
13.exploratory Data Analysis
0% (1)
13.exploratory Data Analysis
10 pages
DataPreparation - Outlier - Treatment ASSIGNMENT 1
100% (1)
DataPreparation - Outlier - Treatment ASSIGNMENT 1
7 pages
06.discretization Problem Statement
50% (2)
06.discretization Problem Statement
2 pages
13.exploratory Data Analysis
50% (2)
13.exploratory Data Analysis
8 pages
Assignment 05 ANSWERS
100% (1)
Assignment 05 ANSWERS
5 pages
Topic: Dimension Reduction With PCA: Instructions
No ratings yet
Topic: Dimension Reduction With PCA: Instructions
8 pages
KNN - Problem Statement ANSWER
100% (1)
KNN - Problem Statement ANSWER
8 pages
CRISP ML (Q) Business Understanding
No ratings yet
CRISP ML (Q) Business Understanding
12 pages
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
100% (1)
Name: Siti Mursyida Abdul Karim (Data Science Program) Topic: Assignment - EDA
13 pages
Missing Values
No ratings yet
Missing Values
6 pages
R - Assignment
No ratings yet
R - Assignment
2 pages
Statistics and Probability
No ratings yet
Statistics and Probability
8 pages
Data Assigment 1
100% (2)
Data Assigment 1
32 pages
Discretization Problem Statement
No ratings yet
Discretization Problem Statement
4 pages
01.CRISP DM Business Understanding
No ratings yet
01.CRISP DM Business Understanding
10 pages
1 - Write A Python Program To Check That A String Contains Only A Certain Set of Characters (In This Case A-Z, A-Z and 0-9)
No ratings yet
1 - Write A Python Program To Check That A String Contains Only A Certain Set of Characters (In This Case A-Z, A-Z and 0-9)
4 pages
Name:Silpa Batch Id: Analysis: WDEO 171220 Topic: Principal Component
100% (1)
Name:Silpa Batch Id: Analysis: WDEO 171220 Topic: Principal Component
7 pages
Assignment 1 CRISP
No ratings yet
Assignment 1 CRISP
8 pages
Support Vector Machines Problem Statement
No ratings yet
Support Vector Machines Problem Statement
27 pages
Day13-K-Means Clustering
No ratings yet
Day13-K-Means Clustering
10 pages
Day10 Mathematical Foundations
No ratings yet
Day10 Mathematical Foundations
4 pages
15 KNN - Problem Statement
0% (2)
15 KNN - Problem Statement
3 pages
Sqlassignment 03
No ratings yet
Sqlassignment 03
3 pages
Zero Variance-Problem Statement
0% (1)
Zero Variance-Problem Statement
3 pages
Problem Statement - Mathematical Foundations
No ratings yet
Problem Statement - Mathematical Foundations
6 pages
Assignment 06
50% (2)
Assignment 06
2 pages
Radhika PCA - Problem Statement
No ratings yet
Radhika PCA - Problem Statement
3 pages
Assignment 2
No ratings yet
Assignment 2
7 pages
Zero Variance-Problem Statement
50% (2)
Zero Variance-Problem Statement
4 pages
DataPreparation Outlier Treatment
100% (1)
DataPreparation Outlier Treatment
3 pages
Original
No ratings yet
Original
30 pages
360DigiTmg E Book Data Science
100% (1)
360DigiTmg E Book Data Science
168 pages
Machine Learning Overview
No ratings yet
Machine Learning Overview
7 pages
Unit I MACHINE LEARNING
No ratings yet
Unit I MACHINE LEARNING
87 pages
Industrial Psychology
100% (9)
Industrial Psychology
32 pages
3 Juli 2022
No ratings yet
3 Juli 2022
4 pages
Spektro Uv Vis Hurnal1
No ratings yet
Spektro Uv Vis Hurnal1
9 pages
DR Syed Mohammed Haseebuddin Quadri Department of English Manuu
No ratings yet
DR Syed Mohammed Haseebuddin Quadri Department of English Manuu
36 pages
BRM Short Notes
100% (1)
BRM Short Notes
3 pages
Heteroscedasticity Notes
No ratings yet
Heteroscedasticity Notes
9 pages
MA (Economics) II Sem Statistical Inferences and Research Methods
No ratings yet
MA (Economics) II Sem Statistical Inferences and Research Methods
21 pages
9780521872423_excerpt
No ratings yet
9780521872423_excerpt
10 pages
Elect 1: Intro To Research: Dickson P. Pagente, Male College Department, SAIT
No ratings yet
Elect 1: Intro To Research: Dickson P. Pagente, Male College Department, SAIT
20 pages
Do SAS® Users Read Books? Using SAS Graphics To Enhance Survey Research
No ratings yet
Do SAS® Users Read Books? Using SAS Graphics To Enhance Survey Research
10 pages
2024 Graduate Diploma of Psychological Science Course Handbook - With Midyear Intake 1
No ratings yet
2024 Graduate Diploma of Psychological Science Course Handbook - With Midyear Intake 1
20 pages
Biostatistics
No ratings yet
Biostatistics
2 pages
Survey Method
No ratings yet
Survey Method
6 pages
The Difference Between Science and Engineering
No ratings yet
The Difference Between Science and Engineering
6 pages
Ehsan Zamanzade
No ratings yet
Ehsan Zamanzade
7 pages
Name: - Year & Section: - Score: - I-MULTIPLE CHOICE: Write The Letter of The Correct Answer in A Separate Sheet
No ratings yet
Name: - Year & Section: - Score: - I-MULTIPLE CHOICE: Write The Letter of The Correct Answer in A Separate Sheet
4 pages
CHAPTER 2 - Descriptive Statistics
No ratings yet
CHAPTER 2 - Descriptive Statistics
3 pages
ETW2410 Introductory Econometrics: Lecture Slides Week 1
No ratings yet
ETW2410 Introductory Econometrics: Lecture Slides Week 1
21 pages
Data Management
No ratings yet
Data Management
12 pages
Quantitative and Qualitative Research: Grade 9
No ratings yet
Quantitative and Qualitative Research: Grade 9
12 pages
Week 1 PR 2
No ratings yet
Week 1 PR 2
30 pages
Research Methodology
No ratings yet
Research Methodology
3 pages
Interpretation Cluster Analysis
No ratings yet
Interpretation Cluster Analysis
8 pages
B.ed 2nd Sem Syllabus
No ratings yet
B.ed 2nd Sem Syllabus
24 pages
Mapping Artistic Research Towards Diagra
No ratings yet
Mapping Artistic Research Towards Diagra
80 pages
Tools For Quality Improvement: The Magnificent Seven
No ratings yet
Tools For Quality Improvement: The Magnificent Seven
14 pages
Chapter 1-190810 073902
No ratings yet
Chapter 1-190810 073902
5 pages
Basic Concepts of Audit Sampling AU 350
No ratings yet
Basic Concepts of Audit Sampling AU 350
2 pages
Lesson 2 - Characteristics, Strengths, and Weaknesses of Quantitative Research
No ratings yet
Lesson 2 - Characteristics, Strengths, and Weaknesses of Quantitative Research
8 pages

Uploaded by

Uploaded by

INDEX

- CRISP - DM Exploratory Data Analysis

Deﬁnition of Artiﬁcial Intelligence,

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 1

Ability of inanimate objects such as Machines, Robots,

Systems, etc., with computing capabilities to perform

____________ tasks that are similar to Humans.

· Hearing (Speech to Text Applications)

· Response to Stimuli (Inputs) X1 X2 X3 Y

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 2

Data Science is a ﬁeld of study related to data, to bring out

meaningful ______ insights for eﬀective ______ making.

Topics of Data Science includes

1. ______ Analysis 6. Black Box Techniques

2. Hypothesis Testing 7. ______ Mining

3. Data ______ 8. Natural Language Processing

4. Regression Analysis 9. ______ Analysis, etc.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 3

Data Mining is similar to coal mining where we get coal and

if lucky one might get precious stones such as diamond.

In Data Mining we get ______ from _____ and insights

similar to diamond are extremely valuable for ______.

Data Mining (Branches)

_______ ________ Active Structured

Unsupervised Semi-Supervised _______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 4

historical / past data and then using it on the ________

unseen data to achieve a deﬁned objective.

1. ______ Learning _______ Learning - Both _______and

_______ are known in the historical data

2. ________ Learning ________ Learning - Only________ are

known in the historical Data & ________ is not known or assumed as

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 5

Deep Learning is a special branch of Machine

Learning where the ___________ in data are

Some of the Deep Learning Architecture

Multi-Layered Perceptron • Mask R-CNN

• ___________ Neural Network • Autoencoders

• _________ Neural Network • Generative Adversarial Network (GAN)

• Deep Belief Network • Boltzmann Machine

• Long Short Term Memory (LSTM) • Deep Q-Networks

which is heavily used in applications sincluding games, robotics, investment

banking, and trading etc.

Reinforcement Learning is a ___________ based learning, which solves

sequential decision problems by __________ with the environment.

The 5 key elements of Reinforcement Learning

________ is a learning component that makes decision

Environment is the physical world where agents perform

_____ deﬁnes behavior of the agent from states to

______ ___________ deﬁnes the problem and maps it to a

Model of the Environment is an optional component which predicts the

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 7

Highest Level of Analytics

__________________ - Answers questions on what happened in the

past and present.

Example: Number of Covid-19 cases to date across various countries

Diagnostic Analytics - Answers questions on ________________.

Example: Why are the Covid-19 cases increasing?

___________ - Answers questions on ____________________.

_________________ - Provides remedies and solutions for what might

happen in the future.

might increase in the next one month?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 8

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 9

A few examples on Business Objective and Business Constraints

Business Problem : Signiﬁcant proportion of customers who take loan are

Business Problem : Signiﬁcant proportion of customers are complaining that

Keys points to remember:

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 10

Any data, which can be Data when represented in decimal

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 11

Count Data examples

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 12

_______________ data is Quantitative data include

1. This weighs heavy 1. Weight 85 kg

____________ Data and _______ Data fall under Quantitative Data.

Continuous Discrete (Count)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 14

meaningful insights for eﬀective making.

In Data Mining we get __ from _ and insights

_ __ Active Structured

1. __ Learning _ Learning - Both _____and

2. Learning Learning - Only are

_____ deﬁnes the problem and maps it to a

_ - Answers questions on __________.

______ Data and _ Data fall under Quantitative Data.

1. _ is also called _

• ___ is that data • _ is that data

Surveys, __ of , Sensors Data,

of ____ examples:

Organizational data _ ___

• _ variable • __

• _ variable • __

• _ variable • __ variable