100% found this document useful (1 vote)
2K views

360DigiTMG Practical Data Science New

The document discusses various concepts related to artificial intelligence including artificial intelligence, data science, data mining, machine learning, deep learning, and reinforcement learning. It provides definitions and examples of these concepts.

Uploaded by

yesbene
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
2K views

360DigiTMG Practical Data Science New

The document discusses various concepts related to artificial intelligence including artificial intelligence, data science, data mining, machine learning, deep learning, and reinforcement learning. It provides definitions and examples of these concepts.

Uploaded by

yesbene
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 168

INDEX

Ingredients of AI
- Artificial Intelligence
- Data Science
- Data Mining
- Machine Learning
- Deep Learning
- Reinforcement Learning

Stages of Analytics
CRISP - DM
- CRISP - DM Business Understanding
- CRISP - DM Data Collection
• Data Types
• Different Scales of Measurement
• Data Understanding
• Qualitative vs Quantitative
• Structured vs Unstructured
• Big Data vs Non-Big Data
• Cross Sectional vs Time Series vs Longitudinal Data
• Balanced vs Unbalanced
• Data Collection Sources
- Primary Data
- Secondary Data
• Preliminaries for Data Analysis
• Probability
• Base Equation
• Random Variables
• Probability Distributions
• Sampling Techniques
• Inferential Statistics
• Non-Probability Sampling
• Probability Sampling
• Sampling Funnel
- CRISP - DM Data Cleansing / Data Preparation
• Outlier Treatment
• Winsorization
• Alpha Trimmed
• Missing Values
• Imputation
• Transformation
• Normalization/Standardization
• Dummy Variables
• Type Casting
• Handling Duplicates
• String Manipulation

- CRISP - DM Exploratory Data Analysis


• Measure of Central Tendency
• Measure of Dispersion
• Measure of Skewness
• Measure of Kurtosis
• Graphical Representations
- Histogram
- Box Plot
- Q-Q Plot
- Bivariate Analysis
- Scatter Plot
- Correlation Coefficient
• Multivariate Analysis
• Data Quality Analysis
• Four Errors to Be Avoided During Data Collection
• Data Integration
• Feature Engineering
• Feature Extraction
• Feature Selection
- CRISP - DM Model Building Using Data Mining
• Supervised Learning
• Supervised Learning has Four Broad Problems to Solve:
- Predict A Categorical Class: Classification
- Predict A Numerical Value: Prediction
- Predict User Preference from a Large Pool of Options: Recommendation
- Predict Relevance of an Entity to a "Query": Retrieval
• Data Mining Unsupervised
- A Few of the Algorithms are:
- Clustering
- Dimension Reduction
- Network Analysis
- Association Rules
- Online Recommendation Systems

• Unsupervised Preliminaries
- Distance Calculation
- Linkages
• Clustering / Segmentation
- K-Means Clustering
- Disadvantages of K-Means
- K-Means++ Clustering
- K-Medians Clustering
- K-Medoids
- Partitioning Around Medoids (PAM)
- CLARA
• Hierarchical Clustering
- Disadvantages of Hierarchical Clustering
• Density Based Clustering: DBSCAN
• OPTICS
• Grid-Based Clustering Methods
• Three Broad Categories of Measurement in Clustering
• Most Common Measures
• Clustering Assessment Methods
• Finding K Value
• Mathematical Foundations
• Dimension Reduction
- PCA
- SVD
- LDA
• Association Rules
- Support
- Confidence
- Lift
• Recommender Systems
- Types of Recommendation Strategies
- Collaborative Filtering
- Similarity Measures
- Disadvantages
- Alternative Approaches
- Recommendations vs Association Rules
- New Users and New Items
• Network Analysis
- Applications
- Degree Centrality
- Closeness Centrality
- Betweenness Centrality
- Eigenvector Centrality
- Edge / Link Properties
- Cluster Coefficient
• Text Mining
- Examples of Sources
- Pre-Process the Data
- Document Term Matrix / Term Document Matrix
- Word Cloud
- Natural Language Processing (NLP)
- Natural Language Understanding (NLU)
- Natural Language Generation (NLG)
- Parts of Speech Tagging (Pos)
- Named Entity Recognition (NER)
- Topic Modelling
- LSA / LSI
- LDA
- Text Summarization
• Data Mining Supervised Learning
• Machine Learning Primer
- Key Challenges
• Model Evaluation Techniques
- Errors
- Confusion Matrix
- Cross Table
- ROC Curve
• K-Nearest Neighbor
- Choosing K Value
- Pros and Cons
• Naive Bayes Algorithm
• Decision Tree
- Three Types of Nodes
- Greedy Algorithm
- Information Theory 101
- Entropy
- Pros and Cons of Decision Tree
• Scatter Diagram
• Correlation Analysis
• Linear Regression
- Ordinary Least Squares
- Model Assumptions
• Logistic Regression
• Support Vector Machine
- Hyperplane
- Non-Linear Spaces
- Kernel Tricks
- Kernel Functions
• Deep Learning Primer
- Image Recognition
- Speech Data
- Text Data
- Shallow Machine Learning Models
• Perceptron Algorithm
- Biological Neuron
- Simple Neural Network Components
- Perceptron Algorithm
- Learning Rate
- Gradient Primer
- Gradient Descent Algorithms Variants
- Empirically Determined Components
• Multi-Layers Perceptron (MLP) / Artificial Neural Network (ANN)
- Non-Linear Patterns
- Integration Function
- Activation Function
- Regularization Techniques Used for Overfitting
- Error-Change Criterion
- Weight-Change Criterion
- Dropout
- Drop Connect
- Noise
- Batch Normalization
- Shuffling Inputs
- Weight Initialization Techniques
• Forecasting
- Time Series vs Cross Sectional Data
- EDA - Components of Time Series
• Systematic Part
• Level
• Trend
• Seasonality
- Non-Systematic Part
• Noise/Random
- Data Partition
- Forecast Model
- Model-Driven Techniques
- Data-Driven Techniques
- Smoothing Techniques
- Moving Average
- Exponential Smoothing
- De-Trending and De-Seasoning
- Regression
- Differencing
- Moving Average
Ingredients of AI

Artificial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Definition of Artificial Intelligence,


Data Science, Data Mining, Machine
Learning, Deep Learning,
Reinforcement Learning (RL)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 1


Artificial Intelligence

Artificial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Ability of inanimate objects such as Machines, Robots,

Systems, etc., with computing capabilities to perform

____________ tasks that are similar to Humans.

Examples of AI
· ________ (Video Analytics & Image Processing)

· Hearing (Speech to Text Applications)

· Response to Stimuli (Inputs) X1 X2 X3 Y

y = f(x)

· __________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 2


Data Science

Artificial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Data Science is a field of study related to data, to bring out

meaningful ______ insights for effective ______ making.

Topics of Data Science includes

1. ______ Analysis 6. Black Box Techniques

2. Hypothesis Testing 7. ______ Mining

3. Data ______ 8. Natural Language Processing

4. Regression Analysis 9. ______ Analysis, etc.

5. Classification Techniques

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 3


Data Mining

Artificial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Data Mining is similar to coal mining where we get coal and

if lucky one might get precious stones such as diamond.

In Data Mining we get ______ from _____ and insights

similar to diamond are extremely valuable for ______.

Data Mining (Branches)

_______ ________ Active Structured


Learning Learning Learning Prediction

Unsupervised Semi-Supervised _______


Learning Learning Learning

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 4


Machine Learning
Machine Learning is learning from the ________ of the

historical / past data and then using it on the ________

unseen data to achieve a defined objective.

1. ______ Learning _______ Learning - Both _______and

_______ are known in the historical data

2. ________ Learning ________ Learning - Only________ are

known in the historical Data & ________ is not known or assumed as

not known

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 5


Deep Learning

Artificial Intelligence

Data Science

Data Mining

Machine Learning

Deep Learning

Reinforcement
Learning

Deep Learning is a special branch of Machine

Learning where the ___________ in data are

_________ extracted.

Some of the Deep Learning Architecture


• _________________ / • Gated Recurrent Units (GRUs)

Multi-Layered Perceptron • Mask R-CNN

• ___________ Neural Network • Autoencoders

• _________ Neural Network • Generative Adversarial Network (GAN)

• Deep Belief Network • Boltzmann Machine

• Long Short Term Memory (LSTM) • Deep Q-Networks

• Q Learning etc.
Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 6
Reinforcement Learning
Reinforcement Learning is a special branch of ___________ Learning

which is heavily used in applications sincluding games, robotics, investment

banking, and trading etc.

Reinforcement Learning is a ___________ based learning, which solves

sequential decision problems by __________ with the environment.

The 5 key elements of Reinforcement Learning

________ is a learning component that makes decision


______
on actions to maximize the reward.

Environment is the physical world where agents perform


Environment actions.

_____ defines behavior of the agent from states to


_____ actions.

______ ___________ defines the problem and maps it to a


______
numerical reward.

______
______ __________ defines the cumulative future reward.

Model of the Environment is an optional component which predicts the


behavior of the environment.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 7


Stages of Analytics

Highest Level of Analytics

4. Prescriptive Analytics

3. Predictive Analytics

2. Diagnostic Analytics

1. Descriptive Analytics

__________________ - Answers questions on what happened in the

past and present.

Example: Number of Covid-19 cases to date across various countries

Diagnostic Analytics - Answers questions on ________________.

Example: Why are the Covid-19 cases increasing?

___________ - Answers questions on ____________________.

Example: What will be the number of Covid-19 cases for the next month?

_________________ - Provides remedies and solutions for what might

happen in the future.

Example: What should be done to avoid the spread of Covid-19 cases, which

might increase in the next one month?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 8


CRISP - DM
C____ I_______ S______ P_____
for Data Mining

__________

___________

6 1

__________
5
CRISP - DM 2
_________

4
3
Data Mining -
Model Development
Data Cleansing/
Preparation and
Exploratory
Data Analysis

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 9


CRISP - DM Business Understanding
Articulate the business problem by understanding the
client/customer requirements

____________ Business
Objectives ____________

A few examples on Business Objective and Business Constraints

Business Problem : Significant proportion of customers who take loan are


unable to repay
Business Objective : ____________ Loan Defaulters
Business Constraint : Maximize Profits

Business Problem : Significant proportion of customers are complaining that


they did not do the credit card transaction
Business Objective : Minimize Fraud
Business Constraint : _____________ Convenience

Keys points to remember:


Ensure that objectives and constraints are SMART

- S____________
- Measurable
SMART - A____________
- R____________
- Time-Bound
Key Deliverable: ________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 10


CRISP - DM Data Collection
Understanding various __________ is pivotal to proceed further with data

collection.

Data Types
_____________________ _____________________

Any data, which can be Data when represented in decimal


represented in a _________ format does not make sense.
and makes sense.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 11


CRISP - DM Data Collection
_______________________________________________

_______________________________________________

_______________________________________________

_______________________________________________

Count Data examples


_______________________________________________

_______________________________________________

_______________________________________________

_______________________________________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 12


Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 13
Data Understanding

_____________________ VS _____________________

_______________ data is Quantitative data include


non-numerical data. numbers.

Examples Examples

1. This weighs heavy 1. Weight 85 kg


2. That kitten is small 2. Height 164.3 cm

____________ Data and _______ Data fall under Quantitative Data.

Qualitative
Binary
Normal
Categorical
Multiple

Ordinal

Quantitative

Continuous Discrete (Count)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 14


____________________ vs Unstructured
Structured data is that data Unstructured data is that data
which in raw state can be placed which in its raw state cannot be
in a _______ format. placed in any _______ format.

Video is split into images and Videos, Images, Audio/Speech,


images into ________ and Textual Data are examples of
each pixel intensity value can be Unstructured data.
an entry in a column and this
becomes structured.

Audio Files / Speech data can be converted into features using _________

Frequency ____________ Coefficient (MFCC).


Textual data can be converted into _____ of ______ (BoW) as an example

to make it Structured.

Example: But I, being poor, have only my dreams; I have spread my dreams

under your feet; Tread softly because you tread on my dreams.

Poor Dream Spread Feet Tread Soft

1 3 1 1 1 1

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 15


Data Understanding

_____________________ VS _____________________

Data which is governed by the 5 Vs

__________ High Volume, generating at rapid Velocity,


__________
from a wide Variety with an element of
5 Vs __________
uncertainty __________, and
__________
__________ appropriate Value

________ is that which cannot be stored in the available hardware and

cannot be processed using available software.

____________ is that data which can be stored in the available hardware

and can be processed using available software.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 16


Cross-Sectional vs ___________ vs ___________

1. Cross-sectional data is that data, where date, time, and


sequence in which we arrange the data is immaterial

2. Cross-sectional data usually contains more than one variable

1. Population survey of demographics


Examples:
2. Profit & Loss statements of various companies

1 _________ data is that data, where the date, time, and


sequence in which we arrange the data is important

2. _________ data usually contains only one variable of


interest to be forecasted

1. Monitoring patient blood pressure every week


Examples:
2. Global warming trend

1. _________ is also called _________

2. _________ includes properties of both Cross-Sectional.


and Time Series

3. Data as well as _________ , wherein there is more than


one variable, which are sorted based on the date and time

1. Exam scores of all students in a class from sessional


Examples:
to final exams
2. Health scores of all employees recorded every month

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 17


__________ vs __________
• Whether a person claims an insurance or not,

• Will a person 'pay on time, 'with a delay' or 'will default', etc,

• ___________ is that data • _________ is that data


where the classes of output where the classes of output
variables are more or less in variables are in unequal
equal proportion proportion

• E.g. 47% of people have • E.g. 23% of data is defaulted


defaulted and 53% of data is and 77% of data is not defaulted
not defaulted in the loan default in the loan default variable
variable

• When we have balanced data


then we can simply apply
random sampling techniques

Default Not Default

Thumb Rule: if proportion of minority output class is < 30% then


data is imbalanced.

Sampling for imbalanced data refer to next page.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 18


When we have imbalanced data then we apply different
sampling techniques such as:

• _____________ - Undersampling and Oversampling

• Bootstrap Resampling

• K-Fold Cross Validation

• _____________ K-Fold Cross Validation

• _____________ K-Fold Cross-Validation

• _____________ (N-Fold Cross-Validation) LOOCV

• SMOTE (Synthetic Minority Oversampling Technique)

• MSMOTE (Modified SMOTE)

• Cluster-Based Sampling

• Ensemble Techniques

Imbalanced Data

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 19


Data Collection Sources
_______________: Data Collected at the Source

_______________: Data Collected Before Hand

Primary Data

Secondary Data

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 20


Primary Data
Examples of _________

Surveys, __________ of ________, ________ Sensors Data,


Interviews, Focus Groups, etc.

Survey steps:
1. Understand the business __________ and __________ behind
conducting the survey. E.g. Sales are low for a training company
2. Perform __________ analysis - __________ Analysis, 5-_____
Analysis, etc. E.g. Product Pricing is uncompetitive
3. Formulate Decision Problem. E.g. Should product prices be changed
4. Formulate Research _________. E.g. Determine the price elasticity of
demand and the impact on sales and profits of various levels of price
changes
5. List out Constructs. E.g. Training Enrolment
6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect,
Constraint aspect
7. Devise Survey ____________ based on the _______. E.g. I am most
likely to enroll for the training program in: In the next one week, In the next
one month, In the next one quarter, etc.

__________of ______________ examples:

• Coupon marketing with a 10% discount vs 20% discount, to which of these


customers are responding well

• Coupon targeting customers within 10 km radius versus 20 km radius

• Combinations of discount & distance to experiment

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 21


Secondary Data

Organizational data _________ ___________


are stored in databases (paid) databases
__________

• Oracle DB • Industry reports


• Microsoft DB • Government reports
• MySQL • Quasi-government
• NoSQL - MongoDB reports, etc.

• Big Data, etc.

Meta Data Description: Data about Data


• Obtaining meta data description is mandatory before we proceed
further in the project

• Understand the data volume details such as size, number of records,


total databases, tables, etc.

• Understand the data attributes/variables - description and values


which these variables take

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 22


Preliminaries for Data Analysis
Probability can be explained as the extent to which an event is likely to occur,
measured by the ratio of the _________ cases to the whole number of
cases possible.

# ________
Probability =
# Total events

Properties of Probability:

• Ranges from 0 to 1
• Summation of probabilities of all values of an event will be equal to 1

Example:

H 1
P(H) = = = 0.5
H&T 2

2 2
P(Red) = = = 0.5
2(R)+2(B) 4

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 23


Base Equation
Random variables can be broadly classified as Output and Input variables.

Mathematically the relation between these is expressed using base equation:

Y is known as: X is known as:

• _________ variable • __________

• Response • Explanatory

• _________ • __________

• Explained variable • Covariates

• Criterion • __________

• Measures variable • Factors

• _________ variable • __________

• _________ • Controlled variable

• _________ variable • __________ variable

• Exposure variable

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 24


_____________

If there is a chance / Any output on any event


probability associated with which _______ is called
each of the possible output Variable
then it is called _______

_____________ are always represented using Upper case.

Values that a random variable takes are represented using ____________.

Ex: Roll of a single die

X = {1, 2, 3, 4, 5, 6}

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 25


Probability Distribution
Representing the probabilities of all possible outcomes of an event in a tabular

format or a graphical representation is called Probability Distribution.

If a random variable is continuous then the underlying probability distribution

is called __________________.

If a random variable is discrete then the underlying probability distribution is

called __________________.

X P(X=x)

0 0.40

1 0.25 0.5
0.4
2 0.20 0.3
0.2
3 0.05
0.1
4 0.10 0
0 1 2 3 4

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 26


Sampling Techniques
Sampling is a technique to collect the _________ of population data.

These techniques are broadly classified into 2 types.

_________ Sampling

_________ Sampling

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 27


Inferential Statistics
Inferential statistical is a process of analysing the ________ and deriving

statements / properties of a ________.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 28


Sampling Techniques
Sampling Techniques is a technique which is based on convenience, wherein,

priority varies for the data that is to be collected to represent the population,

these approaches are also known as ____________ sampling.

A few examples of Non-Probability Sampling:

1 Convenience Sampling

2 Quota Sampling

3 Judgment Sampling

4 Snowball Sampling

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 29


Sampling Techniques
________ Sampling also known as ________. Sampling is the default

approach for inferential statistics. Each data point to be collected will have

________ to get selected.

A few examples of __________ Sampling:

1 Simple Random Sampling

2 Systematic Sampling

3 Stratified Sampling

4 Clustered Sampling

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 30


Sampling Funnel

Population

Sampling Frame

Simple Random Sampling

Sample

Population All Covid-19 cases on the planet

Sampling Frame • The majority of Covid-19 cases are in the USA,


India, and Brazil and hence these 3 countries
can be selected as a ______________

• ______ does not have any hard and fast rule,


It is devised based on business logic

Simple Random • Randomly Sample 10% or 20% of the data from


Sampling the sampling frame using Simple ________
technique

• ____________ is the gold standard


technique used for sampling

• ____________ is the only sampling


technique which has no bias

• Other sampling techniques such as Stratified


sampling, etc., Also can be used to sample the
data but _________ is the best

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 31


CRISP - DM Data Cleansing
Data Cleansing / Data Preparation

Data Cleansing is also called as ___________, ___________,

___________, ___________.

Outlier or____________ - Any value, which is extremely small or

extremely large from the remaining data.

Outliers are treated using 3 R technique:

___________________

___________________

___________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 32


Winsorization Technique
Winsorization is the technique, which modifies the sample distribution of

random variables by removing outliers. For example, 90% winsorization means

all data below the 5th percentile is set at 5th percentile and all the data above

the 95th percentile is set at 95th percentile.

5 25 75 95

All values below 5th percentile are All values above 95th percentile are
changed to 5th percentile value changed to 95th percentile value

Alpha Trimmed Technique


Alpha Trimmed Technique lets you set an alpha value, for example if alpha = 5%,

then all the lower & upper 5% values are trimmed or removed.

Lower 5% values are Upper 5% values are


removed/trimmed removed/trimmed

5 25 75 95

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 33


Missing Values
Missing Values - Fields in the data which might have blank spaces and (or) NA,

_________, _________, _________.

3 Variants of Missing Values

• ________________________ (MAR)

• Missingness Not At Random (MNAR)

• ________________________ (MCAR)

Name Age Salary


Steve 23 $ 4,000
Raj 33 $ 6,500
Chen 41 Missingness
Wilma 37 $ 7,200
Audrey 51 $ 9,300

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 34


Imputation
Imputation is a technique used to replace missing values with logical values.

Wide variety of Techniques are available, choosing the one which fits the data is

an art:

______________ ______________

(Simple Strategies) Single Imputation Methods

• List-Wise Deletion or • Mean Imputation


Complete-Case Analysis
• Median Imputation

• Available Case Method or • Mode Imputation


Pair-Wise Deletion
• Random Imputation

• Hot deck Imputation

• Regression Imputation

• KNN Imputation

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 35


Transformation
______________________________________________

Types of transformation

• Logarithmic

• ____________

• Square Root

• ____________

• Box-Cox

• Johnson

____________ / Binning / Grouping - Converting _________ data to

____________

Binarization - Converting continuous data into ____________

Rounding - Rounding off the decimals to the nearest integer e.g. 5.6 = 6

Binning - Two types of Binning

• Fixed Width Binning

• Adaptive Binning

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 36


Normalization
Normalization/ ____________ - Making the data
X-mean
Z=
_________ and ____________. stdev

Methods of Normalization / ____________ include ____________

also called as ____________ , ____________ also called as

____________ or__________________________.

Standardization has two parts:

• ____________________ or Mean Subtraction - Mean

Normalization will make the mean of the data _________.

• Variance Normalization - _________________ will make the

variance of the data __________.

X - min(x)
max(x)-min(x)

Normalization is also called the _____________________. Normalized


data has minimum value = 0 and maximum value = 1 and sometimes when
dealing with negative values the range can be in between -1 to +1.

Mix-Max Scaler's disadvantage is that its scaled values are influenced by

_______________.

____________________ is not influenced by X - median(x)


outliers because it considers 'Median' & 'IQR'. IQR(x)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 37


Dummy Variable
Dummy Variable Creation: Representing/Converting categorical data in
numerical data

Techniques for Dummy Variable creation are:

________________

_____________ Scheme

_______________

Label Encoding

________ Coding Scheme

________ Hashing Scheme

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 38


Type Casting
Converting one type to another, for e.g. converting 'Character' type to 'Factor';
'Integer' type to 'Float'.

Type Casting

Type Conversion Type Coercion Type Juggling

Handling Duplicates
Ensures that we get a ____________ of _________ from all the
various locations.

E.g. A person opens a bank account but his transactions are recorded as John
Travolta in a few, John in a few entries and Travolta in a few; however, all 3 are
the name of the same person. So we merge all these names into one.

Name Amount Spent Name Amount Spent


John Travolta $ 1,000 John Travolta $ 3,600
Travolta $ 800
Merged because all '3' entries belong
John $ 1,800
to same customer.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 39


String Manipulation
Working with textual data. Various ways of converting unstructured textual
data into structured data are:

Stemming
____________ ____________

Stopword Removal
____________

Zero or Near Zero Variance


_______ & Near_______________ feature:

Variables which are factors with a single level or majority of the levels are the
same. E.g. All the zip code numbers are the same or Gender column has all
entries listed as female.

We remove the variables from our analysis which have _____ or ________
variance in features.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 40


CRISP - DM Exploratory Data Analysis
(EDA)
Elements in EDA

Measure of Central Tendency is also called as "First Moment Business Decision".

MEAN

Also called as Gets influenced by

____________ ____________

MEDIAN

• Median is the middle value of the dataset

• Median of a dataset does not get influenced by __________

MODE

• Mode is the value, which repeats ____________ times

• Mode is applied to ____________ data

• If data has ______ mode it is called ______, if the data has ________

called ______ data and more than two modes called ____________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 41


Measure of _____________, also called as
Second Moment Business Decision
How far away is each data
point from mean/average?
Variance
Units of measurement get squared

Standard deviation is the


square root of variance
Standard Deviation
Get back the original units, which were
squared during variance calculation

Represents the boundaries


of the data spread
Range
Maximum - Minimum

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 42


Measure of Skewness
• Data concentrated to the __________ is ________ Skewed also called

______________Skewed

• Data concentrated to the __________ is Right Skewed also called

____________ Skewed

• Presence of long tails helps in devising interesting business strategies

Mean

Median

Measure of Kurtosis

__________ Curve __________ Curve __________ Curve

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 43


Graphical Representations
Univariate analysis - Analysis of a single variable
is called Univariate Analysis.

Graphs using which we can visualize single variables are:

1. Bar Plot 8.____________

2. ____________ 9. Time Series Plots

3. ____________ 10. ____________

4. Strip Plot 11. Density Plot

5. ____________ 12. Boxplot or Box & Whisker Plot

6. ____________ 13. ____________ or

_____________
7. Candle Plot

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 44


Graphical Representations
Majorly used plots for Univariate Analysis includes Histogram, Box Plot,

and Q-Q Plot.

Histogram
Histogram is also called as Frequency Distribution Plot.

Primary Purpose: Histogram is used to identify the shape of the distribution.

Summarises the Used to identify Identify if the data is unimodal,


data into the shape of the bimodal or multimodal
discrete bins Distribution

Secondary Purpose: Histogram is used to identify the presence of Outliers.

0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
30 35 40 45 50 55 60 65 70

Is used to identify presence of Outliers

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 45


Box Plot is also called as Box and Whisker Plot
• Box Plot gives the 5 point summary, namely, Min, Max, Q1 / First Quartile, Q3 /
Third Quartile, Median / Q2 / Second Quartile

• Middle 50% of data is located in the Inter Quartile Range (IQR) = Q3 - Q1

• Formula used to identify outliers is Q1 - 1.5 (IQR) on the lower side and
Q3 + 1.5 (IQR) on the upper side

• Primary Purpose of Boxplot is to identify the existence of outliers

• Secondary Purpose of Boxplot is to identify the shape of distribution

Q1 Median Q3

whisker whisker

Q-Q plot is also called Quantile Quantile Plot


• Q-Q plot is used to check whether the data are normally distributed or not.
If data are non-normal then we resort to transformation techniques to
make the data normal

• The line in the Q-Q plot connects from Q1 to Q3

• X-axis contains the standardized values of the


random variable

• Y-axis contains the random values, which are not


Q-Q Plot
standardized

• If the data points fall along the line then data are
considered to be Normally Distributed

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 46


Bivariate Analysis
Bivariate analysis is analyzing two variables.

Scatter plot is used to check for the correlation between two variables.

The primary purpose of the Scatter Plot is to determine the following:

• Direction - Whether the direction is Positive or Negative or No Correlation

Positive Correlation Negative Correlation No Correlation

• Strength - Whether the strength is Strong or Moderate or Weak

Moderate Correlation Strong Correlation No Correlation

• Check whether the relationship is Linear or Nonlinear

Linear Nonlinear

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 47


90

80
The Secondary purpose of the
70
Scatter Plot is to determine
60

_____________. 50

40

30
130 140 150 160 170 180 190

• Determining strength using a scatter plot is subjective

• Objectively evaluate strength using Correlation Coefficient (r)

• Correlation coefficient value ranges from +1 to -1

• Covariance is also used to track the correlation between 2 variables

• However, Correlation Coefficient normalizes the data in correlation


calculations whereas Covariance does not normalize the data in correlation
calculation

• | r | > 0.85 implies that there is a strong correlation between the variables

• | r | < = 0.4 implies that there is a weak correlation

• | r | > 0.4 & | r | < = 0.85 implies that there is a moderate correlation

Multivariate Analysis
The two main plots to perform Multivariate analysis are:

• Pair Plot

• Interaction Plot

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 48


Data Quality Analysis
Focus of this step is to identify the potential
errors, shortcomings, and issues with data.

Name Age Date Date Date

Steve 23 2001 Jan - 01 1-Jan-01


Jeff 37 2001 Jan - 01 17-Jan-01

Clara 28 2002 Feb - 01 8-Feb-02


2003 Jun - 01 12-Jun-03
Peter 41

Identify _________ Identify _________ Identify different


levels of granularity

Sales Region
Name Age Salary
19,345 North
Steve $ 12,000 23
23,424 West
Jeff $ 4,500 37
24,164 East
Clara $ 5,200 28
19,453 South

Validation and __________ Data Wrong metadata


Reliability information

Wrong information due to data errors (manual /

automated) - ______ ____________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 49


Four errors to be avoided during Data
Collection
1. Random Errors - Measurement device (thermometer) faulty or Person
measuring does mistakes. Leads to False Positives

2. Systematic Errors - Social desirability bias of Trump on Twitter. Wearable


devices data is of wealthy customers

3. Errors in choosing what to measure - Rather than choosing a person from a


top university for a job, maybe we need to look at their social network which
guided them through a series of events, which resulted in them joining the
top school. High SAT score is not just based on high IQ, it depends on access
to good tutors and purchasing good study material. Someone might like a
subject and hence got a high GPA, but can we guarantee such a success in
other fields

4. Errors of exclusion - Not capturing women data pertaining to cardiovascular


diseases. An election in the US, not having data of colored women
candidates. Chief Diversity Officer in big firms is a solution!

Random Errors Systematic Errors

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 50


Data Integration
Data Integration is invoked when there are
multiple datasets to be integrated or merged

Appending
Multiple datasets with the same attributes/columns.

Name Age Salary Name Age Salary


Steve 23 $ 4,000 Wilma 37 $ 7,200
Raj 33 $ 6,500 Audrey 51 $ 9,300
Chen 41 $ 5,900

Name Age Salary


Steve 23 $ 4,000
Raj 33 $ 6,500
Chen 41 $ 5,900
Wilma 37 $ 7,200
Audrey 51 $ 9,300

Merging
Multiple datasets having different attributes using a common attribute.

Name Age Salary Name Designation Location


Steve 23 $ 4,000 Wilma Manager Kuala Lumpur
Raj 33 $ 6,500 Chen V.P NY City
Chen 41 $ 5,900

Name Age Salary Designation Location


Steve 23 $ 4,000 Manager Kuala Lumpur
Raj 33 $ 6,500 Nan NaN
Chen 41 $ 5,900 V.P NY City

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 51


Feature Engineering
Attribute Generation is also called Feature Extraction or Feature
Engineering. Using the given variables, try to apply domain
knowledge to come up with more meaningful derived variables.

Feature Extraction can be performed on:

• for Temporal Data

- Date Based Features


- Time-Based Features

• for Numeric Data

• for Categorical Data

• on Text Data

• on Image Data

Feature Extraction
1. Deep Learning is performed using Automatic Extraction

2. Shallow Machine Learning is performed using Manual Extraction

3. Feature Extraction is used to get either Derived Features or


Normalized Features

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 52


Feature Selection
Feature Selection or Attribute Selection is
shortlisting a subset of features or attributes.

It is based on:
All Features
• Attribute importance

• Quality
Feature Selection
• _________________

• Assumptions Final Features

• Constraints

Feature Selection Techniques

• Filter Methods • _________________

• Wrapper Methods • Variable Importance Plot

• _________________ • Subset Selection Methods


includes:
• Threshold-Based Methods
• _________________
• Statistical Methods (Lasso Regression, Ridge
Regression)
• Hypothesis Testing
• Forward Stepwise Selection
• _________________
• Backward Stepwise Selection
• Model-Based Selection

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 53


Model Building using Data Mining
Supervised Learning

In the historical data if the ____________ variable ____ is known, then

we apply supervised learning tasks on the historical data. Supervised Learning

is also called ________________________ or Machine Learning.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 54


Supervised Learning has four broad problems to solve:

• Predict a __________ class: __________.


- Example: Does the pathology image show signs of Benign or
Malignant tumour

- Is employee 'X' going to Attrite or Not Attrite

• Predict a ______________ value: ______________

(also sometimes called as __________)


- Example - What will be the stock value tomorrow?

- How many Samsung mobile phones will we sell next month?

• Predict user _____________ from a large pool of options:


Recommendation
- Example - Who will be the best match for getting married on a
matrimonial website?

• Predict RELEVANCE of an entity to a “query”: Retrieval


- Example - Return to the most relevant website in Google search?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 55


Data Mining-Unsupervised
What is Unsupervised learning?

Algorithms that draw conclusions on ______________.

____________ data is data where output variable ‘Y’ is unknown.

Unsupervised Learning algorithms help in ____________ analysis.

A few of the algorithms are:

Clustering
___________

Network Analysis
___________

___________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 56


Unsupervised Learing - Preliminaries
Distance Calculation:

Distance is either calculated between:

Between a _______ Between two


Two _________
and a _______ clusters

Distance Properties:

• Should be non-negative (distance > 0)

• Distance between a record to itself is equal to 0

• Satisfies ____________ (Distance between records 'i' & 'j' is equal to

the distance between records 'j' & 'i')

Standardize or ______________

the variables before calculating the

distance if the variables scale or are of

different units.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 57


Distance Calculations
Distance Metrics for Continuous Data

• __________ Distance which is calculated using Correlation Matrix

• _________ Distance, is also called as L1 norm

• Euclidean Distance, is also called as L2 norm

d= a2+b2+c2
z
d c
d = (xi1 - xj1 ) 2 + (xi2 - xj2 ) 2 + .... + (xip - xjp) 2
y
b
a

Distance Metrics for Binary Categorical Data

• Binary Euclidean Distance

• Simple Matching Coefficient

• _________ Coefficient

Distance Metrics for Categorical Data (> 2 categories)

• Distance is 0, if both items have same category

• Distance is 1 otherwise

Distance Metrics when both Quantitative Data & Categorical Data


exists in a dataset

• ________ General Dissimilarity Coefficient

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 58


Linkages
Linkages - Distance between a record & a cluster, or between two clusters.

1. Single Linkage - This is the least distance between a record and


a cluster, or between two clusters.

• Single Linkage is also called as _________________

• Emphasis is on close records or regions and not on overall structure of

data

• Capable of clustering non-elliptical shaped regions

• Gets influenced greatly by outliers or noisy data

2. Complete Linkage - This is the largest distance (diameter) between a record


and a cluster, or between two clusters.

• Complete Linkage is also called as ____________

• Complete Linkage is also sensitive to outliers

Nearest Farthest
Neighbour Neighbour
(Single Linkage) (Complete Linkage)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 59


Average Centroid

3. Average Linkage - This is the average of all distances between


a record and a cluster, or between two clusters.

• Average Linkage is also called Group Average

• Very expensive because computation takes a lot of time

4. Centroid Linkage - This is the distance between the centroids (centers) of


two clusters or between a record and centroid of a cluster.

• Centroid Linkage is also called Centroid Similarity

5. ______ Criterion - It is the increase in the value of SSE criterion for


clustering obtained by merging them into a single cluster.

• This is also called Ward's Minimum Variance and it minimizes the total

within cluster variance

6. G______ A______ A______ C______ (GAAC)

• Two clusters are merged based on cardinality of the clusters and

centroid of clusters

• Cardinality is the number of elements in the cluster

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 60


Clustering / Segmentation

Clustering has two main criteria:

a. Similar records to be grouped together. High ____________ similarity

b. Dissimilar records to be assigned to different groups.


Less ____________ similarity

• ____________ groups will form ____________ groups


after clustering exercise

• Clustering is an ____________ technique

• Separation of clusters can be of two types:


______ (one entry belongs to one cluster)
vs ___________ (one entry belongs to more than one cluster)

When we have a single variable then clustering can be performed by using a


simple boxplot. When we have 2 variables then we can perform scatter diagrams.

-2

-4

-4 -2 0 2 4 6 8 10

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 61


Clustering / Segmentation
When we have more than 2 variables then there are a lot of other
techniques such as:

Partitioning Based Methods:

• K-Means Clustering

• K-Means ++ Clustering

• ____________ Clustering

• Genetic K-Means Clustering

• K-Medoids Clustering

• K-Medians Clustering

• K-Modes Clustering

• ____________ Clustering

• ____________ Clustering

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 62


K-Means Clustering
K-Means clustering is called Non-Hierarchical Clustering.

We upfront decide the number of clusters using ________________ or


Elbow Curve.

Steps for K-Means


1. Decide the number of clusters 'K' based on the elbow curve of scree plot or
based on the thumb rule _____________. Alternatively, users may
intuitively decide upon the number of clusters

2. Dataset is partitioned into K _____________ with 'K' centers called


centroids. These centroids are randomly chosen as part of __________

3. Each data point of the dataset, which is the closest to one of the centroids will
form a cluster with that closest centroid

4. Centroids are again _____________ with the data points assigned


to each cluster

5. Steps 3 & 4 are repeated iteratively until no ____________ of data points


to other clusters is possible

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 63


Disadvantages of K-Means Clustering
Random initialization of centroids leads to clustering exercise terminating
at a ______________ (______________ because the
objective is to get ______________ within the sum of squares).

Solution:
Initialize the algorithm multiple times with different initial partitions.

No defined rule for selecting the K-value, while there are thumb rules, these
are not foolproof.

Solution:
Run the algorithm with multiple 'K' values (range) and select the clusters
with the least within the sum of squares and highest ___________
within Sum of Squares’.

Extremely sensitive to the outliers or extreme values.

Solution:
K-medians, _____________________ are a few other variants
which handle outliers very well.

K-Means clustering works for the data which is continuous in nature.

Solution:
Use _________________ for categorical data.

Cannot discover clusters with non-convex shapes.

Solution:
Use _______________ clustering and ____________ K-Means.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 64


K-Means++ Clustering
K-Means++ altogether, addresses the problem of different
initializations leading to different clusters.

Steps:

1. Decide the number of clusters 'K'

2. First centroid is randomly selected

3. Second centroid is selected such that, it is at the _______________

4. Step 2 depends on weighted _______________ score criteria

5. This process continues until all 'K' centroids are obtained

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10

K-Means Clustering K-Means++ Clustering

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 65


K-Medians
K-Medians is very good at handling outliers.

L1 Norm is the distance measure used and is also called Manhattan Distance.

Steps are very similar to K-Means except that instead of calculating Mean we
calculate Median.

1. K-Means cannot handle categorical data

2. Categorical data can be converted into one-hot encoding but will hamper the
quality of the clusters, especially when the dimensions are large

3. K-Modes is the solution and uses modes instead of means and everything
else is similar to K-Means

4. Distance is measured using ________________

5. If the data has a mixture of categorical and numerical data then the
_______________ method can be used

6. K-Means can only handle linearly separable patterns and ___________.


Kernel K-Means clustering works well when the data is in non-convex format
(non-linearly separable patterns)

7. ___________functions are used to take data to high-dimensional space


to make it linear and captures the patterns to form clusters

________________ Functions to be used are:

Gaussian Radial
___________ Sigmoid
Basis Function
Kernel ___________
(RBF) ______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 66


K-Medoids
K-Medoids address the problem of K-Means getting influenced by outliers.

Steps:

1. Choose 'K' data points randomly as medoids

2. Instead of taking the centriod of data points of a cluster, medoids are


considered to be the center

3. Find out the distance from each and every data point to the medoid and add
them to get a value. This value is called total cost

4. Select any other point randomly as a representative point (any point other
than medoid points)

5. Find out the distance from each of the points to the new representative point
and add them to get a value. This value is called the total cost of a new
representative point

6. If the total cost of step 3 is greater than the total cost of step 5 then the
representative point at step 4 will become a new medoid and the process
continues

7. If the total cost of step 3 is less than the total cost of step 5 then the
algorithm ends

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 67


Partitioning Around Medoids (PAM)
Partitioning Around Medoids (PAM) is a classic example of ________ Algorithm.

Steps:

1. Randomly points are chosen to be ____________

2. Replace medoids will non-medoids, if the ____________


(Sum of Squared Errors - SSE) of the resulting cluster is improved (reduced)

3. Continue iteratively until the____________ criteria of step 2 is satisfied

PAM is well suited for small datasets but it fails for large datasets.

CLARA - Clustering Large Applications:


1. In the case of large datasets performing clustering by in-memory
computation is not feasible. The sampling technique is used to avoid this
problem
2. CLARA is a variant of PAM
3. However unlike PAM, the medoids of all the data points aren’t calculated, but
only for a small sample
4. The PAM algorithm is now applied to create optimal medoids for the sample
5. CLARA then performs the entire process for a specified no of points to reduce bias

CLARANS - Clustering Large Applications based on RANdomised Search:


1. The shortcoming of CLARA is that, it varies based on the sample size
2. CLARANS is akin to double randomization where the algorithm randomly
selects the ‘K’. And also randomly selects medoids and a non-medoid object
(Similar to K-Medoids)
3. CLARANS repeats this randomised process a finite number of times to obtain
optimal solution

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 68


Hierarchical Clustering
Hierarchical clustering is also called Agglomerative technique (bottom-up hierarchy
of clusters) or Divisive technique (top-down hierarchy of clusters).

Agglomerative:

Start by considering each data point as a cluster and keep merging the records
or clusters until we exhaust all records and reach a single big cluster.

Steps:

1. Start with 'n' number of clusters where 'n' is the number of data points

2. Merge two records, or a record and a cluster, or two clusters at each step
based on the distance criteria and linkage functions

Divisive:

• Start by considering that all data points belong to one single cluster and keep
splitting into two groups each time, until we reach a stage where each data
point is a single cluster

• Divisive Clustering is more efficient than Agglomerative Clustering

• Split the clusters with the largest SSE value

• Splitting criterion can be Ward's criterion or Gini-index in case of


categorical data

• Stopping criterion can be used to determine the termination criterion

Number of clusters are decided after running the algorithm and viewing the
Dendrogram. Dendrogram is a set of data points, which appear like a tree of
clusters with multi-level nested partitioning.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 69


Disadvantages of Hierarchical Clustering
Work done previously cannot be un-done and cannot work well

on ________ datasets.

Types of Hierarchical Clustering


1 BIRCH

B___________ I___________ R___________ and

C___________ using H ___________

2 CURE

C___________ U___________ RE___________

3 CHAMELEON

Hierarchical Clustering using Dynamic Modeling. This is a ___________


approach used in clustering ___________ structures

4 P___________ Hierarchical Clustering

5 G___________ Clustering Model

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 70


Density Based Clustering: DBSCAN
• Clustering based on a local cluster criterion

• Can discover clusters of random shapes and can handle outliers

• Density parameters should be provided for stopping condition

DBSCAN - D______B______ S_______ C________ of

A____________ with N____________

Works on the basis of two parameters:

______________ ______________

Maximum Radius of the Minimum number of points in the

neighbourhood Eps-neighbourhood of a point

It works on the principle of

_________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 71


OPTICS
Ordering of Points to Identify Cluster Structure

Works on the principle of varying density of clusters

2 Aspects for Optics

Core Distance Reachability Distance

“Plot the number of clusters for the image if it was subject to Optics clustering”.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 72


____________ - Based Clustering Methods
Partition the data space into finite number of cells to form a __________.

Find clusters from the cells in the __________ structure.

Challenges:

Difficult to handle irregular distribution in the data.

Suffers from the curse of dimensionality, i.e., difficult to cluster

high-dimensional data.

STING CLIQUE

Methods:

STING - ST__________ IN________ G_______ approach

CLIQUE - Cl___________ in QUE___________ - This is both

density-based as well as grid-based subspace

clustering algorithm.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 73


Three broad categories of
measurement in clustering

External Internal Relative

___________________

Used to compare the clustering output against subject matter expertise

(ground truth)

Four criteria for ____________ Methods are:

1. Cluster homogeneity - More the purity, better is the cluster formation

2. Cluster completeness - Ground truth of objects and cluster assigned

objects belong to same cluster

3. Rag bag better than alien - Assigning heterogeneous object is very

different from the remaining points of a cluster to a cluster will be

penalized more than assigning it into a rag bag/miscellaneous/other

category

4. Small cluster preservation - Splitting a large cluster into smaller clusters

is much better than splitting a small cluster into smaller clusters

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 74


Most Common __________ Measures
1. _________-based measures

• Purity

• Maximum Matching

• F-measure (Precision & Recall)

2. _________-based measures

• Entropy of Clustering • Entropy of Partitioning

• Conditional Entropy • Mutual Information

• Normalized Mutual Information (NMI)

3. Pairwise measures

• True Positive • False Negative

• False Positive • True Negative

• _________ Coefficient • __________ Statistic

• Fowlkes - Mallow Measure

4. Correlation measures

• Discretized ________ Static

• Normalized Discretized ________ Static

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 75


Most Common External Measures

____________________

Goodness of clustering and an example of same is ________ coefficient

Most common internal measures:

1. Beta-CV measure

2. ______________ Cut

3. Modularity

4. Relative measure - ___________ Coefficient

___________________

Compare the results of clustering obtained by different parameter settings of

the same algorithm.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 76


Clustering Assessment Methods
1. ______________ Histogram

2. Distance Distribution

3. ____________ Statistic

Finding K value in clustering


1. __________________ Approach

2. Empirical Method = n
2

3. Elbow Method 3.5 x

3.0

4. ________-Validation Method 2.5

2.0
x
1.5

1.0 x
x x
0.5 x x
x x
1 2 3 4 5 6 7 8 9

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 77


Mathematical Foundations
Basic Matrix Operations:
1. Matrix _________________

2. Matrix Multiplication

3. Matrix ________________

4. _______________ Matrix

5. __________ and Eigenvalues

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 78


Dimension Analysis and_________________
____________(PCA, SVD, LDA)

Dimensions are also called as ____________, Variables, ________.

Feature extraction of input variables from hundreds of variables is known as

____________ Reduction.

Lesser dimensions means easy interpretability, quicker calculations, which

also helps in reducing the ______ conditions and also avoiding

___________.

Another benefit of dimensionality reduction is ____________ the

multivariate data on a ___________.

Out of the many techniques available, in this book

we will discuss the most popular methods:

• PCA - P__________ C___________ A___________

• SVD - S_________ V__________ D_____________

• LDA - L_________ D_____________ A__________

• ____________ Analysis

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 79


PCA - P_________ C____________ A______
PCA is applied on Dense data (data, which does not have many zeros) which is

quantitative in nature.

PCA is used to convert a large number of features into an equal number of

features called P____________ C__________________ (PCs).

These PCs capture 100% information, however, the initial set of PCs alone can

capture maximum information.

PCA helps us reduce the size of the dataset significantly at the expenses of

minimum information loss.

If the original dataset has features, which are all correlated then applying PCA

does not help.

Each PC will capture information contained in all the variables of the original

dataset.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 80


Benefits of PCA
• Reduction of number of ________ & hence faster processing

• Identify the __________ between multiple columns at one go by

interpreting the ________ of PCs

• Visualizing ____________ data using a ___ visualization technique

• Inputs being ________ is called as collinearity and this is a problem,

which is overcome by PCA because it makes the inputs ___________

• Helps in identifying similar columns

The ith principal component is a


weighted average of original PCi = ai1x1 + ai2x2 + ai3x3…. + ainxn
measurements / columns:

Weights (aij) are chosen such that:

• PCs are ordered by their _________(PC1 > PC2 > PC3, and so on)

• Pairs of PCs have ___________ = 0

• For each PC, sum of ___________ = 1 (Unit Vector)

Data Normalization / Standardization should be


performed before applying PCA.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 81


SVD - S___________ V_______ D_____________
S_____ V_____ D________ or SVD - is applied to reduce

__________ data (data, which has a lot of entries as zeros).

SVD is applied on the images to reduce the size of images and helps

immensely in image processing.

SVD is extensively used in ____________________

It is a _____ decomposition

method, represented as:

• diagonal matrix values are known as the singular values of the


original matrix X

• U matrix column values are called the ______________ of X

• V matrix column values are called the _______________ of X

_____________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 82


LDA - L_______ D_______________ A________
L_____ D__________ A_____ (LDA) is used to solve dimensionality

reduction for data with higher attributes.

Linear Discriminant Analysis is a supervised algorithm as it takes the class


label into consideration.

LDA finds a centroid for each class datapoints.

LDA determines a new dimension based on centroids in a way to satisfy two

criteria:

1. _____________ the distance between the centroid of each class.

2. ____________ the variation (which LDA calls scatter and is

represented by s2), within each category.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 83


Association Rules
Relationship Mining, ____________ Analysis, or ______ Analysis - All

mean the same thing, i.e., how are two entities related to each other, is there

any dependency between them.

The objective of study of association is to find

Are certain What business


What item groups of items strategies will you
goes with what consistently device with
purchased this knowledge
together

Association rules are known as probabilistic ‘_____’ statements. Generating


the most ideal statements among all which show true dependencies is done
using the following measures.

Support Confidence Lift

If part of the statement is called as ____________.

Then part of the statement is called ____________.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 84


Association Rules
Support:

Percentage / Number of transactions in which IF/____________ & THEN

/ Consequent appear in the data

# transactions in which A & C appear together


Support =
_____ # of transactions

Drawbacks of Support:

1. Generating all possible rules is exponential in the number of distinct items

2. It does not capture the true dependency - How good are these rules

beyond the point that they have high support?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 85


Association Rules
____________

Percentage of If/Antecedent transactions that also have the

Then/Consequent item set.

P (Consequent | Antecedent) = P(C & A) / ____

__________________
__________ =
# transactions with A

Drawbacks of Confidence:

• Carries the same drawback as of Support

• It does not capture the true dependency - How good is the dependency

between entities which have high Support?

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 86


Lift Ratio is a measure describing the ratio between dependency and

independency between entities.

Formula: Confidence / ____________

Confidence
Lift =
__________________

Note: __________________ assumes independence between

antecedent & consequent:

__________________

P(C|A) = P(C & A) / P(A) = P(C) X P(A) /P(A) = P(C)

# transactions with consequent item sets


_____________ =
# transactions in database

Threshold - 1:

Lift > 1 indicates a rule that is useful in finding consequent item sets. The rule

above is much better than selecting random transactions

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 87


Recommender Systems
Recommender Systems is also called _______________

Data used for the analysis usually has ‘Users’ as rows and ‘Items’ will be

columns. The entries within the dataset can be:

(from retail ecommerce context)

Whether a user has Whether user has ____

purchased or not the product or not

How many products each What is the rating

user has purchased? provided by the user?

Sometimes the values, for example ratings columns, are divided by the
______. ______ refers to the number of customers who have
purchased or rated the item. This process is called ______ the ratings.

Generally applied on e-commerce platforms. Customers purchasing patterns


are analysed to design Personalized strategies to recommend items which
have a high likely chance of getting purchased.

• What is the item most likely to be purchased?

• Can we identify and make suggestions/recommendations upfront?

These are broad two questions that have to be addressed. Recommendations


in turn help to gain confidence in the user and make him loyal to the brand.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 88


Types of Recommendation Strategies

1 ____________ Recommender system

2 __________________ Recommender system

3 Demographic based Recommender system

4 ______ based Recommender system

5 Knowledge based Recommender system

6 ______ Recommender system

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 89


Recommender System
Collaborative filtering is the most popular approach and it is based on

similarity measures between users.

Similarity Measures:

______ Based Similarity:

Cos(A,B) = A•B / |A|*|B|

______ Based Similarity:

Corr(AB) = Covariance (A,B) / Stdev (A) * Stdev (B)

Euclidean distance

____________ distance, etc.

What to Recommend:

List out and recommend the items that the person is MOST LIKELY to buy
from the list of items that similar customers have already purchased.

Sorting the list of items can be based on:

• How many similar customers purchased it

• Rated by most

• Highest rated, etc.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 90


Recommender System
Disadvantages:

______ learning and ______ expensive

2
Compute is very expensive - n similarities calculations

Options to reduce computational burden

1 Discard ______ buyers

Discard items that are very


2
popular or very unpopular

3 ___________ can reduce the number of rows

PCA (dimension reduction) can


4
reduce the number of columns

5 __________________ customers

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 91


Recommender System
Alternative Approaches

Strategic decision making in terms of

‘Better accuracy and ______’ vs ‘Slightly lower accuracy and

______recommendations’

Search-Based method is a recommendation based on previous purchases.

A variant of Search-Based method is called Item-to-Item Collaborative Filtering.

Rows will be Items and Columns will be Users.

Disadvantage is that most obvious items are always recommended.

Recommendations vs Association Rules

Association Rules Recommendation Engine

______, Common, Generic Strategy Personalized strategy

____________ is important __________ is unimportant

Useful for large physical stores Useful for ____ recommendation

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 92


Recommender System
For New Users For New Items

Recommend ______ Recommend ______ to a


popular items few users

Recommend ______ popular Recommend to the tech-geeks


items based on demography (if it is a gadget)

Recommend based on Identify the most ______


____________ person in the social media graph
data and recommend the new
Make user login using social
item to this influential person
network, then look at the user’s
social media activity and
recommend accordingly

Show a few items and ask user to


rate them so that based on the
rating, one can be recommended

Challenge with Rating Matrix-Based Recommendation:

Rating matrices are huge and sparse (too many empty cells)

______ is used to handle the sparse rating matrix

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 93


Network Analysis
Network Data or ______ Data is a different type of data, which requires

different types of analysis.

Key components of a ______ or Network are

Vertices / ______ and ______ / Links

Network can be represented as Adjacency Matrix. Note: For an undirected

graph, the adjacency matrix is symmetric in nature.

Links / ______ between ____ can be either bidirectional or ________

Applications

Supply Chain Crowd-Funding Airlines


Network Network Network

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 94


Network Analysis

Node Properties

____________ = Number of direct ties with other nodes

In-Degree = Number of Incoming connections

Out-Degree = Number of Outgoing connections

Degree centrality is a local measure and


hence we should look at other measures.

____________ is how close the node is to other nodes in the network

_________ When comparison of Normalized

= 1/(sum of two networks arise ______ = (Total

distances to all then normalized number of nodes -

other nodes) ______ should be 1) * Closeness

considered

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 95


Network Analysis

____________ centrality can be measured for a node or an edge

____________ centrality is how often the


node/edge lies on the shortest path between pairs

# of shortest paths between a pair it lies on


________ centrality = ∑
# of shortest path between a pair

When two networks are compared then we use normalized ____________

__________________
Normalized ________ =
# of all possible pairs except the focal node

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 96


Network Analysis
__________ centrality measures who are you connected to and not just

how many you are connected to

• Nodes which are connected to high scoring nodes contribute more to the

score of that nodes which are connected to low scoring nodes

• _________ is calculated from ____________ of adjacency matrix

x1 a11 a12 ... a1n x1


x2 . x2
1
. =
λ
. . λx = AX
. . .
xn an1 an2 ... ann xn

y
X corresponding to the highest Eigenvalue is the vector that consists of the

__________ centralities of the nodes.

__________ Centrality is a measure on how likely is a person, who

receives the information, going to diffuse the information further.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 97


Network Analysis
Edge / Link Properties

Edge or Link properties are defined based on the domain knowledge and there

is no defined rule in defining the same.

Number of
edge/Number of
possible edges

Path _____

Network
Average Path
Shortest Path Related Length
Properties

Cluster
_____
Coefficient

It is the longest, Measures the degree


shortest path to which nodes in a
graph tend to cluster
together

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 98


Network Analysis

# of links that exist among its neighbors


Cluster Coefficient
=
of a node # of links that could have existed
among its neighbors

Cluster coefficient of a network is average cluster

coefficient of nodes in the network.

Community Detection Algorithms

____________ Fast-Greedy ____________

Also called as Iterative network divisive algorithm

Steps
- Calculate edge betweenness of each edge
- Remove the edge with the highest betweenness
- Repeat the above mentioned two steps until no edge
is remained

Starts with the __________ and goes on until all

nodes are isolated communities

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 99


Text Mining
Analyzing __________ Text data by generating __________ data in

key-value pair form. Deriving insights from the extracted keywords by

arranging the extracted keywords in a plain space with font sizes varying

based on their frequency is called __________.

Collect the text data / Extract data from sources.

_____ Speech
transcripts
_____

Email to Examples
customer of Field agents,
service Sources salespeople

________
Social media
outreach _______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 100


Text Mining
Pre-process the data

• Typos

• Case - uppercase / lowercase / proper case

• Punctuations & special symbols (‘%’, ‘!’, ‘&’, etc.)

• Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.)

• Numbers

• Extra White spaces

• Custom words

• Stemming

• Lemmatization

• Tokenization - Tokenization refers to the process of splitting a


sentence into its constituent words

Document Term Matrix / Term Document Matrix

Documents arranged in rows and Terms arranged in columns is


called as DTM and transpose of DTM is TDM.

Word Cloud
__________ - words present in positive dictionary.

__________- words present in negative dictionary.

_____ - two words repeated together - gives better context of the content.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 101


Natural Language Processing (NLP)
Text Analytics is the method of extracting meaningful insights and answering

questions from text data.

N_____ L________ U__________ (NLU)


A process by which an inanimate object (not alive - machines, systems, robots)

with computing power is able to comprehend spoken language.

Example: Humans talk to robot

N_________ L_________ G_________ (NLG)


A process by which an inanimate object (not alive - machines, systems, robots)

with computing power is able to manifest its thoughts in a language that

humans are able to understand.

Example: Robot responds to human queries

POS tags - Parts of Speech Tagging – Process of tagging words within


sentences into their respective PoS and then labelling them.

N_________ E_________ R_________


__________ are usually not present in the dictionaries so we need to treat

them separately. People, place, organizations, quantities, percentages, etc.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 102


Topic Modeling Algorithms
LSA/LSI (Latent Semantic Analysis/Latent Semantic )
Reducing dimension for classification. LSA assumes that the words will occur in

similar pieces of text if they have similar meaning.

LDA (____________________)
A topic modelling method that generates topics based on words/expression

frequency from documents.

Text Summarization:
Process of producing concise version of text by retaining all the important

information.

Topics

Documents

TOPIC
MODELING

Topic

Frequency of words

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 103


Machine Learning Primer
Steps based on Training & Testing datasets

1. Get the ______________ / ____________ data needed for

analysis which is the output of data cleansing

2. Split the data into training data & testing data

Training and Validation

70%

Training
Dataset

Random 30%
Historical Data Sampling
Test
Dataset

a. Split the data based on random sampling if the data is balanced

b. Split the data based on other sampling techniques if the

data is imbalanced

(Refer to Step 2 of CRISP-DM to know about imbalance dataset

sampling techniques)

c. _____________________________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 104


Machine Learning Primer

Training and Validation


Statistical
70% Model
Training Machine
Dataset Learning

Random 30% X%
Historical Data Sampling Accuracy

Test
Dataset Prediction and Model Validation
Validation Outcome

3. ___________________________________

4. Test the model on testing data to get the predicted values

5. Compare the ________________________________ and

____________________ values of testing data to calculate error

or accuracy. (Model evaluation techniques are discussed in subsequent

sections). This will give us Testing Error or Testing Accuracy

6. Also test the built model on training data

7. Compare the training data predicted values and training data actual values

to calculate the error or accuracy. This will give us Training Error or Training

Accuracy

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 105


Machine Learning Primer
8. Training Error and Testing Error

a. If training error and testing error are small and close to each other then

the model is considered to be RIGHT FIT (how low the error values

should be is a subjective evaluation. E.g., In healthcare even 1% error

might be considered high, whereas in a garment manufacturing

process even 8% error might be considered low)

b. If training error is low and testing error is high then the model is

considered to be ________ . _______ is also called ______

c. If training error is high then testing error also will be high. This scenario

is called ___________ or ______

d. If training error is high and testing error is low then something is

seriously wrong with the data or model you built. Redo the entire

project

Y Y Y

X X X

__________________ __________________ __________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 106


Machine Learning Primer
9. ____________ is a common problem and also challenging to solve.

Different Machine Learning algorithms have different regularization

techniques (also called as generalization techniques) to handle

________________

10. __________ problems can be solved easily by increasing the number

of datapoints (observations) and/or features (columns). Also proper feature

engineering and transformation will address this issue

1. High bias, Low variance 2. Low bias, High variance

3. High bias, High variance 4. Low bias, Low variance

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 107


Machine Learning Primer
The challenge of Training & Testing dataset split, which leads to information

leak is countered with new school of thought with an idea to split the data into:

• Training Data

• __________ (Development Data)

• __________

Data

Training Validation Testing

Test on Training
Keep validating and keep • Test on Validation Data to get
Data to get • Test on Testing
retraining the model until Validation Error/Accuracy
Training Data to get Test
Error/Accuracy
desired accuracy is achieved • Fine tune the model parameters to Error/Accuracy
get better accuracy (also called as
• Pick the best-performing algorithm generalization
Build Model error)
• Check for model
Overfitting or
Underfitting
• Run multiple trials
• Combine Training & Validation Data to run the model, which has given and take average
the desired results based on a set of finalized model parameters. Then (numerical
test the model on testing data outcome) of
• Also ensure that model neither overfits (Variance) nor underfits (Bias) majority

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 108


Model Evaluation Techniques
If the ‘Y’ output variable is __________ then we can use the following list of error
functions to evaluate the model.
Error = Predicted Value - Actual Value (Actual Value is also called as ________ Value)

Data Science Workbook


n
1
• ______(ME) ME= ∑ et
T Actual Prediction Error from Prediction Error from
t=1
Data Model 1 Model 1 Model 2 Model 2
n
• _________ (MAE) 1 100 101 1 110 10
MAD= ∑ | et |
or___________ (MAD) n t=1
200 199 -1 190 -10
300 301 1 310 10
n
1 2 400 399 -1 390 -10
• Mean Squared Error (MSE) MSE= ∑ et
n

© 2020 360DigiTMG. All Rights Reserved


t=1

n
1 2
• Correlation Coefficient
• Root Mean Squared Error (RMSE) RMSE= ∑ et
n t=1

1 n xy - x y
et (∑ (∑ (∑
( ( (
• Mean Percentage Error (MPE) MPE= ∑ r=
n Yt
t=1
2 2
2 2
n∑x - n
∑y -
(∑x (∑y
( (
• ______________ 1 et
MAPE= ∑
n Yt
(MAPE) t=1

MAE
• Mean Absolute Scaled Error (MASE) MASE=
MAE in-sample, naive

109
MAE in-sample, naive is the mean absolute error produced by a naive forecast
Model Evaluation Techniques
If the ‘Y’ is Discrete variable (Classification Models) then we can use the
following list:

Confusion Matrix:
Can be applied for both _______ classification as well as ________
classification models.

Confusion matrix is used to compare predicted values and actual values.

Binary Classification Confusion Matrix:


Actual Class

Positive Negative

Positive True Positives False Positives


Predicted (TP) (FP)
Class

Negative False Negatives True Negatives


(FN) (TN)

True Positive (TP)


• Patient with disease is told that he/she has disease

True Negative (TN)


• Patient with no disease is told that he/she has no disease

False Positive (FP)


• Patient with no disease is told that he/she has disease

False Negative (FN)


• Patient with disease is told that he/she has no disease

Error = 1- Accuracy
TP + TN
Accuracy = Accuracy should be greater
TP + FP + FN + TNP than % of majority class

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 110


Model Evaluation Techniques

Possible Outcomes

Not Your wife’s Your wife’s


Decision Birthday Birthday
Alternatives

Did not buy


Flowers
(No Action)

Status Quo Wife Angry

Bought
Flowers
(Action)
Wife Suspicious
Money Wasted Domestic Bliss

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 111


Model Evaluation Techniques
• ______ = TP/(TP+FP) = TP/Predicted Positive = Prob. of correctly

identifying a random patient with disease as having disease

__________ is also called as __________________ (PPV)

TP
Precision = (Designers in the formula hide the name precision)
TP + FP

• ______ (Recall or ______ or ______ Rate) = TP/(TP+FN) =

TP/Actual Positive = Proportion of people with disease who are correctly

identified as having disease

TP
Recall =
TP + FN

• ______ (True negative rate) = TN/(TN+FP) = Proportion of people with no

disease being characterized as not having disease

• ______ (Alpha or type I error) = 1 - Specificity

• FN rate (Beta or type II error) = 1 - Sensitivity

• ____ = 2 * ((Precision * Recall) / (Precision + Recall)); F1: 1 to 0 & defines a

measure that balances precision & recall

2 X Precision X Recall
F1 Score =
Precision + Recall

F1 score is the harmonic mean of precision and recall.

Closer the ‘F1’ value to 1, better the accuracy.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 112


Confusion Matrix
A Confusion Matrix is also called a Cross Table or ____________. Here is

an example of a multi-class classification problem.

Activity recognition from video

Bend 100 0 0 0 0 0
Jack 0 100 0 0 0 0
Jump 0 0 89 0 0 11
Actual Class Run 0 0 0 67 0 33
Skip 0 0 0 0 100 0
Walk 0 0 11 33 0 100
Bend Jack Jump Run Skip Walk

Predicted Class

The values along the diagonal are right predictions and the values off the

diagonal are incorrect predictions.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 113


ROC Curve

100%

si t
as ec
r
fie
Cl erf
80%

e
r

lu
fie
True positive rate

va
si
s

e
la

tiv
tC
60%

ic
s

ed
Te

pr
(sensitivity)

no
40%

ith
rw
fie
si
as
20% Cl

0%
0% 20% 40% 60% 80% 100%
False positive rate (1 - specificity)

R_____ O________ C_________ Curve was used right from World


War II to distinguish between true signals and false alarms.
The ROC curve has the ‘True Positive Rate (TPR)’ on the Y-axis and ‘False
Positive Rate (FPR)’ on the X-axis.

ROC curve is used to visually depict accuracy.

ROC curve is also used to find the _____ value


(Example: Risk Neutral: should probability be > 0.5 as cut-off value to
categorize a customer under ‘will default’ category; Risk Taking: should the
probability be > 0.8 cut-off to categorize a customer under ‘will default’
category; or Risk Averse: should the probability be > 0.3 cut-off to categorize a
customer under ‘will default’ category)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 114


ROC Curve
Numerically if one must evaluate the accuracy then AUC
(Area Under the Curve) can be calculated.

Disease Present Disease Absent

Test Positive True Positives False Positives

Test Negative False Negatives True Negatives

0.9 - 1.0 = A (outstanding)

0.8 - 0.9 = B (excellent/good)

0.7 - 0.8 = C (acceptable/fair)

0.6 - 0.7 = D (poor)

0.5 - 0.6 = F (no discrimination)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 115


K-Nearest Neighbour
KNN also known as:

• On-demand or Lazy Learning

• __________-Based Reasoning

• __________-Based Reasoning

• Instance-Based Learning

• Rote Learning

• __________ Reasoning

KNN works for both the scenarios : Y is ____________ as well as

____________.

KNN is based on calculating distance among the various points. Distance can

be any of the distance measures such as Euclidean distance discussed in

previous sections.

KNN also has an improved version where _____ _______ are assigned to

the neighbors based on their distance from the query point.

In case of continuous output, the final prediction will be the _______ of all
output values and in case of categorical output, the final prediction will be the

_______________ of all the output values.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 116


Training Set Accuracy

Right Level of
Model Complexcity

Model Accuracy

Validation Set Accuracy

Model Complexicty

Choosing ‘K’ value is critical because it is used to solve the problem of

bias-variance tradeoff.

• Low ‘K’ value is ____________________

• High ‘K’ value might introduce data points from _______________

Pros (Advantages) and Cons (Disadvantages)

Strengths Weakness

Does not depend on the underlying There is no model produced and hence
data distribution no interesting relationship among output
and inputs is learnt

Memory requirement is large because


Testing process will be very fast distance calculations is saved in memory

Testing process is slower in comparison


to other models

Categorical Inputs require additional


processing

Suffers from Curse of dimensionality

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 117


Naive Bayes Algorithm
Naive Bayes is a machine learning algorithm based on the principle of probability.

The relationship between ______ events is described using Bayes Theorem.

Probability of event A given that event B has occurred is called as

__________Probability.

Class Prior or Prior Probability Data Likelihood given class

P(Class) * P (Data | Class)


P(Class | Spam) = --------------------------------
P(Data)
Posterior probability
Data Prior or Marginal Likelihood

X ( Whether the email contains


Y ( Whether the email is spam or not)
the word lottery or not)
Spam Lottery
Not Spam Lottery
Spam No Lottery
Spam No Lottery
Not Spam No Lottery
Not Spam No Lottery
Not Spam Lottery
Not Spam Lottery
Spam No Lottery
Spam No Lottery

P(Class) = P(Spam) = No. of times spam appears in the data / Total no. of emails = 5/10

P(Data) = P(Lottery) = No. of times lottery appears in the data / Total no. of emails = 4/10

P(Data | Class) = P(Lottery | Spam) = No. of emails having word lottery given that emails are
spam = 1/5. In total there are 5 spam emails and out of which 1 email has the word lottery.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 118


Decision Tree
“___________ Decision Tree” “__________ Decision Tree”
When output is Categorical When output is Numerical

Decision trees are


• Nonparametric ________ model, that works on divide & conquer strategy
• Rule-based algorithm that works on the principle of ____________.
A path from root node to leaf node represents a rule
• Tree-like structure in which an internal node represents a test on an
attribute, each branch represents outcome of test and each leaf node
represents the class label

Three type of nodes

_____ Branch Node / Internal Nodes Leaf Node /


/ Decision Nodes _______ Node

Root Node

Branch Node

Leaf
Node

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 119


A Greedy Algorithm
To develop a Decision Tree, consider 2 important questions:

Q1. Which _______________________


Q2. When to ______________________

Attributes Tree is developed Generate most


Identify an
100% of should be using a statistical homogeneous
______
data Discrete - if measure pair of branches
to split
continuous (________
then ______ _______)
_______

Conditions to Stop

All records of the No attributes to No records left

branch are of further split


__________

Age CR Class Age CR Class Age CR Class

>40 Fair Yes >40 Fair Yes 31 ..40 Fair Yes


>40 Excellent No <=30 Excellent Yes
31 ..40 Excellent Yes
<=30 Fair Yes

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 120


Information Theory 101
If the event is very _____, then the ________ content in the event is very low.

Examples

Information
Sun rises in Extremely high
content lowest
the east probability (p = 1)
(‘0’ bits)

Occurrence of
Very low High information
earthquake in
probability content
Kuala Lumpur

In conclusion “Information Content is proportional to Rarity”

)
I(event) = log2
) 1
= -log2 Prob (event)
Prob(event)

Entropy:
• Entropy is the expected information content of all the events
• Entropy value of 0 means the sample is completely homogeneous
• Entropy value of 1 means the sample is completely heterogeneous

)
H(p = (p1 ... pn)) =
n
∑p log
) 1
n
= -∑ p! log( p! )
!
i=1 p! i=1

Purity = Accuracy = 1 - Entropy

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 121


Information Theory 101
In Accuracy we assign the _______________ Label to each region.

5 60 40 40 60 10

Dominant Label Sky Blue NA Royal Blue


Accuracy - Sky Blue 60/65 40/80 60/70

Entropy is a measure of disorder or impurity (variation/______________)

Decision trees find attributes which return the most homogeneous branches.

Purity can also be measured using GINI Measure, which is the Expected.
Accuracy with __________ Labeling.

5 60 40 40 60 10

Dominant - Royal Blue 5/65 40/80 10/70


Accuracy - Sky Blue 60/65 40/80 60/70

5 5 60 60 60 60 10 10
� �X � � +� �X� � � �X � � +� �X� �
65 65 65 65 70 70 70 70

After calculating the measure of _____, one must decide on which feature to
split. For this, one must measure the change in __________ resulting from
a split on each possible feature. This calculation is known as __________

Information gain of a feature is the difference between entropy in the segment


before the split (S1) and partitions resulting from the split (S2).

InfoGain (F) = Entropy (S1) - Entropy (S2)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 122


Information Theory 101
Less the variation in class labels post the split better the _____

Information gain: Decrease in the _______ (variation) after the


dataset is split on an attribute.

Higher homogeneity implies Higher information gain.

Entropy = E1

Information gain(I1) = E1-E2

Where E1 > E2

Entropy = E2

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 123


Pros and Cons of Decision Tree
Strengths Weaknesses

Uses the important feature during Biased towards factors (features),


decision making which have a lot of levels
Interpretation is very simple because there Small changes in the data will result in
is no mathematical background needed large changes to decision making

Model overfitting can be addressed using __________ techniques.

__________ is the regularization technique used in the Decision Tree.


Pruning is the process of reducing the size of the tree to generalize the
unseen data.

Two ________ techniques are

Pre-_____ or Early Stopping Post-_____

Stopping the tree from growing Grows the tree completely and then
once the desired condition is meet. apply the conditions to reduce the
tree size.
• Stop the tree from growing once
it reaches a certain number of Example, if the error rate is less
decisions than 3% then reduce the nodes.
• Stop the tree from growing if So, the nodes and branches that
decision nodes contain only a have less reduction of errors are
small number of examples removed.
Disadvantage: When to stop the tree This process of grafting branches is
from growing. What if an important known as subtree raising or subtree
pattern was prevented from replacement.
learning?

Post _______ is more effective than pre- _______

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 124


Continuous Value Prediction
Scatter Diagram - Visual representation of the relationship between
two continuous variables

Strong Positive Correlation Moderate Positive Correlation No Correlation

Moderate Negative Correlation Strong Negative Correlation Curvilinear relationship

Correlation Analysis - Measures the correlation between two variables

r = +1: r close to +1:


Perfect Positive Correlation Strong Positive Association

r close to -1: r close to 0:


Strong Negative Association Weak or No Association

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 125


Linear Regression
Equation of straight line that we have learnt in our school days

______________________________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 126


Ordinary Least Squares
y
An observed value of x Fitting a straight line by least squares
when x equals x0
ŷ = 𝛽𝛽0 +𝛽𝛽1x

Erron term

Straight line defined by


the equation = β0 + β1x

β1
β0
Mean value of
y and x equals x0
y intercept y = β0 + β1x + ε; ε = error term
x
x0 = A specific value of x,
the independent variable

____________________ Technique to find the best fit line.

The best fit line is the line which has minimum square deviations from all the data points
to the line.

To improve the accuracy, transformations can be applied, this will ensure that the data has a
linear pattern with minimum spread.

Coefficient of Determination R2 - also known as goodness of fit, is the measure of


predictability of Y (dependent variable) when X’s (independent variable) are given.

It can be interpreted as the % of variability in output (Y) that can be explained with the

_______________ (X)

R2 = SSR/SST = ((SSR)/(SSR + SSE))

0 ≦ R2 ≦ 1

Where,

SSR = 𝚺𝚺�(ŷ - ȳ)2 (measure of explained variation)

SSE = 𝚺𝚺�(y - ŷ)2 (measure of unexplained variation)

SST = SSR + SSE = 𝚺𝚺�(y - ȳ)2 (measure of the total variation in y)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 127


Model Assumptions
________________

________________

________________

________________

Problems arise while linear regression model training:

____________ : Errors are dependent on each other

____________ : Errors have non-constant variance

____________ : Independent variable pair are linearly dependent


on each other

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 128


Logistic Regression
Predicts the ____________ of the outcome class.

The algorithm finds the linear relationship between independent


variables and a link function of these probabilities.

The link function that provides the best goodness-of-fit for the
given data is chosen.

1.0 1.0
0.8 0.8
Probability

Probability
0.6 0.6
Linear Regression Line Linear Regression
0.4 0.4 Curve
0.2 0.2
0.0 0.0

0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500

Below 0

The output from logistic regression will lie between 0 to 1.

The logistic regression curve is known as ________________ Curve.

Probability values are segregated into binary outcomes using a _________


value. The default cutoff is treated as 0.5 (50%)

• If probability of an event > 0.5; then Event is considered to be True


(predicted outcome = 1)

• If probability of an event </= 0.5; then Event is considered to be False


(predicted outcome = 0)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 129


Logistic Regression
The logistic regression performed using:

Y= β0 + β1x1 + β2x2 + ... + βkxk + ε

Where,

β0 = the y is _______________

βi = the model coefficient for the linear effect of variable is on y

e = the random error

The probability function:

ey
p= ; where e = 2.7183
1+e y

The output of the logistic regression will give a sigmoid


curve (also known as S curve)

1.0

0.8 Logistic Regression


Curve
Probability

0.6

0.4 Interpretation: Probability p


0.2 indicates that the event has a
0.0 chance ‘p’ for a given
0 20 25 30 35 40
_______________
Inspection Time

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 130


Support Vector Machine
SVMs can be adapted to use with nearly any type of learning task, including

both __________ and _______________.

SVM is inspired from statistical learning theory.

Other names: Large-margin classifier, Max-margin classifier, __________

Two Dimensions Three Dimensions

The task of the SVM algorithm is to identify a line that separates the two

classes in a binary problem. However, in a multidimensional problem a line

cannot separate the classes.

The goal of an SVM is to create a flat boundary called a __________,

which divides the space to create _______________ partitions.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 131


Support Vector Machine
There is more than one choice of dividing line between the
groups of circles and squares.

c c
b b
a a

Support Vector

Maximum Margin
Support Vector

SVM searches for M_________ M_____ H_________ (MMH)

MMH is as far away as possible from outer boundaries (convex hull) of the two

groups of data points.

The maximum margin linear classifier is the linear classifier with the maximum

margin. This is the simplest kind of SVM (Called an LSVM).

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 132


Support Vector Machine
Non-Linear Spaces

Sunny Snowy

Input Space Higher Dimension Space

Ф
Latitude

Altitude
Kernel Trick

Longitude Longitude

Kernel Tricks
A key feature of SVMs is their ability to map the problem into a higher

dimension space using a process known as the _____ trick. After

the _____ trick has been applied, we look at the data through the

lens of a new dimension and a nonlinear relationship may suddenly

appear to be quite linear.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 133


Support Vector Machine
Kernel Functions
• The linear kernel does not transform the data at all. Therefore, it can be
expressed simply as the dot product of the features:

K �xi , xj� = �xi , xj�

• The______ kernel results in a SVM model somewhat analogous to a


neural network using a sigmoid activation function. The Greek letters kappa
and delta are used as kernel parameters

K �xi , xj� = tanh �kxi . xj - δ�

• The polynomial kernel of degree d adds a simple non-linear


transformation of the data

d
K �xi , xj� = �xi . xj + 1�

• The _________ kernel is similar to a RBF neural network. The RBF kernel
performs well on many types of data and is thought to be a reasonable
starting point for many learning tasks

2
-�xi -xj �
K �xi , xj� = e
2σ 2

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 134


Deep Learning/Neural Network
Artificial Neural Network is used to mimic Biological Neural Network.

Deep Learning is named in this way because it has many _____


________ to the output.

__________ Models versus __________ Learning Models

__________ Extraction from the data


(Images, Speech, Text, Videos (videos are a subset of images))is
automatically performed using Deep Learning models.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 135


Deep Learning/Neural Network
The multiple layers capture compositionality:

Image Recognition

Each layer captures some features, For Example:

Pixel Edge Texton _____ Part Object

Initial layers capture __________ features

Next layers capture __________ features

Final layers capture __________ features

At the end, the classifier will predict the output.

Low-Level Mid-Level High-Level Trainable


feature feature feature Classifier Car

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 136


Deep Learning/Neural Network
Deep Learning Learns Layers of Features

Linear Linear Linear Linear


Transformation Transformation Transformation Transformation

Speech data is processed through multiple


layers and __________ is captured.

Sample _____ Sound Phone Phoneme Word


____

Text data is processed through Deep Learning


layers and compositionality is captured.

Text Character Word Word Clause _____ Story


Group

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 137


Deep Learning/Neural Network
Shallow Machine Learning Models:

Feature extraction from the data is performed manually.

Vision / Images / Videos: Videos are broken into ___________


and from each __________, the features are extracted.

Vision Unsupervised Learning:


K-Means
SIFT (Scale Invariant Feature
Transform)

HOG (Histogram of Gaussians)


Supervised Learning:
(KNN) K-Nearest Neighbor

Techniques used to
extract features

Speech Unsupervised Learning:


Mixture of Gaussians

MFCC - Mel Frequency


Cepstral Coefficient
Supervised Learning:
Hidden Markov Models (HMMs)

Text
Love the way Unsupervised Learning:
360DigiTMG Bag of Words n-grams
delivers training (BoW)
on Data Science Term Document
& Artificial Matrix (TDM)
Intelligence Supervised Learning:
Naive Bayes

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 138


Perceptron Algorithm:
Artificial Intelligence is about trying to mimic a human brain.

An Artificial Neural Network (ANN) models the relationship between a set of


_____ signals and an _____ signal using a model derived from our
understanding of how a biological brain responds to stimuli from sensory
inputs. Just as a brain uses a network of interconnected cells called Neurons
to create a massive parallel processor, ANN uses a network of artificial neurons
or nodes to solve learning problems.

Neuron

Dendrite
Axon

Nucleus
Axon Terminal
Myelin
Scwann Cell

Node of Ranvier

Soma

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 139


Deep Learning/Neural Network
Simple Neural Network components:

1. Input layer - contains the numbers of _____ equal to the


number of input features

2. Input layer also has one additional neuron called _____, which is
equivalent to the ‘b’ (y-intercept) in the equation of the line y = b + mx

3. ‘b’, ‘w1’, ‘w2’, ‘w3’,....... are called as weights and are ________ initialized

4. These neurons are also called as nodes and are connected via an edge to
the neuron in the next layer

5. __________function (usually summation) is used to __________


all the inputs and corresponding weights
f(x) = b + w1x1 + w2x2 + w3x3 + w4x4 + w5x5 -> This equation will give a
numerical output

6. The output of the integration function is passed on to the


_______________ component of the neuron

x0= 1
Dendrites
b Cell body or Soma
x1
w1
w2 Activation
x2
w3 y = f (wx + b)
x3
w4
x4
Axon

xn wn Summation

Neural Network

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 140


Deep Learning/Neural Network
x0 = 1 Weight update
error
b
δl = al - yl
x1
w1

w2 w0
x2

n
net = wixi 1
i-0 o = σ (net) = 1+e-net
wn
Net input Activation Output
function function
xn
Input

7. Based on the functioning of ________ function, the final output is


predicted

8. Predicted output and actual output are compared to calculate the _____
function / _____ function (error calculated for each record is called as
_____ function and combination of all these individual errors is called as
cost function)

9. Based on this error, the __________ algorithm is used to go back in the


network to update the weights

10. Weights are updated with the objective of minimizing the error and this
minimization of error is achieved using _______________ Algorithm

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 141


Deep Learning/Neural Network
Perceptron Algorithm

The Perceptron algorithm was proposed back in 1958 by Frank Rosenblatt


(Cornell Aeronautical Laboratory).

Neural network with no hidden layers and a single output neuron is called a
__________ Algorithm.

The __________ algorithm can only handle _____ boundaries. _____


boundaries are handled using the Multi-Layered __________ algorithm.

Weight updation as part of _______________ algorithm is done using


the following formula:

Randomly initialize weight vector w0

Repeat until error is less than a threshold γ or max_iteratios M:

For each training example (xi ti) :

Predict output y1 using current network weights wn

Update weight vector as follows:

wn+1 = wn + ή * (ti - yi) * xi

Learning Rate Error

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 142


Deep Learning/Neural Network
__________ are updated so that the error is minimized.

Learning Rate is also called Eta value and ranges from 0 to 1.

A value close to 0 would mean

< <
_____ steps to arrive at the
bottom of the error surface.

A value close to 1 would mean


__________ the bottom of
the error surface.

Constant learning rate creates a


problem of bouncing around the bowl.
The gradient will never reach the
bottom of the error surface.

This problem is solved using


Changing Learning Rate
(_______________).

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 143


Deep Learning/Neural Network
____________- Learning rate is 0.2

reduced epoch after epoch, until it

Learning Rate
reaches the end of a defined number 0.1

of epochs.
0.0
0 4 8 12 16 20
Epochs

____________ - Learning rate


is the same for a fixed number of 0.2

epochs and then it starts reducing

Learning Rate
after every epoch until the defined 0.1

number of epochs reaches to end.


0.0
0 4 8 12 16 20
Epochs

____________ - Learning rate


0.2
is reduced after a set of fixed number
Learning Rate

of epochs (for e.g., learning rate will


0.1
be reduced by 10% after every 5
epochs).
0.0
0 4 8 12 16 20
Epochs

_________________________- Learning rate is reduced when it


is observed that the error stops reducing.

0.2 0.2
Learning Rate

Learning Rate

0.1 0.1

0.0 0.0
0 4 8 12 16 20 0 4 8 12 16 20
Epochs Epochs

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 144


Deep Learning/Neural Network
Gradient Primer:

Rate of change

Gradient is also called as ________

Slope

Curves/Surfaces should be
continuous and smooth
(_____ /sharp points)

Curves / Surfaces should


be __________

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 145


Deep Learning/Neural Network
Gradient Descent Algorithms Variants:

A few definitions:

Iteration: Equivalent to when a weight update is done

Epoch: When entire training set is used once to update the weights

Batch Gradient Stochastic Gradient Mini-batch Stochastic


Descent Descent Gradient Descent
Epoch 1 1 1
Example 10000 training records 10000 training records 10000 training records
1 10000 100 (if minibatch
size is 100).
Iteration
10000/100 = 100
iterations
Weights are updated Weights are updated Weights are updated
once, after all 10000 after each training after every minibatch
training records are sample passes (100 in this case) of
Example passed through the through the network. records are passed
network If we have 10000 through the network.
training samples then Records within
weights are updated minibatch are
10000 times randomly chosen.

Nesterov
Momentum
Momentum
Other advanced variants
of Mini-Batch SGD: Adagrad Adadelta

RMSprop Adam

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 146


Deep Learning/Neural Network
Empirically Determined components are:

1. Number of hidden layers

2. Number of _____ within each hidden layer

3. _______________

4. Error/Cost/Loss Functions

5. _______________ Methods

No. of neurons Activation Function Loss Function


Y (output)
in output layer in Output layer

Continuous 1 Linear / Identify ME, MAE, MSE, etc.

Discrete 1 for binary Sigmoid / Tanh Binary Cross Entropy


(2 categories) classification problem

Discrete 10 if we have a Sigmoid Categorical


(>2 categories) 10 class problem Cross Entropy

Note: Hidden layers can have any activation function and majorly _________
activation functions seem to be giving good results.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 147


Multi-Layered Perceptron (MLP) /
Artificial Neural Network (ANN)
Non-Linear patterns can be handled in two ways:
Changing _______________ Function:

Quadratic Function Spherical Function

m m

ƒ = ∑ wjxj2 - 0 ƒ = ∑ (xj - wj)2 - 0


j=1 j=1

The presence of hidden layers alone


y1 y2 yn will not capture the ________
pattern. Activation function to be
Output Layer used should be non-linear.
Usage of linear or identify activation
functions within the neurons of the
Hidden Layer hidden layer will only capture linear
patterns.

If no activation functions are


Input Layer
specified in the layers then by
x1 x2 xn default the network
assumes________________.

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 148


Multi-Layered Perceptron
List of activation functions include

Identity function
a (ƒ) = a ___ function
(Linear function)

a
Ramp Sigmoid .5
1
function ƒ function
1

1 ReLU 1

Tanh (Rectified
0 0
function Linear Unit)
-1
function -1

1 1
ELU
_____
(Exponential 0 0
ReLU
Linear Unit)
-1 -1

Output
neurons
A
Probability of A

Maxout _____ B
Probability of B
Rest of
network C
Probability of C

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 149


Multi-Layered Perceptron
Regularization Techniques used for Overfitting

L1 regularization / L1 _____ decay term

L2 regularization / L2 _____ decay term

Weight Decay Term

m n1-1 s1 s1+1

j(0) = 1
m ∑� 1
2 � h0 (x(i)) - y(i)�2� +
y
2 ∑ ∑ ∑�w � (l)
ji
2

i=1 l=1 i=1 j=1

Training Set Accuracy


Accuracy

Over fitting

___ stopping:
Early Stopping
Test Set Accuracy Epoch

Epoch

Error-change criterion

1. Stop when error isn't dropping over a window of, say, 10 epochs

2. Train for a fixed number of _____ after criterion is reached (possibly with
lower learning rate)

Weight-change criterion

1. Compare _____ at epochs t-10 & t and test

maxi � wi - wi �<
t t-10
ρ

2. Possibly express as a _______ of the weight

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 150


Multi-Layered Perceptron
Dropout

It is an interesting way to perform model averaging in Deep Learning

Training Phase: For each hidden layer, for each training sample, for each
iteration, ignore (zero out) a __________, p, of nodes (and corresponding
activations).

Test Phase: Use all __________, but reduce them by a factor p (to account
for the missing activations during training).

Randomly select a subset of _____ and force their output to ________.

Standard
Neutral Net

After applying
Dropout

v1
r1
1
v2
r2
v3 0
r3
v4 1

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 151


Multi-Layered Perceptron
Drop Connect:

Very similar to _____, however, we disable the weights instead of the nodes.

Here the nodes are partially active.

v1
r1
v2
r2
v3
r3
v4

Noise:

Noise

Data Noise Label Noise Gradient Noise

• Add noise to data • Disturb each training • Add _____ to


while _____ sample with the the gradient
probability

• For each disturbed


sample, the label is
randomly drawn from
a uniform distribution
regardless of the true
label

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 152


Multi-Layered Perceptron
Batch Normalization:
Input: Values of x over a mini - batch: B = { x1...m };

Parameters to be learned: Ύ, β

Output: {y i= BN Ύ, β(x i)}

µ B‹ 1 ∑m x // mini - batch _____



m i=1
i

2 1 ∑m (x -µ ) // mini - batch _____


σB ‹ —
m i=1 i B
xi - µB // normalize
X� ‹ —
2
σB + Є

yi ‹ ΎX�i + β = BNΎ,β (xi ) // __________

• Batch Normalization layer is usually inserted before __________ layer


(after Fully Connected or Dense Layer)

• Reduces the strong dependence on weight initialization

Shuffling inputs:

• Choose examples with maximum __________ content

• Shuffle the training set so that successive __________ examples never


(rarely) belong to the same class

• Present input examples that produce a large error more frequently than
examples that produce a small error. Why? It helps to take large steps in the
Gradient descent

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 153


Multi-Layered Perceptron
Weight Initialization Techniques:

6 6
_____ initialization uniform( - , )
fan in+ fanout fan in+ fanout

Caffe implements a simpler version of Xavier’s initialization


2 2
uniform( - , )
fan in+ fanout fan in+ fanout
_____ initialization
4 4
uniform( - , )
fan in+ fanout fan in+ fanout

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 154


Forecasting
Time Series vs _______________ Data

Time Series Data:

Data that is collected over equal spaced time intervals and the time interval is

also an essential part of the data.

____________ Data:

Data that can be collected at a single point of time.

Forecasting is the use of various modeling techniques to predict a future

outcome on the basis of historical time series data.

Real Estate Price


Price

2001 2018 2035

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 155


Forecasting
EDA - Components of Time Series

EDA

_______ Visual

EDA in time series is mostly visual.

Elements of visualization in time series:

Time plot Lag Scatter Plot _____ Plot

Stacked Area Chart ______ Chart Heat Map


for multi variable time series data

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 156


Forecasting
Data Partition

Time series should be split in __________ order

Most Recent period data will be chosen as Validation data.

Training Data Validation Data

Fit the model only to Assess performance on


Training period Validation period

Training Validation Future

Conditions to choose the validation period:

• Forecast Horizon

• ____________

• Length of Time series

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 157


Forecasting
Forecasting Model

There are predominantly 2 approaches to forecasting

Model-Driven Data-Driven
_______ similar to future _______ similar to future

1. Linear Regression 1. Naïve forecasts

2. Autoregressive models 2. _________

3. ARIMA 3. Neural nets

4. ____________

5. ____________

Sales Forecast Sales Centered MA(4)


Sales

Sales

Months Months

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 158


Forecasting
Smoothing Techniques

Moving Average Exponential Smoothing

• ______ Moving Average • Simple Exponential Smoothing

• ______ Moving Average • Holt's Method/

Double Exponential Smoothing

• ____________ Method

MA ES

Moving Average Exponential Smoothing

Assigns equal weights to all Assigns ___________ to recent


past observations observations than past observations

Better to forecast when data & Better to forecast when data &
environment is not _________ environment is _________

Window width is key Smoothing constant (α, β, γ) value


to success is key to success (0 < α, β, γ ≤ 1)

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 159


Forecasting
De-Trending and De-Seasoning

• To remove ______ and/or ______, fit


a regression model with trend and/or
seasonality
Regression
• Series of forecast errors should be
de-trended & deseasonalized

• Simple & popular for removing trend and /


or seasonality from a time series

• Lag-1 difference: Yt – Yt-1 (For removing


Differencing ______); Lag-M difference: Yt – Yt-M
(For removing ______)

• Double differencing: difference the


differenced series

• Uses moving average to remove ______

Moving Average • Generates seasonal indexes as a


byproduct

Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 160


Accreditation to international certification bodies

For further details, call us at

1800-212-654321

[email protected] 360digitmg.com

2-56/2/19, 3rd Floor, Vijaya Towers, Ayyappa Society Road, Madhapur, Hyderabad, Telangana 500008

USA | INDIA | MALAYSIA | ROMANIA | SOUTH AFRICA | DUBAI | BAHRAIN

You might also like