360DigiTMG Practical Data Science New
360DigiTMG Practical Data Science New
Ingredients of AI
- Artificial Intelligence
- Data Science
- Data Mining
- Machine Learning
- Deep Learning
- Reinforcement Learning
Stages of Analytics
CRISP - DM
- CRISP - DM Business Understanding
- CRISP - DM Data Collection
• Data Types
• Different Scales of Measurement
• Data Understanding
• Qualitative vs Quantitative
• Structured vs Unstructured
• Big Data vs Non-Big Data
• Cross Sectional vs Time Series vs Longitudinal Data
• Balanced vs Unbalanced
• Data Collection Sources
- Primary Data
- Secondary Data
• Preliminaries for Data Analysis
• Probability
• Base Equation
• Random Variables
• Probability Distributions
• Sampling Techniques
• Inferential Statistics
• Non-Probability Sampling
• Probability Sampling
• Sampling Funnel
- CRISP - DM Data Cleansing / Data Preparation
• Outlier Treatment
• Winsorization
• Alpha Trimmed
• Missing Values
• Imputation
• Transformation
• Normalization/Standardization
• Dummy Variables
• Type Casting
• Handling Duplicates
• String Manipulation
• Unsupervised Preliminaries
- Distance Calculation
- Linkages
• Clustering / Segmentation
- K-Means Clustering
- Disadvantages of K-Means
- K-Means++ Clustering
- K-Medians Clustering
- K-Medoids
- Partitioning Around Medoids (PAM)
- CLARA
• Hierarchical Clustering
- Disadvantages of Hierarchical Clustering
• Density Based Clustering: DBSCAN
• OPTICS
• Grid-Based Clustering Methods
• Three Broad Categories of Measurement in Clustering
• Most Common Measures
• Clustering Assessment Methods
• Finding K Value
• Mathematical Foundations
• Dimension Reduction
- PCA
- SVD
- LDA
• Association Rules
- Support
- Confidence
- Lift
• Recommender Systems
- Types of Recommendation Strategies
- Collaborative Filtering
- Similarity Measures
- Disadvantages
- Alternative Approaches
- Recommendations vs Association Rules
- New Users and New Items
• Network Analysis
- Applications
- Degree Centrality
- Closeness Centrality
- Betweenness Centrality
- Eigenvector Centrality
- Edge / Link Properties
- Cluster Coefficient
• Text Mining
- Examples of Sources
- Pre-Process the Data
- Document Term Matrix / Term Document Matrix
- Word Cloud
- Natural Language Processing (NLP)
- Natural Language Understanding (NLU)
- Natural Language Generation (NLG)
- Parts of Speech Tagging (Pos)
- Named Entity Recognition (NER)
- Topic Modelling
- LSA / LSI
- LDA
- Text Summarization
• Data Mining Supervised Learning
• Machine Learning Primer
- Key Challenges
• Model Evaluation Techniques
- Errors
- Confusion Matrix
- Cross Table
- ROC Curve
• K-Nearest Neighbor
- Choosing K Value
- Pros and Cons
• Naive Bayes Algorithm
• Decision Tree
- Three Types of Nodes
- Greedy Algorithm
- Information Theory 101
- Entropy
- Pros and Cons of Decision Tree
• Scatter Diagram
• Correlation Analysis
• Linear Regression
- Ordinary Least Squares
- Model Assumptions
• Logistic Regression
• Support Vector Machine
- Hyperplane
- Non-Linear Spaces
- Kernel Tricks
- Kernel Functions
• Deep Learning Primer
- Image Recognition
- Speech Data
- Text Data
- Shallow Machine Learning Models
• Perceptron Algorithm
- Biological Neuron
- Simple Neural Network Components
- Perceptron Algorithm
- Learning Rate
- Gradient Primer
- Gradient Descent Algorithms Variants
- Empirically Determined Components
• Multi-Layers Perceptron (MLP) / Artificial Neural Network (ANN)
- Non-Linear Patterns
- Integration Function
- Activation Function
- Regularization Techniques Used for Overfitting
- Error-Change Criterion
- Weight-Change Criterion
- Dropout
- Drop Connect
- Noise
- Batch Normalization
- Shuffling Inputs
- Weight Initialization Techniques
• Forecasting
- Time Series vs Cross Sectional Data
- EDA - Components of Time Series
• Systematic Part
• Level
• Trend
• Seasonality
- Non-Systematic Part
• Noise/Random
- Data Partition
- Forecast Model
- Model-Driven Techniques
- Data-Driven Techniques
- Smoothing Techniques
- Moving Average
- Exponential Smoothing
- De-Trending and De-Seasoning
- Regression
- Differencing
- Moving Average
Ingredients of AI
Artificial Intelligence
Data Science
Data Mining
Machine Learning
Deep Learning
Reinforcement
Learning
Artificial Intelligence
Data Science
Data Mining
Machine Learning
Deep Learning
Reinforcement
Learning
Examples of AI
· ________ (Video Analytics & Image Processing)
y = f(x)
· __________
Artificial Intelligence
Data Science
Data Mining
Machine Learning
Deep Learning
Reinforcement
Learning
5. Classification Techniques
Artificial Intelligence
Data Science
Data Mining
Machine Learning
Deep Learning
Reinforcement
Learning
not known
Artificial Intelligence
Data Science
Data Mining
Machine Learning
Deep Learning
Reinforcement
Learning
_________ extracted.
• Q Learning etc.
Data Science Workbook © 2020 360DigiTMG. All Rights Reserved 6
Reinforcement Learning
Reinforcement Learning is a special branch of ___________ Learning
______
______ __________ defines the cumulative future reward.
4. Prescriptive Analytics
3. Predictive Analytics
2. Diagnostic Analytics
1. Descriptive Analytics
Example: What will be the number of Covid-19 cases for the next month?
Example: What should be done to avoid the spread of Covid-19 cases, which
__________
___________
6 1
__________
5
CRISP - DM 2
_________
4
3
Data Mining -
Model Development
Data Cleansing/
Preparation and
Exploratory
Data Analysis
____________ Business
Objectives ____________
- S____________
- Measurable
SMART - A____________
- R____________
- Time-Bound
Key Deliverable: ________________
collection.
Data Types
_____________________ _____________________
_______________________________________________
_______________________________________________
_______________________________________________
_______________________________________________
_______________________________________________
_______________________________________________
_____________________ VS _____________________
Examples Examples
Qualitative
Binary
Normal
Categorical
Multiple
Ordinal
Quantitative
Audio Files / Speech data can be converted into features using _________
to make it Structured.
Example: But I, being poor, have only my dreams; I have spread my dreams
1 3 1 1 1 1
_____________________ VS _____________________
• Bootstrap Resampling
• Cluster-Based Sampling
• Ensemble Techniques
Imbalanced Data
Primary Data
Secondary Data
Survey steps:
1. Understand the business __________ and __________ behind
conducting the survey. E.g. Sales are low for a training company
2. Perform __________ analysis - __________ Analysis, 5-_____
Analysis, etc. E.g. Product Pricing is uncompetitive
3. Formulate Decision Problem. E.g. Should product prices be changed
4. Formulate Research _________. E.g. Determine the price elasticity of
demand and the impact on sales and profits of various levels of price
changes
5. List out Constructs. E.g. Training Enrolment
6. Deduce Aspects based on construct. E.g. Time aspect, Strength aspect,
Constraint aspect
7. Devise Survey ____________ based on the _______. E.g. I am most
likely to enroll for the training program in: In the next one week, In the next
one month, In the next one quarter, etc.
# ________
Probability =
# Total events
Properties of Probability:
• Ranges from 0 to 1
• Summation of probabilities of all values of an event will be equal to 1
Example:
H 1
P(H) = = = 0.5
H&T 2
2 2
P(Red) = = = 0.5
2(R)+2(B) 4
• Response • Explanatory
• _________ • __________
• Criterion • __________
• Exposure variable
X = {1, 2, 3, 4, 5, 6}
is called __________________.
called __________________.
X P(X=x)
0 0.40
1 0.25 0.5
0.4
2 0.20 0.3
0.2
3 0.05
0.1
4 0.10 0
0 1 2 3 4
_________ Sampling
_________ Sampling
priority varies for the data that is to be collected to represent the population,
1 Convenience Sampling
2 Quota Sampling
3 Judgment Sampling
4 Snowball Sampling
approach for inferential statistics. Each data point to be collected will have
2 Systematic Sampling
3 Stratified Sampling
4 Clustered Sampling
Population
Sampling Frame
Sample
___________, ___________.
___________________
___________________
___________________
all data below the 5th percentile is set at 5th percentile and all the data above
5 25 75 95
All values below 5th percentile are All values above 95th percentile are
changed to 5th percentile value changed to 95th percentile value
then all the lower & upper 5% values are trimmed or removed.
5 25 75 95
• ________________________ (MAR)
• ________________________ (MCAR)
Wide variety of Techniques are available, choosing the one which fits the data is
an art:
______________ ______________
• Regression Imputation
• KNN Imputation
Types of transformation
• Logarithmic
• ____________
• Square Root
• ____________
• Box-Cox
• Johnson
____________
• Adaptive Binning
____________ or__________________________.
X - min(x)
max(x)-min(x)
_______________.
________________
_____________ Scheme
_______________
Label Encoding
Type Casting
Handling Duplicates
Ensures that we get a ____________ of _________ from all the
various locations.
E.g. A person opens a bank account but his transactions are recorded as John
Travolta in a few, John in a few entries and Travolta in a few; however, all 3 are
the name of the same person. So we merge all these names into one.
Stemming
____________ ____________
Stopword Removal
____________
Variables which are factors with a single level or majority of the levels are the
same. E.g. All the zip code numbers are the same or Gender column has all
entries listed as female.
We remove the variables from our analysis which have _____ or ________
variance in features.
MEAN
____________ ____________
MEDIAN
MODE
• If data has ______ mode it is called ______, if the data has ________
called ______ data and more than two modes called ____________
______________Skewed
____________ Skewed
Mean
Median
Measure of Kurtosis
_____________
7. Candle Plot
Histogram
Histogram is also called as Frequency Distribution Plot.
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0.00
30 35 40 45 50 55 60 65 70
• Formula used to identify outliers is Q1 - 1.5 (IQR) on the lower side and
Q3 + 1.5 (IQR) on the upper side
Q1 Median Q3
whisker whisker
• If the data points fall along the line then data are
considered to be Normally Distributed
Scatter plot is used to check for the correlation between two variables.
Linear Nonlinear
80
The Secondary purpose of the
70
Scatter Plot is to determine
60
_____________. 50
40
30
130 140 150 160 170 180 190
• | r | > 0.85 implies that there is a strong correlation between the variables
• | r | > 0.4 & | r | < = 0.85 implies that there is a moderate correlation
Multivariate Analysis
The two main plots to perform Multivariate analysis are:
• Pair Plot
• Interaction Plot
Sales Region
Name Age Salary
19,345 North
Steve $ 12,000 23
23,424 West
Jeff $ 4,500 37
24,164 East
Clara $ 5,200 28
19,453 South
Appending
Multiple datasets with the same attributes/columns.
Merging
Multiple datasets having different attributes using a common attribute.
• on Text Data
• on Image Data
Feature Extraction
1. Deep Learning is performed using Automatic Extraction
It is based on:
All Features
• Attribute importance
• Quality
Feature Selection
• _________________
• Constraints
Clustering
___________
Network Analysis
___________
___________
Distance Properties:
Standardize or ______________
different units.
d= a2+b2+c2
z
d c
d = (xi1 - xj1 ) 2 + (xi2 - xj2 ) 2 + .... + (xip - xjp) 2
y
b
a
• _________ Coefficient
• Distance is 1 otherwise
data
Nearest Farthest
Neighbour Neighbour
(Single Linkage) (Complete Linkage)
• This is also called Ward's Minimum Variance and it minimizes the total
centroid of clusters
-2
-4
-4 -2 0 2 4 6 8 10
• K-Means Clustering
• K-Means ++ Clustering
• ____________ Clustering
• K-Medoids Clustering
• K-Medians Clustering
• K-Modes Clustering
• ____________ Clustering
• ____________ Clustering
3. Each data point of the dataset, which is the closest to one of the centroids will
form a cluster with that closest centroid
Solution:
Initialize the algorithm multiple times with different initial partitions.
No defined rule for selecting the K-value, while there are thumb rules, these
are not foolproof.
Solution:
Run the algorithm with multiple 'K' values (range) and select the clusters
with the least within the sum of squares and highest ___________
within Sum of Squares’.
Solution:
K-medians, _____________________ are a few other variants
which handle outliers very well.
Solution:
Use _________________ for categorical data.
Solution:
Use _______________ clustering and ____________ K-Means.
Steps:
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-4 -2 0 2 4 6 8 10 -4 -2 0 2 4 6 8 10
L1 Norm is the distance measure used and is also called Manhattan Distance.
Steps are very similar to K-Means except that instead of calculating Mean we
calculate Median.
2. Categorical data can be converted into one-hot encoding but will hamper the
quality of the clusters, especially when the dimensions are large
3. K-Modes is the solution and uses modes instead of means and everything
else is similar to K-Means
5. If the data has a mixture of categorical and numerical data then the
_______________ method can be used
Gaussian Radial
___________ Sigmoid
Basis Function
Kernel ___________
(RBF) ______
Steps:
3. Find out the distance from each and every data point to the medoid and add
them to get a value. This value is called total cost
4. Select any other point randomly as a representative point (any point other
than medoid points)
5. Find out the distance from each of the points to the new representative point
and add them to get a value. This value is called the total cost of a new
representative point
6. If the total cost of step 3 is greater than the total cost of step 5 then the
representative point at step 4 will become a new medoid and the process
continues
7. If the total cost of step 3 is less than the total cost of step 5 then the
algorithm ends
Steps:
PAM is well suited for small datasets but it fails for large datasets.
Agglomerative:
Start by considering each data point as a cluster and keep merging the records
or clusters until we exhaust all records and reach a single big cluster.
Steps:
1. Start with 'n' number of clusters where 'n' is the number of data points
2. Merge two records, or a record and a cluster, or two clusters at each step
based on the distance criteria and linkage functions
Divisive:
• Start by considering that all data points belong to one single cluster and keep
splitting into two groups each time, until we reach a stage where each data
point is a single cluster
Number of clusters are decided after running the algorithm and viewing the
Dendrogram. Dendrogram is a set of data points, which appear like a tree of
clusters with multi-level nested partitioning.
on ________ datasets.
2 CURE
3 CHAMELEON
______________ ______________
_________________
“Plot the number of clusters for the image if it was subject to Optics clustering”.
Challenges:
high-dimensional data.
STING CLIQUE
Methods:
clustering algorithm.
___________________
(ground truth)
category
• Purity
• Maximum Matching
2. _________-based measures
3. Pairwise measures
4. Correlation measures
____________________
1. Beta-CV measure
2. ______________ Cut
3. Modularity
___________________
2. Distance Distribution
3. ____________ Statistic
2. Empirical Method = n
2
3.0
2.0
x
1.5
1.0 x
x x
0.5 x x
x x
1 2 3 4 5 6 7 8 9
2. Matrix Multiplication
3. Matrix ________________
4. _______________ Matrix
____________ Reduction.
___________.
• ____________ Analysis
quantitative in nature.
These PCs capture 100% information, however, the initial set of PCs alone can
PCA helps us reduce the size of the dataset significantly at the expenses of
If the original dataset has features, which are all correlated then applying PCA
Each PC will capture information contained in all the variables of the original
dataset.
• PCs are ordered by their _________(PC1 > PC2 > PC3, and so on)
SVD is applied on the images to reduce the size of images and helps
It is a _____ decomposition
_____________
criteria:
mean the same thing, i.e., how are two entities related to each other, is there
Drawbacks of Support:
2. It does not capture the true dependency - How good are these rules
__________________
__________ =
# transactions with A
Drawbacks of Confidence:
• It does not capture the true dependency - How good is the dependency
Confidence
Lift =
__________________
__________________
Threshold - 1:
Lift > 1 indicates a rule that is useful in finding consequent item sets. The rule
Data used for the analysis usually has ‘Users’ as rows and ‘Items’ will be
Sometimes the values, for example ratings columns, are divided by the
______. ______ refers to the number of customers who have
purchased or rated the item. This process is called ______ the ratings.
Similarity Measures:
Euclidean distance
What to Recommend:
List out and recommend the items that the person is MOST LIKELY to buy
from the list of items that similar customers have already purchased.
• Rated by most
2
Compute is very expensive - n similarities calculations
5 __________________ customers
______recommendations’
Rating matrices are huge and sparse (too many empty cells)
Applications
Node Properties
considered
__________________
Normalized ________ =
# of all possible pairs except the focal node
• Nodes which are connected to high scoring nodes contribute more to the
y
X corresponding to the highest Eigenvalue is the vector that consists of the
Edge or Link properties are defined based on the domain knowledge and there
Number of
edge/Number of
possible edges
Path _____
Network
Average Path
Shortest Path Related Length
Properties
Cluster
_____
Coefficient
Steps
- Calculate edge betweenness of each edge
- Remove the edge with the highest betweenness
- Repeat the above mentioned two steps until no edge
is remained
arranging the extracted keywords in a plain space with font sizes varying
_____ Speech
transcripts
_____
Email to Examples
customer of Field agents,
service Sources salespeople
________
Social media
outreach _______
• Typos
• Filler words, connectors, pronouns (‘all’, ‘for’, ‘of’, ‘my’, ‘to’, etc.)
• Numbers
• Custom words
• Stemming
• Lemmatization
Word Cloud
__________ - words present in positive dictionary.
_____ - two words repeated together - gives better context of the content.
LDA (____________________)
A topic modelling method that generates topics based on words/expression
Text Summarization:
Process of producing concise version of text by retaining all the important
information.
Topics
Documents
TOPIC
MODELING
Topic
Frequency of words
70%
Training
Dataset
Random 30%
Historical Data Sampling
Test
Dataset
data is imbalanced
sampling techniques)
c. _____________________________________
Random 30% X%
Historical Data Sampling Accuracy
Test
Dataset Prediction and Model Validation
Validation Outcome
3. ___________________________________
7. Compare the training data predicted values and training data actual values
to calculate the error or accuracy. This will give us Training Error or Training
Accuracy
a. If training error and testing error are small and close to each other then
the model is considered to be RIGHT FIT (how low the error values
b. If training error is low and testing error is high then the model is
c. If training error is high then testing error also will be high. This scenario
seriously wrong with the data or model you built. Redo the entire
project
Y Y Y
X X X
________________
leak is countered with new school of thought with an idea to split the data into:
• Training Data
• __________
Data
Test on Training
Keep validating and keep • Test on Validation Data to get
Data to get • Test on Testing
retraining the model until Validation Error/Accuracy
Training Data to get Test
Error/Accuracy
desired accuracy is achieved • Fine tune the model parameters to Error/Accuracy
get better accuracy (also called as
• Pick the best-performing algorithm generalization
Build Model error)
• Check for model
Overfitting or
Underfitting
• Run multiple trials
• Combine Training & Validation Data to run the model, which has given and take average
the desired results based on a set of finalized model parameters. Then (numerical
test the model on testing data outcome) of
• Also ensure that model neither overfits (Variance) nor underfits (Bias) majority
n
1 2
• Correlation Coefficient
• Root Mean Squared Error (RMSE) RMSE= ∑ et
n t=1
1 n xy - x y
et (∑ (∑ (∑
( ( (
• Mean Percentage Error (MPE) MPE= ∑ r=
n Yt
t=1
2 2
2 2
n∑x - n
∑y -
(∑x (∑y
( (
• ______________ 1 et
MAPE= ∑
n Yt
(MAPE) t=1
MAE
• Mean Absolute Scaled Error (MASE) MASE=
MAE in-sample, naive
109
MAE in-sample, naive is the mean absolute error produced by a naive forecast
Model Evaluation Techniques
If the ‘Y’ is Discrete variable (Classification Models) then we can use the
following list:
Confusion Matrix:
Can be applied for both _______ classification as well as ________
classification models.
Positive Negative
Error = 1- Accuracy
TP + TN
Accuracy = Accuracy should be greater
TP + FP + FN + TNP than % of majority class
Possible Outcomes
Bought
Flowers
(Action)
Wife Suspicious
Money Wasted Domestic Bliss
TP
Precision = (Designers in the formula hide the name precision)
TP + FP
TP
Recall =
TP + FN
2 X Precision X Recall
F1 Score =
Precision + Recall
Bend 100 0 0 0 0 0
Jack 0 100 0 0 0 0
Jump 0 0 89 0 0 11
Actual Class Run 0 0 0 67 0 33
Skip 0 0 0 0 100 0
Walk 0 0 11 33 0 100
Bend Jack Jump Run Skip Walk
Predicted Class
The values along the diagonal are right predictions and the values off the
100%
si t
as ec
r
fie
Cl erf
80%
e
r
lu
fie
True positive rate
va
si
s
e
la
tiv
tC
60%
ic
s
ed
Te
pr
(sensitivity)
no
40%
ith
rw
fie
si
as
20% Cl
0%
0% 20% 40% 60% 80% 100%
False positive rate (1 - specificity)
• __________-Based Reasoning
• __________-Based Reasoning
• Instance-Based Learning
• Rote Learning
• __________ Reasoning
____________.
KNN is based on calculating distance among the various points. Distance can
previous sections.
KNN also has an improved version where _____ _______ are assigned to
In case of continuous output, the final prediction will be the _______ of all
output values and in case of categorical output, the final prediction will be the
Right Level of
Model Complexcity
Model Accuracy
Model Complexicty
bias-variance tradeoff.
Strengths Weakness
Does not depend on the underlying There is no model produced and hence
data distribution no interesting relationship among output
and inputs is learnt
__________Probability.
P(Class) = P(Spam) = No. of times spam appears in the data / Total no. of emails = 5/10
P(Data) = P(Lottery) = No. of times lottery appears in the data / Total no. of emails = 4/10
P(Data | Class) = P(Lottery | Spam) = No. of emails having word lottery given that emails are
spam = 1/5. In total there are 5 spam emails and out of which 1 email has the word lottery.
Root Node
Branch Node
Leaf
Node
Conditions to Stop
Examples
Information
Sun rises in Extremely high
content lowest
the east probability (p = 1)
(‘0’ bits)
Occurrence of
Very low High information
earthquake in
probability content
Kuala Lumpur
)
I(event) = log2
) 1
= -log2 Prob (event)
Prob(event)
Entropy:
• Entropy is the expected information content of all the events
• Entropy value of 0 means the sample is completely homogeneous
• Entropy value of 1 means the sample is completely heterogeneous
)
H(p = (p1 ... pn)) =
n
∑p log
) 1
n
= -∑ p! log( p! )
!
i=1 p! i=1
5 60 40 40 60 10
Decision trees find attributes which return the most homogeneous branches.
Purity can also be measured using GINI Measure, which is the Expected.
Accuracy with __________ Labeling.
5 60 40 40 60 10
5 5 60 60 60 60 10 10
� �X � � +� �X� � � �X � � +� �X� �
65 65 65 65 70 70 70 70
After calculating the measure of _____, one must decide on which feature to
split. For this, one must measure the change in __________ resulting from
a split on each possible feature. This calculation is known as __________
Entropy = E1
Where E1 > E2
Entropy = E2
Stopping the tree from growing Grows the tree completely and then
once the desired condition is meet. apply the conditions to reduce the
tree size.
• Stop the tree from growing once
it reaches a certain number of Example, if the error rate is less
decisions than 3% then reduce the nodes.
• Stop the tree from growing if So, the nodes and branches that
decision nodes contain only a have less reduction of errors are
small number of examples removed.
Disadvantage: When to stop the tree This process of grafting branches is
from growing. What if an important known as subtree raising or subtree
pattern was prevented from replacement.
learning?
______________________________
Erron term
β1
β0
Mean value of
y and x equals x0
y intercept y = β0 + β1x + ε; ε = error term
x
x0 = A specific value of x,
the independent variable
The best fit line is the line which has minimum square deviations from all the data points
to the line.
To improve the accuracy, transformations can be applied, this will ensure that the data has a
linear pattern with minimum spread.
It can be interpreted as the % of variability in output (Y) that can be explained with the
_______________ (X)
0 ≦ R2 ≦ 1
Where,
________________
________________
________________
The link function that provides the best goodness-of-fit for the
given data is chosen.
1.0 1.0
0.8 0.8
Probability
Probability
0.6 0.6
Linear Regression Line Linear Regression
0.4 0.4 Curve
0.2 0.2
0.0 0.0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Below 0
Where,
β0 = the y is _______________
ey
p= ; where e = 2.7183
1+e y
1.0
0.6
The task of the SVM algorithm is to identify a line that separates the two
c c
b b
a a
Support Vector
Maximum Margin
Support Vector
MMH is as far away as possible from outer boundaries (convex hull) of the two
The maximum margin linear classifier is the linear classifier with the maximum
Sunny Snowy
Ф
Latitude
Altitude
Kernel Trick
Longitude Longitude
Kernel Tricks
A key feature of SVMs is their ability to map the problem into a higher
the _____ trick has been applied, we look at the data through the
d
K �xi , xj� = �xi . xj + 1�
• The _________ kernel is similar to a RBF neural network. The RBF kernel
performs well on many types of data and is thought to be a reasonable
starting point for many learning tasks
2
-�xi -xj �
K �xi , xj� = e
2σ 2
Image Recognition
Techniques used to
extract features
Text
Love the way Unsupervised Learning:
360DigiTMG Bag of Words n-grams
delivers training (BoW)
on Data Science Term Document
& Artificial Matrix (TDM)
Intelligence Supervised Learning:
Naive Bayes
Neuron
Dendrite
Axon
Nucleus
Axon Terminal
Myelin
Scwann Cell
Node of Ranvier
Soma
2. Input layer also has one additional neuron called _____, which is
equivalent to the ‘b’ (y-intercept) in the equation of the line y = b + mx
3. ‘b’, ‘w1’, ‘w2’, ‘w3’,....... are called as weights and are ________ initialized
4. These neurons are also called as nodes and are connected via an edge to
the neuron in the next layer
x0= 1
Dendrites
b Cell body or Soma
x1
w1
w2 Activation
x2
w3 y = f (wx + b)
x3
w4
x4
Axon
xn wn Summation
Neural Network
w2 w0
x2
n
net = wixi 1
i-0 o = σ (net) = 1+e-net
wn
Net input Activation Output
function function
xn
Input
8. Predicted output and actual output are compared to calculate the _____
function / _____ function (error calculated for each record is called as
_____ function and combination of all these individual errors is called as
cost function)
10. Weights are updated with the objective of minimizing the error and this
minimization of error is achieved using _______________ Algorithm
Neural network with no hidden layers and a single output neuron is called a
__________ Algorithm.
< <
_____ steps to arrive at the
bottom of the error surface.
Learning Rate
reaches the end of a defined number 0.1
of epochs.
0.0
0 4 8 12 16 20
Epochs
Learning Rate
after every epoch until the defined 0.1
0.2 0.2
Learning Rate
Learning Rate
0.1 0.1
0.0 0.0
0 4 8 12 16 20 0 4 8 12 16 20
Epochs Epochs
Rate of change
Slope
Curves/Surfaces should be
continuous and smooth
(_____ /sharp points)
A few definitions:
Epoch: When entire training set is used once to update the weights
Nesterov
Momentum
Momentum
Other advanced variants
of Mini-Batch SGD: Adagrad Adadelta
RMSprop Adam
3. _______________
4. Error/Cost/Loss Functions
5. _______________ Methods
Note: Hidden layers can have any activation function and majorly _________
activation functions seem to be giving good results.
m m
Identity function
a (ƒ) = a ___ function
(Linear function)
a
Ramp Sigmoid .5
1
function ƒ function
1
1 ReLU 1
Tanh (Rectified
0 0
function Linear Unit)
-1
function -1
1 1
ELU
_____
(Exponential 0 0
ReLU
Linear Unit)
-1 -1
Output
neurons
A
Probability of A
Maxout _____ B
Probability of B
Rest of
network C
Probability of C
m n1-1 s1 s1+1
j(0) = 1
m ∑� 1
2 � h0 (x(i)) - y(i)�2� +
y
2 ∑ ∑ ∑�w � (l)
ji
2
Over fitting
___ stopping:
Early Stopping
Test Set Accuracy Epoch
Epoch
Error-change criterion
1. Stop when error isn't dropping over a window of, say, 10 epochs
2. Train for a fixed number of _____ after criterion is reached (possibly with
lower learning rate)
Weight-change criterion
maxi � wi - wi �<
t t-10
ρ
Training Phase: For each hidden layer, for each training sample, for each
iteration, ignore (zero out) a __________, p, of nodes (and corresponding
activations).
Test Phase: Use all __________, but reduce them by a factor p (to account
for the missing activations during training).
Standard
Neutral Net
After applying
Dropout
v1
r1
1
v2
r2
v3 0
r3
v4 1
Very similar to _____, however, we disable the weights instead of the nodes.
v1
r1
v2
r2
v3
r3
v4
Noise:
Noise
Parameters to be learned: Ύ, β
Shuffling inputs:
• Present input examples that produce a large error more frequently than
examples that produce a small error. Why? It helps to take large steps in the
Gradient descent
6 6
_____ initialization uniform( - , )
fan in+ fanout fan in+ fanout
Data that is collected over equal spaced time intervals and the time interval is
____________ Data:
EDA
_______ Visual
• Forecast Horizon
• ____________
Model-Driven Data-Driven
_______ similar to future _______ similar to future
4. ____________
5. ____________
Sales
Months Months
• ____________ Method
MA ES
Better to forecast when data & Better to forecast when data &
environment is not _________ environment is _________
1800-212-654321
[email protected] 360digitmg.com
2-56/2/19, 3rd Floor, Vijaya Towers, Ayyappa Society Road, Madhapur, Hyderabad, Telangana 500008