Kadir
Kadir
Session 2024-2025
Submitted By:
Abdul Kadir
ARYAN COLLEGE
LOHAGAL, AJMER-305001
guide Mr. Trivendra Sir, Lecturer, Aryan College, Ajmer for discussing
the case study, guiding and constantly inspiring us for the accomplishment of
this project. To carry out this work under her supervision has been both a
thanks to Mr. Amar Pal Singh Shekhawat, Director, Aryan College, Ajmer
for his active cooperation and valuable suggestions throughout our project
work.
Abdul Kadir
PREFACE
This project of Data Science and Big Data in Java has been made
with a lot of hard work and the valuable guidance by Mr. Trivendra Sir
Through this project, an automation of the Data Science and Big Data has
been done to reduce the manual and mental tedious work. The manpower
I also thankful to our mentor Dr. Prafull Chandra Narooka for giving
1. Introduction
1.1 DEFINATION OF DATA SCIENCE
1.2 Important of Data Science
1.3 Role of Data Scientist
1.4 Tools for Data Science
1.5 Applications of Data Science
1.6 Lifecycle of Data Science
2. Big Data and Data Science hype
2.1 Types of Big Data
2.2 Three Characteristics of Big Data-V3c
2.3 Benefits of Big Data
2.4 Big Data Techniques
2.5 Underfitting and Overfitting
2.6 Data Science Hype
3. Statistical Inference, Statistical modelling
4. Probability Distributions
4.1 What is Probability?
4.2 Why Probability is important?
4.3 How to use Probability in Data Science?
4.4 What are probability distributions?
4.5 What are the types of probability distributions?
5. Fitting a model
5.1 Objectives of Model Fitting
5.2 Why are we fitting models to data?
6. Introduction to R
6.1 What is R?
6.2 Why We Choose R for Data Science?
6.3 History of R
6.4 R Features
6.5 How R is different than Other Technologies
1|Page
6.6 Applications of R Programming
6.7 What is R In Data Science important?
6.8 What Makes R Suitable For Data Science?
6.9 Data Science Companies that Use R
7. Exploratory Data Analysis and the Data Science Process
7.1 Exploratory Data Analysis
7.2 Data Science Process
8. Basic tools (plots, graphs and summary statistics) of EDA
8.1 Exploratory data analysis
8.2 Types of Data
9. The Data Science Process - Case Study, Real Direct (online real estate firm)
2|Page
7. Data Wrangling
8. Feature Generation
8.1 INTRODUCTION
8.2 BACKGROUND
8.3 SYSTEM AND METHODS
9. Feature Selection algorithms
9.1 The Problem the Feature Selection Solves
9.2 Feature Selection Algorithms
9.3 How to Choose a Feature Selection Method for Machine Learning
3|Page
Unit I- Introduction of Data Science & R Programming
1. INTRODUCTION: WHAT IS DATA SCIENCE?
What is Data Science?
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from data in various forms, both structured and
unstructured, similar to data mining.
• Because you have too many data such as money, reviews, customer data, people working, etc.
• You want to keep it clear and easy to understand so you can make a change that’s why data
science is relevant.
• Data analysis lets people make better decisions, either faster or better.
Every company, however, has information, and its business value depends on how much
information it thinks.
Since late, Information Science has acquired significance in the light of the fact that it can
assist companies with growing business estimation of their accessible knowledge and thus allow
them to take the upper hand against their rivals.
It can help us know our customers better, it can help us refine our processes and it can help
us make better decisions. Knowledge, in the light of information technology, has become a vital
instrument.
• Data scientists help organizations understand and handle data, and address complex problems
using knowledge from a range of technology niches.
• They are typically built in the fields of computer science, modelling, statistics, analytics and
mathematics, coupled with modelling statistics and mathematics combined with a clear business
sense.
A typical data science process looks like this, which can be modified for specific use case:
4|Page
• Build & validate the models
• Deploy & monitor the performance
1. R
2. Python
3. SQL
4. Hadoop
5. Tableau
6. Weka
A brief overview of the main phases of the Data Science Lifecycle is shown in Figure 1:
5|Page
Figure 1: Data Science Lifecycle
Phase 1—Discovery: Phase 1 — It's important to understand the various criteria, requirements,
goals and necessary budget before you start the project. You ought to have the courage to ask the
6|Page
7|Page
right questions. Here, you determine whether you have the resources needed for supporting the
project in terms of people, equipment, time and data. You must also frame the business problem in
this process, and formulate initial hypotheses (IH) for testing.
Phase 2—Data preparation: You need analytical sandbox in this process, in which you can conduct
analytics for the entire duration of the project. Before modelling, you need to search, preprocess, and
condition data. In addition, you must perform ETLT (extracting, converting, loading, and converting)
to bring data into the sandbox. Let's look at the flow of statistic analysis in Figure 2 below.
R may be used for data cleaning, retrieval, and visualization. It will help you identify the outliers and
create a relationship between the variables. When the data has been cleaned and packed, it is time to
do some exploratory analytics on it. Let's see if you can get this done.
Step 3 — Model planning: You can decide the methods and strategies for drawing the relationship
between variables here. These relationships will set the basis for the algorithms you will be
implementing over the next step. Using through statistical formulas and visualization tools, you'll
apply Exploratory Data Analytics (EDA).
1. R has a full range of modellingcapabilities and offers a strong setting for interpretive model
building.
2. SQL Analysis services can use can data mining functions and simple predictivemodels to perform
in-database analytics.
3. SAS / ACCESS can be used to access Hadoop data, and is used to construct repeatable and
reusable flow diagrams for model.
While there are many tools on the market but R is the most commonly used tool.
6|
Page
Now that you have insights into the essence of your results, and you have chosen to use the
algorithms. You will apply the algorithm in the next step, and you will create a model.
Phase 4—Model building: You will be designing data sets for training and testing purposes during
this process. You should decide whether your current resources are adequate to run the models, or
whether you need a more robust environment (such as fast and parallel processing). To construct
the model, you'll examine different learning strategies such as grouping, association, and
clustering.
Model building can be done using the following methods shown in Figure 4.
Step 5 — Operationalize: You provide final reports, presentations, code and technical documents
during this process. Alternatively, a pilot project is often often applied in an area of real-time
output. It will give you a simple, small-scale image of the results and other relevant constraints
before full deployment.
Phase 6—Communicating results: Now it's necessary to determine whether you were able to
accomplish your target that you designed in the first step. Therefore, in the last step, you define all
the main outcomes, communicate to the stakeholders and decide if the project's results are a
success or a failure based on the Step 1 criteria.
Huge Data is a term used to describe a gigantic amount of both organized and unstructured
knowledge that is so immense that using traditional database and programming methods is
impossible to handle. The amount of knowledge is too high in most undertaking situations or it
moves too quickly, or it reaches the existing planning cap.
(1) Data collection by significant quantities a. Via machines, sensors, men, events.
(2) To do something about it. Decision taking, testing observations, gaining perspective,
forecasting the future.
9|Page
Types of Big Data
There are three types of data behind Big Data- structured, semi-structured, and unstructured shown in
Figure 5. There's a lot of useful knowledge in each category that you can mine to use in various
projects.
Structured
By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple search engine algorithms. For instance, the employee
table in a company database will be structured as the employee details, their job
positions, their salaries, etc., will be present in an organized manner.
Unstructured
Unstructured information alludes to the information that does not have a particular structure
or structure at all. This makes it exceptionally troublesome and tedious to process and
investigate unstructured information. Email is a case of unstructured information.
Semi-organized
8|
Page
1. Volume
A run of the mill PC may have had 10 gigabytes of storage in 2000. Today, Facebook is
regularly ingesting 500 terabytes of new information. Boeing 737 can generate 240 terabytes of
flight information on a single trip across the US. The advanced cell phones, the information they
generate and expend; sensors implanted into ordinary items can take care of containing human,
field, and other data, including video, in billions of new, continually refreshed information, before
long outcome.
2. Velocity
• Clickstreams and ad impressions capture consumer activity at millions of events per second
• High-frequency stock trading algorithms represent market movements within microseconds •
machine-to machine processes share data between billions of devices • networks and sensors
produce huge real time log data
• online gaming systems help millions of competitor users, each pro 3. Variety.
• Big data are not just numbers, dates, and strings. Huge data is also geospatial information,
3D information, sound and video, and unstructured content, including web-based log documents.
• Modern database systems were built to handle smaller structured information volumes,
fewer changes or an expected, steady information structure. • Big Data inquiry involves details of
different kinds
Access to social information from web indexes and locales like facebook, twitter are empowering
associations to calibrate their business procedures.
Huge Data innovations can be utilized for making an arranging territory or landing zone for
new information before recognizing what information ought to be moved to the information
distribution center. Also, such coordination of Big Data advances and information distribution
center encourages an association to offload inconsistently got to information.
11 | P a g e
Big Data Tools for Data Analysis
1) Apache Hadoop
2) CDH (Cloudera Distribution for Hadoop)
3) Cassandra
4) Knime
5) Datawrapper
6) MongoDB
7) Lumify
8) HPCC
9) Storm
10) Apache SAMOA
11) Talend
12) Rapidminer
13) Qubole
14) Tableau
15) R
In addition to the emerging data harnesses of computer-driven research techniques, analyzes are
often focused on conventional statistical methods.9 Essentially, how data analysis techniques
operate within an enterprise is doubled; broad data analysis is generated by streaming data as it
appears, and then conducting batch analysis of data as it grows – to search for behavioural patterns
and trends.
When data becomes more informative in its size, scope and depth, the more creativity it drives.
1. A/B testing
3. Data mining
4. Machine learning
6. Statistics.
12 | P a g e
Underfitting and Overfitting
Machine learning uses data to create a “model” and uses model to make predictions
Customers who are women over age 20 are likely to respond to an advertisement
Students with good grades are predicted to do well on the SAT
The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains
• Underfitting
Model used for predictions is too simplistic
60% of men and 70% of women responded to an advertisement, therefore all future
ads should go to women
If a furniture item has four legs and a flat top it is a dining room table
The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains
• Overfitting
Model used for predictions is too specific
The best targets for an advertisement are married women between 25 and 27 years with
short black hair, one child, and one pet dog
If a furniture item has four 100 cm legs with decoration and a flat polished wooden top with
rounded edges then it is a dining room table
The noise around AI, data science, machine learning and profound learning is hitting a
level of fever. Our industry has experienced a difference in what people mean when they say "AI,"
"machine learning" or "data science" as this noise has evolved. It can be argued that a growing
taxonomy is missing for our industry. If there is taxonomy then we have not done a very good job
of adhering to it, as data science professionals. This would have consequences. Two implications
include creating a hype-bubble that leads to unrealistic expectations and an growing inability to
interact, especially with colleagues from non-data sciences. I will cover succinct concepts in this
post and then argue how important it is.
Concise Definitions
Data Science: a discipline that produces predictions and explanations using code and
computer to create models that are put into action.
Machine Learning: a class of algorithms or techniques to capture complex data patterns in
model form automatically.
Deep learning: A class of machine learning algorithms that uses more than one hidden layer of
neural networks.
13 | P a g e
AI: a group of systems functioning in a manner comparable to humans in both degree of
autonomy and reach.
Hype
There is a lot of star strength in these words. We encourage people to dream and envision a
better future leading to their unnecessary use. More buzz around our industry is elevating the tide
that raises all sails, no? Sure, we all hope the tide keeps rising. Yet, if it bursts, we will aim for a
sustainable rise and stop a publicity bubble that would cause widespread disillusionment.
Numerous leaders are requesting guidance on how to assist executives, mid-level managers
and even emerging data scientists have reasonable standards of data science initiatives without
losing data science excitement. Unrealistic expectations delay development by swelling excitement
when projects yield less than utopian results
A major cause of this hype has been the constant overuse of "AI" when referring to any
solution which allows some kind of prediction. Owing to constant overuse, people automatically
equate data science ventures with near-perfect autonomous human-like solutions. Or, at the very
least, people believe that data science can easily solve their particular predictive need, without
questioning whether their organizational data supports such a concept.
I know a data science manager who demands that his data sciences team be practically
locked with business executives in a room for an hour before he approves any new data science
project. Okay, the door isn't actually closed, but it's shut, and for a full hour he wants them to
discuss the project. They saw a reduction in the rework of the project, as they concentrated on early
communication with business stakeholders. The difficulty of describing principles related to data
science is as challenging as it is. We only make it more complicated if we cannot describe our own
words.
Because AI and deep learning have come onto the scene, conversations have to be interrupted and
questions answered to figure out what people are.
• Because AI and deep learning have come onto the scene, discussions are constantly
required to pause and ask questions and figure out what people actually mean by using
those words. For starters, how would you interpret those conversation-based statements?
• "Our goal is to make our technology AI-driven within 5 years." • "We need to improve
machine learning before we invest in deep learning." • "We use AI to predict fraud so that
our customers can spend with confidence." • "Our research showed that AI-investing
organizations had a 10 percent increase in revenue." The most common term
misunderstanding is when someone speaks about AI solutions, or when they do AI,
whatever they do.
The most common term-confusion is when someone talks about AI solutions, or when they
do AI, when they should actually talk about creating a model of deep learning or machine learning.
It seems like the exchange of words is all too frequently deliberate, with the speaker hoping to get
14 | P a g e
a hypeboost by saying "AI." Let's dive through each of the meanings, and see if we can find a
taxonomy agreement.
Data Science
First of all, like every other academic discipline I see data science as a technical discipline.
Take Biology for example. Biology requires a variety of concepts, theories, processes, and
instruments. Experimentation is normal. The bio-research group continuously contributes to the
knowledge base of the discipline. It is no different from data science. Practitioners do science of
the results. Scientists are moving the field forward with new hypotheses, principles and methods.
The data science activity includes the marriage of code (usually some mathematical
programming language) with data for model building. This involves the initial essential and
dominant steps of obtaining, cleaning, and preparing data. Models of data science generally make
predictions (e.g., predicting loan risk, predicting diagnosis of disease, predicting how to respond to
a conversation, predicting what objects are in an image). Models of data science may also explain
or characterize the environment (e.g. whic) for us.
Data science models may also illustrate or define the environment for us (e.g., which
combination of variables is most important in making a diagnosis of the disease, which consumers
are most similar and how). Eventually, when applied to new data, these models are put into action
for making predictions and explications. Data science is a discipline that produces predictions and
explanations using code and data to create models that are put into action.
A description for data science can be difficult to formulate while at the same time separating it
from statistical analysis. Via educational training in math and statistics as well as professional
experience as a statistician I came to the data sciences profession. I used to do data science like
many of you before it became a thing.
In my interpretation, one important point is that data science models are applied to new
data in order to make future predictions and explanations, or "put into development." Although it is
true that surface-response models can be used to predict a response on new data, it is typically a
hypothetical prediction of what would happen if the inputs were modified. The engineers then
adjust the inputs and analyze the responses the physical device produces in its new environment.
The surface model of the reaction is not put into development. This doesn't take the thousands of
new input settings, in batches or streams over time, and predicts responses.
This concept of data science is by no means foolproof but it starts capturing the essence of data
science by bringing predictive and descriptive models into action.
15 | P a g e
Machine Learning
Machine learning as a word reigns in the 1950s. Today, data scientists see it as a collection
of techniques used within data science. It is a tool collection, or a class of techniques used to
construct the above mentioned models. Machine learning helps computers to create (or learn)
models on their own, rather than a person directly articulating the reasoning for a model. This is
achieved by analyzing an initial collection of data, finding complex hidden patterns in that data and
storing those patterns in a model so that they can be later applied to new data for predictions or
interpretations to be made.
The magic behind this automated pattern-discovery process lies in the algorithms.
Algorithms are Machine Learning Workhorses. Popular machine learning algorithms include the
various approaches to the neural network, clustering strategies, gradient boosting machines,
random forests and much more. If data science is a discipline like biology, then microscopy or
genetic engineering is like machine learning. It is a set of methods and techniques which is used to
exercise the discipline.
Deep Learning
Deep learning is the simplest interpretation of those concepts. Deep learning is a class of
machine learning algorithms that employs more than one hidden layer of neural networks. The own
neural networks date back to the 1950s. Recently, deep learning algorithms were very popular
beginning in the 1980s, with a lull in the 1990s and 2000s, followed by resurgence in our decade
due to fairly minor changes in how deep networks were designed that proved to have incredible
impacts. Deep learning can be applied to a wide range of applications, including image
recognition, chat assistants, and recommender systems. Google Voice, Google Images and Google
Search for example are some of the roots. For example, Google Speech, Google Photos, and
Google Search are some of the original solutions built using deep learning.
AI
AI has been around for a long time. Long before the recent hype storm that has co-opted it
with buzzwords. How do we, as data scientists, define it? When and how should we use it? What is
AI to us? Honestly, it’s not sure anyone really knows. This might be our “emperor has no clothes”
moment. We have the ambiguity and the resulting hype that comes from the promise of something
new and unknown. The CEO of a well-known data science company was recently talking with our
team at Domino when he mentioned “AI”. He immediately caught himself and said, “I know that
doesn’t really mean anything. I just had to start using it because everyone is talking about it. I
resisted for a long time but finally gave in.”
That said, I'm going to take a stab on it: AI is a category of systems that people aim to build
that have the distinguishing characteristic that in the degree of autonomy and scope of activity they
will be comparable with humans.
If data science is like biology and machine learning is like genetic engineering, and then AI
is like resistance to disease, to expand our analogy. It's the end product, a series of solutions or
structures we seek to build by applying machine learning (often deep learning) and other
techniques.
16 | P a g e
Here is the concluding line. I think we need to distinguish between strategies that are part
of AI solutions, AI-like solutions and actual AI solutions. This includes AI building blocks,
solutions with AI-ish qualities, and solutions that approach human autonomy and scope. These are
three separate things. People just say “AI” for all three far too often.
For example,
• Deep learning is not AI. It is a technique that can be used as part of an AI solution.
• Most data science projects are not AI solutions. A customer churn model is not an AI
solution, no matter if it used deep learning or logistic regression.
• A self-driving car is an AI solution. It is a solution that operates with complexity and
autonomy that approaches what humans are capable of doing.
17 | P a g e
Page
So let's start by actually reviewing the fundamental AI related technologies that were omitted from
the report this year but are still important for business:
• Deep neural networks (DNNs). Gartner also speaks about DNNs as a basis for many other
new technologies used in the Hype Cycle.
• Conversation AI platforms. Gartner no longer considers new technologies to Conversational
AI systems, though focusing on their importance to market.
• Digital Helpers. Gartner no longer considers new technologies as virtual assistants, though
focusing on their importance as industry.
• Artificial General Intelligence (AGI). In my opinion, a good call by Gartner to favor a
pragmatic vision around AI, moving away from hype. As Gartner mentions, AGI will not be
mature for decades.
According to Gartner, what areas of focus will be the AI leaders for companies? Based on the 2019
Emerging Technologies Priority List, those will be:
• Augmented Intelligence. Gartner recognizes this evolving technology as the key to the
design strategy of new business technologies, combining short-term automation with a mid-/long-
term strategy that ensures quality enhancement not only by means of automation, but also through
growing human talent.
• Edge AI. In those situations where contact costs, latency or high volume ingestion can be
crucial. This, of course, means ensuring that the correct AI technologies and techniques are
available for our use case (e.g. Deep Learning) are available for the IoT infrastructure we want to
deploy, among other conditions.
Finally, which are the new emerging technologies related to AI in 2019 Hype Cycle:
• Explainable AI.
• Transfer Learning.
19 | P a g e
3. STATISTICAL INFERENCE, STATISTICAL MODELLING
STATISTICAL INFERENCE
Inferential Statistics
Inferential statistics allows you to make inferences about the population from the sample data
shown in Figure 7.
Sampling Distributions
Sample means being spread more and more naturally around the true mean (the population
parameter), as we increase our sample size. Sample variation means decreasing as sample size
increases.
The Central Limit Theorem is used to help us understand the following facts, whether or not the
distribution of populations is normal:
20 | P a g e
Asample mean can be referred topoint
as a estimate
of apopulation mean
. Aconfidence interval
is
always cantered around the mean of your sample. To construct themargin of you add a
interval,
error
. Themargin of error
is found by multiplying
standard
the error
of the mean byz-score
the of
the percent confidence level shown in Figure 8:
The confidence level represents the number of times out of 100 that the population average would be
within the specified sample mean interval.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference involving asking a question, collecting data and
then analysing what the data informs us on how to precede. The experimental hypothesis is called the
null hypothesis and giveno symbol.
the H We test the null hypothesis against an alternative hypothesis
to which the symbol
o is assigned shown if Figure 9.
H
When testing a hypothesis we have to determine how much of a difference between means is
required to refute the null hypothesis. For their hypothesis check, Statisticians first select a level of
significance or degree of alpha (α).
21 | P a g e
Values that show the edge of the critical area are important. Critical regions define the entire value
area which means you are rejecting the null hypothesis.
These are the four basic steps we follow for (one & two group means) hypothesis testing:
2. Select the appropriate significance level and check the test assumptions.
Hypothesis Test on One Sample Mean When the Population Parameters are Known
We find the z-statistic of our sample mean in the sampling distribution and determine if that zscore
falls within the critical(rejection) region or not. This test is only appropriate when you know the
true mean and standard deviation of the population.
22 | P a g e
Hypothesis Tests When You Don’t Know Your Population Parameters
TheStudent’s t-distribution
is similar to the normal distribution, except it is more spread out and
wider in appearance, and has thicker tails. The differences between
t-distribution
theand thenormal
distribution
are more exaggerated when there are fewer data points, and fewer
therefore
degrees of
freedomshown in Figure 11.
23 | P a g e
Estimation as a follow-up to a Hypothesis Test
When hypothesis
a is rejected
, it is often useful to turn to estimation to try totrue value
capture of
the
thepopulation mean
.
Two-Sample T Tests
When we have
independent samples
we assume that the scores of onedo
sample
not affect
the other.
unpaired t-test
In twodependent samples
of data, each score in one sample
is paired with
a specific score in the other
sample.
paired t-test
Chi-square test
is used for
categorical data
and it can be used to estimate how closely the distribution
of a categorical variable matches an expected distribution
goodness-of-fit
(thetest), or to estimate
whether two categorical variables are independent of one
testanother
of independence
(the ).
goodness-of-fit
−1
degree of freedom (d f) = no. of categories(c)
24 | P a g e
Analysis of Variance (ANOVA)
allows us to test the hypothesis
multiple
that population means and
variances equal. We can conduct a series of t-tests instead of ANOVA but that would be
of scores are
1. Calculate the
total sum of squares
(SST )
2. Calculate the
sum of squares between
(SSB)
3. Find the
sum of squares within groups
(SSW ) by subtracting
degrees of freedom
4. Next solve for for the test
Mean Squares
5. Using the values, you can now calculate the Between Mean Squares
(MSB) and
Within(MSW ) using the relationships below
6. Finally, calculateFthe
statistic
using the following ratio
7. It is easy to fill in the Table from here — and also to see that once the SS and df are filled in, the
8. FindF critical
ANOVA formulas
One-Way ANOVA
25 | P a g e
remaining values in the table for MS and F are simple calculations
26 | P a g e
Page
Regression
Regression analysis is a set of statistical processes for estimating the relationships among variables
shown in Figure 13.
Simple Regression
This method uses a single independent variable to predict a dependent variable by fitting the best
relationship shown in Figure 13.
Multiple Regression
This method uses more than one independent variable to predict a dependent variable by fitting the
best relationship shown in Figure 14.
It works best when multicollinearit y is absent. It’s a phenomenon in which two or more predictor
variables are highly correlated.
24 |
Page
Nonlinear Regression
Indata science
, inferential statistics
is used is many ways:
Making
inferences
about the
population
from the
sample
.
If adding or removing a feature from a model will really help to improve the model.
Data Science also makes use of statistical inferences to forecast or interpret computer patterns, while
statistical inferences use data distribution of probabilities. Therefore it is important to know the
likelihood and its implementations to work effectively on data science problems.
What is Conditional
Probability?
Conditional probability is a measure of the likelihood of an occurrence (some particular circumstance
occurring), provided that another occurrence has occurred (by inference, hypothesis, conclusion, or
evidence).
The probability of event B provided event A equals the likelihood of event A and event B divided by
the likelihood of event A.
estimation.
Most techniques in data science (i.e., Naive Bayes) depend on the theorem of Bayes. Bayes'
theorem is a formula that explains how, when given proof, to change the probabilities of the
hypotheses.
30 | P a g e
Using the Bayes’ theorem, it’s possible to build a learner that predicts the probability of the
response variable belonging to some class, given a new set of attributes.
A random variable is a set of possible values from a random experiment shown in Figure 16.
A random variable (random quantity, aleatory variable, or stochastic variable) is a variable whose
possible values are outcomes of a random phenomenon.
Random variables can be discrete or continuous. Discrete random variables can only take certain
values while continuous random variables can take any value (within a range).
Within a random variable the distribution of probability determines how the probabilities are
distributed over the variable random values.
To define the probability distribution, x is used for a discrete random variable, a probability mass
function denoted by f(x). This function gives random variable probability for any value.
A continuous random variable is defined for the probability that a continuous random variable will
lie within a given interval, since there is an infinite number of values at every interval. Thus here
the probability distribution is defined by the probability density function, which is also denoted by
f(x).
31 | P a g e
A binomial distribution is a statistical experiment with the following characteristics: The
experiment is made up of n repeated trials. Each trial can produce only two possible results. We call
one such outcome a success and the other a loss. The probability of success, denoted by P, for every
trial is the same shown in Figure 17.
The standard distribution, also known as the Gaussian distribution, is a distribution of probability
that is symmetrical around the mean, indicating that the occurrence of data near the mean is more
common than that far from the mean.
• The mean is in the middle and splits the area into halves;
• It is entirely defined by its mean and standard deviation (or variance σ2)
32 | P a g e
Figure 18: Normal Distribution
How random variables & probability distributions are used in data science?
Data science also makes use of statistical inferences to forecast or interpret computer patterns,
while statistical inferences use data distribution of probability. Therefore it is important to know
random variables & their distributions of probability to function effectively on data science
problems.
5. FITTING A MODEL
Fitting a model means that you are making your algorithm know the relationship between the
predictors and the outcome so that you can predict the future value of the outcome.
And the best fitted model has a particular set of parameters that best describes the question at hand.
1. Make inferences about the relationship between variables in a given data set.
2. Concisely describe the relationship between the variables and make inferential assumptions
about the relationship
3. Predict the values of interest variables on the basis of values of other predictor variables, and
define the volatility of predictions.
33 | P a g e
6. INTRODUCTION TO R
What is R?
R programming language is one that allows statistical computation and is commonly used by data
miners and statisticians to analyze data. It was created by Ross Ihaka and Robert Gentleman in
1995 where they derived the word 'R' from the first letters of their names. R is a common choice
for statistical computing and graphical techniques in data analytics and in data science.
In its CRAN repository R holds a set of over 10,000 packages. Such products are appealing to
specific statistical applications. R may give a steep learning curve for beginners. But while R's
syntax can be simple to understand. It is an interactive method used for the implementation of
statistical learning. Hence, a. user without knowledge of statistics may not be able to get the best
out of R.
Data Science has emerged as 21st Century's most popular industry. That is because the need to
evaluate and create knowledge from the data is pressing. Industries are converting the raw data into
goods supplied with data. To do so, several essential tools are needed to churn the raw data. R is
one of the programming languages that provide you with an intensive environment for process
analysis, transformation, and knowledge visualization.
To several statisticians who want to get interested in the design of mathematical models to solve
complex problems, this is the primary option. R includes a sea of packages attracting all types of
disciplines such as astronomy, biology, etc. Though R was originally used for academic purposes, it
is now also being used in industry.
R is a technical language used for complex mathematical modelling. Alternatively, R also supports
array, matrix, and vector operations. R is renowned for its graphical libraries, which allow users to
delineate visual graphs and make them intractable to users. In addition, R helps its users to create
using Web applications
In addition to this, R provides you with several options of advanced data analytics like the
development of prediction models, machine learning algorithms, etc. R also provides several
packages for image processing.
History of R
• Ross Ihaka and Robert Gentleman developed the R project and released its first version in
1995 and a stable beta version in 2000
R Features
34 | P a g e
There are essential R features, which we will explore in order to understand R's role in data science
shown in Figure 19.
1. Open-source – R is an open-source platform that allows you to access and change the code,
and even create your own libraries. This is safe to download, as well.
3. Analytical support-With R, through its wide variety of support libraries, you can perform
analytical operations. You can clean, arrange, analyze, display your data and construct predictive
models, too.
4. Help extensions – R allows developers to write their own libraries and packages as
distributed add-ons, and to promote such packages. This makes R a developer-friendly language
that allows its users to make changes and updates.
35 | P a g e
How R is different than Other Technologies
There are certain special features of R programming that make it different compared to other
technologies: • Graphical Libraries-R stays ahead of the curve with its elegant graphical libraries.
Libraries such as ggplot2,plotfully encourage attractive libraries for well-defined plot creation.
• Technology advancement – R supports different advanced tools and features that allow you to
create robust statistical models.
• Job Scenario – R is the primary Data Science device, as stated above. With Data Science's
exponential growth and growing demand, R has become the world's most in demand
programming language today.
• Customer and Public Service Support – You will experience good community service with R.
• Portability – R is extremely compact. For the best results a lot of different programming
languages and software frameworks can easily be combined with the R environment.
Applications of R Programming
• R is used in the finance and banking industries for fraud prevention, customer turnover reduction
and potential decision taking.
• Bioinformatics is also used to study genetic sequence sequences, to conduct drug discovery and
computer neuroscience.
• R is used to discover potential consumers in online ads through social media research.
Organizations often use insights from social media to evaluate consumer emotions and enhance
their goods.
• Production companies use R to evaluate input from customers. Manufacturing companies use R to
analyze customer feedback. They often use it to forecast future demand to adjust their output
velocities and increase profits.
1. R includes many essential data wrangling sets, such as dplyr, purrr, readxl, google sheets,
datapasta, jsonlite, tidyquant, tidyr etc.
2. R facilitates robust mathematical modelling. Considering that data science is heavy statistics, R
is an ideal method to execute various statistical operations on it.
3. R is an appealing method for various applications in the data sciences, as it offers esthetic
visualization software such as ggplot2, scatterplot3D, lattice, high charter etc.
36 | P a g e
4. R is used widely in the ETL (Extract, Transform, Load) data science applications. This offers an
interface to various databases such as SQL and even spreadsheets.
5. Another essential skill of R is to interface and analyze unstructured data with the NoSQL
databases. This is particularly useful for applications in Data Science where a collection of data
needs to be analyzed.
6. Data scientists may use machine learning algorithms with R to gain insights into future events.
Various packages are available such as rpart, CARET, randomForest, and nnet.
R is the most popular choice for data scientists. Following are some of the key reasons as to why
they use R –
1. R has been reliable and useful for many years at academia. R was generally used at the
academy for research purposes as it offered various statistical analytical instruments. With
advancements in data science and the need to analyze data, R has also become a common option
within the industry.
3. R offers its popular ggplot2 bundle that is best known for its visualizations. Ggplot2 offers
esthetic visualizations which are appropriate for all data operations. In addition, ggplot2 provides
users with a degree of interactivity so they can understand more clearly the data contained in the
visualisation.
Some of the major data science companies that use R analysis and statistical modelling are shown
in
Figure 20 –
37 | P a g e
Figure20: Data Science companies that use R
Exploratory Data Analysis refers to the essential phase of initial data analyses in order to identify
patterns, find anomalies, test hypotheses, and use descriptive statistics and graphical representations
to verify conclusions.
38 | P a g e
Second, knowing the data is a good idea, and trying to gain as many ideas from it. EDA is all about
making sense of the data in hand, before they get it dirty.
In order to share my interpretation of the definition and techniques I know, I will take an example
of the white version of the Wine Quality data set available on the UCI Machine Learning
Repository and seek to obtain as many insights from the data set using EDA as possible.
To start with, I imported required libraries (pandas, numpy, matplotlib, and seaborn for this
example) and loaded the data set.
Note: Any inferences I've been able to draw, with bullet points I described.
• To take a closer look at the data took help of “ .head()”function of pandas library which
returns first five observations of the data set. Similarly “.tail ()” returns last five
observations of the data set.
I found out the total number of rows and columns in the data set using “.shape”.
• Out of which one is dependent variable and rest 11 are independent variables —
physicochemical characteristics.
It is also a good practice to know the columns and their corresponding data types, along with
finding whether they contain null values or not.
39 | P a g e
• Data has only float and integer values.
The describe() function in pandas is very handy in getting various summary statistics. This function
returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the
data.
• Here as you can notice mean value is less than median value of each column which is
represented by 50% (50th percentile) in index column.
• There is notably a large difference between 75th %tile and max values of predictors
“residual sugar ”,” free sulfurdioxide”,”totalsulfur dioxide”.
• Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set.
• “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.
40 | P a g e
• 1,2& 10 Quality ratings are not given by any observation. Only scores obtained are
between 3 to 9.
Data Science is a multidimensional discipline that uses scientific methods, techniques and
algorithms to derive information and insights from structured and unstructured data. I accept that
much of his research is data-related but includes a variety of other data-based processes.
Data Science represents a multidisciplinary field. This includes the systematic blend of scientific
and statistical methods, procedures, creation of algorithms, and technology to obtain useful data
information.
Yet how do all these dimensions work together? To grasp this, you need to learn the data science
process / the day-to-day life of a data scientist
41 | P a g e
Step One: Ask Questions to Frame the Business Problem
Seek to get an idea of what a organization wants in the first step, and collect data based on it. You
start the data science process by asking the right questions to figure out what the issue is. Let's take
a very popular bag company question-the sales issue.
• How do we classify those clients who are more likely to buy our product?
• You agree to work on the issue after a conversation with the marketing team: "How do we find
potential buyers who are more likely to purchase our product? "
• The next move for you is to find out what all the details you have with you to answer the
questions above.
• Now that you are aware of your business concern, it's time to collect the data that will help you
solve the problem. Before collecting the data you should ask if the organization already has the
correct data available?
• In certain cases, you can get the previously collected data sets from other investigations. The
following data are required: age, gender, previous customer’s transaction history, etc.
You find that most of the customer-related data is available in the company’s Customer
Relationship Management (CRM) software, managed by the sales team.
• SQL database is a rear device with many tables for CRM applications. As you go through the SQL
database, you will find out that the system stores detailed customer identification, contact and
demographic details (that they gave the company) and their detailed sales process as well.
• If you do not think the available data is adequate, then plans must be made to collect new data. By
showing or circulating a feedback form you can also take input from your guests and customers. I
agree with that, that's a lot of engineering work and it takes time and effort.
• In addition, the data you obtained is 'raw data' containing errors and missing values. And before
the data is analyzed, you need to clean (wrangle) the data.
42 | P a g e
Exploring the data cleans and organizes it. This method is focused on more than 70 per cent of the
data scientist's time. While gathering all the data, you're not able to use it, because the raw data
you've gathered most likely contains odds.
First, you need to make sure that the data is error free and clean. It is the most significant step in the
cycle that requires patience and concentration.
For this function specific tools and techniques are used, such as Python, R, SQL, etc.
• Are the data values missing, i.e. are consumers without their contact numbers?
• Has null values in it? If it happens, how do you fix it?
• Have multiple datasets? Was it a sensible idea to fuse data sets? If so, then how can you bring
them together?
When the tests show the missing and false values they are ready for review. Remember that it is
easier to get the data incorrect than to have no experience at all.
Stage four. After analyzing the data, you have sufficient knowledge to construct a model to address
the question:
“How can we identify potential customers who are more likely to be?”
In this phase, you analyze the data in order to extract information from it. Analyzing the data
involves the application of various algorithms which will derive meaning from it:
But answering those questions will only give you theories and hints. Modelling data is an easy way
to simulate data within a proper equation the machine understands. You will be able to make
modelbased predictions. You might need to try out multiple models to find the best fit.
Coming back to the issue of sales, this model will help you predict which clients are more likely to
buy. The prediction can be specific, like Female, 16-36 age group living in India.
Communication skills are an important part of a work for data scientists but are often widely
underestimated. This will in fact be a very difficult part of your work, because it includes
43 | P a g e
explaining the results to the public and other team members in a way that they will clearly
understand.
• Graph or chart the information for presentation with tools – R, Python, Tableau, Excel.
• Believe me; answers will always spark more questions, and the process begins again.
The goal of EDA is to use summary statistics and visualizations to better understand the data and to
find clues about the patterns, the quality of the data and the assumptions and assumptions of our
study. EDA is NOT about creating fancy visualizations or even aesthetically appealing
visualizations, the aim is to try and answer data questions. Your goal should be to be able to
produce a chart that anyone can look at in a few seconds and understand what's going on. If not, the
visualization is too complex (or fancy) and something similar has to be used.
EDA is also very iterative since we first make assumptions based on our first exploratory
visualizations, then build some models. We then make visualizations of the model results and tune
our models.
Types of Data
Once we can start talking about data discovery, let's learn the various types of data or measurement
rates first. I highly recommend that you read Measurement Rates in the online stats book and
continue reading the section to browse your statistical information. This segment is simply a
synopsis. Data comes in different forms but can be classified into two major groups: structured and
unstructured data. Structured data is data that is a form of high-degree or organizational data, such
as numerical or categorical data. Examples of standardized data are Temperature, phone numbers,
gender. Unstructured data is data in a form that doesn't have the framework of which we are using
directly. Types of unstructured data include pictures, videos, audio, text in the language and many
others. There is an emerging field called Deep Learning which uses a specialized collection of
algorithms with unstructured data that perform well. We will concentrate on structured data in this
guide but include brief details
44 | P a g e
While a pie chart is a very common method for representing categorical variables, it is not
recommended since it is very difficult for humans to understand angles. Simply put by statistician
and visualization professor Edward Tufte: “pie charts are bad”, others agree and some even say that
“pie charts are the worst”.
For example what you you determine from the follow pie chart?
Categorical Variables
Categorical variables can also be nominal or ordinal. Nominal data has no intrinsic ordering to the
categories. For example gender (Male, Female, Other) has no specific rdering. Ordinal data as clear
ordering such as three settings on a toaster (high medium and low). A frequency table (count of
each category) is the common statistic for describing categorical data of each variable, and a bar
chart or a waffle chart (shown below) are two visualizations which can be used.
The charts look identical, and it takes more than a couple of seconds to understand the data.
Now compare this to the corresponding bar chart
Page
41 |
A reader an instantly compare the 5 variables. Since humans have a difficult time comparing angles,
bar graphs or waffle diagrams are recommended. There are many other visualizations which are not
recommended: spider charts, stacked bar charts, and many other junkcharts.
Any value within a finite or infinite interval can be numeric or continuous variables. Examples
include weight, height, and temperature. Intervals and ratios are two types of numeric variables.
Interval variables have numerical ratios and the same definition over the entire scale, but have no
absolute zero. For example, temperature may be meaningfully subtracted or added in Fahrenheit or
Celcius (difference between 10 degrees and 20 degrees is the same difference as 40 to 50 degrees),
but cannot be measured. For instance, a day that's twice as hot may not be twice as hot.
"The calculation proportion scale is the most insightful scale. This is a scale of intervals with the
additional property that its zero position implies the absence of the measured quantity. You can
think of a ratio scale as the three previous scales were rolled up into one. It provides a name or
category for each object as a nominal scale (numbers are used as labels). The objects are ordered,
like an ordinal scale (in terms of ordering the numbers). The same disparity at two positions on the
scale has the same value as an interval scale. And, moreover, the same ratio in two places on the
scale also has the same meaning. "A good example of a ratio scale is weight, because it has a true
zero and can be added, subtracted, multiplied or divided.
The process of transforming numerical variables into categorical is what is otherwise known as
discretization. For example, age may be 0-12 (child), 13-19 (teenager), 20-65 (adult), 65+ (senior)
categories. Binning is useful because it can be used as a filter to minimize noise or non-linearity
and some algorithms need categorical data, such as decision trees. Binning also helps data
scientists to easily determine numerical values for outliers, null or incomplete values. Binning
strategies involve using equal width (based on the range), equal frequency in of bin, sorted rank,
quantiles, or math (such as log) functions. Binning may be used based on entropy of information,
or acquiring information.
Encoding
Encoding is the transformation of categorical variables into numeric (or binary) variables,
otherwise known as continuation. Sex is a simple example of encoding: -1, 0, 1 may be used to
identify males, females and others. Binary encoding is a special case of encoding where the value
is set to a 0 or 1 indicating a category's absence or presence. One hot encoding is a special case
where each binary is encoded in multiple categories. If we have k categories, this will generate k
extra features (increasing the dimensionality of the data). Another type of encoding is using an
encoding based on target and probability. The average value is the group, which contains a chance.
For exploratory data analysis, we are going be investigating Anscombe’s_quartet and Fisher’s Iris
data set.
"Anscombe's quartet consists of four datasets with almost similar basic statistical features, but
looking somewhat different when graphed. Every dataset is composed of eleven points (x, y). They
were built by the statistician Francis Anscombe in 1973 in order to demonstrate both the
importance of graphing data before analyzing it and the effect of outliers on statistical properties."
The Iris data set is a multivariate set of data introduced by Ronald Fisher in his 1936 paper the use
of multiple measures. This indicates the variability of three related species in the Iris bulbs.
The histogram can be used to show the count, mode, variance, standard deviation, coefficient of
deviation, skewness and kurtosis.
When plotting the relation between two variables, one can use a scatter plot.
If the data is time series or has an order, a line chart can be used.
Multivariate Visualization
When dealing with multiple variables, it is tempting to make three dimensional plots, but as
below it can be difficult to understand the data:
show
50 | P a g e
Instead of blindly using decomposition, a data scientist could plot the result:
By looking at the contrast (black and white) in the images, we can see there is an mportance
i with the
locations of the eyes, nose and mouth, along with the head shape.
9. THE DATA SCIENCE PROCESS - CASE STUDY, REAL DIRECT (ONLINE REAL
ESTATE FIRM)
With various statistical models such as Markov Chains, it is now possible to predict the probability
that doctors will prescribe medicines based on their experience with the brand. In the same way,
improving learning is beginning to develop itself in the area of digital marketing. It is used to
51 | P a g e
understand the patterns of digital participation of physicians and their prescriptions. The main aim
of this case study of data science is to discuss the problems facing it and how data science offers
solutions to them.
Data Scientists can apply the Predictive Maintenance Strategy to the use of data to optimize
highvalue machinery for the production and refining of oil products. With telemetry data extracted
through sensors, a steady stream of historical data can be used to train our machine learning model.
This machine learning model will predict the failure of machine parts and will alert operators of
timely maintenance in order to avoid oil losses. The Data Scientist assigned to the implementation
of the PdM strategy should help prevent hazards and predict machine failure, encouraging
operators to take precautions.
Helix is one of the companies for genome analysis that provides customers with their genomic
data. Also, due to the emergence of new computational methodologies, many medicines adapted to
particular genetic designs have become increasingly popular. Thanks to the data explosion, we can
understand and analyze complex genomic sequences on a wide scale. Data Scientists can use
modern computational resources to manage massive databases and understand patterns in genomic
sequences to detect defects and provide information to physicians and researchers. In addition,
with the use of wearable tools, data scientists may use the relationship between genetic
characteristics and medical visits to build a predictive modelling framework.
Data Science has also changed the way students communicate with teachers and assess their
success. Instructors may use data science to evaluate the input they obtain from students and use it
to enhance their teaching. Data Science can be used to construct predictive modelling that can
predict student drop-out levels based on their results and advise instructors to take the appropriate
precautions.
52 | P a g e
UNIT- II (Basic Machine Learning Algorithms & Applications)
You will discover the linear regression algorithm in this article, how it operates and how you can
best use it in your machine learning projects. In this article you'll learn: • Why linear regression is
part of statistics as well as machine learning.
To grasp linear regression you don't need to learn any statistics or linear algebras. It is a gentle
highlevel introduction to the technique to give you enough experience to be able to make
successful use of it for your own problems.
Discover how machine learning algorithms work in my latest book like kNN, decision trees, naive
bayes, SVM, ensembles and much more, with 22 tutorials and examples in excel.
Let's kick off. Figure 1 shows Linear Regression for Machine Learning
53 | P a g e
Isn't linear statistical regression?
Before we immerse yourself in the specifics of linear regression, you might wonder why we are
looking at this algorithm.
Machine learning, more precisely the field of predictive modelling, is concerned primarily with
reducing a model's error or making the most accurate predictions possible, at the cost of
description. We can borrow, reuse and steal algorithms from many different fields like statistics in
applied machine learning and use them for these purposes.
Linear regression has thus been developed in the field of statistics and is being studied as a model
for understanding the relationship between them
As such, linear regression was developed in the field of statistics and is studied as a model for
understanding the relationship between input and output numerical variables, but was borrowed
from machine learning. It's both a statistical algorithm and a machine learning algorithm.
Next, let's look at some of the common names used to refer to a linear regression model. Figure 2
shows the sample of the handy machine learning algorithms mind map.
54 | P a g e
Things can get really complicated when you start looking at linear regression. The explanation is
that for so long (more than 200 years), linear regression has been around. From every possible
angle, it has been studied and sometimes every angle has a new and different name. A linear model,
such as a model that follows a linear relationship between the input variables (x) and the output
variable (y), is a linear regression. More precisely, a linear combination of input variables (x) can
be used to determine y.
The approach is referred to as simple linear regression when a single input variable (x) exists.
Statistical literature also refers to the approach as multiple linear regression when multiple input
variables are available. Different techniques, the most common of which is called Ordinary Lowest
Squares, may be used to prepare or train linear regression equations from data. Therefore it is
common to refer to the model built in this way as ordinary.
Linear Regression Model Representation
For each input value or column, the linear equation assigns one scale factor, called the coefficient,
represented by the Greek capital letter Beta (B). An additional coefficient is often introduced,
which gives the line an extra degree of freedom (for example, going up and down on a two-
dimensional plot) and is also referred to as the coefficient of intercept or bias.
For example, in a simple regression problem (a single x and a single y), the form of the model
would be:
y = B0 + B1*x
In higher dimensions, when we have more than one input (x), the line is called a plane or a
hyperplane. The representation is therefore the form of the equation and the specific values used
for the coefficients (e.g. B0 and B1 in the example above).
It is common to talk about the complexity of a regression model such as linear regression. This
refers to the number of coefficients used for the model.
When the coefficient is zero, the influence of the input variable on the model is effectively
removed from the model prediction (0 * x = 0). This becomes relevant when you look at the
regularization methods that change the learning algorithm to reduce the complexity of the
regression models by putting pressure on the absolute size of the coefficients, driving some to zero.
55 | P a g e
Now that we understand the representation used for the linear regression model, let's look at some
ways that we can learn this representation from the data.
Learning a linear regression model means estimating the values of the coefficients used in the
representation with the data available to us.
In this section, we will take a brief look at four techniques for the preparation of a linear regression
model. This is not enough information to implement it from scratch, but enough to get a taste of the
computation and the trade-offs involved.
There are a lot more techniques because the model is so well studied. Take note of Ordinary Less
Squares because it is the most common method used in general. Take note of Gradient Descent as
it is the most common technique taught in machine learning classes.
56 | P a g e
With simple linear regression, if we have a single input, we can use statistics to estimate the
coefficients. This requires you to calculate statistical properties from data such as mean, standard
deviations, correlations and covariance. All data must be made available for the purpose of
crossing and calculating statistics.
If we have more than one input, we can use the Ordinary Lowest Squares to estimate the
coefficient values.
The Ordinary Least Squares procedure seeks to minimize the sum of the squared residues. This
means that given the regression curve through the info , we calculate the space from each datum to
the regression curve , square it, and sum all the squared errors together. This is the quantity that the
least common squares are trying to minimize.
This approach treats the data as a matrix and uses linear algebra operations to estimate the
optimum coefficient values. This means that all the data must be available and you must have
enough memory to fit the data and perform the matrix operation.
It is unusual to perform the Ordinary Least Squares procedure yourself, unless it is done as a linear
algebra exercise. It's more likely you're going to call a procedure in a linear algebra library. This
procedure is very quick to calculate.
3. Gradient Descent
When one or more inputs are available, you can use the process of optimizing coefficient values by
iteratively minimizing the model error on your training data.
This operation is called Gradient Descent, starting with random values for each coefficient. The
sum of squared errors is calculated for each pair of input and output values. The learning rate is
used as a scale factor and the coefficients are updated in order to minimize the error. The process is
repeated until a minimum amount of squared error is achieved or no further improvement is
possible.
When using this method, you must select the learning rate (alpha) parameter that will determine the
size of the improvement step to be taken for each iteration of the procedure.
Gradient descent is often taught using a linear regression model because it is relatively easy to
understand. In practice, it is useful when you have a very large dataset in either the number of rows
or the number of columns that may not fit into your memory.
57 | P a g e
4. Regularization
There are extensions to the training of a linear model called regularization methods. Both aim to
minimize the sum of the squared error of the model on the training data (using ordinary least
squares) but also to reduce the complexity of the model (such as the number or absolute size of the
sum of all coefficients in the model).
1. Lasso regression: where ordinary least squares are modified to minimize the absolute sum
of the coefficients (called L1 regularization) as well.
2. Ridge Regression: where the ordinary least squares are modified to minimize the squared
absolute sum of the coefficients (called L2 regularization)
These methods are effective to use when there is collinearity in your input values, and the ordinary
least squares would override the training data.
Now that you know some techniques to learn the coefficients in a linear regression model, let's
look at how we can use a model to make new data predictions.
Making Linear Regression Predictions Since representation is a linear equation, making predictions
is as simple as solving an equation for a specific set of inputs.
Let's use an example to make this concrete. Imagine that we predict weight (y) from height (x). Our
linear regression model for this problem would be:
y = B0 + B1 * x1
or
weight =B0 +B1 * height
Where B0 is that the bias coefficient and B1 is that the coefficient for the height column. We use a
learning technique to seek out an honest set of coefficient values. Once found, we will connect
different height values to predict the load .
For example, let’s use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in
kilograms) for an individual with the height of 182 centimeters.
58 | P a g e
You can see that the above equation could be plotted as a line in two-dimensions. The B0 is our
starting point regardless of what height we have. We can run through a bunch of heights from 100
to
250 centimeters and plug them to e equation and get weight values, creating our line.
th
Linear regression has been studied extensively, and there is a lot of literature on how your data
needs to be structured to make the most of the model.
As such, there is a lot of sophistication in talking about these requirements and expectations that
can be intimidating. In practice, these rules can be used more as thumb rules when using Ordinary
Less Squares Regression, the most common linear regression implementation.
Try using these heuristics to prepare your data differently and see what works best for your
problem.
1. Linear Assumption. Linear regression assumes that the relationship between input and output is
linear. It doesn't support anything else. This may be obvious, but it's a good thing to remember
when you have a lot of attributes. You may need to transform the data to make the relationship
linear (e.g. transform log for an exponential relationship).
57 | P a g e
2. Remove your noise. Linear regression assumes that the input and output variables are not noisy.
Consider using data cleaning operations that will make it easier for you to expose and clarify the
signal in your data. This is most important for the output variable and, if possible, you want to
remove outliers in the output variable (y).
3. Remove Collinearity from me. Linear regression is over-fitting your data when you have highly
correlated input variables. Consider calculating pairwise correlations for your input data and
removing the most correlated data.
4. The Gaussian Distribution. Linear regression makes more reliable predictions if your input and
output variables have a Gaussian distribution. You may get some benefit from transforms (e.g.
log or BoxCox) on your variables to make their distribution look more Gaussian.
5. Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input
variables using standardization or normalization.
• Lazy learning algorithm − KNN may be a lazy learning algorithm because it doesn't have a
specialized training phase and uses all data for training while classifying.
K-nearest neighbors (KNN) algorithm uses 'feature similarity' to predict the values of new data
points, which means that a value will be assigned to the new data point based on how closely the
points in the training set match. We can understand how it works by following steps − Step 1 − We
need a data set to implement any algorithm. So we have to load the training as well as the test data
during the first step of KNN.
Step 2 − Next, we’d like to settle on the value of K, i.e. the nearest data point. K could be any
integer.
3.1 − Calculate the distance between the test data and each row of training data using any
method, namely: Euclidean, Manhattan or Hamming distance. Euclidean is the most
common method used to calculate distance.
60 | P a g e
3.2 − Now, based on the distance value, sort it in ascending order.
3.3 − Next, the top rows of K will be selected from the sorted array.
3.4 − Now assign a class to the test point based on the most frequent class of these rows.
Step 4 − End Example The following is an example to understand the concept of K and the
working of the KNN algorithm − Suppose we have a dataset that can be plotted as follows shown
in Figure 4−
Figure 4: Dataset
Now, we'd like to classify new data point with black dot (at point 60, 60) into blue or red class. We
are assuming K = 3 i.e. it would find three nearest data points. It is shown in Figure 5 −
We can see in Figure 5 the three nearest neighbors of the data point with black dot. Among those
three, two of them lie in Red class hence the black dot will also be assigned in red class.
Implementation in Python
As we all know K-nearest neighbors (KNN) algorithm are often used for both classification also as
regression. The following are the recipes in Python to use KNN as classifier also as regressor –
KNN as Classifier
61 | P a g e
First, start with importing necessary python packages −
62 | P a g e
63 | P a g e
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Next, train the model with the help of KNeighborsClassifier class of sklearn as follows –
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
Accuracy: 0.8833333333333333
KNN as Regressor
First, start with importing necessary Python packages −
64 | P a g e
importnumpy as np
import pandas as pd
Next, download
the iris dataset
from its weblink as
follows −
path =
"https://archive.ics.
uci.edu/ml/machine
-learning-
databases/iris/iris.d
ata"
Next, we need to
assign column
names to the
dataset as follows
−
headernames =
['sepal-length',
'sepal-width', 'petal-
length', 'petal-
width', 'Class']
Now, we need to read dataset to pandasdataframe as follows −
data=pd.read_csv(url, names =headernames)
array=data.values
X =array[:,:2]
Y =array[:,2]
data.shape
output:(150,5)
Next, import KNeighborsRegressor from sklearn to fit the model −
fromsklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 10)
knnr.fit(X, y)
At last, we can find the MSE as follows −
Pros
• It’s very useful for non-linear data because there’s no assumption about the information in this
algorithm.
65 | P a g e
• It may be a versatile algorithm which will be used for both classification and regression.
• It is comparatively accurate, but there are far better supervised learning models than KNN.
Cons
• It's a bit expensive algorithm, because it stores all the training data.
The following are some of the areas in which KNN can be successfully applied –
1. The KNN Banking System can be used in the banking system to predict the weather and the
individual is fit for loan approval? Does that individual have the same characteristics as one
of the defaulters?
2. Calculation of credit ratings KNN algorithms can be used to find an individual's credit
rating by comparing it to persons with similar characteristics.
3. Politics With the assistance of KNN algorithms, we will classify potential voters into
different classes like "Will vote," "Will not vote," "Will vote to the Congress Party," "Will
vote to the BJP Party."
4. Other areas where the KNN algorithm is often used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
3. K-MEANS
K-means algorithm is an iterative algorithm that attempts to divide the dataset into Kpre-defined
separate non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make inter-cluster data points as similar as possible while keeping clusters as different (far)
as possible. It assigns data points to a cluster in such a way that the sum of the squared distance
between the data points and the centroid cluster (arithmetic mean of all data points belonging to
that cluster) is at a minimum. The less variation we have within clusters, the more homogenous
(similar) the data points are within the same cluster.
The way k-means algorithm works is as follows:
2. Initialize the centroids by first shuffling the dataset and then randomly selecting K data points
for the centroids without substitution.
3. Keep iterating until there is no change in the centroids. i.e. the assignment of data points to
clusters does not change.
4. Calculate the sum of the squared distance between the data points and all the centroids.
66 | P a g e
6. Calculate the centroids for clusters by taking the average of all data points belonging to each
cluster.
The following approach k-means to solve the problem is called Expectation-Maximization. The
Estep assigns the data points to the nearest cluster. The M— step is computing the centroid of each
cluster. Below is a breakdown of how we can solve it mathematically (feel free to skip it).
Where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.
It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we
minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and
update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids
after the cluster assignments from previous step (M-step). Therefore, E-step is:
In other words, assign the data point xi to the closest cluster judged by its sum of squared distance
from cluster’s centroid.
Which translates to recomputing the centroid of each cluster to reflect the new assignments.
• Since cluster algorithms, together with k-means, use distance-based measurements to work
out the similarity between data points, it's counseled that data ought to be standardized to own a
67 | P a g e
mean zero and a typical deviation of 1, since nearly always the characteristics in any knowledge
set would have totally different units of measuring like age vs financial gain.
• In sight of the k-mean repetitious nature and therefore the random formatting of the
centroids at the start of the algorithmic program, {different|totally totally different|completely
different} initializations could result in different clusters, because the k-mean algorithmic program
could also be stuck to the native optimum and should not converge to the world optimum.
It is so counseled that the algorithmic program be run mistreatment totally different center of mass
initializations which the results of the run be chosen that yielded a lower total of square distance.
• The assignment of examples doesn't amendment is that the same issue as no amendment in
within-cluster variation:
Implementation
We will use simple implementation of k-means here to illustrate some of the concepts. Then we
will use sklearn implementation that makes it more efficient to take care of a lot of things for us.
Applications
K-means algorithm is very popular and is used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal
usually when we're undergoing a cluster analysis is either:
1. Get a meaningful insight into the structure of the data we're dealing with.
2. Cluster-then-predicts where different models will be built for different subgroups if we believe
there is a wide variation in the behavior of different subgroups. An example of this is the
clustering of patients into different subgroups and the development of a model for each subgroup
to predict the risk of heart attack.
4. FILTERING SPAM
4.1 Spam
• Involves causation messages by email to varied recipients at constant time (Mass Emailing).
• Grew exponentially since 1990 however has leveled off recently and is not any longer growing
exponentially
• Giveaways
• Chain letters
• Political email
Spam as a retardant
Some Statistics
Email Address Harvesting - Process of obtaining email addresses through various methods:
69 | P a g e
$130 billion worldwide
$42 billion in u. s. u. s. unit of time increase from 2007 estimates
100% increase in 2007 from 2005
Productivity loss from inspecting and deleting spam lost by spam management merchandise
(False Negatives)
Productivity loss from sorting out legitimate mails deleted in error by spam management
merchandise (False Positives)
Operations and repair costs (Filters and Firewalls – installment and maintenance)
Email Address gather - methodology of obtaining email addresses through various methods: •
• Bots
70 | P a g e
Types of Spam Filters
1. Header Filters
a. Look at email headers to judge if forged or not
3. Content Filters
a. Scan the text content of emails
b. Use fuzzy logic
4. Permission Filters
a. Based on Challenge /Response system
5. White list/blacklist Filters
a. Will only accept emails from list of “good email addresses”
b. Will block emails from “bad email addresses”
6. Community Filters
a. Work on the principal of "communal knowledge" of spam
b. These types of filters communicate with a central server.
7. Bayesian Filters
a. Statistical email filtering
b. Uses Naïve Bayes classifier
Bayesian Classification
1. Specific words are likely to occur in spam emails and legitimate emails For example, most
email users often encounter the word "Viagra" in spam emails, but rarely see it in other emails
71 | P a g e
2. The filter does not know these probabilities beforehand, and must be trained first so that it can
build them up to
3. To train the filter, the user must manually indicate whether or not the new email is spam
4. For all of the words in each training email, the filter will adjust the probability that each word
will appear in the spam or legitimate email in its database.
For example, Bayesian spam filters will typically have learned a very high spam probability for
the words "Viagra" and "refinance," but a very low spam probability for words that are seen
only in legitimate emails, such as the names of friends and family members
5. After training, the word probabilities are used to calculate the probability that an email
containing a specific set of words belongs to either category
6. Each word in the email contributes to the spam probability of the email, or only the most
interesting words
7. This contribution is calculated using the Bayes Theorem
8. Then, the spam probabil (false positive or false negative) which allows the software to
dynamically adapt to the ever-evolving nature of spam
10. Some spam filters combine the results of both Bayesian spam filtering and other heuristics
(predefined content rules, envelope viewing, etc.) resulting in even higher filtering accuracy.
2. Suppose the suspected message contains the word "replica" Most people who are used to receive
e-mails know that this message is likely to be spam
72 | P a g e
Pros & Cons
Advantages
• It can be trained on a per-user basis o The spam that a user receives is often
related to the online user's activities
Disadvantages
73 | P a g e
o Google, by its Gmail email system, performing an OCR to every mid to large size
image, analyzing the text inside
• Spam emails not only consume computing resources, but can also be frustrating
• Numerous detection techniques exist, but none is a “good for all scenarios” technique
• Data Mining approaches for content based spam filtering seem promising
4.8 How to Design a Spam Filtering System with Machine Learning Algorithm
Exploratory Data Analysis is a very important data science process. It helps the data scientist to
understand the data at hand and relates it to the business context.
The open source tools that I will use to visualize and analyze my data are Word Cloud.
Word Cloud is a data visualization tool used to represent text data. The size of the text in the image
represents the frequency or importance of the words in the training data.
3. Visualize the training data with Word Cloud & Bar Chart
Data is the essential ingredients before we can develop any meaningful algorithm. Knowing where
to get your data can be a very handy tool especially when you are just a beginner.
Below are a few of the famous repositories where you can easily get thousand kind of data set for
free:
2. Kaggle datasets
3. AWS datasets
In short, there are two types of data present in this repository, namely ham (non-spam) and spam
data. In addition, the ham data is easy and hard, which means that there are some non-spam data
that are very similar to spam data. This could make our system difficult to make a decision.
74 | P a g e
If you are using Linux or Mac, simply do this on the terminal, wget is simply a command that will
help you download the url file:
75 | P a g e
Figure 7: Visualization for spam email
From this view, you can see something interesting about the spam email. Many of them have a
high number of "spam" words, such as: free, money, product, etc. Having this awareness could
help us make a better decision when it comes to designing a spam detection system.
One important thing to note is that the word cloud displays only the frequency of words, not
necessarily the importance of words. It is therefore necessary to do some data cleaning, such as
removing stop words, punctuation and so on, from the data before visualizing it.
Another technique of visualization is to use the bar chart and display the frequency of the most
visible words. N-gram means how many words you consider to be a single unit when you
calculate the frequency of words.
I have shown an example of 1-gram and 2-gram in Figure 9. You can definitely experiment with a
larger n-gram model.
Figure 9: Bar chart visualization of 1-gram model
It is important to divide your data set into a training set and test set, so that you can evaluate the
performance of your model using the test set before deploying it in a production environment.
Figure 11: Target Count For Train Data
The distribution between train data and test data are quite similar which is around 20–21%, so we are
good to go and start to process our data !
Data Preprocessing
Text Cleaning
Text Cleaning is a very important step in machine learning because your data may contains a lot
of noise and unwanted character such as punctuation, white space, numbers, hyperlink and etc.
• removing numbers
• removing punctuation
• removing hyperlink
• removing stop words such as a, about, above, down, doing and the list goes on…
• Word Stemming
• Word lemmatization
The two techniques that may seem foreign to most people are word stemming and word
lemmatization. Both of these techniques try to reduce words to their most basic form, but they do
so with different approaches.
• Word stemming — Stemming algorithms work by removing the word end or beginning,
using a list of common prefixes and suffixes that can be found in that language. Examples of
Word Stemming for English Words are as follows:
Stem
Form Suffix
running -ing run
runs -s run
consolidate -ate consolid
consolidated -ated consolid
• Word Lemmatization — Lemmatization is utilizing the dictionary of a particular language
and tried to convert the words back to its base form. It will try to take into account of the
meaning of the verbs and convert it back to the most suitable base form.
Stem
Form Suffix
running -ing run
runs -s run
consolidate -ate consolid
Implementing these two algorithms might be tricky and requires a lot of thinking and design to
deal with different edge cases.
Luckily NLTK library has provided the implementation of these two algorithms, so we can use it
out of the box from the library!
Import the library and start designing some functions to help us understand the basic working of
these two algorithms.
stemmer = PorterStemmer()
lemmatizer
dirty_text==WordNetLemmatizer()
"He studies in the house yesterday, unluckily,
defword_stemmer(words): stem_words =
[stemmer.stem(o) for o in words] return "
".join(stem_words)
lemma_words = [lemmatizer.lemmatize(o) for o in words]
defword_lemmatizer(words):
return " ".join(lemma_words)
The output of word stemmer is very obvious, some of the endings of the words have been chopped
off
#Output
'He studied in the house yesterday, unluckily, the fan break down'
The lemmatization has converted studies -> study, breaks -> break
Our algorithm always expect the input to be integers/floats, so we need to have some feature
extraction layer in the middle to convert the words to integers/floats.