0% found this document useful (0 votes)
16 views

Kadir

This document is a project report on Data Science and Big Data submitted by Abdul Kadir for the Bachelor of Science degree at Aryan College, Ajmer. It covers various aspects of data science, including its definition, importance, tools, lifecycle, and applications, as well as an overview of big data concepts and techniques. The project aims to automate data science processes to enhance efficiency and reduce manual work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Kadir

This document is a project report on Data Science and Big Data submitted by Abdul Kadir for the Bachelor of Science degree at Aryan College, Ajmer. It covers various aspects of data science, including its definition, importance, tools, lifecycle, and applications, as well as an overview of big data concepts and techniques. The project aims to automate data science processes to enhance efficiency and reduce manual work.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 84

Data Science and Big Data

A Project Submitted in partial fulfilment for the award of the Degree of

Bachelor of Science (Information Technology)

Session 2024-2025

Submitted By:

Abdul Kadir

ARYAN COLLEGE

LOHAGAL, AJMER-305001

MAHARSHI DAYANAND SARASWATI UNIVERSITY

AJMER (RAJ.) INDIA


ACKNOWLEDGEMENT

I express my deep sense of gratitude and indebtedness to my project

guide Mr. Trivendra Sir, Lecturer, Aryan College, Ajmer for discussing

the case study, guiding and constantly inspiring us for the accomplishment of

this project. To carry out this work under her supervision has been both a

great pleasure and privilege to us.

I am genuinely grateful to my parents and friends for always being

there as a support system. We take this opportunity to express our sincere

thanks to Mr. Amar Pal Singh Shekhawat, Director, Aryan College, Ajmer

for his active cooperation and valuable suggestions throughout our project

work.

Abdul Kadir
PREFACE
This project of Data Science and Big Data in Java has been made

with a lot of hard work and the valuable guidance by Mr. Trivendra Sir

Through this project, an automation of the Data Science and Big Data has

been done to reduce the manual and mental tedious work. The manpower

and financial resources saved by implementing this program, can be utilized

for the further growth of the business.

I also thankful to our mentor Dr. Prafull Chandra Narooka for giving

suggestion when I needed.


Contents
Unit I- Introduction of Data Science & R Programming

1. Introduction
1.1 DEFINATION OF DATA SCIENCE
1.2 Important of Data Science
1.3 Role of Data Scientist
1.4 Tools for Data Science
1.5 Applications of Data Science
1.6 Lifecycle of Data Science
2. Big Data and Data Science hype
2.1 Types of Big Data
2.2 Three Characteristics of Big Data-V3c
2.3 Benefits of Big Data
2.4 Big Data Techniques
2.5 Underfitting and Overfitting
2.6 Data Science Hype
3. Statistical Inference, Statistical modelling
4. Probability Distributions
4.1 What is Probability?
4.2 Why Probability is important?
4.3 How to use Probability in Data Science?
4.4 What are probability distributions?
4.5 What are the types of probability distributions?
5. Fitting a model
5.1 Objectives of Model Fitting
5.2 Why are we fitting models to data?
6. Introduction to R
6.1 What is R?
6.2 Why We Choose R for Data Science?
6.3 History of R
6.4 R Features
6.5 How R is different than Other Technologies

1|Page
6.6 Applications of R Programming
6.7 What is R In Data Science important?
6.8 What Makes R Suitable For Data Science?
6.9 Data Science Companies that Use R
7. Exploratory Data Analysis and the Data Science Process
7.1 Exploratory Data Analysis
7.2 Data Science Process
8. Basic tools (plots, graphs and summary statistics) of EDA
8.1 Exploratory data analysis
8.2 Types of Data
9. The Data Science Process - Case Study, Real Direct (online real estate firm)

UNIT- II (Basic Machine Learning Algorithms & Applications)

1. Linear Regression for Machine Learning


2. k-Nearest Neighbours (k-NN)
2.1 Working of KNN Algorithm
2.2 Implementation in Python
2.3 Pros and Cons of KNN
2.4 Applications of KNN
3. k-Means
3.1 Applications
4. Filtering Spam
4.1 What is spam?
4.2 Purpose of Spam
4.3 Spam Life Cycle
4.4 Types of Spam Filters
4.5 Spam Filters Properties
4.6 Bayesian Classification
4.7 Computing the Probability
4.8 How to Design a Spam Filtering System with Machine Learning Algorithm
5. Linear Regression and k-NN for filtering spam
6. Naive Bayes

2|Page
7. Data Wrangling
8. Feature Generation
8.1 INTRODUCTION
8.2 BACKGROUND
8.3 SYSTEM AND METHODS
9. Feature Selection algorithms
9.1 The Problem the Feature Selection Solves
9.2 Feature Selection Algorithms
9.3 How to Choose a Feature Selection Method for Machine Learning

UNIT- III Mining Social-Network Graphs & Data Visualization

1. What is Social networks as graphs


2. Clustering of graphs
a. Graphic Clustering Methods
b. What kinds of cuts are perfect for drawing clusters in graphs?
3. Direct discovery of communities in graphs
4. Partitioning of graphs
5. Neighbourhood properties in graphs
6. Data Visualization, Basic principles
a. Data Visualization
b. History of Data Visualization
c. Why is data visualization important?
d. What makes data visualization effective?
e. Five types of Big Data Visualization groups
7. Common data visualization tools
8. Examples of inspiring (industry) projects
9. Data Science and Ethical Issues

3|Page
Unit I- Introduction of Data Science & R Programming
1. INTRODUCTION: WHAT IS DATA SCIENCE?
What is Data Science?

Data science is an interdisciplinary field that uses scientific methods, processes, algorithms
and systems to extract knowledge and insights from data in various forms, both structured and
unstructured, similar to data mining.

Why is Data Science?

• Because you have too many data such as money, reviews, customer data, people working, etc.
• You want to keep it clear and easy to understand so you can make a change that’s why data
science is relevant.
• Data analysis lets people make better decisions, either faster or better.

Why Data Science is important?

Every company, however, has information, and its business value depends on how much
information it thinks.

Since late, Information Science has acquired significance in the light of the fact that it can
assist companies with growing business estimation of their accessible knowledge and thus allow
them to take the upper hand against their rivals.

It can help us know our customers better, it can help us refine our processes and it can help
us make better decisions. Knowledge, in the light of information technology, has become a vital
instrument.

Role of Data Scientist

• Data scientists help organizations understand and handle data, and address complex problems
using knowledge from a range of technology niches.

• They are typically built in the fields of computer science, modelling, statistics, analytics and
mathematics, coupled with modelling statistics and mathematics combined with a clear business
sense.

How to do Data Science?

A typical data science process looks like this, which can be modified for specific use case:

• Understand the business


• Collect & explore the data
• Prepare & process the data

4|Page
• Build & validate the models
• Deploy & monitor the performance

Tools for Data Science

1. R
2. Python
3. SQL
4. Hadoop
5. Tableau
6. Weka

Applications of Data Science

1. Data Science in Healthcare


2. Data Science in E-commerce
3. Data Science in Manufacturing
4. Data Science as Conversational Agents
5. Data Science in Transport

Lifecycle of Data Science

A brief overview of the main phases of the Data Science Lifecycle is shown in Figure 1:

5|Page
Figure 1: Data Science Lifecycle

Phase 1—Discovery: Phase 1 — It's important to understand the various criteria, requirements,
goals and necessary budget before you start the project. You ought to have the courage to ask the

6|Page
7|Page
right questions. Here, you determine whether you have the resources needed for supporting the
project in terms of people, equipment, time and data. You must also frame the business problem in
this process, and formulate initial hypotheses (IH) for testing.

Phase 2—Data preparation: You need analytical sandbox in this process, in which you can conduct
analytics for the entire duration of the project. Before modelling, you need to search, preprocess, and
condition data. In addition, you must perform ETLT (extracting, converting, loading, and converting)
to bring data into the sandbox. Let's look at the flow of statistic analysis in Figure 2 below.

Figure 2: Statistical Analysis flow of Data preperation

R may be used for data cleaning, retrieval, and visualization. It will help you identify the outliers and
create a relationship between the variables. When the data has been cleaned and packed, it is time to
do some exploratory analytics on it. Let's see if you can get this done.

Step 3 — Model planning: You can decide the methods and strategies for drawing the relationship
between variables here. These relationships will set the basis for the algorithms you will be
implementing over the next step. Using through statistical formulas and visualization tools, you'll
apply Exploratory Data Analytics (EDA).

Let’s have a look at various model planning tools in Figure 3.

Figure 3: Common Tools for Model planning

1. R has a full range of modellingcapabilities and offers a strong setting for interpretive model
building.

2. SQL Analysis services can use can data mining functions and simple predictivemodels to perform
in-database analytics.

3. SAS / ACCESS can be used to access Hadoop data, and is used to construct repeatable and
reusable flow diagrams for model.

While there are many tools on the market but R is the most commonly used tool.

6|
Page
Now that you have insights into the essence of your results, and you have chosen to use the
algorithms. You will apply the algorithm in the next step, and you will create a model.

Phase 4—Model building: You will be designing data sets for training and testing purposes during
this process. You should decide whether your current resources are adequate to run the models, or
whether you need a more robust environment (such as fast and parallel processing). To construct
the model, you'll examine different learning strategies such as grouping, association, and
clustering.

Model building can be done using the following methods shown in Figure 4.

Figure 4: Common Tools for Model Building.

Step 5 — Operationalize: You provide final reports, presentations, code and technical documents
during this process. Alternatively, a pilot project is often often applied in an area of real-time
output. It will give you a simple, small-scale image of the results and other relevant constraints
before full deployment.

Phase 6—Communicating results: Now it's necessary to determine whether you were able to
accomplish your target that you designed in the first step. Therefore, in the last step, you define all
the main outcomes, communicate to the stakeholders and decide if the project's results are a
success or a failure based on the Step 1 criteria.

2. BIG DATA AND DATA SCIENCE HYPE


What does that mean by "Big Data?"

Huge Data is a term used to describe a gigantic amount of both organized and unstructured
knowledge that is so immense that using traditional database and programming methods is
impossible to handle. The amount of knowledge is too high in most undertaking situations or it
moves too quickly, or it reaches the existing planning cap.

(1) Data collection by significant quantities a. Via machines, sensors, men, events.

(2) To do something about it. Decision taking, testing observations, gaining perspective,
forecasting the future.

9|Page
Types of Big Data

There are three types of data behind Big Data- structured, semi-structured, and unstructured shown in
Figure 5. There's a lot of useful knowledge in each category that you can mine to use in various
projects.

Figure 5: Types of Big data

 Structured

By structured data, we mean data that can be processed, stored, and retrieved in a fixed
format. It refers to highly organized information that can be readily and seamlessly stored and
accessed from a database by simple search engine algorithms. For instance, the employee
table in a company database will be structured as the employee details, their job
positions, their salaries, etc., will be present in an organized manner.

 Unstructured

Unstructured information alludes to the information that does not have a particular structure
or structure at all. This makes it exceptionally troublesome and tedious to process and
investigate unstructured information. Email is a case of unstructured information.

 Semi-organized

Semi-organized information relates to the information containing both the arrangements


referenced over, that is, organized and unstructured information. To be exact, it alludes to the
information that in spite of the fact that has not been grouped under a specific storehouse
(database), yet contains essential data or labels that isolate singular components inside the
information.
Big Data also requires several sources and often comes often from different types. But it is not
always an easy job to know how to combine all of the tools you need to work withdifferent styles.
Three Characteristics of Big Data-V3c
1. Volume - Data quantity
2. Velocity - Data Speed
3. Variety- Data Types

8|
Page
1. Volume
A run of the mill PC may have had 10 gigabytes of storage in 2000. Today, Facebook is
regularly ingesting 500 terabytes of new information. Boeing 737 can generate 240 terabytes of
flight information on a single trip across the US. The advanced cell phones, the information they
generate and expend; sensors implanted into ordinary items can take care of containing human,
field, and other data, including video, in billions of new, continually refreshed information, before
long outcome.
2. Velocity
• Clickstreams and ad impressions capture consumer activity at millions of events per second
• High-frequency stock trading algorithms represent market movements within microseconds •
machine-to machine processes share data between billions of devices • networks and sensors
produce huge real time log data
• online gaming systems help millions of competitor users, each pro 3. Variety.
• Big data are not just numbers, dates, and strings. Huge data is also geospatial information,
3D information, sound and video, and unstructured content, including web-based log documents.
• Modern database systems were built to handle smaller structured information volumes,
fewer changes or an expected, steady information structure. • Big Data inquiry involves details of
different kinds

Benefits of Big Data

Capacity to process Big Data gets different advantages, for example,

1. Organizations can use outside insight while taking choices

Access to social information from web indexes and locales like facebook, twitter are empowering
associations to calibrate their business procedures.

2. Improved client assistance

Customary client criticism frameworks are getting supplanted by new frameworks


structured with Big Data advances. In these new frameworks, Big Data and common language
handling innovations are being utilized to peruse and assess buyer reactions.

3. Early recognizable proof of hazard to the item/administrations, assuming any

4. Better operational effectiveness

Huge Data innovations can be utilized for making an arranging territory or landing zone for
new information before recognizing what information ought to be moved to the information
distribution center. Also, such coordination of Big Data advances and information distribution
center encourages an association to offload inconsistently got to information.

11 | P a g e
Big Data Tools for Data Analysis
1) Apache Hadoop
2) CDH (Cloudera Distribution for Hadoop)
3) Cassandra
4) Knime
5) Datawrapper
6) MongoDB
7) Lumify
8) HPCC
9) Storm
10) Apache SAMOA
11) Talend
12) Rapidminer
13) Qubole
14) Tableau
15) R

Big Data Techniques

Six big data analysis techniques


Big data is defined by the three V's: the vast amount of data, the pace in which it is processed, and
the broad variety of data.7 Due to the second descriptor, the pace, data analytics has extended into
the technical fields of machine learning and artificial intelligence. In addition to the emerging
computer-based analytical methods, data harnesses are now used to analyze it.

In addition to the emerging data harnesses of computer-driven research techniques, analyzes are
often focused on conventional statistical methods.9 Essentially, how data analysis techniques
operate within an enterprise is doubled; broad data analysis is generated by streaming data as it
appears, and then conducting batch analysis of data as it grows – to search for behavioural patterns
and trends.

When data becomes more informative in its size, scope and depth, the more creativity it drives.

1. A/B testing

2. Data fusion and data integration

3. Data mining

4. Machine learning

5. Natural language processing (NLP).

6. Statistics.

12 | P a g e
Underfitting and Overfitting

Machine learning uses data to create a “model” and uses model to make predictions

 Customers who are women over age 20 are likely to respond to an advertisement
 Students with good grades are predicted to do well on the SAT
 The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains

• Underfitting
Model used for predictions is too simplistic
 60% of men and 70% of women responded to an advertisement, therefore all future
ads should go to women
 If a furniture item has four legs and a flat top it is a dining room table
 The temperature of a city can be estimated as the average of its nearby cities, unless
some of the cities are on the coast or in the mountains

• Overfitting
Model used for predictions is too specific
 The best targets for an advertisement are married women between 25 and 27 years with
short black hair, one child, and one pet dog
 If a furniture item has four 100 cm legs with decoration and a flat polished wooden top with
rounded edges then it is a dining room table

Data Science Hype

The noise around AI, data science, machine learning and profound learning is hitting a
level of fever. Our industry has experienced a difference in what people mean when they say "AI,"
"machine learning" or "data science" as this noise has evolved. It can be argued that a growing
taxonomy is missing for our industry. If there is taxonomy then we have not done a very good job
of adhering to it, as data science professionals. This would have consequences. Two implications
include creating a hype-bubble that leads to unrealistic expectations and an growing inability to
interact, especially with colleagues from non-data sciences. I will cover succinct concepts in this
post and then argue how important it is.

Concise Definitions

Data Science: a discipline that produces predictions and explanations using code and
computer to create models that are put into action.
Machine Learning: a class of algorithms or techniques to capture complex data patterns in
model form automatically.
Deep learning: A class of machine learning algorithms that uses more than one hidden layer of
neural networks.

13 | P a g e
AI: a group of systems functioning in a manner comparable to humans in both degree of
autonomy and reach.

Hype

There is a lot of star strength in these words. We encourage people to dream and envision a
better future leading to their unnecessary use. More buzz around our industry is elevating the tide
that raises all sails, no? Sure, we all hope the tide keeps rising. Yet, if it bursts, we will aim for a
sustainable rise and stop a publicity bubble that would cause widespread disillusionment.

Numerous leaders are requesting guidance on how to assist executives, mid-level managers
and even emerging data scientists have reasonable standards of data science initiatives without
losing data science excitement. Unrealistic expectations delay development by swelling excitement
when projects yield less than utopian results

A major cause of this hype has been the constant overuse of "AI" when referring to any
solution which allows some kind of prediction. Owing to constant overuse, people automatically
equate data science ventures with near-perfect autonomous human-like solutions. Or, at the very
least, people believe that data science can easily solve their particular predictive need, without
questioning whether their organizational data supports such a concept.

Communication Inappropriate words are often used to clutter up conversations. It can be


especially detrimental when a cross-functional team assembles to communicate priorities and
develop the end solution in the early planning phases of a data science project.

I know a data science manager who demands that his data sciences team be practically
locked with business executives in a room for an hour before he approves any new data science
project. Okay, the door isn't actually closed, but it's shut, and for a full hour he wants them to
discuss the project. They saw a reduction in the rework of the project, as they concentrated on early
communication with business stakeholders. The difficulty of describing principles related to data
science is as challenging as it is. We only make it more complicated if we cannot describe our own
words.

Because AI and deep learning have come onto the scene, conversations have to be interrupted and
questions answered to figure out what people are.
• Because AI and deep learning have come onto the scene, discussions are constantly
required to pause and ask questions and figure out what people actually mean by using
those words. For starters, how would you interpret those conversation-based statements?
• "Our goal is to make our technology AI-driven within 5 years." • "We need to improve
machine learning before we invest in deep learning." • "We use AI to predict fraud so that
our customers can spend with confidence." • "Our research showed that AI-investing
organizations had a 10 percent increase in revenue." The most common term
misunderstanding is when someone speaks about AI solutions, or when they do AI,
whatever they do.

The most common term-confusion is when someone talks about AI solutions, or when they
do AI, when they should actually talk about creating a model of deep learning or machine learning.
It seems like the exchange of words is all too frequently deliberate, with the speaker hoping to get

14 | P a g e
a hypeboost by saying "AI." Let's dive through each of the meanings, and see if we can find a
taxonomy agreement.

Data Science
First of all, like every other academic discipline I see data science as a technical discipline.
Take Biology for example. Biology requires a variety of concepts, theories, processes, and
instruments. Experimentation is normal. The bio-research group continuously contributes to the
knowledge base of the discipline. It is no different from data science. Practitioners do science of
the results. Scientists are moving the field forward with new hypotheses, principles and methods.

The data science activity includes the marriage of code (usually some mathematical
programming language) with data for model building. This involves the initial essential and
dominant steps of obtaining, cleaning, and preparing data. Models of data science generally make
predictions (e.g., predicting loan risk, predicting diagnosis of disease, predicting how to respond to
a conversation, predicting what objects are in an image). Models of data science may also explain
or characterize the environment (e.g. whic) for us.

Data science models may also illustrate or define the environment for us (e.g., which
combination of variables is most important in making a diagnosis of the disease, which consumers
are most similar and how). Eventually, when applied to new data, these models are put into action
for making predictions and explications. Data science is a discipline that produces predictions and
explanations using code and data to create models that are put into action.
A description for data science can be difficult to formulate while at the same time separating it
from statistical analysis. Via educational training in math and statistics as well as professional
experience as a statistician I came to the data sciences profession. I used to do data science like
many of you before it became a thing.

Statistical analysis is focused on samples, experimental conditions, probabilities and


distributions. It typically refers to questions increasing the probability of events or the validity of
claims. It uses various algorithms such as t-test, chi-square, ANOVA, DOE, surface designs for the
answer, etc. Often these algorithms create models too. For example, surface response designs are
techniques for estimating a physical system's polynomial model based on observed explanatory
factors, and how they contribute to the response factor.

In my interpretation, one important point is that data science models are applied to new
data in order to make future predictions and explanations, or "put into development." Although it is
true that surface-response models can be used to predict a response on new data, it is typically a
hypothetical prediction of what would happen if the inputs were modified. The engineers then
adjust the inputs and analyze the responses the physical device produces in its new environment.
The surface model of the reaction is not put into development. This doesn't take the thousands of
new input settings, in batches or streams over time, and predicts responses.

This concept of data science is by no means foolproof but it starts capturing the essence of data
science by bringing predictive and descriptive models into action.

15 | P a g e
Machine Learning
Machine learning as a word reigns in the 1950s. Today, data scientists see it as a collection
of techniques used within data science. It is a tool collection, or a class of techniques used to
construct the above mentioned models. Machine learning helps computers to create (or learn)
models on their own, rather than a person directly articulating the reasoning for a model. This is
achieved by analyzing an initial collection of data, finding complex hidden patterns in that data and
storing those patterns in a model so that they can be later applied to new data for predictions or
interpretations to be made.
The magic behind this automated pattern-discovery process lies in the algorithms.
Algorithms are Machine Learning Workhorses. Popular machine learning algorithms include the
various approaches to the neural network, clustering strategies, gradient boosting machines,
random forests and much more. If data science is a discipline like biology, then microscopy or
genetic engineering is like machine learning. It is a set of methods and techniques which is used to
exercise the discipline.

Deep Learning
Deep learning is the simplest interpretation of those concepts. Deep learning is a class of
machine learning algorithms that employs more than one hidden layer of neural networks. The own
neural networks date back to the 1950s. Recently, deep learning algorithms were very popular
beginning in the 1980s, with a lull in the 1990s and 2000s, followed by resurgence in our decade
due to fairly minor changes in how deep networks were designed that proved to have incredible
impacts. Deep learning can be applied to a wide range of applications, including image
recognition, chat assistants, and recommender systems. Google Voice, Google Images and Google
Search for example are some of the roots. For example, Google Speech, Google Photos, and
Google Search are some of the original solutions built using deep learning.

AI
AI has been around for a long time. Long before the recent hype storm that has co-opted it
with buzzwords. How do we, as data scientists, define it? When and how should we use it? What is
AI to us? Honestly, it’s not sure anyone really knows. This might be our “emperor has no clothes”
moment. We have the ambiguity and the resulting hype that comes from the promise of something
new and unknown. The CEO of a well-known data science company was recently talking with our
team at Domino when he mentioned “AI”. He immediately caught himself and said, “I know that
doesn’t really mean anything. I just had to start using it because everyone is talking about it. I
resisted for a long time but finally gave in.”

That said, I'm going to take a stab on it: AI is a category of systems that people aim to build
that have the distinguishing characteristic that in the degree of autonomy and scope of activity they
will be comparable with humans.

If data science is like biology and machine learning is like genetic engineering, and then AI
is like resistance to disease, to expand our analogy. It's the end product, a series of solutions or
structures we seek to build by applying machine learning (often deep learning) and other
techniques.

16 | P a g e
Here is the concluding line. I think we need to distinguish between strategies that are part
of AI solutions, AI-like solutions and actual AI solutions. This includes AI building blocks,
solutions with AI-ish qualities, and solutions that approach human autonomy and scope. These are
three separate things. People just say “AI” for all three far too often.

For example,
• Deep learning is not AI. It is a technique that can be used as part of an AI solution.
• Most data science projects are not AI solutions. A customer churn model is not an AI
solution, no matter if it used deep learning or logistic regression.
• A self-driving car is an AI solution. It is a solution that operates with complexity and
autonomy that approaches what humans are capable of doing.

17 | P a g e
Page
So let's start by actually reviewing the fundamental AI related technologies that were omitted from
the report this year but are still important for business:

• Deep neural networks (DNNs). Gartner also speaks about DNNs as a basis for many other
new technologies used in the Hype Cycle.
• Conversation AI platforms. Gartner no longer considers new technologies to Conversational
AI systems, though focusing on their importance to market.
• Digital Helpers. Gartner no longer considers new technologies as virtual assistants, though
focusing on their importance as industry.
• Artificial General Intelligence (AGI). In my opinion, a good call by Gartner to favor a
pragmatic vision around AI, moving away from hype. As Gartner mentions, AGI will not be
mature for decades.

According to Gartner, what areas of focus will be the AI leaders for companies? Based on the 2019
Emerging Technologies Priority List, those will be:

• Augmented Intelligence. Gartner recognizes this evolving technology as the key to the
design strategy of new business technologies, combining short-term automation with a mid-/long-
term strategy that ensures quality enhancement not only by means of automation, but also through
growing human talent.

• Edge AI. In those situations where contact costs, latency or high volume ingestion can be
crucial. This, of course, means ensuring that the correct AI technologies and techniques are
available for our use case (e.g. Deep Learning) are available for the IoT infrastructure we want to
deploy, among other conditions.

Finally, which are the new emerging technologies related to AI in 2019 Hype Cycle:

• Adaptive ML  Emotion AI.

• Explainable AI.

• Generative Adversarial Networks (GANs).

• Transfer Learning.

19 | P a g e
3. STATISTICAL INFERENCE, STATISTICAL MODELLING

STATISTICAL INFERENCE

Inferential Statistics

Inferential statistics allows you to make inferences about the population from the sample data
shown in Figure 7.

Figure 7: Inferential statistics

Population & Sample

A sample of a population is a representative subset. In certain cases carrying out a population


census is an ideal yet unrealistic solution. Sampling is much more practical though sampling error
is vulnerable. A sample that is not population representative is called bias, the approach chosen for
such sampling is called sampling bias. The key forms of sampling bias are comfort bias, judgment
bias, size bias, response bias. Randomisation is the best method to reduce bias in the sampling.
Simple random sampling is the simplest randomisation method; other systematic sampling
techniques are cluster sampling & stratified sampling.

Sampling Distributions

Sample means being spread more and more naturally around the true mean (the population
parameter), as we increase our sample size. Sample variation means decreasing as sample size
increases.

Central Limit Theorem

The Central Limit Theorem is used to help us understand the following facts, whether or not the
distribution of populations is normal:

1. The mean of the sample is the same as mean of population


2. The default sample deviation means is always equal to the standard error.
3. Sample distribution means will become ever more natural as the sample size increases.

20 | P a g e
Asample mean can be referred topoint
as a estimate
of apopulation mean
. Aconfidence interval
is
always cantered around the mean of your sample. To construct themargin of you add a
interval,
error
. Themargin of error
is found by multiplying
standard
the error
of the mean byz-score
the of
the percent confidence level shown in Figure 8:

Figure 8: Confidence intervals graph

The confidence level represents the number of times out of 100 that the population average would be
within the specified sample mean interval.

Hypothesis Testing

Hypothesis testing is a kind of statistical inference involving asking a question, collecting data and
then analysing what the data informs us on how to precede. The experimental hypothesis is called the
null hypothesis and giveno symbol.
the H We test the null hypothesis against an alternative hypothesis
to which the symbol
o is assigned shown if Figure 9.
H

Figure 9: Hypothesis testing

When testing a hypothesis we have to determine how much of a difference between means is
required to refute the null hypothesis. For their hypothesis check, Statisticians first select a level of
significance or degree of alpha (α).

21 | P a g e
Values that show the edge of the critical area are important. Critical regions define the entire value
area which means you are rejecting the null hypothesis.

Figure 10: left, right & two-tailed tests

These are the four basic steps we follow for (one & two group means) hypothesis testing:

1. State the null and alternative hypotheses.

2. Select the appropriate significance level and check the test assumptions.

3. Analyse the data and compute the test statistic.

4. Interpret the result.

Hypothesis Testing (One and Two Group Means)

Hypothesis Test on One Sample Mean When the Population Parameters are Known

We find the z-statistic of our sample mean in the sampling distribution and determine if that zscore
falls within the critical(rejection) region or not. This test is only appropriate when you know the
true mean and standard deviation of the population.

22 | P a g e
Hypothesis Tests When You Don’t Know Your Population Parameters

TheStudent’s t-distribution
is similar to the normal distribution, except it is more spread out and
wider in appearance, and has thicker tails. The differences between
t-distribution
theand thenormal
distribution
are more exaggerated when there are fewer data points, and fewer
therefore
degrees of
freedomshown in Figure 11.

Figure 11: Distribution graph

23 | P a g e
Estimation as a follow-up to a Hypothesis Test

When hypothesis
a is rejected
, it is often useful to turn to estimation to try totrue value
capture of
the
thepopulation mean
.

Two-Sample T Tests

Independent Vs Dependent Samples

When we have
independent samples
we assume that the scores of onedo
sample
not affect
the other.

unpaired t-test

In twodependent samples
of data, each score in one sample
is paired with
a specific score in the other
sample.

paired t-test

Hypothesis Testing (Categorical Data)

Chi-square test
is used for
categorical data
and it can be used to estimate how closely the distribution
of a categorical variable matches an expected distribution
goodness-of-fit
(thetest), or to estimate
whether two categorical variables are independent of one
testanother
of independence
(the ).

goodness-of-fit

−1
degree of freedom (d f) = no. of categories(c)

24 | P a g e
Analysis of Variance (ANOVA)
allows us to test the hypothesis
multiple
that population means and
variances equal. We can conduct a series of t-tests instead of ANOVA but that would be
of scores are

We follow a series of steps to perform ANOVA:

1. Calculate the
total sum of squares
(SST )

2. Calculate the
sum of squares between
(SSB)

3. Find the
sum of squares within groups
(SSW ) by subtracting

degrees of freedom
4. Next solve for for the test

Mean Squares
5. Using the values, you can now calculate the Between Mean Squares
(MSB) and
Within(MSW ) using the relationships below

6. Finally, calculateFthe
statistic
using the following ratio

7. It is easy to fill in the Table from here — and also to see that once the SS and df are filled in, the

8. FindF critical

ANOVA formulas

If F-valuefrom the ANOVA test is greater F-critical


than the value
, so we would reject our Null
Hypothesis.

One-Way ANOVA

tedious due to various factors.

25 | P a g e
remaining values in the table for MS and F are simple calculations

26 | P a g e
Page
Regression

Regression analysis is a set of statistical processes for estimating the relationships among variables
shown in Figure 13.

Figure 13: Regression

Simple Regression

This method uses a single independent variable to predict a dependent variable by fitting the best
relationship shown in Figure 13.

Figure 13: Simple Regression

Multiple Regression

This method uses more than one independent variable to predict a dependent variable by fitting the
best relationship shown in Figure 14.

Figure 14: Multiple Regression

It works best when multicollinearit y is absent. It’s a phenomenon in which two or more predictor
variables are highly correlated.

24 |
Page
Nonlinear Regression

In this method, observational data are modelled by a function


nonlinear
which
combination
is a of the
model parameters and depends
one or
onmore independent variables
shown in Figure 14.

Figure 14: Nonlinear regression

Significance in Data Science

Indata science
, inferential statistics
is used is many ways:

 Making
inferences
about the
population
from the
sample
.

 Concluding whether a sample


significantly
is different
from the population.

 If adding or removing a feature from a model will really help to improve the model.

 If onemodel is significantly than


betterthe other?
29 | P a g e
 Hypothesis testing
in general.
Why Probability is
important?
In certain areas of our everyday lives, confusion and randomness exist and having a strong knowledge
of probability allows us to make sense of these uncertainties. Knowing about chance allows us to
make educated judgments on what is likely to happen, based on a trend of previously collected data or

How to use Probability in Data Science?

Data Science also makes use of statistical inferences to forecast or interpret computer patterns, while
statistical inferences use data distribution of probabilities. Therefore it is important to know the
likelihood and its implementations to work effectively on data science problems.

What is Conditional
Probability?
Conditional probability is a measure of the likelihood of an occurrence (some particular circumstance
occurring), provided that another occurrence has occurred (by inference, hypothesis, conclusion, or
evidence).

The probability of event B provided event A equals the likelihood of event A and event B divided by
the likelihood of event A.

How conditional probability is used in data


science?

estimation.

Most techniques in data science (i.e., Naive Bayes) depend on the theorem of Bayes. Bayes'
theorem is a formula that explains how, when given proof, to change the probabilities of the
hypotheses.

30 | P a g e
Using the Bayes’ theorem, it’s possible to build a learner that predicts the probability of the
response variable belonging to some class, given a new set of attributes.

What are random variables?

A random variable is a set of possible values from a random experiment shown in Figure 16.

Figure 16: Random variables

A random variable (random quantity, aleatory variable, or stochastic variable) is a variable whose
possible values are outcomes of a random phenomenon.

Random variables can be discrete or continuous. Discrete random variables can only take certain
values while continuous random variables can take any value (within a range).

What are probability distributions?

Within a random variable the distribution of probability determines how the probabilities are
distributed over the variable random values.

To define the probability distribution, x is used for a discrete random variable, a probability mass
function denoted by f(x). This function gives random variable probability for any value.

A continuous random variable is defined for the probability that a continuous random variable will
lie within a given interval, since there is an infinite number of values at every interval. Thus here
the probability distribution is defined by the probability density function, which is also denoted by
f(x).

What are the types of probability distributions?

31 | P a g e
A binomial distribution is a statistical experiment with the following characteristics: The
experiment is made up of n repeated trials. Each trial can produce only two possible results. We call
one such outcome a success and the other a loss. The probability of success, denoted by P, for every
trial is the same shown in Figure 17.

Figure 17: Probability Distributions

The standard distribution, also known as the Gaussian distribution, is a distribution of probability
that is symmetrical around the mean, indicating that the occurrence of data near the mean is more
common than that far from the mean.

This has the following characteristics:

• The typical curve is symmetrical around the mean μ;

• The mean is in the middle and splits the area into halves;

• The total area below the curve is equal to 1;

• It is entirely defined by its mean and standard deviation (or variance σ2)

32 | P a g e
Figure 18: Normal Distribution

How random variables & probability distributions are used in data science?

Data science also makes use of statistical inferences to forecast or interpret computer patterns,
while statistical inferences use data distribution of probability. Therefore it is important to know
random variables & their distributions of probability to function effectively on data science
problems.

5. FITTING A MODEL
Fitting a model means that you are making your algorithm know the relationship between the
predictors and the outcome so that you can predict the future value of the outcome.

And the best fitted model has a particular set of parameters that best describes the question at hand.

Objectives of Model Fitting

There are two main goals for model fitting

1. Make inferences about the relationship between variables in a given data set.

2. Predictions/forecasting of potential events, based on models calculated using historical evidence.

Why are we fitting models to data?

1. Estimate the distributional properties of variables, theoretically subject to certain variables.

2. Concisely describe the relationship between the variables and make inferential assumptions
about the relationship

3. Predict the values of interest variables on the basis of values of other predictor variables, and
define the volatility of predictions.

33 | P a g e
6. INTRODUCTION TO R
What is R?

R programming language is one that allows statistical computation and is commonly used by data
miners and statisticians to analyze data. It was created by Ross Ihaka and Robert Gentleman in
1995 where they derived the word 'R' from the first letters of their names. R is a common choice
for statistical computing and graphical techniques in data analytics and in data science.

In its CRAN repository R holds a set of over 10,000 packages. Such products are appealing to
specific statistical applications. R may give a steep learning curve for beginners. But while R's
syntax can be simple to understand. It is an interactive method used for the implementation of
statistical learning. Hence, a. user without knowledge of statistics may not be able to get the best
out of R.

Why We Choose R for Data Science?

Data Science has emerged as 21st Century's most popular industry. That is because the need to
evaluate and create knowledge from the data is pressing. Industries are converting the raw data into
goods supplied with data. To do so, several essential tools are needed to churn the raw data. R is
one of the programming languages that provide you with an intensive environment for process
analysis, transformation, and knowledge visualization.

To several statisticians who want to get interested in the design of mathematical models to solve
complex problems, this is the primary option. R includes a sea of packages attracting all types of
disciplines such as astronomy, biology, etc. Though R was originally used for academic purposes, it
is now also being used in industry.

R is a technical language used for complex mathematical modelling. Alternatively, R also supports
array, matrix, and vector operations. R is renowned for its graphical libraries, which allow users to
delineate visual graphs and make them intractable to users. In addition, R helps its users to create
using Web applications
In addition to this, R provides you with several options of advanced data analytics like the
development of prediction models, machine learning algorithms, etc. R also provides several
packages for image processing.

History of R

• R was conceived by John Chambers at Bell Laboratories in 1976. R was created as an


extension, as well as a programming language implementation of S.

• Ross Ihaka and Robert Gentleman developed the R project and released its first version in
1995 and a stable beta version in 2000

R Features

34 | P a g e
There are essential R features, which we will explore in order to understand R's role in data science
shown in Figure 19.

Figure 19: Features of R

1. Open-source – R is an open-source platform that allows you to access and change the code,
and even create your own libraries. This is safe to download, as well.

2. A complete language – Although R is commonly regarded as a statistical programming


language, it also contains many features of an Object Oriented programming language.

3. Analytical support-With R, through its wide variety of support libraries, you can perform
analytical operations. You can clean, arrange, analyze, display your data and construct predictive
models, too.

4. Help extensions – R allows developers to write their own libraries and packages as
distributed add-ons, and to promote such packages. This makes R a developer-friendly language
that allows its users to make changes and updates.

2. Facilitates database connectivity – This consists of a variety of add-on modules linking R to


databases such as the RODBC package, the Open DataBase Connectivity Protocol (ODBC) and the
ROracle package allowing connectivity with Oracle databases. The programming language R also
contains MySQL extensions as RMySQL.

3. Extensive community participation – It has an active community that is further enhanced by


R being an open-source programming language. To many this makes R a perfect alternative. R
offers various boot camps and seminars around the world.
4. Simple and easy to understand – Although many would argue that R provides the beginners
with a steep learning curve, this is because R is a statistical language. To use R to its maximum,
you need to have statistical experience. R, however, has a syntax which is easy to understand. That
helps you to better recall and appreciate R.

35 | P a g e
How R is different than Other Technologies

There are certain special features of R programming that make it different compared to other
technologies: • Graphical Libraries-R stays ahead of the curve with its elegant graphical libraries.
Libraries such as ggplot2,plotfully encourage attractive libraries for well-defined plot creation.

• Availability / Cost – R is free to use, indicating universal accessibility.

• Technology advancement – R supports different advanced tools and features that allow you to
create robust statistical models.

• Job Scenario – R is the primary Data Science device, as stated above. With Data Science's
exponential growth and growing demand, R has become the world's most in demand
programming language today.

• Customer and Public Service Support – You will experience good community service with R.

• Portability – R is extremely compact. For the best results a lot of different programming
languages and software frameworks can easily be combined with the R environment.

Applications of R Programming

• R is used in the finance and banking industries for fraud prevention, customer turnover reduction
and potential decision taking.

• Bioinformatics is also used to study genetic sequence sequences, to conduct drug discovery and
computer neuroscience.

• R is used to discover potential consumers in online ads through social media research.
Organizations often use insights from social media to evaluate consumer emotions and enhance
their goods.

• E-commerce firms use R to evaluate consumer transactions and their input.

• Production companies use R to evaluate input from customers. Manufacturing companies use R to
analyze customer feedback. They often use it to forecast future demand to adjust their output
velocities and increase profits.

What is R In Data Science important?

Several of R's essential data science framework features are-

1. R includes many essential data wrangling sets, such as dplyr, purrr, readxl, google sheets,
datapasta, jsonlite, tidyquant, tidyr etc.

2. R facilitates robust mathematical modelling. Considering that data science is heavy statistics, R
is an ideal method to execute various statistical operations on it.

3. R is an appealing method for various applications in the data sciences, as it offers esthetic
visualization software such as ggplot2, scatterplot3D, lattice, high charter etc.

36 | P a g e
4. R is used widely in the ETL (Extract, Transform, Load) data science applications. This offers an
interface to various databases such as SQL and even spreadsheets.

5. Another essential skill of R is to interface and analyze unstructured data with the NoSQL
databases. This is particularly useful for applications in Data Science where a collection of data
needs to be analyzed.

6. Data scientists may use machine learning algorithms with R to gain insights into future events.
Various packages are available such as rpart, CARET, randomForest, and nnet.

What Makes R Suitable For Data Science?

R is the most popular choice for data scientists. Following are some of the key reasons as to why
they use R –

1. R has been reliable and useful for many years at academia. R was generally used at the
academy for research purposes as it offered various statistical analytical instruments. With
advancements in data science and the need to analyze data, R has also become a common option
within the industry.

2. R is an ideal resource when it comes to wrangling data. It allows many preprocessed


packages to be used which make it much easier to wrangle the data. This is one of the key reasons
why in the Data Science culture R is favored.

3. R offers its popular ggplot2 bundle that is best known for its visualizations. Ggplot2 offers
esthetic visualizations which are appropriate for all data operations. In addition, ggplot2 provides
users with a degree of interactivity so they can understand more clearly the data contained in the
visualisation.

4. R includes modules for the machine learning of different operations. If it is boosting,


constructing random forests or doing regression and classification, machine learning offers a wide
variety of products.

Data Science Companies that Use R

Some of the major data science companies that use R analysis and statistical modelling are shown
in
Figure 20 –

37 | P a g e
Figure20: Data Science companies that use R

1. Facebook-Facebook uses R extensively for monitoring of social networks. It uses R to gain


insights into users actions and create relationships between them.
2. Airbnb – Supports R with its complex day-to-day data operations. R uses the dplyr packet
to slic and dictate the data. It also makes use of the ggplot2 graphic package to visualize the results.
This also makes use of the pwr kit for different experiments and statistical studies.
3. Uber-Uber makes excellent use of the R kit to navigate its charting components. Shiny is an
interactive web application developed with R for interactive graphics embedding.
4. Google – R is a common option at Google for carrying out several analytical operations.
The project Google Flu Trends makes use of R to examine flu-related trends and patterns in
searches. In addition, Google's predictive API uses R to evaluate the historical data and render
possible predictions.
5. ANZ-ANZ is one of Australia's biggest banks. It uses R for credit risk analytics that
includes predicting loan defaults based on the clients' transactions and credit score.
6. Novartis – Norvatis is a leading pharmaceutical firm that relies on R for FDA applications
for clinical data analysis.
7. IBM-IBM is one of R's largest investors. It joined the consortium in R recently. IBM also
makes use of R to develop various analytical solutions. It used R in IBM Watson-an open forum for
computing. Furthermore, IBM supports R projects and helps the organization flourish by making
some significant contributions.

7. EXPLORATORY DATA ANALYSIS AND THE DATA SCIENCE PROCESS

Exploratory Data Analysis

Exploratory Data Analysis refers to the essential phase of initial data analyses in order to identify
patterns, find anomalies, test hypotheses, and use descriptive statistics and graphical representations
to verify conclusions.

38 | P a g e
Second, knowing the data is a good idea, and trying to gain as many ideas from it. EDA is all about
making sense of the data in hand, before they get it dirty.

EDA explained using sample data set:

In order to share my interpretation of the definition and techniques I know, I will take an example
of the white version of the Wine Quality data set available on the UCI Machine Learning
Repository and seek to obtain as many insights from the data set using EDA as possible.

To start with, I imported required libraries (pandas, numpy, matplotlib, and seaborn for this
example) and loaded the data set.

Note: Any inferences I've been able to draw, with bullet points I described.

• Original data is separated by delimiter “ ; “ in given data set.

• To take a closer look at the data took help of “ .head()”function of pandas library which
returns first five observations of the data set. Similarly “.tail ()” returns last five
observations of the data set.

I found out the total number of rows and columns in the data set using “.shape”.

• Dataset comprises of 4898 observations and 12 characteristics.

• Out of which one is dependent variable and rest 11 are independent variables —
physicochemical characteristics.

It is also a good practice to know the columns and their corresponding data types, along with
finding whether they contain null values or not.

39 | P a g e
• Data has only float and integer values.

• No variable column has null/missing values.

The describe() function in pandas is very handy in getting various summary statistics. This function
returns the count, mean, standard deviation, minimum and maximum values and the quantiles of the
data.

• Here as you can notice mean value is less than median value of each column which is
represented by 50% (50th percentile) in index column.

• There is notably a large difference between 75th %tile and max values of predictors
“residual sugar ”,” free sulfurdioxide”,”totalsulfur dioxide”.

• Thus observations 1 and 2 suggests that there are extreme values-Outliers in our data set.

Few key insights just by looking at dependent variable are as follows:

• Target variable/Dependent variable is discrete and categorical in nature.

• “quality” score scale ranges from 1 to 10;where 1 being poor and 10 being the best.

40 | P a g e
• 1,2& 10 Quality ratings are not given by any observation. Only scores obtained are
between 3 to 9.

• This tells us vote count of each quality score in descending order.

• “quality” has most values concentrated in the categories 5, 6 and 7.

• Only a few observations made for the categories 3 & 9.

Data Science Process

Data Science is a multidimensional discipline that uses scientific methods, techniques and
algorithms to derive information and insights from structured and unstructured data. I accept that
much of his research is data-related but includes a variety of other data-based processes.

Data Science represents a multidisciplinary field. This includes the systematic blend of scientific
and statistical methods, procedures, creation of algorithms, and technology to obtain useful data
information.
Yet how do all these dimensions work together? To grasp this, you need to learn the data science
process / the day-to-day life of a data scientist

Data Science Process – Daily Tasks of Data Scientist


The steps involved in the complete data science process are:

41 | P a g e
Step One: Ask Questions to Frame the Business Problem

Seek to get an idea of what a organization wants in the first step, and collect data based on it. You
start the data science process by asking the right questions to figure out what the issue is. Let's take
a very popular bag company question-the sales issue.

To analyze the issue, you need to begin by asking a lot of questions:

• Who are the target audience and the clients?

• How do you approach the target market?

• How does the selling cycle look like at the moment?

• Which details do you have about the target market?

• How do we classify those clients who are more likely to buy our product?

• You agree to work on the issue after a conversation with the marketing team: "How do we find
potential buyers who are more likely to purchase our product? "

• The next move for you is to find out what all the details you have with you to answer the
questions above.

Stage two: Get Relevant Data for Problem Analysis

• Now that you are aware of your business concern, it's time to collect the data that will help you
solve the problem. Before collecting the data you should ask if the organization already has the
correct data available?

• In certain cases, you can get the previously collected data sets from other investigations. The
following data are required: age, gender, previous customer’s transaction history, etc.

You find that most of the customer-related data is available in the company’s Customer
Relationship Management (CRM) software, managed by the sales team.

• SQL database is a rear device with many tables for CRM applications. As you go through the SQL
database, you will find out that the system stores detailed customer identification, contact and
demographic details (that they gave the company) and their detailed sales process as well.

• If you do not think the available data is adequate, then plans must be made to collect new data. By
showing or circulating a feedback form you can also take input from your guests and customers. I
agree with that, that's a lot of engineering work and it takes time and effort.

• In addition, the data you obtained is 'raw data' containing errors and missing values. And before
the data is analyzed, you need to clean (wrangle) the data.

Step Three: Continue. Explore the Data to Make Error Corrections

42 | P a g e
Exploring the data cleans and organizes it. This method is focused on more than 70 per cent of the
data scientist's time. While gathering all the data, you're not able to use it, because the raw data
you've gathered most likely contains odds.

First, you need to make sure that the data is error free and clean. It is the most significant step in the
cycle that requires patience and concentration.

For this function specific tools and techniques are used, such as Python, R, SQL, etc.

So you start answering these questions:

• Are the data values missing, i.e. are consumers without their contact numbers?
• Has null values in it? If it happens, how do you fix it?

• Have multiple datasets? Was it a sensible idea to fuse data sets? If so, then how can you bring
them together?

When the tests show the missing and false values they are ready for review. Remember that it is
easier to get the data incorrect than to have no experience at all.

Stage four. After analyzing the data, you have sufficient knowledge to construct a model to address
the question:

“How can we identify potential customers who are more likely to be?”

In this phase, you analyze the data in order to extract information from it. Analyzing the data
involves the application of various algorithms which will derive meaning from it:

• Build data model to answer the query.

• Test the model against gathered data.

• Use of different visualization software for presenting data.

• Carry out the algorithms and statistical analysis required.

• Findings align with other methods and sources.

But answering those questions will only give you theories and hints. Modelling data is an easy way
to simulate data within a proper equation the machine understands. You will be able to make
modelbased predictions. You might need to try out multiple models to find the best fit.

Coming back to the issue of sales, this model will help you predict which clients are more likely to
buy. The prediction can be specific, like Female, 16-36 age group living in India.

Step five. Communicate the Analytical findings.

Communication skills are an important part of a work for data scientists but are often widely
underestimated. This will in fact be a very difficult part of your work, because it includes

43 | P a g e
explaining the results to the public and other team members in a way that they will clearly
understand.

• Graph or chart the information for presentation with tools – R, Python, Tableau, Excel.

• Use "storytelling" to fit the results.

• Answer the different follow-up questions.

• Present data in different formats-reports, websites.

• Believe me; answers will always spark more questions, and the process begins again.

8. BASIC TOOLS (PLOTS, GRAPHS AND SUMMARY STATISTICS) OF EDA


Exploratory data analysis
Exploratory Data Analysis (EDA) is a very critical step that takes place after the feature
development and data acquisition process and should be carried out prior to the modelling process.
This is because it is really important for a data scientist to be able to understand the essence of the
data without making assumptions about it.

The goal of EDA is to use summary statistics and visualizations to better understand the data and to
find clues about the patterns, the quality of the data and the assumptions and assumptions of our
study. EDA is NOT about creating fancy visualizations or even aesthetically appealing
visualizations, the aim is to try and answer data questions. Your goal should be to be able to
produce a chart that anyone can look at in a few seconds and understand what's going on. If not, the
visualization is too complex (or fancy) and something similar has to be used.

EDA is also very iterative since we first make assumptions based on our first exploratory
visualizations, then build some models. We then make visualizations of the model results and tune
our models.
Types of Data

Once we can start talking about data discovery, let's learn the various types of data or measurement
rates first. I highly recommend that you read Measurement Rates in the online stats book and
continue reading the section to browse your statistical information. This segment is simply a
synopsis. Data comes in different forms but can be classified into two major groups: structured and
unstructured data. Structured data is data that is a form of high-degree or organizational data, such
as numerical or categorical data. Examples of standardized data are Temperature, phone numbers,
gender. Unstructured data is data in a form that doesn't have the framework of which we are using
directly. Types of unstructured data include pictures, videos, audio, text in the language and many
others. There is an emerging field called Deep Learning which uses a specialized collection of
algorithms with unstructured data that perform well. We will concentrate on structured data in this
guide but include brief details

44 | P a g e
While a pie chart is a very common method for representing categorical variables, it is not
recommended since it is very difficult for humans to understand angles. Simply put by statistician
and visualization professor Edward Tufte: “pie charts are bad”, others agree and some even say that
“pie charts are the worst”.

For example what you you determine from the follow pie chart?
Categorical Variables

Categorical variables can also be nominal or ordinal. Nominal data has no intrinsic ordering to the
categories. For example gender (Male, Female, Other) has no specific rdering. Ordinal data as clear
ordering such as three settings on a toaster (high medium and low). A frequency table (count of
each category) is the common statistic for describing categorical data of each variable, and a bar
chart or a waffle chart (shown below) are two visualizations which can be used.

The charts look identical, and it takes more than a couple of seconds to understand the data.
Now compare this to the corresponding bar chart

Page
41 |

A reader an instantly compare the 5 variables. Since humans have a difficult time comparing angles,
bar graphs or waffle diagrams are recommended. There are many other visualizations which are not
recommended: spider charts, stacked bar charts, and many other junkcharts.

For example, this visualization is very complicated and difficult to understanding:

Often less is more: the plot redux as a simple line graph:


Numeric Variables

Any value within a finite or infinite interval can be numeric or continuous variables. Examples
include weight, height, and temperature. Intervals and ratios are two types of numeric variables.
Interval variables have numerical ratios and the same definition over the entire scale, but have no
absolute zero. For example, temperature may be meaningfully subtracted or added in Fahrenheit or
Celcius (difference between 10 degrees and 20 degrees is the same difference as 40 to 50 degrees),
but cannot be measured. For instance, a day that's twice as hot may not be twice as hot.

"The calculation proportion scale is the most insightful scale. This is a scale of intervals with the
additional property that its zero position implies the absence of the measured quantity. You can
think of a ratio scale as the three previous scales were rolled up into one. It provides a name or
category for each object as a nominal scale (numbers are used as labels). The objects are ordered,
like an ordinal scale (in terms of ordering the numbers). The same disparity at two positions on the
scale has the same value as an interval scale. And, moreover, the same ratio in two places on the
scale also has the same meaning. "A good example of a ratio scale is weight, because it has a true
zero and can be added, subtracted, multiplied or divided.

Binning (Numeric to Categorical)

The process of transforming numerical variables into categorical is what is otherwise known as
discretization. For example, age may be 0-12 (child), 13-19 (teenager), 20-65 (adult), 65+ (senior)
categories. Binning is useful because it can be used as a filter to minimize noise or non-linearity
and some algorithms need categorical data, such as decision trees. Binning also helps data
scientists to easily determine numerical values for outliers, null or incomplete values. Binning
strategies involve using equal width (based on the range), equal frequency in of bin, sorted rank,
quantiles, or math (such as log) functions. Binning may be used based on entropy of information,
or acquiring information.

Encoding

Encoding is the transformation of categorical variables into numeric (or binary) variables,
otherwise known as continuation. Sex is a simple example of encoding: -1, 0, 1 may be used to
identify males, females and others. Binary encoding is a special case of encoding where the value
is set to a 0 or 1 indicating a category's absence or presence. One hot encoding is a special case
where each binary is encoded in multiple categories. If we have k categories, this will generate k
extra features (increasing the dimensionality of the data). Another type of encoding is using an
encoding based on target and probability. The average value is the group, which contains a chance.

Iris Data Set

For exploratory data analysis, we are going be investigating Anscombe’s_quartet and Fisher’s Iris
data set.

"Anscombe's quartet consists of four datasets with almost similar basic statistical features, but
looking somewhat different when graphed. Every dataset is composed of eleven points (x, y). They
were built by the statistician Francis Anscombe in 1973 in order to demonstrate both the
importance of graphing data before analyzing it and the effect of outliers on statistical properties."
The Iris data set is a multivariate set of data introduced by Ronald Fisher in his 1936 paper the use
of multiple measures. This indicates the variability of three related species in the Iris bulbs.
The histogram can be used to show the count, mode, variance, standard deviation, coefficient of
deviation, skewness and kurtosis.

Bivariate data (Two Variables)

When plotting the relation between two variables, one can use a scatter plot.
If the data is time series or has an order, a line chart can be used.

Multivariate Visualization

When dealing with multiple variables, it is tempting to make three dimensional plots, but as
below it can be difficult to understand the data:
show

50 | P a g e
Instead of blindly using decomposition, a data scientist could plot the result:

By looking at the contrast (black and white) in the images, we can see there is an mportance
i with the
locations of the eyes, nose and mouth, along with the head shape.

9. THE DATA SCIENCE PROCESS - CASE STUDY, REAL DIRECT (ONLINE REAL
ESTATE FIRM)

Data Science Case Studies


Here are the most popular data science case studies that will tell you how data science is used in
various industries. Also, the importance of data science in a variety of industries.

1. Data Science in Pharmaceutical Industries


Through improved data processing and cloud-driven applications, it is now easier to access a wide
variety of patient information datasets. In the pharmaceutical industry, artificial intelligence and data
analytics have revolutionized oncology. With new pharmaceutical innovations emerging every day,
it is difficult for physicians to keep up-to-date on treatment options. However, it is difficult to tap
into a highly competitive market for more standardized medical care options. However, with the
advances in technology and the development of parallel pipelined computational models, it is now
easier for the pharmaceutical industry to have a competitive advantage over the market.

With various statistical models such as Markov Chains, it is now possible to predict the probability
that doctors will prescribe medicines based on their experience with the brand. In the same way,
improving learning is beginning to develop itself in the area of digital marketing. It is used to
51 | P a g e
understand the patterns of digital participation of physicians and their prescriptions. The main aim
of this case study of data science is to discuss the problems facing it and how data science offers
solutions to them.

2. Predictive Modelling for Maintaining Oil and Gas Supply


The crude oil and gas industries are faced with a major problem of equipment failures, typically
due to the inefficiency of the oil wells and their output at a subpar stage. With the implementation
of a effective strategy that advocates for predictive maintenance, well operators can be alerted, as
well as informed of maintenance times, to the critical phases of shutdown. This would lead to an
increase in oil production and avoid further losses.

Data Scientists can apply the Predictive Maintenance Strategy to the use of data to optimize
highvalue machinery for the production and refining of oil products. With telemetry data extracted
through sensors, a steady stream of historical data can be used to train our machine learning model.
This machine learning model will predict the failure of machine parts and will alert operators of
timely maintenance in order to avoid oil losses. The Data Scientist assigned to the implementation
of the PdM strategy should help prevent hazards and predict machine failure, encouraging
operators to take precautions.

3. Data Science in BioTech


The human genome consists of four building blocks – A, T, C and G. Our appearance and
characteristics are determined by the three billion permutations of these four building blocks.
Although there are genetic defects and lifestyle defects, the results will lead to chronic diseases.
Identifying these defects at an early stage will allow doctors and testing teams to take preventive
action.

Helix is one of the companies for genome analysis that provides customers with their genomic
data. Also, due to the emergence of new computational methodologies, many medicines adapted to
particular genetic designs have become increasingly popular. Thanks to the data explosion, we can
understand and analyze complex genomic sequences on a wide scale. Data Scientists can use
modern computational resources to manage massive databases and understand patterns in genomic
sequences to detect defects and provide information to physicians and researchers. In addition,
with the use of wearable tools, data scientists may use the relationship between genetic
characteristics and medical visits to build a predictive modelling framework.

4. Data Science in Education

Data Science has also changed the way students communicate with teachers and assess their
success. Instructors may use data science to evaluate the input they obtain from students and use it
to enhance their teaching. Data Science can be used to construct predictive modelling that can
predict student drop-out levels based on their results and advise instructors to take the appropriate
precautions.

52 | P a g e
UNIT- II (Basic Machine Learning Algorithms & Applications)

1. LINEAR REGRESSION FOR MACHINE LEARNING


Linear regression in statistics and machine learning is perhaps one of the most well-known and
wellunderstood algorithms.

You will discover the linear regression algorithm in this article, how it operates and how you can
best use it in your machine learning projects. In this article you'll learn: • Why linear regression is
part of statistics as well as machine learning.

• The other titles known as linear regression.


• Algorithms of representation and inference used to construct a model of linear regression;
 How best to plan the data using linear regression modelling.

To grasp linear regression you don't need to learn any statistics or linear algebras. It is a gentle
highlevel introduction to the technique to give you enough experience to be able to make
successful use of it for your own problems.

Discover how machine learning algorithms work in my latest book like kNN, decision trees, naive
bayes, SVM, ensembles and much more, with 22 tutorials and examples in excel.

Let's kick off. Figure 1 shows Linear Regression for Machine Learning

Figure 1: Linear Regression for Machine Learning Photo by Nicolas Raymond.

53 | P a g e
Isn't linear statistical regression?

Before we immerse yourself in the specifics of linear regression, you might wonder why we are
looking at this algorithm.

Isn't it mathematical technique?

Machine learning, more precisely the field of predictive modelling, is concerned primarily with
reducing a model's error or making the most accurate predictions possible, at the cost of
description. We can borrow, reuse and steal algorithms from many different fields like statistics in
applied machine learning and use them for these purposes.
Linear regression has thus been developed in the field of statistics and is being studied as a model
for understanding the relationship between them

As such, linear regression was developed in the field of statistics and is studied as a model for
understanding the relationship between input and output numerical variables, but was borrowed
from machine learning. It's both a statistical algorithm and a machine learning algorithm.
Next, let's look at some of the common names used to refer to a linear regression model. Figure 2
shows the sample of the handy machine learning algorithms mind map.

Get your FREE Algorithms Mind Map

Figure 2: Sample of the handy machine learning algorithms mind map.

Many Linear Regression Names

54 | P a g e
Things can get really complicated when you start looking at linear regression. The explanation is
that for so long (more than 200 years), linear regression has been around. From every possible
angle, it has been studied and sometimes every angle has a new and different name. A linear model,
such as a model that follows a linear relationship between the input variables (x) and the output
variable (y), is a linear regression. More precisely, a linear combination of input variables (x) can
be used to determine y.
The approach is referred to as simple linear regression when a single input variable (x) exists.
Statistical literature also refers to the approach as multiple linear regression when multiple input
variables are available. Different techniques, the most common of which is called Ordinary Lowest
Squares, may be used to prepare or train linear regression equations from data. Therefore it is
common to refer to the model built in this way as ordinary.
Linear Regression Model Representation

Linear regression, since it is so easy to depict, is a popular model. A representation is a linear


equation that combines a particular set of input values (x) with the solution for which the output is
expected for that set of input values (y). As such, both the values of input (x) and output are
numeric.

For each input value or column, the linear equation assigns one scale factor, called the coefficient,
represented by the Greek capital letter Beta (B). An additional coefficient is often introduced,
which gives the line an extra degree of freedom (for example, going up and down on a two-
dimensional plot) and is also referred to as the coefficient of intercept or bias.

For example, in a simple regression problem (a single x and a single y), the form of the model
would be:

y = B0 + B1*x

In higher dimensions, when we have more than one input (x), the line is called a plane or a
hyperplane. The representation is therefore the form of the equation and the specific values used
for the coefficients (e.g. B0 and B1 in the example above).

It is common to talk about the complexity of a regression model such as linear regression. This
refers to the number of coefficients used for the model.

When the coefficient is zero, the influence of the input variable on the model is effectively
removed from the model prediction (0 * x = 0). This becomes relevant when you look at the
regularization methods that change the learning algorithm to reduce the complexity of the
regression models by putting pressure on the absolute size of the coefficients, driving some to zero.

55 | P a g e
Now that we understand the representation used for the linear regression model, let's look at some
ways that we can learn this representation from the data.

Figure 2: What is Linear Regression?

Linear Regression Learning Model

Learning a linear regression model means estimating the values of the coefficients used in the
representation with the data available to us.

In this section, we will take a brief look at four techniques for the preparation of a linear regression
model. This is not enough information to implement it from scratch, but enough to get a taste of the
computation and the trade-offs involved.

There are a lot more techniques because the model is so well studied. Take note of Ordinary Less
Squares because it is the most common method used in general. Take note of Gradient Descent as
it is the most common technique taught in machine learning classes.

1. Simple Linear Regression

56 | P a g e
With simple linear regression, if we have a single input, we can use statistics to estimate the
coefficients. This requires you to calculate statistical properties from data such as mean, standard
deviations, correlations and covariance. All data must be made available for the purpose of
crossing and calculating statistics.

2. Ordinary Least Squares

If we have more than one input, we can use the Ordinary Lowest Squares to estimate the
coefficient values.

The Ordinary Least Squares procedure seeks to minimize the sum of the squared residues. This
means that given the regression curve through the info , we calculate the space from each datum to
the regression curve , square it, and sum all the squared errors together. This is the quantity that the
least common squares are trying to minimize.

This approach treats the data as a matrix and uses linear algebra operations to estimate the
optimum coefficient values. This means that all the data must be available and you must have
enough memory to fit the data and perform the matrix operation.

It is unusual to perform the Ordinary Least Squares procedure yourself, unless it is done as a linear
algebra exercise. It's more likely you're going to call a procedure in a linear algebra library. This
procedure is very quick to calculate.

3. Gradient Descent

When one or more inputs are available, you can use the process of optimizing coefficient values by
iteratively minimizing the model error on your training data.

This operation is called Gradient Descent, starting with random values for each coefficient. The
sum of squared errors is calculated for each pair of input and output values. The learning rate is
used as a scale factor and the coefficients are updated in order to minimize the error. The process is
repeated until a minimum amount of squared error is achieved or no further improvement is
possible.

When using this method, you must select the learning rate (alpha) parameter that will determine the
size of the improvement step to be taken for each iteration of the procedure.

Gradient descent is often taught using a linear regression model because it is relatively easy to
understand. In practice, it is useful when you have a very large dataset in either the number of rows
or the number of columns that may not fit into your memory.

57 | P a g e
4. Regularization

There are extensions to the training of a linear model called regularization methods. Both aim to
minimize the sum of the squared error of the model on the training data (using ordinary least
squares) but also to reduce the complexity of the model (such as the number or absolute size of the
sum of all coefficients in the model).

Two common examples of regularization procedures for linear regression are:

1. Lasso regression: where ordinary least squares are modified to minimize the absolute sum
of the coefficients (called L1 regularization) as well.

2. Ridge Regression: where the ordinary least squares are modified to minimize the squared
absolute sum of the coefficients (called L2 regularization)

These methods are effective to use when there is collinearity in your input values, and the ordinary
least squares would override the training data.

Now that you know some techniques to learn the coefficients in a linear regression model, let's
look at how we can use a model to make new data predictions.

Making Linear Regression Predictions Since representation is a linear equation, making predictions
is as simple as solving an equation for a specific set of inputs.

Let's use an example to make this concrete. Imagine that we predict weight (y) from height (x). Our
linear regression model for this problem would be:

y = B0 + B1 * x1
or
weight =B0 +B1 * height

Where B0 is that the bias coefficient and B1 is that the coefficient for the height column. We use a
learning technique to seek out an honest set of coefficient values. Once found, we will connect
different height values to predict the load .

For example, let’s use B0 = 0.1 and B1 = 0.5. Let’s plug them in and calculate the weight (in
kilograms) for an individual with the height of 182 centimeters.

weight = 0.1 + 0.5 * 182


weight = 91.1

58 | P a g e
You can see that the above equation could be plotted as a line in two-dimensions. The B0 is our
starting point regardless of what height we have. We can run through a bunch of heights from 100
to
250 centimeters and plug them to e equation and get weight values, creating our line.
th

Figure 3: Sample Height vs Weight Linear Regression


Now that we know how to make predictions given a learned linear regression model, let’s look at
some rules of thumb for preparing our data to make the most of this type of model.

Preparing Data for Linear Regression

Linear regression has been studied extensively, and there is a lot of literature on how your data
needs to be structured to make the most of the model.

As such, there is a lot of sophistication in talking about these requirements and expectations that
can be intimidating. In practice, these rules can be used more as thumb rules when using Ordinary
Less Squares Regression, the most common linear regression implementation.

Try using these heuristics to prepare your data differently and see what works best for your
problem.

1. Linear Assumption. Linear regression assumes that the relationship between input and output is
linear. It doesn't support anything else. This may be obvious, but it's a good thing to remember
when you have a lot of attributes. You may need to transform the data to make the relationship
linear (e.g. transform log for an exponential relationship).
57 | P a g e
2. Remove your noise. Linear regression assumes that the input and output variables are not noisy.
Consider using data cleaning operations that will make it easier for you to expose and clarify the
signal in your data. This is most important for the output variable and, if possible, you want to
remove outliers in the output variable (y).

3. Remove Collinearity from me. Linear regression is over-fitting your data when you have highly
correlated input variables. Consider calculating pairwise correlations for your input data and
removing the most correlated data.

4. The Gaussian Distribution. Linear regression makes more reliable predictions if your input and
output variables have a Gaussian distribution. You may get some benefit from transforms (e.g.
log or BoxCox) on your variables to make their distribution look more Gaussian.

5. Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input
variables using standardization or normalization.

2. K-NEAREST NEIGHBORS (K-NN)


K-nearest neighbors (KNN) algorithm may be a sort of controlled ML algorithm which will be used for both
classification and regression predictive problems. However, it's mainly used for the classification of
predictive problems within the industry.

The subsequent two properties would define KNN well –

• Lazy learning algorithm − KNN may be a lazy learning algorithm because it doesn't have a
specialized training phase and uses all data for training while classifying.

• Non-parametric learning algorithm − KNN is additionally a non-parametric learning algorithm


because it assumes nothing about the underlying data.

2.1 Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses 'feature similarity' to predict the values of new data
points, which means that a value will be assigned to the new data point based on how closely the
points in the training set match. We can understand how it works by following steps − Step 1 − We
need a data set to implement any algorithm. So we have to load the training as well as the test data
during the first step of KNN.

Step 2 − Next, we’d like to settle on the value of K, i.e. the nearest data point. K could be any
integer.

Step 3 − For each point of the test data, do the following −

3.1 − Calculate the distance between the test data and each row of training data using any
method, namely: Euclidean, Manhattan or Hamming distance. Euclidean is the most
common method used to calculate distance.

60 | P a g e
3.2 − Now, based on the distance value, sort it in ascending order.

3.3 − Next, the top rows of K will be selected from the sorted array.

3.4 − Now assign a class to the test point based on the most frequent class of these rows.

Step 4 − End Example The following is an example to understand the concept of K and the
working of the KNN algorithm − Suppose we have a dataset that can be plotted as follows shown
in Figure 4−

Figure 4: Dataset

Now, we'd like to classify new data point with black dot (at point 60, 60) into blue or red class. We
are assuming K = 3 i.e. it would find three nearest data points. It is shown in Figure 5 −

Figure 5: Finding three nearest neighbors

We can see in Figure 5 the three nearest neighbors of the data point with black dot. Among those
three, two of them lie in Red class hence the black dot will also be assigned in red class.

Implementation in Python
As we all know K-nearest neighbors (KNN) algorithm are often used for both classification also as
regression. The following are the recipes in Python to use KNN as classifier also as regressor –
KNN as Classifier
61 | P a g e
First, start with importing necessary python packages −

62 | P a g e
63 | P a g e
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

Next, train the model with the help of KNeighborsClassifier class of sklearn as follows –

fromsklearn.neighbors import KNeighborsClassifier


classifier = KNeighborsClassifier(n_neighbors = 8)
classifier.fit(X_train, y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
fromsklearn.metricsimportclassification_report,confusion_matrix,accuracy_sco
re result=confusion_matrix(y_test,y_pred) print("Confusion Matrix:")
print(result) result1
=classification_report(y_test,y_pred)
print("Classification Report:",)
print(result1) result2
=accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output

Confusion Matrix:

[[21 0 0]
[ 0 16 0]
[ 0 7 16]]

Classification Report:

precision recall f1-score support


Iris-setosa 1.00 1.00 1.00 21 Iris-
versicolor 0.70 1.00 0.82 16 Iris-
virginica 1.00 0.70 0.82 23
microavg 0.88 0.88 0.88 60
macroavg 0.90 0.90 0.88 60
weightedavg 0.92 0.88 0.88 60

Accuracy: 0.8833333333333333

KNN as Regressor
First, start with importing necessary Python packages −

64 | P a g e
importnumpy as np
import pandas as pd
Next, download
the iris dataset
from its weblink as
follows −
path =
"https://archive.ics.
uci.edu/ml/machine
-learning-
databases/iris/iris.d
ata"
Next, we need to
assign column
names to the
dataset as follows

headernames =
['sepal-length',
'sepal-width', 'petal-
length', 'petal-
width', 'Class']
Now, we need to read dataset to pandasdataframe as follows −
data=pd.read_csv(url, names =headernames)
array=data.values
X =array[:,:2]
Y =array[:,2]
data.shape
output:(150,5)
Next, import KNeighborsRegressor from sklearn to fit the model −
fromsklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors = 10)
knnr.fit(X, y)
At last, we can find the MSE as follows −

print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))


Output
The MSE is: 0.12226666666666669
2.3 Pros and Cons of KNN

Pros

• It is a very simple algorithm to understand and interpret.

• It’s very useful for non-linear data because there’s no assumption about the information in this
algorithm.
65 | P a g e
• It may be a versatile algorithm which will be used for both classification and regression.
• It is comparatively accurate, but there are far better supervised learning models than KNN.

Cons

• It's a bit expensive algorithm, because it stores all the training data.

• High memory storage required compared to other supervised learning algorithms.


• The prediction is slow within the case of a large N.
• It is extremely sensitive to the size of the information also on irrelevant features.

2.4 Applications of KNN

The following are some of the areas in which KNN can be successfully applied –

1. The KNN Banking System can be used in the banking system to predict the weather and the
individual is fit for loan approval? Does that individual have the same characteristics as one
of the defaulters?
2. Calculation of credit ratings KNN algorithms can be used to find an individual's credit
rating by comparing it to persons with similar characteristics.
3. Politics With the assistance of KNN algorithms, we will classify potential voters into
different classes like "Will vote," "Will not vote," "Will vote to the Congress Party," "Will
vote to the BJP Party."
4. Other areas where the KNN algorithm is often used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.

3. K-MEANS
K-means algorithm is an iterative algorithm that attempts to divide the dataset into Kpre-defined
separate non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make inter-cluster data points as similar as possible while keeping clusters as different (far)
as possible. It assigns data points to a cluster in such a way that the sum of the squared distance
between the data points and the centroid cluster (arithmetic mean of all data points belonging to
that cluster) is at a minimum. The less variation we have within clusters, the more homogenous
(similar) the data points are within the same cluster.
The way k-means algorithm works is as follows:

1. Specify the number of K clusters.

2. Initialize the centroids by first shuffling the dataset and then randomly selecting K data points
for the centroids without substitution.

3. Keep iterating until there is no change in the centroids. i.e. the assignment of data points to
clusters does not change.

4. Calculate the sum of the squared distance between the data points and all the centroids.

5. Assign each data point to the nearest cluster (centroid).

66 | P a g e
6. Calculate the centroids for clusters by taking the average of all data points belonging to each
cluster.

The following approach k-means to solve the problem is called Expectation-Maximization. The
Estep assigns the data points to the nearest cluster. The M— step is computing the centroid of each
cluster. Below is a breakdown of how we can solve it mathematically (feel free to skip it).

The objective function is:

Where wik=1 for data point xi if it belongs to cluster k; otherwise, wik=0. Also, μk is the centroid
of xi’s cluster.

It’s a minimization problem of two parts. We first minimize J w.r.t. wik and treat μk fixed. Then we
minimize J w.r.t. μk and treat wik fixed. Technically speaking, we differentiate J w.r.t. wik first and
update cluster assignments (E-step). Then we differentiate J w.r.t. μk and recompute the centroids
after the cluster assignments from previous step (M-step). Therefore, E-step is:

In other words, assign the data point xi to the closest cluster judged by its sum of squared distance
from cluster’s centroid.

And M-step is:

Which translates to recomputing the centroid of each cluster to reflect the new assignments.

Few things to note here:

• Since cluster algorithms, together with k-means, use distance-based measurements to work
out the similarity between data points, it's counseled that data ought to be standardized to own a
67 | P a g e
mean zero and a typical deviation of 1, since nearly always the characteristics in any knowledge
set would have totally different units of measuring like age vs financial gain.

• In sight of the k-mean repetitious nature and therefore the random formatting of the
centroids at the start of the algorithmic program, {different|totally totally different|completely
different} initializations could result in different clusters, because the k-mean algorithmic program
could also be stuck to the native optimum and should not converge to the world optimum.

It is so counseled that the algorithmic program be run mistreatment totally different center of mass
initializations which the results of the run be chosen that yielded a lower total of square distance.

• The assignment of examples doesn't amendment is that the same issue as no amendment in
within-cluster variation:

Implementation

We will use simple implementation of k-means here to illustrate some of the concepts. Then we
will use sklearn implementation that makes it more efficient to take care of a lot of things for us.

Applications

K-means algorithm is very popular and is used in a variety of applications such as market
segmentation, document clustering, image segmentation and image compression, etc. The goal
usually when we're undergoing a cluster analysis is either:

1. Get a meaningful insight into the structure of the data we're dealing with.

2. Cluster-then-predicts where different models will be built for different subgroups if we believe
there is a wide variation in the behavior of different subgroups. An example of this is the
clustering of patients into different subgroups and the development of a model for each subgroup
to predict the risk of heart attack.

4. FILTERING SPAM

4.1 Spam

• Spam additionally referred to as unsought industrial Email (UCE)

• Involves causation messages by email to varied recipients at constant time (Mass Emailing).

• Grew exponentially since 1990 however has leveled off recently and is not any longer growing
exponentially

• 80% of all spam is sent by but two hundred spammers

4.2 Purpose of Spam


68 | P a g e
• Advertisements

• Pyramid schemes (Multi-Level Marketing)

• Giveaways

• Chain letters

• Political email

• Stock market recommendation

Spam as a retardant

• Consumes computing resources and time


• Reduces the effectiveness of legitimate advertising
• Cost Shifting
• Fraud
• Identity thievery
• Consumer Perception
• Global Implications
John Borgan [ReplyNet] – “Junk email isn't simply annoying any longer. It’s feeding into productivity. It’s
feeding into time”.

Some Statistics

 Cost of Spam 2009:


o $130 billion worldwide o  $42 billion in US alone  30%
increase from 2007 estimates o 100% increase in 2007 from
2005
 Main components of cost:
o Productivity loss from inspecting and deleting spam missed by spam control
products
(False Negatives) o Productivity loss from searching for legitimate mails deleted in
error by spam control products (False Positives)
o Operations and helpdesk costs (Filters and Firewalls – installment and
maintenance)

Email Address Harvesting - Process of obtaining email addresses through various methods:

• Purchasing /Trading lists with other spammers


• Bots
• Directory harvest attack
• Free Product or Services requiring valid email address
• News bulletins /Forums

• Cost of Spam 2009:

69 | P a g e
 $130 billion worldwide
 $42 billion in u. s. u. s. unit of time increase from 2007 estimates
 100% increase in 2007 from 2005

• Main components of cost:

 Productivity loss from inspecting and deleting spam lost by spam management merchandise
(False Negatives)
 Productivity loss from sorting out legitimate mails deleted in error by spam management
merchandise (False Positives)
 Operations and repair costs (Filters and Firewalls – installment and maintenance)

Email Address gather - methodology of obtaining email addresses through various methods: •

Purchasing /Trading lists with completely different spammers

• Bots

• Directory harvest attack

• Free Product or Services requiring valid email address

• News bulletins /Forums

4.3 Spam Life Cycle

Figure 6: Life cycle of spam

70 | P a g e
Types of Spam Filters

1. Header Filters
a. Look at email headers to judge if forged or not

b. Contain more information in addition to recipient , sender and subject fields


2. Language Filters
a. filters based on email body language
b. Can be used to filter out spam written in foreign languages

3. Content Filters
a. Scan the text content of emails
b. Use fuzzy logic
4. Permission Filters
a. Based on Challenge /Response system
5. White list/blacklist Filters
a. Will only accept emails from list of “good email addresses”
b. Will block emails from “bad email addresses”
6. Community Filters
a. Work on the principal of "communal knowledge" of spam
b. These types of filters communicate with a central server.
7. Bayesian Filters
a. Statistical email filtering
b. Uses Naïve Bayes classifier

Spam Filters Properties

1. Filter must prevent spam from entering inboxes


2. Able to detect the spam without blocking the ham
a. Maximize efficiency of the filter
3. Do not require any modification to existing e-mail protocols
4. Easily incremental
a. Spam evolve continuously
b. Need to adapt to each user

Data Mining and Spam Filtering

• Spam Filtering can be seen as a specific text categorization (Classification)


• History o Jason Rennie’siFile (1996); first know program to use Bayesian Classification for
spam filtering

Bayesian Classification

1. Specific words are likely to occur in spam emails and legitimate emails For example, most
email users often encounter the word "Viagra" in spam emails, but rarely see it in other emails

71 | P a g e
2. The filter does not know these probabilities beforehand, and must be trained first so that it can
build them up to
3. To train the filter, the user must manually indicate whether or not the new email is spam
4. For all of the words in each training email, the filter will adjust the probability that each word
will appear in the spam or legitimate email in its database.

For example, Bayesian spam filters will typically have learned a very high spam probability for
the words "Viagra" and "refinance," but a very low spam probability for words that are seen
only in legitimate emails, such as the names of friends and family members

5. After training, the word probabilities are used to calculate the probability that an email
containing a specific set of words belongs to either category
6. Each word in the email contributes to the spam probability of the email, or only the most
interesting words
7. This contribution is calculated using the Bayes Theorem
8. Then, the spam probabil (false positive or false negative) which allows the software to
dynamically adapt to the ever-evolving nature of spam
10. Some spam filters combine the results of both Bayesian spam filtering and other heuristics
(predefined content rules, envelope viewing, etc.) resulting in even higher filtering accuracy.

Computing the Probability

1. Calculation of the probability that a message containing a given word is spam:

2. Suppose the suspected message contains the word "replica" Most people who are used to receive
e-mails know that this message is likely to be spam

3. The formula used by the software for computing.

72 | P a g e
Pros & Cons

Advantages

• It can be trained on a per-user basis o The spam that a user receives is often
related to the online user's activities

Disadvantages

• Bayesian spam filtering is susceptible to Bayesian poisoning o Insertion of


random innocuous words that are not normally associated with spam o Replace
text with pictures

73 | P a g e
o Google, by its Gmail email system, performing an OCR to every mid to large size
image, analyzing the text inside
• Spam emails not only consume computing resources, but can also be frustrating
• Numerous detection techniques exist, but none is a “good for all scenarios” technique
• Data Mining approaches for content based spam filtering seem promising
4.8 How to Design a Spam Filtering System with Machine Learning Algorithm

Exploratory Data Analysis (EDA)

Exploratory Data Analysis is a very important data science process. It helps the data scientist to
understand the data at hand and relates it to the business context.

The open source tools that I will use to visualize and analyze my data are Word Cloud.

Word Cloud is a data visualization tool used to represent text data. The size of the text in the image
represents the frequency or importance of the words in the training data.

Steps to take in this section:

1. Get the email data

2. Explore and analyze the data

3. Visualize the training data with Word Cloud & Bar Chart

Get the spam data

Data is the essential ingredients before we can develop any meaningful algorithm. Knowing where
to get your data can be a very handy tool especially when you are just a beginner.

Below are a few of the famous repositories where you can easily get thousand kind of data set for
free:

1. UC Irvine Machine Learning Repository

2. Kaggle datasets

3. AWS datasets

You can go to this link (https://spamassassin.apache.org/old/publiccorpus/) to go to the data set for


this email spamming data set, which is distributed by Spam Assassin. There are a few categories of
data that you can read from readme.html to get more background information about the data.

In short, there are two types of data present in this repository, namely ham (non-spam) and spam
data. In addition, the ham data is easy and hard, which means that there are some non-spam data
that are very similar to spam data. This could make our system difficult to make a decision.

74 | P a g e
If you are using Linux or Mac, simply do this on the terminal, wget is simply a command that will
help you download the url file:

75 | P a g e
Figure 7: Visualization for spam email

Figure 8: Visualization for non-spam email

From this view, you can see something interesting about the spam email. Many of them have a
high number of "spam" words, such as: free, money, product, etc. Having this awareness could
help us make a better decision when it comes to designing a spam detection system.

One important thing to note is that the word cloud displays only the frequency of words, not
necessarily the importance of words. It is therefore necessary to do some data cleaning, such as
removing stop words, punctuation and so on, from the data before visualizing it.

N-grams model visualization

Another technique of visualization is to use the bar chart and display the frequency of the most
visible words. N-gram means how many words you consider to be a single unit when you
calculate the frequency of words.

I have shown an example of 1-gram and 2-gram in Figure 9. You can definitely experiment with a
larger n-gram model.
Figure 9: Bar chart visualization of 1-gram model

Figure 10: Bar chart visualization of 2-gram model

Train Test Split

It is important to divide your data set into a training set and test set, so that you can evaluate the
performance of your model using the test set before deploying it in a production environment.
Figure 11: Target Count For Train Data

Figure 12: Train Data Distribution


Figure 13: Target Count For Test Data

Figure 14: Train Data Distribution

Test Data Distribution

The distribution between train data and test data are quite similar which is around 20–21%, so we are
good to go and start to process our data !

Data Preprocessing
Text Cleaning
Text Cleaning is a very important step in machine learning because your data may contains a lot
of noise and unwanted character such as punctuation, white space, numbers, hyperlink and etc.

Some standard procedures that people generally use are:

• convert all letters to lower/upper case

• removing numbers

• removing punctuation

• removing white spaces

• removing hyperlink

• removing stop words such as a, about, above, down, doing and the list goes on…

• Word Stemming

• Word lemmatization

The two techniques that may seem foreign to most people are word stemming and word
lemmatization. Both of these techniques try to reduce words to their most basic form, but they do
so with different approaches.

• Word stemming — Stemming algorithms work by removing the word end or beginning,
using a list of common prefixes and suffixes that can be found in that language. Examples of
Word Stemming for English Words are as follows:

Stem
Form Suffix
running -ing run
runs -s run
consolidate -ate consolid
consolidated -ated consolid
• Word Lemmatization — Lemmatization is utilizing the dictionary of a particular language
and tried to convert the words back to its base form. It will try to take into account of the
meaning of the verbs and convert it back to the most suitable base form.

Stem
Form Suffix
running -ing run
runs -s run
consolidate -ate consolid
Implementing these two algorithms might be tricky and requires a lot of thinking and design to
deal with different edge cases.

Luckily NLTK library has provided the implementation of these two algorithms, so we can use it
out of the box from the library!

Import the library and start designing some functions to help us understand the basic working of
these two algorithms.

# Just import them and use i t

fromnltk.stem import PorterStemmer


from nltk.stem import
WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer
dirty_text==WordNetLemmatizer()
"He studies in the house yesterday, unluckily,

the fans breaks down"

defword_stemmer(words): stem_words =
[stemmer.stem(o) for o in words] return "
".join(stem_words)
lemma_words = [lemmatizer.lemmatize(o) for o in words]
defword_lemmatizer(words):
return " ".join(lemma_words)
The output of word stemmer is very obvious, some of the endings of the words have been chopped
off

clean_text = word_stemmer(dirty_text.split(" "))


clean_text

#Output
'He studied in the house yesterday, unluckily, the fan break down'

The lemmatization has converted studies -> study, breaks -> break

Our algorithm always expect the input to be integers/floats, so we need to have some feature
extraction layer in the middle to convert the words to integers/floats.

You might also like