0% found this document useful (0 votes)
237 views

Internship - Report (Vinay Patil L V)

1. Industry 4.0 refers to the current trend of automation and data exchange in manufacturing technologies through cyber-physical systems, the internet of things, cloud computing, and artificial intelligence. 2. Cognitive computing uses artificial intelligence and machine learning techniques like natural language processing and computer vision to mimic human thought processes. 3. Machine learning is a subset of artificial intelligence that uses algorithms to learn from data without being explicitly programmed, while deep learning uses neural networks for tasks like image and speech recognition.

Uploaded by

Ganesh H S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
237 views

Internship - Report (Vinay Patil L V)

1. Industry 4.0 refers to the current trend of automation and data exchange in manufacturing technologies through cyber-physical systems, the internet of things, cloud computing, and artificial intelligence. 2. Cognitive computing uses artificial intelligence and machine learning techniques like natural language processing and computer vision to mimic human thought processes. 3. Machine learning is a subset of artificial intelligence that uses algorithms to learn from data without being explicitly programmed, while deep learning uses neural networks for tasks like image and speech recognition.

Uploaded by

Ganesh H S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

BELAGAVI – 590 018, KARNATAKA

An Internship Report on

EXPLORATORY DATA ANALYSIS (EDA) ON


“SUICIDE DATA SET”

Submitted in partial fulfilment of the required for the award of degree of


BACHELOR OF ENGINEERING
IN
DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

Submitted By

THOTAD VINAY PATIL L V


(4BD18IS101)
Internship carried out at
Acranton Technologies, Davanagere.

Internal Guide External Guide

Mrs. Hemashree H C Mr. Bharath H C


Assistant Professor, CEO
Information Science and Engineering Dept. Acranton Technologies Pvt.
Ltd.

2021-2022
Bapuji Institute of Engineering and Technology,
Department of Information Science and Engineering,
Davangere -577004, Karnataka
BAPUJI INSTITUTE OF ENGINEERING AND TECHNOLOGY
DAVANGERE – 577004, KARNATAKA

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

CERTIFICATE
This is to certify that Mr. THOTAD VINAY PATIL L V bearing USN
4BD18IS101 of Information Science and Engineering department has satisfactorily submitted
internship report entitled “Exploratory Data Analysis (EDA) on Suicide Data Set”. The
internship report has been approved as it satisfies the academic requirement with respect to the
Internship work prescribed for Bachelor of Engineering Degree of the Visvesvaraya
Technological University, Belagavi, during the year 2021-22.

Mrs. Hemashree H C Dr. Poornima B Mr. Sheik Imran


Guide Prof & Head Internship Coordinator

External Viva

Name of the Examiners Signature with Date

1. __________________________

2. __________________________
ACKNOWLEDGEMENT
I would like to acknowledge the help and encouragement given by various people
during the course of this Internship.
I would like to express my sincere gratitude to resource person Mr. Bharath H C
cofounder, training and internship ACRANTON TECHNOLOGIES Pvt Ltd.
Salutation to our beloved and highly esteemed institute “Bapuji Institute of
Engineering and Technology.” For having well qualified staff and labs furnished with
necessary equipment.
I express my sincere thanks to my Guide Mrs. Hemashree H C, and Internship
Coordinator Mr. Sheik Imran for giving constant encouragement, support and valuable
guidance throughout the course of the Internship, without whose stable guidance this
report would not have been achieved.
I express whole hearted gratitude to Dr. Poornima B H.O.D. of IS&E. I wish to
acknowledge her help who made my task easy by providing with her valuable help and
encouragement.
I would like to thank our beloved Principal Dr. H.B. Aravind and the Director
Prof. Y Vrushabhendrappa of this Institute for giving me the opportunity and guidance
to work for the Internship.
I would like to extend my gratitude to all teaching and non -teaching staff of
Department of Information Science and Engineering for the help and support rendered to
me. I have benefited a lot from the feedback, suggestions given by them.
I would like to extend my gratitude to all my family members and friends
especially for Their advice and moral support.

THOTAD VINAY PATIL L V


4BD18IS101
ABSTRACT

Applied machine learning is the application of machine learning to a specific data-related


problem. This machine learning can involve either supervised models, meaning that there is an
algorithm that improves itself on the basis of labelled training data, or unsupervised models, in which
the inferences and analyses are drawn from data that is unlabeled. Applied machine learning is
characterized in general by the use of statistical algorithms and techniques to make sense of,
categorize, and manipulate data. Machine learning can be applied in any case in which there are
nondeterministic elements to a problem, and especially where the manipulation and analysis of a
large amount of statistically generated data are required.
CONTENTS

TOPICS PAGE NO.


About the Company i
CHAPTER 1 : INTRODUCTION 01-06
1.1. INDUSTRY 4.0
1.2. COGNITIVE COMPUTING
1.3. AI VS ML VS DL
1.4. MACHINE LEARNING TYPES
1.5. TOOLS USED IN ML
1.6. CHALLENGES IN MACHINE LEARNING
CHAPTER 2 : SYSTEM REQUIREMENTS 07
2.1 HARDWARE REQUIREMENTS
2.2 SOFTWARE REQUIREMENTS
CHAPTER 3 : DIFFERENT ALGORITHMS IN ML & USE CASES 08-10
CHAPTER 4 : TERMINOLOGIES 11-15
4.1 OVERFITTING & UNDERFITTING
4.2 BIAS VS VARIANCE
4.3 REGRESSION
4.4 PERFORMANCE MATRICES
4.5 ERROR ANALYSIS
CHAPTER 5 : PROJECT 16-24
5.1 DESCRIPTION
5.2 PRE-PROCESSING
5.3 VISUALIZATION OF DATA
CHAPTER 6 : RESULTS AND DISCUSSIONS 25
CONCLUSION 26
BIBLIOGRAPHY
ABOUT THE COMPANY
Acranton Technologies Pvt Ltd
“Acranton technologies pvt. ltd. is started in the year 2018 in Davangere by the software
engineers who were dreamed to become entrepreneurs. “We love people and everything about
them thus; our product will simply be an extension or an expression of oneself. Our business is
to change the way things work around us and not settle for anything less than revolutionary".
We work with dedication and determination to provide best IT solutions to our clients and also
help students in achieving their career goals”.

Our mission is to

• To provide superior quality product service; we will radically shift the global economy
toward small business by empowering products confidently grow and successfully
their own ventures.

• Solving people problems and difficulties by adoption of technologies.

• To promote technical ideas into reality through applications for better future.

• Endeavoring the digital technology.

• Providing ocean of opportunities by sharing the trending technology to the seeking


personnel.

i
Applied Machine Learning

CHAPTER 1
INTRODUCTION TO MACHINE LEARNING

1.1 INDUSTRY 4.0

Industry 1.0 refers to the first industrial revolution. It is marked by a transition from hand
production methods to machines through the use of steam power and water power. Industry 2 . 0 is the
s e c o n d i n d u s t r i a l r e v o l u t i o n o r b e t t e r k n o w n a s t h e technological revolution is the
period between 1870 and 1914. It was made possible with the extensive railroad networks and the
telegraph which allowed for faster transfer of people and ideas.

The third industrial revolution or Industry 3.0 occurred in the late 20th century, after the end
of the two big wars, as a result of a slowdown with the industrialization and technological
advancement compared to previous periods. Industry 4.0 is the fourth industrial revolution that
concerns industry. The fourth industrial revolution encompasses areas which are not normally
classified as an industry, such as smart cities, for instance. In essence, industry 4.0 is the trend
towards automation and data exchange in manufacturing technologies and processes which include
cyber-physical systems (CPS), the internet of things (IoT), industrial internet of things (IIOT), cloud
computing, cognitive computing and artificial intelligence.
The concept includes:
• Smart manufacturing
• Smart factory
• Lights out (manufacturing) also known as dark factories
• Industrial internet of things also called internet of things for manufacturing
Within modular structured smart factories, cyber-physical systems monitor physical processes, create
a virtual copy of the physical world and make decentralized decisions. Over the Internet of Things,
cyber-physical systems communicate and cooperate with each other and with humans in real-time
both internally and across organizational services offered and used by participants of the value chain.

Department of IS&E, BIET, Davangere Page 1


Applied Machine Learning

Fig 1.1 Industry 4.0

1.2COGNITIVE COMPUTING
Cognitive computing (CC) refers to technology platforms that, broadly speaking, are based
on the scientific disciplines of artificial intelligence and signal processing. These platforms
encompass machine learning, reasoning, natural language processing, speech recognition
and vision (object recognition), human–computer interaction, dialog and narrative
generation, among other technologies.
Some features of cognitive systems are:
Adaptive
They may learn as information changes, and as goals and requirements evolve. They
may resolve ambiguity and tolerate unpredictability. They may be engineered to feed on
dynamic data in real time, or near real time.
Interactive
They may interact easily with users so that those users can define their needs comfortably.
They may also interact with other processors, devices, and cloud services, as well as
with people.
Iterative and stateful

Department of IS&E, BIET, Davangere Page 2


Applied Machine Learning

They may aid in defining a problem by asking questions or finding additional source input if
a problem statement is ambiguous or incomplete. They may "remember" previous interactions
in a process and return information that is suitable for the specific application at that point in
time. Contextual
They may understand, identify, and extract contextual elements such as meaning, syntax,
time, location, appropriate domain, regulations, user’s profile, process, task and goal. They
may draw on multiple sources of information, including both structured and unstructured
digital information, as well as sensory inputs (visual, gestural, auditory, or sensor-provided).

1.3 AI vs ML vs DL
AI is an umbrella discipline that covers everything related to making machines smarter.
Machine Learning (ML) is commonly used along with AI but it is a subset of AI. ML refers
to an AI system that can self-learn based on the algorithm. Systems that get smarter and
smarter over time without human intervention is ML. Deep Learning (DL) is a machine
learning (ML) applied to large data sets. Most AI work involves ML because intelligent
behaviour requires considerable knowledge.

1. Artificial Intelligence (AI)


Humans have been obsessed with automation since the beginning of technology
adoption. AI enables machines to think without any human intervention. It is a
broad area of computer science. AI systems fall into three types: ANI: Artificial
Narrow Intelligence, which is goal-oriented and programmed to perform a single
task. AGI (Artificial General Intelligence) which allows machines to learn,
understand, and act in a way that is indistinguishable from humans in a given
situation. ASI (Artificial Super Intelligence) is a hypothetical AI where machines are
capable of exhibiting intelligence that surpasses brightest humans.
2. Machine Learning (ML)
ML is a subset of AI that uses statistical learning algorithms to build smart systems.
The ML systems can automatically learn and improve without explicitly being
programmed. The recommendation systems on music and video streaming

Department of IS&E, BIET, Davangere Page 3


Applied Machine Learning

services are examples of ML. The machine learning algorithms are classified into
three categories: supervised, unsupervised and reinforcement learning.
3. Deep Learning (DL)
This subset of AI is a technique that is inspired by the way a human brain
filters information. It is associated with learning from examples. DL systems help a
computer model to filter the input data through layers to predict and classify
information. Deep Learning processes information in the same manner as the human
brain. It is used in technologies such as driver-less cars. DL network architectures
are classified into Convolutional Neural Networks, Recurrent Neural Networks,
and Recursive Neural Networks.

1.4 MACHINE LEARNING AND TYPES


With machine learning algorithms, AI was able to develop beyond just performing the
tasks it was programmed to do. Before ML entered the mainstream, AI programs were
only used to automate low-level tasks in business and enterprise settings.
This included tasks like intelligent automation or simple rule-based classification.
This meant that AI algorithms were restricted to only the domain of what they were
processed for. However, with machine learning, computers were able to move past
doing what they were programmed and began evolving with each iteration.
Machine learning is fundamentally set apart from artificial intelligence, as it has
the capability to evolve. Using various programming techniques, machine learning
algorithms are able to process large amounts of data and extract useful information. In
this way, they can improve upon their previous iterations by learning from the data they
are provided.
We cannot talk about machine learning without speaking about big data, one of
the most important aspects of machine learning algorithms. Any type of AI is usually
dependent on the quality of its dataset for good results, as the field makes use of statistical
methods heavily.
Types of Machine Learning

Department of IS&E, BIET, Davangere Page 4


Applied Machine Learning

Fig 1.2 Types of ML

1. Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type,
the machine learning algorithm is trained on labelled data. Even though the data needs
to be labelled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances. In supervised learning, the ML
algorithm is given a small training dataset to work with. This training dataset is a
smaller part of the bigger dataset and serves to give the algorithm a basic idea of the
problem, solution, and data points to be dealt with. The training dataset is also very
similar to the final dataset in its characteristics and provides the algorithm with the
labelled parameters required for the problem.
The algorithm then finds relationships between the parameters given,
essentially establishing a cause and effect relationship between the variables in the
dataset. At the end of the training, the algorithm has an idea of how the data works
and the relationship between the input and the output. This solution is then deployed
for use with the final dataset, which it learns from in the same way as the training
dataset. This means that supervised machine learning algorithms will continue to
improve even after being deployed, discovering new patterns and relationships as it
trains itself on new data.

Department of IS&E, BIET, Davangere Page 5


Applied Machine Learning

2. Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of
the relationship between any two data points. However, unsupervised learning
does not have labels to work off of, resulting in the creation of hidden structures.
Relationships between data points are perceived by the algorithm in an abstract
manner, with no input required from human beings.
The creation of these hidden structures is what makes unsupervised learning
algorithms versatile. Instead of a defined and set problem statement, unsupervised
learning algorithms can adapt to the data by dynamically changing hidden structures.
This offers more post-deployment development than supervised learning algorithms.

3. Reinforcement Learning
Reinforcement learning directly takes inspiration from how human beings learn
from data in their lives. It features an algorithm that improves upon itself and
learns from new situations using a trial-and-error method. Favourable outputs are
encouraged or ‘reinforced’, and non-favourable outputs are discouraged or
‘punished’.
Based on the psychological concept of conditioning, reinforcement learning works
by putting the algorithm in a work environment with an interpreter and a reward
system. In every iteration of the algorithm, the output result is given to the
interpreter, which decides whether the outcome is favorable or not.
In typical reinforcement learning use-cases, such as finding the shortest
route between two points on a map, the solution is not an absolute value. Instead, it
takes on a score of effectiveness, expressed in a percentage value. The higher this
percentage value is, the more reward is given to the algorithm. Thus, the program
is trained to give the best possible solution for the best possible reward.

Department of IS&E, BIET, Davangere Page 6


Applied Machine Learning

CHAPTER 2
SYSTEM REQUIREMENTS

2.1 Hardware Requirements

• Processor : Intel Core i5 processor


• RAM : 8GB or above
• Processor Speed : 2.4GHz
• System Type : X64-based processor

2.2 Software Requirements

• Operating System : Windows 10


• Text Editor / IDE : Jupiter Notebook, Spyder
• Language : Python, HTML, CSS, JavaScript
• Distribution GUI : Anaconda
• Framework : Flask

Department of IS&E, BIET, Davangere Page 7


Applied Machine Learning

CHAPTER 3
ALGORITHMS USED IN ML

3.1 Naïve Bayes Classifier Algorithm (Supervised Learning -


Classification) The Naïve Bayes classifier is based on Bayes’ theorem and
classifies every value as independent of any other value. It allows us to predict a
class/category, based on a given set of features, using probability.
Despite its simplicity, the classifier does surprisingly well and is often used due to
the fact it outperforms more sophisticated classification methods.

3.2 K Means Clustering Algorithm (Unsupervised Learning - Clustering)


The K Means Clustering algorithm is a type of unsupervised learning, which is used to
categorise unlabelled data, i.e. data without defined categories or groups. The algorithm
works by finding groups within the data, with the number of groups represented by the
variable K. It then works iteratively to assign each data point to one of K groups based
on the features provided.

3.3 Support Vector Machine Algorithm (Supervised Learning -


Classification)
Support Vector Machine algorithms are supervised learning models that analyses data used
for classification and regression analysis. They essentially filter data into categories,
which is achieved by providing a set of training examples, each set marked as belonging
to one or the other of the two categories. The algorithm then works to build a model that
assigns new values to one category or the other.

3.4 Linear Regression (Supervised Learning/Regression)


Linear regression is the most basic type of regression. Simple linear regression allows us
to understand the relationships between two continuous variables.

Department of IS&E, BIET, Davangere Page 8


Applied Machine Learning

3.5 Logistic Regression

In statistics, the logistic model is used to model the probability of a certain class or event
existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model
several classes of events such as determining whether an image contains a cat, dog, lion,
etc. Each object being detected in the image would be assigned a probability between 0 and
1 and the sum adding to one.
Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions exist. In
regression analysis, logistic regression is estimating the parameters of a logistic model (a
form of binary regression). Mathematically, a binary logistic model has a dependent
variable with two possible values, such as pass/fail which is represented by an indicator
variable, where the two values are labelled "0" and "1". In the logistic model, the log-odds
(the logarithm of the odds) for the value labelled "1" is a linear combination of one or more
independent variables. ("predictors"); the independent variables can each be a binary variable
(two classes, coded by an indicator variable) or a continuous variable (any real value). The
corresponding probability of the value labelled "1" can vary between 0 (certainly the value
"0") and 1 (certainly the value "1"), hence the labelling; the function that converts log-
odds to probability is the logistic function, hence the name. The unit of measurement for
the log- odds scale is called a logit, from logistic unit, hence the alternative names.
Analogous models with a different sigmoid function instead of the logistic function can also
be used, such as the probit model; the defining characteristic of the logistic model is that
increasing one of the independent variables multiplicatively scales the odds of the given
outcome at a constant rate, with each independent variable having its own parameter; for a
binary dependent variable this generalizes the odds ratio.

3.6 Decision Trees

Decision Trees are a type of Supervised Machine Learning where the data is continuously
split according to a certain parameter. The tree can be explained by two entities, namely
decision

Department of IS&E, BIET, Davangere Page 9


Applied Machine Learning

nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes
are where the data is split. Decision tree learning uses a decision tree as a predictive model
to go from observations about an item (represented in the branches) to conclusions about
the item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where the target
variable can take a discrete set of values are called classification trees; in these tree
structures, leaves represent class labels and branches represent conjunctions of features
that lead to those class labels. Decision trees where the target variable can take continuous
values are called regression trees. In decision analysis, a decision tree can be used to
visually and explicitly represent decisions and decision making. In data mining, a decision
tree describes data, but the resulting classification tree can be an input for decision making.

3.7 Random Forest

Random forest is a supervised learning algorithm which is used for both classification as
well as regression. But however, it is mainly used for classification problems. As we know
that a forest is made up of trees and more trees means more robust forest. Similarly, random
forest algorithm creates decision trees on data samples and then gets the prediction from each
of them and finally selects the best solution by means of voting. It is an ensemble method
which is better than a single decision tree because it reduces the over-fitting by averaging the
result. The fundamental concept behind random forest is a simple but powerful one —
the wisdom of crowds. In data science speak, the reason that the random forest model
works so well is: A large number of relatively uncorrelated models (trees) operating as a
committee will outperform any of the individual constituent models. The low correlation
between models is the key. Just like how investments with low correlations (like stocks and
bonds) come together to form a portfolio that is greater than the sum of its parts,
uncorrelated models can produce ensemble predictions that are more accurate than any of
the individual predictions. The reason for this wonderful effect is that the trees protect each
other from their individual errors. While some trees may be wrong, many other trees will
be right, so as a group the trees are able to move in the correct direction.

Department of IS&E, BIET, Davangere Page 10


Applied Machine Learning

CHAPTER 4
TERMINOLOGIES

4.1 OVERFITTING & UNDERFITTING

When we run our training algorithm on the data set, we allow the overall cost (i.e., distance
from each point to the line) to become smaller with more iterations. Leaving this training
algorithm run for long leads to minimal overall cost. However, this means that the line will
be fit into all the points (including noise), catching secondary patterns that may not be needed
for the generalizability of the model. If the model does not capture the dominant trend,
it can’t predict a likely output for an input that it has never seen before.
Overfitting is the case where the overall cost is really small, but the generalization
of the model is unreliable. This is due to the model learning “too much” from the training
data set. The more we leave the model training the higher the chance of overfitting
occurring. We always want to find the trend, not fit the line to all the data points.
Overfitting (or high variance) leads to worse than good.

Fig 3.1 Overfitting & Underfitting


We want the model to learn from the training data, but we don’t want it to learn
too much (i.e., too many patterns). However, this could lead the model to not learn enough
patterns from the training data, and possibly not even capture the dominant trend. This
case is called underfitting. Underfitting is the case where the model has “not learned
enough” from the

Department of IS&E, BIET, Davangere Page 11


Applied Machine Learning

training data, resulting in low generalization and unreliable predictions. In high bias, the
model might not have enough flexibility, that does not generalize well.

4.2 BIAS vs VARIANCE

Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. Model with high bias pays very little attention to the training
data and oversimplifies the model. It always leads to high error on training and test data.

Variance is the variability of model prediction for a given data point or a value which tells
us spread of our data. Model with high variance pays a lot of attention to training data and
does not generalize on the data which it hasn’t seen before. As a result, such models
perform very well on training data but has high error rates on test data.

Fig 3.2 Bias and variance using bulls-eye diagram

If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand, if our model has large number of parameters then it’s
going to have high variance and low bias. So, we need to find the right/good balance without
overfitting and underfitting the data. This trade-off in complexity is why there is a trade-off
between bias and variance. An algorithm can’t be more complex and less complex at the
same time.

Department of IS&E, BIET, Davangere Page 12


Applied Machine Learning

4.3 REGULARIZATION

This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates
towards zero. This technique discourages learning a more complex or flexible model, so as
to avoid the risk of overfitting. Regularization is a technique used to reduce the errors by
fitting the function appropriately on the given training set and avoid
overfitting. The commonly used regularization techniques are:

1. LASSO regression:

Lasso (Least Absolute Shrinkage and Selection Operator) is another variation, in which
the above function is minimized. It’s clear that this variation differs from ridge regression
only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its
penalty. In statistics, this is known as the L1 norm.

2. Ridge regression:

Above formula shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the
tuning parameter that decides how much we want to penalize the flexibility of our model. The
increase in flexibility of a model is represented by increase in its coefficients, and if
we want to minimize the above function, then these coefficients need to be small. This is
how the Ridge regression technique prevents coefficients from rising too high. In statistics,
this is known as the L2 norm.

Department of IS&E, BIET, Davangere Page 13


Applied Machine Learning

4.4 PERFORMANCE METRICS

After doing the usual Feature Engineering, Selection, and of course, implementing a model
and getting some output in forms of a probability or a class, the next step is to find
out how effective is the model based on some metric using test datasets. Different
performance metrics are used to evaluate different Machine Learning Algorithms. We can
use classification performance metrics such as Log-Loss, Accuracy, AUC (Area under
Curve) etc. Other examples are precision, recall, f1 score which can be used for sorting
algorithms primarily used by search engines.

1. Accuracy:

It is most common performance metric for classification algorithms. It may be defined as


the number of correct predictions made as a ratio of all predictions made. In the Numerator,
are our correct predictions (True positives and True Negatives) and in the denominator, are
the kind of all predictions made by the algorithm (Right as well as wrong ones). We
can use accuracy_score function of sklearn.metrics to compute accuracy of our classification
model.

2. Precision:

Precision, used in document retrievals, may be defined as the number of correct documents
returned by our ML model. Precision is a measure that tells us what proportion of the outcomes
are actually true. Precision is about being precise. So even if we managed to predict only
one output, and we predicted it correctly, then we are 100% precise.

3. Recall:

Department of IS&E, BIET, Davangere Page 14


Applied Machine Learning

Recall may be defined as the number of positives returned by our ML model. Recall is a metric
that quantifies the number of correct positive predictions made out of all positive predictions
that could have been made. Unlike precision that only comments on the correct positive
predictions out of all positive predictions, recall provides an indication of missed positive
predictions.

4. F1 score:

This score will give us the harmonic mean of precision and recall. Mathematically, F1 score
is the weighted average of the precision and recall. The best value of F1 would be 1 and
worst would be 0. We can calculate F1 score with the help of following formula –
𝑭𝟏 = 𝟐 ∗ (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍) / (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝒓𝒆𝒄𝒂𝒍𝒍)
F1 score is having equal relative contribution of precision and recall. We can
use
classification_report function of sklearn.metrics to get the classification report of our
classification model.

4.5 ERROR ANALYSIS

The intentional approach to building a model is using error analysis. Error analysis requires
you to dig into the results of your model after each iteration. You look at the data and
predictions on an observational level and form hypotheses as to why your model failed on
certain predictions. Then you test your hypothesis by changing the model in a way that
might fix that error, and begin the next iteration. Each iteration of modeling becomes more
time consuming with error analysis, but the final results are better and will likely arrive faster.

The pattern to good error analysis is this:

• Find errors.

• Create a hypothesis for what could fix the errors.

• Test hypothesis.

• Repeat.

Department of IS&E, BIET, Davangere Page 15


Applied Machine Learning

CHAPTER 5
PROJECT 1
EXPLORATORY DATA ANALYSIS (EDA) ON
“SUICIDE DATA SET”
5.1 DESCRIPTION
Each data in the data set represents a year, a country, a certain age range, and a gender.
For example, in the country Brazil in the year 1985, over 75 years, committed suicide 129
men.
The data set has 10 attributes.
These being:
Country: Country of record data;
Year: Year of record data;
Sex: Male / Female;
Age: Suicide age range, ages divided into six categories;
Suicides_no: Number of Suicides;
Population: population of this sex, in this age range, in this country and in this year;
Suicides / 100k pop: Reason between the number of suicides and the population / 100k;
GDP_for_year: GDP of the country in the year who issue;
GDP_per_capita: ratio between the country’s GDP and its population;
Generation: Generation of the suicides in question, being possible 6 different categories.
Suicide rate is defined as number of suicides per 100k population.
After analysing the given dataset suicide rates.csv we could able to decide that the
given problem is supervised machine learning problem where suicide/100k pop is the output
label

5.2 PRE-PROCESSING
➢ Step-1: Importing Libraries and Data set

➢ S

Department of IS&E, BIET, Davangere Page 16


Applied Machine Learning

➢ Step-3: - Take care of Missing data

Department of IS&E, BIET, Davangere Page 17


Applied Machine Learning

➢ Step-4: - Drop the unwanted columns

➢ Step-5: - Explore and Analyse

Dept of IS&E, BIET, Davangere Page 27

Department of IS&E, BIET, Davangere Page 18


Applied Machine Learning

Correlation:

5.3 VISUALISATION OF DATA


➢ SUICIDES/100K POPULATON V/S COUNTRY

Department of IS&E, BIET, Davangere Page 19


Applied Machine Learning

➢ SUICIDES/100K POPULATON V/S COUTRY-SEX

Department of IS&E, BIET, Davangere Page 20


Applied Machine Learning

➢ YEAR WISE ANALYSIS

It can be clearly seen from the about graph that total number of suicides (summing
up the suicide numbers of all countries in particular year) was increasing during
years 1989-2005 then it started to decrease.

➢ SUICIDE PER 100K (BY POPULATION)

Department of IS&E, BIET, Davangere Page 21


Applied Machine Learning

It can be observed that:


➢ Till the year 1995 suicide rate was increasing, it was about 18
suicides/100k population in the year 1995.

➢ In the year after 1995 the suicide rate got started to decrease.

➢ SUICIDE RATE BASED ON GENDER

It can observed from the above graph that

➢ Females have very less number of suicide rate.

➢ Men have most number of suicide rate and in 1995 about 25


men/100k population committed suicide.

➢ AGE WISE ANALYSIS

Department of IS&E, BIET, Davangere Page 22


Applied Machine Learning

It can be clearly observed that,

➢ Both male and female of age group between 35-54 years have more
suicide number,5-14 years have less number of suicides.

• GENERATION WISE ANALYSIS

It can be clearly observed that,

➢ People of generation x have highest suicide rate.

➢ People of generation z have lowest suicide rate.

Department of IS&E, BIET, Davangere Page 23


Applied Machine Learning

• GDP ANALYSIS

➢ It can be observed from the above scatter plot that as the GDP fo year
is increasing no of suicides started to decrease gradually.

Department of IS&E, BIET, Davangere Page 24


Applied Machine Learning

CHAPTER 6
RESULTS AND DISCUSSIONS

After analyzing the given data set Suicide rate, we could come to conclusion that

➢ The country Lithuania has the highest suicide/100k population

➢ No of men committed suicide is almost 4 times more compared to women.

➢ It was observed that middle aged men i.e men of age 35-54 have highest suicide rate.

➢ The people G.I Generation have committed suicides

➢ And it was observed that GDP of the country clearly makes the difference in
suicide rates, because as the GDP for year and GDP per capita started to increase
suicide ratestarted to decrease.

Considering suicide number most of them were men, may be unemployment can be one of the
reason. And in later years suicide rates started to decrease as precautions were taken and people
were started to educate about the suicides.

Department of IS&E, BIET, Davangere Page 25


Applied Machine Learning

CONCLUSION
Many industries are executing projects based on applied machine learning for various
applications. Machine learning is reorganizing the world globally. It completely automated the
business processes. Machine learning is one of the advanced technologies of AI. The concept
of machine learning is vital to stay ahead in a competitive market. Machine learning is one of the
methods by which we can operate artificial intelligence. It is also known as a subset of AI. So,
let’s dive in and understand the machine learning model.

From deep learning to applied machine learning, the techniques and tools of machine
learning have enhanced and automated business support functions. Many industries (such as
healthcare, retail, manufacturing, finance, education, entertainment, banking, telecom, and much
more) use machine learning for their businesses.

Department of IS&E, BIET, Davangere Page 26


BIBLIOGRAPHY

1. Data Sets taken from https://www.kaggle.com/russellyates88/suicide-rates-overview-

1985-to-2016

2. Advanced Machine Learning with scikit-learn by Andreas C. Müller, Released September

2015, Publisher(s): Infinite Skills, ISBN: 9781771374927.

3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor

Hastie, Robert Tibshirani, and Jerome Friedman

4. https://towardsdatascience.com

5. https://medium.com/analytics-vidhya/a-quick-guide-on-missing-data-imputation-

techniques-in-python-2020-5410f3df1c1e

You might also like