Internship - Report (Vinay Patil L V)
Internship - Report (Vinay Patil L V)
An Internship Report on
Submitted By
2021-2022
Bapuji Institute of Engineering and Technology,
Department of Information Science and Engineering,
Davangere -577004, Karnataka
BAPUJI INSTITUTE OF ENGINEERING AND TECHNOLOGY
DAVANGERE – 577004, KARNATAKA
CERTIFICATE
This is to certify that Mr. THOTAD VINAY PATIL L V bearing USN
4BD18IS101 of Information Science and Engineering department has satisfactorily submitted
internship report entitled “Exploratory Data Analysis (EDA) on Suicide Data Set”. The
internship report has been approved as it satisfies the academic requirement with respect to the
Internship work prescribed for Bachelor of Engineering Degree of the Visvesvaraya
Technological University, Belagavi, during the year 2021-22.
External Viva
1. __________________________
2. __________________________
ACKNOWLEDGEMENT
I would like to acknowledge the help and encouragement given by various people
during the course of this Internship.
I would like to express my sincere gratitude to resource person Mr. Bharath H C
cofounder, training and internship ACRANTON TECHNOLOGIES Pvt Ltd.
Salutation to our beloved and highly esteemed institute “Bapuji Institute of
Engineering and Technology.” For having well qualified staff and labs furnished with
necessary equipment.
I express my sincere thanks to my Guide Mrs. Hemashree H C, and Internship
Coordinator Mr. Sheik Imran for giving constant encouragement, support and valuable
guidance throughout the course of the Internship, without whose stable guidance this
report would not have been achieved.
I express whole hearted gratitude to Dr. Poornima B H.O.D. of IS&E. I wish to
acknowledge her help who made my task easy by providing with her valuable help and
encouragement.
I would like to thank our beloved Principal Dr. H.B. Aravind and the Director
Prof. Y Vrushabhendrappa of this Institute for giving me the opportunity and guidance
to work for the Internship.
I would like to extend my gratitude to all teaching and non -teaching staff of
Department of Information Science and Engineering for the help and support rendered to
me. I have benefited a lot from the feedback, suggestions given by them.
I would like to extend my gratitude to all my family members and friends
especially for Their advice and moral support.
Our mission is to
• To provide superior quality product service; we will radically shift the global economy
toward small business by empowering products confidently grow and successfully
their own ventures.
• To promote technical ideas into reality through applications for better future.
i
Applied Machine Learning
CHAPTER 1
INTRODUCTION TO MACHINE LEARNING
Industry 1.0 refers to the first industrial revolution. It is marked by a transition from hand
production methods to machines through the use of steam power and water power. Industry 2 . 0 is the
s e c o n d i n d u s t r i a l r e v o l u t i o n o r b e t t e r k n o w n a s t h e technological revolution is the
period between 1870 and 1914. It was made possible with the extensive railroad networks and the
telegraph which allowed for faster transfer of people and ideas.
The third industrial revolution or Industry 3.0 occurred in the late 20th century, after the end
of the two big wars, as a result of a slowdown with the industrialization and technological
advancement compared to previous periods. Industry 4.0 is the fourth industrial revolution that
concerns industry. The fourth industrial revolution encompasses areas which are not normally
classified as an industry, such as smart cities, for instance. In essence, industry 4.0 is the trend
towards automation and data exchange in manufacturing technologies and processes which include
cyber-physical systems (CPS), the internet of things (IoT), industrial internet of things (IIOT), cloud
computing, cognitive computing and artificial intelligence.
The concept includes:
• Smart manufacturing
• Smart factory
• Lights out (manufacturing) also known as dark factories
• Industrial internet of things also called internet of things for manufacturing
Within modular structured smart factories, cyber-physical systems monitor physical processes, create
a virtual copy of the physical world and make decentralized decisions. Over the Internet of Things,
cyber-physical systems communicate and cooperate with each other and with humans in real-time
both internally and across organizational services offered and used by participants of the value chain.
1.2COGNITIVE COMPUTING
Cognitive computing (CC) refers to technology platforms that, broadly speaking, are based
on the scientific disciplines of artificial intelligence and signal processing. These platforms
encompass machine learning, reasoning, natural language processing, speech recognition
and vision (object recognition), human–computer interaction, dialog and narrative
generation, among other technologies.
Some features of cognitive systems are:
Adaptive
They may learn as information changes, and as goals and requirements evolve. They
may resolve ambiguity and tolerate unpredictability. They may be engineered to feed on
dynamic data in real time, or near real time.
Interactive
They may interact easily with users so that those users can define their needs comfortably.
They may also interact with other processors, devices, and cloud services, as well as
with people.
Iterative and stateful
They may aid in defining a problem by asking questions or finding additional source input if
a problem statement is ambiguous or incomplete. They may "remember" previous interactions
in a process and return information that is suitable for the specific application at that point in
time. Contextual
They may understand, identify, and extract contextual elements such as meaning, syntax,
time, location, appropriate domain, regulations, user’s profile, process, task and goal. They
may draw on multiple sources of information, including both structured and unstructured
digital information, as well as sensory inputs (visual, gestural, auditory, or sensor-provided).
1.3 AI vs ML vs DL
AI is an umbrella discipline that covers everything related to making machines smarter.
Machine Learning (ML) is commonly used along with AI but it is a subset of AI. ML refers
to an AI system that can self-learn based on the algorithm. Systems that get smarter and
smarter over time without human intervention is ML. Deep Learning (DL) is a machine
learning (ML) applied to large data sets. Most AI work involves ML because intelligent
behaviour requires considerable knowledge.
services are examples of ML. The machine learning algorithms are classified into
three categories: supervised, unsupervised and reinforcement learning.
3. Deep Learning (DL)
This subset of AI is a technique that is inspired by the way a human brain
filters information. It is associated with learning from examples. DL systems help a
computer model to filter the input data through layers to predict and classify
information. Deep Learning processes information in the same manner as the human
brain. It is used in technologies such as driver-less cars. DL network architectures
are classified into Convolutional Neural Networks, Recurrent Neural Networks,
and Recursive Neural Networks.
1. Supervised Learning
Supervised learning is one of the most basic types of machine learning. In this type,
the machine learning algorithm is trained on labelled data. Even though the data needs
to be labelled accurately for this method to work, supervised learning is extremely
powerful when used in the right circumstances. In supervised learning, the ML
algorithm is given a small training dataset to work with. This training dataset is a
smaller part of the bigger dataset and serves to give the algorithm a basic idea of the
problem, solution, and data points to be dealt with. The training dataset is also very
similar to the final dataset in its characteristics and provides the algorithm with the
labelled parameters required for the problem.
The algorithm then finds relationships between the parameters given,
essentially establishing a cause and effect relationship between the variables in the
dataset. At the end of the training, the algorithm has an idea of how the data works
and the relationship between the input and the output. This solution is then deployed
for use with the final dataset, which it learns from in the same way as the training
dataset. This means that supervised machine learning algorithms will continue to
improve even after being deployed, discovering new patterns and relationships as it
trains itself on new data.
2. Unsupervised Learning
Unsupervised machine learning holds the advantage of being able to work with
unlabeled data. This means that human labor is not required to make the dataset
machine-readable, allowing much larger datasets to be worked on by the program.
In supervised learning, the labels allow the algorithm to find the exact nature of
the relationship between any two data points. However, unsupervised learning
does not have labels to work off of, resulting in the creation of hidden structures.
Relationships between data points are perceived by the algorithm in an abstract
manner, with no input required from human beings.
The creation of these hidden structures is what makes unsupervised learning
algorithms versatile. Instead of a defined and set problem statement, unsupervised
learning algorithms can adapt to the data by dynamically changing hidden structures.
This offers more post-deployment development than supervised learning algorithms.
3. Reinforcement Learning
Reinforcement learning directly takes inspiration from how human beings learn
from data in their lives. It features an algorithm that improves upon itself and
learns from new situations using a trial-and-error method. Favourable outputs are
encouraged or ‘reinforced’, and non-favourable outputs are discouraged or
‘punished’.
Based on the psychological concept of conditioning, reinforcement learning works
by putting the algorithm in a work environment with an interpreter and a reward
system. In every iteration of the algorithm, the output result is given to the
interpreter, which decides whether the outcome is favorable or not.
In typical reinforcement learning use-cases, such as finding the shortest
route between two points on a map, the solution is not an absolute value. Instead, it
takes on a score of effectiveness, expressed in a percentage value. The higher this
percentage value is, the more reward is given to the algorithm. Thus, the program
is trained to give the best possible solution for the best possible reward.
CHAPTER 2
SYSTEM REQUIREMENTS
CHAPTER 3
ALGORITHMS USED IN ML
In statistics, the logistic model is used to model the probability of a certain class or event
existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model
several classes of events such as determining whether an image contains a cat, dog, lion,
etc. Each object being detected in the image would be assigned a probability between 0 and
1 and the sum adding to one.
Logistic regression is a statistical model that in its basic form uses a logistic function
to model a binary dependent variable, although many more complex extensions exist. In
regression analysis, logistic regression is estimating the parameters of a logistic model (a
form of binary regression). Mathematically, a binary logistic model has a dependent
variable with two possible values, such as pass/fail which is represented by an indicator
variable, where the two values are labelled "0" and "1". In the logistic model, the log-odds
(the logarithm of the odds) for the value labelled "1" is a linear combination of one or more
independent variables. ("predictors"); the independent variables can each be a binary variable
(two classes, coded by an indicator variable) or a continuous variable (any real value). The
corresponding probability of the value labelled "1" can vary between 0 (certainly the value
"0") and 1 (certainly the value "1"), hence the labelling; the function that converts log-
odds to probability is the logistic function, hence the name. The unit of measurement for
the log- odds scale is called a logit, from logistic unit, hence the alternative names.
Analogous models with a different sigmoid function instead of the logistic function can also
be used, such as the probit model; the defining characteristic of the logistic model is that
increasing one of the independent variables multiplicatively scales the odds of the given
outcome at a constant rate, with each independent variable having its own parameter; for a
binary dependent variable this generalizes the odds ratio.
Decision Trees are a type of Supervised Machine Learning where the data is continuously
split according to a certain parameter. The tree can be explained by two entities, namely
decision
nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes
are where the data is split. Decision tree learning uses a decision tree as a predictive model
to go from observations about an item (represented in the branches) to conclusions about
the item's target value (represented in the leaves). It is one of the predictive modelling
approaches used in statistics, data mining and machine learning. Tree models where the target
variable can take a discrete set of values are called classification trees; in these tree
structures, leaves represent class labels and branches represent conjunctions of features
that lead to those class labels. Decision trees where the target variable can take continuous
values are called regression trees. In decision analysis, a decision tree can be used to
visually and explicitly represent decisions and decision making. In data mining, a decision
tree describes data, but the resulting classification tree can be an input for decision making.
Random forest is a supervised learning algorithm which is used for both classification as
well as regression. But however, it is mainly used for classification problems. As we know
that a forest is made up of trees and more trees means more robust forest. Similarly, random
forest algorithm creates decision trees on data samples and then gets the prediction from each
of them and finally selects the best solution by means of voting. It is an ensemble method
which is better than a single decision tree because it reduces the over-fitting by averaging the
result. The fundamental concept behind random forest is a simple but powerful one —
the wisdom of crowds. In data science speak, the reason that the random forest model
works so well is: A large number of relatively uncorrelated models (trees) operating as a
committee will outperform any of the individual constituent models. The low correlation
between models is the key. Just like how investments with low correlations (like stocks and
bonds) come together to form a portfolio that is greater than the sum of its parts,
uncorrelated models can produce ensemble predictions that are more accurate than any of
the individual predictions. The reason for this wonderful effect is that the trees protect each
other from their individual errors. While some trees may be wrong, many other trees will
be right, so as a group the trees are able to move in the correct direction.
CHAPTER 4
TERMINOLOGIES
When we run our training algorithm on the data set, we allow the overall cost (i.e., distance
from each point to the line) to become smaller with more iterations. Leaving this training
algorithm run for long leads to minimal overall cost. However, this means that the line will
be fit into all the points (including noise), catching secondary patterns that may not be needed
for the generalizability of the model. If the model does not capture the dominant trend,
it can’t predict a likely output for an input that it has never seen before.
Overfitting is the case where the overall cost is really small, but the generalization
of the model is unreliable. This is due to the model learning “too much” from the training
data set. The more we leave the model training the higher the chance of overfitting
occurring. We always want to find the trend, not fit the line to all the data points.
Overfitting (or high variance) leads to worse than good.
training data, resulting in low generalization and unreliable predictions. In high bias, the
model might not have enough flexibility, that does not generalize well.
Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict. Model with high bias pays very little attention to the training
data and oversimplifies the model. It always leads to high error on training and test data.
Variance is the variability of model prediction for a given data point or a value which tells
us spread of our data. Model with high variance pays a lot of attention to training data and
does not generalize on the data which it hasn’t seen before. As a result, such models
perform very well on training data but has high error rates on test data.
If our model is too simple and has very few parameters then it may have high bias
and low variance. On the other hand, if our model has large number of parameters then it’s
going to have high variance and low bias. So, we need to find the right/good balance without
overfitting and underfitting the data. This trade-off in complexity is why there is a trade-off
between bias and variance. An algorithm can’t be more complex and less complex at the
same time.
4.3 REGULARIZATION
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates
towards zero. This technique discourages learning a more complex or flexible model, so as
to avoid the risk of overfitting. Regularization is a technique used to reduce the errors by
fitting the function appropriately on the given training set and avoid
overfitting. The commonly used regularization techniques are:
1. LASSO regression:
Lasso (Least Absolute Shrinkage and Selection Operator) is another variation, in which
the above function is minimized. It’s clear that this variation differs from ridge regression
only in penalizing the high coefficients. It uses |βj|(modulus)instead of squares of β, as its
penalty. In statistics, this is known as the L1 norm.
2. Ridge regression:
Above formula shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the
tuning parameter that decides how much we want to penalize the flexibility of our model. The
increase in flexibility of a model is represented by increase in its coefficients, and if
we want to minimize the above function, then these coefficients need to be small. This is
how the Ridge regression technique prevents coefficients from rising too high. In statistics,
this is known as the L2 norm.
After doing the usual Feature Engineering, Selection, and of course, implementing a model
and getting some output in forms of a probability or a class, the next step is to find
out how effective is the model based on some metric using test datasets. Different
performance metrics are used to evaluate different Machine Learning Algorithms. We can
use classification performance metrics such as Log-Loss, Accuracy, AUC (Area under
Curve) etc. Other examples are precision, recall, f1 score which can be used for sorting
algorithms primarily used by search engines.
1. Accuracy:
2. Precision:
Precision, used in document retrievals, may be defined as the number of correct documents
returned by our ML model. Precision is a measure that tells us what proportion of the outcomes
are actually true. Precision is about being precise. So even if we managed to predict only
one output, and we predicted it correctly, then we are 100% precise.
3. Recall:
Recall may be defined as the number of positives returned by our ML model. Recall is a metric
that quantifies the number of correct positive predictions made out of all positive predictions
that could have been made. Unlike precision that only comments on the correct positive
predictions out of all positive predictions, recall provides an indication of missed positive
predictions.
4. F1 score:
This score will give us the harmonic mean of precision and recall. Mathematically, F1 score
is the weighted average of the precision and recall. The best value of F1 would be 1 and
worst would be 0. We can calculate F1 score with the help of following formula –
𝑭𝟏 = 𝟐 ∗ (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 ∗ 𝒓𝒆𝒄𝒂𝒍𝒍) / (𝒑𝒓𝒆𝒄𝒊𝒔𝒊𝒐𝒏 + 𝒓𝒆𝒄𝒂𝒍𝒍)
F1 score is having equal relative contribution of precision and recall. We can
use
classification_report function of sklearn.metrics to get the classification report of our
classification model.
The intentional approach to building a model is using error analysis. Error analysis requires
you to dig into the results of your model after each iteration. You look at the data and
predictions on an observational level and form hypotheses as to why your model failed on
certain predictions. Then you test your hypothesis by changing the model in a way that
might fix that error, and begin the next iteration. Each iteration of modeling becomes more
time consuming with error analysis, but the final results are better and will likely arrive faster.
• Find errors.
• Test hypothesis.
• Repeat.
CHAPTER 5
PROJECT 1
EXPLORATORY DATA ANALYSIS (EDA) ON
“SUICIDE DATA SET”
5.1 DESCRIPTION
Each data in the data set represents a year, a country, a certain age range, and a gender.
For example, in the country Brazil in the year 1985, over 75 years, committed suicide 129
men.
The data set has 10 attributes.
These being:
Country: Country of record data;
Year: Year of record data;
Sex: Male / Female;
Age: Suicide age range, ages divided into six categories;
Suicides_no: Number of Suicides;
Population: population of this sex, in this age range, in this country and in this year;
Suicides / 100k pop: Reason between the number of suicides and the population / 100k;
GDP_for_year: GDP of the country in the year who issue;
GDP_per_capita: ratio between the country’s GDP and its population;
Generation: Generation of the suicides in question, being possible 6 different categories.
Suicide rate is defined as number of suicides per 100k population.
After analysing the given dataset suicide rates.csv we could able to decide that the
given problem is supervised machine learning problem where suicide/100k pop is the output
label
5.2 PRE-PROCESSING
➢ Step-1: Importing Libraries and Data set
➢ S
Correlation:
It can be clearly seen from the about graph that total number of suicides (summing
up the suicide numbers of all countries in particular year) was increasing during
years 1989-2005 then it started to decrease.
➢ In the year after 1995 the suicide rate got started to decrease.
➢ Both male and female of age group between 35-54 years have more
suicide number,5-14 years have less number of suicides.
• GDP ANALYSIS
➢ It can be observed from the above scatter plot that as the GDP fo year
is increasing no of suicides started to decrease gradually.
CHAPTER 6
RESULTS AND DISCUSSIONS
After analyzing the given data set Suicide rate, we could come to conclusion that
➢ It was observed that middle aged men i.e men of age 35-54 have highest suicide rate.
➢ And it was observed that GDP of the country clearly makes the difference in
suicide rates, because as the GDP for year and GDP per capita started to increase
suicide ratestarted to decrease.
Considering suicide number most of them were men, may be unemployment can be one of the
reason. And in later years suicide rates started to decrease as precautions were taken and people
were started to educate about the suicides.
CONCLUSION
Many industries are executing projects based on applied machine learning for various
applications. Machine learning is reorganizing the world globally. It completely automated the
business processes. Machine learning is one of the advanced technologies of AI. The concept
of machine learning is vital to stay ahead in a competitive market. Machine learning is one of the
methods by which we can operate artificial intelligence. It is also known as a subset of AI. So,
let’s dive in and understand the machine learning model.
From deep learning to applied machine learning, the techniques and tools of machine
learning have enhanced and automated business support functions. Many industries (such as
healthcare, retail, manufacturing, finance, education, entertainment, banking, telecom, and much
more) use machine learning for their businesses.
1985-to-2016
3. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Trevor
4. https://towardsdatascience.com
5. https://medium.com/analytics-vidhya/a-quick-guide-on-missing-data-imputation-
techniques-in-python-2020-5410f3df1c1e