0% found this document useful (0 votes)
24 views

Research Paper

Research paper

Uploaded by

ranabeena804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Research Paper

Research paper

Uploaded by

ranabeena804
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Title:- Machine Learning

Abstract:

Machine learning is a branch of computer science that has the potential to transform
epidemiologic sciences. Amid a growing focus on “Big Data,” it offers
epidemiologists new tools to tackle problems for which classical methods are not
well-suited. In order to critically evaluate the value of integrating machine learning
algorithms and existing methods,
however, it is essential to address language and technical barriers between the two
fields that can make it difficult for epidemiologists to read and assess machine
learning studies. Here, we provide an overview of the concepts and termi- nology
used in machine learning literature, which encompasses a diverse set of tools with
goals ranging from prediction to classification to clustering. We provide a brief
introduction to 5 common machine learning algorithms and 4
ensemble-based approaches. We then summarize epidemiologic applications of
machine learning techniques in the published literature. We recommend approaches
to incorporate machine learning in epidemiologic research and dis-
cuss opportunities and challenges for integrating machine learning and existing
epidemiologic research methods. Big Data; ensemble models; machine learning

Keywords:
ANN - artificial neural networks.
BMA - Bayesian model averaging.
BMI - body mass index.
CART - classification and regression trees.
SVM - support vector machine.]

Literature review:
Machine learning is a branch of computer science that broadly
aims to enable computers to “learn” without being directly pro-
grammed (1). It has origins in the artificial intelligence movement
of the 1950s and emphasizes practical objectives and appli-
cations, particularly prediction and optimization. Computers
“learn” in machine learning by improving their performance
at tasks through “experience” (2, p. xv). In practice, “experi-
ence” usually means fitting to data; hence, there is not a clear
boundary between machine learning and statistical approaches.
Indeed, whether a given methodology is considered “machine
learning” or “statistical” often reflects its history as much as
genuine differences, and many algorithms (e.g., least absolute
shrinkage and selection operator (LASSO), stepwise regres-
sion) may or may not be considered machine learning depend-
ing on who you ask. Still, despite methodological similarities,
machine learning is philosophically and practically distin-
guishable. At the liberty of (considerable) oversimplification,
machine learning generally emphasizes predictive accuracy
over hypothesis-driven inference, usually focusing on large,
high-dimensional (i.e., having many covariates) data sets (3,
4). Regardless of the precise distinction between approaches,
in practice, machine learning offers epidemiologists important
tools. In particular, a growing focus on “Big Data” emphasizes
problems and data sets for which machine learning algorithms
excel while more commonly used statistical approaches struggle.
This primer provides a basic introduction to machine learn-
ing with the aim of providing readers a foundation for critically
reading studies based on these methods and a jumping-off point
for those interested in using machine learning techniques in epi-
demiologic research. The “Concepts and Terminology” section
of this paper presents concepts and terminology used in the
machine learning literature. The “Machine Learning Algo-
rithms” section provides a brief introduction to 5 common
machine learning algorithms: artificial neural networks, deci-
sion trees, support vector machines, naive Bayes, and k-
means clustering. These are important and commonly used
algorithms that epidemiologists are likely to encounter in
practice, but they are by no means comprehensive of this
large and highly diverse field. The following two sections,
“Ensemble Methods” and “Epidemiologic Applications,”
extend this examination to ensemble-based approaches and
epidemiologic applications in the published literature. “Brief
Recommendations” provides some recommendations for
incorporating machine learning into epidemiologic practice,
and the last section discusses opportunities and challenges.

CONCEPTS AND TERMINOLOGY:


For epidemiologists seeking to integrate machine learning
techniques into their research, language and technical barriers
between the two fields can make reading source materials and
studies challenging. Some machine learning concepts lack sta-
tistical or epidemiologic parallels, and machine learning termi-
nology often differs even where the underlying concepts are the
same. Here we briefly review basic machine learning principles
and provide a glossary of machine learning terms and their sta-
tistical/epidemiologic equivalents (Table 1)
Supervised, unsupervised, and semisupervised learning
Machine learning is broadly classifiable by whether the com-
puter’s learning (i.e., model-fitting) is “supervised” or “unsu-
pervised.” Supervised learning is akin to the type of model-
fitting that is standard in epidemiologic practice: The value
of the outcome (i.e., the dependent variable), often called
its “label” in machine learning, is known for each observation.
Data with specified outcome values are called “labeled data.”
Common supervised learning techniques include standard epi-
demiologic approaches such as linear and logistic regression, as
well as many of the most popular machine learning algorithms
(e.g., decision trees, support vector machines).
In unsupervised learning, the algorithm attempts to identify
natural relationships and groupings within the data without ref-
erence to any outcome or the “right answer” (5, p. 517). Unsu-
pervised learning approaches share similarities in goals and
structure with statistical approaches that attempt to identify
unspecified subgroups with similar characteristics (e.g., “latent”
variables or classes) (6). Clustering algorithms, which group ob-
servations on the basis of similar data characteristics (e.g., both
oranges and beach balls are round), are common unsupervised
learning implementations. Examples may include k-means clus-
tering and expectation-maximization clustering using Gaussian
mixture models (7, 8).
Semisupervised learning fits models to both labeled and
unlabeled data. Labeling data (outcomes) is often time-
consuming and expensive, particularly for large data sets. Semi-
supervised learning supplements limited labeled data with an
abundance of unlabeled data with the goal of improving model
performance (studies show that unlabeled data can help build a
better classifier, but appropriate model selection is critical) (9).
For example, in a study of Web page classification, Nigam et al.
(10) fit a naive Bayes classifier to labeled data and then used
the same classifier to probabilistically label unlabeled observa-
tions (i.e., fill in missing outcome data). They then retrained a
new classifier on the resulting, fully labeled data set, thereby
achieving a 30% increase in Web page classification accuracy
on data outside of the training set. Semisupervised learning can
bear some similarity to statistical approaches for missing data
and censoring (e.g., multiple imputation), but as an approach
that focuses on imputing missing outcomes rather than missing
covariates.

Classification versus regression algorithms


Within the domain of supervised learning, machine learning
algorithms can be further divided into classification or regres-
sion applications, depending upon the nature of the response
variable. In general, in the machine learning literature, classifi-
cation refers to prediction of categorical outcomes, while
regression refers to prediction of continuous outcomes. We use
this terminology throughout this primer and are explicit when
referring to specific regression algorithms (e.g., logistic regres-
sion). Many machine learning algorithms that were developed
to perform classification have been adapted to also address
regression problems, and vice versa.
Generative versus discriminative algorithms
Machine learning algorithms, both supervised and unsuper-
vised, can be discriminative or generative (11, 12). Discrimina-
tive algorithms directly model the conditional probability of an
outcome, Pr(y|x) (the probability of y given x), in a set of
observed data—for example, the probability that a subject has
type 2 diabetes mellitus given a certain body mass index (BMI;
weight (kg)/height (m)2). Most statistical approaches familiar to
epidemiologists (e.g., linear and logistic regression) are discrim-
inative, as are most of the algorithms discussed in this primer.
In contrast, while generative algorithms can also compute
the conditional probability of an outcome, this computation oc-
curs indirectly. Generative algorithms first model the joint prob-
ability distribution, Pr(x, y) (the probabilities associated with all
possible combinations of x and y), or, continuing our example,
a probabilistic model that accounts for all observed combina-
tions of BMIs and diabetes outcomes (Table 2). This joint prob-
ability distribution can be transformed into a conditional
probability distribution in order to classify data, as Pr(y|x) =
Pr(x, y)/Pr(x). Because the joint probability distribution models
the underlying data-generating process, generative models can
also be used, as their name suggests, for directly generating
new simulated data points reflecting the distribution of the co-
variates and outcome in the modeled population (11). However,
because they model the full joint distribution of outcomes and
covariates, generative models are generally more complex and
require more assumptions to fit than discriminative algorithms
(12, 13). Examples of generative algorithms include naive
Bayes and hidden Markov models (11).
Reinforcement learning
In reinforcement learning, systems learn to excel at a task over
time through trial and error (14). Reinforcement learning tech-
niques take an iterative approach to learning by obtaining positive
or negative feedback based on performance of a given task on
some data (whether prediction, classification, or another action)
and then self-adapting and attempting the task again on new data
(though old data may be reencountered) (15). Depending on how
it is implemented, this approach can be akin to supervised learn-
ing, or it may represent a semisupervised approach (as in genera-
tive adversarial neural networks (16)). Reinforcement learning
algorithms often optimize the use of early, “exploratory” versions
of a model—that is, task attempts—that perform poorly to gain
information to perform better on future attempts, and then

become less labile as the model “learns” more (15). Medical


and epidemiologic applications of reinforcement learning have
included modeling the effect of sequential clinical treatment
decisions on disease progression (17) (e.g., optimizing first-
and second-line therapy decisions for schizophrenia manage-
ment (18)) and personalized, adaptive medication dosing strat-
egies. For example, Nemati et al. (19) used reinforcement
learning with artificial neural networks in a cohort of intensive-
care-unit patients to develop individualized heparin dosing
strategies that evolve as a patient’s clinical phenotype changes,
in order to maximize the amount of time that blood drug levels
remain within the therapeutic window.

MACHINE LEARNING ALGORITHMS:


In this section, we introduce 5 common machine learning al-
gorithms: artificial neural networks, decision trees, support vec-
tor machines, naive Bayes, and k-means clustering. For each, we
include a brief description, summarize strengths and limita-
tions, and highlight implementations available on common sta-
tistical computing platforms. This section is intended to provide
a high-level introduction to these algorithms, and we refer inter-
ested readers to the cited references for further information.
Artificial neural networks
Artificial neural networks (ANNs) are inspired by the signal-
ing behavior of neurons in biological neural networks. ANNs,
which consist of a population of neurons interconnected through
complex signaling pathways, use this structure to analyze com-
plex interactions between a group of measurable covariates in
order to predict an outcome. ANNs possess layers of “neurons”
connected by “axons” (20) (Figure 1A). These layers are grouped
into 1) an input layer, 2) one or more middle “hidden” layers,
and 3) an output layer. The neurons in the input and output
layers correspond to the independent and dependent variables,
respectively. Neurons in adjacent layers communicate with each
other through activation functions, which convert the weighted
sum of a neuron’s inputs into an output (Figure 1B). Depending
on the type of activation function, the output can be dichoto-
mous (“1” when the weighted sum exceeds a given threshold
and “0” otherwise) or continuous. The weighted sum of a neu-
ron’s inputs is somewhat analogous to coefficients in linear or
logistic regression.
Figure 1 illustrates a simple neural network with a single hid-
den layer and a feed-forward structure (i.e., signals progress

unidirectionally from input to output layers). For supervised


learning applications, once the numbers of layers and neurons
are selected, the connection weights of the ANN are fit on a
training set of labeled data through a reinforcement learning
approach. Initial connection weights are generally selected ran-
domly, and network output is compared with the correct output
(class labels) using a loss function, which is based on the differ-
ence between the predicted and true values of the outcome. The
goal is to reduce the loss function to zero—that is, to make the
ANN’s predicted output match truth as closely as possible,
albeit while also protecting against overfitting. In response,
1) resulting error values are distributed backwards through the
network, from output to input, in order to assign an error value
contribution to each hidden and input layer neuron (called
“back-propagation”; for additional technical information on this
process, see, for example, Rumelhart et al. (21)), and 2) connec-
tion weights are updated in order to minimize the loss function
(“weight adjustment”). This 2-fold optimization process repeats
for a number of “epochs” or iterations until the network meets a
prespecified stopping rule or error rate threshold (22, 23).
Strengths and limitations. Strengths of ANNs include their
ability to accommodate variable interactions and nonlinear
associations without user specification (22). The primary
limitation of ANNs is that, although it is arguably not completely
a “black box” (23, p. 1112), the underlying model nevertheless
remains largely opaque. Effects are mediated exclusively through
hidden layer(s), making interpreting relationships between input
and output layers challenging, especially for “deep” ANNs,
which include multiple hidden layers. This lack of transparency
complicates commonsense or etiological interpretation of indi-
vidual variable effects and connection weights, although there
are continuing efforts to enhance ANN interpretability (20, 24,
25). ANN training parameters can also be complex, and setting
and tuning these parameters generally necessitates technical
expertise. Moreover, complex ANNs, including deep networks,
can require large data sets (potentially in the tens or hundreds of
thousands, although there is no hard-and-fast rule) in order to
achieve optimal model performance, which may be prohibitive
for some epidemiologic applications (26).
Sample statistical packages and modules. Available
software includes neuralnet, nnet, deepnet, and TensorFlow in
R (R Foundation for Statistical Computing, Vienna, Austria);
Enterprise Miner Neural Network and AutoNeural in SAS
(SAS Institute, Inc., Cary, North Carolina); and sklearn and
TensorFlow in Python (Python Software Foundation, Wilming-
ton, Delaware)

Conclusion:
The field of machine learning is rapidly developing and can
make any technical review seem obsolete within months. Grow-
ing interest in the field from the general public, as reflected in
extensive coverage of self-driving cars and AlphaGo (Alphabet,
Inc., Mountain View, California) in the mainstream media, is
accompanied by efforts from the machine learning community
to make advanced machine learning technologies more acces-
sible. Educational companies such as Udacity (Udacity, Inc.,
Mountain View, California) and Coursera (Coursera, Inc.,
Mountain View, California) have partnered with companies
like Google and academic institutions to create online and
freely available courses on machine learning and deep
learning.
In addition to the growing educational resources, large tech-
nological companies, including Google, IBM (International
Business Machines Corporation, Armonk, New York), and
Amazon Web Services (Amazon Web Services, Inc., Seattle,
Washington), are heavily investing in open-source machine learn-
ing that uses data-flow graphs to build models (e.g., TensorFlow
(Google, Inc.) (157)). The use of data-flow graphs in TensorFlow
enables developers and data scientists to focus on the high-level
overall logic of the algorithms rather than the technical coding
details, which greatly increases the reproducibility and optimiz-
ability of the models. Models built with TensorFlow can be inte-
grated into mobile devices, making on-device/bedside diagnosis
practical when combined with mobile sensors. The ability of
TensorFlow to build and run models on the cloud also dramat-
ically increases processing power and storage ability, which is
particularly helpful for analyzing large data sets with complex
algorithms. These machine learning developments continue to
ease the entry barriers for epidemiologists interested in using
advanced machine learning technologies, and they have the
potential to transform epidemiologic research.
Yet, there continue to be challenges that impede greater inte-
gration of machine learning into epidemiologic research. Clas-
sically trained epidemiologists often lack the skills to take full
advantage of machine learning technologies, partly because of
the continued popularity of closed-source programming lan-
guages (e.g., SAS, Stata) in epidemiology. In addition, despite
the promise of “Big Data,” logistical roadblocks to sharing de-
identified patient data and amassing large health-care data sets
can make it challenging for epidemiologists to leverage these
opportunities, particularly compared with the private sector.
Even when data are available, epidemiologists should be mind-
ful of the class-imbalance issue (see Table 1) often inherent in
health-care and surveillance data, which can pose challenges
for many standard algorithms (158). Most importantly, a general
lack of working knowledge on machine learning algorithms,
despite their substantial methodological overlap with statisti-
cal methods, reduces the practical uptake of these techniques
in the epidemiologic literature.
Ultimately, advanced machine learning algorithms offer epi-
demiologists new tools for tackling problems that classical meth-
ods are not well-suited for, but they by no means serve as a
cure-all for poor study design or poor data quality. Further erod-
ing the cultural and language barriers between machine learning
and epidemiology serves as an essential first step toward under-
standing the value of, and achieving greater integration with,
machine learning and existing epidemiologic research methods.
REFERENCES :
1. Samuel AL. Some studies in machine learning using the game
of checkers. IBM J Res Dev. 1959;3(3):210–229.
2. Mitchell TM. Machine Learning. 1st ed. New York, NY:
McGraw-Hill Education; 1997.
3. Rasmussen CE, Williams CKI. Gaussian Processes for
Machine Learning. 1st ed. Cambridge, MA: MIT Press; 2006.
4. Breiman L. Statistical modeling: the two cultures (with
comments and a rejoinder by the author). Stat Sci. 2001;16(3):
199–231.
5. Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd ed.
Hoboken, NJ: John Wiley & Sons, Inc.; 2012:517.
6. Bartholomew DJ, Knott M, Moustaki I. Latent Variable
Models and Factor Analysis: A Unified Approach. 3rd ed.
Hoboken, NJ: John Wiley & Sons, Inc.; 2011.
7. Hennig C, Meila M, Murtagh F, et al. Handbook of
Cluster Analysis. 1st ed. Boca Raton, FL: CRC Press;
2015:34.
8. Bishop CM. Pattern Recognition and Machine Learning. 1st
ed. New York, NY: Springer Publishing Compnay; 2006:424.
9. Zhu X, Goldberg AB. Introduction to Semi-Supervised
Learning. 1st ed. San Rafael, CA: Morgan & Claypool
Publishers; 2009:11.
10. Nigam K, McCallum AK, Thrun S, et al. Text classification
from labeled and unlabeled documents using EM. Mach
Learn. 2000;39(2):103–134.
11. Ng AY, Jordan MI. On discriminative vs. generative
classifiers: a comparison of logistic regression and naive
Bayes. In: Dietterich TG, Becker S, Ghahramani Z, eds.
Advances in Neural Information Processing Systems 14.
Cambridge, MA: MIT Press; 2002:841–848.
12. Vapnik VN. Statistical Learning Theory. 1st ed. Hoboken,
NJ: Wiley-Interscience; 1998:12–21.
13. Pernkopf F, Bilmes J. Discriminative versus generative
parameter and structure learning of Bayesian network
classifiers. In: Proceedings of the 22nd International
Conference on Machine Learning—ICML ’05. New York,
NY: Association for Computing Machinery; 2005:657–664.
https://dl.acm.org/citation.cfm?id=1102434. Accessed July
4, 2019.
14. Reinforcement Learning: An Introduction—Richard S.
Sutton and Andrew G. Bartow [book review]. IEEE Trans
Neural Netw. 1998;9(5):1054.
15. RS Sutton, AG Barto. Reinforcement Learning: An
Introduction. 2nd ed. Cambridge, MA: MIT Press; 2018.

You might also like