Research Paper
Research Paper
Abstract:
Machine learning is a branch of computer science that has the potential to transform
epidemiologic sciences. Amid a growing focus on “Big Data,” it offers
epidemiologists new tools to tackle problems for which classical methods are not
well-suited. In order to critically evaluate the value of integrating machine learning
algorithms and existing methods,
however, it is essential to address language and technical barriers between the two
fields that can make it difficult for epidemiologists to read and assess machine
learning studies. Here, we provide an overview of the concepts and termi- nology
used in machine learning literature, which encompasses a diverse set of tools with
goals ranging from prediction to classification to clustering. We provide a brief
introduction to 5 common machine learning algorithms and 4
ensemble-based approaches. We then summarize epidemiologic applications of
machine learning techniques in the published literature. We recommend approaches
to incorporate machine learning in epidemiologic research and dis-
cuss opportunities and challenges for integrating machine learning and existing
epidemiologic research methods. Big Data; ensemble models; machine learning
Keywords:
ANN - artificial neural networks.
BMA - Bayesian model averaging.
BMI - body mass index.
CART - classification and regression trees.
SVM - support vector machine.]
Literature review:
Machine learning is a branch of computer science that broadly
aims to enable computers to “learn” without being directly pro-
grammed (1). It has origins in the artificial intelligence movement
of the 1950s and emphasizes practical objectives and appli-
cations, particularly prediction and optimization. Computers
“learn” in machine learning by improving their performance
at tasks through “experience” (2, p. xv). In practice, “experi-
ence” usually means fitting to data; hence, there is not a clear
boundary between machine learning and statistical approaches.
Indeed, whether a given methodology is considered “machine
learning” or “statistical” often reflects its history as much as
genuine differences, and many algorithms (e.g., least absolute
shrinkage and selection operator (LASSO), stepwise regres-
sion) may or may not be considered machine learning depend-
ing on who you ask. Still, despite methodological similarities,
machine learning is philosophically and practically distin-
guishable. At the liberty of (considerable) oversimplification,
machine learning generally emphasizes predictive accuracy
over hypothesis-driven inference, usually focusing on large,
high-dimensional (i.e., having many covariates) data sets (3,
4). Regardless of the precise distinction between approaches,
in practice, machine learning offers epidemiologists important
tools. In particular, a growing focus on “Big Data” emphasizes
problems and data sets for which machine learning algorithms
excel while more commonly used statistical approaches struggle.
This primer provides a basic introduction to machine learn-
ing with the aim of providing readers a foundation for critically
reading studies based on these methods and a jumping-off point
for those interested in using machine learning techniques in epi-
demiologic research. The “Concepts and Terminology” section
of this paper presents concepts and terminology used in the
machine learning literature. The “Machine Learning Algo-
rithms” section provides a brief introduction to 5 common
machine learning algorithms: artificial neural networks, deci-
sion trees, support vector machines, naive Bayes, and k-
means clustering. These are important and commonly used
algorithms that epidemiologists are likely to encounter in
practice, but they are by no means comprehensive of this
large and highly diverse field. The following two sections,
“Ensemble Methods” and “Epidemiologic Applications,”
extend this examination to ensemble-based approaches and
epidemiologic applications in the published literature. “Brief
Recommendations” provides some recommendations for
incorporating machine learning into epidemiologic practice,
and the last section discusses opportunities and challenges.
Conclusion:
The field of machine learning is rapidly developing and can
make any technical review seem obsolete within months. Grow-
ing interest in the field from the general public, as reflected in
extensive coverage of self-driving cars and AlphaGo (Alphabet,
Inc., Mountain View, California) in the mainstream media, is
accompanied by efforts from the machine learning community
to make advanced machine learning technologies more acces-
sible. Educational companies such as Udacity (Udacity, Inc.,
Mountain View, California) and Coursera (Coursera, Inc.,
Mountain View, California) have partnered with companies
like Google and academic institutions to create online and
freely available courses on machine learning and deep
learning.
In addition to the growing educational resources, large tech-
nological companies, including Google, IBM (International
Business Machines Corporation, Armonk, New York), and
Amazon Web Services (Amazon Web Services, Inc., Seattle,
Washington), are heavily investing in open-source machine learn-
ing that uses data-flow graphs to build models (e.g., TensorFlow
(Google, Inc.) (157)). The use of data-flow graphs in TensorFlow
enables developers and data scientists to focus on the high-level
overall logic of the algorithms rather than the technical coding
details, which greatly increases the reproducibility and optimiz-
ability of the models. Models built with TensorFlow can be inte-
grated into mobile devices, making on-device/bedside diagnosis
practical when combined with mobile sensors. The ability of
TensorFlow to build and run models on the cloud also dramat-
ically increases processing power and storage ability, which is
particularly helpful for analyzing large data sets with complex
algorithms. These machine learning developments continue to
ease the entry barriers for epidemiologists interested in using
advanced machine learning technologies, and they have the
potential to transform epidemiologic research.
Yet, there continue to be challenges that impede greater inte-
gration of machine learning into epidemiologic research. Clas-
sically trained epidemiologists often lack the skills to take full
advantage of machine learning technologies, partly because of
the continued popularity of closed-source programming lan-
guages (e.g., SAS, Stata) in epidemiology. In addition, despite
the promise of “Big Data,” logistical roadblocks to sharing de-
identified patient data and amassing large health-care data sets
can make it challenging for epidemiologists to leverage these
opportunities, particularly compared with the private sector.
Even when data are available, epidemiologists should be mind-
ful of the class-imbalance issue (see Table 1) often inherent in
health-care and surveillance data, which can pose challenges
for many standard algorithms (158). Most importantly, a general
lack of working knowledge on machine learning algorithms,
despite their substantial methodological overlap with statisti-
cal methods, reduces the practical uptake of these techniques
in the epidemiologic literature.
Ultimately, advanced machine learning algorithms offer epi-
demiologists new tools for tackling problems that classical meth-
ods are not well-suited for, but they by no means serve as a
cure-all for poor study design or poor data quality. Further erod-
ing the cultural and language barriers between machine learning
and epidemiology serves as an essential first step toward under-
standing the value of, and achieving greater integration with,
machine learning and existing epidemiologic research methods.
REFERENCES :
1. Samuel AL. Some studies in machine learning using the game
of checkers. IBM J Res Dev. 1959;3(3):210–229.
2. Mitchell TM. Machine Learning. 1st ed. New York, NY:
McGraw-Hill Education; 1997.
3. Rasmussen CE, Williams CKI. Gaussian Processes for
Machine Learning. 1st ed. Cambridge, MA: MIT Press; 2006.
4. Breiman L. Statistical modeling: the two cultures (with
comments and a rejoinder by the author). Stat Sci. 2001;16(3):
199–231.
5. Duda RO, Hart PE, Stork DG. Pattern Classification. 2nd ed.
Hoboken, NJ: John Wiley & Sons, Inc.; 2012:517.
6. Bartholomew DJ, Knott M, Moustaki I. Latent Variable
Models and Factor Analysis: A Unified Approach. 3rd ed.
Hoboken, NJ: John Wiley & Sons, Inc.; 2011.
7. Hennig C, Meila M, Murtagh F, et al. Handbook of
Cluster Analysis. 1st ed. Boca Raton, FL: CRC Press;
2015:34.
8. Bishop CM. Pattern Recognition and Machine Learning. 1st
ed. New York, NY: Springer Publishing Compnay; 2006:424.
9. Zhu X, Goldberg AB. Introduction to Semi-Supervised
Learning. 1st ed. San Rafael, CA: Morgan & Claypool
Publishers; 2009:11.
10. Nigam K, McCallum AK, Thrun S, et al. Text classification
from labeled and unlabeled documents using EM. Mach
Learn. 2000;39(2):103–134.
11. Ng AY, Jordan MI. On discriminative vs. generative
classifiers: a comparison of logistic regression and naive
Bayes. In: Dietterich TG, Becker S, Ghahramani Z, eds.
Advances in Neural Information Processing Systems 14.
Cambridge, MA: MIT Press; 2002:841–848.
12. Vapnik VN. Statistical Learning Theory. 1st ed. Hoboken,
NJ: Wiley-Interscience; 1998:12–21.
13. Pernkopf F, Bilmes J. Discriminative versus generative
parameter and structure learning of Bayesian network
classifiers. In: Proceedings of the 22nd International
Conference on Machine Learning—ICML ’05. New York,
NY: Association for Computing Machinery; 2005:657–664.
https://dl.acm.org/citation.cfm?id=1102434. Accessed July
4, 2019.
14. Reinforcement Learning: An Introduction—Richard S.
Sutton and Andrew G. Bartow [book review]. IEEE Trans
Neural Netw. 1998;9(5):1054.
15. RS Sutton, AG Barto. Reinforcement Learning: An
Introduction. 2nd ed. Cambridge, MA: MIT Press; 2018.