(GAM) Application PDF
(GAM) Application PDF
GENERALIZED
ADDITIVE
MODELS
G E T T I N G S TA R T E D W I T H G A M S I N R
Getting Started with GAMs in R 2
Contents
Preface 4
Introduction 5
Beyond the General Linear Model I 5
General Linear Model 5
Generalized Linear Model 5
Generalized Additive Model 6
Beyond the General Linear Model II 6
Fitting the Standard Linear Model 6
Polynomial Regression 6
Scatterplot Smoothing 7
Generalized Additive Models 7
Summary 8
Application Using R 8
Initial Examination 8
Single Predictor 10
Linear Fit 10
GAM 11
Graphical Display 13
Model Comparison 14
Multiple Predictors 15
Linear Fit 15
GAM 15
Graphical Display 17
Model Comparison 19
Other Issues 19
Estimation 19
Shrinkage & Variable Selection 20
Effective degrees of freedom again 20
3 Generalized Additive Models
Miscellaneous 23
Relation to Some Other Nonlinear Approaches 23
Known Form 23
Response Transformation 24
The Black Box 24
Extensions 25
Conclusion 25
Appendix 26
R packages 26
Miscellaneous Code 27
ggopts 27
mod_gam1 plot 27
Penalized Estimation Example 27
R Session Info 29
Getting Started with GAMs in R 4
Preface
The following provides a brief introduction to generalized additive Current draft last updated 2012-08-07
models and some thoughts on getting started within the R environ-
ment. It doesn’t assume much more than a basic exposure to regres-
sion, and maybe a general idea of R though not necessarily any partic-
ular expertise. The presentation is of a very applied nature, and such
that the topics build upon the familiar and generalize to the less so,
with the hope that one can bring the concepts they are comfortable
with to the new material. The audience in mind is a researcher with
typical social science training.
To familiarize yourself with R, I have an introduction to R here
and second part here, and they would provide all one would need for
the following, though there is much to be gained from simple web
browsing on R. And while it wasn’t the intention starting out, this
document could be seen as a vignette for the mgcv package.
With regard to the text, functions are in red text, packages in blue,
and links are sort of orange. With regard to the graphs, they are meant
to be zoomed in upon, and you can as much or little as you like with-
out loss of clarity. Having them smaller and in the margins allows the
reader to focus on them as much as one would like while having them
right where they need to be in relation to the text (for the most part).
This document was created entirely with Rstudio.
5 Generalized Additive Models
Introduction
Beyond the General Linear Model I
One of the issues with this model is that, in its basic form it can be
very limiting in its assumptions about the data generating process for
the variable we wish to study. Also, while the above is one variant of
the usual of the manner in which it is presented in introductory texts,
it also very typically will not capture what is going on in a great many
data situations.
variance are equal. For the Poisson, the (canonical) link function g(.),
is the natural log, and so relates the log of µ to the linear predictor. As
such we could also write it in terms of exponentiating the right hand
side. µ = eb0 +b1 X1 +b2 X2 ...+b p X p
While there is a great deal to further explore with regard to gen-
eralized linear models, the point here is to simply to note them as a
generalization of the typical linear model with which all would be fa-
miliar after an introductory course in statistics. As we eventually move
to generalized additive models, we can see them as a subsequent step
in the generalization.
60
y
Polynomial Regression
One common approach we could undertake is to add a transformation
of the predictor X, and in this case we might consider a quadratic
term such that our model looks something like that to the right. We y = N (µ, σ2 )
haven’t really moved from the general linear model in this case, but we µ = b0 + b1 X1 + b2 X 2
80
60
y
40
20
0 2 4 6 8 10
x
Introduction Figure 2
7 Generalized Additive Models
Scatterplot Smoothing
There are still other possible routes we could take. Many are probably
familiar with lowess (or loess) regression, if only in the sense that often
statistical packages may, either by default or with relative ease, add a
nonparametric fit line to a scatterplot2 . By default this is typically a 2
See the scatterplots in the car package
lowess fit. for example
X values below and 100 values above. Now, for just that range we fit a 60
y
linear regression model and note the fitted value where x0 = 3.0 (Fig. 40
20
4). If we now do this for every X value, we’ll have fitted values based
0
sidered. Other options we could throw into this process, and usually Introduction Figure 3
do, would be to fit a polynomial regression for each neighborhood, and
weight observations the closer they are to the X value, with less weight 20
●
●
● ●
● ●
● ●
● ● ●
●
●
●
● ● ●
● ●● ●
● ●
● ● ● ●
● ● ●
● ●● ●●
y
● ●
● ● ● ●
● ● ●
● ●
● ●● ●
● ●
● ●
●
●● ● ●
● ●
● ● ● ● ●
●● ●
● ● ●
●
using a lowess approach, we could have fit have fit a model assuming a ● ● ●
● ●● ● ● ● ● ●
●
● ● ● ●● ●● ● ●
●
●● ● ●
● ● ● ● ● ●
10 ● ●
●
● ●●
● ● ●
● ●●● ● ●
● ● ● ●
some cases, the issue is that nonlinear relationships are generally not
Introduction Figure 4
specified so easily4 . One can see this a variety of approaches Figure 6,
which is a version of that found in Venables and Ripley (1997 edition).
3
y was in fact constructed as x + x2 +
noise.
The leftmost plots are from a polynomial regression and lowess fit. 4
Bolker 2008 notes an example of fitting
a polynomial to 1970s U.S. Census data
that would have predicted a population
Generalized Additive Models crash to zero by 2015.
onds, acceleration in g. Here we have data that are probably not going 80
figure one can compare various approaches, and the first is a straight- 40
us much. We could try higher orders, which would help, but in the end 0
50
−50
−100
Acceleration
50
−50
−100
0 10 20 30 40 50 600 10 20 30 40 50 600 10 20 30 40 50 60
Time
Introduction Figure 6
Getting Started with GAMs in R 8
Summary
Polynomial Regression
y = N (µ, σ2 )
µ = b0 + b1 X1 + b2 X 2
GLM formulation
y = N (µ, σ2 )
g(µ) = b0 + b1 X1 + b2 X 2 identity link function
GAM formulation
y = N (µ, σ2 )
g(µ) = f ( X ) identity link function
Now that some of the mystery will hopefully have been removed, let’s
take a look at GAMs in action.
Application Using R
Initial Examination
Overall Science Score (average score for 15 year olds) The education component is measured
by mean of years of schooling for adults
Interest in science aged 25 years and expected years of
schooling for children of school entering
Support for scientific inquiry age, the health index by life expectancy
Income Index at birth, and the wealth component is
based on the gross national income per
Health Index capita. The HDI sets a minimum and
a maximum for each dimension, and
Education Index values indicate where each country
stands in relation to these endpoints,
Human Development Index (composed of the Income index, Health expressed as a value between 0 and 1.
Index, and Education Index) More information on the HDI measures
can be found here.
9 Generalized Additive Models
But even before we get too far, it would be good to know our op-
tions in the world of GAMs. Near the end of this document is a list of
some packages to be aware of, and there is quite a bit of GAM func-
tionality available within R, even for just plotting5 . For our purposes 5
ggplot2 has basic gam functionality for
we will use mgcv. scatterplot smoothing. Examples are
further in the text.
The first thing to do is get the data in and do some initial inspec-
tions.
d = read.csv("http://www.nd.edu/~mclark19/learn/data/pisasci2006.csv")
library(psych)
describe(d)[-1, 1:9] #univariate
●●● ● ●
●
●●
●
●●
●
● ● ●
●
●
●
●
●● ●
●
●
●
●
●
● ● ●
●
●
●●● ●
500
● ● ● ● ●●
● ● ● ● ●
● ● ● ● ● ●●
●
●
● ●●● ● ● ●
● ● ● ●● ● ●
●
●
●● ●
● ●●
●● ● ● ● ● ●● ●
●
●
●● ● ●●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●
●● ● ● ●●● ● ●●●● ● ● ●●
● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ● ●
●
● ● ● ●
● ● ●
● ● ● ●
●
● ● ●
●
● ● ● ●
● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●●
scatterplotMatrix(d[,-c(1,3:5)],pch=19,cex=.5,reg.line=F, lwd.smooth=1.25,
● ●
● ● ● ● ● ●
●
● ● ● ● ● ● ●● ● ● ● ●●● ●● ●
● ● ● ● ● ● ● ●● ● ● ●● ● ●
● ● ● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●
350
● ● ● ● ● ●
● ● ● ● ● ●
spread=F,ellipse=T, col=c('gray60','#2957FF','#FF8000'),
650
● ● ● ● ●● ● ● ● ● ● ●
●●
● ● ●
● ●
Interest ●
●
● ●
● ●
●
● ●
● ● ●●
● ●
●
●
● ●
●
●
●
● ●
●●
● ●
● ●
● ●
●
● ●
● ● ● ● ● ●
550
● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●
col.axis='gray50')
● ● ● ●● ● ● ● ● ● ●
●● ● ● ●
● ● ● ● ● ● ● ● ● ● ●●
●
●
●
●
●
●● ●
●
●
● ● ●
●
● ●
● ● ● ●
●
●●
● ● ●
● ●
●
●
●● ●
● ●● ●
● ●●
●●
●
●
● ●
●●
● ●
●●●
● ●
●
●● ●●
●
●
●
●
● ●
●
●● ●●● ● ●● ● ● ● ●● ●
● ● ● ●
● ● ● ● ●
● ● ● ●
● ● ●● ● ●
● ● ●● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ● ● ●● ●
450
● ● ● ● ● ● ●● ● ●
●● ● ●● ● ●●● ●●
●● ● ● ● ●
● ● ●●● ● ●
●
●●● ●
● ● ●
● ● ● ● ●● ● ● ●
● ●
●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ●● ● ● ● ● ● ●
●
●
● ●
● ●
● ●
●
●
● Support ●
●
●
●
●
●
●
●
●
●
● ●
●
●
● ●
● ●
●
●
●
● ●
●
520
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●
● ● ●● ● ●● ●
● ●● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ●
● ●● ●
● ● ● ● ●
●● ● ● ● ●● ● ● ● ●● ● ●
●
●
● ●
●
●
● ●
●
● ● ●
●
●
● ●●
● ●● ●
●
● ●
●
●●
● ●●
● ● ●
●
●
●
● ●
●
●
●
●●
●
● ●
● ●●
●
●
●
●
●
●● ●
●
● ● ●
●
●
● ●
●
●● ●
● ●
●● ●
●
●
● ● ● ● ●● ● ● ● ● ●●
● ● ●● ● ● ● ● ●● ● ● ●●
●● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ●● ●
● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
460
● ● ● ● ●● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
Income ●● ● ●●
0.4 0.6 0.8
● ● ●● ● ● ● ● ● ●● ●
●● ●●
●●●● ●●● ● ●● ●
●● ● ●● ●
● ●
●●
●
● ●● ●
● ● ●● ● ●
●●●● ●● ●
● ● ● ●● ●
● ●● ● ● ●
● ● ● ●●●
●● ●
● ● ● ● ●● ● ● ●
●● ●
● ●
● ● ● ● ● ●●
● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●
● ●
●
●
●● ●
●
●
●
●
● ● ● ● ●
● ●
●
● ● ●●
●
● ●●
●● ● ●
●
● ● ●
●
● ●● ●
● ●
● ● ●●
● ●
●
●● ●● ● ● ● ● ● ● ●
●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ● ● ●
● ● ● ● ● ● ●● ● ●●
● ●● ● ● ● ● ● ● ●● ● ● ● ●
●●
● ● ● ● ● ● ●
● ● ● ●● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
time with- univariate and bivariate density, as well as loess curves. For ●
●
●
●
●
●●
●
● ●
●● ●●
●
●
●
● ●
●●
●●
●
● ●
●
● ●●
●
● ●
●
●
●
●
● ●
●●
●
●
● ● ●
●●
●
●
●
● ●● ●
● ●● ● ●
●●
●
●
●
●
Health
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ● ●
●
●
●●
●
●
0.90
● ●● ● ● ● ● ● ●● ●●
● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●
● ●
● ● ● ● ● ● ● ● ● ● ● ●●
● ● ● ● ●
●● ● ● ● ● ● ● ●
●
●
● ●
●
● ●
●
●
●
●
● ●
●
●
● ●
● ● ●
●
● ●
●
● ●
● ● ● ● ● ● ● ● ● ● ●● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ● ●● ● ● ● ●
● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ●
● ● ● ● ●
● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●●
● ●●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
our purposes we will ignore the issue regarding the haves vs. have-nots
0.75
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
1.0
● ● ● ● ● ● ● ● ●● ● ●
● ● ● ● ● ●
●
●
●
●
●
●●
●
●● ●
● ●
● ● ●
● ●
●● ● ●
●
●
● ●
●
●
●● ● ●
●
●
●
●
●
●
●
●
● ● ●
● ●
● ● ●
●
●
● ●●
● ●
●
● ●
●
●
●
●●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
●
● ●●● ●
●
Edu ●
● ●
● ●
●●
● ●
●
●
●
●●
● ●● ● ● ● ●●
● ● ● ●●
● ● ● ●
● ● ● ● ●
● ● ● ●● ●
● ●
0.8
●
●
●
● ●
●
●
● ●
●
● ●●
● ●
● ●
●●
●
●
● ● ●
●
● ●
● ● ●●
●
● ●
●
● ● ●
●
● ●
●
● ●
● ● ●
●●
●
● ● ● ● ● ● ●
●● ●●
● ● ● ● ●
●
● ● ●
● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●
● ●● ●● ●●
on the science scale and save that for another day. Note the surprising
● ● ● ● ● ●
● ● ● ● ● ● ● ● ● ● ● ●
● ● ●● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
0.6
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
●● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
●
● ●● ● ●●●
●
●
● ●● ●
●
●
●
●
●
●●
●
●
● ●
●● ● ●● ●
●
●
●
●
● ●●
●● ●
●
● ●
●●
● ●
● ●
●●
●
●
● ●
●
●
●
●
●
●
●
●●●
●
●
● ●
●●
●
● ●●
● ●
●●
●
●
●
●
●● ●
● ●
● ●● ● ●
●
● ●
●
●
●
● ●●
●●
●
● ●● ●
●●
● ● ● ● ●●
● ●
●
● ●
●●● ● ●
● ●
● ● ●●
●● ●
●
●
●●
●
●
HDI
0.8
● ●
●
●
● ●
●
●
●
●
●
● ● ●
● ●
●
●
●
●
●
● ● ●
●
●
●
● ●●
●
●●
● ● ● ●
● ●● ●●
● ● ●● ●
● ● ● ●
● ● ●
● ● ● ● ● ● ●● ● ● ● ●
●● ● ●● ● ● ● ● ● ● ● ●●
● ●
● ●
● ● ● ● ● ● ●● ● ● ● ●● ● ● ●●
● ● ● ● ● ●
● ● ● ● ● ●
● ●● ● ● ● ● ● ● ● ● ● ●●
● ● ● ●
● ● ● ● ● ●
● ● ● ● ● ●
0.6
● ● ● ● ● ●
● ● ● ● ● ●
350 450 550 460 500 540 0.75 0.85 0.95 0.6 0.8
na na
el el el
Isra Isra Isra
400 Arg
ent
ina
Bra
zil
isia
Ind
one
sia
om
bia
Arg
ent
ina
Bra
Ind
zil
one
sia
isia
om
bia Ind
one
sia
bia
omBrazil Arg
ent
ina
c c c
ubli ubli ubli
Rep Rep Rep
450 500 550 600 650 460 480 500 520 540 560 0.4 0.5 0.6 0.7 0.8 0.9
Health Edu HDI
dmelt = melt(d, id=c('Country','Overall'), 550
Fin
land
Chi
naH
K
Chi
naH
K
Fin
land
Chi
naH
K
Fin
land
land
ada ada land ada land
onia Zea s
Can an an Can s
onia Zea onia Can
an Zeas
ia ia ia
measure=c('Interest','Support','Income','Health','Edu','HDI'))
Est Newin land tral Jap Jap Estland tralNew Est Jap New land tral
nsteher her her
c hteNet gdom Aus m
ia Net c Aus m
ia gdo c Net
Aus
ubli venia Liec Kinny Kin
gdo
ven ny ubli venKin ubli ny
Rep Slo d Slod ma Rep Sloted Rep ma d
tedma rlan ted rlan
ch Uni
Ger itze Uni itze Ger ch Uni ch Ger rlan
Cze gium tria Sw tria Sw gium Cze Cze tria gium Switze
y and Aus
Bel Aus Bely and y Aus Bel and
gar Irel n gar n Irel gar Irel n
Hun ede Hun ede Hun ede
rk Sw Sw rk Swrk
and c and c and c
500 era
tion ia
uanvia
Lith Lat Slo
vak
Pol ubli
Rep atia
Cro
Uni
ma tes
Den Sta
ted
em
bou
rg Fra
in
way
nce
Icel
and Cro
atia
em
bou
rg
Pol
era
tion Fra
nce
in LatviaSlovak
Lith
Rep
ubli
ia and
uan Icel
Den
ma
Uni
ted
Sta
tes
way era
tion
Pol
atia
Cro via ia
LatLithuan Slo
vak
Rep
ubli nce
inembou
ma
Fra rgDen and
Icel United
Sta
tes
way
Fed Lux Spa
Nor Lux Fed Spa Nor Fed SpaLux Nor
sian sian sian
Rus al al Rus Rus al
tug ece tug ece tug ece
Por Gre Italy Por Italy Gre Por Italy
Gre
el el el
Isra Isra Isra
450 ia le le le
gar
bia Chi Chi bia garia bia ia Chi
Ser SerBul SerBulgar
400 Ind
one
sia
Bra
zil
Col
om
bia
isia
Arg
ent
ina
Ind
one
sia
isiaCol
bia
om Brazil Arg
ent
ina
Ind
one
sia
bia
om zil
Col Bra
isia
Arg
ent
ina
ijan
ggplot(aes(x=value,y=Overall), data=dmelt) +
rba Tun Tun Tun
Aze
ar ar ar
geom_point(color='#FF8000',alpha=.75) +
c c c
ubli ubli ubli
Rep Rep Rep
gyz gyz gyz
Kyr Kyr Kyr
0.75 0.80 0.85 0.90 0.95 0.6 0.7 0.8 0.9 1.0 0.6 0.7 0.8 0.9
#geom_smooth(se=F) + value
6
In the following plots I have a ggplot
object called ggopts to clean up the de-
fault plotting scheme. See the appendix.
Getting Started with GAMs in R 10
facet_wrap(~variable, scales='free_x') +
ggopts
We can see again that linear fits aren’t going to do so well for some,
though it might be a close approximation for interest in science and
support for science. Now let’s run the smooth. By default ggplot2 will
use a loess smoother for small data sets (i.e. < 1000 observations),
but can use the mgcv gam function as a smoother, though you’ll need to
load either the gam or mgcv packages first.
Interest Support Income
library(mgcv) 550
geom_point(color='#FF8000',alpha=.75) + 450
ggopts
Overall
450 500 550 600 650 460 480 500 520 540 560 0.4 0.5 0.6 0.7 0.8 0.9
Health Edu HDI
550
Often in these situations people will perform some standard trans- 500
400
used. For example, in this case one can log the overall score, Income, 350
or both and a linear relation will still not be seen. 0.75 0.80 0.85 0.90 0.95 0.6 0.7 0.8
value
0.9 1.0 0.6 0.7 0.8 0.9
Application Fig. 3
Single Predictor
Linear Fit
We will start with the simple situation of a single predictor. Let’s begin
by using a typical linear regression to predict science scores by the
Income index. We could use the standard R lm function, but I’ll leave
that as an exercise for comparison. We can still do straightforward
linear models with the gam function, and again it is important to note
that the standard linear model can be seen as a special case of a GAM.
library(mgcv)
mod_lm <- gam(Overall ~ Income, data = d)
summary(mod_lm)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ Income
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 204.3 35.4 5.78 4.3e-07 ***
## Income 355.9 46.8 7.61 5.4e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
11 Generalized Additive Models
What are we getting here? The same thing you get from a regular
linear model, because you just ran one. However there are a couple
things to look at. The coefficient is statistically significant, but serves
as a reminder that it usually a good idea to scale predictor variables
so that the effect is more meaningful. Here, moving one unit on In-
come is akin from going broke to being the richest country. But in a
more meaningful sense, if we moved from say, .7 to .8, we’d expect
an increase of about 35 points on the science score. We also see the
deviance explained, which serves as a generalization of R-squared 7 , 7
For those more familiar with general-
and in this case, it actually is equivalent to R-squared. Likewise there ized linear models, this is calculated as
(Dev Null -DevResidual )/Dev Null
is the familiar adjusted version of it to account for small sample size One can verify this by running the same
and model complexity. The scale estimate is the scaled deviance, which model via the glm, and using the values
from the summary of the model object.
here is equivalent to the residual sums of squares. The GCV score we
will save for when we run a GAM.
GAM
b1=1
j =1 470.4
470.2
470.0
b2=x
200
−20
−30
b3=x^2
−40
−50
−60
−70
linear modeling, this is just our ’model matrix’, which we can save out −20
b4=x^3
−40
with many model functions in R via the argument x=T or using the −60
−80
model.matrix function. With the graph at the right we can see the
500
450
f(x)
effects in action. It is based on the results extracted from running the
400
350
0.4 0.5 0.6 0.7 0.8 0.9
model9 and obtaining the coefficients (e.g. the first plot represents
Income
Application Fig. 4
the intercept of 470.44, the second plot, our b2 coefficient of 289.5
9
multiplied by Income and so forth). The bottom plot shows the final fit see ?poly for how to fit a polynomial in
R.
f ( x ), i.e. the linear combination of the basis functions.
At this point we have done nothing we couldn’t do in our regular
regression approach, but the take home message is that as we move
to GAMs we are going about things in much the same fashion; we are
simply changing the nature of the basis, and have a great deal more
flexibility in choosing the form.
In figure 5 I show the fit using a ’by-hand’ cubic spline basis (see
Getting Started with GAMs in R 12
550 550
●
referred to as knots. Separate cubic polynomials are fit at each section, 500
● ● ●
●
●
●
●
450 ●
and then joined at the knots to create a continuous curve. The graph ●
●
●
●
●●
●
●
●
500 400
represents a cubic spline with 8 knots10 between the first and third
●
● ●
●
● ●
● ● ● ●
●
350 ●
●
●
quartiles. The inset graph uses the GAM functionality within ggplot2’s
● ●
Overall
450
●
●
●
Let’s now fit an actual generalized additive model using the same ● ●
●
●
●
●
●
cubic spline as our smoothing function. We again use the gam function 400
●
● ●
●
●
as before for basic model fitting, but now we are using a function s ●
within the formula to denote the smooth terms. Within that function 350 ●
chose ’cr’, denoting cubic regression splines, to keep consistent with 0.4 0.5 0.6 0.7 0.8 0.9 1.0
Income
our previous example.
Application Fig. 5
mod_gam1 <- gam(Overall ~ s(Income, bs = "cr"), data = d)
summary(mod_gam1) 10
Ten including the endpoints.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ s(Income, bs = "cr")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 470.44 4.08 115 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Income) 6.9 7.74 16.7 1.2e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.7 Deviance explained = 73.9%
## GCV score = 1053.7 Scale est. = 899.67 n = 54
The first thing to note is, that aside from the smooth part, our model
code is similar to what we’re used to with core R functions such as lm
and glm. In the summary, we first see the distribution assumed as well
as the link function used, in this case normal and identity, respectively,
which to iterate, had we had no smoothing would result in a standard
regression model. After that we see that the output is separated into
parametric and smooth, or nonparametric parts. In this case, the only
parametric component is the intercept, but it’s good to remember that
you are not bound to smooth every effect of interest, and indeed, as
we will discuss in more detail later, part of the process may involve
refitting the model with terms that were found to be linear for the
13 Generalized Additive Models
cross validation estimation process. We estimate the model for all ob-
servations except i, then note the squared residual predicting observa-
tion i from the model. Then we do this for all observations. However,
the GCV score is an efficient measure of this concept that doesn’t ac-
tually require fitting all those models and overcomes other issues. It is In this initial model the GCV can be
this score that is minimized when determining the specific nature of found as:
GCV = (n−ed f −[nno∗fscaled est.
parametric terms])2
the smooth. On its own it doesn’t tell us much, but we can use it simi-
lar to AIC as a comparative measure to choose among different models,
with lower being better.
Graphical Display
One can get sense of the form of the fit by simply plotting the model
object as follows:
50
plot(mod_gam1)
0
Figure 5, where the inset was constructed directly from ggplot. You can
−100
−150
−200
Model Comparison
Let us now compare our regular regression fit to the GAM model fit.
The following shows how one can extract various measures of perfor-
mance and the subsequent table shows them gathered together.
AIC(mod_lm)
## [1] 550.2
summary(mod_lm)$sp.criterion
## [1] 1504
## [1] 0.5175
Do the same to extract those same elements from the GAM. The
following display14 makes for easy comparison. 14
I just gathered the values into a
data.frame object and used xtable from
the eponymous package to produce the
aic_lm aic_gam1 gcv_lm gcv_gam1 rsq_lm rsq_gam1 table.
550.24 529.81 1504.50 1053.73 0.52 0.70
Comparing these various measures, it’s safe to conclude that the GAM
fits better. We can also perform the familiar statistical test via the
anova function we apply to other R model objects. As with the previ-
ous p-value issue, we can’t be too particular, and technically one could
have a model with more terms but lower edf, where it just wouldn’t
even make sense15 . As it would be best to be conservative, we’ll pro- 15
?anova.gam for more information.
ceed cautiously.
Multiple Predictors
Let’s now see what we can do with a more realistic case where we have
added model complexity.
Linear Fit
We’ll start with the typical linear model approach again, this time
adding the Health and Education indices.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ Income + Edu + Health
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.2 79.0 1.53 0.131
## Income 182.3 85.3 2.14 0.038 *
## Edu 234.1 54.8 4.27 9.1e-05 ***
## Health 27.0 134.9 0.20 0.842
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## R-sq.(adj) = 0.616 Deviance explained = 63.9%
## GCV score = 1212.3 Scale est. = 1119 n = 52
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ s(Income) + s(Edu) + s(Health)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
Getting Started with GAMs in R 16
There are again a couple things to take note of. First, statistically
speaking, we come to the same conclusion as the linear model regard-
ing the individual effects. One should take particular note of the effect
of Health index. The effective degrees of freedom with value 1 sug-
gests that it has essentially been reduced to a simple linear effect. The
following will update the model to explicitly model the effect as such,
but as one can see, the results are identical.
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ s(Income) + s(Edu) + Health
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 640 102 6.26 3.1e-07 ***
## Health -190 115 -1.65 0.11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Income) 7.59 8.41 8.83 1.3e-07 ***
## s(Edu) 6.20 7.18 3.31 0.0073 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.863 Deviance explained = 90.3%
## GCV score = 573.83 Scale est. = 399.5 n = 52
We can also note that this model accounts for much of the variance
in Overall science scores, with an adjusted R-squared of .86. In short,
it looks like the living standards and educational resources of a country
are associated with overall science scores, even if we don’t really need
the Health index in the model.
17 Generalized Additive Models
Graphical Display
Here we can see the effects of interest, and again one might again
note the penalized-to-linear effect of Health. We again see the tapering
●
●
●
●
50
50
●
● ●
● ● ●
● ● ● ●
● ● ●
●
● ●
●
●
●
● ● ● ●
● ●
●
● ● ●
● ●
● ● ● ●
● ●
● ● ●
●●
●
●
s(Income,7.59)
● ●
−50 0
−50 0
● ●
● ● ●
● ● ● ● ●
● ● ●
● ●
●
s(Edu,6.2)
● ● ●
●
●
●
● ● ● ●
● ●
●
●
● ● ●
●
●
● ●
●
●
●
●
●
●
●
−150
−150
●
with a slight positive effect overall. Health, as noted, has been reduced
−250
−250
0.4 0.5 0.6 0.7 0.8 0.9 0.6 0.7 0.8 0.9 1.0
Income Edu
50
●
●
●
●
● ●
● ● ● ●
●
● ●
● ● ●
●
● ● ●
−50 0
●
●
●
●
●
● ● ● ●
● ● ●
●
●
●
s(Health,1)
● ●
● ●
● ●
●
● ●
●
●
●
−150
Previously, we had only one predictor and so we could let ggplot
−250
0.75 0.80 0.85 0.90 0.95
Health
do the work to produce a plot on the response scale, although we
Application Fig. 7
could have added the intercept back in. Here we’ll need to do a little
more. The following will produce a plot for Income on the scale of the
response, with the other predictors held at their mean. Seen at right.
# Note that mod_gam2$model is the data that was used in the modeling process, 500
Health=mean(mod_gam2$model$Health))
fits = predict(mod_gam2, newdata=testdata, type='response', se=T)
fit
ggplot(aes(x=Income,y=fit), data=predicts) +
geom_smooth(aes(ymin = fit - se.fit, ymax=fit + se.fit), 200
fill='gray80', size=1,stat='identity') +
ggopts
0.4 0.5 0.6 0.7 0.8 0.9 1.0
Income
Application Fig. 8
This gives us a sense for one predictor, but let’s take a gander at
Income and Education at the same time. Previously, while using the
function plot, it actually is plot.gam, or the basic plot function for a
GAM class object. There is another plotting function, vis.gam, that will
give us a bit more to play with. The following will produce a contour
plot (directly to the right) with Income on the x axis, Education on the
y, with values on the response scale given by the contours, with lighter
color indicating higher values.
vis.gam(mod_gam2, type = "response", plot.type = "contour") response
First and foremost Figure 9 reflects the individual plots (e.g. one can
0.9
see the decrease in scores at the highest income levels), and we can see
0.8
Edu
0.7
0.6
Application Fig. 9
Getting Started with GAMs in R 18
Let’s take a look at another approach, continuing the focus on visual response
display. It may not be obvious at all, but one can utilize smooths of
more than one variable, in effect, a smooth of the smooths of the vari-
0.9
ables that go into it. Let’s create a new model to play around with this
0.8
feature. After fitting the model, an example of the plot code is given
Edu
(figure 11), but I also provide a different perspective (figure 12) as
0.7
well as contour plot (Application Figure 10) for comparison with the
0.6
graph based on separate smooths.
0.5 0.6 0.7 0.8 0.9
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Overall ~ te(Income, Edu)
## <environment: 0x000000000a575438>
##
respo
## Parametric coefficients:
nse
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 471.15 3.35 141 <2e-16 ***
## ---
u
Ed
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Inc
om
e
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## te(Income,Edu) 10.1 12.2 16.9 <2e-16 *** Application Fig. 11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.8 Deviance explained = 84%
## GCV score = 741.42 Scale est. = 583.17 n = 52
Edu
Model Comparison
Other Issues
Estimation
where l ( β) is the usual GLM likelihood function, and λ j are the smooth-
ing parameters. The part of the function including λ penalizes curva-
ture in the function, where λ establishes a trade-off between the good-
ness of fit and the smoothness, and such an approach will allow for less
overfitting19 . Technically we could specify the smoothing parameters 19
λ → ∞ would result in a straight line
explicitly, and the appendix has some ’by-hand’ code taken directly estimate for f j , while λ = 0 would allow
any f that interpolates the data (Wood,
from Wood (2006) with only slight modifications, where the smoothing 2006, p. 128)
parameters are chosen and compared20 . 20
And we would have to choose in order
Smoothing parameters however are in fact estimated, and this for the penalized maximum likelihood to
work.
brings us back to the cross-validation procedure mentioned before.
Smoothing parameters are selected which minimize the GCV score
by default, though one has other options. Note that there are other
approaches to estimation such as backfitting, generalized smoothing
splines and Bayesian.
F = ( X T X + S ) −1 X T X
and note
β̃ = ( X T X )−1 X T y
β̂ = F β̃
the diagonal elements of F are where the effective degrees of freedom
for each covariate come from. I doubt this helps much conceptually but
there it is anyway.
A number of smooths are available with the mgcv package, and one
can learn more by entering ?smooth.terms at the command line. In
21 Generalized Additive Models
our models we have used cubic regression splines and thin plate re-
gression splines (TPRS), the latter being the default for a GAM in this
package. As a brief summary, TPRS work well in general in terms of
performance and otherwise has some particular advantages, and has
a shrinkage alternative available. One should still feel free to play
around, particularly when dealing with multiple smooths, where the
tensor product smooths would be better for covariates of different
scales.
We have some built-in abilities to examine whether there are any par-
ticular issues, and we can try it with our second GAM model.
Resids vs. linear pred.
gam.check(mod_gam2, k.rep = 1000) ●
40
40
● ●
deviance residuals
●
●
20
20
● ● ●
## ●
●
residuals
● ●● ●
●
● ● ●
● ● ●
● ● ●●
0
● ●
● ●
●
●
● ● ●
●
●
●●
●●
●
●
●
●
−20
−20
●
● ● ●
−40
−40
●
## The RMS GCV score gradiant at convergence was 2.499e-05 . −40 −20 0 20 40 350 400 450 500
theoretical quantiles linear predictor
## The Hessian was positive definite.
Histogram of residuals Response vs. Fitted Values
## The estimated model rank was 28 (maximum possible: 28)
10 12 14
●
550
## ●
●● ●●
● ●
● ● ● ●●
●
●
500
● ●
●
● ●
●● ●
Frequency
●
Response
● ●
●
450
●
● ●●
6
●●● ●
●
400
●
4
● ● ●
●
##
350
2
●
0
●
The plots are of the sort we’re used to from a typical regression
setting, though it’s perhaps a bit difficult to make any grand conclu-
sion based on such a small data set. The printed output on the other One can inspect the quantile-quantile
hand contains unfamiliar information, but is largely concerned with plot directly with the qq.gam function.
Prediction
A previous example used the predict function on the data used to fit
the model to obtain fitted values on the response scale. Typically we’d
use this on new data. I do not cover it, because the functionality is the
same as the predict.glm function in base R, and one can just refer
to that. It is worth noting that there is an option, type=’lpmatrix’,
which will return the actual model matrix by which the coefficients
must be pre-multiplied to get the values of the linear predictor at the
supplied covariate values. This can be particularly useful towards
opening the black box as one learns the technique.
## df AIC
## mod_1d 15.59 476.1
## mod_2d 13.25 489.7
Again though, we could have just used the summary output from the
second model.
Instances where such an approach does not appear to be appropri-
ate within the context of the mgcv package are when terms are able
to be penalized to zero; in such a case p-values will be much too low.
In addition, when comparing GAMs, sometimes the nesting of models
would not be so clear when there are multiple smooths involved, and
additional steps may need to be taken to make sure they are nested.
We must make sure that each smooth term in the null model has no
more effective degrees of freedom than the same term in the alterna-
tive, otherwise it’s possible that the model with more terms can have
lower effective degrees of freedom but better fit, rendering the test
nonsensical. Wood (2006) suggests that if such model comparison is
the ultimate goal an unpenalized approach25 would be best to have 25
This can be achieved with the argu-
ment s(..., fx=T), although now one
much confidence in the p-values.26
has to worry more about the k value
used, as very high k will lead to low
power
26
One might also examine the gss
Miscellaneous package for anova based approach to
generalized smoothing splines.
Known Form
It should be noted that one can fit generalized additive models under A general form of nonlinear models:
a general heading of nonlinear models whose focus may be on transfor- y = f ( X, β) + e
mations of the outcome (as with generalized linear models), the pre-
dictor variables (polynomial regression and GAMs), or both (GAMs), in
addition to those whose effects are nonlinear in the parameters 27 . The 27
For example, various theoretically
difference between the current presentation and those latter nonlinear motivated models in economics and
ecology.
Getting Started with GAMs in R 24
Response Transformation
It is common practice, perhaps too common, to manually transform the
response and go about things with a typical linear model. While there
might be specific reasons for doing so, the primary reason applied
researchers seem to do so is to make the distribution ’more normal’
so that regular regression methods can be applied. As an example, a
typical transformation is to take the log, particularly to tame ’outliers’
or deal with skewness.
While it was a convenience ’back in the day’ because we didn’t have
software or computing power to deal with a lot of data situations aptly,
this is definitely not the case now. In many situations it would be better
to, for example, conduct a generalized linear model with a log link or
perhaps assume a different distribution for the response directly (e.g.
skew-normal), and many tools allow researchers to do this with ease28 . 28
A lot of ’outliers’ tend to magically
There are still cases where one might focus on response transforma- go away with an appropriate choice
of distribution for the data generating
tion, just not so one can overcome some particular nuisance in trying process.
to fit a linear regression. An example might be in some forms of func-
tional data analysis, where we are concerned with some function of the
response that has been measured on many occasions over time.
In the case of neural net models, one can imagine a model where
the input units (predictor variables) are weighted and summed to
create hidden layer units, which are then essentially put through the
same process to create outputs (see a simple example to the right).
One can see projection pursuit models as an example where a smooth
function is taken of the components which make up the hidden layer.
Neural networks are highly flexible in that there can be any number
of inputs, hidden layers, and outputs. However, such models are very
explicit in the black box approach.
Projection pursuit and neural net models are usually found among
data mining/machine learning techniques any number of which might
be utilized in a number of disciplines. Other more algorithmic/black
box approaches include k-nearest-neighbors, random forests, support
vector machines, and various tweaks or variations thereof including Input Layer Hidden Layer Output
boosting, bagging, bragging and other alliterative shenanigans 30 .
As Venables and Ripley note, generalized additive models might be Miscellaneous Fig. 1 A Neural Net Model
thought of as falling somewhere in between the fully parametric and
30
See Hastie et al. (2009) for an
interpretable models of linear regression and black box techniques.
overview of such approaches.
Indeed, there are algorithmic approaches which utilize GAMs as part of
their approach.
Extensions
Conclusion
Generalized additive models are a conceptually straightforward tool
that allows one to incorporate nonlinear predictor effects into their
otherwise linear models. In addition, they allow one to keep within the
linear and generalized linear frameworks with which one is already fa-
miliar, while providing new avenues of model exploration and possibly
Getting Started with GAMs in R 26
Appendix
R packages
Also GAMMs.
VGAM: Vector generalized linear and additive models, and associated
models.
Miscellaneous Code
Here is any code that might be okay to play around with but would
have unnecessarily cluttered sections of the paper.
ggopts
My ggplot2 default options: ggopts. Note that by actually adding the
opts line after ggopts, you can then override or change certain parts of
it on the fly while maintaining the gist. I have this in my Rprofile.site
file so that it is always available.
mod_gam1 plot
size = c(1.42,1.58,1.78,1.99,1.99,1.99,2.13,2.13,2.13,
2.32,2.32,2.32,2.32,2.32,2.43,2.43,2.78,2.98,2.98)
wear = c(4.0,4.2,2.5,2.6,2.8,2.4,3.2,2.4,2.6,4.8,2.9,
3.8,3.0,2.7,3.1,3.3,3.0,2.8,1.7)
x= size-min(size); x = x/max(x)
Getting Started with GAMs in R 28
d = data.frame(wear, x)
Example 1.
●
4.5
knots = 1:4/5
●
3.5
●
●
● ●
● ●
2.5 ●
# Base R plot ● ●
# ggplot
library(ggplot2)
ggplot(aes(x=x, y=wear), data=data.frame(x,wear))+
geom_point(color="#FF8000") +
geom_line(aes(x=xp, y=Xp%*%coef(mod.1)), data=data.frame(xp,Xp), color="#2957FF") +
ggopts
29 Generalized Additive Models
4.5
● ● ●
●
●
●
●
3.5
● ● ●
● ● ●
● ● ●
3.0 ● ● ● ● ● ●
● ● ●
● ● ● ● ● ●
d2 = data.frame(x=xp)
● ● ●
● ● ● ● ● ●
2.5 ● ● ●
● ● ● ● ● ●
2.0
● ● ●
wear
lambda = 1e−04 lambda = 1e−05 lambda = 1e−06
4.0 ● ● ●
● ● ●
3.5
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ● ●
● ● ● ● ● ●
● ● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
} x
### ggplot
library(ggplot2); library(reshape)
d3 = melt(d2, id='x')
R Session Info
References