Nonparametric Regression
Nonparametric Regression
1 Introduction
Y = β0 + β1 X1 + · · · + βp Xp + .
Y = m(X) +
where E() = 0. This follows since, = Y − m(X) and E() = E(E(|X)) = E(m(X) −
m(X)) = 0
Example 1 Figure 1 shows data from on bone mineral density. The plots show the rela-
tive change in bone density over two consecutive visits, for men and women. The smooth
estimates of the regression functions suggest that a growth spurt occurs two years earlier for
females. In this example, Y is change in bone mineral density and X is age.
Y = m(x1 , x2 , x3 , x4 ) + . (2)
1
0.20
0.15
Females
Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25
Age
0.20
0.15
Males
Change in BMD
0.10
0.05
0.00
−0.05
10 15 20 25
Age
The estimator m b typically involves smoothing the data in some way. The main challenge is
to determine how much smoothing to do. When the data are oversmoothed, the bias term
is large and the variance is small. When the data are undersmoothed the opposite is true.
2
300
190
180
250
170
200
160
150
150
100
−0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 0.15
Age Bmi
240
160
150
200
140
160
130
120
120
110
−0.10 −0.05 0.00 0.05 0.10 −0.10 −0.05 0.00 0.05 0.10 0.15
Map Tc
This is called the bias–variance tradeoff. Minimizing risk corresponds to balancing bias and
variance.
3 The Regressogram
We will begin by assuming there is only one covariate X. For simplicity, assume that
0 ≤ X ≤ 1. The simplest nonparametric estimator of m is the regressogram. Let k be an
integer. Divide [0, 1] into k bins:
1 X
Yj = Yi .
nj X ∈B
i j
3
Finally, we define m(x)
b = Y j for all x ∈ Bj . We can write this as
k
X
m(x)
b = Y j I(x ∈ Bj ).
j=1
regressogram = function(x,y,left,right,k,plotit,xlab="",ylab="",sub=""){
### assumes the data are on the interval [left,right]
n = length(x)
B = seq(left,right,length=k+1)
WhichBin = findInterval(x,B)
N = tabulate(WhichBin)
m.hat = rep(0,k)
for(j in 1:k){
if(N[j]>0)m.hat[j] = mean(y[WhichBin == j])
}
if(plotit==TRUE){
a = min(c(y,m.hat))
b = max(c(y,m.hat))
plot(B,c(m.hat,m.hat[k]),lwd=3,type="s",
xlab=xlab,ylab=ylab,ylim=c(a,b),col="blue",sub=sub)
points(x,y)
}
return(list(bins=B,m.hat=m.hat))
}
pdf("regressogram.pdf")
par(mfrow=c(2,2))
### simulated example
n = 100
x = runif(n)
y = 3*sin(8*x) + rnorm(n,0,.3)
plot(x,y,pch=20)
out = regressogram(x,y,left=0,right=1,k=5,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=10,plotit=TRUE)
out = regressogram(x,y,left=0,right=1,k=20,plotit=TRUE)
dev.off()
4
### Bone mineral density versus age for men and women
pdf("bmd.pdf")
par(mfrow=c(2,2))
library(ElemStatLearn)
attach(bone)
From Figures 3 and 4 you can see two things. First, we need a way to choose k. (The answer
will be cross-validation). But also, m(x)
b is very unsmooth. To get a smoother estimator, we
will use kernel smoothing.
The idea of kernel smoothing is very simple. To estimate m(x) b we will take a local average
of the Yi0 s. In other words, we average all Yi such that |Xi − x| ≤ h where h is some small
number called the bandwidth. We can write this estimator as
Pn x−Xi
i=1 K h
Yi
m(x)
b = P n x−Xi
i=1 K h
where (
1 if |z| ≤ 1
K(z) =
0 if |z| > 1.
The function K is called the boxcar kernel.
5
● ● ● ●
● ● ●●
● ● ● ● ● ●●
●● ●
● ●●●
3
3
● ●●● ●
●
●● ●
● ●
●
●● ● ●
●● ●●
●
● ●● ●● ● ●
●●
●
● ●● ●
●● ●●
●
● ●
2
2
● ● ● ● ● ●●
●
●● ● ●●
●●
●● ●●
●
● ●
● ●
●
●
● ●
●●●
●●● ●●●
●
● ●● ● ●
● ●
●●
●
1
1
●
●●
●
●● ●
●● ●
●
●●
●● ●
● ●
●
● ● ● ●
y
● ●
0
0
● ● ●●
● ● ● ●
●
●
● ●
●●
● ●
−1
−1
● ●
●●
●
●
●
●●
● ●
●●
●
● ●
●● ●●
● ●
● ●
●
● ● ●● ● ● ●●●
−3
−3
● ● ● ●
● ●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
● ● ● ●
● ● ●●
●● ● ● ● ●●
●● ●
● ●●● ● ●●●
3
●
●● ●● ●
●● ●●
● ●
●● ●● ● ● ●● ●● ● ●
●
●● ●● ●
●● ●●
● ● ● ●
2
● ●●
● ● ●●
●
●●
●● ●●
●●
●
● ●
● ●
● ●
●
●●● ●●● ●●● ●●●
● ●
● ● ● ●
● ●
●● ●●
1
●● ● ●● ●
●● ●
● ●
● ●● ●
● ●
●
● ● ● ●
● ●
0
●● ●●
● ● ● ●
●●
● ● ●●
● ●
−1
−1
● ●
● ●● ● ●●
●●
● ●●
●
● ●
●● ●●
● ●
● ●
● ● ●●● ● ● ●●●
−3
−3
● ● ● ●
● ●
● ●
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
6
● ●
● ●
● ● ● ● ● ●
0.15
0.15
● ●● ● ●●
● ● ● ●
●●● ●●●
● ●● ● ●●
●● ● ●● ●
● ● ● ●
● ●●● ● ● ● ●●● ● ●
Density
Density
●● ●● ● ●● ●● ●
● ● ● ●
● ● ●
● ● ● ●
●
● ● ● ● ● ● ● ● ● ●
● ● ●●● ● ● ● ● ●●● ● ●
0.05
0.05
● ● ● ● ●●
● ●● ●●● ●●● ● ● ● ● ●●
● ●● ●●● ●●●
● ●●●●● ● ● ● ●
●●● ●● ●●
● ● ●● ● ●●●●● ● ● ● ●
●●● ●● ●●
● ● ●●
● ●●● ● ●● ● ● ●● ● ● ●●● ● ●● ● ● ●● ●
●
● ●
●
● ●●● ● ● ● ●● ● ● ●● ● ●●● ● ● ● ●● ● ● ●●
●● ● ●
● ●● ●
● ● ● ●●
● ● ●● ●
● ● ●●●●● ●
●●●●●●● ●● ●● ●● ● ●
● ●● ●
● ● ● ●●
● ● ●● ●
● ● ●●●●● ●
●●●●●●● ●● ●●
● ●●● ● ●
● ●●● ● ● ●●● ●● ● ●●● ● ●
● ●●● ● ● ●●● ●●
● ● ●● ● ● ●●
● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ● ● ●● ● ● ● ●● ● ●● ●
● ●● ●●● ●
● ● ●●●● ●●● ● ● ●●●● ●●●
● ● ● ● ● ● ● ● ● ●
● ●
−0.05
−0.05
● ●●●● ● ● ●●●● ●
● ● ● ●
● ●
10 15 20 25 10 15 20 25
Age Age
Male Male
● ●
● ●
●● ●●
● ●● ● ●●
0.15
0.15
● ●
● ●
●● ●●● ● ●● ●●● ●
● ●● ● ●●
●● ●● ● ●● ●● ●
●●●● ●● ●●●● ●●
Density
Density
● ●● ● ● ●● ●
●● ●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ●● ● ● ●● ●
● ● ● ●
● ●● ●● ● ● ● ● ●● ●● ● ● ●
●● ●● ● ●●● ● ●● ●● ● ●●● ●
0.05
0.05
● ● ● ● ● ● ● ● ● ● ● ●
● ● ● ● ●●●
● ● ●
● ● ● ● ● ● ●●●
● ● ●
● ●
●● ●●● ●●●● ●● ●●● ●●●●
● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ●
● ●
● ● ● ● ● ●
● ● ● ●
● ●
● ● ● ● ●●● ●
●
● ● ●●
●●●
● ●
● ●●●●● ● ● ●
● ● ● ● ●●● ●
●
● ● ●●
●●●
● ●
● ●●●●● ●
●
●● ●●
●● ● ●● ●●●●●●
●● ●
●● ●●
●● ● ●● ●●●●●●
●●
●●● ● ● ●● ● ● ●● ●●●● ●●
●●● ●●● ● ● ●● ● ● ●● ●●●● ●●
●●●
● ●● ● ● ●●●●●●
●●
● ● ●●●
● ●● ● ● ● ● ●● ● ● ●●●●●●
●●
● ● ●●●
● ●● ● ● ●
● ●●● ●● ●● ● ● ●●● ●●●●●●
● ● ● ●●● ●● ●● ● ● ●●● ●●●●●●
● ●
● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●
● ● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
Age Age
Female Female
Figure 4: Bone density example. Men and women. Left plots use k = 10. Right plots use
k = 20.
7
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●●●●●●● ●●●●●●●
● ●● ● ● ●●●●● ● ●● ● ● ●●●●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ● ● ● ● ●
●●● ●●●
● ●● ●● ● ●● ● ● ●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●● ●●●●●● ● ● ● ● ●● ●●●●●● ●
●● ●● ●●● ● ●● ●● ●●● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ● ●
●
● ●●●●●● ●●● ● ● ● ●● ● ●
●
● ●●●●●● ●●● ●
● ● ● ●
● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
● ● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
●
●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●● ●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●●
● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
● ● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
●
●
● ● ● ●● ● ● ● ●
● ●●● ● ●
● ● ● ●● ● ● ● ●
● ●●● ●
●●● ● ● ● ●●● ● ● ●
● ●● ● ● ● ●● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●●●●●●● ●●●●●●●
● ●● ● ● ●●●●● ● ●● ● ● ●●●●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ● ● ● ● ●
●●● ●●●
● ●● ●● ● ●● ● ● ●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●● ●●●●●● ● ● ● ● ●● ●●●●●● ●
●● ●● ●●● ● ●● ●● ●●● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ● ●
●
● ●●●●●● ●●● ● ● ● ●● ● ●
●
● ●●●●●● ●●● ●
● ● ● ●
● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
● ● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
●
●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●● ●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●●
● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
● ● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
●
●
● ● ● ●● ● ● ● ●
● ●●● ● ●
● ● ● ●● ● ● ● ●
● ●●● ●
●●● ● ● ● ●●● ● ● ●
● ●● ● ● ● ●● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
8
As you can see in Figure 5, this gives much smoother estimates than the regressogram. But
we can improve this even further by replacing K with a smoother function.
This leads us to the following definition. A one-dimensional smoothing kernel is any smooth
function K such that K(x) ≥ 0 and
Z Z Z
K(x) dx = 1, xK(x)dx = 0 and σK ≡ x2 K(x)dx > 0.
2
(7)
Let h > 0 be a positive number, called the bandwidth. The Nadaraya–Watson kernel esti-
mator is defined by
Pn x−Xi
n
i=1 Yi K
X
h
m(x)
b ≡m
b h (x) = Pn x−Xi
= Yi `i (x) (8)
i=1 K h i=1
P
where `i (x) = K((x − Xi )/h)/ j K((x − Xj )/h).
Thus m(x)
b is a local average of the Yi ’s. It can be shown that there is an optimal kernel
called the Epanechnikov kernel. But the choice of kernel K is not too important. Estimates
obtained by using different kernels are usually numerically very similar. This observation is
confirmed by theoretical calculations which show that the risk is very insensitive to the choice
of kernel. What does matter much more is the choice of bandwidth h which controls the
amount of smoothing. Small bandwidths give very rough estimates while larger bandwidths
give smoother estimates. A common choice for the kernel is the normal density:
2 /2
K(z) ∝ e−z .
This does not mean that we are assuming that the data are Normal. We are just using this
as a convenient way to define the smoothing weights.
9
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●●●●●●● ●●●●●●●
● ●● ● ● ●●●●● ● ●● ● ● ●●●●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ● ● ● ● ●
●●● ●●●
● ●● ●● ● ●● ● ● ●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●● ●●●●●● ● ● ● ● ●● ●●●●●● ●
●● ●● ●●● ● ●● ●● ●●● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ● ●
●
● ●●●●●● ●●● ● ● ● ●● ● ●
●
● ●●●●●● ●●● ●
● ● ● ●
● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
● ● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
●
●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●● ●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●●
● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
● ● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
●
●
● ● ● ●● ● ● ● ●
● ●●● ● ●
● ● ● ●● ● ● ● ●
● ●●● ●
●●● ● ● ● ●●● ● ● ●
● ●● ● ● ● ●● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
● ●
● ●
● ● ● ●
● ●● ● ●●
0.15
0.15
● ●
● ●
● ● ●●● ● ● ● ●●● ●
density.female
density.female
● ●● ● ●●
●●●●●●● ●●●●●●●
● ●● ● ● ●●●●● ● ●● ● ● ●●●●●
● ●●
● ●
● ●● ● ● ●●
● ●
● ●● ●
● ● ● ● ● ●
●●● ●●●
● ●● ●● ● ●● ● ● ●● ●● ● ●● ●
●● ●● ● ●●
● ●● ●● ● ●●
●
0.05
0.05
● ● ● ●● ● ● ● ● ● ●● ● ●
● ● ● ●● ●●●●●● ● ● ● ● ●● ●●●●●● ●
●● ●● ●●● ● ●● ●● ●●● ●
● ● ●● ● ● ● ● ● ● ●● ● ● ● ●
● ● ●● ● ●
●
● ●●●●●● ●●● ● ● ● ●● ● ●
●
● ●●●●●● ●●● ●
● ● ● ●
● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
● ● ● ● ● ●● ● ●
●●● ●● ●●● ● ●●●● ●
●●●●
● ●●●● ●●
●●
●
●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●● ●●● ●●● ●●●●● ●●
● ●●
●●●●●● ● ●●
●●
●●●
● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
● ● ●● ● ● ● ●●
● ●● ●●●● ● ●
● ●
●
●
● ● ● ●● ● ● ● ●
● ●●● ● ●
● ● ● ●● ● ● ● ●
● ●●● ●
●●● ● ● ● ●●● ● ● ●
● ●● ● ● ● ●● ● ●
−0.05
−0.05
● ●
●● ● ● ● ●● ● ● ●
10 15 20 25 10 15 20 25
age.female age.female
10
P
where `j (x) = K((x − Xj )/h)/ t K((x − Xt )/h). Let Y b = (Yb1 , . . . , Ybn ), Y = (Y1 , . . . , Yn )
and let L be the n × n matrix with entries Lij = `j (Xi ). We then see that
Y
b = LY. (10)
The matrix L is called the smoothing matrix. It is like the hat matrix but it is not a
projection matrix. We call X
ν = tr(L) = Lii
i
the effective degrees of freedom. As the bandwidth h gets smaller, ν gets larger. In other
words, small bandwidths correspond to more complex models.
The equation Y b = LY means that we can write each Ybi as a linear combination of the Yi ’s.
Because of this, we say that kernel regression is a linear smoother. This does not mean that
m(x)
b is linear. It just means that each Ybi as a linear combination of the Yi ’s.
The are many other linear smoothers besides kernel regression estimators but we will focus on
the kernel estimator for now. Remember: a linear smoother does not mean linear regression.
6 Choosing h by Cross-Validation
It turns out that there is a handy short cut formual for CV which is
2
1 X (Yi − m b h (Xi ))
CV (h) = . (11)
n i 1 − Lii
11
Cross−validation Score
Cross−validation Score
0.0017
0.0017
0.0015
0.0015
0.0013
0.0013
0 1 2 3 4 5 0 10 20 30 40 50
●
● ●
● ●●
0.15
●
●
● ● ●●● ●
● ●●
●● ●● ●
● ●● ● ●●
Density
● ●● ●
●●
● ●●
● ●
● ●● ●
● ●● ●
●●
● ●● ●● ● ●● ●
●● ●● ● ●●
●
0.05
● ● ● ● ● ●
● ● ● ● ●● ●●
●●● ●● ●
●● ●● ● ● ●●
● ● ●● ● ●●● ●
● ● ● ●
● ●
●● ● ● ●
● ●
● ●
● ● ● ● ●● ●● ●
● ● ●● ● ●●●● ● ●
●
●● ●●
●●●● ● ●● ●●●● ●●
●●
●●● ● ● ●● ● ● ●● ●●●● ●●●●●
● ●● ● ● ●●● ●●
● ●●●● ●●
●
●
● ● ●● ● ● ●
●
● ●●● ●● ●● ● ● ●
● ● ●●●●● ●
●
● ●● ● ● ● ●
● ●
−0.05
●
●● ● ● ●
10 15 20 25
Age
Figure 7: Cross validation (black) and generalized cross validation (red dotted line) versus
h and versus effective degrees of freedom. The bottom left plot is the kernel regression
estimator using the bandwidth that minimizes the cross-validation score.
12
which is called generalized cross validation.
Figure 7 shows the cross-validation score and the generalized cross validation score. I plotted
it versus h and then I plotted it versus the effective degrees of freedom ν. The bottom left plot
is the kernel regression estimator using the bandwidth that minimizes the cross-validation
score.
kernel = function(x,y,grid,h){
### kernel regression estimator at a grid of values
n = length(x)
k = length(grid)
m.hat = rep(0,k)
for(i in 1:k){
w = dnorm(grid[i],x,h)
m.hat[i] = sum(y*w)/sum(w)
}
return(m.hat)
}
kernel.fitted = function(x,y,h){
### fitted values and diaginal of smoothing matrix
n = length(x)
m.hat = rep(0,n)
S = rep(0,n)
for(i in 1:n){
w = dnorm(x[i],x,h)
w = w/sum(w)
m.hat[i] = sum(y*w)
S[i] = w[i]
}
return(list(fitted=m.hat,S=S))
}
CV = function(x,y,H){
### H is a vector of bandwidths
n = length(x)
k = length(H)
cv = rep(0,k)
nu = rep(0,k)
13
gcv = rep(0,k)
for(i in 1:k){
tmp = kernel.fitted(x,y,H[i])
cv[i] = mean(((y - tmp$fitted)/(1-tmp$S))^2)
nu[i] = sum(tmp$S)
gcv[i] = mean((y - tmp$fitted)^2)/(1-nu[i]/n)^2
}
return(list(cv=cv,gcv=gcv,nu=nu))
}
pdf("crossval.pdf")
par(mfrow=c(2,2))
bone = read.table("BoneDensity.txt",header=TRUE)
attach(bone)
H = seq(.1,5,length=20)
out = CV(age.female,density.female,H)
plot(H,out$cv,type="l",lwd=3,xlab="Bandwidth",ylab="Cross-validation Score")
lines(H,out$gcv,lty=2,col="red",lwd=3)
plot(out$nu,out$cv,type="l",lwd=3,xlab="Effective Degrees of Freedom",ylab="Cross-valida
lines(out$nu,out$gcv,lty=2,col="red",lwd=3)
j = which.min(out$cv)
h = H[j]
grid = seq(min(age.female),max(age.female),length=100)
m.hat = kernel(age.female,density.female,grid,h)
plot(age.female,density.female,xlab="Age",ylab="Density")
lines(grid,m.hat,lwd=3,col="blue")
dev.off()
14
7 Analysis of the Kernel Estimator
where τ 2 is the unavoidable error, bn (x) = E(m b h (x)) − m(x) is the bias and vn (x) =
Var(mb h (x)) is the variance. The first term is unavoidable. It is the second two terms
that we can try to make small. That is, we would like to minimize the integrated mean
squared error Z Z
IM SE = b2n (x)p(x)dx + vn (x)p(x)dx.
What do we learn from this? First, the optimal bandwidth gets smaller as the sample size
increases. Second, the IMSE goes to 0 as n gets larger. But it goes to 0 slower than other
estimators your are used to. For example, if you estimate the mean µ with the sample
mean Y then E(Y − µ)2 = σ 2 /n. Roughly speaking, the mean squared error of parametric
estimators is something like 1/n but kernel estimators (and other nonparamettic estimators)
behave like (1/n)4/5 which goes to 0 more slowly. Slower convergence is the price of being
nonparametric.
The formula we derived for hn is interesting for helping our understanding of the theoretical
behavior of m.
b But we can’t use that formual in practice because those constants involve
quantities that depend on the unknown function m. So we use cross-validation to choose h
in practice.
15
8 Variability Bands
We would like to get some idea of how accurate our estimator is. To estimate the standard
error of m
b h (x) we are going to use a tool called the bootstrap. In 402, you will see that the
bootstrap is a very general tool. Here we will only use it to get standard errors for our kernel
estimator.
(a) Draw n observations, with replacement, from the original data {Z1 , . . . , Zn }. Call
these observations Z1∗ , . . . , Zn∗ . This is called a bootstrap sample.
(b) Compute a kernel estimator m b ∗j (x) from the bootstrap sample. (Note that we
b ∗j (x) at every x.)
compute m
b ∗1 (x), . . . , m
4. At each x, let se(x) be the standard deviation of the numbers m b ∗B (x).
kernel = function(x,y,grid,h){
### kernel regression estimator at a grid of values
n = length(x)
k = length(grid)
m.hat = rep(0,k)
for(i in 1:k){
w = dnorm(grid[i],x,h)
m.hat[i] = sum(y*w)/sum(w)
}
return(m.hat)
}
16
boot = function(x,y,grid,h,B){
### pointwise standard error for kernel regression using the bootstrap
k = length(grid)
n = length(x)
M = matrix(0,k,B)
for(j in 1:B){
I = sample(1:n,size=n,replace=TRUE)
xx = x[I]
yy = y[I]
M[,j] = kernel(xx,yy,grid,h)
}
s = sqrt(apply(M,1,var))
return(s)
}
bone = read.table("BoneDensity.txt",header=TRUE)
attach(bone)
age.female = age[gender == "female"]
density.female = spnbmd[gender == "female"]
h = .7
grid = seq(min(age.female),max(age.female),length=100)
plot(age.female,density.female)
mhat = kernel(age.female,density.female,grid,h)
lines(grid,mhat,lwd=3)
B = 1000
se = boot(age.female,density.female,grid,h,B)
lines(grid,mhat+2*se,lwd=3,lty=2,col="red")
lines(grid,mhat-2*se,lwd=3,lty=2,col="red")
Suppose now that the covariate is d-dimensional, Xi = (xi1 , . . . , xid )T . The regression equa-
tion is
Y = m(X1 , . . . , Xd ) + . (13)
A problem that occurs with smoothing methods is the curse of dimensionality. Estimation
gets harder very quickly as the dimension of the observations increases.
17
●
0.20
● ●
●
● ●
●
0.15
● ●
● ● ●
●
●● ●● ●
●●
● ● ●
●●●●
density.female
●
● ● ●● ●
● ● ●
0.10
● ●
●●
● ● ●
● ● ●
● ●
● ●
● ●
● ● ●● ●● ● ●
● ●● ● ●●
● ● ● ●
● ● ●
0.05
● ●● ●
● ● ● ● ●●● ●● ●
● ● ●● ● ●
●
● ● ●
●● ● ●
● ● ●● ● ●●●● ● ● ●● ● ●
● ●● ● ●● ● ●
●● ● ● ● ●
● ● ● ● ●
●● ● ● ● ●●
● ●●● ●●
● ● ● ● ●● ●
●●●
● ● ●
● ● ● ● ●
● ● ● ●
●
●● ● ●● ● ●● ● ● ●
● ●●● ●●
0.00
● ● ●● ●● ● ●
●
● ● ● ● ●●● ● ● ● ●
● ●
● ● ● ● ● ●● ● ●
● ● ● ●● ● ● ● ● ●
● ●
● ● ● ● ●
●
● ● ● ●
●
●
● ● ●
●
●
●
● ●
−0.05
● ●
10 15 20 25
age.female
18
The IMSE of of most nonparametric regression estimators has the form n−4/(4+d) . This
implies that the sample size needed for a given level of accuracy increases exponentially with
d. The reason for this phenomenon is that smoothing involves estimating a function m(x)
using data points in a local neighborhood of x. But in a high-dimensional problem, the data
are very sparse, so local neighborhoods contain very few points.
9.1 Kernels
We now proceed as in the univariate case. For example, we use cross-validation to estimate
h.
The additive model is not well defined as we have stated it. We can add any constant to
α and subtract the same constant from one of the mj ’s without changing the regression
function. This problem can be fixed in a number of ways; the simplest is P b = Y and
to set α
n
then regard the mj ’s as deviations from Y . In this case we require that i=1 mb j (Xi ) = 0
for each j.
There is a simple algorithm called backfitting for turning any one-dimensional regression
smoother into a method for fitting additive models. This is essentially a coordinate descent,
Gauss-Seidel algorithm.
19
Initialization: set α
b = Y and set initial guesses for m
b 1, . . . , m
b d . Now iterate the following
steps until convergence. For j = 1, . . . , d do:
P
• Compute Yei = Yi − α
b − k6=j m
b k (Xi ), i = 1, . . . , n.
• end do.
9.3 Example
library(mgcv)
D = read.table("CarData.txt",header=TRUE)
##data from Consumer Reports (1990)
attach(D)
names(D)
[1] "Price" "Mileage" "Weight" "Disp" "HP"
pairs(D)
out = gam(Price ~ s(Mileage) + s(Weight) + s(Disp) + s(HP))
summary(out)
#Family: gaussian
#Link function: identity
#
#Formula:
#Price ~ s(Mileage) + s(Weight) + s(Disp) + s(HP)
#
#Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 12615.7 307.3 41.06 <2e-16 ***
#---
#Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
20
#
#Approximate significance of smooth terms:
# edf Ref.df F p-value
#s(Mileage) 4.328 5.314 1.897 0.12416
#s(Weight) 1.000 1.000 7.857 0.00723 **
#s(Disp) 1.000 1.000 11.354 0.00146 **
#s(HP) 4.698 5.685 3.374 0.00943 **
#---
#Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
#
#R-sq.(adj) = 0.66 Deviance explained = 72.4%
#GCV = 7.0853e+06 Scale est. = 5.6652e+06 n = 60
#
plot(out,lwd=3)
r = resid(out)
plot(Mileage,r);abline(h=0)
plot(Weight,r);abline(h=0)
plot(Disp,r);abline(h=0)
plot(HP,r);abline(h=0)
21
20 25 30 35 100 200 300
● ● ● ●
● ● ● ●
20000
● ● ● ●
● ● ● ● ●● ● ● ●●
●●● ● ● ● ●● ●
●●● ●● ● ● ● ●
Price ●●
●●● ●●
● ●●
●● ●
●● ●● ● ●
●
●
●
●
●
● ●●● ●
● ● ●●●
●
●
●●
●
●● ●
● ●●
● ●●● ●
●●● ●
●
●
●●
●
●
●
●
●
● ● ●●
●●
●
●
● ●
10000
● ● ● ● ● ●
● ● ● ● ● ●● ●
● ● ●●
●● ● ● ● ●● ●
● ● ● ●● ● ● ●● ● ● ● ●
●
●●●●● ● ●●● ●● ● ● ●●
●● ●●
●● ●● ●
●● ●●
● ● ● ●● ● ● ● ●●
● ● ●● ●● ● ●
● ● ● ●● ● ● ● ● ● ● ●● ● ●
● ●●●
● ● ● ●
35
● ● ● ●
● ● ● ●
● ●● ●●● ● ●● ● ●●
● ● ● ● ● ●●
30
● ● ● ●
● ● ● ●
●
● ● ●
●● ●● ●
●
● Mileage ●●
●●
●● ●
● ●●
●
●●
●●
●
● ●
● ●
● ●
●●
●●● ●
●● ●
25
●● ● ● ● ●● ● ●● ●
● ●● ● ● ●● ● ●●● ● ●● ●
●●● ●● ● ● ● ●●● ● ●● ●●
●● ● ● ●● ● ● ● ●
●● ●● ● ● ●● ● ● ●
● ● ●●●●
● ●● ● ● ●● ● ●● ● ●● ●● ● ● ●● ●
20
● ●● ● ● ●●●● ● ●
●● ●
● ● ● ●●●
● ●● ● ●● ● ● ● ● ● ●
● ●●● ● ●● ● ● ● ● ● ● ●● ●
● ● ● ● ●
● ● ●
● ●●● ●●
●● ● ●
● ● ● ●
● ● ● ●●
●● ● ● ●●●● ● ● ● ● ●● ●
●● ● ● ●●●●
●● ● ● ● ●
● ●●● ●
●●●● ● ● ● ●● ●
● ● ● ●●●
3000
●
● ● ●● ● ●
●
● ●
●
●
●●●●
●
●●
●
●●●
● ●●●●●
●
●
●
●●●
● ● ●●
●
● ●
●●
●● ● ●
●● ●●
●
Weight ●●
●●
● ●●
●
●●●●
●●●
●●
●
●
●
●
●
●● ●
●●●
●●●
●
●
● ●●● ●
●●
●
●
● ● ● ● ● ●● ●●
●
●● ●●●
● ● ●● ●●
●● ● ●
●
●●
●●● ●●●●●
●
●
●
2000
● ● ● ●
● ● ● ●
●● ● ● ●● ●●
300
●● ● ● ●●● ● ● ● ● ●
●● ● ● ● ● ●
200
●●
● ●● ●●● ● ● ●
●
●
●● ●
●
●
●●● ●
●● ●●●
●
● ●
●● ● ●● Disp ●●
●
●●
●●●●
●
●
●
●●●●● ●●
●●● ●●●
●●●● ●● ● ●●●●●●
●● ●● ● ● ●●● ●●
●● ● ●● ● ● ● ●● ●●
● ●●
● ● ●● ● ●●
●● ● ● ● ● ●
●● ● ● ●● ● ●●●●●●●● ●●●●
● ●●● ● ●
100
● ●● ● ●● ●● ●
●
●●●
●
●●
● ●●●● ● ●● ●● ● ●●
●● ● ● ●●●● ●●●
●● ● ● ●● ● ● ●●
● ● ● ●
200
● ● ● ●
● ● ● ● ● ● ● ● ●
● ●● ●●● ●●● ● ●● ●
150
● ●●●●
●●●●●● ●
●● ●
●
●●●●●
●
●●●●● ●
●●
● ●
● ● ● ●●●
●
●
●● ● ●
● ●
●
●
●
●●●
● ●
● ● ● ● ●
●
●
●
●
HP
● ●●●●●●● ● ● ●●●● ●
●
●● ● ●
● ●● ● ●●
● ●● ● ● ● ● ● ● ●●●●●
●● ●
● ●
● ●
●● 100
●●
● ● ● ● ●●●● ●
●●● ●●● ● ●●
● ●●
●
●
●● ●●
●
●● ● ● ●● ●● ●● ● ●
●
● ●● ●
●●●
● ● ● ● ●
●● ● ●● ● ● ● ● ●●
● ● ● ●
22
5000
5000
s(Mileage,4.33)
s(Weight,1)
0
0
−10000
−10000
20 25 30 35 2000 2500 3000 3500
Mileage Weight
5000
5000
s(HP,4.7)
s(Disp,1)
0
−10000
−10000
Disp HP
23
● ●
6000
6000
● ●
● ●
● ●
● ● ● ● ● ●
● ●
2000
2000
● ●
r
r
● ● ● ● ● ●● ●
● ●
● ● ● ● ● ● ● ●● ●● ● ● ●●
● ● ● ● ● ● ● ●
0
0
● ● ● ● ● ●● ● ● ● ●
● ● ●
● ● ● ● ● ● ●
● ● ● ● ● ● ●● ●●● ●
● ● ● ● ●
● ● ● ●● ●
● ● ● ●
● ●
● ● ● ● ● ● ●
● ● ●● ●
−4000
−4000
● ●
Mileage Weight
● ●
6000
6000
● ●
● ●
● ●
● ● ● ● ● ●
● ●
2000
2000
● ●
r
● ● ● ●●
● ●● ● ●● ●
● ● ● ● ●● ● ● ● ● ● ●●
● ●● ● ●● ●
● ● ●
0
● ● ● ● ● ● ● ●●
● ● ●● ● ●
● ●
●● ●●
●● ● ● ● ● ● ● ●
● ● ●
● ●●● ●● ● ● ●
●
● ●
● ● ● ●
● ●
● ● ● ● ● ● ● ●
● ● ● ●
−4000
−4000
● ●
Disp HP
24