DBSCAN
DBSCAN
DeBaCl
Theoretical Analysis of DeBaCl
Alessandro Rinaldo
Department of Statistics
Carnegie Mellon University
Outline
Density-based clustering.
Clustering
4
2
normal4[1:1000, ][,2]
0
-2
-4
-4 -2 0 2 4
normal4[1:1000, ][,1]
3
TIME
−1
−2
−3
−4 −2 0 2 4 6 8
ENGY
0
y
−1
−2
−3
−3 −2 −1 0 1 2 3
x
1.5
0.5
0
z
2
−0.5
−1
0
−1.5
2 1 −2
0 −1 x
y
Atom, n = 800, dimension = 3, classes = 2, main problem: different variances and linear not separable
50
0
z
−50
50 50
0 0
−50
y −50 x
Assume P has a density f . For a threshold λ > 0, the λ-upper level set
(high density region) of f is
Definition (λ-Clusters)
A λ-cluster of P is a maximal connected component of L(λ).
Definition (α-Clusters)
A α-cluster of P is maximal connected component of L(α). Minimal volume
set of prescribed probability content.
Assume P has a density f . For a threshold λ > 0, the λ-upper level set
(high density region) of f is
Definition (λ-Clusters)
A λ-cluster of P is a maximal connected component of L(λ).
Definition (α-Clusters)
A α-cluster of P is maximal connected component of L(α). Minimal volume
set of prescribed probability content.
A ⊂ B or B ⊂ A or A ∩ B = ∅.
A ⊂ B or B ⊂ A or A ∩ B = ∅.
0 5 10
0 5 10
0 5 10
0 5 10
0 5 10
λ α
0
0.21
0.2
0.17
0.3
0.13
0.4
0.1
0.6
0.04 0.07
0.8
0.9
0
1
0 5 10
Cluster Trees
Partition Property
The leaves and branches of the tree partition S = supp(P).
The minimax rate (attained by KDEs with vanishing bandwidth) for this
problem over Hölder classes of densities is
β
log n 2β+d
,
n
where β is the smoothness parameter.
This typically requires a sample size exponential in d. Consistent
estimation of T is unfeasible in high-dimensions.
The minimax rate (attained by KDEs with vanishing bandwidth) for this
problem over Hölder classes of densities is
β
log n 2β+d
,
n
where β is the smoothness parameter.
This typically requires a sample size exponential in d. Consistent
estimation of T is unfeasible in high-dimensions.
Fix
Use (slightly) incorrect connectivity based on Xn .
Fix
Use (slightly) incorrect connectivity based on Xn .
Fix
Use (slightly) incorrect connectivity based on Xn .
sup |f (x) − bfh (x)| ≤ sup |f (x) − fh (x)| + sup |fh (x) − bfh (x)|
x x x
| {z } | {z }
bias random fluctuations
r !
1
= O(hβ ) + OP
nhd
Ignoring the bias and for fixed h fh can be well estimated with the nearly
parameric, dimension independent rate:
r !
log n
sup |fh (x) − fh (x)| = O
b ,
x n
with high probability. The dimension is in the constants!
sup |f (x) − bfh (x)| ≤ sup |f (x) − fh (x)| + sup |fh (x) − bfh (x)|
x x x
| {z } | {z }
bias random fluctuations
r !
1
= O(hβ ) + OP
nhd
Ignoring the bias and for fixed h fh can be well estimated with the nearly
parameric, dimension independent rate:
r !
log n
sup |fh (x) − fh (x)| = O
b ,
x n
with high probability. The dimension is in the constants!
Easy Hard
-5 0 5 -5 0 5
Easy Hard
-5 0 5 -5 0 5
Meet DeBaCl
Implementations:
pyton module DeBaCl by Brian Kent (update coming soon)
https://github.com/CoAxLab/DeBaCl
Meet DeBaCl
Implementations:
pyton module DeBaCl by Brian Kent (update coming soon)
https://github.com/CoAxLab/DeBaCl
Meet DeBaCl
Meet DeBaCl
Meet DeBaCl
Meet DeBaCl
Meet DeBaCl
Remarks on DeBaCl
Why k-nn and not KDE? A KDE version of DeBaCl is easy enough to
devise but we prefer k-nn...
Remarks on DeBaCl
Why k-nn and not KDE? A KDE version of DeBaCl is easy enough to
devise but we prefer k-nn...
The fiber endpoint data is derived from in vivo difusion weighted brain
imaging (DWI) collected at the Scientific Imaging and Brain Research
Center at Carnegie Mellon University in 2012 for 30 neurologically
healthy controls (the CMU-30 group).
From the DWI data, deterministic fiber tractography was used to simulate
smooth 1-dimensional manifolds (with boundaries) called fiber
streamlines that represent tracks of strong water diffusion in the brain
(Hagmann et al., 2006).
10, 000 fiber streamlines were mapped from the cortex into the striatum
for a single subject. Only the teminal points of the streamlines were kept.
k = 200
Work done in collaboration with Timothy Verstynen:
http://www.psy.cmu.edu/~coaxlab/
(a)
(b) (c)
Figure 3.1: Fiber streamlines and streamline endpoints for one subject in the CMU-
30 group. (a) 10,000 streamlines mapped from DeBaCl
A. Rinaldo the middle frontal
: Density gyrusClustering
Based to the stria- 19/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Figure 3.2: Interactive exploration of data subsets with the dendrogram. The top
row shows the dendrogram forA.the striatal
Rinaldo endpoints
DeBaCl inClustering
: Density Based Figure 19/43
3.1, with di↵erent
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Example
3.2. 1: clustering
CLUSTER endpoints
RETRIEVAL of fiber tracks in the striatum
OPTIONS 33
Example
34 1: clusteringCHAPTER
endpoints of fiber
3. DATA tracks inAND
EXPLORATION the CLUSTERING
striatum
(a) (b)
A. Rinaldo DeBaCl : Density Based Clustering 19/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
(a) (b)
Figure 3.6: Oversmoothed level set tree result for population genetics. (a) When
the smoothing parameter k is set to 40, the estimated level set tree for the HGDP
population genetic data has only four all-mode clusters. (b) The map shows these
clusters describe continent groups well, but do not capture more detailed population
A. Rinaldo DeBaCl : Density Based Clustering 20/43
dendrogram). The DeBaClclusters in this result are very well m
Density-based Clustering
(a)
(a)
A. Rinaldo DeBaCl : Density Based Clustering 20/43
Density-based Clustering
n this result are veryTheoretical
wellAnalysis
matched
DeBaCl
of DeBaCl
to populations.
Example 2: clustering individuals into populations using SNPs
(b)
(a)
(b)
A. Rinaldo DeBaCl : Density Based Clustering 20/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
(a) (b)
(c) (d)
(e) (f)
Figure 4.1: Spoken phoneme data, separated into the 5 true classes. Each phoneme is
represented by the log-periodogram at 150 fixed frequencies (see Hastie et al. (1995)
for details), which we smoothed with a cubic spline. The phonemes are (a) ‘sh’, (b)
‘iy’, (c) ‘dcl’, (d) ‘aa’, and (e) ‘ao’. (f) The mean function for each phoneme; color
A. Rinaldo
indicates correspondence between DeBaCl
phoneme class and mean: Density Based Clustering
function. 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-
cate higher pseudo-density values,A. Rinaldo
implyingDeBaCl
greater proximity to neighbors.
: Density Based Clustering 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-
cate higher pseudo-density values, implying greater proximity to neighbors.
(a) (b)
Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observation
in each all-mode cluster is shown with a black outline.
A. Rinaldo DeBaCl : Density Based Clustering 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Example 4:mode
one major clustering
centeredhurricane trackspeninsula
near the Yucatán (functional data)
and one just o↵shore of the
Figure 4.4: Hurricane track data, shaded by pseudo-density value. Curves with more
intense color have higher pseudo-density, indicating greater proximity to neighbors.
A. Rinaldo DeBaCl : Density Based Clustering 22/43
54 CHAPTER 4. PSEUDO-DENSITIES
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
of similarity; that is, the Gulf of Mexico storms are more similar to each other than
Example 4: clustering hurricane tracks (functional data)
to any other cluster. The only cluster that seems possibly out of place is the plain
green, which is closer to the mid-Atlantic pink and orange than to the Gulf of Mexico
(a) (b)
Figure 4.5: (a) Hurricane track level set tree. (b) All-mode clusters. Color indicates
correspondence between dendrogram branches and clusters on the map; low-density,
unclustered background tracks are shown faintly
A. Rinaldo DeBaCl in gray.
: Density Based Clustering 22/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Figure 7. Comparison of methods for whole-fiber segmentation. A, B) High-density fiber pathway clusters from the level set tree all-m
method for middle frontal gyrus fibers in two subjects. C, D) Single linkage hierarchical clustering results for the same fiber pathways, wit
dendrogram cut to match the same number of clusters in the level set tree result. E, F) K-means clustering results for the sample fiber pathw
A. Rinaldo
doi:10.1371/journal.pone.0093344.g007 DeBaCl : Density Based Clustering 23/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
DeBaCl estimates rp and its lower level sets at the sample points Xn .
DeBaCl estimates rp and its lower level sets at the sample points Xn .
DeBaCl estimates rp and its lower level sets at the sample points Xn .
1 kxk ≤ 1
Let K be the uniform kernel K (x) = and consider the
0 otherwise
biased, Lebesgue density
Z
1 kx − y k p 1
fp (x) = d
K dP(y ) = d
∝ d , x ∈ Rd ,
vd rp (x) Rd r p (x) vd rp (x) rp (x)
1 kxk ≤ 1
Let K be the uniform kernel K (x) = and consider the
0 otherwise
biased, Lebesgue density
Z
1 kx − y k p 1
fp (x) = d
K dP(y ) = d
∝ d , x ∈ Rd ,
vd rp (x) Rd r p (x) vd rp (x) rp (x)
fp is biased
If P has a continuous Lebesgue density f , then, a.e.,
p
lim = f (x),
p→0 vd rpd (x)
The upper level sets of fp (·) are the lower level sets of rp (·).
Algorithmic Tree
The algorithmic tree T is the dendrogram of the r -clusters of P with path
connectedness replaced by algorithmic connectedness.
The upper level sets of fp (·) are the lower level sets of rp (·).
Algorithmic Tree
The algorithmic tree T is the dendrogram of the r -clusters of P with path
connectedness replaced by algorithmic connectedness.
Empirical Tree
The empirical tree Tbn is the dendrogram produced by DeBaCl (i.e. based on
the estimated distances b r1 , . . . , b
rn ).
Oracle Tree
The oracle tree Tn is the dedrogram of nested subsets of Xn obtained by
running DeBaCl if an oracle gave us the true values of rp (X1 ), . . . , rp (Xn ).
The oracle tree: the output of DeBaCl using the true rXi ’s.
The empirical tree: the output of DeBaCl, using the estimated distances
rX1 , . . . , b
b rXn .
The oracle tree: the output of DeBaCl using the true rXi ’s.
The empirical tree: the output of DeBaCl, using the estimated distances
rX1 , . . . , b
b rXn .
Consistency of DeBaCl
Tree Consistency
A cluster A of T is said to be consistency estimable if, with high probability,
b n := A ∩ Xn is a cluster of Tbn . Consistency of the tree follows from uniform
A
consistency over clusters.
Consistency of DeBaCl
Tree Consistency
A cluster A of T is said to be consistency estimable if, with high probability,
b n := A ∩ Xn is a cluster of Tbn . Consistency of the tree follows from uniform
A
consistency over clusters.
Regularity Assumption
Let Fx be the cdf of kX − xk, X ∼ P. For some c > 0 and P-almost all x,
Fx0 (rp (x)) ≥ c. The constant c depends on dim(P).
x x
Fx(t) Fx(t)
t t
Figure 1: Two situations where there is no modulus of continuity for the quantile function Fx 1 .
Left: the set is not a connected set. Right: the measure is not (a, b)-standard.
Courtesy of Betrand Michel
4 Application to the A.
geometric
Rinaldo models
DeBaClin: Density
Rd Based Clustering 31/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
Lemma
There exists a constant C, depending on dim(P), such that, with probability at
least 1/n, r
log n
max ri − ri ≤ C
b := n .
i n
1
For a uniform distribution in full dimension C ∝ pd
.
Ln (r − n ) ⊂ b
Ln (r ) ⊂ Ln (r + n ),
and
Ln (r − n ) ⊂ Ln (r ) ⊂ b
b Ln (r + n ).
If the c.d.f. of rp (X ) with X ∼ P is locally Lipschizt atr , withhigh probability
q
log n
the proportion of misclustered nodes at level r is O n
.
Stability is a local property and does not hold around at values r of near
branching points.
Stability is a local property and does not hold around at values r of near
branching points.
Stability is a local property and does not hold around at values r of near
branching points.
Thickness
A cluster is (γ, c)-thick, where γ > 0 and c ≥ 0 if, for any x ∈ A, there exists a
y ∈ B(x, γ) ∩ A such that
Thickness
A cluster is (γ, c)-thick, where γ > 0 and c ≥ 0 if, for any x ∈ A, there exists a
y ∈ B(x, γ) ∩ A such that
Consistency
Assume that the sample X = (X1 , . . . , Xn ) is a γn -covering of supp(P). Then,
if A is an r -cluster that is (γn , c)-thick, where γn < rA /(2(2 + c)), then Xn,A is
a cluster of Ln (r ). This holds uniformly over all such clusters.
That is, when the sample is dense enough, the oracle tree will produce the
same clustering as the if we knew the algorithmic tree.
rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?
rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?
rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?
Consistency of DeBaCl
Consistency of DeBaCl
Let A be an r -cluster of T such that its r − n subcluster A0 is (γn , c)-thick,
with γn < (rA0 − n )/(2 + c) and c ≥ 0. If the sample X = (X1 , . . . , Xn ) is a γn
covering of supp(P) then the subgraph of G(r b − n ) induced by Xn,A0 is
connected; if A is also 2n -away from all the other r clusters, then Xn,A0 is a
cluster of Tbn (r − n ). This holds uniformly over such clusters of T .
κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.
κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.
κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.
κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.
0.067
0.914
6
1.518
4
0.042
lambda
r
2
0.015
0
2.989
-2
0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.602
6
0.313
4
0.39
kappa
alpha
2
0.174
0
0.838
-2
0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Then,
n√ o
B(x, rp (x))
rp (x) − rp (x) , x ∈ Rd , x ∈ Rd
n b
Fx0 (rp (x))
Then,
n√ o
B(x, rp (x))
rp (x) − rp (x) , x ∈ Rd , x ∈ Rd
n b
Fx0 (rp (x))
Then,
n√ o
B(x, rp (x))
rp (x) − rp (x) , x ∈ Rd , x ∈ Rd
n b
Fx0 (rp (x))
Conclusions
Conclusions
CMU TopStat