0% found this document useful (0 votes)
69 views

DBSCAN

The document discusses density-based clustering and DeBaCl, a density-based clustering algorithm. It provides an outline that covers density-based clustering, the DeBaCl algorithm and applications, and a theoretical analysis of DeBaCl. Density-based clustering assumes the data is drawn from an underlying probability distribution with density and seeks to identify clusters as connected high-density regions. DeBaCl performs density-based clustering and its properties are analyzed in the document.

Uploaded by

KalpanaMohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views

DBSCAN

The document discusses density-based clustering and DeBaCl, a density-based clustering algorithm. It provides an outline that covers density-based clustering, the DeBaCl algorithm and applications, and a theoretical analysis of DeBaCl. Density-based clustering assumes the data is drawn from an underlying probability distribution with density and seeks to identify clusters as connected high-density regions. DeBaCl performs density-based clustering and its properties are analyzed in the document.

Uploaded by

KalpanaMohan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Density-based Clustering

DeBaCl
Theoretical Analysis of DeBaCl

DeBaCl: a Density-based Clustering Algorithm


and its Properties

Alessandro Rinaldo
Department of Statistics
Carnegie Mellon University

joint work with Brian Kent and Fabrizio Lecci


Thanks to: Larry Wasserman, Bertrand Michel, Fred Chazal
and Timothy Verstynen

November 10, 2014


Department of Statistics
Rice University

A. Rinaldo DeBaCl : Density Based Clustering 1/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Outline

Density-based clustering.

The algorithm DeBaCland some applications.

Theoretical analysis of DeBaCl.

A. Rinaldo DeBaCl : Density Based Clustering 2/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering

Classic problem in statistics, computer science, probability and many


other fields. Huge literature!

Abstract formulation: optimally organize a set of objects into groups, so


that objects in the same group are maximally similar and objects in
different groups are maximally dissimilar.

Goal, scope and performance of a given clustering task is in many cases


poorly or only partially defined.

Analyses of clustering procedures often focus on the algorithmic


properties, and tend to ignore the probabilistic nature of the input.

A. Rinaldo DeBaCl : Density Based Clustering 3/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

In much of this talk, we are interested in clustering Xn = (X1 , . . . , Xn ), an i.i.d.


sample from a probability distribution P with support S ⊂ Rd .

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

4
2
normal4[1:1000, ][,2]

0
-2
-4

-4 -2 0 2 4

normal4[1:1000, ][,1]

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

EngyTime, n = 4096, dimension = 2, classes = 2, main problem: gaussian mixture

3
TIME

−1

−2

−3
−4 −2 0 2 4 6 8
ENGY

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

Target, n = 770, dimension = 2, classes = 6, main problem: outlier


3

0
y

−1

−2

−3
−3 −2 −1 0 1 2 3
x

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

Chainlink, n = 1000, dimension = 3, classes = 2, main problem: linear not separable

1.5

0.5

0
z

2
−0.5

−1
0
−1.5
2 1 −2
0 −1 x
y

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

Atom, n = 800, dimension = 3, classes = 2, main problem: different variances and linear not separable

50

0
z

−50

50 50

0 0

−50
y −50 x

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

We will try to be as agnostic as possible about P:


P has a density with respect to k -dimensional Hausdorff measure or
mixtures thereof, k = {1, . . . , d};
the dimension k = dim(S) is unknown;
the smoothness of f is unknown;
the number of clusters is unknown;
we are interested in both the algorithmic and statistical challenges of
high dimensions.

We believe that many of our results extend to clustering of functional data.

A. Rinaldo DeBaCl : Density Based Clustering 4/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Assume P has a density f . For a threshold λ > 0, the λ-upper level set
(high density region) of f is

L(λ) = {x ∈ Rd : f (x) ≥ λ}.

Definition (λ-Clusters)
A λ-cluster of P is a maximal connected component of L(λ).

More interpretable twist (see Rinaldo et al., 2012).


For α ∈ [0, 1], set λα = sup{λ : P (L(λ)) ≥ α} and L(α) = L(λα ).

Definition (α-Clusters)
A α-cluster of P is maximal connected component of L(α). Minimal volume
set of prescribed probability content.

A. Rinaldo DeBaCl : Density Based Clustering 5/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Assume P has a density f . For a threshold λ > 0, the λ-upper level set
(high density region) of f is

L(λ) = {x ∈ Rd : f (x) ≥ λ}.

Definition (λ-Clusters)
A λ-cluster of P is a maximal connected component of L(λ).

More interpretable twist (see Rinaldo et al., 2012).


For α ∈ [0, 1], set λα = sup{λ : P (L(λ)) ≥ α} and L(α) = L(λα ).

Definition (α-Clusters)
A α-cluster of P is maximal connected component of L(α). Minimal volume
set of prescribed probability content.

A. Rinaldo DeBaCl : Density Based Clustering 5/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Consider all thresholds simultaneously!


The family of all λ-clusters of P is called the cluster tree of P because it
has the tree property: A, B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

The hierarchy of inclusions of T can be represented as a dendrogram,


with height indexed by λ or α.

Many subtle topological and measure-theoretical details: see Steinwart


(2014).

A. Rinaldo DeBaCl : Density Based Clustering 6/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Consider all thresholds simultaneously!


The family of all λ-clusters of P is called the cluster tree of P because it
has the tree property: A, B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

The hierarchy of inclusions of T can be represented as a dendrogram,


with height indexed by λ or α.

Many subtle topological and measure-theoretical details: see Steinwart


(2014).

A. Rinaldo DeBaCl : Density Based Clustering 6/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

λ α

0
0.21

0.2
0.17

0.3
0.13

0.4
0.1

0.6
0.04 0.07

0.8
0.9
0

1
0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster Trees

The cluster tree T captures all the clustering properties of P


simultaneously.
The cluster tree is an algebraic structure for visualizing and encoding P.
It is largely decoupled from the geometry and dimension of P.
There is no need to choose the number of clusters.

A. Rinaldo DeBaCl : Density Based Clustering 8/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Trees, branches and leaves

Partition Property
The leaves and branches of the tree partition S = supp(P).

A. Rinaldo DeBaCl : Density Based Clustering 9/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue I: Density-based clustering is statistically hard

To “estimate T consistently" using a density estimator bf , we need sup norm


consistency, i.e. supx |f (x) − bf (x)| = oP (1).

The minimax rate (attained by KDEs with vanishing bandwidth) for this
problem over Hölder classes of densities is
β
log n 2β+d
,
n
where β is the smoothness parameter.
This typically requires a sample size exponential in d. Consistent
estimation of T is unfeasible in high-dimensions.

A. Rinaldo DeBaCl : Density Based Clustering 10/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue I: Density-based clustering is statistically hard

To “estimate T consistently" using a density estimator bf , we need sup norm


consistency, i.e. supx |f (x) − bf (x)| = oP (1).

The minimax rate (attained by KDEs with vanishing bandwidth) for this
problem over Hölder classes of densities is
β
log n 2β+d
,
n
where β is the smoothness parameter.
This typically requires a sample size exponential in d. Consistent
estimation of T is unfeasible in high-dimensions.

A. Rinaldo DeBaCl : Density Based Clustering 10/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue II: density-based clustering is algorithmically hard

Even assuming f known, deciding whether x and y are in the same


λ-cluster of f requires finding a path ` ⊂ S between x and y such that
f (z) ≥ λ for all z ∈ `.

This computation is prohibitively difficult even in moderate dimensions.


Building T is unfeasible in high-dimensions.

Fix
Use (slightly) incorrect connectivity based on Xn .

A. Rinaldo DeBaCl : Density Based Clustering 11/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue II: density-based clustering is algorithmically hard

Even assuming f known, deciding whether x and y are in the same


λ-cluster of f requires finding a path ` ⊂ S between x and y such that
f (z) ≥ λ for all z ∈ `.

This computation is prohibitively difficult even in moderate dimensions.


Building T is unfeasible in high-dimensions.

Fix
Use (slightly) incorrect connectivity based on Xn .

A. Rinaldo DeBaCl : Density Based Clustering 11/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue II: density-based clustering is algorithmically hard

Even assuming f known, deciding whether x and y are in the same


λ-cluster of f requires finding a path ` ⊂ S between x and y such that
f (z) ≥ λ for all z ∈ `.

This computation is prohibitively difficult even in moderate dimensions.


Building T is unfeasible in high-dimensions.

Fix
Use (slightly) incorrect connectivity based on Xn .

A. Rinaldo DeBaCl : Density Based Clustering 11/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with curse of dimensionality in density-based clustering

So you are all about the bias...big trouble!


Statistical hardness is an unavoidable bias issue. So let’s ignore it!

Suppose P has Lebesgue density f , assumed Hölder smooth with


parameter β. Let bfh be a KDE with bandwitdh h
n
bfh (x) = 1
X 1 kx − Xi k
K , x ∈ Rd .
n hd h
i=1

For eaxh h > 0, bfh is an unbiased estimator of the density


Z
1 ky − xk
fh (x) = d f (y )K dx, x ∈ Rd .
h Rd h

fh is much easier to estimate than f !!

A. Rinaldo DeBaCl : Density Based Clustering 12/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with curse of dimensionality in density-based clustering

So you are all about the bias...big trouble!


Statistical hardness is an unavoidable bias issue. So let’s ignore it!

Suppose P has Lebesgue density f , assumed Hölder smooth with


parameter β. Let bfh be a KDE with bandwitdh h
n
bfh (x) = 1
X 1 kx − Xi k
K , x ∈ Rd .
n hd h
i=1

For eaxh h > 0, bfh is an unbiased estimator of the density


Z
1 ky − xk
fh (x) = d f (y )K dx, x ∈ Rd .
h Rd h

fh is much easier to estimate than f !!

A. Rinaldo DeBaCl : Density Based Clustering 12/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with course of dimensionality

By the Hölder assumption and Gine’ and Guillon (2002):

sup |f (x) − bfh (x)| ≤ sup |f (x) − fh (x)| + sup |fh (x) − bfh (x)|
x x x
| {z } | {z }
bias random fluctuations
r !
1
= O(hβ ) + OP
nhd

Ignoring the bias and for fixed h fh can be well estimated with the nearly
parameric, dimension independent rate:
r !
log n
sup |fh (x) − fh (x)| = O
b ,
x n
with high probability. The dimension is in the constants!

A. Rinaldo DeBaCl : Density Based Clustering 13/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with course of dimensionality

By the Hölder assumption and Gine’ and Guillon (2002):

sup |f (x) − bfh (x)| ≤ sup |f (x) − fh (x)| + sup |fh (x) − bfh (x)|
x x x
| {z } | {z }
bias random fluctuations
r !
1
= O(hβ ) + OP
nhd

Ignoring the bias and for fixed h fh can be well estimated with the nearly
parameric, dimension independent rate:
r !
log n
sup |fh (x) − fh (x)| = O
b ,
x n
with high probability. The dimension is in the constants!

A. Rinaldo DeBaCl : Density Based Clustering 13/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

More on ignoring the bias

One may measure the difficulty of a clustering problem depending on


whether density clustering based on biased density estimation can be
successful.

Easy Hard

-5 0 5 -5 0 5

Another major advantage of allowing for bias is that it extends the


applicability of density-based clustering to singular P.
See, e.g., Rinaldo and Wasserman (2010).

A. Rinaldo DeBaCl : Density Based Clustering 14/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

More on ignoring the bias

One may measure the difficulty of a clustering problem depending on


whether density clustering based on biased density estimation can be
successful.

Easy Hard

-5 0 5 -5 0 5

Another major advantage of allowing for bias is that it extends the


applicability of density-based clustering to singular P.
See, e.g., Rinaldo and Wasserman (2010).

A. Rinaldo DeBaCl : Density Based Clustering 14/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Very large amount of literature.

Cluster tree estimation: Koltchinksii (2000), Stuetzle and Nugent (2010).


More recently: Chaudhuri, Dasgupta, Kptufe and von Luxburg (2013),
Balakrishnan et al. (2013) and Steinwary (2014).

Support estimation: Korostelev and Tsybakov (1993), Mammen and


Tsybakov (1995), Cuevas and Fraiman (1997), Biau, Cadre and Pellettier
(2008).
Level set estimation for fixed λ: Polonik (1995), Tsybakov (1997), Walther
(1997), Scott and Nowak (2006), Cuevas, González-Menteiga and
Rodríguez-Casal (2006), Singh, Scott and Nowak (2009), Rigollet and Vert
(2010).

Some algorithms: DBSCAN, OPTICS, denpro.

A. Rinaldo DeBaCl : Density Based Clustering 15/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

DeBaCl is a simple algorithm for density-based clustering.


We credit Kpotufe and von Luxburg (2011).
It is based on the k-nn density estimator.

Implementations:
pyton module DeBaCl by Brian Kent (update coming soon)
https://github.com/CoAxLab/DeBaCl

R package TDA by Fabrizio Lecci et al.


http://cran.r-project.org/web/packages/TDA/index.html

A. Rinaldo DeBaCl : Density Based Clustering 16/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

DeBaCl is a simple algorithm for density-based clustering.


We credit Kpotufe and von Luxburg (2011).
It is based on the k-nn density estimator.

Implementations:
pyton module DeBaCl by Brian Kent (update coming soon)
https://github.com/CoAxLab/DeBaCl

R package TDA by Fabrizio Lecci et al.


http://cran.r-project.org/web/packages/TDA/index.html

A. Rinaldo DeBaCl : Density Based Clustering 16/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

For fixed p ∈ (0, 1) and each i = 1, . . . , n, set b


ri be the distance of Xi from its
k -th nearest neighbors in Xn , with k = dnpe.

Input: p ∈ (0, 1) and Xn


1. Construct the knn graph Gbn with nodes Xn and edges (Xi , Xj )
(i) if kXi − Xj k ≤ max{b rj } (k-nn)
ri , b
(ii) if kXi − Xj k ≤ max{b rj } (mutual k-nn)
ri , b
2. For all r ∈ R := [mini b
ri , maxi b
ri ]
(i) set Gbn (r ) be subgraph induced by {Xi : b
ri ≤ r }.
(ii) compute the connected components of Gbn (r ).
Output {Tbn (r ), r ∈ R}, the dendrogram of the connected components.

A. Rinaldo DeBaCl : Density Based Clustering 17/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Remarks on DeBaCl

The input to DeBaCl are the k-nn distances b r1 , . . . , b


rn that are used
both to compute level sets and to determine connectedness.
Computational complexity. The computation of all the k-nn’s has
complexity O (n log n) (using k -d trees, ball-trees and cover-trees).
The complexity of constructing all the connected components is nearly
linear in n because it relies on a modified union-find procedure (Najman
and Couprie, 2006) and never uses breadth-first search.
DeBaCl outputs a data structure.

Why k-nn and not KDE? A KDE version of DeBaCl is easy enough to
devise but we prefer k-nn...

A. Rinaldo DeBaCl : Density Based Clustering 18/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Remarks on DeBaCl

The input to DeBaCl are the k-nn distances b r1 , . . . , b


rn that are used
both to compute level sets and to determine connectedness.
Computational complexity. The computation of all the k-nn’s has
complexity O (n log n) (using k -d trees, ball-trees and cover-trees).
The complexity of constructing all the connected components is nearly
linear in n because it relies on a modified union-find procedure (Najman
and Couprie, 2006) and never uses breadth-first search.
DeBaCl outputs a data structure.

Why k-nn and not KDE? A KDE version of DeBaCl is easy enough to
devise but we prefer k-nn...

A. Rinaldo DeBaCl : Density Based Clustering 18/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 1: clustering endpoints of fiber tracks in the striatum

The fiber endpoint data is derived from in vivo difusion weighted brain
imaging (DWI) collected at the Scientific Imaging and Brain Research
Center at Carnegie Mellon University in 2012 for 30 neurologically
healthy controls (the CMU-30 group).
From the DWI data, deterministic fiber tractography was used to simulate
smooth 1-dimensional manifolds (with boundaries) called fiber
streamlines that represent tracks of strong water diffusion in the brain
(Hagmann et al., 2006).
10, 000 fiber streamlines were mapped from the cortex into the striatum
for a single subject. Only the teminal points of the streamlines were kept.
k = 200
Work done in collaboration with Timothy Verstynen:
http://www.psy.cmu.edu/~coaxlab/

A. Rinaldo DeBaCl : Density Based Clustering 19/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 1: clustering endpoints of fiber tracks in the striatum


3.2. CLUSTER RETRIEVAL OPTIONS 31

(a)

(b) (c)

Figure 3.1: Fiber streamlines and streamline endpoints for one subject in the CMU-
30 group. (a) 10,000 streamlines mapped from DeBaCl
A. Rinaldo the middle frontal
: Density gyrusClustering
Based to the stria- 19/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 1: clustering endpoints of fiber tracks in the striatum

32 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

Figure 3.2: Interactive exploration of data subsets with the dendrogram. The top
row shows the dendrogram forA.the striatal
Rinaldo endpoints
DeBaCl inClustering
: Density Based Figure 19/43
3.1, with di↵erent
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example
3.2. 1: clustering
CLUSTER endpoints
RETRIEVAL of fiber tracks in the striatum
OPTIONS 33

A. Rinaldo DeBaCl : Density Based Clustering 19/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example
34 1: clusteringCHAPTER
endpoints of fiber
3. DATA tracks inAND
EXPLORATION the CLUSTERING
striatum

(a) (b)
A. Rinaldo DeBaCl : Density Based Clustering 19/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

Data from The Human Genome Diversity Project (HGDP) dataset,


available at
http://www.hagsc.org/hgdp/files.html.
Cleaned-up comprised of 11,775 SNPs from 931 subjects from 53
populations from Crosset et al. (2010).
The goal of the analysis is to identify the hierarchy of high-density
clusters of individuals in the sample, ideally capturing the correct
membership in populations.
In the first level set tree k = 40, in the second k = 6.

A. Rinaldo DeBaCl : Density Based Clustering 20/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

44 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

(a) (b)

Figure 3.6: Oversmoothed level set tree result for population genetics. (a) When
the smoothing parameter k is set to 40, the estimated level set tree for the HGDP
population genetic data has only four all-mode clusters. (b) The map shows these
clusters describe continent groups well, but do not capture more detailed population
A. Rinaldo DeBaCl : Density Based Clustering 20/43
dendrogram). The DeBaClclusters in this result are very well m
Density-based Clustering

Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

(a)

A. Rinaldo DeBaCl : Density Based Clustering 20/43


Density-based Clustering
DeBaCl
3.4. HIGH-DIMENSIONAL EXPERIMENTS
Theoretical Analysis of DeBaCl 45
Example 2: clustering individuals into populations using SNPs

(a)
A. Rinaldo DeBaCl : Density Based Clustering 20/43
Density-based Clustering
n this result are veryTheoretical
wellAnalysis
matched
DeBaCl
of DeBaCl
to populations.
Example 2: clustering individuals into populations using SNPs

(b)

A. Rinaldo DeBaCl : Density Based Clustering 20/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

(a)

(b)
A. Rinaldo DeBaCl : Density Based Clustering 20/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)

The phoneme dataset, from Ferraty and Vieu (2006) contains


log-periodograms of 2000 instances of digitized human speech, divided
evenly between five phonemes: “sh", “dcl" (as in “dark"), “iy" (as in the
vowel of “she"), “aa", and “ao". Each recording is treated as a single
functional observation, which was smoothed using a cubic spline.
Distance between function is the L2 distace (each phoneme is observed
over 150 frequencies).
k = 20.

A. Rinaldo DeBaCl : Density Based Clustering 21/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)


50 CHAPTER 4. PSEUDO-DENSITIES

(a) (b)

(c) (d)

(e) (f)

Figure 4.1: Spoken phoneme data, separated into the 5 true classes. Each phoneme is
represented by the log-periodogram at 150 fixed frequencies (see Hastie et al. (1995)
for details), which we smoothed with a cubic spline. The phonemes are (a) ‘sh’, (b)
‘iy’, (c) ‘dcl’, (d) ‘aa’, and (e) ‘ao’. (f) The mean function for each phoneme; color
A. Rinaldo
indicates correspondence between DeBaCl
phoneme class and mean: Density Based Clustering
function. 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)


4.3. HURRICANE TRACKS 51

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-
cate higher pseudo-density values,A. Rinaldo
implyingDeBaCl
greater proximity to neighbors.
: Density Based Clustering 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-
cate higher pseudo-density values, implying greater proximity to neighbors.

(a) (b)

Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observation
in each all-mode cluster is shown with a black outline.
A. Rinaldo DeBaCl : Density Based Clustering 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 4: clustering hurricane tracks (functional data)

The U.S. National Hurricane Centers’s HURDAT dataset contains


positional and atmospheric measurements of North Atlantic tropical
cyclons from 1851 to 2012 (Landsea et al., 2013). The coordinates (in
degrees latitude and longitude) for each storm are recorded at least
every six hours.
The processed dataset contained 398 hurricane tracks.
Pairwise distances based on max-average-min distance (not a metric).
k = 6 (γ = 2).

A. Rinaldo DeBaCl : Density Based Clustering 22/43


The data are shown in Figure 4.4, colored by pseudo-density (darker hues correspond
Density-based Clustering
DeBaCl
to high pseudo-density values).
Theoretical AnalysisJust by looking at this map, we expected at least
of DeBaCl

Example 4:mode
one major clustering
centeredhurricane trackspeninsula
near the Yucatán (functional data)
and one just o↵shore of the

U.S east coast.

Figure 4.4: Hurricane track data, shaded by pseudo-density value. Curves with more
intense color have higher pseudo-density, indicating greater proximity to neighbors.
A. Rinaldo DeBaCl : Density Based Clustering 22/43
54 CHAPTER 4. PSEUDO-DENSITIES
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
of similarity; that is, the Gulf of Mexico storms are more similar to each other than
Example 4: clustering hurricane tracks (functional data)
to any other cluster. The only cluster that seems possibly out of place is the plain

green, which is closer to the mid-Atlantic pink and orange than to the Gulf of Mexico

storms, despite substantial overlap with the latter.

(a) (b)

Figure 4.5: (a) Hurricane track level set tree. (b) All-mode clusters. Color indicates
correspondence between dendrogram branches and clusters on the map; low-density,
unclustered background tracks are shown faintly
A. Rinaldo DeBaCl in gray.
: Density Based Clustering 22/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 5: clustering fiber tracks (functional data)

Fiber tractography datasets obtained through DSI techniques. Focus on


two corticostriatal pathways: lateral frontal (middle frontal gyrus to
striatum) and orbitofrontal (gyrus rectus to striatum).
A 30 subject template was used.
We used DeBaClto perform whole fiber tracks segmentation and looked
at tracks in the lateral frontal cortex and orbitofrontal cortex. Total of
51,126 fibers.
Pairwise distances based on max-average-min distance (not a metric).
k = d0.25 ∗ ne
Work done in collaboration with Timothy Verstynen:
http://www.psy.cmu.edu/~coaxlab/

A. Rinaldo DeBaCl : Density Based Clustering 23/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 5: clustering fiber tracks (functional data)

A. Rinaldo DeBaCl : Density Based Clustering 23/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 5: clustering fiber tracks (functional data)

A. Rinaldo DeBaCl : Density Based Clustering 23/43


Density-based Clustering
Figure 6. Level set tree clustering for whole fiber streamlines. A) Foreground fibers for the seven selected clusters from the 30 su
DeBaCl
templateTheoretical
data set for streamlines
Analysis oftracked
DeBaCl between the middle frontal gyrus and striatum, shown in both a sagittal and coronal view. Cluster
colored according to an all-mode clustering of the tree. B) The level set tree for data in panel A. Tree leaves are matched to fiber clusters by col
Same analysis as shown in A, but for a set of streamlines from the orbitofrontal cortex. Inset shows closeup of fiber streamlines in the striata
Example 5: clustering fiber tracks (functional data)
mask. D) Level set tree for data shown in panel C. The branch colors of trees in panels B and D match the clusters shown in the streamlines of p
A and C respectively.
doi:10.1371/journal.pone.0093344.g006

Figure 7. Comparison of methods for whole-fiber segmentation. A, B) High-density fiber pathway clusters from the level set tree all-m
method for middle frontal gyrus fibers in two subjects. C, D) Single linkage hierarchical clustering results for the same fiber pathways, wit
dendrogram cut to match the same number of clusters in the level set tree result. E, F) K-means clustering results for the sample fiber pathw
A. Rinaldo
doi:10.1371/journal.pone.0093344.g007 DeBaCl : Density Based Clustering 23/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Theoretical analysis of DeBaCl: set-up

Let P a non-atomic probability measure supported on S ⊂ Pd . We allow


for dim(S) < d and for S to be of mixed dimension.

Fix a number p ∈ (0, 1). Define the function rp : Rd → R+ given by


n o
x 7→ rp (x) = inf r > 0 : P (B(x, r )) ≥ p .

Thus rp (x) is the p-th quantile of the univariate variable kX − xk, X ∼ P.

DeBaCl estimates rp and its lower level sets at the sample points Xn .

A. Rinaldo DeBaCl : Density Based Clustering 24/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Theoretical analysis of DeBaCl: set-up

Let P a non-atomic probability measure supported on S ⊂ Pd . We allow


for dim(S) < d and for S to be of mixed dimension.

Fix a number p ∈ (0, 1). Define the function rp : Rd → R+ given by


n o
x 7→ rp (x) = inf r > 0 : P (B(x, r )) ≥ p .

Thus rp (x) is the p-th quantile of the univariate variable kX − xk, X ∼ P.

DeBaCl estimates rp and its lower level sets at the sample points Xn .

A. Rinaldo DeBaCl : Density Based Clustering 24/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Theoretical analysis of DeBaCl: set-up

Let P a non-atomic probability measure supported on S ⊂ Pd . We allow


for dim(S) < d and for S to be of mixed dimension.

Fix a number p ∈ (0, 1). Define the function rp : Rd → R+ given by


n o
x 7→ rp (x) = inf r > 0 : P (B(x, r )) ≥ p .

Thus rp (x) is the p-th quantile of the univariate variable kX − xk, X ∼ P.

DeBaCl estimates rp and its lower level sets at the sample points Xn .

A. Rinaldo DeBaCl : Density Based Clustering 24/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl


1 kxk ≤ 1
Let K be the uniform kernel K (x) = and consider the
0 otherwise
biased, Lebesgue density
Z
1 kx − y k p 1
fp (x) = d
K dP(y ) = d
∝ d , x ∈ Rd ,
vd rp (x) Rd r p (x) vd rp (x) rp (x)

with vd the volume of unit ball in Rd .

The empirical equivalent of fp (x) is the k-nn density estimator, where


k = dpne:
bfp (x) = k 1
, x ∈ Rd
rpd (x)
n vd b
rp (x) the distance from x to its k -th nearest neighborhood in Xn .
with b

A. Rinaldo DeBaCl : Density Based Clustering 25/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl


1 kxk ≤ 1
Let K be the uniform kernel K (x) = and consider the
0 otherwise
biased, Lebesgue density
Z
1 kx − y k p 1
fp (x) = d
K dP(y ) = d
∝ d , x ∈ Rd ,
vd rp (x) Rd r p (x) vd rp (x) rp (x)

with vd the volume of unit ball in Rd .

The empirical equivalent of fp (x) is the k-nn density estimator, where


k = dpne:
bfp (x) = k 1
, x ∈ Rd
rpd (x)
n vd b
rp (x) the distance from x to its k -th nearest neighborhood in Xn .
with b

A. Rinaldo DeBaCl : Density Based Clustering 25/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

fp is biased
If P has a continuous Lebesgue density f , then, a.e.,
p
lim = f (x),
p→0 vd rpd (x)

and convergence holds uniformly over compacts. If P has a continuous


density f w.r.t. Hk , then, a.e.,
p
lim k
= f (x),
p→0 vk rx,p

A. Rinaldo DeBaCl : Density Based Clustering 26/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

The upper level sets of fp (·) are the lower level sets of rp (·).

For r > 0, let L(r ) = {x ∈ Rd : rp (x) ≤ r }∩S.


Define the r -clusters of P as the maximal (path) connected component
of L(r ). The resulting tree is algorithmically hard.

Algorithmic connectedness. Two points in L(r ) are algorithmically


connected if there exists {x = z0 , z1 , . . . , zm = y } ⊂ L(r ) such that
zi ∈ B(zi+1 , rzi +1 ).

Algorithmic Tree
The algorithmic tree T is the dendrogram of the r -clusters of P with path
connectedness replaced by algorithmic connectedness.

A. Rinaldo DeBaCl : Density Based Clustering 27/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

The upper level sets of fp (·) are the lower level sets of rp (·).

For r > 0, let L(r ) = {x ∈ Rd : rp (x) ≤ r }∩S.


Define the r -clusters of P as the maximal (path) connected component
of L(r ). The resulting tree is algorithmically hard.

Algorithmic connectedness. Two points in L(r ) are algorithmically


connected if there exists {x = z0 , z1 , . . . , zm = y } ⊂ L(r ) such that
zi ∈ B(zi+1 , rzi +1 ).

Algorithmic Tree
The algorithmic tree T is the dendrogram of the r -clusters of P with path
connectedness replaced by algorithmic connectedness.

A. Rinaldo DeBaCl : Density Based Clustering 27/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

A. Rinaldo DeBaCl : Density Based Clustering 27/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

A. Rinaldo DeBaCl : Density Based Clustering 27/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

Connectedness implies algorithmic connectedness but not the other way


around.

A. Rinaldo DeBaCl : Density Based Clustering 27/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

Connectedness implies algorithmic connectedness but not the other way


around.

A. Rinaldo DeBaCl : Density Based Clustering 27/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

Empirical Tree
The empirical tree Tbn is the dendrogram produced by DeBaCl (i.e. based on
the estimated distances b r1 , . . . , b
rn ).

Oracle Tree
The oracle tree Tn is the dedrogram of nested subsets of Xn obtained by
running DeBaCl if an oracle gave us the true values of rp (X1 ), . . . , rp (Xn ).

They are both data dependent!

A. Rinaldo DeBaCl : Density Based Clustering 28/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

The “density" tree (high density clusters of f )


my not be defined (if, e.g., S is of mixed dimension)
is statistically and algorithmically hard in d.
The algorithmic r-tree (tree of r -clusters with path connectedness
replaced by algorithmic connectedness): our target.

The oracle tree: the output of DeBaCl using the true rXi ’s.
The empirical tree: the output of DeBaCl, using the estimated distances
rX1 , . . . , b
b rXn .

A. Rinaldo DeBaCl : Density Based Clustering 29/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

The “density" tree (high density clusters of f )


my not be defined (if, e.g., S is of mixed dimension)
is statistically and algorithmically hard in d.
The algorithmic r-tree (tree of r -clusters with path connectedness
replaced by algorithmic connectedness): our target.

The oracle tree: the output of DeBaCl using the true rXi ’s.
The empirical tree: the output of DeBaCl, using the estimated distances
rX1 , . . . , b
b rXn .

A. Rinaldo DeBaCl : Density Based Clustering 29/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Consistency of DeBaCl

Tree Consistency
A cluster A of T is said to be consistency estimable if, with high probability,
b n := A ∩ Xn is a cluster of Tbn . Consistency of the tree follows from uniform
A
consistency over clusters.

To prove consistency we will show that


Tbn the (empirical tree) and Tn (the oracle tree) yield nearly the same
hierarchy of clusters over Xn , with high probability;
Tn (the oracle tree) consistently estimates most of T (the algorithmic
tree).

A. Rinaldo DeBaCl : Density Based Clustering 30/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Consistency of DeBaCl

Tree Consistency
A cluster A of T is said to be consistency estimable if, with high probability,
b n := A ∩ Xn is a cluster of Tbn . Consistency of the tree follows from uniform
A
consistency over clusters.

To prove consistency we will show that


Tbn the (empirical tree) and Tn (the oracle tree) yield nearly the same
hierarchy of clusters over Xn , with high probability;
Tn (the oracle tree) consistently estimates most of T (the algorithmic
tree).

A. Rinaldo DeBaCl : Density Based Clustering 30/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The empirical tree Tbn is close to the oracle tree Tn


Write ri and b
ri for rXi and b
rXi .

Regularity Assumption
Let Fx be the cdf of kX − xk, X ∼ P. For some c > 0 and P-almost all x,
Fx0 (rp (x)) ≥ c. The constant c depends on dim(P).

x x

Fx(t) Fx(t)

t t

Figure 1: Two situations where there is no modulus of continuity for the quantile function Fx 1 .
Left: the set is not a connected set. Right: the measure is not (a, b)-standard.
Courtesy of Betrand Michel
4 Application to the A.
geometric
Rinaldo models
DeBaClin: Density
Rd Based Clustering 31/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The empirical tree Tbn is close to the oracle tree Tn

Lemma
There exists a constant C, depending on dim(P), such that, with probability at
least 1/n, r
log n
max ri − ri ≤ C
b := n .
i n

1
For a uniform distribution in full dimension C ∝ pd
.

A. Rinaldo DeBaCl : Density Based Clustering 31/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

For fixed r > 0, consider the (random) sublevel sets:

Ln (r ) = {Xi ∈ {1, . . . , n} : ri ≤ r }, (oracle) (1)


Ln (r ) = {Xi ∈ {1, . . . , n} : b
b ri ≤ r }, (empirical). (2)

Corollary (Level set consistency)

Uniformly over all r > n , with probability at least 1 − 1/n,

Ln (r − n ) ⊂ b
Ln (r ) ⊂ Ln (r + n ),

and
Ln (r − n ) ⊂ Ln (r ) ⊂ b
b Ln (r + n ).
If the c.d.f. of rp (X ) with X ∼ P is locally Lipschizt atr , withhigh probability
q
log n
the proportion of misclustered nodes at level r is O n
.

A. Rinaldo DeBaCl : Density Based Clustering 32/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Stability (yields correct connectivity)


The interval of heights I = (a, b) is stable for Tn if the dendrogram does not
change when the k-nn distances involved are perturbed by n . Sufficient
conditions are that the clusters are at least 2n - apart and Xn is “sufficiently
dense".

Corollary (Cluster consistency)


If I is a stable interval, then for each a + n ≤ r ≤ b − n , the number of
clusters in Lr (n) and bLn (r ) is the same and constant and each cluster A of
Ln (r ) is such that B ⊂ A ⊂ B 0 , for some B cluster of b
Ln (r − n ) and B 0 of
Ln (r + n ).
b

Stability is a local property and does not hold around at values r of near
branching points.

A. Rinaldo DeBaCl : Density Based Clustering 33/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Stability (yields correct connectivity)


The interval of heights I = (a, b) is stable for Tn if the dendrogram does not
change when the k-nn distances involved are perturbed by n . Sufficient
conditions are that the clusters are at least 2n - apart and Xn is “sufficiently
dense".

Corollary (Cluster consistency)


If I is a stable interval, then for each a + n ≤ r ≤ b − n , the number of
clusters in Lr (n) and bLn (r ) is the same and constant and each cluster A of
Ln (r ) is such that B ⊂ A ⊂ B 0 , for some B cluster of b
Ln (r − n ) and B 0 of
Ln (r + n ).
b

Stability is a local property and does not hold around at values r of near
branching points.

A. Rinaldo DeBaCl : Density Based Clustering 33/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Stability (yields correct connectivity)


The interval of heights I = (a, b) is stable for Tn if the dendrogram does not
change when the k-nn distances involved are perturbed by n . Sufficient
conditions are that the clusters are at least 2n - apart and Xn is “sufficiently
dense".

Corollary (Cluster consistency)


If I is a stable interval, then for each a + n ≤ r ≤ b − n , the number of
clusters in Lr (n) and bLn (r ) is the same and constant and each cluster A of
Ln (r ) is such that B ⊂ A ⊂ B 0 , for some B cluster of b
Ln (r − n ) and B 0 of
Ln (r + n ).
b

Stability is a local property and does not hold around at values r of near
branching points.

A. Rinaldo DeBaCl : Density Based Clustering 33/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

We need clusters to be “well-shaped".

Thickness
A cluster is (γ, c)-thick, where γ > 0 and c ≥ 0 if, for any x ∈ A, there exists a
y ∈ B(x, γ) ∩ A such that

B(y , γ/(2 + c)) ∩ supp(P) ⊂ B(x, γ) ∩ A.

Thickness provides controls how narrow a cluster A can get compared to


clusters at contiguous higher levels in the tree. It is satisfied for
well-behaved manifold if γ is small compared to the reach.

A. Rinaldo DeBaCl : Density Based Clustering 34/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

We need clusters to be “well-shaped".

Thickness
A cluster is (γ, c)-thick, where γ > 0 and c ≥ 0 if, for any x ∈ A, there exists a
y ∈ B(x, γ) ∩ A such that

B(y , γ/(2 + c)) ∩ supp(P) ⊂ B(x, γ) ∩ A.

Thickness provides controls how narrow a cluster A can get compared to


clusters at contiguous higher levels in the tree. It is satisfied for
well-behaved manifold if γ is small compared to the reach.

A. Rinaldo DeBaCl : Density Based Clustering 34/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

A. Rinaldo DeBaCl : Density Based Clustering 34/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

A. Rinaldo DeBaCl : Density Based Clustering 34/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

A. Rinaldo DeBaCl : Density Based Clustering 34/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

For a cluster A of T , let rA = infx∈A rp (x).

Consistency
Assume that the sample X = (X1 , . . . , Xn ) is a γn -covering of supp(P). Then,
if A is an r -cluster that is (γn , c)-thick, where γn < rA /(2(2 + c)), then Xn,A is
a cluster of Ln (r ). This holds uniformly over all such clusters.

That is, when the sample is dense enough, the oracle tree will produce the
same clustering as the if we knew the algorithmic tree.

A. Rinaldo DeBaCl : Density Based Clustering 35/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

When is Xn a dense covering?

rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?

Assume A is “well-behaved" (standard assumption) and

inf P(B(x, γ/2) ∩ A) = cA > 0.


x

Then, Xn,A is a γ covering of A with high probability if



dim(A) + log n
cA = Ω .
n

If we let p → 0, then cA → 0 at a rate Θ(γ k ), with γ → 0. This would give


dimension dependent rate.

A. Rinaldo DeBaCl : Density Based Clustering 36/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

When is Xn a dense covering?

rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?

Assume A is “well-behaved" (standard assumption) and

inf P(B(x, γ/2) ∩ A) = cA > 0.


x

Then, Xn,A is a γ covering of A with high probability if



dim(A) + log n
cA = Ω .
n

If we let p → 0, then cA → 0 at a rate Θ(γ k ), with γ → 0. This would give


dimension dependent rate.

A. Rinaldo DeBaCl : Density Based Clustering 36/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

When is Xn a dense covering?

rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?

Assume A is “well-behaved" (standard assumption) and

inf P(B(x, γ/2) ∩ A) = cA > 0.


x

Then, Xn,A is a γ covering of A with high probability if



dim(A) + log n
cA = Ω .
n

If we let p → 0, then cA → 0 at a rate Θ(γ k ), with γ → 0. This would give


dimension dependent rate.

A. Rinaldo DeBaCl : Density Based Clustering 36/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Consistency of DeBaCl

Consistency of DeBaCl
Let A be an r -cluster of T such that its r − n subcluster A0 is (γn , c)-thick,
with γn < (rA0 − n )/(2 + c) and c ≥ 0. If the sample X = (X1 , . . . , Xn ) is a γn
covering of supp(P) then the subgraph of G(r b − n ) induced by Xn,A0 is
connected; if A is also 2n -away from all the other r clusters, then Xn,A0 is a
cluster of Tbn (r − n ). This holds uniformly over such clusters of T .

A. Rinaldo DeBaCl : Density Based Clustering 37/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.


The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the


values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b


rα to be the α-quantile of b
r1 , . . . , b
rn . The
tree height is indexed by α: the α-level of the tree represents b L(b rα ), i.e.
the α-fraction of “most clusterable" data points. Highly interpretable.

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.


The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the


values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b


rα to be the α-quantile of b
r1 , . . . , b
rn . The
tree height is indexed by α: the α-level of the tree represents b L(b rα ), i.e.
the α-fraction of “most clusterable" data points. Highly interpretable.

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.


The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the


values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b


rα to be the α-quantile of b
r1 , . . . , b
rn . The
tree height is indexed by α: the α-level of the tree represents b L(b rα ), i.e.
the α-fraction of “most clusterable" data points. Highly interpretable.

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.


The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the


values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b


rα to be the α-quantile of b
r1 , . . . , b
rn . The
tree height is indexed by α: the α-level of the tree represents b L(b rα ), i.e.
the α-fraction of “most clusterable" data points. Highly interpretable.

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

Data lambda Tree r Tree

0.067

0.914
6

1.518
4

0.042
lambda

r
2

0.015
0

2.989
-2

0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

cluster labels alpha Tree kappa Tree


0.004

0.602
6

0.313
4

0.39
kappa
alpha
2

0.174
0

0.838
-2

0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

A. Rinaldo DeBaCl : Density Based Clustering 38/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

The difference can be substantial.

A. Rinaldo DeBaCl : Density Based Clustering 38/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let B the P-Brownian bridge indexed by the closed balls in Rd : a centered


Gaussian process with covariance function

(B, B 0 ) → P(B ∩ B 0 ) = P(B)P(B 0 ).

Denote by B(x, r ) the value of B at the ball B(x, r ).


Recall that, for x ∈ Rd , Fx is the c.d.f. of kX − xk with X ∼ P and rp (x) is the
p-th quantile of Fx .

Theorem (Function delta method)


Assume that, for some constant c > 0 and for P-almost all x,

Fx0 (rp (x)) ≥ c.

Then,
n√ o
B(x, rp (x))

rp (x) − rp (x) , x ∈ Rd , x ∈ Rd

n b
Fx0 (rp (x))

A. Rinaldo DeBaCl : Density Based Clustering 39/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let B the P-Brownian bridge indexed by the closed balls in Rd : a centered


Gaussian process with covariance function

(B, B 0 ) → P(B ∩ B 0 ) = P(B)P(B 0 ).

Denote by B(x, r ) the value of B at the ball B(x, r ).


Recall that, for x ∈ Rd , Fx is the c.d.f. of kX − xk with X ∼ P and rp (x) is the
p-th quantile of Fx .

Theorem (Function delta method)


Assume that, for some constant c > 0 and for P-almost all x,

Fx0 (rp (x)) ≥ c.

Then,
n√ o
B(x, rp (x))

rp (x) − rp (x) , x ∈ Rd , x ∈ Rd

n b
Fx0 (rp (x))

A. Rinaldo DeBaCl : Density Based Clustering 39/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let B the P-Brownian bridge indexed by the closed balls in Rd : a centered


Gaussian process with covariance function

(B, B 0 ) → P(B ∩ B 0 ) = P(B)P(B 0 ).

Denote by B(x, r ) the value of B at the ball B(x, r ).


Recall that, for x ∈ Rd , Fx is the c.d.f. of kX − xk with X ∼ P and rp (x) is the
p-th quantile of Fx .

Theorem (Function delta method)


Assume that, for some constant c > 0 and for P-almost all x,

Fx0 (rp (x)) ≥ c.

Then,
n√ o
B(x, rp (x))

rp (x) − rp (x) , x ∈ Rd , x ∈ Rd

n b
Fx0 (rp (x))

A. Rinaldo DeBaCl : Density Based Clustering 39/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let {brp∗ (x), x ∈ Rd } be the bootstrap version of b


rp based on the empirical
distribution of Xn .

Corollary (Boostrap validity)


Conditionally almost surely,
n√ o n B(x, r (x)) o
p
rp∗ (x) − b
rp (x) , x ∈ Rd , x ∈ Rd

n b 0
Fx (rp (x))

Using the bootstrap, we can construct asymptotically correct confidence


bands for rp and the DeBaCltrees.

A. Rinaldo DeBaCl : Density Based Clustering 40/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let {brp∗ (x), x ∈ Rd } be the bootstrap version of b


rp based on the empirical
distribution of Xn .

Corollary (Boostrap validity)


Conditionally almost surely,
n√ o n B(x, r (x)) o
p
rp∗ (x) − b
rp (x) , x ∈ Rd , x ∈ Rd

n b 0
Fx (rp (x))

Using the bootstrap, we can construct asymptotically correct confidence


bands for rp and the DeBaCltrees.

A. Rinaldo DeBaCl : Density Based Clustering 40/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Bootstrap confidence sets example

A. Rinaldo DeBaCl : Density Based Clustering 41/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Bootstrap confidence sets example

A. Rinaldo DeBaCl : Density Based Clustering 41/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Bootstrap confidence sets example

A. Rinaldo DeBaCl : Density Based Clustering 41/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Conclusions

Density-based clustering is a principled paradigm for clustering.


The cluster tree provides an interpretable and highy informative
encoding of all the clustering properties of P.
DeBaClis a simple, computationally efficient algorithm for consistently
estimating the cluster tree, even in high dimensions.

Current and future work:


Work out the theoretical properties of the α and κ tree.
Develop and study methods for constructing confidence sets for cluster
trees and, more generally, for using cluster trees for statistical inference.
Provide guidelines on how to choose p!

A. Rinaldo DeBaCl : Density Based Clustering 42/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Conclusions

Density-based clustering is a principled paradigm for clustering.


The cluster tree provides an interpretable and highy informative
encoding of all the clustering properties of P.
DeBaClis a simple, computationally efficient algorithm for consistently
estimating the cluster tree, even in high dimensions.

Current and future work:


Work out the theoretical properties of the α and κ tree.
Develop and study methods for constructing confidence sets for cluster
trees and, more generally, for using cluster trees for statistical inference.
Provide guidelines on how to choose p!

A. Rinaldo DeBaCl : Density Based Clustering 42/43


Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

CMU TopStat

A. Rinaldo DeBaCl : Density Based Clustering 43/43

You might also like