0% found this document useful (0 votes)

69 views

DBSCAN

The document discusses density-based clustering and DeBaCl, a density-based clustering algorithm. It provides an outline that covers density-based clustering, the DeBaCl algorithm and applications, and a theoretical analysis of DeBaCl. Density-based clustering assumes the data is drawn from an underlying probability distribution with density and seeks to identify clusters as connected high-density regions. DeBaCl performs density-based clustering and its properties are analyzed in the document.

Uploaded by

KalpanaMohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views

DBSCAN

Uploaded by

KalpanaMohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 119

Density-based Clustering

DeBaCl
Theoretical Analysis of DeBaCl

DeBaCl: a Density-based Clustering Algorithm

and its Properties

Alessandro Rinaldo
Department of Statistics
Carnegie Mellon University

joint work with Brian Kent and Fabrizio Lecci

Thanks to: Larry Wasserman, Bertrand Michel, Fred Chazal
and Timothy Verstynen

November 10, 2014

Department of Statistics
Rice University

A. Rinaldo DeBaCl : Density Based Clustering 1/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Outline

Density-based clustering.

The algorithm DeBaCland some applications.

Theoretical analysis of DeBaCl.

A. Rinaldo DeBaCl : Density Based Clustering 2/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering

Classic problem in statistics, computer science, probability and many

other fields. Huge literature!

Abstract formulation: optimally organize a set of objects into groups, so

that objects in the same group are maximally similar and objects in
different groups are maximally dissimilar.

Goal, scope and performance of a given clustering task is in many cases

poorly or only partially defined.

Analyses of clustering procedures often focus on the algorithmic

properties, and tend to ignore the probabilistic nature of the input.

A. Rinaldo DeBaCl : Density Based Clustering 3/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

In much of this talk, we are interested in clustering Xn = (X1 , . . . , Xn ), an i.i.d.

sample from a probability distribution P with support S ⊂ Rd .

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

4
2
normal4[1:1000, ][,2]

0
-2
-4

-4 -2 0 2 4

normal4[1:1000, ][,1]

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

EngyTime, n = 4096, dimension = 2, classes = 2, main problem: gaussian mixture

3
TIME

−1

−2

−3
−4 −2 0 2 4 6 8
ENGY

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

Target, n = 770, dimension = 2, classes = 6, main problem: outlier

0
y

−1

−2

−3
−3 −2 −1 0 1 2 3
x

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

Chainlink, n = 1000, dimension = 3, classes = 2, main problem: linear not separable

1.5

0.5

0
z

2
−0.5

−1
0
−1.5
2 1 −2
0 −1 x
y

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

Atom, n = 800, dimension = 3, classes = 2, main problem: different variances and linear not separable

0
z

−50

50 50

0 0

−50
y −50 x

Source: Fundamental Clustering Problems Suite (FCPS).

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Clustering in Euclidean spaces

We will try to be as agnostic as possible about P:

P has a density with respect to k -dimensional Hausdorff measure or
mixtures thereof, k = {1, . . . , d};
the dimension k = dim(S) is unknown;
the smoothness of f is unknown;
the number of clusters is unknown;
we are interested in both the algorithmic and statistical challenges of
high dimensions.

We believe that many of our results extend to clustering of functional data.

A. Rinaldo DeBaCl : Density Based Clustering 4/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Assume P has a density f . For a threshold λ > 0, the λ-upper level set
(high density region) of f is

L(λ) = {x ∈ Rd : f (x) ≥ λ}.

Definition (λ-Clusters)
A λ-cluster of P is a maximal connected component of L(λ).

More interpretable twist (see Rinaldo et al., 2012).

For α ∈ [0, 1], set λα = sup{λ : P (L(λ)) ≥ α} and L(α) = L(λα ).

Definition (α-Clusters)
A α-cluster of P is maximal connected component of L(α). Minimal volume
set of prescribed probability content.

A. Rinaldo DeBaCl : Density Based Clustering 5/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Assume P has a density f . For a threshold λ > 0, the λ-upper level set
(high density region) of f is

L(λ) = {x ∈ Rd : f (x) ≥ λ}.

Definition (λ-Clusters)
A λ-cluster of P is a maximal connected component of L(λ).

More interpretable twist (see Rinaldo et al., 2012).

For α ∈ [0, 1], set λα = sup{λ : P (L(λ)) ≥ α} and L(α) = L(λα ).

Definition (α-Clusters)
A α-cluster of P is maximal connected component of L(α). Minimal volume
set of prescribed probability content.

A. Rinaldo DeBaCl : Density Based Clustering 5/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Consider all thresholds simultaneously!

The family of all λ-clusters of P is called the cluster tree of P because it
has the tree property: A, B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

The hierarchy of inclusions of T can be represented as a dendrogram,

with height indexed by λ or α.

Many subtle topological and measure-theoretical details: see Steinwart

(2014).

A. Rinaldo DeBaCl : Density Based Clustering 6/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering – Hartigan (1975, 1981)

Consider all thresholds simultaneously!

The family of all λ-clusters of P is called the cluster tree of P because it
has the tree property: A, B ∈ T implies that

A ⊂ B or B ⊂ A or A ∩ B = ∅.

The hierarchy of inclusions of T can be represented as a dendrogram,

with height indexed by λ or α.

Many subtle topological and measure-theoretical details: see Steinwart

(2014).

A. Rinaldo DeBaCl : Density Based Clustering 6/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering in action

λ α

0
0.21

0.2
0.17

0.3
0.13

0.4
0.1

0.6
0.04 0.07

0.8
0.9
0

1
0 5 10

A. Rinaldo DeBaCl : Density Based Clustering 7/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster Trees

The cluster tree T captures all the clustering properties of P

simultaneously.
The cluster tree is an algebraic structure for visualizing and encoding P.
It is largely decoupled from the geometry and dimension of P.
There is no need to choose the number of clusters.

A. Rinaldo DeBaCl : Density Based Clustering 8/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Trees, branches and leaves

Partition Property
The leaves and branches of the tree partition S = supp(P).

A. Rinaldo DeBaCl : Density Based Clustering 9/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue I: Density-based clustering is statistically hard

To “estimate T consistently" using a density estimator bf , we need sup norm

consistency, i.e. supx |f (x) − bf (x)| = oP (1).

The minimax rate (attained by KDEs with vanishing bandwidth) for this
problem over Hölder classes of densities is
β
log n 2β+d
,
n
where β is the smoothness parameter.
This typically requires a sample size exponential in d. Consistent
estimation of T is unfeasible in high-dimensions.

A. Rinaldo DeBaCl : Density Based Clustering 10/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue I: Density-based clustering is statistically hard

To “estimate T consistently" using a density estimator bf , we need sup norm

consistency, i.e. supx |f (x) − bf (x)| = oP (1).

A. Rinaldo DeBaCl : Density Based Clustering 10/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue II: density-based clustering is algorithmically hard

Even assuming f known, deciding whether x and y are in the same

λ-cluster of f requires finding a path ` ⊂ S between x and y such that
f (z) ≥ λ for all z ∈ `.

This computation is prohibitively difficult even in moderate dimensions.

Building T is unfeasible in high-dimensions.

Fix
Use (slightly) incorrect connectivity based on Xn .

A. Rinaldo DeBaCl : Density Based Clustering 11/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue II: density-based clustering is algorithmically hard

Even assuming f known, deciding whether x and y are in the same

λ-cluster of f requires finding a path ` ⊂ S between x and y such that
f (z) ≥ λ for all z ∈ `.

This computation is prohibitively difficult even in moderate dimensions.

Building T is unfeasible in high-dimensions.

Fix
Use (slightly) incorrect connectivity based on Xn .

A. Rinaldo DeBaCl : Density Based Clustering 11/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Issue II: density-based clustering is algorithmically hard

Even assuming f known, deciding whether x and y are in the same

λ-cluster of f requires finding a path ` ⊂ S between x and y such that
f (z) ≥ λ for all z ∈ `.

This computation is prohibitively difficult even in moderate dimensions.

Building T is unfeasible in high-dimensions.

Fix
Use (slightly) incorrect connectivity based on Xn .

A. Rinaldo DeBaCl : Density Based Clustering 11/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with curse of dimensionality in density-based clustering

So you are all about the bias...big trouble!

Statistical hardness is an unavoidable bias issue. So let’s ignore it!

Suppose P has Lebesgue density f , assumed Hölder smooth with

parameter β. Let bfh be a KDE with bandwitdh h
n
bfh (x) = 1
X 1 kx − Xi k
K , x ∈ Rd .
n hd h
i=1

For eaxh h > 0, bfh is an unbiased estimator of the density

Z
1 ky − xk
fh (x) = d f (y )K dx, x ∈ Rd .
h Rd h

fh is much easier to estimate than f !!

A. Rinaldo DeBaCl : Density Based Clustering 12/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with curse of dimensionality in density-based clustering

So you are all about the bias...big trouble!

Statistical hardness is an unavoidable bias issue. So let’s ignore it!

Suppose P has Lebesgue density f , assumed Hölder smooth with

parameter β. Let bfh be a KDE with bandwitdh h
n
bfh (x) = 1
X 1 kx − Xi k
K , x ∈ Rd .
n hd h
i=1

For eaxh h > 0, bfh is an unbiased estimator of the density

Z
1 ky − xk
fh (x) = d f (y )K dx, x ∈ Rd .
h Rd h

fh is much easier to estimate than f !!

A. Rinaldo DeBaCl : Density Based Clustering 12/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with course of dimensionality

By the Hölder assumption and Gine’ and Guillon (2002):

sup |f (x) − bfh (x)| ≤ sup |f (x) − fh (x)| + sup |fh (x) − bfh (x)|
x x x
| {z } | {z }
bias random fluctuations
r !
1
= O(hβ ) + OP
nhd

Ignoring the bias and for fixed h fh can be well estimated with the nearly
parameric, dimension independent rate:
r !
log n
sup |fh (x) − fh (x)| = O
b ,
x n
with high probability. The dimension is in the constants!

A. Rinaldo DeBaCl : Density Based Clustering 13/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

How to deal with course of dimensionality

By the Hölder assumption and Gine’ and Guillon (2002):

sup |f (x) − bfh (x)| ≤ sup |f (x) − fh (x)| + sup |fh (x) − bfh (x)|
x x x
| {z } | {z }
bias random fluctuations
r !
1
= O(hβ ) + OP
nhd

A. Rinaldo DeBaCl : Density Based Clustering 13/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

One may measure the difficulty of a clustering problem depending on

whether density clustering based on biased density estimation can be
successful.

Easy Hard

-5 0 5 -5 0 5

Another major advantage of allowing for bias is that it extends the

applicability of density-based clustering to singular P.
See, e.g., Rinaldo and Wasserman (2010).

A. Rinaldo DeBaCl : Density Based Clustering 14/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

One may measure the difficulty of a clustering problem depending on

whether density clustering based on biased density estimation can be
successful.

Easy Hard

-5 0 5 -5 0 5

Another major advantage of allowing for bias is that it extends the

applicability of density-based clustering to singular P.
See, e.g., Rinaldo and Wasserman (2010).

A. Rinaldo DeBaCl : Density Based Clustering 14/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Very large amount of literature.

Cluster tree estimation: Koltchinksii (2000), Stuetzle and Nugent (2010).

More recently: Chaudhuri, Dasgupta, Kptufe and von Luxburg (2013),
Balakrishnan et al. (2013) and Steinwary (2014).

Support estimation: Korostelev and Tsybakov (1993), Mammen and

Tsybakov (1995), Cuevas and Fraiman (1997), Biau, Cadre and Pellettier
(2008).
Level set estimation for fixed λ: Polonik (1995), Tsybakov (1997), Walther
(1997), Scott and Nowak (2006), Cuevas, González-Menteiga and
Rodríguez-Casal (2006), Singh, Scott and Nowak (2009), Rigollet and Vert
(2010).

Some algorithms: DBSCAN, OPTICS, denpro.

A. Rinaldo DeBaCl : Density Based Clustering 15/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

DeBaCl is a simple algorithm for density-based clustering.

We credit Kpotufe and von Luxburg (2011).
It is based on the k-nn density estimator.

Implementations:
pyton module DeBaCl by Brian Kent (update coming soon)
https://github.com/CoAxLab/DeBaCl

R package TDA by Fabrizio Lecci et al.

http://cran.r-project.org/web/packages/TDA/index.html

A. Rinaldo DeBaCl : Density Based Clustering 16/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

DeBaCl is a simple algorithm for density-based clustering.

We credit Kpotufe and von Luxburg (2011).
It is based on the k-nn density estimator.

Implementations:
pyton module DeBaCl by Brian Kent (update coming soon)
https://github.com/CoAxLab/DeBaCl

R package TDA by Fabrizio Lecci et al.

http://cran.r-project.org/web/packages/TDA/index.html

A. Rinaldo DeBaCl : Density Based Clustering 16/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

For fixed p ∈ (0, 1) and each i = 1, . . . , n, set b

ri be the distance of Xi from its
k -th nearest neighbors in Xn , with k = dnpe.

Input: p ∈ (0, 1) and Xn

1. Construct the knn graph Gbn with nodes Xn and edges (Xi , Xj )
(i) if kXi − Xj k ≤ max{b rj } (k-nn)
ri , b
(ii) if kXi − Xj k ≤ max{b rj } (mutual k-nn)
ri , b
2. For all r ∈ R := [mini b
ri , maxi b
ri ]
(i) set Gbn (r ) be subgraph induced by {Xi : b
ri ≤ r }.
(ii) compute the connected components of Gbn (r ).
Output {Tbn (r ), r ∈ R}, the dendrogram of the connected components.

A. Rinaldo DeBaCl : Density Based Clustering 17/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Meet DeBaCl

A. Rinaldo DeBaCl : Density Based Clustering 17/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Remarks on DeBaCl

The input to DeBaCl are the k-nn distances b r1 , . . . , b

rn that are used
both to compute level sets and to determine connectedness.
Computational complexity. The computation of all the k-nn’s has
complexity O (n log n) (using k -d trees, ball-trees and cover-trees).
The complexity of constructing all the connected components is nearly
linear in n because it relies on a modified union-find procedure (Najman
and Couprie, 2006) and never uses breadth-first search.
DeBaCl outputs a data structure.

Why k-nn and not KDE? A KDE version of DeBaCl is easy enough to
devise but we prefer k-nn...

A. Rinaldo DeBaCl : Density Based Clustering 18/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Remarks on DeBaCl

The input to DeBaCl are the k-nn distances b r1 , . . . , b

Why k-nn and not KDE? A KDE version of DeBaCl is easy enough to
devise but we prefer k-nn...

A. Rinaldo DeBaCl : Density Based Clustering 18/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 1: clustering endpoints of fiber tracks in the striatum

The fiber endpoint data is derived from in vivo difusion weighted brain
imaging (DWI) collected at the Scientific Imaging and Brain Research
Center at Carnegie Mellon University in 2012 for 30 neurologically
healthy controls (the CMU-30 group).
From the DWI data, deterministic fiber tractography was used to simulate
smooth 1-dimensional manifolds (with boundaries) called fiber
streamlines that represent tracks of strong water diffusion in the brain
(Hagmann et al., 2006).
10, 000 fiber streamlines were mapped from the cortex into the striatum
for a single subject. Only the teminal points of the streamlines were kept.
k = 200
Work done in collaboration with Timothy Verstynen:
http://www.psy.cmu.edu/~coaxlab/

A. Rinaldo DeBaCl : Density Based Clustering 19/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 1: clustering endpoints of fiber tracks in the striatum

3.2. CLUSTER RETRIEVAL OPTIONS 31

(a)

(b) (c)

Figure 3.1: Fiber streamlines and streamline endpoints for one subject in the CMU-
30 group. (a) 10,000 streamlines mapped from DeBaCl
A. Rinaldo the middle frontal
: Density gyrusClustering
Based to the stria- 19/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 1: clustering endpoints of fiber tracks in the striatum

32 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

Figure 3.2: Interactive exploration of data subsets with the dendrogram. The top
row shows the dendrogram forA.the striatal
Rinaldo endpoints
DeBaCl inClustering
: Density Based Figure 19/43
3.1, with di↵erent
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example
3.2. 1: clustering
CLUSTER endpoints
RETRIEVAL of fiber tracks in the striatum
OPTIONS 33

A. Rinaldo DeBaCl : Density Based Clustering 19/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example
34 1: clusteringCHAPTER
endpoints of fiber
3. DATA tracks inAND
EXPLORATION the CLUSTERING
striatum

(a) (b)
A. Rinaldo DeBaCl : Density Based Clustering 19/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

Data from The Human Genome Diversity Project (HGDP) dataset,

available at
http://www.hagsc.org/hgdp/files.html.
Cleaned-up comprised of 11,775 SNPs from 931 subjects from 53
populations from Crosset et al. (2010).
The goal of the analysis is to identify the hierarchy of high-density
clusters of individuals in the sample, ideally capturing the correct
membership in populations.
In the first level set tree k = 40, in the second k = 6.

A. Rinaldo DeBaCl : Density Based Clustering 20/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

44 CHAPTER 3. DATA EXPLORATION AND CLUSTERING

(a) (b)

Figure 3.6: Oversmoothed level set tree result for population genetics. (a) When
the smoothing parameter k is set to 40, the estimated level set tree for the HGDP
population genetic data has only four all-mode clusters. (b) The map shows these
clusters describe continent groups well, but do not capture more detailed population
A. Rinaldo DeBaCl : Density Based Clustering 20/43
dendrogram). The DeBaClclusters in this result are very well m
Density-based Clustering

Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

(a)

A. Rinaldo DeBaCl : Density Based Clustering 20/43

Density-based Clustering
DeBaCl
3.4. HIGH-DIMENSIONAL EXPERIMENTS
Theoretical Analysis of DeBaCl 45
Example 2: clustering individuals into populations using SNPs

(a)
A. Rinaldo DeBaCl : Density Based Clustering 20/43
Density-based Clustering
n this result are veryTheoretical
wellAnalysis
matched
DeBaCl
of DeBaCl
to populations.
Example 2: clustering individuals into populations using SNPs

(b)

A. Rinaldo DeBaCl : Density Based Clustering 20/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 2: clustering individuals into populations using SNPs

(a)

(b)
A. Rinaldo DeBaCl : Density Based Clustering 20/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)

The phoneme dataset, from Ferraty and Vieu (2006) contains

log-periodograms of 2000 instances of digitized human speech, divided
evenly between five phonemes: “sh", “dcl" (as in “dark"), “iy" (as in the
vowel of “she"), “aa", and “ao". Each recording is treated as a single
functional observation, which was smoothed using a cubic spline.
Distance between function is the L2 distace (each phoneme is observed
over 150 frequencies).
k = 20.

A. Rinaldo DeBaCl : Density Based Clustering 21/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)

50 CHAPTER 4. PSEUDO-DENSITIES

(a) (b)

(e) (f)

Figure 4.1: Spoken phoneme data, separated into the 5 true classes. Each phoneme is
represented by the log-periodogram at 150 fixed frequencies (see Hastie et al. (1995)
for details), which we smoothed with a cubic spline. The phonemes are (a) ‘sh’, (b)
‘iy’, (c) ‘dcl’, (d) ‘aa’, and (e) ‘ao’. (f) The mean function for each phoneme; color
A. Rinaldo
indicates correspondence between DeBaCl
phoneme class and mean: Density Based Clustering
function. 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)

4.3. HURRICANE TRACKS 51

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-
cate higher pseudo-density values,A. Rinaldo
implyingDeBaCl
greater proximity to neighbors.
: Density Based Clustering 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 3: clustering phonemes (functional data)

Figure 4.2: Pseudo-density values for the phoneme data. More intense shades indi-
cate higher pseudo-density values, implying greater proximity to neighbors.

(a) (b)

Figure 4.3: (a) Phoneme level set tree. (b) All-mode clusters. The modal observation
in each all-mode cluster is shown with a black outline.
A. Rinaldo DeBaCl : Density Based Clustering 21/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 4: clustering hurricane tracks (functional data)

The U.S. National Hurricane Centers’s HURDAT dataset contains

positional and atmospheric measurements of North Atlantic tropical
cyclons from 1851 to 2012 (Landsea et al., 2013). The coordinates (in
degrees latitude and longitude) for each storm are recorded at least
every six hours.
The processed dataset contained 398 hurricane tracks.
Pairwise distances based on max-average-min distance (not a metric).
k = 6 (γ = 2).

A. Rinaldo DeBaCl : Density Based Clustering 22/43

The data are shown in Figure 4.4, colored by pseudo-density (darker hues correspond
Density-based Clustering
DeBaCl
to high pseudo-density values).
Theoretical AnalysisJust by looking at this map, we expected at least
of DeBaCl

Example 4:mode
one major clustering
centeredhurricane trackspeninsula
near the Yucatán (functional data)
and one just o↵shore of the

U.S east coast.

Figure 4.4: Hurricane track data, shaded by pseudo-density value. Curves with more
intense color have higher pseudo-density, indicating greater proximity to neighbors.
A. Rinaldo DeBaCl : Density Based Clustering 22/43
54 CHAPTER 4. PSEUDO-DENSITIES
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl
of similarity; that is, the Gulf of Mexico storms are more similar to each other than
Example 4: clustering hurricane tracks (functional data)
to any other cluster. The only cluster that seems possibly out of place is the plain

green, which is closer to the mid-Atlantic pink and orange than to the Gulf of Mexico

storms, despite substantial overlap with the latter.

(a) (b)

Figure 4.5: (a) Hurricane track level set tree. (b) All-mode clusters. Color indicates
correspondence between dendrogram branches and clusters on the map; low-density,
unclustered background tracks are shown faintly
A. Rinaldo DeBaCl in gray.
: Density Based Clustering 22/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 5: clustering fiber tracks (functional data)

Fiber tractography datasets obtained through DSI techniques. Focus on

two corticostriatal pathways: lateral frontal (middle frontal gyrus to
striatum) and orbitofrontal (gyrus rectus to striatum).
A 30 subject template was used.
We used DeBaClto perform whole fiber tracks segmentation and looked
at tracks in the lateral frontal cortex and orbitofrontal cortex. Total of
51,126 fibers.
Pairwise distances based on max-average-min distance (not a metric).
k = d0.25 ∗ ne
Work done in collaboration with Timothy Verstynen:
http://www.psy.cmu.edu/~coaxlab/

A. Rinaldo DeBaCl : Density Based Clustering 23/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 5: clustering fiber tracks (functional data)

A. Rinaldo DeBaCl : Density Based Clustering 23/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Example 5: clustering fiber tracks (functional data)

A. Rinaldo DeBaCl : Density Based Clustering 23/43

Density-based Clustering
Figure 6. Level set tree clustering for whole fiber streamlines. A) Foreground fibers for the seven selected clusters from the 30 su
DeBaCl
templateTheoretical
data set for streamlines
Analysis oftracked
DeBaCl between the middle frontal gyrus and striatum, shown in both a sagittal and coronal view. Cluster
colored according to an all-mode clustering of the tree. B) The level set tree for data in panel A. Tree leaves are matched to fiber clusters by col
Same analysis as shown in A, but for a set of streamlines from the orbitofrontal cortex. Inset shows closeup of fiber streamlines in the striata
Example 5: clustering fiber tracks (functional data)
mask. D) Level set tree for data shown in panel C. The branch colors of trees in panels B and D match the clusters shown in the streamlines of p
A and C respectively.
doi:10.1371/journal.pone.0093344.g006

Figure 7. Comparison of methods for whole-fiber segmentation. A, B) High-density fiber pathway clusters from the level set tree all-m
method for middle frontal gyrus fibers in two subjects. C, D) Single linkage hierarchical clustering results for the same fiber pathways, wit
dendrogram cut to match the same number of clusters in the level set tree result. E, F) K-means clustering results for the sample fiber pathw
A. Rinaldo
doi:10.1371/journal.pone.0093344.g007 DeBaCl : Density Based Clustering 23/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Theoretical analysis of DeBaCl: set-up

Let P a non-atomic probability measure supported on S ⊂ Pd . We allow

for dim(S) < d and for S to be of mixed dimension.

Fix a number p ∈ (0, 1). Define the function rp : Rd → R+ given by

n o
x 7→ rp (x) = inf r > 0 : P (B(x, r )) ≥ p .

Thus rp (x) is the p-th quantile of the univariate variable kX − xk, X ∼ P.

DeBaCl estimates rp and its lower level sets at the sample points Xn .

A. Rinaldo DeBaCl : Density Based Clustering 24/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Theoretical analysis of DeBaCl: set-up

Let P a non-atomic probability measure supported on S ⊂ Pd . We allow

for dim(S) < d and for S to be of mixed dimension.

Fix a number p ∈ (0, 1). Define the function rp : Rd → R+ given by

n o
x 7→ rp (x) = inf r > 0 : P (B(x, r )) ≥ p .

Thus rp (x) is the p-th quantile of the univariate variable kX − xk, X ∼ P.

DeBaCl estimates rp and its lower level sets at the sample points Xn .

A. Rinaldo DeBaCl : Density Based Clustering 24/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Theoretical analysis of DeBaCl: set-up

Let P a non-atomic probability measure supported on S ⊂ Pd . We allow

for dim(S) < d and for S to be of mixed dimension.

Fix a number p ∈ (0, 1). Define the function rp : Rd → R+ given by

n o
x 7→ rp (x) = inf r > 0 : P (B(x, r )) ≥ p .

Thus rp (x) is the p-th quantile of the univariate variable kX − xk, X ∼ P.

DeBaCl estimates rp and its lower level sets at the sample points Xn .

A. Rinaldo DeBaCl : Density Based Clustering 24/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

1 kxk ≤ 1
Let K be the uniform kernel K (x) = and consider the
0 otherwise
biased, Lebesgue density
Z
1 kx − y k p 1
fp (x) = d
K dP(y ) = d
∝ d , x ∈ Rd ,
vd rp (x) Rd r p (x) vd rp (x) rp (x)

with vd the volume of unit ball in Rd .

The empirical equivalent of fp (x) is the k-nn density estimator, where

k = dpne:
bfp (x) = k 1
, x ∈ Rd
rpd (x)
n vd b
rp (x) the distance from x to its k -th nearest neighborhood in Xn .
with b

A. Rinaldo DeBaCl : Density Based Clustering 25/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

with vd the volume of unit ball in Rd .

The empirical equivalent of fp (x) is the k-nn density estimator, where

k = dpne:
bfp (x) = k 1
, x ∈ Rd
rpd (x)
n vd b
rp (x) the distance from x to its k -th nearest neighborhood in Xn .
with b

A. Rinaldo DeBaCl : Density Based Clustering 25/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

fp is biased
If P has a continuous Lebesgue density f , then, a.e.,
p
lim = f (x),
p→0 vd rpd (x)

and convergence holds uniformly over compacts. If P has a continuous

density f w.r.t. Hk , then, a.e.,
p
lim k
= f (x),
p→0 vk rx,p

A. Rinaldo DeBaCl : Density Based Clustering 26/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

The upper level sets of fp (·) are the lower level sets of rp (·).

For r > 0, let L(r ) = {x ∈ Rd : rp (x) ≤ r }∩S.

Define the r -clusters of P as the maximal (path) connected component
of L(r ). The resulting tree is algorithmically hard.

Algorithmic connectedness. Two points in L(r ) are algorithmically

connected if there exists {x = z0 , z1 , . . . , zm = y } ⊂ L(r ) such that
zi ∈ B(zi+1 , rzi +1 ).

Algorithmic Tree
The algorithmic tree T is the dendrogram of the r -clusters of P with path
connectedness replaced by algorithmic connectedness.

A. Rinaldo DeBaCl : Density Based Clustering 27/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

The upper level sets of fp (·) are the lower level sets of rp (·).

For r > 0, let L(r ) = {x ∈ Rd : rp (x) ≤ r }∩S.

Define the r -clusters of P as the maximal (path) connected component
of L(r ). The resulting tree is algorithmically hard.

Algorithmic connectedness. Two points in L(r ) are algorithmically

connected if there exists {x = z0 , z1 , . . . , zm = y } ⊂ L(r ) such that
zi ∈ B(zi+1 , rzi +1 ).

Algorithmic Tree
The algorithmic tree T is the dendrogram of the r -clusters of P with path
connectedness replaced by algorithmic connectedness.

A. Rinaldo DeBaCl : Density Based Clustering 27/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

A. Rinaldo DeBaCl : Density Based Clustering 27/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

A. Rinaldo DeBaCl : Density Based Clustering 27/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

Connectedness implies algorithmic connectedness but not the other way

around.

A. Rinaldo DeBaCl : Density Based Clustering 27/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

Connectedness implies algorithmic connectedness but not the other way

around.

A. Rinaldo DeBaCl : Density Based Clustering 27/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Density-based clustering with no densities

Empirical Tree
The empirical tree Tbn is the dendrogram produced by DeBaCl (i.e. based on
the estimated distances b r1 , . . . , b
rn ).

Oracle Tree
The oracle tree Tn is the dedrogram of nested subsets of Xn obtained by
running DeBaCl if an oracle gave us the true values of rp (X1 ), . . . , rp (Xn ).

They are both data dependent!

A. Rinaldo DeBaCl : Density Based Clustering 28/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

The “density" tree (high density clusters of f )

my not be defined (if, e.g., S is of mixed dimension)
is statistically and algorithmically hard in d.
The algorithmic r-tree (tree of r -clusters with path connectedness
replaced by algorithmic connectedness): our target.

The oracle tree: the output of DeBaCl using the true rXi ’s.
The empirical tree: the output of DeBaCl, using the estimated distances
rX1 , . . . , b
b rXn .

A. Rinaldo DeBaCl : Density Based Clustering 29/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

The “density" tree (high density clusters of f )

The oracle tree: the output of DeBaCl using the true rXi ’s.
The empirical tree: the output of DeBaCl, using the estimated distances
rX1 , . . . , b
b rXn .

A. Rinaldo DeBaCl : Density Based Clustering 29/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

A tale of many trees

A. Rinaldo DeBaCl : Density Based Clustering 29/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Consistency of DeBaCl

Tree Consistency
A cluster A of T is said to be consistency estimable if, with high probability,
b n := A ∩ Xn is a cluster of Tbn . Consistency of the tree follows from uniform
A
consistency over clusters.

To prove consistency we will show that

Tbn the (empirical tree) and Tn (the oracle tree) yield nearly the same
hierarchy of clusters over Xn , with high probability;
Tn (the oracle tree) consistently estimates most of T (the algorithmic
tree).

A. Rinaldo DeBaCl : Density Based Clustering 30/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Consistency of DeBaCl

To prove consistency we will show that

A. Rinaldo DeBaCl : Density Based Clustering 30/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The empirical tree Tbn is close to the oracle tree Tn

Write ri and b
ri for rXi and b
rXi .

Regularity Assumption
Let Fx be the cdf of kX − xk, X ∼ P. For some c > 0 and P-almost all x,
Fx0 (rp (x)) ≥ c. The constant c depends on dim(P).

x x

Fx(t) Fx(t)

t t

Figure 1: Two situations where there is no modulus of continuity for the quantile function Fx 1 .
Left: the set is not a connected set. Right: the measure is not (a, b)-standard.
Courtesy of Betrand Michel
4 Application to the A.
geometric
Rinaldo models
DeBaClin: Density
Rd Based Clustering 31/43
Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The empirical tree Tbn is close to the oracle tree Tn

Lemma
There exists a constant C, depending on dim(P), such that, with probability at
least 1/n, r
log n
max ri − ri ≤ C
b := n .
i n

1
For a uniform distribution in full dimension C ∝ pd
.

A. Rinaldo DeBaCl : Density Based Clustering 31/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

For fixed r > 0, consider the (random) sublevel sets:

Ln (r ) = {Xi ∈ {1, . . . , n} : ri ≤ r }, (oracle) (1)

Ln (r ) = {Xi ∈ {1, . . . , n} : b
b ri ≤ r }, (empirical). (2)

Corollary (Level set consistency)

Uniformly over all r > n , with probability at least 1 − 1/n,

Ln (r − n ) ⊂ b
Ln (r ) ⊂ Ln (r + n ),

and
Ln (r − n ) ⊂ Ln (r ) ⊂ b
b Ln (r + n ).
If the c.d.f. of rp (X ) with X ∼ P is locally Lipschizt atr , withhigh probability
q
log n
the proportion of misclustered nodes at level r is O n
.

A. Rinaldo DeBaCl : Density Based Clustering 32/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Stability (yields correct connectivity)

The interval of heights I = (a, b) is stable for Tn if the dendrogram does not
change when the k-nn distances involved are perturbed by n . Sufficient
conditions are that the clusters are at least 2n - apart and Xn is “sufficiently
dense".

Corollary (Cluster consistency)

If I is a stable interval, then for each a + n ≤ r ≤ b − n , the number of
clusters in Lr (n) and bLn (r ) is the same and constant and each cluster A of
Ln (r ) is such that B ⊂ A ⊂ B 0 , for some B cluster of b
Ln (r − n ) and B 0 of
Ln (r + n ).
b

Stability is a local property and does not hold around at values r of near
branching points.

A. Rinaldo DeBaCl : Density Based Clustering 33/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Stability (yields correct connectivity)

Corollary (Cluster consistency)

Stability is a local property and does not hold around at values r of near
branching points.

A. Rinaldo DeBaCl : Density Based Clustering 33/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Stability (yields correct connectivity)

Corollary (Cluster consistency)

Stability is a local property and does not hold around at values r of near
branching points.

A. Rinaldo DeBaCl : Density Based Clustering 33/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

We need clusters to be “well-shaped".

Thickness
A cluster is (γ, c)-thick, where γ > 0 and c ≥ 0 if, for any x ∈ A, there exists a
y ∈ B(x, γ) ∩ A such that

B(y , γ/(2 + c)) ∩ supp(P) ⊂ B(x, γ) ∩ A.

Thickness provides controls how narrow a cluster A can get compared to

clusters at contiguous higher levels in the tree. It is satisfied for
well-behaved manifold if γ is small compared to the reach.

A. Rinaldo DeBaCl : Density Based Clustering 34/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

We need clusters to be “well-shaped".

Thickness
A cluster is (γ, c)-thick, where γ > 0 and c ≥ 0 if, for any x ∈ A, there exists a
y ∈ B(x, γ) ∩ A such that

B(y , γ/(2 + c)) ∩ supp(P) ⊂ B(x, γ) ∩ A.

Thickness provides controls how narrow a cluster A can get compared to

clusters at contiguous higher levels in the tree. It is satisfied for
well-behaved manifold if γ is small compared to the reach.

A. Rinaldo DeBaCl : Density Based Clustering 34/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

A. Rinaldo DeBaCl : Density Based Clustering 34/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

A. Rinaldo DeBaCl : Density Based Clustering 34/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

The oracle tree Tn is close to the algorithmic tree T

A. Rinaldo DeBaCl : Density Based Clustering 34/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

For a cluster A of T , let rA = infx∈A rp (x).

Consistency
Assume that the sample X = (X1 , . . . , Xn ) is a γn -covering of supp(P). Then,
if A is an r -cluster that is (γn , c)-thick, where γn < rA /(2(2 + c)), then Xn,A is
a cluster of Ln (r ). This holds uniformly over all such clusters.

That is, when the sample is dense enough, the oracle tree will produce the
same clustering as the if we knew the algorithmic tree.

A. Rinaldo DeBaCl : Density Based Clustering 35/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

When is Xn a dense covering?

rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?

Assume A is “well-behaved" (standard assumption) and

inf P(B(x, γ/2) ∩ A) = cA > 0.

Then, Xn,A is a γ covering of A with high probability if

dim(A) + log n
cA = Ω .
n

If we let p → 0, then cA → 0 at a rate Θ(γ k ), with γ → 0. This would give

dimension dependent rate.

A. Rinaldo DeBaCl : Density Based Clustering 36/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

When is Xn a dense covering?

rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?

Assume A is “well-behaved" (standard assumption) and

inf P(B(x, γ/2) ∩ A) = cA > 0.

Then, Xn,A is a γ covering of A with high probability if

dim(A) + log n
cA = Ω .
n

If we let p → 0, then cA → 0 at a rate Θ(γ k ), with γ → 0. This would give

dimension dependent rate.

A. Rinaldo DeBaCl : Density Based Clustering 36/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

When is Xn a dense covering?

rA
When is Xn a γn -covering of a set A ⊂ S, with γn ≤ 2(c+2)
?

Assume A is “well-behaved" (standard assumption) and

inf P(B(x, γ/2) ∩ A) = cA > 0.

Then, Xn,A is a γ covering of A with high probability if

dim(A) + log n
cA = Ω .
n

If we let p → 0, then cA → 0 at a rate Θ(γ k ), with γ → 0. This would give

dimension dependent rate.

A. Rinaldo DeBaCl : Density Based Clustering 36/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Consistency of DeBaCl

Consistency of DeBaCl
Let A be an r -cluster of T such that its r − n subcluster A0 is (γn , c)-thick,
with γn < (rA0 − n )/(2 + c) and c ≥ 0. If the sample X = (X1 , . . . , Xn ) is a γn
covering of supp(P) then the subgraph of G(r b − n ) induced by Xn,A0 is
connected; if A is also 2n -away from all the other r clusters, then Xn,A0 is a
cluster of Tbn (r − n ). This holds uniformly over such clusters of T .

A. Rinaldo DeBaCl : Density Based Clustering 37/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.

The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the

values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b

rα to be the α-quantile of b
r1 , . . . , b
rn . The
tree height is indexed by α: the α-level of the tree represents b L(b rα ), i.e.
the α-fraction of “most clusterable" data points. Highly interpretable.

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.

The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the

values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.

The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the

values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

DeBaCloutputs a data structure, which can be visualized in different ways.

The r -tree: the tree height is indexed by r . Counterintuitive: the tree
grows as r gets smaller and the higher portions of the tree are shorter!

λ-tree: the tree height is indexed by 1/r d , a rescaling proportional to the

values of the k-nn density estimator.
The proportions are right but it is not very interpretable.

α-tree: for each α ∈ (0, 1) set b

κ-tree: each branch has length equal to the fraction of points comprising
it. The sum of the length of all branches and leaves is 1.
Highly interpretable.

A. Rinaldo DeBaCl : Density Based Clustering 38/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

Data lambda Tree r Tree

0.067

0.914
6

1.518
4

0.042
lambda

r
2

0.015
0

2.989
-2

0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

cluster labels alpha Tree kappa Tree

0.004

0.602
6

0.313
4

0.39
kappa
alpha
2

0.174
0

0.838
-2

0
0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

A. Rinaldo DeBaCl : Density Based Clustering 38/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Cluster trees come in many flavors: r , λ, α...and κ

The difference can be substantial.

A. Rinaldo DeBaCl : Density Based Clustering 38/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let B the P-Brownian bridge indexed by the closed balls in Rd : a centered

Gaussian process with covariance function

(B, B 0 ) → P(B ∩ B 0 ) = P(B)P(B 0 ).

Denote by B(x, r ) the value of B at the ball B(x, r ).

Recall that, for x ∈ Rd , Fx is the c.d.f. of kX − xk with X ∼ P and rp (x) is the
p-th quantile of Fx .

Theorem (Function delta method)

Assume that, for some constant c > 0 and for P-almost all x,

Fx0 (rp (x)) ≥ c.

Then,
n√ o
B(x, rp (x))

rp (x) − rp (x) , x ∈ Rd , x ∈ Rd

n b
Fx0 (rp (x))

A. Rinaldo DeBaCl : Density Based Clustering 39/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let B the P-Brownian bridge indexed by the closed balls in Rd : a centered

Gaussian process with covariance function

(B, B 0 ) → P(B ∩ B 0 ) = P(B)P(B 0 ).

Denote by B(x, r ) the value of B at the ball B(x, r ).

Recall that, for x ∈ Rd , Fx is the c.d.f. of kX − xk with X ∼ P and rp (x) is the
p-th quantile of Fx .

Theorem (Function delta method)

Assume that, for some constant c > 0 and for P-almost all x,

Fx0 (rp (x)) ≥ c.

Then,
n√ o
B(x, rp (x))

rp (x) − rp (x) , x ∈ Rd , x ∈ Rd

n b
Fx0 (rp (x))

A. Rinaldo DeBaCl : Density Based Clustering 39/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let B the P-Brownian bridge indexed by the closed balls in Rd : a centered

Gaussian process with covariance function

(B, B 0 ) → P(B ∩ B 0 ) = P(B)P(B 0 ).

Denote by B(x, r ) the value of B at the ball B(x, r ).

Recall that, for x ∈ Rd , Fx is the c.d.f. of kX − xk with X ∼ P and rp (x) is the
p-th quantile of Fx .

Theorem (Function delta method)

Assume that, for some constant c > 0 and for P-almost all x,

Fx0 (rp (x)) ≥ c.

Then,
n√ o
B(x, rp (x))

rp (x) − rp (x) , x ∈ Rd , x ∈ Rd

n b
Fx0 (rp (x))

A. Rinaldo DeBaCl : Density Based Clustering 39/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let {brp∗ (x), x ∈ Rd } be the bootstrap version of b

rp based on the empirical
distribution of Xn .

Corollary (Boostrap validity)

Conditionally almost surely,
n√ o n B(x, r (x)) o
p
rp∗ (x) − b
rp (x) , x ∈ Rd , x ∈ Rd

n b 0
Fx (rp (x))

Using the bootstrap, we can construct asymptotically correct confidence

bands for rp and the DeBaCltrees.

A. Rinaldo DeBaCl : Density Based Clustering 40/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Weak convergence and the bootstrap

Let {brp∗ (x), x ∈ Rd } be the bootstrap version of b

rp based on the empirical
distribution of Xn .

Corollary (Boostrap validity)

Conditionally almost surely,
n√ o n B(x, r (x)) o
p
rp∗ (x) − b
rp (x) , x ∈ Rd , x ∈ Rd

n b 0
Fx (rp (x))

Using the bootstrap, we can construct asymptotically correct confidence

bands for rp and the DeBaCltrees.

A. Rinaldo DeBaCl : Density Based Clustering 40/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Bootstrap confidence sets example

A. Rinaldo DeBaCl : Density Based Clustering 41/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Bootstrap confidence sets example

A. Rinaldo DeBaCl : Density Based Clustering 41/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Bootstrap confidence sets example

A. Rinaldo DeBaCl : Density Based Clustering 41/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Conclusions

Density-based clustering is a principled paradigm for clustering.

The cluster tree provides an interpretable and highy informative
encoding of all the clustering properties of P.
DeBaClis a simple, computationally efficient algorithm for consistently
estimating the cluster tree, even in high dimensions.

Current and future work:

Work out the theoretical properties of the α and κ tree.
Develop and study methods for constructing confidence sets for cluster
trees and, more generally, for using cluster trees for statistical inference.
Provide guidelines on how to choose p!

A. Rinaldo DeBaCl : Density Based Clustering 42/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

Conclusions

Density-based clustering is a principled paradigm for clustering.

Current and future work:

A. Rinaldo DeBaCl : Density Based Clustering 42/43

Density-based Clustering
DeBaCl
Theoretical Analysis of DeBaCl

CMU TopStat

A. Rinaldo DeBaCl : Density Based Clustering 43/43

Density Based Silhouette Diagnostics For
No ratings yet
Density Based Silhouette Diagnostics For
14 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (3)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
9 pages
Analysis of Transport Choice of Employees - A Project On Machine Learning
100% (10)
Analysis of Transport Choice of Employees - A Project On Machine Learning
24 pages
Android Based Parking Booking System
89% (9)
Android Based Parking Booking System
30 pages
Density Based Clustering
No ratings yet
Density Based Clustering
22 pages
Clusters - Density-Based
No ratings yet
Clusters - Density-Based
12 pages
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
No ratings yet
Autoepsdbscan: Dbscan With Eps Automatic For Large Dataset: Manisha Naik Gaonkar & Kedar Sawant
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
27 pages
Density
No ratings yet
Density
3 pages
A Survey of Density Based Clustering Algorithms: Panthadeep BHATTACHARJEE, Pinaki MITRA
No ratings yet
A Survey of Density Based Clustering Algorithms: Panthadeep BHATTACHARJEE, Pinaki MITRA
27 pages
Multi Density DBScan
No ratings yet
Multi Density DBScan
8 pages
Lesson 4.1 - Unsupervised Learning Partitioning Methods
No ratings yet
Lesson 4.1 - Unsupervised Learning Partitioning Methods
32 pages
DS143 Group 13 Presentation-1
No ratings yet
DS143 Group 13 Presentation-1
27 pages
Dbscan: Fast Density-Based Clustering With R: Michael Hahsler Matthew Piekenbrock
No ratings yet
Dbscan: Fast Density-Based Clustering With R: Michael Hahsler Matthew Piekenbrock
28 pages
A Density Clustering Based On Outlier
No ratings yet
A Density Clustering Based On Outlier
6 pages
Density-Based Clustering Algorithm: Presented by - Rohit Paul
No ratings yet
Density-Based Clustering Algorithm: Presented by - Rohit Paul
12 pages
Density Based Clustering
No ratings yet
Density Based Clustering
17 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
Module 10
No ratings yet
Module 10
59 pages
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
No ratings yet
Clustering: K-Means, Agglomerative, DBSCAN: Tan, Steinbach, Kumar
45 pages
A Novel Density-Based Clustering Algorithm For Predicting Cardiovascular Disease
No ratings yet
A Novel Density-Based Clustering Algorithm For Predicting Cardiovascular Disease
12 pages
An Empirical Evaluation of Density-Based Clustering Techniques
No ratings yet
An Empirical Evaluation of Density-Based Clustering Techniques
8 pages
Duan2006 1 3
No ratings yet
Duan2006 1 3
3 pages
Unit 3 Updated Notes
No ratings yet
Unit 3 Updated Notes
29 pages
4.5-Cluster Analysis
No ratings yet
4.5-Cluster Analysis
17 pages
Clustering Density Based
No ratings yet
Clustering Density Based
14 pages
ASDM C02 Clustering
No ratings yet
ASDM C02 Clustering
107 pages
Review On Density-Based Clustering - DBSCAN, DenClue & GRID
No ratings yet
Review On Density-Based Clustering - DBSCAN, DenClue & GRID
20 pages
Density Based Clustering [ Unit 5 ]
No ratings yet
Density Based Clustering [ Unit 5 ]
5 pages
A_Theoretical_Analysis_of_Density_Peaks_Clustering_and_the_Component-Wise_Peak-Finding_Algorithm
No ratings yet
A_Theoretical_Analysis_of_Density_Peaks_Clustering_and_the_Component-Wise_Peak-Finding_Algorithm
12 pages
Density ML
No ratings yet
Density ML
51 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
26 pages
Model-Based Clustering
No ratings yet
Model-Based Clustering
23 pages
A Survey of Some Density Based Clustering Techniques PDF
No ratings yet
A Survey of Some Density Based Clustering Techniques PDF
5 pages
DBSCAN_An_Assessment_of_Density_Based_Cl
No ratings yet
DBSCAN_An_Assessment_of_Density_Based_Cl
5 pages
Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique
No ratings yet
Dbsmote: Density-Based Synthetic Minority Over-Sampling Technique
21 pages
Density Based Clustering Algorithm
No ratings yet
Density Based Clustering Algorithm
25 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
A Fully Autonomous Data Density Based Clustering Technique: R.hyde1@lancaster - Ac.uk P.angelov@lancaster - Ac.uk
No ratings yet
A Fully Autonomous Data Density Based Clustering Technique: R.hyde1@lancaster - Ac.uk P.angelov@lancaster - Ac.uk
8 pages
Density Based
No ratings yet
Density Based
27 pages
Clustering
No ratings yet
Clustering
65 pages
Unit 2
No ratings yet
Unit 2
33 pages
Clustering
No ratings yet
Clustering
12 pages
Unit - V DW
No ratings yet
Unit - V DW
6 pages
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course
No ratings yet
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course
14 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Unsupervised Learning-01
No ratings yet
Unsupervised Learning-01
42 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
64 pages
Efficient Hierarchical Clustering of Large Data Sets Using P-Trees
No ratings yet
Efficient Hierarchical Clustering of Large Data Sets Using P-Trees
4 pages
DWDM 5
No ratings yet
DWDM 5
12 pages
VDBSCAN
No ratings yet
VDBSCAN
4 pages
Clustering Techniques
No ratings yet
Clustering Techniques
23 pages
DBSCAN Past, present and future
No ratings yet
DBSCAN Past, present and future
7 pages
V3i206 PDF
No ratings yet
V3i206 PDF
5 pages
ML UNIT 4
No ratings yet
ML UNIT 4
15 pages
Ifsa Eusflat 2015 Submission 24
No ratings yet
Ifsa Eusflat 2015 Submission 24
7 pages
Density-Based Clustering Validation: April 2014
No ratings yet
Density-Based Clustering Validation: April 2014
10 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Sequences and Infinite Series, A Collection of Solved Problems
From Everand
Sequences and Infinite Series, A Collection of Solved Problems
Steven Tan
No ratings yet
Instructions For Typesetting Manuscripts Using Msword
No ratings yet
Instructions For Typesetting Manuscripts Using Msword
6 pages
Digital Communications and Networks: Yaira K. Rivera S Anchez, Steven A. Demurjian, Mohammed S. Baihan
No ratings yet
Digital Communications and Networks: Yaira K. Rivera S Anchez, Steven A. Demurjian, Mohammed S. Baihan
12 pages
IS Model QP
No ratings yet
IS Model QP
2 pages
Automated Extraction of Attributes From Natural Language Attribute-Based Access Control (ABAC) Policies
No ratings yet
Automated Extraction of Attributes From Natural Language Attribute-Based Access Control (ABAC) Policies
25 pages
Gip Unit 5 - 2018-19 Even
No ratings yet
Gip Unit 5 - 2018-19 Even
36 pages
Gip Uniy-1 Notes 2018-19
No ratings yet
Gip Uniy-1 Notes 2018-19
47 pages
Unit 2 Ques
No ratings yet
Unit 2 Ques
80 pages
Cs655 Unit II
No ratings yet
Cs655 Unit II
27 pages
C Programs - With Answers
No ratings yet
C Programs - With Answers
40 pages
Face Mask Detection Using Machine Learning and Deep Learning
No ratings yet
Face Mask Detection Using Machine Learning and Deep Learning
6 pages
Processes 08 01068
No ratings yet
Processes 08 01068
18 pages
Deep Super Learner: A Deep Ensemble For Classification Problems
No ratings yet
Deep Super Learner: A Deep Ensemble For Classification Problems
12 pages
Heart_Disease
No ratings yet
Heart_Disease
26 pages
Stock Market Analysis With The Usage of Machine Learning and Deep Learning Algorithms
No ratings yet
Stock Market Analysis With The Usage of Machine Learning and Deep Learning Algorithms
9 pages
Notes Data Science
100% (1)
Notes Data Science
5 pages
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
No ratings yet
Performance Evaluation of Different Supervised Learning Algorithms For Mobile Price Classification
10 pages
A Graph Based Deep Learning Model For Aml
No ratings yet
A Graph Based Deep Learning Model For Aml
135 pages
Tamiliniyan Santhosh 92
No ratings yet
Tamiliniyan Santhosh 92
34 pages
Data Science
No ratings yet
Data Science
13 pages
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
100% (1)
Scikit-Learn Cheat Sheet Python For Data Science: Preprocessing The Data Evaluate Your Model's Performance
1 page
ML Viva and Oral Question and Answers
No ratings yet
ML Viva and Oral Question and Answers
5 pages
NTCC REPORT 010424.docx Megha Pachuri
No ratings yet
NTCC REPORT 010424.docx Megha Pachuri
27 pages
CHAPTER6
No ratings yet
CHAPTER6
52 pages
Thermal Facial Analysis For Deception Detectio
No ratings yet
Thermal Facial Analysis For Deception Detectio
9 pages
Jawaban Huawei
No ratings yet
Jawaban Huawei
58 pages
04 - RA - Crop Prediction Using Machine Learning
No ratings yet
04 - RA - Crop Prediction Using Machine Learning
12 pages
Experiment 2.2 KNN Classifier
No ratings yet
Experiment 2.2 KNN Classifier
7 pages
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
No ratings yet
An Automatic Dermatology Detection System Based On Deep Learning and Computer Vision
10 pages
Consumer Behavior Analysis of Social Media Networks by Using Machine Learning
No ratings yet
Consumer Behavior Analysis of Social Media Networks by Using Machine Learning
4 pages
Dissertation Report - Final
No ratings yet
Dissertation Report - Final
28 pages
SPSS Predictive Analytics
No ratings yet
SPSS Predictive Analytics
8 pages
Gayatri Resume Update 2
No ratings yet
Gayatri Resume Update 2
2 pages
Data Science: M.Tech. (Computer Science & Engineering) First Semester (C.B.C.S.)
No ratings yet
Data Science: M.Tech. (Computer Science & Engineering) First Semester (C.B.C.S.)
2 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
Data Science Solutions Sample
100% (6)
Data Science Solutions Sample
53 pages
40 - Malware Detection Using Machine Learning and Performance Evaluation
No ratings yet
40 - Malware Detection Using Machine Learning and Performance Evaluation
77 pages