0% found this document useful (0 votes)
62 views

BCA DM Chapter 5 - Clustering

The document provides an overview of clustering techniques, including hierarchical and partitional algorithms. Hierarchical algorithms create nested clusters and can be agglomerative (bottom-up) or divisive (top-down). Partitional algorithms create all clusters at once, requiring the number of clusters as input. Popular partitional algorithms include k-means, which assigns data to clusters based on minimizing distance to cluster means, and PAM, which clusters data around representative points called medoids. The document discusses key clustering concepts like similarity measures, outliers, and genetic algorithms for clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views

BCA DM Chapter 5 - Clustering

The document provides an overview of clustering techniques, including hierarchical and partitional algorithms. Hierarchical algorithms create nested clusters and can be agglomerative (bottom-up) or divisive (top-down). Partitional algorithms create all clusters at once, requiring the number of clusters as input. Popular partitional algorithms include k-means, which assigns data to clusters based on minimizing distance to cluster means, and PAM, which clusters data around representative points called medoids. The document discusses key clustering concepts like similarity measures, outliers, and genetic algorithms for clustering.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Chapter 5

Clustering
Clustering Outline

Goal: Provide an overview of the clustering problem


and introduce some of the basic algorithms

 Clustering Problem Overview

 Clustering Techniques

 Hierarchical Algorithms

 Partitional Algorithms

 Genetic Algorithm
© Prentice Hall 2
Clustering vs. Classification
 No prior knowledge
 Number of clusters
 Meaning of clusters
 Unsupervised learning

Classification – Predefined
Clustering – Not Predefined
- Data are grouped according to similarities

© Prentice Hall 3
Clustering Examples

 Segment customer database based on similar


buying patterns.
 Group houses in a town into neighborhoods
based on similar features.
 Identify new plant species
 Identify similar Web usage patterns

© Prentice Hall 4
Clustering Example

© Prentice Hall 5
Group of Homes
Clustering Issues
 Outlier handling – is difficult (data can fall on any cluster
according to the condition)
 Dynamic data – cluster membership changes overtime
 Interpreting results – after clustering, the meaning will
differ for each clustering
 Evaluating results – need the concept of classification to
be done before clustering them
 Number of clusters
 Data to be used
 Scalability
▪ Always results are dynamic according to the clustering process

© Prentice Hall 7
Clustering Problem
 Given a database D={t ,t2,…,tn} of tuples and
1
an integer value k, the Clustering Problem is
to define a mapping
 f:Dg{1,..,k} where each ti is assigned to one cluster
Kj, 1<=j<=k.
 A Cluster, K , contains precisely those tuples
j
mapped to it.
 Unlike classification problem, clusters are not
known a priori.

© Prentice Hall 8
Types of Clustering

 Hierarchical – Nested set of clusters created.

 Partitional – One set of clusters created.

 Incremental – Each element handled one at a

time.
 Simultaneous – All elements handled together.

 Overlapping/Non-overlapping

© Prentice Hall 9
Clustering Approaches
Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

© Prentice Hall 10
Similarity & Distance Measures
Cluster Parameters

Middle of Cluster – called medoid

© Prentice Hall 11
Distance Between Clusters
 Single Link: smallest distance between points
 Complete Link: largest distance between points
 Average Link: average distance between points
 Centroid: distance between centroids

© Prentice Hall 12
Outliers

 Outliers are sample points with values much


different from those of the remaining set of
data
 May represent errors in data
 Could also be correct data values that are simply
different from the remaining data
Impact of Outliers on Clustering
Solid lines – diff clusters
dashed line – diff clusters

© Prentice Hall 14
Hierarchical Clustering
Hierarchical Clustering
 Clusters are created in levels actually creating sets
of clusters at each level.
 Agglomerative
 Initially each item in its own cluster
 Iteratively clusters are merged together
 Bottom Up
 Divisive
 Initially all items in one cluster
 Large clusters are successively divided
 Top Down

© Prentice Hall 16
Hierarchical Algorithms

 Single Link
 MST Single Link
 Complete Link
 Average Link

© Prentice Hall 17
Dendrogram
 Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
 Each level shows clusters for
that level.
 Leaf – individual clusters
 Root – one cluster
 A cluster at level i is the union
of its children clusters at level
i+1.

© Prentice Hall 18
Levels of Clustering

© Prentice Hall 19
Agglomerative Example
A B
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3 E C
C 2 2 0 1 5
D 2 4 1 0 3
D
E 3 3 5 3 0

Threshold of

1 2 3 4 5

A B C D E

© Prentice Hall 20
Agglomerative Algorithm

© Prentice Hall 21
Single Link
 View all items with links (distances)
between them.
 Finds maximal connected components in
this graph.
 Two clusters are merged if there is at least
one edge which connects them.
 Uses threshold distances at each level.
 Could be agglomerative or divisive.

© Prentice Hall 22
Single Link Clustering

© Prentice Hall 23
5.5 Partitional Algorithms
Partitional Clustering

 Nonhierarchical
 Creates clusters in one step as opposed to
several steps.
 Since only one set of clusters is output, the
user normally has to input the desired
number of clusters, k.
 Usually deals with static sets.

© Prentice Hall 25
Partitional Algorithms

 Minimum Spanning Tree


 Squared Error Clustering Algo
 K-Means
 Nearest Neighbor
 PAM – Partitioning Around Medoids
 BEA – Bond Energy Algo
 Clustering with GA & NN

© Prentice Hall 26
MST – Min. Spanning Tree Algorithm

© Prentice Hall 27
Squared Error

 Minimized squared error

© Prentice Hall 28
Squared Error Algorithm

© Prentice Hall 29
K-Means
 Initial set of clusters randomly chosen.
 Iteratively, items are moved among sets of
clusters until the desired set is reached.
 High degree of similarity among elements
in a cluster is obtained.
 Given a cluster Ki={ti1,ti2,…,tim}, the cluster

mean is mi = (1/m)(ti1 + … + tim)

© Prentice Hall 30
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}, k=2
 Randomly assign means: m1=3,m2=4
 K ={2,3}, K ={4,10,12,20,30,11,25}, m =2.5,m =16
1 2 1 2
 K ={2,3,4},K ={10,12,20,30,11,25}, m =3,m =18
1 2 1 2
 K ={2,3,4,10},K ={12,20,30,11,25}, m =4.75,m =19.6
1 2 1 2
 K ={2,3,4,10,11,12},K ={20,30,25}, m =7,m =25
1 2 1 2
 Stop as the clusters with these means are the same.

© Prentice Hall 31
K-Means Algorithm

© Prentice Hall 32
Nearest Neighbor

 Items are iteratively merged into the existing


clusters that are closest.
 Incremental
 Threshold, t, used to determine if items are
added to existing clusters or a new cluster is
created.

© Prentice Hall 33
Nearest Neighbor Algorithm

© Prentice Hall 34
PAM
 Partitioning Around Medoids (PAM)
 (K-Medoids)
 Handles outliers well.
 Ordering of input does not impact results.
 Does not scale well.
 Each cluster represented by one item, called
the medoid.
 Initial set of k medoids randomly chosen.

© Prentice Hall 35
PAM
PAM Cost Calculation
 At each step in algorithm, medoids are changed
if the overall cost is improved.
 Cjih – cost change for an item tj associated with
swapping medoid ti with non-medoid th.

© Prentice Hall 37
PAM Algorithm

© Prentice Hall 38
BEA
 Bond Energy Algorithm
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes
based on common usage.
 Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding

© Prentice Hall 39
BEA

Modified from [OV99]

© Prentice Hall 40
Genetic Algorithm Example
 {A,B,C,D,E,F,G,H}
 Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
 Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
 What should termination criteria be?

© Prentice Hall 41
GA Algorithm

© Prentice Hall 42

You might also like