0% found this document useful (0 votes)

62 views

BCA DM Chapter 5 - Clustering

The document provides an overview of clustering techniques, including hierarchical and partitional algorithms. Hierarchical algorithms create nested clusters and can be agglomerative (bottom-up) or divisive (top-down). Partitional algorithms create all clusters at once, requiring the number of clusters as input. Popular partitional algorithms include k-means, which assigns data to clusters based on minimizing distance to cluster means, and PAM, which clusters data around representative points called medoids. The document discusses key clustering concepts like similarity measures, outliers, and genetic algorithms for clustering.

Uploaded by

Selvarani Research

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

BCA DM Chapter 5 - Clustering

Uploaded by

Selvarani Research

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

Chapter 5

Clustering
Clustering Outline

Goal: Provide an overview of the clustering problem

and introduce some of the basic algorithms

 Clustering Problem Overview

 Clustering Techniques

 Hierarchical Algorithms

 Partitional Algorithms

 Genetic Algorithm
© Prentice Hall 2
Clustering vs. Classification
 No prior knowledge
 Number of clusters
 Meaning of clusters
 Unsupervised learning

Classification – Predefined
Clustering – Not Predefined
- Data are grouped according to similarities

 Segment customer database based on similar

buying patterns.
 Group houses in a town into neighborhoods
based on similar features.
 Identify new plant species
 Identify similar Web usage patterns

© Prentice Hall 5
Group of Homes
Clustering Issues
 Outlier handling – is difficult (data can fall on any cluster
according to the condition)
 Dynamic data – cluster membership changes overtime
 Interpreting results – after clustering, the meaning will
differ for each clustering
 Evaluating results – need the concept of classification to
be done before clustering them
 Number of clusters
 Data to be used
 Scalability
▪ Always results are dynamic according to the clustering process

© Prentice Hall 7
Clustering Problem
 Given a database D={t ,t2,…,tn} of tuples and
1
an integer value k, the Clustering Problem is
to define a mapping
 f:Dg{1,..,k} where each ti is assigned to one cluster
Kj, 1<=j<=k.
 A Cluster, K , contains precisely those tuples
j
mapped to it.
 Unlike classification problem, clusters are not
known a priori.

 Hierarchical – Nested set of clusters created.

 Partitional – One set of clusters created.

 Incremental – Each element handled one at a

time.
 Simultaneous – All elements handled together.

 Overlapping/Non-overlapping

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

Middle of Cluster – called medoid

© Prentice Hall 11
Distance Between Clusters
 Single Link: smallest distance between points
 Complete Link: largest distance between points
 Average Link: average distance between points
 Centroid: distance between centroids

 Outliers are sample points with values much

different from those of the remaining set of
data
 May represent errors in data
 Could also be correct data values that are simply
different from the remaining data
Impact of Outliers on Clustering
Solid lines – diff clusters
dashed line – diff clusters

© Prentice Hall 14
Hierarchical Clustering
Hierarchical Clustering
 Clusters are created in levels actually creating sets
of clusters at each level.
 Agglomerative
 Initially each item in its own cluster
 Iteratively clusters are merged together
 Bottom Up
 Divisive
 Initially all items in one cluster
 Large clusters are successively divided
 Top Down

 Single Link
 MST Single Link
 Complete Link
 Average Link

© Prentice Hall 17
Dendrogram
 Dendrogram: a tree data
structure which illustrates
hierarchical clustering
techniques.
 Each level shows clusters for
that level.
 Leaf – individual clusters
 Root – one cluster
 A cluster at level i is the union
of its children clusters at level
i+1.

Threshold of

1 2 3 4 5

A B C D E

© Prentice Hall 21
Single Link
 View all items with links (distances)
between them.
 Finds maximal connected components in
this graph.
 Two clusters are merged if there is at least
one edge which connects them.
 Uses threshold distances at each level.
 Could be agglomerative or divisive.

 Nonhierarchical
 Creates clusters in one step as opposed to
several steps.
 Since only one set of clusters is output, the
user normally has to input the desired
number of clusters, k.
 Usually deals with static sets.

 Minimum Spanning Tree

 Squared Error Clustering Algo
 K-Means
 Nearest Neighbor
 PAM – Partitioning Around Medoids
 BEA – Bond Energy Algo
 Clustering with GA & NN

 Minimized squared error

© Prentice Hall 29
K-Means
 Initial set of clusters randomly chosen.
 Iteratively, items are moved among sets of
clusters until the desired set is reached.
 High degree of similarity among elements
in a cluster is obtained.
 Given a cluster Ki={ti1,ti2,…,tim}, the cluster

mean is mi = (1/m)(ti1 + … + tim)

© Prentice Hall 30
K-Means Example
 Given: {2,4,10,12,3,20,30,11,25}, k=2
 Randomly assign means: m1=3,m2=4
 K ={2,3}, K ={4,10,12,20,30,11,25}, m =2.5,m =16
1 2 1 2
 K ={2,3,4},K ={10,12,20,30,11,25}, m =3,m =18
1 2 1 2
 K ={2,3,4,10},K ={12,20,30,11,25}, m =4.75,m =19.6
1 2 1 2
 K ={2,3,4,10,11,12},K ={20,30,25}, m =7,m =25
1 2 1 2
 Stop as the clusters with these means are the same.

 Items are iteratively merged into the existing

clusters that are closest.
 Incremental
 Threshold, t, used to determine if items are
added to existing clusters or a new cluster is
created.

© Prentice Hall 34
PAM
 Partitioning Around Medoids (PAM)
 (K-Medoids)
 Handles outliers well.
 Ordering of input does not impact results.
 Does not scale well.
 Each cluster represented by one item, called
the medoid.
 Initial set of k medoids randomly chosen.

© Prentice Hall 35
PAM
PAM Cost Calculation
 At each step in algorithm, medoids are changed
if the overall cost is improved.
 Cjih – cost change for an item tj associated with
swapping medoid ti with non-medoid th.

© Prentice Hall 38
BEA
 Bond Energy Algorithm
 Database design (physical and logical)
 Vertical fragmentation
 Determine affinity (bond) between attributes
based on common usage.
 Algorithm outline:
1. Create affinity matrix
2. Convert to BOND matrix
3. Create regions of close bonding

Modified from [OV99]

© Prentice Hall 40
Genetic Algorithm Example
 {A,B,C,D,E,F,G,H}
 Randomly choose initial solution:
{A,C,E} {B,F} {D,G,H} or
10101000, 01000100, 00010011
 Suppose crossover at point four and
choose 1st and 3rd individuals:
10100011, 01000100, 00011000
 What should termination criteria be?

Applications For Empanelment - Ad-Hoc Panel 2023-2024
No ratings yet
Applications For Empanelment - Ad-Hoc Panel 2023-2024
5 pages
Means of Transportation Dialogue
40% (10)
Means of Transportation Dialogue
3 pages
Data Mining-Unit 3-Part1
No ratings yet
Data Mining-Unit 3-Part1
41 pages
clustering
No ratings yet
clustering
16 pages
Clustering
No ratings yet
Clustering
32 pages
Clustering
No ratings yet
Clustering
28 pages
Clustering
No ratings yet
Clustering
35 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Unit 4 Clustering
No ratings yet
Unit 4 Clustering
32 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
DM_C6
No ratings yet
DM_C6
37 pages
Clustering
No ratings yet
Clustering
45 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
P 3.1.3 Hierarchical
No ratings yet
P 3.1.3 Hierarchical
30 pages
Cluster
100% (1)
Cluster
72 pages
Clustering
No ratings yet
Clustering
24 pages
Cluster Analysis: Basic Concepts and Algorithms
No ratings yet
Cluster Analysis: Basic Concepts and Algorithms
141 pages
Cluster
No ratings yet
Cluster
20 pages
Chp10 Cluster Analysis Basic Concepts and Methods
No ratings yet
Chp10 Cluster Analysis Basic Concepts and Methods
24 pages
ML-07-clustering
No ratings yet
ML-07-clustering
56 pages
unit iv[1]
No ratings yet
unit iv[1]
96 pages
Clustering
No ratings yet
Clustering
39 pages
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
No ratings yet
Session 3: Clustering Techniques - Partitioning & Hierarchical Methods
27 pages
Grouping
No ratings yet
Grouping
98 pages
Clustering
No ratings yet
Clustering
110 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Cluster Analysis - Approach 1
No ratings yet
Cluster Analysis - Approach 1
28 pages
Flat Clustering & Hierarchical Clustering in I.R
No ratings yet
Flat Clustering & Hierarchical Clustering in I.R
13 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Clustering-Part1
No ratings yet
Clustering-Part1
79 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Clustering
No ratings yet
Clustering
49 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Clustering[1]
No ratings yet
Clustering[1]
37 pages
UNIT5
No ratings yet
UNIT5
60 pages
ML - 8
No ratings yet
ML - 8
70 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
15-505 Internet Search Technologies: Kamal Nigam
No ratings yet
15-505 Internet Search Technologies: Kamal Nigam
62 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Clustering: Unsupervised Learning Methods 15-381
No ratings yet
Clustering: Unsupervised Learning Methods 15-381
25 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Lect 12
No ratings yet
Lect 12
80 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
IT3080 Lecture04 2023
No ratings yet
IT3080 Lecture04 2023
56 pages
Unit-4
No ratings yet
Unit-4
76 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Chapter_6 (2)
No ratings yet
Chapter_6 (2)
54 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Clustering For Big Data Analytics
No ratings yet
Clustering For Big Data Analytics
28 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Graphic Novel Assignment With Rubric
100% (1)
Graphic Novel Assignment With Rubric
2 pages
CV-Johan Iswara Lumban Tungkup
No ratings yet
CV-Johan Iswara Lumban Tungkup
2 pages
Letter To Speakers
No ratings yet
Letter To Speakers
5 pages
Chapter 1 - Galapate Et Al
No ratings yet
Chapter 1 - Galapate Et Al
8 pages
Veterinery General Surgery
No ratings yet
Veterinery General Surgery
3 pages
Methods of Phonological Analysis: Outline
No ratings yet
Methods of Phonological Analysis: Outline
10 pages
Borderline Clients Tend To Show: The TAT and Borderline, Narcissistic, and Psychotic Patients
No ratings yet
Borderline Clients Tend To Show: The TAT and Borderline, Narcissistic, and Psychotic Patients
2 pages
Chapter 7 Selection
No ratings yet
Chapter 7 Selection
7 pages
To Succeed in The Cambridge English C1 Advanced (Formerly KN
No ratings yet
To Succeed in The Cambridge English C1 Advanced (Formerly KN
2 pages
Cover Letter D093501
No ratings yet
Cover Letter D093501
2 pages
Icast Write Up For Junior Group
No ratings yet
Icast Write Up For Junior Group
8 pages
Tle Cookery10 Q1 M14
No ratings yet
Tle Cookery10 Q1 M14
15 pages
References
No ratings yet
References
2 pages
The Case For Growth: Why Measure Student Learning?
No ratings yet
The Case For Growth: Why Measure Student Learning?
6 pages
Format Project-Proposal
No ratings yet
Format Project-Proposal
5 pages
Dogme Language Teaching - Wikipedia
No ratings yet
Dogme Language Teaching - Wikipedia
3 pages
Chapter 7 Memory
No ratings yet
Chapter 7 Memory
67 pages
Rubrics LP
No ratings yet
Rubrics LP
2 pages
Vikrant Sehgal: Company: Iffco Tokio General Insurance Company LTD
No ratings yet
Vikrant Sehgal: Company: Iffco Tokio General Insurance Company LTD
3 pages
Alam M-U, Et Al. BMJ Open 2017,7, E015508. Doi, 10.1136-Bmjopen-2016
No ratings yet
Alam M-U, Et Al. BMJ Open 2017,7, E015508. Doi, 10.1136-Bmjopen-2016
10 pages
How To Use The Fdisk Tool and The Format Tool To Partition or Repartition A Hard Disk
No ratings yet
How To Use The Fdisk Tool and The Format Tool To Partition or Repartition A Hard Disk
22 pages
Review (Units 1-3) : Language Test B: Do You Have Any Memories That You Wish You Could Play Back?
No ratings yet
Review (Units 1-3) : Language Test B: Do You Have Any Memories That You Wish You Could Play Back?
4 pages
Causal Comparative Research
100% (2)
Causal Comparative Research
14 pages
Colombia: Education System
No ratings yet
Colombia: Education System
13 pages
Adam's Equity Theory
100% (2)
Adam's Equity Theory
3 pages
Imcb Rise Brochure
No ratings yet
Imcb Rise Brochure
2 pages
RTS Schedule 2025-26
No ratings yet
RTS Schedule 2025-26
6 pages
Clasa A V-A Plan de Lectie Progress Check Units 1, 2
100% (1)
Clasa A V-A Plan de Lectie Progress Check Units 1, 2
3 pages

Uploaded by

Uploaded by

Chapter 5

Goal: Provide an overview of the clustering problem

 Clustering Problem Overview

 Segment customer database based on similar

 Hierarchical – Nested set of clusters created.

 Partitional – One set of clusters created.

 Incremental – Each element handled one at a

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression

Middle of Cluster – called medoid

 Outliers are sample points with values much

 Minimum Spanning Tree

 Minimized squared error

mean is mi = (1/m)(ti1 + … + tim)

 Items are iteratively merged into the existing

Modified from [OV99]

You might also like