0% found this document useful (0 votes)

57 views

Clustering For Big Data Analytics

This document provides an overview of clustering methods for big data analytics. It begins with basic concepts of cluster analysis and definitions. It then describes partitioning and hierarchical clustering methods. Partitioning methods discussed include k-means, k-medoids, CLARA and CLARANS. Hierarchical methods mentioned are Diana, Agnes, BIRCH and CAMELEON. The document also covers density-based, grid-based, model-based, frequent pattern-based and constraint-based clustering approaches. It discusses considerations and challenges for applying clustering to big data.

Uploaded by

hakona

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

Clustering For Big Data Analytics

Uploaded by

hakona

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 28

Clustering for Big Data

Analytics

1
Basic Concepts and Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods

2
What is Cluster Analysis?
 Cluster: A collection of data objects
 similar (or related) to one another within the same group

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

 Finding similarities between data according to the

characteristics found in the data and grouping similar

data objects into clusters
 Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
 Typical applications
 As a stand-alone tool to get insight into data distribution

 As a preprocessing step for other algorithms

3
Application Areas: Where Clustering is
appealing for Big Data Analysis
 Big Data related to Biological domains: taxonomy of living things:
kingdom, phylum, class, order, family, genus and species
 Large documents clustering for Information retrieval
 Big Geo Data e.g., Land use: Identification of areas of similar land
use in an earth observation database
 Diverse Marketing: Help marketers discover distinct groups in their
customer bases, and then use this knowledge to develop targeted
marketing programs
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
 Climate: understanding earth climate, find patterns of atmospheric
and ocean
 Economic Science: market research
4
Clustering as a Preprocessing Tool for BDA

 Summarization:
 Preprocessing for regression, PCA, classification, and
association analysis
 Compression:
 Image processing: vector quantization
 Finding K-nearest Neighbors
 Localizing search to one or a small number of clusters
 Outlier detection
 Outliers are often viewed as those “far away” from any
cluster

5
Quality: What Is Good Clustering?

 A good clustering method will produce high quality

clusters
 high intra-class similarity: cohesive within clusters
 low inter-class similarity: distinctive between clusters
 The quality of a clustering method depends on
 the similarity measure used by the method
 its implementation, and
 Its ability to discover some or all of the hidden patterns

6
Measure the Quality of Clustering
 Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
 Quality of clustering:
 There is usually a separate “quality” function that
measures the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”
 The answer is typically highly subjective
7
Considerations for Cluster Analysis
 Partitioning criteria
 Single level vs. hierarchical partitioning (often, multi-level
hierarchical partitioning is desirable)
 Separation of clusters
 Exclusive (e.g., one customer belongs to only one region) vs.
non-exclusive (e.g., one document may belong to more than one
class)
 Similarity measure
 Distance-based (e.g., Euclidian, road network, vector) vs.
connectivity-based (e.g., density or contiguity)
 Clustering space
 Full space (often when low dimensional) vs. subspaces (often in
high-dimensional clustering)
8
Data to Big Data: Requirements and Challenges
 Scalability
 Clustering all the data instead of only on samples

 Ability to deal with different types of attributes

 Numerical, binary, categorical, ordinal, linked, and mixture of these

 Constraint-based clustering
 User may give inputs on constraints
 Use domain knowledge to determine input parameters
 Interpretability and usability
 Others
 Discovery of clusters with arbitrary shape

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

 High dimensionality

9
Data to Big Data: Major Clustering Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion, e.g.,

minimizing the sum of square errors

 Typical methods: k-means, k-medoids, CLARA ( Clustering Large Applications),

CLARANS(Clustering Large Applications based on RANdomized Search)

 The K-Medoids clustering technique can resolve the limitation of the K-Means algorithm
of being adversely affected by noise/outliers in the input data. But K-Medoids proves to
be a computationally costly method for considerably large values of ‘k’ (number of
clusters) and large datasets.
 The CLARA algorithm was introduced as an extension of K-Medoids. It uses only random
samples of the input data (instead of the entire dataset) and computes the best medoids
in those samples. It thus works better than K-Medoids for crowded datasets. However,
the algorithm may give wrong clustering results if one or more sampled medoids are
away from the actual best medoids.
 CLARANS algorithm takes care of the cons of both K-Medoids and CLARA algorithms
besides dealing with difficult-to-handle data mining data, i.e. spatial data. It maintains a
balance between the computational cost and the influence of data sampling on clusters’
formation.

10
Major Clustering Approaches (I)

Hierarchical approach:
Create a hierarchical decomposition of the set of data
(or objects) using some criterion
Typical methods: Diana, Agnes, BIRCH (Balanced Iterative Reducing
and Clustering using Hierarchies), CAMELEON (Clustering using

dynamic modeling)
Density-based approach:
Based on connectivity and density functions
Typical methods: DBSACN, OPTICS, DenClue
Grid-based approach:
based on a multiple-level granularity structure
Typical methods: STING, WaveCluster, CLIQUE

11
Major Clustering Approaches (II)
 Model-based:
 A model is hypothesized for each of the clusters and tries to find

the best fit of that model to each other

 Typical methods: EM, SOM, COBWEB

 Frequent pattern-based:
 Based on the analysis of frequent patterns

 Typical methods: p-Cluster

 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific

constraints
 Typical methods: COD (obstacles), constrained clustering

 Link-based clustering:
 Objects are often linked together in various ways

 Massive links can be used to cluster objects: SimRank, LinkClus

12
Chapter 10. Cluster Analysis: Basic Concepts and
Methods

 Cluster Analysis: Basic Concepts

 Partitioning Methods
 Hierarchical Methods

13
Partitioning Algorithms: Basic Concept
 Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where c i is
the centroid or medoid of cluster Ci)

E   ik1 pCi ( p  ci ) 2
 Given k, find a partition of k clusters that optimizes the chosen partitioning
criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by
the center of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects in
the cluster
14
The K-Means Clustering Method

 Given k, the k-means algorithm is implemented in four

steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partitioning (the centroid is
the center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when the assignment does
not change

15
An Example of K-Means Clustering

K=2

Arbitrarily Update the

partition cluster
objects into centroids
k groups

The initial data set Loop if Reassign objects

needed
 Partition objects into k nonempty
subsets
 Repeat
 Compute centroid (i.e., mean Update the
cluster
point) for each partition centroids
 Assign each object to the
cluster of its nearest centroid
 Until no change
16
Comments on the K-Means Method

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #

iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimal.
 Weakness
 Applicable only to objects in a continuous n-dimensional space
 Using the k-modes method for categorical data
 In comparison, k-medoids can be applied to a wide range of data
 Need to specify k, the number of clusters, in advance (there are
ways to automatically determine the best k (see Hastie et al., 2009)
 Sensitive to noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

17
Variations of the K-Means Method

 Most of the variants of the k-means which differ in

 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

18
What Is the Problem of the K-Means Method?

 The k-means algorithm is sensitive to outliers !

 Since an object with an extremely large value may substantially
distort the distribution of the data
 K-Medoids: Instead of taking the mean value of the object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster

10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

19
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10 10 10

9 9 9

8 8 8

7 7 7

6
Arbitrary 6
Assign 6

5
choose k 5 each 5

4 object as 4 remainin 4

3
initial 3
g object 3

2
medoids 2
to 2

1 1
nearest
1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
medoids 0 1 2 3 4 5 6 7 8 9 10

K=2 Randomly select a

Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8 Compute
9

8
Swapping O 7 total cost of 7

Until no and Oramdom 6

swapping 6

change
5 5

If quality is 4 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

20
The K-Medoid Clustering Method

 K-Medoids Clustering: Find representative objects (medoids) in clusters

 PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
 Starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale well
for large data sets (due to the computational complexity)
 Efficiency improvement on PAM
 CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples
 CLARANS (Ng & Han, 1994): Randomized re-sampling

21
Chapter 10. Cluster Analysis: Basic Concepts and
Methods
 Cluster Analysis: Basic Concepts
 Partitioning Methods
 Hierarchical Methods

22
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input, but
needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
agglomerative
(AGNES)
a
ab
b abcde
c
cde
d
de
e
divisive
Step 4 Step 3 Step 2 Step 1 Step 0 (DIANA)
23
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical packages, e.g., Splus
 Use the single-link method and the dissimilarity matrix
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
10 10 10

9 9 9

8 8 8

7 7 7

6 6 6

5 5 5

4 4 4

3 3 3

2 2 2

1 1 1

0 0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

24
Dendrogram: Shows How Clusters are Merged

Decompose data objects into a several levels of nested

partitioning (tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting

the dendrogram at the desired level, then each
connected component forms a cluster

25
DIANA (Divisive Analysis)

 Introduced in Kaufmann and Rousseeuw (1990)

 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own

10 10
10

9 9
9
8 8
8

7 7
7

6 6
6

5 5
5
4 4
4

3 3
3

2 2
2

1 1
1
0 0
0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10

26
Distance between Clusters X X

 Single link: smallest distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

 Complete link: largest distance between an element in one cluster and

an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

 Average: avg distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

 Centroid: distance between the centroids of two clusters, i.e., dist(Ki, Kj)
= dist(Ci, Cj)

 Medoid: distance between the medoids of two clusters, i.e., dist(Ki, Kj) =
dist(Mi, Mj)
 Medoid: a chosen, centrally located object in the cluster
27
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)
 Centroid: the “middle” of a cluster iN 1(t )
Cm  N
ip

 Radius: square root of average distance from any point

of the cluster to its centroid  N (t  cm ) 2
Rm  i 1 ip
N
 Diameter: square root of average mean squared
distance between all pairs of points in the cluster
 N  N (t  t ) 2
Dm  i 1 i 1 ip iq
N ( N 1)

unit iv[1]
No ratings yet
unit iv[1]
96 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering K Means Agnes
No ratings yet
Clustering K Means Agnes
36 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
10ClusBasic
No ratings yet
10ClusBasic
66 pages
Cluster Analysis
No ratings yet
Cluster Analysis
76 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
93 pages
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
No ratings yet
APznzaaxpWzYylHJmwXGn2puBz7GP1usZYf9XTi7oqfrrKnFV9DMMfVzPCu6yO0UOnr_XFt1gJv4TE1ITR6850n9k65DydQUgoRlylNdn2acWAu6KNonoO8z7QULN6BlLxY_B-JhKko0tJ3K77woLz26oTaAv1YNcIuMcOSqInmgeCUzpUxjKC9VqnT_lhE7vDyWp_LQQjGTRnamgIC6ya3nlwi7mjjE9EUIiO2sUhjkD6RV
38 pages
10ClusBasic
No ratings yet
10ClusBasic
95 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Unit 5 DM
No ratings yet
Unit 5 DM
47 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
50 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
Data Mining Clustering
No ratings yet
Data Mining Clustering
76 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Unit5 Clustering
No ratings yet
Unit5 Clustering
74 pages
Clustering
No ratings yet
Clustering
32 pages
05 Clustering
No ratings yet
05 Clustering
96 pages
Lecture 8 - Clustering
No ratings yet
Lecture 8 - Clustering
23 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Session 7 Clustering
No ratings yet
Session 7 Clustering
93 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Lecture 16
No ratings yet
Lecture 16
29 pages
8 - Clustering
No ratings yet
8 - Clustering
85 pages
Introduction to Cluster Analysis.
No ratings yet
Introduction to Cluster Analysis.
53 pages
Clustering
No ratings yet
Clustering
104 pages
Cluster Analysis
No ratings yet
Cluster Analysis
21 pages
Clustering
No ratings yet
Clustering
24 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Lecture 1 PDF
No ratings yet
Lecture 1 PDF
23 pages
Cluster Analysis Clustering
No ratings yet
Cluster Analysis Clustering
6 pages
Clustering
No ratings yet
Clustering
34 pages
Cluster Analysis
No ratings yet
Cluster Analysis
18 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
78 pages
Partitioning Methods
No ratings yet
Partitioning Methods
26 pages
Data Mining: I Gede Mahendra Darmawiguna
No ratings yet
Data Mining: I Gede Mahendra Darmawiguna
25 pages
Clustering Partitioning Methods
No ratings yet
Clustering Partitioning Methods
20 pages
Chap8-Cluster Analysis
No ratings yet
Chap8-Cluster Analysis
103 pages
UNIT-5 PPT
No ratings yet
UNIT-5 PPT
85 pages
Clustering
No ratings yet
Clustering
25 pages
Lecture 3.2.3 3.2.4
No ratings yet
Lecture 3.2.3 3.2.4
28 pages
Clustering-Part 1
No ratings yet
Clustering-Part 1
35 pages
Cluster
No ratings yet
Cluster
20 pages
Cluster Analysis-Unit 4
No ratings yet
Cluster Analysis-Unit 4
7 pages
Clustering in AI
No ratings yet
Clustering in AI
16 pages
clustering
No ratings yet
clustering
16 pages
Custer Analysis: Prepared by Navin Ninama
No ratings yet
Custer Analysis: Prepared by Navin Ninama
20 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
DWMModule 4 (1) (1) (1)
No ratings yet
DWMModule 4 (1) (1) (1)
31 pages
Data Mining
No ratings yet
Data Mining
98 pages
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
No ratings yet
ML_7th_Sem_AIML_ITE_Notes_Complete_LONG[1]-155-202
48 pages
UG BSF Clustering
No ratings yet
UG BSF Clustering
119 pages
5 Algoritma Klastering
No ratings yet
5 Algoritma Klastering
85 pages
Unit-4
No ratings yet
Unit-4
76 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Unsupervised Machine Learning - What Is, Algorithms, Example
No ratings yet
Unsupervised Machine Learning - What Is, Algorithms, Example
11 pages
Research On Segmenting E-Commerce Customer Through An Improved K-Medoids Clustering Algorithm
No ratings yet
Research On Segmenting E-Commerce Customer Through An Improved K-Medoids Clustering Algorithm
10 pages
Artificial Intelligence By Example Develop machine intelligence from scratch using real artificial intelligence use cases Denis Rothman instant download
100% (5)
Artificial Intelligence By Example Develop machine intelligence from scratch using real artificial intelligence use cases Denis Rothman instant download
59 pages
Enhanced Academic Performance Evaluation Technique Using Fuzzy System
No ratings yet
Enhanced Academic Performance Evaluation Technique Using Fuzzy System
12 pages
K-Means_Clustering_Report
No ratings yet
K-Means_Clustering_Report
2 pages
Concordia University Machine Learning Assaignment with solutions
No ratings yet
Concordia University Machine Learning Assaignment with solutions
8 pages
2023_Summer_Final (1)
No ratings yet
2023_Summer_Final (1)
21 pages
Download Big Data Analytics Tools and Technology for Effective Planning 1st Edition Arun K. Somani ebook All Chapters PDF
100% (1)
Download Big Data Analytics Tools and Technology for Effective Planning 1st Edition Arun K. Somani ebook All Chapters PDF
55 pages
IAI&ML UNIT-4
No ratings yet
IAI&ML UNIT-4
34 pages
BMI 704 - Machine Learning Lab
No ratings yet
BMI 704 - Machine Learning Lab
23 pages
AI With Python - Unsupervised Learning - Clustering
No ratings yet
AI With Python - Unsupervised Learning - Clustering
12 pages
r20 Datamining Lab (2-2 Sem Lab)
No ratings yet
r20 Datamining Lab (2-2 Sem Lab)
41 pages
Machine Learning Lab (17CSL76)
No ratings yet
Machine Learning Lab (17CSL76)
48 pages
Final Year Project Edi Irawan
No ratings yet
Final Year Project Edi Irawan
75 pages
Machine Learning Algorithms, Real-World Applications and Research Directions
No ratings yet
Machine Learning Algorithms, Real-World Applications and Research Directions
73 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Data Science and Big Data An Environment of Computational Intelligence
100% (4)
Data Science and Big Data An Environment of Computational Intelligence
303 pages
New Syllabus
No ratings yet
New Syllabus
2 pages
Mcq's (6 Topics)
No ratings yet
Mcq's (6 Topics)
42 pages
A Multi-Criteria Recommender System For Tourism Using Fuzzy Approach
No ratings yet
A Multi-Criteria Recommender System For Tourism Using Fuzzy Approach
10 pages
Plot Centroids by Clustering Things
No ratings yet
Plot Centroids by Clustering Things
1 page
Introduction To Data Analytics and Visualization Question Paper
100% (1)
Introduction To Data Analytics and Visualization Question Paper
2 pages
Full Download Python Programming in Context, Fourth Edition Julie Anderson And Jon Anderson PDF DOCX
100% (1)
Full Download Python Programming in Context, Fourth Edition Julie Anderson And Jon Anderson PDF DOCX
65 pages
Ai Resos
No ratings yet
Ai Resos
16 pages
Customer Clustering Using RFM Analysis
No ratings yet
Customer Clustering Using RFM Analysis
5 pages
[FREE PDF sample] Bioinformatics Algorithms An Active Learning Approach 2nd Edition Phillip Compeau ebooks
100% (14)
[FREE PDF sample] Bioinformatics Algorithms An Active Learning Approach 2nd Edition Phillip Compeau ebooks
50 pages
Unit IV Unsupervised Learning
No ratings yet
Unit IV Unsupervised Learning
4 pages
CCL MiniProject
No ratings yet
CCL MiniProject
8 pages
DSA Presentation Group 6
No ratings yet
DSA Presentation Group 6
34 pages

Uploaded by

Uploaded by

Clustering for Big Data

 Cluster Analysis: Basic Concepts

 dissimilar (or unrelated) to the objects in other groups

 Cluster analysis (or clustering, data segmentation, …)

characteristics found in the data and grouping similar

 As a preprocessing step for other algorithms

 A good clustering method will produce high quality

 Ability to deal with different types of attributes

 Ability to deal with noisy data

 Incremental clustering and insensitivity to input order

minimizing the sum of square errors

CLARANS(Clustering Large Applications based on RANdomized Search)

the best fit of that model to each other

 Typical methods: p-Cluster

 Massive links can be used to cluster objects: SimRank, LinkClus

 Cluster Analysis: Basic Concepts

 Given k, the k-means algorithm is implemented in four

Arbitrarily Update the

The initial data set Loop if Reassign objects

 Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #

 Most of the variants of the k-means which differ in

 The k-means algorithm is sensitive to outliers !

K=2 Randomly select a

Until no and Oramdom 6

 K-Medoids Clustering: Find representative objects (medoids) in clusters

Decompose data objects into a several levels of nested

A clustering of the data objects is obtained by cutting

 Introduced in Kaufmann and Rousseeuw (1990)

 Single link: smallest distance between an element in one cluster and an

 Complete link: largest distance between an element in one cluster and

 Average: avg distance between an element in one cluster and an

 Radius: square root of average distance from any point

You might also like