fbpx
Wikipedia

SUBCLU

SUBCLU is an algorithm for clustering high-dimensional data by Karin Kailing, Hans-Peter Kriegel and Peer Kröger.[1] It is a subspace clustering algorithm that builds on the density-based clustering algorithm DBSCAN. SUBCLU can find clusters in axis-parallel subspaces, and uses a bottom-up, greedy strategy to remain efficient.

Approach edit

SUBCLU uses a monotonicity criteria: if a cluster is found in a subspace  , then each subspace   also contains a cluster. However, a cluster   in subspace   is not necessarily a cluster in  , since clusters are required to be maximal, and more objects might be contained in the cluster in   that contains  . However, a density-connected set in a subspace   is also a density-connected set in  .

This downward-closure property is utilized by SUBCLU in a way similar to the Apriori algorithm: first, all 1-dimensional subspaces are clustered. All clusters in a higher-dimensional subspace will be subsets of the clusters detected in this first clustering. SUBCLU hence recursively produces  -dimensional candidate subspaces by combining  -dimensional subspaces with clusters sharing   attributes. After pruning irrelevant candidates, DBSCAN is applied to the candidate subspace to find out if it still contains clusters. If it does, the candidate subspace is used for the next combination of subspaces. In order to improve the runtime of DBSCAN, only the points known to belong to clusters in one  -dimensional subspace (which is chosen to contain as little clusters as possible) are considered. Due to the downward-closure property, other point cannot be part of a  -dimensional cluster anyway.

Pseudocode edit

SUBCLU takes two parameters,   and  , which serve the same role as in DBSCAN. In a first step, DBSCAN is used to find 1D-clusters in each subspace spanned by a single attribute:

 

 
 
 
 
 
 
 
 
 
// In a second step,  -dimensional clusters are built from  -dimensional ones:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

 

The set   contains all the  -dimensional subspaces that are known to contain clusters. The set   contains the sets of clusters found in the subspaces. The   is chosen to minimize the runs of DBSCAN (and the number of points that need to be considered in each run) for finding the clusters in the candidate subspaces.

Candidate subspaces are generated much alike the Apriori algorithm generates the frequent itemset candidates: Pairs of the  -dimensional subspaces are compared, and if they differ in one attribute only, they form a  -dimensional candidate. However, a number of irrelevant candidates are found as well; they contain a  -dimensional subspace that does not contain a cluster. Hence, these candidates are removed in a second step:

 

 
 
 
 
 
 
 
 
// Pruning of irrelevant candidate subspaces
 
 
 
 
 
 
 

 

Availability edit

An example implementation of SUBCLU is available in the ELKI framework.

References edit

  1. ^ Karin Kailing, Hans-Peter Kriegel and Peer Kröger. Density-Connected Subspace Clustering for High-Dimensional Data. In: Proc. SIAM Int. Conf. on Data Mining (SDM'04), pp. 246-257, 2004.

subclu, this, article, relies, largely, entirely, single, source, relevant, discussion, found, talk, page, please, help, improve, this, article, introducing, citations, additional, sources, find, sources, news, newspapers, books, scholar, jstor, february, 2010. This article relies largely or entirely on a single source Relevant discussion may be found on the talk page Please help improve this article by introducing citations to additional sources Find sources SUBCLU news newspapers books scholar JSTOR February 2010 SUBCLU is an algorithm for clustering high dimensional data by Karin Kailing Hans Peter Kriegel and Peer Kroger 1 It is a subspace clustering algorithm that builds on the density based clustering algorithm DBSCAN SUBCLU can find clusters in axis parallel subspaces and uses a bottom up greedy strategy to remain efficient Contents 1 Approach 2 Pseudocode 3 Availability 4 ReferencesApproach editSUBCLU uses a monotonicity criteria if a cluster is found in a subspace S displaystyle S nbsp then each subspace T S displaystyle T subseteq S nbsp also contains a cluster However a cluster C DB displaystyle C subseteq DB nbsp in subspace S displaystyle S nbsp is not necessarily a cluster in T S displaystyle T subseteq S nbsp since clusters are required to be maximal and more objects might be contained in the cluster in T displaystyle T nbsp that contains C displaystyle C nbsp However a density connected set in a subspace S displaystyle S nbsp is also a density connected set in T S displaystyle T subseteq S nbsp This downward closure property is utilized by SUBCLU in a way similar to the Apriori algorithm first all 1 dimensional subspaces are clustered All clusters in a higher dimensional subspace will be subsets of the clusters detected in this first clustering SUBCLU hence recursively produces k 1 displaystyle k 1 nbsp dimensional candidate subspaces by combining k displaystyle k nbsp dimensional subspaces with clusters sharing k 1 displaystyle k 1 nbsp attributes After pruning irrelevant candidates DBSCAN is applied to the candidate subspace to find out if it still contains clusters If it does the candidate subspace is used for the next combination of subspaces In order to improve the runtime of DBSCAN only the points known to belong to clusters in one k displaystyle k nbsp dimensional subspace which is chosen to contain as little clusters as possible are considered Due to the downward closure property other point cannot be part of a k 1 displaystyle k 1 nbsp dimensional cluster anyway Pseudocode editSUBCLU takes two parameters ϵ displaystyle epsilon nbsp and MinPts displaystyle MinPts nbsp which serve the same role as in DBSCAN In a first step DBSCAN is used to find 1D clusters in each subspace spanned by a single attribute SUBCLU DB eps MinPts displaystyle mathtt SUBCLU DB eps MinPts nbsp S1 displaystyle S 1 emptyset nbsp C1 displaystyle C 1 emptyset nbsp foreacha Attributes displaystyle mathtt for each a in Attributes nbsp C a DBSCAN DB a eps MinPts displaystyle C a mathtt DBSCAN DB a eps MinPts nbsp if C a displaystyle mathtt if C a neq emptyset nbsp S1 S1 a displaystyle S 1 S 1 cup a nbsp C1 C1 C a displaystyle C 1 C 1 cup C a nbsp dd endif displaystyle mathtt end if nbsp dd endfor displaystyle mathtt end for nbsp In a second step k 1 displaystyle k 1 nbsp dimensional clusters are built from k displaystyle k nbsp dimensional ones k 1 displaystyle k 1 nbsp while Ck displaystyle mathtt while C k neq emptyset nbsp CandSk 1 GenerateCandidateSubspaces Sk displaystyle mathtt CandS k 1 mathtt GenerateCandidateSubspaces S k nbsp foreachcand CandSk 1 displaystyle mathtt for each cand in mathtt CandS k 1 nbsp bestSubspace mins Sk s cand Ci Cs Ci displaystyle mathtt bestSubspace min s in S k wedge s subset cand sum C i in C s C i nbsp Ccand displaystyle C cand emptyset nbsp foreachclustercl CbestSubspace displaystyle mathtt for each cluster cl in C mathtt bestSubspace nbsp Ccand Ccand DBSCAN cl cand eps MinPts displaystyle C cand C cand cup mathtt DBSCAN cl cand eps MinPts nbsp if Ccand displaystyle mathtt if C cand neq emptyset nbsp Sk 1 Sk 1 cand displaystyle S k 1 S k 1 cup cand nbsp Ck 1 Ck 1 Ccand displaystyle C k 1 C k 1 cup C cand nbsp dd endif displaystyle mathtt end if nbsp dd endfor displaystyle mathtt end for nbsp dd endfor displaystyle mathtt end for nbsp k k 1 displaystyle k k 1 nbsp dd endwhile displaystyle mathtt end while nbsp end displaystyle mathtt end nbsp The set Sk displaystyle S k nbsp contains all the k displaystyle k nbsp dimensional subspaces that are known to contain clusters The set Ck displaystyle C k nbsp contains the sets of clusters found in the subspaces The bestSubspace displaystyle bestSubspace nbsp is chosen to minimize the runs of DBSCAN and the number of points that need to be considered in each run for finding the clusters in the candidate subspaces Candidate subspaces are generated much alike the Apriori algorithm generates the frequent itemset candidates Pairs of the k displaystyle k nbsp dimensional subspaces are compared and if they differ in one attribute only they form a k 1 displaystyle k 1 nbsp dimensional candidate However a number of irrelevant candidates are found as well they contain a k displaystyle k nbsp dimensional subspace that does not contain a cluster Hence these candidates are removed in a second step GenerateCandidateSubspaces Sk displaystyle mathtt GenerateCandidateSubspaces S k nbsp CandSk 1 displaystyle mathtt CandS k 1 emptyset nbsp foreachs1 Sk displaystyle mathtt for each s 1 in S k nbsp foreachs2 Sk displaystyle mathtt for each s 2 in S k nbsp if s1ands2differinexactlyoneattribute displaystyle mathtt if s 1 mathtt and s 2 mathtt differ in exactly one attribute nbsp CandSk 1 CandSk 1 s1 s2 displaystyle mathtt CandS k 1 mathtt CandS k 1 cup s 1 cup s 2 nbsp dd endif displaystyle mathtt end if nbsp dd endfor displaystyle mathtt end for nbsp dd endfor displaystyle mathtt end for nbsp Pruning of irrelevant candidate subspaces foreachcand CandSk 1 displaystyle mathtt for each cand in mathtt CandS k 1 nbsp foreachk elements cand displaystyle mathtt for each k texttt element s subset cand nbsp if s Sk displaystyle mathtt if s not in S k nbsp CandSk 1 CandSk 1 cand displaystyle mathtt CandS k 1 mathtt CandS k 1 setminus cand nbsp dd endif displaystyle mathtt end if nbsp dd endfor displaystyle mathtt end for nbsp dd endfor displaystyle mathtt end for nbsp end displaystyle mathtt end nbsp Availability editAn example implementation of SUBCLU is available in the ELKI framework References edit Karin Kailing Hans Peter Kriegel and Peer Kroger Density Connected Subspace Clustering for High Dimensional Data In Proc SIAM Int Conf on Data Mining SDM 04 pp 246 257 2004 Retrieved from https en wikipedia org w index php title SUBCLU amp oldid 1126165023, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.