模式識別與人工智能之七-part2_第1頁
已閱讀1頁,還剩31頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

1、Pattern Recognition & artificial IntelligenceLecture 7: 聚類算法(三 conts),1,主要內(nèi)容,Hierarchical Clustering—Realization 基于分層的聚類算法-代碼實現(xiàn),BIRCH:利用層次方法的平衡迭代歸約和聚類,,01,Chameleon:利用動態(tài)建模的層次聚類算法,,04,ROCK:分類屬性的層次聚類算法,,02,CURE:基于

2、質(zhì)心和基于代表對象方法之間的中間策略,,03,2,BIRCH,BIRCH: Balanced Iterative Reducing and Clustering Using HierarchiesAgglomerative Clustering designed for clustering a large amount of numerical dataWhat Birch algorithm tries to solve?

3、Most of the existing algorithms DO NOT consider the case that datasets can be too large to fit in main memoryThey DO NOT concentrate on minimizing the number of scans of the datasetI/O costs are very highThe compl

4、exity of BIRCH is O(n) where n is the number of objects to be clustered.,BIRCH: The Idea by example,Data Objects,,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,,If cluster 1 becomes too large (not compa

5、ct) by adding object 2,then split the cluster,Leaf node,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,,Cluster2,entry 1,entry 2,Leaf node with two ent

6、ries,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,,Cluster2,3,entry1 is the closest to object 3If cluster 1 becomes too large by adding object 3,th

7、en split the cluster,entry 1,entry 2,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,Cluster2,3,,,entry 1,,entry 2,entry 3,Cluster3,Leaf node with three

8、entries,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,Cluster2,3,,,entry 1,,entry 2,entry 3,Cluster3,4,entry3 is the closest to object 4Cluster 2 rem

9、ains compact when adding object 4then add object 4 to cluster 2,,Cluster2,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,3,,,entry 1,,entry 2,entry 3,C

10、luster3,4,entry2 is the closest to object 5Cluster 3 becomes too large by adding object 5then split cluster 3?BUT there is a limit to the number of entries a node can haveThus, split the node,,Cluster2,5,BIRCH: Th

11、e Idea by example,Data Objects,1,Clustering Process (build a tree),,,Cluster1,,1,2,3,4,5,6,2,Leaf node,3,,Cluster3,4,,Cluster2,5,,,entry 1,entry 2,,,entry 1.1,entry 1.2,,,entry 2.1,entry 2.2,Leaf node,Non-Leaf node,,Clu

12、ster4,,,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,Cluster1,,1,2,3,4,5,6,2,Leaf node,3,,Cluster3,4,,Cluster2,5,,,entry 1,entry 2,,,entry 1.1,entry 1.2,,,entry 2.1,entry 2.2,Leaf node,N

13、on-Leaf node,,Cluster4,,,6,entry1.2 is the closest to object 6Cluster 3 remains compact when adding object 6then add object 6 to cluster 3,,Cluster3,BIRCH: Key Components,Clustering Feature (CF)Summary of the stati

14、stics for a given cluster: the 0-th, 1st and 2nd moments of the cluster from the statistical point of view Used to compute centroids, and measures the compactness and distance of clusters CF-Treeheight-balanced tre

15、e two parameters: number of entries in each node The diameter of all entries in a leaf nodeLeaf nodes are connected via prev and next pointers,Clustering Feature,Clustering Feature (CF): CF = (N, LS, SS) N

16、: Number of data points LS: linear sum of N points: SS: square sum of N points:,,,,,Cluster 1 (2,5) (3,2) (4,3),CF2= ?3, (35,36), (417 ,440)?,,,,Cluster 2,,CF1= ?3, (2+3+4 , 5+2+3), (22+32+42 , 52+22+32)? = ?

17、3, (9,10), (29 ,38)?,,Cluster3,CF3=CF1+CF2= ?3+3, (9+35, 10+36), (29+417 , 38+440)? = ?6, (44,46), (446 ,478)?,Properties of Clustering Feature,CF entry is a summary of statistics of the cluster A representation of th

18、e clusterA CF entry has sufficient information to calculate the centroid, radius, diameter and many other distance measuresAdditively theorem allows us to merge sub-clusters incrementally,Distance Measures,Given a

19、 cluster with data points , Centroid: Radius: average distance from any point of the cluster to its centroidDiameter: square root of average mean squared distance between all pairs of points in the cluster,CF T

20、ree,B = Branching Factor, maximum children in a non-leaf nodeT = Threshold for diameter or radius of the cluster in a leafL = number of entries in a leafCF entry in parent =

21、 sum of CF entries of a child of that entryIn-memory, height-balanced tree,,,,…,…,…,…,…,,,,Root level,Firstlevel,CF Tree Insertion,Start with the rootFind the CF entry in the root closest to the data point, move to

22、 that child and repeat the process until a closest leaf entry is found.At the leafIf the point can be accommodated in the cluster, update the entryIf this addition violates the threshold T, split the entry, if this

23、violates the limit imposed by L, split the leaf. If its parent node too is full, split that and so onUpdate the CF entries from the root to the leaf to accommodate this point,,Phase 1: Load into memory by building a CF

24、 tree,,Phase 2 (optional): Condense tree into desirable range by building a smaller CF tree,,Initial CF tree,Data,,Phase 3: Global Clustering,,Smaller CF tree,,Good Clusters,Phase 4: (optional and offline): Cluster Refin

25、ing,Better Clusters,,,,Birch Algorithm,Birch Algorithm: Phase 1,Choose an initial value for threshold, start inserting the data points one by one into the tree as per the insertion algorithmIf, in the middle of the abo

26、ve step, the size of the CF tree exceeds the size of the available memory, increase the value of thresholdConvert the partially built tree into a new treeRepeat the above steps until the entire dataset is scanned and

27、 a full tree is builtOutlier Handling,Birch Algorithm: Phase 2,3, and 4,Phase 2A bridge between phase 1 and phase 3Builds a smaller CF tree by increasing the thresholdPhase 3Apply global clustering algorithm to

28、the sub-clusters given by leaf entries of the CF treeImproves clustering qualityPhase 4Scan the entire dataset to label the data pointsOutlier handling,ROCK,Similarity function NeighborsLinksCriterion functionGo

29、odness measure,Major definitions,ROCK,Similarity function NeighborsLinksCriterion functionGoodness measure,Major definitions,ROCK,Similarity function,Let Sim (Pi, Pj) be a similarity function that is used to measure

30、the closeness between points pi and Pj. ROCK assumes that Sim function is normalized to return a value between 0 and 1For Quran treasures data, a possible definition for the sim function is based on the Jaccard coeffic

31、ient:,ROCK,Neighbors and links,,,NeighborIf similarity between two points exceeds certain similarity threshold (?), they are neighbors.LinkThe Link for pair of points is: the number of their common neighbors.Obvious

32、ly, Link incorporates global information about the other points in the neighborhood of the two points. The larger the Link, the higher probability that this pair of points are in the same clusters.,ROCK,Criterion functio

33、n,,,to get the best clusters, we have to maximize this Criterion FunctionWhere Ci denotes cluster i ni is the number of points in Ci k is the number of clusters ? is the simi

34、larity threshold Suppose in Ci, each point has roughly nf(θ) neighbors.A suitable choice for basket data is : f(θ)=(1-θ)/(1+θ),ROCK,Goodness measure,,,Goodness FunctionDuring clustering, we use this goodness measur

35、e in order to maximize the criterion function. This goodness measure helps to identify the best pair of clusters to be merged during each step of ROCK.,ROCK,Algorithm,,,,Input: A set S of data points

36、Number of k clusters to be found The similarity thresholdOutput: Groups of clustered dataThe ROCK algorithm is divided into three major parts:Draw a random sample from the data set:Perform a hierarchi

37、cal agglomerative clustering algorithmLabel data on diskin our case, we do not deal with a very huge data set. So, we will consider the whole data in the process of forming clusters, i.e. we skip step1 and step3,ROCK,A

38、lgorithm,,,,places each single data point into a separate clustercompute the similarity measure for all pairs of clustersmerge the two clusters with the highest similarity (goodness measure)Verify a stop condition. If

39、 it is not met then go to step b,Computation of links:using the similarity threshold ?, we can convert the similarity matrix into an adjacency matrix (A)Then we obtain a matrix indicating the number of links by calcula

40、ting (A x A ) , i.e., by multiplying the adjacency matrix A with itself,ROCK,Suppose we have four verses contains some subjects , as follows:P1={ judgment, faith, prayer, fair}P2={ fasting, faith, prayer}P3={ fair, fa

41、sting, faith}P4={ fasting, prayer, pilgrimage}the similarity threshold = 0.3, and number of required cluster is 2.using Jaccard coefficient as a similarity measure, we obtain the following similarity table :,Example:,

42、ROCK,Example:,Since we have a similarity threshold equal to 0.3, then we derive the adjacency table:?By multiplying the adjacency table with itself, we derive the following table which shows the number of links (or comm

43、on neighbors) :?,ROCK,Example:,we compute the goodness measure for all adjacent points ,assuming that f(?) =1-? / 1+? we obtain the following table?: we have an equal goodness measure for merging ((P

44、1,P2), (P2,P1), (P3,P1)),ROCK,Example:,Now, after merging P1 and P2, we have only three clusters. The following table shows the number of common neighbors for these clusters:?Then we can obtain the following goodness me

45、asures for all adjacent clusters:?,Since the number of required clusters is 2, then we finish the clustering algorithm by merging C(P1,P2) and P3, obtaining a new cluster C(P1,P2,P3) which contains {P1,P2,P3} leaving P4

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論