版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
1、Pattern Recognition & artificial IntelligenceLecture 7: 聚類算法(三 conts),1,主要內(nèi)容,Hierarchical Clustering—Realization 基于分層的聚類算法-代碼實現(xiàn),BIRCH:利用層次方法的平衡迭代歸約和聚類,,01,Chameleon:利用動態(tài)建模的層次聚類算法,,04,ROCK:分類屬性的層次聚類算法,,02,CURE:基于
2、質(zhì)心和基于代表對象方法之間的中間策略,,03,2,BIRCH,BIRCH: Balanced Iterative Reducing and Clustering Using HierarchiesAgglomerative Clustering designed for clustering a large amount of numerical dataWhat Birch algorithm tries to solve?
3、Most of the existing algorithms DO NOT consider the case that datasets can be too large to fit in main memoryThey DO NOT concentrate on minimizing the number of scans of the datasetI/O costs are very highThe compl
4、exity of BIRCH is O(n) where n is the number of objects to be clustered.,BIRCH: The Idea by example,Data Objects,,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,,If cluster 1 becomes too large (not compa
5、ct) by adding object 2,then split the cluster,Leaf node,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,,Cluster2,entry 1,entry 2,Leaf node with two ent
6、ries,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,,Cluster2,3,entry1 is the closest to object 3If cluster 1 becomes too large by adding object 3,th
7、en split the cluster,entry 1,entry 2,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,Cluster2,3,,,entry 1,,entry 2,entry 3,Cluster3,Leaf node with three
8、entries,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,,Cluster2,3,,,entry 1,,entry 2,entry 3,Cluster3,4,entry3 is the closest to object 4Cluster 2 rem
9、ains compact when adding object 4then add object 4 to cluster 2,,Cluster2,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,,Cluster1,,1,2,3,4,5,6,2,Leaf node,3,,,entry 1,,entry 2,entry 3,C
10、luster3,4,entry2 is the closest to object 5Cluster 3 becomes too large by adding object 5then split cluster 3?BUT there is a limit to the number of entries a node can haveThus, split the node,,Cluster2,5,BIRCH: Th
11、e Idea by example,Data Objects,1,Clustering Process (build a tree),,,Cluster1,,1,2,3,4,5,6,2,Leaf node,3,,Cluster3,4,,Cluster2,5,,,entry 1,entry 2,,,entry 1.1,entry 1.2,,,entry 2.1,entry 2.2,Leaf node,Non-Leaf node,,Clu
12、ster4,,,BIRCH: The Idea by example,Data Objects,1,Clustering Process (build a tree),,,Cluster1,,1,2,3,4,5,6,2,Leaf node,3,,Cluster3,4,,Cluster2,5,,,entry 1,entry 2,,,entry 1.1,entry 1.2,,,entry 2.1,entry 2.2,Leaf node,N
13、on-Leaf node,,Cluster4,,,6,entry1.2 is the closest to object 6Cluster 3 remains compact when adding object 6then add object 6 to cluster 3,,Cluster3,BIRCH: Key Components,Clustering Feature (CF)Summary of the stati
14、stics for a given cluster: the 0-th, 1st and 2nd moments of the cluster from the statistical point of view Used to compute centroids, and measures the compactness and distance of clusters CF-Treeheight-balanced tre
15、e two parameters: number of entries in each node The diameter of all entries in a leaf nodeLeaf nodes are connected via prev and next pointers,Clustering Feature,Clustering Feature (CF): CF = (N, LS, SS) N
16、: Number of data points LS: linear sum of N points: SS: square sum of N points:,,,,,Cluster 1 (2,5) (3,2) (4,3),CF2= ?3, (35,36), (417 ,440)?,,,,Cluster 2,,CF1= ?3, (2+3+4 , 5+2+3), (22+32+42 , 52+22+32)? = ?
17、3, (9,10), (29 ,38)?,,Cluster3,CF3=CF1+CF2= ?3+3, (9+35, 10+36), (29+417 , 38+440)? = ?6, (44,46), (446 ,478)?,Properties of Clustering Feature,CF entry is a summary of statistics of the cluster A representation of th
18、e clusterA CF entry has sufficient information to calculate the centroid, radius, diameter and many other distance measuresAdditively theorem allows us to merge sub-clusters incrementally,Distance Measures,Given a
19、 cluster with data points , Centroid: Radius: average distance from any point of the cluster to its centroidDiameter: square root of average mean squared distance between all pairs of points in the cluster,CF T
20、ree,B = Branching Factor, maximum children in a non-leaf nodeT = Threshold for diameter or radius of the cluster in a leafL = number of entries in a leafCF entry in parent =
21、 sum of CF entries of a child of that entryIn-memory, height-balanced tree,,,,…,…,…,…,…,,,,Root level,Firstlevel,CF Tree Insertion,Start with the rootFind the CF entry in the root closest to the data point, move to
22、 that child and repeat the process until a closest leaf entry is found.At the leafIf the point can be accommodated in the cluster, update the entryIf this addition violates the threshold T, split the entry, if this
23、violates the limit imposed by L, split the leaf. If its parent node too is full, split that and so onUpdate the CF entries from the root to the leaf to accommodate this point,,Phase 1: Load into memory by building a CF
24、 tree,,Phase 2 (optional): Condense tree into desirable range by building a smaller CF tree,,Initial CF tree,Data,,Phase 3: Global Clustering,,Smaller CF tree,,Good Clusters,Phase 4: (optional and offline): Cluster Refin
25、ing,Better Clusters,,,,Birch Algorithm,Birch Algorithm: Phase 1,Choose an initial value for threshold, start inserting the data points one by one into the tree as per the insertion algorithmIf, in the middle of the abo
26、ve step, the size of the CF tree exceeds the size of the available memory, increase the value of thresholdConvert the partially built tree into a new treeRepeat the above steps until the entire dataset is scanned and
27、 a full tree is builtOutlier Handling,Birch Algorithm: Phase 2,3, and 4,Phase 2A bridge between phase 1 and phase 3Builds a smaller CF tree by increasing the thresholdPhase 3Apply global clustering algorithm to
28、the sub-clusters given by leaf entries of the CF treeImproves clustering qualityPhase 4Scan the entire dataset to label the data pointsOutlier handling,ROCK,Similarity function NeighborsLinksCriterion functionGo
29、odness measure,Major definitions,ROCK,Similarity function NeighborsLinksCriterion functionGoodness measure,Major definitions,ROCK,Similarity function,Let Sim (Pi, Pj) be a similarity function that is used to measure
30、the closeness between points pi and Pj. ROCK assumes that Sim function is normalized to return a value between 0 and 1For Quran treasures data, a possible definition for the sim function is based on the Jaccard coeffic
31、ient:,ROCK,Neighbors and links,,,NeighborIf similarity between two points exceeds certain similarity threshold (?), they are neighbors.LinkThe Link for pair of points is: the number of their common neighbors.Obvious
32、ly, Link incorporates global information about the other points in the neighborhood of the two points. The larger the Link, the higher probability that this pair of points are in the same clusters.,ROCK,Criterion functio
33、n,,,to get the best clusters, we have to maximize this Criterion FunctionWhere Ci denotes cluster i ni is the number of points in Ci k is the number of clusters ? is the simi
34、larity threshold Suppose in Ci, each point has roughly nf(θ) neighbors.A suitable choice for basket data is : f(θ)=(1-θ)/(1+θ),ROCK,Goodness measure,,,Goodness FunctionDuring clustering, we use this goodness measur
35、e in order to maximize the criterion function. This goodness measure helps to identify the best pair of clusters to be merged during each step of ROCK.,ROCK,Algorithm,,,,Input: A set S of data points
36、Number of k clusters to be found The similarity thresholdOutput: Groups of clustered dataThe ROCK algorithm is divided into three major parts:Draw a random sample from the data set:Perform a hierarchi
37、cal agglomerative clustering algorithmLabel data on diskin our case, we do not deal with a very huge data set. So, we will consider the whole data in the process of forming clusters, i.e. we skip step1 and step3,ROCK,A
38、lgorithm,,,,places each single data point into a separate clustercompute the similarity measure for all pairs of clustersmerge the two clusters with the highest similarity (goodness measure)Verify a stop condition. If
39、 it is not met then go to step b,Computation of links:using the similarity threshold ?, we can convert the similarity matrix into an adjacency matrix (A)Then we obtain a matrix indicating the number of links by calcula
40、ting (A x A ) , i.e., by multiplying the adjacency matrix A with itself,ROCK,Suppose we have four verses contains some subjects , as follows:P1={ judgment, faith, prayer, fair}P2={ fasting, faith, prayer}P3={ fair, fa
41、sting, faith}P4={ fasting, prayer, pilgrimage}the similarity threshold = 0.3, and number of required cluster is 2.using Jaccard coefficient as a similarity measure, we obtain the following similarity table :,Example:,
42、ROCK,Example:,Since we have a similarity threshold equal to 0.3, then we derive the adjacency table:?By multiplying the adjacency table with itself, we derive the following table which shows the number of links (or comm
43、on neighbors) :?,ROCK,Example:,we compute the goodness measure for all adjacent points ,assuming that f(?) =1-? / 1+? we obtain the following table?: we have an equal goodness measure for merging ((P
44、1,P2), (P2,P1), (P3,P1)),ROCK,Example:,Now, after merging P1 and P2, we have only three clusters. The following table shows the number of common neighbors for these clusters:?Then we can obtain the following goodness me
45、asures for all adjacent clusters:?,Since the number of required clusters is 2, then we finish the clustering algorithm by merging C(P1,P2) and P3, obtaining a new cluster C(P1,P2,P3) which contains {P1,P2,P3} leaving P4
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 眾賞文庫僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 模式識別與人工智能之七-part1
- 模式識別與人工智能之十
- 模式識別與人工智能之八
- 模式識別與人工智能之四
- 模式識別與人工智能之五
- 模式識別與人工智能
- 模式識別與人工智能整理
- 人工智能與模式識別的發(fā)展
- 模式識別在人工智能方面的應(yīng)用
- 模式識別、人工智能與醫(yī)學(xué)專家系統(tǒng)之間的關(guān)系
- 人工智能虹膜識別
- 大數(shù)據(jù)與人工智能-解惑
- 講座名稱大數(shù)據(jù)與人工智能
- 計算機科學(xué)與人工智能
- 煤氣流分布模式識別與布料指導(dǎo)人工智能系統(tǒng)的設(shè)計與實現(xiàn).pdf
- 在線習(xí)題part2(2)
- 圍棋人機大戰(zhàn)背后與人工智能發(fā)展
- 人工智能原理人工智能概述
- part2 of the fourth picture.dwg
- part2 of the second picture.dwg
評論
0/150
提交評論