相似性概念与聚类分析

资源描述

单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,相似性，概念与聚类分析,Email:,1,机器学习的目的之一：概念,人们学习的目的是学习知识, 因此, 机器学习的一个自然期望是：从数据中学习到知识,什么是知识的最基本单位：概念,Concepts are the glue that holds our mental world together。,Cited from page 1 in the book entiled “The big book of concepts”, written by M.L. Murphy, 2002, MIT,2,经典概念的定义：（,Plato and Aristotle）,概念的内涵：必要而且充分条件（命题描述, 命题可以是复合命题）,概念的外延：给出论域中符合该概念的所有样例,符合排中率（law of the excluded middle）,要么符合这个概念，要么不符合这个概念,这种经典的概念形式称为定义法,什么是概念？,3,概念与数据分析,数据分析的一个重要的应用就是从数据中学习到概念(语义).,Cited from C. Rother, V. Kolmogorov, and A. Blake,GrabCut: Interactive foreground extraction using iterated graph cuts, ACM Trans. Graph., vol. 23, pp. 309314, 2004,4,相应的机器学习问题(I),已知:既定概念和该既定概念外延的一个有限子集(即: 标定样本),期望: 学习既定概念的内涵定义,机器学习:分类, 回归等技术可以归为此类问题, 即所谓的有监督学习,5,相应的机器学习问题(II),已知:,样本集, 但其中的样本属于哪一个概念未知 (未标定样本),期望:,学习出与人类认知相符的概念.最好得到概念的内涵表示, 否则,也希望得到概念的外延子集.,机器学习: 聚类分析可以归为此类问题, 无监督学习,6,本次演讲的重点,如何从未标定的数据集中提取概念, 即聚类分析,7,Outline,概念的形成（Gestalt Theory）,概念的非经典定义,聚类分析,类的复杂性讨论,未来展望,8,概念的形成,可分为实体类别(natural kinds)与抽象类别( abstract kinds),Max Wertheimer (1923)说:,“我站在窗前, 看到的是房屋,树, 天空.” 不可能认到一个一个的像素点这种程度.,提出了实体类别的组织原则,9,概念的形成格式塔理论与样本的概念归属,格式塔学派整体上认识视觉,提供了根据二维数据形成概念的基本依据,邻近律,相似律,连续律,封闭律,对称律,10,概念的形成相似律 Law of Similarity,11,概念的形成Law of proximity邻近律,12,概念的形成Gestalt 准则的推广性,封闭律, 连续律, 对称律在高维空间的推广挑战性高, 比如对称性:二维与三维不同.,相似律和近邻律的推广性受数据空间维数的影响相对较小,因此对于概念的研究来说, 似更为重要.,另外，封闭律, 连续律在概念不重叠和相切的情形下可以由相似律和近邻律来反映,13,概念“游戏”内包含的对象,不包含共有的特性,马术，游泳，下棋，网球等,都属于游戏,概念的非经典定义,经典概念的颠覆,Wittgenstein, L. (1958).,Philosophical Investigations,(G. E. M. Anscombe, Trans.). USA: Blackwell Publishing.,Ludwig Wittgenstein,14,概念的非经典定义Eleanor Roschs 的发现,上个世纪70年代,Eleanor Rosch 的工作在认知科学领域彻底终结了经典概念的定义-“The big book of concepts”, written by M.L. Murphy, 2002, MIT,典型样本与非典型样本,15,概念的非经典定义,Examples of items studied by Rosch & Mervis (1975), ordered by typicality,Fruit: orange, apple, banana, peach, pear, apricot, plum, grapes, strawberry, grapefruit, pineapple, blueberry, lemon, watermelon, honeydew, pomegranate, date, coconut, tomato, olive,Furniture: chair, sofa, table, dresser, desk, bed, bookcase, footstool, lamp, piano, cushion, mirror, rug, radio, stove, clock, picture, closet, vase, telephone,16,概念的非经典定义Prototype view of concepts,A single prototype as a category representation,It avoids the contradictable features,A feature list as a category representation,It is not popular as computational complexity,17,概念的非经典定义,Exemplar view of concepts (Medin and Schaffer, 1978),Concepts by represented by exemplars,18,概念的非经典定义Knowledge approach of concepts,Concepts can be considered a part of general knowledge,goal-derived categories (Barsalou,1985),things to eat on a diet, things to take from ones house during a fire,Its limitation: Much of a concept cannot be based on previous knowledge,19,概念的非经典定义样本如何归属于某个特定概念,样本归入与之最相似的特定概念,20,概念，相似性与聚类分析,聚类形成的划分子集内样本具有相当的同质性，即类内的样本是相似的，不同类之间的样本是不相似,如果借用经典概念，聚类分析得到的是概念的一个外延子集,由于聚类分析可以发现数据的内蕴结构，即数据自身蕴含的概念，近年来，聚类分析的应用日益增广,21,聚类分析聚类算法与使用的概念定义,类原型聚类算法：紧致型的类,样例型聚类算法：连通型的类,经典概念对应的聚类算法,22,聚类分析Prototype based clustering: C(K)-MEANS,Remark: The essence of K-means is the same as that of C-means.,LBG or GLA also has almost the same meaning as K-means,23,聚类分析K-means and its extensions,Fuzzy C-means,EM type clustering,Deterministic annealing clustering,Fuzzy c-shells,K-mode,PCM,Conditional fuzzy c-means,GCM (Yu Jian, 2005),24,聚类分析Prototype based clustering,Usually use dissimilarity to represent similarity, therefore ignore the step of computing similarity,Their advantages are as follows: fast speed, and good interpretation, can easily deal with touching clusters or overlapping clusters when prototypes are proper,25,聚类分析Exemplar based clustering,Additive clustering,Affinity Propagation,Minimum cut and Spectral clustering,Hierarchical clustering,Minimum Spanning tree,DBSCAN,QT(Quality Threshold）,26,聚类分析Affinity Propagation (,Frey & Dueck, 2007,),27,聚类分析Minimum cut,28,聚类分析Normalized Cut(Shi and Malik, 2000),Subject to the constraints:,y(i),1,-,b,y,t,D,1=0,29,聚类分析Minimum Cut,最小切聚类算法（minimum cut），,Ahmed Elgammal, Graph Cuts and Image Segmentation, Rutgers University,Jianbo Shi and Jitendra Malik,“Normalized Cuts and Image,Segmetnation”,IEEE Transactions on Pattern Analysis and Machine,Intelligence, Vol 22 No. 0, August 2000,30,聚类分析Single Linkage,31,聚类分析Single linkage的优缺点,优点： Single Linkage: J. Haritgan(1981, JASA, 76(374) 证明了只有Single linkage 可以统计一致的发现发现高密度类， average linkage和complete linkage 不具有此性质,缺点：,不能发现不同密度的类,受噪音影响特别厉害,难点：有一个很难确定的参数，聚类数或者阈值,32,聚类分析DBSCAN 算法,算法的思想是寻找具有足够高密度的连通区域划分作为类，而低密度区域的点则作为孤立点。,一个点的密度可以看作所有样本点与此点的相似度之和,优点：可以发现任意形状类,缺点：两个参数（密度水平参数，近邻参数），难以选择,33,聚类分析,DBSCAN等算法,(DBSCAN) M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. 1996.,A density-based algorithm for discovering clusters in large spatial databases,. KDD96,34,聚类分析QT clustering,QT(Quality Threshold)聚类算法是通过限定类的直径来聚类的。主要思想是：如果定义了相异矩阵对应的图，类的直径应该不大于给定的阈值。因此，其流程如下：选定一个样本，逐渐合并与其最相似的样本，直到再增加样本将导致类的直径超过给定的阈值为止，然后选定下一个样本，重新聚类。,Heyer,L.J., et al. “Exploring Expression Data: Identification and Analysis of Coexpressed Genes”. Genome Research, 9:1106-1115 (1999),35,聚类分析现存样例型聚类算法的不足,The predefined parameters such as the number of clusters for additive clustering, the preference value and the damping factor for the affinity propagation, the number of clusters for spectral clustering, Threshold for cluster diameter,High complexity of additive clustering, quality threshold,No convergence proof of Affinity propagation,No calculation results for Spectral clustering if similarity matrix is not proper,36,聚类分析,对应经典概念的聚类算法,如果经典概念的外延来表示划分，即可以用划分矩阵来表示这样发展出来的算法可以称为划分矩阵型聚类算法。,主要有三种技术,37,聚类分析基于矩阵分解技术,算法的输入是相似矩阵，计算的主要依据是可将相似矩阵分解成划分矩阵的乘积这样的形式，这样的聚类算法有基于非负矩阵分解的聚类算法，以及异质聚类算法中的矩阵分解聚类算法，可加性聚类算法(additive clustering)也可以勉强归为这样的算法,38,聚类分析基于信息论,算法的输入是概率分布矩阵，文献中的算法有信息瓶颈（information bottleneck）聚类算法，以及异质聚类算法的互信息联合聚类算法等等,39,聚类分析基于margin 理论,现有的方法有支持向量机聚类算法（support vector clustering）和最大margin聚类算法（maximum margin clustering）,40,类的复杂性讨论,概念的定义是一个非常难的问题,类是什么也一直是聚类分析研究者面对的核心难题,41,任意形状类(I),42,任意形状类（II),43,非同质类,44,相切类,45,重叠类,46,重叠类在图像中的表现,47,混合类,48,Cited from Jain, A.: Data clustering: 50 years beyond k-means. Pattern Recognition Letters (Available on line 9 Sept. 2009),49,现存聚类算法的优缺点,类原型聚类算法可以处理相切类，重叠类，条件是类原型合适。但是对于任意形状类处理不好,连通类聚类算法能够处理弱相切类（但是比较复杂），一般的相切类和重叠类处理不了.,50,Essentially, all models are wrong, but some are useful.,Cited from Box, George E. P.; Norman R. Draper (1987).,Empirical Model-Building and Response Surfaces,. Wiley. pp.p. 424.,ISBN 0471810339,“there is no single clustering algorithm that has been shown to dominate other algorithms across all application domains”,A.K. Jain, 2009, PRL, 2009,51,相似性的二值表示,一个是在得到相似性得到以后，如何判断对象与类别之间的关系。,一般假设相似性与一个理想相似性是一一对应的.,所谓的理想相似性是指其值与0或者1很接近,s(i,k)=e(i,k)+(i,k), 其中， e(i,k)取值为0或者1,52,相似性的二值表示定理,53,Texas clustering(Yu, Hao and Zhou),由此而来, 我们得到新的基于相似度的聚类算法,54,未来展望,类的表示（概念的表示）,数据的表示（特征空间）,如何结合领域知识聚类算法： semi-supervised clustering,现有算法的性能客观评估,55,谢谢,.,56,

展开阅读全文

相似性概念与聚类分析

最新文档