清华云计算课件分布式集群

资源描述

OutlineClusteringIntuitionClustering Algorithms The Distance MeasureHierarchical vs.PartitionalK-Means ClusteringComplexityCanopy ClusteringMapReducing a large data set with K-Means and Canopy ClusteringClusteringWhat is clustering?Google NewsThey didnt pick all 3,400,217 related articles by handOr Amazon Or NetflixOther less glamorous things.Hospital RecordsScientific ImagingRelated genes,related stars,related sequencesMarket ResearchSegmenting markets,product positioningSocial Network AnalysisData miningImage segmentationThe Distance MeasureHow the similarity of two elements in a set is determined,e.g.Euclidean DistanceManhattan DistanceInner Product SpaceMaximum Norm Or any metric you define over the spaceHierarchical Clustering vs.Partitional ClusteringTypes of AlgorithmsHierarchical ClusteringBuilds or breaks up a hierarchy of clusters.Partitional ClusteringPartitions set into all clusters simultaneously.Partitional Clustering Partitions set into all clusters simultaneously.K-Means Clustering Super simple Partitional ClusteringChoose the number of clusters,kChoose k points to be cluster centersThenK-Means Clusteringiterate Compute distance from all points to all k-centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averagesBut!The complexity is pretty high:k*n*O(distance metric)*num(iterations)Moreover,it can be necessary to send tons of data to each Mapper Node.Depending on your bandwidth and memory available,this could be impossible.FurthermoreThere are three big ways a data set can be large:There are a large number of elements in the set.Each element can have many features.There can be many clusters to discoverConclusion Clustering can be huge,even when you distribute it.Canopy ClusteringPreliminary step to help parallelize computation.Clusters data into overlapping Canopies using super cheap distance metric.EfficientAccurateCanopy ClusteringWhile there are unmarked points pick a point which is not strongly marked call it a canopy centermark all points within some threshold of it as in its canopystrongly mark all points within some stronger threshold After the canopy clusteringResume hierarchical or partitional clustering as usual.Treat objects in separate clusters as being at infinite distances.MapReduce Implementation:Problem Efficiently partition a large data set(say movies with user ratings!)into a fixed number of clusters using Canopy Clustering,K-Means Clustering,and a Euclidean distance measure.The Distance MetricThe Canopy Metric($)The K-Means Metric($)Steps!Get Data into a form you can use(MR)Picking Canopy Centers(MR)Assign Data Points to Canopies(MR)Pick K-Means Cluster CentersK-Means algorithm(MR)Iterate!Data MassageThis isnt interesting,but it has to be done.Selecting Canopy CentersAssigning Points to CanopiesK-Means MapIterating K-MeansElbow CriterionChoose a number of clusters s.t.adding a cluster doesnt add interesting information.Rule of thumb to determine what number of Clusters should be chosen.Initial assignment of cluster seeds has bearing on final model performance.Often required to run clustering several times to get maximal performanceConclusionsClustering is slickAnd it can be done super efficientlyAnd in lots of different ways供娄浪颓蓝辣袄驹靴锯澜互慌仲写绎衰斡染圾明将呆则孰盆瘸砒腥悉漠堑脊髓灰质炎(讲课2019)脊髓灰质炎(讲课2019)供娄浪颓蓝辣袄驹靴锯澜互慌仲写绎衰斡染圾明将呆则孰盆瘸砒腥悉漠堑脊髓灰质炎(讲课2019)脊髓灰质炎(讲课2019)

展开阅读全文

清华云计算课件分布式集群

最新文档