清华云计算课件分布式集群

上传人:沈*** 文档编号:241589582 上传时间:2024-07-07 格式:PPT 页数:55 大小:285.50KB
返回 下载 相关 举报
清华云计算课件分布式集群_第1页
第1页 / 共55页
清华云计算课件分布式集群_第2页
第2页 / 共55页
清华云计算课件分布式集群_第3页
第3页 / 共55页
点击查看更多>>
资源描述
OutlineClusteringIntuitionClustering Algorithms The Distance MeasureHierarchical vs.PartitionalK-Means ClusteringComplexityCanopy ClusteringMapReducing a large data set with K-Means and Canopy ClusteringClusteringWhat is clustering?Google NewsThey didnt pick all 3,400,217 related articles by handOr Amazon Or NetflixOther less glamorous things.Hospital RecordsScientific ImagingRelated genes,related stars,related sequencesMarket ResearchSegmenting markets,product positioningSocial Network AnalysisData miningImage segmentationThe Distance MeasureHow the similarity of two elements in a set is determined,e.g.Euclidean DistanceManhattan DistanceInner Product SpaceMaximum Norm Or any metric you define over the spaceHierarchical Clustering vs.Partitional ClusteringTypes of AlgorithmsHierarchical ClusteringBuilds or breaks up a hierarchy of clusters.Partitional ClusteringPartitions set into all clusters simultaneously.Partitional Clustering Partitions set into all clusters simultaneously.K-Means Clustering Super simple Partitional ClusteringChoose the number of clusters,kChoose k points to be cluster centersThenK-Means Clusteringiterate Compute distance from all points to all k-centers Assign each point to the nearest k-center Compute the average of all points assigned to all specific k-centers Replace the k-centers with the new averagesBut!The complexity is pretty high:k*n*O(distance metric)*num(iterations)Moreover,it can be necessary to send tons of data to each Mapper Node.Depending on your bandwidth and memory available,this could be impossible.FurthermoreThere are three big ways a data set can be large:There are a large number of elements in the set.Each element can have many features.There can be many clusters to discoverConclusion Clustering can be huge,even when you distribute it.Canopy ClusteringPreliminary step to help parallelize computation.Clusters data into overlapping Canopies using super cheap distance metric.EfficientAccurateCanopy ClusteringWhile there are unmarked points pick a point which is not strongly marked call it a canopy centermark all points within some threshold of it as in its canopystrongly mark all points within some stronger threshold After the canopy clusteringResume hierarchical or partitional clustering as usual.Treat objects in separate clusters as being at infinite distances.MapReduce Implementation:Problem Efficiently partition a large data set(say movies with user ratings!)into a fixed number of clusters using Canopy Clustering,K-Means Clustering,and a Euclidean distance measure.The Distance MetricThe Canopy Metric($)The K-Means Metric($)Steps!Get Data into a form you can use(MR)Picking Canopy Centers(MR)Assign Data Points to Canopies(MR)Pick K-Means Cluster CentersK-Means algorithm(MR)Iterate!Data MassageThis isnt interesting,but it has to be done.Selecting Canopy CentersAssigning Points to CanopiesK-Means MapIterating K-MeansElbow CriterionChoose a number of clusters s.t.adding a cluster doesnt add interesting information.Rule of thumb to determine what number of Clusters should be chosen.Initial assignment of cluster seeds has bearing on final model performance.Often required to run clustering several times to get maximal performanceConclusionsClustering is slickAnd it can be done super efficientlyAnd in lots of different ways供娄浪颓蓝辣袄驹靴锯澜互慌仲写绎衰斡染圾明将呆则孰盆瘸砒腥悉漠堑脊髓灰质炎(讲课2019)脊髓灰质炎(讲课2019)供娄浪颓蓝辣袄驹靴锯澜互慌仲写绎衰斡染圾明将呆则孰盆瘸砒腥悉漠堑脊髓灰质炎(讲课2019)脊髓灰质炎(讲课2019)
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 管理文书 > 施工组织


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!