《数据仓库与数据挖掘》第8章

上传人:tfg****lgh 文档编号:248292350 上传时间:2024-10-23 格式:PPTX 页数:148 大小:1.28MB
返回 下载 相关 举报
《数据仓库与数据挖掘》第8章_第1页
第1页 / 共148页
《数据仓库与数据挖掘》第8章_第2页
第2页 / 共148页
《数据仓库与数据挖掘》第8章_第3页
第3页 / 共148页
点击查看更多>>
资源描述
,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Data Mining: Concepts and Techniques,*,第6章: 关联规则挖掘,Association rule mining,Algorithms for scalable mining of (single-dimensional Boolean) association rules in transactional databases,Mining various kinds of association/correlation rules,Constraint-based association mining,Sequential pattern mining,Applications/extensions of frequent pattern mining,Summary,2024/10/23,1,Data Mining: Concepts and Techniques,WhatIs AssociationMining?,Association rule mining:,Finding frequent patterns, associations,correlations,or causalstructuresamong sets ofitemsor objects intransaction databases, relational databases,and otherinformation repositories.,Frequent pattern: pattern(setof items,sequence,etc.)thatoccurs frequently in a database AIS93,Motivation: findingregularities indata,Whatproducts were oftenpurchasedtogether?, Beer anddiapers?!,Whatare the subsequent purchases afterbuying a PC?,Whatkindsof DNA are sensitive tothisnew drug?,Can we automaticallyclassifyweb documents?,2023/3/7,2,Data Mining: Conceptsand Techniques,关联规则挖掘的,基,基本概念,购物篮分析引,发,发关联规则挖掘,的,的例子,问题:“什么商,品,品组或集合顾客,多,多半会在一次购,物,物中同时购买?,”,”,购物篮分析:设,全,全域为商店出售,的,的商品的集合(,即,即项目全集),,一,一次购物购买(,即,即事务)的商品,为,为项目全集的子,集,集,若每种商品,用,用一个布尔变量,表,表示该商品的有,无,无,则每个购物,篮,篮可用一个布尔,向,向量表示。通过,对,对布尔向量的分,析,析,得到反映商,品,品频繁关联或同,时,时购买的购买模,式,式。这些模式可,用,用关联规则描述,。,。,例购买计算,机,机与购买财务管,理,理软件的关联规,则,则可表示为:,computer financial_management_softwar,support=2%,confidence=60%,support,为支持度,,confidence,为置信度。,该规则表示:在,所,所分析的全部事,务,务中,有2的,事,事务同时购买计,算,算机和财务管理,软,软件;在购买计,算,算机的顾客中60也购买财务,管,管理软件。,2023/3/7,3,Data Mining: Conceptsand Techniques,Why IsFrequent Pattern or Assoiciation Mining anEssential Task in Data Mining?,Foundation formany essential dataminingtasks,Association, correlation, causality,Sequential patterns,temporal or cyclic association, partial periodicity, spatial and multimedia association,Associative classification,clusteranalysis, iceberg cube, fascicles(semantic datacompression),Broad applications,Basketdata analysis,cross-marketing, catalog design, sale campaign analysis,Web log(clickstream) analysis, DNA sequence analysis,etc.,2023/3/7,4,Data Mining:Concepts and Techniques,关联规则,关联(,Associations,),),分析的目的是,为,为了挖掘隐藏,在,在数据间的相,互,互关系,即对,于,于给定的一组,项,项目和一个记,录,录集,通过对,记,记录集的分析,,,,得出项目集,中,中的项目之间,的,的相关性。项,目,目之间的相关,性,性用关联规则,来,来描述,关联,规,规则反映了一,组,组数据项之间,的,的密切程度或,关,关系。,定义81,令,I=i1,i2,,in,是项目集,,D,是全体事务的,集,集合。事务,T,是,I,上的一个子集,,,,集合,T,I,,每个事务用唯,一,一的标志,TID,来标识。关联,规,规则是形如,X,Y,的蕴含式,其,中,中,X,I,Y,I,且,X,Y=,,X,称为规则的条,件,件,,Y,称为规则的结,果,果。,2023/3/7,5,Data Mining:Concepts and Techniques,置信度和支持,度,度,定义82,关联规则,X,Y,对事物集,D,的支持度(,support,),定义为,D,中包含有事务,X,和,Y,的百分比。关,联,联规则,X,Y,对事务集合,D,的置信度(,confidence),定义为,D,中包含有,X,的事务数与同,时,时包含,Y,的百分比。即,:,:,support(X,Y)(,包含,X,和,Y,的事务数 /,事,事务总数),100,confidence(X,Y),包含,X,和,Y,的事务数 /,包,包含,X,的事务数)100,定义83,置信度和支持,度,度均大于给定,阈,阈值(即最小,置,置信度阈值和,最,最小支持度阈,值,值)。即:,support(X,Y) min_sup,confidence(X,Y) min_conf,的关联规则称,为,为强规则;否,则,则称为弱规则,。,。,2023/3/7,6,Data Mining:Conceptsand Techniques,关联规,则,则挖掘,数据挖,掘,掘主要,就,就是对,强,强规则,的,的挖掘,。,。通过,设,设置最,小,小支持,度,度和最,小,小置信,度,度可以,了,了解某,些,些数据,之,之间的,关,关联程,度,度。,关联规,则,则挖掘,:,:给定,一,一组,Item,和记录,集,集合,,挖,挖掘出,Item,间的相,关,关性,,使,使其置,信,信度和,支,支持度,分,分别大,于,于用户,给,给定的,最,最小置,信,信度和,、,、最小,支,支持度,。,。,2023/3/7,7,Data Mining:Conceptsand Techniques,关联规,则,则挖掘,数据挖,掘,掘主要,就,就是对,强,强规则,的,的挖掘,。,。通过,设,设置最,小,小支持,度,度和最,小,小置信,度,度可以,了,了解某,些,些数据,之,之间的,关,关联程,度,度。,强规则,X,Y,对应的,项,项集(,XY,),),必定是,频,频繁集,。,。因此,,,,可以,把,把关联,规,规则挖,掘,掘划分,为,为以下,两,两个子,问,问题:,根据最,小,小支持,度,度找出,事,事务集,D,中的所,有,有频繁,项,项集。,核心,根据频,繁,繁项集,和,和最小,置,置信度,产,产生关,联,联规则,。,。较,易,易,关联规,则,则挖掘,:,:给定,一,一组,Item,和记录,集,集合,,挖,挖掘出,Item,间的相,关,关性,,使,使其置,信,信度和,支,支持度,分,分别大,于,于用户,给,给定的,最,最小置,信,信度和,、,、最小,支,支持度,。,。,2023/3/7,8,DataMining:ConceptsandTechniques,关联,规,规则,挖,挖掘,的,的分,类,类基于,变,变量,的,的类,别,别,基于,规,规则,中,中处,理,理的,变,变量,的,的类,别,别,关,关联,规,规则,可,可以,分,分为,布,布尔,型,型和,数,数值,型,型:,布尔,型,型关,联,联规,则,则:,如,如果,规,规则,考,考虑,的,的关,联,联是,项,项“,在,在”,或,或“,不,不在,”,”,,则,则关,联,联规,则,则是,布,布尔,型,型的,。,。例,如,如,,由,由购,物,物篮,分,分析,得,得出,的,的关,联,联规,则,则。,量化,型,型关,联,联规,则,则:,如,如果,描,描述,的,的是,量,量化,的,的项,或,或属,性,性之,间,间的,关,关联,,,,则,该,该规,则,则是,量,量化,型,型的,关,关联,规,规则,。,。例,如,如,,以,以下,是,是量,化,化型,关,关联,规,规则,的,的一,个,个例,子,子(,其,其中,X,为表,示,示顾,客,客的,变,变量,,,,量,化,化属,性,性,age,和,income,已经,离,离散,化,化),:,:,age(X,“,“3039,”,”),income(,“,“42K,48K,”,”),buys(X,“high_resolution_TV,”,”),量化,型,型关,联,联规,则,则中,也,也可,以,以包,含,含多,种,种变,量,量。,例,例如,:,:,性别=“,女,女”=,职,职业=“,秘,秘书,”,”,,,,是,布,布尔,型,型关,联,联规,则,则;,性别=“,女,女”=,avg(,月收,入,入)=2300,,涉,涉及,的,的收,入,入是,数,数值,类,类型,,,,所,以,以是,一,一个,量,量化,型,型关,联,联规,则,则。,2023/3/7,9,DataMining:Concepts and Techniques,关联规则,挖,挖掘的分,类,类基于,抽象层次,基于规则,中,中数据的,抽,抽象层次,可以分,为,为单层关,联,联规则和,多,多层关联,规,规则:,单层的关,联,联规则:,所,所有的变,量,量都不涉,及,及不同抽,象,象层次的,项,项或属性,。,。,例如:,buys(X,“,“computer”)buys(X, “printer”),顾客,X,购买的商,品,品不涉及,不,不同抽象,层,层次(“,computer,”,”,和“,printer”,在同一个,抽,抽象层),,,,因此是,单,单层关联,规,规则。,多层的关,联,联规则:,变,变量涉及,不,不同抽象,层,层次的项,或,或属性。,例如:,age(X,“3039,”,”) buys(X, “laptop computer”),age(X,“3039,”,”) buys(X, “computer,”,”),顾客,X,购买的商,品,品涉及不,同,同抽象层,次,次(“,computer,”,”,在比“,laptop computer”,高的抽象,层,层),因,此,此是多层,关,关联规则,。,。,2023/3/7,10,DataMining:Concepts and Techniques,关联规则,挖,挖掘的分,类,类基于数据,的,的维数,基于规则,中,中涉及到,的,的数据的,维,维数,关,联,联规则可,以,以分为单,维,维的和多,维,维的,单维关联,规,规则:处,理,理单个维,中,中属性间,的,的关系,,即,即在单维,的,的关联规,则,则中,只,涉,涉及到数,据,据的一个,维,维。,例如:用,户,户购买的,物,物品:“,咖,咖啡=,砂,砂糖”,,这,这条规则,只,只涉及到,用,用户的购,买,买的物品,。,。,多维关联,规,规则:处,理,理多个维,中,中属性之,间,间的关系,,,,即在多,维,维的关联,规,规则中,,要,要处理的,数,数据将会,涉,涉及多个,维,维。,例如:性,别,别=“女,”,”=职,业,业=“秘,书,书”,这,条,条规则就,涉,涉及到两,个,个维中字,段,段的信息,,,,是两个,维,维上的一,条,条关联规,则,则,2023/3/7,11,Data Mining:Concepts and Techniques,关联规则挖掘,的,的过程,定义84,在关联规则挖,掘,掘算法中,把,项,项目的集合称,为,为项集(,itemset),,包含有,k,个项目的项集,称,称为,k-,项集。包含项,集,集的事务数称,为,为项集的出现,频,频率,简称为,项,项集的频率或,支,支持度计数。,如,如果项集的出,现,现频率大于或,等,等于最小支持,度,度,s,与,D,中事务总数的,乘,乘积,则称该,项,项集满足最小,支,支持度,s。,如果项集满足,最,最小支持度,,则,则称该项集为,频,频繁项集(,frequent itemset ),。,。,关联规则的挖,掘,掘主要被分解,为,为下面两步:,第1步:找出,所,所有的频繁项,集,集,即找出支,持,持度大于或等,于,于给定的最小,支,支持度阈值的,所,所有项集。可,以,以从1到,k,递归查找,k-,频繁项集。,第2步:由频,繁,繁项集产生强,关,关联规则,即,找,找出满足最小,支,支持度和最小,置,置信度的关联,规,规则。对给定,的,的,L,,如果其非空子,集,集,A,L,sup(L),为,L,的支持度,,sup(A),为,A,的支持度,则,产,产生形式为,A,L-A,的规则。,2023/3/7,12,Data Mining:Concepts and Techniques,BasicConcepts: FrequentPatterns and Association Rules,Itemset X=x,1, , x,k,Find all therules,X,Y,with min confidence andsupport,support,s,probabilitythat atransactioncontains XY,confidence,c,conditionalprobabilitythat atransactionhaving X also contains,Y,.,Let min_support =50%,min_conf= 50%:,A,C,(50%,66.7%),C,A,(50%,100%),Customer,buys diaper,Customer,buys both,Customer,buys beer,Transaction-id,Items bought,10,A, B, C,20,A, C,30,A, D,40,B, E, F,2023/3/7,13,DataMining: Concepts andTechniques,Mining Association Rulesan Example,For rule,A,C,:,support =support(,A,C,) =50%,confidence= support(,A,C,)/support(,A,) =66.6%,Min.support 50%,Min.confidence50%,Transaction-id,Items bought,10,A, B, C,20,A, C,30,A, D,40,B, E, F,Frequent pattern,Support,A,75%,B,50%,C,50%,A, C,50%,2023/3/7,14,DataMining: Concepts andTechniques,Chapter 6:Mining AssociationRulesin LargeDatabases,Association rule mining,Algorithmsforscalable miningof (single-dimensional Boolean) associationrulesin transactional databases,Mining variouskindsof association/correlation rules,Constraint-based association mining,Sequentialpattern mining,Applications/extensions of frequentpattern mining,Summary,2023/3/7,15,Data Mining: Conceptsand Techniques,Apriori: A CandidateGeneration-and-test Approach,Any subset ofa frequent itemset must be frequent,if,beer,diaper,nuts,is frequent, so is,beer,diaper,Every transaction having beer, diaper, nuts also contains beer, diaper,Aprioripruning principle,:If there isanyitemsetwhichis infrequent,its supersetshouldnot begenerated/tested!,Method:,generate length (k+1)candidate itemsets from length kfrequentitemsets, and,test the candidates againstDB,The performance studies showits efficiency and scalability,Agrawal& Srikant 1994, Mannila, etal. 1994,2023/3/7,16,Data Mining: Conceptsand Techniques,The Apriori Algorithm,An Example,Database TDB,1,st,scan,C,1,L,1,L,2,C,2,C,2,2,nd,scan,C,3,L,3,3,rd,scan,Tid,Items,10,A, C, D,20,B, C, E,30,A, B, C, E,40,B, E,Itemset,sup,A,2,B,3,C,3,D,1,E,3,Itemset,sup,A,2,B,3,C,3,E,3,Itemset,A, B,A, C,A, E,B, C,B, E,C, E,Itemset,sup,A, B,1,A, C,2,A, E,1,B, C,2,B, E,3,C, E,2,Itemset,sup,A, C,2,B, C,2,B, E,3,C, E,2,Itemset,B, C, E,Itemset,sup,B, C, E,2,2023/3/7,17,Data Mining:Conceptsand Techniques,TheAprioriAlgorithm,Pseudo-code,:,C,k,: Candidateitemset of sizek,L,k,: frequent itemsetofsizek,L,1,= frequentitems;,for,(,k,= 1;,L,k,!=,;,k,+),dobegin,C,k+1,= candidatesgenerated from,L,k,;,foreach,transaction,t,indatabasedo,incrementthe count of allcandidates in,C,k+1,that arecontainedin,t,L,k+1,= candidatesin,C,k+1,with min_support,end,return,k,L,k,;,2023/3/7,18,Data Mining:Conceptsand Techniques,ImportantDetailsofApriori,Howtogeneratecandidates?,Step 1: self-joining,L,k,Step 2: pruning,Howtocountsupportsofcandidates?,Example of Candidate-generation,L,3,=,abc, abd,acd,ace, bcd,Self-joining:,L,3,*L,3,abcd,from,abc,and,abd,acde,from,acd,and,ace,Pruning:,acde,isremoved because,ade,isnotin,L,3,C,4,=,abcd,2023/3/7,19,DataMining:ConceptsandTechniques,HowtoGenerateCandidates?,Supposetheitemsin,L,k-1,arelistedinanorder,Step1:self-joining,L,k-1,insertinto,C,k,select,p.item,1,p.item,2,p.item,k-1,q.item,k-1,from,L,k-1,p,L,k-1,q,where,p.item,1,=q.item,1,p.item,k-2,=q.item,k-2,p.item,k-1,=min_conf,则输出,关,关联规,则,则“,s,(I-s)”,,,,,其中,min_conf,为最小,置,置信度,阈,阈值。,2023/3/7,23,Data Mining:Concepts and Techniques,由频繁项集而,产,产生关联规则,例假设数,据,据包含频繁项,集,集,I=I1,I2,I5,:,:,第1步:对于,频,频繁项集,I=I1,I2,I5,,,,,产生,I,的所有非空子,集,集:,I1,I2,I1,I5,I2,I5,I1,I2,I5,第2步:对于,I,的每一个非空,子,子集,s,,输出关联规则,“,“,s,(I-s),”,I1I2I5confidence=2/4=50%,I1I5I2confidence=2/2=100%,I2I5I1confidence=2/2=100%,I1I2I5confidence=2/6=33%,I2I1I5confidence=2/7=29%,I5I1I2confidence=2/7=100%,如果最小置信,度,度设定为70,,则只有以,下,下三个关联规,则,则输出:,I1I5I2confidence=2/2=100%,I2I5I1confidence=2/2=100%,I5I1I2confidence=2/7=100%,2023/3/7,24,DataMining:ConceptsandTechniques,例子,例,以,下,下表,所,所示,的,的事,务,务集,为,为例,,,,其,中,中,Ci,是候,选,选集,,,,,Li,是大,数,数据,项,项集,。,。假,设,设最,小,小支,持,持度,为,为40%,,,,最,小,小置,信,信度,为,为70%,。,。则,数,数据,项,项在,候,候选,集,集中,至,至少,要,要出,现,现4,次,次以,上,上才,能,能满,足,足大,数,数据,项,项的,条,条件,,,,规,则,则的,可,可信,度,度至,少,少要,大,大于70%才,能,能形,成,成关,联,联规,则,则。,Apriori,关联,规,规则,挖,挖掘,过,过程,如,如图,所,所示,。,。,2023/3/7,25,DataMining:Con,ce,ptsandTechniques,ChallengesofFrequentPatternMining,Challenges,Multip,Huge number of candidates,Tedious workload of support counting for candidates,Improving Apriori: general ideas,Reduce passes of transaction database scans,Shrink number of candidates,Facilitate support counting of candidates,2023/3/7,26,Data Mining: Conceptsand Techniques,DIC: Reduce Number ofScans,ABCD,ABC,ABD,ACD,BCD,AB,AC,BC,AD,BD,CD,A,B,C,D,Itemsetlattice,Once both A and D aredetermined frequent,the countingof AD begins,Once all length-2 subsets ofBCD are determined frequent, the countingof BCDbegins,Transactions,1-,itemsets,2-,itemsets,Apriori,1-,itemsets,2-,items,3-,items,DIC,S. BrinR. Motwani, J. Ullman, andS. Tsur.Dynamicitemset counting andimplication rules for market basket data. In,SIGMOD97,2023/3/7,27,Data Mini,ng:,Conceptsand Techniques,Par,tition: ScanDatabaseOnlyTwice,Anyitemsetthat is potentially frequent in DB mustbefrequentinatleastone of thepartitions of DB,Scan 1: partitiondatabaseandfindlocalfrequentpatterns,Scan 2: consolidate globalfrequentpatterns,A.Savasere,E.Omiecinski,and S. Navathe.Anefficientalgorithm forminingassociationinlargedatabases. In,VLDB95,2023/3/7,28,DataMining: Concepts andTechniques,Sampling for Frequent Patterns,Select a sampleof original database, mine frequent patternswithin sampleusingApriori,Scandatabase once to verify frequent itemsets found insample, only,borders,of closureof frequent patterns arechecked,Scan database again to find missed frequent patterns,H. Toivonen.,Sampling large databases for association rules,. In,VLDB96,2023/3/7,29,DataMining: Concepts andTechniques,DHP:Reduce theNumber ofCandidates,A,k,-itemset whosecorresponding hashing bucket countis below the threshold cannotbe frequent,Candidates: a,b, c,d, e,Hashentries: ab, ad, ae bd, be,de,Frequent1-itemset:a, b, d,e,ab is not acandidate 2-itemsetif the sum of countof ab,ad,aeis belowsupportthreshold,J. Park,M.Chen, and P.Yu.An effectivehash-basedalgorithm for miningassociationrules. In,SIGMOD95,2023/3/7,30,DataMining:Concepts and Techniques,Eclat/MaxEclat and VIPER: ExploringVerticalData Format,Usetid-list, the list of transaction-ids containinganitemset,Compressionof tid-lists,ItemsetA: t1, t2, t3, sup(A)=3,ItemsetB: t2, t3, t4, sup(B)=3,ItemsetAB:t2,t3,sup(AB)=2,Major operation:intersection oftid-lists,M. Zakiet al.Newalgorithms for fastdiscovery ofassociationrules. InKDD97,P. Shenoy etal.Turbo-charging verticalmining of largedatabases. InSIGMOD00,2023/3/7,31,Data Mining:Concepts and Techniques,Bottleneck of Frequent-patternMining,Multiple databasescansarecostly,Mininglongpatterns needs many passes ofscanning andgenerates lots ofcandidates,To find frequent itemset,i,1,i,2,i,100,# of scans:100,# of Candidates: (,100,1,) + (,100,2,) + + (,1,1,0,0,0,0,) = 2,100,-1 =1.27*10,30,!,Bottleneck:candidate-generation-and-test,Can weavoidcandidate generation?,2023/3/7,32,Data Mining:Concepts and Techniques,MiningFrequent PatternsWithoutCandidate Generation,Grow long patternsfromshortones using local frequent items,“abc”is a frequent pattern,Get all transactions having “abc”: DB|abc,“d” isa local frequentitem in DB|abc, abcdis afrequent pattern,2023/3/7,33,DataMining: Concepts andTechniques,ConstructFP-tree from aTransaction Database,f:4,c:1,b:1,p:1,b:1,c:3,a:3,b:1,m:2,p:2,m:1,Header Table,Item frequency head,f4,c4,a3,b3,m3,p3,min_support = 3,TIDItemsbought(ordered)frequent items,100,f, a,c, d, g,i, m,p,f, c,a, m, p,200,a, b,c, f, l,m, o,f, c,a, b, m,300,b, f,h, j, o,w,f, b,400,b, c,k, s, p,c, b,p,500,a, f,c, e, l,p, m,n,f, c,a, m, p,ScanDB once, find frequent 1-itemset (singleitempattern),Sortfrequent itemsin frequency descending order,f-list,ScanDB again,constructFP-tree,F-list=f-c-a-b-m-p,2023/3/7,34,DataMining: Concepts andTechniques,Benefits of theFP-tree Structure,Completeness,Preserve complete informationfor frequent patternmining,Neverbreak a long pattern ofany transaction,Compactness,Reduce irrelevant infoinfrequent itemsare gone,Itemsin frequency descendingorder: themorefrequently occurring, the more likely to beshared,Neverbe largerthantheoriginal database (not count node-linksand the,count,field),For Connect-4 DB, compressionratiocould beover100,2023/3/7,35,Data Mining: Conceptsand Techniques,Partition Patterns and Databases,Frequent patterns canbe partitioned intosubsetsaccording tof-list,F-list=f-c-a-b-m-p,Patterns containing p,Patterns having m butno p,Patterns having c butno a nor b, m, p,Patternf,Completeness and non-redundency,2023/3/7,36,Data Mining: Conceptsand Techniques,Find PatternsHavingP FromP-conditionalDatabase,Starting at the frequent item header tablein theFP-tree,Traverse the FP-treeby following the linkof each frequent item,p,Accumulate allof,transformed prefix paths,of item,p,to form,p,s conditionalpatternbase,Conditional,patternbases,itemcond. pattern base,cf:3,afc:3,bfca:1, f:1,c:1,mfca:2, fcab:1,pfcam:2, cb:1,f:4,c:1,b:1,p:1,b:1,c:3,a:3,b:1,m:2,p:2,m:1,Header Table,Item frequency head,f4,c4,a3,b3,m3,p3,2023/3/7,37,DataMining:Concepts and Techniques,FromConditionalPattern-bases to Conditional FP-trees,Foreachpattern-base,Accumulate the countforeach item in the base,Construct the FP-tree for the frequent itemsofthepatternbase,m-conditional,patternbase:,fca:2, fcab:1,f:3,c:3,a:3,m-conditional,FP-tree,Allfrequentpatterns relateto,m,m,fm,cm,am,fcm,fam, cam,fcam,f:4,c:1,b:1,p:1,b:1,c:3,a:3,b:1,m:2,p:2,m:1,Header Table,Itemfrequencyhead,f4,c4,a3,b3,m3,p3,2023,/3/7,38,DataMining:Concepts and Techniques,Recursion: Mining Each Conditional FP-tree,f:3,c:3,a:3,m-conditional,FP-tree,Cond. pattern base of “am”:(fc:3),f:3,c:3,am-conditional,FP-tree,Cond. pattern base of “cm”:(f:3),f:3,cm-conditional,FP-tree,Cond. pattern base of “cam”:(f:3),f:3,cam-conditional,FP-tree,2023/3/7,39,Data Mining:Conceptsand Techniques,A SpecialCase:Single PrefixPath in FP-tree,Suppose a(conditional)FP-tree Thas ashared singleprefix-pathP,Miningcan be decomposedintotwoparts,Reductionofthe singleprefixpathinto onenode,Concatenation of theminingresultsofthetwo parts,a,2,:n,2,a,3,:n,3,a,1,:n,1,b,1,:m,1,C,1,:k,1,C,2,:k,2,C,3,:k,3,b,1,:m,1,C,1,:k,1,C,2,:k,2,C,3,:k,3,r,1,+,a,2,:n,2,a,3,:n,3,a,1,:n,1,r,1,=,2023/3/7,40,Data Mining:Conceptsand Techniques,MiningFrequentPatternsWithFP-trees,Idea:Frequentpattern growth,Recursivelygrow frequent patterns by patternand database partition,Method,Foreachfrequentitem,constructit,Repeat the process on each newly created conditional FP-tree,Until the resulting FP-tree is empty, or it contains only one pathsingle path will generate all the combinations of its sub-paths, each of which is a frequent pattern,2023/3/7,41,DataMining: Concepts andTechniques,算法中文说,明,明,2023/3/7,42,DataMining: Concepts andTechniques,Scaling FP-growth byDB Projection,FP-tree cannotfit in memory?DB projection,First partition a database into a set of projected DBs,Then constructand mine FP-tree foreach projected DB,Parallel projectionvs.Partition projectiontechniques,Parallel projection is spacecostly,2023/3/7,43,Data Mining: Conceptsand Techniques,Partition-based Projection,Parallel projectionneeds alot ofdisk space,Partition projectionsaves it,Tran. DB,fcamp,fcabm,fb,cbp,fcamp,p-proj DB,fcam,cb,fcam,m-proj DB,fcab,fca,fca,b-proj DB,f,cb,a-proj DB,fc,c-proj DB,f,f-proj DB,am-proj DB,fc,fc,fc,cm-proj DB,f,f,f,2023/3/7,44,Data Mining: Conceptsand Techniques,FP-Growth vs.Apriori: ScalabilityWith the Support Threshold,Data set T25I20D10K,2023/3/7,45,Data Mining:Concepts and Techniques,FP-Growth vs. Tree-Projection:Scalabilitywiththe SupportThreshold,Data set T25I20D100K,2023/3/7,46,Data Mining: Conceptsand Techniques,Why IsFP-Growth theWinner?,Divide-and-conquer:,decompose boththe mining task andDB according to the frequentpatterns obtained sofar,leads to focused search of smallerdatabases,Other factors,no candidate generation, nocandidate test,compressed database:FP-treestructure,no repeated scan of entire database,basic opscounting local freq itemsand buildingsub FP-tree, no pattern search andmatching,2023/3/7,47,Data Mining:Conceptsand Techniques,Implicationsofthe Methodology,Miningclosed frequent itemsets andmax-pat
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 幼儿教育


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!