资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,CS 361A,*,*,CS 361A,(Advanced Data Structures and Algorithms),Lecture 20(Dec 7,2005),Data Mining:Association Rules,Rajeev Motwani,(partially based on notes by Jeff Ullman),CS 361A,1,Association Rules Overview,Market Baskets&Association Rules,Frequent item-sets,A-priori algorithm,Hash-based improvements,One-or two-pass approximations,High-correlation mining,CS 361A,2,Association Rules,Two Traditions,DM is,science,of approximating,joint distributions,Representation of process generating data,Predict PE for interesting events E,DM is,technology,for,fast counting,Can compute certain summaries quickly,Lets try to use them,Association Rules,Captures,interesting,pieces of joint distribution,Exploits fast counting technology,CS 361A,3,Market-Basket Model,Large Sets,Items,A=A,1,A,2,A,m,e.g.,products sold in supermarket,Baskets,B=B,1,B,2,B,n,small subsets of items in,A,e.g.,items bought by customer in one transaction,Support,sup(X),=number of baskets with itemset,X,Frequent Itemset Problem,Given,support threshold,s,Frequent Itemsets,Find,all frequent itemsets,CS 361A,4,Example,Items A,=milk,coke,pepsi,beer,juice.,Baskets,B,1,=m,c,bB,2,=m,p,j,B,3,=m,bB,4,=c,j,B,5,=m,p,bB,6,=m,c,b,j,B,7,=c,b,jB,8,=b,c,Support threshold,s=3,Frequent itemsets,m,c,b,j,m,b,c,b,j,c,CS 361A,5,Application 1(Retail Stores),Real market baskets,chain stores keep TBs of customer purchase info,Value?,how typical customers navigate stores,positioning tempting items,suggests“tie-in tricks e.g.,hamburger sale while raising ketchup price,High support needed,or no$s,CS 361A,6,Application 2(Information Retrieval),Scenario 1,baskets,=documents,items,=words in documents,frequent word-groups,=linked concepts.,Scenario 2,items,=sentences,baskets,=documents containing sentences,frequent sentence-groups,=possible plagiarism,CS 361A,7,Application 3(Web Search),Scenario 1,baskets,=web pages,items,=outgoing links,pages with similar references,about same topic,Scenario 2,baskets,=web pages,items,=incoming links,pages with similar in-links,mirrors,or same topic,CS 361A,8,Scale of Problem,WalMart,sells m=100,000 items,tracks n=1,000,000,000 baskets,Web,several billion pages,one new“word per page,Assumptions,m small enough for small amount of memory per item,m too large for memory per pair or k-set of items,n too large for memory per basket,Very sparse data rare for item to be in basket,CS 361A,9,Association Rules,If-then rules,about basket contents,A,1,A,2,A,k,A,j,if basket has,X=A,1,A,k,then likely to have,A,j,Confidence,probability of,A,j,given,A,1,A,k,Support,(of rule),CS 361A,10,Example,B1=m,c,b,B2=m,p,j,B3=m,b,B4=c,j,B5=m,p,b,B6=m,c,b,j,B7=c,b,jB8=b,c,Association Rule,m,b,c,Support=2,Confidence=2/4=50%,CS 361A,11,Finding Association Rules,Goal,find,all,association rules such that,support,confidence,Reduction to Frequent Itemsets Problems,Find,all frequent itemsets X,Given,X=A,1,A,k,generate,all,rules,X-A,j,A,j,Confidence=,sup(X)/sup(X-A,j,),Support=,sup(X),Observe,X-A,j,also frequent,support known,CS 361A,12,Computation Model,Data Storage,Flat Files,rather than database system,Stored on,disk,basket-by-basket,Cost Measure,number of passes,Count,disk I/O,only,Given data size,avoid random seeks and do,linear-scans,Main-Memory Bottleneck,Algorithms maintain count-tables in memory,Limitation on number of counters,Disk-swapping count-tables is disaster,CS 361A,13,Finding Frequent Pairs,Frequent 2-Sets,hard case already,focus,for now,later extend to k-sets,Nave Algorithm,Counters,all,m(m1)/2,item pairs,Single pass,scanning all baskets,Basket of size,b,increments,b(b1)/2,counters,Failure?,if memory,m(m1)/2,even for,m=100,000,CS 361A,14,Montonicity Property,Underlies all known algorithms,Monotonicity Property,Given,itemsets,Then,Contrapositive,(for 2-sets),CS 361A,15,A-Priori Algorithm,A-Priori,2-pass approach in limited memory,Pass 1,m,counters(,candidate items,in,A,),Linear scan,of baskets,b,Increment counters,for each item in,b,Mark as,frequent,f,items of count at least,s,Pass 2,f(f-1)/2,counters(,candidate pairs,of frequent items),Linear scan,of baskets,b,Increment counters,for each pair of frequent items in,b,Failure,if memory s,bit=1),Pass 2,Counter,only for,F,qualified,pairs(X,i,X,j,):,both are frequent,pair hashes to frequent bucket(bit=1),Linear scan,of baskets,b,Increment counters,for candidate qualified pairs of items in,b,CS 361A,20,Multistage PCY Algorithm,Problem,False positives from hashing,New Idea,Multiple,rounds of hashing,After Pass 1,get list of,qualified,pairs,In Pass 2,hash only,qualified,pairs,Fewer pairs hash to buckets,less,false positives,(buckets with count s,yet no pair of count s),In Pass 3,less likely to qualify infrequent pairs,Repetition,reduce memory,but more passes,Failure,memory 2,Monotonicity,itemset,X,is frequent,only if,X X,j,is frequent for all,X,j,Idea,Stage k,finds all frequent k-sets,Stage 1,gets all frequent item
展开阅读全文