资源描述
单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,*,互联网信息搜索,湖南大学计算机与通信学院,刘钰峰,互联网信息搜索六,tfidf and,vector spaces,回顾,1、中文分词,2、词典压缩,3、posting list压缩,4、tfidf,Scoring documents,How do we construct an index?,What strategies can we use with limited main memory?,Scoring,We wish to return in order the documents most likely to be useful to the searcher,How can we rank order the docs in the corpus with respect to a query?,Assign a score say in 0,1,for each doc on each query,Begin with a perfect world no spammers,Nobody stuffing keywords into a doc to make it match queries,More on“adversarial IR”under web search,Linear zone combinations,First generation of scoring methods:use a linear combination of Booleans:,E.g.,Score=0.6*,+0.3*+0.05*+0.05*,Each expression such as takes on a value in 0,1.,Then the overall score is in 0,1.,For this example the scores can only take,on a finite set of values what are they?,Exercise,On the query,bill,OR,rights,suppose that we retrieve the following docs from the various zone indexes:,bill,rights,bill,rights,bill,rights,Author,Title,Body,1,5,2,8,3,3,5,9,2,5,1,5,8,3,9,9,Compute the score,for each doc based on the weightings 0.6,0.3,0.1,General idea,We are given a,weight vector,whose components sum up to 1.,There is a weight for each zone/field.,Given a Boolean query,we assign a score to each doc by adding up the weighted contributions of the zones/fields.,Typically users want to see the,K,highest-scoring docs.,Index support for zone combinations,In the simplest version we have a separate inverted index for each zone,Variant:have a single index with a separate dictionary entry for each term and zone,E.g.,bill.author,bill.title,bill.body,1,2,5,8,3,2,5,1,9,Of course,compress zone names,like author/title/body.,Zone combinations index,The above scheme is still wasteful:each term is potentially replicated for each zone,In a slightly better scheme,we encode the zone in the postings:,At query time,accumulate contributions to the total score of a document from the various postings,e.g.,bill,1.author,1.body,2.author,2.body,3.title,As before,the zone names get compressed.,bill,1.author,1.body,2.author,2.body,3.title,rights,3.title,3.body,5.title,5.body,Score accumulation,As we walk the postings for the query,bill,OR,rights,we accumulate scores for each doc in a linear merge as before.,Note:we get,both,bill,and,rights,in the,Title,field of doc 3,but score it no higher.,Should we give more weight to more hits?,1,2,3,5,0.7,0.7,0.4,0.4,Term-document count matrices,Consider the number of occurrences of a term in a document:,Bag of words,model,Document is a vector:a column below,Bag of words view of a doc,Thus the doc,John is quicker than Mary,.,is indistinguishable from the doc,Mary is quicker than John,.,Which of the indexes discussed,so far distinguish these two docs?,Counts vs.frequencies,WARNING,:In a lot of IR literature,“frequency”is used to mean“count”,Thus,term frequency,in IR literature is used to mean,number of occurrences,in a doc,Not,divided by document length(which would actually make it a frequency),We will conform to this misnomer,In saying,term frequency,we mean the,number of occurrences,of a term in a document.,Term frequency,tf,Long docs are favored,because theyre more likely to contain query terms,Can fix this to some extent by normalizing for document length,But is raw,tf,the right measure?,Document frequency,But document frequency(,df,)may be better:,df,=number of docs in the corpus containing the term,Word,cf,df,ferrari,1042217,insurance,104403997,Document/collection frequency weighting is only possible in known(static)collection.,So how do we make use of,df,?,tf x idf term weights,tf x idf measure combines:,term frequency(,tf,),or,wf,some measure of term density in a doc,inverse document frequency(,idf,),measure of informativeness of a term:its rarity across the whole corpus,could just be raw count of number of documents the term occurs in(,idf,i,=,1/,df,i,),but by far the most commonly used version is:,See Kishore Papineni,NAACL 2,2002 for theoretical justification,Summary:tf x idf(or tf.idf),Assign a tf.idf weight to each term,i,in each document,d,Increases with the number of occurrences,within,a doc,Increases with the rarity of the term,across,the whole corpus,再论TF,Real-valued term-document matrices,Function(scaling)of count of a word in a document:,Bag of words,model,Each is a vector in,v,Here log-scaled,tf.idf,Note can be 1!,Documents as vectors,Each doc,j,can now be viewed as a vector of,wf,idf,values,one component for each term,So we have a vector space,terms are axes,docs live in this space,even with stemming,may have 20,000+dimensions,(The corpus of documents gives us a matrix,which we could also view as a vector space in which words live transposable data),Why turn docs into vectors?,First application:Query-by-example,Given a doc,d,find others“like”it.,Now that,d,is a vector,find vectors(docs)“near”it.,Intuition,Postulate:Documents that are“close together”,in the vector space talk a
展开阅读全文