CS345DataMining

上传人:gb****c 文档编号:243009924 上传时间:2024-09-13 格式:PPT 页数:35 大小:186.50KB
返回 下载 相关 举报
CS345DataMining_第1页
第1页 / 共35页
CS345DataMining_第2页
第2页 / 共35页
CS345DataMining_第3页
第3页 / 共35页
点击查看更多>>
资源描述
Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,CS345Data Mining,Mining the Web for Structured Data,1,Our view of the web so far,Web pages as atomic units,Great for some applications,e.g., Conventional web search,But not always the right model,Going beyond web pages,Question answering,What is the height of Mt Everest?,Who killed Abraham Lincoln?,Relation Extraction,Find all pairs,Virtual Databases,Answer database-like queries over web data,E.g., Find all software engineering jobs in Fortune 500 companies,Question Answering,E.g., Who killed Abraham Lincoln?,Nave algorithm,Find all web pages containing the terms “killed” and “Abraham Lincoln” in close proximity,Extract k-grams from a small window around the terms,Find the most commonly occuring k-grams,Question Answering,Nave algorithm works fairly well!,Some improvements,Use sentence structure e.g., restrict to noun phrases only,Rewrite questions before matching,“What is the height of Mt Everest” becomes “The height of Mt Everest is ”,The number of pages analyzed is more important than the sophistication of the NLP,For simple questions,Reference: Dumais et al,Relation Extraction,Find pairs (title, author),Where title is the name of a book,E.g., (Foundation, Isaac Asimov),Find pairs (company, hq),E.g., (Microsoft, Redmond),Find pairs (abbreviation, expansion),(ADA, American Dental Association),Can also have tuples with 2 components,Relation Extraction,Assumptions:,No single source contains all the tuples,Each tuple appears on many web pages,Components of tuple appear “close” together,Foundation, by Isaac Asimov,Isaac Asimovs masterpiece, the Foundation trilogy,There are repeated patterns in the way tuples are represented on web pages,Nave approach,Study a few websites and come up with a set of patterns e.g., regular expressions,letter = A-Za-z. ,title = letter5,40,author = letter10,30,(title) by (author),Problems with nave approach,A pattern that works on one web page might produce nonsense when applied to another,So patterns need to be page-specific, or at least site-specific,Impossible for a human to exhaustively enumerate patterns for every relevant website,Will result in low coverage,Better approach (Brin),Exploit duality between patterns and tuples,Find tuples that match a set of patterns,Find patterns that match a lot of tuples,DIPRE (Dual Iterative Pattern Relation Extraction),Patterns,Tuples,Match,Generate,DIPRE Algorithm,R,SampleTuples,e.g., a small set of pairs,O,FindOccurrences(R,),Occurrences of,tuples,on web pages,Keep some surrounding context,P,GenPatterns(O,),Look for patterns in the way,tuples,occur,Make sure patterns are not too general!,R,MatchingTuples(P,),Return or go back to Step 2,Occurrences,e.g., Titles and authors,Restrict to cases where author and title appear in close proximity on web page, Foundation by Isaac Asimov (1951),url,=,order,= title,author (or author,title),denote as 0 or 1,prefix,= “ ” (limit to e.g., 10 characters),middle,= “ by ”,suffix,= “(1951) ”,occurrence,=,(Foundation,Isaac Asimov,url,order,prefix,middle,suffix),Patterns, Foundation by Isaac Asimov (1951), Nightfall by Isaac Asimov (1941),order = title,author (say 0),shared prefix =,shared middle =,by,shared suffix =,(19,pattern = (order,shared prefix, shared middle, shared suffix),URL Prefix,Patterns may be specific to a website,Or even parts of it,Add urlprefix component to pattern,occurence:, Foundation by Isaac Asimov (1951),occurence:, Nightfall by Isaac Asimov (1941),shared urlprefix =,pattern =,(urlprefix,order,prefix,middle,suffix),Generating Patterns,Group,occurences,by order and middle,Let O = set of,occurences,with the same order and middle,pattern.order,=,O.order,pattern.middle,=,O.middle,pattern.urlprefix,= longest common prefix of all,urls,in O,pattern.prefix,= longest common prefix of occurrences in O,pattern.suffix,= longest common suffix of occurrences in O,Example,occurence:, Foundation by Isaac Asimov (1951),occurence:, Nightfall by Isaac Asimov (1941),order = title,author,middle = “ by ”,urlprefix =,prefix = “ ”,suffix = “ (19”,Example,occurence:,Foundation, by Isaac Asimov, has been hailed,occurence:,Nightfall, by Isaac Asimov, tells the tale of,order = title,author,middle = “, by ”,urlprefix =,prefix = “”,suffix = “, ”,Pattern Specificity,We want to avoid generating patterns that are too general,One approach:,For pattern p, define specificity = |urlprefix|middle|prefix|suffix|,Suppose n(p) = number of occurences that match the pattern p,Discard patterns where n(p) n,min,Discard patterns p where specificity(p)n(p) threshold.,Go back to step 2,Some refinements,Give more weight to tuples found earlier,Approximate pattern matches,Entity tagging,Tuple confidence,If tuple t matches a set of patterns P,conf(t) = 1 -,p,2,P,(1-conf(p),Suppose we allow tuples that dont exactly match patterns but only approximately,conf(t) = 1 -,p,2,P,(1-conf(p)match(t,p),
展开阅读全文
相关资源
正为您匹配相似的精品文档
相关搜索

最新文档


当前位置:首页 > 图纸专区 > 大学资料


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!