stanford大学-大数据挖掘-introdu课件

上传人:阳*** 文档编号:111944185 上传时间:2022-06-21 格式:PPT 页数:27 大小:59KB
返回 下载 相关 举报
stanford大学-大数据挖掘-introdu课件_第1页
第1页 / 共27页
stanford大学-大数据挖掘-introdu课件_第2页
第2页 / 共27页
stanford大学-大数据挖掘-introdu课件_第3页
第3页 / 共27页
点击查看更多>>
资源描述
stanford大学-大数据挖掘-introdu1CS345A: Data Mining on the WebCourse IntroductionIssues in Data MiningBonferronis Principlestanford大学-大数据挖掘-introdu2Course StaffuInstructors:w Anand Rajaramanw Jeff UllmanuReach us as cs345a-win0809-staff lists.stanford.edu.uMore info on .stanford大学-大数据挖掘-introdu3RequirementsuHomework (Gradiance and other) 20%w Go to w Enter class code 83769DC9.w If you took CS145 or CS245 in the past year, you should have free access; otherwise you will have to purchase access from Pearson Ed.uProject 40%uFinal Exam 40%stanford大学-大数据挖掘-introdu4ProjectuSoftware implementation related to course subject matter.uShould involve an original component or experiment.uMore later about available data and computing resources.stanford大学-大数据挖掘-introdu5Possible ProjectsuMany past projects have dealt with collaborative filtering (advice based on what similar people do).w E.g., Netflix Challenge.uOthers have dealt with engineering solutions to “machine-learning” problems.stanford大学-大数据挖掘-introdu6ML-Replacement ProjectsuML generally requires a large “training set” of correctly classified data.w Example: classifying Web pages by topic.uHard to find well-classified data.w Exception: Open Directory works for page topics, because work is collaborative and shared by many.w Other good exceptions?stanford大学-大数据挖掘-introdu7ML-Replacement (2)u Many problems require thought rather than ML:u Tell important pages from unimportant (PageRank).u Tell real news from publicity (how?).u Distinguish positive from negative product reviews (how?).1. Etc., etc.stanford大学-大数据挖掘-introdu8Team Projectsu Working in pairs OK, but wNo more than two per project.wWe will expect more from a pair than from an individual.1. The effort should be roughly evenly distributed.stanford大学-大数据挖掘-introdu9What is Data Mining?uDiscovery of useful, possibly unexpected, patterns in data.uSubsidiary issues:w Data cleaning: detection of bogus data. E.g., age = 150. Entity resolution.w Visualization: something better than megabyte files of output.stanford大学-大数据挖掘-introdu10CulturesuDatabases: concentrate on large-scale (non-main-memory) data.uAI (machine-learning): concentrate on complex methods, small data.uStatistics: concentrate on models.stanford大学-大数据挖掘-introdu11Models vs. Analytic ProcessinguTo a database person, data-mining is an extreme form of analytic processing queries that examine large amounts of data.w Result is the query answer.uTo a statistician, data-mining is the inference of models.w Result is the parameters of the model.stanford大学-大数据挖掘-introdu12(Way too Simple) ExampleuGiven a billion numbers, a DB person would compute their average and standard deviation.uA statistician might fit the billion points to the best Gaussian distribution and report the mean and standard deviation of that distribution.stanford大学-大数据挖掘-introdu13Outline of CourseuMap-Reduce and Hadoop.uAssociation rules, frequent itemsets.uPageRank and related measures of importance on the Web (link analysis ).w Spam detection.w Topic-specific search.uRecommendation systems.w Collaborative filtering.stanford大学-大数据挖掘-introdu14Outline (2)uFinding similar sets.w Minhashing, Locality-Sensitive hashing.uExtracting structured data (relations) from the Web.uClustering data.uManaging Web advertisements.uMining data streams.stanford大学-大数据挖掘-introdu15Meaningfulness of AnswersuA big data-mining risk is that you will “discover” patterns that are meaningless.uStatisticians call it Bonferronis principle: (roughly) if you look in more places for interesting patterns than your amount of data will support, you are bound to find crap.stanford大学-大数据挖掘-introdu16Examples of Bonferronis Principleu A big objection to TIA was that it was looking for so many vague connections that it was sure to find things that were bogus and thus violate innocents privacy.u The Rhine Paradox: a great example of how not to conduct scientific research.stanford大学-大数据挖掘-introdu17Stanford Professor Proves Tracking Terrorists Is Impossible!uThree years ago, the example I am about to give you was picked up from my class slides by a reporter from the LA Times.uDespite my talking to him at length, he was unable to grasp the point that the story was made up to illustrate Bonferronis Principle, and was not real.stanford大学-大数据挖掘-introdu18The “TIA” StoryuSuppose we believe that certain groups of evil-doers are meeting occasionally in hotels to plot doing evil.uWe want to find (unrelated) people who at least twice have stayed at the same hotel on the same day.stanford大学-大数据挖掘-introdu19The Detailsu109 people being tracked.u1000 days.uEach person stays in a hotel 1% of the time (10 days out of 1000).uHotels hold 100 people (so 105 hotels).uIf everyone behaves randomly (I.e., no evil-doers) will the data mining detect anything suspicious?stanford大学-大数据挖掘-introdu20Calculations (1)uProbability that given persons p and q will be at the same hotel on given day d :w 1/100 1/100 10-5 = 10-9.uProbability that p and q will be at the same hotel on given days d1 and d2:w 10-9 10-9 = 10-18.uPairs of days:w 5105.p atsomehotelq atsomehotelSamehotelstanford大学-大数据挖掘-introdu21Calculations (2)uProbability that p and q will be at the same hotel on some two days:w 5105 10-18 = 510-13.uPairs of people:w 51017.uExpected number of “suspicious” pairs of people:w 51017 510-13 = 250,000.stanford大学-大数据挖掘-introdu22ConclusionuSuppose there are (say) 10 pairs of evil-doers who definitely stayed at the same hotel twice.uAnalysts have to sift through 250,010 candidates to find the 10 real cases.w Not gonna happen.w But how can we improve the scheme?stanford大学-大数据挖掘-introdu23MoraluWhen looking for a property (e.g., “two people stayed at the same hotel twice”), make sure that the property does not allow so many possibilities that random data will surely produce facts “of interest.”stanford大学-大数据挖掘-introdu24Rhine Paradox (1)uJoseph Rhine was a parapsychologist in the 1950s who hypothesized that some people had Extra-Sensory Perception.uHe devised (something like) an experiment where subjects were asked to guess 10 hidden cards red or blue.uHe discovered that almost 1 in 1000 had ESP they were able to get all 10 right!stanford大学-大数据挖掘-introdu25Rhine Paradox (2)uHe told these people they had ESP and called them in for another test of the same type.uAlas, he discovered that almost all of them had lost their ESP.uWhat did he conclude?w Answer on next slide.stanford大学-大数据挖掘-introdu26Rhine Paradox (3)uHe concluded that you shouldnt tell people they have ESP; it causes them to lose it.stanford大学-大数据挖掘-introdu27MoraluUnderstanding Bonferronis Principle will help you look a little less stupid than a parapsychologist.
展开阅读全文
相关资源
相关搜索

最新文档


当前位置:首页 > 办公文档 > 教学培训


copyright@ 2023-2025  zhuangpeitu.com 装配图网版权所有   联系电话:18123376007

备案号:ICP2024067431-1 川公网安备51140202000466号


本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。装配图网仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知装配图网,我们立即给予删除!