第1章数据挖掘课件

资源描述

7/11/20241何谓数据挖掘？何谓数据挖掘？n数据挖掘是从大量数据中寻找其规律的技术，主要有数据准备、规律寻找和规律表示三个步骤。n数据准备是从各种数据源中选取和集成用于数据挖掘的数据；n规律寻找是用某种方法将数据中的规律找出来；n规律表示是用尽可能符合用户习惯的方式如可视化将找出的规律表示出来。n数据挖掘在自身开展的过程中，吸收了数理统计、数据库和人工智能中的大量技术。7/11/20242数据挖掘工程实例信用卡公司分析信用卡历史数据，判断哪些人有风险，哪些没有超市分析交易数据，安排货架上货物摆布，以提高销售调查局分析行为模式，判断哪些人对受保护的信息具有潜在威胁药房分析医师的处方，判断哪些医师愿意购置他们的产品保险公司分析以前的客户记录，决定哪些客户是潜在花费昂贵的汽车公司分析不同地方人的购置模型，针对性地发送给客户喜欢的汽车的手册人才中心分析不同客户的工作历史，发送客户潜在的感兴趣的工作信息访问没有归类的竞争对手数据库，推断出潜在的归类信息7/11/20243 教育学院分析学生历史信息，决定哪些人愿意参加培训，发送手册给他们核武器工厂分析历史核查信息记录，决定没有采用哪项预防措施将导致核灾难广告公司分析人们购置模式,估计他们的收入和孩子数目,作为潜在的市场信息调查局分析不同团体的旅游模式，决定不同团体之间的关联医师分析病人历史和当前用药情况，不仅诊断用药而且预测潜在的问题税务局分析不同团体的交所得税的记录，发现异常模型和趋势调查局分析罪犯记录，推断哪些人可能会犯恐怖罪和大的谋杀罪7/11/20244Chapter 1.Introductionn什么激发了数据挖掘，为什么它是重要的？n什么是数据挖掘？n在何种数据上进行数据挖掘？n数据挖掘功能可以挖掘什么类型的模式n所有模式都是有趣的吗？n数据挖掘系统的分类n数据挖掘的主要问题7/11/20245动机：“需要是创造之母n数据泛滥问题n 自动数据收集工具和成熟的数据库技术使得大量数据n 存储于数据库，数据仓库和其他信息库。n我们数据丰富但信息贫乏n解决方法：数据仓库和数据挖掘n 数据仓库和联机分析处理n 大型数据库中的有趣知识规那么、模式7/11/20246数据库技术的演化n1960s:从原始的文件处理演化到复杂的、功能强大的数据库系统n数据收集，数据库创立，信息管理系统IMS)和数据库管理系统n1970s:从层次和网状数据库系统开展到开发关系数据库系统n关系数据模型，关系数据库管理系统工具n1980s:广泛接受关系技术，研究和开发新的、功能强大的数据库系统。使用了先进的数据模型，面向对象模型，扩充关系模型，对象-关系模型和演绎模型。n关系数据库管理系统RDBMS,高级数据模型(面向对象、演绎n 等等)和面向应用的DBMS(空间的、科学的、工程的n1990s:数据仓库是一种数据库结构。这是一种多个异种数据源在单个站点以统一的模式组织的存储，以支持管理决策。n数据挖掘和数据仓库，多媒体数据库和web数据库n2000s新一代综合信息系统nStream data management and miningnData mining and its applicationsnWeb technology(XML,data integration)and global information systems 7/11/20247数据挖掘的出现数据挖掘的出现n数据挖掘出现于20世纪80年代后期，90年代有了突飞猛进的开展。2001年，Gartner Group的一次高级技术调查将数据挖掘和人工智能列为“未来三到五年内将对工业产生深远影响的五大关健技术之首，并且还将并行处理体系和数据挖掘列为未来五年内投资焦点的十大新兴技术前两位。n数据挖掘出现于20世纪80年代末，最早是在数据库领域开展起来的，称为数据库中的知识发现(KDD，Knowledge Discovery in Database)。数据挖掘是KDD过程中的一个环节，它的历史虽然较短，但从20世纪90年代以来，开展速度很快，目前还没有一个完整的定义。n数据库中发现知识一词首次出现于1989年在美国底特律召开的第十一届国际联合人工智能学术会议上，到1995年在加拿大蒙特利尔召开的首届KDD&Data Mining国际学术会议，再到以后每年都要召开一次的KDD&Data Mining国际学术会议，经过十多年的努力，数据挖掘技术的研究已经取得了丰硕的成果，不少软件公司已研制出数据挖掘软件产品，并在北美、欧洲等国家得到应用。7/11/20248数据挖掘的当前热点数据挖掘的当前热点n数据挖掘技术的三大支柱为：数据库技术；人工智能技术及概率与数理统计。n当前数据挖掘的研究热点为：n 1.网站的数据挖掘Web site data miningn 3.文本的数据挖掘Textual Mining7/11/20249什么是数据挖掘？nData mining(knowledge discovery from data)n在大型数据库中提取有趣的重要的，隐含的，目前未知的，潜n 在有用的信息和模式nData mining:a misnomer?n另外的名字和它们的“内在故事nKnowledge discovery(mining)in databases(KDD),knowledge extraction,data/pattern analysis,data archeology,data dredging,information harvesting,business intelligence,etc.n什么不是数据挖掘？nSimple search and query processing n(Deductive)expert systems7/11/202410专家系统专家系统n专家系统曾经是人工智能研究工作者的骄傲。在研制一个专家系统时，知识工程师首先要从领域专家那里获取知识，这一过程实质上是归纳过程，是非常复杂的个人到个人之间的交互过程，有很强的个性和随机性。因此，知识获取成为专家系统研究中公认的瓶颈问题。n其次，知识工程师在整理表达从领域专家那里获得的知识时，用if-then等类的规那么表达，约束性太大，用常规数理逻辑来表达社会现象和人的思维活动局限性太大，也太困难，知识表示又成为一大难题。n此外，即使某个领域的知识通过一定手段获取并表达了，但这样做成的专家系统对常识和百科知识相当缺乏，而人类专家知识是以拥有大量常识为根底的。n人工智能技术的三大难题：“知识获取、知识表示、缺乏常识大大限制了专家系统的应用。人工智能学者开始着手基于案例的推理，尤其是从事机器学习的科学家们，不再满足自己构造的小样本学习模式的象牙塔，开始正视现实生活中大量的、不完全的、有噪声的、模糊的、随机的大数据样本，从而与数据仓库技术相结合，转向数据挖掘技术。7/11/202411Database Processing vs.Data Mining Processing数据库查询对数据挖掘查询数据库查询对数据挖掘查询nQuerynWell definednSQLnQuerynPoorly definednNo precise query languagen n DataData Operational data Operational datan n OutputOutput Precise Precise Subset of database Subset of databasen n DataData Not operational data Not operational datan n OutputOutput Fuzzy Fuzzy Not a subset of database Not a subset of database7/11/202412Query Examples查询实例比照查询实例比照nDatabasenData Mining Find all customers who have purchased milkFind all customers who have purchased milk Find all items which are frequently purchased Find all items which are frequently purchased with milk.(association rules)with milk.(association rules)Find all credit applicants with last name of Smith.Find all credit applicants with last name of Smith.Identify customers who have purchased more Identify customers who have purchased more than$10,000 in the last month.than$10,000 in the last month.Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks.(classification)risks.(classification)Identify customers with similar buying habits.Identify customers with similar buying habits.(Clustering)(Clustering)7/11/202413Why Data Mining?Potential Applicationsn数据库分析和决定支持n市场分析和管理n目标市场，用户关系管理,市场菜篮子分析,交叉销售,市场分割。n风险性分析和管理n预测，顾客保存,改善保险,质量控制,竞争分析n欺骗觉察和发现不寻常的模式(outliers)n其他应用n文本挖掘(新闻组，电子邮件，文件)和WEB分析n智能询问答复n生物信息学和生物数据分析7/11/202414市场分析和管理(1)n用于分析的数据从何来？信用卡交易，信誉卡，折扣券，用户投诉，公众生活方式调查。n目标市场n找出具有相同特征兴趣，收入水平，消费习惯等等的“模式顾客群。n随着时间的推移决定顾客的购置方式n从单独银行账户向联合银行账户的转变。例如：结婚nPredict what factors will attract new customersn交叉市场分析n不同产品之间的销售关联关系n在此关联信息上进行预测7/11/202415市场分析和管理(2)n顾客形象n 数据挖掘可以告诉你什麽样的顾客会买什麽样的 n 产品聚类或分类n识别顾客需求n 保证为不同的顾客提供了最好的产品n 使用预测手段去发现什麽因素会吸引新的顾客。n提供汇总信息n 各种各样的多方位汇总信息n 统计的汇总信息数据中心的趋势和变化7/11/202416公司分析和风险管理n财政方案和财产评估n现金流分析和预测n财产分析的偶发性需求分析n典型性分析和时序分析财政比率，趋势分析等等n资源方案：n总结和比较资源和花销n竞争：n控制对手和市场的方向n 把顾客划分成许多类，依据类的划分编制价格程序n把这个价格策略放到高度竞争的市场环境内7/11/202417欺骗性检测和管理(1)广泛应用于医疗系统,零售系统,信用卡效劳,电信(卡欺骗行为),等等.实现途径：利用历史性数据建立欺骗性行为模型并使用数据挖掘帮助识别同类例子具体事例汽车保险：检测出那些成心制造车祸而索取保险金的人来路不明钱财的追踪:发现可疑钱财交易(美国财政部的财政犯罪执行网)医疗保险:检测出潜在的病人，呼叫医生和证明人7/11/202418欺骗性检测和管理(2)n发现不正确的医学治疗n 澳大利亚医疗保险协会证明在许多情况下全面审 n 查测试是很需要的n检测错误n 呼叫模式：呼叫目的地，持续时间，每天或n 每周的次数。分析与预期标准相背离的模式n零售n 分析家估计38%的零售收缩缘于雇员的不老实。7/11/2024197/11/202420Knowledge Discovery(KDD)ProcessnData miningcore of knowledge discovery processData CleaningData IntegrationDatabasesData WarehouseTask-relevant DataSelectionData MiningPattern Evaluation7/11/202421KDD Process:Several Key Stepsn了解应用领域：n相关的预备知识和应用目标n创立一个目标数据集：数据选择n数据清理和预加工可能占用60%精力n数据变换：n发现有用的特征，维/变量的变换，常量的表示n选择数据挖掘功能(任务n汇总，分类，关联，聚集n选择挖掘算法n数据挖掘：搜索兴趣模式n模式评估和知识表达n可视化，变形，去掉冗余模式等等n使用发现的知识7/11/202422Data Mining and Business Intelligence Increasing potentialto supportbusiness decisionsEnd UserBusiness Analyst DataAnalystDBADecision MakingData PresentationVisualization TechniquesData MiningInformation DiscoveryData ExplorationStatistical Summary,Querying,and ReportingData Preprocessing/Integration,Data WarehousesData SourcesPaper,Files,Web documents,Scientific experiments,Database Systems7/11/2024237/11/202424Architecture:Typical Data Mining Systemdata cleaning,integration,and selectionDatabase or Data Warehouse ServerData Mining EnginePattern EvaluationGraphical User InterfaceKnowledge-BaseDatabaseData WarehouseWorld-WideWebOther InfoRepositories7/11/2024257/11/202426Data Mining:Confluence of Multiple Disciplines Data MiningDatabase TechnologyStatisticsMachineLearningPatternRecognitionAlgorithmOtherDisciplinesVisualization7/11/2024277/11/202428Why Not Traditional Data Analysis?nTremendous amount of datanAlgorithms must be highly scalable to handle such as tera-bytes of datanHigh-dimensionality of data nMicro-array may have tens of thousands of dimensionsnHigh complexity of datanData streams and sensor datanTime-series data,temporal data,sequence data nStructure data,graphs,social networks and multi-linked datanHeterogeneous databases and legacy databasesnSpatial,spatiotemporal,multimedia,text and Web datanSoftware programs,scientific simulationsnNew and sophisticated applications7/11/202429Data Mining:On What Kinds of Data?nDatabase-oriented data sets and applicationsnRelational database,data warehouse,transactional databasenAdvanced data sets and advanced applications nData streams and sensor datanTime-series data,temporal data,sequence data(incl.bio-sequences)nStructure data,graphs,social networks and multi-linked datanObject-relational databasesnHeterogeneous databases and legacy databasesnSpatial data and spatiotemporal datanMultimedia databasenText databasesnThe World-Wide Web7/11/2024307/11/2024317/11/2024327/11/202433Ex:Time Series AnalysisnExample:Stock MarketnPredict future valuesnDetermine similar patterns over timenClassify behavior7/11/202434Multi-Dimensional View of Data MiningnData to be minednRelational,data warehouse,transactional,stream,object-oriented/relational,active,spatial,time-series,text,multi-media,heterogeneous,legacy,WWWnKnowledge to be minednCharacterization,discrimination,association,classification,clustering,trend/deviation,outlier analysis,etc.nMultiple/integrated functions and mining at multiple levelsnTechniques utilizednDatabase-oriented,data warehouse(OLAP),machine learning,statistics,visualization,etc.nApplications adaptednRetail,telecommunication,banking,fraud analysis,bio-data mining,stock market analysis,text mining,Web mining,etc.7/11/202435Data Mining:Classification SchemesnGeneral functionalitynDescriptive data mining nPredictive data miningnDifferent views lead to different classificationsnData view:Kinds of data to be minednKnowledge view:Kinds of knowledge to be discoverednMethod view:Kinds of techniques utilizednApplication view:Kinds of applications adapted7/11/202436Data Mining Models and Tasks预测型模型对数据的值进行预测。预测型模型对数据的值进行预测。预测模型建模可能是基于使用其他的历史数据。预测模型建模可能是基于使用其他的历史数据。描述型模型对数据中的模式或关系进行辨识。与预测行型模型不同，描述型模型对数据中的模式或关系进行辨识。与预测行型模型不同，描述型模型提供了一种探索被分析数据的性质的方法，描述型模型提供了一种探索被分析数据的性质的方法，而不是预测新的性质。而不是预测新的性质。7/11/202437Are All the“Discovered Patterns Interesting?nData mining may generate thousands of patterns:Not all of them are interestingnSuggested approach:Human-centered,query-based,focused miningnInterestingness measuresnA pattern is interesting if it is easily understood by humans,valid on new or test data with some degree of certainty,potentially useful,novel,or validates some hypothesis that a user seeks to confirm nObjective vs.subjective interestingness measuresnObjective:based on statistics and structures of patterns,e.g.,support,confidence,etc.nSubjective:based on users belief in the data,e.g.,unexpectedness,novelty,actionability,etc.7/11/202438Find All and Only Interesting Patterns?nFind all the interesting patterns:CompletenessnCan a data mining system find all the interesting patterns?Do we need to find all of the interesting patterns?nHeuristic vs.exhaustive searchnAssociation vs.classification vs.clusteringnSearch for only interesting patterns:An optimization problemnCan a data mining system find only the interesting patterns?nApproachesnFirst general all the patterns and then filter out the uninteresting onesnGenerate only the interesting patternsmining query optimization7/11/202439Why Data Mining Query Language?nAutomated vs.query-driven?nFinding all the patterns autonomously in a database?unrealistic because the patterns could be too many but uninterestingnData mining should be an interactive process nUser directs what to be minednUsers must be provided with a set of primitives to be used to communicate with the data mining systemnIncorporating these primitives in a data mining query languagenMore flexible user interaction nFoundation for design of graphical user interfacenStandardization of data mining industry and practice7/11/202440DMQLA Data Mining Query Language nMotivationnA DMQL can provide the ability to support ad-hoc and interactive data miningnBy providing a standardized language like SQLnHope to achieve a similar effect like that SQL has on relational databasenFoundation for system development and evolutionnFacilitate information exchange,technology transfer,commercialization and wide acceptancenDesignnDMQL is designed with the primitives described earlier7/11/202441Primitives that Define a Data Mining TasknTask-relevant datanType of knowledge to be minednBackground knowledgenPattern interestingness measurementsnVisualization/presentation of discovered patterns7/11/202442Major Issues in Data MiningnMining methodology nMining different kinds of knowledge from diverse data types,e.g.,bio,stream,WebnPerformance:efficiency,effectiveness,and scalabilitynPattern evaluation:the interestingness problemnIncorporation of background knowledgenHandling noise and incomplete datanParallel,distributed and incremental mining methodsnIntegration of the discovered knowledge with existing one:knowledge fusion nUser interactionnData mining query languages and ad-hoc miningnExpression and visualization of data mining resultsnInteractive mining of knowledge at multiple levels of abstractionnApplications and social impactsnDomain-specific data mining&invisible data miningnProtection of data security,integrity,and privacy7/11/202443DM IssuesnHuman InteractionnOverfitting nOutliers nInterpretationnVisualization nLarge DatasetsnHigh Dimensionality7/11/202444KDD Issues(contd)nMultimedia DatanMissing DatanIrrelevant DatanNoisy DatanChanging DatanIntegrationnApplication7/11/202445SummarynData mining:Discovering interesting patterns from large amounts of datanA natural evolution of database technology,in great demand,with wide applicationsnA KDD process includes data cleaning,data integration,data selection,transformation,data mining,pattern evaluation,and knowledge presentationnMining can be performed in a variety of information repositoriesnData mining functionalities:characterization,discrimination,association,classification,clustering,outlier and trend analysis,etc.nData mining systems and architecturesnMajor issues in data mining

展开阅读全文

第1章数据挖掘课件

最新文档