资源描述
,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Data Warehousing and OLAP Technology,#,Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Data Warehousing and OLAP Technology,#,02 十月 2024,Data Warehousing and OLAP Technology,1,数据仓库和,OLAP,技术,什么是数据仓库,(,What is a data warehouse)?,多维数据模型,(,A multi-dimensional data model),数据仓库体系结构,(,Data warehouse architecture),数据仓库实现,(,Data warehouse implementation),Further development of data cube technology,From data warehousing to data mining,01,三,三月2020,Data Warehousing andOLAP Technology,2,数据库,的,的定义,传统的,数,数据库,技,技术是,以,以单一,的,的数据,资,资源为,中,中心,,同,同时进,行,行从事,务,务处理,,,,批处,理,理到决,策,策分析,的,的各类,处,处理;,数据库,主,主要是,为,为自动,化,化,精,简,简工作,任,任务和,高,高速数,据,据采集,服,服务的,。,。它的,运,运行是,事,事务驱,动,动,面,向,向应用,的,的,数,据,据库的,根,根本任,务,务是完,成,成数据,操,操作,,即,即及时,安,安全地,将,将当前,事,事务所,产,产生的,记,记录保,存,存下来,。,。,01,三,三月2020,Data Warehousing andOLAP Technology,3,两种不,同,同的数,据,据处理,需,需求,计算机,系,系统中,存,存在着,两,两类不,同,同的数,据,据处理,需,需求,,即,即:,操作型,处,处理(事务,处,处理),:,:主要,是,是对一,个,个或一,组,组记录,的,的查询,和,和修改,,,,这时,候,候人们,关,关心的,是,是响应,时,时间、数据的,安,安全性,和,和完整,性,性;,分析型,处,处理(信息,型,型处理,),):用,于,于管理,人,人员的,决,决策分,析,析,如DDS,(,(decisionsupportsystem,),)、多维分,析,析等。,01,三,三月2020,Data Warehousing andOLAP Technology,4,为什么,要,要建立,数,数据仓,库,库?,数据DATA,知识KNOWLEDGE,决定DECISIONS,Patterns,Trends,Facts,Relations,Models,Associations,Sequences,TargetMarkets,Fundsallocation,Trading options,Wheretoadvertise,Catalog mailinglist,Salesgeography,财经的Financial,经济的Economic,政府Government,销售分,数,数Point-of-Sale,人口统,计,计学Demographic,生活方,式,式Lifestyle,痛苦:太多数,据,据,无,法,法作出,正,正确判,断,断!,01,三,三月2020,Data Warehousing andOLAP Technology,5,What is DataWarehouse?,数据仓,库,库是在,企,企业管,理,理和决,策,策中面向主,题,题的,集成的,与时间,相,相关的和不可修,改,改的数据,集,集合,“Adata warehouseisasubject-oriented,integrated,time-variant,andnonvolatilecollection of datainsupportofmanagementsdecision,-,-making process,.,.”W.H.Inmon,Data warehousing:,Theprocessofconstructingand using datawarehouses,01,三,三月2020,Data Warehousing andOLAP Technology,6,Data WarehouseSubject-Oriented,Organizedaround major subjects,such ascustomer,product,sales.,Focusingonthemodelingand analysis of datafor decision makers,not on daily operationsortransactionprocessing.,Providea simpleandconciseview aroundparticular subjectissues byexcludingdatathat arenotuseful in thedecisionsupport process.,01,三,三月2020,Data Warehousing andOLAP Technology,7,面向应,用,用举例,采购子,系,系统:,订单(,订,订单号,,,,供应,商,商号,,总,总金额,,,,日期,),),订单细,则,则(订,单,单号,,商,商品号,,,,类别,,,,单价,,,,数量,),),供应商,(,(供应,商,商号,,供,供应商,名,名,地,址,址,电,话,话),销售子,系,系统:,顾客(,顾,顾客号,,,,姓名,,,,性别,,,,年龄,,,,地址,,,,电话,),),销售(,员工号,,,,顾客,号,号,商,品,品号,,数,数量,,单,单价日,期,期),库存管,理,理子系,统,统:,领料单,(,(,领料单,号,号,,领料,人,人,商,品,品号,,数,数量,,日,日期),进料单,(,(,进料单,号,号,订,单,单号,,进,进料人,,,,收料,人,人,日,期,期,),库存(,商,商品号,,,,库房,号,号,库,存,存量,,日,日期),库房(,库房号,,仓库,保,保管员,,,,地点,,,,库存,商,商品描,述,述),人事管,理,理子系,统,统:,员工(,员,员工号,,,,姓名,,,,性别,,,,年龄,,,,部门,号,号),部门(,部,部门号,,,,部门,名,名称,,部,部门主,管,管,电,话,话),面向主,题,题举例,:,:,商品:,商品固,有,有信息,:,:商品,号,号,商,品,品名,,类,类别,,颜,颜色等,商品采,购,购信息,:,:商品,号,号,供,应,应商号,,,,供应,价,价,供,应,应日期,,,,供应,量,量等,商品销,售,售信息,:,:商品,号,号,顾,客,客号,,售,售价,,销,销售日,期,期,销,售,售量等,商品库,存,存信息,:,:商品,号,号,库,房,房号,,日,日期,,库,库存量,等,等,供应商,:,:,供应商,固,固有信,息,息:供,应,应商号,,,,供应,商,商名,,地,地址,,电,电话等,供应商,品,品信息,:,:供应,商,商号,,商,商品号,,,,供应,价,价,供,应,应日期,,,,供应,量,量等,顾客:,顾客固,有,有信息,:,:顾客,号,号,顾,客,客名,,性,性别,,年,年龄,,住,住址,,电,电话等,顾客购,物,物信息,:,:顾客,号,号,商,品,品号,,售,售价,,购,购买日,期,期,购,买,买量等,01,三,三月2020,Data Warehousing andOLAP Technology,8,Data WarehouseIntegrated,Constructedbyintegratingmultiple,heterogeneousdata sources,relational databases,flatfiles,on,-,-linetransactionrecords,Data cleaning anddata integration techniquesare applied,.,.,Ensureconsistencyinnaming conventions,encodingstructures,attributemeasures, etc.amongdifferent datasources,E.g,.,.,Hotelprice:currency, tax,breakfast covered, etc.,When dataismovedtothe warehouse,itisconverted.,01,三,三月2020,Data Warehousing andOLAP Technology,9,Data WarehouseTime Variant,Thetimehorizon forthedatawarehouseissignificantlylongerthanthat of operational systems,.,.,Operationaldatabase:currentvaluedata.,Data warehousedata:provide information fromahistorical perspective (e.g,.,.,past 5,-,-10years,),),Everykeystructure in thedata warehouse,Containsanelement of time, explicitlyorimplicitly,Butthe keyofoperationaldata mayormaynot contain,“,“timeelement”,.,.,01,三,三月2020,Data Warehousing andOLAP Technology,10,Data WarehouseNon,-,-Volatile,Aphysically separate storeofdata transformed fromthe operational environment.,Operationalupdateofdatadoes notoccurinthedatawarehouseenvironment,.,.,Does notrequire transaction processing, recovery,andconcurrencycontrolmechanisms,Requiresonly twooperations in dataaccessing:,initial loadingofdataandaccessofdata.,01,三,三月2020,Data Warehousing andOLAP Technology,11,Data Warehousevs.HeterogeneousDBMS,Traditionalheterogeneous DB integration:,Buildwrappers/mediatorsontopofheterogeneousdatabases,Querydrivenapproach,When aqueryisposedtoaclientsite,ameta-dictionaryisusedtotranslatethe query intoqueriesappropriateforindividualheterogeneous sites involved,andthe resultsare integratedintoa globalanswerset,Complex information filtering,competeforresources,Data warehouse:update,-,-driven, highperformance,Informationfrom heterogeneoussourcesisintegrated in advanceand storedinwarehouses fordirectqueryand analysis,01,三,三月2020,Data Warehousing andOLAP Technology,12,Data Warehousevs.OperationalDBMS,OLTP (on-line transaction processing,),),Majortask of traditional relationalDBMS,Day,-,-to,-,-day operations,:,: purchasing, inventory,banking,manufacturing,payroll,registration,accounting,etc,.,.,OLAP (on-line analyticalprocessing),Majortask of datawarehouse system,Data analysis anddecisionmaking,Distinctfeatures,(,(OLTPvs.OLAP),:,:,User andsystemorientation,:,: customer vs.market,Data contents:current,detailedvs.historical,consolidated,Databasedesign,:,: ER +applicationvs,.,. star,+,+subject,View:current,localvs.evolutionary,integrated,Accesspatterns,:,: updatevs.read-only butcomplex queries,01,三,三月2020,Data Warehousing andOLAP Technology,13,OLTP vs.OLAP,01,三,三月2020,Data Warehousing andOLAP Technology,14,WhySeparateDataWarehouse,?,?,High performance forboth systems,DBMStunedforOLTP:access methods, indexing,concurrencycontrol,recovery,Warehouse,tunedfor OLAP,:,: complexOLAPqueries,multidimensional view, consolidation,.,.,Differentfunctions anddifferentdata:,missing data: Decision supportrequireshistoricaldata which operational DBsdonottypically maintain,data consolidation:DSrequiresconsolidation (aggregation,summarization)ofdatafrom heterogeneoussources,data quality: differentsources typicallyuseinconsistent datarepresentations,codesandformatswhichhave to be reconciled,01,三,三月2020,Data Warehousing andOLAP Technology,15,Data Warehousing andOLAP Technology,What is adatawarehouse,?,?,A multi-dimensional datamodel,Data warehousearchitecture,Data warehouseimplementation,Further development of datacubetechnology,From datawarehousingtodatamining,01,三,三月2020,Data Warehousing andOLAP Technology,16,From TablesandSpreadsheets to DataCubes,A datawarehouse is based on amultidimensional datamodelwhichviewsdata in theform of adatacube,A datacube,suchassales, allowsdata to be modeledand viewedinmultipledimensions,Dimensiontables,such asitem (item_name, brand,type),ortime(day,week,month, quarter, year,),),Fact table contains measures (such asdollars_sold) andkeys to eachofthe relateddimension tables,Indata warehousing literature, an n,-,-Dbase cubeiscalled abase cuboid. Thetopmost0-Dcuboid,whichholdsthehighest-levelofsummarization,iscalledtheapex cuboid.Thelatticeofcuboids forms adata cube,.,.,01,三,三月2020,Data Warehousing andOLAP Technology,17,Cube:A LatticeofCuboids,all,time,item,location,supplier,time,item,time,location,time,supplier,item,location,item,supplier,location,supplier,time,item,location,time,item,supplier,time,location,supplier,item,location,supplier,time,item,location,supplier,0-D(apex,),) cuboid,1-D cuboids,2-D cuboids,3-D cuboids,4-D(base,),) cuboid,01,三,三月2020,Data Warehousing andOLAP Technology,18,Conceptual Modeling of DataWarehouses,Modelingdata warehouses,:,: dimensions,&,&measures,Star schema:A facttableinthe middleconnectedtoasetofdimension tables,Snowflakeschema:A refinementofstarschemawheresomedimensionalhierarchyisnormalizedinto aset of smallerdimension tables, formingashapesimilar to snowflake,Fact constellations:Multiplefact tablessharedimensiontables, viewedasa collectionofstars, thereforecalledgalaxyschemaorfact constellation,01,三,三月2020,Data Warehousing andOLAP Technology,19,Example of StarSchema,time_key,day,day_of_the_week,month,quarter,year,time,location_key,street,city,province_or_street,country,location,SalesFact Table,time_key,item_key,branch,_,_key,location_key,units_sold,dollars_sold,avg,_,_sales,Measures,item_key,item_name,brand,type,supplier_type,item,branch_key,branch_name,branch_type,branch,01,三,三月2020,Data Warehousing andOLAP Technology,20,Example of SnowflakeSchema,time_key,day,day_of_the_week,month,quarter,year,time,location_key,street,city_key,location,SalesFact Table,time_key,item_key,branch,_,_key,location_key,units_sold,dollars_sold,avg,_,_sales,Measures,item_key,item_name,brand,type,supplier_key,item,branch_key,branch_name,branch_type,branch,supplier_key,supplier_type,supplier,city_key,city,province_or_street,country,city,01,三,三月2020,Data Warehousing andOLAP Technology,21,Example of FactConstellation,time_key,day,day_of_the_week,month,quarter,year,time,location_key,street,city,province_or_street,country,location,SalesFact Table,time_key,item_key,branch,_,_key,location_key,units_sold,dollars_sold,avg,_,_sales,Measures,item_key,item_name,brand,type,supplier_type,item,branch_key,branch_name,branch_type,branch,ShippingFact Table,time_key,item_key,shipper_key,from_location,to_location,dollars_cost,units_shipped,shipper_key,shipper_name,location_key,shipper_type,shipper,01,三,三月2020,Data Warehousing andOLAP Technology,22,A DataMining Query Language,DMQL:LanguagePrimitives,Cube Definition,(,(FactTable,),),definecube,:,:,DimensionDefinition,(,( DimensionTable,),),definedimension,as(,),),Special Case,(,(SharedDimensionTables),Firsttime as “cube definition,”,”,definedimension,asincube,01,三,三月2020,Data Warehousing andOLAP Technology,23,Defininga StarSchema in DMQL,definecubesales_star time,item,branch, location:,dollars_sold,=,=sum,(,(sales,_,_in,_,_dollars), avg_sales,=,= avg(sales_in_dollars),units,_,_sold,=,= count(*,),),definedimensiontimeas(time_key, day,day_of_week, month,quarter,year),definedimensionitemas(item_key, item,_,_name,brand, type, supplier_type,),),definedimensionbranchas(branch_key,branch_name, branch_type),definedimensionlocationas(location,_,_key,street, city, province_or_state,country),01,三,三月2020,Data Warehousing andOLAP Technology,24,Defininga SnowflakeSchemainDMQL,definecubesales_snowflake,time, item, branch,location,:,dollars_sold,=,=sum,(,(sales,_,_in,_,_dollars), avg_sales,=,= avg(sales_in_dollars),units,_,_sold,=,= count(*,),),definedimensiontimeas(time_key, day,day_of_week, month,quarter,year),definedimensionitemas(item_key, item,_,_name,brand, type,supplier(supplier_key, supplier_type,),),definedimensionbranchas(branch_key,branch_name, branch_type),definedimensionlocationas(location,_,_key,street,city(city,_,_key,province_or_state,country),),),01,三,三月2020,Data Warehousing andOLAP Technology,25,Defininga FactConstellationinDMQL,definecubesales,time,item,branch,location,:,:,dollars_sold,=,=sum,(,(sales,_,_in,_,_dollars), avg_sales,=,= avg(sales_in_dollars),units,_,_sold,=,= count(*,),),definedimensiontimeas(time_key, day,day_of_week, month,quarter,year),definedimensionitemas(item_key, item,_,_name,brand, type, supplier_type,),),definedimensionbranchas(branch_key,branch_name, branch_type),definedimensionlocationas(location,_,_key,street, city, province_or_state,country),definecubeshipping,time,item,shipper,from_location,to,_,_location,:,dollar,_,_cost,=,= sum(cost_in_dollars,),),unit_shipped,=,=count(,*,*),definedimensiontimeastimeincubesales,definedimensionitemasitemincubesales,definedimensionshipperas(shipper_key, shipper,_,_name,locationaslocationincubesales,shipper_type),definedimensionfrom_locationaslocationincubesales,definedimensionto_locationaslocationincubesales,01,三,三月2020,Data Warehousing andOLAP Technology,26,Measures:ThreeCategories,distributive: if theresultderivedbyapplyingthefunctiontonaggregatevalues is thesame as thatderivedbyapplyingthefunctiononall thedata withoutpartitioning.,E.g,.,.,count(,),),sum,(,(),min(), max(,),).,algebraic:ifitcanbecomputedbyanalgebraic function withMarguments,(,(whereMisa boundedinteger), eachofwhichisobtainedbyapplyingadistributiveaggregate function.,E.g,.,.,avg,(,(),min_N,(,(),standard,_,_deviation(),.,.,holistic:ifthereisnoconstantboundonthestoragesize neededtodescribea subaggregate.,E.g,.,.,median,(,(),mode(,),),rank(),.,.,01,三,三月2020,Data Warehousing andOLAP Technology,27,A ConceptHierarchy:Dimension,(,(location),all,Europe,North_America,Mexico,Canada,Spain,Germany,Vancouver,M.Wind,L.Chan,.,.,.,.,.,.,all,region,office,country,Toronto,Frankfurt,city,01,三,三月2020,Data Warehousing andOLAP Technology,28,View of Warehousesand Hierarchies,Specification of hierarchies,Schemahierarchy,day,month,quarter;week, year,Set,_,_groupinghierarchy,1.,.,.10, inexpensive,01,三,三月2020,Data Warehousing andOLAP Technology,29,Multidimensional Data,Salesvolumeasafunctionofproduct,month,and region,Product,Region,Month,Dimensions:Product,Location,Time,Hierarchicalsummarizationpaths,IndustryRegionYear,CategoryCountryQuarter,ProductCityMonthWeek,OfficeDay,01,三,三月2020,Data Warehousing andOLAP Technology,30,A SampleData Cube,Totalannualsales,ofTVinU.S.A,.,.,Date,Product,Country,All, All, All,sum,sum,TV,VCR,PC,1,Qtr,2,Qtr,3,Qtr,4,Qtr,U.S.A,Canada,Mexico,sum,01,三,三月2020,Data Warehousing andOLAP Technology,31,Cuboids Correspondingtothe Cube,all,product,date,country,product,date,product,country,date,country,product,date,country,0-D(apex,),) cuboid,1-D cuboids,2-D cuboids,3-D(base,),) cuboid,01,三,三月2020,Data Warehousing andOLAP Technology,32,Browsinga DataCube,Visualization,OLAP capabilities,Interactivemanipulation,01,三,三月2020,Data Warehousing andOLAP Technology,33,Typical OLAPOperations,Roll up (drill-up),:,:summarizedata,byclimbinguphierarchyorbydimension reduction,Drilldown (roll down,),):reverse of roll,-,-up,from higherlevelsummary to lower level summaryordetaileddata,orintroducingnew dimensions,Sliceanddice:,project andselect,Pivot,(,(rotate):,reorientthecube,visualization,3Dtoseries of 2D planes.,Otheroperations,drillacross,:,:involving,(,(across)morethan onefact table,drillthrough:through thebottomlevelofthe cubetoits back,-,-end relationaltables (usingSQL,),),01,三,三月2020,Data Warehousing andOLAP Technology,34,A Star,-,-Net Query Model,ShippingMethod,AIR,-,-EXPRESS,TRUCK,ORDER,CustomerOrders,CONTRACTS,Customer,Product,PRODUCT GROUP,PRODUCT LINE,PRODUCT ITEM,SALESPERSON,DISTRICT,DIVISION,Organization,Promotion,CITY,COUNTRY,REGION,Location,DAILY,QTRLY,ANNUALY,Time,Each circleiscalledafootprint,01,三,三月2020,Data Warehousing andOLAP Technology,35,Data Warehousing andOLAP Technologyfor DataMining,What is adatawarehouse,?,?,A multi-dimensional datamodel,Data warehousearchitecture,Data warehouseimplementation,Further development of datacubetechnology,From datawarehousingtodatamining,01,三,三月2020,Data Warehousing andOLAP Technology,36,DesignofaData Warehouse:ABusinessAnalysisFramework,Four views regardingthedesign of adatawarehouse,Top,-,-downview,allowsselection of therelevantinformationnecessaryfor thedata warehouse,Data sourceview,exposes theinformationbeingcaptured,stored,andmanagedbyoperationalsystems,Data warehouseview,consistsoffact tablesanddimension tables,Businessqueryview,sees theperspectivesofdatainthewarehouse fromthe viewofend-user,01,三,三月2020,Data Warehousing andOLAP Technology,37,Data WarehouseDesignProcess,Top,-,-down,bottom-up approachesoracombinationofboth,Top,-,-down: Startswith overalldesign andplanning,(,(mature),Bottom,-,-up: Startswith experiments andprototypes (rapid),From software engineering point of view,Waterfall:structured andsystematic analysis at eachstepbeforeproceedingtothenext,Spiral:rapidgeneration of increasinglyfunctional systems, short turnaround time, quick turnaround,Typical datawarehouse designprocess,Chooseabusinessprocesstomodel,e.g.,orders,invoices,etc.,Choosethegrain(atomiclevelofdata)ofthebusinessprocess,Choosethedimensionsthat willapplytoeachfact table record,Choosethemeasurethat willpopulateeachfact table record,01,三,三月2020,Data Warehousing andOLAP Technology,38,Multi-TieredArchitecture,Data,Warehouse,Extract,Transform,Load,Refresh,OLAP Engine,Analysis,Query,Reports,Data mining,Monitor,&,Integrator,Metadata,Data Sources,Front-EndTools,Serve,Data Marts,Operational,DBs,other,sources,Data Storage,OLAP Server,01,三,三月2020,Data Warehousing andOLAP Technology,39,Source,Databases,Data Extraction,Transformation, load,Warehouse,Admin.,Tools,Extract,Transform,and Load,Data,Modeling,Tool,Central,Metadata,Architected,Data Marts,Data Access,and Analysis,End-User,DW Tools,Central Data,Warehouse,Central,Data,Warehouse,Mid-,Tier,Mid-,Tier,Data,Mart,Data,Mart,Local,Metadata,Local,Metadata,Local,Metadata,Metadata,Exchange,MDB,Data,Cleansing,Tool,Relational,Appl. Package,Legacy,External,RDBMS,RDBMS,体系结,构,构,Pieter,1998,数据仓,库,库的焦,点,点问题,-,-,数据的,获,获得、,存,存储和,使,使用,Relational,Package,Legacy,External,source,Data,Clean,Tool,Data,Staging,Enterprise,Data,Warehouse,Datamart,Datamart,RDBMS,ROLAP,RDBMS,End-User,Tool,End-User,Tool,MDB,End-User,Tool,End-User,Tool,数据仓,库,库和集,市,市的加,载,载能力,至,至关重,要,要,数据仓,库,库和集,市,市的查,询,询输出,能,能力至,关,关重要,ETL工具,去掉操,作,作型数,据,据库中,的,的不需,要,要的数,据,据,统一转,换,换数据,的,的名称,和,和定义,计算汇,总,总数据,和,和派生,数,数据,估计遗,失,失数据,的,的缺省,值,值,调节源,数,数据的,定,定义变,化,化,01,三,三月2020,Data Warehousing andOLAP Technology,42,ThreeData WarehouseModels,Enterprise warehouse,collectsallofthe information about subjects spanning theentireorganization,Data Mart,a subsetofcorporate,-,-widedata thatisofvaluetoaspecificgroupsofusers,.,.Itsscopeisconfinedtospecific, selected groups,suchasmarketingdatamart,Independentvs.dependent (directlyfrom warehouse)datamart,Virtual warehouse,A setofviewsover operational databases,Only someofthe possible summaryviewsmay be materialized,01,三,三月2020,Data Warehousing andOLAP Technology,43,Data WarehouseDevelopment:ARecommendedApproach,Defineahigh-level corporatedata model,Data Mart,Data Mart,DistributedData Marts,Multi-Tier DataWarehouse,Enterprise DataWarehouse,Modelrefinement,Modelrefinement,01,三,三月2020,Data Warehousing andOLAP Technology,44,OLAP ServerArchitectures,Relational OLAP,(,(ROLAP),Userelationalorextended-relational DBMStostoreand managewarehousedataandOLAPmiddlewaretosupport missingpieces,Include optimizationofDBMS backend, implementation of aggregation navigationlogic, andadditional tools andservices,greater scalability,Multidimensional OLAP,(,(MOLAP),Array-basedmultidimensional storageengine (sparsematrix techniques,),),fast indexing to pre-computedsummarized data,HybridOLAP,(,(HOLAP,),),User flexibility,e.g,.,.,low level:relational,high-level:array,SpecializedSQLservers,specializedsupport forSQLqueriesover star,/,/snowflake schemas,01,三,三月2020,Data Warehousing andOLAP Technology,45,Data Warehousing andOLAP Technologyfor DataM
展开阅读全文