资源描述
学习学习学习学习 创造创造创造创造 超越超越超越超越 共享共享共享共享Greenplum数据库开发基础12014年10月目录Greenplum概述和数据分布概述和数据分布数据加载和外部表数据加载和外部表客户端工具客户端工具Greenplum数据库基础数据库基础Greenplum与与Oracle优化策略优化策略其他要点及示例其他要点及示例MPP架构MPP(MassivelyParallelProcessing)Shared-NothingArchitectureMPP无共享架构的优势InterconnectLoading数据分布在所有的并行节点上每个节点只处理其中一部分数据最优化的I/O处理所有的节点同时进行并行处理节点之间完全无共享,无I/O冲突自动化的并行处理机制内部处理自动化并行,无需人工分区或优化加载与访问方式与一般数据库相同最易于扩展的架构BI和数据分析的最佳选择增加节点实现线性扩展增加节点可线性增加存储、查询和加载性能Greenplum基本架构MPP(MassivelyParallelProcessing)Shared-NothingArchitectureNetworkInterconnect.MasterSeversQuery planning&dispatch SegmentSeversQuery processing&data storageSQLMapReduceExternalSourcesLoading,streaming,etc.Greenplum的并行处理特性v并行处理由系统自动完成,无需人工干预并行处理由系统自动完成,无需人工干预所有数据均匀分布到所有节点,每个节点都计算自己的部分数据,所以并行处理无需人工干预,系统自动完成。v无需复杂的调优需求,只需要加载数据和查询无需复杂的调优需求,只需要加载数据和查询DBA工作量极少,无需复杂的调优工作和维护工作。v扩展性扩展性可线性扩展到10,000个节点每增加一个节点,查询、加载性能都成线性增长v客户端访问及第三方工具支持客户端访问及第三方工具支持完全支持数据库技术接口标准,例如:SQL,ODBC,JDBC,OLEDB等。同时,广泛地支持各个BI和ETL软件工具。Greenplum基本体系架构客户端接口和程序lpsqllpgAdminIIIlODBC/DatadirectlJDBClPerlDBIlPythonllibpqlOLEDBMasterHostl访问系统的入口l建立与客户端的连接和管理lSQL的解析并形成执行计划l执行计划向Segment的分发l收集Segment的执行结果l协调工作处理过程lMaster不存储业务数据,只存储系统目录表和元数据(数据字典)Segmentv每段(Segment)存放一部分用户数据v一个系统可以有多段v用户不能直接存取访问v所有对段的访问都经过Masterv用户查询SQL的执行InterconnectvGreenplum数据库之间的连接层v进程间协调和管理v基于千兆以太网架构v属于系统内部私网配置v支持两种协议:TCPorUDPGreenplum高可用性体系架构Master/Standby 镜像保护lStandby实时与Master节点的Catalog和事务日志保持同步lStandby节点用于当PrimaryMaster节点损坏时提供Master服务数据冗余-Segment 镜像保护l每个Segment的数据冗余存放在另一个Segment上,数据实时同步l当PrimarySegment失败时,MirrorSegment将自动提供服务lPrimarySegment恢复正常后,使用gprecoversegF同步数据。表分布的策略-并行计算的基础lHash分布CREATETABLEDISTRIBUTEDBY(column,)同样数值的内容被分配到同一个Segment上l循环分布CREATETABLEDISTRIBUTEDRANDOMLY具有同样数值的行内容并不一定在同一个Segment上分布键(Distribution Keys)v用于将数据平均分布到Segments之中的一个或者多个字段v用表的主键作分布键可以使数据分布均匀v建表时使用 DISTRIBUTEDBY 子句定义表的分布键CREATETABLEsales(dtdate,prcfloat,qtyint,cust_idint,prod_idint,vend_idint)DISTRIBUTEDBY(dt,cust_id,prod_id);v如果表没有主键,或者没有合适的字段作为分布键,可以使用随机分布键(DISTRIBUTEDRANDOMLY)v如果没有明确定义分布键,系统会把第一个字段作为表的分布键分布存储数据均匀分布-并行处理的关键43Oct 20 20051264Oct 20 20051145Oct 20 20054246Oct 20 20056477Oct 20 20053248Oct 20 200512OrderOrder#OrderDateCustomerID50Oct 20 20053456Oct 20 20052163Oct 20 20051544Oct 20 20051053Oct 20 20058255Oct 20 200555策略策略:数据尽可能的均匀分布到每个节点查询命令的执行SQL查询处理机制并行查询计划SELECT customer,amount FROM sales JOIN customer USING(cust_id)WHERE date=04302008;压缩存储和行列存储v压缩存储压缩存储支持ZLIB和QUICKLZ方式的压缩,压缩比可到10:1压缩表只能是AppendOnly方式压缩数据不一定会带来性能的下降,压缩表将消耗CPU资源,而减少I/O资源占用语法CREATETABLEfoo(aint,btext)WITH(appendonly=true,compresstype=zlib,compresslevel=5);v行或列存储模式行或列存储模式列模式目前只支持AppendOnly 如果常用的查询只取表中少量字段,则列模式效率更高,如查询需要取表中的大量字段,行模式效率更高语法:CREATETABLEsales2(LIKEsales)WITH(appendonly=true,orientation=column);锁停止活动的SQL查询v查看要停止的SQL查询的进程IDv 执行 select*from pg_stat_activity 查看到当前数据库连接的IP 地址,用户名,提交的查询等。(另外也可以在master 主机上查看进程,对每个客户端连接,master 都会创建一个进程。ps-ef|grep-i postgres|grep-i con)v#查询表是否被锁vselect procpid,t.*from pg_stat_activity t where usename=lds_betl and datname=ldsdb and waiting=t;v停止SQL:v 执行 Select pg_cancel_backend(procpid)v 或者 Select pg_terminate_backend(procpid)v 或者 在MASTER OS:$kill procpidv注:极端情况下,kill 不能停止SQL时,采用kill-11停止进程v 千万不要使用kill-9,该操作导致数据库崩溃;v 生产系统请不要采用kill操作。表分区的概念v将一张大表逻辑性地分成多个部分,如按照分区条件进行查询,将减少数据的扫描范围,提高系统性能。v提高对于特定类型数据的查询速度和性能v也可以更方便数据库的维护和更新v两种类型:Range分区(日期范围或数字范围)/如日期、价格等List 分区,例如地区、产品等vGreenplum中的表分区在使用中具有总表的继承性,并通过Check参数指定相应的子表v分区的子表依然根据分布策略分布在各segment上v分区是一种非常有用的优化措施,例如一年的交易按交易日期分区后,查询一天的交易性能将提高365倍!Data Distribution&PartitioningSegment1ASegment1BSegment1CSegment1DSegment2ASegment2BSegment2CSegment2DSegment3ASegment3BSegment3CSegment3DJan2005Feb2005Mar2005Apr2005May2005Jun2005Jul2005Aug2005Sep2005Oct2005Nov2005Dec2005每个分区表的数据平均分布到各个节点表分区可减少数据的搜索范围,提高查询性能Full Table Scan VS.Partition PruningSegment1ASegment1BSegment1CSegment1DSegment2ASegment2BSegment2CSegment2DSegment3ASegment3BSegment3CSegment3DSegment1ASegment1BSegment1CSegment1DSegment2ASegment2BSegment2CSegment2DSegment3ASegment3BSegment3CSegment3DSegment1ASegment1BSegment1CSegment1DSegment2ASegment2BSegment2CSegment2DSegment3ASegment3BSegment3CSegment3DSELECT COUNT(*)FROM orders WHERE order_date=Oct 20 2005 AND order_date(non-superuser)vTo show a list of all databases:lvTo connect to another database:cdb_namevUse PGDATABASE environment variable to set the default databaseSchemavTo Create:CREATESCHEMAvTo Drop:DROPSCHEMAvTo Edit:ALTERSCHEMAChange nameAssign new ownervPSQL TipsvTo see the current schema:SELECTcurrent_schema();vTo see a list of all schemas in the database:dnvTo see the schema search path:SHOWsearch_path;vTo set the search path for a database:ALTERDATABASESETsearch_pathTOmyschema,public,pg_catalog;表vTo Create:CREATETABLEAdditional DISTRIBUTEDBY or DISTRIBUTEDRANDOMLY clauseSome syntax not supportedvTo Edit:ALTERTABLECannot alter distribution key columnsvTo Drop:DROPTABLEPSQL TipsvTo list tables in the database:dtvTo see structure of a table:d+table_namevTo list system catalog tables:dtSvTo list external tables only:dxvTo see the distribution key columns of a table:dtable_name表和字段约束vCHECK table or column constraintsvNOTNULL column constraintsvUNIQUE column constraintsOne allowed per tableUnique columns must also be in distribution keyNot allowed if table also has a primary keyvPRIMARYKEY table constraintsUsed as distribution key by defaultvFOREIGNKEY constraints definitions are supported but not enforcedvForeign key relationships are utilized by the query planner to improve query plans.视图vView SQL Commands:CREATEVIEWDROPVIEW vPSQL Tips:To list all views while in psql:dv To see a view definition:d+view_name EXAMPLE:CREATEVIEWtoptenASSELECTname,rank,gender,yearFROMnames,rankWHERErank11ANDnames.id=rank.id;SELECT*FROMtoptenORDERBYyear,rank;索引v在Greenplum数据库中应谨慎创建索引v索引不一定都能优化查询v应测试索引是否真正提升了性能v删除没用的索引vPRIMARY KEY索引会自动创建v唯一性索引只能在分布键字段创建索引(续)v索引类型:B-tree Bitmapv索引相关SQL命令:CREATEINDEXALTERINDEXDROPINDEXREINDEXvPSQL Tips:在PSQL显示所有索引:di显示索引定义:d+index_name v大批量ETL加工最好不建索引,对性能提升作用不大vB-TREE适用每次通过单一字段筛选查询少量数据vB-MAP适用每次通过多个字段筛选查询大量数据其他数据库对象vFunctions and operatorsvSequencesvTriggersvTablespaces数据类型v常用数据类型CHAR,VARCHAR,TEXTSmallint,integer,bigintNumeric,real,doubleprecisionTimestamp,date,timeBooleanArray 类型。如 integerv其它数据类型请参考常用系统表及视图v所有系统表在pg_catalogschemav标准PostgreSQL系统表(pg_*)v常用系统表:pg_stat_activitypg_tablespg_classpg_attribute pg_namespacev在psql显示所有系统表:dtSvPsql显示所有系统视图:dvSv其它catalog参考函数v日期函数Extract(day|month|year。Fromdate);Selectdate+1day:interval,date+1month:intervalSELECTdate_part(day,TIMESTAMP2001-02-1620:38:40);Result:16SELECTdate_trunc(hour,TIMESTAMP2001-02-1620:38:40);Result:2001-02-1620:00:00pg_sleep(seconds);系统日期变量Current_dateCurrent_timeCurrent_timestampNow()Timeofday()在事务中发生变化,以上函数在事务中不变函数v字符串处理函数Substr/length/lower/upper/trim/replace/positionrPad/lpadTo_char,|(字符串连接)substringlike,simillar to (模式匹配)v其它杂类Case。When/Coalescenullifgenerate_seriesIn/notin/exists/any/allBuilt-In Functions(SELECT)FunctionDescriptionExampleCURRENT_DATEReturns the current system date2006-11-06CURRENT_TIMEReturns the current system time16:50:54CURRENT_TIMESTAMPReturns the current system date and time2008-01-06 16:51:44.430000+00:00LOCALTIMEReturns the current system time with time zone adjustment19:50:54LOCALTIMESTAMPReturns the current system date and time with time zone adjustment2008-01-06 19:51:44.430000+00:00CURRENT_ROLEROLEReturns the current database userjdoeMathematical FunctionsFunctionReturnsDescriptionExampleResults+-*/sameAdd,Subtract,Multiply&Divide1+12%IntegerModulo10%20SameExponentiation224|/NumericSquare Root|/93|/NumericCube Root|/82!NumericFactorial!36&|#NumericBitwiseAnd,Or,XOR,Not91&1511NumericBitwiseShift left,right1 2162Mathematical Functions(Continued)FunctionReturnsDescriptionExampleResultsabssameAbsolute Valueabs(-998.2)998.2ceiling(numeric)NumericReturns smallest integer not less than argumentceiling(48.2)49floor(numeric)NumericReturns largest integer not greater than argumentfloor(48.2)48pi()NumericThe constantpi()3.1419random()NumericRandom value between 0.0 and 1.0random().87663round()NumericRound to nearest integerround(22.7)23String FunctionsFunctionReturnsDescriptionExampleResultsstring|stringTextString concatenationmy|mymymychar_length(string)Integernumber of chars in stringchar_length(mymy)4position(string in string)IntegerLocation of specified sub-stringposition(my in ohmy)3lower(string)TextConverts to lower caselower(MYMY)mymyupper(string)TextConverts to upper caseupper(mymy)MYMYsubstring(string from n for n)TextDisplays portion of stringsubstring(myohmy from 3 for 2)ohtrim(both,leading,trailing from string)TextRemove leading and/or trailing characterstrim(mymy )mymyString Functions(Continued)FunctionReturnsDescriptionExampleResultsinitcap(string)TextChanges caseinitcap(my my)My Mylength(string)IntegerReturns string lengthlength(mymy)4split_part(string,delimiter,occurrence)TextSeparates delimited listsplit_part(one|two|three,|,2)twoDate FunctionsFunctionReturnsDescriptionExampleResultsage(timestamp,timestamp)TimestampDifference in years,months and daysage(2008-08-12 timestamp,current_timestamp)0 years 1 month 11 daysextract(field from timestamp)IntegerReturns year,month,day,hour,minute or secondextract(day from current_date)11now()TimestampReturns current date&timenow()2008-09-22 11:00:01overlapsBooleanSimplifies comparing date rangesWHERE(2008-01-01,2008-02-11)overlaps(2008-02-01,2008-09-11)TRUE存储过程vGreenplum支持SQL/PYTHON/PERL/C语言构建函数,以下着重介绍SQL 存储过程。一个存储过程就是一个事务,包括对子过程的调用都在一个事务内存储过程结构:CREATEFUNCTIONsomefunc()RETURNSintegerAS$DECLAREquantityinteger:=30;BEGINRETURN.;END;$LANGUAGEplpgsql;赋值赋值给一个变量或行/记录赋值用下面方法:identifier:=expression例子:user_id:=20;执行一个没有结果的查询执行一个没有结果的查询:PERFORMquery;一个例子:PERFORMcreate_mv(cs_session_page_requests_mv,my_query);存储过程请参考:存储过程请参考:存储过程v动态动态SQLEXECUTEcommand-stringINTOSTRICTtarget;vSELECTINTOExample:SELECTIDINTOVAR_IDFROMTABLEAv获取结果状态获取结果状态GETDIAGNOSTICSvariable=item,.;一个例子:GETDIAGNOSTICSinteger_var=ROW_COUNT;vSQL返回变量返回变量SQLERRM(SQL出错信息),SQLSTATE(SQL执行返回状态编码)v控制结构控制结构IF.THEN.ELSEIF.THEN.ELSELOOP,EXIT,CONTINUE,WHILE,FORv从函数返回从函数返回有两个命令可以用来从函数中返回数据:RETURN和RETURNNEXT。Syntax:RETURNexpression;v设置回调设置回调EXECSQLWHENEVERconditionaction;condition可以是下列之一:SQLERROR,SQLWARNING,NOTFOUND存储过程v异常处理异常处理EXCEPTIONWHENunique_violationTHEN-donothingEND;忽略错误:EXCEPTIONWHENOTHERSTHENRAISENOTICEanEXCEPTIONisabouttoberaised;RAISEEXCEPTIONNUM:%,DETAILS:%,SQLSTATE,SQLERRM;END;v错误和消息错误和消息RAISElevelformat,expression,.;Level:Info:信息输入信息输入Notice:信息提示信息提示Exception:产生一个例外,将退出存储过程产生一个例外,将退出存储过程Example:RAISENOTICECallingcs_create_job(%),v_job_id;OVER(PARTITION BY)ExampleSELECT*,row_number()OVER()FROMsaleORDERBYcn;SELECT*,row_number()OVER(PARTITION BY cn)FROMsaleORDERBYcn;row_number|cn|vn|pn|dt|qty|prc-+-+-+-+-+-+-1|1|10|200|1401-03-01|1|02|1|30|300|1401-05-02|1|03|1|50|400|1401-06-01|1|04|1|30|500|1401-06-01|12|55|1|20|100|1401-05-01|1|01|2|50|400|1401-06-01|1|02|2|40|100|1401-01-01|1100|24001|3|40|200|1401-04-01|1|0(8rows)row_number|cn|vn|pn|dt|qty|prc-+-+-+-+-+-+-1|1|10|200|1401-03-01|1|02|1|30|300|1401-05-02|1|03|1|50|400|1401-06-01|1|04|1|30|500|1401-06-01|12|55|1|20|100|1401-05-01|1|06|2|50|400|1401-06-01|1|07|2|40|100|1401-01-01|1100|24008|3|40|200|1401-04-01|1|0(8rows)OVER(ORDER BY)ExampleSELECTvn,sum(prc*qty)FROMsaleGROUPBYvnORDERBY2DESC;SELECTvn,sum(prc*qty),rank()OVER(ORDER BY sum(prc*qty)DESC)FROMsaleGROUPBYvnORDERBY2DESC;vn|sum|rank-+-+-40|2640002|130|180|250|0|320|0|310|0|3(5rows)vn|sum-+-40|264000230|18050|020|010|0(5rows)事务v事务将多个语句捆绑为all-or-nothing操作v事务相关命令BEGINorSTARTTRANSACTIONENDorCOMMITROLLBACKSAVEPOINTandROLLBACKTOSAVEPOINTvPsql中设置自动提交模式:setautocommiton|off目录Greenplum概述和数据分布概述和数据分布数据加载和外部表数据加载和外部表客户端工具客户端工具Greenplum数据库基础数据库基础Greenplum与与Oracle优化策略优化策略其他要点及示例其他要点及示例数据类型DataTypeORACLEGreenplumNumericNUMBER(p,s)SMALLINT (2 bytes)INTEGER (4 bytes)BIGINT (8 bytes)DECIMAL(p,s)(11+p/2 bytes)NUMERIC(p,s)(11+p/2 bytes)REAL (4 bytes)DOUBLE (8 bytes)CharacterCHAR and NCHARCHARVARCHAR2 and NVARCHAR2VARCHARDate&TimeDATE(includes time to sec)DATE or TIMESTAMP without Time ZoneTIMESTAMPTIMESTAMPINTERVALINTERVAL/TIMEBinaryBFILE(1 GB)Large Objects(up to 2 GB)RAW BFILE(1 GB)BYTEACLOB and NCLOBTEXTGreenplum与Oracle比较(1)DifferenceOracleGreenplumDUALSELECT 1+1 FROM DUALSELECT 1+1;orCREATE VIEW dual AS SELECT X:VARCHAR(1)AS DUMMY;SELECT 1+1 FROM dual;NEXTVALSELECT A_TABLE_SEQUENCE.NEXTVAL FROM DUALSELECT NEXTVAL(A_TABLE_SEQUENCE)FROM DUALROWNUMSELECT*FROM AGE_TYPE WHERE ROWNUM=5SELECT*FROM AGE_TYPELIMIT 5 OFFSET 0SELECT*FROM AGE_TYPE WHERE CODE IS NOT NULLAND ROWNUM=5 ORDER BY CODE DESCSELECT*FROM AGE_TYPEWHERE CODE IS NOT NULLORDER BY CODE DESCLIMIT 5 OFFSET 0ASSELECT A.COL1 A_COL1,A.COL2 A_COL2FROM A_TABLE ASELECT A.COL1 AS A_COL1,A.COL2 AS A_COL2FROM A_TABLE AGreenplum与Oracle比较(2)DifferenceOracleGreenplum(+)SELECT*FROM A_TABLE A,B_TABLE B WHERE A.ID(+)=B.IDSELECT*FROM A_TABLE A RIGHT OUTER JOIN B_TABLE B ON A.ID=B.IDSELECT*FROM A_TABLE A,B_TABLE B WHERE A.ID(+)=B.IDAND A.COL1=COL1_VALUESELECT*FROM A_TABLE A RIGHT OUTER JOIN B_TABLE B ON A.ID=B.ID AND A.COL1=COL1_VALUESELECT*FROM A_TABLE A,B_TABLE B,C_TABLE C,D_TABLE DWHERE A.ID=B.ID(+)AND A.ID=C.ID(+)ANDA.COL1=D.COL1SELECT*FROM(A_TABLE ALEFT OUTER JOIN B_TABLE BON A.ID=B.ID)LEFT OUTER JOIN C_TABLE CON A.ID=C.ID,D_TABLE DWHERE A.COL1=D.COL1SELECT*FROM A_TABLE AWHERE A.COL1(+)=0 AND A.COL2(+)=A_VALUE2SELECT*FROM A_TABLE AWHERE A.COL1=0 AND A.COL2=A_VALUE2Greenplum与Oracle比较(3)DifferenceOracleGreenplumNVLSELECT NVL(SUM(VALUE11),0)FS_VALUE1,NVL(SUM(VALUE21),0)FS_VALUE2FROM FIELD_SUMSELECT COALESCE(SUM(VALUE11),0)AS FS_VALUE1,COALESCE(SUM(VALUE21),0)AS FS_VALUE2FROM FIELD_SUMTO_NUMBERSELECT COL1 FROM A_TABLEORDER BY TO_NUMBER(COL1)SELECT COL1 FROM A_TABLEORDER BY TO_NUMBER(COL1,999999)note:999999 is length of COL1DECODESELECT DECODE(ENDFLAG,1,A,B)ENDFLAGFROM TESTSELECT(CASE ENDFLAG WHEN 1 THEN A ELSE B END)AS ENDFLAGFROM TEST|SELECT NULL|-|NULL AS VALUES1FROM DUALSELECT COALESCE(NULL,)|-|COALESCE(NULL,)AS VALUES1FROM DUALGreenplum与Oracle比较(4)DifferenceOracleGreenplumSYSDATEUPDATE A_TABLESET ENTREDATE=SYSDATEUPDATE A_TABLESET ENTREDATE=CURRENT_TIMESTAMP;orUPDATE A_TABLESET ENTREDATE=TO_TIMESTAMP(CURRENT_TIMESTAMP,YYYY-MM-DD HH24:MI:SS)SELECT TO_DATE(SYSDATE,YYYY-MM-DD)AS DAYFROM DUALSELECT TO_DATE(CURRENT_DATE,YYYY-MM-DD)AS DAYFROM DUALaggregateSELECT ROUND(AVG(SUM(BASICCNT1)BASICCNTFROM ACCESS_INFO_SUM1_VWHERE YEARCODE BETWEEN 200305 AND 200505GROUP BY SCCODESELECT ROUND(AVG(AIV.BASICCNT)AS BASICCNTFROM(SELECT SUM(BASICCNT1)AS BASICCNT FROM ACCESS_INFO_SUM1_V WHERE YEARCODE BETWEEN 200305 AND 200505 GROUP BY sccode )AIVCEILSELECT CEIL(SYSDATE-TO_DATE(20051027 14:56:10,YYYYMMDD HH24:MI:SS)AS DAYSFROM DUALSELECT EXTRACT(DAY FROM(TO_TIMESTAMP(CURRENT_TIMESTAMP,YYYY-MM-DD-HH24-MI-SS)-TO_TIMESTAMP(2005-10-27 14:56:10,YYYY-MM-DD-HH24-MI-SS)+1 AS DAYSFROM DUALGreenplum与Oracle比较(5)DifferenceOracleGreenplumSELECT LENGTH()AS VALUE1 FROM DUALResultVALUE1=NULLSELECT LENGTH()AS VALUE1 FROM DUALResultVALUE1=0SELECT TO_DATE(,YYYYMMDD)AS VALUE2 FROM DUALResultVALUE2=NULLSELECT TO_DATE(,YYYYMMDD)AS VALUE2 FROM DUALResultVALUE2=0001-01-01 BCSELECT TO_NUMBER(,1)AS VALUE3 FROM DUALResultVALUE3=NULLSELECT TO_NUMBER(,1)AS VALUE3 FROM DUALResult cannot executeINSERT INTO TEST(VALUE4)VALUES()ResultVALUE4=NULLnote:VALUE4 is numeric typeINSERT INTO TEST(VALUE4)VALUES()ResultVALUE4=0note:VALUE4 is numeric typeINSERT INTO TEST(VALUE5)VALUES()ResultVALUE5=NULLnote:VALUE5 is character typeINSERT INTO TEST(VALUE5)VALUES()ResultVALUE5=note:VALUE5 is character type,length equal 0INSERT INTO TEST(VALUE6)VALUES(TO_DATE(,YYYYMMDD)ResultVALUE6=NULLnote:VALUE6 is time typeINSERT INTO TEST(VALUE6)VALUES(TO_DATE(,YYYYMMDD)ResultVALUE6=0001-01-01 BCnote:VALUE6 is time typeGreenplum与Oracle比较(6)DifferenceOracleGreenplumNULLIFnot support NULLIFSELECT NULLIF(VALUE1,VALUE2)AS COL1 FROM DUALCONCATCONCAT(CHAR,CHAR)CREATE FUNCTION CONCAT(CHAR,CHAR)RETURNS CHAR AS SELECT$1|$2 LANGUAGE sql;ADD_MONTHSADD_MONTHS(date,int)CREATE FUNCTION add_months(date,int)RETURNS date AS SELECT($1+($2:text|months):interval):date;LANGUAGE sql;LAST_DAYLAST_DAY(DATE)CREATE FUNCTION LAST_DAY(DATE)RETURNS DATE ASSELECT date(substr(text($1+interval(1 month),1,7)|-01)-1LANGUAGE sql;MONTH_BETWEENMONTH_BETWEEN(DATA,DATA)CREATE FUNCTION MONTH_BETWEEN(DATA,DATA)RETURNS NUMERIC ASSELECT to_number(date($1)-date($2),999999999)/31LANGUAGE sql;BIN_TO_NUMSELECT BIN_TO_NUM(1,0,1,0)AS VALUE1 FROM DUALSELECT CAST(B1010 AS INTEGER)AS VALUE1BITANDBITAND(int,int)SELECT int&int目录Greenplum概述和数据分布概述和数据分布数据加载和外部表数据加载和外部表客户端工具客户端工具Greenplum数据库基础数据库基础Greenplum与与Oracle优化策略优化策略其他要点及示例其他要点及示例查看执行计划查看sql执行计划的命令:vEXPLAINvEXPLAINANALYZE查看执行计划:v一般从最后面往上查看v以下命令可能会包含Gather,Redistribute,Broadcast等操作JoinsSortsAggregations每步操作给出如下度量:lcost(units of disk page fetches)lrows(rows output by this node)lwidth(byte count of the widest row produced by this node)执行计划(EXPLAIN)EXPLAINSELECT*FROMnamesWHEREname=Joelle;-QUERYPLAN-GatherMotion2:1(slice1)(cost=0.00.20.88rows=1width=13)-SeqScanonnames(cost=0.00.20.88rows=1width=13)Filter:name:textJoelle:textSQLqueryFilterconditionGathermotionCost,rows,andwidth执行计划(EXPLAIN ANALYZE)EXPLAINANALYZESELECT*FROMnamesWHEREname=Joelle;-QUERYPLAN-GatherMotion2:1(slice1)(cost=0.00.20.88rows=1width=13)recv:Total 1 rows with 0.305 ms to first row,0.537 ms to end.-SeqScanonnames(cost=0.00.20.88rows=1width=13)Total 1 rows(seg0)with 0.255 ms to first row,0.486 ms to end.Filter:name:textJoelle:text22.548 ms elapsed 1segmentreturnedrowsActualtimetorunthequery1rowreturnedtomaster优化策略(1)1、数据分布-选用合适字段作为DistributionKey(DK),尽量做到平均分布Selectgp_segment_id,count(*)fromtablenamegroupby1;例1:优化策略(2)2、尽量选择常用连接条件或Groupby列作为DistributionKey,最好只用一个字段作为DK,并且DK列的distinct值越多越好。例2:insertintotablec(auction_id,.)select*fromtablealeftjointalbebontablea.selid=tablec.id;优化前耗时120秒优化方式:将tablea的DistributionKey改为selid,tableb的改为id,tablec的DistributionKey改为auction_id优化后耗时88秒,提升了32秒修改DK的语法:ALTERTABLEnameSETDISTRIBUTEDBY(column,.);ALTERTABLEnameSETWITH(REORGANIZE=true);3、采用Createtable代替Insertinto;尽量对中间过程使用临时表;数据库内有很多表都是全量更新的,因此可以用Createtable来代替Insertinto,性能能够大幅提高。例2中的SQL,改成Createtabletablecas(select*fromtablealeftjointablebontablea.selid=tablec.id)distributedby(auction_id),优化后耗时为65秒,提升了13秒优化策略(3)4、定期收集统计信息和执行Vacuum定期收集统计信息,可以优化SQL执行路径;sytax:analyzetable;数据加载后执行Vacuum,或者采用重创建表的方式来释放垃圾数据,可以提高SQL效率;对系统的数据字典也需要定期做vacuum由于GPDB使用的是MVCC事务并发模型,被删除或更新的数据行依然占据着物理磁盘空间,即便它们对于新的事务已经不可见。如果数据库有大量的更新和删除,会产生大量过期记录。VACUMM命令还会收集表级别的统计信息,如行数和页面数。ANALYZE命令收集查询规划器需要用到的列统计信息。VACUUM和ANALYZE操作可以在同一个命令中一起运行。例:=#VACUUM ANALYZE mytable;5、SQL合并在Greenplum中,建议将多个表的join和嵌套子查询SQL适当合并为一个SQL实现,可以减少IO,达到提高性能的目的。6、列存储和数据压缩,可以提高SQL查询效率;7、尽量避免用exists和notin语法,可以改成join和leftjoin方法实现尽量避免在用where条件中使用or语法,可以用Union带代替;8、高并发查询下,可以用分区和索引来加快速度,可以用setenable_seq_scan=off来强制使用索引;索引不宜太多,会增加explain的时间。另外,对于join的表,可以用明确的方式减少explain时间,避免因为多表关联plan时间。优化策略(4)9、尽量避免创建索引一般来说,在Greenplum中索引对于多表的join是没有性能提升的。优化方法:删除这类索引,可以提高数据加载的性能;10、使用索引对值比较类型的SQL进行优化:Select*fromaccess_loga,ip_tablebwherea.ip=b.min_ipanda.ip=b.max_ipA表:千万级B表:几万行在A表上建立IP的b-tree索引可以提高系统效率11、表分区提高查询效率12、Truncate代替deleteall,避免Vacuum处理;优化策略(5)例12:Demo1:Vacuum analyze 优化优化Table1:100万条数据createtabletable1(t1integer,t2varchar)distributedrandomly;Table1:1千条数据createtabletable2(t1integer,t2varchar)distributedrandomly;执行SQL:selecta.*,b.*fromtable1a,table2bwherea.t1=b.t1;优化策略(6)未作vacuumanalyze前:Vacuumanalyze后:说明:对小表做Broadcast到所有节点,比对大表做redistribution更好性能知识点:greenplum中有三种数据的移动操作:A:Broadcast Motion(N:N),即广播数据,每个节点向其他节点广播需要发送的数据。B:Redistribute Motion(N:N),重新分布数据,利用join的列值hash不同,将筛选后的数据在其他segment重新分布。C:Gather Motion(N:1),聚合汇总数据,每个节点将join后的数据发到一个单节点上,通常是发到主节点master。优化策略(7)13、DK优化,相同优化,相同DK 表关联性能更表关联性能更高高createtabletable1(t1integer,t2varchar)distributedby(t1);createtabletable2(t1integer,t2varchar)distributedby(t1,t2);执行SQL:selecta.*,b.*fromtable1a,table2bwherea.t1=b.t1;优化策略(8)修改table2的DK;createtabletable2(t1integer,t2varchar)distributedby(t1);说明:两个表的DK相同,节省了做Redistribution的消耗。优化策略(9)需要说明的是,DK需要顺序相同,类型相同,才能避免redistribution。例如:CREATETABLEtest(t1integer,t2charactervarying(10)DISTRIBUTEDBY(t1);CREATETABLEtest1(t1integer,t2charactervarying(10)DISTRIBUTEDBY(t2);select*fromtesta,test1bwherea.t1=b.t2两个表都要做重分布;优化策略(10)14、批量数据处理后,无论成功与否,都应该进行vac
展开阅读全文