资源描述
Slide Title,Body Text,Second Level,Third Level,Performance libraries:,Math Kernel Library,2003,年,3,月,Agenda,介绍,目的,MKL,的内容,性能特性,有资源限制的优化,线程,使用函数库,BLAS,回顾,目的,性能,性能,性能!,MKL,是,Intel,为科学和工程计算设计的数学库,开始,只包含,BLAS,和,FFT,定位:,Solvers,(,BLAS,LAPACK),特征向量/特征值,solvers(BLAS,LAPACK),一些量子化学的需要(,dgemm,),PDEs,信号处理,地震,solid-state physics(,FFTs,),General scientific,financial(vector transcendental functions VML),针对现在和将来的,Intel,处理器进行优化,目的,donts,But dont use MKL on,Dont use MKL on“small”counts,Dont call vector math functions on small,n,X,Y,Z,W,X,Y,Z,W,=,4,x4,Transformation,matrix,几何变换,*这些情况可以使用,IPP,MKL,的内容,BLAS,(Basic Linear Algebra Subroutines),Level 1 BLAS vector-vector operations,15 function types,48 functions,Level 2 BLAS matrix-vector operations,26 function types,66 functions,Level 3 BLAS matrix-matrix operations,9 function types,30 functions,Extended BLAS level 1 BLAS for sparse vectors,8 function types,24 functions,MKL Contents,LAPACK,(linear algebra package),Solvers&,eigensolvers,Many hundreds of routines total!,Total user callable+support routines 1000,FFTs,(fast,fourier,transforms),one&two dimensional,with&without frequency ordering(bit reversal),VML,(vector math library),Set of,vectorized,transcendental functions,Most of,libm,functions,but faster,MKL,的内容,大多数,MKL,函数具有,Fortran,接口,源于高性能计算,BLAS,LAPACK,等函数库大多数用,Fortran,实现,Cblas,接口 为了方便,C/C+,程序员调用,BLAS,MKL,的内容,-,环境,支持,Intel,和,CVF Fortran,编译器,支持,Linux,和,Windows,操作系统,静态和动态链接库,支持所有处理器,32-bit and 64-bit,大量的测试和例程,大量的文档,MKL Index,有资源限制的优化,所有优化的目标是:最高的速度,耗尽系统的一项或多项资源,:(尽量利用资源),CPU,寄存器短缺是较大的挑战,Cache,尽可能将数据保持在,Cache,中,内存带宽,最低限度的访问内存,Computer,尽可能利用所有处理器,System,尽可能利用所有结点(将来),线程,大多数,MKL,函数支持多线程,但是,level 1,level 2 BLAS,的多线程效果不大(,O(n),很多函数可以支持多线程,Level 3 BLAS(O(n,3,),LAPACK(O(n,3,),FFTs,(O(n log(n),VML?Depends on processor and function,所有多线程通过,OpenMP,实现,所有,MKL,函数可以很安全的设计和编译成多线程,怎样与,MKL,链接,Assume program calls MKL function then what?,two approaches:,Static,link all library objects linked into program binary,DLL,use without static link frequent C approach,Static Link,Scenario 1:,ifl,BLAS,Pentium III processor:,ifl,o,myprog,myprog.f,static L/opt/intel/mkl/lib/32 lmkl_p3,lpthread,-,lguide,(Linux),Dynamic Link,Scenario 2:C program uses BLAS but want optimal code determined at runtime:,ifl,o,myprog,myprog.f,L/opt/intel/mkl/lib/32,lmkl,lpthread,-,lguide,(Linux),BLAS,回顾,3“,levels”of functions+sparse,Level 1:vector-vector operations,Level 2:vector-matrix operations,Level 3:matrix-matrix operations,Sparse:level 1 operations on sparse vectors,“Levels”,的历史,Level 1 in early 70s,Level 2 in mid-70s followed immediately by level 3,BLAS,命名约定,General scheme:,precision,:one or two letters,1 letter implies input and output are same type,s,=single,d,=double,c,=single complex,z,=double complex,2 letters input and output are different,cs,zd,:complex in,real out;,sc,dz,:real in,complex out,BLAS,命名约定,:,g:general,ge,:general;,gb,:band(,普通,),s:symmetric,sy,:symmetric;sp:packed;,sb,:band(,对称,),h:,Hermitian,he:,Hermitian,;hp:packed;,hb,:band(,Hermitian,共轭,),t:triangular,tr,:triangular;,tp,:packed;,tb,:band(,三角,),通常,band(General Band),对称,band(symmetric band),复数共轭,band(Hermitian,Band),三角,band(triangular band),packed,BLAS Naming Conventions,Level 1,c:conjugated(,cdotc,),u:,unconjugated,(,cdotu,),g:givens(,srotg,),Level 2,mv,:,matrix-vector,;,sv,:solve(vector operations);r:rank update;r2:rank 2 update,dger,:double-precision general rank update:,A:=alpha*x*y+A,Level 3,mm:,matrix-matrix,;,sm,:solve(matrix operations);r:rank update;r2:rank 2 update,dsyr2k:double-precision symmetric rank-2 update,Matrix Multiplication,四种实现方法,Roll your own,DDOT(level 1),DGEMV(level 2),DGEMM(level 3),Because C is used,all is not pretty,J,Matrix Multiplication,Roll Your Own/Dot Product,for(i=0;i n;i+),for(j=0;j m;j+),temp=0.0;,for(k=0;k,kk,;k+),temp+=aik*bkj;,cij=temp;,incx,=1;,incy,=,ldb,;,for(i=0;i n;i+),for(j=0;j m;j+),cij=DDOT(&n,&ai,&,incx,&b0j,&,incy,);,Roll Your Own,ddot,Matrix Multiplication,DGEMV/DGEMM,incx,=1;,incy,=,ldb,;alpha=1.0;beta=0.0;,transa,=t;,for(i=0;i n;i+),dgemv,(&,transa,&m,&n,&alpha,a,&,lda,&b0i,&,ldb,&beta,&c0i,&,ldc,);,dgemv,alpha=1.0;beta=0.0;,transa,=n;,transb,=n;,dgemm,(&,transa,&,transb,&m,&n,&,kk,&alpha,b,&,ldb,a,&,lda,&beta,c,&,ldc,);,dgemm,MKL,的矩阵乘性能:编译的代码,vs,DGEMM,*2.2,GHz Intel,Pentium,4 processor,512 MB memory,*800,MHz,Itanium,processor,4 MB cache,NEC Express5800,MKL,在并行环境上的性能:,DGEMM,
展开阅读全文