执行时间(latency等待时间)

资源描述

Click to edit Master title style,Click to edit Master text styles,Second level,Third level,Fourth level,Fifth level,*,Performance Measurement 1,Performance,Execution time,执行时间（,latency,等待时间）,:,Time between the start and the completion of an event,一个事件从开始到结束所经过的时间,Performance, 1/(,Execution time),性能与执行时间成反比,Throughput,吞吐量,(,bandwidth,带宽,),：,Total amount of work done in a given time,给定时间内完成的全部工作,1,Performance Measurement 1,Machine X is,n,% faster than Machine Y:,机器,X,比机器,Y,快,n,%,2,Performance Measurement 2,Example:,Machine A runs a program in 10 seconds,Machine B runs the same program in 15 seconds,A is _% faster than B.,3,Make the Common Case Fast,Perhaps the most important and pervasive principle of computer design is to make the common case fast: In making a design trade-off, favor the frequent case over the infrequent case.,计算机设计的最重要的原则就是：加快经常性发生事件的执行速度。,4,Make the Common Case Fast,Improving the frequent event, rather than the rare event, will obviously help performance.,Overflow case and no overflow case in addition,提高频繁事件的执行速度，而不是提高罕见事件的执行速度，将带来明显的性能上的提高,例如加法运算中的溢出和非溢出情况,5,Amdahls Law 1,Amdahls Law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used.,阿姆达定律表明：通过改进某模式得到的整体性能提高，受限于该改进模式所占的运行时间比例。,6,Amdahls Law 2,Speedup,（,加速比）,=,Performance for entire task using the enhancement when possible,（,改进后完成整个任务的性能,）,Performance for entire task w/o using the enhancement,（,改进前完成整个任务的性能,）,=,Execution time for entire task w/o using the enhancement,（,改进前完成整个任务的时间）,Execution time for entire task using the enhancement when possible,（,改进前完成整个任务的时间）,7,Amdahls Law 3,Execution time,new,= Execution time,old,x,where,f,E,: fraction of enhancement,s,E,: improvement gained by the,enhancement mode,即：新的执行时间,=,原来执行时间,x,8,Amdahls Law 3, Speedup =,即：加速比原来的执行时间/新的执行时间,1,9,Amdahls Law 4,Example: An enhancement run 10 times faster than the original machine, but it is usable 40% of the time, then the speedup = _.,Sol:,f,E,= 0.4,s,E,= 10, Speedup= 1/(1-0.4) + 0.4/10),= 1.56,10,Amdahls Law can also be applied to compare two CPU design alternatives, for example :Implementations of floating-point(FP)square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root(FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responseible for a total of 50% of the execution time for the application. Compare these two design alternatives.,Amdahls Law can also be applied to compare two CPU design alternatives,for example :Implementations of,floating-point(FP) square root,vary significantly in performance, especially among processors designed for graphics.,Amdahls Law,也可以用于比较两种设计不同的,CPU,，特别是对于处理图形的处理器来说，求浮点数平方根的不同实现方法在性能上有很大差异。,11,Amdahls Law can also be applied to compare two CPU design alternatives, for example :Implementations of floating-point(FP)square root vary significantly in performance, especially among processors designed for graphics. Suppose FP square root(FPSQR) is responsible for 20% of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of 10. The other alternative is just to try to make all FP instructions in the graphics processor run faster by a factor of 1.6; FP instructions are responseible for a total of 50% of the execution time for the application. Compare these two design alternatives.,Suppose,FP square root(FPSQR),is responsible for,20%,of the execution time of a critical graphics benchmark. One proposal is to enhance the FPSQR hardware and speed up this operation by a factor of,10,. The other alternative is just to try to make all,FP,instructions in the graphics processor run faster by a factor of,1.6,; FP instructions are,responseible,for a total of,50%,of the execution time for the application. Compare these two design alternatives.,例如，求浮点数平方根的操作，在一个标准测试程序中占总执行时间的,20%,。一种方法是改进,FPSQR,硬件，将它的操作速度提,10,倍。另一种方法是将所有图形处理器中的,FP,指令的执行速度都提高,1.6,倍，这些,FP,指令在总的执行时间中占,50%,比较这两种设计方法。,12,Answer: we can compare these two alternatives by comparing the speedups:,Improving the performance of the FP operations overall is slightly better because of the higher frequency.,Answer: we can compare these two alternatives by comparing the speedups:,（可以通过计算加速比来进行比较）,Improving the performance of the FP operations overall is slightly better because of the higher frequency.,（可见提高所有,FP,操作的性能的方案要好，这是由于它们的执行频率较高）,13,Amdahls Law 6,Extreme Cases,极限情况,f,E,= 0, Speedup = 1,f,E,= 1, Speedup =,s,E,f,E,增强比例,s,E,增强加速比,14,CPU Performance 1,Most computers are constructed using a clock running at a constant rate,多数计算机的运行都基于一个固定频率的时钟信号,Referred to by length/time, e.g., 10 ns, or rate, e.g., 100 MHz,ms = 10,3,sec, s = 10,6,sec, ns = 10,9,sec,Hz = 1/sec, KHz = 10,3,Hz, MHz = 10,6,Hz,GHz = 10,9,Hz,Clock cycle time = 1/ clock rate,15,CPU Performance 2,CPI ( clock cycle per instruction,每条指令时钟周期数,),(,程序,CPU,时钟周期数,),(,程序指令数,),CPU time for a program,= CPU clock cycles for a program x clock cycle time,(执行程序花费的,CPU,时钟周期数) (时钟周期时间),16,CPU Performance 3,CPI x Instruction Count x 1/(clock rate),= CPU time,BUT, not every instruction takes the same number of clock cycles to execute., Take the average.,执行指令花费的时钟周期数并不相同，这里取平均值,17,CPU Performance 4,CPI,n,: number of different instructions in a program,CPI,i,: CPI of instruction,i,f,i,: frequency of instruction,i,in a program,n,即, (,第,i,条指令的,CPI ,该指令在全部指令中占的比例,) i=1,18,CPU Performance 5,Example:,Operations frequency clock cycle,ADD60%1,LOAD40%2,CPI,overall,= _,1.4,19,CPU Performance 6,Example:,A given program consists of a 100-instruciton loop that is executed 42 times. If it takes 16000 cycles to execute the program on a given system, what are that systems CPI for the program?,一个程序有一个循环组成，循环内,100,条指令，循环执行,42,次，在某个特定的系统执行这个程序花费,16000,周期，则这个系统执行这个程序的,CPI,是多少？,The total number of instructions executed is: 10042=4200.,So the CPI is: 160004200=3.81.,20,Improve CPU Performance 1,How do we improve CPU performance,那么我们怎样提高,CPU,性能呢？,i.e., reduce CPU time?,Again, CPU time,= CPI x Instruction Count x 1/(clock rate),So, we want to_ CPI,_ Instruction Count,_ clock rate,_ clock cycle time,我们可以减少,CPI,、,IC,、,clock cycle time,或增加,clock rate,21,Improve CPU Performance 2,Clock rate,增加时钟频率的方法,HardWare technology,硬件技术,Organization,组织结构,CPI,减少,CPI,的方法,Organization,Instruction set architecture,指令集,Instruction Count,减少,IC,的方法,Instruction set architecture,Compiler technology,编译技术,22,MIPS 1,MIPS: Million Instruction Per Second,每秒百万指令,MIPS,指令数,执行时间,23,MIPS 2,Given MIPS, MIPS Execution time ,Performance ,已知,MIPS:,则,:,执行时间指令数,/ (MIPS10,6,),因此，如果,MIPS,增加，则执行时间减少，性能增强,24,MIPS 3,Advantage:,Easy to understand (especially by customers),容易理解,Disadvantages,Difficult to compare MIPS of computers with different instruction sets,MIPS,依赖于指令集，不同指令集的计算机不能比较,MIPS,MIPS varies between programs on the same computer,同一计算机上的,MIPS,可能因程序而异,MIPS can vary inversely to performance ( e.g. floating-point instruction executed by hardware or software ) MIPS,可能与性能相反,25,MIPS 4,When running a particular program, computer A achieves 100 MIPS and computer B achieves 75 MIPS. However, computer A takes 60s to execute the program, while computer B takes only 45s. How is this possible?,执行一个具体的程序时，计算机,A,的,MIPS,为,100,而计算机,B,的,MIPS,为,75,。然而执行这个程序计算机,A,花费,60s,，而计算机,B,花费,45s,，为什么？,26,MIPS 5,Solution: MIPS measures the rate at which a processor executes instructions, but different processor architectures require different numbers of instructions to perform a given computation. If computer A had to execute significantly more instructions than computer B to complete the program, it would be possible for computer A to take longer to run the program than processor B despite the fact that computer A executes more instructions per second.,27,MIPS 5,解答,: MIPS,是评价处理器执行指令速度的一个标准,但是对于一个给定的计算，不同体系结构的处理器需要不同数量的指令来进行计算。,在执行本程序时，如果计算机,A,所需要执行的指令数比计算机,B,多，那么尽管计算机,A,的,MIPS,比计算机,B,大,它仍然可能需要比计算机,B,更长的执行时间,28,Other Measurements,MFLOPS:,Millions of floating point operations per second,每秒百万条浮点指令,29,Internet Resources,Search for the Intel Museum,Charles Babbage Institute,PowerPC,Intel Developer Home,30,

展开阅读全文

执行时间(latency等待时间)

最新文档