hazards(结构冒险大多发生在)

资源描述

More Pipeline1Basic RISC PipeliningBasic idea:Each instruction spends 1 clock cycle in each of the 5 execution stages.During 1 clock cycle,the pipeline can process(in different stages)5 different instructions.2Simple RISC DatapathIFIDEXMEMWBProgramCounterNext PCInst.Reg.Loadfr.Mem.Data3Description of Pipe Stages4Hazards5The hazards of pipeliningPipeline hazards prevent next instruction from executing during designated clock cycleThere are 3 classes of hazards:Structural Hazards:Arise from resource conflicts HW cannot support all possible combinations of instructionsData Hazards:Occur when given instruction depends on data from an instruction ahead of it in pipelineControl Hazards:Result from branch,other instructions that change flow of program(i.e.change PC)6How do we deal with hazards?Often,pipeline must be stalledStalling pipeline usually lets some instruction(s)in pipeline proceed,another/others wait for data,resource,etc.7Stalls and performanceStalls impede(阻止)progress of a pipeline and result in deviation from 1 instruction executing/clock cyclePipelining can be viewed to:Decrease CPI or clock cycle time for instructionLets see what affect stalls have on CPICPI pipelined=Ideal CPI+Pipeline stall cycles per instruction1+Pipeline stall cycles per instructionIgnoring overhead and assuming stages are balanced:8Even more pipeline performance issues!This results in:Which leads to:If no stalls in ideal casespeedup=number of pipeline stages91.Structural hazardsMost common instances of structural hazards(结构冒险大多发生在):When a functional unit not fully pipelined(完全流水)When some resource not duplicated enoughOne way to avoid structural hazards is to duplicate resourcesPipelines stall result of hazards,CPI increased from the usual“110An example of a structural hazardALURegMemDMRegALURegMemDMRegALURegMemDMRegALURegMemDMRegTimeALURegMemDMRegLoadInstruction 1Instruction 2Instruction 3Instruction 4Whats the problem here?The processor has a combined instruction+data memory with only 1 read port11How is it resolved?ALURegMemDMRegALURegMemDMRegALURegMemDMRegTimeALURegMemDMRegLoadInstruction 1Instruction 2StallInstruction 3BubbleBubbleBubbleBubbleBubblePipeline generally stalled by inserting a“bubble or NOP12Or alternativelyInst.#12345678910LOADIFIDEXMEMWBInst.i+1IFIDEXMEMWBInst.i+2IFIDEXMEMWBInst.i+3stallIFIDEXMEMWBInst.i+4IFIDEXMEMWBInst.i+5IFIDEXMEMInst.i+6IFIDEXClock NumberLOAD instruction“steals an instruction fetch cycle which will cause the pipeline to stall.Thus,no instruction completes on clock cycle 813Remember the common case!But,in some cases it may be better to allow them than to eliminate them.These are situations a computer architect might have to consider:Is pipelining functional units or duplicating them costly in terms of HW?Does structural hazard occur often?Whats the common case?142.Data hazardsWhy do they exist?Pipelining changes order(i.e.read/write accesses to operands)Order differs from order seen by sequentially executing instructions on unpipelined machine(流水执行序不同于非流水机器的顺序执行指令序)Consider this example:ADD R1,R2,R3SUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9XOR R10,R1,R11All instructions after ADD use result of ADD ADD writes the register in WB but SUB needs it in ID.This is a data hazard15Illustrating a data hazardALURegMemDMRegALURegMemDMRegALURegMemDMRegMemTimeADD R1,R2,R3SUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9XOR R10,R1,R11ALURegMemADD instruction causes a hazard in next 3 instructions b/c(because)register not written until after those 3 read it.16Data hazard specificsThere are actually 3 different kinds of data hazards!Read After Write(RAW)Write After Write(WAW)Write After Read(WAR)Assume that hazards will use instructions i&j.i is always issued before j.Thus,i will always be further along in pipeline than j.With an in-order issue/in-order completion machine,were not as concerned with WAW,WAR17Three Types of Data HazardsThere are actually 3 different kinds of data hazards!Let i be an earlier instruction,j a later one.RAW(read after write)j tries to read a value before i writes itWAW(write after write)i and j write to same place,but in the wrong order.发生条件：Only occurs if 1 pipeline stage can write(in-order)WAR(write after read)j writes a new value to a location before i has read the old one.发生条件：Only occurs if writes can happen before reads in pipeline(in-order).18Read after write(RAW)hazardsWith RAW hazard,instruction j tries to read a source operand before instruction i writes it.Thus,j would incorrectly receive an old or incorrect valueGraphically/Example:Can use stalling or forwarding to resolve this hazardjiInstruction j is aread instructionissued after iInstruction i is awrite instructionissued before ji:ADD R1,R2,R3j:SUB R4,R1,R619ForwardingIt can actually be solved relatively easily with forwardingIn this example,result of the ADD instruction not really needed until after ADD actually produces itCan we move the result from EX/MEM register to the beginning of ALU(where SUB needs it)?Generally speaking:Forwarding occurs when a result is passed directly to functional unit that requires it.Result goes from output of one unit to input of another20When can we forward?ALURegMemDMRegALURegMemDMRegALURegMemDMRegMemTimeADD R1,R2,R3SUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9XOR R10,R1,R11ALURegMemSUB gets info.from EX/MEM pipe registerAND gets info.from MEM/WB pipe registerOR gets info.by forwarding fromregister fileRule of thumb:If line goes“forward you can do forwarding.If its drawn backward,its physically impossible.21Data Hazard Detection22Hazard Detection LogicExample:Detecting whether an instruction that has just been fetched needs to be stalled because of a preceding load.23Forwarding Situations in DLX24HW Change for ForwardingMuxMuxALUZero?Data memoryID/EXEX/MEMMEM/WB25Forwarding:It doesnt always workALURegIMDMRegALURegIMDMALURegIMTimeLW R1,0(R2)SUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9RegIMLoad has a latency thatforwarding cant solve.Pipeline must stall until hazard cleared(starting with instruction that wants to use data until source produces it).26The solutionALURegIMDMRegRegIMIMTimeLW R1,0(R2)SUB R4,R1,R5AND R6,R1,R7OR R8,R1,R9BubbleBubbleBubbleALURegRegIMALUDMInsertion of bubble causes#of cycles to complete this sequence to grow by 127Data hazards and the compilerCompiler should be able to help eliminate some stalls caused by data piler could not generate a LOAD instruction that is immediately followed by instruction that uses result of LOADs destination register.Technique is called“pipeline/instruction scheduling28A simple Example A clever compiler can often reschedule instructions to avoid a stall.A simple example:Original code:lw r2,0(r4)add r1,r2,r3 Note:Stall happens here!lw r5,4(r4)Transformed code:lw r2,0(r4)lw r5,4(r4)add r1,r2,r3 No stall needed!29Simple RISC Pipeline Stall Statistics%of loads that cause a stall30Write after write(WAW)hazardsWith WAW hazard,instruction j tries to write an operand before instruction i writes it.The writes are performed in wrong order leaving the value written by earlier instructionGraphically/Example:jiInstruction j is awrite instructionissued after iInstruction i is awrite instructionissued before ji:DIV F1,F2,F3j:SUB F1,F4,F631Write after read(WAR)hazardsWith WAR hazard,instruction j tries to write an operand before instruction i reads it.Instruction i would incorrectly receive newer value of its operand;Instead of getting old value,it could receive some newer,undesired value.Graphically/Example:jiInstruction j is awrite instructionissued after iInstruction i is aread instructionissued before ji:DIV F7,F1,F3j:SUB F1,F4,F6323.Control(Branch)HazardsSuppose the new PC value is not computed until the MEM stage.Then we must stall 3 clocks after every branch!33Branch Hazardsneed to consider hazards involving branches:Example:40:beq$1,$3,2844:and$12,$2,$548:or$13,$6,$252:add$14,$2,$272:lw$4,50($7)34Pipeline impact on branchHow do we deal with this?Always stallAssume branch-not-takenBranch delay slots35Assume branch not takenOn average,branches are taken the timeIf branch not takenContinue normal processingElse,if branch is takenNeed to flush improper instruction from pipelineCuts overall time for branch processing in 36Assume branch not takenCase 1:not taken Execution proceeds normally no penalty37Assume branch not takenCase 2:taken branchBubbles injected into 3 stages during cycle 538Sum:Branch Penalty ImpactAssume 16%of all instructions are branches4%unconditional branches:3 cycle penalty12%conditional:50%taken,3 cycle penaltyFor a sequence of N instructions(assume N is large)N cycles to initiate each3*0.04*N delays due to unconditional branches0.5*3*0.12*N delays due to conditional takenAlso,an extra 4 cycles for pipeline to emptyTotal:1.3*N+4 total cycles(or 1.3 cycles/instruction)(CPI)30%Performance Hit!(Bad thing)39Branch delay slotDelay slot:Find one instruction that will be executed no matter which way the branch goesBranches always execute next 1 or 2 instructionsInstruction so executed said to be in delay slot branch instructionDelay slot instruction 1Delay slot instruction 2.Delay slot instruction n branch target if takenBranch delay slot of length n40Scheduling Delayed BranchADD R1,R2,R3if R2=0 thenif R2=0 thenSUB R4,R5,R6ADD R1,R2,R3if R1=0 thenSUB R4,R5,R6ADD R1,R2,R3if R1=0 thenADD R1,R2,R3if R1=0 thenADD R1,R2,R3SUB R4,R5,R6OR R7,R8,R9SUB R4,R5,R6ADD R1,R2,R3if R1=0 thenSUB R4,R5,R6OR R7,R8,R9From beforeFrom targetFrom fall through41Scheduling Delayed BranchWhere to get instructions to fill branch delay slot?Before branch instructionalways valuableFrom the target address:only valuable when branch takenFrom fall through:only valuable when branch not taken42Fast Branch ResolutionPerformance penalty could be more than 30%Deeper pipelines,some code is very branch heavyFast Branch ResolutionAdder in ID for PC+immediate targetsOnly works for simple conditions(compare to 0)Comparing two register values could be too slow4344New Pipeline Logic45ExampleAssume the following MIPS instruction mix:What is the resulting CPI for the pipelined MIPS with forwarding and branch address calculation in ID stage when using a branch not-taken scheme?CPI =Ideal CPI +Pipeline stall clock cycles per instruction =1 +stalls by loads +stalls by branches =1 +.3 x.25 x 1 +.2 x.45 x 1 =1 +.075 +.09 =1.165TypeFrequencyArith/Logic40%Load30%of which 25%are followed immediately by an instruction using the loaded value Store10%branch20%of which 45%are taken46Exceptions47Types of Exceptions(Interrupts,Faults)I/O device request,timer eventInvoking OS services from a user programTracing(single-stepping)through programBreakpointsInteger arithmetic overflow,divide by zeroFP arithmetic anomaly(overflow,underflow,etc.)Page fault(page not in physical memory)Misaligned memory accessMemory-protection violation(acc.mem.not alloced to proc.)Illegal(undefined or unimplemented)instructionHardware malfunctionPower-related interrupt(e.g.battery low,power failure)48Exception Characterization 1Synchronous vs.asynchronousEvent synchronized with program execution?Synchronous:event occurs same place every timeAsynchronous:caused by devices external to CPU&memory,also hw malfunctionsUser requested vs.coercedEvent caused intentionally by user program?Requested:user task asks for it Coerced:hw event not under control of user program49Exception Characterization 2User maskable(can be disabled)or notCan event be disabled?Maskable:event that can be disabled by user taskWithin instructions or between instructionsDoes event prevent instruction from completing?Within:during execution of task,hard to handle,usually synchronous since instruction is triggerResume vs terminateDoes the program continue from where it left off after exception is handled,or does it stop?Terminating:execution always stops after the interrupt50Restartable ExceptionsRequirements:Exception may occur within instruction.Program must continue after exception is handled.Examples:Virtual memory page fault.Difficult because:Pipeline state must be saved.One approach,for easy cases:1.Force a trap inst.into pipeline on next IF.2.Clear pipeline behind faulting instruction.3.Exception handler saves PC of faulting instr.51Precise vs.Imprecise HandlingMachines may support either or both modes of exception handling:Precise exception handling:Correctly implement all possible combinations of exceptions in all circumstances.May be a requirement for some systems/applications.May be 10 x slower!Easier for integer than floating-point.Useful for debugging code.Imprecise exception handling:Only correctly implement the most common cases.Software may avoid some exceptions.Only statistical guarantees of correctness,through testing.52Exceptions in DLX pipelineInstruction Fetch,&Memory stagesPage fault on instruction/data fetchMisaligned memory accessMemory-protection violationInstruction Decode stageUndefined/illegal opcodeExecution stageArithmetic exceptionWrite-Back stageNone!53Out-of-Order ExceptionsConsider the following code sequence:LW IF ID EX MEM WBADD IF ID EX MEM WBThe ADD may cause an exception during IF,before LW causes an exception during MEM!Cant restart PC on the ADD!Solution:Note the exception in a status vector,carried along.Disable writes for that instruction.Resolve all exceptions at a late stage(e.g.WB).54Pipelining ComplicationsComplex addressing modes and instructions Autoincrement address modes:causes register change during instruction execution interrupts?Need to restore register stateAdds WAR and WAW hazards since writes no longer in last stageFloating point:long execution time;out of order completion55Stopping and Starting ExecutionMost difficult exception occurrences have 2 propertiesThey occur within instructionsThey must be restartableThe pipeline must be shut down safely and the state must be saved for correct restartingRestarting is usually done by saving PC of instruction at which to startBranches and delayed branches require special treatmentPrecise exceptions allow instructions just before the exception to be completed,while restarting instructions after the exception56Multi-cycle Operations57Multi-cycle Operations for FP58Pipelined Multiple-Issue FPU59Out-of-order complete Notice instructions may complete out-of-order:MULTD IF ID M1 M2 M3 M4 M5 M6 M7 ME WBADDD IF ID A1 A2 A3 A4 ME WBLD IF ID EX ME WBSD IF ID EX ME WB60Typical FP Code Seq.WAR.StallsClock Cycle NumberInstruction1234567891011121314151617L.D F4,0(R2)IFIDEXME WBMUL.D F0,F4,F6IFIDstall M1M2M3M4M5M6M7ME WBADD.D F2,F0,F8IFstallIDstall stall stall stall stall stallA1A2A3A4MEWBS.D F2,0(R2)IFstall stall stall stall stall stallIDEX stall stall stallME61Structure hazards62Sum:multiple-cycles problemsRaises the possibility of WAW hazards,and structural hazards in MEM&WB stages.Structural hazards may occur especially often with non-pipelined DIV unit.Out-of-order completion impacts exception handling.63附录：附录：The MIPS R4000 Pipeline64The MIPS R4300 PipelineManufactured by NEC64-bit processor implements MIPS64 ISAUsed in embedded applicationsNintendo-64(任天堂)game processor,network router,Multiple EX stages for floating-point pipelineOut-of-order completion,precise exceptionsNEC VR 4122:Integer datapath,software for FP operations65Real MIPS R4000“SuperPipelineIF,IS-Instruction cache fetch,First&Second halves.RF-Inst.decode,Register Fetch,hazard checkEX-Execution(EA calc,ALU op,target calc)DF,DS-Data cache access,First&Second halves.TC-Tag Check,did cache access hit?WB-Write-Back for loads®ister-register ops.66R4000:Two-Cycle Load DelayIFISIFRFISIFEXRFISIFDFEXRFISIFDSDFEXRFISIFTCDSDFEXRFISIFWBTCDSDFEXRFISIFTWO CycleLoad Latency67R4000:Three-Cycle Branch DelayWBTCDSDFEXRFISIFIFISIFRFISIFEXRFISIFDFEXRFISIFDSDFEXRFISIFTCDSDFEXRFISIFTHREE CycleBranch Latency(conditions evaluated during EX phase)68R4000 FP Functional Unit StagesU Unpack floating-point numbersFP adder functional unit stages:A Mantissa(尾数)ADD stageR Rounding(舍入)stageS Operand shift stageFP multiplier functional unit stages:E Exception test stageM First stage of multiplierN Second stage of multiplierFP divider function unit stages:D Divide pipeline stage69More R4000 FP pipeline detailsFP InstructionLatencyInitiationintervalPipe stagesAdd,subtract43U,SA,AR,RSMultiply84U,EM,M,M,M,N,NA,RDivide3635U,A,R,D27,DA,DR,DA,DR,A,RSquare root112111U,E,(AR)108,A,RNegate21U,SAbsolute value21U,SFP compare32U,A,RBoth units used in same clock cycleU unpackA mantissa addR round S shiftE exception testM multiply 1st stageN multiply 2nd stageD dividePair of units used on108 consecutive cycles70FP Multiply followed by AddClock cycleOpIssue/Stall0123456789101112mulIssueUEMMMMNNARaddIssueUSAARRSaddIssueUSAARRSaddIssueUSAARRSStallUSAARRSStallUSAARRSaddIssueUSAARRSaddIssueUSAARRS71

展开阅读全文

hazards(结构冒险大多发生在)

最新文档