高级计算机体系结构10存储器结构课件

资源描述

单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,单击此处编辑母版标题样式,单击此处编辑母版文本样式,第二级,第三级,第四级,第五级,2024/8/13,高级计算机体系结构10存储器结构,2023/8/23高级计算机体系结构10存储器结构,1,Lecture 10: Memory Hierarchy: Reducing Hit Time, Main Memory, & Examples,Spring 2010,Super Computing Lab.,Lecture 10: Memory Hierarchy:,2,Review: Reducing Misses,3 Cs: Compulsory, Capacity, Conflict Misses,Reducing Miss Rate,1. Reduce Misses via Larger Block Size,2. Reduce Misses via Higher Associativity,3. Reducing Misses via Victim Cache,4. Reducing Misses via Pseudo-Associativity,5. Reducing Misses by HW Prefetching Instr, Data,6. Reducing Misses by SW Prefetching Data,7. Reducing Misses by Compiler Optimizations,Remember danger of concentrating on just one parameter when evaluating performance,Review: Reducing Misses3 Cs: C,3,Reducing Miss Penalty Summary,Five techniques,Read priority over write on miss,Subblock placement,Early Restart and Critical Word First on miss,Non-blocking Caches (Hit under Miss, Miss under Miss),Second Level Cache,Can be applied recursively to Multilevel Caches,Danger is that time to DRAM will grow with multiple levels in between,First attempts at L2 caches can make things worse, since increased worst case is worse,Reducing Miss Penalty SummaryF,4,Review: Improving Cache Performance,1. Reduce the miss rate,2. Reduce the miss penalty, or,3. Reduce the time to hit in the cache,- hit time: read tag + compare,Review: Improving Cache Perfor,5,1. Fast Hit times via Small and Simple Caches,Why Alpha 21164 has 8KB Instruction and 8KB data cache + 96KB second level cache?,Small data cache (faster) and clock rate (on-chip),Direct Mapped, on chip,Advantage: overlap tag check & data transfer,1. Fast Hit times via Small a,6,1. Fast Hit times via Small and Simple Caches,Index tag memory and then compare takes time,Small,cache can help hit time since smaller memory takes less time to index,E.g., L1 caches same size for 3 generations of AMD microprocessors: K6, Athlon, and Opteron,Also L2 cache small enough to fit on chip with the processor avoids time penalty of going off chip,Simple,direct mapping,Can overlap tag check with data transmission since no choice,Access time estimate for 90 nm using CACTI model 4.0,Median ratios of access time relative to the direct-mapped caches are 1.32, 1.39, and 1.43 for 2-way, 4-way, and 8-way caches,1. Fast Hit times via Small a,7,2. Fast hits by Avoiding Address Translation,Send virtual address to cache: Called,Virtually Addressed Cache,or just,Virtual Cache,vs.,Physical Cache,Every time process is switched logically must flush the cache; otherwise get false hits,Cost is time to flush +,“,compulsory,”,misses from empty cache,Dealing with,aliases,(sometimes called,synonyms,); Two different virtual addresses map to same physical address,I/O must interact with cache, so need virtual address,Solution to aliases,HW guarantees that every cache block has unique physical address,SW guarantee : lower n bits must have same address; as long as covers index field called,page coloring,Solution to cache flush,Add,process identifier tag,that identifies process as well as address within process: cannot get a hit if wrong process,2. Fast hits by Avoiding Addre,8,Virtually Addressed Caches,CPU,TB,$,MEM,VA,PA,PA,Conventional,Organization,CPU,$,TB,MEM,VA,VA,PA,Virtually Addressed Cache,Translate only on miss,Synonym Problem,CPU,$,TB,MEM,VA,PA,Tags,PA,Overlap $ access,with VA translation,:,requires $ index to,remain invariant,across translation,VA,Tags,L2 $,Virtually Addressed CachesCPUT,9,2,. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address,If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag,Limits cache to page size: what if want bigger caches and uses same trick?,Higher associativity moves barrier to right,Page coloring,Page Address,Page Offset,Address Tag,Index,Block Offset,31,12 11,0,2. Fast Cache Hits by Avoidin,10,Pipeline Tag Check and Update Cache as separate stages; current write tag check & previous write cache update,Only STORES in the pipeline; empty during a missStore r2, (r1) Check r1Add-Sub-Store r4, (r3) Mr1 computers use any generation DRAM,Commodity, second source industry = high volume, low profit, conservative,Little organization innovation in 20 years,Order of importance: 1) Cost/bit 2) Capacity,First RAMBUS: 10X BW, +30% cost = little impact,DRAM HistoryDRAMs: capacity +6,23,DRAM Future: 1 Gbit DRAM,Mitsubishi,Samsung,Blocks512 x 2 Mbit 1024 x 1 Mbit,Clock200 MHz250 MHz,Data Pins6416,Die Size24 x 24 mm31 x 21 mm,Sizes will be much smaller in production,Metal Layers34,Technology0.15 micron 0.16 micron,DRAM Future: 1 Gbit DRAM Mi,24,Fast Memory Systems: DRAM specific,Multiple CAS accesses: several names (page mode),Extended Data Out (EDO),: 30% faster in page mode,New DRAMs to address gap; what will they cost, will they survive?,RAMBUS,: startup company; reinvent DRAM interface,Each Chip a module vs. slice of memory,Short bus between CPU and chips,Does own refresh,Variable amount of data returned,1 byte / 2 ns (500 MB/s per chip),20% increase in DRAM area,Synchronous DRAM,: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz),Intel claims RAMBUS Direct (16 b wide) is future PC memory?,Possibly not true! Intel to drop RAMBUS?,Niche memory or main memory?,e.g., Video RAM for frame buffers, DRAM + fast serial output,Fast Memory Systems: DRAM spec,25,Main Memory Performance,Simple,:,CPU, Cache, Bus, Memory same width (32 or 64 bits),Wide,:,CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits UtraSPARC 512),Interleaved,:,CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is,word interleaved,Main Memory PerformanceSimple:,26,Interleaving,Access Pattern without Interleaving:,Start Access for D1,CPU,Memory,Start Access for D2,D1 available,Access Pattern with 4-way Interleaving:,Access Bank 0,Access Bank 1,Access Bank 2,Access Bank 3,CPU,Memory,Bank 1,Memory,Bank 0,Memory,Bank 3,Memory,Bank 2,InterleavingAccess Pattern wit,27,Main Memory Performance,Timing model (word size is 32 bits),1 to send address,6 access time, 1 to send data,Cache Block is 4 words,Simple M.P.,= 4 x (1+6+1) = 32,Wide M.P.,= 1 + 6 + 1 = 8,Interleaved M.P.,= 1 + 6 + 4x1 = 11,Main Memory PerformanceTiming,28,Independent Memory Banks,Memory banks for independent accesses vs. faster sequential accesses,Multiprocessor,I/O,CPU with Hit under n Misses, Non-blocking Cache,Superbank,: all memory active on one block transfer (or,Bank,),Bank,: portion within a superbank that is word interleaved (or,Subbank,),Superbank,Bank,Independent Memory BanksMemory,29,Independent Memory Banks,How many banks?,number banks =number clocks to access word in bank,For sequential accesses, otherwise will return to original bank before it has next word ready,(like in vector case),Increasing DRAM = fewer chips = harder to have banks,Independent Memory BanksHow ma,30,DRAMs per PC over Time,Minimum Memory Size,DRAM Generation,86 89 92969902,1 Mb 4 Mb 16 Mb 64 Mb 256 Mb1 Gb,4 MB,8 MB,16 MB,32 MB,64 MB,128 MB,256 MB,32,8,16,4,8,2,4,1,8,2,4,1,8,2,DRAMs per PC over TimeMinimum,31,DRAM Latency BW,More App Bandwidth = Cache misses = DRAM RAS/CAS,Application BW = Lower DRAM,Latency,RAMBUS, Synch DRAM increase BW but,higher,latency,EDO DRAM BWMore App Ban,32,Potential,: DRAM Crossroads?,After 20 years of 4X every 3 years, running into wall? (64Mb - 1 Gb),How can keep $1B fab lines full if buy fewer DRAMs per computer?,Cost/bit -0%/yr if stop 4X/3 yr?,What will happen to $40B/yr DRAM industry?,Potential : DRAM Crossroads?Af,33,Main Memory Summary,Wider Memory,Interleaved Memory: for sequential or independent accesses,Avoiding bank conflicts: SW & HW,DRAM specific optimizations: page mode & Specialty DRAM,Main Memory SummaryWider Memor,34,Cache Cross Cutting Issues,Superscalar CPU & Number Cache Ports must match: number memory accesses/cycle?,Speculative Execution and non-faulting option on memory/TLB,Parallel Execution vs. Cache locality,Want far separation to find independent operations vs. want reuse of data accesses to avoid misses,I/O and consistency of data between cache and memory,Caches = multiple copies of data,Consistency by HW or by SW?,Where connect I/O to computer?,Cache Cross Cutting IssuesSup,35,Alpha 21064,Separate Instr & Data TLB & Caches,TLBs fully associative,TLB updates in SW(,“,Priv Arch Libr,”,),Caches 8KB direct mapped, write thru,Critical 8 bytes first,Prefetch instr. stream buffer,2 MB L2 cache, direct mapped, WB (off-chip),256 bit path to main memory, 4 x 64-bit modules,Victim Buffer: to give read priority over write,4 entry write buffer between D$ & L2$,Stream,Buffer,Write,Buffer,Victim Buffer,Instr,Data,Alpha 21064Separate Instr & Da,36,Alpha Memory Performance: Miss Rates of SPEC92,8K,8K,2M,I$ miss = 2%,D$ miss = 13%,L2 miss = 0.6%,I$ miss = 1%,D$ miss = 21%,L2 miss = 0.3%,I$ miss = 6%,D$ miss = 32%,L2 miss = 10%,Alpha Memory Performance: Miss,37,Alpha CPI Components,Instruction stall: branch mispredict (green);,Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + reg conflicts, structural conflicts,Alpha CPI ComponentsInstructio,38,Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, .),4KB Data cache miss rate 8%,12%, or 28%?,1KB Instr cache miss rate 0%,3%,or 10%?,Alpha vs. MIPS for 8KB Data $:17% vs. 10%,Why 2X Alpha v. MIPS?,D$, Tom,D$, gcc,D$, esp,I$, gcc,I$, esp,I$, Tom,Pitfall: Predicting Cache Perf,39,Pitfall: Simulating Too Small an Address Trace,I$ = 4 KB, B=16B,D$ = 4 KB, B=16B,L2 = 512 KB, B=128B,MP = 12, 200,Pitfall: Simulating Too Small,40,

展开阅读全文

高级计算机体系结构10存储器结构课件

最新文档