niagra ====== ieee micro, vol 25, issue 2 implementation of sparc v9 expect to use around 60W (montecito - dual core i2, 100W) ILP is usually low, TLP is usually high. large working sets destroy cache. argue that complex ILP processor doesn't scale over a single issue. 32 threads of execution 4 threads make a "thread group". thread group shares a "sparc pipe" niagra has 8 of these sparc pipe contains l1 caches all share an l2 cache (3mb; 4 way banked, 12-way set associative) commerical code has data sharing, meaning cache coherence traffic is high. with smp this goes over a bus, with niagra stays on chip. chip to memory i s a 200gb/s crossbar. sparc pipeline ============== each thread (of four) has unique registers and instruction/store buffers. shares l1, TLB, execution units, pipeline registers. single issue pipeline; two instructions/cycle (precode bit indicates long latency instructions) 1) fetch - instruction and itlb accessed 2) thread select 3) decode 4) execute 5) memory - DTLB and data cache 6) write-back first two stages are replicated for each thread. ALU and shfit instructions have single cycle latency. multiply and divde operations are long latency and cause a thread switch. * thread selection policy is to switch between threads every cycle, giving priority to least recently running thread. thread scheduler assumes loads are cache hits, and issues dependent instructions speculativley (but speculative instruction are given lower priority that real instructions) * register file register windows; eight in, local and out registers. each thread has eight register windows. four such register sets support each of the four threads. register file contains 640 64-bit registers. procedure calls request a new window; outputs become inputs. returning slides window back down. the working set is "fast registers" whilst the other registers are in slower six-transistor SRAM cells. transfer between the two takes 1-2 cycles; and stalls the thread. * memory subsystem l1 cache has random replacment. 8k l1 cache, 4 way associative and line size 16 bytes, write-through. miss rates of aroud 10% quoted. applications with large data sets require big l1 caches; having multiple threads ready to go hides latencies from l1/l2 misses.