Notes on the Pentium IV ======================= northwood - earlier prescott - updated an add [m], r1 has many steps -- that is why we have superscalar front end - decode and store instructions back end - reorder and execute instructions x86 instructions of variable lengths, decoded into microops (uops) uops are regular; same size, same operands, keeps functional units simple and separate. each instruction == 1-4 uops uops want to run out of order something needs to keep instructions together - p6 : reservation station - p4 : planner logical unit restores results in order expected by program since you need to keep the results of intermediate calculations somewhere, 8 registers is not enough; pentium 4 has 128 rotating renamable registers trace cache : after decoder but before processor units. stores uops. if empty decoding happens only at one x86 instruction/cycle. if found, decoded instructions are moved out at up to 6uops per two cycles (i.e. half the frequency). hits in trace cache around 75-95%, holds 12,000 uops complex instructions turn into mrom vectors; like "macro" for uops in microcode. saves space in the trace cache. branch prediction unit forsees jumps. if it misses pipeles stopped, buffers flushed. stores ~4,000 branches in the Branch History Table. data prefetch works with branch prediction system and memory subsystem. knows that a load might take a long time from memory. memory - p4 has two cache levels. - l2 plays major part. 1mb prescott - l1 small (16K) compared to p3 but faster (2 cycle latency) - advanced transfer cache is a 256bit bus between processor and l1 p4 copies 256 bits every cycle (p3 every second) - l3 -> l2 via 64 bit bus, l3 duplicates l2 alus in netburst - integers divided into two fast (some ops), one slow (more complex) - fast 2x clock - adds in 16 bit increments in two stages + one for flags - latency trade off; full now takes 3 half cycles (1.5 cycles) - however good for dependent operations. - slow alu actually is 'virtual unit' which is more like a scheduler - thus shifts might go faster as 4 adds - prescott got integer multiplication unit, shifts in fast ALU but fast ALU runs at clock speed. - shift v add means recompile! cache - l1 -> registers slower than l2 -> l1 - this is because l1 needs to be found fast, if not there then moved in fast - latencies to integer, mmx and sse registers all different - northwood couldn't prefetch across virutal pages with no TLB, prescott chagned that - prescott uses more bits to tag cache, so less aliasing issues (4mb, rather than 64kb) - longer tags mean higher penalty on miss, but less misses pipeline - after trace cache - 20 stages, prescott even bigger (31) trace -> fetch -> queue -> schedulars -> dispatch -> execute -> retire - retirement restores program order and writing to memory (bases that on data written into buffers earlier in the pipeline) - there are two queues in the queue step, one for address operations and another. this changes which schedulars they end up in for the next step. address queue is 16 uops deep, other is 32 uops. - 5 schedulars; fast0/1 with alu, slow 0/1 with fpu, muliplication, etc mem for memory (all from addr queue sent here). - schedular queues correspond to reservation stations in p6 core. this is where out of order selection occurs. sent so that execution unit is free and operands available. - schedular queue length gives quality of out of order execution. windows slow1 = 12, slow=10, others = 8 - ht - netburst trace cache allows better multithreading because decoder has already done some of the work, unlike in p6 core.