PPC970 - G5 ====== g4e ; "wide and shallow" -- 9 functional units, fe issues 4 instrucutions / clock (3 + barnch). each func has short pipeline. p4 ; "narrow and deep" -- fewer functional units, push through faster. 126 instructions in flight. 970 ; "wide and deep". 16 stage pipeline (pentium > 20). up to 200 instructions in flight. 970 has bigger l1 cache than g4, because of longer pipeline. branch prediction; p4 uses 4096 branch history table. 970 has 16k entry BHT & 128-entry branch target instruction cache. fetches 8 instructions/cycle from l1 into instruction queue p4 only takes 1 instruction/cycle from l1, but trace cache can be misleading since takes fetch & decode out of critical path. 970 fe can issue up to 3 instructions/cycle into 3 issue queues. 970 breaks into iops (internal operations) which are executed out of order by core. squeezes some extra ilp. - cracked instruction = 2 iops - millicoded instruction = > 2 iops iops arranged into groups of 5, according to certain rules and restrictions. group is in program order. queues issues groups to execution cores. queues can issue 8 iops/cycle combined. groups reduce the number of instructions to track in flight (200 inflight means 20 groups). groups have a group completion table entry -- equiv of a reorder buffer. groups can fragment with bad code.