Branch Prediction: Case For Branch Prediction When Issue N Instructions Per Clock Cycle
Branch Prediction: Case For Branch Prediction When Issue N Instructions Per Clock Cycle
Easy to predict (direction and address) Easy to predict? (maybe not/ maybe dynamically)
not taken
2.5
taken • 1-bit branch prediction scheme – simplest.
2 • Branch History Table (BHT): Lower bits of PC address index
1.5 table of 1-bit values
– Says whether or not branch taken last time
1
– No address check (saves HW, but may not be right branch)
0.5 • Problem: in a loop, 1-bit BHT will cause
0 2 mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
5
20
35
50
65
80
95
5
0
00
– First time through loop on next time through code, when it predicts exit
-8
-7
-5
-4
-2
-1
-1
How big is the BHT? • (2,2) predictor: 2-bit 2-bit recent global
global, 2-bit local branch history
(01 = not taken then taken)
Correlating Branches (2-level Branch Predictors) Example
BNEZ R1, L1 ; branch b1 (d!=0)
If (d == 0)
ADDI R1, R0, #1 ; d==0, so d=1
d = 1;
Branch address (k bits) L1: SUBI R3, R1, #1
if (d==1)
BNEZ R3, L2 ;branch b2 (d!=1)
(m, n) branch predictor n-bits per branch …
local predictors L2:
m-bit global history
n-bit local predictor
Local history only: Behavior of a 1-bit predictor initialized to NT (not taken)
Example Example
BNEZ R1, L1 ; branch b1 (d!=0) BNEZ R1, L1 ; branch b1 (d!=0)
If (d == 0) If (d == 0)
ADDI R1, R0, #1 ; d==0, so d=1 ADDI R1, R0, #1 ; d==0, so d=1
d = 1; d = 1;
L1: SUBI R3, R1, #1 L1: SUBI R3, R1, #1
if (d==1) if (d==1)
BNEZ R3, L2 ;branch b2 (d!=1) BNEZ R3, L2 ;branch b2 (d!=1)
… …
L2: L2:
The action of the 1-bit predictor with 1-bit correlation -- (1,1) predictor The action of the 1-bit predictor with 1-bit correlation -- (1,1) predictor
Initialized to NT/NT Initialized to T/T
Initial value Pred b1 real b1 Pred b2 real b2 Initial value Pred b1 real b1 Pred b2 real b2
of d of d
2 NT/NT T NT/NT T 2 T/T T T/T T
0 T/NT NT NT/T NT 0 T/T NT T/T NT
2 T/NT T NT/T T 2 T/NT T NT/T T
0 T/NT NT NT/T NT 0 T/NT NT NT/T NT
Prediction
Example Example
NT T
Branch PC
00 00 00 00 00 00
NT
0 0 0
The action of the 2-bit predictor with 1-bit correlation -- (1,2) predictor The action of the 2-bit predictor with 2-bit correlation -- (2,2) predictor
Initialized to NT(00)/NT(00) Initialized to NT/NT/NT/NT
Initial Pred b1
real Pred b2 real Initial value Pred b1 real b1 Pred b2 real b2
d =? b1 b2 of d
2 NT(00)/NT(00) T NT(00)/NT(00) T 2 NT/NT/NT/NT T NT/NT/NT/NT T
0 NT(01)/NT(00) NT NT(00)/NT(01) NT 0 NT NT
2 NT(01)/NT(00) T NT(00)/NT(01) T 2 T T
0 T(11)/NT(00) NT NT(00)/T(11) NT 0 NT NT
18%
• Several of the SPEC benchmarks have less
4096 Entries 2-bit BHT
than a dozen branches responsible for 90%
Frequency of Mispredictions
16%
14%
Unlimited Entries 2-bit BHT of taken branches:
12%
1024 Entries (2,2) BHT program branch % static # = 90%
11%
10%
compress 14% 236 13
eqntott 25% 494 5
8%
6% 6% 6% gcc 15% 9531 2020
mpeg 10% 5598 532
6% 5% 5%
4%
4%
real gcc 13% 17361 3214
2% 1%
0%
1% • Real programs + OS more like gcc
• Small benefits beyond benchmarks for
0%
0%
Example Example
T T
NT NT
T(11) T(10) PC of b1 T(11) T(10) PC of b1
T 10 01 10 00 T 10 01 10 00
NT T NT T
T PC of b2 T PC of b2
T 01 01 00 11 T 01 01 00 11
NT(01) NT(00) ? NT(01) NT(00) ?
NT NT
NT NT
0 0 0 0
The action of the 2-bit (per address (local) history) predictor with 2-bit (global history) correlation -- The action of the 2-bit (per address (local) history) predictor with 2-bit (global history) correlation --
(2,2) predictor (GAp predictor (Yeh&Patt’s terminology)) (2,2) predictor (GAp predictor (Yeh&Patt’s terminology))
Pred b1 real b1 Pred b2 real b2 Pred b1 real b1 Pred b2 real b2
T(10)/NT(01)/T(10)/NT(00) NT NT(01)/NT(01)/NT(00)/T(11) T T(10)/NT(01)/T(10)/NT(00) NT NT(01)/NT(01)/NT(00)/T(11) T
T NT NT(00)/NT(01)/T(10)/NT(00) T T(11)/NT(01)/NT(00)/T(11) NT
T NT NT(00)/T(11)/T(10)/NT(00) T T(11)/NT(01)/NT(00)/T(10) NT
T NT NT(00)/T(11)/T(11)/NT(00) T T(11)/NT(00)/NT(00)/T(10) NT
NT NT NT(00)/T(11)/T(11)/NT(00) NT T(11)/NT(00)/NT(00)/T(10) NT
T T NT(00)/T(11)/T(10)/NT(00) T T(10)/NT(00)/NT(00)/T(10) T
NT NT NT(01)/T(11)/T(10)/NT(00) NT T(10)/NT(01)/NT(00)/T(10) NT
Accuracy of Different Schemes
20% Re-evaluating Correlation
18%
16%
14%
Unlimited Entries 2-bit BHT of taken branches:
12%
1024 Entries (2,2) BHT program branch % static # = 90%
11%
10%
compress 14% 236 13
eqntott 25% 494 5
8%
6% 6% 6% gcc 15% 9531 2020
mpeg 10% 5598 532
6% 5% 5%
4%
4%
real gcc 13% 17361 3214
2% 1%
0%
1% • Real programs + OS more like gcc
• Small benefits beyond benchmarks for
0%
0%
PC of instruction
6%
FETCH
5%
Correlating
4%
3%
2%
Tournament
1% =? Extra
Yes: instruction is prediction state
branch and use bits
0% No: branch not predicted PC as
predicted, proceed normally
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
next PC
Total predictor size (Kbits) (Next PC = PC+4)
Predicated Execution
Overriding Predictors • Avoid branch prediction by turning branches
into conditionally executed instructions:
• Big predictors are slow, but more accurate if (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
x
• Use a single cycle predictor in fetch – Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following
• Start the multi-cycle predictor instr. A=
– When it completes, compare it to the fast prediction. – IA-64: 64 1-bit condition fields selected B op C
» If same, do nothing so conditional execution of any instruction
» If different, assume the slow predictor is right and flush – This transformation is called “if-conversion”
pipline.
• Advantage: reduced branch penalty for those • Drawbacks to conditional instructions
branches mispredicted by the fast predictor and – Still takes a clock even if “annulled”
correctly predicted by the slow predictor – Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
Pitfall: Sometimes bigger and
Special Case Return Addresses dumber is better
• 21264 uses tournament predictor (29 Kbits)
• Register Indirect branch hard to predict • Earlier 21164 uses a simple 2-bit predictor
address with 2K entries (or a total of 4 Kbits)
• SPEC95 benchmarks, 21264 outperforms
• SPEC89 85% such branches for procedure
– 21264 avg. 11.5 mispredictions per 1000 instructions
return – 21164 avg. 16.5 mispredictions per 1000 instructions
• Since stack discipline for procedures, save • Reversed for transaction processing (TP) !
return address in small buffer that acts like – 21264 avg. 17 mispredictions per 1000 instructions
a stack: 8 to 16 entries has small miss rate – 21164 avg. 15 mispredictions per 1000 instructions
• TP code much larger & 21164 hold 2X
branch predictions based on local behavior
(2K vs. 1K local predictor in the 21264)
Mis-speculation Recovery
Mis-speculation Recovery • Eliminate Incorrect Path
– Use branch tag(s) to deallocate completion buffer entries occupied by speculative
NT T tag1 instructions (now determined to be mis-speculated).
– Invalidate all instructions in the decode and dispatch buffers, as well as those in
reservation stations
NT T tag2
NT T How expensive is a misprediction?
tag2
NT T NT T NT T NT T • Start New Correct Path
– Update PC with computed branch target (if it was predicted NT)
tag3 tag3 tag3 – Update PC with sequential instruction address (if it was predicted T)
– Can begin speculation once again when encounter a new branch
How soon can you restart?
• Eliminate Incorrect Path
– Must ensure that the mis-speculated instructions produce no side
effects
• Start New Correct Path
– Must have remembered the alternate (non-predicted) path
Fast Branch Rewind and Restart:
Trailing Confirmation
NT T tag1 • Discard all ROB entries (and
corresponding operations) younger
than the mispredicted branches
NT T tag2
NT T
tag2 • Can restart immediately from oldest
NT T NT T NT T NT T the corrected branch target
because the ROB has sufficient
tag3 tag3 tag3 information (rename & value) to
continue from where left off another miss
• Works with nested misprediction
• Trailing Confirmation mispredictions!!
– When branch is resolved, remove/deallocate speculation another miss
youngest
tag
– Permit completion of branch and following instructions
Operand Value/Tag
• To reinitiate renaming:
– wait for all instructions older than the rewind point to drain clear of the
pipeline and then reset register remapping to null
Long restart latency
– Reorder buffer has to remember how to restored the map table to the point
of the mispredicted branch
Complicated multi-cycle logic
- Cache rename map after branch prediction