0% found this document useful (0 votes)
44 views

Branch Prediction: Case For Branch Prediction When Issue N Instructions Per Clock Cycle

1. Branch prediction is important for processors that can issue multiple instructions per cycle since branches will arrive faster than the processor can handle without prediction. 2. According to Amdahl's law, the relative impact of control stalls due to branches will be larger for processors that can issue more instructions per cycle. 3. Common branch prediction schemes include 1-bit and 2-bit prediction buffers, correlating branch predictors that use histories of recent branches, and branch target buffers that predict target addresses.

Uploaded by

shubham anurag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Branch Prediction: Case For Branch Prediction When Issue N Instructions Per Clock Cycle

1. Branch prediction is important for processors that can issue multiple instructions per cycle since branches will arrive faster than the processor can handle without prediction. 2. According to Amdahl's law, the relative impact of control stalls due to branches will be larger for processors that can issue more instructions per cycle. 3. Common branch prediction schemes include 1-bit and 2-bit prediction buffers, correlating branch predictors that use histories of recent branches, and branch target buffers that predict target addresses.

Uploaded by

shubham anurag
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Case for Branch Prediction when

Issue N instructions per clock cycle

1. Branches will arrive up to n times faster in


an n-issue processor
Branch Prediction 2. Amdahl’s Law => relative impact of the
control stalls will be larger with the lower
potential CPI in an n-issue processor

conversely, need branch prediction to ‘see’


potential parallelism

Branch Prediction Schemes Parts of the predictor

1. 1-bit Branch-Prediction Buffer • Direction Predictor


– For conditional branches
2. 2-bit Branch-Prediction Buffer » Predicts whether the branch will be taken
3. Correlating Branch Prediction – Examples:
» Always taken; backwards taken
4. Tournament Branch Predictor
• Address Predictor
5. Branch Target Buffer – Predicts the target address (use if predicted taken)
6. Integrated Instruction Fetch Units – Examples:
» BTB; Return Address Stack; Precomputed Branch
7. Return Address Predictors • Recovery logic
Example gzip: Example gzip:

• gzip: loop branch A@ 0x1200098d8 • gzip: if branch B@ 0x12000fa04

• Executed: 1359575 times • Executed: 151409 times


• Taken: 1359565 times • Taken: 71480 times
• Not-taken: 10 times • Not-taken: 79929 times
• % time taken: 99% - 100% • % time taken: ~49%

Easy to predict (direction and address) Easy to predict? (maybe not/ maybe dynamically)

Branch Backwards Dynamic Branch Prediction


3.5
3
• Performance = ƒ(accuracy, cost of prediction and misprediction)
% of total branches

not taken
2.5
taken • 1-bit branch prediction scheme – simplest.
2 • Branch History Table (BHT): Lower bits of PC address index
1.5 table of 1-bit values
– Says whether or not branch taken last time
1
– No address check (saves HW, but may not be right branch)
0.5 • Problem: in a loop, 1-bit BHT will cause
0 2 mispredictions (avg is 9 iterations before exit):
– End of loop case, when it exits instead of looping as before
5
20

35

50

65

80

95
5

0
00

– First time through loop on next time through code, when it predicts exit
-8

-7

-5

-4

-2

-1
-1

distance of branch target instead of looping


– Only 80% accuracy even if loop 90% of the time
Most backward branches are heavily TAKEN
Forward branches slightly more likely to be NOT-TAKEN
Ref: The Effects of Predicated Execution on Branch Prediction
Dynamic Branch Prediction
(Jim Smith, 1981)
1-bit predictor
• Better Solution: 2-bit scheme where change
• 1-bit history (direction predictor) prediction only if get misprediction twice:
– Remember the last direction for a branch
T
T NT
Branch History Table
branchPC Predict Taken 11 10 Predict Taken
T
Predict
Taken
T NT
NT
Predict Not 01 00 Predict Not
NT T NT T T
Taken Taken

• Red: stop, not taken


Predict
Not Taken NT
How big is the BHT?
• Green: go, taken
NT • Adds hysteresis to decision making process

Correlating Branches (2-level Branch Predictors)


2-bit predictor
Idea: taken/not Branch address (4 bits)

• 2-bit history (direction predictor) taken of recently


executed branches is 2-bits per branch
related to 2)
if (aa == behavior local predictors

Branch History Table of next branch


aa=0; (as
branchPC wellif (bb
as the
== 2)history of
that branch bb=0;behavior) Prediction
Prediction
– Then behavior of recent
if (aa !=bb)
branches {
selects
between, say, 4
predictions of next
SN NT T ST

branch, updating just


that prediction

How big is the BHT? • (2,2) predictor: 2-bit 2-bit recent global
global, 2-bit local branch history
(01 = not taken then taken)
Correlating Branches (2-level Branch Predictors) Example
BNEZ R1, L1 ; branch b1 (d!=0)
If (d == 0)
ADDI R1, R0, #1 ; d==0, so d=1
d = 1;
Branch address (k bits) L1: SUBI R3, R1, #1
if (d==1)
BNEZ R3, L2 ;branch b2 (d!=1)
(m, n) branch predictor n-bits per branch …
local predictors L2:
m-bit global history
n-bit local predictor
Local history only: Behavior of a 1-bit predictor initialized to NT (not taken)

The number of bits in an (m,n) predictor is: Prediction


Prediction Initial value Pred b1 real b1 Pred b2 real b2
of d
2^m x n x 2^k 2 NT T NT T
A 16 entry (k=4) (2, 2) predictor has: 0 T NT T NT

2^2 x 2 x 2^4 = 128 bits. 2 NT T NT T


m-bit recent global 0 T NT T NT
branch history

Example Example
BNEZ R1, L1 ; branch b1 (d!=0) BNEZ R1, L1 ; branch b1 (d!=0)
If (d == 0) If (d == 0)
ADDI R1, R0, #1 ; d==0, so d=1 ADDI R1, R0, #1 ; d==0, so d=1
d = 1; d = 1;
L1: SUBI R3, R1, #1 L1: SUBI R3, R1, #1
if (d==1) if (d==1)
BNEZ R3, L2 ;branch b2 (d!=1) BNEZ R3, L2 ;branch b2 (d!=1)
… …
L2: L2:
The action of the 1-bit predictor with 1-bit correlation -- (1,1) predictor The action of the 1-bit predictor with 1-bit correlation -- (1,1) predictor
Initialized to NT/NT Initialized to T/T

Initial value Pred b1 real b1 Pred b2 real b2 Initial value Pred b1 real b1 Pred b2 real b2
of d of d
2 NT/NT T NT/NT T 2 T/T T T/T T
0 T/NT NT NT/T NT 0 T/T NT T/T NT
2 T/NT T NT/T T 2 T/NT T NT/T T
0 T/NT NT NT/T NT 0 T/NT NT NT/T NT

Prediction
Example Example
NT T
Branch PC

00 00 00 00 00 00
NT

0 0 0
The action of the 2-bit predictor with 1-bit correlation -- (1,2) predictor The action of the 2-bit predictor with 2-bit correlation -- (2,2) predictor
Initialized to NT(00)/NT(00) Initialized to NT/NT/NT/NT

Initial Pred b1
real Pred b2 real Initial value Pred b1 real b1 Pred b2 real b2
d =? b1 b2 of d
2 NT(00)/NT(00) T NT(00)/NT(00) T 2 NT/NT/NT/NT T NT/NT/NT/NT T
0 NT(01)/NT(00) NT NT(00)/NT(01) NT 0 NT NT
2 NT(01)/NT(00) T NT(00)/NT(01) T 2 T T
0 T(11)/NT(00) NT NT(00)/T(11) NT 0 NT NT

Accuracy of Different Schemes


20% Re-evaluating Correlation
18%

18%
• Several of the SPEC benchmarks have less
4096 Entries 2-bit BHT
than a dozen branches responsible for 90%
Frequency of Mispredictions

16%

14%
Unlimited Entries 2-bit BHT of taken branches:
12%
1024 Entries (2,2) BHT program branch % static # = 90%
11%

10%
compress 14% 236 13
eqntott 25% 494 5
8%
6% 6% 6% gcc 15% 9531 2020
mpeg 10% 5598 532
6% 5% 5%
4%
4%
real gcc 13% 17361 3214
2% 1%
0%
1% • Real programs + OS more like gcc
• Small benefits beyond benchmarks for
0%
0%

correlation? problems with branch aliases?


4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

What’s missing in this picture?


BHT Accuracy Tournament Predictors
• Motivation for correlating branch predictors is
• Mispredict because either: 2-bit predictor failed on important branches;
– Wrong guess for that branch by adding global information, performance
– Got branch history of wrong branch when index the improved
table
• 4096 entry table programs vary from 1% • Tournament predictors: use 2 predictors, 1
misprediction (nasa7, tomcatv) to 18% based on global information and 1 based on
(eqntott), with spice at 9% and gcc at 12% local information, and combine with a selector
• For SPEC92, • Hopes to select right predictor for right
4096 about as good as infinite table branch (or right context of branch)

7 Branch Prediction Schemes Parts of the predictor

1. 1-bit Branch-Prediction Buffer • Direction Predictor


– For conditional branches
2. 2-bit Branch-Prediction Buffer » Predicts whether the branch will be taken
3. Correlating Branch Prediction – Examples:
» Always taken; backwards taken
4. Tournament Branch Predictor
• Address Predictor
5. Branch Target Buffer – Predicts the target address (use if predicted taken)
6. Integrated Instruction Fetch Units – Examples:
» BTB; Return Address Stack; Precomputed Branch
7. Return Address Predictors • Recovery logic
Example
T
Example NT
T(11) T(10) PC of b1
BNEZ R1, L1 ; branch b1 (d!=0) T 00 00 00 00
NT NT
ADDI R1, R0, #1 ; d==0, so d=1 T PC of b2
L1: SUBI R3, R1, #1 T 00 00 00 00
BNEZ R3, L2 ;branch b2 (d!=1) NT(01) NT(00) NT
… NT
NT
L2:
0 0
The action of the 1-bit predictor with 1-bit correlation -- (1,1) predictor The action of the 2-bit (per address (local) history) predictor with 2-bit (global history) correlation --
Initialized to T/T (2,2) predictor (GAp predictor (Yeh&Patt’s terminology))
Initialized to NT(00)/NT(00)/NT(00)/NT(00)
Initial value Pred b1 real b1 Pred b2 real b2 Initial value Pred b1 real b1 Pred b2 real b2
of d
of d
2 NT(00)/NT(00)/NT(00)/NT(00) T NT(00)/NT(00)/NT(00)/NT(00) T
2 T/T T T/T T
0 NT(01)/NT(00)/NT(00)/NT(00) NT NT(00)/NT(01)/NT(00)/NT(00) NT
0 T/T NT T/T NT
2 NT(01)/NT(00)/NT(00)/NT(00) T NT(00)/NT(01)/NT(00)/NT(00) T
2 T/NT T NT/T T
0 T(11)/NT(00)/NT(00)/NT(00) NT NT(00)/TT(11)/NT(00)/NT(00) NT
0 T/NT NT NT/T NT

Example Example
T T
NT NT
T(11) T(10) PC of b1 T(11) T(10) PC of b1
T 10 01 10 00 T 10 01 10 00
NT T NT T
T PC of b2 T PC of b2
T 01 01 00 11 T 01 01 00 11
NT(01) NT(00) ? NT(01) NT(00) ?
NT NT
NT NT
0 0 0 0
The action of the 2-bit (per address (local) history) predictor with 2-bit (global history) correlation -- The action of the 2-bit (per address (local) history) predictor with 2-bit (global history) correlation --
(2,2) predictor (GAp predictor (Yeh&Patt’s terminology)) (2,2) predictor (GAp predictor (Yeh&Patt’s terminology))
Pred b1 real b1 Pred b2 real b2 Pred b1 real b1 Pred b2 real b2
T(10)/NT(01)/T(10)/NT(00) NT NT(01)/NT(01)/NT(00)/T(11) T T(10)/NT(01)/T(10)/NT(00) NT NT(01)/NT(01)/NT(00)/T(11) T

T NT NT(00)/NT(01)/T(10)/NT(00) T T(11)/NT(01)/NT(00)/T(11) NT

T NT NT(00)/T(11)/T(10)/NT(00) T T(11)/NT(01)/NT(00)/T(10) NT

T NT NT(00)/T(11)/T(11)/NT(00) T T(11)/NT(00)/NT(00)/T(10) NT

NT NT NT(00)/T(11)/T(11)/NT(00) NT T(11)/NT(00)/NT(00)/T(10) NT

T T NT(00)/T(11)/T(10)/NT(00) T T(10)/NT(00)/NT(00)/T(10) T

NT NT NT(01)/T(11)/T(10)/NT(00) NT T(10)/NT(01)/NT(00)/T(10) NT
Accuracy of Different Schemes
20% Re-evaluating Correlation
18%

18% • Several of the SPEC benchmarks have less


4096 Entries 2-bit BHT
than a dozen branches responsible for 90%
Frequency of Mispredictions

16%

14%
Unlimited Entries 2-bit BHT of taken branches:
12%
1024 Entries (2,2) BHT program branch % static # = 90%
11%

10%
compress 14% 236 13
eqntott 25% 494 5
8%
6% 6% 6% gcc 15% 9531 2020
mpeg 10% 5598 532
6% 5% 5%
4%
4%
real gcc 13% 17361 3214
2% 1%
0%
1% • Real programs + OS more like gcc
• Small benefits beyond benchmarks for
0%
0%

correlation? problems with branch aliases?


4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

What’s missing in this picture?

BHT Accuracy Tournament Predictors


• Motivation for correlating branch predictors is
• Mispredict because either: 2-bit predictor failed on important branches;
– Wrong guess for that branch by adding global information, performance
– Got branch history of wrong branch when index the improved
table
• 4096 entry table programs vary from 1% • Tournament predictors: use 2 predictors, 1
misprediction (nasa7, tomcatv) to 18% based on global information and 1 based on
(eqntott), with spice at 9% and gcc at 12% local information, and combine with a selector
• For SPEC92, • Hopes to select right predictor for right
4096 about as good as infinite table branch (or right context of branch)
Tournament (hybrid) predictors Tournament Predictor in Alpha 21264
• 4K 2-bit counters to choose from among a global
predictor and a local predictor
• Global predictor also has 4K entries and is indexed by
Local predictor the history of the last 12 branches; each entry in the
Global/gshare predictor
(e.g. 2-bit) global predictor is a standard 2-bit predictor
(much more state)
– 12-bit pattern: ith bit 0 => ith prior branch not taken;
ith bit 1 => ith prior branch taken;
Prediction Prediction • Local predictor consists of a 2-level predictor:
1 2 – Top level a local history table consisting of 1024 10-bit
entries; each 10-bit entry corresponds to the most recent
10 branch outcomes for the entry. 10-bit history allows
Selection table patterns 10 branches to be discovered and predicted.
(2-bit state machine) Prediction – Next level Selected entry from the local history table is
used to index a table of 1K entries consisting a 3-bit
saturating counters, which provide the local prediction
How do you select which predictor to use? • Total size: 4K*2 + 4K*2 + 1K*10 + 1K*3 = 29K bits!
How do you update the various predictor/selector?
(~180,000 transistors)

% of predictions from local predictor Accuracy of Branch Prediction


in Tournament Prediction Scheme tomcatv
99%
99%
100%
0% 20% 40% 60% 80% 100% 95%
doduc 84%
nasa7 98% 97%
matrix300 100% 86%
tomcatv 94% fpppp 82% Profile-based
98%
doduc 90% 2-bit counter
spice 55% 88% Tournament
li 77%
fpppp 76% 98%
gcc 72% 86%
espresso 63% espresso 82%
96%
eqntott 37%
88%
li 69% gcc 70%
94%

0% 20% 40% 60% 80% 100%


Branch prediction accuracy
• Profile: branch profile from last execution
(static in that in encoded in instruction, but profile)
Need Address
Accuracy v. Size (SPEC89) at Same Time as Prediction
10%
• Branch Target Buffer (BTB): Address of branch index to get
9% prediction AND branch address (if taken)
– Note: must check for branch match now, since can’t use wrong branch address
8%
Local
7% Branch PC Predicted PC

PC of instruction
6%

FETCH
5%
Correlating
4%
3%
2%
Tournament
1% =? Extra
Yes: instruction is prediction state
branch and use bits
0% No: branch not predicted PC as
predicted, proceed normally
0 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
next PC
Total predictor size (Kbits) (Next PC = PC+4)

Predicated Execution
Overriding Predictors • Avoid branch prediction by turning branches
into conditionally executed instructions:
• Big predictors are slow, but more accurate if (x) then A = B op C else NOP
– If false, then neither store result nor cause exception
x
• Use a single cycle predictor in fetch – Expanded ISA of Alpha, MIPS, PowerPC, SPARC have
conditional move; PA-RISC can annul any following
• Start the multi-cycle predictor instr. A=
– When it completes, compare it to the fast prediction. – IA-64: 64 1-bit condition fields selected B op C
» If same, do nothing so conditional execution of any instruction
» If different, assume the slow predictor is right and flush – This transformation is called “if-conversion”
pipline.
• Advantage: reduced branch penalty for those • Drawbacks to conditional instructions
branches mispredicted by the fast predictor and – Still takes a clock even if “annulled”
correctly predicted by the slow predictor – Stall if condition evaluated late
– Complex conditions reduce effectiveness;
condition becomes known late in pipeline
Pitfall: Sometimes bigger and
Special Case Return Addresses dumber is better
• 21264 uses tournament predictor (29 Kbits)
• Register Indirect branch hard to predict • Earlier 21164 uses a simple 2-bit predictor
address with 2K entries (or a total of 4 Kbits)
• SPEC95 benchmarks, 21264 outperforms
• SPEC89 85% such branches for procedure
– 21264 avg. 11.5 mispredictions per 1000 instructions
return – 21164 avg. 16.5 mispredictions per 1000 instructions
• Since stack discipline for procedures, save • Reversed for transaction processing (TP) !
return address in small buffer that acts like – 21264 avg. 17 mispredictions per 1000 instructions
a stack: 8 to 16 entries has small miss rate – 21164 avg. 15 mispredictions per 1000 instructions
• TP code much larger & 21164 hold 2X
branch predictions based on local behavior
(2K vs. 1K local predictor in the 21264)

Dynamic Branch Prediction Summary General speculation


• Prediction becoming important part of scalar
execution • Control speculation
• Branch History Table: 2 bits for loop accuracy – “I think this branch will go to address 90004”
• Correlation: Recently executed branches correlated • Data speculation
with next branch. – “I’ll guess the result of the load will be zero”
– Either different branches • Memory conflict speculation
– Or different executions of same branches – “I don’t think this load conflicts with any proceeding store.”
• Tournament Predictor: more resources to • Error speculation
competitive solutions and pick between them – “I don’t think there were any errors in this calculation”
• Branch Target Buffer: include branch address &
prediction
• Predicated Execution can reduce number of
branches, number of mispredicted branches
• Return address stack for prediction of indirect
jump
Speculation in general
Control Flow Speculation
• Need to be 100% sure on final correctness!
– So need a recovery mechanism NT T tag1
– Must make forward progress!
• Want to speed up overall performance NT T tag2
NT T
– So recovery cost should be low or expected rate of
occurrence should be low. NT T NT T NT T NT T
– There can be a real trade-off on accuracy, cost of tag3
recovery, and speedup when correct.
• Leading Speculation
• Should keep the worst case in mind… – Tag speculative instructions
– Advance branch and following instructions
– Buffer addresses of speculated branch instructions

Mis-speculation Recovery
Mis-speculation Recovery • Eliminate Incorrect Path
– Use branch tag(s) to deallocate completion buffer entries occupied by speculative
NT T tag1 instructions (now determined to be mis-speculated).
– Invalidate all instructions in the decode and dispatch buffers, as well as those in
reservation stations
NT T tag2
NT T How expensive is a misprediction?
tag2
NT T NT T NT T NT T • Start New Correct Path
– Update PC with computed branch target (if it was predicted NT)
tag3 tag3 tag3 – Update PC with sequential instruction address (if it was predicted T)
– Can begin speculation once again when encounter a new branch
How soon can you restart?
• Eliminate Incorrect Path
– Must ensure that the mis-speculated instructions produce no side
effects
• Start New Correct Path
– Must have remembered the alternate (non-predicted) path
Fast Branch Rewind and Restart:
Trailing Confirmation
NT T tag1 • Discard all ROB entries (and
corresponding operations) younger
than the mispredicted branches
NT T tag2
NT T
tag2 • Can restart immediately from oldest
NT T NT T NT T NT T the corrected branch target
because the ROB has sufficient
tag3 tag3 tag3 information (rename & value) to
continue from where left off another miss
• Works with nested misprediction
• Trailing Confirmation mispredictions!!
– When branch is resolved, remove/deallocate speculation another miss
youngest
tag
– Permit completion of branch and following instructions

Rewinding/Flushing of Rename Table


ARF Map Table PRF
data busy tag data rdy
next to
logical free
register next to
name allocate

Operand Value/Tag

• To reinitiate renaming:
– wait for all instructions older than the rewind point to drain clear of the
pipeline and then reset register remapping to null
Long restart latency
– Reorder buffer has to remember how to restored the map table to the point
of the mispredicted branch
Complicated multi-cycle logic
- Cache rename map after branch prediction

You might also like