0% found this document useful (0 votes)

136 views

GDC2003 Memory Optimization 18mar03

This document discusses techniques for optimizing memory and cache performance. It begins with an overview of the memory hierarchy and importance of cache optimization. It then provides strategies for rearranging and reducing data to improve cache utilization including prefetching, structure layout, tree data structures, and linearization caching. Finally, it discusses reducing aliasing and improving memory allocation policies.

Uploaded by

chegu.balaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

GDC2003 Memory Optimization 18mar03

Uploaded by

chegu.balaji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 60

MEMORY

OPTIMIZATION
Christer Ericson
Sony Computer Entertainment, Santa Monica
([email protected])
Talk contents 1/2
► Problem statement
! Why “memory optimization?”
► Brief architecture overview
! The memory hierarchy
► Optimizing for (code and) data cache
! General suggestions
! Data structures
►Prefetching and preloading
►Structure layout
►Tree structures
►Linearization caching
►…
Talk contents 2/2
►…
► Aliasing
! Abstraction penalty problem
! Alias analysis (type-based)
! ‘restrict’ pointers
! Tips for reducing aliasing
Problem statement
► For the last 20-something years…
! CPU speeds have increased ~60%/year
! Memory speeds only decreased ~10%/year
► Gap covered by use of cache memory
► Cache is under-exploited
! Diminishing returns for larger caches

► Inefficient cache use = lower performance

! How increase cache utilization? Cache-awareness!
Need more justification? 1/3

Instruction parallelism:
SIMD instructions consume
data at 2-8 times the rate
of normal instructions!
Need more justification? 2/3

Proebsting’s law:
Improvements to
compiler technology
double program performance
every ~18 years!
Corollary: Don’t expect the compiler to do it for you!
Need more justification? 3/3

On Moore’s law:
► Consoles don’t follow it (as such)
! Fixed hardware
! 2nd/3rd generation titles must get
improvements from somewhere
Brief cache review
► Caches
! Code cache for instructions, data cache for data
! Forms a memory hierarchy
► Cache lines
! Cache divided into cache lines of ~32/64 bytes each
! Correct unit in which to count memory accesses
► Direct-mapped
! For n KB cache, bytes at k, k+n, k+2n, … map to same
cache line
► N-way set-associative
! Logical cache line corresponds to N physical lines
! Helps minimize cache line thrashing
The memory hierarchy
Roughly:

CPU 1 cycle

~1-5 cycles
L1 cache

L2 cache ~5-20 cycles

Main memory ~40-100 cycles

Some cache specs
L1 cache (I/D) L2 cache
PS2 16K/8K† 2-way N/A
GameCube 32K/32K‡ 8-way 256K 2-way unified
XBOX 16K/16K 4-way 128K 8-way unified
PC ~32-64K ~128-512K

► †16K data scratchpad important part of design

► ‡configurable as 16K 4-way + 16K scratchpad
Foes: 3 C’s of cache misses
► Compulsory misses
! Unavoidable misses when data read for first time
► Capacity misses
! Not enough cache space to hold all active data
! Too much data accessed inbetween successive use
► Conflict misses
! Cache thrashing due to data mapping to same cache
lines
Friends: Introducing the 3 R’s
► Rearrange (code, data)
! Change layout to increase spatial locality
► Reduce (size, # cache lines read)
! Smaller/smarter formats, compression
► Reuse (cache lines)
! Increase temporal (and spatial) locality

Compulsory Capacity Conflict

Rearrange X (x) X
Reduce X X (x)
Reuse (x) X
Measuring cache utilization
► Profile
! CPU performance/event counters
►Give memory access statistics
►But not access patterns (e.g. stride)
! Commercial products
►SN Systems’ Tuner, Metrowerks’ CATS, Intel’s VTune
! Roll your own
►In gcc ‘-p’ option + define _mcount()
►Instrument code with calls to logging class
! Do back-of-the-envelope comparison
► Study the generated code
Code cache optimization 1/2
► Locality
! Reorder functions
►Manually within file
►Reorder object files during linking (order in makefile)
►__attribute__ ((section ("xxx"))) in gcc

! Adapt coding style

►Monolithicfunctions
►Encapsulation/OOP is less code cache friendly

! Moving target
! Beware various implicit functions (e.g. fptodp)
Code cache optimization 2/2
► Size
! Beware: inlining, unrolling, large macros
! KISS
►Avoid featuritis
►Provide multiple copies (also helps locality)
! Loop splitting and loop fusion
! Compile for size (‘-Os’ in gcc)
! Rewrite in asm (where it counts)
► Again, study generated code
! Build intuition about code generated
Data cache optimization
► Lots and lots of stuff…
! “Compressing” data
! Blocking and strip mining
! Padding data to align to cache lines
! Plus other things I won’t go into
► What I will talk about…
! Prefetching and preloading data into cache
! Cache-conscious structure layout
! Tree data structures
! Linearization caching
! Memory allocation
! Aliasing and “anti-aliasing”
Prefetching and preloading
► Software prefetching
! Not too early – data may be evicted before use
! Not too late – data not fetched in time for use
! Greedy
► Preloading (pseudo-prefetching)
! Hit-under-miss processing
Software prefetching
// Loop through and process all 4n elements
for (int i = 0; i < 4 * n; i++)
Process(elem[i]);

const int kLookAhead = 4; // Some elements ahead

for (int i = 0; i < 4 * n; i += 4) {
Prefetch(elem[i + kLookAhead]);
Process(elem[i + 0]);
Process(elem[i + 1]);
Process(elem[i + 2]);
Process(elem[i + 3]);
}
Greedy prefetching
void PreorderTraversal(Node *pNode) {
// Greedily prefetch left traversal path
Prefetch(pNode->left);
// Process the current node
Process(pNode);
// Greedily prefetch right traversal path
Prefetch(pNode->right);
// Recursively visit left then right subtree
PreorderTraversal(pNode->left);
PreorderTraversal(pNode->right);
}
Preloading (pseudo-prefetch)
Elem a = elem[0];
for (int i = 0; i < 4 * n; i += 4) {
Elem e = elem[i + 4]; // Cache miss, non-blocking
Elem b = elem[i + 1]; // Cache hit
Elem c = elem[i + 2]; // Cache hit
Elem d = elem[i + 3]; // Cache hit
Process(a);
Process(b);
Process(c);
Process(d);
a = e;
}

(NB: This code reads one element beyond the end of the elem array.)
Structures
► Cache-conscious layout
! Field reordering (usually grouped conceptually)
! Hot/cold splitting
► Let use decide format
! Array of structures
! Structures of arrays
► Little compiler support
! Easier for non-pointer languages (Java)
! C/C++: do it yourself
Field reordering
struct S { struct S {
void *key; void *key;
int count[20]; S *pNext;
S *pNext; int count[20];
}; };

void Foo(S p, void key, int k) {

while (p) {
if (p->key == key) { ► Likely accessed
p->count[k]++; together so
break; store them
} together!
p = p->pNext;
}
}
Hot/cold splitting
Hot fields: Cold fields:
struct S { struct S2 {
void *key; int count[10];
S *pNext; };
S2 *pCold;
};

► Allocate all ‘struct S’ from a memory pool

! Increases coherence
► Prefer array-style allocation
! No need for actual pointer to cold fields
Hot/cold splitting
Beware compiler padding
struct X { struct Y { struct Z {

Decreasing size!
int8 a; int8 a, pad_a[7]; int64 b;
int64 b; int64 b; int64 e;
int8 c; int8 c, pad_c[1]; float f;
int16 d; int16 d, pad_d[2]; int16 d;
int64 e; int64 e; int8 a;
float f; float f, pad_f[1]; int8 c;
}; }; };

Assuming 4-byte floats, for most compilers sizeof(X) == 40,

sizeof(Y) == 40, and sizeof(Z) == 24.
Cache performance analysis
► Usage patterns
! Activity – indicates hot or cold field
! Correlation – basis for field reordering
► Logging tool
! Access all class members through accessor functions
! Manually instrument functions to call Log() function
! Log() function…
► takesobject type + member field as arguments
► hash-maps current args to count field accesses
► hash-maps current + previous args to track pairwise accesses
Tree data structures
► Rearrange nodes
! Increase spatial locality
! Cache-aware vs. cache-oblivious layouts
► Reduce size
! Pointer elimination (using implicit pointers)
! “Compression”
►Quantize values
►Store data relative to parent node
Breadth-first order

► Pointer-less:Left(n)=2n, Right(n)=2n+1
► Requires storage for complete tree of height H
Depth-first order

► Left(n)= n + 1, Right(n) = stored index

► Only stores existing nodes
van Emde Boas layout

► “Cache-oblivious”
► Recursive construction
A compact static k-d tree
union KDNode {
// leaf, type 11
int32 leafIndex_type;
// non-leaf, type 00 = x,
// 01 = y, 10 = z-split
float splitVal_type;
};
Linearization caching
► Nothing better than linear data
! Best possible spatial locality
! Easily prefetchable
► So linearize data at runtime!
! Fetch data, store linearized in a custom cache
! Use it to linearize…
►hierarchy traversals
►indexed data
►other random-access stuff
Memory allocation policy
► Don’t allocate from heap, use pools
! No block overhead
! Keeps data together
! Faster too, and no fragmentation
► Free ASAP, reuse immediately
! Block is likely in cache so reuse its cachelines
! First fit, using free list
The curse of aliasing
What is aliasing?
int n; Aliasing is multiple
int *p1 = &n; references to the
int *p2 = &n; same storage location

Aliasing is also missed opportunities for optimization

int Foo(int *a, int *b) {
*a = 1; What value is
*b = 2; returned here?
return *a; Who knows!
}
The curse of aliasing
► What is causing aliasing?
! Pointers
! Global variables/class members make it worse
► What is the problem with aliasing?
! Hinders reordering/elimination of loads/stores
►Poisoning data cache
►Negatively affects instruction scheduling
►Hinders common subexpression elimination (CSE),
loop-invariant code motion, constant/copy
propagation, etc.
How do we do ‘anti-aliasing’?
► What can be done about aliasing?
! Better languages
►Less aliasing, lower abstraction penalty†
! Better compilers
►Alias analysis such as type-based alias analysis†
! Better programmers (aiding the compiler)
►That’s you, after the next 20 slides!
! Leap of faith
►-fno-aliasing
† To be defined
Matrix multiplication 1/3
Consider optimizing a 2x2 matrix multiplication:
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
for (int i = 0; i < 2; i++) {
for (int j = 0; j < 2; j++) {
a[i][j] = 0.0f;
for (int k = 0; k < 2; k++)
a[i][j] += b[i][k] * c[k][j];
}
}
}

How do we typically optimize it? Right, unrolling!

Matrix multiplication 2/3
Staightforward unrolling results in this:
// 16 memory reads, 4 writes
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0];
a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1]; //(1)
a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0]; //(2)
a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1]; //(3)
}

► But wait! There’s a hidden assumption! a is not b or c!

► Compiler doesn’t (cannot) know this!
! (1) Must refetch b[0][0] and b[0][1]
! (2) Must refetch c[0][0] and c[1][0]
! (3) Must refetch b[0][0], b[0][1], c[0][0] and c[1][0]
Matrix multiplication 3/3
A correct approach is instead writing it as:
// 8 memory reads, 4 writes
Mat22mul(float a[2][2], float b[2][2], float c[2][2]){
float b00 = b[0][0], b01 = b[0][1];
float b10 = b[1][0], b11 = b[1][1]; Consume
float c00 = c[0][0], c01 = c[0][1]; inputs…
float c10 = c[1][0], c11 = c[1][1];

a[0][0] = b00c00 + b01c10; …before

a[0][1] = b00*c01 + b01*c11;
producing
a[1][0] = b10*c00 + b11*c10;
a[1][1] = b10*c01 + b11*c11;
outputs
}
Abstraction penalty problem
► Higher levels of abstraction have a negative
effect on optimization
! Code broken into smaller generic subunits
! Data and operation hiding
►Cannot make local copy of e.g. internal pointers
►Cannot hoist constant expressions out of loops

► Especially because of aliasing issues

C++ abstraction penalty
► Lots of (temporary) objects around
! Iterators
! Matrix/vector classes
► Objects live in heap/stack
! Thus subject to aliasing
! Makes tracking of current member value very difficult
! But tracking required to keep values in registers!

► Implicit aliasing through the this pointer

! Class members are virtually as bad as global variables
C++ abstraction penalty
Pointer members in classes may alias other members:
numVals not a
class Buf {
public:
local variable!
void Clear() {
for (int i = 0; i < numVals; i++)
pBuf[i] = 0;
}
private: May be
int numVals, *pBuf; aliased
}
by pBuf!
Code likely to refetch numVals each iteration!
C++ abstraction penalty
We know that aliasing won’t happen, and can
manually solve the aliasing issue by writing code as:
class Buf {
public:
void Clear() {
for (int i = 0, n = numVals; i < n; i++)
pBuf[i] = 0;
}
private:
int numVals, *pBuf;
}
C++ abstraction penalty
Since pBuf[i] can only alias numVals in the first
iteration, a quality compiler can fix this problem by
peeling the loop once, turning it into:
void Clear() {
if (numVals >= 1) {
pBuf[0] = 0;
for (int i = 1, n = numVals; i < n; i++)
pBuf[i] = 0;
}
}

Q: Does your compiler do this optimization?!

Type-based alias analysis
► Some aliasing the compiler can catch
! A powerful tool is type-based alias analysis

Use language types

to disambiguate
memory
references!
Type-based alias analysis
► ANSI C/C++ states that…
! Each area of memory can only be associated
with one type during its lifetime
! Aliasing may only occur between references of
the same compatible type

► Enables
compiler to rule out aliasing
between references of non-compatible type
! Turned on with –fstrict-aliasing in gcc
Compatibility of C/C++ types
► In short…
! Types compatible if differing by signed,
unsigned, const or volatile
! char and unsigned char compatible with any
type
! Otherwise not compatible

► (See standard for full details.)

What TBAA can do for you
It can turn this:
void Foo(float *v, int *n) {
Possible aliasing
for (int i = 0; i < *n; i++)
between
v[i] += 1.0f;
v[i] and *n
}

into this:
void Foo(float *v, int *n) {
int t = *n;
No aliasing possible
for (int i = 0; i < t; i++)
so fetch *n once!
v[i] += 1.0f;
}
What TBAA can also do
► Cause obscure bugs in non-conforming code!
! Beware especially so-called “type punning”

uint32 i; uint32 i; uint32 i;

float f; union { union {
i = *((uint32 *)&f); float f; float f;
uint32 i; uchar8 c[4];
} u; } u;
u.f = f; u.f = f;
i = u.i; i = (u.c[3]<<24L)+
(u.c[2]<<16L)+
...;

Illegal Allowed Required

C/C++ code! By gcc by standard
Restrict-qualified pointers
► restrict keyword
! New to 1999 ANSI/ISO C standard
! Not in C++ standard yet, but supported by many C++
compilers
! A hint only, so may do nothing and still be conforming
►A restrict-qualified pointer (or reference)…
! …is basically a promise to the compiler that for the
scope of the pointer, the target of the pointer will only
be accessed through that pointer (and pointers copied
from it).
! (See standard for full details.)
Using the restrict keyword
Given this code:
void Foo(float v[], float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}

You really want the compiler to treat it as if written:

void Foo(float v[], float *c, int n) {
float tmp = *c + 1.0f;
for (int i = 0; i < n; i++)
v[i] = tmp;
}

But because of possible aliasing it cannot!

Using the restrict keyword
For example, the code might be called as:
float a[10];
a[4] = 0.0f;
Foo(a, &a[4], 10);

giving for the first version:

v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2

and for the second version:

v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

The compiler must be conservative, and

cannot perform the optimization!
Solving the aliasing problem
The fix? Declaring the output as restrict:
void Foo(float * restrict v, float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}

► Alas, in practice may need to declare both pointers restrict!

! A restrict-qualified pointer can grant access to non-restrict pointer
! Full data-flow analysis required to detect this
! However, two restrict-qualified pointers are trivially non-aliasing!
! Also may work declaring second argument as “float * const c”
‘const’ doesn’t help
Some might think this would work:
void Foo(float v[], const float *c, int n) {
for (int i = 0; i < n; i++)
v[i] = *c + 1.0f;
}
Since *c is const, v[i]
cannot write to it, right?

► Wrong! const promises almost nothing!

! Says *c is const through c, not that *c is const in
general
! Can be cast away
! For detecting programming errors, not fixing aliasing
SIMD + restrict = TRUE
► restrict enables SIMD optimizations
void VecAdd(int *a, int *b, int *c) {
for (int i = 0; i < 4; i++)
a[i] = b[i] + c[i]; Stores may alias loads.
} Must perform operations
sequentially.

void VecAdd(int * restrict a, int b, int c) {

for (int i = 0; i < 4; i++)
a[i] = b[i] + c[i]; Independent loads and
} stores. Operations can
be performed in parallel!
Restrict-qualified pointers
► Important, especially with C++
! Helps combat abstraction penalty problem
► But beware…
! Tricky semantics, easy to get wrong
! Compiler won’t tell you about incorrect use
! Incorrect use = slow painful death!
Tips for avoiding aliasing
► Minimize use of globals, pointers, references
! Pass small variables by-value
! Inline small functions taking pointer or reference
arguments
► Use local variables as much as possible
! Make local copies of global and class member variables
► Don’t take the address of variables (with &)
► restrict pointers and references
► Declare variables close to point of use
► Declare side-effect free functions as const
► Do manual CSE, especially of pointer expressions
That’s it! – Resources 1/2
► Ericson, Christer. Real-time collision detection. Morgan-
Kaufmann, 2003. (Chapter on memory optimization)
► Mitchell, Mark. Type-based alias analysis. Dr. Dobb’s
journal, October 2000.
► Robison, Arch. Restricted pointers are coming. C/C++
Users Journal, July 1999.
http://www.cuj.com/articles/1999/9907/9907d/9907d.htm
► Chilimbi, Trishul. Cache-conscious data structures - design
and implementation. PhD Thesis. University of Wisconsin,
Madison, 1999.
► Prokop, Harald. Cache-oblivious algorithms. Master’s
Thesis. MIT, June, 1999.
► …
Resources 2/2
► …
► Gavin, Andrew. Stephen White. Teaching an old dog new
bits: How console developers are able to improve
performance when the hardware hasn’t changed.
Gamasutra. November 12, 1999
http://www.gamasutra.com/features/19991112/GavinWhite_01.htm
► Handy, Jim. The cache memory book. Academic Press,
1998.
► Macris, Alexandre. Pascal Urro. Leveraging the power of
cache memory. Gamasutra. April 9, 1999
http://www.gamasutra.com/features/19990409/cache_01.htm
► Gross, Ornit. Pentium III prefetch optimizations using the
VTune performance analyzer. Gamasutra. July 30, 1999
http://www.gamasutra.com/features/19990730/sse_prefetch_01.htm
► Truong, Dan. François Bodin. André Seznec. Improving
cache behavior of dynamically allocated data structures.

DWC_pcie_ctl_sw_databook
No ratings yet
DWC_pcie_ctl_sw_databook
1,093 pages
L01 - Introduction To Studio 5000 Logix Designer - Lab Manual
80% (5)
L01 - Introduction To Studio 5000 Logix Designer - Lab Manual
96 pages
Efabless Caravel "Harness" Soc: Preliminary
No ratings yet
Efabless Caravel "Harness" Soc: Preliminary
30 pages
PCIE Protocol
No ratings yet
PCIE Protocol
29 pages
Interrupts
No ratings yet
Interrupts
59 pages
Xeon D 1500 Datasheet Vol 1
No ratings yet
Xeon D 1500 Datasheet Vol 1
608 pages
DRAM Command Guide
No ratings yet
DRAM Command Guide
2 pages
Memory Address Decoding
No ratings yet
Memory Address Decoding
25 pages
Memory Controller
No ratings yet
Memory Controller
26 pages
NVM Express 1 - 1a
No ratings yet
NVM Express 1 - 1a
166 pages
Intel SATA Controller
No ratings yet
Intel SATA Controller
59 pages
Debugging SystemVerilog
No ratings yet
Debugging SystemVerilog
175 pages
Basics of DDR Protocol: Jose Thomas Vellara
No ratings yet
Basics of DDR Protocol: Jose Thomas Vellara
57 pages
Cortex Mo+ Technical Reference Manual
No ratings yet
Cortex Mo+ Technical Reference Manual
51 pages
7 Series Memory Controllers
100% (1)
7 Series Memory Controllers
36 pages
DDR Sdram: A 1.8V, 700mb/s/pin, 512Mb DDR-II SDRAM With On-Die Termination and Off-Chip Driver Calibration
No ratings yet
DDR Sdram: A 1.8V, 700mb/s/pin, 512Mb DDR-II SDRAM With On-Die Termination and Off-Chip Driver Calibration
36 pages
Intel 82802 Firmware Hub
No ratings yet
Intel 82802 Firmware Hub
53 pages
Mobile DRAM Standard Formulation
No ratings yet
Mobile DRAM Standard Formulation
5 pages
Palestra 4 Abram Belk
No ratings yet
Palestra 4 Abram Belk
143 pages
Chapter 4 - Cache Memory: Luis Tarrataca
No ratings yet
Chapter 4 - Cache Memory: Luis Tarrataca
159 pages
Development of CPLD Based Memory Controller For MPC 603E PPC Based Single Board Computer
No ratings yet
Development of CPLD Based Memory Controller For MPC 603E PPC Based Single Board Computer
38 pages
Pulpissimo: Datasheet: The Pulp Team
No ratings yet
Pulpissimo: Datasheet: The Pulp Team
101 pages
Network On Chip
No ratings yet
Network On Chip
43 pages
ASU DDR5 Digital Presentation
No ratings yet
ASU DDR5 Digital Presentation
59 pages
PATA/IDE and Sata Interface
No ratings yet
PATA/IDE and Sata Interface
12 pages
Lecture 7 Main Memory
No ratings yet
Lecture 7 Main Memory
36 pages
Intelligent High Performance Memory Access Technique in Aspect of DDR3
No ratings yet
Intelligent High Performance Memory Access Technique in Aspect of DDR3
6 pages
U-Boot-v2: Sascha Hauer Marc Kleine-Budde
No ratings yet
U-Boot-v2: Sascha Hauer Marc Kleine-Budde
36 pages
2-3 - Common - Storage - Protocols - Copie
No ratings yet
2-3 - Common - Storage - Protocols - Copie
58 pages
PCI Express Basics & Background: Richard Solomon Synopsys
100% (1)
PCI Express Basics & Background: Richard Solomon Synopsys
45 pages
21CS43 - Module 1
No ratings yet
21CS43 - Module 1
21 pages
Aquabolt-XL: Samsung HBM2-PIM With In-Memory Processing For ML Accelerators and Beyond
No ratings yet
Aquabolt-XL: Samsung HBM2-PIM With In-Memory Processing For ML Accelerators and Beyond
26 pages
Cache Design
No ratings yet
Cache Design
59 pages
DDR3 HynixX
No ratings yet
DDR3 HynixX
75 pages
Tutorial On Transaction Level Modeling With Systemc For System Level Design
No ratings yet
Tutorial On Transaction Level Modeling With Systemc For System Level Design
151 pages
PCS White Paper
No ratings yet
PCS White Paper
14 pages
Intel Punit
No ratings yet
Intel Punit
3 pages
ECRC Calculation
No ratings yet
ECRC Calculation
8 pages
Intel Desktop Board DQ67SW Technical Product Specification
No ratings yet
Intel Desktop Board DQ67SW Technical Product Specification
96 pages
DDR5 Ram
No ratings yet
DDR5 Ram
3 pages
Registers in Computer Architecture
No ratings yet
Registers in Computer Architecture
37 pages
SR MR Iov
No ratings yet
SR MR Iov
63 pages
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
100% (1)
The Berkeley Out - of - Order Machine (Boom!) : An Open - Source Industry - Compeeeve, Synthesizable, Parameterized Risc - V Processor
45 pages
ddr3 PDF
No ratings yet
ddr3 PDF
120 pages
AXI OCP Protocol Signal Description AXI
100% (1)
AXI OCP Protocol Signal Description AXI
2 pages
DDR memory system simulation method
No ratings yet
DDR memory system simulation method
17 pages
489 6GbsSAS 12Gbs PerfTuningGuide
No ratings yet
489 6GbsSAS 12Gbs PerfTuningGuide
97 pages
Pci Bus End Point Block
0% (1)
Pci Bus End Point Block
178 pages
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
No ratings yet
A Dual-Core RISC-V Vector Processor With On-Chip Fine-Grain Power Management in 28-nm FD-SOI
5 pages
Embedded Systems Design - 2: Dr. N. Mathivanan
No ratings yet
Embedded Systems Design - 2: Dr. N. Mathivanan
10 pages
2023 CXL DesignTradeoffs IEEE Micro
No ratings yet
2023 CXL DesignTradeoffs IEEE Micro
9 pages
Intel® Management Engine (ME) Firmware Update Procedure
No ratings yet
Intel® Management Engine (ME) Firmware Update Procedure
3 pages
DDR5-Anil Pandey PDF
No ratings yet
DDR5-Anil Pandey PDF
3 pages
Pcie Intel Specification
No ratings yet
Pcie Intel Specification
9 pages
Embedded System: 1 History
No ratings yet
Embedded System: 1 History
11 pages
PCI SIG Arch Overview
No ratings yet
PCI SIG Arch Overview
37 pages
AXI4-Stream Upsizing Downsizing Data Width Converters For Hardware-In-The-Loop Simulations
No ratings yet
AXI4-Stream Upsizing Downsizing Data Width Converters For Hardware-In-The-Loop Simulations
3 pages
Logic synthesis Standard Requirements
From Everand
Logic synthesis Standard Requirements
Gerardus Blokdyk
No ratings yet
Application-Specific Integrated Circuit ASIC A Complete Guide
From Everand
Application-Specific Integrated Circuit ASIC A Complete Guide
Gerardus Blokdyk
No ratings yet
10_Caches
No ratings yet
10_Caches
34 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Proficy - Logic Developer - PLC 7.0 Getting Started, GFK-1918P
No ratings yet
Proficy - Logic Developer - PLC 7.0 Getting Started, GFK-1918P
176 pages
L17 Studio5000
100% (1)
L17 Studio5000
98 pages
Cs 101
No ratings yet
Cs 101
29 pages
Language Evaluation Criteria
0% (1)
Language Evaluation Criteria
28 pages
CS-GATE - Solved PYQ
No ratings yet
CS-GATE - Solved PYQ
38 pages
L18 Manual
No ratings yet
L18 Manual
108 pages
WA Agent For Databases Impl r11.3.4 ENU
No ratings yet
WA Agent For Databases Impl r11.3.4 ENU
45 pages
Instruction Intrinseque
No ratings yet
Instruction Intrinseque
275 pages
Verdi Quickrefpdf
No ratings yet
Verdi Quickrefpdf
7 pages
Lab 01 - Characteristics of Programming Languages (Answers) PDF
No ratings yet
Lab 01 - Characteristics of Programming Languages (Answers) PDF
5 pages
PWP Chapter 3
No ratings yet
PWP Chapter 3
48 pages
Fundamentals of Python: First Programs Second Edition: Week 5 (Chapter 5)
No ratings yet
Fundamentals of Python: First Programs Second Edition: Week 5 (Chapter 5)
46 pages
Abclx Driver Help
No ratings yet
Abclx Driver Help
56 pages
Python Programming R18A0513: B.Tech Iii Year - I Sem (R18) (2021-22)
No ratings yet
Python Programming R18A0513: B.Tech Iii Year - I Sem (R18) (2021-22)
208 pages
GNU C Language Manual
No ratings yet
GNU C Language Manual
242 pages
ARM Optimizing C/C++
No ratings yet
ARM Optimizing C/C++
203 pages
Data Flow Testing
No ratings yet
Data Flow Testing
50 pages
Rust Programming Tutorial PDF
100% (1)
Rust Programming Tutorial PDF
57 pages
Release Notes
No ratings yet
Release Notes
13 pages
Popl18 p202
No ratings yet
Popl18 p202
34 pages
Unit 4 GE8151 PSPP
No ratings yet
Unit 4 GE8151 PSPP
21 pages
Optimising Serial Code
No ratings yet
Optimising Serial Code
101 pages
PDB Command Reference
No ratings yet
PDB Command Reference
8 pages
GNU C Language Manual
No ratings yet
GNU C Language Manual
242 pages
Unit - 4 List, Tuples and Dictionaries
No ratings yet
Unit - 4 List, Tuples and Dictionaries
10 pages
Spru 187 o
No ratings yet
Spru 187 o
229 pages
GDC2003 Memory Optimization 18mar03
No ratings yet
GDC2003 Memory Optimization 18mar03
60 pages
Understanding C/C++ Strict Aliasing: or - Why Won't The #$@##@ % Compiler Let Me Do What I Need To Do! by Patrick Horgan
No ratings yet
Understanding C/C++ Strict Aliasing: or - Why Won't The #$@##@ % Compiler Let Me Do What I Need To Do! by Patrick Horgan
7 pages
Accessing Real Time Clock Registers and NMI Enable Bit
No ratings yet
Accessing Real Time Clock Registers and NMI Enable Bit
22 pages

Uploaded by

Uploaded by

MEMORY

► Inefficient cache use = lower performance

L2 cache ~5-20 cycles

Main memory ~40-100 cycles

► †16K data scratchpad important part of design

Compulsory Capacity Conflict

! Adapt coding style

const int kLookAhead = 4; // Some elements ahead

void Foo(S *p, void *key, int k) {

► Allocate all ‘struct S’ from a memory pool

Assuming 4-byte floats, for most compilers sizeof(X) == 40,

► Left(n)= n + 1, Right(n) = stored index

Aliasing is also missed opportunities for optimization

How do we typically optimize it? Right, unrolling!

► But wait! There’s a hidden assumption! a is not b or c!

a[0][0] = b00*c00 + b01*c10; …before

► Especially because of aliasing issues

► Implicit aliasing through the this pointer

Q: Does your compiler do this optimization?!

Use language types

► (See standard for full details.)

uint32 i; uint32 i; uint32 i;

Illegal Allowed Required

You really want the compiler to treat it as if written:

But because of possible aliasing it cannot!

giving for the first version:

and for the second version:

The compiler must be conservative, and

► Alas, in practice may need to declare both pointers restrict!

► Wrong! const promises almost nothing!

void VecAdd(int * restrict a, int *b, int *c) {

You might also like

void Foo(S p, void key, int k) {

a[0][0] = b00c00 + b01c10; …before

void VecAdd(int * restrict a, int b, int c) {