GDC2003 Memory Optimization 18mar03
GDC2003 Memory Optimization 18mar03
OPTIMIZATION
Christer Ericson
Sony Computer Entertainment, Santa Monica
([email protected])
Talk contents 1/2
► Problem statement
! Why “memory optimization?”
► Brief architecture overview
! The memory hierarchy
► Optimizing for (code and) data cache
! General suggestions
! Data structures
►Prefetching and preloading
►Structure layout
►Tree structures
►Linearization caching
►…
Talk contents 2/2
►…
► Aliasing
! Abstraction penalty problem
! Alias analysis (type-based)
! ‘restrict’ pointers
! Tips for reducing aliasing
Problem statement
► For the last 20-something years…
! CPU speeds have increased ~60%/year
! Memory speeds only decreased ~10%/year
► Gap covered by use of cache memory
► Cache is under-exploited
! Diminishing returns for larger caches
Instruction parallelism:
SIMD instructions consume
data at 2-8 times the rate
of normal instructions!
Need more justification? 2/3
Proebsting’s law:
Improvements to
compiler technology
double program performance
every ~18 years!
Corollary: Don’t expect the compiler to do it for you!
Need more justification? 3/3
On Moore’s law:
► Consoles don’t follow it (as such)
! Fixed hardware
! 2nd/3rd generation titles must get
improvements from somewhere
Brief cache review
► Caches
! Code cache for instructions, data cache for data
! Forms a memory hierarchy
► Cache lines
! Cache divided into cache lines of ~32/64 bytes each
! Correct unit in which to count memory accesses
► Direct-mapped
! For n KB cache, bytes at k, k+n, k+2n, … map to same
cache line
► N-way set-associative
! Logical cache line corresponds to N physical lines
! Helps minimize cache line thrashing
The memory hierarchy
Roughly:
CPU 1 cycle
~1-5 cycles
L1 cache
! Moving target
! Beware various implicit functions (e.g. fptodp)
Code cache optimization 2/2
► Size
! Beware: inlining, unrolling, large macros
! KISS
►Avoid featuritis
►Provide multiple copies (also helps locality)
! Loop splitting and loop fusion
! Compile for size (‘-Os’ in gcc)
! Rewrite in asm (where it counts)
► Again, study generated code
! Build intuition about code generated
Data cache optimization
► Lots and lots of stuff…
! “Compressing” data
! Blocking and strip mining
! Padding data to align to cache lines
! Plus other things I won’t go into
► What I will talk about…
! Prefetching and preloading data into cache
! Cache-conscious structure layout
! Tree data structures
! Linearization caching
! Memory allocation
! Aliasing and “anti-aliasing”
Prefetching and preloading
► Software prefetching
! Not too early – data may be evicted before use
! Not too late – data not fetched in time for use
! Greedy
► Preloading (pseudo-prefetching)
! Hit-under-miss processing
Software prefetching
// Loop through and process all 4n elements
for (int i = 0; i < 4 * n; i++)
Process(elem[i]);
(NB: This code reads one element beyond the end of the elem array.)
Structures
► Cache-conscious layout
! Field reordering (usually grouped conceptually)
! Hot/cold splitting
► Let use decide format
! Array of structures
! Structures of arrays
► Little compiler support
! Easier for non-pointer languages (Java)
! C/C++: do it yourself
Field reordering
struct S { struct S {
void *key; void *key;
int count[20]; S *pNext;
S *pNext; int count[20];
}; };
Decreasing size!
int8 a; int8 a, pad_a[7]; int64 b;
int64 b; int64 b; int64 e;
int8 c; int8 c, pad_c[1]; float f;
int16 d; int16 d, pad_d[2]; int16 d;
int64 e; int64 e; int8 a;
float f; float f, pad_f[1]; int8 c;
}; }; };
► Pointer-less:Left(n)=2n, Right(n)=2n+1
► Requires storage for complete tree of height H
Depth-first order
► “Cache-oblivious”
► Recursive construction
A compact static k-d tree
union KDNode {
// leaf, type 11
int32 leafIndex_type;
// non-leaf, type 00 = x,
// 01 = y, 10 = z-split
float splitVal_type;
};
Linearization caching
► Nothing better than linear data
! Best possible spatial locality
! Easily prefetchable
► So linearize data at runtime!
! Fetch data, store linearized in a custom cache
! Use it to linearize…
►hierarchy traversals
►indexed data
►other random-access stuff
Memory allocation policy
► Don’t allocate from heap, use pools
! No block overhead
! Keeps data together
! Faster too, and no fragmentation
► Free ASAP, reuse immediately
! Block is likely in cache so reuse its cachelines
! First fit, using free list
The curse of aliasing
What is aliasing?
int n; Aliasing is multiple
int *p1 = &n; references to the
int *p2 = &n; same storage location
► Enables
compiler to rule out aliasing
between references of non-compatible type
! Turned on with –fstrict-aliasing in gcc
Compatibility of C/C++ types
► In short…
! Types compatible if differing by signed,
unsigned, const or volatile
! char and unsigned char compatible with any
type
! Otherwise not compatible
into this:
void Foo(float *v, int *n) {
int t = *n;
No aliasing possible
for (int i = 0; i < t; i++)
so fetch *n once!
v[i] += 1.0f;
}
What TBAA can also do
► Cause obscure bugs in non-conforming code!
! Beware especially so-called “type punning”