Similar presentations:
Common C++ Performance Mistakes in Games
1. Common C++ Performance Mistakes in Games
Pete IsenseeXbox Advanced Technology Group
2. About the Data
• ATG reviews code to find bottlenecksand make perf recommendations
– 50 titles per year
– 96% use C++
– 1 in 3 use “advanced” features like
templates or generics
3. Why This Talk Is Important
• The majority of Xbox games are CPUbound
• The CPU bottleneck is often a language
or C++ library issue
• These issues are not usually specific to
the platform
4. Format
Definition of the problem
Examples
Recommendation
For reference
– A frame is 17 or 33 ms (60fps / 30fps)
– Bottlenecks given in ms per frame
5. Issue: STL
• Game using std::list– Adding ~20,000 objects every frame
– Rebuilding the list every frame
– Time spent: 6.5 ms/frame!
– ~156K overhead (2 pointers per node)
– Objects spread all over the heap
6. std::set and map
Many games use set/map as sorted lists
Inserts are slow (log(N))
Memory overhead: 3 ptrs + color
Worst case in game: 3.8 ms/frame
7. std::vector
• Hundreds of push_back()s per frame• VS7.1 expands vector by 50%
• Question: How many reallocations for
100 push_back()s?
• Answer: 13!
(1,2,3,4,5,7,10,14,20,29,43,64,95)
8. Clearly, the STL is Evil
9. Use the Right Tool for the Job
The STL is powerful, but it’s not free
Filling any container is expensive
Be aware of container overhead
Be aware of heap fragmentation and
cache coherency
• Prefer vector, vector::reserve()
10. The STL is Evil, Sometimes
• The STL doesn’t solve every problem• The STL solves some problems poorly
• Sometimes good old C-arrays are the
perfect container
• Mike Abrash puts it well:
– “The best optimizer is between your ears”
11. Issue: NIH Syndrome
• Example: Custom binary tree– Sorted list of transparent objects
– Badly unbalanced
– 1 ms/frame to add only 400 items
• Example: Custom dynamic array class
– Poorer performance than std::vector
– Fewer features
12. Optimizations that Aren’t
void appMemcpy( void* d, const void* s, size_t b ){
// lots of assembly code here ...
}
appMemcpy( pDest, pSrc, 100 ); // bottleneck
• appMemcpy was slower than memcpy
for anything under 64K
13. Invent Only What You Need
• std::set/map more efficient than thecustom tree by 10X
– Tested and proven
– Still high overhead
• An even better solution
– Unsorted vector or array
– Sort once
– 20X improvement
14. Profile
• Run your profiler– Rinse. Repeat.
– Prove the improvement.
• Don’t rewrite the C runtime or STL just
because you can. There are more
interesting places to spend your time.
15. Issue: Tool Knowledge
• If you’re a programmer, you use C/C++every day
• C++ is complex
• CRT and STL libraries are complex
• The complexities matter
• Sometimes they really matter
16. vector::clear
• Game reused global vector in frame loop• clear() called every frame to empty the vector
• C++ Standard
– clear() erases all elements (size() goes to 0)
– No mention of what happens to vector capacity
• On VS7.1/Dinkumware, frees the memory
• Every frame reallocated memory
17. Zero-Initialization
struct Array { int x[1000]; };struct Container {
Array arr;
Container() : arr() { }
};
Container x; // bottleneck
• Costing 3.5 ms/frame
• Removing : arr() speeds this by 20X
18. Know Thine Holy Standard
• Use resize(0) to reduce container sizewithout affecting capacity
• T() means zero-initialize PODs. Don’t
use T() unless you mean it.
• Get a copy of the C++ Standard. Really.
– www.techstreet.com; search on 14882
– Only $18 for the PDF
19. Issue: C Runtime
void BuildScore( char* s, int n ){
if( n > 0 )
sprintf( s, “%d”, n );
else
sprintf( s, “” );
}
• n was often zero
• sprintf was a hotspot
20. qsort
• Sorting is important in games• qsort is not an ideal sorting function
– No type safety
– Comparison function call overhead
– No opportunity for compiler inlining
• There are faster options
21. Clearly, the CRT is Evil
22. Understand Your Options
• itoa() can replace sprintf( s, “%d”, n )• *s = ‘\0’ can replace sprintf( s, “” )
• std::sort can replace qsort
– Type safe
– Comparison can be inlined
• Other sorting options can be even
faster: partial_sort, partition
23. Issue: Function Calls
50,000-100,000 calls/frame is normal
At 60Hz, Xbox has 12.2M cycles/frame
Function call/return averages 20 cycles
A game calling 61,000 functions/frame
spends 10% CPU (1.7 ms/frame) in
function call overhead
24. Extreme Function-ality
• 120,000 functions/frame• 140,000 functions/frame
• 130,000 calls to a single function/frame
(ColumnVec<3,float>::operator[])
• And the winner:
– 340,000 calls per frame!
– 9 ms/frame of call overhead
25. Beware Elegance
• Elegance → levels of indirection →more functions → perf impact
• Use algorithmic solutions first
– One pass through the world
– Better object rejection
– Do AI/physics/networking less often than
once/frame
26. Inline Judiciously
• Remember: inline is a suggestion• Try “inline any suitable” compiler option
– 15 to 20 fps
– 68,000 calls down to 47,000
• Try __forceinline or similar keyword
– Adding to 5 funcs shaved 1.5 ms/frame
• Don’t over-inline
27. Issue: for loops
// Example 1: Copy indices to push bufferfor( DWORD i = 0; i < dwIndexCnt; ++i )
*pPushBuffer++ = arrIndices[ i ];
// Example 2: Initialize vector array
for( DWORD i = 0; i < dwMax; ++i )
mVectorArr[i] = XGVECTOR4(0,0,0,0);
// Example 3: Process items in world
for( itr i = c.begin(); i < c.end(); ++i )
Process( *i );
28. Watch Out For For
• Never copy/clear a POD with a for loop• std::algorithms are optimized; use them
memcpy( pPushBuffer, arrIndices,
dwIndexCnt * sizeof(DWORD) );
memset( mVectorArr, 0, dwMax * sizeof(XGVECTOR4) );
for_each( c.begin(), c.end(), Process );
29. Issue: Exception Handling
Most games never throw
Most games never catch
Yet, most games enable EH
EH adds code to do stack unwinding
– A little bit of overhead to a lot of code
– 10% size increase is common
– 2 ms/frame in worst case
30. Disable Exception Handling
• Don’t throw or catch exceptions• Turn off the C++ EH compiler option
• For Dinkumware STL
– Define “_HAS_EXCEPTIONS=0”
– Write empty _Throw and _Raise_handler; see
stdthrow.cpp and raisehan.cpp in crt folder
– Add #pragma warning(disable: 4530)
31. Issue: Strings
• Programmers love strings• Love hurts
• ~7000 calls to stricmp in frame loop
– 1.5 ms/frame
• Binary search of a string table
– 2 ms/frame
32. Avoid strings
• String comparisons don’t belong in theframe loop
• Put strings in an table and compare
indices
• At least optimize the comparison
– Compare pointers only
– Prefer strcmp to stricmp
33. Issue: Memory Allocation
• Memory overhead– Xbox granularity/overhead is 16/16 bytes
– Overhead alone is often 1+ MB
• Too many allocations
– Games commonly do thousands of
allocations per frame
– Cost: 1-5 ms/frame
34. Hidden Allocations
• push_back(), insert() and friendstypically allocate memory
• String constructors allocate
• Init-style calls often allocate
• Temporary objects, particularly string
constants that convert to string objects
35. Minimize Per-Frame Allocations
• Use memory-friendly data structures, e.g.arrays, vectors
• Reserve memory in advance
• Use custom allocators
– Pool same-size allocations in a single block of
memory to avoid overhead
• Use the explicit keyword to avoid hidden
temporaries
• Avoid strings
36. Other Tidbits
Compiler settings: experiment
dynamic_cast: just say no
Constructors: performance killers
Unused static array space: track this
Loop unrolling: huge wins, sometimes
Suspicious comments: watch out
– “Immensely slow matrix multiplication”
37. Wrap Up
• Use the Right Tool for theJob
• The STL is Evil, Sometimes
• Invent Only What You Need
• Profile
• Know Thine Holy Standard
• Understand Your Options
Beware Elegance
Inline Judiciously
Watch Out For For
Disable Exception Handling
Avoid Strings
Minimize Per-frame
Allocations
38. Call to Action: Evolve!
• Pass the rubber chicken– Share your C++ performance mistakes
with your team
• Mentor junior programmers
– So they only make new mistakes
• Don’t stop learning
– You can never know enough C++
39. Questions
• Fill out your feedback forms• Email: [email protected]
• This presentation:
www.tantalon.com/pete.htm