Common C++ Performance Mistakes in Games
About the Data
Why This Talk Is Important
Format
Issue: STL
std::set and map
std::vector
Clearly, the STL is Evil
Use the Right Tool for the Job
The STL is Evil, Sometimes
Issue: NIH Syndrome
Optimizations that Aren’t
Invent Only What You Need
Profile
Issue: Tool Knowledge
vector::clear
Zero-Initialization
Know Thine Holy Standard
Issue: C Runtime
qsort
Clearly, the CRT is Evil
Understand Your Options
Issue: Function Calls
Extreme Function-ality
Beware Elegance
Inline Judiciously
Issue: for loops
Watch Out For For
Issue: Exception Handling
Disable Exception Handling
Issue: Strings
Avoid strings
Issue: Memory Allocation
Hidden Allocations
Minimize Per-Frame Allocations
Other Tidbits
Wrap Up
Call to Action: Evolve!
Questions
227.50K
Category: programmingprogramming

Common C++ Performance Mistakes in Games

1. Common C++ Performance Mistakes in Games

Pete Isensee
Xbox Advanced Technology Group

2. About the Data

• ATG reviews code to find bottlenecks
and make perf recommendations
– 50 titles per year
– 96% use C++
– 1 in 3 use “advanced” features like
templates or generics

3. Why This Talk Is Important

• The majority of Xbox games are CPU
bound
• The CPU bottleneck is often a language
or C++ library issue
• These issues are not usually specific to
the platform

4. Format


Definition of the problem
Examples
Recommendation
For reference
– A frame is 17 or 33 ms (60fps / 30fps)
– Bottlenecks given in ms per frame

5. Issue: STL

• Game using std::list
– Adding ~20,000 objects every frame
– Rebuilding the list every frame
– Time spent: 6.5 ms/frame!
– ~156K overhead (2 pointers per node)
– Objects spread all over the heap

6. std::set and map


Many games use set/map as sorted lists
Inserts are slow (log(N))
Memory overhead: 3 ptrs + color
Worst case in game: 3.8 ms/frame

7. std::vector

• Hundreds of push_back()s per frame
• VS7.1 expands vector by 50%
• Question: How many reallocations for
100 push_back()s?
• Answer: 13!
(1,2,3,4,5,7,10,14,20,29,43,64,95)

8. Clearly, the STL is Evil

9. Use the Right Tool for the Job


The STL is powerful, but it’s not free
Filling any container is expensive
Be aware of container overhead
Be aware of heap fragmentation and
cache coherency
• Prefer vector, vector::reserve()

10. The STL is Evil, Sometimes

• The STL doesn’t solve every problem
• The STL solves some problems poorly
• Sometimes good old C-arrays are the
perfect container
• Mike Abrash puts it well:
– “The best optimizer is between your ears”

11. Issue: NIH Syndrome

• Example: Custom binary tree
– Sorted list of transparent objects
– Badly unbalanced
– 1 ms/frame to add only 400 items
• Example: Custom dynamic array class
– Poorer performance than std::vector
– Fewer features

12. Optimizations that Aren’t

void appMemcpy( void* d, const void* s, size_t b )
{
// lots of assembly code here ...
}
appMemcpy( pDest, pSrc, 100 ); // bottleneck
• appMemcpy was slower than memcpy
for anything under 64K

13. Invent Only What You Need

• std::set/map more efficient than the
custom tree by 10X
– Tested and proven
– Still high overhead
• An even better solution
– Unsorted vector or array
– Sort once
– 20X improvement

14. Profile

• Run your profiler
– Rinse. Repeat.
– Prove the improvement.
• Don’t rewrite the C runtime or STL just
because you can. There are more
interesting places to spend your time.

15. Issue: Tool Knowledge

• If you’re a programmer, you use C/C++
every day
• C++ is complex
• CRT and STL libraries are complex
• The complexities matter
• Sometimes they really matter

16. vector::clear

• Game reused global vector in frame loop
• clear() called every frame to empty the vector
• C++ Standard
– clear() erases all elements (size() goes to 0)
– No mention of what happens to vector capacity
• On VS7.1/Dinkumware, frees the memory
• Every frame reallocated memory

17. Zero-Initialization

struct Array { int x[1000]; };
struct Container {
Array arr;
Container() : arr() { }
};
Container x; // bottleneck
• Costing 3.5 ms/frame
• Removing : arr() speeds this by 20X

18. Know Thine Holy Standard

• Use resize(0) to reduce container size
without affecting capacity
• T() means zero-initialize PODs. Don’t
use T() unless you mean it.
• Get a copy of the C++ Standard. Really.
– www.techstreet.com; search on 14882
– Only $18 for the PDF

19. Issue: C Runtime

void BuildScore( char* s, int n )
{
if( n > 0 )
sprintf( s, “%d”, n );
else
sprintf( s, “” );
}
• n was often zero
• sprintf was a hotspot

20. qsort

• Sorting is important in games
• qsort is not an ideal sorting function
– No type safety
– Comparison function call overhead
– No opportunity for compiler inlining
• There are faster options

21. Clearly, the CRT is Evil

22. Understand Your Options

• itoa() can replace sprintf( s, “%d”, n )
• *s = ‘\0’ can replace sprintf( s, “” )
• std::sort can replace qsort
– Type safe
– Comparison can be inlined
• Other sorting options can be even
faster: partial_sort, partition

23. Issue: Function Calls


50,000-100,000 calls/frame is normal
At 60Hz, Xbox has 12.2M cycles/frame
Function call/return averages 20 cycles
A game calling 61,000 functions/frame
spends 10% CPU (1.7 ms/frame) in
function call overhead

24. Extreme Function-ality

• 120,000 functions/frame
• 140,000 functions/frame
• 130,000 calls to a single function/frame
(ColumnVec<3,float>::operator[])
• And the winner:
– 340,000 calls per frame!
– 9 ms/frame of call overhead

25. Beware Elegance

• Elegance → levels of indirection →
more functions → perf impact
• Use algorithmic solutions first
– One pass through the world
– Better object rejection
– Do AI/physics/networking less often than
once/frame

26. Inline Judiciously

• Remember: inline is a suggestion
• Try “inline any suitable” compiler option
– 15 to 20 fps
– 68,000 calls down to 47,000
• Try __forceinline or similar keyword
– Adding to 5 funcs shaved 1.5 ms/frame
• Don’t over-inline

27. Issue: for loops

// Example 1: Copy indices to push buffer
for( DWORD i = 0; i < dwIndexCnt; ++i )
*pPushBuffer++ = arrIndices[ i ];
// Example 2: Initialize vector array
for( DWORD i = 0; i < dwMax; ++i )
mVectorArr[i] = XGVECTOR4(0,0,0,0);
// Example 3: Process items in world
for( itr i = c.begin(); i < c.end(); ++i )
Process( *i );

28. Watch Out For For

• Never copy/clear a POD with a for loop
• std::algorithms are optimized; use them
memcpy( pPushBuffer, arrIndices,
dwIndexCnt * sizeof(DWORD) );
memset( mVectorArr, 0, dwMax * sizeof(XGVECTOR4) );
for_each( c.begin(), c.end(), Process );

29. Issue: Exception Handling


Most games never throw
Most games never catch
Yet, most games enable EH
EH adds code to do stack unwinding
– A little bit of overhead to a lot of code
– 10% size increase is common
– 2 ms/frame in worst case

30. Disable Exception Handling

• Don’t throw or catch exceptions
• Turn off the C++ EH compiler option
• For Dinkumware STL
– Define “_HAS_EXCEPTIONS=0”
– Write empty _Throw and _Raise_handler; see
stdthrow.cpp and raisehan.cpp in crt folder
– Add #pragma warning(disable: 4530)

31. Issue: Strings

• Programmers love strings
• Love hurts
• ~7000 calls to stricmp in frame loop
– 1.5 ms/frame
• Binary search of a string table
– 2 ms/frame

32. Avoid strings

• String comparisons don’t belong in the
frame loop
• Put strings in an table and compare
indices
• At least optimize the comparison
– Compare pointers only
– Prefer strcmp to stricmp

33. Issue: Memory Allocation

• Memory overhead
– Xbox granularity/overhead is 16/16 bytes
– Overhead alone is often 1+ MB
• Too many allocations
– Games commonly do thousands of
allocations per frame
– Cost: 1-5 ms/frame

34. Hidden Allocations

• push_back(), insert() and friends
typically allocate memory
• String constructors allocate
• Init-style calls often allocate
• Temporary objects, particularly string
constants that convert to string objects

35. Minimize Per-Frame Allocations

• Use memory-friendly data structures, e.g.
arrays, vectors
• Reserve memory in advance
• Use custom allocators
– Pool same-size allocations in a single block of
memory to avoid overhead
• Use the explicit keyword to avoid hidden
temporaries
• Avoid strings

36. Other Tidbits


Compiler settings: experiment
dynamic_cast: just say no
Constructors: performance killers
Unused static array space: track this
Loop unrolling: huge wins, sometimes
Suspicious comments: watch out
– “Immensely slow matrix multiplication”

37. Wrap Up

• Use the Right Tool for the
Job
• The STL is Evil, Sometimes
• Invent Only What You Need
• Profile
• Know Thine Holy Standard
• Understand Your Options
Beware Elegance
Inline Judiciously
Watch Out For For
Disable Exception Handling
Avoid Strings
Minimize Per-frame
Allocations

38. Call to Action: Evolve!

• Pass the rubber chicken
– Share your C++ performance mistakes
with your team
• Mentor junior programmers
– So they only make new mistakes
• Don’t stop learning
– You can never know enough C++

39. Questions

• Fill out your feedback forms
• Email: [email protected]
• This presentation:
www.tantalon.com/pete.htm
English     Русский Rules