Threading results were disappointing. Well, I was expecting 4x improvement on my quad core machine. The result was between 35-40% improvement which is still something.
After a closer look, I ended up concluding that at 250×250 instances we’re memory bandwidth bound (i.e. disabling all math in UpdateAllTransforms and just doing a raw copy of the positions to derived positions did nothing to performance)
I tried very hard other approaches to optimize UpdateAllTransforms & UpdateAllBounds which should theoretically reduce cache misses (like doing some AoS to SoA conversion using shufps instead of simple movss) but which required a few extra bytes per instance and the framerate only decreased.
The only solution I can think to that problem is using what is done with large matrices: we currently update the transforms of all nodes, then all bounds, then cull all of them.
Perhaps if we do this by smaller batches, it could be possible to keep everything in the cache (The data of 62500 instances does NOT fit in my 2x6MB cache, do the math) i.e. update 1000 nodes then update 1000 bounds, then cull them; repeat. But this is very hard to pull off correctly.
Anyway I’m happy with the improvements over Ogre 1.x; I can’t refactor over recently refactored code otherwise we would never get anywhere.
The other bottlenecks I suspect could be living in the parts I didn’t touch: RenderQueue, AutoParamsData
One thing is awesome though, and that is I can run 100×100 rotating instances at 60 fps… IN DEBUG MODE 🙂 🙂 🙂
Debug mode is a lot less bandwidth bound (for obvious reasons) so threading is usually more visible there.
The tests are similar to the previous ones. They have been done on 4 threads. Do not compare the framerate as I’ve changed the number of instances per batch (which bumped performance) to get rid of the slowdown effects from the API. I was trying to benchmark the threading here.
Not a silver bullet
Before I begin, I don’t even need my usual disclaimers because there’s something I have to tell you:
Threading results can vary a lot. It depends on your scene, how many entities are scene. In some cases you may not see a performance increase at all. I’ve seen the CPU usage stay around 50% or go to 100% when not looking at anything (since all of the work Ogre has to do is threaded)
After launching the profiler, it was called to my attention that the test was spending significant cpu time inside of SceneNode::roll; that is animating the cubes.
So what did I do? I let Ogre run it in parallel too. But that’s cheating! you may think. Well the thing is, in Ogre 1.x it was impossible to call roll or setPosition from multiple threads. The SceneManager would crash (more quickly with Octree) or the Node would become in an invalid state due to race conditions.
In Ogre 2.x; it’s perfectly safe to call roll, setOrientation, setPosition or setScale from multiple threads as long as it is not the same Node in both threads. So I thought it’s a valid excuse to do them in parallel.
But it won’t happen in a real game! You may say… ummm… yes it will happen! A real game probably runs Bullet, Havok or PhysX, and positions & orientations are copied every frame from the physics engine to the graphics engine. Now you can do the copy from multiple threads.
If you still think I’m cheating, you can download the source code, disable the aforementioned code, and post your own results. But if you do that, don’t call roll from a single thread, that’s unfair.
“Dynamic Rotating” is the demo with the cubes rotating.
“Dynamic” is the demo with the cubes not rotating, but created with SCENE_DYNAMIC.
“Static” is the demo with the cubes not rotating, and created with SCENE_STATIC flag.
I’ve normalized the CPU usage to the range [0; 400%] where 400% which means all cores running at 100%; and 100% can either mean one core running at 100% or all four at 25% each.
|Test – Dynamic Rotating||4 threads||1 thread||Speedup|
|All entities||20.01 ms – 240% CPU||28.50 ms – 100% CPU||1.42x|
|Few entities||11.06 ms – 320% CPU||18.60 ms – 100% CPU||1.68x|
|No entities||10.88 ms – 348% CPU||18.20 ms – 100% CPU||1.67x|
|Test – Dynamic||4 threads||1 thread||Speedup|
|All entities||16.70 ms – 208% CPU||20.50 ms – 100% CPU||1.22x|
|Few entities||7.50 ms – 340% CPU||10.70 ms – 100% CPU||1.43x|
|No entities||7.32 ms – 332% CPU||9.99 ms – 100% CPU||1.36x|
|Test – Static||4 threads||1 thread||Speedup|
|All entities||10.40 ms – 110% CPU||10.50 ms – 100% CPU||1.01x|
|Few entities||1.00 ms – 132% CPU||1.00 ms – 100% CPU||1.00x|
|No entities||0.69 ms – 136% CPU||0.59ms – 100% CPU||0.86x|
As it becomes evident, attaining more than 1.5x improvement is possible, though very unfeasible. Still, overall it’s something; specially when gamers are so desperate to get their games reach the 60fps landmark. And if your engine already runs its logic from another thread; getting all cores at full utilization becomes a real possibility.
The static test “few entities” had no speedup at all (1.00x) but was coloured in red because I noticed a significant amount of extra CPU was used in the threaded version; which may affect battery life in mobile.
May be more Ogre devs (including the community) can help track inefficiencies that are preventing scalability. And take note not everything is being threaded; particularly the RenderQueue; which should in theory a candidate for threading.
Furthermore our threading system is synchronous (for simplicity), but it might be worth researching whether an async approach could gain a few extra speedup.
Now I have to focus in providing a cross-platform facility to autodetect the numbers of cores in a machine.
As always, the data you need to reproduce my results. I’m very interesting in hearing your results: my Quad Core has a 2x6MB L2 cache. Which means 2 Cores share a 6MB L2 cache, while the other 2 share another 6MB L2 cache.
When going multithreading, my system is able to use the full 12MB cache for Ogre; which could be another explanation of the speed increase.
AFAIK newer Intel systems share the L2 cache is unique to all cores but they share the L3 cache. Also, I didn’t tested AMD systems. Results could be indeed very interesting if they happen to vary much from mine.