…and I’m biased like no one!
Now that I’ve got your attention with such pretentious title :), let’s clear a few things:
This is an unfair comparison. More than half of Ogre features are disabled (because they do not compile or would crash). It’s possible that adding them back could slowndown or improve perfomance. This benchmark is far distant from a real world scenario (I’m only testing the same 24-vertices cube non-textured that has Normals, and unused Tangents & 1 UV set) and it’s likely the API overhead changes these measurements.Also, none of the objects are skeletally animated (although I do animate the cubes). Furthermore, my GSoC is not complete and there’s still a lot to be done (hopefully performance would be even better) Furthermore I do not consider it myself a “scientifically done” benchmark, although some may disagree (if I were to do it scientifically, I would’ve put more effort in the profiling code so that it logs the history graph, rather than writing the values down on paper from watching the HUD; I was also manually moving the camera, instead of using fixed values every time).
Now that I’ve done the proper disclaimmers. Let’s get to the test.
The test consists in drawing multiple cubes of 24-vertices each as a draw call, using Fixed Function DX9 API. The cubes are not textured.
The point of this little benchmark is to show the difference in parsing the scene graph: updating all transformations and frustum culling; between Ogre 1.9 & Ogre 2.0
The test can easily switch between Ogre 1.x & Ogre 2.x using a simple batch script, then recompiling.
The bench was performed on the following machine:
- CPU: Intel Quad Core Extreme QX9650 @3Ghz (only 1 core is used at the moment). L1 Cache: 32kb 42249 MB/s – L2 Cache 2x6144kb 19606 MB/s
- GPU: AMD Radeon HD 7770 1GB; Ghz Edition.
- RAM: 4GB RAM DDR2 400 Mhz (2 sticks in Dual Channel, they report 4070 MB/s)
Bandwidth (caches & ram) was reported by Memtest86+
Ogre 2.0 was compiled with /arch:SSE2 while Ogre 1.9 wasn’t (in 2.0 SSE2 is mandatory for PC). Both were compiled with MSVC 2008 Express Edition
There are eight combinations of possible scenarios that make a big factor:
- Everything is culled, Half of the stuff is culled, Everything is rendered on screen
- Everything is being animated (I just rotate them on spot), everything is not animated.
That’s 2^3 combinations, hence 8 scenarios
I draw a grid of 250×250 boxes, which makes a total of 62.500 Entities on scene.
On the left Ogre 2.0; on the right Ogre 1.9
This stable summarizes the differences:
|Test – Animated||Ogre 2.0||Ogre 1.9||Speedup|
Yay! an astonishing difference on every single case. Note that at rendering all 62.500 cubes, either the RenderQueue inefficiencies (which I haven’t yet refactored) or the API constraints start to show up.
I’m expecting instancing to cause a big speed up even at insanely high entity count.
Most importantly, having 62.500 cubes was “completely unplayable” in Ogre 1.9; while Ogre 2.0 has “acceptable framerate” most of the time.
Note however, when not animating the change isn’t all that impressive:
|Test – Not Animated||Ogre 2.0||Ogre 1.9||Speedup|
This time, there’s almost no speed up, and 2.0 struggles to keep up with 1.9
This isn’t surprising at all. Ogre 1.9 updates transform only when something changes, while 2.0 always updates the transform. This is actually very interesting because 2.0 is so fast that it can almost keep up with 1.9; except 2.0 is doing a lot of work, while 1.9 is avoiding everything. Brute force is almost keeping up with brains.
The most noticeable difference is when all entities are being skipped. Mostly because 1.9 achieves +60fps (~100fps), while 2.0 is under 45fps. This doesn’t worry me because:
- The framerate of 2.0 is a lot more consistent than 1.9 (low steady framerate is much better than high, jumpy/spiky framerate)
- Nobody really has a game that shows nothing on screen.
- I haven’t done yet the static system for Ogre.
The last point is very important. Because each stage is well defined in Ogre 2.0 (unlike the mess from 1.x); the planned static system puts a bit of burden over the user because it has to know before hand which entities won’t move for a long time (eg. buildings, some scenario props, trees) instead of letting Ogre track that at micro level.
But the advantage is that Ogre 2.0 will know static nodes can skip the transform stage (update node hierarchy, update the entities’ bounding boxes from local to world space) and go straight to the frustum cull stage.
Once that is done, static entities would go super fast.
Threading is around the corner
Take in mind that even by looking at the code, scene graph traversal is perfectly threadable now. When the time comes to it; it’s just a matter of putting the data into multiple threads, and cross fingers that scalability is close to 100% (I can already imagine my four cores running at full power :P)
Let’s take a peek at the memory consumption. This is the least reliable test of all, because I’m using the task manager to measure the difference, which is a terrible (as in bad) tool to measure a process’ RAM usage.
Nonetheless I kept removing variables, useless virtual keywords; and only added very few (almost none). It’s expected that ram usage per Entity should be lower; and it should at least show up in task manager for orientative purposes. I would worry if, for example, task mgr suddenly shows Ogre 2.0 using more ram than 1.9
Yay! I’m not going to use ratios here, because the numbers showing up here could be quite misleading or just a mere lie from the OS.
But it’s good to know we’re on good track.
Google Summer of Code
GSoC is far from over. In fact, we’re a bit ahead of schedule because this is the state in which Ogre was supposed to be by August, the 2nd.
Now it’s time to put everything working. There’s a lot of broken functionality. GL render system doesn’t compile. RibbonTrails, Billboards, etc don’t work. Instancing is not just broken, it even fires compiler warnings by just including the headers. Shadows aren’t working (neither stencil or shadow maps).
I had to do something that wasn’t in schedule, and that was creating the light lists (a sorted list of closest lights to each Entity) in SIMD form.
The next thing I’ll be focusing on is bringing instancing back.
Code and reproducing results
Of course, what you need to reproduce my results:
Update: Added source code link