Ogre 2.0 is up to 3x faster 2


…and I’m biased like no one!

Now that I’ve got your attention with such pretentious title :), let’s clear a few things:

This is an unfair comparison. More than half of Ogre features are disabled (because they do not compile or would crash). It’s possible that adding them back could slowndown or improve perfomance. This benchmark is far distant from a real world scenario (I’m only testing the same 24-vertices cube non-textured that has Normals, and unused Tangents & 1 UV set) and it’s likely the API overhead changes these measurements.Also, none of the objects are skeletally animated (although I do animate the cubes). Furthermore, my GSoC is not complete and there’s still a lot to be done (hopefully performance would be even better) Furthermore I do not consider it myself a “scientifically done” benchmark, although some may disagree (if I were to do it scientifically, I would’ve put more effort in the profiling code so that it logs the history graph, rather than writing the values down on paper from watching the HUD; I was also manually moving the camera, instead of using fixed values every time).

Now that I’ve done the proper disclaimmers. Let’s get to the test.

The test

The test consists in drawing multiple cubes of 24-vertices each as a draw call, using Fixed Function DX9 API. The cubes are not textured.

The point of this little benchmark is to show the difference in parsing the scene graph: updating all transformations and frustum culling; between Ogre 1.9 & Ogre 2.0

The test can easily switch between Ogre 1.x & Ogre 2.x using a simple batch script, then recompiling.

The bench was performed on the following machine:

  • CPU: Intel Quad Core Extreme QX9650 @3Ghz (only 1 core is used at the moment). L1 Cache: 32kb 42249 MB/s – L2 Cache 2x6144kb 19606 MB/s
  • GPU: AMD Radeon HD 7770 1GB; Ghz Edition.
  • RAM: 4GB RAM DDR2 400 Mhz (2 sticks in Dual Channel, they report 4070 MB/s)

Bandwidth (caches & ram) was reported by Memtest86+

Ogre 2.0 was compiled with /arch:SSE2 while Ogre 1.9 wasn’t (in 2.0 SSE2 is mandatory for PC). Both were compiled with MSVC 2008 Express Edition

The Scenarios

There are eight combinations of possible scenarios that make a big factor:

  • Everything is culled, Half of the stuff is culled, Everything is rendered on screen
  • Everything is being animated (I just rotate them on spot), everything is not animated.

That’s 2^3 combinations, hence 8 scenarios

I draw a grid of 250×250 boxes, which makes a total of 62.500 Entities on scene.

The Results

On the left Ogre 2.0; on the right Ogre 1.9

Ogre 2.0 Few Entities - Animated

Ogre 2.0 Few Entities – Animated

Ogre 1.9 Few Entities - Animated

Ogre 1.9 Few Entities – Animated

Ogre 2.0 Some Entities - Animated

Ogre 2.0 Some Entities – Animated

Ogre 1.9 Some Entities - Animated

Ogre 1.9 Some Entities – Animated

Ogre 2.0 All Entities - Animated

Ogre 2.0 All Entities – Animated

Ogre 1.9 All Entities - Animated

Ogre 1.9 All Entities – Animated

Ogre 2.0 No Entities - Animated

Ogre 2.0 No Entities – Animated

Ogre 1.9 No Entities - Animated

Ogre 1.9 No Entities – Animated

This stable summarizes the differences:

Test – Animated Ogre 2.0 Ogre 1.9 Speedup
Few entities 38.92ms 124.33ms 3.19x
Some entities 50.85ms 137.75ms 2.71x
All entities 272.75ms 383.33ms 1.41x
No entities 38.73ms 107.10ms 2.77x

Yay! an astonishing difference on every single case. Note that at rendering all 62.500 cubes, either the RenderQueue inefficiencies (which I haven’t yet refactored) or the API constraints start to show up.

I’m expecting instancing to cause a big speed up even at insanely high entity count.

Most importantly, having 62.500 cubes was “completely unplayable” in Ogre 1.9; while Ogre 2.0 has “acceptable framerate” most of the time.

Note however, when not animating the change isn’t all that impressive:

Test – Not Animated Ogre 2.0 Ogre 1.9 Speedup
Few entities 25.17ms 26.87ms 1.07x
Some entities 37.56ms 37.52ms 1.00x
All entities 262.25ms 222.20ms 0.85x
No entities 25.05ms 10.08ms 0.4x

This time, there’s almost no speed up, and 2.0 struggles to keep up with 1.9

This isn’t surprising at all. Ogre 1.9 updates transform only when something changes, while 2.0 always updates the transform. This is actually very interesting because 2.0 is so fast that it can almost keep up with 1.9; except 2.0 is doing a lot of work, while 1.9 is avoiding everything. Brute force is almost keeping up with brains.

The most noticeable difference is when all entities are being skipped. Mostly because 1.9 achieves +60fps (~100fps), while 2.0 is under 45fps. This doesn’t worry me because:

  1. The framerate of 2.0 is a lot more consistent than 1.9 (low steady framerate is much better than high, jumpy/spiky framerate)
  2. Nobody really has a game that shows nothing on screen.
  3. I haven’t done yet the static system for Ogre.

The last point is very important. Because each stage is well defined in Ogre 2.0 (unlike the mess from 1.x); the planned static system puts a bit of burden over the user because it has to know before hand which entities won’t move for a long time (eg. buildings, some scenario props, trees) instead of letting Ogre track that at micro level.

But the advantage is that Ogre 2.0 will know static nodes can skip the transform stage (update node hierarchy, update the entities’ bounding boxes from local to world space) and go straight to the frustum cull stage.

Once that is done, static entities would go super fast.

Threading is around the corner

Take in mind that even by looking at the code, scene graph traversal is perfectly threadable now. When the time comes to it; it’s just a matter of putting the data into multiple threads, and cross fingers that scalability is close to 100% (I can already imagine my four cores running at full power :P)

Memory Consumption

Let’s take a peek at the memory consumption. This is the least reliable test of all, because I’m using the task manager to measure the difference, which is a terrible (as in bad) tool to measure a process’ RAM usage.

Nonetheless I kept removing variables, useless virtual keywords; and only added very few (almost none). It’s expected that ram usage per Entity should be lower; and it should at least show up in task manager for orientative purposes. I would worry if, for example, task mgr suddenly shows Ogre 2.0 using more ram than 1.9

Ogre 1.9

Ogre 1.9 using 235MB

Ogre 2.0

Ogre 2.0 using 167MB

Yay! I’m not going to use ratios here, because the numbers showing up here could be quite misleading or just a mere lie from the OS.

But it’s good to know we’re on good track.

Google Summer of Code

GSoC is far from over. In fact, we’re a bit ahead of schedule because this is the state in which Ogre was supposed to be by August, the 2nd.

Now it’s time to put everything working. There’s a lot of broken functionality. GL render system doesn’t compile. RibbonTrails, Billboards, etc don’t work. Instancing is not just broken, it even fires compiler warnings by just including the headers. Shadows aren’t working (neither stencil or shadow maps).

I had to do something that wasn’t in schedule, and that was creating the light lists (a sorted list of closest lights to each Entity) in SIMD form.

The next thing I’ll be focusing on is bringing instancing back.

Code and reproducing results

Of course, what you need to reproduce my results:

Update: Added source code link


2 thoughts on “Ogre 2.0 is up to 3x faster

  • cyrfer

    This sounds like great work. By the way, I think OGRE should remove the derivatives (not necessarily literal derivatives) of Entity from the Core API, like Billboards and such. So don’t waste your time fixing those classes!

    Does your work remove the need for an Entity derivative to support Instancing? I never liked the fact that we need a different Entity type and factory to achieve instancing.

    • Matias Post author

      InstancedEntity is still a separate class from Entity (both deriving from MovableObject and both being able to attach to SceneNodes).
      Considering how things are developing, unless your entities have a very small number of instances, most likely using InstancedEntity from the start will achiever greater performance.

      I share part of that frustration. I’ve been thinking of ways to being able to “auto instance” like Frostbite does and treat all entities as instanced entities automatically.
      However all approaches I can think of involve some form of HW instancing support, which unfortunately doesn’t play well with Mobile right now.

Comments are closed.