Ogre 2.0: O(1) insertion, removal and contiguous memory iteration 3


(and O(N) for lookup)

This pattern will be used widely in Ogre 2.0, so better get used it I thought it would be important to describe and explain it.
It’s not rocket science, but without any explanation it can be a bit daunting at first.

The problem

We need a container that ideally meets the following properties to be appropriate for game development:

  • Fast iteration
  • Fast insertion
  • Fast removal
  • Lookups are rarely used
  • We don’t care about keeping the contents sorted

In 1.x Ogre keeps everything like SceneNodes and MovableObjects within maps (as in std::map or Hashed maps). Sometimes these maps are used just to keep reference of created objects (ie. avoid memory leaks on cleanup) and sometimes they’re used for actual iteration.

Create an entity? then do mMovableObjectCollectionMap[“Entity”].map[“MyEntityName”] = OGRE_NEW Entity();

The good thing about maps is that they’re good for insertion, removals and lookups because of their amortized complexity O( log n ).

But they suck at keeping everything contiguous (in virtual space), and hence iteration; which is what we’ll be doing most often. Instead, we’ll be using an std::vector with a few tricks.

Before we start, let’s remember that Ogre 1.x made an unreasonable effort to keep instances indexed by unique names. This makes a lot sense for Materials & Meshes for example, and those will still be kept as is. However for stuff like Entities, Lights & SceneNodes, names will now be optional and may not be unique. With this in mind, it’s easy to see we’re not doing magic, just fixing a flaw from the aging 1.x design. Now every Entity & SceneNode is unique by an integer (the size of this int is configurable, but default for now is 32-bit)

Achieving Fast insertions

Well this one’s easy. Since we’re a graphics engine, we can impose the developer to actually tell us how many objects he will be creating during the lifetime of his application. Even if the developer doesn’t like it, many ports impose this restriction already due to very limited memory (Consoles, Mobile). This allows us to reserve enough ram.

Unless capacity is reached, every insertion is a push_back since we don’t need sorting, thus taking constant time. Granted, if capacity is exceeded (which can actually happen, specially in PC) that particular insertion will take O(N); a concession we’re reasonably allowed to make.

Achieving Fast removals

This one’s the most known trick: “Swap and pop”. Effectively it consists in swapping the element to remove with the one at the last of the vector and then pop()

This doesn’t work if you need to keep your objects sorted, but we don’t; so it works for us.

The following snippet can do the work for us:

/** Used for efficient removal in std::vector and std::deque (like an std::list)
	However it assumes the order of elements in container is not important or
	something external to the container holds the index of an element in it
	(but still should be kept deterministically across machines)
	Basically it swaps the iterator with the last iterator, and pops back
	Returns the next iterator
*/
template
typename T::iterator efficientVectorRemove( T& container, typename T::iterator& iterator )
{
	const size_t idx = iterator - container.begin();
	*iterator = container.back();
	container.pop_back();

	return container.begin() + idx;
}

std::vector<SceneNode*> sceneNodeList;
std::vector<SceneNode*>::iterator itor; //Assume it's initialized
itor = efficientVectorRemove( sceneNodeList, itor );
//Now itor points to the next element, which could be sceneNodeList.end() if it was the last one

There is a problem though: Given the pointer of the object, how do we find where is it located in the container?

One simple way is to linearly iterate through each object until we find the match (either because their pointer addresses match or their Ids are equal). However this suddenly converts our removals into O(N), because lookups aren’t free, nor amortized.

A very simple solution is to keep an index to our location inside the pointer. This adds an overhead of 4 bytes per instance. Another restriction is that this only works when passing Pointers for removal. Passing its Id means we have to do a look up.

Consider the following code:

class SceneNode
{
public:
	size_t mGlobalIndex;
};

std::vector<SceneNode*> nodeVec;

void removeSceneNode( SceneNode *node )
{
	std::vector<SceneNode*>::iterator itor = nodeVec.begin() + node->mGlobalIndex;
	efficientVectorRemove( nodeVec, itor );
}

Now we just did a removal in O(1) time. The key was in keeping mGlobalIndex up to date. Oh, but there was one caveat. That code above as it is now, is buggy. The node that was at the end is now where the old node used to be. It’s mGlobalIndex is out of date. We need to account that:

void removeSceneNode( SceneNode *node )
{
	std::vector<SceneNode*>::iterator itor = nodeVec.begin() + node->mGlobalIndex;
	efficientVectorRemove( nodeVec, itor );

	//The node that was at the end got swapped and has now a different index
	if( itor != nodeVec.end() )
		(*itor)->mGlobalIndex = itor - nodeVec.begin();
}

Of course this code assumes mGlobalIndex was set at creation time.

We need 2 variables, not 1

There is a caveat with this approach: We need as many indexes as arrays we have. Luckily in Ogre MovableObjects & SceneNodes we only have two: A global vector used to keep track of all objects created, and a ‘parent’ vector that a parent node uses to know which ones are they children.

This means in Ogre we’ll have two variables. In the previous example, we used mGlobalIndex; we’ll also have mParentIndex, which works in the exact same way, but used at object’s scope.

That’s 8 bytes per instance in a 32-bit system, and 8 to 16 bytes in a 64-bit build (a 32-bit index should be totally okay unless you’re planning on having more than 4 billion objects in your scene!). Pretty nice considering a linked list needs the same overhead, maps also have their own overhead, not to mention the overhead of having multiple chunks of heap memory requested.

Pitfalls

  • Lookup / finds take up to O(N). If we assume user tries to maintain LIFO order we can do a reverse search since most likely the objects to be found most often are the last ones being created. But that’s just a hint.
  • If we forget to update mGlobalIndex or mParentIndex after messing the vector’s order; it’s a pita to find the bug. Fortunately the places where these values are manipulated are very few and centralized. As long as we keep the interface design centralized, this shouldn’t be much of a problem.
  • Not Ogre’s case (but applicable to the technique): It requires N variables to track the index of N vectors.
  • Not Ogre’s case (but applicable to the technique): Doesn’t work if you’ll be constantly shuffling your elements in the vector (ie. insert, erase, sort)

Not so contiguous

Well, as Jesus de Santos pointed out, even a flat array isn’t 100% contiguous in physical memory, but rather in virtual memory. Additionally we keep them as an array of pointers, rather than an array of the actual objects. This means one extra level of indirection, where the SceneNodes and MovableObjects may not even be contiguous in virtual space (perhaps some day…).
Nonetheless, this was already true for the 1.x implementation, with the increased overhead of having the list of pointers in non-contiguous operation (adding several layers of extra indirections), nor having random access like arrays do.

Code

A preview of this implementation seen in action is in Rev. 4686. Note however, in its current state Ogre 2.0 is not compiling, not by a long shot. I’m working day and night to get it back online.

**Sign off**


3 thoughts on “Ogre 2.0: O(1) insertion, removal and contiguous memory iteration

  • Lee

    Nice work. I like it.

    I wouldto know if your going to instrument the Ogre code and profile Ogre before you optimize other in it for 2.0
    Liike using Intel Vtune.?

  • Matias Post author

    I’m using CodeAnalyst from AMD. However I would love to put my hands on a VTune license (mostly because my main dev machine is Intel, and I have to use my amd machine to profile the cache misses otherwise).
    In case you haven’t seen my slides, I’ve already profiled the 1.x branch extensively.
    However when I’m done, I’ll have to do it again to see the 2.0 changes and the new hotspots.

  • FF

    At least you can use vtune 30 days for free. At the end of your project I would consider it.

Comments are closed.