RakNet, SWIG and a nasty Garbage Collector bug.


We’ve been hunting for months a very rare crash that was driving us nuts.
It was further obfuscated because a SEH exception (a nasty one) was being caught by C# and being rethrown as a completely different exception (that looked like an innocent one).

If you use RakNet + SWIG to get the code running in C#; the following snippet will be familiar to you:

for (Packet packet = node.Receive(); node != null && packet != null; node.DeallocatePacket(packet), packet = node.Receive())
{
	BitStream packetBitStream = new BitStream(packet.data, packet.length, false);
}

Looks innocent, right? It tells the BitStream to use the memory pointer directly from packet.data (without cloning the pointer into a data pointer BitStream owns) and sets the length. The memory is guaranteed to remain valid until its deallocation, which happens when the iteration finishes.
It definitely works in C++. And it seems to work on C#. Except, it can randomly crash at you on C#. Why?

 

After ton of debugging, the problem became quite dead obvious once it gets pointed out.
On C++, the “data” is a pointer owned by the Packet. And the memory is guaranteed to live until DeallocatePacket is called.

However Packet.data returns a byte[] object. This data was allocated by the Garbage Collector. It’s not the direct pointer returned by RakNet, but rather a C# clone. Oops.

Let’s check MSDN’s documentation on the matter:

What happens during a garbage collection

  • (…)A relocating phase that updates the references to the objects that will be compacted.
  • (…)Because generation 2 collections can occupy multiple segments, objects that are promoted into generation 2 can be moved into an older segment. Both generation 1 and generation 2 survivors can be moved to a different segment, because they are promoted to generation 2.(…)

So, basically, the internal memory pointer from a byte[] can become dangling at any time on the next GC iteration, even if there are still references to the object. [sarcasm]GREAT![/sarcasm]

It will probably work if the C/C++ does something on the naked pointer without returning control to C# before it’s done. However the BitStream data is specifically meant to keep a reference to that pointer for a long time.

This does not work.

Worst part, crashes will probably be very unpredictable because they depend on the GC.

All of this comes from code autogenerated from SWIG!!!

It even looks like the SWIG or RakNet developers noticed this problem and attempted to “fix” it by keeping a reference to the byte[] to prevent the GC from reclaiming the memory of a zero-ref object. However this doesn’t fully solve it, as it cannot prevent the cases where the GC decides to move references to different blocks, or migrate an object from one generation to another. That only hides the problem deeper.

 

Solving the problem:

Solution 1

Always use BitStream with the last argument set to “true”; which will force BitStream to use an internal copy of memory that it owns. This is the easiest and safest approach.

 

Solution 2

Solution 1 is great, but we take performance seriously. And allocating and deallocating at such granularity (per packet) doesn’t make us happy.
Manually edit the auto-generated cs files from SWIG so that you don’t pass the byte[] data pointer to C++, but rather the native IntPtr that lives in the C++ Packet class and will live until DeallocatePacket gets called. The files you will need to change are BitStream.cs (remove the constructor that takes a byte[] pointer, add one that takes IntPtr), Packet.cs (add a getter that directly returns RakNetPINVOKE.Packet_data_get (swigCPtr)) and RakNetPINVOKE.cs (add an entry point to CSharp_new_BitStream__SWIG_2 that uses an IntPtr instead of a byte[])

…but it doesn’t stop there. Sadly this guarantee doesn’t hold up when we work in C#. This innocent looking code:

AddressOrGUID( myPacket )

Will call Packet.getCPtr. This function checks if Packet.data has already been called. If that’s true, then it is possible that the data array could’ve been modified from C# and therefore “obj.SetPacketData(obj.data, obj.data.Length)” will be called. In other words, it’s just trying to keep the C# and C++ copies in sync.

Let’s check the C++ definition of CSharp_Packet_SetPacketData:

SWIGINTERN void RakNet_Packet_SetPacketData(RakNet::Packet *self,unsigned char *inByteArray,int numBytes){
	if (self->data!=NULL)
	{
		rakFree_Ex(self->data, "SwigInterfaceFiles\\RakNetCSharpExtends.i", 253);
	}
	//create new with size
	self->data=(unsigned char*) rakMalloc_Ex(numBytes, "SwigInterfaceFiles\\RakNetCSharpExtends.i", 253);
	//copy
	memcpy( self->data,inByteArray, numBytes );
	self->length=numBytes;
}

DANGLING ALERT! DANGLING ALERT!
The data pointer from the C++ side of Packet gets reallocated. The guarantee we had that the Packet’s memory is valid until DeallocatePacket no longer holds true. Something that would never happen in C++ alone; but the custom SWIG C# -> C++ interface breaks it.
Stupid C# snippets like “AddressOrGUID( myPacket )” can cause a chain reaction (in case myPacket.data was already called regardless of whether it was the getter or the setter) that leads to reallocating the Packet’s internal memory, leaving your BitStream in a very dangerous state.

It’s better than the GC dangling your pointers randomly, but still very bad.

Luckily for us, you can edit RakNetCSharpExtends.i so that RakNet_Packet_SetPacketData frees the pointer ONLY IF the number of bytes is higher than what the current pointer can hold (and assert or log otherwise to get a good guess that something strange is going on in case this ever happens). This is rare to actually happen because this function is often used to update the data sent via C# to C++.

You also gain some performance by avoiding to touch the heap unnecessarily. Yay!

You should do the same with RakNet_FileListNode_SetData btw, just in case.
Also stay away from manipulating FileStruct or just downloading files using RakNet from C#. It’s plagued with this pattern of reallocating memory every time it gets called from C# but this time we don’t have the required information to prevent reallocations. If you ever need this functionality, use it solely from C++ side.

If you think this is too much for you to handle, go for solution 1.

 

In case you’re wondering… yes. Once we fixed this problem, all of our rare & random crashes disappeared. I actually took some time to publish this article because we’ve been waiting for thorough testing to confirm the bug was fixed.