Mesa radeonsi: I’m impressed 1


Around 5 months ago I upgraded from Ubuntu 14.04 to 16.04.

Little did I know fglrx was no longer supported (the proprietary binary driver blob from AMD) and at that time the AMDGPU Pro driver (the new proprietary driver with an open source kernel driver) for GCN 1.0 cards was extremely experimental (it was ok for GCN 1.2 cards).

So I was forced to use the open source “radeon” (specifically called “radeonsi” for GCN) that comes with Ubuntu 16.04. At first I was furious. None of my Ogre 2.1 samples would work. For a time I genuinely considered rolling back to 14.04. Mesa sucked as much as it always used to. Until I noticed the Mesa drivers that come bundled with Ubuntu 16.04 are extremely old.

So I tried bleeding edge Mesa via Oibaf PPA. And Oh my God… everything was working! Then I learnt how to compile Mesa myself from source, and Mesa developers eventually told me I needed Kernel 4.7 or higher to run compute shaders. So I also built Kernel from source.

At the time of writing I prefer using non-LTS Ubuntu 16.10 simply because it’s bundled with Kernel 4.8 (no need for me to compile it) as well as coming with more up to date dependencies for building Mesa.

The Good

Source Level Access: Nothing better than being able to step inside the actual binary driver. It’s not working? I can take a look at what it’s doing. Is there a debug switch? Find in Files throughout the source code. We just don’t have this on Windows. You can use debug flags (e.g. R600, LIBGL_ALWAYS_SOFTWARE) to dump ISA shader, compare against the software renderer. You can even profile the driver!

Mesa is thin: Unlike proprietary drivers which act to be overly smart just to get better benchmarks and sell more cards; Mesa has a tendency to do exactly what it is told to. If you are a veteran GL developer doing modern AZDO practices, a thin driver is actually what we need to squeeze the most performance! Of course if you are an AAA with a giant codebase full of poor practices and legacy code or porting from D3D (quite common), this may not be exactly good news for you.

Performance on par with Windows: This is where it gets interesting: a comparison with our samples shows Mesa is comparable to both OpenGL and Direct3D11 proprietary drivers. Sometimes it outperforms them, sometimes it loses (though when it loses, it may lose by a bigger margin than the wins). Because Ogre 2.1 is AZDO, that means driver overhead is pretty much out of the picture; and what we really are comparing is how good is each driver at compiling shader code. I guess LLVM should really be getting the compliments here.

This is a stark contrast with most benchmarks in Phoronix where Windows builds have a comfortable edge over Linux builds.

Feature level is quite complete: We do very advanced use of OpenGL, and except for Compute Shaders, it runs everything we throw at them.

Decent bugtrack record: At the time of writing I submitted 6 bug reports, often with repros. 4 of them have been fixed and 1 of them has a fix but it’s not been integrated yet. That’s quite impressive.

Works better than fglrx: fglrx had GPU RAM memory leaks and did not handle out of GPU RAM conditions well. Blatantly obvious that opening many tabs in Chrome would suddenly start glitching the whole screen. Failing to open Ogre 2.1 after long use of the system unless you reboot (because there was not enough GPU RAM), or glitched texture arrays. So in this sense, radeonsi is definitely a win.

The Bad

Compute Shaders: Compute Shader support is relatively new. Basic Compute Shaders run ok, but it begins to break apart with more complex ones. Recently I reported a bug in Compute where it would fail to read depth buffers; which has been fixed at the time of writing. Our Terrain sample relies on Compute to generate the shadows. Right now it generates garbage that loosely resembles the actual output. I haven’t reported this bug because I don’t have time to prepare a repro.

Bleeding edge or GTFO: Updating from LLVM 3.9.0 to 3.9.1 introduced a red tint bug. So you have to skip and go directly to 4.0.0. However, LLVM 4.0.0 has a bug I reported that was affecting many of our samples: Flickering artifacts in radeonsi driver with divergent texture reads. The good news this bug has a fix. The bad news the fix is in form of a LLVM patch that is yet being evaluated for inclusion, and requires a patch for Mesa to use the LLVM fix (the patch is in the bugreport).

If you are distro packager, I strongly recommend you use LLVM 5.0 + these 2 patches. Anything older than that will cause problems. This is very bleeding edge.

You can also go back to LLVM 3.9.0 but there’s been important bugfixes for recent GPU models.

Please note some of these bugs may affect performance metrics. If the GPU isn’t doing what it should be doing, it may be taking shortcuts that artificially improve performance. It’s pointless to be faster if you’re not getting the job done correctly. How much this is true, I don’t know.

Fault tolerance is suboptimal: Causing a GPU reset or locking up the driver/GPU is much easier if you give bad input; which unfortunately is more common when you’re developing and make mistakes. For example a silly mistake that causes GL_INVALID_OPERATION with Windows drivers; locks up the GPU in Linux.

There’s a lot of incorrect code out there: GL has been historically a big mess with many ways to do the same thing. X11 is also mess. And this is gets noticed. For instance a default installation of Blender flickers a lot. That’s because it incorrectly assumes the behavior of GL when partially rendering to the front buffer. And Mesa is unforgiving (this is a good thing for new software, but for legacy stuff…). Fortunately this can be fixed by selecting in Blender “Triple Buffer” instead of Automatic as Window Draw Method (Ctrl+Alt+U); which according to Blender’s manual Triple Buffer is the best method. I don’t know why Automatic doesn’t select Triple Buffer for Mesa.

The Ugly

GPU resets: I… I don’t know why they even try. I complained about this 1.5 years ago.  Even David Airlie passed by(thanks Dave!) and told me proprietary IHV should stop trying to bypass the DRM architecture by implementing their own replacement. Well… I am using the radeonsi drivers this time. And GPU reset recovery sucks. Even when it recovers, spawning up LightDM again finishes off the system for good. Not once was I able to resume like I do in Windows. Every time I either had to do Ctrl+Alt+Supr to reboot if I’m lucky to get a terminal, but almost always I have to hit the Reset button.

Hardware Bugs: This is not Mesa’s fault, but it has to be said because often the IHV sits in a more comfortable situation where they get notified about bugs across the next room. If they’ve documented it, chances are the radeonsi driver writers are going to know it (thanks AMD for being FOSS friendly!). But if it was only said verbally and fixed in Catalyst’s code, they may never find out.

My Radeon HD 7770 has several hardware bugs. Every now and then it syncs incorrectly with the monitor, which manifests itself as a “flash” that lasts for a few milliseconds. This flash often looks like the content of what should be displayed tiled circa 3×3 with a diagonal inclination (suggesting a row pitch misalignment). Most people is unable to tell what was on the flash, I’m just used to watch for high speed artifacts.

It happens like once per hour, so it’s not a big deal. However with the radeon driver, it appears sometimes it is unable to sync completely if it fails for the first time (i.e. during boot); and if that happens; my primary monitor will display a black screen while the secondary monitor will display correctly. This can be fixed by rebooting, or by switching resolutions then back. It’s annoying, and actually a problem if the second monitor is disabled thus I’m blind (at least I know how to reboot by pressing Ctrl+Alt+F1 then Ctrl+Alt+Supr).

Another HW bug is related to hardware cursors. When using multimonitor setups, moving the cursor from monitor A to monitor B will sometimes glitch the cursor, often looking like a small and thin missingno. Moving the cursor between monitors again multiple times will eventually fix it. This bug is can be reproduced with fglrx on Linux, older Catalyst drivers in Windows (modern ones seem to have switched to a software cursor), and radeonsi. This is a harmless and not annoying at all bug, but it shows that these clever little buggers can get their way in unnoticed.

I don’t have a confirmation it’s a HW bug, but yesterday my work with SSR (Screen Space Reflections) showed that when I made for loops with large iteration counts diverge a lot (code was the same, data fed was different) in Mesa it would cause flicker (even if everything was stationary). When I tried on Windows, the regions that flickered became 0 instead (but that’s not the value I expected either). I could be wrong, but I have a strong feeling Windows’ driver added failsafes to prevent these HW bugs from popping up while radeonsi just tried to naively do what I asked.

 

Overall

I’m quite impressed with radeonsi. The performance is great, it’s open source (Stallman speeches can tell how important that is, if you can withstand his paranoia/conspiracy-talk though; he’s not wrong though) I love being able to step inside the driver. Every version keeps improving (although sometimes they step back). The Mesa project moves extremely fast. This blog post will probably be out of date in 2 months.

If I were to recommend what to use for Ogre 2.1; then use bleeding edge Mesa. It can run what we do very well.

Sometimes I’d wish vendors spent even more of their resources to Open Source. When an Open Source alternative can compete and sometimes beat your version, then what’s the point of developing the proprietary driver?

Of course this isn’t enough for developing graphics exclusively on Linux. X11 still sucks (i.e. no reliable VSync; it’s API is hell). Most tools are on Windows. And AMD offers several diagnostic and profiling tools that only work with their proprietary drivers (on Windows). Also stability in Windows is much superior. The number of GPU resets I get on Windows (unless it’s my fault) is 0. I can’t say the same of the Mesa drivers (*ahem*, running Chrome)

However their inability to properly restore from GPU resets is worrying. It’s not working. Stop pretending it does. The situation has not improved one bit from my rant 1.5 years ago.


One thought on “Mesa radeonsi: I’m impressed

  • Hi-Angel

    Sorry if it turn out to be a duplicate — I left a comment earlier, but it seems to have disappeared. I thought it been sent to moderation, however I didn’t see a notification about it.

    Your post made me ask the question about GPU reset on #radeon channel. I’ve been told that as of now (in-development kernel 4.15) GPU reset for amdgpu-family cards is not working, and have not yet been enabled. This is probably why you’ve seen problems with recovery.

Comments are closed.