Maybe it’s time to talk about a new Linux Display Driver Model 11

This entry was posted in Graphics and tagged DRM Graphics Linux OpenGL Vulkan WDDM on September 4, 2015 by Matias

With new APIs being all the rave (Vulkan, D3D12, Metal), Steam OS pushing for Linux gaming. Even I myself encounter myself doing more and more graphics work in a Linux machine.

After quite some time of graphics development on an Ubuntu 14.04 machine it has called to my attention that Linux graphics architecture is a disaster.

Let’s take a look at the issues:

Due to the nature of my work, I hang the entire system waaaaaaaaaaaaaaay more often than I would on a Windows machine (on Windows, it’s just a TDR). I need to hit the reset button quite often.
OpenGL drivers are inferior to their Windows counterparts. If you keep to strict GL >4.3 & AZDO it’s actually very nice to work with and performance is on par (sometimes, superior to Windows), but I still encounter system hangs or texturing issues (i.e. corruption) or out of memory issues I do not have on Windows.
X11 sucks.
To get decent GL on the replacements for X11, you need libGL… which pulls all the X dependencies.
Current Compositors [stares at Compiz] drain battery like crazy (this is mostly an issue with X).
Getting VSync to work correctly is a nightmare. Even when it’s on, there’s still tearing (WTF???) unless you’re not using a Compositor.
VSync on Optimus machines doesn’t work.
Only one GPU works in systems with more than one. Unlike in Windows.
There’s no security. I can ask the GL drivers to give me a lot of memory, and I can read whatever the other dead processes have been writing to the framebuffer. WTF!

This applies to Android too. GPU Virtualization is a joke. Processes screwing with the GLES context of other processes, listeners for creating, restoring and destroying window and contexts being called out of order (e.g. two onCreate calls followed by two onDestroy instead of onCreate->onDestroy->onCreate->onDestroy chain; onRestore calls being called before onCreate. A nightmare, probably caused or aggravated by vendor-specific customization of the stock Android).

Let’s compare it to WDDM (Windows Display Driver Model):

Memory residency tightly controlled by the kernel and vendor agnostic. Virtual memory doesn’t mean memory is resident. GPU makers of course hate this since they loose control, but enforces proper virtualization across processes, strong security (zero-out memory before giving a process new memory; defined behavior when reading out of bounds in the GPU).
Proper enumeration of all “Adapters” (GPUs) and “Monitors”, and their topology (e.g. Monitor A supports N modes, connected via DVI to Adapter B)
Proper VSync, facilitated by the Compositor thanks to a sane windowing system.
Standardized protocol for cross-GPU talk. Adapter A can render, and copy its contents to the framebuffer of Adapter B, and output the results to any monitor connected in Adapter B. Let’s take a moment to appreciate how I can plug one monitor to my Radeon HD 7770, another to my integrated Intel HD 4400; and place the window in the middle of the two monitors and still being rendered perfectly fine by any of the two cards, with proper VSync! That’s an incredibly feat of engineering. I can even plug all monitors to my Radeon card, and render using the Intel card. Wow, just… wow. (it’s also great for testing).
Proper handling of loss of adapters. In the previous example, I can disable the Radeon card in the device manager, and only the monitor connected to the Intel card will remain to function. If I was rendering using the Intel, everything will go on as usual; if I was using the Radeon card, the process will get a device_removed event and will have to handle it either by switching to the Intel one, or quitting (or just crashing). It’s like USB for GPUs. Another wow… I mean… wow.
TDR (Timeout Detection and Recovery). Admittedly it’s a hack. A last resort failsafe in case things go horribly wrong relying on heuristics. But it’s so damn useful! The screen freezes for 2 seconds then flashes, and I can continue to work instead of having to hit Reset. Ideally the GPU should communicate the CPU of all sorts of exceptions and catch them CPU side; thus terminating GPU tasks, possibly alongside its CPU process (or be able to monitor GPU tasks from each CPU process in real time in Task Manager, and be able to kill them on demand). TDR should only be needed in case of real HW failure. But right now the HW and SW is not prepared for this, so TDR is what we get. And no, restarting X11 server is not the same as a TDR. A TDR is roughly the Linux equivalent of modprobe -rf gpumodule && modprobe gpumodule but without ripping apart all your non-terminal processes (i.e. most processes in an X11 desktop session).

I’m not an expert on the driver side of things. I’m also pretty sure driver devs from AMD, Intel and NVIDIA have a lot of bad things to say about WDDM. But the shortcomings on Linux are pretty evident. I can’t get smooth flash player video to playback smoothly with VSync when running behind a Compositor for God’s sake (pretty much any decent and popular Linux distro). Even VLC has issues achieving smooth playback.

All these new APIs (specially with Vulkan coming soon) look great and it does sound like Vulkan will solve a lot of issues we’ve been having on *NIX OSes (stability, performance, bugs, information about HW). I’m really excited about it.

I also get the feeling Vulkan’s ability for cross GPU talk (just like D3D12) will just be a workaround to get multiple GPUs working on Linux that will be useful outside of games. But relying on drivers from different vendors to cooperate correctly in an OS whose driver model doesn’t standardize this cross-device talk sounds like it will work only on sunny days with rainbows.

I fear that if Linux wants to take the next step after Vulkan, to be really serious as a desktop OS and a gaming machine, something needs to be done driver side and create a WDDM equivalent (LDDM?). This won’t be easy. It needs the cooperation from the Linux Kernel devs (*cough* Trovalds and his team *cough*) accepting such initiative, Google’s help (Android side), good Vulkan drivers, and a better windowing system than X (like Wayland). That’s a lot of coordination and a lot of work.

This isn’t a grim post about Linux gaming or as a desktop OS. Not at all. Like I said, I find myself increasingly more often working with Linux for intensive graphical applications (mostly thanks to GL 4.3), and I hear other people too. Vulkan is coming, and Valve is pushing this OS as a gaming platform.

I’m just pointing out the next challenge on Linux after Vulkan will be to address these issues: a proper driver model and a better windowing system.

11 thoughts on “Maybe it’s time to talk about a new Linux Display Driver Model”

Niklas Rosenqvist September 5, 2015 at 05:45

Yeah, that is basically why the Linux ecosystem is moving away from X11 to Wayland. You’re a few years too late to complain on the situation, action has already been taken.
- Matias Post author September 5, 2015 at 12:59
  
  And I’m very glad about that!
  However, the main big issue remaining with Wayland right now is it is restricted to GLES. GLES is very limited in what you can do when compared to GL. The argument is that linking to libGL pulls back lots of X dependencies. I get that. But it needs to be addressed or else adoption will be limited.
  
  The second problem has little to do with Wayland, and has to do about kernel-level changes to seamlessly support multiple GPUs, and Wayland will have to be aware of this interface too (so that if a window is in the middle of 4 monitors plugged to 3 different cards, it sends the proper work to all 3 cards; likely most work is done by one card, while the other 2 receive a copy of the subregion they need to display), as well as GPU virtualization and security enhancements which have nothing to do with Wayland.
  - Thomas September 5, 2015 at 17:54
    
    GLES is quickly catching up to GL. Wayland will also support Vulkan.
  - David Airlie September 5, 2015 at 19:23
    
    Eh, wayland isn’t GLES only, you can run OpenGL via EGL apps fine on it.
    
    wayland compositors and wayland using applications are different things also, you need to be more specific when you say “wayland”, the term covers a protocol between compositor and apps, not the implementation of either end.
  - Matias Post author September 5, 2015 at 19:28
    
    GLES is not “quickly catching up to GL”. Not by a long shot, it’s not meant to either. There is no ARB_base_instance, no Multi draw indirect, no persistent mapping, no BufferStorage, no texture buffers, no SSBOs, no image store, no bindless. Not to mention GLSL syntax limitations.
    Basically anything needed for low CPU overhead, high performance rendering (quite useful for a composting manager) and modern rendering techniques.
    
    Vulkan supporting Wayland: That’s great. But Vulkan is not the successor for OpenGL, it’s the low level alternative (pretty much to how DX12 is to DX11.3). At least though, it should be possible to make a vendor-agnostic OpenGL implementation using Vulkan.
  - Matias Post author September 5, 2015 at 20:48
    
    I mean a compositor that talks the Wayland protocol.
bumblebee September 5, 2015 at 18:27

While yes, Linux isn’t Windows, there is only 1 Windows and 1 Compositor. There is not ‘one Linux’. There is one Linux kernel, and there are many desktop environments, many compositors and many window managers. There is no use in comparing Windows and Linux-based operating systems that way as they are simply not one on one but one to many.

Sure, graphics on Linux-kernel bases operating systems is not the same as the one graphics system the one Windows operating system has, and definitely not as polished, but there is also not one company or entity to blame.

If you want it to be ‘better’, you won’t get off by simply ‘fixing’ something, you’re going to have to replace a large part of the graphics system. That in turn will break with pretty much everything, and nobody will accept that and therefore will not include it. This is where the problem starts: you either need traction by getting large community support and support from large distro’s, or you need to have a boatload of money and deliver it to them in a nice package.

It’s not impossible to fix, but it’s also not possible to ‘write some documentation’ so people now it’s shitty state and hope for the best and actually get a result.
David Airlie September 5, 2015 at 19:21

As the person responsible for the kernel code, really all of these things take time and investment, and most of that is on-going.

If you used Windows Vista, it’s WDDM was a disaster, you couldn’t plug in two GPUs from two manufacturers and have anything work, they did a lot of work before Windows 8 to resolve that, however they have a much simpler ecosystem (as do Apple), they have a single window system and compositor. Linux userspace is a fragmented nightmare, and you spend more time fixing things in KDE/gnome/etc, all valid bugs, but ones you don’t see.

We also suffer from NVIDIA having shipped their own replacement for 90% of the stack for so long that they can’t help but fight against attempts to overall make things better. Things are getting better.

projects like:
glvnd to split the OpenGL ABI up to not be X dependent.
wayland as a protocol is good, however people keeping seeing it as an X server replacement, instead of an X11 replacement, the only wayland compositor that is close to competent is mutter and it’s still quite an internal design mess.
we’ve had GPU reset for ages, but getting testing on it from GPU vendors is hard, again the nvidia driver isn’t open source so only they can spend the time/money to fix that problem.
adapter loss is a hard one, X11 doesn’t make that easier, current OpenGL as an API also sucks for it, Direct3D had adapter loss built-in from the start which helped apps deal with it.

So we don’t really need a new start from scratch stack, the current kernel driver model is pretty sufficient to cover the use case and WDDM is just that piece in Windows. We need a lot more people invested in making the problems go away, at the moment there are maybe 5 people in the world trying to look at the overall stack design, instead of whatever focused task their employer wants to aid the bottom line. (lots of people writing drivers for Android/ARM/Chrome, but it still took 2-3 years to get funded work on making the modesetting API atomic, which helps everyone).
- Matias Post author September 5, 2015 at 20:13
  
  Thanks for dropping by! It’s very interesting to get feedback “from the other side”
  
  Vista’s WDDM: I agree. It was a disaster. But we’re in Windows 10 now, and it has been working fine since Win 7. When we compare at the current state on Linux, it’s certainly lagging behind.
  It is true that the userspace is much more fragmented than Apple or MS’. But enumerating all devices properly with their monitor topology is manager-agnostic. Ask DRM for a list of available GPUs, and ask a list of Monitors with available modes and links to those GPUs.
  If X11, or KDE/GNOME don’t handle that well that’s their problem. Correct me if I’m wrong, but DRM doesn’t seem to expose such simple interface for querying available adapters, which leaves X11 enumerating all hardware, filtering GPUs, and then selecting only one for Compiz/KWin/Weston/Mutter/Metacity to use. By the time we reach full KDE/GNOME/etc integration (i.e. choose window modes, turn off a GPU) a lot of things can go wrong.
  Certainly biggest offenders here are DRM not providing a simple “query” interface (correct me if I’m wrong) and X11 not supporting more than one card at the same time (not at least by design).
  
  “we’ve had GPU reset for ages”
  OMG!!! Does anybody else know about it??? These things need to be more advertised. I wouldn’t be suprised if the Linux driver teams of the big 3 vendors don’t know about it.
  I have no problem in a GPU reset restarting the entire X (and killing all of its processes) if that lets me avoid hitting the reset button (hitting the reset button there’s a danger of losing work for disk writes that haven’t been flushed yet). Once something like that begins to exist, someone will start trying to see if they can get X not to get killed on a reset to get a smoother experience, although that probably won’t happen because it’s too difficult. But it may happen on Wayland.
  
  “X11 doesn’t make that easier, current OpenGL as an API also sucks for it, Direct3D had adapter loss built-in from the start which helped apps deal with it.”
  You’re right, but this is exactly the title of this blogpost! Vulkan API does support device loss from the start. The next step once Vulkan is released is to begin looking at things like this: being able to recover from a reset, allow sharing data & monitors with multiple GPUs.
  
  Microsoft didn’t have it easy either. For their legacy rendering paths (e.g. GDI) they moved everything to CPU only rendering reading and writing directly from GPU memory with some minor common operations happening entirely in the GPU, see http://blogs.msdn.com/b/e7/archive/2009/04/25/engineering-windows-7-for-graphics-performance.aspx “Desktop Graphics – Reduced Memory Footprint”
  
  Oh well… again this isn’t about a grim state. I’m pointing out the challenges. With Valve pushing Linux as a gaming platform, these issues are more important than before. And with many companies behind Vulkan trying to make it happen, this is the right time to start talking about these problems to get them tackled before it’s too late.
  - David Airlie September 7, 2015 at 04:33
    
    I wrote something longer and my browser ate it
    
    The people who work on this stuff are aware of the problems and of most of the possible solutions for them. There is no single revolutionary feature called LDDM that will solve this.
    
    There are lots of small problems they need to be fixed one by one, by people with the time and investment. The Linux graphics stack has a resourcing issue, compared to Windows we are between 1/100 and 1/1000 the size in terms of engineering staff, maybe even more. A lot of recent investment has just been involved in catching up with the other OSes and the nvidia binary driver which doesn’t really contribute much to the Linux stack. I’m not sure you realise the burden of having a complete OpenGL implementation that doesn’t involve itself in the driver model at all and which can’t be adapted without the blessing of the company who works on it. One of the main blockers to removing X11 has and will always be the nvidia driver.
    
    So really Linux needs time/investment/money whatever, more people using it etc. until things like that happen more and more we are going to just plod along like we do now, with people fixing the problems as they become problems of them or their company or their companies customers, and other stuff that isn’t as important just falling into the cracks until it becomes important.
    
    Laptops with two GPUs are an example of a problem I spent 2-3 years fixing because I had company funded time to work on it. it’s a lot of work getting X11 to where it could support GPU output and render offloading, wayland means a lot of that work has to be redone in a different manner, guess what I’ll probably end up having to work on next.
    
    But there are so many niches like GPU reset and reloading drivers that we might never get to until an interested party decides to invest the resources. There is no Microsoft enforcer, we can’t make vendors write drivers for a model we create, we create the model in association with the vendors that participate in the process, like Intel and AMD. A lot of vendors participate to just get their driver written without ever graduating to the leagues of helping sustain the ecosystem, and some vendors sustain the ecosystem but don’t spend enough time understanding the other vendors problems.
    
    So we can’t just do what you think we should, because the solution for Microsoft is nothing like the solution for Linux unfortunately.
    - Matias Post author September 7, 2015 at 12:54
      
      I saw that my post was far more popular than I thought.
      
      I also saw in reddit & hackernews that they got the feeling I was attacking Linux. Quite the contrary. I was pointing out the next challenges. I know some Valve employees pass by blog, and I was (still am) hoping they would see this.
      
      Having you reassuring me that there is already GPU reset on the kernel stack that goes by unused is really nice. The word needs to be spread.
      
      As for the resources of MS vs FOSS; that’s something I totally know first hand. However, this is another of the points I mention: There’s an increasing number of graphics devs moving to Linux (not exclusively though). This will help in the long term. Hopefully there will be commercial interest in fixing these issues in the foreseeable future.
      
      As for the NVIDIA thing: That’s sad. There’s something that makes me buzz though.
      When asking GPU vendors for Windows certain features, there is quite a common response: “We can’t do that. It’s not up to us. The OS Kernel handles that part” sometimes followed by a deeper technical explanation.
      That doesn’t mean they don’t try though. I’ve been certainly been bitten by NVIDIA’s WDDM hacks to make AAA games “faster” where I have to flush & stall the entire pipeline because they let staging buffers transfers queue up until the system runs out of memory, or artificially clamp the amount of discards/buffer-renaming because they do not have a limit like AMD or Intel (which is against the wddm specs) and this can trigger bus flaws in some low end motherboards due to saturation.
      
      Despite this, it’s quite clear that Graphics drivers on Windows are forced to go through the OS stack. Perhaps at this point in time with such fragmentation it’s too late for Linux, but wouldn’t it be possible to make the windowing system (whether X, Wayland, Mir whatever) reject whatever doesn’t go through libdrm? It’s not a popular option. But gradually starting to force vendors to use certain OS paths would put them in the right direction. The reason multi-GPU-multi-monitor setups work in Windows is because the OS is in charge of that part. Not the vendor drivers.
      
      OpenGL certainly didn’t help as you say (no device lost, no way to choose the GPU adapter, completely clueless about the monitor choice… unless you use vendor specific extensions of course) while Vulkan has all of these things bundled in it. The tide may change.
      
      Despite all the shit DXGI gets (let’s call a theoretical LGI as in Linux Graphics Interface), a proper vendor agnostic way to get graphics in a OS should be similar to that system:
      
      GPU devices are enumerated with unique IDs.
      
      Monitors are enumerated with links to those GPU IDs, and a list of video modes (video mode changes via KMS)
      
      A Surface to LGI is requested to draw to screen. A surface may:
      
      Occupy no monitor, or occupy more than one monitor. Process can request which monitor(s) to take, but the Compositor has the final say.
      
      Must be owned by only one GPU device.
      
      Can be linked to other surfaces, so that VSync in them gets synchronized (i.e. they are both displayed at the same time, some very specific industrial-grade simulator wants this otherwise you see visible stutter because each Monitor is VSync’ed but showing different frames). This part is really tricky to get right not even Windows does it fully well.
      
      To get an OpenGL/Vulkan context, it must be linked to a requested surface. This in turn means it’s linked to one GPU
      
      Now the Compositor has all the information it needs: If the surface is moved to the middle of 4 monitors, it knows that one GPU will be rendering to that surface (via GL, GLES, Vulkan, via SW whatever), and know to which GPUs the framebuffer needs to be copied to, so that it can be shown in each monitor correctly. This needs GPU to GPU transfers, which should happen ideally in libdrm, or less ideally using Vulkan.
      
      Optimus machine means the Surface is owned and rendered by the NVIDIA card, but shown on the monitor owned by the Intel card.
      
      Advanced multi-GPU rendering (e.g. SLI/CrossFire, or the explicit forms ala DX12) is up to the API (i.e. Vulkan can do it) and that’s why a surface should be allowed to be monitor-less. The Compositor will know that a monitor-less surface has no window and doesn’t need to care about it. It’s just a process performing offscreen rendering (i.e. to later save a PNG) or Multi GPU cross-talk (i.e. draw shadow mapping in one GPU, draw final pass in the GPU owning a surface).
      
      When a device gets reset, the surfaces owned by that GPU get busted and now become the equivalent of /dev/null. Some processes (the modern ones) may be able to recover if they handle the restore correctly, the rest will be left with a blank window and the close button intact (since the windowing manager is capable of recovering, right???).
      
      This is a simple algorithm. It is and looks so simple because it took us decades to figure it out (Windows 9x didn’t do it that way, OS X doesn’t do it, any Linux windowing system before Wayland doesn’t either; neither Win XP. Only starting with Vista and it was a rough mockup that got polished in Windows 7).
      That is what Wayland and Mir should be doing (maybe X? I doubt it will be simple to correct so much legacy). It’s a simple and elegant design.
      
      Even if NVIDIA completely replaces its OpenGL and driver stack; OpenGL creation happens via GLX. AFAIK NVIDIA doesn’t bundle its own X Server. THAT’S the place where things need to change. To aquire a GL context, NVIDIA’s must go through GLX. Just change that part. Make a GLX2 that works this way and let a legacy path for GLX (obviously the hardest part, not an easy feat).
      Obviously NVIDIA replacing the driver stack means GPU to GPU transfers won’t work through libdrm. But that only means that:
      
      Worst case scenario only NVIDIA card can render, other cards need to be disabled.
      
      Middle ground, NVIDIA can render to monitor A, and the other card can render to monitor B (as long as that monitor B is physically plugged to non-nv card), but a window cannot be put in the middle of the two monitors, and windows started via the NVIDIA card cannot be shown on monitor B, and windows started via the other card cannot be shown on monitor A; unless we force a recreation of resources every time they move to another monitor). Restricting each card to the monitors that are plugged to them sucks for a good desktop user experience, but there are certain industrial applications that would welcome (i.e. simulators)
      
      Best case all still works as intended
      
      AMD <-> Intel cooperation should hopefully work, which would put NV at disadvantage thus encourage them to play nice with libdrm’s GPU to GPU transfers.
      
      The basic algorithms is simple and easy; the little details and keeping the legacy cruft compatible must certainly be overwhelmingly difficult. But I felt I had to describe how the system should work to avoid losing focus of the end goal.
      
      If Wayland & Mir (and its X Server emulation) work exclusively using Surfaces with these properties (be shared by multiple or no monitors, owned by one GPU), then all that remains is driver support. And until driver support gets in, the system will still work fine as a single GPU, single or multi-monitor setup.