A little clarification on modern shader compile times 12


So I saw these tweets earlier today:

Horizon Zero Dawn has been fun.
Tells me I have an old driver. Still runs. Takes 30 minutes to compile shaders. I watch the intro cut scene. I think. Better to update the drivers. I quit. Update drivers. Start the game again.
And now we’re compiling shaders again? What?

Why are companies not compiling shaders as part of their build?
I am so confused are they injecting screen resolution and driver version or what?

I won’t quote every tweet, just mention another one that other (older) games don’t have this problem.

So there are a few issues that need explaining.

Pre DX12/Vulkan world

In D3D11/OpenGL from a 1000km view we can simplify the rendering pipeline to the following (I’m not going to cover every stage):

vertex input -> run vertex shader -> run pixel shader -> output pixel

  1. Vertex input: This is programmable from C++ side. It answers questions such as:
    • Does the vertex contain only position?
    • Does it have normal?
    • Is the variable in FLOAT32 format? Uses 16-bit float? in 8-bit where the range [0;255] is converted to range [0; 1.0]?
  2. Vertex shader, which pretends all the vertex inputs are in 32-bit float
  3. Pixel Shader, which pretends all pixels are 4-channel RGBA in 32-bit floating precision
  4. The output pixel, which can be:
    • In RGBA8_UNORM, RGBA16_UNORM, RGBA16_FLOAT, RG16_UNORM, R8_SNORM, etc. See DXGI_FORMAT for a long list.
    • May use MSAA, may not use MSAA
    • May use alpha blending, it may not

So the problem is that vertex & pixel shaders pretend their inputs and outputs are in 32-bit floats.

How do modern GPUs address this problem in D3D11/GL? By dividing both shaders in 3 parts:

  1. Prefix or Preamble
  2. Body
  3. Suffix or Epilogue

The body is the vertex/pixel shader that gets compiled by fxc (or by the OpenGL driver) and later converted to a GPU-specific ISA (Instruction Set Architecture) i.e. the binary program that the GPU will run.

The prefix and suffix are both patchable regions. What do I mean by patchable? Well… as in binary patching.

The vertex shader ‘pretends’ the input is in 32-bit float. Thus the body got compiled as 32-bit.
But if the input is say 16-bit half floating point with a specific vertex stride (the offset in bytes between each vertex), then the preamble gets patched with a short sequence of instructions which load the 16-bit half floats from the right offsets and converts them to the 32-bit floats the body expects.

The same will happen to the pixel shader, which needs to convert its four 32-bit floats into say RG8_UNORM i.e. discard the blue and alpha channel, convert the red and green from range [0; 1.0] to range [0; 255] and store it to memory.

In order to do that, the driver will patch the epilogue and perform the 32-bit -> 8-bit conversion on the red and green channels.

Depending on the GPU the suffix may contain more operations that have to do with MSAA or even alpha blending (the latter is particularly true in mobile)

D3D11 games ran heavy optimizations only on the body section, mostly done by the fxc compiler, and developers could store them in a single file (a cache) that can be distributed to all machines.

The driver will still need to convert the D3D11 bytecode to a GPU-specific ISA, but it relies on fxc’s heavy optimizations having done the job. Thus conversion from D3D11 to bytecode isn’t free, but it isn’t too costly and can often be hidden by driver-side threading.

Shaders could be paired arbitrarily

One more thing I forgot to mention is that vertex and pixel shaders could be combined arbitrarily at runtime. There are a few rules, called signatures, about having matching layouts otherwise the two shaders can’t be paired together.

But despite those rules, if a vertex shader outputs 16 floats for the pixel shader to use but the pixel shader only uses 4 of them; the vertex shader can’t be optimized for that assumption.

The vertex shader’s body will be optimized as if the pixel shader will consume all 16 floats. At most the suffix will only export only 4 floats to the pixel shader; but there’s still a lot of wasted code in the body that could be removed but won’t be.

Drivers may try to analyze the resulting pair and remove that waste, but they only have limited time to do so (otherwise some games could see permanent heavy stuttering).

Post DX12/Vulkan world

DX12/Vk introduced the concept of Pipeline State Objects aka PSOs which is one huge blob of all data embedded into a single API object.

Because PSOs contain all the data required, there is no longer a need to divide shaders into prefix, body and epilogue.

Drivers know in advance the vertex format, the vertex & pixel shaders that will be paired together, pixel format, MSAA count, whether alpha blending will be used, etc.

We have all the information that is required to produce the optimal shader:

  • Code paths producing unused output will be removed
  • The whole shader’s ‘body’ may prefer to use 16-bit float registers if the vertex format input is in 16-bit (rather than converting 16 -> 32 bits and then operating in 32 bit)
  • Loading & Store instructions may be reordered anywhere to reduce latency (which would typically be forced to the prefix or the suffix)

Therefore most optimizations are delayed until actual PSO creation time. Unlike D3D11’s fxc which took forever to compile, the newer D3D12’s dxc compiler and glslang (if you’re not using Microsoft’s compiler) actually compile very fast.

These shader compilers barely perform optimizations (although SPIRV optimizers still exist, and they may make a difference in mobile).

Unfortunately, caching a PSO is tied to GPU and driver version. Therefore something as trivial as a driver upgrade could invalidate the cache meaning you have to wait again.

As David Clyde said this does actively discourage people from updating though and could become a problem.

There are mitigation strategies being researched, such as:

  • Compiling a slow version of the shader in short time, then recompiling in the background an optimized version
  • Users uploading their caches to a giant shared database classified by GPU device, vendor and driver version; which other users can download (note this may have security concerns, a shader is an executable after all)
  • Faster optimizing compilers

Please note that ported PS4 exclusives such as Horizon Zero Dawn and Detroit Become Human were designed around having the PSO cache distributed with the binary (because there’s only two GPUs to target: PS4 and PS4 Pro) thus they were not designed to recompile twice (slow then fast version). Thus these games spend 20 minutes at the beginning building their PSO cache.

Thus there you have it: that’s the reason modern games are taking so long to build shaders at the beginning, and why it may become more frequent in the future.


12 thoughts on “A little clarification on modern shader compile times

  • ADEV

    > Thus these games spend 20 minutes at the beginning building their PSO cache.

    why not pre-compile it yourself once, and ship it to people?

    because this sounds like a huge waste of energy

    everyone recompiling the same thing for 20minutes, this is not eco-friendly at all

    • Matias Post author

      Because the PSO is tied to the GPU and driver version.

      GPUs are not like CPUs. PCs for example have the x86 instruction set which both Intel and AMD must conform to and that’s why both CPUs can run the same exe without any modification.

      However GPUs are vastly different between vendors (e.g. Intel, NVIDIA, AMD) and even between models (e.g. GeForce 680 works very different from a GeForce 1080) thus they can’t run a PSO that was compiled for a different GPU.

      What you’re mentioning (compile PSOs once, ship it with the game) is what’s done in consoles (PS4. XBox, Switch) because there’s only 1 or 2 GPUs to support (the PS4 & PS4Pro, the XBox One & XBox One X, only one for the Switch)

      > because this sounds like a huge waste of energy

      Yes it is : (

      > everyone recompiling the same thing for 20minutes, this is not eco-friendly at all

      They’re not recompiling *the same thing* if they have different GPUs. But everyone with the same GPU are indeed recompiling the same thing.

      I agree it’s not eco-friendly at all

      Right now the only way to ship a PSO that works on every machine would be to buy every GPU in existence and compile them all once. This is very expensive.

      If vendors (Intel, NV, AMD) would have a tool to allow gamedevs to compile for each GPU without actually owning the device, that would be a game changer

      • Kelly MacNeill

        might it be possible for the gpu vendor to cache these compiled isa blobs and distribute them with something like geforce experience? of course this pushes a bandwidth cost to the gpu vendor, but maybe it can be seen as a sort of service.

        • Silent

          This is the exact problem Steam shader cache is trying to solve – as more people with different hardware play the game, their cache might end up being cached by Steam and then served to other people with the exact same configuration.

  • steve m

    This seems like a huge miss in the DX12 architecture, or at least DXC. If PS4 games can precompile their shaders, meaning they have all the state ready offline, PC games should be able to as well. Shaders should be compiled to a DXIL/SPIRV-like language, mostly optimized. The only thing that should be necessary at runtime is a direct translation of DXIL to ASM, and a few optimizations, which is what DX11 drivers have already been doing decently fast.

    Can you detail a bit more where the missing piece is in DX12 in relation to PS4? Again assuming that the driver specific conversion to ASM can be fast.

    If it’s just that DXC is too lazy to do any optimizations, that’s dumb, but relatively easily fixable.

    SPIRV-opt seems to work along those lines. Is the problem in the Vulkan world just as bad still?

    How much leeway does the driver have? Why can’t it compile a slow version of the shader, then quietly swap it out once a faster version is compiled? Is the PSO an immutable chunk of memory not owned by the driver? (I left the industry before DX12, so I have lots of holes in my knowledge).

    Do developers have enough control on DX12 to make startup fast, they just missed in these cases? Is the current 30min PSO compilation fully threaded? Would disabling optimizations at PSO compilation time reduce that 30 minute wait to something completely reasonable, like <1min?

    On the flipside, DX11 was not flawless either, and could spend large amounts of time compiling shaders inside the driver.

    • Matias Post author

      > Again assuming that the driver specific conversion to ASM can be fast.

      No, that’s the core of the problem. It’s not fast.

      > If it’s just that DXC is too lazy to do any optimizations, that’s dumb, but relatively easily fixable.

      No, it’s not. Because optimizing DXIL/SPIRV is optimizing for a hypothetical GPU. What really matters is optimizing for the real GPU. And there’s hundreds, if not thousands, of them out there.

      It’s not that DXC compiler writers are lazy, it’s just that optimizations choices at that stage help but are limited.

      > Why can’t it compile a slow version of the shader, then quietly swap it out once a faster version is compiled?

      There is work being done towards this end. But DX12/Vulkan is still in its infancy until that matures.

      There’s also the issue that the application needs to be aware of this, because the driver silently swapping shaders automatically (without app intervention) is part of what got us into the microstutter problem that D3D11/GL has.

      > Do developers have enough control on DX12 to make startup fast, they just missed in these cases?

      Games developed as console exclusives had never had to deal with this because they only targetted 2 GPUs at most (vs hundreds of different GPUs PCs have).
      This is like a building’s foundation. It’s very expensive and hard to change it once you have the full building built.

      There’s also engines that have been originally designed for a DX11 paradigm. Engines are rarely written in ideal scenarios.

      • steve m

        Thanks for the answers. But I’m still not clear on some points.

        >> Again assuming that the driver specific conversion to ASM can be fast.

        >No, that’s the core of the problem. It’s not fast.

        But drivers have always been compiling DXIL to ASM. Why would it be so much slower now? [I think I answer myself below.]

        You also stated that fxc was slow and dxc is fast. That presumably means fxc used to do more optimizations than dxc. And you said as much: “These shader compilers [dxc/glslang] barely perform optimizations “. Why would dxc not keep doing as much optimization as fxc did? If the answer is “because those optimizations are worthless”, then why was fxc wasting time on them in the first place?

        Sure the vendor can make their compiler arbitrarily slow if they want to eke out every optimization. But apparently DX11 drivers were not that slow. Were they doing the trick with quick compile followed by slow optimized compile behind the scenes? That would be a good call. Now that I think of it, I’m pretty sure I do recall dx11 drivers doing heavy shader compilation well after the game was already running, which must be the optimization pass. Why not keep doing that in the dx12 driver?

        But in that light, I now have my answer to the part of the question that’s “where the missing piece is in DX12 in relation to PS4?”. An optimized ASM shader does actually take a long time to compile and it obviously can’t be done offline as consoles do.

        > swapping shaders automatically (without app intervention) is part of what got us into the microstutter problem that D3D11/GL has.

        I think the microstutter on dx11 came from the first cold compile that would happen in the middle of a frame, not swapping to the optimized shader later. The GPU doesn’t care that last frame this PSO had one shader binary in it and this frame it has another. The PSO takes care of the cold compile problem, because by setting up your PSOs at load time, they’re guaranteed to be compiled at least with a slow shader before they’re accessed for rendering. That’s a good thing that solves the main issue: avoids on-demand compilation in the middle of a frame that blocks a draw call.

        Seems to me the fundamental issue is that if PSOs could simply compile a quick version of the shader, and the driver could later swap it out to an optimized version, it should work even with lazy developers. What is it in dx12 that precludes this driver trick? Is it that the PSO is a chunk of memory with a set size, and an optimized shader is likely to not fit in that memory, and the driver is not supposed to waste your memory on padding just in case? Or is the PSO immutable for other reasons? Again, excuse my lack of education on the dx12 api.

        >> Why can’t it compile a slow version of the shader, then quietly swap it out once a faster version is compiled?

        >There is work being done towards this end. But DX12/Vulkan is still in its infancy until that matures.

        This sounds like it would be a driver optimization. I find it hard to believe that driver writers have not gotten around to doing this optimization that already existed in their dx11 drivers. If it requires API changes, well, that would imply my first statement “This seems like a huge miss in the DX12 architecture” 🙂

        > Engines are rarely written in ideal scenarios.

        I understand. The question was: can developers avoid slow startups if they do everything right in today’s dx12? If this is affirmative, then we could say the miss on the part of the API designers was that they naively assumed devs would do the right thing (manually swap PSOs), and they wouldn’t need to add hacks to bail out “bad” devs.

        Appreciate your insight!

        • Simon Deschenes

          I will try to answer more clearly what changed between DX11 and DX12/Vulkan so you understand better.

          Under DX11, there was many objects (shaders, states, …) and each of these object could be patched easily at runtime and resulted in a lot of small calls that impacted the CPU a lot.
          – Set vertex shader
          – Set pixel shader
          – Set depth state
          – Set raster state
          – Set blend state
          – Set constant buffer
          – Set texture
          – Set sampler

          In D3D12 and Vulkan, you have one big object, the Pipeline State Object for all the shaders and all the states (depth, raster, blend,…) that is compiled as a single optimal blob and it is set as a single call (saving precious CPU time).

          You bind every resources (textures and buffers) in groups with objects named descriptor tables(d3d12) / descriptor sets (vulkan).

          So, the number of objects that is actually bound at runtime is very low (PSO and descriptor tables) and the number of API calls are also extremely low.

          In D3D11, it wasn’t possible to get an “optimal” pre-baked object for all of the states on the GPU. A lot of optimizations were not done because it was not possible to know the states in advance. With D3D12/Vulkan, the driver can effectively make much more involved optimizations because it knows (almost) everything about the state of the GPU when the PSOs will be used.

          This is why the vendors are actually doing many additional optimization passes they didn’t do in the past.

          > Why can’t it compile a slow version of the shader, then quietly swap it out once a faster version is compiled?

          Because they can’t do that. They are low level API, the user of those APIs is supposed to have full control over the memory and the resources. If the API starts doing things “in the back” of the code, it is no longer a low level API and we just get back to D3D11.

          • Simon Deschenes

            Here is one thing that actually exists in the Vulkan API that can help the situation : “Pipeline Derivatives”.

            > 9.5. Pipeline Derivatives
            > A pipeline derivative is a child pipeline created from a parent pipeline, where the child and parent are expected to have much commonality. The goal of derivative pipelines is that they be cheaper to create using the parent as a starting point, and that it be more efficient (on either host or device) to switch/bind between children of the same parent.

            It would be interesting to profile the use of pipeline derivatives to know if the GPU vendors actually use that information, or if they ignore it completely.

            PS. I would have liked to add that information to my previous post but there is no “Edit” button

        • Matias Post author

          > But drivers have always been compiling DXIL to ASM. Why would it be so much slower now?

          You answered indeed. Because the optimization phase WAS MOVED from shader compilation stage to PSO creation stage.
          It is pointless to do it twice (at shader compilation and PSO creation stage)

          > Were they doing the trick with quick compile followed by slow optimized compile behind the scenes

          Yes and no.
          Sometimes DX11 drivers decide to sacrifice performance and not further optimize the shader too much, or optimize the shader a lot and take the hit while trying to hide it, somehow.

          > why was fxc wasting time on them in the first place?

          Because there was no notion of PSOs. It didn’t exist. GPUs worked very different in 2007.
          Architectures evolve. Vertex format conversion was done by a dedicated chip instead of being done via instructions in a shader.

          DX11 was an incremental improvement over DX10. You’re comparing an API that was designed for GPUs from 2006 against an API designed for GPUs from post 2012+ era.

          Things change. What was optimal yesterday, is suboptimal today.

          > I think the microstutter on dx11 came from the first cold compile that would happen in the middle of a frame

          No. What you’re mentioning caused a big stutter, which is why games render a few frames to warm up the drivers.

          > I think the microstutter on dx11 came from the first cold compile that would happen in the middle of a frame, not swapping to the optimized shader later

          That’s a big misconception. One of the main causes of microstutter is the driver randomly swapping a slow shader for a fast shader.

          Microstutter happens because frame N takes 20ms to render while frame N+1 takes 8ms to render and then frame N+2 takes 17ms to render.

          The FPS counter will read 60 fps, but it will be a microstutter fest.

          > > Why can’t it compile a slow version of the shader, then quietly swap it out once a faster version is compiled?

          > Because they can’t do that. They are low level API, the user of those APIs is supposed to have full control over the memory and the resources. If the API starts doing things “in the back” of the code, it is no longer a low level API and we just get back to D3D11.

          That’s a perfect answer. If the driver starts doing things behind the app’s back, we’re back to D3D11 and thus back to microstutters.

  • Dave Ranck

    Great info! Thanks. Lots of hilarious arguments on Steam about this topic for HZD. Everyone is an expert. You shed light on the topic and I thank you.

Comments are closed.