Clustered Forward vs Deferred Shading

So… this is a highly discussed topic these days. Many engines already do Deferred Shading. Doom 2016 decided to go Forward at 60 FPS with 1080p, thus making popular that big AAA titles with high demands can do Forward just fine (although it’s not the first nor only AAA title that uses Forward. Forza Horizon 2 for example). VR also tends to like forward because of the lower bandwidth costs at high resolution + MSAA.

However, I feel there’s a lot not being covered in regular discussions. Having implemented both, I want to describe the strengths and weaknesses of each other.

Note: When I say “Forward” in this post, I’m referring to any of the modern variations that allow efficiently rendering lots of lights in forward shading methods: Tiled Forward, Clustered Forward, Forward+, Forward+ 2.5, Forward3D. They’re all variations of the same thing. I will not be referring to “Forward” as the old technique of keeping a per-draw fixed-length array of lights that was common in the 90’s and 2000’s, most commonly but not always limited to 8 lights per draw.

What everyone tells you about Forward vs Deferred:

Pros of Forward:

Works with MSAA
Works with transparency
Allows using multiple BRDFs

Cons of Forward:

Often requires a Z pre-pass (not always, depends on implementation and trade offs)
Pixel quad occupancy problems for tiny triangles
Performance more dependent on complexity of scene
The “good” modern algorithms require compute shaders, which can be a problem if DX10-level hardware is being targeted.

Pros of Deferred:

Performance depends more on screen resolution than scene complexity.
Lots of resources online on how to implement one.
Works on really old hardware, including mobile.
Impact of tiny triangles is lower.
Mobile extensions can optimize bandwidth consumption a lot by keeping passes in on-chip memory.

Const of Deferred:

Antialising is hard.
Transparency is really hard. Better revert to using Forward for those.
Poor support for multiple BRDFs.
Consumes A LOT of bandwidth. (though there are modern variations that reduce bandwidth consumption, but they only work on modern GPUs)
You’re lucky if you get those mobile extensions for using on-chip memory to work in shipped devices other than your own. Yes, Vulkan supports subpasses as a core feature, but again… you’re lucky if you find many Android devices supporting Vulkan. The market share right now is minuscule. There’s still GLES2-only Android devices being manufactured in 2016!

Both Deferred and Forward have the motto that “one big light == many small lights” due to fillrate/computation behavior.

What is rarely talked about

This is where I wanted to get to. If we look at the pros and cons I just listed, which are listed everywhere, modern Forward has huge pluses. As in, huuuge pluses. MSAA + transparency + multiple BRDFs + low bandwidth??? Count me in!!!

So… what’s that catch?

One huge mega-shader

The first problem that pops up is that everything gets mashed up into one big pixel shader. The same pixel shader has to do:

Directional’s shadow mapping (often selecting splits or doing fancy things to get pretty looking shadows).
Shadow mapping for any other light (spot and/or point)
Normal mapping (TBN matrix operations)
Sampling all textures (albedo, specular/metalness, roughness)
Detail map compositing (if more than one diffuse texture is used)
Iterate through the cluster for point lights.
Iterate through the cluster for spot lights.

This results in horrible VGPR/SGPR register usage, which results in horrible occupancy and thus hinder the ability of the GPU to hide latency of memory reads and other stalls. Deferred shading has the advantage of being able to do more passes in smaller shaders that do one operation each (at the expense of more bw consumption for each pass).

Deferred gives us better control on the ratio of shader_passes / shader_size aka how many operations we cram into the same shader before writing a new one.

Dependency problems

The first thing that comes to mind is SSAO. Ideally SSAO needs to be applied to only the ambient lighting term. The thing is, to compute SSAO you need access to the depth buffer. But you don’t have the depth buffer until you’ve done your forward pass. So you’re in a Catch-22 problem:

You need the forward pass to finish in order to compute SSAO.
You need SSAO to apply the AO to the ambient lighting term to do the forward pass.

To solve this problem you have two solutions:

Perform a depth pre-pass
Apply SSAO to the whole thing, ignoring correctness. Perhaps you can use the alpha channel to indicate how much % of the pixel has been lit by ambient, as a fake approximate. Or other tricks. But it would be a fake. Let’s remember that AO is already a fake (it has no real world physical basis), making SSAO the fake of a fake. Such tricks add up to faking the fake of a fake. Not nice. But maybe you make it look good.

Deferred doesn’t have this problem at all, as the depth is available before the shading pass.

SSR (Screen Space Reflections)

Technically this is a subset of the “Dependency problems” like the SSAO problem described above, but since SSR are all the rave now, it deserves its own section.

SSR has a similar dependency problem:

You need to know how the shaded result is going to look (you need depth, normals from G-Buffer + shaded result) to compute SSR.
You need SSR to render the final result.

Deferred Shading doesn’t have this problem because it can compute the shaded result, then do reflections, then combine them because it has all the necessary data in the G-Buffers to do a correct merge.

In forward to solve this problem you have three options:

Do what Doom 2016 does. Hybrid rendering (Deferred + Forward): perform a depth pre-pass with a small GBuffer (you need at least normals + depth, maybe roughness depending on how you compute glossy reflections), and use the final result of the previous frame (re-projected). Reflections will lag behind one frame. Also this inter-frame dependency doesn’t play nice with SLI/Crossfire.
Output a full blown GBuffer so you can later correctly merge reflections with the final output. This puts Forward in the “high bandwidth cost” category. Also MSAA gets harder to support. You’re essentially bringing in some of the Deferred’s cons.
Like with SSAO, don’t care about correctness and merge the output with the reflections in whatever way you can.

All points to hybrid rendering via a depth prepass

Depth prepass seems to be the solution to everything:

By having the depth earlier, you can compute SSAO.
Early depth helps with the mega-shader problem as every fragment will only be shaded once.
You can compute scene’s shadow mapping in a deferred way, then apply it as a mask during the forward shading result; thus offloading the mega-shader and reducing its size. Doing this mask during the depth prepass may be useful in AMD GPUs who have all those GCN cores idle. Maybe even compute some lights.
You can use Doom’s solution to SSR (if you can sustain 60 fps)
If you’re not careful, one of these solutions can hinder your ability to use MSAA.

But of course… depth prepass is not free. It costs both CPU (batch generation and command preparation/execution) and GPU (the more tessellated your geometry is, the worse it gets). Furthermore, if you’re an experienced veteran with access to modern API techniques (e.g. GL’s AZDO, D3D11, D3D12, Vulkan) writing a new engine then depth prepass gets cheaper CPU-side as your engine will be very efficient with low driver overhead and multi-threaded command preparation. If you’re not an experienced programmer, or if you’re working with an engine with giant amounts of legacy, an extra pass can be very expensive.

As for the depth prepass itself, you also have two choices:

While doing the depth prepass, output a small GBuffer & do some offloading from the mega-shader (e.g. by computing the shadow mask). This means the depth prepass must split batches by materials in the same way the main forward pass will do. CPU side this is great because now commands can be reused entirely except for the ones that set the shader. Just change the commands that change the shader and you’re done. But this may result in heavier driver computations and more chance of pipeline stalls in the GPU.
Perform a depth prepass. No pixel shader, only depth output. CPU side you will have to create commands twice, but allows to batch a lot more (basically almost everything can be batched together). GPU side this is great because nearly no pipeline stalls will happen. Though load balancing may be an issue in GCN thus you’ll need some async compute to compensate.

Which one is better? I don’t know. I have yet to try & profile. But since option #2 doesn’t allow hybrid rendering I’ll likely won’t be even trying it.

Other options

I won’t go into detail because I don’t have experience with these techniques. A GPU-Open post talks about texel shading and Oxide Games’ presentation talks about (the poorly-chosen-name) Object Space Shading. These techniques fight the drawbacks discussed here, but the texel shading article has a pretty good explanation of their own set of drawbacks.

Final words

Phew! I got this out of my chest.

Forward + big mega-shaders (which will get you GPU performance problems) vs Forward + depth prepass (and potentially MSAA issues) vs Deferred rendering with transparency, MSAA, and bandwidth issues. Pick your poison.

I’m a huge fan of modern Forward algorithms. I love being able to use MSAA and having no problems with transparency.

But unfortunately, I don’t see any clear winner as a general case algorithm.

If you can foresee you will have low tessellated geometry with expensive pixel shaders, go for Forward + depth prepass. If you have low or moderately tessellated geometry w/ cheaper pixel shaders (e.g. non-photorealistic rendering, don’t care about perfect reflections or “correct” SSAO) just go Forward. If you can foresee you’ll have highly detailed geometry, reuse the same BRDFs and can spare the bandwidth (or have the time to implement the textureless techniques, or the resources to deal with mobile extensions for using on-chip memory) then go Deferred.

If your company has massive amount of resources to support all of these techniques at the same time and switch between them at will, then great for you! If you’re not that giant company, then evaluate your goals, the limitations of each technique, and pick the one best suited for your needs.

Now you have what you need to do a more informed decision. Go back to making games!

Wish list: Variable-frequency shading

If API /GPU engineers are reading this, I’d like to suggest the following ideas:

Sometimes I’d wish certain parts of the shader would be executed at lower frequency than others.

For example, I want to sample the albedo texture at full resolution. But I want to compute the shading at a quarter resolution without doing two passes (pseudo-language pixel shader):

float3 normal : TEXCOORD0;
float3 tangent : TEXCOORD1;
float3 binormal : TEXCOORD2;
float2 uv : TEXCOORD3;
struct QuarterResResult
{
    float NdotL;
};

QuarterResResult mainQuarterRes() [frequency=4] //1/4th resolution
{
    QuarterResResult retVal;
    float3 vNormal = normalMap.Sample( sampler, uv ) * float3( normal, tangent, bitangent );
    retVal.NdotL = dot( vNormal, viewDir );
    return NdotL;
}

float4 mainFullRes( in QuarterResResult inValues[frequency=4] )
{
    //inValues would be sourced from LDS, cache or some other on-chip memory.
    float4 albedo = myTex.Sample( sampler, uv );
    return albedo * inValues.NdotL;
}

This would be the same as rendering twice, first time to a 960×540 target to compute NdotL, second time to a 1920×1080 target to compute final result; except only one scene pass is used, which would result in lower vertex shader and rasterizer usage. I think we can compromise by choosing normal, tangent, bitangent & uv from a particular lane instead of computing the correct interpolated one at the middle of the 4 texels.

A feature like this would be amazing, and it would make Forward much more appealing.

Food for thought.

Matias Post author April 16, 2018 at 13:24

Shadow mapping is expensive.

You can put it in F+; but there are so many things to consider… the main problem is memory; Shadow Mapping requires a texture. You can use UV atlas and texture arrays to have lots of shadow maps; but that doesn’t solve the problem that you need width*height*bpp*max_num_shadowmaps bytes.

The next problem will be performance. Sampling shadow maps is not free. Sampling more than 16 and you’ll have serious performance problems (most games limit to 3-5* shadow maps; some games may go up to 10 or 16).
Generating shadowmaps is not free either. You can have 10-16 shadowmaps assuming most of them are static or updated on demand or at a much lower frequency (e.g. 5 shadowmaps updated every frame, the other 11 shadowmaps updated twice per second at most, and interleaved, i.e. not at all 11 at the same time in the same frame).

*Keep in mind 3 to 4 of these shadow maps are usually just for the directional light’s CSM/PSSM cascades. So in practice games get the sun, and one or two more lights that cast shadows; unless the game can keep the camera inside an interior, where you can get rid of the sun.

And if you plan on having F+ with hundreds of lights; you’ll want to keep two lists (shadow casting lights and non-shadow casting lights); because branching in the pixel shader if( shadow[i]->hasShadowmap ) is incredibly wastefull when you have 16 shadow casting lights and a 100 lights that won’t enter that branch.

The next problem will be prioritizing (which lights get a shadow map?)

And all of this assuming you’re targetting high end GPUs (e.g. GTX 980 / 1060 / 1080; Radeon 580/Vega).

A much easier hack is to not use shadow mapping for so many lights, and place “dark lights” instead. Dark lights are lights with negative colour values, so they substract light instead of adding. It’s not physically correct, but it empowers your artist to cheaply place a single light (or a few lights) that create dark corners where needed (it can also be automated, i.e. an algorithm that analyzes where to place lights that darken the place, a fake of AO basically); while using around 5 shadowmaps for generating dynamic accurate shadows.

Cheers

7 thoughts on “Clustered Forward vs Deferred Shading”

Pingback: Ogre Progress Report: December 2017 | OGRE - Open Source 3D Graphics Engine
Georg March 23, 2018 at 08:40

Excellent article, thanks!

I don’t understand the shadow masking, though. You’d need one mask for each light per pixel, wouldn’t you? You can’t just multiply the shadow factors of all the lights. Only some may be shadowed and you have to figure out which. A linked list of values per pixel doesn’t sound appealing to me.
- Georg March 23, 2018 at 23:06
  
  Did you mean shadow mask for the directional light only, maybe? This would at least save you from having to sample cascaded shadow maps in the forward shader.
  - Matias Post author April 3, 2018 at 21:05
    
    Yes, you’d need 1 mask per light for the shadows.
    
    HOWEVER:
    1. The main directional light (aka the sun) is usually the fattest shadow mapping routine due to CSM/PSSM; so you may just have one mask for the directional light only.
    2. Even if you apply it to all the lights casting shadows, you don’t need a linked list. Usually the number of shadow mapping lights is send as regular forward thus fixed (sending them as Forward+ would be very expensive) thus you can use multiple channels (RGBA) for each light, and MRT if you need more than 4 lights (or pack shadow multiple masks into the same channel using bitmasks or the builtin packFloat2x16 and co.)
    
    Cheers
    - Georg April 12, 2018 at 21:23
      
      I see. Thanks for clearing that up.
      
      Why would sending them as Forward+ be very expensive? Because then nothing would be stopping the artists from putting tons of shadow map sampling lights on the screen? 😉
      - Matias Post author April 16, 2018 at 13:24
        
        Shadow mapping is expensive.
        
        You can put it in F+; but there are so many things to consider… the main problem is memory; Shadow Mapping requires a texture. You can use UV atlas and texture arrays to have lots of shadow maps; but that doesn’t solve the problem that you need width*height*bpp*max_num_shadowmaps bytes.
        
        The next problem will be performance. Sampling shadow maps is not free. Sampling more than 16 and you’ll have serious performance problems (most games limit to 3-5* shadow maps; some games may go up to 10 or 16).
        Generating shadowmaps is not free either. You can have 10-16 shadowmaps assuming most of them are static or updated on demand or at a much lower frequency (e.g. 5 shadowmaps updated every frame, the other 11 shadowmaps updated twice per second at most, and interleaved, i.e. not at all 11 at the same time in the same frame).
        
        *Keep in mind 3 to 4 of these shadow maps are usually just for the directional light’s CSM/PSSM cascades. So in practice games get the sun, and one or two more lights that cast shadows; unless the game can keep the camera inside an interior, where you can get rid of the sun.
        
        And if you plan on having F+ with hundreds of lights; you’ll want to keep two lists (shadow casting lights and non-shadow casting lights); because branching in the pixel shader if( shadow[i]->hasShadowmap ) is incredibly wastefull when you have 16 shadow casting lights and a 100 lights that won’t enter that branch.
        
        The next problem will be prioritizing (which lights get a shadow map?)
        
        And all of this assuming you’re targetting high end GPUs (e.g. GTX 980 / 1060 / 1080; Radeon 580/Vega).
        
        A much easier hack is to not use shadow mapping for so many lights, and place “dark lights” instead. Dark lights are lights with negative colour values, so they substract light instead of adding. It’s not physically correct, but it empowers your artist to cheaply place a single light (or a few lights) that create dark corners where needed (it can also be automated, i.e. an algorithm that analyzes where to place lights that darken the place, a fake of AO basically); while using around 5 shadowmaps for generating dynamic accurate shadows.
        
        Cheers
        
        Georg April 16, 2018 at 23:02
        
        Ok, thanks for confirming my assumption. I am aware that sampling shadow maps is expensive, especially when you are using PCF. Yes, you’d have to use a shadow map atlas, texture array or bindless textures if you wanted to sample them in the forward+ shader.
        
        So, yes, you have to limit and prioritize the number of shadow casting lights. But you also have to prioritize if you pass the shadow casting lights with the draw calls.
        
        Also, you need know which pixels are affected by the shadow casting lights, so unless you are using deferred rendering for those, you want to have them in the lists somehow.
        
        But having said that, I do like the idea of limiting the shadow casting lights to 4 and using a small downscaled g-buffer to store the shadow masks in RGBA. Then you get the best of both worlds.

Comments are closed.

Yosoygames

Think out of the box. A site about software, video games, graphics, music and media in general