Vertex Formats Part 2: Fetch vs Pull


Welcome to the final of this 2-part series!. You can find Vertex Formats Part 1: Compression

Another issue that seems to be loose in understanding is how GPUs are shaping up in the future.

Traditionally, vertex formats were “push”. That means the vertex data was forced into the shader as input, with the format and stride specified using the API’s functions.

The “push” form is how rendering APIs have traditionally treated vertex buffers: you specify a vertex format (e.g. via a VAO in OpenGL), the stride (i.e. how many bytes between the start of the old and new vertex); and then the GPU would go vertex by vertex executing the vertex shader for it.

“Pull” is getting more popular as of late, as it is more flexible and allows for new algorithms to be made. In simple shader terms, it’s just doing:

float3 position = buffer[vertexId].pos.xyz;

You can find more information about “pull” methods by googling “programmable vertex pulling”. A good read is Vertex Shader Tricks by Bill Bilodeau (archive.org link) and “merge instancing” from Graphics Gems for Games from Emil Persson.
How it works is very simple. The question is why? and is there a performance penalty?

Why and where

GPU problems almost always boil down to batching problems.

In 2003 we were hitting CPU limits because we were hitting DrawCall limits. Wanted to render 1000 objects? That’s a 1000 draw primitive calls. The API overhead was too high: kernel transitions, validation, preparing parameters into the stack, the call instructions, thrashing the icache, etc.

So we added instancing. You wanted to render 1000 instances of the same mesh in different locations? It’s now just one draw call with instance_count = 1000. This problem later shifted to state changing (switching parameters for the material, switching textures, etc), and I’m gonna make it short: we got const buffers, indirect drawing, texture arrays and bindless textures.

Problem solved. Except…. the problem is still there.

See… GPUs work in batches (wavefront in AMD terms, warp in NVIDIA terms). For example AMD’s wavefront size is 64. if I render 1 triangle, I’m leaving 61 threads of a wavefront wasted (assuming the GPU computes 1 vertex per thread).

If I render 1000 instances of that triangle, it’s going to require 1000 wavefronts, instead of consuming 47 wavefronts (1000 tris x 3 vertices / 64 threads = 46.88 wavefronts) (*).

Yes, if certain conditions are met, the driver is able to to pack all of those instances in the same wavefront, thus maximizing usage. But don’t count on it, because chances are you’re doing something that prevents the shader from merging them. If you’re doing something the GPU may not be able to render correctly (that depends on the data you’re feeding the shader with) unless the instances are put in different wavefronts, the driver must play safe and split them. It can only do the merging automatically if it’s guaranteed the HW can process your commands correctly.

Christophe Riccio made a nice test about the minimum number of triangles per draw call to maximize GPU utilization.

If the mesh you are rendering is large enough, you don’t need to worry about this problem. But if the mesh is small enough, it gets quite big. Why? LOD gives you diminishing returns (i.e. lowering the vertex count literally gives you no gain), cities are filled with these small objects: lamps, posts, signs, crash barriers. Lots of objects with small vertex counts.

In other words, we just shifted the batching problem elsewhere. It’s no longer on the CPU, and now it’s harder to profile & measure.

(*) it’s actually more complex than that, there’s Compute Units, each CU has 4 SIMD units, each SIMD has 16 lanes not 64, each SIMD can cycle between up to 10 wavefronts to hide latency. I just don’t want to go into these details. GPU purists could describe what actually goes on but it can cause be quite confusing and goes way beyond out of scope.


That’s where vertex pulling comes in. Basically, “screw that”, and we merge the vertices by hand rather than praying the driver does it for us. Most algorithms that perform manual vertex pulling do it in order to handle the same mesh or different meshes as part of the same batch.

Just Cause 2 for example calls it merge-instancing to store the indices of the mesh to render in one buffer, and then loads the corresponding offset to that mesh + the vertex id; that way it can render multiple buildings (which have low vertex count) as part of the same batch. It adds an extra layer of indirection though. The goal is to maximize GPU utilization.

Without vertex pulling, the only way to solve this particular batching problem is to manually bake the meshes you want to render into a different vertex buffer. The problem with that approach is that you get a lot of data duplication: If 5 different buildings are reused 3 times each but with many different permutations, instead of storing the vertex data for 5 meshes you’ll have to keep one buffer copy for each permutation.


The same goes for particle FXs: Send the position and generate the quad in the vertex shader. The data is loaded by reading buffer[vertexId / 6];

Otherwise you’d have to use instancing (6 vertices per wavefront? no thanks) or send 6x the data per particle across the PCIe bus.


Performance implications

This varies widely because it depends on the GPU:

  1. On AMD’s GCN, it works with pulling. Always. When you’re doing push, the driver patches the shader so the ISA (Instruction Set Architecture) is converted to pull based on the currently bound vertex format.
    • Now you now why PSOs (Pipeline State Objects) need the vertex format and the shader. So the ISA can be patched when the PSO is created. For older APIs such as OpenGL and Direct3D11 where there’s no PSO, the driver must keep an internal look up table that matches vertex formats and vertex shaders with already generated ISAs (and create a new entry in the table if none was found). And this look up table must be touched inside the DrawCall, every time you’ve changed the vertex attributes or the shader.
  2. To the best of my knowledge, Intel and NVIDIA prefer push, as that allows prefetching the vertices and prevent potential stalls
    • Update:Turánszki János posted a benchmark of pull vs push for NV & GCN.
      • Timothy Lottes tells me NVIDIA should support pull much better, and that he recommended pull back when he was working there. He said about the benchmark “That factor of 4 difference in NV timing suggests non-optimal in-app implementation for pull model, specifically bad data layout or memory access patterns”. Nonetheless, I think it is a good example that pull is riskier: Pull code that isn’t careful, it can lead to significant slowdowns (this advise applies to GCN as well!). Checkout the thread.
  3. iOS hardware before A11 prefers push. Apple devs have publicly stated the Metal compiler can recognize certain pull patterns and still use the prefetcher as if it were push; but it only works for basic pull code.
    • On A11 Bionic SoC, all push is converted to pull, just like with AMD’s GCN. Thanks to Jedd Haberstro for the info.
  4. When doing pull, beware of buffer semantics whose details can limit your performance unnecessarily. Some buffers don’t map 1:1 between API and what the GPU can do. See perftest repo for more information on the fastest way to read data out of a buffer.

What does this mean?

  1. If you’re targeting consoles (PS4 & XBox One) which are both using GCN, then push has no performance benefit
  2. Same goes for post-A11 iOS devices
  3. If you’re including anything else in your target, then your choice will depend on how “classic” your rendering is. If you’re rendering traditionally, then push style is the most sensible approach, while reserving pull-style rendering for when it is a clear win such as merge-instancing for small objects, and particle FXs. If your renderer is more unconventional, then you may want to always use pull since it gives you more freedom.



Is “pull” the future? I don’t know, but it appears to be the case. It certainly is more flexible and puts the burden of solving the batching problem on the developer, rather than the driver. But for classic rendering “push” generally sounds a better approach, simply because it’s a more streamlined/easier to pipeline, straightforward approach but it falls apart for small mesh sizes.