It’s time we talk about TDR


So… I made a joke that became more popular than I expected.

I thought it was going to be a niche joke only graphics programmers would understand, but as it got popular I got to realize many others would get it too: Substance Painter artists, Adobe users, Blender/Maya/etc users rendering with the GPU. HPC guys can relate. Even some gamers got it.

Basically any tool that uses compute shaders to accelerate their task will have their users relate to this problem.

TDR (Timeout Detection and Recovery) is the term for Windows’ solution and hence the name people are most familiar with, but the concept applies to all platforms (Windows, Linux, macOS, iOS, Android).

When I talk about TDR, I’m referring to all of them regardless of their name or implementation details.

The halting problem

TDR is a solution to the halting problem: determining whether the program will finish running, or continue to run forever. Alan Turing proved in 1936 that a general algorithm to solve the halting problem for all possible program-input pairs cannot exist.

Therefore TDR forces an easy solution: Nothing can run forever. Period. If it takes more than two seconds of execution time, it gets the axe.

This makes sense if we see GPUs merely as graphics cards displaying on monitors at 16.67ms intervals. But GPUs have really gone very far from there.

But… is it the only solution? No.

What does Windows, Linux, macOS, freebsd, etc do? They delegate solving the halting problem to the user. If the user suspects a process is misbehaving, he can kill it from process explorer. Can a human always solve the halting problem accurately? Not really, but he owns the machine.

Android & iOS solve the problem slightly differently: Although the user can kill individual processes, most services are invisible and are only allowed to run for short periods of time or limited access while they’re in the background; and are allowed to run indefinitely while on the foreground.

TDR has been holding the field back:

  1. Graphics/gaming programmers want to experiment with long-running compute shaders that go idle when there’s nothing process, and wake up when they get pinged with new data
  2. Other fields also want to execute very expensive long-running CS and have to edit or disable the TDR timeouts to achieve such purpose. But this is not practical as it’s hard to recover the system when it becomes unresponsive due to program error; and you need to ssh into it.

GPU process explorer

So… Windows 10 has been showing GPU usage for a while now.

Why couldn’t it show a GPU process tab?

The idea would be as follows: GPUs that support preemption would be allowed to run GPU processes, which would be managed in a similar way CPU processes are: they’d have handles, their own PID pool, priority, possibly execution masks, and ability to pause and resume processes.

The compositor’s rendering process in the primary GPU connected to a monitor would always have critical priority.

Of course that would need GPU HW to be capable of preemption. But I am told this is already possible by current gen HW.

The minimum we need is:

  1. The GPU being capable of suspending compute shaders long enough for the OS to display on the screen, move the mouse cursor and click some buttons
  2. The GPU being capable of terminating a compute shader at any point without having to resort to a full GPU reset. Some latency is tolerable. But a shader must not get easily stuck in an uninterruptible loop. It’s fine if it can only happen on purpose or rarely, i.e. by exploting a rare/invalid instruction or rare conditions

So, asuming it’s possible at least to a high degree: How would that work?

Opt-in

We have existing graphics APIs which have been designed around the current model: “Fire and forget”.

If we look at Vulkan:

void vkCmdDispatch( VkCommandBuffer commandBuffer, uint32_t x, uint32_t y, uint32_t z );

That’s it. That dispatch better take less than 2 seconds (or whatever the TDR registry key says).

Or at least be a card not connected to the monitor. That way it can run indefinitely.

D3D11 has D3D11_CREATE_DEVICE_DISABLE_GPU_TIMEOUT but that’s still very subpar.

But how about we make slight additions:

typedef void* GPUHANDLE;
// Analogous to pthread_create/CreateThread
GPUHANDLE vkCmdDispatch2( VkCommandBuffer commandBuffer, uint32_t x, uint32_t y, uint32_t z );
// Analogous to popen/CreateProcess
GPUHANDLE vkCmdDispatchProcess( VkCommandBuffer commandBuffer, uint32_t x, uint32_t y, uint32_t z );

The handle could be used like with regular Win32/POSIX handles:

// Windows
WaitForSingleObject( gpuhandle, ms );

// *nix
pthread_join( gpuhandle, NULL );

waitpid( gpuhandle, &status, 0 );

The same happens with every function to manipulate threads (suspension, resume, kill). I would be ok if for performance/architectural reasons alternate functions need to be used, e.g.

// Windows
WaitForSingleObjectGpu( gpuhandle, ms );
WaitForMultipleObjects2( cpuhandles, cpuhandlecount, gpuhandles, gpuhandlecount, bWaitAll, ms );

// *nix
pthread_join_gpu( gpuhandle, NULL );
waitpid_gpu( gpuhandle, &status, 0 );

An additional side effect of this is that any vkSemaphore or vkEvent issued after vkCmdDispatch2 does not guarantee the shader ended

GPUs which do not support preemption

  • Calling vkCmdDispatch2 & vkCmdDispatchProcess would be illegal.
  • Calling WaitForSingleObjectGpu or waitpid_gpu would return an error flag.
  • Calling WaitForMultipleObjects2 would work as long gpuhandlecount is 0; otherwise it should return an error condition.

Signal handling

New signals would have to be created e.g. SIGGPUTERM, SIGGPUKILL. GPUs processes can trap these signals; but if for some reason they’re unable to or won’t, they will be routed to their CPU parent process.

If a GPU process has detached itself from its CPU parent, then the root CPU process becomes its parent (e.g. the init process in Linux systems)

Perhaps a more generic term could be used, like SIGTERMCOP being ‘COP’ for coprocessor

OS level convergency rules

Unless stated otherwise, OS APIs calls made from the GPU must use convergent inputs. For example:

// API declaration
threadgroup_uniform sighandler_t signal(int signum, threadgroup_uniform sighandler_t handler);

signal( SIGGPUTERM, sig_handler ); // sig_handler must be convergent at threadgroup level

The threadgroup_uniform keyword does not yet exist. I’m using my proposed improvements to shading languages to ensure these errors are caught at compile time.

Forcing sig_handler to be non-convergent would mean that any random input would be chosen.

It’s not too diferent as calling signal() with uninititialized inputs in a regular CPU app. Pass bad inputs to OS calls, expect bad things to happen along the way: SIGGPUSEGV at best, silently performing unexpected behavior at worst.

Using the ‘old’ API

Calling vkCmdDispatch i.e. the function that doesn’t return a handle, could behave in one of two ways:

  1. If the Vulkan context was not initialized with the flag indicating it is GPU-context-aware (i.e. old apps); display a single GPU process child of its CPU process to allow the user to see it. Killing the GPU process would trigger VK_DEVICE_LOST
  2. If the Vulkan context was initialized as being GPU-context-aware; display a GPU process child of the CPU process which signifies the GPU context. The application may create more GPU processes via vkCmdDispatchProcess and co.; killing the GPU context would raise SIGGPUTERM, SIGGPUKILL with the PID of the GPU context.

In both cases the GPU process dies when the CPU process that issues these calls dies; but in the second case GPU processes that have been spawned as parentless will survive.

GPU restart

If the GPU is restarted, all GPU processes are terminated and their parent CPU processes notified.

Multi-GPUs

Clients would use a single PID. One would have to query the OS through calls to see which GPU is executing that PID.

Encoding the GPU into the PID is ill-advised: in the future we may want to allow shaders to migrate between GPUs.

Other stuff

It’s normal that after GPUs move on from just TDR into independent co-processors, people will demand more. And it’s hard to foresee these demands because we are not there yet.

Could GPU launch CPU processes?

That’s a question that later may need revisiting. Let’s keep thing simpler for now. So the answer is: no.

Could GPU launch GPU threads & processes?

I don’t see why not. As long as the call spawning the GPU process is convergent.

I only see one risk though, which is spawning too many processes too fast.

Unlike CPUs, it is too easy to accidentally have all threadgroups spawn a process when you only meant one threadgroup to do that. Not to mention it could be done on purpose and the rate of new_processes / second could be much faster than what a CPU can do.

If it can DoS a system, then it’s something that needs to be accounted, whether via rate limiting or forbiding it.

Disk access

There will be lots of race conditions. And this is a subject I don’t dominate well. My instinct tells me, and based on the behavior of virtual machines, that any attempt will end up in two possibilities:

  1. GPUs having read-only access to a filesystem already mounted by the CPU
  2. GPUs having R/W access to a filesystem mounted exclusively by the GPU

This is an entire field: A GPU reset with a GPU-mounted filesystem means data loss and journal recovery.

There’s already some work on this area

Network access

Due to security concerns. Let’s keep thing simpler for now. So the answer is: no.

Security

Processes need a flag that the user allows them to spawn GPU processes. Think of it as the Linux exec bit, but for GPU. If it’s not set, the CPU process will see as if the GPU was not capable of preemption; and it would be incapable of spawning GPU processes (other than the main Vulkan context) or communicating with other GPU processes.

RAM access controls and virtualization is something that AFAIK Windows 10 already implements. I don’t know if it’s robust enough to be used like this.

Linux is very far behind in this area. I can ask for a texture in Vulkan or OpenGL right now and I can see the previous contents of the process it used to belong to.

GPU Debugging

Unlike the CPU, there is not one ISA but many. It’s not just different vendors, but a moving target across time for the same vendor. It’s what allowed the GPU field to progress so much faster than the CPU field.

Therefore the kernel needs the driver to implement an API to read and parse ISAs:

size_t get_current_exec_address( GPUHANDLE gpupid, uint32_t threadGroupIdx[3] );
void get_current_exec_bytecode( GPUHANDLE gpupid, size_t address, size_t numInstructionsPrev, size_t numInstructionsAfter, uint8_t *outByteCode, size_t *inOutBytecodeSize );
void translate_bytecode( uint8_t *outByteCode, size_t sizeBytes, char *outString, size_t *inOutMaxStringSize );

It could work as follows:

// Get current execution offset at given threadgroup
size_t addr = get_current_exec_address( gpupid, threadGroupIdx ); 
// Grab 11 instructions: 5 before the current execution + 5 afterwards + the current one at addr
get_current_exec_bytecode( gpupid, addr, 5u, 5u, &bytecode, &bytecodeSize );
// Convert the binary bytecode into a human readable string
translate_bytecode( bytecode, bytecodeSize, &outString, &maxStringSize );

More functions would be needed to grab the state of all registers and manipulate execution (i.e. single step, jump to a different instruction), but this gets the idea.

I’ll leave it up to politics if vendors can always return an empty string in translate_bytecode and still be certified as WDDM (IMHO keeping the ISA secret is ridiculous and only hurts your business, but politics be politics).

This assumes instruction level preemption is possible though.

Closing thoughts

I feel like a baby asking demands. There are details I am missing. I’m not a kernel developer.

Some of you may be thinking “It’s not that easy”. And yes I understand: having independent systems synchronized (multiple core, computers across the internet) is never easy. I wouldn’t expect CPU-GPU synchronization to be easy in a manner that is secure, responsive and stable.

The purpose of this article is to spark ideas OS engineers (Microsoft, Linux kernel/DRM maintainers, Apple) would start considering. TDR is not be the best we can hope for.