The road to 16-bit floats GPU is paved with our blood


Ok that was a flashy clickbait-y title! What? You didn’t like it? I’m proud of that one!

So I recently tried to implement 16-bit floats into our shaders.

I knew it was untested. What I was not prepared is… how untested everything is!

We use macros like #define midf min16float. We use midf because half is actually in use in Metal.

Our new code has 3 modes of operation:

Full32: midf is just float. Nothing special. The default

#define midf float
#define midf4 float4

Midf16: midf is just float16_t. It requires shaderFloat16 and storageInputOutput16 to be YES, and in turn extensions VK_KHR_shader_float16_int8 and VK_KHR_16bit_storage.

Only Vulkan and Metal support this feature. It’s excellent for testing and debugging 16-bit floats, because you force 16-bit precision regardless of whether the driver/HW can perform more efficient with it.

#define midf float16_t
#define midf4 f16vec4

Support is actually scarce and limited mostly to post-Vega and a few Intel cards. And Metal.

Relaxed midf is mediump float on Vulkan and min16float on D3D11 (though we disabled it for D3D11). Support is much broader.

But that’s because most drivers will default to 32-bit, so you have no way to test it other than via the reference rasterizer or an Android phone.

#define midf mediump float
#define midf4 mediump vec4

// Unfortunately casts and construction e.g.
// midf myFloat = midf_c(5.0) requires
// a different macro because:
// mediump float myFloat = mediump float(5.0)
// is not valid syntax.
#define midf_c float
#define midf4_c vec4

Bugs, bugs everywhere

I knew Godot had problems with mediump float because mixing mediump with highp would cause some PSOs to fail to build in older Qualcomm drivers. So we’re off to a bad start.

What I didn’t expect:

  • MS FXC with optimizations fails to compile on a simple shader with 2-level nested static branch. It would seems it tries to generated a cmov and fail.
    • Due to this and fxc being no longer maintained, we prefer to disable min16float support on Direct3D 11.
    • Direct3D 12 is not in our roadmap.
  • SPIRV-Reflect would randomly fail if mediump precision is used. This bug has been fixed now.
  • SPIRV debugging in RenderDoc will be unreliable if float16_t is used.
  • RADV would ignore vertex layouts for float16_t vertex inputs. i.e. in f16vec4 vertexPosition will only work if the vertex data is 16-bit (16_unorm, 16_snorm or 16_half) but it won’t work correctly if it is stored as 32-bit float/unorm/snorm. I didn’t report this bug because it may be ‘working as intended’ since something similar happens when the input data is declared as int. Autoconversion will no longer work. This is easy to workaround. Just declare vertexPosition as vec4 and then cast it to f16vec4( vertexPosition ). Fortunately we know which data is natively stored as 16-bit so those are declared as f16vec4.

On RDNA2, half 16-bit is not a sure win

I was surprised to see (and later confirmed), RDNA2 does not support converting during an OP.

e.g. with the following code:

uniform float k; // Inside an UBO (simplified)

float16_t b = ...;
float16_t a = b + float16_t( k );

First k is loaded into SGPR then converted to half in VGPR. Then the addition happens.
If we use float all the way, k is loaded into SGRP and kept there. If b is in VGPR, the addition will be VGPR a = VGPR b + SGRP k.

The only solution to this is to declare uniform float16_t k so that the data can be natively loaded as 16-bit.

However with such scarce support and FP16 not being natively supported by C/C++, it is very difficult to support such path. We can’t ditch FP32 paths, and supporting both is a lot of effort.

It is much easier to send all the data as FP32 and then let the GPU load and convert automatically.

What this means in practice:

The following pixel shader:

#version 450

#extension GL_EXT_shader_16bit_storage: require
#extension GL_EXT_shader_explicit_arithmetic_types_float16: require

layout( ogre_P0 ) uniform Params {
  uniform vec4 myParamA[128];
  uniform vec4 myParamB;
  uniform vec4 myParamC;
};

// #define f16vec4 vec4

layout( location = 0 )
out f16vec4 fragColour;

void main()
{
  f16vec4 tmp0 = f16vec4( float16_t( 0 ) );
  f16vec4 tmp1 = f16vec4( float16_t( 0 ) );
  f16vec4 tmp2 = f16vec4( float16_t( 0 ) );
  f16vec4 tmp3 = f16vec4( float16_t( 0 ) );
  for( int i = 0; i < 128; ++i )
  {
      tmp0 += f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
      tmp1 += tmp0 + f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
      tmp2 += tmp0 + tmp1 + f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
      tmp3 += tmp0 + tmp1 + tmp2 + f16vec4( myParamA[i] ) + f16vec4( myParamB ) * f16vec4( myParamC );
  }
  fragColour = (tmp0 * tmp1 + tmp2) * tmp3;
}

Produces the following ISA with RADV:

BB0:
    v_mov_b32_sdwa v0, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0002f9 00861480
    v_mov_b32_sdwa v1, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0202f9 00861480
    v_mov_b32_sdwa v2, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0402f9 00861480
    v_mov_b32_sdwa v3, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0602f9 00861480
    v_mov_b32_sdwa v4, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0802f9 00861480
    v_mov_b32_sdwa v5, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0a02f9 00861480
    v_mov_b32_sdwa v6, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0c02f9 00861480
    v_mov_b32_sdwa v7, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e0e02f9 00861480
    v_mov_b32_sdwa v8, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1002f9 00861480
    v_mov_b32_sdwa v9, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1202f9 00861480
    v_mov_b32_sdwa v10, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1402f9 00861480
    v_mov_b32_sdwa v11, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1602f9 00861480
    v_mov_b32_sdwa v12, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1802f9 00861480
    v_mov_b32_sdwa v13, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1a02f9 00861480
    v_mov_b32_sdwa v14, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1c02f9 00861480
    v_mov_b32_sdwa v15, 0 dst_sel:WORD_0 dst_unused:UNUSED_PRESERVE src0_sel:DWORD ; 7e1e02f9 00861480
    s_mov_b32 s0, 0                                             ; be800380
BB1:
    s_cmp_ge_i32 s0, 0x80                                       ; bf03ff00 00000080
    s_cbranch_scc1 BB5                                          ; bf850054
BB4:
    s_add_i32 s4, 16, s2                                        ; 81040290
    s_movk_i32 s5, 0x8000                                       ; b0058000
    s_load_dwordx4 s[4:7], s[4:5], 0x0                          ; f4080102 fa000000
    s_lshl_b32 s1, s0, 4                                        ; 8f018400
    s_waitcnt lgkmcnt(0)                                        ; bf8cc07f
    s_clause 0x2                                                ; bfa10002
    s_buffer_load_dwordx4 s[8:11], s[4:7], s1                   ; f4280202 02000000
    s_buffer_load_dwordx4 s[12:15], s[4:7], 0x800               ; f4280302 fa000800
    s_buffer_load_dwordx4 s[4:7], s[4:7], 0x810                 ; f4280102 fa000810
    s_add_u32 s0, s0, 1                                         ; 80008100
    s_waitcnt lgkmcnt(0)                                        ; bf8cc07f
    v_cvt_f16_f32_e32 v16, s8                                   ; 7e201408
    v_cvt_f16_f32_e32 v17, s9                                   ; 7e221409
    v_cvt_f16_f32_e32 v18, s10                                  ; 7e24140a
    v_cvt_f16_f32_e32 v19, s11                                  ; 7e26140b
    v_cvt_f16_f32_e32 v20, s12                                  ; 7e28140c
    v_cvt_f16_f32_e32 v21, s13                                  ; 7e2a140d
    v_cvt_f16_f32_e32 v22, s14                                  ; 7e2c140e
    v_cvt_f16_f32_e32 v23, s15                                  ; 7e2e140f
    v_cvt_f16_f32_e32 v24, s4                                   ; 7e301404
    v_cvt_f16_f32_e32 v25, s5                                   ; 7e321405
    v_cvt_f16_f32_e32 v26, s6                                   ; 7e341406
    v_cvt_f16_f32_e32 v27, s7                                   ; 7e361407
    v_fma_f16 v28, v20, v24, v16                                ; d74b001c 04423114
    v_fma_f16 v29, v21, v25, v17                                ; d74b001d 04463315
    v_fma_f16 v30, v22, v26, v18                                ; d74b001e 044a3516
    v_fma_f16 v31, v23, v27, v19                                ; d74b001f 044e3717
    v_add_f16_e32 v12, v12, v28                                 ; 6418390c
    v_add_f16_e32 v13, v13, v29                                 ; 641a3b0d
    v_add_f16_e32 v14, v14, v30                                 ; 641c3d0e
    v_add_f16_e32 v15, v15, v31                                 ; 641e3f0f
    v_add_f16_e32 v28, v12, v16                                 ; 6438210c
    v_add_f16_e32 v29, v13, v17                                 ; 643a230d
    v_add_f16_e32 v30, v14, v18                                 ; 643c250e
    v_add_f16_e32 v31, v15, v19                                 ; 643e270f
    v_fmac_f16_e32 v28, v20, v24                                ; 6c383114
    v_fmac_f16_e32 v29, v21, v25                                ; 6c3a3315
    v_fmac_f16_e32 v30, v22, v26                                ; 6c3c3516
    v_fmac_f16_e32 v31, v23, v27                                ; 6c3e3717
    v_add_f16_e32 v8, v8, v28                                   ; 64103908
    v_add_f16_e32 v9, v9, v29                                   ; 64123b09
    v_add_f16_e32 v10, v10, v30                                 ; 64143d0a
    v_add_f16_e32 v11, v11, v31                                 ; 64163f0b
    v_add_f16_e32 v28, v12, v8                                  ; 6438110c
    v_add_f16_e32 v29, v13, v9                                  ; 643a130d
    v_add_f16_e32 v30, v14, v10                                 ; 643c150e
    v_add_f16_e32 v31, v15, v11                                 ; 643e170f
    v_add_f16_e32 v32, v28, v16                                 ; 6440211c
    v_add_f16_e32 v33, v29, v17                                 ; 6442231d
    v_add_f16_e32 v34, v30, v18                                 ; 6444251e
    v_add_f16_e32 v35, v31, v19                                 ; 6446271f
    v_fmac_f16_e32 v32, v20, v24                                ; 6c403114
    v_fmac_f16_e32 v33, v21, v25                                ; 6c423315
    v_fmac_f16_e32 v34, v22, v26                                ; 6c443516
    v_fmac_f16_e32 v35, v23, v27                                ; 6c463717
    v_add_f16_e32 v4, v4, v32                                   ; 64084104
    v_add_f16_e32 v5, v5, v33                                   ; 640a4305
    v_add_f16_e32 v6, v6, v34                                   ; 640c4506
    v_add_f16_e32 v7, v7, v35                                   ; 640e4707
    v_add_f16_e32 v28, v28, v4                                  ; 6438091c
    v_add_f16_e32 v29, v29, v5                                  ; 643a0b1d
    v_add_f16_e32 v30, v30, v6                                  ; 643c0d1e
    v_add_f16_e32 v31, v31, v7                                  ; 643e0f1f
    v_add_f16_e32 v16, v28, v16                                 ; 6420211c
    v_add_f16_e32 v17, v29, v17                                 ; 6422231d
    v_add_f16_e32 v18, v30, v18                                 ; 6424251e
    v_add_f16_e32 v19, v31, v19                                 ; 6426271f
    v_fmac_f16_e32 v16, v20, v24                                ; 6c203114
    v_fmac_f16_e32 v17, v21, v25                                ; 6c223315
    v_fmac_f16_e32 v18, v22, v26                                ; 6c243516
    v_fmac_f16_e32 v19, v23, v27                                ; 6c263717
    v_add_f16_e32 v0, v0, v16                                   ; 64002100
    v_add_f16_e32 v1, v1, v17                                   ; 64022301
    v_add_f16_e32 v2, v2, v18                                   ; 64042502
    v_add_f16_e32 v3, v3, v19                                   ; 64062703
    s_branch BB1                                                ; bf82ffa9
BB5:
    v_fmac_f16_e32 v4, v12, v8                                  ; 6c08110c
    v_fmac_f16_e32 v5, v13, v9                                  ; 6c0a130d
    v_fmac_f16_e32 v6, v14, v10                                 ; 6c0c150e
    v_fmac_f16_e32 v7, v15, v11                                 ; 6c0e170f
    v_mul_f16_e32 v0, v4, v0                                    ; 6a000104
    v_mul_f16_sdwa v0, v5, v1 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_0 ; 6a0002f9 04041505
    v_mul_f16_e32 v1, v6, v2                                    ; 6a020506
    v_mul_f16_sdwa v1, v7, v3 dst_sel:WORD_1 dst_unused:UNUSED_PRESERVE src0_sel:WORD_0 src1_sel:WORD_0 ; 6a0206f9 04041507
    exp mrt0 v0, v0, v1, v1 done compr vm                       ; f8001c0f 80800100
    s_endpgm                                                    ; bf810000



Pixel Shader:
*** SHADER STATS ***
SGPRs: 128
VGPRs: 40
Spilled SGPRs: 0
Spilled VGPRs: 0
PrivMem VGPRs: 0
Code size: 532
LDS size: 0
Scratch size: 0
Subgroups per SIMD: 24
Hash: 3918390333
Instructions: 105
Copies: 18
Branches: 2
Latency: 3032
Inverse Throughput: 1072
VMEM Clause: 0
SMEM Clause: 2
Pre-Sched SGPRs: 10
Pre-Sched VGPRs: 36
********************

But if we compile it as 32-bit (by replacing all f16vec4 with vec4):

BB0:
    v_lshrrev_b64 v[0:1], 0, 0                                  ; d7000000 00010080
    v_lshrrev_b64 v[2:3], 0, 0                                  ; d7000002 00010080
    v_lshrrev_b64 v[4:5], 0, 0                                  ; d7000004 00010080
    v_lshrrev_b64 v[6:7], 0, 0                                  ; d7000006 00010080
    v_lshrrev_b64 v[8:9], 0, 0                                  ; d7000008 00010080
    v_lshrrev_b64 v[10:11], 0, 0                                ; d700000a 00010080
    v_lshrrev_b64 v[12:13], 0, 0                                ; d700000c 00010080
    v_lshrrev_b64 v[14:15], 0, 0                                ; d700000e 00010080
    s_mov_b32 s0, 0                                             ; be800380
BB1:
    s_cmp_ge_i32 s0, 0x80                                       ; bf03ff00 00000080
    s_cbranch_scc1 BB5                                          ; bf85004c
BB4:
    s_add_i32 s4, 16, s2                                        ; 81040290
    s_movk_i32 s5, 0x8000                                       ; b0058000
    s_load_dwordx4 s[4:7], s[4:5], 0x0                          ; f4080102 fa000000
    s_lshl_b32 s1, s0, 4                                        ; 8f018400
    s_waitcnt lgkmcnt(0)                                        ; bf8cc07f
    s_clause 0x2                                                ; bfa10002
    s_buffer_load_dwordx4 s[8:11], s[4:7], 0x810                ; f4280202 fa000810
    s_buffer_load_dwordx4 s[12:15], s[4:7], 0x800               ; f4280302 fa000800
    s_buffer_load_dwordx4 s[4:7], s[4:7], s1                    ; f4280102 02000000
    s_add_u32 s0, s0, 1                                         ; 80008100
    s_waitcnt lgkmcnt(0)                                        ; bf8cc07f
    v_mul_f32_e64 v16, s12, s8                                  ; d5080010 0000100c
    v_mul_f32_e64 v17, s13, s9                                  ; d5080011 0000120d
    v_mul_f32_e64 v18, s14, s10                                 ; d5080012 0000140e
    v_mul_f32_e64 v19, s15, s11                                 ; d5080013 0000160f
    v_add_f32_e32 v20, s4, v16                                  ; 06282004
    v_add_f32_e32 v21, s5, v17                                  ; 062a2205
    v_add_f32_e32 v22, s6, v18                                  ; 062c2406
    v_add_f32_e32 v23, s7, v19                                  ; 062e2607
    v_add_f32_e32 v12, v12, v20                                 ; 0618290c
    v_add_f32_e32 v13, v13, v21                                 ; 061a2b0d
    v_add_f32_e32 v14, v14, v22                                 ; 061c2d0e
    v_add_f32_e32 v15, v15, v23                                 ; 061e2f0f
    v_add_f32_e32 v20, s4, v12                                  ; 06281804
    v_add_f32_e32 v21, s5, v13                                  ; 062a1a05
    v_add_f32_e32 v22, s6, v14                                  ; 062c1c06
    v_add_f32_e32 v23, s7, v15                                  ; 062e1e07
    v_add_f32_e32 v20, v20, v16                                 ; 06282114
    v_add_f32_e32 v21, v21, v17                                 ; 062a2315
    v_add_f32_e32 v22, v22, v18                                 ; 062c2516
    v_add_f32_e32 v23, v23, v19                                 ; 062e2717
    v_add_f32_e32 v8, v8, v20                                   ; 06102908
    v_add_f32_e32 v9, v9, v21                                   ; 06122b09
    v_add_f32_e32 v10, v10, v22                                 ; 06142d0a
    v_add_f32_e32 v11, v11, v23                                 ; 06162f0b
    v_add_f32_e32 v20, v12, v8                                  ; 0628110c
    v_add_f32_e32 v21, v13, v9                                  ; 062a130d
    v_add_f32_e32 v22, v14, v10                                 ; 062c150e
    v_add_f32_e32 v23, v15, v11                                 ; 062e170f
    v_add_f32_e32 v24, s4, v20                                  ; 06302804
    v_add_f32_e32 v25, s5, v21                                  ; 06322a05
    v_add_f32_e32 v26, s6, v22                                  ; 06342c06
    v_add_f32_e32 v27, s7, v23                                  ; 06362e07
    v_add_f32_e32 v24, v24, v16                                 ; 06302118
    v_add_f32_e32 v25, v25, v17                                 ; 06322319
    v_add_f32_e32 v26, v26, v18                                 ; 0634251a
    v_add_f32_e32 v27, v27, v19                                 ; 0636271b
    v_add_f32_e32 v4, v4, v24                                   ; 06083104
    v_add_f32_e32 v5, v5, v25                                   ; 060a3305
    v_add_f32_e32 v6, v6, v26                                   ; 060c3506
    v_add_f32_e32 v7, v7, v27                                   ; 060e3707
    v_add_f32_e32 v20, v20, v4                                  ; 06280914
    v_add_f32_e32 v21, v21, v5                                  ; 062a0b15
    v_add_f32_e32 v22, v22, v6                                  ; 062c0d16
    v_add_f32_e32 v23, v23, v7                                  ; 062e0f17
    v_add_f32_e32 v20, s4, v20                                  ; 06282804
    v_add_f32_e32 v21, s5, v21                                  ; 062a2a05
    v_add_f32_e32 v22, s6, v22                                  ; 062c2c06
    v_add_f32_e32 v23, s7, v23                                  ; 062e2e07
    v_add_f32_e32 v16, v20, v16                                 ; 06202114
    v_add_f32_e32 v17, v21, v17                                 ; 06222315
    v_add_f32_e32 v18, v22, v18                                 ; 06242516
    v_add_f32_e32 v19, v23, v19                                 ; 06262717
    v_add_f32_e32 v0, v0, v16                                   ; 06002100
    v_add_f32_e32 v1, v1, v17                                   ; 06022301
    v_add_f32_e32 v2, v2, v18                                   ; 06042502
    v_add_f32_e32 v3, v3, v19                                   ; 06062703
    s_branch BB1                                                ; bf82ffb1
BB5:
    v_fmac_f32_e32 v4, v12, v8                                  ; 5608110c
    v_fmac_f32_e32 v5, v13, v9                                  ; 560a130d
    v_fmac_f32_e32 v6, v14, v10                                 ; 560c150e
    v_fmac_f32_e32 v7, v15, v11                                 ; 560e170f
    v_mul_f32_e32 v0, v4, v0                                    ; 10000104
    v_mul_f32_e32 v1, v5, v1                                    ; 10020305
    v_mul_f32_e32 v2, v6, v2                                    ; 10040506
    v_mul_f32_e32 v3, v7, v3                                    ; 10060707
    v_cvt_pkrtz_f16_f32_e32 v0, v0, v1                          ; 5e000300
    v_cvt_pkrtz_f16_f32_e32 v1, v2, v3                          ; 5e020702
    exp mrt0 v0, v0, v1, v1 done compr vm                       ; f8001c0f 80800100
    s_endpgm                                                    ; bf810000



Pixel Shader:
*** SHADER STATS ***
SGPRs: 128
VGPRs: 32
Spilled SGPRs: 0
Spilled VGPRs: 0
PrivMem VGPRs: 0
Code size: 436
LDS size: 0
Scratch size: 0
Subgroups per SIMD: 32
Hash: 1646907196
Instructions: 91
Copies: 10
Branches: 2
Latency: 2925
Inverse Throughput: 948
VMEM Clause: 0
SMEM Clause: 2
Pre-Sched SGPRs: 15
Pre-Sched VGPRs: 28
********************

We can notice a few things:

  • Inverse Throughput (Estimated busy cycles to execute one wave, i.e. lower is better) 1072 vs 948. 32-bit wins
  • Latency (Issue cycles plus stall cycles, i.e. lower is better) 3032 vs 2925. 32-bit wins
  • RADV generated zero V_PK_ADD_F16 and V_PK_FMAC_F16 despite being plenty of opportunities
  • I don’t know if V_CVT_F16_F32_SDWA is possible or if it has a cost. It seems SDWA instructions need 2 DWORDS.
  • 16-bit needed 36 VGPR vs 32-bit needing 28 VGRPs
  • Part of this is explained by data being kept in scalar registers in 32-bit; whereas 16-bit needs to move everything to VGPR
  • Another part is RADV not being able to pack 2 FP16 in the same VGPR register
    • That only happens twice at the end where we see v_mul_f16_sdwa instructions

In other words

  • RDNA2 does not seem to double the register count with FP16. It would be cool to see registers s[0:512] and v[0:128] alias to sh[0:1024] and vh[0:512] respectively. This can either be implemented as register aliasing, or via additional instructions that operate on high/low bits of the register. This is an implementation detail, whereas externally it can be seen as register aliasing. It would be easier to mentally track.
  • RDNA2 does support packed math. i.e. operating on two FP16 at the same time living on the same 32-bit register
  • RDNA2 also supports SDWA suffix to target the 2nd FP16 value in a VGPR register
  • RADV / ACO does not seem yet to take advantage of packed math instructions
  • Proprietary drivers on Windows do use packed operation. Our PBS shader was filled with V_CVT_PKRTZ_F16_F32 and PK arithmetic instructions
    • However no noticeable change in performance was noticed
      • It’s quite possible I was not stressing the GPU enough
    • VGPR went down, SGPR went up. Overall a win (yay!)
    • With AMD_shader_info:
      • “Total Cycles” FP16 was 6% higher
      • “Total Stalls” for FP16 is 96% higher. Sounds bad to me, but I don’t know.
  • Loading float from const buffers and converting it to FP16 needs a V_CVT instruction and taking away 1 VGPR, since conversion cannot happen on SGRP
    • This is a big problem
      • Light data is in FP32 (position x3, direction x3, colour x3, spot params x3, attenuation x4)
        • In advanced games light data typically ends up in VGPR for Deferred/Forward+ data BUT
        • Power-constrained games (like the ones targetting mobile! where F16 is most useful!) light data lives in constant buffers as they use regular forward
      • Material data is in FP32 (kD x3, kS x3, fresnel x1, roughness x1, transparency x1)
  • This could be fixed if instead of S_LOAD_DWORDX16 a new instruction like S_LOADCVT_DWORDXn_LO could load then batch-convert the FP32 floats in s[0:1] into two FP16 in sh[0:1] leaving s0 with two FP16 floats in it, and s1 with either the original FP32 value. S_LOADCVT_DWORDXn_HI would do the same but the results stored in sh[2:3], leaving the original FP32 in s0. V_CVT_PKRTZ_F16_F32 already does this but:
    • It operates exclusively on VGPR results
    • It does not respect the current rounding mode (it assumes round to zero)
    • It works on 2 floats at a time. It’s not in bulk. S_LOADCVT_DWORDX16 would convert 16 floats at a time.
    • S_LOADCVT_DWORDXn_LO and S_LOADCVT_DWORDXn_HI make sense because based on my observations a few operations need the original data from the const buffer as both FP16 and FP32. Thus being able to load-and-convert F32-to-F16 data while preserving half of the original FP32 values could be very useful.
    • The goal behind an amalgamated instruction S_LOADCVT_DWORDn is to perform load and conversion without an S_WAITCNT instruction. The rationale is the same as BUFFER_LOAD_FORMAT_XYZW (untyped buffer load and conversion) but targetting scalar registers.
    • Besides, no other instruction other than raw load dword operations seems to support writing to SGPR. This is obviously for simplicity. Thus having a ASIC convert data on the fly during load would make sense.
  • Alternatively, arithmetic instructions that take SGPR input and converts on the spot to half could work as well

With the Steam Deck disembarking in a few weeks with RDNA2 and Samsung soon to release an RDNA2-powered phone I thought FP16 support would be more advanced (Note: I don’t know if Samsung uses RADV or proprietary).

Valve should invest on getting packed FP16 math for RADV. If AMD or Valve needs a test case, they can run our samples. It runs on Wine so it should run on Proton too.

I modified the samples to add the cmd line --force16 which will force Midf16 mode as described above. Look for “RUNNING WITH 16-bit PRECISION AND SUPPORTED! :)” in the Log. And of course, choose Vulkan.

For native Linux, build master and apply this patch to force Midf16:

diff --git a/OgreMain/src/OgreHlms.cpp b/OgreMain/src/OgreHlms.cpp
index 55e00e70bb..3498c8b0df 100644
--- a/OgreMain/src/OgreHlms.cpp
+++ b/OgreMain/src/OgreHlms.cpp
@@ -275,7 +275,7 @@ namespace Ogre
 #else
         mDebugOutputProperties( false ),
 #endif
-        mPrecisionMode( PrecisionFull32 ),
+        mPrecisionMode( PrecisionMidf16 ),
         mFastShaderBuildHack( false ),
         mDefaultDatablock( 0 ),
         mType( type ),