You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

glsl-extension.md 12 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385
  1. # ncnn GLSL extension
  2. ## rationale
  3. Different GPUs support different features, some support fp16 as buffer storage type, some support fp16 as operand variable, some old GPUs only support fp32
  4. When the GPU supports the `VK_KHR_16bit_storage` extension, in order to minimize the memory bandwidth consumption of the GPU, we will give priority to using fp16 as the storage type. Otherwise, we use `packHalf2x16` and `unpackHalf2x16` in GLSL 4.2 to compress 2 fp32 to uint, reducing read and write bandwidth.
  5. Similarly, when the gpu supports the `VK_KHR_shader_float16_int8` extension, in order to speed up the calculation efficiency, we will give priority to using fp16 as the operation operand, which usually doubles the speed. Otherwise, we use fp32.
  6. To ensure the widest compatibility, the following code for declaring descriptor binding and loading data will be written
  7. ```c
  8. #if NCNN_fp16_storage // gpu supports 16bit storage
  9. layout (binding = 0) buffer blob { f16vec4 blob_data[]; };
  10. #elif NCNN_fp16_packed // gpu supports GLSL 4.2
  11. layout (binding = 0) buffer blob { uvec2 blob_data[]; };
  12. #else // gpu only supports fp32
  13. layout (binding = 0) buffer blob { vec4 blob_data[]; };
  14. #endif
  15. void main()
  16. {
  17. const int i = int(gl_GlobalInvocationID.x);
  18. #if NCNN_fp16_storage && NCNN_fp16_arithmetic // gpu supports 16bit storage and shader float16
  19. f16vec4 x = blob_data[i];
  20. #elif NCNN_fp16_storage // gpu supports 16bit storage but no shader float16
  21. vec4 x = vec4(blob_data[i]);
  22. #elif NCNN_fp16_packed && NCNN_fp16_arithmetic // gpu supports GLSL 4.2 and shader float16
  23. f16vec4 x = f16vec4(unpackFloat2x16(blob_data[i].x), unpackFloat2x16(blob_data[i].y));
  24. #elif NCNN_fp16_packed // gpu supports GLSL 4.2
  25. vec4 x = vec4(unpackHalf2x16(blob_data[i].x), unpackHalf2x16(blob_data[i].y));
  26. #else // gpu only supports fp32
  27. vec4 x = blob_data[i];
  28. #endif
  29. }
  30. ```
  31. As you can see, just declaring the buffer type and reading a value consumes a lot of lines of code, which is a maintenance nightmare. Therefore, ncnn adds more flexible data types and auxiliary functions to reduce the size of the code and improve readability, and will automatically expand to the most efficient implementation according to the feature level supported by the GPU.
  32. The above code, by using the ncnn glsl extension, can be simplified to
  33. ```c
  34. layout (binding = 0) buffer blob { sfpvec4 blob_data[]; };
  35. void main()
  36. {
  37. const int i = int(gl_GlobalInvocationID.x);
  38. afpvec4 x = buffer_ld4(blob_data, i);
  39. }
  40. ```
  41. The ncnn glsl extension provides the necessary data types for storage, computation, shared memory, and load, store, conversion functions for buffers and images. We also provide some buffer and image copy functions to prevent loss of precision when using fp16 as the intermediate data type, and to avoid unnecessary `unpackHalf2x16` and `packHalf2x16` pair.
  42. # entrypoint for compiling GLSL
  43. The gpu.h header in the ncnn library exposes 3 APIs for compiling glsl code into spir-v binary, they support ncnn glsl extension, these 3 functions accept opt switch to control the expansion form of ncnn glsl extension. The first two accept raw glsl code strings, and the last one is used to create ncnn's built-in shader.
  44. ```cpp
  45. namespace ncnn {
  46. // online spirv compilation
  47. NCNN_EXPORT int compile_spirv_module(const char* comp_string, const Option& opt, std::vector<uint32_t>& spirv);
  48. NCNN_EXPORT int compile_spirv_module(const char* comp_data, int comp_data_size, const Option& opt, std::vector<uint32_t>& spirv);
  49. NCNN_EXPORT int compile_spirv_module(int shader_type_index, const Option& opt, std::vector<uint32_t>& spirv);
  50. } // namespace ncnn
  51. ```
  52. ## compile ncnn extended GLSL code directly
  53. You can write shader code with ncnn glsl extension, compiled to spir-v using ncnn functions. The compiled product is a standard-compliant spir-v binary, which can be directly used to create a pipeline object in the vulkan api
  54. ```cpp
  55. static const char my_glsl_data[] = R"(
  56. #version 450
  57. layout (binding = 0) readonly buffer a_blob { sfpvec4 a_blob_data[]; };
  58. layout (binding = 1) writeonly buffer b_blob { sfpvec4 b_blob_data[]; };
  59. void main()
  60. {
  61. const int i = int(gl_GlobalInvocationID.x);
  62. afpvec4 v = buffer_ld4(a_blob_data, i);
  63. v = v + 123;
  64. buffer_st4(b_blob_data, i, v);
  65. }
  66. )";
  67. Option opt;
  68. // you can control the extension behavior
  69. // even if the gpu supports 16bit storage
  70. opt.use_fp16_storage = false;
  71. std::vector<uint32_t> spirv;
  72. ncnn::compile_spirv_module(my_glsl_data, sizeof(my_glsl_data) - 1, opt, spirv);
  73. // To create pipeline object later
  74. // ncnn::Pipeline pipeline(vkdev);
  75. // pipeline.set_local_size_xyz(64, 1, 1);
  76. // pipeline.create(spirv.data(), spirv.size() * 4, specializations);
  77. ```
  78. ## ncnn built-in shader
  79. The shader index inside ncnn is exposed in the `layer_shader_type.h` header and can be used if needed
  80. ```cpp
  81. #include "layer_shader_type.h"
  82. int shader_type_index = LayerShaderType::convert_ycbcr;
  83. Option opt;
  84. std::vector<uint32_t> spirv;
  85. int retc = compile_spirv_module(shader_type_index, opt, spirv);
  86. ```
  87. # data types
  88. ## storage type
  89. declare buffer data layout in descriptor binding
  90. ```c
  91. layout (binding = 0) buffer top_blob { sfpvec4 top_blob_data[]; };
  92. ```
  93. |storage type|fp32|fp16p|fp16s|
  94. |---|---|---|---|
  95. |sfp|float|uint|float16_t|
  96. |sfpvec2|vec2|uint|f16vec2|
  97. |sfpvec4|vec4|uvec2|f16vec4|
  98. |sfpvec8|mat2x4|uvec4|f16mat2x4|
  99. ## arithmetic type
  100. declare local variable in glsl code
  101. ```c
  102. void main()
  103. {
  104. afpvec4 v = a * b;
  105. }
  106. ```
  107. |arithmetic type|fp32|fp16a|
  108. |---|---|---|
  109. |afp|float|float16_t|
  110. |afpvec2|vec2|f16vec2|
  111. |afpvec4|vec4|f16vec4|
  112. |afpvec8|mat2x4|f16mat2x4|
  113. ## local type
  114. declare variable in shared local memory
  115. ```c
  116. shared lfp tmp_a[8][4][2];
  117. ```
  118. |local type|fp32|fp16p / fp16s only|fp16s+fp16a|fp16s+fp16u|
  119. |---|---|---|---|---|
  120. |lfp|float|float|float|float16_t|
  121. |lfpvec4|vec4|uvec2|uint64_t|f16vec4|
  122. # buffer functions
  123. - load typed value from src[offset]
  124. ```c
  125. afp buffer_ld1(sfp src, int offset);
  126. afpvec2 buffer_ld2(sfpvec2 src, int offset);
  127. afpvec4 buffer_ld4(sfpvec4 src, int offset);
  128. afpvec8 buffer_ld8(sfpvec8 src, int offset);
  129. ```
  130. - store typed value to dst[offset]
  131. ```c
  132. void buffer_st1(sfp dst, int offset, afp v);
  133. void buffer_st2(sfpvec2 dst, int offset, afpvec2 v);
  134. void buffer_st4(sfpvec4 dst, int offset, afpvec4 v);
  135. void buffer_st8(sfpvec8 dst, int offset, afpvec8 v);
  136. ```
  137. - copy typed value from src[src_offset] to dst[dst_offset]
  138. ```c
  139. void buffer_cp1(sfp dst, int dst_offset, sfp src, int src_offset);
  140. void buffer_cp2(sfpvec2 dst, int dst_offset, sfpvec2 src, int src_offset);
  141. void buffer_cp4(sfpvec4 dst, int dst_offset, sfpvec4 src, int src_offset);
  142. void buffer_cp8(sfpvec4 dst, int dst_offset, sfpvec4 src, int src_offset);
  143. ```
  144. - copy and pack value from src[src_offsets[0],src_offsets[1],...] to dst[dst_offset]
  145. ```c
  146. void buffer_cp1to4(sfpvec4 dst, int dst_offset, sfp src, ivec4 src_offsets);
  147. void buffer_cp1to8(sfpvec8 dst, int dst_offset, sfp src, ivec4 src_offsets_0, ivec4 src_offsets_1);
  148. void buffer_cp4to8(sfpvec8 dst, int dst_offset, sfpvec4 src, ivec2 src_offsets);
  149. ```
  150. - copy and unpack value from src[src_offset] to dst[dst_offsets[0],dst_offsets[1],...]
  151. ```c
  152. void buffer_cp4to1(sfp dst, ivec4 dst_offsets, sfpvec4 src, int src_offset);
  153. void buffer_cp8to1(sfp dst, ivec4 dst_offsets_0, ivec4 dst_offsets_1, sfpvec8 src, int src_offset);
  154. void buffer_cp8to4(sfpvec4 dst, ivec2 dst_offsets, sfpvec8 src, int src_offset);
  155. ```
  156. # local data conversion functions
  157. - storage buffer to local memory
  158. ```c
  159. lfp sfp2lfp(sfp v);
  160. lfpvec4 sfp2lfpvec4(sfpvec4 v);
  161. ```
  162. - local memory to local variable
  163. ```c
  164. afp lfp2afp(lfp v);
  165. afpvec4 lfp2afpvec4(lfpvec4 v);
  166. ```
  167. Note: The common usage of local memory is to read from global memory first, store it in local memory, and then read local variables from local memory for subsequent use. Therefore, only storage type to local type and local type to arithmetic type conversion functions are provided here.
  168. # misc functions
  169. - prefer specialization constant over push constant
  170. ```c
  171. T psc(T x)
  172. ```
  173. Declare the same variable in specialization constant AND push constant section, then `psc(x)` will become a compile-time constant when specialization constant given non-zero or be dynamic via push constant otherwise. This is often used for tensor shape specialization. We can usually resolve all shape information and make them be compile-time constants for more aggressive shader optimization.
  174. ```c
  175. layout (constant_id = 0) const int size = 0;
  176. layout (push_constant) uniform parameter
  177. {
  178. int size;
  179. } p;
  180. void main()
  181. {
  182. const int s = psc(size);
  183. }
  184. ```
  185. # platform macros
  186. judge if the current platform is moltenvk, for enabling some platform-specific workaround
  187. ```c
  188. #if NCNN_moltenvk
  189. // enable workaround for moltenvk
  190. #endif
  191. ```
  192. ncnn adds additional macro definitions in the new version, which may conflict or confuse the existing glsl code. In order to obtain cross-version compatibility of ncnn, you can switch between the old and new codes according to the `ncnn_glsl_version` macro version.
  193. ```c
  194. #if ncnn_glsl_version >= 1
  195. // use device macros introduced since version 1
  196. #endif
  197. ```
  198. ncnn additionally defines most of the vulkan device-related features as macros, which we can use to distinguish different platforms, device extensions, features, and properties
  199. ### extension macros
  200. When the device supports an extension, `ncnn_<extension_name>` is defined as the extension version
  201. ```c
  202. void main()
  203. {
  204. #if ncnn_VK_KHR_16bit_storage
  205. // here is the code for any device that supports VK_KHR_16bit_storage
  206. #endif
  207. #if ncnn_VK_KHR_sampler_ycbcr_conversion >= 10
  208. // here is the code for any device that supports VK_KHR_sampler_ycbcr_conversion and version >= 10
  209. #endif
  210. }
  211. ```
  212. ### device feature and property macros
  213. ncnn will query device features and properties and then define them as macros.
  214. The macro name is `ncnn_<feature_name>` or `ncnn_<property_name>`
  215. The `GL_EXT_shader_explicit_arithmetic_types_int64` extension will be automatically enabled without explicit code indication when the device supports `shaderInt64`
  216. ```c
  217. void main()
  218. {
  219. #if ncnn_robustBufferAccess
  220. // here is the code for any device that supports robustBufferAccess feature
  221. #endif
  222. #if ncnn_vendorID == 4318
  223. // here is the vendor specific code, 4318 is nvidia graphics
  224. #endif
  225. #if ncnn_subgroupSize == 32
  226. // here is the code path optimized for subgroup_size == 32
  227. #endif
  228. // use macro definitions
  229. uint size; // dynamic value from some previous routines
  230. if (size < ncnn_subgroupSize)
  231. {
  232. #if ncnn_supportedOperations & 4
  233. // subgroup support arithmetic
  234. #endif
  235. #if ncnn_subgroup_arithmetic
  236. // shorthand style for checking subgroup arithmetic :P
  237. #endif
  238. }
  239. }
  240. ```
  241. ### validation layer macros
  242. ncnn will define some additional convenient macros when the vulkan validation layer enabled
  243. * `ncnn_enable_validation_layer`
  244. * `NCNN_LOGE`
  245. currently, you have to modify the `ENABLE_VALIDATION_LAYER` definition at the beginning of `src/gpu.cpp` to `1` to enable these macros.
  246. The `GL_EXT_debug_printf` extension will be enabled automatically without explicitly specifying it in your code.
  247. ```c
  248. void main()
  249. {
  250. int gx = int(gl_GlobalInvocationID.x);
  251. #if ncnn_enable_validation_layer
  252. NCNN_LOGE("gx = %d\n", gx);
  253. #endif
  254. }
  255. ```
  256. At runtime, `NCNN_LOGE` will print out the value of `gx`
  257. ### option macros
  258. enable glsl extension only if user enable some options
  259. The `GL_EXT_shader_16bit_storage` extension will be automatically enabled without explicit code indication when the device supports 16-bit storage and the user turns on `opt.use_fp16_storage`
  260. The `GL_EXT_shader_explicit_arithmetic_types_float16` extension will be automatically enabled without explicit code indication when the device supports 16-bit arithmetic and the user turns on `opt.use_fp16_arithmetic`
  261. ```c
  262. void main()
  263. {
  264. #if NCNN_fp16_storage
  265. // the user enable fp16 storage option and the device has fp16 storage support
  266. #endif
  267. #if NCNN_fp16_arithmetic
  268. // the user enable fp16 arithmetic option and the device has fp16 arithmetic support
  269. #endif
  270. }
  271. ```
  272. |macro|defined by option|
  273. |---|---|
  274. |NCNN_fp16_packed|opt.use_fp16_packed|
  275. |NCNN_fp16_storage|opt.use_fp16_storage|
  276. |NCNN_fp16_arithmetic|opt.use_fp16_arithmetic|
  277. |NCNN_int8_packed|opt.use_int8_packed|
  278. |NCNN_int8_storage|opt.use_int8_storage|
  279. |NCNN_int8_arithmetic|opt.use_int8_arithmetic|
  280. |NCNN_shader_local_memory|opt.use_shader_local_memory|