You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

layer-feat-mask.md 4.2 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111
  1. # layer feature mask
  2. Each ncnn layer allows a special parameter pair `31=X` to control specific bahavior.
  3. X is an unsigned integer with each bit contributing a feature mask.
  4. We usually use it to configuring fine-graded behaviors for certain layers to maintain accuracy, reduce memory usage or optimize performance.
  5. |bit|value|mask|rationale|
  6. |---|---|---|---|
  7. |1<<0|1|no fp16 arithmetic|precision concern|
  8. |1<<1|2|no fp16 storage|precision concern|
  9. |1<<2|4|no bf16 storage|precision concern|
  10. |1<<3|8|no int8|debug dynamic quantized model|
  11. |1<<4|16|no vulkan|reduce overhead for cpu op - gpu split - cpu op|
  12. |1<<5|32|no sgemm|reduce some memory|
  13. |1<<6|64|no winograd|reduce some memory|
  14. |1<<7|128|no threading|force single thread|
  15. These bits can be OR-combined into one value to control multiple behaviors simultaneously.
  16. For example, `31=17` means disabling both vulkan and fp16 arithmetic.
  17. ## disable fp16 for certain layer to fix overflow
  18. ```ruby
  19. 7767517
  20. 3 3
  21. Input input 0 1 input0 0=22 1=22 2=32
  22. Convolution conv0 1 1 input0 conv0 0=32 1=1 6=1024 9=1
  23. Convolution conv1 1 1 conv0 conv1 0=128 1=3 6=36864 9=1
  24. ```
  25. Typically, we use fp16 computation to improve inference speed.
  26. However, since the weight value of `conv1` is very large, fp16 accumulation may cause numerical overflow, so fp16 needs to be disabled individually for `conv1`, while other layers continue to use fp16 mode
  27. Add `31=3` to disable fp16 storage and arithmetic.
  28. ```ruby
  29. 7767517
  30. 3 3
  31. Input input 0 1 input0 0=22 1=22 2=32
  32. Convolution conv0 1 1 input0 conv0 0=32 1=1 6=1024 9=1
  33. Convolution conv1 1 1 conv0 conv1 0=128 1=3 6=36864 9=1 31=3
  34. ```
  35. ## disable vulkan for certain layer to improve performance
  36. ```ruby
  37. 7767517
  38. 5 5
  39. Input input 0 1 input0 0=22 1=22 2=32
  40. Convolution conv0 1 1 input0 conv0 0=32 1=1 6=1024 9=1
  41. SomeCPULayer c0 1 1 conv0 c0 0=32
  42. ReLU relu0 1 1 c0 relu0
  43. SomeCPULayer c1 1 1 relu0 c1 0=32
  44. ```
  45. Between the CPU layers, there is a simple calculation layer that supports vulkan. We can set `31=16` to force it to run on CPU. This can avoid the overhead of data upload, download and storage layout conversion between CPU and GPU. After all, CPU is fast enough for simple operations.
  46. ```ruby
  47. 7767517
  48. 5 5
  49. Input input 0 1 input0 0=22 1=22 2=32
  50. Convolution conv0 1 1 input0 conv0 0=32 1=1 6=1024 9=1
  51. SomeCPULayer c0 1 1 conv0 c0 0=32
  52. ReLU relu0 1 1 c0 relu0 31=16
  53. SomeCPULayer c1 1 1 relu0 c1 0=32
  54. ```
  55. ## disable winograd for certain layer to reduce memory usage
  56. ```ruby
  57. 7767517
  58. 3 3
  59. Input input 0 1 input0 0=22 1=22 2=32
  60. Convolution conv0 1 1 input0 conv0 0=32 1=1 6=1024 9=1
  61. Convolution conv1 1 1 conv0 conv1 0=128 1=3 6=36864 9=1
  62. ```
  63. The winograd technology uses more memory for the purpose of improving convolution performance, but this is not always true. In some memory-constrained situations, or memory IO bottlenecks, we can disable the use of winograd on some layers in exchange for a smaller memory footprint. Add `31=64` to Convolution layer, which forces it to use implcit-gemm or tiled im2col-gemm implementation, reducing memory usage and sometimes improving vulkan performance.
  64. ```ruby
  65. 7767517
  66. 3 3
  67. Input input 0 1 input0 0=22 1=22 2=32
  68. Convolution conv0 1 1 input0 conv0 0=32 1=1 6=1024 9=1
  69. Convolution conv1 1 1 conv0 conv1 0=128 1=3 6=36864 9=1 31=64
  70. ```
  71. ## disable threading for certain layer to improve performance
  72. ```ruby
  73. 7767517
  74. 4 4
  75. Input input 0 1 input0 0=22 1=22 2=3
  76. Convolution conv0 1 1 input0 conv0 0=16 1=3 6=432
  77. HardSigmoid hs 1 1 conv0 hs0
  78. Convolution conv1 1 1 hs0 conv1 0=16 1=3 6=2304
  79. ```
  80. The overhead of multi-thread dispatch and merging is too large for small tensors. Add `31=128` to HardSigmoid layer, which forces it to execute in a single thread, reducing power consumption and improving performance.
  81. ```ruby
  82. 7767517
  83. 4 4
  84. Input input 0 1 input0 0=22 1=22 2=3
  85. Convolution conv0 1 1 input0 conv0 0=16 1=3 6=432
  86. HardSigmoid hs 1 1 conv0 hs0 31=128
  87. Convolution conv1 1 1 hs0 conv1 0=16 1=3 6=2304
  88. ```