 [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852 |
- // Tencent is pleased to support the open source community by making ncnn available.
- //
- // Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
- //
- // Licensed under the BSD 3-Clause License (the "License"); you may not use this file except
- // in compliance with the License. You may obtain a copy of the License at
- //
- // https://opensource.org/licenses/BSD-3-Clause
- //
- // Unless required by applicable law or agreed to in writing, software distributed
- // under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
- // CONDITIONS OF ANY KIND, either express or implied. See the License for the
- // specific language governing permissions and limitations under the License.
-
- #include "command.h"
-
- #if NCNN_VULKAN
-
- #include <stdio.h>
-
- namespace ncnn {
-
- Command::Command(VulkanDevice* _vkdev, uint32_t _queue_index) : vkdev(_vkdev), queue_index(_queue_index)
- {
- // get queue
- vkGetDeviceQueue(vkdev->vkdevice(), queue_index, 0, &queue);
-
- create_command_pool();
-
- create_command_buffer();
-
- // create fence
- VkFenceCreateInfo fenceCreateInfo;
- fenceCreateInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
- fenceCreateInfo.pNext = 0;
- fenceCreateInfo.flags = 0;
-
- VkResult ret = vkCreateFence(vkdev->vkdevice(), &fenceCreateInfo, 0, &fence);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkCreateFence failed %d\n", ret);
- }
- }
-
- Command::~Command()
- {
- vkDestroyFence(vkdev->vkdevice(), fence, 0);
-
- vkFreeCommandBuffers(vkdev->vkdevice(), command_pool, 1, &command_buffer);
-
- vkDestroyCommandPool(vkdev->vkdevice(), command_pool, 0);
- }
-
- int Command::create_command_pool()
- {
- VkCommandPoolCreateInfo commandPoolCreateInfo;
- commandPoolCreateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
- commandPoolCreateInfo.pNext = 0;
- commandPoolCreateInfo.flags = 0;
- commandPoolCreateInfo.queueFamilyIndex = queue_index;
-
- VkResult ret = vkCreateCommandPool(vkdev->vkdevice(), &commandPoolCreateInfo, 0, &command_pool);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkCreateCommandPool failed %d\n", ret);
- return -1;
- }
-
- return 0;
- }
-
- int Command::create_command_buffer()
- {
- VkCommandBufferAllocateInfo commandBufferAllocateInfo;
- commandBufferAllocateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
- commandBufferAllocateInfo.pNext = 0;
- commandBufferAllocateInfo.commandPool = command_pool;
- commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
- commandBufferAllocateInfo.commandBufferCount = 1;
-
- VkResult ret = vkAllocateCommandBuffers(vkdev->vkdevice(), &commandBufferAllocateInfo, &command_buffer);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkAllocateCommandBuffers failed %d\n", ret);
- return -1;
- }
-
- return 0;
- }
-
- int Command::begin_command_buffer()
- {
- // fprintf(stderr, "==================== begin\n");
-
- VkCommandBufferBeginInfo commandBufferBeginInfo;
- commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
- commandBufferBeginInfo.pNext = 0;
- commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
- commandBufferBeginInfo.pInheritanceInfo = 0;
-
- VkResult ret = vkBeginCommandBuffer(command_buffer, &commandBufferBeginInfo);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkBeginCommandBuffer failed %d\n", ret);
- return -1;
- }
-
- return 0;
- }
-
- int Command::end_command_buffer()
- {
- // fprintf(stderr, "==================== end\n");
-
- VkResult ret = vkEndCommandBuffer(command_buffer);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkEndCommandBuffer failed %d\n", ret);
- return -1;
- }
-
- return 0;
- }
-
- int Command::queue_submit()
- {
- // fprintf(stderr, "==================== submit\n");
-
- VkSubmitInfo submitInfo;
- submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
- submitInfo.pNext = 0;
- submitInfo.waitSemaphoreCount = 0;
- submitInfo.pWaitSemaphores = 0;
- submitInfo.pWaitDstStageMask = 0;
- submitInfo.commandBufferCount = 1;
- submitInfo.pCommandBuffers = &command_buffer;
- submitInfo.signalSemaphoreCount = 0;
- submitInfo.pSignalSemaphores = 0;
-
- VkResult ret = vkQueueSubmit(queue, 1, &submitInfo, fence);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkQueueSubmit failed %d\n", ret);
- return -1;
- }
-
- return 0;
- }
-
- int Command::wait_fence()
- {
- // fprintf(stderr, "==================== wait\n");
-
- VkResult ret = vkWaitForFences(vkdev->vkdevice(), 1, &fence, VK_TRUE, UINT64_MAX);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkWaitForFences failed %d\n", ret);
- return -1;
- }
-
- return 0;
- }
-
- VkCompute::VkCompute(VulkanDevice* _vkdev) : Command(_vkdev, _vkdev->info.compute_queue_index)
- {
- }
-
- VkCompute::~VkCompute()
- {
- if (!vkdev->info.support_VK_KHR_push_descriptor)
- {
- for (size_t i=0; i<descriptorsets.size(); i++)
- {
- vkFreeDescriptorSets(vkdev->vkdevice(), descriptor_pools[i], 1, &descriptorsets[i]);
- vkDestroyDescriptorPool(vkdev->vkdevice(), descriptor_pools[i], 0);
- }
- }
- }
-
- int VkCompute::begin()
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return begin_command_buffer();
-
- record_type r;
- r.type = 0;
- delayed_records.push_back(r);
-
- return 0;
- }
-
- void VkCompute::record_upload(const VkMat& m)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return copy_buffer(m.staging_buffer(), 0, m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
-
- record_type r;
- r.type = 1;
- r.copy.src = m.staging_buffer();
- r.copy.src_offset = 0;
- r.copy.dst = m.buffer();
- r.copy.dst_offset = m.buffer_offset();
- r.copy.size = m.total() * m.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_download(const VkMat& m)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return copy_buffer(m.buffer(), m.buffer_offset(), m.staging_buffer(), 0, m.total() * m.elemsize);
-
- record_type r;
- r.type = 1;
- r.copy.src = m.buffer();
- r.copy.src_offset = m.buffer_offset();
- r.copy.dst = m.staging_buffer();
- r.copy.dst_offset = 0;
- r.copy.size = m.total() * m.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_clone(const VkMat& src, const VkMat& dst)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return copy_buffer(src.buffer(), src.buffer_offset(), dst.buffer(), dst.buffer_offset(), src.total() * src.elemsize);
-
- record_type r;
- r.type = 1;
- r.copy.src = src.buffer();
- r.copy.src_offset = src.buffer_offset();
- r.copy.dst = dst.buffer();
- r.copy.dst_offset = dst.buffer_offset();
- r.copy.size = src.total() * src.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_copy_region(const VkMat& src, const VkMat& dst, const VkBufferCopy& region)
- {
- std::vector<VkBufferCopy> regions(1);
- regions[0] = region;
-
- record_copy_regions(src, dst, regions);
- }
-
- void VkCompute::record_copy_regions(const VkMat& src, const VkMat& dst, const std::vector<VkBufferCopy>& regions)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return copy_buffer_regions(src.buffer(), dst.buffer(), regions);
-
- record_type r;
- r.type = 2;
- r.copy_regions.src = src.buffer();
- r.copy_regions.dst = dst.buffer();
- r.regions = regions;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& bindings, const std::vector<vk_constant_type>& constants, const VkMat& m)
- {
- record_bind_pipeline(pipeline->pipeline);
-
- record_update_bindings(pipeline->pipeline_layout, pipeline->descriptorset_layout, pipeline->descriptor_update_template, bindings);
-
- record_push_constants(pipeline->pipeline_layout, constants);
-
- uint32_t group_count_xyz[3];
- group_count_xyz[0] = (m.w + pipeline->local_size_x - 1) / pipeline->local_size_x;
- group_count_xyz[1] = (m.h + pipeline->local_size_y - 1) / pipeline->local_size_y;
- group_count_xyz[2] = (m.c + pipeline->local_size_z - 1) / pipeline->local_size_z;
-
- record_dispatch(group_count_xyz);
- }
-
- void VkCompute::record_bind_pipeline(VkPipeline pipeline)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return bind_pipeline(pipeline);
-
- record_type r;
- r.type = 3;
- r.bind_pipeline.pipeline = pipeline;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_update_bindings(VkPipelineLayout pipeline_layout, VkDescriptorSetLayout descriptorset_layout, VkDescriptorUpdateTemplateKHR descriptor_update_template, const std::vector<VkMat>& bindings)
- {
- const int binding_count = bindings.size();
-
- if (binding_count == 0)
- return;
-
- std::vector<VkDescriptorBufferInfo> descriptorBufferInfos(binding_count);
- for (int i=0; i<binding_count; i++)
- {
- descriptorBufferInfos[i].buffer = bindings[i].buffer();
- descriptorBufferInfos[i].offset = bindings[i].buffer_offset();
- descriptorBufferInfos[i].range = bindings[i].total() * bindings[i].elemsize;
- }
-
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return update_bindings(pipeline_layout, descriptor_update_template, descriptorBufferInfos);
-
- // create new descriptor_pool and descriptorset
- VkDescriptorPool descriptor_pool;
- {
- VkDescriptorPoolSize poolSize;
- poolSize.type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
- poolSize.descriptorCount = binding_count;
-
- VkDescriptorPoolCreateInfo descriptorPoolCreateInfo;
- descriptorPoolCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
- descriptorPoolCreateInfo.pNext = 0;
- descriptorPoolCreateInfo.flags = VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT;
- descriptorPoolCreateInfo.maxSets = 1;
- descriptorPoolCreateInfo.poolSizeCount = 1;
- descriptorPoolCreateInfo.pPoolSizes = &poolSize;
-
- VkResult ret = vkCreateDescriptorPool(vkdev->vkdevice(), &descriptorPoolCreateInfo, 0, &descriptor_pool);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkCreateDescriptorPool failed %d\n", ret);
- return;
- }
- }
- descriptor_pools.push_back(descriptor_pool);
-
- VkDescriptorSet descriptorset;
- {
- VkDescriptorSetAllocateInfo descriptorSetAllocateInfo;
- descriptorSetAllocateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
- descriptorSetAllocateInfo.pNext = 0;
- descriptorSetAllocateInfo.descriptorPool = descriptor_pool;
- descriptorSetAllocateInfo.descriptorSetCount = 1;
- descriptorSetAllocateInfo.pSetLayouts = &descriptorset_layout;
-
- VkResult ret = vkAllocateDescriptorSets(vkdev->vkdevice(), &descriptorSetAllocateInfo, &descriptorset);
- if (ret != VK_SUCCESS)
- {
- fprintf(stderr, "vkAllocateDescriptorSets failed %d\n", ret);
- return;
- }
- }
- descriptorsets.push_back(descriptorset);
-
- // fprintf(stderr, "update descriptorset %p\n", descriptorset);
-
- if (vkdev->info.support_VK_KHR_descriptor_update_template)
- {
- vkdev->vkUpdateDescriptorSetWithTemplateKHR(vkdev->vkdevice(), descriptorset, descriptor_update_template, descriptorBufferInfos.data());
- }
- else
- {
- std::vector<VkWriteDescriptorSet> writeDescriptorSets(binding_count);
- for (int i=0; i<binding_count; i++)
- {
- writeDescriptorSets[i].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
- writeDescriptorSets[i].pNext = 0;
- writeDescriptorSets[i].dstSet = descriptorset;
- writeDescriptorSets[i].dstBinding = i;
- writeDescriptorSets[i].dstArrayElement = 0;
- writeDescriptorSets[i].descriptorCount = 1;
- writeDescriptorSets[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
- writeDescriptorSets[i].pImageInfo = 0;
- writeDescriptorSets[i].pBufferInfo = &descriptorBufferInfos[i];
- writeDescriptorSets[i].pTexelBufferView = 0;
- }
-
- vkUpdateDescriptorSets(vkdev->vkdevice(), binding_count, writeDescriptorSets.data(), 0, 0);
- }
-
- record_type r;
- r.type = 4;
- r.bind_descriptorset.pipeline_layout = pipeline_layout;
- r.bind_descriptorset.descriptorset = descriptorset;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_push_constants(VkPipelineLayout pipeline_layout, const std::vector<vk_constant_type>& constants)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return push_constants(pipeline_layout, constants);
-
- record_type r;
- r.type = 5;
- r.push_constants.pipeline_layout = pipeline_layout;
- r.constants = constants;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_dispatch(const uint32_t* group_count_xyz)
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return dispatch(group_count_xyz);
-
- record_type r;
- r.type = 6;
- r.dispatch.group_count_xyz[0] = group_count_xyz[0];
- r.dispatch.group_count_xyz[1] = group_count_xyz[1];
- r.dispatch.group_count_xyz[2] = group_count_xyz[2];
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_transfer_compute_barrier(const VkMat& m)
- {
- m.state = 3;
-
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return transfer_compute_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
-
- record_type r;
- r.type = 7;
- r.transfer_compute_barrier.buffer = m.buffer();
- r.transfer_compute_barrier.offset = m.buffer_offset();
- r.transfer_compute_barrier.size = m.total() * m.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_compute_transfer_barrier(const VkMat& m)
- {
- m.state = 2;
-
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return compute_transfer_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
-
- record_type r;
- r.type = 8;
- r.compute_transfer_barrier.buffer = m.buffer();
- r.compute_transfer_barrier.offset = m.buffer_offset();
- r.compute_transfer_barrier.size = m.total() * m.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_compute_compute_barrier(const VkMat& m)
- {
- m.state = 3;
-
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return compute_compute_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
-
- record_type r;
- r.type = 9;
- r.compute_compute_barrier.buffer = m.buffer();
- r.compute_compute_barrier.offset = m.buffer_offset();
- r.compute_compute_barrier.size = m.total() * m.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_transfer_transfer_barrier(const VkMat& m)
- {
- m.state = 2;
-
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return transfer_transfer_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
-
- record_type r;
- r.type = 10;
- r.transfer_transfer_barrier.buffer = m.buffer();
- r.transfer_transfer_barrier.offset = m.buffer_offset();
- r.transfer_transfer_barrier.size = m.total() * m.elemsize;
- delayed_records.push_back(r);
- }
-
- void VkCompute::record_prepare_transfer_barrier(const VkMat& m)
- {
- if (m.state == 2)
- return record_transfer_transfer_barrier(m);
-
- if (m.state == 3)
- return record_compute_transfer_barrier(m);
-
- m.state = 2;
- }
-
- void VkCompute::record_prepare_compute_barrier(const VkMat& m)
- {
- if (m.state == 2)
- return record_transfer_compute_barrier(m);
-
- if (m.state == 3)
- return record_compute_compute_barrier(m);
-
- m.state = 3;
- }
-
- int VkCompute::end()
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return end_command_buffer();
-
- record_type r;
- r.type = 11;
- delayed_records.push_back(r);
-
- return 0;
- }
-
- int VkCompute::submit()
- {
- if (vkdev->info.support_VK_KHR_push_descriptor)
- return queue_submit();
-
- // handle delayed records
- for (size_t i=0; i<delayed_records.size(); i++)
- {
- const record_type& r = delayed_records[i];
-
- switch (r.type)
- {
- case 0:
- begin_command_buffer();
- break;
- case 1:
- copy_buffer(r.copy.src, r.copy.src_offset, r.copy.dst, r.copy.dst_offset, r.copy.size);
- break;
- case 2:
- copy_buffer_regions(r.copy_regions.src, r.copy_regions.dst, r.regions);
- break;
- case 3:
- bind_pipeline(r.bind_pipeline.pipeline);
- break;
- case 4:
- bind_descriptorset(r.bind_descriptorset.pipeline_layout, r.bind_descriptorset.descriptorset);
- break;
- case 5:
- push_constants(r.push_constants.pipeline_layout, r.constants);
- break;
- case 6:
- dispatch(r.dispatch.group_count_xyz);
- break;
- case 7:
- transfer_compute_barrier(r.transfer_compute_barrier.buffer, r.transfer_compute_barrier.offset, r.transfer_compute_barrier.size);
- break;
- case 8:
- compute_transfer_barrier(r.compute_transfer_barrier.buffer, r.compute_transfer_barrier.offset, r.compute_transfer_barrier.size);
- break;
- case 9:
- compute_compute_barrier(r.compute_compute_barrier.buffer, r.compute_compute_barrier.offset, r.compute_compute_barrier.size);
- break;
- case 10:
- transfer_transfer_barrier(r.compute_compute_barrier.buffer, r.compute_compute_barrier.offset, r.compute_compute_barrier.size);
- break;
- case 11:
- end_command_buffer();
- break;
- }
- }
-
- return queue_submit();
- }
-
- int VkCompute::wait()
- {
- return wait_fence();
- }
-
- void VkCompute::copy_buffer(VkBuffer src, size_t src_offset, VkBuffer dst, size_t dst_offset, size_t size)
- {
- // fprintf(stderr, "cmd copy %p to %p\n", src, dst);
-
- VkBufferCopy region;
- region.srcOffset = src_offset;
- region.dstOffset = dst_offset;
- region.size = size;
-
- vkCmdCopyBuffer(command_buffer, src, dst, 1, ®ion);
- }
-
- void VkCompute::copy_buffer_regions(VkBuffer src, VkBuffer dst, const std::vector<VkBufferCopy>& regions)
- {
- // fprintf(stderr, "cmd copy regions %p to %p\n", src, dst);
-
- vkCmdCopyBuffer(command_buffer, src, dst, regions.size(), regions.data());
- }
-
- void VkCompute::bind_pipeline(VkPipeline pipeline)
- {
- // fprintf(stderr, "cmd bind_pipeline %p\n", pipeline);
-
- vkCmdBindPipeline(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
- }
-
- void VkCompute::bind_descriptorset(VkPipelineLayout pipeline_layout, VkDescriptorSet descriptorset)
- {
- // fprintf(stderr, "cmd bind_descriptorset %p %p\n", pipeline_layout, descriptorset);
-
- vkCmdBindDescriptorSets(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline_layout, 0, 1, &descriptorset, 0, 0);
- }
-
- void VkCompute::update_bindings(VkPipelineLayout pipeline_layout, VkDescriptorUpdateTemplateKHR descriptor_update_template, const std::vector<VkDescriptorBufferInfo>& descriptorBufferInfos)
- {
- // fprintf(stderr, "cmd update_bindings %p %p\n", pipeline_layout, descriptor_update_template);
-
- vkdev->vkCmdPushDescriptorSetWithTemplateKHR(command_buffer, descriptor_update_template, pipeline_layout, 0, descriptorBufferInfos.data());
- }
-
- void VkCompute::push_constants(VkPipelineLayout pipeline_layout, const std::vector<vk_constant_type>& constants)
- {
- // fprintf(stderr, "cmd push_constants %p\n", pipeline_layout);
-
- vkCmdPushConstants(command_buffer, pipeline_layout, VK_SHADER_STAGE_COMPUTE_BIT, 0, constants.size() * sizeof(vk_constant_type), constants.data());
- }
-
- void VkCompute::dispatch(const uint32_t* group_count_xyz)
- {
- // fprintf(stderr, "cmd dispatch %d %d %d\n", group_count_xyz[0], group_count_xyz[1], group_count_xyz[2]);
-
- vkCmdDispatch(command_buffer, group_count_xyz[0], group_count_xyz[1], group_count_xyz[2]);
- }
-
- void VkCompute::transfer_compute_barrier(VkBuffer buffer, size_t offset, size_t size)
- {
- // fprintf(stderr, "cmd transfer_compute_barrier %p\n", buffer);
-
- VkBufferMemoryBarrier bufferBarrier;
- bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
- bufferBarrier.pNext = 0;
- bufferBarrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
- bufferBarrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;
- bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.buffer = buffer;
- bufferBarrier.offset = offset;
- bufferBarrier.size = size;
-
- VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
- VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
-
- vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
- }
-
- void VkCompute::compute_transfer_barrier(VkBuffer buffer, size_t offset, size_t size)
- {
- // fprintf(stderr, "cmd compute_transfer_barrier %p\n", buffer);
-
- VkBufferMemoryBarrier bufferBarrier;
- bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
- bufferBarrier.pNext = 0;
- bufferBarrier.srcAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;
- bufferBarrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
- bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.buffer = buffer;
- bufferBarrier.offset = offset;
- bufferBarrier.size = size;
-
- VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
- VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
-
- vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
- }
-
- void VkCompute::compute_compute_barrier(VkBuffer buffer, size_t offset, size_t size)
- {
- // fprintf(stderr, "cmd compute_compute_barrier %p\n", buffer);
-
- VkBufferMemoryBarrier bufferBarrier;
- bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
- bufferBarrier.pNext = 0;
- bufferBarrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
- bufferBarrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
- bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.buffer = buffer;
- bufferBarrier.offset = offset;
- bufferBarrier.size = size;
-
- VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
- VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
-
- vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
- }
-
- void VkCompute::transfer_transfer_barrier(VkBuffer buffer, size_t offset, size_t size)
- {
- // fprintf(stderr, "cmd transfer_transfer_barrier %p\n", buffer);
-
- VkBufferMemoryBarrier bufferBarrier;
- bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
- bufferBarrier.pNext = 0;
- bufferBarrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
- bufferBarrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
- bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
- bufferBarrier.buffer = buffer;
- bufferBarrier.offset = offset;
- bufferBarrier.size = size;
-
- VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
- VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
-
- vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
- }
-
- VkTransfer::VkTransfer(VulkanDevice* _vkdev) : Command(_vkdev, _vkdev->info.transfer_queue_index)
- {
- staging_data = 0;
- }
-
- VkTransfer::~VkTransfer()
- {
- }
-
- void VkTransfer::record_upload(const Mat& src, VkMat& dst)
- {
- dst.create_like(src, weight_vkallocator, staging_vkallocator);
-
- if (dst.allocator->mappable)
- {
- dst.upload(src);
- return;
- }
-
- record_type r;
- r.type = 0;
- r.size = src.total() * src.elemsize;
- r.upload.src = src.data;
- r.upload.dst = dst.buffer();
- r.upload.dst_offset = dst.buffer_offset();
- delayed_records.push_back(r);
- }
-
- void VkTransfer::record_download(const VkMat& src, Mat& dst)
- {
- dst.create_like(src);// TODO respect blob allocator
-
- if (src.allocator->mappable)
- {
- src.download(dst);
- return;
- }
-
- record_type r;
- r.type = 1;
- r.size = src.total() * src.elemsize;
- r.download.src = src.buffer();
- r.download.src_offset = src.buffer_offset();
- r.download.dst = dst.data;
- delayed_records.push_back(r);
- }
-
- int VkTransfer::submit()
- {
- if (delayed_records.empty())
- return 0;
-
- int transfer_count = delayed_records.size();
-
- // solve staging buffer size
- size_t staging_buffer_size = 0;
- for (int i=0; i<transfer_count; i++)
- {
- const record_type& r = delayed_records[i];
- staging_buffer_size += r.size;
- }
-
- // TODO sperated staging buffer for upload and download ?
- // allocate staging buffer
- staging_data = staging_vkallocator->fastMalloc(staging_buffer_size);
-
- // copy upload data
- size_t mapped_ptr_offset = 0;
- for (int i=0; i<transfer_count; i++)
- {
- const record_type& r = delayed_records[i];
- if (r.type == 0)
- {
- memcpy((unsigned char*)staging_data->mapped_ptr + mapped_ptr_offset, r.upload.src, r.size);
- }
-
- mapped_ptr_offset += r.size;
- }
-
- begin_command_buffer();
-
- // fprintf(stderr, "cmd transfer %p %lu\n", staging_data->buffer, staging_buffer_size);
-
- // handle delayed records
- size_t staging_buffer_offset = 0;
- for (int i=0; i<transfer_count; i++)
- {
- const record_type& r = delayed_records[i];
-
- switch (r.type)
- {
- case 0:
- copy_buffer(staging_data->buffer, staging_buffer_offset, r.upload.dst, r.upload.dst_offset, r.size);
- break;
- case 1:
- copy_buffer(r.download.src, r.download.src_offset, staging_data->buffer, staging_buffer_offset, r.size);
- break;
- }
-
- staging_buffer_offset += r.size;
- }
-
- end_command_buffer();
-
- return queue_submit();
- }
-
- int VkTransfer::wait()
- {
- if (delayed_records.empty())
- return 0;
-
- int ret = wait_fence();
-
- int transfer_count = delayed_records.size();
-
- // copy download data
- size_t mapped_ptr_offset = 0;
- for (int i=0; i<transfer_count; i++)
- {
- const record_type& r = delayed_records[i];
- if (r.type == 1)
- {
- memcpy(r.download.dst, (unsigned char*)staging_data->mapped_ptr + mapped_ptr_offset, r.size);
- }
-
- mapped_ptr_offset += r.size;
- }
-
- // deallocate staging buffer
- staging_vkallocator->fastFree(staging_data);
-
- staging_data = 0;
-
- return ret;
- }
-
- void VkTransfer::copy_buffer(VkBuffer src, size_t src_offset, VkBuffer dst, size_t dst_offset, size_t size)
- {
- // fprintf(stderr, "cmd copy %p to %p\n", src, dst);
-
- VkBufferCopy region;
- region.srcOffset = src_offset;
- region.dstOffset = dst_offset;
- region.size = size;
-
- vkCmdCopyBuffer(command_buffer, src, dst, 1, ®ion);
- }
-
- void VkTransfer::copy_buffer_regions(VkBuffer src, VkBuffer dst, const std::vector<VkBufferCopy>& regions)
- {
- // fprintf(stderr, "cmd copy regions %p to %p\n", src, dst);
-
- vkCmdCopyBuffer(command_buffer, src, dst, regions.size(), regions.data());
- }
-
- } // namespace ncnn
-
- #endif // NCNN_VULKAN
|