You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

command.cpp 27 kB

[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852
  1. // Tencent is pleased to support the open source community by making ncnn available.
  2. //
  3. // Copyright (C) 2018 THL A29 Limited, a Tencent company. All rights reserved.
  4. //
  5. // Licensed under the BSD 3-Clause License (the "License"); you may not use this file except
  6. // in compliance with the License. You may obtain a copy of the License at
  7. //
  8. // https://opensource.org/licenses/BSD-3-Clause
  9. //
  10. // Unless required by applicable law or agreed to in writing, software distributed
  11. // under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
  12. // CONDITIONS OF ANY KIND, either express or implied. See the License for the
  13. // specific language governing permissions and limitations under the License.
  14. #include "command.h"
  15. #if NCNN_VULKAN
  16. #include <stdio.h>
  17. namespace ncnn {
  18. Command::Command(VulkanDevice* _vkdev, uint32_t _queue_index) : vkdev(_vkdev), queue_index(_queue_index)
  19. {
  20. // get queue
  21. vkGetDeviceQueue(vkdev->vkdevice(), queue_index, 0, &queue);
  22. create_command_pool();
  23. create_command_buffer();
  24. // create fence
  25. VkFenceCreateInfo fenceCreateInfo;
  26. fenceCreateInfo.sType = VK_STRUCTURE_TYPE_FENCE_CREATE_INFO;
  27. fenceCreateInfo.pNext = 0;
  28. fenceCreateInfo.flags = 0;
  29. VkResult ret = vkCreateFence(vkdev->vkdevice(), &fenceCreateInfo, 0, &fence);
  30. if (ret != VK_SUCCESS)
  31. {
  32. fprintf(stderr, "vkCreateFence failed %d\n", ret);
  33. }
  34. }
  35. Command::~Command()
  36. {
  37. vkDestroyFence(vkdev->vkdevice(), fence, 0);
  38. vkFreeCommandBuffers(vkdev->vkdevice(), command_pool, 1, &command_buffer);
  39. vkDestroyCommandPool(vkdev->vkdevice(), command_pool, 0);
  40. }
  41. int Command::create_command_pool()
  42. {
  43. VkCommandPoolCreateInfo commandPoolCreateInfo;
  44. commandPoolCreateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;
  45. commandPoolCreateInfo.pNext = 0;
  46. commandPoolCreateInfo.flags = 0;
  47. commandPoolCreateInfo.queueFamilyIndex = queue_index;
  48. VkResult ret = vkCreateCommandPool(vkdev->vkdevice(), &commandPoolCreateInfo, 0, &command_pool);
  49. if (ret != VK_SUCCESS)
  50. {
  51. fprintf(stderr, "vkCreateCommandPool failed %d\n", ret);
  52. return -1;
  53. }
  54. return 0;
  55. }
  56. int Command::create_command_buffer()
  57. {
  58. VkCommandBufferAllocateInfo commandBufferAllocateInfo;
  59. commandBufferAllocateInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;
  60. commandBufferAllocateInfo.pNext = 0;
  61. commandBufferAllocateInfo.commandPool = command_pool;
  62. commandBufferAllocateInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;
  63. commandBufferAllocateInfo.commandBufferCount = 1;
  64. VkResult ret = vkAllocateCommandBuffers(vkdev->vkdevice(), &commandBufferAllocateInfo, &command_buffer);
  65. if (ret != VK_SUCCESS)
  66. {
  67. fprintf(stderr, "vkAllocateCommandBuffers failed %d\n", ret);
  68. return -1;
  69. }
  70. return 0;
  71. }
  72. int Command::begin_command_buffer()
  73. {
  74. // fprintf(stderr, "==================== begin\n");
  75. VkCommandBufferBeginInfo commandBufferBeginInfo;
  76. commandBufferBeginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
  77. commandBufferBeginInfo.pNext = 0;
  78. commandBufferBeginInfo.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
  79. commandBufferBeginInfo.pInheritanceInfo = 0;
  80. VkResult ret = vkBeginCommandBuffer(command_buffer, &commandBufferBeginInfo);
  81. if (ret != VK_SUCCESS)
  82. {
  83. fprintf(stderr, "vkBeginCommandBuffer failed %d\n", ret);
  84. return -1;
  85. }
  86. return 0;
  87. }
  88. int Command::end_command_buffer()
  89. {
  90. // fprintf(stderr, "==================== end\n");
  91. VkResult ret = vkEndCommandBuffer(command_buffer);
  92. if (ret != VK_SUCCESS)
  93. {
  94. fprintf(stderr, "vkEndCommandBuffer failed %d\n", ret);
  95. return -1;
  96. }
  97. return 0;
  98. }
  99. int Command::queue_submit()
  100. {
  101. // fprintf(stderr, "==================== submit\n");
  102. VkSubmitInfo submitInfo;
  103. submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
  104. submitInfo.pNext = 0;
  105. submitInfo.waitSemaphoreCount = 0;
  106. submitInfo.pWaitSemaphores = 0;
  107. submitInfo.pWaitDstStageMask = 0;
  108. submitInfo.commandBufferCount = 1;
  109. submitInfo.pCommandBuffers = &command_buffer;
  110. submitInfo.signalSemaphoreCount = 0;
  111. submitInfo.pSignalSemaphores = 0;
  112. VkResult ret = vkQueueSubmit(queue, 1, &submitInfo, fence);
  113. if (ret != VK_SUCCESS)
  114. {
  115. fprintf(stderr, "vkQueueSubmit failed %d\n", ret);
  116. return -1;
  117. }
  118. return 0;
  119. }
  120. int Command::wait_fence()
  121. {
  122. // fprintf(stderr, "==================== wait\n");
  123. VkResult ret = vkWaitForFences(vkdev->vkdevice(), 1, &fence, VK_TRUE, UINT64_MAX);
  124. if (ret != VK_SUCCESS)
  125. {
  126. fprintf(stderr, "vkWaitForFences failed %d\n", ret);
  127. return -1;
  128. }
  129. return 0;
  130. }
  131. VkCompute::VkCompute(VulkanDevice* _vkdev) : Command(_vkdev, _vkdev->info.compute_queue_index)
  132. {
  133. }
  134. VkCompute::~VkCompute()
  135. {
  136. if (!vkdev->info.support_VK_KHR_push_descriptor)
  137. {
  138. for (size_t i=0; i<descriptorsets.size(); i++)
  139. {
  140. vkFreeDescriptorSets(vkdev->vkdevice(), descriptor_pools[i], 1, &descriptorsets[i]);
  141. vkDestroyDescriptorPool(vkdev->vkdevice(), descriptor_pools[i], 0);
  142. }
  143. }
  144. }
  145. int VkCompute::begin()
  146. {
  147. if (vkdev->info.support_VK_KHR_push_descriptor)
  148. return begin_command_buffer();
  149. record_type r;
  150. r.type = 0;
  151. delayed_records.push_back(r);
  152. return 0;
  153. }
  154. void VkCompute::record_upload(const VkMat& m)
  155. {
  156. if (vkdev->info.support_VK_KHR_push_descriptor)
  157. return copy_buffer(m.staging_buffer(), 0, m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
  158. record_type r;
  159. r.type = 1;
  160. r.copy.src = m.staging_buffer();
  161. r.copy.src_offset = 0;
  162. r.copy.dst = m.buffer();
  163. r.copy.dst_offset = m.buffer_offset();
  164. r.copy.size = m.total() * m.elemsize;
  165. delayed_records.push_back(r);
  166. }
  167. void VkCompute::record_download(const VkMat& m)
  168. {
  169. if (vkdev->info.support_VK_KHR_push_descriptor)
  170. return copy_buffer(m.buffer(), m.buffer_offset(), m.staging_buffer(), 0, m.total() * m.elemsize);
  171. record_type r;
  172. r.type = 1;
  173. r.copy.src = m.buffer();
  174. r.copy.src_offset = m.buffer_offset();
  175. r.copy.dst = m.staging_buffer();
  176. r.copy.dst_offset = 0;
  177. r.copy.size = m.total() * m.elemsize;
  178. delayed_records.push_back(r);
  179. }
  180. void VkCompute::record_clone(const VkMat& src, const VkMat& dst)
  181. {
  182. if (vkdev->info.support_VK_KHR_push_descriptor)
  183. return copy_buffer(src.buffer(), src.buffer_offset(), dst.buffer(), dst.buffer_offset(), src.total() * src.elemsize);
  184. record_type r;
  185. r.type = 1;
  186. r.copy.src = src.buffer();
  187. r.copy.src_offset = src.buffer_offset();
  188. r.copy.dst = dst.buffer();
  189. r.copy.dst_offset = dst.buffer_offset();
  190. r.copy.size = src.total() * src.elemsize;
  191. delayed_records.push_back(r);
  192. }
  193. void VkCompute::record_copy_region(const VkMat& src, const VkMat& dst, const VkBufferCopy& region)
  194. {
  195. std::vector<VkBufferCopy> regions(1);
  196. regions[0] = region;
  197. record_copy_regions(src, dst, regions);
  198. }
  199. void VkCompute::record_copy_regions(const VkMat& src, const VkMat& dst, const std::vector<VkBufferCopy>& regions)
  200. {
  201. if (vkdev->info.support_VK_KHR_push_descriptor)
  202. return copy_buffer_regions(src.buffer(), dst.buffer(), regions);
  203. record_type r;
  204. r.type = 2;
  205. r.copy_regions.src = src.buffer();
  206. r.copy_regions.dst = dst.buffer();
  207. r.regions = regions;
  208. delayed_records.push_back(r);
  209. }
  210. void VkCompute::record_pipeline(const Pipeline* pipeline, const std::vector<VkMat>& bindings, const std::vector<vk_constant_type>& constants, const VkMat& m)
  211. {
  212. record_bind_pipeline(pipeline->pipeline);
  213. record_update_bindings(pipeline->pipeline_layout, pipeline->descriptorset_layout, pipeline->descriptor_update_template, bindings);
  214. record_push_constants(pipeline->pipeline_layout, constants);
  215. uint32_t group_count_xyz[3];
  216. group_count_xyz[0] = (m.w + pipeline->local_size_x - 1) / pipeline->local_size_x;
  217. group_count_xyz[1] = (m.h + pipeline->local_size_y - 1) / pipeline->local_size_y;
  218. group_count_xyz[2] = (m.c + pipeline->local_size_z - 1) / pipeline->local_size_z;
  219. record_dispatch(group_count_xyz);
  220. }
  221. void VkCompute::record_bind_pipeline(VkPipeline pipeline)
  222. {
  223. if (vkdev->info.support_VK_KHR_push_descriptor)
  224. return bind_pipeline(pipeline);
  225. record_type r;
  226. r.type = 3;
  227. r.bind_pipeline.pipeline = pipeline;
  228. delayed_records.push_back(r);
  229. }
  230. void VkCompute::record_update_bindings(VkPipelineLayout pipeline_layout, VkDescriptorSetLayout descriptorset_layout, VkDescriptorUpdateTemplateKHR descriptor_update_template, const std::vector<VkMat>& bindings)
  231. {
  232. const int binding_count = bindings.size();
  233. if (binding_count == 0)
  234. return;
  235. std::vector<VkDescriptorBufferInfo> descriptorBufferInfos(binding_count);
  236. for (int i=0; i<binding_count; i++)
  237. {
  238. descriptorBufferInfos[i].buffer = bindings[i].buffer();
  239. descriptorBufferInfos[i].offset = bindings[i].buffer_offset();
  240. descriptorBufferInfos[i].range = bindings[i].total() * bindings[i].elemsize;
  241. }
  242. if (vkdev->info.support_VK_KHR_push_descriptor)
  243. return update_bindings(pipeline_layout, descriptor_update_template, descriptorBufferInfos);
  244. // create new descriptor_pool and descriptorset
  245. VkDescriptorPool descriptor_pool;
  246. {
  247. VkDescriptorPoolSize poolSize;
  248. poolSize.type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
  249. poolSize.descriptorCount = binding_count;
  250. VkDescriptorPoolCreateInfo descriptorPoolCreateInfo;
  251. descriptorPoolCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
  252. descriptorPoolCreateInfo.pNext = 0;
  253. descriptorPoolCreateInfo.flags = VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT;
  254. descriptorPoolCreateInfo.maxSets = 1;
  255. descriptorPoolCreateInfo.poolSizeCount = 1;
  256. descriptorPoolCreateInfo.pPoolSizes = &poolSize;
  257. VkResult ret = vkCreateDescriptorPool(vkdev->vkdevice(), &descriptorPoolCreateInfo, 0, &descriptor_pool);
  258. if (ret != VK_SUCCESS)
  259. {
  260. fprintf(stderr, "vkCreateDescriptorPool failed %d\n", ret);
  261. return;
  262. }
  263. }
  264. descriptor_pools.push_back(descriptor_pool);
  265. VkDescriptorSet descriptorset;
  266. {
  267. VkDescriptorSetAllocateInfo descriptorSetAllocateInfo;
  268. descriptorSetAllocateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
  269. descriptorSetAllocateInfo.pNext = 0;
  270. descriptorSetAllocateInfo.descriptorPool = descriptor_pool;
  271. descriptorSetAllocateInfo.descriptorSetCount = 1;
  272. descriptorSetAllocateInfo.pSetLayouts = &descriptorset_layout;
  273. VkResult ret = vkAllocateDescriptorSets(vkdev->vkdevice(), &descriptorSetAllocateInfo, &descriptorset);
  274. if (ret != VK_SUCCESS)
  275. {
  276. fprintf(stderr, "vkAllocateDescriptorSets failed %d\n", ret);
  277. return;
  278. }
  279. }
  280. descriptorsets.push_back(descriptorset);
  281. // fprintf(stderr, "update descriptorset %p\n", descriptorset);
  282. if (vkdev->info.support_VK_KHR_descriptor_update_template)
  283. {
  284. vkdev->vkUpdateDescriptorSetWithTemplateKHR(vkdev->vkdevice(), descriptorset, descriptor_update_template, descriptorBufferInfos.data());
  285. }
  286. else
  287. {
  288. std::vector<VkWriteDescriptorSet> writeDescriptorSets(binding_count);
  289. for (int i=0; i<binding_count; i++)
  290. {
  291. writeDescriptorSets[i].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
  292. writeDescriptorSets[i].pNext = 0;
  293. writeDescriptorSets[i].dstSet = descriptorset;
  294. writeDescriptorSets[i].dstBinding = i;
  295. writeDescriptorSets[i].dstArrayElement = 0;
  296. writeDescriptorSets[i].descriptorCount = 1;
  297. writeDescriptorSets[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
  298. writeDescriptorSets[i].pImageInfo = 0;
  299. writeDescriptorSets[i].pBufferInfo = &descriptorBufferInfos[i];
  300. writeDescriptorSets[i].pTexelBufferView = 0;
  301. }
  302. vkUpdateDescriptorSets(vkdev->vkdevice(), binding_count, writeDescriptorSets.data(), 0, 0);
  303. }
  304. record_type r;
  305. r.type = 4;
  306. r.bind_descriptorset.pipeline_layout = pipeline_layout;
  307. r.bind_descriptorset.descriptorset = descriptorset;
  308. delayed_records.push_back(r);
  309. }
  310. void VkCompute::record_push_constants(VkPipelineLayout pipeline_layout, const std::vector<vk_constant_type>& constants)
  311. {
  312. if (vkdev->info.support_VK_KHR_push_descriptor)
  313. return push_constants(pipeline_layout, constants);
  314. record_type r;
  315. r.type = 5;
  316. r.push_constants.pipeline_layout = pipeline_layout;
  317. r.constants = constants;
  318. delayed_records.push_back(r);
  319. }
  320. void VkCompute::record_dispatch(const uint32_t* group_count_xyz)
  321. {
  322. if (vkdev->info.support_VK_KHR_push_descriptor)
  323. return dispatch(group_count_xyz);
  324. record_type r;
  325. r.type = 6;
  326. r.dispatch.group_count_xyz[0] = group_count_xyz[0];
  327. r.dispatch.group_count_xyz[1] = group_count_xyz[1];
  328. r.dispatch.group_count_xyz[2] = group_count_xyz[2];
  329. delayed_records.push_back(r);
  330. }
  331. void VkCompute::record_transfer_compute_barrier(const VkMat& m)
  332. {
  333. m.state = 3;
  334. if (vkdev->info.support_VK_KHR_push_descriptor)
  335. return transfer_compute_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
  336. record_type r;
  337. r.type = 7;
  338. r.transfer_compute_barrier.buffer = m.buffer();
  339. r.transfer_compute_barrier.offset = m.buffer_offset();
  340. r.transfer_compute_barrier.size = m.total() * m.elemsize;
  341. delayed_records.push_back(r);
  342. }
  343. void VkCompute::record_compute_transfer_barrier(const VkMat& m)
  344. {
  345. m.state = 2;
  346. if (vkdev->info.support_VK_KHR_push_descriptor)
  347. return compute_transfer_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
  348. record_type r;
  349. r.type = 8;
  350. r.compute_transfer_barrier.buffer = m.buffer();
  351. r.compute_transfer_barrier.offset = m.buffer_offset();
  352. r.compute_transfer_barrier.size = m.total() * m.elemsize;
  353. delayed_records.push_back(r);
  354. }
  355. void VkCompute::record_compute_compute_barrier(const VkMat& m)
  356. {
  357. m.state = 3;
  358. if (vkdev->info.support_VK_KHR_push_descriptor)
  359. return compute_compute_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
  360. record_type r;
  361. r.type = 9;
  362. r.compute_compute_barrier.buffer = m.buffer();
  363. r.compute_compute_barrier.offset = m.buffer_offset();
  364. r.compute_compute_barrier.size = m.total() * m.elemsize;
  365. delayed_records.push_back(r);
  366. }
  367. void VkCompute::record_transfer_transfer_barrier(const VkMat& m)
  368. {
  369. m.state = 2;
  370. if (vkdev->info.support_VK_KHR_push_descriptor)
  371. return transfer_transfer_barrier(m.buffer(), m.buffer_offset(), m.total() * m.elemsize);
  372. record_type r;
  373. r.type = 10;
  374. r.transfer_transfer_barrier.buffer = m.buffer();
  375. r.transfer_transfer_barrier.offset = m.buffer_offset();
  376. r.transfer_transfer_barrier.size = m.total() * m.elemsize;
  377. delayed_records.push_back(r);
  378. }
  379. void VkCompute::record_prepare_transfer_barrier(const VkMat& m)
  380. {
  381. if (m.state == 2)
  382. return record_transfer_transfer_barrier(m);
  383. if (m.state == 3)
  384. return record_compute_transfer_barrier(m);
  385. m.state = 2;
  386. }
  387. void VkCompute::record_prepare_compute_barrier(const VkMat& m)
  388. {
  389. if (m.state == 2)
  390. return record_transfer_compute_barrier(m);
  391. if (m.state == 3)
  392. return record_compute_compute_barrier(m);
  393. m.state = 3;
  394. }
  395. int VkCompute::end()
  396. {
  397. if (vkdev->info.support_VK_KHR_push_descriptor)
  398. return end_command_buffer();
  399. record_type r;
  400. r.type = 11;
  401. delayed_records.push_back(r);
  402. return 0;
  403. }
  404. int VkCompute::submit()
  405. {
  406. if (vkdev->info.support_VK_KHR_push_descriptor)
  407. return queue_submit();
  408. // handle delayed records
  409. for (size_t i=0; i<delayed_records.size(); i++)
  410. {
  411. const record_type& r = delayed_records[i];
  412. switch (r.type)
  413. {
  414. case 0:
  415. begin_command_buffer();
  416. break;
  417. case 1:
  418. copy_buffer(r.copy.src, r.copy.src_offset, r.copy.dst, r.copy.dst_offset, r.copy.size);
  419. break;
  420. case 2:
  421. copy_buffer_regions(r.copy_regions.src, r.copy_regions.dst, r.regions);
  422. break;
  423. case 3:
  424. bind_pipeline(r.bind_pipeline.pipeline);
  425. break;
  426. case 4:
  427. bind_descriptorset(r.bind_descriptorset.pipeline_layout, r.bind_descriptorset.descriptorset);
  428. break;
  429. case 5:
  430. push_constants(r.push_constants.pipeline_layout, r.constants);
  431. break;
  432. case 6:
  433. dispatch(r.dispatch.group_count_xyz);
  434. break;
  435. case 7:
  436. transfer_compute_barrier(r.transfer_compute_barrier.buffer, r.transfer_compute_barrier.offset, r.transfer_compute_barrier.size);
  437. break;
  438. case 8:
  439. compute_transfer_barrier(r.compute_transfer_barrier.buffer, r.compute_transfer_barrier.offset, r.compute_transfer_barrier.size);
  440. break;
  441. case 9:
  442. compute_compute_barrier(r.compute_compute_barrier.buffer, r.compute_compute_barrier.offset, r.compute_compute_barrier.size);
  443. break;
  444. case 10:
  445. transfer_transfer_barrier(r.compute_compute_barrier.buffer, r.compute_compute_barrier.offset, r.compute_compute_barrier.size);
  446. break;
  447. case 11:
  448. end_command_buffer();
  449. break;
  450. }
  451. }
  452. return queue_submit();
  453. }
  454. int VkCompute::wait()
  455. {
  456. return wait_fence();
  457. }
  458. void VkCompute::copy_buffer(VkBuffer src, size_t src_offset, VkBuffer dst, size_t dst_offset, size_t size)
  459. {
  460. // fprintf(stderr, "cmd copy %p to %p\n", src, dst);
  461. VkBufferCopy region;
  462. region.srcOffset = src_offset;
  463. region.dstOffset = dst_offset;
  464. region.size = size;
  465. vkCmdCopyBuffer(command_buffer, src, dst, 1, &region);
  466. }
  467. void VkCompute::copy_buffer_regions(VkBuffer src, VkBuffer dst, const std::vector<VkBufferCopy>& regions)
  468. {
  469. // fprintf(stderr, "cmd copy regions %p to %p\n", src, dst);
  470. vkCmdCopyBuffer(command_buffer, src, dst, regions.size(), regions.data());
  471. }
  472. void VkCompute::bind_pipeline(VkPipeline pipeline)
  473. {
  474. // fprintf(stderr, "cmd bind_pipeline %p\n", pipeline);
  475. vkCmdBindPipeline(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
  476. }
  477. void VkCompute::bind_descriptorset(VkPipelineLayout pipeline_layout, VkDescriptorSet descriptorset)
  478. {
  479. // fprintf(stderr, "cmd bind_descriptorset %p %p\n", pipeline_layout, descriptorset);
  480. vkCmdBindDescriptorSets(command_buffer, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline_layout, 0, 1, &descriptorset, 0, 0);
  481. }
  482. void VkCompute::update_bindings(VkPipelineLayout pipeline_layout, VkDescriptorUpdateTemplateKHR descriptor_update_template, const std::vector<VkDescriptorBufferInfo>& descriptorBufferInfos)
  483. {
  484. // fprintf(stderr, "cmd update_bindings %p %p\n", pipeline_layout, descriptor_update_template);
  485. vkdev->vkCmdPushDescriptorSetWithTemplateKHR(command_buffer, descriptor_update_template, pipeline_layout, 0, descriptorBufferInfos.data());
  486. }
  487. void VkCompute::push_constants(VkPipelineLayout pipeline_layout, const std::vector<vk_constant_type>& constants)
  488. {
  489. // fprintf(stderr, "cmd push_constants %p\n", pipeline_layout);
  490. vkCmdPushConstants(command_buffer, pipeline_layout, VK_SHADER_STAGE_COMPUTE_BIT, 0, constants.size() * sizeof(vk_constant_type), constants.data());
  491. }
  492. void VkCompute::dispatch(const uint32_t* group_count_xyz)
  493. {
  494. // fprintf(stderr, "cmd dispatch %d %d %d\n", group_count_xyz[0], group_count_xyz[1], group_count_xyz[2]);
  495. vkCmdDispatch(command_buffer, group_count_xyz[0], group_count_xyz[1], group_count_xyz[2]);
  496. }
  497. void VkCompute::transfer_compute_barrier(VkBuffer buffer, size_t offset, size_t size)
  498. {
  499. // fprintf(stderr, "cmd transfer_compute_barrier %p\n", buffer);
  500. VkBufferMemoryBarrier bufferBarrier;
  501. bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
  502. bufferBarrier.pNext = 0;
  503. bufferBarrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
  504. bufferBarrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;
  505. bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  506. bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  507. bufferBarrier.buffer = buffer;
  508. bufferBarrier.offset = offset;
  509. bufferBarrier.size = size;
  510. VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
  511. VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
  512. vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
  513. }
  514. void VkCompute::compute_transfer_barrier(VkBuffer buffer, size_t offset, size_t size)
  515. {
  516. // fprintf(stderr, "cmd compute_transfer_barrier %p\n", buffer);
  517. VkBufferMemoryBarrier bufferBarrier;
  518. bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
  519. bufferBarrier.pNext = 0;
  520. bufferBarrier.srcAccessMask = VK_ACCESS_SHADER_READ_BIT | VK_ACCESS_SHADER_WRITE_BIT;
  521. bufferBarrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
  522. bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  523. bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  524. bufferBarrier.buffer = buffer;
  525. bufferBarrier.offset = offset;
  526. bufferBarrier.size = size;
  527. VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
  528. VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
  529. vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
  530. }
  531. void VkCompute::compute_compute_barrier(VkBuffer buffer, size_t offset, size_t size)
  532. {
  533. // fprintf(stderr, "cmd compute_compute_barrier %p\n", buffer);
  534. VkBufferMemoryBarrier bufferBarrier;
  535. bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
  536. bufferBarrier.pNext = 0;
  537. bufferBarrier.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
  538. bufferBarrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
  539. bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  540. bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  541. bufferBarrier.buffer = buffer;
  542. bufferBarrier.offset = offset;
  543. bufferBarrier.size = size;
  544. VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
  545. VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT;
  546. vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
  547. }
  548. void VkCompute::transfer_transfer_barrier(VkBuffer buffer, size_t offset, size_t size)
  549. {
  550. // fprintf(stderr, "cmd transfer_transfer_barrier %p\n", buffer);
  551. VkBufferMemoryBarrier bufferBarrier;
  552. bufferBarrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
  553. bufferBarrier.pNext = 0;
  554. bufferBarrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
  555. bufferBarrier.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
  556. bufferBarrier.srcQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  557. bufferBarrier.dstQueueFamilyIndex = VK_QUEUE_FAMILY_IGNORED;
  558. bufferBarrier.buffer = buffer;
  559. bufferBarrier.offset = offset;
  560. bufferBarrier.size = size;
  561. VkPipelineStageFlags srcStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
  562. VkPipelineStageFlags dstStageMask = VK_PIPELINE_STAGE_TRANSFER_BIT;
  563. vkCmdPipelineBarrier(command_buffer, srcStageMask, dstStageMask, 0, 0, 0, 1, &bufferBarrier, 0, 0);
  564. }
  565. VkTransfer::VkTransfer(VulkanDevice* _vkdev) : Command(_vkdev, _vkdev->info.transfer_queue_index)
  566. {
  567. staging_data = 0;
  568. }
  569. VkTransfer::~VkTransfer()
  570. {
  571. }
  572. void VkTransfer::record_upload(const Mat& src, VkMat& dst)
  573. {
  574. dst.create_like(src, weight_vkallocator, staging_vkallocator);
  575. if (dst.allocator->mappable)
  576. {
  577. dst.upload(src);
  578. return;
  579. }
  580. record_type r;
  581. r.type = 0;
  582. r.size = src.total() * src.elemsize;
  583. r.upload.src = src.data;
  584. r.upload.dst = dst.buffer();
  585. r.upload.dst_offset = dst.buffer_offset();
  586. delayed_records.push_back(r);
  587. }
  588. void VkTransfer::record_download(const VkMat& src, Mat& dst)
  589. {
  590. dst.create_like(src);// TODO respect blob allocator
  591. if (src.allocator->mappable)
  592. {
  593. src.download(dst);
  594. return;
  595. }
  596. record_type r;
  597. r.type = 1;
  598. r.size = src.total() * src.elemsize;
  599. r.download.src = src.buffer();
  600. r.download.src_offset = src.buffer_offset();
  601. r.download.dst = dst.data;
  602. delayed_records.push_back(r);
  603. }
  604. int VkTransfer::submit()
  605. {
  606. if (delayed_records.empty())
  607. return 0;
  608. int transfer_count = delayed_records.size();
  609. // solve staging buffer size
  610. size_t staging_buffer_size = 0;
  611. for (int i=0; i<transfer_count; i++)
  612. {
  613. const record_type& r = delayed_records[i];
  614. staging_buffer_size += r.size;
  615. }
  616. // TODO sperated staging buffer for upload and download ?
  617. // allocate staging buffer
  618. staging_data = staging_vkallocator->fastMalloc(staging_buffer_size);
  619. // copy upload data
  620. size_t mapped_ptr_offset = 0;
  621. for (int i=0; i<transfer_count; i++)
  622. {
  623. const record_type& r = delayed_records[i];
  624. if (r.type == 0)
  625. {
  626. memcpy((unsigned char*)staging_data->mapped_ptr + mapped_ptr_offset, r.upload.src, r.size);
  627. }
  628. mapped_ptr_offset += r.size;
  629. }
  630. begin_command_buffer();
  631. // fprintf(stderr, "cmd transfer %p %lu\n", staging_data->buffer, staging_buffer_size);
  632. // handle delayed records
  633. size_t staging_buffer_offset = 0;
  634. for (int i=0; i<transfer_count; i++)
  635. {
  636. const record_type& r = delayed_records[i];
  637. switch (r.type)
  638. {
  639. case 0:
  640. copy_buffer(staging_data->buffer, staging_buffer_offset, r.upload.dst, r.upload.dst_offset, r.size);
  641. break;
  642. case 1:
  643. copy_buffer(r.download.src, r.download.src_offset, staging_data->buffer, staging_buffer_offset, r.size);
  644. break;
  645. }
  646. staging_buffer_offset += r.size;
  647. }
  648. end_command_buffer();
  649. return queue_submit();
  650. }
  651. int VkTransfer::wait()
  652. {
  653. if (delayed_records.empty())
  654. return 0;
  655. int ret = wait_fence();
  656. int transfer_count = delayed_records.size();
  657. // copy download data
  658. size_t mapped_ptr_offset = 0;
  659. for (int i=0; i<transfer_count; i++)
  660. {
  661. const record_type& r = delayed_records[i];
  662. if (r.type == 1)
  663. {
  664. memcpy(r.download.dst, (unsigned char*)staging_data->mapped_ptr + mapped_ptr_offset, r.size);
  665. }
  666. mapped_ptr_offset += r.size;
  667. }
  668. // deallocate staging buffer
  669. staging_vkallocator->fastFree(staging_data);
  670. staging_data = 0;
  671. return ret;
  672. }
  673. void VkTransfer::copy_buffer(VkBuffer src, size_t src_offset, VkBuffer dst, size_t dst_offset, size_t size)
  674. {
  675. // fprintf(stderr, "cmd copy %p to %p\n", src, dst);
  676. VkBufferCopy region;
  677. region.srcOffset = src_offset;
  678. region.dstOffset = dst_offset;
  679. region.size = size;
  680. vkCmdCopyBuffer(command_buffer, src, dst, 1, &region);
  681. }
  682. void VkTransfer::copy_buffer_regions(VkBuffer src, VkBuffer dst, const std::vector<VkBufferCopy>& regions)
  683. {
  684. // fprintf(stderr, "cmd copy regions %p to %p\n", src, dst);
  685. vkCmdCopyBuffer(command_buffer, src, dst, regions.size(), regions.data());
  686. }
  687. } // namespace ncnn
  688. #endif // NCNN_VULKAN