You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

binaryop.cpp 17 kB

[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
8 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606
  1. // Tencent is pleased to support the open source community by making ncnn available.
  2. //
  3. // Copyright (C) 2017 THL A29 Limited, a Tencent company. All rights reserved.
  4. //
  5. // Licensed under the BSD 3-Clause License (the "License"); you may not use this file except
  6. // in compliance with the License. You may obtain a copy of the License at
  7. //
  8. // https://opensource.org/licenses/BSD-3-Clause
  9. //
  10. // Unless required by applicable law or agreed to in writing, software distributed
  11. // under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
  12. // CONDITIONS OF ANY KIND, either express or implied. See the License for the
  13. // specific language governing permissions and limitations under the License.
  14. #include "binaryop.h"
  15. #include <math.h>
  16. #include <algorithm>
  17. #include <functional>
  18. namespace ncnn {
  19. DEFINE_LAYER_CREATOR(BinaryOp)
  20. BinaryOp::BinaryOp()
  21. {
  22. one_blob_only = false;
  23. support_inplace = false;
  24. support_vulkan = true;
  25. #if NCNN_VULKAN
  26. pipeline_binaryop = 0;
  27. pipeline_binaryop_pack4 = 0;
  28. #endif // NCNN_VULKAN
  29. }
  30. int BinaryOp::load_param(const ParamDict& pd)
  31. {
  32. op_type = pd.get(0, 0);
  33. with_scalar = pd.get(1, 0);
  34. b = pd.get(2, 0.f);
  35. if (with_scalar != 0)
  36. {
  37. one_blob_only = true;
  38. support_inplace = true;
  39. }
  40. return 0;
  41. }
  42. template<typename Op>
  43. static int binary_op(const Mat& a, const Mat& b, Mat& c, const Option& opt)
  44. {
  45. Op op;
  46. int w = a.w;
  47. int h = a.h;
  48. int channels = a.c;
  49. int size = w * h;
  50. size_t elemsize = a.elemsize;
  51. int w1 = b.w;
  52. int h1 = b.h;
  53. int channels1 = b.c;
  54. int size1 = w1 * h1;
  55. if (a.dims == 3)
  56. {
  57. c.create(w, h, channels, elemsize, opt.blob_allocator);
  58. if (c.empty())
  59. return -100;
  60. if (b.dims == 3)
  61. {
  62. if (b.w == 1&&b.h==1)
  63. {
  64. #pragma omp parallel for num_threads(opt.num_threads)
  65. for (int q = 0; q < channels; q++)
  66. {
  67. const float* ptr = a.channel(q);
  68. float* outptr = c.channel(q);
  69. const float* b0 = b.channel(q);
  70. for (int i = 0; i < size; i++)
  71. {
  72. outptr[i] = op(ptr[i], b0[0]);
  73. }
  74. }
  75. return 0;
  76. }
  77. #pragma omp parallel for num_threads(opt.num_threads)
  78. for (int q=0; q<channels; q++)
  79. {
  80. const float* ptr = a.channel(q);
  81. const float* ptr1 = b.channel(q);
  82. float* outptr = c.channel(q);
  83. for (int i=0; i<size; i++)
  84. {
  85. outptr[i] = op(ptr[i], ptr1[i]);
  86. }
  87. }
  88. return 0;
  89. }
  90. if (b.dims == 2)
  91. {
  92. #pragma omp parallel for num_threads(opt.num_threads)
  93. for (int q=0; q<channels; q++)
  94. {
  95. const float* ptr = a.channel(q);
  96. const float* ptr1 = (const float*)b + h * q;
  97. float* outptr = c.channel(q);
  98. for (int y=0; y<h; y++)
  99. {
  100. const float b0 = ptr1[y];
  101. for (int x=0; x<w; x++)
  102. {
  103. outptr[x] = op(ptr[x], b0);
  104. }
  105. ptr += w;
  106. outptr += w;
  107. }
  108. }
  109. return 0;
  110. }
  111. if (b.dims == 1)
  112. {
  113. if (b.w == 1)
  114. {
  115. const float b0 = b[0];
  116. #pragma omp parallel for num_threads(opt.num_threads)
  117. for (int q=0; q<channels; q++)
  118. {
  119. const float* ptr = a.channel(q);
  120. float* outptr = c.channel(q);
  121. for (int i=0; i<size; i++)
  122. {
  123. outptr[i] = op(ptr[i], b0);
  124. }
  125. }
  126. return 0;
  127. }
  128. #pragma omp parallel for num_threads(opt.num_threads)
  129. for (int q=0; q<channels; q++)
  130. {
  131. const float* ptr = a.channel(q);
  132. const float b0 = b[q];
  133. float* outptr = c.channel(q);
  134. for (int i=0; i<size; i++)
  135. {
  136. outptr[i] = op(ptr[i], b0);
  137. }
  138. }
  139. return 0;
  140. }
  141. }
  142. else if (a.dims == 2)
  143. {
  144. if (b.dims == 3)
  145. {
  146. c.create(w1, h1, channels1, elemsize, opt.blob_allocator);
  147. if (c.empty())
  148. return -100;
  149. #pragma omp parallel for num_threads(opt.num_threads)
  150. for (int q=0; q<channels1; q++)
  151. {
  152. const float* ptr = (const float*)a + h1 * q;
  153. const float* ptr1 = b.channel(q);
  154. float* outptr = c.channel(q);
  155. for (int y=0; y<h1; y++)
  156. {
  157. const float a0 = ptr[y];
  158. for (int x=0; x<w1; x++)
  159. {
  160. outptr[x] = op(a0, ptr1[x]);
  161. }
  162. ptr1 += w1;
  163. outptr += w1;
  164. }
  165. }
  166. return 0;
  167. }
  168. c.create(w, h, elemsize, opt.blob_allocator);
  169. if (c.empty())
  170. return -100;
  171. if (b.dims == 2)
  172. {
  173. for (int i=0; i<size; i++)
  174. {
  175. c[i] = op(a[i], b[i]);
  176. }
  177. return 0;
  178. }
  179. if (b.dims == 1)
  180. {
  181. c.create(w, h, elemsize, opt.blob_allocator);
  182. if (c.empty())
  183. return -100;
  184. if (b.w == 1)
  185. {
  186. const float b0 = b[0];
  187. for (int i=0; i<size; i++)
  188. {
  189. c[i] = op(a[i], b0);
  190. }
  191. return 0;
  192. }
  193. const float* ptr = a;
  194. float* outptr = c;
  195. for (int y=0; y<h; y++)
  196. {
  197. const float b0 = b[y];
  198. for (int x=0; x<w; x++)
  199. {
  200. outptr[x] = op(ptr[x], b0);
  201. }
  202. ptr += w;
  203. outptr += w;
  204. }
  205. return 0;
  206. }
  207. }
  208. else if (a.dims == 1)
  209. {
  210. if (a.w == 1)
  211. {
  212. if (b.dims == 3)
  213. {
  214. c.create(w1, h1, channels1, elemsize, opt.blob_allocator);
  215. if (c.empty())
  216. return -100;
  217. const float a0 = a[0];
  218. #pragma omp parallel for num_threads(opt.num_threads)
  219. for (int q=0; q<channels1; q++)
  220. {
  221. const float* ptr1 = b.channel(q);
  222. float* outptr = c.channel(q);
  223. for (int i=0; i<size1; i++)
  224. {
  225. outptr[i] = op(a0, ptr1[i]);
  226. }
  227. }
  228. return 0;
  229. }
  230. if (b.dims == 2)
  231. {
  232. c.create(w1, h1, elemsize, opt.blob_allocator);
  233. if (c.empty())
  234. return -100;
  235. const float a0 = a[0];
  236. for (int i=0; i<size1; i++)
  237. {
  238. c[i] = op(a0, b[i]);
  239. }
  240. return 0;
  241. }
  242. if (b.dims == 1)
  243. {
  244. c.create(w1, elemsize, opt.blob_allocator);
  245. if (c.empty())
  246. return -100;
  247. const float a0 = a[0];
  248. for (int i=0; i<size1; i++)
  249. {
  250. c[i] = op(a0, b[i]);
  251. }
  252. return 0;
  253. }
  254. }
  255. if (b.dims == 3)
  256. {
  257. c.create(w1, h1, channels1, elemsize, opt.blob_allocator);
  258. if (c.empty())
  259. return -100;
  260. #pragma omp parallel for num_threads(opt.num_threads)
  261. for (int q=0; q<channels1; q++)
  262. {
  263. const float a0 = a[q];
  264. const float* ptr1 = b.channel(q);
  265. float* outptr = c.channel(q);
  266. for (int i=0; i<size1; i++)
  267. {
  268. outptr[i] = op(a0, ptr1[i]);
  269. }
  270. }
  271. return 0;
  272. }
  273. if (b.dims == 2)
  274. {
  275. c.create(w1, h1, elemsize, opt.blob_allocator);
  276. if (c.empty())
  277. return -100;
  278. const float* ptr1 = b;
  279. float* outptr = c;
  280. for (int y=0; y<h1; y++)
  281. {
  282. const float a0 = a[y];
  283. for (int x=0; x<w1; x++)
  284. {
  285. outptr[x] = op(a0, ptr1[x]);
  286. }
  287. ptr1 += w1;
  288. outptr += w1;
  289. }
  290. return 0;
  291. }
  292. if (b.dims == 1)
  293. {
  294. c.create(w, elemsize, opt.blob_allocator);
  295. if (c.empty())
  296. return -100;
  297. if (b.w == 1)
  298. {
  299. const float b0 = b[0];
  300. for (int i=0; i<size; i++)
  301. {
  302. c[i] = op(a[i], b0);
  303. }
  304. return 0;
  305. }
  306. for (int i=0; i<size; i++)
  307. {
  308. c[i] = op(a[i], b[i]);
  309. }
  310. }
  311. }
  312. return 0;
  313. }
  314. template<typename Op>
  315. static int binary_op_scalar_inplace(Mat& a, float b, const Option& opt)
  316. {
  317. Op op;
  318. int w = a.w;
  319. int h = a.h;
  320. int channels = a.c;
  321. int size = w * h;
  322. #pragma omp parallel for num_threads(opt.num_threads)
  323. for (int q=0; q<channels; q++)
  324. {
  325. float* ptr = a.channel(q);
  326. for (int i=0; i<size; i++)
  327. {
  328. ptr[i] = op(ptr[i], b);
  329. }
  330. }
  331. return 0;
  332. }
  333. template<typename T>
  334. struct binary_op_max : std::binary_function<T,T,T> {
  335. T operator() (const T& x, const T& y) const { return std::max(x, y); }
  336. };
  337. template<typename T>
  338. struct binary_op_min : std::binary_function<T,T,T> {
  339. T operator() (const T& x, const T& y) const { return std::min(x, y); }
  340. };
  341. template<typename T>
  342. struct binary_op_pow : std::binary_function<T,T,T> {
  343. T operator() (const T& x, const T& y) const { return pow(x, y); }
  344. };
  345. template<typename T>
  346. struct binary_op_rsub : std::binary_function<T,T,T> {
  347. T operator() (const T& x, const T& y) const { return y - x; }
  348. };
  349. template<typename T>
  350. struct binary_op_rdiv : std::binary_function<T,T,T> {
  351. T operator() (const T& x, const T& y) const { return y / x; }
  352. };
  353. int BinaryOp::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& top_blobs, const Option& opt) const
  354. {
  355. const Mat& bottom_blob = bottom_blobs[0];
  356. const Mat& bottom_blob1 = bottom_blobs[1];
  357. Mat& top_blob = top_blobs[0];
  358. if (op_type == Operation_ADD)
  359. return binary_op< std::plus<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  360. if (op_type == Operation_SUB)
  361. return binary_op< std::minus<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  362. if (op_type == Operation_MUL)
  363. return binary_op< std::multiplies<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  364. if (op_type == Operation_DIV)
  365. return binary_op< std::divides<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  366. if (op_type == Operation_MAX)
  367. return binary_op< binary_op_max<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  368. if (op_type == Operation_MIN)
  369. return binary_op< binary_op_min<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  370. if (op_type == Operation_POW)
  371. return binary_op< binary_op_pow<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  372. if (op_type == Operation_RSUB)
  373. return binary_op< binary_op_rsub<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  374. if (op_type == Operation_RDIV)
  375. return binary_op< binary_op_rdiv<float> >(bottom_blob, bottom_blob1, top_blob, opt);
  376. return 0;
  377. }
  378. int BinaryOp::forward_inplace(Mat& bottom_top_blob, const Option& opt) const
  379. {
  380. if (op_type == Operation_ADD)
  381. return binary_op_scalar_inplace< std::plus<float> >(bottom_top_blob, b, opt);
  382. if (op_type == Operation_SUB)
  383. return binary_op_scalar_inplace< std::minus<float> >(bottom_top_blob, b, opt);
  384. if (op_type == Operation_MUL)
  385. return binary_op_scalar_inplace< std::multiplies<float> >(bottom_top_blob, b, opt);
  386. if (op_type == Operation_DIV)
  387. return binary_op_scalar_inplace< std::divides<float> >(bottom_top_blob, b, opt);
  388. if (op_type == Operation_MAX)
  389. return binary_op_scalar_inplace< binary_op_max<float> >(bottom_top_blob, b, opt);
  390. if (op_type == Operation_MIN)
  391. return binary_op_scalar_inplace< binary_op_min<float> >(bottom_top_blob, b, opt);
  392. if (op_type == Operation_POW)
  393. return binary_op_scalar_inplace< binary_op_pow<float> >(bottom_top_blob, b, opt);
  394. if (op_type == Operation_RSUB)
  395. return binary_op_scalar_inplace< binary_op_rsub<float> >(bottom_top_blob, b, opt);
  396. if (op_type == Operation_RDIV)
  397. return binary_op_scalar_inplace< binary_op_rdiv<float> >(bottom_top_blob, b, opt);
  398. return 0;
  399. }
  400. #if NCNN_VULKAN
  401. int BinaryOp::create_pipeline()
  402. {
  403. pipeline_binaryop = new Pipeline(vkdev);
  404. pipeline_binaryop->set_optimal_local_size_xyz();
  405. std::vector<vk_specialization_type> specializations(3);
  406. specializations[0].i = op_type;
  407. specializations[1].i = with_scalar;
  408. specializations[2].f = b;
  409. pipeline_binaryop->create("binaryop", specializations, 3, 15);
  410. // pack4
  411. {
  412. pipeline_binaryop_pack4 = new Pipeline(vkdev);
  413. pipeline_binaryop_pack4->set_optimal_local_size_xyz();
  414. pipeline_binaryop_pack4->create("binaryop_pack4", specializations, 3, 15);
  415. }
  416. return 0;
  417. }
  418. int BinaryOp::destroy_pipeline()
  419. {
  420. delete pipeline_binaryop;
  421. pipeline_binaryop = 0;
  422. delete pipeline_binaryop_pack4;
  423. pipeline_binaryop_pack4 = 0;
  424. return 0;
  425. }
  426. int BinaryOp::forward(const std::vector<VkMat>& bottom_blobs, std::vector<VkMat>& top_blobs, VkCompute& cmd, const Option& opt) const
  427. {
  428. const VkMat& bottom_blob = bottom_blobs[0];
  429. const VkMat& bottom_blob1 = bottom_blobs[1];
  430. VkMat& top_blob = top_blobs[0];
  431. int packing = bottom_blob.packing;
  432. // TODO broadcast
  433. top_blob.create_like(bottom_blob, opt.blob_vkallocator, opt.staging_vkallocator);
  434. if (top_blob.empty())
  435. return -100;
  436. // fprintf(stderr, "BinaryOp::forward %p %p %p\n", bottom_blob.buffer(), bottom_blob1.buffer(), top_blob.buffer());
  437. std::vector<VkMat> bindings(3);
  438. bindings[0] = bottom_blob;
  439. bindings[1] = bottom_blob1;
  440. bindings[2] = top_blob;
  441. std::vector<vk_constant_type> constants(15);
  442. constants[0].i = bottom_blob.dims;
  443. constants[1].i = bottom_blob.w;
  444. constants[2].i = bottom_blob.h;
  445. constants[3].i = bottom_blob.c;
  446. constants[4].i = bottom_blob.cstep;
  447. constants[5].i = bottom_blob1.dims;
  448. constants[6].i = bottom_blob1.w;
  449. constants[7].i = bottom_blob1.h;
  450. constants[8].i = bottom_blob1.c;
  451. constants[9].i = bottom_blob1.cstep;
  452. constants[10].i = top_blob.dims;
  453. constants[11].i = top_blob.w;
  454. constants[12].i = top_blob.h;
  455. constants[13].i = top_blob.c;
  456. constants[14].i = top_blob.cstep;
  457. const Pipeline* pipeline = packing == 4 ? pipeline_binaryop_pack4 : pipeline_binaryop;
  458. // record
  459. cmd.record_prepare_compute_barrier(bottom_blob);
  460. cmd.record_prepare_compute_barrier(bottom_blob1);
  461. cmd.record_pipeline(pipeline, bindings, constants, top_blob);
  462. return 0;
  463. }
  464. int BinaryOp::forward_inplace(VkMat& bottom_top_blob, VkCompute& cmd, const Option& opt) const
  465. {
  466. int packing = bottom_top_blob.packing;
  467. // fprintf(stderr, "BinaryOp::forward_inplace %p\n", bottom_top_blob.buffer());
  468. std::vector<VkMat> bindings(3);
  469. bindings[0] = bottom_top_blob;
  470. bindings[1] = bottom_top_blob;// TODO use dummy buffer
  471. bindings[2] = bottom_top_blob;// TODO use dummy buffer
  472. std::vector<vk_constant_type> constants(15);
  473. constants[10].i = bottom_top_blob.dims;
  474. constants[11].i = bottom_top_blob.w;
  475. constants[12].i = bottom_top_blob.h;
  476. constants[13].i = bottom_top_blob.c;
  477. constants[14].i = bottom_top_blob.cstep;
  478. const Pipeline* pipeline = packing == 4 ? pipeline_binaryop_pack4 : pipeline_binaryop;
  479. // record
  480. cmd.record_prepare_compute_barrier(bottom_top_blob);
  481. cmd.record_pipeline(pipeline, bindings, constants, bottom_top_blob);
  482. return 0;
  483. }
  484. #endif // NCNN_VULKAN
  485. } // namespace ncnn