You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

gpu.cpp 241 kB

[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
adreno image shader + fp16 + fp16a (#1714) * wip * wip * fix * image and imageview can not be destroyed until command execution ends * fast copy path for tightly packed data * wip * texture load works * 1d 3d image * record clone image, multiple commands share one image reference * upload download image * layer forward accept vkimagemat * vkimagemat graph works * staging vkimagemat for passing dynamic parameters, macro for fp32+image shader, padding image shader * vkimagemat elemsize * convolution test pass * conv1x1s1 image shader * fast staging image allocator from host memory, pooling image shader * convolutiondepthwise image shader * innerproduct image shader * packing image shader * crop deconvolution image shader * resolve spirv binding types * image fp16 and fp16a, cast image shader * eltwise image shader * wip * absval image shader * deconvolutiondepthwise image shader * concat image shader, squeezenet works * noop split image shader * uniform precision hint * layer support_image_storage * wip * vulkan device utility operator * command is storage and packing option aware * fallback to cpu on image allocation failed, mobilenetssd works * flatten image shader, enable more test * ci test * check imgfp32 imgfp16 imgfp16a features * fix ci test * fix ci test * upgrade swiftshader * wip * opt aggressive * imgfp16p * opt none * convolution winograd image shader * fix flush range, fast copy path for continous buffer * minor fix * fix innerproduct * wip ... * wip * cast fix * packing test * wip * image fp16p is fp16p * wip * silence * more line info * code clean * softmax image shader
6 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
[WIP] vulkan compute (#618) * vulkan infrastructure * vkallocator and vkmat * layer interface for vulkan compute * wip... * default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface * simplify command api, vkmat holds staging buffer, relu works * initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works * init extension functions * dynamic local size and group count * group count=1 is invalid * regard device max workgroup size limit * fix relu oooops * decouple command record and staging allocation * create result blob * add pooling shader * buffer is faster than image :) * fix pooling shader * add innerproduct shader * readonly writeonly decoration * simplify buffer creation * decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D * fix vulkan building issues in visual studio (#1) * fix building issues on visual studio * ignore benchmark * cancel changes * ... ... * decouple paramdict and vulkandevice * fix staging buffer destroy in model loading * remove vkdev member in option * add padding shader * simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output * add convolutiondepthwise and softmax shader * specialization float type, add leakyrelu * add dropout shader * add batchnorm shader * split vulkan forward * add scale shader * push constant type can be int or float * set_optimal_local_size_xyz * add eltwise shader * concat vulkan forward * fix convolution without bias * add dummy shader for concat and split, more fix ... * optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor * check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR * binaryop and unaryop shader * hide raw command buffer * simple vkbenchncnn benchmark * create device with transfer queue * rename command to vkcompute, add vktransfer and layer upload_model interface * external VkMat, copy and map wrt buffer offset * command copy respect offset and size * decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights * fix build on android * binding count can not vary :( * barrier check state, fix sub-op destruction * declare local_size_xyz constant, fix crash on radv * fix local_size_xyz, second try * more barrier and state fix * fix softmax * reconstruct buffer memory allocator, reuse blob buffer, less verbose output * find unified memory type index * weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment * use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation * find more useful vulkan extensions and enable them * fix msvc build * respect VK_KHR_dedicated_allocation for weight buffer allocation * fix android build * fix bias name conflicts with metal * decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording * drop dummy shader, inplace softmax, multiple shader module works * fix unique queue family index error * flatten support vulkan * mnasnet run * find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk * some minor changes * add some high level api * use dedicated transfer queue to upload weight model * prefer mappable buffer on unified memory * global pooling and convolution fc, reuse staging buffer * implement ring-buffer style blob allocator, add VkBufferMemory capacity * use blob allocator for workspace blob, it works fine :) * vulkan option off * Update layer.cpp * fix build with vulkan off * less verbose output, fix crash on vulkan_compute off * merge benchncnn tool * allocator clear api, use new weight buffer allocator per net * add default locked allocator * mapped mat ptr api, persistent mapped memory works generally :) * travis ci linux vulkan * travis ci vulkan wip ... * more gpu wip ... * more gpu wip ... * wip... * wip... * wip... ... * wip... ios vulkan build... * find glslangValidator on ios build * use dynamic moltenvk library * travis ci wip ... * ios simulator does not support metal at all * fix cpu only extractor * optimize workgroup size, first try * optimize workgroup size, second try * conv1x1s1d1 vec4 * revert build system * fix ncnn2mem build * fix ncnn2mem build
7 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693169416951696169716981699170017011702170317041705170617071708170917101711171217131714171517161717171817191720172117221723172417251726172717281729173017311732173317341735173617371738173917401741174217431744174517461747174817491750175117521753175417551756175717581759176017611762176317641765176617671768176917701771177217731774177517761777177817791780178117821783178417851786178717881789179017911792179317941795179617971798179918001801180218031804180518061807180818091810181118121813181418151816181718181819182018211822182318241825182618271828182918301831183218331834183518361837183818391840184118421843184418451846184718481849185018511852185318541855185618571858185918601861186218631864186518661867186818691870187118721873187418751876187718781879188018811882188318841885188618871888188918901891189218931894189518961897189818991900190119021903190419051906190719081909191019111912191319141915191619171918191919201921192219231924192519261927192819291930193119321933193419351936193719381939194019411942194319441945194619471948194919501951195219531954195519561957195819591960196119621963196419651966196719681969197019711972197319741975197619771978197919801981198219831984198519861987198819891990199119921993199419951996199719981999200020012002200320042005200620072008200920102011201220132014201520162017201820192020202120222023202420252026202720282029203020312032203320342035203620372038203920402041204220432044204520462047204820492050205120522053205420552056205720582059206020612062206320642065206620672068206920702071207220732074207520762077207820792080208120822083208420852086208720882089209020912092209320942095209620972098209921002101210221032104210521062107210821092110211121122113211421152116211721182119212021212122212321242125212621272128212921302131213221332134213521362137213821392140214121422143214421452146214721482149215021512152215321542155215621572158215921602161216221632164216521662167216821692170217121722173217421752176217721782179218021812182218321842185218621872188218921902191219221932194219521962197219821992200220122022203220422052206220722082209221022112212221322142215221622172218221922202221222222232224222522262227222822292230223122322233223422352236223722382239224022412242224322442245224622472248224922502251225222532254225522562257225822592260226122622263226422652266226722682269227022712272227322742275227622772278227922802281228222832284228522862287228822892290229122922293229422952296229722982299230023012302230323042305230623072308230923102311231223132314231523162317231823192320232123222323232423252326232723282329233023312332233323342335233623372338233923402341234223432344234523462347234823492350235123522353235423552356235723582359236023612362236323642365236623672368236923702371237223732374237523762377237823792380238123822383238423852386238723882389239023912392239323942395239623972398239924002401240224032404240524062407240824092410241124122413241424152416241724182419242024212422242324242425242624272428242924302431243224332434243524362437243824392440244124422443244424452446244724482449245024512452245324542455245624572458245924602461246224632464246524662467246824692470247124722473247424752476247724782479248024812482248324842485248624872488248924902491249224932494249524962497249824992500250125022503250425052506250725082509251025112512251325142515251625172518251925202521252225232524252525262527252825292530253125322533253425352536253725382539254025412542254325442545254625472548254925502551255225532554255525562557255825592560256125622563256425652566256725682569257025712572257325742575257625772578257925802581258225832584258525862587258825892590259125922593259425952596259725982599260026012602260326042605260626072608260926102611261226132614261526162617261826192620262126222623262426252626262726282629263026312632263326342635263626372638263926402641264226432644264526462647264826492650265126522653265426552656265726582659266026612662266326642665266626672668266926702671267226732674267526762677267826792680268126822683268426852686268726882689269026912692269326942695269626972698269927002701270227032704270527062707270827092710271127122713271427152716271727182719272027212722272327242725272627272728272927302731273227332734273527362737273827392740274127422743274427452746274727482749275027512752275327542755275627572758275927602761276227632764276527662767276827692770277127722773277427752776277727782779278027812782278327842785278627872788278927902791279227932794279527962797279827992800280128022803280428052806280728082809281028112812281328142815281628172818281928202821282228232824282528262827282828292830283128322833283428352836283728382839284028412842284328442845284628472848284928502851285228532854285528562857285828592860286128622863286428652866286728682869287028712872287328742875287628772878287928802881288228832884288528862887288828892890289128922893289428952896289728982899290029012902290329042905290629072908290929102911291229132914291529162917291829192920292129222923292429252926292729282929293029312932293329342935293629372938293929402941294229432944294529462947294829492950295129522953295429552956295729582959296029612962296329642965296629672968296929702971297229732974297529762977297829792980298129822983298429852986298729882989299029912992299329942995299629972998299930003001300230033004300530063007300830093010301130123013301430153016301730183019302030213022302330243025302630273028302930303031303230333034303530363037303830393040304130423043304430453046304730483049305030513052305330543055305630573058305930603061306230633064306530663067306830693070307130723073307430753076307730783079308030813082308330843085308630873088308930903091309230933094309530963097309830993100310131023103310431053106310731083109311031113112311331143115311631173118311931203121312231233124312531263127312831293130313131323133313431353136313731383139314031413142314331443145314631473148314931503151315231533154315531563157315831593160316131623163316431653166316731683169317031713172317331743175317631773178317931803181318231833184318531863187318831893190319131923193319431953196319731983199320032013202320332043205320632073208320932103211321232133214321532163217321832193220322132223223322432253226322732283229323032313232323332343235323632373238323932403241324232433244324532463247324832493250325132523253325432553256325732583259326032613262326332643265326632673268326932703271327232733274327532763277327832793280328132823283328432853286328732883289329032913292329332943295329632973298329933003301330233033304330533063307330833093310331133123313331433153316331733183319332033213322332333243325332633273328332933303331333233333334333533363337333833393340334133423343334433453346334733483349335033513352335333543355335633573358335933603361336233633364336533663367336833693370337133723373337433753376337733783379338033813382338333843385338633873388338933903391339233933394339533963397339833993400340134023403340434053406340734083409341034113412341334143415341634173418341934203421342234233424342534263427342834293430343134323433343434353436343734383439344034413442344334443445344634473448344934503451345234533454345534563457345834593460346134623463346434653466346734683469347034713472347334743475347634773478347934803481348234833484348534863487348834893490349134923493349434953496349734983499350035013502350335043505350635073508350935103511351235133514351535163517351835193520352135223523352435253526352735283529353035313532353335343535353635373538353935403541354235433544354535463547354835493550355135523553355435553556355735583559356035613562356335643565356635673568356935703571357235733574357535763577357835793580358135823583358435853586358735883589359035913592359335943595359635973598359936003601360236033604360536063607360836093610361136123613361436153616361736183619362036213622362336243625362636273628362936303631363236333634363536363637363836393640364136423643364436453646364736483649365036513652365336543655365636573658365936603661366236633664366536663667366836693670367136723673367436753676367736783679368036813682368336843685368636873688368936903691369236933694369536963697369836993700370137023703370437053706370737083709371037113712371337143715371637173718371937203721372237233724372537263727372837293730373137323733373437353736373737383739374037413742374337443745374637473748374937503751375237533754375537563757375837593760376137623763376437653766376737683769377037713772377337743775377637773778377937803781378237833784378537863787378837893790379137923793379437953796379737983799380038013802380338043805380638073808380938103811381238133814381538163817381838193820382138223823382438253826382738283829383038313832383338343835383638373838383938403841384238433844384538463847384838493850385138523853385438553856385738583859386038613862386338643865386638673868386938703871387238733874387538763877387838793880388138823883388438853886388738883889389038913892389338943895389638973898389939003901390239033904390539063907390839093910391139123913391439153916391739183919392039213922392339243925392639273928392939303931393239333934393539363937393839393940394139423943394439453946394739483949395039513952395339543955395639573958395939603961396239633964396539663967396839693970397139723973397439753976397739783979398039813982398339843985398639873988398939903991399239933994399539963997399839994000400140024003400440054006400740084009401040114012401340144015401640174018401940204021402240234024402540264027402840294030403140324033403440354036403740384039404040414042404340444045404640474048404940504051405240534054405540564057405840594060406140624063406440654066406740684069407040714072407340744075407640774078407940804081408240834084408540864087408840894090409140924093409440954096409740984099410041014102410341044105410641074108410941104111411241134114411541164117411841194120412141224123412441254126412741284129413041314132413341344135413641374138413941404141414241434144414541464147414841494150415141524153415441554156415741584159416041614162416341644165416641674168416941704171417241734174417541764177417841794180418141824183418441854186418741884189419041914192419341944195419641974198419942004201420242034204420542064207420842094210421142124213421442154216421742184219422042214222422342244225422642274228422942304231423242334234423542364237423842394240424142424243424442454246424742484249425042514252425342544255425642574258425942604261426242634264426542664267426842694270427142724273427442754276427742784279428042814282428342844285428642874288428942904291429242934294429542964297429842994300430143024303430443054306430743084309431043114312431343144315431643174318431943204321432243234324432543264327432843294330433143324333433443354336433743384339434043414342434343444345434643474348434943504351435243534354435543564357435843594360436143624363436443654366436743684369437043714372437343744375437643774378437943804381438243834384438543864387438843894390439143924393439443954396439743984399440044014402440344044405440644074408440944104411441244134414441544164417441844194420442144224423442444254426442744284429443044314432443344344435443644374438443944404441444244434444444544464447444844494450445144524453445444554456445744584459446044614462446344644465446644674468446944704471447244734474447544764477447844794480448144824483448444854486448744884489449044914492449344944495449644974498449945004501450245034504450545064507450845094510451145124513451445154516451745184519452045214522452345244525452645274528452945304531453245334534453545364537453845394540454145424543454445454546454745484549455045514552455345544555455645574558455945604561456245634564456545664567456845694570457145724573457445754576457745784579458045814582458345844585458645874588458945904591459245934594459545964597459845994600460146024603460446054606460746084609461046114612461346144615461646174618461946204621462246234624462546264627462846294630463146324633463446354636463746384639464046414642464346444645464646474648464946504651465246534654465546564657465846594660466146624663466446654666466746684669467046714672467346744675467646774678467946804681468246834684468546864687468846894690469146924693469446954696469746984699470047014702470347044705470647074708470947104711471247134714471547164717471847194720472147224723472447254726472747284729473047314732473347344735473647374738473947404741474247434744474547464747474847494750475147524753475447554756475747584759476047614762476347644765476647674768476947704771477247734774477547764777477847794780478147824783478447854786478747884789479047914792479347944795479647974798479948004801480248034804480548064807480848094810481148124813481448154816481748184819482048214822482348244825482648274828482948304831483248334834483548364837483848394840484148424843484448454846484748484849485048514852485348544855485648574858485948604861486248634864486548664867486848694870487148724873487448754876487748784879488048814882488348844885488648874888488948904891489248934894489548964897489848994900490149024903490449054906490749084909491049114912491349144915491649174918491949204921492249234924492549264927492849294930493149324933493449354936493749384939494049414942494349444945494649474948494949504951495249534954495549564957495849594960496149624963496449654966496749684969497049714972497349744975497649774978497949804981498249834984498549864987498849894990499149924993499449954996499749984999500050015002500350045005500650075008500950105011501250135014501550165017501850195020502150225023502450255026502750285029503050315032503350345035503650375038503950405041504250435044504550465047504850495050505150525053505450555056505750585059506050615062506350645065506650675068506950705071507250735074507550765077507850795080508150825083508450855086508750885089509050915092509350945095509650975098509951005101510251035104510551065107510851095110511151125113511451155116511751185119512051215122512351245125512651275128512951305131513251335134513551365137513851395140514151425143514451455146514751485149515051515152515351545155515651575158515951605161516251635164516551665167516851695170517151725173517451755176517751785179518051815182518351845185518651875188518951905191519251935194519551965197519851995200520152025203520452055206520752085209521052115212521352145215521652175218521952205221522252235224522552265227522852295230523152325233523452355236523752385239524052415242524352445245524652475248524952505251525252535254525552565257525852595260526152625263526452655266526752685269527052715272527352745275527652775278527952805281528252835284528552865287528852895290529152925293529452955296529752985299530053015302530353045305530653075308530953105311531253135314531553165317531853195320532153225323532453255326532753285329533053315332533353345335533653375338533953405341534253435344534553465347534853495350535153525353535453555356535753585359536053615362536353645365536653675368536953705371537253735374537553765377537853795380538153825383538453855386538753885389539053915392539353945395539653975398539954005401540254035404540554065407540854095410541154125413541454155416541754185419542054215422542354245425542654275428542954305431543254335434543554365437543854395440544154425443544454455446544754485449545054515452545354545455545654575458545954605461546254635464546554665467546854695470547154725473547454755476547754785479548054815482548354845485548654875488548954905491549254935494549554965497549854995500550155025503550455055506550755085509551055115512551355145515551655175518551955205521552255235524552555265527552855295530553155325533553455355536553755385539554055415542554355445545554655475548554955505551555255535554555555565557555855595560556155625563556455655566556755685569557055715572557355745575557655775578557955805581558255835584558555865587558855895590559155925593559455955596559755985599560056015602560356045605560656075608560956105611561256135614561556165617561856195620562156225623562456255626562756285629563056315632563356345635563656375638563956405641564256435644564556465647564856495650565156525653565456555656565756585659566056615662566356645665566656675668566956705671567256735674567556765677567856795680568156825683568456855686568756885689569056915692569356945695569656975698569957005701570257035704570557065707570857095710571157125713571457155716571757185719572057215722572357245725572657275728572957305731573257335734573557365737573857395740574157425743574457455746574757485749575057515752575357545755575657575758575957605761576257635764576557665767576857695770577157725773577457755776577757785779578057815782578357845785
  1. // Copyright 2018 Tencent
  2. // SPDX-License-Identifier: BSD-3-Clause
  3. #include "gpu.h"
  4. #if NCNN_VULKAN
  5. #include <float.h>
  6. #include <limits.h>
  7. #include <stdlib.h>
  8. #include <string.h>
  9. #include "glslang/SPIRV/GlslangToSpv.h"
  10. #if NCNN_SYSTEM_GLSLANG
  11. #include "glslang/Public/ShaderLang.h"
  12. #else
  13. #include "glslang/glslang/Public/ShaderLang.h"
  14. #endif
  15. #include "layer/vulkan/shader/vulkan_activation.comp.hex.h"
  16. #include "command.h"
  17. #include "layer.h"
  18. #include "layer_type.h"
  19. #include "mat.h"
  20. #include "pipelinecache.h"
  21. // There is known issue that vkDestroyDebugUtilsMessengerEXT crash on exit when vulkan validation layer enabled
  22. // upstream fix https://github.com/KhronosGroup/Vulkan-Loader/pull/539
  23. #define ENABLE_VALIDATION_LAYER 0
  24. namespace ncnn {
  25. // global
  26. static Mutex g_instance_lock;
  27. class __ncnn_vulkan_instance_holder
  28. {
  29. public:
  30. __ncnn_vulkan_instance_holder()
  31. {
  32. instance = 0;
  33. instance_api_version = 0;
  34. created = 0;
  35. glslang_initialized = false;
  36. #if NCNN_VULKAN_LOADER
  37. libvulkan = 0;
  38. #if defined __ANDROID__
  39. hvkdi = 0;
  40. #endif
  41. #endif // NCNN_VULKAN_LOADER
  42. #if ENABLE_VALIDATION_LAYER
  43. callback = 0;
  44. #endif
  45. }
  46. ~__ncnn_vulkan_instance_holder()
  47. {
  48. destroy_gpu_instance();
  49. }
  50. operator VkInstance()
  51. {
  52. return instance;
  53. }
  54. VkInstance instance;
  55. uint32_t instance_api_version;
  56. int created;
  57. bool glslang_initialized;
  58. #if ENABLE_VALIDATION_LAYER
  59. VkDebugUtilsMessengerEXT callback;
  60. #endif
  61. };
  62. static __ncnn_vulkan_instance_holder g_instance;
  63. static int g_gpu_count = 0;
  64. static int g_default_gpu_index = -1;
  65. // NOTE 32 is large enough i think ...
  66. #define NCNN_MAX_GPU_COUNT 32
  67. static GpuInfo* g_gpu_infos[NCNN_MAX_GPU_COUNT] = {0};
  68. // default vulkan device
  69. static Mutex g_default_vkdev_lock;
  70. static VulkanDevice* g_default_vkdev[NCNN_MAX_GPU_COUNT] = {0};
  71. struct layer_shader_registry_entry
  72. {
  73. const char* comp_data;
  74. int comp_data_size;
  75. };
  76. #include "layer_shader_spv_data.h"
  77. static const layer_shader_registry_entry layer_shader_registry[] = {
  78. #include "layer_shader_registry.h"
  79. };
  80. static const int layer_shader_registry_entry_count = sizeof(layer_shader_registry) / sizeof(layer_shader_registry_entry);
  81. // vulkan core
  82. PFN_vkAllocateCommandBuffers vkAllocateCommandBuffers = 0;
  83. PFN_vkAllocateDescriptorSets vkAllocateDescriptorSets = 0;
  84. PFN_vkAllocateMemory vkAllocateMemory = 0;
  85. PFN_vkBeginCommandBuffer vkBeginCommandBuffer = 0;
  86. PFN_vkBindBufferMemory vkBindBufferMemory = 0;
  87. PFN_vkBindImageMemory vkBindImageMemory = 0;
  88. PFN_vkCmdBeginQuery vkCmdBeginQuery = 0;
  89. PFN_vkCmdBindDescriptorSets vkCmdBindDescriptorSets = 0;
  90. PFN_vkCmdBindIndexBuffer vkCmdBindIndexBuffer = 0;
  91. PFN_vkCmdBindPipeline vkCmdBindPipeline = 0;
  92. PFN_vkCmdCopyBuffer vkCmdCopyBuffer = 0;
  93. PFN_vkCmdCopyBufferToImage vkCmdCopyBufferToImage = 0;
  94. PFN_vkCmdCopyImage vkCmdCopyImage = 0;
  95. PFN_vkCmdCopyImageToBuffer vkCmdCopyImageToBuffer = 0;
  96. PFN_vkCmdCopyQueryPoolResults vkCmdCopyQueryPoolResults = 0;
  97. PFN_vkCmdDispatch vkCmdDispatch = 0;
  98. PFN_vkCmdDispatchIndirect vkCmdDispatchIndirect = 0;
  99. PFN_vkCmdEndQuery vkCmdEndQuery = 0;
  100. PFN_vkCmdExecuteCommands vkCmdExecuteCommands = 0;
  101. PFN_vkCmdFillBuffer vkCmdFillBuffer = 0;
  102. PFN_vkCmdPipelineBarrier vkCmdPipelineBarrier = 0;
  103. PFN_vkCmdPushConstants vkCmdPushConstants = 0;
  104. PFN_vkCmdResetQueryPool vkCmdResetQueryPool = 0;
  105. PFN_vkCmdResolveImage vkCmdResolveImage = 0;
  106. PFN_vkCmdUpdateBuffer vkCmdUpdateBuffer = 0;
  107. PFN_vkCmdWriteTimestamp vkCmdWriteTimestamp = 0;
  108. PFN_vkCreateBuffer vkCreateBuffer = 0;
  109. PFN_vkCreateBufferView vkCreateBufferView = 0;
  110. PFN_vkCreateCommandPool vkCreateCommandPool = 0;
  111. PFN_vkCreateComputePipelines vkCreateComputePipelines = 0;
  112. PFN_vkCreateDescriptorPool vkCreateDescriptorPool = 0;
  113. PFN_vkCreateDescriptorSetLayout vkCreateDescriptorSetLayout = 0;
  114. PFN_vkCreateDevice vkCreateDevice = 0;
  115. PFN_vkCreateFence vkCreateFence = 0;
  116. PFN_vkCreateImage vkCreateImage = 0;
  117. PFN_vkCreateImageView vkCreateImageView = 0;
  118. PFN_vkCreatePipelineCache vkCreatePipelineCache = 0;
  119. PFN_vkCreatePipelineLayout vkCreatePipelineLayout = 0;
  120. PFN_vkCreateQueryPool vkCreateQueryPool = 0;
  121. PFN_vkCreateSampler vkCreateSampler = 0;
  122. PFN_vkCreateSemaphore vkCreateSemaphore = 0;
  123. PFN_vkCreateShaderModule vkCreateShaderModule = 0;
  124. PFN_vkDestroyBuffer vkDestroyBuffer = 0;
  125. PFN_vkDestroyBufferView vkDestroyBufferView = 0;
  126. PFN_vkDestroyCommandPool vkDestroyCommandPool = 0;
  127. PFN_vkDestroyDescriptorPool vkDestroyDescriptorPool = 0;
  128. PFN_vkDestroyDescriptorSetLayout vkDestroyDescriptorSetLayout = 0;
  129. PFN_vkDestroyDevice vkDestroyDevice = 0;
  130. PFN_vkDestroyFence vkDestroyFence = 0;
  131. PFN_vkDestroyImage vkDestroyImage = 0;
  132. PFN_vkDestroyImageView vkDestroyImageView = 0;
  133. PFN_vkDestroyInstance vkDestroyInstance = 0;
  134. PFN_vkDestroyPipeline vkDestroyPipeline = 0;
  135. PFN_vkDestroyPipelineCache vkDestroyPipelineCache = 0;
  136. PFN_vkDestroyPipelineLayout vkDestroyPipelineLayout = 0;
  137. PFN_vkDestroyQueryPool vkDestroyQueryPool = 0;
  138. PFN_vkDestroySampler vkDestroySampler = 0;
  139. PFN_vkDestroySemaphore vkDestroySemaphore = 0;
  140. PFN_vkDestroyShaderModule vkDestroyShaderModule = 0;
  141. PFN_vkDeviceWaitIdle vkDeviceWaitIdle = 0;
  142. PFN_vkEndCommandBuffer vkEndCommandBuffer = 0;
  143. PFN_vkEnumerateDeviceExtensionProperties vkEnumerateDeviceExtensionProperties = 0;
  144. PFN_vkEnumerateDeviceLayerProperties vkEnumerateDeviceLayerProperties = 0;
  145. PFN_vkEnumeratePhysicalDevices vkEnumeratePhysicalDevices = 0;
  146. PFN_vkFlushMappedMemoryRanges vkFlushMappedMemoryRanges = 0;
  147. PFN_vkFreeCommandBuffers vkFreeCommandBuffers = 0;
  148. PFN_vkFreeDescriptorSets vkFreeDescriptorSets = 0;
  149. PFN_vkFreeMemory vkFreeMemory = 0;
  150. PFN_vkGetBufferMemoryRequirements vkGetBufferMemoryRequirements = 0;
  151. PFN_vkGetDeviceMemoryCommitment vkGetDeviceMemoryCommitment = 0;
  152. PFN_vkGetDeviceProcAddr vkGetDeviceProcAddr = 0;
  153. PFN_vkGetDeviceQueue vkGetDeviceQueue = 0;
  154. PFN_vkGetFenceStatus vkGetFenceStatus = 0;
  155. PFN_vkGetImageMemoryRequirements vkGetImageMemoryRequirements = 0;
  156. PFN_vkGetImageSubresourceLayout vkGetImageSubresourceLayout = 0;
  157. PFN_vkGetPhysicalDeviceFeatures vkGetPhysicalDeviceFeatures = 0;
  158. PFN_vkGetPhysicalDeviceFormatProperties vkGetPhysicalDeviceFormatProperties = 0;
  159. PFN_vkGetPhysicalDeviceImageFormatProperties vkGetPhysicalDeviceImageFormatProperties = 0;
  160. PFN_vkGetPhysicalDeviceMemoryProperties vkGetPhysicalDeviceMemoryProperties = 0;
  161. PFN_vkGetPhysicalDeviceProperties vkGetPhysicalDeviceProperties = 0;
  162. PFN_vkGetPhysicalDeviceQueueFamilyProperties vkGetPhysicalDeviceQueueFamilyProperties = 0;
  163. PFN_vkGetPipelineCacheData vkGetPipelineCacheData = 0;
  164. PFN_vkGetQueryPoolResults vkGetQueryPoolResults = 0;
  165. PFN_vkInvalidateMappedMemoryRanges vkInvalidateMappedMemoryRanges = 0;
  166. PFN_vkMapMemory vkMapMemory = 0;
  167. PFN_vkMergePipelineCaches vkMergePipelineCaches = 0;
  168. PFN_vkQueueSubmit vkQueueSubmit = 0;
  169. PFN_vkQueueWaitIdle vkQueueWaitIdle = 0;
  170. PFN_vkResetCommandBuffer vkResetCommandBuffer = 0;
  171. PFN_vkResetCommandPool vkResetCommandPool = 0;
  172. PFN_vkResetDescriptorPool vkResetDescriptorPool = 0;
  173. PFN_vkResetFences vkResetFences = 0;
  174. PFN_vkUnmapMemory vkUnmapMemory = 0;
  175. PFN_vkUpdateDescriptorSets vkUpdateDescriptorSets = 0;
  176. PFN_vkWaitForFences vkWaitForFences = 0;
  177. int support_VK_KHR_external_memory_capabilities = 0;
  178. int support_VK_KHR_get_physical_device_properties2 = 0;
  179. int support_VK_KHR_get_surface_capabilities2 = 0;
  180. int support_VK_KHR_portability_enumeration = 0;
  181. int support_VK_KHR_surface = 0;
  182. int support_VK_EXT_debug_utils = 0;
  183. int support_VK_EXT_validation_features = 0;
  184. int support_VK_EXT_validation_flags = 0;
  185. #if __ANDROID_API__ >= 26
  186. int support_VK_KHR_android_surface = 0;
  187. #endif // __ANDROID_API__ >= 26
  188. // VK_KHR_cooperative_matrix
  189. PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR = 0;
  190. // VK_KHR_external_memory_capabilities
  191. PFN_vkGetPhysicalDeviceExternalBufferPropertiesKHR vkGetPhysicalDeviceExternalBufferPropertiesKHR = 0;
  192. // VK_KHR_get_physical_device_properties2
  193. PFN_vkGetPhysicalDeviceFeatures2KHR vkGetPhysicalDeviceFeatures2KHR = 0;
  194. PFN_vkGetPhysicalDeviceProperties2KHR vkGetPhysicalDeviceProperties2KHR = 0;
  195. PFN_vkGetPhysicalDeviceFormatProperties2KHR vkGetPhysicalDeviceFormatProperties2KHR = 0;
  196. PFN_vkGetPhysicalDeviceImageFormatProperties2KHR vkGetPhysicalDeviceImageFormatProperties2KHR = 0;
  197. PFN_vkGetPhysicalDeviceQueueFamilyProperties2KHR vkGetPhysicalDeviceQueueFamilyProperties2KHR = 0;
  198. PFN_vkGetPhysicalDeviceMemoryProperties2KHR vkGetPhysicalDeviceMemoryProperties2KHR = 0;
  199. // VK_KHR_get_surface_capabilities2
  200. PFN_vkGetPhysicalDeviceSurfaceCapabilities2KHR vkGetPhysicalDeviceSurfaceCapabilities2KHR = 0;
  201. PFN_vkGetPhysicalDeviceSurfaceFormats2KHR vkGetPhysicalDeviceSurfaceFormats2KHR = 0;
  202. // VK_KHR_surface
  203. PFN_vkDestroySurfaceKHR vkDestroySurfaceKHR = 0;
  204. PFN_vkGetPhysicalDeviceSurfaceSupportKHR vkGetPhysicalDeviceSurfaceSupportKHR = 0;
  205. PFN_vkGetPhysicalDeviceSurfaceCapabilitiesKHR vkGetPhysicalDeviceSurfaceCapabilitiesKHR = 0;
  206. PFN_vkGetPhysicalDeviceSurfaceFormatsKHR vkGetPhysicalDeviceSurfaceFormatsKHR = 0;
  207. PFN_vkGetPhysicalDeviceSurfacePresentModesKHR vkGetPhysicalDeviceSurfacePresentModesKHR = 0;
  208. #if __ANDROID_API__ >= 26
  209. // VK_KHR_android_surface
  210. PFN_vkCreateAndroidSurfaceKHR vkCreateAndroidSurfaceKHR = 0;
  211. #endif // __ANDROID_API__ >= 26
  212. // VK_NV_cooperative_matrix
  213. PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesNV vkGetPhysicalDeviceCooperativeMatrixPropertiesNV = 0;
  214. // VK_NV_cooperative_matrix2
  215. PFN_vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV = 0;
  216. // VK_NV_cooperative_vector
  217. PFN_vkGetPhysicalDeviceCooperativeVectorPropertiesNV vkGetPhysicalDeviceCooperativeVectorPropertiesNV = 0;
  218. class GpuInfoPrivate
  219. {
  220. public:
  221. void query_features();
  222. void query_properties();
  223. void query_queue_properties();
  224. int query_extensions();
  225. void query_extension_features();
  226. void query_extension_properties();
  227. public:
  228. int device_index;
  229. // physical device
  230. VkPhysicalDevice physicalDevice;
  231. // features
  232. VkPhysicalDeviceFeatures physicalDevicefeatures;
  233. // properties
  234. VkPhysicalDeviceProperties physicalDeviceProperties;
  235. // memory properties
  236. VkPhysicalDeviceMemoryProperties physicalDeviceMemoryProperties;
  237. // extension properties
  238. std::vector<VkExtensionProperties> deviceExtensionProperties;
  239. // 0 = discrete gpu
  240. // 1 = integrated gpu
  241. // 2 = virtual gpu
  242. // 3 = cpu
  243. int type;
  244. // runtime
  245. uint32_t compute_queue_family_index;
  246. uint32_t transfer_queue_family_index;
  247. uint32_t compute_queue_count;
  248. uint32_t transfer_queue_count;
  249. // property
  250. bool unified_compute_transfer_queue;
  251. // bug is not feature
  252. bool bug_storage_buffer_no_l1;
  253. bool bug_corrupted_online_pipeline_cache;
  254. bool bug_buffer_image_load_zero;
  255. // but sometimes bug is a feature
  256. bool bug_implicit_fp16_arithmetic;
  257. // cooperative matrix
  258. bool support_cooperative_matrix_8_8_16;
  259. bool support_cooperative_matrix_16_8_8;
  260. bool support_cooperative_matrix_16_8_16;
  261. bool support_cooperative_matrix_16_16_16;
  262. // extension capability
  263. int support_VK_KHR_8bit_storage;
  264. int support_VK_KHR_16bit_storage;
  265. int support_VK_KHR_bind_memory2;
  266. int support_VK_KHR_buffer_device_address;
  267. int support_VK_KHR_create_renderpass2;
  268. int support_VK_KHR_cooperative_matrix;
  269. int support_VK_KHR_dedicated_allocation;
  270. int support_VK_KHR_descriptor_update_template;
  271. int support_VK_KHR_driver_properties;
  272. int support_VK_KHR_external_memory;
  273. int support_VK_KHR_get_memory_requirements2;
  274. int support_VK_KHR_maintenance1;
  275. int support_VK_KHR_maintenance2;
  276. int support_VK_KHR_maintenance3;
  277. int support_VK_KHR_multiview;
  278. int support_VK_KHR_portability_subset;
  279. int support_VK_KHR_push_descriptor;
  280. int support_VK_KHR_sampler_ycbcr_conversion;
  281. int support_VK_KHR_shader_bfloat16;
  282. int support_VK_KHR_shader_float16_int8;
  283. int support_VK_KHR_shader_float_controls;
  284. int support_VK_KHR_shader_float_controls2;
  285. int support_VK_KHR_shader_integer_dot_product;
  286. int support_VK_KHR_shader_non_semantic_info;
  287. int support_VK_KHR_shader_subgroup_extended_types;
  288. int support_VK_KHR_shader_subgroup_rotate;
  289. int support_VK_KHR_storage_buffer_storage_class;
  290. int support_VK_KHR_swapchain;
  291. int support_VK_KHR_vulkan_memory_model;
  292. int support_VK_KHR_zero_initialize_workgroup_memory;
  293. int support_VK_EXT_buffer_device_address;
  294. int support_VK_EXT_descriptor_indexing;
  295. int support_VK_EXT_memory_budget;
  296. int support_VK_EXT_memory_priority;
  297. int support_VK_EXT_queue_family_foreign;
  298. int support_VK_EXT_shader_atomic_float;
  299. int support_VK_EXT_shader_atomic_float2;
  300. int support_VK_EXT_shader_float8;
  301. int support_VK_EXT_subgroup_size_control;
  302. int support_VK_AMD_device_coherent_memory;
  303. #if __ANDROID_API__ >= 26
  304. int support_VK_ANDROID_external_memory_android_hardware_buffer;
  305. #endif // __ANDROID_API__ >= 26
  306. int support_VK_NV_cooperative_matrix;
  307. int support_VK_NV_cooperative_matrix2;
  308. int support_VK_NV_cooperative_vector;
  309. // extension features
  310. void* queryExtensionFeatures;
  311. VkPhysicalDevice8BitStorageFeaturesKHR query8BitStorageFeatures;
  312. VkPhysicalDevice16BitStorageFeaturesKHR query16BitStorageFeatures;
  313. VkPhysicalDeviceFloat16Int8FeaturesKHR queryFloat16Int8Features;
  314. VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR querySamplerYcbcrConversionFeatures;
  315. VkPhysicalDeviceCooperativeMatrixFeaturesKHR queryCooperativeMatrixFeatures;
  316. VkPhysicalDeviceCooperativeMatrixFeaturesNV queryCooperativeMatrixFeaturesNV;
  317. VkPhysicalDeviceCooperativeMatrix2FeaturesNV queryCooperativeMatrix2FeaturesNV;
  318. VkPhysicalDeviceShaderBfloat16FeaturesKHR queryShaderBfloat16Features;
  319. VkPhysicalDeviceShaderFloat8FeaturesEXT queryShaderFloat8Features;
  320. VkPhysicalDeviceShaderFloatControls2FeaturesKHR queryShaderFloatControls2Features;
  321. VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR queryShaderIntegerDotProductFeatures;
  322. VkPhysicalDeviceSubgroupSizeControlFeaturesEXT querySubgroupSizeControlFeatures;
  323. VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR queryShaderSubgroupRotateFeatures;
  324. VkPhysicalDeviceShaderAtomicFloatFeaturesEXT queryShaderAtomicFloatFeatures;
  325. VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT queryShaderAtomicFloat2Features;
  326. VkPhysicalDeviceCooperativeVectorFeaturesNV queryCooperativeVectorFeaturesNV;
  327. VkPhysicalDeviceVulkanMemoryModelFeaturesKHR queryVulkanMemoryModelFeatures;
  328. // extension properties
  329. void* queryExtensionProperties;
  330. VkPhysicalDeviceFloatControlsPropertiesKHR queryFloatControlsProperties;
  331. VkPhysicalDeviceShaderIntegerDotProductProperties queryShaderIntegerDotProductProperties;
  332. VkPhysicalDeviceSubgroupProperties querySubgroupProperties;
  333. VkPhysicalDeviceDriverPropertiesKHR queryDriverProperties;
  334. VkPhysicalDeviceSubgroupSizeControlPropertiesEXT querySubgroupSizeControlProperties;
  335. VkPhysicalDeviceCooperativeMatrix2PropertiesNV queryCooperativeMatrix2PropertiesNV;
  336. VkPhysicalDeviceCooperativeVectorPropertiesNV queryCooperativeVectorPropertiesNV;
  337. // extension sub properties
  338. std::vector<VkCooperativeMatrixPropertiesKHR> queryCooperativeMatrixSubProperties;
  339. std::vector<VkCooperativeMatrixPropertiesNV> queryCooperativeMatrixSubPropertiesNV;
  340. std::vector<VkCooperativeMatrixFlexibleDimensionsPropertiesNV> queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV;
  341. std::vector<VkCooperativeVectorPropertiesNV> queryCooperativeVectorSubPropertiesNV;
  342. };
  343. void GpuInfoPrivate::query_features()
  344. {
  345. vkGetPhysicalDeviceFeatures(physicalDevice, &physicalDevicefeatures);
  346. }
  347. void GpuInfoPrivate::query_properties()
  348. {
  349. vkGetPhysicalDeviceProperties(physicalDevice, &physicalDeviceProperties);
  350. // NCNN_LOGE("[%u] apiVersion = %u.%u.%u", i, VK_VERSION_MAJOR(physicalDeviceProperties.apiVersion),
  351. // VK_VERSION_MINOR(physicalDeviceProperties.apiVersion), VK_VERSION_PATCH(physicalDeviceProperties.apiVersion));
  352. // NCNN_LOGE("[%u] driverVersion = %u.%u.%u", i, VK_VERSION_MAJOR(physicalDeviceProperties.driverVersion),
  353. // VK_VERSION_MINOR(physicalDeviceProperties.driverVersion), VK_VERSION_PATCH(physicalDeviceProperties.driverVersion));
  354. // NCNN_LOGE("[%u] vendorID = %x", i, physicalDeviceProperties.vendorID);
  355. // NCNN_LOGE("[%u] deviceID = %x", i, physicalDeviceProperties.deviceID);
  356. // NCNN_LOGE("[%u] deviceType = %x", i, physicalDeviceProperties.deviceType);
  357. // NCNN_LOGE("[%u] deviceName = %s", i, physicalDeviceProperties.deviceName);
  358. // NCNN_LOGE("[%u] pipelineCacheUUID = %u", i, physicalDeviceProperties.pipelineCacheUUID);
  359. // device type
  360. {
  361. type = -1;
  362. if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_DISCRETE_GPU)
  363. type = 0;
  364. if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU)
  365. type = 1;
  366. if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_VIRTUAL_GPU)
  367. type = 2;
  368. if (physicalDeviceProperties.deviceType == VK_PHYSICAL_DEVICE_TYPE_CPU)
  369. type = 3;
  370. }
  371. // mali
  372. // t760 = 0x13b5 0x7500001 / 0x7501000
  373. // t860 = 0x13b5 0x8602000
  374. // t880 = 0x13b5 0x8800020
  375. // g31 = 0x13b5 0x70930000
  376. // g51 = 0x13b5 0x70901010
  377. // g52 = 0x13b5 0x74021000 / 0x72120000
  378. // g71 = 0x13b5 0x60a00002
  379. // g72 = 0x13b5 0x62210001
  380. // g76 = 0x13b5 0x72110000
  381. // g77 = 0x13b5 0x90800011
  382. // adreno
  383. // 506 = 0x5143 0x5000600
  384. // 510 = 0x5143 0x5010000
  385. // 512 = 0x5143 0x5010200
  386. // 530 = 0x5143 0x5030004
  387. // 540 = 0x5143 0x5040001
  388. // 616 = 0x5143 0x6010600
  389. // 630 = 0x5143 0x6030001
  390. // 640 = 0x5143 0x6040001
  391. // 650 = 0x5143 0x6050002
  392. bug_storage_buffer_no_l1 = false;
  393. bug_corrupted_online_pipeline_cache = false;
  394. bug_implicit_fp16_arithmetic = false;
  395. bug_buffer_image_load_zero = false;
  396. if (physicalDeviceProperties.vendorID == 0x5143 && physicalDeviceProperties.apiVersion < VK_MAKE_VERSION(1, 0, 66))
  397. {
  398. // qcom adreno with old buggy driver cannot share created pipeline properly
  399. bug_corrupted_online_pipeline_cache = true;
  400. }
  401. if (physicalDeviceProperties.vendorID == 0x5143 && !(physicalDeviceProperties.deviceID == 0x6040001 || physicalDeviceProperties.deviceID == 0x6050002))
  402. {
  403. // NOTE but qcom855/qcom855plus/qcom865 are known exceptions
  404. // qcom adreno storage buffer without L1 cache
  405. bug_storage_buffer_no_l1 = true;
  406. }
  407. if (physicalDeviceProperties.vendorID == 0x5143 && physicalDeviceProperties.apiVersion < VK_MAKE_VERSION(1, 1, 87))
  408. {
  409. // HACK buffer2image before image-read dependency does not work properly
  410. // even promised with full image memory barrier on old adreno driver
  411. // TODO figure out a proper workaround without hurt speed too much
  412. // TODO only for old drivers
  413. bug_buffer_image_load_zero = true;
  414. }
  415. if (physicalDeviceProperties.vendorID == 0x13b5
  416. && (physicalDeviceProperties.deviceID == 0x7500001
  417. || physicalDeviceProperties.deviceID == 0x7501000
  418. || physicalDeviceProperties.deviceID == 0x8602000
  419. || physicalDeviceProperties.deviceID == 0x8800020
  420. || physicalDeviceProperties.deviceID == 0x70930000
  421. || physicalDeviceProperties.deviceID == 0x70901010
  422. || physicalDeviceProperties.deviceID == 0x72120000
  423. || physicalDeviceProperties.deviceID == 0x74021000
  424. || physicalDeviceProperties.deviceID == 0x60a00002
  425. || physicalDeviceProperties.deviceID == 0x62210001))
  426. {
  427. // NOTE rk3288/rk3399/t880/g31/g51/g52/g71/g72
  428. // however, g76/g77 has explicit fp16 arithmetic
  429. // arm mali driver accept spirv with fp16 arithmetic
  430. bug_implicit_fp16_arithmetic = true;
  431. }
  432. if (physicalDeviceProperties.vendorID == 0x5143
  433. && (physicalDeviceProperties.deviceID == 0x6030001
  434. || physicalDeviceProperties.deviceID == 0x6040001
  435. || physicalDeviceProperties.deviceID == 0x6050002))
  436. {
  437. // TODO enable devices other than qcom845/qcom855/qcom855plus/qcom865
  438. // qcom adreno driver accept spirv with fp16 arithmetic
  439. bug_implicit_fp16_arithmetic = true;
  440. }
  441. }
  442. static uint32_t find_device_compute_queue(const std::vector<VkQueueFamilyProperties>& queueFamilyProperties)
  443. {
  444. // first try, compute only queue
  445. for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)
  446. {
  447. const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];
  448. if ((queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)
  449. && !(queueFamilyProperty.queueFlags & VK_QUEUE_GRAPHICS_BIT))
  450. {
  451. return i;
  452. }
  453. }
  454. // second try, any queue with compute and graphics
  455. for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)
  456. {
  457. const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];
  458. if ((queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)
  459. && (queueFamilyProperty.queueFlags & VK_QUEUE_GRAPHICS_BIT))
  460. {
  461. return i;
  462. }
  463. }
  464. // third try, any queue with compute
  465. for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)
  466. {
  467. const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];
  468. if (queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)
  469. {
  470. return i;
  471. }
  472. }
  473. // NCNN_LOGE("no compute queue");
  474. return -1;
  475. }
  476. static uint32_t find_device_transfer_queue(const std::vector<VkQueueFamilyProperties>& queueFamilyProperties)
  477. {
  478. // first try, transfer only queue
  479. for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)
  480. {
  481. const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];
  482. if ((queueFamilyProperty.queueFlags & VK_QUEUE_TRANSFER_BIT)
  483. && !(queueFamilyProperty.queueFlags & VK_QUEUE_COMPUTE_BIT)
  484. && !(queueFamilyProperty.queueFlags & VK_QUEUE_GRAPHICS_BIT))
  485. {
  486. return i;
  487. }
  488. }
  489. // second try, any queue with transfer
  490. for (uint32_t i = 0; i < queueFamilyProperties.size(); i++)
  491. {
  492. const VkQueueFamilyProperties& queueFamilyProperty = queueFamilyProperties[i];
  493. if (queueFamilyProperty.queueFlags & VK_QUEUE_TRANSFER_BIT)
  494. {
  495. return i;
  496. }
  497. }
  498. // third try, use compute queue
  499. uint32_t compute_queue_index = find_device_compute_queue(queueFamilyProperties);
  500. if (compute_queue_index != (uint32_t)-1)
  501. {
  502. return compute_queue_index;
  503. }
  504. // NCNN_LOGE("no transfer queue");
  505. return -1;
  506. }
  507. void GpuInfoPrivate::query_queue_properties()
  508. {
  509. // find compute queue
  510. uint32_t queueFamilyPropertiesCount;
  511. vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyPropertiesCount, 0);
  512. std::vector<VkQueueFamilyProperties> queueFamilyProperties(queueFamilyPropertiesCount);
  513. vkGetPhysicalDeviceQueueFamilyProperties(physicalDevice, &queueFamilyPropertiesCount, queueFamilyProperties.data());
  514. compute_queue_family_index = find_device_compute_queue(queueFamilyProperties);
  515. transfer_queue_family_index = find_device_transfer_queue(queueFamilyProperties);
  516. compute_queue_count = queueFamilyProperties[compute_queue_family_index].queueCount;
  517. transfer_queue_count = queueFamilyProperties[transfer_queue_family_index].queueCount;
  518. unified_compute_transfer_queue = compute_queue_family_index == transfer_queue_family_index;
  519. }
  520. int GpuInfoPrivate::query_extensions()
  521. {
  522. // get device extension
  523. uint32_t deviceExtensionPropertyCount = 0;
  524. VkResult ret = vkEnumerateDeviceExtensionProperties(physicalDevice, NULL, &deviceExtensionPropertyCount, NULL);
  525. if (ret != VK_SUCCESS)
  526. {
  527. NCNN_LOGE("vkEnumerateDeviceExtensionProperties failed %d", ret);
  528. return -1;
  529. }
  530. deviceExtensionProperties.resize(deviceExtensionPropertyCount);
  531. ret = vkEnumerateDeviceExtensionProperties(physicalDevice, NULL, &deviceExtensionPropertyCount, deviceExtensionProperties.data());
  532. if (ret != VK_SUCCESS)
  533. {
  534. NCNN_LOGE("vkEnumerateDeviceExtensionProperties failed %d", ret);
  535. return -1;
  536. }
  537. // extension capability
  538. support_VK_KHR_8bit_storage = 0;
  539. support_VK_KHR_16bit_storage = 0;
  540. support_VK_KHR_bind_memory2 = 0;
  541. support_VK_KHR_buffer_device_address = 0;
  542. support_VK_KHR_create_renderpass2 = 0;
  543. support_VK_KHR_cooperative_matrix = 0;
  544. support_VK_KHR_dedicated_allocation = 0;
  545. support_VK_KHR_descriptor_update_template = 0;
  546. support_VK_KHR_driver_properties = 0;
  547. support_VK_KHR_external_memory = 0;
  548. support_VK_KHR_get_memory_requirements2 = 0;
  549. support_VK_KHR_maintenance1 = 0;
  550. support_VK_KHR_maintenance2 = 0;
  551. support_VK_KHR_maintenance3 = 0;
  552. support_VK_KHR_multiview = 0;
  553. support_VK_KHR_portability_subset = 0;
  554. support_VK_KHR_push_descriptor = 0;
  555. support_VK_KHR_sampler_ycbcr_conversion = 0;
  556. support_VK_KHR_shader_bfloat16 = 0;
  557. support_VK_KHR_shader_float16_int8 = 0;
  558. support_VK_KHR_shader_float_controls = 0;
  559. support_VK_KHR_shader_float_controls2 = 0;
  560. support_VK_KHR_shader_integer_dot_product = 0;
  561. support_VK_KHR_shader_non_semantic_info = 0;
  562. support_VK_KHR_shader_subgroup_extended_types = 0;
  563. support_VK_KHR_shader_subgroup_rotate = 0;
  564. support_VK_KHR_storage_buffer_storage_class = 0;
  565. support_VK_KHR_swapchain = 0;
  566. support_VK_KHR_vulkan_memory_model = 0;
  567. support_VK_KHR_zero_initialize_workgroup_memory = 0;
  568. support_VK_EXT_buffer_device_address = 0;
  569. support_VK_EXT_descriptor_indexing = 0;
  570. support_VK_EXT_memory_budget = 0;
  571. support_VK_EXT_memory_priority = 0;
  572. support_VK_EXT_queue_family_foreign = 0;
  573. support_VK_EXT_shader_atomic_float = 0;
  574. support_VK_EXT_shader_atomic_float2 = 0;
  575. support_VK_EXT_shader_float8 = 0;
  576. support_VK_EXT_subgroup_size_control = 0;
  577. support_VK_AMD_device_coherent_memory = 0;
  578. #if __ANDROID_API__ >= 26
  579. support_VK_ANDROID_external_memory_android_hardware_buffer = 0;
  580. #endif // __ANDROID_API__ >= 26
  581. support_VK_NV_cooperative_matrix = 0;
  582. support_VK_NV_cooperative_matrix2 = 0;
  583. support_VK_NV_cooperative_vector = 0;
  584. for (uint32_t j = 0; j < deviceExtensionPropertyCount; j++)
  585. {
  586. const VkExtensionProperties& exp = deviceExtensionProperties[j];
  587. // NCNN_LOGE("device extension %s = %u", exp.extensionName, exp.specVersion);
  588. if (strcmp(exp.extensionName, "VK_KHR_8bit_storage") == 0)
  589. support_VK_KHR_8bit_storage = exp.specVersion;
  590. else if (strcmp(exp.extensionName, "VK_KHR_16bit_storage") == 0)
  591. support_VK_KHR_16bit_storage = exp.specVersion;
  592. else if (strcmp(exp.extensionName, "VK_KHR_bind_memory2") == 0)
  593. support_VK_KHR_bind_memory2 = exp.specVersion;
  594. else if (strcmp(exp.extensionName, "VK_KHR_buffer_device_address") == 0)
  595. support_VK_KHR_buffer_device_address = exp.specVersion;
  596. else if (strcmp(exp.extensionName, "VK_KHR_create_renderpass2") == 0)
  597. support_VK_KHR_create_renderpass2 = exp.specVersion;
  598. else if (strcmp(exp.extensionName, "VK_KHR_cooperative_matrix") == 0)
  599. support_VK_KHR_cooperative_matrix = exp.specVersion;
  600. else if (strcmp(exp.extensionName, "VK_KHR_dedicated_allocation") == 0)
  601. support_VK_KHR_dedicated_allocation = exp.specVersion;
  602. else if (strcmp(exp.extensionName, "VK_KHR_descriptor_update_template") == 0)
  603. support_VK_KHR_descriptor_update_template = exp.specVersion;
  604. else if (strcmp(exp.extensionName, "VK_KHR_driver_properties") == 0)
  605. support_VK_KHR_driver_properties = exp.specVersion;
  606. else if (strcmp(exp.extensionName, "VK_KHR_external_memory") == 0)
  607. support_VK_KHR_external_memory = exp.specVersion;
  608. else if (strcmp(exp.extensionName, "VK_KHR_get_memory_requirements2") == 0)
  609. support_VK_KHR_get_memory_requirements2 = exp.specVersion;
  610. else if (strcmp(exp.extensionName, "VK_KHR_maintenance1") == 0)
  611. support_VK_KHR_maintenance1 = exp.specVersion;
  612. else if (strcmp(exp.extensionName, "VK_KHR_maintenance2") == 0)
  613. support_VK_KHR_maintenance2 = exp.specVersion;
  614. else if (strcmp(exp.extensionName, "VK_KHR_maintenance3") == 0)
  615. support_VK_KHR_maintenance3 = exp.specVersion;
  616. else if (strcmp(exp.extensionName, "VK_KHR_multiview") == 0)
  617. support_VK_KHR_multiview = exp.specVersion;
  618. else if (strcmp(exp.extensionName, "VK_KHR_portability_subset") == 0)
  619. support_VK_KHR_portability_subset = exp.specVersion;
  620. else if (strcmp(exp.extensionName, "VK_KHR_push_descriptor") == 0)
  621. support_VK_KHR_push_descriptor = exp.specVersion;
  622. else if (strcmp(exp.extensionName, "VK_KHR_sampler_ycbcr_conversion") == 0)
  623. support_VK_KHR_sampler_ycbcr_conversion = exp.specVersion;
  624. else if (strcmp(exp.extensionName, "VK_KHR_shader_bfloat16") == 0)
  625. support_VK_KHR_shader_bfloat16 = exp.specVersion;
  626. else if (strcmp(exp.extensionName, "VK_KHR_shader_float16_int8") == 0)
  627. support_VK_KHR_shader_float16_int8 = exp.specVersion;
  628. else if (strcmp(exp.extensionName, "VK_KHR_shader_float_controls") == 0)
  629. support_VK_KHR_shader_float_controls = exp.specVersion;
  630. else if (strcmp(exp.extensionName, "VK_KHR_shader_float_controls2") == 0)
  631. support_VK_KHR_shader_float_controls2 = exp.specVersion;
  632. else if (strcmp(exp.extensionName, "VK_KHR_shader_integer_dot_product") == 0)
  633. support_VK_KHR_shader_integer_dot_product = exp.specVersion;
  634. else if (strcmp(exp.extensionName, "VK_KHR_shader_non_semantic_info") == 0)
  635. support_VK_KHR_shader_non_semantic_info = exp.specVersion;
  636. else if (strcmp(exp.extensionName, "VK_KHR_shader_subgroup_extended_types") == 0)
  637. support_VK_KHR_shader_subgroup_extended_types = exp.specVersion;
  638. else if (strcmp(exp.extensionName, "VK_KHR_shader_subgroup_rotate") == 0)
  639. support_VK_KHR_shader_subgroup_rotate = exp.specVersion;
  640. else if (strcmp(exp.extensionName, "VK_KHR_storage_buffer_storage_class") == 0)
  641. support_VK_KHR_storage_buffer_storage_class = exp.specVersion;
  642. else if (strcmp(exp.extensionName, "VK_KHR_swapchain") == 0)
  643. support_VK_KHR_swapchain = exp.specVersion;
  644. else if (strcmp(exp.extensionName, "VK_KHR_vulkan_memory_model") == 0)
  645. support_VK_KHR_vulkan_memory_model = exp.specVersion;
  646. else if (strcmp(exp.extensionName, "VK_KHR_zero_initialize_workgroup_memory") == 0)
  647. support_VK_KHR_zero_initialize_workgroup_memory = exp.specVersion;
  648. else if (strcmp(exp.extensionName, "VK_EXT_buffer_device_address") == 0)
  649. support_VK_EXT_buffer_device_address = exp.specVersion;
  650. else if (strcmp(exp.extensionName, "VK_EXT_descriptor_indexing") == 0)
  651. support_VK_EXT_descriptor_indexing = exp.specVersion;
  652. else if (strcmp(exp.extensionName, "VK_EXT_memory_budget") == 0)
  653. support_VK_EXT_memory_budget = exp.specVersion;
  654. else if (strcmp(exp.extensionName, "VK_EXT_memory_priority") == 0)
  655. support_VK_EXT_memory_priority = exp.specVersion;
  656. else if (strcmp(exp.extensionName, "VK_EXT_queue_family_foreign") == 0)
  657. support_VK_EXT_queue_family_foreign = exp.specVersion;
  658. else if (strcmp(exp.extensionName, "VK_EXT_shader_atomic_float") == 0)
  659. support_VK_EXT_shader_atomic_float = exp.specVersion;
  660. else if (strcmp(exp.extensionName, "VK_EXT_shader_atomic_float2") == 0)
  661. support_VK_EXT_shader_atomic_float2 = exp.specVersion;
  662. else if (strcmp(exp.extensionName, "VK_EXT_shader_float8") == 0)
  663. support_VK_EXT_shader_float8 = exp.specVersion;
  664. else if (strcmp(exp.extensionName, "VK_EXT_subgroup_size_control") == 0)
  665. support_VK_EXT_subgroup_size_control = exp.specVersion;
  666. else if (strcmp(exp.extensionName, "VK_AMD_device_coherent_memory") == 0)
  667. support_VK_AMD_device_coherent_memory = exp.specVersion;
  668. #if __ANDROID_API__ >= 26
  669. else if (strcmp(exp.extensionName, "VK_ANDROID_external_memory_android_hardware_buffer") == 0)
  670. support_VK_ANDROID_external_memory_android_hardware_buffer = exp.specVersion;
  671. #endif // __ANDROID_API__ >= 26
  672. else if (strcmp(exp.extensionName, "VK_NV_cooperative_matrix") == 0)
  673. support_VK_NV_cooperative_matrix = exp.specVersion;
  674. else if (strcmp(exp.extensionName, "VK_NV_cooperative_matrix2") == 0)
  675. support_VK_NV_cooperative_matrix2 = exp.specVersion;
  676. else if (strcmp(exp.extensionName, "VK_NV_cooperative_vector") == 0)
  677. support_VK_NV_cooperative_vector = exp.specVersion;
  678. }
  679. if (support_VK_KHR_buffer_device_address)
  680. {
  681. // we prefer khr extension
  682. support_VK_EXT_buffer_device_address = 0;
  683. }
  684. if (support_VK_KHR_cooperative_matrix)
  685. {
  686. // we prefer khr extension
  687. support_VK_NV_cooperative_matrix = 0;
  688. }
  689. return 0;
  690. }
  691. void GpuInfoPrivate::query_extension_features()
  692. {
  693. queryExtensionFeatures = 0;
  694. // query int8 storage
  695. memset(&query8BitStorageFeatures, 0, sizeof(query8BitStorageFeatures));
  696. query8BitStorageFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_8BIT_STORAGE_FEATURES_KHR;
  697. query8BitStorageFeatures.pNext = 0;
  698. if (support_VK_KHR_8bit_storage)
  699. {
  700. query8BitStorageFeatures.pNext = queryExtensionFeatures;
  701. queryExtensionFeatures = &query8BitStorageFeatures;
  702. }
  703. // query fp16/int16 storage
  704. memset(&query16BitStorageFeatures, 0, sizeof(query16BitStorageFeatures));
  705. query16BitStorageFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_16BIT_STORAGE_FEATURES_KHR;
  706. query16BitStorageFeatures.pNext = 0;
  707. if (support_VK_KHR_16bit_storage)
  708. {
  709. query16BitStorageFeatures.pNext = queryExtensionFeatures;
  710. queryExtensionFeatures = &query16BitStorageFeatures;
  711. }
  712. // query fp16/int8 arithmetic
  713. memset(&queryFloat16Int8Features, 0, sizeof(queryFloat16Int8Features));
  714. queryFloat16Int8Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FLOAT16_INT8_FEATURES_KHR;
  715. queryFloat16Int8Features.pNext = 0;
  716. if (support_VK_KHR_shader_float16_int8)
  717. {
  718. queryFloat16Int8Features.pNext = queryExtensionFeatures;
  719. queryExtensionFeatures = &queryFloat16Int8Features;
  720. }
  721. // query ycbcr_conversion
  722. memset(&querySamplerYcbcrConversionFeatures, 0, sizeof(querySamplerYcbcrConversionFeatures));
  723. querySamplerYcbcrConversionFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SAMPLER_YCBCR_CONVERSION_FEATURES_KHR;
  724. querySamplerYcbcrConversionFeatures.pNext = 0;
  725. if (support_VK_KHR_sampler_ycbcr_conversion)
  726. {
  727. querySamplerYcbcrConversionFeatures.pNext = queryExtensionFeatures;
  728. queryExtensionFeatures = &querySamplerYcbcrConversionFeatures;
  729. }
  730. // query cooperative_matrix
  731. memset(&queryCooperativeMatrixFeatures, 0, sizeof(queryCooperativeMatrixFeatures));
  732. queryCooperativeMatrixFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_FEATURES_KHR;
  733. queryCooperativeMatrixFeatures.pNext = 0;
  734. if (support_VK_KHR_cooperative_matrix)
  735. {
  736. queryCooperativeMatrixFeatures.pNext = queryExtensionFeatures;
  737. queryExtensionFeatures = &queryCooperativeMatrixFeatures;
  738. }
  739. // query nv cooperative matrix
  740. memset(&queryCooperativeMatrixFeaturesNV, 0, sizeof(queryCooperativeMatrixFeaturesNV));
  741. queryCooperativeMatrixFeaturesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_FEATURES_NV;
  742. queryCooperativeMatrixFeaturesNV.pNext = 0;
  743. if (support_VK_NV_cooperative_matrix)
  744. {
  745. queryCooperativeMatrixFeaturesNV.pNext = queryExtensionFeatures;
  746. queryExtensionFeatures = &queryCooperativeMatrixFeaturesNV;
  747. }
  748. // query nv cooperative matrix2
  749. memset(&queryCooperativeMatrix2FeaturesNV, 0, sizeof(queryCooperativeMatrix2FeaturesNV));
  750. queryCooperativeMatrix2FeaturesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_2_FEATURES_NV;
  751. queryCooperativeMatrix2FeaturesNV.pNext = 0;
  752. if (support_VK_NV_cooperative_matrix2)
  753. {
  754. queryCooperativeMatrix2FeaturesNV.pNext = queryExtensionFeatures;
  755. queryExtensionFeatures = &queryCooperativeMatrix2FeaturesNV;
  756. }
  757. // query bfloat16
  758. memset(&queryShaderBfloat16Features, 0, sizeof(queryShaderBfloat16Features));
  759. queryShaderBfloat16Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_BFLOAT16_FEATURES_KHR;
  760. queryShaderBfloat16Features.pNext = 0;
  761. if (support_VK_KHR_shader_bfloat16)
  762. {
  763. queryShaderBfloat16Features.pNext = queryExtensionFeatures;
  764. queryExtensionFeatures = &queryShaderBfloat16Features;
  765. }
  766. // query float8
  767. memset(&queryShaderFloat8Features, 0, sizeof(queryShaderFloat8Features));
  768. queryShaderFloat8Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_FLOAT8_FEATURES_EXT;
  769. queryShaderFloat8Features.pNext = 0;
  770. if (support_VK_EXT_shader_float8)
  771. {
  772. queryShaderFloat8Features.pNext = queryExtensionFeatures;
  773. queryExtensionFeatures = &queryShaderFloat8Features;
  774. }
  775. // query float controls 2
  776. memset(&queryShaderFloatControls2Features, 0, sizeof(queryShaderFloatControls2Features));
  777. queryShaderFloatControls2Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_FLOAT_CONTROLS_2_FEATURES_KHR;
  778. queryShaderFloatControls2Features.pNext = 0;
  779. if (support_VK_KHR_shader_float_controls2)
  780. {
  781. queryShaderFloatControls2Features.pNext = queryExtensionFeatures;
  782. queryExtensionFeatures = &queryShaderFloatControls2Features;
  783. }
  784. // query integer dot product
  785. memset(&queryShaderIntegerDotProductFeatures, 0, sizeof(queryShaderIntegerDotProductFeatures));
  786. queryShaderIntegerDotProductFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_INTEGER_DOT_PRODUCT_FEATURES_KHR;
  787. queryShaderIntegerDotProductFeatures.pNext = 0;
  788. if (support_VK_KHR_shader_integer_dot_product)
  789. {
  790. queryShaderIntegerDotProductFeatures.pNext = queryExtensionFeatures;
  791. queryExtensionFeatures = &queryShaderIntegerDotProductFeatures;
  792. }
  793. // query subgroup size control
  794. memset(&querySubgroupSizeControlFeatures, 0, sizeof(querySubgroupSizeControlFeatures));
  795. querySubgroupSizeControlFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_SIZE_CONTROL_FEATURES_EXT;
  796. querySubgroupSizeControlFeatures.pNext = 0;
  797. if (support_VK_EXT_subgroup_size_control >= 2)
  798. {
  799. querySubgroupSizeControlFeatures.pNext = queryExtensionFeatures;
  800. queryExtensionFeatures = &querySubgroupSizeControlFeatures;
  801. }
  802. // query subgroup rotate
  803. memset(&queryShaderSubgroupRotateFeatures, 0, sizeof(queryShaderSubgroupRotateFeatures));
  804. queryShaderSubgroupRotateFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_SUBGROUP_ROTATE_FEATURES_KHR;
  805. queryShaderSubgroupRotateFeatures.pNext = 0;
  806. if (support_VK_KHR_shader_subgroup_rotate)
  807. {
  808. queryShaderSubgroupRotateFeatures.pNext = queryExtensionFeatures;
  809. queryExtensionFeatures = &queryShaderSubgroupRotateFeatures;
  810. }
  811. // query atomic float
  812. memset(&queryShaderAtomicFloatFeatures, 0, sizeof(queryShaderAtomicFloatFeatures));
  813. queryShaderAtomicFloatFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_ATOMIC_FLOAT_FEATURES_EXT;
  814. queryShaderAtomicFloatFeatures.pNext = 0;
  815. if (support_VK_EXT_shader_atomic_float)
  816. {
  817. queryShaderAtomicFloatFeatures.pNext = queryExtensionFeatures;
  818. queryExtensionFeatures = &queryShaderAtomicFloatFeatures;
  819. }
  820. // query atomic float2
  821. memset(&queryShaderAtomicFloat2Features, 0, sizeof(queryShaderAtomicFloat2Features));
  822. queryShaderAtomicFloat2Features.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_ATOMIC_FLOAT_2_FEATURES_EXT;
  823. queryShaderAtomicFloat2Features.pNext = 0;
  824. if (support_VK_EXT_shader_atomic_float2)
  825. {
  826. queryShaderAtomicFloat2Features.pNext = queryExtensionFeatures;
  827. queryExtensionFeatures = &queryShaderAtomicFloat2Features;
  828. }
  829. // query vulkan memory model
  830. memset(&queryVulkanMemoryModelFeatures, 0, sizeof(queryVulkanMemoryModelFeatures));
  831. queryVulkanMemoryModelFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_VULKAN_MEMORY_MODEL_FEATURES_KHR;
  832. queryVulkanMemoryModelFeatures.pNext = 0;
  833. if (support_VK_KHR_vulkan_memory_model)
  834. {
  835. queryVulkanMemoryModelFeatures.pNext = queryExtensionFeatures;
  836. queryExtensionFeatures = &queryVulkanMemoryModelFeatures;
  837. }
  838. // query nv cooperative vector
  839. memset(&queryCooperativeVectorFeaturesNV, 0, sizeof(queryCooperativeVectorFeaturesNV));
  840. queryCooperativeVectorFeaturesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_VECTOR_FEATURES_NV;
  841. queryCooperativeVectorFeaturesNV.pNext = 0;
  842. if (support_VK_NV_cooperative_vector)
  843. {
  844. queryCooperativeVectorFeaturesNV.pNext = queryExtensionFeatures;
  845. queryExtensionFeatures = &queryCooperativeVectorFeaturesNV;
  846. }
  847. if (support_VK_KHR_get_physical_device_properties2)
  848. {
  849. VkPhysicalDeviceFeatures2KHR queryFeatures;
  850. queryFeatures.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FEATURES_2_KHR;
  851. queryFeatures.pNext = queryExtensionFeatures;
  852. vkGetPhysicalDeviceFeatures2KHR(physicalDevice, &queryFeatures);
  853. }
  854. // apply known blacklist
  855. if (physicalDeviceProperties.vendorID == 0x13b5 && physicalDeviceProperties.apiVersion < VK_MAKE_VERSION(1, 0, 82))
  856. {
  857. // the 16bit_storage implementation of arm mali driver is buggy :[
  858. query16BitStorageFeatures.storageBuffer16BitAccess = VK_FALSE;
  859. }
  860. if (physicalDeviceProperties.vendorID == 0x10002 && physicalDeviceProperties.deviceID == 0x70006214 && physicalDeviceProperties.apiVersion == VK_MAKE_VERSION(1, 1, 82))
  861. {
  862. // the 16bit_storage implementation of vivante gc1700 driver is buggy :[
  863. query16BitStorageFeatures.storageBuffer16BitAccess = VK_FALSE;
  864. }
  865. if (bug_implicit_fp16_arithmetic)
  866. {
  867. // force capability on as long as the driver accept spirv with fp16 arithmetic :D
  868. queryFloat16Int8Features.shaderFloat16 = VK_TRUE;
  869. }
  870. if (physicalDeviceProperties.vendorID == 0x5143 && !query16BitStorageFeatures.storageBuffer16BitAccess)
  871. {
  872. // fp16 arithmetic yields wrong result on old adreno drivers :(
  873. queryFloat16Int8Features.shaderFloat16 = VK_FALSE;
  874. }
  875. }
  876. void GpuInfoPrivate::query_extension_properties()
  877. {
  878. queryExtensionProperties = 0;
  879. // query float controls
  880. memset(&queryFloatControlsProperties, 0, sizeof(queryFloatControlsProperties));
  881. queryFloatControlsProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_FLOAT_CONTROLS_PROPERTIES;
  882. queryFloatControlsProperties.pNext = 0;
  883. if (support_VK_KHR_shader_float_controls)
  884. {
  885. queryFloatControlsProperties.pNext = queryExtensionProperties;
  886. queryExtensionProperties = &queryFloatControlsProperties;
  887. }
  888. // query integer dot product
  889. memset(&queryShaderIntegerDotProductProperties, 0, sizeof(queryShaderIntegerDotProductProperties));
  890. queryShaderIntegerDotProductProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SHADER_INTEGER_DOT_PRODUCT_PROPERTIES_KHR;
  891. queryShaderIntegerDotProductProperties.pNext = 0;
  892. if (support_VK_KHR_driver_properties)
  893. {
  894. queryShaderIntegerDotProductProperties.pNext = queryExtensionProperties;
  895. queryExtensionProperties = &queryShaderIntegerDotProductProperties;
  896. }
  897. // query subgroup
  898. memset(&querySubgroupProperties, 0, sizeof(querySubgroupProperties));
  899. querySubgroupProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_PROPERTIES;
  900. querySubgroupProperties.pNext = 0;
  901. if (VK_VERSION_MAJOR(g_instance.instance_api_version) >= 1 && VK_VERSION_MINOR(g_instance.instance_api_version) >= 1)
  902. {
  903. querySubgroupProperties.pNext = queryExtensionProperties;
  904. queryExtensionProperties = &querySubgroupProperties;
  905. }
  906. else
  907. {
  908. querySubgroupProperties.subgroupSize = 64;
  909. if (physicalDeviceProperties.vendorID == 0x5143) // qcom adreno prefer very large workgroup :P
  910. querySubgroupProperties.subgroupSize = 128;
  911. if (physicalDeviceProperties.vendorID == 0x13b5) // arm mali
  912. querySubgroupProperties.subgroupSize = 16;
  913. if (physicalDeviceProperties.vendorID == 0x1010) // imgtec powervr
  914. querySubgroupProperties.subgroupSize = 32;
  915. if (physicalDeviceProperties.vendorID == 0x1002) // amd
  916. querySubgroupProperties.subgroupSize = 64;
  917. if (physicalDeviceProperties.vendorID == 0x10de) // nvidia
  918. querySubgroupProperties.subgroupSize = 32;
  919. if (physicalDeviceProperties.vendorID == 0x8086) // intel
  920. querySubgroupProperties.subgroupSize = 32;
  921. }
  922. // query driver properties
  923. memset(&queryDriverProperties, 0, sizeof(queryDriverProperties));
  924. queryDriverProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_DRIVER_PROPERTIES_KHR;
  925. queryDriverProperties.pNext = 0;
  926. if (support_VK_KHR_driver_properties)
  927. {
  928. queryDriverProperties.pNext = queryExtensionProperties;
  929. queryExtensionProperties = &queryDriverProperties;
  930. }
  931. // query subgroup size control
  932. memset(&querySubgroupSizeControlProperties, 0, sizeof(querySubgroupSizeControlProperties));
  933. querySubgroupSizeControlProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_SUBGROUP_SIZE_CONTROL_PROPERTIES_EXT;
  934. querySubgroupSizeControlProperties.pNext = 0;
  935. if (support_VK_EXT_subgroup_size_control)
  936. {
  937. querySubgroupSizeControlProperties.pNext = queryExtensionProperties;
  938. queryExtensionProperties = &querySubgroupSizeControlProperties;
  939. }
  940. // query nv cooperative matrix2
  941. memset(&queryCooperativeMatrix2PropertiesNV, 0, sizeof(queryCooperativeMatrix2PropertiesNV));
  942. queryCooperativeMatrix2PropertiesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_MATRIX_2_PROPERTIES_NV;
  943. queryCooperativeMatrix2PropertiesNV.pNext = 0;
  944. if (support_VK_NV_cooperative_matrix2)
  945. {
  946. queryCooperativeMatrix2PropertiesNV.pNext = queryExtensionProperties;
  947. queryExtensionProperties = &queryCooperativeMatrix2PropertiesNV;
  948. }
  949. // query nv cooperative vector
  950. memset(&queryCooperativeVectorPropertiesNV, 0, sizeof(queryCooperativeVectorPropertiesNV));
  951. queryCooperativeVectorPropertiesNV.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_COOPERATIVE_VECTOR_PROPERTIES_NV;
  952. queryCooperativeVectorPropertiesNV.pNext = 0;
  953. if (support_VK_NV_cooperative_vector)
  954. {
  955. queryCooperativeVectorPropertiesNV.pNext = queryExtensionProperties;
  956. queryExtensionProperties = &queryCooperativeVectorPropertiesNV;
  957. }
  958. if (support_VK_KHR_get_physical_device_properties2)
  959. {
  960. VkPhysicalDeviceProperties2KHR queryProperties;
  961. queryProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_PROPERTIES_2_KHR;
  962. queryProperties.pNext = queryExtensionProperties;
  963. vkGetPhysicalDeviceProperties2KHR(physicalDevice, &queryProperties);
  964. // append subgroup rotate
  965. if (support_VK_KHR_shader_subgroup_rotate)
  966. {
  967. if (queryShaderSubgroupRotateFeatures.shaderSubgroupRotate)
  968. querySubgroupProperties.supportedOperations |= VK_SUBGROUP_FEATURE_ROTATE_BIT_KHR;
  969. if (queryShaderSubgroupRotateFeatures.shaderSubgroupRotateClustered)
  970. querySubgroupProperties.supportedOperations |= VK_SUBGROUP_FEATURE_ROTATE_CLUSTERED_BIT_KHR;
  971. }
  972. }
  973. if (!support_VK_EXT_subgroup_size_control)
  974. {
  975. querySubgroupSizeControlProperties.minSubgroupSize = querySubgroupProperties.subgroupSize;
  976. querySubgroupSizeControlProperties.maxSubgroupSize = querySubgroupProperties.subgroupSize;
  977. querySubgroupSizeControlProperties.maxComputeWorkgroupSubgroups = std::max(physicalDeviceProperties.limits.maxComputeWorkGroupInvocations / querySubgroupProperties.subgroupSize, 1u);
  978. }
  979. // query supported cooperative matrix types and operations
  980. queryCooperativeMatrixSubProperties.clear();
  981. queryCooperativeMatrixSubPropertiesNV.clear();
  982. support_cooperative_matrix_8_8_16 = false;
  983. support_cooperative_matrix_16_8_8 = false;
  984. support_cooperative_matrix_16_8_16 = false;
  985. support_cooperative_matrix_16_16_16 = false;
  986. if (support_VK_KHR_cooperative_matrix && queryCooperativeMatrixFeatures.cooperativeMatrix)
  987. {
  988. uint32_t propertyCount = 0;
  989. VkResult ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, 0);
  990. if (ret != VK_SUCCESS)
  991. {
  992. NCNN_LOGE("vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR failed %d", ret);
  993. }
  994. queryCooperativeMatrixSubProperties.resize(propertyCount);
  995. for (uint32_t j = 0; j < propertyCount; j++)
  996. {
  997. memset(&queryCooperativeMatrixSubProperties[j], 0, sizeof(queryCooperativeMatrixSubProperties[j]));
  998. queryCooperativeMatrixSubProperties[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_PROPERTIES_KHR;
  999. queryCooperativeMatrixSubProperties[j].pNext = 0;
  1000. }
  1001. ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(physicalDevice, &propertyCount, queryCooperativeMatrixSubProperties.data());
  1002. if (ret != VK_SUCCESS)
  1003. {
  1004. NCNN_LOGE("vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR failed %d", ret);
  1005. }
  1006. for (uint32_t j = 0; j < propertyCount; j++)
  1007. {
  1008. const VkCooperativeMatrixPropertiesKHR& cmp = queryCooperativeMatrixSubProperties[j];
  1009. // NCNN_LOGE("cpm %2d %2d %2d %d %d %d %d %d", cmp.MSize, cmp.NSize, cmp.KSize, cmp.AType, cmp.BType, cmp.CType, cmp.ResultType, cmp.scope);
  1010. if (cmp.MSize == 8 && cmp.NSize == 8 && cmp.KSize == 16
  1011. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR
  1012. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR
  1013. && cmp.scope == VK_SCOPE_SUBGROUP_KHR)
  1014. {
  1015. support_cooperative_matrix_8_8_16 = true;
  1016. }
  1017. if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 8
  1018. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR
  1019. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR
  1020. && cmp.scope == VK_SCOPE_SUBGROUP_KHR)
  1021. {
  1022. support_cooperative_matrix_16_8_8 = true;
  1023. }
  1024. if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 16
  1025. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR
  1026. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR
  1027. && cmp.scope == VK_SCOPE_SUBGROUP_KHR)
  1028. {
  1029. support_cooperative_matrix_16_8_16 = true;
  1030. }
  1031. if (cmp.MSize == 16 && cmp.NSize == 16 && cmp.KSize == 16
  1032. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR
  1033. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_KHR && cmp.ResultType == VK_COMPONENT_TYPE_FLOAT32_KHR
  1034. && cmp.scope == VK_SCOPE_SUBGROUP_KHR)
  1035. {
  1036. support_cooperative_matrix_16_16_16 = true;
  1037. }
  1038. }
  1039. }
  1040. else if (support_VK_NV_cooperative_matrix && queryCooperativeMatrixFeaturesNV.cooperativeMatrix)
  1041. {
  1042. uint32_t propertyCount = 0;
  1043. VkResult ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesNV(physicalDevice, &propertyCount, 0);
  1044. if (ret != VK_SUCCESS)
  1045. {
  1046. NCNN_LOGE("vkGetPhysicalDeviceCooperativeMatrixPropertiesNV failed %d", ret);
  1047. }
  1048. queryCooperativeMatrixSubPropertiesNV.resize(propertyCount);
  1049. for (uint32_t j = 0; j < propertyCount; j++)
  1050. {
  1051. memset(&queryCooperativeMatrixSubPropertiesNV[j], 0, sizeof(queryCooperativeMatrixSubPropertiesNV[j]));
  1052. queryCooperativeMatrixSubPropertiesNV[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_PROPERTIES_NV;
  1053. queryCooperativeMatrixSubPropertiesNV[j].pNext = 0;
  1054. }
  1055. ret = vkGetPhysicalDeviceCooperativeMatrixPropertiesNV(physicalDevice, &propertyCount, queryCooperativeMatrixSubPropertiesNV.data());
  1056. if (ret != VK_SUCCESS)
  1057. {
  1058. NCNN_LOGE("vkGetPhysicalDeviceCooperativeMatrixPropertiesNV failed %d", ret);
  1059. }
  1060. for (uint32_t j = 0; j < propertyCount; j++)
  1061. {
  1062. const VkCooperativeMatrixPropertiesNV& cmp = queryCooperativeMatrixSubPropertiesNV[j];
  1063. // NCNN_LOGE("cpm %2d %2d %2d %d %d %d %d %d", cmp.MSize, cmp.NSize, cmp.KSize, cmp.AType, cmp.BType, cmp.CType, cmp.DType, cmp.scope);
  1064. if (cmp.MSize == 8 && cmp.NSize == 8 && cmp.KSize == 16
  1065. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV
  1066. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV
  1067. && cmp.scope == VK_SCOPE_SUBGROUP_NV)
  1068. {
  1069. support_cooperative_matrix_8_8_16 = true;
  1070. }
  1071. if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 8
  1072. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV
  1073. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV
  1074. && cmp.scope == VK_SCOPE_SUBGROUP_NV)
  1075. {
  1076. support_cooperative_matrix_16_8_8 = true;
  1077. }
  1078. if (cmp.MSize == 16 && cmp.NSize == 8 && cmp.KSize == 16
  1079. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV
  1080. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV
  1081. && cmp.scope == VK_SCOPE_SUBGROUP_NV)
  1082. {
  1083. support_cooperative_matrix_16_8_16 = true;
  1084. }
  1085. if (cmp.MSize == 16 && cmp.NSize == 16 && cmp.KSize == 16
  1086. && cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV
  1087. && cmp.CType == VK_COMPONENT_TYPE_FLOAT32_NV && cmp.DType == VK_COMPONENT_TYPE_FLOAT32_NV
  1088. && cmp.scope == VK_SCOPE_SUBGROUP_NV)
  1089. {
  1090. support_cooperative_matrix_16_16_16 = true;
  1091. }
  1092. }
  1093. }
  1094. // query supported cooperative matrix2 types and operations
  1095. queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV.clear();
  1096. if (support_VK_NV_cooperative_matrix2 && queryCooperativeMatrix2FeaturesNV.cooperativeMatrixFlexibleDimensions)
  1097. {
  1098. uint32_t propertyCount = 0;
  1099. VkResult ret = vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV(physicalDevice, &propertyCount, 0);
  1100. if (ret != VK_SUCCESS)
  1101. {
  1102. NCNN_LOGE("vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV failed %d", ret);
  1103. }
  1104. queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV.resize(propertyCount);
  1105. for (uint32_t j = 0; j < propertyCount; j++)
  1106. {
  1107. memset(&queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j], 0, sizeof(queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j]));
  1108. queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_MATRIX_FLEXIBLE_DIMENSIONS_PROPERTIES_NV;
  1109. queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j].pNext = 0;
  1110. }
  1111. ret = vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV(physicalDevice, &propertyCount, queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV.data());
  1112. if (ret != VK_SUCCESS)
  1113. {
  1114. NCNN_LOGE("vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV failed %d", ret);
  1115. }
  1116. for (uint32_t j = 0; j < propertyCount; j++)
  1117. {
  1118. const VkCooperativeMatrixFlexibleDimensionsPropertiesNV& cmfdp = queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV[j];
  1119. // NCNN_LOGE("cmfdp %2d %2d %2d %d %d %d %d %d %d %d", cmfdp.MGranularity, cmfdp.NGranularity, cmfdp.KGranularity, cmfdp.AType, cmfdp.BType, cmfdp.CType, cmfdp.ResultType, cmfdp.saturatingAccumulation, cmfdp.scope, cmfdp.workgroupInvocations);
  1120. }
  1121. }
  1122. // query supported cooperative vector types and operations
  1123. queryCooperativeVectorSubPropertiesNV.clear();
  1124. if (support_VK_NV_cooperative_vector && queryCooperativeVectorFeaturesNV.cooperativeVector)
  1125. {
  1126. uint32_t propertyCount = 0;
  1127. VkResult ret = vkGetPhysicalDeviceCooperativeVectorPropertiesNV(physicalDevice, &propertyCount, 0);
  1128. if (ret != VK_SUCCESS)
  1129. {
  1130. NCNN_LOGE("vkGetPhysicalDeviceCooperativeVectorPropertiesNV failed %d", ret);
  1131. }
  1132. queryCooperativeVectorSubPropertiesNV.resize(propertyCount);
  1133. for (uint32_t j = 0; j < propertyCount; j++)
  1134. {
  1135. memset(&queryCooperativeVectorSubPropertiesNV[j], 0, sizeof(queryCooperativeVectorSubPropertiesNV[j]));
  1136. queryCooperativeVectorSubPropertiesNV[j].sType = VK_STRUCTURE_TYPE_COOPERATIVE_VECTOR_PROPERTIES_NV;
  1137. queryCooperativeVectorSubPropertiesNV[j].pNext = 0;
  1138. }
  1139. ret = vkGetPhysicalDeviceCooperativeVectorPropertiesNV(physicalDevice, &propertyCount, queryCooperativeVectorSubPropertiesNV.data());
  1140. if (ret != VK_SUCCESS)
  1141. {
  1142. NCNN_LOGE("vkGetPhysicalDeviceCooperativeVectorPropertiesNV failed %d", ret);
  1143. }
  1144. for (uint32_t j = 0; j < propertyCount; j++)
  1145. {
  1146. const VkCooperativeVectorPropertiesNV& cvp = queryCooperativeVectorSubPropertiesNV[j];
  1147. // NCNN_LOGE("cvp %d %d %d %d %d %d", cvp.inputType, cvp.inputInterpretation, cvp.matrixInterpretation, cvp.biasInterpretation, cvp.resultType, cvp.transpose);
  1148. }
  1149. }
  1150. if (queryDriverProperties.driverID == VK_DRIVER_ID_MESA_TURNIP)
  1151. {
  1152. // turnip crash when compiling large shader with full subgroup
  1153. querySubgroupSizeControlFeatures.computeFullSubgroups = VK_FALSE;
  1154. }
  1155. }
  1156. GpuInfo::GpuInfo()
  1157. : d(new GpuInfoPrivate)
  1158. {
  1159. }
  1160. GpuInfo::~GpuInfo()
  1161. {
  1162. delete d;
  1163. }
  1164. GpuInfo::GpuInfo(const GpuInfo&)
  1165. : d(0)
  1166. {
  1167. }
  1168. GpuInfo& GpuInfo::operator=(const GpuInfo&)
  1169. {
  1170. return *this;
  1171. }
  1172. int GpuInfo::device_index() const
  1173. {
  1174. return d->device_index;
  1175. }
  1176. VkPhysicalDevice GpuInfo::physicalDevice() const
  1177. {
  1178. return d->physicalDevice;
  1179. }
  1180. VkPhysicalDevice GpuInfo::physical_device() const
  1181. {
  1182. return d->physicalDevice;
  1183. }
  1184. const VkPhysicalDeviceFeatures& GpuInfo::physicalDevicefeatures() const
  1185. {
  1186. return d->physicalDevicefeatures;
  1187. }
  1188. const VkPhysicalDeviceProperties& GpuInfo::physicalDeviceProperties() const
  1189. {
  1190. return d->physicalDeviceProperties;
  1191. }
  1192. const VkPhysicalDeviceMemoryProperties& GpuInfo::physicalDeviceMemoryProperties() const
  1193. {
  1194. return d->physicalDeviceMemoryProperties;
  1195. }
  1196. const VkPhysicalDeviceMemoryProperties& GpuInfo::physical_device_memory_properties() const
  1197. {
  1198. return d->physicalDeviceMemoryProperties;
  1199. }
  1200. const std::vector<VkExtensionProperties>& GpuInfo::deviceExtensionProperties() const
  1201. {
  1202. return d->deviceExtensionProperties;
  1203. }
  1204. uint32_t GpuInfo::api_version() const
  1205. {
  1206. return d->physicalDeviceProperties.apiVersion;
  1207. }
  1208. uint32_t GpuInfo::driver_version() const
  1209. {
  1210. return d->physicalDeviceProperties.driverVersion;
  1211. }
  1212. uint32_t GpuInfo::vendor_id() const
  1213. {
  1214. return d->physicalDeviceProperties.vendorID;
  1215. }
  1216. uint32_t GpuInfo::device_id() const
  1217. {
  1218. return d->physicalDeviceProperties.deviceID;
  1219. }
  1220. const char* GpuInfo::device_name() const
  1221. {
  1222. return d->physicalDeviceProperties.deviceName;
  1223. }
  1224. uint8_t* GpuInfo::pipeline_cache_uuid() const
  1225. {
  1226. return d->physicalDeviceProperties.pipelineCacheUUID;
  1227. }
  1228. uint32_t GpuInfo::driver_id() const
  1229. {
  1230. return d->queryDriverProperties.driverID;
  1231. }
  1232. const char* GpuInfo::driver_name() const
  1233. {
  1234. return d->queryDriverProperties.driverName;
  1235. }
  1236. int GpuInfo::type() const
  1237. {
  1238. return d->type;
  1239. }
  1240. uint32_t GpuInfo::max_shared_memory_size() const
  1241. {
  1242. return d->physicalDeviceProperties.limits.maxComputeSharedMemorySize;
  1243. }
  1244. uint32_t GpuInfo::max_workgroup_count_x() const
  1245. {
  1246. return d->physicalDeviceProperties.limits.maxComputeWorkGroupCount[0];
  1247. }
  1248. uint32_t GpuInfo::max_workgroup_count_y() const
  1249. {
  1250. return d->physicalDeviceProperties.limits.maxComputeWorkGroupCount[1];
  1251. }
  1252. uint32_t GpuInfo::max_workgroup_count_z() const
  1253. {
  1254. return d->physicalDeviceProperties.limits.maxComputeWorkGroupCount[2];
  1255. }
  1256. uint32_t GpuInfo::max_workgroup_invocations() const
  1257. {
  1258. return d->physicalDeviceProperties.limits.maxComputeWorkGroupInvocations;
  1259. }
  1260. uint32_t GpuInfo::max_workgroup_size_x() const
  1261. {
  1262. return d->physicalDeviceProperties.limits.maxComputeWorkGroupSize[0];
  1263. }
  1264. uint32_t GpuInfo::max_workgroup_size_y() const
  1265. {
  1266. return d->physicalDeviceProperties.limits.maxComputeWorkGroupSize[1];
  1267. }
  1268. uint32_t GpuInfo::max_workgroup_size_z() const
  1269. {
  1270. return d->physicalDeviceProperties.limits.maxComputeWorkGroupSize[2];
  1271. }
  1272. size_t GpuInfo::memory_map_alignment() const
  1273. {
  1274. return d->physicalDeviceProperties.limits.minMemoryMapAlignment;
  1275. }
  1276. size_t GpuInfo::buffer_offset_alignment() const
  1277. {
  1278. return d->physicalDeviceProperties.limits.minStorageBufferOffsetAlignment;
  1279. }
  1280. size_t GpuInfo::non_coherent_atom_size() const
  1281. {
  1282. return d->physicalDeviceProperties.limits.nonCoherentAtomSize;
  1283. }
  1284. size_t GpuInfo::buffer_image_granularity() const
  1285. {
  1286. return d->physicalDeviceProperties.limits.bufferImageGranularity;
  1287. }
  1288. uint32_t GpuInfo::max_image_dimension_1d() const
  1289. {
  1290. return d->physicalDeviceProperties.limits.maxImageDimension1D;
  1291. }
  1292. uint32_t GpuInfo::max_image_dimension_2d() const
  1293. {
  1294. return d->physicalDeviceProperties.limits.maxImageDimension2D;
  1295. }
  1296. uint32_t GpuInfo::max_image_dimension_3d() const
  1297. {
  1298. return d->physicalDeviceProperties.limits.maxImageDimension3D;
  1299. }
  1300. float GpuInfo::timestamp_period() const
  1301. {
  1302. return d->physicalDeviceProperties.limits.timestampPeriod;
  1303. }
  1304. uint32_t GpuInfo::compute_queue_family_index() const
  1305. {
  1306. return d->compute_queue_family_index;
  1307. }
  1308. uint32_t GpuInfo::transfer_queue_family_index() const
  1309. {
  1310. return d->transfer_queue_family_index;
  1311. }
  1312. uint32_t GpuInfo::compute_queue_count() const
  1313. {
  1314. return d->compute_queue_count;
  1315. }
  1316. uint32_t GpuInfo::transfer_queue_count() const
  1317. {
  1318. return d->transfer_queue_count;
  1319. }
  1320. bool GpuInfo::unified_compute_transfer_queue() const
  1321. {
  1322. return d->unified_compute_transfer_queue;
  1323. }
  1324. uint32_t GpuInfo::subgroup_size() const
  1325. {
  1326. return d->querySubgroupProperties.subgroupSize;
  1327. }
  1328. uint32_t GpuInfo::min_subgroup_size() const
  1329. {
  1330. return d->querySubgroupSizeControlProperties.minSubgroupSize;
  1331. }
  1332. uint32_t GpuInfo::max_subgroup_size() const
  1333. {
  1334. return d->querySubgroupSizeControlProperties.maxSubgroupSize;
  1335. }
  1336. uint32_t GpuInfo::max_compute_workgroup_subgroups() const
  1337. {
  1338. return d->querySubgroupSizeControlProperties.maxComputeWorkgroupSubgroups;
  1339. }
  1340. bool GpuInfo::support_subgroup_size_control() const
  1341. {
  1342. return d->querySubgroupSizeControlFeatures.subgroupSizeControl;
  1343. }
  1344. bool GpuInfo::support_compute_full_subgroups() const
  1345. {
  1346. return d->querySubgroupSizeControlFeatures.computeFullSubgroups;
  1347. }
  1348. uint32_t GpuInfo::support_subgroup_ops() const
  1349. {
  1350. return d->querySubgroupProperties.supportedOperations;
  1351. }
  1352. bool GpuInfo::bug_storage_buffer_no_l1() const
  1353. {
  1354. return d->bug_storage_buffer_no_l1;
  1355. }
  1356. bool GpuInfo::bug_corrupted_online_pipeline_cache() const
  1357. {
  1358. return d->bug_corrupted_online_pipeline_cache;
  1359. }
  1360. bool GpuInfo::bug_buffer_image_load_zero() const
  1361. {
  1362. return d->bug_buffer_image_load_zero;
  1363. }
  1364. bool GpuInfo::bug_implicit_fp16_arithmetic() const
  1365. {
  1366. return d->bug_implicit_fp16_arithmetic;
  1367. }
  1368. bool GpuInfo::support_fp16_packed() const
  1369. {
  1370. return true;
  1371. }
  1372. bool GpuInfo::support_fp16_storage() const
  1373. {
  1374. return d->query16BitStorageFeatures.storageBuffer16BitAccess;
  1375. }
  1376. bool GpuInfo::support_fp16_uniform() const
  1377. {
  1378. return d->query16BitStorageFeatures.uniformAndStorageBuffer16BitAccess;
  1379. }
  1380. bool GpuInfo::support_fp16_arithmetic() const
  1381. {
  1382. return d->queryFloat16Int8Features.shaderFloat16;
  1383. }
  1384. bool GpuInfo::support_int8_packed() const
  1385. {
  1386. return true;
  1387. }
  1388. bool GpuInfo::support_int8_storage() const
  1389. {
  1390. return d->query8BitStorageFeatures.storageBuffer8BitAccess;
  1391. }
  1392. bool GpuInfo::support_int8_uniform() const
  1393. {
  1394. return d->query8BitStorageFeatures.uniformAndStorageBuffer8BitAccess;
  1395. }
  1396. bool GpuInfo::support_int8_arithmetic() const
  1397. {
  1398. return d->queryFloat16Int8Features.shaderInt8;
  1399. }
  1400. bool GpuInfo::support_fp16_image() const
  1401. {
  1402. return d->physicalDevicefeatures.shaderStorageImageExtendedFormats;
  1403. }
  1404. bool GpuInfo::support_int8_image() const
  1405. {
  1406. return d->physicalDevicefeatures.shaderStorageImageExtendedFormats;
  1407. }
  1408. bool GpuInfo::support_fp_fast_math() const
  1409. {
  1410. return d->queryShaderFloatControls2Features.shaderFloatControls2;
  1411. }
  1412. bool GpuInfo::support_ycbcr_conversion() const
  1413. {
  1414. return d->querySamplerYcbcrConversionFeatures.samplerYcbcrConversion;
  1415. }
  1416. bool GpuInfo::support_cooperative_matrix() const
  1417. {
  1418. return d->queryCooperativeMatrixFeatures.cooperativeMatrix || d->queryCooperativeMatrixFeaturesNV.cooperativeMatrix;
  1419. }
  1420. bool GpuInfo::support_cooperative_matrix_8_8_16() const
  1421. {
  1422. return d->support_cooperative_matrix_8_8_16;
  1423. }
  1424. bool GpuInfo::support_cooperative_matrix_16_8_8() const
  1425. {
  1426. return d->support_cooperative_matrix_16_8_8;
  1427. }
  1428. bool GpuInfo::support_cooperative_matrix_16_8_16() const
  1429. {
  1430. return d->support_cooperative_matrix_16_8_16;
  1431. }
  1432. bool GpuInfo::support_cooperative_matrix_16_16_16() const
  1433. {
  1434. return d->support_cooperative_matrix_16_16_16;
  1435. }
  1436. int GpuInfo::support_VK_KHR_8bit_storage() const
  1437. {
  1438. return d->support_VK_KHR_8bit_storage;
  1439. }
  1440. int GpuInfo::support_VK_KHR_16bit_storage() const
  1441. {
  1442. return d->support_VK_KHR_16bit_storage;
  1443. }
  1444. int GpuInfo::support_VK_KHR_bind_memory2() const
  1445. {
  1446. return d->support_VK_KHR_bind_memory2;
  1447. }
  1448. int GpuInfo::support_VK_KHR_buffer_device_address() const
  1449. {
  1450. return d->support_VK_KHR_buffer_device_address;
  1451. }
  1452. int GpuInfo::support_VK_KHR_create_renderpass2() const
  1453. {
  1454. return d->support_VK_KHR_create_renderpass2;
  1455. }
  1456. int GpuInfo::support_VK_KHR_cooperative_matrix() const
  1457. {
  1458. return d->support_VK_KHR_cooperative_matrix;
  1459. }
  1460. int GpuInfo::support_VK_KHR_dedicated_allocation() const
  1461. {
  1462. return d->support_VK_KHR_dedicated_allocation;
  1463. }
  1464. int GpuInfo::support_VK_KHR_descriptor_update_template() const
  1465. {
  1466. return d->support_VK_KHR_descriptor_update_template;
  1467. }
  1468. int GpuInfo::support_VK_KHR_driver_properties() const
  1469. {
  1470. return d->support_VK_KHR_driver_properties;
  1471. }
  1472. int GpuInfo::support_VK_KHR_external_memory() const
  1473. {
  1474. return d->support_VK_KHR_external_memory;
  1475. }
  1476. int GpuInfo::support_VK_KHR_get_memory_requirements2() const
  1477. {
  1478. return d->support_VK_KHR_get_memory_requirements2;
  1479. }
  1480. int GpuInfo::support_VK_KHR_maintenance1() const
  1481. {
  1482. return d->support_VK_KHR_maintenance1;
  1483. }
  1484. int GpuInfo::support_VK_KHR_maintenance2() const
  1485. {
  1486. return d->support_VK_KHR_maintenance2;
  1487. }
  1488. int GpuInfo::support_VK_KHR_maintenance3() const
  1489. {
  1490. return d->support_VK_KHR_maintenance3;
  1491. }
  1492. int GpuInfo::support_VK_KHR_multiview() const
  1493. {
  1494. return d->support_VK_KHR_multiview;
  1495. }
  1496. int GpuInfo::support_VK_KHR_portability_subset() const
  1497. {
  1498. return d->support_VK_KHR_portability_subset;
  1499. }
  1500. int GpuInfo::support_VK_KHR_push_descriptor() const
  1501. {
  1502. return d->support_VK_KHR_push_descriptor;
  1503. }
  1504. int GpuInfo::support_VK_KHR_sampler_ycbcr_conversion() const
  1505. {
  1506. return d->support_VK_KHR_sampler_ycbcr_conversion;
  1507. }
  1508. int GpuInfo::support_VK_KHR_shader_bfloat16() const
  1509. {
  1510. return d->support_VK_KHR_shader_bfloat16;
  1511. }
  1512. int GpuInfo::support_VK_KHR_shader_float16_int8() const
  1513. {
  1514. return d->support_VK_KHR_shader_float16_int8;
  1515. }
  1516. int GpuInfo::support_VK_KHR_shader_float_controls() const
  1517. {
  1518. return d->support_VK_KHR_shader_float_controls;
  1519. }
  1520. int GpuInfo::support_VK_KHR_shader_float_controls2() const
  1521. {
  1522. return d->support_VK_KHR_shader_float_controls2;
  1523. }
  1524. int GpuInfo::support_VK_KHR_shader_integer_dot_product() const
  1525. {
  1526. return d->support_VK_KHR_shader_integer_dot_product;
  1527. }
  1528. int GpuInfo::support_VK_KHR_shader_non_semantic_info() const
  1529. {
  1530. return d->support_VK_KHR_shader_non_semantic_info;
  1531. }
  1532. int GpuInfo::support_VK_KHR_shader_subgroup_extended_types() const
  1533. {
  1534. return d->support_VK_KHR_shader_subgroup_extended_types;
  1535. }
  1536. int GpuInfo::support_VK_KHR_shader_subgroup_rotate() const
  1537. {
  1538. return d->support_VK_KHR_shader_subgroup_rotate;
  1539. }
  1540. int GpuInfo::support_VK_KHR_storage_buffer_storage_class() const
  1541. {
  1542. return d->support_VK_KHR_storage_buffer_storage_class;
  1543. }
  1544. int GpuInfo::support_VK_KHR_swapchain() const
  1545. {
  1546. return d->support_VK_KHR_swapchain;
  1547. }
  1548. int GpuInfo::support_VK_KHR_vulkan_memory_model() const
  1549. {
  1550. return d->support_VK_KHR_vulkan_memory_model;
  1551. }
  1552. int GpuInfo::support_VK_KHR_zero_initialize_workgroup_memory() const
  1553. {
  1554. return d->support_VK_KHR_zero_initialize_workgroup_memory;
  1555. }
  1556. int GpuInfo::support_VK_EXT_buffer_device_address() const
  1557. {
  1558. return d->support_VK_EXT_buffer_device_address;
  1559. }
  1560. int GpuInfo::support_VK_EXT_descriptor_indexing() const
  1561. {
  1562. return d->support_VK_EXT_descriptor_indexing;
  1563. }
  1564. int GpuInfo::support_VK_EXT_memory_budget() const
  1565. {
  1566. return d->support_VK_EXT_memory_budget;
  1567. }
  1568. int GpuInfo::support_VK_EXT_memory_priority() const
  1569. {
  1570. return d->support_VK_EXT_memory_priority;
  1571. }
  1572. int GpuInfo::support_VK_EXT_queue_family_foreign() const
  1573. {
  1574. return d->support_VK_EXT_queue_family_foreign;
  1575. }
  1576. int GpuInfo::support_VK_EXT_shader_atomic_float() const
  1577. {
  1578. return d->support_VK_EXT_shader_atomic_float;
  1579. }
  1580. int GpuInfo::support_VK_EXT_shader_atomic_float2() const
  1581. {
  1582. return d->support_VK_EXT_shader_atomic_float2;
  1583. }
  1584. int GpuInfo::support_VK_EXT_shader_float8() const
  1585. {
  1586. return d->support_VK_EXT_shader_float8;
  1587. }
  1588. int GpuInfo::support_VK_EXT_subgroup_size_control() const
  1589. {
  1590. return d->support_VK_EXT_subgroup_size_control;
  1591. }
  1592. int GpuInfo::support_VK_AMD_device_coherent_memory() const
  1593. {
  1594. return d->support_VK_AMD_device_coherent_memory;
  1595. }
  1596. #if __ANDROID_API__ >= 26
  1597. int GpuInfo::support_VK_ANDROID_external_memory_android_hardware_buffer() const
  1598. {
  1599. return d->support_VK_ANDROID_external_memory_android_hardware_buffer;
  1600. }
  1601. #endif // __ANDROID_API__ >= 26
  1602. int GpuInfo::support_VK_NV_cooperative_matrix() const
  1603. {
  1604. return d->support_VK_NV_cooperative_matrix;
  1605. }
  1606. int GpuInfo::support_VK_NV_cooperative_matrix2() const
  1607. {
  1608. return d->support_VK_NV_cooperative_matrix2;
  1609. }
  1610. int GpuInfo::support_VK_NV_cooperative_vector() const
  1611. {
  1612. return d->support_VK_NV_cooperative_vector;
  1613. }
  1614. const void* GpuInfo::queryExtensionFeatures() const
  1615. {
  1616. return d->queryExtensionFeatures;
  1617. }
  1618. const VkPhysicalDevice8BitStorageFeaturesKHR& GpuInfo::query8BitStorageFeatures() const
  1619. {
  1620. return d->query8BitStorageFeatures;
  1621. }
  1622. const VkPhysicalDevice16BitStorageFeaturesKHR& GpuInfo::query16BitStorageFeatures() const
  1623. {
  1624. return d->query16BitStorageFeatures;
  1625. }
  1626. const VkPhysicalDeviceFloat16Int8FeaturesKHR& GpuInfo::queryFloat16Int8Features() const
  1627. {
  1628. return d->queryFloat16Int8Features;
  1629. }
  1630. const VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR& GpuInfo::querySamplerYcbcrConversionFeatures() const
  1631. {
  1632. return d->querySamplerYcbcrConversionFeatures;
  1633. }
  1634. const VkPhysicalDeviceCooperativeMatrixFeaturesKHR& GpuInfo::queryCooperativeMatrixFeatures() const
  1635. {
  1636. return d->queryCooperativeMatrixFeatures;
  1637. }
  1638. const VkPhysicalDeviceCooperativeMatrixFeaturesNV& GpuInfo::queryCooperativeMatrixFeaturesNV() const
  1639. {
  1640. return d->queryCooperativeMatrixFeaturesNV;
  1641. }
  1642. const VkPhysicalDeviceCooperativeMatrix2FeaturesNV& GpuInfo::queryCooperativeMatrix2FeaturesNV() const
  1643. {
  1644. return d->queryCooperativeMatrix2FeaturesNV;
  1645. }
  1646. const VkPhysicalDeviceCooperativeVectorFeaturesNV& GpuInfo::queryCooperativeVectorFeaturesNV() const
  1647. {
  1648. return d->queryCooperativeVectorFeaturesNV;
  1649. }
  1650. const VkPhysicalDeviceSubgroupSizeControlFeaturesEXT& GpuInfo::querySubgroupSizeControlFeatures() const
  1651. {
  1652. return d->querySubgroupSizeControlFeatures;
  1653. }
  1654. const VkPhysicalDeviceShaderBfloat16FeaturesKHR& GpuInfo::queryShaderBfloat16Features() const
  1655. {
  1656. return d->queryShaderBfloat16Features;
  1657. }
  1658. const VkPhysicalDeviceShaderFloat8FeaturesEXT& GpuInfo::queryShaderFloat8Features() const
  1659. {
  1660. return d->queryShaderFloat8Features;
  1661. }
  1662. const VkPhysicalDeviceShaderFloatControls2FeaturesKHR& GpuInfo::queryShaderFloatControls2Features() const
  1663. {
  1664. return d->queryShaderFloatControls2Features;
  1665. }
  1666. const VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR& GpuInfo::queryShaderIntegerDotProductFeatures() const
  1667. {
  1668. return d->queryShaderIntegerDotProductFeatures;
  1669. }
  1670. const VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR& GpuInfo::queryShaderSubgroupRotateFeatures() const
  1671. {
  1672. return d->queryShaderSubgroupRotateFeatures;
  1673. }
  1674. const VkPhysicalDeviceShaderAtomicFloatFeaturesEXT& GpuInfo::queryShaderAtomicFloatFeatures() const
  1675. {
  1676. return d->queryShaderAtomicFloatFeatures;
  1677. }
  1678. const VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT& GpuInfo::queryShaderAtomicFloat2Features() const
  1679. {
  1680. return d->queryShaderAtomicFloat2Features;
  1681. }
  1682. const VkPhysicalDeviceVulkanMemoryModelFeaturesKHR& GpuInfo::queryVulkanMemoryModelFeatures() const
  1683. {
  1684. return d->queryVulkanMemoryModelFeatures;
  1685. }
  1686. const void* GpuInfo::queryExtensionProperties() const
  1687. {
  1688. return d->queryExtensionProperties;
  1689. }
  1690. const VkPhysicalDeviceCooperativeMatrix2PropertiesNV& GpuInfo::queryCooperativeMatrix2PropertiesNV() const
  1691. {
  1692. return d->queryCooperativeMatrix2PropertiesNV;
  1693. }
  1694. const VkPhysicalDeviceCooperativeVectorPropertiesNV& GpuInfo::queryCooperativeVectorPropertiesNV() const
  1695. {
  1696. return d->queryCooperativeVectorPropertiesNV;
  1697. }
  1698. const VkPhysicalDeviceDriverPropertiesKHR& GpuInfo::queryDriverProperties() const
  1699. {
  1700. return d->queryDriverProperties;
  1701. }
  1702. const VkPhysicalDeviceFloatControlsPropertiesKHR& GpuInfo::queryFloatControlsProperties() const
  1703. {
  1704. return d->queryFloatControlsProperties;
  1705. }
  1706. const VkPhysicalDeviceShaderIntegerDotProductProperties& GpuInfo::queryShaderIntegerDotProductProperties() const
  1707. {
  1708. return d->queryShaderIntegerDotProductProperties;
  1709. }
  1710. const VkPhysicalDeviceSubgroupProperties& GpuInfo::querySubgroupProperties() const
  1711. {
  1712. return d->querySubgroupProperties;
  1713. }
  1714. const VkPhysicalDeviceSubgroupSizeControlPropertiesEXT& GpuInfo::querySubgroupSizeControlProperties() const
  1715. {
  1716. return d->querySubgroupSizeControlProperties;
  1717. }
  1718. const std::vector<VkCooperativeMatrixPropertiesKHR>& GpuInfo::queryCooperativeMatrixSubProperties() const
  1719. {
  1720. return d->queryCooperativeMatrixSubProperties;
  1721. }
  1722. const std::vector<VkCooperativeMatrixPropertiesNV>& GpuInfo::queryCooperativeMatrixSubPropertiesNV() const
  1723. {
  1724. return d->queryCooperativeMatrixSubPropertiesNV;
  1725. }
  1726. const std::vector<VkCooperativeMatrixFlexibleDimensionsPropertiesNV>& GpuInfo::queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV() const
  1727. {
  1728. return d->queryCooperativeMatrixFlexibleDimensionsSubPropertiesNV;
  1729. }
  1730. const std::vector<VkCooperativeVectorPropertiesNV>& GpuInfo::queryCooperativeVectorSubPropertiesNV() const
  1731. {
  1732. return d->queryCooperativeVectorSubPropertiesNV;
  1733. }
  1734. void GpuInfo::get_optimal_cooperative_matrix_mnk(int M, int N, int K, VkComponentTypeKHR type, VkComponentTypeKHR acctype, VkScopeKHR scope, int& coopmat_M, int& coopmat_N, int& coopmat_K) const
  1735. {
  1736. coopmat_M = 0;
  1737. coopmat_N = 0;
  1738. coopmat_K = 0;
  1739. // collect mnk candidates
  1740. std::vector<VkCooperativeMatrixPropertiesKHR> mnk_properties;
  1741. if (d->support_VK_KHR_cooperative_matrix && d->queryCooperativeMatrixFeatures.cooperativeMatrix)
  1742. {
  1743. for (size_t i = 0; i < d->queryCooperativeMatrixSubProperties.size(); i++)
  1744. {
  1745. const VkCooperativeMatrixPropertiesKHR& cmp = d->queryCooperativeMatrixSubProperties[i];
  1746. if (cmp.AType == type && cmp.BType == type
  1747. && cmp.CType == acctype && cmp.ResultType == acctype
  1748. && cmp.scope == scope)
  1749. {
  1750. mnk_properties.push_back(cmp);
  1751. }
  1752. }
  1753. }
  1754. else if (d->support_VK_NV_cooperative_matrix && d->queryCooperativeMatrixFeaturesNV.cooperativeMatrix)
  1755. {
  1756. for (size_t i = 0; i < d->queryCooperativeMatrixSubPropertiesNV.size(); i++)
  1757. {
  1758. const VkCooperativeMatrixPropertiesNV& cmp = d->queryCooperativeMatrixSubPropertiesNV[i];
  1759. if (cmp.AType == (VkComponentTypeNV)type && cmp.BType == (VkComponentTypeNV)type
  1760. && cmp.CType == (VkComponentTypeNV)acctype && cmp.DType == (VkComponentTypeNV)acctype
  1761. && cmp.scope == (VkScopeNV)scope)
  1762. {
  1763. VkCooperativeMatrixPropertiesKHR cmp_khr;
  1764. cmp_khr.MSize = cmp.MSize;
  1765. cmp_khr.NSize = cmp.NSize;
  1766. cmp_khr.KSize = cmp.KSize;
  1767. mnk_properties.push_back(cmp_khr);
  1768. }
  1769. }
  1770. }
  1771. if (mnk_properties.empty() && (acctype == VK_COMPONENT_TYPE_FLOAT16_KHR || acctype == VK_COMPONENT_TYPE_BFLOAT16_KHR))
  1772. {
  1773. // try acctype fp32
  1774. return get_optimal_cooperative_matrix_mnk(M, N, K, type, VK_COMPONENT_TYPE_FLOAT32_KHR, scope, coopmat_M, coopmat_N, coopmat_K);
  1775. }
  1776. if (mnk_properties.empty())
  1777. return;
  1778. // find the optimal, prefer the first mnk tuple with same cost
  1779. double min_cost = DBL_MAX;
  1780. for (size_t i = 0; i < mnk_properties.size(); i++)
  1781. {
  1782. const VkCooperativeMatrixPropertiesKHR& cmp = mnk_properties[i];
  1783. const int M_pad = (M + cmp.MSize - 1) / cmp.MSize * cmp.MSize;
  1784. const int N_pad = (N + cmp.NSize - 1) / cmp.NSize * cmp.NSize;
  1785. const int K_pad = (K + cmp.KSize - 1) / cmp.KSize * cmp.KSize;
  1786. double cost = M_pad * N_pad * K_pad - M * N * K;
  1787. if (cost < min_cost)
  1788. {
  1789. min_cost = cost;
  1790. coopmat_M = cmp.MSize;
  1791. coopmat_N = cmp.NSize;
  1792. coopmat_K = cmp.KSize;
  1793. }
  1794. }
  1795. }
  1796. static int init_instance_core()
  1797. {
  1798. vkAllocateCommandBuffers = (PFN_vkAllocateCommandBuffers)vkGetInstanceProcAddr(g_instance, "vkAllocateCommandBuffers");
  1799. vkAllocateDescriptorSets = (PFN_vkAllocateDescriptorSets)vkGetInstanceProcAddr(g_instance, "vkAllocateDescriptorSets");
  1800. vkAllocateMemory = (PFN_vkAllocateMemory)vkGetInstanceProcAddr(g_instance, "vkAllocateMemory");
  1801. vkBeginCommandBuffer = (PFN_vkBeginCommandBuffer)vkGetInstanceProcAddr(g_instance, "vkBeginCommandBuffer");
  1802. vkBindBufferMemory = (PFN_vkBindBufferMemory)vkGetInstanceProcAddr(g_instance, "vkBindBufferMemory");
  1803. vkBindImageMemory = (PFN_vkBindImageMemory)vkGetInstanceProcAddr(g_instance, "vkBindImageMemory");
  1804. vkCmdBeginQuery = (PFN_vkCmdBeginQuery)vkGetInstanceProcAddr(g_instance, "vkCmdBeginQuery");
  1805. vkCmdBindDescriptorSets = (PFN_vkCmdBindDescriptorSets)vkGetInstanceProcAddr(g_instance, "vkCmdBindDescriptorSets");
  1806. vkCmdBindIndexBuffer = (PFN_vkCmdBindIndexBuffer)vkGetInstanceProcAddr(g_instance, "vkCmdBindIndexBuffer");
  1807. vkCmdBindPipeline = (PFN_vkCmdBindPipeline)vkGetInstanceProcAddr(g_instance, "vkCmdBindPipeline");
  1808. vkCmdCopyBuffer = (PFN_vkCmdCopyBuffer)vkGetInstanceProcAddr(g_instance, "vkCmdCopyBuffer");
  1809. vkCmdCopyBufferToImage = (PFN_vkCmdCopyBufferToImage)vkGetInstanceProcAddr(g_instance, "vkCmdCopyBufferToImage");
  1810. vkCmdCopyImage = (PFN_vkCmdCopyImage)vkGetInstanceProcAddr(g_instance, "vkCmdCopyImage");
  1811. vkCmdCopyImageToBuffer = (PFN_vkCmdCopyImageToBuffer)vkGetInstanceProcAddr(g_instance, "vkCmdCopyImageToBuffer");
  1812. vkCmdCopyQueryPoolResults = (PFN_vkCmdCopyQueryPoolResults)vkGetInstanceProcAddr(g_instance, "vkCmdCopyQueryPoolResults");
  1813. vkCmdDispatch = (PFN_vkCmdDispatch)vkGetInstanceProcAddr(g_instance, "vkCmdDispatch");
  1814. vkCmdDispatchIndirect = (PFN_vkCmdDispatchIndirect)vkGetInstanceProcAddr(g_instance, "vkCmdDispatchIndirect");
  1815. vkCmdEndQuery = (PFN_vkCmdEndQuery)vkGetInstanceProcAddr(g_instance, "vkCmdEndQuery");
  1816. vkCmdExecuteCommands = (PFN_vkCmdExecuteCommands)vkGetInstanceProcAddr(g_instance, "vkCmdExecuteCommands");
  1817. vkCmdFillBuffer = (PFN_vkCmdFillBuffer)vkGetInstanceProcAddr(g_instance, "vkCmdFillBuffer");
  1818. vkCmdPipelineBarrier = (PFN_vkCmdPipelineBarrier)vkGetInstanceProcAddr(g_instance, "vkCmdPipelineBarrier");
  1819. vkCmdPushConstants = (PFN_vkCmdPushConstants)vkGetInstanceProcAddr(g_instance, "vkCmdPushConstants");
  1820. vkCmdResetQueryPool = (PFN_vkCmdResetQueryPool)vkGetInstanceProcAddr(g_instance, "vkCmdResetQueryPool");
  1821. vkCmdResolveImage = (PFN_vkCmdResolveImage)vkGetInstanceProcAddr(g_instance, "vkCmdResolveImage");
  1822. vkCmdUpdateBuffer = (PFN_vkCmdUpdateBuffer)vkGetInstanceProcAddr(g_instance, "vkCmdUpdateBuffer");
  1823. vkCmdWriteTimestamp = (PFN_vkCmdWriteTimestamp)vkGetInstanceProcAddr(g_instance, "vkCmdWriteTimestamp");
  1824. vkCreateBuffer = (PFN_vkCreateBuffer)vkGetInstanceProcAddr(g_instance, "vkCreateBuffer");
  1825. vkCreateBufferView = (PFN_vkCreateBufferView)vkGetInstanceProcAddr(g_instance, "vkCreateBufferView");
  1826. vkCreateCommandPool = (PFN_vkCreateCommandPool)vkGetInstanceProcAddr(g_instance, "vkCreateCommandPool");
  1827. vkCreateComputePipelines = (PFN_vkCreateComputePipelines)vkGetInstanceProcAddr(g_instance, "vkCreateComputePipelines");
  1828. vkCreateDescriptorPool = (PFN_vkCreateDescriptorPool)vkGetInstanceProcAddr(g_instance, "vkCreateDescriptorPool");
  1829. vkCreateDescriptorSetLayout = (PFN_vkCreateDescriptorSetLayout)vkGetInstanceProcAddr(g_instance, "vkCreateDescriptorSetLayout");
  1830. vkCreateDevice = (PFN_vkCreateDevice)vkGetInstanceProcAddr(g_instance, "vkCreateDevice");
  1831. vkCreateFence = (PFN_vkCreateFence)vkGetInstanceProcAddr(g_instance, "vkCreateFence");
  1832. vkCreateImage = (PFN_vkCreateImage)vkGetInstanceProcAddr(g_instance, "vkCreateImage");
  1833. vkCreateImageView = (PFN_vkCreateImageView)vkGetInstanceProcAddr(g_instance, "vkCreateImageView");
  1834. vkCreatePipelineCache = (PFN_vkCreatePipelineCache)vkGetInstanceProcAddr(g_instance, "vkCreatePipelineCache");
  1835. vkCreatePipelineLayout = (PFN_vkCreatePipelineLayout)vkGetInstanceProcAddr(g_instance, "vkCreatePipelineLayout");
  1836. vkCreateQueryPool = (PFN_vkCreateQueryPool)vkGetInstanceProcAddr(g_instance, "vkCreateQueryPool");
  1837. vkCreateSampler = (PFN_vkCreateSampler)vkGetInstanceProcAddr(g_instance, "vkCreateSampler");
  1838. vkCreateSemaphore = (PFN_vkCreateSemaphore)vkGetInstanceProcAddr(g_instance, "vkCreateSemaphore");
  1839. vkCreateShaderModule = (PFN_vkCreateShaderModule)vkGetInstanceProcAddr(g_instance, "vkCreateShaderModule");
  1840. vkDestroyBuffer = (PFN_vkDestroyBuffer)vkGetInstanceProcAddr(g_instance, "vkDestroyBuffer");
  1841. vkDestroyBufferView = (PFN_vkDestroyBufferView)vkGetInstanceProcAddr(g_instance, "vkDestroyBufferView");
  1842. vkDestroyCommandPool = (PFN_vkDestroyCommandPool)vkGetInstanceProcAddr(g_instance, "vkDestroyCommandPool");
  1843. vkDestroyDescriptorPool = (PFN_vkDestroyDescriptorPool)vkGetInstanceProcAddr(g_instance, "vkDestroyDescriptorPool");
  1844. vkDestroyDescriptorSetLayout = (PFN_vkDestroyDescriptorSetLayout)vkGetInstanceProcAddr(g_instance, "vkDestroyDescriptorSetLayout");
  1845. vkDestroyDevice = (PFN_vkDestroyDevice)vkGetInstanceProcAddr(g_instance, "vkDestroyDevice");
  1846. vkDestroyFence = (PFN_vkDestroyFence)vkGetInstanceProcAddr(g_instance, "vkDestroyFence");
  1847. vkDestroyImage = (PFN_vkDestroyImage)vkGetInstanceProcAddr(g_instance, "vkDestroyImage");
  1848. vkDestroyImageView = (PFN_vkDestroyImageView)vkGetInstanceProcAddr(g_instance, "vkDestroyImageView");
  1849. vkDestroyInstance = (PFN_vkDestroyInstance)vkGetInstanceProcAddr(g_instance, "vkDestroyInstance");
  1850. vkDestroyPipeline = (PFN_vkDestroyPipeline)vkGetInstanceProcAddr(g_instance, "vkDestroyPipeline");
  1851. vkDestroyPipelineCache = (PFN_vkDestroyPipelineCache)vkGetInstanceProcAddr(g_instance, "vkDestroyPipelineCache");
  1852. vkDestroyPipelineLayout = (PFN_vkDestroyPipelineLayout)vkGetInstanceProcAddr(g_instance, "vkDestroyPipelineLayout");
  1853. vkDestroyQueryPool = (PFN_vkDestroyQueryPool)vkGetInstanceProcAddr(g_instance, "vkDestroyQueryPool");
  1854. vkDestroySampler = (PFN_vkDestroySampler)vkGetInstanceProcAddr(g_instance, "vkDestroySampler");
  1855. vkDestroySemaphore = (PFN_vkDestroySemaphore)vkGetInstanceProcAddr(g_instance, "vkDestroySemaphore");
  1856. vkDestroyShaderModule = (PFN_vkDestroyShaderModule)vkGetInstanceProcAddr(g_instance, "vkDestroyShaderModule");
  1857. vkDeviceWaitIdle = (PFN_vkDeviceWaitIdle)vkGetInstanceProcAddr(g_instance, "vkDeviceWaitIdle");
  1858. vkEndCommandBuffer = (PFN_vkEndCommandBuffer)vkGetInstanceProcAddr(g_instance, "vkEndCommandBuffer");
  1859. vkEnumerateDeviceExtensionProperties = (PFN_vkEnumerateDeviceExtensionProperties)vkGetInstanceProcAddr(g_instance, "vkEnumerateDeviceExtensionProperties");
  1860. vkEnumerateDeviceLayerProperties = (PFN_vkEnumerateDeviceLayerProperties)vkGetInstanceProcAddr(g_instance, "vkEnumerateDeviceLayerProperties");
  1861. vkEnumeratePhysicalDevices = (PFN_vkEnumeratePhysicalDevices)vkGetInstanceProcAddr(g_instance, "vkEnumeratePhysicalDevices");
  1862. vkFlushMappedMemoryRanges = (PFN_vkFlushMappedMemoryRanges)vkGetInstanceProcAddr(g_instance, "vkFlushMappedMemoryRanges");
  1863. vkFreeCommandBuffers = (PFN_vkFreeCommandBuffers)vkGetInstanceProcAddr(g_instance, "vkFreeCommandBuffers");
  1864. vkFreeDescriptorSets = (PFN_vkFreeDescriptorSets)vkGetInstanceProcAddr(g_instance, "vkFreeDescriptorSets");
  1865. vkFreeMemory = (PFN_vkFreeMemory)vkGetInstanceProcAddr(g_instance, "vkFreeMemory");
  1866. vkGetBufferMemoryRequirements = (PFN_vkGetBufferMemoryRequirements)vkGetInstanceProcAddr(g_instance, "vkGetBufferMemoryRequirements");
  1867. vkGetDeviceMemoryCommitment = (PFN_vkGetDeviceMemoryCommitment)vkGetInstanceProcAddr(g_instance, "vkGetDeviceMemoryCommitment");
  1868. vkGetDeviceProcAddr = (PFN_vkGetDeviceProcAddr)vkGetInstanceProcAddr(g_instance, "vkGetDeviceProcAddr");
  1869. vkGetDeviceQueue = (PFN_vkGetDeviceQueue)vkGetInstanceProcAddr(g_instance, "vkGetDeviceQueue");
  1870. vkGetFenceStatus = (PFN_vkGetFenceStatus)vkGetInstanceProcAddr(g_instance, "vkGetFenceStatus");
  1871. vkGetImageMemoryRequirements = (PFN_vkGetImageMemoryRequirements)vkGetInstanceProcAddr(g_instance, "vkGetImageMemoryRequirements");
  1872. vkGetImageSubresourceLayout = (PFN_vkGetImageSubresourceLayout)vkGetInstanceProcAddr(g_instance, "vkGetImageSubresourceLayout");
  1873. vkGetPhysicalDeviceFeatures = (PFN_vkGetPhysicalDeviceFeatures)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceFeatures");
  1874. vkGetPhysicalDeviceFormatProperties = (PFN_vkGetPhysicalDeviceFormatProperties)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceFormatProperties");
  1875. vkGetPhysicalDeviceImageFormatProperties = (PFN_vkGetPhysicalDeviceImageFormatProperties)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceImageFormatProperties");
  1876. vkGetPhysicalDeviceMemoryProperties = (PFN_vkGetPhysicalDeviceMemoryProperties)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceMemoryProperties");
  1877. vkGetPhysicalDeviceProperties = (PFN_vkGetPhysicalDeviceProperties)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceProperties");
  1878. vkGetPhysicalDeviceQueueFamilyProperties = (PFN_vkGetPhysicalDeviceQueueFamilyProperties)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceQueueFamilyProperties");
  1879. vkGetPipelineCacheData = (PFN_vkGetPipelineCacheData)vkGetInstanceProcAddr(g_instance, "vkGetPipelineCacheData");
  1880. vkGetQueryPoolResults = (PFN_vkGetQueryPoolResults)vkGetInstanceProcAddr(g_instance, "vkGetQueryPoolResults");
  1881. vkInvalidateMappedMemoryRanges = (PFN_vkInvalidateMappedMemoryRanges)vkGetInstanceProcAddr(g_instance, "vkInvalidateMappedMemoryRanges");
  1882. vkMapMemory = (PFN_vkMapMemory)vkGetInstanceProcAddr(g_instance, "vkMapMemory");
  1883. vkMergePipelineCaches = (PFN_vkMergePipelineCaches)vkGetInstanceProcAddr(g_instance, "vkMergePipelineCaches");
  1884. vkQueueSubmit = (PFN_vkQueueSubmit)vkGetInstanceProcAddr(g_instance, "vkQueueSubmit");
  1885. vkQueueWaitIdle = (PFN_vkQueueWaitIdle)vkGetInstanceProcAddr(g_instance, "vkQueueWaitIdle");
  1886. vkResetCommandBuffer = (PFN_vkResetCommandBuffer)vkGetInstanceProcAddr(g_instance, "vkResetCommandBuffer");
  1887. vkResetCommandPool = (PFN_vkResetCommandPool)vkGetInstanceProcAddr(g_instance, "vkResetCommandPool");
  1888. vkResetDescriptorPool = (PFN_vkResetDescriptorPool)vkGetInstanceProcAddr(g_instance, "vkResetDescriptorPool");
  1889. vkResetFences = (PFN_vkResetFences)vkGetInstanceProcAddr(g_instance, "vkResetFences");
  1890. vkUnmapMemory = (PFN_vkUnmapMemory)vkGetInstanceProcAddr(g_instance, "vkUnmapMemory");
  1891. vkUpdateDescriptorSets = (PFN_vkUpdateDescriptorSets)vkGetInstanceProcAddr(g_instance, "vkUpdateDescriptorSets");
  1892. vkWaitForFences = (PFN_vkWaitForFences)vkGetInstanceProcAddr(g_instance, "vkWaitForFences");
  1893. return 0;
  1894. }
  1895. static int init_instance_extension()
  1896. {
  1897. if (support_VK_KHR_external_memory_capabilities)
  1898. {
  1899. vkGetPhysicalDeviceExternalBufferPropertiesKHR = (PFN_vkGetPhysicalDeviceExternalBufferPropertiesKHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceExternalBufferPropertiesKHR");
  1900. }
  1901. if (support_VK_KHR_get_physical_device_properties2)
  1902. {
  1903. vkGetPhysicalDeviceFeatures2KHR = (PFN_vkGetPhysicalDeviceFeatures2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceFeatures2KHR");
  1904. vkGetPhysicalDeviceProperties2KHR = (PFN_vkGetPhysicalDeviceProperties2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceProperties2KHR");
  1905. vkGetPhysicalDeviceFormatProperties2KHR = (PFN_vkGetPhysicalDeviceFormatProperties2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceFormatProperties2KHR");
  1906. vkGetPhysicalDeviceImageFormatProperties2KHR = (PFN_vkGetPhysicalDeviceImageFormatProperties2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceImageFormatProperties2KHR");
  1907. vkGetPhysicalDeviceQueueFamilyProperties2KHR = (PFN_vkGetPhysicalDeviceQueueFamilyProperties2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceQueueFamilyProperties2KHR");
  1908. vkGetPhysicalDeviceMemoryProperties2KHR = (PFN_vkGetPhysicalDeviceMemoryProperties2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceMemoryProperties2KHR");
  1909. }
  1910. if (support_VK_KHR_get_surface_capabilities2)
  1911. {
  1912. vkGetPhysicalDeviceSurfaceCapabilities2KHR = (PFN_vkGetPhysicalDeviceSurfaceCapabilities2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceSurfaceCapabilities2KHR");
  1913. vkGetPhysicalDeviceSurfaceFormats2KHR = (PFN_vkGetPhysicalDeviceSurfaceFormats2KHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceSurfaceFormats2KHR");
  1914. }
  1915. if (support_VK_KHR_surface)
  1916. {
  1917. vkDestroySurfaceKHR = (PFN_vkDestroySurfaceKHR)vkGetInstanceProcAddr(g_instance, "vkDestroySurfaceKHR");
  1918. vkGetPhysicalDeviceSurfaceSupportKHR = (PFN_vkGetPhysicalDeviceSurfaceSupportKHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceSurfaceSupportKHR");
  1919. vkGetPhysicalDeviceSurfaceCapabilitiesKHR = (PFN_vkGetPhysicalDeviceSurfaceCapabilitiesKHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceSurfaceCapabilitiesKHR");
  1920. vkGetPhysicalDeviceSurfaceFormatsKHR = (PFN_vkGetPhysicalDeviceSurfaceFormatsKHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceSurfaceFormatsKHR");
  1921. vkGetPhysicalDeviceSurfacePresentModesKHR = (PFN_vkGetPhysicalDeviceSurfacePresentModesKHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceSurfacePresentModesKHR");
  1922. }
  1923. #if __ANDROID_API__ >= 26
  1924. if (support_VK_KHR_android_surface)
  1925. {
  1926. vkCreateAndroidSurfaceKHR = (PFN_vkCreateAndroidSurfaceKHR)vkGetInstanceProcAddr(g_instance, "vkCreateAndroidSurfaceKHR");
  1927. }
  1928. #endif // __ANDROID_API__ >= 26
  1929. // VK_KHR_cooperative_matrix
  1930. {
  1931. vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR = (PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR");
  1932. }
  1933. // VK_NV_cooperative_matrix
  1934. {
  1935. vkGetPhysicalDeviceCooperativeMatrixPropertiesNV = (PFN_vkGetPhysicalDeviceCooperativeMatrixPropertiesNV)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceCooperativeMatrixPropertiesNV");
  1936. }
  1937. // VK_NV_cooperative_matrix2
  1938. {
  1939. vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV = (PFN_vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV");
  1940. }
  1941. // VK_NV_cooperative_vector
  1942. {
  1943. vkGetPhysicalDeviceCooperativeVectorPropertiesNV = (PFN_vkGetPhysicalDeviceCooperativeVectorPropertiesNV)vkGetInstanceProcAddr(g_instance, "vkGetPhysicalDeviceCooperativeVectorPropertiesNV");
  1944. }
  1945. return 0;
  1946. }
  1947. #if ENABLE_VALIDATION_LAYER
  1948. static VKAPI_ATTR VkBool32 VKAPI_CALL debugCallback(
  1949. VkDebugUtilsMessageSeverityFlagBitsEXT /*messageSeverity*/,
  1950. VkDebugUtilsMessageTypeFlagsEXT /*messageType*/,
  1951. const VkDebugUtilsMessengerCallbackDataEXT* pCallbackData,
  1952. void* /*pUserData*/)
  1953. {
  1954. NCNN_LOGE("validation layer: %s", pCallbackData->pMessage);
  1955. return VK_FALSE;
  1956. }
  1957. static VkResult CreateDebugUtilsMessengerEXT(VkInstance instance, const VkDebugUtilsMessengerCreateInfoEXT* pCreateInfo, const VkAllocationCallbacks* pAllocator, VkDebugUtilsMessengerEXT* pCallback)
  1958. {
  1959. PFN_vkCreateDebugUtilsMessengerEXT func = (PFN_vkCreateDebugUtilsMessengerEXT)vkGetInstanceProcAddr(instance, "vkCreateDebugUtilsMessengerEXT");
  1960. if (func)
  1961. return func(instance, pCreateInfo, pAllocator, pCallback);
  1962. return VK_ERROR_EXTENSION_NOT_PRESENT;
  1963. }
  1964. static void DestroyDebugUtilsMessengerEXT(VkInstance instance, VkDebugUtilsMessengerEXT callback, const VkAllocationCallbacks* pAllocator)
  1965. {
  1966. PFN_vkDestroyDebugUtilsMessengerEXT func = (PFN_vkDestroyDebugUtilsMessengerEXT)vkGetInstanceProcAddr(instance, "vkDestroyDebugUtilsMessengerEXT");
  1967. if (func)
  1968. func(instance, callback, pAllocator);
  1969. }
  1970. #endif // ENABLE_VALIDATION_LAYER
  1971. static int find_default_vulkan_device_index()
  1972. {
  1973. // first try, discrete gpu
  1974. for (int i = 0; i < g_gpu_count; i++)
  1975. {
  1976. if (g_gpu_infos[i]->type() == 0)
  1977. return i;
  1978. }
  1979. // second try, integrated gpu
  1980. for (int i = 0; i < g_gpu_count; i++)
  1981. {
  1982. if (g_gpu_infos[i]->type() == 1)
  1983. return i;
  1984. }
  1985. // third try, any probed device
  1986. if (g_gpu_count > 0)
  1987. return 0;
  1988. NCNN_LOGE("no vulkan device");
  1989. return -1;
  1990. }
  1991. int create_gpu_instance(const char* driver_path)
  1992. {
  1993. MutexLockGuard lock(g_instance_lock);
  1994. if (g_instance.created != 0)
  1995. return g_instance.instance ? 0 : -1;
  1996. g_instance.created = 1;
  1997. // NCNN_LOGE("create_gpu_instance");
  1998. #if NCNN_SIMPLEVK
  1999. // load vulkan driver
  2000. {
  2001. int ret = load_vulkan_driver(driver_path);
  2002. if (ret != 0)
  2003. {
  2004. NCNN_LOGE("load vulkan driver failed");
  2005. return -1;
  2006. }
  2007. }
  2008. #else
  2009. if (driver_path)
  2010. {
  2011. NCNN_LOGE("custom vulkan driver is not supported when NCNN_SIMPLEVK is off");
  2012. NCNN_LOGE("will always use the system vulkan driver");
  2013. }
  2014. #endif // NCNN_SIMPLEVK
  2015. VkResult ret;
  2016. std::vector<const char*> enabledLayers;
  2017. #if ENABLE_VALIDATION_LAYER
  2018. uint32_t instanceLayerPropertyCount;
  2019. ret = vkEnumerateInstanceLayerProperties(&instanceLayerPropertyCount, NULL);
  2020. if (ret != VK_SUCCESS)
  2021. {
  2022. NCNN_LOGE("vkEnumerateInstanceLayerProperties failed %d", ret);
  2023. return -1;
  2024. }
  2025. std::vector<VkLayerProperties> instanceLayerProperties(instanceLayerPropertyCount);
  2026. ret = vkEnumerateInstanceLayerProperties(&instanceLayerPropertyCount, instanceLayerProperties.data());
  2027. if (ret != VK_SUCCESS)
  2028. {
  2029. NCNN_LOGE("vkEnumerateInstanceLayerProperties failed %d", ret);
  2030. return -1;
  2031. }
  2032. for (uint32_t i = 0; i < instanceLayerPropertyCount; i++)
  2033. {
  2034. const VkLayerProperties& lp = instanceLayerProperties[i];
  2035. // NCNN_LOGE("instance layer %s = %u", lp.layerName, lp.implementationVersion);
  2036. if (strcmp(lp.layerName, "VK_LAYER_LUNARG_standard_validation") == 0)
  2037. {
  2038. enabledLayers.push_back("VK_LAYER_LUNARG_standard_validation");
  2039. }
  2040. if (strcmp(lp.layerName, "VK_LAYER_LUNARG_parameter_validation") == 0)
  2041. {
  2042. enabledLayers.push_back("VK_LAYER_LUNARG_parameter_validation");
  2043. }
  2044. if (strcmp(lp.layerName, "VK_LAYER_KHRONOS_validation") == 0)
  2045. {
  2046. enabledLayers.push_back("VK_LAYER_KHRONOS_validation");
  2047. }
  2048. }
  2049. #endif // ENABLE_VALIDATION_LAYER
  2050. std::vector<const char*> enabledExtensions;
  2051. uint32_t instanceExtensionPropertyCount;
  2052. ret = vkEnumerateInstanceExtensionProperties(NULL, &instanceExtensionPropertyCount, NULL);
  2053. if (ret != VK_SUCCESS)
  2054. {
  2055. NCNN_LOGE("vkEnumerateInstanceExtensionProperties failed %d", ret);
  2056. return -1;
  2057. }
  2058. std::vector<VkExtensionProperties> instanceExtensionProperties(instanceExtensionPropertyCount);
  2059. ret = vkEnumerateInstanceExtensionProperties(NULL, &instanceExtensionPropertyCount, instanceExtensionProperties.data());
  2060. if (ret != VK_SUCCESS)
  2061. {
  2062. NCNN_LOGE("vkEnumerateInstanceExtensionProperties failed %d", ret);
  2063. return -1;
  2064. }
  2065. support_VK_KHR_get_physical_device_properties2 = 0;
  2066. support_VK_KHR_get_surface_capabilities2 = 0;
  2067. support_VK_KHR_portability_enumeration = 0;
  2068. support_VK_KHR_surface = 0;
  2069. support_VK_EXT_debug_utils = 0;
  2070. support_VK_EXT_validation_features = 0;
  2071. support_VK_EXT_validation_flags = 0;
  2072. #if __ANDROID_API__ >= 26
  2073. support_VK_KHR_android_surface = 0;
  2074. #endif // __ANDROID_API__ >= 26
  2075. for (uint32_t j = 0; j < instanceExtensionPropertyCount; j++)
  2076. {
  2077. const VkExtensionProperties& exp = instanceExtensionProperties[j];
  2078. // NCNN_LOGE("instance extension %s = %u", exp.extensionName, exp.specVersion);
  2079. if (strcmp(exp.extensionName, "VK_KHR_external_memory_capabilities") == 0)
  2080. support_VK_KHR_external_memory_capabilities = exp.specVersion;
  2081. else if (strcmp(exp.extensionName, "VK_KHR_get_physical_device_properties2") == 0)
  2082. support_VK_KHR_get_physical_device_properties2 = exp.specVersion;
  2083. else if (strcmp(exp.extensionName, "VK_KHR_get_surface_capabilities2") == 0)
  2084. support_VK_KHR_get_surface_capabilities2 = exp.specVersion;
  2085. else if (strcmp(exp.extensionName, "VK_KHR_portability_enumeration") == 0)
  2086. support_VK_KHR_portability_enumeration = exp.specVersion;
  2087. else if (strcmp(exp.extensionName, "VK_KHR_surface") == 0)
  2088. support_VK_KHR_surface = exp.specVersion;
  2089. else if (strcmp(exp.extensionName, "VK_EXT_debug_utils") == 0)
  2090. support_VK_EXT_debug_utils = exp.specVersion;
  2091. else if (strcmp(exp.extensionName, "VK_EXT_validation_features") == 0)
  2092. support_VK_EXT_validation_features = exp.specVersion;
  2093. else if (strcmp(exp.extensionName, "VK_EXT_validation_flags") == 0)
  2094. support_VK_EXT_validation_flags = exp.specVersion;
  2095. #if __ANDROID_API__ >= 26
  2096. else if (strcmp(exp.extensionName, "VK_KHR_android_surface") == 0)
  2097. support_VK_KHR_android_surface = exp.specVersion;
  2098. #endif // __ANDROID_API__ >= 26
  2099. }
  2100. if (support_VK_EXT_validation_features)
  2101. {
  2102. // we prefer the modern one
  2103. support_VK_EXT_validation_flags = 0;
  2104. }
  2105. if (support_VK_KHR_external_memory_capabilities)
  2106. enabledExtensions.push_back("VK_KHR_external_memory_capabilities");
  2107. if (support_VK_KHR_get_physical_device_properties2)
  2108. enabledExtensions.push_back("VK_KHR_get_physical_device_properties2");
  2109. if (support_VK_KHR_get_surface_capabilities2)
  2110. enabledExtensions.push_back("VK_KHR_get_surface_capabilities2");
  2111. if (support_VK_KHR_portability_enumeration)
  2112. enabledExtensions.push_back("VK_KHR_portability_enumeration");
  2113. if (support_VK_KHR_surface)
  2114. enabledExtensions.push_back("VK_KHR_surface");
  2115. #if ENABLE_VALIDATION_LAYER
  2116. if (support_VK_EXT_debug_utils)
  2117. enabledExtensions.push_back("VK_EXT_debug_utils");
  2118. if (support_VK_EXT_validation_features)
  2119. enabledExtensions.push_back("VK_EXT_validation_features");
  2120. if (support_VK_EXT_validation_flags)
  2121. enabledExtensions.push_back("VK_EXT_validation_flags");
  2122. #endif // ENABLE_VALIDATION_LAYER
  2123. #if __ANDROID_API__ >= 26
  2124. if (support_VK_KHR_android_surface)
  2125. enabledExtensions.push_back("VK_KHR_android_surface");
  2126. #endif // __ANDROID_API__ >= 26
  2127. uint32_t instance_api_version = VK_MAKE_VERSION(1, 0, 0);
  2128. typedef VkResult(VKAPI_PTR * PFN_vkEnumerateInstanceVersion)(uint32_t * pApiVersion);
  2129. PFN_vkEnumerateInstanceVersion vkEnumerateInstanceVersion = (PFN_vkEnumerateInstanceVersion)vkGetInstanceProcAddr(0, "vkEnumerateInstanceVersion");
  2130. if (vkEnumerateInstanceVersion)
  2131. {
  2132. ret = vkEnumerateInstanceVersion(&instance_api_version);
  2133. if (ret != VK_SUCCESS)
  2134. {
  2135. NCNN_LOGE("vkEnumerateInstanceVersion failed %d", ret);
  2136. return -1;
  2137. }
  2138. }
  2139. // NCNN_LOGE("instance apiVersion = %u.%u.%u", VK_VERSION_MAJOR(instance_api_version), VK_VERSION_MINOR(instance_api_version), VK_VERSION_PATCH(instance_api_version));
  2140. VkApplicationInfo applicationInfo;
  2141. applicationInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;
  2142. applicationInfo.pNext = 0;
  2143. applicationInfo.pApplicationName = "ncnn";
  2144. applicationInfo.applicationVersion = 0;
  2145. applicationInfo.pEngineName = "ncnn";
  2146. applicationInfo.engineVersion = 20250327;
  2147. applicationInfo.apiVersion = instance_api_version;
  2148. void* enabledExtensionFeatures = 0;
  2149. #if ENABLE_VALIDATION_LAYER
  2150. std::vector<VkValidationFeatureEnableEXT> enabledValidationFeature;
  2151. enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_GPU_ASSISTED_EXT);
  2152. enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_GPU_ASSISTED_RESERVE_BINDING_SLOT_EXT);
  2153. enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_BEST_PRACTICES_EXT);
  2154. enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_DEBUG_PRINTF_EXT);
  2155. enabledValidationFeature.push_back(VK_VALIDATION_FEATURE_ENABLE_SYNCHRONIZATION_VALIDATION_EXT);
  2156. VkValidationFeaturesEXT validationFeatures;
  2157. validationFeatures.sType = VK_STRUCTURE_TYPE_VALIDATION_FEATURES_EXT;
  2158. validationFeatures.pNext = 0;
  2159. validationFeatures.enabledValidationFeatureCount = enabledValidationFeature.size();
  2160. validationFeatures.pEnabledValidationFeatures = enabledValidationFeature.data();
  2161. validationFeatures.disabledValidationFeatureCount = 0;
  2162. validationFeatures.pDisabledValidationFeatures = 0;
  2163. if (support_VK_EXT_validation_features)
  2164. {
  2165. validationFeatures.pNext = enabledExtensionFeatures;
  2166. enabledExtensionFeatures = &validationFeatures;
  2167. }
  2168. VkValidationFlagsEXT validationFlags;
  2169. validationFlags.sType = VK_STRUCTURE_TYPE_VALIDATION_FLAGS_EXT;
  2170. validationFlags.pNext = 0;
  2171. validationFlags.disabledValidationCheckCount = 0;
  2172. validationFlags.pDisabledValidationChecks = 0;
  2173. if (support_VK_EXT_validation_flags)
  2174. {
  2175. validationFlags.pNext = enabledExtensionFeatures;
  2176. enabledExtensionFeatures = &validationFlags;
  2177. }
  2178. #endif // ENABLE_VALIDATION_LAYER
  2179. VkInstanceCreateInfo instanceCreateInfo;
  2180. instanceCreateInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;
  2181. instanceCreateInfo.pNext = enabledExtensionFeatures;
  2182. instanceCreateInfo.flags = 0;
  2183. if (support_VK_KHR_portability_enumeration)
  2184. instanceCreateInfo.flags |= VK_INSTANCE_CREATE_ENUMERATE_PORTABILITY_BIT_KHR;
  2185. instanceCreateInfo.pApplicationInfo = &applicationInfo;
  2186. instanceCreateInfo.enabledLayerCount = enabledLayers.size();
  2187. instanceCreateInfo.ppEnabledLayerNames = enabledLayers.data();
  2188. instanceCreateInfo.enabledExtensionCount = enabledExtensions.size();
  2189. instanceCreateInfo.ppEnabledExtensionNames = enabledExtensions.data();
  2190. VkInstance instance = 0;
  2191. ret = vkCreateInstance(&instanceCreateInfo, 0, &instance);
  2192. if (ret != VK_SUCCESS)
  2193. {
  2194. NCNN_LOGE("vkCreateInstance failed %d", ret);
  2195. return -1;
  2196. }
  2197. g_instance.instance = instance;
  2198. g_instance.instance_api_version = instance_api_version;
  2199. init_instance_core();
  2200. #if ENABLE_VALIDATION_LAYER
  2201. if (support_VK_EXT_debug_utils)
  2202. {
  2203. VkDebugUtilsMessengerCreateInfoEXT createInfo = {};
  2204. createInfo.sType = VK_STRUCTURE_TYPE_DEBUG_UTILS_MESSENGER_CREATE_INFO_EXT;
  2205. createInfo.messageSeverity = VK_DEBUG_UTILS_MESSAGE_SEVERITY_VERBOSE_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_SEVERITY_INFO_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_SEVERITY_WARNING_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_SEVERITY_ERROR_BIT_EXT;
  2206. createInfo.messageType = VK_DEBUG_UTILS_MESSAGE_TYPE_GENERAL_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_TYPE_VALIDATION_BIT_EXT | VK_DEBUG_UTILS_MESSAGE_TYPE_PERFORMANCE_BIT_EXT;
  2207. createInfo.pfnUserCallback = debugCallback;
  2208. createInfo.pUserData = 0;
  2209. ret = CreateDebugUtilsMessengerEXT(g_instance, &createInfo, NULL, &g_instance.callback);
  2210. if (ret != VK_SUCCESS)
  2211. {
  2212. NCNN_LOGE("CreateDebugUtilsMessengerEXT failed %d", ret);
  2213. return -1;
  2214. }
  2215. }
  2216. #endif // ENABLE_VALIDATION_LAYER
  2217. init_instance_extension();
  2218. uint32_t physicalDeviceCount = 0;
  2219. ret = vkEnumeratePhysicalDevices(g_instance, &physicalDeviceCount, 0);
  2220. if (ret != VK_SUCCESS)
  2221. {
  2222. NCNN_LOGE("vkEnumeratePhysicalDevices failed %d", ret);
  2223. return -1;
  2224. }
  2225. if (physicalDeviceCount > NCNN_MAX_GPU_COUNT)
  2226. physicalDeviceCount = NCNN_MAX_GPU_COUNT;
  2227. std::vector<VkPhysicalDevice> physicalDevices(physicalDeviceCount);
  2228. ret = vkEnumeratePhysicalDevices(g_instance, &physicalDeviceCount, physicalDevices.data());
  2229. if (ret != VK_SUCCESS)
  2230. {
  2231. NCNN_LOGE("vkEnumeratePhysicalDevices failed %d", ret);
  2232. return -1;
  2233. }
  2234. // find proper device and queue
  2235. int gpu_info_index = 0;
  2236. for (uint32_t i = 0; i < physicalDeviceCount; i++)
  2237. {
  2238. const VkPhysicalDevice& physicalDevice = physicalDevices[i];
  2239. delete g_gpu_infos[gpu_info_index];
  2240. g_gpu_infos[gpu_info_index] = new GpuInfo;
  2241. GpuInfo& gpu_info = *g_gpu_infos[gpu_info_index];
  2242. gpu_info.d->device_index = gpu_info_index;
  2243. gpu_info.d->physicalDevice = physicalDevice;
  2244. gpu_info.d->query_features();
  2245. gpu_info.d->query_properties();
  2246. // device type
  2247. // info
  2248. // NCNN_LOGE("[%u] max_shared_memory_size = %u", i, gpu_info.max_shared_memory_size);
  2249. // NCNN_LOGE("[%u] max_workgroup_count = %u %u %u", i, gpu_info.max_workgroup_count[0], gpu_info.max_workgroup_count[1], gpu_info.max_workgroup_count[2]);
  2250. // NCNN_LOGE("[%u] max_workgroup_invocations = %u", i, gpu_info.max_workgroup_invocations);
  2251. // NCNN_LOGE("[%u] max_workgroup_size = %u %u %u", i, gpu_info.max_workgroup_size[0], gpu_info.max_workgroup_size[1], gpu_info.max_workgroup_size[2]);
  2252. // NCNN_LOGE("[%u] memory_map_alignment = %lu", i, gpu_info.memory_map_alignment);
  2253. // NCNN_LOGE("[%u] buffer_offset_alignment = %lu", i, gpu_info.buffer_offset_alignment);
  2254. gpu_info.d->query_queue_properties();
  2255. // cache memory properties
  2256. vkGetPhysicalDeviceMemoryProperties(physicalDevice, &gpu_info.d->physicalDeviceMemoryProperties);
  2257. int rqde = gpu_info.d->query_extensions();
  2258. if (rqde != 0)
  2259. {
  2260. return -1;
  2261. }
  2262. gpu_info.d->query_extension_features();
  2263. gpu_info.d->query_extension_properties();
  2264. NCNN_LOGE("[%u %s] queueC=%u[%u] queueT=%u[%u]", i, gpu_info.device_name(),
  2265. gpu_info.compute_queue_family_index(), gpu_info.compute_queue_count(),
  2266. gpu_info.transfer_queue_family_index(), gpu_info.transfer_queue_count());
  2267. NCNN_LOGE("[%u %s] fp16-p/s/u/a=%d/%d/%d/%d int8-p/s/u/a=%d/%d/%d/%d", i, gpu_info.device_name(),
  2268. gpu_info.support_fp16_packed(), gpu_info.support_fp16_storage(), gpu_info.support_fp16_uniform(), gpu_info.support_fp16_arithmetic(),
  2269. gpu_info.support_int8_packed(), gpu_info.support_int8_storage(), gpu_info.support_int8_uniform(), gpu_info.support_int8_arithmetic());
  2270. NCNN_LOGE("[%u %s] subgroup=%u(%u~%u) ops=%d/%d/%d/%d/%d/%d/%d/%d/%d/%d", i, gpu_info.device_name(),
  2271. gpu_info.subgroup_size(), gpu_info.min_subgroup_size(), gpu_info.max_subgroup_size(),
  2272. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_BASIC_BIT) != 0,
  2273. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_VOTE_BIT) != 0,
  2274. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) != 0,
  2275. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_BALLOT_BIT) != 0,
  2276. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_SHUFFLE_BIT) != 0,
  2277. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT) != 0,
  2278. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_CLUSTERED_BIT) != 0,
  2279. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_QUAD_BIT) != 0,
  2280. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ROTATE_BIT_KHR) != 0,
  2281. (gpu_info.support_subgroup_ops() & VK_SUBGROUP_FEATURE_ROTATE_CLUSTERED_BIT_KHR) != 0);
  2282. // collect matrix mnk
  2283. std::vector<VkCooperativeMatrixPropertiesKHR> fp16_matrix_properties;
  2284. std::vector<VkCooperativeMatrixPropertiesKHR> int8_matrix_properties;
  2285. std::vector<VkCooperativeMatrixPropertiesKHR> bf16_matrix_properties;
  2286. std::vector<VkCooperativeMatrixPropertiesKHR> fp8_matrix_properties;
  2287. if (gpu_info.support_VK_KHR_cooperative_matrix())
  2288. {
  2289. const std::vector<VkCooperativeMatrixPropertiesKHR>& properties = gpu_info.queryCooperativeMatrixSubProperties();
  2290. for (uint32_t j = 0; j < properties.size(); j++)
  2291. {
  2292. const VkCooperativeMatrixPropertiesKHR& cmp = properties[j];
  2293. if (cmp.AType == VK_COMPONENT_TYPE_FLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_KHR)
  2294. {
  2295. bool mnk_hit = false;
  2296. for (size_t k = 0; k < fp16_matrix_properties.size(); k++)
  2297. {
  2298. const VkCooperativeMatrixPropertiesKHR& cmp0 = fp16_matrix_properties[k];
  2299. if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)
  2300. {
  2301. mnk_hit = true;
  2302. break;
  2303. }
  2304. }
  2305. if (!mnk_hit)
  2306. fp16_matrix_properties.push_back(cmp);
  2307. }
  2308. if ((cmp.AType == VK_COMPONENT_TYPE_SINT8_KHR || cmp.AType == VK_COMPONENT_TYPE_SINT8_PACKED_NV)
  2309. && (cmp.BType == VK_COMPONENT_TYPE_SINT8_KHR || cmp.BType == VK_COMPONENT_TYPE_SINT8_PACKED_NV))
  2310. {
  2311. bool mnk_hit = false;
  2312. for (size_t k = 0; k < int8_matrix_properties.size(); k++)
  2313. {
  2314. const VkCooperativeMatrixPropertiesKHR& cmp0 = int8_matrix_properties[k];
  2315. if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)
  2316. {
  2317. mnk_hit = true;
  2318. break;
  2319. }
  2320. }
  2321. if (!mnk_hit)
  2322. int8_matrix_properties.push_back(cmp);
  2323. }
  2324. if (cmp.AType == VK_COMPONENT_TYPE_BFLOAT16_KHR && cmp.BType == VK_COMPONENT_TYPE_BFLOAT16_KHR)
  2325. {
  2326. bool mnk_hit = false;
  2327. for (size_t k = 0; k < bf16_matrix_properties.size(); k++)
  2328. {
  2329. const VkCooperativeMatrixPropertiesKHR& cmp0 = bf16_matrix_properties[k];
  2330. if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)
  2331. {
  2332. mnk_hit = true;
  2333. break;
  2334. }
  2335. }
  2336. if (!mnk_hit)
  2337. bf16_matrix_properties.push_back(cmp);
  2338. }
  2339. if ((cmp.AType == VK_COMPONENT_TYPE_FLOAT8_E4M3_EXT || cmp.AType == VK_COMPONENT_TYPE_FLOAT8_E5M2_EXT
  2340. || cmp.AType == VK_COMPONENT_TYPE_FLOAT_E4M3_NV || cmp.AType == VK_COMPONENT_TYPE_FLOAT_E5M2_NV)
  2341. && (cmp.BType == VK_COMPONENT_TYPE_FLOAT8_E4M3_EXT || cmp.BType == VK_COMPONENT_TYPE_FLOAT8_E5M2_EXT
  2342. || cmp.BType == VK_COMPONENT_TYPE_FLOAT_E4M3_NV || cmp.BType == VK_COMPONENT_TYPE_FLOAT_E5M2_NV))
  2343. {
  2344. bool mnk_hit = false;
  2345. for (size_t k = 0; k < fp8_matrix_properties.size(); k++)
  2346. {
  2347. const VkCooperativeMatrixPropertiesKHR& cmp0 = fp8_matrix_properties[k];
  2348. if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)
  2349. {
  2350. mnk_hit = true;
  2351. break;
  2352. }
  2353. }
  2354. if (!mnk_hit)
  2355. fp8_matrix_properties.push_back(cmp);
  2356. }
  2357. }
  2358. }
  2359. else if (gpu_info.support_VK_NV_cooperative_matrix())
  2360. {
  2361. const std::vector<VkCooperativeMatrixPropertiesNV>& properties = gpu_info.queryCooperativeMatrixSubPropertiesNV();
  2362. for (uint32_t j = 0; j < properties.size(); j++)
  2363. {
  2364. const VkCooperativeMatrixPropertiesNV& cmp = properties[j];
  2365. if (cmp.AType == VK_COMPONENT_TYPE_FLOAT16_NV && cmp.BType == VK_COMPONENT_TYPE_FLOAT16_NV)
  2366. {
  2367. bool mnk_hit = false;
  2368. for (size_t k = 0; k < fp16_matrix_properties.size(); k++)
  2369. {
  2370. const VkCooperativeMatrixPropertiesKHR& cmp0 = fp16_matrix_properties[k];
  2371. if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)
  2372. {
  2373. mnk_hit = true;
  2374. break;
  2375. }
  2376. }
  2377. if (!mnk_hit)
  2378. {
  2379. VkCooperativeMatrixPropertiesKHR cmp_khr;
  2380. cmp_khr.MSize = cmp.MSize;
  2381. cmp_khr.NSize = cmp.NSize;
  2382. cmp_khr.KSize = cmp.KSize;
  2383. fp16_matrix_properties.push_back(cmp_khr);
  2384. }
  2385. }
  2386. if (cmp.AType == VK_COMPONENT_TYPE_SINT8_NV && cmp.BType == VK_COMPONENT_TYPE_SINT8_NV)
  2387. {
  2388. bool mnk_hit = false;
  2389. for (size_t k = 0; k < int8_matrix_properties.size(); k++)
  2390. {
  2391. const VkCooperativeMatrixPropertiesKHR& cmp0 = int8_matrix_properties[k];
  2392. if (cmp.MSize == cmp0.MSize && cmp.NSize == cmp0.NSize && cmp.KSize == cmp0.KSize)
  2393. {
  2394. mnk_hit = true;
  2395. break;
  2396. }
  2397. }
  2398. if (!mnk_hit)
  2399. {
  2400. VkCooperativeMatrixPropertiesKHR cmp_khr;
  2401. cmp_khr.MSize = cmp.MSize;
  2402. cmp_khr.NSize = cmp.NSize;
  2403. cmp_khr.KSize = cmp.KSize;
  2404. int8_matrix_properties.push_back(cmp_khr);
  2405. }
  2406. }
  2407. }
  2408. }
  2409. std::string fp16_matrix_info_str;
  2410. std::string int8_matrix_info_str;
  2411. std::string bf16_matrix_info_str;
  2412. std::string fp8_matrix_info_str;
  2413. {
  2414. for (uint32_t j = 0; j < fp16_matrix_properties.size(); j++)
  2415. {
  2416. const VkCooperativeMatrixPropertiesKHR& cmp = fp16_matrix_properties[j];
  2417. char tmp[64];
  2418. sprintf(tmp, j > 0 ? "/%ux%ux%u" : "%ux%ux%u", cmp.MSize, cmp.NSize, cmp.KSize);
  2419. fp16_matrix_info_str += tmp;
  2420. }
  2421. for (uint32_t j = 0; j < int8_matrix_properties.size(); j++)
  2422. {
  2423. const VkCooperativeMatrixPropertiesKHR& cmp = int8_matrix_properties[j];
  2424. char tmp[64];
  2425. sprintf(tmp, j > 0 ? "/%ux%ux%u" : "%ux%ux%u", cmp.MSize, cmp.NSize, cmp.KSize);
  2426. int8_matrix_info_str += tmp;
  2427. }
  2428. for (uint32_t j = 0; j < bf16_matrix_properties.size(); j++)
  2429. {
  2430. const VkCooperativeMatrixPropertiesKHR& cmp = bf16_matrix_properties[j];
  2431. char tmp[64];
  2432. sprintf(tmp, j > 0 ? "/%ux%ux%u" : "%ux%ux%u", cmp.MSize, cmp.NSize, cmp.KSize);
  2433. bf16_matrix_info_str += tmp;
  2434. }
  2435. for (uint32_t j = 0; j < fp8_matrix_properties.size(); j++)
  2436. {
  2437. const VkCooperativeMatrixPropertiesKHR& cmp = fp8_matrix_properties[j];
  2438. char tmp[64];
  2439. sprintf(tmp, j > 0 ? "/%ux%ux%u" : "%ux%ux%u", cmp.MSize, cmp.NSize, cmp.KSize);
  2440. fp8_matrix_info_str += tmp;
  2441. }
  2442. if (fp16_matrix_info_str.empty())
  2443. fp16_matrix_info_str = "0";
  2444. if (int8_matrix_info_str.empty())
  2445. int8_matrix_info_str = "0";
  2446. if (bf16_matrix_info_str.empty())
  2447. bf16_matrix_info_str = "0";
  2448. if (fp8_matrix_info_str.empty())
  2449. fp8_matrix_info_str = "0";
  2450. }
  2451. NCNN_LOGE("[%u %s] fp16-cm=%s int8-cm=%s bf16-cm=%s fp8-cm=%s", i, gpu_info.device_name(),
  2452. fp16_matrix_info_str.c_str(), int8_matrix_info_str.c_str(), bf16_matrix_info_str.c_str(), fp8_matrix_info_str.c_str());
  2453. gpu_info_index++;
  2454. }
  2455. g_gpu_count = gpu_info_index;
  2456. // the default gpu device
  2457. g_default_gpu_index = find_default_vulkan_device_index();
  2458. g_instance.glslang_initialized = glslang::InitializeProcess();
  2459. // the global __ncnn_vulkan_instance_holder destructor will call destroy_gpu_instance() on exit
  2460. // but it seems to be too late for nvidia driver :(
  2461. // driver's internal data structure has been destroyed when called, causing segfault
  2462. // atexit() seems to be helpful for calling it earlier --- nihui
  2463. static int destroy_gpu_instance_atexit_registered = 0;
  2464. if (!destroy_gpu_instance_atexit_registered)
  2465. {
  2466. atexit(destroy_gpu_instance);
  2467. destroy_gpu_instance_atexit_registered = 1;
  2468. }
  2469. return 0;
  2470. }
  2471. VkInstance get_gpu_instance()
  2472. {
  2473. return (VkInstance)g_instance;
  2474. }
  2475. void destroy_gpu_instance()
  2476. {
  2477. MutexLockGuard lock(g_instance_lock);
  2478. if (g_instance.created == 0)
  2479. return;
  2480. for (int i = 0; i < NCNN_MAX_GPU_COUNT; i++)
  2481. {
  2482. VulkanDevice* vulkan_device = g_default_vkdev[i];
  2483. if (vulkan_device)
  2484. {
  2485. VkDevice vkdev = g_default_vkdev[i]->vkdevice();
  2486. if (vkdev)
  2487. {
  2488. vkDeviceWaitIdle(vkdev);
  2489. }
  2490. }
  2491. }
  2492. // NCNN_LOGE("destroy_gpu_instance");
  2493. if (g_instance.glslang_initialized)
  2494. {
  2495. glslang::FinalizeProcess();
  2496. g_instance.glslang_initialized = false;
  2497. }
  2498. for (int i = 0; i < NCNN_MAX_GPU_COUNT; i++)
  2499. {
  2500. delete g_default_vkdev[i];
  2501. g_default_vkdev[i] = 0;
  2502. delete g_gpu_infos[i];
  2503. g_gpu_infos[i] = 0;
  2504. }
  2505. #if ENABLE_VALIDATION_LAYER
  2506. if (support_VK_EXT_debug_utils && g_instance.callback)
  2507. {
  2508. DestroyDebugUtilsMessengerEXT(g_instance, g_instance.callback, NULL);
  2509. g_instance.callback = 0;
  2510. }
  2511. #endif // ENABLE_VALIDATION_LAYER
  2512. if (vkDestroyInstance)
  2513. {
  2514. vkDestroyInstance(g_instance, 0);
  2515. vkDestroyInstance = 0;
  2516. }
  2517. g_instance.instance = 0;
  2518. #if NCNN_SIMPLEVK
  2519. unload_vulkan_driver();
  2520. #endif
  2521. g_instance.created = 0;
  2522. }
  2523. static void try_create_gpu_instance()
  2524. {
  2525. {
  2526. MutexLockGuard lock(g_instance_lock);
  2527. if (g_instance.created != 0)
  2528. return;
  2529. }
  2530. create_gpu_instance();
  2531. }
  2532. int get_gpu_count()
  2533. {
  2534. try_create_gpu_instance();
  2535. return g_gpu_count;
  2536. }
  2537. int get_default_gpu_index()
  2538. {
  2539. try_create_gpu_instance();
  2540. return g_default_gpu_index;
  2541. }
  2542. const GpuInfo& get_gpu_info(int device_index)
  2543. {
  2544. try_create_gpu_instance();
  2545. return *g_gpu_infos[device_index];
  2546. }
  2547. class VkDummyAllocator : public VkBlobAllocator
  2548. {
  2549. public:
  2550. // NOTE 16k is large enough I think ...
  2551. VkDummyAllocator(const VulkanDevice* _vkdev)
  2552. : VkBlobAllocator(_vkdev, 16 * 1024)
  2553. {
  2554. }
  2555. };
  2556. class VkDummyCompute : public VkCompute
  2557. {
  2558. public:
  2559. VkDummyCompute(const VulkanDevice* _vkdev)
  2560. : VkCompute(_vkdev)
  2561. {
  2562. }
  2563. void record_dummy(const VkMat& buffer)
  2564. {
  2565. barrier_readwrite(buffer);
  2566. }
  2567. void record_dummy(const VkImageMat& image)
  2568. {
  2569. barrier_readwrite(image);
  2570. }
  2571. void record_dummy_readonly(const VkImageMat& image)
  2572. {
  2573. barrier_readonly(image);
  2574. }
  2575. };
  2576. class VulkanDevicePrivate
  2577. {
  2578. public:
  2579. VulkanDevicePrivate(VulkanDevice* _vkdev);
  2580. VulkanDevice* const vkdev;
  2581. // dummy buffer and image
  2582. int create_dummy_buffer_image();
  2583. void destroy_dummy_buffer_image();
  2584. // utility operator
  2585. const ncnn::Layer* get_utility_operator(int cast_type_from_index, int cast_type_to_index, int packing_type_to_index) const;
  2586. void destroy_utility_operator();
  2587. VkDevice device;
  2588. // hardware queue
  2589. mutable std::vector<VkQueue> compute_queues;
  2590. mutable std::vector<VkQueue> transfer_queues;
  2591. mutable int free_compute_queue_count;
  2592. mutable int free_transfer_queue_count;
  2593. mutable Mutex compute_queue_lock;
  2594. mutable Mutex transfer_queue_lock;
  2595. mutable ConditionVariable compute_queue_condition;
  2596. mutable ConditionVariable transfer_queue_condition;
  2597. // default blob allocator for each queue
  2598. mutable std::vector<VkAllocator*> blob_allocators;
  2599. mutable Mutex blob_allocator_lock;
  2600. // default staging allocator for each queue
  2601. mutable std::vector<VkAllocator*> staging_allocators;
  2602. mutable Mutex staging_allocator_lock;
  2603. // nearest sampler for texelfetch
  2604. VkSampler texelfetch_sampler;
  2605. // dummy buffer and image
  2606. VkAllocator* dummy_allocator;
  2607. VkMat dummy_buffer;
  2608. VkImageMat dummy_image;
  2609. VkImageMat dummy_image_readonly;
  2610. // device-wide pipeline cache
  2611. PipelineCache* pipeline_cache;
  2612. // utility operator
  2613. // from fp32 | fp16
  2614. // to fp32 | fp16
  2615. // to pack1 | pack4 | pack8
  2616. mutable ncnn::Layer* uop_packing[2][2][3];
  2617. // from int8
  2618. // to int8
  2619. // to pack1 | pack4 | pack8
  2620. mutable ncnn::Layer* uop_packing_int8[3];
  2621. mutable Mutex uop_lock;
  2622. // device is valid and sucessfully initialized
  2623. bool valid;
  2624. };
  2625. VulkanDevicePrivate::VulkanDevicePrivate(VulkanDevice* _vkdev)
  2626. : vkdev(_vkdev)
  2627. {
  2628. device = 0;
  2629. texelfetch_sampler = 0;
  2630. dummy_allocator = 0;
  2631. pipeline_cache = 0;
  2632. valid = false;
  2633. memset(uop_packing, 0, sizeof(uop_packing));
  2634. memset(uop_packing_int8, 0, sizeof(uop_packing_int8));
  2635. }
  2636. int VulkanDevicePrivate::create_dummy_buffer_image()
  2637. {
  2638. dummy_allocator = new VkDummyAllocator(vkdev);
  2639. dummy_buffer.create(1, 4u, dummy_allocator);
  2640. dummy_image.create(1, 4u, dummy_allocator);
  2641. #if __APPLE__
  2642. if (vkdev->info.type() == 0)
  2643. dummy_image_readonly.create(1, 4u, dummy_allocator);
  2644. #else
  2645. dummy_image_readonly.create(1, 4u, dummy_allocator);
  2646. #endif
  2647. VkDummyCompute cmd(vkdev);
  2648. cmd.record_dummy(dummy_buffer);
  2649. cmd.record_dummy(dummy_image);
  2650. #if __APPLE__
  2651. if (vkdev->info.type() == 0)
  2652. cmd.record_dummy_readonly(dummy_image_readonly);
  2653. #else
  2654. cmd.record_dummy_readonly(dummy_image_readonly);
  2655. #endif
  2656. return cmd.submit_and_wait();
  2657. }
  2658. void VulkanDevicePrivate::destroy_dummy_buffer_image()
  2659. {
  2660. dummy_buffer.release();
  2661. dummy_image.release();
  2662. #if __APPLE__
  2663. if (vkdev->info.type() == 0)
  2664. dummy_image_readonly.release();
  2665. #else
  2666. dummy_image_readonly.release();
  2667. #endif
  2668. if (dummy_allocator)
  2669. {
  2670. delete dummy_allocator;
  2671. dummy_allocator = 0;
  2672. }
  2673. }
  2674. const ncnn::Layer* VulkanDevicePrivate::get_utility_operator(int cast_type_from_index, int cast_type_to_index, int packing_type_to_index) const
  2675. {
  2676. bool use_fp16 = (cast_type_from_index == 1 || cast_type_to_index == 1);
  2677. bool use_int8 = (cast_type_from_index == 3 || cast_type_to_index == 3);
  2678. MutexLockGuard lock(uop_lock);
  2679. const ncnn::Layer* cached_uop = 0;
  2680. if (use_int8)
  2681. {
  2682. cached_uop = uop_packing_int8[packing_type_to_index];
  2683. }
  2684. else
  2685. {
  2686. cached_uop = uop_packing[cast_type_from_index][cast_type_to_index][packing_type_to_index];
  2687. }
  2688. if (cached_uop)
  2689. return cached_uop;
  2690. // create uop
  2691. Option opt;
  2692. opt.use_fp16_packed = use_fp16; // fp16p is always supported
  2693. opt.use_fp16_storage = use_fp16 && vkdev->info.support_fp16_storage();
  2694. opt.use_int8_packed = use_int8; // int8p is always supported
  2695. opt.use_int8_storage = use_int8 && vkdev->info.support_int8_storage();
  2696. // fp16/int8 arithmetic are not necessary for packing
  2697. // and may conflict with storage options
  2698. opt.use_fp16_arithmetic = false;
  2699. opt.use_int8_arithmetic = false;
  2700. // enable pack8 for pack8to1/pack8to4
  2701. opt.use_shader_pack8 = true;
  2702. // do not enable spirv-1.3 from cooperative matrix
  2703. opt.use_cooperative_matrix = false;
  2704. opt.use_vulkan_compute = true;
  2705. // cache uop pipeline as device member explicitly
  2706. opt.pipeline_cache = 0;
  2707. opt.vulkan_device_index = vkdev->info.device_index();
  2708. ncnn::Layer* uop = ncnn::create_layer_vulkan(LayerType::Packing);
  2709. uop->vkdev = vkdev;
  2710. ncnn::ParamDict pd;
  2711. pd.set(0, packing_type_to_index == 0 ? 1 : packing_type_to_index == 1 ? 4 : 8); // out_elempack
  2712. pd.set(2, cast_type_from_index + 1); // 0=auto 1=fp32 2=fp16 3=int8
  2713. pd.set(3, cast_type_to_index + 1);
  2714. uop->load_param(pd);
  2715. uop->create_pipeline(opt);
  2716. if (use_int8)
  2717. {
  2718. uop_packing_int8[packing_type_to_index] = uop;
  2719. }
  2720. else
  2721. {
  2722. uop_packing[cast_type_from_index][cast_type_to_index][packing_type_to_index] = uop;
  2723. }
  2724. return uop;
  2725. }
  2726. void VulkanDevicePrivate::destroy_utility_operator()
  2727. {
  2728. Option opt;
  2729. opt.use_vulkan_compute = true;
  2730. opt.use_fp16_arithmetic = false;
  2731. opt.use_int8_arithmetic = false;
  2732. opt.use_cooperative_matrix = false;
  2733. opt.pipeline_cache = 0;
  2734. opt.vulkan_device_index = vkdev->info.device_index();
  2735. // from fp32 | fp16
  2736. for (int j0 = 0; j0 < 2; j0++)
  2737. {
  2738. // to fp32 | fp16
  2739. for (int j1 = 0; j1 < 2; j1++)
  2740. {
  2741. bool use_fp16 = (j0 == 1 || j1 == 1);
  2742. opt.use_fp16_packed = use_fp16;
  2743. opt.use_fp16_storage = use_fp16 && vkdev->info.support_fp16_storage();
  2744. opt.use_int8_packed = false;
  2745. opt.use_int8_storage = false;
  2746. // to pack1 | pack4 | pack8
  2747. for (int k = 0; k < 3; k++)
  2748. {
  2749. // enable pack8 for pack8to1/pack8to4
  2750. opt.use_shader_pack8 = true;
  2751. ncnn::Layer* uop = uop_packing[j0][j1][k];
  2752. if (!uop)
  2753. continue;
  2754. uop->destroy_pipeline(opt);
  2755. delete uop;
  2756. uop_packing[j0][j1][k] = 0;
  2757. }
  2758. }
  2759. }
  2760. // int8
  2761. {
  2762. bool use_int8 = true;
  2763. opt.use_fp16_packed = false;
  2764. opt.use_fp16_storage = false;
  2765. opt.use_int8_packed = use_int8;
  2766. opt.use_int8_storage = use_int8 && vkdev->info.support_int8_storage();
  2767. // to pack1 | pack4 | pack8
  2768. for (int k = 0; k < 3; k++)
  2769. {
  2770. // enable pack8 for pack8to1/pack8to4
  2771. opt.use_shader_pack8 = true;
  2772. ncnn::Layer* uop = uop_packing_int8[k];
  2773. if (!uop)
  2774. continue;
  2775. uop->destroy_pipeline(opt);
  2776. delete uop;
  2777. uop_packing_int8[k] = 0;
  2778. }
  2779. }
  2780. }
  2781. VulkanDevice::VulkanDevice(int device_index)
  2782. : info(get_gpu_info(device_index)), d(new VulkanDevicePrivate(this))
  2783. {
  2784. try_create_gpu_instance();
  2785. std::vector<const char*> enabledExtensions;
  2786. if (info.support_VK_KHR_8bit_storage())
  2787. enabledExtensions.push_back("VK_KHR_8bit_storage");
  2788. if (info.support_VK_KHR_16bit_storage())
  2789. enabledExtensions.push_back("VK_KHR_16bit_storage");
  2790. if (info.support_VK_KHR_bind_memory2())
  2791. enabledExtensions.push_back("VK_KHR_bind_memory2");
  2792. if (info.support_VK_KHR_buffer_device_address())
  2793. enabledExtensions.push_back("VK_KHR_buffer_device_address");
  2794. if (info.support_VK_KHR_create_renderpass2())
  2795. enabledExtensions.push_back("VK_KHR_create_renderpass2");
  2796. if (info.support_VK_KHR_cooperative_matrix())
  2797. enabledExtensions.push_back("VK_KHR_cooperative_matrix");
  2798. if (info.support_VK_KHR_dedicated_allocation())
  2799. enabledExtensions.push_back("VK_KHR_dedicated_allocation");
  2800. if (info.support_VK_KHR_descriptor_update_template())
  2801. enabledExtensions.push_back("VK_KHR_descriptor_update_template");
  2802. if (info.support_VK_KHR_driver_properties())
  2803. enabledExtensions.push_back("VK_KHR_driver_properties");
  2804. if (info.support_VK_KHR_external_memory())
  2805. enabledExtensions.push_back("VK_KHR_external_memory");
  2806. if (info.support_VK_KHR_get_memory_requirements2())
  2807. enabledExtensions.push_back("VK_KHR_get_memory_requirements2");
  2808. if (info.support_VK_KHR_maintenance1())
  2809. enabledExtensions.push_back("VK_KHR_maintenance1");
  2810. if (info.support_VK_KHR_maintenance2())
  2811. enabledExtensions.push_back("VK_KHR_maintenance2");
  2812. if (info.support_VK_KHR_maintenance3())
  2813. enabledExtensions.push_back("VK_KHR_maintenance3");
  2814. if (info.support_VK_KHR_multiview())
  2815. enabledExtensions.push_back("VK_KHR_multiview");
  2816. if (info.support_VK_KHR_portability_subset())
  2817. enabledExtensions.push_back("VK_KHR_portability_subset");
  2818. if (info.support_VK_KHR_push_descriptor())
  2819. enabledExtensions.push_back("VK_KHR_push_descriptor");
  2820. if (info.support_VK_KHR_sampler_ycbcr_conversion())
  2821. enabledExtensions.push_back("VK_KHR_sampler_ycbcr_conversion");
  2822. if (info.support_VK_KHR_shader_bfloat16())
  2823. enabledExtensions.push_back("VK_KHR_shader_bfloat16");
  2824. if (info.support_VK_KHR_shader_float16_int8())
  2825. enabledExtensions.push_back("VK_KHR_shader_float16_int8");
  2826. if (info.support_VK_KHR_shader_float_controls())
  2827. enabledExtensions.push_back("VK_KHR_shader_float_controls");
  2828. if (info.support_VK_KHR_shader_float_controls2())
  2829. enabledExtensions.push_back("VK_KHR_shader_float_controls2");
  2830. if (info.support_VK_KHR_shader_integer_dot_product())
  2831. enabledExtensions.push_back("VK_KHR_shader_integer_dot_product");
  2832. if (info.support_VK_KHR_shader_non_semantic_info())
  2833. enabledExtensions.push_back("VK_KHR_shader_non_semantic_info");
  2834. if (info.support_VK_KHR_shader_subgroup_extended_types())
  2835. enabledExtensions.push_back("VK_KHR_shader_subgroup_extended_types");
  2836. if (info.support_VK_KHR_shader_subgroup_rotate())
  2837. enabledExtensions.push_back("VK_KHR_shader_subgroup_rotate");
  2838. if (info.support_VK_KHR_storage_buffer_storage_class())
  2839. enabledExtensions.push_back("VK_KHR_storage_buffer_storage_class");
  2840. if (info.support_VK_KHR_swapchain())
  2841. enabledExtensions.push_back("VK_KHR_swapchain");
  2842. if (info.support_VK_KHR_vulkan_memory_model())
  2843. enabledExtensions.push_back("VK_KHR_vulkan_memory_model");
  2844. if (info.support_VK_KHR_zero_initialize_workgroup_memory())
  2845. enabledExtensions.push_back("VK_KHR_zero_initialize_workgroup_memory");
  2846. if (info.support_VK_EXT_buffer_device_address())
  2847. enabledExtensions.push_back("VK_EXT_buffer_device_address");
  2848. if (info.support_VK_EXT_descriptor_indexing())
  2849. enabledExtensions.push_back("VK_EXT_descriptor_indexing");
  2850. if (info.support_VK_EXT_memory_budget())
  2851. enabledExtensions.push_back("VK_EXT_memory_budget");
  2852. if (info.support_VK_EXT_memory_priority())
  2853. enabledExtensions.push_back("VK_EXT_memory_priority");
  2854. if (info.support_VK_EXT_queue_family_foreign())
  2855. enabledExtensions.push_back("VK_EXT_queue_family_foreign");
  2856. if (info.support_VK_EXT_shader_atomic_float())
  2857. enabledExtensions.push_back("VK_EXT_shader_atomic_float");
  2858. if (info.support_VK_EXT_shader_atomic_float2())
  2859. enabledExtensions.push_back("VK_EXT_shader_atomic_float2");
  2860. if (info.support_VK_EXT_shader_float8())
  2861. enabledExtensions.push_back("VK_EXT_shader_float8");
  2862. if (info.support_VK_EXT_subgroup_size_control())
  2863. enabledExtensions.push_back("VK_EXT_subgroup_size_control");
  2864. if (info.support_VK_AMD_device_coherent_memory())
  2865. enabledExtensions.push_back("VK_AMD_device_coherent_memory");
  2866. #if __ANDROID_API__ >= 26
  2867. if (info.support_VK_ANDROID_external_memory_android_hardware_buffer())
  2868. enabledExtensions.push_back("VK_ANDROID_external_memory_android_hardware_buffer");
  2869. #endif // __ANDROID_API__ >= 26
  2870. if (info.support_VK_NV_cooperative_matrix())
  2871. enabledExtensions.push_back("VK_NV_cooperative_matrix");
  2872. if (info.support_VK_NV_cooperative_matrix2())
  2873. enabledExtensions.push_back("VK_NV_cooperative_matrix2");
  2874. if (info.support_VK_NV_cooperative_vector())
  2875. enabledExtensions.push_back("VK_NV_cooperative_vector");
  2876. const void* enabledExtensionFeatures = info.queryExtensionFeatures();
  2877. std::vector<float> compute_queue_priorities(info.compute_queue_count(), 1.f); // 0.f ~ 1.f
  2878. std::vector<float> transfer_queue_priorities(info.transfer_queue_count(), 1.f); // 0.f ~ 1.f
  2879. VkDeviceQueueCreateInfo deviceQueueCreateInfos[3];
  2880. VkDeviceQueueCreateInfo deviceComputeQueueCreateInfo;
  2881. deviceComputeQueueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;
  2882. deviceComputeQueueCreateInfo.pNext = 0;
  2883. deviceComputeQueueCreateInfo.flags = 0;
  2884. deviceComputeQueueCreateInfo.queueFamilyIndex = info.compute_queue_family_index();
  2885. deviceComputeQueueCreateInfo.queueCount = info.compute_queue_count();
  2886. deviceComputeQueueCreateInfo.pQueuePriorities = compute_queue_priorities.data();
  2887. VkDeviceQueueCreateInfo deviceTransferQueueCreateInfo;
  2888. deviceTransferQueueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;
  2889. deviceTransferQueueCreateInfo.pNext = 0;
  2890. deviceTransferQueueCreateInfo.flags = 0;
  2891. deviceTransferQueueCreateInfo.queueFamilyIndex = info.transfer_queue_family_index();
  2892. deviceTransferQueueCreateInfo.queueCount = info.transfer_queue_count();
  2893. deviceTransferQueueCreateInfo.pQueuePriorities = transfer_queue_priorities.data();
  2894. VkDeviceCreateInfo deviceCreateInfo;
  2895. deviceCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;
  2896. deviceCreateInfo.pNext = enabledExtensionFeatures;
  2897. deviceCreateInfo.flags = 0;
  2898. if (info.compute_queue_family_index() == info.transfer_queue_family_index())
  2899. {
  2900. deviceQueueCreateInfos[0] = deviceComputeQueueCreateInfo;
  2901. deviceCreateInfo.queueCreateInfoCount = 1;
  2902. }
  2903. else // if (info.compute_queue_family_index() != info.transfer_queue_family_index())
  2904. {
  2905. deviceQueueCreateInfos[0] = deviceComputeQueueCreateInfo;
  2906. deviceQueueCreateInfos[1] = deviceTransferQueueCreateInfo;
  2907. deviceCreateInfo.queueCreateInfoCount = 2;
  2908. }
  2909. deviceCreateInfo.pQueueCreateInfos = deviceQueueCreateInfos;
  2910. deviceCreateInfo.enabledLayerCount = 0;
  2911. deviceCreateInfo.ppEnabledLayerNames = 0;
  2912. deviceCreateInfo.enabledExtensionCount = enabledExtensions.size();
  2913. deviceCreateInfo.ppEnabledExtensionNames = enabledExtensions.data();
  2914. deviceCreateInfo.pEnabledFeatures = 0; // VkPhysicalDeviceFeatures pointer
  2915. VkResult ret = vkCreateDevice(info.physicalDevice(), &deviceCreateInfo, 0, &d->device);
  2916. if (ret != VK_SUCCESS)
  2917. {
  2918. NCNN_LOGE("vkCreateDevice failed %d", ret);
  2919. return;
  2920. }
  2921. init_device_extension();
  2922. d->free_compute_queue_count = 0;
  2923. d->free_transfer_queue_count = 0;
  2924. d->free_compute_queue_count = info.compute_queue_count();
  2925. d->compute_queues.resize(info.compute_queue_count());
  2926. d->blob_allocators.resize(info.compute_queue_count());
  2927. d->staging_allocators.resize(info.compute_queue_count());
  2928. for (uint32_t i = 0; i < info.compute_queue_count(); i++)
  2929. {
  2930. vkGetDeviceQueue(d->device, info.compute_queue_family_index(), i, &d->compute_queues[i]);
  2931. d->blob_allocators[i] = new VkBlobAllocator(this);
  2932. d->staging_allocators[i] = new VkStagingAllocator(this);
  2933. }
  2934. if (info.compute_queue_family_index() != info.transfer_queue_family_index())
  2935. {
  2936. d->free_transfer_queue_count = info.transfer_queue_count();
  2937. d->transfer_queues.resize(info.transfer_queue_count());
  2938. for (uint32_t i = 0; i < info.transfer_queue_count(); i++)
  2939. {
  2940. vkGetDeviceQueue(d->device, info.transfer_queue_family_index(), i, &d->transfer_queues[i]);
  2941. }
  2942. }
  2943. // prepare immutable texelfetch sampler
  2944. {
  2945. VkSamplerCreateInfo samplerCreateInfo;
  2946. samplerCreateInfo.sType = VK_STRUCTURE_TYPE_SAMPLER_CREATE_INFO;
  2947. samplerCreateInfo.pNext = 0;
  2948. samplerCreateInfo.flags = 0;
  2949. samplerCreateInfo.magFilter = VK_FILTER_NEAREST;
  2950. samplerCreateInfo.minFilter = VK_FILTER_NEAREST;
  2951. samplerCreateInfo.mipmapMode = VK_SAMPLER_MIPMAP_MODE_NEAREST;
  2952. samplerCreateInfo.addressModeU = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
  2953. samplerCreateInfo.addressModeV = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
  2954. samplerCreateInfo.addressModeW = VK_SAMPLER_ADDRESS_MODE_CLAMP_TO_EDGE;
  2955. samplerCreateInfo.mipLodBias = 0.0f;
  2956. samplerCreateInfo.anisotropyEnable = VK_FALSE;
  2957. samplerCreateInfo.maxAnisotropy = 1;
  2958. samplerCreateInfo.compareEnable = VK_FALSE;
  2959. samplerCreateInfo.compareOp = VK_COMPARE_OP_NEVER;
  2960. samplerCreateInfo.minLod = 0.0f;
  2961. samplerCreateInfo.maxLod = 0.0f;
  2962. samplerCreateInfo.borderColor = VK_BORDER_COLOR_FLOAT_TRANSPARENT_BLACK;
  2963. samplerCreateInfo.unnormalizedCoordinates = VK_TRUE;
  2964. ret = vkCreateSampler(d->device, &samplerCreateInfo, 0, &d->texelfetch_sampler);
  2965. if (ret != VK_SUCCESS)
  2966. {
  2967. NCNN_LOGE("vkCreateSampler failed %d", ret);
  2968. }
  2969. }
  2970. int cret = d->create_dummy_buffer_image();
  2971. if (cret != 0)
  2972. {
  2973. NCNN_LOGE("VulkanDevice create_dummy_buffer_image failed %d", cret);
  2974. return;
  2975. }
  2976. d->pipeline_cache = new PipelineCache(this);
  2977. d->valid = true;
  2978. }
  2979. VulkanDevice::~VulkanDevice()
  2980. {
  2981. d->destroy_utility_operator();
  2982. d->destroy_dummy_buffer_image();
  2983. if (d->texelfetch_sampler)
  2984. {
  2985. vkDestroySampler(d->device, d->texelfetch_sampler, 0);
  2986. }
  2987. for (size_t i = 0; i < d->blob_allocators.size(); i++)
  2988. {
  2989. delete d->blob_allocators[i];
  2990. }
  2991. d->blob_allocators.clear();
  2992. for (size_t i = 0; i < d->staging_allocators.size(); i++)
  2993. {
  2994. delete d->staging_allocators[i];
  2995. }
  2996. d->staging_allocators.clear();
  2997. if (d->pipeline_cache)
  2998. {
  2999. delete d->pipeline_cache;
  3000. }
  3001. if (d->device)
  3002. {
  3003. vkDestroyDevice(d->device, 0);
  3004. }
  3005. delete d;
  3006. }
  3007. VulkanDevice::VulkanDevice(const VulkanDevice&)
  3008. : info(get_gpu_info(0)), d(0)
  3009. {
  3010. }
  3011. VulkanDevice& VulkanDevice::operator=(const VulkanDevice&)
  3012. {
  3013. return *this;
  3014. }
  3015. VkDevice VulkanDevice::vkdevice() const
  3016. {
  3017. return d->device;
  3018. }
  3019. bool VulkanDevice::is_valid() const
  3020. {
  3021. return d->valid;
  3022. }
  3023. VkShaderModule VulkanDevice::compile_shader_module(const uint32_t* spv_data, size_t spv_data_size) const
  3024. {
  3025. VkShaderModuleCreateInfo shaderModuleCreateInfo;
  3026. shaderModuleCreateInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
  3027. shaderModuleCreateInfo.pNext = 0;
  3028. shaderModuleCreateInfo.flags = 0;
  3029. shaderModuleCreateInfo.codeSize = spv_data_size;
  3030. shaderModuleCreateInfo.pCode = spv_data;
  3031. VkShaderModule shader_module;
  3032. VkResult ret = vkCreateShaderModule(d->device, &shaderModuleCreateInfo, 0, &shader_module);
  3033. if (ret != VK_SUCCESS)
  3034. {
  3035. NCNN_LOGE("vkCreateShaderModule failed %d", ret);
  3036. return 0;
  3037. }
  3038. return shader_module;
  3039. }
  3040. static void inject_local_size_xyz(const uint32_t* code, size_t size, uint32_t local_size_x, uint32_t local_size_y, uint32_t local_size_z, uint32_t* dstcode, size_t* dstsize)
  3041. {
  3042. uint32_t local_size_x_id = -1;
  3043. uint32_t local_size_y_id = -1;
  3044. uint32_t local_size_z_id = -1;
  3045. uint32_t gl_WorkGroupSize_id = -1;
  3046. const uint32_t* p = code;
  3047. uint32_t* dp = dstcode;
  3048. // skip magic version generator bound schema
  3049. memcpy(dp, p, 5 * sizeof(uint32_t));
  3050. p += 5;
  3051. dp += 5;
  3052. // foreach op
  3053. while ((const unsigned char*)p < (const unsigned char*)code + size)
  3054. {
  3055. uint32_t opcode = p[0];
  3056. uint16_t wordcount = opcode >> 16;
  3057. uint16_t op = opcode & 0xffff;
  3058. if (op == 16) // OpExecutionMode
  3059. {
  3060. uint32_t mode = p[2];
  3061. if (mode == 17) // LocalSize
  3062. {
  3063. memcpy(dp, p, wordcount * sizeof(uint32_t));
  3064. // set local_size_xyz
  3065. dp[3] = local_size_x;
  3066. dp[4] = local_size_y;
  3067. dp[5] = local_size_z;
  3068. p += wordcount;
  3069. dp += wordcount;
  3070. continue;
  3071. }
  3072. }
  3073. else if (op == 50) // OpSpecConstant
  3074. {
  3075. uint32_t id = p[2];
  3076. if (id == local_size_x_id || id == local_size_y_id || id == local_size_z_id)
  3077. {
  3078. p += wordcount;
  3079. continue;
  3080. }
  3081. }
  3082. else if (op == 51) // OpSpecConstantComposite
  3083. {
  3084. uint32_t id = p[2];
  3085. if (id == gl_WorkGroupSize_id)
  3086. {
  3087. if (wordcount == 6 && (p[3] == local_size_x_id || p[4] == local_size_y_id || p[5] == local_size_z_id))
  3088. {
  3089. p += wordcount;
  3090. continue;
  3091. }
  3092. }
  3093. }
  3094. else if (op == 71) // OpDecorate
  3095. {
  3096. uint32_t id = p[1];
  3097. uint32_t decoration = p[2];
  3098. if (decoration == 1) // SpecId
  3099. {
  3100. uint32_t specid = p[3];
  3101. if (specid == 233) local_size_x_id = id;
  3102. if (specid == 234) local_size_y_id = id;
  3103. if (specid == 235) local_size_z_id = id;
  3104. if (specid == 233 || specid == 234 || specid == 235)
  3105. {
  3106. p += wordcount;
  3107. continue;
  3108. }
  3109. }
  3110. else if (decoration == 11) // BuiltIn
  3111. {
  3112. uint32_t builtin = p[3];
  3113. if (builtin == 25) // WorkgroupSize
  3114. {
  3115. gl_WorkGroupSize_id = id;
  3116. p += wordcount;
  3117. continue;
  3118. }
  3119. }
  3120. }
  3121. memcpy(dp, p, wordcount * sizeof(uint32_t));
  3122. p += wordcount;
  3123. dp += wordcount;
  3124. }
  3125. *dstsize = (unsigned char*)dp - (unsigned char*)dstcode;
  3126. }
  3127. VkShaderModule VulkanDevice::compile_shader_module(const uint32_t* spv_data, size_t spv_data_size, uint32_t local_size_x, uint32_t local_size_y, uint32_t local_size_z) const
  3128. {
  3129. uint32_t* spv_data_modified = (uint32_t*)malloc(spv_data_size);
  3130. size_t spv_data_size_modified = spv_data_size;
  3131. inject_local_size_xyz(spv_data, spv_data_size, local_size_x, local_size_y, local_size_z, spv_data_modified, &spv_data_size_modified);
  3132. VkShaderModule shader_module = compile_shader_module(spv_data_modified, spv_data_size_modified);
  3133. free(spv_data_modified);
  3134. return shader_module;
  3135. }
  3136. int VulkanDevice::create_descriptorset_layout(int binding_count, const int* binding_types, VkDescriptorSetLayout* descriptorset_layout) const
  3137. {
  3138. if (binding_count == 0)
  3139. {
  3140. *descriptorset_layout = 0;
  3141. return 0;
  3142. }
  3143. std::vector<VkDescriptorSetLayoutBinding> descriptorSetLayoutBindings(binding_count);
  3144. for (int i = 0; i < binding_count; i++)
  3145. {
  3146. int binding_type = binding_types[i];
  3147. descriptorSetLayoutBindings[i].binding = i;
  3148. descriptorSetLayoutBindings[i].descriptorCount = 1;
  3149. descriptorSetLayoutBindings[i].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
  3150. if (binding_type == 1)
  3151. {
  3152. descriptorSetLayoutBindings[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
  3153. descriptorSetLayoutBindings[i].pImmutableSamplers = 0;
  3154. }
  3155. else if (binding_type == 2)
  3156. {
  3157. descriptorSetLayoutBindings[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;
  3158. descriptorSetLayoutBindings[i].pImmutableSamplers = 0;
  3159. }
  3160. else // if (binding_type == 3)
  3161. {
  3162. descriptorSetLayoutBindings[i].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;
  3163. descriptorSetLayoutBindings[i].pImmutableSamplers = immutable_texelfetch_sampler(); // we always use texelfetch
  3164. }
  3165. }
  3166. VkDescriptorSetLayoutCreateInfo descriptorSetLayoutCreateInfo;
  3167. descriptorSetLayoutCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
  3168. descriptorSetLayoutCreateInfo.pNext = 0;
  3169. descriptorSetLayoutCreateInfo.flags = 0;
  3170. descriptorSetLayoutCreateInfo.bindingCount = binding_count;
  3171. descriptorSetLayoutCreateInfo.pBindings = descriptorSetLayoutBindings.data();
  3172. if (info.support_VK_KHR_push_descriptor())
  3173. {
  3174. descriptorSetLayoutCreateInfo.flags |= VK_DESCRIPTOR_SET_LAYOUT_CREATE_PUSH_DESCRIPTOR_BIT_KHR;
  3175. }
  3176. VkResult ret = vkCreateDescriptorSetLayout(d->device, &descriptorSetLayoutCreateInfo, 0, descriptorset_layout);
  3177. if (ret != VK_SUCCESS)
  3178. {
  3179. NCNN_LOGE("vkCreateDescriptorSetLayout failed %d", ret);
  3180. return -1;
  3181. }
  3182. return 0;
  3183. }
  3184. int VulkanDevice::create_pipeline_layout(int push_constant_count, VkDescriptorSetLayout descriptorset_layout, VkPipelineLayout* pipeline_layout) const
  3185. {
  3186. VkPushConstantRange pushConstantRange;
  3187. pushConstantRange.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
  3188. pushConstantRange.offset = 0;
  3189. pushConstantRange.size = sizeof(vk_constant_type) * push_constant_count;
  3190. VkPipelineLayoutCreateInfo pipelineLayoutCreateInfo;
  3191. pipelineLayoutCreateInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
  3192. pipelineLayoutCreateInfo.pNext = 0;
  3193. pipelineLayoutCreateInfo.flags = 0;
  3194. if (descriptorset_layout)
  3195. {
  3196. pipelineLayoutCreateInfo.setLayoutCount = 1;
  3197. pipelineLayoutCreateInfo.pSetLayouts = &descriptorset_layout;
  3198. }
  3199. else
  3200. {
  3201. pipelineLayoutCreateInfo.setLayoutCount = 0;
  3202. pipelineLayoutCreateInfo.pSetLayouts = 0;
  3203. }
  3204. if (push_constant_count > 0)
  3205. {
  3206. pipelineLayoutCreateInfo.pushConstantRangeCount = 1;
  3207. pipelineLayoutCreateInfo.pPushConstantRanges = &pushConstantRange;
  3208. }
  3209. else
  3210. {
  3211. pipelineLayoutCreateInfo.pushConstantRangeCount = 0;
  3212. pipelineLayoutCreateInfo.pPushConstantRanges = 0;
  3213. }
  3214. VkResult ret = vkCreatePipelineLayout(d->device, &pipelineLayoutCreateInfo, 0, pipeline_layout);
  3215. if (ret != VK_SUCCESS)
  3216. {
  3217. NCNN_LOGE("vkCreatePipelineLayout failed %d", ret);
  3218. return -1;
  3219. }
  3220. return 0;
  3221. }
  3222. int VulkanDevice::create_pipeline(VkShaderModule shader_module, VkPipelineLayout pipeline_layout, const std::vector<vk_specialization_type>& specializations, uint32_t subgroup_size, VkPipeline* pipeline) const
  3223. {
  3224. const int specialization_count = specializations.size();
  3225. std::vector<VkSpecializationMapEntry> specializationMapEntries(specialization_count);
  3226. for (int i = 0; i < specialization_count; i++)
  3227. {
  3228. specializationMapEntries[i].constantID = i;
  3229. specializationMapEntries[i].offset = i * sizeof(vk_specialization_type);
  3230. specializationMapEntries[i].size = sizeof(vk_specialization_type);
  3231. }
  3232. VkSpecializationInfo specializationInfo;
  3233. specializationInfo.mapEntryCount = specializationMapEntries.size();
  3234. specializationInfo.pMapEntries = specializationMapEntries.data();
  3235. specializationInfo.dataSize = specializations.size() * sizeof(vk_specialization_type);
  3236. specializationInfo.pData = specializations.data();
  3237. VkPipelineShaderStageCreateInfo pipelineShaderStageCreateInfo;
  3238. pipelineShaderStageCreateInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
  3239. pipelineShaderStageCreateInfo.pNext = 0;
  3240. pipelineShaderStageCreateInfo.flags = 0;
  3241. pipelineShaderStageCreateInfo.stage = VK_SHADER_STAGE_COMPUTE_BIT;
  3242. pipelineShaderStageCreateInfo.module = shader_module;
  3243. pipelineShaderStageCreateInfo.pName = "main";
  3244. pipelineShaderStageCreateInfo.pSpecializationInfo = &specializationInfo;
  3245. // but full subgroup bits enforce local_size_x be multiple of subgroup size
  3246. // if (info.support_compute_full_subgroups())
  3247. // {
  3248. // pipelineShaderStageCreateInfo.flags |= VK_PIPELINE_SHADER_STAGE_CREATE_REQUIRE_FULL_SUBGROUPS_BIT_EXT;
  3249. // }
  3250. void* enabledExtensionFeatures = 0;
  3251. // subgroup size control
  3252. VkPipelineShaderStageRequiredSubgroupSizeCreateInfoEXT pipelineShaderStageRequiredSubgroupSizeCreateInfo;
  3253. pipelineShaderStageRequiredSubgroupSizeCreateInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_REQUIRED_SUBGROUP_SIZE_CREATE_INFO_EXT;
  3254. pipelineShaderStageRequiredSubgroupSizeCreateInfo.pNext = 0;
  3255. pipelineShaderStageRequiredSubgroupSizeCreateInfo.requiredSubgroupSize = subgroup_size;
  3256. if (info.support_subgroup_size_control())
  3257. {
  3258. // pipelineShaderStageCreateInfo.flags |= VK_PIPELINE_SHADER_STAGE_CREATE_ALLOW_VARYING_SUBGROUP_SIZE_BIT;
  3259. pipelineShaderStageRequiredSubgroupSizeCreateInfo.pNext = enabledExtensionFeatures;
  3260. enabledExtensionFeatures = &pipelineShaderStageRequiredSubgroupSizeCreateInfo;
  3261. }
  3262. pipelineShaderStageCreateInfo.pNext = enabledExtensionFeatures;
  3263. VkComputePipelineCreateInfo computePipelineCreateInfo;
  3264. computePipelineCreateInfo.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
  3265. computePipelineCreateInfo.pNext = 0;
  3266. computePipelineCreateInfo.flags = 0;
  3267. computePipelineCreateInfo.stage = pipelineShaderStageCreateInfo;
  3268. computePipelineCreateInfo.layout = pipeline_layout;
  3269. computePipelineCreateInfo.basePipelineHandle = 0;
  3270. computePipelineCreateInfo.basePipelineIndex = 0;
  3271. VkResult ret = vkCreateComputePipelines(d->device, 0, 1, &computePipelineCreateInfo, 0, pipeline);
  3272. if (ret != VK_SUCCESS)
  3273. {
  3274. NCNN_LOGE("vkCreateComputePipelines failed %d", ret);
  3275. return -1;
  3276. }
  3277. return 0;
  3278. }
  3279. int VulkanDevice::create_descriptor_update_template(int binding_count, const int* binding_types, VkDescriptorSetLayout descriptorset_layout, VkPipelineLayout pipeline_layout, VkDescriptorUpdateTemplateKHR* descriptor_update_template) const
  3280. {
  3281. if (binding_count == 0)
  3282. {
  3283. *descriptor_update_template = 0;
  3284. return 0;
  3285. }
  3286. std::vector<VkDescriptorUpdateTemplateEntryKHR> descriptorUpdateTemplateEntries(binding_count);
  3287. size_t offset = 0;
  3288. for (int i = 0; i < binding_count; i++) // TODO do not update weights
  3289. {
  3290. int binding_type = binding_types[i];
  3291. descriptorUpdateTemplateEntries[i].dstBinding = i;
  3292. descriptorUpdateTemplateEntries[i].dstArrayElement = 0;
  3293. descriptorUpdateTemplateEntries[i].descriptorCount = 1;
  3294. descriptorUpdateTemplateEntries[i].offset = offset;
  3295. if (binding_type == 1)
  3296. {
  3297. descriptorUpdateTemplateEntries[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
  3298. descriptorUpdateTemplateEntries[i].stride = sizeof(VkDescriptorBufferInfo);
  3299. }
  3300. else if (binding_type == 2)
  3301. {
  3302. descriptorUpdateTemplateEntries[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_IMAGE;
  3303. descriptorUpdateTemplateEntries[i].stride = sizeof(VkDescriptorImageInfo);
  3304. }
  3305. else // if (binding_type == 3)
  3306. {
  3307. descriptorUpdateTemplateEntries[i].descriptorType = VK_DESCRIPTOR_TYPE_COMBINED_IMAGE_SAMPLER;
  3308. descriptorUpdateTemplateEntries[i].stride = sizeof(VkDescriptorImageInfo);
  3309. }
  3310. offset += descriptorUpdateTemplateEntries[i].stride;
  3311. }
  3312. VkDescriptorUpdateTemplateCreateInfoKHR descriptorUpdateTemplateCreateInfo;
  3313. descriptorUpdateTemplateCreateInfo.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_UPDATE_TEMPLATE_CREATE_INFO_KHR;
  3314. descriptorUpdateTemplateCreateInfo.pNext = 0;
  3315. descriptorUpdateTemplateCreateInfo.flags = 0;
  3316. descriptorUpdateTemplateCreateInfo.descriptorUpdateEntryCount = binding_count; // TODO do not update weights
  3317. descriptorUpdateTemplateCreateInfo.pDescriptorUpdateEntries = descriptorUpdateTemplateEntries.data();
  3318. if (info.support_VK_KHR_push_descriptor())
  3319. {
  3320. descriptorUpdateTemplateCreateInfo.templateType = VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR;
  3321. }
  3322. else
  3323. {
  3324. descriptorUpdateTemplateCreateInfo.templateType = VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_DESCRIPTOR_SET_KHR;
  3325. }
  3326. // descriptorSetLayout should be ignored if VK_DESCRIPTOR_UPDATE_TEMPLATE_TYPE_PUSH_DESCRIPTORS_KHR
  3327. // FIXME HACK WARNING TODO NOTE but crash on radv if set NULL :(
  3328. descriptorUpdateTemplateCreateInfo.descriptorSetLayout = descriptorset_layout;
  3329. descriptorUpdateTemplateCreateInfo.pipelineBindPoint = VK_PIPELINE_BIND_POINT_COMPUTE;
  3330. descriptorUpdateTemplateCreateInfo.pipelineLayout = pipeline_layout;
  3331. descriptorUpdateTemplateCreateInfo.set = 0;
  3332. VkResult ret = vkCreateDescriptorUpdateTemplateKHR(d->device, &descriptorUpdateTemplateCreateInfo, 0, descriptor_update_template);
  3333. if (ret != VK_SUCCESS)
  3334. {
  3335. NCNN_LOGE("vkCreateDescriptorUpdateTemplateKHR failed %d", ret);
  3336. return -1;
  3337. }
  3338. return 0;
  3339. }
  3340. uint32_t VulkanDevice::find_memory_index(uint32_t memory_type_bits, VkFlags required, VkFlags preferred, VkFlags preferred_not) const
  3341. {
  3342. const VkPhysicalDeviceMemoryProperties& memory_properties = info.physicalDeviceMemoryProperties();
  3343. // first try, find required and with preferred and without preferred_not
  3344. for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)
  3345. {
  3346. bool is_required = (1 << i) & memory_type_bits;
  3347. if (is_required)
  3348. {
  3349. const VkMemoryType& memoryType = memory_properties.memoryTypes[i];
  3350. if ((memoryType.propertyFlags & required) == required
  3351. && (preferred && (memoryType.propertyFlags & preferred))
  3352. && (preferred_not && !(memoryType.propertyFlags & preferred_not)))
  3353. {
  3354. return i;
  3355. }
  3356. }
  3357. }
  3358. // second try, find required and with preferred
  3359. for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)
  3360. {
  3361. bool is_required = (1 << i) & memory_type_bits;
  3362. if (is_required)
  3363. {
  3364. const VkMemoryType& memoryType = memory_properties.memoryTypes[i];
  3365. if ((memoryType.propertyFlags & required) == required
  3366. && (preferred && (memoryType.propertyFlags & preferred)))
  3367. {
  3368. return i;
  3369. }
  3370. }
  3371. }
  3372. // third try, find required and without preferred_not
  3373. for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)
  3374. {
  3375. bool is_required = (1 << i) & memory_type_bits;
  3376. if (is_required)
  3377. {
  3378. const VkMemoryType& memoryType = memory_properties.memoryTypes[i];
  3379. if ((memoryType.propertyFlags & required) == required
  3380. && (preferred_not && !(memoryType.propertyFlags & preferred_not)))
  3381. {
  3382. return i;
  3383. }
  3384. }
  3385. }
  3386. // fourth try, find any required
  3387. for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)
  3388. {
  3389. bool is_required = (1 << i) & memory_type_bits;
  3390. if (is_required)
  3391. {
  3392. const VkMemoryType& memoryType = memory_properties.memoryTypes[i];
  3393. if ((memoryType.propertyFlags & required) == required)
  3394. {
  3395. return i;
  3396. }
  3397. }
  3398. }
  3399. NCNN_LOGE("no such memory type %u %u %u %u", memory_type_bits, required, preferred, preferred_not);
  3400. return -1;
  3401. }
  3402. bool VulkanDevice::is_mappable(uint32_t memory_type_index) const
  3403. {
  3404. const VkMemoryType& memoryType = info.physicalDeviceMemoryProperties().memoryTypes[memory_type_index];
  3405. return memoryType.propertyFlags & VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT;
  3406. }
  3407. bool VulkanDevice::is_coherent(uint32_t memory_type_index) const
  3408. {
  3409. const VkMemoryType& memoryType = info.physicalDeviceMemoryProperties().memoryTypes[memory_type_index];
  3410. return memoryType.propertyFlags & VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
  3411. }
  3412. VkQueue VulkanDevice::acquire_queue(uint32_t queue_family_index) const
  3413. {
  3414. if (queue_family_index != info.compute_queue_family_index() && queue_family_index != info.transfer_queue_family_index())
  3415. {
  3416. NCNN_LOGE("invalid queue_family_index %u", queue_family_index);
  3417. return 0;
  3418. }
  3419. Mutex& queue_lock = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_lock : d->transfer_queue_lock;
  3420. queue_lock.lock();
  3421. ConditionVariable& queue_condition = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_condition : d->transfer_queue_condition;
  3422. int& free_queue_count = queue_family_index == info.compute_queue_family_index() ? d->free_compute_queue_count : d->free_transfer_queue_count;
  3423. while (free_queue_count == 0)
  3424. {
  3425. // no free queues, wait for recleams from other threads
  3426. queue_condition.wait(queue_lock);
  3427. }
  3428. std::vector<VkQueue>& queues = queue_family_index == info.compute_queue_family_index() ? d->compute_queues : d->transfer_queues;
  3429. VkQueue queue = 0;
  3430. for (size_t i = 0; i < queues.size(); i++)
  3431. {
  3432. if (queues[i])
  3433. {
  3434. queue = queues[i];
  3435. queues[i] = 0;
  3436. break;
  3437. }
  3438. }
  3439. if (!queue)
  3440. {
  3441. NCNN_LOGE("FATAL ERROR! out of hardware queue %u", queue_family_index);
  3442. }
  3443. free_queue_count -= 1;
  3444. queue_lock.unlock();
  3445. queue_condition.signal();
  3446. return queue;
  3447. }
  3448. void VulkanDevice::reclaim_queue(uint32_t queue_family_index, VkQueue queue) const
  3449. {
  3450. if (queue_family_index != info.compute_queue_family_index() && queue_family_index != info.transfer_queue_family_index())
  3451. {
  3452. NCNN_LOGE("invalid queue_family_index %u", queue_family_index);
  3453. return;
  3454. }
  3455. Mutex& queue_lock = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_lock : d->transfer_queue_lock;
  3456. queue_lock.lock();
  3457. ConditionVariable& queue_condition = queue_family_index == info.compute_queue_family_index() ? d->compute_queue_condition : d->transfer_queue_condition;
  3458. int& free_queue_count = queue_family_index == info.compute_queue_family_index() ? d->free_compute_queue_count : d->free_transfer_queue_count;
  3459. std::vector<VkQueue>& queues = queue_family_index == info.compute_queue_family_index() ? d->compute_queues : d->transfer_queues;
  3460. size_t i = 0;
  3461. for (; i < queues.size(); i++)
  3462. {
  3463. if (!queues[i])
  3464. {
  3465. queues[i] = queue;
  3466. break;
  3467. }
  3468. }
  3469. if (i == queues.size())
  3470. {
  3471. NCNN_LOGE("FATAL ERROR! reclaim_queue get wild queue %u %p", queue_family_index, queue);
  3472. }
  3473. free_queue_count += 1;
  3474. queue_lock.unlock();
  3475. queue_condition.signal();
  3476. }
  3477. VkAllocator* VulkanDevice::acquire_blob_allocator() const
  3478. {
  3479. MutexLockGuard lock(d->blob_allocator_lock);
  3480. for (int i = 0; i < (int)d->blob_allocators.size(); i++)
  3481. {
  3482. VkAllocator* allocator = d->blob_allocators[i];
  3483. if (allocator)
  3484. {
  3485. d->blob_allocators[i] = 0;
  3486. return allocator;
  3487. }
  3488. }
  3489. // pre-allocated allcator exhausted, create new
  3490. VkAllocator* allocator = new VkBlobAllocator(this);
  3491. d->blob_allocators.push_back(allocator);
  3492. d->blob_allocators[d->blob_allocators.size() - 1] = 0;
  3493. return allocator;
  3494. }
  3495. void VulkanDevice::reclaim_blob_allocator(VkAllocator* allocator) const
  3496. {
  3497. MutexLockGuard lock(d->blob_allocator_lock);
  3498. for (int i = 0; i < (int)d->blob_allocators.size(); i++)
  3499. {
  3500. if (!d->blob_allocators[i])
  3501. {
  3502. d->blob_allocators[i] = allocator;
  3503. return;
  3504. }
  3505. }
  3506. NCNN_LOGE("FATAL ERROR! reclaim_blob_allocator get wild allocator %p", allocator);
  3507. }
  3508. VkAllocator* VulkanDevice::acquire_staging_allocator() const
  3509. {
  3510. MutexLockGuard lock(d->staging_allocator_lock);
  3511. for (int i = 0; i < (int)d->staging_allocators.size(); i++)
  3512. {
  3513. VkAllocator* allocator = d->staging_allocators[i];
  3514. if (allocator)
  3515. {
  3516. d->staging_allocators[i] = 0;
  3517. return allocator;
  3518. }
  3519. }
  3520. // pre-allocated allcator exhausted, create new
  3521. VkAllocator* allocator = new VkStagingAllocator(this);
  3522. d->staging_allocators.push_back(allocator);
  3523. d->staging_allocators[d->staging_allocators.size() - 1] = 0;
  3524. return allocator;
  3525. }
  3526. void VulkanDevice::reclaim_staging_allocator(VkAllocator* allocator) const
  3527. {
  3528. MutexLockGuard lock(d->staging_allocator_lock);
  3529. for (int i = 0; i < (int)d->staging_allocators.size(); i++)
  3530. {
  3531. if (!d->staging_allocators[i])
  3532. {
  3533. d->staging_allocators[i] = allocator;
  3534. return;
  3535. }
  3536. }
  3537. NCNN_LOGE("FATAL ERROR! reclaim_staging_allocator get wild allocator %p", allocator);
  3538. }
  3539. const VkSampler* VulkanDevice::immutable_texelfetch_sampler() const
  3540. {
  3541. return &d->texelfetch_sampler;
  3542. }
  3543. VkMat VulkanDevice::get_dummy_buffer() const
  3544. {
  3545. return d->dummy_buffer;
  3546. }
  3547. VkImageMat VulkanDevice::get_dummy_image() const
  3548. {
  3549. return d->dummy_image;
  3550. }
  3551. VkImageMat VulkanDevice::get_dummy_image_readonly() const
  3552. {
  3553. #if __APPLE__
  3554. if (info.type() != 0)
  3555. return d->dummy_image;
  3556. #endif
  3557. return d->dummy_image_readonly;
  3558. }
  3559. const PipelineCache* VulkanDevice::get_pipeline_cache() const
  3560. {
  3561. return d->pipeline_cache;
  3562. }
  3563. bool VulkanDevice::shape_support_image_storage(const Mat& shape) const
  3564. {
  3565. int dims = shape.dims;
  3566. int width = shape.w;
  3567. int height = shape.h;
  3568. int depth = shape.c;
  3569. int elempack = shape.elempack;
  3570. // large elempack spills on image w
  3571. if (elempack == 8) width *= 2;
  3572. if (elempack == 16) width *= 4;
  3573. if (elempack == 32) width *= 8;
  3574. if (elempack == 64) width *= 16;
  3575. if (dims == 1)
  3576. {
  3577. if (width > (int)info.max_image_dimension_1d())
  3578. {
  3579. return false;
  3580. }
  3581. }
  3582. else if (dims == 2)
  3583. {
  3584. if (width > (int)info.max_image_dimension_2d() || height > (int)info.max_image_dimension_2d())
  3585. {
  3586. return false;
  3587. }
  3588. }
  3589. else // if (dims == 3)
  3590. {
  3591. if (width > (int)info.max_image_dimension_3d() || height > (int)info.max_image_dimension_3d() || depth > (int)info.max_image_dimension_3d())
  3592. {
  3593. return false;
  3594. }
  3595. }
  3596. return true;
  3597. }
  3598. uint32_t VulkanDevice::get_heap_budget() const
  3599. {
  3600. const VkPhysicalDeviceMemoryProperties& memory_properties = info.physicalDeviceMemoryProperties();
  3601. uint32_t buffer_memory_type_index = d->dummy_allocator->buffer_memory_type_index;
  3602. uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;
  3603. if (!info.support_VK_EXT_memory_budget())
  3604. {
  3605. // NCNN_LOGE("heap budget from assumption\n");
  3606. uint32_t device_local_heap_size = memory_properties.memoryHeaps[buffer_heap_index].size / 1024 / 1024;
  3607. // we usually cannot use all heap
  3608. // 70% for 4G+
  3609. // 50% for 4G-
  3610. return device_local_heap_size >= 4000 ? device_local_heap_size * 0.7 : device_local_heap_size * 0.5;
  3611. }
  3612. VkPhysicalDeviceMemoryBudgetPropertiesEXT memoryBudgetProperties;
  3613. memoryBudgetProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_BUDGET_PROPERTIES_EXT;
  3614. memoryBudgetProperties.pNext = 0;
  3615. VkPhysicalDeviceMemoryProperties2KHR memoryProperties;
  3616. memoryProperties.sType = VK_STRUCTURE_TYPE_PHYSICAL_DEVICE_MEMORY_PROPERTIES_2_KHR;
  3617. memoryProperties.pNext = &memoryBudgetProperties;
  3618. vkGetPhysicalDeviceMemoryProperties2KHR(info.physicalDevice(), &memoryProperties);
  3619. return memoryBudgetProperties.heapBudget[buffer_heap_index] / 1024 / 1024;
  3620. }
  3621. void VulkanDevice::convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, VkCompute& cmd, const Option& opt) const
  3622. {
  3623. convert_packing(src, dst, dst_elempack, 0, cmd, opt);
  3624. }
  3625. void VulkanDevice::convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, int cast_type_to, VkCompute& cmd, const Option& opt) const
  3626. {
  3627. int packing_type_to_index = dst_elempack == 1 ? 0 : dst_elempack == 4 ? 1 : 2;
  3628. int cast_type_from_index;
  3629. if (src.elembits() == 32)
  3630. {
  3631. cast_type_from_index = 0;
  3632. }
  3633. else if (src.elembits() == 16)
  3634. {
  3635. cast_type_from_index = 1;
  3636. }
  3637. else // if (src.elembits() == 8)
  3638. {
  3639. cast_type_from_index = 3;
  3640. }
  3641. int cast_type_to_index = cast_type_to ? cast_type_to - 1 : cast_type_from_index;
  3642. // NCNN_LOGE("convert_packing b2b %d %d %d", cast_type_from_index, cast_type_to_index, packing_type_to_index);
  3643. if ((cast_type_from_index == 0 || cast_type_from_index == 1) && (cast_type_to_index == 2 || cast_type_to_index == 3))
  3644. {
  3645. NCNN_LOGE("convert_packing from fp32/fp16 to int32/int8 is not supported");
  3646. return;
  3647. }
  3648. if ((cast_type_from_index == 2 || cast_type_from_index == 3) && (cast_type_to_index == 0 || cast_type_to_index == 1))
  3649. {
  3650. NCNN_LOGE("convert_packing from int32/int8 to fp32/fp16 is not supported");
  3651. return;
  3652. }
  3653. Option opt2 = opt;
  3654. opt2.use_fp16_packed = (cast_type_from_index == 1 || cast_type_to_index == 1);
  3655. opt2.use_fp16_storage = (cast_type_from_index == 1 || cast_type_to_index == 1) && info.support_fp16_storage();
  3656. opt2.use_int8_packed = (cast_type_from_index == 3 || cast_type_to_index == 3);
  3657. opt2.use_int8_storage = (cast_type_from_index == 3 || cast_type_to_index == 3) && info.support_int8_storage();
  3658. const ncnn::Layer* uop = d->get_utility_operator(cast_type_from_index, cast_type_to_index, packing_type_to_index);
  3659. uop->forward(src, dst, cmd, opt2);
  3660. }
  3661. int VulkanDevice::init_device_extension()
  3662. {
  3663. if (info.support_VK_KHR_bind_memory2())
  3664. {
  3665. vkBindBufferMemory2KHR = (PFN_vkBindBufferMemory2KHR)vkGetDeviceProcAddr(d->device, "vkBindBufferMemory2KHR");
  3666. vkBindImageMemory2KHR = (PFN_vkBindImageMemory2KHR)vkGetDeviceProcAddr(d->device, "vkBindImageMemory2KHR");
  3667. }
  3668. if (info.support_VK_KHR_buffer_device_address())
  3669. {
  3670. vkGetBufferDeviceAddressKHR = (PFN_vkGetBufferDeviceAddressKHR)vkGetDeviceProcAddr(d->device, "vkGetBufferDeviceAddressKHR");
  3671. vkGetBufferOpaqueCaptureAddressKHR = (PFN_vkGetBufferOpaqueCaptureAddressKHR)vkGetDeviceProcAddr(d->device, "vkGetBufferOpaqueCaptureAddressKHR");
  3672. vkGetDeviceMemoryOpaqueCaptureAddressKHR = (PFN_vkGetDeviceMemoryOpaqueCaptureAddressKHR)vkGetDeviceProcAddr(d->device, "vkGetDeviceMemoryOpaqueCaptureAddressKHR");
  3673. }
  3674. if (info.support_VK_KHR_descriptor_update_template())
  3675. {
  3676. vkCreateDescriptorUpdateTemplateKHR = (PFN_vkCreateDescriptorUpdateTemplateKHR)vkGetDeviceProcAddr(d->device, "vkCreateDescriptorUpdateTemplateKHR");
  3677. vkDestroyDescriptorUpdateTemplateKHR = (PFN_vkDestroyDescriptorUpdateTemplateKHR)vkGetDeviceProcAddr(d->device, "vkDestroyDescriptorUpdateTemplateKHR");
  3678. vkUpdateDescriptorSetWithTemplateKHR = (PFN_vkUpdateDescriptorSetWithTemplateKHR)vkGetDeviceProcAddr(d->device, "vkUpdateDescriptorSetWithTemplateKHR");
  3679. }
  3680. if (info.support_VK_KHR_get_memory_requirements2())
  3681. {
  3682. vkGetImageMemoryRequirements2KHR = (PFN_vkGetImageMemoryRequirements2KHR)vkGetDeviceProcAddr(d->device, "vkGetImageMemoryRequirements2KHR");
  3683. vkGetBufferMemoryRequirements2KHR = (PFN_vkGetBufferMemoryRequirements2KHR)vkGetDeviceProcAddr(d->device, "vkGetBufferMemoryRequirements2KHR");
  3684. }
  3685. if (info.support_VK_KHR_maintenance1())
  3686. {
  3687. vkTrimCommandPoolKHR = (PFN_vkTrimCommandPoolKHR)vkGetDeviceProcAddr(d->device, "vkTrimCommandPoolKHR");
  3688. }
  3689. if (info.support_VK_KHR_maintenance3())
  3690. {
  3691. vkGetDescriptorSetLayoutSupportKHR = (PFN_vkGetDescriptorSetLayoutSupportKHR)vkGetDeviceProcAddr(d->device, "vkGetDescriptorSetLayoutSupportKHR");
  3692. }
  3693. if (info.support_VK_KHR_push_descriptor())
  3694. {
  3695. if (info.support_VK_KHR_descriptor_update_template())
  3696. {
  3697. vkCmdPushDescriptorSetWithTemplateKHR = (PFN_vkCmdPushDescriptorSetWithTemplateKHR)vkGetDeviceProcAddr(d->device, "vkCmdPushDescriptorSetWithTemplateKHR");
  3698. }
  3699. vkCmdPushDescriptorSetKHR = (PFN_vkCmdPushDescriptorSetKHR)vkGetDeviceProcAddr(d->device, "vkCmdPushDescriptorSetKHR");
  3700. }
  3701. if (info.support_VK_KHR_sampler_ycbcr_conversion())
  3702. {
  3703. vkCreateSamplerYcbcrConversionKHR = (PFN_vkCreateSamplerYcbcrConversionKHR)vkGetDeviceProcAddr(d->device, "vkCreateSamplerYcbcrConversionKHR");
  3704. vkDestroySamplerYcbcrConversionKHR = (PFN_vkDestroySamplerYcbcrConversionKHR)vkGetDeviceProcAddr(d->device, "vkDestroySamplerYcbcrConversionKHR");
  3705. }
  3706. if (info.support_VK_KHR_swapchain())
  3707. {
  3708. vkCreateSwapchainKHR = (PFN_vkCreateSwapchainKHR)vkGetDeviceProcAddr(d->device, "vkCreateSwapchainKHR");
  3709. vkDestroySwapchainKHR = (PFN_vkDestroySwapchainKHR)vkGetDeviceProcAddr(d->device, "vkDestroySwapchainKHR");
  3710. vkGetSwapchainImagesKHR = (PFN_vkGetSwapchainImagesKHR)vkGetDeviceProcAddr(d->device, "vkGetSwapchainImagesKHR");
  3711. vkAcquireNextImageKHR = (PFN_vkAcquireNextImageKHR)vkGetDeviceProcAddr(d->device, "vkAcquireNextImageKHR");
  3712. vkQueuePresentKHR = (PFN_vkQueuePresentKHR)vkGetDeviceProcAddr(d->device, "vkQueuePresentKHR");
  3713. }
  3714. if (info.support_VK_EXT_buffer_device_address())
  3715. {
  3716. vkGetBufferDeviceAddressEXT = (PFN_vkGetBufferDeviceAddressEXT)vkGetDeviceProcAddr(d->device, "vkGetBufferDeviceAddressEXT");
  3717. }
  3718. #if __ANDROID_API__ >= 26
  3719. if (info.support_VK_ANDROID_external_memory_android_hardware_buffer())
  3720. {
  3721. vkGetAndroidHardwareBufferPropertiesANDROID = (PFN_vkGetAndroidHardwareBufferPropertiesANDROID)vkGetDeviceProcAddr(d->device, "vkGetAndroidHardwareBufferPropertiesANDROID");
  3722. vkGetMemoryAndroidHardwareBufferANDROID = (PFN_vkGetMemoryAndroidHardwareBufferANDROID)vkGetDeviceProcAddr(d->device, "vkGetMemoryAndroidHardwareBufferANDROID");
  3723. }
  3724. #endif // __ANDROID_API__ >= 26
  3725. if (info.support_VK_NV_cooperative_vector())
  3726. {
  3727. vkCmdConvertCooperativeVectorMatrixNV = (PFN_vkCmdConvertCooperativeVectorMatrixNV)vkGetDeviceProcAddr(d->device, "vkCmdConvertCooperativeVectorMatrixNV");
  3728. vkConvertCooperativeVectorMatrixNV = (PFN_vkConvertCooperativeVectorMatrixNV)vkGetDeviceProcAddr(d->device, "vkConvertCooperativeVectorMatrixNV");
  3729. }
  3730. return 0;
  3731. }
  3732. VulkanDevice* get_gpu_device(int device_index)
  3733. {
  3734. try_create_gpu_instance();
  3735. if (device_index < 0 || device_index >= g_gpu_count)
  3736. return 0;
  3737. MutexLockGuard lock(g_default_vkdev_lock);
  3738. if (!g_default_vkdev[device_index])
  3739. g_default_vkdev[device_index] = new VulkanDevice(device_index);
  3740. return g_default_vkdev[device_index];
  3741. }
  3742. static TBuiltInResource get_default_TBuiltInResource()
  3743. {
  3744. TBuiltInResource resource;
  3745. resource.maxLights = 32;
  3746. resource.maxClipPlanes = 6;
  3747. resource.maxTextureUnits = 32;
  3748. resource.maxTextureCoords = 32;
  3749. resource.maxVertexAttribs = 64;
  3750. resource.maxVertexUniformComponents = 4096;
  3751. resource.maxVaryingFloats = 64;
  3752. resource.maxVertexTextureImageUnits = 32;
  3753. resource.maxCombinedTextureImageUnits = 80;
  3754. resource.maxTextureImageUnits = 32;
  3755. resource.maxFragmentUniformComponents = 4096;
  3756. resource.maxDrawBuffers = 32;
  3757. resource.maxVertexUniformVectors = 128;
  3758. resource.maxVaryingVectors = 8;
  3759. resource.maxFragmentUniformVectors = 16;
  3760. resource.maxVertexOutputVectors = 16;
  3761. resource.maxFragmentInputVectors = 15;
  3762. resource.minProgramTexelOffset = -8;
  3763. resource.maxProgramTexelOffset = 7;
  3764. resource.maxClipDistances = 8;
  3765. resource.maxComputeWorkGroupCountX = 65535;
  3766. resource.maxComputeWorkGroupCountY = 65535;
  3767. resource.maxComputeWorkGroupCountZ = 65535;
  3768. resource.maxComputeWorkGroupSizeX = 1024;
  3769. resource.maxComputeWorkGroupSizeY = 1024;
  3770. resource.maxComputeWorkGroupSizeZ = 64;
  3771. resource.maxComputeUniformComponents = 1024;
  3772. resource.maxComputeTextureImageUnits = 16;
  3773. resource.maxComputeImageUniforms = 8;
  3774. resource.maxComputeAtomicCounters = 8;
  3775. resource.maxComputeAtomicCounterBuffers = 1;
  3776. resource.maxVaryingComponents = 60;
  3777. resource.maxVertexOutputComponents = 64;
  3778. resource.maxGeometryInputComponents = 64;
  3779. resource.maxGeometryOutputComponents = 128;
  3780. resource.maxFragmentInputComponents = 128;
  3781. resource.maxImageUnits = 8;
  3782. resource.maxCombinedImageUnitsAndFragmentOutputs = 8;
  3783. resource.maxCombinedShaderOutputResources = 8;
  3784. resource.maxImageSamples = 0;
  3785. resource.maxVertexImageUniforms = 0;
  3786. resource.maxTessControlImageUniforms = 0;
  3787. resource.maxTessEvaluationImageUniforms = 0;
  3788. resource.maxGeometryImageUniforms = 0;
  3789. resource.maxFragmentImageUniforms = 8;
  3790. resource.maxCombinedImageUniforms = 8;
  3791. resource.maxGeometryTextureImageUnits = 16;
  3792. resource.maxGeometryOutputVertices = 256;
  3793. resource.maxGeometryTotalOutputComponents = 1024;
  3794. resource.maxGeometryUniformComponents = 1024;
  3795. resource.maxGeometryVaryingComponents = 64;
  3796. resource.maxTessControlInputComponents = 128;
  3797. resource.maxTessControlOutputComponents = 128;
  3798. resource.maxTessControlTextureImageUnits = 16;
  3799. resource.maxTessControlUniformComponents = 1024;
  3800. resource.maxTessControlTotalOutputComponents = 4096;
  3801. resource.maxTessEvaluationInputComponents = 128;
  3802. resource.maxTessEvaluationOutputComponents = 128;
  3803. resource.maxTessEvaluationTextureImageUnits = 16;
  3804. resource.maxTessEvaluationUniformComponents = 1024;
  3805. resource.maxTessPatchComponents = 120;
  3806. resource.maxPatchVertices = 32;
  3807. resource.maxTessGenLevel = 64;
  3808. resource.maxViewports = 16;
  3809. resource.maxVertexAtomicCounters = 0;
  3810. resource.maxTessControlAtomicCounters = 0;
  3811. resource.maxTessEvaluationAtomicCounters = 0;
  3812. resource.maxGeometryAtomicCounters = 0;
  3813. resource.maxFragmentAtomicCounters = 8;
  3814. resource.maxCombinedAtomicCounters = 8;
  3815. resource.maxAtomicCounterBindings = 1;
  3816. resource.maxVertexAtomicCounterBuffers = 0;
  3817. resource.maxTessControlAtomicCounterBuffers = 0;
  3818. resource.maxTessEvaluationAtomicCounterBuffers = 0;
  3819. resource.maxGeometryAtomicCounterBuffers = 0;
  3820. resource.maxFragmentAtomicCounterBuffers = 1;
  3821. resource.maxCombinedAtomicCounterBuffers = 1;
  3822. resource.maxAtomicCounterBufferSize = 16384;
  3823. resource.maxTransformFeedbackBuffers = 4;
  3824. resource.maxTransformFeedbackInterleavedComponents = 64;
  3825. resource.maxCullDistances = 8;
  3826. resource.maxCombinedClipAndCullDistances = 8;
  3827. resource.maxSamples = 4;
  3828. resource.maxMeshOutputVerticesNV = 256;
  3829. resource.maxMeshOutputPrimitivesNV = 512;
  3830. resource.maxMeshWorkGroupSizeX_NV = 32;
  3831. resource.maxMeshWorkGroupSizeY_NV = 1;
  3832. resource.maxMeshWorkGroupSizeZ_NV = 1;
  3833. resource.maxTaskWorkGroupSizeX_NV = 32;
  3834. resource.maxTaskWorkGroupSizeY_NV = 1;
  3835. resource.maxTaskWorkGroupSizeZ_NV = 1;
  3836. resource.maxMeshViewCountNV = 4;
  3837. // TODO compile-time glslang version check
  3838. // resource.maxDualSourceDrawBuffersEXT = 1;
  3839. resource.limits.nonInductiveForLoops = 1;
  3840. resource.limits.whileLoops = 1;
  3841. resource.limits.doWhileLoops = 1;
  3842. resource.limits.generalUniformIndexing = 1;
  3843. resource.limits.generalAttributeMatrixVectorIndexing = 1;
  3844. resource.limits.generalVaryingIndexing = 1;
  3845. resource.limits.generalSamplerIndexing = 1;
  3846. resource.limits.generalVariableIndexing = 1;
  3847. resource.limits.generalConstantMatrixVectorIndexing = 1;
  3848. return resource;
  3849. }
  3850. class VulkanShaderIncluder : public glslang::TShader::Includer
  3851. {
  3852. public:
  3853. virtual glslang::TShader::Includer::IncludeResult* includeLocal(const char* headerName, const char* /*includerName*/, size_t /*inclusionDepth*/)
  3854. {
  3855. if (strcmp(headerName, "vulkan_activation.comp") == 0)
  3856. {
  3857. const char* const headerData = vulkan_activation_comp_data;
  3858. const size_t headerLength = sizeof(vulkan_activation_comp_data);
  3859. glslang::TShader::Includer::IncludeResult* r = new glslang::TShader::Includer::IncludeResult(headerName, headerData, headerLength, 0);
  3860. return r;
  3861. }
  3862. return 0;
  3863. }
  3864. virtual void releaseInclude(glslang::TShader::Includer::IncludeResult* r)
  3865. {
  3866. delete r;
  3867. }
  3868. };
  3869. class DefinitionCollector
  3870. {
  3871. public:
  3872. template<typename T>
  3873. void append(const char* key, T def)
  3874. {
  3875. definitions.push_back(std::make_pair(key, def));
  3876. }
  3877. public:
  3878. struct typed_value
  3879. {
  3880. typed_value(const char* _s)
  3881. : type(0), s(_s)
  3882. {
  3883. }
  3884. typed_value(uint8_t _u8)
  3885. : type(1), u8(_u8)
  3886. {
  3887. }
  3888. typed_value(uint32_t _u32)
  3889. : type(2), u32(_u32)
  3890. {
  3891. }
  3892. typed_value(int32_t _i32)
  3893. : type(3), i32(_i32)
  3894. {
  3895. }
  3896. typed_value(uint64_t _u64)
  3897. : type(4), u64(_u64)
  3898. {
  3899. }
  3900. typed_value(float _f32)
  3901. : type(5), f32(_f32)
  3902. {
  3903. }
  3904. int type;
  3905. union
  3906. {
  3907. const char* s;
  3908. uint8_t u8;
  3909. uint32_t u32;
  3910. int32_t i32;
  3911. uint64_t u64;
  3912. float f32;
  3913. };
  3914. };
  3915. std::vector<std::pair<const char*, typed_value> > definitions;
  3916. };
  3917. int compile_spirv_module(const char* comp_string, const Option& opt, std::vector<uint32_t>& spirv)
  3918. {
  3919. // -1 for omitting the tail '\0'
  3920. int length = strlen(comp_string) - 1;
  3921. return compile_spirv_module(comp_string, length, opt, spirv);
  3922. }
  3923. int compile_spirv_module(const char* comp_data, int comp_data_size, const Option& opt, std::vector<uint32_t>& spirv)
  3924. {
  3925. DefinitionCollector custom_defines;
  3926. DefinitionCollector device_defines;
  3927. if (opt.use_fp16_storage)
  3928. {
  3929. custom_defines.append("sfp", "float16_t");
  3930. custom_defines.append("sfpvec2", "f16vec2");
  3931. custom_defines.append("sfpvec4", "f16vec4");
  3932. if (opt.use_fp16_arithmetic)
  3933. {
  3934. custom_defines.append("sfpvec8", "f16mat2x4");
  3935. custom_defines.append("sfpmat4", "f16mat4");
  3936. }
  3937. }
  3938. else if (opt.use_fp16_packed)
  3939. {
  3940. custom_defines.append("sfp", "uint");
  3941. custom_defines.append("sfpvec2", "uint");
  3942. custom_defines.append("sfpvec4", "uvec2");
  3943. custom_defines.append("sfpvec8", "uvec4");
  3944. }
  3945. else
  3946. {
  3947. custom_defines.append("sfp", "float");
  3948. custom_defines.append("sfpvec2", "vec2");
  3949. custom_defines.append("sfpvec4", "vec4");
  3950. custom_defines.append("sfpvec8", "mat2x4");
  3951. custom_defines.append("sfpmat4", "mat4");
  3952. }
  3953. if (opt.use_fp16_arithmetic)
  3954. {
  3955. custom_defines.append("afp", "float16_t");
  3956. custom_defines.append("afpvec2", "f16vec2");
  3957. custom_defines.append("afpvec4", "f16vec4");
  3958. custom_defines.append("afpvec8", "f16mat2x4");
  3959. custom_defines.append("afpmat4", "f16mat4");
  3960. }
  3961. else
  3962. {
  3963. custom_defines.append("afp", "float");
  3964. custom_defines.append("afpvec2", "vec2");
  3965. custom_defines.append("afpvec4", "vec4");
  3966. custom_defines.append("afpvec8", "mat2x4");
  3967. custom_defines.append("afpmat4", "mat4");
  3968. }
  3969. if (opt.use_fp16_storage && opt.use_fp16_uniform && opt.use_fp16_arithmetic)
  3970. {
  3971. custom_defines.append("lfp", "float16_t");
  3972. custom_defines.append("lfpvec4", "f16vec4");
  3973. }
  3974. else if (opt.use_fp16_storage && opt.use_fp16_arithmetic)
  3975. {
  3976. custom_defines.append("lfp", "float");
  3977. custom_defines.append("lfpvec4", "uint64_t");
  3978. }
  3979. else if (opt.use_fp16_storage || opt.use_fp16_packed)
  3980. {
  3981. custom_defines.append("lfp", "float");
  3982. custom_defines.append("lfpvec4", "uvec2");
  3983. }
  3984. else
  3985. {
  3986. custom_defines.append("lfp", "float");
  3987. custom_defines.append("lfpvec4", "vec4");
  3988. }
  3989. if (opt.use_fp16_storage && opt.use_fp16_uniform && opt.use_fp16_arithmetic)
  3990. {
  3991. custom_defines.append("sfp2lfp(v)", "v");
  3992. custom_defines.append("sfp2lfpvec4(v)", "v");
  3993. custom_defines.append("lfp2afp(v)", "v");
  3994. custom_defines.append("lfp2afpvec4(v)", "v");
  3995. }
  3996. else if (opt.use_fp16_storage && opt.use_fp16_arithmetic)
  3997. {
  3998. custom_defines.append("sfp2lfp(v)", "float(v)");
  3999. custom_defines.append("sfp2lfpvec4(v)", "pack64(halfBitsToUInt16(v))");
  4000. custom_defines.append("lfp2afp(v)", "float16_t(v)");
  4001. custom_defines.append("lfp2afpvec4(v)", "int16BitsToHalf(unpack16(v))");
  4002. }
  4003. else if (opt.use_fp16_packed && opt.use_fp16_arithmetic)
  4004. {
  4005. custom_defines.append("sfp2lfp(v)", "v");
  4006. custom_defines.append("sfp2lfpvec4(v)", "v");
  4007. custom_defines.append("lfp2afp(v)", "float16_t(v)");
  4008. custom_defines.append("lfp2afpvec4(v)", "f16vec4(unpackFloat2x16(v.x),unpackFloat2x16(v.y))");
  4009. }
  4010. else if (opt.use_fp16_storage)
  4011. {
  4012. custom_defines.append("sfp2lfp(v)", "float(v)");
  4013. custom_defines.append("sfp2lfpvec4(v)", "uvec2(packHalf2x16(vec4(v).rg),packHalf2x16(vec4(v).ba))");
  4014. custom_defines.append("lfp2afp(v)", "v");
  4015. custom_defines.append("lfp2afpvec4(v)", "vec4(unpackHalf2x16(v.x),unpackHalf2x16(v.y))");
  4016. }
  4017. else if (opt.use_fp16_packed)
  4018. {
  4019. custom_defines.append("sfp2lfp(v)", "v");
  4020. custom_defines.append("sfp2lfpvec4(v)", "v");
  4021. custom_defines.append("lfp2afp(v)", "v");
  4022. custom_defines.append("lfp2afpvec4(v)", "vec4(unpackHalf2x16(v.x),unpackHalf2x16(v.y))");
  4023. }
  4024. else
  4025. {
  4026. custom_defines.append("sfp2lfp(v)", "v");
  4027. custom_defines.append("sfp2lfpvec4(v)", "v");
  4028. custom_defines.append("lfp2afp(v)", "v");
  4029. custom_defines.append("lfp2afpvec4(v)", "v");
  4030. }
  4031. if (opt.use_fp16_storage && opt.use_fp16_arithmetic)
  4032. {
  4033. custom_defines.append("buffer_ld1(buf,i)", "buf[i]");
  4034. custom_defines.append("buffer_st1(buf,i,v)", "{buf[i]=v;}");
  4035. custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4036. custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{buf[i]=f16vec4(sbuf[si4.r],sbuf[si4.g],sbuf[si4.b],sbuf[si4.a]);}");
  4037. custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{buf[i]=f16mat2x4(sbuf[si4.r],sbuf[si4.g],sbuf[si4.b],sbuf[si4.a],sbuf[sii4.r],sbuf[sii4.g],sbuf[sii4.b],sbuf[sii4.a]);}");
  4038. custom_defines.append("buffer_ld2(buf,i)", "buf[i]");
  4039. custom_defines.append("buffer_st2(buf,i,v)", "{buf[i]=v;}");
  4040. custom_defines.append("buffer_cp2(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4041. custom_defines.append("buffer_ld4(buf,i)", "buf[i]");
  4042. custom_defines.append("buffer_st4(buf,i,v)", "{buf[i]=v;}");
  4043. custom_defines.append("buffer_cp4(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4044. custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{buf[i4.r]=sbuf[si].r;buf[i4.g]=sbuf[si].g;buf[i4.b]=sbuf[si].b;buf[i4.a]=sbuf[si].a;}");
  4045. custom_defines.append("buffer_cp4to8(buf,i,sbuf,si2)", "{buf[i]=f16mat2x4(sbuf[si2.r],sbuf[si2.g]);}");
  4046. custom_defines.append("buffer_ld8(buf,i)", "buf[i]");
  4047. custom_defines.append("buffer_st8(buf,i,v)", "{buf[i]=v;}");
  4048. custom_defines.append("buffer_cp8(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4049. custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{f16mat2x4 _v=sbuf[si]; buf[i4.r]=_v[0].r;buf[i4.g]=_v[0].g;buf[i4.b]=_v[0].b;buf[i4.a]=_v[0].a; buf[ii4.r]=_v[1].r;buf[ii4.g]=_v[1].g;buf[ii4.b]=_v[1].b;buf[ii4.a]=_v[1].a;}");
  4050. custom_defines.append("buffer_cp8to4(buf,i2,sbuf,si)", "{f16mat2x4 _v=sbuf[si]; buf[i2.r]=_v[0];buf[i2.g]=_v[1];}");
  4051. custom_defines.append("sfp2afpmat4(v)", "v");
  4052. custom_defines.append("afp2sfpmat4(v)", "v");
  4053. }
  4054. else if (opt.use_fp16_packed && opt.use_fp16_arithmetic)
  4055. {
  4056. // custom_defines.append("buffer_ld1(buf,i)", "float16_t(buf[i])");
  4057. custom_defines.append("buffer_ld1(buf,i)", "float16_t(unpackHalf2x16(buf[(i)/2])[(i)%2])");
  4058. // custom_defines.append("buffer_st1(buf,i,v)", "{buf[i]=float(v);}");
  4059. custom_defines.append("buffer_st1(buf,i,v)", "{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;float _vs=float(v);uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=_vs;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}");
  4060. // custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4061. custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;uint _si=uint(si);uint _sid2=_si/2;uint _sim2=_si%2;float v=unpackHalf2x16(sbuf[_sid2])[_sim2];uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=v;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}");
  4062. // custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{buf[i]=uvec2(packFloat2x16(f16vec2(sbuf[si4.r],sbuf[si4.g])),packFloat2x16(f16vec2(sbuf[si4.b],sbuf[si4.a])));}");
  4063. custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{uvec4 _si4d2=uvec4(si4)/2;uvec4 _si4m2=uvec4(si4)%2; buf[i]=uvec2(packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.r])[_si4m2.r],unpackHalf2x16(sbuf[_si4d2.g])[_si4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.b])[_si4m2.b],unpackHalf2x16(sbuf[_si4d2.a])[_si4m2.a])));}");
  4064. // custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{buf[i]=uvec4(packFloat2x16(f16vec2(sbuf[si4.r],sbuf[si4.g])),packFloat2x16(f16vec2(sbuf[si4.b],sbuf[si4.a])),packFloat2x16(f16vec2(sbuf[sii4.r],sbuf[sii4.g])),packFloat2x16(f16vec2(sbuf[sii4.b],sbuf[sii4.a])));}");
  4065. custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{uvec4 _si4d2=uvec4(si4)/2;uvec4 _sii4d2=uvec4(sii4)/2;uvec4 _si4m2=uvec4(si4)%2;uvec4 _sii4m2=uvec4(sii4)%2; buf[i]=uvec4(packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.r])[_si4m2.r],unpackHalf2x16(sbuf[_si4d2.g])[_si4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.b])[_si4m2.b],unpackHalf2x16(sbuf[_si4d2.a])[_si4m2.a])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_sii4d2.r])[_sii4m2.r],unpackHalf2x16(sbuf[_sii4d2.g])[_sii4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_sii4d2.b])[_sii4m2.b],unpackHalf2x16(sbuf[_sii4d2.a])[_sii4m2.a])));}");
  4066. custom_defines.append("buffer_ld2(buf,i)", "unpackFloat2x16(buf[i])");
  4067. custom_defines.append("buffer_st2(buf,i,v)", "{buf[i]=packFloat2x16(v)}");
  4068. custom_defines.append("buffer_cp2(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4069. custom_defines.append("buffer_ld4(buf,i)", "f16vec4(unpackFloat2x16(buf[i].x),unpackFloat2x16(buf[i].y))");
  4070. custom_defines.append("buffer_st4(buf,i,v)", "{buf[i]=uvec2(packFloat2x16(v.rg),packFloat2x16(v.ba));}");
  4071. custom_defines.append("buffer_cp4(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4072. // custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{uvec2 _v=sbuf[si]; f16vec2 _v0=unpackFloat2x16(_v.x);f16vec2 _v1=unpackFloat2x16(_v.y); buf[i4.r]=_v0.r;buf[i4.g]=_v0.g;buf[i4.b]=_v1.r;buf[i4.a]=_v1.g;}");
  4073. custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{uvec2 _v=sbuf[si]; vec2 _v0=unpackHalf2x16(_v.x);vec2 _v1=unpackHalf2x16(_v.y);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);}");
  4074. custom_defines.append("buffer_cp4to8(buf,i,sbuf,si2)", "{buf[i]=uvec4(sbuf[si2.r],sbuf[si2.g]);}");
  4075. custom_defines.append("buffer_ld8(buf,i)", "f16mat2x4(f16vec4(unpackFloat2x16(buf[i].r),unpackFloat2x16(buf[i].g)),f16vec4(unpackFloat2x16(buf[i].b),unpackFloat2x16(buf[i].a)))");
  4076. custom_defines.append("buffer_st8(buf,i,v)", "{buf[i]=uvec4(uvec2(packFloat2x16(v[0].rg),packFloat2x16(v[0].ba)),uvec2(packFloat2x16(v[1].rg),packFloat2x16(v[1].ba)));}");
  4077. custom_defines.append("buffer_cp8(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4078. // custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{uvec4 _v=sbuf[si]; f16vec2 _v0=unpackFloat2x16(_v.r);f16vec2 _v1=unpackFloat2x16(_v.g);f16vec2 _v2=unpackFloat2x16(_v.b);f16vec2 _v3=unpackFloat2x16(_v.a); buf[i4.r]=_v0.r;buf[i4.g]=_v0.g;buf[i4.b]=_v1.r;buf[i4.a]=_v1.g; buf[ii4.r]=_v2.r;buf[ii4.g]=_v2.g;buf[ii4.b]=_v3.r;buf[ii4.a]=_v3.g;}");
  4079. custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{uvec4 _v=sbuf[si]; vec2 _v0=unpackHalf2x16(_v.r);vec2 _v1=unpackHalf2x16(_v.g);vec2 _v2=unpackHalf2x16(_v.b);vec2 _v3=unpackHalf2x16(_v.a);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);buffer_st1(buf,ii4.r,_v2.r);buffer_st1(buf,ii4.g,_v2.g);buffer_st1(buf,ii4.b,_v3.r);buffer_st1(buf,ii4.a,_v3.g);}");
  4080. custom_defines.append("buffer_cp8to4(buf,i2,sbuf,si)", "{uvec4 _v=sbuf[si]; buf[i2.r]=_v.rg;buf[i2.g]=_v.ba;}");
  4081. }
  4082. else if (opt.use_fp16_storage)
  4083. {
  4084. custom_defines.append("buffer_ld1(buf,i)", "float(buf[i])");
  4085. custom_defines.append("buffer_st1(buf,i,v)", "{buf[i]=float16_t(v);}");
  4086. custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4087. custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{buf[i].r=sbuf[si4.r];buf[i].g=sbuf[si4.g];buf[i].b=sbuf[si4.b];buf[i].a=sbuf[si4.a];}");
  4088. custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{buf[i].abcd.r=sbuf[si4.r];buf[i].abcd.g=sbuf[si4.g];buf[i].abcd.b=sbuf[si4.b];buf[i].abcd.a=sbuf[si4.a];buf[i].efgh.r=sbuf[sii4.r];buf[i].efgh.g=sbuf[sii4.g];buf[i].efgh.b=sbuf[sii4.b];buf[i].efgh.a=sbuf[sii4.a];}");
  4089. custom_defines.append("buffer_ld2(buf,i)", "vec2(buf[i])");
  4090. custom_defines.append("buffer_st2(buf,i,v)", "{buf[i]=f16vec2(v);}");
  4091. custom_defines.append("buffer_cp2(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4092. custom_defines.append("buffer_ld4(buf,i)", "vec4(buf[i])");
  4093. custom_defines.append("buffer_st4(buf,i,v)", "{buf[i]=f16vec4(v);}");
  4094. custom_defines.append("buffer_cp4(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4095. custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{buf[i4.r]=sbuf[si].r;buf[i4.g]=sbuf[si].g;buf[i4.b]=sbuf[si].b;buf[i4.a]=sbuf[si].a;}");
  4096. custom_defines.append("buffer_cp4to8(buf,i,sbuf,si2)", "{buf[i].abcd=sbuf[si2.r];buf[i].efgh=sbuf[si2.g];}");
  4097. custom_defines.append("buffer_ld8(buf,i)", "mat2x4(vec4(buf[i].abcd),vec4(buf[i].efgh))");
  4098. custom_defines.append("buffer_st8(buf,i,v)", "{buf[i].abcd=f16vec4(v[0]);buf[i].efgh=f16vec4(v[1]);}");
  4099. custom_defines.append("buffer_cp8(buf,i,sbuf,si)", "{buf[i].abcd=sbuf[si].abcd;buf[i].efgh=sbuf[si].efgh;}");
  4100. custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{buf[i4.r]=sbuf[si].abcd.r;buf[i4.g]=sbuf[si].abcd.g;buf[i4.b]=sbuf[si].abcd.b;buf[i4.a]=sbuf[si].abcd.a; buf[ii4.r]=sbuf[si].efgh.r;buf[ii4.g]=sbuf[si].efgh.g;buf[ii4.b]=sbuf[si].efgh.b;buf[ii4.a]=sbuf[si].efgh.a;}");
  4101. custom_defines.append("buffer_cp8to4(buf,i2,sbuf,si)", "{buf[i2.r]=sbuf[si].abcd;buf[i2.g]=sbuf[si].efgh;}");
  4102. }
  4103. else if (opt.use_fp16_packed)
  4104. {
  4105. // custom_defines.append("buffer_ld1(buf,i)", "buf[i]");
  4106. custom_defines.append("buffer_ld1(buf,i)", "unpackHalf2x16(buf[(i)/2])[(i)%2]");
  4107. // custom_defines.append("buffer_st1(buf,i,v)", "{buf[i]=v;}");
  4108. custom_defines.append("buffer_st1(buf,i,v)", "{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;float _vs=float(v);uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=_vs;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}");
  4109. // custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4110. custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{uint _i=uint(i);uint _id2=_i/2;uint _im2=_i%2;uint _si=uint(si);uint _sid2=_si/2;uint _sim2=_si%2;float v=unpackHalf2x16(sbuf[_sid2])[_sim2];uint _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id2],0,0);vec2 _v=unpackHalf2x16(_old_v);_v[_im2]=v;_new_v=packHalf2x16(_v);} while(atomicCompSwap(buf[_id2],_old_v,_new_v)!=_old_v);}");
  4111. // custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{buf[i]=uvec2(packHalf2x16(vec2(sbuf[si4.r],sbuf[si4.g])),packHalf2x16(vec2(sbuf[si4.b],sbuf[si4.a])));}");
  4112. custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{uvec4 _si4d2=uvec4(si4)/2;uvec4 _si4m2=uvec4(si4)%2; buf[i]=uvec2(packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.r])[_si4m2.r],unpackHalf2x16(sbuf[_si4d2.g])[_si4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.b])[_si4m2.b],unpackHalf2x16(sbuf[_si4d2.a])[_si4m2.a])));}");
  4113. // custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{buf[i]=uvec4(packHalf2x16(vec2(sbuf[si4.r],sbuf[si4.g])),packHalf2x16(vec2(sbuf[si4.b],sbuf[si4.a])),packHalf2x16(vec2(sbuf[sii4.r],sbuf[sii4.g])),packHalf2x16(vec2(sbuf[sii4.b],sbuf[sii4.a])));}");
  4114. custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{uvec4 _si4d2=uvec4(si4)/2;uvec4 _sii4d2=uvec4(sii4)/2;uvec4 _si4m2=uvec4(si4)%2;uvec4 _sii4m2=uvec4(sii4)%2; buf[i]=uvec4(packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.r])[_si4m2.r],unpackHalf2x16(sbuf[_si4d2.g])[_si4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_si4d2.b])[_si4m2.b],unpackHalf2x16(sbuf[_si4d2.a])[_si4m2.a])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_sii4d2.r])[_sii4m2.r],unpackHalf2x16(sbuf[_sii4d2.g])[_sii4m2.g])),packHalf2x16(vec2(unpackHalf2x16(sbuf[_sii4d2.b])[_sii4m2.b],unpackHalf2x16(sbuf[_sii4d2.a])[_sii4m2.a])));}");
  4115. custom_defines.append("buffer_ld2(buf,i)", "unpackHalf2x16(buf[i])");
  4116. custom_defines.append("buffer_st2(buf,i,v)", "{buf[i]=packHalf2x16(v)}");
  4117. custom_defines.append("buffer_cp2(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4118. custom_defines.append("buffer_ld4(buf,i)", "vec4(unpackHalf2x16(buf[i].x),unpackHalf2x16(buf[i].y))");
  4119. custom_defines.append("buffer_st4(buf,i,v)", "{buf[i]=uvec2(packHalf2x16(v.rg),packHalf2x16(v.ba));}");
  4120. custom_defines.append("buffer_cp4(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4121. // custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{uvec2 _v=sbuf[si]; vec2 _v0=unpackHalf2x16(_v.x);vec2 _v1=unpackHalf2x16(_v.y); buf[i4.r]=_v0.r;buf[i4.g]=_v0.g;buf[i4.b]=_v1.r;buf[i4.a]=_v1.g;}");
  4122. custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{uvec2 _v=sbuf[si]; vec2 _v0=unpackHalf2x16(_v.x);vec2 _v1=unpackHalf2x16(_v.y);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);}");
  4123. custom_defines.append("buffer_cp4to8(buf,i,sbuf,si2)", "{buf[i]=uvec4(sbuf[si2.r],sbuf[si2.g]);}");
  4124. custom_defines.append("buffer_ld8(buf,i)", "mat2x4(vec4(unpackHalf2x16(buf[i].r),unpackHalf2x16(buf[i].g)),vec4(unpackHalf2x16(buf[i].b),unpackHalf2x16(buf[i].a)))");
  4125. custom_defines.append("buffer_st8(buf,i,v)", "{buf[i]=uvec4(uvec2(packHalf2x16(v[0].rg),packHalf2x16(v[0].ba)),uvec2(packHalf2x16(v[1].rg),packHalf2x16(v[1].ba)));}");
  4126. custom_defines.append("buffer_cp8(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4127. // custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{uvec4 _v=sbuf[si]; vec2 _v0=unpackHalf2x16(_v.r);vec2 _v1=unpackHalf2x16(_v.g);vec2 _v2=unpackHalf2x16(_v.b);vec2 _v3=unpackHalf2x16(_v.a); buf[i4.r]=_v0.r;buf[i4.g]=_v0.g;buf[i4.b]=_v1.r;buf[i4.a]=_v1.g; buf[ii4.r]=_v2.r;buf[ii4.g]=_v2.g;buf[ii4.b]=_v3.r;buf[ii4.a]=_v3.g;}");
  4128. custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{uvec4 _v=sbuf[si]; vec2 _v0=unpackHalf2x16(_v.r);vec2 _v1=unpackHalf2x16(_v.g);vec2 _v2=unpackHalf2x16(_v.b);vec2 _v3=unpackHalf2x16(_v.a);buffer_st1(buf,i4.r,_v0.r);buffer_st1(buf,i4.g,_v0.g);buffer_st1(buf,i4.b,_v1.r);buffer_st1(buf,i4.a,_v1.g);buffer_st1(buf,ii4.r,_v2.r);buffer_st1(buf,ii4.g,_v2.g);buffer_st1(buf,ii4.b,_v3.r);buffer_st1(buf,ii4.a,_v3.g);}");
  4129. custom_defines.append("buffer_cp8to4(buf,i2,sbuf,si)", "{uvec4 _v=sbuf[si]; buf[i2.r]=_v.rg;buf[i2.g]=_v.ba;}");
  4130. }
  4131. else
  4132. {
  4133. custom_defines.append("buffer_ld1(buf,i)", "buf[i]");
  4134. custom_defines.append("buffer_st1(buf,i,v)", "{buf[i]=v;}");
  4135. custom_defines.append("buffer_cp1(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4136. custom_defines.append("buffer_cp1to4(buf,i,sbuf,si4)", "{buf[i]=vec4(sbuf[si4.r],sbuf[si4.g],sbuf[si4.b],sbuf[si4.a]);}");
  4137. custom_defines.append("buffer_cp1to8(buf,i,sbuf,si4,sii4)", "{buf[i]=mat2x4(sbuf[si4.r],sbuf[si4.g],sbuf[si4.b],sbuf[si4.a],sbuf[sii4.r],sbuf[sii4.g],sbuf[sii4.b],sbuf[sii4.a]);}");
  4138. custom_defines.append("buffer_ld2(buf,i)", "buf[i]");
  4139. custom_defines.append("buffer_st2(buf,i,v)", "{buf[i]=v;}");
  4140. custom_defines.append("buffer_cp2(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4141. custom_defines.append("buffer_ld4(buf,i)", "buf[i]");
  4142. custom_defines.append("buffer_st4(buf,i,v)", "{buf[i]=v;}");
  4143. custom_defines.append("buffer_cp4(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4144. custom_defines.append("buffer_cp4to1(buf,i4,sbuf,si)", "{vec4 _v=sbuf[si]; buf[i4.r]=_v.r;buf[i4.g]=_v.g;buf[i4.b]=_v.b;buf[i4.a]=_v.a;}");
  4145. custom_defines.append("buffer_cp4to8(buf,i,sbuf,si2)", "{buf[i]=mat2x4(sbuf[si2.r],sbuf[si2.g]);}");
  4146. custom_defines.append("buffer_ld8(buf,i)", "buf[i]");
  4147. custom_defines.append("buffer_st8(buf,i,v)", "{buf[i]=v;}");
  4148. custom_defines.append("buffer_cp8(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4149. custom_defines.append("buffer_cp8to1(buf,i4,ii4,sbuf,si)", "{mat2x4 _v=sbuf[si]; buf[i4.r]=_v[0].r;buf[i4.g]=_v[0].g;buf[i4.b]=_v[0].b;buf[i4.a]=_v[0].a; buf[ii4.r]=_v[1].r;buf[ii4.g]=_v[1].g;buf[ii4.b]=_v[1].b;buf[ii4.a]=_v[1].a;}");
  4150. custom_defines.append("buffer_cp8to4(buf,i2,sbuf,si)", "{mat2x4 _v=sbuf[si]; buf[i2.r]=_v[0];buf[i2.g]=_v[1];}");
  4151. custom_defines.append("sfp2afpmat4(v)", "v");
  4152. custom_defines.append("afp2sfpmat4(v)", "v");
  4153. }
  4154. if (opt.use_int8_storage)
  4155. {
  4156. custom_defines.append("sint8", "int8_t");
  4157. }
  4158. else if (opt.use_int8_packed)
  4159. {
  4160. custom_defines.append("sint8", "int");
  4161. }
  4162. else
  4163. {
  4164. custom_defines.append("sint8", "int");
  4165. }
  4166. custom_defines.append("sint8vec4", "int");
  4167. custom_defines.append("sint8vec8", "ivec2");
  4168. custom_defines.append("aint8", "int");
  4169. custom_defines.append("aint8vec4", "ivec4");
  4170. custom_defines.append("unpackInt4x8(v)", "ivec4((v<<24)>>24,(v<<16)>>24,(v<<8)>>24,v>>24)");
  4171. custom_defines.append("packInt4x8(v)", "int((uint(v.r)&0xFFu)|((uint(v.g)&0xFFu)<<8)|((uint(v.b)&0xFFu)<<16)|((uint(v.a)&0xFFu)<<24))");
  4172. if (opt.use_int8_storage)
  4173. {
  4174. custom_defines.append("i8buffer_ld1(buf,i)", "int(buf[i])");
  4175. custom_defines.append("i8buffer_st1(buf,i,v)", "{buf[i]=int8_t(v);}");
  4176. custom_defines.append("i8buffer_cp1(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4177. }
  4178. else
  4179. {
  4180. custom_defines.append("i8buffer_ld1(buf,i)", "int(((buf[(i)/4])<<(24-((i)%4)*8))>>24)");
  4181. custom_defines.append("i8buffer_st1(buf,i,v)", "{uint _i=uint(i);uint _id4=_i/4;uint _im4=_i%4;int _vs=int(v);int _old_v, _new_v;do{_old_v=atomicCompSwap(buf[_id4],0,0);ivec4 _v=unpackInt4x8(_old_v);_v[_im4]=_vs;_new_v=packInt4x8(_v);} while(atomicCompSwap(buf[_id4],_old_v,_new_v)!=_old_v);}");
  4182. custom_defines.append("i8buffer_cp1(buf,i,sbuf,si)", "{int _v=i8buffer_ld1(sbuf,si);i8buffer_st1(buf,i,_v);}");
  4183. }
  4184. custom_defines.append("i8buffer_ld4(buf,i)", "unpackInt4x8(buf[i])");
  4185. custom_defines.append("i8buffer_st4(buf,i,v)", "{buf[i]=packInt4x8(v);}");
  4186. custom_defines.append("i8buffer_cp4(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4187. custom_defines.append("i8buffer_ld8(buf,i)", "ivec8(unpackInt4x8(buf[i].r),unpackInt4x8(buf[i].g))");
  4188. custom_defines.append("i8buffer_st8(buf,i,v)", "{buf[i]=ivec2(packInt4x8(v.abcd),packInt4x8(v.efgh));}");
  4189. custom_defines.append("i8buffer_cp8(buf,i,sbuf,si)", "{buf[i]=sbuf[si];}");
  4190. custom_defines.append("psc(x)", "(x==0?p.x:x)");
  4191. if (opt.use_fp16_storage)
  4192. {
  4193. custom_defines.append("NCNN_fp16_storage", 1);
  4194. }
  4195. else if (opt.use_fp16_packed)
  4196. {
  4197. custom_defines.append("NCNN_fp16_packed", 1);
  4198. }
  4199. if (opt.use_fp16_uniform)
  4200. {
  4201. custom_defines.append("NCNN_fp16_uniform", 1);
  4202. }
  4203. if (opt.use_fp16_arithmetic)
  4204. {
  4205. custom_defines.append("NCNN_fp16_arithmetic", 1);
  4206. }
  4207. if (opt.use_int8_storage)
  4208. {
  4209. custom_defines.append("NCNN_int8_storage", 1);
  4210. }
  4211. else if (opt.use_int8_packed)
  4212. {
  4213. custom_defines.append("NCNN_int8_packed", 1);
  4214. }
  4215. if (opt.use_int8_uniform)
  4216. {
  4217. custom_defines.append("NCNN_int8_uniform", 1);
  4218. }
  4219. if (opt.use_int8_arithmetic)
  4220. {
  4221. custom_defines.append("NCNN_int8_arithmetic", 1);
  4222. }
  4223. if (opt.use_shader_local_memory)
  4224. {
  4225. custom_defines.append("NCNN_shader_local_memory", 1);
  4226. }
  4227. #if __APPLE__
  4228. custom_defines.append("NCNN_moltenvk", 1);
  4229. #endif
  4230. custom_defines.append("ncnn_glsl_version", 1);
  4231. bool support_shader_int64 = false;
  4232. // fill device macros
  4233. {
  4234. int device_index = opt.vulkan_device_index;
  4235. if (device_index < 0 || device_index >= get_gpu_count())
  4236. device_index = get_default_gpu_index();
  4237. const GpuInfo& info = get_gpu_info(device_index);
  4238. support_shader_int64 = info.physicalDevicefeatures().shaderInt64;
  4239. // pull in device extensions
  4240. {
  4241. const std::vector<VkExtensionProperties>& properties = info.deviceExtensionProperties();
  4242. for (size_t i = 0; i < properties.size(); i++)
  4243. {
  4244. const VkExtensionProperties& exp = properties[i];
  4245. device_defines.append(exp.extensionName, exp.specVersion);
  4246. }
  4247. }
  4248. #define DD_APPEND_FEATURE(X) device_defines.append(#X, features.X ? 1 : 0);
  4249. // pull in device features macros
  4250. {
  4251. const VkPhysicalDeviceFeatures& features = info.physicalDevicefeatures();
  4252. DD_APPEND_FEATURE(robustBufferAccess)
  4253. DD_APPEND_FEATURE(fullDrawIndexUint32)
  4254. DD_APPEND_FEATURE(imageCubeArray)
  4255. DD_APPEND_FEATURE(independentBlend)
  4256. DD_APPEND_FEATURE(geometryShader)
  4257. DD_APPEND_FEATURE(tessellationShader)
  4258. DD_APPEND_FEATURE(sampleRateShading)
  4259. DD_APPEND_FEATURE(dualSrcBlend)
  4260. DD_APPEND_FEATURE(logicOp)
  4261. DD_APPEND_FEATURE(multiDrawIndirect)
  4262. DD_APPEND_FEATURE(drawIndirectFirstInstance)
  4263. DD_APPEND_FEATURE(depthClamp)
  4264. DD_APPEND_FEATURE(depthBiasClamp)
  4265. DD_APPEND_FEATURE(fillModeNonSolid)
  4266. DD_APPEND_FEATURE(depthBounds)
  4267. DD_APPEND_FEATURE(wideLines)
  4268. DD_APPEND_FEATURE(largePoints)
  4269. DD_APPEND_FEATURE(alphaToOne)
  4270. DD_APPEND_FEATURE(multiViewport)
  4271. DD_APPEND_FEATURE(samplerAnisotropy)
  4272. DD_APPEND_FEATURE(textureCompressionETC2)
  4273. DD_APPEND_FEATURE(textureCompressionASTC_LDR)
  4274. DD_APPEND_FEATURE(textureCompressionBC)
  4275. DD_APPEND_FEATURE(occlusionQueryPrecise)
  4276. DD_APPEND_FEATURE(pipelineStatisticsQuery)
  4277. DD_APPEND_FEATURE(vertexPipelineStoresAndAtomics)
  4278. DD_APPEND_FEATURE(fragmentStoresAndAtomics)
  4279. DD_APPEND_FEATURE(shaderTessellationAndGeometryPointSize)
  4280. DD_APPEND_FEATURE(shaderImageGatherExtended)
  4281. DD_APPEND_FEATURE(shaderStorageImageExtendedFormats)
  4282. DD_APPEND_FEATURE(shaderStorageImageMultisample)
  4283. DD_APPEND_FEATURE(shaderStorageImageReadWithoutFormat)
  4284. DD_APPEND_FEATURE(shaderStorageImageWriteWithoutFormat)
  4285. DD_APPEND_FEATURE(shaderUniformBufferArrayDynamicIndexing)
  4286. DD_APPEND_FEATURE(shaderSampledImageArrayDynamicIndexing)
  4287. DD_APPEND_FEATURE(shaderStorageBufferArrayDynamicIndexing)
  4288. DD_APPEND_FEATURE(shaderStorageImageArrayDynamicIndexing)
  4289. DD_APPEND_FEATURE(shaderClipDistance)
  4290. DD_APPEND_FEATURE(shaderCullDistance)
  4291. DD_APPEND_FEATURE(shaderFloat64)
  4292. DD_APPEND_FEATURE(shaderInt64)
  4293. DD_APPEND_FEATURE(shaderInt16)
  4294. DD_APPEND_FEATURE(shaderResourceResidency)
  4295. DD_APPEND_FEATURE(shaderResourceMinLod)
  4296. DD_APPEND_FEATURE(sparseBinding)
  4297. DD_APPEND_FEATURE(sparseResidencyBuffer)
  4298. DD_APPEND_FEATURE(sparseResidencyImage2D)
  4299. DD_APPEND_FEATURE(sparseResidencyImage3D)
  4300. DD_APPEND_FEATURE(sparseResidency2Samples)
  4301. DD_APPEND_FEATURE(sparseResidency4Samples)
  4302. DD_APPEND_FEATURE(sparseResidency8Samples)
  4303. DD_APPEND_FEATURE(sparseResidency16Samples)
  4304. DD_APPEND_FEATURE(sparseResidencyAliased)
  4305. DD_APPEND_FEATURE(variableMultisampleRate)
  4306. DD_APPEND_FEATURE(inheritedQueries)
  4307. }
  4308. if (info.support_VK_KHR_8bit_storage())
  4309. {
  4310. const VkPhysicalDevice8BitStorageFeaturesKHR& features = info.query8BitStorageFeatures();
  4311. DD_APPEND_FEATURE(storageBuffer8BitAccess)
  4312. DD_APPEND_FEATURE(uniformAndStorageBuffer8BitAccess)
  4313. DD_APPEND_FEATURE(storagePushConstant8)
  4314. }
  4315. if (info.support_VK_KHR_16bit_storage())
  4316. {
  4317. const VkPhysicalDevice16BitStorageFeaturesKHR& features = info.query16BitStorageFeatures();
  4318. DD_APPEND_FEATURE(storageBuffer16BitAccess)
  4319. DD_APPEND_FEATURE(uniformAndStorageBuffer16BitAccess)
  4320. DD_APPEND_FEATURE(storagePushConstant16)
  4321. DD_APPEND_FEATURE(storageInputOutput16)
  4322. }
  4323. if (info.support_VK_KHR_shader_float16_int8())
  4324. {
  4325. const VkPhysicalDeviceFloat16Int8FeaturesKHR& features = info.queryFloat16Int8Features();
  4326. DD_APPEND_FEATURE(shaderFloat16)
  4327. DD_APPEND_FEATURE(shaderInt8)
  4328. }
  4329. if (info.support_VK_KHR_sampler_ycbcr_conversion())
  4330. {
  4331. const VkPhysicalDeviceSamplerYcbcrConversionFeaturesKHR& features = info.querySamplerYcbcrConversionFeatures();
  4332. DD_APPEND_FEATURE(samplerYcbcrConversion)
  4333. }
  4334. if (info.support_VK_KHR_cooperative_matrix())
  4335. {
  4336. const VkPhysicalDeviceCooperativeMatrixFeaturesKHR& features = info.queryCooperativeMatrixFeatures();
  4337. DD_APPEND_FEATURE(cooperativeMatrix)
  4338. DD_APPEND_FEATURE(cooperativeMatrixRobustBufferAccess)
  4339. }
  4340. else if (info.support_VK_NV_cooperative_matrix())
  4341. {
  4342. const VkPhysicalDeviceCooperativeMatrixFeaturesNV& features = info.queryCooperativeMatrixFeaturesNV();
  4343. DD_APPEND_FEATURE(cooperativeMatrix)
  4344. DD_APPEND_FEATURE(cooperativeMatrixRobustBufferAccess)
  4345. }
  4346. if (info.support_VK_NV_cooperative_matrix2())
  4347. {
  4348. const VkPhysicalDeviceCooperativeMatrix2FeaturesNV& features = info.queryCooperativeMatrix2FeaturesNV();
  4349. DD_APPEND_FEATURE(cooperativeMatrixWorkgroupScope)
  4350. DD_APPEND_FEATURE(cooperativeMatrixFlexibleDimensions)
  4351. DD_APPEND_FEATURE(cooperativeMatrixReductions)
  4352. DD_APPEND_FEATURE(cooperativeMatrixConversions)
  4353. DD_APPEND_FEATURE(cooperativeMatrixPerElementOperations)
  4354. DD_APPEND_FEATURE(cooperativeMatrixTensorAddressing)
  4355. DD_APPEND_FEATURE(cooperativeMatrixBlockLoads)
  4356. }
  4357. if (info.support_VK_NV_cooperative_vector())
  4358. {
  4359. const VkPhysicalDeviceCooperativeVectorFeaturesNV& features = info.queryCooperativeVectorFeaturesNV();
  4360. DD_APPEND_FEATURE(cooperativeVector)
  4361. DD_APPEND_FEATURE(cooperativeVectorTraining)
  4362. }
  4363. if (info.support_VK_EXT_subgroup_size_control())
  4364. {
  4365. const VkPhysicalDeviceSubgroupSizeControlFeaturesEXT& features = info.querySubgroupSizeControlFeatures();
  4366. DD_APPEND_FEATURE(subgroupSizeControl)
  4367. DD_APPEND_FEATURE(computeFullSubgroups)
  4368. }
  4369. if (info.support_VK_KHR_shader_bfloat16())
  4370. {
  4371. const VkPhysicalDeviceShaderBfloat16FeaturesKHR& features = info.queryShaderBfloat16Features();
  4372. DD_APPEND_FEATURE(shaderBFloat16Type)
  4373. DD_APPEND_FEATURE(shaderBFloat16DotProduct)
  4374. DD_APPEND_FEATURE(shaderBFloat16CooperativeMatrix)
  4375. }
  4376. if (info.support_VK_EXT_shader_float8())
  4377. {
  4378. const VkPhysicalDeviceShaderFloat8FeaturesEXT& features = info.queryShaderFloat8Features();
  4379. DD_APPEND_FEATURE(shaderFloat8)
  4380. DD_APPEND_FEATURE(shaderFloat8CooperativeMatrix)
  4381. }
  4382. if (info.support_VK_KHR_shader_float_controls2())
  4383. {
  4384. const VkPhysicalDeviceShaderFloatControls2FeaturesKHR& features = info.queryShaderFloatControls2Features();
  4385. DD_APPEND_FEATURE(shaderFloatControls2)
  4386. }
  4387. if (info.support_VK_KHR_shader_integer_dot_product())
  4388. {
  4389. const VkPhysicalDeviceShaderIntegerDotProductFeaturesKHR& features = info.queryShaderIntegerDotProductFeatures();
  4390. DD_APPEND_FEATURE(shaderIntegerDotProduct)
  4391. }
  4392. if (info.support_VK_KHR_shader_subgroup_rotate())
  4393. {
  4394. const VkPhysicalDeviceShaderSubgroupRotateFeaturesKHR& features = info.queryShaderSubgroupRotateFeatures();
  4395. DD_APPEND_FEATURE(shaderSubgroupRotate)
  4396. DD_APPEND_FEATURE(shaderSubgroupRotateClustered)
  4397. }
  4398. if (info.support_VK_EXT_shader_atomic_float())
  4399. {
  4400. const VkPhysicalDeviceShaderAtomicFloatFeaturesEXT& features = info.queryShaderAtomicFloatFeatures();
  4401. DD_APPEND_FEATURE(shaderBufferFloat32Atomics)
  4402. DD_APPEND_FEATURE(shaderBufferFloat32AtomicAdd)
  4403. DD_APPEND_FEATURE(shaderBufferFloat64Atomics)
  4404. DD_APPEND_FEATURE(shaderBufferFloat64AtomicAdd)
  4405. DD_APPEND_FEATURE(shaderSharedFloat32Atomics)
  4406. DD_APPEND_FEATURE(shaderSharedFloat32AtomicAdd)
  4407. DD_APPEND_FEATURE(shaderSharedFloat64Atomics)
  4408. DD_APPEND_FEATURE(shaderSharedFloat64AtomicAdd)
  4409. DD_APPEND_FEATURE(shaderImageFloat32Atomics)
  4410. DD_APPEND_FEATURE(shaderImageFloat32AtomicAdd)
  4411. DD_APPEND_FEATURE(sparseImageFloat32Atomics)
  4412. DD_APPEND_FEATURE(sparseImageFloat32AtomicAdd)
  4413. }
  4414. if (info.support_VK_EXT_shader_atomic_float2())
  4415. {
  4416. const VkPhysicalDeviceShaderAtomicFloat2FeaturesEXT& features = info.queryShaderAtomicFloat2Features();
  4417. DD_APPEND_FEATURE(shaderBufferFloat16Atomics)
  4418. DD_APPEND_FEATURE(shaderBufferFloat16AtomicAdd)
  4419. DD_APPEND_FEATURE(shaderBufferFloat16AtomicMinMax)
  4420. DD_APPEND_FEATURE(shaderBufferFloat32AtomicMinMax)
  4421. DD_APPEND_FEATURE(shaderBufferFloat64AtomicMinMax)
  4422. DD_APPEND_FEATURE(shaderSharedFloat16Atomics)
  4423. DD_APPEND_FEATURE(shaderSharedFloat16AtomicAdd)
  4424. DD_APPEND_FEATURE(shaderSharedFloat16AtomicMinMax)
  4425. DD_APPEND_FEATURE(shaderSharedFloat32AtomicMinMax)
  4426. DD_APPEND_FEATURE(shaderSharedFloat64AtomicMinMax)
  4427. DD_APPEND_FEATURE(shaderImageFloat32AtomicMinMax)
  4428. DD_APPEND_FEATURE(sparseImageFloat32AtomicMinMax)
  4429. }
  4430. if (info.support_VK_KHR_vulkan_memory_model())
  4431. {
  4432. const VkPhysicalDeviceVulkanMemoryModelFeaturesKHR& features = info.queryVulkanMemoryModelFeatures();
  4433. DD_APPEND_FEATURE(vulkanMemoryModel)
  4434. DD_APPEND_FEATURE(vulkanMemoryModelDeviceScope)
  4435. DD_APPEND_FEATURE(vulkanMemoryModelAvailabilityVisibilityChains)
  4436. }
  4437. #undef DD_APPEND_FEATURE
  4438. #define DD_APPEND_PROPERTY(X) device_defines.append(#X, properties.X);
  4439. // pull in device properties macros
  4440. {
  4441. const VkPhysicalDeviceProperties& properties = info.physicalDeviceProperties();
  4442. DD_APPEND_PROPERTY(apiVersion)
  4443. DD_APPEND_PROPERTY(driverVersion)
  4444. DD_APPEND_PROPERTY(vendorID)
  4445. DD_APPEND_PROPERTY(deviceID)
  4446. DD_APPEND_PROPERTY(deviceType)
  4447. // DD_APPEND_PROPERTY(deviceName)
  4448. // DD_APPEND_PROPERTY(pipelineCacheUUID)
  4449. #define DD_APPEND_PROPERTY_LIMIT(X) device_defines.append(#X, properties.limits.X);
  4450. #define DD_APPEND_PROPERTY_LIMIT_2(X) \
  4451. device_defines.append(#X "_0", properties.limits.X[0]); \
  4452. device_defines.append(#X "_1", properties.limits.X[1]);
  4453. #define DD_APPEND_PROPERTY_LIMIT_3(X) \
  4454. device_defines.append(#X "_0", properties.limits.X[0]); \
  4455. device_defines.append(#X "_1", properties.limits.X[1]); \
  4456. device_defines.append(#X "_2", properties.limits.X[2]);
  4457. DD_APPEND_PROPERTY_LIMIT(maxImageDimension1D)
  4458. DD_APPEND_PROPERTY_LIMIT(maxImageDimension2D)
  4459. DD_APPEND_PROPERTY_LIMIT(maxImageDimension3D)
  4460. DD_APPEND_PROPERTY_LIMIT(maxImageDimensionCube)
  4461. DD_APPEND_PROPERTY_LIMIT(maxImageArrayLayers)
  4462. DD_APPEND_PROPERTY_LIMIT(maxTexelBufferElements)
  4463. DD_APPEND_PROPERTY_LIMIT(maxUniformBufferRange)
  4464. DD_APPEND_PROPERTY_LIMIT(maxStorageBufferRange)
  4465. DD_APPEND_PROPERTY_LIMIT(maxPushConstantsSize)
  4466. DD_APPEND_PROPERTY_LIMIT(maxMemoryAllocationCount)
  4467. DD_APPEND_PROPERTY_LIMIT(maxSamplerAllocationCount)
  4468. DD_APPEND_PROPERTY_LIMIT(bufferImageGranularity)
  4469. DD_APPEND_PROPERTY_LIMIT(sparseAddressSpaceSize)
  4470. DD_APPEND_PROPERTY_LIMIT(maxBoundDescriptorSets)
  4471. DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorSamplers)
  4472. DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorUniformBuffers)
  4473. DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorStorageBuffers)
  4474. DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorSampledImages)
  4475. DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorStorageImages)
  4476. DD_APPEND_PROPERTY_LIMIT(maxPerStageDescriptorInputAttachments)
  4477. DD_APPEND_PROPERTY_LIMIT(maxPerStageResources)
  4478. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetSamplers)
  4479. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetUniformBuffers)
  4480. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetUniformBuffersDynamic)
  4481. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetStorageBuffers)
  4482. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetStorageBuffersDynamic)
  4483. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetSampledImages)
  4484. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetStorageImages)
  4485. DD_APPEND_PROPERTY_LIMIT(maxDescriptorSetInputAttachments)
  4486. DD_APPEND_PROPERTY_LIMIT(maxVertexInputAttributes)
  4487. DD_APPEND_PROPERTY_LIMIT(maxVertexInputBindings)
  4488. DD_APPEND_PROPERTY_LIMIT(maxVertexInputAttributeOffset)
  4489. DD_APPEND_PROPERTY_LIMIT(maxVertexInputBindingStride)
  4490. DD_APPEND_PROPERTY_LIMIT(maxVertexOutputComponents)
  4491. DD_APPEND_PROPERTY_LIMIT(maxTessellationGenerationLevel)
  4492. DD_APPEND_PROPERTY_LIMIT(maxTessellationPatchSize)
  4493. DD_APPEND_PROPERTY_LIMIT(maxTessellationControlPerVertexInputComponents)
  4494. DD_APPEND_PROPERTY_LIMIT(maxTessellationControlPerVertexOutputComponents)
  4495. DD_APPEND_PROPERTY_LIMIT(maxTessellationControlPerPatchOutputComponents)
  4496. DD_APPEND_PROPERTY_LIMIT(maxTessellationControlTotalOutputComponents)
  4497. DD_APPEND_PROPERTY_LIMIT(maxTessellationEvaluationInputComponents)
  4498. DD_APPEND_PROPERTY_LIMIT(maxTessellationEvaluationOutputComponents)
  4499. DD_APPEND_PROPERTY_LIMIT(maxGeometryShaderInvocations)
  4500. DD_APPEND_PROPERTY_LIMIT(maxGeometryInputComponents)
  4501. DD_APPEND_PROPERTY_LIMIT(maxGeometryOutputComponents)
  4502. DD_APPEND_PROPERTY_LIMIT(maxGeometryOutputVertices)
  4503. DD_APPEND_PROPERTY_LIMIT(maxGeometryTotalOutputComponents)
  4504. DD_APPEND_PROPERTY_LIMIT(maxFragmentInputComponents)
  4505. DD_APPEND_PROPERTY_LIMIT(maxFragmentOutputAttachments)
  4506. DD_APPEND_PROPERTY_LIMIT(maxFragmentDualSrcAttachments)
  4507. DD_APPEND_PROPERTY_LIMIT(maxFragmentCombinedOutputResources)
  4508. DD_APPEND_PROPERTY_LIMIT(maxComputeSharedMemorySize)
  4509. DD_APPEND_PROPERTY_LIMIT_3(maxComputeWorkGroupCount)
  4510. DD_APPEND_PROPERTY_LIMIT(maxComputeWorkGroupInvocations)
  4511. DD_APPEND_PROPERTY_LIMIT_3(maxComputeWorkGroupSize)
  4512. DD_APPEND_PROPERTY_LIMIT(subPixelPrecisionBits)
  4513. DD_APPEND_PROPERTY_LIMIT(subTexelPrecisionBits)
  4514. DD_APPEND_PROPERTY_LIMIT(mipmapPrecisionBits)
  4515. DD_APPEND_PROPERTY_LIMIT(maxDrawIndexedIndexValue)
  4516. DD_APPEND_PROPERTY_LIMIT(maxDrawIndirectCount)
  4517. DD_APPEND_PROPERTY_LIMIT(maxSamplerLodBias)
  4518. DD_APPEND_PROPERTY_LIMIT(maxSamplerAnisotropy)
  4519. DD_APPEND_PROPERTY_LIMIT(maxViewports)
  4520. DD_APPEND_PROPERTY_LIMIT_2(maxViewportDimensions)
  4521. DD_APPEND_PROPERTY_LIMIT_2(viewportBoundsRange)
  4522. DD_APPEND_PROPERTY_LIMIT(viewportSubPixelBits)
  4523. device_defines.append("minMemoryMapAlignment", (uint32_t)properties.limits.minMemoryMapAlignment);
  4524. DD_APPEND_PROPERTY_LIMIT(minTexelBufferOffsetAlignment)
  4525. DD_APPEND_PROPERTY_LIMIT(minUniformBufferOffsetAlignment)
  4526. DD_APPEND_PROPERTY_LIMIT(minStorageBufferOffsetAlignment)
  4527. DD_APPEND_PROPERTY_LIMIT(minTexelOffset)
  4528. DD_APPEND_PROPERTY_LIMIT(maxTexelOffset)
  4529. DD_APPEND_PROPERTY_LIMIT(minTexelGatherOffset)
  4530. DD_APPEND_PROPERTY_LIMIT(maxTexelGatherOffset)
  4531. DD_APPEND_PROPERTY_LIMIT(minInterpolationOffset)
  4532. DD_APPEND_PROPERTY_LIMIT(maxInterpolationOffset)
  4533. DD_APPEND_PROPERTY_LIMIT(subPixelInterpolationOffsetBits)
  4534. DD_APPEND_PROPERTY_LIMIT(maxFramebufferWidth)
  4535. DD_APPEND_PROPERTY_LIMIT(maxFramebufferHeight)
  4536. DD_APPEND_PROPERTY_LIMIT(maxFramebufferLayers)
  4537. DD_APPEND_PROPERTY_LIMIT(framebufferColorSampleCounts)
  4538. DD_APPEND_PROPERTY_LIMIT(framebufferDepthSampleCounts)
  4539. DD_APPEND_PROPERTY_LIMIT(framebufferStencilSampleCounts)
  4540. DD_APPEND_PROPERTY_LIMIT(framebufferNoAttachmentsSampleCounts)
  4541. DD_APPEND_PROPERTY_LIMIT(maxColorAttachments)
  4542. DD_APPEND_PROPERTY_LIMIT(sampledImageColorSampleCounts)
  4543. DD_APPEND_PROPERTY_LIMIT(sampledImageIntegerSampleCounts)
  4544. DD_APPEND_PROPERTY_LIMIT(sampledImageDepthSampleCounts)
  4545. DD_APPEND_PROPERTY_LIMIT(sampledImageStencilSampleCounts)
  4546. DD_APPEND_PROPERTY_LIMIT(storageImageSampleCounts)
  4547. DD_APPEND_PROPERTY_LIMIT(maxSampleMaskWords)
  4548. DD_APPEND_PROPERTY_LIMIT(timestampComputeAndGraphics)
  4549. DD_APPEND_PROPERTY_LIMIT(timestampPeriod)
  4550. DD_APPEND_PROPERTY_LIMIT(maxClipDistances)
  4551. DD_APPEND_PROPERTY_LIMIT(maxCullDistances)
  4552. DD_APPEND_PROPERTY_LIMIT(maxCombinedClipAndCullDistances)
  4553. DD_APPEND_PROPERTY_LIMIT(discreteQueuePriorities)
  4554. DD_APPEND_PROPERTY_LIMIT_2(pointSizeRange)
  4555. DD_APPEND_PROPERTY_LIMIT_2(lineWidthRange)
  4556. DD_APPEND_PROPERTY_LIMIT(pointSizeGranularity)
  4557. DD_APPEND_PROPERTY_LIMIT(lineWidthGranularity)
  4558. DD_APPEND_PROPERTY_LIMIT(strictLines)
  4559. DD_APPEND_PROPERTY_LIMIT(standardSampleLocations)
  4560. DD_APPEND_PROPERTY_LIMIT(optimalBufferCopyOffsetAlignment)
  4561. DD_APPEND_PROPERTY_LIMIT(optimalBufferCopyRowPitchAlignment)
  4562. DD_APPEND_PROPERTY_LIMIT(nonCoherentAtomSize)
  4563. #undef DD_APPEND_PROPERTY_LIMIT
  4564. #undef DD_APPEND_PROPERTY_LIMIT_2
  4565. #undef DD_APPEND_PROPERTY_LIMIT_3
  4566. #define DD_APPEND_PROPERTY_SPARSE(X) device_defines.append(#X, properties.sparseProperties.X);
  4567. DD_APPEND_PROPERTY_SPARSE(residencyStandard2DBlockShape)
  4568. DD_APPEND_PROPERTY_SPARSE(residencyStandard2DMultisampleBlockShape)
  4569. DD_APPEND_PROPERTY_SPARSE(residencyStandard3DBlockShape)
  4570. DD_APPEND_PROPERTY_SPARSE(residencyAlignedMipSize)
  4571. DD_APPEND_PROPERTY_SPARSE(residencyNonResidentStrict)
  4572. #undef DD_APPEND_PROPERTY_SPARSE
  4573. }
  4574. {
  4575. const VkPhysicalDeviceSubgroupProperties& properties = info.querySubgroupProperties();
  4576. DD_APPEND_PROPERTY(subgroupSize)
  4577. DD_APPEND_PROPERTY(supportedStages)
  4578. DD_APPEND_PROPERTY(supportedOperations)
  4579. DD_APPEND_PROPERTY(quadOperationsInAllStages)
  4580. // append subgroup ops
  4581. device_defines.append("subgroup_basic", (properties.supportedOperations & VK_SUBGROUP_FEATURE_BASIC_BIT) ? 1 : 0);
  4582. device_defines.append("subgroup_vote", (properties.supportedOperations & VK_SUBGROUP_FEATURE_VOTE_BIT) ? 1 : 0);
  4583. device_defines.append("subgroup_arithmetic", (properties.supportedOperations & VK_SUBGROUP_FEATURE_ARITHMETIC_BIT) ? 1 : 0);
  4584. device_defines.append("subgroup_ballot", (properties.supportedOperations & VK_SUBGROUP_FEATURE_BALLOT_BIT) ? 1 : 0);
  4585. device_defines.append("subgroup_shuffle", (properties.supportedOperations & VK_SUBGROUP_FEATURE_SHUFFLE_BIT) ? 1 : 0);
  4586. device_defines.append("subgroup_shuffle_relative", (properties.supportedOperations & VK_SUBGROUP_FEATURE_SHUFFLE_RELATIVE_BIT) ? 1 : 0);
  4587. device_defines.append("subgroup_clustered", (properties.supportedOperations & VK_SUBGROUP_FEATURE_CLUSTERED_BIT) ? 1 : 0);
  4588. device_defines.append("subgroup_quad", (properties.supportedOperations & VK_SUBGROUP_FEATURE_QUAD_BIT) ? 1 : 0);
  4589. device_defines.append("subgroup_rotate", (properties.supportedOperations & VK_SUBGROUP_FEATURE_ROTATE_BIT) ? 1 : 0);
  4590. device_defines.append("subgroup_rotate_relative", (properties.supportedOperations & VK_SUBGROUP_FEATURE_ROTATE_CLUSTERED_BIT) ? 1 : 0);
  4591. device_defines.append("subgroup_partitioned", (properties.supportedOperations & VK_SUBGROUP_FEATURE_PARTITIONED_BIT_NV) ? 1 : 0);
  4592. }
  4593. if (info.support_VK_NV_cooperative_matrix2())
  4594. {
  4595. const VkPhysicalDeviceCooperativeMatrix2PropertiesNV& properties = info.queryCooperativeMatrix2PropertiesNV();
  4596. DD_APPEND_PROPERTY(cooperativeMatrixWorkgroupScopeMaxWorkgroupSize)
  4597. DD_APPEND_PROPERTY(cooperativeMatrixFlexibleDimensionsMaxDimension)
  4598. DD_APPEND_PROPERTY(cooperativeMatrixWorkgroupScopeReservedSharedMemory)
  4599. }
  4600. if (info.support_VK_NV_cooperative_vector())
  4601. {
  4602. const VkPhysicalDeviceCooperativeVectorPropertiesNV& properties = info.queryCooperativeVectorPropertiesNV();
  4603. DD_APPEND_PROPERTY(cooperativeVectorSupportedStages)
  4604. DD_APPEND_PROPERTY(cooperativeVectorTrainingFloat16Accumulation)
  4605. DD_APPEND_PROPERTY(cooperativeVectorTrainingFloat32Accumulation)
  4606. DD_APPEND_PROPERTY(maxCooperativeVectorComponents)
  4607. }
  4608. if (info.support_VK_KHR_driver_properties())
  4609. {
  4610. const VkPhysicalDeviceDriverPropertiesKHR& properties = info.queryDriverProperties();
  4611. DD_APPEND_PROPERTY(driverID)
  4612. // DD_APPEND_PROPERTY(driverName)
  4613. // DD_APPEND_PROPERTY(driverInfo)
  4614. device_defines.append("conformanceVersion_major", properties.conformanceVersion.major);
  4615. device_defines.append("conformanceVersion_minor", properties.conformanceVersion.minor);
  4616. device_defines.append("conformanceVersion_subminor", properties.conformanceVersion.subminor);
  4617. device_defines.append("conformanceVersion_patch", properties.conformanceVersion.patch);
  4618. }
  4619. if (info.support_VK_KHR_shader_integer_dot_product())
  4620. {
  4621. const VkPhysicalDeviceShaderIntegerDotProductProperties& properties = info.queryShaderIntegerDotProductProperties();
  4622. DD_APPEND_PROPERTY(integerDotProduct8BitUnsignedAccelerated)
  4623. DD_APPEND_PROPERTY(integerDotProduct8BitSignedAccelerated)
  4624. DD_APPEND_PROPERTY(integerDotProduct8BitMixedSignednessAccelerated)
  4625. DD_APPEND_PROPERTY(integerDotProduct4x8BitPackedUnsignedAccelerated)
  4626. DD_APPEND_PROPERTY(integerDotProduct4x8BitPackedSignedAccelerated)
  4627. DD_APPEND_PROPERTY(integerDotProduct4x8BitPackedMixedSignednessAccelerated)
  4628. DD_APPEND_PROPERTY(integerDotProduct16BitUnsignedAccelerated)
  4629. DD_APPEND_PROPERTY(integerDotProduct16BitSignedAccelerated)
  4630. DD_APPEND_PROPERTY(integerDotProduct16BitMixedSignednessAccelerated)
  4631. DD_APPEND_PROPERTY(integerDotProduct32BitUnsignedAccelerated)
  4632. DD_APPEND_PROPERTY(integerDotProduct32BitSignedAccelerated)
  4633. DD_APPEND_PROPERTY(integerDotProduct32BitMixedSignednessAccelerated)
  4634. DD_APPEND_PROPERTY(integerDotProduct64BitUnsignedAccelerated)
  4635. DD_APPEND_PROPERTY(integerDotProduct64BitSignedAccelerated)
  4636. DD_APPEND_PROPERTY(integerDotProduct64BitMixedSignednessAccelerated)
  4637. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating8BitUnsignedAccelerated)
  4638. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating8BitSignedAccelerated)
  4639. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating8BitMixedSignednessAccelerated)
  4640. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating4x8BitPackedUnsignedAccelerated)
  4641. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating4x8BitPackedSignedAccelerated)
  4642. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating4x8BitPackedMixedSignednessAccelerated)
  4643. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating16BitUnsignedAccelerated)
  4644. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating16BitSignedAccelerated)
  4645. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating16BitMixedSignednessAccelerated)
  4646. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating32BitUnsignedAccelerated)
  4647. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating32BitSignedAccelerated)
  4648. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating32BitMixedSignednessAccelerated)
  4649. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating64BitUnsignedAccelerated)
  4650. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating64BitSignedAccelerated)
  4651. DD_APPEND_PROPERTY(integerDotProductAccumulatingSaturating64BitMixedSignednessAccelerated)
  4652. }
  4653. if (info.support_VK_EXT_subgroup_size_control())
  4654. {
  4655. const VkPhysicalDeviceSubgroupSizeControlPropertiesEXT& properties = info.querySubgroupSizeControlProperties();
  4656. DD_APPEND_PROPERTY(minSubgroupSize)
  4657. DD_APPEND_PROPERTY(maxSubgroupSize)
  4658. DD_APPEND_PROPERTY(maxComputeWorkgroupSubgroups)
  4659. DD_APPEND_PROPERTY(requiredSubgroupSizeStages)
  4660. }
  4661. #if ENABLE_VALIDATION_LAYER
  4662. if (info.support_VK_KHR_shader_non_semantic_info())
  4663. {
  4664. device_defines.append("enable_validation_layer", VK_TRUE);
  4665. custom_defines.append("NCNN_LOGE", "debugPrintfEXT");
  4666. }
  4667. #endif
  4668. #undef DD_APPEND_PROPERTY
  4669. }
  4670. std::string define_macro_data;
  4671. for (size_t i = 0; i < custom_defines.definitions.size(); i++)
  4672. {
  4673. const char* key = custom_defines.definitions[i].first;
  4674. const DefinitionCollector::typed_value& def = custom_defines.definitions[i].second;
  4675. if (def.type == 0)
  4676. {
  4677. define_macro_data += std::string("#define ") + key + " " + def.s + "\n";
  4678. }
  4679. else
  4680. {
  4681. char defstr[256];
  4682. if (def.type == 1)
  4683. {
  4684. sprintf(defstr, "%u", def.u8);
  4685. }
  4686. if (def.type == 2)
  4687. {
  4688. sprintf(defstr, "%u", def.u32);
  4689. }
  4690. if (def.type == 3)
  4691. {
  4692. sprintf(defstr, "%d", def.i32);
  4693. }
  4694. if (def.type == 4)
  4695. {
  4696. if (support_shader_int64)
  4697. {
  4698. sprintf(defstr, "%luull", def.u64);
  4699. }
  4700. else
  4701. {
  4702. uint32_t u32 = def.u64 > UINT_MAX ? UINT_MAX : (uint32_t)def.u64;
  4703. sprintf(defstr, "%u", u32);
  4704. }
  4705. }
  4706. if (def.type == 5)
  4707. {
  4708. sprintf(defstr, "%e", def.f32);
  4709. }
  4710. define_macro_data += std::string("#define ") + key + " " + defstr + "\n";
  4711. }
  4712. }
  4713. for (size_t i = 0; i < device_defines.definitions.size(); i++)
  4714. {
  4715. const char* key = device_defines.definitions[i].first;
  4716. const DefinitionCollector::typed_value& def = device_defines.definitions[i].second;
  4717. if (def.type == 0)
  4718. {
  4719. define_macro_data += std::string("#define ncnn_") + key + " \"" + def.s + "\"\n";
  4720. }
  4721. else
  4722. {
  4723. char defstr[256];
  4724. if (def.type == 1)
  4725. {
  4726. sprintf(defstr, "%u", def.u8);
  4727. }
  4728. if (def.type == 2)
  4729. {
  4730. sprintf(defstr, "%u", def.u32);
  4731. }
  4732. if (def.type == 3)
  4733. {
  4734. sprintf(defstr, "%d", def.i32);
  4735. }
  4736. if (def.type == 4)
  4737. {
  4738. if (support_shader_int64)
  4739. {
  4740. sprintf(defstr, "%luull", def.u64);
  4741. }
  4742. else
  4743. {
  4744. uint32_t u32 = def.u64 > UINT_MAX ? UINT_MAX : (uint32_t)def.u64;
  4745. sprintf(defstr, "%u", u32);
  4746. }
  4747. }
  4748. if (def.type == 5)
  4749. {
  4750. sprintf(defstr, "%e", def.f32);
  4751. }
  4752. define_macro_data += std::string("#define ncnn_") + key + " " + defstr + "\n";
  4753. }
  4754. }
  4755. // enable extensions
  4756. std::string custom_exts;
  4757. if (support_shader_int64)
  4758. {
  4759. custom_exts += "#extension GL_EXT_shader_explicit_arithmetic_types_int64: require\n";
  4760. }
  4761. if (opt.use_fp16_storage)
  4762. {
  4763. custom_exts += "#extension GL_EXT_shader_16bit_storage: require\n";
  4764. custom_exts += "struct sfpvec8 { f16vec4 abcd; f16vec4 efgh; };\n";
  4765. }
  4766. if (opt.use_fp16_arithmetic)
  4767. {
  4768. custom_exts += "#extension GL_EXT_shader_explicit_arithmetic_types_float16: require\n";
  4769. }
  4770. custom_exts += "struct ivec8 { ivec4 abcd; ivec4 efgh; };\n";
  4771. if (opt.use_int8_storage)
  4772. {
  4773. custom_exts += "#extension GL_EXT_shader_8bit_storage: require\n";
  4774. }
  4775. if (opt.use_int8_arithmetic)
  4776. {
  4777. custom_exts += "#extension GL_EXT_shader_explicit_arithmetic_types_int8: require\n";
  4778. }
  4779. #if ENABLE_VALIDATION_LAYER
  4780. {
  4781. custom_exts += "#extension GL_EXT_debug_printf : require\n";
  4782. }
  4783. #endif
  4784. // debug
  4785. // NCNN_LOGE("%s", define_macro_data.c_str());
  4786. bool compile_success = true;
  4787. {
  4788. glslang::TShader s(EShLangCompute);
  4789. // split shader source by token "#version 450\n"
  4790. int version_end_pos = -1;
  4791. {
  4792. for (int i = 0; i < comp_data_size - 8; i++)
  4793. {
  4794. if (strncmp(comp_data + i, "#version", 8) != 0)
  4795. continue;
  4796. // #version shall be the very beginning or after newline
  4797. if (i != 0 && comp_data[i - 1] != '\n')
  4798. continue;
  4799. int nversion = 0;
  4800. sscanf(comp_data + i, "#version %*d\n%n", &nversion);
  4801. if (nversion == 0)
  4802. continue;
  4803. version_end_pos = i + nversion;
  4804. break;
  4805. }
  4806. if (version_end_pos == -1)
  4807. {
  4808. NCNN_LOGE("shader source has no #version token");
  4809. return -1;
  4810. }
  4811. // NCNN_LOGE("version_end_pos = %d", version_end_pos);
  4812. }
  4813. const char* comp_data_2 = comp_data + version_end_pos;
  4814. int comp_data_size_1 = version_end_pos;
  4815. int comp_data_size_2 = comp_data_size - comp_data_size_1;
  4816. const char* comp_datas[4] = {comp_data, custom_exts.c_str(), define_macro_data.c_str(), comp_data_2};
  4817. const int comp_data_sizes[4] = {comp_data_size_1, (int)custom_exts.size(), (int)define_macro_data.size(), comp_data_size_2};
  4818. s.setStringsWithLengths(comp_datas, comp_data_sizes, 4);
  4819. s.setEntryPoint("main");
  4820. s.setSourceEntryPoint("main");
  4821. s.setEnvInput(glslang::EShSourceGlsl, EShLangCompute, glslang::EShClientVulkan, 1);
  4822. if (opt.use_subgroup_ops || opt.use_cooperative_matrix)
  4823. {
  4824. // subgroup / cooperative_matrix need vulkan-1.1 and spirv-1.3
  4825. s.setEnvClient(glslang::EShClientVulkan, glslang::EShTargetVulkan_1_1);
  4826. s.setEnvTarget(glslang::EshTargetSpv, glslang::EShTargetSpv_1_3);
  4827. }
  4828. else
  4829. {
  4830. s.setEnvClient(glslang::EShClientVulkan, glslang::EShTargetVulkan_1_0);
  4831. s.setEnvTarget(glslang::EshTargetSpv, glslang::EShTargetSpv_1_0);
  4832. }
  4833. TBuiltInResource resources = get_default_TBuiltInResource();
  4834. VulkanShaderIncluder includer;
  4835. bool pr = s.parse(&resources, 100, ENoProfile, false, false, EShMsgDefault, includer);
  4836. if (!pr)
  4837. {
  4838. NCNN_LOGE("compile spir-v module failed");
  4839. NCNN_LOGE("%s", s.getInfoLog());
  4840. NCNN_LOGE("%s", s.getInfoDebugLog());
  4841. // print as line_number: code
  4842. {
  4843. const char* p = comp_datas[3];
  4844. const char* line_end;
  4845. int line_number = 1;
  4846. while ((line_end = strchr(p, '\n')) != NULL)
  4847. {
  4848. NCNN_LOGE("%d:\t%.*s", line_number++, (int)(line_end - p), p);
  4849. p = line_end + 1;
  4850. }
  4851. if (*p != '\0')
  4852. {
  4853. NCNN_LOGE("%d:\t%s", line_number, p);
  4854. }
  4855. }
  4856. compile_success = false;
  4857. }
  4858. else
  4859. {
  4860. glslang::TIntermediate* ir = s.getIntermediate();
  4861. glslang::GlslangToSpv(*ir, spirv);
  4862. }
  4863. }
  4864. return compile_success ? 0 : -1;
  4865. }
  4866. int compile_spirv_module(int shader_type_index, const Option& opt, std::vector<uint32_t>& spirv)
  4867. {
  4868. if (shader_type_index < 0 || shader_type_index >= layer_shader_registry_entry_count)
  4869. {
  4870. NCNN_LOGE("no such shader module %d", shader_type_index);
  4871. return -1;
  4872. }
  4873. const char* comp_data = layer_shader_registry[shader_type_index].comp_data;
  4874. int comp_data_size = layer_shader_registry[shader_type_index].comp_data_size;
  4875. return compile_spirv_module(comp_data, comp_data_size, opt, spirv);
  4876. }
  4877. int resolve_shader_info(const uint32_t* spv_data, size_t spv_data_size, ShaderInfo& shader_info)
  4878. {
  4879. shader_info.specialization_count = 0;
  4880. shader_info.binding_count = 0;
  4881. shader_info.push_constant_count = 0;
  4882. uint32_t parameter_id = -233;
  4883. int specialization_count = 0;
  4884. int binding_count = 0;
  4885. int push_constant_count = 0;
  4886. // id -> binding_type
  4887. std::vector<int> id_types;
  4888. // binding_id -> binding_type
  4889. std::vector<int> binding_types;
  4890. const uint32_t* p = spv_data;
  4891. int bound = p[3];
  4892. id_types.resize(bound);
  4893. // skip magic version generator bound schema
  4894. p += 5;
  4895. // foreach op
  4896. while ((const unsigned char*)p < (const unsigned char*)spv_data + spv_data_size)
  4897. {
  4898. uint32_t opcode = p[0];
  4899. uint16_t wordcount = opcode >> 16;
  4900. uint16_t op = opcode & 0xffff;
  4901. if (op == 5) // OpName
  4902. {
  4903. uint32_t id = p[1];
  4904. const char* name = (const char*)&p[2];
  4905. if (strcmp(name, "parameter") == 0)
  4906. {
  4907. parameter_id = id;
  4908. }
  4909. }
  4910. else if (op == 6) // OpMemberName
  4911. {
  4912. uint32_t id = p[1];
  4913. if (id == parameter_id)
  4914. {
  4915. push_constant_count++;
  4916. }
  4917. }
  4918. else if (op == 25) // OpTypeImage
  4919. {
  4920. uint32_t id = p[1];
  4921. id_types[id] = 2;
  4922. }
  4923. else if (op == 27) // OpTypeSampledImage
  4924. {
  4925. uint32_t id = p[1];
  4926. id_types[id] = 3;
  4927. }
  4928. else if (op == 32) // OpTypePointer
  4929. {
  4930. uint32_t id = p[1];
  4931. uint32_t storage_class = p[2];
  4932. uint32_t type = p[3];
  4933. if (storage_class == 0) // UniformConstant
  4934. {
  4935. id_types[id] = id_types[type];
  4936. }
  4937. if (storage_class == 2) // Uniform
  4938. {
  4939. id_types[id] = id_types[type];
  4940. }
  4941. if (storage_class == 12) // StorageBuffer
  4942. {
  4943. id_types[type] = 1;
  4944. id_types[id] = id_types[type];
  4945. }
  4946. }
  4947. else if (op == 59) // OpVariable
  4948. {
  4949. uint32_t id = p[1];
  4950. uint32_t var_id = p[2];
  4951. uint32_t storage_class = p[3];
  4952. if (storage_class == 0) // UniformConstant
  4953. {
  4954. id_types[var_id] = id_types[id];
  4955. }
  4956. if (storage_class == 2) // Uniform
  4957. {
  4958. id_types[var_id] = id_types[id];
  4959. }
  4960. if (storage_class == 12) // StorageBuffer
  4961. {
  4962. id_types[var_id] = id_types[id];
  4963. }
  4964. }
  4965. else if (op == 71) // OpDecorate
  4966. {
  4967. uint32_t id = p[1];
  4968. uint32_t decoration = p[2];
  4969. uint32_t binding_id = p[3];
  4970. if (decoration == 1) // SpecId
  4971. {
  4972. specialization_count++;
  4973. }
  4974. if (decoration == 3) // BufferBlock
  4975. {
  4976. id_types[id] = 1;
  4977. }
  4978. else if (decoration == 33) // Binding
  4979. {
  4980. binding_count = std::max(binding_count, (int)binding_id + 1);
  4981. binding_types.resize(binding_count);
  4982. binding_types[binding_id] = id;
  4983. }
  4984. }
  4985. p += wordcount;
  4986. }
  4987. if (binding_count > 16)
  4988. {
  4989. NCNN_LOGE("too many binding %d", binding_count);
  4990. return -1;
  4991. }
  4992. shader_info.specialization_count = specialization_count;
  4993. shader_info.binding_count = binding_count;
  4994. shader_info.push_constant_count = push_constant_count;
  4995. // resolve binding_types
  4996. for (int i = 0; i < binding_count; i++)
  4997. {
  4998. shader_info.binding_types[i] = id_types[binding_types[i]];
  4999. }
  5000. return 0;
  5001. }
  5002. } // namespace ncnn
  5003. #else
  5004. namespace ncnn {
  5005. int create_gpu_instance(const char* driver_path)
  5006. {
  5007. return 0;
  5008. }
  5009. void destroy_gpu_instance()
  5010. {
  5011. }
  5012. int get_gpu_count()
  5013. {
  5014. return 0;
  5015. }
  5016. } // namespace ncnn
  5017. #endif // NCNN_VULKAN