 [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  new int8 implement,better accuracy (#749)
* add the armv7a conv3x3s1 implement without overflow,remove old codes
* fix the bug of conv3x3s2 packed int8
* new int8 implement,weight quant by perchanel,better accuracy~
* fix the bug of conv3x3s1 packed int8 neon
* add the naive c fp32 and int8 winograd F(2,3)
* add the neon intrinsic int8 winograd F(2,3)
* optimize the armv7a int8 winograd F(2,3) with neon assembly
* optimize the armv7a int8 winograd F(2,3) input transform with assembly.
* add the requantize layer and int8 relu implement.
* add graph optimize conv1x1s2 -> conv1x1s1,begin optimize int8 aarch64.
* fix int8 bugs
* add the c naive im2col with sgemm
* add aarch64 int8 winograd f23, conv3x3s2 naive implement
* add the int8 sgemm conv7x7s2 on x86/armv7a platform
* optimize the int8 sgemm by neon intrinsic and packed kernel
* optimize the int8 sgemm with packed data
* optimize the int8 sgemm with armv7a neon assembly
* add the int8 sgemm on arm64-v8a platform
* perpare to merge latest codes from master
* add the int8 param files
* In the Class Net,add the fuse_network method
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago  [WIP] vulkan compute (#618)
* vulkan infrastructure
* vkallocator and vkmat
* layer interface for vulkan compute
* wip...
* default vulkan device, command wrapper, upload model weight in load_model to simplify layer interface
* simplify command api, vkmat holds staging buffer, relu works
* initialize specialization constant, simplify command dispatch, fix staging buffer copy with different shape, convolution works
* init extension functions
* dynamic local size and group count
* group count=1 is invalid
* regard device max workgroup size limit
* fix relu oooops
* decouple command record and staging allocation
* create result blob
* add pooling shader
* buffer is faster than image :)
* fix pooling shader
* add innerproduct shader
* readonly writeonly decoration
* simplify buffer creation
* decouple command and layer, VK_KHR_descriptor_update_template extension makes descriptor binding update easy :D
* fix vulkan building issues in visual studio (#1)
* fix building issues on visual studio
* ignore benchmark
* cancel changes
* ... ...
* decouple paramdict and vulkandevice
* fix staging buffer destroy in model loading
* remove vkdev member in option
* add padding shader
* simplify vulkan layer creation, simplify convolution and pooling shader for no padding, less debug output
* add convolutiondepthwise and softmax shader
* specialization float type, add leakyrelu
* add dropout shader
* add batchnorm shader
* split vulkan forward
* add scale shader
* push constant type can be int or float
* set_optimal_local_size_xyz
* add eltwise shader
* concat vulkan forward
* fix convolution without bias
* add dummy shader for concat and split, more fix ...
* optional VK_KHR_descriptor_update_template and VK_KHR_push_descriptor
* check VK_KHR_push_descriptor for vkCmdPushDescriptorSetWithTemplateKHR
* binaryop and unaryop shader
* hide raw command buffer
* simple vkbenchncnn benchmark
* create device with transfer queue
* rename command to vkcompute, add vktransfer and layer upload_model interface
* external VkMat, copy and map wrt buffer offset
* command copy respect offset and size
* decouple weight upload and load, simplify upload weight api, use one big staging buffer for uploading weights
* fix build on android
* binding count can not vary :(
* barrier check state, fix sub-op destruction
* declare local_size_xyz constant, fix crash on radv
* fix local_size_xyz, second try
* more barrier and state fix
* fix softmax
* reconstruct buffer memory allocator, reuse blob buffer, less verbose output
* find unified memory type index
* weight staging buffer allocator and weight buffer allocator, respect descriptor buffer offset alignment
* use VK_KHR_descriptor_update_template for faster descriptor update if available, multithread pipeline creation
* find more useful vulkan extensions and enable them
* fix msvc build
* respect VK_KHR_dedicated_allocation for weight buffer allocation
* fix android build
* fix bias name conflicts with metal
* decouple pipeline and layer, building shader sources into shader module, dedicated create_pipeline api, simplify pipeline recording
* drop dummy shader, inplace softmax, multiple shader module works
* fix unique queue family index error
* flatten support vulkan
* mnasnet run
* find shader module by name, each entry point per shader module, fix attribute/id conflict on moltenvk
* some minor changes
* add some high level api
* use dedicated transfer queue to upload weight model
* prefer mappable buffer on unified memory
* global pooling and convolution fc, reuse staging buffer
* implement ring-buffer style blob allocator, add VkBufferMemory capacity
* use blob allocator for workspace blob, it works fine :)
* vulkan option off
* Update layer.cpp
* fix build with vulkan off
* less verbose output, fix crash on vulkan_compute off
* merge benchncnn tool
* allocator clear api, use new weight buffer allocator per net
* add default locked allocator
* mapped mat ptr api, persistent mapped memory works generally :)
* travis ci linux vulkan
* travis ci vulkan wip ...
* more gpu wip ...
* more gpu wip ...
* wip...
* wip...
* wip... ...
* wip... ios vulkan build...
* find glslangValidator on ios build
* use dynamic moltenvk library
* travis ci wip ...
* ios simulator does not support metal at all
* fix cpu only extractor
* optimize workgroup size, first try
* optimize workgroup size, second try
* conv1x1s1d1 vec4
* revert build system
* fix ncnn2mem build
* fix ncnn2mem build
7 years ago |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272127312741275127612771278127912801281128212831284128512861287128812891290129112921293129412951296129712981299130013011302130313041305130613071308130913101311131213131314131513161317131813191320132113221323132413251326132713281329133013311332133313341335133613371338133913401341134213431344134513461347134813491350135113521353135413551356135713581359136013611362136313641365136613671368136913701371137213731374137513761377137813791380138113821383138413851386138713881389139013911392139313941395139613971398139914001401140214031404140514061407140814091410141114121413141414151416141714181419142014211422142314241425142614271428142914301431143214331434143514361437143814391440144114421443144414451446144714481449145014511452145314541455145614571458145914601461146214631464146514661467146814691470147114721473147414751476147714781479148014811482148314841485148614871488148914901491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539154015411542154315441545154615471548154915501551155215531554155515561557155815591560156115621563156415651566156715681569157015711572157315741575157615771578157915801581158215831584158515861587158815891590159115921593159415951596159715981599160016011602160316041605160616071608160916101611161216131614161516161617161816191620162116221623162416251626162716281629163016311632163316341635163616371638163916401641164216431644164516461647164816491650165116521653165416551656165716581659166016611662166316641665166616671668166916701671167216731674167516761677167816791680168116821683168416851686168716881689169016911692169316941695169616971698169917001701170217031704170517061707170817091710171117121713171417151716171717181719172017211722172317241725172617271728172917301731173217331734173517361737173817391740174117421743174417451746174717481749175017511752175317541755175617571758175917601761176217631764176517661767176817691770177117721773177417751776177717781779178017811782178317841785178617871788178917901791179217931794179517961797179817991800180118021803180418051806180718081809181018111812181318141815181618171818181918201821182218231824182518261827182818291830183118321833183418351836183718381839184018411842184318441845184618471848184918501851185218531854185518561857185818591860186118621863186418651866186718681869187018711872187318741875187618771878187918801881188218831884188518861887188818891890189118921893189418951896189718981899190019011902190319041905190619071908190919101911191219131914191519161917191819191920192119221923192419251926192719281929193019311932193319341935193619371938193919401941194219431944194519461947194819491950195119521953195419551956195719581959196019611962196319641965196619671968196919701971197219731974197519761977197819791980198119821983198419851986198719881989199019911992199319941995199619971998199920002001200220032004200520062007200820092010201120122013201420152016201720182019202020212022202320242025202620272028202920302031203220332034203520362037203820392040204120422043204420452046204720482049205020512052205320542055205620572058205920602061206220632064206520662067206820692070207120722073207420752076207720782079208020812082208320842085208620872088208920902091209220932094209520962097209820992100210121022103210421052106210721082109211021112112211321142115211621172118211921202121212221232124212521262127212821292130213121322133213421352136213721382139214021412142214321442145214621472148214921502151215221532154215521562157215821592160216121622163216421652166216721682169217021712172217321742175217621772178217921802181218221832184218521862187218821892190219121922193219421952196219721982199220022012202220322042205220622072208220922102211221222132214221522162217221822192220222122222223222422252226222722282229223022312232223322342235223622372238223922402241224222432244224522462247224822492250225122522253225422552256225722582259226022612262226322642265226622672268226922702271227222732274227522762277227822792280228122822283228422852286228722882289229022912292229322942295229622972298229923002301230223032304230523062307230823092310231123122313231423152316231723182319232023212322232323242325232623272328232923302331233223332334233523362337233823392340234123422343234423452346234723482349235023512352235323542355235623572358235923602361236223632364236523662367236823692370237123722373237423752376237723782379238023812382238323842385238623872388238923902391239223932394239523962397239823992400240124022403240424052406240724082409241024112412241324142415241624172418241924202421242224232424242524262427242824292430243124322433243424352436243724382439244024412442244324442445244624472448244924502451245224532454245524562457245824592460246124622463246424652466246724682469247024712472247324742475247624772478247924802481248224832484248524862487248824892490249124922493249424952496249724982499250025012502250325042505250625072508250925102511251225132514251525162517251825192520252125222523252425252526252725282529253025312532253325342535253625372538253925402541254225432544254525462547254825492550255125522553255425552556255725582559256025612562256325642565256625672568256925702571257225732574257525762577257825792580258125822583258425852586258725882589259025912592259325942595259625972598259926002601260226032604260526062607260826092610261126122613261426152616261726182619262026212622262326242625262626272628262926302631263226332634263526362637263826392640264126422643264426452646264726482649265026512652265326542655265626572658265926602661266226632664266526662667266826692670267126722673267426752676267726782679268026812682268326842685268626872688268926902691269226932694269526962697269826992700270127022703270427052706270727082709271027112712271327142715271627172718271927202721272227232724272527262727272827292730273127322733273427352736273727382739274027412742274327442745274627472748274927502751275227532754275527562757275827592760276127622763276427652766276727682769277027712772277327742775277627772778277927802781278227832784278527862787278827892790279127922793279427952796279727982799280028012802280328042805 |
- // Tencent is pleased to support the open source community by making ncnn available.
- //
- // Copyright (C) 2017 THL A29 Limited, a Tencent company. All rights reserved.
- //
- // Licensed under the BSD 3-Clause License (the "License"); you may not use this file except
- // in compliance with the License. You may obtain a copy of the License at
- //
- // https://opensource.org/licenses/BSD-3-Clause
- //
- // Unless required by applicable law or agreed to in writing, software distributed
- // under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
- // CONDITIONS OF ANY KIND, either express or implied. See the License for the
- // specific language governing permissions and limitations under the License.
-
- #include "net.h"
-
- #include "cpu.h"
- #include "datareader.h"
- #include "layer_type.h"
- #include "modelbin.h"
- #include "paramdict.h"
-
- #include <stdarg.h>
- #include <stdint.h>
- #include <string.h>
-
- #if NCNN_BENCHMARK
- #include "benchmark.h"
- #endif // NCNN_BENCHMARK
-
- #if NCNN_VULKAN
- #include "command.h"
- #include "pipelinecache.h"
- #endif // NCNN_VULKAN
-
- namespace ncnn {
-
- class NetPrivate
- {
- public:
- NetPrivate(Option& _opt);
-
- Option& opt;
-
- #if NCNN_VULKAN
-
- int upload_model();
-
- #endif // NCNN_VULKAN
-
- friend class Extractor;
- int forward_layer(int layer_index, std::vector<Mat>& blob_mats, const Option& opt) const;
-
- #if NCNN_VULKAN
- int forward_layer(int layer_index, std::vector<Mat>& blob_mats, std::vector<VkMat>& blob_mats_gpu, VkCompute& cmd, const Option& opt) const;
- int forward_layer(int layer_index, std::vector<Mat>& blob_mats, std::vector<VkMat>& blob_mats_gpu, std::vector<VkImageMat>& blob_mats_gpu_image, VkCompute& cmd, const Option& opt) const;
- #endif // NCNN_VULKAN
-
- int convert_layout(Mat& bottom_blob, const Layer* layer, const Option& opt) const;
-
- int do_forward_layer(const Layer* layer, std::vector<Mat>& blob_mats, const Option& opt) const;
- #if NCNN_VULKAN
- int do_forward_layer(const Layer* layer, std::vector<VkMat>& blob_mats_gpu, VkCompute& cmd, const Option& opt) const;
- int do_forward_layer(const Layer* layer, std::vector<VkImageMat>& blob_mats_gpu_image, VkCompute& cmd, const Option& opt) const;
- #endif // NCNN_VULKAN
-
- void update_input_output_indexes();
- #if NCNN_STRING
- void update_input_output_names();
- #endif // NCNN_STRING
-
- std::vector<Blob> blobs;
- std::vector<Layer*> layers;
-
- std::vector<int> input_blob_indexes;
- std::vector<int> output_blob_indexes;
- #if NCNN_STRING
- std::vector<const char*> input_blob_names;
- std::vector<const char*> output_blob_names;
- #endif // NCNN_STRING
-
- std::vector<custom_layer_registry_entry> custom_layer_registry;
- std::vector<overwrite_builtin_layer_registry_entry> overwrite_builtin_layer_registry;
-
- PoolAllocator* local_blob_allocator;
- PoolAllocator* local_workspace_allocator;
-
- #if NCNN_VULKAN
- const VulkanDevice* vkdev;
-
- VkAllocator* weight_vkallocator;
- VkAllocator* weight_staging_vkallocator;
-
- PipelineCache* pipeline_cache;
- #endif // NCNN_VULKAN
- };
-
- NetPrivate::NetPrivate(Option& _opt)
- : opt(_opt)
- {
- local_blob_allocator = 0;
- local_workspace_allocator = 0;
-
- #if NCNN_VULKAN
- vkdev = 0;
- weight_vkallocator = 0;
- weight_staging_vkallocator = 0;
- pipeline_cache = 0;
- #endif // NCNN_VULKAN
- }
-
- static Option get_masked_option(const Option& opt, int featmask)
- {
- // mask option usage as layer specific featmask
- Option opt1 = opt;
- opt1.use_fp16_arithmetic = opt1.use_fp16_arithmetic && !(featmask & (1 << 0));
- opt1.use_fp16_storage = opt1.use_fp16_storage && !(featmask & (1 << 1));
- opt1.use_fp16_packed = opt1.use_fp16_packed && !(featmask & (1 << 1));
- opt1.use_bf16_storage = opt1.use_bf16_storage && !(featmask & (1 << 2));
- opt1.use_int8_packed = opt1.use_int8_packed && !(featmask & (1 << 3));
- opt1.use_int8_storage = opt1.use_int8_storage && !(featmask & (1 << 3));
- opt1.use_int8_arithmetic = opt1.use_int8_arithmetic && !(featmask & (1 << 3));
- opt1.use_vulkan_compute = opt1.use_vulkan_compute && !(featmask & (1 << 4));
- opt1.use_image_storage = opt1.use_image_storage && !(featmask & (1 << 4));
- opt1.use_tensor_storage = opt1.use_tensor_storage && !(featmask & (1 << 4));
- opt1.use_sgemm_convolution = opt1.use_sgemm_convolution && !(featmask & (1 << 5));
- opt1.use_winograd_convolution = opt1.use_winograd_convolution && !(featmask & (1 << 6));
-
- return opt1;
- }
-
- #if NCNN_VULKAN
- int NetPrivate::upload_model()
- {
- ncnn::VkTransfer cmd(vkdev);
-
- // create gpu device allocator if null
- if (!weight_vkallocator)
- {
- weight_vkallocator = new VkWeightAllocator(vkdev);
- }
- if (!weight_staging_vkallocator)
- {
- weight_staging_vkallocator = new VkWeightStagingAllocator(vkdev);
- }
-
- Option opt_upload = opt;
- opt_upload.blob_vkallocator = weight_vkallocator;
- opt_upload.workspace_vkallocator = weight_vkallocator;
- opt_upload.staging_vkallocator = weight_staging_vkallocator;
-
- for (size_t i = 0; i < layers.size(); i++)
- {
- if (layers[i]->support_vulkan)
- {
- int uret = layers[i]->upload_model(cmd, get_masked_option(opt_upload, layers[i]->featmask));
- if (uret != 0)
- {
- NCNN_LOGE("layer upload_model %d failed", (int)i);
- return -1;
- }
- }
- }
-
- return cmd.submit_and_wait();
- }
- #endif // NCNN_VULKAN
-
- int NetPrivate::forward_layer(int layer_index, std::vector<Mat>& blob_mats, const Option& opt) const
- {
- const Layer* layer = layers[layer_index];
-
- // NCNN_LOGE("forward_layer %d %s", layer_index, layer->name.c_str());
-
- // load bottom blobs
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- if (blob_mats[bottom_blob_index].dims == 0)
- {
- int ret = forward_layer(blobs[bottom_blob_index].producer, blob_mats, opt);
- if (ret != 0)
- return ret;
- }
- }
-
- #if NCNN_BENCHMARK
- double start = get_current_time();
- Mat bottom_blob;
- if (layer->one_blob_only)
- {
- int bottom_blob_index = layer->bottoms[0];
- bottom_blob.dims = blob_mats[bottom_blob_index].dims;
- bottom_blob.w = blob_mats[bottom_blob_index].w;
- bottom_blob.h = blob_mats[bottom_blob_index].h;
- bottom_blob.c = blob_mats[bottom_blob_index].c;
- bottom_blob.elempack = blob_mats[bottom_blob_index].elempack;
- bottom_blob.elemsize = blob_mats[bottom_blob_index].elemsize;
- }
- #endif
- int ret = 0;
- if (layer->featmask)
- {
- ret = do_forward_layer(layer, blob_mats, get_masked_option(opt, layer->featmask));
- }
- else
- {
- ret = do_forward_layer(layer, blob_mats, opt);
- }
- #if NCNN_BENCHMARK
- double end = get_current_time();
- if (layer->one_blob_only)
- {
- int top_blob_index = layer->tops[0];
- benchmark(layer, bottom_blob, blob_mats[top_blob_index], start, end);
- }
- else
- {
- benchmark(layer, start, end);
- }
- #endif
- if (ret != 0)
- return ret;
-
- // NCNN_LOGE("forward_layer %d %s done", layer_index, layer->name.c_str());
- // const Mat& blob = blob_mats[layer->tops[0]];
- // NCNN_LOGE("[%-2d %-16s %-16s] %d blobs count = %-3d size = %-3d x %-3d", layer_index, layer->type.c_str(), layer->name.c_str(), layer->tops[0], blob.c, blob.h, blob.w);
-
- return 0;
- }
-
- #if NCNN_VULKAN
- int NetPrivate::forward_layer(int layer_index, std::vector<Mat>& blob_mats, std::vector<VkMat>& blob_mats_gpu, VkCompute& cmd, const Option& opt) const
- {
- const Layer* layer = layers[layer_index];
-
- // NCNN_LOGE("forward_layer %d %d %s", layer->support_vulkan, layer_index, layer->name.c_str());
-
- bool cmd_submit_and_wait = false;
-
- // load bottom blobs
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- if (blob_mats_gpu[bottom_blob_index].dims == 0 && blob_mats[bottom_blob_index].dims == 0)
- {
- int ret = forward_layer(blobs[bottom_blob_index].producer, blob_mats, blob_mats_gpu, cmd, opt);
- if (ret != 0)
- return ret;
- }
-
- if (layer->support_vulkan)
- {
- if (blob_mats_gpu[bottom_blob_index].dims == 0)
- {
- // host to buffer
- cmd.record_upload(blob_mats[bottom_blob_index], blob_mats_gpu[bottom_blob_index], opt);
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats[bottom_blob_index].release();
- }
- }
- }
- else
- {
- if (blob_mats[bottom_blob_index].dims == 0)
- {
- Option opt_download = opt;
- opt_download.use_packing_layout = layer->support_packing;
-
- // buffer to host
- cmd.record_download(blob_mats_gpu[bottom_blob_index], blob_mats[bottom_blob_index], opt_download);
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu[bottom_blob_index].release();
- }
-
- cmd_submit_and_wait = true;
- }
- }
- }
-
- int ret;
- if (cmd_submit_and_wait)
- {
- ret = cmd.submit_and_wait();
-
- #if NCNN_BENCHMARK
- std::vector<uint64_t> results(layer_index * 2);
- cmd.get_query_pool_results(0, layer_index * 2, results);
- for (int i = 0; i < layer_index; i++)
- {
- uint64_t start = results[i * 2];
- uint64_t end = results[i * 2 + 1];
- if (start == 0 || end == 0)
- continue;
-
- double duration_us = (end - start) * vkdev->info.timestamp_period() / 1000;
- NCNN_LOGE("%-24s %-30s %8.2lfus |", layers[i]->type.c_str(), layers[i]->name.c_str(), duration_us);
- }
- #endif // NCNN_BENCHMARK
-
- cmd.reset();
- if (ret != 0)
- return ret;
- }
-
- if (layer->support_vulkan)
- {
- #if NCNN_BENCHMARK
- cmd.record_write_timestamp(layer_index * 2);
- #endif
- if (layer->featmask)
- {
- ret = do_forward_layer(layer, blob_mats_gpu, cmd, get_masked_option(opt, layer->featmask));
- }
- else
- {
- ret = do_forward_layer(layer, blob_mats_gpu, cmd, opt);
- }
- #if NCNN_BENCHMARK
- cmd.record_write_timestamp(layer_index * 2 + 1);
- #endif
- }
- else
- {
- #if NCNN_BENCHMARK
- double start = get_current_time();
- Mat bottom_blob;
- if (layer->one_blob_only)
- {
- int bottom_blob_index = layer->bottoms[0];
- bottom_blob = blob_mats[bottom_blob_index].shape();
- }
- #endif
- if (layer->featmask)
- {
- ret = do_forward_layer(layer, blob_mats, get_masked_option(opt, layer->featmask));
- }
- else
- {
- ret = do_forward_layer(layer, blob_mats, opt);
- }
- #if NCNN_BENCHMARK
- double end = get_current_time();
- if (layer->one_blob_only)
- {
- int top_blob_index = layer->tops[0];
- benchmark(layer, bottom_blob, blob_mats[top_blob_index], start, end);
- }
- else
- {
- benchmark(layer, start, end);
- }
- #endif
- }
- if (ret != 0)
- return ret;
-
- // NCNN_LOGE("forward_layer %d %d %s done", layer->support_vulkan, layer_index, layer->name.c_str());
-
- return 0;
- }
-
- int NetPrivate::forward_layer(int layer_index, std::vector<Mat>& blob_mats, std::vector<VkMat>& blob_mats_gpu, std::vector<VkImageMat>& blob_mats_gpu_image, VkCompute& cmd, const Option& opt) const
- {
- const Layer* layer = layers[layer_index];
-
- // NCNN_LOGE("forward_layer %d %d %s", layer->support_vulkan, layer_index, layer->name.c_str());
-
- bool cmd_submit_and_wait = false;
- bool image_allocation_failed = false;
-
- IMAGE_ALLOCATION_FAILED:
-
- if (image_allocation_failed)
- {
- #if NCNN_STRING
- NCNN_LOGE("forward_layer %d %s image allocation failed, fallback to cpu", layer_index, layer->name.c_str());
- #else
- NCNN_LOGE("forward_layer %d image allocation failed, fallback to cpu", layer_index);
- #endif
- }
-
- // load bottom blobs
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- if (blob_mats_gpu_image[bottom_blob_index].dims == 0 && blob_mats_gpu[bottom_blob_index].dims == 0 && blob_mats[bottom_blob_index].dims == 0)
- {
- int ret = forward_layer(blobs[bottom_blob_index].producer, blob_mats, blob_mats_gpu, blob_mats_gpu_image, cmd, opt);
- if (ret != 0)
- return ret;
- }
-
- if (layer->support_vulkan && !image_allocation_failed)
- {
- if (layer->support_image_storage)
- {
- if (blob_mats_gpu_image[bottom_blob_index].dims == 0)
- {
- if (blob_mats_gpu[bottom_blob_index].dims == 0)
- {
- // host to image
- cmd.record_upload(blob_mats[bottom_blob_index], blob_mats_gpu_image[bottom_blob_index], opt);
-
- if (blob_mats_gpu_image[bottom_blob_index].empty())
- {
- image_allocation_failed = true;
- goto IMAGE_ALLOCATION_FAILED;
- }
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats[bottom_blob_index].release();
- }
- }
- else
- {
- // buffer to image
- cmd.record_buffer_to_image(blob_mats_gpu[bottom_blob_index], blob_mats_gpu_image[bottom_blob_index], opt);
-
- if (blob_mats_gpu_image[bottom_blob_index].empty())
- {
- image_allocation_failed = true;
- goto IMAGE_ALLOCATION_FAILED;
- }
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu[bottom_blob_index].release();
- }
- }
- }
- }
- else
- {
- if (blob_mats_gpu[bottom_blob_index].dims == 0)
- {
- if (blob_mats_gpu_image[bottom_blob_index].dims == 0)
- {
- // host to buffer
- cmd.record_upload(blob_mats[bottom_blob_index], blob_mats_gpu[bottom_blob_index], opt);
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats[bottom_blob_index].release();
- }
- }
- else
- {
- // image to buffer
- cmd.record_image_to_buffer(blob_mats_gpu_image[bottom_blob_index], blob_mats_gpu[bottom_blob_index], opt);
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu_image[bottom_blob_index].release();
- }
- }
- }
- }
- }
- else
- {
- if (blob_mats[bottom_blob_index].dims == 0)
- {
- if (blob_mats_gpu_image[bottom_blob_index].dims == 0)
- {
- // buffer to host
- cmd.record_download(blob_mats_gpu[bottom_blob_index], blob_mats[bottom_blob_index], opt);
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu[bottom_blob_index].release();
- }
-
- cmd_submit_and_wait = true;
- }
- else
- {
- // image to host
- cmd.record_download(blob_mats_gpu_image[bottom_blob_index], blob_mats[bottom_blob_index], opt);
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu_image[bottom_blob_index].release();
- }
-
- cmd_submit_and_wait = true;
- }
- }
- }
- }
-
- int ret;
- if (cmd_submit_and_wait)
- {
- ret = cmd.submit_and_wait();
-
- #if NCNN_BENCHMARK
- std::vector<uint64_t> results(layer_index * 2);
- cmd.get_query_pool_results(0, layer_index * 2, results);
- for (int i = 0; i < layer_index; i++)
- {
- uint64_t start = results[i * 2];
- uint64_t end = results[i * 2 + 1];
- if (start == 0 || end == 0)
- continue;
-
- double duration_us = (end - start) * vkdev->info.timestamp_period() / 1000;
- NCNN_LOGE("%-24s %-30s %8.2lfus |", layers[i]->type.c_str(), layers[i]->name.c_str(), duration_us);
- }
- #endif // NCNN_BENCHMARK
-
- cmd.reset();
-
- if (ret != 0)
- return ret;
- }
-
- if (layer->support_vulkan && !image_allocation_failed)
- {
- #if NCNN_BENCHMARK
- cmd.record_write_timestamp(layer_index * 2);
- #endif
- if (layer->support_image_storage)
- {
- if (layer->featmask)
- {
- ret = do_forward_layer(layer, blob_mats_gpu_image, cmd, get_masked_option(opt, layer->featmask));
- }
- else
- {
- ret = do_forward_layer(layer, blob_mats_gpu_image, cmd, opt);
- }
- if (ret == -100)
- {
- image_allocation_failed = true;
- goto IMAGE_ALLOCATION_FAILED;
- }
- }
- else
- {
- if (layer->featmask)
- {
- ret = do_forward_layer(layer, blob_mats_gpu, cmd, get_masked_option(opt, layer->featmask));
- }
- else
- {
- ret = do_forward_layer(layer, blob_mats_gpu, cmd, opt);
- }
- }
- #if NCNN_BENCHMARK
- cmd.record_write_timestamp(layer_index * 2 + 1);
- #endif
- }
- else
- {
- #if NCNN_BENCHMARK
- double start = get_current_time();
- Mat bottom_blob;
- if (layer->one_blob_only)
- {
- int bottom_blob_index = layer->bottoms[0];
- bottom_blob = blob_mats[bottom_blob_index].shape();
- }
- #endif
- if (layer->featmask)
- {
- ret = do_forward_layer(layer, blob_mats, get_masked_option(opt, layer->featmask));
- }
- else
- {
- ret = do_forward_layer(layer, blob_mats, opt);
- }
- #if NCNN_BENCHMARK
- double end = get_current_time();
- if (layer->one_blob_only)
- {
- int top_blob_index = layer->tops[0];
- benchmark(layer, bottom_blob, blob_mats[top_blob_index], start, end);
- }
- else
- {
- benchmark(layer, start, end);
- }
- #endif
- }
- if (ret != 0)
- return ret;
-
- // NCNN_LOGE("forward_layer %d %d %s done", layer->support_vulkan, layer_index, layer->name.c_str());
-
- return 0;
- }
- #endif // NCNN_VULKAN
-
- int NetPrivate::convert_layout(Mat& bottom_blob, const Layer* layer, const Option& opt) const
- {
- // clang-format off
- // *INDENT-OFF*
- #if NCNN_ARM82
- if (opt.use_fp16_storage && cpu_support_arm_asimdhp())
- {
- if (bottom_blob.elembits() == 32 && layer->support_fp16_storage)
- {
- Mat bottom_blob_fp16;
- cast_float32_to_float16(bottom_blob, bottom_blob_fp16, opt);
- bottom_blob = bottom_blob_fp16;
- }
- if (bottom_blob.elembits() == 16 && !layer->support_fp16_storage)
- {
- Mat bottom_blob_fp32;
- cast_float16_to_float32(bottom_blob, bottom_blob_fp32, opt);
- bottom_blob = bottom_blob_fp32;
- }
- }
- else
- #endif // NCNN_ARM82
- #if NCNN_RVV
- if (opt.use_fp16_storage && cpu_support_riscv_v() && cpu_support_riscv_zfh())
- {
- if (bottom_blob.elembits() == 32 && layer->support_fp16_storage)
- {
- Mat bottom_blob_fp16;
- cast_float32_to_float16(bottom_blob, bottom_blob_fp16, opt);
- bottom_blob = bottom_blob_fp16;
- }
- if (bottom_blob.elembits() == 16 && !layer->support_fp16_storage)
- {
- Mat bottom_blob_fp32;
- cast_float16_to_float32(bottom_blob, bottom_blob_fp32, opt);
- bottom_blob = bottom_blob_fp32;
- }
- }
- else
- #endif // NCNN_RVV
- #if NCNN_BF16
- if (opt.use_bf16_storage)
- {
- if (bottom_blob.elembits() == 32 && layer->support_bf16_storage)
- {
- Mat bottom_blob_bf16;
- cast_float32_to_bfloat16(bottom_blob, bottom_blob_bf16, opt);
- bottom_blob = bottom_blob_bf16;
- }
- if (bottom_blob.elembits() == 16 && !layer->support_bf16_storage)
- {
- Mat bottom_blob_fp32;
- cast_bfloat16_to_float32(bottom_blob, bottom_blob_fp32, opt);
- bottom_blob = bottom_blob_fp32;
- }
- }
- else
- #endif // NCNN_BF16
- {
- // no type conversion
- }
- // *INDENT-ON*
- // clang-format on
-
- int dst_elempack = 1;
- if (opt.use_packing_layout)
- {
- // resolve dst_elempack
- int dims = bottom_blob.dims;
- int elemcount = 0;
- if (dims == 1) elemcount = bottom_blob.elempack * bottom_blob.w;
- if (dims == 2) elemcount = bottom_blob.elempack * bottom_blob.h;
- if (dims == 3 || dims == 4) elemcount = bottom_blob.elempack * bottom_blob.c;
-
- int elembits = bottom_blob.elembits();
-
- if (layer->support_packing)
- {
- if (elembits == 32)
- {
- #if NCNN_AVX512
- if (elemcount % 16 == 0 && ncnn::cpu_support_x86_avx512())
- dst_elempack = 16;
- else if (elemcount % 8 == 0 && ncnn::cpu_support_x86_avx())
- dst_elempack = 8;
- else if (elemcount % 4 == 0)
- dst_elempack = 4;
- #elif NCNN_AVX
- if (elemcount % 8 == 0 && ncnn::cpu_support_x86_avx())
- dst_elempack = 8;
- else if (elemcount % 4 == 0)
- dst_elempack = 4;
- #elif NCNN_RVV
- const int packn = ncnn::cpu_riscv_vlenb() / 4;
- if (elemcount % packn == 0)
- dst_elempack = packn;
- #else
- if (elemcount % 4 == 0)
- dst_elempack = 4;
- #endif
- }
- if (elembits == 16)
- {
- #if NCNN_ARM82
- if (elemcount % 8 == 0 && ncnn::cpu_support_arm_asimdhp() && opt.use_fp16_arithmetic)
- dst_elempack = 8;
- else if (elemcount % 4 == 0)
- dst_elempack = 4;
- #elif NCNN_RVV
- const int packn = ncnn::cpu_riscv_vlenb() / 2;
- if (elemcount % packn == 0)
- dst_elempack = packn;
- #else
- if (elemcount % 4 == 0)
- dst_elempack = 4;
- #endif
- }
- if (elembits == 8)
- {
- #if NCNN_RVV
- const int packn = ncnn::cpu_riscv_vlenb() / 1;
- if (elemcount % packn == 0)
- dst_elempack = packn;
- #else
- if (elemcount % 8 == 0)
- dst_elempack = 8;
- #endif
- }
- }
- }
-
- if (bottom_blob.elempack != dst_elempack)
- {
- Mat bottom_blob_packed;
- convert_packing(bottom_blob, bottom_blob_packed, dst_elempack, opt);
- bottom_blob = bottom_blob_packed;
- }
-
- return 0;
- }
-
- int NetPrivate::do_forward_layer(const Layer* layer, std::vector<Mat>& blob_mats, const Option& opt) const
- {
- if (layer->one_blob_only)
- {
- int bottom_blob_index = layer->bottoms[0];
- int top_blob_index = layer->tops[0];
-
- Mat& bottom_blob_ref = blob_mats[bottom_blob_index];
- Mat bottom_blob;
-
- if (opt.lightmode)
- {
- // deep copy for inplace forward if data is shared
- if (layer->support_inplace && *bottom_blob_ref.refcount != 1)
- {
- bottom_blob = bottom_blob_ref.clone(opt.blob_allocator);
- }
- }
- if (bottom_blob.dims == 0)
- {
- bottom_blob = bottom_blob_ref;
- }
-
- convert_layout(bottom_blob, layer, opt);
-
- // forward
- if (opt.lightmode && layer->support_inplace)
- {
- Mat& bottom_top_blob = bottom_blob;
- int ret = layer->forward_inplace(bottom_top_blob, opt);
- if (ret != 0)
- return ret;
-
- // store top blob
- blob_mats[top_blob_index] = bottom_top_blob;
- }
- else
- {
- Mat top_blob;
- int ret = layer->forward(bottom_blob, top_blob, opt);
- if (ret != 0)
- return ret;
-
- // store top blob
- blob_mats[top_blob_index] = top_blob;
- }
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats[bottom_blob_index].release();
- }
- }
- else
- {
- std::vector<Mat> bottom_blobs(layer->bottoms.size());
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- Mat& bottom_blob_ref = blob_mats[bottom_blob_index];
- bottom_blobs[i].release();
-
- if (opt.lightmode)
- {
- // deep copy for inplace forward if data is shared
- if (layer->support_inplace && *bottom_blob_ref.refcount != 1)
- {
- bottom_blobs[i] = bottom_blob_ref.clone(opt.blob_allocator);
- }
- }
- if (bottom_blobs[i].dims == 0)
- {
- bottom_blobs[i] = bottom_blob_ref;
- }
-
- convert_layout(bottom_blobs[i], layer, opt);
- }
-
- // forward
- if (opt.lightmode && layer->support_inplace)
- {
- std::vector<Mat>& bottom_top_blobs = bottom_blobs;
- int ret = layer->forward_inplace(bottom_top_blobs, opt);
- if (ret != 0)
- return ret;
-
- // store top blobs
- for (size_t i = 0; i < layer->tops.size(); i++)
- {
- int top_blob_index = layer->tops[i];
-
- blob_mats[top_blob_index] = bottom_top_blobs[i];
- }
- }
- else
- {
- std::vector<Mat> top_blobs(layer->tops.size());
- int ret = layer->forward(bottom_blobs, top_blobs, opt);
- if (ret != 0)
- return ret;
-
- // store top blobs
- for (size_t i = 0; i < layer->tops.size(); i++)
- {
- int top_blob_index = layer->tops[i];
-
- blob_mats[top_blob_index] = top_blobs[i];
- }
- }
-
- if (opt.lightmode)
- {
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- // delete after taken in light mode
- blob_mats[bottom_blob_index].release();
- }
- }
- }
-
- return 0;
- }
-
- #if NCNN_VULKAN
- int NetPrivate::do_forward_layer(const Layer* layer, std::vector<VkMat>& blob_mats_gpu, VkCompute& cmd, const Option& opt) const
- {
- if (layer->one_blob_only)
- {
- // load bottom blob
- int bottom_blob_index = layer->bottoms[0];
- int top_blob_index = layer->tops[0];
-
- VkMat& bottom_blob_ref = blob_mats_gpu[bottom_blob_index];
- VkMat bottom_blob;
-
- if (opt.lightmode)
- {
- // deep copy for inplace forward if data is shared
- if (layer->support_inplace && *bottom_blob_ref.refcount != 1)
- {
- cmd.record_clone(bottom_blob_ref, bottom_blob, opt);
- // NCNN_LOGE("clone %p[+%lu] %p[+%lu]", bottom_blob_ref.buffer(), bottom_blob_ref.buffer_offset(), bottom_blob.buffer(), bottom_blob.buffer_offset());
- }
- }
- if (bottom_blob.dims == 0)
- {
- bottom_blob = bottom_blob_ref;
- }
-
- // forward
- if (opt.lightmode && layer->support_inplace)
- {
- VkMat& bottom_top_blob = bottom_blob;
- int ret = layer->forward_inplace(bottom_top_blob, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blob
- blob_mats_gpu[top_blob_index] = bottom_top_blob;
- }
- else
- {
- VkMat top_blob;
- int ret = layer->forward(bottom_blob, top_blob, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blob
- blob_mats_gpu[top_blob_index] = top_blob;
- }
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu[bottom_blob_index].release();
- }
- }
- else
- {
- // load bottom blobs
- std::vector<VkMat> bottom_blobs(layer->bottoms.size());
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- VkMat& bottom_blob_ref = blob_mats_gpu[bottom_blob_index];
- bottom_blobs[i].release();
-
- if (opt.lightmode)
- {
- // deep copy for inplace forward if data is shared
- if (layer->support_inplace && *bottom_blob_ref.refcount != 1)
- {
- cmd.record_clone(bottom_blob_ref, bottom_blobs[i], opt);
- // NCNN_LOGE("clone %p[+%lu] %p[+%lu]", bottom_blob_ref.buffer(), bottom_blob_ref.buffer_offset(), bottom_blobs[i].buffer(), bottom_blobs[i].buffer_offset());
- }
- }
- if (bottom_blobs[i].dims == 0)
- {
- bottom_blobs[i] = bottom_blob_ref;
- }
- }
-
- // forward
- if (opt.lightmode && layer->support_inplace)
- {
- std::vector<VkMat>& bottom_top_blobs = bottom_blobs;
- int ret = layer->forward_inplace(bottom_top_blobs, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blobs
- for (size_t i = 0; i < layer->tops.size(); i++)
- {
- int top_blob_index = layer->tops[i];
-
- blob_mats_gpu[top_blob_index] = bottom_top_blobs[i];
- }
- }
- else
- {
- std::vector<VkMat> top_blobs(layer->tops.size());
- int ret = layer->forward(bottom_blobs, top_blobs, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blobs
- for (size_t i = 0; i < layer->tops.size(); i++)
- {
- int top_blob_index = layer->tops[i];
-
- blob_mats_gpu[top_blob_index] = top_blobs[i];
- }
- }
-
- if (opt.lightmode)
- {
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- // delete after taken in light mode
- blob_mats_gpu[bottom_blob_index].release();
- }
- }
- }
-
- return 0;
- }
-
- int NetPrivate::do_forward_layer(const Layer* layer, std::vector<VkImageMat>& blob_mats_gpu_image, VkCompute& cmd, const Option& opt) const
- {
- if (layer->one_blob_only)
- {
- // load bottom blob
- int bottom_blob_index = layer->bottoms[0];
- int top_blob_index = layer->tops[0];
-
- VkImageMat& bottom_blob_ref = blob_mats_gpu_image[bottom_blob_index];
- VkImageMat bottom_blob;
-
- if (opt.lightmode)
- {
- // deep copy for inplace forward if data is shared
- if (layer->support_inplace && *bottom_blob_ref.refcount != 1)
- {
- cmd.record_clone(bottom_blob_ref, bottom_blob, opt);
- // NCNN_LOGE("clone %p[+%lu] %p[+%lu]", bottom_blob_ref.buffer(), bottom_blob_ref.buffer_offset(), bottom_blob.buffer(), bottom_blob.buffer_offset());
- }
- }
- if (bottom_blob.dims == 0)
- {
- bottom_blob = bottom_blob_ref;
- }
-
- // forward
- if (opt.lightmode && layer->support_inplace)
- {
- VkImageMat& bottom_top_blob = bottom_blob;
- int ret = layer->forward_inplace(bottom_top_blob, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blob
- blob_mats_gpu_image[top_blob_index] = bottom_top_blob;
- }
- else
- {
- VkImageMat top_blob;
- int ret = layer->forward(bottom_blob, top_blob, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blob
- blob_mats_gpu_image[top_blob_index] = top_blob;
- }
-
- if (opt.lightmode)
- {
- // delete after taken in light mode
- blob_mats_gpu_image[bottom_blob_index].release();
- }
- }
- else
- {
- // load bottom blobs
- std::vector<VkImageMat> bottom_blobs(layer->bottoms.size());
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- VkImageMat& bottom_blob_ref = blob_mats_gpu_image[bottom_blob_index];
-
- if (opt.lightmode)
- {
- // deep copy for inplace forward if data is shared
- if (layer->support_inplace && *bottom_blob_ref.refcount != 1)
- {
- cmd.record_clone(bottom_blob_ref, bottom_blobs[i], opt);
- // NCNN_LOGE("clone %p[+%lu] %p[+%lu]", bottom_blob_ref.buffer(), bottom_blob_ref.buffer_offset(), bottom_blobs[i].buffer(), bottom_blobs[i].buffer_offset());
- }
- }
- if (bottom_blobs[i].dims == 0)
- {
- bottom_blobs[i] = bottom_blob_ref;
- }
- }
-
- // forward
- if (opt.lightmode && layer->support_inplace)
- {
- std::vector<VkImageMat>& bottom_top_blobs = bottom_blobs;
- int ret = layer->forward_inplace(bottom_top_blobs, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blobs
- for (size_t i = 0; i < layer->tops.size(); i++)
- {
- int top_blob_index = layer->tops[i];
-
- blob_mats_gpu_image[top_blob_index] = bottom_top_blobs[i];
- }
- }
- else
- {
- std::vector<VkImageMat> top_blobs(layer->tops.size());
- int ret = layer->forward(bottom_blobs, top_blobs, cmd, opt);
- if (ret != 0)
- return ret;
-
- // store top blobs
- for (size_t i = 0; i < layer->tops.size(); i++)
- {
- int top_blob_index = layer->tops[i];
-
- blob_mats_gpu_image[top_blob_index] = top_blobs[i];
- }
- }
-
- if (opt.lightmode)
- {
- for (size_t i = 0; i < layer->bottoms.size(); i++)
- {
- int bottom_blob_index = layer->bottoms[i];
-
- // delete after taken in light mode
- blob_mats_gpu_image[bottom_blob_index].release();
- }
- }
- }
-
- return 0;
- }
- #endif // NCNN_VULKAN
-
- void NetPrivate::update_input_output_indexes()
- {
- input_blob_indexes.clear();
- output_blob_indexes.clear();
-
- for (size_t i = 0; i < layers.size(); i++)
- {
- if (layers[i]->typeindex == LayerType::Input)
- {
- int blob_index = layers[i]->tops[0];
- input_blob_indexes.push_back(blob_index);
- }
- }
-
- for (size_t i = 0; i < blobs.size(); i++)
- {
- if (blobs[i].producer != -1 && blobs[i].consumer == -1)
- {
- output_blob_indexes.push_back(i);
- }
- }
- }
-
- #if NCNN_STRING
- void NetPrivate::update_input_output_names()
- {
- input_blob_names.clear();
- output_blob_names.clear();
-
- for (size_t i = 0; i < input_blob_indexes.size(); i++)
- {
- int blob_index = input_blob_indexes[i];
- input_blob_names.push_back(blobs[blob_index].name.c_str());
- }
-
- for (size_t i = 0; i < output_blob_indexes.size(); i++)
- {
- int blob_index = output_blob_indexes[i];
- output_blob_names.push_back(blobs[blob_index].name.c_str());
- }
- }
- #endif // NCNN_STRING
-
- Net::Net()
- : d(new NetPrivate(opt))
- {
- }
-
- Net::~Net()
- {
- clear();
-
- delete d;
- }
-
- Net::Net(const Net&)
- : d(0)
- {
- }
-
- Net& Net::operator=(const Net&)
- {
- return *this;
- }
-
- #if NCNN_STRING
- int Net::register_custom_layer(const char* type, layer_creator_func creator, layer_destroyer_func destroyer, void* userdata)
- {
- int typeindex = layer_to_index(type);
- if (typeindex != -1)
- {
- NCNN_LOGE("overwrite built-in layer type %s", type);
-
- for (size_t i = 0; i < d->overwrite_builtin_layer_registry.size(); i++)
- {
- if (d->overwrite_builtin_layer_registry[i].typeindex == typeindex)
- {
- NCNN_LOGE("overwrite existing overwritten built-in layer index %d", typeindex);
-
- d->overwrite_builtin_layer_registry[i].creator = creator;
- d->overwrite_builtin_layer_registry[i].destroyer = destroyer;
- d->overwrite_builtin_layer_registry[i].userdata = userdata;
- return 0;
- }
- }
-
- struct overwrite_builtin_layer_registry_entry entry = {typeindex, creator, destroyer, userdata};
- d->overwrite_builtin_layer_registry.push_back(entry);
- return 0;
- }
-
- int custom_index = custom_layer_to_index(type);
- if (custom_index == -1)
- {
- struct custom_layer_registry_entry entry = {type, creator, destroyer, userdata};
- d->custom_layer_registry.push_back(entry);
- }
- else
- {
- NCNN_LOGE("overwrite existing custom layer type %s", type);
- d->custom_layer_registry[custom_index].name = type;
- d->custom_layer_registry[custom_index].creator = creator;
- d->custom_layer_registry[custom_index].destroyer = destroyer;
- d->custom_layer_registry[custom_index].userdata = userdata;
- }
-
- return 0;
- }
- #endif // NCNN_STRING
-
- int Net::register_custom_layer(int index, layer_creator_func creator, layer_destroyer_func destroyer, void* userdata)
- {
- int custom_index = index & ~LayerType::CustomBit;
- if (index == custom_index)
- {
- NCNN_LOGE("overwrite built-in layer type %d", index);
-
- for (size_t i = 0; i < d->overwrite_builtin_layer_registry.size(); i++)
- {
- if (d->overwrite_builtin_layer_registry[i].typeindex == index)
- {
- NCNN_LOGE("overwrite existing overwritten built-in layer index %d", index);
-
- d->overwrite_builtin_layer_registry[i].creator = creator;
- d->overwrite_builtin_layer_registry[i].destroyer = destroyer;
- d->overwrite_builtin_layer_registry[i].userdata = userdata;
- return 0;
- }
- }
-
- struct overwrite_builtin_layer_registry_entry entry = {index, creator, destroyer, userdata};
- d->overwrite_builtin_layer_registry.push_back(entry);
- return 0;
- }
-
- if ((int)d->custom_layer_registry.size() <= custom_index)
- {
- #if NCNN_STRING
- struct custom_layer_registry_entry dummy = {"", 0, 0, 0};
- #else
- struct custom_layer_registry_entry dummy = {0, 0, 0};
- #endif // NCNN_STRING
- d->custom_layer_registry.resize(custom_index + 1, dummy);
- }
-
- if (d->custom_layer_registry[custom_index].creator)
- {
- NCNN_LOGE("overwrite existing custom layer index %d", custom_index);
- }
-
- d->custom_layer_registry[custom_index].creator = creator;
- d->custom_layer_registry[custom_index].destroyer = destroyer;
- d->custom_layer_registry[custom_index].userdata = userdata;
- return 0;
- }
-
- #if NCNN_STRING
- int Net::load_param(const DataReader& dr)
- {
- #define SCAN_VALUE(fmt, v) \
- if (dr.scan(fmt, &v) != 1) \
- { \
- NCNN_LOGE("parse " #v " failed"); \
- return -1; \
- }
-
- int magic = 0;
- SCAN_VALUE("%d", magic)
- if (magic != 7767517)
- {
- NCNN_LOGE("param is too old, please regenerate");
- return -1;
- }
-
- // parse
- int layer_count = 0;
- int blob_count = 0;
- SCAN_VALUE("%d", layer_count)
- SCAN_VALUE("%d", blob_count)
- if (layer_count <= 0 || blob_count <= 0)
- {
- NCNN_LOGE("invalid layer_count or blob_count");
- return -1;
- }
-
- d->layers.resize((size_t)layer_count);
- d->blobs.resize((size_t)blob_count);
-
- #if NCNN_VULKAN
- // TODO enable gpu when bf16 conversion implemented
- if (opt.use_bf16_storage)
- opt.use_vulkan_compute = false;
-
- if (opt.use_vulkan_compute)
- {
- if (!d->vkdev) d->vkdev = get_gpu_device();
- if (!d->vkdev) opt.use_vulkan_compute = false; // no vulkan device, fallback to cpu
- }
- if (opt.use_vulkan_compute)
- {
- // sanitize use options
- if (!d->vkdev->info.support_fp16_packed()) opt.use_fp16_packed = false;
- if (!d->vkdev->info.support_fp16_storage()) opt.use_fp16_storage = false;
- if (!d->vkdev->info.support_fp16_arithmetic()) opt.use_fp16_arithmetic = false;
- if (!d->vkdev->info.support_int8_storage()) opt.use_int8_storage = false;
- if (!d->vkdev->info.support_int8_arithmetic()) opt.use_int8_arithmetic = false;
- if (!d->vkdev->info.support_cooperative_matrix()) opt.use_cooperative_matrix = false;
-
- if (d->vkdev->info.bug_buffer_image_load_zero()) opt.use_image_storage = false;
-
- // enable local memory optimization on discrete gpu only
- if (d->vkdev->info.type() != 0) opt.use_shader_local_memory = false;
-
- // fp16a makes no sense when fp16 storage disabled
- if (!opt.use_fp16_packed && !opt.use_fp16_storage) opt.use_fp16_arithmetic = false;
- }
- else
- {
- // fp16a makes no sense when fp16 storage disabled
- if (!opt.use_fp16_storage) opt.use_fp16_arithmetic = false;
- }
- #endif // NCNN_VULKAN
-
- ParamDict pd;
-
- int blob_index = 0;
- for (int i = 0; i < layer_count; i++)
- {
- char layer_type[256];
- char layer_name[256];
- int bottom_count = 0;
- int top_count = 0;
- SCAN_VALUE("%255s", layer_type)
- SCAN_VALUE("%255s", layer_name)
- SCAN_VALUE("%d", bottom_count)
- SCAN_VALUE("%d", top_count)
-
- Layer* layer = create_overwrite_builtin_layer(layer_type);
- if (!layer)
- {
- layer = create_layer(layer_type);
- }
- if (!layer)
- {
- layer = create_custom_layer(layer_type);
- }
- if (!layer)
- {
- NCNN_LOGE("layer %s not exists or registered", layer_type);
- clear();
- return -1;
- }
-
- #if NCNN_VULKAN
- if (opt.use_vulkan_compute)
- layer->vkdev = d->vkdev;
- #endif // NCNN_VULKAN
-
- layer->type = std::string(layer_type);
- layer->name = std::string(layer_name);
- // NCNN_LOGE("new layer %d %s", i, layer_name);
-
- layer->bottoms.resize(bottom_count);
-
- for (int j = 0; j < bottom_count; j++)
- {
- char bottom_name[256];
- SCAN_VALUE("%255s", bottom_name)
-
- int bottom_blob_index = find_blob_index_by_name(bottom_name);
- if (bottom_blob_index == -1)
- {
- Blob& blob = d->blobs[blob_index];
-
- bottom_blob_index = blob_index;
-
- blob.name = std::string(bottom_name);
- // NCNN_LOGE("new blob %s", bottom_name);
-
- blob_index++;
- }
-
- Blob& blob = d->blobs[bottom_blob_index];
-
- blob.consumer = i;
-
- layer->bottoms[j] = bottom_blob_index;
- }
-
- layer->tops.resize(top_count);
- for (int j = 0; j < top_count; j++)
- {
- Blob& blob = d->blobs[blob_index];
-
- char blob_name[256];
- SCAN_VALUE("%255s", blob_name)
-
- blob.name = std::string(blob_name);
- // NCNN_LOGE("new blob %s", blob_name);
-
- blob.producer = i;
-
- layer->tops[j] = blob_index;
-
- blob_index++;
- }
-
- // layer specific params
- int pdlr = pd.load_param(dr);
- if (pdlr != 0)
- {
- NCNN_LOGE("ParamDict load_param %d %s failed", i, layer->name.c_str());
- continue;
- }
-
- if (layer->support_int8_storage)
- {
- // no int8 gpu support yet
- opt.use_vulkan_compute = false;
- }
-
- // pull out top shape hints
- Mat shape_hints = pd.get(30, Mat());
- if (!shape_hints.empty())
- {
- const int* psh = shape_hints;
- for (int j = 0; j < top_count; j++)
- {
- Blob& blob = d->blobs[layer->tops[j]];
-
- int dims = psh[0];
- if (dims == 1)
- {
- blob.shape = Mat(psh[1], (void*)0, 4u, 1);
- }
- if (dims == 2)
- {
- blob.shape = Mat(psh[1], psh[2], (void*)0, 4u, 1);
- }
- if (dims == 3)
- {
- blob.shape = Mat(psh[1], psh[2], psh[3], (void*)0, 4u, 1);
- }
-
- psh += 4;
- }
- }
-
- // set bottom and top shape hints
- layer->bottom_shapes.resize(bottom_count);
- for (int j = 0; j < bottom_count; j++)
- {
- layer->bottom_shapes[j] = d->blobs[layer->bottoms[j]].shape;
- }
-
- layer->top_shapes.resize(top_count);
- for (int j = 0; j < top_count; j++)
- {
- layer->top_shapes[j] = d->blobs[layer->tops[j]].shape;
- }
-
- // pull out layer specific feature disabled set
- layer->featmask = pd.get(31, 0);
-
- int lr = layer->load_param(pd);
- if (lr != 0)
- {
- NCNN_LOGE("layer load_param %d %s failed", i, layer->name.c_str());
- continue;
- }
-
- d->layers[i] = layer;
- }
-
- d->update_input_output_indexes();
- d->update_input_output_names();
-
- #undef SCAN_VALUE
- return 0;
- }
- #endif // NCNN_STRING
-
- int Net::load_param_bin(const DataReader& dr)
- {
- #define READ_VALUE(buf) \
- if (dr.read(&buf, sizeof(buf)) != sizeof(buf)) \
- { \
- NCNN_LOGE("read " #buf " failed"); \
- return -1; \
- }
-
- int magic = 0;
- READ_VALUE(magic)
- if (magic != 7767517)
- {
- NCNN_LOGE("param is too old, please regenerate");
- return -1;
- }
-
- int layer_count = 0;
- int blob_count = 0;
- READ_VALUE(layer_count)
- READ_VALUE(blob_count)
- if (layer_count <= 0 || blob_count <= 0)
- {
- NCNN_LOGE("invalid layer_count or blob_count");
- return -1;
- }
-
- d->layers.resize(layer_count);
- d->blobs.resize(blob_count);
-
- #if NCNN_VULKAN
- // TODO enable gpu when bf16 conversion implemented
- if (opt.use_bf16_storage)
- opt.use_vulkan_compute = false;
-
- if (opt.use_vulkan_compute)
- {
- if (!d->vkdev) d->vkdev = get_gpu_device();
- if (!d->vkdev) opt.use_vulkan_compute = false; // no vulkan device, fallback to cpu
- }
- if (opt.use_vulkan_compute)
- {
- // sanitize use options
- if (!d->vkdev->info.support_fp16_packed()) opt.use_fp16_packed = false;
- if (!d->vkdev->info.support_fp16_storage()) opt.use_fp16_storage = false;
- if (!d->vkdev->info.support_fp16_arithmetic()) opt.use_fp16_arithmetic = false;
- if (!d->vkdev->info.support_int8_storage()) opt.use_int8_storage = false;
- if (!d->vkdev->info.support_int8_arithmetic()) opt.use_int8_arithmetic = false;
- if (!d->vkdev->info.support_cooperative_matrix()) opt.use_cooperative_matrix = false;
-
- if (d->vkdev->info.bug_buffer_image_load_zero()) opt.use_image_storage = false;
-
- // enable local memory optimization on discrete gpu only
- if (d->vkdev->info.type() != 0) opt.use_shader_local_memory = false;
-
- // fp16a makes no sense when fp16 storage disabled
- if (!opt.use_fp16_packed && !opt.use_fp16_storage) opt.use_fp16_arithmetic = false;
- }
- else
- {
- // fp16a makes no sense when fp16 storage disabled
- if (!opt.use_fp16_storage) opt.use_fp16_arithmetic = false;
- }
- #endif // NCNN_VULKAN
-
- ParamDict pd;
-
- for (int i = 0; i < layer_count; i++)
- {
- int typeindex;
- int bottom_count;
- int top_count;
- READ_VALUE(typeindex)
- READ_VALUE(bottom_count)
- READ_VALUE(top_count)
-
- Layer* layer = create_overwrite_builtin_layer(typeindex);
- if (!layer)
- {
- layer = create_layer(typeindex);
- }
- if (!layer)
- {
- int custom_index = typeindex & ~LayerType::CustomBit;
- layer = create_custom_layer(custom_index);
- }
- if (!layer)
- {
- NCNN_LOGE("layer %d not exists or registered", typeindex);
- clear();
- return -1;
- }
-
- #if NCNN_VULKAN
- if (opt.use_vulkan_compute)
- layer->vkdev = d->vkdev;
- #endif // NCNN_VULKAN
-
- // layer->type = std::string(layer_type);
- // layer->name = std::string(layer_name);
- // NCNN_LOGE("new layer %d", typeindex);
-
- layer->bottoms.resize(bottom_count);
- for (int j = 0; j < bottom_count; j++)
- {
- int bottom_blob_index;
- READ_VALUE(bottom_blob_index)
-
- Blob& blob = d->blobs[bottom_blob_index];
-
- blob.consumer = i;
-
- layer->bottoms[j] = bottom_blob_index;
- }
-
- layer->tops.resize(top_count);
- for (int j = 0; j < top_count; j++)
- {
- int top_blob_index;
- READ_VALUE(top_blob_index)
-
- Blob& blob = d->blobs[top_blob_index];
-
- // blob.name = std::string(blob_name);
- // NCNN_LOGE("new blob %s", blob_name);
-
- blob.producer = i;
-
- layer->tops[j] = top_blob_index;
- }
-
- // layer specific params
- int pdlr = pd.load_param_bin(dr);
- if (pdlr != 0)
- {
- #if NCNN_STRING
- NCNN_LOGE("ParamDict load_param %d %s failed", i, layer->name.c_str());
- #else
- NCNN_LOGE("ParamDict load_param %d failed", i);
- #endif
- continue;
- }
-
- if (layer->support_int8_storage)
- {
- // no int8 gpu support yet
- opt.use_vulkan_compute = false;
- }
-
- // pull out top blob shape hints
- Mat shape_hints = pd.get(30, Mat());
- if (!shape_hints.empty())
- {
- const int* psh = shape_hints;
- for (int j = 0; j < top_count; j++)
- {
- Blob& blob = d->blobs[layer->tops[j]];
-
- int dims = psh[0];
- if (dims == 1)
- {
- blob.shape = Mat(psh[1], (void*)0, 4u, 1);
- }
- if (dims == 2)
- {
- blob.shape = Mat(psh[1], psh[2], (void*)0, 4u, 1);
- }
- if (dims == 3)
- {
- blob.shape = Mat(psh[1], psh[2], psh[3], (void*)0, 4u, 1);
- }
-
- psh += 4;
- }
- }
-
- // set bottom and top shape hints
- layer->bottom_shapes.resize(bottom_count);
- for (int j = 0; j < bottom_count; j++)
- {
- layer->bottom_shapes[j] = d->blobs[layer->bottoms[j]].shape;
- }
-
- layer->top_shapes.resize(top_count);
- for (int j = 0; j < top_count; j++)
- {
- layer->top_shapes[j] = d->blobs[layer->tops[j]].shape;
- }
-
- // pull out layer specific feature disabled set
- layer->featmask = pd.get(31, 0);
-
- int lr = layer->load_param(pd);
- if (lr != 0)
- {
- #if NCNN_STRING
- NCNN_LOGE("layer load_param %d %s failed", i, layer->name.c_str());
- #else
- NCNN_LOGE("layer load_param %d failed", i);
- #endif
- continue;
- }
-
- d->layers[i] = layer;
- }
-
- d->update_input_output_indexes();
-
- #undef READ_VALUE
- return 0;
- }
-
- int Net::load_model(const DataReader& dr)
- {
- if (d->layers.empty())
- {
- NCNN_LOGE("network graph not ready");
- return -1;
- }
-
- int layer_count = (int)d->layers.size();
-
- // load file
- int ret = 0;
-
- #if NCNN_VULKAN
- if (opt.use_vulkan_compute)
- {
- if (!opt.pipeline_cache)
- {
- if (!d->pipeline_cache)
- d->pipeline_cache = new PipelineCache(d->vkdev);
- opt.pipeline_cache = d->pipeline_cache;
- }
- }
- #endif // NCNN_VULKAN
-
- ModelBinFromDataReader mb(dr);
- for (int i = 0; i < layer_count; i++)
- {
- Layer* layer = d->layers[i];
-
- //Here we found inconsistent content in the parameter file.
- if (!layer)
- {
- NCNN_LOGE("load_model error at layer %d, parameter file has inconsistent content.", i);
- ret = -1;
- break;
- }
-
- int lret = layer->load_model(mb);
- if (lret != 0)
- {
- #if NCNN_STRING
- NCNN_LOGE("layer load_model %d %s failed", i, layer->name.c_str());
- #else
- NCNN_LOGE("layer load_model %d failed", i);
- #endif
- ret = -1;
- break;
- }
-
- if (layer->support_int8_storage)
- {
- // no int8 gpu support yet
- opt.use_vulkan_compute = false;
- }
-
- Option opt1 = get_masked_option(opt, layer->featmask);
- #if NCNN_VULKAN
- if (opt1.use_vulkan_compute)
- {
- if (!layer->support_image_storage) opt1.use_image_storage = false;
- }
- else
- {
- layer->vkdev = 0;
- layer->support_vulkan = false;
- }
- #endif // NCNN_VULKAN
-
- int cret = layer->create_pipeline(opt1);
- if (cret != 0)
- {
- #if NCNN_STRING
- NCNN_LOGE("layer create_pipeline %d %s failed", i, layer->name.c_str());
- #else
- NCNN_LOGE("layer create_pipeline %d failed", i);
- #endif
- ret = -1;
- break;
- }
- }
-
- if (opt.use_local_pool_allocator)
- {
- if (opt.blob_allocator == 0)
- {
- if (!d->local_blob_allocator)
- {
- d->local_blob_allocator = new PoolAllocator;
- d->local_blob_allocator->set_size_compare_ratio(0.f);
- }
- }
- if (opt.workspace_allocator == 0)
- {
- if (!d->local_workspace_allocator)
- {
- d->local_workspace_allocator = new PoolAllocator;
- d->local_workspace_allocator->set_size_compare_ratio(0.f);
- }
- }
- }
-
- #if NCNN_VULKAN
- if (ret == 0 && opt.use_vulkan_compute)
- {
- ret = d->upload_model();
- }
- #endif // NCNN_VULKAN
-
- return ret;
- }
-
- #if NCNN_STDIO
- #if NCNN_STRING
- int Net::load_param(FILE* fp)
- {
- DataReaderFromStdio dr(fp);
- return load_param(dr);
- }
-
- int Net::load_param_mem(const char* _mem)
- {
- const unsigned char* mem = (const unsigned char*)_mem;
- DataReaderFromMemory dr(mem);
- return load_param(dr);
- }
-
- int Net::load_param(const char* protopath)
- {
- FILE* fp = fopen(protopath, "rb");
- if (!fp)
- {
- NCNN_LOGE("fopen %s failed", protopath);
- return -1;
- }
-
- int ret = load_param(fp);
- fclose(fp);
- return ret;
- }
- #endif // NCNN_STRING
-
- int Net::load_param_bin(FILE* fp)
- {
- DataReaderFromStdio dr(fp);
- return load_param_bin(dr);
- }
-
- int Net::load_param_bin(const char* protopath)
- {
- FILE* fp = fopen(protopath, "rb");
- if (!fp)
- {
- NCNN_LOGE("fopen %s failed", protopath);
- return -1;
- }
-
- int ret = load_param_bin(fp);
- fclose(fp);
- return ret;
- }
-
- int Net::load_model(FILE* fp)
- {
- DataReaderFromStdio dr(fp);
- return load_model(dr);
- }
-
- int Net::load_model(const char* modelpath)
- {
- FILE* fp = fopen(modelpath, "rb");
- if (!fp)
- {
- NCNN_LOGE("fopen %s failed", modelpath);
- return -1;
- }
-
- int ret = load_model(fp);
- fclose(fp);
- return ret;
- }
- #endif // NCNN_STDIO
-
- int Net::load_param(const unsigned char* _mem)
- {
- const unsigned char* mem = _mem;
- DataReaderFromMemory dr(mem);
- load_param_bin(dr);
- return static_cast<int>(mem - _mem);
- }
-
- int Net::load_model(const unsigned char* _mem)
- {
- const unsigned char* mem = _mem;
- DataReaderFromMemory dr(mem);
- load_model(dr);
- return static_cast<int>(mem - _mem);
- }
-
- #if NCNN_PLATFORM_API
- #if __ANDROID_API__ >= 9
- #if NCNN_STRING
- int Net::load_param(AAsset* asset)
- {
- DataReaderFromAndroidAsset dr(asset);
- return load_param(dr);
- }
-
- int Net::load_param(AAssetManager* mgr, const char* assetpath)
- {
- AAsset* asset = AAssetManager_open(mgr, assetpath, AASSET_MODE_BUFFER);
- if (!asset)
- {
- NCNN_LOGE("AAssetManager_open %s failed", assetpath);
- return -1;
- }
-
- int ret = load_param(asset);
- AAsset_close(asset);
- return ret;
- }
- #endif // NCNN_STRING
-
- int Net::load_param_bin(AAsset* asset)
- {
- DataReaderFromAndroidAsset dr(asset);
- return load_param_bin(dr);
- }
-
- int Net::load_param_bin(AAssetManager* mgr, const char* assetpath)
- {
- AAsset* asset = AAssetManager_open(mgr, assetpath, AASSET_MODE_BUFFER);
- if (!asset)
- {
- NCNN_LOGE("AAssetManager_open %s failed", assetpath);
- return -1;
- }
-
- int ret = load_param_bin(asset);
- AAsset_close(asset);
- return ret;
- }
-
- int Net::load_model(AAsset* asset)
- {
- DataReaderFromAndroidAsset dr(asset);
- return load_model(dr);
- }
-
- int Net::load_model(AAssetManager* mgr, const char* assetpath)
- {
- AAsset* asset = AAssetManager_open(mgr, assetpath, AASSET_MODE_STREAMING);
- if (!asset)
- {
- NCNN_LOGE("AAssetManager_open %s failed", assetpath);
- return -1;
- }
-
- int ret = load_model(asset);
- AAsset_close(asset);
- return ret;
- }
- #endif // __ANDROID_API__ >= 9
- #endif // NCNN_PLATFORM_API
-
- void Net::clear()
- {
- d->blobs.clear();
- for (size_t i = 0; i < d->layers.size(); i++)
- {
- Layer* layer = d->layers[i];
-
- Option opt1 = get_masked_option(opt, layer->featmask);
- #if NCNN_VULKAN
- if (!layer->support_image_storage)
- {
- opt1.use_image_storage = false;
- }
- #endif // NCNN_VULKAN
-
- int dret = layer->destroy_pipeline(opt1);
- if (dret != 0)
- {
- NCNN_LOGE("layer destroy_pipeline failed");
- // ignore anyway
- }
-
- if (layer->typeindex & ncnn::LayerType::CustomBit)
- {
- int custom_index = layer->typeindex & ~ncnn::LayerType::CustomBit;
- if (d->custom_layer_registry[custom_index].destroyer)
- {
- d->custom_layer_registry[custom_index].destroyer(layer, d->custom_layer_registry[custom_index].userdata);
- }
- else
- {
- delete layer;
- }
- }
- else
- {
- // check overwrite builtin layer destroyer
- int index = -1;
- const size_t overwrite_builtin_layer_registry_entry_count = d->overwrite_builtin_layer_registry.size();
- for (size_t i = 0; i < overwrite_builtin_layer_registry_entry_count; i++)
- {
- if (d->overwrite_builtin_layer_registry[i].typeindex == layer->typeindex)
- {
- index = i;
- break;
- }
- }
-
- if (index != -1 && d->overwrite_builtin_layer_registry[index].destroyer)
- {
- d->overwrite_builtin_layer_registry[index].destroyer(layer, d->overwrite_builtin_layer_registry[index].userdata);
- }
- else
- {
- delete layer;
- }
- }
- }
- d->layers.clear();
-
- if (d->local_blob_allocator)
- {
- delete d->local_blob_allocator;
- d->local_blob_allocator = 0;
- }
- if (d->local_workspace_allocator)
- {
- delete d->local_workspace_allocator;
- d->local_workspace_allocator = 0;
- }
-
- #if NCNN_VULKAN
- if (d->weight_vkallocator)
- {
- delete d->weight_vkallocator;
- d->weight_vkallocator = 0;
- }
- if (d->weight_staging_vkallocator)
- {
- delete d->weight_staging_vkallocator;
- d->weight_staging_vkallocator = 0;
- }
- if (d->pipeline_cache)
- {
- delete d->pipeline_cache;
- d->pipeline_cache = 0;
- opt.pipeline_cache = 0;
- }
- #endif // NCNN_VULKAN
- }
-
- Extractor Net::create_extractor() const
- {
- return Extractor(this, d->blobs.size());
- }
-
- const std::vector<int>& Net::input_indexes() const
- {
- return d->input_blob_indexes;
- }
-
- const std::vector<int>& Net::output_indexes() const
- {
- return d->output_blob_indexes;
- }
-
- #if NCNN_STRING
- const std::vector<const char*>& Net::input_names() const
- {
- return d->input_blob_names;
- }
-
- const std::vector<const char*>& Net::output_names() const
- {
- return d->output_blob_names;
- }
- #endif
-
- const std::vector<Blob>& Net::blobs() const
- {
- return d->blobs;
- }
-
- const std::vector<Layer*>& Net::layers() const
- {
- return d->layers;
- }
-
- std::vector<Blob>& Net::mutable_blobs()
- {
- return d->blobs;
- }
-
- std::vector<Layer*>& Net::mutable_layers()
- {
- return d->layers;
- }
-
- #if NCNN_VULKAN
- void Net::set_vulkan_device(int device_index)
- {
- d->vkdev = get_gpu_device(device_index);
- }
-
- void Net::set_vulkan_device(const VulkanDevice* _vkdev)
- {
- d->vkdev = _vkdev;
- }
-
- const VulkanDevice* Net::vulkan_device() const
- {
- return d->vkdev;
- }
- #endif // NCNN_VULKAN
-
- #if NCNN_STRING
- int Net::find_blob_index_by_name(const char* name) const
- {
- for (size_t i = 0; i < d->blobs.size(); i++)
- {
- const Blob& blob = d->blobs[i];
- if (blob.name == name)
- {
- return static_cast<int>(i);
- }
- }
-
- NCNN_LOGE("find_blob_index_by_name %s failed", name);
- return -1;
- }
-
- int Net::find_layer_index_by_name(const char* name) const
- {
- for (size_t i = 0; i < d->layers.size(); i++)
- {
- const Layer* layer = d->layers[i];
- if (layer->name == name)
- {
- return static_cast<int>(i);
- }
- }
-
- NCNN_LOGE("find_layer_index_by_name %s failed", name);
- return -1;
- }
-
- int Net::custom_layer_to_index(const char* type)
- {
- const size_t custom_layer_registry_entry_count = d->custom_layer_registry.size();
- for (size_t i = 0; i < custom_layer_registry_entry_count; i++)
- {
- if (strcmp(type, d->custom_layer_registry[i].name) == 0)
- return static_cast<int>(i);
- }
-
- return -1;
- }
-
- Layer* Net::create_custom_layer(const char* type)
- {
- int index = custom_layer_to_index(type);
- if (index == -1)
- return 0;
-
- return create_custom_layer(index);
- }
-
- Layer* Net::create_overwrite_builtin_layer(const char* type)
- {
- int typeindex = layer_to_index(type);
- if (typeindex == -1)
- return 0;
-
- return create_overwrite_builtin_layer(typeindex);
- }
- #endif // NCNN_STRING
-
- Layer* Net::create_custom_layer(int index)
- {
- const size_t custom_layer_registry_entry_count = d->custom_layer_registry.size();
- if (index < 0 || static_cast<unsigned int>(index) >= custom_layer_registry_entry_count)
- return 0;
-
- layer_creator_func layer_creator = d->custom_layer_registry[index].creator;
- if (!layer_creator)
- return 0;
-
- Layer* layer = layer_creator(d->custom_layer_registry[index].userdata);
- layer->typeindex = ncnn::LayerType::CustomBit | index;
- return layer;
- }
-
- Layer* Net::create_overwrite_builtin_layer(int typeindex)
- {
- int index = -1;
- const size_t overwrite_builtin_layer_registry_entry_count = d->overwrite_builtin_layer_registry.size();
- for (size_t i = 0; i < overwrite_builtin_layer_registry_entry_count; i++)
- {
- if (d->overwrite_builtin_layer_registry[i].typeindex == typeindex)
- {
- index = i;
- break;
- }
- }
-
- if (index == -1)
- return 0;
-
- layer_creator_func layer_creator = d->overwrite_builtin_layer_registry[index].creator;
- if (!layer_creator)
- return 0;
-
- Layer* layer = layer_creator(d->overwrite_builtin_layer_registry[index].userdata);
- layer->typeindex = typeindex;
- return layer;
- }
-
- class ExtractorPrivate
- {
- public:
- ExtractorPrivate(const Net* _net)
- : net(_net)
- {
- }
- const Net* net;
- std::vector<Mat> blob_mats;
- Option opt;
-
- #if NCNN_VULKAN
- VkAllocator* local_blob_vkallocator;
- VkAllocator* local_staging_vkallocator;
-
- std::vector<VkMat> blob_mats_gpu;
- std::vector<VkImageMat> blob_mats_gpu_image;
- #endif // NCNN_VULKAN
- };
-
- Extractor::Extractor(const Net* _net, size_t blob_count)
- : d(new ExtractorPrivate(_net))
- {
- d->blob_mats.resize(blob_count);
- d->opt = d->net->opt;
-
- #if NCNN_VULKAN
- if (d->net->opt.use_vulkan_compute)
- {
- d->local_blob_vkallocator = 0;
- d->local_staging_vkallocator = 0;
-
- d->blob_mats_gpu.resize(blob_count);
- d->blob_mats_gpu_image.resize(blob_count);
- }
- #endif // NCNN_VULKAN
- }
-
- Extractor::~Extractor()
- {
- clear();
-
- delete d;
- }
-
- Extractor::Extractor(const Extractor& rhs)
- : d(new ExtractorPrivate(0))
- {
- d->net = rhs.d->net;
- d->blob_mats = rhs.d->blob_mats;
- d->opt = rhs.d->opt;
-
- #if NCNN_VULKAN
- d->local_blob_vkallocator = 0;
- d->local_staging_vkallocator = 0;
-
- d->blob_mats_gpu = rhs.d->blob_mats_gpu;
- d->blob_mats_gpu_image = rhs.d->blob_mats_gpu_image;
- #endif // NCNN_VULKAN
- }
-
- Extractor& Extractor::operator=(const Extractor& rhs)
- {
- if (this == &rhs)
- return *this;
-
- d->net = rhs.d->net;
- d->blob_mats = rhs.d->blob_mats;
- d->opt = rhs.d->opt;
-
- #if NCNN_VULKAN
- d->local_blob_vkallocator = 0;
- d->local_staging_vkallocator = 0;
-
- d->blob_mats_gpu = rhs.d->blob_mats_gpu;
- d->blob_mats_gpu_image = rhs.d->blob_mats_gpu_image;
- #endif // NCNN_VULKAN
-
- return *this;
- }
-
- void Extractor::clear()
- {
- d->blob_mats.clear();
-
- #if NCNN_VULKAN
- if (d->opt.use_vulkan_compute)
- {
- d->blob_mats_gpu.clear();
- d->blob_mats_gpu_image.clear();
-
- if (d->local_blob_vkallocator)
- {
- d->net->vulkan_device()->reclaim_blob_allocator(d->local_blob_vkallocator);
- }
- if (d->local_staging_vkallocator)
- {
- d->net->vulkan_device()->reclaim_staging_allocator(d->local_staging_vkallocator);
- }
- }
- #endif // NCNN_VULKAN
- }
-
- void Extractor::set_light_mode(bool enable)
- {
- d->opt.lightmode = enable;
- }
-
- void Extractor::set_num_threads(int num_threads)
- {
- d->opt.num_threads = num_threads;
- }
-
- void Extractor::set_blob_allocator(Allocator* allocator)
- {
- d->opt.blob_allocator = allocator;
- }
-
- void Extractor::set_workspace_allocator(Allocator* allocator)
- {
- d->opt.workspace_allocator = allocator;
- }
-
- #if NCNN_VULKAN
- void Extractor::set_vulkan_compute(bool enable)
- {
- if (d->net->d->opt.use_vulkan_compute)
- {
- d->opt.use_vulkan_compute = enable;
- }
- else
- {
- NCNN_LOGE("set_vulkan_compute failed, network use_vulkan_compute disabled");
- }
- }
-
- void Extractor::set_blob_vkallocator(VkAllocator* allocator)
- {
- d->opt.blob_vkallocator = allocator;
- }
-
- void Extractor::set_workspace_vkallocator(VkAllocator* allocator)
- {
- d->opt.workspace_vkallocator = allocator;
- }
-
- void Extractor::set_staging_vkallocator(VkAllocator* allocator)
- {
- d->opt.staging_vkallocator = allocator;
- }
- #endif // NCNN_VULKAN
-
- #if NCNN_STRING
- int Extractor::input(const char* blob_name, const Mat& in)
- {
- int blob_index = d->net->find_blob_index_by_name(blob_name);
- if (blob_index == -1)
- {
- NCNN_LOGE("Try");
- const std::vector<const char*>& input_names = d->net->input_names();
- for (size_t i = 0; i < input_names.size(); i++)
- {
- NCNN_LOGE(" ex.input(\"%s\", in%d);", input_names[i], (int)i);
- }
-
- return -1;
- }
-
- return input(blob_index, in);
- }
-
- int Extractor::extract(const char* blob_name, Mat& feat, int type)
- {
- int blob_index = d->net->find_blob_index_by_name(blob_name);
- if (blob_index == -1)
- {
- NCNN_LOGE("Try");
- const std::vector<const char*>& output_names = d->net->output_names();
- for (size_t i = 0; i < output_names.size(); i++)
- {
- NCNN_LOGE(" ex.extract(\"%s\", out%d);", output_names[i], (int)i);
- }
-
- return -1;
- }
-
- return extract(blob_index, feat, type);
- }
- #endif // NCNN_STRING
-
- int Extractor::input(int blob_index, const Mat& in)
- {
- if (blob_index < 0 || blob_index >= (int)d->blob_mats.size())
- return -1;
-
- d->blob_mats[blob_index] = in;
-
- return 0;
- }
-
- int Extractor::extract(int blob_index, Mat& feat, int type)
- {
- if (blob_index < 0 || blob_index >= (int)d->blob_mats.size())
- return -1;
-
- int old_blocktime = get_kmp_blocktime();
- set_kmp_blocktime(d->opt.openmp_blocktime);
-
- int old_flush_denormals = get_flush_denormals();
- set_flush_denormals(d->opt.flush_denormals);
-
- int ret = 0;
-
- if (d->blob_mats[blob_index].dims == 0)
- {
- int layer_index = d->net->blobs()[blob_index].producer;
-
- // use local allocator
- if (d->opt.use_local_pool_allocator)
- {
- if (!d->opt.blob_allocator)
- {
- d->opt.blob_allocator = d->net->d->local_blob_allocator;
- }
- if (!d->opt.workspace_allocator)
- {
- d->opt.workspace_allocator = d->net->d->local_workspace_allocator;
- }
- }
-
- #if NCNN_VULKAN
- if (d->opt.use_vulkan_compute)
- {
- // use local allocator
- if (!d->opt.blob_vkallocator)
- {
- d->local_blob_vkallocator = d->net->vulkan_device()->acquire_blob_allocator();
- d->opt.blob_vkallocator = d->local_blob_vkallocator;
- }
- if (!d->opt.workspace_vkallocator)
- {
- d->opt.workspace_vkallocator = d->opt.blob_vkallocator;
- }
- if (!d->opt.staging_vkallocator)
- {
- d->local_staging_vkallocator = d->net->vulkan_device()->acquire_staging_allocator();
- d->opt.staging_vkallocator = d->local_staging_vkallocator;
- }
-
- ncnn::VkCompute cmd(d->net->vulkan_device());
- #if NCNN_BENCHMARK
- cmd.create_query_pool(d->net->layers().size() * 2);
- #endif // NCNN_BENCHMARK
-
- // TODO vkimagemat for adreno
- if (d->opt.use_image_storage)
- {
- VkImageMat feat_gpu;
- ret = extract(blob_index, feat_gpu, cmd);
-
- if (ret == 0 && d->blob_mats[blob_index].dims == 0 && feat_gpu.dims != 0)
- {
- cmd.record_download(feat_gpu, d->blob_mats[blob_index], d->opt);
-
- ret = cmd.submit_and_wait();
-
- #if NCNN_BENCHMARK
- std::vector<uint64_t> results(d->net->layers().size() * 2);
- cmd.get_query_pool_results(0, d->net->layers().size() * 2, results);
- for (size_t i = 0; i < d->net->layers().size(); i++)
- {
- uint64_t start = results[i * 2];
- uint64_t end = results[i * 2 + 1];
- if (start == 0 || end == 0)
- continue;
-
- double duration_us = (end - start) * d->net->vulkan_device()->info.timestamp_period() / 1000;
- NCNN_LOGE("%-24s %-30s %8.2lfus |", d->net->layers()[i]->type.c_str(), d->net->layers()[i]->name.c_str(), duration_us);
- }
- #endif // NCNN_BENCHMARK
- }
- }
- else
- {
- VkMat feat_gpu;
- ret = extract(blob_index, feat_gpu, cmd);
-
- if (ret == 0 && d->blob_mats[blob_index].dims == 0 && feat_gpu.dims != 0)
- {
- cmd.record_download(feat_gpu, d->blob_mats[blob_index], d->opt);
-
- ret = cmd.submit_and_wait();
-
- #if NCNN_BENCHMARK
- std::vector<uint64_t> results(d->net->layers().size() * 2);
- cmd.get_query_pool_results(0, d->net->layers().size() * 2, results);
- for (size_t i = 0; i < d->net->layers().size(); i++)
- {
- uint64_t start = results[i * 2];
- uint64_t end = results[i * 2 + 1];
- if (start == 0 || end == 0)
- continue;
-
- double duration_us = (end - start) * d->net->vulkan_device()->info.timestamp_period() / 1000;
- NCNN_LOGE("%-24s %-30s %8.2lfus |", d->net->layers()[i]->type.c_str(), d->net->layers()[i]->name.c_str(), duration_us);
- }
- #endif // NCNN_BENCHMARK
- }
- }
- }
- else
- {
- ret = d->net->d->forward_layer(layer_index, d->blob_mats, d->opt);
- }
- #else
- ret = d->net->d->forward_layer(layer_index, d->blob_mats, d->opt);
- #endif // NCNN_VULKAN
- }
-
- feat = d->blob_mats[blob_index];
-
- if (d->opt.use_packing_layout && (type == 0) && feat.elempack != 1)
- {
- Mat bottom_blob_unpacked;
- convert_packing(feat, bottom_blob_unpacked, 1, d->opt);
- feat = bottom_blob_unpacked;
- }
-
- // clang-format off
- // *INDENT-OFF*
- #if NCNN_ARM82
- if (d->opt.use_fp16_storage && cpu_support_arm_asimdhp() && (type == 0))
- {
- if (feat.elembits() == 16)
- {
- Mat feat_fp32;
- cast_float16_to_float32(feat, feat_fp32, d->opt);
- feat = feat_fp32;
- }
- }
- else
- #endif // NCNN_ARM82
- #if NCNN_BF16
- if (d->opt.use_bf16_storage && (type == 0))
- {
- if (feat.elembits() == 16)
- {
- Mat feat_fp32;
- cast_bfloat16_to_float32(feat, feat_fp32, d->opt);
- feat = feat_fp32;
- }
- }
- else
- #endif // NCNN_BF16
- if (feat.elembits() == 8 && (type == 0))
- {
- Mat feat_fp32;
- cast_int8_to_float32(feat, feat_fp32, d->opt);
- feat = feat_fp32;
- }
- // *INDENT-ON*
- // clang-format on
-
- if (d->opt.use_local_pool_allocator && feat.allocator == d->net->d->local_blob_allocator)
- {
- // detach the returned mat from local pool allocator
- // so we could destroy net instance much earlier
- feat = feat.clone();
- }
-
- set_kmp_blocktime(old_blocktime);
- set_flush_denormals(old_flush_denormals);
-
- return ret;
- }
-
- #if NCNN_VULKAN
- #if NCNN_STRING
- int Extractor::input(const char* blob_name, const VkMat& in)
- {
- int blob_index = d->net->find_blob_index_by_name(blob_name);
- if (blob_index == -1)
- {
- NCNN_LOGE("Try");
- const std::vector<const char*>& input_names = d->net->input_names();
- for (size_t i = 0; i < input_names.size(); i++)
- {
- NCNN_LOGE(" ex.input(\"%s\", in%d);", input_names[i], (int)i);
- }
-
- return -1;
- }
-
- return input(blob_index, in);
- }
-
- int Extractor::extract(const char* blob_name, VkMat& feat, VkCompute& cmd)
- {
- int blob_index = d->net->find_blob_index_by_name(blob_name);
- if (blob_index == -1)
- {
- NCNN_LOGE("Try");
- const std::vector<const char*>& output_names = d->net->output_names();
- for (size_t i = 0; i < output_names.size(); i++)
- {
- NCNN_LOGE(" ex.extract(\"%s\", out%d);", output_names[i], (int)i);
- }
-
- return -1;
- }
-
- return extract(blob_index, feat, cmd);
- }
-
- int Extractor::input(const char* blob_name, const VkImageMat& in)
- {
- int blob_index = d->net->find_blob_index_by_name(blob_name);
- if (blob_index == -1)
- {
- NCNN_LOGE("Try");
- const std::vector<const char*>& input_names = d->net->input_names();
- for (size_t i = 0; i < input_names.size(); i++)
- {
- NCNN_LOGE(" ex.input(\"%s\", in%d);", input_names[i], (int)i);
- }
-
- return -1;
- }
-
- return input(blob_index, in);
- }
-
- int Extractor::extract(const char* blob_name, VkImageMat& feat, VkCompute& cmd)
- {
- int blob_index = d->net->find_blob_index_by_name(blob_name);
- if (blob_index == -1)
- {
- NCNN_LOGE("Try");
- const std::vector<const char*>& output_names = d->net->output_names();
- for (size_t i = 0; i < output_names.size(); i++)
- {
- NCNN_LOGE(" ex.extract(\"%s\", out%d);", output_names[i], (int)i);
- }
-
- return -1;
- }
-
- return extract(blob_index, feat, cmd);
- }
- #endif // NCNN_STRING
-
- int Extractor::input(int blob_index, const VkMat& in)
- {
- if (blob_index < 0 || blob_index >= (int)d->blob_mats.size())
- return -1;
-
- d->blob_mats_gpu[blob_index] = in;
-
- return 0;
- }
-
- int Extractor::extract(int blob_index, VkMat& feat, VkCompute& cmd)
- {
- if (blob_index < 0 || blob_index >= (int)d->blob_mats.size())
- return -1;
-
- int old_blocktime = get_kmp_blocktime();
- set_kmp_blocktime(d->opt.openmp_blocktime);
-
- int old_flush_denormals = get_flush_denormals();
- set_flush_denormals(d->opt.flush_denormals);
-
- int ret = 0;
-
- if (d->blob_mats_gpu[blob_index].dims == 0)
- {
- if (d->blob_mats_gpu_image[blob_index].dims != 0)
- {
- // image to buffer
- cmd.record_image_to_buffer(d->blob_mats_gpu_image[blob_index], d->blob_mats_gpu[blob_index], d->opt);
- }
- else if (d->blob_mats[blob_index].dims != 0)
- {
- // host to buffer
- cmd.record_upload(d->blob_mats[blob_index], d->blob_mats_gpu[blob_index], d->opt);
- }
- else
- {
- int layer_index = d->net->blobs()[blob_index].producer;
- ret = d->net->d->forward_layer(layer_index, d->blob_mats, d->blob_mats_gpu, cmd, d->opt);
- }
- }
-
- feat = d->blob_mats_gpu[blob_index];
-
- set_kmp_blocktime(old_blocktime);
- set_flush_denormals(old_flush_denormals);
-
- return ret;
- }
-
- int Extractor::input(int blob_index, const VkImageMat& in)
- {
- if (blob_index < 0 || blob_index >= (int)d->blob_mats.size())
- return -1;
-
- d->blob_mats_gpu_image[blob_index] = in;
-
- return 0;
- }
-
- int Extractor::extract(int blob_index, VkImageMat& feat, VkCompute& cmd)
- {
- if (blob_index < 0 || blob_index >= (int)d->blob_mats.size())
- return -1;
-
- int old_blocktime = get_kmp_blocktime();
- set_kmp_blocktime(d->opt.openmp_blocktime);
-
- int old_flush_denormals = get_flush_denormals();
- set_flush_denormals(d->opt.flush_denormals);
-
- int ret = 0;
-
- if (d->blob_mats_gpu_image[blob_index].dims == 0)
- {
- if (d->blob_mats_gpu[blob_index].dims != 0)
- {
- // buffer to image
- cmd.record_buffer_to_image(d->blob_mats_gpu[blob_index], d->blob_mats_gpu_image[blob_index], d->opt);
- }
- else if (d->blob_mats[blob_index].dims != 0)
- {
- // host to image
- cmd.record_upload(d->blob_mats[blob_index], d->blob_mats_gpu_image[blob_index], d->opt);
- }
- else
- {
- int layer_index = d->net->blobs()[blob_index].producer;
- ret = d->net->d->forward_layer(layer_index, d->blob_mats, d->blob_mats_gpu, d->blob_mats_gpu_image, cmd, d->opt);
- }
- }
-
- feat = d->blob_mats_gpu_image[blob_index];
-
- if (feat.empty())
- {
- NCNN_LOGE("extract %d image allocation failed", blob_index);
- ret = -100;
- }
-
- set_kmp_blocktime(old_blocktime);
- set_flush_denormals(old_flush_denormals);
-
- return ret;
- }
- #endif // NCNN_VULKAN
-
- } // namespace ncnn
|