nihui
a961ab992e
arm deconv matmul use gemm ( #4594 )
* arm deconv matmul use gemm
* reduce gemm armv7 register uses
3 years ago
nihui
254eb8d0d4
blacklist fp16a on old adreno driver ( #4587 )
3 years ago
nihui
06b97d7e69
fix exynos 9810 isa detection ( #4585 )
3 years ago
nihui
5ac17df797
arm optimization for packed convolution unified elempack ( #4590 )
3 years ago
nihui
010d6772d6
softmax arm unified elempack and bf16/fp16 optimization ( #4582 )
* mha arm use softmax fp16
3 years ago
nihui
c777bf09dc
arm convolution sgemm unified elempack ( #4572 )
* fuse im2col and packb tile
3 years ago
nihui
6987efd950
fix scale avx512 ( #4580 )
3 years ago
Kenji Mouri
47879ea7ea
Add SSE2 implementation of atan in x86 targets. ( #4575 )
3 years ago
Kenji Mouri
b314b3543d
Add SSE2 implementation of acos in x86 targets. ( #4573 )
3 years ago
Kenji Mouri
328d2ca2c4
Add SSE2 implementation of asin in x86 targets. ( #4570 )
3 years ago
Yoh
7573faae52
move floor and ceil sse_function from unaryOp to sse_mathfun ( #4566 )
3 years ago
nihui
dabc4c065f
arm convolution winograd unified elempack ( #4556 )
* update f43 coeffs
* arm convolution winograd unified elempack
* disable bf16s test atm
* test gnu inline asm off
3 years ago
nihui
6f08ec7397
use full date for macos pypi package ( #4552 )
* use full date for pypi package
* split version date string only for dylib
3 years ago
WuJinxuan
ff80ac2955
[ARM] Multiheadattention ( #4463 )
3 years ago
nihui
bbc770079e
silence fopen error on sysfs cache files
3 years ago
nihui
47ea2877ed
stb and emsdk update ( #4536 )
* stb_image_write 1.16
* stb_image v2.28
* update emsdk 3.1.28
* enable stb arm neon
* update doc
Co-authored-by: ncnnnnn <67086033+ncnnnnn@users.noreply.github.com>
3 years ago
nihui
d0c2738043
update riscv winograd f43 coeffs and fix some warnings ( #4537 )
* update winograd f43 coeffs
* rvv tanh rework
* fix warnings
* rebuild qemu
3 years ago
WuJinxuan
6572da3533
[x86] GroupNorm ( #4471 )
Co-authored-by: EdVince <EdVince@users.noreply.github.com>
3 years ago
nihui
833f6ed8e4
c api for getting output indexes and names ( #4534 )
3 years ago
nihui
1832da8292
concat 4d ( #4528 )
3 years ago
nihui
fb9cf7982d
eltwise 4d ( #4529 )
3 years ago
nihui
32e2de015e
slice 4d ( #4525 )
3 years ago
nihui
fc6ce4a641
copyto operator ( #4522 )
3 years ago
nihui
242e775d21
pnnx convert torch log10, pow 2 as square ( #4518 )
3 years ago
nihui
246e71c526
implement atan2 ( #4516 )
3 years ago
Fangjun Kuang
92e75105c9
Support torch.cumsum ( #4505 )
3 years ago
nihui
ab4cfbf5b0
enrich ncnn binary broadcast rules ( #4513 )
3 years ago
nihui
6869c81ed3
find cpu cache size from sysfs ( #4502 )
* find cpu cache size from sysfs
* android l3
* make g_thread_affinity_mask singleton
* global mask
3 years ago
nihui
17197b3c45
ci build with musl libc ( #4499 )
3 years ago
nihui
ce6b80a16b
pnnx flatten input tuple list ( #4498 )
3 years ago
nihui
3b36656bc8
reduce vulkan winograd f43 transform shader register pressure ( #4496 )
3 years ago
nihui
dfbcd3e69b
improve vulkan winograd f43 fp16 numerical stability ( #4492 )
3 years ago
weirdseed
503a8b921f
fix uninitialized gpu bug_buffer_image_load_zero value ( #4493 )
3 years ago
nihui
d2d012dce5
x86 bfloat16 cast functions ( #4491 )
* simplify cast fp16 avx512 dispatch
* define sse4.1 macro on msvc avx+
3 years ago
nihui
fed99fd35b
gemm output transpose, prepack c ( #4479 )
* mha is now permute and reshape free
* gemm user defined tile mnk param
3 years ago
nihui
2e3e680d77
x86 optimization for packed convolution unified elempack ( #4469 )
3 years ago
nihui
bd5bbe3f2c
x86 optimization for winograd unified elempack part2 ( #4470 )
* improve gemm packb threading
* optimize tile size
* profile winograd condition
* handle threads changes
3 years ago
ws
643285a08c
fix macos vulkan instance create failed when vulkan sdk version >= 1.… ( #4472 )
* enable VK_KHR_portability_subset extension if device support it
Co-authored-by: w1ndseeker <w1ndseeker@users.noreply.github.com>
3 years ago
nihui
88274827da
x86 optimization for winograd unified elempack ( #4456 )
3 years ago
WuJinxuan
ad956c8c9c
[ARM] GELU ( #4464 )
3 years ago
WuJinxuan
10e9d91576
Add x86 MultiHeadAttention ( #4443 )
* fix doc, sync x86 gemm fix
Co-authored-by: EdVince <EdVince@users.noreply.github.com>
Co-authored-by: nihuini <nihuini@tencent.com>
3 years ago
nihui
15761fc1a6
arm vfpv4 asimdhp asimdfhm optimization for gemm ( #4432 )
3 years ago
nihui
c471826da1
fix arm bfloat2float float2bfloat oops ( #4439 )
3 years ago
nihui
88dba58992
fix gemm transpose B wrong result when tile N is not a multiple of 4, optimize load C ( #4430 )
3 years ago
nihui
7b3261dace
gemm arm optimization ( #4426 )
* cmake determine target 32bit and 64bit
* include opt source with non-runtime cpu
* check compiler support gnu style inline assembly
3 years ago
nihui
5da70724b1
matmul x86 use sgemm ( #4421 )
3 years ago
nihui
1f1981052c
convolution deconvolution and deformableconv2d x86 use sgemm ( #4414 )
* drop old sgemm code
* fix convdw test
* fix avx512 gemm
* optimize prefer sgemm condition
3 years ago
nihui
9cc6eb1942
meet gemm x86 transpose alignment
3 years ago
nihui
18fbaebe68
get cpu l2 cache size and resolve gemm tile size ( #4411 )
* get cpu l2 cache size and resolve gemm tile size
* optimize constant tile K
* fix per-core l2 cache detection, better macos cpu cluster topology discovery
3 years ago
nihui
c5640a16c3
gemm x86 multiply alpha beta in post gemm stage, enable one_blob_only ( #4407 )
* gemm x86 multiply alpha beta in post gemm stage, enable one_blob_only
* relax mnk multiple restrictions
* make square tiles in each thread
* sanitize num_threads changes
3 years ago