Kagurazaka Kotori
08ecc94d63
x86: Use _mm_cvtsi128_si{32,64} in float2int8 ( #3536 )
This patch uses _mm_cvtsi128_si{32,64} intrinsics when returning value
in float2int8() to reduce unnecessary memory accesses.
Resolves TODO "use _mm_cvtsi128_si64 on 64bit target".
Signed-off-by: Kagurazaka Kotori <kagurazakakotori@gmail.com>
4 years ago
teng
3ff9ae707f
simplify macro ( #3530 )
4 years ago
Kagurazaka Kotori
5c078016c2
x86/avx_mathfun.h: Remove fallback warnings ( #3527 )
* x86/avx_mathfun.h: Remove fallback warnings
This patch removes warning messages indicating falling back to SSE2
when AVX2 support is disabled as suggested. Also reorders non-AVX2
macros for readability and faster preprocessing.
Suggested-by: nihui <shuizhuyuanluo@126.com>
Signed-off-by: Kagurazaka Kotori <kagurazakakotori@gmail.com>
* apply code-format changes
Co-authored-by: kagurazakakotori <kagurazakakotori@users.noreply.github.com>
4 years ago
nihui
2d46994d2e
wrap avxvnni and avx512vnni build options over cpu feature detector
4 years ago
nihui
bae2ee375f
simplify c api layer forward_n output array type
4 years ago
nihui
c0a94cd9ca
fix armv7 without neon ( #3514 )
4 years ago
nihui
4e4e0b9cf8
do not link libgcc as we no longer rely on builtin support cpu feature intrinsics now
4 years ago
nihui
d95213a005
x86 convolution int8 optimization third stage ( #3506 )
* avx-vnni and avx512-vnni optimization for convolution int8 gemm and 3x3 winograd pack8to4/pack8to1
4 years ago
nihuini
9f7f491885
use the old-style __cpuid_count for old compiler compatibility, fix #3510
4 years ago
nihui
930c36ebe2
avx512 infrastructure ( #3407 )
4 years ago
nihui
c2896bcd4d
x86 convolution int8 optimization second stage ( #3495 )
* some sse 4.1 optimization
* sse2/avx2 optimization for convolution 3x3 winograd42 int8 pack8to4/pack8to1
4 years ago
teng
13a51fbcf8
add else ( #3494 )
4 years ago
nihui
e9b8f0a6ef
x86 avx2 optimization for convolution gemm int8 ( #3489 )
4 years ago
nihui
c5d7f963b9
layer tile ( #3491 )
4 years ago
nihui
922f8b33c1
reduction4d, merge keepdims arg, add test ( #3469 )
4 years ago
nihui
713e712ba6
fix slow fp32/int32 crop on arm82 ( #3462 )
4 years ago
nihui
2d98c86ecd
branch less mat channel ( #3452 )
4 years ago
nihui
3a83704c38
binary4d, unary4d ( #3443 )
4 years ago
nihuini
f02b259a15
convert some pnnx reduction family
4 years ago
nihui
6941ec8fc9
arm neon optimization for general packed convolution ( #3426 )
4 years ago
tpoisonooo
dddcdb97b2
feat(src): add mat default constructor ( #3427 )
Co-authored-by: MegEngine <megengine@megvii.com>
Co-authored-by: tpoisonooo <tpoisonooo@users.noreply.github.com>
4 years ago
nihui
76f2dddc37
cmake option for c api ( #3423 )
4 years ago
nihui
878cb713d5
optional arm82 dot source ( #3415 )
4 years ago
nihui
999e640d43
dynamic convolution weight ( #3408 )
4 years ago
nihui
426e564b6e
general simd optimization for convolution1d ( #3404 )
4 years ago
nihui
14c3802de3
c api for 4d mat ( #3403 )
4 years ago
nihui
f98c396e6b
crop4d ( #3402 )
4 years ago
nihui
cf20dbc0bd
relu3d, batchnorm3d, reshape4d, flatten4d, permute4d ( #3397 )
Co-authored-by: ElvisYu <elvisyuovo@gmail.com>
Co-authored-by: 余浩文 <m18107220188@163.com>
Co-authored-by: nihui <nihui@users.noreply.github.com>
Co-authored-by: Zr2223 <67497651+Zr2223@users.noreply.github.com>
Co-authored-by: Zr2223 <Zr2223@users.noreply.github.com>
4 years ago
nihui
f10cc6dd93
initial data structure changes for 3dcnn, conv3d, pooling3d ( #3378 )
Co-authored-by: ElvisYu <elvisyuovo@gmail.com>
Co-authored-by: 余浩文 <m18107220188@163.com>
Co-authored-by: Zr2223 <67497651+Zr2223@users.noreply.github.com>
4 years ago
nihui
f4f7fabe27
fix python wheel, parallel 3 for macos tasks ( #3396 )
4 years ago
nihui
24fbb6e8cb
honor thread setting on load and vulkan command, ci avx512 t4 ( #3391 )
4 years ago
nihui
ed1fb210ea
arm neon optimization for innerproduct int8 gemm ( #3367 )
4 years ago
nihui
ac3d32aa0d
get_elf_hwcap_from_getauxval ( #3301 )
4 years ago
nihui
4566ad529a
fix rounding on x86 avx and armv7 neon ( #3365 )
4 years ago
源源球球✨
f66110b7e0
delete useless variables ( #3360 )
4 years ago
nihui
f433f86874
fix squeeze expanddims axes and add test ( #3359 )
4 years ago
nihui
525df8bcc5
rnn/lstm/gru with unequal input output ( #3352 )
4 years ago
nihui
f448a8f595
implement interp-1d on 2d blob ( #3349 )
4 years ago
nihui
5eb4a2ccd0
implement convolutiondepthwise1d ( #3342 )
4 years ago
nihui
b3a521981b
implement interp cubic aligncorner ( #3338 )
4 years ago
nihui
adfc8b25bc
fix deconv output pad ( #3337 )
4 years ago
nihui
f86a307ab5
silence code scanning
4 years ago
nihui
6a52e8e5f2
fix potential integer type overflow
4 years ago
nihui
a5b846371a
fix adaptive pooling vulkan, second try
4 years ago
Wang Kangyu
53040f566b
use int ceil/floor div ( #3333 )
4 years ago
nihuini
514770a37e
fix adaptive pooling vulkan
4 years ago
nihuini
80f2138b0c
fix adaptive pooling
4 years ago
nihuini
6c2cee8186
fix mat clone with atypical source cstep
4 years ago
nihuini
a490f8a533
fix layernorm with affine
4 years ago
Tijmen Verhulsdonck
ac5dc23ccc
added a number of optimized sse layers ( #3302 )
* added a number of optimized sse layers, specifically to increase performance of mobilenet style networks
4 years ago