Martin Kroeker
3d511f0e66
replace spurious avx512 requirement with fma check
4 years ago
Martin Kroeker
2dfb24730d
Use "old" compute(24) function with clang due to register limitations
4 years ago
Martin Kroeker
7b8f580941
Merge pull request #3156 from martin-frbg/omatcopy_d
Move x86_64 DOMATCOPY_RT back to the C implementation
4 years ago
Martin Kroeker
0f5e86a0d9
Remove premature entry for DOMATCOPY_RT
4 years ago
Martin Kroeker
7b294a99fd
Move common.h back to the top of the file so that SKYLAKEX (from config.h) is defined in time
4 years ago
Martin Kroeker
0934568d9c
Move includes under the ifdef for compilers w/o intrinsics support
4 years ago
Martin Kroeker
a9f6f7ad39
Remove spurious AVX512 requirement and add AVX2/FMA3 guard
4 years ago
Martin Kroeker
292d1af1a0
Update omatcopy_rt.c
5 years ago
Martin Kroeker
325b398e3c
Update omatcopy_rt.c
5 years ago
Martin Kroeker
6f5667b4d4
Enable optimized S/D OMATCOPY_RT
5 years ago
Martin Kroeker
cceeee7806
Add optimized omatcopy_rt
5 years ago
Martin Kroeker
47691c031f
Use Haswell optimizations for Zen as well
5 years ago
Martin Kroeker
ce7ddd8921
Use Haswell optimizations for Zen as well
5 years ago
Martin Kroeker
950c047b49
Use Haswell optimizations for Zen as well
5 years ago
Martin Kroeker
46509953a9
Use Haswell optimizations for Zen as well
5 years ago
Martin Kroeker
db348dcff2
Enable optimized srot/drot kernels from Haswell
5 years ago
Martin Kroeker
69a5558203
Merge pull request #3059 from Guobing-Chen/BF16_gemm
Initial code for Cooperlake BF16 GEMM kernel
5 years ago
Alex Henrie
202fc9e8ed
Fix uninitialized argument value in dasum_k
5 years ago
Chen, Guobing
b0beb0b1ca
Initial code for Cooperlake BF16 GEMM kernel
5 years ago
Martin Kroeker
114eb159a4
Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA
5 years ago
Martin Kroeker
441c08c9ff
Merge pull request #3016 from xiegengxin/complex-asum
Improve the performance of zasum and casum with AVX512 intrinsic
5 years ago
Gengxin Xie
0cb7a403b2
fix error declare function blas_level1_thread_with_return_value
5 years ago
Gengxin Xie
b766c1e9bb
Improve the performance of zasum and casum with AVX512 intrinsic
5 years ago
Martin Kroeker
f1bf040b25
Merge pull request #2988 from xiegengxin/smp-asum
Improve the performance of dasum and sasum when SMP is defined
5 years ago
Gengxin Xie
d6e7e05bb3
Improve the performance of dasum and sasum when SMP is defined
5 years ago
Qiyu8
a87e537b8c
modify macro
5 years ago
Qiyu8
5bc0a7583f
only FMA3 and vector larger than 128 have positive effects.
5 years ago
Qiyu8
8c0b206d4c
Optimize the performance of rot by using universal intrinsics
5 years ago
Martin Kroeker
ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
5 years ago
Gengxin Xie
725ffbf041
fix typo
5 years ago
Gengxin Xie
d9ba49165a
Improve the performance of rot by using AVX512 and AVX2 intrinsic
5 years ago
Chen, Guobing
a7b1f9b1bb
Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv
Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
5 years ago
İsmail Dönmez
4a1d00f589
Fix build with -Werror=return-type
dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a
return 0 similar to other files.
5 years ago
Bart Oldeman
b073d759d0
x86_64: clobber all xmm registers after vzeroupper
As observed using GCC 10 using -march=native -ftree-vectorize
on Knights Landing, it is now smart enough to find clobbers inside
non-inlined static functions.
In particular, sgemv counted on a kernel to preserve the whole
%ymm2 register (since it was not in the clobber list), but the top
part was destroyed by vzeroupper. This caused many tests to fail.
This patch makes sure all xmm (and ymm/zmm by extension) registers
are listed as clobbered to avoid this happening, as most kernels
already did correctly in fact.
5 years ago
Bart Oldeman
03e781b766
sgemm_direct_skylakex: fix 75eeb26 regression.
The
`#if defined(SKYLAKEX) || defined (COOPERLAKE)`
from that commit was before #include "common.h" so caused the
compiled function to be empty, returning garbage results for
qualifying sgemm's on those architectures.
Closes #2914
5 years ago
Martin Kroeker
c339c40c01
Silence a redefinition warning
5 years ago
Qiyu8
bfdf4b56da
Add double precision universal intrinsics for X86/ARM
5 years ago
Martin Kroeker
756802df61
Merge pull request #2890 from martin-frbg/s-d-sum
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
5 years ago
Martin Kroeker
8d2df7d066
Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM
5 years ago
Martin Kroeker
08929430cd
Merge pull request #2886 from martin-frbg/issue_2767
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
5 years ago
Martin Kroeker
0c84ffe05f
Merge pull request #2881 from mattip/fninit
add fninit to reset fpu registers before assembler routines
5 years ago
Matti Picus
403eb513a0
use emms instead, add WIN guards
5 years ago
Martin Kroeker
dc8a1afa63
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
fd94236042
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
68ce719fac
Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c
5 years ago
Martin Kroeker
d7dd9b396c
Rename shdot.c to sbdot.c
5 years ago
Martin Kroeker
7812486091
Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug
5 years ago
Matti Picus
a5b164946c
add fninit to reset fpu registers before assembler routines
5 years ago
Qiyu8
14f7dad3b7
performance improved
5 years ago
Qiyu8
325b539c26
Optimize the performance of daxpy by using universal intrinsics
5 years ago