Gengxin Xie
d6e7e05bb3
Improve the performance of dasum and sasum when SMP is defined
5 years ago
Martin Kroeker
ff16329cb7
Merge pull request #2972 from xiegengxin/rot-intrinsic
Improve the performance of rot by using AVX512 and AVX2 intrinsic
5 years ago
Martin Kroeker
110c7a6de0
Merge pull request #2979 from RajalakshmiSR/dot_power10
Optimize sdot/ddot for POWER10
5 years ago
Rajalakshmi Srinivasaraghavan
6e364981a8
Optimize sdot/ddot for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Martin Kroeker
b976a0bf40
Remove previous workaround for compiler flags related to cpu capabilities in x86_64 DYNAMIC_ARCH builds
5 years ago
Martin Kroeker
ff74319ea5
Merge pull request #2977 from martin-frbg/issue2976
Fix macro name used in ifdef for POWERPC/PGI
5 years ago
Martin Kroeker
28d2dfe2b3
Fix macro name used in ifdef
5 years ago
Gengxin Xie
725ffbf041
fix typo
5 years ago
Gengxin Xie
d9ba49165a
Improve the performance of rot by using AVX512 and AVX2 intrinsic
5 years ago
Rajalakshmi Srinivasaraghavan
dd7a9cc5bf
POWER10: Change dgemm unroll factors
Changing the unroll factors for dgemm to 8 shows improved performance with
POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.
5 years ago
Rajalakshmi Srinivasaraghavan
b435491885
Optimize caxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Chen, Guobing
a7b1f9b1bb
Implementation of BF16 based gemv
1. Add a new API -- sbgemv to support bfloat16 based gemv
2. Implement a generic kernel for sbgemv
3. Implement an avx512-bf16 based kernel for sbgemv
Signed-off-by: Chen, Guobing <guobing.chen@intel.com>
5 years ago
Martin Kroeker
67f39ad813
Merge pull request #2939 from thrasibule/Makefile_cleanup
reuse variables defined in Makefile.system
5 years ago
Rajalakshmi Srinivasaraghavan
c24ba8b1dd
Optimize saxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Martin Kroeker
6f9460f0f6
Merge pull request #2937 from martin-frbg/pwr-buffersz
Increase and unify BUFFERSIZE on POWER;fix gcc inline warning
5 years ago
Guillaume Horel
1917a4e7b8
reuse variables defined in Makefile.system
5 years ago
Martin Kroeker
34c3c407ef
label always_inline function as inline to silence a gcc warning
5 years ago
Martin Kroeker
2e48d560ba
Fix compiler version check
5 years ago
Rajalakshmi Srinivasaraghavan
ad745c0bae
Optimize scopy/ccopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Also reorganized all variants of copy functions
to make use of same kernel.
5 years ago
İsmail Dönmez
4a1d00f589
Fix build with -Werror=return-type
dgemm_tcopy_16_skylakex.c CNAME function should return an int, add a
return 0 similar to other files.
5 years ago
Bart Oldeman
b073d759d0
x86_64: clobber all xmm registers after vzeroupper
As observed using GCC 10 using -march=native -ftree-vectorize
on Knights Landing, it is now smart enough to find clobbers inside
non-inlined static functions.
In particular, sgemv counted on a kernel to preserve the whole
%ymm2 register (since it was not in the clobber list), but the top
part was destroyed by vzeroupper. This caused many tests to fail.
This patch makes sure all xmm (and ymm/zmm by extension) registers
are listed as clobbered to avoid this happening, as most kernels
already did correctly in fact.
5 years ago
Martin Kroeker
dc6e44c3f8
Merge pull request #2916 from martin-frbg/issue2911
Clean up duplicate definitions in POWER8 kernels and fix power10 option passing
5 years ago
Martin Kroeker
a61c086408
Fix spurious trailing whitespace in comment
5 years ago
Bart Oldeman
03e781b766
sgemm_direct_skylakex: fix 75eeb26 regression.
The
`#if defined(SKYLAKEX) || defined (COOPERLAKE)`
from that commit was before #include "common.h" so caused the
compiled function to be empty, returning garbage results for
qualifying sgemm's on those architectures.
Closes #2914
5 years ago
Martin Kroeker
f1a4071d8c
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
97cf10062f
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
17e288e18d
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
c1422f3e46
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
d85b24e103
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
df70667043
fix core list for sse/sse2
5 years ago
Martin Kroeker
f071d1207a
add sse2
5 years ago
Martin Kroeker
dc6cefd2f5
Expressly enable -msse for 32bit DYNAMIC_ARCH kernels
5 years ago
Martin Kroeker
c339c40c01
Silence a redefinition warning
5 years ago
Martin Kroeker
10379fc83b
Use ifdef instead of if
5 years ago
Martin Kroeker
4c25910da0
Merge pull request #2896 from martin-frbg/intrin-double
Add compiler flag for SSE4 where available
5 years ago
Martin Kroeker
ae6ac83991
Revert "add double precision SSE"
5 years ago
Qiyu8
4fac91ef37
adapt arm platform
5 years ago
Qiyu8
bfdf4b56da
Add double precision universal intrinsics for X86/ARM
5 years ago
Martin Kroeker
ebf0470fc2
add sse4.1 for DYNAMIC_ARCH kernels
5 years ago
Martin Kroeker
c9c3ae07af
Add double precision operations
5 years ago
Martin Kroeker
756802df61
Merge pull request #2890 from martin-frbg/s-d-sum
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
5 years ago
Rajalakshmi Srinivasaraghavan
0826d68f93
POWER10: Change the packing format for bfloat16
As the new MMA instructions need the inputs in 4x2 order for bfloat16,
changing the format in copy/packing code. This avoids permute instructions
in the gemm kernel inner loop.
5 years ago
Rajalakshmi Srinivasaraghavan
b5d30b390d
Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
5 years ago
Martin Kroeker
fecedc9c69
Add -mssse3
5 years ago
Martin Kroeker
0eacbca85f
Add Haswell and Zen to temporary sse3 whitelist
5 years ago
Martin Kroeker
6999086a2b
whitelist SANDYBRIDGE for SSE3
5 years ago
Martin Kroeker
8d2df7d066
Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM
5 years ago
Martin Kroeker
08929430cd
Merge pull request #2886 from martin-frbg/issue_2767
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
5 years ago
Martin Kroeker
0c84ffe05f
Merge pull request #2881 from mattip/fninit
add fninit to reset fpu registers before assembler routines
5 years ago
Matti Picus
403eb513a0
use emms instead, add WIN guards
5 years ago