Rajalakshmi Srinivasaraghavan
2379abaa5e
POWER10: Improve dgemm performance
This patch uses vector pair pointer for input load operation
which helps to generate power10 lxvp instructions.
4 years ago
Rajalakshmi Srinivasaraghavan
55bb9f639a
POWER10: Optimized zgemv
This patch makes use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1 for zgemv_n and zgemv_t.
4 years ago
Rajalakshmi Srinivasaraghavan
2dbcddd83d
POWER10: Adding check for little endian
This patch makes sure that recent POWER10 patches are used
only for little endian.
4 years ago
Martin Kroeker
86c5a0013f
Add workaround for LAPACK testsuite failures with the NVIDIA HPC compiler
4 years ago
Martin Kroeker
ef85c22474
Add workaround for LAPACK test failures with the NVIDIA HPC compiler
4 years ago
Martin Kroeker
d3555d2e50
Add workaround for LAPACK test failures with the NVIDIA HPC compiler
4 years ago
Rajalakshmi Srinivasaraghavan
09d47af2c0
Optimize zscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
4 years ago
Rajalakshmi Srinivasaraghavan
41646ed006
Optimize s/dasum function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
4 years ago
Rajalakshmi Srinivasaraghavan
0571c3187b
POWER10: Rename mma builtins
The LLVM and GCC teams agreed to rename the __builtin_mma_assemble_pair and
__builtin_mma_disassemble_pair built-ins to __builtin_vsx_assemble_pair and
__builtin_vsx_disassemble_pair respectively. This patch is to make
corresponding changes in dgemm kernel. Also made changes in
inputs to those builtins to avoid some potential typecasting issues.
Reference gcc commit id:77ef995c1fbcab76a2a69b9f4700bcfd005d8e62
4 years ago
Rajalakshmi Srinivasaraghavan
2056ffc227
Optimize cscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
3ede843d50
Optimize s/dscal function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
439b93f6d2
Optimize s/drot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
eff7c9166e
Optimize cdot function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
601b711c78
Optimize swap function for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
2fb11f873b
POWER10: Improve copy performance
This patch aligns the stores to 32 byte boundary for scopy and dcopy
before entering into vector pair loop. For ccopy, changed the store
instructions to stxv to improve performance of unaligned cases.
5 years ago
Martin Kroeker
043128cbe5
Merge pull request #3029 from RajalakshmiSR/axpyp10
POWER10: Improve axpy performance
5 years ago
Rajalakshmi Srinivasaraghavan
346e30a46a
POWER10: Improve axpy performance
This patch aligns the stores to 32 byte boundary for saxpy and daxpy
before entering into vector pair loop. Fox caxpy, changed the store
instructions to stxv to improve performance of unaligned cases.
5 years ago
Gordon Fossum
213c0e7abb
Added special unrolled vectorized versions of "Solve" for specific sizes,
in DTRSM and STRSM, to improve performance in Power9 and Power10.
5 years ago
Rajalakshmi Srinivasaraghavan
7d46e31de1
POWER10: Optimize dgemv_n
Handling as 4x8 with vector pairs gives better performance than
existing code in POWER10.
5 years ago
Rajalakshmi Srinivasaraghavan
6e364981a8
Optimize sdot/ddot for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
dd7a9cc5bf
POWER10: Change dgemm unroll factors
Changing the unroll factors for dgemm to 8 shows improved performance with
POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.
5 years ago
Rajalakshmi Srinivasaraghavan
b435491885
Optimize caxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Rajalakshmi Srinivasaraghavan
c24ba8b1dd
Optimize saxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores.
5 years ago
Martin Kroeker
34c3c407ef
label always_inline function as inline to silence a gcc warning
5 years ago
Rajalakshmi Srinivasaraghavan
ad745c0bae
Optimize scopy/ccopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Also reorganized all variants of copy functions
to make use of same kernel.
5 years ago
Martin Kroeker
a61c086408
Fix spurious trailing whitespace in comment
5 years ago
Martin Kroeker
f1a4071d8c
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
97cf10062f
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
17e288e18d
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
c1422f3e46
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
d85b24e103
Clean up STACKSIZE redefinition
5 years ago
Rajalakshmi Srinivasaraghavan
0826d68f93
POWER10: Change the packing format for bfloat16
As the new MMA instructions need the inputs in 4x2 order for bfloat16,
changing the format in copy/packing code. This avoids permute instructions
in the gemm kernel inner loop.
5 years ago
Martin Kroeker
2061f7fdff
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
9ae80490e0
rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
d314d1f49f
Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c
5 years ago
Rajalakshmi Srinivasaraghavan
2df4235e00
Optimize dcopy/zcopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
5 years ago
Rajalakshmi Srinivasaraghavan
be43d2cb96
Optimize daxpy/zaxpy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
5 years ago
Rajalakshmi Srinivasaraghavan
317ff27cda
POWER10: Avoid setting accumulators to zero in gemm kernels
For the first iteration, it is better to use xvf*ger instead of xvf*gerpp
builtins which helps to avoid setting accumulators to zero. This helps
to reduce few instructions.
5 years ago
Rajalakshmi Srinivasaraghavan
f77b6a83f4
dgemv optimization for POWER10
Making use of new vector pair POWER10 instructions in dgemv_n and dgemv_t.
Also adding a new block 4x128 to make use of Matrix-Multiply Assist (MMA)
feature introduced in POWER ISA v3.1. Tested on simulator and there
are no new test failures.
5 years ago
Rajalakshmi Srinivasaraghavan
d557584b71
Fix compilation issues with clang on POWER
As gcc defaults to -malign-power, removing that option. Also
adding -fno-integrated-as to use GNU assembler for powerpc
assembly optimization files. Fixed other compilation errors
reported in dgemv_t.c file.
5 years ago
Rajalakshmi Srinivasaraghavan
9be2688c78
Fix to store results in correct order for POWER10 GEMM kernels
There is a recent compiler change in __builtin_mma_disassemble_acc() which
affects the order of storing result in POWER10. Also removing new LDFLAG
-mno-power10-stub as it is handled by linker automatically.
5 years ago
Martin Kroeker
6a2a60038c
Merge pull request #2720 from martin-frbg/issue2694
WIP Further fixes for 32bit POWER8
5 years ago
Martin Kroeker
251a09ec90
Typo fix
5 years ago
Martin Kroeker
95d37e1575
Regroup the 32 and 64bit sections and restore 64bit CAXPY
5 years ago
Martin Kroeker
3523bb778e
Merge pull request #2721 from martin-frbg/p8align
Fix alignment errors in the power8 saxpy kernel
5 years ago
Martin Kroeker
ca3561cab9
Add ifdefs around call to altivec microkernel
5 years ago
Martin Kroeker
21072e502a
Typo fix
5 years ago
Martin Kroeker
661c6bfa5a
Exclude altivec code paths if the compiler does not support them
5 years ago
Martin Kroeker
0033f8be0d
Use vec_vsx_ld/st to fix misaligned accesses flagged by asan
5 years ago
Martin Kroeker
f308e741b2
remove debug output and revert changes to cdot and crot
5 years ago