Chris Sidebottom
ecae1389df
Reduce duplication in kernel definitions
These files are exactly the same, so I believe we can reduce these files
down. Other files require a slightly more complex unpicking.
2 years ago
Chris Sidebottom
60e66725e4
Use numeric labels to allow repeated inlining
2 years ago
Chris Sidebottom
7a4fef4f60
Tweak SVE dot kernel
This changes the SVE dot kernel to only predicate when necessary as well
as streamlining the assembly a bit. The benchmarks seem to indicate this
can improve performance by ~33%.
2 years ago
Martin Kroeker
f06b535566
Use C kernel for dgemv_t due to limitations of the old assembly one
2 years ago
barracuda156
d9653af018
KERNEL.PPC970, KERNEL.PPCG4: unbreak CMake parsing
Fixes: https://github.com/OpenMathLib/OpenBLAS/issues/4366
2 years ago
Chip-Kerchner
93747fb377
Merge remote-tracking branch 'origin/develop' into power10Copies
2 years ago
Chip-Kerchner
4e738e561a
Replace two vector loads with one vector pair load and fix endianess of stores.
2 years ago
yancheng
d32f38fb37
loongarch64: Add optimizations for nrm2.
2 years ago
yancheng
f9b468990e
loongarch64: Add optimizations for rot.
2 years ago
yancheng
c80e7e27d1
loongarch64: Add optimizations for sum and asum.
2 years ago
yancheng
d4c96a35a8
loongarch64: Add optimizations for axpy and axpby.
2 years ago
yancheng
360acc0a41
loongarch64: Add optimizations for swap.
2 years ago
yancheng
174c25766b
loongarch64: Add optimizations for copy.
2 years ago
yancheng
49829b2b7d
loongarch64: Add optimizations for iamin.
2 years ago
yancheng
be83f5e4e0
loongarch64: Add optimizations for iamax.
2 years ago
yancheng
e3fb2b5afa
loongarch64: Add optimizations for imin.
2 years ago
yancheng
e46b48e372
loongarch64: Add optimizations for imax.
2 years ago
yancheng
702fc1d56d
loongarch64: Add optimization for min.
2 years ago
yancheng
346b384d1c
loongarch64: Add optimization for max.
2 years ago
yancheng
ff2ecc6cda
loongarch64: Add optimization for amin.
2 years ago
yancheng
265b5f2e80
loongarch64: Add optimizations for amax.
2 years ago
yancheng
993ede7c70
loongarch64: Add optimizations for scal.
2 years ago
Martin Kroeker
39bf8ece20
Merge pull request #4340 from yinshiyou/la-dev
Add some refines and optimizations for LoongArch.
2 years ago
Shiyou Yin
9fe07d82fd
loongarch: Add LSX optimization for dot.
2 years ago
Shiyou Yin
13b8c44b44
loongarch: Add optimization for dsdot kernel.
2 years ago
Shiyou Yin
3def6a8143
loongarch: Add LASX optimization for dot.
2 years ago
Bart Oldeman
c34e2cf380
Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum
for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.
For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated!
2 years ago
Martin Kroeker
22aa401656
Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC ( #4327 )
* Temporarily disable the C/ZASUM microkernels for any version of NVHPC
2 years ago
Bart Oldeman
f8ad5344c2
Fix casum fallback kernel.
This kernel is only used on Skylake+ if the kernel with AVX512
intrinsics can't be used, but used the variable x1 incorrectly
in the tail end of the loop, as it is still at the initial
value instead of where x points to.
This caused 55 "other error"s in the LAPACK tests
(https://github.com/OpenMathLib/OpenBLAS/issues/4282 )
This change makes casum.c as similar as possible as zasum.c,
because zasum.c does this correctly.
2 years ago
Martin Kroeker
04bc801999
(Re)apply fixes for supporting only a subset of precision types from PR 3915
2 years ago
Martin Kroeker
9019bc4945
Use SkylakeX ?ASUM microkernel for Cooperlake/Sapphirerapids as well
2 years ago
Martin Kroeker
3bfa4d4dcc
Fix outdated SVE kernel definitions for Cortex cpus by aliasing to ARMV8SVE
2 years ago
Rajalakshmi Srinivasaraghavan
980f702f72
POWER: AIX: Make use of power10 optimization
POWER10 optimizations are disabled when using default AIX assembler.
As we have fixed many issues recently, enabling optimization path
for default assembler.
2 years ago
Rajalakshmi Srinivasaraghavan
9f42570e33
POWER: Increase macro size limit for AIX
This patch increases the macro size limit from 4096 to 16384 to
allow compiling larger assembly files in AIX.
Tested with GCC and IBM Open XL C.
2 years ago
Martin Kroeker
9f49aef91b
Merge pull request #4255 from RajalakshmiSR/AIX-P10
POWER10: Fix compilation issues with Open XL C
2 years ago
Martin Kroeker
e7d05402e0
Fix up S/D GEMM copy function definitions after #4009
2 years ago
Rajalakshmi Srinivasaraghavan
71d733e5f7
POWER: Avoid m4 conversions for C files
This patch removes intermediate m4 conversions used in sbgemm
compilation as it is not needed for .c files.
Tested on AIX with gcc and IBM Open XL C.
2 years ago
Rajalakshmi Srinivasaraghavan
82fc29a57a
POWER10: Fallback to POWER8 functions
As cgemm and zgemm kernels are not optimized for big endian falling
back to POWER8 versions. Tested on AIX using gcc and Open XL C.
2 years ago
Rajalakshmi Srinivasaraghavan
db0805906b
powerpc: Fix build errors with Open XL C
This patch fixes errors when using Open XL C compiler on AIX.
Tested with gcc/xlf and ibm-clang/xlf compiler combinations.
2 years ago
Martin Kroeker
675cd551da
fix improper function prototypes (empty parentheses)
2 years ago
gxw
d15e0a055c
LoongArch64: Fixed compilation issues when enable DYNAMIC_ARCH
2 years ago
gxw
4670eb1462
LoongArch64: Add dtrsm kernel
2 years ago
gxw
f2cf929374
LoongArch64: Add sgemv kernel
2 years ago
Martin Kroeker
8e6d93359d
Merge pull request #4196 from TiborGY/obsolete_inlines
Modernize obsolete inline order
2 years ago
gxw
394a1fd1bf
LoongArch64: Compatible with early internal toolchain
__loongarch_grlen and __loongarch_frlen were introduced in gcc version 8.3.0
(Loongnix 8.3.0-6.lnd.vec.31) internally within Loongson to standardize the
general and floating-point register widths. However, previous versions did
not have them, requiring additional checks to be added.
2 years ago
Martin Kroeker
9c4ae4d4fb
Merge pull request #4206 from martin-frbg/issue4201-2
Work around miscompilation of zdot_thunderx2t99 by the current NVIDIA HPC compiler
2 years ago
Martin Kroeker
88435104c8
Merge pull request #4204 from martin-frbg/llvm17-2
Work around LLVM17 miscompiling the AVX512 microkernels for CASUM/ZASUM
2 years ago
Martin Kroeker
fc8894dd98
Workaround miscompilation by NVIDIA nvc
2 years ago
Martin Kroeker
7a6203ffa1
restore default Neoverse SVE build instructions for non-NVIDIA compilers
2 years ago
Martin Kroeker
2c3034ff7f
Disable the C/ZASUM AVX512 microkernels when compiling with LLVM17 as well
2 years ago