Martin Kroeker
519b40fad9
Merge pull request #4398 from yinshiyou/la-dev
Add Optimizations for LoongArch.
2 years ago
pengxu
a5d0d21378
loongarch64: Add zgemm and cgemm optimization
2 years ago
gxw
546f13558c
loongarch64: Add {c/z}swap and {c/z}sum optimization
2 years ago
Hao Chen
edabb93668
loongarch64: Refine axpby optimization functions.
2 years ago
Hao Chen
1ec5dded43
loongarch64: Add c/zrot optimization functions.
Signed-off-by: Hao Chen <chenhao@loongson.cn>
2 years ago
Hao Chen
3c53ded315
loongarch64: Add c/znrm2 optimization functions.
2 years ago
Hao Chen
fbd612f8c4
loongarch64: Add ic/zamin optimization functions.
2 years ago
Hao Chen
d97272cb35
loongarch64: Add c/zdot optimization functions.
2 years ago
Hao Chen
65a0aeb128
loongarch64: Add c/zcopy optimization functions.
Signed-off-by: Hao Chen <chenhao@loongson.cn>
2 years ago
Hao Chen
2a34fb4b80
loongarch64: Add and refine scal optimization functions.
Signed-off-by: Hao Chen <chenhao@loongson.cn>
2 years ago
Hao Chen
8785e948b5
loongarch64: Add camin optimization function.
2 years ago
Hao Chen
0753848e03
loongarch64: Refine and add axpy optimization functions.
Signed-off-by: Hao Chen <chenhao@loongson.cn>
2 years ago
Hao Chen
06fd5b5995
loongarch64: Add and Refine asum optimization functions.
2 years ago
guxiwei
e771be185e
Optimize copy functions with lsx.
Signed-off-by: Hao Chen <chenhao@loongson.cn>
2 years ago
Hao Chen
179ed51d3b
Add dgemm_kernel_8x4.S file.
2 years ago
Hao Chen
173a65d4e6
loongarch64: Add and refine iamax optimization functions.
2 years ago
zhoupeng
ea70e165c7
loongarch64: Refine rot optimization.
2 years ago
zhoupeng
116aee7527
loongarch64: Refine imin optimization.
2 years ago
zhoupeng
8be2654193
loongarch64: Refine imax optimization.
2 years ago
zhoupeng
154baad454
loongarch64: Refine iamin optimization.
2 years ago
Shiyou Yin
36c12c4971
loongarch64: Refine copy,swap,nrm2,sum optimization.
2 years ago
Shiyou Yin
c6996a80e9
loongarch64: Refine amax,amin,max,min optimization.
2 years ago
Chris Sidebottom
ecae1389df
Reduce duplication in kernel definitions
These files are exactly the same, so I believe we can reduce these files
down. Other files require a slightly more complex unpicking.
2 years ago
Chris Sidebottom
60e66725e4
Use numeric labels to allow repeated inlining
2 years ago
Chris Sidebottom
7a4fef4f60
Tweak SVE dot kernel
This changes the SVE dot kernel to only predicate when necessary as well
as streamlining the assembly a bit. The benchmarks seem to indicate this
can improve performance by ~33%.
2 years ago
Martin Kroeker
f06b535566
Use C kernel for dgemv_t due to limitations of the old assembly one
2 years ago
barracuda156
d9653af018
KERNEL.PPC970, KERNEL.PPCG4: unbreak CMake parsing
Fixes: https://github.com/OpenMathLib/OpenBLAS/issues/4366
2 years ago
Chip-Kerchner
93747fb377
Merge remote-tracking branch 'origin/develop' into power10Copies
2 years ago
Chip-Kerchner
4e738e561a
Replace two vector loads with one vector pair load and fix endianess of stores.
2 years ago
yancheng
d32f38fb37
loongarch64: Add optimizations for nrm2.
2 years ago
yancheng
f9b468990e
loongarch64: Add optimizations for rot.
2 years ago
yancheng
c80e7e27d1
loongarch64: Add optimizations for sum and asum.
2 years ago
yancheng
d4c96a35a8
loongarch64: Add optimizations for axpy and axpby.
2 years ago
yancheng
360acc0a41
loongarch64: Add optimizations for swap.
2 years ago
yancheng
174c25766b
loongarch64: Add optimizations for copy.
2 years ago
yancheng
49829b2b7d
loongarch64: Add optimizations for iamin.
2 years ago
yancheng
be83f5e4e0
loongarch64: Add optimizations for iamax.
2 years ago
yancheng
e3fb2b5afa
loongarch64: Add optimizations for imin.
2 years ago
yancheng
e46b48e372
loongarch64: Add optimizations for imax.
2 years ago
yancheng
702fc1d56d
loongarch64: Add optimization for min.
2 years ago
yancheng
346b384d1c
loongarch64: Add optimization for max.
2 years ago
yancheng
ff2ecc6cda
loongarch64: Add optimization for amin.
2 years ago
yancheng
265b5f2e80
loongarch64: Add optimizations for amax.
2 years ago
yancheng
993ede7c70
loongarch64: Add optimizations for scal.
2 years ago
Martin Kroeker
39bf8ece20
Merge pull request #4340 from yinshiyou/la-dev
Add some refines and optimizations for LoongArch.
2 years ago
Shiyou Yin
9fe07d82fd
loongarch: Add LSX optimization for dot.
2 years ago
Shiyou Yin
13b8c44b44
loongarch: Add optimization for dsdot kernel.
2 years ago
Shiyou Yin
3def6a8143
loongarch: Add LASX optimization for dot.
2 years ago
Bart Oldeman
c34e2cf380
Use _mm_set1_epi{32,64x} to init mask in x86-64 [cz]asum
for skylake kernels. This is the same method as used in [sd]asum.
_mm_set1_epi64x was commented out for zasum, but has the advantage
of avoiding possible undefined behaviour (using an uninitialized
variable), optimized out by NVHPC and icx. The new code works
fine with those compilers.
For GCC 12.3 the generated code is identical; no matter what method
you use, the compiler optimizes the code into a compile-time
constant, there is no performance benefit using mm_cmpeq_epi8
since the corresponding instruction (VPCMPEQB) isn't actually
generated!
2 years ago
Martin Kroeker
22aa401656
Temporarily disable the AVX512 CASUM/ZASUM microkernels for any version of NVIDIA HPC ( #4327 )
* Temporarily disable the C/ZASUM microkernels for any version of NVHPC
2 years ago