Martin Kroeker
ccfb7ead15
Merge pull request #2072 from martin-frbg/sum
Add (C)BLAS extension ?sum
6 years ago
Rashmica Gupta
bcdf1d4917
Add in runtime CPU detection for POWER.
6 years ago
Martin Kroeker
706dfe263b
Add POWER implementation of ?sum
as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure
6 years ago
Martin Kroeker
7c51cc8527
Merge branch 'develop' into develop
6 years ago
AbdelRauf
853a18bc17
power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself
6 years ago
Martin Kroeker
718efcec6f
Fix out-of-bounds memory access in gemm_beta
Fixes #2011 (as suggested by davemq), assuming typo by K.Goto
7 years ago
Martin Kroeker
f9d67bb5e8
Fix out-of-bounds memory access in gemm_beta
Fixes #2011 (as suggested by davemq) presuming typo by K.Goto
7 years ago
Ubuntu
498ac98581
Note for unused kernels
7 years ago
Ubuntu
cd9ea45463
NBMAX=4096 for gemvn, added sgemvn 8x8 for future
7 years ago
Ubuntu
4abc375a91
sgemv cgemv pairs
7 years ago
Ubuntu
43a4572038
crot fix
7 years ago
Abdelrauf
a034e65512
Merge branch 'develop' into develop
7 years ago
Ubuntu
8c3386be87
Added missing Blas1 single fp {saxpy, caxpy, cdot, crot(refactored version of srot),isamax ,isamin, icamax, icamin},
Fixed idamin,icamin choosing the first occurance index of equal minimals
7 years ago
Martin Kroeker
961d25e9c7
Use the new zrot.c on POWER8 for crot as well
fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)
7 years ago
Martin Kroeker
8a3b6fa108
Use generic zrot.c on ppc64/POWER6 to work around utest failure from … ( #1535 )
* Use generic C implementation of zrot on ppc64/POWER6 to work around utest failure from #1469
7 years ago
QWR QWR
28ca97015d
power8:Added initial zgemv_(t|n) ,i(d|z)amax,i(d|z)amin,dgemv_t(transposed),zrot
z13: improved zgemv_(t|n)_4,zscal,zaxpy
8 years ago
the mslm
2c0a008281
dgemm_ncopy_4_ save/restore
8 years ago
the mslm
c5425daa6b
power8 ?gemm_tcopy save/restore
8 years ago
martin
7a4b3cfbf8
Add trivially optimized DSDOT for POWER8
8 years ago
Martin Kroeker
9c017a2218
Save and restore VSX registers
8 years ago
Matt Brown
bd831a03a8
Optimise sscal for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
edc97918f8
Optimise srot for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
e0034de22d
Optimise sdot for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
32c7fe6bff
Optimise sasum for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
19bdf9d52b
Optimise casum for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
4f09030fdc
Optimise cswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
6f4eca5ea4
Optimise sswap for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
be55f96cbd
Optimise scopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Matt Brown
96dd0ef4f7
Optimise ccopy for POWER9
Use lxvd2x instruction instead of lxvw4x.
lxvd2x performs far better on the new POWER architecture than lxvw4x.
8 years ago
Alan Modra
dc40bc7368
Power8 inline assembly tweaks
Further fixes on top of 9e2f316ed . Writing some doco for gcc on
inline assembly woke me up to some more errors.
- dgemv_kernel_4x4 asm did not mention *ap as a memory input, and
*y is both read and write.
- sasum_kernel_32 and casum_kernel_16 did not use %x for a vsx insn
operand, a problem if the "=f" sum output was ever allocated a vsx
reg in the altivec set. This might be possible with inlining and
future gcc optimisation.
8 years ago
Martin Kroeker
9e2f316ede
Power8 inline assembly fixes
Quoting patch author amodra from #1078
Lots of issues here.
- The vsx regs weren't listed as clobbered.
- Poor choice of vsx regs, which along with the lack of clobbers led to
trashing v0..v21 and fr14..fr23. Ideally you'd let gcc choose all
temp vsx regs, but asms currently have a limit of 30 i/o parms.
- Other regs were clobbered unnecessarily, seemingly in an attempt to
clobber inputs, with gcc-7 complaining about the clobber of r2.
(Changed inputs should be also listed as outputs or as an i/o.)
- "r" constraint used instead of "b" for gprs used in insns where the
r0 encoding means zero rather than r0.
- There were unused asm inputs too.
- All memory was clobbered rather than hooking up memory outputs with
proper memory constraints, and that and the lack of proper memory
input constraints meant the asms needed to be volatile and their
containing function noinline.
- Some parameters were being passed unnecessarily via memory.
- When a copy of a
9 years ago
Martin Kroeker
a6e9e0b94b
Remove explicit include of complex.h
9 years ago
Zhang Xianyi
515bc56ea9
Refs #946 . Use nrm2 reference implementation for Power8.
9 years ago
Zhang Xianyi
ae70b916f4
Refs #929 . Deal with zero and NaNs for scale.
9 years ago
Werner Saar
412bcd187a
optimized dtrsm_logic_LT_16x4_power8.S and dtrsm_macros_LT_16x4_power8.S
9 years ago
Werner Saar
8b140220c8
optimized dtrsm_kernel_LT for POWER8
9 years ago
Werner Saar
8fb5a1aaff
added optimized dtrsm_LT kernel for POWER8
9 years ago
Werner Saar
6a2bde7a2d
optimized dgemm and dgetrf for POWER8
9 years ago
Werner Saar
8310d4d3f7
optimized dgemm for 20 threads
9 years ago
Werner Saar
56948dbf0f
optimized dgemm for POWER8
9 years ago
Werner Saar
0d0c6f7d7d
optimized dgemm for POWER8
9 years ago
Werner Saar
a3da10662f
added sgemm_tcopy_8_power8.S
9 years ago
Werner Saar
d46f07bb4e
added cgemm_tcopy_8_power8.S
9 years ago
Werner Saar
879a51165f
Optimized zgemm and tested zgemm again
9 years ago
Werner Saar
9276c9012f
Optimized sgemm and dgemm and tested again.
9 years ago
Werner Saar
0001260f4b
optimized sgemm
9 years ago
Werner Saar
3c6294ca3d
added optimized sgemm_tcopy for power8
9 years ago
Werner Saar
e173c51c04
updated zgemm- and ztrmm-kernel for POWER8
9 years ago
Werner Saar
9c42f0374a
Updated cgemm- and sgemm-kernel for POWER8 SMP
9 years ago
Werner Saar
a51102e9b7
bugfixes for sgemm- and cgemm-kernel
9 years ago