Arjan van de Ven
850b73dbb9
saxpy_haswell: Add AVX512 support
avx512 support fits nicely in the C+intrinsics code and gets a
speed improvement for vectors where the saxpy operation is not fully
memory bound
7 years ago
Arjan van de Ven
06ea72f5a5
write saxpy_haswell kernel using C intrinsics and don't disallow inlining
the intrinsics version of saxpy is more readable than the inline asm version,
and in the intrinsics version there's no reason anymore to ban inlining
(since the compiler has full visibility now) which gives a mid single digits
improvement in performance
7 years ago
Arjan van de Ven
d86604687f
saxpy_haswell: Use named arguments in inline asm
Improves readability
7 years ago
Arjan van de Ven
ef30a7239c
sdot_haswell: similar to ddot: turn into intrinsics based C code that supports AVX512
do the same thing for SDOT that the previous patches did for DDOT; the perf gain
is in the 60% range so at least somewhat interesting
7 years ago
Arjan van de Ven
21c6220d63
fix typo in dsymv avx512 code path
7 years ago
Arjan van de Ven
34d63df4b3
Add AVX512 support to DDOT
now that it's written in C + intrinsics it's easy to add AVX512 support
for DDOT
7 years ago
Arjan van de Ven
ae38fa55c3
Use intrinsics instead of inline asm
Intrinsics based code is generally easier to read for the non-math part
of the algorithm and it's easier to add, say, AVX512 to it later
7 years ago
Arjan van de Ven
847bbd6f4c
use named arguments in the inline asm
makes the asm easier to read
7 years ago
Arjan van de Ven
9c29524f50
various code cleanups and comments
7 years ago
Arjan van de Ven
f2810beafb
Add AVX512 support to dsymv_L_microk_haswell-2.c
Now that the code is written in intrinsics it's relatively easy to add AVX512 support
7 years ago
Arjan van de Ven
c202e06297
Write dsymv_kernel_4x4 for Haswell using intrinsics
intrinsics make the non-math part of the code easier to follow
than all hand coded asm, and it also helps getting ready for
adding avx512 support
7 years ago
Arjan van de Ven
0faba28adb
dsymv_L haswell: use symbol names for inline asm
symbolic names for gcc inline assembly are much easier to read
7 years ago
Arjan van de Ven
df31ec064e
Add AVX512 support to the dgemv_n_microk_haswell-4.c kernel
Now that the kernel is written in C-with-intrinsics, adding
AVX512 support to this kernel is trivial and yields a pretty significant
performance increase
7 years ago
Arjan van de Ven
e52d01cfe7
Also make the kernel_4x2 use intrinsics for readability and consistency
7 years ago
Arjan van de Ven
4a8ae8b8aa
replace the hasell dgemv_kernel_4x4 kernel with a the same code written in intrinsics
using intrinsics is a bit easier to read (at least for the non-math part of the code)
and also allows the compiler to be better about register allocation and optimizing the
non-math (loop/setup) code.
It also allows the code to honor the "no fma" flag if the user so desires.
The result of this change is (measured for a size of 16) a 15% performance increase.
And it is a step towards being able to add an AVX512 version of the code.
7 years ago
Arjan van de Ven
350531e76a
dgemv_n_microk_haswell: Use symbolic names for asm inputs to make the code more readable
gcc assembly syntax supports symbolic names in addition to numeric parameter order;
it's generally more readable to have code use the symbolic names
7 years ago
Martin Kroeker
4e103c822c
typo fix
7 years ago
Martin Kroeker
d2142760e0
Fix precision problem in DSDOT
7 years ago
Martin Kroeker
2fbfc64da8
Use C kernels for default c/zAXPY, xROT, c/zSWAP
7 years ago
Martin Kroeker
ba8388cee0
Merge pull request #1651 from martin-frbg/avx512-nodgemm
Disable the 16x2 DTRMM kernel on SkylakeX as well
7 years ago
Martin Kroeker
6e54b0a027
Disable the 16x2 DTRMM kernel on SkylakeX as well
7 years ago
Martin Kroeker
40c8cbc3bf
Merge pull request #1650 from martin-frbg/avx512-nodgemm
Disable the AVX512 DGEMM kernel for now
7 years ago
Martin Kroeker
f0a8dc2eec
Disable the AVX512 DGEMM kernel for now
due to #1643
7 years ago
Martin Kroeker
b83e4c60c7
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
e344db269b
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
545b82efd3
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
e322a951fe
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
c628c6fa59
Merge pull request #1612 from oon3m0oo/cpus
Fixed a few more unnecessary calls to num_cpu_avail.
8 years ago
Martin Kroeker
6f71c0fce4
Return a somewhat sane default value for L2 cache size if cpuid retur… ( #1611 )
* Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected
Fixes #1610 , the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH
builds of OpenBLAS to hang
8 years ago
Craig Donner
c2545b0fd6
Fixed a few more unnecessary calls to num_cpu_avail.
I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.
8 years ago
Arjan van de Ven
89372e0993
Use AVX512 also for DGEMM
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM
Performance for the not-retuned version is in the 30% range
8 years ago
Martin Kroeker
0023515733
Typo fix (misplaced parenthesis)
8 years ago
Arjan van de Ven
99c7bba8e4
Initial support for SkylakeX / AVX512
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
8 years ago
Martin Kroeker
8562d5787a
Merge pull request #1583 from martin-frbg/issue1575
Handle INCX=0,INCY=0 case
8 years ago
Martin Kroeker
7df8c4f76f
typo fix
8 years ago
Martin Kroeker
2fc748bf72
Restore optimized swap kernel now that we have a proper fix
8 years ago
Martin Kroeker
d1b7be14aa
Handle INCX=0,INCY=0 case
Fixes #1575 (sswap/dswap failing the swap utest on x86) as suggested by atsampson.
8 years ago
Martin Kroeker
961d25e9c7
Use the new zrot.c on POWER8 for crot as well
fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)
8 years ago
Martin Kroeker
f5959f2543
Merge pull request #1567 from martin-frbg/mipstrmm
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
8 years ago
Martin Kroeker
82012b960b
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
... as it was just a silly workaround for the issue seen in #1563 , caused by #1419
8 years ago
Martin Kroeker
8dd3515fa2
Merge pull request #1565 from martin-frbg/mipstypo
Remove extraneous brace from previous commit of mips dsdot fix
8 years ago
Martin Kroeker
95f7f0229c
Remove extraneous brace from previous commit
8 years ago
Martin Kroeker
5082fe4306
Merge pull request #1564 from martin-frbg/issue1563
Revert changes from PR#1419
8 years ago
Martin Kroeker
7a7619af6d
Revert changes from PR#1419
at least one of these changes apparently is an oversimplification, leading to TRMM breakage on some platforms as observed in #1563
8 years ago
Martin Kroeker
893b535540
Use correct data type for initializers of v2f64, v4f32
Fixes #1561
8 years ago
Martin Kroeker
018f2dad27
Switch mips32 target to USE_TRMM to fix complex TRMM
8 years ago
Martin Kroeker
9d5098dbc9
Add MIPS 1004K target (Mediatek MT7621 SOC)
8 years ago
Martin Kroeker
954f1832de
Merge pull request #1540 from martin-frbg/mips32-zasum
Fix typo in MIPS P5600 complex ASUM code selection
8 years ago
Martin Kroeker
941ad280a8
Fix typo in MIPS P5600 complex ASUM code selection
8 years ago
Martin Kroeker
1da365312a
Merge pull request #1538 from martin-frbg/arm7utest
Fix handling of zero INCX, INCY in ArmV7 AXPY and ROT
8 years ago