Arjan van de Ven
ae38fa55c3
Use intrinsics instead of inline asm
Intrinsics based code is generally easier to read for the non-math part
of the algorithm and it's easier to add, say, AVX512 to it later
7 years ago
Arjan van de Ven
847bbd6f4c
use named arguments in the inline asm
makes the asm easier to read
7 years ago
Arjan van de Ven
9c29524f50
various code cleanups and comments
7 years ago
Arjan van de Ven
f2810beafb
Add AVX512 support to dsymv_L_microk_haswell-2.c
Now that the code is written in intrinsics it's relatively easy to add AVX512 support
7 years ago
Arjan van de Ven
c202e06297
Write dsymv_kernel_4x4 for Haswell using intrinsics
intrinsics make the non-math part of the code easier to follow
than all hand coded asm, and it also helps getting ready for
adding avx512 support
7 years ago
Arjan van de Ven
0faba28adb
dsymv_L haswell: use symbol names for inline asm
symbolic names for gcc inline assembly are much easier to read
7 years ago
Arjan van de Ven
df31ec064e
Add AVX512 support to the dgemv_n_microk_haswell-4.c kernel
Now that the kernel is written in C-with-intrinsics, adding
AVX512 support to this kernel is trivial and yields a pretty significant
performance increase
7 years ago
Arjan van de Ven
e52d01cfe7
Also make the kernel_4x2 use intrinsics for readability and consistency
7 years ago
Arjan van de Ven
4a8ae8b8aa
replace the hasell dgemv_kernel_4x4 kernel with a the same code written in intrinsics
using intrinsics is a bit easier to read (at least for the non-math part of the code)
and also allows the compiler to be better about register allocation and optimizing the
non-math (loop/setup) code.
It also allows the code to honor the "no fma" flag if the user so desires.
The result of this change is (measured for a size of 16) a 15% performance increase.
And it is a step towards being able to add an AVX512 version of the code.
7 years ago
Arjan van de Ven
350531e76a
dgemv_n_microk_haswell: Use symbolic names for asm inputs to make the code more readable
gcc assembly syntax supports symbolic names in addition to numeric parameter order;
it's generally more readable to have code use the symbolic names
7 years ago
Martin Kroeker
4e103c822c
typo fix
7 years ago
Martin Kroeker
d2142760e0
Fix precision problem in DSDOT
7 years ago
Martin Kroeker
2fbfc64da8
Use C kernels for default c/zAXPY, xROT, c/zSWAP
7 years ago
Martin Kroeker
ba8388cee0
Merge pull request #1651 from martin-frbg/avx512-nodgemm
Disable the 16x2 DTRMM kernel on SkylakeX as well
7 years ago
Martin Kroeker
6e54b0a027
Disable the 16x2 DTRMM kernel on SkylakeX as well
7 years ago
Martin Kroeker
40c8cbc3bf
Merge pull request #1650 from martin-frbg/avx512-nodgemm
Disable the AVX512 DGEMM kernel for now
7 years ago
Martin Kroeker
f0a8dc2eec
Disable the AVX512 DGEMM kernel for now
due to #1643
7 years ago
Martin Kroeker
b83e4c60c7
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
e344db269b
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
545b82efd3
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
e322a951fe
Remove premature exit for INC_X or INC_Y zero
8 years ago
Martin Kroeker
c628c6fa59
Merge pull request #1612 from oon3m0oo/cpus
Fixed a few more unnecessary calls to num_cpu_avail.
8 years ago
Martin Kroeker
6f71c0fce4
Return a somewhat sane default value for L2 cache size if cpuid retur… ( #1611 )
* Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected
Fixes #1610 , the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH
builds of OpenBLAS to hang
8 years ago
Craig Donner
c2545b0fd6
Fixed a few more unnecessary calls to num_cpu_avail.
I don't have as many benchmarks for these as for gemm, but it should still
make a difference for small matrices.
8 years ago
Arjan van de Ven
89372e0993
Use AVX512 also for DGEMM
this required switching to the generic gemm_beta code (which is faster anyway on SKX)
for both DGEMM and SGEMM
Performance for the not-retuned version is in the 30% range
8 years ago
Martin Kroeker
0023515733
Typo fix (misplaced parenthesis)
8 years ago
Arjan van de Ven
99c7bba8e4
Initial support for SkylakeX / AVX512
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
8 years ago
Martin Kroeker
8562d5787a
Merge pull request #1583 from martin-frbg/issue1575
Handle INCX=0,INCY=0 case
8 years ago
Martin Kroeker
7df8c4f76f
typo fix
8 years ago
Martin Kroeker
2fc748bf72
Restore optimized swap kernel now that we have a proper fix
8 years ago
Martin Kroeker
d1b7be14aa
Handle INCX=0,INCY=0 case
Fixes #1575 (sswap/dswap failing the swap utest on x86) as suggested by atsampson.
8 years ago
Martin Kroeker
961d25e9c7
Use the new zrot.c on POWER8 for crot as well
fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)
8 years ago
Martin Kroeker
f5959f2543
Merge pull request #1567 from martin-frbg/mipstrmm
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
8 years ago
Martin Kroeker
82012b960b
Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"
... as it was just a silly workaround for the issue seen in #1563 , caused by #1419
8 years ago
Martin Kroeker
8dd3515fa2
Merge pull request #1565 from martin-frbg/mipstypo
Remove extraneous brace from previous commit of mips dsdot fix
8 years ago
Martin Kroeker
95f7f0229c
Remove extraneous brace from previous commit
8 years ago
Martin Kroeker
5082fe4306
Merge pull request #1564 from martin-frbg/issue1563
Revert changes from PR#1419
8 years ago
Martin Kroeker
7a7619af6d
Revert changes from PR#1419
at least one of these changes apparently is an oversimplification, leading to TRMM breakage on some platforms as observed in #1563
8 years ago
Martin Kroeker
893b535540
Use correct data type for initializers of v2f64, v4f32
Fixes #1561
8 years ago
Martin Kroeker
018f2dad27
Switch mips32 target to USE_TRMM to fix complex TRMM
8 years ago
Martin Kroeker
9d5098dbc9
Add MIPS 1004K target (Mediatek MT7621 SOC)
8 years ago
Martin Kroeker
954f1832de
Merge pull request #1540 from martin-frbg/mips32-zasum
Fix typo in MIPS P5600 complex ASUM code selection
8 years ago
Martin Kroeker
941ad280a8
Fix typo in MIPS P5600 complex ASUM code selection
8 years ago
Martin Kroeker
1da365312a
Merge pull request #1538 from martin-frbg/arm7utest
Fix handling of zero INCX, INCY in ArmV7 AXPY and ROT
8 years ago
Martin Kroeker
2d0929fa7c
Move the test for zero incx,incy in ARMV7 ROT
to pass the related utest (see #1469 )
8 years ago
Martin Kroeker
125343cc88
Drop test for zero incx,incy in armv7 AXPY
...to pass the related utest (see #1469 )
8 years ago
Martin Kroeker
8a3b6fa108
Use generic zrot.c on ppc64/POWER6 to work around utest failure from … ( #1535 )
* Use generic C implementation of zrot on ppc64/POWER6 to work around utest failure from #1469
8 years ago
Martin Kroeker
9c5518319a
Revert "Fix 32bit HASWELL builds"
8 years ago
Martin Kroeker
2ca0faf495
Merge pull request #1515 from martin-frbg/mipsdot
Correct precision of mips dsdot
8 years ago
Martin Kroeker
0fe434598b
Fix precision of mips dsdot
8 years ago