OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Arjan van de Ven	b30b82ce46	Merge `b1cc69e7a8` into `66da7677bd`	8 years ago
Arjan van de Ven	7932ff3ea9	Add an AVX512 enabled DDOT function written in C intrinsics for best readability. (the same C code works for Haswell as well) For logistical reasons the code falls back to the existing haswell AVX2 implementation if the GCC or LLVM compiler is not new enough	8 years ago
Arjan van de Ven	b1cc69e7a8	Convert dscal_haswell to intrinsics and add AVX512 support dscal is a relatively simple function... make it more readable and 50% faster by using C intrinsics and AVX512 support	8 years ago
Arjan van de Ven	93aa18b1a8	daxpy_haswell: Change to C+instrinsics + AVX512 to mimic the change to saxpy_haswell Use the same transformation as was done to saxpy for daxpy gives a low double digit performance increase	8 years ago
Arjan van de Ven	7af8a5445d	saxpy_haswell: Go to a more compact intrinsics notation	8 years ago
Arjan van de Ven	850b73dbb9	saxpy_haswell: Add AVX512 support avx512 support fits nicely in the C+intrinsics code and gets a speed improvement for vectors where the saxpy operation is not fully memory bound	8 years ago
Arjan van de Ven	06ea72f5a5	write saxpy_haswell kernel using C intrinsics and don't disallow inlining the intrinsics version of saxpy is more readable than the inline asm version, and in the intrinsics version there's no reason anymore to ban inlining (since the compiler has full visibility now) which gives a mid single digits improvement in performance	8 years ago
Arjan van de Ven	d86604687f	saxpy_haswell: Use named arguments in inline asm Improves readability	8 years ago
Arjan van de Ven	ef30a7239c	sdot_haswell: similar to ddot: turn into intrinsics based C code that supports AVX512 do the same thing for SDOT that the previous patches did for DDOT; the perf gain is in the 60% range so at least somewhat interesting	8 years ago
Arjan van de Ven	21c6220d63	fix typo in dsymv avx512 code path	8 years ago
Arjan van de Ven	34d63df4b3	Add AVX512 support to DDOT now that it's written in C + intrinsics it's easy to add AVX512 support for DDOT	8 years ago
Arjan van de Ven	ae38fa55c3	Use intrinsics instead of inline asm Intrinsics based code is generally easier to read for the non-math part of the algorithm and it's easier to add, say, AVX512 to it later	8 years ago
Arjan van de Ven	847bbd6f4c	use named arguments in the inline asm makes the asm easier to read	8 years ago
Arjan van de Ven	9c29524f50	various code cleanups and comments	8 years ago
Arjan van de Ven	f2810beafb	Add AVX512 support to dsymv_L_microk_haswell-2.c Now that the code is written in intrinsics it's relatively easy to add AVX512 support	8 years ago
Arjan van de Ven	c202e06297	Write dsymv_kernel_4x4 for Haswell using intrinsics intrinsics make the non-math part of the code easier to follow than all hand coded asm, and it also helps getting ready for adding avx512 support	8 years ago
Arjan van de Ven	0faba28adb	dsymv_L haswell: use symbol names for inline asm symbolic names for gcc inline assembly are much easier to read	8 years ago
Arjan van de Ven	df31ec064e	Add AVX512 support to the dgemv_n_microk_haswell-4.c kernel Now that the kernel is written in C-with-intrinsics, adding AVX512 support to this kernel is trivial and yields a pretty significant performance increase	8 years ago
Arjan van de Ven	e52d01cfe7	Also make the kernel_4x2 use intrinsics for readability and consistency	8 years ago
Arjan van de Ven	4a8ae8b8aa	replace the hasell dgemv_kernel_4x4 kernel with a the same code written in intrinsics using intrinsics is a bit easier to read (at least for the non-math part of the code) and also allows the compiler to be better about register allocation and optimizing the non-math (loop/setup) code. It also allows the code to honor the "no fma" flag if the user so desires. The result of this change is (measured for a size of 16) a 15% performance increase. And it is a step towards being able to add an AVX512 version of the code.	8 years ago
Arjan van de Ven	350531e76a	dgemv_n_microk_haswell: Use symbolic names for asm inputs to make the code more readable gcc assembly syntax supports symbolic names in addition to numeric parameter order; it's generally more readable to have code use the symbolic names	8 years ago
Martin Kroeker	4e103c822c	typo fix	8 years ago
Martin Kroeker	d2142760e0	Fix precision problem in DSDOT	8 years ago
Martin Kroeker	2fbfc64da8	Use C kernels for default c/zAXPY, xROT, c/zSWAP	8 years ago
Martin Kroeker	ba8388cee0	Merge pull request #1651 from martin-frbg/avx512-nodgemm Disable the 16x2 DTRMM kernel on SkylakeX as well	8 years ago
Martin Kroeker	6e54b0a027	Disable the 16x2 DTRMM kernel on SkylakeX as well	8 years ago
Martin Kroeker	40c8cbc3bf	Merge pull request #1650 from martin-frbg/avx512-nodgemm Disable the AVX512 DGEMM kernel for now	8 years ago
Martin Kroeker	f0a8dc2eec	Disable the AVX512 DGEMM kernel for now due to #1643	8 years ago
Martin Kroeker	b83e4c60c7	Remove premature exit for INC_X or INC_Y zero	8 years ago
Martin Kroeker	e344db269b	Remove premature exit for INC_X or INC_Y zero	8 years ago
Martin Kroeker	545b82efd3	Remove premature exit for INC_X or INC_Y zero	8 years ago
Martin Kroeker	e322a951fe	Remove premature exit for INC_X or INC_Y zero	8 years ago
Martin Kroeker	c628c6fa59	Merge pull request #1612 from oon3m0oo/cpus Fixed a few more unnecessary calls to num_cpu_avail.	8 years ago
Martin Kroeker	6f71c0fce4	Return a somewhat sane default value for L2 cache size if cpuid retur… (#1611 ) * Return a somewhat sane default value for L2 cache size if cpuid returned something unexpected Fixes #1610, the KVM hypervisor on Google Chromebooks returning zero for CPUID 0x80000006, causing DYNAMIC_ARCH builds of OpenBLAS to hang	8 years ago
Craig Donner	c2545b0fd6	Fixed a few more unnecessary calls to num_cpu_avail. I don't have as many benchmarks for these as for gemm, but it should still make a difference for small matrices.	8 years ago
Arjan van de Ven	89372e0993	Use AVX512 also for DGEMM this required switching to the generic gemm_beta code (which is faster anyway on SKX) for both DGEMM and SGEMM Performance for the not-retuned version is in the 30% range	8 years ago
Martin Kroeker	0023515733	Typo fix (misplaced parenthesis)	8 years ago
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	8 years ago
Martin Kroeker	8562d5787a	Merge pull request #1583 from martin-frbg/issue1575 Handle INCX=0,INCY=0 case	8 years ago
Martin Kroeker	7df8c4f76f	typo fix	8 years ago
Martin Kroeker	2fc748bf72	Restore optimized swap kernel now that we have a proper fix	8 years ago
Martin Kroeker	d1b7be14aa	Handle INCX=0,INCY=0 case Fixes #1575 (sswap/dswap failing the swap utest on x86) as suggested by atsampson.	8 years ago
Martin Kroeker	961d25e9c7	Use the new zrot.c on POWER8 for crot as well fixes #1571 (the old zrot.S assembly does not handle incx=0 correctly)	8 years ago
Martin Kroeker	f5959f2543	Merge pull request #1567 from martin-frbg/mipstrmm Revert " Switch mips32 target to USE_TRMM to fix complex TRMM"	8 years ago
Martin Kroeker	82012b960b	Revert " Switch mips32 target to USE_TRMM to fix complex TRMM" ... as it was just a silly workaround for the issue seen in #1563, caused by #1419	8 years ago
Martin Kroeker	8dd3515fa2	Merge pull request #1565 from martin-frbg/mipstypo Remove extraneous brace from previous commit of mips dsdot fix	8 years ago
Martin Kroeker	95f7f0229c	Remove extraneous brace from previous commit	8 years ago
Martin Kroeker	5082fe4306	Merge pull request #1564 from martin-frbg/issue1563 Revert changes from PR#1419	8 years ago
Martin Kroeker	7a7619af6d	Revert changes from PR#1419 at least one of these changes apparently is an oversimplification, leading to TRMM breakage on some platforms as observed in #1563	8 years ago
Martin Kroeker	893b535540	Use correct data type for initializers of v2f64, v4f32 Fixes #1561	8 years ago

1 2 3 4 5 ...

1018 Commits (b30b82ce46d0b118e70410e5c7c8ac37409f59f2)