OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	c04a729081	Add ?sum definitions for generic kernel	6 years ago
Martin Kroeker	9d717cb5ee	Add x86_64 implementation of ?sum as trivial copy of ?asum with the fabs calls removed	6 years ago
Martin Kroeker	32c7063cb0	Merge pull request #2061 from martin-frbg/martin-frbg-patch-1 Disable the AVX512 DGEMM kernel (again)	6 years ago
Martin Kroeker	e608d4f7fe	Disable the AVX512 DGEMM kernel (again) Due to as yet unresolved errors seen in #1955 and #2029	7 years ago
Celelibi	b7f59da42d	Fix crash in sgemm SSE/nano kernel on x86_64 Fix bug #2047. Signed-off-by: Celelibi <celelibi@gmail.com>	7 years ago
Andrew	6eee1beac5	move fix to right place	7 years ago
Martin Kroeker	e12cdf58ef	Merge pull request #2024 from martin-frbg/gcc9fixes4 Fix inline assembly constraints in Bulldozer TRSM kernels	7 years ago
Martin Kroeker	1860c9456d	Merge pull request #2023 from martin-frbg/gcc9fixes3 Fix inline assembly constraints in various x86_64 GEMVN kernels	7 years ago
Martin Kroeker	f9bb76d29a	Fix inline assembly constraints in Bulldozer TRSM kernels rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009	7 years ago
Martin Kroeker	efb9038f72	Fix inline assembly constraints	7 years ago
Martin Kroeker	e976557d29	Fix inline assembly constraints rework indices to allow marking argument lda as input and output.	7 years ago
Martin Kroeker	9d8be15789	Fix inline assembly constraints rework indices to allow marking argument lda4 as input and output. For #2009	7 years ago
Martin Kroeker	d752799a0f	Merge pull request #2021 from martin-frbg/gcc9fixes2 Fix wrong constraints in inline assembly of Haswell DTRSM kernel	7 years ago
Martin Kroeker	c26c0b77a7	Fix wrong constraints in inline assembly for #2009	7 years ago
Martin Kroeker	1c6da2d03c	Merge pull request #2019 from martin-frbg/gcc9fixes Fix unannounced modification of input operand 8 (lda4) in Haswell GEMVN microkernel	7 years ago
Martin Kroeker	4255a58cd2	Rename operands to put lda on the input/output constraint list	7 years ago
Martin Kroeker	46e415b140	Save and restore input argument 8 (lda4) Fixes miscompilation with gcc9 -ftree-vectorize (related to issue #2009)	7 years ago
Bart Oldeman	69a97ca7b9	dgemv_kernel_4x4(Haswell): add missing clobbers for xmm0,xmm1,xmm2,xmm3 This fixes a crash in dblat2 when OpenBLAS is compiled using -march=znver1 -ftree-vectorize -O2 See also: https://github.com/easybuilders/easybuild-easyconfigs/issues/7180	7 years ago
Martin Kroeker	ab1630f9fa	Fix declaration of arguments in inline assembly Argument 0 is modified so should be input and output	7 years ago
Martin Kroeker	b824fa70eb	Fix declaration of assembly arguments in SSYMV and DSYMV microkernels Arguments 0 and 1 are both input and output	7 years ago
Martin Kroeker	91481a3e4e	Fix declaration of input arguments in inline assembly Argument 0 is modified as it doubles as a counter	7 years ago
Martin Kroeker	dc6ac9eab0	Fix declaration of input arguments in the x86_64 s/dGEMV_T and s/dGEMV_N kernels Arguments 0 and 1 need to be tagged as both input and output	7 years ago
Martin Kroeker	32b0f1168e	Fix declaration of input arguments in the Sandybridge GER microkernels (#1967 ) * Tag arguments 0 and 1 as both input and output	7 years ago
Martin Kroeker	b495e54310	Fix declaration of input arguments in the x86_64 SCAL microkernels (#1966 ) * Tag arguments 0 and 1 as both input and output (see #1964)	7 years ago
Martin Kroeker	d5e6940253	Fix declaration of input arguments in the x86_64 microkernels for DOT and AXPY (#1965 ) * Tag operands 0 and 1 as both input and output For #1964 (basically a continuation of coding problems first seen in #1292)	7 years ago
Arjan van de Ven	795285c587	Fix thinko in skylake beta handling casting ints is cheaper but it has a rounding, not memory casing effect, resulting in invalid outcome	7 years ago
Arjan van de Ven	d321448a63	dgemm: use dgemm_ncopy_8_skylakex.c also for Haswell The dgemm_ncopy_8_skylakex.c code is not avx512 specific and gives a nice performance boost for medium sized matrices	7 years ago
Arjan van de Ven	c43331ad0a	dgemm: Use the skylakex beta function also for haswell it's more efficient for certain tall/skinny matrices	7 years ago
Arjan van de Ven	69d206440a	Make the skylakex/haswell sgemm code compile and run even with compilers without avx2 support	7 years ago
Arjan van de Ven	0586899a10	Use sgemm_ncopy_4_skylakex.c also for Haswell sgemm_ncopy_4_skylakex.c uses SSE transpose operations where the real perf win happens; this also works great for Haswell. This gives double digit percentage gains on small and skinny matrices	7 years ago
Arjan van de Ven	00dc09ad19	Use the skylake sgemm beta code also for haswell with a few small changes it's possible to use the skylake sgemm code also for haswell, this gives a modest gain (10% range) for smallish matrixes but does wonders for very skinny matrixes	7 years ago
Arjan van de Ven	cdc668d82b	Add a "sgemm direct" mode for small matrixes OpenBLAS has a fancy algorithm for copying the input data while laying it out in a more CPU friendly memory layout. This is great for large matrixes; the cost of the copy is easily ammortized by the gains from the better memory layout. But for small matrixes (on CPUs that can do efficient unaligned loads) this copy can be a net loss. This patch adds (for SKYLAKEX initially) a "sgemm direct" mode, that bypasses the whole copy machinary for ALPHA=1/BETA=0/... standard arguments, for small matrixes only. What is small? For the non-threaded case this has been measured to be in the MNK = 28 * 512 * 512 range, while in the threaded case it's less, around MNK = 1 * 512 * 512	7 years ago
Martin Kroeker	701ea88347	Use p2align instead of align for OSX compatibility fixes #1902	7 years ago
Andrew	19c4bdd8b3	Add return value so that freebsd system clang does not err out	7 years ago
Arjan van de Ven	dcc5d6291e	skylakex: Make the sgemm/dgemm beta code robust for a N=0 or M=0 case in the threading code there are cases where N or M can become 0, and the optimized beta code did not handle this well, leading to a crash during the audit for the crash a few edge conditions on the if statements were found and fixed as well	7 years ago
Arjan van de Ven	55b244ca0d	enable the SGEMM/SKX C based kernel In QA the final bug was found so now the sklyakex sgemm C based kernel can be activated....	7 years ago
Arjan van de Ven	d4bad73834	Add a C+intrinsics version of the SGEMM/skylakex kernel for most sizes this is 1.2x to 1.4x faster than the current code	7 years ago
Arjan van de Ven	582c589727	dgemm/skylakex: replace discrete mul/add with fma very minor gains since it's not super hot code, but general principles	7 years ago
Arjan van de Ven	adbf6afa25	Add vector optimizations for ncopy as well for dgemm/skylakex	7 years ago
Arjan van de Ven	32bec8afbb	add a skylakex optimized dgemm beta function	7 years ago
Arjan van de Ven	20c5d668fe	dgemm/avx512 simplify and speed up the 4x4 kernel	7 years ago
Arjan van de Ven	6d43c51ccf	undo slow dgemm/skylake microoptimization the compare is more costly than the work	7 years ago
Arjan van de Ven	d74dc39b0f	Add optimized *copy versions for skylakex Add optimized n/t copy versions for skylakex; in the patch the tcopy is also rewritten using intrinsics; the ncopy file will be worked on in a future commit	7 years ago
Arjan van de Ven	66b43affbc	Add a 24x8 kernel to the skylakex dgemm implementation Minor gains for small matrixes, but at 512x512 and above the gain gets more significant.	7 years ago
Arjan van de Ven	1938819c25	skylake dgemm: Add a 16x8 kernel The next step for the avx512 dgemm code is adding a 16x8 kernel. In the 8x8 kernel, each FMA has a matching load (the broadcast); in the 16x8 kernel we can reuse this load for 2 FMAs, which in turn reduces pressure on the load ports of the CPU and gives a nice performance boost (in the 25% range).	7 years ago
Martin Kroeker	b7496c3638	Function name needs to be CNAME, set from outside to allow suffixing for dynamic_arch	7 years ago
Arjan van de Ven	45fe8cb0c5	Create a AVX512 enabled version of DGEMM This patch adds dgemm_kernel_4x8_skylakex.c which is * dgemm_kernel_4x8_haswell.s converted to C + intrinsics * 8x8 support added * 8x8 kernel implemented using AVX512 Performance is a work in progress, but already shows a 10% - 20% increase for a wide range of matrix sizes.	7 years ago
Martin Kroeker	375dff54fc	Merge pull request #1733 from fenrus75/dsymv Add an AVX512 enabled DSYMV (L) function	7 years ago
Martin Kroeker	a5f165275a	Merge pull request #1732 from fenrus75/dgemv Add an AVX512 enabled DGEMV (n) function	7 years ago
Martin Kroeker	8c13aa495a	Merge pull request #1730 from fenrus75/fix-sdot Fix typo in sdot function	7 years ago

1 2 3 4 5 ...

479 Commits (a6a8cc2b7fa30f46fdaa4fb6e50c19da8c11e335)