Wangyang Guo
fdd2d0fc7b
Small Matrix: skylakex: add sgemm nt kernel
5 years ago
Wangyang Guo
5f91668904
Small Matrix: skylakex: sgemm nn: fix n6 conflicts with n4
5 years ago
Wangyang Guo
0ecaa99fc2
Small Matrix: skylakex: sgemm nn: fix error when beta not zero
5 years ago
Wangyang Guo
a1835c8ca2
Small Matrix: skylakex: sgemm nn: add n6 to improve performance
5 years ago
Wangyang Guo
50bd888c73
Small Matrix: skylakex: sgemm nn: reduce store 4 N at a time
5 years ago
Wangyang Guo
95912941ca
Small Matrix: skylakex: sgemm nn: reduce store 4 M at a time
5 years ago
Wangyang Guo
8a4bb07453
Small Matrix: skylakex: sgemm nn: clean up unused code
5 years ago
Wangyang Guo
04ac9c7a13
Small Matrix: skylakex: sgemm_nn: optimize for M <= 8
5 years ago
Wangyang Guo
20befbb2f9
Optimize M < 16 using AVX512 mask
5 years ago
Wangyang Guo
c3e4c4db47
small matrix: SkylakeX: add SGEMM NN kernel
5 years ago
Martin Kroeker
b2053239fc
Fix mssing dummy parameter (imag part of alpha) of zdot_thread_function
5 years ago
Martin Kroeker
9ee21a0a39
Merge pull request #2780 from Guobing-Chen/CPL_build_support
Enable COOPERLAKE build target
5 years ago
Martin Kroeker
75eeb265d7
[WIP] Refactor the driver code for direct SGEMM ( #2782 )
Move "direct SGEMM" functionality out of the SkylakeX SGEMM kernel and make it available
(on x86_64 targets only for now) in DYNAMIC_ARCH builds
* Add sgemm_direct targets in the kernel Makefile.L3 and CMakeLists.txt
* Add direct_sgemm functions to the gotoblas struct in common_param.h
* Move sgemm_direct_performant helper to separate file
* Update gemm.c to macros for sgemm_direct to support dynamic_arch naming via common_s,h
* (Conditionally) add sgemm_direct functions in setparam-ref.c
5 years ago
Chen, Guobing
e740c4873d
Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
5 years ago
Martin Kroeker
81dcfdcf39
Multiply by 2 instead of left-shifting a potentially negative number
fixes GCC ubsan warning in the BLAS tests
5 years ago
Martin Kroeker
0ef4b3f1f2
Multiply instead of doing a left shift of a potentially negative number
fixes GCC ubsan report in the BLAS tests
5 years ago
Martin Kroeker
aa53a8a5cb
Multiply by two instead of left-shifting one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
5 years ago
Martin Kroeker
aa3a1e7d8c
Multiply by two rather than left shift by one place
fixes GCC ubsan report of "left shift of negative value -2" in the BLAS tests
5 years ago
Martin Kroeker
e30ad0e521
Strip UTF8 byte order marker from source
5 years ago
Martin Kroeker
93592d1260
Merge pull request #2675 from wjc404/develop
AVX512 DGEMM TCOPY_16 Function
6 years ago
wjc404
086d87a302
AVX512 dgemm tcopy_16 function
6 years ago
Martin Kroeker
c3574ffe53
Merge pull request #2646 from wjc404/develop
Optimize AVX512 parallel DGEMM performance
6 years ago
wjc404
0e3ac4a06b
Add files via upload
6 years ago
Martin Kroeker
2271c3506b
Work around excessive LAPACK test failures on Skylake-X
Something in the plain C parts of x86_64 cscal.c and zscal.c appears to be miscompiled by both gfortran9 and ifort when compiling for skylakex-avx512, even when the optimized Haswell microkernel is not in use.
6 years ago
Martin Kroeker
90dba9f716
Duplicate earlier Clang 9.0.0 workaround for corresponding Apple Clang version
As discussed on the original PR #2329 , the "Apple Clang 11.0.3" that appears to be based the same LLVM release produces the same miscompilation of this file.
6 years ago
Martin Kroeker
5b0093b5fe
Convert aligned moves to unaligned
should have no performance impact on reasonably modern cpus and fixes occasional crashes in actual user code.
6 years ago
Martin Kroeker
567d2760e6
Merge pull request #2520 from wjc404/develop
Fix avx512 sgemm performance bug when ldc is a multiple of 1024
6 years ago
wjc404
b8307768e2
Add files via upload
6 years ago
Martin Kroeker
af8a619e1f
Merge pull request #2517 from wjc404/develop
Temporary fix for SKX STRSM
6 years ago
wjc404
62b9608986
Update KERNEL.SKYLAKEX
6 years ago
Martin Kroeker
a1b181cea2
Merge pull request #2516 from wjc404/develop
AVX2 STRSM kernels
6 years ago
wjc404
cdc0e9011e
Update KERNEL.ZEN
6 years ago
wjc404
fa049d49c2
AVX2 STRSM kernel
6 years ago
Martin Kroeker
ea8eec5d17
Merge pull request #2422 from wjc404/develop
Adjust SkylakeX GEMM3M parameters, add an AVX512 STRMM kernel and fix performance bugs in AVX2 s/c/z GEMM
6 years ago
wjc404
dd22eb7621
Update cgemm_kernel_8x2_haswell.c
6 years ago
wjc404
2352331e60
Update zgemm_kernel_4x2_haswell.c
6 years ago
wjc404
1b980001dd
Update zgemm_kernel_4x2_haswell.c
6 years ago
wjc404
2515e1152f
Update cgemm_kernel_8x2_haswell.c
6 years ago
wjc404
903854c168
Add files via upload
6 years ago
wjc404
a2ff577a30
Update KERNEL.ZEN
6 years ago
wjc404
97a32cb0a5
Update KERNEL.HASWELL
6 years ago
Martin Liska
aeea14ee40
Come up with LOAD_AND_COMPARE_TO_MXX macro in iamax_sse.S.
6 years ago
Martin Liska
18bcc36a69
Fix implementation of iamax_sse.S as reported in #2116 .
The was a typo in iamax_sse.S where one of the comparison
was cmpeqps instead of cmpeqss. That misdetected index
for sequences where the minimum value was 0.
6 years ago
wjc404
f566787e6e
Update KERNEL.SKYLAKEX
6 years ago
wjc404
e3368cbf18
AVX512 STRMM kernel
6 years ago
Bart Oldeman
7ea5e07d1c
Fix inline asm in dscal: mark x, x1 as clobbered. Fixes #2408
The leaq instructions in dscal_kernel_inc_8 modify x and x1 so they
must be declared as input/output constraints, otherwise the compiler
may assume the corresponding registers are not modified.
6 years ago
wjc404
3447d04eaf
Update dgemm_kernel_16x2_skylakex.c
6 years ago
wjc404
8b5cdcc64c
Update sgemm_kernel_8x4_haswell.c
6 years ago
wjc404
4e00d96a78
Update dgemm_kernel_16x2_skylakex.c
6 years ago
wjc404
096da2f51a
Update dgemm_kernel_16x2_skylakex.c
6 years ago