Martin Kroeker
17e288e18d
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
c1422f3e46
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
d85b24e103
Clean up STACKSIZE redefinition
5 years ago
Martin Kroeker
df70667043
fix core list for sse/sse2
5 years ago
Martin Kroeker
f071d1207a
add sse2
5 years ago
Martin Kroeker
dc6cefd2f5
Expressly enable -msse for 32bit DYNAMIC_ARCH kernels
5 years ago
Martin Kroeker
c339c40c01
Silence a redefinition warning
5 years ago
Martin Kroeker
10379fc83b
Use ifdef instead of if
5 years ago
Martin Kroeker
4c25910da0
Merge pull request #2896 from martin-frbg/intrin-double
Add compiler flag for SSE4 where available
5 years ago
Martin Kroeker
ae6ac83991
Revert "add double precision SSE"
5 years ago
Qiyu8
4fac91ef37
adapt arm platform
5 years ago
Qiyu8
bfdf4b56da
Add double precision universal intrinsics for X86/ARM
5 years ago
Martin Kroeker
ebf0470fc2
add sse4.1 for DYNAMIC_ARCH kernels
5 years ago
Martin Kroeker
c9c3ae07af
Add double precision operations
5 years ago
Martin Kroeker
756802df61
Merge pull request #2890 from martin-frbg/s-d-sum
Revert special handling of Windows xNRM2 and enable C+intrinsics kern…
5 years ago
Rajalakshmi Srinivasaraghavan
0826d68f93
POWER10: Change the packing format for bfloat16
As the new MMA instructions need the inputs in 4x2 order for bfloat16,
changing the format in copy/packing code. This avoids permute instructions
in the gemm kernel inner loop.
5 years ago
Rajalakshmi Srinivasaraghavan
b5d30b390d
Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
5 years ago
Martin Kroeker
fecedc9c69
Add -mssse3
5 years ago
Martin Kroeker
0eacbca85f
Add Haswell and Zen to temporary sse3 whitelist
5 years ago
Martin Kroeker
6999086a2b
whitelist SANDYBRIDGE for SSE3
5 years ago
Martin Kroeker
8d2df7d066
Revert special handling of Windows xNRM2 and enable C+intrinsics kernel for SSUM/DSUM
5 years ago
Martin Kroeker
08929430cd
Merge pull request #2886 from martin-frbg/issue_2767
Rename "HALF" precision functions (sh prefix) to "BFLOAT16" with "sb" prefix
5 years ago
Martin Kroeker
0c84ffe05f
Merge pull request #2881 from mattip/fninit
add fninit to reset fpu registers before assembler routines
5 years ago
Matti Picus
403eb513a0
use emms instead, add WIN guards
5 years ago
Qiyu8
0ed1f07660
Optimize the performance of sum by using universal intrinsics
5 years ago
Martin Kroeker
3aecafad80
Change "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
756062afa5
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
2061f7fdff
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
dc8a1afa63
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
fd94236042
Rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
68ce719fac
Rename shdot_microk_cooperlake.c to sbdot_microk_cooperlake.c
5 years ago
Martin Kroeker
d7dd9b396c
Rename shdot.c to sbdot.c
5 years ago
Martin Kroeker
9ae80490e0
rename "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
d314d1f49f
Rename shgemm_kernel_power10.c to sbgemm_kernel_power10.c
5 years ago
Martin Kroeker
c589c3e2a1
Merge pull request #2882 from martin-frbg/issue2709
Use generic C for (D/Z)NRM2 on Windows x86_64
5 years ago
Martin Kroeker
ec638a82bf
Merge pull request #2852 from martin-frbg/issue2588-cmake
Support building only a subset of variable types
5 years ago
Martin Kroeker
6b6adf8a4a
Allow compiling only a subset of kernels for specific variable types
5 years ago
Martin Kroeker
ac653c94f3
Merge branch 'develop' into issue2588-cmake
5 years ago
Martin Kroeker
7a53128481
Add whitelist of DYNAMIC_ARCH kernels for which -msse3 needs to be enabled
5 years ago
Martin Kroeker
e1b7123bbe
Merge pull request #2867 from Qiyu8/usimd-floatdot
Optimize the performance of dot by using universal intrinsics in X86/ARM
5 years ago
Qiyu8
f32d34a015
add sse3 compiler flag
5 years ago
Martin Kroeker
7812486091
Use generic C for D/Z nrm2 kernels on Windows to work around fpu exception bug
5 years ago
Matti Picus
a5b164946c
add fninit to reset fpu registers before assembler routines
5 years ago
User User-User
d2333e7842
aarch64 fix std=c18 compilation
5 years ago
Qiyu8
60e6c68e38
Adapt ARM architect
5 years ago
Qiyu8
1b1a757f5f
Optimize the performance of dot by using universal intrinsics in X86/ARM
5 years ago
Rajalakshmi Srinivasaraghavan
2df4235e00
Optimize dcopy/zcopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
5 years ago
Martin Kroeker
dfbc62ef7e
Support building only a subset of types
5 years ago
Qiyu8
14f7dad3b7
performance improved
5 years ago
Qiyu8
325b539c26
Optimize the performance of daxpy by using universal intrinsics
5 years ago