Masato Nakagawa
80d3c2ad95
Add Improving Load Imbalance in Thread-Parallel GEMM
11 months ago
Martin Kroeker
77c638db67
Revert "Fix potential inaccuracy in multithreaded level3 related to SWITCH_RATIO"
1 year ago
John Hein
6cd9bbe531
fix signedness of pointer to integer type passed to blas_lock()
1 year ago
Martin Kroeker
8a1710dd0d
don't apply switch_ratio to tail of loop
1 year ago
shivammonaka
9e22d70957
Dynamic locking in Pthread Backend to allow multiple BLAS calls to be executed parallelly
1 year ago
Martin Kroeker
db070a9223
add gemm_batch drivers
1 year ago
Martin Kroeker
d0794f88dc
add gemm_batch driver
1 year ago
yamazaki-mitsufumi
51ab1903e7
Expanding the scop of 2D thread distribution
1 year ago
shivammonaka
d49ebc54e1
Merge branch 'shivam-develop' into shivam-Locks
1 year ago
shivammonaka
bc191015e3
Using OpenMP locks with NUM_PARALLEL
2 years ago
Martin Kroeker
c4bd4a2e5d
fix improper function prototypes (empty parentheses)
2 years ago
Chris Sidebottom
32f2fafde7
Propagate SWITCH_RATIO to DYNAMIC_ARCH builds
Previously dynamic builds were either using the default SWITCH_RATIO
or one from the higher level architecture; this patch ensures the
dynamic builds can use this parameter as well.
3 years ago
Honglin Zhu
4989e039a5
Define SBGEMM_ALIGN_K for DYNAMIC_ARCH build
3 years ago
Honglin Zhu
b00d5b9746
New sbgemm implementation for Neoverse N2
1. Use UZP instructions but not gather load and scatter store instructions to get lower latency.
2. Padding k to a power of 4.
3 years ago
Wangyang Guo
3dc6052c7e
initial support for Sapphire Rapids platform
4 years ago
Martin Kroeker
2f8220d757
Add sbgemm
4 years ago
Martin Kroeker
307c4c0786
Fix typo
4 years ago
Martin Kroeker
e83df93975
Work around another recent macro name collision with winnt.h
4 years ago
Martin Kroeker
a554712439
remove extra/intermediate size step for min_jj introduced in PR747
5 years ago
Martin Kroeker
5d26223f4a
remove extra/intermediate size step of min_jj from PR747
5 years ago
Martin Kroeker
d3ff1f889f
Convert ifndefs to ifneq
5 years ago
Rajalakshmi Srinivasaraghavan
b5d30b390d
Fix build issues with bfloat16
This patch fixes compilation errors due to recent renaming from SH to SB
with BUILD_BFLOAT16.
5 years ago
Martin Kroeker
006c7f6671
Change "HALF" and "sh" to "BFLOAT16" and "sb"
5 years ago
Martin Kroeker
886a8e3190
Adapt for supporting only a subset of variable types
5 years ago
Martin Kroeker
ac653c94f3
Merge branch 'develop' into issue2588-cmake
5 years ago
Martin Kroeker
988a6f429e
Add BUILD_vartype defines
5 years ago
Martin Kroeker
e5e2fbd593
Support building only selected types
5 years ago
y00512012
06cf73a239
fix a bug of trmm
5 years ago
Martin Kroeker
ddec244a5a
Merge pull request #2838 from austinpagan/gordon_trmm
Adding performance patch for trmm, just like trsm (#2836 )
5 years ago
fossum
dfeca46098
Adding performance patch for trmm, just like #2836
5 years ago
fossum
274d6e015b
Fixing a performance bug in trsm_[LR].c.
5 years ago
Martin Kroeker
330044d821
Fix potentiol domain error in sqrt
5 years ago
Chen, Guobing
e740c4873d
Enable COOPERLAKE build target
Enable new build target platform -- COOPERLAKE. This target platform
supports all the SKYLAKEX supported ISAs + avx512bf16. So all the
SKYLAKEX specific kernels/drivers and related code are now extended
to be also active on COOPERLAKE. Besides, new BF16 related kernels
are active under this target.
5 years ago
Martin Kroeker
ce45af8151
Update conditional for atomics to use HAVE_C11
5 years ago
Martin Kroeker
6f38de06d2
Update conditional for atomics to use HAVE_C11
5 years ago
Martin Kroeker
5dd14e3d48
Make building the bfloat16 functions conditional on option BUILD_HALF ( #2590 )
* make building the bfloat16 BLAS functions conditional on BUILD_HALF
* pass the BUILD_HALF option to gensymbol
* Pass BUILD_HALF as a compiler define for dynamic_arch builds
5 years ago
Rajalakshmi Srinivasaraghavan
7eb55504b1
RFC : Add half precision gemm for bfloat16 in OpenBLAS
This patch adds support for bfloat16 data type matrix multiplication kernel.
For architectures that don't support bfloat16, it is defined as unsigned short
(2 bytes). Default unroll sizes can be changed as per architecture as done for
SGEMM and for now 8 and 4 are used for M and N. Size of ncopy/tcopy can be
changed as per architecture requirement and for now, size 2 is used.
Added shgemm in kernel/power/KERNEL.POWER9 and tested in powerpc64le and
powerpc64. For reference, added a small test compare_sgemm_shgemm.c to compare
sgemm and shgemm output.
This patch does not cover OpenBLAS test, benchmark and lapack tests for shgemm.
Complex type implementation can be discussed and added once this is approved.
5 years ago
Ali Saidi
97ce6bbce2
Fix barriers in level3_thread
6 years ago
wjc404
2f96a2c55b
Update trmm_R.c
6 years ago
wjc404
833bd0f8ff
Update trmm_L.c
6 years ago
wjc404
77b8f49556
Update level3_thread.c
6 years ago
wjc404
1c3e20ce48
Update level3.c
6 years ago
wjc404
e9fb8f62b1
Update level3_gemm3m_thread.c
6 years ago
wjc404
4c35b8dbaa
Update gemm3m_level3.c
6 years ago
Martin Kroeker
f3065a0eed
Fix race conditions in multithreaded GEMM3M
by adding barriers (and a mutex lock for the non-OpenMP case) like it was already done for GEMM in level3_thread.c some time ago
6 years ago
Martin Kroeker
f343ed65b5
Avoid taking the root of a negative number
Fixes #1924 where numpy 1.17+ would report the (transient) FE_INVALID exception raised for the domain error.
7 years ago
Martin Kroeker
f72fdf525c
Merge pull request #1875 from martin-frbg/issue1851
Serialize accesses to parallelized level3 functions from multiple cal…
7 years ago
Martin Kroeker
113cb00b95
fix missing parenthesis
7 years ago
Martin Kroeker
5192651706
Add CriticalSection handling instead of mutexes for Windows
7 years ago
Martin Kroeker
2e6fae2aad
Serialize accesses to parallelized level3 functions from multiple callers
for #1851
7 years ago