Martin Kroeker
fbda20c856
Merge pull request #94 from xianyi/develop
rebase
5 years ago
Martin Kroeker
e1b7123bbe
Merge pull request #2867 from Qiyu8/usimd-floatdot
Optimize the performance of dot by using universal intrinsics in X86/ARM
5 years ago
Qiyu8
f32d34a015
add sse3 compiler flag
5 years ago
Martin Kroeker
599777ecb7
Merge pull request #2879 from martin-frbg/issue2839
Default BLAS3_MEM_ALLOC_THRESHOLD on all platforms to 32
5 years ago
Martin Kroeker
a5feea6611
make BLAS3_MEM_ALLOC_THRESHOLD configurable on non-Windows
5 years ago
Martin Kroeker
dc8e4e1959
Reduce the BLAS3 heap allocation threshold to 32 and mark it as configurable
5 years ago
Martin Kroeker
cccd1438da
Merge pull request #93 from xianyi/develop
rebase
5 years ago
Martin Kroeker
f032d8966e
Merge pull request #2874 from Flamefire/memory_fixes
Avoid out of bounds access on invalid memory free
5 years ago
Martin Kroeker
f6e4cf2f9d
Merge pull request #2876 from Flamefire/omp_fork_fix
Lazyly reinit threads after a fork in OMP mode
5 years ago
Martin Kroeker
9828343e12
Merge pull request #2878 from brada4/asms
fix clang std=c18 compilation on aarch64
5 years ago
User User-User
d2333e7842
aarch64 fix std=c18 compilation
5 years ago
Alexander Grund
3094fc6c83
Lazyly reinit threads after a fork in OMP mode
This initializes the per-thread memory buffers which get
cleared/released on a fork via pthread_at_fork. Not doing so leads to
each thread calling blas_memory_alloc on almost every execution which
slows down the code significantly as the threads race for the memory
allocation using locks to serialize that.
5 years ago
Alexander Grund
3c05f54df8
Avoid out of bounds access on invalid memory free
5 years ago
Alexander Grund
dee7c49938
Fix TABs and trailing space
5 years ago
Martin Kroeker
d3c0d6811b
Merge pull request #2873 from martin-frbg/issue2871
Check for __linux rather than linux in cpuid code and benchmarks
5 years ago
Martin Kroeker
9637cd1fd1
Merge pull request #2865 from thisch/backticks
Consolidate usage of backticks for build options
5 years ago
Martin Kroeker
5464eb13ea
Change ifdef linux to __linux for C11 compatibility
5 years ago
Martin Kroeker
e1574cbc83
Change ifdef linux to __linux for C11 compatibility
and add a fallback for unsupported operating systems in detect()
5 years ago
Martin Kroeker
0b2bb5696a
Change ifdef linux to __linux for C11 compatibility
5 years ago
Martin Kroeker
a7d5d0078d
Change ifdef linux to __linux for C11 compatibility
5 years ago
Martin Kroeker
be40440ec5
Change ifdef linux to __linux for C11 compatibility
5 years ago
Martin Kroeker
2bf70c8e3b
Change ifdef linux to __linux for C11 compatibility
5 years ago
Qiyu8
60e6c68e38
Adapt ARM architect
5 years ago
Martin Kroeker
64629cb5c7
Merge pull request #91 from xianyi/develop
rebase
5 years ago
Qiyu8
1b1a757f5f
Optimize the performance of dot by using universal intrinsics in X86/ARM
5 years ago
Martin Kroeker
0d98ce202c
Merge pull request #2866 from RajalakshmiSR/p10_dcopy
Optimize dcopy/zcopy for POWER10
5 years ago
Rajalakshmi Srinivasaraghavan
2df4235e00
Optimize dcopy/zcopy for POWER10
This patch makes use of new POWER10 vector pair instructions for
loads and stores. Tested in simulator and no new failures.
5 years ago
Thomas Hisch
fe8cd5ae7e
Consolidate usage of backticks for build options
There were some build options in the README that were not
highlighted. Now all are highlighted.
5 years ago
Martin Kroeker
ba31c8f5f9
Merge pull request #2853 from Qiyu8/usimd-daxpy
Optimize the performance of daxpy by using universal intrinsics
5 years ago
Martin Kroeker
e961d4d609
Merge pull request #2864 from martin-frbg/lapack445
FIx underflow/rounding errors in LAPACK (S,D)LANV2
5 years ago
Martin Kroeker
7ed25e9e10
FIx underflow/rounding errors in LAPACK (S,D)LANV2
Reference-LAPACK PR 445, fixing their issue 263
5 years ago
Martin Kroeker
7b169379e0
Merge pull request #2863 from martin-frbg/readmefixes
Readmefixes
5 years ago
Martin Kroeker
7f539fb850
Update cpu list, outline cmake build, clarify scope of set_num_threads extension
5 years ago
Martin Kroeker
caf7a12295
Merge pull request #90 from xianyi/develop
rebase
5 years ago
Martin Kroeker
72b5b73647
Merge pull request #2850 from xiaojiayuan111/develop
fix a bug of trmm
5 years ago
Qiyu8
881c15179f
remove default support for FMA4 on zen architect
5 years ago
Martin Kroeker
dfaafd3b55
Merge pull request #2854 from martin-frbg/travis-graviton
Add an AWS-Graviton2 build to Travis CI
5 years ago
Martin Kroeker
f2e9a24e1a
Add AWS Graviton2 build
5 years ago
Martin Kroeker
61fae59298
Merge pull request #88 from xianyi/develop
rebase
5 years ago
Martin Kroeker
33d22f99f1
Merge pull request #2851 from martin-frbg/travis-xcode12
Add an OSX build with xcode12
5 years ago
Martin Kroeker
5ba01dd1a8
Add an OSX build with xcode12
5 years ago
Qiyu8
14f7dad3b7
performance improved
5 years ago
y00512012
06cf73a239
fix a bug of trmm
5 years ago
Qiyu8
325b539c26
Optimize the performance of daxpy by using universal intrinsics
5 years ago
Martin Kroeker
0f112077e6
Merge pull request #2847 from mhillenibm/fixup_cscal
s390x: fix cscal and zscal implementations
5 years ago
Marius Hillenbrand
22aa81f3e5
s390x: fix cscal and zscal implementations
The implementation of complex scalar * vector multiplication for Z14
makes some LAPACK tests fail because the numerical differences to the
reference implementation exceed the threshold (as can be seen by running
make lapack-test and replacing kernel/zarch/cscal.c with a generic
implementation for comparison).
The complex multiplication uses terms of the form a * b + c * d for both
real and imaginary parts. The assembly code (and compiler-emitted code
as well) uses fused multiply add operations for the second product and
sum. The results can be "surprising", for example when both terms in the
imaginary part nearly cancel each other out. In that case, the second
product contributes more digits to the sum than the first product that
has been rounded before.
One option is to use separate multiplications (which then round the same
way) and a distinct add. Change the code to pursue that path, by (1)
requesting the compiler not to contract the operations into FMAs and (2)
replacing the assembly kernel with corresponding vectorized C code
(where change 1 also applies).
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Marius Hillenbrand
77ea73f5e5
s390x: for clang use fp-contract=on instead of fast
Make clang slightly more cautious when contracting floating-point
operations (e.g., when applying fused multiply add) by setting
-ffp-contract=on (instead of fast).
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Marius Hillenbrand
f91057cbad
s390x: move common vector definitions and utils into header
... to facilitate reuse beyond gemm_vec.c and avoid code duplication.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Martin Kroeker
992d7ca63d
Merge pull request #2845 from martin-frbg/lapack443
Fix workspace query in LAPACK xGELQ (Reference-LAPACK 443)
5 years ago
Martin Kroeker
7e4d5c237c
Fix workspace query in xGELQ (Reference-LAPACK PR443)
5 years ago