Rajalakshmi Srinivasaraghavan
571eadb880
powerpc: Optimized SGEMM/DGEMM/CGEMM for POWER10
This patch introduces new optimized version of SGEMM, CGEMM and DGEMM
using power10 Matrix-Multiply Assist (MMA) feature introduced in
POWER ISA v3.1. This patch makes use of new POWER10 compute instructions
for matrix multiplication operation.
Tested on simulator and there are no new test failures.
Cycles count reduced by 30-50% compared to POWER9 version depending on
M/N/K sizes.
MMA GCC patch for reference:
https://gcc.gnu.org/git/?p=gcc.git;a=commit;h=8ee2640bfdc62f835ec9740278f948034bc7d9f1
5 years ago
Martin Kroeker
93592d1260
Merge pull request #2675 from wjc404/develop
AVX512 DGEMM TCOPY_16 Function
5 years ago
Martin Kroeker
6eaeb01263
Merge pull request #2658 from RajalakshmiSR/p10
powerpc: Add support for future processor
5 years ago
wjc404
086d87a302
AVX512 dgemm tcopy_16 function
5 years ago
Martin Kroeker
af501eb753
Merge pull request #2669 from mhillenibm/zarch_fix_gcc_detection
Zarch fix gcc detection
5 years ago
Martin Kroeker
0eb6c4dded
Merge pull request #2672 from mhillenibm/test_num_threads
cpp_thread_test: Change adjustment of concurrency on systems with <52 hw threads
5 years ago
Marius Hillenbrand
de838c38ef
cpp_thread_test/dgemv: fail early if concurrency is zero
The two test cases dgemv_tester and dgemm_tester accept the degree of
concurrency as command line argument (amongst others). Fail early if
value 0 has been specified, instead of later with less-clear symptoms.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Marius Hillenbrand
478898b37a
cpp_thread_test/dgemv: cap concurrency to number of hw threads on small systems
... instead of (number of hw threads - 4) to avoid invalid numbers on
smaller systems. Currently, systems with 4 or fewer CPUs (e.g., small CI
VMs) would fail the test. Fixes one of the issues discussed in #2668
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Marius Hillenbrand
2389291766
Makefile.system: remove duplicate variable GCCVERSIONGT5
... to bring unified gcc version detection with common variables to the
one remaining spot in Makefile.system.
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Marius Hillenbrand
a2d13ea611
Fix gcc version detection for zarch
Employ common variables for gcc version detection and fix the broken
check for gcc >= 5.2.
Fixes #2668
Signed-off-by: Marius Hillenbrand <mhillen@linux.ibm.com>
5 years ago
Martin Kroeker
1bd3cd66c2
Increment version to 0.3.10.dev
5 years ago
Martin Kroeker
1c53e1366d
Increment version to 0.3.10.dev
5 years ago
Martin Kroeker
95dbeff66d
Merge branch 'release-0.3.0' into develop
5 years ago
Martin Kroeker
3b673a24b7
Increment version to 0.3.10.dev
5 years ago
Martin Kroeker
1eb1979050
Increment version to 0.3.10.dev
5 years ago
Martin Kroeker
efc53b6e7e
Merge pull request #2665 from martin-frbg/flang-fixes-2a
Fix spelling of flang option -Mrecursive, add -Kieee and workaround for AOCC optimizer bug
5 years ago
Martin Kroeker
72888497e2
Update with 0.3.10 changes
5 years ago
Martin Kroeker
7e3e006af6
Merge pull request #2666 from martin-frbg/blastest
Update BLAS tests to what netlib 3.9.0 uses
5 years ago
Martin Kroeker
d906d14402
Merge pull request #2664 from ACSimon33/exported_symbols
Add missing exported symbols.
5 years ago
Martin Kroeker
3785c0e82b
Merge pull request #2663 from martin-frbg/issue2654
Respect predefined defaults for AR, AS, LD and RANLIB
5 years ago
Martin Kroeker
f2d8879af6
Merge pull request #2661 from martin-frbg/issue2660
Report selected DYNAMIC_ARCH kernel rather than one of its aliases in gotoblas_corename
5 years ago
Martin Kroeker
6876221cf3
Remove optimization level limit for flang again and add -fno-unroll-loops for AOCC flang 2.x instead
5 years ago
Martin Kroeker
79cdcde717
Re-enable higher optimization levels for flang while disabling loop unrolling for AOCC flang
5 years ago
Martin Kroeker
18a11137f1
Update BLAS tests to correspond to Reference-LAPACK 3.9.0
replaces calculation of machine precision with call to epsilon intrinsic and removes the requirement for previous output files to be removed before rerunning tests
5 years ago
Martin Kroeker
1dd712131e
Fix spelling of flang option -Mrecursive and add -Kieee
5 years ago
Martin Kroeker
0ed2adf0b2
Fix spelling of flang option -Mrecursive and add -Kieee
5 years ago
Martin Kroeker
abf670757b
Respect predefined defaults for AR, AS, LD and RANLIB
5 years ago
Simon Märtens
41fc6f3cd2
Added missing exported symbols.
5 years ago
Martin Kroeker
007d9f97d7
Make gotoblas_corename report the name of the selected TARGET rather than its aliases
5 years ago
Martin Kroeker
63d26090f5
Merge pull request #64 from xianyi/develop
rebase
5 years ago
Rajalakshmi Srinivasaraghavan
9fe930f205
powerpc: Add support for future processor
This is the initial patch to support build infrastructure
for POWER10 architecture.
5 years ago
Martin Kroeker
3a1b58d54a
Merge pull request #2653 from craft-zhang/cortex-a53
fix INIT8x4 of SGEMM on Arm Cortex-A53
5 years ago
Martin Kroeker
f7659be4a0
Merge pull request #2652 from martin-frbg/flang-fixes
Fixes for compilation with flang binary release 20190329
5 years ago
ZhangDanfeng
bc6fd20a40
fix INIT8x4
Signed-off-by: ZhangDanfeng <467688405@qq.com>
5 years ago
Martin Kroeker
3ce469a34f
Limit optimization level to O1 for flang and add -frecursive
5 years ago
Martin Kroeker
ba2c5b404d
When building with flang, use it also for the final link step to get dependencies right
5 years ago
Martin Kroeker
f07a80354b
Apply previously AOCC-specific workaround to all versions of flang
5 years ago
Martin Kroeker
fdd1b50263
Merge pull request #63 from xianyi/develop
rebase
5 years ago
Martin Kroeker
430e8b45fe
Merge pull request #2648 from martin-frbg/lapack411
Break out of potentially infinite rescaling loop in LAPACK xLARGV/xLARTG/xLARTGP
5 years ago
Martin Kroeker
88fe85f4e0
Merge pull request #2647 from martin-frbg/aocc-flang
Small fixes for flang in general and the AMD AOCC version of it in particular
5 years ago
Martin Kroeker
89091e6b64
Merge pull request #2645 from martin-frbg/misc_fixes
Miscellaneous fixes
5 years ago
Martin Kroeker
522aaf53bf
Break out of potentially infinite rescaling loop in LAPACK xLARGV/xLARTG/xLARTGP
Reference-LAPACK issue 411
5 years ago
Martin Kroeker
c3574ffe53
Merge pull request #2646 from wjc404/develop
Optimize AVX512 parallel DGEMM performance
5 years ago
Martin Kroeker
4e28dc6353
Use only -O1 with AMD AOCC version of flang
to prevent miscompilation of LAPACK codes and tests on Ryzen
5 years ago
Martin Kroeker
13c28889a2
Update "cosmetic fixes for non-C99 compilers"
5 years ago
wjc404
0e3ac4a06b
Add files via upload
5 years ago
Martin Kroeker
28915eed72
Cosmetic fixes for non-C99 compilers
5 years ago
Martin Kroeker
7f60fb6b91
Delete spurious copy of common_param.h
5 years ago
Martin Kroeker
0464e662ad
make blas_quickdivide unsigned and guard against miscompilation
5 years ago
Martin Kroeker
0f9a935a5a
Merge pull request #62 from xianyi/develop
rebase
5 years ago