OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Rajalakshmi Srinivasaraghavan	601b711c78	Optimize swap function for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	5 years ago
Ashwin Sekhar T K	1b2508362b	arm64: Fix nrm2 for input vectors with Inf Fix double precision nrm2 kernels returning NaN when the input vectors contain Inf/-Inf.	5 years ago
Martin Kroeker	3559c5d7a2	Merge pull request #3048 from martin-frbg/issue2998 Temporarily revert to the old NRM2 kernels for ThunderX2/3 and NeoverseN1	5 years ago
Martin Kroeker	8631e2976a	Temporarily revert to the old nrm2 kernels	5 years ago
Martin Kroeker	2768bc1764	Temporarily revert to the old nrm2 kernels	5 years ago
Martin Kroeker	6f4698ee1f	Temporarily revert to the old nrm2 kernel	5 years ago
Martin Kroeker	114eb159a4	Disable FMA intrinsics in the srot kernel when the compiler is PGI/NVIDIA	5 years ago
Martin Kroeker	005cce5507	Amend SkylakeX options to support the NVIDIA compiler	5 years ago
Martin Kroeker	c73d8ee40d	Conditionally add -mfma to compiler options where needed	5 years ago
Rajalakshmi Srinivasaraghavan	2fb11f873b	POWER10: Improve copy performance This patch aligns the stores to 32 byte boundary for scopy and dcopy before entering into vector pair loop. For ccopy, changed the store instructions to stxv to improve performance of unaligned cases.	5 years ago
Martin Kroeker	043128cbe5	Merge pull request #3029 from RajalakshmiSR/axpyp10 POWER10: Improve axpy performance	5 years ago
Martin Kroeker	3331ca492d	Merge pull request #3021 from austinpagan/trsm_p10 POWER: Added special unrolled vectorized versions of "Solve" for specific si…	5 years ago
Rajalakshmi Srinivasaraghavan	346e30a46a	POWER10: Improve axpy performance This patch aligns the stores to 32 byte boundary for saxpy and daxpy before entering into vector pair loop. Fox caxpy, changed the store instructions to stxv to improve performance of unaligned cases.	5 years ago
gxw	4b548857d6	Add msa support for loongson 1. Using core loongson3r3 and loongson3r4 for loongson 2. Add DYNAMIC_ARCH for loongson Change-Id: I1c6b54dbeca3a0cc31d1222af36a7e9bd6ab54c1	5 years ago
Martin Kroeker	7f11e33e8d	Merge pull request #3025 from TiredNotTear/develop MIPS: Fix two bugs	5 years ago
Martin Kroeker	53e0837809	Merge pull request #3022 from jinboson/develop Fix test errors reported by cblas_cgemm & cblas_ctrmm	5 years ago
Hao Chen	ad38bd0e89	Fix failed cgemv and zgemv test case after using msa optimization The cgemv and zgemv test case will call cgemv_n/t_msa.c zgemv_n/t_msa.c files in MIPS environment. When the macro CONJ is defined, the calculation result will be wrong due to the wrong definition of OP2. This patch updates the value of OP2 and passes the corresponding test.	5 years ago
Hao Chen	47b639cc9b	Fix failed sswap and dswap case by using msa optimization The swap test case will call sswap_msa.c and dswap_msa.c files in MIPS environmnet. When inc_x or inc_y is equal to zero, the calculation result of the two functions will be wrong. This patch adds the processing of inc_x or inc_y equal to zero, and the swap test case has passed.	5 years ago
Martin Kroeker	b660008c7e	Work around DOT and SWAP test failures	5 years ago
Martin Kroeker	f8346603cf	Fix compilation with SolarisStudio	5 years ago
Jin Bo	65de6f5957	Fix test errors reported by cblas_cgemm & cblas_ctrmm The file cgemm_kernel_8x4_msa.c holds the MSA optimization codes of cblas_cgemm and cblas_ctrmm. It defines two macros: CGEMM_SCALE_1X2 and CGEMM_TRMM_SCALE_1X2. The pc1 array index in the two macros should be 0 and 1.	5 years ago
Gordon Fossum	213c0e7abb	Added special unrolled vectorized versions of "Solve" for specific sizes, in DTRSM and STRSM, to improve performance in Power9 and Power10.	5 years ago
Martin Kroeker	441c08c9ff	Merge pull request #3016 from xiegengxin/complex-asum Improve the performance of zasum and casum with AVX512 intrinsic	5 years ago
Gengxin Xie	0cb7a403b2	fix error declare function blas_level1_thread_with_return_value	5 years ago
Gengxin Xie	b766c1e9bb	Improve the performance of zasum and casum with AVX512 intrinsic	5 years ago
Rajalakshmi Srinivasaraghavan	7d46e31de1	POWER10: Optimize dgemv_n Handling as 4x8 with vector pairs gives better performance than existing code in POWER10.	5 years ago
Martin Kroeker	f1bf040b25	Merge pull request #2988 from xiegengxin/smp-asum Improve the performance of dasum and sasum when SMP is defined	5 years ago
Xianyi Zhang	7037849498	Merge branch 'develop' into risc-v	5 years ago
Martin Kroeker	7e9cb39a25	Merge pull request #2981 from Qiyu8/fix-sum Fix sum optimize issues	5 years ago
Gengxin Xie	d6e7e05bb3	Improve the performance of dasum and sasum when SMP is defined	5 years ago
Qiyu8	ae0b1dea19	modify system.cmake to enable fma flag	5 years ago
Qiyu8	e0dac6b53b	fix the CI failure of target specific option mismatch	5 years ago
Qiyu8	e5c2ceb675	fix the CI failure of lack the head	5 years ago
Qiyu8	a87e537b8c	modify macro	5 years ago
Qiyu8	5bc0a7583f	only FMA3 and vector larger than 128 have positive effects.	5 years ago
Qiyu8	8c0b206d4c	Optimize the performance of rot by using universal intrinsics	5 years ago
Qiyu8	c4c591ac5a	fix sum optimize issues	5 years ago
Xianyi Zhang	fc35b72ae1	Refs #2899 Merge branch 'openblas-open-910' of git://github.com/damonyu1989/OpenBLAS into damonyu1989-openblas-open-910	5 years ago
Xianyi Zhang	913cc9a4ca	Merge branch 'develop' into risc-v	5 years ago
Martin Kroeker	ff16329cb7	Merge pull request #2972 from xiegengxin/rot-intrinsic Improve the performance of rot by using AVX512 and AVX2 intrinsic	5 years ago
Martin Kroeker	110c7a6de0	Merge pull request #2979 from RajalakshmiSR/dot_power10 Optimize sdot/ddot for POWER10	5 years ago
Rajalakshmi Srinivasaraghavan	6e364981a8	Optimize sdot/ddot for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	5 years ago
Martin Kroeker	b976a0bf40	Remove previous workaround for compiler flags related to cpu capabilities in x86_64 DYNAMIC_ARCH builds	5 years ago
Martin Kroeker	ff74319ea5	Merge pull request #2977 from martin-frbg/issue2976 Fix macro name used in ifdef for POWERPC/PGI	5 years ago
Martin Kroeker	28d2dfe2b3	Fix macro name used in ifdef	5 years ago
Gengxin Xie	725ffbf041	fix typo	5 years ago
Gengxin Xie	d9ba49165a	Improve the performance of rot by using AVX512 and AVX2 intrinsic	5 years ago
Rajalakshmi Srinivasaraghavan	dd7a9cc5bf	POWER10: Change dgemm unroll factors Changing the unroll factors for dgemm to 8 shows improved performance with POWER10 MMA feature. Also made some minor changes in sgemm for edge cases.	5 years ago
Rajalakshmi Srinivasaraghavan	b435491885	Optimize caxpy for POWER10 This patch makes use of new POWER10 vector pair instructions for loads and stores.	5 years ago
Chen, Guobing	a7b1f9b1bb	Implementation of BF16 based gemv 1. Add a new API -- sbgemv to support bfloat16 based gemv 2. Implement a generic kernel for sbgemv 3. Implement an avx512-bf16 based kernel for sbgemv Signed-off-by: Chen, Guobing <guobing.chen@intel.com>	5 years ago

1 2 3 4 5 ...

1594 Commits (ac3e2a3fdd2f2e430ff7b6a58aeb8252afc935de)