User User-User
39ef0880ae
copy conf
4 years ago
Gilles Gouaillardet
9d292d37b2
arm64: add the missing d9 register to the clobber list
Refs. numpy/numpy#18422
Signed-off-by: Gilles Gouaillardet <gilles@rist.or.jp>
4 years ago
CodesWithWolves
d2bda3b56a
Remove Unnecessary/Erroneous Reads In sgemm_tcopy_16.S COPY1x8 Macro
There appears to have been some code leak when copying from the COPY2x8
macro above where we're reading 8 bytes into d4-d7 directly after
reading 4 bytes into s4-s7. These 32 bytes in d4-7 are unused and can
possibly overrun the boundary of allocated memory -- Valgrind detected
this which is what dragged my attention to it for a 128,1 copy.
Additionally, there is no need to update the addresses stored in A0-A7
as the only possible paths after running this macro will overwrite A0-7
if looping to the next 8 rows, or overwrite A0-3 if moving to 4 rows --
in which case A4-7 are unused.
4 years ago
Martin Kroeker
b716c0ef01
Add workaround for NVIDIA HPC
5 years ago
Martin Kroeker
2efa3b70dc
Add workaround for NVIDIA HPC
5 years ago
Martin Kroeker
49959d4f1c
Add workaround for NVIDIA HPC
5 years ago
Martin Kroeker
0f27a03607
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels
5 years ago
Martin Kroeker
c2a8ebfe69
Add workaround for NVIDIA HPC mishandling of the asm DOT kernels
5 years ago
Ashwin Sekhar T K
1b2508362b
arm64: Fix nrm2 for input vectors with Inf
Fix double precision nrm2 kernels returning NaN when the
input vectors contain Inf/-Inf.
5 years ago
Martin Kroeker
8631e2976a
Temporarily revert to the old nrm2 kernels
5 years ago
Martin Kroeker
2768bc1764
Temporarily revert to the old nrm2 kernels
5 years ago
Martin Kroeker
6f4698ee1f
Temporarily revert to the old nrm2 kernel
5 years ago
Martin Kroeker
e1b7123bbe
Merge pull request #2867 from Qiyu8/usimd-floatdot
Optimize the performance of dot by using universal intrinsics in X86/ARM
5 years ago
User User-User
d2333e7842
aarch64 fix std=c18 compilation
5 years ago
Qiyu8
60e6c68e38
Adapt ARM architect
5 years ago
Martin Kroeker
775a87242d
Rename KERNEL.SILICON to KERNEL.VORTEX
5 years ago
Martin Kroeker
80794fe8fd
Create KERNEL.SILICON
5 years ago
Ashwin Sekhar T K
4e1be0e481
ARM64: Add THUNDERX3T110 Target
5 years ago
ZhangDanfeng
bc6fd20a40
fix INIT8x4
Signed-off-by: ZhangDanfeng <467688405@qq.com>
5 years ago
ZhangDanfeng
9b7877ccf1
sgemm copy source init
Signed-off-by: ZhangDanfeng <467688405@qq.com>
5 years ago
ZhangDanfeng
f82fa802d1
Insert prefetch
Signed-off-by: ZhangDanfeng <467688405@qq.com>
5 years ago
张丹枫
9df79ae9a3
update sgemm and strmm kernel selecting strategy
5 years ago
张丹枫
a1fc6041cd
use general register to speedup
5 years ago
张丹枫
edb423d772
align general register using to strmm_kernel_8x8
5 years ago
zhangdanfeng
0e6eb8c247
sgemm kernel use sgemm_kernel_8x8_cortexa53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
5 years ago
zhangdanfeng
d475db29c6
optimized for cortex-a53
Signed-off-by: zhangdanfeng <zhangdanfeng@cloudwalk.cn>
5 years ago
Ashwin Sekhar T K
8353cb245a
ARM64: Improve DAXPY for ThunderX2
Improve performance of DAXPY for ThunderX2
when the vector fits in L1 Cache.
5 years ago
Martin Kroeker
144be81ca1
fix initialization to zero in the NEON SGEMM_BETA kernel as well
5 years ago
Martin Kroeker
07cdd5d05c
Fix zero initialization for beta=0 case
use immediate initialization instead of multiplication in case register content is a NaN
5 years ago
s00548429
bec7923a0d
Fix the functional bugs for zamax.
5 years ago
Ali Saidi
c623a965f9
Add Neoverse-N1 core
The implementation is a hybird of the ARMV8 one with some of the
improved TX2 rountines along with specifying -march=v8.2-a
6 years ago
Martin Kroeker
e57b11acca
Add preliminary support for EMAG8180
6 years ago
Martin Kroeker
456ee2e1f0
Merge pull request #2357 from chenxuqiang/dgemm_beta_zero
kernel/arm64/dgemm_beta.S: add beta == zero branch
6 years ago
shengyang
80db5f11e1
update
6 years ago
chenxuqiang
52de4cc8fd
kernel/arm64/dgemm_beta.S: add beta == zero branch
added beta == zero branch, and no need to load C matrix.
Signed by: Xuqiang Chen <chenxuqiang3@hisilicon.com>
6 years ago
Martin Kroeker
44028581cc
Merge pull request #2355 from Zeyiii/dev-zeyi2
Use arm neon instructions to optimize sgemm_beta operation
6 years ago
Martin Kroeker
86ab939936
Merge pull request #2354 from ZuoQ3/develop
[WIP] Use arm neon instructions to optimize tcopy operation
6 years ago
shengyang
8d84403205
Use arm neon instructions to optimize ncopy operation
modified: KERNEL.ARMV8
modified: KERNEL.TSV110
new file: sgemm_ncopy_4.S
6 years ago
w00421467
0833a4846a
Use arm neon instructions to optimize sgemm_beta operation
6 years ago
zq
50f7fc1401
[WIP] Use arm neon instructions to optimize tcopy operation
6 years ago
w00421467
3ccf8885ac
prefetching for dgemm_beta
6 years ago
w00421467
b7cc69ee62
declare DGEMM_BETA in KERNEL.ARMV8 rather than the generic KERNEL
6 years ago
w00421467
aeef942c4f
use arm neon instructions to optimize gemm beta operation
6 years ago
Martin Kroeker
85ccdce8c4
Remove the IOS fallbacks to generic C kernels
6 years ago
Martin Kroeker
a448884a63
Remove automatic label postfixes from macro included only once
6 years ago
Martin Kroeker
3a2df19db6
Fix accidental duplication of jump instruction
6 years ago
Martin Kroeker
56837e9d92
Make local labels in macro compatible with the xcode assembler
... which does not perform the automatic numbering on instantiation that the _@ suffix signifies
6 years ago
Martin Kroeker
3e3ccb9011
Add ARM64 implementations of ?sum
as trivial copies of the respective ?asum kernels with the fabs calls removed
6 years ago
maomao194313
783ba8058f
HiSilicon tsv110 CPUs optimization branch
add HiSilicon tsv110 CPUs optimization branch
7 years ago
Martin Kroeker
7639f2e1f0
Rewrite the conditional for OSX to fix cmake parsing on others
The Makefile variable parser in utils.cmake currently does not handle conditionals. Having the definitions for non-OSX last will at least make cmake builds work again on non-OSX platforms.
7 years ago