You can not select more than 25 topicsTopics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.
This patch has two benefits:
1. Using vfmaddsub231p[sd] instead of vaddsubp[sd] eliminates a vmulp[sd]
instruction, giving a ~10% speedup, measured from ~33 to ~36 Gflops
for sscal with 4096 elements and from ~17 to ~19 Gflops for dscal on my Kaby
Lake laptop, see e.g. OPENBLAS_LOOPS=10000 benchmark/cscal.goto 4096 4096.
2. Using it for both the main loop and the tail end makes sure the same FMA
instruction is used for all loop iterations, which is not the case with the
current situation where the tail loop is implemented in C, if the compiler is
allowed to use FMA instructions. This is important for some LAPACK eigenvalue
testcases that rely on bitwise identical results independent of how many loop
iterations are used.