This patch has two benefits:
1. Using vfmaddsub231p[sd] instead of vaddsubp[sd] eliminates a vmulp[sd]
instruction, giving a ~10% speedup, measured from ~33 to ~36 Gflops
for sscal with 4096 elements and from ~17 to ~19 Gflops for dscal on my Kaby
Lake laptop, see e.g. OPENBLAS_LOOPS=10000 benchmark/cscal.goto 4096 4096.
2. Using it for both the main loop and the tail end makes sure the same FMA
instruction is used for all loop iterations, which is not the case with the
current situation where the tail loop is implemented in C, if the compiler is
allowed to use FMA instructions. This is important for some LAPACK eigenvalue
testcases that rely on bitwise identical results independent of how many loop
iterations are used.
Benchmarks should allocate with cacheline (often 64 bytes) alignment
to avoid unreliable timings. This technique, storing the offset in the
byte before the pointer, doesn't require C11's aligned_alloc for
compatibility with older compilers.
For example, Glibc's x86_64 malloc returns 16-byte aligned buffers, which is
not sufficient for AVX/AVX2 (32-byte preferred) or AVX512 (64-byte).
This allows Julia to set a default number of threads (usually `1`) to be
used when no other thread counts are specified [0], to short-circuit the
default OpenBLAS thread initialization routine that spins up a different
number of threads than Julia would otherwise choose.
The reason to add a new environment variable is that we want to be able
to configure OpenBLAS to avoid performing its initial memory
allocation/thread startup, as that can consume significant amounts of
memory, but we still want to be sensitive to legacy codebases that set
things like `OMP_NUM_THREADS` or `GOTOBLAS_NUM_THREADS`. Creating a new
environment variable that is openblas-specific and is not already
publicly used to control the overall number of threads of programs like
Julia seems to be the best way forward.
[0] https://github.com/JuliaLang/julia/pull/46844