It is possible to build a program that calls a non-GEMM OpenBLAS routine from
a static initializer. Since the order of initialization is undefined, and even
less defined when using __attribute__((constructor)) in one TU and a C++ static
initializer in another TU, it can happen (and does, unfortunately) that
gotoblas_init is not called before the first BLAS routine. This results in a
segfault when trying to index into the gotoblas table.
The solution I have here is indirection: rather than directly using the table
use an inlined function to first check if it's been initialized. Since it will
only not have been done once, hopefully the branch prediction still keeps things
fast.
This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server)
target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set,
which brings 2 basic things:
1) 512 bit wide SIMD (2x width of AVX2)
2) 32 SIMD registers (2x the number on AVX2)
This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel
to AVX512VL; more will follow later but this patch aims to get the infrastructure
in place for this "later".
Full performance tuning has not been done yet; with more registers and wider SIMD
it's in theory possible to retune the kernels but even without that there's an
interesting enough performance increase (30-40% range) with just this change.
By default, OpenBLAS doesn't output the warning message. You can set
OPENBLAS_VERBOSE (e.g. export OPENBLAS_VERBOSE=1) to enable the warning
message on runtime.
The present patch verifies that, on machines declaring an Athlon CPU model and
family, the 3dnow and 3dnowext feature flags are indeed present. If they are
not, it fallbacks on the most generic x86 kernel. This prevents crashes due to
illegal instruction on qemu guests with a weird configuration.
Closes#272