It is possible to build a program that calls a non-GEMM OpenBLAS routine from
a static initializer. Since the order of initialization is undefined, and even
less defined when using __attribute__((constructor)) in one TU and a C++ static
initializer in another TU, it can happen (and does, unfortunately) that
gotoblas_init is not called before the first BLAS routine. This results in a
segfault when trying to index into the gotoblas table.
The solution I have here is indirection: rather than directly using the table
use an inlined function to first check if it's been initialized. Since it will
only not have been done once, hopefully the branch prediction still keeps things
fast.
The Ximatcopy functions create a copy of the input matrix
although they seem to work inplace. The new routines
XIMATCOPY_K_YY perform the operations inplace if the leading
dimension does not change.