When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.
This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.
This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation. The overall performance impact was < 1%, anyway. Removing it also simplifies the code.
Support arbitrary numbers of threads for memory allocation.
When I originally refactored memory.c to reduce locking, I made the (incorrect) assumption that all threads were managed by OpenBLAS. The recent Issues we've seen show that really, any caller can make its own threads and call into OpenBLAS; they don't all come from blas_server. Thus we have to be able to support an arbitrary number of threads that can come in any time. The original implementation (before my changes) dealt with this by having a single allocation table, and everyone had to lock to get access to and update it, which was expensive. Moving to thread-local allocation tables was much faster, but now we have to deal with the fact that thread local storage might not be cleaned up.
This change gives each thread its own local allocation table, and completely does away with the global table. We cleanup allocations using pthreads' key destructor and Win32's DLL_THREAD_DETACH.
This change also removes compiler TLS, which in the end, wasn't really worth it given the issues with the glibc implementation. The overall performance impact was < 1%, anyway. Removing it also simplifies the code.
written in C intrinsics for best readability.
(the same C code works for Haswell as well)
For logistical reasons the code falls back to the existing
haswell AVX2 implementation if the GCC or LLVM compiler is not new enough
Fixes two calls that were using `fabs` on a `long double` argument rather than `fabsl`, which looks like it is doing an unintentional truncation to `double` precision.
Since we now use an allocation size that isn't a multiple of PAGESIZE, finding
the pages for run_bench wasn't terminating properly. Now we detect if we've
found enough pages for the allocation and terminate the loop.
_snprintf_s takes an additional (size) argument, so is no direct replacement.
(Note that this code is currently unused - the two instances of snprintf here are within ifdef blocks that are not compiled for MSVC)