OpenBLAS

Commit Graph

Author	SHA1	Message	Date
Martin Kroeker	47bf0dba8f	Add build-time option for OMP scheduler; document MULTITHREAD_THRESHOLD range (#1620 ) * Allow choosing the OpenMP scheduler and add range hint for GEMM_MULTITHREAD_THRESHOLD * Amended description of GEMM_MULTITHREAD_THRESHOLD to reflect #742 making it track floating point operations rather than matrix size	8 years ago
Craig Donner	bf40f806ef	Remove the need for most locking in memory.c. Using thread local storage for tracking memory allocations means that threads no longer have to lock at all when doing memory allocations / frees. This particularly helps the gemm driver since it does an allocation per invocation. Even without threading at all, this helps, since even calling a lock with no contention has a cost: Before this change, no threading: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 102 ns 102 ns 13504412 BM_SGEMM/6 175 ns 175 ns 7997580 BM_SGEMM/8 205 ns 205 ns 6842073 BM_SGEMM/10 266 ns 266 ns 5294919 BM_SGEMM/16 478 ns 478 ns 2963441 BM_SGEMM/20 690 ns 690 ns 2144755 BM_SGEMM/32 1906 ns 1906 ns 716981 BM_SGEMM/40 2983 ns 2983 ns 473218 BM_SGEMM/64 9421 ns 9422 ns 148450 BM_SGEMM/72 12630 ns 12631 ns 112105 BM_SGEMM/80 15845 ns 15846 ns 89118 BM_SGEMM/90 25675 ns 25676 ns 54332 BM_SGEMM/100 29864 ns 29865 ns 47120 BM_SGEMM/112 37841 ns 37842 ns 36717 BM_SGEMM/128 56531 ns 56532 ns 25361 BM_SGEMM/140 75886 ns 75888 ns 18143 BM_SGEMM/150 98493 ns 98496 ns 14299 BM_SGEMM/160 102620 ns 102622 ns 13381 BM_SGEMM/170 135169 ns 135173 ns 10231 BM_SGEMM/180 146170 ns 146172 ns 9535 BM_SGEMM/189 190226 ns 190231 ns 7397 BM_SGEMM/200 194513 ns 194519 ns 7210 BM_SGEMM/256 396561 ns 396573 ns 3531 ``` with this change: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 14500387 BM_SGEMM/6 166 ns 166 ns 8381763 BM_SGEMM/8 196 ns 196 ns 7277044 BM_SGEMM/10 256 ns 256 ns 5515721 BM_SGEMM/16 463 ns 463 ns 3025197 BM_SGEMM/20 636 ns 636 ns 2070213 BM_SGEMM/32 1885 ns 1885 ns 739444 BM_SGEMM/40 2969 ns 2969 ns 472152 BM_SGEMM/64 9371 ns 9372 ns 148932 BM_SGEMM/72 12431 ns 12431 ns 112919 BM_SGEMM/80 15615 ns 15616 ns 89978 BM_SGEMM/90 25397 ns 25398 ns 55041 BM_SGEMM/100 29445 ns 29446 ns 47540 BM_SGEMM/112 37530 ns 37531 ns 37286 BM_SGEMM/128 55373 ns 55375 ns 25277 BM_SGEMM/140 76241 ns 76241 ns 18259 BM_SGEMM/150 102196 ns 102200 ns 13736 BM_SGEMM/160 101521 ns 101525 ns 13556 BM_SGEMM/170 136182 ns 136184 ns 10567 BM_SGEMM/180 146861 ns 146864 ns 9035 BM_SGEMM/189 192632 ns 192632 ns 7231 BM_SGEMM/200 198547 ns 198555 ns 6995 BM_SGEMM/256 392316 ns 392330 ns 3539 ``` Before, when built with USE_THREAD=1, GEMM_MULTITHREAD_THRESHOLD = 4, the cost of small matrix operations was overshadowed by thread locking (look smaller than 32) even when not explicitly spawning threads: ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 328 ns 328 ns 4170562 BM_SGEMM/6 396 ns 396 ns 3536400 BM_SGEMM/8 418 ns 418 ns 3330102 BM_SGEMM/10 491 ns 491 ns 2863047 BM_SGEMM/16 710 ns 710 ns 2028314 BM_SGEMM/20 871 ns 871 ns 1581546 BM_SGEMM/32 2132 ns 2132 ns 657089 BM_SGEMM/40 3197 ns 3196 ns 437969 BM_SGEMM/64 9645 ns 9645 ns 144987 BM_SGEMM/72 35064 ns 32881 ns 50264 BM_SGEMM/80 37661 ns 35787 ns 42080 BM_SGEMM/90 36507 ns 36077 ns 40091 BM_SGEMM/100 32513 ns 31850 ns 48607 BM_SGEMM/112 41742 ns 41207 ns 37273 BM_SGEMM/128 67211 ns 65095 ns 21933 BM_SGEMM/140 68263 ns 67943 ns 19245 BM_SGEMM/150 121854 ns 115439 ns 10660 BM_SGEMM/160 116826 ns 115539 ns 10000 BM_SGEMM/170 126566 ns 122798 ns 11960 BM_SGEMM/180 130088 ns 127292 ns 11503 BM_SGEMM/189 120309 ns 116634 ns 13162 BM_SGEMM/200 114559 ns 110993 ns 10000 BM_SGEMM/256 217063 ns 207806 ns 6417 ``` and after, it's gone (note this includes my other change which reduces calls to num_cpu_avail): ``` ---------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------- BM_SGEMM/4 95 ns 95 ns 12347650 BM_SGEMM/6 166 ns 166 ns 8259683 BM_SGEMM/8 193 ns 193 ns 7162210 BM_SGEMM/10 258 ns 258 ns 5415657 BM_SGEMM/16 471 ns 471 ns 2981009 BM_SGEMM/20 666 ns 666 ns 2148002 BM_SGEMM/32 1903 ns 1903 ns 738245 BM_SGEMM/40 2969 ns 2969 ns 473239 BM_SGEMM/64 9440 ns 9440 ns 148442 BM_SGEMM/72 37239 ns 33330 ns 46813 BM_SGEMM/80 57350 ns 55949 ns 32251 BM_SGEMM/90 36275 ns 36249 ns 42259 BM_SGEMM/100 31111 ns 31008 ns 45270 BM_SGEMM/112 43782 ns 40912 ns 34749 BM_SGEMM/128 67375 ns 64406 ns 22443 BM_SGEMM/140 76389 ns 67003 ns 21430 BM_SGEMM/150 72952 ns 71830 ns 19793 BM_SGEMM/160 97039 ns 96858 ns 11498 BM_SGEMM/170 123272 ns 122007 ns 11855 BM_SGEMM/180 126828 ns 126505 ns 11567 BM_SGEMM/189 115179 ns 114665 ns 11044 BM_SGEMM/200 89289 ns 87259 ns 16147 BM_SGEMM/256 226252 ns 222677 ns 7375 ``` I've also tested this with ThreadSanitizer and found no data races during execution. I'm not sure why 200 is always faster than it's neighbors, we must be hitting some optimal cache size or something.	8 years ago
Martin Kroeker	63f7395fb4	Move some DYNAMIC_ARCH targets to new DYNAMIC_OLDER option	8 years ago
Martin Kroeker	38ad05bd04	Extend loop range to find SkylakeX in force_coretype	8 years ago
Martin Kroeker	8be027e4c6	Update dynamic.c	8 years ago
Martin Kroeker	ac7b6e3e9a	Fix misplaced endif	8 years ago
Martin Kroeker	ef626c6824	typo fix	8 years ago
Martin Kroeker	5a51cf4576	Separate Skylake X from Skylake	8 years ago
Arjan van de Ven	99c7bba8e4	Initial support for SkylakeX / AVX512 This patch adds the basic infrastructure for adding the SkylakeX (Intel Skylake server) target. The SkylakeX target will use the AVX512 (AVX512VL level) instruction set, which brings 2 basic things: 1) 512 bit wide SIMD (2x width of AVX2) 2) 32 SIMD registers (2x the number on AVX2) This initial patch only contains a trivial transofrmation of the Haswell SGEMM kernel to AVX512VL; more will follow later but this patch aims to get the infrastructure in place for this "later". Full performance tuning has not been done yet; with more registers and wider SIMD it's in theory possible to retune the kernels but even without that there's an interesting enough performance increase (30-40% range) with just this change.	8 years ago
Matthew Brett	a8002e283a	Revert "take out unused variables" This reverts commit `e5752ff9b3`. The variables i and n are used in the `#if !__GLIBC_PREREQ(2, 7)` branch. Closes gh-1586.	8 years ago
Martin Kroeker	a91f1587b9	Work around name clash with Windows10's winnt.h fixes #1503	8 years ago
Martin Kroeker	191746c493	Merge pull request #1557 from martin-frbg/getconfig Add threading and OpenMP information to output	8 years ago
Martin Kroeker	41ae8e8d67	Add threading and OpenMP information to output For #1416 and #1529, more information about the options OpenBLAS was built with is needed. Additionally we may want to add this data to the openblas.pc file (but not all projects use pkgconfig, and as far as I am aware the cmake module for accessing it does not make such "private" declarations available)	8 years ago
zhiyong.dang	53457f222f	move _Atomic define to common.h	8 years ago
Zhiyong Dang	3716267124	Change _STDC_VERSION__ to __STDC_VERSION__ Change-Id: Id3fa4e8d9eedd4ef7230df69b611e7f397301a42	8 years ago
Zhang Xianyi	50acc40613	Merge pull request #1536 from WestAlgo/develop Fix race condition in blas_server_omp.c	8 years ago
Martin Kroeker	802cf6b22d	Merge pull request #1486 from martin-frbg/atomic Use _Atomic instead of volatile for thread safety where C11 is supported	8 years ago
Zhiyong Dang	1b83341d19	Fix race condition in blas_server_omp.c Change-Id: Ic896276cd073d6b41930c7c5a29d66348cd1725d	8 years ago
Martin Kroeker	f29389c7ac	Merge pull request #1520 from martin-frbg/cpucounts Catch invalid cpu count returned by CPU_COUNT_S	8 years ago
Martin Kroeker	7c861605b2	Catch invalid cpu count returned by CPU_COUNT_S mips32 was seen to return zero here, driving nthreads to zero with subsequent fpe in blas_quickdivide	8 years ago
Martin Kroeker	20c6c38e51	Merge branch 'develop' into atomic	8 years ago
Martin Kroeker	d636b418af	Merge pull request #1504 from ararslan/aa/openbsd Allow building on OpenBSD	8 years ago
Alex Arslan	a41d241a0e	Add support for DragonFly BSD	8 years ago
Alex Arslan	8da6b6ae52	Allow building on OpenBSD With this change, OpenBLAS builds and all tests pass on OpenBSD 6.2 using Clang. Tested on x86-64 only, with and without DYNAMIC_ARCH=1.	8 years ago
Martin Kroeker	01c4b82f04	Update memory.c	8 years ago
Martin Kroeker	93db123f7e	Update memory.c	8 years ago
Martin Kroeker	752fdb5dd8	Add workaround for old gcc and clang versions Old gcc and clang do not handle constructor arguments, finally fix #875 as discussed there, using the fedora patch	8 years ago
Martin Kroeker	6a99fcce94	Use _Atomic instead of volatile for thread safety where C11 is supported Suggested by dodomorandi in #660	8 years ago
Martin Kroeker	7646974227	Limit the additional locking from PRs 1052,1299 to non-OpenMP multithreading	8 years ago
Martin Kroeker	8866e393a2	Revert "Add locks only for non-OPENMP multithreading"	8 years ago
Martin Kroeker	3119b2ab4c	Add locks only for non-OPENMP multithreading to migitate performance problems caused by #1052 and #1299 as seen in #1461	8 years ago
Erik M. Bray	8f5f614615	On Cygwin use mmap instead of Windows native allocation functions, which are not fork-safe.	8 years ago
Erik M. Bray	f5fc109fbd	Perform blas_thread_shutdown with pthread_atfork() on Cygwin Even if we're directly using the win32 threading driver and not pthreads, pthread_atfork still works fine to register a pre-fork handler, and is necessary to restore the threading server to a pre-initialized state.	8 years ago
Martin Kroeker	e388459a27	Merge pull request #1419 from brada4/develop Initialize unitialized values for repeated calls	8 years ago
Andrew	e5752ff9b3	take out unused variables	8 years ago
Andrew	8a0b086b28	add missing bracket for old glibc (cppcheck)	8 years ago
Martin Kroeker	42285d8e70	Merge pull request #1410 from brada4/develop Address warnings #1357	8 years ago
Andrew	8aafa0473c	address last warnings as seen by gcc7	8 years ago
Andrew	11a627c54e	remove surplus parentheses to silence clang5	8 years ago
Martin Kroeker	cc9500db41	Merge pull request #1403 from brada4/develop Address few more warnings	8 years ago
Andrew	bfc2a88594	remove unused buffer	8 years ago
Martin Kroeker	177b78c8b4	Issue1388 (#1389 ) * Calculation of chunk range limits was ignoring num_cpu bug introduced by me in #1262 - should fix #1388 * Calculation of range limits was ignoring num_cpu bug introduced by me in #1262 * Calculation of chunk range limits was ignoring num_cpu bug introduced by me in #1262 * Calculation of chunk range limits was ignoring num_cpu bug introduced by me in #1262 * Calculation of chunk range limits was ignoring num_cpu bug introduced by me in #1262 * Calculation of chunk range limits was ignoring num_cpu bug introduced by me in #1262	8 years ago
Andrew	281a2b952f	warning cleanup (#1380 ) * dead increments in driver/level2 * dead increments in kernel/generic * part dead increments in kernel/x86_64	8 years ago
Martin Kroeker	c49c6b237d	Merge pull request #1382 from martin-frbg/dtrmv-1332 Work around errors in multithreaded dtrmv	8 years ago
Martin Kroeker	28ae3ca76f	Limit MAX_CPU to 1024 for now Some Linux distributions (notably SuSE) have raised CPU_SETSIZE to 4096, apparently disregarding API limitations. From #1348, the highest value to survive array initialization (on a desktop system) is 3232, and 1024 - which is the more usual CPU_SETSIZE limit, was demonstrated to work fine on an actual bignuma system.	8 years ago
Martin Kroeker	b414283f48	Disable gemv unrolling as a (hopefully temporary) workaround for #1332	8 years ago
Andrew	ef95cd471f	elminate unread variable, after reiteration 3 of them (clang4)	8 years ago
Andrew	e14d50d86e	eliminate Wunused-const gcc7 warning	8 years ago
Martin Kroeker	07e7c36dac	Handle shmem init failures in cpu affinity setup code Failures to obtain or attach shared memory segments would lead to an exit without explanation of the exact cause. This change introduces a more verbose error message and tries to make the code continue without setting cpu affinity. Fixes #1351	8 years ago
Martin Kroeker	2a6fef9a55	Try to handle shmget or shmat failing also replaces one verbatim sched_yield with the YIELDING macro for consistency as suggested in #1351	8 years ago

1 2 3 4 5 ...

330 Commits (47bf0dba8f7a9cbd559e2f9cabe0bf2c7d3ee7a8)