* With the Intel compiler on Linux, prefer ifort for the final link step
icc has known problems with mixed-language builds that ifort can handle just fine. Fixes #1956
* Rename operands to put lda on the input/output constraint list
* Fix wrong constraints in inline assembly
for #2009
* Fix inline assembly constraints
rework indices to allow marking argument lda4 as input and output. For #2009
* Fix inline assembly constraints
rework indices to allow marking argument lda as input and output.
* Fix inline assembly constraints
* Fix inline assembly constraints
* Fix inline assembly constraints in Bulldozer TRSM kernels
rework indices to allow marking i,as and bs as both input and output (marked operand n1 as well for simplicity). For #2009
* Correct range_n limiting
same bug as seen in #1388, somehow missed in corresponding PR #1389
* Allow multithreading TRMV again
revert workaround introduced for issue #1332 as the actual cause appears to be my incorrect fix from #1262 (see #1388)
* Fix error introduced during cleanup
* Reduce list of kernels in the dynamic arch build
to make compilation complete reliably within the 1h limit again
* init
* move fix to right place
* Fix missing -c option in AVX512 test
* Fix AVX512 test always returning false due to missing compiler option
* Make x86_32 imply NO_AVX2, NO_AVX512 in addition to NO_AVX
fixes #2033
* Keep xcode8.3 for osx BINARY=32 build
as xcode10 deprecated i386
* Make sure that AVX512 is disabled in 32bit builds
for #2033
* Improve handling of NO_STATIC and NO_SHARED
to avoid surprises from defining either as zero. Fixes #2035 by addressing some concerns from #1422
* init
* address warning introed with #1814 et al
* Restore locking optimizations for OpenMP case
restore another accidentally dropped part of #1468 that was missed in #2004 to address performance regression reported in #1461
* HiSilicon tsv110 CPUs optimization branch
add HiSilicon tsv110 CPUs optimization branch
* add TARGET support for HiSilicon tsv110 CPUs
* add TARGET support for HiSilicon tsv110 CPUs
* add TARGET support for HiSilicon tsv110 CPUs
* Fix module definition conflicts between LAPACK and ReLAPACK
for #2043
* Do not compile in AVX512 check if AVX support is disabled
xgetbv is function depends on NO_AVX being undefined - we could change that too, but that combo is unlikely to work anyway
* ctest.c : add __POWERPC__ for PowerMac
* Fix crash in sgemm SSE/nano kernel on x86_64
Fix bug #2047.
Signed-off-by: Celelibi <celelibi@gmail.com>
* param.h : enable defines for PPC970 on DarwinOS
fixes:
gemm.c: In function 'sgemm_':
../common_param.h:981:18: error: 'SGEMM_DEFAULT_P' undeclared (first use in this function)
#define SGEMM_P SGEMM_DEFAULT_P
^
* common_power.h: force DCBT_ARG 0 on PPC970 Darwin
without this, we see
../kernel/power/gemv_n.S:427:Parameter syntax error
and many more similar entries
that relates to this assembly command
dcbt 8, r24, r18
this change makes the DCBT_ARG = 0
and openblas builds through to completion on PowerMac 970
Tests pass
* Make TARGET=GENERIC compatible with DYNAMIC_ARCH=1
for issue #2048
* make DYNAMIC_ARCH=1 package work on TSV110.
* make DYNAMIC_ARCH=1 package work on TSV110
* Add Intel Denverton
for #2048
* Add Intel Denverton
* Change 64-bit detection as explained in #2056
* Trivial typo fix
as suggested in #2022
* Disable the AVX512 DGEMM kernel (again)
Due to as yet unresolved errors seen in #1955 and #2029
* Use POSIX getenv on Cygwin
The Windows-native GetEnvironmentVariable cannot be relied on, as
Cygwin does not always copy environment variables set through Cygwin
to the Windows environment block, particularly after fork().
* Fix for #2063: The DllMain used in Cygwin did not run the thread memory
pool cleanup upon THREAD_DETACH which is needed when compiled with
USE_TLS=1.
* Also call CloseHandle on each thread, as well as on the event so as to not leak thread handles.
* AIX asm syntax changes needed for shared object creation
* power9 makefile. dgemm based on power8 kernel with following changes : 32x unrolled 16x4 kernel and 8x4 kernel using (lxv stxv butterfly rank1 update). improvement from 17 to 22-23gflops. dtrmm cases were added into dgemm itself
* Expose CBLAS interfaces for I?MIN and I?MAX
* Build CBLAS interfaces for I?MIN and I?MAX
* Add declarations for ?sum and cblas_?sum
* Add interface for ?sum (derived from ?asum)
* Add ?sum
* Add implementations of ssum/dsum and csum/zsum
as trivial copies of asum/zsasum with the fabs calls replaced by fmov to preserve code structure
* Add ARM implementations of ?sum
(trivial copies of the respective ?asum with the fabs calls removed)
* Add ARM64 implementations of ?sum
as trivial copies of the respective ?asum kernels with the fabs calls removed
* Add ia64 implementation of ?sum
as trivial copy of asum with the fabs calls removed
* Add MIPS implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
* Add MIPS64 implementation of ?sum
as trivial copy of ?asum with the fabs replaced by mov to preserve code structure
* Add POWER implementation of ?sum
as trivial copy of ?asum with the fabs replaced by fmr to preserve code structure
* Add SPARC implementation of ?sum
as trivial copy of ?asum with the fabs replaced by fmov to preserve code structure
* Add x86 implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
* Add x86_64 implementation of ?sum
as trivial copy of ?asum with the fabs calls removed
* Add ZARCH implementation of ?sum
as trivial copies of the respective ?asum kernels with the ABS and vflpsb calls removed
* Detect 32bit environment on 64bit ARM hardware
for #2056, using same approach as #2058
* Add cmake defaults for ?sum kernels
* Add ?sum
* Add ?sum definitions for generic kernel
* Add declarations for ?sum
* Add -lm and disable EXPRECISION support on *BSD
fixes #2075
* Add in runtime CPU detection for POWER.
* snprintf define consolidated to common.h
* Support INTERFACE64=1
* Add support for INTERFACE64 and fix XERBLA calls
1. Replaced all instances of "int" with "blasint"
2. Added string length as "hidden" third parameter in calls to fortran XERBLA
* Correct length of name string in xerbla call
* Avoid out-of-bounds accesses in LAPACK EIG tests
see https://github.com/Reference-LAPACK/lapack/issues/333
* Correct INFO=4 condition
* Disable reallocation of work array in xSYTRF
as it appears to cause memory management problems (seen in the LAPACK tests)
* Disable repeated recursion on Ab_BR in ReLAPACK xGBTRF
due to crashes in LAPACK tests
* sgemm/strmm
* Update Changelog with changes from 0.3.6
* Increment version to 0.3.7.dev
* Increment version to 0.3.7.dev
* Misc. typo fixes
Found via `codespell -q 3 -w -L ith,als,dum,nd,amin,nto,wis,ba -S ./relapack,./kernel,./lapack-netlib`
* Correct argument of CPU_ISSET for glibc <2.5
fixes #2104
* conflict resolve
* Revert reference/ fixes
* Revert Changelog.txt typos
* Disable the SkyLakeX DGEMMITCOPY kernel as well
as a stopgap measure for https://github.com/numpy/numpy/issues/13401 as mentioned in #1955
* Disable DGEMMINCOPY as well for now
#1955
* init
* Fix errors in cpu enumeration with glibc 2.6
for #2114
* Change two http links to https
Closes #2109
* remove redundant code #2113
* Set up CI with Azure Pipelines
[skip ci]
* TST: add native POWER8 to CI
* add native POWER8 testing to
Travis CI matrix with ppc64le
os entry
* Update link to IBM MASS library, update cpu support status
* first try migrating one of the arm builds from travis
* fix tabbing in azure commands
* Update azure-pipelines.yml
take out offending lines (although stolen from https://github.com/conda-forge/opencv-feedstock azure-pipelines fiie)
* Update azure-pipelines.yml
* Update azure-pipelines.yml
* Update azure-pipelines.yml
* Update azure-pipelines.yml
* DOC: Add Azure CI status badge
* Add ARMV6 build to azure CI setup (#2122)
using aytekinar's Alpine image and docker script from the Travis setup
[skip ci]
* TST: Azure manylinux1 & clean-up
* remove some of the steps & comments
from the original Azure yml template
* modify the trigger section to use
develop since OpenBLAS primarily uses
this branch; use the same batching
behavior as downstream projects NumPy/
SciPy
* remove Travis emulated ARMv6 gcc build
because this now happens in Azure
* use documented Ubuntu vmImage name for Azure
and add in a manylinux1 test run to the matrix
[skip appveyor]
* Add NO_AFFINITY to available options on Linux, and set it to ON
to match the gmake default. Fixes second part of #2114
* Replace ISMIN and ISAMIN kernels on all x86_64 platforms (#2125)
* Mark iamax_sse.S as unsuitable for MIN due to issue #2116
* Use iamax.S rather than iamax_sse.S for ISMIN/ISAMIN on all x86_64 as workaround for #2116
* Move ARMv8 gcc build from Travis to Azure
* Move ARMv8 gcc build from Travis to Azure
* Update .travis.yml
* Test drone CI
* install make
* remove sudo
* Install gcc
* Install perl
* Install gfortran and add a clang job
* gfortran->gcc-gfortran
* Switch to ubuntu and parallel jobs
* apt update
* Fix typo
* update yes
* no need of gcc in clang build
* Add a cmake build as well
* Add cmake builds and print options
* build without lapack on cmake
* parallel build
* See if ubuntu 19.04 fixes the ICE
* Remove qemu armv8 builds
* arm32 build
* Fix typo
* TST: add SkylakeX AVX512 CI test
* adapt the C-level reproducer code for some
recent SkylakeX AVX512 kernel issues, provided
by Isuru Fernando and modified by Martin Kroeker,
for usage in the utest suite
* add an Intel SDE SkylakeX emulation utest run to
the Azure CI matrix; a custom Docker build was required
because Ubuntu image provided by Azure does not support
AVX512VL instructions
* Add option USE_LOCKING for single-threaded build with locking support
for calling from concurrent threads
* Add option USE_LOCKING for single-threaded build with locking support
* Add option USE_LOCKING for SMP-like locking in USE_THREAD=0 builds
* Add option USE_LOCKING but keep default settings intact
* Remove unrelated change
* Do not try ancient PGI hacks with recent versions of that compiler
should fix #2139
* Build and run utests in any case, they do their own checks for fortran availability
* Add softfp support in min/max kernels
fix for #1912
* Revert "Add softfp support in min/max kernels"
* Separate implementations of AMAX and IAMAX on arm
As noted in #1912 and comment on #1942, the combined implementation happens to "do the right thing" on hardfp, but cannot return both value and index on softfp where they would have to share the return register
* Ensure correct output for DAMAX with softfp
* Use generic kernels for complex (I)AMAX to support softfp
* improved zgemm power9 based on power8
* upload thread safety test folder
* hook up c++ thread safety test (main Makefile)
* add c++ thread test option to Makefile.rule
* Document NO_AVX512
for #2151
* sgemm pipeline improved, zgemm rewritten without inner packs, ABI lxvx v20 fixed with vs52
* Fix detection of AVX512 capable compilers in getarch
21eda8b5 introduced a check in getarch.c to test if the compiler is capable of
AVX512. This check currently fails, since the used __AVX2__ macro is only
defined if getarch itself was compiled with AVX2/AVX512 support. Make sure this
is the case by building getarch with -march=native on x86_64. It is only
supposed to run on the build host anyway.
* c_check: Unlink correct file
* power9 zgemm ztrmm optimized
* conflict resolve
* Add gfortran workaround for ABI violations in LAPACKE
for #2154 (see gcc bug 90329)
* Add gfortran workaround for ABI violations
for #2154 (see gcc bug 90329)
* Add gfortran workaround for potential ABI violation
for #2154
* Update fc.cmake
* Remove any inadvertent use of -march=native from DYNAMIC_ARCH builds
from #2143, -march=native precludes use of more specific options like -march=skylake-avx512 in individual kernels, and defeats the purpose of dynamic arch anyway.
* Avoid unintentional activation of TLS code via USE_TLS=0
fixes #2149
* Do not force gcc options on non-gcc compilers
fixes compile failure with pgi 18.10 as reported on OpenBLAS-users
* Update Makefile.x86_64
* Zero ecx with a mov instruction
PGI assembler does not like the initialization in the constraints.
* Fix mov syntax
* new sgemm 8x16
* Update dtrmm_kernel_16x4_power8.S
* PGI compiler does not like -march=native
* Fix build on FreeBSD/powerpc64.
Signed-off-by: Piotr Kubaj <pkubaj@anongoth.pl>
* Fix build for PPC970 on FreeBSD pt. 1
FreeBSD needs DCBT_ARG=0 as well.
* Fix build for PPC970 on FreeBSD pt.2
FreeBSD needs those macros too.
* cgemm/ctrmm power9
* Utest needs CBLAS but not necessarily FORTRAN
* Add mingw builds to Appveyor config
* Add getarch flags to disable AVX on x86
(and other small fixes to match Makefile behaviour)
* Make disabling DYNAMIC_ARCH on unsupported systems work
needs to be unset in the cache for the change to have any effect
* Mingw32 needs leading underscore on object names
(also copy BUNDERSCORE settings for FORTRAN from the corresponding Makefile)
pull/2179/head
| @@ -0,0 +1,143 @@ | |||||
| --- | |||||
| kind: pipeline | |||||
| name: arm64_gcc_make | |||||
| platform: | |||||
| os: linux | |||||
| arch: arm64 | |||||
| steps: | |||||
| - name: Build and Test | |||||
| image: ubuntu:19.04 | |||||
| environment: | |||||
| CC: gcc | |||||
| COMMON_FLAGS: 'DYNAMIC_ARCH=1 TARGET=ARMV8 NUM_THREADS=32' | |||||
| commands: | |||||
| - echo "MAKE_FLAGS:= $COMMON_FLAGS" | |||||
| - apt-get update -y | |||||
| - apt-get install -y make $CC gfortran perl | |||||
| - $CC --version | |||||
| - make QUIET_MAKE=1 $COMMON_FLAGS | |||||
| - make -C test $COMMON_FLAGS | |||||
| - make -C ctest $COMMON_FLAGS | |||||
| - make -C utest $COMMON_FLAGS | |||||
| --- | |||||
| kind: pipeline | |||||
| name: arm32_gcc_make | |||||
| platform: | |||||
| os: linux | |||||
| arch: arm | |||||
| steps: | |||||
| - name: Build and Test | |||||
| image: ubuntu:19.04 | |||||
| environment: | |||||
| CC: gcc | |||||
| COMMON_FLAGS: 'DYNAMIC_ARCH=1 TARGET=ARMV6 NUM_THREADS=32' | |||||
| commands: | |||||
| - echo "MAKE_FLAGS:= $COMMON_FLAGS" | |||||
| - apt-get update -y | |||||
| - apt-get install -y make $CC gfortran perl | |||||
| - $CC --version | |||||
| - make QUIET_MAKE=1 $COMMON_FLAGS | |||||
| - make -C test $COMMON_FLAGS | |||||
| - make -C ctest $COMMON_FLAGS | |||||
| - make -C utest $COMMON_FLAGS | |||||
| --- | |||||
| kind: pipeline | |||||
| name: arm64_clang_make | |||||
| platform: | |||||
| os: linux | |||||
| arch: arm64 | |||||
| steps: | |||||
| - name: Build and Test | |||||
| image: ubuntu:18.04 | |||||
| environment: | |||||
| CC: clang | |||||
| COMMON_FLAGS: 'DYNAMIC_ARCH=1 TARGET=ARMV8 NUM_THREADS=32' | |||||
| commands: | |||||
| - echo "MAKE_FLAGS:= $COMMON_FLAGS" | |||||
| - apt-get update -y | |||||
| - apt-get install -y make $CC gfortran perl | |||||
| - $CC --version | |||||
| - make QUIET_MAKE=1 $COMMON_FLAGS | |||||
| - make -C test $COMMON_FLAGS | |||||
| - make -C ctest $COMMON_FLAGS | |||||
| - make -C utest $COMMON_FLAGS | |||||
| --- | |||||
| kind: pipeline | |||||
| name: arm32_clang_cmake | |||||
| platform: | |||||
| os: linux | |||||
| arch: arm | |||||
| steps: | |||||
| - name: Build and Test | |||||
| image: ubuntu:18.04 | |||||
| environment: | |||||
| CC: clang | |||||
| CMAKE_FLAGS: '-DDYNAMIC_ARCH=1 -DTARGET=ARMV6 -DNUM_THREADS=32 -DNOFORTRAN=ON -DBUILD_WITHOUT_LAPACK=ON' | |||||
| commands: | |||||
| - echo "CMAKE_FLAGS:= $CMAKE_FLAGS" | |||||
| - apt-get update -y | |||||
| - apt-get install -y make $CC g++ perl cmake | |||||
| - $CC --version | |||||
| - mkdir build && cd build | |||||
| - cmake $CMAKE_FLAGS .. | |||||
| - make -j | |||||
| - ctest | |||||
| --- | |||||
| kind: pipeline | |||||
| name: arm64_gcc_cmake | |||||
| platform: | |||||
| os: linux | |||||
| arch: arm64 | |||||
| steps: | |||||
| - name: Build and Test | |||||
| image: ubuntu:18.04 | |||||
| environment: | |||||
| CC: gcc | |||||
| CMAKE_FLAGS: '-DDYNAMIC_ARCH=1 -DTARGET=ARMV8 -DNUM_THREADS=32 -DNOFORTRAN=ON -DBUILD_WITHOUT_LAPACK=ON' | |||||
| commands: | |||||
| - echo "CMAKE_FLAGS:= $CMAKE_FLAGS" | |||||
| - apt-get update -y | |||||
| - apt-get install -y make $CC g++ perl cmake | |||||
| - $CC --version | |||||
| - mkdir build && cd build | |||||
| - cmake $CMAKE_FLAGS .. | |||||
| - make -j | |||||
| - ctest | |||||
| --- | |||||
| kind: pipeline | |||||
| name: arm64_clang_cmake | |||||
| platform: | |||||
| os: linux | |||||
| arch: arm64 | |||||
| steps: | |||||
| - name: Build and Test | |||||
| image: ubuntu:18.04 | |||||
| environment: | |||||
| CC: clang | |||||
| CMAKE_FLAGS: '-DDYNAMIC_ARCH=1 -DTARGET=ARMV8 -DNUM_THREADS=32 -DNOFORTRAN=ON -DBUILD_WITHOUT_LAPACK=ON' | |||||
| commands: | |||||
| - echo "CMAKE_FLAGS:= $CMAKE_FLAGS" | |||||
| - apt-get update -y | |||||
| - apt-get install -y make $CC g++ perl cmake | |||||
| - $CC --version | |||||
| - mkdir build && cd build | |||||
| - cmake $CMAKE_FLAGS .. | |||||
| - make -j | |||||
| - ctest | |||||
| @@ -25,6 +25,15 @@ matrix: | |||||
| - TARGET_BOX=LINUX64 | - TARGET_BOX=LINUX64 | ||||
| - BTYPE="BINARY=64" | - BTYPE="BINARY=64" | ||||
| - <<: *test-ubuntu | |||||
| os: linux-ppc64le | |||||
| before_script: | |||||
| - COMMON_FLAGS="DYNAMIC_ARCH=1 TARGET=POWER8 NUM_THREADS=32" | |||||
| env: | |||||
| # for matrix annotation only | |||||
| - TARGET_BOX=PPC64LE_LINUX | |||||
| - BTYPE="BINARY=64 USE_OPENMP=1" | |||||
| - <<: *test-ubuntu | - <<: *test-ubuntu | ||||
| env: | env: | ||||
| - TARGET_BOX=LINUX64 | - TARGET_BOX=LINUX64 | ||||
| @@ -160,45 +169,10 @@ matrix: | |||||
| - BTYPE="BINARY=64 INTERFACE64=1" | - BTYPE="BINARY=64 INTERFACE64=1" | ||||
| - <<: *test-macos | - <<: *test-macos | ||||
| osx_image: xcode8.3 | |||||
| env: | env: | ||||
| - BTYPE="BINARY=32" | - BTYPE="BINARY=32" | ||||
| - &emulated-arm | |||||
| dist: trusty | |||||
| sudo: required | |||||
| services: docker | |||||
| env: IMAGE_ARCH=arm32 TARGET_ARCH=ARMV6 COMPILER=gcc | |||||
| name: "Emulated Build for ARMV6 with gcc" | |||||
| before_install: sudo docker run --rm --privileged multiarch/qemu-user-static:register --reset | |||||
| script: | | |||||
| echo "FROM openblas/alpine:${IMAGE_ARCH} | |||||
| COPY . /tmp/openblas | |||||
| RUN mkdir /tmp/openblas/build && \ | |||||
| cd /tmp/openblas/build && \ | |||||
| CC=${COMPILER} cmake -D DYNAMIC_ARCH=OFF \ | |||||
| -D TARGET=${TARGET_ARCH} \ | |||||
| -D BUILD_SHARED_LIBS=ON \ | |||||
| -D BUILD_WITHOUT_LAPACK=ON \ | |||||
| -D BUILD_WITHOUT_CBLAS=ON \ | |||||
| -D CMAKE_BUILD_TYPE=Release ../ && \ | |||||
| cmake --build ." > Dockerfile | |||||
| docker build . | |||||
| - <<: *emulated-arm | |||||
| env: IMAGE_ARCH=arm32 TARGET_ARCH=ARMV6 COMPILER=clang | |||||
| name: "Emulated Build for ARMV6 with clang" | |||||
| - <<: *emulated-arm | |||||
| env: IMAGE_ARCH=arm64 TARGET_ARCH=ARMV8 COMPILER=gcc | |||||
| name: "Emulated Build for ARMV8 with gcc" | |||||
| - <<: *emulated-arm | |||||
| env: IMAGE_ARCH=arm64 TARGET_ARCH=ARMV8 COMPILER=clang | |||||
| name: "Emulated Build for ARMV8 with clang" | |||||
| allow_failures: | |||||
| - env: IMAGE_ARCH=arm32 TARGET_ARCH=ARMV6 COMPILER=gcc | |||||
| - env: IMAGE_ARCH=arm32 TARGET_ARCH=ARMV6 COMPILER=clang | |||||
| - env: IMAGE_ARCH=arm64 TARGET_ARCH=ARMV8 COMPILER=gcc | |||||
| - env: IMAGE_ARCH=arm64 TARGET_ARCH=ARMV8 COMPILER=clang | |||||
| # whitelist | # whitelist | ||||
| branches: | branches: | ||||
| only: | only: | ||||
| @@ -6,7 +6,7 @@ cmake_minimum_required(VERSION 2.8.5) | |||||
| project(OpenBLAS C ASM) | project(OpenBLAS C ASM) | ||||
| set(OpenBLAS_MAJOR_VERSION 0) | set(OpenBLAS_MAJOR_VERSION 0) | ||||
| set(OpenBLAS_MINOR_VERSION 3) | set(OpenBLAS_MINOR_VERSION 3) | ||||
| set(OpenBLAS_PATCH_VERSION 6.dev) | |||||
| set(OpenBLAS_PATCH_VERSION 7.dev) | |||||
| set(OpenBLAS_VERSION "${OpenBLAS_MAJOR_VERSION}.${OpenBLAS_MINOR_VERSION}.${OpenBLAS_PATCH_VERSION}") | set(OpenBLAS_VERSION "${OpenBLAS_MAJOR_VERSION}.${OpenBLAS_MINOR_VERSION}.${OpenBLAS_PATCH_VERSION}") | ||||
| # Adhere to GNU filesystem layout conventions | # Adhere to GNU filesystem layout conventions | ||||
| @@ -20,9 +20,14 @@ if(MSVC) | |||||
| option(BUILD_WITHOUT_LAPACK "Do not build LAPACK and LAPACKE (Only BLAS or CBLAS)" ON) | option(BUILD_WITHOUT_LAPACK "Do not build LAPACK and LAPACKE (Only BLAS or CBLAS)" ON) | ||||
| endif() | endif() | ||||
| option(BUILD_WITHOUT_CBLAS "Do not build the C interface (CBLAS) to the BLAS functions" OFF) | option(BUILD_WITHOUT_CBLAS "Do not build the C interface (CBLAS) to the BLAS functions" OFF) | ||||
| option(DYNAMIC_ARCH "Include support for multiple CPU targets, with automatic selection at runtime (x86/x86_64 only)" OFF) | |||||
| option(DYNAMIC_OLDER "Include specific support for older cpu models (Penryn,Dunnington,Atom,Nano,Opteron) with DYNAMIC_ARCH" OFF) | |||||
| option(DYNAMIC_ARCH "Include support for multiple CPU targets, with automatic selection at runtime (x86/x86_64, aarch64 or ppc only)" OFF) | |||||
| option(DYNAMIC_OLDER "Include specific support for older x86 cpu models (Penryn,Dunnington,Atom,Nano,Opteron) with DYNAMIC_ARCH" OFF) | |||||
| option(BUILD_RELAPACK "Build with ReLAPACK (recursive implementation of several LAPACK functions on top of standard LAPACK)" OFF) | option(BUILD_RELAPACK "Build with ReLAPACK (recursive implementation of several LAPACK functions on top of standard LAPACK)" OFF) | ||||
| if(${CMAKE_SYSTEM_NAME} MATCHES "Linux") | |||||
| option(NO_AFFINITY "Disable support for CPU affinity masks to avoid binding processes from e.g. R or numpy/scipy to a single core" ON) | |||||
| else() | |||||
| set(NO_AFFINITY 1) | |||||
| endif() | |||||
| # Add a prefix or suffix to all exported symbol names in the shared library. | # Add a prefix or suffix to all exported symbol names in the shared library. | ||||
| # Avoids conflicts with other BLAS libraries, especially when using | # Avoids conflicts with other BLAS libraries, especially when using | ||||
| @@ -42,6 +47,19 @@ endif() | |||||
| ####### | ####### | ||||
| if(MSVC AND MSVC_STATIC_CRT) | |||||
| set(CompilerFlags | |||||
| CMAKE_CXX_FLAGS | |||||
| CMAKE_CXX_FLAGS_DEBUG | |||||
| CMAKE_CXX_FLAGS_RELEASE | |||||
| CMAKE_C_FLAGS | |||||
| CMAKE_C_FLAGS_DEBUG | |||||
| CMAKE_C_FLAGS_RELEASE | |||||
| ) | |||||
| foreach(CompilerFlag ${CompilerFlags}) | |||||
| string(REPLACE "/MD" "/MT" ${CompilerFlag} "${${CompilerFlag}}") | |||||
| endforeach() | |||||
| endif() | |||||
| message(WARNING "CMake support is experimental. It does not yet support all build options and may not produce the same Makefiles that OpenBLAS ships with.") | message(WARNING "CMake support is experimental. It does not yet support all build options and may not produce the same Makefiles that OpenBLAS ships with.") | ||||
| @@ -62,10 +80,10 @@ endif () | |||||
| set(SUBDIRS ${BLASDIRS}) | set(SUBDIRS ${BLASDIRS}) | ||||
| if (NOT NO_LAPACK) | if (NOT NO_LAPACK) | ||||
| list(APPEND SUBDIRS lapack) | |||||
| if(BUILD_RELAPACK) | if(BUILD_RELAPACK) | ||||
| list(APPEND SUBDIRS relapack/src) | list(APPEND SUBDIRS relapack/src) | ||||
| endif() | endif() | ||||
| list(APPEND SUBDIRS lapack) | |||||
| endif () | endif () | ||||
| # set which float types we want to build for | # set which float types we want to build for | ||||
| @@ -134,7 +152,7 @@ endif () | |||||
| # Only generate .def for dll on MSVC and always produce pdb files for debug and release | # Only generate .def for dll on MSVC and always produce pdb files for debug and release | ||||
| if(MSVC) | if(MSVC) | ||||
| if (${CMAKE_MAJOR_VERSION}.${CMAKE_MINOR_VERSION} LESS 3.4) | |||||
| if (${CMAKE_MAJOR_VERSION}.${CMAKE_MINOR_VERSION} VERSION_LESS 3.4) | |||||
| set(OpenBLAS_DEF_FILE "${PROJECT_BINARY_DIR}/openblas.def") | set(OpenBLAS_DEF_FILE "${PROJECT_BINARY_DIR}/openblas.def") | ||||
| endif() | endif() | ||||
| set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /Zi") | set(CMAKE_C_FLAGS_RELEASE "${CMAKE_C_FLAGS_RELEASE} /Zi") | ||||
| @@ -149,15 +167,9 @@ if (${DYNAMIC_ARCH}) | |||||
| endforeach() | endforeach() | ||||
| endif () | endif () | ||||
| # Only build shared libs for MSVC | |||||
| if (MSVC) | |||||
| set(BUILD_SHARED_LIBS ON) | |||||
| endif() | |||||
| # add objects to the openblas lib | # add objects to the openblas lib | ||||
| add_library(${OpenBLAS_LIBNAME} ${LA_SOURCES} ${LAPACKE_SOURCES} ${RELA_SOURCES} ${TARGET_OBJS} ${OpenBLAS_DEF_FILE}) | add_library(${OpenBLAS_LIBNAME} ${LA_SOURCES} ${LAPACKE_SOURCES} ${RELA_SOURCES} ${TARGET_OBJS} ${OpenBLAS_DEF_FILE}) | ||||
| target_include_directories(${OpenBLAS_LIBNAME} INTERFACE $<INSTALL_INTERFACE:include>) | |||||
| target_include_directories(${OpenBLAS_LIBNAME} INTERFACE $<INSTALL_INTERFACE:include/openblas${SUFFIX64}>) | |||||
| # Android needs to explicitly link against libm | # Android needs to explicitly link against libm | ||||
| if(ANDROID) | if(ANDROID) | ||||
| @@ -166,7 +178,7 @@ endif() | |||||
| # Handle MSVC exports | # Handle MSVC exports | ||||
| if(MSVC AND BUILD_SHARED_LIBS) | if(MSVC AND BUILD_SHARED_LIBS) | ||||
| if (${CMAKE_MAJOR_VERSION}.${CMAKE_MINOR_VERSION} LESS 3.4) | |||||
| if (${CMAKE_MAJOR_VERSION}.${CMAKE_MINOR_VERSION} VERSION_LESS 3.4) | |||||
| include("${PROJECT_SOURCE_DIR}/cmake/export.cmake") | include("${PROJECT_SOURCE_DIR}/cmake/export.cmake") | ||||
| else() | else() | ||||
| # Creates verbose .def file (51KB vs 18KB) | # Creates verbose .def file (51KB vs 18KB) | ||||
| @@ -199,7 +211,8 @@ if (USE_THREAD) | |||||
| target_link_libraries(${OpenBLAS_LIBNAME} ${CMAKE_THREAD_LIBS_INIT}) | target_link_libraries(${OpenBLAS_LIBNAME} ${CMAKE_THREAD_LIBS_INIT}) | ||||
| endif() | endif() | ||||
| if (MSVC OR NOT NOFORTRAN) | |||||
| #if (MSVC OR NOT NOFORTRAN) | |||||
| if (NOT NO_CBLAS) | |||||
| # Broken without fortran on unix | # Broken without fortran on unix | ||||
| add_subdirectory(utest) | add_subdirectory(utest) | ||||
| endif() | endif() | ||||
| @@ -217,6 +230,14 @@ set_target_properties(${OpenBLAS_LIBNAME} PROPERTIES | |||||
| SOVERSION ${OpenBLAS_MAJOR_VERSION} | SOVERSION ${OpenBLAS_MAJOR_VERSION} | ||||
| ) | ) | ||||
| if (BUILD_SHARED_LIBS AND BUILD_RELAPACK) | |||||
| if (NOT MSVC) | |||||
| target_link_libraries(${OpenBLAS_LIBNAME} "-Wl,-allow-multiple-definition") | |||||
| else() | |||||
| target_link_libraries(${OpenBLAS_LIBNAME} "/FORCE:MULTIPLE") | |||||
| endif() | |||||
| endif() | |||||
| if (BUILD_SHARED_LIBS AND NOT ${SYMBOLPREFIX}${SYMBOLSUFIX} STREQUAL "") | if (BUILD_SHARED_LIBS AND NOT ${SYMBOLPREFIX}${SYMBOLSUFIX} STREQUAL "") | ||||
| if (NOT DEFINED ARCH) | if (NOT DEFINED ARCH) | ||||
| set(ARCH_IN "x86_64") | set(ARCH_IN "x86_64") | ||||
| @@ -314,7 +335,7 @@ install (FILES ${OPENBLAS_CONFIG_H} DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}) | |||||
| if(NOT NOFORTRAN) | if(NOT NOFORTRAN) | ||||
| message(STATUS "Generating f77blas.h in ${CMAKE_INSTALL_INCLUDEDIR}") | message(STATUS "Generating f77blas.h in ${CMAKE_INSTALL_INCLUDEDIR}") | ||||
| set(F77BLAS_H ${CMAKE_BINARY_DIR}/f77blas.h) | |||||
| set(F77BLAS_H ${CMAKE_BINARY_DIR}/generated/f77blas.h) | |||||
| file(WRITE ${F77BLAS_H} "#ifndef OPENBLAS_F77BLAS_H\n") | file(WRITE ${F77BLAS_H} "#ifndef OPENBLAS_F77BLAS_H\n") | ||||
| file(APPEND ${F77BLAS_H} "#define OPENBLAS_F77BLAS_H\n") | file(APPEND ${F77BLAS_H} "#define OPENBLAS_F77BLAS_H\n") | ||||
| file(APPEND ${F77BLAS_H} "#include \"openblas_config.h\"\n") | file(APPEND ${F77BLAS_H} "#include \"openblas_config.h\"\n") | ||||
| @@ -327,10 +348,11 @@ endif() | |||||
| if(NOT NO_CBLAS) | if(NOT NO_CBLAS) | ||||
| message (STATUS "Generating cblas.h in ${CMAKE_INSTALL_INCLUDEDIR}") | message (STATUS "Generating cblas.h in ${CMAKE_INSTALL_INCLUDEDIR}") | ||||
| set(CBLAS_H ${CMAKE_BINARY_DIR}/generated/cblas.h) | |||||
| file(READ ${CMAKE_CURRENT_SOURCE_DIR}/cblas.h CBLAS_H_CONTENTS) | file(READ ${CMAKE_CURRENT_SOURCE_DIR}/cblas.h CBLAS_H_CONTENTS) | ||||
| string(REPLACE "common" "openblas_config" CBLAS_H_CONTENTS_NEW "${CBLAS_H_CONTENTS}") | string(REPLACE "common" "openblas_config" CBLAS_H_CONTENTS_NEW "${CBLAS_H_CONTENTS}") | ||||
| file(WRITE ${CMAKE_BINARY_DIR}/cblas.tmp "${CBLAS_H_CONTENTS_NEW}") | |||||
| install (FILES ${CMAKE_BINARY_DIR}/cblas.tmp DESTINATION ${CMAKE_INSTALL_INCLUDEDIR} RENAME cblas.h) | |||||
| file(WRITE ${CBLAS_H} "${CBLAS_H_CONTENTS_NEW}") | |||||
| install (FILES ${CBLAS_H} DESTINATION ${CMAKE_INSTALL_INCLUDEDIR}) | |||||
| endif() | endif() | ||||
| if(NOT NO_LAPACKE) | if(NOT NO_LAPACKE) | ||||
| @@ -167,4 +167,7 @@ In chronological order: | |||||
| * [2017-02-26] ztrmm kernel for IBM z13 | * [2017-02-26] ztrmm kernel for IBM z13 | ||||
| * [2017-03-13] strmm and ctrmm kernel for IBM z13 | * [2017-03-13] strmm and ctrmm kernel for IBM z13 | ||||
| * [2017-09-01] initial Blas Level-1,2 (double precision) for IBM z13 | * [2017-09-01] initial Blas Level-1,2 (double precision) for IBM z13 | ||||
| * [2018-03-07] added missing Blas Level 1-2 (double precision) simd codes | |||||
| * [2019-02-01] added missing Blas Level-1,2 (single precision) simd codes | |||||
| * [2019-03-14] power9 dgemm/dtrmm kernel | |||||
| * [2019-04-29] power9 sgemm/strmm kernel | |||||
| @@ -1,4 +1,82 @@ | |||||
| OpenBLAS ChangeLog | OpenBLAS ChangeLog | ||||
| ==================================================================== | |||||
| Version 0.3.6 | |||||
| 29-Apr-2019 | |||||
| common: | |||||
| * the build tools now check that a given cpu TARGET is actually valid | |||||
| * the build-time check of system features (c_check) has been made | |||||
| less dependent on particular perl features (this should mainly | |||||
| benefit building on Windows) | |||||
| * several problem with the ReLAPACK integration were fixed, | |||||
| including INTERFACE64 support and building a shared library | |||||
| * building with CMAKE on BSD systems was improved | |||||
| * a non-absolute SUM function was added based on the | |||||
| existing optimized code for ASUM | |||||
| * CBLAS interfaces to the IxMIN and IxMAX functions were added | |||||
| * a name clash between LAPACKE and BOOST headers was resolved | |||||
| * CMAKE builds with OpenMP failed to include the appropriate getrf_parallel | |||||
| kernels | |||||
| * a crash on thread (key) deletion with the USE_TLS=1 memory management | |||||
| option was fixed | |||||
| * restored several earlier fixes, in particular for OpenMP performance, | |||||
| building on BSD, and calling fork on CYGWIN, which had inadvertently | |||||
| been dropped in the 0.3.3 rewrite of the memory management code. | |||||
| x86_64: | |||||
| * the AVX512 DGEMM kernel has been disabled again due to unsolved problems | |||||
| * building with old versions of MSVC was fixed | |||||
| * it is now possible to build a static library on Windows with CMAKE | |||||
| * accessing environment variables on CYGWIN at run time was fixed | |||||
| * the CMAKE build system now recognizes 32bit userspace on 64bit hardware | |||||
| * Intel "Denverton" atom and Hygon "Dhyana" zen CPUs are now autodetected | |||||
| * building for DYNAMIC_ARCH with a DYNAMIC_LIST of targets is now supported | |||||
| with CMAKE as well | |||||
| * building for DYNAMIC_ARCH with GENERIC as the default target is now supported | |||||
| * a buffer overflow in the SSE GEMM kernel for Intel Nano targets was fixed | |||||
| * assembly bugs involving undeclared modification of input operands were fixed | |||||
| in the AXPY, DOT, GEMV, GER, SCAL, SYMV and TRSM microkernels for Nehalem, | |||||
| Sandybridge, Haswell, Bulldozer and Piledriver. These would typically cause | |||||
| test failures or segfaults when compiled with recent versions of gcc from 8 onward. | |||||
| * a similar bug was fixed in the blas_quickdivide code used to split workloads | |||||
| in most functions | |||||
| * a bug in the IxMIN implementation for the GENERIC target made it return the result of IxMAX | |||||
| * fixed building on SkylakeX systems when either the compiler or the (emulated) operating | |||||
| environment does not support AVX512 | |||||
| * improved GEMM performance on ZEN targets | |||||
| x86: | |||||
| * build failures caused by the recently added checks for AVX512 were fixed | |||||
| * an inline assembly bug involving undeclared modification of an input argument was | |||||
| fixed in the blas_quickdivide code used to split workloads in most functions | |||||
| * a bug in the IMIN implementation for the GENERIC target made it return the result of IMAX | |||||
| MIPS32: | |||||
| * a bug in the IMIN implementation made it return the result of IMAX | |||||
| POWER: | |||||
| * single precision BLAS1/2 functions have received optimized POWER8 kernels | |||||
| * POWER9 is now a separate target, with an optimized DGEMM/DTRMM kernel | |||||
| * building on PPC970 systems under OSX Leopard or Tiger is now supported | |||||
| * out-of-bounds memory accesses in the gemm_beta microkernels were fixed | |||||
| * building a shared library on AIX is now supported for POWER6 | |||||
| * DYNAMIC_ARCH support has been added for POWER6 and newer | |||||
| ARMv7: | |||||
| * corrected xDOT behaviour with zero INC_X or INC_Y | |||||
| * a bug in the IMIN implementation made it return the result of IMAX | |||||
| ARMv8: | |||||
| * added support for HiSilicon TSV110 cpus | |||||
| * the CMAKE build system now recognizes 32bit userspace on 64bit hardware | |||||
| * cross-compilation with CMAKE now works again | |||||
| * a bug in the IMIN implementation made it return the result of IMAX | |||||
| * ARMV8 builds with the BINARY=32 option are now automatically handled as ARMV7 | |||||
| IBM Z: | |||||
| * optimized microkernels for single precicion BLAS1/2 functions have been added | |||||
| for both Z13 and Z14 | |||||
| ==================================================================== | ==================================================================== | ||||
| Version 0.3.5 | Version 0.3.5 | ||||
| 31-Dec-2018 | 31-Dec-2018 | ||||
| @@ -34,7 +34,7 @@ endif | |||||
| LAPACK_NOOPT := $(filter-out -O0 -O1 -O2 -O3 -Ofast,$(LAPACK_FFLAGS)) | LAPACK_NOOPT := $(filter-out -O0 -O1 -O2 -O3 -Ofast,$(LAPACK_FFLAGS)) | ||||
| SUBDIRS_ALL = $(SUBDIRS) test ctest utest exports benchmark ../laswp ../bench | |||||
| SUBDIRS_ALL = $(SUBDIRS) test ctest utest exports benchmark ../laswp ../bench cpp_thread_test | |||||
| .PHONY : all libs netlib $(RELA) test ctest shared install | .PHONY : all libs netlib $(RELA) test ctest shared install | ||||
| .NOTPARALLEL : all libs $(RELA) prof lapack-test install blas-test | .NOTPARALLEL : all libs $(RELA) prof lapack-test install blas-test | ||||
| @@ -96,7 +96,7 @@ endif | |||||
| @echo | @echo | ||||
| shared : | shared : | ||||
| ifndef NO_SHARED | |||||
| ifneq ($(NO_SHARED), 1) | |||||
| ifeq ($(OSNAME), $(filter $(OSNAME),Linux SunOS Android Haiku)) | ifeq ($(OSNAME), $(filter $(OSNAME),Linux SunOS Android Haiku)) | ||||
| @$(MAKE) -C exports so | @$(MAKE) -C exports so | ||||
| @ln -fs $(LIBSONAME) $(LIBPREFIX).so | @ln -fs $(LIBSONAME) $(LIBPREFIX).so | ||||
| @@ -123,10 +123,13 @@ ifeq ($(NOFORTRAN), $(filter 0,$(NOFORTRAN))) | |||||
| touch $(LIBNAME) | touch $(LIBNAME) | ||||
| ifndef NO_FBLAS | ifndef NO_FBLAS | ||||
| $(MAKE) -C test all | $(MAKE) -C test all | ||||
| $(MAKE) -C utest all | |||||
| endif | endif | ||||
| $(MAKE) -C utest all | |||||
| ifndef NO_CBLAS | ifndef NO_CBLAS | ||||
| $(MAKE) -C ctest all | $(MAKE) -C ctest all | ||||
| ifeq ($(CPP_THREAD_SAFETY_TEST), 1) | |||||
| $(MAKE) -C cpp_thread_test all | |||||
| endif | |||||
| endif | endif | ||||
| endif | endif | ||||
| @@ -38,3 +38,8 @@ ifeq ($(CORE), THUNDERX2T99) | |||||
| CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99 | CCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99 | ||||
| FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99 | FCOMMON_OPT += -march=armv8.1-a -mtune=thunderx2t99 | ||||
| endif | endif | ||||
| ifeq ($(CORE), TSV110) | |||||
| CCOMMON_OPT += -march=armv8.2-a -mtune=tsv110 | |||||
| FCOMMON_OPT += -march=armv8.2-a -mtune=tsv110 | |||||
| endif | |||||
| @@ -58,14 +58,14 @@ ifndef NO_LAPACKE | |||||
| endif | endif | ||||
| #for install static library | #for install static library | ||||
| ifndef NO_STATIC | |||||
| ifneq ($(NO_STATIC),1) | |||||
| @echo Copying the static library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | @echo Copying the static library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | ||||
| @install -pm644 $(LIBNAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | @install -pm644 $(LIBNAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | ||||
| @cd "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" ; \ | @cd "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" ; \ | ||||
| ln -fs $(LIBNAME) $(LIBPREFIX).$(LIBSUFFIX) | ln -fs $(LIBNAME) $(LIBPREFIX).$(LIBSUFFIX) | ||||
| endif | endif | ||||
| #for install shared library | #for install shared library | ||||
| ifndef NO_SHARED | |||||
| ifneq ($(NO_SHARED),1) | |||||
| @echo Copying the shared library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | @echo Copying the shared library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | ||||
| ifeq ($(OSNAME), $(filter $(OSNAME),Linux SunOS Android Haiku)) | ifeq ($(OSNAME), $(filter $(OSNAME),Linux SunOS Android Haiku)) | ||||
| @install -pm755 $(LIBSONAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | @install -pm755 $(LIBSONAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | ||||
| @@ -106,14 +106,14 @@ ifndef NO_LAPACKE | |||||
| endif | endif | ||||
| #for install static library | #for install static library | ||||
| ifndef NO_STATIC | |||||
| ifneq ($(NO_STATIC),1) | |||||
| @echo Copying the static library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | @echo Copying the static library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | ||||
| @installbsd -c -m 644 $(LIBNAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | @installbsd -c -m 644 $(LIBNAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | ||||
| @cd "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" ; \ | @cd "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" ; \ | ||||
| ln -fs $(LIBNAME) $(LIBPREFIX).$(LIBSUFFIX) | ln -fs $(LIBNAME) $(LIBPREFIX).$(LIBSUFFIX) | ||||
| endif | endif | ||||
| #for install shared library | #for install shared library | ||||
| ifndef NO_SHARED | |||||
| ifneq ($(NO_SHARED),1) | |||||
| @echo Copying the shared library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | @echo Copying the shared library to $(DESTDIR)$(OPENBLAS_LIBRARY_DIR) | ||||
| @installbsd -c -m 755 $(LIBSONAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | @installbsd -c -m 755 $(LIBSONAME) "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" | ||||
| @cd "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" ; \ | @cd "$(DESTDIR)$(OPENBLAS_LIBRARY_DIR)" ; \ | ||||
| @@ -138,7 +138,7 @@ endif | |||||
| @echo "SET(OpenBLAS_VERSION \"${VERSION}\")" > "$(DESTDIR)$(OPENBLAS_CMAKE_DIR)/$(OPENBLAS_CMAKE_CONFIG)" | @echo "SET(OpenBLAS_VERSION \"${VERSION}\")" > "$(DESTDIR)$(OPENBLAS_CMAKE_DIR)/$(OPENBLAS_CMAKE_CONFIG)" | ||||
| @echo "SET(OpenBLAS_INCLUDE_DIRS ${OPENBLAS_INCLUDE_DIR})" >> "$(DESTDIR)$(OPENBLAS_CMAKE_DIR)/$(OPENBLAS_CMAKE_CONFIG)" | @echo "SET(OpenBLAS_INCLUDE_DIRS ${OPENBLAS_INCLUDE_DIR})" >> "$(DESTDIR)$(OPENBLAS_CMAKE_DIR)/$(OPENBLAS_CMAKE_CONFIG)" | ||||
| ifndef NO_SHARED | |||||
| ifneq ($(NO_SHARED),1) | |||||
| #ifeq logical or | #ifeq logical or | ||||
| ifeq ($(OSNAME), $(filter $(OSNAME),Linux FreeBSD NetBSD OpenBSD DragonFly)) | ifeq ($(OSNAME), $(filter $(OSNAME),Linux FreeBSD NetBSD OpenBSD DragonFly)) | ||||
| @echo "SET(OpenBLAS_LIBRARIES ${OPENBLAS_LIBRARY_DIR}/$(LIBPREFIX).so)" >> "$(DESTDIR)$(OPENBLAS_CMAKE_DIR)/$(OPENBLAS_CMAKE_CONFIG)" | @echo "SET(OpenBLAS_LIBRARIES ${OPENBLAS_LIBRARY_DIR}/$(LIBPREFIX).so)" >> "$(DESTDIR)$(OPENBLAS_CMAKE_DIR)/$(OPENBLAS_CMAKE_CONFIG)" | ||||
| @@ -9,7 +9,15 @@ else | |||||
| USE_OPENMP = 1 | USE_OPENMP = 1 | ||||
| endif | endif | ||||
| ifeq ($(CORE), POWER9) | |||||
| ifeq ($(USE_OPENMP), 1) | |||||
| COMMON_OPT += -Ofast -mcpu=power9 -mtune=power9 -mvsx -malign-power -DUSE_OPENMP -fno-fast-math -fopenmp | |||||
| FCOMMON_OPT += -O2 -frecursive -mcpu=power9 -mtune=power9 -malign-power -DUSE_OPENMP -fno-fast-math -fopenmp | |||||
| else | |||||
| COMMON_OPT += -Ofast -mcpu=power9 -mtune=power9 -mvsx -malign-power -fno-fast-math | |||||
| FCOMMON_OPT += -O2 -frecursive -mcpu=power9 -mtune=power9 -malign-power -fno-fast-math | |||||
| endif | |||||
| endif | |||||
| ifeq ($(CORE), POWER8) | ifeq ($(CORE), POWER8) | ||||
| ifeq ($(USE_OPENMP), 1) | ifeq ($(USE_OPENMP), 1) | ||||
| @@ -21,6 +29,10 @@ FCOMMON_OPT += -O2 -frecursive -mcpu=power8 -mtune=power8 -malign-power -fno-fas | |||||
| endif | endif | ||||
| endif | endif | ||||
| # workaround for C->FORTRAN ABI violation in LAPACKE | |||||
| ifeq ($(F_COMPILER), GFORTRAN) | |||||
| FCOMMON_OPT += -fno-optimize-sibling-calls | |||||
| endif | |||||
| FLAMEPATH = $(HOME)/flame/lib | FLAMEPATH = $(HOME)/flame/lib | ||||
| @@ -3,7 +3,7 @@ | |||||
| # | # | ||||
| # This library's version | # This library's version | ||||
| VERSION = 0.3.6.dev | |||||
| VERSION = 0.3.7.dev | |||||
| # If you set the suffix, the library name will be libopenblas_$(LIBNAMESUFFIX).a | # If you set the suffix, the library name will be libopenblas_$(LIBNAMESUFFIX).a | ||||
| # and libopenblas_$(LIBNAMESUFFIX).so. Meanwhile, the soname in shared library | # and libopenblas_$(LIBNAMESUFFIX).so. Meanwhile, the soname in shared library | ||||
| @@ -58,6 +58,12 @@ VERSION = 0.3.6.dev | |||||
| # For force setting for multi threaded, specify USE_THREAD = 1 | # For force setting for multi threaded, specify USE_THREAD = 1 | ||||
| # USE_THREAD = 0 | # USE_THREAD = 0 | ||||
| # If you want to build a single-threaded OpenBLAS, but expect to call this | |||||
| # from several concurrent threads in some other program, comment this in for | |||||
| # thread safety. (This is done automatically for USE_THREAD=1 , and should not | |||||
| # be necessary when USE_OPENMP=1) | |||||
| # USE_LOCKING = 1 | |||||
| # If you're going to use this library with OpenMP, please comment it in. | # If you're going to use this library with OpenMP, please comment it in. | ||||
| # This flag is always set for POWER8. Don't set USE_OPENMP = 0 if you're targeting POWER8. | # This flag is always set for POWER8. Don't set USE_OPENMP = 0 if you're targeting POWER8. | ||||
| # USE_OPENMP = 1 | # USE_OPENMP = 1 | ||||
| @@ -157,6 +163,10 @@ NO_AFFINITY = 1 | |||||
| # Don't use Haswell optimizations if binutils is too old (e.g. RHEL6) | # Don't use Haswell optimizations if binutils is too old (e.g. RHEL6) | ||||
| # NO_AVX2 = 1 | # NO_AVX2 = 1 | ||||
| # Don't use SkylakeX optimizations if binutils or compiler are too old (the build | |||||
| # system will try to determine this automatically) | |||||
| # NO_AVX512 = 1 | |||||
| # Don't use parallel make. | # Don't use parallel make. | ||||
| # NO_PARALLEL_MAKE = 1 | # NO_PARALLEL_MAKE = 1 | ||||
| @@ -181,17 +191,17 @@ NO_AFFINITY = 1 | |||||
| # time out to improve performance. This number should be from 4 to 30 | # time out to improve performance. This number should be from 4 to 30 | ||||
| # which corresponds to (1 << n) cycles. For example, if you set to 26, | # which corresponds to (1 << n) cycles. For example, if you set to 26, | ||||
| # thread will be running for (1 << 26) cycles(about 25ms on 3.0GHz | # thread will be running for (1 << 26) cycles(about 25ms on 3.0GHz | ||||
| # system). Also you can control this mumber by THREAD_TIMEOUT | |||||
| # system). Also you can control this number by THREAD_TIMEOUT | |||||
| # CCOMMON_OPT += -DTHREAD_TIMEOUT=26 | # CCOMMON_OPT += -DTHREAD_TIMEOUT=26 | ||||
| # Using special device driver for mapping physically contigous memory | |||||
| # Using special device driver for mapping physically contiguous memory | |||||
| # to the user space. If bigphysarea is enabled, it will use it. | # to the user space. If bigphysarea is enabled, it will use it. | ||||
| # DEVICEDRIVER_ALLOCATION = 1 | # DEVICEDRIVER_ALLOCATION = 1 | ||||
| # If you need to synchronize FP CSR between threads (for x86/x86_64 only). | # If you need to synchronize FP CSR between threads (for x86/x86_64 only). | ||||
| # CONSISTENT_FPCSR = 1 | # CONSISTENT_FPCSR = 1 | ||||
| # If any gemm arguement m, n or k is less or equal this threshold, gemm will be execute | |||||
| # If any gemm argument m, n or k is less or equal this threshold, gemm will be execute | |||||
| # with single thread. (Actually in recent versions this is a factor proportional to the | # with single thread. (Actually in recent versions this is a factor proportional to the | ||||
| # number of floating point operations necessary for the given problem size, no longer | # number of floating point operations necessary for the given problem size, no longer | ||||
| # an individual dimension). You can use this setting to avoid the overhead of multi- | # an individual dimension). You can use this setting to avoid the overhead of multi- | ||||
| @@ -199,7 +209,7 @@ NO_AFFINITY = 1 | |||||
| # been reported to be optimal for certain workloads (50 is the recommended value for Julia). | # been reported to be optimal for certain workloads (50 is the recommended value for Julia). | ||||
| # GEMM_MULTITHREAD_THRESHOLD = 4 | # GEMM_MULTITHREAD_THRESHOLD = 4 | ||||
| # If you need santy check by comparing reference BLAS. It'll be very | |||||
| # If you need sanity check by comparing results to reference BLAS. It'll be very | |||||
| # slow (Not implemented yet). | # slow (Not implemented yet). | ||||
| # SANITY_CHECK = 1 | # SANITY_CHECK = 1 | ||||
| @@ -239,6 +249,21 @@ COMMON_PROF = -pg | |||||
| # SYMBOLPREFIX= | # SYMBOLPREFIX= | ||||
| # SYMBOLSUFFIX= | # SYMBOLSUFFIX= | ||||
| # Run a C++ based thread safety tester after the build is done. | |||||
| # This is mostly intended as a developer feature to spot regressions, but users and | |||||
| # package maintainers can enable this if they have doubts about the thread safety of | |||||
| # the library, given the configuration in this file. | |||||
| # By default, the thread safety tester launches 52 concurrent calculations at the same | |||||
| # time. | |||||
| # | |||||
| # Please note that the test uses ~1300 MiB of RAM for the DGEMM test. | |||||
| # | |||||
| # The test requires CBLAS to be built, a C++11 capable compiler and the presence of | |||||
| # an OpenMP implementation. If you are cross-compiling this test will probably not | |||||
| # work at all. | |||||
| # | |||||
| # CPP_THREAD_SAFETY_TEST = 1 | |||||
| # | # | ||||
| # End of user configuration | # End of user configuration | ||||
| # | # | ||||
| @@ -9,6 +9,11 @@ ifndef TOPDIR | |||||
| TOPDIR = . | TOPDIR = . | ||||
| endif | endif | ||||
| # If ARCH is not set, we use the host system's architecture. | |||||
| ifndef ARCH | |||||
| ARCH := $(shell uname -m) | |||||
| endif | |||||
| # Catch conflicting usage of ARCH in some BSD environments | # Catch conflicting usage of ARCH in some BSD environments | ||||
| ifeq ($(ARCH), amd64) | ifeq ($(ARCH), amd64) | ||||
| override ARCH=x86_64 | override ARCH=x86_64 | ||||
| @@ -137,7 +142,12 @@ endif | |||||
| endif | endif | ||||
| # On x86_64 build getarch with march=native. This is required to detect AVX512 support in getarch. | |||||
| ifeq ($(ARCH), x86_64) | |||||
| ifneq ($(C_COMPILER), PGI) | |||||
| GETARCH_FLAGS += -march=native | |||||
| endif | |||||
| endif | |||||
| ifdef INTERFACE64 | ifdef INTERFACE64 | ||||
| ifneq ($(INTERFACE64), 0) | ifneq ($(INTERFACE64), 0) | ||||
| @@ -155,7 +165,8 @@ GETARCH_FLAGS += -DNO_AVX | |||||
| endif | endif | ||||
| ifeq ($(BINARY), 32) | ifeq ($(BINARY), 32) | ||||
| GETARCH_FLAGS += -DNO_AVX | |||||
| GETARCH_FLAGS += -DNO_AVX -DNO_AVX2 -DNO_AVX512 | |||||
| NO_AVX512 = 1 | |||||
| endif | endif | ||||
| ifeq ($(NO_AVX2), 1) | ifeq ($(NO_AVX2), 1) | ||||
| @@ -236,6 +247,10 @@ SMP = 1 | |||||
| endif | endif | ||||
| endif | endif | ||||
| ifeq ($(SMP), 1) | |||||
| USE_LOCKING = | |||||
| endif | |||||
| ifndef NEED_PIC | ifndef NEED_PIC | ||||
| NEED_PIC = 1 | NEED_PIC = 1 | ||||
| endif | endif | ||||
| @@ -387,6 +402,12 @@ ifneq ($(MAX_STACK_ALLOC), 0) | |||||
| CCOMMON_OPT += -DMAX_STACK_ALLOC=$(MAX_STACK_ALLOC) | CCOMMON_OPT += -DMAX_STACK_ALLOC=$(MAX_STACK_ALLOC) | ||||
| endif | endif | ||||
| ifdef USE_LOCKING | |||||
| ifneq ($(USE_LOCKING), 0) | |||||
| CCOMMON_OPT += -DUSE_LOCKING | |||||
| endif | |||||
| endif | |||||
| # | # | ||||
| # Architecture dependent settings | # Architecture dependent settings | ||||
| # | # | ||||
| @@ -527,6 +548,12 @@ DYNAMIC_CORE += THUNDERX | |||||
| DYNAMIC_CORE += THUNDERX2T99 | DYNAMIC_CORE += THUNDERX2T99 | ||||
| endif | endif | ||||
| ifeq ($(ARCH), power) | |||||
| DYNAMIC_CORE = POWER6 | |||||
| DYNAMIC_CORE += POWER8 | |||||
| DYNAMIC_CORE += POWER9 | |||||
| endif | |||||
| # If DYNAMIC_CORE is not set, DYNAMIC_ARCH cannot do anything, so force it to empty | # If DYNAMIC_CORE is not set, DYNAMIC_ARCH cannot do anything, so force it to empty | ||||
| ifndef DYNAMIC_CORE | ifndef DYNAMIC_CORE | ||||
| override DYNAMIC_ARCH= | override DYNAMIC_ARCH= | ||||
| @@ -737,6 +764,8 @@ CCOMMON_OPT += -DF_INTERFACE_GFORT | |||||
| FCOMMON_OPT += -Wall | FCOMMON_OPT += -Wall | ||||
| # make single-threaded LAPACK calls thread-safe #1847 | # make single-threaded LAPACK calls thread-safe #1847 | ||||
| FCOMMON_OPT += -frecursive | FCOMMON_OPT += -frecursive | ||||
| # work around ABI problem with passing single-character arguments | |||||
| FCOMMON_OPT += -fno-optimize-sibling-calls | |||||
| #Don't include -lgfortran, when NO_LAPACK=1 or lsbcc | #Don't include -lgfortran, when NO_LAPACK=1 or lsbcc | ||||
| ifneq ($(NO_LAPACK), 1) | ifneq ($(NO_LAPACK), 1) | ||||
| EXTRALIB += -lgfortran | EXTRALIB += -lgfortran | ||||
| @@ -1042,7 +1071,7 @@ ifdef USE_SIMPLE_THREADED_LEVEL3 | |||||
| CCOMMON_OPT += -DUSE_SIMPLE_THREADED_LEVEL3 | CCOMMON_OPT += -DUSE_SIMPLE_THREADED_LEVEL3 | ||||
| endif | endif | ||||
| ifdef USE_TLS | |||||
| ifeq ($(USE_TLS), 1) | |||||
| CCOMMON_OPT += -DUSE_TLS | CCOMMON_OPT += -DUSE_TLS | ||||
| endif | endif | ||||
| @@ -28,11 +28,15 @@ endif | |||||
| ifeq ($(CORE), HASWELL) | ifeq ($(CORE), HASWELL) | ||||
| ifndef DYNAMIC_ARCH | ifndef DYNAMIC_ARCH | ||||
| ifndef NO_AVX2 | ifndef NO_AVX2 | ||||
| ifeq ($(C_COMPILER), GCC) | |||||
| CCOMMON_OPT += -mavx2 | CCOMMON_OPT += -mavx2 | ||||
| endif | |||||
| ifeq ($(F_COMPILER), GFORTRAN) | |||||
| FCOMMON_OPT += -mavx2 | FCOMMON_OPT += -mavx2 | ||||
| endif | endif | ||||
| endif | endif | ||||
| endif | endif | ||||
| endif | |||||
| @@ -4,3 +4,7 @@ CCOMMON_OPT += -march=z13 -mzvector | |||||
| FCOMMON_OPT += -march=z13 -mzvector | FCOMMON_OPT += -march=z13 -mzvector | ||||
| endif | endif | ||||
| ifeq ($(CORE), Z14) | |||||
| CCOMMON_OPT += -march=z14 -mzvector | |||||
| FCOMMON_OPT += -march=z14 -mzvector | |||||
| endif | |||||
| @@ -6,11 +6,13 @@ Travis CI: [](https://ci.appveyor.com/project/xianyi/openblas/branch/develop) | AppVeyor: [](https://ci.appveyor.com/project/xianyi/openblas/branch/develop) | ||||
| [](https://dev.azure.com/xianyi/OpenBLAS/_build/latest?definitionId=1&branchName=develop) | |||||
| ## Introduction | ## Introduction | ||||
| OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. | OpenBLAS is an optimized BLAS library based on GotoBLAS2 1.13 BSD version. | ||||
| Please read the documentation on the OpenBLAS wiki pages: <http://github.com/xianyi/OpenBLAS/wiki>. | |||||
| Please read the documentation on the OpenBLAS wiki pages: <https://github.com/xianyi/OpenBLAS/wiki>. | |||||
| ## Binary Packages | ## Binary Packages | ||||
| @@ -22,7 +24,7 @@ You can download them from [file hosting on sourceforge.net](https://sourceforge | |||||
| ## Installation from Source | ## Installation from Source | ||||
| Download from project homepage, http://xianyi.github.com/OpenBLAS/, or check out the code | |||||
| Download from project homepage, https://xianyi.github.com/OpenBLAS/, or check out the code | |||||
| using Git from https://github.com/xianyi/OpenBLAS.git. | using Git from https://github.com/xianyi/OpenBLAS.git. | ||||
| ### Dependencies | ### Dependencies | ||||
| @@ -63,9 +65,7 @@ A debug version can be built using `make DEBUG=1`. | |||||
| ### Compile with MASS support on Power CPU (optional) | ### Compile with MASS support on Power CPU (optional) | ||||
| The [IBM MASS](http://www-01.ibm.com/software/awdtools/mass/linux/mass-linux.html) library | |||||
| consists of a set of mathematical functions for C, C++, and Fortran applications that are | |||||
| are tuned for optimum performance on POWER architectures. | |||||
| The [IBM MASS](https://www.ibm.com/support/home/product/W511326D80541V01/other_software/mathematical_acceleration_subsystem) library consists of a set of mathematical functions for C, C++, and Fortran applications that are tuned for optimum performance on POWER architectures. | |||||
| OpenBLAS with MASS requires a 64-bit, little-endian OS on POWER. | OpenBLAS with MASS requires a 64-bit, little-endian OS on POWER. | ||||
| The library can be installed as shown: | The library can be installed as shown: | ||||
| @@ -115,6 +115,7 @@ Please read `GotoBLAS_01Readme.txt`. | |||||
| - **AMD Bulldozer**: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar) | - **AMD Bulldozer**: x86-64 ?GEMM FMA4 kernels. (Thanks to Werner Saar) | ||||
| - **AMD PILEDRIVER**: Uses Bulldozer codes with some optimizations. | - **AMD PILEDRIVER**: Uses Bulldozer codes with some optimizations. | ||||
| - **AMD STEAMROLLER**: Uses Bulldozer codes with some optimizations. | - **AMD STEAMROLLER**: Uses Bulldozer codes with some optimizations. | ||||
| - **AMD ZEN**: Uses Haswell codes with some optimizations. | |||||
| #### MIPS64 | #### MIPS64 | ||||
| @@ -133,11 +134,13 @@ Please read `GotoBLAS_01Readme.txt`. | |||||
| #### PPC/PPC64 | #### PPC/PPC64 | ||||
| - **POWER8**: Optmized Level-3 BLAS and some Level-1, only with `USE_OPENMP=1` | |||||
| - **POWER8**: Optimized BLAS, only for PPC64LE (Little Endian), only with `USE_OPENMP=1` | |||||
| - **POWER9**: Optimized Level-3 BLAS (real) and some Level-1,2. PPC64LE with OpenMP only. | |||||
| #### IBM zEnterprise System | #### IBM zEnterprise System | ||||
| - **Z13**: Optimized Level-3 BLAS and Level-1,2 (double precision) | - **Z13**: Optimized Level-3 BLAS and Level-1,2 (double precision) | ||||
| - **Z14**: Optimized Level-3 BLAS and Level-1,2 (single precision) | |||||
| ### Supported OS | ### Supported OS | ||||
| @@ -48,6 +48,7 @@ POWER5 | |||||
| POWER6 | POWER6 | ||||
| POWER7 | POWER7 | ||||
| POWER8 | POWER8 | ||||
| POWER9 | |||||
| PPCG4 | PPCG4 | ||||
| PPC970 | PPC970 | ||||
| PPC970MP | PPC970MP | ||||
| @@ -90,7 +91,9 @@ CORTEXA73 | |||||
| FALKOR | FALKOR | ||||
| THUNDERX | THUNDERX | ||||
| THUNDERX2T99 | THUNDERX2T99 | ||||
| TSV110 | |||||
| 9.System Z: | 9.System Z: | ||||
| ZARCH_GENERIC | ZARCH_GENERIC | ||||
| Z13 | Z13 | ||||
| Z14 | |||||
| @@ -35,7 +35,14 @@ environment: | |||||
| DYNAMIC_ARCH: ON | DYNAMIC_ARCH: ON | ||||
| WITH_FORTRAN: no | WITH_FORTRAN: no | ||||
| - COMPILER: cl | - COMPILER: cl | ||||
| - COMPILER: MinGW64-gcc-7.2.0-mingw | |||||
| DYNAMIC_ARCH: OFF | |||||
| WITH_FORTRAN: ignore | |||||
| - COMPILER: MinGW64-gcc-7.2.0 | |||||
| - APPVEYOR_BUILD_WORKER_IMAGE: Visual Studio 2015 | |||||
| COMPILER: MinGW-gcc-5.3.0 | |||||
| WITH_FORTRAN: ignore | |||||
| install: | install: | ||||
| - if [%COMPILER%]==[clang-cl] call %CONDA_INSTALL_LOCN%\Scripts\activate.bat | - if [%COMPILER%]==[clang-cl] call %CONDA_INSTALL_LOCN%\Scripts\activate.bat | ||||
| - if [%COMPILER%]==[clang-cl] conda config --add channels conda-forge --force | - if [%COMPILER%]==[clang-cl] conda config --add channels conda-forge --force | ||||
| @@ -52,10 +59,17 @@ install: | |||||
| before_build: | before_build: | ||||
| - ps: if (-Not (Test-Path .\build)) { mkdir build } | - ps: if (-Not (Test-Path .\build)) { mkdir build } | ||||
| - cd build | - cd build | ||||
| - set PATH=%PATH:C:\Program Files\Git\usr\bin;=% | |||||
| - if [%COMPILER%]==[MinGW-gcc-5.3.0] set PATH=C:\MinGW\bin;C:\msys64\usr\bin;C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin;%PATH% | |||||
| - if [%COMPILER%]==[MinGW64-gcc-7.2.0-mingw] set PATH=C:\MinGW\bin;C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin;%PATH% | |||||
| - if [%COMPILER%]==[MinGW64-gcc-7.2.0] set PATH=C:\msys64\usr\bin;C:\mingw-w64\x86_64-7.2.0-posix-seh-rt_v5-rev1\mingw64\bin;%PATH% | |||||
| - if [%COMPILER%]==[cl] cmake -G "Visual Studio 15 2017 Win64" .. | - if [%COMPILER%]==[cl] cmake -G "Visual Studio 15 2017 Win64" .. | ||||
| - if [%WITH_FORTRAN%]==[no] cmake -G "Ninja" -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_C_COMPILER=clang-cl .. | |||||
| - if [%COMPILER%]==[MinGW64-gcc-7.2.0-mingw] cmake -G "MinGW Makefiles" -DNOFORTRAN=1 .. | |||||
| - if [%COMPILER%]==[MinGW64-gcc-7.2.0] cmake -G "MSYS Makefiles" -DBINARY=32 -DNOFORTRAN=1 .. | |||||
| - if [%COMPILER%]==[MinGW-gcc-5.3.0] cmake -G "MSYS Makefiles" -DNOFORTRAN=1 .. | |||||
| - if [%WITH_FORTRAN%]==[no] cmake -G "Ninja" -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_C_COMPILER=clang-cl -DMSVC_STATIC_CRT=ON .. | |||||
| - if [%WITH_FORTRAN%]==[yes] cmake -G "Ninja" -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_C_COMPILER=clang-cl -DCMAKE_Fortran_COMPILER=flang -DBUILD_WITHOUT_LAPACK=no -DNOFORTRAN=0 .. | - if [%WITH_FORTRAN%]==[yes] cmake -G "Ninja" -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_C_COMPILER=clang-cl -DCMAKE_Fortran_COMPILER=flang -DBUILD_WITHOUT_LAPACK=no -DNOFORTRAN=0 .. | ||||
| - if [%DYNAMIC_ARCH%]==[ON] cmake -DDYNAMIC_ARCH=ON .. | |||||
| - if [%DYNAMIC_ARCH%]==[ON] cmake -DDYNAMIC_ARCH=ON -DDYNAMIC_LIST='CORE2;NEHALEM;SANDYBRIDGE;BULLDOZER;HASWELL' .. | |||||
| build_script: | build_script: | ||||
| - cmake --build . | - cmake --build . | ||||
| @@ -64,3 +78,4 @@ test_script: | |||||
| - echo Running Test | - echo Running Test | ||||
| - cd utest | - cd utest | ||||
| - openblas_utest | - openblas_utest | ||||
| @@ -0,0 +1,51 @@ | |||||
| trigger: | |||||
| # start a new build for every push | |||||
| batch: False | |||||
| branches: | |||||
| include: | |||||
| - develop | |||||
| jobs: | |||||
| # manylinux1 is useful to test because the | |||||
| # standard Docker container uses an old version | |||||
| # of gcc / glibc | |||||
| - job: manylinux1_gcc | |||||
| pool: | |||||
| vmImage: 'ubuntu-16.04' | |||||
| steps: | |||||
| - script: | | |||||
| echo "FROM quay.io/pypa/manylinux1_x86_64 | |||||
| COPY . /tmp/openblas | |||||
| RUN cd /tmp/openblas && \ | |||||
| COMMON_FLAGS='DYNAMIC_ARCH=1 TARGET=NEHALEM NUM_THREADS=32' && \ | |||||
| BTYPE='BINARY=64' CC=gcc && \ | |||||
| make QUIET_MAKE=1 $COMMON_FLAGS $BTYPE && \ | |||||
| make -C test $COMMON_FLAGS $BTYPE && \ | |||||
| make -C ctest $COMMON_FLAGS $BTYPE && \ | |||||
| make -C utest $COMMON_FLAGS $BTYPE" > Dockerfile | |||||
| docker build . | |||||
| displayName: Run manylinux1 docker build | |||||
| - job: Intel_SDE_skx | |||||
| pool: | |||||
| vmImage: 'ubuntu-16.04' | |||||
| steps: | |||||
| - script: | | |||||
| # at the time of writing the available Azure Ubuntu vm image | |||||
| # does not support AVX512VL, so use more recent LTS version | |||||
| echo "FROM ubuntu:bionic | |||||
| COPY . /tmp/openblas | |||||
| RUN apt-get -y update && apt-get -y install \\ | |||||
| cmake \\ | |||||
| gfortran \\ | |||||
| make \\ | |||||
| wget | |||||
| RUN mkdir /tmp/SDE && cd /tmp/SDE && \\ | |||||
| mkdir sde-external-8.35.0-2019-03-11-lin && \\ | |||||
| wget --quiet -O sde-external-8.35.0-2019-03-11-lin.tar.bz2 https://www.dropbox.com/s/fopsnzj67572sj5/sde-external-8.35.0-2019-03-11-lin.tar.bz2?dl=0 && \\ | |||||
| tar -xjvf sde-external-8.35.0-2019-03-11-lin.tar.bz2 -C /tmp/SDE/sde-external-8.35.0-2019-03-11-lin --strip-components=1 | |||||
| RUN cd /tmp/openblas && CC=gcc make QUIET_MAKE=1 DYNAMIC_ARCH=1 NUM_THREADS=32 BINARY=64 | |||||
| CMD cd /tmp/openblas && echo 0 > /proc/sys/kernel/yama/ptrace_scope && CC=gcc OPENBLAS_VERBOSE=2 /tmp/SDE/sde-external-8.35.0-2019-03-11-lin/sde64 -cpuid_in /tmp/SDE/sde-external-8.35.0-2019-03-11-lin/misc/cpuid/skx/cpuid.def -- make -C utest DYNAMIC_ARCH=1 NUM_THREADS=32 BINARY=64" > Dockerfile | |||||
| docker build -t intel_sde . | |||||
| # we need a privileged docker run for sde process attachment | |||||
| docker run --privileged intel_sde | |||||
| displayName: 'Run AVX512 SkylakeX docker build / test' | |||||
| @@ -207,7 +207,7 @@ int main(int argc, char *argv[]){ | |||||
| for (i = 0; i < m * n * COMPSIZE; i++) { | for (i = 0; i < m * n * COMPSIZE; i++) { | ||||
| c[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5; | c[i] = ((FLOAT) rand() / (FLOAT) RAND_MAX) - 0.5; | ||||
| } | } | ||||
| fprintf(stderr, " SIZE Flops Time\n"); | fprintf(stderr, " SIZE Flops Time\n"); | ||||
| for (i = from; i <= to; i += step) { | for (i = from; i <= to; i += step) { | ||||
| @@ -2,6 +2,8 @@ | |||||
| argv <- commandArgs(trailingOnly = TRUE) | argv <- commandArgs(trailingOnly = TRUE) | ||||
| if (!is.null(options("matprod")[[1]])) options(matprod = "blas") | |||||
| nfrom <- 128 | nfrom <- 128 | ||||
| nto <- 2048 | nto <- 2048 | ||||
| nstep <- 128 | nstep <- 128 | ||||
| @@ -19,7 +21,6 @@ if (length(argv) > 0) { | |||||
| loops <- as.numeric(argv[z]) | loops <- as.numeric(argv[z]) | ||||
| } | } | ||||
| } | } | ||||
| } | } | ||||
| p <- Sys.getenv("OPENBLAS_LOOPS") | p <- Sys.getenv("OPENBLAS_LOOPS") | ||||
| @@ -27,29 +28,21 @@ if (p != "") { | |||||
| loops <- as.numeric(p) | loops <- as.numeric(p) | ||||
| } | } | ||||
| cat(sprintf( | |||||
| "From %.0f To %.0f Step=%.0f Loops=%.0f\n", | |||||
| nfrom, | |||||
| nto, | |||||
| nstep, | |||||
| loops | |||||
| )) | |||||
| cat(sprintf("From %.0f To %.0f Step=%.0f Loops=%.0f\n", nfrom, nto, nstep, loops)) | |||||
| cat(sprintf(" SIZE Flops Time\n")) | cat(sprintf(" SIZE Flops Time\n")) | ||||
| n <- nfrom | n <- nfrom | ||||
| while (n <= nto) { | while (n <= nto) { | ||||
| A <- matrix(rnorm(n * n), ncol = n, nrow = n) | |||||
| A <- matrix(rnorm(n * n), nrow = n) | |||||
| ev <- 0 | ev <- 0 | ||||
| z <- system.time(for (l in 1:loops) { | z <- system.time(for (l in 1:loops) { | ||||
| ev <- eigen(A) | ev <- eigen(A) | ||||
| }) | }) | ||||
| mflops <- (26.66 * n * n * n) * loops / (z[3] * 1.0e6) | |||||
| mflops <- (26.66 * n * n * n) * loops / (z[3] * 1e+06) | |||||
| st <- sprintf("%.0fx%.0f :", n, n) | st <- sprintf("%.0fx%.0f :", n, n) | ||||
| cat(sprintf("%20s %10.2f MFlops %10.6f sec\n", st, mflops, z[3])) | cat(sprintf("%20s %10.2f MFlops %10.6f sec\n", st, mflops, z[3])) | ||||
| n <- n + nstep | n <- n + nstep | ||||
| } | } | ||||
| @@ -2,6 +2,8 @@ | |||||
| argv <- commandArgs(trailingOnly = TRUE) | argv <- commandArgs(trailingOnly = TRUE) | ||||
| if (!is.null(options("matprod")[[1]])) options(matprod = "blas") | |||||
| nfrom <- 128 | nfrom <- 128 | ||||
| nto <- 2048 | nto <- 2048 | ||||
| nstep <- 128 | nstep <- 128 | ||||
| @@ -19,7 +21,6 @@ if (length(argv) > 0) { | |||||
| loops <- as.numeric(argv[z]) | loops <- as.numeric(argv[z]) | ||||
| } | } | ||||
| } | } | ||||
| } | } | ||||
| p <- Sys.getenv("OPENBLAS_LOOPS") | p <- Sys.getenv("OPENBLAS_LOOPS") | ||||
| @@ -27,26 +28,13 @@ if (p != "") { | |||||
| loops <- as.numeric(p) | loops <- as.numeric(p) | ||||
| } | } | ||||
| cat(sprintf( | |||||
| "From %.0f To %.0f Step=%.0f Loops=%.0f\n", | |||||
| nfrom, | |||||
| nto, | |||||
| nstep, | |||||
| loops | |||||
| )) | |||||
| cat(sprintf("From %.0f To %.0f Step=%.0f Loops=%.0f\n", nfrom, nto, nstep, loops)) | |||||
| cat(sprintf(" SIZE Flops Time\n")) | cat(sprintf(" SIZE Flops Time\n")) | ||||
| n <- nfrom | n <- nfrom | ||||
| while (n <= nto) { | while (n <= nto) { | ||||
| A <- matrix(runif(n * n), | |||||
| ncol = n, | |||||
| nrow = n, | |||||
| byrow = TRUE) | |||||
| B <- matrix(runif(n * n), | |||||
| ncol = n, | |||||
| nrow = n, | |||||
| byrow = TRUE) | |||||
| A <- matrix(runif(n * n), nrow = n) | |||||
| B <- matrix(runif(n * n), nrow = n) | |||||
| C <- 1 | C <- 1 | ||||
| z <- system.time(for (l in 1:loops) { | z <- system.time(for (l in 1:loops) { | ||||
| @@ -54,11 +42,10 @@ while (n <= nto) { | |||||
| l <- l + 1 | l <- l + 1 | ||||
| }) | }) | ||||
| mflops <- (2.0 * n * n * n) * loops / (z[3] * 1.0e6) | |||||
| mflops <- (2.0 * n * n * n) * loops / (z[3] * 1e+06) | |||||
| st <- sprintf("%.0fx%.0f :", n, n) | st <- sprintf("%.0fx%.0f :", n, n) | ||||
| cat(sprintf("%20s %10.2f MFlops %10.6f sec\n", st, mflops, z[3])) | cat(sprintf("%20s %10.2f MFlops %10.6f sec\n", st, mflops, z[3])) | ||||
| n <- n + nstep | n <- n + nstep | ||||
| } | } | ||||
| @@ -2,6 +2,8 @@ | |||||
| argv <- commandArgs(trailingOnly = TRUE) | argv <- commandArgs(trailingOnly = TRUE) | ||||
| if (!is.null(options("matprod")[[1]])) options(matprod = "blas") | |||||
| nfrom <- 128 | nfrom <- 128 | ||||
| nto <- 2048 | nto <- 2048 | ||||
| nstep <- 128 | nstep <- 128 | ||||
| @@ -19,7 +21,6 @@ if (length(argv) > 0) { | |||||
| loops <- as.numeric(argv[z]) | loops <- as.numeric(argv[z]) | ||||
| } | } | ||||
| } | } | ||||
| } | } | ||||
| p <- Sys.getenv("OPENBLAS_LOOPS") | p <- Sys.getenv("OPENBLAS_LOOPS") | ||||
| @@ -27,31 +28,22 @@ if (p != "") { | |||||
| loops <- as.numeric(p) | loops <- as.numeric(p) | ||||
| } | } | ||||
| cat(sprintf( | |||||
| "From %.0f To %.0f Step=%.0f Loops=%.0f\n", | |||||
| nfrom, | |||||
| nto, | |||||
| nstep, | |||||
| loops | |||||
| )) | |||||
| cat(sprintf("From %.0f To %.0f Step=%.0f Loops=%.0f\n", nfrom, nto, nstep, loops)) | |||||
| cat(sprintf(" SIZE Flops Time\n")) | cat(sprintf(" SIZE Flops Time\n")) | ||||
| n <- nfrom | n <- nfrom | ||||
| while (n <= nto) { | while (n <= nto) { | ||||
| A <- matrix(rnorm(n * n), ncol = n, nrow = n) | |||||
| B <- matrix(rnorm(n * n), ncol = n, nrow = n) | |||||
| A <- matrix(rnorm(n * n), nrow = n) | |||||
| B <- matrix(rnorm(n * n), nrow = n) | |||||
| z <- system.time(for (l in 1:loops) { | z <- system.time(for (l in 1:loops) { | ||||
| solve(A, B) | solve(A, B) | ||||
| }) | }) | ||||
| mflops <- | |||||
| (2.0 / 3.0 * n * n * n + 2.0 * n * n * n) * loops / (z[3] * 1.0e6) | |||||
| mflops <- (8.0 / 3 * n * n * n) * loops / (z[3] * 1e+06) | |||||
| st <- sprintf("%.0fx%.0f :", n, n) | st <- sprintf("%.0fx%.0f :", n, n) | ||||
| cat(sprintf("%20s %10.2f MFlops %10.6f sec\n", st, mflops, z[3])) | cat(sprintf("%20s %10.2f MFlops %10.6f sec\n", st, mflops, z[3])) | ||||
| n <- n + nstep | n <- n + nstep | ||||
| } | } | ||||
| @@ -1,7 +1,7 @@ | |||||
| #!/usr/bin/perl | #!/usr/bin/perl | ||||
| use File::Basename; | |||||
| use File::Temp qw(tempfile); | |||||
| #use File::Basename; | |||||
| # use File::Temp qw(tempfile); | |||||
| # Checking cross compile | # Checking cross compile | ||||
| $hostos = `uname -s | sed -e s/\-.*//`; chop($hostos); | $hostos = `uname -s | sed -e s/\-.*//`; chop($hostos); | ||||
| @@ -12,7 +12,7 @@ $hostarch = "arm64" if ($hostarch eq "aarch64"); | |||||
| $hostarch = "power" if ($hostarch =~ /^(powerpc|ppc).*/); | $hostarch = "power" if ($hostarch =~ /^(powerpc|ppc).*/); | ||||
| $hostarch = "zarch" if ($hostarch eq "s390x"); | $hostarch = "zarch" if ($hostarch eq "s390x"); | ||||
| $tmpf = new File::Temp( UNLINK => 1 ); | |||||
| #$tmpf = new File::Temp( UNLINK => 1 ); | |||||
| $binary = $ENV{"BINARY"}; | $binary = $ENV{"BINARY"}; | ||||
| $makefile = shift(@ARGV); | $makefile = shift(@ARGV); | ||||
| @@ -31,12 +31,25 @@ if ($?) { | |||||
| $cross_suffix = ""; | $cross_suffix = ""; | ||||
| if (dirname($compiler_name) ne ".") { | |||||
| $cross_suffix .= dirname($compiler_name) . "/"; | |||||
| } | |||||
| eval "use File::Basename"; | |||||
| if ($@){ | |||||
| warn "could not load PERL module File::Basename, emulating its functionality"; | |||||
| my $dirnam = substr($compiler_name, 0, rindex($compiler_name, "/")-1 ); | |||||
| if ($dirnam ne ".") { | |||||
| $cross_suffix .= $dirnam . "/"; | |||||
| } | |||||
| my $basnam = substr($compiler_name, rindex($compiler_name,"/")+1, length($compiler_name)-rindex($compiler_name,"/")-1); | |||||
| if ($basnam =~ /([^\s]*-)(.*)/) { | |||||
| $cross_suffix .= $1; | |||||
| } | |||||
| } else { | |||||
| if (dirname($compiler_name) ne ".") { | |||||
| $cross_suffix .= dirname($compiler_name) . "/"; | |||||
| } | |||||
| if (basename($compiler_name) =~ /([^\s]*-)(.*)/) { | |||||
| $cross_suffix .= $1; | |||||
| if (basename($compiler_name) =~ /([^\s]*-)(.*)/) { | |||||
| $cross_suffix .= $1; | |||||
| } | |||||
| } | } | ||||
| $compiler = ""; | $compiler = ""; | ||||
| @@ -171,20 +184,26 @@ if ($?) { | |||||
| $have_msa = 0; | $have_msa = 0; | ||||
| if (($architecture eq "mips") || ($architecture eq "mips64")) { | if (($architecture eq "mips") || ($architecture eq "mips64")) { | ||||
| $code = '"addvi.b $w0, $w1, 1"'; | |||||
| $msa_flags = "-mmsa -mfp64 -msched-weight -mload-store-pairs"; | |||||
| print $tmpf "#include <msa.h>\n\n"; | |||||
| print $tmpf "void main(void){ __asm__ volatile($code); }\n"; | |||||
| $args = "$msa_flags -o $tmpf.o -x c $tmpf"; | |||||
| my @cmd = ("$compiler_name $args"); | |||||
| system(@cmd) == 0; | |||||
| if ($? != 0) { | |||||
| $have_msa = 0; | |||||
| eval "use File::Temp qw(tempfile)"; | |||||
| if ($@){ | |||||
| warn "could not load PERL module File::Temp, so could not check MSA capatibility"; | |||||
| } else { | } else { | ||||
| $have_msa = 1; | |||||
| $tmpf = new File::Temp( UNLINK => 1 ); | |||||
| $code = '"addvi.b $w0, $w1, 1"'; | |||||
| $msa_flags = "-mmsa -mfp64 -msched-weight -mload-store-pairs"; | |||||
| print $tmpf "#include <msa.h>\n\n"; | |||||
| print $tmpf "void main(void){ __asm__ volatile($code); }\n"; | |||||
| $args = "$msa_flags -o $tmpf.o -x c $tmpf"; | |||||
| my @cmd = ("$compiler_name $args"); | |||||
| system(@cmd) == 0; | |||||
| if ($? != 0) { | |||||
| $have_msa = 0; | |||||
| } else { | |||||
| $have_msa = 1; | |||||
| } | |||||
| unlink("$tmpf.o"); | |||||
| } | } | ||||
| unlink("$tmpf.o"); | |||||
| } | } | ||||
| $architecture = x86 if ($data =~ /ARCH_X86/); | $architecture = x86 if ($data =~ /ARCH_X86/); | ||||
| @@ -204,17 +223,25 @@ $binformat = bin64 if ($data =~ /BINARY_64/); | |||||
| $no_avx512= 0; | $no_avx512= 0; | ||||
| if (($architecture eq "x86") || ($architecture eq "x86_64")) { | if (($architecture eq "x86") || ($architecture eq "x86_64")) { | ||||
| $code = '"vbroadcastss -4 * 4(%rsi), %zmm2"'; | |||||
| print $tmpf "#include <immintrin.h>\n\nint main(void){ __asm__ volatile($code); }\n"; | |||||
| $args = " -march=skylake-avx512 -o $tmpf.o -x c $tmpf"; | |||||
| my @cmd = ("$compiler_name $args >/dev/null 2>/dev/null"); | |||||
| system(@cmd) == 0; | |||||
| if ($? != 0) { | |||||
| $no_avx512 = 1; | |||||
| } else { | |||||
| eval "use File::Temp qw(tempfile)"; | |||||
| if ($@){ | |||||
| warn "could not load PERL module File::Temp, so could not check compiler compatibility with AVX512"; | |||||
| $no_avx512 = 0; | $no_avx512 = 0; | ||||
| } else { | |||||
| # $tmpf = new File::Temp( UNLINK => 1 ); | |||||
| ($fh,$tmpf) = tempfile( UNLINK => 1 ); | |||||
| $code = '"vbroadcastss -4 * 4(%rsi), %zmm2"'; | |||||
| print $tmpf "#include <immintrin.h>\n\nint main(void){ __asm__ volatile($code); }\n"; | |||||
| $args = " -march=skylake-avx512 -c -o $tmpf.o -x c $tmpf"; | |||||
| my @cmd = ("$compiler_name $args >/dev/null 2>/dev/null"); | |||||
| system(@cmd) == 0; | |||||
| if ($? != 0) { | |||||
| $no_avx512 = 1; | |||||
| } else { | |||||
| $no_avx512 = 0; | |||||
| } | |||||
| unlink("$tmpf.o"); | |||||
| } | } | ||||
| unlink("tmpf.o"); | |||||
| } | } | ||||
| $data = `$compiler_name -S ctest1.c && grep globl ctest1.s | head -n 1 && rm -f ctest1.s`; | $data = `$compiler_name -S ctest1.c && grep globl ctest1.s | head -n 1 && rm -f ctest1.s`; | ||||
| @@ -73,6 +73,11 @@ double cblas_dasum (OPENBLAS_CONST blasint n, OPENBLAS_CONST double *x, OPENBLAS | |||||
| float cblas_scasum(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | float cblas_scasum(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | ||||
| double cblas_dzasum(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | double cblas_dzasum(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | ||||
| float cblas_ssum (OPENBLAS_CONST blasint n, OPENBLAS_CONST float *x, OPENBLAS_CONST blasint incx); | |||||
| double cblas_dsum (OPENBLAS_CONST blasint n, OPENBLAS_CONST double *x, OPENBLAS_CONST blasint incx); | |||||
| float cblas_scsum(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | |||||
| double cblas_dzsum(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | |||||
| float cblas_snrm2 (OPENBLAS_CONST blasint N, OPENBLAS_CONST float *X, OPENBLAS_CONST blasint incX); | float cblas_snrm2 (OPENBLAS_CONST blasint N, OPENBLAS_CONST float *X, OPENBLAS_CONST blasint incX); | ||||
| double cblas_dnrm2 (OPENBLAS_CONST blasint N, OPENBLAS_CONST double *X, OPENBLAS_CONST blasint incX); | double cblas_dnrm2 (OPENBLAS_CONST blasint N, OPENBLAS_CONST double *X, OPENBLAS_CONST blasint incX); | ||||
| float cblas_scnrm2(OPENBLAS_CONST blasint N, OPENBLAS_CONST void *X, OPENBLAS_CONST blasint incX); | float cblas_scnrm2(OPENBLAS_CONST blasint N, OPENBLAS_CONST void *X, OPENBLAS_CONST blasint incX); | ||||
| @@ -88,6 +93,16 @@ CBLAS_INDEX cblas_idamin(OPENBLAS_CONST blasint n, OPENBLAS_CONST double *x, OPE | |||||
| CBLAS_INDEX cblas_icamin(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | CBLAS_INDEX cblas_icamin(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | ||||
| CBLAS_INDEX cblas_izamin(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | CBLAS_INDEX cblas_izamin(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | ||||
| CBLAS_INDEX cblas_ismax(OPENBLAS_CONST blasint n, OPENBLAS_CONST float *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_idmax(OPENBLAS_CONST blasint n, OPENBLAS_CONST double *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_icmax(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_izmax(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_ismin(OPENBLAS_CONST blasint n, OPENBLAS_CONST float *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_idmin(OPENBLAS_CONST blasint n, OPENBLAS_CONST double *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_icmin(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | |||||
| CBLAS_INDEX cblas_izmin(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx); | |||||
| void cblas_saxpy(OPENBLAS_CONST blasint n, OPENBLAS_CONST float alpha, OPENBLAS_CONST float *x, OPENBLAS_CONST blasint incx, float *y, OPENBLAS_CONST blasint incy); | void cblas_saxpy(OPENBLAS_CONST blasint n, OPENBLAS_CONST float alpha, OPENBLAS_CONST float *x, OPENBLAS_CONST blasint incx, float *y, OPENBLAS_CONST blasint incy); | ||||
| void cblas_daxpy(OPENBLAS_CONST blasint n, OPENBLAS_CONST double alpha, OPENBLAS_CONST double *x, OPENBLAS_CONST blasint incx, double *y, OPENBLAS_CONST blasint incy); | void cblas_daxpy(OPENBLAS_CONST blasint n, OPENBLAS_CONST double alpha, OPENBLAS_CONST double *x, OPENBLAS_CONST blasint incx, double *y, OPENBLAS_CONST blasint incy); | ||||
| void cblas_caxpy(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *alpha, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx, void *y, OPENBLAS_CONST blasint incy); | void cblas_caxpy(OPENBLAS_CONST blasint n, OPENBLAS_CONST void *alpha, OPENBLAS_CONST void *x, OPENBLAS_CONST blasint incx, void *y, OPENBLAS_CONST blasint incy); | ||||
| @@ -73,11 +73,16 @@ if (DYNAMIC_ARCH) | |||||
| endif () | endif () | ||||
| if (NOT NO_AVX512) | if (NOT NO_AVX512) | ||||
| set(DYNAMIC_CORE ${DYNAMIC_CORE} SKYLAKEX) | set(DYNAMIC_CORE ${DYNAMIC_CORE} SKYLAKEX) | ||||
| string(REGEX REPLACE "-march=native" "" CMAKE_C_FLAGS ${CMAKE_C_FLAGS}) | |||||
| endif () | |||||
| if (DYNAMIC_LIST) | |||||
| set(DYNAMIC_CORE PRESCOTT ${DYNAMIC_LIST}) | |||||
| endif () | endif () | ||||
| endif () | endif () | ||||
| if (NOT DYNAMIC_CORE) | if (NOT DYNAMIC_CORE) | ||||
| unset(DYNAMIC_ARCH) | |||||
| message (STATUS "DYNAMIC_ARCH is not supported on this architecture, removing from options") | |||||
| unset(DYNAMIC_ARCH CACHE) | |||||
| endif () | endif () | ||||
| endif () | endif () | ||||
| @@ -44,7 +44,10 @@ endif () | |||||
| if (${F_COMPILER} STREQUAL "GFORTRAN") | if (${F_COMPILER} STREQUAL "GFORTRAN") | ||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DF_INTERFACE_GFORT") | set(CCOMMON_OPT "${CCOMMON_OPT} -DF_INTERFACE_GFORT") | ||||
| # ensure reentrancy of lapack codes | |||||
| set(FCOMMON_OPT "${FCOMMON_OPT} -Wall -frecursive") | set(FCOMMON_OPT "${FCOMMON_OPT} -Wall -frecursive") | ||||
| # work around ABI violation in passing string arguments from C | |||||
| set(FCOMMON_OPT "${FCOMMON_OPT} -fno-optimize-sibling-calls") | |||||
| #Don't include -lgfortran, when NO_LAPACK=1 or lsbcc | #Don't include -lgfortran, when NO_LAPACK=1 or lsbcc | ||||
| if (NOT NO_LAPACK) | if (NOT NO_LAPACK) | ||||
| set(EXTRALIB "{EXTRALIB} -lgfortran") | set(EXTRALIB "{EXTRALIB} -lgfortran") | ||||
| @@ -1,7 +1,7 @@ | |||||
| # helper functions for the kernel CMakeLists.txt | # helper functions for the kernel CMakeLists.txt | ||||
| # Set the default filenames for L1 objects. Most of these will be overriden by the appropriate KERNEL file. | |||||
| # Set the default filenames for L1 objects. Most of these will be overridden by the appropriate KERNEL file. | |||||
| macro(SetDefaultL1) | macro(SetDefaultL1) | ||||
| set(SAMAXKERNEL amax.S) | set(SAMAXKERNEL amax.S) | ||||
| set(DAMAXKERNEL amax.S) | set(DAMAXKERNEL amax.S) | ||||
| @@ -107,6 +107,12 @@ macro(SetDefaultL1) | |||||
| set(DAXPBYKERNEL ../arm/axpby.c) | set(DAXPBYKERNEL ../arm/axpby.c) | ||||
| set(CAXPBYKERNEL ../arm/zaxpby.c) | set(CAXPBYKERNEL ../arm/zaxpby.c) | ||||
| set(ZAXPBYKERNEL ../arm/zaxpby.c) | set(ZAXPBYKERNEL ../arm/zaxpby.c) | ||||
| set(SSUMKERNEL sum.S) | |||||
| set(DSUMKERNEL sum.S) | |||||
| set(CSUMKERNEL zsum.S) | |||||
| set(ZSUMKERNEL zsum.S) | |||||
| set(QSUMKERNEL sum.S) | |||||
| set(XSUMKERNEL zsum.S) | |||||
| endmacro () | endmacro () | ||||
| macro(SetDefaultL2) | macro(SetDefaultL2) | ||||
| @@ -162,4 +168,4 @@ macro(SetDefaultL3) | |||||
| set(DGEADD_KERNEL ../generic/geadd.c) | set(DGEADD_KERNEL ../generic/geadd.c) | ||||
| set(CGEADD_KERNEL ../generic/zgeadd.c) | set(CGEADD_KERNEL ../generic/zgeadd.c) | ||||
| set(ZGEADD_KERNEL ../generic/zgeadd.c) | set(ZGEADD_KERNEL ../generic/zgeadd.c) | ||||
| endmacro () | |||||
| endmacro () | |||||
| @@ -8,6 +8,11 @@ if (${CMAKE_SYSTEM_NAME} STREQUAL "Linux") | |||||
| set(NO_EXPRECISION 1) | set(NO_EXPRECISION 1) | ||||
| endif () | endif () | ||||
| if (${CMAKE_SYSTEM_NAME} MATCHES "FreeBSD|OpenBSD|NetBSD|DragonFly") | |||||
| set(EXTRALIB "${EXTRALIB} -lm") | |||||
| set(NO_EXPRECISION 1) | |||||
| endif () | |||||
| if (${CMAKE_SYSTEM_NAME} STREQUAL "AIX") | if (${CMAKE_SYSTEM_NAME} STREQUAL "AIX") | ||||
| set(EXTRALIB "${EXTRALIB} -lm") | set(EXTRALIB "${EXTRALIB} -lm") | ||||
| endif () | endif () | ||||
| @@ -59,6 +59,9 @@ set(FU "") | |||||
| if (APPLE OR (MSVC AND NOT ${CMAKE_C_COMPILER_ID} MATCHES "Clang")) | if (APPLE OR (MSVC AND NOT ${CMAKE_C_COMPILER_ID} MATCHES "Clang")) | ||||
| set(FU "_") | set(FU "_") | ||||
| endif() | endif() | ||||
| if(MINGW AND NOT MINGW64) | |||||
| set(FU "_") | |||||
| endif() | |||||
| set(COMPILER_ID ${CMAKE_C_COMPILER_ID}) | set(COMPILER_ID ${CMAKE_C_COMPILER_ID}) | ||||
| if (${COMPILER_ID} STREQUAL "GNU") | if (${COMPILER_ID} STREQUAL "GNU") | ||||
| @@ -82,6 +85,11 @@ endif () | |||||
| # f_check | # f_check | ||||
| if (NOT NOFORTRAN) | if (NOT NOFORTRAN) | ||||
| include("${PROJECT_SOURCE_DIR}/cmake/f_check.cmake") | include("${PROJECT_SOURCE_DIR}/cmake/f_check.cmake") | ||||
| else () | |||||
| file(APPEND ${TARGET_CONF_TEMP} | |||||
| "#define BUNDERSCORE _\n" | |||||
| "#define NEEDBUNDERSCORE 1\n") | |||||
| set(BU "_") | |||||
| endif () | endif () | ||||
| # Cannot run getarch on target if we are cross-compiling | # Cannot run getarch on target if we are cross-compiling | ||||
| @@ -65,6 +65,18 @@ if (DEFINED TARGET) | |||||
| set(GETARCH_FLAGS "-DFORCE_${TARGET}") | set(GETARCH_FLAGS "-DFORCE_${TARGET}") | ||||
| endif () | endif () | ||||
| # On x86_64 build getarch with march=native. This is required to detect AVX512 support in getarch. | |||||
| if (X86_64) | |||||
| set(GETARCH_FLAGS "${GETARCH_FLAGS} -march=native") | |||||
| endif () | |||||
| # On x86 no AVX support is available | |||||
| if (X86 OR X86_64) | |||||
| if ((DEFINED BINARY AND BINARY EQUAL 32) OR ("$CMAKE_SIZEOF_VOID_P}" EQUAL "4")) | |||||
| set(GETARCH_FLAGS "${GETARCH_FLAGS} -DNO_AVX -DNO_AVX2 -DNO_AVX512") | |||||
| endif () | |||||
| endif () | |||||
| if (INTERFACE64) | if (INTERFACE64) | ||||
| message(STATUS "Using 64-bit integers.") | message(STATUS "Using 64-bit integers.") | ||||
| set(GETARCH_FLAGS "${GETARCH_FLAGS} -DUSE64BITINT") | set(GETARCH_FLAGS "${GETARCH_FLAGS} -DUSE64BITINT") | ||||
| @@ -136,10 +148,16 @@ endif () | |||||
| if (USE_THREAD) | if (USE_THREAD) | ||||
| message(STATUS "Multi-threading enabled with ${NUM_THREADS} threads.") | message(STATUS "Multi-threading enabled with ${NUM_THREADS} threads.") | ||||
| else() | |||||
| if (${USE_LOCKING}) | |||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DUSE_LOCKING") | |||||
| endif () | |||||
| endif () | endif () | ||||
| include("${PROJECT_SOURCE_DIR}/cmake/prebuild.cmake") | include("${PROJECT_SOURCE_DIR}/cmake/prebuild.cmake") | ||||
| if (DEFINED BINARY) | |||||
| message(STATUS "Compiling a ${BINARY}-bit binary.") | |||||
| endif () | |||||
| if (NOT DEFINED NEED_PIC) | if (NOT DEFINED NEED_PIC) | ||||
| set(NEED_PIC 1) | set(NEED_PIC 1) | ||||
| endif () | endif () | ||||
| @@ -156,6 +174,9 @@ include("${PROJECT_SOURCE_DIR}/cmake/cc.cmake") | |||||
| if (NOT NOFORTRAN) | if (NOT NOFORTRAN) | ||||
| # Fortran Compiler dependent settings | # Fortran Compiler dependent settings | ||||
| include("${PROJECT_SOURCE_DIR}/cmake/fc.cmake") | include("${PROJECT_SOURCE_DIR}/cmake/fc.cmake") | ||||
| else () | |||||
| set(NO_LAPACK 1) | |||||
| set(NO_LAPACKE 1) | |||||
| endif () | endif () | ||||
| if (BINARY64) | if (BINARY64) | ||||
| @@ -181,12 +202,24 @@ if (NEED_PIC) | |||||
| endif () | endif () | ||||
| if (DYNAMIC_ARCH) | if (DYNAMIC_ARCH) | ||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DDYNAMIC_ARCH") | |||||
| if (DYNAMIC_OLDER) | |||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DDYNAMIC_OLDER") | |||||
| if (X86 OR X86_64 OR ARM64 OR PPC) | |||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DDYNAMIC_ARCH") | |||||
| if (DYNAMIC_OLDER) | |||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DDYNAMIC_OLDER") | |||||
| endif () | |||||
| else () | |||||
| unset (DYNAMIC_ARCH) | |||||
| message (STATUS "DYNAMIC_ARCH is not supported on the target architecture, removing") | |||||
| endif () | endif () | ||||
| endif () | endif () | ||||
| if (DYNAMIC_LIST) | |||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DDYNAMIC_LIST") | |||||
| foreach(DCORE ${DYNAMIC_LIST}) | |||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DDYN_${DCORE}") | |||||
| endforeach () | |||||
| endif () | |||||
| if (NO_LAPACK) | if (NO_LAPACK) | ||||
| set(CCOMMON_OPT "${CCOMMON_OPT} -DNO_LAPACK") | set(CCOMMON_OPT "${CCOMMON_OPT} -DNO_LAPACK") | ||||
| #Disable LAPACK C interface | #Disable LAPACK C interface | ||||
| @@ -276,7 +309,7 @@ endif () | |||||
| set(KERNELDIR "${PROJECT_SOURCE_DIR}/kernel/${ARCH}") | set(KERNELDIR "${PROJECT_SOURCE_DIR}/kernel/${ARCH}") | ||||
| # TODO: nead to convert these Makefiles | |||||
| # TODO: need to convert these Makefiles | |||||
| # include ${PROJECT_SOURCE_DIR}/cmake/${ARCH}.cmake | # include ${PROJECT_SOURCE_DIR}/cmake/${ARCH}.cmake | ||||
| if (${CORE} STREQUAL "PPC440") | if (${CORE} STREQUAL "PPC440") | ||||
| @@ -39,13 +39,21 @@ elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "ppc.*|power.*|Power.*") | |||||
| elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "mips64.*") | elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "mips64.*") | ||||
| set(MIPS64 1) | set(MIPS64 1) | ||||
| elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*") | elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "amd64.*|x86_64.*|AMD64.*") | ||||
| set(X86_64 1) | |||||
| if("${CMAKE_SIZEOF_VOID_P}" EQUAL "8") | |||||
| set(X86_64 1) | |||||
| else() | |||||
| set(X86 1) | |||||
| endif() | |||||
| elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "i686.*|i386.*|x86.*|amd64.*|AMD64.*") | elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "i686.*|i386.*|x86.*|amd64.*|AMD64.*") | ||||
| set(X86 1) | set(X86 1) | ||||
| elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm.*|ARM.*)") | elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(arm.*|ARM.*)") | ||||
| set(ARM 1) | set(ARM 1) | ||||
| elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(aarch64.*|AARCH64.*)") | elseif(CMAKE_SYSTEM_PROCESSOR MATCHES "^(aarch64.*|AARCH64.*)") | ||||
| set(ARM64 1) | |||||
| if("${CMAKE_SIZEOF_VOID_P}" EQUAL "8") | |||||
| set(ARM64 1) | |||||
| else() | |||||
| set(ARM 1) | |||||
| endif() | |||||
| endif() | endif() | ||||
| if (X86_64) | if (X86_64) | ||||
| @@ -78,7 +86,7 @@ endif() | |||||
| if (X86_64 OR X86) | if (X86_64 OR X86) | ||||
| file(WRITE ${PROJECT_BINARY_DIR}/avx512.tmp "#include <immintrin.h>\n\nint main(void){ __asm__ volatile(\"vbroadcastss -4 * 4(%rsi), %zmm2\"); }") | file(WRITE ${PROJECT_BINARY_DIR}/avx512.tmp "#include <immintrin.h>\n\nint main(void){ __asm__ volatile(\"vbroadcastss -4 * 4(%rsi), %zmm2\"); }") | ||||
| execute_process(COMMAND ${CMAKE_C_COMPILER} -march=skylake-avx512 -v -o ${PROJECT_BINARY_DIR}/avx512.o -x c ${PROJECT_BINARY_DIR}/avx512.tmp OUTPUT_QUIET ERROR_QUIET RESULT_VARIABLE NO_AVX512) | |||||
| execute_process(COMMAND ${CMAKE_C_COMPILER} -march=skylake-avx512 -c -v -o ${PROJECT_BINARY_DIR}/avx512.o -x c ${PROJECT_BINARY_DIR}/avx512.tmp OUTPUT_QUIET ERROR_QUIET RESULT_VARIABLE NO_AVX512) | |||||
| if (NO_AVX512 EQUAL 1) | if (NO_AVX512 EQUAL 1) | ||||
| set (CCOMMON_OPT "${CCOMMON_OPT} -DNO_AVX512") | set (CCOMMON_OPT "${CCOMMON_OPT} -DNO_AVX512") | ||||
| endif() | endif() | ||||
| @@ -89,7 +89,7 @@ function(AllCombinations list_in absent_codes_in) | |||||
| set(CODES_OUT ${CODES_OUT} PARENT_SCOPE) | set(CODES_OUT ${CODES_OUT} PARENT_SCOPE) | ||||
| endfunction () | endfunction () | ||||
| # generates object files for each of the sources, using the BLAS naming scheme to pass the funciton name as a preprocessor definition | |||||
| # generates object files for each of the sources, using the BLAS naming scheme to pass the function name as a preprocessor definition | |||||
| # @param sources_in the source files to build from | # @param sources_in the source files to build from | ||||
| # @param defines_in (optional) preprocessor definitions that will be applied to all objects | # @param defines_in (optional) preprocessor definitions that will be applied to all objects | ||||
| # @param name_in (optional) if this is set this name will be used instead of the filename. Use a * to indicate where the float character should go, if no star the character will be prepended. | # @param name_in (optional) if this is set this name will be used instead of the filename. Use a * to indicate where the float character should go, if no star the character will be prepended. | ||||
| @@ -85,6 +85,8 @@ extern "C" { | |||||
| #if !defined(_MSC_VER) | #if !defined(_MSC_VER) | ||||
| #include <unistd.h> | #include <unistd.h> | ||||
| #elif _MSC_VER < 1900 | |||||
| #define snprintf _snprintf | |||||
| #endif | #endif | ||||
| #include <time.h> | #include <time.h> | ||||
| @@ -129,7 +131,7 @@ extern "C" { | |||||
| #include <time.h> | #include <time.h> | ||||
| #include <unistd.h> | #include <unistd.h> | ||||
| #include <math.h> | #include <math.h> | ||||
| #ifdef SMP | |||||
| #if defined(SMP) || defined(USE_LOCKING) | |||||
| #include <pthread.h> | #include <pthread.h> | ||||
| #endif | #endif | ||||
| #endif | #endif | ||||
| @@ -198,7 +200,7 @@ extern "C" { | |||||
| #error "You can't specify both LOCK operation!" | #error "You can't specify both LOCK operation!" | ||||
| #endif | #endif | ||||
| #ifdef SMP | |||||
| #if defined(SMP) || defined(USE_LOCKING) | |||||
| #define USE_PTHREAD_LOCK | #define USE_PTHREAD_LOCK | ||||
| #undef USE_PTHREAD_SPINLOCK | #undef USE_PTHREAD_SPINLOCK | ||||
| #endif | #endif | ||||
| @@ -348,6 +350,11 @@ typedef int blasint; | |||||
| #endif | #endif | ||||
| #endif | #endif | ||||
| #ifdef POWER9 | |||||
| #ifndef YIELDING | |||||
| #define YIELDING __asm__ __volatile__ ("nop;nop;nop;nop;nop;nop;nop;nop;\n"); | |||||
| #endif | |||||
| #endif | |||||
| /* | /* | ||||
| #ifdef PILEDRIVER | #ifdef PILEDRIVER | ||||
| @@ -439,7 +446,7 @@ please https://github.com/xianyi/OpenBLAS/issues/246 | |||||
| typedef char env_var_t[MAX_PATH]; | typedef char env_var_t[MAX_PATH]; | ||||
| #define readenv(p, n) 0 | #define readenv(p, n) 0 | ||||
| #else | #else | ||||
| #ifdef OS_WINDOWS | |||||
| #if defined(OS_WINDOWS) && !defined(OS_CYGWIN_NT) | |||||
| typedef char env_var_t[MAX_PATH]; | typedef char env_var_t[MAX_PATH]; | ||||
| #define readenv(p, n) GetEnvironmentVariable((LPCTSTR)(n), (LPTSTR)(p), sizeof(p)) | #define readenv(p, n) GetEnvironmentVariable((LPCTSTR)(n), (LPTSTR)(p), sizeof(p)) | ||||
| #else | #else | ||||
| @@ -19,6 +19,7 @@ | |||||
| #define CDOTC_K cdotc_k | #define CDOTC_K cdotc_k | ||||
| #define CNRM2_K cnrm2_k | #define CNRM2_K cnrm2_k | ||||
| #define CSCAL_K cscal_k | #define CSCAL_K cscal_k | ||||
| #define CSUM_K csum_k | |||||
| #define CSWAP_K cswap_k | #define CSWAP_K cswap_k | ||||
| #define CROT_K csrot_k | #define CROT_K csrot_k | ||||
| @@ -249,6 +250,7 @@ | |||||
| #define CDOTC_K gotoblas -> cdotc_k | #define CDOTC_K gotoblas -> cdotc_k | ||||
| #define CNRM2_K gotoblas -> cnrm2_k | #define CNRM2_K gotoblas -> cnrm2_k | ||||
| #define CSCAL_K gotoblas -> cscal_k | #define CSCAL_K gotoblas -> cscal_k | ||||
| #define CSUM_K gotoblas -> csum_k | |||||
| #define CSWAP_K gotoblas -> cswap_k | #define CSWAP_K gotoblas -> cswap_k | ||||
| #define CROT_K gotoblas -> csrot_k | #define CROT_K gotoblas -> csrot_k | ||||
| @@ -19,6 +19,7 @@ | |||||
| #define DDOTC_K ddot_k | #define DDOTC_K ddot_k | ||||
| #define DNRM2_K dnrm2_k | #define DNRM2_K dnrm2_k | ||||
| #define DSCAL_K dscal_k | #define DSCAL_K dscal_k | ||||
| #define DSUM_K dsum_k | |||||
| #define DSWAP_K dswap_k | #define DSWAP_K dswap_k | ||||
| #define DROT_K drot_k | #define DROT_K drot_k | ||||
| @@ -174,6 +175,7 @@ | |||||
| #define DDOTC_K gotoblas -> ddot_k | #define DDOTC_K gotoblas -> ddot_k | ||||
| #define DNRM2_K gotoblas -> dnrm2_k | #define DNRM2_K gotoblas -> dnrm2_k | ||||
| #define DSCAL_K gotoblas -> dscal_k | #define DSCAL_K gotoblas -> dscal_k | ||||
| #define DSUM_K gotoblas -> dsum_k | |||||
| #define DSWAP_K gotoblas -> dswap_k | #define DSWAP_K gotoblas -> dswap_k | ||||
| #define DROT_K gotoblas -> drot_k | #define DROT_K gotoblas -> drot_k | ||||
| @@ -122,6 +122,13 @@ xdouble BLASFUNC(qasum) (blasint *, xdouble *, blasint *); | |||||
| double BLASFUNC(dzasum)(blasint *, double *, blasint *); | double BLASFUNC(dzasum)(blasint *, double *, blasint *); | ||||
| xdouble BLASFUNC(qxasum)(blasint *, xdouble *, blasint *); | xdouble BLASFUNC(qxasum)(blasint *, xdouble *, blasint *); | ||||
| FLOATRET BLASFUNC(ssum) (blasint *, float *, blasint *); | |||||
| FLOATRET BLASFUNC(scsum)(blasint *, float *, blasint *); | |||||
| double BLASFUNC(dsum) (blasint *, double *, blasint *); | |||||
| xdouble BLASFUNC(qsum) (blasint *, xdouble *, blasint *); | |||||
| double BLASFUNC(dzsum)(blasint *, double *, blasint *); | |||||
| xdouble BLASFUNC(qxsum)(blasint *, xdouble *, blasint *); | |||||
| blasint BLASFUNC(isamax)(blasint *, float *, blasint *); | blasint BLASFUNC(isamax)(blasint *, float *, blasint *); | ||||
| blasint BLASFUNC(idamax)(blasint *, double *, blasint *); | blasint BLASFUNC(idamax)(blasint *, double *, blasint *); | ||||
| blasint BLASFUNC(iqamax)(blasint *, xdouble *, blasint *); | blasint BLASFUNC(iqamax)(blasint *, xdouble *, blasint *); | ||||
| @@ -100,6 +100,13 @@ float casum_k (BLASLONG, float *, BLASLONG); | |||||
| double zasum_k (BLASLONG, double *, BLASLONG); | double zasum_k (BLASLONG, double *, BLASLONG); | ||||
| xdouble xasum_k (BLASLONG, xdouble *, BLASLONG); | xdouble xasum_k (BLASLONG, xdouble *, BLASLONG); | ||||
| float ssum_k (BLASLONG, float *, BLASLONG); | |||||
| double dsum_k (BLASLONG, double *, BLASLONG); | |||||
| xdouble qsum_k (BLASLONG, xdouble *, BLASLONG); | |||||
| float csum_k (BLASLONG, float *, BLASLONG); | |||||
| double zsum_k (BLASLONG, double *, BLASLONG); | |||||
| xdouble xsum_k (BLASLONG, xdouble *, BLASLONG); | |||||
| float samax_k (BLASLONG, float *, BLASLONG); | float samax_k (BLASLONG, float *, BLASLONG); | ||||
| double damax_k (BLASLONG, double *, BLASLONG); | double damax_k (BLASLONG, double *, BLASLONG); | ||||
| xdouble qamax_k (BLASLONG, xdouble *, BLASLONG); | xdouble qamax_k (BLASLONG, xdouble *, BLASLONG); | ||||
| @@ -66,6 +66,7 @@ | |||||
| #define DOTC_K QDOTC_K | #define DOTC_K QDOTC_K | ||||
| #define NRM2_K QNRM2_K | #define NRM2_K QNRM2_K | ||||
| #define SCAL_K QSCAL_K | #define SCAL_K QSCAL_K | ||||
| #define SUM_K QSUM_K | |||||
| #define SWAP_K QSWAP_K | #define SWAP_K QSWAP_K | ||||
| #define ROT_K QROT_K | #define ROT_K QROT_K | ||||
| @@ -356,6 +357,7 @@ | |||||
| #define DOTC_K DDOTC_K | #define DOTC_K DDOTC_K | ||||
| #define NRM2_K DNRM2_K | #define NRM2_K DNRM2_K | ||||
| #define SCAL_K DSCAL_K | #define SCAL_K DSCAL_K | ||||
| #define SUM_K DSUM_K | |||||
| #define SWAP_K DSWAP_K | #define SWAP_K DSWAP_K | ||||
| #define ROT_K DROT_K | #define ROT_K DROT_K | ||||
| @@ -658,6 +660,7 @@ | |||||
| #define DOTC_K SDOTC_K | #define DOTC_K SDOTC_K | ||||
| #define NRM2_K SNRM2_K | #define NRM2_K SNRM2_K | ||||
| #define SCAL_K SSCAL_K | #define SCAL_K SSCAL_K | ||||
| #define SUM_K SSUM_K | |||||
| #define SWAP_K SSWAP_K | #define SWAP_K SSWAP_K | ||||
| #define ROT_K SROT_K | #define ROT_K SROT_K | ||||
| @@ -962,6 +965,7 @@ | |||||
| #define DOTC_K XDOTC_K | #define DOTC_K XDOTC_K | ||||
| #define NRM2_K XNRM2_K | #define NRM2_K XNRM2_K | ||||
| #define SCAL_K XSCAL_K | #define SCAL_K XSCAL_K | ||||
| #define SUM_K XSUM_K | |||||
| #define SWAP_K XSWAP_K | #define SWAP_K XSWAP_K | ||||
| #define ROT_K XROT_K | #define ROT_K XROT_K | ||||
| @@ -1363,6 +1367,7 @@ | |||||
| #define DOTC_K ZDOTC_K | #define DOTC_K ZDOTC_K | ||||
| #define NRM2_K ZNRM2_K | #define NRM2_K ZNRM2_K | ||||
| #define SCAL_K ZSCAL_K | #define SCAL_K ZSCAL_K | ||||
| #define SUM_K ZSUM_K | |||||
| #define SWAP_K ZSWAP_K | #define SWAP_K ZSWAP_K | ||||
| #define ROT_K ZROT_K | #define ROT_K ZROT_K | ||||
| @@ -1785,6 +1790,7 @@ | |||||
| #define DOTC_K CDOTC_K | #define DOTC_K CDOTC_K | ||||
| #define NRM2_K CNRM2_K | #define NRM2_K CNRM2_K | ||||
| #define SCAL_K CSCAL_K | #define SCAL_K CSCAL_K | ||||
| #define SUM_K CSUM_K | |||||
| #define SWAP_K CSWAP_K | #define SWAP_K CSWAP_K | ||||
| #define ROT_K CROT_K | #define ROT_K CROT_K | ||||
| @@ -63,6 +63,7 @@ BLASLONG (*ismin_k) (BLASLONG, float *, BLASLONG); | |||||
| float (*snrm2_k) (BLASLONG, float *, BLASLONG); | float (*snrm2_k) (BLASLONG, float *, BLASLONG); | ||||
| float (*sasum_k) (BLASLONG, float *, BLASLONG); | float (*sasum_k) (BLASLONG, float *, BLASLONG); | ||||
| float (*ssum_k) (BLASLONG, float *, BLASLONG); | |||||
| int (*scopy_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | int (*scopy_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | ||||
| float (*sdot_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | float (*sdot_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | ||||
| double (*dsdot_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | double (*dsdot_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | ||||
| @@ -154,6 +155,7 @@ BLASLONG (*idmin_k) (BLASLONG, double *, BLASLONG); | |||||
| double (*dnrm2_k) (BLASLONG, double *, BLASLONG); | double (*dnrm2_k) (BLASLONG, double *, BLASLONG); | ||||
| double (*dasum_k) (BLASLONG, double *, BLASLONG); | double (*dasum_k) (BLASLONG, double *, BLASLONG); | ||||
| double (*dsum_k) (BLASLONG, double *, BLASLONG); | |||||
| int (*dcopy_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | int (*dcopy_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | ||||
| double (*ddot_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | double (*ddot_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | ||||
| int (*drot_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG, double, double); | int (*drot_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG, double, double); | ||||
| @@ -245,6 +247,7 @@ BLASLONG (*iqmin_k) (BLASLONG, xdouble *, BLASLONG); | |||||
| xdouble (*qnrm2_k) (BLASLONG, xdouble *, BLASLONG); | xdouble (*qnrm2_k) (BLASLONG, xdouble *, BLASLONG); | ||||
| xdouble (*qasum_k) (BLASLONG, xdouble *, BLASLONG); | xdouble (*qasum_k) (BLASLONG, xdouble *, BLASLONG); | ||||
| xdouble (*qsum_k) (BLASLONG, xdouble *, BLASLONG); | |||||
| int (*qcopy_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | int (*qcopy_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | ||||
| xdouble (*qdot_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | xdouble (*qdot_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | ||||
| int (*qrot_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG, xdouble, xdouble); | int (*qrot_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG, xdouble, xdouble); | ||||
| @@ -332,6 +335,7 @@ BLASLONG (*icamin_k)(BLASLONG, float *, BLASLONG); | |||||
| float (*cnrm2_k) (BLASLONG, float *, BLASLONG); | float (*cnrm2_k) (BLASLONG, float *, BLASLONG); | ||||
| float (*casum_k) (BLASLONG, float *, BLASLONG); | float (*casum_k) (BLASLONG, float *, BLASLONG); | ||||
| float (*csum_k) (BLASLONG, float *, BLASLONG); | |||||
| int (*ccopy_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | int (*ccopy_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | ||||
| openblas_complex_float (*cdotu_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | openblas_complex_float (*cdotu_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | ||||
| openblas_complex_float (*cdotc_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | openblas_complex_float (*cdotc_k) (BLASLONG, float *, BLASLONG, float *, BLASLONG); | ||||
| @@ -495,6 +499,7 @@ BLASLONG (*izamin_k)(BLASLONG, double *, BLASLONG); | |||||
| double (*znrm2_k) (BLASLONG, double *, BLASLONG); | double (*znrm2_k) (BLASLONG, double *, BLASLONG); | ||||
| double (*zasum_k) (BLASLONG, double *, BLASLONG); | double (*zasum_k) (BLASLONG, double *, BLASLONG); | ||||
| double (*zsum_k) (BLASLONG, double *, BLASLONG); | |||||
| int (*zcopy_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | int (*zcopy_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | ||||
| openblas_complex_double (*zdotu_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | openblas_complex_double (*zdotu_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | ||||
| openblas_complex_double (*zdotc_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | openblas_complex_double (*zdotc_k) (BLASLONG, double *, BLASLONG, double *, BLASLONG); | ||||
| @@ -660,6 +665,7 @@ BLASLONG (*ixamin_k)(BLASLONG, xdouble *, BLASLONG); | |||||
| xdouble (*xnrm2_k) (BLASLONG, xdouble *, BLASLONG); | xdouble (*xnrm2_k) (BLASLONG, xdouble *, BLASLONG); | ||||
| xdouble (*xasum_k) (BLASLONG, xdouble *, BLASLONG); | xdouble (*xasum_k) (BLASLONG, xdouble *, BLASLONG); | ||||
| xdouble (*xsum_k) (BLASLONG, xdouble *, BLASLONG); | |||||
| int (*xcopy_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | int (*xcopy_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | ||||
| openblas_complex_xdouble (*xdotu_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | openblas_complex_xdouble (*xdotu_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | ||||
| openblas_complex_xdouble (*xdotc_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | openblas_complex_xdouble (*xdotc_k) (BLASLONG, xdouble *, BLASLONG, xdouble *, BLASLONG); | ||||
| @@ -39,7 +39,7 @@ | |||||
| #ifndef COMMON_POWER | #ifndef COMMON_POWER | ||||
| #define COMMON_POWER | #define COMMON_POWER | ||||
| #if defined(POWER8) | |||||
| #if defined(POWER8) || defined(POWER9) | |||||
| #define MB __asm__ __volatile__ ("eieio":::"memory") | #define MB __asm__ __volatile__ ("eieio":::"memory") | ||||
| #define WMB __asm__ __volatile__ ("eieio":::"memory") | #define WMB __asm__ __volatile__ ("eieio":::"memory") | ||||
| #else | #else | ||||
| @@ -241,7 +241,7 @@ static inline int blas_quickdivide(blasint x, blasint y){ | |||||
| #define HAVE_PREFETCH | #define HAVE_PREFETCH | ||||
| #endif | #endif | ||||
| #if defined(POWER3) || defined(POWER6) || defined(PPCG4) || defined(CELL) || defined(POWER8) | |||||
| #if defined(POWER3) || defined(POWER6) || defined(PPCG4) || defined(CELL) || defined(POWER8) || defined(POWER9) || ( defined(PPC970) && ( defined(OS_DARWIN) || defined(OS_FREEBSD) ) ) | |||||
| #define DCBT_ARG 0 | #define DCBT_ARG 0 | ||||
| #else | #else | ||||
| #define DCBT_ARG 8 | #define DCBT_ARG 8 | ||||
| @@ -263,7 +263,7 @@ static inline int blas_quickdivide(blasint x, blasint y){ | |||||
| #define L1_PREFETCH dcbtst | #define L1_PREFETCH dcbtst | ||||
| #endif | #endif | ||||
| #if defined(POWER8) | |||||
| #if defined(POWER8) || defined(POWER9) | |||||
| #define L1_DUALFETCH | #define L1_DUALFETCH | ||||
| #define L1_PREFETCHSIZE (16 + 128 * 100) | #define L1_PREFETCHSIZE (16 + 128 * 100) | ||||
| #define L1_PREFETCH dcbtst | #define L1_PREFETCH dcbtst | ||||
| @@ -499,7 +499,7 @@ static inline int blas_quickdivide(blasint x, blasint y){ | |||||
| #if defined(ASSEMBLER) && !defined(NEEDPARAM) | #if defined(ASSEMBLER) && !defined(NEEDPARAM) | ||||
| #ifdef OS_LINUX | |||||
| #if defined(OS_LINUX) || defined(OS_FREEBSD) | |||||
| #ifndef __64BIT__ | #ifndef __64BIT__ | ||||
| #define PROLOGUE \ | #define PROLOGUE \ | ||||
| .section .text;\ | .section .text;\ | ||||
| @@ -598,9 +598,14 @@ REALNAME:;\ | |||||
| #ifndef __64BIT__ | #ifndef __64BIT__ | ||||
| #define PROLOGUE \ | #define PROLOGUE \ | ||||
| .machine "any";\ | .machine "any";\ | ||||
| .toc;\ | |||||
| .globl .REALNAME;\ | .globl .REALNAME;\ | ||||
| .globl REALNAME;\ | |||||
| .csect REALNAME[DS],3;\ | |||||
| REALNAME:;\ | |||||
| .long .REALNAME, TOC[tc0], 0;\ | |||||
| .csect .text[PR],5;\ | .csect .text[PR],5;\ | ||||
| .REALNAME:; | |||||
| .REALNAME: | |||||
| #define EPILOGUE \ | #define EPILOGUE \ | ||||
| _section_.text:;\ | _section_.text:;\ | ||||
| @@ -611,9 +616,14 @@ _section_.text:;\ | |||||
| #define PROLOGUE \ | #define PROLOGUE \ | ||||
| .machine "any";\ | .machine "any";\ | ||||
| .toc;\ | |||||
| .globl .REALNAME;\ | .globl .REALNAME;\ | ||||
| .globl REALNAME;\ | |||||
| .csect REALNAME[DS],3;\ | |||||
| REALNAME:;\ | |||||
| .llong .REALNAME, TOC[tc0], 0;\ | |||||
| .csect .text[PR], 5;\ | .csect .text[PR], 5;\ | ||||
| .REALNAME:; | |||||
| .REALNAME: | |||||
| #define EPILOGUE \ | #define EPILOGUE \ | ||||
| _section_.text:;\ | _section_.text:;\ | ||||
| @@ -774,7 +784,7 @@ Lmcount$lazy_ptr: | |||||
| #define HALT mfspr r0, 1023 | #define HALT mfspr r0, 1023 | ||||
| #ifdef OS_LINUX | |||||
| #if defined(OS_LINUX) || defined(OS_FREEBSD) | |||||
| #if defined(PPC440) || defined(PPC440FP2) | #if defined(PPC440) || defined(PPC440FP2) | ||||
| #undef MAX_CPU_NUMBER | #undef MAX_CPU_NUMBER | ||||
| #define MAX_CPU_NUMBER 1 | #define MAX_CPU_NUMBER 1 | ||||
| @@ -802,7 +812,7 @@ Lmcount$lazy_ptr: | |||||
| #define BUFFER_SIZE ( 2 << 20) | #define BUFFER_SIZE ( 2 << 20) | ||||
| #elif defined(PPC440FP2) | #elif defined(PPC440FP2) | ||||
| #define BUFFER_SIZE ( 16 << 20) | #define BUFFER_SIZE ( 16 << 20) | ||||
| #elif defined(POWER8) | |||||
| #elif defined(POWER8) || defined(POWER9) | |||||
| #define BUFFER_SIZE ( 64 << 20) | #define BUFFER_SIZE ( 64 << 20) | ||||
| #else | #else | ||||
| #define BUFFER_SIZE ( 16 << 20) | #define BUFFER_SIZE ( 16 << 20) | ||||
| @@ -819,7 +829,7 @@ Lmcount$lazy_ptr: | |||||
| #define MAP_ANONYMOUS MAP_ANON | #define MAP_ANONYMOUS MAP_ANON | ||||
| #endif | #endif | ||||
| #ifdef OS_LINUX | |||||
| #if defined(OS_LINUX) || defined(OS_FREEBSD) | |||||
| #ifndef __64BIT__ | #ifndef __64BIT__ | ||||
| #define FRAMESLOT(X) (((X) * 4) + 8) | #define FRAMESLOT(X) (((X) * 4) + 8) | ||||
| #else | #else | ||||
| @@ -19,6 +19,7 @@ | |||||
| #define QDOTC_K qdot_k | #define QDOTC_K qdot_k | ||||
| #define QNRM2_K qnrm2_k | #define QNRM2_K qnrm2_k | ||||
| #define QSCAL_K qscal_k | #define QSCAL_K qscal_k | ||||
| #define QSUM_K qsum_k | |||||
| #define QSWAP_K qswap_k | #define QSWAP_K qswap_k | ||||
| #define QROT_K qrot_k | #define QROT_K qrot_k | ||||
| @@ -161,6 +162,7 @@ | |||||
| #define QDOTC_K gotoblas -> qdot_k | #define QDOTC_K gotoblas -> qdot_k | ||||
| #define QNRM2_K gotoblas -> qnrm2_k | #define QNRM2_K gotoblas -> qnrm2_k | ||||
| #define QSCAL_K gotoblas -> qscal_k | #define QSCAL_K gotoblas -> qscal_k | ||||
| #define QSUM_K gotoblas -> qsum_k | |||||
| #define QSWAP_K gotoblas -> qswap_k | #define QSWAP_K gotoblas -> qswap_k | ||||
| #define QROT_K gotoblas -> qrot_k | #define QROT_K gotoblas -> qrot_k | ||||
| @@ -12,6 +12,7 @@ | |||||
| #define ISMAX_K ismax_k | #define ISMAX_K ismax_k | ||||
| #define ISMIN_K ismin_k | #define ISMIN_K ismin_k | ||||
| #define SASUM_K sasum_k | #define SASUM_K sasum_k | ||||
| #define SSUM_K ssum_k | |||||
| #define SAXPYU_K saxpy_k | #define SAXPYU_K saxpy_k | ||||
| #define SAXPYC_K saxpy_k | #define SAXPYC_K saxpy_k | ||||
| #define SCOPY_K scopy_k | #define SCOPY_K scopy_k | ||||
| @@ -170,6 +171,7 @@ | |||||
| #define ISMAX_K gotoblas -> ismax_k | #define ISMAX_K gotoblas -> ismax_k | ||||
| #define ISMIN_K gotoblas -> ismin_k | #define ISMIN_K gotoblas -> ismin_k | ||||
| #define SASUM_K gotoblas -> sasum_k | #define SASUM_K gotoblas -> sasum_k | ||||
| #define SSUM_K gotoblas -> ssum_k | |||||
| #define SAXPYU_K gotoblas -> saxpy_k | #define SAXPYU_K gotoblas -> saxpy_k | ||||
| #define SAXPYC_K gotoblas -> saxpy_k | #define SAXPYC_K gotoblas -> saxpy_k | ||||
| #define SCOPY_K gotoblas -> scopy_k | #define SCOPY_K gotoblas -> scopy_k | ||||
| @@ -45,7 +45,7 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| * SIZE must be carefully chosen to be: | * SIZE must be carefully chosen to be: | ||||
| * - as small as possible to maximize the number of stack allocation | * - as small as possible to maximize the number of stack allocation | ||||
| * - large enough to support all architectures and kernel | * - large enough to support all architectures and kernel | ||||
| * Chosing a too small SIZE will lead to a stack smashing. | |||||
| * Choosing a SIZE too small will lead to a stack smashing. | |||||
| */ | */ | ||||
| #define STACK_ALLOC(SIZE, TYPE, BUFFER) \ | #define STACK_ALLOC(SIZE, TYPE, BUFFER) \ | ||||
| /* make it volatile because some function (ex: dgemv_n.S) */ \ | /* make it volatile because some function (ex: dgemv_n.S) */ \ | ||||
| @@ -19,6 +19,7 @@ | |||||
| #define XDOTC_K xdotc_k | #define XDOTC_K xdotc_k | ||||
| #define XNRM2_K xnrm2_k | #define XNRM2_K xnrm2_k | ||||
| #define XSCAL_K xscal_k | #define XSCAL_K xscal_k | ||||
| #define XSUM_K xsum_k | |||||
| #define XSWAP_K xswap_k | #define XSWAP_K xswap_k | ||||
| #define XROT_K xqrot_k | #define XROT_K xqrot_k | ||||
| @@ -227,6 +228,7 @@ | |||||
| #define XDOTC_K gotoblas -> xdotc_k | #define XDOTC_K gotoblas -> xdotc_k | ||||
| #define XNRM2_K gotoblas -> xnrm2_k | #define XNRM2_K gotoblas -> xnrm2_k | ||||
| #define XSCAL_K gotoblas -> xscal_k | #define XSCAL_K gotoblas -> xscal_k | ||||
| #define XSUM_K gotoblas -> xsum_k | |||||
| #define XSWAP_K gotoblas -> xswap_k | #define XSWAP_K gotoblas -> xswap_k | ||||
| #define XROT_K gotoblas -> xqrot_k | #define XROT_K gotoblas -> xqrot_k | ||||
| @@ -187,7 +187,7 @@ static __inline int blas_quickdivide(unsigned int x, unsigned int y){ | |||||
| y = blas_quick_divide_table[y]; | y = blas_quick_divide_table[y]; | ||||
| __asm__ __volatile__ ("mull %0" :"=d" (result) :"a"(x), "0" (y)); | |||||
| __asm__ __volatile__ ("mull %0" :"=d" (result), "+a"(x): "0" (y)); | |||||
| return result; | return result; | ||||
| #endif | #endif | ||||
| @@ -214,7 +214,7 @@ static __inline int blas_quickdivide(unsigned int x, unsigned int y){ | |||||
| #endif | #endif | ||||
| #if defined(PILEDRIVER) || defined(BULLDOZER) || defined(STEAMROLLER) || defined(EXCAVATOR) | #if defined(PILEDRIVER) || defined(BULLDOZER) || defined(STEAMROLLER) || defined(EXCAVATOR) | ||||
| //Enable some optimazation for barcelona. | |||||
| //Enable some optimization for barcelona. | |||||
| #define BARCELONA_OPTIMIZATION | #define BARCELONA_OPTIMIZATION | ||||
| #endif | #endif | ||||
| @@ -129,12 +129,13 @@ static __inline void cpuid(int op, int *eax, int *ebx, int *ecx, int *edx){ | |||||
| *ecx=cpuinfo[2]; | *ecx=cpuinfo[2]; | ||||
| *edx=cpuinfo[3]; | *edx=cpuinfo[3]; | ||||
| #else | #else | ||||
| __asm__ __volatile__("cpuid" | |||||
| __asm__ __volatile__("mov $0, %%ecx;" | |||||
| "cpuid" | |||||
| : "=a" (*eax), | : "=a" (*eax), | ||||
| "=b" (*ebx), | "=b" (*ebx), | ||||
| "=c" (*ecx), | "=c" (*ecx), | ||||
| "=d" (*edx) | "=d" (*edx) | ||||
| : "0" (op), "c"(0)); | |||||
| : "0" (op)); | |||||
| #endif | #endif | ||||
| } | } | ||||
| @@ -210,7 +211,7 @@ static __inline int blas_quickdivide(unsigned int x, unsigned int y){ | |||||
| y = blas_quick_divide_table[y]; | y = blas_quick_divide_table[y]; | ||||
| __asm__ __volatile__ ("mull %0" :"=d" (result) :"a"(x), "0" (y)); | |||||
| __asm__ __volatile__ ("mull %0" :"=d" (result), "+a"(x) : "0" (y)); | |||||
| return result; | return result; | ||||
| } | } | ||||
| @@ -276,7 +277,7 @@ static __inline int blas_quickdivide(unsigned int x, unsigned int y){ | |||||
| #ifdef ASSEMBLER | #ifdef ASSEMBLER | ||||
| #if defined(PILEDRIVER) || defined(BULLDOZER) || defined(STEAMROLLER) || defined(EXCAVATOR) | #if defined(PILEDRIVER) || defined(BULLDOZER) || defined(STEAMROLLER) || defined(EXCAVATOR) | ||||
| //Enable some optimazation for barcelona. | |||||
| //Enable some optimization for barcelona. | |||||
| #define BARCELONA_OPTIMIZATION | #define BARCELONA_OPTIMIZATION | ||||
| #endif | #endif | ||||
| @@ -19,6 +19,7 @@ | |||||
| #define ZDOTC_K zdotc_k | #define ZDOTC_K zdotc_k | ||||
| #define ZNRM2_K znrm2_k | #define ZNRM2_K znrm2_k | ||||
| #define ZSCAL_K zscal_k | #define ZSCAL_K zscal_k | ||||
| #define ZSUM_K zsum_k | |||||
| #define ZSWAP_K zswap_k | #define ZSWAP_K zswap_k | ||||
| #define ZROT_K zdrot_k | #define ZROT_K zdrot_k | ||||
| @@ -249,6 +250,7 @@ | |||||
| #define ZDOTC_K gotoblas -> zdotc_k | #define ZDOTC_K gotoblas -> zdotc_k | ||||
| #define ZNRM2_K gotoblas -> znrm2_k | #define ZNRM2_K gotoblas -> znrm2_k | ||||
| #define ZSCAL_K gotoblas -> zscal_k | #define ZSCAL_K gotoblas -> zscal_k | ||||
| #define ZSUM_K gotoblas -> zsum_k | |||||
| #define ZSWAP_K gotoblas -> zswap_k | #define ZSWAP_K gotoblas -> zswap_k | ||||
| #define ZROT_K gotoblas -> zdrot_k | #define ZROT_K gotoblas -> zdrot_k | ||||
| @@ -0,0 +1,14 @@ | |||||
| include ../Makefile.rule | |||||
| all :: dgemv_tester dgemm_tester | |||||
| dgemv_tester : | |||||
| $(CXX) $(COMMON_OPT) -Wall -Wextra -Wshadow -fopenmp -std=c++11 dgemv_thread_safety.cpp ../libopenblas.a -lpthread -o dgemv_tester | |||||
| ./dgemv_tester | |||||
| dgemm_tester : dgemv_tester | |||||
| $(CXX) $(COMMON_OPT) -Wall -Wextra -Wshadow -fopenmp -std=c++11 dgemm_thread_safety.cpp ../libopenblas.a -lpthread -o dgemm_tester | |||||
| ./dgemm_tester | |||||
| clean :: | |||||
| rm -f dgemv_tester dgemm_tester | |||||
| @@ -0,0 +1,55 @@ | |||||
| inline void pauser(){ | |||||
| /// a portable way to pause a program | |||||
| std::string dummy; | |||||
| std::cout << "Press enter to continue..."; | |||||
| std::getline(std::cin, dummy); | |||||
| } | |||||
| void FillMatrices(std::vector<std::vector<double>>& matBlock, std::mt19937_64& PRNG, std::uniform_real_distribution<double>& rngdist, const blasint randomMatSize, const uint32_t numConcurrentThreads, const uint32_t numMat){ | |||||
| for(uint32_t i=0; i<numMat; i++){ | |||||
| for(uint32_t j = 0; j < static_cast<uint32_t>(randomMatSize*randomMatSize); j++){ | |||||
| matBlock[i][j] = rngdist(PRNG); | |||||
| } | |||||
| } | |||||
| for(uint32_t i=numMat; i<(numConcurrentThreads*numMat); i+=numMat){ | |||||
| for(uint32_t j=0; j<numMat; j++){ | |||||
| matBlock[i+j] = matBlock[j]; | |||||
| } | |||||
| } | |||||
| } | |||||
| void FillVectors(std::vector<std::vector<double>>& vecBlock, std::mt19937_64& PRNG, std::uniform_real_distribution<double>& rngdist, const blasint randomMatSize, const uint32_t numConcurrentThreads, const uint32_t numVec){ | |||||
| for(uint32_t i=0; i<numVec; i++){ | |||||
| for(uint32_t j = 0; j < static_cast<uint32_t>(randomMatSize); j++){ | |||||
| vecBlock[i][j] = rngdist(PRNG); | |||||
| } | |||||
| } | |||||
| for(uint32_t i=numVec; i<(numConcurrentThreads*numVec); i+=numVec){ | |||||
| for(uint32_t j=0; j<numVec; j++){ | |||||
| vecBlock[i+j] = vecBlock[j]; | |||||
| } | |||||
| } | |||||
| } | |||||
| std::mt19937_64 InitPRNG(){ | |||||
| std::random_device rd; | |||||
| std::mt19937_64 PRNG(rd()); //seed PRNG using /dev/urandom or similar OS provided RNG | |||||
| std::uniform_real_distribution<double> rngdist{-1.0, 1.0}; | |||||
| //make sure the internal state of the PRNG is properly mixed by generating 10M random numbers | |||||
| //PRNGs often have unreliable distribution uniformity and other statistical properties before their internal state is sufficiently mixed | |||||
| for (uint32_t i=0;i<10000000;i++) rngdist(PRNG); | |||||
| return PRNG; | |||||
| } | |||||
| void PrintMatrices(const std::vector<std::vector<double>>& matBlock, const blasint randomMatSize, const uint32_t numConcurrentThreads, const uint32_t numMat){ | |||||
| for (uint32_t i=0;i<numConcurrentThreads*numMat;i++){ | |||||
| std::cout<<i<<std::endl; | |||||
| for (uint32_t j = 0; j < static_cast<uint32_t>(randomMatSize); j++){ | |||||
| for (uint32_t k = 0; k < static_cast<uint32_t>(randomMatSize); k++){ | |||||
| std::cout<<matBlock[i][j*randomMatSize + k]<<" "; | |||||
| } | |||||
| std::cout<<std::endl; | |||||
| } | |||||
| std::cout<<std::endl; | |||||
| } | |||||
| } | |||||
| @@ -0,0 +1,92 @@ | |||||
| #include <iostream> | |||||
| #include <vector> | |||||
| #include <random> | |||||
| #include <future> | |||||
| #include <omp.h> | |||||
| #include "../cblas.h" | |||||
| #include "cpp_thread_safety_common.h" | |||||
| void launch_cblas_dgemm(double* A, double* B, double* C, const blasint randomMatSize){ | |||||
| cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, randomMatSize, randomMatSize, randomMatSize, 1.0, A, randomMatSize, B, randomMatSize, 0.1, C, randomMatSize); | |||||
| } | |||||
| int main(int argc, char* argv[]){ | |||||
| blasint randomMatSize = 1024; //dimension of the random square matrices used | |||||
| uint32_t numConcurrentThreads = 52; //number of concurrent calls of the functions being tested | |||||
| uint32_t numTestRounds = 16; //number of testing rounds before success exit | |||||
| if (argc > 4){ | |||||
| std::cout<<"ERROR: too many arguments for thread safety tester"<<std::endl; | |||||
| abort(); | |||||
| } | |||||
| if(argc == 4){ | |||||
| std::vector<std::string> cliArgs; | |||||
| for (int i = 1; i < argc; i++){ | |||||
| cliArgs.push_back(argv[i]); | |||||
| std::cout<<argv[i]<<std::endl; | |||||
| } | |||||
| randomMatSize = std::stoul(cliArgs[0]); | |||||
| numConcurrentThreads = std::stoul(cliArgs[1]); | |||||
| numTestRounds = std::stoul(cliArgs[2]); | |||||
| } | |||||
| std::uniform_real_distribution<double> rngdist{-1.0, 1.0}; | |||||
| std::vector<std::vector<double>> matBlock(numConcurrentThreads*3); | |||||
| std::vector<std::future<void>> futureBlock(numConcurrentThreads); | |||||
| std::cout<<"*----------------------------*\n"; | |||||
| std::cout<<"| DGEMM thread safety tester |\n"; | |||||
| std::cout<<"*----------------------------*\n"; | |||||
| std::cout<<"Size of random matrices(N=M=K): "<<randomMatSize<<'\n'; | |||||
| std::cout<<"Number of concurrent calls into OpenBLAS : "<<numConcurrentThreads<<'\n'; | |||||
| std::cout<<"Number of testing rounds : "<<numTestRounds<<'\n'; | |||||
| std::cout<<"This test will need "<<(static_cast<uint64_t>(randomMatSize*randomMatSize)*numConcurrentThreads*3*8)/static_cast<double>(1024*1024)<<" MiB of RAM\n"<<std::endl; | |||||
| std::cout<<"Initializing random number generator..."<<std::flush; | |||||
| std::mt19937_64 PRNG = InitPRNG(); | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Preparing to test CBLAS DGEMM thread safety\n"; | |||||
| std::cout<<"Allocating matrices..."<<std::flush; | |||||
| for(uint32_t i=0; i<(numConcurrentThreads*3); i++){ | |||||
| matBlock[i].resize(randomMatSize*randomMatSize); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| //pauser(); | |||||
| std::cout<<"Filling matrices with random numbers..."<<std::flush; | |||||
| FillMatrices(matBlock, PRNG, rngdist, randomMatSize, numConcurrentThreads, 3); | |||||
| //PrintMatrices(matBlock, randomMatSize, numConcurrentThreads, 3); | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Testing CBLAS DGEMM thread safety\n"; | |||||
| omp_set_num_threads(numConcurrentThreads); | |||||
| for(uint32_t R=0; R<numTestRounds; R++){ | |||||
| std::cout<<"DGEMM round #"<<R<<std::endl; | |||||
| std::cout<<"Launching "<<numConcurrentThreads<<" threads simultaneously using OpenMP..."<<std::flush; | |||||
| #pragma omp parallel for default(none) shared(futureBlock, matBlock, randomMatSize, numConcurrentThreads) | |||||
| for(uint32_t i=0; i<numConcurrentThreads; i++){ | |||||
| futureBlock[i] = std::async(std::launch::async, launch_cblas_dgemm, &matBlock[i*3][0], &matBlock[i*3+1][0], &matBlock[i*3+2][0], randomMatSize); | |||||
| //launch_cblas_dgemm( &matBlock[i][0], &matBlock[i+1][0], &matBlock[i+2][0]); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Waiting for threads to finish..."<<std::flush; | |||||
| for(uint32_t i=0; i<numConcurrentThreads; i++){ | |||||
| futureBlock[i].get(); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| //PrintMatrices(matBlock, randomMatSize, numConcurrentThreads, 3); | |||||
| std::cout<<"Comparing results from different threads..."<<std::flush; | |||||
| for(uint32_t i=3; i<(numConcurrentThreads*3); i+=3){ //i is the index of matrix A, for a given thread | |||||
| for(uint32_t j = 0; j < static_cast<uint32_t>(randomMatSize*randomMatSize); j++){ | |||||
| if (std::abs(matBlock[i+2][j] - matBlock[2][j]) > 1.0E-13){ //i+2 is the index of matrix C, for a given thread | |||||
| std::cout<<"ERROR: one of the threads returned a different result! Index : "<<i+2<<std::endl; | |||||
| std::cout<<"CBLAS DGEMM thread safety test FAILED!"<<std::endl; | |||||
| return -1; | |||||
| } | |||||
| } | |||||
| } | |||||
| std::cout<<"OK!\n"<<std::endl; | |||||
| } | |||||
| std::cout<<"CBLAS DGEMM thread safety test PASSED!\n"<<std::endl; | |||||
| return 0; | |||||
| } | |||||
| @@ -0,0 +1,101 @@ | |||||
| #include <iostream> | |||||
| #include <vector> | |||||
| #include <random> | |||||
| #include <future> | |||||
| #include <omp.h> | |||||
| #include "../cblas.h" | |||||
| #include "cpp_thread_safety_common.h" | |||||
| void launch_cblas_dgemv(double* A, double* x, double* y, const blasint randomMatSize){ | |||||
| const blasint inc = 1; | |||||
| cblas_dgemv(CblasColMajor, CblasNoTrans, randomMatSize, randomMatSize, 1.0, A, randomMatSize, x, inc, 0.1, y, inc); | |||||
| } | |||||
| int main(int argc, char* argv[]){ | |||||
| blasint randomMatSize = 1024; //dimension of the random square matrices and vectors being used | |||||
| uint32_t numConcurrentThreads = 52; //number of concurrent calls of the functions being tested | |||||
| uint32_t numTestRounds = 16; //number of testing rounds before success exit | |||||
| if (argc > 4){ | |||||
| std::cout<<"ERROR: too many arguments for thread safety tester"<<std::endl; | |||||
| abort(); | |||||
| } | |||||
| if(argc == 4){ | |||||
| std::vector<std::string> cliArgs; | |||||
| for (int i = 1; i < argc; i++){ | |||||
| cliArgs.push_back(argv[i]); | |||||
| std::cout<<argv[i]<<std::endl; | |||||
| } | |||||
| randomMatSize = std::stoul(cliArgs.at(0)); | |||||
| numConcurrentThreads = std::stoul(cliArgs.at(1)); | |||||
| numTestRounds = std::stoul(cliArgs.at(2)); | |||||
| } | |||||
| std::uniform_real_distribution<double> rngdist{-1.0, 1.0}; | |||||
| std::vector<std::vector<double>> matBlock(numConcurrentThreads); | |||||
| std::vector<std::vector<double>> vecBlock(numConcurrentThreads*2); | |||||
| std::vector<std::future<void>> futureBlock(numConcurrentThreads); | |||||
| std::cout<<"*----------------------------*\n"; | |||||
| std::cout<<"| DGEMV thread safety tester |\n"; | |||||
| std::cout<<"*----------------------------*\n"; | |||||
| std::cout<<"Size of random matrices and vectors(N=M): "<<randomMatSize<<'\n'; | |||||
| std::cout<<"Number of concurrent calls into OpenBLAS : "<<numConcurrentThreads<<'\n'; | |||||
| std::cout<<"Number of testing rounds : "<<numTestRounds<<'\n'; | |||||
| std::cout<<"This test will need "<<((static_cast<uint64_t>(randomMatSize*randomMatSize)*numConcurrentThreads*8)+(static_cast<uint64_t>(randomMatSize)*numConcurrentThreads*8*2))/static_cast<double>(1024*1024)<<" MiB of RAM\n"<<std::endl; | |||||
| std::cout<<"Initializing random number generator..."<<std::flush; | |||||
| std::mt19937_64 PRNG = InitPRNG(); | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Preparing to test CBLAS DGEMV thread safety\n"; | |||||
| std::cout<<"Allocating matrices..."<<std::flush; | |||||
| for(uint32_t i=0; i<numConcurrentThreads; i++){ | |||||
| matBlock.at(i).resize(randomMatSize*randomMatSize); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Allocating vectors..."<<std::flush; | |||||
| for(uint32_t i=0; i<(numConcurrentThreads*2); i++){ | |||||
| vecBlock.at(i).resize(randomMatSize); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| //pauser(); | |||||
| std::cout<<"Filling matrices with random numbers..."<<std::flush; | |||||
| FillMatrices(matBlock, PRNG, rngdist, randomMatSize, numConcurrentThreads, 1); | |||||
| //PrintMatrices(matBlock, randomMatSize, numConcurrentThreads); | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Filling vectors with random numbers..."<<std::flush; | |||||
| FillVectors(vecBlock, PRNG, rngdist, randomMatSize, numConcurrentThreads, 2); | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Testing CBLAS DGEMV thread safety"<<std::endl; | |||||
| omp_set_num_threads(numConcurrentThreads); | |||||
| for(uint32_t R=0; R<numTestRounds; R++){ | |||||
| std::cout<<"DGEMV round #"<<R<<std::endl; | |||||
| std::cout<<"Launching "<<numConcurrentThreads<<" threads simultaneously using OpenMP..."<<std::flush; | |||||
| #pragma omp parallel for default(none) shared(futureBlock, matBlock, vecBlock, randomMatSize, numConcurrentThreads) | |||||
| for(uint32_t i=0; i<numConcurrentThreads; i++){ | |||||
| futureBlock[i] = std::async(std::launch::async, launch_cblas_dgemv, &matBlock[i][0], &vecBlock[i*2][0], &vecBlock[i*2+1][0], randomMatSize); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Waiting for threads to finish..."<<std::flush; | |||||
| for(uint32_t i=0; i<numConcurrentThreads; i++){ | |||||
| futureBlock[i].get(); | |||||
| } | |||||
| std::cout<<"done\n"; | |||||
| std::cout<<"Comparing results from different threads..."<<std::flush; | |||||
| for(uint32_t i=2; i<(numConcurrentThreads*2); i+=2){ //i is the index of vector x, for a given thread | |||||
| for(uint32_t j = 0; j < static_cast<uint32_t>(randomMatSize); j++){ | |||||
| if (std::abs(vecBlock[i+1][j] - vecBlock[1][j]) > 1.0E-13){ //i+1 is the index of vector y, for a given thread | |||||
| std::cout<<"ERROR: one of the threads returned a different result! Index : "<<i+1<<std::endl; | |||||
| std::cout<<"CBLAS DGEMV thread safety test FAILED!"<<std::endl; | |||||
| return -1; | |||||
| } | |||||
| } | |||||
| } | |||||
| std::cout<<"OK!\n"<<std::endl; | |||||
| } | |||||
| std::cout<<"CBLAS DGEMV thread safety test PASSED!\n"<<std::endl; | |||||
| return 0; | |||||
| } | |||||
| @@ -39,6 +39,8 @@ | |||||
| // Cavium | // Cavium | ||||
| #define CPU_THUNDERX 7 | #define CPU_THUNDERX 7 | ||||
| #define CPU_THUNDERX2T99 8 | #define CPU_THUNDERX2T99 8 | ||||
| //Hisilicon | |||||
| #define CPU_TSV110 9 | |||||
| static char *cpuname[] = { | static char *cpuname[] = { | ||||
| "UNKNOWN", | "UNKNOWN", | ||||
| @@ -49,7 +51,8 @@ static char *cpuname[] = { | |||||
| "CORTEXA73", | "CORTEXA73", | ||||
| "FALKOR", | "FALKOR", | ||||
| "THUNDERX", | "THUNDERX", | ||||
| "THUNDERX2T99" | |||||
| "THUNDERX2T99", | |||||
| "TSV110" | |||||
| }; | }; | ||||
| static char *cpuname_lower[] = { | static char *cpuname_lower[] = { | ||||
| @@ -61,7 +64,8 @@ static char *cpuname_lower[] = { | |||||
| "cortexa73", | "cortexa73", | ||||
| "falkor", | "falkor", | ||||
| "thunderx", | "thunderx", | ||||
| "thunderx2t99" | |||||
| "thunderx2t99", | |||||
| "tsv110" | |||||
| }; | }; | ||||
| int get_feature(char *search) | int get_feature(char *search) | ||||
| @@ -145,6 +149,9 @@ int detect(void) | |||||
| return CPU_THUNDERX; | return CPU_THUNDERX; | ||||
| else if (strstr(cpu_implementer, "0x43") && strstr(cpu_part, "0x0af")) | else if (strstr(cpu_implementer, "0x43") && strstr(cpu_part, "0x0af")) | ||||
| return CPU_THUNDERX2T99; | return CPU_THUNDERX2T99; | ||||
| // HiSilicon | |||||
| else if (strstr(cpu_implementer, "0x48") && strstr(cpu_part, "0xd01")) | |||||
| return CPU_TSV110; | |||||
| } | } | ||||
| p = (char *) NULL ; | p = (char *) NULL ; | ||||
| @@ -286,6 +293,21 @@ void get_cpuconfig(void) | |||||
| printf("#define DTB_DEFAULT_ENTRIES 64 \n"); | printf("#define DTB_DEFAULT_ENTRIES 64 \n"); | ||||
| printf("#define DTB_SIZE 4096 \n"); | printf("#define DTB_SIZE 4096 \n"); | ||||
| break; | break; | ||||
| case CPU_TSV110: | |||||
| printf("#define TSV110 \n"); | |||||
| printf("#define L1_CODE_SIZE 65536 \n"); | |||||
| printf("#define L1_CODE_LINESIZE 64 \n"); | |||||
| printf("#define L1_CODE_ASSOCIATIVE 4 \n"); | |||||
| printf("#define L1_DATA_SIZE 65536 \n"); | |||||
| printf("#define L1_DATA_LINESIZE 64 \n"); | |||||
| printf("#define L1_DATA_ASSOCIATIVE 4 \n"); | |||||
| printf("#define L2_SIZE 524228 \n"); | |||||
| printf("#define L2_LINESIZE 64 \n"); | |||||
| printf("#define L2_ASSOCIATIVE 8 \n"); | |||||
| printf("#define DTB_DEFAULT_ENTRIES 64 \n"); | |||||
| printf("#define DTB_SIZE 4096 \n"); | |||||
| break; | |||||
| } | } | ||||
| } | } | ||||
| @@ -94,7 +94,7 @@ char *corename[] = { | |||||
| "CELL", | "CELL", | ||||
| "PPCG4", | "PPCG4", | ||||
| "POWER8", | "POWER8", | ||||
| "POWER8" | |||||
| "POWER9" | |||||
| }; | }; | ||||
| int detect(void){ | int detect(void){ | ||||
| @@ -124,7 +124,7 @@ int detect(void){ | |||||
| if (!strncasecmp(p, "POWER6", 6)) return CPUTYPE_POWER6; | if (!strncasecmp(p, "POWER6", 6)) return CPUTYPE_POWER6; | ||||
| if (!strncasecmp(p, "POWER7", 6)) return CPUTYPE_POWER6; | if (!strncasecmp(p, "POWER7", 6)) return CPUTYPE_POWER6; | ||||
| if (!strncasecmp(p, "POWER8", 6)) return CPUTYPE_POWER8; | if (!strncasecmp(p, "POWER8", 6)) return CPUTYPE_POWER8; | ||||
| if (!strncasecmp(p, "POWER9", 6)) return CPUTYPE_POWER8; | |||||
| if (!strncasecmp(p, "POWER9", 6)) return CPUTYPE_POWER9; | |||||
| if (!strncasecmp(p, "Cell", 4)) return CPUTYPE_CELL; | if (!strncasecmp(p, "Cell", 4)) return CPUTYPE_CELL; | ||||
| if (!strncasecmp(p, "7447", 4)) return CPUTYPE_PPCG4; | if (!strncasecmp(p, "7447", 4)) return CPUTYPE_PPCG4; | ||||
| @@ -156,7 +156,7 @@ int detect(void){ | |||||
| if (!strncasecmp(p, "POWER6", 6)) return CPUTYPE_POWER6; | if (!strncasecmp(p, "POWER6", 6)) return CPUTYPE_POWER6; | ||||
| if (!strncasecmp(p, "POWER7", 6)) return CPUTYPE_POWER6; | if (!strncasecmp(p, "POWER7", 6)) return CPUTYPE_POWER6; | ||||
| if (!strncasecmp(p, "POWER8", 6)) return CPUTYPE_POWER8; | if (!strncasecmp(p, "POWER8", 6)) return CPUTYPE_POWER8; | ||||
| if (!strncasecmp(p, "POWER9", 6)) return CPUTYPE_POWER8; | |||||
| if (!strncasecmp(p, "POWER9", 6)) return CPUTYPE_POWER9; | |||||
| if (!strncasecmp(p, "Cell", 4)) return CPUTYPE_CELL; | if (!strncasecmp(p, "Cell", 4)) return CPUTYPE_CELL; | ||||
| if (!strncasecmp(p, "7447", 4)) return CPUTYPE_PPCG4; | if (!strncasecmp(p, "7447", 4)) return CPUTYPE_PPCG4; | ||||
| return CPUTYPE_POWER5; | return CPUTYPE_POWER5; | ||||
| @@ -180,7 +180,7 @@ int id; | |||||
| __asm __volatile("mfpvr %0" : "=r"(id)); | __asm __volatile("mfpvr %0" : "=r"(id)); | ||||
| switch ( id >> 16 ) { | switch ( id >> 16 ) { | ||||
| case 0x4e: // POWER9 | case 0x4e: // POWER9 | ||||
| return CPUTYPE_POWER8; | |||||
| return CPUTYPE_POWER9; | |||||
| break; | break; | ||||
| case 0x4d: | case 0x4d: | ||||
| case 0x4b: // POWER8/8E | case 0x4b: // POWER8/8E | ||||
| @@ -1359,6 +1359,8 @@ int get_cpuname(void){ | |||||
| return CPUTYPE_NEHALEM; | return CPUTYPE_NEHALEM; | ||||
| case 12: | case 12: | ||||
| // Apollo Lake | // Apollo Lake | ||||
| case 15: | |||||
| // Denverton | |||||
| return CPUTYPE_NEHALEM; | return CPUTYPE_NEHALEM; | ||||
| } | } | ||||
| break; | break; | ||||
| @@ -1376,9 +1378,9 @@ int get_cpuname(void){ | |||||
| } | } | ||||
| break; | break; | ||||
| case 9: | case 9: | ||||
| case 8: | |||||
| case 8: | |||||
| switch (model) { | switch (model) { | ||||
| case 14: // Kaby Lake | |||||
| case 14: // Kaby Lake and refreshes | |||||
| if(support_avx2()) | if(support_avx2()) | ||||
| return CPUTYPE_HASWELL; | return CPUTYPE_HASWELL; | ||||
| if(support_avx()) | if(support_avx()) | ||||
| @@ -27,9 +27,9 @@ | |||||
| #include <string.h> | #include <string.h> | ||||
| #define CPU_GENERIC 0 | |||||
| #define CPU_Z13 1 | |||||
| #define CPU_Z14 2 | |||||
| #define CPU_GENERIC 0 | |||||
| #define CPU_Z13 1 | |||||
| #define CPU_Z14 2 | |||||
| static char *cpuname[] = { | static char *cpuname[] = { | ||||
| "ZARCH_GENERIC", | "ZARCH_GENERIC", | ||||
| @@ -64,10 +64,8 @@ int detect(void) | |||||
| if (strstr(p, "2964")) return CPU_Z13; | if (strstr(p, "2964")) return CPU_Z13; | ||||
| if (strstr(p, "2965")) return CPU_Z13; | if (strstr(p, "2965")) return CPU_Z13; | ||||
| /* detect z14, but fall back to z13 */ | |||||
| if (strstr(p, "3906")) return CPU_Z13; | |||||
| if (strstr(p, "3907")) return CPU_Z13; | |||||
| if (strstr(p, "3906")) return CPU_Z14; | |||||
| if (strstr(p, "3907")) return CPU_Z14; | |||||
| return CPU_GENERIC; | return CPU_GENERIC; | ||||
| } | } | ||||
| @@ -116,7 +114,14 @@ void get_cpuconfig(void) | |||||
| break; | break; | ||||
| case CPU_Z14: | case CPU_Z14: | ||||
| printf("#define Z14\n"); | printf("#define Z14\n"); | ||||
| printf("#define L1_DATA_SIZE 131072\n"); | |||||
| printf("#define L1_DATA_LINESIZE 256\n"); | |||||
| printf("#define L1_DATA_ASSOCIATIVE 8\n"); | |||||
| printf("#define L2_SIZE 4194304\n"); | |||||
| printf("#define L2_LINESIZE 256\n"); | |||||
| printf("#define L2_ASSOCIATIVE 8\n"); | |||||
| printf("#define DTB_DEFAULT_ENTRIES 64\n"); | printf("#define DTB_DEFAULT_ENTRIES 64\n"); | ||||
| printf("#define DTB_SIZE 4096\n"); | |||||
| break; | break; | ||||
| } | } | ||||
| } | } | ||||
| @@ -113,7 +113,7 @@ ARCH_X86 | |||||
| ARCH_X86_64 | ARCH_X86_64 | ||||
| #endif | #endif | ||||
| #if defined(__powerpc___) || defined(__PPC__) || defined(_POWER) | |||||
| #if defined(__powerpc___) || defined(__PPC__) || defined(_POWER) || defined(__POWERPC__) | |||||
| ARCH_POWER | ARCH_POWER | ||||
| #endif | #endif | ||||
| @@ -577,7 +577,7 @@ | |||||
| SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | ||||
| * ************************* STEST1 ***************************** | * ************************* STEST1 ***************************** | ||||
| * | * | ||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMODATE THE FORTRAN | |||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMMODATE THE FORTRAN | |||||
| * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | ||||
| * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | ||||
| * | * | ||||
| @@ -653,7 +653,7 @@ | |||||
| SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | ||||
| * ************************* STEST1 ***************************** | * ************************* STEST1 ***************************** | ||||
| * | * | ||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMODATE THE FORTRAN | |||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMMODATE THE FORTRAN | |||||
| * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | ||||
| * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | ||||
| * | * | ||||
| @@ -653,7 +653,7 @@ | |||||
| SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | ||||
| * ************************* STEST1 ***************************** | * ************************* STEST1 ***************************** | ||||
| * | * | ||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMODATE THE FORTRAN | |||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMMODATE THE FORTRAN | |||||
| * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | ||||
| * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | ||||
| * | * | ||||
| @@ -577,7 +577,7 @@ | |||||
| SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | SUBROUTINE STEST1(SCOMP1,STRUE1,SSIZE,SFAC) | ||||
| * ************************* STEST1 ***************************** | * ************************* STEST1 ***************************** | ||||
| * | * | ||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMODATE THE FORTRAN | |||||
| * THIS IS AN INTERFACE SUBROUTINE TO ACCOMMODATE THE FORTRAN | |||||
| * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | * REQUIREMENT THAT WHEN A DUMMY ARGUMENT IS AN ARRAY, THE | ||||
| * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | * ACTUAL ARGUMENT MUST ALSO BE AN ARRAY OR AN ARRAY ELEMENT. | ||||
| * | * | ||||
| @@ -346,7 +346,7 @@ int CNAME(BLASLONG m, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG incx, FLOAT *bu | |||||
| range_m[MAX_CPU_NUMBER - num_cpu - 1] = range_m[MAX_CPU_NUMBER - num_cpu] - width; | range_m[MAX_CPU_NUMBER - num_cpu - 1] = range_m[MAX_CPU_NUMBER - num_cpu] - width; | ||||
| range_n[num_cpu] = num_cpu * (((m + 15) & ~15) + 16); | range_n[num_cpu] = num_cpu * (((m + 15) & ~15) + 16); | ||||
| if (range_n[num_cpu] > m) range_n[num_cpu] = m; | |||||
| if (range_n[num_cpu] > m * num_cpu) range_n[num_cpu] = m * num_cpu; | |||||
| queue[num_cpu].mode = mode; | queue[num_cpu].mode = mode; | ||||
| queue[num_cpu].routine = trmv_kernel; | queue[num_cpu].routine = trmv_kernel; | ||||
| @@ -386,7 +386,7 @@ int CNAME(BLASLONG m, FLOAT *a, BLASLONG lda, FLOAT *x, BLASLONG incx, FLOAT *bu | |||||
| range_m[num_cpu + 1] = range_m[num_cpu] + width; | range_m[num_cpu + 1] = range_m[num_cpu] + width; | ||||
| range_n[num_cpu] = num_cpu * (((m + 15) & ~15) + 16); | range_n[num_cpu] = num_cpu * (((m + 15) & ~15) + 16); | ||||
| if (range_n[num_cpu] > m) range_n[num_cpu] = m; | |||||
| if (range_n[num_cpu] > m * num_cpu) range_n[num_cpu] = m * num_cpu; | |||||
| queue[num_cpu].mode = mode; | queue[num_cpu].mode = mode; | ||||
| queue[num_cpu].routine = trmv_kernel; | queue[num_cpu].routine = trmv_kernel; | ||||
| @@ -18,8 +18,12 @@ ifeq ($(DYNAMIC_ARCH), 1) | |||||
| ifeq ($(ARCH),arm64) | ifeq ($(ARCH),arm64) | ||||
| COMMONOBJS += dynamic_arm64.$(SUFFIX) | COMMONOBJS += dynamic_arm64.$(SUFFIX) | ||||
| else | else | ||||
| ifeq ($(ARCH),power) | |||||
| COMMONOBJS += dynamic_power.$(SUFFIX) | |||||
| else | |||||
| COMMONOBJS += dynamic.$(SUFFIX) | COMMONOBJS += dynamic.$(SUFFIX) | ||||
| endif | endif | ||||
| endif | |||||
| else | else | ||||
| COMMONOBJS += parameter.$(SUFFIX) | COMMONOBJS += parameter.$(SUFFIX) | ||||
| endif | endif | ||||
| @@ -78,8 +82,12 @@ ifeq ($(DYNAMIC_ARCH), 1) | |||||
| ifeq ($(ARCH),arm64) | ifeq ($(ARCH),arm64) | ||||
| HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) dynamic_arm64.$(SUFFIX) | HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) dynamic_arm64.$(SUFFIX) | ||||
| else | else | ||||
| ifeq ($(ARCH),power) | |||||
| HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) dynamic_power.$(SUFFIX) | |||||
| else | |||||
| HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) dynamic.$(SUFFIX) | HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) dynamic.$(SUFFIX) | ||||
| endif | endif | ||||
| endif | |||||
| else | else | ||||
| HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) parameter.$(SUFFIX) | HPLOBJS = memory.$(SUFFIX) xerbla.$(SUFFIX) parameter.$(SUFFIX) | ||||
| endif | endif | ||||
| @@ -109,7 +109,7 @@ extern unsigned int openblas_thread_timeout(); | |||||
| /* equal to "OMP_NUM_THREADS - 1" and thread only wakes up when */ | /* equal to "OMP_NUM_THREADS - 1" and thread only wakes up when */ | ||||
| /* jobs is queued. */ | /* jobs is queued. */ | ||||
| /* We need this grobal for cheking if initialization is finished. */ | |||||
| /* We need this global for checking if initialization is finished. */ | |||||
| int blas_server_avail __attribute__((aligned(ATTRIBUTE_SIZE))) = 0; | int blas_server_avail __attribute__((aligned(ATTRIBUTE_SIZE))) = 0; | ||||
| /* Local Variables */ | /* Local Variables */ | ||||
| @@ -150,8 +150,8 @@ static unsigned int thread_timeout = (1U << (THREAD_TIMEOUT)); | |||||
| #ifdef MONITOR | #ifdef MONITOR | ||||
| /* Monitor is a function to see thread's status for every seconds. */ | |||||
| /* Usually it turns off and it's for debugging. */ | |||||
| /* Monitor is a function to see thread's status for every second. */ | |||||
| /* Usually it turns off and it's for debugging. */ | |||||
| static pthread_t monitor_thread; | static pthread_t monitor_thread; | ||||
| static int main_status[MAX_CPU_NUMBER]; | static int main_status[MAX_CPU_NUMBER]; | ||||
| @@ -50,7 +50,7 @@ | |||||
| /* This is a thread implementation for Win32 lazy implementation */ | /* This is a thread implementation for Win32 lazy implementation */ | ||||
| /* Thread server common infomation */ | |||||
| /* Thread server common information */ | |||||
| typedef struct{ | typedef struct{ | ||||
| CRITICAL_SECTION lock; | CRITICAL_SECTION lock; | ||||
| HANDLE filled; | HANDLE filled; | ||||
| @@ -61,7 +61,7 @@ typedef struct{ | |||||
| } blas_pool_t; | } blas_pool_t; | ||||
| /* We need this global for cheking if initialization is finished. */ | |||||
| /* We need this global for checking if initialization is finished. */ | |||||
| int blas_server_avail = 0; | int blas_server_avail = 0; | ||||
| /* Local Variables */ | /* Local Variables */ | ||||
| @@ -461,13 +461,18 @@ int BLASFUNC(blas_thread_shutdown)(void){ | |||||
| SetEvent(pool.killed); | SetEvent(pool.killed); | ||||
| for(i = 0; i < blas_num_threads - 1; i++){ | for(i = 0; i < blas_num_threads - 1; i++){ | ||||
| // Could also just use WaitForMultipleObjects | |||||
| WaitForSingleObject(blas_threads[i], 5); //INFINITE); | WaitForSingleObject(blas_threads[i], 5); //INFINITE); | ||||
| #ifndef OS_WINDOWSSTORE | #ifndef OS_WINDOWSSTORE | ||||
| // TerminateThread is only available with WINAPI_DESKTOP and WINAPI_SYSTEM not WINAPI_APP in UWP | // TerminateThread is only available with WINAPI_DESKTOP and WINAPI_SYSTEM not WINAPI_APP in UWP | ||||
| TerminateThread(blas_threads[i],0); | TerminateThread(blas_threads[i],0); | ||||
| #endif | #endif | ||||
| CloseHandle(blas_threads[i]); | |||||
| } | } | ||||
| CloseHandle(pool.filled); | |||||
| CloseHandle(pool.killed); | |||||
| blas_server_avail = 0; | blas_server_avail = 0; | ||||
| } | } | ||||
| @@ -322,7 +322,7 @@ int support_avx2(){ | |||||
| } | } | ||||
| int support_avx512(){ | int support_avx512(){ | ||||
| #ifndef NO_AVX512 | |||||
| #if !defined(NO_AVX) && !defined(NO_AVX512) | |||||
| int eax, ebx, ecx, edx; | int eax, ebx, ecx, edx; | ||||
| int ret=0; | int ret=0; | ||||
| @@ -566,8 +566,8 @@ static gotoblas_t *get_coretype(void){ | |||||
| return &gotoblas_NEHALEM; //OS doesn't support AVX. Use old kernels. | return &gotoblas_NEHALEM; //OS doesn't support AVX. Use old kernels. | ||||
| } | } | ||||
| } | } | ||||
| //Apollo Lake | |||||
| if (model == 12) { | |||||
| //Apollo Lake or Denverton | |||||
| if (model == 12 || model == 15) { | |||||
| return &gotoblas_NEHALEM; | return &gotoblas_NEHALEM; | ||||
| } | } | ||||
| return NULL; | return NULL; | ||||
| @@ -0,0 +1,102 @@ | |||||
| #include "common.h" | |||||
| extern gotoblas_t gotoblas_POWER6; | |||||
| extern gotoblas_t gotoblas_POWER8; | |||||
| extern gotoblas_t gotoblas_POWER9; | |||||
| extern void openblas_warning(int verbose, const char *msg); | |||||
| static char *corename[] = { | |||||
| "unknown", | |||||
| "POWER6", | |||||
| "POWER8", | |||||
| "POWER9" | |||||
| }; | |||||
| #define NUM_CORETYPES 4 | |||||
| char *gotoblas_corename(void) { | |||||
| if (gotoblas == &gotoblas_POWER6) return corename[1]; | |||||
| if (gotoblas == &gotoblas_POWER8) return corename[2]; | |||||
| if (gotoblas == &gotoblas_POWER9) return corename[3]; | |||||
| return corename[0]; | |||||
| } | |||||
| static gotoblas_t *get_coretype(void) { | |||||
| if (__builtin_cpu_is("power6") || __builtin_cpu_is("power6x")) | |||||
| return &gotoblas_POWER6; | |||||
| if (__builtin_cpu_is("power8")) | |||||
| return &gotoblas_POWER8; | |||||
| if (__builtin_cpu_is("power9")) | |||||
| return &gotoblas_POWER9; | |||||
| return NULL; | |||||
| } | |||||
| static gotoblas_t *force_coretype(char * coretype) { | |||||
| int i ; | |||||
| int found = -1; | |||||
| char message[128]; | |||||
| for ( i = 0 ; i < NUM_CORETYPES; i++) | |||||
| { | |||||
| if (!strncasecmp(coretype, corename[i], 20)) | |||||
| { | |||||
| found = i; | |||||
| break; | |||||
| } | |||||
| } | |||||
| switch (found) | |||||
| { | |||||
| case 1: return (&gotoblas_POWER6); | |||||
| case 2: return (&gotoblas_POWER8); | |||||
| case 3: return (&gotoblas_POWER9); | |||||
| default: return NULL; | |||||
| } | |||||
| snprintf(message, 128, "Core not found: %s\n", coretype); | |||||
| openblas_warning(1, message); | |||||
| } | |||||
| void gotoblas_dynamic_init(void) { | |||||
| char coremsg[128]; | |||||
| char coren[22]; | |||||
| char *p; | |||||
| if (gotoblas) return; | |||||
| p = getenv("OPENBLAS_CORETYPE"); | |||||
| if ( p ) | |||||
| { | |||||
| gotoblas = force_coretype(p); | |||||
| } | |||||
| else | |||||
| { | |||||
| gotoblas = get_coretype(); | |||||
| } | |||||
| if (gotoblas == NULL) | |||||
| { | |||||
| snprintf(coremsg, 128, "Falling back to POWER8 core\n"); | |||||
| openblas_warning(1, coremsg); | |||||
| gotoblas = &gotoblas_POWER8; | |||||
| } | |||||
| if (gotoblas && gotoblas -> init) { | |||||
| strncpy(coren,gotoblas_corename(),20); | |||||
| sprintf(coremsg, "Core: %s\n",coren); | |||||
| openblas_warning(2, coremsg); | |||||
| gotoblas -> init(); | |||||
| } else { | |||||
| openblas_warning(0, "OpenBLAS : Architecture Initialization failed. No initialization function found.\n"); | |||||
| exit(1); | |||||
| } | |||||
| } | |||||
| void gotoblas_dynamic_quit(void) { | |||||
| gotoblas = NULL; | |||||
| } | |||||
| @@ -765,7 +765,7 @@ int gotoblas_set_affinity(int pos) { | |||||
| int mynode = 1; | int mynode = 1; | ||||
| /* if number of threads is larger than inital condition */ | |||||
| /* if number of threads is larger than initial condition */ | |||||
| if (pos < 0) { | if (pos < 0) { | ||||
| sched_setaffinity(0, sizeof(cpu_orig_mask), &cpu_orig_mask[0]); | sched_setaffinity(0, sizeof(cpu_orig_mask), &cpu_orig_mask[0]); | ||||
| return 0; | return 0; | ||||
| @@ -857,7 +857,14 @@ void gotoblas_affinity_init(void) { | |||||
| common -> shmid = pshmid; | common -> shmid = pshmid; | ||||
| if (common -> magic != SH_MAGIC) { | if (common -> magic != SH_MAGIC) { | ||||
| #if defined(__GLIBC_PREREQ) | |||||
| #if __GLIBC_PREREQ(2, 7) | |||||
| cpu_set_t *cpusetp; | cpu_set_t *cpusetp; | ||||
| #else | |||||
| cpu_set_t cpuset; | |||||
| #endif | |||||
| #endif | |||||
| int nums; | int nums; | ||||
| int ret; | int ret; | ||||
| @@ -890,7 +897,7 @@ void gotoblas_affinity_init(void) { | |||||
| } | } | ||||
| CPU_FREE(cpusetp); | CPU_FREE(cpusetp); | ||||
| #else | #else | ||||
| ret = sched_getaffinity(0,sizeof(cpu_set_t), cpusetp); | |||||
| ret = sched_getaffinity(0,sizeof(cpu_set_t), &cpuset); | |||||
| if (ret!=0) { | if (ret!=0) { | ||||
| common->num_procs = nums; | common->num_procs = nums; | ||||
| } else { | } else { | ||||
| @@ -898,11 +905,11 @@ void gotoblas_affinity_init(void) { | |||||
| int i; | int i; | ||||
| int n = 0; | int n = 0; | ||||
| for (i=0;i<nums;i++) | for (i=0;i<nums;i++) | ||||
| if (CPU_ISSET(i,cpusetp)) n++; | |||||
| if (CPU_ISSET(i,&cpuset)) n++; | |||||
| common->num_procs = n; | common->num_procs = n; | ||||
| } | } | ||||
| #else | #else | ||||
| common->num_procs = CPU_COUNT(sizeof(cpu_set_t),cpusetp); | |||||
| common->num_procs = CPU_COUNT(&cpuset); | |||||
| } | } | ||||
| #endif | #endif | ||||
| @@ -198,45 +198,68 @@ int get_num_procs(void); | |||||
| #else | #else | ||||
| int get_num_procs(void) { | int get_num_procs(void) { | ||||
| static int nums = 0; | static int nums = 0; | ||||
| cpu_set_t *cpusetp; | |||||
| size_t size; | |||||
| int ret; | |||||
| int i,n; | |||||
| cpu_set_t cpuset,*cpusetp; | |||||
| size_t size; | |||||
| int ret; | |||||
| #if defined(__GLIBC_PREREQ) | |||||
| #if !__GLIBC_PREREQ(2, 7) | |||||
| int i; | |||||
| #if !__GLIBC_PREREQ(2, 6) | |||||
| int n; | |||||
| #endif | |||||
| #endif | |||||
| #endif | |||||
| if (!nums) nums = sysconf(_SC_NPROCESSORS_CONF); | if (!nums) nums = sysconf(_SC_NPROCESSORS_CONF); | ||||
| #if !defined(OS_LINUX) | #if !defined(OS_LINUX) | ||||
| return nums; | |||||
| return nums; | |||||
| #endif | #endif | ||||
| #if !defined(__GLIBC_PREREQ) | #if !defined(__GLIBC_PREREQ) | ||||
| return nums; | |||||
| return nums; | |||||
| #else | #else | ||||
| #if !__GLIBC_PREREQ(2, 3) | #if !__GLIBC_PREREQ(2, 3) | ||||
| return nums; | |||||
| return nums; | |||||
| #endif | #endif | ||||
| #if !__GLIBC_PREREQ(2, 7) | #if !__GLIBC_PREREQ(2, 7) | ||||
| ret = sched_getaffinity(0,sizeof(cpu_set_t), cpusetp); | |||||
| ret = sched_getaffinity(0,sizeof(cpuset), &cpuset); | |||||
| if (ret!=0) return nums; | if (ret!=0) return nums; | ||||
| n=0; | n=0; | ||||
| #if !__GLIBC_PREREQ(2, 6) | #if !__GLIBC_PREREQ(2, 6) | ||||
| for (i=0;i<nums;i++) | for (i=0;i<nums;i++) | ||||
| if (CPU_ISSET(i,cpusetp)) n++; | |||||
| if (CPU_ISSET(i,&cpuset)) n++; | |||||
| nums=n; | nums=n; | ||||
| #else | #else | ||||
| nums = CPU_COUNT(sizeof(cpu_set_t),cpusetp); | |||||
| nums = CPU_COUNT(sizeof(cpuset),&cpuset); | |||||
| #endif | #endif | ||||
| return nums; | return nums; | ||||
| #else | #else | ||||
| cpusetp = CPU_ALLOC(nums); | |||||
| if (cpusetp == NULL) return nums; | |||||
| size = CPU_ALLOC_SIZE(nums); | |||||
| ret = sched_getaffinity(0,size,cpusetp); | |||||
| if (ret!=0) return nums; | |||||
| ret = CPU_COUNT_S(size,cpusetp); | |||||
| if (ret > 0 && ret < nums) nums = ret; | |||||
| CPU_FREE(cpusetp); | |||||
| return nums; | |||||
| if (nums >= CPU_SETSIZE) { | |||||
| cpusetp = CPU_ALLOC(nums); | |||||
| if (cpusetp == NULL) { | |||||
| return nums; | |||||
| } | |||||
| size = CPU_ALLOC_SIZE(nums); | |||||
| ret = sched_getaffinity(0,size,cpusetp); | |||||
| if (ret!=0) { | |||||
| CPU_FREE(cpusetp); | |||||
| return nums; | |||||
| } | |||||
| ret = CPU_COUNT_S(size,cpusetp); | |||||
| if (ret > 0 && ret < nums) nums = ret; | |||||
| CPU_FREE(cpusetp); | |||||
| return nums; | |||||
| } else { | |||||
| ret = sched_getaffinity(0,sizeof(cpuset),&cpuset); | |||||
| if (ret!=0) { | |||||
| return nums; | |||||
| } | |||||
| ret = CPU_COUNT(&cpuset); | |||||
| if (ret > 0 && ret < nums) nums = ret; | |||||
| return nums; | |||||
| } | |||||
| #endif | #endif | ||||
| #endif | #endif | ||||
| } | } | ||||
| @@ -1290,6 +1313,13 @@ void blas_memory_free_nolock(void * map_address) { | |||||
| free(map_address); | free(map_address); | ||||
| } | } | ||||
| #ifdef SMP | |||||
| void blas_thread_memory_cleanup(void) { | |||||
| blas_memory_cleanup((void*)get_memory_table()); | |||||
| } | |||||
| #endif | |||||
| void blas_shutdown(void){ | void blas_shutdown(void){ | ||||
| #ifdef SMP | #ifdef SMP | ||||
| BLASFUNC(blas_thread_shutdown)(); | BLASFUNC(blas_thread_shutdown)(); | ||||
| @@ -1299,7 +1329,7 @@ void blas_shutdown(void){ | |||||
| /* Only cleanupIf we were built for threading and TLS was initialized */ | /* Only cleanupIf we were built for threading and TLS was initialized */ | ||||
| if (local_storage_key) | if (local_storage_key) | ||||
| #endif | #endif | ||||
| blas_memory_cleanup((void*)get_memory_table()); | |||||
| blas_thread_memory_cleanup(); | |||||
| #ifdef SEEK_ADDRESS | #ifdef SEEK_ADDRESS | ||||
| base_address = 0UL; | base_address = 0UL; | ||||
| @@ -1529,7 +1559,7 @@ BOOL APIENTRY DllMain(HMODULE hModule, DWORD ul_reason_for_call, LPVOID lpReser | |||||
| break; | break; | ||||
| case DLL_THREAD_DETACH: | case DLL_THREAD_DETACH: | ||||
| #if defined(SMP) | #if defined(SMP) | ||||
| blas_memory_cleanup((void*)get_memory_table()); | |||||
| blas_thread_memory_cleanup(); | |||||
| #endif | #endif | ||||
| break; | break; | ||||
| case DLL_PROCESS_DETACH: | case DLL_PROCESS_DETACH: | ||||
| @@ -1592,6 +1622,7 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| gotoblas_init(); | gotoblas_init(); | ||||
| gotoblas_quit(); | gotoblas_quit(); | ||||
| #if __PGIC__ < 19 | |||||
| #if 0 | #if 0 | ||||
| asm ("\t.section\t.ctors,\"aw\",@progbits; .align 8; .quad gotoblas_init; .section .text"); | asm ("\t.section\t.ctors,\"aw\",@progbits; .align 8; .quad gotoblas_init; .section .text"); | ||||
| asm ("\t.section\t.dtors,\"aw\",@progbits; .align 8; .quad gotoblas_quit; .section .text"); | asm ("\t.section\t.dtors,\"aw\",@progbits; .align 8; .quad gotoblas_quit; .section .text"); | ||||
| @@ -1599,13 +1630,16 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| asm (".section .init,\"ax\"; call gotoblas_init@PLT; .section .text"); | asm (".section .init,\"ax\"; call gotoblas_init@PLT; .section .text"); | ||||
| asm (".section .fini,\"ax\"; call gotoblas_quit@PLT; .section .text"); | asm (".section .fini,\"ax\"; call gotoblas_quit@PLT; .section .text"); | ||||
| #endif | #endif | ||||
| #endif | |||||
| } | } | ||||
| #endif | #endif | ||||
| #else | #else | ||||
| /* USE_TLS / COMPILE_TLS not set */ | |||||
| #include <errno.h> | #include <errno.h> | ||||
| #ifdef OS_WINDOWS | |||||
| #if defined(OS_WINDOWS) && !defined(OS_CYGWIN_NT) | |||||
| #define ALLOC_WINDOWS | #define ALLOC_WINDOWS | ||||
| #ifndef MEM_LARGE_PAGES | #ifndef MEM_LARGE_PAGES | ||||
| #define MEM_LARGE_PAGES 0x20000000 | #define MEM_LARGE_PAGES 0x20000000 | ||||
| @@ -1619,7 +1653,7 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| #include <stdio.h> | #include <stdio.h> | ||||
| #include <fcntl.h> | #include <fcntl.h> | ||||
| #ifndef OS_WINDOWS | |||||
| #if !defined(OS_WINDOWS) || defined(OS_CYGWIN_NT) | |||||
| #include <sys/mman.h> | #include <sys/mman.h> | ||||
| #ifndef NO_SYSV_IPC | #ifndef NO_SYSV_IPC | ||||
| #include <sys/shm.h> | #include <sys/shm.h> | ||||
| @@ -1639,7 +1673,7 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| #include <sys/resource.h> | #include <sys/resource.h> | ||||
| #endif | #endif | ||||
| #if defined(OS_FREEBSD) || defined(OS_DARWIN) | |||||
| #if defined(OS_FREEBSD) || defined(OS_OPENBSD) || defined(OS_DRAGONFLY) || defined(OS_DARWIN) | |||||
| #include <sys/sysctl.h> | #include <sys/sysctl.h> | ||||
| #include <sys/resource.h> | #include <sys/resource.h> | ||||
| #endif | #endif | ||||
| @@ -1678,9 +1712,12 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| #elif (defined(OS_DARWIN) || defined(OS_SUNOS)) && defined(C_GCC) | #elif (defined(OS_DARWIN) || defined(OS_SUNOS)) && defined(C_GCC) | ||||
| #define CONSTRUCTOR __attribute__ ((constructor)) | #define CONSTRUCTOR __attribute__ ((constructor)) | ||||
| #define DESTRUCTOR __attribute__ ((destructor)) | #define DESTRUCTOR __attribute__ ((destructor)) | ||||
| #else | |||||
| #elif __GNUC__ && INIT_PRIORITY && ((GCC_VERSION >= 40300) || (CLANG_VERSION >= 20900)) | |||||
| #define CONSTRUCTOR __attribute__ ((constructor(101))) | #define CONSTRUCTOR __attribute__ ((constructor(101))) | ||||
| #define DESTRUCTOR __attribute__ ((destructor(101))) | #define DESTRUCTOR __attribute__ ((destructor(101))) | ||||
| #else | |||||
| #define CONSTRUCTOR __attribute__ ((constructor)) | |||||
| #define DESTRUCTOR __attribute__ ((destructor)) | |||||
| #endif | #endif | ||||
| #ifdef DYNAMIC_ARCH | #ifdef DYNAMIC_ARCH | ||||
| @@ -1704,45 +1741,70 @@ void goto_set_num_threads(int num_threads) {}; | |||||
| int get_num_procs(void); | int get_num_procs(void); | ||||
| #else | #else | ||||
| int get_num_procs(void) { | int get_num_procs(void) { | ||||
| static int nums = 0; | static int nums = 0; | ||||
| cpu_set_t *cpusetp; | |||||
| size_t size; | |||||
| int ret; | |||||
| int i,n; | |||||
| cpu_set_t cpuset,*cpusetp; | |||||
| size_t size; | |||||
| int ret; | |||||
| #if defined(__GLIBC_PREREQ) | |||||
| #if !__GLIBC_PREREQ(2, 7) | |||||
| int i; | |||||
| #if !__GLIBC_PREREQ(2, 6) | |||||
| int n; | |||||
| #endif | |||||
| #endif | |||||
| #endif | |||||
| if (!nums) nums = sysconf(_SC_NPROCESSORS_CONF); | if (!nums) nums = sysconf(_SC_NPROCESSORS_CONF); | ||||
| #if !defined(OS_LINUX) | #if !defined(OS_LINUX) | ||||
| return nums; | |||||
| return nums; | |||||
| #endif | #endif | ||||
| #if !defined(__GLIBC_PREREQ) | #if !defined(__GLIBC_PREREQ) | ||||
| return nums; | |||||
| return nums; | |||||
| #else | #else | ||||
| #if !__GLIBC_PREREQ(2, 3) | #if !__GLIBC_PREREQ(2, 3) | ||||
| return nums; | |||||
| return nums; | |||||
| #endif | #endif | ||||
| #if !__GLIBC_PREREQ(2, 7) | #if !__GLIBC_PREREQ(2, 7) | ||||
| ret = sched_getaffinity(0,sizeof(cpu_set_t), cpusetp); | |||||
| ret = sched_getaffinity(0,sizeof(cpuset), &cpuset); | |||||
| if (ret!=0) return nums; | if (ret!=0) return nums; | ||||
| n=0; | n=0; | ||||
| #if !__GLIBC_PREREQ(2, 6) | #if !__GLIBC_PREREQ(2, 6) | ||||
| for (i=0;i<nums;i++) | for (i=0;i<nums;i++) | ||||
| if (CPU_ISSET(i,cpusetp)) n++; | |||||
| if (CPU_ISSET(i,&cpuset)) n++; | |||||
| nums=n; | nums=n; | ||||
| #else | #else | ||||
| nums = CPU_COUNT(sizeof(cpu_set_t),cpusetp); | |||||
| nums = CPU_COUNT(sizeof(cpuset),&cpuset); | |||||
| #endif | #endif | ||||
| return nums; | return nums; | ||||
| #else | #else | ||||
| cpusetp = CPU_ALLOC(nums); | |||||
| if (cpusetp == NULL) return nums; | |||||
| size = CPU_ALLOC_SIZE(nums); | |||||
| ret = sched_getaffinity(0,size,cpusetp); | |||||
| if (ret!=0) return nums; | |||||
| nums = CPU_COUNT_S(size,cpusetp); | |||||
| CPU_FREE(cpusetp); | |||||
| return nums; | |||||
| if (nums >= CPU_SETSIZE) { | |||||
| cpusetp = CPU_ALLOC(nums); | |||||
| if (cpusetp == NULL) { | |||||
| return nums; | |||||
| } | |||||
| size = CPU_ALLOC_SIZE(nums); | |||||
| ret = sched_getaffinity(0,size,cpusetp); | |||||
| if (ret!=0) { | |||||
| CPU_FREE(cpusetp); | |||||
| return nums; | |||||
| } | |||||
| ret = CPU_COUNT_S(size,cpusetp); | |||||
| if (ret > 0 && ret < nums) nums = ret; | |||||
| CPU_FREE(cpusetp); | |||||
| return nums; | |||||
| } else { | |||||
| ret = sched_getaffinity(0,sizeof(cpuset),&cpuset); | |||||
| if (ret!=0) { | |||||
| return nums; | |||||
| } | |||||
| ret = CPU_COUNT(&cpuset); | |||||
| if (ret > 0 && ret < nums) nums = ret; | |||||
| return nums; | |||||
| } | |||||
| #endif | #endif | ||||
| #endif | #endif | ||||
| } | } | ||||
| @@ -1756,7 +1818,7 @@ int get_num_procs(void) { | |||||
| return nums; | return nums; | ||||
| } | } | ||||
| #endif | #endif | ||||
| #ifdef OS_HAIKU | #ifdef OS_HAIKU | ||||
| int get_num_procs(void) { | int get_num_procs(void) { | ||||
| static int nums = 0; | static int nums = 0; | ||||
| @@ -1793,7 +1855,7 @@ int get_num_procs(void) { | |||||
| #endif | #endif | ||||
| #if defined(OS_FREEBSD) | |||||
| #if defined(OS_FREEBSD) || defined(OS_OPENBSD) || defined(OS_DRAGONFLY) | |||||
| int get_num_procs(void) { | int get_num_procs(void) { | ||||
| @@ -1870,7 +1932,7 @@ void openblas_fork_handler() | |||||
| // http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035 | // http://gcc.gnu.org/bugzilla/show_bug.cgi?id=60035 | ||||
| // In the mean time build with USE_OPENMP=0 or link against another | // In the mean time build with USE_OPENMP=0 or link against another | ||||
| // implementation of OpenMP. | // implementation of OpenMP. | ||||
| #if !(defined(OS_WINDOWS) || defined(OS_ANDROID)) && defined(SMP_SERVER) | |||||
| #if !((defined(OS_WINDOWS) && !defined(OS_CYGWIN_NT)) || defined(OS_ANDROID)) && defined(SMP_SERVER) | |||||
| int err; | int err; | ||||
| err = pthread_atfork ((void (*)(void)) BLASFUNC(blas_thread_shutdown), NULL, NULL); | err = pthread_atfork ((void (*)(void)) BLASFUNC(blas_thread_shutdown), NULL, NULL); | ||||
| if(err != 0) | if(err != 0) | ||||
| @@ -1883,7 +1945,7 @@ extern int openblas_goto_num_threads_env(); | |||||
| extern int openblas_omp_num_threads_env(); | extern int openblas_omp_num_threads_env(); | ||||
| int blas_get_cpu_number(void){ | int blas_get_cpu_number(void){ | ||||
| #if defined(OS_LINUX) || defined(OS_WINDOWS) || defined(OS_FREEBSD) || defined(OS_DARWIN) || defined(OS_ANDROID) | |||||
| #if defined(OS_LINUX) || defined(OS_WINDOWS) || defined(OS_FREEBSD) || defined(OS_OPENBSD) || defined(OS_DRAGONFLY) || defined(OS_DARWIN) || defined(OS_ANDROID) | |||||
| int max_num; | int max_num; | ||||
| #endif | #endif | ||||
| int blas_goto_num = 0; | int blas_goto_num = 0; | ||||
| @@ -1891,11 +1953,11 @@ int blas_get_cpu_number(void){ | |||||
| if (blas_num_threads) return blas_num_threads; | if (blas_num_threads) return blas_num_threads; | ||||
| #if defined(OS_LINUX) || defined(OS_WINDOWS) || defined(OS_FREEBSD) || defined(OS_DARWIN) || defined(OS_ANDROID) | |||||
| #if defined(OS_LINUX) || defined(OS_WINDOWS) || defined(OS_FREEBSD) || defined(OS_OPENBSD) || defined(OS_DRAGONFLY) || defined(OS_DARWIN) || defined(OS_ANDROID) | |||||
| max_num = get_num_procs(); | max_num = get_num_procs(); | ||||
| #endif | #endif | ||||
| blas_goto_num = 0; | |||||
| // blas_goto_num = 0; | |||||
| #ifndef USE_OPENMP | #ifndef USE_OPENMP | ||||
| blas_goto_num=openblas_num_threads_env(); | blas_goto_num=openblas_num_threads_env(); | ||||
| if (blas_goto_num < 0) blas_goto_num = 0; | if (blas_goto_num < 0) blas_goto_num = 0; | ||||
| @@ -1907,7 +1969,7 @@ int blas_get_cpu_number(void){ | |||||
| #endif | #endif | ||||
| blas_omp_num = 0; | |||||
| // blas_omp_num = 0; | |||||
| blas_omp_num=openblas_omp_num_threads_env(); | blas_omp_num=openblas_omp_num_threads_env(); | ||||
| if (blas_omp_num < 0) blas_omp_num = 0; | if (blas_omp_num < 0) blas_omp_num = 0; | ||||
| @@ -1915,7 +1977,7 @@ int blas_get_cpu_number(void){ | |||||
| else if (blas_omp_num > 0) blas_num_threads = blas_omp_num; | else if (blas_omp_num > 0) blas_num_threads = blas_omp_num; | ||||
| else blas_num_threads = MAX_CPU_NUMBER; | else blas_num_threads = MAX_CPU_NUMBER; | ||||
| #if defined(OS_LINUX) || defined(OS_WINDOWS) || defined(OS_FREEBSD) || defined(OS_DARWIN) || defined(OS_ANDROID) | |||||
| #if defined(OS_LINUX) || defined(OS_WINDOWS) || defined(OS_FREEBSD) || defined(OS_OPENBSD) || defined(OS_DRAGONFLY) || defined(OS_DARWIN) || defined(OS_ANDROID) | |||||
| if (blas_num_threads > max_num) blas_num_threads = max_num; | if (blas_num_threads > max_num) blas_num_threads = max_num; | ||||
| #endif | #endif | ||||
| @@ -2002,11 +2064,15 @@ static void *alloc_mmap(void *address){ | |||||
| } | } | ||||
| if (map_address != (void *)-1) { | if (map_address != (void *)-1) { | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| LOCK_COMMAND(&alloc_lock); | LOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| release_info[release_pos].address = map_address; | release_info[release_pos].address = map_address; | ||||
| release_info[release_pos].func = alloc_mmap_free; | release_info[release_pos].func = alloc_mmap_free; | ||||
| release_pos ++; | release_pos ++; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| } | } | ||||
| #ifdef OS_LINUX | #ifdef OS_LINUX | ||||
| @@ -2148,14 +2214,18 @@ static void *alloc_mmap(void *address){ | |||||
| #if defined(OS_LINUX) && !defined(NO_WARMUP) | #if defined(OS_LINUX) && !defined(NO_WARMUP) | ||||
| } | } | ||||
| #endif | #endif | ||||
| LOCK_COMMAND(&alloc_lock); | |||||
| if (map_address != (void *)-1) { | if (map_address != (void *)-1) { | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| LOCK_COMMAND(&alloc_lock); | |||||
| #endif | |||||
| release_info[release_pos].address = map_address; | release_info[release_pos].address = map_address; | ||||
| release_info[release_pos].func = alloc_mmap_free; | release_info[release_pos].func = alloc_mmap_free; | ||||
| release_pos ++; | release_pos ++; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | |||||
| #endif | |||||
| } | } | ||||
| UNLOCK_COMMAND(&alloc_lock); | |||||
| return map_address; | return map_address; | ||||
| } | } | ||||
| @@ -2523,7 +2593,7 @@ void *blas_memory_alloc(int procpos){ | |||||
| int position; | int position; | ||||
| #if defined(WHEREAMI) && !defined(USE_OPENMP) | #if defined(WHEREAMI) && !defined(USE_OPENMP) | ||||
| int mypos; | |||||
| int mypos = 0; | |||||
| #endif | #endif | ||||
| void *map_address; | void *map_address; | ||||
| @@ -2554,6 +2624,11 @@ void *blas_memory_alloc(int procpos){ | |||||
| NULL, | NULL, | ||||
| }; | }; | ||||
| void *(**func)(void *address); | void *(**func)(void *address); | ||||
| #if defined(USE_OPENMP) | |||||
| if (!memory_initialized) { | |||||
| #endif | |||||
| LOCK_COMMAND(&alloc_lock); | LOCK_COMMAND(&alloc_lock); | ||||
| if (!memory_initialized) { | if (!memory_initialized) { | ||||
| @@ -2589,6 +2664,9 @@ void *blas_memory_alloc(int procpos){ | |||||
| } | } | ||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| #if defined(USE_OPENMP) | |||||
| } | |||||
| #endif | |||||
| #ifdef DEBUG | #ifdef DEBUG | ||||
| printf("Alloc Start ...\n"); | printf("Alloc Start ...\n"); | ||||
| @@ -2603,13 +2681,17 @@ void *blas_memory_alloc(int procpos){ | |||||
| do { | do { | ||||
| if (!memory[position].used && (memory[position].pos == mypos)) { | if (!memory[position].used && (memory[position].pos == mypos)) { | ||||
| #if defined(SMP) && !defined(USE_OPENMP) | |||||
| LOCK_COMMAND(&alloc_lock); | LOCK_COMMAND(&alloc_lock); | ||||
| // blas_lock(&memory[position].lock); | |||||
| #else | |||||
| blas_lock(&memory[position].lock); | |||||
| #endif | |||||
| if (!memory[position].used) goto allocation; | if (!memory[position].used) goto allocation; | ||||
| #if defined(SMP) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| // blas_unlock(&memory[position].lock); | |||||
| #else | |||||
| blas_unlock(&memory[position].lock); | |||||
| #endif | |||||
| } | } | ||||
| position ++; | position ++; | ||||
| @@ -2621,21 +2703,26 @@ void *blas_memory_alloc(int procpos){ | |||||
| position = 0; | position = 0; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| LOCK_COMMAND(&alloc_lock); | LOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| do { | do { | ||||
| /* if (!memory[position].used) { */ | |||||
| /* blas_lock(&memory[position].lock);*/ | |||||
| #if defined(USE_OPENMP) | |||||
| if (!memory[position].used) { | |||||
| blas_lock(&memory[position].lock); | |||||
| #endif | |||||
| if (!memory[position].used) goto allocation; | if (!memory[position].used) goto allocation; | ||||
| /* blas_unlock(&memory[position].lock);*/ | |||||
| /* } */ | |||||
| #if defined(USE_OPENMP) | |||||
| blas_unlock(&memory[position].lock); | |||||
| } | |||||
| #endif | |||||
| position ++; | position ++; | ||||
| } while (position < NUM_BUFFERS); | } while (position < NUM_BUFFERS); | ||||
| UNLOCK_COMMAND(&alloc_lock); | |||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | |||||
| #endif | |||||
| goto error; | goto error; | ||||
| allocation : | allocation : | ||||
| @@ -2645,10 +2732,11 @@ void *blas_memory_alloc(int procpos){ | |||||
| #endif | #endif | ||||
| memory[position].used = 1; | memory[position].used = 1; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| /* blas_unlock(&memory[position].lock);*/ | |||||
| #else | |||||
| blas_unlock(&memory[position].lock); | |||||
| #endif | |||||
| if (!memory[position].addr) { | if (!memory[position].addr) { | ||||
| do { | do { | ||||
| #ifdef DEBUG | #ifdef DEBUG | ||||
| @@ -2665,7 +2753,7 @@ void *blas_memory_alloc(int procpos){ | |||||
| #ifdef ALLOC_DEVICEDRIVER | #ifdef ALLOC_DEVICEDRIVER | ||||
| if ((*func == alloc_devicedirver) && (map_address == (void *)-1)) { | if ((*func == alloc_devicedirver) && (map_address == (void *)-1)) { | ||||
| fprintf(stderr, "OpenBLAS Warning ... Physically contigous allocation was failed.\n"); | |||||
| fprintf(stderr, "OpenBLAS Warning ... Physically contiguous allocation was failed.\n"); | |||||
| } | } | ||||
| #endif | #endif | ||||
| @@ -2693,9 +2781,13 @@ void *blas_memory_alloc(int procpos){ | |||||
| } while ((BLASLONG)map_address == -1); | } while ((BLASLONG)map_address == -1); | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| LOCK_COMMAND(&alloc_lock); | LOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| memory[position].addr = map_address; | memory[position].addr = map_address; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| #ifdef DEBUG | #ifdef DEBUG | ||||
| printf(" Mapping Succeeded. %p(%d)\n", (void *)memory[position].addr, position); | printf(" Mapping Succeeded. %p(%d)\n", (void *)memory[position].addr, position); | ||||
| @@ -2749,8 +2841,9 @@ void blas_memory_free(void *free_area){ | |||||
| #endif | #endif | ||||
| position = 0; | position = 0; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| LOCK_COMMAND(&alloc_lock); | LOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| while ((position < NUM_BUFFERS) && (memory[position].addr != free_area)) | while ((position < NUM_BUFFERS) && (memory[position].addr != free_area)) | ||||
| position++; | position++; | ||||
| @@ -2764,7 +2857,9 @@ void blas_memory_free(void *free_area){ | |||||
| WMB; | WMB; | ||||
| memory[position].used = 0; | memory[position].used = 0; | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| #ifdef DEBUG | #ifdef DEBUG | ||||
| printf("Unmap Succeeded.\n\n"); | printf("Unmap Succeeded.\n\n"); | ||||
| @@ -2779,8 +2874,9 @@ void blas_memory_free(void *free_area){ | |||||
| for (position = 0; position < NUM_BUFFERS; position++) | for (position = 0; position < NUM_BUFFERS; position++) | ||||
| printf("%4ld %p : %d\n", position, memory[position].addr, memory[position].used); | printf("%4ld %p : %d\n", position, memory[position].addr, memory[position].used); | ||||
| #endif | #endif | ||||
| #if (defined(SMP) || defined(USE_LOCKING)) && !defined(USE_OPENMP) | |||||
| UNLOCK_COMMAND(&alloc_lock); | UNLOCK_COMMAND(&alloc_lock); | ||||
| #endif | |||||
| return; | return; | ||||
| } | } | ||||
| @@ -2830,7 +2926,7 @@ void blas_shutdown(void){ | |||||
| #if defined(OS_LINUX) && !defined(NO_WARMUP) | #if defined(OS_LINUX) && !defined(NO_WARMUP) | ||||
| #ifdef SMP | |||||
| #if defined(SMP) || defined(USE_LOCKING) | |||||
| #if defined(USE_PTHREAD_LOCK) | #if defined(USE_PTHREAD_LOCK) | ||||
| static pthread_mutex_t init_lock = PTHREAD_MUTEX_INITIALIZER; | static pthread_mutex_t init_lock = PTHREAD_MUTEX_INITIALIZER; | ||||
| #elif defined(USE_PTHREAD_SPINLOCK) | #elif defined(USE_PTHREAD_SPINLOCK) | ||||
| @@ -2855,7 +2951,7 @@ static void _touch_memory(blas_arg_t *arg, BLASLONG *range_m, BLASLONG *range_n, | |||||
| if (hot_alloc != 2) { | if (hot_alloc != 2) { | ||||
| #endif | #endif | ||||
| #ifdef SMP | |||||
| #if defined(SMP) || defined(USE_LOCKING) | |||||
| LOCK_COMMAND(&init_lock); | LOCK_COMMAND(&init_lock); | ||||
| #endif | #endif | ||||
| @@ -2865,7 +2961,7 @@ static void _touch_memory(blas_arg_t *arg, BLASLONG *range_m, BLASLONG *range_n, | |||||
| size -= PAGESIZE; | size -= PAGESIZE; | ||||
| } | } | ||||
| #ifdef SMP | |||||
| #if defined(SMP) || defined(USE_LOCKING) | |||||
| UNLOCK_COMMAND(&init_lock); | UNLOCK_COMMAND(&init_lock); | ||||
| #endif | #endif | ||||
| @@ -3098,7 +3194,7 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| gotoblas_init(); | gotoblas_init(); | ||||
| gotoblas_quit(); | gotoblas_quit(); | ||||
| #if __PGIC__ < 19 | |||||
| #if 0 | #if 0 | ||||
| asm ("\t.section\t.ctors,\"aw\",@progbits; .align 8; .quad gotoblas_init; .section .text"); | asm ("\t.section\t.ctors,\"aw\",@progbits; .align 8; .quad gotoblas_init; .section .text"); | ||||
| asm ("\t.section\t.dtors,\"aw\",@progbits; .align 8; .quad gotoblas_quit; .section .text"); | asm ("\t.section\t.dtors,\"aw\",@progbits; .align 8; .quad gotoblas_quit; .section .text"); | ||||
| @@ -3106,6 +3202,7 @@ void gotoblas_dummy_for_PGI(void) { | |||||
| asm (".section .init,\"ax\"; call gotoblas_init@PLT; .section .text"); | asm (".section .init,\"ax\"; call gotoblas_init@PLT; .section .text"); | ||||
| asm (".section .fini,\"ax\"; call gotoblas_quit@PLT; .section .text"); | asm (".section .fini,\"ax\"; call gotoblas_quit@PLT; .section .text"); | ||||
| #endif | #endif | ||||
| #endif | |||||
| } | } | ||||
| #endif | #endif | ||||
| @@ -35,12 +35,6 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| #include <string.h> | #include <string.h> | ||||
| #if defined(_WIN32) && defined(_MSC_VER) | |||||
| #if _MSC_VER < 1900 | |||||
| #define snprintf _snprintf | |||||
| #endif | |||||
| #endif | |||||
| static char* openblas_config_str="" | static char* openblas_config_str="" | ||||
| "OpenBLAS " | "OpenBLAS " | ||||
| VERSION | VERSION | ||||
| @@ -141,6 +141,14 @@ else | |||||
| $(OBJCOPY) --redefine-syms objcopy.def ../$(LIBNAME) ../$(LIBNAME).renamed | $(OBJCOPY) --redefine-syms objcopy.def ../$(LIBNAME) ../$(LIBNAME).renamed | ||||
| ../$(LIBSONAME) : ../$(LIBNAME).renamed linktest.c | ../$(LIBSONAME) : ../$(LIBNAME).renamed linktest.c | ||||
| endif | endif | ||||
| ifeq ($(F_COMPILER), INTEL) | |||||
| $(FC) $(FFLAGS) $(LDFLAGS) -shared -o ../$(LIBSONAME) \ | |||||
| -Wl,--whole-archive $< -Wl,--no-whole-archive \ | |||||
| -Wl,-soname,$(INTERNALNAME) $(EXTRALIB) | |||||
| $(CC) $(CFLAGS) $(LDFLAGS) -w -o linktest linktest.c ../$(LIBSONAME) $(FEXTRALIB) && echo OK. | |||||
| else | |||||
| ifneq ($(C_COMPILER), LSB) | ifneq ($(C_COMPILER), LSB) | ||||
| $(CC) $(CFLAGS) $(LDFLAGS) -shared -o ../$(LIBSONAME) \ | $(CC) $(CFLAGS) $(LDFLAGS) -shared -o ../$(LIBSONAME) \ | ||||
| -Wl,--whole-archive $< -Wl,--no-whole-archive \ | -Wl,--whole-archive $< -Wl,--no-whole-archive \ | ||||
| @@ -152,6 +160,7 @@ else | |||||
| -Wl,--whole-archive $< -Wl,--no-whole-archive \ | -Wl,--whole-archive $< -Wl,--no-whole-archive \ | ||||
| -Wl,-soname,$(INTERNALNAME) $(EXTRALIB) | -Wl,-soname,$(INTERNALNAME) $(EXTRALIB) | ||||
| $(FC) $(CFLAGS) $(LDFLAGS) -w -o linktest linktest.c ../$(LIBSONAME) $(FEXTRALIB) && echo OK. | $(FC) $(CFLAGS) $(LDFLAGS) -w -o linktest linktest.c ../$(LIBSONAME) $(FEXTRALIB) && echo OK. | ||||
| endif | |||||
| endif | endif | ||||
| rm -f linktest | rm -f linktest | ||||
| @@ -40,15 +40,25 @@ | |||||
| void gotoblas_init(void); | void gotoblas_init(void); | ||||
| void gotoblas_quit(void); | void gotoblas_quit(void); | ||||
| #if defined(SMP) && defined(USE_TLS) | |||||
| void blas_thread_memory_cleanup(void); | |||||
| #endif | |||||
| BOOL APIENTRY DllMain(HINSTANCE hInst, DWORD reason, LPVOID reserved) { | BOOL APIENTRY DllMain(HINSTANCE hInst, DWORD reason, LPVOID reserved) { | ||||
| if (reason == DLL_PROCESS_ATTACH) { | |||||
| gotoblas_init(); | |||||
| } | |||||
| if (reason == DLL_PROCESS_DETACH) { | |||||
| gotoblas_quit(); | |||||
| switch(reason) { | |||||
| case DLL_PROCESS_ATTACH: | |||||
| gotoblas_init(); | |||||
| break; | |||||
| case DLL_PROCESS_DETACH: | |||||
| gotoblas_quit(); | |||||
| break; | |||||
| case DLL_THREAD_ATTACH: | |||||
| break; | |||||
| case DLL_THREAD_DETACH: | |||||
| #if defined(SMP) && defined(USE_TLS) | |||||
| blas_thread_memory_cleanup(); | |||||
| #endif | |||||
| break; | |||||
| } | } | ||||
| return TRUE; | return TRUE; | ||||
| @@ -125,7 +125,7 @@ if ($compiler eq "") { | |||||
| $openmp = "-openmp"; | $openmp = "-openmp"; | ||||
| } | } | ||||
| # for embeded underscore name, e.g. zho_ge, it may append 2 underscores. | |||||
| # for embedded underscore name, e.g. zho_ge, it may append 2 underscores. | |||||
| $data = `$compiler -O2 -S ftest3.f > /dev/null 2>&1 && cat ftest3.s && rm -f ftest3.s`; | $data = `$compiler -O2 -S ftest3.f > /dev/null 2>&1 && cat ftest3.s && rm -f ftest3.s`; | ||||
| if ($data =~ / zho_ge__/) { | if ($data =~ / zho_ge__/) { | ||||
| $need2bu = 1; | $need2bu = 1; | ||||
| @@ -637,6 +637,18 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| #define CORENAME "POWER8" | #define CORENAME "POWER8" | ||||
| #endif | #endif | ||||
| #if defined(FORCE_POWER9) | |||||
| #define FORCE | |||||
| #define ARCHITECTURE "POWER" | |||||
| #define SUBARCHITECTURE "POWER9" | |||||
| #define SUBDIRNAME "power" | |||||
| #define ARCHCONFIG "-DPOWER9 " \ | |||||
| "-DL1_DATA_SIZE=32768 -DL1_DATA_LINESIZE=128 " \ | |||||
| "-DL2_SIZE=4194304 -DL2_LINESIZE=128 " \ | |||||
| "-DDTB_DEFAULT_ENTRIES=128 -DDTB_SIZE=4096 -DL2_ASSOCIATIVE=8 " | |||||
| #define LIBNAME "power9" | |||||
| #define CORENAME "POWER9" | |||||
| #endif | |||||
| #ifdef FORCE_PPCG4 | #ifdef FORCE_PPCG4 | ||||
| #define FORCE | #define FORCE | ||||
| @@ -1065,6 +1077,23 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| #else | #else | ||||
| #endif | #endif | ||||
| #ifdef FORCE_TSV110 | |||||
| #define FORCE | |||||
| #define ARCHITECTURE "ARM64" | |||||
| #define SUBARCHITECTURE "TSV110" | |||||
| #define SUBDIRNAME "arm64" | |||||
| #define ARCHCONFIG "-DTSV110 " \ | |||||
| "-DL1_CODE_SIZE=65536 -DL1_CODE_LINESIZE=64 -DL1_CODE_ASSOCIATIVE=4 " \ | |||||
| "-DL1_DATA_SIZE=65536 -DL1_DATA_LINESIZE=64 -DL1_DATA_ASSOCIATIVE=4 " \ | |||||
| "-DL2_SIZE=524288 -DL2_LINESIZE=64 -DL2_ASSOCIATIVE=8 " \ | |||||
| "-DDTB_DEFAULT_ENTRIES=64 -DDTB_SIZE=4096 " \ | |||||
| "-DHAVE_VFPV4 -DHAVE_VFPV3 -DHAVE_VFP -DHAVE_NEON -DARMV8" | |||||
| #define LIBNAME "tsv110" | |||||
| #define CORENAME "TSV110" | |||||
| #else | |||||
| #endif | |||||
| #ifdef FORCE_ZARCH_GENERIC | #ifdef FORCE_ZARCH_GENERIC | ||||
| #define FORCE | #define FORCE | ||||
| #define ARCHITECTURE "ZARCH" | #define ARCHITECTURE "ZARCH" | ||||
| @@ -1085,6 +1114,16 @@ USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| #define CORENAME "Z13" | #define CORENAME "Z13" | ||||
| #endif | #endif | ||||
| #ifdef FORCE_Z14 | |||||
| #define FORCE | |||||
| #define ARCHITECTURE "ZARCH" | |||||
| #define SUBARCHITECTURE "Z14" | |||||
| #define ARCHCONFIG "-DZ14 " \ | |||||
| "-DDTB_DEFAULT_ENTRIES=64" | |||||
| #define LIBNAME "z14" | |||||
| #define CORENAME "Z14" | |||||
| #endif | |||||
| #ifndef FORCE | #ifndef FORCE | ||||
| #ifdef USER_TARGET | #ifdef USER_TARGET | ||||
| @@ -12,6 +12,7 @@ set(BLAS1_REAL_ONLY_SOURCES | |||||
| rotm.c rotmg.c # N.B. these do not have complex counterparts | rotm.c rotmg.c # N.B. these do not have complex counterparts | ||||
| rot.c | rot.c | ||||
| asum.c | asum.c | ||||
| sum.c | |||||
| ) | ) | ||||
| # these will have 'z' prepended for the complex version | # these will have 'z' prepended for the complex version | ||||
| @@ -23,7 +24,7 @@ set(BLAS1_MANGLED_SOURCES | |||||
| axpby.c | axpby.c | ||||
| ) | ) | ||||
| # TODO: USE_NETLIB_GEMV shoudl switch gemv.c to netlib/*gemv.f | |||||
| # TODO: USE_NETLIB_GEMV should switch gemv.c to netlib/*gemv.f | |||||
| # these all have 'z' sources for complex versions | # these all have 'z' sources for complex versions | ||||
| set(BLAS2_SOURCES | set(BLAS2_SOURCES | ||||
| gemv.c ger.c | gemv.c ger.c | ||||
| @@ -124,6 +125,7 @@ foreach (float_type ${FLOAT_TYPES}) | |||||
| GenerateNamedObjects("max.c" "USE_ABS;USE_MIN" "scamin" ${CBLAS_FLAG} "" "" true "COMPLEX") | GenerateNamedObjects("max.c" "USE_ABS;USE_MIN" "scamin" ${CBLAS_FLAG} "" "" true "COMPLEX") | ||||
| GenerateNamedObjects("max.c" "USE_ABS" "scamax" ${CBLAS_FLAG} "" "" true "COMPLEX") | GenerateNamedObjects("max.c" "USE_ABS" "scamax" ${CBLAS_FLAG} "" "" true "COMPLEX") | ||||
| GenerateNamedObjects("asum.c" "" "scasum" ${CBLAS_FLAG} "" "" true "COMPLEX") | GenerateNamedObjects("asum.c" "" "scasum" ${CBLAS_FLAG} "" "" true "COMPLEX") | ||||
| GenerateNamedObjects("sum.c" "" "scsum" ${CBLAS_FLAG} "" "" true "COMPLEX") | |||||
| endif () | endif () | ||||
| if (${float_type} STREQUAL "ZCOMPLEX") | if (${float_type} STREQUAL "ZCOMPLEX") | ||||
| GenerateNamedObjects("zscal.c" "SSCAL" "dscal" ${CBLAS_FLAG} "" "" false "ZCOMPLEX") | GenerateNamedObjects("zscal.c" "SSCAL" "dscal" ${CBLAS_FLAG} "" "" false "ZCOMPLEX") | ||||
| @@ -132,6 +134,7 @@ foreach (float_type ${FLOAT_TYPES}) | |||||
| GenerateNamedObjects("max.c" "USE_ABS;USE_MIN" "dzamin" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | GenerateNamedObjects("max.c" "USE_ABS;USE_MIN" "dzamin" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | ||||
| GenerateNamedObjects("max.c" "USE_ABS" "dzamax" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | GenerateNamedObjects("max.c" "USE_ABS" "dzamax" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | ||||
| GenerateNamedObjects("asum.c" "" "dzasum" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | GenerateNamedObjects("asum.c" "" "dzasum" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | ||||
| GenerateNamedObjects("sum.c" "" "dzsum" ${CBLAS_FLAG} "" "" true "ZCOMPLEX") | |||||
| endif () | endif () | ||||
| endforeach () | endforeach () | ||||
| @@ -25,7 +25,7 @@ SBLAS1OBJS = \ | |||||
| saxpy.$(SUFFIX) sswap.$(SUFFIX) \ | saxpy.$(SUFFIX) sswap.$(SUFFIX) \ | ||||
| scopy.$(SUFFIX) sscal.$(SUFFIX) \ | scopy.$(SUFFIX) sscal.$(SUFFIX) \ | ||||
| sdot.$(SUFFIX) sdsdot.$(SUFFIX) dsdot.$(SUFFIX) \ | sdot.$(SUFFIX) sdsdot.$(SUFFIX) dsdot.$(SUFFIX) \ | ||||
| sasum.$(SUFFIX) snrm2.$(SUFFIX) \ | |||||
| sasum.$(SUFFIX) ssum.$(SUFFIX) snrm2.$(SUFFIX) \ | |||||
| smax.$(SUFFIX) samax.$(SUFFIX) ismax.$(SUFFIX) isamax.$(SUFFIX) \ | smax.$(SUFFIX) samax.$(SUFFIX) ismax.$(SUFFIX) isamax.$(SUFFIX) \ | ||||
| smin.$(SUFFIX) samin.$(SUFFIX) ismin.$(SUFFIX) isamin.$(SUFFIX) \ | smin.$(SUFFIX) samin.$(SUFFIX) ismin.$(SUFFIX) isamin.$(SUFFIX) \ | ||||
| srot.$(SUFFIX) srotg.$(SUFFIX) srotm.$(SUFFIX) srotmg.$(SUFFIX) \ | srot.$(SUFFIX) srotg.$(SUFFIX) srotm.$(SUFFIX) srotmg.$(SUFFIX) \ | ||||
| @@ -51,7 +51,7 @@ DBLAS1OBJS = \ | |||||
| daxpy.$(SUFFIX) dswap.$(SUFFIX) \ | daxpy.$(SUFFIX) dswap.$(SUFFIX) \ | ||||
| dcopy.$(SUFFIX) dscal.$(SUFFIX) \ | dcopy.$(SUFFIX) dscal.$(SUFFIX) \ | ||||
| ddot.$(SUFFIX) \ | ddot.$(SUFFIX) \ | ||||
| dasum.$(SUFFIX) dnrm2.$(SUFFIX) \ | |||||
| dasum.$(SUFFIX) dsum.$(SUFFIX) dnrm2.$(SUFFIX) \ | |||||
| dmax.$(SUFFIX) damax.$(SUFFIX) idmax.$(SUFFIX) idamax.$(SUFFIX) \ | dmax.$(SUFFIX) damax.$(SUFFIX) idmax.$(SUFFIX) idamax.$(SUFFIX) \ | ||||
| dmin.$(SUFFIX) damin.$(SUFFIX) idmin.$(SUFFIX) idamin.$(SUFFIX) \ | dmin.$(SUFFIX) damin.$(SUFFIX) idmin.$(SUFFIX) idamin.$(SUFFIX) \ | ||||
| drot.$(SUFFIX) drotg.$(SUFFIX) drotm.$(SUFFIX) drotmg.$(SUFFIX) \ | drot.$(SUFFIX) drotg.$(SUFFIX) drotm.$(SUFFIX) drotmg.$(SUFFIX) \ | ||||
| @@ -76,7 +76,7 @@ CBLAS1OBJS = \ | |||||
| caxpy.$(SUFFIX) caxpyc.$(SUFFIX) cswap.$(SUFFIX) \ | caxpy.$(SUFFIX) caxpyc.$(SUFFIX) cswap.$(SUFFIX) \ | ||||
| ccopy.$(SUFFIX) cscal.$(SUFFIX) csscal.$(SUFFIX) \ | ccopy.$(SUFFIX) cscal.$(SUFFIX) csscal.$(SUFFIX) \ | ||||
| cdotc.$(SUFFIX) cdotu.$(SUFFIX) \ | cdotc.$(SUFFIX) cdotu.$(SUFFIX) \ | ||||
| scasum.$(SUFFIX) scnrm2.$(SUFFIX) \ | |||||
| scasum.$(SUFFIX) scsum.$(SUFFIX) scnrm2.$(SUFFIX) \ | |||||
| scamax.$(SUFFIX) icamax.$(SUFFIX) \ | scamax.$(SUFFIX) icamax.$(SUFFIX) \ | ||||
| scamin.$(SUFFIX) icamin.$(SUFFIX) \ | scamin.$(SUFFIX) icamin.$(SUFFIX) \ | ||||
| csrot.$(SUFFIX) crotg.$(SUFFIX) \ | csrot.$(SUFFIX) crotg.$(SUFFIX) \ | ||||
| @@ -105,7 +105,7 @@ ZBLAS1OBJS = \ | |||||
| zaxpy.$(SUFFIX) zaxpyc.$(SUFFIX) zswap.$(SUFFIX) \ | zaxpy.$(SUFFIX) zaxpyc.$(SUFFIX) zswap.$(SUFFIX) \ | ||||
| zcopy.$(SUFFIX) zscal.$(SUFFIX) zdscal.$(SUFFIX) \ | zcopy.$(SUFFIX) zscal.$(SUFFIX) zdscal.$(SUFFIX) \ | ||||
| zdotc.$(SUFFIX) zdotu.$(SUFFIX) \ | zdotc.$(SUFFIX) zdotu.$(SUFFIX) \ | ||||
| dzasum.$(SUFFIX) dznrm2.$(SUFFIX) \ | |||||
| dzasum.$(SUFFIX) dzsum.$(SUFFIX) dznrm2.$(SUFFIX) \ | |||||
| dzamax.$(SUFFIX) izamax.$(SUFFIX) \ | dzamax.$(SUFFIX) izamax.$(SUFFIX) \ | ||||
| dzamin.$(SUFFIX) izamin.$(SUFFIX) \ | dzamin.$(SUFFIX) izamin.$(SUFFIX) \ | ||||
| zdrot.$(SUFFIX) zrotg.$(SUFFIX) \ | zdrot.$(SUFFIX) zrotg.$(SUFFIX) \ | ||||
| @@ -146,7 +146,7 @@ QBLAS1OBJS = \ | |||||
| qaxpy.$(SUFFIX) qswap.$(SUFFIX) \ | qaxpy.$(SUFFIX) qswap.$(SUFFIX) \ | ||||
| qcopy.$(SUFFIX) qscal.$(SUFFIX) \ | qcopy.$(SUFFIX) qscal.$(SUFFIX) \ | ||||
| qdot.$(SUFFIX) \ | qdot.$(SUFFIX) \ | ||||
| qasum.$(SUFFIX) qnrm2.$(SUFFIX) \ | |||||
| qasum.$(SUFFIX) qsum.$(SUFFIX) qnrm2.$(SUFFIX) \ | |||||
| qmax.$(SUFFIX) qamax.$(SUFFIX) iqmax.$(SUFFIX) iqamax.$(SUFFIX) \ | qmax.$(SUFFIX) qamax.$(SUFFIX) iqmax.$(SUFFIX) iqamax.$(SUFFIX) \ | ||||
| qmin.$(SUFFIX) qamin.$(SUFFIX) iqmin.$(SUFFIX) iqamin.$(SUFFIX) \ | qmin.$(SUFFIX) qamin.$(SUFFIX) iqmin.$(SUFFIX) iqamin.$(SUFFIX) \ | ||||
| qrot.$(SUFFIX) qrotg.$(SUFFIX) qrotm.$(SUFFIX) qrotmg.$(SUFFIX) \ | qrot.$(SUFFIX) qrotg.$(SUFFIX) qrotm.$(SUFFIX) qrotmg.$(SUFFIX) \ | ||||
| @@ -168,7 +168,7 @@ XBLAS1OBJS = \ | |||||
| xaxpy.$(SUFFIX) xaxpyc.$(SUFFIX) xswap.$(SUFFIX) \ | xaxpy.$(SUFFIX) xaxpyc.$(SUFFIX) xswap.$(SUFFIX) \ | ||||
| xcopy.$(SUFFIX) xscal.$(SUFFIX) xqscal.$(SUFFIX) \ | xcopy.$(SUFFIX) xscal.$(SUFFIX) xqscal.$(SUFFIX) \ | ||||
| xdotc.$(SUFFIX) xdotu.$(SUFFIX) \ | xdotc.$(SUFFIX) xdotu.$(SUFFIX) \ | ||||
| qxasum.$(SUFFIX) qxnrm2.$(SUFFIX) \ | |||||
| qxasum.$(SUFFIX) qxsum.$(SUFFIX) qxnrm2.$(SUFFIX) \ | |||||
| qxamax.$(SUFFIX) ixamax.$(SUFFIX) \ | qxamax.$(SUFFIX) ixamax.$(SUFFIX) \ | ||||
| qxamin.$(SUFFIX) ixamin.$(SUFFIX) \ | qxamin.$(SUFFIX) ixamin.$(SUFFIX) \ | ||||
| xqrot.$(SUFFIX) xrotg.$(SUFFIX) \ | xqrot.$(SUFFIX) xrotg.$(SUFFIX) \ | ||||
| @@ -203,7 +203,7 @@ ifdef QUAD_PRECISION | |||||
| QBLAS1OBJS = \ | QBLAS1OBJS = \ | ||||
| qaxpy.$(SUFFIX) qswap.$(SUFFIX) \ | qaxpy.$(SUFFIX) qswap.$(SUFFIX) \ | ||||
| qcopy.$(SUFFIX) qscal.$(SUFFIX) \ | qcopy.$(SUFFIX) qscal.$(SUFFIX) \ | ||||
| qasum.$(SUFFIX) qnrm2.$(SUFFIX) \ | |||||
| qasum.$(SUFFIX) qsum.$(SUFFIX) qnrm2.$(SUFFIX) \ | |||||
| qmax.$(SUFFIX) qamax.$(SUFFIX) iqmax.$(SUFFIX) iqamax.$(SUFFIX) \ | qmax.$(SUFFIX) qamax.$(SUFFIX) iqmax.$(SUFFIX) iqamax.$(SUFFIX) \ | ||||
| qmin.$(SUFFIX) qamin.$(SUFFIX) iqmin.$(SUFFIX) iqamin.$(SUFFIX) \ | qmin.$(SUFFIX) qamin.$(SUFFIX) iqmin.$(SUFFIX) iqamin.$(SUFFIX) \ | ||||
| qrot.$(SUFFIX) qrotg.$(SUFFIX) qrotm.$(SUFFIX) qrotmg.$(SUFFIX) \ | qrot.$(SUFFIX) qrotg.$(SUFFIX) qrotm.$(SUFFIX) qrotmg.$(SUFFIX) \ | ||||
| @@ -224,7 +224,7 @@ QBLAS3OBJS = \ | |||||
| XBLAS1OBJS = \ | XBLAS1OBJS = \ | ||||
| xaxpy.$(SUFFIX) xaxpyc.$(SUFFIX) xswap.$(SUFFIX) \ | xaxpy.$(SUFFIX) xaxpyc.$(SUFFIX) xswap.$(SUFFIX) \ | ||||
| xcopy.$(SUFFIX) xscal.$(SUFFIX) xqscal.$(SUFFIX) \ | xcopy.$(SUFFIX) xscal.$(SUFFIX) xqscal.$(SUFFIX) \ | ||||
| qxasum.$(SUFFIX) qxnrm2.$(SUFFIX) \ | |||||
| qxasum.$(SUFFIX) qxsum.$(SUFFIX) qxnrm2.$(SUFFIX) \ | |||||
| qxamax.$(SUFFIX) ixamax.$(SUFFIX) \ | qxamax.$(SUFFIX) ixamax.$(SUFFIX) \ | ||||
| qxamin.$(SUFFIX) ixamin.$(SUFFIX) \ | qxamin.$(SUFFIX) ixamin.$(SUFFIX) \ | ||||
| xqrot.$(SUFFIX) xrotg.$(SUFFIX) \ | xqrot.$(SUFFIX) xrotg.$(SUFFIX) \ | ||||
| @@ -263,7 +263,8 @@ CSBLAS1OBJS = \ | |||||
| cblas_isamax.$(SUFFIX) cblas_isamin.$(SUFFIX) cblas_sasum.$(SUFFIX) cblas_saxpy.$(SUFFIX) \ | cblas_isamax.$(SUFFIX) cblas_isamin.$(SUFFIX) cblas_sasum.$(SUFFIX) cblas_saxpy.$(SUFFIX) \ | ||||
| cblas_scopy.$(SUFFIX) cblas_sdot.$(SUFFIX) cblas_sdsdot.$(SUFFIX) cblas_dsdot.$(SUFFIX) \ | cblas_scopy.$(SUFFIX) cblas_sdot.$(SUFFIX) cblas_sdsdot.$(SUFFIX) cblas_dsdot.$(SUFFIX) \ | ||||
| cblas_srot.$(SUFFIX) cblas_srotg.$(SUFFIX) cblas_srotm.$(SUFFIX) cblas_srotmg.$(SUFFIX) \ | cblas_srot.$(SUFFIX) cblas_srotg.$(SUFFIX) cblas_srotm.$(SUFFIX) cblas_srotmg.$(SUFFIX) \ | ||||
| cblas_sscal.$(SUFFIX) cblas_sswap.$(SUFFIX) cblas_snrm2.$(SUFFIX) cblas_saxpby.$(SUFFIX) | |||||
| cblas_sscal.$(SUFFIX) cblas_sswap.$(SUFFIX) cblas_snrm2.$(SUFFIX) cblas_saxpby.$(SUFFIX) \ | |||||
| cblas_ismin.$(SUFFIX) cblas_ismax.$(SUFFIX) cblas_ssum.$(SUFFIX) | |||||
| CSBLAS2OBJS = \ | CSBLAS2OBJS = \ | ||||
| cblas_sgemv.$(SUFFIX) cblas_sger.$(SUFFIX) cblas_ssymv.$(SUFFIX) cblas_strmv.$(SUFFIX) \ | cblas_sgemv.$(SUFFIX) cblas_sger.$(SUFFIX) cblas_ssymv.$(SUFFIX) cblas_strmv.$(SUFFIX) \ | ||||
| @@ -280,7 +281,8 @@ CDBLAS1OBJS = \ | |||||
| cblas_idamax.$(SUFFIX) cblas_idamin.$(SUFFIX) cblas_dasum.$(SUFFIX) cblas_daxpy.$(SUFFIX) \ | cblas_idamax.$(SUFFIX) cblas_idamin.$(SUFFIX) cblas_dasum.$(SUFFIX) cblas_daxpy.$(SUFFIX) \ | ||||
| cblas_dcopy.$(SUFFIX) cblas_ddot.$(SUFFIX) \ | cblas_dcopy.$(SUFFIX) cblas_ddot.$(SUFFIX) \ | ||||
| cblas_drot.$(SUFFIX) cblas_drotg.$(SUFFIX) cblas_drotm.$(SUFFIX) cblas_drotmg.$(SUFFIX) \ | cblas_drot.$(SUFFIX) cblas_drotg.$(SUFFIX) cblas_drotm.$(SUFFIX) cblas_drotmg.$(SUFFIX) \ | ||||
| cblas_dscal.$(SUFFIX) cblas_dswap.$(SUFFIX) cblas_dnrm2.$(SUFFIX) cblas_daxpby.$(SUFFIX) | |||||
| cblas_dscal.$(SUFFIX) cblas_dswap.$(SUFFIX) cblas_dnrm2.$(SUFFIX) cblas_daxpby.$(SUFFIX) \ | |||||
| cblas_idmin.$(SUFFIX) cblas_idmax.$(SUFFIX) cblas_dsum.$(SUFFIX) | |||||
| CDBLAS2OBJS = \ | CDBLAS2OBJS = \ | ||||
| cblas_dgemv.$(SUFFIX) cblas_dger.$(SUFFIX) cblas_dsymv.$(SUFFIX) cblas_dtrmv.$(SUFFIX) \ | cblas_dgemv.$(SUFFIX) cblas_dger.$(SUFFIX) cblas_dsymv.$(SUFFIX) cblas_dtrmv.$(SUFFIX) \ | ||||
| @@ -300,7 +302,8 @@ CCBLAS1OBJS = \ | |||||
| cblas_cdotc_sub.$(SUFFIX) cblas_cdotu_sub.$(SUFFIX) \ | cblas_cdotc_sub.$(SUFFIX) cblas_cdotu_sub.$(SUFFIX) \ | ||||
| cblas_cscal.$(SUFFIX) cblas_csscal.$(SUFFIX) \ | cblas_cscal.$(SUFFIX) cblas_csscal.$(SUFFIX) \ | ||||
| cblas_cswap.$(SUFFIX) cblas_scnrm2.$(SUFFIX) \ | cblas_cswap.$(SUFFIX) cblas_scnrm2.$(SUFFIX) \ | ||||
| cblas_caxpby.$(SUFFIX) | |||||
| cblas_caxpby.$(SUFFIX) \ | |||||
| cblas_icmin.$(SUFFIX) cblas_icmax.$(SUFFIX) cblas_scsum.$(SUFFIX) | |||||
| CCBLAS2OBJS = \ | CCBLAS2OBJS = \ | ||||
| cblas_cgemv.$(SUFFIX) cblas_cgerc.$(SUFFIX) cblas_cgeru.$(SUFFIX) \ | cblas_cgemv.$(SUFFIX) cblas_cgerc.$(SUFFIX) cblas_cgeru.$(SUFFIX) \ | ||||
| @@ -326,7 +329,9 @@ CZBLAS1OBJS = \ | |||||
| cblas_zdotc_sub.$(SUFFIX) cblas_zdotu_sub.$(SUFFIX) \ | cblas_zdotc_sub.$(SUFFIX) cblas_zdotu_sub.$(SUFFIX) \ | ||||
| cblas_zscal.$(SUFFIX) cblas_zdscal.$(SUFFIX) \ | cblas_zscal.$(SUFFIX) cblas_zdscal.$(SUFFIX) \ | ||||
| cblas_zswap.$(SUFFIX) cblas_dznrm2.$(SUFFIX) \ | cblas_zswap.$(SUFFIX) cblas_dznrm2.$(SUFFIX) \ | ||||
| cblas_zaxpby.$(SUFFIX) | |||||
| cblas_zaxpby.$(SUFFIX) \ | |||||
| cblas_izmin.$(SUFFIX) cblas_izmax.$(SUFFIX) cblas_dzsum.$(SUFFIX) | |||||
| CZBLAS2OBJS = \ | CZBLAS2OBJS = \ | ||||
| cblas_zgemv.$(SUFFIX) cblas_zgerc.$(SUFFIX) cblas_zgeru.$(SUFFIX) \ | cblas_zgemv.$(SUFFIX) cblas_zgerc.$(SUFFIX) cblas_zgeru.$(SUFFIX) \ | ||||
| @@ -560,6 +565,24 @@ dzasum.$(SUFFIX) dzasum.$(PSUFFIX) : asum.c | |||||
| qxasum.$(SUFFIX) qxasum.$(PSUFFIX) : asum.c | qxasum.$(SUFFIX) qxasum.$(PSUFFIX) : asum.c | ||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | $(CC) $(CFLAGS) -c $< -o $(@F) | ||||
| ssum.$(SUFFIX) ssum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | |||||
| dsum.$(SUFFIX) dsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | |||||
| qsum.$(SUFFIX) qsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | |||||
| scsum.$(SUFFIX) scsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | |||||
| dzsum.$(SUFFIX) dzsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | |||||
| qxsum.$(SUFFIX) qxsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | |||||
| snrm2.$(SUFFIX) snrm2.$(PSUFFIX) : nrm2.c | snrm2.$(SUFFIX) snrm2.$(PSUFFIX) : nrm2.c | ||||
| $(CC) $(CFLAGS) -c $< -o $(@F) | $(CC) $(CFLAGS) -c $< -o $(@F) | ||||
| @@ -1383,6 +1406,18 @@ cblas_ismin.$(SUFFIX) cblas_ismin.$(PSUFFIX) : imax.c | |||||
| cblas_idmin.$(SUFFIX) cblas_idmin.$(PSUFFIX) : imax.c | cblas_idmin.$(SUFFIX) cblas_idmin.$(PSUFFIX) : imax.c | ||||
| $(CC) $(CFLAGS) -DCBLAS -c -UUSE_ABS -DUSE_MIN $< -o $(@F) | $(CC) $(CFLAGS) -DCBLAS -c -UUSE_ABS -DUSE_MIN $< -o $(@F) | ||||
| cblas_icmax.$(SUFFIX) cblas_icmax.$(PSUFFIX) : imax.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c -UUSE_ABS -UUSE_MIN $< -o $(@F) | |||||
| cblas_izmax.$(SUFFIX) cblas_izmax.$(PSUFFIX) : imax.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c -UUSE_ABS -UUSE_MIN $< -o $(@F) | |||||
| cblas_icmin.$(SUFFIX) cblas_icmin.$(PSUFFIX) : imax.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c -UUSE_ABS -DUSE_MIN $< -o $(@F) | |||||
| cblas_izmin.$(SUFFIX) cblas_izmin.$(PSUFFIX) : imax.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c -UUSE_ABS -DUSE_MIN $< -o $(@F) | |||||
| cblas_sasum.$(SUFFIX) cblas_sasum.$(PSUFFIX) : asum.c | cblas_sasum.$(SUFFIX) cblas_sasum.$(PSUFFIX) : asum.c | ||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | ||||
| @@ -1395,6 +1430,18 @@ cblas_scasum.$(SUFFIX) cblas_scasum.$(PSUFFIX) : asum.c | |||||
| cblas_dzasum.$(SUFFIX) cblas_dzasum.$(PSUFFIX) : asum.c | cblas_dzasum.$(SUFFIX) cblas_dzasum.$(PSUFFIX) : asum.c | ||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | ||||
| cblas_ssum.$(SUFFIX) cblas_ssum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | |||||
| cblas_dsum.$(SUFFIX) cblas_dsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | |||||
| cblas_scsum.$(SUFFIX) cblas_scsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | |||||
| cblas_dzsum.$(SUFFIX) cblas_dzsum.$(PSUFFIX) : sum.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | |||||
| cblas_sdsdot.$(SUFFIX) cblas_sdsdot.$(PSUFFIX) : sdsdot.c | cblas_sdsdot.$(SUFFIX) cblas_sdsdot.$(PSUFFIX) : sdsdot.c | ||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | ||||
| @@ -1402,7 +1449,7 @@ cblas_dsdot.$(SUFFIX) cblas_dsdot.$(PSUFFIX) : dsdot.c | |||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | ||||
| cblas_sdot.$(SUFFIX) cblas_sdot.$(PSUFFIX) : dot.c | cblas_sdot.$(SUFFIX) cblas_sdot.$(PSUFFIX) : dot.c | ||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | |||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | |||||
| cblas_ddot.$(SUFFIX) cblas_ddot.$(PSUFFIX) : dot.c | cblas_ddot.$(SUFFIX) cblas_ddot.$(PSUFFIX) : dot.c | ||||
| $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | $(CC) $(CFLAGS) -DCBLAS -c $< -o $(@F) | ||||
| @@ -91,7 +91,7 @@ void CNAME(blasint n, FLOAT alpha, FLOAT *x, blasint incx, FLOAT *y, blasint inc | |||||
| //disable multi-thread when incx==0 or incy==0 | //disable multi-thread when incx==0 or incy==0 | ||||
| //In that case, the threads would be dependent. | //In that case, the threads would be dependent. | ||||
| // | // | ||||
| //Temporarily work-around the low performance issue with small imput size & | |||||
| //Temporarily work-around the low performance issue with small input size & | |||||
| //multithreads. | //multithreads. | ||||
| if (incx == 0 || incy == 0 || n <= MULTI_THREAD_MINIMAL) | if (incx == 0 || incy == 0 || n <= MULTI_THREAD_MINIMAL) | ||||
| nthreads = 1; | nthreads = 1; | ||||
| @@ -0,0 +1,97 @@ | |||||
| /*********************************************************************/ | |||||
| /* Copyright 2009, 2010 The University of Texas at Austin. */ | |||||
| /* All rights reserved. */ | |||||
| /* */ | |||||
| /* Redistribution and use in source and binary forms, with or */ | |||||
| /* without modification, are permitted provided that the following */ | |||||
| /* conditions are met: */ | |||||
| /* */ | |||||
| /* 1. Redistributions of source code must retain the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer. */ | |||||
| /* */ | |||||
| /* 2. Redistributions in binary form must reproduce the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer in the documentation and/or other materials */ | |||||
| /* provided with the distribution. */ | |||||
| /* */ | |||||
| /* THIS SOFTWARE IS PROVIDED BY THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, */ | |||||
| /* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF */ | |||||
| /* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE */ | |||||
| /* DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, */ | |||||
| /* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES */ | |||||
| /* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE */ | |||||
| /* GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR */ | |||||
| /* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF */ | |||||
| /* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT */ | |||||
| /* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT */ | |||||
| /* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE */ | |||||
| /* POSSIBILITY OF SUCH DAMAGE. */ | |||||
| /* */ | |||||
| /* The views and conclusions contained in the software and */ | |||||
| /* documentation are those of the authors and should not be */ | |||||
| /* interpreted as representing official policies, either expressed */ | |||||
| /* or implied, of The University of Texas at Austin. */ | |||||
| /*********************************************************************/ | |||||
| #include <stdio.h> | |||||
| #include "common.h" | |||||
| #ifdef FUNCTION_PROFILE | |||||
| #include "functable.h" | |||||
| #endif | |||||
| #ifndef CBLAS | |||||
| FLOATRET NAME(blasint *N, FLOAT *x, blasint *INCX){ | |||||
| BLASLONG n = *N; | |||||
| BLASLONG incx = *INCX; | |||||
| FLOATRET ret; | |||||
| PRINT_DEBUG_NAME; | |||||
| if (n <= 0) return 0; | |||||
| IDEBUG_START; | |||||
| FUNCTION_PROFILE_START(); | |||||
| ret = (FLOATRET)SUM_K(n, x, incx); | |||||
| FUNCTION_PROFILE_END(COMPSIZE, n, n); | |||||
| IDEBUG_END; | |||||
| return ret; | |||||
| } | |||||
| #else | |||||
| #ifdef COMPLEX | |||||
| FLOAT CNAME(blasint n, void *vx, blasint incx){ | |||||
| FLOAT *x = (FLOAT*) vx; | |||||
| #else | |||||
| FLOAT CNAME(blasint n, FLOAT *x, blasint incx){ | |||||
| #endif | |||||
| FLOAT ret; | |||||
| PRINT_DEBUG_CNAME; | |||||
| if (n <= 0) return 0; | |||||
| IDEBUG_START; | |||||
| FUNCTION_PROFILE_START(); | |||||
| ret = SUM_K(n, x, incx); | |||||
| FUNCTION_PROFILE_END(COMPSIZE, n, n); | |||||
| IDEBUG_END; | |||||
| return ret; | |||||
| } | |||||
| #endif | |||||
| @@ -218,11 +218,8 @@ void CNAME(enum CBLAS_ORDER order, enum CBLAS_UPLO Uplo, | |||||
| buffer = (FLOAT *)blas_memory_alloc(1); | buffer = (FLOAT *)blas_memory_alloc(1); | ||||
| #ifdef SMP | #ifdef SMP | ||||
| /* nthreads = num_cpu_avail(2); | |||||
| nthreads = num_cpu_avail(2); | |||||
| FIXME trmv_thread was found to be broken, see issue 1332 */ | |||||
| nthreads = 1; | |||||
| if (nthreads == 1) { | if (nthreads == 1) { | ||||
| #endif | #endif | ||||
| @@ -204,7 +204,7 @@ void NAME(char *SIDE, char *UPLO, char *TRANS, char *DIAG, | |||||
| if (side < 0) info = 1; | if (side < 0) info = 1; | ||||
| if (info != 0) { | if (info != 0) { | ||||
| BLASFUNC(xerbla)(ERROR_NAME, &info, sizeof(ERROR_NAME)); | |||||
| BLASFUNC(xerbla)(ERROR_NAME, &info, sizeof(ERROR_NAME)-1); | |||||
| return; | return; | ||||
| } | } | ||||
| @@ -99,7 +99,7 @@ void CNAME(blasint n, FLOAT *ALPHA, FLOAT *x, blasint incx, FLOAT *y, blasint in | |||||
| //disable multi-thread when incx==0 or incy==0 | //disable multi-thread when incx==0 or incy==0 | ||||
| //In that case, the threads would be dependent. | //In that case, the threads would be dependent. | ||||
| // | // | ||||
| //Temporarily work-around the low performance issue with small imput size & | |||||
| //Temporarily work-around the low performance issue with small input size & | |||||
| //multithreads. | //multithreads. | ||||
| if (incx == 0 || incy == 0 || n <= MULTI_THREAD_MINIMAL) | if (incx == 0 || incy == 0 || n <= MULTI_THREAD_MINIMAL) | ||||
| nthreads = 1; | nthreads = 1; | ||||
| @@ -239,9 +239,6 @@ void CNAME(enum CBLAS_ORDER order, enum CBLAS_UPLO Uplo, | |||||
| } else | } else | ||||
| nthreads = 1; | nthreads = 1; | ||||
| /* FIXME TRMV multithreading appears to be broken, see issue 1332*/ | |||||
| nthreads = 1; | |||||
| if(nthreads > 1) { | if(nthreads > 1) { | ||||
| buffer_size = n > 16 ? 0 : n * 4 + 40; | buffer_size = n > 16 ? 0 : n * 4 + 40; | ||||
| } | } | ||||
| @@ -65,6 +65,7 @@ function (build_core TARGET_CORE KDIR TSUFFIX KERNEL_DEFINITIONS) | |||||
| GenerateNamedObjects("${KERNELDIR}/${${float_char}SCALKERNEL}" "" "scal_k" false "" "" false ${float_type}) | GenerateNamedObjects("${KERNELDIR}/${${float_char}SCALKERNEL}" "" "scal_k" false "" "" false ${float_type}) | ||||
| GenerateNamedObjects("${KERNELDIR}/${${float_char}SWAPKERNEL}" "" "swap_k" false "" "" false ${float_type}) | GenerateNamedObjects("${KERNELDIR}/${${float_char}SWAPKERNEL}" "" "swap_k" false "" "" false ${float_type}) | ||||
| GenerateNamedObjects("${KERNELDIR}/${${float_char}AXPBYKERNEL}" "" "axpby_k" false "" "" false ${float_type}) | GenerateNamedObjects("${KERNELDIR}/${${float_char}AXPBYKERNEL}" "" "axpby_k" false "" "" false ${float_type}) | ||||
| GenerateNamedObjects("${KERNELDIR}/${${float_char}SUMKERNEL}" "" "sum_k" false "" "" false ${float_type}) | |||||
| if (${float_type} STREQUAL "COMPLEX" OR ${float_type} STREQUAL "ZCOMPLEX") | if (${float_type} STREQUAL "COMPLEX" OR ${float_type} STREQUAL "ZCOMPLEX") | ||||
| GenerateNamedObjects("${KERNELDIR}/${${float_char}AXPYKERNEL}" "CONJ" "axpyc_k" false "" "" false ${float_type}) | GenerateNamedObjects("${KERNELDIR}/${${float_char}AXPYKERNEL}" "CONJ" "axpyc_k" false "" "" false ${float_type}) | ||||
| @@ -340,6 +340,32 @@ ifndef XSCALKERNEL | |||||
| XSCALKERNEL = zscal.S | XSCALKERNEL = zscal.S | ||||
| endif | endif | ||||
| ### SUM ### | |||||
| ifndef SSUMKERNEL | |||||
| SSUMKERNEL = sum.S | |||||
| endif | |||||
| ifndef DSUMKERNEL | |||||
| DSUMKERNEL = sum.S | |||||
| endif | |||||
| ifndef CSUMKERNEL | |||||
| CSUMKERNEL = zsum.S | |||||
| endif | |||||
| ifndef ZSUMKERNEL | |||||
| ZSUMKERNEL = zsum.S | |||||
| endif | |||||
| ifndef QSUMKERNEL | |||||
| QSUMKERNEL = sum.S | |||||
| endif | |||||
| ifndef XSUMKERNEL | |||||
| XSUMKERNEL = zsum.S | |||||
| endif | |||||
| ### SWAP ### | ### SWAP ### | ||||
| ifndef SSWAPKERNEL | ifndef SSWAPKERNEL | ||||
| @@ -453,7 +479,7 @@ endif | |||||
| SBLASOBJS += \ | SBLASOBJS += \ | ||||
| samax_k$(TSUFFIX).$(SUFFIX) samin_k$(TSUFFIX).$(SUFFIX) smax_k$(TSUFFIX).$(SUFFIX) smin_k$(TSUFFIX).$(SUFFIX) \ | samax_k$(TSUFFIX).$(SUFFIX) samin_k$(TSUFFIX).$(SUFFIX) smax_k$(TSUFFIX).$(SUFFIX) smin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| isamax_k$(TSUFFIX).$(SUFFIX) isamin_k$(TSUFFIX).$(SUFFIX) ismax_k$(TSUFFIX).$(SUFFIX) ismin_k$(TSUFFIX).$(SUFFIX) \ | isamax_k$(TSUFFIX).$(SUFFIX) isamin_k$(TSUFFIX).$(SUFFIX) ismax_k$(TSUFFIX).$(SUFFIX) ismin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| sasum_k$(TSUFFIX).$(SUFFIX) saxpy_k$(TSUFFIX).$(SUFFIX) scopy_k$(TSUFFIX).$(SUFFIX) \ | |||||
| sasum_k$(TSUFFIX).$(SUFFIX) ssum_k$(TSUFFIX).$(SUFFIX) saxpy_k$(TSUFFIX).$(SUFFIX) scopy_k$(TSUFFIX).$(SUFFIX) \ | |||||
| sdot_k$(TSUFFIX).$(SUFFIX) sdsdot_k$(TSUFFIX).$(SUFFIX) dsdot_k$(TSUFFIX).$(SUFFIX) \ | sdot_k$(TSUFFIX).$(SUFFIX) sdsdot_k$(TSUFFIX).$(SUFFIX) dsdot_k$(TSUFFIX).$(SUFFIX) \ | ||||
| snrm2_k$(TSUFFIX).$(SUFFIX) srot_k$(TSUFFIX).$(SUFFIX) sscal_k$(TSUFFIX).$(SUFFIX) sswap_k$(TSUFFIX).$(SUFFIX) \ | snrm2_k$(TSUFFIX).$(SUFFIX) srot_k$(TSUFFIX).$(SUFFIX) sscal_k$(TSUFFIX).$(SUFFIX) sswap_k$(TSUFFIX).$(SUFFIX) \ | ||||
| saxpby_k$(TSUFFIX).$(SUFFIX) | saxpby_k$(TSUFFIX).$(SUFFIX) | ||||
| @@ -463,31 +489,32 @@ DBLASOBJS += \ | |||||
| idamax_k$(TSUFFIX).$(SUFFIX) idamin_k$(TSUFFIX).$(SUFFIX) idmax_k$(TSUFFIX).$(SUFFIX) idmin_k$(TSUFFIX).$(SUFFIX) \ | idamax_k$(TSUFFIX).$(SUFFIX) idamin_k$(TSUFFIX).$(SUFFIX) idmax_k$(TSUFFIX).$(SUFFIX) idmin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| dasum_k$(TSUFFIX).$(SUFFIX) daxpy_k$(TSUFFIX).$(SUFFIX) dcopy_k$(TSUFFIX).$(SUFFIX) ddot_k$(TSUFFIX).$(SUFFIX) \ | dasum_k$(TSUFFIX).$(SUFFIX) daxpy_k$(TSUFFIX).$(SUFFIX) dcopy_k$(TSUFFIX).$(SUFFIX) ddot_k$(TSUFFIX).$(SUFFIX) \ | ||||
| dnrm2_k$(TSUFFIX).$(SUFFIX) drot_k$(TSUFFIX).$(SUFFIX) dscal_k$(TSUFFIX).$(SUFFIX) dswap_k$(TSUFFIX).$(SUFFIX) \ | dnrm2_k$(TSUFFIX).$(SUFFIX) drot_k$(TSUFFIX).$(SUFFIX) dscal_k$(TSUFFIX).$(SUFFIX) dswap_k$(TSUFFIX).$(SUFFIX) \ | ||||
| daxpby_k$(TSUFFIX).$(SUFFIX) | |||||
| daxpby_k$(TSUFFIX).$(SUFFIX) dsum_k$(TSUFFIX).$(SUFFIX) | |||||
| QBLASOBJS += \ | QBLASOBJS += \ | ||||
| qamax_k$(TSUFFIX).$(SUFFIX) qamin_k$(TSUFFIX).$(SUFFIX) qmax_k$(TSUFFIX).$(SUFFIX) qmin_k$(TSUFFIX).$(SUFFIX) \ | qamax_k$(TSUFFIX).$(SUFFIX) qamin_k$(TSUFFIX).$(SUFFIX) qmax_k$(TSUFFIX).$(SUFFIX) qmin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| iqamax_k$(TSUFFIX).$(SUFFIX) iqamin_k$(TSUFFIX).$(SUFFIX) iqmax_k$(TSUFFIX).$(SUFFIX) iqmin_k$(TSUFFIX).$(SUFFIX) \ | iqamax_k$(TSUFFIX).$(SUFFIX) iqamin_k$(TSUFFIX).$(SUFFIX) iqmax_k$(TSUFFIX).$(SUFFIX) iqmin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| qasum_k$(TSUFFIX).$(SUFFIX) qaxpy_k$(TSUFFIX).$(SUFFIX) qcopy_k$(TSUFFIX).$(SUFFIX) qdot_k$(TSUFFIX).$(SUFFIX) \ | qasum_k$(TSUFFIX).$(SUFFIX) qaxpy_k$(TSUFFIX).$(SUFFIX) qcopy_k$(TSUFFIX).$(SUFFIX) qdot_k$(TSUFFIX).$(SUFFIX) \ | ||||
| qnrm2_k$(TSUFFIX).$(SUFFIX) qrot_k$(TSUFFIX).$(SUFFIX) qscal_k$(TSUFFIX).$(SUFFIX) qswap_k$(TSUFFIX).$(SUFFIX) | |||||
| qnrm2_k$(TSUFFIX).$(SUFFIX) qrot_k$(TSUFFIX).$(SUFFIX) qscal_k$(TSUFFIX).$(SUFFIX) qswap_k$(TSUFFIX).$(SUFFIX) \ | |||||
| qsum_k$(TSUFFIX).$(SUFFIX) | |||||
| CBLASOBJS += \ | CBLASOBJS += \ | ||||
| camax_k$(TSUFFIX).$(SUFFIX) camin_k$(TSUFFIX).$(SUFFIX) icamax_k$(TSUFFIX).$(SUFFIX) icamin_k$(TSUFFIX).$(SUFFIX) \ | camax_k$(TSUFFIX).$(SUFFIX) camin_k$(TSUFFIX).$(SUFFIX) icamax_k$(TSUFFIX).$(SUFFIX) icamin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| casum_k$(TSUFFIX).$(SUFFIX) caxpy_k$(TSUFFIX).$(SUFFIX) caxpyc_k$(TSUFFIX).$(SUFFIX) ccopy_k$(TSUFFIX).$(SUFFIX) \ | casum_k$(TSUFFIX).$(SUFFIX) caxpy_k$(TSUFFIX).$(SUFFIX) caxpyc_k$(TSUFFIX).$(SUFFIX) ccopy_k$(TSUFFIX).$(SUFFIX) \ | ||||
| cdotc_k$(TSUFFIX).$(SUFFIX) cdotu_k$(TSUFFIX).$(SUFFIX) cnrm2_k$(TSUFFIX).$(SUFFIX) csrot_k$(TSUFFIX).$(SUFFIX) \ | cdotc_k$(TSUFFIX).$(SUFFIX) cdotu_k$(TSUFFIX).$(SUFFIX) cnrm2_k$(TSUFFIX).$(SUFFIX) csrot_k$(TSUFFIX).$(SUFFIX) \ | ||||
| cscal_k$(TSUFFIX).$(SUFFIX) cswap_k$(TSUFFIX).$(SUFFIX) caxpby_k$(TSUFFIX).$(SUFFIX) | |||||
| cscal_k$(TSUFFIX).$(SUFFIX) cswap_k$(TSUFFIX).$(SUFFIX) caxpby_k$(TSUFFIX).$(SUFFIX) csum_k$(TSUFFIX).$(SUFFIX) | |||||
| ZBLASOBJS += \ | ZBLASOBJS += \ | ||||
| zamax_k$(TSUFFIX).$(SUFFIX) zamin_k$(TSUFFIX).$(SUFFIX) izamax_k$(TSUFFIX).$(SUFFIX) izamin_k$(TSUFFIX).$(SUFFIX) \ | zamax_k$(TSUFFIX).$(SUFFIX) zamin_k$(TSUFFIX).$(SUFFIX) izamax_k$(TSUFFIX).$(SUFFIX) izamin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| zasum_k$(TSUFFIX).$(SUFFIX) zaxpy_k$(TSUFFIX).$(SUFFIX) zaxpyc_k$(TSUFFIX).$(SUFFIX) zcopy_k$(TSUFFIX).$(SUFFIX) \ | zasum_k$(TSUFFIX).$(SUFFIX) zaxpy_k$(TSUFFIX).$(SUFFIX) zaxpyc_k$(TSUFFIX).$(SUFFIX) zcopy_k$(TSUFFIX).$(SUFFIX) \ | ||||
| zdotc_k$(TSUFFIX).$(SUFFIX) zdotu_k$(TSUFFIX).$(SUFFIX) znrm2_k$(TSUFFIX).$(SUFFIX) zdrot_k$(TSUFFIX).$(SUFFIX) \ | zdotc_k$(TSUFFIX).$(SUFFIX) zdotu_k$(TSUFFIX).$(SUFFIX) znrm2_k$(TSUFFIX).$(SUFFIX) zdrot_k$(TSUFFIX).$(SUFFIX) \ | ||||
| zscal_k$(TSUFFIX).$(SUFFIX) zswap_k$(TSUFFIX).$(SUFFIX) zaxpby_k$(TSUFFIX).$(SUFFIX) | |||||
| zscal_k$(TSUFFIX).$(SUFFIX) zswap_k$(TSUFFIX).$(SUFFIX) zaxpby_k$(TSUFFIX).$(SUFFIX) zsum_k$(TSUFFIX).$(SUFFIX) | |||||
| XBLASOBJS += \ | XBLASOBJS += \ | ||||
| xamax_k$(TSUFFIX).$(SUFFIX) xamin_k$(TSUFFIX).$(SUFFIX) ixamax_k$(TSUFFIX).$(SUFFIX) ixamin_k$(TSUFFIX).$(SUFFIX) \ | xamax_k$(TSUFFIX).$(SUFFIX) xamin_k$(TSUFFIX).$(SUFFIX) ixamax_k$(TSUFFIX).$(SUFFIX) ixamin_k$(TSUFFIX).$(SUFFIX) \ | ||||
| xasum_k$(TSUFFIX).$(SUFFIX) xaxpy_k$(TSUFFIX).$(SUFFIX) xaxpyc_k$(TSUFFIX).$(SUFFIX) xcopy_k$(TSUFFIX).$(SUFFIX) \ | xasum_k$(TSUFFIX).$(SUFFIX) xaxpy_k$(TSUFFIX).$(SUFFIX) xaxpyc_k$(TSUFFIX).$(SUFFIX) xcopy_k$(TSUFFIX).$(SUFFIX) \ | ||||
| xdotc_k$(TSUFFIX).$(SUFFIX) xdotu_k$(TSUFFIX).$(SUFFIX) xnrm2_k$(TSUFFIX).$(SUFFIX) xqrot_k$(TSUFFIX).$(SUFFIX) \ | xdotc_k$(TSUFFIX).$(SUFFIX) xdotu_k$(TSUFFIX).$(SUFFIX) xnrm2_k$(TSUFFIX).$(SUFFIX) xqrot_k$(TSUFFIX).$(SUFFIX) \ | ||||
| xscal_k$(TSUFFIX).$(SUFFIX) xswap_k$(TSUFFIX).$(SUFFIX) | |||||
| xscal_k$(TSUFFIX).$(SUFFIX) xswap_k$(TSUFFIX).$(SUFFIX) xsum_k$(TSUFFIX).$(SUFFIX) | |||||
| ### AMAX ### | ### AMAX ### | ||||
| @@ -617,7 +644,7 @@ $(KDIR)idmin_k$(TSUFFIX).$(SUFFIX) $(KDIR)idmin_k$(TPSUFFIX).$(PSUFFIX) : $(KE | |||||
| $(KDIR)iqmin_k$(TSUFFIX).$(SUFFIX) $(KDIR)iqmin_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(IQMINKERNEL) | $(KDIR)iqmin_k$(TSUFFIX).$(SUFFIX) $(KDIR)iqmin_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(IQMINKERNEL) | ||||
| $(CC) -c $(CFLAGS) -UCOMPLEX -DXDOUBLE -UUSE_ABS -DUSE_MIN $< -o $@ | $(CC) -c $(CFLAGS) -UCOMPLEX -DXDOUBLE -UUSE_ABS -DUSE_MIN $< -o $@ | ||||
| ### ASUM ### | |||||
| $(KDIR)sasum_k$(TSUFFIX).$(SUFFIX) $(KDIR)sasum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(SASUMKERNEL) | $(KDIR)sasum_k$(TSUFFIX).$(SUFFIX) $(KDIR)sasum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(SASUMKERNEL) | ||||
| $(CC) -c $(CFLAGS) -UCOMPLEX -UDOUBLE $< -o $@ | $(CC) -c $(CFLAGS) -UCOMPLEX -UDOUBLE $< -o $@ | ||||
| @@ -636,6 +663,26 @@ $(KDIR)zasum_k$(TSUFFIX).$(SUFFIX) $(KDIR)zasum_k$(TPSUFFIX).$(PSUFFIX) : $(KE | |||||
| $(KDIR)xasum_k$(TSUFFIX).$(SUFFIX) $(KDIR)xasum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(XASUMKERNEL) | $(KDIR)xasum_k$(TSUFFIX).$(SUFFIX) $(KDIR)xasum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(XASUMKERNEL) | ||||
| $(CC) -c $(CFLAGS) -DCOMPLEX -DXDOUBLE $< -o $@ | $(CC) -c $(CFLAGS) -DCOMPLEX -DXDOUBLE $< -o $@ | ||||
| ### SUM ### | |||||
| $(KDIR)ssum_k$(TSUFFIX).$(SUFFIX) $(KDIR)ssum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(SSUMKERNEL) | |||||
| $(CC) -c $(CFLAGS) -UCOMPLEX -UDOUBLE $< -o $@ | |||||
| $(KDIR)dsum_k$(TSUFFIX).$(SUFFIX) $(KDIR)dsum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(DSUMKERNEL) | |||||
| $(CC) -c $(CFLAGS) -UCOMPLEX -DDOUBLE $< -o $@ | |||||
| $(KDIR)qsum_k$(TSUFFIX).$(SUFFIX) $(KDIR)qsum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(QSUMKERNEL) | |||||
| $(CC) -c $(CFLAGS) -UCOMPLEX -DXDOUBLE $< -o $@ | |||||
| $(KDIR)csum_k$(TSUFFIX).$(SUFFIX) $(KDIR)csum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(CSUMKERNEL) | |||||
| $(CC) -c $(CFLAGS) -DCOMPLEX -UDOUBLE $< -o $@ | |||||
| $(KDIR)zsum_k$(TSUFFIX).$(SUFFIX) $(KDIR)zsum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(ZSUMKERNEL) | |||||
| $(CC) -c $(CFLAGS) -DCOMPLEX -DDOUBLE $< -o $@ | |||||
| $(KDIR)xsum_k$(TSUFFIX).$(SUFFIX) $(KDIR)xsum_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(XSUMKERNEL) | |||||
| $(CC) -c $(CFLAGS) -DCOMPLEX -DXDOUBLE $< -o $@ | |||||
| ### AXPY ### | |||||
| $(KDIR)saxpy_k$(TSUFFIX).$(SUFFIX) $(KDIR)saxpy_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(SAXPYKERNEL) | $(KDIR)saxpy_k$(TSUFFIX).$(SUFFIX) $(KDIR)saxpy_k$(TPSUFFIX).$(PSUFFIX) : $(KERNELDIR)/$(SAXPYKERNEL) | ||||
| $(CC) -c $(CFLAGS) -UCOMPLEX -UDOUBLE $< -o $@ | $(CC) -c $(CFLAGS) -UCOMPLEX -UDOUBLE $< -o $@ | ||||
| @@ -24,7 +24,7 @@ ifeq ($(TARGET), LOONGSON3B) | |||||
| USE_TRMM = 1 | USE_TRMM = 1 | ||||
| endif | endif | ||||
| ifeq ($(TARGET), GENERIC) | |||||
| ifeq ($(CORE), GENERIC) | |||||
| USE_TRMM = 1 | USE_TRMM = 1 | ||||
| endif | endif | ||||
| @@ -44,10 +44,18 @@ ifeq ($(CORE), POWER8) | |||||
| USE_TRMM = 1 | USE_TRMM = 1 | ||||
| endif | endif | ||||
| ifeq ($(CORE), POWER9) | |||||
| USE_TRMM = 1 | |||||
| endif | |||||
| ifeq ($(ARCH), zarch) | ifeq ($(ARCH), zarch) | ||||
| USE_TRMM = 1 | USE_TRMM = 1 | ||||
| endif | endif | ||||
| ifeq ($(CORE), Z14) | |||||
| USE_TRMM = 1 | |||||
| endif | |||||
| @@ -0,0 +1,206 @@ | |||||
| /*********************************************************************/ | |||||
| /* Copyright 2009, 2010 The University of Texas at Austin. */ | |||||
| /* All rights reserved. */ | |||||
| /* */ | |||||
| /* Redistribution and use in source and binary forms, with or */ | |||||
| /* without modification, are permitted provided that the following */ | |||||
| /* conditions are met: */ | |||||
| /* */ | |||||
| /* 1. Redistributions of source code must retain the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer. */ | |||||
| /* */ | |||||
| /* 2. Redistributions in binary form must reproduce the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer in the documentation and/or other materials */ | |||||
| /* provided with the distribution. */ | |||||
| /* */ | |||||
| /* THIS SOFTWARE IS PROVIDED BY THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, */ | |||||
| /* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF */ | |||||
| /* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE */ | |||||
| /* DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, */ | |||||
| /* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES */ | |||||
| /* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE */ | |||||
| /* GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR */ | |||||
| /* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF */ | |||||
| /* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT */ | |||||
| /* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT */ | |||||
| /* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE */ | |||||
| /* POSSIBILITY OF SUCH DAMAGE. */ | |||||
| /* */ | |||||
| /* The views and conclusions contained in the software and */ | |||||
| /* documentation are those of the authors and should not be */ | |||||
| /* interpreted as representing official policies, either expressed */ | |||||
| /* or implied, of The University of Texas at Austin. */ | |||||
| /*********************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #include "version.h" | |||||
| #define PREFETCHSIZE 88 | |||||
| #define N $16 | |||||
| #define X $17 | |||||
| #define INCX $18 | |||||
| #define I $19 | |||||
| #define s0 $f0 | |||||
| #define s1 $f1 | |||||
| #define s2 $f10 | |||||
| #define s3 $f11 | |||||
| #define a0 $f12 | |||||
| #define a1 $f13 | |||||
| #define a2 $f14 | |||||
| #define a3 $f15 | |||||
| #define a4 $f16 | |||||
| #define a5 $f17 | |||||
| #define a6 $f18 | |||||
| #define a7 $f19 | |||||
| #define t0 $f20 | |||||
| #define t1 $f21 | |||||
| #define t2 $f22 | |||||
| #define t3 $f23 | |||||
| PROLOGUE | |||||
| PROFCODE | |||||
| fclr s0 | |||||
| unop | |||||
| fclr t0 | |||||
| ble N, $L999 | |||||
| sra N, 3, I | |||||
| fclr s1 | |||||
| fclr s2 | |||||
| ble I, $L15 | |||||
| LD a0, 0 * SIZE(X) | |||||
| fclr t1 | |||||
| SXADDQ INCX, X, X | |||||
| fclr t2 | |||||
| LD a1, 0 * SIZE(X) | |||||
| fclr t3 | |||||
| SXADDQ INCX, X, X | |||||
| fclr s3 | |||||
| LD a2, 0 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| LD a3, 0 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| LD a4, 0 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| LD a5, 0 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| lda I, -1(I) | |||||
| ble I, $L13 | |||||
| .align 4 | |||||
| $L12: | |||||
| ADD s0, t0, s0 | |||||
| ldl $31, PREFETCHSIZE * 2 * SIZE(X) | |||||
| fmov a0, t0 | |||||
| lda I, -1(I) | |||||
| ADD s1, t1, s1 | |||||
| LD a6, 0 * SIZE(X) | |||||
| fmov a1, t1 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s2, t2, s2 | |||||
| LD a7, 0 * SIZE(X) | |||||
| fmov a2, t2 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s3, t3, s3 | |||||
| LD a0, 0 * SIZE(X) | |||||
| fmov a3, t3 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s0, t0, s0 | |||||
| LD a1, 0 * SIZE(X) | |||||
| fmov a4, t0 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s1, t1, s1 | |||||
| LD a2, 0 * SIZE(X) | |||||
| fmov a5, t1 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s2, t2, s2 | |||||
| LD a3, 0 * SIZE(X) | |||||
| fmov a6, t2 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s3, t3, s3 | |||||
| LD a4, 0 * SIZE(X) | |||||
| fmov a7, t3 | |||||
| SXADDQ INCX, X, X | |||||
| LD a5, 0 * SIZE(X) | |||||
| unop | |||||
| SXADDQ INCX, X, X | |||||
| bne I, $L12 | |||||
| .align 4 | |||||
| $L13: | |||||
| ADD s0, t0, s0 | |||||
| LD a6, 0 * SIZE(X) | |||||
| fmov a0, t0 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s1, t1, s1 | |||||
| LD a7, 0 * SIZE(X) | |||||
| fmov a1, t1 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s2, t2, s2 | |||||
| fmov a2, t2 | |||||
| ADD s3, t3, s3 | |||||
| fmov a3, t3 | |||||
| ADD s0, t0, s0 | |||||
| fmov a4, t0 | |||||
| ADD s1, t1, s1 | |||||
| fmov a5, t1 | |||||
| ADD s2, t2, s2 | |||||
| fmov a6, t2 | |||||
| ADD s3, t3, s3 | |||||
| fmov a7, t3 | |||||
| ADD s1, t1, s1 | |||||
| ADD s2, t2, s2 | |||||
| ADD s3, t3, s3 | |||||
| ADD s0, s1, s0 | |||||
| ADD s2, s3, s2 | |||||
| .align 4 | |||||
| $L15: | |||||
| and N, 7, I | |||||
| ADD s0, s2, s0 | |||||
| unop | |||||
| ble I, $L999 | |||||
| .align 4 | |||||
| $L17: | |||||
| ADD s0, t0, s0 | |||||
| LD a0, 0 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| fmov a0, t0 | |||||
| lda I, -1(I) | |||||
| bne I, $L17 | |||||
| .align 4 | |||||
| $L999: | |||||
| ADD s0, t0, s0 | |||||
| ret | |||||
| EPILOGUE | |||||
| @@ -0,0 +1,208 @@ | |||||
| /*********************************************************************/ | |||||
| /* Copyright 2009, 2010 The University of Texas at Austin. */ | |||||
| /* All rights reserved. */ | |||||
| /* */ | |||||
| /* Redistribution and use in source and binary forms, with or */ | |||||
| /* without modification, are permitted provided that the following */ | |||||
| /* conditions are met: */ | |||||
| /* */ | |||||
| /* 1. Redistributions of source code must retain the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer. */ | |||||
| /* */ | |||||
| /* 2. Redistributions in binary form must reproduce the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer in the documentation and/or other materials */ | |||||
| /* provided with the distribution. */ | |||||
| /* */ | |||||
| /* THIS SOFTWARE IS PROVIDED BY THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, */ | |||||
| /* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF */ | |||||
| /* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE */ | |||||
| /* DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, */ | |||||
| /* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES */ | |||||
| /* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE */ | |||||
| /* GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR */ | |||||
| /* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF */ | |||||
| /* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT */ | |||||
| /* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT */ | |||||
| /* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE */ | |||||
| /* POSSIBILITY OF SUCH DAMAGE. */ | |||||
| /* */ | |||||
| /* The views and conclusions contained in the software and */ | |||||
| /* documentation are those of the authors and should not be */ | |||||
| /* interpreted as representing official policies, either expressed */ | |||||
| /* or implied, of The University of Texas at Austin. */ | |||||
| /*********************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #include "version.h" | |||||
| #define PREFETCHSIZE 88 | |||||
| #define N $16 | |||||
| #define X $17 | |||||
| #define INCX $18 | |||||
| #define I $19 | |||||
| #define s0 $f0 | |||||
| #define s1 $f1 | |||||
| #define s2 $f10 | |||||
| #define s3 $f11 | |||||
| #define a0 $f12 | |||||
| #define a1 $f13 | |||||
| #define a2 $f14 | |||||
| #define a3 $f15 | |||||
| #define a4 $f16 | |||||
| #define a5 $f17 | |||||
| #define a6 $f18 | |||||
| #define a7 $f19 | |||||
| #define t0 $f20 | |||||
| #define t1 $f21 | |||||
| #define t2 $f22 | |||||
| #define t3 $f23 | |||||
| PROLOGUE | |||||
| PROFCODE | |||||
| fclr s0 | |||||
| unop | |||||
| fclr t0 | |||||
| addq INCX, INCX, INCX | |||||
| fclr s1 | |||||
| unop | |||||
| fclr t1 | |||||
| ble N, $L999 | |||||
| fclr s2 | |||||
| sra N, 2, I | |||||
| fclr s3 | |||||
| ble I, $L15 | |||||
| LD a0, 0 * SIZE(X) | |||||
| fclr t2 | |||||
| LD a1, 1 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| LD a2, 0 * SIZE(X) | |||||
| fclr t3 | |||||
| LD a3, 1 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| LD a4, 0 * SIZE(X) | |||||
| LD a5, 1 * SIZE(X) | |||||
| SXADDQ INCX, X, X | |||||
| lda I, -1(I) | |||||
| ble I, $L13 | |||||
| .align 4 | |||||
| $L12: | |||||
| ADD s0, t0, s0 | |||||
| ldl $31, PREFETCHSIZE * SIZE(X) | |||||
| fmov a0, t0 | |||||
| lda I, -1(I) | |||||
| ADD s1, t1, s1 | |||||
| LD a6, 0 * SIZE(X) | |||||
| fmov a1, t1 | |||||
| unop | |||||
| ADD s2, t2, s2 | |||||
| LD a7, 1 * SIZE(X) | |||||
| fmov a2, t2 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s3, t3, s3 | |||||
| LD a0, 0 * SIZE(X) | |||||
| fmov a3, t3 | |||||
| unop | |||||
| ADD s0, t0, s0 | |||||
| LD a1, 1 * SIZE(X) | |||||
| fmov a4, t0 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s1, t1, s1 | |||||
| LD a2, 0 * SIZE(X) | |||||
| fmov a5, t1 | |||||
| unop | |||||
| ADD s2, t2, s2 | |||||
| LD a3, 1 * SIZE(X) | |||||
| fmov a6, t2 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s3, t3, s3 | |||||
| LD a4, 0 * SIZE(X) | |||||
| fmov a7, t3 | |||||
| unop | |||||
| LD a5, 1 * SIZE(X) | |||||
| unop | |||||
| SXADDQ INCX, X, X | |||||
| bne I, $L12 | |||||
| .align 4 | |||||
| $L13: | |||||
| ADD s0, t0, s0 | |||||
| LD a6, 0 * SIZE(X) | |||||
| fmov a0, t0 | |||||
| ADD s1, t1, s1 | |||||
| LD a7, 1 * SIZE(X) | |||||
| fmov a1, t1 | |||||
| SXADDQ INCX, X, X | |||||
| ADD s2, t2, s2 | |||||
| fmov a2, t2 | |||||
| ADD s3, t3, s3 | |||||
| fmov a3, t3 | |||||
| ADD s0, t0, s0 | |||||
| fmov a4, t0 | |||||
| ADD s1, t1, s1 | |||||
| fmov a5, t1 | |||||
| ADD s2, t2, s2 | |||||
| fmov a6, t2 | |||||
| ADD s3, t3, s3 | |||||
| fmov a7, t3 | |||||
| ADD s2, t2, s2 | |||||
| ADD s3, t3, s3 | |||||
| .align 4 | |||||
| $L15: | |||||
| ADD s0, s2, s0 | |||||
| and N, 3, I | |||||
| ADD s1, s3, s1 | |||||
| ble I, $L999 | |||||
| .align 4 | |||||
| $L17: | |||||
| ADD s0, t0, s0 | |||||
| LD a0, 0 * SIZE(X) | |||||
| fmov a0, t0 | |||||
| lda I, -1(I) | |||||
| ADD s1, t1, s1 | |||||
| LD a1, 1 * SIZE(X) | |||||
| fmov a1, t1 | |||||
| SXADDQ INCX, X, X | |||||
| bne I, $L17 | |||||
| .align 4 | |||||
| $L999: | |||||
| ADD s0, t0, s0 | |||||
| ADD s1, t1, s1 | |||||
| ADD s0, s1, s0 | |||||
| ret | |||||
| EPILOGUE | |||||
| @@ -35,6 +35,11 @@ DASUMKERNEL = ../arm/asum.c | |||||
| CASUMKERNEL = ../arm/zasum.c | CASUMKERNEL = ../arm/zasum.c | ||||
| ZASUMKERNEL = ../arm/zasum.c | ZASUMKERNEL = ../arm/zasum.c | ||||
| SSUMKERNEL = ../arm/sum.c | |||||
| DSUMKERNEL = ../arm/sum.c | |||||
| CSUMKERNEL = ../arm/zsum.c | |||||
| ZSUMKERNEL = ../arm/zsum.c | |||||
| SAXPYKERNEL = ../arm/axpy.c | SAXPYKERNEL = ../arm/axpy.c | ||||
| DAXPYKERNEL = ../arm/axpy.c | DAXPYKERNEL = ../arm/axpy.c | ||||
| CAXPYKERNEL = ../arm/zaxpy.c | CAXPYKERNEL = ../arm/zaxpy.c | ||||
| @@ -1,30 +1,30 @@ | |||||
| include $(KERNELDIR)/KERNEL.ARMV5 | include $(KERNELDIR)/KERNEL.ARMV5 | ||||
| SAMAXKERNEL = iamax_vfp.S | |||||
| DAMAXKERNEL = iamax_vfp.S | |||||
| CAMAXKERNEL = iamax_vfp.S | |||||
| ZAMAXKERNEL = iamax_vfp.S | |||||
| SAMAXKERNEL = amax_vfp.S | |||||
| DAMAXKERNEL = amax_vfp.S | |||||
| #CAMAXKERNEL = amax_vfp.S | |||||
| #ZAMAXKERNEL = amax_vfp.S | |||||
| SAMINKERNEL = iamax_vfp.S | |||||
| DAMINKERNEL = iamax_vfp.S | |||||
| CAMINKERNEL = iamax_vfp.S | |||||
| ZAMINKERNEL = iamax_vfp.S | |||||
| SAMINKERNEL = amax_vfp.S | |||||
| DAMINKERNEL = amax_vfp.S | |||||
| #CAMINKERNEL = amax_vfp.S | |||||
| #ZAMINKERNEL = amax_vfp.S | |||||
| SMAXKERNEL = iamax_vfp.S | |||||
| DMAXKERNEL = iamax_vfp.S | |||||
| SMAXKERNEL = amax_vfp.S | |||||
| DMAXKERNEL = amax_vfp.S | |||||
| SMINKERNEL = iamax_vfp.S | |||||
| DMINKERNEL = iamax_vfp.S | |||||
| SMINKERNEL = amax_vfp.S | |||||
| DMINKERNEL = amax_vfp.S | |||||
| ISAMAXKERNEL = iamax_vfp.S | ISAMAXKERNEL = iamax_vfp.S | ||||
| IDAMAXKERNEL = iamax_vfp.S | IDAMAXKERNEL = iamax_vfp.S | ||||
| ICAMAXKERNEL = iamax_vfp.S | |||||
| IZAMAXKERNEL = iamax_vfp.S | |||||
| #ICAMAXKERNEL = iamax_vfp.S | |||||
| #IZAMAXKERNEL = iamax_vfp.S | |||||
| ISAMINKERNEL = iamax_vfp.S | ISAMINKERNEL = iamax_vfp.S | ||||
| IDAMINKERNEL = iamax_vfp.S | IDAMINKERNEL = iamax_vfp.S | ||||
| ICAMINKERNEL = iamax_vfp.S | |||||
| IZAMINKERNEL = iamax_vfp.S | |||||
| #ICAMINKERNEL = iamax_vfp.S | |||||
| #IZAMINKERNEL = iamax_vfp.S | |||||
| ISMAXKERNEL = iamax_vfp.S | ISMAXKERNEL = iamax_vfp.S | ||||
| IDMAXKERNEL = iamax_vfp.S | IDMAXKERNEL = iamax_vfp.S | ||||
| @@ -37,6 +37,9 @@ DASUMKERNEL = asum_vfp.S | |||||
| CASUMKERNEL = asum_vfp.S | CASUMKERNEL = asum_vfp.S | ||||
| ZASUMKERNEL = asum_vfp.S | ZASUMKERNEL = asum_vfp.S | ||||
| SSUMKERNEL = sum_vfp.S | |||||
| DSUMKERNEL = sum_vfp.S | |||||
| SAXPYKERNEL = axpy_vfp.S | SAXPYKERNEL = axpy_vfp.S | ||||
| DAXPYKERNEL = axpy_vfp.S | DAXPYKERNEL = axpy_vfp.S | ||||
| CAXPYKERNEL = axpy_vfp.S | CAXPYKERNEL = axpy_vfp.S | ||||
| @@ -0,0 +1,445 @@ | |||||
| /*************************************************************************** | |||||
| Copyright (c) 2013, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *****************************************************************************/ | |||||
| /************************************************************************************** | |||||
| * 2013/11/14 Saar | |||||
| * BLASTEST : OK | |||||
| * CTEST : OK | |||||
| * TEST : OK | |||||
| * | |||||
| **************************************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #define STACKSIZE 256 | |||||
| #define N r0 | |||||
| #define X r1 | |||||
| #define INC_X r2 | |||||
| #define I r12 | |||||
| #define X_PRE 512 | |||||
| /************************************************************************************** | |||||
| * Macro definitions | |||||
| **************************************************************************************/ | |||||
| #if defined(USE_ABS) | |||||
| #if defined(DOUBLE) | |||||
| #define VABS(x0,x1) vabs.f64 x0, x1 | |||||
| #else | |||||
| #define VABS(x0,x1) vabs.f32 x0, x1 | |||||
| #endif | |||||
| #else | |||||
| #define VABS(x0,x1) nop | |||||
| #endif | |||||
| /*****************************************************************************************/ | |||||
| #if defined(USE_MIN) | |||||
| #define MOVCOND movlt | |||||
| #if defined(DOUBLE) | |||||
| #define VMOVCOND vmovlt.f64 | |||||
| #else | |||||
| #define VMOVCOND vmovlt.f32 | |||||
| #endif | |||||
| #else | |||||
| #define MOVCOND movgt | |||||
| #if defined(DOUBLE) | |||||
| #define VMOVCOND vmovgt.f64 | |||||
| #else | |||||
| #define VMOVCOND vmovgt.f32 | |||||
| #endif | |||||
| #endif | |||||
| /*****************************************************************************************/ | |||||
| #if !defined(COMPLEX) | |||||
| #if defined(DOUBLE) | |||||
| .macro INIT_F | |||||
| vldmia.f64 X!, { d0 } | |||||
| VABS( d0, d0 ) | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f64 X!, { d4 } | |||||
| VABS( d4, d4 ) | |||||
| vcmpe.f64 d4, d0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND d0, d4 | |||||
| .endm | |||||
| .macro INIT_S | |||||
| vldmia.f64 X, { d0 } | |||||
| VABS( d0, d0 ) | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f64 X, { d4 } | |||||
| VABS( d4, d4 ) | |||||
| vcmpe.f64 d4, d0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND d0, d4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #else | |||||
| .macro INIT_F | |||||
| vldmia.f32 X!, { s0 } | |||||
| VABS( s0, s0 ) | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f32 X!, { s4 } | |||||
| VABS( s4, s4 ) | |||||
| vcmpe.f32 s4, s0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND s0, s4 | |||||
| .endm | |||||
| .macro INIT_S | |||||
| vldmia.f32 X, { s0 } | |||||
| VABS( s0, s0 ) | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f32 X, { s4 } | |||||
| VABS( s4, s4 ) | |||||
| vcmpe.f32 s4, s0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND s0, s4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #endif | |||||
| #else | |||||
| #if defined(DOUBLE) | |||||
| .macro INIT_F | |||||
| vldmia.f64 X!, { d0 -d1 } | |||||
| vabs.f64 d0, d0 | |||||
| vabs.f64 d1, d1 | |||||
| vadd.f64 d0 , d0, d1 | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f64 X!, { d4 - d5 } | |||||
| vabs.f64 d4, d4 | |||||
| vabs.f64 d5, d5 | |||||
| vadd.f64 d4 , d4, d5 | |||||
| vcmpe.f64 d4, d0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND d0, d4 | |||||
| .endm | |||||
| .macro INIT_S | |||||
| vldmia.f64 X, { d0 -d1 } | |||||
| vabs.f64 d0, d0 | |||||
| vabs.f64 d1, d1 | |||||
| vadd.f64 d0 , d0, d1 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f64 X, { d4 - d5 } | |||||
| vabs.f64 d4, d4 | |||||
| vabs.f64 d5, d5 | |||||
| vadd.f64 d4 , d4, d5 | |||||
| vcmpe.f64 d4, d0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND d0, d4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #else | |||||
| .macro INIT_F | |||||
| vldmia.f32 X!, { s0 -s1 } | |||||
| vabs.f32 s0, s0 | |||||
| vabs.f32 s1, s1 | |||||
| vadd.f32 s0 , s0, s1 | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f32 X!, { s4 - s5 } | |||||
| vabs.f32 s4, s4 | |||||
| vabs.f32 s5, s5 | |||||
| vadd.f32 s4 , s4, s5 | |||||
| vcmpe.f32 s4, s0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND s0, s4 | |||||
| .endm | |||||
| .macro INIT_S | |||||
| vldmia.f32 X, { s0 -s1 } | |||||
| vabs.f32 s0, s0 | |||||
| vabs.f32 s1, s1 | |||||
| vadd.f32 s0 , s0, s1 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f32 X, { s4 - s5 } | |||||
| vabs.f32 s4, s4 | |||||
| vabs.f32 s5, s5 | |||||
| vadd.f32 s4 , s4, s5 | |||||
| vcmpe.f32 s4, s0 | |||||
| vmrs APSR_nzcv, fpscr | |||||
| VMOVCOND s0, s4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #endif | |||||
| #endif | |||||
| /************************************************************************************** | |||||
| * End of macro definitions | |||||
| **************************************************************************************/ | |||||
| PROLOGUE | |||||
| .align 5 | |||||
| movs r12, #0 // clear floating point register | |||||
| vmov s0, r12 | |||||
| #if defined(DOUBLE) | |||||
| vcvt.f64.f32 d0, s0 | |||||
| #endif | |||||
| cmp N, #0 | |||||
| ble amax_kernel_L999 | |||||
| cmp INC_X, #0 | |||||
| beq amax_kernel_L999 | |||||
| cmp INC_X, #1 | |||||
| bne amax_kernel_S_BEGIN | |||||
| amax_kernel_F_BEGIN: | |||||
| INIT_F | |||||
| subs N, N , #1 | |||||
| ble amax_kernel_L999 | |||||
| asrs I, N, #2 // I = N / 4 | |||||
| ble amax_kernel_F1 | |||||
| .align 5 | |||||
| amax_kernel_F4: | |||||
| pld [ X, #X_PRE ] | |||||
| KERNEL_F1 | |||||
| KERNEL_F1 | |||||
| #if defined(COMPLEX) && defined(DOUBLE) | |||||
| pld [ X, #X_PRE ] | |||||
| #endif | |||||
| KERNEL_F1 | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| ble amax_kernel_F1 | |||||
| #if defined(COMPLEX) || defined(DOUBLE) | |||||
| pld [ X, #X_PRE ] | |||||
| #endif | |||||
| KERNEL_F1 | |||||
| KERNEL_F1 | |||||
| #if defined(COMPLEX) && defined(DOUBLE) | |||||
| pld [ X, #X_PRE ] | |||||
| #endif | |||||
| KERNEL_F1 | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| bne amax_kernel_F4 | |||||
| amax_kernel_F1: | |||||
| ands I, N, #3 | |||||
| ble amax_kernel_L999 | |||||
| amax_kernel_F10: | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| bne amax_kernel_F10 | |||||
| b amax_kernel_L999 | |||||
| amax_kernel_S_BEGIN: | |||||
| #if defined(COMPLEX) | |||||
| #if defined(DOUBLE) | |||||
| lsl INC_X, INC_X, #4 // INC_X * SIZE * 2 | |||||
| #else | |||||
| lsl INC_X, INC_X, #3 // INC_X * SIZE * 2 | |||||
| #endif | |||||
| #else | |||||
| #if defined(DOUBLE) | |||||
| lsl INC_X, INC_X, #3 // INC_X * SIZE | |||||
| #else | |||||
| lsl INC_X, INC_X, #2 // INC_X * SIZE | |||||
| #endif | |||||
| #endif | |||||
| INIT_S | |||||
| subs N, N , #1 | |||||
| ble amax_kernel_L999 | |||||
| asrs I, N, #2 // I = N / 4 | |||||
| ble amax_kernel_S1 | |||||
| .align 5 | |||||
| amax_kernel_S4: | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne amax_kernel_S4 | |||||
| amax_kernel_S1: | |||||
| ands I, N, #3 | |||||
| ble amax_kernel_L999 | |||||
| amax_kernel_S10: | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne amax_kernel_S10 | |||||
| amax_kernel_L999: | |||||
| #if !defined(__ARM_PCS_VFP) | |||||
| #if defined(DOUBLE) | |||||
| vmov r0, r1, d0 | |||||
| #else | |||||
| vmov r0, s0 | |||||
| #endif | |||||
| #endif | |||||
| bx lr | |||||
| EPILOGUE | |||||
| @@ -53,7 +53,7 @@ BLASLONG CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x) | |||||
| while(i < n) | while(i < n) | ||||
| { | { | ||||
| if( x[ix] > minf ) | |||||
| if( x[ix] < minf ) | |||||
| { | { | ||||
| min = i; | min = i; | ||||
| minf = x[ix]; | minf = x[ix]; | ||||
| @@ -0,0 +1,51 @@ | |||||
| /*************************************************************************** | |||||
| Copyright (c) 2013, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *****************************************************************************/ | |||||
| /************************************************************************************** | |||||
| * trivial copy of asum.c with the ABS() removed * | |||||
| **************************************************************************************/ | |||||
| #include "common.h" | |||||
| #include <math.h> | |||||
| FLOAT CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x) | |||||
| { | |||||
| BLASLONG i=0; | |||||
| FLOAT sumf = 0.0; | |||||
| if (n <= 0 || inc_x <= 0) return(sumf); | |||||
| n *= inc_x; | |||||
| while(i < n) | |||||
| { | |||||
| sumf += x[i]; | |||||
| i += inc_x; | |||||
| } | |||||
| return(sumf); | |||||
| } | |||||
| @@ -0,0 +1,425 @@ | |||||
| /*************************************************************************** | |||||
| Copyright (c) 2013, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *****************************************************************************/ | |||||
| /************************************************************************************** | |||||
| * trivial copy of asum_vfp.S with the in-place vabs.f64 calls removed * | |||||
| **************************************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #define STACKSIZE 256 | |||||
| #define N r0 | |||||
| #define X r1 | |||||
| #define INC_X r2 | |||||
| #define I r12 | |||||
| #define X_PRE 512 | |||||
| /************************************************************************************** | |||||
| * Macro definitions | |||||
| **************************************************************************************/ | |||||
| #if !defined(COMPLEX) | |||||
| #if defined(DOUBLE) | |||||
| .macro KERNEL_F4 | |||||
| pld [ X, #X_PRE ] | |||||
| vldmia.f64 X!, { d4 - d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vldmia.f64 X!, { d6 - d7 } | |||||
| vadd.f64 d1 , d1, d5 | |||||
| vadd.f64 d0 , d0, d6 | |||||
| vadd.f64 d1 , d1, d7 | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f64 X!, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| .endm | |||||
| .macro KERNEL_S4 | |||||
| vldmia.f64 X, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| add X, X, INC_X | |||||
| vldmia.f64 X, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| add X, X, INC_X | |||||
| vldmia.f64 X, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| add X, X, INC_X | |||||
| vldmia.f64 X, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f64 X, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #else | |||||
| .macro KERNEL_F4 | |||||
| vldmia.f32 X!, { s4 - s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vldmia.f32 X!, { s6 - s7 } | |||||
| vadd.f32 s1 , s1, s5 | |||||
| vadd.f32 s0 , s0, s6 | |||||
| vadd.f32 s1 , s1, s7 | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f32 X!, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| .endm | |||||
| .macro KERNEL_S4 | |||||
| vldmia.f32 X, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| add X, X, INC_X | |||||
| vldmia.f32 X, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| add X, X, INC_X | |||||
| vldmia.f32 X, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| add X, X, INC_X | |||||
| vldmia.f32 X, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f32 X, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #endif | |||||
| #else | |||||
| #if defined(DOUBLE) | |||||
| .macro KERNEL_F4 | |||||
| pld [ X, #X_PRE ] | |||||
| vldmia.f64 X!, { d4 - d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vldmia.f64 X!, { d6 - d7 } | |||||
| vadd.f64 d1 , d1, d5 | |||||
| vadd.f64 d0 , d0, d6 | |||||
| vadd.f64 d1 , d1, d7 | |||||
| pld [ X, #X_PRE ] | |||||
| vldmia.f64 X!, { d4 - d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vldmia.f64 X!, { d6 - d7 } | |||||
| vadd.f64 d1 , d1, d5 | |||||
| vadd.f64 d0 , d0, d6 | |||||
| vadd.f64 d1 , d1, d7 | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f64 X!, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vldmia.f64 X!, { d4 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| .endm | |||||
| .macro KERNEL_S4 | |||||
| vldmia.f64 X, { d4 -d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vadd.f64 d0 , d0, d5 | |||||
| add X, X, INC_X | |||||
| vldmia.f64 X, { d4 -d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vadd.f64 d0 , d0, d5 | |||||
| add X, X, INC_X | |||||
| vldmia.f64 X, { d4 -d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vadd.f64 d0 , d0, d5 | |||||
| add X, X, INC_X | |||||
| vldmia.f64 X, { d4 -d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vadd.f64 d0 , d0, d5 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f64 X, { d4 -d5 } | |||||
| vadd.f64 d0 , d0, d4 | |||||
| vadd.f64 d0 , d0, d5 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #else | |||||
| .macro KERNEL_F4 | |||||
| pld [ X, #X_PRE ] | |||||
| vldmia.f32 X!, { s4 - s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vldmia.f32 X!, { s6 - s7 } | |||||
| vadd.f32 s1 , s1, s5 | |||||
| vadd.f32 s0 , s0, s6 | |||||
| vadd.f32 s1 , s1, s7 | |||||
| vldmia.f32 X!, { s4 - s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vldmia.f32 X!, { s6 - s7 } | |||||
| vadd.f32 s1 , s1, s5 | |||||
| vadd.f32 s0 , s0, s6 | |||||
| vadd.f32 s1 , s1, s7 | |||||
| .endm | |||||
| .macro KERNEL_F1 | |||||
| vldmia.f32 X!, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vldmia.f32 X!, { s4 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| .endm | |||||
| .macro KERNEL_S4 | |||||
| vldmia.f32 X, { s4 -s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vadd.f32 s0 , s0, s5 | |||||
| add X, X, INC_X | |||||
| vldmia.f32 X, { s4 -s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vadd.f32 s0 , s0, s5 | |||||
| add X, X, INC_X | |||||
| vldmia.f32 X, { s4 -s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vadd.f32 s0 , s0, s5 | |||||
| add X, X, INC_X | |||||
| vldmia.f32 X, { s4 -s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vadd.f32 s0 , s0, s5 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| vldmia.f32 X, { s4 -s5 } | |||||
| vadd.f32 s0 , s0, s4 | |||||
| vadd.f32 s0 , s0, s5 | |||||
| add X, X, INC_X | |||||
| .endm | |||||
| #endif | |||||
| #endif | |||||
| /************************************************************************************** | |||||
| * End of macro definitions | |||||
| **************************************************************************************/ | |||||
| PROLOGUE | |||||
| .align 5 | |||||
| movs r12, #0 // clear floating point register | |||||
| vmov s0, r12 | |||||
| vmov s1, r12 | |||||
| #if defined(DOUBLE) | |||||
| vcvt.f64.f32 d0, s0 | |||||
| vcvt.f64.f32 d1, s1 | |||||
| #endif | |||||
| cmp N, #0 | |||||
| ble asum_kernel_L999 | |||||
| cmp INC_X, #0 | |||||
| beq asum_kernel_L999 | |||||
| cmp INC_X, #1 | |||||
| bne asum_kernel_S_BEGIN | |||||
| asum_kernel_F_BEGIN: | |||||
| asrs I, N, #2 // I = N / 4 | |||||
| ble asum_kernel_F1 | |||||
| .align 5 | |||||
| asum_kernel_F4: | |||||
| #if !defined(DOUBLE) && !defined(COMPLEX) | |||||
| pld [ X, #X_PRE ] | |||||
| #endif | |||||
| KERNEL_F4 | |||||
| subs I, I, #1 | |||||
| ble asum_kernel_F1 | |||||
| KERNEL_F4 | |||||
| subs I, I, #1 | |||||
| bne asum_kernel_F4 | |||||
| asum_kernel_F1: | |||||
| ands I, N, #3 | |||||
| ble asum_kernel_L999 | |||||
| asum_kernel_F10: | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| bne asum_kernel_F10 | |||||
| b asum_kernel_L999 | |||||
| asum_kernel_S_BEGIN: | |||||
| #if defined(COMPLEX) | |||||
| #if defined(DOUBLE) | |||||
| lsl INC_X, INC_X, #4 // INC_X * SIZE * 2 | |||||
| #else | |||||
| lsl INC_X, INC_X, #3 // INC_X * SIZE * 2 | |||||
| #endif | |||||
| #else | |||||
| #if defined(DOUBLE) | |||||
| lsl INC_X, INC_X, #3 // INC_X * SIZE | |||||
| #else | |||||
| lsl INC_X, INC_X, #2 // INC_X * SIZE | |||||
| #endif | |||||
| #endif | |||||
| asrs I, N, #2 // I = N / 4 | |||||
| ble asum_kernel_S1 | |||||
| .align 5 | |||||
| asum_kernel_S4: | |||||
| KERNEL_S4 | |||||
| subs I, I, #1 | |||||
| bne asum_kernel_S4 | |||||
| asum_kernel_S1: | |||||
| ands I, N, #3 | |||||
| ble asum_kernel_L999 | |||||
| asum_kernel_S10: | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne asum_kernel_S10 | |||||
| asum_kernel_L999: | |||||
| #if defined(DOUBLE) | |||||
| vadd.f64 d0 , d0, d1 // set return value | |||||
| #else | |||||
| vadd.f32 s0 , s0, s1 // set return value | |||||
| #endif | |||||
| #if !defined(__ARM_PCS_VFP) | |||||
| #if !defined(DOUBLE) | |||||
| vmov r0, s0 | |||||
| #else | |||||
| vmov r0, r1, d0 | |||||
| #endif | |||||
| #endif | |||||
| bx lr | |||||
| EPILOGUE | |||||
| @@ -0,0 +1,57 @@ | |||||
| /*************************************************************************** | |||||
| Copyright (c) 2013, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *****************************************************************************/ | |||||
| /************************************************************************************** | |||||
| * trivial copy of zasum.c with the ABS() removed * | |||||
| **************************************************************************************/ | |||||
| #include "common.h" | |||||
| #include <math.h> | |||||
| #define CSUM1(x,i) x[i]+x[i+1] | |||||
| FLOAT CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x) | |||||
| { | |||||
| BLASLONG i=0; | |||||
| FLOAT sumf = 0.0; | |||||
| BLASLONG inc_x2; | |||||
| if (n <= 0 || inc_x <= 0) return(sumf); | |||||
| inc_x2 = 2 * inc_x; | |||||
| n *= inc_x2; | |||||
| while(i < n) | |||||
| { | |||||
| sumf += CSUM1(x,i); | |||||
| i += inc_x2; | |||||
| } | |||||
| return(sumf); | |||||
| } | |||||
| @@ -0,0 +1,175 @@ | |||||
| SAMINKERNEL = ../arm/amin.c | |||||
| DAMINKERNEL = ../arm/amin.c | |||||
| CAMINKERNEL = ../arm/zamin.c | |||||
| ZAMINKERNEL = ../arm/zamin.c | |||||
| SMAXKERNEL = ../arm/max.c | |||||
| DMAXKERNEL = ../arm/max.c | |||||
| SMINKERNEL = ../arm/min.c | |||||
| DMINKERNEL = ../arm/min.c | |||||
| ISAMINKERNEL = ../arm/iamin.c | |||||
| IDAMINKERNEL = ../arm/iamin.c | |||||
| ICAMINKERNEL = ../arm/izamin.c | |||||
| IZAMINKERNEL = ../arm/izamin.c | |||||
| ISMAXKERNEL = ../arm/imax.c | |||||
| IDMAXKERNEL = ../arm/imax.c | |||||
| ISMINKERNEL = ../arm/imin.c | |||||
| IDMINKERNEL = ../arm/imin.c | |||||
| STRMMKERNEL = ../generic/trmmkernel_4x4.c | |||||
| DTRMMKERNEL = ../generic/trmmkernel_2x2.c | |||||
| CTRMMKERNEL = ../generic/ztrmmkernel_2x2.c | |||||
| ZTRMMKERNEL = ../generic/ztrmmkernel_2x2.c | |||||
| STRSMKERNEL_LN = ../generic/trsm_kernel_LN.c | |||||
| STRSMKERNEL_LT = ../generic/trsm_kernel_LT.c | |||||
| STRSMKERNEL_RN = ../generic/trsm_kernel_RN.c | |||||
| STRSMKERNEL_RT = ../generic/trsm_kernel_RT.c | |||||
| DTRSMKERNEL_LN = ../generic/trsm_kernel_LN.c | |||||
| DTRSMKERNEL_LT = ../generic/trsm_kernel_LT.c | |||||
| DTRSMKERNEL_RN = ../generic/trsm_kernel_RN.c | |||||
| DTRSMKERNEL_RT = ../generic/trsm_kernel_RT.c | |||||
| CTRSMKERNEL_LN = ../generic/trsm_kernel_LN.c | |||||
| CTRSMKERNEL_LT = ../generic/trsm_kernel_LT.c | |||||
| CTRSMKERNEL_RN = ../generic/trsm_kernel_RN.c | |||||
| CTRSMKERNEL_RT = ../generic/trsm_kernel_RT.c | |||||
| ZTRSMKERNEL_LN = ../generic/trsm_kernel_LN.c | |||||
| ZTRSMKERNEL_LT = ../generic/trsm_kernel_LT.c | |||||
| ZTRSMKERNEL_RN = ../generic/trsm_kernel_RN.c | |||||
| ZTRSMKERNEL_RT = ../generic/trsm_kernel_RT.c | |||||
| SAMAXKERNEL = amax.S | |||||
| DAMAXKERNEL = amax.S | |||||
| CAMAXKERNEL = zamax.S | |||||
| ZAMAXKERNEL = zamax.S | |||||
| ISAMAXKERNEL = iamax.S | |||||
| IDAMAXKERNEL = iamax.S | |||||
| ICAMAXKERNEL = izamax.S | |||||
| IZAMAXKERNEL = izamax.S | |||||
| SASUMKERNEL = asum.S | |||||
| DASUMKERNEL = asum.S | |||||
| CASUMKERNEL = casum.S | |||||
| ZASUMKERNEL = zasum.S | |||||
| SAXPYKERNEL = axpy.S | |||||
| DAXPYKERNEL = axpy.S | |||||
| CAXPYKERNEL = zaxpy.S | |||||
| ZAXPYKERNEL = zaxpy.S | |||||
| SCOPYKERNEL = copy.S | |||||
| DCOPYKERNEL = copy.S | |||||
| CCOPYKERNEL = copy.S | |||||
| ZCOPYKERNEL = copy.S | |||||
| SDOTKERNEL = dot.S | |||||
| DDOTKERNEL = dot.S | |||||
| CDOTKERNEL = zdot.S | |||||
| ZDOTKERNEL = zdot.S | |||||
| DSDOTKERNEL = dot.S | |||||
| SNRM2KERNEL = nrm2.S | |||||
| DNRM2KERNEL = nrm2.S | |||||
| CNRM2KERNEL = znrm2.S | |||||
| ZNRM2KERNEL = znrm2.S | |||||
| SROTKERNEL = rot.S | |||||
| DROTKERNEL = rot.S | |||||
| CROTKERNEL = zrot.S | |||||
| ZROTKERNEL = zrot.S | |||||
| SSCALKERNEL = scal.S | |||||
| DSCALKERNEL = scal.S | |||||
| CSCALKERNEL = zscal.S | |||||
| ZSCALKERNEL = zscal.S | |||||
| SSWAPKERNEL = swap.S | |||||
| DSWAPKERNEL = swap.S | |||||
| CSWAPKERNEL = swap.S | |||||
| ZSWAPKERNEL = swap.S | |||||
| SGEMVNKERNEL = gemv_n.S | |||||
| DGEMVNKERNEL = gemv_n.S | |||||
| CGEMVNKERNEL = zgemv_n.S | |||||
| ZGEMVNKERNEL = zgemv_n.S | |||||
| SGEMVTKERNEL = gemv_t.S | |||||
| DGEMVTKERNEL = gemv_t.S | |||||
| CGEMVTKERNEL = zgemv_t.S | |||||
| ZGEMVTKERNEL = zgemv_t.S | |||||
| SGEMMKERNEL = sgemm_kernel_$(SGEMM_UNROLL_M)x$(SGEMM_UNROLL_N).S | |||||
| STRMMKERNEL = strmm_kernel_$(SGEMM_UNROLL_M)x$(SGEMM_UNROLL_N).S | |||||
| ifneq ($(SGEMM_UNROLL_M), $(SGEMM_UNROLL_N)) | |||||
| SGEMMINCOPY = ../generic/gemm_ncopy_$(SGEMM_UNROLL_M).c | |||||
| SGEMMITCOPY = ../generic/gemm_tcopy_$(SGEMM_UNROLL_M).c | |||||
| SGEMMINCOPYOBJ = sgemm_incopy$(TSUFFIX).$(SUFFIX) | |||||
| SGEMMITCOPYOBJ = sgemm_itcopy$(TSUFFIX).$(SUFFIX) | |||||
| endif | |||||
| SGEMMONCOPY = ../generic/gemm_ncopy_$(SGEMM_UNROLL_N).c | |||||
| SGEMMOTCOPY = ../generic/gemm_tcopy_$(SGEMM_UNROLL_N).c | |||||
| SGEMMONCOPYOBJ = sgemm_oncopy$(TSUFFIX).$(SUFFIX) | |||||
| SGEMMOTCOPYOBJ = sgemm_otcopy$(TSUFFIX).$(SUFFIX) | |||||
| DGEMMKERNEL = dgemm_kernel_$(DGEMM_UNROLL_M)x$(DGEMM_UNROLL_N).S | |||||
| DTRMMKERNEL = dtrmm_kernel_$(DGEMM_UNROLL_M)x$(DGEMM_UNROLL_N).S | |||||
| ifneq ($(DGEMM_UNROLL_M), $(DGEMM_UNROLL_N)) | |||||
| ifeq ($(DGEMM_UNROLL_M), 8) | |||||
| DGEMMINCOPY = dgemm_ncopy_$(DGEMM_UNROLL_M).S | |||||
| DGEMMITCOPY = dgemm_tcopy_$(DGEMM_UNROLL_M).S | |||||
| else | |||||
| DGEMMINCOPY = ../generic/gemm_ncopy_$(DGEMM_UNROLL_M).c | |||||
| DGEMMITCOPY = ../generic/gemm_tcopy_$(DGEMM_UNROLL_M).c | |||||
| endif | |||||
| DGEMMINCOPYOBJ = dgemm_incopy$(TSUFFIX).$(SUFFIX) | |||||
| DGEMMITCOPYOBJ = dgemm_itcopy$(TSUFFIX).$(SUFFIX) | |||||
| endif | |||||
| ifeq ($(DGEMM_UNROLL_N), 4) | |||||
| DGEMMONCOPY = dgemm_ncopy_$(DGEMM_UNROLL_N).S | |||||
| DGEMMOTCOPY = dgemm_tcopy_$(DGEMM_UNROLL_N).S | |||||
| else | |||||
| DGEMMONCOPY = ../generic/gemm_ncopy_$(DGEMM_UNROLL_N).c | |||||
| DGEMMOTCOPY = ../generic/gemm_tcopy_$(DGEMM_UNROLL_N).c | |||||
| endif | |||||
| DGEMMONCOPYOBJ = dgemm_oncopy$(TSUFFIX).$(SUFFIX) | |||||
| DGEMMOTCOPYOBJ = dgemm_otcopy$(TSUFFIX).$(SUFFIX) | |||||
| CGEMMKERNEL = cgemm_kernel_$(CGEMM_UNROLL_M)x$(CGEMM_UNROLL_N).S | |||||
| CTRMMKERNEL = ctrmm_kernel_$(CGEMM_UNROLL_M)x$(CGEMM_UNROLL_N).S | |||||
| ifneq ($(CGEMM_UNROLL_M), $(CGEMM_UNROLL_N)) | |||||
| CGEMMINCOPY = ../generic/zgemm_ncopy_$(CGEMM_UNROLL_M).c | |||||
| CGEMMITCOPY = ../generic/zgemm_tcopy_$(CGEMM_UNROLL_M).c | |||||
| CGEMMINCOPYOBJ = cgemm_incopy$(TSUFFIX).$(SUFFIX) | |||||
| CGEMMITCOPYOBJ = cgemm_itcopy$(TSUFFIX).$(SUFFIX) | |||||
| endif | |||||
| CGEMMONCOPY = ../generic/zgemm_ncopy_$(CGEMM_UNROLL_N).c | |||||
| CGEMMOTCOPY = ../generic/zgemm_tcopy_$(CGEMM_UNROLL_N).c | |||||
| CGEMMONCOPYOBJ = cgemm_oncopy$(TSUFFIX).$(SUFFIX) | |||||
| CGEMMOTCOPYOBJ = cgemm_otcopy$(TSUFFIX).$(SUFFIX) | |||||
| ZGEMMKERNEL = zgemm_kernel_$(ZGEMM_UNROLL_M)x$(ZGEMM_UNROLL_N).S | |||||
| ZTRMMKERNEL = ztrmm_kernel_$(ZGEMM_UNROLL_M)x$(ZGEMM_UNROLL_N).S | |||||
| ifneq ($(ZGEMM_UNROLL_M), $(ZGEMM_UNROLL_N)) | |||||
| ZGEMMINCOPY = ../generic/zgemm_ncopy_$(ZGEMM_UNROLL_M).c | |||||
| ZGEMMITCOPY = ../generic/zgemm_tcopy_$(ZGEMM_UNROLL_M).c | |||||
| ZGEMMINCOPYOBJ = zgemm_incopy$(TSUFFIX).$(SUFFIX) | |||||
| ZGEMMITCOPYOBJ = zgemm_itcopy$(TSUFFIX).$(SUFFIX) | |||||
| endif | |||||
| ZGEMMONCOPY = ../generic/zgemm_ncopy_$(ZGEMM_UNROLL_N).c | |||||
| ZGEMMOTCOPY = ../generic/zgemm_tcopy_$(ZGEMM_UNROLL_N).c | |||||
| ZGEMMONCOPYOBJ = zgemm_oncopy$(TSUFFIX).$(SUFFIX) | |||||
| ZGEMMOTCOPYOBJ = zgemm_otcopy$(TSUFFIX).$(SUFFIX) | |||||
| @@ -0,0 +1,164 @@ | |||||
| /******************************************************************************* | |||||
| Copyright (c) 2019, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *******************************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #define N x0 /* vector length */ | |||||
| #define X x1 /* X vector address */ | |||||
| #define INC_X x2 /* X stride */ | |||||
| #define I x5 /* loop variable */ | |||||
| /******************************************************************************* | |||||
| * Macro definitions | |||||
| *******************************************************************************/ | |||||
| #define REG0 wzr | |||||
| #define SUMF s0 | |||||
| #define TMPF s1 | |||||
| #define TMPVF {v1.s}[0] | |||||
| #define SZ 4 | |||||
| /******************************************************************************/ | |||||
| .macro KERNEL_F1 | |||||
| ld1 {v1.2s}, [X], #8 | |||||
| ext v2.8b, v1.8b, v1.8b, #4 | |||||
| fadd TMPF, TMPF, s2 | |||||
| fadd SUMF, SUMF, TMPF | |||||
| .endm | |||||
| .macro KERNEL_F8 | |||||
| ld1 {v1.4s, v2.4s, v3.4s, v4.4s}, [X] | |||||
| add X, X, #64 | |||||
| PRFM PLDL1KEEP, [X, #1024] | |||||
| fadd v1.4s, v1.4s, v2.4s | |||||
| fadd v3.4s, v3.4s, v4.4s | |||||
| fadd v0.4s, v0.4s, v1.4s | |||||
| fadd v0.4s, v0.4s, v3.4s | |||||
| .endm | |||||
| .macro KERNEL_F8_FINALIZE | |||||
| ext v1.16b, v0.16b, v0.16b, #8 | |||||
| fadd v0.2s, v0.2s, v1.2s | |||||
| faddp SUMF, v0.2s | |||||
| .endm | |||||
| .macro INIT_S | |||||
| lsl INC_X, INC_X, #3 | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| ld1 {v1.2s}, [X], INC_X | |||||
| ext v2.8b, v1.8b, v1.8b, #4 | |||||
| fadd TMPF, TMPF, s2 | |||||
| fadd SUMF, SUMF, TMPF | |||||
| .endm | |||||
| /******************************************************************************* | |||||
| * End of macro definitions | |||||
| *******************************************************************************/ | |||||
| PROLOGUE | |||||
| fmov SUMF, REG0 | |||||
| fmov s1, SUMF | |||||
| cmp N, xzr | |||||
| ble .Lcsum_kernel_L999 | |||||
| cmp INC_X, xzr | |||||
| ble .Lcsum_kernel_L999 | |||||
| cmp INC_X, #1 | |||||
| bne .Lcsum_kernel_S_BEGIN | |||||
| .Lcsum_kernel_F_BEGIN: | |||||
| asr I, N, #3 | |||||
| cmp I, xzr | |||||
| beq .Lcsum_kernel_F1 | |||||
| .Lcsum_kernel_F8: | |||||
| KERNEL_F8 | |||||
| subs I, I, #1 | |||||
| bne .Lcsum_kernel_F8 | |||||
| KERNEL_F8_FINALIZE | |||||
| .Lcsum_kernel_F1: | |||||
| ands I, N, #7 | |||||
| ble .Lcsum_kernel_L999 | |||||
| .Lcsum_kernel_F10: | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| bne .Lcsum_kernel_F10 | |||||
| .Lcsum_kernel_L999: | |||||
| ret | |||||
| .Lcsum_kernel_S_BEGIN: | |||||
| INIT_S | |||||
| asr I, N, #2 | |||||
| cmp I, xzr | |||||
| ble .Lcsum_kernel_S1 | |||||
| .Lcsum_kernel_S4: | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne .Lcsum_kernel_S4 | |||||
| .Lcsum_kernel_S1: | |||||
| ands I, N, #3 | |||||
| ble .Lcsum_kernel_L999 | |||||
| .Lcsum_kernel_S10: | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne .Lcsum_kernel_S10 | |||||
| ret | |||||
| EPILOGUE | |||||
| @@ -0,0 +1,186 @@ | |||||
| /******************************************************************************* | |||||
| Copyright (c) 2019, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *******************************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #define N x0 /* vector length */ | |||||
| #define X x1 /* X vector address */ | |||||
| #define INC_X x2 /* X stride */ | |||||
| #define I x5 /* loop variable */ | |||||
| /******************************************************************************* | |||||
| * Macro definitions | |||||
| *******************************************************************************/ | |||||
| #if !defined(DOUBLE) | |||||
| #define REG0 wzr | |||||
| #define SUMF s0 | |||||
| #define TMPF s1 | |||||
| #define TMPVF {v1.s}[0] | |||||
| #define SZ 4 | |||||
| #else | |||||
| #define REG0 xzr | |||||
| #define SUMF d0 | |||||
| #define TMPF d1 | |||||
| #define TMPVF {v1.d}[0] | |||||
| #define SZ 8 | |||||
| #endif | |||||
| /******************************************************************************/ | |||||
| .macro KERNEL_F1 | |||||
| ldr TMPF, [X], #SZ | |||||
| fadd SUMF, SUMF, TMPF | |||||
| .endm | |||||
| .macro KERNEL_F8 | |||||
| #if !defined(DOUBLE) | |||||
| ld1 {v1.4s, v2.4s}, [X], #32 // Load [X3, X2, X1, X0] | |||||
| fadd v1.4s, v1.4s, v2.4s // [X3+X1, X2+X0] | |||||
| fadd v0.4s, v0.4s, v1.4s // [X3+X1, X2+X0] | |||||
| PRFM PLDL1KEEP, [X, #1024] | |||||
| #else // DOUBLE | |||||
| ld1 {v2.2d, v3.2d, v4.2d, v5.2d}, [X] | |||||
| add X, X, #64 | |||||
| PRFM PLDL1KEEP, [X, #1024] | |||||
| fadd v2.2d, v2.2d, v3.2d | |||||
| fadd v4.2d, v4.2d, v5.2d | |||||
| fadd v0.2d, v0.2d, v2.2d | |||||
| fadd v0.2d, v0.2d, v4.2d | |||||
| #endif | |||||
| .endm | |||||
| .macro KERNEL_F8_FINALIZE | |||||
| #if !defined(DOUBLE) | |||||
| ext v1.16b, v0.16b, v0.16b, #8 | |||||
| fadd v0.2s, v0.2s, v1.2s | |||||
| faddp SUMF, v0.2s | |||||
| #else | |||||
| faddp SUMF, v0.2d | |||||
| #endif | |||||
| .endm | |||||
| .macro INIT_S | |||||
| #if !defined(DOUBLE) | |||||
| lsl INC_X, INC_X, #2 | |||||
| #else | |||||
| lsl INC_X, INC_X, #3 | |||||
| #endif | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| ld1 TMPVF, [X], INC_X | |||||
| fadd SUMF, SUMF, TMPF | |||||
| .endm | |||||
| /******************************************************************************* | |||||
| * End of macro definitions | |||||
| *******************************************************************************/ | |||||
| PROLOGUE | |||||
| fmov SUMF, REG0 | |||||
| #if !defined(DOUBLE) | |||||
| fmov s1, SUMF | |||||
| #else | |||||
| fmov d1, SUMF | |||||
| #endif | |||||
| cmp N, xzr | |||||
| ble .Lsum_kernel_L999 | |||||
| cmp INC_X, xzr | |||||
| ble .Lsum_kernel_L999 | |||||
| cmp INC_X, #1 | |||||
| bne .Lsum_kernel_S_BEGIN | |||||
| .Lsum_kernel_F_BEGIN: | |||||
| asr I, N, #3 | |||||
| cmp I, xzr | |||||
| beq .Lsum_kernel_F1 | |||||
| .Lsum_kernel_F8: | |||||
| KERNEL_F8 | |||||
| subs I, I, #1 | |||||
| bne .Lsum_kernel_F8 | |||||
| KERNEL_F8_FINALIZE | |||||
| .Lsum_kernel_F1: | |||||
| ands I, N, #7 | |||||
| ble .Lsum_kernel_L999 | |||||
| .Lsum_kernel_F10: | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| bne .Lsum_kernel_F10 | |||||
| .Lsum_kernel_L999: | |||||
| ret | |||||
| .Lsum_kernel_S_BEGIN: | |||||
| INIT_S | |||||
| asr I, N, #2 | |||||
| cmp I, xzr | |||||
| ble .Lsum_kernel_S1 | |||||
| .Lsum_kernel_S4: | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne .Lsum_kernel_S4 | |||||
| .Lsum_kernel_S1: | |||||
| ands I, N, #3 | |||||
| ble .Lsum_kernel_L999 | |||||
| .Lsum_kernel_S10: | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne .Lsum_kernel_S10 | |||||
| ret | |||||
| EPILOGUE | |||||
| @@ -0,0 +1,158 @@ | |||||
| /******************************************************************************* | |||||
| Copyright (c) 2015, The OpenBLAS Project | |||||
| All rights reserved. | |||||
| Redistribution and use in source and binary forms, with or without | |||||
| modification, are permitted provided that the following conditions are | |||||
| met: | |||||
| 1. Redistributions of source code must retain the above copyright | |||||
| notice, this list of conditions and the following disclaimer. | |||||
| 2. Redistributions in binary form must reproduce the above copyright | |||||
| notice, this list of conditions and the following disclaimer in | |||||
| the documentation and/or other materials provided with the | |||||
| distribution. | |||||
| 3. Neither the name of the OpenBLAS project nor the names of | |||||
| its contributors may be used to endorse or promote products | |||||
| derived from this software without specific prior written permission. | |||||
| THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" | |||||
| AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE | |||||
| IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE | |||||
| ARE DISCLAIMED. IN NO EVENT SHALL THE OPENBLAS PROJECT OR CONTRIBUTORS BE | |||||
| LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL | |||||
| DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR | |||||
| SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER | |||||
| CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, | |||||
| OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE | |||||
| USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | |||||
| *******************************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #define N x0 /* vector length */ | |||||
| #define X x1 /* X vector address */ | |||||
| #define INC_X x2 /* X stride */ | |||||
| #define I x5 /* loop variable */ | |||||
| /******************************************************************************* | |||||
| * Macro definitions | |||||
| *******************************************************************************/ | |||||
| #define REG0 xzr | |||||
| #define SUMF d0 | |||||
| #define TMPF d1 | |||||
| #define TMPVF {v1.d}[0] | |||||
| #define SZ 8 | |||||
| /******************************************************************************/ | |||||
| .macro KERNEL_F1 | |||||
| ld1 {v1.2d}, [X], #16 | |||||
| faddp TMPF, v1.2d | |||||
| fadd SUMF, SUMF, TMPF | |||||
| .endm | |||||
| .macro KERNEL_F4 | |||||
| ld1 {v1.2d, v2.2d, v3.2d, v4.2d}, [X], #64 | |||||
| fadd v1.2d, v1.2d, v2.2d | |||||
| fadd v3.2d, v3.2d, v4.2d | |||||
| fadd v0.2d, v0.2d, v1.2d | |||||
| fadd v0.2d, v0.2d, v3.2d | |||||
| PRFM PLDL1KEEP, [X, #1024] | |||||
| .endm | |||||
| .macro KERNEL_F4_FINALIZE | |||||
| faddp SUMF, v0.2d | |||||
| .endm | |||||
| .macro INIT_S | |||||
| lsl INC_X, INC_X, #4 | |||||
| .endm | |||||
| .macro KERNEL_S1 | |||||
| ld1 {v1.2d}, [X], INC_X | |||||
| faddp TMPF, v1.2d | |||||
| fadd SUMF, SUMF, TMPF | |||||
| .endm | |||||
| /******************************************************************************* | |||||
| * End of macro definitions | |||||
| *******************************************************************************/ | |||||
| PROLOGUE | |||||
| fmov SUMF, REG0 | |||||
| cmp N, xzr | |||||
| ble .Lzsum_kernel_L999 | |||||
| cmp INC_X, xzr | |||||
| ble .Lzsum_kernel_L999 | |||||
| cmp INC_X, #1 | |||||
| bne .Lzsum_kernel_S_BEGIN | |||||
| .Lzsum_kernel_F_BEGIN: | |||||
| asr I, N, #2 | |||||
| cmp I, xzr | |||||
| beq .Lzsum_kernel_F1 | |||||
| .Lzsum_kernel_F4: | |||||
| KERNEL_F4 | |||||
| subs I, I, #1 | |||||
| bne .Lzsum_kernel_F4 | |||||
| KERNEL_F4_FINALIZE | |||||
| .Lzsum_kernel_F1: | |||||
| ands I, N, #3 | |||||
| ble .Lzsum_kernel_L999 | |||||
| .Lzsum_kernel_F10: | |||||
| KERNEL_F1 | |||||
| subs I, I, #1 | |||||
| bne .Lzsum_kernel_F10 | |||||
| .Lzsum_kernel_L999: | |||||
| ret | |||||
| .Lzsum_kernel_S_BEGIN: | |||||
| INIT_S | |||||
| asr I, N, #2 | |||||
| cmp I, xzr | |||||
| ble .Lzsum_kernel_S1 | |||||
| .Lzsum_kernel_S4: | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne .Lzsum_kernel_S4 | |||||
| .Lzsum_kernel_S1: | |||||
| ands I, N, #3 | |||||
| ble .Lzsum_kernel_L999 | |||||
| .Lzsum_kernel_S10: | |||||
| KERNEL_S1 | |||||
| subs I, I, #1 | |||||
| bne .Lzsum_kernel_S10 | |||||
| ret | |||||
| EPILOGUE | |||||
| @@ -60,6 +60,10 @@ CASUMKERNEL = asum.S | |||||
| ZASUMKERNEL = asum.S | ZASUMKERNEL = asum.S | ||||
| XASUMKERNEL = asum.S | XASUMKERNEL = asum.S | ||||
| CSUMKERNEL = sum.S | |||||
| ZSUMKERNEL = sum.S | |||||
| XSUMKERNEL = sum.S | |||||
| CNRM2KERNEL = nrm2.S | CNRM2KERNEL = nrm2.S | ||||
| ZNRM2KERNEL = nrm2.S | ZNRM2KERNEL = nrm2.S | ||||
| XNRM2KERNEL = nrm2.S | XNRM2KERNEL = nrm2.S | ||||
| @@ -0,0 +1,358 @@ | |||||
| /*********************************************************************/ | |||||
| /* Copyright 2009, 2010 The University of Texas at Austin. */ | |||||
| /* Copyright 2019, The OpenBLAS project */ | |||||
| /* All rights reserved. */ | |||||
| /* */ | |||||
| /* Redistribution and use in source and binary forms, with or */ | |||||
| /* without modification, are permitted provided that the following */ | |||||
| /* conditions are met: */ | |||||
| /* */ | |||||
| /* 1. Redistributions of source code must retain the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer. */ | |||||
| /* */ | |||||
| /* 2. Redistributions in binary form must reproduce the above */ | |||||
| /* copyright notice, this list of conditions and the following */ | |||||
| /* disclaimer in the documentation and/or other materials */ | |||||
| /* provided with the distribution. */ | |||||
| /* */ | |||||
| /* THIS SOFTWARE IS PROVIDED BY THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, */ | |||||
| /* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF */ | |||||
| /* MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE */ | |||||
| /* DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY OF TEXAS AT */ | |||||
| /* AUSTIN OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, */ | |||||
| /* INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES */ | |||||
| /* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE */ | |||||
| /* GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR */ | |||||
| /* BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF */ | |||||
| /* LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT */ | |||||
| /* (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT */ | |||||
| /* OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE */ | |||||
| /* POSSIBILITY OF SUCH DAMAGE. */ | |||||
| /* */ | |||||
| /* The views and conclusions contained in the software and */ | |||||
| /* documentation are those of the authors and should not be */ | |||||
| /* interpreted as representing official policies, either expressed */ | |||||
| /* or implied, of The University of Texas at Austin. */ | |||||
| /*********************************************************************/ | |||||
| #define ASSEMBLER | |||||
| #include "common.h" | |||||
| #ifdef XDOUBLE | |||||
| #define PREFETCH_SIZE ( 8 * 16 + 4) | |||||
| #elif defined(DOUBLE) | |||||
| #define PREFETCH_SIZE (16 * 16 + 8) | |||||
| #else | |||||
| #define PREFETCH_SIZE (32 * 16 + 16) | |||||
| #endif | |||||
| #ifndef COMPLEX | |||||
| #define COMPADD 0 | |||||
| #define STRIDE INCX | |||||
| #else | |||||
| #define COMPADD 1 | |||||
| #define STRIDE SIZE | |||||
| #endif | |||||
| #define PRE1 r2 | |||||
| #define I r17 | |||||
| #define J r18 | |||||
| #define INCX16 r21 | |||||
| #define PR r30 | |||||
| #define ARLC r31 | |||||
| #define N r32 | |||||
| #define X r33 | |||||
| #define INCX r34 | |||||
| PROLOGUE | |||||
| .prologue | |||||
| PROFCODE | |||||
| { .mfi | |||||
| adds PRE1 = PREFETCH_SIZE * SIZE, X | |||||
| mov f8 = f0 | |||||
| .save ar.lc, ARLC | |||||
| mov ARLC = ar.lc | |||||
| } | |||||
| ;; | |||||
| .body | |||||
| #ifdef F_INTERFACE | |||||
| { .mmi | |||||
| LDINT N = [N] | |||||
| LDINT INCX = [INCX] | |||||
| nop.i 0 | |||||
| } | |||||
| ;; | |||||
| #ifndef USE64BITINT | |||||
| { .mii | |||||
| nop.m 0 | |||||
| sxt4 N = N | |||||
| sxt4 INCX = INCX | |||||
| } | |||||
| ;; | |||||
| #endif | |||||
| #endif | |||||
| { .mmi | |||||
| cmp.lt p0, p6 = r0, INCX | |||||
| cmp.lt p0, p7 = r0, N | |||||
| shr I = N, (4 - COMPADD) | |||||
| } | |||||
| { .mbb | |||||
| and J = ((1 << (4 - COMPADD)) - 1), N | |||||
| (p6) br.ret.sptk.many b0 | |||||
| (p7) br.ret.sptk.many b0 | |||||
| } | |||||
| ;; | |||||
| { .mfi | |||||
| adds I = -1, I | |||||
| mov f10 = f0 | |||||
| mov PR = pr | |||||
| } | |||||
| { .mfi | |||||
| cmp.eq p9, p0 = r0, J | |||||
| mov f9 = f0 | |||||
| tbit.z p0, p12 = N, 3 - COMPADD | |||||
| } | |||||
| ;; | |||||
| { .mmi | |||||
| cmp.eq p16, p0 = r0, r0 | |||||
| cmp.ne p17, p0 = r0, r0 | |||||
| mov ar.ec= 3 | |||||
| } | |||||
| { .mfi | |||||
| cmp.ne p18, p0 = r0, r0 | |||||
| mov f11 = f0 | |||||
| shl INCX = INCX, BASE_SHIFT + COMPADD | |||||
| } | |||||
| ;; | |||||
| { .mmi | |||||
| #ifdef XDOUBLE | |||||
| shladd INCX16 = INCX, (3 - COMPADD), r0 | |||||
| #else | |||||
| shladd INCX16 = INCX, (4 - COMPADD), r0 | |||||
| #endif | |||||
| cmp.ne p19, p0 = r0, r0 | |||||
| mov ar.lc = I | |||||
| } | |||||
| { .mmb | |||||
| cmp.gt p8 ,p0 = r0, I | |||||
| #ifdef COMPLEX | |||||
| adds INCX = - SIZE, INCX | |||||
| #else | |||||
| nop.m 0 | |||||
| #endif | |||||
| (p8) br.cond.dpnt .L55 | |||||
| } | |||||
| ;; | |||||
| .align 32 | |||||
| .L52: | |||||
| { .mmf | |||||
| (p16) lfetch.nt1 [PRE1], INCX16 | |||||
| (p16) LDFD f32 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p19) FADD f8 = f8, f71 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f35 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p19) FADD f9 = f9, f74 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f38 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p19) FADD f10 = f10, f77 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f41 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p19) FADD f11 = f11, f80 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f44 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f8 = f8, f34 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f47 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f9 = f9, f37 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f50 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f10 = f10, f40 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f53 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f11 = f11, f43 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| #ifdef XDOUBLE | |||||
| (p16) lfetch.nt1 [PRE1], INCX16 | |||||
| #endif | |||||
| (p16) LDFD f56 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f8 = f8, f46 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f59 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f9 = f9, f49 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f62 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f10 = f10, f52 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f65 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f11 = f11, f55 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f68 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f8 = f8, f58 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f71 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f9 = f9, f61 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f74 = [X], STRIDE | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f10 = f10, f64 | |||||
| } | |||||
| ;; | |||||
| { .mmf | |||||
| (p16) LDFD f77 = [X], INCX | |||||
| } | |||||
| { .mfb | |||||
| (p18) FADD f11 = f11, f67 | |||||
| br.ctop.sptk.few .L52 | |||||
| } | |||||
| ;; | |||||
| FADD f8 = f8, f71 | |||||
| FADD f9 = f9, f74 | |||||
| FADD f10 = f10, f77 | |||||
| FADD f11 = f11, f80 | |||||
| .align 32 | |||||
| ;; | |||||
| .L55: | |||||
| (p12) LDFD f32 = [X], STRIDE | |||||
| (p9) br.cond.dptk .L998 | |||||
| ;; | |||||
| (p12) LDFD f33 = [X], INCX | |||||
| ;; | |||||
| (p12) LDFD f34 = [X], STRIDE | |||||
| ;; | |||||
| (p12) LDFD f35 = [X], INCX | |||||
| tbit.z p0, p13 = N, (2 - COMPADD) | |||||
| ;; | |||||
| (p12) LDFD f36 = [X], STRIDE | |||||
| tbit.z p0, p14 = N, (1 - COMPADD) | |||||
| ;; | |||||
| (p12) LDFD f37 = [X], INCX | |||||
| #ifndef COMPLEX | |||||
| tbit.z p0, p15 = N, 0 | |||||
| #endif | |||||
| ;; | |||||
| (p12) LDFD f38 = [X], STRIDE | |||||
| ;; | |||||
| (p12) LDFD f39 = [X], INCX | |||||
| ;; | |||||
| (p13) LDFD f40 = [X], STRIDE | |||||
| ;; | |||||
| (p13) LDFD f41 = [X], INCX | |||||
| ;; | |||||
| (p13) LDFD f42 = [X], STRIDE | |||||
| (p12) FADD f8 = f8, f32 | |||||
| ;; | |||||
| (p13) LDFD f43 = [X], INCX | |||||
| (p12) FADD f9 = f9, f33 | |||||
| ;; | |||||
| (p14) LDFD f44 = [X], STRIDE | |||||
| (p12) FADD f10 = f10, f34 | |||||
| ;; | |||||
| (p14) LDFD f45 = [X], INCX | |||||
| (p12) FADD f11 = f11, f35 | |||||
| ;; | |||||
| #ifndef COMPLEX | |||||
| (p15) LDFD f46 = [X] | |||||
| #endif | |||||
| (p12) FADD f8 = f8, f36 | |||||
| ;; | |||||
| (p12) FADD f9 = f9, f37 | |||||
| (p12) FADD f10 = f10, f38 | |||||
| (p12) FADD f11 = f11, f39 | |||||
| ;; | |||||
| (p13) FADD f8 = f8, f40 | |||||
| (p13) FADD f9 = f9, f41 | |||||
| #ifndef COMPLEX | |||||
| #endif | |||||
| (p13) FADD f10 = f10, f42 | |||||
| ;; | |||||
| (p13) FADD f11 = f11, f43 | |||||
| (p14) FADD f8 = f8, f44 | |||||
| (p14) FADD f9 = f9, f45 | |||||
| #ifndef COMPLEX | |||||
| (p15) FADD f10 = f10, f46 | |||||
| #endif | |||||
| ;; | |||||
| .align 32 | |||||
| .L998: | |||||
| { .mfi | |||||
| FADD f8 = f8, f9 | |||||
| mov ar.lc = ARLC | |||||
| } | |||||
| { .mmf | |||||
| FADD f10 = f10, f11 | |||||
| } | |||||
| ;; | |||||
| { .mii | |||||
| mov pr = PR, -65474 | |||||
| } | |||||
| ;; | |||||
| { .mfb | |||||
| FADD f8 = f8, f10 | |||||
| br.ret.sptk.many b0 | |||||
| } | |||||
| EPILOGUE | |||||
| @@ -30,6 +30,11 @@ IDMAXKERNEL = ../mips/imax.c | |||||
| ISMINKERNEL = ../mips/imin.c | ISMINKERNEL = ../mips/imin.c | ||||
| IDMINKERNEL = ../mips/imin.c | IDMINKERNEL = ../mips/imin.c | ||||
| SSUMKERNEL = ../mips/sum.c | |||||
| DSUMKERNEL = ../mips/sum.c | |||||
| CSUMKERNEL = ../mips/zsum.c | |||||
| ZSUMKERNEL = ../mips/zsum.c | |||||
| ifdef HAVE_MSA | ifdef HAVE_MSA | ||||
| SASUMKERNEL = ../mips/sasum_msa.c | SASUMKERNEL = ../mips/sasum_msa.c | ||||
| DASUMKERNEL = ../mips/dasum_msa.c | DASUMKERNEL = ../mips/dasum_msa.c | ||||
| @@ -45,7 +45,7 @@ BLASLONG CNAME(BLASLONG n, FLOAT *x, BLASLONG inc_x) | |||||
| while(i < n) | while(i < n) | ||||
| { | { | ||||
| if( x[ix] > minf ) | |||||
| if( x[ix] < minf ) | |||||
| { | { | ||||
| min = i; | min = i; | ||||
| minf = x[ix]; | minf = x[ix]; | ||||