Add implementation of SGEMM based on the Arm®v9-A architecture Scalable
Matrix Extension (SME) [1], using the Arm C Language Extensions (ACLE)
[2].
Add SME2 compute & packing kernels for SGEMM and enable them under the
ARMV9SME target.
The compute kernel performs outer products on panels of A and B,
accumulating into 2x2 inner blocks of C via the SME two-dimensional
architectural register, ZA.
The non-transpose packing kernel performs a copy into a contiguous
buffer using SVE loads & stores in Streaming SVE mode. Streaming SVE is
an execution mode introduced by SME that supports execution of SVE code
with the SME defined vector length, known as the Streaming SVE vector
length (SVL).
The transpose packing kernel performs on-the-fly transposition by
utilizing horizontal & vertical tile slice access to the SME ZA
register.
Includes an update to the driver to account for expanded inner block.
Note: this places the ARMV9SME target in WIP state. It is functional for
SGEMM, and all GEMM tests are passing. Other BLAS3 routines have not
been updated to match the larger kernel size, so SYMM/TRMM tests are
currently expected to fail in this WIP state.
[1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2
[2] https://arm-software.github.io/acle/main/acle.html
Add a new target, ARMV9SME, for Arm®v9-A architecture systems that
support the Scalable Matrix Extension (SME) [1].
Initially inherits ARMV8SVE settings with updated compiler flags. This
target can only be built with an SME-capable toolchain such as GCC 14 or
LLVM 19.
Includes some initial FEAT_SME2 feature detection on Linux targets via
hwcaps. Target is disabled in DYNAMIC_ARCH builds by default.
This is intended as a base target for SME2 kernels.
[1] https://developer.arm.com/documentation/109246/0100/SME-Overview/SME-and-SME2
Use microarchitecture name instead of meaningless strings to name the core,
the legacy core is still retained.
1. Rename LOONGSONGENERIC to LA64_GENERIC
2. Rename LOONGSON3R5 to LA464
3. Rename LOONGSON2K1000 to LA264
Implement DYNAMIC_ARCH support for riscv64. Three cpu types are
supported, riscv64_generic, riscv64_zvl256b, riscv64_zvl128b.
The two non-generic kernels require CPU support for RVV 1.0 to
function correctly. Detecting that a riscv64 device supports
RVV 1.0 is a little complicated as there are some boards on the
market that advertise support for V via hwcap but only support
RVV 0.7.1, which is not binary compatible with RVV 1.0. The
approach taken is to first try hwprobe. If hwprobe is not
available, we fall back to hwcap + an additional check to distinguish
between RVV 1.0 and RVV 0.7.1.
Tested on a VM with VLEN=256, a CanMV K230 with VLEN=128 (with only
the big core enabled), a Lichee Pi with RVV 0.7.1 and a VF2 with no
vector.
A compiler with RVV 1.0 support must be used to build OpenBLAS for
riscv64 when DYNAMIC_ARCH=1.
Signed-off-by: Mark Ryan <markdryan@rivosinc.com>