This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.
This improves performance for sbgemv_t by up to 100x on NEOVERSEV1. The geometric mean speedup is ~61x for M=N=[2,512].