This re-uses the existing NEOVERSEN2 8x4 `sbgemm` kernel to implement `bgemm`.
1. Modify the algorithm to resolve multithreading failures 2. No memory allocation in sbgemm kernel 3. Optimize when alpha == 1.0f