1. Modify the algorithm to resolve multithreading failures 2. No memory allocation in sbgemm kernel 3. Optimize when alpha == 1.0f
implement ncopy tcopy kernel_8x4