I partially reverted the changes in #2361 and I received the following speed up on: ./xsl3blastst -R gemm -N 2048 2048 1 -a 5 1 1 1 1 1 AMD Ryzen 7 2700X (Zen+): 61400 to 63300 MFlops AMD EPYC 7742 (Zen v2): 91400 to 94500 MFlops These numbers are single-threaded performance.pull/2430/head