* print benchmark information for every layer, especially for CONVOLUTION * print benchmark information for every layer, especially for CONVOLUTION, for cross-platform. * move the function implementation to cpp file to avoid multiple definitions