update gpu train scripts

3 years ago · 46cc2d1745
--- a/gpu_new/README.md
+++ b/gpu_new/README.md
@@ -0,0 +1,112 @@
 # 如何在启智平台上进行模型训练 - GPU版本

 - 启智集群单数据集的训练，启智集群多数据集的训练，智算集群的单数据集训练，这3个的训练使用方式不同，请注意区分：

  - 启智集群单数据集的训练示例请参考示例中[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)的代码注释
  - 启智集群单数据集**加载模型**的训练示例请参考示例中[pretrain.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/pretrain.py)的代码注释
  - 启智集群多数据集的训练示例请参考示例中[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)的代码注释
  - 智算集群单数据集的训练示例请参考示例中[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)的代码注释
  - 智算集群单数据集**加载模型**的训练示例请参考示例中[pretrain_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/pretrain_for_c2net.py)的代码注释
 - 启智集群中单数据集和多数据集的区别在于使用方式不同：
  如本示例中单数据集MNISTDataset_torch.zip的使用方式是:数据集位于/dataset/下
  多数据集时MNISTDataset_torch.zip的使用方式是：数据集位于/dataset/MNISTDataset_torch/下
 - 智算网络中，若需要在每个epoch后都返回训练结果，可以使用回传工具将/tmp/output文件夹的内容及时传到启智以供下载，具体写法为：

  ```
  os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
  ```

 ## 1 概述

 - 本项目以#LeNet5-MNIST-PyTorch为例，简要介绍如何在启智AI协同平台上使用Pytorch完成训练任务，包括单数据集的训练，多数据集的训练，智算网络的训练，旨在为AI开发者提供启智训练示例。
 - 用户可以直接使用本项目提供的数据集和代码文件创建自己的训练任务。

 ## 2 准备工作

 - 启智平台使用准备，本项目需要用户创建启智平台账户，克隆代码到自己的账户，上传数据集，具体操作方法可以通过访问[OpenI_Learning](https://git.openi.org.cn/zeizei/OpenI_Learning)项目学习小白训练营系列课程进行学习。

 ### 2.1 数据准备

 #### 数据集获取

 - 如果你需要试运行本示例，则无需再次上传数据集，因为本示例中的数据集MnistDataset_torch.zip已经设置为公开数据集，可以直接引用,数据集也可从本项目的数据集目录中下载并查看数据结构，[MNISTDataset_torch.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0),[mnist_epoch1_0.73.pkl.zip数据集下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/datasets?type=0)。
 - 数据文件说明
  - MNISTData数据集是由10类28∗28的灰度图片组成，训练数据集包含60000张图片，测试数据集包含10000张图片。
  - 数据集压缩包的目录结构如下：

    > MNISTDataset_torch.zip
    > ├── test
    > │     └── MNIST
    > │           │── raw
    > │           │    ├── t10k-images-idx3-ubyte
    > │           │    └── t10k-labels-idx1-ubyte
    > │           │    ├── train-images-idx3-ubyte
    > │	        │    └── train-labels-idx1-ubyte
    > │           └── processed
    > │                ├── test.pt
    > │	             └── training.pt
    > └── train
    > └── MNIST
    > │── raw
    > │    ├── t10k-images-idx3-ubyte
    > │    └── t10k-labels-idx1-ubyte
    > │    ├── train-images-idx3-ubyte
    > │    └── train-labels-idx1-ubyte
    > └── processed
    > ├── test.pt
    > └── training.pt
    >

    > mnist_epoch1_0.73.pkl.zip
    > ├── mnist_epoch1_0.73.pkl
    >

 #### 数据集上传

 使用GPU进行训练，需要在GPU芯片上运行，所以上传的数据集需要传到GPU界面。(此步骤在本示例中不需要，可直接选择公开数据集MNISTDataset_torch.zip)

 ### 2.2 执行脚本准备

 #### 示例代码

 - 示例代码可从本仓库中下载，[代码下载](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU)
 - 代码文件说明
  - [train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)，用于单数据集训练的脚本文件。具体说明请参考[train.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train.py)
  - [train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)，用于多数据集训练的脚本文件。具体说明请参考[train_for_multidataset.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_multidataset.py)
  - [train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)，用于智算网络训练的脚本文件。具体说明请参考[train_for_c2net.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/train_for_c2net.py)
  - [model.py](https://git.openi.org.cn/OpenIOSSG/MNIST_PytorchExample_GPU/src/branch/master/model.py)，使用的训练网络，在单数据集训练，多数据集训练，智算网络训练中使用到。

 ## 3 创建训练任务

 准备好数据和执行脚本以后，需要创建训练任务将Pytorch脚本运行。首次使用的用户可参考本示例代码。

 ### 训练界面示例

 由于A100的适配性问题，A100需要使用cuda11以上的cuda版本，目前平台已提供基于A100的cuda基础镜像，只需要选择对应的公共镜像：
 ![avatar](Example_picture/适用A100的基础镜像.png)
 训练界面参数参考如下：
 ![avatar](Example_picture/基础镜像.png)

 表1 创建训练作业界面参数说明

 | 参数名称 | 说明                                                                                                                                                                 |
 | -------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 | 计算资源 | 选择CPU/GPU                                                                                                                                                          |
 | 代码分支 | 选择仓库代码中要使用的代码分支，默认可选择master分支                                                                                                                 |
 | 镜像     | 镜像选择已在调试环境中调试好的镜像，目前版本请选择基础镜像：平台提供基于A100的cuda基础镜像，如dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191 |
 | 启动文件 | 启动文件选择代码目录下的启动脚本train.py                                                                                                                             |
 | 数据集   | 数据集选择已上传到启智平台的公共数据集MnistDataset_torch.zip                                                                                                         |
 | 运行参数 | 增加运行参数可以向脚本中其他参数传值，如epoch_size                                                                                                                   |
 | 资源规格 | 规格选择含有GPU个数的规格                                                                                                                                            |

 ## 4 查看运行结果

 ### 4.1 在训练作业界面可以查看运行日志

 目前训练任务的日志只能在代码中print输出，参考示例train.py代码相关print

 ### 4.2 训练结束后可以下载模型文件

 ![avatar](Example_picture/结果下载.png)

 ## 对于示例代码有任何问题，欢迎在本项目中提issue。
--- a/gpu_new/inference.py
+++ b/gpu_new/inference.py
@@ -0,0 +1,76 @@
 #!/usr/bin/python
 #coding=utf-8
 '''
 GPU INFERENCE  INSTANCE

 If there are Chinese comments in the code，please add at the beginning：
 #!/usr/bin/python
 #coding=utf-8
 Due to the adaptability of a100, please use the recommended image of the
 platform with cuda 11.Then adjust the code and submit the image.
 The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
 In the environment, the uploaded dataset will be automatically placed in the /dataset directory.
 if MnistDataset_torch.zip is selected,Then the dataset directory is /dataset/test;

 The model file selected  is in /model  directory.
 The result download path is under /result . and the Qizhi platform will provide file downloads under the /result directory.

 本例中的镜像是dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
 选择的数据集被放置在/dataset目录
 选择的模型文件放置在/model目录
 输出结果路径是/result目录

 ！！！注意：目前推理的资源环境不支持联网，所以镜像无法使用公网镜像，镜像必须先提交到启智平台;推理的数据集也需要先上传到启智平台

 '''


 import numpy as np
 import torch
 from torchvision.datasets import mnist
 from torch.utils.data import DataLoader
 from torchvision.transforms import ToTensor
 import os
 import argparse



 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
 #获取模型文件名称
 parser.add_argument('--modelname',  help='model name')



 if __name__ == '__main__':
    args, unknown = parser.parse_known_args()
    print('cuda is available:{}'.format(torch.cuda.is_available()))
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    test_dataset = mnist.MNIST(root='/dataset/test', train=False, transform=ToTensor(),
                               download=False)
    test_loader = DataLoader(test_dataset, batch_size=256)
    #如果文件名确定，model_path可以直接写死
    model_path = '/model/'+args.modelname

    model = torch.load(model_path).to(device)
    model.eval()

    correct = 0
    _sum = 0

    for idx, (test_x, test_label) in enumerate(test_loader):
        test_x = test_x
        test_label = test_label
        predict_y = model(test_x.to(device).float()).detach()
        predict_ys = np.argmax(predict_y.cpu(), axis=-1)
        label_np = test_label.numpy()
        _ = predict_ys == test_label
        correct += np.sum(_.numpy(), axis=-1)
        _sum += _.shape[0]
    print('accuracy: {:.2f}'.format(correct / _sum))
    #结果写入/result
    filename = 'result.txt'
    file_path = os.path.join('/result', filename)
    with open(file_path, 'w') as file:
        file.write('accuracy: {:.2f}'.format(correct / _sum))
--- a/gpu_new/model.py
+++ b/gpu_new/model.py
@@ -0,0 +1,35 @@
 from torch.nn import Module
 from torch import nn


 class Model(Module):
    def __init__(self):
        super(Model, self).__init__()
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.relu1 = nn.ReLU()
        self.pool1 = nn.MaxPool2d(2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.relu2 = nn.ReLU()
        self.pool2 = nn.MaxPool2d(2)
        self.fc1 = nn.Linear(256, 120)
        self.relu3 = nn.ReLU()
        self.fc2 = nn.Linear(120, 84)
        self.relu4 = nn.ReLU()
        self.fc3 = nn.Linear(84, 10)
        self.relu5 = nn.ReLU()

    def forward(self, x):
        y = self.conv1(x)
        y = self.relu1(y)
        y = self.pool1(y)
        y = self.conv2(y)
        y = self.relu2(y)
        y = self.pool2(y)
        y = y.view(y.shape[0], -1)
        y = self.fc1(y)
        y = self.relu3(y)
        y = self.fc2(y)
        y = self.relu4(y)
        y = self.fc3(y)
        y = self.relu5(y)
        return y
--- a/gpu_new/pretrain.py
+++ b/gpu_new/pretrain.py
@@ -0,0 +1,125 @@
 #!/usr/bin/python
 #coding=utf-8
 '''
 If there are Chinese comments in the code，please add at the beginning：
 #!/usr/bin/python
 #coding=utf-8   

 1，The dataset structure of the single-dataset in this example
 MnistDataset_torch.zip
  ├── test
  └── train  

 2，Due to the adaptability of a100, before using the training environment, please use the recommended image of the 
 platform with cuda 11.Then adjust the code and submit the image.
 The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
 In the training environment, the uploaded dataset will be automatically placed in the /dataset directory. 
 Note: the paths are different when selecting a single dataset and multiple datasets.
 (1)If it is a single dataset: if MnistDataset_torch.zip is selected, 
   the dataset directory is /dataset/train, /dataset/test;
   If it is a multiple dataset: if MnistDataset_torch.zip is selected, 
   the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test;

 (2)If the pre-training model file is selected, the selected pre-training model path save as parameter ckpt_url;

 The model download path is under /model by default. Please specify the model output location to /model, 
 and the Qizhi platform will provide file downloads under the /model directory.
 '''


 from model import Model
 import numpy as np
 import torch
 from torchvision.datasets import mnist
 from torch.nn import CrossEntropyLoss
 from torch.optim import SGD
 from torch.utils.data import DataLoader
 from torchvision.transforms import ToTensor
 import argparse
 import os

 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
 #The dataset location is placed under /dataset
 parser.add_argument('--traindata', default="/dataset/train" ,help='path to train dataset')
 parser.add_argument('--testdata', default="/dataset/test" ,help='path to test dataset')
 parser.add_argument('--epoch_size', type=int, default=10, help='how much epoch to train')
 parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')
 #获取模型文件名称
 parser.add_argument('--ckpt_url', default="", help='pretrain model path')

 # 参数声明
 WORKERS = 0   # dataloder线程数
 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 model = Model().to(device)
 optimizer = SGD(model.parameters(), lr=1e-1)
 cost = CrossEntropyLoss()

 # 模型训练
 def train(model, train_loader, epoch):
    model.train()
    train_loss = 0
    for i, data in enumerate(train_loader, 0):
        x, y = data
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        y_hat = model(x)
        loss = cost(y_hat, y)
        loss.backward()
        optimizer.step()
        train_loss += loss
    loss_mean = train_loss / (i+1)
    print('Train Epoch: {}\t Loss: {:.6f}'.format(epoch, loss_mean.item()))
 
 # 模型测试
 def test(model, test_loader, test_data):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for i, data in enumerate(test_loader, 0):
            x, y = data
            x = x.to(device)
            y = y.to(device)
            optimizer.zero_grad()
            y_hat = model(x)
            test_loss += cost(y_hat, y).item()
            pred = y_hat.max(1, keepdim=True)[1]
            correct += pred.eq(y.view_as(pred)).sum().item()
        test_loss /= (i+1)
        print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            test_loss, correct, len(test_data), 100. * correct / len(test_data)))
 def main():
    # 如果有保存的模型，则加载模型，并在其基础上继续训练
    if os.path.exists(args.ckpt_url):
        checkpoint = torch.load(args.ckpt_url)
        model.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        start_epoch = checkpoint['epoch']
        print('加载 epoch {} 权重成功！'.format(start_epoch))
    else:
        start_epoch = 0
        print('无保存模型，将从头开始训练！')
 
    for epoch in range(start_epoch+1, epochs):
        train(model, train_loader, epoch)
        test(model, test_loader, test_dataset)
        # 保存模型
        state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
        torch.save(state, '/model/mnist_epoch{}.pkl'.format(epoch))

 if __name__ == '__main__':
    args, unknown = parser.parse_known_args()
    #log output
    print('cuda is available:{}'.format(torch.cuda.is_available()))  
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    batch_size = args.batch_size
    epochs = args.epoch_size
    train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
    test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
    train_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    main()


--- a/gpu_new/pretrain_for_c2net.py
+++ b/gpu_new/pretrain_for_c2net.py
@@ -0,0 +1,141 @@
 #!/usr/bin/python
 #coding=utf-8
 '''
 If there are Chinese comments in the code，please add at the beginning：
 #!/usr/bin/python
 #coding=utf-8   

 In the training environment, 
 (1)the code will be automatically placed in the /tmp/code directory, 
 (2)the uploaded dataset will be automatically placed in the /tmp/dataset directory
 Note: the paths are different when selecting a single dataset and multiple datasets.
 (1)If it is a single dataset: if MnistDataset_torch.zip is selected, 
   the dataset directory is /tmp/dataset/train, /dataset/test;

 The dataset structure of the single dataset in the training image in this example:
  tmp
    ├──dataset 
         ├── test
         └── train 

 If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip, 
 the dataset directory is /tmp/dataset/MnistDataset_torch/train, /tmp/dataset/MnistDataset_torch/test
 and /tmp/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl
 The dataset structure in the training image for multiple datasets in this example:
 tmp
  ├──dataset
     ├── MnistDataset_torch
     |     ├── test
     |     └── train 
     └── checkpoint_epoch1_0.73 
           ├── mnist_epoch1_0.73.pkl
 (3)the model download path is under /tmp/output by default, please specify the model output location to /tmp/output, 
 qizhi platform will provide file downloads under the /tmp/output directory.
 (4)If the pre-training model file is selected, the selected pre-training model path save as parameter ckpt_url;

 In addition, if you want to get the model file after each training, you can call the uploader_for_gpu tool, 
 which is written as: 
 import os
 os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
 '''


 from model import Model
 import numpy as np
 import torch
 from torchvision.datasets import mnist
 from torch.nn import CrossEntropyLoss
 from torch.optim import SGD
 from torch.utils.data import DataLoader
 from torchvision.transforms import ToTensor
 import argparse
 import os

 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
 #The dataset location is placed under /dataset
 parser.add_argument('--traindata', default="/tmp/dataset/train" ,help='path to train dataset')
 parser.add_argument('--testdata', default="/tmp/dataset/test" ,help='path to test dataset')
 parser.add_argument('--epoch_size', type=int, default=10, help='how much epoch to train')
 parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')
 #获取模型文件名称
 parser.add_argument('--ckpt_url', default="", help='pretrain model path')

 # 参数声明
 WORKERS = 0   # dataloder线程数
 device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
 model = Model().to(device)
 optimizer = SGD(model.parameters(), lr=1e-1)
 cost = CrossEntropyLoss()

 # 模型训练
 def train(model, train_loader, epoch):
    model.train()
    train_loss = 0
    for i, data in enumerate(train_loader, 0):
        x, y = data
        x = x.to(device)
        y = y.to(device)
        optimizer.zero_grad()
        y_hat = model(x)
        loss = cost(y_hat, y)
        loss.backward()
        optimizer.step()
        train_loss += loss
    loss_mean = train_loss / (i+1)
    print('Train Epoch: {}\t Loss: {:.6f}'.format(epoch, loss_mean.item()))
 
 # 模型测试
 def test(model, test_loader, test_data):
    model.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for i, data in enumerate(test_loader, 0):
            x, y = data
            x = x.to(device)
            y = y.to(device)
            optimizer.zero_grad()
            y_hat = model(x)
            test_loss += cost(y_hat, y).item()
            pred = y_hat.max(1, keepdim=True)[1]
            correct += pred.eq(y.view_as(pred)).sum().item()
        test_loss /= (i+1)
        print('Test set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
            test_loss, correct, len(test_data), 100. * correct / len(test_data)))
 def main():
    # 如果有保存的模型，则加载模型，并在其基础上继续训练
    if os.path.exists(args.ckpt_url):
        checkpoint = torch.load(args.ckpt_url)
        model.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        start_epoch = checkpoint['epoch']
        print('加载 epoch {} 权重成功！'.format(start_epoch))
    else:
        start_epoch = 0
        print('无保存模型，将从头开始训练！')
 
    for epoch in range(start_epoch+1, epochs):
        train(model, train_loader, epoch)
        test(model, test_loader, test_dataset)
        # 保存模型
        state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
        torch.save(state, '/tmp/output/mnist_epoch{}.pkl'.format(epoch))
        #After calling uploader_for_gpu, after each epoch training, the result file under /tmp/output will be sent back to Qizhi
        os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")

 if __name__ == '__main__':
    args, unknown = parser.parse_known_args()
    #log output
    print('cuda is available:{}'.format(torch.cuda.is_available()))  
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    batch_size = args.batch_size
    epochs = args.epoch_size
    train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
    test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
    train_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    main()


        
--- a/gpu_new/train.py
+++ b/gpu_new/train.py
@@ -0,0 +1,87 @@
 #!/usr/bin/python
 #coding=utf-8    
 '''
 If there are Chinese comments in the code，please add at the beginning：
 #!/usr/bin/python
 #coding=utf-8    

 Due to the adaptability of a100, before using the training environment, please use the recommended image of the 
 platform with cuda 11.Then adjust the code and submit the image.
 The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
 In the training environment, the uploaded dataset will be automatically placed in the /dataset directory. 
 If it is a single dataset: 
 if MnistDataset_torch.zip is selected,Then the dataset directory is /dataset/train, /dataset/test;
 If it is a multiple dataset: 
 If MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip are selected, 
 the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test
 and /dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl

 The model download path is under /model by default. Please specify the model output location to /model, 
 and the Qizhi platform will provide file downloads under the /model directory.
 '''


 from model import Model
 import numpy as np
 import torch
 from torchvision.datasets import mnist
 from torch.nn import CrossEntropyLoss
 from torch.optim import SGD
 from torch.utils.data import DataLoader
 from torchvision.transforms import ToTensor
 import argparse

 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
 #The dataset location is placed under /dataset
 parser.add_argument('--traindata', default="/dataset/train" ,help='path to train dataset')
 parser.add_argument('--testdata', default="/dataset/test" ,help='path to test dataset')
 parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
 parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

 if __name__ == '__main__':
    args, unknown = parser.parse_known_args()
    #log output
    print('cuda is available:{}'.format(torch.cuda.is_available()))  
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    batch_size = args.batch_size
    train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
    test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
    train_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    model = Model().to(device)
    sgd = SGD(model.parameters(), lr=1e-1)
    cost = CrossEntropyLoss()
    epoch = args.epoch_size
    print('epoch_size is:{}'.format(epoch))
    for _epoch in range(epoch):
        print('the {} epoch_size begin'.format(_epoch + 1))
        model.train()
        for idx, (train_x, train_label) in enumerate(train_loader):
            train_x = train_x.to(device)
            train_label = train_label.to(device)
            label_np = np.zeros((train_label.shape[0], 10))
            sgd.zero_grad()
            predict_y = model(train_x.float())
            loss = cost(predict_y, train_label.long())
            if idx % 10 == 0:
                print('idx: {}, loss: {}'.format(idx, loss.sum().item()))
            loss.backward()
            sgd.step()

        correct = 0
        _sum = 0
        model.eval()
        for idx, (test_x, test_label) in enumerate(test_loader):
            test_x = test_x
            test_label = test_label
            predict_y = model(test_x.to(device).float()).detach()
            predict_ys = np.argmax(predict_y.cpu(), axis=-1)
            label_np = test_label.numpy()
            _ = predict_ys == test_label
            correct += np.sum(_.numpy(), axis=-1)
            _sum += _.shape[0]
        print('accuracy: {:.2f}'.format(correct / _sum))
        #The model output location is placed under /model
        state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
        torch.save(state, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))
--- a/gpu_new/train_for_c2net.py
+++ b/gpu_new/train_for_c2net.py
@@ -0,0 +1,111 @@
 #!/usr/bin/python
 #coding=utf-8
 '''
 If there are Chinese comments in the code，please add at the beginning：
 #!/usr/bin/python
 #coding=utf-8   

 In the training environment, 
 the code will be automatically placed in the /tmp/code directory, 
 the uploaded dataset will be automatically placed in the /tmp/dataset directory

 Note: the paths are different when selecting a single dataset and multiple datasets.
 (1)If it is a single dataset: if MnistDataset_torch.zip is selected, 
   the dataset directory is /tmp/dataset/train, /dataset/test;

 The dataset structure of the single dataset in the training image in this example:
  tmp
    ├──dataset 
         ├── test
         └── train 

 If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip, 
 the dataset directory is /tmp/dataset/MnistDataset_torch/train, /tmp/dataset/MnistDataset_torch/test
 and /tmp/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl
 The dataset structure in the training image for multiple datasets in this example:
 tmp
  ├──dataset
     ├── MnistDataset_torch
     |     ├── test
     |     └── train 
     └── checkpoint_epoch1_0.73 
           ├── mnist_epoch1_0.73.pkl


 the model download path is under /tmp/output by default, please specify the model output location to /tmp/output, 
 qizhi platform will provide file downloads under the /tmp/output directory.

 In addition, if you want to get the model file after each training, you can call the uploader_for_gpu tool, 
 which is written as: 
 import os
 os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
 '''


 from model import Model
 import numpy as np
 import torch
 from torchvision.datasets import mnist
 from torch.nn import CrossEntropyLoss
 from torch.optim import SGD
 from torch.utils.data import DataLoader
 from torchvision.transforms import ToTensor
 import argparse
 import os

 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
 #The dataset location is placed under /dataset
 parser.add_argument('--traindata', default="/tmp/dataset/train" ,help='path to train dataset')
 parser.add_argument('--testdata', default="/tmp/dataset/test" ,help='path to test dataset')
 parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
 parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

 if __name__ == '__main__':
    args, unknown = parser.parse_known_args()
    #log output
    print('cuda is available:{}'.format(torch.cuda.is_available()))  
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    batch_size = args.batch_size
    train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
    test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
    train_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    model = Model().to(device)
    sgd = SGD(model.parameters(), lr=1e-1)
    cost = CrossEntropyLoss()
    epoch = args.epoch_size
    print('epoch_size is:{}'.format(epoch))
    for _epoch in range(epoch):
        print('the {} epoch_size begin'.format(_epoch + 1))
        model.train()
        for idx, (train_x, train_label) in enumerate(train_loader):
            train_x = train_x.to(device)
            train_label = train_label.to(device)
            label_np = np.zeros((train_label.shape[0], 10))
            sgd.zero_grad()
            predict_y = model(train_x.float())
            loss = cost(predict_y, train_label.long())
            if idx % 10 == 0:
                print('idx: {}, loss: {}'.format(idx, loss.sum().item()))
            loss.backward()
            sgd.step()

        correct = 0
        _sum = 0
        model.eval()
        for idx, (test_x, test_label) in enumerate(test_loader):
            test_x = test_x
            test_label = test_label
            predict_y = model(test_x.to(device).float()).detach()
            predict_ys = np.argmax(predict_y.cpu(), axis=-1)
            label_np = test_label.numpy()
            _ = predict_ys == test_label
            correct += np.sum(_.numpy(), axis=-1)
            _sum += _.shape[0]
        print('accuracy: {:.2f}'.format(correct / _sum))
        #The model output location is placed under /model
        state = {'model':model.state_dict(), 'optimizer':optimizer.state_dict(), 'epoch':epoch}
        torch.save(state, '/tmp/output/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))
        #After calling uploader_for_gpu, after each epoch training, the result file under /tmp/output will be sent back to Qizhi
        os.system("cd /tmp/script_for_grampus/ &&./uploader_for_gpu " + "/tmp/output/")
--- a/gpu_new/train_for_multidataset.py
+++ b/gpu_new/train_for_multidataset.py
@@ -0,0 +1,113 @@
 #!/usr/bin/python
 #coding=utf-8
 '''
 If there are Chinese comments in the code，please add at the beginning：
 #!/usr/bin/python
 #coding=utf-8   

 1，The dataset structure of the multi-dataset in this example
 MnistDataset_torch.zip
  ├── test
  └── train  
 
 checkpoint_epoch1_0.73.zip
  ├── mnist_epoch1_0.73.pkl

 2，Due to the adaptability of a100, before using the training environment, please use the recommended image of the 
 platform with cuda 11.Then adjust the code and submit the image.
 The image of this example is: dockerhub.pcl.ac.cn:5000/user-images/openi:cuda111_python37_pytorch191
 In the training environment, the uploaded dataset will be automatically placed in the /dataset directory. 
 Note: the paths are different when selecting a single dataset and multiple datasets.
 (1)If it is a single dataset: if MnistDataset_torch.zip is selected, 
   the dataset directory is /dataset/train, /dataset/test;

 The dataset structure of the single dataset in the training image in this example:
  dataset
   ├── test
   └── train 
 (2)If multiple datasets are selected, such as MnistDataset_torch.zip and checkpoint_epoch1_0.73.zip, 
 the dataset directory is /dataset/MnistDataset_torch/train, /dataset/MnistDataset_torch/test
 and /dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl

 The dataset structure in the training image for multiple datasets in this example:
  dataset
   ├── MnistDataset_torch
   |     ├── test
   |     └── train 
   └── checkpoint_epoch1_0.73 
         ├── mnist_epoch1_0.73.pkl


 The model download path is under /model by default. Please specify the model output location to /model, 
 and the Qizhi platform will provide file downloads under the /model directory.
 '''


 from model import Model
 import numpy as np
 import torch
 from torchvision.datasets import mnist
 from torch.nn import CrossEntropyLoss
 from torch.optim import SGD
 from torch.utils.data import DataLoader
 from torchvision.transforms import ToTensor
 import argparse

 # Training settings
 parser = argparse.ArgumentParser(description='PyTorch MNIST Example')
 #The dataset location is placed under /dataset
 parser.add_argument('--traindata', default="/dataset/MnistDataset_torch/train" ,help='path to train dataset')
 parser.add_argument('--testdata', default="/dataset/MnistDataset_torch/test" ,help='path to test dataset')
 parser.add_argument('--checkpoint', default="/dataset/checkpoint_epoch1_0.73/mnist_epoch1_0.73.pkl" ,help='checkpoint file')
 parser.add_argument('--epoch_size', type=int, default=1, help='how much epoch to train')
 parser.add_argument('--batch_size', type=int, default=256, help='how much batch_size in epoch')

 if __name__ == '__main__':
    args, unknown = parser.parse_known_args()
    #log output
    print('cuda is available:{}'.format(torch.cuda.is_available()))  
    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    batch_size = args.batch_size
    train_dataset = mnist.MNIST(root=args.traindata, train=True, transform=ToTensor(),download=False)
    test_dataset = mnist.MNIST(root=args.testdata, train=False, transform=ToTensor(),download=False)
    train_loader = DataLoader(train_dataset, batch_size=batch_size)
    test_loader = DataLoader(test_dataset, batch_size=batch_size)
    model = Model().to(device)
    sgd = SGD(model.parameters(), lr=1e-1)
    cost = CrossEntropyLoss()
    epoch = args.epoch_size
    print('epoch_size is:{}'.format(epoch))
    # Load the trained model
    # path = args.checkpoint
    # checkpoint = torch.load(path, map_location=device)
    # model.load_state_dict(checkpoint)
    for _epoch in range(epoch):
        print('the {} epoch_size begin'.format(_epoch + 1))
        model.train()
        for idx, (train_x, train_label) in enumerate(train_loader):
            train_x = train_x.to(device)
            train_label = train_label.to(device)
            label_np = np.zeros((train_label.shape[0], 10))
            sgd.zero_grad()
            predict_y = model(train_x.float())
            loss = cost(predict_y, train_label.long())
            if idx % 10 == 0:
                print('idx: {}, loss: {}'.format(idx, loss.sum().item()))
            loss.backward()
            sgd.step()

        correct = 0
        _sum = 0
        model.eval()
        for idx, (test_x, test_label) in enumerate(test_loader):
            test_x = test_x
            test_label = test_label
            predict_y = model(test_x.to(device).float()).detach()
            predict_ys = np.argmax(predict_y.cpu(), axis=-1)
            label_np = test_label.numpy()
            _ = predict_ys == test_label
            correct += np.sum(_.numpy(), axis=-1)
            _sum += _.shape[0]
        print('accuracy: {:.2f}'.format(correct / _sum))
        #The model output location is placed under /model
        torch.save(model, '/model/mnist_epoch{}_{:.2f}.pkl'.format(_epoch+1, correct / _sum))