!5230 modify readme.md for deepfm

Merge pull request !5230 from yangyongjie/master
5 years ago · 8d3ce090e4
--- a/model_zoo/official/recommend/deepfm/README.md
+++ b/model_zoo/official/recommend/deepfm/README.md
@@ -1,147 +1,287 @@
 # DeepFM Description
 # Contents

 This is an example of training DeepFM with Criteo dataset in MindSpore.
 - [DeepFM Description](#deepfm-description)
 - [Model Architecture](#model-architecture)
 - [Dataset](#dataset)
 - [Environment Requirements](#environment-requirements)
 - [Quick Start](#quick-start)    
 - [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Training Process](#training-process)
        - [Training](#training)
        - [Distributed Training](#distributed-training)  
    - [Evaluation Process](#evaluation-process)
        - [Evaluation](#evaluation)
 - [Model Description](#model-description)
    - [Performance](#performance)  
        - [Evaluation Performance](#evaluation-performance)
        - [Inference Performance](#evaluation-performance)
 - [Description of Random Situation](#description-of-random-situation)
 - [ModelZoo Homepage](#modelzoo-homepage)

 [Paper](https://arxiv.org/pdf/1703.04247.pdf) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He

 # [DeepFM Description](#contents)

 # Model architecture
 Learning sophisticated feature interactions behind user behaviors is critical in maximizing CTR for recommender systems. Despite great progress, existing methods seem to have a strong bias towards low- or high-order interactions, or require expertise feature engineering. In this paper, we show that it is possible to derive an end-to-end learning model that emphasizes both low- and high-order feature interactions. The proposed model, DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture. 

 The overall network architecture of DeepFM is show below:
 [Paper](https://arxiv.org/abs/1703.04247):  Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction

 [Link](https://arxiv.org/pdf/1703.04247.pdf)
 # [Model Architecture](#contents)

 DeepFM consists of two components. The FM component is a factorization machine, which is proposed in to learn feature interactions for recommendation. The deep component is a feed-forward neural network, which is used to learn high-order feature interactions.
 The FM and deep component share the same input raw feature vector, which enables DeepFM to learn low- and high-order feature interactions simultaneously from the input raw features.

 # Requirements
 - Install [MindSpore](https://www.mindspore.cn/install/en).
 - Download the criteo dataset for pre-training. Extract and clean text in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). Convert the dataset to TFRecord format and move the files to a specified path.
 # [Dataset](#contents)

 - [1] A dataset used in  Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, Xiuqiang He. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction[J]. 2017.
  

 # [Environment Requirements](#contents)

 - Hardware（Ascend/GPU）
  - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. 
 - Framework
  - [MindSpore](https://www.mindspore.cn/install/en)
 - For more information, please check the resources below：
  - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html) 
  - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)

 # Script description

 ## Script and sample code

 ```shell
 ├── deepfm       
  ├── README.md                      
  ├── scripts 
  │   ├──run_distribute_train.sh                
  │   ├──run_distribute_train_gpu.sh
  │   ├──run_standalone_train.sh                    
  │   ├──run_eval.sh                   
  ├── src
  │   ├──__init__.py                     
  │   ├──config.py                     
  │   ├──dataset.py
  │   ├──callback.py                                    
  │   ├──deepfm.py
  ├── train.py
  ├── eval.py
 ```

 ## Training process

 ### Usage

 - sh run_distribute_train.sh [DEVICE_NUM] [DATASET_PATH] [RANK_TABLE_FILE]
 - sh run_distribute_train_gpu.sh [DEVICE_NUM] [DATASET_PATH]
 - sh run_standalone_train.sh [DEVICE_ID] [DEVICE_TARGET] [DATASET_PATH]
 - python train.py --dataset_path [DATASET_PATH] --device_target [DEVICE_TARGET]
 # [Quick Start](#contents)

 After installing MindSpore via the official website, you can start training and evaluation as follows: 

 - runing on Ascend

  ```
  # run training example
  python train.py \
    --dataset_path='dataset/train' \
    --ckpt_path='./checkpoint' \
    --eval_file_name='auc.log' \
    --loss_file_name='loss.log' \
    --device_target='Ascend' \
    --do_eval=True > ms_log/output.log 2>&1 &
  
  # run distributed training example
  sh scripts/run_distribute_train.sh 8 /dataset_path /rank_table_8p.json
  
  # run evaluation example
  python eval.py \
    --dataset_path='dataset/test' \
    --checkpoint_path='./checkpoint/deepfm.ckpt' \
    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
  OR
  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/deepfm.ckpt
  ```

  For distributed training, a hccl configuration file with JSON format needs to be created in advance.

  Please follow the instructions in the link below:

  https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.

 - running on GPU

  For running on GPU, please change `device_target` from `Ascend` to `GPU` in configuration file src/config.py

  ```
  # run training example
  python train.py \
    --dataset_path='dataset/train' \
    --ckpt_path='./checkpoint' \
    --eval_file_name='auc.log' \
    --loss_file_name='loss.log' \
    --device_target='GPU' \
    --do_eval=True > ms_log/output.log 2>&1 &
  
  # run distributed training example
  sh scripts/run_distribute_train.sh 8 /dataset_path
  
  # run evaluation example
  python eval.py \
    --dataset_path='dataset/test' \
    --checkpoint_path='./checkpoint/deepfm.ckpt' \
    --device_target='GPU' > ms_log/eval_output.log 2>&1 &
  OR
  sh scripts/run_eval.sh 0 GPU /dataset_path /checkpoint_path/deepfm.ckpt
  ```

 # [Script Description](#contents)

 ## [Script and Sample Code](#contents)

 ### Launch

 ``` 
 # distribute training example
  sh scripts/run_distribute_train.sh 8 /opt/dataset/criteo /opt/mindspore_hccl_file.json
  sh scripts/run_distribute_train_gpu.sh 8 /opt/dataset/criteo
 # standalone training example
  sh scripts/run_standalone_train.sh 0 Ascend /opt/dataset/criteo
  or
  python train.py --dataset_path /opt/dataset/criteo --device_target Ascend > output.log 2>&1 &
 ```

 ### Result

 Training result will be stored in the example path. 
 Checkpoints will be stored at `./checkpoint` by default, 
 and training log  will be redirected to `./output.log` by default,
 and loss log will be redirected to `./loss.log` by default,
 and eval log will be redirected to `./auc.log` by default. 


 ## Eval process

 ### Usage

 - sh run_eval.sh [DEVICE_ID] [DEVICE_TARGET] [DATASET_PATH] [CHECKPOINT_PATH]

 ### Launch

 ``` 
 # infer example
    sh scripts/run_eval.sh 0 Ascend ~/criteo/eval/ ~/train/deepfm-15_41257.ckpt
 .
 └─deepfm      
  ├─README.md
  ├─scripts      
    ├─run_standalone_train.sh         # launch standalone training(1p) in Ascend or GPU
    ├─run_distribute_train.sh         # launch distributed training(8p) in Ascend
    ├─run_distribute_train_gpu.sh     # launch distributed training(8p) in GPU
    └─run_eval.sh                     # launch evaluating in Ascend or GPU
  ├─src
    ├─__init__.py                     # python init file
    ├─config.py                       # parameter configuration
    ├─callback.py                     # define callback function
    ├─deepfm.py                       # deepfm network
    ├─dataset.py                      # create dataset for deepfm
  ├─eval.py                           # eval net
  └─train.py                          # train net
 ```

 > checkpoint can be produced in training process. 

 ### Result

 Inference result will be stored in the example path, you can find result like the followings in `auc.log`. 

 ``` 
 2020-05-27 20:51:35 AUC: 0.80577889065281, eval time: 35.55999s.
 ```
 ## [Script Parameters](#contents)

 Parameters for both training and evaluation can be set in config.py

 - train parameters
  ```
  optional arguments:
  -h, --help            show this help message and exit
  --dataset_path DATASET_PATH
                        Dataset path
  --ckpt_path CKPT_PATH
                        Checkpoint path
  --eval_file_name EVAL_FILE_NAME
                        Auc log file path. Default: "./auc.log"
  --loss_file_name LOSS_FILE_NAME
                        Loss log file path. Default: "./loss.log"
  --do_eval DO_EVAL     Do evaluation or not. Default: True
  --device_target DEVICE_TARGET
                        Ascend or GPU. Default: Ascend
  ```
 - eval parameters
  ```
  optional arguments:
  -h, --help            show this help message and exit
  --checkpoint_path CHECKPOINT_PATH
                        Checkpoint file path
  --dataset_path DATASET_PATH
                        Dataset path
  --device_target DEVICE_TARGET
                        Ascend or GPU. Default: Ascend
  ```


 ## [Training Process](#contents)

 ### Training 

 - running on Ascend

  ```
  python train.py \
    --dataset_path='dataset/train' \
    --ckpt_path='./checkpoint' \
    --eval_file_name='auc.log' \
    --loss_file_name='loss.log' \
    --device_target='Ascend' \
    --do_eval=True > ms_log/output.log 2>&1 &
  ```
  
  The python command above will run in the background, you can view the results through the file `ms_log/output.log`.
  
  After training, you'll get some checkpoint files under `./checkpoint` folder by default. The loss value are saved in loss.log file.
  
  ```
  2020-05-27 15:26:29 epoch: 1 step: 41257, loss is 0.498953253030777
  2020-05-27 15:32:32 epoch: 2 step: 41257, loss is 0.45545706152915955
  ...
  ```
  
  The model checkpoint will be saved in the current directory. 

 - running on GPU
  To do.

 ### Distributed Training

 - running on Ascend

  ```
  sh scripts/run_distribute_train.sh 8 /dataset_path /rank_table_8p.json
  ```
  
  The above shell script will run distribute training in the background. You can view the results through the file `log[X]/output.log`. The loss value are saved in loss.log file.
  

 - running on GPU
  To do.


 ## [Evaluation Process](#contents)

 ### Evaluation

 - evaluation on dataset when running on Ascend

  Before running the command below, please check the checkpoint path used for evaluation.
  
  ```
  python eval.py \
    --dataset_path='dataset/test' \
    --checkpoint_path='./checkpoint/deepfm.ckpt' \
    --device_target='Ascend' > ms_log/eval_output.log 2>&1 &
  OR
  sh scripts/run_eval.sh 0 Ascend /dataset_path /checkpoint_path/deepfm.ckpt
  ```
  
  The above python command will run in the background. You can view the results through the file "eval_output.log". The accuracy is saved in auc.log file.
  
  ```
  {'result': {'AUC': 0.8057789065281104, 'eval_time': 35.64779996871948}}
  ```


 - evaluation on dataset when running on GPU
  To do.


 # [Model Description](#contents)
 ## [Performance](#contents)

 ### Evaluation Performance 

 | Parameters                 | Ascend                                                      | GPU                    |
 | -------------------------- | ----------------------------------------------------------- | ---------------------- |
 | Model Version              | DeepFM                                                      | To do                  |
 | Resource                   | Ascend 910; CPU 2.60GHz, 192cores; Memory 314G              | To do                  |
 | uploaded Date              | 05/17/2020 (month/day/year)                                 | To do                  |
 | MindSpore Version          | 0.3.0-alpha                                                 | To do                  |
 | Dataset                    | [1]                                                         | To do                  |
 | Training Parameters        | epoch=15, batch_size=1000, lr=1e-5                          | To do                  |
 | Optimizer                  | Adam                                                        | To do                  |
 | Loss Function              | Sigmoid Cross Entropy With Logits                           | To do                  |
 | outputs                    | Accuracy                                                    | To do                  |
 | Loss                       | 0.45                                                        | To do                  |
 | Speed                      | 1pc: 8.16 ms/step;                                          | To do                  |
 | Total time                 | 1pc: 90 mins;                                               | To do                  |
 | Parameters (M)             | 16.5                                                        | To do                  |
 | Checkpoint for Fine tuning | 190M (.ckpt file)                                           | To do                  |
 | Scripts                    | [deepfm script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/recommend/deepfm) | To do                  |


 ### Inference Performance

 | Parameters          | Ascend                      | GPU                         |
 | ------------------- | --------------------------- | --------------------------- |
 | Model Version       | DeepFM                      | To do                       |
 | Resource            | Ascend 910                  | To do                       |
 | Uploaded Date       | 05/27/2020 (month/day/year) | To do                       |
 | MindSpore Version   | 0.3.0-alpha                 | To do                       |
 | Dataset             | [1]                         | To do                       |
 | batch_size          | 1000                        | To do                       |
 | outputs             | accuracy                    | To do                       |
 | Accuracy            | 1pc: 80.55%;                | To do                       |
 | Model for inference | 190M (.ckpt file)           | To do                       |


 # [Description of Random Situation](#contents)

 We set the random seed before training in train.py.

 # [ModelZoo Homepage](#contents)  
 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).  

 # Model description

 ## Learning Rate

 | Number of Devices      | Learning Rate      |
 | ---------------------- | ------------------ |
 | 1                      | 1e-5               |
 | 8                      | 1e-4               |

 > Change the learning rate at src/config.py accordingly.

 ## Performance

 ### Training Performance

 | Parameters                 | DeepFM                                                |
 | -------------------------- | ------------------------------------------------------|
 | Model Version              |                                                       |
 | Resource                   | Ascend 910, cpu:2.60GHz 96cores, memory:1.5T          |
 | uploaded Date              | 05/27/2020                                            |
 | MindSpore Version          | 0.2.0                                                 |
 | Dataset                    | Criteo                                                |
 | Training Parameters        | src/config.py                                         |
 | Optimizer                  | Adam                                                  |
 | Loss Function              | SoftmaxCrossEntropyWithLogits                         |
 | outputs                    |                                                       |
 | Loss                       | 0.4234                                                |
 | Accuracy                   | AUC[0.8055]                                           |
 | Total time                 | 91 min                                                |
 | Params (M)                 |                                                       |
 | Checkpoint for Fine tuning |                                                       |
 | Model for inference        |                                                       |

 #### Inference Performance

 | Parameters                 |                               |                           |
 | -------------------------- | ----------------------------- | ------------------------- |
 | Model Version              |                               |                           |   
 | Resource                   | Ascend 910                    | Ascend 310                | 
 | uploaded Date              | 05/27/2020                    | 05/27/2020                | 
 | MindSpore Version          | 0.2.0                         | 0.2.0                     |  
 | Dataset                    | Criteo                        |                           |
 | batch_size                 | 1000                          |                           |
 | outputs                    |                               |                           |
 | Accuracy                   | AUC[0.8055]                   |                           |                      
 | Speed                      |                               |                           |                     
 | Total time                 | 35.559s                       |                           |                      
 | Model for inference        |                               |                           |                 

 # ModelZoo Homepage  
 [Link](https://gitee.com/mindspore/mindspore/tree/master/mindspore/model_zoo)  
--- a/model_zoo/official/recommend/deepfm/eval.py
+++ b/model_zoo/official/recommend/deepfm/eval.py
@@ -30,7 +30,7 @@ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 parser = argparse.ArgumentParser(description='CTR Prediction')
 parser.add_argument('--checkpoint_path', type=str, default=None, help='Checkpoint file path')
 parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
 parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend, GPU, or CPU')
 parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend or GPU. Default: Ascend')
 args_opt, _ = parser.parse_known_args()
 device_id = int(os.getenv('DEVICE_ID'))
 context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=device_id)
--- a/model_zoo/official/recommend/deepfm/train.py
+++ b/model_zoo/official/recommend/deepfm/train.py
@@ -34,11 +34,15 @@ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
 parser = argparse.ArgumentParser(description='CTR Prediction')
 parser.add_argument('--dataset_path', type=str, default=None, help='Dataset path')
 parser.add_argument('--ckpt_path', type=str, default=None, help='Checkpoint path')
 parser.add_argument('--eval_file_name', type=str, default="./auc.log", help='eval file path')
 parser.add_argument('--loss_file_name', type=str, default="./loss.log", help='loss file path')
 parser.add_argument('--do_eval', type=bool, default=True, help='Do evaluation or not.')
 parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend, GPU, or CPU')
 parser.add_argument('--eval_file_name', type=str, default="./auc.log",
                    help='Auc log file path. Default: "./auc.log"')
 parser.add_argument('--loss_file_name', type=str, default="./loss.log",
                    help='Loss log file path. Default: "./loss.log"')
 parser.add_argument('--do_eval', type=str, default='True',
                    help='Do evaluation or not, only support "True" or "False". Default: "True"')
 parser.add_argument('--device_target', type=str, default="Ascend", help='Ascend or GPU. Default: Ascend')
 args_opt, _ = parser.parse_known_args()
 args_opt.do_eval = args_opt.do_eval == 'True'
 rank_size = int(os.environ.get("RANK_SIZE", 1))

 random.seed(1)