You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 16 kB

5 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325
  1. # Contents
  2. - [NCF Description](#NCF-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Features](#features)
  6. - [Mixed Precision](#mixed-precision)
  7. - [Environment Requirements](#environment-requirements)
  8. - [Quick Start](#quick-start)
  9. - [Script Description](#script-description)
  10. - [Script and Sample Code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Training Process](#training-process)
  13. - [Training](#training)
  14. - [Distributed Training](#distributed-training)
  15. - [Evaluation Process](#evaluation-process)
  16. - [Evaluation](#evaluation)
  17. - [Model Description](#model-description)
  18. - [Performance](#performance)
  19. - [Evaluation Performance](#evaluation-performance)
  20. - [Inference Performance](#evaluation-performance)
  21. - [How to use](#how-to-use)
  22. - [Inference](#inference)
  23. - [Continue Training on the Pretrained Model](#continue-training-on-the-pretrained-model)
  24. - [Transfer Learning](#transfer-learning)
  25. - [Description of Random Situation](#description-of-random-situation)
  26. - [ModelZoo Homepage](#modelzoo-homepage)
  27. # [NCF Description](#contents)
  28. NCF is a general framework for collaborative filtering of recommendations in which a neural network architecture is used to model user-item interactions. Unlike traditional models, NCF does not resort to Matrix Factorization (MF) with an inner product on latent features of users and items. It replaces the inner product with a multi-layer perceptron that can learn an arbitrary function from data.
  29. [Paper](https://arxiv.org/abs/1708.05031): He X, Liao L, Zhang H, et al. Neural collaborative filtering[C]//Proceedings of the 26th international conference on world wide web. 2017: 173-182.
  30. # [Model Architecture](#contents)
  31. Two instantiations of NCF are Generalized Matrix Factorization (GMF) and Multi-Layer Perceptron (MLP). GMF applies a linear kernel to model the latent feature interactions, and and MLP uses a nonlinear kernel to learn the interaction function from data. NeuMF is a fused model of GMF and MLP to better model the complex user-item interactions, and unifies the strengths of linearity of MF and non-linearity of MLP for modeling the user-item latent structures. NeuMF allows GMF and MLP to learn separate embeddings, and combines the two models by concatenating their last hidden layer. [neumf_model.py](neumf_model.py) defines the architecture details.
  32. # [Dataset](#contents)
  33. The [MovieLens datasets](http://files.grouplens.org/datasets/movielens/) are used for model training and evaluation. Specifically, we use two datasets: **ml-1m** (short for MovieLens 1 million) and **ml-20m** (short for MovieLens 20 million).
  34. ## ml-1m
  35. ml-1m dataset contains 1,000,209 anonymous ratings of approximately 3,706 movies made by 6,040 users who joined MovieLens in 2000. All ratings are contained in the file "ratings.dat" without header row, and are in the following format:
  36. ```cpp
  37. UserID::MovieID::Rating::Timestamp
  38. ```
  39. - UserIDs range between 1 and 6040.
  40. - MovieIDs range between 1 and 3952.
  41. - Ratings are made on a 5-star scale (whole-star ratings only).
  42. ## ml-20m
  43. ml-20m dataset contains 20,000,263 ratings of 26,744 movies by 138493 users. All ratings are contained in the file "ratings.csv". Each line of this file after the header row represents one rating of one movie by one user, and has the following format:
  44. ```text
  45. userId,movieId,rating,timestamp
  46. ```
  47. - The lines within this file are ordered first by userId, then, within user, by movieId.
  48. - Ratings are made on a 5-star scale, with half-star increments (0.5 stars - 5.0 stars).
  49. In both datasets, the timestamp is represented in seconds since midnight Coordinated Universal Time (UTC) of January 1, 1970. Each user has at least 20 ratings.
  50. # [Features](#contents)
  51. ## Mixed Precision
  52. The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
  53. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
  54. # [Environment Requirements](#contents)
  55. - Hardware(Ascend/GPU)
  56. - Prepare hardware environment with Ascend or GPU processor.
  57. - Framework
  58. - [MindSpore](https://www.mindspore.cn/install/en)
  59. - For more information, please check the resources below:
  60. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  61. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  62. # [Quick Start](#contents)
  63. After installing MindSpore via the official website, you can start training and evaluation as follows:
  64. ```python
  65. #run data process
  66. bash scripts/run_download_dataset.sh
  67. # run training example
  68. bash scripts/run_train.sh
  69. # run distributed training example
  70. sh scripts/run_train.sh rank_table.json
  71. # run evaluation example
  72. sh run_eval.sh
  73. ```
  74. If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start training and evaluation as follows:
  75. ```python
  76. # run distributed training on modelarts example
  77. # (1) First, Perform a or b.
  78. # a. Set "enable_modelarts=True" on default_config.yaml file.
  79. # Set other parameters on default_config.yaml file you need.
  80. # b. Add "enable_modelarts=True" on the website UI interface.
  81. # Add other parameters on the website UI interface.
  82. # (2) Set the code directory to "/path/ncf" on the website UI interface.
  83. # (3) Set the startup file to "train.py" on the website UI interface.
  84. # (4) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
  85. # (5) Create your job.
  86. # run evaluation on modelarts example
  87. # (1) Copy or upload your trained model to S3 bucket.
  88. # (2) Perform a or b.
  89. # a. Set "checkpoint_file_path='/cache/checkpoint_path/model.ckpt'" on default_config.yaml file.
  90. # Set "checkpoint_url=/The path of checkpoint in S3/" on default_config.yaml file.
  91. # b. Add "checkpoint_file_path='/cache/checkpoint_path/model.ckpt'" on the website UI interface.
  92. # Add "checkpoint_url=/The path of checkpoint in S3/" on the website UI interface.
  93. # (3) Set the code directory to "/path/ncf" on the website UI interface.
  94. # (4) Set the startup file to "eval.py" on the website UI interface.
  95. # (5) Set the "Dataset path" and "Output file path" and "Job log path" to your path on the website UI interface.
  96. # (6) Create your job.
  97. ```
  98. # [Script Description](#contents)
  99. ## [Script and Sample Code](#contents)
  100. ```text
  101. ├── ModelZoo_NCF_ME
  102. ├── README.md // descriptions about NCF
  103. ├── scripts
  104. │ ├──ascend_distributed_launcher
  105. │ ├──__init__.py // init file
  106. │ ├──get_distribute_pretrain_cmd.py // create distribute shell script
  107. │ ├──run_train.sh // shell script for train
  108. │ ├──run_distribute_train.sh // shell script for distribute train
  109. │ ├──run_eval.sh // shell script for evaluation
  110. │ ├──run_download_dataset.sh // shell script for dataget and process
  111. │ ├──run_transfer_ckpt_to_air.sh // shell script for transfer model style
  112. ├── src
  113. │ ├──dataset.py // creating dataset
  114. │ ├──ncf.py // ncf architecture
  115. │ ├──config.py // parameter analysis
  116. │ ├──device_adapter.py // device adapter
  117. │ ├──local_adapter.py // local adapter
  118. │ ├──moxing_adapter.py // moxing adapter
  119. │ ├──movielens.py // data download file
  120. │ ├──callbacks.py // model loss and eval callback file
  121. │ ├──constants.py // the constants of model
  122. │ ├──export.py // export checkpoint files into geir/onnx
  123. │ ├──metrics.py // the file for auc compute
  124. │ ├──stat_utils.py // the file for data process functions
  125. ├── default_config.yaml // parameter configuration
  126. ├── train.py // training script
  127. ├── eval.py // evaluation script
  128. ```
  129. ## [Script Parameters](#contents)
  130. Parameters for both training and evaluation can be set in config.py.
  131. - config for NCF, ml-1m dataset
  132. ```python
  133. * `--data_path`: This should be set to the same directory given to the data_download data_dir argument.
  134. * `--dataset`: The dataset name to be downloaded and preprocessed. By default, it is ml-1m.
  135. * `--train_epochs`: Total train epochs.
  136. * `--batch_size`: Training batch size.
  137. * `--eval_batch_size`: Eval batch size.
  138. * `--num_neg`: The Number of negative instances to pair with a positive instance.
  139. * `--layers`: The sizes of hidden layers for MLP.
  140. * `--num_factors`:The Embedding size of MF model.
  141. * `--output_path`:The location of the output file.
  142. * `--eval_file_name` : Eval output file.
  143. ```
  144. ## [Training Process](#contents)
  145. ### Training
  146. ```python
  147. bash scripts/run_train.sh
  148. ```
  149. The python command above will run in the background, you can view the results through the file `train.log`. After training, you'll get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  150. ```python
  151. # grep "loss is " train.log
  152. ds_train.size: 95
  153. epoch: 1 step: 95, loss is 0.25074288
  154. epoch: 2 step: 95, loss is 0.23324402
  155. epoch: 3 step: 95, loss is 0.18286772
  156. ...
  157. ```
  158. The model checkpoint will be saved in the current directory.
  159. ## [Evaluation Process](#contents)
  160. ### Evaluation
  161. - evaluation on ml-1m dataset when running on Ascend
  162. Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "checkpoint/ncf-125_390.ckpt".
  163. ```python
  164. sh scripts/run_eval.sh
  165. ```
  166. The above python command will run in the background. You can view the results through the file "eval.log". The accuracy of the test dataset will be as follows:
  167. ```python
  168. # grep "accuracy: " eval.log
  169. HR:0.6846,NDCG:0.410
  170. ```
  171. # [Model Description](#contents)
  172. ## [Performance](#contents)
  173. ### Evaluation Performance
  174. | Parameters | Ascend |
  175. | -------------------------- | ------------------------------------------------------------ |
  176. | Model Version | NCF |
  177. | Resource | Ascend 910; CPU 2.60GHz, 56cores; Memory 314G; OS Euler2.8 |
  178. | uploaded Date | 10/23/2020 (month/day/year) |
  179. | MindSpore Version | 1.0.0 |
  180. | Dataset | ml-1m |
  181. | Training Parameters | epoch=25, steps=19418, batch_size = 256, lr=0.00382059 |
  182. | Optimizer | GradOperation |
  183. | Loss Function | Softmax Cross Entropy |
  184. | outputs | probability |
  185. | Speed | 1pc: 0.575 ms/step |
  186. | Total time | 1pc: 5 mins |
  187. ### Inference Performance
  188. | Parameters | Ascend |
  189. | ------------------- | --------------------------- |
  190. | Model Version | NCF |
  191. | Resource | Ascend 910; OS Euler2.8 |
  192. | Uploaded Date | 10/23/2020 (month/day/year) |
  193. | MindSpore Version | 1.0.0 |
  194. | Dataset | ml-1m |
  195. | batch_size | 256 |
  196. | outputs | probability |
  197. | Accuracy | HR:0.6846,NDCG:0.410 |
  198. ## [How to use](#contents)
  199. ### Inference
  200. If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html). Following the steps below, this is a simple example:
  201. <https://www.mindspore.cn/tutorial/inference/en/master/multi_platform_inference.html>
  202. ```python
  203. # Load unseen dataset for inference
  204. dataset = dataset.create_dataset(cfg.data_path, 1, False)
  205. # Define model
  206. net = GoogleNet(num_classes=cfg.num_classes)
  207. opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01,
  208. cfg.momentum, weight_decay=cfg.weight_decay)
  209. loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
  210. model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
  211. # Load pre-trained model
  212. param_dict = load_checkpoint(cfg.checkpoint_path)
  213. load_param_into_net(net, param_dict)
  214. net.set_train(False)
  215. # Make predictions on the unseen dataset
  216. acc = model.eval(dataset)
  217. print("accuracy: ", acc)
  218. ```
  219. ### Continue Training on the Pretrained Model
  220. ```python
  221. # Load dataset
  222. dataset = create_dataset(cfg.data_path, cfg.epoch_size)
  223. batch_num = dataset.get_dataset_size()
  224. # Define model
  225. net = GoogleNet(num_classes=cfg.num_classes)
  226. # Continue training if set pre_trained to be True
  227. if cfg.pre_trained:
  228. param_dict = load_checkpoint(cfg.checkpoint_path)
  229. load_param_into_net(net, param_dict)
  230. lr = lr_steps(0, lr_max=cfg.lr_init, total_epochs=cfg.epoch_size,
  231. steps_per_epoch=batch_num)
  232. opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()),
  233. Tensor(lr), cfg.momentum, weight_decay=cfg.weight_decay)
  234. loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
  235. model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'},
  236. amp_level="O2", keep_batchnorm_fp32=False, loss_scale_manager=None)
  237. # Set callbacks
  238. config_ck = CheckpointConfig(save_checkpoint_steps=batch_num * 5,
  239. keep_checkpoint_max=cfg.keep_checkpoint_max)
  240. time_cb = TimeMonitor(data_size=batch_num)
  241. ckpoint_cb = ModelCheckpoint(prefix="train_googlenet_cifar10", directory="./",
  242. config=config_ck)
  243. loss_cb = LossMonitor()
  244. # Start training
  245. model.train(cfg.epoch_size, dataset, callbacks=[time_cb, ckpoint_cb, loss_cb])
  246. print("train success")
  247. ```
  248. # [Description of Random Situation](#contents)
  249. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
  250. # [ModelZoo Homepage](#contents)
  251. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).