You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 3.4 kB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
  1. # Guideline to Efficiently Generating MindRecord
  2. <!-- TOC -->
  3. - [What does the example do](#what-does-the-example-do)
  4. - [Example test for ImageNet](#example-test-for-imagenet)
  5. - [How to use the example for other dataset](#how-to-use-the-example-for-other-dataset)
  6. - [Create work space](#create-work-space)
  7. - [Implement data generator](#implement-data-generator)
  8. - [Run data generator](#run-data-generator)
  9. <!-- /TOC -->
  10. ## What does the example do
  11. This example provides an efficient way to generate MindRecord. Users only need to define the parallel granularity of training data reading and the data reading function of a single task. That is, they can efficiently convert the user's training data into MindRecord.
  12. 1. run_template.sh: entry script, users need to modify parameters according to their own training data.
  13. 2. writer.py: main script, called by run_template.sh, it mainly reads user training data in parallel and generates MindRecord.
  14. 3. template/mr_api.py: uers define their own parallel granularity of training data reading and single task reading function through the template.
  15. ## Example test for ImageNet
  16. 1. Download and prepare the ImageNet dataset as required.
  17. > [ImageNet dataset download address](http://image-net.org/download)
  18. Store the downloaded ImageNet dataset in a folder. The folder contains all images and a mapping file that records labels of the images.
  19. In the mapping file, there are three columns, which are separated by spaces. They indicate image classes, label IDs, and label names. The following is an example of the mapping file:
  20. ```
  21. n02119789 1 pen
  22. n02100735 2 notbook
  23. n02110185 3 mouse
  24. n02096294 4 orange
  25. ```
  26. 2. Edit run_imagenet.sh and modify the parameters
  27. 3. Run the bash script
  28. ```bash
  29. bash run_imagenet.sh
  30. ```
  31. 4. Performance result
  32. | Training Data | General API | Current Example | Env |
  33. | ---- | ---- | ---- | ---- |
  34. |ImageNet(140G)| 2h40m | 50m | CPU: Intel Xeon Gold 6130 x 64, Memory: 256G, Storage: HDD |
  35. ## How to use the example for other dataset
  36. ### Create work space
  37. Assume the dataset name is 'xyz'
  38. * Create work space from template
  39. ```shell
  40. cd ${your_mindspore_home}/example/convert_to_mindrecord
  41. cp -r template xyz
  42. ```
  43. ### Implement data generator
  44. Edit dictionary data generator
  45. * Edit file
  46. ```shell
  47. cd ${your_mindspore_home}/example/convert_to_mindrecord
  48. vi xyz/mr_api.py
  49. ```
  50. Two API, 'mindrecord_task_number' and 'mindrecord_dict_data', must be implemented
  51. - 'mindrecord_task_number()' returns number of tasks. Return 1 if data row is generated serially. Return N if generator can be split into N parallel-run tasks.
  52. - 'mindrecord_dict_data(task_id)' yields dictionary data row by row. 'task_id' is 0..N-1, if N is return value of mindrecord_task_number()
  53. Tricky for parallel run
  54. - For ImageNet, one directory can be a task.
  55. - For TFRecord with multiple files, each file can be a task.
  56. - For TFRecord with 1 file only, it could also be split into N tasks. Task_id=K means: data row is picked only if (count % N == K)
  57. ### Run data generator
  58. * run python script
  59. ```shell
  60. cd ${your_mindspore_home}/example/convert_to_mindrecord
  61. python writer.py --mindrecord_script xyz [...]
  62. ```
  63. > You can put this command in script **run_xyz.sh** for easy execution