You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 2.6 kB

12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273
  1. # Guideline to Efficiently Generating MindRecord
  2. <!-- TOC -->
  3. - [What does the example do](#what-does-the-example-do)
  4. - [Example test for Cora](#example-test-for-cora)
  5. - [How to use the example for other dataset](#how-to-use-the-example-for-other-dataset)
  6. - [Create work space](#create-work-space)
  7. - [Implement data generator](#implement-data-generator)
  8. - [Run data generator](#run-data-generator)
  9. <!-- /TOC -->
  10. ## What does the example do
  11. This example provides an efficient way to generate MindRecord. Users only need to define the parallel granularity of training data reading and the data reading function of a single task. That is, they can efficiently convert the user's training data into MindRecord.
  12. 1. write_cora.sh: entry script, users need to modify parameters according to their own training data.
  13. 2. writer.py: main script, called by write_cora.sh, it mainly reads user training data in parallel and generates MindRecord.
  14. 3. cora/mr_api.py: uers define their own parallel granularity of training data reading and single task reading function through the cora.
  15. ## Example test for Cora
  16. 1. Download and prepare the Cora dataset as required.
  17. > [Cora dataset download address](https://github.com/jzaldi/datasets/tree/master/cora)
  18. 2. Edit write_cora.sh and modify the parameters
  19. ```
  20. --mindrecord_file: output MindRecord file.
  21. --mindrecord_partitions: the partitions for MindRecord.
  22. ```
  23. 3. Run the bash script
  24. ```bash
  25. bash write_cora.sh
  26. ```
  27. ## How to use the example for other dataset
  28. ### Create work space
  29. Assume the dataset name is 'xyz'
  30. * Create work space from cora
  31. ```shell
  32. cd ${your_mindspore_home}/example/graph_to_mindrecord
  33. cp -r cora xyz
  34. ```
  35. ### Implement data generator
  36. Edit dictionary data generator.
  37. * Edit file
  38. ```shell
  39. cd ${your_mindspore_home}/example/graph_to_mindrecord
  40. vi xyz/mr_api.py
  41. ```
  42. Two API, 'mindrecord_task_number' and 'mindrecord_dict_data', must be implemented.
  43. - 'mindrecord_task_number()' returns number of tasks. Return 1 if data row is generated serially. Return N if generator can be split into N parallel-run tasks.
  44. - 'mindrecord_dict_data(task_id)' yields dictionary data row by row. 'task_id' is 0..N-1, if N is return value of mindrecord_task_number()
  45. ### Run data generator
  46. * run python script
  47. ```shell
  48. cd ${your_mindspore_home}/example/graph_to_mindrecord
  49. python writer.py --mindrecord_script xyz [...]
  50. ```
  51. > You can put this command in script **write_xyz.sh** for easy execution