You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

datasets.h 104 kB

added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
5 years ago
5 years ago
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672
  1. /**
  2. * Copyright 2020-2021 Huawei Technologies Co., Ltd
  3. *
  4. * Licensed under the Apache License, Version 2.0 (the "License");
  5. * you may not use this file except in compliance with the License.
  6. * You may obtain a copy of the License at
  7. *
  8. * http://www.apache.org/licenses/LICENSE-2.0
  9. *
  10. * Unless required by applicable law or agreed to in writing, software
  11. * distributed under the License is distributed on an "AS IS" BASIS,
  12. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. * See the License for the specific language governing permissions and
  14. * limitations under the License.
  15. */
  16. #ifndef MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_
  17. #define MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_
  18. #include <sys/stat.h>
  19. #include <unistd.h>
  20. #include <algorithm>
  21. #include <functional>
  22. #include <map>
  23. #include <memory>
  24. #include <set>
  25. #include <string>
  26. #include <unordered_map>
  27. #include <unordered_set>
  28. #include <utility>
  29. #include <vector>
  30. #include "include/api/dual_abi_helper.h"
  31. #include "include/api/types.h"
  32. #include "minddata/dataset/include/iterator.h"
  33. #include "minddata/dataset/include/json_fwd.hpp"
  34. #include "minddata/dataset/include/samplers.h"
  35. #include "minddata/dataset/include/text.h"
  36. namespace mindspore {
  37. namespace dataset {
  38. class Tensor;
  39. class TensorShape;
  40. class TreeAdapter;
  41. class TreeAdapterLite;
  42. class TreeGetters;
  43. class Vocab;
  44. class DatasetCache;
  45. class DatasetNode;
  46. class Iterator;
  47. class PullBasedIterator;
  48. class TensorOperation;
  49. class SchemaObj;
  50. class SamplerObj;
  51. class CsvBase;
  52. // Dataset classes (in alphabetical order)
  53. class BatchDataset;
  54. class MapDataset;
  55. class ProjectDataset;
  56. class ShuffleDataset;
  57. class BucketBatchByLengthDataset;
  58. class FilterDataset;
  59. class CSVDataset;
  60. class TransferDataset;
  61. class ConcatDataset;
  62. class RenameDataset;
  63. class SentencePieceVocab;
  64. enum class SentencePieceModel;
  65. class DSCallback;
  66. class RepeatDataset;
  67. class SkipDataset;
  68. class TakeDataset;
  69. class ZipDataset;
  70. /// \class Dataset datasets.h
  71. /// \brief A base class to represent a dataset in the data pipeline.
  72. class Dataset : public std::enable_shared_from_this<Dataset> {
  73. public:
  74. // need friend class so they can access the children_ field
  75. friend class Iterator;
  76. friend class TransferNode;
  77. /// \brief Constructor
  78. Dataset();
  79. /// \brief Destructor
  80. ~Dataset() = default;
  81. /// \brief Gets the dataset size
  82. /// \param[in] estimate This is only supported by some of the ops and it's used to speed up the process of getting
  83. /// dataset size at the expense of accuracy.
  84. /// \return dataset size. If failed, return -1
  85. int64_t GetDatasetSize(bool estimate = false);
  86. /// \brief Gets the output type
  87. /// \return a vector of DataType. If failed, return an empty vector
  88. std::vector<mindspore::DataType> GetOutputTypes();
  89. /// \brief Gets the output shape
  90. /// \return a vector of TensorShape. If failed, return an empty vector
  91. std::vector<std::vector<int64_t>> GetOutputShapes();
  92. /// \brief Gets the batch size
  93. /// \return int64_t
  94. int64_t GetBatchSize();
  95. /// \brief Gets the repeat count
  96. /// \return int64_t
  97. int64_t GetRepeatCount();
  98. /// \brief Gets the number of classes
  99. /// \return number of classes. If failed, return -1
  100. int64_t GetNumClasses();
  101. /// \brief Gets the column names
  102. /// \return Names of the columns. If failed, return an empty vector
  103. std::vector<std::string> GetColumnNames() { return VectorCharToString(GetColumnNamesCharIF()); }
  104. /// \brief Gets the class indexing
  105. /// \return a map of ClassIndexing. If failed, return an empty map
  106. std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing() {
  107. return ClassIndexCharToString(GetClassIndexingCharIF());
  108. }
  109. /// \brief Setter function for runtime number of workers
  110. /// \param[in] num_workers The number of threads in this operator
  111. /// \return Shared pointer to the original object
  112. std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers);
  113. /// \brief Function to create an PullBasedIterator over the Dataset
  114. /// \param[in] columns List of columns to be used to specify the order of columns
  115. /// \return Shared pointer to the Iterator
  116. std::shared_ptr<PullIterator> CreatePullBasedIterator(std::vector<std::vector<char>> columns = {});
  117. /// \brief Function to create an Iterator over the Dataset pipeline
  118. /// \param[in] columns List of columns to be used to specify the order of columns
  119. /// \param[in] num_epochs Number of epochs to run through the pipeline, default -1 which means infinite epochs.
  120. /// An empty row is returned at the end of each epoch
  121. /// \return Shared pointer to the Iterator
  122. std::shared_ptr<Iterator> CreateIterator(std::vector<std::string> columns = {}, int32_t num_epochs = -1) {
  123. return CreateIteratorCharIF(VectorStringToChar(columns), num_epochs);
  124. }
  125. /// \brief Function to transfer data through a device.
  126. /// \notes If device is Ascend, features of data will be transferred one by one. The limitation
  127. /// of data transmission per time is 256M.
  128. /// \param[in] queue_name Channel name (default="", create new unique name).
  129. /// \param[in] device_type Type of device (default="", get from MSContext).
  130. /// \param[in] device_id id of device (default=1, get from MSContext).
  131. /// \param[in] num_epochs Number of epochs (default=-1, infinite epochs).
  132. /// \param[in] send_epoch_end Whether to send end of sequence to device or not (default=true).
  133. /// \param[in] total_batches Number of batches to be sent to the device (default=0, all data).
  134. /// \param[in] create_data_info_queue Whether to create queue which stores types and shapes
  135. /// of data or not(default=false).
  136. /// \return Returns true if no error encountered else false.
  137. bool DeviceQueue(std::string queue_name = "", std::string device_type = "", int32_t device_id = 0,
  138. int32_t num_epochs = -1, bool send_epoch_end = true, int32_t total_batches = 0,
  139. bool create_data_info_queue = false) {
  140. return DeviceQueueCharIF(StringToChar(queue_name), StringToChar(device_type), device_id, num_epochs, send_epoch_end,
  141. total_batches, create_data_info_queue);
  142. }
  143. /// \brief Function to create a Saver to save the dynamic data processed by the dataset pipeline
  144. /// \note Usage restrictions:
  145. /// 1. Supported dataset formats: 'mindrecord' only
  146. /// 2. To save the samples in order, set dataset's shuffle to false and num_files to 1.
  147. /// 3. Before calling the function, do not use batch operator, repeat operator or data augmentation operators
  148. /// with random attribute in map operator.
  149. /// 4. Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor
  150. /// multi-dimensional string.
  151. /// \param[in] file_name Path to dataset file
  152. /// \param[in] num_files Number of dataset files (default=1)
  153. /// \param[in] file_type Dataset format (default="mindrecord")
  154. /// \return Returns true if no error encountered else false
  155. bool Save(std::string dataset_path, int32_t num_files = 1, std::string dataset_type = "mindrecord") {
  156. return SaveCharIF(StringToChar(dataset_path), num_files, StringToChar(dataset_type));
  157. }
  158. /// \brief Function to create a BatchDataset
  159. /// \notes Combines batch_size number of consecutive rows into batches
  160. /// \param[in] batch_size The number of rows each batch is created with
  161. /// \param[in] drop_remainder Determines whether or not to drop the last possibly incomplete
  162. /// batch. If true, and if there are less than batch_size rows
  163. /// available to make the last batch, then those rows will
  164. /// be dropped and not propagated to the next node
  165. /// \return Shared pointer to the current BatchDataset
  166. std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false);
  167. /// \brief Function to create a BucketBatchByLengthDataset
  168. /// \notes Bucket elements according to their lengths. Each bucket will be padded and batched when
  169. /// they are full.
  170. /// \param[in] column_names Columns passed to element_length_function
  171. /// \param[in] bucket_boundaries A list consisting of the upper boundaries of the buckets.
  172. /// Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for
  173. /// [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each
  174. /// 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
  175. /// \param[in] bucket_batch_sizes A list consisting of the batch sizes for each bucket.
  176. /// Must contain elements equal to the size of bucket_boundaries + 1.
  177. /// \param[in] element_length_function A function pointer that takes in MSTensorVec and outputs a MSTensorVec.
  178. /// The output must contain a single tensor containing a single int32_t. If no value is provided,
  179. /// then size of column_names must be 1, and the size of the first dimension of that column will be taken
  180. /// as the length (default=nullptr)
  181. /// \param[in] pad_info Represents how to batch each column. The key corresponds to the column name, the value must
  182. /// be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element
  183. /// corresponds to the value to pad with. If a column is not specified, then that column will be padded to the
  184. /// longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be
  185. /// padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is
  186. /// wanted, set pad_info to None (default=empty dictionary).
  187. /// \param[in] pad_to_bucket_boundary If true, will pad each unspecified dimension in pad_info to the
  188. /// bucket_boundary minus 1. If there are any elements that fall into the last bucket,
  189. /// an error will occur (default=false).
  190. /// \param[in] drop_remainder If true, will drop the last batch for each bucket if it is not a full batch
  191. /// (default=false).
  192. /// \return Shared pointer to the current BucketBatchByLengthDataset
  193. std::shared_ptr<BucketBatchByLengthDataset> BucketBatchByLength(
  194. const std::vector<std::string> &column_names, const std::vector<int32_t> &bucket_boundaries,
  195. const std::vector<int32_t> &bucket_batch_sizes,
  196. std::function<MSTensorVec(MSTensorVec)> element_length_function = nullptr,
  197. const std::map<std::string, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {},
  198. bool pad_to_bucket_boundary = false, bool drop_remainder = false) {
  199. return std::make_shared<BucketBatchByLengthDataset>(
  200. shared_from_this(), VectorStringToChar(column_names), bucket_boundaries, bucket_batch_sizes,
  201. element_length_function, PadInfoStringToChar(pad_info), pad_to_bucket_boundary, drop_remainder);
  202. }
  203. /// \brief Function to create a SentencePieceVocab from source dataset
  204. /// \notes Build a SentencePieceVocab from a dataset.
  205. /// \param[in] col_names Column names to get words from. It can be a vector of column names
  206. /// \param[in] vocab_size Vocabulary size.
  207. /// \param[in] character_coverage Percentage of characters covered by the model, must be between
  208. /// 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like
  209. /// Japanese or Chinese character sets, and 1.0 for other languages with small character sets.
  210. /// \param[in] model_type Model type. Choose from unigram (default), bpe, char, or word.
  211. /// The input sentence must be pretokenized when using word type.
  212. /// \param[in] params A vector contains more option parameters of sentencepiece library
  213. std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocab(
  214. const std::vector<std::string> &col_names, int32_t vocab_size, float character_coverage,
  215. SentencePieceModel model_type, const std::unordered_map<std::string, std::string> &params) {
  216. return BuildSentencePieceVocabCharIF(VectorStringToChar(col_names), vocab_size, character_coverage, model_type,
  217. UnorderedMapStringToChar(params));
  218. }
  219. /// \brief Function to create a Vocab from source dataset
  220. /// \notes Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab
  221. /// which contains top_k most frequent words (if top_k is specified)
  222. /// \param[in] columns Column names to get words from. It can be a vector of column names
  223. /// \param[in] freq_range A tuple of integers (min_frequency, max_frequency). Words within the frequency
  224. /// range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency
  225. /// can be set to default, which corresponds to 0/total_words separately
  226. /// \param[in] top_k Number of words to be built into vocab. top_k most frequent words are
  227. /// taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken
  228. /// \param[in] special_tokens A list of strings, each one is a special token
  229. /// \param[in] special_first Whether special_tokens will be prepended/appended to vocab, If special_tokens
  230. /// is specified and special_first is set to default, special_tokens will be prepended
  231. /// \return Shared pointer to the current Vocab
  232. std::shared_ptr<Vocab> BuildVocab(const std::vector<std::string> &columns = {},
  233. const std::pair<int64_t, int64_t> &freq_range = {0, kDeMaxFreq},
  234. int64_t top_k = kDeMaxTopk, const std::vector<std::string> &special_tokens = {},
  235. bool special_first = true) {
  236. return BuildVocabCharIF(VectorStringToChar(columns), freq_range, top_k, VectorStringToChar(special_tokens),
  237. special_first);
  238. }
  239. /// \brief Function to create a ConcatDataset
  240. /// \notes Concat the datasets in the input
  241. /// \param[in] datasets List of shared pointers to the dataset that should be concatenated together
  242. /// \return Shared pointer to the current ConcatDataset
  243. std::shared_ptr<ConcatDataset> Concat(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  244. std::vector<std::shared_ptr<Dataset>> all_datasets{shared_from_this()};
  245. all_datasets.insert(std::end(all_datasets), std::begin(datasets), std::end(datasets));
  246. return std::make_shared<ConcatDataset>(all_datasets);
  247. }
  248. /// \brief Function to filter dataset by predicate
  249. /// \notes If input_columns is not provided or empty, all columns will be used
  250. /// \param[in] predicate Function callable which returns a boolean value. If false then filter the element
  251. /// \param[in] input_columns List of names of the input columns to filter
  252. /// \return Shared pointer to the current FilterNode
  253. std::shared_ptr<FilterDataset> Filter(std::function<MSTensorVec(MSTensorVec)> predicate,
  254. const std::vector<std::string> &input_columns = {}) {
  255. return std::make_shared<FilterDataset>(shared_from_this(), predicate, VectorStringToChar(input_columns));
  256. }
  257. /// \brief Function to create a MapDataset
  258. /// \notes Applies each operation in operations to this dataset
  259. /// \param[in] operations Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations
  260. /// are applied in the order they appear in this list
  261. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  262. /// operation as input. The size of this list must match the number of
  263. /// input columns expected by the first operator. The default input_columns
  264. /// is the first column
  265. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  266. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  267. /// The size of this list must match the number of output columns of the
  268. /// last operation. The default output_columns will have the same
  269. /// name as the input columns, i.e., the columns will be replaced
  270. /// \param[in] project_columns A list of column names to project
  271. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  272. /// \return Shared pointer to the current MapDataset
  273. std::shared_ptr<MapDataset> Map(std::vector<TensorTransform *> operations,
  274. const std::vector<std::string> &input_columns = {},
  275. const std::vector<std::string> &output_columns = {},
  276. const std::vector<std::string> &project_columns = {},
  277. const std::shared_ptr<DatasetCache> &cache = nullptr,
  278. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  279. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  280. (void)std::transform(
  281. operations.begin(), operations.end(), std::back_inserter(transform_ops),
  282. [](TensorTransform *op) -> std::shared_ptr<TensorOperation> { return op != nullptr ? op->Parse() : nullptr; });
  283. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  284. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  285. callbacks);
  286. }
  287. /// \brief Function to create a MapDataset
  288. /// \notes Applies each operation in operations to this dataset
  289. /// \param[in] operations Vector of shared pointers to TensorTransform objects to be applied on the dataset.
  290. /// Operations are applied in the order they appear in this list
  291. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  292. /// operation as input. The size of this list must match the number of
  293. /// input columns expected by the first operator. The default input_columns
  294. /// is the first column
  295. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  296. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  297. /// The size of this list must match the number of output columns of the
  298. /// last operation. The default output_columns will have the same
  299. /// name as the input columns, i.e., the columns will be replaced
  300. /// \param[in] project_columns A list of column names to project
  301. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  302. /// \return Shared pointer to the current MapDataset
  303. std::shared_ptr<MapDataset> Map(std::vector<std::shared_ptr<TensorTransform>> operations,
  304. const std::vector<std::string> &input_columns = {},
  305. const std::vector<std::string> &output_columns = {},
  306. const std::vector<std::string> &project_columns = {},
  307. const std::shared_ptr<DatasetCache> &cache = nullptr,
  308. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  309. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  310. (void)std::transform(operations.begin(), operations.end(), std::back_inserter(transform_ops),
  311. [](std::shared_ptr<TensorTransform> op) -> std::shared_ptr<TensorOperation> {
  312. return op != nullptr ? op->Parse() : nullptr;
  313. });
  314. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  315. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  316. callbacks);
  317. }
  318. /// \brief Function to create a MapDataset
  319. /// \notes Applies each operation in operations to this dataset
  320. /// \param[in] operations Vector of TensorTransform objects to be applied on the dataset. Operations are applied in
  321. /// the order they appear in this list
  322. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  323. /// operation as input. The size of this list must match the number of
  324. /// input columns expected by the first operator. The default input_columns
  325. /// is the first column
  326. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  327. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  328. /// The size of this list must match the number of output columns of the
  329. /// last operation. The default output_columns will have the same
  330. /// name as the input columns, i.e., the columns will be replaced
  331. /// \param[in] project_columns A list of column names to project
  332. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  333. /// \return Shared pointer to the current MapDataset
  334. std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> operations,
  335. const std::vector<std::string> &input_columns = {},
  336. const std::vector<std::string> &output_columns = {},
  337. const std::vector<std::string> &project_columns = {},
  338. const std::shared_ptr<DatasetCache> &cache = nullptr,
  339. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  340. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  341. (void)std::transform(operations.begin(), operations.end(), std::back_inserter(transform_ops),
  342. [](TensorTransform &op) -> std::shared_ptr<TensorOperation> { return op.Parse(); });
  343. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  344. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  345. callbacks);
  346. }
  347. /// \brief Function to create a Project Dataset
  348. /// \notes Applies project to the dataset
  349. /// \param[in] columns The name of columns to project
  350. /// \return Shared pointer to the current Dataset
  351. std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns) {
  352. return std::make_shared<ProjectDataset>(shared_from_this(), VectorStringToChar(columns));
  353. }
  354. /// \brief Function to create a Rename Dataset
  355. /// \notes Renames the columns in the input dataset
  356. /// \param[in] input_columns List of the input columns to rename
  357. /// \param[in] output_columns List of the output columns
  358. /// \return Shared pointer to the current Dataset
  359. std::shared_ptr<RenameDataset> Rename(const std::vector<std::string> &input_columns,
  360. const std::vector<std::string> &output_columns) {
  361. return std::make_shared<RenameDataset>(shared_from_this(), VectorStringToChar(input_columns),
  362. VectorStringToChar(output_columns));
  363. }
  364. /// \brief Function to create a RepeatDataset
  365. /// \notes Repeats this dataset count times. Repeat indefinitely if count is -1
  366. /// \param[in] count Number of times the dataset should be repeated
  367. /// \return Shared pointer to the current Dataset
  368. /// \note Repeat will return shared pointer to `Dataset` instead of `RepeatDataset`
  369. /// due to a limitation in the current implementation
  370. std::shared_ptr<RepeatDataset> Repeat(int32_t count = -1) {
  371. return std::make_shared<RepeatDataset>(shared_from_this(), count);
  372. }
  373. /// \brief Function to create a Shuffle Dataset
  374. /// \notes Randomly shuffles the rows of this dataset
  375. /// \param[in] buffer_size The size of the buffer (must be larger than 1) for shuffling
  376. /// \return Shared pointer to the current ShuffleDataset
  377. std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size) {
  378. return std::make_shared<ShuffleDataset>(shared_from_this(), buffer_size);
  379. }
  380. /// \brief Function to create a SkipDataset
  381. /// \notes Skips count elements in this dataset.
  382. /// \param[in] count Number of elements the dataset to be skipped.
  383. /// \return Shared pointer to the current SkipDataset
  384. std::shared_ptr<SkipDataset> Skip(int32_t count) { return std::make_shared<SkipDataset>(shared_from_this(), count); }
  385. /// \brief Function to create a TakeDataset
  386. /// \notes Takes count elements in this dataset.
  387. /// \param[in] count Number of elements the dataset to be taken.
  388. /// \return Shared pointer to the current Dataset
  389. std::shared_ptr<TakeDataset> Take(int32_t count = -1) {
  390. return std::make_shared<TakeDataset>(shared_from_this(), count);
  391. }
  392. /// \brief Function to create a Zip Dataset
  393. /// \notes Applies zip to the dataset
  394. /// \param[in] datasets A list of shared pointers to the datasets that we want to zip
  395. /// \return Shared pointer to the current Dataset
  396. std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  397. std::vector<std::shared_ptr<Dataset>> all_datasets = datasets;
  398. all_datasets.push_back(shared_from_this());
  399. return std::make_shared<ZipDataset>(all_datasets);
  400. }
  401. std::shared_ptr<DatasetNode> IRNode() { return ir_node_; }
  402. protected:
  403. std::shared_ptr<TreeGetters> tree_getters_;
  404. std::shared_ptr<DatasetNode> ir_node_;
  405. private:
  406. // Char interface(CharIF) of GetColumnNames
  407. std::vector<std::vector<char>> GetColumnNamesCharIF();
  408. // Char interface(CharIF) of GetClassIndexing
  409. std::vector<std::pair<std::vector<char>, std::vector<int32_t>>> GetClassIndexingCharIF();
  410. // Char interface(CharIF) of CreateIterator
  411. std::shared_ptr<Iterator> CreateIteratorCharIF(std::vector<std::vector<char>> columns, int32_t num_epochs);
  412. // Char interface(CharIF) of DeviceQueue
  413. bool DeviceQueueCharIF(const std::vector<char> &queue_name, const std::vector<char> &device_type, int32_t device_id,
  414. int32_t num_epochs, bool send_epoch_end, int32_t total_batches, bool create_data_info_queue);
  415. // Char interface(CharIF) of Save
  416. bool SaveCharIF(const std::vector<char> &dataset_path, int32_t num_files, const std::vector<char> &dataset_type);
  417. // Char interface(CharIF) of BuildSentencePieceVocab
  418. std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocabCharIF(
  419. const std::vector<std::vector<char>> &col_names, int32_t vocab_size, float character_coverage,
  420. SentencePieceModel model_type, const std::map<std::vector<char>, std::vector<char>> &params);
  421. // Char interface(CharIF) of BuildVocab
  422. std::shared_ptr<Vocab> BuildVocabCharIF(const std::vector<std::vector<char>> &columns,
  423. const std::pair<int64_t, int64_t> &freq_range, int64_t top_k,
  424. const std::vector<std::vector<char>> &special_tokens, bool special_first);
  425. };
  426. class SchemaObj {
  427. public:
  428. /// \brief Constructor
  429. explicit SchemaObj(const std::string &schema_file = "") : SchemaObj(StringToChar(schema_file)) {}
  430. /// \brief Destructor
  431. ~SchemaObj() = default;
  432. /// \brief SchemaObj Init function
  433. /// \return bool true if schema initialization is successful
  434. Status Init();
  435. /// \brief Add new column to the schema with unknown shape of rank 1
  436. /// \param[in] name Name of the column.
  437. /// \param[in] ms_type Data type of the column(mindspore::DataType).
  438. /// \return Status code
  439. Status add_column(const std::string &name, mindspore::DataType ms_type) {
  440. return add_column_char(StringToChar(name), ms_type);
  441. }
  442. /// \brief Add new column to the schema with unknown shape of rank 1
  443. /// \param[in] name Name of the column.
  444. /// \param[in] ms_type Data type of the column(std::string).
  445. /// \param[in] shape Shape of the column.
  446. /// \return Status code
  447. Status add_column(const std::string &name, const std::string &ms_type) {
  448. return add_column_char(StringToChar(name), StringToChar(ms_type));
  449. }
  450. /// \brief Add new column to the schema
  451. /// \param[in] name Name of the column.
  452. /// \param[in] ms_type Data type of the column(mindspore::DataType).
  453. /// \param[in] shape Shape of the column.
  454. /// \return Status code
  455. Status add_column(const std::string &name, mindspore::DataType ms_type, const std::vector<int32_t> &shape) {
  456. return add_column_char(StringToChar(name), ms_type, shape);
  457. }
  458. /// \brief Add new column to the schema
  459. /// \param[in] name Name of the column.
  460. /// \param[in] ms_type Data type of the column(std::string).
  461. /// \param[in] shape Shape of the column.
  462. /// \return Status code
  463. Status add_column(const std::string &name, const std::string &ms_type, const std::vector<int32_t> &shape) {
  464. return add_column_char(StringToChar(name), StringToChar(ms_type), shape);
  465. }
  466. /// \brief Get a JSON string of the schema
  467. /// \return JSON string of the schema
  468. std::string to_json() { return CharToString(to_json_char()); }
  469. /// \brief Get a JSON string of the schema
  470. std::string to_string() { return to_json(); }
  471. /// \brief Set a new value to dataset_type
  472. void set_dataset_type(std::string dataset_type);
  473. /// \brief Set a new value to num_rows
  474. void set_num_rows(int32_t num_rows);
  475. /// \brief Get the current num_rows
  476. int32_t get_num_rows() const;
  477. /// \brief Get schema file from JSON file
  478. /// \param[in] json_string Name of JSON file to be parsed.
  479. /// \return Status code
  480. Status FromJSONString(const std::string &json_string) { return FromJSONStringCharIF(StringToChar(json_string)); }
  481. /// \brief Parse and add column information
  482. /// \param[in] json_string Name of JSON string for column dataset attribute information, decoded from schema file.
  483. /// \return Status code
  484. Status ParseColumnString(const std::string &json_string) {
  485. return ParseColumnStringCharIF(StringToChar(json_string));
  486. }
  487. private:
  488. /// \brief Parse the columns and add them to columns
  489. /// \param[in] columns Dataset attribution information, decoded from schema file.
  490. /// Support both nlohmann::json::value_t::array and nlohmann::json::value_t::onject.
  491. /// \return Status code
  492. Status parse_column(nlohmann::json columns);
  493. /// \brief Get schema file from JSON file
  494. /// \param[in] json_obj parsed JSON object
  495. /// \return Status code
  496. Status from_json(nlohmann::json json_obj);
  497. // Char constructor of SchemaObj
  498. explicit SchemaObj(const std::vector<char> &schema_file);
  499. // Char interface of add_column
  500. Status add_column_char(const std::vector<char> &name, mindspore::DataType ms_type);
  501. Status add_column_char(const std::vector<char> &name, const std::vector<char> &ms_type);
  502. Status add_column_char(const std::vector<char> &name, mindspore::DataType ms_type, const std::vector<int32_t> &shape);
  503. Status add_column_char(const std::vector<char> &name, const std::vector<char> &ms_type,
  504. const std::vector<int32_t> &shape);
  505. // Char interface of to_json
  506. const std::vector<char> to_json_char();
  507. // Char interface of FromJSONString
  508. Status FromJSONStringCharIF(const std::vector<char> &json_string);
  509. // Char interface of ParseColumnString
  510. Status ParseColumnStringCharIF(const std::vector<char> &json_string);
  511. struct Data;
  512. std::shared_ptr<Data> data_;
  513. };
  514. class BatchDataset : public Dataset {
  515. public:
  516. BatchDataset(std::shared_ptr<Dataset> input, int32_t batch_size, bool drop_remainder = false);
  517. ~BatchDataset() = default;
  518. };
  519. class BucketBatchByLengthDataset : public Dataset {
  520. public:
  521. BucketBatchByLengthDataset(
  522. std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &column_names,
  523. const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes,
  524. std::function<MSTensorVec(MSTensorVec)> element_length_function = nullptr,
  525. const std::map<std::vector<char>, std::pair<std::vector<int64_t>, MSTensor>> &pad_info = {},
  526. bool pad_to_bucket_boundary = false, bool drop_remainder = false);
  527. ~BucketBatchByLengthDataset() = default;
  528. };
  529. class ConcatDataset : public Dataset {
  530. public:
  531. explicit ConcatDataset(const std::vector<std::shared_ptr<Dataset>> &input);
  532. ~ConcatDataset() = default;
  533. };
  534. class FilterDataset : public Dataset {
  535. public:
  536. FilterDataset(std::shared_ptr<Dataset> input, std::function<MSTensorVec(MSTensorVec)> predicate,
  537. const std::vector<std::vector<char>> &input_columns);
  538. ~FilterDataset() = default;
  539. };
  540. class MapDataset : public Dataset {
  541. public:
  542. MapDataset(std::shared_ptr<Dataset> input, std::vector<std::shared_ptr<TensorOperation>> operations,
  543. const std::vector<std::vector<char>> &input_columns, const std::vector<std::vector<char>> &output_columns,
  544. const std::vector<std::vector<char>> &project_columns, const std::shared_ptr<DatasetCache> &cache,
  545. std::vector<std::shared_ptr<DSCallback>> callbacks);
  546. ~MapDataset() = default;
  547. };
  548. class ProjectDataset : public Dataset {
  549. public:
  550. ProjectDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &columns);
  551. ~ProjectDataset() = default;
  552. };
  553. class RenameDataset : public Dataset {
  554. public:
  555. RenameDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &input_columns,
  556. const std::vector<std::vector<char>> &output_columns);
  557. ~RenameDataset() = default;
  558. };
  559. class RepeatDataset : public Dataset {
  560. public:
  561. RepeatDataset(std::shared_ptr<Dataset> input, int32_t count);
  562. ~RepeatDataset() = default;
  563. };
  564. class ShuffleDataset : public Dataset {
  565. public:
  566. ShuffleDataset(std::shared_ptr<Dataset> input, int32_t buffer_size);
  567. ~ShuffleDataset() = default;
  568. };
  569. class SkipDataset : public Dataset {
  570. public:
  571. SkipDataset(std::shared_ptr<Dataset> input, int32_t count);
  572. ~SkipDataset() = default;
  573. };
  574. class TakeDataset : public Dataset {
  575. public:
  576. TakeDataset(std::shared_ptr<Dataset> input, int32_t count);
  577. ~TakeDataset() = default;
  578. };
  579. class ZipDataset : public Dataset {
  580. public:
  581. explicit ZipDataset(const std::vector<std::shared_ptr<Dataset>> &inputs);
  582. ~ZipDataset() = default;
  583. };
  584. /// \brief Function to create a SchemaObj
  585. /// \param[in] schema_file Path of schema file
  586. /// \note This api exists because std::string will constrained by ABI compile macro but char don't.
  587. /// \return Shared pointer to the current schema
  588. std::shared_ptr<SchemaObj> SchemaCharIF(const std::vector<char> &schema_file);
  589. /// \brief Function to create a SchemaObj
  590. /// \param[in] schema_file Path of schema file
  591. /// \return Shared pointer to the current schema
  592. inline std::shared_ptr<SchemaObj> Schema(const std::string &schema_file = "") {
  593. return SchemaCharIF(StringToChar(schema_file));
  594. }
  595. class AlbumDataset : public Dataset {
  596. public:
  597. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  598. const std::vector<std::vector<char>> &column_names, bool decode, const std::shared_ptr<Sampler> &sampler,
  599. const std::shared_ptr<DatasetCache> &cache);
  600. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  601. const std::vector<std::vector<char>> &column_names, bool decode, Sampler *sampler,
  602. const std::shared_ptr<DatasetCache> &cache);
  603. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  604. const std::vector<std::vector<char>> &column_names, bool decode,
  605. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  606. ~AlbumDataset() = default;
  607. };
  608. /// \brief Function to create an AlbumDataset
  609. /// \notes The generated dataset is specified through setting a schema
  610. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  611. /// \param[in] data_schema Path to dataset schema file
  612. /// \param[in] column_names Column names used to specify columns to load, if empty, will read all columns.
  613. /// (default = {})
  614. /// \param[in] decode the option to decode the images in dataset (default = false)
  615. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  616. /// given,
  617. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  618. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  619. /// \return Shared pointer to the current Dataset
  620. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  621. const std::vector<std::string> &column_names = {}, bool decode = false,
  622. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  623. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  624. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  625. VectorStringToChar(column_names), decode, sampler, cache);
  626. }
  627. /// \brief Function to create an AlbumDataset
  628. /// \notes The generated dataset is specified through setting a schema
  629. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  630. /// \param[in] data_schema Path to dataset schema file
  631. /// \param[in] column_names Column names used to specify columns to load
  632. /// \param[in] decode the option to decode the images in dataset
  633. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  634. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  635. /// \return Shared pointer to the current Dataset
  636. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  637. const std::vector<std::string> &column_names, bool decode, Sampler *sampler,
  638. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  639. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  640. VectorStringToChar(column_names), decode, sampler, cache);
  641. }
  642. /// \brief Function to create an AlbumDataset
  643. /// \notes The generated dataset is specified through setting a schema
  644. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  645. /// \param[in] data_schema Path to dataset schema file
  646. /// \param[in] column_names Column names used to specify columns to load
  647. /// \param[in] decode the option to decode the images in dataset
  648. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  649. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  650. /// \return Shared pointer to the current Dataset
  651. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  652. const std::vector<std::string> &column_names, bool decode,
  653. const std::reference_wrapper<Sampler> sampler,
  654. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  655. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  656. VectorStringToChar(column_names), decode, sampler, cache);
  657. }
  658. class CelebADataset : public Dataset {
  659. public:
  660. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  661. const std::shared_ptr<Sampler> &sampler, bool decode,
  662. const std::set<std::vector<char>> &extensions, const std::shared_ptr<DatasetCache> &cache);
  663. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  664. bool decode, const std::set<std::vector<char>> &extensions,
  665. const std::shared_ptr<DatasetCache> &cache);
  666. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  667. const std::reference_wrapper<Sampler> sampler, bool decode,
  668. const std::set<std::vector<char>> &extensions, const std::shared_ptr<DatasetCache> &cache);
  669. ~CelebADataset() = default;
  670. };
  671. /// \brief Function to create a CelebADataset
  672. /// \notes The generated dataset has two columns ['image', 'attr'].
  673. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  674. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  675. /// \param[in] usage One of "all", "train", "valid" or "test" (default = "all").
  676. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  677. /// given,
  678. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  679. /// \param[in] decode Decode the images after reading (default=false).
  680. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  681. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  682. /// \return Shared pointer to the current Dataset
  683. inline std::shared_ptr<CelebADataset> CelebA(
  684. const std::string &dataset_dir, const std::string &usage = "all",
  685. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), bool decode = false,
  686. const std::set<std::string> &extensions = {}, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  687. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  688. SetStringToChar(extensions), cache);
  689. }
  690. /// \brief Function to create a CelebADataset
  691. /// \notes The generated dataset has two columns ['image', 'attr'].
  692. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  693. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  694. /// \param[in] usage One of "all", "train", "valid" or "test"
  695. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  696. /// \param[in] decode Decode the images after reading (default=false).
  697. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  698. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  699. /// \return Shared pointer to the current Dataset
  700. inline std::shared_ptr<CelebADataset> CelebA(const std::string &dataset_dir, const std::string &usage, Sampler *sampler,
  701. bool decode = false, const std::set<std::string> &extensions = {},
  702. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  703. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  704. SetStringToChar(extensions), cache);
  705. }
  706. /// \brief Function to create a CelebADataset
  707. /// \notes The generated dataset has two columns ['image', 'attr'].
  708. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  709. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  710. /// \param[in] usage One of "all", "train", "valid" or "test"
  711. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  712. /// \param[in] decode Decode the images after reading (default=false).
  713. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  714. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  715. /// \return Shared pointer to the current Dataset
  716. inline std::shared_ptr<CelebADataset> CelebA(const std::string &dataset_dir, const std::string &usage,
  717. const std::reference_wrapper<Sampler> sampler, bool decode = false,
  718. const std::set<std::string> &extensions = {},
  719. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  720. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  721. SetStringToChar(extensions), cache);
  722. }
  723. class Cifar10Dataset : public Dataset {
  724. public:
  725. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  726. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  727. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  728. const std::shared_ptr<DatasetCache> &cache);
  729. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  730. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  731. ~Cifar10Dataset() = default;
  732. };
  733. /// \brief Function to create a Cifar10 Dataset
  734. /// \notes The generated dataset has two columns ["image", "label"]
  735. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  736. /// \param[in] usage of CIFAR10, can be "train", "test" or "all" (default = "all").
  737. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  738. /// given,
  739. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  740. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  741. /// \return Shared pointer to the current Dataset
  742. inline std::shared_ptr<Cifar10Dataset> Cifar10(
  743. const std::string &dataset_dir, const std::string &usage = "all",
  744. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  745. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  746. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  747. }
  748. /// \brief Function to create a Cifar10 Dataset
  749. /// \notes The generated dataset has two columns ["image", "label"]
  750. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  751. /// \param[in] usage of CIFAR10, can be "train", "test" or "all"
  752. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  753. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  754. /// \return Shared pointer to the current Dataset
  755. inline std::shared_ptr<Cifar10Dataset> Cifar10(const std::string &dataset_dir, const std::string &usage,
  756. Sampler *sampler, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  757. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  758. }
  759. /// \brief Function to create a Cifar10 Dataset
  760. /// \notes The generated dataset has two columns ["image", "label"]
  761. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  762. /// \param[in] usage of CIFAR10, can be "train", "test" or "all"
  763. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  764. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  765. /// \return Shared pointer to the current Dataset
  766. inline std::shared_ptr<Cifar10Dataset> Cifar10(const std::string &dataset_dir, const std::string &usage,
  767. const std::reference_wrapper<Sampler> sampler,
  768. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  769. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  770. }
  771. class Cifar100Dataset : public Dataset {
  772. public:
  773. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  774. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  775. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  776. const std::shared_ptr<DatasetCache> &cache);
  777. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  778. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  779. ~Cifar100Dataset() = default;
  780. };
  781. /// \brief Function to create a Cifar100 Dataset
  782. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  783. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  784. /// \param[in] usage of CIFAR100, can be "train", "test" or "all" (default = "all").
  785. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  786. /// given,
  787. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  788. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  789. /// \return Shared pointer to the current Dataset
  790. inline std::shared_ptr<Cifar100Dataset> Cifar100(
  791. const std::string &dataset_dir, const std::string &usage = "all",
  792. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  793. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  794. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  795. }
  796. /// \brief Function to create a Cifar100 Dataset
  797. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  798. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  799. /// \param[in] usage of CIFAR100, can be "train", "test" or "all".
  800. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  801. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  802. /// \return Shared pointer to the current Dataset
  803. inline std::shared_ptr<Cifar100Dataset> Cifar100(const std::string &dataset_dir, const std::string &usage,
  804. Sampler *sampler,
  805. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  806. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  807. }
  808. /// \brief Function to create a Cifar100 Dataset
  809. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  810. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  811. /// \param[in] usage of CIFAR100, can be "train", "test" or "all".
  812. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  813. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  814. /// \return Shared pointer to the current Dataset
  815. inline std::shared_ptr<Cifar100Dataset> Cifar100(const std::string &dataset_dir, const std::string &usage,
  816. const std::reference_wrapper<Sampler> sampler,
  817. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  818. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  819. }
  820. class CLUEDataset : public Dataset {
  821. public:
  822. explicit CLUEDataset(const std::vector<std::vector<char>> &dataset_files, const std::vector<char> &task,
  823. const std::vector<char> &usage, int64_t num_samples, ShuffleMode shuffle, int32_t num_shards,
  824. int32_t shard_id, const std::shared_ptr<DatasetCache> &cache);
  825. ~CLUEDataset() = default;
  826. };
  827. /// \brief Function to create a CLUEDataset
  828. /// \notes The generated dataset has a variable number of columns depending on the task and usage
  829. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  830. /// will be sorted in a lexicographical order.
  831. /// \param[in] task The kind of task, one of "AFQMC", "TNEWS", "IFLYTEK", "CMNLI", "WSC" and "CSL" (default="AFQMC").
  832. /// \param[in] usage Be used to "train", "test" or "eval" data (default="train").
  833. /// \param[in] num_samples The number of samples to be included in the dataset.
  834. /// (Default = 0 means all samples.)
  835. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode.kGlobal)
  836. /// Can be any of:
  837. /// ShuffleMode::kFalse - No shuffling is performed.
  838. /// ShuffleMode::kFiles - Shuffle files only.
  839. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  840. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  841. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  842. /// specified only when num_shards is also specified. (Default = 0)
  843. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  844. /// \return Shared pointer to the current CLUEDataset
  845. inline std::shared_ptr<CLUEDataset> CLUE(const std::vector<std::string> &dataset_files,
  846. const std::string &task = "AFQMC", const std::string &usage = "train",
  847. int64_t num_samples = 0, ShuffleMode shuffle = ShuffleMode::kGlobal,
  848. int32_t num_shards = 1, int32_t shard_id = 0,
  849. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  850. return std::make_shared<CLUEDataset>(VectorStringToChar(dataset_files), StringToChar(task), StringToChar(usage),
  851. num_samples, shuffle, num_shards, shard_id, cache);
  852. }
  853. class CocoDataset : public Dataset {
  854. public:
  855. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  856. const std::vector<char> &task, const bool &decode, const std::shared_ptr<Sampler> &sampler,
  857. const std::shared_ptr<DatasetCache> &cache);
  858. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  859. const std::vector<char> &task, const bool &decode, Sampler *sampler,
  860. const std::shared_ptr<DatasetCache> &cache);
  861. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  862. const std::vector<char> &task, const bool &decode, const std::reference_wrapper<Sampler> sampler,
  863. const std::shared_ptr<DatasetCache> &cache);
  864. ~CocoDataset() = default;
  865. };
  866. /// \brief Function to create a CocoDataset
  867. /// \notes The generated dataset has multi-columns :
  868. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  869. /// ['iscrowd', dtype=uint32]].
  870. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  871. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  872. /// ['num_keypoints', dtype=uint32]].
  873. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  874. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  875. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  876. /// \param[in] annotation_file Path to the annotation json
  877. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  878. /// \param[in] decode Decode the images after reading
  879. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  880. /// given,
  881. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  882. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  883. /// \return Shared pointer to the current Dataset
  884. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  885. const std::string &task = "Detection", const bool &decode = false,
  886. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  887. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  888. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  889. decode, sampler, cache);
  890. }
  891. /// \brief Function to create a CocoDataset
  892. /// \notes The generated dataset has multi-columns :
  893. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  894. /// ['iscrowd', dtype=uint32]].
  895. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  896. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  897. /// ['num_keypoints', dtype=uint32]].
  898. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  899. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  900. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  901. /// \param[in] annotation_file Path to the annotation json
  902. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  903. /// \param[in] decode Decode the images after reading
  904. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  905. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  906. /// \return Shared pointer to the current Dataset
  907. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  908. const std::string &task, const bool &decode, Sampler *sampler,
  909. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  910. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  911. decode, sampler, cache);
  912. }
  913. /// \brief Function to create a CocoDataset
  914. /// \notes The generated dataset has multi-columns :
  915. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  916. /// ['iscrowd', dtype=uint32]].
  917. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  918. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  919. /// ['num_keypoints', dtype=uint32]].
  920. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  921. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  922. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  923. /// \param[in] annotation_file Path to the annotation json
  924. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  925. /// \param[in] decode Decode the images after reading
  926. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  927. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  928. /// \return Shared pointer to the current Dataset
  929. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  930. const std::string &task, const bool &decode,
  931. const std::reference_wrapper<Sampler> sampler,
  932. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  933. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  934. decode, sampler, cache);
  935. }
  936. class CSVDataset : public Dataset {
  937. public:
  938. explicit CSVDataset(const std::vector<std::vector<char>> &dataset_files, char field_delim,
  939. const std::vector<std::shared_ptr<CsvBase>> &column_defaults,
  940. const std::vector<std::vector<char>> &column_names, int64_t num_samples, ShuffleMode shuffle,
  941. int32_t num_shards, int32_t shard_id, const std::shared_ptr<DatasetCache> &cache);
  942. ~CSVDataset() = default;
  943. };
  944. /// \brief Function to create a CSVDataset
  945. /// \notes The generated dataset has a variable number of columns
  946. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  947. /// will be sorted in a lexicographical order.
  948. /// \param[in] field_delim A char that indicates the delimiter to separate fields (default=',').
  949. /// \param[in] column_defaults List of default values for the CSV field (default={}). Each item in the list is
  950. /// either a valid type (float, int, or string). If this is not provided, treats all columns as string type.
  951. /// \param[in] column_names List of column names of the dataset (default={}). If this is not provided, infers the
  952. /// column_names from the first row of CSV file.
  953. /// \param[in] num_samples The number of samples to be included in the dataset.
  954. /// (Default = 0 means all samples.)
  955. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode::kGlobal)
  956. /// Can be any of:
  957. /// ShuffleMode::kFalse - No shuffling is performed.
  958. /// ShuffleMode::kFiles - Shuffle files only.
  959. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  960. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  961. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  962. /// specified only when num_shards is also specified. (Default = 0)
  963. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  964. /// \return Shared pointer to the current Dataset
  965. inline std::shared_ptr<CSVDataset> CSV(const std::vector<std::string> &dataset_files, char field_delim = ',',
  966. const std::vector<std::shared_ptr<CsvBase>> &column_defaults = {},
  967. const std::vector<std::string> &column_names = {}, int64_t num_samples = 0,
  968. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  969. int32_t shard_id = 0, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  970. return std::make_shared<CSVDataset>(VectorStringToChar(dataset_files), field_delim, column_defaults,
  971. VectorStringToChar(column_names), num_samples, shuffle, num_shards, shard_id,
  972. cache);
  973. }
  974. class ImageFolderDataset : public Dataset {
  975. public:
  976. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode,
  977. const std::shared_ptr<Sampler> &sampler, const std::set<std::vector<char>> &extensions,
  978. const std::map<std::vector<char>, int32_t> &class_indexing,
  979. const std::shared_ptr<DatasetCache> &cache);
  980. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode, Sampler *sampler,
  981. const std::set<std::vector<char>> &extensions,
  982. const std::map<std::vector<char>, int32_t> &class_indexing,
  983. const std::shared_ptr<DatasetCache> &cache);
  984. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode,
  985. const std::reference_wrapper<Sampler> sampler,
  986. const std::set<std::vector<char>> &extensions,
  987. const std::map<std::vector<char>, int32_t> &class_indexing,
  988. const std::shared_ptr<DatasetCache> &cache);
  989. ~ImageFolderDataset() = default;
  990. };
  991. /// \brief Function to create an ImageFolderDataset
  992. /// \notes A source dataset that reads images from a tree of directories
  993. /// All images within one folder have the same label
  994. /// The generated dataset has two columns ["image", "label"]
  995. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  996. /// \param[in] decode A flag to decode in ImageFolder
  997. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  998. /// given,
  999. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1000. /// \param[in] extensions File extensions to be read
  1001. /// \param[in] class_indexing a class name to label map
  1002. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1003. /// \return Shared pointer to the current ImageFolderDataset
  1004. inline std::shared_ptr<ImageFolderDataset> ImageFolder(
  1005. const std::string &dataset_dir, bool decode = false,
  1006. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1007. const std::set<std::string> &extensions = {}, const std::map<std::string, int32_t> &class_indexing = {},
  1008. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1009. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1010. MapStringToChar(class_indexing), cache);
  1011. }
  1012. /// \brief Function to create an ImageFolderDataset
  1013. /// \notes A source dataset that reads images from a tree of directories
  1014. /// All images within one folder have the same label
  1015. /// The generated dataset has two columns ["image", "label"]
  1016. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1017. /// \param[in] decode A flag to decode in ImageFolder
  1018. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1019. /// \param[in] extensions File extensions to be read
  1020. /// \param[in] class_indexing a class name to label map
  1021. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1022. /// \return Shared pointer to the current ImageFolderDataset
  1023. inline std::shared_ptr<ImageFolderDataset> ImageFolder(const std::string &dataset_dir, bool decode, Sampler *sampler,
  1024. const std::set<std::string> &extensions = {},
  1025. const std::map<std::string, int32_t> &class_indexing = {},
  1026. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1027. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1028. MapStringToChar(class_indexing), cache);
  1029. }
  1030. /// \brief Function to create an ImageFolderDataset
  1031. /// \notes A source dataset that reads images from a tree of directories
  1032. /// All images within one folder have the same label
  1033. /// The generated dataset has two columns ["image", "label"]
  1034. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1035. /// \param[in] decode A flag to decode in ImageFolder
  1036. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1037. /// \param[in] extensions File extensions to be read
  1038. /// \param[in] class_indexing a class name to label map
  1039. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1040. /// \return Shared pointer to the current ImageFolderDataset
  1041. inline std::shared_ptr<ImageFolderDataset> ImageFolder(const std::string &dataset_dir, bool decode,
  1042. const std::reference_wrapper<Sampler> sampler,
  1043. const std::set<std::string> &extensions = {},
  1044. const std::map<std::string, int32_t> &class_indexing = {},
  1045. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1046. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1047. MapStringToChar(class_indexing), cache);
  1048. }
  1049. class ManifestDataset : public Dataset {
  1050. public:
  1051. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage,
  1052. const std::shared_ptr<Sampler> &sampler,
  1053. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1054. const std::shared_ptr<DatasetCache> &cache);
  1055. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage, Sampler *sampler,
  1056. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1057. const std::shared_ptr<DatasetCache> &cache);
  1058. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage,
  1059. const std::reference_wrapper<Sampler> sampler,
  1060. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1061. const std::shared_ptr<DatasetCache> &cache);
  1062. ~ManifestDataset() = default;
  1063. };
  1064. /// \brief Function to create a ManifestDataset
  1065. /// \notes The generated dataset has two columns ["image", "label"]
  1066. /// \param[in] dataset_file The dataset file to be read
  1067. /// \param[in] usage Need "train", "eval" or "inference" data (default="train")
  1068. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1069. /// given,
  1070. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1071. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1072. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1073. /// \param[in] decode Decode the images after reading (default=false).
  1074. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1075. /// \return Shared pointer to the current ManifestDataset
  1076. inline std::shared_ptr<ManifestDataset> Manifest(
  1077. const std::string &dataset_file, const std::string &usage = "train",
  1078. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1079. const std::map<std::string, int32_t> &class_indexing = {}, bool decode = false,
  1080. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1081. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1082. MapStringToChar(class_indexing), decode, cache);
  1083. }
  1084. /// \brief Function to create a ManifestDataset
  1085. /// \notes The generated dataset has two columns ["image", "label"]
  1086. /// \param[in] dataset_file The dataset file to be read
  1087. /// \param[in] usage Need "train", "eval" or "inference" data
  1088. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1089. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1090. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1091. /// \param[in] decode Decode the images after reading (default=false).
  1092. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1093. /// \return Shared pointer to the current ManifestDataset
  1094. inline std::shared_ptr<ManifestDataset> Manifest(const std::string &dataset_file, const std::string &usage,
  1095. Sampler *sampler,
  1096. const std::map<std::string, int32_t> &class_indexing = {},
  1097. bool decode = false,
  1098. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1099. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1100. MapStringToChar(class_indexing), decode, cache);
  1101. }
  1102. /// \brief Function to create a ManifestDataset
  1103. /// \notes The generated dataset has two columns ["image", "label"]
  1104. /// \param[in] dataset_file The dataset file to be read
  1105. /// \param[in] usage Need "train", "eval" or "inference" data
  1106. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1107. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1108. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1109. /// \param[in] decode Decode the images after reading (default=false).
  1110. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1111. /// \return Shared pointer to the current ManifestDataset
  1112. inline std::shared_ptr<ManifestDataset> Manifest(const std::string &dataset_file, const std::string &usage,
  1113. const std::reference_wrapper<Sampler> sampler,
  1114. const std::map<std::string, int32_t> &class_indexing = {},
  1115. bool decode = false,
  1116. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1117. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1118. MapStringToChar(class_indexing), decode, cache);
  1119. }
  1120. class MindDataDataset : public Dataset {
  1121. public:
  1122. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1123. const std::shared_ptr<Sampler> &sampler, nlohmann::json padded_sample, int64_t num_padded);
  1124. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1125. Sampler *sampler, nlohmann::json padded_sample, int64_t num_padded);
  1126. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1127. const std::reference_wrapper<Sampler> sampler, nlohmann::json padded_sample,
  1128. int64_t num_padded);
  1129. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1130. const std::vector<std::vector<char>> &columns_list, const std::shared_ptr<Sampler> &sampler,
  1131. nlohmann::json padded_sample, int64_t num_padded);
  1132. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1133. const std::vector<std::vector<char>> &columns_list, Sampler *sampler,
  1134. nlohmann::json padded_sample, int64_t num_padded);
  1135. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1136. const std::vector<std::vector<char>> &columns_list,
  1137. const std::reference_wrapper<Sampler> sampler, nlohmann::json padded_sample,
  1138. int64_t num_padded);
  1139. ~MindDataDataset() = default;
  1140. };
  1141. /// \brief Function to create a MindDataDataset
  1142. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1143. /// in the same path will be found and loaded automatically.
  1144. /// \param[in] columns_list List of columns to be read (default={})
  1145. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1146. /// given,
  1147. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler()),
  1148. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1149. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1150. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1151. /// \return Shared pointer to the current MindDataDataset
  1152. inline std::shared_ptr<MindDataDataset> MindData(
  1153. const std::string &dataset_file, const std::vector<std::string> &columns_list = {},
  1154. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), nlohmann::json padded_sample = nullptr,
  1155. int64_t num_padded = 0) {
  1156. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1157. padded_sample, num_padded);
  1158. }
  1159. /// \brief Function to create a MindDataDataset
  1160. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1161. /// in the same path will be found and loaded automatically.
  1162. /// \param[in] columns_list List of columns to be read
  1163. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1164. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1165. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1166. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1167. /// \return Shared pointer to the current MindDataDataset
  1168. inline std::shared_ptr<MindDataDataset> MindData(const std::string &dataset_file,
  1169. const std::vector<std::string> &columns_list, Sampler *sampler,
  1170. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1171. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1172. padded_sample, num_padded);
  1173. }
  1174. /// \brief Function to create a MindDataDataset
  1175. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1176. /// in the same path will be found and loaded automatically.
  1177. /// \param[in] columns_list List of columns to be read
  1178. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1179. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1180. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1181. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1182. /// \return Shared pointer to the current MindDataDataset
  1183. inline std::shared_ptr<MindDataDataset> MindData(const std::string &dataset_file,
  1184. const std::vector<std::string> &columns_list,
  1185. const std::reference_wrapper<Sampler> sampler,
  1186. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1187. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1188. padded_sample, num_padded);
  1189. }
  1190. /// \brief Function to create a MindDataDataset
  1191. /// \param[in] dataset_files List of dataset files to be read directly.
  1192. /// \param[in] columns_list List of columns to be read (default={})
  1193. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1194. /// given,
  1195. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler()),
  1196. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1197. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1198. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1199. /// \return Shared pointer to the current MindDataDataset
  1200. inline std::shared_ptr<MindDataDataset> MindData(
  1201. const std::vector<std::string> &dataset_files, const std::vector<std::string> &columns_list = {},
  1202. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), nlohmann::json padded_sample = nullptr,
  1203. int64_t num_padded = 0) {
  1204. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1205. padded_sample, num_padded);
  1206. }
  1207. /// \brief Function to create a MindDataDataset
  1208. /// \param[in] dataset_files List of dataset files to be read directly.
  1209. /// \param[in] columns_list List of columns to be read
  1210. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1211. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1212. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1213. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1214. /// \return Shared pointer to the current MindDataDataset
  1215. inline std::shared_ptr<MindDataDataset> MindData(const std::vector<std::string> &dataset_files,
  1216. const std::vector<std::string> &columns_list, Sampler *sampler,
  1217. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1218. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1219. padded_sample, num_padded);
  1220. }
  1221. /// \brief Function to create a MindDataDataset
  1222. /// \param[in] dataset_files List of dataset files to be read directly.
  1223. /// \param[in] columns_list List of columns to be read
  1224. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1225. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1226. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1227. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1228. /// \return Shared pointer to the current MindDataDataset
  1229. inline std::shared_ptr<MindDataDataset> MindData(const std::vector<std::string> &dataset_files,
  1230. const std::vector<std::string> &columns_list,
  1231. const std::reference_wrapper<Sampler> sampler,
  1232. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1233. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1234. padded_sample, num_padded);
  1235. }
  1236. class MnistDataset : public Dataset {
  1237. public:
  1238. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  1239. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  1240. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  1241. const std::shared_ptr<DatasetCache> &cache);
  1242. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  1243. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  1244. ~MnistDataset() = default;
  1245. };
  1246. /// \brief Function to create a MnistDataset
  1247. /// \notes The generated dataset has two columns ["image", "label"]
  1248. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1249. /// \param[in] usage of MNIST, can be "train", "test" or "all" (default = "all").
  1250. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1251. /// given,
  1252. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1253. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1254. /// \return Shared pointer to the current MnistDataset
  1255. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage = "all",
  1256. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1257. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1258. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1259. }
  1260. /// \brief Function to create a MnistDataset
  1261. /// \notes The generated dataset has two columns ["image", "label"]
  1262. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1263. /// \param[in] usage of MNIST, can be "train", "test" or "all"
  1264. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1265. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1266. /// \return Shared pointer to the current MnistDataset
  1267. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage, Sampler *sampler,
  1268. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1269. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1270. }
  1271. /// \brief Function to create a MnistDataset
  1272. /// \notes The generated dataset has two columns ["image", "label"]
  1273. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1274. /// \param[in] usage of MNIST, can be "train", "test" or "all"
  1275. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1276. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1277. /// \return Shared pointer to the current MnistDataset
  1278. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage,
  1279. const std::reference_wrapper<Sampler> sampler,
  1280. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1281. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1282. }
  1283. /// \brief Function to create a ConcatDataset
  1284. /// \notes Reload "+" operator to concat two datasets
  1285. /// \param[in] datasets1 Shared pointer to the first dataset to be concatenated
  1286. /// \param[in] datasets2 Shared pointer to the second dataset to be concatenated
  1287. /// \return Shared pointer to the current ConcatDataset
  1288. inline std::shared_ptr<ConcatDataset> operator+(const std::shared_ptr<Dataset> &datasets1,
  1289. const std::shared_ptr<Dataset> &datasets2) {
  1290. return std::make_shared<ConcatDataset>(std::vector({datasets1, datasets2}));
  1291. }
  1292. class RandomDataDataset : public Dataset {
  1293. public:
  1294. RandomDataDataset(const int32_t &total_rows, std::shared_ptr<SchemaObj> schema,
  1295. const std::vector<std::vector<char>> &columns_list, std::shared_ptr<DatasetCache> cache);
  1296. RandomDataDataset(const int32_t &total_rows, const std::vector<char> &schema_path,
  1297. const std::vector<std::vector<char>> &columns_list, std::shared_ptr<DatasetCache> cache);
  1298. ~RandomDataDataset() = default;
  1299. };
  1300. /// \brief Function to create a RandomDataset
  1301. /// \param[in] total_rows Number of rows for the dataset to generate (default=0, number of rows is random)
  1302. /// \param[in] schema SchemaObj to set column type, data type and data shape
  1303. /// \param[in] columns_list List of columns to be read (default={}, read all columns)
  1304. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1305. /// \return Shared pointer to the current Dataset
  1306. template <typename T = std::shared_ptr<SchemaObj>>
  1307. std::shared_ptr<RandomDataDataset> RandomData(const int32_t &total_rows = 0, const T &schema = nullptr,
  1308. const std::vector<std::string> &columns_list = {},
  1309. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1310. std::shared_ptr<RandomDataDataset> ds;
  1311. if constexpr (std::is_same<T, std::nullptr_t>::value || std::is_same<T, std::shared_ptr<SchemaObj>>::value) {
  1312. std::shared_ptr<SchemaObj> schema_obj = schema;
  1313. ds =
  1314. std::make_shared<RandomDataDataset>(total_rows, std::move(schema_obj), VectorStringToChar(columns_list), cache);
  1315. } else {
  1316. ds = std::make_shared<RandomDataDataset>(total_rows, StringToChar(schema), VectorStringToChar(columns_list), cache);
  1317. }
  1318. return ds;
  1319. }
  1320. class TextFileDataset : public Dataset {
  1321. public:
  1322. explicit TextFileDataset(const std::vector<std::vector<char>> &dataset_files, int64_t num_samples,
  1323. ShuffleMode shuffle, int32_t num_shards, int32_t shard_id,
  1324. const std::shared_ptr<DatasetCache> &cache);
  1325. ~TextFileDataset() = default;
  1326. };
  1327. /// \brief Function to create a TextFileDataset
  1328. /// \notes The generated dataset has one column ['text']
  1329. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  1330. /// will be sorted in a lexicographical order.
  1331. /// \param[in] num_samples The number of samples to be included in the dataset.
  1332. /// (Default = 0 means all samples.)
  1333. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode.kGlobal)
  1334. /// Can be any of:
  1335. /// ShuffleMode.kFalse - No shuffling is performed.
  1336. /// ShuffleMode.kFiles - Shuffle files only.
  1337. /// ShuffleMode.kGlobal - Shuffle both the files and samples.
  1338. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  1339. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  1340. /// specified only when num_shards is also specified. (Default = 0)
  1341. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1342. /// \return Shared pointer to the current TextFileDataset
  1343. inline std::shared_ptr<TextFileDataset> TextFile(const std::vector<std::string> &dataset_files, int64_t num_samples = 0,
  1344. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  1345. int32_t shard_id = 0,
  1346. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1347. return std::make_shared<TextFileDataset>(VectorStringToChar(dataset_files), num_samples, shuffle, num_shards,
  1348. shard_id, cache);
  1349. }
  1350. class TFRecordDataset : public Dataset {
  1351. public:
  1352. TFRecordDataset(const std::vector<std::vector<char>> &dataset_files, const std::vector<char> &schema,
  1353. const std::vector<std::vector<char>> &columns_list, int64_t num_samples, ShuffleMode shuffle,
  1354. int32_t num_shards, int32_t shard_id, bool shard_equal_rows, std::shared_ptr<DatasetCache> cache);
  1355. /// \brief Constructor
  1356. /// \note Parameter 'schema' is shared pointer to Schema object
  1357. TFRecordDataset(const std::vector<std::vector<char>> &dataset_files, std::shared_ptr<SchemaObj> schema,
  1358. const std::vector<std::vector<char>> &columns_list, int64_t num_samples, ShuffleMode shuffle,
  1359. int32_t num_shards, int32_t shard_id, bool shard_equal_rows, std::shared_ptr<DatasetCache> cache);
  1360. ~TFRecordDataset() = default;
  1361. };
  1362. /// \brief Function to create a TFRecordDataset
  1363. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  1364. /// will be sorted in a lexicographical order.
  1365. /// \param[in] schema SchemaObj or string to schema path. (Default = nullptr, which means that the
  1366. /// meta data from the TFData file is considered the schema.)
  1367. /// \param[in] columns_list List of columns to be read. (Default = {}, read all columns)
  1368. /// \param[in] num_samples The number of samples to be included in the dataset.
  1369. /// (Default = 0 means all samples.)
  1370. /// If num_samples is 0 and numRows(parsed from schema) does not exist, read the full dataset;
  1371. /// If num_samples is 0 and numRows(parsed from schema) is greater than 0, read numRows rows;
  1372. /// If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.
  1373. /// \param[in] shuffle The mode for shuffling data every epoch. (Default = ShuffleMode::kGlobal)
  1374. /// Can be any of:
  1375. /// ShuffleMode::kFalse - No shuffling is performed.
  1376. /// ShuffleMode::kFiles - Shuffle files only.
  1377. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  1378. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  1379. /// \param[in] shard_id The shard ID within num_shards. This argument should be specified only
  1380. /// when num_shards is also specified. (Default = 0)
  1381. /// \param[in] shard_equal_rows Get equal rows for all shards. (Default = False, number of rows of
  1382. /// each shard may be not equal)
  1383. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1384. /// \return Shared pointer to the current TFRecordDataset
  1385. template <typename T = std::shared_ptr<SchemaObj>>
  1386. std::shared_ptr<TFRecordDataset> TFRecord(const std::vector<std::string> &dataset_files, const T &schema = nullptr,
  1387. const std::vector<std::string> &columns_list = {}, int64_t num_samples = 0,
  1388. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  1389. int32_t shard_id = 0, bool shard_equal_rows = false,
  1390. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1391. std::shared_ptr<TFRecordDataset> ds = nullptr;
  1392. if constexpr (std::is_same<T, std::nullptr_t>::value || std::is_same<T, std::shared_ptr<SchemaObj>>::value) {
  1393. std::shared_ptr<SchemaObj> schema_obj = schema;
  1394. ds = std::make_shared<TFRecordDataset>(VectorStringToChar(dataset_files), std::move(schema_obj),
  1395. VectorStringToChar(columns_list), num_samples, shuffle, num_shards, shard_id,
  1396. shard_equal_rows, cache);
  1397. } else {
  1398. std::string schema_path = schema;
  1399. if (!schema_path.empty()) {
  1400. struct stat sb;
  1401. int rc = stat(schema_path.c_str(), &sb);
  1402. if (rc != 0) {
  1403. return nullptr;
  1404. }
  1405. }
  1406. ds = std::make_shared<TFRecordDataset>(VectorStringToChar(dataset_files), StringToChar(schema_path),
  1407. VectorStringToChar(columns_list), num_samples, shuffle, num_shards, shard_id,
  1408. shard_equal_rows, cache);
  1409. }
  1410. return ds;
  1411. }
  1412. class VOCDataset : public Dataset {
  1413. public:
  1414. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1415. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1416. bool decode, const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  1417. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1418. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1419. bool decode, Sampler *sampler, const std::shared_ptr<DatasetCache> &cache);
  1420. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1421. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1422. bool decode, const std::reference_wrapper<Sampler> sampler,
  1423. const std::shared_ptr<DatasetCache> &cache);
  1424. ~VOCDataset() = default;
  1425. };
  1426. /// \brief Function to create a VOCDataset
  1427. /// \notes The generated dataset has multi-columns :
  1428. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1429. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1430. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1431. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1432. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1433. /// \param[in] usage The type of data list text file to be read (default = "train").
  1434. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1435. /// \param[in] decode Decode the images after reading
  1436. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1437. /// given,
  1438. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1439. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1440. /// \return Shared pointer to the current Dataset
  1441. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task = "Segmentation",
  1442. const std::string &usage = "train",
  1443. const std::map<std::string, int32_t> &class_indexing = {}, bool decode = false,
  1444. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1445. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1446. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1447. MapStringToChar(class_indexing), decode, sampler, cache);
  1448. }
  1449. /// \brief Function to create a VOCDataset
  1450. /// \notes The generated dataset has multi-columns :
  1451. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1452. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1453. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1454. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1455. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1456. /// \param[in] usage The type of data list text file to be read.
  1457. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1458. /// \param[in] decode Decode the images after reading
  1459. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1460. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1461. /// \return Shared pointer to the current Dataset
  1462. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task,
  1463. const std::string &usage, const std::map<std::string, int32_t> &class_indexing,
  1464. bool decode, Sampler *sampler,
  1465. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1466. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1467. MapStringToChar(class_indexing), decode, sampler, cache);
  1468. }
  1469. /// \brief Function to create a VOCDataset
  1470. /// \notes The generated dataset has multi-columns :
  1471. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1472. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1473. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1474. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1475. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1476. /// \param[in] usage The type of data list text file to be read.
  1477. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1478. /// \param[in] decode Decode the images after reading
  1479. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1480. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1481. /// \return Shared pointer to the current Dataset
  1482. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task,
  1483. const std::string &usage, const std::map<std::string, int32_t> &class_indexing,
  1484. bool decode, const std::reference_wrapper<Sampler> sampler,
  1485. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1486. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1487. MapStringToChar(class_indexing), decode, sampler, cache);
  1488. }
  1489. std::shared_ptr<DatasetCache> CreateDatasetCacheCharIF(session_id_type id, uint64_t mem_sz, bool spill,
  1490. std::optional<std::vector<char>> hostname = std::nullopt,
  1491. std::optional<int32_t> port = std::nullopt,
  1492. std::optional<int32_t> num_connections = std::nullopt,
  1493. std::optional<int32_t> prefetch_sz = std::nullopt);
  1494. /// \brief Function the create a cache to be attached to a dataset
  1495. /// \param id A user assigned session id for the current pipeline.
  1496. /// \param mem_sz Size of the memory set aside for the row caching (default=0 which means unlimited,
  1497. /// note that it might bring in the risk of running out of memory on the machine).
  1498. /// \param spill Spill to disk if out of memory (default=False).
  1499. /// \param hostname optional host name (default="127.0.0.1").
  1500. /// \param port optional port (default=50052).
  1501. /// \param num_connections optional number of connections (default=12).
  1502. /// \param prefetch_sz optional prefetch size (default=20).
  1503. /// \return Shared pointer to DatasetCache. If error, nullptr is returned.
  1504. inline std::shared_ptr<DatasetCache> CreateDatasetCache(session_id_type id, uint64_t mem_sz, bool spill,
  1505. std::optional<std::string> hostname = std::nullopt,
  1506. std::optional<int32_t> port = std::nullopt,
  1507. std::optional<int32_t> num_connections = std::nullopt,
  1508. std::optional<int32_t> prefetch_sz = std::nullopt) {
  1509. return CreateDatasetCacheCharIF(id, mem_sz, spill, OptionalStringToChar(hostname), port, num_connections,
  1510. prefetch_sz);
  1511. }
  1512. /// \brief Function to create a ZipDataset
  1513. /// \notes Applies zip to the dataset
  1514. /// \param[in] datasets List of shared pointers to the datasets that we want to zip
  1515. /// \return Shared pointer to the current Dataset
  1516. inline std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  1517. return std::make_shared<ZipDataset>(datasets);
  1518. }
  1519. } // namespace dataset
  1520. } // namespace mindspore
  1521. #endif // MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_