You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

datasets.h 104 kB

added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
5 years ago
5 years ago
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817818819820821822823824825826827828829830831832833834835836837838839840841842843844845846847848849850851852853854855856857858859860861862863864865866867868869870871872873874875876877878879880881882883884885886887888889890891892893894895896897898899900901902903904905906907908909910911912913914915916917918919920921922923924925926927928929930931932933934935936937938939940941942943944945946947948949950951952953954955956957958959960961962963964965966967968969970971972973974975976977978979980981982983984985986987988989990991992993994995996997998999100010011002100310041005100610071008100910101011101210131014101510161017101810191020102110221023102410251026102710281029103010311032103310341035103610371038103910401041104210431044104510461047104810491050105110521053105410551056105710581059106010611062106310641065106610671068106910701071107210731074107510761077107810791080108110821083108410851086108710881089109010911092109310941095109610971098109911001101110211031104110511061107110811091110111111121113111411151116111711181119112011211122112311241125112611271128112911301131113211331134113511361137113811391140114111421143114411451146114711481149115011511152115311541155115611571158115911601161116211631164116511661167116811691170117111721173117411751176117711781179118011811182118311841185118611871188118911901191119211931194119511961197119811991200120112021203120412051206120712081209121012111212121312141215121612171218121912201221122212231224122512261227122812291230123112321233123412351236123712381239124012411242124312441245124612471248124912501251125212531254125512561257125812591260126112621263126412651266126712681269127012711272127312741275127612771278127912801281128212831284128512861287128812891290129112921293129412951296129712981299130013011302130313041305130613071308130913101311131213131314131513161317131813191320132113221323132413251326132713281329133013311332133313341335133613371338133913401341134213431344134513461347134813491350135113521353135413551356135713581359136013611362136313641365136613671368136913701371137213731374137513761377137813791380138113821383138413851386138713881389139013911392139313941395139613971398139914001401140214031404140514061407140814091410141114121413141414151416141714181419142014211422142314241425142614271428142914301431143214331434143514361437143814391440144114421443144414451446144714481449145014511452145314541455145614571458145914601461146214631464146514661467146814691470147114721473147414751476147714781479148014811482148314841485148614871488148914901491149214931494149514961497149814991500150115021503150415051506150715081509151015111512151315141515151615171518151915201521152215231524152515261527152815291530153115321533153415351536153715381539154015411542154315441545154615471548154915501551155215531554155515561557155815591560156115621563156415651566156715681569157015711572157315741575157615771578157915801581158215831584158515861587158815891590159115921593159415951596159715981599160016011602160316041605160616071608160916101611161216131614161516161617161816191620162116221623162416251626162716281629163016311632163316341635163616371638163916401641164216431644164516461647164816491650165116521653165416551656165716581659166016611662166316641665
  1. /**
  2. * Copyright 2020-2021 Huawei Technologies Co., Ltd
  3. *
  4. * Licensed under the Apache License, Version 2.0 (the "License");
  5. * you may not use this file except in compliance with the License.
  6. * You may obtain a copy of the License at
  7. *
  8. * http://www.apache.org/licenses/LICENSE-2.0
  9. *
  10. * Unless required by applicable law or agreed to in writing, software
  11. * distributed under the License is distributed on an "AS IS" BASIS,
  12. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. * See the License for the specific language governing permissions and
  14. * limitations under the License.
  15. */
  16. #ifndef MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_
  17. #define MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_
  18. #include <sys/stat.h>
  19. #include <unistd.h>
  20. #include <algorithm>
  21. #include <map>
  22. #include <memory>
  23. #include <set>
  24. #include <string>
  25. #include <unordered_map>
  26. #include <unordered_set>
  27. #include <utility>
  28. #include <vector>
  29. #include "include/api/dual_abi_helper.h"
  30. #include "minddata/dataset/include/iterator.h"
  31. #include "minddata/dataset/include/samplers.h"
  32. #include "minddata/dataset/include/tensor.h"
  33. #include "minddata/dataset/include/text.h"
  34. #include "minddata/dataset/include/type_id.h"
  35. namespace mindspore {
  36. namespace dataset {
  37. class Tensor;
  38. class TensorRow;
  39. class TensorShape;
  40. class TreeAdapter;
  41. class TreeGetters;
  42. class Vocab;
  43. class DatasetCache;
  44. class DatasetNode;
  45. class Iterator;
  46. class TensorOperation;
  47. class SchemaObj;
  48. class SamplerObj;
  49. class CsvBase;
  50. // Dataset classes (in alphabetical order)
  51. class BatchDataset;
  52. class MapDataset;
  53. class ProjectDataset;
  54. class ShuffleDataset;
  55. class BucketBatchByLengthDataset;
  56. class FilterDataset;
  57. class CSVDataset;
  58. class TransferDataset;
  59. class ConcatDataset;
  60. class RenameDataset;
  61. class SentencePieceVocab;
  62. enum class SentencePieceModel;
  63. class DSCallback;
  64. class RepeatDataset;
  65. class SkipDataset;
  66. class TakeDataset;
  67. class ZipDataset;
  68. /// \class Dataset datasets.h
  69. /// \brief A base class to represent a dataset in the data pipeline.
  70. class Dataset : public std::enable_shared_from_this<Dataset> {
  71. public:
  72. // need friend class so they can access the children_ field
  73. friend class Iterator;
  74. friend class TransferNode;
  75. /// \brief Constructor
  76. Dataset();
  77. /// \brief Destructor
  78. ~Dataset() = default;
  79. /// \brief Gets the dataset size
  80. /// \param[in] estimate This is only supported by some of the ops and it's used to speed up the process of getting
  81. /// dataset size at the expense of accuracy.
  82. /// \return dataset size. If failed, return -1
  83. int64_t GetDatasetSize(bool estimate = false);
  84. /// \brief Gets the output type
  85. /// \return a vector of DataType. If failed, return an empty vector
  86. std::vector<DataType> GetOutputTypes();
  87. /// \brief Gets the output shape
  88. /// \return a vector of TensorShape. If failed, return an empty vector
  89. std::vector<TensorShape> GetOutputShapes();
  90. /// \brief Gets the batch size
  91. /// \return int64_t
  92. int64_t GetBatchSize();
  93. /// \brief Gets the repeat count
  94. /// \return int64_t
  95. int64_t GetRepeatCount();
  96. /// \brief Gets the number of classes
  97. /// \return number of classes. If failed, return -1
  98. int64_t GetNumClasses();
  99. /// \brief Gets the column names
  100. /// \return Names of the columns. If failed, return an empty vector
  101. std::vector<std::string> GetColumnNames() { return VectorCharToString(GetColumnNamesCharIF()); }
  102. /// \brief Gets the class indexing
  103. /// \return a map of ClassIndexing. If failed, return an empty map
  104. std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing() {
  105. return ClassIndexCharToString(GetClassIndexingCharIF());
  106. }
  107. /// \brief Setter function for runtime number of workers
  108. /// \param[in] num_workers The number of threads in this operator
  109. /// \return Shared pointer to the original object
  110. std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers);
  111. /// \brief Function to create an Iterator over the Dataset pipeline
  112. /// \param[in] columns List of columns to be used to specify the order of columns
  113. /// \param[in] num_epochs Number of epochs to run through the pipeline, default -1 which means infinite epochs.
  114. /// An empty row is returned at the end of each epoch
  115. /// \return Shared pointer to the Iterator
  116. std::shared_ptr<Iterator> CreateIterator(std::vector<std::string> columns = {}, int32_t num_epochs = -1) {
  117. return CreateIteratorCharIF(VectorStringToChar(columns), num_epochs);
  118. }
  119. /// \brief Function to transfer data through a device.
  120. /// \notes If device is Ascend, features of data will be transferred one by one. The limitation
  121. /// of data transmission per time is 256M.
  122. /// \param[in] queue_name Channel name (default="", create new unique name).
  123. /// \param[in] device_type Type of device (default="", get from MSContext).
  124. /// \param[in] num_epochs Number of epochs (default=-1, infinite epochs).
  125. /// \param[in] send_epoch_end Whether to send end of sequence to device or not (default=true).
  126. /// \param[in] total_batches Number of batches to be sent to the device (default=0, all data).
  127. /// \param[in] create_data_info_queue Whether to create queue which stores types and shapes
  128. /// of data or not(default=false).
  129. /// \return Returns true if no error encountered else false.
  130. bool DeviceQueue(std::string queue_name = "", std::string device_type = "", int32_t num_epochs = -1,
  131. bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false) {
  132. return DeviceQueueCharIF(StringToChar(queue_name), StringToChar(device_type), num_epochs, send_epoch_end,
  133. total_batches, create_data_info_queue);
  134. }
  135. /// \brief Function to create a Saver to save the dynamic data processed by the dataset pipeline
  136. /// \note Usage restrictions:
  137. /// 1. Supported dataset formats: 'mindrecord' only
  138. /// 2. To save the samples in order, set dataset's shuffle to false and num_files to 1.
  139. /// 3. Before calling the function, do not use batch operator, repeat operator or data augmentation operators
  140. /// with random attribute in map operator.
  141. /// 4. Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor
  142. /// multi-dimensional string.
  143. /// \param[in] file_name Path to dataset file
  144. /// \param[in] num_files Number of dataset files (default=1)
  145. /// \param[in] file_type Dataset format (default="mindrecord")
  146. /// \return Returns true if no error encountered else false
  147. bool Save(std::string dataset_path, int32_t num_files = 1, std::string dataset_type = "mindrecord") {
  148. return SaveCharIF(StringToChar(dataset_path), num_files, StringToChar(dataset_type));
  149. }
  150. /// \brief Function to create a BatchDataset
  151. /// \notes Combines batch_size number of consecutive rows into batches
  152. /// \param[in] batch_size The number of rows each batch is created with
  153. /// \param[in] drop_remainder Determines whether or not to drop the last possibly incomplete
  154. /// batch. If true, and if there are less than batch_size rows
  155. /// available to make the last batch, then those rows will
  156. /// be dropped and not propagated to the next node
  157. /// \return Shared pointer to the current BatchDataset
  158. std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false);
  159. /// \brief Function to create a BucketBatchByLengthDataset
  160. /// \notes Bucket elements according to their lengths. Each bucket will be padded and batched when
  161. /// they are full.
  162. /// \param[in] column_names Columns passed to element_length_function
  163. /// \param[in] bucket_boundaries A list consisting of the upper boundaries of the buckets.
  164. /// Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for
  165. /// [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each
  166. /// 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
  167. /// \param[in] bucket_batch_sizes A list consisting of the batch sizes for each bucket.
  168. /// Must contain elements equal to the size of bucket_boundaries + 1.
  169. /// \param[in] element_length_function A function pointer that takes in TensorRow and outputs a TensorRow.
  170. /// The output must contain a single tensor containing a single int32_t. If no value is provided,
  171. /// then size of column_names must be 1, and the size of the first dimension of that column will be taken
  172. /// as the length (default=nullptr)
  173. /// \param[in] pad_info Represents how to batch each column. The key corresponds to the column name, the value must
  174. /// be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element
  175. /// corresponds to the value to pad with. If a column is not specified, then that column will be padded to the
  176. /// longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be
  177. /// padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is
  178. /// wanted, set pad_info to None (default=empty dictionary).
  179. /// \param[in] pad_to_bucket_boundary If true, will pad each unspecified dimension in pad_info to the
  180. /// bucket_boundary minus 1. If there are any elements that fall into the last bucket,
  181. /// an error will occur (default=false).
  182. /// \param[in] drop_remainder If true, will drop the last batch for each bucket if it is not a full batch
  183. /// (default=false).
  184. /// \return Shared pointer to the current BucketBatchByLengthDataset
  185. std::shared_ptr<BucketBatchByLengthDataset> BucketBatchByLength(
  186. const std::vector<std::string> &column_names, const std::vector<int32_t> &bucket_boundaries,
  187. const std::vector<int32_t> &bucket_batch_sizes,
  188. std::function<TensorRow(TensorRow)> element_length_function = nullptr,
  189. const std::map<std::string, std::pair<TensorShape, std::shared_ptr<Tensor>>> &pad_info = {},
  190. bool pad_to_bucket_boundary = false, bool drop_remainder = false) {
  191. return std::make_shared<BucketBatchByLengthDataset>(
  192. shared_from_this(), VectorStringToChar(column_names), bucket_boundaries, bucket_batch_sizes,
  193. element_length_function, PadInfoStringToChar(pad_info), pad_to_bucket_boundary, drop_remainder);
  194. }
  195. /// \brief Function to create a SentencePieceVocab from source dataset
  196. /// \notes Build a SentencePieceVocab from a dataset.
  197. /// \param[in] col_names Column names to get words from. It can be a vector of column names
  198. /// \param[in] vocab_size Vocabulary size.
  199. /// \param[in] character_coverage Percentage of characters covered by the model, must be between
  200. /// 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like
  201. /// Japanese or Chinese character sets, and 1.0 for other languages with small character sets.
  202. /// \param[in] model_type Model type. Choose from unigram (default), bpe, char, or word.
  203. /// The input sentence must be pretokenized when using word type.
  204. /// \param[in] params A vector contains more option parameters of sentencepiece library
  205. std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocab(
  206. const std::vector<std::string> &col_names, int32_t vocab_size, float character_coverage,
  207. SentencePieceModel model_type, const std::unordered_map<std::string, std::string> &params) {
  208. return BuildSentencePieceVocabCharIF(VectorStringToChar(col_names), vocab_size, character_coverage, model_type,
  209. UnorderedMapStringToChar(params));
  210. }
  211. /// \brief Function to create a Vocab from source dataset
  212. /// \notes Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab
  213. /// which contains top_k most frequent words (if top_k is specified)
  214. /// \param[in] columns Column names to get words from. It can be a vector of column names
  215. /// \param[in] freq_range A tuple of integers (min_frequency, max_frequency). Words within the frequency
  216. /// range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency
  217. /// can be set to default, which corresponds to 0/total_words separately
  218. /// \param[in] top_k Number of words to be built into vocab. top_k most frequent words are
  219. /// taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken
  220. /// \param[in] special_tokens A list of strings, each one is a special token
  221. /// \param[in] special_first Whether special_tokens will be prepended/appended to vocab, If special_tokens
  222. /// is specified and special_first is set to default, special_tokens will be prepended
  223. /// \return Shared pointer to the current Vocab
  224. std::shared_ptr<Vocab> BuildVocab(const std::vector<std::string> &columns = {},
  225. const std::pair<int64_t, int64_t> &freq_range = {0, kDeMaxFreq},
  226. int64_t top_k = kDeMaxTopk, const std::vector<std::string> &special_tokens = {},
  227. bool special_first = true) {
  228. return BuildVocabCharIF(VectorStringToChar(columns), freq_range, top_k, VectorStringToChar(special_tokens),
  229. special_first);
  230. }
  231. /// \brief Function to create a ConcatDataset
  232. /// \notes Concat the datasets in the input
  233. /// \param[in] datasets List of shared pointers to the dataset that should be concatenated together
  234. /// \return Shared pointer to the current ConcatDataset
  235. std::shared_ptr<ConcatDataset> Concat(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  236. std::vector<std::shared_ptr<Dataset>> all_datasets{shared_from_this()};
  237. all_datasets.insert(std::end(all_datasets), std::begin(datasets), std::end(datasets));
  238. return std::make_shared<ConcatDataset>(all_datasets);
  239. }
  240. /// \brief Function to filter dataset by predicate
  241. /// \notes If input_columns is not provided or empty, all columns will be used
  242. /// \param[in] predicate Function callable which returns a boolean value. If false then filter the element
  243. /// \param[in] input_columns List of names of the input columns to filter
  244. /// \return Shared pointer to the current FilterNode
  245. std::shared_ptr<FilterDataset> Filter(std::function<TensorRow(TensorRow)> predicate,
  246. const std::vector<std::string> &input_columns = {}) {
  247. return std::make_shared<FilterDataset>(shared_from_this(), predicate, VectorStringToChar(input_columns));
  248. }
  249. /// \brief Function to create a MapDataset
  250. /// \notes Applies each operation in operations to this dataset
  251. /// \param[in] operations Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations
  252. /// are applied in the order they appear in this list
  253. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  254. /// operation as input. The size of this list must match the number of
  255. /// input columns expected by the first operator. The default input_columns
  256. /// is the first column
  257. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  258. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  259. /// The size of this list must match the number of output columns of the
  260. /// last operation. The default output_columns will have the same
  261. /// name as the input columns, i.e., the columns will be replaced
  262. /// \param[in] project_columns A list of column names to project
  263. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  264. /// \return Shared pointer to the current MapDataset
  265. std::shared_ptr<MapDataset> Map(std::vector<TensorTransform *> operations,
  266. const std::vector<std::string> &input_columns = {},
  267. const std::vector<std::string> &output_columns = {},
  268. const std::vector<std::string> &project_columns = {},
  269. const std::shared_ptr<DatasetCache> &cache = nullptr,
  270. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  271. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  272. (void)std::transform(
  273. operations.begin(), operations.end(), std::back_inserter(transform_ops),
  274. [](TensorTransform *op) -> std::shared_ptr<TensorOperation> { return op != nullptr ? op->Parse() : nullptr; });
  275. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  276. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  277. callbacks);
  278. }
  279. /// \brief Function to create a MapDataset
  280. /// \notes Applies each operation in operations to this dataset
  281. /// \param[in] operations Vector of shared pointers to TensorTransform objects to be applied on the dataset.
  282. /// Operations are applied in the order they appear in this list
  283. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  284. /// operation as input. The size of this list must match the number of
  285. /// input columns expected by the first operator. The default input_columns
  286. /// is the first column
  287. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  288. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  289. /// The size of this list must match the number of output columns of the
  290. /// last operation. The default output_columns will have the same
  291. /// name as the input columns, i.e., the columns will be replaced
  292. /// \param[in] project_columns A list of column names to project
  293. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  294. /// \return Shared pointer to the current MapDataset
  295. std::shared_ptr<MapDataset> Map(std::vector<std::shared_ptr<TensorTransform>> operations,
  296. const std::vector<std::string> &input_columns = {},
  297. const std::vector<std::string> &output_columns = {},
  298. const std::vector<std::string> &project_columns = {},
  299. const std::shared_ptr<DatasetCache> &cache = nullptr,
  300. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  301. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  302. (void)std::transform(operations.begin(), operations.end(), std::back_inserter(transform_ops),
  303. [](std::shared_ptr<TensorTransform> op) -> std::shared_ptr<TensorOperation> {
  304. return op != nullptr ? op->Parse() : nullptr;
  305. });
  306. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  307. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  308. callbacks);
  309. }
  310. /// \brief Function to create a MapDataset
  311. /// \notes Applies each operation in operations to this dataset
  312. /// \param[in] operations Vector of TensorTransform objects to be applied on the dataset. Operations are applied in
  313. /// the order they appear in this list
  314. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  315. /// operation as input. The size of this list must match the number of
  316. /// input columns expected by the first operator. The default input_columns
  317. /// is the first column
  318. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  319. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  320. /// The size of this list must match the number of output columns of the
  321. /// last operation. The default output_columns will have the same
  322. /// name as the input columns, i.e., the columns will be replaced
  323. /// \param[in] project_columns A list of column names to project
  324. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  325. /// \return Shared pointer to the current MapDataset
  326. std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> operations,
  327. const std::vector<std::string> &input_columns = {},
  328. const std::vector<std::string> &output_columns = {},
  329. const std::vector<std::string> &project_columns = {},
  330. const std::shared_ptr<DatasetCache> &cache = nullptr,
  331. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  332. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  333. (void)std::transform(operations.begin(), operations.end(), std::back_inserter(transform_ops),
  334. [](TensorTransform &op) -> std::shared_ptr<TensorOperation> { return op.Parse(); });
  335. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  336. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  337. callbacks);
  338. }
  339. /// \brief Function to create a Project Dataset
  340. /// \notes Applies project to the dataset
  341. /// \param[in] columns The name of columns to project
  342. /// \return Shared pointer to the current Dataset
  343. std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns) {
  344. return std::make_shared<ProjectDataset>(shared_from_this(), VectorStringToChar(columns));
  345. }
  346. /// \brief Function to create a Rename Dataset
  347. /// \notes Renames the columns in the input dataset
  348. /// \param[in] input_columns List of the input columns to rename
  349. /// \param[in] output_columns List of the output columns
  350. /// \return Shared pointer to the current Dataset
  351. std::shared_ptr<RenameDataset> Rename(const std::vector<std::string> &input_columns,
  352. const std::vector<std::string> &output_columns) {
  353. return std::make_shared<RenameDataset>(shared_from_this(), VectorStringToChar(input_columns),
  354. VectorStringToChar(output_columns));
  355. }
  356. /// \brief Function to create a RepeatDataset
  357. /// \notes Repeats this dataset count times. Repeat indefinitely if count is -1
  358. /// \param[in] count Number of times the dataset should be repeated
  359. /// \return Shared pointer to the current Dataset
  360. /// \note Repeat will return shared pointer to `Dataset` instead of `RepeatDataset`
  361. /// due to a limitation in the current implementation
  362. std::shared_ptr<RepeatDataset> Repeat(int32_t count = -1) {
  363. return std::make_shared<RepeatDataset>(shared_from_this(), count);
  364. }
  365. /// \brief Function to create a Shuffle Dataset
  366. /// \notes Randomly shuffles the rows of this dataset
  367. /// \param[in] buffer_size The size of the buffer (must be larger than 1) for shuffling
  368. /// \return Shared pointer to the current ShuffleDataset
  369. std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size) {
  370. return std::make_shared<ShuffleDataset>(shared_from_this(), buffer_size);
  371. }
  372. /// \brief Function to create a SkipDataset
  373. /// \notes Skips count elements in this dataset.
  374. /// \param[in] count Number of elements the dataset to be skipped.
  375. /// \return Shared pointer to the current SkipDataset
  376. std::shared_ptr<SkipDataset> Skip(int32_t count) { return std::make_shared<SkipDataset>(shared_from_this(), count); }
  377. /// \brief Function to create a TakeDataset
  378. /// \notes Takes count elements in this dataset.
  379. /// \param[in] count Number of elements the dataset to be taken.
  380. /// \return Shared pointer to the current Dataset
  381. std::shared_ptr<TakeDataset> Take(int32_t count = -1) {
  382. return std::make_shared<TakeDataset>(shared_from_this(), count);
  383. }
  384. /// \brief Function to create a Zip Dataset
  385. /// \notes Applies zip to the dataset
  386. /// \param[in] datasets A list of shared pointers to the datasets that we want to zip
  387. /// \return Shared pointer to the current Dataset
  388. std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  389. std::vector<std::shared_ptr<Dataset>> all_datasets = datasets;
  390. all_datasets.push_back(shared_from_this());
  391. return std::make_shared<ZipDataset>(all_datasets);
  392. }
  393. std::shared_ptr<DatasetNode> IRNode() { return ir_node_; }
  394. protected:
  395. std::shared_ptr<TreeGetters> tree_getters_;
  396. std::shared_ptr<DatasetNode> ir_node_;
  397. private:
  398. // Char interface(CharIF) of GetColumnNames
  399. std::vector<std::vector<char>> GetColumnNamesCharIF();
  400. // Char interface(CharIF) of GetClassIndexing
  401. std::vector<std::pair<std::vector<char>, std::vector<int32_t>>> GetClassIndexingCharIF();
  402. // Char interface(CharIF) of CreateIterator
  403. std::shared_ptr<Iterator> CreateIteratorCharIF(std::vector<std::vector<char>> columns, int32_t num_epochs);
  404. // Char interface(CharIF) of DeviceQueue
  405. bool DeviceQueueCharIF(const std::vector<char> &queue_name, const std::vector<char> &device_type, int32_t num_epochs,
  406. bool send_epoch_end, int32_t total_batches, bool create_data_info_queue);
  407. // Char interface(CharIF) of Save
  408. bool SaveCharIF(const std::vector<char> &dataset_path, int32_t num_files, const std::vector<char> &dataset_type);
  409. // Char interface(CharIF) of BuildSentencePieceVocab
  410. std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocabCharIF(
  411. const std::vector<std::vector<char>> &col_names, int32_t vocab_size, float character_coverage,
  412. SentencePieceModel model_type, const std::map<std::vector<char>, std::vector<char>> &params);
  413. // Char interface(CharIF) of BuildVocab
  414. std::shared_ptr<Vocab> BuildVocabCharIF(const std::vector<std::vector<char>> &columns,
  415. const std::pair<int64_t, int64_t> &freq_range, int64_t top_k,
  416. const std::vector<std::vector<char>> &special_tokens, bool special_first);
  417. };
  418. class SchemaObj {
  419. public:
  420. /// \brief Constructor
  421. explicit SchemaObj(const std::string &schema_file = "") : SchemaObj(StringToChar(schema_file)) {}
  422. /// \brief Destructor
  423. ~SchemaObj() = default;
  424. /// \brief SchemaObj Init function
  425. /// \return bool true if schema initialization is successful
  426. Status Init();
  427. /// \brief Add new column to the schema with unknown shape of rank 1
  428. /// \param[in] name Name of the column.
  429. /// \param[in] de_type Data type of the column(TypeId).
  430. /// \return Status code
  431. Status add_column(const std::string &name, TypeId de_type) { return add_column_char(StringToChar(name), de_type); }
  432. /// \brief Add new column to the schema with unknown shape of rank 1
  433. /// \param[in] name Name of the column.
  434. /// \param[in] de_type Data type of the column(std::string).
  435. /// \param[in] shape Shape of the column.
  436. /// \return Status code
  437. Status add_column(const std::string &name, const std::string &de_type) {
  438. return add_column_char(StringToChar(name), StringToChar(de_type));
  439. }
  440. /// \brief Add new column to the schema
  441. /// \param[in] name Name of the column.
  442. /// \param[in] de_type Data type of the column(TypeId).
  443. /// \param[in] shape Shape of the column.
  444. /// \return Status code
  445. Status add_column(const std::string &name, TypeId de_type, const std::vector<int32_t> &shape) {
  446. return add_column_char(StringToChar(name), de_type, shape);
  447. }
  448. /// \brief Add new column to the schema
  449. /// \param[in] name Name of the column.
  450. /// \param[in] de_type Data type of the column(std::string).
  451. /// \param[in] shape Shape of the column.
  452. /// \return Status code
  453. Status add_column(const std::string &name, const std::string &de_type, const std::vector<int32_t> &shape) {
  454. return add_column_char(StringToChar(name), StringToChar(de_type), shape);
  455. }
  456. /// \brief Get a JSON string of the schema
  457. /// \return JSON string of the schema
  458. std::string to_json() { return CharToString(to_json_char()); }
  459. /// \brief Get a JSON string of the schema
  460. std::string to_string() { return to_json(); }
  461. /// \brief Set a new value to dataset_type
  462. void set_dataset_type(std::string dataset_type);
  463. /// \brief Set a new value to num_rows
  464. void set_num_rows(int32_t num_rows);
  465. /// \brief Get the current num_rows
  466. int32_t get_num_rows() const;
  467. /// \brief Get schema file from JSON file
  468. /// \param[in] json_string Name of JSON file to be parsed.
  469. /// \return Status code
  470. Status FromJSONString(const std::string &json_string) { return FromJSONStringCharIF(StringToChar(json_string)); }
  471. /// \brief Parse and add column information
  472. /// \param[in] json_string Name of JSON string for column dataset attribute information, decoded from schema file.
  473. /// \return Status code
  474. Status ParseColumnString(const std::string &json_string) {
  475. return ParseColumnStringCharIF(StringToChar(json_string));
  476. }
  477. private:
  478. /// \brief Parse the columns and add them to columns
  479. /// \param[in] columns Dataset attribution information, decoded from schema file.
  480. /// Support both nlohmann::json::value_t::array and nlohmann::json::value_t::onject.
  481. /// \return Status code
  482. Status parse_column(nlohmann::json columns);
  483. /// \brief Get schema file from JSON file
  484. /// \param[in] json_obj parsed JSON object
  485. /// \return Status code
  486. Status from_json(nlohmann::json json_obj);
  487. // Char constructor of SchemaObj
  488. explicit SchemaObj(const std::vector<char> &schema_file);
  489. // Char interface of add_column
  490. Status add_column_char(const std::vector<char> &name, TypeId de_type);
  491. Status add_column_char(const std::vector<char> &name, const std::vector<char> &de_type);
  492. Status add_column_char(const std::vector<char> &name, TypeId de_type, const std::vector<int32_t> &shape);
  493. Status add_column_char(const std::vector<char> &name, const std::vector<char> &de_type,
  494. const std::vector<int32_t> &shape);
  495. // Char interface of to_json
  496. const std::vector<char> to_json_char();
  497. // Char interface of FromJSONString
  498. Status FromJSONStringCharIF(const std::vector<char> &json_string);
  499. // Char interface of ParseColumnString
  500. Status ParseColumnStringCharIF(const std::vector<char> &json_string);
  501. struct Data;
  502. std::shared_ptr<Data> data_;
  503. };
  504. class BatchDataset : public Dataset {
  505. public:
  506. BatchDataset(std::shared_ptr<Dataset> input, int32_t batch_size, bool drop_remainder = false);
  507. ~BatchDataset() = default;
  508. };
  509. class BucketBatchByLengthDataset : public Dataset {
  510. public:
  511. BucketBatchByLengthDataset(
  512. std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &column_names,
  513. const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes,
  514. std::function<TensorRow(TensorRow)> element_length_function = nullptr,
  515. const std::map<std::vector<char>, std::pair<TensorShape, std::shared_ptr<Tensor>>> &pad_info = {},
  516. bool pad_to_bucket_boundary = false, bool drop_remainder = false);
  517. ~BucketBatchByLengthDataset() = default;
  518. };
  519. class ConcatDataset : public Dataset {
  520. public:
  521. explicit ConcatDataset(const std::vector<std::shared_ptr<Dataset>> &input);
  522. ~ConcatDataset() = default;
  523. };
  524. class FilterDataset : public Dataset {
  525. public:
  526. FilterDataset(std::shared_ptr<Dataset> input, std::function<TensorRow(TensorRow)> predicate,
  527. const std::vector<std::vector<char>> &input_columns);
  528. ~FilterDataset() = default;
  529. };
  530. class MapDataset : public Dataset {
  531. public:
  532. MapDataset(std::shared_ptr<Dataset> input, std::vector<std::shared_ptr<TensorOperation>> operations,
  533. const std::vector<std::vector<char>> &input_columns, const std::vector<std::vector<char>> &output_columns,
  534. const std::vector<std::vector<char>> &project_columns, const std::shared_ptr<DatasetCache> &cache,
  535. std::vector<std::shared_ptr<DSCallback>> callbacks);
  536. ~MapDataset() = default;
  537. };
  538. class ProjectDataset : public Dataset {
  539. public:
  540. ProjectDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &columns);
  541. ~ProjectDataset() = default;
  542. };
  543. class RenameDataset : public Dataset {
  544. public:
  545. RenameDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &input_columns,
  546. const std::vector<std::vector<char>> &output_columns);
  547. ~RenameDataset() = default;
  548. };
  549. class RepeatDataset : public Dataset {
  550. public:
  551. RepeatDataset(std::shared_ptr<Dataset> input, int32_t count);
  552. ~RepeatDataset() = default;
  553. };
  554. class ShuffleDataset : public Dataset {
  555. public:
  556. ShuffleDataset(std::shared_ptr<Dataset> input, int32_t buffer_size);
  557. ~ShuffleDataset() = default;
  558. };
  559. class SkipDataset : public Dataset {
  560. public:
  561. SkipDataset(std::shared_ptr<Dataset> input, int32_t count);
  562. ~SkipDataset() = default;
  563. };
  564. class TakeDataset : public Dataset {
  565. public:
  566. TakeDataset(std::shared_ptr<Dataset> input, int32_t count);
  567. ~TakeDataset() = default;
  568. };
  569. class ZipDataset : public Dataset {
  570. public:
  571. explicit ZipDataset(const std::vector<std::shared_ptr<Dataset>> &inputs);
  572. ~ZipDataset() = default;
  573. };
  574. /// \brief Function to create a SchemaObj
  575. /// \param[in] schema_file Path of schema file
  576. /// \note This api exists because std::string will constrained by ABI compile macro but char don't.
  577. /// \return Shared pointer to the current schema
  578. std::shared_ptr<SchemaObj> SchemaCharIF(const std::vector<char> &schema_file);
  579. /// \brief Function to create a SchemaObj
  580. /// \param[in] schema_file Path of schema file
  581. /// \return Shared pointer to the current schema
  582. inline std::shared_ptr<SchemaObj> Schema(const std::string &schema_file = "") {
  583. return SchemaCharIF(StringToChar(schema_file));
  584. }
  585. class AlbumDataset : public Dataset {
  586. public:
  587. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  588. const std::vector<std::vector<char>> &column_names, bool decode, const std::shared_ptr<Sampler> &sampler,
  589. const std::shared_ptr<DatasetCache> &cache);
  590. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  591. const std::vector<std::vector<char>> &column_names, bool decode, Sampler *sampler,
  592. const std::shared_ptr<DatasetCache> &cache);
  593. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  594. const std::vector<std::vector<char>> &column_names, bool decode,
  595. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  596. ~AlbumDataset() = default;
  597. };
  598. /// \brief Function to create an AlbumDataset
  599. /// \notes The generated dataset is specified through setting a schema
  600. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  601. /// \param[in] data_schema Path to dataset schema file
  602. /// \param[in] column_names Column names used to specify columns to load, if empty, will read all columns.
  603. /// (default = {})
  604. /// \param[in] decode the option to decode the images in dataset (default = false)
  605. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  606. /// given,
  607. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  608. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  609. /// \return Shared pointer to the current Dataset
  610. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  611. const std::vector<std::string> &column_names = {}, bool decode = false,
  612. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  613. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  614. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  615. VectorStringToChar(column_names), decode, sampler, cache);
  616. }
  617. /// \brief Function to create an AlbumDataset
  618. /// \notes The generated dataset is specified through setting a schema
  619. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  620. /// \param[in] data_schema Path to dataset schema file
  621. /// \param[in] column_names Column names used to specify columns to load
  622. /// \param[in] decode the option to decode the images in dataset
  623. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  624. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  625. /// \return Shared pointer to the current Dataset
  626. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  627. const std::vector<std::string> &column_names, bool decode, Sampler *sampler,
  628. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  629. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  630. VectorStringToChar(column_names), decode, sampler, cache);
  631. }
  632. /// \brief Function to create an AlbumDataset
  633. /// \notes The generated dataset is specified through setting a schema
  634. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  635. /// \param[in] data_schema Path to dataset schema file
  636. /// \param[in] column_names Column names used to specify columns to load
  637. /// \param[in] decode the option to decode the images in dataset
  638. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  639. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  640. /// \return Shared pointer to the current Dataset
  641. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  642. const std::vector<std::string> &column_names, bool decode,
  643. const std::reference_wrapper<Sampler> sampler,
  644. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  645. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  646. VectorStringToChar(column_names), decode, sampler, cache);
  647. }
  648. class CelebADataset : public Dataset {
  649. public:
  650. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  651. const std::shared_ptr<Sampler> &sampler, bool decode,
  652. const std::set<std::vector<char>> &extensions, const std::shared_ptr<DatasetCache> &cache);
  653. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  654. bool decode, const std::set<std::vector<char>> &extensions,
  655. const std::shared_ptr<DatasetCache> &cache);
  656. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  657. const std::reference_wrapper<Sampler> sampler, bool decode,
  658. const std::set<std::vector<char>> &extensions, const std::shared_ptr<DatasetCache> &cache);
  659. ~CelebADataset() = default;
  660. };
  661. /// \brief Function to create a CelebADataset
  662. /// \notes The generated dataset has two columns ['image', 'attr'].
  663. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  664. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  665. /// \param[in] usage One of "all", "train", "valid" or "test" (default = "all").
  666. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  667. /// given,
  668. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  669. /// \param[in] decode Decode the images after reading (default=false).
  670. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  671. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  672. /// \return Shared pointer to the current Dataset
  673. inline std::shared_ptr<CelebADataset> CelebA(
  674. const std::string &dataset_dir, const std::string &usage = "all",
  675. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), bool decode = false,
  676. const std::set<std::string> &extensions = {}, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  677. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  678. SetStringToChar(extensions), cache);
  679. }
  680. /// \brief Function to create a CelebADataset
  681. /// \notes The generated dataset has two columns ['image', 'attr'].
  682. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  683. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  684. /// \param[in] usage One of "all", "train", "valid" or "test"
  685. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  686. /// \param[in] decode Decode the images after reading (default=false).
  687. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  688. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  689. /// \return Shared pointer to the current Dataset
  690. inline std::shared_ptr<CelebADataset> CelebA(const std::string &dataset_dir, const std::string &usage, Sampler *sampler,
  691. bool decode = false, const std::set<std::string> &extensions = {},
  692. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  693. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  694. SetStringToChar(extensions), cache);
  695. }
  696. /// \brief Function to create a CelebADataset
  697. /// \notes The generated dataset has two columns ['image', 'attr'].
  698. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  699. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  700. /// \param[in] usage One of "all", "train", "valid" or "test"
  701. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  702. /// \param[in] decode Decode the images after reading (default=false).
  703. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  704. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  705. /// \return Shared pointer to the current Dataset
  706. inline std::shared_ptr<CelebADataset> CelebA(const std::string &dataset_dir, const std::string &usage,
  707. const std::reference_wrapper<Sampler> sampler, bool decode = false,
  708. const std::set<std::string> &extensions = {},
  709. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  710. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  711. SetStringToChar(extensions), cache);
  712. }
  713. class Cifar10Dataset : public Dataset {
  714. public:
  715. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  716. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  717. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  718. const std::shared_ptr<DatasetCache> &cache);
  719. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  720. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  721. ~Cifar10Dataset() = default;
  722. };
  723. /// \brief Function to create a Cifar10 Dataset
  724. /// \notes The generated dataset has two columns ["image", "label"]
  725. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  726. /// \param[in] usage of CIFAR10, can be "train", "test" or "all" (default = "all").
  727. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  728. /// given,
  729. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  730. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  731. /// \return Shared pointer to the current Dataset
  732. inline std::shared_ptr<Cifar10Dataset> Cifar10(
  733. const std::string &dataset_dir, const std::string &usage = "all",
  734. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  735. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  736. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  737. }
  738. /// \brief Function to create a Cifar10 Dataset
  739. /// \notes The generated dataset has two columns ["image", "label"]
  740. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  741. /// \param[in] usage of CIFAR10, can be "train", "test" or "all"
  742. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  743. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  744. /// \return Shared pointer to the current Dataset
  745. inline std::shared_ptr<Cifar10Dataset> Cifar10(const std::string &dataset_dir, const std::string &usage,
  746. Sampler *sampler, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  747. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  748. }
  749. /// \brief Function to create a Cifar10 Dataset
  750. /// \notes The generated dataset has two columns ["image", "label"]
  751. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  752. /// \param[in] usage of CIFAR10, can be "train", "test" or "all"
  753. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  754. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  755. /// \return Shared pointer to the current Dataset
  756. inline std::shared_ptr<Cifar10Dataset> Cifar10(const std::string &dataset_dir, const std::string &usage,
  757. const std::reference_wrapper<Sampler> sampler,
  758. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  759. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  760. }
  761. class Cifar100Dataset : public Dataset {
  762. public:
  763. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  764. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  765. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  766. const std::shared_ptr<DatasetCache> &cache);
  767. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  768. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  769. ~Cifar100Dataset() = default;
  770. };
  771. /// \brief Function to create a Cifar100 Dataset
  772. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  773. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  774. /// \param[in] usage of CIFAR100, can be "train", "test" or "all" (default = "all").
  775. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  776. /// given,
  777. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  778. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  779. /// \return Shared pointer to the current Dataset
  780. inline std::shared_ptr<Cifar100Dataset> Cifar100(
  781. const std::string &dataset_dir, const std::string &usage = "all",
  782. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  783. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  784. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  785. }
  786. /// \brief Function to create a Cifar100 Dataset
  787. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  788. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  789. /// \param[in] usage of CIFAR100, can be "train", "test" or "all".
  790. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  791. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  792. /// \return Shared pointer to the current Dataset
  793. inline std::shared_ptr<Cifar100Dataset> Cifar100(const std::string &dataset_dir, const std::string &usage,
  794. Sampler *sampler,
  795. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  796. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  797. }
  798. /// \brief Function to create a Cifar100 Dataset
  799. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  800. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  801. /// \param[in] usage of CIFAR100, can be "train", "test" or "all".
  802. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  803. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  804. /// \return Shared pointer to the current Dataset
  805. inline std::shared_ptr<Cifar100Dataset> Cifar100(const std::string &dataset_dir, const std::string &usage,
  806. const std::reference_wrapper<Sampler> sampler,
  807. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  808. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  809. }
  810. class CLUEDataset : public Dataset {
  811. public:
  812. explicit CLUEDataset(const std::vector<std::vector<char>> &dataset_files, const std::vector<char> &task,
  813. const std::vector<char> &usage, int64_t num_samples, ShuffleMode shuffle, int32_t num_shards,
  814. int32_t shard_id, const std::shared_ptr<DatasetCache> &cache);
  815. ~CLUEDataset() = default;
  816. };
  817. /// \brief Function to create a CLUEDataset
  818. /// \notes The generated dataset has a variable number of columns depending on the task and usage
  819. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  820. /// will be sorted in a lexicographical order.
  821. /// \param[in] task The kind of task, one of "AFQMC", "TNEWS", "IFLYTEK", "CMNLI", "WSC" and "CSL" (default="AFQMC").
  822. /// \param[in] usage Be used to "train", "test" or "eval" data (default="train").
  823. /// \param[in] num_samples The number of samples to be included in the dataset.
  824. /// (Default = 0 means all samples.)
  825. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode.kGlobal)
  826. /// Can be any of:
  827. /// ShuffleMode::kFalse - No shuffling is performed.
  828. /// ShuffleMode::kFiles - Shuffle files only.
  829. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  830. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  831. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  832. /// specified only when num_shards is also specified. (Default = 0)
  833. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  834. /// \return Shared pointer to the current CLUEDataset
  835. inline std::shared_ptr<CLUEDataset> CLUE(const std::vector<std::string> &dataset_files,
  836. const std::string &task = "AFQMC", const std::string &usage = "train",
  837. int64_t num_samples = 0, ShuffleMode shuffle = ShuffleMode::kGlobal,
  838. int32_t num_shards = 1, int32_t shard_id = 0,
  839. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  840. return std::make_shared<CLUEDataset>(VectorStringToChar(dataset_files), StringToChar(task), StringToChar(usage),
  841. num_samples, shuffle, num_shards, shard_id, cache);
  842. }
  843. class CocoDataset : public Dataset {
  844. public:
  845. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  846. const std::vector<char> &task, const bool &decode, const std::shared_ptr<Sampler> &sampler,
  847. const std::shared_ptr<DatasetCache> &cache);
  848. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  849. const std::vector<char> &task, const bool &decode, Sampler *sampler,
  850. const std::shared_ptr<DatasetCache> &cache);
  851. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  852. const std::vector<char> &task, const bool &decode, const std::reference_wrapper<Sampler> sampler,
  853. const std::shared_ptr<DatasetCache> &cache);
  854. ~CocoDataset() = default;
  855. };
  856. /// \brief Function to create a CocoDataset
  857. /// \notes The generated dataset has multi-columns :
  858. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  859. /// ['iscrowd', dtype=uint32]].
  860. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  861. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  862. /// ['num_keypoints', dtype=uint32]].
  863. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  864. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  865. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  866. /// \param[in] annotation_file Path to the annotation json
  867. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  868. /// \param[in] decode Decode the images after reading
  869. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  870. /// given,
  871. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  872. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  873. /// \return Shared pointer to the current Dataset
  874. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  875. const std::string &task = "Detection", const bool &decode = false,
  876. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  877. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  878. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  879. decode, sampler, cache);
  880. }
  881. /// \brief Function to create a CocoDataset
  882. /// \notes The generated dataset has multi-columns :
  883. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  884. /// ['iscrowd', dtype=uint32]].
  885. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  886. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  887. /// ['num_keypoints', dtype=uint32]].
  888. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  889. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  890. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  891. /// \param[in] annotation_file Path to the annotation json
  892. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  893. /// \param[in] decode Decode the images after reading
  894. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  895. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  896. /// \return Shared pointer to the current Dataset
  897. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  898. const std::string &task, const bool &decode, Sampler *sampler,
  899. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  900. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  901. decode, sampler, cache);
  902. }
  903. /// \brief Function to create a CocoDataset
  904. /// \notes The generated dataset has multi-columns :
  905. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  906. /// ['iscrowd', dtype=uint32]].
  907. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  908. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  909. /// ['num_keypoints', dtype=uint32]].
  910. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  911. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  912. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  913. /// \param[in] annotation_file Path to the annotation json
  914. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  915. /// \param[in] decode Decode the images after reading
  916. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  917. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  918. /// \return Shared pointer to the current Dataset
  919. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  920. const std::string &task, const bool &decode,
  921. const std::reference_wrapper<Sampler> sampler,
  922. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  923. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  924. decode, sampler, cache);
  925. }
  926. class CSVDataset : public Dataset {
  927. public:
  928. explicit CSVDataset(const std::vector<std::vector<char>> &dataset_files, char field_delim,
  929. const std::vector<std::shared_ptr<CsvBase>> &column_defaults,
  930. const std::vector<std::vector<char>> &column_names, int64_t num_samples, ShuffleMode shuffle,
  931. int32_t num_shards, int32_t shard_id, const std::shared_ptr<DatasetCache> &cache);
  932. ~CSVDataset() = default;
  933. };
  934. /// \brief Function to create a CSVDataset
  935. /// \notes The generated dataset has a variable number of columns
  936. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  937. /// will be sorted in a lexicographical order.
  938. /// \param[in] field_delim A char that indicates the delimiter to separate fields (default=',').
  939. /// \param[in] column_defaults List of default values for the CSV field (default={}). Each item in the list is
  940. /// either a valid type (float, int, or string). If this is not provided, treats all columns as string type.
  941. /// \param[in] column_names List of column names of the dataset (default={}). If this is not provided, infers the
  942. /// column_names from the first row of CSV file.
  943. /// \param[in] num_samples The number of samples to be included in the dataset.
  944. /// (Default = 0 means all samples.)
  945. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode::kGlobal)
  946. /// Can be any of:
  947. /// ShuffleMode::kFalse - No shuffling is performed.
  948. /// ShuffleMode::kFiles - Shuffle files only.
  949. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  950. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  951. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  952. /// specified only when num_shards is also specified. (Default = 0)
  953. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  954. /// \return Shared pointer to the current Dataset
  955. inline std::shared_ptr<CSVDataset> CSV(const std::vector<std::string> &dataset_files, char field_delim = ',',
  956. const std::vector<std::shared_ptr<CsvBase>> &column_defaults = {},
  957. const std::vector<std::string> &column_names = {}, int64_t num_samples = 0,
  958. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  959. int32_t shard_id = 0, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  960. return std::make_shared<CSVDataset>(VectorStringToChar(dataset_files), field_delim, column_defaults,
  961. VectorStringToChar(column_names), num_samples, shuffle, num_shards, shard_id,
  962. cache);
  963. }
  964. class ImageFolderDataset : public Dataset {
  965. public:
  966. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode,
  967. const std::shared_ptr<Sampler> &sampler, const std::set<std::vector<char>> &extensions,
  968. const std::map<std::vector<char>, int32_t> &class_indexing,
  969. const std::shared_ptr<DatasetCache> &cache);
  970. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode, Sampler *sampler,
  971. const std::set<std::vector<char>> &extensions,
  972. const std::map<std::vector<char>, int32_t> &class_indexing,
  973. const std::shared_ptr<DatasetCache> &cache);
  974. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode,
  975. const std::reference_wrapper<Sampler> sampler,
  976. const std::set<std::vector<char>> &extensions,
  977. const std::map<std::vector<char>, int32_t> &class_indexing,
  978. const std::shared_ptr<DatasetCache> &cache);
  979. ~ImageFolderDataset() = default;
  980. };
  981. /// \brief Function to create an ImageFolderDataset
  982. /// \notes A source dataset that reads images from a tree of directories
  983. /// All images within one folder have the same label
  984. /// The generated dataset has two columns ["image", "label"]
  985. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  986. /// \param[in] decode A flag to decode in ImageFolder
  987. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  988. /// given,
  989. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  990. /// \param[in] extensions File extensions to be read
  991. /// \param[in] class_indexing a class name to label map
  992. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  993. /// \return Shared pointer to the current ImageFolderDataset
  994. inline std::shared_ptr<ImageFolderDataset> ImageFolder(
  995. const std::string &dataset_dir, bool decode = false,
  996. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  997. const std::set<std::string> &extensions = {}, const std::map<std::string, int32_t> &class_indexing = {},
  998. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  999. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1000. MapStringToChar(class_indexing), cache);
  1001. }
  1002. /// \brief Function to create an ImageFolderDataset
  1003. /// \notes A source dataset that reads images from a tree of directories
  1004. /// All images within one folder have the same label
  1005. /// The generated dataset has two columns ["image", "label"]
  1006. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1007. /// \param[in] decode A flag to decode in ImageFolder
  1008. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1009. /// \param[in] extensions File extensions to be read
  1010. /// \param[in] class_indexing a class name to label map
  1011. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1012. /// \return Shared pointer to the current ImageFolderDataset
  1013. inline std::shared_ptr<ImageFolderDataset> ImageFolder(const std::string &dataset_dir, bool decode, Sampler *sampler,
  1014. const std::set<std::string> &extensions = {},
  1015. const std::map<std::string, int32_t> &class_indexing = {},
  1016. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1017. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1018. MapStringToChar(class_indexing), cache);
  1019. }
  1020. /// \brief Function to create an ImageFolderDataset
  1021. /// \notes A source dataset that reads images from a tree of directories
  1022. /// All images within one folder have the same label
  1023. /// The generated dataset has two columns ["image", "label"]
  1024. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1025. /// \param[in] decode A flag to decode in ImageFolder
  1026. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1027. /// \param[in] extensions File extensions to be read
  1028. /// \param[in] class_indexing a class name to label map
  1029. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1030. /// \return Shared pointer to the current ImageFolderDataset
  1031. inline std::shared_ptr<ImageFolderDataset> ImageFolder(const std::string &dataset_dir, bool decode,
  1032. const std::reference_wrapper<Sampler> sampler,
  1033. const std::set<std::string> &extensions = {},
  1034. const std::map<std::string, int32_t> &class_indexing = {},
  1035. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1036. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1037. MapStringToChar(class_indexing), cache);
  1038. }
  1039. class ManifestDataset : public Dataset {
  1040. public:
  1041. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage,
  1042. const std::shared_ptr<Sampler> &sampler,
  1043. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1044. const std::shared_ptr<DatasetCache> &cache);
  1045. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage, Sampler *sampler,
  1046. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1047. const std::shared_ptr<DatasetCache> &cache);
  1048. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage,
  1049. const std::reference_wrapper<Sampler> sampler,
  1050. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1051. const std::shared_ptr<DatasetCache> &cache);
  1052. ~ManifestDataset() = default;
  1053. };
  1054. /// \brief Function to create a ManifestDataset
  1055. /// \notes The generated dataset has two columns ["image", "label"]
  1056. /// \param[in] dataset_file The dataset file to be read
  1057. /// \param[in] usage Need "train", "eval" or "inference" data (default="train")
  1058. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1059. /// given,
  1060. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1061. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1062. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1063. /// \param[in] decode Decode the images after reading (default=false).
  1064. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1065. /// \return Shared pointer to the current ManifestDataset
  1066. inline std::shared_ptr<ManifestDataset> Manifest(
  1067. const std::string &dataset_file, const std::string &usage = "train",
  1068. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1069. const std::map<std::string, int32_t> &class_indexing = {}, bool decode = false,
  1070. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1071. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1072. MapStringToChar(class_indexing), decode, cache);
  1073. }
  1074. /// \brief Function to create a ManifestDataset
  1075. /// \notes The generated dataset has two columns ["image", "label"]
  1076. /// \param[in] dataset_file The dataset file to be read
  1077. /// \param[in] usage Need "train", "eval" or "inference" data
  1078. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1079. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1080. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1081. /// \param[in] decode Decode the images after reading (default=false).
  1082. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1083. /// \return Shared pointer to the current ManifestDataset
  1084. inline std::shared_ptr<ManifestDataset> Manifest(const std::string &dataset_file, const std::string &usage,
  1085. Sampler *sampler,
  1086. const std::map<std::string, int32_t> &class_indexing = {},
  1087. bool decode = false,
  1088. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1089. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1090. MapStringToChar(class_indexing), decode, cache);
  1091. }
  1092. /// \brief Function to create a ManifestDataset
  1093. /// \notes The generated dataset has two columns ["image", "label"]
  1094. /// \param[in] dataset_file The dataset file to be read
  1095. /// \param[in] usage Need "train", "eval" or "inference" data
  1096. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1097. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1098. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1099. /// \param[in] decode Decode the images after reading (default=false).
  1100. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1101. /// \return Shared pointer to the current ManifestDataset
  1102. inline std::shared_ptr<ManifestDataset> Manifest(const std::string &dataset_file, const std::string &usage,
  1103. const std::reference_wrapper<Sampler> sampler,
  1104. const std::map<std::string, int32_t> &class_indexing = {},
  1105. bool decode = false,
  1106. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1107. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1108. MapStringToChar(class_indexing), decode, cache);
  1109. }
  1110. class MindDataDataset : public Dataset {
  1111. public:
  1112. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1113. const std::shared_ptr<Sampler> &sampler, nlohmann::json padded_sample, int64_t num_padded);
  1114. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1115. Sampler *sampler, nlohmann::json padded_sample, int64_t num_padded);
  1116. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1117. const std::reference_wrapper<Sampler> sampler, nlohmann::json padded_sample,
  1118. int64_t num_padded);
  1119. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1120. const std::vector<std::vector<char>> &columns_list, const std::shared_ptr<Sampler> &sampler,
  1121. nlohmann::json padded_sample, int64_t num_padded);
  1122. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1123. const std::vector<std::vector<char>> &columns_list, Sampler *sampler,
  1124. nlohmann::json padded_sample, int64_t num_padded);
  1125. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1126. const std::vector<std::vector<char>> &columns_list,
  1127. const std::reference_wrapper<Sampler> sampler, nlohmann::json padded_sample,
  1128. int64_t num_padded);
  1129. ~MindDataDataset() = default;
  1130. };
  1131. /// \brief Function to create a MindDataDataset
  1132. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1133. /// in the same path will be found and loaded automatically.
  1134. /// \param[in] columns_list List of columns to be read (default={})
  1135. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1136. /// given,
  1137. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler()),
  1138. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1139. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1140. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1141. /// \return Shared pointer to the current MindDataDataset
  1142. inline std::shared_ptr<MindDataDataset> MindData(
  1143. const std::string &dataset_file, const std::vector<std::string> &columns_list = {},
  1144. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), nlohmann::json padded_sample = nullptr,
  1145. int64_t num_padded = 0) {
  1146. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1147. padded_sample, num_padded);
  1148. }
  1149. /// \brief Function to create a MindDataDataset
  1150. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1151. /// in the same path will be found and loaded automatically.
  1152. /// \param[in] columns_list List of columns to be read
  1153. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1154. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1155. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1156. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1157. /// \return Shared pointer to the current MindDataDataset
  1158. inline std::shared_ptr<MindDataDataset> MindData(const std::string &dataset_file,
  1159. const std::vector<std::string> &columns_list, Sampler *sampler,
  1160. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1161. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1162. padded_sample, num_padded);
  1163. }
  1164. /// \brief Function to create a MindDataDataset
  1165. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1166. /// in the same path will be found and loaded automatically.
  1167. /// \param[in] columns_list List of columns to be read
  1168. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1169. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1170. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1171. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1172. /// \return Shared pointer to the current MindDataDataset
  1173. inline std::shared_ptr<MindDataDataset> MindData(const std::string &dataset_file,
  1174. const std::vector<std::string> &columns_list,
  1175. const std::reference_wrapper<Sampler> sampler,
  1176. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1177. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1178. padded_sample, num_padded);
  1179. }
  1180. /// \brief Function to create a MindDataDataset
  1181. /// \param[in] dataset_files List of dataset files to be read directly.
  1182. /// \param[in] columns_list List of columns to be read (default={})
  1183. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1184. /// given,
  1185. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler()),
  1186. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1187. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1188. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1189. /// \return Shared pointer to the current MindDataDataset
  1190. inline std::shared_ptr<MindDataDataset> MindData(
  1191. const std::vector<std::string> &dataset_files, const std::vector<std::string> &columns_list = {},
  1192. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), nlohmann::json padded_sample = nullptr,
  1193. int64_t num_padded = 0) {
  1194. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1195. padded_sample, num_padded);
  1196. }
  1197. /// \brief Function to create a MindDataDataset
  1198. /// \param[in] dataset_files List of dataset files to be read directly.
  1199. /// \param[in] columns_list List of columns to be read
  1200. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1201. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1202. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1203. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1204. /// \return Shared pointer to the current MindDataDataset
  1205. inline std::shared_ptr<MindDataDataset> MindData(const std::vector<std::string> &dataset_files,
  1206. const std::vector<std::string> &columns_list, Sampler *sampler,
  1207. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1208. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1209. padded_sample, num_padded);
  1210. }
  1211. /// \brief Function to create a MindDataDataset
  1212. /// \param[in] dataset_files List of dataset files to be read directly.
  1213. /// \param[in] columns_list List of columns to be read
  1214. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1215. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1216. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1217. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1218. /// \return Shared pointer to the current MindDataDataset
  1219. inline std::shared_ptr<MindDataDataset> MindData(const std::vector<std::string> &dataset_files,
  1220. const std::vector<std::string> &columns_list,
  1221. const std::reference_wrapper<Sampler> sampler,
  1222. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1223. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1224. padded_sample, num_padded);
  1225. }
  1226. class MnistDataset : public Dataset {
  1227. public:
  1228. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  1229. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  1230. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  1231. const std::shared_ptr<DatasetCache> &cache);
  1232. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  1233. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  1234. ~MnistDataset() = default;
  1235. };
  1236. /// \brief Function to create a MnistDataset
  1237. /// \notes The generated dataset has two columns ["image", "label"]
  1238. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1239. /// \param[in] usage of MNIST, can be "train", "test" or "all" (default = "all").
  1240. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1241. /// given,
  1242. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1243. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1244. /// \return Shared pointer to the current MnistDataset
  1245. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage = "all",
  1246. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1247. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1248. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1249. }
  1250. /// \brief Function to create a MnistDataset
  1251. /// \notes The generated dataset has two columns ["image", "label"]
  1252. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1253. /// \param[in] usage of MNIST, can be "train", "test" or "all"
  1254. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1255. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1256. /// \return Shared pointer to the current MnistDataset
  1257. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage, Sampler *sampler,
  1258. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1259. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1260. }
  1261. /// \brief Function to create a MnistDataset
  1262. /// \notes The generated dataset has two columns ["image", "label"]
  1263. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1264. /// \param[in] usage of MNIST, can be "train", "test" or "all"
  1265. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1266. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1267. /// \return Shared pointer to the current MnistDataset
  1268. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage,
  1269. const std::reference_wrapper<Sampler> sampler,
  1270. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1271. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1272. }
  1273. /// \brief Function to create a ConcatDataset
  1274. /// \notes Reload "+" operator to concat two datasets
  1275. /// \param[in] datasets1 Shared pointer to the first dataset to be concatenated
  1276. /// \param[in] datasets2 Shared pointer to the second dataset to be concatenated
  1277. /// \return Shared pointer to the current ConcatDataset
  1278. inline std::shared_ptr<ConcatDataset> operator+(const std::shared_ptr<Dataset> &datasets1,
  1279. const std::shared_ptr<Dataset> &datasets2) {
  1280. return std::make_shared<ConcatDataset>(std::vector({datasets1, datasets2}));
  1281. }
  1282. class RandomDataDataset : public Dataset {
  1283. public:
  1284. RandomDataDataset(const int32_t &total_rows, std::shared_ptr<SchemaObj> schema,
  1285. const std::vector<std::vector<char>> &columns_list, std::shared_ptr<DatasetCache> cache);
  1286. RandomDataDataset(const int32_t &total_rows, const std::vector<char> &schema_path,
  1287. const std::vector<std::vector<char>> &columns_list, std::shared_ptr<DatasetCache> cache);
  1288. ~RandomDataDataset() = default;
  1289. };
  1290. /// \brief Function to create a RandomDataset
  1291. /// \param[in] total_rows Number of rows for the dataset to generate (default=0, number of rows is random)
  1292. /// \param[in] schema SchemaObj to set column type, data type and data shape
  1293. /// \param[in] columns_list List of columns to be read (default={}, read all columns)
  1294. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1295. /// \return Shared pointer to the current Dataset
  1296. template <typename T = std::shared_ptr<SchemaObj>>
  1297. std::shared_ptr<RandomDataDataset> RandomData(const int32_t &total_rows = 0, const T &schema = nullptr,
  1298. const std::vector<std::string> &columns_list = {},
  1299. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1300. std::shared_ptr<RandomDataDataset> ds;
  1301. if constexpr (std::is_same<T, std::nullptr_t>::value || std::is_same<T, std::shared_ptr<SchemaObj>>::value) {
  1302. std::shared_ptr<SchemaObj> schema_obj = schema;
  1303. ds =
  1304. std::make_shared<RandomDataDataset>(total_rows, std::move(schema_obj), VectorStringToChar(columns_list), cache);
  1305. } else {
  1306. ds = std::make_shared<RandomDataDataset>(total_rows, StringToChar(schema), VectorStringToChar(columns_list), cache);
  1307. }
  1308. return ds;
  1309. }
  1310. class TextFileDataset : public Dataset {
  1311. public:
  1312. explicit TextFileDataset(const std::vector<std::vector<char>> &dataset_files, int64_t num_samples,
  1313. ShuffleMode shuffle, int32_t num_shards, int32_t shard_id,
  1314. const std::shared_ptr<DatasetCache> &cache);
  1315. ~TextFileDataset() = default;
  1316. };
  1317. /// \brief Function to create a TextFileDataset
  1318. /// \notes The generated dataset has one column ['text']
  1319. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  1320. /// will be sorted in a lexicographical order.
  1321. /// \param[in] num_samples The number of samples to be included in the dataset.
  1322. /// (Default = 0 means all samples.)
  1323. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode.kGlobal)
  1324. /// Can be any of:
  1325. /// ShuffleMode.kFalse - No shuffling is performed.
  1326. /// ShuffleMode.kFiles - Shuffle files only.
  1327. /// ShuffleMode.kGlobal - Shuffle both the files and samples.
  1328. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  1329. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  1330. /// specified only when num_shards is also specified. (Default = 0)
  1331. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1332. /// \return Shared pointer to the current TextFileDataset
  1333. inline std::shared_ptr<TextFileDataset> TextFile(const std::vector<std::string> &dataset_files, int64_t num_samples = 0,
  1334. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  1335. int32_t shard_id = 0,
  1336. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1337. return std::make_shared<TextFileDataset>(VectorStringToChar(dataset_files), num_samples, shuffle, num_shards,
  1338. shard_id, cache);
  1339. }
  1340. class TFRecordDataset : public Dataset {
  1341. public:
  1342. TFRecordDataset(const std::vector<std::vector<char>> &dataset_files, const std::vector<char> &schema,
  1343. const std::vector<std::vector<char>> &columns_list, int64_t num_samples, ShuffleMode shuffle,
  1344. int32_t num_shards, int32_t shard_id, bool shard_equal_rows, std::shared_ptr<DatasetCache> cache);
  1345. /// \brief Constructor
  1346. /// \note Parameter 'schema' is shared pointer to Schema object
  1347. TFRecordDataset(const std::vector<std::vector<char>> &dataset_files, std::shared_ptr<SchemaObj> schema,
  1348. const std::vector<std::vector<char>> &columns_list, int64_t num_samples, ShuffleMode shuffle,
  1349. int32_t num_shards, int32_t shard_id, bool shard_equal_rows, std::shared_ptr<DatasetCache> cache);
  1350. ~TFRecordDataset() = default;
  1351. };
  1352. /// \brief Function to create a TFRecordDataset
  1353. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  1354. /// will be sorted in a lexicographical order.
  1355. /// \param[in] schema SchemaObj or string to schema path. (Default = nullptr, which means that the
  1356. /// meta data from the TFData file is considered the schema.)
  1357. /// \param[in] columns_list List of columns to be read. (Default = {}, read all columns)
  1358. /// \param[in] num_samples The number of samples to be included in the dataset.
  1359. /// (Default = 0 means all samples.)
  1360. /// If num_samples is 0 and numRows(parsed from schema) does not exist, read the full dataset;
  1361. /// If num_samples is 0 and numRows(parsed from schema) is greater than 0, read numRows rows;
  1362. /// If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.
  1363. /// \param[in] shuffle The mode for shuffling data every epoch. (Default = ShuffleMode::kGlobal)
  1364. /// Can be any of:
  1365. /// ShuffleMode::kFalse - No shuffling is performed.
  1366. /// ShuffleMode::kFiles - Shuffle files only.
  1367. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  1368. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  1369. /// \param[in] shard_id The shard ID within num_shards. This argument should be specified only
  1370. /// when num_shards is also specified. (Default = 0)
  1371. /// \param[in] shard_equal_rows Get equal rows for all shards. (Default = False, number of rows of
  1372. /// each shard may be not equal)
  1373. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1374. /// \return Shared pointer to the current TFRecordDataset
  1375. template <typename T = std::shared_ptr<SchemaObj>>
  1376. std::shared_ptr<TFRecordDataset> TFRecord(const std::vector<std::string> &dataset_files, const T &schema = nullptr,
  1377. const std::vector<std::string> &columns_list = {}, int64_t num_samples = 0,
  1378. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  1379. int32_t shard_id = 0, bool shard_equal_rows = false,
  1380. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1381. std::shared_ptr<TFRecordDataset> ds = nullptr;
  1382. if constexpr (std::is_same<T, std::nullptr_t>::value || std::is_same<T, std::shared_ptr<SchemaObj>>::value) {
  1383. std::shared_ptr<SchemaObj> schema_obj = schema;
  1384. ds = std::make_shared<TFRecordDataset>(VectorStringToChar(dataset_files), std::move(schema_obj),
  1385. VectorStringToChar(columns_list), num_samples, shuffle, num_shards, shard_id,
  1386. shard_equal_rows, cache);
  1387. } else {
  1388. std::string schema_path = schema;
  1389. if (!schema_path.empty()) {
  1390. struct stat sb;
  1391. int rc = stat(common::SafeCStr(schema_path), &sb);
  1392. if (rc == -1 && errno != ENOENT) {
  1393. MS_LOG(WARNING) << "Unable to query the status of [" << schema_path << "]. Errno = " << errno << ".";
  1394. }
  1395. if (rc != 0) {
  1396. MS_LOG(ERROR) << "TFRecordDataset: schema path [" << schema_path << "] is invalid or does not exist.";
  1397. return nullptr;
  1398. }
  1399. }
  1400. ds = std::make_shared<TFRecordDataset>(VectorStringToChar(dataset_files), StringToChar(schema_path),
  1401. VectorStringToChar(columns_list), num_samples, shuffle, num_shards, shard_id,
  1402. shard_equal_rows, cache);
  1403. }
  1404. return ds;
  1405. }
  1406. class VOCDataset : public Dataset {
  1407. public:
  1408. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1409. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1410. bool decode, const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  1411. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1412. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1413. bool decode, Sampler *sampler, const std::shared_ptr<DatasetCache> &cache);
  1414. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1415. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1416. bool decode, const std::reference_wrapper<Sampler> sampler,
  1417. const std::shared_ptr<DatasetCache> &cache);
  1418. ~VOCDataset() = default;
  1419. };
  1420. /// \brief Function to create a VOCDataset
  1421. /// \notes The generated dataset has multi-columns :
  1422. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1423. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1424. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1425. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1426. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1427. /// \param[in] usage The type of data list text file to be read (default = "train").
  1428. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1429. /// \param[in] decode Decode the images after reading
  1430. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1431. /// given,
  1432. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1433. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1434. /// \return Shared pointer to the current Dataset
  1435. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task = "Segmentation",
  1436. const std::string &usage = "train",
  1437. const std::map<std::string, int32_t> &class_indexing = {}, bool decode = false,
  1438. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1439. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1440. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1441. MapStringToChar(class_indexing), decode, sampler, cache);
  1442. }
  1443. /// \brief Function to create a VOCDataset
  1444. /// \notes The generated dataset has multi-columns :
  1445. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1446. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1447. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1448. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1449. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1450. /// \param[in] usage The type of data list text file to be read.
  1451. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1452. /// \param[in] decode Decode the images after reading
  1453. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1454. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1455. /// \return Shared pointer to the current Dataset
  1456. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task,
  1457. const std::string &usage, const std::map<std::string, int32_t> &class_indexing,
  1458. bool decode, Sampler *sampler,
  1459. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1460. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1461. MapStringToChar(class_indexing), decode, sampler, cache);
  1462. }
  1463. /// \brief Function to create a VOCDataset
  1464. /// \notes The generated dataset has multi-columns :
  1465. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1466. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1467. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1468. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1469. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1470. /// \param[in] usage The type of data list text file to be read.
  1471. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1472. /// \param[in] decode Decode the images after reading
  1473. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1474. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1475. /// \return Shared pointer to the current Dataset
  1476. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task,
  1477. const std::string &usage, const std::map<std::string, int32_t> &class_indexing,
  1478. bool decode, const std::reference_wrapper<Sampler> sampler,
  1479. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1480. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1481. MapStringToChar(class_indexing), decode, sampler, cache);
  1482. }
  1483. std::shared_ptr<DatasetCache> CreateDatasetCacheCharIF(session_id_type id, uint64_t mem_sz, bool spill,
  1484. std::optional<std::vector<char>> hostname = std::nullopt,
  1485. std::optional<int32_t> port = std::nullopt,
  1486. std::optional<int32_t> num_connections = std::nullopt,
  1487. std::optional<int32_t> prefetch_sz = std::nullopt);
  1488. /// \brief Function the create a cache to be attached to a dataset
  1489. /// \param id A user assigned session id for the current pipeline.
  1490. /// \param mem_sz Size of the memory set aside for the row caching (default=0 which means unlimited,
  1491. /// note that it might bring in the risk of running out of memory on the machine).
  1492. /// \param spill Spill to disk if out of memory (default=False).
  1493. /// \param hostname optional host name (default="127.0.0.1").
  1494. /// \param port optional port (default=50052).
  1495. /// \param num_connections optional number of connections (default=12).
  1496. /// \param prefetch_sz optional prefetch size (default=20).
  1497. /// \return Shared pointer to DatasetCache. If error, nullptr is returned.
  1498. inline std::shared_ptr<DatasetCache> CreateDatasetCache(session_id_type id, uint64_t mem_sz, bool spill,
  1499. std::optional<std::string> hostname = std::nullopt,
  1500. std::optional<int32_t> port = std::nullopt,
  1501. std::optional<int32_t> num_connections = std::nullopt,
  1502. std::optional<int32_t> prefetch_sz = std::nullopt) {
  1503. return CreateDatasetCacheCharIF(id, mem_sz, spill, OptionalStringToChar(hostname), port, num_connections,
  1504. prefetch_sz);
  1505. }
  1506. /// \brief Function to create a ZipDataset
  1507. /// \notes Applies zip to the dataset
  1508. /// \param[in] datasets List of shared pointers to the datasets that we want to zip
  1509. /// \return Shared pointer to the current Dataset
  1510. inline std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  1511. return std::make_shared<ZipDataset>(datasets);
  1512. }
  1513. } // namespace dataset
  1514. } // namespace mindspore
  1515. #endif // MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_