You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

datasets.h 104 kB

added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
5 years ago
5 years ago
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
added python api based on cpp api 1st draft of python iterator Added Cifar10 and Cifar100 pybind port Change pybind to use IR for Skip and Manifest Signed-off-by: alex-yuyue <yue.yu1@huawei.com> DatasetNode as a base for all IR nodes namespace change Fix the namespace issue and make ut tests work Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Add VOCDataset !63 Added RandomDataset * Added RandomDataset add imagefolder ir Pybind switch: CelebA and UT !61 CLUE example with class definition * Merge branch 'python-api' of gitee.com:ezphlow/mindspore into clue_class_pybind * Passing testcases * Added CLUE, not working add ManifestDataset IR Signed-off-by: alex-yuyue <yue.yu1@huawei.com> Update Coco & VOC & TFReader, Update clang-format, Reorder datasets_binding !69 Add Generator and move c_dataset.Iterator to dataset.Iterator * Add GeneratorDataset to c_dataset * Add GeneratorDataset to c_dataset !67 Moving c_datasets and adding sampler wrapper * Need to add create() method in datasets.py * migration from c_dataset to dataset part 1 !71 Fix indent error * Fix indentation error !72 Fix c_api tests cases * Fix c_api tests cases !73 Added CSV Dataset * Added CSVDataset pybind switch: Take and CelebA fixes !75 move c_dataset functionality to datasets * Fixed existing testcases * Added working clue and imagefolder * Added sampler conversion from pybind * Added sampler creation !77 Add Python API tree * Python API tree add minddataset TextFileDataset pybind Rename to skip test_concat.py and test_minddataset_exception.py !80 Add batch IR to python-api branch, most test cases work * staging III * staging, add pybind Enable more c_api take and CelebA tests; delete util_c_api !84 Schema changes in datasets.py * Schema changes !85 Remove input_indexes from sub-classes * remove input_index from each subclass !83 Remove C datasets * Removed c_dataset package * Remove c_datasets !82 pybind switch: shuffle * pybind switch: shuffle !86 Add build_vocab * Add build_vocab Rebase with upstream/master _shuffle conflict BatchNode error !88 Fix rebase problem * fix rebase problem Enable more unit tests; code typo/nit fixes !91 Fix python vocag hang * Fix python vocab hang !89 Added BucketBatchByLength Pybind switch * Added BucketBatchByLength Update and enable more tet_c_api_*.py tests !95 Add BuildSentencePeiceVocab * - Add BuildSentencePeiceVocab !96 Fix more tests * - Fix some tests - Enable more test_c_api_* - Add syncwait !99 pybind switch for device op * pybind switch for device op !93 Add getters to python API * Add getters to python API !101 Validate tree, error if graph * - Add sync wait !103 TFrecord/Random Datasets schema problem * - TfRecord/Random schem aproblem !102 Added filter pybind switch * Added Filter pybind switch !104 Fix num_samples * - TfRecord/Random schem aproblem !105 Fix to_device hang * Fix to_device hang !94 Adds Cache support for CLUE dataset * Added cache for all dataset ops * format change * Added CLUE cache support * Added Cache conversion Add save pybind fix compile err init modify concat_node !107 Fix some tests cases * Fix tests cases Enable and fix more tests !109 pybind switch for get dataset size * pybind_get_dataset_size some check-code fixes for pylint, cpplint and clang-format !113 Add callback * revert * dataset_sz 1 line * fix typo * get callback to work !114 Make Android compile clean * Make Android Compile Clean Fix build issues due to rebase !115 Fix more tests * Fix tests cases * !93 Add getters to python API fix test_profiling.py !116 fix get dataset size * fix get dataset size !117 GetColumnNames pybind switch * Added GetColumnNames pybind switch code-check fixes: clangformat, cppcheck, cpplint, pylint Delete duplicate test_c_api_*.py files; more lint fixes !121 Fix cpp tests * Remove extra call to getNext in cpp tests !122 Fix Schema with Generator * Fix Schema with Generator fix some cases of csv & mindrecord !124 fix tfrecord get_dataset_size and add some UTs * fix tfrecord get dataset size and add some ut for get_dataset_size !125 getter separation * Getter separation !126 Fix sampler.GetNumSamples * Fix sampler.GetNumSampler !127 Assign runtime getter to each get function * Assign runtime getter to each get function Fix compile issues !128 Match master code * Match master code !129 Cleanup DeviceOp/save code * Cleanup ToDevice/Save code !130 Add cache fix * Added cache fix for map and image folder !132 Fix testing team issues * Pass queue_name from python to C++ * Add Schema.from_json !131 Fix Cache op issues and delete de_pipeline * Roll back C++ change * Removed de_pipeline and passing all cache tests. * fixed cache tests !134 Cleanup datasets.py part1 * Cleanup dataset.py part1 !133 Updated validation for SentencePieceVocab.from_dataset * Added type_check for column names in SentencePieceVocab.from_dataset Rebase on master 181120 10:20 fix profiling temporary solution of catching stauts from Node.Build() !141 ToDevice Termination * ToDevice termination pylint fixes !137 Fix test team issues and add some corresponding tests * Fix test team issues and add some corresponding tests !138 TreeGetter changes to use OptPass * Getter changes to use OptPass (Zirui) Rebase fix !143 Fix cpplint issue * Fix cpplint issue pylint fixes in updated testcases !145 Reset exceptions testcase * reset exception test to master !146 Fix Check_Pylint Error * Fix Check_Pylint Error !147 fix android * fix android !148 ToDevice changes * Add ToDevice to the iterator List for cleanup at exit !149 Pylint issue * Add ToDevice to the iterator List for cleanup at exit !150 Pylint 2 * Add ToDevice to the iterator List for cleanup at exit !152 ExecutionTree error * ET destructor error !153 in getter_pass, only remove callback, without deleting map op * getter pass no longer removes map !156 early __del__ of iterator/to_device * early __del__ of iterator !155 Address review comments Eric 1 * Added one liner fix to validators.py * roll back signature fix * lint fix * Eric Address comments 2 * C++ lint fix * Address comments Eric 1 !158 Review rework for dataset bindings - part 1 * Reorder nodes repeat and rename * Review rework for dataset bindings - part 1 !154 Fixing minor problems in the comments (datasets.py, python_tree_consumer.cc, iterators_bindings.cc, and iterators.py) * Fixing minor problems in the comments (datasets.py, python_tree_consum… !157 add replace none * Add replace_none to datasets.py, address comments in tests Trying to resolve copy Override the deepcopy method of deviceop Create_ir_tree method Create_ir_tree method 2 Create_ir_tree method 2 del to_device if already exists del to_device if already exists cache getters shapes and types Added yolov3 relaxation, to be rolled back Get shapes and types together bypass yolo NumWorkers for MapOp revert Yolo revert Thor Print more info Debug code: Update LOG INFO to LOG ERROR do not remove epochctrl for getter pass Remove repeat(1) pritn batch size add log to tree_consumer and device_queue op Revert PR 8744 Signed-off-by: alex-yuyue <yue.yu1@huawei.com> __del__ toDEvice __del__ toDevice2 !165 add ifndef ENABLE_ANDROID to device queue print * Add ifndef ENABLE_ANDROID to device queue print revert some changes !166 getter: get_data_info * getter: get_data_info !168 add back tree print * revert info to warnning in one log * add back the missed print tree log Release GIL in GetDataInfo
5 years ago
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192939495969798991001011021031041051061071081091101111121131141151161171181191201211221231241251261271281291301311321331341351361371381391401411421431441451461471481491501511521531541551561571581591601611621631641651661671681691701711721731741751761771781791801811821831841851861871881891901911921931941951961971981992002012022032042052062072082092102112122132142152162172182192202212222232242252262272282292302312322332342352362372382392402412422432442452462472482492502512522532542552562572582592602612622632642652662672682692702712722732742752762772782792802812822832842852862872882892902912922932942952962972982993003013023033043053063073083093103113123133143153163173183193203213223233243253263273283293303313323333343353363373383393403413423433443453463473483493503513523533543553563573583593603613623633643653663673683693703713723733743753763773783793803813823833843853863873883893903913923933943953963973983994004014024034044054064074084094104114124134144154164174184194204214224234244254264274284294304314324334344354364374384394404414424434444454464474484494504514524534544554564574584594604614624634644654664674684694704714724734744754764774784794804814824834844854864874884894904914924934944954964974984995005015025035045055065075085095105115125135145155165175185195205215225235245255265275285295305315325335345355365375385395405415425435445455465475485495505515525535545555565575585595605615625635645655665675685695705715725735745755765775785795805815825835845855865875885895905915925935945955965975985996006016026036046056066076086096106116126136146156166176186196206216226236246256266276286296306316326336346356366376386396406416426436446456466476486496506516526536546556566576586596606616626636646656666676686696706716726736746756766776786796806816826836846856866876886896906916926936946956966976986997007017027037047057067077087097107117127137147157167177187197207217227237247257267277287297307317327337347357367377387397407417427437447457467477487497507517527537547557567577587597607617627637647657667677687697707717727737747757767777787797807817827837847857867877887897907917927937947957967977987998008018028038048058068078088098108118128138148158168178188198208218228238248258268278288298308318328338348358368378388398408418428438448458468478488498508518528538548558568578588598608618628638648658668678688698708718728738748758768778788798808818828838848858868878888898908918928938948958968978988999009019029039049059069079089099109119129139149159169179189199209219229239249259269279289299309319329339349359369379389399409419429439449459469479489499509519529539549559569579589599609619629639649659669679689699709719729739749759769779789799809819829839849859869879889899909919929939949959969979989991000100110021003100410051006100710081009101010111012101310141015101610171018101910201021102210231024102510261027102810291030103110321033103410351036103710381039104010411042104310441045104610471048104910501051105210531054105510561057105810591060106110621063106410651066106710681069107010711072107310741075107610771078107910801081108210831084108510861087108810891090109110921093109410951096109710981099110011011102110311041105110611071108110911101111111211131114111511161117111811191120112111221123112411251126112711281129113011311132113311341135113611371138113911401141114211431144114511461147114811491150115111521153115411551156115711581159116011611162116311641165116611671168116911701171117211731174117511761177117811791180118111821183118411851186118711881189119011911192119311941195119611971198119912001201120212031204120512061207120812091210121112121213121412151216121712181219122012211222122312241225122612271228122912301231123212331234123512361237123812391240124112421243124412451246124712481249125012511252125312541255125612571258125912601261126212631264126512661267126812691270127112721273127412751276127712781279128012811282128312841285128612871288128912901291129212931294129512961297129812991300130113021303130413051306130713081309131013111312131313141315131613171318131913201321132213231324132513261327132813291330133113321333133413351336133713381339134013411342134313441345134613471348134913501351135213531354135513561357135813591360136113621363136413651366136713681369137013711372137313741375137613771378137913801381138213831384138513861387138813891390139113921393139413951396139713981399140014011402140314041405140614071408140914101411141214131414141514161417141814191420142114221423142414251426142714281429143014311432143314341435143614371438143914401441144214431444144514461447144814491450145114521453145414551456145714581459146014611462146314641465146614671468146914701471147214731474147514761477147814791480148114821483148414851486148714881489149014911492149314941495149614971498149915001501150215031504150515061507150815091510151115121513151415151516151715181519152015211522152315241525152615271528152915301531153215331534153515361537153815391540154115421543154415451546154715481549155015511552155315541555155615571558155915601561156215631564156515661567156815691570157115721573157415751576157715781579158015811582158315841585158615871588158915901591159215931594159515961597159815991600160116021603160416051606160716081609161016111612161316141615161616171618161916201621162216231624162516261627162816291630163116321633163416351636163716381639164016411642164316441645164616471648164916501651165216531654165516561657165816591660166116621663166416651666166716681669167016711672167316741675167616771678167916801681168216831684168516861687168816891690169116921693
  1. /**
  2. * Copyright 2020-2021 Huawei Technologies Co., Ltd
  3. *
  4. * Licensed under the Apache License, Version 2.0 (the "License");
  5. * you may not use this file except in compliance with the License.
  6. * You may obtain a copy of the License at
  7. *
  8. * http://www.apache.org/licenses/LICENSE-2.0
  9. *
  10. * Unless required by applicable law or agreed to in writing, software
  11. * distributed under the License is distributed on an "AS IS" BASIS,
  12. * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  13. * See the License for the specific language governing permissions and
  14. * limitations under the License.
  15. */
  16. #ifndef MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_
  17. #define MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_
  18. #include <sys/stat.h>
  19. #include <unistd.h>
  20. #include <algorithm>
  21. #include <map>
  22. #include <memory>
  23. #include <set>
  24. #include <string>
  25. #include <unordered_map>
  26. #include <unordered_set>
  27. #include <utility>
  28. #include <vector>
  29. #include "include/api/dual_abi_helper.h"
  30. #include "minddata/dataset/include/iterator.h"
  31. #include "minddata/dataset/include/samplers.h"
  32. #include "minddata/dataset/include/tensor.h"
  33. #include "minddata/dataset/include/text.h"
  34. #include "minddata/dataset/include/type_id.h"
  35. namespace mindspore {
  36. namespace dataset {
  37. class Tensor;
  38. class TensorRow;
  39. class TensorShape;
  40. class TreeAdapter;
  41. class TreeGetters;
  42. #ifndef ENABLE_ANDROID
  43. class Vocab;
  44. #endif
  45. class DatasetCache;
  46. class DatasetNode;
  47. class Iterator;
  48. class TensorOperation;
  49. class SchemaObj;
  50. class SamplerObj;
  51. class CsvBase;
  52. // Dataset classes (in alphabetical order)
  53. class BatchDataset;
  54. class MapDataset;
  55. class ProjectDataset;
  56. class ShuffleDataset;
  57. #ifndef ENABLE_ANDROID
  58. class BucketBatchByLengthDataset;
  59. class FilterDataset;
  60. class CSVDataset;
  61. class TransferDataset;
  62. class ConcatDataset;
  63. class RenameDataset;
  64. #endif
  65. #ifndef ENABLE_ANDROID
  66. class SentencePieceVocab;
  67. enum class SentencePieceModel;
  68. #endif
  69. class DSCallback;
  70. class RepeatDataset;
  71. #ifndef ENABLE_ANDROID
  72. class SkipDataset;
  73. class TakeDataset;
  74. class ZipDataset;
  75. #endif
  76. /// \class Dataset datasets.h
  77. /// \brief A base class to represent a dataset in the data pipeline.
  78. class Dataset : public std::enable_shared_from_this<Dataset> {
  79. public:
  80. // need friend class so they can access the children_ field
  81. friend class Iterator;
  82. friend class TransferNode;
  83. /// \brief Constructor
  84. Dataset();
  85. /// \brief Destructor
  86. ~Dataset() = default;
  87. /// \brief Gets the dataset size
  88. /// \param[in] estimate This is only supported by some of the ops and it's used to speed up the process of getting
  89. /// dataset size at the expense of accuracy.
  90. /// \return dataset size. If failed, return -1
  91. int64_t GetDatasetSize(bool estimate = false);
  92. /// \brief Gets the output type
  93. /// \return a vector of DataType. If failed, return an empty vector
  94. std::vector<DataType> GetOutputTypes();
  95. /// \brief Gets the output shape
  96. /// \return a vector of TensorShape. If failed, return an empty vector
  97. std::vector<TensorShape> GetOutputShapes();
  98. /// \brief Gets the batch size
  99. /// \return int64_t
  100. int64_t GetBatchSize();
  101. /// \brief Gets the repeat count
  102. /// \return int64_t
  103. int64_t GetRepeatCount();
  104. /// \brief Gets the number of classes
  105. /// \return number of classes. If failed, return -1
  106. int64_t GetNumClasses();
  107. /// \brief Gets the column names
  108. /// \return Names of the columns. If failed, return an empty vector
  109. std::vector<std::string> GetColumnNames() { return VectorCharToString(GetColumnNamesCharIF()); }
  110. /// \brief Gets the class indexing
  111. /// \return a map of ClassIndexing. If failed, return an empty map
  112. std::vector<std::pair<std::string, std::vector<int32_t>>> GetClassIndexing() {
  113. return ClassIndexCharToString(GetClassIndexingCharIF());
  114. }
  115. /// \brief Setter function for runtime number of workers
  116. /// \param[in] num_workers The number of threads in this operator
  117. /// \return Shared pointer to the original object
  118. std::shared_ptr<Dataset> SetNumWorkers(int32_t num_workers);
  119. /// \brief Function to create an Iterator over the Dataset pipeline
  120. /// \param[in] columns List of columns to be used to specify the order of columns
  121. /// \param[in] num_epochs Number of epochs to run through the pipeline, default -1 which means infinite epochs.
  122. /// An empty row is returned at the end of each epoch
  123. /// \return Shared pointer to the Iterator
  124. std::shared_ptr<Iterator> CreateIterator(std::vector<std::string> columns = {}, int32_t num_epochs = -1) {
  125. return CreateIteratorCharIF(VectorStringToChar(columns), num_epochs);
  126. }
  127. #ifndef ENABLE_ANDROID
  128. /// \brief Function to transfer data through a device.
  129. /// \notes If device is Ascend, features of data will be transferred one by one. The limitation
  130. /// of data transmission per time is 256M.
  131. /// \param[in] queue_name Channel name (default="", create new unique name).
  132. /// \param[in] device_type Type of device (default="", get from MSContext).
  133. /// \param[in] num_epochs Number of epochs (default=-1, infinite epochs).
  134. /// \param[in] send_epoch_end Whether to send end of sequence to device or not (default=true).
  135. /// \param[in] total_batches Number of batches to be sent to the device (default=0, all data).
  136. /// \param[in] create_data_info_queue Whether to create queue which stores types and shapes
  137. /// of data or not(default=false).
  138. /// \return Returns true if no error encountered else false.
  139. bool DeviceQueue(std::string queue_name = "", std::string device_type = "", int32_t num_epochs = -1,
  140. bool send_epoch_end = true, int32_t total_batches = 0, bool create_data_info_queue = false) {
  141. return DeviceQueueCharIF(StringToChar(queue_name), StringToChar(device_type), num_epochs, send_epoch_end,
  142. total_batches, create_data_info_queue);
  143. }
  144. /// \brief Function to create a Saver to save the dynamic data processed by the dataset pipeline
  145. /// \note Usage restrictions:
  146. /// 1. Supported dataset formats: 'mindrecord' only
  147. /// 2. To save the samples in order, set dataset's shuffle to false and num_files to 1.
  148. /// 3. Before calling the function, do not use batch operator, repeat operator or data augmentation operators
  149. /// with random attribute in map operator.
  150. /// 4. Mindrecord does not support bool, uint64, multi-dimensional uint8(drop dimension) nor
  151. /// multi-dimensional string.
  152. /// \param[in] file_name Path to dataset file
  153. /// \param[in] num_files Number of dataset files (default=1)
  154. /// \param[in] file_type Dataset format (default="mindrecord")
  155. /// \return Returns true if no error encountered else false
  156. bool Save(std::string dataset_path, int32_t num_files = 1, std::string dataset_type = "mindrecord") {
  157. return SaveCharIF(StringToChar(dataset_path), num_files, StringToChar(dataset_type));
  158. }
  159. #endif
  160. /// \brief Function to create a BatchDataset
  161. /// \notes Combines batch_size number of consecutive rows into batches
  162. /// \param[in] batch_size The number of rows each batch is created with
  163. /// \param[in] drop_remainder Determines whether or not to drop the last possibly incomplete
  164. /// batch. If true, and if there are less than batch_size rows
  165. /// available to make the last batch, then those rows will
  166. /// be dropped and not propagated to the next node
  167. /// \return Shared pointer to the current BatchDataset
  168. std::shared_ptr<BatchDataset> Batch(int32_t batch_size, bool drop_remainder = false);
  169. #ifndef ENABLE_ANDROID
  170. /// \brief Function to create a BucketBatchByLengthDataset
  171. /// \notes Bucket elements according to their lengths. Each bucket will be padded and batched when
  172. /// they are full.
  173. /// \param[in] column_names Columns passed to element_length_function
  174. /// \param[in] bucket_boundaries A list consisting of the upper boundaries of the buckets.
  175. /// Must be strictly increasing. If there are n boundaries, n+1 buckets are created: One bucket for
  176. /// [0, bucket_boundaries[0]), one bucket for [bucket_boundaries[i], bucket_boundaries[i+1]) for each
  177. /// 0<i<n, and one bucket for [bucket_boundaries[n-1], inf).
  178. /// \param[in] bucket_batch_sizes A list consisting of the batch sizes for each bucket.
  179. /// Must contain elements equal to the size of bucket_boundaries + 1.
  180. /// \param[in] element_length_function A function pointer that takes in TensorRow and outputs a TensorRow.
  181. /// The output must contain a single tensor containing a single int32_t. If no value is provided,
  182. /// then size of column_names must be 1, and the size of the first dimension of that column will be taken
  183. /// as the length (default=nullptr)
  184. /// \param[in] pad_info Represents how to batch each column. The key corresponds to the column name, the value must
  185. /// be a tuple of 2 elements. The first element corresponds to the shape to pad to, and the second element
  186. /// corresponds to the value to pad with. If a column is not specified, then that column will be padded to the
  187. /// longest in the current batch, and 0 will be used as the padding value. Any unspecified dimensions will be
  188. /// padded to the longest in the current batch, unless if pad_to_bucket_boundary is true. If no padding is
  189. /// wanted, set pad_info to None (default=empty dictionary).
  190. /// \param[in] pad_to_bucket_boundary If true, will pad each unspecified dimension in pad_info to the
  191. /// bucket_boundary minus 1. If there are any elements that fall into the last bucket,
  192. /// an error will occur (default=false).
  193. /// \param[in] drop_remainder If true, will drop the last batch for each bucket if it is not a full batch
  194. /// (default=false).
  195. /// \return Shared pointer to the current BucketBatchByLengthDataset
  196. std::shared_ptr<BucketBatchByLengthDataset> BucketBatchByLength(
  197. const std::vector<std::string> &column_names, const std::vector<int32_t> &bucket_boundaries,
  198. const std::vector<int32_t> &bucket_batch_sizes,
  199. std::function<TensorRow(TensorRow)> element_length_function = nullptr,
  200. const std::map<std::string, std::pair<TensorShape, std::shared_ptr<Tensor>>> &pad_info = {},
  201. bool pad_to_bucket_boundary = false, bool drop_remainder = false) {
  202. return std::make_shared<BucketBatchByLengthDataset>(
  203. shared_from_this(), VectorStringToChar(column_names), bucket_boundaries, bucket_batch_sizes,
  204. element_length_function, PadInfoStringToChar(pad_info), pad_to_bucket_boundary, drop_remainder);
  205. }
  206. /// \brief Function to create a SentencePieceVocab from source dataset
  207. /// \notes Build a SentencePieceVocab from a dataset.
  208. /// \param[in] col_names Column names to get words from. It can be a vector of column names
  209. /// \param[in] vocab_size Vocabulary size.
  210. /// \param[in] character_coverage Percentage of characters covered by the model, must be between
  211. /// 0.98 and 1.0 Good defaults are: 0.9995 for languages with rich character sets like
  212. /// Japanese or Chinese character sets, and 1.0 for other languages with small character sets.
  213. /// \param[in] model_type Model type. Choose from unigram (default), bpe, char, or word.
  214. /// The input sentence must be pretokenized when using word type.
  215. /// \param[in] params A vector contains more option parameters of sentencepiece library
  216. std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocab(
  217. const std::vector<std::string> &col_names, int32_t vocab_size, float character_coverage,
  218. SentencePieceModel model_type, const std::unordered_map<std::string, std::string> &params) {
  219. return BuildSentencePieceVocabCharIF(VectorStringToChar(col_names), vocab_size, character_coverage, model_type,
  220. UnorderedMapStringToChar(params));
  221. }
  222. /// \brief Function to create a Vocab from source dataset
  223. /// \notes Build a vocab from a dataset. This would collect all the unique words in a dataset and return a vocab
  224. /// which contains top_k most frequent words (if top_k is specified)
  225. /// \param[in] columns Column names to get words from. It can be a vector of column names
  226. /// \param[in] freq_range A tuple of integers (min_frequency, max_frequency). Words within the frequency
  227. /// range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency/max_frequency
  228. /// can be set to default, which corresponds to 0/total_words separately
  229. /// \param[in] top_k Number of words to be built into vocab. top_k most frequent words are
  230. /// taken. The top_k is taken after freq_range. If not enough top_k, all words will be taken
  231. /// \param[in] special_tokens A list of strings, each one is a special token
  232. /// \param[in] special_first Whether special_tokens will be prepended/appended to vocab, If special_tokens
  233. /// is specified and special_first is set to default, special_tokens will be prepended
  234. /// \return Shared pointer to the current Vocab
  235. std::shared_ptr<Vocab> BuildVocab(const std::vector<std::string> &columns = {},
  236. const std::pair<int64_t, int64_t> &freq_range = {0, kDeMaxFreq},
  237. int64_t top_k = kDeMaxTopk, const std::vector<std::string> &special_tokens = {},
  238. bool special_first = true) {
  239. return BuildVocabCharIF(VectorStringToChar(columns), freq_range, top_k, VectorStringToChar(special_tokens),
  240. special_first);
  241. }
  242. /// \brief Function to create a ConcatDataset
  243. /// \notes Concat the datasets in the input
  244. /// \param[in] datasets List of shared pointers to the dataset that should be concatenated together
  245. /// \return Shared pointer to the current ConcatDataset
  246. std::shared_ptr<ConcatDataset> Concat(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  247. std::vector<std::shared_ptr<Dataset>> all_datasets{shared_from_this()};
  248. all_datasets.insert(std::end(all_datasets), std::begin(datasets), std::end(datasets));
  249. return std::make_shared<ConcatDataset>(all_datasets);
  250. }
  251. /// \brief Function to filter dataset by predicate
  252. /// \notes If input_columns is not provided or empty, all columns will be used
  253. /// \param[in] predicate Function callable which returns a boolean value. If false then filter the element
  254. /// \param[in] input_columns List of names of the input columns to filter
  255. /// \return Shared pointer to the current FilterNode
  256. std::shared_ptr<FilterDataset> Filter(std::function<TensorRow(TensorRow)> predicate,
  257. const std::vector<std::string> &input_columns = {}) {
  258. return std::make_shared<FilterDataset>(shared_from_this(), predicate, VectorStringToChar(input_columns));
  259. }
  260. #endif
  261. /// \brief Function to create a MapDataset
  262. /// \notes Applies each operation in operations to this dataset
  263. /// \param[in] operations Vector of raw pointers to TensorTransform objects to be applied on the dataset. Operations
  264. /// are applied in the order they appear in this list
  265. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  266. /// operation as input. The size of this list must match the number of
  267. /// input columns expected by the first operator. The default input_columns
  268. /// is the first column
  269. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  270. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  271. /// The size of this list must match the number of output columns of the
  272. /// last operation. The default output_columns will have the same
  273. /// name as the input columns, i.e., the columns will be replaced
  274. /// \param[in] project_columns A list of column names to project
  275. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  276. /// \return Shared pointer to the current MapDataset
  277. std::shared_ptr<MapDataset> Map(std::vector<TensorTransform *> operations,
  278. const std::vector<std::string> &input_columns = {},
  279. const std::vector<std::string> &output_columns = {},
  280. const std::vector<std::string> &project_columns = {},
  281. const std::shared_ptr<DatasetCache> &cache = nullptr,
  282. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  283. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  284. (void)std::transform(
  285. operations.begin(), operations.end(), std::back_inserter(transform_ops),
  286. [](TensorTransform *op) -> std::shared_ptr<TensorOperation> { return op != nullptr ? op->Parse() : nullptr; });
  287. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  288. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  289. callbacks);
  290. }
  291. /// \brief Function to create a MapDataset
  292. /// \notes Applies each operation in operations to this dataset
  293. /// \param[in] operations Vector of shared pointers to TensorTransform objects to be applied on the dataset.
  294. /// Operations are applied in the order they appear in this list
  295. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  296. /// operation as input. The size of this list must match the number of
  297. /// input columns expected by the first operator. The default input_columns
  298. /// is the first column
  299. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  300. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  301. /// The size of this list must match the number of output columns of the
  302. /// last operation. The default output_columns will have the same
  303. /// name as the input columns, i.e., the columns will be replaced
  304. /// \param[in] project_columns A list of column names to project
  305. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  306. /// \return Shared pointer to the current MapDataset
  307. std::shared_ptr<MapDataset> Map(std::vector<std::shared_ptr<TensorTransform>> operations,
  308. const std::vector<std::string> &input_columns = {},
  309. const std::vector<std::string> &output_columns = {},
  310. const std::vector<std::string> &project_columns = {},
  311. const std::shared_ptr<DatasetCache> &cache = nullptr,
  312. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  313. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  314. (void)std::transform(operations.begin(), operations.end(), std::back_inserter(transform_ops),
  315. [](std::shared_ptr<TensorTransform> op) -> std::shared_ptr<TensorOperation> {
  316. return op != nullptr ? op->Parse() : nullptr;
  317. });
  318. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  319. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  320. callbacks);
  321. }
  322. /// \brief Function to create a MapDataset
  323. /// \notes Applies each operation in operations to this dataset
  324. /// \param[in] operations Vector of TensorTransform objects to be applied on the dataset. Operations are applied in
  325. /// the order they appear in this list
  326. /// \param[in] input_columns Vector of the names of the columns that will be passed to the first
  327. /// operation as input. The size of this list must match the number of
  328. /// input columns expected by the first operator. The default input_columns
  329. /// is the first column
  330. /// \param[in] output_columns Vector of names assigned to the columns outputted by the last operation
  331. /// This parameter is mandatory if len(input_columns) != len(output_columns)
  332. /// The size of this list must match the number of output columns of the
  333. /// last operation. The default output_columns will have the same
  334. /// name as the input columns, i.e., the columns will be replaced
  335. /// \param[in] project_columns A list of column names to project
  336. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  337. /// \return Shared pointer to the current MapDataset
  338. std::shared_ptr<MapDataset> Map(const std::vector<std::reference_wrapper<TensorTransform>> operations,
  339. const std::vector<std::string> &input_columns = {},
  340. const std::vector<std::string> &output_columns = {},
  341. const std::vector<std::string> &project_columns = {},
  342. const std::shared_ptr<DatasetCache> &cache = nullptr,
  343. std::vector<std::shared_ptr<DSCallback>> callbacks = {}) {
  344. std::vector<std::shared_ptr<TensorOperation>> transform_ops;
  345. (void)std::transform(operations.begin(), operations.end(), std::back_inserter(transform_ops),
  346. [](TensorTransform &op) -> std::shared_ptr<TensorOperation> { return op.Parse(); });
  347. return std::make_shared<MapDataset>(shared_from_this(), transform_ops, VectorStringToChar(input_columns),
  348. VectorStringToChar(output_columns), VectorStringToChar(project_columns), cache,
  349. callbacks);
  350. }
  351. /// \brief Function to create a Project Dataset
  352. /// \notes Applies project to the dataset
  353. /// \param[in] columns The name of columns to project
  354. /// \return Shared pointer to the current Dataset
  355. std::shared_ptr<ProjectDataset> Project(const std::vector<std::string> &columns) {
  356. return std::make_shared<ProjectDataset>(shared_from_this(), VectorStringToChar(columns));
  357. }
  358. #ifndef ENABLE_ANDROID
  359. /// \brief Function to create a Rename Dataset
  360. /// \notes Renames the columns in the input dataset
  361. /// \param[in] input_columns List of the input columns to rename
  362. /// \param[in] output_columns List of the output columns
  363. /// \return Shared pointer to the current Dataset
  364. std::shared_ptr<RenameDataset> Rename(const std::vector<std::string> &input_columns,
  365. const std::vector<std::string> &output_columns) {
  366. return std::make_shared<RenameDataset>(shared_from_this(), VectorStringToChar(input_columns),
  367. VectorStringToChar(output_columns));
  368. }
  369. #endif
  370. /// \brief Function to create a RepeatDataset
  371. /// \notes Repeats this dataset count times. Repeat indefinitely if count is -1
  372. /// \param[in] count Number of times the dataset should be repeated
  373. /// \return Shared pointer to the current Dataset
  374. /// \note Repeat will return shared pointer to `Dataset` instead of `RepeatDataset`
  375. /// due to a limitation in the current implementation
  376. std::shared_ptr<RepeatDataset> Repeat(int32_t count = -1) {
  377. return std::make_shared<RepeatDataset>(shared_from_this(), count);
  378. }
  379. #ifndef ENABLE_ANDROID
  380. /// \brief Function to create a Shuffle Dataset
  381. /// \notes Randomly shuffles the rows of this dataset
  382. /// \param[in] buffer_size The size of the buffer (must be larger than 1) for shuffling
  383. /// \return Shared pointer to the current ShuffleDataset
  384. std::shared_ptr<ShuffleDataset> Shuffle(int32_t buffer_size) {
  385. return std::make_shared<ShuffleDataset>(shared_from_this(), buffer_size);
  386. }
  387. /// \brief Function to create a SkipDataset
  388. /// \notes Skips count elements in this dataset.
  389. /// \param[in] count Number of elements the dataset to be skipped.
  390. /// \return Shared pointer to the current SkipDataset
  391. std::shared_ptr<SkipDataset> Skip(int32_t count) { return std::make_shared<SkipDataset>(shared_from_this(), count); }
  392. /// \brief Function to create a TakeDataset
  393. /// \notes Takes count elements in this dataset.
  394. /// \param[in] count Number of elements the dataset to be taken.
  395. /// \return Shared pointer to the current Dataset
  396. std::shared_ptr<TakeDataset> Take(int32_t count = -1) {
  397. return std::make_shared<TakeDataset>(shared_from_this(), count);
  398. }
  399. /// \brief Function to create a Zip Dataset
  400. /// \notes Applies zip to the dataset
  401. /// \param[in] datasets A list of shared pointers to the datasets that we want to zip
  402. /// \return Shared pointer to the current Dataset
  403. std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  404. std::vector<std::shared_ptr<Dataset>> all_datasets = datasets;
  405. all_datasets.push_back(shared_from_this());
  406. return std::make_shared<ZipDataset>(all_datasets);
  407. }
  408. #endif
  409. std::shared_ptr<DatasetNode> IRNode() { return ir_node_; }
  410. protected:
  411. std::shared_ptr<TreeGetters> tree_getters_;
  412. std::shared_ptr<DatasetNode> ir_node_;
  413. private:
  414. // Char interface(CharIF) of GetColumnNames
  415. std::vector<std::vector<char>> GetColumnNamesCharIF();
  416. // Char interface(CharIF) of GetClassIndexing
  417. std::vector<std::pair<std::vector<char>, std::vector<int32_t>>> GetClassIndexingCharIF();
  418. // Char interface(CharIF) of CreateIterator
  419. std::shared_ptr<Iterator> CreateIteratorCharIF(std::vector<std::vector<char>> columns, int32_t num_epochs);
  420. // Char interface(CharIF) of DeviceQueue
  421. bool DeviceQueueCharIF(const std::vector<char> &queue_name, const std::vector<char> &device_type, int32_t num_epochs,
  422. bool send_epoch_end, int32_t total_batches, bool create_data_info_queue);
  423. // Char interface(CharIF) of Save
  424. bool SaveCharIF(const std::vector<char> &dataset_path, int32_t num_files, const std::vector<char> &dataset_type);
  425. // Char interface(CharIF) of BuildSentencePieceVocab
  426. std::shared_ptr<SentencePieceVocab> BuildSentencePieceVocabCharIF(
  427. const std::vector<std::vector<char>> &col_names, int32_t vocab_size, float character_coverage,
  428. SentencePieceModel model_type, const std::map<std::vector<char>, std::vector<char>> &params);
  429. // Char interface(CharIF) of BuildVocab
  430. std::shared_ptr<Vocab> BuildVocabCharIF(const std::vector<std::vector<char>> &columns,
  431. const std::pair<int64_t, int64_t> &freq_range, int64_t top_k,
  432. const std::vector<std::vector<char>> &special_tokens, bool special_first);
  433. };
  434. class SchemaObj {
  435. public:
  436. /// \brief Constructor
  437. explicit SchemaObj(const std::string &schema_file = "") : SchemaObj(StringToChar(schema_file)) {}
  438. /// \brief Destructor
  439. ~SchemaObj() = default;
  440. /// \brief SchemaObj Init function
  441. /// \return bool true if schema initialization is successful
  442. Status Init();
  443. /// \brief Add new column to the schema with unknown shape of rank 1
  444. /// \param[in] name Name of the column.
  445. /// \param[in] de_type Data type of the column(TypeId).
  446. /// \return Status code
  447. Status add_column(const std::string &name, TypeId de_type) { return add_column_char(StringToChar(name), de_type); }
  448. /// \brief Add new column to the schema with unknown shape of rank 1
  449. /// \param[in] name Name of the column.
  450. /// \param[in] de_type Data type of the column(std::string).
  451. /// \param[in] shape Shape of the column.
  452. /// \return Status code
  453. Status add_column(const std::string &name, const std::string &de_type) {
  454. return add_column_char(StringToChar(name), StringToChar(de_type));
  455. }
  456. /// \brief Add new column to the schema
  457. /// \param[in] name Name of the column.
  458. /// \param[in] de_type Data type of the column(TypeId).
  459. /// \param[in] shape Shape of the column.
  460. /// \return Status code
  461. Status add_column(const std::string &name, TypeId de_type, const std::vector<int32_t> &shape) {
  462. return add_column_char(StringToChar(name), de_type, shape);
  463. }
  464. /// \brief Add new column to the schema
  465. /// \param[in] name Name of the column.
  466. /// \param[in] de_type Data type of the column(std::string).
  467. /// \param[in] shape Shape of the column.
  468. /// \return Status code
  469. Status add_column(const std::string &name, const std::string &de_type, const std::vector<int32_t> &shape) {
  470. return add_column_char(StringToChar(name), StringToChar(de_type), shape);
  471. }
  472. /// \brief Get a JSON string of the schema
  473. /// \return JSON string of the schema
  474. std::string to_json() { return CharToString(to_json_char()); }
  475. /// \brief Get a JSON string of the schema
  476. std::string to_string() { return to_json(); }
  477. /// \brief Set a new value to dataset_type
  478. void set_dataset_type(std::string dataset_type);
  479. /// \brief Set a new value to num_rows
  480. void set_num_rows(int32_t num_rows);
  481. /// \brief Get the current num_rows
  482. int32_t get_num_rows() const;
  483. /// \brief Get schema file from JSON file
  484. /// \param[in] json_string Name of JSON file to be parsed.
  485. /// \return Status code
  486. Status FromJSONString(const std::string &json_string) { return FromJSONStringCharIF(StringToChar(json_string)); }
  487. /// \brief Parse and add column information
  488. /// \param[in] json_string Name of JSON string for column dataset attribute information, decoded from schema file.
  489. /// \return Status code
  490. Status ParseColumnString(const std::string &json_string) {
  491. return ParseColumnStringCharIF(StringToChar(json_string));
  492. }
  493. private:
  494. /// \brief Parse the columns and add them to columns
  495. /// \param[in] columns Dataset attribution information, decoded from schema file.
  496. /// Support both nlohmann::json::value_t::array and nlohmann::json::value_t::onject.
  497. /// \return Status code
  498. Status parse_column(nlohmann::json columns);
  499. /// \brief Get schema file from JSON file
  500. /// \param[in] json_obj parsed JSON object
  501. /// \return Status code
  502. Status from_json(nlohmann::json json_obj);
  503. // Char constructor of SchemaObj
  504. explicit SchemaObj(const std::vector<char> &schema_file);
  505. // Char interface of add_column
  506. Status add_column_char(const std::vector<char> &name, TypeId de_type);
  507. Status add_column_char(const std::vector<char> &name, const std::vector<char> &de_type);
  508. Status add_column_char(const std::vector<char> &name, TypeId de_type, const std::vector<int32_t> &shape);
  509. Status add_column_char(const std::vector<char> &name, const std::vector<char> &de_type,
  510. const std::vector<int32_t> &shape);
  511. // Char interface of to_json
  512. const std::vector<char> to_json_char();
  513. // Char interface of FromJSONString
  514. Status FromJSONStringCharIF(const std::vector<char> &json_string);
  515. // Char interface of ParseColumnString
  516. Status ParseColumnStringCharIF(const std::vector<char> &json_string);
  517. struct Data;
  518. std::shared_ptr<Data> data_;
  519. };
  520. class BatchDataset : public Dataset {
  521. public:
  522. BatchDataset(std::shared_ptr<Dataset> input, int32_t batch_size, bool drop_remainder = false);
  523. ~BatchDataset() = default;
  524. };
  525. #ifndef ENABLE_ANDROID
  526. class BucketBatchByLengthDataset : public Dataset {
  527. public:
  528. BucketBatchByLengthDataset(
  529. std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &column_names,
  530. const std::vector<int32_t> &bucket_boundaries, const std::vector<int32_t> &bucket_batch_sizes,
  531. std::function<TensorRow(TensorRow)> element_length_function = nullptr,
  532. const std::map<std::vector<char>, std::pair<TensorShape, std::shared_ptr<Tensor>>> &pad_info = {},
  533. bool pad_to_bucket_boundary = false, bool drop_remainder = false);
  534. ~BucketBatchByLengthDataset() = default;
  535. };
  536. class ConcatDataset : public Dataset {
  537. public:
  538. explicit ConcatDataset(const std::vector<std::shared_ptr<Dataset>> &input);
  539. ~ConcatDataset() = default;
  540. };
  541. class FilterDataset : public Dataset {
  542. public:
  543. FilterDataset(std::shared_ptr<Dataset> input, std::function<TensorRow(TensorRow)> predicate,
  544. const std::vector<std::vector<char>> &input_columns);
  545. ~FilterDataset() = default;
  546. };
  547. #endif
  548. class MapDataset : public Dataset {
  549. public:
  550. MapDataset(std::shared_ptr<Dataset> input, std::vector<std::shared_ptr<TensorOperation>> operations,
  551. const std::vector<std::vector<char>> &input_columns, const std::vector<std::vector<char>> &output_columns,
  552. const std::vector<std::vector<char>> &project_columns, const std::shared_ptr<DatasetCache> &cache,
  553. std::vector<std::shared_ptr<DSCallback>> callbacks);
  554. ~MapDataset() = default;
  555. };
  556. class ProjectDataset : public Dataset {
  557. public:
  558. ProjectDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &columns);
  559. ~ProjectDataset() = default;
  560. };
  561. #ifndef ENABLE_ANDROID
  562. class RenameDataset : public Dataset {
  563. public:
  564. RenameDataset(std::shared_ptr<Dataset> input, const std::vector<std::vector<char>> &input_columns,
  565. const std::vector<std::vector<char>> &output_columns);
  566. ~RenameDataset() = default;
  567. };
  568. #endif
  569. class RepeatDataset : public Dataset {
  570. public:
  571. RepeatDataset(std::shared_ptr<Dataset> input, int32_t count);
  572. ~RepeatDataset() = default;
  573. };
  574. class ShuffleDataset : public Dataset {
  575. public:
  576. ShuffleDataset(std::shared_ptr<Dataset> input, int32_t buffer_size);
  577. ~ShuffleDataset() = default;
  578. };
  579. #ifndef ENABLE_ANDROID
  580. class SkipDataset : public Dataset {
  581. public:
  582. SkipDataset(std::shared_ptr<Dataset> input, int32_t count);
  583. ~SkipDataset() = default;
  584. };
  585. class TakeDataset : public Dataset {
  586. public:
  587. TakeDataset(std::shared_ptr<Dataset> input, int32_t count);
  588. ~TakeDataset() = default;
  589. };
  590. class ZipDataset : public Dataset {
  591. public:
  592. explicit ZipDataset(const std::vector<std::shared_ptr<Dataset>> &inputs);
  593. ~ZipDataset() = default;
  594. };
  595. #endif
  596. /// \brief Function to create a SchemaObj
  597. /// \param[in] schema_file Path of schema file
  598. /// \note This api exists because std::string will constrained by ABI compile macro but char don't.
  599. /// \return Shared pointer to the current schema
  600. std::shared_ptr<SchemaObj> SchemaCharIF(const std::vector<char> &schema_file);
  601. /// \brief Function to create a SchemaObj
  602. /// \param[in] schema_file Path of schema file
  603. /// \return Shared pointer to the current schema
  604. inline std::shared_ptr<SchemaObj> Schema(const std::string &schema_file = "") {
  605. return SchemaCharIF(StringToChar(schema_file));
  606. }
  607. class AlbumDataset : public Dataset {
  608. public:
  609. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  610. const std::vector<std::vector<char>> &column_names, bool decode, const std::shared_ptr<Sampler> &sampler,
  611. const std::shared_ptr<DatasetCache> &cache);
  612. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  613. const std::vector<std::vector<char>> &column_names, bool decode, Sampler *sampler,
  614. const std::shared_ptr<DatasetCache> &cache);
  615. AlbumDataset(const std::vector<char> &dataset_dir, const std::vector<char> &data_schema,
  616. const std::vector<std::vector<char>> &column_names, bool decode,
  617. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  618. ~AlbumDataset() = default;
  619. };
  620. /// \brief Function to create an AlbumDataset
  621. /// \notes The generated dataset is specified through setting a schema
  622. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  623. /// \param[in] data_schema Path to dataset schema file
  624. /// \param[in] column_names Column names used to specify columns to load, if empty, will read all columns.
  625. /// (default = {})
  626. /// \param[in] decode the option to decode the images in dataset (default = false)
  627. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  628. /// given,
  629. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  630. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  631. /// \return Shared pointer to the current Dataset
  632. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  633. const std::vector<std::string> &column_names = {}, bool decode = false,
  634. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  635. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  636. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  637. VectorStringToChar(column_names), decode, sampler, cache);
  638. }
  639. /// \brief Function to create an AlbumDataset
  640. /// \notes The generated dataset is specified through setting a schema
  641. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  642. /// \param[in] data_schema Path to dataset schema file
  643. /// \param[in] column_names Column names used to specify columns to load
  644. /// \param[in] decode the option to decode the images in dataset
  645. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  646. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  647. /// \return Shared pointer to the current Dataset
  648. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  649. const std::vector<std::string> &column_names, bool decode, Sampler *sampler,
  650. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  651. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  652. VectorStringToChar(column_names), decode, sampler, cache);
  653. }
  654. /// \brief Function to create an AlbumDataset
  655. /// \notes The generated dataset is specified through setting a schema
  656. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  657. /// \param[in] data_schema Path to dataset schema file
  658. /// \param[in] column_names Column names used to specify columns to load
  659. /// \param[in] decode the option to decode the images in dataset
  660. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  661. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  662. /// \return Shared pointer to the current Dataset
  663. inline std::shared_ptr<AlbumDataset> Album(const std::string &dataset_dir, const std::string &data_schema,
  664. const std::vector<std::string> &column_names, bool decode,
  665. const std::reference_wrapper<Sampler> sampler,
  666. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  667. return std::make_shared<AlbumDataset>(StringToChar(dataset_dir), StringToChar(data_schema),
  668. VectorStringToChar(column_names), decode, sampler, cache);
  669. }
  670. #ifndef ENABLE_ANDROID
  671. class CelebADataset : public Dataset {
  672. public:
  673. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  674. const std::shared_ptr<Sampler> &sampler, bool decode,
  675. const std::set<std::vector<char>> &extensions, const std::shared_ptr<DatasetCache> &cache);
  676. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  677. bool decode, const std::set<std::vector<char>> &extensions,
  678. const std::shared_ptr<DatasetCache> &cache);
  679. explicit CelebADataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  680. const std::reference_wrapper<Sampler> sampler, bool decode,
  681. const std::set<std::vector<char>> &extensions, const std::shared_ptr<DatasetCache> &cache);
  682. ~CelebADataset() = default;
  683. };
  684. /// \brief Function to create a CelebADataset
  685. /// \notes The generated dataset has two columns ['image', 'attr'].
  686. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  687. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  688. /// \param[in] usage One of "all", "train", "valid" or "test" (default = "all").
  689. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  690. /// given,
  691. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  692. /// \param[in] decode Decode the images after reading (default=false).
  693. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  694. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  695. /// \return Shared pointer to the current Dataset
  696. inline std::shared_ptr<CelebADataset> CelebA(
  697. const std::string &dataset_dir, const std::string &usage = "all",
  698. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), bool decode = false,
  699. const std::set<std::string> &extensions = {}, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  700. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  701. SetStringToChar(extensions), cache);
  702. }
  703. /// \brief Function to create a CelebADataset
  704. /// \notes The generated dataset has two columns ['image', 'attr'].
  705. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  706. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  707. /// \param[in] usage One of "all", "train", "valid" or "test"
  708. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  709. /// \param[in] decode Decode the images after reading (default=false).
  710. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  711. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  712. /// \return Shared pointer to the current Dataset
  713. inline std::shared_ptr<CelebADataset> CelebA(const std::string &dataset_dir, const std::string &usage, Sampler *sampler,
  714. bool decode = false, const std::set<std::string> &extensions = {},
  715. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  716. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  717. SetStringToChar(extensions), cache);
  718. }
  719. /// \brief Function to create a CelebADataset
  720. /// \notes The generated dataset has two columns ['image', 'attr'].
  721. /// The type of the image tensor is uint8. The attr tensor is uint32 and one hot type.
  722. /// \param[in] dataset_dir Path to the root directory that contains the dataset.
  723. /// \param[in] usage One of "all", "train", "valid" or "test"
  724. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  725. /// \param[in] decode Decode the images after reading (default=false).
  726. /// \param[in] extensions Set of file extensions to be included in the dataset (default={}).
  727. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  728. /// \return Shared pointer to the current Dataset
  729. inline std::shared_ptr<CelebADataset> CelebA(const std::string &dataset_dir, const std::string &usage,
  730. const std::reference_wrapper<Sampler> sampler, bool decode = false,
  731. const std::set<std::string> &extensions = {},
  732. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  733. return std::make_shared<CelebADataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, decode,
  734. SetStringToChar(extensions), cache);
  735. }
  736. class Cifar10Dataset : public Dataset {
  737. public:
  738. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  739. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  740. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  741. const std::shared_ptr<DatasetCache> &cache);
  742. explicit Cifar10Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  743. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  744. ~Cifar10Dataset() = default;
  745. };
  746. /// \brief Function to create a Cifar10 Dataset
  747. /// \notes The generated dataset has two columns ["image", "label"]
  748. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  749. /// \param[in] usage of CIFAR10, can be "train", "test" or "all" (default = "all").
  750. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  751. /// given,
  752. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  753. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  754. /// \return Shared pointer to the current Dataset
  755. inline std::shared_ptr<Cifar10Dataset> Cifar10(
  756. const std::string &dataset_dir, const std::string &usage = "all",
  757. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  758. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  759. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  760. }
  761. /// \brief Function to create a Cifar10 Dataset
  762. /// \notes The generated dataset has two columns ["image", "label"]
  763. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  764. /// \param[in] usage of CIFAR10, can be "train", "test" or "all"
  765. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  766. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  767. /// \return Shared pointer to the current Dataset
  768. inline std::shared_ptr<Cifar10Dataset> Cifar10(const std::string &dataset_dir, const std::string &usage,
  769. Sampler *sampler, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  770. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  771. }
  772. /// \brief Function to create a Cifar10 Dataset
  773. /// \notes The generated dataset has two columns ["image", "label"]
  774. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  775. /// \param[in] usage of CIFAR10, can be "train", "test" or "all"
  776. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  777. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  778. /// \return Shared pointer to the current Dataset
  779. inline std::shared_ptr<Cifar10Dataset> Cifar10(const std::string &dataset_dir, const std::string &usage,
  780. const std::reference_wrapper<Sampler> sampler,
  781. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  782. return std::make_shared<Cifar10Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  783. }
  784. class Cifar100Dataset : public Dataset {
  785. public:
  786. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  787. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  788. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  789. const std::shared_ptr<DatasetCache> &cache);
  790. explicit Cifar100Dataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  791. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  792. ~Cifar100Dataset() = default;
  793. };
  794. /// \brief Function to create a Cifar100 Dataset
  795. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  796. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  797. /// \param[in] usage of CIFAR100, can be "train", "test" or "all" (default = "all").
  798. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  799. /// given,
  800. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  801. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  802. /// \return Shared pointer to the current Dataset
  803. inline std::shared_ptr<Cifar100Dataset> Cifar100(
  804. const std::string &dataset_dir, const std::string &usage = "all",
  805. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  806. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  807. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  808. }
  809. /// \brief Function to create a Cifar100 Dataset
  810. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  811. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  812. /// \param[in] usage of CIFAR100, can be "train", "test" or "all".
  813. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  814. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  815. /// \return Shared pointer to the current Dataset
  816. inline std::shared_ptr<Cifar100Dataset> Cifar100(const std::string &dataset_dir, const std::string &usage,
  817. Sampler *sampler,
  818. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  819. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  820. }
  821. /// \brief Function to create a Cifar100 Dataset
  822. /// \notes The generated dataset has three columns ["image", "coarse_label", "fine_label"]
  823. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  824. /// \param[in] usage of CIFAR100, can be "train", "test" or "all".
  825. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  826. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  827. /// \return Shared pointer to the current Dataset
  828. inline std::shared_ptr<Cifar100Dataset> Cifar100(const std::string &dataset_dir, const std::string &usage,
  829. const std::reference_wrapper<Sampler> sampler,
  830. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  831. return std::make_shared<Cifar100Dataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  832. }
  833. class CLUEDataset : public Dataset {
  834. public:
  835. explicit CLUEDataset(const std::vector<std::vector<char>> &dataset_files, const std::vector<char> &task,
  836. const std::vector<char> &usage, int64_t num_samples, ShuffleMode shuffle, int32_t num_shards,
  837. int32_t shard_id, const std::shared_ptr<DatasetCache> &cache);
  838. ~CLUEDataset() = default;
  839. };
  840. /// \brief Function to create a CLUEDataset
  841. /// \notes The generated dataset has a variable number of columns depending on the task and usage
  842. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  843. /// will be sorted in a lexicographical order.
  844. /// \param[in] task The kind of task, one of "AFQMC", "TNEWS", "IFLYTEK", "CMNLI", "WSC" and "CSL" (default="AFQMC").
  845. /// \param[in] usage Be used to "train", "test" or "eval" data (default="train").
  846. /// \param[in] num_samples The number of samples to be included in the dataset.
  847. /// (Default = 0 means all samples.)
  848. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode.kGlobal)
  849. /// Can be any of:
  850. /// ShuffleMode::kFalse - No shuffling is performed.
  851. /// ShuffleMode::kFiles - Shuffle files only.
  852. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  853. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  854. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  855. /// specified only when num_shards is also specified. (Default = 0)
  856. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  857. /// \return Shared pointer to the current CLUEDataset
  858. inline std::shared_ptr<CLUEDataset> CLUE(const std::vector<std::string> &dataset_files,
  859. const std::string &task = "AFQMC", const std::string &usage = "train",
  860. int64_t num_samples = 0, ShuffleMode shuffle = ShuffleMode::kGlobal,
  861. int32_t num_shards = 1, int32_t shard_id = 0,
  862. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  863. return std::make_shared<CLUEDataset>(VectorStringToChar(dataset_files), StringToChar(task), StringToChar(usage),
  864. num_samples, shuffle, num_shards, shard_id, cache);
  865. }
  866. class CocoDataset : public Dataset {
  867. public:
  868. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  869. const std::vector<char> &task, const bool &decode, const std::shared_ptr<Sampler> &sampler,
  870. const std::shared_ptr<DatasetCache> &cache);
  871. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  872. const std::vector<char> &task, const bool &decode, Sampler *sampler,
  873. const std::shared_ptr<DatasetCache> &cache);
  874. CocoDataset(const std::vector<char> &dataset_dir, const std::vector<char> &annotation_file,
  875. const std::vector<char> &task, const bool &decode, const std::reference_wrapper<Sampler> sampler,
  876. const std::shared_ptr<DatasetCache> &cache);
  877. ~CocoDataset() = default;
  878. };
  879. /// \brief Function to create a CocoDataset
  880. /// \notes The generated dataset has multi-columns :
  881. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  882. /// ['iscrowd', dtype=uint32]].
  883. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  884. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  885. /// ['num_keypoints', dtype=uint32]].
  886. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  887. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  888. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  889. /// \param[in] annotation_file Path to the annotation json
  890. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  891. /// \param[in] decode Decode the images after reading
  892. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  893. /// given,
  894. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  895. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  896. /// \return Shared pointer to the current Dataset
  897. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  898. const std::string &task = "Detection", const bool &decode = false,
  899. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  900. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  901. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  902. decode, sampler, cache);
  903. }
  904. /// \brief Function to create a CocoDataset
  905. /// \notes The generated dataset has multi-columns :
  906. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  907. /// ['iscrowd', dtype=uint32]].
  908. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  909. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  910. /// ['num_keypoints', dtype=uint32]].
  911. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  912. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  913. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  914. /// \param[in] annotation_file Path to the annotation json
  915. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  916. /// \param[in] decode Decode the images after reading
  917. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  918. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  919. /// \return Shared pointer to the current Dataset
  920. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  921. const std::string &task, const bool &decode, Sampler *sampler,
  922. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  923. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  924. decode, sampler, cache);
  925. }
  926. /// \brief Function to create a CocoDataset
  927. /// \notes The generated dataset has multi-columns :
  928. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  929. /// ['iscrowd', dtype=uint32]].
  930. /// - task='Stuff', column: [['image', dtype=uint8], ['segmentation',dtype=float32], ['iscrowd', dtype=uint32]].
  931. /// - task='Keypoint', column: [['image', dtype=uint8], ['keypoints', dtype=float32],
  932. /// ['num_keypoints', dtype=uint32]].
  933. /// - task='Panoptic', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['category_id', dtype=uint32],
  934. /// ['iscrowd', dtype=uint32], ['area', dtype=uitn32]].
  935. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  936. /// \param[in] annotation_file Path to the annotation json
  937. /// \param[in] task Set the task type of reading coco data, now support 'Detection'/'Stuff'/'Panoptic'/'Keypoint'
  938. /// \param[in] decode Decode the images after reading
  939. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  940. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  941. /// \return Shared pointer to the current Dataset
  942. inline std::shared_ptr<CocoDataset> Coco(const std::string &dataset_dir, const std::string &annotation_file,
  943. const std::string &task, const bool &decode,
  944. const std::reference_wrapper<Sampler> sampler,
  945. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  946. return std::make_shared<CocoDataset>(StringToChar(dataset_dir), StringToChar(annotation_file), StringToChar(task),
  947. decode, sampler, cache);
  948. }
  949. class CSVDataset : public Dataset {
  950. public:
  951. explicit CSVDataset(const std::vector<std::vector<char>> &dataset_files, char field_delim,
  952. const std::vector<std::shared_ptr<CsvBase>> &column_defaults,
  953. const std::vector<std::vector<char>> &column_names, int64_t num_samples, ShuffleMode shuffle,
  954. int32_t num_shards, int32_t shard_id, const std::shared_ptr<DatasetCache> &cache);
  955. ~CSVDataset() = default;
  956. };
  957. /// \brief Function to create a CSVDataset
  958. /// \notes The generated dataset has a variable number of columns
  959. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  960. /// will be sorted in a lexicographical order.
  961. /// \param[in] field_delim A char that indicates the delimiter to separate fields (default=',').
  962. /// \param[in] column_defaults List of default values for the CSV field (default={}). Each item in the list is
  963. /// either a valid type (float, int, or string). If this is not provided, treats all columns as string type.
  964. /// \param[in] column_names List of column names of the dataset (default={}). If this is not provided, infers the
  965. /// column_names from the first row of CSV file.
  966. /// \param[in] num_samples The number of samples to be included in the dataset.
  967. /// (Default = 0 means all samples.)
  968. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode::kGlobal)
  969. /// Can be any of:
  970. /// ShuffleMode::kFalse - No shuffling is performed.
  971. /// ShuffleMode::kFiles - Shuffle files only.
  972. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  973. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  974. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  975. /// specified only when num_shards is also specified. (Default = 0)
  976. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  977. /// \return Shared pointer to the current Dataset
  978. inline std::shared_ptr<CSVDataset> CSV(const std::vector<std::string> &dataset_files, char field_delim = ',',
  979. const std::vector<std::shared_ptr<CsvBase>> &column_defaults = {},
  980. const std::vector<std::string> &column_names = {}, int64_t num_samples = 0,
  981. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  982. int32_t shard_id = 0, const std::shared_ptr<DatasetCache> &cache = nullptr) {
  983. return std::make_shared<CSVDataset>(VectorStringToChar(dataset_files), field_delim, column_defaults,
  984. VectorStringToChar(column_names), num_samples, shuffle, num_shards, shard_id,
  985. cache);
  986. }
  987. class ImageFolderDataset : public Dataset {
  988. public:
  989. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode,
  990. const std::shared_ptr<Sampler> &sampler, const std::set<std::vector<char>> &extensions,
  991. const std::map<std::vector<char>, int32_t> &class_indexing,
  992. const std::shared_ptr<DatasetCache> &cache);
  993. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode, Sampler *sampler,
  994. const std::set<std::vector<char>> &extensions,
  995. const std::map<std::vector<char>, int32_t> &class_indexing,
  996. const std::shared_ptr<DatasetCache> &cache);
  997. explicit ImageFolderDataset(const std::vector<char> &dataset_dir, bool decode,
  998. const std::reference_wrapper<Sampler> sampler,
  999. const std::set<std::vector<char>> &extensions,
  1000. const std::map<std::vector<char>, int32_t> &class_indexing,
  1001. const std::shared_ptr<DatasetCache> &cache);
  1002. ~ImageFolderDataset() = default;
  1003. };
  1004. /// \brief Function to create an ImageFolderDataset
  1005. /// \notes A source dataset that reads images from a tree of directories
  1006. /// All images within one folder have the same label
  1007. /// The generated dataset has two columns ["image", "label"]
  1008. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1009. /// \param[in] decode A flag to decode in ImageFolder
  1010. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1011. /// given,
  1012. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1013. /// \param[in] extensions File extensions to be read
  1014. /// \param[in] class_indexing a class name to label map
  1015. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1016. /// \return Shared pointer to the current ImageFolderDataset
  1017. inline std::shared_ptr<ImageFolderDataset> ImageFolder(
  1018. const std::string &dataset_dir, bool decode = false,
  1019. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1020. const std::set<std::string> &extensions = {}, const std::map<std::string, int32_t> &class_indexing = {},
  1021. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1022. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1023. MapStringToChar(class_indexing), cache);
  1024. }
  1025. /// \brief Function to create an ImageFolderDataset
  1026. /// \notes A source dataset that reads images from a tree of directories
  1027. /// All images within one folder have the same label
  1028. /// The generated dataset has two columns ["image", "label"]
  1029. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1030. /// \param[in] decode A flag to decode in ImageFolder
  1031. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1032. /// \param[in] extensions File extensions to be read
  1033. /// \param[in] class_indexing a class name to label map
  1034. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1035. /// \return Shared pointer to the current ImageFolderDataset
  1036. inline std::shared_ptr<ImageFolderDataset> ImageFolder(const std::string &dataset_dir, bool decode, Sampler *sampler,
  1037. const std::set<std::string> &extensions = {},
  1038. const std::map<std::string, int32_t> &class_indexing = {},
  1039. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1040. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1041. MapStringToChar(class_indexing), cache);
  1042. }
  1043. /// \brief Function to create an ImageFolderDataset
  1044. /// \notes A source dataset that reads images from a tree of directories
  1045. /// All images within one folder have the same label
  1046. /// The generated dataset has two columns ["image", "label"]
  1047. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1048. /// \param[in] decode A flag to decode in ImageFolder
  1049. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1050. /// \param[in] extensions File extensions to be read
  1051. /// \param[in] class_indexing a class name to label map
  1052. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1053. /// \return Shared pointer to the current ImageFolderDataset
  1054. inline std::shared_ptr<ImageFolderDataset> ImageFolder(const std::string &dataset_dir, bool decode,
  1055. const std::reference_wrapper<Sampler> sampler,
  1056. const std::set<std::string> &extensions = {},
  1057. const std::map<std::string, int32_t> &class_indexing = {},
  1058. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1059. return std::make_shared<ImageFolderDataset>(StringToChar(dataset_dir), decode, sampler, SetStringToChar(extensions),
  1060. MapStringToChar(class_indexing), cache);
  1061. }
  1062. class ManifestDataset : public Dataset {
  1063. public:
  1064. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage,
  1065. const std::shared_ptr<Sampler> &sampler,
  1066. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1067. const std::shared_ptr<DatasetCache> &cache);
  1068. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage, Sampler *sampler,
  1069. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1070. const std::shared_ptr<DatasetCache> &cache);
  1071. explicit ManifestDataset(const std::vector<char> &dataset_file, const std::vector<char> &usage,
  1072. const std::reference_wrapper<Sampler> sampler,
  1073. const std::map<std::vector<char>, int32_t> &class_indexing, bool decode,
  1074. const std::shared_ptr<DatasetCache> &cache);
  1075. ~ManifestDataset() = default;
  1076. };
  1077. /// \brief Function to create a ManifestDataset
  1078. /// \notes The generated dataset has two columns ["image", "label"]
  1079. /// \param[in] dataset_file The dataset file to be read
  1080. /// \param[in] usage Need "train", "eval" or "inference" data (default="train")
  1081. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1082. /// given,
  1083. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1084. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1085. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1086. /// \param[in] decode Decode the images after reading (default=false).
  1087. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1088. /// \return Shared pointer to the current ManifestDataset
  1089. inline std::shared_ptr<ManifestDataset> Manifest(
  1090. const std::string &dataset_file, const std::string &usage = "train",
  1091. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1092. const std::map<std::string, int32_t> &class_indexing = {}, bool decode = false,
  1093. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1094. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1095. MapStringToChar(class_indexing), decode, cache);
  1096. }
  1097. /// \brief Function to create a ManifestDataset
  1098. /// \notes The generated dataset has two columns ["image", "label"]
  1099. /// \param[in] dataset_file The dataset file to be read
  1100. /// \param[in] usage Need "train", "eval" or "inference" data
  1101. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1102. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1103. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1104. /// \param[in] decode Decode the images after reading (default=false).
  1105. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1106. /// \return Shared pointer to the current ManifestDataset
  1107. inline std::shared_ptr<ManifestDataset> Manifest(const std::string &dataset_file, const std::string &usage,
  1108. Sampler *sampler,
  1109. const std::map<std::string, int32_t> &class_indexing = {},
  1110. bool decode = false,
  1111. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1112. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1113. MapStringToChar(class_indexing), decode, cache);
  1114. }
  1115. /// \brief Function to create a ManifestDataset
  1116. /// \notes The generated dataset has two columns ["image", "label"]
  1117. /// \param[in] dataset_file The dataset file to be read
  1118. /// \param[in] usage Need "train", "eval" or "inference" data
  1119. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1120. /// \param[in] class_indexing A str-to-int mapping from label name to index (default={}, the folder
  1121. /// names will be sorted alphabetically and each class will be given a unique index starting from 0).
  1122. /// \param[in] decode Decode the images after reading (default=false).
  1123. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1124. /// \return Shared pointer to the current ManifestDataset
  1125. inline std::shared_ptr<ManifestDataset> Manifest(const std::string &dataset_file, const std::string &usage,
  1126. const std::reference_wrapper<Sampler> sampler,
  1127. const std::map<std::string, int32_t> &class_indexing = {},
  1128. bool decode = false,
  1129. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1130. return std::make_shared<ManifestDataset>(StringToChar(dataset_file), StringToChar(usage), sampler,
  1131. MapStringToChar(class_indexing), decode, cache);
  1132. }
  1133. class MindDataDataset : public Dataset {
  1134. public:
  1135. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1136. const std::shared_ptr<Sampler> &sampler, nlohmann::json padded_sample, int64_t num_padded);
  1137. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1138. Sampler *sampler, nlohmann::json padded_sample, int64_t num_padded);
  1139. explicit MindDataDataset(const std::vector<char> &dataset_file, const std::vector<std::vector<char>> &columns_list,
  1140. const std::reference_wrapper<Sampler> sampler, nlohmann::json padded_sample,
  1141. int64_t num_padded);
  1142. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1143. const std::vector<std::vector<char>> &columns_list, const std::shared_ptr<Sampler> &sampler,
  1144. nlohmann::json padded_sample, int64_t num_padded);
  1145. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1146. const std::vector<std::vector<char>> &columns_list, Sampler *sampler,
  1147. nlohmann::json padded_sample, int64_t num_padded);
  1148. explicit MindDataDataset(const std::vector<std::vector<char>> &dataset_files,
  1149. const std::vector<std::vector<char>> &columns_list,
  1150. const std::reference_wrapper<Sampler> sampler, nlohmann::json padded_sample,
  1151. int64_t num_padded);
  1152. ~MindDataDataset() = default;
  1153. };
  1154. /// \brief Function to create a MindDataDataset
  1155. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1156. /// in the same path will be found and loaded automatically.
  1157. /// \param[in] columns_list List of columns to be read (default={})
  1158. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1159. /// given,
  1160. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler()),
  1161. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1162. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1163. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1164. /// \return Shared pointer to the current MindDataDataset
  1165. inline std::shared_ptr<MindDataDataset> MindData(
  1166. const std::string &dataset_file, const std::vector<std::string> &columns_list = {},
  1167. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), nlohmann::json padded_sample = nullptr,
  1168. int64_t num_padded = 0) {
  1169. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1170. padded_sample, num_padded);
  1171. }
  1172. /// \brief Function to create a MindDataDataset
  1173. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1174. /// in the same path will be found and loaded automatically.
  1175. /// \param[in] columns_list List of columns to be read
  1176. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1177. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1178. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1179. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1180. /// \return Shared pointer to the current MindDataDataset
  1181. inline std::shared_ptr<MindDataDataset> MindData(const std::string &dataset_file,
  1182. const std::vector<std::string> &columns_list, Sampler *sampler,
  1183. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1184. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1185. padded_sample, num_padded);
  1186. }
  1187. /// \brief Function to create a MindDataDataset
  1188. /// \param[in] dataset_file File name of one component of a mindrecord source. Other files with identical source
  1189. /// in the same path will be found and loaded automatically.
  1190. /// \param[in] columns_list List of columns to be read
  1191. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1192. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1193. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1194. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1195. /// \return Shared pointer to the current MindDataDataset
  1196. inline std::shared_ptr<MindDataDataset> MindData(const std::string &dataset_file,
  1197. const std::vector<std::string> &columns_list,
  1198. const std::reference_wrapper<Sampler> sampler,
  1199. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1200. return std::make_shared<MindDataDataset>(StringToChar(dataset_file), VectorStringToChar(columns_list), sampler,
  1201. padded_sample, num_padded);
  1202. }
  1203. /// \brief Function to create a MindDataDataset
  1204. /// \param[in] dataset_files List of dataset files to be read directly.
  1205. /// \param[in] columns_list List of columns to be read (default={})
  1206. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1207. /// given,
  1208. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler()),
  1209. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1210. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1211. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1212. /// \return Shared pointer to the current MindDataDataset
  1213. inline std::shared_ptr<MindDataDataset> MindData(
  1214. const std::vector<std::string> &dataset_files, const std::vector<std::string> &columns_list = {},
  1215. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(), nlohmann::json padded_sample = nullptr,
  1216. int64_t num_padded = 0) {
  1217. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1218. padded_sample, num_padded);
  1219. }
  1220. /// \brief Function to create a MindDataDataset
  1221. /// \param[in] dataset_files List of dataset files to be read directly.
  1222. /// \param[in] columns_list List of columns to be read
  1223. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1224. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1225. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1226. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1227. /// \return Shared pointer to the current MindDataDataset
  1228. inline std::shared_ptr<MindDataDataset> MindData(const std::vector<std::string> &dataset_files,
  1229. const std::vector<std::string> &columns_list, Sampler *sampler,
  1230. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1231. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1232. padded_sample, num_padded);
  1233. }
  1234. /// \brief Function to create a MindDataDataset
  1235. /// \param[in] dataset_files List of dataset files to be read directly.
  1236. /// \param[in] columns_list List of columns to be read
  1237. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1238. /// supported sampler list: SubsetRandomSampler, PkSampler, RandomSampler, SequentialSampler, DistributedSampler.
  1239. /// \param[in] padded_sample Samples will be appended to dataset, where keys are the same as column_list.
  1240. /// \param[in] num_padded Number of padding samples. Dataset size plus num_padded should be divisible by num_shards.
  1241. /// \return Shared pointer to the current MindDataDataset
  1242. inline std::shared_ptr<MindDataDataset> MindData(const std::vector<std::string> &dataset_files,
  1243. const std::vector<std::string> &columns_list,
  1244. const std::reference_wrapper<Sampler> sampler,
  1245. nlohmann::json padded_sample = nullptr, int64_t num_padded = 0) {
  1246. return std::make_shared<MindDataDataset>(VectorStringToChar(dataset_files), VectorStringToChar(columns_list), sampler,
  1247. padded_sample, num_padded);
  1248. }
  1249. #endif
  1250. class MnistDataset : public Dataset {
  1251. public:
  1252. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  1253. const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  1254. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage, Sampler *sampler,
  1255. const std::shared_ptr<DatasetCache> &cache);
  1256. explicit MnistDataset(const std::vector<char> &dataset_dir, const std::vector<char> &usage,
  1257. const std::reference_wrapper<Sampler> sampler, const std::shared_ptr<DatasetCache> &cache);
  1258. ~MnistDataset() = default;
  1259. };
  1260. /// \brief Function to create a MnistDataset
  1261. /// \notes The generated dataset has two columns ["image", "label"]
  1262. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1263. /// \param[in] usage of MNIST, can be "train", "test" or "all" (default = "all").
  1264. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1265. /// given,
  1266. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1267. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1268. /// \return Shared pointer to the current MnistDataset
  1269. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage = "all",
  1270. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1271. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1272. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1273. }
  1274. /// \brief Function to create a MnistDataset
  1275. /// \notes The generated dataset has two columns ["image", "label"]
  1276. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1277. /// \param[in] usage of MNIST, can be "train", "test" or "all"
  1278. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1279. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1280. /// \return Shared pointer to the current MnistDataset
  1281. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage, Sampler *sampler,
  1282. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1283. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1284. }
  1285. /// \brief Function to create a MnistDataset
  1286. /// \notes The generated dataset has two columns ["image", "label"]
  1287. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1288. /// \param[in] usage of MNIST, can be "train", "test" or "all"
  1289. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1290. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1291. /// \return Shared pointer to the current MnistDataset
  1292. inline std::shared_ptr<MnistDataset> Mnist(const std::string &dataset_dir, const std::string &usage,
  1293. const std::reference_wrapper<Sampler> sampler,
  1294. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1295. return std::make_shared<MnistDataset>(StringToChar(dataset_dir), StringToChar(usage), sampler, cache);
  1296. }
  1297. #ifndef ENABLE_ANDROID
  1298. /// \brief Function to create a ConcatDataset
  1299. /// \notes Reload "+" operator to concat two datasets
  1300. /// \param[in] datasets1 Shared pointer to the first dataset to be concatenated
  1301. /// \param[in] datasets2 Shared pointer to the second dataset to be concatenated
  1302. /// \return Shared pointer to the current ConcatDataset
  1303. inline std::shared_ptr<ConcatDataset> operator+(const std::shared_ptr<Dataset> &datasets1,
  1304. const std::shared_ptr<Dataset> &datasets2) {
  1305. return std::make_shared<ConcatDataset>(std::vector({datasets1, datasets2}));
  1306. }
  1307. class RandomDataDataset : public Dataset {
  1308. public:
  1309. RandomDataDataset(const int32_t &total_rows, std::shared_ptr<SchemaObj> schema,
  1310. const std::vector<std::vector<char>> &columns_list, std::shared_ptr<DatasetCache> cache);
  1311. RandomDataDataset(const int32_t &total_rows, const std::vector<char> &schema_path,
  1312. const std::vector<std::vector<char>> &columns_list, std::shared_ptr<DatasetCache> cache);
  1313. ~RandomDataDataset() = default;
  1314. };
  1315. /// \brief Function to create a RandomDataset
  1316. /// \param[in] total_rows Number of rows for the dataset to generate (default=0, number of rows is random)
  1317. /// \param[in] schema SchemaObj to set column type, data type and data shape
  1318. /// \param[in] columns_list List of columns to be read (default={}, read all columns)
  1319. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1320. /// \return Shared pointer to the current Dataset
  1321. template <typename T = std::shared_ptr<SchemaObj>>
  1322. std::shared_ptr<RandomDataDataset> RandomData(const int32_t &total_rows = 0, const T &schema = nullptr,
  1323. const std::vector<std::string> &columns_list = {},
  1324. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1325. std::shared_ptr<RandomDataDataset> ds;
  1326. if constexpr (std::is_same<T, std::nullptr_t>::value || std::is_same<T, std::shared_ptr<SchemaObj>>::value) {
  1327. std::shared_ptr<SchemaObj> schema_obj = schema;
  1328. ds =
  1329. std::make_shared<RandomDataDataset>(total_rows, std::move(schema_obj), VectorStringToChar(columns_list), cache);
  1330. } else {
  1331. ds = std::make_shared<RandomDataDataset>(total_rows, StringToChar(schema), VectorStringToChar(columns_list), cache);
  1332. }
  1333. return ds;
  1334. }
  1335. class TextFileDataset : public Dataset {
  1336. public:
  1337. explicit TextFileDataset(const std::vector<std::vector<char>> &dataset_files, int64_t num_samples,
  1338. ShuffleMode shuffle, int32_t num_shards, int32_t shard_id,
  1339. const std::shared_ptr<DatasetCache> &cache);
  1340. ~TextFileDataset() = default;
  1341. };
  1342. /// \brief Function to create a TextFileDataset
  1343. /// \notes The generated dataset has one column ['text']
  1344. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  1345. /// will be sorted in a lexicographical order.
  1346. /// \param[in] num_samples The number of samples to be included in the dataset.
  1347. /// (Default = 0 means all samples.)
  1348. /// \param[in] shuffle The mode for shuffling data every epoch. (Default=ShuffleMode.kGlobal)
  1349. /// Can be any of:
  1350. /// ShuffleMode.kFalse - No shuffling is performed.
  1351. /// ShuffleMode.kFiles - Shuffle files only.
  1352. /// ShuffleMode.kGlobal - Shuffle both the files and samples.
  1353. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  1354. /// \param[in] shard_id The shard ID within num_shards. This argument should be
  1355. /// specified only when num_shards is also specified. (Default = 0)
  1356. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1357. /// \return Shared pointer to the current TextFileDataset
  1358. inline std::shared_ptr<TextFileDataset> TextFile(const std::vector<std::string> &dataset_files, int64_t num_samples = 0,
  1359. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  1360. int32_t shard_id = 0,
  1361. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1362. return std::make_shared<TextFileDataset>(VectorStringToChar(dataset_files), num_samples, shuffle, num_shards,
  1363. shard_id, cache);
  1364. }
  1365. class TFRecordDataset : public Dataset {
  1366. public:
  1367. TFRecordDataset(const std::vector<std::vector<char>> &dataset_files, const std::vector<char> &schema,
  1368. const std::vector<std::vector<char>> &columns_list, int64_t num_samples, ShuffleMode shuffle,
  1369. int32_t num_shards, int32_t shard_id, bool shard_equal_rows, std::shared_ptr<DatasetCache> cache);
  1370. /// \brief Constructor
  1371. /// \note Parameter 'schema' is shared pointer to Schema object
  1372. TFRecordDataset(const std::vector<std::vector<char>> &dataset_files, std::shared_ptr<SchemaObj> schema,
  1373. const std::vector<std::vector<char>> &columns_list, int64_t num_samples, ShuffleMode shuffle,
  1374. int32_t num_shards, int32_t shard_id, bool shard_equal_rows, std::shared_ptr<DatasetCache> cache);
  1375. ~TFRecordDataset() = default;
  1376. };
  1377. /// \brief Function to create a TFRecordDataset
  1378. /// \param[in] dataset_files List of files to be read to search for a pattern of files. The list
  1379. /// will be sorted in a lexicographical order.
  1380. /// \param[in] schema SchemaObj or string to schema path. (Default = nullptr, which means that the
  1381. /// meta data from the TFData file is considered the schema.)
  1382. /// \param[in] columns_list List of columns to be read. (Default = {}, read all columns)
  1383. /// \param[in] num_samples The number of samples to be included in the dataset.
  1384. /// (Default = 0 means all samples.)
  1385. /// If num_samples is 0 and numRows(parsed from schema) does not exist, read the full dataset;
  1386. /// If num_samples is 0 and numRows(parsed from schema) is greater than 0, read numRows rows;
  1387. /// If both num_samples and numRows(parsed from schema) are greater than 0, read num_samples rows.
  1388. /// \param[in] shuffle The mode for shuffling data every epoch. (Default = ShuffleMode::kGlobal)
  1389. /// Can be any of:
  1390. /// ShuffleMode::kFalse - No shuffling is performed.
  1391. /// ShuffleMode::kFiles - Shuffle files only.
  1392. /// ShuffleMode::kGlobal - Shuffle both the files and samples.
  1393. /// \param[in] num_shards Number of shards that the dataset should be divided into. (Default = 1)
  1394. /// \param[in] shard_id The shard ID within num_shards. This argument should be specified only
  1395. /// when num_shards is also specified. (Default = 0)
  1396. /// \param[in] shard_equal_rows Get equal rows for all shards. (Default = False, number of rows of
  1397. /// each shard may be not equal)
  1398. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1399. /// \return Shared pointer to the current TFRecordDataset
  1400. template <typename T = std::shared_ptr<SchemaObj>>
  1401. std::shared_ptr<TFRecordDataset> TFRecord(const std::vector<std::string> &dataset_files, const T &schema = nullptr,
  1402. const std::vector<std::string> &columns_list = {}, int64_t num_samples = 0,
  1403. ShuffleMode shuffle = ShuffleMode::kGlobal, int32_t num_shards = 1,
  1404. int32_t shard_id = 0, bool shard_equal_rows = false,
  1405. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1406. std::shared_ptr<TFRecordDataset> ds = nullptr;
  1407. if constexpr (std::is_same<T, std::nullptr_t>::value || std::is_same<T, std::shared_ptr<SchemaObj>>::value) {
  1408. std::shared_ptr<SchemaObj> schema_obj = schema;
  1409. ds = std::make_shared<TFRecordDataset>(VectorStringToChar(dataset_files), std::move(schema_obj),
  1410. VectorStringToChar(columns_list), num_samples, shuffle, num_shards, shard_id,
  1411. shard_equal_rows, cache);
  1412. } else {
  1413. std::string schema_path = schema;
  1414. if (!schema_path.empty()) {
  1415. struct stat sb;
  1416. int rc = stat(common::SafeCStr(schema_path), &sb);
  1417. if (rc == -1 && errno != ENOENT) {
  1418. MS_LOG(WARNING) << "Unable to query the status of [" << schema_path << "]. Errno = " << errno << ".";
  1419. }
  1420. if (rc != 0) {
  1421. MS_LOG(ERROR) << "TFRecordDataset: schema path [" << schema_path << "] is invalid or does not exist.";
  1422. return nullptr;
  1423. }
  1424. }
  1425. ds = std::make_shared<TFRecordDataset>(VectorStringToChar(dataset_files), StringToChar(schema_path),
  1426. VectorStringToChar(columns_list), num_samples, shuffle, num_shards, shard_id,
  1427. shard_equal_rows, cache);
  1428. }
  1429. return ds;
  1430. }
  1431. class VOCDataset : public Dataset {
  1432. public:
  1433. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1434. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1435. bool decode, const std::shared_ptr<Sampler> &sampler, const std::shared_ptr<DatasetCache> &cache);
  1436. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1437. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1438. bool decode, Sampler *sampler, const std::shared_ptr<DatasetCache> &cache);
  1439. explicit VOCDataset(const std::vector<char> &dataset_dir, const std::vector<char> &task,
  1440. const std::vector<char> &usage, const std::map<std::vector<char>, int32_t> &class_indexing,
  1441. bool decode, const std::reference_wrapper<Sampler> sampler,
  1442. const std::shared_ptr<DatasetCache> &cache);
  1443. ~VOCDataset() = default;
  1444. };
  1445. /// \brief Function to create a VOCDataset
  1446. /// \notes The generated dataset has multi-columns :
  1447. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1448. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1449. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1450. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1451. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1452. /// \param[in] usage The type of data list text file to be read (default = "train").
  1453. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1454. /// \param[in] decode Decode the images after reading
  1455. /// \param[in] sampler Shared pointer to a sampler object used to choose samples from the dataset. If sampler is not
  1456. /// given,
  1457. /// a `RandomSampler` will be used to randomly iterate the entire dataset (default = RandomSampler())
  1458. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1459. /// \return Shared pointer to the current Dataset
  1460. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task = "Segmentation",
  1461. const std::string &usage = "train",
  1462. const std::map<std::string, int32_t> &class_indexing = {}, bool decode = false,
  1463. const std::shared_ptr<Sampler> &sampler = std::make_shared<RandomSampler>(),
  1464. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1465. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1466. MapStringToChar(class_indexing), decode, sampler, cache);
  1467. }
  1468. /// \brief Function to create a VOCDataset
  1469. /// \notes The generated dataset has multi-columns :
  1470. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1471. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1472. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1473. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1474. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1475. /// \param[in] usage The type of data list text file to be read.
  1476. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1477. /// \param[in] decode Decode the images after reading
  1478. /// \param[in] sampler Raw pointer to a sampler object used to choose samples from the dataset.
  1479. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1480. /// \return Shared pointer to the current Dataset
  1481. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task,
  1482. const std::string &usage, const std::map<std::string, int32_t> &class_indexing,
  1483. bool decode, Sampler *sampler,
  1484. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1485. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1486. MapStringToChar(class_indexing), decode, sampler, cache);
  1487. }
  1488. /// \brief Function to create a VOCDataset
  1489. /// \notes The generated dataset has multi-columns :
  1490. /// - task='Detection', column: [['image', dtype=uint8], ['bbox', dtype=float32], ['label', dtype=uint32],
  1491. /// ['difficult', dtype=uint32], ['truncate', dtype=uint32]].
  1492. /// - task='Segmentation', column: [['image', dtype=uint8], ['target',dtype=uint8]].
  1493. /// \param[in] dataset_dir Path to the root directory that contains the dataset
  1494. /// \param[in] task Set the task type of reading voc data, now only support "Segmentation" or "Detection"
  1495. /// \param[in] usage The type of data list text file to be read.
  1496. /// \param[in] class_indexing A str-to-int mapping from label name to index, only valid in "Detection" task
  1497. /// \param[in] decode Decode the images after reading
  1498. /// \param[in] sampler Sampler object used to choose samples from the dataset.
  1499. /// \param[in] cache Tensor cache to use. (default=nullptr which means no cache is used).
  1500. /// \return Shared pointer to the current Dataset
  1501. inline std::shared_ptr<VOCDataset> VOC(const std::string &dataset_dir, const std::string &task,
  1502. const std::string &usage, const std::map<std::string, int32_t> &class_indexing,
  1503. bool decode, const std::reference_wrapper<Sampler> sampler,
  1504. const std::shared_ptr<DatasetCache> &cache = nullptr) {
  1505. return std::make_shared<VOCDataset>(StringToChar(dataset_dir), StringToChar(task), StringToChar(usage),
  1506. MapStringToChar(class_indexing), decode, sampler, cache);
  1507. }
  1508. std::shared_ptr<DatasetCache> CreateDatasetCacheCharIF(session_id_type id, uint64_t mem_sz, bool spill,
  1509. std::optional<std::vector<char>> hostname = std::nullopt,
  1510. std::optional<int32_t> port = std::nullopt,
  1511. std::optional<int32_t> num_connections = std::nullopt,
  1512. std::optional<int32_t> prefetch_sz = std::nullopt);
  1513. /// \brief Function the create a cache to be attached to a dataset
  1514. /// \param id A user assigned session id for the current pipeline.
  1515. /// \param mem_sz Size of the memory set aside for the row caching (default=0 which means unlimited,
  1516. /// note that it might bring in the risk of running out of memory on the machine).
  1517. /// \param spill Spill to disk if out of memory (default=False).
  1518. /// \param hostname optional host name (default="127.0.0.1").
  1519. /// \param port optional port (default=50052).
  1520. /// \param num_connections optional number of connections (default=12).
  1521. /// \param prefetch_sz optional prefetch size (default=20).
  1522. /// \return Shared pointer to DatasetCache. If error, nullptr is returned.
  1523. inline std::shared_ptr<DatasetCache> CreateDatasetCache(session_id_type id, uint64_t mem_sz, bool spill,
  1524. std::optional<std::string> hostname = std::nullopt,
  1525. std::optional<int32_t> port = std::nullopt,
  1526. std::optional<int32_t> num_connections = std::nullopt,
  1527. std::optional<int32_t> prefetch_sz = std::nullopt) {
  1528. return CreateDatasetCacheCharIF(id, mem_sz, spill, OptionalStringToChar(hostname), port, num_connections,
  1529. prefetch_sz);
  1530. }
  1531. /// \brief Function to create a ZipDataset
  1532. /// \notes Applies zip to the dataset
  1533. /// \param[in] datasets List of shared pointers to the datasets that we want to zip
  1534. /// \return Shared pointer to the current Dataset
  1535. inline std::shared_ptr<ZipDataset> Zip(const std::vector<std::shared_ptr<Dataset>> &datasets) {
  1536. return std::make_shared<ZipDataset>(datasets);
  1537. }
  1538. #endif
  1539. } // namespace dataset
  1540. } // namespace mindspore
  1541. #endif // MINDSPORE_CCSRC_MINDDATA_DATASET_INCLUDE_DATASETS_H_