You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

base.py 10 kB

[to #42322933] add/refactor nlp models source code and finetune 1. add sbert,veco,palm,space source code 2. support sbert sequence classification, token classification finetune 3. support veco sequence classification finetune 4. support palm nlg finetune evaluation result: https://sheet.alibaba-inc.com/#/sheet/f7fdcc7f22bd5105 sheet:Maas 5. add ut for finetunes 6. add veco's taskdataset processor 7. add a common trainer for nlp, and a specific trainer for veco 8. merge some duplicate codes of models, preprocessors, pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9574105 * add basic class of hook&metrics * pre-commit passed * change some comments * pre commit passed * 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities * pre-commit passed * fix a comment * Merge branch 'master' into finetune_hooks_metrics # Conflicts: # modelscope/metainfo.py * pre-commit passed * add basic class of hook&metrics * pre-commit passed * change some comments * pre commit passed * 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities * pre-commit passed * fix a comment * Merge branch 'feat/finetune' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune * mv hooks related to modelscope/trainers/hooks * mv priority back * add torch mdoel base and test * update hooks, trainer, import_util * add torch epoch based trainer and dis utils * add hooks * fix warmup * format code stype and fix warmup and add warmup unittest * fix impls * pre-commit check passed * update hook and add EpochBasedTrainer * add trainer unittest * Merge branch 'feat/add_hooks' into feat/add_task # Conflicts: # modelscope/models/base_torch.py # modelscope/trainers/hooks/hook.py # modelscope/trainers/trainer.py * update unittest name * rewrite taskdataset to trainer * fix trainer and add unittest * add unittest * code: run to forward * run through... but ugly code * arrange some cls * fix some errs * revert some mistakes * init check in * Merge branch 'feat/add_hooks' into feat/add_task # Conflicts: # modelscope/trainers/trainer.py * test with bigger epoch and size * add the default metrics class * move build metrics code to a method * merge add_task * merge origin add_task * add device initialization * remove preprocessor arg for bool * add task models * move metric collect logic to metrics class * pre-commit passed * fix cr comments * precommit passed * add task models * Merge remote-tracking branch 'origin/feat/add_task' into feat/backbone_head * add comment * change comment formats. * fix comments * fix ut bug * fix comments * add wrapper check * fix comments * pre commit passed * fix cr comments * solve a loop import problem * fix ut bug * fix ut errors * change dummydataset to msdataset * precommit passed * merge add task * backbone-head is build, model is not correctly loaded * model load states matched * result matched * lint * add veco/palm_v2 code * merge master * merge master success running * add repr model name level * Merge branch 'feat/veco_palm' into feat/finetune_sbert_veco * model test for training * add token-classification metric add formal ut * fix running bug * finetune and pipeline are working with backbone-head * add nli * add missing code * finetune and pipeline are working with backbone-head * Merge branch 'feat/backbone_head' of http://gitlab.alibaba-inc.com/Ali-MaaS/MaaS-lib into feat/backbone_head * add a test repo for pr * remove merge conflicted file * remove merge conflicted file 1 * lint check * import error * none type bug fix * forward input unpacking or dict bug * move head into models, add build_backbone with registry, no base method * merge master * feat: 1. add interleave dataset method 2. support multiple dataset in trainer.build_dataset 3. support 3 sub tasks in sequence_classification task * unfinished * update the task model structure in NLP field * merge master * update by comments * keep the default model id as current on production * unfinished * unfinished * veco can run * Merge remote-tracking branch 'origin/master' into feat/backbone_head * add taskmodel for module management * remove forward_input_is_dict * unfinished * token classification started * update base model structure * move space to backbone * remove 'type' in build_from_cfg method * test update * bug fix * on tesing, mess code * Merge branch 'feat/backbone_head' into feat/refactor_nlp_730 # Conflicts: # modelscope/metrics/builder.py # modelscope/models/__init__.py # modelscope/models/nlp/__init__.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py # requirements/multi-modal.txt * add missing merge * add sofa source code * refactor * add veco task dataset * add veco task dataset * pre-commit passed * fix bug of log * add some features * merge master * bug fix * refine nlp models * fix the training error * unfinished * refactor pipeline * Merge branch 'feat/backbone_head' into feat/refactor_nlp_730 # Conflicts: # modelscope/metrics/builder.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/structbert/modeling_sbert.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/preprocessors/base.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py * Merge commit 'ab04ceafc5453ce7daa9aa09e37a55f703072a10' into feat/refactor_nlp_730 # Conflicts: # modelscope/metainfo.py # modelscope/metrics/builder.py # modelscope/models/__init__.py # modelscope/models/base/base_torch_model.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/space/model/intent_unified_transformer.py # modelscope/models/nlp/backbones/space/model/model_base.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sequence_classification.py # modelscope/models/nlp/space/__init__.py # modelscope/models/nlp/space_for_dialog_intent_prediction.py # modelscope/models/nlp/space_for_dialog_modeling.py # modelscope/models/nlp/space_for_dialog_state_tracking.py # modelscope/models/nlp/task_model.py # modelscope/pipelines/nlp/sentiment_classification_pipeline.py # modelscope/preprocessors/base.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py * revert changes * unify sentnece classification postprocess * revert some changes, move some model files * pipeline first case run through * ws pipeline passed * Merge branch 'feat/refactor_nlp_730' into feat/finetune_sbert_veco * finetune * revert code * revert some code * ws finetune started, only the accuracy is weird * Merge branch 'feat/veco_taskdataset' into feat/finetune_sbert_veco # Conflicts: # modelscope/task_datasets/veco_dataset.py # tests/taskdataset/test_veco_dataset.py * veco+nli finetune started * Merge branch 'master' into feat/finetune_sbert_veco # Conflicts: # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sbert_for_token_classification.py # modelscope/models/nlp/sbert_for_zero_shot_classification.py # modelscope/models/nlp/space/space_for_dialog_intent_prediction.py # modelscope/models/nlp/space/space_for_dialog_modeling.py # modelscope/trainers/trainer.py * add trainer for nlp * trainer: dataset params passed into preprocessor * test passed by nlptrainer * fix some bugs * fix some bugs * add backbone/head subclass * fix regression bugs * fix bug in token-cls finetune * support cfg modification * fix bug * fix bug * update requirements * add some comments and fix some t * add some comments and revert a argument * split to two test files * revert code * fixbug in precessor (cherry picked from commit 7a648d096ef8500c694d3255dabe29e6f4bfc3e5) * fix ut bug * support sbert models * unfinished * Merge branch 'feat/finetune_sbert_veco' into sly_tmp_veco_finetune # Conflicts: # tests/trainers/test_finetune_sequence_classification.py * fixbug in veco * fix bug * fixbug * correct running params * remove useless files * add palm finetuning with cnn_dailymail dataset * copy space model from sofa * Merge branch 'feat/finetune_sbert_veco' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune_sbert_veco * Merge branch 'master' into feat/finetune_sbert_veco # Conflicts: # modelscope/metrics/__init__.py # modelscope/models/__init__.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/__init__.py # modelscope/models/nlp/backbones/structbert/modeling_sbert.py # modelscope/models/nlp/heads/__init__.py # modelscope/models/nlp/masked_language.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/models/nlp/sbert_for_nli.py # modelscope/models/nlp/sbert_for_sentence_similarity.py # modelscope/models/nlp/sbert_for_sentiment_classification.py # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sbert_for_token_classification.py # modelscope/models/nlp/sbert_for_zero_shot_classification.py # modelscope/models/nlp/sequence_classification.py # modelscope/models/nlp/space/space_for_dialog_intent_prediction.py # modelscope/models/nlp/space/space_for_dialog_modeling.py # modelscope/models/nlp/space/space_for_dialog_state_tracking.py # modelscope/models/nlp/structbert/adv_utils.py # modelscope/models/nlp/structbert/configuration_sbert.py # modelscope/models/nlp/task_models/task_model.py # modelscope/pipelines/__init__.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/nli_pipeline.py # modelscope/pipelines/nlp/sentence_similarity_pipeline.py # modelscope/pipelines/nlp/sentiment_classification_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/preprocessors/nlp.py # modelscope/task_datasets/__init__.py # modelscope/trainers/trainer.py # modelscope/trainers/utils/inference.py # modelscope/utils/file_utils.py # requirements/nlp.txt # tests/pipelines/test_nli.py # tests/pipelines/test_sentence_similarity.py # tests/pipelines/test_sentiment_classification.py * fix imports * mark backbone in their own modeling * pre-commit check passed * pre-commit passed, remove roberta model * fix a bug in ast import * skip all finetune uts * fix bugs * pre-commit passed * bug fixed * bug fixed * bug fixed * bug fixed * fix ut bug * fix bug * fix ut bug * fix bug * fix bug * fixbugs * fixbug * revert veco * revert veco because of core dump * fix palm bug * revert veco * revert mistaken code * add a test print * pre-commit check * test exception * add test code * for test * fix bug and test * remove test code * remove useless file * 1. fix some bugs 2. add backbone ut * Merge branch 'master' into feat/finetune_refactor_730 # Conflicts: # modelscope/metainfo.py # modelscope/metrics/sequence_classification_metric.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/task_models/task_model.py # modelscope/preprocessors/__init__.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py # modelscope/trainers/utils/inference.py # modelscope/utils/file_utils.py # tests/trainers/test_trainer_with_nlp.py * pre-commit passed * revert files * increase test level * unregister models * fix bugs * fix cr comments * fix bug in backbone-head * add sbert backbone * fix bug * add test for token-cls-metric * pre-commit passed * fix ut comments * revert normal tokenizer to fast tokenizer * Merge branch 'master' into feat/finetune_refactor_730 # Conflicts: # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/__init__.py # modelscope/models/nlp/backbones/structbert/__init__.py # modelscope/models/nlp/masked_language.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sbert_for_token_classification.py # modelscope/models/nlp/sbert_for_zero_shot_classification.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py # modelscope/trainers/utils/inference.py * fix merge bugs * pre commit passed * fix bug * fix bug * fix bug * fix bug from master * add print * fix ut bug * fix bug * Merge branch 'master' into feat/finetune_refactor_730 * skip task model test
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] add/refactor nlp models source code and finetune 1. add sbert,veco,palm,space source code 2. support sbert sequence classification, token classification finetune 3. support veco sequence classification finetune 4. support palm nlg finetune evaluation result: https://sheet.alibaba-inc.com/#/sheet/f7fdcc7f22bd5105 sheet:Maas 5. add ut for finetunes 6. add veco's taskdataset processor 7. add a common trainer for nlp, and a specific trainer for veco 8. merge some duplicate codes of models, preprocessors, pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/9574105 * add basic class of hook&metrics * pre-commit passed * change some comments * pre commit passed * 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities * pre-commit passed * fix a comment * Merge branch 'master' into finetune_hooks_metrics # Conflicts: # modelscope/metainfo.py * pre-commit passed * add basic class of hook&metrics * pre-commit passed * change some comments * pre commit passed * 1. remove accuracy's groups 2. remove useless hooks 3. simplify priorities * pre-commit passed * fix a comment * Merge branch 'feat/finetune' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune * mv hooks related to modelscope/trainers/hooks * mv priority back * add torch mdoel base and test * update hooks, trainer, import_util * add torch epoch based trainer and dis utils * add hooks * fix warmup * format code stype and fix warmup and add warmup unittest * fix impls * pre-commit check passed * update hook and add EpochBasedTrainer * add trainer unittest * Merge branch 'feat/add_hooks' into feat/add_task # Conflicts: # modelscope/models/base_torch.py # modelscope/trainers/hooks/hook.py # modelscope/trainers/trainer.py * update unittest name * rewrite taskdataset to trainer * fix trainer and add unittest * add unittest * code: run to forward * run through... but ugly code * arrange some cls * fix some errs * revert some mistakes * init check in * Merge branch 'feat/add_hooks' into feat/add_task # Conflicts: # modelscope/trainers/trainer.py * test with bigger epoch and size * add the default metrics class * move build metrics code to a method * merge add_task * merge origin add_task * add device initialization * remove preprocessor arg for bool * add task models * move metric collect logic to metrics class * pre-commit passed * fix cr comments * precommit passed * add task models * Merge remote-tracking branch 'origin/feat/add_task' into feat/backbone_head * add comment * change comment formats. * fix comments * fix ut bug * fix comments * add wrapper check * fix comments * pre commit passed * fix cr comments * solve a loop import problem * fix ut bug * fix ut errors * change dummydataset to msdataset * precommit passed * merge add task * backbone-head is build, model is not correctly loaded * model load states matched * result matched * lint * add veco/palm_v2 code * merge master * merge master success running * add repr model name level * Merge branch 'feat/veco_palm' into feat/finetune_sbert_veco * model test for training * add token-classification metric add formal ut * fix running bug * finetune and pipeline are working with backbone-head * add nli * add missing code * finetune and pipeline are working with backbone-head * Merge branch 'feat/backbone_head' of http://gitlab.alibaba-inc.com/Ali-MaaS/MaaS-lib into feat/backbone_head * add a test repo for pr * remove merge conflicted file * remove merge conflicted file 1 * lint check * import error * none type bug fix * forward input unpacking or dict bug * move head into models, add build_backbone with registry, no base method * merge master * feat: 1. add interleave dataset method 2. support multiple dataset in trainer.build_dataset 3. support 3 sub tasks in sequence_classification task * unfinished * update the task model structure in NLP field * merge master * update by comments * keep the default model id as current on production * unfinished * unfinished * veco can run * Merge remote-tracking branch 'origin/master' into feat/backbone_head * add taskmodel for module management * remove forward_input_is_dict * unfinished * token classification started * update base model structure * move space to backbone * remove 'type' in build_from_cfg method * test update * bug fix * on tesing, mess code * Merge branch 'feat/backbone_head' into feat/refactor_nlp_730 # Conflicts: # modelscope/metrics/builder.py # modelscope/models/__init__.py # modelscope/models/nlp/__init__.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py # requirements/multi-modal.txt * add missing merge * add sofa source code * refactor * add veco task dataset * add veco task dataset * pre-commit passed * fix bug of log * add some features * merge master * bug fix * refine nlp models * fix the training error * unfinished * refactor pipeline * Merge branch 'feat/backbone_head' into feat/refactor_nlp_730 # Conflicts: # modelscope/metrics/builder.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/structbert/modeling_sbert.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/preprocessors/base.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py * Merge commit 'ab04ceafc5453ce7daa9aa09e37a55f703072a10' into feat/refactor_nlp_730 # Conflicts: # modelscope/metainfo.py # modelscope/metrics/builder.py # modelscope/models/__init__.py # modelscope/models/base/base_torch_model.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/space/model/intent_unified_transformer.py # modelscope/models/nlp/backbones/space/model/model_base.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sequence_classification.py # modelscope/models/nlp/space/__init__.py # modelscope/models/nlp/space_for_dialog_intent_prediction.py # modelscope/models/nlp/space_for_dialog_modeling.py # modelscope/models/nlp/space_for_dialog_state_tracking.py # modelscope/models/nlp/task_model.py # modelscope/pipelines/nlp/sentiment_classification_pipeline.py # modelscope/preprocessors/base.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py * revert changes * unify sentnece classification postprocess * revert some changes, move some model files * pipeline first case run through * ws pipeline passed * Merge branch 'feat/refactor_nlp_730' into feat/finetune_sbert_veco * finetune * revert code * revert some code * ws finetune started, only the accuracy is weird * Merge branch 'feat/veco_taskdataset' into feat/finetune_sbert_veco # Conflicts: # modelscope/task_datasets/veco_dataset.py # tests/taskdataset/test_veco_dataset.py * veco+nli finetune started * Merge branch 'master' into feat/finetune_sbert_veco # Conflicts: # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sbert_for_token_classification.py # modelscope/models/nlp/sbert_for_zero_shot_classification.py # modelscope/models/nlp/space/space_for_dialog_intent_prediction.py # modelscope/models/nlp/space/space_for_dialog_modeling.py # modelscope/trainers/trainer.py * add trainer for nlp * trainer: dataset params passed into preprocessor * test passed by nlptrainer * fix some bugs * fix some bugs * add backbone/head subclass * fix regression bugs * fix bug in token-cls finetune * support cfg modification * fix bug * fix bug * update requirements * add some comments and fix some t * add some comments and revert a argument * split to two test files * revert code * fixbug in precessor (cherry picked from commit 7a648d096ef8500c694d3255dabe29e6f4bfc3e5) * fix ut bug * support sbert models * unfinished * Merge branch 'feat/finetune_sbert_veco' into sly_tmp_veco_finetune # Conflicts: # tests/trainers/test_finetune_sequence_classification.py * fixbug in veco * fix bug * fixbug * correct running params * remove useless files * add palm finetuning with cnn_dailymail dataset * copy space model from sofa * Merge branch 'feat/finetune_sbert_veco' of gitlab.alibaba-inc.com:Ali-MaaS/MaaS-lib into feat/finetune_sbert_veco * Merge branch 'master' into feat/finetune_sbert_veco # Conflicts: # modelscope/metrics/__init__.py # modelscope/models/__init__.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/__init__.py # modelscope/models/nlp/backbones/structbert/modeling_sbert.py # modelscope/models/nlp/heads/__init__.py # modelscope/models/nlp/masked_language.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/models/nlp/sbert_for_nli.py # modelscope/models/nlp/sbert_for_sentence_similarity.py # modelscope/models/nlp/sbert_for_sentiment_classification.py # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sbert_for_token_classification.py # modelscope/models/nlp/sbert_for_zero_shot_classification.py # modelscope/models/nlp/sequence_classification.py # modelscope/models/nlp/space/space_for_dialog_intent_prediction.py # modelscope/models/nlp/space/space_for_dialog_modeling.py # modelscope/models/nlp/space/space_for_dialog_state_tracking.py # modelscope/models/nlp/structbert/adv_utils.py # modelscope/models/nlp/structbert/configuration_sbert.py # modelscope/models/nlp/task_models/task_model.py # modelscope/pipelines/__init__.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/nli_pipeline.py # modelscope/pipelines/nlp/sentence_similarity_pipeline.py # modelscope/pipelines/nlp/sentiment_classification_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/preprocessors/nlp.py # modelscope/task_datasets/__init__.py # modelscope/trainers/trainer.py # modelscope/trainers/utils/inference.py # modelscope/utils/file_utils.py # requirements/nlp.txt # tests/pipelines/test_nli.py # tests/pipelines/test_sentence_similarity.py # tests/pipelines/test_sentiment_classification.py * fix imports * mark backbone in their own modeling * pre-commit check passed * pre-commit passed, remove roberta model * fix a bug in ast import * skip all finetune uts * fix bugs * pre-commit passed * bug fixed * bug fixed * bug fixed * bug fixed * fix ut bug * fix bug * fix ut bug * fix bug * fix bug * fixbugs * fixbug * revert veco * revert veco because of core dump * fix palm bug * revert veco * revert mistaken code * add a test print * pre-commit check * test exception * add test code * for test * fix bug and test * remove test code * remove useless file * 1. fix some bugs 2. add backbone ut * Merge branch 'master' into feat/finetune_refactor_730 # Conflicts: # modelscope/metainfo.py # modelscope/metrics/sequence_classification_metric.py # modelscope/models/nlp/__init__.py # modelscope/models/nlp/task_models/task_model.py # modelscope/preprocessors/__init__.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py # modelscope/trainers/utils/inference.py # modelscope/utils/file_utils.py # tests/trainers/test_trainer_with_nlp.py * pre-commit passed * revert files * increase test level * unregister models * fix bugs * fix cr comments * fix bug in backbone-head * add sbert backbone * fix bug * add test for token-cls-metric * pre-commit passed * fix ut comments * revert normal tokenizer to fast tokenizer * Merge branch 'master' into feat/finetune_refactor_730 # Conflicts: # modelscope/models/nlp/__init__.py # modelscope/models/nlp/backbones/__init__.py # modelscope/models/nlp/backbones/structbert/__init__.py # modelscope/models/nlp/masked_language.py # modelscope/models/nlp/palm_v2/palm_for_text_generation.py # modelscope/models/nlp/sbert_for_sequence_classification.py # modelscope/models/nlp/sbert_for_token_classification.py # modelscope/models/nlp/sbert_for_zero_shot_classification.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/preprocessors/nlp.py # modelscope/trainers/trainer.py # modelscope/trainers/utils/inference.py * fix merge bugs * pre commit passed * fix bug * fix bug * fix bug * fix bug from master * add print * fix ut bug * fix bug * Merge branch 'master' into feat/finetune_refactor_730 * skip task model test
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
3 years ago
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
[to #42322933] NLP 1030 Refactor Features: 1. Refactor the directory structure of nlp models. All model files are placed into either the model folder or the task_model folder 2. Refactor all the comments to google style 3. Add detail comments to important tasks and nlp models, to list the description of the model, and its preprocessor&trainer 4. Model Exporting now supports a direct all to TorchModelExporter(no need to derive from it) 5. Refactor model save_pretrained method to support direct running(independent from trainer) 6. Remove the judgement of Model in the pipeline base class, to support outer register models running in our pipelines 7. Nlp trainer now has a NLPTrainingArguments class , user can pass arguments into the dataclass, and use it as a normal cfg_modify_fn, to simplify the operation of modify cfg. 8. Merge the BACKBONES and the MODELS, so user can get a backbone with the Model.from_pretrained call 9. Model.from_pretrained now support a task argument, so user can use a backbone and load it with a specific task class. 10. Support Preprocessor.from_pretrained method 11. Add standard return classes to important nlp tasks, so some of the pipelines and the models are independent now, the return values of the models will always be tensors, and the pipelines will take care of the conversion to numpy and the following stuffs. 12. Split the file of the nlp preprocessors, to make the dir structure more clear. Bugs Fixing: 1. Fix a bug that lr_scheduler can be called earlier than the optimizer's step 2. Fix a bug that the direct call of Pipelines (not from pipeline(xxx)) throws error 3. Fix a bug that the trainer will not call the correct TaskDataset class 4. Fix a bug that the internal loading of dataset will throws error in the trainer class Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10490585
3 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264
  1. # Copyright (c) Alibaba, Inc. and its affiliates.
  2. import os
  3. from abc import ABC, abstractmethod
  4. from copy import deepcopy
  5. from typing import Any, Dict, Optional, Sequence
  6. from modelscope.metainfo import Models, Preprocessors
  7. from modelscope.utils.config import Config, ConfigDict
  8. from modelscope.utils.constant import DEFAULT_MODEL_REVISION, ModeKeys, Tasks
  9. from modelscope.utils.hub import read_config, snapshot_download
  10. from modelscope.utils.logger import get_logger
  11. from .builder import build_preprocessor
  12. logger = get_logger(__name__)
  13. PREPROCESSOR_MAP = {
  14. # nlp
  15. # bart
  16. (Models.bart, Tasks.text_error_correction):
  17. Preprocessors.text_error_correction,
  18. # bert
  19. (Models.bert, Tasks.backbone):
  20. Preprocessors.sen_cls_tokenizer,
  21. (Models.bert, Tasks.document_segmentation):
  22. Preprocessors.document_segmentation,
  23. (Models.bert, Tasks.fill_mask):
  24. Preprocessors.fill_mask,
  25. (Models.bert, Tasks.sentence_embedding):
  26. Preprocessors.sentence_embedding,
  27. (Models.bert, Tasks.text_classification):
  28. Preprocessors.sen_cls_tokenizer,
  29. (Models.bert, Tasks.nli):
  30. Preprocessors.sen_cls_tokenizer,
  31. (Models.bert, Tasks.sentiment_classification):
  32. Preprocessors.sen_cls_tokenizer,
  33. (Models.bert, Tasks.sentence_similarity):
  34. Preprocessors.sen_cls_tokenizer,
  35. (Models.bert, Tasks.zero_shot_classification):
  36. Preprocessors.sen_cls_tokenizer,
  37. (Models.bert, Tasks.text_ranking):
  38. Preprocessors.text_ranking,
  39. (Models.bert, Tasks.part_of_speech):
  40. Preprocessors.token_cls_tokenizer,
  41. (Models.bert, Tasks.token_classification):
  42. Preprocessors.token_cls_tokenizer,
  43. (Models.bert, Tasks.word_segmentation):
  44. Preprocessors.token_cls_tokenizer,
  45. # bloom
  46. (Models.bloom, Tasks.backbone):
  47. Preprocessors.text_gen_tokenizer,
  48. # gpt_neo
  49. # gpt_neo may have different preprocessors, but now only one
  50. (Models.gpt_neo, Tasks.backbone):
  51. Preprocessors.sentence_piece,
  52. # gpt3 has different preprocessors by different sizes of models, so they are not listed here.
  53. # palm_v2
  54. (Models.palm, Tasks.backbone):
  55. Preprocessors.text_gen_tokenizer,
  56. # T5
  57. (Models.T5, Tasks.backbone):
  58. Preprocessors.text2text_gen_preprocessor,
  59. (Models.T5, Tasks.text2text_generation):
  60. Preprocessors.text2text_gen_preprocessor,
  61. # deberta_v2
  62. (Models.deberta_v2, Tasks.backbone):
  63. Preprocessors.sen_cls_tokenizer,
  64. (Models.deberta_v2, Tasks.fill_mask):
  65. Preprocessors.fill_mask,
  66. # ponet
  67. (Models.ponet, Tasks.fill_mask):
  68. Preprocessors.fill_mask_ponet,
  69. # structbert
  70. (Models.structbert, Tasks.backbone):
  71. Preprocessors.sen_cls_tokenizer,
  72. (Models.structbert, Tasks.fill_mask):
  73. Preprocessors.fill_mask,
  74. (Models.structbert, Tasks.faq_question_answering):
  75. Preprocessors.faq_question_answering_preprocessor,
  76. (Models.structbert, Tasks.text_classification):
  77. Preprocessors.sen_cls_tokenizer,
  78. (Models.structbert, Tasks.nli):
  79. Preprocessors.sen_cls_tokenizer,
  80. (Models.structbert, Tasks.sentiment_classification):
  81. Preprocessors.sen_cls_tokenizer,
  82. (Models.structbert, Tasks.sentence_similarity):
  83. Preprocessors.sen_cls_tokenizer,
  84. (Models.structbert, Tasks.zero_shot_classification):
  85. Preprocessors.sen_cls_tokenizer,
  86. (Models.structbert, Tasks.part_of_speech):
  87. Preprocessors.token_cls_tokenizer,
  88. (Models.structbert, Tasks.token_classification):
  89. Preprocessors.token_cls_tokenizer,
  90. (Models.structbert, Tasks.word_segmentation):
  91. Preprocessors.token_cls_tokenizer,
  92. # veco
  93. (Models.veco, Tasks.backbone):
  94. Preprocessors.sen_cls_tokenizer,
  95. (Models.veco, Tasks.fill_mask):
  96. Preprocessors.fill_mask,
  97. (Models.veco, Tasks.text_classification):
  98. Preprocessors.sen_cls_tokenizer,
  99. (Models.veco, Tasks.nli):
  100. Preprocessors.sen_cls_tokenizer,
  101. (Models.veco, Tasks.sentiment_classification):
  102. Preprocessors.sen_cls_tokenizer,
  103. (Models.veco, Tasks.sentence_similarity):
  104. Preprocessors.sen_cls_tokenizer,
  105. # space
  106. }
  107. class Preprocessor(ABC):
  108. def __init__(self, mode=ModeKeys.INFERENCE, *args, **kwargs):
  109. self._mode = mode
  110. self.device = int(
  111. os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None
  112. pass
  113. @abstractmethod
  114. def __call__(self, data: Dict[str, Any]) -> Dict[str, Any]:
  115. pass
  116. @property
  117. def mode(self):
  118. return self._mode
  119. @mode.setter
  120. def mode(self, value):
  121. self._mode = value
  122. @classmethod
  123. def from_pretrained(cls,
  124. model_name_or_path: str,
  125. revision: Optional[str] = DEFAULT_MODEL_REVISION,
  126. cfg_dict: Config = None,
  127. preprocessor_mode=ModeKeys.INFERENCE,
  128. **kwargs):
  129. """Instantiate a preprocessor from local directory or remote model repo. Note
  130. that when loading from remote, the model revision can be specified.
  131. Args:
  132. model_name_or_path(str): A model dir or a model id used to load the preprocessor out.
  133. revision(str, `optional`): The revision used when the model_name_or_path is
  134. a model id of the remote hub. default `master`.
  135. cfg_dict(Config, `optional`): An optional config. If provided, it will replace
  136. the config read out of the `model_name_or_path`
  137. preprocessor_mode(str, `optional`): Specify the working mode of the preprocessor, can be `train`, `eval`,
  138. or `inference`. Default value `inference`.
  139. The preprocessor field in the config may contain two sub preprocessors:
  140. >>> {
  141. >>> "train": {
  142. >>> "type": "some-train-preprocessor"
  143. >>> },
  144. >>> "val": {
  145. >>> "type": "some-eval-preprocessor"
  146. >>> }
  147. >>> }
  148. In this scenario, the `train` preprocessor will be loaded in the `train` mode, the `val` preprocessor
  149. will be loaded in the `eval` or `inference` mode. The `mode` field in the preprocessor class
  150. will be assigned in all the modes.
  151. Or just one:
  152. >>> {
  153. >>> "type": "some-train-preprocessor"
  154. >>> }
  155. In this scenario, the sole preprocessor will be loaded in all the modes,
  156. and the `mode` field in the preprocessor class will be assigned.
  157. **kwargs:
  158. task(str, `optional`): The `Tasks` enumeration value to replace the task value
  159. read out of config in the `model_name_or_path`.
  160. This is useful when the preprocessor does not have a `type` field and the task to be used is not
  161. equal to the task of which the model is saved.
  162. Other kwargs will be directly fed into the preprocessor, to replace the default configs.
  163. Returns:
  164. The preprocessor instance.
  165. Examples:
  166. >>> from modelscope.preprocessors import Preprocessor
  167. >>> Preprocessor.from_pretrained('damo/nlp_debertav2_fill-mask_chinese-base')
  168. """
  169. if not os.path.exists(model_name_or_path):
  170. model_dir = snapshot_download(
  171. model_name_or_path, revision=revision)
  172. else:
  173. model_dir = model_name_or_path
  174. if cfg_dict is None:
  175. cfg = read_config(model_dir)
  176. else:
  177. cfg = cfg_dict
  178. task = cfg.task
  179. if 'task' in kwargs:
  180. task = kwargs.pop('task')
  181. field_name = Tasks.find_field_by_task(task)
  182. if 'field' in kwargs:
  183. field_name = kwargs.pop('field')
  184. sub_key = 'train' if preprocessor_mode == ModeKeys.TRAIN else 'val'
  185. if not hasattr(cfg, 'preprocessor') or len(cfg.preprocessor) == 0:
  186. logger.warn('No preprocessor field found in cfg.')
  187. preprocessor_cfg = ConfigDict()
  188. else:
  189. preprocessor_cfg = cfg.preprocessor
  190. if 'type' not in preprocessor_cfg:
  191. if sub_key in preprocessor_cfg:
  192. sub_cfg = getattr(preprocessor_cfg, sub_key)
  193. else:
  194. logger.warn(f'No {sub_key} key and type key found in '
  195. f'preprocessor domain of configuration.json file.')
  196. sub_cfg = preprocessor_cfg
  197. else:
  198. sub_cfg = preprocessor_cfg
  199. sub_cfg.update({'model_dir': model_dir})
  200. sub_cfg.update(kwargs)
  201. if 'type' in sub_cfg:
  202. if isinstance(sub_cfg, Sequence):
  203. # TODO: for Sequence, need adapt to `mode` and `mode_dir` args,
  204. # and add mode for Compose or other plans
  205. raise NotImplementedError('Not supported yet!')
  206. sub_cfg = deepcopy(sub_cfg)
  207. preprocessor = build_preprocessor(sub_cfg, field_name)
  208. else:
  209. logger.warn(
  210. f'Cannot find available config to build preprocessor at mode {preprocessor_mode}, '
  211. f'current config: {sub_cfg}. trying to build by task and model information.'
  212. )
  213. model_cfg = getattr(cfg, 'model', ConfigDict())
  214. model_type = model_cfg.type if hasattr(
  215. model_cfg, 'type') else getattr(model_cfg, 'model_type', None)
  216. if task is None or model_type is None:
  217. logger.warn(
  218. f'Find task: {task}, model type: {model_type}. '
  219. f'Insufficient information to build preprocessor, skip building preprocessor'
  220. )
  221. return None
  222. if (model_type, task) not in PREPROCESSOR_MAP:
  223. logger.warn(
  224. f'No preprocessor key {(model_type, task)} found in PREPROCESSOR_MAP, '
  225. f'skip building preprocessor.')
  226. return None
  227. sub_cfg = ConfigDict({
  228. 'type': PREPROCESSOR_MAP[(model_type, task)],
  229. **sub_cfg
  230. })
  231. preprocessor = build_preprocessor(sub_cfg, field_name)
  232. preprocessor.mode = preprocessor_mode
  233. return preprocessor