1. Abstract keys of dicts needed by nlp metric classes into the init method
2. Add Preprocessor.save_pretrained to save preprocessor information
3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training.
4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead
5. Use model/preprocessor's from_pretrained in all nlp pipeline classes.
6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes
7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes
8. Fix user feedback: Re-train the model in continue training scenario
9. Fix user feedback: Too many checkpoint saved
10. Simplify the nlp-trainer
11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override
12. Add safe_get to Config class
---------------------------- Another refactor from version 36 -------------------------
13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example:
TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor
14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors
15. Add output classes of nlp models
16. Refactor the logic for token-classification
17. Fix bug: checkpoint_hook does not support pytorch_model.pt
18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training
NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines
Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513
* add save_pretrained to preprocessor
* save preprocessor config in hook
* refactor label-id mapping fetching logic
* test ok on sentence-similarity
* run on finetuning
* fix bug
* pre-commit passed
* fix bug
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/preprocessors/nlp/nlp_base.py
* add params to init
* 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics
* Split trainer init impls to overridable methods
* remove some obsolete tokenizers
* unfinished
* support input params in pipeline
* fix bugs
* fix ut bug
* fix bug
* fix ut bug
* fix ut bug
* fix ut bug
* add base class for some preprocessors
* Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config
* compatible with old code
* fix ut bug
* fix ut bugs
* fix bug
* add some comments
* fix ut bug
* add a requirement
* fix pre-commit
* Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config
* fixbug
* Support function type in registry
* fix ut bug
* fix bug
* Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config
# Conflicts:
# modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py
# modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
# modelscope/pipelines/nlp/word_segmentation_pipeline.py
# modelscope/utils/hub.py
* remove obsolete file
* rename init args
* rename params
* fix merge bug
* add default preprocessor config for ner-model
* move a method a util file
* remove unused config
* Fix a bug in pbar
* bestckptsaver:change default ckpt numbers to 1
* 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name
* Fix bug
* fix bug
* fix bug
* unfinished refactoring
* unfinished
* uw
* uw
* uw
* uw
* Merge branch 'feat/refactor_config' into feat/refactor_trainer
# Conflicts:
# modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
# modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
# modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
# modelscope/preprocessors/nlp/text_generation_preprocessor.py
* uw
* uw
* unify nlp task outputs
* uw
* uw
* uw
* uw
* change the order of text cls pipeline
* refactor t5
* refactor tg task preprocessor
* fix
* unfinished
* temp
* refactor code
* unfinished
* unfinished
* unfinished
* unfinished
* uw
* Merge branch 'feat/refactor_config' into feat/refactor_trainer
* smoke test pass
* ut testing
* pre-commit passed
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/models/nlp/bert/document_segmentation.py
# modelscope/pipelines/nlp/__init__.py
# modelscope/pipelines/nlp/document_segmentation_pipeline.py
* merge master
* unifnished
* Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config
* fix bug
* fix ut bug
* support ner batch inference
* fix ut bug
* fix bug
* support batch inference on three nlp tasks
* unfinished
* fix bug
* fix bug
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/models/base/base_model.py
# modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
# modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
# modelscope/pipelines/nlp/dialog_modeling_pipeline.py
# modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
# modelscope/pipelines/nlp/document_segmentation_pipeline.py
# modelscope/pipelines/nlp/faq_question_answering_pipeline.py
# modelscope/pipelines/nlp/feature_extraction_pipeline.py
# modelscope/pipelines/nlp/fill_mask_pipeline.py
# modelscope/pipelines/nlp/information_extraction_pipeline.py
# modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
# modelscope/pipelines/nlp/sentence_embedding_pipeline.py
# modelscope/pipelines/nlp/summarization_pipeline.py
# modelscope/pipelines/nlp/table_question_answering_pipeline.py
# modelscope/pipelines/nlp/text2text_generation_pipeline.py
# modelscope/pipelines/nlp/text_classification_pipeline.py
# modelscope/pipelines/nlp/text_error_correction_pipeline.py
# modelscope/pipelines/nlp/text_generation_pipeline.py
# modelscope/pipelines/nlp/text_ranking_pipeline.py
# modelscope/pipelines/nlp/token_classification_pipeline.py
# modelscope/pipelines/nlp/word_segmentation_pipeline.py
# modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
# modelscope/trainers/nlp_trainer.py
* pre-commit passed
* fix bug
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/preprocessors/__init__.py
* fix bug
* fix bug
* fix bug
* fix bug
* fix bug
* fixbug
* pre-commit passed
* fix bug
* fixbug
* fix bug
* fix bug
* fix bug
* fix bug
* self review done
* fixbug
* fix bug
* fix bug
* fix bugs
* remove sub-token offset mapping
* fix name bug
* add some tests
* 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs
* add old logic back
* tmp save
* add tokenize by words logic back
* move outputs file back
* revert veco token-classification back
* fix typo
* Fix description
* Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config
* Merge branch 'master' into feat/refactor_config
# Conflicts:
# modelscope/pipelines/builder.py
master^2
| @@ -1,3 +1,3 @@ | |||||
| version https://git-lfs.github.com/spec/v1 | version https://git-lfs.github.com/spec/v1 | ||||
| oid sha256:7d98ac11a4e9e2744a7402a5cc912da991a41938bbc5dd60f15ee5c6b3196030 | |||||
| size 63349 | |||||
| oid sha256:3b38bfb5a851d35d5fba4d59eda926557666dbd62c70e3e3b24c22605e7d9c4a | |||||
| size 40771 | |||||
| @@ -7,7 +7,8 @@ from torch.utils.data.dataloader import default_collate | |||||
| from modelscope.exporters.builder import EXPORTERS | from modelscope.exporters.builder import EXPORTERS | ||||
| from modelscope.exporters.torch_model_exporter import TorchModelExporter | from modelscope.exporters.torch_model_exporter import TorchModelExporter | ||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.preprocessors import Preprocessor, build_preprocessor | |||||
| from modelscope.preprocessors import ( | |||||
| TextClassificationTransformersPreprocessor, build_preprocessor) | |||||
| from modelscope.utils.config import Config | from modelscope.utils.config import Config | ||||
| from modelscope.utils.constant import ModeKeys, Tasks | from modelscope.utils.constant import ModeKeys, Tasks | ||||
| @@ -59,12 +60,13 @@ class SbertForSequenceClassificationExporter(TorchModelExporter): | |||||
| 'mode': ModeKeys.TRAIN, | 'mode': ModeKeys.TRAIN, | ||||
| **sequence_length | **sequence_length | ||||
| }) | }) | ||||
| preprocessor: Preprocessor = build_preprocessor(cfg, field_name) | |||||
| preprocessor: TextClassificationTransformersPreprocessor = build_preprocessor( | |||||
| cfg, field_name) | |||||
| if pair: | if pair: | ||||
| first_sequence = preprocessor.tokenizer.unk_token | |||||
| second_sequence = preprocessor.tokenizer.unk_token | |||||
| first_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token | |||||
| second_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token | |||||
| else: | else: | ||||
| first_sequence = preprocessor.tokenizer.unk_token | |||||
| first_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token | |||||
| second_sequence = None | second_sequence = None | ||||
| batched = [] | batched = [] | ||||
| @@ -19,18 +19,27 @@ from .builder import METRICS, MetricKeys | |||||
| class SequenceClassificationMetric(Metric): | class SequenceClassificationMetric(Metric): | ||||
| """The metric computation class for sequence classification tasks. | """The metric computation class for sequence classification tasks. | ||||
| This metric class calculates accuracy of the whole input batches. | |||||
| This metric class calculates accuracy/F1 of all the input batches. | |||||
| Args: | |||||
| label_name: The key of label column in the 'inputs' arg. | |||||
| logit_name: The key of logits column in the 'inputs' arg. | |||||
| """ | """ | ||||
| def __init__(self, *args, **kwargs): | |||||
| def __init__(self, | |||||
| label_name=OutputKeys.LABELS, | |||||
| logit_name=OutputKeys.LOGITS, | |||||
| *args, | |||||
| **kwargs): | |||||
| super().__init__(*args, **kwargs) | super().__init__(*args, **kwargs) | ||||
| self.preds = [] | self.preds = [] | ||||
| self.labels = [] | self.labels = [] | ||||
| self.label_name = label_name | |||||
| self.logit_name = logit_name | |||||
| def add(self, outputs: Dict, inputs: Dict): | def add(self, outputs: Dict, inputs: Dict): | ||||
| label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS | |||||
| ground_truths = inputs[label_name] | |||||
| eval_results = outputs[OutputKeys.LOGITS] | |||||
| ground_truths = inputs[self.label_name] | |||||
| eval_results = outputs[self.logit_name] | |||||
| self.preds.append( | self.preds.append( | ||||
| torch_nested_numpify(torch_nested_detach(eval_results))) | torch_nested_numpify(torch_nested_detach(eval_results))) | ||||
| self.labels.append( | self.labels.append( | ||||
| @@ -18,16 +18,22 @@ class TextGenerationMetric(Metric): | |||||
| """The metric computation class for text generation classes. | """The metric computation class for text generation classes. | ||||
| This metric class calculates F1 of the rouge scores for the whole evaluation dataset. | This metric class calculates F1 of the rouge scores for the whole evaluation dataset. | ||||
| Args: | |||||
| target_text: The key of the target text column in the `inputs` arg. | |||||
| pred_text: The key of the predicted text column in the `outputs` arg. | |||||
| """ | """ | ||||
| def __init__(self): | |||||
| def __init__(self, target_text='tgts', pred_text='preds'): | |||||
| self.preds: List[str] = [] | self.preds: List[str] = [] | ||||
| self.tgts: List[str] = [] | self.tgts: List[str] = [] | ||||
| self.rouge = Rouge() | self.rouge = Rouge() | ||||
| self.target_text = target_text | |||||
| self.pred_text = pred_text | |||||
| def add(self, outputs: Dict[str, List[str]], inputs: Dict[str, List[str]]): | def add(self, outputs: Dict[str, List[str]], inputs: Dict[str, List[str]]): | ||||
| ground_truths = inputs['tgts'] | |||||
| eval_results = outputs['preds'] | |||||
| ground_truths = inputs[self.target_text] | |||||
| eval_results = outputs[self.pred_text] | |||||
| for truth in ground_truths: | for truth in ground_truths: | ||||
| self.tgts.append(rebuild_chinese_str(truth)) | self.tgts.append(rebuild_chinese_str(truth)) | ||||
| for result in eval_results: | for result in eval_results: | ||||
| @@ -21,20 +21,16 @@ class TokenClassificationMetric(Metric): | |||||
| This metric class uses seqeval to calculate the scores. | This metric class uses seqeval to calculate the scores. | ||||
| Args: | Args: | ||||
| return_entity_level_metrics (bool, *optional*): | |||||
| label_name(str, `optional`): The key of label column in the 'inputs' arg. | |||||
| logit_name(str, `optional`): The key of logits column in the 'inputs' arg. | |||||
| return_entity_level_metrics (bool, `optional`): | |||||
| Whether to return every label's detail metrics, default False. | Whether to return every label's detail metrics, default False. | ||||
| label2id(dict, `optional`): The label2id information to get the token labels. | |||||
| """ | """ | ||||
| def add(self, outputs: Dict, inputs: Dict): | |||||
| label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS | |||||
| ground_truths = inputs[label_name] | |||||
| eval_results = outputs[OutputKeys.LOGITS] | |||||
| self.preds.append( | |||||
| torch_nested_numpify(torch_nested_detach(eval_results))) | |||||
| self.labels.append( | |||||
| torch_nested_numpify(torch_nested_detach(ground_truths))) | |||||
| def __init__(self, | def __init__(self, | ||||
| label_name=OutputKeys.LABELS, | |||||
| logit_name=OutputKeys.LOGITS, | |||||
| return_entity_level_metrics=False, | return_entity_level_metrics=False, | ||||
| label2id=None, | label2id=None, | ||||
| *args, | *args, | ||||
| @@ -44,6 +40,16 @@ class TokenClassificationMetric(Metric): | |||||
| self.preds = [] | self.preds = [] | ||||
| self.labels = [] | self.labels = [] | ||||
| self.label2id = label2id | self.label2id = label2id | ||||
| self.label_name = label_name | |||||
| self.logit_name = logit_name | |||||
| def add(self, outputs: Dict, inputs: Dict): | |||||
| ground_truths = inputs[self.label_name] | |||||
| eval_results = outputs[self.logit_name] | |||||
| self.preds.append( | |||||
| torch_nested_numpify(torch_nested_detach(eval_results))) | |||||
| self.labels.append( | |||||
| torch_nested_numpify(torch_nested_detach(ground_truths))) | |||||
| def evaluate(self): | def evaluate(self): | ||||
| label2id = self.label2id | label2id = self.label2id | ||||
| @@ -6,7 +6,8 @@ from typing import Any, Callable, Dict, List, Optional, Union | |||||
| from modelscope.hub.snapshot_download import snapshot_download | from modelscope.hub.snapshot_download import snapshot_download | ||||
| from modelscope.models.builder import build_model | from modelscope.models.builder import build_model | ||||
| from modelscope.utils.checkpoint import save_checkpoint, save_pretrained | |||||
| from modelscope.utils.checkpoint import (save_checkpoint, save_configuration, | |||||
| save_pretrained) | |||||
| from modelscope.utils.config import Config | from modelscope.utils.config import Config | ||||
| from modelscope.utils.constant import DEFAULT_MODEL_REVISION, Invoke, ModelFile | from modelscope.utils.constant import DEFAULT_MODEL_REVISION, Invoke, ModelFile | ||||
| from modelscope.utils.device import verify_device | from modelscope.utils.device import verify_device | ||||
| @@ -129,11 +130,9 @@ class Model(ABC): | |||||
| model_cfg[k] = v | model_cfg[k] = v | ||||
| if device is not None: | if device is not None: | ||||
| model_cfg.device = device | model_cfg.device = device | ||||
| model = build_model( | |||||
| model_cfg, task_name=task_name, default_args=kwargs) | |||||
| model = build_model(model_cfg, task_name=task_name) | |||||
| else: | else: | ||||
| model = build_model( | |||||
| model_cfg, task_name=task_name, default_args=kwargs) | |||||
| model = build_model(model_cfg, task_name=task_name) | |||||
| # dynamically add pipeline info to model for pipeline inference | # dynamically add pipeline info to model for pipeline inference | ||||
| if hasattr(cfg, 'pipeline'): | if hasattr(cfg, 'pipeline'): | ||||
| @@ -142,6 +141,7 @@ class Model(ABC): | |||||
| if not hasattr(model, 'cfg'): | if not hasattr(model, 'cfg'): | ||||
| model.cfg = cfg | model.cfg = cfg | ||||
| model_cfg.pop('model_dir', None) | |||||
| model.name = model_name_or_path | model.name = model_name_or_path | ||||
| model.model_dir = local_model_dir | model.model_dir = local_model_dir | ||||
| return model | return model | ||||
| @@ -151,6 +151,7 @@ class Model(ABC): | |||||
| save_checkpoint_names: Union[str, List[str]] = None, | save_checkpoint_names: Union[str, List[str]] = None, | ||||
| save_function: Callable = save_checkpoint, | save_function: Callable = save_checkpoint, | ||||
| config: Optional[dict] = None, | config: Optional[dict] = None, | ||||
| save_config_function: Callable = save_configuration, | |||||
| **kwargs): | **kwargs): | ||||
| """save the pretrained model, its configuration and other related files to a directory, | """save the pretrained model, its configuration and other related files to a directory, | ||||
| so that it can be re-loaded | so that it can be re-loaded | ||||
| @@ -168,18 +169,15 @@ class Model(ABC): | |||||
| config (Optional[dict], optional): | config (Optional[dict], optional): | ||||
| The config for the configuration.json, might not be identical with model.config | The config for the configuration.json, might not be identical with model.config | ||||
| save_config_function (Callble, optional): | |||||
| The function to use to save the configuration. | |||||
| """ | """ | ||||
| if config is None and hasattr(self, 'cfg'): | if config is None and hasattr(self, 'cfg'): | ||||
| config = self.cfg | config = self.cfg | ||||
| assert config is not None, 'Cannot save the model because the model config is empty.' | |||||
| if isinstance(config, Config): | |||||
| config = config.to_dict() | |||||
| if 'preprocessor' in config and config['preprocessor'] is not None: | |||||
| if 'mode' in config['preprocessor']: | |||||
| config['preprocessor']['mode'] = 'inference' | |||||
| elif 'val' in config['preprocessor'] and 'mode' in config[ | |||||
| 'preprocessor']['val']: | |||||
| config['preprocessor']['val']['mode'] = 'inference' | |||||
| if config is not None: | |||||
| save_config_function(target_folder, config) | |||||
| save_pretrained(self, target_folder, save_checkpoint_names, | save_pretrained(self, target_folder, save_checkpoint_names, | ||||
| save_function, config, **kwargs) | |||||
| save_function, **kwargs) | |||||
| @@ -6,6 +6,7 @@ import torch | |||||
| from torch import nn | from torch import nn | ||||
| from modelscope.utils.file_utils import func_receive_dict_inputs | from modelscope.utils.file_utils import func_receive_dict_inputs | ||||
| from modelscope.utils.hub import parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from .base_model import Model | from .base_model import Model | ||||
| @@ -36,9 +36,7 @@ from transformers.utils.model_parallel_utils import (assert_device_map, | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.base import Model, Tensor, TorchModel | from modelscope.models.base import Model, Tensor, TorchModel | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import (BaseModelOutput, | |||||
| BaseModelOutputWithPastAndCrossAttentions, | |||||
| Seq2SeqModelOutput) | |||||
| from modelscope.outputs import AttentionBackboneModelOutput, Seq2SeqModelOutput | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from .configuration import T5Config | from .configuration import T5Config | ||||
| @@ -1182,7 +1180,7 @@ class T5Stack(T5PreTrainedModel): | |||||
| all_attentions, | all_attentions, | ||||
| all_cross_attentions, | all_cross_attentions, | ||||
| ] if v is not None) | ] if v is not None) | ||||
| return BaseModelOutputWithPastAndCrossAttentions( | |||||
| return AttentionBackboneModelOutput( | |||||
| last_hidden_state=hidden_states, | last_hidden_state=hidden_states, | ||||
| past_key_values=present_key_value_states, | past_key_values=present_key_value_states, | ||||
| hidden_states=all_hidden_states, | hidden_states=all_hidden_states, | ||||
| @@ -1475,8 +1473,9 @@ class T5Model(T5PreTrainedModel): | |||||
| output_hidden_states=output_hidden_states, | output_hidden_states=output_hidden_states, | ||||
| return_dict=return_dict, | return_dict=return_dict, | ||||
| ) | ) | ||||
| elif return_dict and not isinstance(encoder_outputs, BaseModelOutput): | |||||
| encoder_outputs = BaseModelOutput( | |||||
| elif return_dict and not isinstance(encoder_outputs, | |||||
| AttentionBackboneModelOutput): | |||||
| encoder_outputs = AttentionBackboneModelOutput( | |||||
| last_hidden_state=encoder_outputs[0], | last_hidden_state=encoder_outputs[0], | ||||
| hidden_states=encoder_outputs[1] | hidden_states=encoder_outputs[1] | ||||
| if len(encoder_outputs) > 1 else None, | if len(encoder_outputs) > 1 else None, | ||||
| @@ -24,7 +24,8 @@ from transformers.utils.model_parallel_utils import (assert_device_map, | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import BaseModelOutput, Seq2SeqLMOutput | |||||
| from modelscope.outputs import (AttentionBackboneModelOutput, Seq2SeqLMOutput, | |||||
| TokenGeneratorOutput) | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from .backbone import T5PreTrainedModel, T5Stack | from .backbone import T5PreTrainedModel, T5Stack | ||||
| @@ -311,8 +312,9 @@ class T5ForConditionalGeneration(T5PreTrainedModel): | |||||
| output_hidden_states=output_hidden_states, | output_hidden_states=output_hidden_states, | ||||
| return_dict=return_dict, | return_dict=return_dict, | ||||
| ) | ) | ||||
| elif return_dict and not isinstance(encoder_outputs, BaseModelOutput): | |||||
| encoder_outputs = BaseModelOutput( | |||||
| elif return_dict and not isinstance(encoder_outputs, | |||||
| AttentionBackboneModelOutput): | |||||
| encoder_outputs = AttentionBackboneModelOutput( | |||||
| last_hidden_state=encoder_outputs[0], | last_hidden_state=encoder_outputs[0], | ||||
| hidden_states=encoder_outputs[1] | hidden_states=encoder_outputs[1] | ||||
| if len(encoder_outputs) > 1 else None, | if len(encoder_outputs) > 1 else None, | ||||
| @@ -426,6 +428,16 @@ class T5ForConditionalGeneration(T5PreTrainedModel): | |||||
| def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor): | def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor): | ||||
| return self._shift_right(labels) | return self._shift_right(labels) | ||||
| def generate( | |||||
| self, | |||||
| *args, | |||||
| **kwargs, | |||||
| ): | |||||
| output = super().generate(*args, **kwargs) | |||||
| return TokenGeneratorOutput( | |||||
| sequences=output if isinstance(output, torch.Tensor) else output[0] | |||||
| ) | |||||
| def _reorder_cache(self, past, beam_idx): | def _reorder_cache(self, past, beam_idx): | ||||
| # if decoder past is not included in output | # if decoder past is not included in output | ||||
| # speedy decoding is disabled and no need to reorder | # speedy decoding is disabled and no need to reorder | ||||
| @@ -30,9 +30,7 @@ if TYPE_CHECKING: | |||||
| SbertForMaskedLM, | SbertForMaskedLM, | ||||
| SbertForSequenceClassification, | SbertForSequenceClassification, | ||||
| SbertForTokenClassification, | SbertForTokenClassification, | ||||
| SbertTokenizer, | |||||
| SbertModel, | SbertModel, | ||||
| SbertTokenizerFast, | |||||
| ) | ) | ||||
| from .T5 import T5ForConditionalGeneration | from .T5 import T5ForConditionalGeneration | ||||
| from .mglm import MGLMForTextSummarization | from .mglm import MGLMForTextSummarization | ||||
| @@ -51,8 +49,7 @@ if TYPE_CHECKING: | |||||
| ) | ) | ||||
| from .veco import (VecoConfig, VecoForMaskedLM, | from .veco import (VecoConfig, VecoForMaskedLM, | ||||
| VecoForSequenceClassification, | VecoForSequenceClassification, | ||||
| VecoForTokenClassification, VecoModel, VecoTokenizer, | |||||
| VecoTokenizerFast) | |||||
| VecoForTokenClassification, VecoModel) | |||||
| from .bloom import BloomModel | from .bloom import BloomModel | ||||
| else: | else: | ||||
| _import_structure = { | _import_structure = { | ||||
| @@ -66,8 +63,6 @@ else: | |||||
| 'SbertForMaskedLM', | 'SbertForMaskedLM', | ||||
| 'SbertForSequenceClassification', | 'SbertForSequenceClassification', | ||||
| 'SbertForTokenClassification', | 'SbertForTokenClassification', | ||||
| 'SbertTokenizer', | |||||
| 'SbertTokenizerFast', | |||||
| 'SbertModel', | 'SbertModel', | ||||
| ], | ], | ||||
| 'veco': [ | 'veco': [ | ||||
| @@ -76,8 +71,6 @@ else: | |||||
| 'VecoForSequenceClassification', | 'VecoForSequenceClassification', | ||||
| 'VecoForTokenClassification', | 'VecoForTokenClassification', | ||||
| 'VecoModel', | 'VecoModel', | ||||
| 'VecoTokenizer', | |||||
| 'VecoTokenizerFast', | |||||
| ], | ], | ||||
| 'bert': [ | 'bert': [ | ||||
| 'BertForMaskedLM', | 'BertForMaskedLM', | ||||
| @@ -7,6 +7,7 @@ import torch.cuda | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.base import TorchModel | from modelscope.models.base import TorchModel | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import TextErrorCorrectionOutput | |||||
| from modelscope.utils.constant import ModelFile, Tasks | from modelscope.utils.constant import ModelFile, Tasks | ||||
| __all__ = ['BartForTextErrorCorrection'] | __all__ = ['BartForTextErrorCorrection'] | ||||
| @@ -55,7 +56,7 @@ class BartForTextErrorCorrection(TorchModel): | |||||
| self.task = task | self.task = task | ||||
| def forward(self, input: Dict[str, Dict]) -> Dict[str, Any]: | |||||
| def forward(self, input: Dict[str, Dict]) -> TextErrorCorrectionOutput: | |||||
| """return the result by the model | """return the result by the model | ||||
| Args: | Args: | ||||
| @@ -91,4 +92,4 @@ class BartForTextErrorCorrection(TorchModel): | |||||
| # get 1-best List[Tensor] | # get 1-best List[Tensor] | ||||
| preds = translations[0][0]['tokens'] | preds = translations[0][0]['tokens'] | ||||
| return {'predictions': preds} | |||||
| return TextErrorCorrectionOutput(predictions=preds) | |||||
| @@ -16,9 +16,6 @@ | |||||
| """PyTorch BERT model. """ | """PyTorch BERT model. """ | ||||
| import math | import math | ||||
| import os | |||||
| from dataclasses import dataclass | |||||
| from typing import Optional, Tuple | |||||
| import torch | import torch | ||||
| import torch.utils.checkpoint | import torch.utils.checkpoint | ||||
| @@ -33,11 +30,10 @@ from transformers.modeling_utils import (PreTrainedModel, | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models import Model, TorchModel | from modelscope.models import Model, TorchModel | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import (BaseModelOutputWithPastAndCrossAttentions, | |||||
| BaseModelOutputWithPoolingAndCrossAttentions) | |||||
| from modelscope.outputs import AttentionBackboneModelOutput | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.hub import parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from modelscope.utils.nlp.utils import parse_labels_in_order | |||||
| from .configuration import BertConfig | from .configuration import BertConfig | ||||
| logger = get_logger(__name__) | logger = get_logger(__name__) | ||||
| @@ -562,7 +558,7 @@ class BertEncoder(nn.Module): | |||||
| all_self_attentions, | all_self_attentions, | ||||
| all_cross_attentions, | all_cross_attentions, | ||||
| ] if v is not None) | ] if v is not None) | ||||
| return BaseModelOutputWithPastAndCrossAttentions( | |||||
| return AttentionBackboneModelOutput( | |||||
| last_hidden_state=hidden_states, | last_hidden_state=hidden_states, | ||||
| past_key_values=next_decoder_cache, | past_key_values=next_decoder_cache, | ||||
| hidden_states=all_hidden_states, | hidden_states=all_hidden_states, | ||||
| @@ -639,30 +635,15 @@ class BertPreTrainedModel(TorchModel, PreTrainedModel): | |||||
| The loaded model, which is initialized by transformers.PreTrainedModel.from_pretrained | The loaded model, which is initialized by transformers.PreTrainedModel.from_pretrained | ||||
| """ | """ | ||||
| model_dir = kwargs.get('model_dir', None) | |||||
| model_dir = kwargs.pop('model_dir', None) | |||||
| cfg = kwargs.pop('cfg', None) | |||||
| model_args = parse_labels_in_order(model_dir, cfg, **kwargs) | |||||
| if model_dir is None: | if model_dir is None: | ||||
| config = BertConfig(**kwargs) | |||||
| config = BertConfig(**model_args) | |||||
| model = cls(config) | model = cls(config) | ||||
| else: | else: | ||||
| model_kwargs = {} | |||||
| label2id = kwargs.get('label2id', parse_label_mapping(model_dir)) | |||||
| id2label = kwargs.get( | |||||
| 'id2label', None if label2id is None else | |||||
| {id: label | |||||
| for label, id in label2id.items()}) | |||||
| if id2label is not None and label2id is None: | |||||
| label2id = {label: id for id, label in id2label.items()} | |||||
| num_labels = kwargs.get( | |||||
| 'num_labels', None if label2id is None else len(label2id)) | |||||
| if num_labels is not None: | |||||
| model_kwargs['num_labels'] = num_labels | |||||
| if label2id is not None: | |||||
| model_kwargs['label2id'] = label2id | |||||
| if id2label is not None: | |||||
| model_kwargs['id2label'] = id2label | |||||
| model = super(Model, cls).from_pretrained( | model = super(Model, cls).from_pretrained( | ||||
| pretrained_model_name_or_path=model_dir, **model_kwargs) | |||||
| pretrained_model_name_or_path=model_dir, **model_args) | |||||
| model.model_dir = model_dir | model.model_dir = model_dir | ||||
| return model | return model | ||||
| @@ -750,7 +731,7 @@ class BertModel(BertPreTrainedModel): | |||||
| output_attentions=None, | output_attentions=None, | ||||
| output_hidden_states=None, | output_hidden_states=None, | ||||
| return_dict=None, | return_dict=None, | ||||
| **kwargs): | |||||
| **kwargs) -> AttentionBackboneModelOutput: | |||||
| r""" | r""" | ||||
| Args: | Args: | ||||
| input_ids (`torch.LongTensor` of shape `((batch_size, sequence_length)`): | input_ids (`torch.LongTensor` of shape `((batch_size, sequence_length)`): | ||||
| @@ -936,7 +917,7 @@ class BertModel(BertPreTrainedModel): | |||||
| if not return_dict: | if not return_dict: | ||||
| return (sequence_output, pooled_output) + encoder_outputs[1:] | return (sequence_output, pooled_output) + encoder_outputs[1:] | ||||
| return BaseModelOutputWithPoolingAndCrossAttentions( | |||||
| return AttentionBackboneModelOutput( | |||||
| last_hidden_state=sequence_output, | last_hidden_state=sequence_output, | ||||
| pooler_output=pooled_output, | pooler_output=pooled_output, | ||||
| past_key_values=encoder_outputs.past_key_values, | past_key_values=encoder_outputs.past_key_values, | ||||
| @@ -5,37 +5,22 @@ from typing import Any, Dict | |||||
| import torch | import torch | ||||
| from torch import nn | from torch import nn | ||||
| from torch.nn import CrossEntropyLoss | from torch.nn import CrossEntropyLoss | ||||
| from transformers.modeling_outputs import TokenClassifierOutput | |||||
| from transformers.models.bert.modeling_bert import (BertModel, | |||||
| BertPreTrainedModel) | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.base import Model | |||||
| from modelscope.models import Model | |||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.models.nlp.ponet import PoNetConfig | |||||
| from modelscope.outputs import AttentionTokenClassificationModelOutput | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from .backbone import BertModel, BertPreTrainedModel | |||||
| from .configuration import BertConfig | |||||
| __all__ = ['BertForDocumentSegmentation'] | __all__ = ['BertForDocumentSegmentation'] | ||||
| @MODELS.register_module( | @MODELS.register_module( | ||||
| Tasks.document_segmentation, module_name=Models.bert_for_ds) | Tasks.document_segmentation, module_name=Models.bert_for_ds) | ||||
| class BertForDocumentSegmentation(Model): | |||||
| def __init__(self, model_dir: str, model_config: Dict[str, Any], *args, | |||||
| **kwargs): | |||||
| super().__init__(model_dir, model_config, *args, **kwargs) | |||||
| self.model_cfg = model_config | |||||
| def build_with_config(self, config): | |||||
| self.bert_model = BertForDocumentSegmentationBase.from_pretrained( | |||||
| self.model_dir, from_tf=False, config=config) | |||||
| return self.bert_model | |||||
| def forward(self) -> Dict[str, Any]: | |||||
| return self.model_cfg | |||||
| class BertForDocumentSegmentationBase(BertPreTrainedModel): | |||||
| class BertForDocumentSegmentation(BertPreTrainedModel): | |||||
| _keys_to_ignore_on_load_unexpected = [r'pooler'] | _keys_to_ignore_on_load_unexpected = [r'pooler'] | ||||
| @@ -103,9 +88,25 @@ class BertForDocumentSegmentationBase(BertPreTrainedModel): | |||||
| output = (logits, ) + outputs[2:] | output = (logits, ) + outputs[2:] | ||||
| return ((loss, ) + output) if loss is not None else output | return ((loss, ) + output) if loss is not None else output | ||||
| return TokenClassifierOutput( | |||||
| return AttentionTokenClassificationModelOutput( | |||||
| loss=loss, | loss=loss, | ||||
| logits=logits, | logits=logits, | ||||
| hidden_states=outputs.hidden_states, | hidden_states=outputs.hidden_states, | ||||
| attentions=outputs.attentions, | attentions=outputs.attentions, | ||||
| ) | ) | ||||
| @classmethod | |||||
| def _instantiate(cls, model_dir, model_config: Dict[str, Any], **kwargs): | |||||
| if model_config['type'] == 'bert': | |||||
| config = BertConfig.from_pretrained(model_dir, num_labels=2) | |||||
| elif model_config['type'] == 'ponet': | |||||
| config = PoNetConfig.from_pretrained(model_dir, num_labels=2) | |||||
| else: | |||||
| raise ValueError( | |||||
| f'Expected config type bert and ponet, which is : {model_config["type"]}' | |||||
| ) | |||||
| model = super(Model, cls).from_pretrained( | |||||
| model_dir, from_tf=False, config=config) | |||||
| model.model_dir = model_dir | |||||
| model.model_cfg = model_config | |||||
| return model | |||||
| @@ -121,7 +121,7 @@ class BertForMaskedLM(BertPreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the fill_mask model of Structbert, the preprocessor of this model | This is the fill_mask model of Structbert, the preprocessor of this model | ||||
| is `modelscope.preprocessors.NLPPreprocessor`. | |||||
| is `modelscope.preprocessors.FillMaskTransformersPreprocessor`. | |||||
| Parameters: | Parameters: | ||||
| config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with | config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with | ||||
| @@ -51,7 +51,7 @@ class BertForSequenceClassification(BertPreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the fill_mask model of Bert, the preprocessor of this model | This is the fill_mask model of Bert, the preprocessor of this model | ||||
| is `modelscope.preprocessors.SequenceClassificationPreprocessor`. | |||||
| is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`. | |||||
| Trainer: | Trainer: | ||||
| This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | ||||
| @@ -22,7 +22,7 @@ from torch.nn import CrossEntropyLoss | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import TokenClassifierOutput | |||||
| from modelscope.outputs import AttentionTokenClassificationModelOutput | |||||
| from modelscope.utils import logger as logging | from modelscope.utils import logger as logging | ||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from .backbone import BertModel, BertPreTrainedModel | from .backbone import BertModel, BertPreTrainedModel | ||||
| @@ -47,7 +47,7 @@ class BertForTokenClassification(BertPreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the fill_mask model of Bert, the preprocessor of this model | This is the fill_mask model of Bert, the preprocessor of this model | ||||
| is `modelscope.preprocessors.SequenceClassificationPreprocessor`. | |||||
| is `modelscope.preprocessors.TokenClassificationTransformersPreprocessor`. | |||||
| Trainer: | Trainer: | ||||
| This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | ||||
| @@ -169,7 +169,7 @@ class BertForTokenClassification(BertPreTrainedModel): | |||||
| - 0 for tokens that are **masked**. | - 0 for tokens that are **masked**. | ||||
| Returns: | Returns: | ||||
| Returns `modelscope.outputs.TokenClassifierOutput` | |||||
| Returns `modelscope.outputs.AttentionTokenClassificationModelOutput` | |||||
| Examples: | Examples: | ||||
| >>> from modelscope.models import Model | >>> from modelscope.models import Model | ||||
| @@ -212,14 +212,25 @@ class BertForTokenClassification(BertPreTrainedModel): | |||||
| loss = loss_fct( | loss = loss_fct( | ||||
| logits.view(-1, self.num_labels), labels.view(-1)) | logits.view(-1, self.num_labels), labels.view(-1)) | ||||
| if label_mask is not None: | |||||
| mask = label_mask | |||||
| masked_lengths = mask.sum(-1).long() | |||||
| masked_logits = torch.zeros_like(logits) | |||||
| for i in range(len(mask)): | |||||
| masked_logits[ | |||||
| i, :masked_lengths[i], :] = logits[i].masked_select( | |||||
| mask[i].unsqueeze(-1)).view(masked_lengths[i], -1) | |||||
| logits = masked_logits | |||||
| if not return_dict: | if not return_dict: | ||||
| output = (logits, ) + outputs[2:] | output = (logits, ) + outputs[2:] | ||||
| return ((loss, ) + output) if loss is not None else output | return ((loss, ) + output) if loss is not None else output | ||||
| return TokenClassifierOutput( | |||||
| return AttentionTokenClassificationModelOutput( | |||||
| loss=loss, | loss=loss, | ||||
| logits=logits, | logits=logits, | ||||
| hidden_states=outputs.hidden_states, | hidden_states=outputs.hidden_states, | ||||
| attentions=outputs.attentions, | attentions=outputs.attentions, | ||||
| offset_mapping=offset_mapping, | offset_mapping=offset_mapping, | ||||
| label_mask=label_mask, | |||||
| ) | ) | ||||
| @@ -22,7 +22,6 @@ import torch.utils.checkpoint | |||||
| from torch import nn | from torch import nn | ||||
| from torch.nn import LayerNorm | from torch.nn import LayerNorm | ||||
| from transformers.activations import ACT2FN | from transformers.activations import ACT2FN | ||||
| from transformers.modeling_outputs import BaseModelOutput | |||||
| from transformers.modeling_utils import PreTrainedModel | from transformers.modeling_utils import PreTrainedModel | ||||
| from transformers.pytorch_utils import softmax_backward_data | from transformers.pytorch_utils import softmax_backward_data | ||||
| @@ -574,7 +573,7 @@ class DebertaV2Encoder(nn.Module): | |||||
| return tuple( | return tuple( | ||||
| v for v in [output_states, all_hidden_states, all_attentions] | v for v in [output_states, all_hidden_states, all_attentions] | ||||
| if v is not None) | if v is not None) | ||||
| return BaseModelOutput( | |||||
| return AttentionBackboneModelOutput( | |||||
| last_hidden_state=output_states, | last_hidden_state=output_states, | ||||
| hidden_states=all_hidden_states, | hidden_states=all_hidden_states, | ||||
| attentions=all_attentions) | attentions=all_attentions) | ||||
| @@ -44,7 +44,7 @@ class DebertaV2ForMaskedLM(DebertaV2PreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the fill_mask model of Deberta_v2, the preprocessor of this model | This is the fill_mask model of Deberta_v2, the preprocessor of this model | ||||
| is `modelscope.preprocessors.NLPPreprocessor`. | |||||
| is `modelscope.preprocessors.FillMaskTransformersPreprocessor`. | |||||
| Parameters: | Parameters: | ||||
| config (`DebertaV2Config`): Model configuration class with all the parameters of the model. | config (`DebertaV2Config`): Model configuration class with all the parameters of the model. | ||||
| @@ -18,18 +18,16 @@ from modelscope.utils.import_utils import LazyImportModule | |||||
| if TYPE_CHECKING: | if TYPE_CHECKING: | ||||
| from .configuration import PalmConfig | from .configuration import PalmConfig | ||||
| from .backbone import ( | |||||
| from .text_generation import ( | |||||
| AbsSummarizer, | AbsSummarizer, | ||||
| PalmForConditionalGeneration, | |||||
| PalmForTextGeneration, | |||||
| Translator, | Translator, | ||||
| ) | ) | ||||
| from .text_generation import PalmForTextGeneration | |||||
| else: | else: | ||||
| _import_structure = { | _import_structure = { | ||||
| 'configuration': ['PalmConfig'], | 'configuration': ['PalmConfig'], | ||||
| 'backbone': | |||||
| ['AbsSummarizer', 'PalmForConditionalGeneration', 'Translator'], | |||||
| 'text_generation': ['PalmForTextGeneration'], | |||||
| 'text_generation': | |||||
| ['AbsSummarizer', 'Translator', 'PalmForTextGeneration'], | |||||
| } | } | ||||
| import sys | import sys | ||||
| @@ -23,8 +23,6 @@ import torch.utils.checkpoint | |||||
| from packaging import version | from packaging import version | ||||
| from torch import nn | from torch import nn | ||||
| from transformers.activations import ACT2FN | from transformers.activations import ACT2FN | ||||
| from transformers.modeling_outputs import \ | |||||
| BaseModelOutputWithPastAndCrossAttentions | |||||
| from transformers.modeling_utils import (PreTrainedModel, | from transformers.modeling_utils import (PreTrainedModel, | ||||
| apply_chunking_to_forward, | apply_chunking_to_forward, | ||||
| find_pruneable_heads_and_indices, | find_pruneable_heads_and_indices, | ||||
| @@ -573,7 +571,7 @@ class PoNetEncoder(nn.Module): | |||||
| all_self_attentions, | all_self_attentions, | ||||
| all_cross_attentions, | all_cross_attentions, | ||||
| ] if v is not None) | ] if v is not None) | ||||
| return BaseModelOutputWithPastAndCrossAttentions( | |||||
| return AttentionBackboneModelOutput( | |||||
| last_hidden_state=hidden_states, | last_hidden_state=hidden_states, | ||||
| past_key_values=next_decoder_cache, | past_key_values=next_decoder_cache, | ||||
| hidden_states=all_hidden_states, | hidden_states=all_hidden_states, | ||||
| @@ -642,34 +640,6 @@ class PoNetPreTrainedModel(TorchModel, PreTrainedModel): | |||||
| return model | return model | ||||
| class PoNetPreTrainedModelV2(PreTrainedModel): | |||||
| """ | |||||
| A base class to handle weights initialization and a simple interface for loading pretrained models. | |||||
| """ | |||||
| config_class = PoNetConfig | |||||
| base_model_prefix = 'ponet' | |||||
| _keys_to_ignore_on_load_missing = [r'position_ids'] | |||||
| def _init_weights(self, module): | |||||
| """Initialize the weights""" | |||||
| if isinstance(module, nn.Linear): | |||||
| # Slightly different from the TF version which uses truncated_normal for initialization | |||||
| # cf https://github.com/pytorch/pytorch/pull/5617 | |||||
| module.weight.data.normal_( | |||||
| mean=0.0, std=self.config.initializer_range) | |||||
| if module.bias is not None: | |||||
| module.bias.data.zero_() | |||||
| elif isinstance(module, nn.Embedding): | |||||
| module.weight.data.normal_( | |||||
| mean=0.0, std=self.config.initializer_range) | |||||
| if module.padding_idx is not None: | |||||
| module.weight.data[module.padding_idx].zero_() | |||||
| elif isinstance(module, nn.LayerNorm): | |||||
| module.bias.data.zero_() | |||||
| module.weight.data.fill_(1.0) | |||||
| @MODELS.register_module(Tasks.backbone, module_name=Models.ponet) | @MODELS.register_module(Tasks.backbone, module_name=Models.ponet) | ||||
| class PoNetModel(PoNetPreTrainedModel): | class PoNetModel(PoNetPreTrainedModel): | ||||
| """The bare PoNet Model transformer outputting raw hidden-states without any specific head on top. | """The bare PoNet Model transformer outputting raw hidden-states without any specific head on top. | ||||
| @@ -5,13 +5,15 @@ from typing import Any, Dict | |||||
| import torch | import torch | ||||
| from torch import nn | from torch import nn | ||||
| from torch.nn import CrossEntropyLoss | from torch.nn import CrossEntropyLoss | ||||
| from transformers.modeling_outputs import TokenClassifierOutput | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.base import Model | from modelscope.models.base import Model | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.models.nlp.bert import BertConfig | |||||
| from modelscope.outputs import AttentionTokenClassificationModelOutput | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from .backbone import PoNetModel, PoNetPreTrainedModelV2 | |||||
| from .backbone import PoNetModel, PoNetPreTrainedModel | |||||
| from .configuration import PoNetConfig | |||||
| __all__ = ['PoNetForDocumentSegmentation'] | __all__ = ['PoNetForDocumentSegmentation'] | ||||
| @@ -20,23 +22,7 @@ __all__ = ['PoNetForDocumentSegmentation'] | |||||
| Tasks.document_segmentation, module_name=Models.ponet_for_ds) | Tasks.document_segmentation, module_name=Models.ponet_for_ds) | ||||
| @MODELS.register_module( | @MODELS.register_module( | ||||
| Tasks.extractive_summarization, module_name=Models.ponet_for_ds) | Tasks.extractive_summarization, module_name=Models.ponet_for_ds) | ||||
| class PoNetForDocumentSegmentation(Model): | |||||
| def __init__(self, model_dir: str, model_config: Dict[str, Any], *args, | |||||
| **kwargs): | |||||
| super().__init__(model_dir, model_config, *args, **kwargs) | |||||
| self.model_cfg = model_config | |||||
| def build_with_config(self, config): | |||||
| self.ponet_model = PoNetForDocumentSegmentationBase.from_pretrained( | |||||
| self.model_dir, config=config) | |||||
| return self.ponet_model | |||||
| def forward(self) -> Dict[str, Any]: | |||||
| return self.model_cfg | |||||
| class PoNetForDocumentSegmentationBase(PoNetPreTrainedModelV2): | |||||
| class PoNetForDocumentSegmentation(PoNetPreTrainedModel): | |||||
| _keys_to_ignore_on_load_unexpected = [r'pooler'] | _keys_to_ignore_on_load_unexpected = [r'pooler'] | ||||
| def __init__(self, config): | def __init__(self, config): | ||||
| @@ -107,9 +93,24 @@ class PoNetForDocumentSegmentationBase(PoNetPreTrainedModelV2): | |||||
| output = (logits, ) + outputs[2:] | output = (logits, ) + outputs[2:] | ||||
| return ((loss, ) + output) if loss is not None else output | return ((loss, ) + output) if loss is not None else output | ||||
| return TokenClassifierOutput( | |||||
| return AttentionTokenClassificationModelOutput( | |||||
| loss=loss, | loss=loss, | ||||
| logits=logits, | logits=logits, | ||||
| hidden_states=outputs.hidden_states, | hidden_states=outputs.hidden_states, | ||||
| attentions=outputs.attentions, | attentions=outputs.attentions, | ||||
| ) | ) | ||||
| @classmethod | |||||
| def _instantiate(cls, model_dir, model_config: Dict[str, Any], **kwargs): | |||||
| if model_config['type'] == 'bert': | |||||
| config = BertConfig.from_pretrained(model_dir, num_labels=2) | |||||
| elif model_config['type'] == 'ponet': | |||||
| config = PoNetConfig.from_pretrained(model_dir, num_labels=2) | |||||
| else: | |||||
| raise ValueError( | |||||
| f'Expected config type bert and ponet, which is : {model_config["type"]}' | |||||
| ) | |||||
| model = super(Model, cls).from_pretrained(model_dir, config=config) | |||||
| model.model_dir = model_dir | |||||
| model.model_cfg = model_config | |||||
| return model | |||||
| @@ -15,14 +15,14 @@ | |||||
| # limitations under the License | # limitations under the License | ||||
| """Tokenization classes for Space. mainly copied from :module:`~transformers.tokenization_xlm_roberta`""" | """Tokenization classes for Space. mainly copied from :module:`~transformers.tokenization_xlm_roberta`""" | ||||
| from modelscope.models.nlp.structbert import (BasicTokenizer, SbertTokenizer, | |||||
| WordpieceTokenizer) | |||||
| from transformers import BasicTokenizer, BertTokenizer, WordpieceTokenizer | |||||
| from modelscope.utils import logger as logging | from modelscope.utils import logger as logging | ||||
| logger = logging.get_logger(__name__) | logger = logging.get_logger(__name__) | ||||
| class SpaceTokenizer(SbertTokenizer): | |||||
| class SpaceTokenizer(BertTokenizer): | |||||
| """ | """ | ||||
| This class overrides [`SpaceTokenizer`]. Please check the superclass for the appropriate | This class overrides [`SpaceTokenizer`]. Please check the superclass for the appropriate | ||||
| documentation alongside usage examples. | documentation alongside usage examples. | ||||
| @@ -24,9 +24,6 @@ if TYPE_CHECKING: | |||||
| from .fill_mask import SbertForMaskedLM | from .fill_mask import SbertForMaskedLM | ||||
| from .text_classification import SbertForSequenceClassification | from .text_classification import SbertForSequenceClassification | ||||
| from .token_classification import SbertForTokenClassification | from .token_classification import SbertForTokenClassification | ||||
| from .tokenization import (BasicTokenizer, SbertTokenizer, | |||||
| WordpieceTokenizer) | |||||
| from .tokenization_fast import SbertTokenizerFast | |||||
| else: | else: | ||||
| _import_structure = { | _import_structure = { | ||||
| 'backbone': ['SbertModel', 'SbertPreTrainedModel'], | 'backbone': ['SbertModel', 'SbertPreTrainedModel'], | ||||
| @@ -35,9 +32,6 @@ else: | |||||
| 'faq_question_answering': ['SbertForFaqQuestionAnswering'], | 'faq_question_answering': ['SbertForFaqQuestionAnswering'], | ||||
| 'text_classification': ['SbertForSequenceClassification'], | 'text_classification': ['SbertForSequenceClassification'], | ||||
| 'token_classification': ['SbertForTokenClassification'], | 'token_classification': ['SbertForTokenClassification'], | ||||
| 'tokenization': | |||||
| ['BasicTokenizer', 'SbertTokenizer', 'WordpieceTokenizer'], | |||||
| 'tokenization_fast': ['SbertTokenizerFast'], | |||||
| } | } | ||||
| import sys | import sys | ||||
| @@ -18,15 +18,13 @@ | |||||
| import math | import math | ||||
| from dataclasses import dataclass | from dataclasses import dataclass | ||||
| from typing import Optional, Tuple, Union | |||||
| from typing import Optional, Union | |||||
| import torch | import torch | ||||
| import torch.nn as nn | import torch.nn as nn | ||||
| import torch.utils.checkpoint | import torch.utils.checkpoint | ||||
| from packaging import version | from packaging import version | ||||
| from transformers.activations import ACT2FN | from transformers.activations import ACT2FN | ||||
| from transformers.modeling_outputs import \ | |||||
| BaseModelOutputWithPastAndCrossAttentions | |||||
| from transformers.modeling_utils import (PreTrainedModel, | from transformers.modeling_utils import (PreTrainedModel, | ||||
| apply_chunking_to_forward, | apply_chunking_to_forward, | ||||
| find_pruneable_heads_and_indices, | find_pruneable_heads_and_indices, | ||||
| @@ -37,8 +35,8 @@ from modelscope.models import Model, TorchModel | |||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import AttentionBackboneModelOutput | from modelscope.outputs import AttentionBackboneModelOutput | ||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.hub import parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from modelscope.utils.nlp.utils import parse_labels_in_order | |||||
| from .configuration import SbertConfig | from .configuration import SbertConfig | ||||
| logger = get_logger(__name__) | logger = get_logger(__name__) | ||||
| @@ -563,7 +561,7 @@ class SbertEncoder(nn.Module): | |||||
| all_self_attentions, | all_self_attentions, | ||||
| all_cross_attentions, | all_cross_attentions, | ||||
| ] if v is not None) | ] if v is not None) | ||||
| return BaseModelOutputWithPastAndCrossAttentions( | |||||
| return AttentionBackboneModelOutput( | |||||
| last_hidden_state=hidden_states, | last_hidden_state=hidden_states, | ||||
| past_key_values=next_decoder_cache, | past_key_values=next_decoder_cache, | ||||
| hidden_states=all_hidden_states, | hidden_states=all_hidden_states, | ||||
| @@ -641,29 +639,15 @@ class SbertPreTrainedModel(TorchModel, PreTrainedModel): | |||||
| """ | """ | ||||
| model_dir = kwargs.pop('model_dir', None) | model_dir = kwargs.pop('model_dir', None) | ||||
| cfg = kwargs.pop('cfg', None) | |||||
| model_args = parse_labels_in_order(model_dir, cfg, **kwargs) | |||||
| if model_dir is None: | if model_dir is None: | ||||
| config = SbertConfig(**kwargs) | |||||
| config = SbertConfig(**model_args) | |||||
| model = cls(config) | model = cls(config) | ||||
| else: | else: | ||||
| model_kwargs = {} | |||||
| label2id = kwargs.get('label2id', parse_label_mapping(model_dir)) | |||||
| id2label = kwargs.get( | |||||
| 'id2label', None if label2id is None else | |||||
| {id: label | |||||
| for label, id in label2id.items()}) | |||||
| if id2label is not None and label2id is None: | |||||
| label2id = {label: id for id, label in id2label.items()} | |||||
| num_labels = kwargs.get( | |||||
| 'num_labels', None if label2id is None else len(label2id)) | |||||
| if num_labels is not None: | |||||
| model_kwargs['num_labels'] = num_labels | |||||
| if label2id is not None: | |||||
| model_kwargs['label2id'] = label2id | |||||
| if id2label is not None: | |||||
| model_kwargs['id2label'] = id2label | |||||
| model = super(Model, cls).from_pretrained( | model = super(Model, cls).from_pretrained( | ||||
| pretrained_model_name_or_path=model_dir, **model_kwargs) | |||||
| pretrained_model_name_or_path=model_dir, **model_args) | |||||
| return model | return model | ||||
| @@ -14,6 +14,7 @@ from modelscope.metainfo import Models | |||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.models.nlp.structbert import SbertConfig, SbertModel | from modelscope.models.nlp.structbert import SbertConfig, SbertModel | ||||
| from modelscope.models.nlp.task_models.task_model import BaseTaskModel | from modelscope.models.nlp.task_models.task_model import BaseTaskModel | ||||
| from modelscope.outputs import FaqQuestionAnsweringOutput | |||||
| from modelscope.utils.config import Config, ConfigFields | from modelscope.utils.config import Config, ConfigFields | ||||
| from modelscope.utils.constant import ModelFile, Tasks | from modelscope.utils.constant import ModelFile, Tasks | ||||
| @@ -208,10 +209,10 @@ class SbertForFaqQuestionAnswering(BaseTaskModel): | |||||
| Predicted scores of all classes for each query. | Predicted scores of all classes for each query. | ||||
| Examples: | Examples: | ||||
| >>> from modelscope.hub.snapshot_download import snapshot_download | >>> from modelscope.hub.snapshot_download import snapshot_download | ||||
| >>> from modelscope.preprocessors import FaqQuestionAnsweringPreprocessor | |||||
| >>> from modelscope.preprocessors import FaqQuestionAnsweringTransformersPreprocessor | |||||
| >>> from modelscope.models.nlp import SbertForFaqQuestionAnswering | >>> from modelscope.models.nlp import SbertForFaqQuestionAnswering | ||||
| >>> cache_path = snapshot_download('damo/nlp_structbert_faq-question-answering_chinese-base') | >>> cache_path = snapshot_download('damo/nlp_structbert_faq-question-answering_chinese-base') | ||||
| >>> preprocessor = FaqQuestionAnsweringPreprocessor.from_pretrained(cache_path) | |||||
| >>> preprocessor = FaqQuestionAnsweringTransformersPreprocessor.from_pretrained(cache_path) | |||||
| >>> model = SbertForFaqQuestionAnswering.from_pretrained(cache_path) | >>> model = SbertForFaqQuestionAnswering.from_pretrained(cache_path) | ||||
| >>> param = { | >>> param = { | ||||
| >>> 'query_set': ['如何使用优惠券', '在哪里领券', '在哪里领券'], | >>> 'query_set': ['如何使用优惠券', '在哪里领券', '在哪里领券'], | ||||
| @@ -270,7 +271,7 @@ class SbertForFaqQuestionAnswering(BaseTaskModel): | |||||
| scores = self.metrics_layer(z_query, protos).view([n_query, num_cls]) | scores = self.metrics_layer(z_query, protos).view([n_query, num_cls]) | ||||
| if self.metrics_layer.name == 'relation': | if self.metrics_layer.name == 'relation': | ||||
| scores = torch.sigmoid(scores) | scores = torch.sigmoid(scores) | ||||
| return {'scores': scores} | |||||
| return FaqQuestionAnsweringOutput(scores=scores) | |||||
| def _get_onehot_labels(self, labels, support_size, num_cls): | def _get_onehot_labels(self, labels, support_size, num_cls): | ||||
| labels_ = labels.view(support_size, 1) | labels_ = labels.view(support_size, 1) | ||||
| @@ -105,7 +105,7 @@ class SbertForMaskedLM(SbertPreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the fill_mask model of StructBERT, the preprocessor of this model | This is the fill_mask model of StructBERT, the preprocessor of this model | ||||
| is `modelscope.preprocessors.NLPPreprocessor`. | |||||
| is `modelscope.preprocessors.FillMaskTransformersPreprocessor`. | |||||
| Parameters: | Parameters: | ||||
| config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with | config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with | ||||
| @@ -213,9 +213,9 @@ class SbertForMaskedLM(SbertPreTrainedModel): | |||||
| Examples: | Examples: | ||||
| >>> from modelscope.models import Model | >>> from modelscope.models import Model | ||||
| >>> from modelscope.preprocessors import Preprocessor, NLPPreprocessor | |||||
| >>> from modelscope.preprocessors import Preprocessor, FillMaskTransformersPreprocessor | |||||
| >>> model = Model.from_pretrained('damo/nlp_structbert_fill-mask_chinese-large') | >>> model = Model.from_pretrained('damo/nlp_structbert_fill-mask_chinese-large') | ||||
| >>> preprocessor = NLPPreprocessor('damo/nlp_structbert_fill-mask_chinese-large') | |||||
| >>> preprocessor = FillMaskTransformersPreprocessor('damo/nlp_structbert_fill-mask_chinese-large') | |||||
| >>> # Call the model, return some tensors | >>> # Call the model, return some tensors | ||||
| >>> print(model(**preprocessor('你师父差得动你,你师父可[MASK]不动我。'))) | >>> print(model(**preprocessor('你师父差得动你,你师父可[MASK]不动我。'))) | ||||
| >>> # Call the pipeline | >>> # Call the pipeline | ||||
| @@ -55,7 +55,7 @@ class SbertForSequenceClassification(SbertPreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the text classification model of StructBERT, the preprocessor of this model | This is the text classification model of StructBERT, the preprocessor of this model | ||||
| is `modelscope.preprocessors.SequenceClassificationPreprocessor`. | |||||
| is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`. | |||||
| Trainer: | Trainer: | ||||
| This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | ||||
| @@ -22,7 +22,7 @@ from torch.nn import CrossEntropyLoss | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import TokenClassifierOutput | |||||
| from modelscope.outputs import AttentionTokenClassificationModelOutput | |||||
| from modelscope.utils import logger as logging | from modelscope.utils import logger as logging | ||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from .adv_utils import compute_adv_loss | from .adv_utils import compute_adv_loss | ||||
| @@ -50,7 +50,7 @@ class SbertForTokenClassification(SbertPreTrainedModel): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the token-classification model of StructBERT, the preprocessor of this model | This is the token-classification model of StructBERT, the preprocessor of this model | ||||
| is `modelscope.preprocessors.TokenClassificationPreprocessor`. | |||||
| is `modelscope.preprocessors.TokenClassificationTransformersPreprocessor`. | |||||
| Trainer: | Trainer: | ||||
| This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer, | ||||
| @@ -168,7 +168,7 @@ class SbertForTokenClassification(SbertPreTrainedModel): | |||||
| - 0 for tokens that are **masked**. | - 0 for tokens that are **masked**. | ||||
| Returns: | Returns: | ||||
| Returns `modelscope.outputs.TokenClassifierOutput` | |||||
| Returns `modelscope.outputs.AttentionTokenClassificationModelOutput` | |||||
| Examples: | Examples: | ||||
| >>> from modelscope.models import Model | >>> from modelscope.models import Model | ||||
| @@ -220,10 +220,21 @@ class SbertForTokenClassification(SbertPreTrainedModel): | |||||
| with_attention_mask=attention_mask is not None, | with_attention_mask=attention_mask is not None, | ||||
| **outputs.kwargs) | **outputs.kwargs) | ||||
| return TokenClassifierOutput( | |||||
| if label_mask is not None: | |||||
| mask = label_mask | |||||
| masked_lengths = mask.sum(-1).long() | |||||
| masked_logits = torch.zeros_like(logits) | |||||
| for i in range(len(mask)): | |||||
| masked_logits[ | |||||
| i, :masked_lengths[i], :] = logits[i].masked_select( | |||||
| mask[i].unsqueeze(-1)).view(masked_lengths[i], -1) | |||||
| logits = masked_logits | |||||
| return AttentionTokenClassificationModelOutput( | |||||
| loss=loss, | loss=loss, | ||||
| logits=logits, | logits=logits, | ||||
| hidden_states=outputs.hidden_states, | hidden_states=outputs.hidden_states, | ||||
| attentions=outputs.attentions, | attentions=outputs.attentions, | ||||
| offset_mapping=offset_mapping, | offset_mapping=offset_mapping, | ||||
| label_mask=label_mask, | |||||
| ) | ) | ||||
| @@ -1,519 +0,0 @@ | |||||
| # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||||
| # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | |||||
| # All rights reserved. | |||||
| # | |||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||||
| # you may not use this file except in compliance with the License. | |||||
| # You may obtain a copy of the License at | |||||
| # | |||||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||||
| # | |||||
| # Unless required by applicable law or agreed to in writing, software | |||||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||||
| # See the License for the specific language governing permissions and | |||||
| # limitations under the License. | |||||
| """Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert`""" | |||||
| import collections | |||||
| import os | |||||
| import unicodedata | |||||
| from typing import List, Optional, Tuple | |||||
| from transformers.tokenization_utils import (PreTrainedTokenizer, _is_control, | |||||
| _is_punctuation, _is_whitespace) | |||||
| from modelscope.utils.constant import ModelFile | |||||
| from modelscope.utils.logger import get_logger | |||||
| logger = get_logger(__name__) | |||||
| VOCAB_FILES_NAMES = {'vocab_file': ModelFile.VOCAB_FILE} | |||||
| PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}} | |||||
| PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { | |||||
| 'nlp_structbert_backbone_large_std': 512, | |||||
| 'nlp_structbert_backbone_base_std': 512, | |||||
| 'nlp_structbert_backbone_lite_std': 512, | |||||
| 'nlp_structbert_backbone_tiny_std': 512, | |||||
| } | |||||
| PRETRAINED_INIT_CONFIGURATION = { | |||||
| 'english_sbert-large-std-512': { | |||||
| 'do_lower_case': True | |||||
| }, | |||||
| } | |||||
| def load_vocab(vocab_file): | |||||
| """Loads a vocabulary file into a dictionary.""" | |||||
| vocab = collections.OrderedDict() | |||||
| with open(vocab_file, 'r', encoding='utf-8') as reader: | |||||
| tokens = reader.readlines() | |||||
| for index, token in enumerate(tokens): | |||||
| token = token.rstrip('\n') | |||||
| vocab[token] = index | |||||
| return vocab | |||||
| def whitespace_tokenize(text): | |||||
| """Runs basic whitespace cleaning and splitting on a piece of text.""" | |||||
| text = text.strip() | |||||
| if not text: | |||||
| return [] | |||||
| tokens = text.split() | |||||
| return tokens | |||||
| class SbertTokenizer(PreTrainedTokenizer): | |||||
| r""" | |||||
| Construct a SBERT tokenizer. Based on WordPiece. | |||||
| This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods. | |||||
| Users should refer to this superclass for more information regarding those methods. | |||||
| Args: | |||||
| vocab_file (:obj:`str`): | |||||
| File containing the vocabulary. | |||||
| do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to lowercase the input when tokenizing. | |||||
| do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to do basic tokenization before WordPiece. | |||||
| never_split (:obj:`Iterable`, `optional`): | |||||
| Collection of tokens which will never be split during tokenization. Only has an effect when | |||||
| :obj:`do_basic_tokenize=True` | |||||
| unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`): | |||||
| The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||||
| token instead. | |||||
| sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`): | |||||
| The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||||
| sequence classification or for a text and a question for question answering. It is also used as the last | |||||
| token of a sequence built with special tokens. | |||||
| pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`): | |||||
| The token used for padding, for example when batching sequences of different lengths. | |||||
| cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`): | |||||
| The classifier token which is used when doing sequence classification (classification of the whole sequence | |||||
| instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||||
| mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`): | |||||
| The token used for masking values. This is the token used when training this model with masked language | |||||
| modeling. This is the token which the model will try to predict. | |||||
| tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to tokenize Chinese characters. | |||||
| This should likely be deactivated for Japanese (see this `issue | |||||
| <https://github.com/huggingface/transformers/issues/328>`__). | |||||
| strip_accents: (:obj:`bool`, `optional`): | |||||
| Whether or not to strip all accents. If this option is not specified, then it will be determined by the | |||||
| value for :obj:`lowercase` (as in the original BERT). | |||||
| """ | |||||
| vocab_files_names = VOCAB_FILES_NAMES | |||||
| pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||||
| pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | |||||
| max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||||
| def __init__(self, | |||||
| vocab_file, | |||||
| do_lower_case=True, | |||||
| do_basic_tokenize=True, | |||||
| never_split=None, | |||||
| unk_token='[UNK]', | |||||
| sep_token='[SEP]', | |||||
| pad_token='[PAD]', | |||||
| cls_token='[CLS]', | |||||
| mask_token='[MASK]', | |||||
| tokenize_chinese_chars=True, | |||||
| strip_accents=None, | |||||
| **kwargs): | |||||
| super().__init__( | |||||
| do_lower_case=do_lower_case, | |||||
| do_basic_tokenize=do_basic_tokenize, | |||||
| never_split=never_split, | |||||
| unk_token=unk_token, | |||||
| sep_token=sep_token, | |||||
| pad_token=pad_token, | |||||
| cls_token=cls_token, | |||||
| mask_token=mask_token, | |||||
| tokenize_chinese_chars=tokenize_chinese_chars, | |||||
| strip_accents=strip_accents, | |||||
| **kwargs, | |||||
| ) | |||||
| if not os.path.isfile(vocab_file): | |||||
| raise ValueError( | |||||
| f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained " | |||||
| 'model use `tokenizer = SbertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`' | |||||
| ) | |||||
| self.vocab = load_vocab(vocab_file) | |||||
| self.ids_to_tokens = collections.OrderedDict([ | |||||
| (ids, tok) for tok, ids in self.vocab.items() | |||||
| ]) | |||||
| self.do_basic_tokenize = do_basic_tokenize | |||||
| if do_basic_tokenize: | |||||
| self.basic_tokenizer = BasicTokenizer( | |||||
| do_lower_case=do_lower_case, | |||||
| never_split=never_split, | |||||
| tokenize_chinese_chars=tokenize_chinese_chars, | |||||
| strip_accents=strip_accents, | |||||
| ) | |||||
| self.wordpiece_tokenizer = WordpieceTokenizer( | |||||
| vocab=self.vocab, unk_token=self.unk_token) | |||||
| @property | |||||
| def do_lower_case(self): | |||||
| return self.basic_tokenizer.do_lower_case | |||||
| @property | |||||
| def vocab_size(self): | |||||
| return len(self.vocab) | |||||
| def get_vocab(self): | |||||
| return dict(self.vocab, **self.added_tokens_encoder) | |||||
| def _tokenize(self, text): | |||||
| split_tokens = [] | |||||
| if self.do_basic_tokenize: | |||||
| for token in self.basic_tokenizer.tokenize( | |||||
| text, never_split=self.all_special_tokens): | |||||
| # If the token is part of the never_split set | |||||
| if token in self.basic_tokenizer.never_split: | |||||
| split_tokens.append(token) | |||||
| else: | |||||
| split_tokens += self.wordpiece_tokenizer.tokenize(token) | |||||
| else: | |||||
| split_tokens = self.wordpiece_tokenizer.tokenize(text) | |||||
| return split_tokens | |||||
| def _convert_token_to_id(self, token): | |||||
| """Converts a token (str) in an id using the vocab.""" | |||||
| return self.vocab.get(token, self.vocab.get(self.unk_token)) | |||||
| def _convert_id_to_token(self, index): | |||||
| """Converts an index (integer) in a token (str) using the vocab.""" | |||||
| return self.ids_to_tokens.get(index, self.unk_token) | |||||
| def convert_tokens_to_string(self, tokens): | |||||
| """Converts a sequence of tokens (string) in a single string.""" | |||||
| out_string = ' '.join(tokens).replace(' ##', '').strip() | |||||
| return out_string | |||||
| def build_inputs_with_special_tokens( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||||
| adding special tokens. A SBERT sequence has the following format: | |||||
| - single sequence: ``[CLS] X [SEP]`` | |||||
| - pair of sequences: ``[CLS] A [SEP] B [SEP]`` | |||||
| Args: | |||||
| token_ids_0 (:obj:`List[int]`): | |||||
| List of IDs to which the special tokens will be added. | |||||
| token_ids_1 (:obj:`List[int]`, `optional`): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. | |||||
| """ | |||||
| if token_ids_1 is None: | |||||
| return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| sep = [self.sep_token_id] | |||||
| return cls + token_ids_0 + sep + token_ids_1 + sep | |||||
| def get_special_tokens_mask( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None, | |||||
| already_has_special_tokens: bool = False) -> List[int]: | |||||
| """ | |||||
| Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |||||
| special tokens using the tokenizer ``prepare_for_model`` method. | |||||
| Args: | |||||
| token_ids_0 (:obj:`List[int]`): | |||||
| List of IDs. | |||||
| token_ids_1 (:obj:`List[int]`, `optional`): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`): | |||||
| Whether or not the token list is already formatted with special tokens for the model. | |||||
| Returns: | |||||
| :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |||||
| """ | |||||
| if already_has_special_tokens: | |||||
| return super().get_special_tokens_mask( | |||||
| token_ids_0=token_ids_0, | |||||
| token_ids_1=token_ids_1, | |||||
| already_has_special_tokens=True) | |||||
| if token_ids_1 is not None: | |||||
| return [1] + ([0] * len(token_ids_0)) + [1] + ( | |||||
| [0] * len(token_ids_1)) + [1] | |||||
| return [1] + ([0] * len(token_ids_0)) + [1] | |||||
| def create_token_type_ids_from_sequences( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence | |||||
| pair mask has the following format: | |||||
| :: | |||||
| 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | |||||
| | first sequence | second sequence | | |||||
| If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s). | |||||
| Args: | |||||
| token_ids_0 (:obj:`List[int]`): | |||||
| List of IDs. | |||||
| token_ids_1 (:obj:`List[int]`, `optional`): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given | |||||
| sequence(s). | |||||
| """ | |||||
| sep = [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| if token_ids_1 is None: | |||||
| return len(cls + token_ids_0 + sep) * [0] | |||||
| return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 | |||||
| + sep) * [1] | |||||
| def save_vocabulary(self, | |||||
| save_directory: str, | |||||
| filename_prefix: Optional[str] = None) -> Tuple[str]: | |||||
| index = 0 | |||||
| if os.path.isdir(save_directory): | |||||
| vocab_file = os.path.join( | |||||
| save_directory, | |||||
| (filename_prefix + '-' if filename_prefix else '') | |||||
| + VOCAB_FILES_NAMES['vocab_file']) | |||||
| else: | |||||
| vocab_file = (filename_prefix | |||||
| + '-' if filename_prefix else '') + save_directory | |||||
| with open(vocab_file, 'w', encoding='utf-8') as writer: | |||||
| for token, token_index in sorted( | |||||
| self.vocab.items(), key=lambda kv: kv[1]): | |||||
| if index != token_index: | |||||
| logger.warning( | |||||
| f'Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive.' | |||||
| ' Please check that the vocabulary is not corrupted!') | |||||
| index = token_index | |||||
| writer.write(token + '\n') | |||||
| index += 1 | |||||
| return (vocab_file, ) | |||||
| class BasicTokenizer(object): | |||||
| """ | |||||
| Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.). | |||||
| Args: | |||||
| do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to lowercase the input when tokenizing. | |||||
| never_split (:obj:`Iterable`, `optional`): | |||||
| Collection of tokens which will never be split during tokenization. Only has an effect when | |||||
| :obj:`do_basic_tokenize=True` | |||||
| tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to tokenize Chinese characters. | |||||
| This should likely be deactivated for Japanese (see this `issue | |||||
| <https://github.com/huggingface/transformers/issues/328>`__). | |||||
| strip_accents: (:obj:`bool`, `optional`): | |||||
| Whether or not to strip all accents. If this option is not specified, then it will be determined by the | |||||
| value for :obj:`lowercase` (as in the original BERT). | |||||
| """ | |||||
| def __init__(self, | |||||
| do_lower_case=True, | |||||
| never_split=None, | |||||
| tokenize_chinese_chars=True, | |||||
| strip_accents=None): | |||||
| if never_split is None: | |||||
| never_split = [] | |||||
| self.do_lower_case = do_lower_case | |||||
| self.never_split = set(never_split) | |||||
| self.tokenize_chinese_chars = tokenize_chinese_chars | |||||
| self.strip_accents = strip_accents | |||||
| def tokenize(self, text, never_split=None): | |||||
| """ | |||||
| Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see | |||||
| WordPieceTokenizer. | |||||
| Args: | |||||
| **never_split**: (`optional`) list of str | |||||
| Kept for backward compatibility purposes. Now implemented directly at the base class level (see | |||||
| :func:`PreTrainedTokenizer.tokenize`) List of token not to split. | |||||
| """ | |||||
| # union() returns a new set by concatenating the two sets. | |||||
| never_split = self.never_split.union( | |||||
| set(never_split)) if never_split else self.never_split | |||||
| text = self._clean_text(text) | |||||
| # This was added on November 1st, 2018 for the multilingual and Chinese | |||||
| # models. This is also applied to the English models now, but it doesn't | |||||
| # matter since the English models were not trained on any Chinese data | |||||
| # and generally don't have any Chinese data in them (there are Chinese | |||||
| # characters in the vocabulary because Wikipedia does have some Chinese | |||||
| # words in the English Wikipedia.). | |||||
| if self.tokenize_chinese_chars: | |||||
| text = self._tokenize_chinese_chars(text) | |||||
| orig_tokens = whitespace_tokenize(text) | |||||
| split_tokens = [] | |||||
| for token in orig_tokens: | |||||
| if token not in never_split: | |||||
| if self.do_lower_case: | |||||
| token = token.lower() | |||||
| if self.strip_accents is not False: | |||||
| token = self._run_strip_accents(token) | |||||
| elif self.strip_accents: | |||||
| token = self._run_strip_accents(token) | |||||
| split_tokens.extend(self._run_split_on_punc(token, never_split)) | |||||
| output_tokens = whitespace_tokenize(' '.join(split_tokens)) | |||||
| return output_tokens | |||||
| def _run_strip_accents(self, text): | |||||
| """Strips accents from a piece of text.""" | |||||
| text = unicodedata.normalize('NFD', text) | |||||
| output = [] | |||||
| for char in text: | |||||
| cat = unicodedata.category(char) | |||||
| if cat == 'Mn': | |||||
| continue | |||||
| output.append(char) | |||||
| return ''.join(output) | |||||
| def _run_split_on_punc(self, text, never_split=None): | |||||
| """Splits punctuation on a piece of text.""" | |||||
| if never_split is not None and text in never_split: | |||||
| return [text] | |||||
| chars = list(text) | |||||
| i = 0 | |||||
| start_new_word = True | |||||
| output = [] | |||||
| while i < len(chars): | |||||
| char = chars[i] | |||||
| if _is_punctuation(char): | |||||
| output.append([char]) | |||||
| start_new_word = True | |||||
| else: | |||||
| if start_new_word: | |||||
| output.append([]) | |||||
| start_new_word = False | |||||
| output[-1].append(char) | |||||
| i += 1 | |||||
| return [''.join(x) for x in output] | |||||
| def _tokenize_chinese_chars(self, text): | |||||
| """Adds whitespace around any CJK character.""" | |||||
| output = [] | |||||
| for char in text: | |||||
| cp = ord(char) | |||||
| if self._is_chinese_char(cp): | |||||
| output.append(' ') | |||||
| output.append(char) | |||||
| output.append(' ') | |||||
| else: | |||||
| output.append(char) | |||||
| return ''.join(output) | |||||
| def _is_chinese_char(self, cp): | |||||
| """Checks whether CP is the codepoint of a CJK character.""" | |||||
| # This defines a "chinese character" as anything in the CJK Unicode block: | |||||
| # https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block) | |||||
| # | |||||
| # Note that the CJK Unicode block is NOT all Japanese and Korean characters, | |||||
| # despite its name. The modern Korean Hangul alphabet is a different block, | |||||
| # as is Japanese Hiragana and Katakana. Those alphabets are used to write | |||||
| # space-separated words, so they are not treated specially and handled | |||||
| # like the all of the other languages. | |||||
| if ((0x4E00 <= cp <= 0x9FFF) or (0x3400 <= cp <= 0x4DBF) | |||||
| or (0x20000 <= cp <= 0x2A6DF) or (0x2A700 <= cp <= 0x2B73F) | |||||
| or (0x2B740 <= cp <= 0x2B81F) or (0x2B820 <= cp <= 0x2CEAF) | |||||
| or (0xF900 <= cp <= 0xFAFF) or (0x2F800 <= cp <= 0x2FA1F)): | |||||
| return True | |||||
| return False | |||||
| def _clean_text(self, text): | |||||
| """Performs invalid character removal and whitespace cleanup on text.""" | |||||
| output = [] | |||||
| for char in text: | |||||
| cp = ord(char) | |||||
| if cp == 0 or cp == 0xFFFD or _is_control(char): | |||||
| continue | |||||
| if _is_whitespace(char): | |||||
| output.append(' ') | |||||
| else: | |||||
| output.append(char) | |||||
| return ''.join(output) | |||||
| class WordpieceTokenizer(object): | |||||
| """Runs WordPiece tokenization.""" | |||||
| def __init__(self, vocab, unk_token, max_input_chars_per_word=100): | |||||
| self.vocab = vocab | |||||
| self.unk_token = unk_token | |||||
| self.max_input_chars_per_word = max_input_chars_per_word | |||||
| def tokenize(self, text): | |||||
| """ | |||||
| Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform | |||||
| tokenization using the given vocabulary. | |||||
| For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`. | |||||
| Args: | |||||
| text: A single token or whitespace separated tokens. This should have | |||||
| already been passed through `BasicTokenizer`. | |||||
| Returns: | |||||
| A list of wordpiece tokens. | |||||
| """ | |||||
| output_tokens = [] | |||||
| for token in whitespace_tokenize(text): | |||||
| chars = list(token) | |||||
| if len(chars) > self.max_input_chars_per_word: | |||||
| output_tokens.append(self.unk_token) | |||||
| continue | |||||
| is_bad = False | |||||
| start = 0 | |||||
| sub_tokens = [] | |||||
| while start < len(chars): | |||||
| end = len(chars) | |||||
| cur_substr = None | |||||
| while start < end: | |||||
| substr = ''.join(chars[start:end]) | |||||
| if start > 0: | |||||
| substr = '##' + substr | |||||
| if substr in self.vocab: | |||||
| cur_substr = substr | |||||
| break | |||||
| end -= 1 | |||||
| if cur_substr is None: | |||||
| is_bad = True | |||||
| break | |||||
| sub_tokens.append(cur_substr) | |||||
| start = end | |||||
| if is_bad: | |||||
| output_tokens.append(self.unk_token) | |||||
| else: | |||||
| output_tokens.extend(sub_tokens) | |||||
| return output_tokens | |||||
| @@ -1,203 +0,0 @@ | |||||
| # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||||
| # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team. | |||||
| # All rights reserved. | |||||
| # | |||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||||
| # you may not use this file except in compliance with the License. | |||||
| # You may obtain a copy of the License at | |||||
| # | |||||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||||
| # | |||||
| # Unless required by applicable law or agreed to in writing, software | |||||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||||
| # See the License for the specific language governing permissions and | |||||
| # limitations under the License. | |||||
| """Fast Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert_fast`""" | |||||
| from typing import List, Optional, Tuple | |||||
| import json | |||||
| import transformers | |||||
| from tokenizers import normalizers | |||||
| from transformers.tokenization_utils_fast import PreTrainedTokenizerFast | |||||
| from modelscope.utils.constant import ModelFile | |||||
| from modelscope.utils.logger import get_logger | |||||
| from .tokenization import SbertTokenizer | |||||
| logger = get_logger(__name__) | |||||
| VOCAB_FILES_NAMES = { | |||||
| 'vocab_file': ModelFile.VOCAB_FILE, | |||||
| 'tokenizer_file': 'tokenizer.json' | |||||
| } | |||||
| PRETRAINED_VOCAB_FILES_MAP = { | |||||
| 'vocab_file': {}, | |||||
| 'tokenizer_file': {}, | |||||
| } | |||||
| PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = { | |||||
| 'nlp_structbert_backbone_large_std': 512, | |||||
| 'nlp_structbert_backbone_base_std': 512, | |||||
| 'nlp_structbert_backbone_lite_std': 512, | |||||
| 'nlp_structbert_backbone_tiny_std': 512, | |||||
| } | |||||
| PRETRAINED_INIT_CONFIGURATION = { | |||||
| 'english_sbert-large-std-512': { | |||||
| 'do_lower_case': True | |||||
| }, | |||||
| } | |||||
| transformers.SLOW_TO_FAST_CONVERTERS[ | |||||
| 'SbertTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS['BertTokenizer'] | |||||
| class SbertTokenizerFast(PreTrainedTokenizerFast): | |||||
| r""" | |||||
| Construct a "fast" SBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece. | |||||
| This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main | |||||
| methods. Users should refer to this superclass for more information regarding those methods. | |||||
| Args: | |||||
| vocab_file (:obj:`str`): | |||||
| File containing the vocabulary. | |||||
| do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to lowercase the input when tokenizing. | |||||
| unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`): | |||||
| The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||||
| token instead. | |||||
| sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`): | |||||
| The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||||
| sequence classification or for a text and a question for question answering. It is also used as the last | |||||
| token of a sequence built with special tokens. | |||||
| pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`): | |||||
| The token used for padding, for example when batching sequences of different lengths. | |||||
| cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`): | |||||
| The classifier token which is used when doing sequence classification (classification of the whole sequence | |||||
| instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||||
| mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`): | |||||
| The token used for masking values. This is the token used when training this model with masked language | |||||
| modeling. This is the token which the model will try to predict. | |||||
| clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to clean the text before tokenization by removing any control characters and replacing all | |||||
| whitespaces by the classic one. | |||||
| tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`): | |||||
| Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see `this | |||||
| issue <https://github.com/huggingface/transformers/issues/328>`__). | |||||
| strip_accents: (:obj:`bool`, `optional`): | |||||
| Whether or not to strip all accents. If this option is not specified, then it will be determined by the | |||||
| value for :obj:`lowercase` (as in the original BERT). | |||||
| wordpieces_prefix: (:obj:`str`, `optional`, defaults to :obj:`"##"`): | |||||
| The prefix for subwords. | |||||
| """ | |||||
| vocab_files_names = VOCAB_FILES_NAMES | |||||
| pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||||
| pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION | |||||
| max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||||
| slow_tokenizer_class = SbertTokenizer | |||||
| def __init__(self, | |||||
| vocab_file=None, | |||||
| tokenizer_file=None, | |||||
| do_lower_case=True, | |||||
| unk_token='[UNK]', | |||||
| sep_token='[SEP]', | |||||
| pad_token='[PAD]', | |||||
| cls_token='[CLS]', | |||||
| mask_token='[MASK]', | |||||
| tokenize_chinese_chars=True, | |||||
| strip_accents=None, | |||||
| **kwargs): | |||||
| super().__init__( | |||||
| vocab_file, | |||||
| tokenizer_file=tokenizer_file, | |||||
| do_lower_case=do_lower_case, | |||||
| unk_token=unk_token, | |||||
| sep_token=sep_token, | |||||
| pad_token=pad_token, | |||||
| cls_token=cls_token, | |||||
| mask_token=mask_token, | |||||
| tokenize_chinese_chars=tokenize_chinese_chars, | |||||
| strip_accents=strip_accents, | |||||
| **kwargs, | |||||
| ) | |||||
| pre_tok_state = json.loads( | |||||
| self.backend_tokenizer.normalizer.__getstate__()) | |||||
| if (pre_tok_state.get('lowercase', do_lower_case) != do_lower_case | |||||
| or pre_tok_state.get('strip_accents', | |||||
| strip_accents) != strip_accents): | |||||
| pre_tok_class = getattr(normalizers, pre_tok_state.pop('type')) | |||||
| pre_tok_state['lowercase'] = do_lower_case | |||||
| pre_tok_state['strip_accents'] = strip_accents | |||||
| self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state) | |||||
| self.do_lower_case = do_lower_case | |||||
| def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None): | |||||
| """ | |||||
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||||
| adding special tokens. A SBERT sequence has the following format: | |||||
| - single sequence: ``[CLS] X [SEP]`` | |||||
| - pair of sequences: ``[CLS] A [SEP] B [SEP]`` | |||||
| Args: | |||||
| token_ids_0 (:obj:`List[int]`): | |||||
| List of IDs to which the special tokens will be added. | |||||
| token_ids_1 (:obj:`List[int]`, `optional`): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens. | |||||
| """ | |||||
| output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||||
| if token_ids_1: | |||||
| output += token_ids_1 + [self.sep_token_id] | |||||
| return output | |||||
| def create_token_type_ids_from_sequences( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence | |||||
| pair mask has the following format: | |||||
| :: | |||||
| 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | |||||
| | first sequence | second sequence | | |||||
| If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s). | |||||
| Args: | |||||
| token_ids_0 (:obj:`List[int]`): | |||||
| List of IDs. | |||||
| token_ids_1 (:obj:`List[int]`, `optional`): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given | |||||
| sequence(s). | |||||
| """ | |||||
| sep = [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| if token_ids_1 is None: | |||||
| return len(cls + token_ids_0 + sep) * [0] | |||||
| return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1 | |||||
| + sep) * [1] | |||||
| def save_vocabulary(self, | |||||
| save_directory: str, | |||||
| filename_prefix: Optional[str] = None) -> Tuple[str]: | |||||
| files = self._tokenizer.model.save( | |||||
| save_directory, name=filename_prefix) | |||||
| return tuple(files) | |||||
| @@ -5,12 +5,10 @@ import numpy as np | |||||
| from modelscope.metainfo import TaskModels | from modelscope.metainfo import TaskModels | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.models.nlp.bert import BertConfig | |||||
| from modelscope.models.nlp.task_models.task_model import \ | from modelscope.models.nlp.task_models.task_model import \ | ||||
| SingleBackboneTaskModelBase | SingleBackboneTaskModelBase | ||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.outputs import FeatureExtractionOutput, OutputKeys | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.hub import parse_label_mapping | |||||
| __all__ = ['FeatureExtractionModel'] | __all__ = ['FeatureExtractionModel'] | ||||
| @@ -31,9 +29,9 @@ class FeatureExtractionModel(SingleBackboneTaskModelBase): | |||||
| self.build_backbone(self.backbone_cfg) | self.build_backbone(self.backbone_cfg) | ||||
| def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||||
| def forward(self, **input: Dict[str, Any]) -> FeatureExtractionOutput: | |||||
| # backbone do not need labels, only head need for loss compute | # backbone do not need labels, only head need for loss compute | ||||
| input.pop(OutputKeys.LABELS, None) | input.pop(OutputKeys.LABELS, None) | ||||
| outputs = super().forward(input) | outputs = super().forward(input) | ||||
| sequence_output = outputs.last_hidden_state | sequence_output = outputs.last_hidden_state | ||||
| return {OutputKeys.TEXT_EMBEDDING: sequence_output} | |||||
| return FeatureExtractionOutput(text_embedding=sequence_output) | |||||
| @@ -7,7 +7,7 @@ from modelscope.metainfo import TaskModels | |||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.models.nlp.task_models.task_model import \ | from modelscope.models.nlp.task_models.task_model import \ | ||||
| SingleBackboneTaskModelBase | SingleBackboneTaskModelBase | ||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.outputs import InformationExtractionOutput, OutputKeys | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| __all__ = ['InformationExtractionModel'] | __all__ = ['InformationExtractionModel'] | ||||
| @@ -31,9 +31,9 @@ class InformationExtractionModel(SingleBackboneTaskModelBase): | |||||
| self.build_backbone(self.backbone_cfg) | self.build_backbone(self.backbone_cfg) | ||||
| self.build_head(self.head_cfg) | self.build_head(self.head_cfg) | ||||
| def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||||
| def forward(self, **input: Dict[str, Any]) -> InformationExtractionOutput: | |||||
| outputs = super().forward(input) | outputs = super().forward(input) | ||||
| sequence_output = outputs.last_hidden_state | sequence_output = outputs.last_hidden_state | ||||
| outputs = self.head.forward(sequence_output, input['text'], | outputs = self.head.forward(sequence_output, input['text'], | ||||
| input['offsets']) | input['offsets']) | ||||
| return {OutputKeys.SPO_LIST: outputs} | |||||
| return InformationExtractionOutput(spo_list=outputs) | |||||
| @@ -12,7 +12,7 @@ from transformers import AutoConfig, AutoModel | |||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| from modelscope.models import TorchModel | from modelscope.models import TorchModel | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import TokenClassifierWithPredictionsOutput | |||||
| from modelscope.outputs import AttentionTokenClassificationModelOutput | |||||
| from modelscope.utils.constant import ModelFile, Tasks | from modelscope.utils.constant import ModelFile, Tasks | ||||
| __all__ = [ | __all__ = [ | ||||
| @@ -115,7 +115,7 @@ class SequenceLabelingForNamedEntityRecognition(TorchModel): | |||||
| - 0 for tokens that are **masked**. | - 0 for tokens that are **masked**. | ||||
| Returns: | Returns: | ||||
| Returns `modelscope.outputs.TokenClassifierOutput` | |||||
| Returns `modelscope.outputs.AttentionTokenClassificationModelOutput` | |||||
| Examples: | Examples: | ||||
| >>> from modelscope.models import Model | >>> from modelscope.models import Model | ||||
| @@ -138,17 +138,16 @@ class SequenceLabelingForNamedEntityRecognition(TorchModel): | |||||
| def postprocess(self, input: Dict[str, Any], **kwargs): | def postprocess(self, input: Dict[str, Any], **kwargs): | ||||
| predicts = self.model.decode(input) | predicts = self.model.decode(input) | ||||
| offset_len = len(input['offset_mapping']) | |||||
| predictions = torch.narrow( | |||||
| predicts, 1, 0, | |||||
| offset_len) # index_select only move loc, not resize | |||||
| return TokenClassifierWithPredictionsOutput( | |||||
| offset_mapping = input.get('offset_mapping') | |||||
| mask = input.get('label_mask') | |||||
| return AttentionTokenClassificationModelOutput( | |||||
| loss=None, | loss=None, | ||||
| logits=None, | logits=None, | ||||
| hidden_states=None, | hidden_states=None, | ||||
| attentions=None, | attentions=None, | ||||
| offset_mapping=input['offset_mapping'], | |||||
| predictions=predictions, | |||||
| label_mask=mask, | |||||
| offset_mapping=offset_mapping, | |||||
| predictions=predicts, | |||||
| ) | ) | ||||
| @@ -1,18 +1,16 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict | from typing import Any, Dict | ||||
| import numpy as np | |||||
| import torch | import torch | ||||
| from modelscope.metainfo import Models, TaskModels | from modelscope.metainfo import Models, TaskModels | ||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.models.nlp.task_models.task_model import \ | from modelscope.models.nlp.task_models.task_model import \ | ||||
| SingleBackboneTaskModelBase | SingleBackboneTaskModelBase | ||||
| from modelscope.outputs import OutputKeys, TokenClassifierOutput | |||||
| from modelscope.outputs import (AttentionTokenClassificationModelOutput, | |||||
| OutputKeys) | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.hub import parse_label_mapping | from modelscope.utils.hub import parse_label_mapping | ||||
| from modelscope.utils.tensor_utils import (torch_nested_detach, | |||||
| torch_nested_numpify) | |||||
| __all__ = ['TokenClassificationModel'] | __all__ = ['TokenClassificationModel'] | ||||
| @@ -48,7 +46,10 @@ class TokenClassificationModel(SingleBackboneTaskModelBase): | |||||
| self.build_backbone(self.backbone_cfg) | self.build_backbone(self.backbone_cfg) | ||||
| self.build_head(self.head_cfg) | self.build_head(self.head_cfg) | ||||
| def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]: | |||||
| def forward( | |||||
| self, | |||||
| **input: Dict[str, | |||||
| Any]) -> AttentionTokenClassificationModelOutput: | |||||
| labels = None | labels = None | ||||
| if OutputKeys.LABEL in input: | if OutputKeys.LABEL in input: | ||||
| labels = input.pop(OutputKeys.LABEL) | labels = input.pop(OutputKeys.LABEL) | ||||
| @@ -62,16 +63,23 @@ class TokenClassificationModel(SingleBackboneTaskModelBase): | |||||
| if labels in input: | if labels in input: | ||||
| loss = self.compute_loss(outputs, labels) | loss = self.compute_loss(outputs, labels) | ||||
| # apply label mask to logits | |||||
| logits = logits[input['label_mask']].unsqueeze(0) | |||||
| if 'label_mask' in input: | |||||
| mask = input['label_mask'] | |||||
| masked_lengths = mask.sum(-1).long() | |||||
| masked_logits = torch.zeros_like(logits) | |||||
| for i in range(len(mask)): | |||||
| masked_logits[ | |||||
| i, :masked_lengths[i], :] = logits[i].masked_select( | |||||
| mask[i].unsqueeze(-1)).view(masked_lengths[i], -1) | |||||
| logits = masked_logits | |||||
| return TokenClassifierOutput( | |||||
| return AttentionTokenClassificationModelOutput( | |||||
| loss=loss, | loss=loss, | ||||
| logits=logits, | logits=logits, | ||||
| hidden_states=outputs.hidden_states, | hidden_states=outputs.hidden_states, | ||||
| attentions=outputs.attentions, | attentions=outputs.attentions, | ||||
| offset_mapping=input['offset_mapping'], | |||||
| ) | |||||
| offset_mapping=input.get('offset_mapping'), | |||||
| label_mask=input.get('label_mask')) | |||||
| def extract_logits(self, outputs): | def extract_logits(self, outputs): | ||||
| return outputs[OutputKeys.LOGITS].cpu().detach() | return outputs[OutputKeys.LOGITS].cpu().detach() | ||||
| @@ -23,8 +23,6 @@ if TYPE_CHECKING: | |||||
| from .text_classification import VecoForSequenceClassification | from .text_classification import VecoForSequenceClassification | ||||
| from .token_classification import VecoForTokenClassification | from .token_classification import VecoForTokenClassification | ||||
| from .fill_mask import VecoForMaskedLM | from .fill_mask import VecoForMaskedLM | ||||
| from .tokenization import VecoTokenizer | |||||
| from .tokenization_fast import VecoTokenizerFast | |||||
| else: | else: | ||||
| _import_structure = { | _import_structure = { | ||||
| 'configuration': ['VecoConfig'], | 'configuration': ['VecoConfig'], | ||||
| @@ -32,8 +30,6 @@ else: | |||||
| 'text_classification': ['VecoForSequenceClassification'], | 'text_classification': ['VecoForSequenceClassification'], | ||||
| 'fill_mask': ['VecoForMaskedLM'], | 'fill_mask': ['VecoForMaskedLM'], | ||||
| 'token_classification': ['VecoForTokenClassification'], | 'token_classification': ['VecoForTokenClassification'], | ||||
| 'tokenization': ['VecoTokenizer'], | |||||
| 'tokenization_fast': ['VecoTokenizerFast'], | |||||
| } | } | ||||
| import sys | import sys | ||||
| @@ -40,7 +40,7 @@ class VecoForMaskedLM(TorchModel, RobertaForMaskedLM): | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the fill_mask model of StructBERT, the preprocessor of this model | This is the fill_mask model of StructBERT, the preprocessor of this model | ||||
| is `modelscope.preprocessors.NLPPreprocessor`. | |||||
| is `modelscope.preprocessors.FillMaskTransformersPreprocessor`. | |||||
| Parameters: | Parameters: | ||||
| config ([`VecoConfig`]): Model configuration class with all the parameters of the | config ([`VecoConfig`]): Model configuration class with all the parameters of the | ||||
| @@ -22,7 +22,7 @@ from modelscope.models import Model, TorchModel | |||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import AttentionTextClassificationModelOutput | from modelscope.outputs import AttentionTextClassificationModelOutput | ||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.hub import parse_label_mapping | |||||
| from modelscope.utils.nlp.utils import parse_labels_in_order | |||||
| from .configuration import VecoConfig | from .configuration import VecoConfig | ||||
| @@ -46,7 +46,7 @@ class VecoForSequenceClassification(TorchModel, | |||||
| Preprocessor: | Preprocessor: | ||||
| This is the text classification model of Veco, the preprocessor of this model | This is the text classification model of Veco, the preprocessor of this model | ||||
| is `modelscope.preprocessors.SequenceClassificationPreprocessor`. | |||||
| is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`. | |||||
| Trainer: | Trainer: | ||||
| This model should be trained by dataset which has mixed languages, | This model should be trained by dataset which has mixed languages, | ||||
| @@ -124,27 +124,13 @@ class VecoForSequenceClassification(TorchModel, | |||||
| """ | """ | ||||
| model_dir = kwargs.pop('model_dir', None) | model_dir = kwargs.pop('model_dir', None) | ||||
| cfg = kwargs.pop('cfg', None) | |||||
| model_args = parse_labels_in_order(model_dir, cfg, **kwargs) | |||||
| if model_dir is None: | if model_dir is None: | ||||
| config = VecoConfig(**kwargs) | |||||
| config = VecoConfig(**model_args) | |||||
| model = cls(config) | model = cls(config) | ||||
| else: | else: | ||||
| model_kwargs = {} | |||||
| label2id = kwargs.get('label2id', parse_label_mapping(model_dir)) | |||||
| id2label = kwargs.get( | |||||
| 'id2label', None if label2id is None else | |||||
| {id: label | |||||
| for label, id in label2id.items()}) | |||||
| if id2label is not None and label2id is None: | |||||
| label2id = {label: id for id, label in id2label.items()} | |||||
| num_labels = kwargs.get( | |||||
| 'num_labels', None if label2id is None else len(label2id)) | |||||
| if num_labels is not None: | |||||
| model_kwargs['num_labels'] = num_labels | |||||
| if label2id is not None: | |||||
| model_kwargs['label2id'] = label2id | |||||
| if id2label is not None: | |||||
| model_kwargs['id2label'] = id2label | |||||
| model = super(Model, cls).from_pretrained( | model = super(Model, cls).from_pretrained( | ||||
| pretrained_model_name_or_path=model_dir, **model_kwargs) | |||||
| pretrained_model_name_or_path=model_dir, **model_args) | |||||
| return model | return model | ||||
| @@ -15,6 +15,7 @@ | |||||
| # See the License for the specific language governing permissions and | # See the License for the specific language governing permissions and | ||||
| # limitations under the License. | # limitations under the License. | ||||
| import torch | |||||
| from transformers import RobertaForTokenClassification | from transformers import RobertaForTokenClassification | ||||
| from modelscope.metainfo import Models | from modelscope.metainfo import Models | ||||
| @@ -22,7 +23,7 @@ from modelscope.models import Model, TorchModel | |||||
| from modelscope.models.builder import MODELS | from modelscope.models.builder import MODELS | ||||
| from modelscope.outputs import AttentionTokenClassificationModelOutput | from modelscope.outputs import AttentionTokenClassificationModelOutput | ||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.hub import parse_label_mapping | |||||
| from modelscope.utils.nlp.utils import parse_labels_in_order | |||||
| from .configuration import VecoConfig | from .configuration import VecoConfig | ||||
| @@ -58,6 +59,7 @@ class VecoForTokenClassification(TorchModel, RobertaForTokenClassification): | |||||
| def forward(self, *args, **kwargs): | def forward(self, *args, **kwargs): | ||||
| kwargs['return_dict'] = True | kwargs['return_dict'] = True | ||||
| outputs = super(Model, self).forward(*args, **kwargs) | outputs = super(Model, self).forward(*args, **kwargs) | ||||
| return AttentionTokenClassificationModelOutput( | return AttentionTokenClassificationModelOutput( | ||||
| loss=outputs.loss, | loss=outputs.loss, | ||||
| logits=outputs.logits, | logits=outputs.logits, | ||||
| @@ -81,27 +83,13 @@ class VecoForTokenClassification(TorchModel, RobertaForTokenClassification): | |||||
| """ | """ | ||||
| model_dir = kwargs.pop('model_dir', None) | model_dir = kwargs.pop('model_dir', None) | ||||
| cfg = kwargs.pop('cfg', None) | |||||
| model_args = parse_labels_in_order(model_dir, cfg, **kwargs) | |||||
| if model_dir is None: | if model_dir is None: | ||||
| config = VecoConfig(**kwargs) | |||||
| config = VecoConfig(**model_args) | |||||
| model = cls(config) | model = cls(config) | ||||
| else: | else: | ||||
| model_kwargs = {} | |||||
| label2id = kwargs.get('label2id', parse_label_mapping(model_dir)) | |||||
| id2label = kwargs.get( | |||||
| 'id2label', None if label2id is None else | |||||
| {id: label | |||||
| for label, id in label2id.items()}) | |||||
| if id2label is not None and label2id is None: | |||||
| label2id = {label: id for id, label in id2label.items()} | |||||
| num_labels = kwargs.get( | |||||
| 'num_labels', None if label2id is None else len(label2id)) | |||||
| if num_labels is not None: | |||||
| model_kwargs['num_labels'] = num_labels | |||||
| if label2id is not None: | |||||
| model_kwargs['label2id'] = label2id | |||||
| if id2label is not None: | |||||
| model_kwargs['id2label'] = id2label | |||||
| model = super(Model, cls).from_pretrained( | model = super(Model, cls).from_pretrained( | ||||
| pretrained_model_name_or_path=model_dir, **model_kwargs) | |||||
| pretrained_model_name_or_path=model_dir, **model_args) | |||||
| return model | return model | ||||
| @@ -1,321 +0,0 @@ | |||||
| # Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |||||
| # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||||
| # All rights reserved. | |||||
| # | |||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||||
| # you may not use this file except in compliance with the License. | |||||
| # You may obtain a copy of the License at | |||||
| # | |||||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||||
| # | |||||
| # Unless required by applicable law or agreed to in writing, software | |||||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||||
| # See the License for the specific language governing permissions and | |||||
| # limitations under the License | |||||
| """Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta`""" | |||||
| import os | |||||
| from shutil import copyfile | |||||
| from typing import Any, Dict, List, Optional, Tuple | |||||
| import sentencepiece as spm | |||||
| from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer | |||||
| from modelscope.utils import logger as logging | |||||
| logger = logging.get_logger(__name__) | |||||
| SPIECE_UNDERLINE = '▁' | |||||
| VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'} | |||||
| PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}} | |||||
| PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {} | |||||
| class VecoTokenizer(PreTrainedTokenizer): | |||||
| """ | |||||
| Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on | |||||
| [SentencePiece](https://github.com/google/sentencepiece). | |||||
| This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. | |||||
| Users should refer to this superclass for more information regarding those methods. | |||||
| Args: | |||||
| vocab_file (`str`): | |||||
| Path to the vocabulary file. | |||||
| bos_token (`str`, *optional*, defaults to `"<s>"`): | |||||
| The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. | |||||
| <Tip> | |||||
| When building a sequence using special tokens, this is not the token that is used for the beginning of | |||||
| sequence. The token used is the `cls_token`. | |||||
| </Tip> | |||||
| eos_token (`str`, *optional*, defaults to `"</s>"`): | |||||
| The end of sequence token. | |||||
| <Tip> | |||||
| When building a sequence using special tokens, this is not the token that is used for the end of | |||||
| sequence. The token used is the `sep_token`. | |||||
| </Tip> | |||||
| sep_token (`str`, *optional*, defaults to `"</s>"`): | |||||
| The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||||
| sequence classification or for a text and a question for question answering. It is also used as the last | |||||
| token of a sequence built with special tokens. | |||||
| cls_token (`str`, *optional*, defaults to `"<s>"`): | |||||
| The classifier token which is used when doing sequence classification (classification of the whole sequence | |||||
| instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||||
| unk_token (`str`, *optional*, defaults to `"<unk>"`): | |||||
| The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||||
| token instead. | |||||
| pad_token (`str`, *optional*, defaults to `"<pad>"`): | |||||
| The token used for padding, for example when batching sequences of different lengths. | |||||
| mask_token (`str`, *optional*, defaults to `"<mask>"`): | |||||
| The token used for masking values. This is the token used when training this model with masked language | |||||
| modeling. This is the token which the model will try to predict. | |||||
| additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`): | |||||
| Additional special tokens used by the tokenizer. | |||||
| sp_model_kwargs (`dict`, *optional*): | |||||
| Will be passed to the `SentencePieceProcessor.__init__()` method. | |||||
| The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python) | |||||
| can be used, among other things, to set: | |||||
| - `enable_sampling`: Enable subword regularization. | |||||
| - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout. | |||||
| - `nbest_size = {0,1}`: No sampling is performed. | |||||
| - `nbest_size > 1`: samples from the nbest_size results. | |||||
| - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice) | |||||
| using forward-filtering-and-backward-sampling algorithm. | |||||
| - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for | |||||
| BPE-dropout. | |||||
| Attributes: | |||||
| sp_model (`SentencePieceProcessor`): | |||||
| The *SentencePiece* processor that is used for every conversion (string, tokens and IDs). | |||||
| """ | |||||
| vocab_files_names = VOCAB_FILES_NAMES | |||||
| pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||||
| max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||||
| model_input_names = ['input_ids', 'attention_mask'] | |||||
| def __init__(self, | |||||
| vocab_file, | |||||
| bos_token='<s>', | |||||
| eos_token='</s>', | |||||
| sep_token='</s>', | |||||
| cls_token='<s>', | |||||
| unk_token='<unk>', | |||||
| pad_token='<pad>', | |||||
| mask_token='<mask>', | |||||
| sp_model_kwargs: Optional[Dict[str, Any]] = None, | |||||
| **kwargs) -> None: | |||||
| # Mask token behave like a normal word, i.e. include the space before it | |||||
| mask_token = AddedToken( | |||||
| mask_token, lstrip=True, rstrip=False) if isinstance( | |||||
| mask_token, str) else mask_token | |||||
| self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs | |||||
| super().__init__( | |||||
| bos_token=bos_token, | |||||
| eos_token=eos_token, | |||||
| unk_token=unk_token, | |||||
| sep_token=sep_token, | |||||
| cls_token=cls_token, | |||||
| pad_token=pad_token, | |||||
| mask_token=mask_token, | |||||
| sp_model_kwargs=self.sp_model_kwargs, | |||||
| **kwargs, | |||||
| ) | |||||
| self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | |||||
| self.sp_model.Load(str(vocab_file)) | |||||
| self.vocab_file = vocab_file | |||||
| # Original fairseq vocab and spm vocab must be "aligned": | |||||
| # Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |||||
| # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ---- | |||||
| # fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | |||||
| # spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a' | |||||
| # Mimic fairseq token-to-id alignment for the first 4 token | |||||
| self.fairseq_tokens_to_ids = { | |||||
| '<s>': 0, | |||||
| '<pad>': 1, | |||||
| '</s>': 2, | |||||
| '<unk>': 3 | |||||
| } | |||||
| # The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab | |||||
| self.fairseq_offset = 1 | |||||
| self.fairseq_tokens_to_ids['<mask>'] = len( | |||||
| self.sp_model) + self.fairseq_offset | |||||
| self.fairseq_ids_to_tokens = { | |||||
| v: k | |||||
| for k, v in self.fairseq_tokens_to_ids.items() | |||||
| } | |||||
| def __getstate__(self): | |||||
| state = self.__dict__.copy() | |||||
| state['sp_model'] = None | |||||
| state['sp_model_proto'] = self.sp_model.serialized_model_proto() | |||||
| return state | |||||
| def __setstate__(self, d): | |||||
| self.__dict__ = d | |||||
| # for backward compatibility | |||||
| if not hasattr(self, 'sp_model_kwargs'): | |||||
| self.sp_model_kwargs = {} | |||||
| self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs) | |||||
| self.sp_model.LoadFromSerializedProto(self.sp_model_proto) | |||||
| def build_inputs_with_special_tokens( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||||
| adding special tokens. An Veco sequence has the following format: | |||||
| - single sequence: `<s> X </s>` | |||||
| - pair of sequences: `<s> A </s></s> B </s>` | |||||
| Args: | |||||
| token_ids_0 (`List[int]`): | |||||
| List of IDs to which the special tokens will be added. | |||||
| token_ids_1 (`List[int]`, *optional*): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |||||
| """ | |||||
| if token_ids_1 is None: | |||||
| return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| sep = [self.sep_token_id] | |||||
| return cls + token_ids_0 + sep + sep + token_ids_1 + sep | |||||
| def get_special_tokens_mask( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None, | |||||
| already_has_special_tokens: bool = False) -> List[int]: | |||||
| """ | |||||
| Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding | |||||
| special tokens using the tokenizer `prepare_for_model` method. | |||||
| Args: | |||||
| token_ids_0 (`List[int]`): | |||||
| List of IDs. | |||||
| token_ids_1 (`List[int]`, *optional*): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| already_has_special_tokens (`bool`, *optional*, defaults to `False`): | |||||
| Whether or not the token list is already formatted with special tokens for the model. | |||||
| Returns: | |||||
| `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. | |||||
| """ | |||||
| if already_has_special_tokens: | |||||
| return super().get_special_tokens_mask( | |||||
| token_ids_0=token_ids_0, | |||||
| token_ids_1=token_ids_1, | |||||
| already_has_special_tokens=True) | |||||
| if token_ids_1 is None: | |||||
| return [1] + ([0] * len(token_ids_0)) + [1] | |||||
| return [1] + ([0] * len(token_ids_0)) + [1, 1] + ( | |||||
| [0] * len(token_ids_1)) + [1] | |||||
| def create_token_type_ids_from_sequences( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does | |||||
| not make use of token type ids, therefore a list of zeros is returned. | |||||
| Args: | |||||
| token_ids_0 (`List[int]`): | |||||
| List of IDs. | |||||
| token_ids_1 (`List[int]`, *optional*): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| `List[int]`: List of zeros. | |||||
| """ | |||||
| sep = [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| if token_ids_1 is None: | |||||
| return len(cls + token_ids_0 + sep) * [0] | |||||
| return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] | |||||
| @property | |||||
| def vocab_size(self): | |||||
| return len( | |||||
| self.sp_model) + self.fairseq_offset + 1 # Add the <mask> token | |||||
| def get_vocab(self): | |||||
| vocab = { | |||||
| self.convert_ids_to_tokens(i): i | |||||
| for i in range(self.vocab_size) | |||||
| } | |||||
| vocab.update(self.added_tokens_encoder) | |||||
| return vocab | |||||
| def _tokenize(self, text: str) -> List[str]: | |||||
| return self.sp_model.encode(text, out_type=str) | |||||
| def _convert_token_to_id(self, token): | |||||
| """Converts a token (str) in an id using the vocab.""" | |||||
| if token in self.fairseq_tokens_to_ids: | |||||
| return self.fairseq_tokens_to_ids[token] | |||||
| spm_id = self.sp_model.PieceToId(token) | |||||
| # Need to return unknown token if the SP model returned 0 | |||||
| return spm_id + self.fairseq_offset if spm_id else self.unk_token_id | |||||
| def _convert_id_to_token(self, index): | |||||
| """Converts an index (integer) in a token (str) using the vocab.""" | |||||
| if index in self.fairseq_ids_to_tokens: | |||||
| return self.fairseq_ids_to_tokens[index] | |||||
| return self.sp_model.IdToPiece(index - self.fairseq_offset) | |||||
| def convert_tokens_to_string(self, tokens): | |||||
| """Converts a sequence of tokens (strings for sub-words) in a single string.""" | |||||
| out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip() | |||||
| return out_string | |||||
| def save_vocabulary(self, | |||||
| save_directory: str, | |||||
| filename_prefix: Optional[str] = None) -> Tuple[str]: | |||||
| if not os.path.isdir(save_directory): | |||||
| logger.error( | |||||
| f'Vocabulary path ({save_directory}) should be a directory') | |||||
| return | |||||
| out_vocab_file = os.path.join( | |||||
| save_directory, (filename_prefix + '-' if filename_prefix else '') | |||||
| + VOCAB_FILES_NAMES['vocab_file']) | |||||
| if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): | |||||
| copyfile(self.vocab_file, out_vocab_file) | |||||
| return (out_vocab_file, ) | |||||
| @@ -1,213 +0,0 @@ | |||||
| # Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team. | |||||
| # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors. | |||||
| # All rights reserved. | |||||
| # | |||||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||||
| # you may not use this file except in compliance with the License. | |||||
| # You may obtain a copy of the License at | |||||
| # | |||||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||||
| # | |||||
| # Unless required by applicable law or agreed to in writing, software | |||||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||||
| # See the License for the specific language governing permissions and | |||||
| # limitations under the License | |||||
| """Fast Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta_fast`""" | |||||
| import os | |||||
| from shutil import copyfile | |||||
| from typing import List, Optional, Tuple | |||||
| import transformers | |||||
| from transformers.file_utils import is_sentencepiece_available | |||||
| from transformers.tokenization_utils import AddedToken | |||||
| from transformers.tokenization_utils_fast import PreTrainedTokenizerFast | |||||
| from modelscope.utils import logger as logging | |||||
| if is_sentencepiece_available(): | |||||
| from .tokenization import VecoTokenizer | |||||
| else: | |||||
| VecoTokenizer = None | |||||
| logger = logging.get_logger(__name__) | |||||
| VOCAB_FILES_NAMES = { | |||||
| 'vocab_file': 'sentencepiece.bpe.model', | |||||
| 'tokenizer_file': 'tokenizer.json' | |||||
| } | |||||
| PRETRAINED_VOCAB_FILES_MAP = { | |||||
| 'vocab_file': {}, | |||||
| 'tokenizer_file': {}, | |||||
| } | |||||
| PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {} | |||||
| transformers.SLOW_TO_FAST_CONVERTERS[ | |||||
| 'VecoTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS[ | |||||
| 'XLMRobertaTokenizer'] | |||||
| class VecoTokenizerFast(PreTrainedTokenizerFast): | |||||
| """ | |||||
| Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. | |||||
| Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models). | |||||
| This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main | |||||
| methods. Users should refer to this superclass for more information regarding those methods. | |||||
| Args: | |||||
| vocab_file (`str`): | |||||
| Path to the vocabulary file. | |||||
| bos_token (`str`, *optional*, defaults to `"<s>"`): | |||||
| The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token. | |||||
| <Tip> | |||||
| When building a sequence using special tokens, this is not the token that is used for the beginning of | |||||
| sequence. The token used is the `cls_token`. | |||||
| </Tip> | |||||
| eos_token (`str`, *optional*, defaults to `"</s>"`): | |||||
| The end of sequence token. | |||||
| <Tip> | |||||
| When building a sequence using special tokens, this is not the token that is used for the end of | |||||
| sequence. The token used is the `sep_token`. | |||||
| </Tip> | |||||
| sep_token (`str`, *optional*, defaults to `"</s>"`): | |||||
| The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for | |||||
| sequence classification or for a text and a question for question answering. It is also used as the last | |||||
| token of a sequence built with special tokens. | |||||
| cls_token (`str`, *optional*, defaults to `"<s>"`): | |||||
| The classifier token which is used when doing sequence classification (classification of the whole sequence | |||||
| instead of per-token classification). It is the first token of the sequence when built with special tokens. | |||||
| unk_token (`str`, *optional*, defaults to `"<unk>"`): | |||||
| The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this | |||||
| token instead. | |||||
| pad_token (`str`, *optional*, defaults to `"<pad>"`): | |||||
| The token used for padding, for example when batching sequences of different lengths. | |||||
| mask_token (`str`, *optional*, defaults to `"<mask>"`): | |||||
| The token used for masking values. This is the token used when training this model with masked language | |||||
| modeling. This is the token which the model will try to predict. | |||||
| additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`): | |||||
| Additional special tokens used by the tokenizer. | |||||
| """ | |||||
| vocab_files_names = VOCAB_FILES_NAMES | |||||
| pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP | |||||
| max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES | |||||
| model_input_names = ['input_ids', 'attention_mask'] | |||||
| slow_tokenizer_class = VecoTokenizer | |||||
| def __init__(self, | |||||
| vocab_file=None, | |||||
| tokenizer_file=None, | |||||
| bos_token='<s>', | |||||
| eos_token='</s>', | |||||
| sep_token='</s>', | |||||
| cls_token='<s>', | |||||
| unk_token='<unk>', | |||||
| pad_token='<pad>', | |||||
| mask_token='<mask>', | |||||
| **kwargs): | |||||
| # Mask token behave like a normal word, i.e. include the space before it | |||||
| mask_token = AddedToken( | |||||
| mask_token, lstrip=True, rstrip=False) if isinstance( | |||||
| mask_token, str) else mask_token | |||||
| super().__init__( | |||||
| vocab_file, | |||||
| tokenizer_file=tokenizer_file, | |||||
| bos_token=bos_token, | |||||
| eos_token=eos_token, | |||||
| sep_token=sep_token, | |||||
| cls_token=cls_token, | |||||
| unk_token=unk_token, | |||||
| pad_token=pad_token, | |||||
| mask_token=mask_token, | |||||
| **kwargs, | |||||
| ) | |||||
| self.vocab_file = vocab_file | |||||
| self.can_save_slow_tokenizer = False if not self.vocab_file else True | |||||
| def build_inputs_with_special_tokens( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and | |||||
| adding special tokens. An Veco sequence has the following format: | |||||
| - single sequence: `<s> X </s>` | |||||
| - pair of sequences: `<s> A </s></s> B </s>` | |||||
| Args: | |||||
| token_ids_0 (`List[int]`): | |||||
| List of IDs to which the special tokens will be added. | |||||
| token_ids_1 (`List[int]`, *optional*): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. | |||||
| """ | |||||
| if token_ids_1 is None: | |||||
| return [self.cls_token_id] + token_ids_0 + [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| sep = [self.sep_token_id] | |||||
| return cls + token_ids_0 + sep + sep + token_ids_1 + sep | |||||
| def create_token_type_ids_from_sequences( | |||||
| self, | |||||
| token_ids_0: List[int], | |||||
| token_ids_1: Optional[List[int]] = None) -> List[int]: | |||||
| """ | |||||
| Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does | |||||
| not make use of token type ids, therefore a list of zeros is returned. | |||||
| Args: | |||||
| token_ids_0 (`List[int]`): | |||||
| List of IDs. | |||||
| token_ids_1 (`List[int]`, *optional*): | |||||
| Optional second list of IDs for sequence pairs. | |||||
| Returns: | |||||
| `List[int]`: List of zeros. | |||||
| """ | |||||
| sep = [self.sep_token_id] | |||||
| cls = [self.cls_token_id] | |||||
| if token_ids_1 is None: | |||||
| return len(cls + token_ids_0 + sep) * [0] | |||||
| return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0] | |||||
| def save_vocabulary(self, | |||||
| save_directory: str, | |||||
| filename_prefix: Optional[str] = None) -> Tuple[str]: | |||||
| if not self.can_save_slow_tokenizer: | |||||
| raise ValueError( | |||||
| 'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow ' | |||||
| 'tokenizer.') | |||||
| if not os.path.isdir(save_directory): | |||||
| logger.error( | |||||
| f'Vocabulary path ({save_directory}) should be a directory.') | |||||
| return | |||||
| out_vocab_file = os.path.join( | |||||
| save_directory, (filename_prefix + '-' if filename_prefix else '') | |||||
| + VOCAB_FILES_NAMES['vocab_file']) | |||||
| if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file): | |||||
| copyfile(self.vocab_file, out_vocab_file) | |||||
| return (out_vocab_file, ) | |||||
| @@ -1,179 +1,13 @@ | |||||
| from dataclasses import dataclass | from dataclasses import dataclass | ||||
| from typing import List, Optional, Tuple, Union | |||||
| from typing import Optional, Tuple, Union | |||||
| import numpy as np | |||||
| from modelscope.outputs.outputs import ModelOutputBase | from modelscope.outputs.outputs import ModelOutputBase | ||||
| Tensor = Union['torch.Tensor', 'tf.Tensor'] | Tensor = Union['torch.Tensor', 'tf.Tensor'] | ||||
| @dataclass | |||||
| class TextClassificationModelOutput(ModelOutputBase): | |||||
| """The output class for text classification models. | |||||
| Args: | |||||
| logits (`Tensor`): The logits output of the model. loss (`Tensor`, | |||||
| *optional*) The loss of the model, available when training. | |||||
| hidden_states (`Tensor`, *optional*) Hidden-states of the model at the | |||||
| output of each layer plus the optional initial embedding outputs. | |||||
| """ | |||||
| logits: Tensor = None | |||||
| loss: Tensor = None | |||||
| @dataclass | |||||
| class TokenClassificationModelOutput(ModelOutputBase): | |||||
| """The output class for token classification models. | |||||
| logits (`Tensor`): The logits output of the model. | |||||
| loss (`Tensor`, *optional*) The loss of the model, available when training. | |||||
| """ | |||||
| logits: Tensor = None | |||||
| loss: Tensor = None | |||||
| offset_mapping: Tensor = None | |||||
| @dataclass | |||||
| class FillMaskModelOutput(ModelOutputBase): | |||||
| """The output class for text classification models. | |||||
| Args: | |||||
| logits (`Tensor`): The logits output of the model. | |||||
| loss (`Tensor`, *optional*) The loss of the model, available when training. | |||||
| input_ids (`Tensor`, *optional*) The input id tensor fed into the model. | |||||
| hidden_states (`Tensor`, *optional*) Hidden-states of the model at the | |||||
| output of each layer plus the optional initial embedding outputs. | |||||
| """ | |||||
| logits: Tensor = None | |||||
| loss: Tensor = None | |||||
| input_ids: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| @dataclass | |||||
| class TokenClassifierOutput(ModelOutputBase): | |||||
| """ | |||||
| Base class for outputs of token classification models. | |||||
| Args: | |||||
| loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when | |||||
| `labels` is provided) : | |||||
| Classification loss. | |||||
| logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, | |||||
| config.num_labels)`): | |||||
| Classification scores (before SoftMax). | |||||
| hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_hidden_states=True` is passed or when | |||||
| `config.output_hidden_states=True`): | |||||
| Tuple of `torch.FloatTensor` (one for the output of the embeddings, | |||||
| if the model has an embedding layer, + one for the output of each | |||||
| layer) of shape `(batch_size, sequence_length, hidden_size)`. | |||||
| Hidden-states of the model at the output of each layer plus the | |||||
| optional initial embedding outputs. | |||||
| attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_attentions=True` is passed or when | |||||
| `config.output_attentions=True`): | |||||
| Tuple of `torch.FloatTensor` (one for each layer) of shape | |||||
| `(batch_size, num_heads, sequence_length, sequence_length)`. | |||||
| Attentions weights after the attention softmax, used to compute the | |||||
| weighted average in the self-attention heads. | |||||
| offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, | |||||
| sequence_length)`, `optional`): | |||||
| Indices of positions of each input sequence tokens in the sentence. | |||||
| Selected in the range ``[0, sequence_length - 1]``. | |||||
| """ | |||||
| loss: Tensor = None | |||||
| logits: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| attentions: Tensor = None | |||||
| offset_mapping: Tensor = None | |||||
| @dataclass | |||||
| class TokenClassifierWithPredictionsOutput(ModelOutputBase): | |||||
| """ | |||||
| Base class for outputs of token classification models. | |||||
| Args: | |||||
| loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when | |||||
| `labels` is provided) : | |||||
| Classification loss. | |||||
| logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, | |||||
| config.num_labels)`): | |||||
| Classification scores (before SoftMax). | |||||
| hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_hidden_states=True` is passed or when | |||||
| `config.output_hidden_states=True`): | |||||
| Tuple of `torch.FloatTensor` (one for the output of the embeddings, | |||||
| if the model has an embedding layer, + one for the output of each | |||||
| layer) of shape `(batch_size, sequence_length, hidden_size)`. | |||||
| Hidden-states of the model at the output of each layer plus the | |||||
| optional initial embedding outputs. | |||||
| attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_attentions=True` is passed or when | |||||
| `config.output_attentions=True`): | |||||
| Tuple of `torch.FloatTensor` (one for each layer) of shape | |||||
| `(batch_size, num_heads, sequence_length, sequence_length)`. | |||||
| Attentions weights after the attention softmax, used to compute the | |||||
| weighted average in the self-attention heads. | |||||
| offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, | |||||
| sequence_length)`, `optional`): | |||||
| Indices of positions of each input sequence tokens in the sentence. | |||||
| Selected in the range ``[0, sequence_length - 1]``. | |||||
| predictions: A PyTorch tensor of the best tag sequence for each batch of shape | |||||
| (nbest, batch_size, seq_length) | |||||
| """ | |||||
| loss: Tensor = None | |||||
| logits: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| attentions: Tensor = None | |||||
| offset_mapping: Tensor = None | |||||
| predictions: Tensor = None | |||||
| @dataclass | |||||
| class BaseModelOutput(ModelOutputBase): | |||||
| """ | |||||
| Base class for model's outputs, with potential hidden states and attentions. | |||||
| Args: | |||||
| last_hidden_state (`torch.FloatTensor` of shape `(batch_size, | |||||
| sequence_length, hidden_size)`): | |||||
| Sequence of hidden-states at the output of the last layer of the | |||||
| model. | |||||
| hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_hidden_states=True` is passed or when | |||||
| `config.output_hidden_states=True`): | |||||
| Tuple of `torch.FloatTensor` (one for the output of the embeddings, | |||||
| if the model has an embedding layer, + one for the output of each | |||||
| layer) of shape `(batch_size, sequence_length, hidden_size)`. | |||||
| Hidden-states of the model at the output of each layer plus the | |||||
| optional initial embedding outputs. | |||||
| attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_attentions=True` is passed or when | |||||
| `config.output_attentions=True`): | |||||
| Tuple of `torch.FloatTensor` (one for each layer) of shape | |||||
| `(batch_size, num_heads, sequence_length, sequence_length)`. | |||||
| Attentions weights after the attention softmax, used to compute the | |||||
| weighted average in the self-attention heads. | |||||
| """ | |||||
| last_hidden_state: Tensor = None | |||||
| hidden_states: Optional[Tuple[Tensor]] = None | |||||
| attentions: Optional[Tuple[Tensor]] = None | |||||
| @dataclass | @dataclass | ||||
| class BackboneModelOutput(ModelOutputBase): | class BackboneModelOutput(ModelOutputBase): | ||||
| """The output class for text classification models. | """The output class for text classification models. | ||||
| @@ -196,81 +30,6 @@ class AttentionBackboneModelOutput(BackboneModelOutput): | |||||
| """The output class for backbones of attention based models. | """The output class for backbones of attention based models. | ||||
| Args: | Args: | ||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the | |||||
| attention softmax, used to compute the weighted average in the | |||||
| self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| past_key_values: Tensor = None | |||||
| cross_attentions: Tensor = None | |||||
| @dataclass | |||||
| class AttentionTextClassificationModelOutput(TextClassificationModelOutput): | |||||
| """The output class for backbones of attention based models. | |||||
| Args: | |||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the | |||||
| attention softmax, used to compute the weighted average in the | |||||
| self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| @dataclass | |||||
| class AttentionTokenClassificationModelOutput(TokenClassificationModelOutput): | |||||
| """The output class for backbones of attention based models. | |||||
| Args: | |||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the attention softmax, | |||||
| used to compute the weighted average in the self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| @dataclass | |||||
| class AttentionFillMaskModelOutput(FillMaskModelOutput): | |||||
| """The output class for the fill mask and attention based models. | |||||
| Args: | |||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the | |||||
| attention softmax, used to compute the weighted average in the | |||||
| self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| @dataclass | |||||
| class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutputBase): | |||||
| """ | |||||
| Base class for model's outputs that also contains a pooling of the last | |||||
| hidden states. | |||||
| Args: | |||||
| last_hidden_state (`torch.FloatTensor` of shape `(batch_size, | |||||
| sequence_length, hidden_size)`): | |||||
| Sequence of hidden-states at the output of the last layer of the | |||||
| model. | |||||
| pooler_output (`torch.FloatTensor` of shape `(batch_size, | |||||
| hidden_size)`): | |||||
| Last layer hidden-state of the first token of the sequence | |||||
| (classification token) after further processing through the layers | |||||
| used for the auxiliary pretraining task. E.g. for BERT-family of | |||||
| models, this returns the classification token after processing | |||||
| through a linear layer and a tanh activation function. The linear | |||||
| layer weights are trained from the next sentence prediction | |||||
| (classification) objective during pretraining. | |||||
| hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_hidden_states=True` is passed or when | |||||
| `config.output_hidden_states=True`): | |||||
| Tuple of `torch.FloatTensor` (one for the output of the embeddings, | |||||
| if the model has an embedding layer, + one for the output of each | |||||
| layer) of shape `(batch_size, sequence_length, hidden_size)`. | |||||
| Hidden-states of the model at the output of each layer plus the | |||||
| optional initial embedding outputs. | |||||
| attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | ||||
| `output_attentions=True` is passed or when | `output_attentions=True` is passed or when | ||||
| `config.output_attentions=True`): | `config.output_attentions=True`): | ||||
| @@ -303,75 +62,8 @@ class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutputBase): | |||||
| can be used (see `past_key_values` input) to speed up sequential | can be used (see `past_key_values` input) to speed up sequential | ||||
| decoding. | decoding. | ||||
| """ | """ | ||||
| last_hidden_state: Tensor = None | |||||
| pooler_output: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| past_key_values: Tensor = None | |||||
| attentions: Tensor = None | attentions: Tensor = None | ||||
| cross_attentions: Tensor = None | |||||
| @dataclass | |||||
| class BaseModelOutputWithPastAndCrossAttentions(ModelOutputBase): | |||||
| """ | |||||
| Base class for model's outputs that may also contain a past key/values (to | |||||
| speed up sequential decoding). | |||||
| Args: | |||||
| last_hidden_state (`torch.FloatTensor` of shape `(batch_size, | |||||
| sequence_length, hidden_size)`): | |||||
| Sequence of hidden-states at the output of the last layer of the | |||||
| model. | |||||
| If `past_key_values` is used only the last hidden-state of the | |||||
| sequences of shape `(batch_size, 1, hidden_size)` is output. | |||||
| past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned | |||||
| when `use_cache=True` is passed or when `config.use_cache=True`): | |||||
| Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, | |||||
| with each tuple having 2 tensors of shape `(batch_size, num_heads, | |||||
| sequence_length, embed_size_per_head)`) and optionally if | |||||
| `config.is_encoder_decoder=True` 2 additional tensors of shape | |||||
| `(batch_size, num_heads, encoder_sequence_length, | |||||
| embed_size_per_head)`. | |||||
| Contains pre-computed hidden-states (key and values in the | |||||
| self-attention blocks and optionally if | |||||
| `config.is_encoder_decoder=True` in the cross-attention blocks) that | |||||
| can be used (see `past_key_values` input) to speed up sequential | |||||
| decoding. | |||||
| hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_hidden_states=True` is passed or when | |||||
| `config.output_hidden_states=True`): | |||||
| Tuple of `torch.FloatTensor` (one for the output of the embeddings, | |||||
| if the model has an embedding layer, + one for the output of each | |||||
| layer) of shape `(batch_size, sequence_length, hidden_size)`. | |||||
| Hidden-states of the model at the output of each layer plus the | |||||
| optional initial embedding outputs. | |||||
| attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_attentions=True` is passed or when | |||||
| `config.output_attentions=True`): | |||||
| Tuple of `torch.FloatTensor` (one for each layer) of shape | |||||
| `(batch_size, num_heads, sequence_length, sequence_length)`. | |||||
| Attentions weights after the attention softmax, used to compute the | |||||
| weighted average in the self-attention heads. | |||||
| cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when | |||||
| `output_attentions=True` and `config.add_cross_attention=True` is passed | |||||
| or when `config.output_attentions=True`): | |||||
| Tuple of `torch.FloatTensor` (one for each layer) of shape | |||||
| `(batch_size, num_heads, sequence_length, sequence_length)`. | |||||
| Attentions weights of the decoder's cross-attention layer, after the | |||||
| attention softmax, used to compute the weighted average in the | |||||
| cross-attention heads. | |||||
| """ | |||||
| last_hidden_state: Tensor = None | |||||
| past_key_values: Tensor = None | past_key_values: Tensor = None | ||||
| hidden_states: Tensor = None | |||||
| attentions: Tensor = None | |||||
| cross_attentions: Tensor = None | cross_attentions: Tensor = None | ||||
| @@ -459,6 +151,60 @@ class Seq2SeqModelOutput(ModelOutputBase): | |||||
| encoder_attentions: Optional[Tuple[Tensor]] = None | encoder_attentions: Optional[Tuple[Tensor]] = None | ||||
| @dataclass | |||||
| class FaqQuestionAnsweringOutput(ModelOutputBase): | |||||
| """The output class for faq QA models. | |||||
| """ | |||||
| scores: Tensor = None | |||||
| @dataclass | |||||
| class FeatureExtractionOutput(ModelOutputBase): | |||||
| """The output class for feature extraction models. | |||||
| """ | |||||
| text_embedding: Tensor = None | |||||
| @dataclass | |||||
| class FillMaskModelOutput(ModelOutputBase): | |||||
| """The output class for text classification models. | |||||
| Args: | |||||
| logits (`Tensor`): The logits output of the model. | |||||
| loss (`Tensor`, *optional*) The loss of the model, available when training. | |||||
| input_ids (`Tensor`, *optional*) The input id tensor fed into the model. | |||||
| hidden_states (`Tensor`, *optional*) Hidden-states of the model at the | |||||
| output of each layer plus the optional initial embedding outputs. | |||||
| """ | |||||
| logits: Tensor = None | |||||
| loss: Tensor = None | |||||
| input_ids: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| @dataclass | |||||
| class AttentionFillMaskModelOutput(FillMaskModelOutput): | |||||
| """The output class for the fill mask and attention based models. | |||||
| Args: | |||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the | |||||
| attention softmax, used to compute the weighted average in the | |||||
| self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| @dataclass | |||||
| class InformationExtractionOutput(ModelOutputBase): | |||||
| """The output class for information extraction models. | |||||
| """ | |||||
| spo_list: np.ndarray = None | |||||
| @dataclass | @dataclass | ||||
| class Seq2SeqLMOutput(ModelOutputBase): | class Seq2SeqLMOutput(ModelOutputBase): | ||||
| """ | """ | ||||
| @@ -543,6 +289,42 @@ class Seq2SeqLMOutput(ModelOutputBase): | |||||
| encoder_attentions: Optional[Tuple[Tensor]] = None | encoder_attentions: Optional[Tuple[Tensor]] = None | ||||
| @dataclass | |||||
| class TextClassificationModelOutput(ModelOutputBase): | |||||
| """The output class for text classification models. | |||||
| Args: | |||||
| logits (`Tensor`): The logits output of the model. loss (`Tensor`, | |||||
| *optional*) The loss of the model, available when training. | |||||
| hidden_states (`Tensor`, *optional*) Hidden-states of the model at the | |||||
| output of each layer plus the optional initial embedding outputs. | |||||
| """ | |||||
| logits: Tensor = None | |||||
| loss: Tensor = None | |||||
| @dataclass | |||||
| class AttentionTextClassificationModelOutput(TextClassificationModelOutput): | |||||
| """The output class for backbones of attention based models. | |||||
| Args: | |||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the | |||||
| attention softmax, used to compute the weighted average in the | |||||
| self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| @dataclass | |||||
| class TextErrorCorrectionOutput(ModelOutputBase): | |||||
| """The output class for information extraction models. | |||||
| """ | |||||
| predictions: np.ndarray = None | |||||
| @dataclass | @dataclass | ||||
| class TextGenerationModelOutput(ModelOutputBase): | class TextGenerationModelOutput(ModelOutputBase): | ||||
| """The output class for text generation models. | """The output class for text generation models. | ||||
| @@ -588,3 +370,35 @@ class TokenGeneratorOutput(ModelOutputBase): | |||||
| scores: Optional[Tuple[Tensor]] = None | scores: Optional[Tuple[Tensor]] = None | ||||
| attentions: Optional[Tuple[Tuple[Tensor]]] = None | attentions: Optional[Tuple[Tuple[Tensor]]] = None | ||||
| hidden_states: Optional[Tuple[Tuple[Tensor]]] = None | hidden_states: Optional[Tuple[Tuple[Tensor]]] = None | ||||
| @dataclass | |||||
| class TokenClassificationModelOutput(ModelOutputBase): | |||||
| """The output class for token classification models. | |||||
| logits (`Tensor`): The logits output of the model. | |||||
| loss (`Tensor`, *optional*) The loss of the model, available when training. | |||||
| predictions: A PyTorch tensor of the best tag sequence for each batch of shape | |||||
| (nbest, batch_size, seq_length) | |||||
| offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, | |||||
| sequence_length)`, `optional`): | |||||
| Indices of positions of each input sequence tokens in the sentence. | |||||
| Selected in the range ``[0, sequence_length - 1]``. | |||||
| """ | |||||
| logits: Tensor = None | |||||
| loss: Tensor = None | |||||
| offset_mapping: Tensor = None | |||||
| predictions: Tensor = None | |||||
| label_mask: Tensor = None | |||||
| @dataclass | |||||
| class AttentionTokenClassificationModelOutput(TokenClassificationModelOutput): | |||||
| """The output class for backbones of attention based models. | |||||
| Args: | |||||
| attentions (`tuple(Tensor)`, *optional* Attentions weights after the attention softmax, | |||||
| used to compute the weighted average in the self-attention heads. | |||||
| """ | |||||
| attentions: Tensor = None | |||||
| hidden_states: Tensor = None | |||||
| @@ -12,7 +12,7 @@ import numpy as np | |||||
| from modelscope.models.base import Model | from modelscope.models.base import Model | ||||
| from modelscope.msdatasets import MsDataset | from modelscope.msdatasets import MsDataset | ||||
| from modelscope.outputs import TASK_OUTPUTS | |||||
| from modelscope.outputs import TASK_OUTPUTS, ModelOutputBase | |||||
| from modelscope.pipeline_inputs import TASK_INPUTS, check_input_type | from modelscope.pipeline_inputs import TASK_INPUTS, check_input_type | ||||
| from modelscope.preprocessors import Preprocessor | from modelscope.preprocessors import Preprocessor | ||||
| from modelscope.utils.config import Config | from modelscope.utils.config import Config | ||||
| @@ -321,6 +321,8 @@ class Pipeline(ABC): | |||||
| return | return | ||||
| output_keys = TASK_OUTPUTS[task_name] | output_keys = TASK_OUTPUTS[task_name] | ||||
| missing_keys = [] | missing_keys = [] | ||||
| input = input.keys() if isinstance(input, | |||||
| (dict, ModelOutputBase)) else input | |||||
| for k in output_keys: | for k in output_keys: | ||||
| if k not in input: | if k not in input: | ||||
| missing_keys.append(k) | missing_keys.append(k) | ||||
| @@ -298,6 +298,7 @@ def pipeline(task: str = None, | |||||
| raise ValueError('task or pipeline_name is required') | raise ValueError('task or pipeline_name is required') | ||||
| model = normalize_model_input(model, model_revision) | model = normalize_model_input(model, model_revision) | ||||
| pipeline_props = {'type': pipeline_name} | |||||
| if pipeline_name is None: | if pipeline_name is None: | ||||
| # get default pipeline for this task | # get default pipeline for this task | ||||
| if isinstance(model, str) \ | if isinstance(model, str) \ | ||||
| @@ -309,7 +310,7 @@ def pipeline(task: str = None, | |||||
| model, str) else read_config( | model, str) else read_config( | ||||
| model[0], revision=model_revision) | model[0], revision=model_revision) | ||||
| check_config(cfg) | check_config(cfg) | ||||
| pipeline_name = cfg.pipeline.type | |||||
| pipeline_props = cfg.pipeline | |||||
| elif model is not None: | elif model is not None: | ||||
| # get pipeline info from Model object | # get pipeline info from Model object | ||||
| first_model = model[0] if isinstance(model, list) else model | first_model = model[0] if isinstance(model, list) else model | ||||
| @@ -318,13 +319,15 @@ def pipeline(task: str = None, | |||||
| cfg = read_config(first_model.model_dir) | cfg = read_config(first_model.model_dir) | ||||
| check_config(cfg) | check_config(cfg) | ||||
| first_model.pipeline = cfg.pipeline | first_model.pipeline = cfg.pipeline | ||||
| pipeline_name = first_model.pipeline.type | |||||
| pipeline_props = first_model.pipeline | |||||
| else: | else: | ||||
| pipeline_name, default_model_repo = get_default_pipeline_info(task) | pipeline_name, default_model_repo = get_default_pipeline_info(task) | ||||
| model = normalize_model_input(default_model_repo, model_revision) | model = normalize_model_input(default_model_repo, model_revision) | ||||
| pipeline_props = {'type': pipeline_name} | |||||
| cfg = ConfigDict(type=pipeline_name, model=model) | |||||
| cfg.device = device | |||||
| pipeline_props['model'] = model | |||||
| pipeline_props['device'] = device | |||||
| cfg = ConfigDict(pipeline_props) | |||||
| if kwargs: | if kwargs: | ||||
| cfg.update(kwargs) | cfg.update(kwargs) | ||||
| @@ -61,6 +61,8 @@ class EasyCVPipeline(object): | |||||
| self.cfg = Config.from_file(self.config_file) | self.cfg = Config.from_file(self.config_file) | ||||
| if 'device' in kwargs: | if 'device' in kwargs: | ||||
| kwargs['device'] = create_device(kwargs['device']) | kwargs['device'] = create_device(kwargs['device']) | ||||
| if 'predictor_config' in kwargs: | |||||
| kwargs.pop('predictor_config') | |||||
| self.predict_op = self._build_predict_op(**kwargs) | self.predict_op = self._build_predict_op(**kwargs) | ||||
| def _build_predict_op(self, **kwargs): | def _build_predict_op(self, **kwargs): | ||||
| @@ -12,22 +12,19 @@ if TYPE_CHECKING: | |||||
| from .dialog_state_tracking_pipeline import DialogStateTrackingPipeline | from .dialog_state_tracking_pipeline import DialogStateTrackingPipeline | ||||
| from .document_segmentation_pipeline import DocumentSegmentationPipeline | from .document_segmentation_pipeline import DocumentSegmentationPipeline | ||||
| from .extractive_summarization_pipeline import ExtractiveSummarizationPipeline | from .extractive_summarization_pipeline import ExtractiveSummarizationPipeline | ||||
| from .fasttext_sequence_classification_pipeline import FasttextSequenceClassificationPipeline | |||||
| from .fasttext_text_classification_pipeline import FasttextSequenceClassificationPipeline | |||||
| from .faq_question_answering_pipeline import FaqQuestionAnsweringPipeline | from .faq_question_answering_pipeline import FaqQuestionAnsweringPipeline | ||||
| from .feature_extraction_pipeline import FeatureExtractionPipeline | from .feature_extraction_pipeline import FeatureExtractionPipeline | ||||
| from .fill_mask_pipeline import FillMaskPipeline | from .fill_mask_pipeline import FillMaskPipeline | ||||
| from .information_extraction_pipeline import InformationExtractionPipeline | from .information_extraction_pipeline import InformationExtractionPipeline | ||||
| from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline, \ | |||||
| NamedEntityRecognitionThaiPipeline, \ | |||||
| NamedEntityRecognitionVietPipeline | |||||
| from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline | |||||
| from .text_ranking_pipeline import TextRankingPipeline | from .text_ranking_pipeline import TextRankingPipeline | ||||
| from .sentence_embedding_pipeline import SentenceEmbeddingPipeline | from .sentence_embedding_pipeline import SentenceEmbeddingPipeline | ||||
| from .text_classification_pipeline import TextClassificationPipeline | from .text_classification_pipeline import TextClassificationPipeline | ||||
| from .summarization_pipeline import SummarizationPipeline | from .summarization_pipeline import SummarizationPipeline | ||||
| from .translation_quality_estimation_pipeline import TranslationQualityEstimationPipeline | from .translation_quality_estimation_pipeline import TranslationQualityEstimationPipeline | ||||
| from .text_error_correction_pipeline import TextErrorCorrectionPipeline | from .text_error_correction_pipeline import TextErrorCorrectionPipeline | ||||
| from .text_generation_pipeline import TextGenerationPipeline | |||||
| from .text2text_generation_pipeline import Text2TextGenerationPipeline | |||||
| from .text_generation_pipeline import TextGenerationPipeline, TextGenerationT5Pipeline | |||||
| from .token_classification_pipeline import TokenClassificationPipeline | from .token_classification_pipeline import TokenClassificationPipeline | ||||
| from .translation_pipeline import TranslationPipeline | from .translation_pipeline import TranslationPipeline | ||||
| from .word_segmentation_pipeline import WordSegmentationPipeline, WordSegmentationThaiPipeline | from .word_segmentation_pipeline import WordSegmentationPipeline, WordSegmentationThaiPipeline | ||||
| @@ -56,8 +53,6 @@ else: | |||||
| 'information_extraction_pipeline': ['InformationExtractionPipeline'], | 'information_extraction_pipeline': ['InformationExtractionPipeline'], | ||||
| 'named_entity_recognition_pipeline': [ | 'named_entity_recognition_pipeline': [ | ||||
| 'NamedEntityRecognitionPipeline', | 'NamedEntityRecognitionPipeline', | ||||
| 'NamedEntityRecognitionThaiPipeline', | |||||
| 'NamedEntityRecognitionVietPipeline' | |||||
| ], | ], | ||||
| 'text_ranking_pipeline': ['TextRankingPipeline'], | 'text_ranking_pipeline': ['TextRankingPipeline'], | ||||
| 'sentence_embedding_pipeline': ['SentenceEmbeddingPipeline'], | 'sentence_embedding_pipeline': ['SentenceEmbeddingPipeline'], | ||||
| @@ -66,7 +61,8 @@ else: | |||||
| ['TableQuestionAnsweringPipeline'], | ['TableQuestionAnsweringPipeline'], | ||||
| 'text_classification_pipeline': ['TextClassificationPipeline'], | 'text_classification_pipeline': ['TextClassificationPipeline'], | ||||
| 'text_error_correction_pipeline': ['TextErrorCorrectionPipeline'], | 'text_error_correction_pipeline': ['TextErrorCorrectionPipeline'], | ||||
| 'text_generation_pipeline': ['TextGenerationPipeline'], | |||||
| 'text_generation_pipeline': | |||||
| ['TextGenerationPipeline', 'TextGenerationT5Pipeline'], | |||||
| 'text2text_generation_pipeline': ['Text2TextGenerationPipeline'], | 'text2text_generation_pipeline': ['Text2TextGenerationPipeline'], | ||||
| 'token_classification_pipeline': ['TokenClassificationPipeline'], | 'token_classification_pipeline': ['TokenClassificationPipeline'], | ||||
| 'translation_pipeline': ['TranslationPipeline'], | 'translation_pipeline': ['TranslationPipeline'], | ||||
| @@ -24,18 +24,27 @@ class ConversationalTextToSqlPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[StarForTextToSql, str], | model: Union[StarForTextToSql, str], | ||||
| preprocessor: ConversationalTextToSqlPreprocessor = None, | preprocessor: ConversationalTextToSqlPreprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """use `model` and `preprocessor` to create a conversational text-to-sql prediction pipeline | """use `model` and `preprocessor` to create a conversational text-to-sql prediction pipeline | ||||
| Args: | Args: | ||||
| model (StarForTextToSql): a model instance | |||||
| preprocessor (ConversationalTextToSqlPreprocessor): | |||||
| a preprocessor instance | |||||
| model (StarForTextToSql): A model instance | |||||
| preprocessor (ConversationalTextToSqlPreprocessor): A preprocessor instance | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = ConversationalTextToSqlPreprocessor( | self.preprocessor = ConversationalTextToSqlPreprocessor( | ||||
| self.model.model_dir) | |||||
| self.model.model_dir, **kwargs) | |||||
| def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]: | def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]: | ||||
| """process the prediction results | """process the prediction results | ||||
| @@ -22,6 +22,9 @@ class DialogIntentPredictionPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[SpaceForDialogIntent, str], | model: Union[SpaceForDialogIntent, str], | ||||
| preprocessor: DialogIntentPredictionPreprocessor = None, | preprocessor: DialogIntentPredictionPreprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a dialog intent prediction pipeline | """Use `model` and `preprocessor` to create a dialog intent prediction pipeline | ||||
| @@ -29,11 +32,18 @@ class DialogIntentPredictionPipeline(Pipeline): | |||||
| model (str or SpaceForDialogIntent): Supply either a local model dir or a model id from the model hub, | model (str or SpaceForDialogIntent): Supply either a local model dir or a model id from the model hub, | ||||
| or a SpaceForDialogIntent instance. | or a SpaceForDialogIntent instance. | ||||
| preprocessor (DialogIntentPredictionPreprocessor): An optional preprocessor instance. | preprocessor (DialogIntentPredictionPreprocessor): An optional preprocessor instance. | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = DialogIntentPredictionPreprocessor( | self.preprocessor = DialogIntentPredictionPreprocessor( | ||||
| self.model.model_dir) | |||||
| self.model.model_dir, **kwargs) | |||||
| self.categories = self.preprocessor.categories | self.categories = self.preprocessor.categories | ||||
| def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]: | def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]: | ||||
| @@ -21,6 +21,9 @@ class DialogModelingPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[SpaceForDialogModeling, str], | model: Union[SpaceForDialogModeling, str], | ||||
| preprocessor: DialogModelingPreprocessor = None, | preprocessor: DialogModelingPreprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a dialog modeling pipeline for dialog response generation | """Use `model` and `preprocessor` to create a dialog modeling pipeline for dialog response generation | ||||
| @@ -28,11 +31,18 @@ class DialogModelingPipeline(Pipeline): | |||||
| model (str or SpaceForDialogModeling): Supply either a local model dir or a model id from the model hub, | model (str or SpaceForDialogModeling): Supply either a local model dir or a model id from the model hub, | ||||
| or a SpaceForDialogModeling instance. | or a SpaceForDialogModeling instance. | ||||
| preprocessor (DialogModelingPreprocessor): An optional preprocessor instance. | preprocessor (DialogModelingPreprocessor): An optional preprocessor instance. | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = DialogModelingPreprocessor( | self.preprocessor = DialogModelingPreprocessor( | ||||
| self.model.model_dir) | |||||
| self.model.model_dir, **kwargs) | |||||
| def postprocess(self, inputs: Dict[str, Tensor]) -> Dict[str, str]: | def postprocess(self, inputs: Dict[str, Tensor]) -> Dict[str, str]: | ||||
| """process the prediction results | """process the prediction results | ||||
| @@ -22,6 +22,9 @@ class DialogStateTrackingPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[SpaceForDST, str], | model: Union[SpaceForDST, str], | ||||
| preprocessor: DialogStateTrackingPreprocessor = None, | preprocessor: DialogStateTrackingPreprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """use `model` and `preprocessor` to create a dialog state tracking pipeline for | """use `model` and `preprocessor` to create a dialog state tracking pipeline for | ||||
| observation of dialog states tracking after many turns of open domain dialogue | observation of dialog states tracking after many turns of open domain dialogue | ||||
| @@ -30,11 +33,20 @@ class DialogStateTrackingPipeline(Pipeline): | |||||
| model (str or SpaceForDialogStateTracking): Supply either a local model dir or a model id | model (str or SpaceForDialogStateTracking): Supply either a local model dir or a model id | ||||
| from the model hub, or a SpaceForDialogStateTracking instance. | from the model hub, or a SpaceForDialogStateTracking instance. | ||||
| preprocessor (DialogStateTrackingPreprocessor): An optional preprocessor instance. | preprocessor (DialogStateTrackingPreprocessor): An optional preprocessor instance. | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = DialogStateTrackingPreprocessor( | self.preprocessor = DialogStateTrackingPreprocessor( | ||||
| self.model.model_dir) | |||||
| self.model.model_dir, **kwargs) | |||||
| self.tokenizer = self.preprocessor.tokenizer | self.tokenizer = self.preprocessor.tokenizer | ||||
| self.config = self.preprocessor.config | self.config = self.preprocessor.config | ||||
| @@ -21,8 +21,16 @@ class DistributedGPT3Pipeline(DistributedPipeline): | |||||
| model = None | model = None | ||||
| def __init__(self, model, preprocessor=None, **kwargs): | def __init__(self, model, preprocessor=None, **kwargs): | ||||
| """ | |||||
| Args: | |||||
| model: The model piece, str is not supported. | |||||
| preprocessor: The preprocessor matched with the model. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| preprocessor = TextGenerationJiebaPreprocessor(model) | |||||
| preprocessor = TextGenerationJiebaPreprocessor(model, **kwargs) | |||||
| super().__init__(model, preprocessor=preprocessor, **kwargs) | super().__init__(model, preprocessor=preprocessor, **kwargs) | ||||
| assert hasattr(preprocessor, 'tokenizer') | assert hasattr(preprocessor, 'tokenizer') | ||||
| @@ -8,7 +8,7 @@ from modelscope.metainfo import Pipelines | |||||
| from modelscope.models.nlp.plug import DistributedPlug | from modelscope.models.nlp.plug import DistributedPlug | ||||
| from modelscope.pipelines.base import DistributedPipeline | from modelscope.pipelines.base import DistributedPipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import TextGenerationPreprocessor | |||||
| from modelscope.preprocessors import TextGenerationTransformersPreprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| @@ -24,11 +24,12 @@ class DistributedPlugPipeline(DistributedPipeline): | |||||
| model, | model, | ||||
| preprocessor=None, | preprocessor=None, | ||||
| first_sequence='sentence', | first_sequence='sentence', | ||||
| sequence_length=512, | |||||
| **kwargs): | **kwargs): | ||||
| """Create a plug pipeline instance. | """Create a plug pipeline instance. | ||||
| Args: | Args: | ||||
| model: The model_id of plug(damo/nlp_plug_text-generation_27B). | |||||
| model: The model_id of plug(damo/nlp_plug_text-generation_27B). | |||||
| The default path to damo/nlp_plug_text-generation_27B can be obtained by function | The default path to damo/nlp_plug_text-generation_27B can be obtained by function | ||||
| get_cache_dir("damo/nlp_plug_text-generation_27B"), the model should be downloaded to | get_cache_dir("damo/nlp_plug_text-generation_27B"), the model should be downloaded to | ||||
| this path before calling this class by model_id. | this path before calling this class by model_id. | ||||
| @@ -53,17 +54,16 @@ class DistributedPlugPipeline(DistributedPipeline): | |||||
| |_ mp_rank_05_model_states.pt | |_ mp_rank_05_model_states.pt | ||||
| |_ mp_rank_06_model_states.pt | |_ mp_rank_06_model_states.pt | ||||
| |_ mp_rank_07_model_states.pt | |_ mp_rank_07_model_states.pt | ||||
| preprocessor: The optional preprocessor, if not passed in, a TextGenerationPreprocessor will | |||||
| preprocessor: The optional preprocessor, if not passed in, a TextGenerationPreprocessor will | |||||
| be used as default. | be used as default. | ||||
| first_sequence: The first_sequence key name if the input format is a dict. | |||||
| kwargs: | |||||
| sequence_length: The input sequence_length. | |||||
| kwargs (dict, `optional`): Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| if preprocessor is None: | if preprocessor is None: | ||||
| preprocessor = TextGenerationPreprocessor( | |||||
| preprocessor = TextGenerationTransformersPreprocessor( | |||||
| model, | model, | ||||
| first_sequence=first_sequence, | first_sequence=first_sequence, | ||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| super().__init__(model, preprocessor=preprocessor, **kwargs) | super().__init__(model, preprocessor=preprocessor, **kwargs) | ||||
| assert hasattr(preprocessor, 'tokenizer') | assert hasattr(preprocessor, 'tokenizer') | ||||
| self.cls_token_id = preprocessor.tokenizer.cls_token_id | self.cls_token_id = preprocessor.tokenizer.cls_token_id | ||||
| @@ -14,7 +14,8 @@ from modelscope.models.nlp.ponet.configuration import PoNetConfig | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Pipeline, Tensor | from modelscope.pipelines.base import Pipeline, Tensor | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import DocumentSegmentationPreprocessor | |||||
| from modelscope.preprocessors import \ | |||||
| DocumentSegmentationTransformersPreprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| @@ -27,26 +28,34 @@ __all__ = ['DocumentSegmentationPipeline'] | |||||
| Tasks.document_segmentation, module_name=Pipelines.document_segmentation) | Tasks.document_segmentation, module_name=Pipelines.document_segmentation) | ||||
| class DocumentSegmentationPipeline(Pipeline): | class DocumentSegmentationPipeline(Pipeline): | ||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: DocumentSegmentationPreprocessor = None, | |||||
| **kwargs): | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| def __init__( | |||||
| self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: DocumentSegmentationTransformersPreprocessor = None, | |||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | |||||
| """The document segmentation pipeline. | |||||
| self.model_dir = self.model.model_dir | |||||
| self.model_cfg = self.model.forward() | |||||
| if self.model_cfg['type'] == 'bert': | |||||
| config = BertConfig.from_pretrained(self.model_dir, num_labels=2) | |||||
| elif self.model_cfg['type'] == 'ponet': | |||||
| config = PoNetConfig.from_pretrained(self.model_dir, num_labels=2) | |||||
| self.document_segmentation_model = self.model.build_with_config( | |||||
| config=config) | |||||
| Args: | |||||
| model (str or Model): Supply either a local model dir or a model id from the model hub | |||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | |||||
| the model if supplied. | |||||
| """ | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| self.model_dir = self.model.model_dir | |||||
| self.model_cfg = self.model.model_cfg | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = DocumentSegmentationPreprocessor( | |||||
| self.model.model_dir, config) | |||||
| self.preprocessor = DocumentSegmentationTransformersPreprocessor( | |||||
| self.model_dir, self.model.config.max_position_embeddings, | |||||
| **kwargs) | |||||
| def __call__( | def __call__( | ||||
| self, documents: Union[List[List[str]], List[str], | self, documents: Union[List[List[str]], List[str], | ||||
| @@ -85,8 +94,7 @@ class DocumentSegmentationPipeline(Pipeline): | |||||
| key: torch.tensor(val) | key: torch.tensor(val) | ||||
| for key, val in predict_dataset.items() | for key, val in predict_dataset.items() | ||||
| } | } | ||||
| predictions = self.document_segmentation_model.forward( | |||||
| **input).logits | |||||
| predictions = self.model.forward(**input).logits | |||||
| predictions = np.argmax(predictions, axis=2) | predictions = np.argmax(predictions, axis=2) | ||||
| assert len(sentences) == len( | assert len(sentences) == len( | ||||
| @@ -6,15 +6,14 @@ from typing import Any, Dict, List, Union | |||||
| import numpy as np | import numpy as np | ||||
| import torch | import torch | ||||
| from datasets import Dataset | from datasets import Dataset | ||||
| from transformers.models.bert.modeling_bert import BertConfig | |||||
| from modelscope.metainfo import Pipelines | from modelscope.metainfo import Pipelines | ||||
| from modelscope.models import Model | from modelscope.models import Model | ||||
| from modelscope.models.nlp.ponet.configuration import PoNetConfig | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Pipeline, Tensor | from modelscope.pipelines.base import Pipeline, Tensor | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import DocumentSegmentationPreprocessor | |||||
| from modelscope.preprocessors import \ | |||||
| DocumentSegmentationTransformersPreprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| @@ -28,31 +27,29 @@ __all__ = ['ExtractiveSummarizationPipeline'] | |||||
| module_name=Pipelines.extractive_summarization) | module_name=Pipelines.extractive_summarization) | ||||
| class ExtractiveSummarizationPipeline(Pipeline): | class ExtractiveSummarizationPipeline(Pipeline): | ||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: DocumentSegmentationPreprocessor = None, | |||||
| **kwargs): | |||||
| model = model if isinstance(model, | |||||
| Model) else Model.from_pretrained(model) | |||||
| self.model_dir = model.model_dir | |||||
| self.model_cfg = model.forward() | |||||
| if self.model_cfg['type'] == 'bert': | |||||
| config = BertConfig.from_pretrained(model.model_dir, num_labels=2) | |||||
| elif self.model_cfg['type'] == 'ponet': | |||||
| config = PoNetConfig.from_pretrained(model.model_dir, num_labels=2) | |||||
| self.extractive_summarization_model = model.build_with_config( | |||||
| config=config) | |||||
| def __init__( | |||||
| self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: DocumentSegmentationTransformersPreprocessor = None, | |||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| self.model_dir = self.model.model_dir | |||||
| self.model_cfg = self.model.model_cfg | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| preprocessor = DocumentSegmentationPreprocessor( | |||||
| self.model_dir, config) | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| self.preprocessor = preprocessor | |||||
| self.preprocessor = DocumentSegmentationTransformersPreprocessor( | |||||
| self.model_dir, self.model.config.max_position_embeddings, | |||||
| **kwargs) | |||||
| def __call__(self, documents: Union[List[str], str]) -> Dict[str, Any]: | def __call__(self, documents: Union[List[str], str]) -> Dict[str, Any]: | ||||
| output = self.predict(documents) | output = self.predict(documents) | ||||
| @@ -80,8 +77,7 @@ class ExtractiveSummarizationPipeline(Pipeline): | |||||
| key: torch.tensor(val) | key: torch.tensor(val) | ||||
| for key, val in predict_dataset.items() | for key, val in predict_dataset.items() | ||||
| } | } | ||||
| logits = self.extractive_summarization_model.forward( | |||||
| **input).logits | |||||
| logits = self.model.forward(**input).logits | |||||
| predictions = np.argmax(logits, axis=2) | predictions = np.argmax(logits, axis=2) | ||||
| assert len(sentences) == len( | assert len(sentences) == len( | ||||
| @@ -20,8 +20,24 @@ class FaqQuestionAnsweringPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[str, Model], | model: Union[str, Model], | ||||
| preprocessor: Preprocessor = None, | preprocessor: Preprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| """The faq question answering pipeline. | |||||
| Args: | |||||
| model (str or Model): A model instance or a model local dir or a model id in the model hub. | |||||
| preprocessor (Preprocessor, `optional`): a preprocessor instance | |||||
| kwargs (dict, `optional`): | |||||
| The preprocessor kwargs passed into the preprocessor's constructor. | |||||
| """ | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| self.model.model_dir, **kwargs) | self.model.model_dir, **kwargs) | ||||
| @@ -9,11 +9,9 @@ from fasttext import load_model | |||||
| from fasttext.FastText import _FastText | from fasttext.FastText import _FastText | ||||
| from modelscope.metainfo import Pipelines | from modelscope.metainfo import Pipelines | ||||
| from modelscope.models import Model | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Input, Pipeline | from modelscope.pipelines.base import Input, Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import SequenceClassificationPreprocessor | |||||
| from modelscope.utils.constant import ModelFile, Tasks | from modelscope.utils.constant import ModelFile, Tasks | ||||
| __all__ = ['FasttextSequenceClassificationPipeline'] | __all__ = ['FasttextSequenceClassificationPipeline'] | ||||
| @@ -36,8 +34,7 @@ class FasttextSequenceClassificationPipeline(Pipeline): | |||||
| """use `model` and `preprocessor` to create a nlp text classification pipeline for prediction | """use `model` and `preprocessor` to create a nlp text classification pipeline for prediction | ||||
| Args: | Args: | ||||
| model: a model directory including model.bin and spm.model | |||||
| preprocessor (SequenceClassificationPreprocessor): a preprocessor instance | |||||
| model: A model directory including model.bin and spm.model | |||||
| """ | """ | ||||
| super().__init__(model=model) | super().__init__(model=model) | ||||
| model_file = os.path.join(model, ModelFile.TORCH_MODEL_BIN_FILE) | model_file = os.path.join(model, ModelFile.TORCH_MODEL_BIN_FILE) | ||||
| @@ -53,8 +50,11 @@ class FasttextSequenceClassificationPipeline(Pipeline): | |||||
| text_sp = sentencepiece_tokenize(self.spm, text) | text_sp = sentencepiece_tokenize(self.spm, text) | ||||
| return {'text_sp': text_sp, 'text': text} | return {'text_sp': text_sp, 'text': text} | ||||
| def forward(self, inputs: Dict[str, Any]) -> Dict[str, Any]: | |||||
| topk = inputs.get('topk', -1) | |||||
| def forward(self, | |||||
| inputs: Dict[str, Any], | |||||
| topk: int = None) -> Dict[str, Any]: | |||||
| if topk is None: | |||||
| topk = inputs.get('topk', -1) | |||||
| label, probs = self.model.predict(inputs['text_sp'], k=topk) | label, probs = self.model.predict(inputs['text_sp'], k=topk) | ||||
| label = [x.replace('__label__', '') for x in label] | label = [x.replace('__label__', '') for x in label] | ||||
| result = { | result = { | ||||
| @@ -9,7 +9,8 @@ from modelscope.models import Model | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Pipeline, Tensor | from modelscope.pipelines.base import Pipeline, Tensor | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import NLPPreprocessor, Preprocessor | |||||
| from modelscope.preprocessors import (FillMaskTransformersPreprocessor, | |||||
| Preprocessor) | |||||
| from modelscope.utils.config import Config | from modelscope.utils.config import Config | ||||
| from modelscope.utils.constant import ModelFile, Tasks | from modelscope.utils.constant import ModelFile, Tasks | ||||
| @@ -23,7 +24,11 @@ class FeatureExtractionPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| first_sequence='sentence', | |||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| padding=False, | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a nlp feature extraction pipeline for prediction | """Use `model` and `preprocessor` to create a nlp feature extraction pipeline for prediction | ||||
| @@ -32,11 +37,8 @@ class FeatureExtractionPipeline(Pipeline): | |||||
| no-head model id from the model hub, or a torch model instance. | no-head model id from the model hub, or a torch model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | ||||
| the model if supplied. | the model if supplied. | ||||
| first_sequence: The key to read the sentence in. | |||||
| sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value. | |||||
| NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' | |||||
| param will have no effect. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example: | Example: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| @@ -46,19 +48,21 @@ class FeatureExtractionPipeline(Pipeline): | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = NLPPreprocessor( | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, | self.model.model_dir, | ||||
| padding=kwargs.pop('padding', False), | |||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| padding=padding, | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| self.config = Config.from_file( | |||||
| os.path.join(self.model.model_dir, ModelFile.CONFIGURATION)) | |||||
| self.tokenizer = self.preprocessor.tokenizer | |||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| **forward_params) -> Dict[str, Any]: | **forward_params) -> Dict[str, Any]: | ||||
| with torch.no_grad(): | with torch.no_grad(): | ||||
| @@ -23,7 +23,11 @@ class FillMaskPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| first_sequence: str = 'sentence', | |||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| first_sequence='sentence', | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """The inference pipeline for all the fill mask sub-tasks. | """The inference pipeline for all the fill mask sub-tasks. | ||||
| @@ -31,11 +35,8 @@ class FillMaskPipeline(Pipeline): | |||||
| model (`str` or `Model` or module instance): A model instance or a model local dir | model (`str` or `Model` or module instance): A model instance or a model local dir | ||||
| or a model id in the model hub. | or a model id in the model hub. | ||||
| preprocessor (`Preprocessor`, `optional`): A Preprocessor instance. | preprocessor (`Preprocessor`, `optional`): A Preprocessor instance. | ||||
| first_sequence (`str`, `optional`): The key to read the sentence in. | |||||
| sequence_length (`int`, `optional`): Max sequence length in the user's custom scenario, default 128. | |||||
| NOTE1: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' | |||||
| param will have no effect. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example1: | Example1: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| @@ -51,20 +52,25 @@ class FillMaskPipeline(Pipeline): | |||||
| NOTE2: Please pay attention to the model's special tokens. | NOTE2: Please pay attention to the model's special tokens. | ||||
| If bert based model(bert, structbert, etc.) is used, the mask token is '[MASK]'. | If bert based model(bert, structbert, etc.) is used, the mask token is '[MASK]'. | ||||
| If the xlm-roberta(xlm-roberta, veco, etc.) based model is used, the mask token is '<mask>'. | If the xlm-roberta(xlm-roberta, veco, etc.) based model is used, the mask token is '<mask>'. | ||||
| To view other examples plese check the tests/pipelines/test_fill_mask.py. | |||||
| To view other examples plese check tests/pipelines/test_fill_mask.py. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| self.model.model_dir, | self.model.model_dir, | ||||
| first_sequence=first_sequence, | first_sequence=first_sequence, | ||||
| second_sequence=None, | |||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| assert hasattr( | |||||
| self.preprocessor, 'mask_id' | |||||
| ), 'The input preprocessor should have the mask_id attribute.' | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| assert hasattr( | |||||
| self.preprocessor, 'mask_id' | |||||
| ), 'The input preprocessor should have the mask_id attribute.' | |||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| **forward_params) -> Dict[str, Any]: | **forward_params) -> Dict[str, Any]: | ||||
| @@ -8,8 +8,7 @@ from modelscope.metainfo import Pipelines | |||||
| from modelscope.models import Model | from modelscope.models import Model | ||||
| from modelscope.pipelines.base import Pipeline | from modelscope.pipelines.base import Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import (Preprocessor, | |||||
| RelationExtractionPreprocessor) | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| __all__ = ['InformationExtractionPipeline'] | __all__ = ['InformationExtractionPipeline'] | ||||
| @@ -24,12 +23,33 @@ class InformationExtractionPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=512, | |||||
| **kwargs): | **kwargs): | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| if preprocessor is None: | |||||
| self.preprocessor = RelationExtractionPreprocessor( | |||||
| """ | |||||
| Args: | |||||
| model (str or Model): Supply either a local model dir which supported information extraction task, or a | |||||
| model id from the model hub, or a torch model instance. | |||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | |||||
| the model if supplied. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if self.preprocessor is None: | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, | self.model.model_dir, | ||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| @@ -1,36 +1,35 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict, Optional, Union | |||||
| import torch | |||||
| from typing import Optional, Union | |||||
| from modelscope.metainfo import Pipelines | from modelscope.metainfo import Pipelines | ||||
| from modelscope.models import Model | from modelscope.models import Model | ||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.pipelines.base import Pipeline | |||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.pipelines.nlp import TokenClassificationPipeline | from modelscope.pipelines.nlp import TokenClassificationPipeline | ||||
| from modelscope.preprocessors import (NERPreprocessorThai, NERPreprocessorViet, | |||||
| Preprocessor, | |||||
| TokenClassificationPreprocessor) | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.tensor_utils import (torch_nested_detach, | |||||
| torch_nested_numpify) | |||||
| __all__ = [ | |||||
| 'NamedEntityRecognitionPipeline', 'NamedEntityRecognitionThaiPipeline', | |||||
| 'NamedEntityRecognitionVietPipeline' | |||||
| ] | |||||
| __all__ = ['NamedEntityRecognitionPipeline'] | |||||
| @PIPELINES.register_module( | @PIPELINES.register_module( | ||||
| Tasks.named_entity_recognition, | Tasks.named_entity_recognition, | ||||
| module_name=Pipelines.named_entity_recognition) | module_name=Pipelines.named_entity_recognition) | ||||
| @PIPELINES.register_module( | |||||
| Tasks.named_entity_recognition, | |||||
| module_name=Pipelines.named_entity_recognition_thai) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.named_entity_recognition, | |||||
| module_name=Pipelines.named_entity_recognition_viet) | |||||
| class NamedEntityRecognitionPipeline(TokenClassificationPipeline): | class NamedEntityRecognitionPipeline(TokenClassificationPipeline): | ||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a nlp NER pipeline for prediction | """Use `model` and `preprocessor` to create a nlp NER pipeline for prediction | ||||
| @@ -39,8 +38,8 @@ class NamedEntityRecognitionPipeline(TokenClassificationPipeline): | |||||
| model id from the model hub, or a torch model instance. | model id from the model hub, or a torch model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | ||||
| the model if supplied. | the model if supplied. | ||||
| sequence_length: Max sequence length in the user's custom scenario. 512 will be used as a default value. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example: | Example: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| >>> pipeline_ins = pipeline(task='named-entity-recognition', | >>> pipeline_ins = pipeline(task='named-entity-recognition', | ||||
| @@ -50,44 +49,17 @@ class NamedEntityRecognitionPipeline(TokenClassificationPipeline): | |||||
| To view other examples plese check the tests/pipelines/test_named_entity_recognition.py. | To view other examples plese check the tests/pipelines/test_named_entity_recognition.py. | ||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = TokenClassificationPreprocessor( | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, | self.model.model_dir, | ||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| self.id2label = kwargs.get('id2label') | |||||
| if self.id2label is None and hasattr(self.preprocessor, 'id2label'): | |||||
| self.id2label = self.preprocessor.id2label | |||||
| @PIPELINES.register_module( | |||||
| Tasks.named_entity_recognition, | |||||
| module_name=Pipelines.named_entity_recognition_thai) | |||||
| class NamedEntityRecognitionThaiPipeline(NamedEntityRecognitionPipeline): | |||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: Optional[Preprocessor] = None, | |||||
| **kwargs): | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| if preprocessor is None: | |||||
| self.preprocessor = NERPreprocessorThai( | |||||
| self.model.model_dir, | |||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.named_entity_recognition, | |||||
| module_name=Pipelines.named_entity_recognition_viet) | |||||
| class NamedEntityRecognitionVietPipeline(NamedEntityRecognitionPipeline): | |||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: Optional[Preprocessor] = None, | |||||
| **kwargs): | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| if preprocessor is None: | |||||
| self.preprocessor = NERPreprocessorViet( | |||||
| self.model.model_dir, | |||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| assert hasattr(self.preprocessor, 'id2label') | |||||
| self.id2label = self.preprocessor.id2label | |||||
| @@ -22,7 +22,10 @@ class SentenceEmbeddingPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| first_sequence='first_sequence', | |||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a nlp text dual encoder then generates the text representation. | """Use `model` and `preprocessor` to create a nlp text dual encoder then generates the text representation. | ||||
| Args: | Args: | ||||
| @@ -30,15 +33,20 @@ class SentenceEmbeddingPipeline(Pipeline): | |||||
| or a model id from the model hub, or a torch model instance. | or a model id from the model hub, or a torch model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | ||||
| the model if supplied. | the model if supplied. | ||||
| sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| self.model.model_dir | |||||
| if isinstance(self.model, Model) else model, | |||||
| first_sequence=first_sequence, | |||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| self.model.model_dir, | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| **forward_params) -> Dict[str, Any]: | **forward_params) -> Dict[str, Any]: | ||||
| @@ -1,12 +1,11 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict, Optional, Union | from typing import Any, Dict, Optional, Union | ||||
| from modelscope.metainfo import Pipelines | |||||
| from modelscope.models.multi_modal import OfaForAllTasks | |||||
| from modelscope.metainfo import Pipelines, Preprocessors | |||||
| from modelscope.pipelines.base import Model, Pipeline | from modelscope.pipelines.base import Model, Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import OfaPreprocessor, Preprocessor | |||||
| from modelscope.utils.constant import Tasks | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.utils.constant import Fields, Tasks | |||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| logger = get_logger() | logger = get_logger() | ||||
| @@ -19,6 +18,9 @@ class SummarizationPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a Summarization pipeline for prediction. | """Use `model` and `preprocessor` to create a Summarization pipeline for prediction. | ||||
| @@ -26,11 +28,25 @@ class SummarizationPipeline(Pipeline): | |||||
| model (str or Model): Supply either a local model dir which supported the summarization task, | model (str or Model): Supply either a local model dir which supported the summarization task, | ||||
| or a model id from the model hub, or a model instance. | or a model id from the model hub, or a model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance. | preprocessor (Preprocessor): An optional preprocessor instance. | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| self.model.eval() | self.model.eval() | ||||
| if preprocessor is None and isinstance(self.model, OfaForAllTasks): | |||||
| self.preprocessor = OfaPreprocessor(model_dir=self.model.model_dir) | |||||
| if preprocessor is None: | |||||
| if self.model.__class__.__name__ == 'OfaForAllTasks': | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, | |||||
| type=Preprocessors.ofa_tasks_preprocessor, | |||||
| field=Fields.multi_modal) | |||||
| else: | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, **kwargs) | |||||
| def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]: | def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]: | ||||
| return inputs | return inputs | ||||
| @@ -33,6 +33,9 @@ class TableQuestionAnsweringPipeline(Pipeline): | |||||
| model: Union[TableQuestionAnswering, str], | model: Union[TableQuestionAnswering, str], | ||||
| preprocessor: TableQuestionAnsweringPreprocessor = None, | preprocessor: TableQuestionAnsweringPreprocessor = None, | ||||
| db: Database = None, | db: Database = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """use `model` and `preprocessor` to create a table question answering prediction pipeline | """use `model` and `preprocessor` to create a table question answering prediction pipeline | ||||
| @@ -40,11 +43,19 @@ class TableQuestionAnsweringPipeline(Pipeline): | |||||
| model (TableQuestionAnswering): a model instance | model (TableQuestionAnswering): a model instance | ||||
| preprocessor (TableQuestionAnsweringPreprocessor): a preprocessor instance | preprocessor (TableQuestionAnsweringPreprocessor): a preprocessor instance | ||||
| db (Database): a database to store tables in the database | db (Database): a database to store tables in the database | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = TableQuestionAnsweringPreprocessor( | self.preprocessor = TableQuestionAnsweringPreprocessor( | ||||
| self.model.model_dir) | |||||
| self.model.model_dir, **kwargs) | |||||
| # initilize tokenizer | # initilize tokenizer | ||||
| self.tokenizer = BertTokenizer( | self.tokenizer = BertTokenizer( | ||||
| @@ -1,113 +0,0 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from typing import Any, Dict, List, Optional, Union | |||||
| import torch | |||||
| from numpy import isin | |||||
| from modelscope.metainfo import Pipelines | |||||
| from modelscope.models.base import Model | |||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.pipelines.base import Input, Pipeline, Tensor | |||||
| from modelscope.pipelines.builder import PIPELINES | |||||
| from modelscope.preprocessors import Text2TextGenerationPreprocessor | |||||
| from modelscope.utils.config import use_task_specific_params | |||||
| from modelscope.utils.constant import Tasks | |||||
| __all__ = ['Text2TextGenerationPipeline'] | |||||
| TRANSLATE_PIPELINES = [ | |||||
| Pipelines.translation_en_to_de, | |||||
| Pipelines.translation_en_to_ro, | |||||
| Pipelines.translation_en_to_fr, | |||||
| ] | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.text2text_generation) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.translation_en_to_de) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.translation_en_to_ro) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.translation_en_to_fr) | |||||
| class Text2TextGenerationPipeline(Pipeline): | |||||
| def __init__( | |||||
| self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: Optional[Text2TextGenerationPreprocessor] = None, | |||||
| first_sequence='sentence', | |||||
| **kwargs): | |||||
| """Use `model` and `preprocessor` to create a text to text generation pipeline for prediction. | |||||
| Args: | |||||
| model (str or Model): Supply either a local model dir which supported the text generation task, | |||||
| or a model id from the model hub, or a torch model instance. | |||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | |||||
| the model if supplied. | |||||
| first_sequence: The key to read the first sentence in. | |||||
| sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value. | |||||
| NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' | |||||
| param will have no effect. | |||||
| Example: | |||||
| >>> from modelscope.pipelines import pipeline | |||||
| >>> pipeline_ins = pipeline(task='text2text-generation', | |||||
| >>> model='damo/nlp_t5_text2text-generation_chinese-base') | |||||
| >>> sentence1 = '中国的首都位于<extra_id_0>。' | |||||
| >>> print(pipeline_ins(sentence1)) | |||||
| >>> # Or use the dict input: | |||||
| >>> print(pipeline_ins({'sentence': sentence1})) | |||||
| >>> # 北京 | |||||
| To view other examples plese check the tests/pipelines/test_text_generation.py. | |||||
| """ | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| if preprocessor is None: | |||||
| self.preprocessor = Text2TextGenerationPreprocessor( | |||||
| self.model.model_dir, | |||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| self.tokenizer = self.preprocessor.tokenizer | |||||
| self.pipeline = self.model.pipeline.type | |||||
| self.model.eval() | |||||
| def preprocess(self, inputs: Input, **preprocess_params) -> Dict[str, Any]: | |||||
| """ Provide specific preprocess for text2text generation pipeline in order to handl multi tasks | |||||
| """ | |||||
| if not isinstance(inputs, str): | |||||
| raise ValueError(f'Not supported input type: {type(inputs)}') | |||||
| if self.pipeline in TRANSLATE_PIPELINES: | |||||
| use_task_specific_params(self.model, self.pipeline) | |||||
| inputs = self.model.config.prefix + inputs | |||||
| return super().preprocess(inputs, **preprocess_params) | |||||
| def forward(self, inputs: Dict[str, Any], | |||||
| **forward_params) -> Dict[str, Any]: | |||||
| forward_params['min_length'] = forward_params.get( | |||||
| 'min_length', self.model.config.min_length) | |||||
| forward_params['max_length'] = forward_params.get( | |||||
| 'max_length', self.model.config.max_length) | |||||
| with torch.no_grad(): | |||||
| output_ids = self.model.generate(**inputs, **forward_params) | |||||
| return {'output_ids': output_ids} | |||||
| def postprocess(self, inputs: Dict[str, Tensor], | |||||
| **postprocess_params) -> Dict[str, str]: | |||||
| """process the prediction results | |||||
| Args: | |||||
| inputs (Dict[str, Any]): _description_ | |||||
| Returns: | |||||
| Dict[str, str]: the prediction results | |||||
| """ | |||||
| output = self.tokenizer.decode( | |||||
| inputs['output_ids'][0], | |||||
| skip_special_tokens=True, | |||||
| ) | |||||
| return {OutputKeys.TEXT: output} | |||||
| @@ -5,11 +5,14 @@ import numpy as np | |||||
| from modelscope.metainfo import Pipelines, Preprocessors | from modelscope.metainfo import Pipelines, Preprocessors | ||||
| from modelscope.models.base import Model | from modelscope.models.base import Model | ||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.outputs import OutputKeys, TextClassificationModelOutput | |||||
| from modelscope.pipelines.base import Pipeline | from modelscope.pipelines.base import Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import Preprocessor | from modelscope.preprocessors import Preprocessor | ||||
| from modelscope.utils.constant import Fields, Tasks | from modelscope.utils.constant import Fields, Tasks | ||||
| from modelscope.utils.logger import get_logger | |||||
| logger = get_logger(__name__) | |||||
| @PIPELINES.register_module( | @PIPELINES.register_module( | ||||
| @@ -31,6 +34,9 @@ class TextClassificationPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Preprocessor = None, | preprocessor: Preprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | **kwargs): | ||||
| """The inference pipeline for all the text classification sub-tasks. | """The inference pipeline for all the text classification sub-tasks. | ||||
| @@ -38,10 +44,8 @@ class TextClassificationPipeline(Pipeline): | |||||
| model (`str` or `Model` or module instance): A model instance or a model local dir | model (`str` or `Model` or module instance): A model instance or a model local dir | ||||
| or a model id in the model hub. | or a model id in the model hub. | ||||
| preprocessor (`Preprocessor`, `optional`): A Preprocessor instance. | preprocessor (`Preprocessor`, `optional`): A Preprocessor instance. | ||||
| first_sequence (`str`, `optional`): The key of the first sentence. | |||||
| second_sequence (`str`, `optional`): The key of the second sentence. | |||||
| sequence_length (`int`, `optional`): The sequence length. | |||||
| id2label (`dict`, `optional`): The id-label mapping. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example: | Example: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| @@ -49,31 +53,38 @@ class TextClassificationPipeline(Pipeline): | |||||
| model='damo/nlp_structbert_sentence-similarity_chinese-base') | model='damo/nlp_structbert_sentence-similarity_chinese-base') | ||||
| >>> input = ('这是个测试', '这也是个测试') | >>> input = ('这是个测试', '这也是个测试') | ||||
| >>> print(pipeline_ins(input)) | >>> print(pipeline_ins(input)) | ||||
| NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' and 'second_sequence' | |||||
| param will have no affection. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| if self.model.__class__.__name__ == 'OfaForAllTasks': | if self.model.__class__.__name__ == 'OfaForAllTasks': | ||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| model_name_or_path=self.model.model_dir, | model_name_or_path=self.model.model_dir, | ||||
| type=Preprocessors.ofa_tasks_preprocessor, | type=Preprocessors.ofa_tasks_preprocessor, | ||||
| field=Fields.multi_modal) | |||||
| field=Fields.multi_modal, | |||||
| **kwargs) | |||||
| else: | else: | ||||
| first_sequence = kwargs.pop('first_sequence', 'first_sequence') | first_sequence = kwargs.pop('first_sequence', 'first_sequence') | ||||
| second_sequence = kwargs.pop('second_sequence', None) | second_sequence = kwargs.pop('second_sequence', None) | ||||
| sequence_length = kwargs.pop('sequence_length', 512) | |||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| self.model | |||||
| if isinstance(self.model, str) else self.model.model_dir, | |||||
| first_sequence=first_sequence, | |||||
| second_sequence=second_sequence, | |||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| self.id2label = kwargs.get('id2label') | |||||
| if self.id2label is None and hasattr(self.preprocessor, 'id2label'): | |||||
| self.id2label = self.preprocessor.id2label | |||||
| self.model.model_dir, **{ | |||||
| 'first_sequence': first_sequence, | |||||
| 'second_sequence': second_sequence, | |||||
| 'sequence_length': sequence_length, | |||||
| **kwargs | |||||
| }) | |||||
| assert hasattr(self.preprocessor, 'id2label') | |||||
| self.id2label = self.preprocessor.id2label | |||||
| if self.id2label is None: | |||||
| logger.warn( | |||||
| 'The id2label mapping is None, will return original ids.' | |||||
| ) | |||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| **forward_params) -> Dict[str, Any]: | **forward_params) -> Dict[str, Any]: | ||||
| @@ -82,16 +93,17 @@ class TextClassificationPipeline(Pipeline): | |||||
| return self.model(**inputs, **forward_params) | return self.model(**inputs, **forward_params) | ||||
| def postprocess(self, | def postprocess(self, | ||||
| inputs: Dict[str, Any], | |||||
| topk: int = 5) -> Dict[str, str]: | |||||
| """process the prediction results | |||||
| inputs: Union[Dict[str, Any], | |||||
| TextClassificationModelOutput], | |||||
| topk: int = None) -> Dict[str, Any]: | |||||
| """Process the prediction results | |||||
| Args: | Args: | ||||
| inputs (`Dict[str, Any]` or `TextClassificationModelOutput`): The model output, please check | inputs (`Dict[str, Any]` or `TextClassificationModelOutput`): The model output, please check | ||||
| the `TextClassificationModelOutput` class for details. | the `TextClassificationModelOutput` class for details. | ||||
| topk (int): The topk probs to take | topk (int): The topk probs to take | ||||
| Returns: | Returns: | ||||
| Dict[str, str]: the prediction results. | |||||
| Dict[str, Any]: the prediction results. | |||||
| scores: The probabilities of each label. | scores: The probabilities of each label. | ||||
| labels: The real labels. | labels: The real labels. | ||||
| Label at index 0 is the smallest probability. | Label at index 0 is the smallest probability. | ||||
| @@ -99,8 +111,6 @@ class TextClassificationPipeline(Pipeline): | |||||
| if self.model.__class__.__name__ == 'OfaForAllTasks': | if self.model.__class__.__name__ == 'OfaForAllTasks': | ||||
| return inputs | return inputs | ||||
| else: | else: | ||||
| assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \ | |||||
| 'as a parameter or make sure the preprocessor has the attribute.' | |||||
| logits = inputs[OutputKeys.LOGITS].cpu().numpy() | logits = inputs[OutputKeys.LOGITS].cpu().numpy() | ||||
| if logits.shape[0] == 1: | if logits.shape[0] == 1: | ||||
| logits = logits[0] | logits = logits[0] | ||||
| @@ -111,20 +121,24 @@ class TextClassificationPipeline(Pipeline): | |||||
| probs = softmax(logits) | probs = softmax(logits) | ||||
| num_classes = probs.shape[-1] | num_classes = probs.shape[-1] | ||||
| topk = min(topk, num_classes) | |||||
| topk = min(topk, num_classes) if topk is not None else num_classes | |||||
| top_indices = np.argpartition(probs, -topk)[-topk:] | top_indices = np.argpartition(probs, -topk)[-topk:] | ||||
| probs = np.take_along_axis(probs, top_indices, axis=-1).tolist() | probs = np.take_along_axis(probs, top_indices, axis=-1).tolist() | ||||
| def map_to_label(id): | def map_to_label(id): | ||||
| if id in self.id2label: | |||||
| return self.id2label[id] | |||||
| elif str(id) in self.id2label: | |||||
| return self.id2label[str(id)] | |||||
| if self.id2label is not None: | |||||
| if id in self.id2label: | |||||
| return self.id2label[id] | |||||
| elif str(id) in self.id2label: | |||||
| return self.id2label[str(id)] | |||||
| else: | |||||
| raise Exception( | |||||
| f'id {id} not found in id2label: {self.id2label}') | |||||
| else: | else: | ||||
| raise Exception('id not found in id2label') | |||||
| return id | |||||
| v_func = np.vectorize(map_to_label) | v_func = np.vectorize(map_to_label) | ||||
| return { | |||||
| OutputKeys.SCORES: probs, | |||||
| OutputKeys.LABELS: v_func(top_indices).tolist() | |||||
| } | |||||
| top_indices = v_func(top_indices).tolist() | |||||
| probs = list(reversed(probs)) | |||||
| top_indices = list(reversed(top_indices)) | |||||
| return {OutputKeys.SCORES: probs, OutputKeys.LABELS: top_indices} | |||||
| @@ -10,7 +10,7 @@ from modelscope.models.nlp import BartForTextErrorCorrection | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Pipeline, Tensor | from modelscope.pipelines.base import Pipeline, Tensor | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import TextErrorCorrectionPreprocessor | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| __all__ = ['TextErrorCorrectionPipeline'] | __all__ = ['TextErrorCorrectionPipeline'] | ||||
| @@ -20,17 +20,20 @@ __all__ = ['TextErrorCorrectionPipeline'] | |||||
| Tasks.text_error_correction, module_name=Pipelines.text_error_correction) | Tasks.text_error_correction, module_name=Pipelines.text_error_correction) | ||||
| class TextErrorCorrectionPipeline(Pipeline): | class TextErrorCorrectionPipeline(Pipeline): | ||||
| def __init__( | |||||
| self, | |||||
| model: Union[BartForTextErrorCorrection, str], | |||||
| preprocessor: Optional[TextErrorCorrectionPreprocessor] = None, | |||||
| **kwargs): | |||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: Optional[Preprocessor] = None, | |||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| **kwargs): | |||||
| """use `model` and `preprocessor` to create a nlp text correction pipeline. | """use `model` and `preprocessor` to create a nlp text correction pipeline. | ||||
| Args: | Args: | ||||
| model (BartForTextErrorCorrection): A model instance, or a model local dir, or a model id in the model hub. | model (BartForTextErrorCorrection): A model instance, or a model local dir, or a model id in the model hub. | ||||
| preprocessor (TextErrorCorrectionPreprocessor): An optional preprocessor instance. | preprocessor (TextErrorCorrectionPreprocessor): An optional preprocessor instance. | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example: | Example: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| >>> pipeline_ins = pipeline( | >>> pipeline_ins = pipeline( | ||||
| @@ -38,13 +41,17 @@ class TextErrorCorrectionPipeline(Pipeline): | |||||
| >>> sentence1 = '随着中国经济突飞猛近,建造工业与日俱增' | >>> sentence1 = '随着中国经济突飞猛近,建造工业与日俱增' | ||||
| >>> print(pipeline_ins(sentence1)) | >>> print(pipeline_ins(sentence1)) | ||||
| To view other examples plese check the tests/pipelines/test_text_error_correction.py. | |||||
| To view other examples plese check tests/pipelines/test_text_error_correction.py. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = TextErrorCorrectionPreprocessor( | |||||
| self.model.model_dir) | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, **kwargs) | |||||
| self.vocab = self.preprocessor.vocab | self.vocab = self.preprocessor.vocab | ||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| @@ -1,20 +1,22 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| import os | |||||
| from typing import Any, Dict, Optional, Union | from typing import Any, Dict, Optional, Union | ||||
| import torch | import torch | ||||
| from modelscope.metainfo import Pipelines | from modelscope.metainfo import Pipelines | ||||
| from modelscope.models.base import Model | from modelscope.models.base import Model | ||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.outputs import (ModelOutputBase, OutputKeys, | |||||
| TokenGeneratorOutput) | |||||
| from modelscope.pipelines.base import Pipeline, Tensor | from modelscope.pipelines.base import Pipeline, Tensor | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import Preprocessor, build_preprocessor | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.utils.chinese_utils import remove_space_between_chinese_chars | from modelscope.utils.chinese_utils import remove_space_between_chinese_chars | ||||
| from modelscope.utils.constant import Fields, Tasks | |||||
| from modelscope.utils.hub import read_config | |||||
| from modelscope.utils.constant import Tasks | |||||
| from modelscope.utils.hub import Config, read_config | |||||
| __all__ = ['TextGenerationPipeline'] | |||||
| __all__ = ['TextGenerationPipeline', 'TextGenerationT5Pipeline'] | |||||
| @PIPELINES.register_module( | @PIPELINES.register_module( | ||||
| @@ -24,7 +26,11 @@ class TextGenerationPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| first_sequence='sentence', | first_sequence='sentence', | ||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a generation pipeline for prediction. | """Use `model` and `preprocessor` to create a generation pipeline for prediction. | ||||
| @@ -33,11 +39,8 @@ class TextGenerationPipeline(Pipeline): | |||||
| or a model id from the model hub, or a torch model instance. | or a model id from the model hub, or a torch model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | ||||
| the model if supplied. | the model if supplied. | ||||
| first_sequence: The key to read the first sentence in. | |||||
| sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value. | |||||
| NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' | |||||
| param will have no effect. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example: | Example: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| @@ -49,26 +52,29 @@ class TextGenerationPipeline(Pipeline): | |||||
| >>> # Or use the dict input: | >>> # Or use the dict input: | ||||
| >>> print(pipeline_ins({'sentence': sentence1})) | >>> print(pipeline_ins({'sentence': sentence1})) | ||||
| To view other examples plese check the tests/pipelines/test_text_generation.py. | |||||
| To view other examples plese check tests/pipelines/test_text_generation.py. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| cfg = read_config(self.model.model_dir) | |||||
| self.postprocessor = cfg.pop('postprocessor', 'decode') | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| preprocessor_cfg = cfg.preprocessor | |||||
| preprocessor_cfg.update({ | |||||
| 'model_dir': | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, | self.model.model_dir, | ||||
| 'first_sequence': | |||||
| first_sequence, | |||||
| 'second_sequence': | |||||
| None, | |||||
| 'sequence_length': | |||||
| kwargs.pop('sequence_length', 128) | |||||
| }) | |||||
| self.preprocessor = build_preprocessor(preprocessor_cfg, | |||||
| Fields.nlp) | |||||
| first_sequence=first_sequence, | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| self.postprocessor = kwargs.pop('postprocessor', None) | |||||
| if self.postprocessor is None and hasattr(self.model, 'model_dir'): | |||||
| # Compatible with old code | |||||
| cfg = read_config(self.model.model_dir) | |||||
| self.postprocessor = cfg.get('postprocessor') | |||||
| if self.postprocessor is None: | |||||
| self.postprocessor = 'decode' | |||||
| def _sanitize_parameters(self, **pipeline_parameters): | def _sanitize_parameters(self, **pipeline_parameters): | ||||
| return {}, pipeline_parameters, {} | return {}, pipeline_parameters, {} | ||||
| @@ -79,20 +85,19 @@ class TextGenerationPipeline(Pipeline): | |||||
| return self.model.generate(inputs, **forward_params) | return self.model.generate(inputs, **forward_params) | ||||
| def decode(self, inputs) -> str: | def decode(self, inputs) -> str: | ||||
| tokenizer = self.preprocessor.tokenizer | |||||
| return tokenizer.decode(inputs.tolist(), skip_special_tokens=True) | |||||
| return self.preprocessor.decode( | |||||
| inputs.tolist(), skip_special_tokens=True) | |||||
| def sentence_piece(self, inputs) -> str: | def sentence_piece(self, inputs) -> str: | ||||
| tokenizer = self.preprocessor.tokenizer | |||||
| return tokenizer.decode(inputs.tolist()) | |||||
| return self.preprocessor.decode(inputs.tolist()) | |||||
| def roberta(self, inputs) -> str: | def roberta(self, inputs) -> str: | ||||
| tokenizer = self.preprocessor.tokenizer | |||||
| decoded = tokenizer.decode(inputs.tolist()) | |||||
| decoded = self.preprocessor.decode(inputs.tolist()) | |||||
| return decoded.replace('<q>', '. ').replace('<mask>', | return decoded.replace('<q>', '. ').replace('<mask>', | ||||
| '. ').replace('</s>', '') | '. ').replace('</s>', '') | ||||
| def postprocess(self, inputs: Dict[str, Tensor], | |||||
| def postprocess(self, inputs: Union[Dict[str, Tensor], | |||||
| TokenGeneratorOutput], | |||||
| **postprocess_params) -> Dict[str, str]: | **postprocess_params) -> Dict[str, str]: | ||||
| """process the prediction results | """process the prediction results | ||||
| @@ -102,9 +107,72 @@ class TextGenerationPipeline(Pipeline): | |||||
| Returns: | Returns: | ||||
| Dict[str, str]: the prediction results | Dict[str, str]: the prediction results | ||||
| """ | """ | ||||
| inputs = inputs['sequences'] | |||||
| if isinstance(inputs, (dict, ModelOutputBase)): | |||||
| inputs = inputs['sequences'] | |||||
| if isinstance(inputs, list) or len(inputs.shape) > 1: | if isinstance(inputs, list) or len(inputs.shape) > 1: | ||||
| inputs = inputs[0] | inputs = inputs[0] | ||||
| decoded = getattr(self, self.postprocessor)(inputs) | decoded = getattr(self, self.postprocessor)(inputs) | ||||
| text = remove_space_between_chinese_chars(decoded) | text = remove_space_between_chinese_chars(decoded) | ||||
| return {OutputKeys.TEXT: text} | return {OutputKeys.TEXT: text} | ||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.translation_en_to_de) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.translation_en_to_ro) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.translation_en_to_fr) | |||||
| @PIPELINES.register_module( | |||||
| Tasks.text2text_generation, module_name=Pipelines.text2text_generation) | |||||
| class TextGenerationT5Pipeline(TextGenerationPipeline): | |||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: Optional[Preprocessor] = None, | |||||
| sub_task=None, | |||||
| **kwargs): | |||||
| super().__init__(model, preprocessor, **kwargs) | |||||
| self.sub_task = sub_task | |||||
| self.task_specific_params = self._parse_specific_model_params( | |||||
| getattr(self.model, 'model_dir', None), 'task_specific_params') | |||||
| self.min_length = self._parse_specific_model_params( | |||||
| getattr(self.model, 'model_dir', None), 'min_length') | |||||
| self.max_length = self._parse_specific_model_params( | |||||
| getattr(self.model, 'model_dir', None), 'max_length') | |||||
| def _parse_specific_model_params(self, model_dir, key): | |||||
| if model_dir is None: | |||||
| return | |||||
| cfg: Config = read_config(model_dir) | |||||
| params = cfg.safe_get(f'model.{key}') | |||||
| if params is None: | |||||
| cfg: Config = read_config(os.path.join(model_dir, 'config.json')) | |||||
| params = cfg.safe_get(key) | |||||
| return params | |||||
| def preprocess(self, inputs, **preprocess_params) -> Dict[str, Any]: | |||||
| if not isinstance(inputs, str): | |||||
| raise ValueError(f'Not supported input type: {type(inputs)}') | |||||
| if self.task_specific_params is not None: | |||||
| sub_task = self.sub_task or self.model.pipeline.type | |||||
| if sub_task in self.task_specific_params: | |||||
| self.model.config.update(self.task_specific_params[sub_task]) | |||||
| if 'prefix' in self.task_specific_params[sub_task]: | |||||
| inputs = self.task_specific_params[sub_task].prefix + inputs | |||||
| return super().preprocess(inputs, **preprocess_params) | |||||
| def forward(self, inputs: Dict[str, Any], | |||||
| **forward_params) -> Dict[str, Any]: | |||||
| min_length = forward_params.get('min_length', self.min_length) | |||||
| max_length = forward_params.get('max_length', self.max_length) | |||||
| if min_length is not None: | |||||
| forward_params['min_length'] = min_length | |||||
| if max_length is not None: | |||||
| forward_params['max_length'] = max_length | |||||
| with torch.no_grad(): | |||||
| return self.model.generate(**inputs, **forward_params) | |||||
| @@ -9,7 +9,8 @@ from modelscope.models import Model | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Pipeline | from modelscope.pipelines.base import Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import Preprocessor, TextRankingPreprocessor | |||||
| from modelscope.preprocessors import (Preprocessor, | |||||
| TextRankingTransformersPreprocessor) | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| __all__ = ['TextRankingPipeline'] | __all__ = ['TextRankingPipeline'] | ||||
| @@ -22,6 +23,10 @@ class TextRankingPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction. | """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction. | ||||
| @@ -30,14 +35,21 @@ class TextRankingPipeline(Pipeline): | |||||
| or a model id from the model hub, or a torch model instance. | or a model id from the model hub, or a torch model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | ||||
| the model if supplied. | the model if supplied. | ||||
| sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| self.model.model_dir, | self.model.model_dir, | ||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| **forward_params) -> Dict[str, Any]: | **forward_params) -> Dict[str, Any]: | ||||
| @@ -1,7 +1,8 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict, Optional, Union | |||||
| from typing import Any, Dict, List, Optional, Union | |||||
| import numpy as np | |||||
| import torch | import torch | ||||
| from modelscope.metainfo import Pipelines | from modelscope.metainfo import Pipelines | ||||
| @@ -32,24 +33,35 @@ class TokenClassificationPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """use `model` and `preprocessor` to create a token classification pipeline for prediction | """use `model` and `preprocessor` to create a token classification pipeline for prediction | ||||
| Args: | Args: | ||||
| model (str or Model): A model instance or a model local dir or a model id in the model hub. | model (str or Model): A model instance or a model local dir or a model id in the model hub. | ||||
| preprocessor (Preprocessor): a preprocessor instance, must not be None. | preprocessor (Preprocessor): a preprocessor instance, must not be None. | ||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = Preprocessor.from_pretrained( | self.preprocessor = Preprocessor.from_pretrained( | ||||
| self.model.model_dir, | self.model.model_dir, | ||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| self.id2label = kwargs.get('id2label') | |||||
| if self.id2label is None and hasattr(self.preprocessor, 'id2label'): | |||||
| self.id2label = self.preprocessor.id2label | |||||
| assert hasattr(self.preprocessor, 'id2label') | |||||
| self.id2label = self.preprocessor.id2label | |||||
| def forward(self, inputs: Dict[str, Any], | def forward(self, inputs: Dict[str, Any], | ||||
| **forward_params) -> Dict[str, Any]: | **forward_params) -> Dict[str, Any]: | ||||
| @@ -60,53 +72,59 @@ class TokenClassificationPipeline(Pipeline): | |||||
| } | } | ||||
| def postprocess(self, inputs: Dict[str, Any], | def postprocess(self, inputs: Dict[str, Any], | ||||
| **postprocess_params) -> Dict[str, str]: | |||||
| """process the prediction results | |||||
| **postprocess_params) -> Dict[str, Any]: | |||||
| """Process the prediction results | |||||
| Args: | Args: | ||||
| inputs (Dict[str, Any]): should be tensors from model | inputs (Dict[str, Any]): should be tensors from model | ||||
| Returns: | Returns: | ||||
| Dict[str, str]: the prediction results | |||||
| Dict[str, Any]: the prediction results | |||||
| """ | """ | ||||
| chunks = self._chunk_process(inputs, **postprocess_params) | chunks = self._chunk_process(inputs, **postprocess_params) | ||||
| # for cws outputs | |||||
| if len(chunks) > 0 and chunks[0]['type'].lower() == 'cws': | |||||
| spans = [ | |||||
| chunk['span'] for chunk in chunks if chunk['span'].strip() | |||||
| ] | |||||
| seg_result = [span for span in spans] | |||||
| outputs = {OutputKeys.OUTPUT: seg_result} | |||||
| # for ner outputs | |||||
| else: | |||||
| outputs = {OutputKeys.OUTPUT: chunks} | |||||
| return outputs | |||||
| return {OutputKeys.OUTPUT: chunks} | |||||
| def _chunk_process(self, inputs: Dict[str, Any], | def _chunk_process(self, inputs: Dict[str, Any], | ||||
| **postprocess_params) -> Dict[str, str]: | |||||
| **postprocess_params) -> List: | |||||
| """process the prediction results and output as chunks | """process the prediction results and output as chunks | ||||
| Args: | Args: | ||||
| inputs (Dict[str, Any]): should be tensors from model | inputs (Dict[str, Any]): should be tensors from model | ||||
| Returns: | Returns: | ||||
| Dict[str, str]: the prediction results | |||||
| List: The output chunks | |||||
| """ | """ | ||||
| text = inputs['text'] | text = inputs['text'] | ||||
| # TODO post_process does not support batch for now. | |||||
| if OutputKeys.PREDICTIONS not in inputs: | if OutputKeys.PREDICTIONS not in inputs: | ||||
| logits = inputs[OutputKeys.LOGITS] | logits = inputs[OutputKeys.LOGITS] | ||||
| predictions = torch.argmax(logits[0], dim=-1) | |||||
| if len(logits.shape) == 3: | |||||
| logits = logits[0] | |||||
| predictions = torch.argmax(logits, dim=-1) | |||||
| else: | else: | ||||
| predictions = inputs[OutputKeys.PREDICTIONS].squeeze( | |||||
| 0).cpu().numpy() | |||||
| predictions = inputs[OutputKeys.PREDICTIONS] | |||||
| if len(predictions.shape) == 2: | |||||
| predictions = predictions[0] | |||||
| offset_mapping = inputs['offset_mapping'] | |||||
| if len(offset_mapping.shape) == 3: | |||||
| offset_mapping = offset_mapping[0] | |||||
| label_mask = inputs.get('label_mask') | |||||
| if label_mask is not None: | |||||
| masked_lengths = label_mask.sum(-1).long().cpu().item() | |||||
| offset_mapping = torch.narrow( | |||||
| offset_mapping, 0, 0, | |||||
| masked_lengths) # index_select only move loc, not resize | |||||
| predictions = torch.narrow( | |||||
| predictions, 0, 0, | |||||
| masked_lengths) # index_select only move loc, not resize | |||||
| offset_mapping = torch_nested_numpify( | |||||
| torch_nested_detach(offset_mapping)) | |||||
| predictions = torch_nested_numpify(torch_nested_detach(predictions)) | predictions = torch_nested_numpify(torch_nested_detach(predictions)) | ||||
| offset_mapping = [x.cpu().tolist() for x in inputs['offset_mapping']] | |||||
| labels = [self.id2label[x] for x in predictions] | labels = [self.id2label[x] for x in predictions] | ||||
| if len(labels) > len(offset_mapping): | |||||
| labels = labels[1:-1] | |||||
| chunks = [] | chunks = [] | ||||
| chunk = {} | chunk = {} | ||||
| for label, offsets in zip(labels, offset_mapping): | for label, offsets in zip(labels, offset_mapping): | ||||
| @@ -2,19 +2,15 @@ | |||||
| import io | import io | ||||
| import os | import os | ||||
| from typing import Any, Dict, Union | |||||
| from typing import Any, Dict | |||||
| import numpy as np | |||||
| import torch | import torch | ||||
| from transformers import XLMRobertaTokenizer | from transformers import XLMRobertaTokenizer | ||||
| from modelscope.metainfo import Pipelines | from modelscope.metainfo import Pipelines | ||||
| from modelscope.models import Model | |||||
| from modelscope.models.nlp import BertForSequenceClassification | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Input, Pipeline | |||||
| from modelscope.pipelines.base import Pipeline | |||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import SequenceClassificationPreprocessor | |||||
| from modelscope.utils.constant import ModelFile, Tasks | from modelscope.utils.constant import ModelFile, Tasks | ||||
| __all__ = ['TranslationQualityEstimationPipeline'] | __all__ = ['TranslationQualityEstimationPipeline'] | ||||
| @@ -10,9 +10,9 @@ from modelscope.outputs import OutputKeys | |||||
| from modelscope.pipelines.base import Pipeline | from modelscope.pipelines.base import Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.pipelines.nlp import TokenClassificationPipeline | from modelscope.pipelines.nlp import TokenClassificationPipeline | ||||
| from modelscope.preprocessors import (Preprocessor, | |||||
| TokenClassificationPreprocessor, | |||||
| WordSegmentationPreprocessorThai) | |||||
| from modelscope.preprocessors import ( | |||||
| Preprocessor, TokenClassificationTransformersPreprocessor, | |||||
| WordSegmentationPreprocessorThai) | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| from modelscope.utils.tensor_utils import (torch_nested_detach, | from modelscope.utils.tensor_utils import (torch_nested_detach, | ||||
| torch_nested_numpify) | torch_nested_numpify) | ||||
| @@ -23,42 +23,49 @@ __all__ = ['WordSegmentationPipeline', 'WordSegmentationThaiPipeline'] | |||||
| @PIPELINES.register_module( | @PIPELINES.register_module( | ||||
| Tasks.word_segmentation, module_name=Pipelines.word_segmentation) | Tasks.word_segmentation, module_name=Pipelines.word_segmentation) | ||||
| class WordSegmentationPipeline(TokenClassificationPipeline): | class WordSegmentationPipeline(TokenClassificationPipeline): | ||||
| """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction. | |||||
| def __init__(self, | |||||
| model: Union[Model, str], | |||||
| preprocessor: Optional[Preprocessor] = None, | |||||
| **kwargs): | |||||
| """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction. | |||||
| NOTE: The preprocessor will first split the sentence into single characters, | |||||
| then feed them into the tokenizer with the parameter is_split_into_words=True. | |||||
| Example: | |||||
| >>> from modelscope.pipelines import pipeline | |||||
| >>> pipeline_ins = pipeline(task='word-segmentation', | |||||
| >>> model='damo/nlp_structbert_word-segmentation_chinese-base') | |||||
| >>> sentence1 = '今天天气不错,适合出去游玩' | |||||
| >>> print(pipeline_ins(sentence1)) | |||||
| To view other examples plese check tests/pipelines/test_word_segmentation.py. | |||||
| """ | |||||
| def postprocess(self, | |||||
| inputs: Dict[str, Any], | |||||
| output_final_sentence=True, | |||||
| **postprocess_params) -> Dict[str, Any]: | |||||
| """Process the prediction results | |||||
| Args: | Args: | ||||
| model (str or Model): Supply either a local model dir which supported the WS task, | |||||
| or a model id from the model hub, or a torch model instance. | |||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | |||||
| the model if supplied. | |||||
| sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value. | |||||
| NOTE: The preprocessor will first split the sentence into single characters, | |||||
| then feed them into the tokenizer with the parameter is_split_into_words=True. | |||||
| Example: | |||||
| >>> from modelscope.pipelines import pipeline | |||||
| >>> pipeline_ins = pipeline(task='word-segmentation', | |||||
| >>> model='damo/nlp_structbert_word-segmentation_chinese-base') | |||||
| >>> sentence1 = '今天天气不错,适合出去游玩' | |||||
| >>> print(pipeline_ins(sentence1)) | |||||
| To view other examples plese check the tests/pipelines/test_word_segmentation.py. | |||||
| inputs (Dict[str, Any]): should be tensors from model | |||||
| output_final_sentence (bool): Output the cut sentence splitted by blanks or not. | |||||
| If False, the pipeline will output the original token-label information. | |||||
| Returns: | |||||
| Dict[str, Any]: The prediction results. | |||||
| """ | """ | ||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| if preprocessor is None: | |||||
| self.preprocessor = TokenClassificationPreprocessor( | |||||
| self.model.model_dir, | |||||
| sequence_length=kwargs.pop('sequence_length', 128)) | |||||
| self.model.eval() | |||||
| chunks = self._chunk_process(inputs, **postprocess_params) | |||||
| # for cws outputs | |||||
| if output_final_sentence: | |||||
| spans = [ | |||||
| chunk['span'] for chunk in chunks if chunk['span'].strip() | |||||
| ] | |||||
| seg_result = [span for span in spans] | |||||
| outputs = {OutputKeys.OUTPUT: seg_result} | |||||
| self.id2label = kwargs.get('id2label') | |||||
| if self.id2label is None and hasattr(self.preprocessor, 'id2label'): | |||||
| self.id2label = self.preprocessor.id2label | |||||
| # for ner outputs | |||||
| else: | |||||
| outputs = {OutputKeys.OUTPUT: chunks} | |||||
| return outputs | |||||
| @PIPELINES.register_module( | @PIPELINES.register_module( | ||||
| @@ -66,8 +73,10 @@ class WordSegmentationPipeline(TokenClassificationPipeline): | |||||
| module_name=Pipelines.multilingual_word_segmentation) | module_name=Pipelines.multilingual_word_segmentation) | ||||
| class MultilingualWordSegmentationPipeline(WordSegmentationPipeline): | class MultilingualWordSegmentationPipeline(WordSegmentationPipeline): | ||||
| def postprocess(self, inputs: Dict[str, Any], | |||||
| **postprocess_params) -> Dict[str, str]: | |||||
| def postprocess(self, | |||||
| inputs: Dict[str, Any], | |||||
| output_final_sentence=True, | |||||
| **postprocess_params) -> Dict[str, Any]: | |||||
| chunks = self._chunk_process(inputs, **postprocess_params) | chunks = self._chunk_process(inputs, **postprocess_params) | ||||
| word_segments = [entity['span'] for entity in chunks] | word_segments = [entity['span'] for entity in chunks] | ||||
| return {OutputKeys.OUTPUT: word_segments} | return {OutputKeys.OUTPUT: word_segments} | ||||
| @@ -80,14 +89,22 @@ class WordSegmentationThaiPipeline(MultilingualWordSegmentationPipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Optional[Preprocessor] = None, | preprocessor: Optional[Preprocessor] = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=512, | |||||
| **kwargs): | **kwargs): | ||||
| model = model if isinstance(model, | |||||
| Model) else Model.from_pretrained(model) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| if preprocessor is None: | if preprocessor is None: | ||||
| preprocessor = WordSegmentationPreprocessorThai( | |||||
| model.model_dir, | |||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| self.preprocessor = WordSegmentationPreprocessorThai( | |||||
| self.model.model_dir, | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| def postprocess(self, inputs: Dict[str, Any], | def postprocess(self, inputs: Dict[str, Any], | ||||
| **postprocess_params) -> Dict[str, str]: | **postprocess_params) -> Dict[str, str]: | ||||
| @@ -10,8 +10,7 @@ from modelscope.models import Model | |||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.pipelines.base import Pipeline | from modelscope.pipelines.base import Pipeline | ||||
| from modelscope.pipelines.builder import PIPELINES | from modelscope.pipelines.builder import PIPELINES | ||||
| from modelscope.preprocessors import (Preprocessor, | |||||
| ZeroShotClassificationPreprocessor) | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.utils.constant import Tasks | from modelscope.utils.constant import Tasks | ||||
| __all__ = ['ZeroShotClassificationPipeline'] | __all__ = ['ZeroShotClassificationPipeline'] | ||||
| @@ -25,6 +24,10 @@ class ZeroShotClassificationPipeline(Pipeline): | |||||
| def __init__(self, | def __init__(self, | ||||
| model: Union[Model, str], | model: Union[Model, str], | ||||
| preprocessor: Preprocessor = None, | preprocessor: Preprocessor = None, | ||||
| config_file: str = None, | |||||
| device: str = 'gpu', | |||||
| auto_collate=True, | |||||
| sequence_length=512, | |||||
| **kwargs): | **kwargs): | ||||
| """Use `model` and `preprocessor` to create a nlp zero shot classifiction for prediction. | """Use `model` and `preprocessor` to create a nlp zero shot classifiction for prediction. | ||||
| @@ -44,7 +47,8 @@ class ZeroShotClassificationPipeline(Pipeline): | |||||
| or a model id from the model hub, or a torch model instance. | or a model id from the model hub, or a torch model instance. | ||||
| preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for | ||||
| the model if supplied. | the model if supplied. | ||||
| sequence_length: Max sequence length in the user's custom scenario. 512 will be used as a default value. | |||||
| kwargs (dict, `optional`): | |||||
| Extra kwargs passed into the preprocessor's constructor. | |||||
| Example: | Example: | ||||
| >>> from modelscope.pipelines import pipeline | >>> from modelscope.pipelines import pipeline | ||||
| @@ -55,17 +59,22 @@ class ZeroShotClassificationPipeline(Pipeline): | |||||
| >>> template = '这篇文章的标题是{}' | >>> template = '这篇文章的标题是{}' | ||||
| >>> print(pipeline_ins(sentence1, candidate_labels=labels, hypothesis_template=template)) | >>> print(pipeline_ins(sentence1, candidate_labels=labels, hypothesis_template=template)) | ||||
| To view other examples plese check the tests/pipelines/test_zero_shot_classification.py. | |||||
| To view other examples plese check tests/pipelines/test_zero_shot_classification.py. | |||||
| """ | """ | ||||
| assert isinstance(model, str) or isinstance(model, Model), \ | |||||
| 'model must be a single str or Model' | |||||
| super().__init__(model=model, preprocessor=preprocessor, **kwargs) | |||||
| super().__init__( | |||||
| model=model, | |||||
| preprocessor=preprocessor, | |||||
| config_file=config_file, | |||||
| device=device, | |||||
| auto_collate=auto_collate) | |||||
| self.entailment_id = 0 | self.entailment_id = 0 | ||||
| self.contradiction_id = 2 | self.contradiction_id = 2 | ||||
| if preprocessor is None: | if preprocessor is None: | ||||
| self.preprocessor = ZeroShotClassificationPreprocessor( | |||||
| sequence_length = kwargs.pop('sequence_length', 512) | |||||
| self.preprocessor = Preprocessor.from_pretrained( | |||||
| self.model.model_dir, | self.model.model_dir, | ||||
| sequence_length=kwargs.pop('sequence_length', 512)) | |||||
| sequence_length=sequence_length, | |||||
| **kwargs) | |||||
| self.model.eval() | self.model.eval() | ||||
| def _sanitize_parameters(self, **kwargs): | def _sanitize_parameters(self, **kwargs): | ||||
| @@ -16,15 +16,19 @@ if TYPE_CHECKING: | |||||
| from .kws import WavToLists | from .kws import WavToLists | ||||
| from .multi_modal import (OfaPreprocessor, MPlugPreprocessor) | from .multi_modal import (OfaPreprocessor, MPlugPreprocessor) | ||||
| from .nlp import ( | from .nlp import ( | ||||
| DocumentSegmentationPreprocessor, FaqQuestionAnsweringPreprocessor, | |||||
| FillMaskPoNetPreprocessor, NLPPreprocessor, | |||||
| NLPTokenizerPreprocessorBase, PassageRankingPreprocessor, | |||||
| TextRankingPreprocessor, RelationExtractionPreprocessor, | |||||
| SentenceEmbeddingPreprocessor, SequenceClassificationPreprocessor, | |||||
| TokenClassificationPreprocessor, TextErrorCorrectionPreprocessor, | |||||
| TextGenerationPreprocessor, Text2TextGenerationPreprocessor, Tokenize, | |||||
| DocumentSegmentationTransformersPreprocessor, | |||||
| FaqQuestionAnsweringTransformersPreprocessor, | |||||
| FillMaskPoNetPreprocessor, FillMaskTransformersPreprocessor, | |||||
| TextRankingTransformersPreprocessor, | |||||
| RelationExtractionTransformersPreprocessor, | |||||
| SentenceEmbeddingTransformersPreprocessor, | |||||
| TextClassificationTransformersPreprocessor, | |||||
| TokenClassificationTransformersPreprocessor, | |||||
| TextErrorCorrectionPreprocessor, TextGenerationT5Preprocessor, | |||||
| TextGenerationTransformersPreprocessor, Tokenize, | |||||
| WordSegmentationBlankSetToLabelPreprocessor, CodeGeeXPreprocessor, | WordSegmentationBlankSetToLabelPreprocessor, CodeGeeXPreprocessor, | ||||
| MGLMSummarizationPreprocessor, ZeroShotClassificationPreprocessor, | |||||
| MGLMSummarizationPreprocessor, | |||||
| ZeroShotClassificationTransformersPreprocessor, | |||||
| TextGenerationJiebaPreprocessor, SentencePiecePreprocessor, | TextGenerationJiebaPreprocessor, SentencePiecePreprocessor, | ||||
| DialogIntentPredictionPreprocessor, DialogModelingPreprocessor, | DialogIntentPredictionPreprocessor, DialogModelingPreprocessor, | ||||
| DialogStateTrackingPreprocessor, ConversationalTextToSqlPreprocessor, | DialogStateTrackingPreprocessor, ConversationalTextToSqlPreprocessor, | ||||
| @@ -47,18 +51,21 @@ else: | |||||
| 'kws': ['WavToLists'], | 'kws': ['WavToLists'], | ||||
| 'multi_modal': ['OfaPreprocessor', 'MPlugPreprocessor'], | 'multi_modal': ['OfaPreprocessor', 'MPlugPreprocessor'], | ||||
| 'nlp': [ | 'nlp': [ | ||||
| 'DocumentSegmentationPreprocessor', | |||||
| 'FaqQuestionAnsweringPreprocessor', 'FillMaskPoNetPreprocessor', | |||||
| 'NLPPreprocessor', 'NLPTokenizerPreprocessorBase', | |||||
| 'TextRankingPreprocessor', 'RelationExtractionPreprocessor', | |||||
| 'SentenceEmbeddingPreprocessor', | |||||
| 'SequenceClassificationPreprocessor', | |||||
| 'TokenClassificationPreprocessor', | |||||
| 'TextErrorCorrectionPreprocessor', 'TextGenerationPreprocessor', | |||||
| 'Tokenize', 'Text2TextGenerationPreprocessor', | |||||
| 'DocumentSegmentationTransformersPreprocessor', | |||||
| 'FaqQuestionAnsweringTransformersPreprocessor', | |||||
| 'FillMaskPoNetPreprocessor', 'FillMaskTransformersPreprocessor', | |||||
| 'NLPTokenizerPreprocessorBase', | |||||
| 'TextRankingTransformersPreprocessor', | |||||
| 'RelationExtractionTransformersPreprocessor', | |||||
| 'SentenceEmbeddingTransformersPreprocessor', | |||||
| 'TextClassificationTransformersPreprocessor', | |||||
| 'TokenClassificationTransformersPreprocessor', | |||||
| 'TextErrorCorrectionPreprocessor', | |||||
| 'TextGenerationTransformersPreprocessor', 'Tokenize', | |||||
| 'TextGenerationT5Preprocessor', | |||||
| 'WordSegmentationBlankSetToLabelPreprocessor', | 'WordSegmentationBlankSetToLabelPreprocessor', | ||||
| 'MGLMSummarizationPreprocessor', 'CodeGeeXPreprocessor', | 'MGLMSummarizationPreprocessor', 'CodeGeeXPreprocessor', | ||||
| 'ZeroShotClassificationPreprocessor', | |||||
| 'ZeroShotClassificationTransformersPreprocessor', | |||||
| 'TextGenerationJiebaPreprocessor', 'SentencePiecePreprocessor', | 'TextGenerationJiebaPreprocessor', 'SentencePiecePreprocessor', | ||||
| 'NERPreprocessorViet', 'NERPreprocessorThai', | 'NERPreprocessorViet', 'NERPreprocessorThai', | ||||
| 'WordSegmentationPreprocessorThai', | 'WordSegmentationPreprocessorThai', | ||||
| @@ -2,9 +2,10 @@ | |||||
| import os | import os | ||||
| from abc import ABC, abstractmethod | from abc import ABC, abstractmethod | ||||
| from copy import deepcopy | from copy import deepcopy | ||||
| from typing import Any, Dict, Optional, Sequence | |||||
| from typing import Any, Callable, Dict, Optional, Sequence, Union | |||||
| from modelscope.metainfo import Models, Preprocessors | from modelscope.metainfo import Models, Preprocessors | ||||
| from modelscope.utils.checkpoint import save_configuration | |||||
| from modelscope.utils.config import Config, ConfigDict | from modelscope.utils.config import Config, ConfigDict | ||||
| from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, Invoke, | from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, Invoke, | ||||
| ModeKeys, Tasks) | ModeKeys, Tasks) | ||||
| @@ -98,6 +99,8 @@ PREPROCESSOR_MAP = { | |||||
| Preprocessors.sen_cls_tokenizer, | Preprocessors.sen_cls_tokenizer, | ||||
| (Models.structbert, Tasks.part_of_speech): | (Models.structbert, Tasks.part_of_speech): | ||||
| Preprocessors.token_cls_tokenizer, | Preprocessors.token_cls_tokenizer, | ||||
| (Models.token_classification_for_ner, Tasks.named_entity_recognition): | |||||
| Preprocessors.token_cls_tokenizer, | |||||
| (Models.structbert, Tasks.token_classification): | (Models.structbert, Tasks.token_classification): | ||||
| Preprocessors.token_cls_tokenizer, | Preprocessors.token_cls_tokenizer, | ||||
| (Models.structbert, Tasks.word_segmentation): | (Models.structbert, Tasks.word_segmentation): | ||||
| @@ -117,7 +120,15 @@ PREPROCESSOR_MAP = { | |||||
| (Models.veco, Tasks.sentence_similarity): | (Models.veco, Tasks.sentence_similarity): | ||||
| Preprocessors.sen_cls_tokenizer, | Preprocessors.sen_cls_tokenizer, | ||||
| # space | |||||
| # taskmodels | |||||
| (Models.lcrf, Tasks.named_entity_recognition): | |||||
| Preprocessors.sequence_labeling_tokenizer, | |||||
| (Models.lcrf_wseg, Tasks.word_segmentation): | |||||
| Preprocessors.sequence_labeling_tokenizer, | |||||
| (Models.tcrf_wseg, Tasks.word_segmentation): | |||||
| Preprocessors.sequence_labeling_tokenizer, | |||||
| (Models.tcrf, Tasks.named_entity_recognition): | |||||
| Preprocessors.sequence_labeling_tokenizer, | |||||
| } | } | ||||
| @@ -125,6 +136,8 @@ class Preprocessor(ABC): | |||||
| def __init__(self, mode=ModeKeys.INFERENCE, *args, **kwargs): | def __init__(self, mode=ModeKeys.INFERENCE, *args, **kwargs): | ||||
| self._mode = mode | self._mode = mode | ||||
| assert self._mode in (ModeKeys.INFERENCE, ModeKeys.TRAIN, | |||||
| ModeKeys.EVAL) | |||||
| self.device = int( | self.device = int( | ||||
| os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None | os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None | ||||
| pass | pass | ||||
| @@ -264,4 +277,41 @@ class Preprocessor(ABC): | |||||
| }) | }) | ||||
| preprocessor = build_preprocessor(sub_cfg, field_name) | preprocessor = build_preprocessor(sub_cfg, field_name) | ||||
| preprocessor.mode = preprocessor_mode | preprocessor.mode = preprocessor_mode | ||||
| sub_cfg.pop('model_dir', None) | |||||
| if not hasattr(preprocessor, 'cfg'): | |||||
| preprocessor.cfg = cfg | |||||
| return preprocessor | return preprocessor | ||||
| def save_pretrained(self, | |||||
| target_folder: Union[str, os.PathLike], | |||||
| config: Optional[dict] = None, | |||||
| save_config_function: Callable = save_configuration): | |||||
| """Save the preprocessor, its configuration and other related files to a directory, | |||||
| so that it can be re-loaded | |||||
| By default, this method will save the preprocessor's config with mode `inference`. | |||||
| Args: | |||||
| target_folder (Union[str, os.PathLike]): | |||||
| Directory to which to save. Will be created if it doesn't exist. | |||||
| config (Optional[dict], optional): | |||||
| The config for the configuration.json | |||||
| save_config_function (Callable): The function used to save the configuration, call this function | |||||
| after the config is updated. | |||||
| """ | |||||
| if config is None and hasattr(self, 'cfg'): | |||||
| config = self.cfg | |||||
| if config is not None: | |||||
| # Update the mode to `inference` in the preprocessor field. | |||||
| if 'preprocessor' in config and config['preprocessor'] is not None: | |||||
| if 'mode' in config['preprocessor']: | |||||
| config['preprocessor']['mode'] = 'inference' | |||||
| elif 'val' in config['preprocessor'] and 'mode' in config[ | |||||
| 'preprocessor']['val']: | |||||
| config['preprocessor']['val']['mode'] = 'inference' | |||||
| save_config_function(target_folder, config) | |||||
| @@ -5,24 +5,22 @@ from modelscope.utils.import_utils import LazyImportModule | |||||
| if TYPE_CHECKING: | if TYPE_CHECKING: | ||||
| from .text_error_correction import TextErrorCorrectionPreprocessor | from .text_error_correction import TextErrorCorrectionPreprocessor | ||||
| from .nlp_base import (NLPTokenizerPreprocessorBase, NLPBasePreprocessor) | |||||
| from .text_generation_jieba_preprocessor import TextGenerationJiebaPreprocessor | |||||
| from .text_generation_preprocessor import TextGenerationJiebaPreprocessor | |||||
| from .sentence_piece_preprocessor import SentencePiecePreprocessor | from .sentence_piece_preprocessor import SentencePiecePreprocessor | ||||
| from .bert_seq_cls_tokenizer import Tokenize | from .bert_seq_cls_tokenizer import Tokenize | ||||
| from .document_segmentation_preprocessor import DocumentSegmentationPreprocessor | |||||
| from .faq_question_answering_preprocessor import FaqQuestionAnsweringPreprocessor | |||||
| from .fill_mask_preprocessor import FillMaskPoNetPreprocessor, NLPPreprocessor | |||||
| from .text_ranking_preprocessor import TextRankingPreprocessor | |||||
| from .relation_extraction_preprocessor import RelationExtractionPreprocessor | |||||
| from .sentence_classification_preprocessor import SequenceClassificationPreprocessor | |||||
| from .sentence_embedding_preprocessor import SentenceEmbeddingPreprocessor | |||||
| from .text_generation_preprocessor import TextGenerationPreprocessor | |||||
| from .text2text_generation_preprocessor import Text2TextGenerationPreprocessor | |||||
| from .token_classification_preprocessor import TokenClassificationPreprocessor, \ | |||||
| from .document_segmentation_preprocessor import DocumentSegmentationTransformersPreprocessor | |||||
| from .faq_question_answering_preprocessor import FaqQuestionAnsweringTransformersPreprocessor | |||||
| from .fill_mask_preprocessor import FillMaskPoNetPreprocessor, FillMaskTransformersPreprocessor | |||||
| from .text_ranking_preprocessor import TextRankingTransformersPreprocessor | |||||
| from .relation_extraction_preprocessor import RelationExtractionTransformersPreprocessor | |||||
| from .text_classification_preprocessor import TextClassificationTransformersPreprocessor | |||||
| from .sentence_embedding_preprocessor import SentenceEmbeddingTransformersPreprocessor | |||||
| from .text_generation_preprocessor import TextGenerationTransformersPreprocessor, TextGenerationT5Preprocessor | |||||
| from .token_classification_preprocessor import TokenClassificationTransformersPreprocessor, \ | |||||
| WordSegmentationBlankSetToLabelPreprocessor | WordSegmentationBlankSetToLabelPreprocessor | ||||
| from .token_classification_thai_preprocessor import WordSegmentationPreprocessorThai, NERPreprocessorThai | from .token_classification_thai_preprocessor import WordSegmentationPreprocessorThai, NERPreprocessorThai | ||||
| from .token_classification_viet_preprocessor import NERPreprocessorViet | from .token_classification_viet_preprocessor import NERPreprocessorViet | ||||
| from .zero_shot_classification_reprocessor import ZeroShotClassificationPreprocessor | |||||
| from .zero_shot_classification_preprocessor import ZeroShotClassificationTransformersPreprocessor | |||||
| from .space import (DialogIntentPredictionPreprocessor, | from .space import (DialogIntentPredictionPreprocessor, | ||||
| DialogModelingPreprocessor, | DialogModelingPreprocessor, | ||||
| DialogStateTrackingPreprocessor, InputFeatures, | DialogStateTrackingPreprocessor, InputFeatures, | ||||
| @@ -36,30 +34,31 @@ else: | |||||
| 'NLPTokenizerPreprocessorBase', | 'NLPTokenizerPreprocessorBase', | ||||
| 'NLPBasePreprocessor', | 'NLPBasePreprocessor', | ||||
| ], | ], | ||||
| 'text_generation_jieba_preprocessor': | |||||
| ['TextGenerationJiebaPreprocessor'], | |||||
| 'sentence_piece_preprocessor': ['SentencePiecePreprocessor'], | 'sentence_piece_preprocessor': ['SentencePiecePreprocessor'], | ||||
| 'bert_seq_cls_tokenizer': ['Tokenize'], | 'bert_seq_cls_tokenizer': ['Tokenize'], | ||||
| 'document_segmentation_preprocessor': | 'document_segmentation_preprocessor': | ||||
| ['DocumentSegmentationPreprocessor'], | |||||
| ['DocumentSegmentationTransformersPreprocessor'], | |||||
| 'faq_question_answering_preprocessor': | 'faq_question_answering_preprocessor': | ||||
| ['FaqQuestionAnsweringPreprocessor'], | |||||
| ['FaqQuestionAnsweringTransformersPreprocessor'], | |||||
| 'fill_mask_preprocessor': | 'fill_mask_preprocessor': | ||||
| ['FillMaskPoNetPreprocessor', 'NLPPreprocessor'], | |||||
| 'text_ranking_preprocessor': ['TextRankingPreprocessor'], | |||||
| 'relation_extraction_preprocessor': ['RelationExtractionPreprocessor'], | |||||
| 'sentence_classification_preprocessor': | |||||
| ['SequenceClassificationPreprocessor'], | |||||
| 'sentence_embedding_preprocessor': ['SentenceEmbeddingPreprocessor'], | |||||
| 'text_generation_preprocessor': ['TextGenerationPreprocessor'], | |||||
| 'text2text_generation_preprocessor': | |||||
| ['Text2TextGenerationPreprocessor'], | |||||
| ['FillMaskPoNetPreprocessor', 'FillMaskTransformersPreprocessor'], | |||||
| 'text_ranking_preprocessor': ['TextRankingTransformersPreprocessor'], | |||||
| 'relation_extraction_preprocessor': | |||||
| ['RelationExtractionTransformersPreprocessor'], | |||||
| 'text_classification_preprocessor': | |||||
| ['TextClassificationTransformersPreprocessor'], | |||||
| 'sentence_embedding_preprocessor': | |||||
| ['SentenceEmbeddingTransformersPreprocessor'], | |||||
| 'text_generation_preprocessor': [ | |||||
| 'TextGenerationTransformersPreprocessor', | |||||
| 'TextGenerationJiebaPreprocessor', 'TextGenerationT5Preprocessor' | |||||
| ], | |||||
| 'token_classification_preprocessor': [ | 'token_classification_preprocessor': [ | ||||
| 'TokenClassificationPreprocessor', | |||||
| 'TokenClassificationTransformersPreprocessor', | |||||
| 'WordSegmentationBlankSetToLabelPreprocessor' | 'WordSegmentationBlankSetToLabelPreprocessor' | ||||
| ], | ], | ||||
| 'zero_shot_classification_reprocessor': | |||||
| ['ZeroShotClassificationPreprocessor'], | |||||
| 'zero_shot_classification_preprocessor': | |||||
| ['ZeroShotClassificationTransformersPreprocessor'], | |||||
| 'text_error_correction': [ | 'text_error_correction': [ | ||||
| 'TextErrorCorrectionPreprocessor', | 'TextErrorCorrectionPreprocessor', | ||||
| ], | ], | ||||
| @@ -3,39 +3,52 @@ | |||||
| from typing import Any, Dict | from typing import Any, Dict | ||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from .nlp_base import NLPBasePreprocessor | |||||
| logger = get_logger() | logger = get_logger() | ||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.document_segmentation) | Fields.nlp, module_name=Preprocessors.document_segmentation) | ||||
| class DocumentSegmentationPreprocessor(NLPBasePreprocessor): | |||||
| def __init__(self, model_dir: str, config, *args, **kwargs): | |||||
| """preprocess the data | |||||
| class DocumentSegmentationTransformersPreprocessor(Preprocessor): | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| model_max_length: int, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| question_column_name='labels', | |||||
| context_column_name='sentences', | |||||
| example_id_column_name='example_id', | |||||
| label_list=['B-EOP', 'O']): | |||||
| """The preprocessor for document segmentation task, based on transformers' tokenizer. | |||||
| Args: | Args: | ||||
| model_dir (str): model path | |||||
| model_dir: The model dir containing the essential files to build the tokenizer. | |||||
| model_max_length: The max length the model supported. | |||||
| mode: The mode for this preprocessor. | |||||
| question_column_name: The key for the question column, default `labels`. | |||||
| context_column_name: The key for the context column, default `sentences`. | |||||
| example_id_column_name: The key for the example id column, default `example_id`. | |||||
| label_list: The label list, default `['B-EOP', 'O']` | |||||
| """ | """ | ||||
| super().__init__(model_dir, *args, **kwargs) | |||||
| super().__init__(mode) | |||||
| from transformers import BertTokenizerFast | from transformers import BertTokenizerFast | ||||
| self.tokenizer = BertTokenizerFast.from_pretrained( | |||||
| model_dir, | |||||
| use_fast=True, | |||||
| ) | |||||
| self.question_column_name = 'labels' | |||||
| self.context_column_name = 'sentences' | |||||
| self.example_id_column_name = 'example_id' | |||||
| self.label_to_id = {'B-EOP': 0, 'O': 1} | |||||
| self.tokenizer = BertTokenizerFast.from_pretrained(model_dir, ) | |||||
| self.question_column_name = question_column_name | |||||
| self.context_column_name = context_column_name | |||||
| self.example_id_column_name = example_id_column_name | |||||
| self.label_list = label_list | |||||
| self.label_to_id = { | |||||
| label: id | |||||
| for id, label in enumerate(self.label_list) | |||||
| } | |||||
| self.target_specical_ids = set() | self.target_specical_ids = set() | ||||
| self.target_specical_ids.add(self.tokenizer.eos_token_id) | self.target_specical_ids.add(self.tokenizer.eos_token_id) | ||||
| self.max_seq_length = config.max_position_embeddings | |||||
| self.label_list = ['B-EOP', 'O'] | |||||
| self.max_seq_length = model_max_length | |||||
| def __call__(self, examples, model_cfg=None) -> Dict[str, Any]: | def __call__(self, examples, model_cfg=None) -> Dict[str, Any]: | ||||
| questions = examples[self.question_column_name] | questions = examples[self.question_column_name] | ||||
| @@ -1,38 +1,58 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| import os | |||||
| from typing import Any, Dict | from typing import Any, Dict | ||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.config import Config, ConfigFields | |||||
| from modelscope.utils.constant import Fields, ModeKeys, ModelFile | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from modelscope.utils.type_assert import type_assert | from modelscope.utils.type_assert import type_assert | ||||
| from .nlp_base import NLPBasePreprocessor | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.faq_question_answering_preprocessor) | Fields.nlp, module_name=Preprocessors.faq_question_answering_preprocessor) | ||||
| class FaqQuestionAnsweringPreprocessor(NLPBasePreprocessor): | |||||
| def __init__(self, model_dir: str, *args, **kwargs): | |||||
| super(FaqQuestionAnsweringPreprocessor, self).__init__( | |||||
| model_dir, mode=ModeKeys.INFERENCE, **kwargs) | |||||
| from transformers import BertTokenizer | |||||
| preprocessor_config = Config.from_file( | |||||
| os.path.join(model_dir, ModelFile.CONFIGURATION)).get( | |||||
| ConfigFields.preprocessor, {}) | |||||
| if preprocessor_config.get('tokenizer', | |||||
| 'BertTokenizer') == 'XLMRoberta': | |||||
| class FaqQuestionAnsweringTransformersPreprocessor(Preprocessor): | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| tokenizer='BertTokenizer', | |||||
| query_set='query_set', | |||||
| support_set='support_set', | |||||
| label_in_support_set='label', | |||||
| text_in_support_set='text', | |||||
| sequence_length=None, | |||||
| **kwargs): | |||||
| """The preprocessor for Faq QA task, based on transformers' tokenizer. | |||||
| Args: | |||||
| model_dir: The model dir containing the essential files to build the tokenizer. | |||||
| mode: The mode for this preprocessor. | |||||
| tokenizer: The tokenizer type used, supported types are `BertTokenizer` | |||||
| and `XLMRobertaTokenizer`, default `BertTokenizer`. | |||||
| query_set: The key for the query_set. | |||||
| support_set: The key for the support_set. | |||||
| label_in_support_set: The key for the label_in_support_set. | |||||
| text_in_support_set: The key for the text_in_support_set. | |||||
| sequence_length: The sequence length for the preprocessor. | |||||
| """ | |||||
| super().__init__(mode) | |||||
| if tokenizer == 'XLMRoberta': | |||||
| from transformers import XLMRobertaTokenizer | from transformers import XLMRobertaTokenizer | ||||
| self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_dir) | self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_dir) | ||||
| else: | else: | ||||
| from transformers import BertTokenizer | |||||
| self.tokenizer = BertTokenizer.from_pretrained(model_dir) | self.tokenizer = BertTokenizer.from_pretrained(model_dir) | ||||
| self.MAX_LEN = preprocessor_config.get('max_seq_length', 50) | |||||
| if sequence_length is not None: | |||||
| self.max_len = sequence_length | |||||
| else: | |||||
| self.max_len = kwargs.get('max_seq_length', 50) | |||||
| self.label_dict = None | self.label_dict = None | ||||
| self.query_set = query_set | |||||
| self.support_set = support_set | |||||
| self.label_in_support_set = label_in_support_set | |||||
| self.text_in_support_set = text_in_support_set | |||||
| def pad(self, samples, max_len): | def pad(self, samples, max_len): | ||||
| result = [] | result = [] | ||||
| @@ -58,25 +78,31 @@ class FaqQuestionAnsweringPreprocessor(NLPBasePreprocessor): | |||||
| @type_assert(object, Dict) | @type_assert(object, Dict) | ||||
| def __call__(self, data: Dict[str, Any], | def __call__(self, data: Dict[str, Any], | ||||
| **preprocessor_param) -> Dict[str, Any]: | **preprocessor_param) -> Dict[str, Any]: | ||||
| TMP_MAX_LEN = preprocessor_param.get('max_seq_length', self.MAX_LEN) | |||||
| queryset = data['query_set'] | |||||
| tmp_max_len = preprocessor_param.get( | |||||
| 'sequence_length', | |||||
| preprocessor_param.get('max_seq_length', self.max_len)) | |||||
| queryset = data[self.query_set] | |||||
| if not isinstance(queryset, list): | if not isinstance(queryset, list): | ||||
| queryset = [queryset] | queryset = [queryset] | ||||
| supportset = data['support_set'] | |||||
| supportset = sorted(supportset, key=lambda d: d['label']) | |||||
| supportset = data[self.support_set] | |||||
| supportset = sorted( | |||||
| supportset, key=lambda d: d[self.label_in_support_set]) | |||||
| queryset_tokenized = [self.encode_plus(text) for text in queryset] | queryset_tokenized = [self.encode_plus(text) for text in queryset] | ||||
| supportset_tokenized = [ | supportset_tokenized = [ | ||||
| self.encode_plus(item['text']) for item in supportset | |||||
| self.encode_plus(item[self.text_in_support_set]) | |||||
| for item in supportset | |||||
| ] | ] | ||||
| max_len = max( | max_len = max( | ||||
| [len(seq) for seq in queryset_tokenized + supportset_tokenized]) | [len(seq) for seq in queryset_tokenized + supportset_tokenized]) | ||||
| max_len = min(TMP_MAX_LEN, max_len) | |||||
| max_len = min(tmp_max_len, max_len) | |||||
| queryset_padded = self.pad(queryset_tokenized, max_len) | queryset_padded = self.pad(queryset_tokenized, max_len) | ||||
| supportset_padded = self.pad(supportset_tokenized, max_len) | supportset_padded = self.pad(supportset_tokenized, max_len) | ||||
| supportset_labels_ori = [item['label'] for item in supportset] | |||||
| supportset_labels_ori = [ | |||||
| item[self.label_in_support_set] for item in supportset | |||||
| ] | |||||
| label_dict = [] | label_dict = [] | ||||
| for label in supportset_labels_ori: | for label in supportset_labels_ori: | ||||
| if label not in label_dict: | if label not in label_dict: | ||||
| @@ -0,0 +1,78 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from typing import Any, Dict, Tuple, Union | |||||
| import numpy as np | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from modelscope.utils.hub import get_model_type | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| from .utils import parse_text_and_label | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.feature_extraction) | |||||
| class FeatureExtractionTransformersPreprocessor(Preprocessor): | |||||
| def __init__(self, | |||||
| model_dir: str = None, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| sequence_length: int = 128, | |||||
| use_fast: bool = None, | |||||
| **kwargs): | |||||
| """The preprocessor for feature extraction task, based on transformers' tokenizer. | |||||
| Args: | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| use_fast: Use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| self.first_sequence = first_sequence | |||||
| self.second_sequence = second_sequence | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | |||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | |||||
| kwargs['max_length'] = sequence_length | |||||
| kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | |||||
| True) | |||||
| super().__init__(mode) | |||||
| model_type = None | |||||
| if model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| self.nlp_tokenizer = NLPTokenizer( | |||||
| model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| def __call__(self, data: Union[str, Tuple, Dict], | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (tuple): [sentence1, sentence2] | |||||
| sentence1 (str): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| text_a, text_b, _ = parse_text_and_label(data, self.mode, | |||||
| self.first_sequence, | |||||
| self.second_sequence) | |||||
| output = self._tokenize_text(text_a, text_b, **kwargs) | |||||
| output = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in output.items() | |||||
| } | |||||
| return output | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None | |||||
| return self.nlp_tokenizer(sequence1, sequence2, **kwargs) | |||||
| @@ -2,60 +2,207 @@ | |||||
| import os.path as osp | import os.path as osp | ||||
| import re | import re | ||||
| from abc import abstractmethod | |||||
| from typing import Any, Dict, Tuple, Union | from typing import Any, Dict, Tuple, Union | ||||
| import numpy as np | import numpy as np | ||||
| import torch | import torch | ||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.config import Config | from modelscope.utils.config import Config | ||||
| from modelscope.utils.constant import Fields, ModeKeys, ModelFile | from modelscope.utils.constant import Fields, ModeKeys, ModelFile | ||||
| from modelscope.utils.hub import get_model_type | |||||
| from modelscope.utils.nlp import import_external_nltk_data | from modelscope.utils.nlp import import_external_nltk_data | ||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| from .utils import parse_text_and_label | |||||
| class FillMaskPreprocessorBase(Preprocessor): | |||||
| def __init__(self, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| mode: str = ModeKeys.INFERENCE): | |||||
| """The base constructor for all the fill-mask preprocessors. | |||||
| Args: | |||||
| first_sequence: The key of the first sequence. | |||||
| second_sequence: The key of the second sequence. | |||||
| mode: The mode for the preprocessor. | |||||
| """ | |||||
| super().__init__(mode) | |||||
| self.first_sequence = first_sequence | |||||
| self.second_sequence = second_sequence | |||||
| def __call__(self, data: Union[str, Tuple, Dict], | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (tuple): [sentence1, sentence2] | |||||
| sentence1 (str): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| text_a, text_b, _ = parse_text_and_label(data, self.mode, | |||||
| self.first_sequence, | |||||
| self.second_sequence) | |||||
| output = self._tokenize_text(text_a, text_b, **kwargs) | |||||
| output = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in output.items() | |||||
| } | |||||
| return output | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| """Tokenize the text. | |||||
| Args: | |||||
| sequence1: The first sequence. | |||||
| sequence2: The second sequence which may be None. | |||||
| Returns: | |||||
| The encoded sequence. | |||||
| """ | |||||
| raise NotImplementedError() | |||||
| @property | |||||
| def mask_id(self): | |||||
| """Return the id of the mask token. | |||||
| Returns: | |||||
| The id of mask token. | |||||
| """ | |||||
| return None | |||||
| @abstractmethod | |||||
| def decode(self, | |||||
| token_ids, | |||||
| skip_special_tokens: bool = False, | |||||
| clean_up_tokenization_spaces: bool = True, | |||||
| **kwargs): | |||||
| """Turn the token_ids to real sentence. | |||||
| Args: | |||||
| token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): | |||||
| List of tokenized input ids. Can be obtained using the `__call__` method. | |||||
| skip_special_tokens (`bool`, *optional*, defaults to `False`): | |||||
| Whether or not to remove special tokens in the decoding. | |||||
| clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): | |||||
| Whether or not to clean up the tokenization spaces. | |||||
| kwargs (additional keyword arguments, *optional*): | |||||
| Will be passed to the underlying model specific decode method. | |||||
| Returns: | |||||
| The real sentence decoded by the preprocessor. | |||||
| """ | |||||
| pass | |||||
| @PREPROCESSORS.register_module(Fields.nlp, module_name=Preprocessors.fill_mask) | @PREPROCESSORS.register_module(Fields.nlp, module_name=Preprocessors.fill_mask) | ||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.feature_extraction) | |||||
| class NLPPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in MLM task. | |||||
| """ | |||||
| class FillMaskTransformersPreprocessor(FillMaskPreprocessorBase): | |||||
| def __init__(self, | |||||
| model_dir: str = None, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| sequence_length: int = 128, | |||||
| use_fast: bool = None, | |||||
| **kwargs): | |||||
| """The preprocessor for fill mask task, based on transformers' tokenizer. | |||||
| def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||||
| Args: | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| use_fast: Use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | kwargs['truncation'] = kwargs.get('truncation', True) | ||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | kwargs['padding'] = kwargs.get('padding', 'max_length') | ||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||||
| kwargs['max_length'] = sequence_length | |||||
| kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | ||||
| True) | True) | ||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| super().__init__(first_sequence, second_sequence, mode) | |||||
| model_type = None | |||||
| if model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| self.nlp_tokenizer = NLPTokenizer( | |||||
| model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None | |||||
| return self.nlp_tokenizer(sequence1, sequence2, **kwargs) | |||||
| @property | @property | ||||
| def mask_id(self): | def mask_id(self): | ||||
| return self.tokenizer.mask_token_id | |||||
| """Return the id of the mask token. | |||||
| Returns: | |||||
| The id of mask token. | |||||
| """ | |||||
| return self.nlp_tokenizer.tokenizer.mask_token_id | |||||
| def decode(self, | def decode(self, | ||||
| token_ids, | token_ids, | ||||
| skip_special_tokens: bool = False, | skip_special_tokens: bool = False, | ||||
| clean_up_tokenization_spaces: bool = True, | clean_up_tokenization_spaces: bool = True, | ||||
| **kwargs): | **kwargs): | ||||
| return self.tokenizer.decode(token_ids, skip_special_tokens, | |||||
| clean_up_tokenization_spaces, **kwargs) | |||||
| """Turn the token_ids to real sentence. | |||||
| Args: | |||||
| token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): | |||||
| List of tokenized input ids. Can be obtained using the `__call__` method. | |||||
| skip_special_tokens (`bool`, *optional*, defaults to `False`): | |||||
| Whether or not to remove special tokens in the decoding. | |||||
| clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): | |||||
| Whether or not to clean up the tokenization spaces. | |||||
| kwargs (additional keyword arguments, *optional*): | |||||
| Will be passed to the underlying model specific decode method. | |||||
| Returns: | |||||
| The real sentence decoded by the preprocessor. | |||||
| """ | |||||
| return self.nlp_tokenizer.tokenizer.decode( | |||||
| token_ids, skip_special_tokens, clean_up_tokenization_spaces, | |||||
| **kwargs) | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.fill_mask_ponet) | Fields.nlp, module_name=Preprocessors.fill_mask_ponet) | ||||
| class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in PoNet model's MLM task. | |||||
| """ | |||||
| class FillMaskPoNetPreprocessor(FillMaskPreprocessorBase): | |||||
| def __init__(self, | |||||
| model_dir, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| sequence_length: int = 512, | |||||
| use_fast: bool = None, | |||||
| **kwargs): | |||||
| """The tokenizer preprocessor used in PoNet model's MLM task. | |||||
| def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||||
| Args: | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| use_fast: Use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | kwargs['truncation'] = kwargs.get('truncation', True) | ||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | kwargs['padding'] = kwargs.get('padding', 'max_length') | ||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 512) | |||||
| kwargs['max_length'] = sequence_length | |||||
| kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | ||||
| True) | True) | ||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| super().__init__(first_sequence, second_sequence, mode) | |||||
| self.nlp_tokenizer = NLPTokenizer( | |||||
| model_dir, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| self.cfg = Config.from_file( | self.cfg = Config.from_file( | ||||
| osp.join(model_dir, ModelFile.CONFIGURATION)) | osp.join(model_dir, ModelFile.CONFIGURATION)) | ||||
| @@ -80,27 +227,15 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| self.sent_tokenize = sent_tokenize | self.sent_tokenize = sent_tokenize | ||||
| self.max_length = kwargs['max_length'] | self.max_length = kwargs['max_length'] | ||||
| def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (tuple): [sentence1, sentence2] | |||||
| sentence1 (str): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| sentence2 (str): a sentence | |||||
| Example: | |||||
| 'you are so beautiful.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| text_a, text_b, labels = self.parse_text_and_label(data) | |||||
| output = self.tokenizer( | |||||
| text_a, | |||||
| text_b, | |||||
| return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None, | |||||
| **self.tokenize_kwargs) | |||||
| def __call__(self, data: Union[str, Tuple, Dict], | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| text_a, text_b, _ = parse_text_and_label(data, self.mode, | |||||
| self.first_sequence, | |||||
| self.second_sequence) | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None | |||||
| output = self.nlp_tokenizer(text_a, text_b, **kwargs) | |||||
| max_seq_length = self.max_length | max_seq_length = self.max_length | ||||
| if text_b is None: | if text_b is None: | ||||
| @@ -108,7 +243,7 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| seg_lens = list( | seg_lens = list( | ||||
| map( | map( | ||||
| len, | len, | ||||
| self.tokenizer( | |||||
| self.nlp_tokenizer.tokenizer( | |||||
| self.sent_tokenize(text_a), | self.sent_tokenize(text_a), | ||||
| add_special_tokens=False, | add_special_tokens=False, | ||||
| truncation=True)['input_ids'])) | truncation=True)['input_ids'])) | ||||
| @@ -125,18 +260,36 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| k: np.array(v) if isinstance(v, list) else v | k: np.array(v) if isinstance(v, list) else v | ||||
| for k, v in output.items() | for k, v in output.items() | ||||
| } | } | ||||
| self.labels_to_id(labels, output) | |||||
| return output | return output | ||||
| @property | @property | ||||
| def mask_id(self): | def mask_id(self): | ||||
| return self.tokenizer.mask_token_id | |||||
| """Return the id of the mask token. | |||||
| Returns: | |||||
| The id of mask token. | |||||
| """ | |||||
| return self.nlp_tokenizer.tokenizer.mask_token_id | |||||
| def decode(self, | def decode(self, | ||||
| token_ids, | token_ids, | ||||
| skip_special_tokens: bool = False, | skip_special_tokens: bool = False, | ||||
| clean_up_tokenization_spaces: bool = True, | clean_up_tokenization_spaces: bool = True, | ||||
| **kwargs): | **kwargs): | ||||
| return self.tokenizer.decode(token_ids, skip_special_tokens, | |||||
| clean_up_tokenization_spaces, **kwargs) | |||||
| """Turn the token_ids to real sentence. | |||||
| Args: | |||||
| token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): | |||||
| List of tokenized input ids. Can be obtained using the `__call__` method. | |||||
| skip_special_tokens (`bool`, *optional*, defaults to `False`): | |||||
| Whether or not to remove special tokens in the decoding. | |||||
| clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): | |||||
| Whether or not to clean up the tokenization spaces. | |||||
| kwargs (additional keyword arguments, *optional*): | |||||
| Will be passed to the underlying model specific decode method. | |||||
| Returns: | |||||
| The real sentence decoded by the preprocessor. | |||||
| """ | |||||
| return self.nlp_tokenizer.tokenizer.decode( | |||||
| token_ids, skip_special_tokens, clean_up_tokenization_spaces, | |||||
| **kwargs) | |||||
| @@ -1,291 +0,0 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| import os | |||||
| from abc import ABC | |||||
| from collections.abc import Mapping | |||||
| from typing import Any, Dict, List, Tuple, Union | |||||
| import json | |||||
| import numpy as np | |||||
| import torch | |||||
| from transformers import AutoTokenizer | |||||
| from modelscope.metainfo import Models | |||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.preprocessors.base import Preprocessor | |||||
| from modelscope.utils.constant import ModeKeys | |||||
| from modelscope.utils.hub import get_model_type, parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | |||||
| logger = get_logger() | |||||
| __all__ = [ | |||||
| 'NLPBasePreprocessor', | |||||
| 'NLPTokenizerPreprocessorBase', | |||||
| ] | |||||
| class NLPBasePreprocessor(Preprocessor, ABC): | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| first_sequence=None, | |||||
| second_sequence=None, | |||||
| label=None, | |||||
| label2id=None, | |||||
| mode=ModeKeys.INFERENCE, | |||||
| use_fast=None, | |||||
| **kwargs): | |||||
| """The NLP preprocessor base class. | |||||
| Args: | |||||
| model_dir (str): The local model path | |||||
| first_sequence: The key for the first sequence | |||||
| second_sequence: The key for the second sequence | |||||
| label: The label key | |||||
| label2id: An optional label2id mapping, the class will try to call utils.parse_label_mapping | |||||
| if this mapping is not supplied. | |||||
| mode: Run this preprocessor in either 'train'/'eval'/'inference' mode | |||||
| use_fast: use the fast version of tokenizer | |||||
| """ | |||||
| self.model_dir = model_dir | |||||
| self.first_sequence = first_sequence | |||||
| self.second_sequence = second_sequence | |||||
| self.label = label | |||||
| self.use_fast = use_fast | |||||
| if self.use_fast is None and model_dir is None: | |||||
| self.use_fast = False | |||||
| elif self.use_fast is None and os.path.isfile( | |||||
| os.path.join(model_dir, 'tokenizer_config.json')): | |||||
| with open( | |||||
| os.path.join(model_dir, 'tokenizer_config.json'), | |||||
| 'r', | |||||
| encoding='utf-8') as f: | |||||
| json_config = json.load(f) | |||||
| self.use_fast = json_config.get('use_fast') | |||||
| self.use_fast = False if self.use_fast is None else self.use_fast | |||||
| self.label2id = label2id | |||||
| if self.label2id is None and model_dir is not None: | |||||
| self.label2id = parse_label_mapping(model_dir) | |||||
| super().__init__(mode, **kwargs) | |||||
| @property | |||||
| def mask_id(self): | |||||
| """Child preprocessor can override this property to return the id of mask token. | |||||
| Returns: | |||||
| The id of mask token, default None. | |||||
| """ | |||||
| return None | |||||
| def decode(self, | |||||
| token_ids: Union[int, List[int], 'np.ndarray', 'torch.Tensor', | |||||
| 'tf.Tensor'], | |||||
| skip_special_tokens: bool = False, | |||||
| clean_up_tokenization_spaces: bool = True, | |||||
| **kwargs): | |||||
| """Turn the token_ids to real sentence. | |||||
| Args: | |||||
| token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`): | |||||
| List of tokenized input ids. Can be obtained using the `__call__` method. | |||||
| skip_special_tokens (`bool`, *optional*, defaults to `False`): | |||||
| Whether or not to remove special tokens in the decoding. | |||||
| clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`): | |||||
| Whether or not to clean up the tokenization spaces. | |||||
| kwargs (additional keyword arguments, *optional*): | |||||
| Will be passed to the underlying model specific decode method. | |||||
| Returns: | |||||
| The real sentence decoded by the preprocessor. | |||||
| """ | |||||
| raise NotImplementedError() | |||||
| class NLPTokenizerPreprocessorBase(NLPBasePreprocessor): | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| label: str = 'label', | |||||
| label2id: dict = None, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| use_fast: bool = None, | |||||
| **kwargs): | |||||
| """The NLP tokenizer preprocessor base class. | |||||
| Any nlp preprocessor which uses the hf tokenizer can inherit from this class. | |||||
| Args: | |||||
| model_dir (str): The local model path | |||||
| first_sequence: The key for the first sequence | |||||
| second_sequence: The key for the second sequence | |||||
| label: The key for the label | |||||
| label2id: An optional label2id dict. | |||||
| If label2id is None, the preprocessor will try to parse label-id mapping from: | |||||
| - configuration.json model.label2id/model.id2label | |||||
| - config.json label2id/id2label | |||||
| - label_mapping.json | |||||
| mode: Run this preprocessor in either 'train'/'eval'/'inference' mode, the behavior may be different. | |||||
| use_fast: use the fast version of tokenizer | |||||
| kwargs: These kwargs will be directly fed into the tokenizer. | |||||
| """ | |||||
| super().__init__(model_dir, first_sequence, second_sequence, label, | |||||
| label2id, mode, use_fast, **kwargs) | |||||
| self.model_dir = model_dir | |||||
| self.tokenize_kwargs = kwargs | |||||
| self.tokenizer = self.build_tokenizer(model_dir) | |||||
| logger.info(f'The key of sentence1: {self.first_sequence}, ' | |||||
| f'The key of sentence2: {self.second_sequence}, ' | |||||
| f'The key of label: {self.label}') | |||||
| if self.first_sequence is None: | |||||
| logger.warning('[Important] first_sequence attribute is not set, ' | |||||
| 'this will cause an error if your input is a dict.') | |||||
| @property | |||||
| def id2label(self): | |||||
| """Return the id2label mapping according to the label2id mapping. | |||||
| @return: The id2label mapping if exists. | |||||
| """ | |||||
| if self.label2id is not None: | |||||
| return {id: label for label, id in self.label2id.items()} | |||||
| return None | |||||
| def build_tokenizer(self, model_dir): | |||||
| """Build a tokenizer by the model type. | |||||
| NOTE: This default implementation only returns slow tokenizer, because the fast tokenizers have a | |||||
| multi-thread problem. | |||||
| Args: | |||||
| model_dir: The local model dir. | |||||
| Returns: | |||||
| The initialized tokenizer. | |||||
| """ | |||||
| self.is_transformer_based_model = 'lstm' not in model_dir | |||||
| # fast version lead to parallel inference failed | |||||
| model_type = get_model_type(model_dir) | |||||
| if model_type in (Models.structbert, Models.gpt3, Models.palm, | |||||
| Models.plug): | |||||
| from modelscope.models.nlp.structbert import SbertTokenizer, SbertTokenizerFast | |||||
| tokenizer = SbertTokenizerFast if self.use_fast else SbertTokenizer | |||||
| return tokenizer.from_pretrained(model_dir) | |||||
| elif model_type == Models.veco: | |||||
| from modelscope.models.nlp.veco import VecoTokenizer, VecoTokenizerFast | |||||
| tokenizer = VecoTokenizerFast if self.use_fast else VecoTokenizer | |||||
| return tokenizer.from_pretrained(model_dir) | |||||
| elif model_type == Models.deberta_v2: | |||||
| from modelscope.models.nlp.deberta_v2 import DebertaV2Tokenizer, DebertaV2TokenizerFast | |||||
| tokenizer = DebertaV2TokenizerFast if self.use_fast else DebertaV2Tokenizer | |||||
| return tokenizer.from_pretrained(model_dir) | |||||
| elif not self.is_transformer_based_model: | |||||
| from transformers import BertTokenizer, BertTokenizerFast | |||||
| tokenizer = BertTokenizerFast if self.use_fast else BertTokenizer | |||||
| return tokenizer.from_pretrained(model_dir) | |||||
| else: | |||||
| return AutoTokenizer.from_pretrained( | |||||
| model_dir, use_fast=self.use_fast) | |||||
| def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (tuple): [sentence1, sentence2] | |||||
| sentence1 (str): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| sentence2 (str): a sentence | |||||
| Example: | |||||
| 'you are so beautiful.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| text_a, text_b, labels = self.parse_text_and_label(data) | |||||
| output = self.tokenizer( | |||||
| text_a, | |||||
| text_b, | |||||
| return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None, | |||||
| **self.tokenize_kwargs) | |||||
| output = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in output.items() | |||||
| } | |||||
| self.labels_to_id(labels, output) | |||||
| return output | |||||
| def parse_text_and_label(self, data): | |||||
| """Parse the input and return the sentences and labels. | |||||
| When input type is tuple or list and its size is 2: | |||||
| If the pair param is False, data will be parsed as the first_sentence and the label, | |||||
| else it will be parsed as the first_sentence and the second_sentence. | |||||
| Args: | |||||
| data: The input data. | |||||
| Returns: | |||||
| The sentences and labels tuple. | |||||
| """ | |||||
| text_a, text_b, labels = None, None, None | |||||
| if isinstance(data, str): | |||||
| text_a = data | |||||
| elif isinstance(data, tuple) or isinstance(data, list): | |||||
| if len(data) == 3: | |||||
| text_a, text_b, labels = data | |||||
| elif len(data) == 2: | |||||
| if self._mode == ModeKeys.INFERENCE: | |||||
| text_a, text_b = data | |||||
| else: | |||||
| text_a, labels = data | |||||
| elif isinstance(data, Mapping): | |||||
| text_a = data.get(self.first_sequence) | |||||
| text_b = data.get(self.second_sequence) | |||||
| labels = data.get(self.label) | |||||
| return text_a, text_b, labels | |||||
| def labels_to_id(self, labels, output): | |||||
| """Turn the labels to id with the type int or float. | |||||
| If the original label's type is str or int, the label2id mapping will try to convert it to the final label. | |||||
| If the original label's type is float, or the label2id mapping does not exist, | |||||
| the original label will be returned. | |||||
| Args: | |||||
| labels: The input labels. | |||||
| output: The label id. | |||||
| Returns: | |||||
| The final labels. | |||||
| """ | |||||
| def label_can_be_mapped(label): | |||||
| return isinstance(label, str) or isinstance(label, int) | |||||
| try: | |||||
| if isinstance(labels, (tuple, list)) and all([label_can_be_mapped(label) for label in labels]) \ | |||||
| and self.label2id is not None: | |||||
| output[OutputKeys.LABELS] = [ | |||||
| self.label2id[label] | |||||
| if label in self.label2id else self.label2id[str(label)] | |||||
| for label in labels | |||||
| ] | |||||
| elif label_can_be_mapped(labels) and self.label2id is not None: | |||||
| output[OutputKeys.LABELS] = self.label2id[ | |||||
| labels] if labels in self.label2id else self.label2id[str( | |||||
| labels)] | |||||
| elif labels is not None: | |||||
| output[OutputKeys.LABELS] = labels | |||||
| except KeyError as e: | |||||
| logger.error( | |||||
| f'Label {labels} cannot be found in the label mapping {self.label2id},' | |||||
| f'which comes from the user input or the configuration files. ' | |||||
| f'Please consider matching your labels with this mapping.') | |||||
| raise e | |||||
| @@ -5,34 +5,36 @@ from typing import Any, Dict | |||||
| from transformers import AutoTokenizer | from transformers import AutoTokenizer | ||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from modelscope.utils.type_assert import type_assert | from modelscope.utils.type_assert import type_assert | ||||
| from .nlp_base import NLPBasePreprocessor | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.re_tokenizer) | Fields.nlp, module_name=Preprocessors.re_tokenizer) | ||||
| class RelationExtractionPreprocessor(NLPBasePreprocessor): | |||||
| """The relation extraction preprocessor used in normal RE task. | |||||
| """ | |||||
| class RelationExtractionTransformersPreprocessor(Preprocessor): | |||||
| def __init__(self, model_dir: str, *args, **kwargs): | |||||
| """preprocess the data | |||||
| def __init__( | |||||
| self, | |||||
| model_dir: str, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| **kwargs, | |||||
| ): | |||||
| """The preprocessor for relation Extraction task, based on transformers' tokenizer. | |||||
| Args: | Args: | ||||
| model_dir (str): model path | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| mode: The mode for the preprocessor. | |||||
| """ | """ | ||||
| super().__init__(model_dir, *args, **kwargs) | |||||
| super().__init__(mode) | |||||
| self.model_dir: str = model_dir | self.model_dir: str = model_dir | ||||
| self.sequence_length = kwargs.pop('sequence_length', 512) | |||||
| self.tokenizer = AutoTokenizer.from_pretrained( | self.tokenizer = AutoTokenizer.from_pretrained( | ||||
| model_dir, use_fast=True) | model_dir, use_fast=True) | ||||
| @type_assert(object, str) | @type_assert(object, str) | ||||
| def __call__(self, data: str) -> Dict[str, Any]: | |||||
| def __call__(self, data: str, **kwargs) -> Dict[str, Any]: | |||||
| """process the raw input data | """process the raw input data | ||||
| Args: | Args: | ||||
| @@ -46,7 +48,9 @@ class RelationExtractionPreprocessor(NLPBasePreprocessor): | |||||
| # preprocess the data for the model input | # preprocess the data for the model input | ||||
| text = data | text = data | ||||
| output = self.tokenizer([text], return_tensors='pt') | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs['return_tensors'] = 'pt' | |||||
| output = self.tokenizer([text], **kwargs) | |||||
| return { | return { | ||||
| 'text': text, | 'text': text, | ||||
| 'input_ids': output['input_ids'], | 'input_ids': output['input_ids'], | ||||
| @@ -1,25 +0,0 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.nli_tokenizer) | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer) | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer) | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer) | |||||
| class SequenceClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in sequence classification. | |||||
| """ | |||||
| def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | |||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | |||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| @@ -1,31 +1,61 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict, Union | |||||
| from typing import Any, Dict | |||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields, ModeKeys | from modelscope.utils.constant import Fields, ModeKeys | ||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| from modelscope.utils.hub import get_model_type | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.sentence_embedding) | Fields.nlp, module_name=Preprocessors.sentence_embedding) | ||||
| class SentenceEmbeddingPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| class SentenceEmbeddingTransformersPreprocessor(Preprocessor): | |||||
| """The tokenizer preprocessor used in sentence embedding. | """The tokenizer preprocessor used in sentence embedding. | ||||
| """ | """ | ||||
| def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | |||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | |||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| first_sequence='source_sentence', | |||||
| second_sequence='sentences_to_compare', | |||||
| mode=ModeKeys.INFERENCE, | |||||
| use_fast: bool = None, | |||||
| sequence_length: int = 128, | |||||
| **kwargs): | |||||
| """The preprocessor for sentence embedding task, based on transformers' tokenizer. | |||||
| def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]: | |||||
| Args: | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| first_sequence: The key of the first sequence. | |||||
| second_sequence: The key of the second sequence. | |||||
| mode: The mode for the preprocessor. | |||||
| use_fast: Use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| self.first_sequence = first_sequence | |||||
| self.second_sequence = second_sequence | |||||
| kwargs['max_length'] = sequence_length | |||||
| model_type = None | |||||
| if model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| self.nlp_tokenizer = NLPTokenizer( | |||||
| model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| super().__init__(mode=mode) | |||||
| def __call__(self, | |||||
| data: Dict, | |||||
| padding=True, | |||||
| truncation=True, | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| """process the raw input data | """process the raw input data | ||||
| Args: | Args: | ||||
| data Dict: | data Dict: | ||||
| keys: "source_sentence" && "sentences_to_compare" | |||||
| keys: the source sentence and the sentences to compare | |||||
| values: list of sentences | values: list of sentences | ||||
| Example: | Example: | ||||
| {"source_sentence": ["how long it take to get a master's degree"], | {"source_sentence": ["how long it take to get a master's degree"], | ||||
| @@ -37,16 +67,16 @@ class SentenceEmbeddingPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| Returns: | Returns: | ||||
| Dict[str, Any]: the preprocessed data | Dict[str, Any]: the preprocessed data | ||||
| """ | """ | ||||
| source_sentence = data['source_sentence'] | |||||
| compare_sentences = data['sentences_to_compare'] | |||||
| sentences = [] | |||||
| sentences.append(source_sentence[0]) | |||||
| source_sentence = data[self.first_sequence] | |||||
| compare_sentences = data[self.second_sequence] | |||||
| sentences = [source_sentence[0]] | |||||
| for sent in compare_sentences: | for sent in compare_sentences: | ||||
| sentences.append(sent) | sentences.append(sent) | ||||
| tokenized_inputs = self.tokenizer( | |||||
| sentences, | |||||
| return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None, | |||||
| padding=True, | |||||
| truncation=True) | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None | |||||
| tokenized_inputs = self.nlp_tokenizer( | |||||
| sentences, padding=padding, truncation=truncation, **kwargs) | |||||
| return tokenized_inputs | return tokenized_inputs | ||||
| @@ -1,7 +1,6 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| import os | |||||
| import os.path as osp | import os.path as osp | ||||
| from typing import Any, Dict | |||||
| import sentencepiece as spm | import sentencepiece as spm | ||||
| import torch | import torch | ||||
| @@ -9,17 +8,26 @@ import torch | |||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors.base import Preprocessor | from modelscope.preprocessors.base import Preprocessor | ||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.sentence_piece) | Fields.nlp, module_name=Preprocessors.sentence_piece) | ||||
| class SentencePiecePreprocessor(Preprocessor): | class SentencePiecePreprocessor(Preprocessor): | ||||
| def __init__(self, model_dir: str, *args, **kwargs): | |||||
| import os | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| mode=ModeKeys.INFERENCE, | |||||
| *args, | |||||
| **kwargs): | |||||
| """The preprocessor for the sentence piece tokenizer. | |||||
| Args: | |||||
| model_dir: The model dir contains the essential files used by the `SentencePieceProcessor`. | |||||
| mode: The mode for the preprocessor. | |||||
| """ | |||||
| super().__init__(*args, **kwargs) | |||||
| super().__init__(mode) | |||||
| self.tokenizer = None | self.tokenizer = None | ||||
| for file_name in os.listdir(model_dir): | for file_name in os.listdir(model_dir): | ||||
| if file_name.endswith('.model'): | if file_name.endswith('.model'): | ||||
| @@ -28,5 +36,5 @@ class SentencePiecePreprocessor(Preprocessor): | |||||
| break | break | ||||
| assert self.tokenizer is not None, 'Can not find .model file' | assert self.tokenizer is not None, 'Can not find .model file' | ||||
| def __call__(self, data: str) -> Dict[str, Any]: | |||||
| def __call__(self, data: str) -> torch.Tensor: | |||||
| return torch.tensor(self.tokenizer.encode([data]), dtype=torch.long) | return torch.tensor(self.tokenizer.encode([data]), dtype=torch.long) | ||||
| @@ -1,40 +0,0 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from typing import Any, Dict, Union | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.text2text_gen_preprocessor) | |||||
| class Text2TextGenerationPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in text generation. | |||||
| """ | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| tokenizer=None, | |||||
| mode=ModeKeys.INFERENCE, | |||||
| **kwargs): | |||||
| kwargs['truncation'] = kwargs.get('truncation', 'do_not_truncate') | |||||
| kwargs['padding'] = kwargs.get('padding', False) | |||||
| kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | |||||
| False) | |||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]: | |||||
| text_a, _, _ = self.parse_text_and_label(data) | |||||
| inputs = self.tokenizer( | |||||
| text_a, | |||||
| return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None, | |||||
| **self.tokenize_kwargs) | |||||
| # This is produced by tokenizers but is an invalid generate kwargs | |||||
| if 'token_type_ids' in inputs: | |||||
| del inputs['token_type_ids'] | |||||
| return inputs | |||||
| @@ -0,0 +1,152 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from abc import abstractmethod | |||||
| from typing import Any, Dict, List, Tuple, Union | |||||
| import numpy as np | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from modelscope.utils.hub import get_model_type, parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| from .utils import labels_to_id, parse_text_and_label | |||||
| logger = get_logger(__name__) | |||||
| class TextClassificationPreprocessorBase(Preprocessor): | |||||
| def __init__( | |||||
| self, | |||||
| model_dir=None, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| label: str = 'label', | |||||
| label2id: Dict = None, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| ): | |||||
| """The base class for the text classification preprocessor. | |||||
| Args: | |||||
| model_dir(str, `optional`): The model dir used to parse the label mapping, can be None. | |||||
| first_sequence(str, `optional`): The key of the first sequence. | |||||
| second_sequence(str, `optional`): The key of the second sequence. | |||||
| label(str, `optional`): The keys of the label columns, default is `label` | |||||
| label2id: (dict, `optional`): The optional label2id mapping | |||||
| mode: The mode for the preprocessor | |||||
| """ | |||||
| super().__init__(mode) | |||||
| self.model_dir = model_dir | |||||
| self.first_sequence = first_sequence | |||||
| self.second_sequence = second_sequence | |||||
| self.label = label | |||||
| self.label2id = label2id | |||||
| if self.label2id is None and self.model_dir is not None: | |||||
| self.label2id = parse_label_mapping(self.model_dir) | |||||
| logger.info(f'The key of sentence1: {self.first_sequence}, ' | |||||
| f'The key of sentence2: {self.second_sequence}, ' | |||||
| f'The key of label: {self.label}') | |||||
| if self.first_sequence is None: | |||||
| logger.warning('[Important] first_sequence attribute is not set, ' | |||||
| 'this will cause an error if your input is a dict.') | |||||
| @property | |||||
| def id2label(self): | |||||
| """Return the id2label mapping according to the label2id mapping. | |||||
| @return: The id2label mapping if exists. | |||||
| """ | |||||
| if self.label2id is not None: | |||||
| return {id: label for label, id in self.label2id.items()} | |||||
| return None | |||||
| def __call__(self, data: Union[str, Tuple, Dict], | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (tuple): [sentence1, sentence2] | |||||
| sentence1 (str): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| sentence2 (str): a sentence | |||||
| Example: | |||||
| 'you are so beautiful.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| text_a, text_b, labels = parse_text_and_label(data, self.mode, | |||||
| self.first_sequence, | |||||
| self.second_sequence, | |||||
| self.label) | |||||
| output = self._tokenize_text(text_a, text_b, **kwargs) | |||||
| output = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in output.items() | |||||
| } | |||||
| labels_to_id(labels, output, self.label2id) | |||||
| return output | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| """Tokenize the text. | |||||
| Args: | |||||
| sequence1: The first sequence. | |||||
| sequence2: The second sequence which may be None. | |||||
| Returns: | |||||
| The encoded sequence. | |||||
| """ | |||||
| raise NotImplementedError() | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.nli_tokenizer) | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer) | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer) | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer) | |||||
| class TextClassificationTransformersPreprocessor( | |||||
| TextClassificationPreprocessorBase): | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None | |||||
| return self.nlp_tokenizer(sequence1, sequence2, **kwargs) | |||||
| def __init__(self, | |||||
| model_dir=None, | |||||
| first_sequence: str = None, | |||||
| second_sequence: str = None, | |||||
| label: Union[str, List] = 'label', | |||||
| label2id: Dict = None, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| sequence_length: int = 128, | |||||
| use_fast: bool = None, | |||||
| **kwargs): | |||||
| """The tokenizer preprocessor used in sequence classification. | |||||
| Args: | |||||
| use_fast: Whether to use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | |||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | |||||
| kwargs['max_length'] = sequence_length | |||||
| model_type = None | |||||
| if model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| self.nlp_tokenizer = NLPTokenizer( | |||||
| model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| super().__init__(model_dir, first_sequence, second_sequence, label, | |||||
| label2id, mode) | |||||
| @@ -7,12 +7,11 @@ from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors.base import Preprocessor | from modelscope.preprocessors.base import Preprocessor | ||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields | from modelscope.utils.constant import Fields | ||||
| from .nlp_base import NLPBasePreprocessor | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.text_error_correction) | Fields.nlp, module_name=Preprocessors.text_error_correction) | ||||
| class TextErrorCorrectionPreprocessor(NLPBasePreprocessor): | |||||
| class TextErrorCorrectionPreprocessor(Preprocessor): | |||||
| """The preprocessor used in text correction task. | """The preprocessor used in text correction task. | ||||
| """ | """ | ||||
| @@ -23,7 +22,7 @@ class TextErrorCorrectionPreprocessor(NLPBasePreprocessor): | |||||
| Args: | Args: | ||||
| model_dir (str): model path | model_dir (str): model path | ||||
| """ | """ | ||||
| super().__init__(model_dir, *args, **kwargs) | |||||
| super().__init__(*args, **kwargs) | |||||
| self.vocab = Dictionary.load(osp.join(model_dir, 'dict.src.txt')) | self.vocab = Dictionary.load(osp.join(model_dir, 'dict.src.txt')) | ||||
| def __call__(self, data: str) -> Dict[str, Any]: | def __call__(self, data: str) -> Dict[str, Any]: | ||||
| @@ -1,44 +0,0 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| import os.path as osp | |||||
| from typing import Any, Dict | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors.base import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.text_gen_jieba_tokenizer) | |||||
| class TextGenerationJiebaPreprocessor(Preprocessor): | |||||
| """The jieba tokenizer preprocessor used in text generation. | |||||
| """ | |||||
| def __init__(self, model_dir: str, *args, **kwargs): | |||||
| from modelscope.models.nlp.gpt3 import JiebaBPETokenizer | |||||
| super().__init__(*args, **kwargs) | |||||
| self.tokenizer = JiebaBPETokenizer( | |||||
| osp.join(model_dir, 'tokenizer.json')) | |||||
| def __call__(self, data: str) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (str): a sentence | |||||
| Example: | |||||
| '深蓝的天空中挂着一轮金黄的圆月,下面是海边的沙地' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| Example: | |||||
| {'net_input': | |||||
| {'src_tokens':tensor([1,2,3,4]), | |||||
| 'src_lengths': tensor([4])} | |||||
| } | |||||
| """ | |||||
| import torch | |||||
| return { | |||||
| 'input_ids': | |||||
| torch.tensor(self.tokenizer.tokenize(data)).unsqueeze_(0) | |||||
| } | |||||
| @@ -1,62 +1,257 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| import os.path as osp | |||||
| from typing import Any, Dict, Optional, Union | from typing import Any, Dict, Optional, Union | ||||
| import numpy as np | |||||
| import torch | |||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors.base import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields, ModeKeys | from modelscope.utils.constant import Fields, ModeKeys | ||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| from modelscope.utils.hub import get_model_type | |||||
| from modelscope.utils.logger import get_logger | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| from .utils import parse_text_and_label | |||||
| logger = get_logger(__name__) | |||||
| class TextGenerationPreprocessorBase(Preprocessor): | |||||
| def __init__(self, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| src_txt='src_txt', | |||||
| tgt_txt='tgt_txt'): | |||||
| """The base class for all the text generation task's preprocessors. | |||||
| Args: | |||||
| mode: The preprocessor mode. | |||||
| src_txt: The key for the src text. | |||||
| tgt_txt: The key for the tgt text. | |||||
| """ | |||||
| super().__init__(mode) | |||||
| self.src_txt = src_txt | |||||
| self.tgt_txt = tgt_txt | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| """Tokenize the text. | |||||
| Args: | |||||
| sequence1: The first sequence. | |||||
| sequence2: The second sequence which may be None. | |||||
| Returns: | |||||
| The encoded sequence. | |||||
| """ | |||||
| raise NotImplementedError() | |||||
| def __call__(self, data: Union[Dict, str], **kwargs) -> Dict[str, Any]: | |||||
| text_a, text_b = parse_text_and_label(data, self.mode, self.src_txt, | |||||
| self.tgt_txt)[0:2] | |||||
| output = self._tokenize_text(text_a, text_b, **kwargs) | |||||
| output = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in output.items() | |||||
| } | |||||
| return output | |||||
| def decode(self, tokens, **kwargs): | |||||
| """Decode the tokens to real text. | |||||
| Args: | |||||
| tokens: The output tokens from model's `forward` and `generate` | |||||
| Returns: | |||||
| The actual text. | |||||
| """ | |||||
| raise NotImplementedError() | |||||
| class NLPTokenizerForRoberta(NLPTokenizer): | |||||
| def build_tokenizer(self): | |||||
| def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]: | |||||
| import os | |||||
| for name in os.listdir(model_dir): | |||||
| full_name = os.path.join(model_dir, name) | |||||
| if 'roberta' in name and os.path.isdir(full_name): | |||||
| return full_name | |||||
| roberta_tokenizer_dir = get_roberta_tokenizer_dir(self.model_dir) | |||||
| if roberta_tokenizer_dir: | |||||
| from transformers import RobertaTokenizer | |||||
| return RobertaTokenizer.from_pretrained( | |||||
| roberta_tokenizer_dir, do_lower_case=False) | |||||
| return super().build_tokenizer() | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.text_gen_tokenizer) | Fields.nlp, module_name=Preprocessors.text_gen_tokenizer) | ||||
| class TextGenerationPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in text generation. | |||||
| """ | |||||
| class TextGenerationTransformersPreprocessor(TextGenerationPreprocessorBase): | |||||
| def __init__(self, | def __init__(self, | ||||
| model_dir: str, | model_dir: str, | ||||
| tokenizer=None, | tokenizer=None, | ||||
| mode=ModeKeys.INFERENCE, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| src_txt='src_txt', | |||||
| tgt_txt='tgt_txt', | |||||
| sequence_length: int = 128, | |||||
| use_fast: bool = None, | |||||
| **kwargs): | **kwargs): | ||||
| """The tokenizer preprocessor used in text generation. | |||||
| Args: | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| mode: The mode for the preprocessor. | |||||
| src_txt: The key of the source sentence. | |||||
| tgt_txt: The key of the generated sentence. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| use_fast: Whether to use the fast tokenizer or not. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| if 'first_sequence' in kwargs: | |||||
| src_txt = kwargs.pop('first_sequence') | |||||
| super().__init__(mode, src_txt, tgt_txt) | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | kwargs['truncation'] = kwargs.get('truncation', True) | ||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | kwargs['padding'] = kwargs.get('padding', 'max_length') | ||||
| kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids', | ||||
| False) | False) | ||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| @staticmethod | |||||
| def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]: | |||||
| import os | |||||
| for name in os.listdir(model_dir): | |||||
| full_name = os.path.join(model_dir, name) | |||||
| if 'roberta' in name and os.path.isdir(full_name): | |||||
| return full_name | |||||
| def build_tokenizer(self, model_dir: str): | |||||
| roberta_tokenizer_dir = self.get_roberta_tokenizer_dir(model_dir) | |||||
| if roberta_tokenizer_dir: | |||||
| from transformers import RobertaTokenizer | |||||
| return RobertaTokenizer.from_pretrained( | |||||
| roberta_tokenizer_dir, do_lower_case=False) | |||||
| return super().build_tokenizer(model_dir) | |||||
| def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]: | |||||
| if self._mode == ModeKeys.INFERENCE: | |||||
| return super().__call__(data) | |||||
| src_rst = super().__call__(data['src_txt']) | |||||
| src_input_ids = src_rst['input_ids'] | |||||
| src_attention_mask = src_rst['attention_mask'] | |||||
| if 'tgt_txt' in data: | |||||
| labels = super().__call__(data['tgt_txt'])['input_ids'] | |||||
| else: | |||||
| labels = src_input_ids[1:] | |||||
| src_input_ids = src_input_ids[:-1] | |||||
| src_attention_mask = src_attention_mask[:-1] | |||||
| kwargs['max_length'] = sequence_length | |||||
| model_type = None | |||||
| if model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| self.nlp_tokenizer = NLPTokenizerForRoberta( | |||||
| model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| def decode(self, tokens, **kwargs): | |||||
| """Decode the tokens to real text. | |||||
| Args: | |||||
| tokens: The output tokens from model's `forward` and `generate` | |||||
| Returns: | |||||
| The actual text. | |||||
| """ | |||||
| return self.nlp_tokenizer.tokenizer.decode(tokens, **kwargs) | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| """Tokenize the text. | |||||
| Args: | |||||
| sequence1: The first sequence. | |||||
| sequence2: The second sequence which may be None. | |||||
| Returns: | |||||
| The encoded sequence. | |||||
| """ | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None | |||||
| output = self.nlp_tokenizer(sequence1, **kwargs) | |||||
| if self.mode != ModeKeys.INFERENCE: | |||||
| if sequence2 is not None: | |||||
| labels = self.nlp_tokenizer(sequence2)['input_ids'] | |||||
| src_input_ids = output['input_ids'] | |||||
| src_attention_mask = output['attention_mask'] | |||||
| else: | |||||
| labels = output['input_ids'][1:] | |||||
| src_input_ids = output['input_ids'][:-1] | |||||
| src_attention_mask = output['attention_mask'][:-1] | |||||
| output = { | |||||
| 'input_ids': src_input_ids, | |||||
| 'attention_mask': src_attention_mask, | |||||
| 'labels': labels, | |||||
| } | |||||
| return output | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.text_gen_jieba_tokenizer) | |||||
| class TextGenerationJiebaPreprocessor(TextGenerationPreprocessorBase): | |||||
| """The jieba tokenizer preprocessor used in text generation. | |||||
| """ | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| src_txt='src_txt', | |||||
| tgt_txt=None): | |||||
| from modelscope.models.nlp.gpt3 import JiebaBPETokenizer | |||||
| super().__init__(mode, src_txt, tgt_txt) | |||||
| if self.tgt_txt is not None: | |||||
| logger.warn( | |||||
| f'TextGenerationJiebaPreprocessor currently does not support training, ' | |||||
| f'the {self.tgt_txt} of the tgt_txt field will be ignored.') | |||||
| self.src_txt = src_txt | |||||
| self.tokenizer = JiebaBPETokenizer( | |||||
| osp.join(model_dir, 'tokenizer.json')) | |||||
| def decode(self, tokens, **kwargs): | |||||
| """Decode the tokens to real text. | |||||
| Args: | |||||
| tokens: The output tokens from model's `forward` and `generate` | |||||
| Returns: | |||||
| The actual text. | |||||
| """ | |||||
| return self.tokenizer.detokenize(tokens) | |||||
| def _tokenize_text(self, sequence1, sequence2=None, **kwargs): | |||||
| """Tokenize the text. | |||||
| Args: | |||||
| sequence1: The first sequence. | |||||
| sequence2: The second sequence which may be None. | |||||
| Returns: | |||||
| The encoded sequence. | |||||
| """ | |||||
| return { | return { | ||||
| 'input_ids': src_input_ids, | |||||
| 'attention_mask': src_attention_mask, | |||||
| 'labels': labels, | |||||
| 'input_ids': | |||||
| torch.tensor(self.tokenizer.tokenize(sequence1)).unsqueeze_(0) | |||||
| } | } | ||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.text2text_gen_preprocessor) | |||||
| class TextGenerationT5Preprocessor(TextGenerationTransformersPreprocessor): | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| src_txt='src_txt', | |||||
| tgt_txt='tgt_txt', | |||||
| use_fast: bool = None, | |||||
| sequence_length: int = 128, | |||||
| **kwargs): | |||||
| """The preprocessor for text to text generation task, based on transformers' tokenizer. | |||||
| Args: | |||||
| model_dir: The model dir used to initialize the tokenizer. | |||||
| src_txt: The key of the first sequence. | |||||
| use_fast: Use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| mode: The mode for the preprocessor. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | |||||
| super().__init__( | |||||
| model_dir, | |||||
| mode=mode, | |||||
| src_txt=src_txt, | |||||
| tgt_txt=tgt_txt, | |||||
| sequence_length=sequence_length, | |||||
| use_fast=use_fast, | |||||
| truncation=kwargs.pop('truncation', True), | |||||
| padding=kwargs.pop('padding', 'max_length'), | |||||
| return_token_type_ids=kwargs.pop('return_token_type_ids', False), | |||||
| **kwargs) | |||||
| @@ -1,67 +1,78 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict, Union | |||||
| from typing import Any, Dict | |||||
| from transformers import AutoTokenizer | from transformers import AutoTokenizer | ||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields, ModeKeys | from modelscope.utils.constant import Fields, ModeKeys | ||||
| from modelscope.utils.type_assert import type_assert | from modelscope.utils.type_assert import type_assert | ||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.text_ranking) | Fields.nlp, module_name=Preprocessors.text_ranking) | ||||
| class TextRankingPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in passage ranking model. | |||||
| """ | |||||
| class TextRankingTransformersPreprocessor(Preprocessor): | |||||
| def __init__(self, | def __init__(self, | ||||
| model_dir: str, | model_dir: str, | ||||
| mode=ModeKeys.INFERENCE, | |||||
| *args, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| first_sequence='source_sentence', | |||||
| second_sequence='sentences_to_compare', | |||||
| label='labels', | |||||
| qid='qid', | |||||
| sequence_length=128, | |||||
| **kwargs): | **kwargs): | ||||
| """preprocess the data | |||||
| """The tokenizer preprocessor class for the text ranking preprocessor. | |||||
| Args: | Args: | ||||
| model_dir (str): model path | |||||
| model_dir(str, `optional`): The model dir used to parse the label mapping, can be None. | |||||
| first_sequence(str, `optional`): The key of the first sequence. | |||||
| second_sequence(str, `optional`): The key of the second sequence. | |||||
| label(str, `optional`): The keys of the label columns, default `labels`. | |||||
| qid(str, `optional`): The qid info. | |||||
| mode: The mode for the preprocessor. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| """ | """ | ||||
| super().__init__(model_dir, mode=mode, *args, **kwargs) | |||||
| self.model_dir: str = model_dir | |||||
| self.first_sequence: str = kwargs.pop('first_sequence', | |||||
| 'source_sentence') | |||||
| self.second_sequence = kwargs.pop('second_sequence', | |||||
| 'sentences_to_compare') | |||||
| self.sequence_length = kwargs.pop('sequence_length', 128) | |||||
| super().__init__(mode) | |||||
| self.model_dir = model_dir | |||||
| self.first_sequence = first_sequence | |||||
| self.second_sequence = second_sequence | |||||
| self.label = label | |||||
| self.qid = qid | |||||
| self.sequence_length = sequence_length | |||||
| self.tokenizer = AutoTokenizer.from_pretrained(self.model_dir) | self.tokenizer = AutoTokenizer.from_pretrained(self.model_dir) | ||||
| @type_assert(object, (str, tuple, Dict)) | |||||
| def __call__(self, data: Union[tuple, Dict]) -> Dict[str, Any]: | |||||
| if isinstance(data, tuple): | |||||
| sentence1, sentence2 = data | |||||
| elif isinstance(data, dict): | |||||
| sentence1 = data.get(self.first_sequence) | |||||
| sentence2 = data.get(self.second_sequence) | |||||
| @type_assert(object, dict) | |||||
| def __call__(self, | |||||
| data: Dict, | |||||
| padding='max_length', | |||||
| truncation=True, | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| sentence1 = data.get(self.first_sequence) | |||||
| sentence2 = data.get(self.second_sequence) | |||||
| labels = data.get(self.label) | |||||
| qid = data.get(self.qid) | |||||
| if isinstance(sentence2, str): | if isinstance(sentence2, str): | ||||
| sentence2 = [sentence2] | sentence2 = [sentence2] | ||||
| if isinstance(sentence1, str): | if isinstance(sentence1, str): | ||||
| sentence1 = [sentence1] | sentence1 = [sentence1] | ||||
| sentence1 = sentence1 * len(sentence2) | sentence1 = sentence1 * len(sentence2) | ||||
| max_seq_length = self.sequence_length | |||||
| kwargs['max_length'] = kwargs.get( | |||||
| 'max_length', kwargs.pop('sequence_length', self.sequence_length)) | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs['return_tensors'] = 'pt' | |||||
| feature = self.tokenizer( | feature = self.tokenizer( | ||||
| sentence1, | sentence1, | ||||
| sentence2, | sentence2, | ||||
| padding='max_length', | |||||
| truncation=True, | |||||
| max_length=max_seq_length, | |||||
| return_tensors='pt') | |||||
| if 'labels' in data: | |||||
| labels = data['labels'] | |||||
| padding=padding, | |||||
| truncation=truncation, | |||||
| **kwargs) | |||||
| if labels is not None: | |||||
| feature['labels'] = labels | feature['labels'] = labels | ||||
| if 'qid' in data: | |||||
| qid = data['qid'] | |||||
| if qid is not None: | |||||
| feature['qid'] = qid | feature['qid'] = qid | ||||
| return feature | return feature | ||||
| @@ -1,28 +1,35 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | # Copyright (c) Alibaba, Inc. and its affiliates. | ||||
| from typing import Any, Dict, Tuple, Union | |||||
| from typing import Any, Dict, List, Tuple, Union | |||||
| import numpy as np | import numpy as np | ||||
| import torch | import torch | ||||
| from modelscope.metainfo import Preprocessors | from modelscope.metainfo import Preprocessors | ||||
| from modelscope.outputs import OutputKeys | from modelscope.outputs import OutputKeys | ||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields, ModeKeys | from modelscope.utils.constant import Fields, ModeKeys | ||||
| from modelscope.utils.hub import get_model_type, parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | |||||
| from modelscope.utils.type_assert import type_assert | from modelscope.utils.type_assert import type_assert | ||||
| from .nlp_base import NLPBasePreprocessor, NLPTokenizerPreprocessorBase | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| from .utils import parse_text_and_label | |||||
| logger = get_logger(__name__) | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, | Fields.nlp, | ||||
| module_name=Preprocessors.word_segment_text_to_label_preprocessor) | module_name=Preprocessors.word_segment_text_to_label_preprocessor) | ||||
| class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor): | |||||
| class WordSegmentationBlankSetToLabelPreprocessor(Preprocessor): | |||||
| """The preprocessor used to turn a single sentence to a labeled token-classification dict. | """The preprocessor used to turn a single sentence to a labeled token-classification dict. | ||||
| """ | """ | ||||
| def __init__(self, **kwargs): | |||||
| self.first_sequence: str = kwargs.pop('first_sequence', 'tokens') | |||||
| self.label = kwargs.pop('label', OutputKeys.LABELS) | |||||
| def __init__(self, generated_sentence='tokens', generated_label='labels'): | |||||
| super().__init__() | |||||
| self.generated_sentence = generated_sentence | |||||
| self.generated_label = generated_label | |||||
| def __call__(self, data: str) -> Union[Dict[str, Any], Tuple]: | def __call__(self, data: str) -> Union[Dict[str, Any], Tuple]: | ||||
| data = data.split(' ') | data = data.split(' ') | ||||
| @@ -43,9 +50,134 @@ class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor): | |||||
| chars, labels = produce_train_sample(data) | chars, labels = produce_train_sample(data) | ||||
| return { | return { | ||||
| self.first_sequence: chars, | |||||
| self.label: labels, | |||||
| self.generated_sentence: chars, | |||||
| self.generated_label: labels, | |||||
| } | |||||
| class TokenClassificationPreprocessorBase(Preprocessor): | |||||
| def __init__( | |||||
| self, | |||||
| model_dir: str = None, | |||||
| first_sequence: str = None, | |||||
| label: str = 'label', | |||||
| label2id: Dict = None, | |||||
| label_all_tokens: bool = False, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| ): | |||||
| """The base class for all the token-classification tasks. | |||||
| Args: | |||||
| model_dir: The model dir to build the the label2id mapping. | |||||
| If None, user need to pass in the `label2id` param. | |||||
| first_sequence: The key for the text(token) column if input type is a dict. | |||||
| label: The key for the label column if input type is a dict and the mode is `training` or `evaluation`. | |||||
| label2id: The label2id mapping, if not provided, you need to specify the model_dir to search the mapping | |||||
| from config files. | |||||
| label_all_tokens: If label exists in the dataset, the preprocessor will try to label the tokens. | |||||
| If label_all_tokens is true, all non-initial sub-tokens will get labels like `I-xxx`, | |||||
| or else the labels will be filled with -100, default False. | |||||
| mode: The preprocessor mode. | |||||
| """ | |||||
| super().__init__(mode) | |||||
| self.model_dir = model_dir | |||||
| self.first_sequence = first_sequence | |||||
| self.label = label | |||||
| self.label2id = label2id | |||||
| self.label_all_tokens = label_all_tokens | |||||
| if self.label2id is None and self.model_dir is not None: | |||||
| self.label2id = parse_label_mapping(self.model_dir) | |||||
| @property | |||||
| def id2label(self): | |||||
| """Return the id2label mapping according to the label2id mapping. | |||||
| @return: The id2label mapping if exists. | |||||
| """ | |||||
| if self.label2id is not None: | |||||
| return {id: label for label, id in self.label2id.items()} | |||||
| return None | |||||
| def labels_to_id(self, labels_list, word_ids): | |||||
| # align the labels with tokenized text | |||||
| assert self.label2id is not None | |||||
| # Map that sends B-Xxx label to its I-Xxx counterpart | |||||
| b_to_i_label = [] | |||||
| label_enumerate_values = [ | |||||
| k for k, v in sorted( | |||||
| self.label2id.items(), key=lambda item: item[1]) | |||||
| ] | |||||
| for idx, label in enumerate(label_enumerate_values): | |||||
| if label.startswith('B-') and label.replace( | |||||
| 'B-', 'I-') in label_enumerate_values: | |||||
| b_to_i_label.append( | |||||
| label_enumerate_values.index(label.replace('B-', 'I-'))) | |||||
| else: | |||||
| b_to_i_label.append(idx) | |||||
| label_row = [self.label2id[lb] for lb in labels_list] | |||||
| previous_word_idx = None | |||||
| label_ids = [] | |||||
| for word_idx in word_ids: | |||||
| if word_idx is None: | |||||
| label_ids.append(-100) | |||||
| elif word_idx != previous_word_idx: | |||||
| label_ids.append(label_row[word_idx]) | |||||
| else: | |||||
| if self.label_all_tokens: | |||||
| label_ids.append(b_to_i_label[label_row[word_idx]]) | |||||
| else: | |||||
| label_ids.append(-100) | |||||
| previous_word_idx = word_idx | |||||
| return label_ids | |||||
| def _tokenize_text(self, sequence1, **kwargs): | |||||
| """Tokenize the text. | |||||
| Args: | |||||
| sequence1: The first sequence. | |||||
| sequence2: The second sequence which may be None. | |||||
| Returns: | |||||
| The encoded sequence. | |||||
| """ | |||||
| raise NotImplementedError() | |||||
| @type_assert(object, (str, tuple, dict)) | |||||
| def __call__(self, data: Union[dict, tuple, str], | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| text, _, label = parse_text_and_label( | |||||
| data, self.mode, self.first_sequence, label=self.label) | |||||
| outputs, word_ids = self._tokenize_text(text, **kwargs) | |||||
| if label is not None: | |||||
| label_ids = self.labels_to_id(label, word_ids) | |||||
| outputs[OutputKeys.LABELS] = label_ids | |||||
| outputs = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in outputs.items() | |||||
| } | } | ||||
| if self.mode == ModeKeys.INFERENCE: | |||||
| outputs['text'] = text | |||||
| return outputs | |||||
| class NLPTokenizerForLSTM(NLPTokenizer): | |||||
| def build_tokenizer(self): | |||||
| if self.model_type == 'lstm': | |||||
| from transformers import AutoTokenizer | |||||
| return AutoTokenizer.from_pretrained( | |||||
| self.model_dir, use_fast=self.use_fast, tokenizer_type='bert') | |||||
| else: | |||||
| return super().build_tokenizer() | |||||
| def get_tokenizer_class(self): | |||||
| tokenizer_class = self.tokenizer.__class__.__name__ | |||||
| if tokenizer_class.endswith( | |||||
| 'Fast') and tokenizer_class != 'PreTrainedTokenizerFast': | |||||
| tokenizer_class = tokenizer_class[:-4] | |||||
| return tokenizer_class | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| @@ -54,227 +186,238 @@ class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor): | |||||
| Fields.nlp, module_name=Preprocessors.token_cls_tokenizer) | Fields.nlp, module_name=Preprocessors.token_cls_tokenizer) | ||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.sequence_labeling_tokenizer) | Fields.nlp, module_name=Preprocessors.sequence_labeling_tokenizer) | ||||
| class TokenClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| class TokenClassificationTransformersPreprocessor( | |||||
| TokenClassificationPreprocessorBase): | |||||
| """The tokenizer preprocessor used in normal NER task. | """The tokenizer preprocessor used in normal NER task. | ||||
| """ | """ | ||||
| def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||||
| """preprocess the data | |||||
| def __init__(self, | |||||
| model_dir: str = None, | |||||
| first_sequence: str = None, | |||||
| label: str = 'label', | |||||
| label2id: Dict = None, | |||||
| label_all_tokens: bool = False, | |||||
| mode: str = ModeKeys.INFERENCE, | |||||
| sequence_length=128, | |||||
| use_fast=None, | |||||
| **kwargs): | |||||
| """ | |||||
| Args: | Args: | ||||
| model_dir (str): model path | |||||
| use_fast: Whether to use the fast tokenizer or not. | |||||
| sequence_length: The max sequence length which the model supported, | |||||
| will be passed into tokenizer as the 'max_length' param. | |||||
| **kwargs: Extra args input into the tokenizer's __call__ method. | |||||
| """ | """ | ||||
| super().__init__(model_dir, first_sequence, label, label2id, | |||||
| label_all_tokens, mode) | |||||
| self.is_lstm_model = 'lstm' in model_dir | |||||
| model_type = None | |||||
| if self.is_lstm_model: | |||||
| model_type = 'lstm' | |||||
| elif model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| kwargs['truncation'] = kwargs.get('truncation', True) | kwargs['truncation'] = kwargs.get('truncation', True) | ||||
| kwargs['padding'] = kwargs.get( | |||||
| 'padding', False if mode == ModeKeys.INFERENCE else 'max_length') | |||||
| kwargs['max_length'] = kwargs.pop('sequence_length', 128) | |||||
| self.sequence_length = kwargs['max_length'] | |||||
| self.label_all_tokens = kwargs.pop('label_all_tokens', False) | |||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| if 'is_split_into_words' in kwargs: | |||||
| self.tokenize_kwargs['is_split_into_words'] = kwargs.pop( | |||||
| 'is_split_into_words') | |||||
| else: | |||||
| self.tokenize_kwargs[ | |||||
| 'is_split_into_words'] = self.tokenizer.init_kwargs.get( | |||||
| 'is_split_into_words', False) | |||||
| if 'label2id' in kwargs: | |||||
| kwargs.pop('label2id') | |||||
| kwargs['padding'] = kwargs.get('padding', 'max_length') | |||||
| kwargs['max_length'] = sequence_length | |||||
| kwargs['add_special_tokens'] = model_type != 'lstm' | |||||
| self.nlp_tokenizer = NLPTokenizerForLSTM( | |||||
| model_dir=model_dir, | |||||
| model_type=model_type, | |||||
| use_fast=use_fast, | |||||
| tokenize_kwargs=kwargs) | |||||
| @type_assert(object, (str, dict)) | |||||
| def __call__(self, data: Union[dict, str]) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| def _tokenize_text(self, text: Union[str, List[str]], **kwargs): | |||||
| tokens = text | |||||
| if self.mode != ModeKeys.INFERENCE: | |||||
| assert isinstance(tokens, list), 'Input needs to be lists in training and evaluating,' \ | |||||
| 'because the length of the words and the labels need to be equal.' | |||||
| is_split_into_words = self.nlp_tokenizer.get_tokenizer_kwarg( | |||||
| 'is_split_into_words', False) | |||||
| if is_split_into_words: | |||||
| tokens = list(tokens) | |||||
| Args: | |||||
| data (str): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| if is_split_into_words and self.mode == ModeKeys.INFERENCE: | |||||
| encodings, word_ids = self._tokenize_text_by_words( | |||||
| tokens, **kwargs) | |||||
| elif self.nlp_tokenizer.tokenizer.is_fast: | |||||
| encodings, word_ids = self._tokenize_text_with_fast_tokenizer( | |||||
| tokens, **kwargs) | |||||
| else: | |||||
| encodings, word_ids = self._tokenize_text_with_slow_tokenizer( | |||||
| tokens, **kwargs) | |||||
| # preprocess the data for the model input | |||||
| text = None | |||||
| labels_list = None | |||||
| if isinstance(data, str): | |||||
| # for inference inputs without label | |||||
| text = data | |||||
| elif isinstance(data, dict): | |||||
| # for finetune inputs with label | |||||
| text = data.get(self.first_sequence) | |||||
| labels_list = data.get(self.label) | |||||
| if isinstance(text, list): | |||||
| self.tokenize_kwargs['is_split_into_words'] = True | |||||
| if self._mode == ModeKeys.INFERENCE: | |||||
| self.tokenize_kwargs['add_special_tokens'] = False | |||||
| if self.mode == ModeKeys.INFERENCE: | |||||
| for key in encodings.keys(): | |||||
| encodings[key] = torch.tensor(encodings[key]).unsqueeze(0) | |||||
| else: | |||||
| encodings.pop('offset_mapping', None) | |||||
| return encodings, word_ids | |||||
| def _tokenize_text_by_words(self, tokens, **kwargs): | |||||
| input_ids = [] | input_ids = [] | ||||
| label_mask = [] | label_mask = [] | ||||
| offset_mapping = [] | offset_mapping = [] | ||||
| token_type_ids = [] | |||||
| if self.tokenize_kwargs[ | |||||
| 'is_split_into_words'] and self._mode == ModeKeys.INFERENCE: | |||||
| for offset, token in enumerate(list(text)): | |||||
| subtoken_ids = self.tokenizer.encode(token, | |||||
| **self.tokenize_kwargs) | |||||
| if len(subtoken_ids) == 0: | |||||
| subtoken_ids = [self.tokenizer.unk_token_id] | |||||
| input_ids.extend(subtoken_ids) | |||||
| label_mask.extend([1] + [0] * (len(subtoken_ids) - 1)) | |||||
| offset_mapping.extend([(offset, offset + 1)]) | |||||
| attention_mask = [] | |||||
| for offset, token in enumerate(tokens): | |||||
| subtoken_ids = self.nlp_tokenizer.tokenizer.encode( | |||||
| token, add_special_tokens=False) | |||||
| if len(subtoken_ids) == 0: | |||||
| subtoken_ids = [self.nlp_tokenizer.tokenizer.unk_token_id] | |||||
| input_ids.extend(subtoken_ids) | |||||
| attention_mask.extend([1] * len(subtoken_ids)) | |||||
| label_mask.extend([True] + [False] * (len(subtoken_ids) - 1)) | |||||
| offset_mapping.extend([(offset, offset + 1)]) | |||||
| padding = kwargs.get('padding', | |||||
| self.nlp_tokenizer.get_tokenizer_kwarg('padding')) | |||||
| max_length = kwargs.get( | |||||
| 'max_length', | |||||
| kwargs.get('sequence_length', | |||||
| self.nlp_tokenizer.get_tokenizer_kwarg('max_length'))) | |||||
| special_token = 1 if self.nlp_tokenizer.get_tokenizer_kwarg( | |||||
| 'add_special_tokens') else 0 | |||||
| if len(label_mask) > max_length - 2 * special_token: | |||||
| label_mask = label_mask[:(max_length - 2 * special_token)] | |||||
| input_ids = input_ids[:(max_length - 2 * special_token)] | |||||
| offset_mapping = offset_mapping[:sum(label_mask)] | |||||
| if padding == 'max_length': | |||||
| label_mask = [False] * special_token + label_mask + \ | |||||
| [False] * (max_length - len(label_mask) - special_token) | |||||
| offset_mapping = offset_mapping + [(0, 0)] * ( | |||||
| max_length - len(offset_mapping)) | |||||
| input_ids = [self.nlp_tokenizer.tokenizer.cls_token_id] * special_token + input_ids + \ | |||||
| [self.nlp_tokenizer.tokenizer.sep_token_id] * special_token + \ | |||||
| [self.nlp_tokenizer.tokenizer.pad_token_id] * (max_length - len(input_ids) - 2 * special_token) | |||||
| attention_mask = attention_mask + [1] * ( | |||||
| special_token * 2) + [0] * ( | |||||
| max_length - len(attention_mask) - 2 * special_token) | |||||
| else: | else: | ||||
| if self.tokenizer.is_fast: | |||||
| encodings = self.tokenizer( | |||||
| text, return_offsets_mapping=True, **self.tokenize_kwargs) | |||||
| attention_mask = encodings['attention_mask'] | |||||
| if 'token_type_ids' in encodings: | |||||
| token_type_ids = encodings['token_type_ids'] | |||||
| input_ids = encodings['input_ids'] | |||||
| word_ids = encodings.word_ids() | |||||
| for i in range(len(word_ids)): | |||||
| if word_ids[i] is None: | |||||
| label_mask.append(0) | |||||
| elif word_ids[i] == word_ids[i - 1]: | |||||
| label_mask.append(0) | |||||
| offset_mapping[-1] = ( | |||||
| offset_mapping[-1][0], | |||||
| encodings['offset_mapping'][i][1]) | |||||
| else: | |||||
| label_mask.append(1) | |||||
| offset_mapping.append(encodings['offset_mapping'][i]) | |||||
| label_mask = [False] * special_token + label_mask + \ | |||||
| [False] * special_token | |||||
| input_ids = [self.nlp_tokenizer.tokenizer.cls_token_id] * special_token + input_ids + \ | |||||
| [self.nlp_tokenizer.tokenizer.sep_token_id] * special_token | |||||
| attention_mask = attention_mask + [1] * (special_token * 2) | |||||
| encodings = { | |||||
| 'input_ids': input_ids, | |||||
| 'attention_mask': attention_mask, | |||||
| 'label_mask': label_mask, | |||||
| 'offset_mapping': offset_mapping, | |||||
| } | |||||
| return encodings, None | |||||
| def _tokenize_text_with_fast_tokenizer(self, tokens, **kwargs): | |||||
| is_split_into_words = isinstance(tokens, list) | |||||
| encodings = self.nlp_tokenizer( | |||||
| tokens, | |||||
| return_offsets_mapping=True, | |||||
| is_split_into_words=is_split_into_words, | |||||
| **kwargs) | |||||
| label_mask = [] | |||||
| word_ids = encodings.word_ids() | |||||
| offset_mapping = [] | |||||
| for i in range(len(word_ids)): | |||||
| if word_ids[i] is None: | |||||
| label_mask.append(False) | |||||
| elif word_ids[i] == word_ids[i - 1]: | |||||
| label_mask.append(False) | |||||
| if not is_split_into_words: | |||||
| offset_mapping[-1] = (offset_mapping[-1][0], | |||||
| encodings['offset_mapping'][i][1]) | |||||
| else: | else: | ||||
| encodings = self.tokenizer(text, **self.tokenize_kwargs) | |||||
| input_ids = encodings['input_ids'] | |||||
| label_mask, offset_mapping = self.get_label_mask_and_offset_mapping( | |||||
| text) | |||||
| if self._mode == ModeKeys.INFERENCE: | |||||
| if len(input_ids) >= self.sequence_length - 2: | |||||
| input_ids = input_ids[:self.sequence_length - 2] | |||||
| label_mask = label_mask[:self.sequence_length - 2] | |||||
| input_ids = [self.tokenizer.cls_token_id | |||||
| ] + input_ids + [self.tokenizer.sep_token_id] | |||||
| label_mask = [0] + label_mask + [0] | |||||
| attention_mask = [1] * len(input_ids) | |||||
| offset_mapping = offset_mapping[:sum(label_mask)] | |||||
| if not self.is_transformer_based_model: | |||||
| input_ids = input_ids[1:-1] | |||||
| attention_mask = attention_mask[1:-1] | |||||
| label_mask = label_mask[1:-1] | |||||
| input_ids = torch.tensor(input_ids).unsqueeze(0) | |||||
| attention_mask = torch.tensor(attention_mask).unsqueeze(0) | |||||
| label_mask = torch.tensor( | |||||
| label_mask, dtype=torch.bool).unsqueeze(0) | |||||
| # the token classification | |||||
| output = { | |||||
| 'text': text, | |||||
| 'input_ids': input_ids, | |||||
| 'attention_mask': attention_mask, | |||||
| 'label_mask': label_mask, | |||||
| 'offset_mapping': offset_mapping | |||||
| } | |||||
| label_mask.append(True) | |||||
| if is_split_into_words: | |||||
| offset_mapping.append((word_ids[i], word_ids[i] + 1)) | |||||
| else: | |||||
| offset_mapping.append(encodings['offset_mapping'][i]) | |||||
| padding = self.nlp_tokenizer.get_tokenizer_kwarg('padding') | |||||
| if padding == 'max_length': | |||||
| offset_mapping = offset_mapping + [(0, 0)] * ( | |||||
| len(label_mask) - len(offset_mapping)) | |||||
| encodings['offset_mapping'] = offset_mapping | |||||
| encodings['label_mask'] = label_mask | |||||
| return encodings, word_ids | |||||
| def _tokenize_text_with_slow_tokenizer(self, tokens, **kwargs): | |||||
| assert self.mode == ModeKeys.INFERENCE and isinstance(tokens, str), \ | |||||
| 'Slow tokenizer now only support str input in inference mode. If you are training models, ' \ | |||||
| 'please consider using the fast tokenizer.' | |||||
| word_ids = None | |||||
| encodings = self.nlp_tokenizer( | |||||
| tokens, is_split_into_words=False, **kwargs) | |||||
| tokenizer_name = self.nlp_tokenizer.get_tokenizer_class() | |||||
| method = 'get_label_mask_and_offset_mapping_' + tokenizer_name | |||||
| if not hasattr(self, method): | |||||
| raise RuntimeError( | |||||
| f'No `{method}` method defined for ' | |||||
| f'tokenizer {tokenizer_name}, please use a fast tokenizer instead, or ' | |||||
| f'try to implement a `{method}` method') | |||||
| label_mask, offset_mapping = getattr(self, method)(tokens) | |||||
| padding = self.nlp_tokenizer.get_tokenizer_kwarg('padding') | |||||
| max_length = self.nlp_tokenizer.get_tokenizer_kwarg('max_length') | |||||
| special_token = 1 if self.nlp_tokenizer.get_tokenizer_kwarg( | |||||
| 'add_special_tokens') else 0 | |||||
| if len(label_mask) > max_length - 2 * special_token: | |||||
| label_mask = label_mask[:(max_length - 2 * special_token)] | |||||
| offset_mapping = offset_mapping[:sum(label_mask)] | |||||
| if padding == 'max_length': | |||||
| label_mask = [False] * special_token + label_mask + \ | |||||
| [False] * (max_length - len(label_mask) - special_token) | |||||
| offset_mapping = offset_mapping + [(0, 0)] * ( | |||||
| max_length - len(offset_mapping)) | |||||
| else: | else: | ||||
| output = { | |||||
| 'input_ids': input_ids, | |||||
| 'token_type_ids': token_type_ids, | |||||
| 'attention_mask': attention_mask, | |||||
| 'label_mask': label_mask, | |||||
| } | |||||
| # align the labels with tokenized text | |||||
| if labels_list is not None: | |||||
| assert self.label2id is not None | |||||
| # Map that sends B-Xxx label to its I-Xxx counterpart | |||||
| b_to_i_label = [] | |||||
| label_enumerate_values = [ | |||||
| k for k, v in sorted( | |||||
| self.label2id.items(), key=lambda item: item[1]) | |||||
| ] | |||||
| for idx, label in enumerate(label_enumerate_values): | |||||
| if label.startswith('B-') and label.replace( | |||||
| 'B-', 'I-') in label_enumerate_values: | |||||
| b_to_i_label.append( | |||||
| label_enumerate_values.index( | |||||
| label.replace('B-', 'I-'))) | |||||
| else: | |||||
| b_to_i_label.append(idx) | |||||
| label_row = [self.label2id[lb] for lb in labels_list] | |||||
| previous_word_idx = None | |||||
| label_ids = [] | |||||
| for word_idx in word_ids: | |||||
| if word_idx is None: | |||||
| label_ids.append(-100) | |||||
| elif word_idx != previous_word_idx: | |||||
| label_ids.append(label_row[word_idx]) | |||||
| else: | |||||
| if self.label_all_tokens: | |||||
| label_ids.append(b_to_i_label[label_row[word_idx]]) | |||||
| else: | |||||
| label_ids.append(-100) | |||||
| previous_word_idx = word_idx | |||||
| labels = label_ids | |||||
| output['labels'] = labels | |||||
| output = { | |||||
| k: np.array(v) if isinstance(v, list) else v | |||||
| for k, v in output.items() | |||||
| } | |||||
| return output | |||||
| label_mask = [False] * special_token + label_mask + \ | |||||
| [False] * special_token | |||||
| encodings['offset_mapping'] = offset_mapping | |||||
| encodings['label_mask'] = label_mask | |||||
| return encodings, word_ids | |||||
| def get_tokenizer_class(self): | |||||
| tokenizer_class = self.tokenizer.__class__.__name__ | |||||
| if tokenizer_class.endswith( | |||||
| 'Fast') and tokenizer_class != 'PreTrainedTokenizerFast': | |||||
| tokenizer_class = tokenizer_class[:-4] | |||||
| return tokenizer_class | |||||
| def get_label_mask_and_offset_mapping_BertTokenizer(self, text): | |||||
| label_mask = [] | |||||
| offset_mapping = [] | |||||
| tokens = self.nlp_tokenizer.tokenizer.tokenize(text) | |||||
| offset = 0 | |||||
| for token in tokens: | |||||
| is_start = (token[:2] != '##') | |||||
| if is_start: | |||||
| label_mask.append(True) | |||||
| else: | |||||
| token = token[2:] | |||||
| label_mask.append(False) | |||||
| start = offset + text[offset:].index(token) | |||||
| end = start + len(token) | |||||
| if is_start: | |||||
| offset_mapping.append((start, end)) | |||||
| else: | |||||
| offset_mapping[-1] = (offset_mapping[-1][0], end) | |||||
| offset = end | |||||
| return label_mask, offset_mapping | |||||
| def get_label_mask_and_offset_mapping(self, text): | |||||
| def get_label_mask_and_offset_mapping_XLMRobertaTokenizer(self, text): | |||||
| label_mask = [] | label_mask = [] | ||||
| offset_mapping = [] | offset_mapping = [] | ||||
| tokens = self.tokenizer.tokenize(text) | |||||
| tokens = self.nlp_tokenizer.tokenizer.tokenize(text) | |||||
| offset = 0 | offset = 0 | ||||
| if self.get_tokenizer_class() == 'BertTokenizer': | |||||
| for token in tokens: | |||||
| is_start = (token[:2] != '##') | |||||
| if is_start: | |||||
| label_mask.append(True) | |||||
| else: | |||||
| token = token[2:] | |||||
| label_mask.append(False) | |||||
| start = offset + text[offset:].index(token) | |||||
| end = start + len(token) | |||||
| if is_start: | |||||
| offset_mapping.append((start, end)) | |||||
| else: | |||||
| offset_mapping[-1] = (offset_mapping[-1][0], end) | |||||
| offset = end | |||||
| elif self.get_tokenizer_class() == 'XLMRobertaTokenizer': | |||||
| last_is_blank = False | |||||
| for token in tokens: | |||||
| is_start = (token[0] == '▁') | |||||
| if is_start: | |||||
| token = token[1:] | |||||
| label_mask.append(True) | |||||
| if len(token) == 0: | |||||
| last_is_blank = True | |||||
| continue | |||||
| else: | |||||
| label_mask.append(False) | |||||
| start = offset + text[offset:].index(token) | |||||
| end = start + len(token) | |||||
| if last_is_blank or is_start: | |||||
| offset_mapping.append((start, end)) | |||||
| else: | |||||
| offset_mapping[-1] = (offset_mapping[-1][0], end) | |||||
| offset = end | |||||
| last_is_blank = False | last_is_blank = False | ||||
| for token in tokens: | |||||
| is_start = (token[0] == '▁') | |||||
| if is_start: | |||||
| token = token[1:] | |||||
| label_mask.append(True) | |||||
| if len(token) == 0: | |||||
| last_is_blank = True | |||||
| continue | |||||
| else: | |||||
| label_mask.append(False) | |||||
| start = offset + text[offset:].index(token) | |||||
| end = start + len(token) | |||||
| if last_is_blank or is_start: | |||||
| offset_mapping.append((start, end)) | |||||
| else: | |||||
| offset_mapping[-1] = (offset_mapping[-1][0], end) | |||||
| offset = end | |||||
| last_is_blank = False | |||||
| else: | |||||
| raise NotImplementedError | |||||
| return label_mask, offset_mapping | return label_mask, offset_mapping | ||||
| @@ -9,19 +9,23 @@ from modelscope.outputs import OutputKeys | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields, ModeKeys | from modelscope.utils.constant import Fields, ModeKeys | ||||
| from modelscope.utils.type_assert import type_assert | from modelscope.utils.type_assert import type_assert | ||||
| from .token_classification_preprocessor import TokenClassificationPreprocessor | |||||
| from .token_classification_preprocessor import \ | |||||
| TokenClassificationTransformersPreprocessor | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.thai_ner_tokenizer) | Fields.nlp, module_name=Preprocessors.thai_ner_tokenizer) | ||||
| class NERPreprocessorThai(TokenClassificationPreprocessor): | |||||
| class NERPreprocessorThai(TokenClassificationTransformersPreprocessor): | |||||
| @type_assert(object, str) | |||||
| def __call__(self, data: str) -> Dict[str, Any]: | |||||
| @type_assert(object, (str, dict)) | |||||
| def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]: | |||||
| from pythainlp import word_tokenize | from pythainlp import word_tokenize | ||||
| if isinstance(data, str): | |||||
| text = data | |||||
| else: | |||||
| text = data[self.first_sequence] | |||||
| segmented_data = ' '.join([ | segmented_data = ' '.join([ | ||||
| w.strip(' ') for w in word_tokenize(text=data, engine='newmm') | |||||
| w.strip(' ') for w in word_tokenize(text=text, engine='newmm') | |||||
| if w.strip(' ') != '' | if w.strip(' ') != '' | ||||
| ]) | ]) | ||||
| output = super().__call__(segmented_data) | output = super().__call__(segmented_data) | ||||
| @@ -31,12 +35,17 @@ class NERPreprocessorThai(TokenClassificationPreprocessor): | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.thai_wseg_tokenizer) | Fields.nlp, module_name=Preprocessors.thai_wseg_tokenizer) | ||||
| class WordSegmentationPreprocessorThai(TokenClassificationPreprocessor): | |||||
| class WordSegmentationPreprocessorThai( | |||||
| TokenClassificationTransformersPreprocessor): | |||||
| @type_assert(object, str) | |||||
| def __call__(self, data: str) -> Dict[str, Any]: | |||||
| @type_assert(object, (str, dict)) | |||||
| def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]: | |||||
| import regex | import regex | ||||
| data = regex.findall(r'\X', data) | |||||
| if isinstance(data, str): | |||||
| text = data | |||||
| else: | |||||
| text = data[self.first_sequence] | |||||
| data = regex.findall(r'\X', text) | |||||
| data = ' '.join([char for char in data]) | data = ' '.join([char for char in data]) | ||||
| output = super().__call__(data) | output = super().__call__(data) | ||||
| @@ -9,19 +9,23 @@ from modelscope.outputs import OutputKeys | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | from modelscope.preprocessors.builder import PREPROCESSORS | ||||
| from modelscope.utils.constant import Fields, ModeKeys | from modelscope.utils.constant import Fields, ModeKeys | ||||
| from modelscope.utils.type_assert import type_assert | from modelscope.utils.type_assert import type_assert | ||||
| from .token_classification_preprocessor import TokenClassificationPreprocessor | |||||
| from .token_classification_preprocessor import \ | |||||
| TokenClassificationTransformersPreprocessor | |||||
| @PREPROCESSORS.register_module( | @PREPROCESSORS.register_module( | ||||
| Fields.nlp, module_name=Preprocessors.viet_ner_tokenizer) | Fields.nlp, module_name=Preprocessors.viet_ner_tokenizer) | ||||
| class NERPreprocessorViet(TokenClassificationPreprocessor): | |||||
| class NERPreprocessorViet(TokenClassificationTransformersPreprocessor): | |||||
| @type_assert(object, str) | |||||
| def __call__(self, data: str) -> Dict[str, Any]: | |||||
| @type_assert(object, (str, dict)) | |||||
| def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]: | |||||
| from pyvi import ViTokenizer | from pyvi import ViTokenizer | ||||
| if isinstance(data, str): | |||||
| text = data | |||||
| else: | |||||
| text = data[self.first_sequence] | |||||
| seg_words = [ | seg_words = [ | ||||
| t.strip(' ') for t in ViTokenizer.tokenize(data).split(' ') | |||||
| t.strip(' ') for t in ViTokenizer.tokenize(text).split(' ') | |||||
| if t.strip(' ') != '' | if t.strip(' ') != '' | ||||
| ] | ] | ||||
| raw_words = [] | raw_words = [] | ||||
| @@ -0,0 +1,112 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| import os | |||||
| from collections.abc import Mapping | |||||
| import json | |||||
| from transformers import AutoTokenizer | |||||
| from modelscope.metainfo import Models | |||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.utils.constant import ModeKeys | |||||
| from modelscope.utils.logger import get_logger | |||||
| logger = get_logger() | |||||
| __all__ = [ | |||||
| 'NLPTokenizer', | |||||
| ] | |||||
| class NLPTokenizer: | |||||
| def __init__(self, | |||||
| model_dir: str = None, | |||||
| model_type=None, | |||||
| use_fast: bool = None, | |||||
| tokenize_kwargs=None): | |||||
| """The transformers tokenizer preprocessor base class. | |||||
| Any nlp preprocessor which uses the huggingface tokenizer can inherit from this class. | |||||
| Args: | |||||
| model_dir (str, `optional`): The local path containing the files used to create a preprocessor. | |||||
| use_fast (str, `optional`): Use the fast version of tokenizer | |||||
| tokenize_kwargs (dict, `optional`): These args will be directly fed into the tokenizer. | |||||
| """ | |||||
| self.model_dir = model_dir | |||||
| self.model_type = model_type | |||||
| self.tokenize_kwargs = tokenize_kwargs | |||||
| if self.tokenize_kwargs is None: | |||||
| self.tokenize_kwargs = {} | |||||
| self._use_fast = use_fast | |||||
| self._tokenizer = None | |||||
| @property | |||||
| def tokenizer(self): | |||||
| if self._tokenizer is None: | |||||
| self._tokenizer = self.build_tokenizer() | |||||
| return self._tokenizer | |||||
| @property | |||||
| def use_fast(self): | |||||
| if self._use_fast is None: | |||||
| if self._use_fast is None and self.model_dir is None: | |||||
| self._use_fast = False | |||||
| elif self._use_fast is None and os.path.isfile( | |||||
| os.path.join(self.model_dir, 'tokenizer_config.json')): | |||||
| with open( | |||||
| os.path.join(self.model_dir, 'tokenizer_config.json'), | |||||
| 'r', | |||||
| encoding='utf-8') as f: | |||||
| json_config = json.load(f) | |||||
| self._use_fast = json_config.get('use_fast') | |||||
| self._use_fast = False if self._use_fast is None else self._use_fast | |||||
| return self._use_fast | |||||
| def build_tokenizer(self): | |||||
| """Build a tokenizer by the model type. | |||||
| NOTE: The fast tokenizers have a multi-thread problem, use it carefully. | |||||
| Returns: | |||||
| The initialized tokenizer. | |||||
| """ | |||||
| # fast version lead to parallel inference failed | |||||
| model_type = self.model_type | |||||
| model_dir = self.model_dir | |||||
| if model_type == Models.deberta_v2: | |||||
| from modelscope.models.nlp.deberta_v2 import DebertaV2Tokenizer, DebertaV2TokenizerFast | |||||
| tokenizer = DebertaV2TokenizerFast if self.use_fast else DebertaV2Tokenizer | |||||
| return tokenizer.from_pretrained( | |||||
| model_dir) if model_dir is not None else tokenizer() | |||||
| if model_type in (Models.structbert, Models.gpt3, Models.palm, | |||||
| Models.plug): | |||||
| from transformers import BertTokenizer, BertTokenizerFast | |||||
| tokenizer = BertTokenizerFast if self.use_fast else BertTokenizer | |||||
| return tokenizer.from_pretrained( | |||||
| model_dir) if model_dir is not None else tokenizer() | |||||
| elif model_type == Models.veco: | |||||
| from transformers import XLMRobertaTokenizer, XLMRobertaTokenizerFast | |||||
| tokenizer = XLMRobertaTokenizerFast if self.use_fast else XLMRobertaTokenizer | |||||
| return tokenizer.from_pretrained( | |||||
| model_dir) if model_dir is not None else tokenizer() | |||||
| assert model_dir is not None | |||||
| return AutoTokenizer.from_pretrained(model_dir, use_fast=self.use_fast) | |||||
| def __call__(self, text, text_pair=None, **kwargs): | |||||
| kwargs['max_length'] = kwargs.get('max_length', | |||||
| kwargs.pop('sequence_length', None)) | |||||
| if kwargs['max_length'] is None: | |||||
| kwargs.pop('max_length') | |||||
| tokenize_kwargs = {k: v for k, v in self.tokenize_kwargs.items()} | |||||
| tokenize_kwargs.update(kwargs) | |||||
| kwargs.update(self.tokenize_kwargs) | |||||
| return self.tokenizer(text, text_pair, **tokenize_kwargs) | |||||
| def get_tokenizer_kwarg(self, key, default_value=None): | |||||
| if key in self.tokenize_kwargs: | |||||
| return self.tokenize_kwargs[key] | |||||
| return self.tokenizer.init_kwargs.get(key, default_value) | |||||
| @@ -0,0 +1,100 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| import os | |||||
| from collections.abc import Mapping | |||||
| from typing import Any, Dict, List, Tuple, Union | |||||
| import json | |||||
| import numpy as np | |||||
| from transformers import AutoTokenizer | |||||
| from modelscope.metainfo import Models | |||||
| from modelscope.outputs import OutputKeys | |||||
| from modelscope.preprocessors.base import Preprocessor | |||||
| from modelscope.utils.constant import ModeKeys | |||||
| from modelscope.utils.hub import get_model_type, parse_label_mapping | |||||
| from modelscope.utils.logger import get_logger | |||||
| logger = get_logger() | |||||
| __all__ = ['parse_text_and_label', 'labels_to_id'] | |||||
| def parse_text_and_label(data, | |||||
| mode, | |||||
| first_sequence=None, | |||||
| second_sequence=None, | |||||
| label=None): | |||||
| """Parse the input and return the sentences and labels. | |||||
| When input type is tuple or list and its size is 2: | |||||
| If the pair param is False, data will be parsed as the first_sentence and the label, | |||||
| else it will be parsed as the first_sentence and the second_sentence. | |||||
| Args: | |||||
| data: The input data. | |||||
| mode: The mode of the preprocessor | |||||
| first_sequence: The key of the first sequence | |||||
| second_sequence: The key of the second sequence | |||||
| label: The key of the label | |||||
| Returns: | |||||
| The sentences and labels tuple. | |||||
| """ | |||||
| text_a, text_b, labels = None, None, None | |||||
| if isinstance(data, str): | |||||
| text_a = data | |||||
| elif isinstance(data, tuple) or isinstance(data, list): | |||||
| if len(data) == 3: | |||||
| text_a, text_b, labels = data | |||||
| elif len(data) == 2: | |||||
| if mode == ModeKeys.INFERENCE: | |||||
| text_a, text_b = data | |||||
| else: | |||||
| text_a, labels = data | |||||
| elif isinstance(data, Mapping): | |||||
| text_a = data.get(first_sequence) | |||||
| text_b = data.get(second_sequence) | |||||
| if label is None or isinstance(label, str): | |||||
| labels = data.get(label) | |||||
| else: | |||||
| labels = [data.get(lb) for lb in label] | |||||
| return text_a, text_b, labels | |||||
| def labels_to_id(labels, output, label2id=None): | |||||
| """Turn the labels to id with the type int or float. | |||||
| If the original label's type is str or int, the label2id mapping will try to convert it to the final label. | |||||
| If the original label's type is float, or the label2id mapping does not exist, | |||||
| the original label will be returned. | |||||
| Args: | |||||
| label2id: An extra label2id mapping. If not provided, the label will not be translated to ids. | |||||
| labels: The input labels. | |||||
| output: The label id. | |||||
| Returns: | |||||
| The final labels. | |||||
| """ | |||||
| def label_can_be_mapped(label): | |||||
| return isinstance(label, str) or isinstance(label, int) | |||||
| try: | |||||
| if isinstance(labels, (tuple, list)) and all([label_can_be_mapped(label) for label in labels]) \ | |||||
| and label2id is not None: | |||||
| output[OutputKeys.LABELS] = [ | |||||
| label2id[label] if label in label2id else label2id[str(label)] | |||||
| for label in labels | |||||
| ] | |||||
| elif label_can_be_mapped(labels) and label2id is not None: | |||||
| output[OutputKeys.LABELS] = label2id[ | |||||
| labels] if labels in label2id else label2id[str(labels)] | |||||
| elif labels is not None: | |||||
| output[OutputKeys.LABELS] = labels | |||||
| except KeyError as e: | |||||
| logger.error( | |||||
| f'Label {labels} cannot be found in the label mapping {label2id},' | |||||
| f'which comes from the user input or the configuration files. ' | |||||
| f'Please consider matching your labels with this mapping.') | |||||
| raise e | |||||
| @@ -0,0 +1,74 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from typing import Any, Dict, Union | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors import Preprocessor | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from modelscope.utils.hub import get_model_type | |||||
| from .transformers_tokenizer import NLPTokenizer | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer) | |||||
| class ZeroShotClassificationTransformersPreprocessor(Preprocessor): | |||||
| """The tokenizer preprocessor used in zero shot classification. | |||||
| """ | |||||
| def __init__(self, | |||||
| model_dir: str, | |||||
| first_sequence=None, | |||||
| mode=ModeKeys.INFERENCE, | |||||
| sequence_length=512, | |||||
| use_fast=None, | |||||
| **kwargs): | |||||
| """preprocess the data | |||||
| Args: | |||||
| model_dir (str): model path | |||||
| """ | |||||
| self.sequence_length = sequence_length | |||||
| model_type = None | |||||
| if model_dir is not None: | |||||
| model_type = get_model_type(model_dir) | |||||
| self.nlp_tokenizer = NLPTokenizer( | |||||
| model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs) | |||||
| self.first_sequence = first_sequence | |||||
| super().__init__(mode=mode) | |||||
| def __call__(self, | |||||
| data: Union[str, Dict], | |||||
| hypothesis_template: str, | |||||
| candidate_labels: list, | |||||
| padding=True, | |||||
| truncation=True, | |||||
| truncation_strategy='only_first', | |||||
| **kwargs) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (str or dict): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| if isinstance(data, dict): | |||||
| data = data.get(self.first_sequence) | |||||
| pairs = [[data, hypothesis_template.format(label)] | |||||
| for label in candidate_labels] | |||||
| if 'return_tensors' not in kwargs: | |||||
| kwargs[ | |||||
| 'return_tensors'] = 'pt' if self._mode == ModeKeys.INFERENCE else None | |||||
| features = self.nlp_tokenizer( | |||||
| pairs, | |||||
| padding=padding, | |||||
| truncation=truncation, | |||||
| truncation_strategy=truncation_strategy, | |||||
| **kwargs) | |||||
| return features | |||||
| @@ -1,51 +0,0 @@ | |||||
| # Copyright (c) Alibaba, Inc. and its affiliates. | |||||
| from typing import Any, Dict, Union | |||||
| from modelscope.metainfo import Preprocessors | |||||
| from modelscope.preprocessors.builder import PREPROCESSORS | |||||
| from modelscope.utils.constant import Fields, ModeKeys | |||||
| from .nlp_base import NLPTokenizerPreprocessorBase | |||||
| @PREPROCESSORS.register_module( | |||||
| Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer) | |||||
| class ZeroShotClassificationPreprocessor(NLPTokenizerPreprocessorBase): | |||||
| """The tokenizer preprocessor used in zero shot classification. | |||||
| """ | |||||
| def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs): | |||||
| """preprocess the data | |||||
| Args: | |||||
| model_dir (str): model path | |||||
| """ | |||||
| self.sequence_length = kwargs.pop('sequence_length', 512) | |||||
| super().__init__(model_dir, mode=mode, **kwargs) | |||||
| def __call__(self, data: Union[str, Dict], hypothesis_template: str, | |||||
| candidate_labels: list) -> Dict[str, Any]: | |||||
| """process the raw input data | |||||
| Args: | |||||
| data (str or dict): a sentence | |||||
| Example: | |||||
| 'you are so handsome.' | |||||
| Returns: | |||||
| Dict[str, Any]: the preprocessed data | |||||
| """ | |||||
| if isinstance(data, dict): | |||||
| data = data.get(self.first_sequence) | |||||
| pairs = [[data, hypothesis_template.format(label)] | |||||
| for label in candidate_labels] | |||||
| features = self.tokenizer( | |||||
| pairs, | |||||
| padding=True, | |||||
| truncation=True, | |||||
| max_length=self.sequence_length, | |||||
| truncation_strategy='only_first', | |||||
| return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None) | |||||
| return features | |||||
| @@ -6,11 +6,12 @@ import numpy as np | |||||
| import torch | import torch | ||||
| from modelscope import __version__ | from modelscope import __version__ | ||||
| from modelscope.metainfo import Hooks | |||||
| from modelscope.utils.checkpoint import load_checkpoint, save_checkpoint | |||||
| from modelscope.metainfo import Hooks, Pipelines | |||||
| from modelscope.utils.checkpoint import (load_checkpoint, save_checkpoint, | |||||
| save_configuration) | |||||
| from modelscope.utils.constant import LogKeys, ModelFile | from modelscope.utils.constant import LogKeys, ModelFile | ||||
| from modelscope.utils.logger import get_logger | from modelscope.utils.logger import get_logger | ||||
| from modelscope.utils.torch_utils import get_dist_info, is_master | |||||
| from modelscope.utils.torch_utils import is_master | |||||
| from .builder import HOOKS | from .builder import HOOKS | ||||
| from .hook import Hook | from .hook import Hook | ||||
| from .priority import Priority | from .priority import Priority | ||||
| @@ -28,17 +29,25 @@ class CheckpointHook(Hook): | |||||
| save_dir (str): The directory to save checkpoints. If is None, use `trainer.work_dir` | save_dir (str): The directory to save checkpoints. If is None, use `trainer.work_dir` | ||||
| save_last (bool): Whether to save the last checkpoint. Default: True. | save_last (bool): Whether to save the last checkpoint. Default: True. | ||||
| checkpoint_file (str): The checkpoint file to be loaded. | checkpoint_file (str): The checkpoint file to be loaded. | ||||
| load_all_state (bool): Load all states(optimizer, epoch, lr_scheduler, random_state, etc.) when loading old | |||||
| training state file or not. The model's state dict will only be loaded if False. | |||||
| max_checkpoint_num (int): The max number of checkpoint files, default None which means never delete anything. | |||||
| If the number exceeding the limit, earlier checkpoints will be deleted first. | |||||
| """ | """ | ||||
| PRIORITY = Priority.LOW | PRIORITY = Priority.LOW | ||||
| def __init__(self, | |||||
| interval=0, | |||||
| by_epoch=True, | |||||
| save_optimizer=True, | |||||
| save_dir=None, | |||||
| save_last=True, | |||||
| checkpoint_file=None): | |||||
| def __init__( | |||||
| self, | |||||
| interval=0, | |||||
| by_epoch=True, | |||||
| save_optimizer=True, | |||||
| save_dir=None, | |||||
| save_last=True, | |||||
| checkpoint_file=None, | |||||
| load_all_state=True, | |||||
| max_checkpoint_num=None, | |||||
| ): | |||||
| self.interval = interval | self.interval = interval | ||||
| self.by_epoch = by_epoch | self.by_epoch = by_epoch | ||||
| self.save_optimizer = save_optimizer | self.save_optimizer = save_optimizer | ||||
| @@ -47,6 +56,11 @@ class CheckpointHook(Hook): | |||||
| self.save_last = save_last | self.save_last = save_last | ||||
| self.rng_state = None | self.rng_state = None | ||||
| self.need_load_rng_state = False | self.need_load_rng_state = False | ||||
| self.load_all_state = load_all_state | |||||
| self.max_checkpoint_num = None | |||||
| if max_checkpoint_num is not None: | |||||
| self.max_checkpoint_num = max(int(max_checkpoint_num), 1) | |||||
| self.history_checkpoints = [] | |||||
| def before_run(self, trainer): | def before_run(self, trainer): | ||||
| if not self.save_dir: | if not self.save_dir: | ||||
| @@ -65,9 +79,10 @@ class CheckpointHook(Hook): | |||||
| if self.checkpoint_file is not None and os.path.isfile( | if self.checkpoint_file is not None and os.path.isfile( | ||||
| self.checkpoint_file): | self.checkpoint_file): | ||||
| meta = self.load_checkpoint(self.checkpoint_file, trainer) | |||||
| meta = self.load_checkpoint(self.checkpoint_file, trainer, | |||||
| self.load_all_state) | |||||
| self.rng_state = meta.get('rng_state') | self.rng_state = meta.get('rng_state') | ||||
| self.need_load_rng_state = True | |||||
| self.need_load_rng_state = self.load_all_state | |||||
| def before_train_iter(self, trainer): | def before_train_iter(self, trainer): | ||||
| if self.need_load_rng_state: | if self.need_load_rng_state: | ||||
| @@ -95,28 +110,30 @@ class CheckpointHook(Hook): | |||||
| self._save_checkpoint(trainer) | self._save_checkpoint(trainer) | ||||
| @classmethod | @classmethod | ||||
| def load_checkpoint(cls, filename, trainer): | |||||
| def load_checkpoint(cls, filename, trainer, load_all_state=True): | |||||
| from modelscope.trainers.parallel.utils import is_parallel | from modelscope.trainers.parallel.utils import is_parallel | ||||
| if is_parallel(trainer.model): | if is_parallel(trainer.model): | ||||
| model = trainer.model.module | model = trainer.model.module | ||||
| else: | else: | ||||
| model = trainer.model | model = trainer.model | ||||
| meta = load_checkpoint(filename, model, | |||||
| getattr(trainer, 'optimizer', None), | |||||
| getattr(trainer, 'lr_scheduler', None)) | |||||
| trainer._epoch = meta.get('epoch', trainer._epoch) | |||||
| trainer._iter = meta.get('iter', trainer._iter) | |||||
| trainer._inner_iter = meta.get('inner_iter', trainer._inner_iter) | |||||
| for i, hook in enumerate(trainer.hooks): | |||||
| # hook: Hook | |||||
| key = f'{hook.__class__}-{i}' | |||||
| if key in meta and hasattr(hook, 'load_state_dict'): | |||||
| hook.load_state_dict(meta.get(key, {})) | |||||
| else: | |||||
| trainer.logger.warn( | |||||
| f'The state_dict of hook {hook.__class__} at index {i} is not found in the checkpoint file.' | |||||
| ) | |||||
| meta = load_checkpoint( | |||||
| filename, model, | |||||
| getattr(trainer, 'optimizer', None) if load_all_state else None, | |||||
| getattr(trainer, 'lr_scheduler', None) if load_all_state else None) | |||||
| if load_all_state: | |||||
| trainer._epoch = meta.get('epoch', trainer._epoch) | |||||
| trainer._iter = meta.get('iter', trainer._iter) | |||||
| trainer._inner_iter = meta.get('inner_iter', trainer._inner_iter) | |||||
| for i, hook in enumerate(trainer.hooks): | |||||
| # hook: Hook | |||||
| key = f'{hook.__class__}-{i}' | |||||
| if key in meta and hasattr(hook, 'load_state_dict'): | |||||
| hook.load_state_dict(meta.get(key, {})) | |||||
| else: | |||||
| trainer.logger.warn( | |||||
| f'The state_dict of hook {hook.__class__} at index {i} is not found in the checkpoint file.' | |||||
| ) | |||||
| version = meta.get('modelscope') | version = meta.get('modelscope') | ||||
| if version != __version__: | if version != __version__: | ||||
| @@ -163,6 +180,21 @@ class CheckpointHook(Hook): | |||||
| and not self.by_epoch): | and not self.by_epoch): | ||||
| self._save_pretrained(trainer) | self._save_pretrained(trainer) | ||||
| self.history_checkpoints.append(cur_save_name) | |||||
| self.remove_obsolete_checkpoints() | |||||
| def remove_obsolete_checkpoints(self): | |||||
| if self.max_checkpoint_num is not None and \ | |||||
| len(self.history_checkpoints) > self.max_checkpoint_num: | |||||
| history_checkpoints = [ckpt for ckpt in self.history_checkpoints] | |||||
| self.history_checkpoints.clear() | |||||
| for i, ckpt_file in enumerate(history_checkpoints): | |||||
| if i < len(history_checkpoints) - self.max_checkpoint_num: | |||||
| if os.path.isfile(ckpt_file): | |||||
| os.remove(ckpt_file) | |||||
| else: | |||||
| self.history_checkpoints.append(ckpt_file) | |||||
| def _save_pretrained(self, trainer): | def _save_pretrained(self, trainer): | ||||
| output_dir = os.path.join(self.save_dir, ModelFile.TRAIN_OUTPUT_DIR) | output_dir = os.path.join(self.save_dir, ModelFile.TRAIN_OUTPUT_DIR) | ||||
| from modelscope.trainers.parallel.utils import is_parallel | from modelscope.trainers.parallel.utils import is_parallel | ||||
| @@ -175,15 +207,53 @@ class CheckpointHook(Hook): | |||||
| config = trainer.cfg.to_dict() | config = trainer.cfg.to_dict() | ||||
| # override pipeline by tasks name after finetune done, | # override pipeline by tasks name after finetune done, | ||||
| # avoid case like fill mask pipeline with a text cls task | # avoid case like fill mask pipeline with a text cls task | ||||
| config['pipeline'] = {'type': config['task']} | |||||
| if config['task'] in [ | |||||
| getattr(Pipelines, attr) for attr in dir(Pipelines) | |||||
| if not attr.startswith('__') | |||||
| ]: | |||||
| # TODO a temp fix to avoid pipeline_name and task mismatch | |||||
| config['pipeline'] = {'type': config['task']} | |||||
| class SaveConfig: | |||||
| def __init__(self, output_dir, config): | |||||
| self.output_dir = output_dir | |||||
| self.config = config | |||||
| def __call__(self, _output_dir, _config): | |||||
| self.config = _config | |||||
| def save_config(self): | |||||
| save_configuration(self.output_dir, self.config) | |||||
| save_config_fn = SaveConfig(output_dir, config) | |||||
| if hasattr(model, 'save_pretrained'): | if hasattr(model, 'save_pretrained'): | ||||
| # Now support two binary files: pytorch_model.bin and pytorch_model.pt | |||||
| default_bin_file = ModelFile.TORCH_MODEL_BIN_FILE | |||||
| if hasattr( | |||||
| model, | |||||
| 'model_dir') and ModelFile.TORCH_MODEL_FILE in os.listdir( | |||||
| model.model_dir): | |||||
| default_bin_file = ModelFile.TORCH_MODEL_FILE | |||||
| model.save_pretrained( | model.save_pretrained( | ||||
| output_dir, | output_dir, | ||||
| ModelFile.TORCH_MODEL_BIN_FILE, | |||||
| default_bin_file, | |||||
| save_function=save_checkpoint, | save_function=save_checkpoint, | ||||
| config=config, | |||||
| config=save_config_fn.config, | |||||
| save_config_function=save_config_fn, | |||||
| with_meta=False) | with_meta=False) | ||||
| if trainer.train_preprocessor is not None: | |||||
| trainer.train_preprocessor.save_pretrained( | |||||
| output_dir, | |||||
| save_config_fn.config, | |||||
| save_config_function=save_config_fn) | |||||
| if trainer.eval_preprocessor is not None: | |||||
| trainer.eval_preprocessor.save_pretrained( | |||||
| output_dir, | |||||
| save_config_fn.config, | |||||
| save_config_function=save_config_fn) | |||||
| save_config_fn.save_config() | |||||
| def after_train_iter(self, trainer): | def after_train_iter(self, trainer): | ||||
| if self.by_epoch: | if self.by_epoch: | ||||
| @@ -222,6 +292,9 @@ class BestCkptSaverHook(CheckpointHook): | |||||
| save_optimizer (bool): Whether to save optimizer state dict. Default: True. | save_optimizer (bool): Whether to save optimizer state dict. Default: True. | ||||
| save_dir (str): Output directory to save best checkpoint. | save_dir (str): Output directory to save best checkpoint. | ||||
| restore_best (bool): Whether to restore the best checkpoint after training. | restore_best (bool): Whether to restore the best checkpoint after training. | ||||
| max_checkpoint_num (int): The max number of checkpoint files, default None which means never delete anything. | |||||
| If the number exceeding the limit, checkpoints with worse metric will be deleted, which is judged by the | |||||
| `rule` and `metric_key` arguments. | |||||
| """ | """ | ||||
| PRIORITY = Priority.LOW | PRIORITY = Priority.LOW | ||||
| @@ -235,13 +308,17 @@ class BestCkptSaverHook(CheckpointHook): | |||||
| save_dir=None, | save_dir=None, | ||||
| save_file_name=None, | save_file_name=None, | ||||
| restore_best=False, | restore_best=False, | ||||
| interval=0): | |||||
| max_checkpoint_num=1, | |||||
| interval=0, | |||||
| **kwargs): | |||||
| assert rule in ['max', 'min'], 'Only support "max" or "min" rule now.' | assert rule in ['max', 'min'], 'Only support "max" or "min" rule now.' | ||||
| super().__init__( | super().__init__( | ||||
| interval=interval, | interval=interval, | ||||
| by_epoch=by_epoch, | by_epoch=by_epoch, | ||||
| save_optimizer=save_optimizer, | save_optimizer=save_optimizer, | ||||
| save_dir=save_dir, | save_dir=save_dir, | ||||
| max_checkpoint_num=max_checkpoint_num, | |||||
| **kwargs, | |||||
| ) | ) | ||||
| self.metric_key = metric_key | self.metric_key = metric_key | ||||
| self.rule = rule | self.rule = rule | ||||
| @@ -249,6 +326,7 @@ class BestCkptSaverHook(CheckpointHook): | |||||
| self._best_ckpt_file = None | self._best_ckpt_file = None | ||||
| self.save_file_name = save_file_name | self.save_file_name = save_file_name | ||||
| self.restore_best = restore_best | self.restore_best = restore_best | ||||
| self.history_checkpoints = set() | |||||
| def _should_save(self, trainer): | def _should_save(self, trainer): | ||||
| return self._is_best_metric(trainer.metric_values) | return self._is_best_metric(trainer.metric_values) | ||||
| @@ -284,6 +362,10 @@ class BestCkptSaverHook(CheckpointHook): | |||||
| self.save_dir, | self.save_dir, | ||||
| f'best_{LogKeys.ITER}{trainer.iter + 1}_{self.metric_key}{self._best_metric}.pth' | f'best_{LogKeys.ITER}{trainer.iter + 1}_{self.metric_key}{self._best_metric}.pth' | ||||
| ) | ) | ||||
| else: | |||||
| if '.' not in cur_save_name: | |||||
| cur_save_name = f'{cur_save_name}.pth' | |||||
| cur_save_name = os.path.join(self.save_dir, cur_save_name) | |||||
| meta = { | meta = { | ||||
| 'epoch': trainer.epoch, | 'epoch': trainer.epoch, | ||||
| @@ -300,6 +382,28 @@ class BestCkptSaverHook(CheckpointHook): | |||||
| trainer.lr_scheduler, meta) | trainer.lr_scheduler, meta) | ||||
| self._best_ckpt_file = cur_save_name | self._best_ckpt_file = cur_save_name | ||||
| self._save_pretrained(trainer) | self._save_pretrained(trainer) | ||||
| self.history_checkpoints.add(cur_save_name) | |||||
| self.remove_obsolete_checkpoints() | |||||
| def remove_obsolete_checkpoints(self): | |||||
| def extract_metric_from_filename(name1): | |||||
| metric1 = float(name1.split(self.metric_key)[1].split('.')[0]) | |||||
| if self.rule == 'max': | |||||
| return -metric1 | |||||
| else: | |||||
| return metric1 | |||||
| if self.max_checkpoint_num is not None and \ | |||||
| len(self.history_checkpoints) > self.max_checkpoint_num: | |||||
| history_checkpoints = sorted( | |||||
| self.history_checkpoints, key=extract_metric_from_filename) | |||||
| self.history_checkpoints.clear() | |||||
| for i, ckpt_file in enumerate(history_checkpoints): | |||||
| if i < self.max_checkpoint_num: | |||||
| self.history_checkpoints.add(ckpt_file) | |||||
| elif os.path.isfile(ckpt_file): | |||||
| os.remove(ckpt_file) | |||||
| def state_dict(self): | def state_dict(self): | ||||
| return { | return { | ||||
| @@ -14,8 +14,8 @@ from modelscope.utils.file_utils import func_receive_dict_inputs | |||||
| class TextGenerationTrainer(NlpEpochBasedTrainer): | class TextGenerationTrainer(NlpEpochBasedTrainer): | ||||
| def _decode(self, tokens): | def _decode(self, tokens): | ||||
| tokenizer = self.eval_preprocessor.tokenizer | |||||
| return tokenizer.decode(tokens.tolist(), skip_special_tokens=True) | |||||
| return self.eval_preprocessor.decode( | |||||
| tokens.tolist(), skip_special_tokens=True) | |||||
| def evaluation_step(self, data): | def evaluation_step(self, data): | ||||
| model = self.model.module if self._dist else self.model | model = self.model.module if self._dist else self.model | ||||