[to #42322933] Refactor NLP and fix some user feedbacks

1. Abstract keys of dicts needed by nlp metric classes into the init method 2. Add Preprocessor.save_pretrained to save preprocessor information 3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training. 4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead 5. Use model/preprocessor's from_pretrained in all nlp pipeline classes. 6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes 7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes 8. Fix user feedback: Re-train the model in continue training scenario 9. Fix user feedback: Too many checkpoint saved 10. Simplify the nlp-trainer 11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override 12. Add safe_get to Config class ---------------------------- Another refactor from version 36 ------------------------- 13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example: TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor 14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors 15. Add output classes of nlp models 16. Refactor the logic for token-classification 17. Fix bug: checkpoint_hook does not support pytorch_model.pt 18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513 * add save_pretrained to preprocessor * save preprocessor config in hook * refactor label-id mapping fetching logic * test ok on sentence-similarity * run on finetuning * fix bug * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/nlp/nlp_base.py * add params to init * 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics * Split trainer init impls to overridable methods * remove some obsolete tokenizers * unfinished * support input params in pipeline * fix bugs * fix ut bug * fix bug * fix ut bug * fix ut bug * fix ut bug * add base class for some preprocessors * Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config * compatible with old code * fix ut bug * fix ut bugs * fix bug * add some comments * fix ut bug * add a requirement * fix pre-commit * Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config * fixbug * Support function type in registry * fix ut bug * fix bug * Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config # Conflicts: # modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/utils/hub.py * remove obsolete file * rename init args * rename params * fix merge bug * add default preprocessor config for ner-model * move a method a util file * remove unused config * Fix a bug in pbar * bestckptsaver:change default ckpt numbers to 1 * 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name * Fix bug * fix bug * fix bug * unfinished refactoring * unfinished * uw * uw * uw * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer # Conflicts: # modelscope/preprocessors/nlp/document_segmentation_preprocessor.py # modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py # modelscope/preprocessors/nlp/relation_extraction_preprocessor.py # modelscope/preprocessors/nlp/text_generation_preprocessor.py * uw * uw * unify nlp task outputs * uw * uw * uw * uw * change the order of text cls pipeline * refactor t5 * refactor tg task preprocessor * fix * unfinished * temp * refactor code * unfinished * unfinished * unfinished * unfinished * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer * smoke test pass * ut testing * pre-commit passed * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/nlp/bert/document_segmentation.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py * merge master * unifnished * Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config * fix bug * fix ut bug * support ner batch inference * fix ut bug * fix bug * support batch inference on three nlp tasks * unfinished * fix bug * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/base/base_model.py # modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py # modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py # modelscope/pipelines/nlp/dialog_modeling_pipeline.py # modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py # modelscope/pipelines/nlp/faq_question_answering_pipeline.py # modelscope/pipelines/nlp/feature_extraction_pipeline.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/information_extraction_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/sentence_embedding_pipeline.py # modelscope/pipelines/nlp/summarization_pipeline.py # modelscope/pipelines/nlp/table_question_answering_pipeline.py # modelscope/pipelines/nlp/text2text_generation_pipeline.py # modelscope/pipelines/nlp/text_classification_pipeline.py # modelscope/pipelines/nlp/text_error_correction_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/text_ranking_pipeline.py # modelscope/pipelines/nlp/token_classification_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/trainers/nlp_trainer.py * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/__init__.py * fix bug * fix bug * fix bug * fix bug * fix bug * fixbug * pre-commit passed * fix bug * fixbug * fix bug * fix bug * fix bug * fix bug * self review done * fixbug * fix bug * fix bug * fix bugs * remove sub-token offset mapping * fix name bug * add some tests * 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs * add old logic back * tmp save * add tokenize by words logic back * move outputs file back * revert veco token-classification back * fix typo * Fix description * Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/pipelines/builder.py
3 years ago · bb5512d1ab
--- a/data/test/regression/sbert_ws_zh.bin
+++ b/data/test/regression/sbert_ws_zh.bin
@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:7d98ac11a4e9e2744a7402a5cc912da991a41938bbc5dd60f15ee5c6b3196030
 size 63349
 oid sha256:3b38bfb5a851d35d5fba4d59eda926557666dbd62c70e3e3b24c22605e7d9c4a
 size 40771
--- a/modelscope/exporters/nlp/sbert_for_sequence_classification_exporter.py
+++ b/modelscope/exporters/nlp/sbert_for_sequence_classification_exporter.py
@@ -7,7 +7,8 @@ from torch.utils.data.dataloader import default_collate
 from modelscope.exporters.builder import EXPORTERS
 from modelscope.exporters.torch_model_exporter import TorchModelExporter
 from modelscope.metainfo import Models
 from modelscope.preprocessors import Preprocessor, build_preprocessor
 from modelscope.preprocessors import (
    TextClassificationTransformersPreprocessor, build_preprocessor)
 from modelscope.utils.config import Config
 from modelscope.utils.constant import ModeKeys, Tasks
@@ -59,12 +60,13 @@ class SbertForSequenceClassificationExporter(TorchModelExporter):
            'mode': ModeKeys.TRAIN,
            **sequence_length
        })
        preprocessor: Preprocessor = build_preprocessor(cfg, field_name)
        preprocessor: TextClassificationTransformersPreprocessor = build_preprocessor(
            cfg, field_name)
        if pair:
            first_sequence = preprocessor.tokenizer.unk_token
            second_sequence = preprocessor.tokenizer.unk_token
            first_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token
            second_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token
        else:
            first_sequence = preprocessor.tokenizer.unk_token
            first_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token
            second_sequence = None
        batched = []
--- a/modelscope/metrics/sequence_classification_metric.py
+++ b/modelscope/metrics/sequence_classification_metric.py
@@ -19,18 +19,27 @@ from .builder import METRICS, MetricKeys
 class SequenceClassificationMetric(Metric):
    """The metric computation class for sequence classification tasks.
    This metric class calculates accuracy of the whole input batches.
    This metric class calculates accuracy/F1 of all the input batches.
    Args:
        label_name: The key of label column in the 'inputs' arg.
        logit_name: The key of logits column in the 'inputs' arg.
    """
    def __init__(self, *args, **kwargs):
    def __init__(self,
                 label_name=OutputKeys.LABELS,
                 logit_name=OutputKeys.LOGITS,
                 *args,
                 **kwargs):
        super().__init__(*args, **kwargs)
        self.preds = []
        self.labels = []
        self.label_name = label_name
        self.logit_name = logit_name
    def add(self, outputs: Dict, inputs: Dict):
        label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS
        ground_truths = inputs[label_name]
        eval_results = outputs[OutputKeys.LOGITS]
        ground_truths = inputs[self.label_name]
        eval_results = outputs[self.logit_name]
        self.preds.append(
            torch_nested_numpify(torch_nested_detach(eval_results)))
        self.labels.append(
--- a/modelscope/metrics/text_generation_metric.py
+++ b/modelscope/metrics/text_generation_metric.py
@@ -18,16 +18,22 @@ class TextGenerationMetric(Metric):
    """The metric computation class for text generation classes.
    This metric class calculates F1 of the rouge scores for the whole evaluation dataset.
    Args:
        target_text: The key of the target text column in the `inputs` arg.
        pred_text: The key of the predicted text column in the `outputs` arg.
    """
    def __init__(self):
    def __init__(self, target_text='tgts', pred_text='preds'):
        self.preds: List[str] = []
        self.tgts: List[str] = []
        self.rouge = Rouge()
        self.target_text = target_text
        self.pred_text = pred_text
    def add(self, outputs: Dict[str, List[str]], inputs: Dict[str, List[str]]):
        ground_truths = inputs['tgts']
        eval_results = outputs['preds']
        ground_truths = inputs[self.target_text]
        eval_results = outputs[self.pred_text]
        for truth in ground_truths:
            self.tgts.append(rebuild_chinese_str(truth))
        for result in eval_results:
--- a/modelscope/metrics/token_classification_metric.py
+++ b/modelscope/metrics/token_classification_metric.py
@@ -21,20 +21,16 @@ class TokenClassificationMetric(Metric):
    This metric class uses seqeval to calculate the scores.
    Args:
        return_entity_level_metrics (bool, *optional*):
        label_name(str, `optional`): The key of label column in the 'inputs' arg.
        logit_name(str, `optional`): The key of logits column in the 'inputs' arg.
        return_entity_level_metrics (bool, `optional`):
            Whether to return every label's detail metrics, default False.
        label2id(dict, `optional`): The label2id information to get the token labels.
    """
    def add(self, outputs: Dict, inputs: Dict):
        label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS
        ground_truths = inputs[label_name]
        eval_results = outputs[OutputKeys.LOGITS]
        self.preds.append(
            torch_nested_numpify(torch_nested_detach(eval_results)))
        self.labels.append(
            torch_nested_numpify(torch_nested_detach(ground_truths)))
    def __init__(self,
                 label_name=OutputKeys.LABELS,
                 logit_name=OutputKeys.LOGITS,
                 return_entity_level_metrics=False,
                 label2id=None,
                 *args,
@@ -44,6 +40,16 @@ class TokenClassificationMetric(Metric):
        self.preds = []
        self.labels = []
        self.label2id = label2id
        self.label_name = label_name
        self.logit_name = logit_name
    def add(self, outputs: Dict, inputs: Dict):
        ground_truths = inputs[self.label_name]
        eval_results = outputs[self.logit_name]
        self.preds.append(
            torch_nested_numpify(torch_nested_detach(eval_results)))
        self.labels.append(
            torch_nested_numpify(torch_nested_detach(ground_truths)))
    def evaluate(self):
        label2id = self.label2id
--- a/modelscope/models/base/base_model.py
+++ b/modelscope/models/base/base_model.py
@@ -6,7 +6,8 @@ from typing import Any, Callable, Dict, List, Optional, Union
 from modelscope.hub.snapshot_download import snapshot_download
 from modelscope.models.builder import build_model
 from modelscope.utils.checkpoint import save_checkpoint, save_pretrained
 from modelscope.utils.checkpoint import (save_checkpoint, save_configuration,
                                         save_pretrained)
 from modelscope.utils.config import Config
 from modelscope.utils.constant import DEFAULT_MODEL_REVISION, Invoke, ModelFile
 from modelscope.utils.device import verify_device
@@ -129,11 +130,9 @@ class Model(ABC):
            model_cfg[k] = v
        if device is not None:
            model_cfg.device = device
            model = build_model(
                model_cfg, task_name=task_name, default_args=kwargs)
            model = build_model(model_cfg, task_name=task_name)
        else:
            model = build_model(
                model_cfg, task_name=task_name, default_args=kwargs)
            model = build_model(model_cfg, task_name=task_name)
        # dynamically add pipeline info to model for pipeline inference
        if hasattr(cfg, 'pipeline'):
@@ -142,6 +141,7 @@ class Model(ABC):
        if not hasattr(model, 'cfg'):
            model.cfg = cfg
        model_cfg.pop('model_dir', None)
        model.name = model_name_or_path
        model.model_dir = local_model_dir
        return model
@@ -151,6 +151,7 @@ class Model(ABC):
                        save_checkpoint_names: Union[str, List[str]] = None,
                        save_function: Callable = save_checkpoint,
                        config: Optional[dict] = None,
                        save_config_function: Callable = save_configuration,
                        **kwargs):
        """save the pretrained model, its configuration and other related files to a directory,
            so that it can be re-loaded
@@ -168,18 +169,15 @@ class Model(ABC):
            config (Optional[dict], optional):
            The config for the configuration.json, might not be identical with model.config
            save_config_function (Callble, optional):
            The function to use to save the configuration.
        """
        if config is None and hasattr(self, 'cfg'):
            config = self.cfg
        assert config is not None, 'Cannot save the model because the model config is empty.'
        if isinstance(config, Config):
            config = config.to_dict()
        if 'preprocessor' in config and config['preprocessor'] is not None:
            if 'mode' in config['preprocessor']:
                config['preprocessor']['mode'] = 'inference'
            elif 'val' in config['preprocessor'] and 'mode' in config[
                    'preprocessor']['val']:
                config['preprocessor']['val']['mode'] = 'inference'
        if config is not None:
            save_config_function(target_folder, config)
        save_pretrained(self, target_folder, save_checkpoint_names,
                        save_function, config, **kwargs)
                        save_function, **kwargs)
--- a/modelscope/models/base/base_torch_model.py
+++ b/modelscope/models/base/base_torch_model.py
@@ -6,6 +6,7 @@ import torch
 from torch import nn
 from modelscope.utils.file_utils import func_receive_dict_inputs
 from modelscope.utils.hub import parse_label_mapping
 from modelscope.utils.logger import get_logger
 from .base_model import Model
--- a/modelscope/models/nlp/T5/backbone.py
+++ b/modelscope/models/nlp/T5/backbone.py
@@ -36,9 +36,7 @@ from transformers.utils.model_parallel_utils import (assert_device_map,
 from modelscope.metainfo import Models
 from modelscope.models.base import Model, Tensor, TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import (BaseModelOutput,
                                BaseModelOutputWithPastAndCrossAttentions,
                                Seq2SeqModelOutput)
 from modelscope.outputs import AttentionBackboneModelOutput, Seq2SeqModelOutput
 from modelscope.utils.constant import Tasks
 from modelscope.utils.logger import get_logger
 from .configuration import T5Config
@@ -1182,7 +1180,7 @@ class T5Stack(T5PreTrainedModel):
                all_attentions,
                all_cross_attentions,
            ] if v is not None)
        return BaseModelOutputWithPastAndCrossAttentions(
        return AttentionBackboneModelOutput(
            last_hidden_state=hidden_states,
            past_key_values=present_key_value_states,
            hidden_states=all_hidden_states,
@@ -1475,8 +1473,9 @@ class T5Model(T5PreTrainedModel):
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
        elif return_dict and not isinstance(encoder_outputs,
                                            AttentionBackboneModelOutput):
            encoder_outputs = AttentionBackboneModelOutput(
                last_hidden_state=encoder_outputs[0],
                hidden_states=encoder_outputs[1]
                if len(encoder_outputs) > 1 else None,
--- a/modelscope/models/nlp/T5/text2text_generation.py
+++ b/modelscope/models/nlp/T5/text2text_generation.py
@@ -24,7 +24,8 @@ from transformers.utils.model_parallel_utils import (assert_device_map,
 from modelscope.metainfo import Models
 from modelscope.models.builder import MODELS
 from modelscope.outputs import BaseModelOutput, Seq2SeqLMOutput
 from modelscope.outputs import (AttentionBackboneModelOutput, Seq2SeqLMOutput,
                                TokenGeneratorOutput)
 from modelscope.utils.constant import Tasks
 from modelscope.utils.logger import get_logger
 from .backbone import T5PreTrainedModel, T5Stack
@@ -311,8 +312,9 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
                output_hidden_states=output_hidden_states,
                return_dict=return_dict,
            )
        elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
            encoder_outputs = BaseModelOutput(
        elif return_dict and not isinstance(encoder_outputs,
                                            AttentionBackboneModelOutput):
            encoder_outputs = AttentionBackboneModelOutput(
                last_hidden_state=encoder_outputs[0],
                hidden_states=encoder_outputs[1]
                if len(encoder_outputs) > 1 else None,
@@ -426,6 +428,16 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
    def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor):
        return self._shift_right(labels)
    def generate(
        self,
        *args,
        **kwargs,
    ):
        output = super().generate(*args, **kwargs)
        return TokenGeneratorOutput(
            sequences=output if isinstance(output, torch.Tensor) else output[0]
        )
    def _reorder_cache(self, past, beam_idx):
        # if decoder past is not included in output
        # speedy decoding is disabled and no need to reorder
--- a/modelscope/models/nlp/init.py
+++ b/modelscope/models/nlp/init.py
@@ -30,9 +30,7 @@ if TYPE_CHECKING:
        SbertForMaskedLM,
        SbertForSequenceClassification,
        SbertForTokenClassification,
        SbertTokenizer,
        SbertModel,
        SbertTokenizerFast,
    )
    from .T5 import T5ForConditionalGeneration
    from .mglm import MGLMForTextSummarization
@@ -51,8 +49,7 @@ if TYPE_CHECKING:
    )
    from .veco import (VecoConfig, VecoForMaskedLM,
                       VecoForSequenceClassification,
                       VecoForTokenClassification, VecoModel, VecoTokenizer,
                       VecoTokenizerFast)
                       VecoForTokenClassification, VecoModel)
    from .bloom import BloomModel
 else:
    _import_structure = {
@@ -66,8 +63,6 @@ else:
            'SbertForMaskedLM',
            'SbertForSequenceClassification',
            'SbertForTokenClassification',
            'SbertTokenizer',
            'SbertTokenizerFast',
            'SbertModel',
        ],
        'veco': [
@@ -76,8 +71,6 @@ else:
            'VecoForSequenceClassification',
            'VecoForTokenClassification',
            'VecoModel',
            'VecoTokenizer',
            'VecoTokenizerFast',
        ],
        'bert': [
            'BertForMaskedLM',
--- a/modelscope/models/nlp/bart/text_error_correction.py
+++ b/modelscope/models/nlp/bart/text_error_correction.py
@@ -7,6 +7,7 @@ import torch.cuda
 from modelscope.metainfo import Models
 from modelscope.models.base import TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import TextErrorCorrectionOutput
 from modelscope.utils.constant import ModelFile, Tasks
 __all__ = ['BartForTextErrorCorrection']
@@ -55,7 +56,7 @@ class BartForTextErrorCorrection(TorchModel):
        self.task = task
    def forward(self, input: Dict[str, Dict]) -> Dict[str, Any]:
    def forward(self, input: Dict[str, Dict]) -> TextErrorCorrectionOutput:
        """return the result by the model
        Args:
@@ -91,4 +92,4 @@ class BartForTextErrorCorrection(TorchModel):
        # get 1-best List[Tensor]
        preds = translations[0][0]['tokens']
        return {'predictions': preds}
        return TextErrorCorrectionOutput(predictions=preds)
--- a/modelscope/models/nlp/bert/backbone.py
+++ b/modelscope/models/nlp/bert/backbone.py
@@ -16,9 +16,6 @@
 """PyTorch BERT model. """
 import math
 import os
 from dataclasses import dataclass
 from typing import Optional, Tuple
 import torch
 import torch.utils.checkpoint
@@ -33,11 +30,10 @@ from transformers.modeling_utils import (PreTrainedModel,
 from modelscope.metainfo import Models
 from modelscope.models import Model, TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import (BaseModelOutputWithPastAndCrossAttentions,
                                BaseModelOutputWithPoolingAndCrossAttentions)
 from modelscope.outputs import AttentionBackboneModelOutput
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import parse_label_mapping
 from modelscope.utils.logger import get_logger
 from modelscope.utils.nlp.utils import parse_labels_in_order
 from .configuration import BertConfig
 logger = get_logger(__name__)
@@ -562,7 +558,7 @@ class BertEncoder(nn.Module):
                all_self_attentions,
                all_cross_attentions,
            ] if v is not None)
        return BaseModelOutputWithPastAndCrossAttentions(
        return AttentionBackboneModelOutput(
            last_hidden_state=hidden_states,
            past_key_values=next_decoder_cache,
            hidden_states=all_hidden_states,
@@ -639,30 +635,15 @@ class BertPreTrainedModel(TorchModel, PreTrainedModel):
            The loaded model, which is initialized by transformers.PreTrainedModel.from_pretrained
        """
        model_dir = kwargs.get('model_dir', None)
        model_dir = kwargs.pop('model_dir', None)
        cfg = kwargs.pop('cfg', None)
        model_args = parse_labels_in_order(model_dir, cfg, **kwargs)
        if model_dir is None:
            config = BertConfig(**kwargs)
            config = BertConfig(**model_args)
            model = cls(config)
        else:
            model_kwargs = {}
            label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
            id2label = kwargs.get(
                'id2label', None if label2id is None else
                {id: label
                 for label, id in label2id.items()})
            if id2label is not None and label2id is None:
                label2id = {label: id for id, label in id2label.items()}
            num_labels = kwargs.get(
                'num_labels', None if label2id is None else len(label2id))
            if num_labels is not None:
                model_kwargs['num_labels'] = num_labels
            if label2id is not None:
                model_kwargs['label2id'] = label2id
            if id2label is not None:
                model_kwargs['id2label'] = id2label
            model = super(Model, cls).from_pretrained(
                pretrained_model_name_or_path=model_dir, **model_kwargs)
                pretrained_model_name_or_path=model_dir, **model_args)
        model.model_dir = model_dir
        return model
@@ -750,7 +731,7 @@ class BertModel(BertPreTrainedModel):
                output_attentions=None,
                output_hidden_states=None,
                return_dict=None,
                **kwargs):
                **kwargs) -> AttentionBackboneModelOutput:
        r"""
        Args:
        input_ids (`torch.LongTensor` of shape `((batch_size, sequence_length)`):
@@ -936,7 +917,7 @@ class BertModel(BertPreTrainedModel):
        if not return_dict:
            return (sequence_output, pooled_output) + encoder_outputs[1:]
        return BaseModelOutputWithPoolingAndCrossAttentions(
        return AttentionBackboneModelOutput(
            last_hidden_state=sequence_output,
            pooler_output=pooled_output,
            past_key_values=encoder_outputs.past_key_values,
--- a/modelscope/models/nlp/bert/document_segmentation.py
+++ b/modelscope/models/nlp/bert/document_segmentation.py
@@ -5,37 +5,22 @@ from typing import Any, Dict
 import torch
 from torch import nn
 from torch.nn import CrossEntropyLoss
 from transformers.modeling_outputs import TokenClassifierOutput
 from transformers.models.bert.modeling_bert import (BertModel,
                                                    BertPreTrainedModel)
 from modelscope.metainfo import Models
 from modelscope.models.base import Model
 from modelscope.models import Model
 from modelscope.models.builder import MODELS
 from modelscope.models.nlp.ponet import PoNetConfig
 from modelscope.outputs import AttentionTokenClassificationModelOutput
 from modelscope.utils.constant import Tasks
 from .backbone import BertModel, BertPreTrainedModel
 from .configuration import BertConfig
 __all__ = ['BertForDocumentSegmentation']
@MODELS.register_module(
    Tasks.document_segmentation, module_name=Models.bert_for_ds)
 class BertForDocumentSegmentation(Model):
    def __init__(self, model_dir: str, model_config: Dict[str, Any], *args,
                 **kwargs):
        super().__init__(model_dir, model_config, *args, **kwargs)
        self.model_cfg = model_config
    def build_with_config(self, config):
        self.bert_model = BertForDocumentSegmentationBase.from_pretrained(
            self.model_dir, from_tf=False, config=config)
        return self.bert_model
    def forward(self) -> Dict[str, Any]:
        return self.model_cfg
 class BertForDocumentSegmentationBase(BertPreTrainedModel):
 class BertForDocumentSegmentation(BertPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r'pooler']
@@ -103,9 +88,25 @@ class BertForDocumentSegmentationBase(BertPreTrainedModel):
            output = (logits, ) + outputs[2:]
            return ((loss, ) + output) if loss is not None else output
        return TokenClassifierOutput(
        return AttentionTokenClassificationModelOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
    @classmethod
    def _instantiate(cls, model_dir, model_config: Dict[str, Any], **kwargs):
        if model_config['type'] == 'bert':
            config = BertConfig.from_pretrained(model_dir, num_labels=2)
        elif model_config['type'] == 'ponet':
            config = PoNetConfig.from_pretrained(model_dir, num_labels=2)
        else:
            raise ValueError(
                f'Expected config type bert and ponet, which is : {model_config["type"]}'
            )
        model = super(Model, cls).from_pretrained(
            model_dir, from_tf=False, config=config)
        model.model_dir = model_dir
        model.model_cfg = model_config
        return model
--- a/modelscope/models/nlp/bert/fill_mask.py
+++ b/modelscope/models/nlp/bert/fill_mask.py
@@ -121,7 +121,7 @@ class BertForMaskedLM(BertPreTrainedModel):
    Preprocessor:
        This is the fill_mask model of Structbert, the preprocessor of this model
        is `modelscope.preprocessors.NLPPreprocessor`.
        is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.
    Parameters:
        config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with
--- a/modelscope/models/nlp/bert/text_classification.py
+++ b/modelscope/models/nlp/bert/text_classification.py
@@ -51,7 +51,7 @@ class BertForSequenceClassification(BertPreTrainedModel):
    Preprocessor:
        This is the fill_mask model of Bert, the preprocessor of this model
        is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
        is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`.
    Trainer:
        This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,
--- a/modelscope/models/nlp/bert/token_classification.py
+++ b/modelscope/models/nlp/bert/token_classification.py
@@ -22,7 +22,7 @@ from torch.nn import CrossEntropyLoss
 from modelscope.metainfo import Models
 from modelscope.models.builder import MODELS
 from modelscope.outputs import TokenClassifierOutput
 from modelscope.outputs import AttentionTokenClassificationModelOutput
 from modelscope.utils import logger as logging
 from modelscope.utils.constant import Tasks
 from .backbone import BertModel, BertPreTrainedModel
@@ -47,7 +47,7 @@ class BertForTokenClassification(BertPreTrainedModel):
    Preprocessor:
        This is the fill_mask model of Bert, the preprocessor of this model
        is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
        is `modelscope.preprocessors.TokenClassificationTransformersPreprocessor`.
    Trainer:
        This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,
@@ -169,7 +169,7 @@ class BertForTokenClassification(BertPreTrainedModel):
            - 0 for tokens that are **masked**.
        Returns:
            Returns `modelscope.outputs.TokenClassifierOutput`
            Returns `modelscope.outputs.AttentionTokenClassificationModelOutput`
        Examples:
            >>> from modelscope.models import Model
@@ -212,14 +212,25 @@ class BertForTokenClassification(BertPreTrainedModel):
                loss = loss_fct(
                    logits.view(-1, self.num_labels), labels.view(-1))
        if label_mask is not None:
            mask = label_mask
            masked_lengths = mask.sum(-1).long()
            masked_logits = torch.zeros_like(logits)
            for i in range(len(mask)):
                masked_logits[
                    i, :masked_lengths[i], :] = logits[i].masked_select(
                        mask[i].unsqueeze(-1)).view(masked_lengths[i], -1)
            logits = masked_logits
        if not return_dict:
            output = (logits, ) + outputs[2:]
            return ((loss, ) + output) if loss is not None else output
        return TokenClassifierOutput(
        return AttentionTokenClassificationModelOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
            offset_mapping=offset_mapping,
            label_mask=label_mask,
        )
--- a/modelscope/models/nlp/deberta_v2/backbone.py
+++ b/modelscope/models/nlp/deberta_v2/backbone.py
@@ -22,7 +22,6 @@ import torch.utils.checkpoint
 from torch import nn
 from torch.nn import LayerNorm
 from transformers.activations import ACT2FN
 from transformers.modeling_outputs import BaseModelOutput
 from transformers.modeling_utils import PreTrainedModel
 from transformers.pytorch_utils import softmax_backward_data
@@ -574,7 +573,7 @@ class DebertaV2Encoder(nn.Module):
            return tuple(
                v for v in [output_states, all_hidden_states, all_attentions]
                if v is not None)
        return BaseModelOutput(
        return AttentionBackboneModelOutput(
            last_hidden_state=output_states,
            hidden_states=all_hidden_states,
            attentions=all_attentions)
--- a/modelscope/models/nlp/deberta_v2/fill_mask.py
+++ b/modelscope/models/nlp/deberta_v2/fill_mask.py
@@ -44,7 +44,7 @@ class DebertaV2ForMaskedLM(DebertaV2PreTrainedModel):
    Preprocessor:
        This is the fill_mask model of Deberta_v2, the preprocessor of this model
        is `modelscope.preprocessors.NLPPreprocessor`.
        is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.
    Parameters:
        config (`DebertaV2Config`): Model configuration class with all the parameters of the model.
--- a/modelscope/models/nlp/palm_v2/init.py
+++ b/modelscope/models/nlp/palm_v2/init.py
@@ -18,18 +18,16 @@ from modelscope.utils.import_utils import LazyImportModule
 if TYPE_CHECKING:
    from .configuration import PalmConfig
    from .backbone import (
    from .text_generation import (
        AbsSummarizer,
        PalmForConditionalGeneration,
        PalmForTextGeneration,
        Translator,
    )
    from .text_generation import PalmForTextGeneration
 else:
    _import_structure = {
        'configuration': ['PalmConfig'],
        'backbone':
        ['AbsSummarizer', 'PalmForConditionalGeneration', 'Translator'],
        'text_generation': ['PalmForTextGeneration'],
        'text_generation':
        ['AbsSummarizer', 'Translator', 'PalmForTextGeneration'],
    }
    import sys
--- a/modelscope/models/nlp/palm_v2/backbone.py
+++ b/modelscope/models/nlp/palm_v2/backbone.py
--- a/modelscope/models/nlp/palm_v2/text_generation.py
+++ b/modelscope/models/nlp/palm_v2/text_generation.py
--- a/modelscope/models/nlp/ponet/backbone.py
+++ b/modelscope/models/nlp/ponet/backbone.py
@@ -23,8 +23,6 @@ import torch.utils.checkpoint
 from packaging import version
 from torch import nn
 from transformers.activations import ACT2FN
 from transformers.modeling_outputs import \
    BaseModelOutputWithPastAndCrossAttentions
 from transformers.modeling_utils import (PreTrainedModel,
                                         apply_chunking_to_forward,
                                         find_pruneable_heads_and_indices,
@@ -573,7 +571,7 @@ class PoNetEncoder(nn.Module):
                all_self_attentions,
                all_cross_attentions,
            ] if v is not None)
        return BaseModelOutputWithPastAndCrossAttentions(
        return AttentionBackboneModelOutput(
            last_hidden_state=hidden_states,
            past_key_values=next_decoder_cache,
            hidden_states=all_hidden_states,
@@ -642,34 +640,6 @@ class PoNetPreTrainedModel(TorchModel, PreTrainedModel):
        return model
 class PoNetPreTrainedModelV2(PreTrainedModel):
    """
    A base class to handle weights initialization and a simple interface for loading pretrained models.
    """
    config_class = PoNetConfig
    base_model_prefix = 'ponet'
    _keys_to_ignore_on_load_missing = [r'position_ids']
    def _init_weights(self, module):
        """Initialize the weights"""
        if isinstance(module, nn.Linear):
            # Slightly different from the TF version which uses truncated_normal for initialization
            # cf https://github.com/pytorch/pytorch/pull/5617
            module.weight.data.normal_(
                mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(
                mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
@MODELS.register_module(Tasks.backbone, module_name=Models.ponet)
 class PoNetModel(PoNetPreTrainedModel):
    """The bare PoNet Model transformer outputting raw hidden-states without any specific head on top.
--- a/modelscope/models/nlp/ponet/document_segmentation.py
+++ b/modelscope/models/nlp/ponet/document_segmentation.py
@@ -5,13 +5,15 @@ from typing import Any, Dict
 import torch
 from torch import nn
 from torch.nn import CrossEntropyLoss
 from transformers.modeling_outputs import TokenClassifierOutput
 from modelscope.metainfo import Models
 from modelscope.models.base import Model
 from modelscope.models.builder import MODELS
 from modelscope.models.nlp.bert import BertConfig
 from modelscope.outputs import AttentionTokenClassificationModelOutput
 from modelscope.utils.constant import Tasks
 from .backbone import PoNetModel, PoNetPreTrainedModelV2
 from .backbone import PoNetModel, PoNetPreTrainedModel
 from .configuration import PoNetConfig
 __all__ = ['PoNetForDocumentSegmentation']
@@ -20,23 +22,7 @@ __all__ = ['PoNetForDocumentSegmentation']
    Tasks.document_segmentation, module_name=Models.ponet_for_ds)
@MODELS.register_module(
    Tasks.extractive_summarization, module_name=Models.ponet_for_ds)
 class PoNetForDocumentSegmentation(Model):
    def __init__(self, model_dir: str, model_config: Dict[str, Any], *args,
                 **kwargs):
        super().__init__(model_dir, model_config, *args, **kwargs)
        self.model_cfg = model_config
    def build_with_config(self, config):
        self.ponet_model = PoNetForDocumentSegmentationBase.from_pretrained(
            self.model_dir, config=config)
        return self.ponet_model
    def forward(self) -> Dict[str, Any]:
        return self.model_cfg
 class PoNetForDocumentSegmentationBase(PoNetPreTrainedModelV2):
 class PoNetForDocumentSegmentation(PoNetPreTrainedModel):
    _keys_to_ignore_on_load_unexpected = [r'pooler']
    def __init__(self, config):
@@ -107,9 +93,24 @@ class PoNetForDocumentSegmentationBase(PoNetPreTrainedModelV2):
            output = (logits, ) + outputs[2:]
            return ((loss, ) + output) if loss is not None else output
        return TokenClassifierOutput(
        return AttentionTokenClassificationModelOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
        )
    @classmethod
    def _instantiate(cls, model_dir, model_config: Dict[str, Any], **kwargs):
        if model_config['type'] == 'bert':
            config = BertConfig.from_pretrained(model_dir, num_labels=2)
        elif model_config['type'] == 'ponet':
            config = PoNetConfig.from_pretrained(model_dir, num_labels=2)
        else:
            raise ValueError(
                f'Expected config type bert and ponet, which is : {model_config["type"]}'
            )
        model = super(Model, cls).from_pretrained(model_dir, config=config)
        model.model_dir = model_dir
        model.model_cfg = model_config
        return model
--- a/modelscope/models/nlp/space/model/tokenization_space.py
+++ b/modelscope/models/nlp/space/model/tokenization_space.py
@@ -15,14 +15,14 @@
 # limitations under the License
 """Tokenization classes for Space. mainly copied from :module:`~transformers.tokenization_xlm_roberta`"""
 from modelscope.models.nlp.structbert import (BasicTokenizer, SbertTokenizer,
                                              WordpieceTokenizer)
 from transformers import BasicTokenizer, BertTokenizer, WordpieceTokenizer
 from modelscope.utils import logger as logging
 logger = logging.get_logger(__name__)
 class SpaceTokenizer(SbertTokenizer):
 class SpaceTokenizer(BertTokenizer):
    """
    This class overrides [`SpaceTokenizer`]. Please check the superclass for the appropriate
    documentation alongside usage examples.
--- a/modelscope/models/nlp/structbert/init.py
+++ b/modelscope/models/nlp/structbert/init.py
@@ -24,9 +24,6 @@ if TYPE_CHECKING:
    from .fill_mask import SbertForMaskedLM
    from .text_classification import SbertForSequenceClassification
    from .token_classification import SbertForTokenClassification
    from .tokenization import (BasicTokenizer, SbertTokenizer,
                               WordpieceTokenizer)
    from .tokenization_fast import SbertTokenizerFast
 else:
    _import_structure = {
        'backbone': ['SbertModel', 'SbertPreTrainedModel'],
@@ -35,9 +32,6 @@ else:
        'faq_question_answering': ['SbertForFaqQuestionAnswering'],
        'text_classification': ['SbertForSequenceClassification'],
        'token_classification': ['SbertForTokenClassification'],
        'tokenization':
        ['BasicTokenizer', 'SbertTokenizer', 'WordpieceTokenizer'],
        'tokenization_fast': ['SbertTokenizerFast'],
    }
    import sys
--- a/modelscope/models/nlp/structbert/backbone.py
+++ b/modelscope/models/nlp/structbert/backbone.py
@@ -18,15 +18,13 @@
 import math
 from dataclasses import dataclass
 from typing import Optional, Tuple, Union
 from typing import Optional, Union
 import torch
 import torch.nn as nn
 import torch.utils.checkpoint
 from packaging import version
 from transformers.activations import ACT2FN
 from transformers.modeling_outputs import \
    BaseModelOutputWithPastAndCrossAttentions
 from transformers.modeling_utils import (PreTrainedModel,
                                         apply_chunking_to_forward,
                                         find_pruneable_heads_and_indices,
@@ -37,8 +35,8 @@ from modelscope.models import Model, TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import AttentionBackboneModelOutput
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import parse_label_mapping
 from modelscope.utils.logger import get_logger
 from modelscope.utils.nlp.utils import parse_labels_in_order
 from .configuration import SbertConfig
 logger = get_logger(__name__)
@@ -563,7 +561,7 @@ class SbertEncoder(nn.Module):
                all_self_attentions,
                all_cross_attentions,
            ] if v is not None)
        return BaseModelOutputWithPastAndCrossAttentions(
        return AttentionBackboneModelOutput(
            last_hidden_state=hidden_states,
            past_key_values=next_decoder_cache,
            hidden_states=all_hidden_states,
@@ -641,29 +639,15 @@ class SbertPreTrainedModel(TorchModel, PreTrainedModel):
        """
        model_dir = kwargs.pop('model_dir', None)
        cfg = kwargs.pop('cfg', None)
        model_args = parse_labels_in_order(model_dir, cfg, **kwargs)
        if model_dir is None:
            config = SbertConfig(**kwargs)
            config = SbertConfig(**model_args)
            model = cls(config)
        else:
            model_kwargs = {}
            label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
            id2label = kwargs.get(
                'id2label', None if label2id is None else
                {id: label
                 for label, id in label2id.items()})
            if id2label is not None and label2id is None:
                label2id = {label: id for id, label in id2label.items()}
            num_labels = kwargs.get(
                'num_labels', None if label2id is None else len(label2id))
            if num_labels is not None:
                model_kwargs['num_labels'] = num_labels
            if label2id is not None:
                model_kwargs['label2id'] = label2id
            if id2label is not None:
                model_kwargs['id2label'] = id2label
            model = super(Model, cls).from_pretrained(
                pretrained_model_name_or_path=model_dir, **model_kwargs)
                pretrained_model_name_or_path=model_dir, **model_args)
        return model
--- a/modelscope/models/nlp/structbert/faq_question_answering.py
+++ b/modelscope/models/nlp/structbert/faq_question_answering.py
@@ -14,6 +14,7 @@ from modelscope.metainfo import Models
 from modelscope.models.builder import MODELS
 from modelscope.models.nlp.structbert import SbertConfig, SbertModel
 from modelscope.models.nlp.task_models.task_model import BaseTaskModel
 from modelscope.outputs import FaqQuestionAnsweringOutput
 from modelscope.utils.config import Config, ConfigFields
 from modelscope.utils.constant import ModelFile, Tasks
@@ -208,10 +209,10 @@ class SbertForFaqQuestionAnswering(BaseTaskModel):
                    Predicted scores of all classes for each query.
        Examples:
            >>> from modelscope.hub.snapshot_download import snapshot_download
            >>> from modelscope.preprocessors import FaqQuestionAnsweringPreprocessor
            >>> from modelscope.preprocessors import FaqQuestionAnsweringTransformersPreprocessor
            >>> from modelscope.models.nlp import SbertForFaqQuestionAnswering
            >>> cache_path = snapshot_download('damo/nlp_structbert_faq-question-answering_chinese-base')
            >>> preprocessor = FaqQuestionAnsweringPreprocessor.from_pretrained(cache_path)
            >>> preprocessor = FaqQuestionAnsweringTransformersPreprocessor.from_pretrained(cache_path)
            >>> model = SbertForFaqQuestionAnswering.from_pretrained(cache_path)
            >>> param = {
            >>>            'query_set': ['如何使用优惠券', '在哪里领券', '在哪里领券'],
@@ -270,7 +271,7 @@ class SbertForFaqQuestionAnswering(BaseTaskModel):
        scores = self.metrics_layer(z_query, protos).view([n_query, num_cls])
        if self.metrics_layer.name == 'relation':
            scores = torch.sigmoid(scores)
        return {'scores': scores}
        return FaqQuestionAnsweringOutput(scores=scores)
    def _get_onehot_labels(self, labels, support_size, num_cls):
        labels_ = labels.view(support_size, 1)
--- a/modelscope/models/nlp/structbert/fill_mask.py
+++ b/modelscope/models/nlp/structbert/fill_mask.py
@@ -105,7 +105,7 @@ class SbertForMaskedLM(SbertPreTrainedModel):
    Preprocessor:
        This is the fill_mask model of StructBERT, the preprocessor of this model
        is `modelscope.preprocessors.NLPPreprocessor`.
        is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.
    Parameters:
        config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with
@@ -213,9 +213,9 @@ class SbertForMaskedLM(SbertPreTrainedModel):
        Examples:
            >>> from modelscope.models import Model
            >>> from modelscope.preprocessors import Preprocessor, NLPPreprocessor
            >>> from modelscope.preprocessors import Preprocessor, FillMaskTransformersPreprocessor
            >>> model = Model.from_pretrained('damo/nlp_structbert_fill-mask_chinese-large')
            >>> preprocessor = NLPPreprocessor('damo/nlp_structbert_fill-mask_chinese-large')
            >>> preprocessor = FillMaskTransformersPreprocessor('damo/nlp_structbert_fill-mask_chinese-large')
            >>> # Call the model, return some tensors
            >>> print(model(**preprocessor('你师父差得动你，你师父可[MASK]不动我。')))
            >>> # Call the pipeline
--- a/modelscope/models/nlp/structbert/text_classification.py
+++ b/modelscope/models/nlp/structbert/text_classification.py
@@ -55,7 +55,7 @@ class SbertForSequenceClassification(SbertPreTrainedModel):
    Preprocessor:
        This is the text classification model of StructBERT, the preprocessor of this model
        is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
        is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`.
    Trainer:
        This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,
--- a/modelscope/models/nlp/structbert/token_classification.py
+++ b/modelscope/models/nlp/structbert/token_classification.py
@@ -22,7 +22,7 @@ from torch.nn import CrossEntropyLoss
 from modelscope.metainfo import Models
 from modelscope.models.builder import MODELS
 from modelscope.outputs import TokenClassifierOutput
 from modelscope.outputs import AttentionTokenClassificationModelOutput
 from modelscope.utils import logger as logging
 from modelscope.utils.constant import Tasks
 from .adv_utils import compute_adv_loss
@@ -50,7 +50,7 @@ class SbertForTokenClassification(SbertPreTrainedModel):
    Preprocessor:
        This is the token-classification model of StructBERT, the preprocessor of this model
        is `modelscope.preprocessors.TokenClassificationPreprocessor`.
        is `modelscope.preprocessors.TokenClassificationTransformersPreprocessor`.
    Trainer:
        This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,
@@ -168,7 +168,7 @@ class SbertForTokenClassification(SbertPreTrainedModel):
            - 0 for tokens that are **masked**.
        Returns:
            Returns `modelscope.outputs.TokenClassifierOutput`
            Returns `modelscope.outputs.AttentionTokenClassificationModelOutput`
        Examples:
            >>> from modelscope.models import Model
@@ -220,10 +220,21 @@ class SbertForTokenClassification(SbertPreTrainedModel):
                    with_attention_mask=attention_mask is not None,
                    **outputs.kwargs)
        return TokenClassifierOutput(
        if label_mask is not None:
            mask = label_mask
            masked_lengths = mask.sum(-1).long()
            masked_logits = torch.zeros_like(logits)
            for i in range(len(mask)):
                masked_logits[
                    i, :masked_lengths[i], :] = logits[i].masked_select(
                        mask[i].unsqueeze(-1)).view(masked_lengths[i], -1)
            logits = masked_logits
        return AttentionTokenClassificationModelOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
            offset_mapping=offset_mapping,
            label_mask=label_mask,
        )
--- a/modelscope/models/nlp/structbert/tokenization.py
+++ b/modelscope/models/nlp/structbert/tokenization.py
@@ -1,519 +0,0 @@
 # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
 # All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert`"""
 import collections
 import os
 import unicodedata
 from typing import List, Optional, Tuple
 from transformers.tokenization_utils import (PreTrainedTokenizer, _is_control,
                                             _is_punctuation, _is_whitespace)
 from modelscope.utils.constant import ModelFile
 from modelscope.utils.logger import get_logger
 logger = get_logger(__name__)
 VOCAB_FILES_NAMES = {'vocab_file': ModelFile.VOCAB_FILE}
 PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}}
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
    'nlp_structbert_backbone_large_std': 512,
    'nlp_structbert_backbone_base_std': 512,
    'nlp_structbert_backbone_lite_std': 512,
    'nlp_structbert_backbone_tiny_std': 512,
 }
 PRETRAINED_INIT_CONFIGURATION = {
    'english_sbert-large-std-512': {
        'do_lower_case': True
    },
 }
 def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = collections.OrderedDict()
    with open(vocab_file, 'r', encoding='utf-8') as reader:
        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        token = token.rstrip('\n')
        vocab[token] = index
    return vocab
 def whitespace_tokenize(text):
    """Runs basic whitespace cleaning and splitting on a piece of text."""
    text = text.strip()
    if not text:
        return []
    tokens = text.split()
    return tokens
 class SbertTokenizer(PreTrainedTokenizer):
    r"""
    Construct a SBERT tokenizer. Based on WordPiece.
    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
    Users should refer to this superclass for more information regarding those methods.
    Args:
        vocab_file (:obj:`str`):
            File containing the vocabulary.
        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to lowercase the input when tokenizing.
        do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to do basic tokenization before WordPiece.
        never_split (:obj:`Iterable`, `optional`):
            Collection of tokens which will never be split during tokenization. Only has an effect when
            :obj:`do_basic_tokenize=True`
        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
            The token used for padding, for example when batching sequences of different lengths.
        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
            The classifier token which is used when doing sequence classification (classification of the whole sequence
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to tokenize Chinese characters.
            This should likely be deactivated for Japanese (see this `issue
            <https://github.com/huggingface/transformers/issues/328>`__).
        strip_accents: (:obj:`bool`, `optional`):
            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
            value for :obj:`lowercase` (as in the original BERT).
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    def __init__(self,
                 vocab_file,
                 do_lower_case=True,
                 do_basic_tokenize=True,
                 never_split=None,
                 unk_token='[UNK]',
                 sep_token='[SEP]',
                 pad_token='[PAD]',
                 cls_token='[CLS]',
                 mask_token='[MASK]',
                 tokenize_chinese_chars=True,
                 strip_accents=None,
                 **kwargs):
        super().__init__(
            do_lower_case=do_lower_case,
            do_basic_tokenize=do_basic_tokenize,
            never_split=never_split,
            unk_token=unk_token,
            sep_token=sep_token,
            pad_token=pad_token,
            cls_token=cls_token,
            mask_token=mask_token,
            tokenize_chinese_chars=tokenize_chinese_chars,
            strip_accents=strip_accents,
            **kwargs,
        )
        if not os.path.isfile(vocab_file):
            raise ValueError(
                f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
                'model use `tokenizer = SbertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`'
            )
        self.vocab = load_vocab(vocab_file)
        self.ids_to_tokens = collections.OrderedDict([
            (ids, tok) for tok, ids in self.vocab.items()
        ])
        self.do_basic_tokenize = do_basic_tokenize
        if do_basic_tokenize:
            self.basic_tokenizer = BasicTokenizer(
                do_lower_case=do_lower_case,
                never_split=never_split,
                tokenize_chinese_chars=tokenize_chinese_chars,
                strip_accents=strip_accents,
            )
        self.wordpiece_tokenizer = WordpieceTokenizer(
            vocab=self.vocab, unk_token=self.unk_token)
    @property
    def do_lower_case(self):
        return self.basic_tokenizer.do_lower_case
    @property
    def vocab_size(self):
        return len(self.vocab)
    def get_vocab(self):
        return dict(self.vocab, **self.added_tokens_encoder)
    def _tokenize(self, text):
        split_tokens = []
        if self.do_basic_tokenize:
            for token in self.basic_tokenizer.tokenize(
                    text, never_split=self.all_special_tokens):
                # If the token is part of the never_split set
                if token in self.basic_tokenizer.never_split:
                    split_tokens.append(token)
                else:
                    split_tokens += self.wordpiece_tokenizer.tokenize(token)
        else:
            split_tokens = self.wordpiece_tokenizer.tokenize(text)
        return split_tokens
    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        return self.vocab.get(token, self.vocab.get(self.unk_token))
    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        return self.ids_to_tokens.get(index, self.unk_token)
    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (string) in a single string."""
        out_string = ' '.join(tokens).replace(' ##', '').strip()
        return out_string
    def build_inputs_with_special_tokens(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A SBERT sequence has the following format:
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + token_ids_1 + sep
    def get_special_tokens_mask(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None,
            already_has_special_tokens: bool = False) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer ``prepare_for_model`` method.
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether or not the token list is already formatted with special tokens for the model.
        Returns:
            :obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0,
                token_ids_1=token_ids_1,
                already_has_special_tokens=True)
        if token_ids_1 is not None:
            return [1] + ([0] * len(token_ids_0)) + [1] + (
                [0] * len(token_ids_1)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1]
    def create_token_type_ids_from_sequences(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence
        pair mask has the following format:
        ::
            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
            | first sequence    | second sequence |
        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1
                                                        + sep) * [1]
    def save_vocabulary(self,
                        save_directory: str,
                        filename_prefix: Optional[str] = None) -> Tuple[str]:
        index = 0
        if os.path.isdir(save_directory):
            vocab_file = os.path.join(
                save_directory,
                (filename_prefix + '-' if filename_prefix else '')
                + VOCAB_FILES_NAMES['vocab_file'])
        else:
            vocab_file = (filename_prefix
                          + '-' if filename_prefix else '') + save_directory
        with open(vocab_file, 'w', encoding='utf-8') as writer:
            for token, token_index in sorted(
                    self.vocab.items(), key=lambda kv: kv[1]):
                if index != token_index:
                    logger.warning(
                        f'Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive.'
                        ' Please check that the vocabulary is not corrupted!')
                    index = token_index
                writer.write(token + '\n')
                index += 1
        return (vocab_file, )
 class BasicTokenizer(object):
    """
    Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).
    Args:
        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to lowercase the input when tokenizing.
        never_split (:obj:`Iterable`, `optional`):
            Collection of tokens which will never be split during tokenization. Only has an effect when
            :obj:`do_basic_tokenize=True`
        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to tokenize Chinese characters.
            This should likely be deactivated for Japanese (see this `issue
            <https://github.com/huggingface/transformers/issues/328>`__).
        strip_accents: (:obj:`bool`, `optional`):
            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
            value for :obj:`lowercase` (as in the original BERT).
    """
    def __init__(self,
                 do_lower_case=True,
                 never_split=None,
                 tokenize_chinese_chars=True,
                 strip_accents=None):
        if never_split is None:
            never_split = []
        self.do_lower_case = do_lower_case
        self.never_split = set(never_split)
        self.tokenize_chinese_chars = tokenize_chinese_chars
        self.strip_accents = strip_accents
    def tokenize(self, text, never_split=None):
        """
        Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
        WordPieceTokenizer.
        Args:
            **never_split**: (`optional`) list of str
                Kept for backward compatibility purposes. Now implemented directly at the base class level (see
                :func:`PreTrainedTokenizer.tokenize`) List of token not to split.
        """
        # union() returns a new set by concatenating the two sets.
        never_split = self.never_split.union(
            set(never_split)) if never_split else self.never_split
        text = self._clean_text(text)
        # This was added on November 1st, 2018 for the multilingual and Chinese
        # models. This is also applied to the English models now, but it doesn't
        # matter since the English models were not trained on any Chinese data
        # and generally don't have any Chinese data in them (there are Chinese
        # characters in the vocabulary because Wikipedia does have some Chinese
        # words in the English Wikipedia.).
        if self.tokenize_chinese_chars:
            text = self._tokenize_chinese_chars(text)
        orig_tokens = whitespace_tokenize(text)
        split_tokens = []
        for token in orig_tokens:
            if token not in never_split:
                if self.do_lower_case:
                    token = token.lower()
                    if self.strip_accents is not False:
                        token = self._run_strip_accents(token)
                elif self.strip_accents:
                    token = self._run_strip_accents(token)
            split_tokens.extend(self._run_split_on_punc(token, never_split))
        output_tokens = whitespace_tokenize(' '.join(split_tokens))
        return output_tokens
    def _run_strip_accents(self, text):
        """Strips accents from a piece of text."""
        text = unicodedata.normalize('NFD', text)
        output = []
        for char in text:
            cat = unicodedata.category(char)
            if cat == 'Mn':
                continue
            output.append(char)
        return ''.join(output)
    def _run_split_on_punc(self, text, never_split=None):
        """Splits punctuation on a piece of text."""
        if never_split is not None and text in never_split:
            return [text]
        chars = list(text)
        i = 0
        start_new_word = True
        output = []
        while i < len(chars):
            char = chars[i]
            if _is_punctuation(char):
                output.append([char])
                start_new_word = True
            else:
                if start_new_word:
                    output.append([])
                start_new_word = False
                output[-1].append(char)
            i += 1
        return [''.join(x) for x in output]
    def _tokenize_chinese_chars(self, text):
        """Adds whitespace around any CJK character."""
        output = []
        for char in text:
            cp = ord(char)
            if self._is_chinese_char(cp):
                output.append(' ')
                output.append(char)
                output.append(' ')
            else:
                output.append(char)
        return ''.join(output)
    def _is_chinese_char(self, cp):
        """Checks whether CP is the codepoint of a CJK character."""
        # This defines a "chinese character" as anything in the CJK Unicode block:
        #   https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
        #
        # Note that the CJK Unicode block is NOT all Japanese and Korean characters,
        # despite its name. The modern Korean Hangul alphabet is a different block,
        # as is Japanese Hiragana and Katakana. Those alphabets are used to write
        # space-separated words, so they are not treated specially and handled
        # like the all of the other languages.
        if ((0x4E00 <= cp <= 0x9FFF) or (0x3400 <= cp <= 0x4DBF)
                or (0x20000 <= cp <= 0x2A6DF) or (0x2A700 <= cp <= 0x2B73F)
                or (0x2B740 <= cp <= 0x2B81F) or (0x2B820 <= cp <= 0x2CEAF)
                or (0xF900 <= cp <= 0xFAFF) or (0x2F800 <= cp <= 0x2FA1F)):
            return True
        return False
    def _clean_text(self, text):
        """Performs invalid character removal and whitespace cleanup on text."""
        output = []
        for char in text:
            cp = ord(char)
            if cp == 0 or cp == 0xFFFD or _is_control(char):
                continue
            if _is_whitespace(char):
                output.append(' ')
            else:
                output.append(char)
        return ''.join(output)
 class WordpieceTokenizer(object):
    """Runs WordPiece tokenization."""
    def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
        self.vocab = vocab
        self.unk_token = unk_token
        self.max_input_chars_per_word = max_input_chars_per_word
    def tokenize(self, text):
        """
        Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
        tokenization using the given vocabulary.
        For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.
        Args:
          text: A single token or whitespace separated tokens. This should have
            already been passed through `BasicTokenizer`.
        Returns:
          A list of wordpiece tokens.
        """
        output_tokens = []
        for token in whitespace_tokenize(text):
            chars = list(token)
            if len(chars) > self.max_input_chars_per_word:
                output_tokens.append(self.unk_token)
                continue
            is_bad = False
            start = 0
            sub_tokens = []
            while start < len(chars):
                end = len(chars)
                cur_substr = None
                while start < end:
                    substr = ''.join(chars[start:end])
                    if start > 0:
                        substr = '##' + substr
                    if substr in self.vocab:
                        cur_substr = substr
                        break
                    end -= 1
                if cur_substr is None:
                    is_bad = True
                    break
                sub_tokens.append(cur_substr)
                start = end
            if is_bad:
                output_tokens.append(self.unk_token)
            else:
                output_tokens.extend(sub_tokens)
        return output_tokens
--- a/modelscope/models/nlp/structbert/tokenization_fast.py
+++ b/modelscope/models/nlp/structbert/tokenization_fast.py
@@ -1,203 +0,0 @@
 # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
 # Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
 # All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Fast Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert_fast`"""
 from typing import List, Optional, Tuple
 import json
 import transformers
 from tokenizers import normalizers
 from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
 from modelscope.utils.constant import ModelFile
 from modelscope.utils.logger import get_logger
 from .tokenization import SbertTokenizer
 logger = get_logger(__name__)
 VOCAB_FILES_NAMES = {
    'vocab_file': ModelFile.VOCAB_FILE,
    'tokenizer_file': 'tokenizer.json'
 }
 PRETRAINED_VOCAB_FILES_MAP = {
    'vocab_file': {},
    'tokenizer_file': {},
 }
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
    'nlp_structbert_backbone_large_std': 512,
    'nlp_structbert_backbone_base_std': 512,
    'nlp_structbert_backbone_lite_std': 512,
    'nlp_structbert_backbone_tiny_std': 512,
 }
 PRETRAINED_INIT_CONFIGURATION = {
    'english_sbert-large-std-512': {
        'do_lower_case': True
    },
 }
 transformers.SLOW_TO_FAST_CONVERTERS[
    'SbertTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS['BertTokenizer']
 class SbertTokenizerFast(PreTrainedTokenizerFast):
    r"""
    Construct a "fast" SBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece.
    This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
    methods. Users should refer to this superclass for more information regarding those methods.
    Args:
        vocab_file (:obj:`str`):
            File containing the vocabulary.
        do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to lowercase the input when tokenizing.
        unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
        pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
            The token used for padding, for example when batching sequences of different lengths.
        cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
            The classifier token which is used when doing sequence classification (classification of the whole sequence
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
        mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
        clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to clean the text before tokenization by removing any control characters and replacing all
            whitespaces by the classic one.
        tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
            Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see `this
            issue <https://github.com/huggingface/transformers/issues/328>`__).
        strip_accents: (:obj:`bool`, `optional`):
            Whether or not to strip all accents. If this option is not specified, then it will be determined by the
            value for :obj:`lowercase` (as in the original BERT).
        wordpieces_prefix: (:obj:`str`, `optional`, defaults to :obj:`"##"`):
            The prefix for subwords.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    slow_tokenizer_class = SbertTokenizer
    def __init__(self,
                 vocab_file=None,
                 tokenizer_file=None,
                 do_lower_case=True,
                 unk_token='[UNK]',
                 sep_token='[SEP]',
                 pad_token='[PAD]',
                 cls_token='[CLS]',
                 mask_token='[MASK]',
                 tokenize_chinese_chars=True,
                 strip_accents=None,
                 **kwargs):
        super().__init__(
            vocab_file,
            tokenizer_file=tokenizer_file,
            do_lower_case=do_lower_case,
            unk_token=unk_token,
            sep_token=sep_token,
            pad_token=pad_token,
            cls_token=cls_token,
            mask_token=mask_token,
            tokenize_chinese_chars=tokenize_chinese_chars,
            strip_accents=strip_accents,
            **kwargs,
        )
        pre_tok_state = json.loads(
            self.backend_tokenizer.normalizer.__getstate__())
        if (pre_tok_state.get('lowercase', do_lower_case) != do_lower_case
                or pre_tok_state.get('strip_accents',
                                     strip_accents) != strip_accents):
            pre_tok_class = getattr(normalizers, pre_tok_state.pop('type'))
            pre_tok_state['lowercase'] = do_lower_case
            pre_tok_state['strip_accents'] = strip_accents
            self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state)
        self.do_lower_case = do_lower_case
    def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. A SBERT sequence has the following format:
        - single sequence: ``[CLS] X [SEP]``
        - pair of sequences: ``[CLS] A [SEP] B [SEP]``
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
        """
        output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        if token_ids_1:
            output += token_ids_1 + [self.sep_token_id]
        return output
    def create_token_type_ids_from_sequences(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence
        pair mask has the following format:
        ::
            0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
            | first sequence    | second sequence |
        If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).
        Args:
            token_ids_0 (:obj:`List[int]`):
                List of IDs.
            token_ids_1 (:obj:`List[int]`, `optional`):
                Optional second list of IDs for sequence pairs.
        Returns:
            :obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
            sequence(s).
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1
                                                        + sep) * [1]
    def save_vocabulary(self,
                        save_directory: str,
                        filename_prefix: Optional[str] = None) -> Tuple[str]:
        files = self._tokenizer.model.save(
            save_directory, name=filename_prefix)
        return tuple(files)
--- a/modelscope/models/nlp/task_models/feature_extraction.py
+++ b/modelscope/models/nlp/task_models/feature_extraction.py
@@ -5,12 +5,10 @@ import numpy as np
 from modelscope.metainfo import TaskModels
 from modelscope.models.builder import MODELS
 from modelscope.models.nlp.bert import BertConfig
 from modelscope.models.nlp.task_models.task_model import \
    SingleBackboneTaskModelBase
 from modelscope.outputs import OutputKeys
 from modelscope.outputs import FeatureExtractionOutput, OutputKeys
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import parse_label_mapping
 __all__ = ['FeatureExtractionModel']
@@ -31,9 +29,9 @@ class FeatureExtractionModel(SingleBackboneTaskModelBase):
        self.build_backbone(self.backbone_cfg)
    def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]:
    def forward(self, **input: Dict[str, Any]) -> FeatureExtractionOutput:
        # backbone do not need labels, only head need for loss compute
        input.pop(OutputKeys.LABELS, None)
        outputs = super().forward(input)
        sequence_output = outputs.last_hidden_state
        return {OutputKeys.TEXT_EMBEDDING: sequence_output}
        return FeatureExtractionOutput(text_embedding=sequence_output)
--- a/modelscope/models/nlp/task_models/information_extraction.py
+++ b/modelscope/models/nlp/task_models/information_extraction.py
@@ -7,7 +7,7 @@ from modelscope.metainfo import TaskModels
 from modelscope.models.builder import MODELS
 from modelscope.models.nlp.task_models.task_model import \
    SingleBackboneTaskModelBase
 from modelscope.outputs import OutputKeys
 from modelscope.outputs import InformationExtractionOutput, OutputKeys
 from modelscope.utils.constant import Tasks
 __all__ = ['InformationExtractionModel']
@@ -31,9 +31,9 @@ class InformationExtractionModel(SingleBackboneTaskModelBase):
        self.build_backbone(self.backbone_cfg)
        self.build_head(self.head_cfg)
    def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]:
    def forward(self, **input: Dict[str, Any]) -> InformationExtractionOutput:
        outputs = super().forward(input)
        sequence_output = outputs.last_hidden_state
        outputs = self.head.forward(sequence_output, input['text'],
                                    input['offsets'])
        return {OutputKeys.SPO_LIST: outputs}
        return InformationExtractionOutput(spo_list=outputs)
--- a/modelscope/models/nlp/task_models/nncrf_for_named_entity_recognition.py
+++ b/modelscope/models/nlp/task_models/nncrf_for_named_entity_recognition.py
@@ -12,7 +12,7 @@ from transformers import AutoConfig, AutoModel
 from modelscope.metainfo import Models
 from modelscope.models import TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import TokenClassifierWithPredictionsOutput
 from modelscope.outputs import AttentionTokenClassificationModelOutput
 from modelscope.utils.constant import ModelFile, Tasks
 __all__ = [
@@ -115,7 +115,7 @@ class SequenceLabelingForNamedEntityRecognition(TorchModel):
            - 0 for tokens that are **masked**.
        Returns:
            Returns `modelscope.outputs.TokenClassifierOutput`
            Returns `modelscope.outputs.AttentionTokenClassificationModelOutput`
        Examples:
            >>> from modelscope.models import Model
@@ -138,17 +138,16 @@ class SequenceLabelingForNamedEntityRecognition(TorchModel):
    def postprocess(self, input: Dict[str, Any], **kwargs):
        predicts = self.model.decode(input)
        offset_len = len(input['offset_mapping'])
        predictions = torch.narrow(
            predicts, 1, 0,
            offset_len)  # index_select only move loc, not resize
        return TokenClassifierWithPredictionsOutput(
        offset_mapping = input.get('offset_mapping')
        mask = input.get('label_mask')
        return AttentionTokenClassificationModelOutput(
            loss=None,
            logits=None,
            hidden_states=None,
            attentions=None,
            offset_mapping=input['offset_mapping'],
            predictions=predictions,
            label_mask=mask,
            offset_mapping=offset_mapping,
            predictions=predicts,
        )
--- a/modelscope/models/nlp/task_models/token_classification.py
+++ b/modelscope/models/nlp/task_models/token_classification.py
@@ -1,18 +1,16 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict
 import numpy as np
 import torch
 from modelscope.metainfo import Models, TaskModels
 from modelscope.models.builder import MODELS
 from modelscope.models.nlp.task_models.task_model import \
    SingleBackboneTaskModelBase
 from modelscope.outputs import OutputKeys, TokenClassifierOutput
 from modelscope.outputs import (AttentionTokenClassificationModelOutput,
                                OutputKeys)
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import parse_label_mapping
 from modelscope.utils.tensor_utils import (torch_nested_detach,
                                           torch_nested_numpify)
 __all__ = ['TokenClassificationModel']
@@ -48,7 +46,10 @@ class TokenClassificationModel(SingleBackboneTaskModelBase):
        self.build_backbone(self.backbone_cfg)
        self.build_head(self.head_cfg)
    def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]:
    def forward(
            self,
            **input: Dict[str,
                          Any]) -> AttentionTokenClassificationModelOutput:
        labels = None
        if OutputKeys.LABEL in input:
            labels = input.pop(OutputKeys.LABEL)
@@ -62,16 +63,23 @@ class TokenClassificationModel(SingleBackboneTaskModelBase):
        if labels in input:
            loss = self.compute_loss(outputs, labels)
        # apply label mask to logits
        logits = logits[input['label_mask']].unsqueeze(0)
        if 'label_mask' in input:
            mask = input['label_mask']
            masked_lengths = mask.sum(-1).long()
            masked_logits = torch.zeros_like(logits)
            for i in range(len(mask)):
                masked_logits[
                    i, :masked_lengths[i], :] = logits[i].masked_select(
                        mask[i].unsqueeze(-1)).view(masked_lengths[i], -1)
            logits = masked_logits
        return TokenClassifierOutput(
        return AttentionTokenClassificationModelOutput(
            loss=loss,
            logits=logits,
            hidden_states=outputs.hidden_states,
            attentions=outputs.attentions,
            offset_mapping=input['offset_mapping'],
        )
            offset_mapping=input.get('offset_mapping'),
            label_mask=input.get('label_mask'))
    def extract_logits(self, outputs):
        return outputs[OutputKeys.LOGITS].cpu().detach()
--- a/modelscope/models/nlp/veco/init.py
+++ b/modelscope/models/nlp/veco/init.py
@@ -23,8 +23,6 @@ if TYPE_CHECKING:
    from .text_classification import VecoForSequenceClassification
    from .token_classification import VecoForTokenClassification
    from .fill_mask import VecoForMaskedLM
    from .tokenization import VecoTokenizer
    from .tokenization_fast import VecoTokenizerFast
 else:
    _import_structure = {
        'configuration': ['VecoConfig'],
@@ -32,8 +30,6 @@ else:
        'text_classification': ['VecoForSequenceClassification'],
        'fill_mask': ['VecoForMaskedLM'],
        'token_classification': ['VecoForTokenClassification'],
        'tokenization': ['VecoTokenizer'],
        'tokenization_fast': ['VecoTokenizerFast'],
    }
    import sys
--- a/modelscope/models/nlp/veco/fill_mask.py
+++ b/modelscope/models/nlp/veco/fill_mask.py
@@ -40,7 +40,7 @@ class VecoForMaskedLM(TorchModel, RobertaForMaskedLM):
    Preprocessor:
        This is the fill_mask model of StructBERT, the preprocessor of this model
        is `modelscope.preprocessors.NLPPreprocessor`.
        is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.
    Parameters:
        config ([`VecoConfig`]): Model configuration class with all the parameters of the
--- a/modelscope/models/nlp/veco/text_classification.py
+++ b/modelscope/models/nlp/veco/text_classification.py
@@ -22,7 +22,7 @@ from modelscope.models import Model, TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import AttentionTextClassificationModelOutput
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import parse_label_mapping
 from modelscope.utils.nlp.utils import parse_labels_in_order
 from .configuration import VecoConfig
@@ -46,7 +46,7 @@ class VecoForSequenceClassification(TorchModel,
    Preprocessor:
        This is the text classification model of Veco, the preprocessor of this model
        is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
        is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`.
    Trainer:
        This model should be trained by dataset which has mixed languages,
@@ -124,27 +124,13 @@ class VecoForSequenceClassification(TorchModel,
        """
        model_dir = kwargs.pop('model_dir', None)
        cfg = kwargs.pop('cfg', None)
        model_args = parse_labels_in_order(model_dir, cfg, **kwargs)
        if model_dir is None:
            config = VecoConfig(**kwargs)
            config = VecoConfig(**model_args)
            model = cls(config)
        else:
            model_kwargs = {}
            label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
            id2label = kwargs.get(
                'id2label', None if label2id is None else
                {id: label
                 for label, id in label2id.items()})
            if id2label is not None and label2id is None:
                label2id = {label: id for id, label in id2label.items()}
            num_labels = kwargs.get(
                'num_labels', None if label2id is None else len(label2id))
            if num_labels is not None:
                model_kwargs['num_labels'] = num_labels
            if label2id is not None:
                model_kwargs['label2id'] = label2id
            if id2label is not None:
                model_kwargs['id2label'] = id2label
            model = super(Model, cls).from_pretrained(
                pretrained_model_name_or_path=model_dir, **model_kwargs)
                pretrained_model_name_or_path=model_dir, **model_args)
        return model
--- a/modelscope/models/nlp/veco/token_classification.py
+++ b/modelscope/models/nlp/veco/token_classification.py
@@ -15,6 +15,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import torch
 from transformers import RobertaForTokenClassification
 from modelscope.metainfo import Models
@@ -22,7 +23,7 @@ from modelscope.models import Model, TorchModel
 from modelscope.models.builder import MODELS
 from modelscope.outputs import AttentionTokenClassificationModelOutput
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import parse_label_mapping
 from modelscope.utils.nlp.utils import parse_labels_in_order
 from .configuration import VecoConfig
@@ -58,6 +59,7 @@ class VecoForTokenClassification(TorchModel, RobertaForTokenClassification):
    def forward(self, *args, **kwargs):
        kwargs['return_dict'] = True
        outputs = super(Model, self).forward(*args, **kwargs)
        return AttentionTokenClassificationModelOutput(
            loss=outputs.loss,
            logits=outputs.logits,
@@ -81,27 +83,13 @@ class VecoForTokenClassification(TorchModel, RobertaForTokenClassification):
        """
        model_dir = kwargs.pop('model_dir', None)
        cfg = kwargs.pop('cfg', None)
        model_args = parse_labels_in_order(model_dir, cfg, **kwargs)
        if model_dir is None:
            config = VecoConfig(**kwargs)
            config = VecoConfig(**model_args)
            model = cls(config)
        else:
            model_kwargs = {}
            label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
            id2label = kwargs.get(
                'id2label', None if label2id is None else
                {id: label
                 for label, id in label2id.items()})
            if id2label is not None and label2id is None:
                label2id = {label: id for id, label in id2label.items()}
            num_labels = kwargs.get(
                'num_labels', None if label2id is None else len(label2id))
            if num_labels is not None:
                model_kwargs['num_labels'] = num_labels
            if label2id is not None:
                model_kwargs['label2id'] = label2id
            if id2label is not None:
                model_kwargs['id2label'] = id2label
            model = super(Model, cls).from_pretrained(
                pretrained_model_name_or_path=model_dir, **model_kwargs)
                pretrained_model_name_or_path=model_dir, **model_args)
        return model
--- a/modelscope/models/nlp/veco/tokenization.py
+++ b/modelscope/models/nlp/veco/tokenization.py
@@ -1,321 +0,0 @@
 # Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
 # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
 # All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License
 """Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta`"""
 import os
 from shutil import copyfile
 from typing import Any, Dict, List, Optional, Tuple
 import sentencepiece as spm
 from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer
 from modelscope.utils import logger as logging
 logger = logging.get_logger(__name__)
 SPIECE_UNDERLINE = '▁'
 VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'}
 PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}}
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
 class VecoTokenizer(PreTrainedTokenizer):
    """
    Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
    [SentencePiece](https://github.com/google/sentencepiece).
    This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
    Users should refer to this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
        bos_token (`str`, *optional*, defaults to `"<s>"`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
            <Tip>
            When building a sequence using special tokens, this is not the token that is used for the beginning of
            sequence. The token used is the `cls_token`.
            </Tip>
        eos_token (`str`, *optional*, defaults to `"</s>"`):
            The end of sequence token.
            <Tip>
            When building a sequence using special tokens, this is not the token that is used for the end of
            sequence. The token used is the `sep_token`.
            </Tip>
        sep_token (`str`, *optional*, defaults to `"</s>"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
        cls_token (`str`, *optional*, defaults to `"<s>"`):
            The classifier token which is used when doing sequence classification (classification of the whole sequence
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
        mask_token (`str`, *optional*, defaults to `"<mask>"`):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
        additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
            Additional special tokens used by the tokenizer.
        sp_model_kwargs (`dict`, *optional*):
            Will be passed to the `SentencePieceProcessor.__init__()` method.
            The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python)
            can be used, among other things, to set:
            - `enable_sampling`: Enable subword regularization.
            - `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.
              - `nbest_size = {0,1}`: No sampling is performed.
              - `nbest_size > 1`: samples from the nbest_size results.
              - `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
                using forward-filtering-and-backward-sampling algorithm.
            - `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
              BPE-dropout.
    Attributes:
        sp_model (`SentencePieceProcessor`):
            The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ['input_ids', 'attention_mask']
    def __init__(self,
                 vocab_file,
                 bos_token='<s>',
                 eos_token='</s>',
                 sep_token='</s>',
                 cls_token='<s>',
                 unk_token='<unk>',
                 pad_token='<pad>',
                 mask_token='<mask>',
                 sp_model_kwargs: Optional[Dict[str, Any]] = None,
                 **kwargs) -> None:
        # Mask token behave like a normal word, i.e. include the space before it
        mask_token = AddedToken(
            mask_token, lstrip=True, rstrip=False) if isinstance(
                mask_token, str) else mask_token
        self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs
        super().__init__(
            bos_token=bos_token,
            eos_token=eos_token,
            unk_token=unk_token,
            sep_token=sep_token,
            cls_token=cls_token,
            pad_token=pad_token,
            mask_token=mask_token,
            sp_model_kwargs=self.sp_model_kwargs,
            **kwargs,
        )
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.Load(str(vocab_file))
        self.vocab_file = vocab_file
        # Original fairseq vocab and spm vocab must be "aligned":
        # Vocab    |    0    |    1    |   2    |    3    |  4  |  5  |  6  |   7   |   8   |  9
        # -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
        # fairseq  | '<s>'   | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's'   | '▁de' | '-'
        # spm      | '<unk>' | '<s>'   | '</s>' | ','     | '.' | '▁' | 's' | '▁de' | '-'   | '▁a'
        # Mimic fairseq token-to-id alignment for the first 4 token
        self.fairseq_tokens_to_ids = {
            '<s>': 0,
            '<pad>': 1,
            '</s>': 2,
            '<unk>': 3
        }
        # The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
        self.fairseq_offset = 1
        self.fairseq_tokens_to_ids['<mask>'] = len(
            self.sp_model) + self.fairseq_offset
        self.fairseq_ids_to_tokens = {
            v: k
            for k, v in self.fairseq_tokens_to_ids.items()
        }
    def __getstate__(self):
        state = self.__dict__.copy()
        state['sp_model'] = None
        state['sp_model_proto'] = self.sp_model.serialized_model_proto()
        return state
    def __setstate__(self, d):
        self.__dict__ = d
        # for backward compatibility
        if not hasattr(self, 'sp_model_kwargs'):
            self.sp_model_kwargs = {}
        self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
        self.sp_model.LoadFromSerializedProto(self.sp_model_proto)
    def build_inputs_with_special_tokens(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An Veco sequence has the following format:
        - single sequence: `<s> X </s>`
        - pair of sequences: `<s> A </s></s> B </s>`
        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
    def get_special_tokens_mask(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None,
            already_has_special_tokens: bool = False) -> List[int]:
        """
        Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
        special tokens using the tokenizer `prepare_for_model` method.
        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
            already_has_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not the token list is already formatted with special tokens for the model.
        Returns:
            `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
        """
        if already_has_special_tokens:
            return super().get_special_tokens_mask(
                token_ids_0=token_ids_0,
                token_ids_1=token_ids_1,
                already_has_special_tokens=True)
        if token_ids_1 is None:
            return [1] + ([0] * len(token_ids_0)) + [1]
        return [1] + ([0] * len(token_ids_0)) + [1, 1] + (
            [0] * len(token_ids_1)) + [1]
    def create_token_type_ids_from_sequences(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does
        not make use of token type ids, therefore a list of zeros is returned.
        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
        Returns:
            `List[int]`: List of zeros.
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
    @property
    def vocab_size(self):
        return len(
            self.sp_model) + self.fairseq_offset + 1  # Add the <mask> token
    def get_vocab(self):
        vocab = {
            self.convert_ids_to_tokens(i): i
            for i in range(self.vocab_size)
        }
        vocab.update(self.added_tokens_encoder)
        return vocab
    def _tokenize(self, text: str) -> List[str]:
        return self.sp_model.encode(text, out_type=str)
    def _convert_token_to_id(self, token):
        """Converts a token (str) in an id using the vocab."""
        if token in self.fairseq_tokens_to_ids:
            return self.fairseq_tokens_to_ids[token]
        spm_id = self.sp_model.PieceToId(token)
        # Need to return unknown token if the SP model returned 0
        return spm_id + self.fairseq_offset if spm_id else self.unk_token_id
    def _convert_id_to_token(self, index):
        """Converts an index (integer) in a token (str) using the vocab."""
        if index in self.fairseq_ids_to_tokens:
            return self.fairseq_ids_to_tokens[index]
        return self.sp_model.IdToPiece(index - self.fairseq_offset)
    def convert_tokens_to_string(self, tokens):
        """Converts a sequence of tokens (strings for sub-words) in a single string."""
        out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
        return out_string
    def save_vocabulary(self,
                        save_directory: str,
                        filename_prefix: Optional[str] = None) -> Tuple[str]:
        if not os.path.isdir(save_directory):
            logger.error(
                f'Vocabulary path ({save_directory}) should be a directory')
            return
        out_vocab_file = os.path.join(
            save_directory, (filename_prefix + '-' if filename_prefix else '')
            + VOCAB_FILES_NAMES['vocab_file'])
        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
            copyfile(self.vocab_file, out_vocab_file)
        return (out_vocab_file, )
--- a/modelscope/models/nlp/veco/tokenization_fast.py
+++ b/modelscope/models/nlp/veco/tokenization_fast.py
@@ -1,213 +0,0 @@
 # Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
 # Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
 # All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License
 """Fast Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta_fast`"""
 import os
 from shutil import copyfile
 from typing import List, Optional, Tuple
 import transformers
 from transformers.file_utils import is_sentencepiece_available
 from transformers.tokenization_utils import AddedToken
 from transformers.tokenization_utils_fast import PreTrainedTokenizerFast
 from modelscope.utils import logger as logging
 if is_sentencepiece_available():
    from .tokenization import VecoTokenizer
 else:
    VecoTokenizer = None
 logger = logging.get_logger(__name__)
 VOCAB_FILES_NAMES = {
    'vocab_file': 'sentencepiece.bpe.model',
    'tokenizer_file': 'tokenizer.json'
 }
 PRETRAINED_VOCAB_FILES_MAP = {
    'vocab_file': {},
    'tokenizer_file': {},
 }
 PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}
 transformers.SLOW_TO_FAST_CONVERTERS[
    'VecoTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS[
        'XLMRobertaTokenizer']
 class VecoTokenizerFast(PreTrainedTokenizerFast):
    """
    Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`].
    Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).
    This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
    methods. Users should refer to this superclass for more information regarding those methods.
    Args:
        vocab_file (`str`):
            Path to the vocabulary file.
        bos_token (`str`, *optional*, defaults to `"<s>"`):
            The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
            <Tip>
            When building a sequence using special tokens, this is not the token that is used for the beginning of
            sequence. The token used is the `cls_token`.
            </Tip>
        eos_token (`str`, *optional*, defaults to `"</s>"`):
            The end of sequence token.
            <Tip>
            When building a sequence using special tokens, this is not the token that is used for the end of
            sequence. The token used is the `sep_token`.
            </Tip>
        sep_token (`str`, *optional*, defaults to `"</s>"`):
            The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
            sequence classification or for a text and a question for question answering. It is also used as the last
            token of a sequence built with special tokens.
        cls_token (`str`, *optional*, defaults to `"<s>"`):
            The classifier token which is used when doing sequence classification (classification of the whole sequence
            instead of per-token classification). It is the first token of the sequence when built with special tokens.
        unk_token (`str`, *optional*, defaults to `"<unk>"`):
            The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
            token instead.
        pad_token (`str`, *optional*, defaults to `"<pad>"`):
            The token used for padding, for example when batching sequences of different lengths.
        mask_token (`str`, *optional*, defaults to `"<mask>"`):
            The token used for masking values. This is the token used when training this model with masked language
            modeling. This is the token which the model will try to predict.
        additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
            Additional special tokens used by the tokenizer.
    """
    vocab_files_names = VOCAB_FILES_NAMES
    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
    model_input_names = ['input_ids', 'attention_mask']
    slow_tokenizer_class = VecoTokenizer
    def __init__(self,
                 vocab_file=None,
                 tokenizer_file=None,
                 bos_token='<s>',
                 eos_token='</s>',
                 sep_token='</s>',
                 cls_token='<s>',
                 unk_token='<unk>',
                 pad_token='<pad>',
                 mask_token='<mask>',
                 **kwargs):
        # Mask token behave like a normal word, i.e. include the space before it
        mask_token = AddedToken(
            mask_token, lstrip=True, rstrip=False) if isinstance(
                mask_token, str) else mask_token
        super().__init__(
            vocab_file,
            tokenizer_file=tokenizer_file,
            bos_token=bos_token,
            eos_token=eos_token,
            sep_token=sep_token,
            cls_token=cls_token,
            unk_token=unk_token,
            pad_token=pad_token,
            mask_token=mask_token,
            **kwargs,
        )
        self.vocab_file = vocab_file
        self.can_save_slow_tokenizer = False if not self.vocab_file else True
    def build_inputs_with_special_tokens(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
        adding special tokens. An Veco sequence has the following format:
        - single sequence: `<s> X </s>`
        - pair of sequences: `<s> A </s></s> B </s>`
        Args:
            token_ids_0 (`List[int]`):
                List of IDs to which the special tokens will be added.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
        Returns:
            `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
        """
        if token_ids_1 is None:
            return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
        cls = [self.cls_token_id]
        sep = [self.sep_token_id]
        return cls + token_ids_0 + sep + sep + token_ids_1 + sep
    def create_token_type_ids_from_sequences(
            self,
            token_ids_0: List[int],
            token_ids_1: Optional[List[int]] = None) -> List[int]:
        """
        Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does
        not make use of token type ids, therefore a list of zeros is returned.
        Args:
            token_ids_0 (`List[int]`):
                List of IDs.
            token_ids_1 (`List[int]`, *optional*):
                Optional second list of IDs for sequence pairs.
        Returns:
            `List[int]`: List of zeros.
        """
        sep = [self.sep_token_id]
        cls = [self.cls_token_id]
        if token_ids_1 is None:
            return len(cls + token_ids_0 + sep) * [0]
        return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]
    def save_vocabulary(self,
                        save_directory: str,
                        filename_prefix: Optional[str] = None) -> Tuple[str]:
        if not self.can_save_slow_tokenizer:
            raise ValueError(
                'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow '
                'tokenizer.')
        if not os.path.isdir(save_directory):
            logger.error(
                f'Vocabulary path ({save_directory}) should be a directory.')
            return
        out_vocab_file = os.path.join(
            save_directory, (filename_prefix + '-' if filename_prefix else '')
            + VOCAB_FILES_NAMES['vocab_file'])
        if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
            copyfile(self.vocab_file, out_vocab_file)
        return (out_vocab_file, )
--- a/modelscope/outputs/nlp/model_outputs.py
+++ b/modelscope/outputs/nlp/model_outputs.py
@@ -1,179 +1,13 @@
 from dataclasses import dataclass
 from typing import List, Optional, Tuple, Union
 from typing import Optional, Tuple, Union
 import numpy as np
 from modelscope.outputs.outputs import ModelOutputBase
 Tensor = Union['torch.Tensor', 'tf.Tensor']
@dataclass
 class TextClassificationModelOutput(ModelOutputBase):
    """The output class for text classification models.
    Args:
        logits (`Tensor`): The logits output of the model. loss (`Tensor`,
        *optional*) The loss of the model, available when training.
        hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
        output of each layer plus the optional initial embedding outputs.
    """
    logits: Tensor = None
    loss: Tensor = None
@dataclass
 class TokenClassificationModelOutput(ModelOutputBase):
    """The output class for token classification models.
        logits (`Tensor`): The logits output of the model.
        loss (`Tensor`, *optional*) The loss of the model, available when training.
    """
    logits: Tensor = None
    loss: Tensor = None
    offset_mapping: Tensor = None
@dataclass
 class FillMaskModelOutput(ModelOutputBase):
    """The output class for text classification models.
    Args:
        logits (`Tensor`): The logits output of the model.
        loss (`Tensor`, *optional*) The loss of the model, available when training.
        input_ids (`Tensor`, *optional*) The input id tensor fed into the model.
        hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
            output of each layer plus the optional initial embedding outputs.
    """
    logits: Tensor = None
    loss: Tensor = None
    input_ids: Tensor = None
    hidden_states: Tensor = None
@dataclass
 class TokenClassifierOutput(ModelOutputBase):
    """
    Base class for outputs of token classification models.
    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when
        `labels` is provided) :
            Classification loss.
        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length,
        config.num_labels)`):
            Classification scores (before SoftMax).
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_hidden_states=True` is passed or when
        `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings,
            if the model has an embedding layer, + one for the output of each
            layer) of shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the
            optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_attentions=True` is passed or when
        `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape
            `(batch_size, num_heads, sequence_length, sequence_length)`.
            Attentions weights after the attention softmax, used to compute the
            weighted average in the self-attention heads.
        offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,
        sequence_length)`, `optional`):
            Indices of positions of each input sequence tokens in the sentence.
            Selected in the range ``[0, sequence_length - 1]``.
    """
    loss: Tensor = None
    logits: Tensor = None
    hidden_states: Tensor = None
    attentions: Tensor = None
    offset_mapping: Tensor = None
@dataclass
 class TokenClassifierWithPredictionsOutput(ModelOutputBase):
    """
    Base class for outputs of token classification models.
    Args:
        loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when
        `labels` is provided) :
            Classification loss.
        logits (`torch.FloatTensor` of shape `(batch_size, sequence_length,
        config.num_labels)`):
            Classification scores (before SoftMax).
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_hidden_states=True` is passed or when
        `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings,
            if the model has an embedding layer, + one for the output of each
            layer) of shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the
            optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_attentions=True` is passed or when
        `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape
            `(batch_size, num_heads, sequence_length, sequence_length)`.
            Attentions weights after the attention softmax, used to compute the
            weighted average in the self-attention heads.
        offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,
        sequence_length)`, `optional`):
            Indices of positions of each input sequence tokens in the sentence.
            Selected in the range ``[0, sequence_length - 1]``.
        predictions: A PyTorch tensor of the best tag sequence for each batch of shape
            (nbest, batch_size, seq_length)
    """
    loss: Tensor = None
    logits: Tensor = None
    hidden_states: Tensor = None
    attentions: Tensor = None
    offset_mapping: Tensor = None
    predictions: Tensor = None
@dataclass
 class BaseModelOutput(ModelOutputBase):
    """
    Base class for model's outputs, with potential hidden states and attentions.
    Args:
        last_hidden_state (`torch.FloatTensor` of shape `(batch_size,
        sequence_length, hidden_size)`):
            Sequence of hidden-states at the output of the last layer of the
            model.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_hidden_states=True` is passed or when
        `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings,
            if the model has an embedding layer, + one for the output of each
            layer) of shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the
            optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_attentions=True` is passed or when
        `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape
            `(batch_size, num_heads, sequence_length, sequence_length)`.
            Attentions weights after the attention softmax, used to compute the
            weighted average in the self-attention heads.
    """
    last_hidden_state: Tensor = None
    hidden_states: Optional[Tuple[Tensor]] = None
    attentions: Optional[Tuple[Tensor]] = None
@dataclass
 class BackboneModelOutput(ModelOutputBase):
    """The output class for text classification models.
@@ -196,81 +30,6 @@ class AttentionBackboneModelOutput(BackboneModelOutput):
    """The output class for backbones of attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the
        attention softmax, used to compute the weighted average in the
        self-attention heads.
    """
    attentions: Tensor = None
    past_key_values: Tensor = None
    cross_attentions: Tensor = None
@dataclass
 class AttentionTextClassificationModelOutput(TextClassificationModelOutput):
    """The output class for backbones of attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the
        attention softmax, used to compute the weighted average in the
        self-attention heads.
    """
    attentions: Tensor = None
    hidden_states: Tensor = None
@dataclass
 class AttentionTokenClassificationModelOutput(TokenClassificationModelOutput):
    """The output class for backbones of attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the attention softmax,
        used to compute the weighted average in the self-attention heads.
    """
    attentions: Tensor = None
    hidden_states: Tensor = None
@dataclass
 class AttentionFillMaskModelOutput(FillMaskModelOutput):
    """The output class for the fill mask and attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the
        attention softmax, used to compute the weighted average in the
        self-attention heads.
    """
    attentions: Tensor = None
@dataclass
 class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutputBase):
    """
    Base class for model's outputs that also contains a pooling of the last
    hidden states.
    Args:
        last_hidden_state (`torch.FloatTensor` of shape `(batch_size,
        sequence_length, hidden_size)`):
            Sequence of hidden-states at the output of the last layer of the
            model.
        pooler_output (`torch.FloatTensor` of shape `(batch_size,
        hidden_size)`):
            Last layer hidden-state of the first token of the sequence
            (classification token) after further processing through the layers
            used for the auxiliary pretraining task. E.g. for BERT-family of
            models, this returns the classification token after processing
            through a linear layer and a tanh activation function. The linear
            layer weights are trained from the next sentence prediction
            (classification) objective during pretraining.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_hidden_states=True` is passed or when
        `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings,
            if the model has an embedding layer, + one for the output of each
            layer) of shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the
            optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_attentions=True` is passed or when
        `config.output_attentions=True`):
@@ -303,75 +62,8 @@ class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutputBase):
            can be used (see `past_key_values` input) to speed up sequential
            decoding.
    """
    last_hidden_state: Tensor = None
    pooler_output: Tensor = None
    hidden_states: Tensor = None
    past_key_values: Tensor = None
    attentions: Tensor = None
    cross_attentions: Tensor = None
@dataclass
 class BaseModelOutputWithPastAndCrossAttentions(ModelOutputBase):
    """
    Base class for model's outputs that may also contain a past key/values (to
    speed up sequential decoding).
    Args:
        last_hidden_state (`torch.FloatTensor` of shape `(batch_size,
        sequence_length, hidden_size)`):
            Sequence of hidden-states at the output of the last layer of the
            model.
            If `past_key_values` is used only the last hidden-state of the
            sequences of shape `(batch_size, 1, hidden_size)` is output.
        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned
        when `use_cache=True` is passed or when `config.use_cache=True`):
            Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`,
            with each tuple having 2 tensors of shape `(batch_size, num_heads,
            sequence_length, embed_size_per_head)`) and optionally if
            `config.is_encoder_decoder=True` 2 additional tensors of shape
            `(batch_size, num_heads, encoder_sequence_length,
            embed_size_per_head)`.
            Contains pre-computed hidden-states (key and values in the
            self-attention blocks and optionally if
            `config.is_encoder_decoder=True` in the cross-attention blocks) that
            can be used (see `past_key_values` input) to speed up sequential
            decoding.
        hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_hidden_states=True` is passed or when
        `config.output_hidden_states=True`):
            Tuple of `torch.FloatTensor` (one for the output of the embeddings,
            if the model has an embedding layer, + one for the output of each
            layer) of shape `(batch_size, sequence_length, hidden_size)`.
            Hidden-states of the model at the output of each layer plus the
            optional initial embedding outputs.
        attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_attentions=True` is passed or when
        `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape
            `(batch_size, num_heads, sequence_length, sequence_length)`.
            Attentions weights after the attention softmax, used to compute the
            weighted average in the self-attention heads.
        cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
        `output_attentions=True` and `config.add_cross_attention=True` is passed
        or when `config.output_attentions=True`):
            Tuple of `torch.FloatTensor` (one for each layer) of shape
            `(batch_size, num_heads, sequence_length, sequence_length)`.
            Attentions weights of the decoder's cross-attention layer, after the
            attention softmax, used to compute the weighted average in the
            cross-attention heads.
    """
    last_hidden_state: Tensor = None
    past_key_values: Tensor = None
    hidden_states: Tensor = None
    attentions: Tensor = None
    cross_attentions: Tensor = None
@@ -459,6 +151,60 @@ class Seq2SeqModelOutput(ModelOutputBase):
    encoder_attentions: Optional[Tuple[Tensor]] = None
@dataclass
 class FaqQuestionAnsweringOutput(ModelOutputBase):
    """The output class for faq QA models.
    """
    scores: Tensor = None
@dataclass
 class FeatureExtractionOutput(ModelOutputBase):
    """The output class for feature extraction models.
    """
    text_embedding: Tensor = None
@dataclass
 class FillMaskModelOutput(ModelOutputBase):
    """The output class for text classification models.
    Args:
        logits (`Tensor`): The logits output of the model.
        loss (`Tensor`, *optional*) The loss of the model, available when training.
        input_ids (`Tensor`, *optional*) The input id tensor fed into the model.
        hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
            output of each layer plus the optional initial embedding outputs.
    """
    logits: Tensor = None
    loss: Tensor = None
    input_ids: Tensor = None
    hidden_states: Tensor = None
@dataclass
 class AttentionFillMaskModelOutput(FillMaskModelOutput):
    """The output class for the fill mask and attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the
        attention softmax, used to compute the weighted average in the
        self-attention heads.
    """
    attentions: Tensor = None
@dataclass
 class InformationExtractionOutput(ModelOutputBase):
    """The output class for information extraction models.
    """
    spo_list: np.ndarray = None
@dataclass
 class Seq2SeqLMOutput(ModelOutputBase):
    """
@@ -543,6 +289,42 @@ class Seq2SeqLMOutput(ModelOutputBase):
    encoder_attentions: Optional[Tuple[Tensor]] = None
@dataclass
 class TextClassificationModelOutput(ModelOutputBase):
    """The output class for text classification models.
    Args:
        logits (`Tensor`): The logits output of the model. loss (`Tensor`,
        *optional*) The loss of the model, available when training.
        hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
        output of each layer plus the optional initial embedding outputs.
    """
    logits: Tensor = None
    loss: Tensor = None
@dataclass
 class AttentionTextClassificationModelOutput(TextClassificationModelOutput):
    """The output class for backbones of attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the
        attention softmax, used to compute the weighted average in the
        self-attention heads.
    """
    attentions: Tensor = None
    hidden_states: Tensor = None
@dataclass
 class TextErrorCorrectionOutput(ModelOutputBase):
    """The output class for information extraction models.
    """
    predictions: np.ndarray = None
@dataclass
 class TextGenerationModelOutput(ModelOutputBase):
    """The output class for text generation models.
@@ -588,3 +370,35 @@ class TokenGeneratorOutput(ModelOutputBase):
    scores: Optional[Tuple[Tensor]] = None
    attentions: Optional[Tuple[Tuple[Tensor]]] = None
    hidden_states: Optional[Tuple[Tuple[Tensor]]] = None
@dataclass
 class TokenClassificationModelOutput(ModelOutputBase):
    """The output class for token classification models.
        logits (`Tensor`): The logits output of the model.
        loss (`Tensor`, *optional*) The loss of the model, available when training.
        predictions: A PyTorch tensor of the best tag sequence for each batch of shape
            (nbest, batch_size, seq_length)
        offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,
        sequence_length)`, `optional`):
            Indices of positions of each input sequence tokens in the sentence.
            Selected in the range ``[0, sequence_length - 1]``.
    """
    logits: Tensor = None
    loss: Tensor = None
    offset_mapping: Tensor = None
    predictions: Tensor = None
    label_mask: Tensor = None
@dataclass
 class AttentionTokenClassificationModelOutput(TokenClassificationModelOutput):
    """The output class for backbones of attention based models.
    Args:
        attentions (`tuple(Tensor)`, *optional* Attentions weights after the attention softmax,
        used to compute the weighted average in the self-attention heads.
    """
    attentions: Tensor = None
    hidden_states: Tensor = None
--- a/modelscope/pipelines/base.py
+++ b/modelscope/pipelines/base.py
@@ -12,7 +12,7 @@ import numpy as np
 from modelscope.models.base import Model
 from modelscope.msdatasets import MsDataset
 from modelscope.outputs import TASK_OUTPUTS
 from modelscope.outputs import TASK_OUTPUTS, ModelOutputBase
 from modelscope.pipeline_inputs import TASK_INPUTS, check_input_type
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.config import Config
@@ -321,6 +321,8 @@ class Pipeline(ABC):
            return
        output_keys = TASK_OUTPUTS[task_name]
        missing_keys = []
        input = input.keys() if isinstance(input,
                                           (dict, ModelOutputBase)) else input
        for k in output_keys:
            if k not in input:
                missing_keys.append(k)
--- a/modelscope/pipelines/builder.py
+++ b/modelscope/pipelines/builder.py
@@ -298,6 +298,7 @@ def pipeline(task: str = None,
        raise ValueError('task or pipeline_name is required')
    model = normalize_model_input(model, model_revision)
    pipeline_props = {'type': pipeline_name}
    if pipeline_name is None:
        # get default pipeline for this task
        if isinstance(model, str) \
@@ -309,7 +310,7 @@ def pipeline(task: str = None,
                        model, str) else read_config(
                            model[0], revision=model_revision)
                check_config(cfg)
                pipeline_name = cfg.pipeline.type
                pipeline_props = cfg.pipeline
        elif model is not None:
            # get pipeline info from Model object
            first_model = model[0] if isinstance(model, list) else model
@@ -318,13 +319,15 @@ def pipeline(task: str = None,
                cfg = read_config(first_model.model_dir)
                check_config(cfg)
                first_model.pipeline = cfg.pipeline
            pipeline_name = first_model.pipeline.type
            pipeline_props = first_model.pipeline
        else:
            pipeline_name, default_model_repo = get_default_pipeline_info(task)
            model = normalize_model_input(default_model_repo, model_revision)
            pipeline_props = {'type': pipeline_name}
    cfg = ConfigDict(type=pipeline_name, model=model)
    cfg.device = device
    pipeline_props['model'] = model
    pipeline_props['device'] = device
    cfg = ConfigDict(pipeline_props)
    if kwargs:
        cfg.update(kwargs)
--- a/modelscope/pipelines/cv/easycv_pipelines/base.py
+++ b/modelscope/pipelines/cv/easycv_pipelines/base.py
@@ -61,6 +61,8 @@ class EasyCVPipeline(object):
        self.cfg = Config.from_file(self.config_file)
        if 'device' in kwargs:
            kwargs['device'] = create_device(kwargs['device'])
        if 'predictor_config' in kwargs:
            kwargs.pop('predictor_config')
        self.predict_op = self._build_predict_op(**kwargs)
    def _build_predict_op(self, **kwargs):
--- a/modelscope/pipelines/nlp/init.py
+++ b/modelscope/pipelines/nlp/init.py
@@ -12,22 +12,19 @@ if TYPE_CHECKING:
    from .dialog_state_tracking_pipeline import DialogStateTrackingPipeline
    from .document_segmentation_pipeline import DocumentSegmentationPipeline
    from .extractive_summarization_pipeline import ExtractiveSummarizationPipeline
    from .fasttext_sequence_classification_pipeline import FasttextSequenceClassificationPipeline
    from .fasttext_text_classification_pipeline import FasttextSequenceClassificationPipeline
    from .faq_question_answering_pipeline import FaqQuestionAnsweringPipeline
    from .feature_extraction_pipeline import FeatureExtractionPipeline
    from .fill_mask_pipeline import FillMaskPipeline
    from .information_extraction_pipeline import InformationExtractionPipeline
    from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline, \
        NamedEntityRecognitionThaiPipeline, \
        NamedEntityRecognitionVietPipeline
    from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline
    from .text_ranking_pipeline import TextRankingPipeline
    from .sentence_embedding_pipeline import SentenceEmbeddingPipeline
    from .text_classification_pipeline import TextClassificationPipeline
    from .summarization_pipeline import SummarizationPipeline
    from .translation_quality_estimation_pipeline import TranslationQualityEstimationPipeline
    from .text_error_correction_pipeline import TextErrorCorrectionPipeline
    from .text_generation_pipeline import TextGenerationPipeline
    from .text2text_generation_pipeline import Text2TextGenerationPipeline
    from .text_generation_pipeline import TextGenerationPipeline, TextGenerationT5Pipeline
    from .token_classification_pipeline import TokenClassificationPipeline
    from .translation_pipeline import TranslationPipeline
    from .word_segmentation_pipeline import WordSegmentationPipeline, WordSegmentationThaiPipeline
@@ -56,8 +53,6 @@ else:
        'information_extraction_pipeline': ['InformationExtractionPipeline'],
        'named_entity_recognition_pipeline': [
            'NamedEntityRecognitionPipeline',
            'NamedEntityRecognitionThaiPipeline',
            'NamedEntityRecognitionVietPipeline'
        ],
        'text_ranking_pipeline': ['TextRankingPipeline'],
        'sentence_embedding_pipeline': ['SentenceEmbeddingPipeline'],
@@ -66,7 +61,8 @@ else:
        ['TableQuestionAnsweringPipeline'],
        'text_classification_pipeline': ['TextClassificationPipeline'],
        'text_error_correction_pipeline': ['TextErrorCorrectionPipeline'],
        'text_generation_pipeline': ['TextGenerationPipeline'],
        'text_generation_pipeline':
        ['TextGenerationPipeline', 'TextGenerationT5Pipeline'],
        'text2text_generation_pipeline': ['Text2TextGenerationPipeline'],
        'token_classification_pipeline': ['TokenClassificationPipeline'],
        'translation_pipeline': ['TranslationPipeline'],
--- a/modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
+++ b/modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
@@ -24,18 +24,27 @@ class ConversationalTextToSqlPipeline(Pipeline):
    def __init__(self,
                 model: Union[StarForTextToSql, str],
                 preprocessor: ConversationalTextToSqlPreprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """use `model` and `preprocessor` to create a conversational text-to-sql prediction pipeline
        Args:
            model (StarForTextToSql): a model instance
            preprocessor (ConversationalTextToSqlPreprocessor):
                a preprocessor instance
            model (StarForTextToSql): A model instance
            preprocessor (ConversationalTextToSqlPreprocessor): A preprocessor instance
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = ConversationalTextToSqlPreprocessor(
                self.model.model_dir)
                self.model.model_dir, **kwargs)
    def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]:
        """process the prediction results
--- a/modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
+++ b/modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
@@ -22,6 +22,9 @@ class DialogIntentPredictionPipeline(Pipeline):
    def __init__(self,
                 model: Union[SpaceForDialogIntent, str],
                 preprocessor: DialogIntentPredictionPreprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """Use `model` and `preprocessor` to create a dialog intent prediction pipeline
@@ -29,11 +32,18 @@ class DialogIntentPredictionPipeline(Pipeline):
            model (str or SpaceForDialogIntent): Supply either a local model dir or a model id from the model hub,
            or a SpaceForDialogIntent instance.
            preprocessor (DialogIntentPredictionPreprocessor): An optional preprocessor instance.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = DialogIntentPredictionPreprocessor(
                self.model.model_dir)
                self.model.model_dir, **kwargs)
        self.categories = self.preprocessor.categories
    def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]:
--- a/modelscope/pipelines/nlp/dialog_modeling_pipeline.py
+++ b/modelscope/pipelines/nlp/dialog_modeling_pipeline.py
@@ -21,6 +21,9 @@ class DialogModelingPipeline(Pipeline):
    def __init__(self,
                 model: Union[SpaceForDialogModeling, str],
                 preprocessor: DialogModelingPreprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """Use `model` and `preprocessor` to create a dialog modeling pipeline for dialog response generation
@@ -28,11 +31,18 @@ class DialogModelingPipeline(Pipeline):
            model (str or SpaceForDialogModeling): Supply either a local model dir or a model id from the model hub,
            or a SpaceForDialogModeling instance.
            preprocessor (DialogModelingPreprocessor): An optional preprocessor instance.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = DialogModelingPreprocessor(
                self.model.model_dir)
                self.model.model_dir, **kwargs)
    def postprocess(self, inputs: Dict[str, Tensor]) -> Dict[str, str]:
        """process the prediction results
--- a/modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
+++ b/modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
@@ -22,6 +22,9 @@ class DialogStateTrackingPipeline(Pipeline):
    def __init__(self,
                 model: Union[SpaceForDST, str],
                 preprocessor: DialogStateTrackingPreprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """use `model` and `preprocessor` to create a dialog state tracking pipeline for
        observation of dialog states tracking after many turns of open domain dialogue
@@ -30,11 +33,20 @@ class DialogStateTrackingPipeline(Pipeline):
            model (str or SpaceForDialogStateTracking): Supply either a local model dir or a model id
            from the model hub, or a SpaceForDialogStateTracking instance.
            preprocessor (DialogStateTrackingPreprocessor): An optional preprocessor instance.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = DialogStateTrackingPreprocessor(
                self.model.model_dir)
                self.model.model_dir, **kwargs)
        self.tokenizer = self.preprocessor.tokenizer
        self.config = self.preprocessor.config
--- a/modelscope/pipelines/nlp/distributed_gpt3_pipeline.py
+++ b/modelscope/pipelines/nlp/distributed_gpt3_pipeline.py
@@ -21,8 +21,16 @@ class DistributedGPT3Pipeline(DistributedPipeline):
    model = None
    def __init__(self, model, preprocessor=None, **kwargs):
        """
        Args:
            model: The model piece, str is not supported.
            preprocessor: The preprocessor matched with the model.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        if preprocessor is None:
            preprocessor = TextGenerationJiebaPreprocessor(model)
            preprocessor = TextGenerationJiebaPreprocessor(model, **kwargs)
        super().__init__(model, preprocessor=preprocessor, **kwargs)
        assert hasattr(preprocessor, 'tokenizer')
--- a/modelscope/pipelines/nlp/distributed_plug_pipeline.py
+++ b/modelscope/pipelines/nlp/distributed_plug_pipeline.py
@@ -8,7 +8,7 @@ from modelscope.metainfo import Pipelines
 from modelscope.models.nlp.plug import DistributedPlug
 from modelscope.pipelines.base import DistributedPipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import TextGenerationPreprocessor
 from modelscope.preprocessors import TextGenerationTransformersPreprocessor
 from modelscope.utils.constant import Tasks
@@ -24,11 +24,12 @@ class DistributedPlugPipeline(DistributedPipeline):
                 model,
                 preprocessor=None,
                 first_sequence='sentence',
                 sequence_length=512,
                 **kwargs):
        """Create a plug pipeline instance.
        Args:
            model: The model_id of plug(damo/nlp_plug_text-generation_27B).
        model: The model_id of plug(damo/nlp_plug_text-generation_27B).
        The default path to damo/nlp_plug_text-generation_27B can be obtained by function
        get_cache_dir("damo/nlp_plug_text-generation_27B"), the model should be downloaded to
        this path before calling this class by model_id.
@@ -53,17 +54,16 @@ class DistributedPlugPipeline(DistributedPipeline):
                |_ mp_rank_05_model_states.pt
                |_ mp_rank_06_model_states.pt
                |_ mp_rank_07_model_states.pt
            preprocessor: The optional preprocessor, if not passed in, a TextGenerationPreprocessor will
        preprocessor: The optional preprocessor, if not passed in, a TextGenerationPreprocessor will
            be used as default.
            first_sequence: The first_sequence key name if the input format is a dict.
            kwargs:
                sequence_length: The input sequence_length.
        kwargs (dict, `optional`): Extra kwargs passed into the preprocessor's constructor.
        """
        if preprocessor is None:
            preprocessor = TextGenerationPreprocessor(
            preprocessor = TextGenerationTransformersPreprocessor(
                model,
                first_sequence=first_sequence,
                sequence_length=kwargs.pop('sequence_length', 512))
                sequence_length=sequence_length,
                **kwargs)
        super().__init__(model, preprocessor=preprocessor, **kwargs)
        assert hasattr(preprocessor, 'tokenizer')
        self.cls_token_id = preprocessor.tokenizer.cls_token_id
--- a/modelscope/pipelines/nlp/document_segmentation_pipeline.py
+++ b/modelscope/pipelines/nlp/document_segmentation_pipeline.py
@@ -14,7 +14,8 @@ from modelscope.models.nlp.ponet.configuration import PoNetConfig
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline, Tensor
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import DocumentSegmentationPreprocessor
 from modelscope.preprocessors import \
    DocumentSegmentationTransformersPreprocessor
 from modelscope.utils.constant import Tasks
 from modelscope.utils.logger import get_logger
@@ -27,26 +28,34 @@ __all__ = ['DocumentSegmentationPipeline']
    Tasks.document_segmentation, module_name=Pipelines.document_segmentation)
 class DocumentSegmentationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: DocumentSegmentationPreprocessor = None,
                 **kwargs):
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
    def __init__(
            self,
            model: Union[Model, str],
            preprocessor: DocumentSegmentationTransformersPreprocessor = None,
            config_file: str = None,
            device: str = 'gpu',
            auto_collate=True,
            **kwargs):
        """The document segmentation pipeline.
        self.model_dir = self.model.model_dir
        self.model_cfg = self.model.forward()
        if self.model_cfg['type'] == 'bert':
            config = BertConfig.from_pretrained(self.model_dir, num_labels=2)
        elif self.model_cfg['type'] == 'ponet':
            config = PoNetConfig.from_pretrained(self.model_dir, num_labels=2)
        self.document_segmentation_model = self.model.build_with_config(
            config=config)
        Args:
            model (str or Model): Supply either a local model dir or a model id from the model hub
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
        """
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        self.model_dir = self.model.model_dir
        self.model_cfg = self.model.model_cfg
        if preprocessor is None:
            self.preprocessor = DocumentSegmentationPreprocessor(
                self.model.model_dir, config)
            self.preprocessor = DocumentSegmentationTransformersPreprocessor(
                self.model_dir, self.model.config.max_position_embeddings,
                **kwargs)
    def __call__(
            self, documents: Union[List[List[str]], List[str],
@@ -85,8 +94,7 @@ class DocumentSegmentationPipeline(Pipeline):
                key: torch.tensor(val)
                for key, val in predict_dataset.items()
            }
            predictions = self.document_segmentation_model.forward(
                **input).logits
            predictions = self.model.forward(**input).logits
        predictions = np.argmax(predictions, axis=2)
        assert len(sentences) == len(
--- a/modelscope/pipelines/nlp/extractive_summarization_pipeline.py
+++ b/modelscope/pipelines/nlp/extractive_summarization_pipeline.py
@@ -6,15 +6,14 @@ from typing import Any, Dict, List, Union
 import numpy as np
 import torch
 from datasets import Dataset
 from transformers.models.bert.modeling_bert import BertConfig
 from modelscope.metainfo import Pipelines
 from modelscope.models import Model
 from modelscope.models.nlp.ponet.configuration import PoNetConfig
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline, Tensor
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import DocumentSegmentationPreprocessor
 from modelscope.preprocessors import \
    DocumentSegmentationTransformersPreprocessor
 from modelscope.utils.constant import Tasks
 from modelscope.utils.logger import get_logger
@@ -28,31 +27,29 @@ __all__ = ['ExtractiveSummarizationPipeline']
    module_name=Pipelines.extractive_summarization)
 class ExtractiveSummarizationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: DocumentSegmentationPreprocessor = None,
                 **kwargs):
        model = model if isinstance(model,
                                    Model) else Model.from_pretrained(model)
        self.model_dir = model.model_dir
        self.model_cfg = model.forward()
        if self.model_cfg['type'] == 'bert':
            config = BertConfig.from_pretrained(model.model_dir, num_labels=2)
        elif self.model_cfg['type'] == 'ponet':
            config = PoNetConfig.from_pretrained(model.model_dir, num_labels=2)
        self.extractive_summarization_model = model.build_with_config(
            config=config)
    def __init__(
            self,
            model: Union[Model, str],
            preprocessor: DocumentSegmentationTransformersPreprocessor = None,
            config_file: str = None,
            device: str = 'gpu',
            auto_collate=True,
            **kwargs):
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        self.model_dir = self.model.model_dir
        self.model_cfg = self.model.model_cfg
        if preprocessor is None:
            preprocessor = DocumentSegmentationPreprocessor(
                self.model_dir, config)
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        self.preprocessor = preprocessor
            self.preprocessor = DocumentSegmentationTransformersPreprocessor(
                self.model_dir, self.model.config.max_position_embeddings,
                **kwargs)
    def __call__(self, documents: Union[List[str], str]) -> Dict[str, Any]:
        output = self.predict(documents)
@@ -80,8 +77,7 @@ class ExtractiveSummarizationPipeline(Pipeline):
                key: torch.tensor(val)
                for key, val in predict_dataset.items()
            }
            logits = self.extractive_summarization_model.forward(
                **input).logits
            logits = self.model.forward(**input).logits
        predictions = np.argmax(logits, axis=2)
        assert len(sentences) == len(
--- a/modelscope/pipelines/nlp/faq_question_answering_pipeline.py
+++ b/modelscope/pipelines/nlp/faq_question_answering_pipeline.py
@@ -20,8 +20,24 @@ class FaqQuestionAnsweringPipeline(Pipeline):
    def __init__(self,
                 model: Union[str, Model],
                 preprocessor: Preprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        """The faq question answering pipeline.
        Args:
            model (str or Model): A model instance or a model local dir or a model id in the model hub.
            preprocessor (Preprocessor, `optional`): a preprocessor instance
            kwargs (dict, `optional`):
                The preprocessor kwargs passed into the preprocessor's constructor.
        """
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir, **kwargs)
--- a/modelscope/pipelines/nlp/fasttext_sequence_classification_pipeline.py
+++ b/modelscope/pipelines/nlp/fasttext_sequence_classification_pipeline.py
@@ -9,11 +9,9 @@ from fasttext import load_model
 from fasttext.FastText import _FastText
 from modelscope.metainfo import Pipelines
 from modelscope.models import Model
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Input, Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import SequenceClassificationPreprocessor
 from modelscope.utils.constant import ModelFile, Tasks
 __all__ = ['FasttextSequenceClassificationPipeline']
@@ -36,8 +34,7 @@ class FasttextSequenceClassificationPipeline(Pipeline):
        """use `model` and `preprocessor` to create a nlp text classification pipeline for prediction
        Args:
            model: a model directory including model.bin and spm.model
            preprocessor (SequenceClassificationPreprocessor): a preprocessor instance
            model: A model directory including model.bin and spm.model
        """
        super().__init__(model=model)
        model_file = os.path.join(model, ModelFile.TORCH_MODEL_BIN_FILE)
@@ -53,8 +50,11 @@ class FasttextSequenceClassificationPipeline(Pipeline):
        text_sp = sentencepiece_tokenize(self.spm, text)
        return {'text_sp': text_sp, 'text': text}
    def forward(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        topk = inputs.get('topk', -1)
    def forward(self,
                inputs: Dict[str, Any],
                topk: int = None) -> Dict[str, Any]:
        if topk is None:
            topk = inputs.get('topk', -1)
        label, probs = self.model.predict(inputs['text_sp'], k=topk)
        label = [x.replace('__label__', '') for x in label]
        result = {
--- a/modelscope/pipelines/nlp/feature_extraction_pipeline.py
+++ b/modelscope/pipelines/nlp/feature_extraction_pipeline.py
@@ -9,7 +9,8 @@ from modelscope.models import Model
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline, Tensor
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import NLPPreprocessor, Preprocessor
 from modelscope.preprocessors import (FillMaskTransformersPreprocessor,
                                      Preprocessor)
 from modelscope.utils.config import Config
 from modelscope.utils.constant import ModelFile, Tasks
@@ -23,7 +24,11 @@ class FeatureExtractionPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 first_sequence='sentence',
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 padding=False,
                 sequence_length=128,
                 **kwargs):
        """Use `model` and `preprocessor` to create a nlp feature extraction pipeline for prediction
@@ -32,11 +37,8 @@ class FeatureExtractionPipeline(Pipeline):
            no-head model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            first_sequence: The key to read the sentence in.
            sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
            NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
            param will have no effect.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
            Example:
            >>> from modelscope.pipelines import pipeline
@@ -46,19 +48,21 @@ class FeatureExtractionPipeline(Pipeline):
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = NLPPreprocessor(
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                padding=kwargs.pop('padding', False),
                sequence_length=kwargs.pop('sequence_length', 128))
                padding=padding,
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
        self.config = Config.from_file(
            os.path.join(self.model.model_dir, ModelFile.CONFIGURATION))
        self.tokenizer = self.preprocessor.tokenizer
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
        with torch.no_grad():
--- a/modelscope/pipelines/nlp/fill_mask_pipeline.py
+++ b/modelscope/pipelines/nlp/fill_mask_pipeline.py
@@ -23,7 +23,11 @@ class FillMaskPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 first_sequence: str = 'sentence',
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 first_sequence='sentence',
                 sequence_length=128,
                 **kwargs):
        """The inference pipeline for all the fill mask sub-tasks.
@@ -31,11 +35,8 @@ class FillMaskPipeline(Pipeline):
            model (`str` or `Model` or module instance): A model instance or a model local dir
                or a model id in the model hub.
            preprocessor (`Preprocessor`, `optional`): A Preprocessor instance.
            first_sequence (`str`， `optional`): The key to read the sentence in.
            sequence_length (`int`， `optional`): Max sequence length in the user's custom scenario, default 128.
            NOTE1: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
            param will have no effect.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
            Example1:
            >>> from modelscope.pipelines import pipeline
@@ -51,20 +52,25 @@ class FillMaskPipeline(Pipeline):
            NOTE2: Please pay attention to the model's special tokens.
            If bert based model(bert, structbert, etc.) is used, the mask token is '[MASK]'.
            If the xlm-roberta(xlm-roberta, veco, etc.) based model is used, the mask token is '<mask>'.
            To view other examples plese check the tests/pipelines/test_fill_mask.py.
            To view other examples plese check tests/pipelines/test_fill_mask.py.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                first_sequence=first_sequence,
                second_sequence=None,
                sequence_length=kwargs.pop('sequence_length', 128))
            assert hasattr(
                self.preprocessor, 'mask_id'
            ), 'The input preprocessor should have the mask_id attribute.'
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
        assert hasattr(
            self.preprocessor, 'mask_id'
        ), 'The input preprocessor should have the mask_id attribute.'
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
--- a/modelscope/pipelines/nlp/information_extraction_pipeline.py
+++ b/modelscope/pipelines/nlp/information_extraction_pipeline.py
@@ -8,8 +8,7 @@ from modelscope.metainfo import Pipelines
 from modelscope.models import Model
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import (Preprocessor,
                                      RelationExtractionPreprocessor)
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.constant import Tasks
 __all__ = ['InformationExtractionPipeline']
@@ -24,12 +23,33 @@ class InformationExtractionPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=512,
                 **kwargs):
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        if preprocessor is None:
            self.preprocessor = RelationExtractionPreprocessor(
        """
        Args:
            model (str or Model): Supply either a local model dir which supported information extraction task, or a
            model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if self.preprocessor is None:
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 512))
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
    def forward(self, inputs: Dict[str, Any],
--- a/modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
+++ b/modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
@@ -1,36 +1,35 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Optional, Union
 import torch
 from typing import Optional, Union
 from modelscope.metainfo import Pipelines
 from modelscope.models import Model
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.pipelines.nlp import TokenClassificationPipeline
 from modelscope.preprocessors import (NERPreprocessorThai, NERPreprocessorViet,
                                      Preprocessor,
                                      TokenClassificationPreprocessor)
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.constant import Tasks
 from modelscope.utils.tensor_utils import (torch_nested_detach,
                                           torch_nested_numpify)
 __all__ = [
    'NamedEntityRecognitionPipeline', 'NamedEntityRecognitionThaiPipeline',
    'NamedEntityRecognitionVietPipeline'
 ]
 __all__ = ['NamedEntityRecognitionPipeline']
@PIPELINES.register_module(
    Tasks.named_entity_recognition,
    module_name=Pipelines.named_entity_recognition)
@PIPELINES.register_module(
    Tasks.named_entity_recognition,
    module_name=Pipelines.named_entity_recognition_thai)
@PIPELINES.register_module(
    Tasks.named_entity_recognition,
    module_name=Pipelines.named_entity_recognition_viet)
 class NamedEntityRecognitionPipeline(TokenClassificationPipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=128,
                 **kwargs):
        """Use `model` and `preprocessor` to create a nlp NER pipeline for prediction
@@ -39,8 +38,8 @@ class NamedEntityRecognitionPipeline(TokenClassificationPipeline):
            model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            sequence_length: Max sequence length in the user's custom scenario. 512 will be used as a default value.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
            Example:
            >>> from modelscope.pipelines import pipeline
            >>> pipeline_ins = pipeline(task='named-entity-recognition',
@@ -50,44 +49,17 @@ class NamedEntityRecognitionPipeline(TokenClassificationPipeline):
            To view other examples plese check the tests/pipelines/test_named_entity_recognition.py.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = TokenClassificationPreprocessor(
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 128))
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
        self.id2label = kwargs.get('id2label')
        if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
            self.id2label = self.preprocessor.id2label
@PIPELINES.register_module(
    Tasks.named_entity_recognition,
    module_name=Pipelines.named_entity_recognition_thai)
 class NamedEntityRecognitionThaiPipeline(NamedEntityRecognitionPipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 **kwargs):
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        if preprocessor is None:
            self.preprocessor = NERPreprocessorThai(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 512))
@PIPELINES.register_module(
    Tasks.named_entity_recognition,
    module_name=Pipelines.named_entity_recognition_viet)
 class NamedEntityRecognitionVietPipeline(NamedEntityRecognitionPipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 **kwargs):
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        if preprocessor is None:
            self.preprocessor = NERPreprocessorViet(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 512))
        assert hasattr(self.preprocessor, 'id2label')
        self.id2label = self.preprocessor.id2label
--- a/modelscope/pipelines/nlp/sentence_embedding_pipeline.py
+++ b/modelscope/pipelines/nlp/sentence_embedding_pipeline.py
@@ -22,7 +22,10 @@ class SentenceEmbeddingPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 first_sequence='first_sequence',
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=128,
                 **kwargs):
        """Use `model` and `preprocessor` to create a nlp text dual encoder then generates the text representation.
        Args:
@@ -30,15 +33,20 @@ class SentenceEmbeddingPipeline(Pipeline):
            or a model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir
                if isinstance(self.model, Model) else model,
                first_sequence=first_sequence,
                sequence_length=kwargs.pop('sequence_length', 128))
                self.model.model_dir,
                sequence_length=sequence_length,
                **kwargs)
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
--- a/modelscope/pipelines/nlp/summarization_pipeline.py
+++ b/modelscope/pipelines/nlp/summarization_pipeline.py
@@ -1,12 +1,11 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Optional, Union
 from modelscope.metainfo import Pipelines
 from modelscope.models.multi_modal import OfaForAllTasks
 from modelscope.metainfo import Pipelines, Preprocessors
 from modelscope.pipelines.base import Model, Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import OfaPreprocessor, Preprocessor
 from modelscope.utils.constant import Tasks
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.constant import Fields, Tasks
 from modelscope.utils.logger import get_logger
 logger = get_logger()
@@ -19,6 +18,9 @@ class SummarizationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """Use `model` and `preprocessor` to create a Summarization pipeline for prediction.
@@ -26,11 +28,25 @@ class SummarizationPipeline(Pipeline):
            model (str or Model): Supply either a local model dir which supported the summarization task,
            or a model id from the model hub, or a model instance.
            preprocessor (Preprocessor): An optional preprocessor instance.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        self.model.eval()
        if preprocessor is None and isinstance(self.model, OfaForAllTasks):
            self.preprocessor = OfaPreprocessor(model_dir=self.model.model_dir)
        if preprocessor is None:
            if self.model.__class__.__name__ == 'OfaForAllTasks':
                self.preprocessor = Preprocessor.from_pretrained(
                    self.model.model_dir,
                    type=Preprocessors.ofa_tasks_preprocessor,
                    field=Fields.multi_modal)
            else:
                self.preprocessor = Preprocessor.from_pretrained(
                    self.model.model_dir, **kwargs)
    def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
        return inputs
--- a/modelscope/pipelines/nlp/table_question_answering_pipeline.py
+++ b/modelscope/pipelines/nlp/table_question_answering_pipeline.py
@@ -33,6 +33,9 @@ class TableQuestionAnsweringPipeline(Pipeline):
                 model: Union[TableQuestionAnswering, str],
                 preprocessor: TableQuestionAnsweringPreprocessor = None,
                 db: Database = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """use `model` and `preprocessor` to create a table question answering prediction pipeline
@@ -40,11 +43,19 @@ class TableQuestionAnsweringPipeline(Pipeline):
            model (TableQuestionAnswering): a model instance
            preprocessor (TableQuestionAnsweringPreprocessor): a preprocessor instance
            db (Database): a database to store tables in the database
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = TableQuestionAnsweringPreprocessor(
                self.model.model_dir)
                self.model.model_dir, **kwargs)
        # initilize tokenizer
        self.tokenizer = BertTokenizer(
--- a/modelscope/pipelines/nlp/text2text_generation_pipeline.py
+++ b/modelscope/pipelines/nlp/text2text_generation_pipeline.py
@@ -1,113 +0,0 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, List, Optional, Union
 import torch
 from numpy import isin
 from modelscope.metainfo import Pipelines
 from modelscope.models.base import Model
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Input, Pipeline, Tensor
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import Text2TextGenerationPreprocessor
 from modelscope.utils.config import use_task_specific_params
 from modelscope.utils.constant import Tasks
 __all__ = ['Text2TextGenerationPipeline']
 TRANSLATE_PIPELINES = [
    Pipelines.translation_en_to_de,
    Pipelines.translation_en_to_ro,
    Pipelines.translation_en_to_fr,
 ]
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.text2text_generation)
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.translation_en_to_de)
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.translation_en_to_ro)
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.translation_en_to_fr)
 class Text2TextGenerationPipeline(Pipeline):
    def __init__(
            self,
            model: Union[Model, str],
            preprocessor: Optional[Text2TextGenerationPreprocessor] = None,
            first_sequence='sentence',
            **kwargs):
        """Use `model` and `preprocessor` to create a text to text generation pipeline for prediction.
        Args:
            model (str or Model): Supply either a local model dir which supported the text generation task,
            or a model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            first_sequence: The key to read the first sentence in.
            sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
            NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
            param will have no effect.
            Example:
            >>> from modelscope.pipelines import pipeline
            >>> pipeline_ins = pipeline(task='text2text-generation',
            >>>    model='damo/nlp_t5_text2text-generation_chinese-base')
            >>> sentence1 = '中国的首都位于<extra_id_0>。'
            >>> print(pipeline_ins(sentence1))
            >>> # Or use the dict input:
            >>> print(pipeline_ins({'sentence': sentence1}))
            >>> # 北京
            To view other examples plese check the tests/pipelines/test_text_generation.py.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        if preprocessor is None:
            self.preprocessor = Text2TextGenerationPreprocessor(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 128))
        self.tokenizer = self.preprocessor.tokenizer
        self.pipeline = self.model.pipeline.type
        self.model.eval()
    def preprocess(self, inputs: Input, **preprocess_params) -> Dict[str, Any]:
        """ Provide specific preprocess for text2text generation pipeline in order to handl multi tasks
        """
        if not isinstance(inputs, str):
            raise ValueError(f'Not supported input type: {type(inputs)}')
        if self.pipeline in TRANSLATE_PIPELINES:
            use_task_specific_params(self.model, self.pipeline)
            inputs = self.model.config.prefix + inputs
        return super().preprocess(inputs, **preprocess_params)
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
        forward_params['min_length'] = forward_params.get(
            'min_length', self.model.config.min_length)
        forward_params['max_length'] = forward_params.get(
            'max_length', self.model.config.max_length)
        with torch.no_grad():
            output_ids = self.model.generate(**inputs, **forward_params)
            return {'output_ids': output_ids}
    def postprocess(self, inputs: Dict[str, Tensor],
                    **postprocess_params) -> Dict[str, str]:
        """process the prediction results
        Args:
            inputs (Dict[str, Any]): _description_
        Returns:
            Dict[str, str]: the prediction results
        """
        output = self.tokenizer.decode(
            inputs['output_ids'][0],
            skip_special_tokens=True,
        )
        return {OutputKeys.TEXT: output}
--- a/modelscope/pipelines/nlp/text_classification_pipeline.py
+++ b/modelscope/pipelines/nlp/text_classification_pipeline.py
@@ -5,11 +5,14 @@ import numpy as np
 from modelscope.metainfo import Pipelines, Preprocessors
 from modelscope.models.base import Model
 from modelscope.outputs import OutputKeys
 from modelscope.outputs import OutputKeys, TextClassificationModelOutput
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.constant import Fields, Tasks
 from modelscope.utils.logger import get_logger
 logger = get_logger(__name__)
@PIPELINES.register_module(
@@ -31,6 +34,9 @@ class TextClassificationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Preprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """The inference pipeline for all the text classification sub-tasks.
@@ -38,10 +44,8 @@ class TextClassificationPipeline(Pipeline):
            model (`str` or `Model` or module instance): A model instance or a model local dir
                or a model id in the model hub.
            preprocessor (`Preprocessor`, `optional`): A Preprocessor instance.
            first_sequence (`str`, `optional`): The key of the first sentence.
            second_sequence (`str`, `optional`): The key of the second sentence.
            sequence_length (`int`, `optional`): The sequence length.
            id2label (`dict`, `optional`): The id-label mapping.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        Example:
            >>> from modelscope.pipelines import pipeline
@@ -49,31 +53,38 @@ class TextClassificationPipeline(Pipeline):
                model='damo/nlp_structbert_sentence-similarity_chinese-base')
            >>> input = ('这是个测试', '这也是个测试')
            >>> print(pipeline_ins(input))
        NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' and 'second_sequence'
            param will have no affection.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            if self.model.__class__.__name__ == 'OfaForAllTasks':
                self.preprocessor = Preprocessor.from_pretrained(
                    model_name_or_path=self.model.model_dir,
                    type=Preprocessors.ofa_tasks_preprocessor,
                    field=Fields.multi_modal)
                    field=Fields.multi_modal,
                    **kwargs)
            else:
                first_sequence = kwargs.pop('first_sequence', 'first_sequence')
                second_sequence = kwargs.pop('second_sequence', None)
                sequence_length = kwargs.pop('sequence_length', 512)
                self.preprocessor = Preprocessor.from_pretrained(
                    self.model
                    if isinstance(self.model, str) else self.model.model_dir,
                    first_sequence=first_sequence,
                    second_sequence=second_sequence,
                    sequence_length=kwargs.pop('sequence_length', 512))
        self.id2label = kwargs.get('id2label')
        if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
            self.id2label = self.preprocessor.id2label
                    self.model.model_dir, **{
                        'first_sequence': first_sequence,
                        'second_sequence': second_sequence,
                        'sequence_length': sequence_length,
                        **kwargs
                    })
                assert hasattr(self.preprocessor, 'id2label')
                self.id2label = self.preprocessor.id2label
                if self.id2label is None:
                    logger.warn(
                        'The id2label mapping is None, will return original ids.'
                    )
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
@@ -82,16 +93,17 @@ class TextClassificationPipeline(Pipeline):
        return self.model(**inputs, **forward_params)
    def postprocess(self,
                    inputs: Dict[str, Any],
                    topk: int = 5) -> Dict[str, str]:
        """process the prediction results
                    inputs: Union[Dict[str, Any],
                                  TextClassificationModelOutput],
                    topk: int = None) -> Dict[str, Any]:
        """Process the prediction results
        Args:
            inputs (`Dict[str, Any]` or `TextClassificationModelOutput`): The model output, please check
                the `TextClassificationModelOutput` class for details.
            topk (int): The topk probs to take
        Returns:
            Dict[str, str]: the prediction results.
            Dict[str, Any]: the prediction results.
                scores: The probabilities of each label.
                labels: The real labels.
            Label at index 0 is the smallest probability.
@@ -99,8 +111,6 @@ class TextClassificationPipeline(Pipeline):
        if self.model.__class__.__name__ == 'OfaForAllTasks':
            return inputs
        else:
            assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \
                                              'as a parameter or make sure the preprocessor has the attribute.'
            logits = inputs[OutputKeys.LOGITS].cpu().numpy()
            if logits.shape[0] == 1:
                logits = logits[0]
@@ -111,20 +121,24 @@ class TextClassificationPipeline(Pipeline):
            probs = softmax(logits)
            num_classes = probs.shape[-1]
            topk = min(topk, num_classes)
            topk = min(topk, num_classes) if topk is not None else num_classes
            top_indices = np.argpartition(probs, -topk)[-topk:]
            probs = np.take_along_axis(probs, top_indices, axis=-1).tolist()
            def map_to_label(id):
                if id in self.id2label:
                    return self.id2label[id]
                elif str(id) in self.id2label:
                    return self.id2label[str(id)]
                if self.id2label is not None:
                    if id in self.id2label:
                        return self.id2label[id]
                    elif str(id) in self.id2label:
                        return self.id2label[str(id)]
                    else:
                        raise Exception(
                            f'id {id} not found in id2label: {self.id2label}')
                else:
                    raise Exception('id not found in id2label')
                    return id
            v_func = np.vectorize(map_to_label)
            return {
                OutputKeys.SCORES: probs,
                OutputKeys.LABELS: v_func(top_indices).tolist()
            }
            top_indices = v_func(top_indices).tolist()
            probs = list(reversed(probs))
            top_indices = list(reversed(top_indices))
            return {OutputKeys.SCORES: probs, OutputKeys.LABELS: top_indices}
--- a/modelscope/pipelines/nlp/text_error_correction_pipeline.py
+++ b/modelscope/pipelines/nlp/text_error_correction_pipeline.py
@@ -10,7 +10,7 @@ from modelscope.models.nlp import BartForTextErrorCorrection
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline, Tensor
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import TextErrorCorrectionPreprocessor
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.constant import Tasks
 __all__ = ['TextErrorCorrectionPipeline']
@@ -20,17 +20,20 @@ __all__ = ['TextErrorCorrectionPipeline']
    Tasks.text_error_correction, module_name=Pipelines.text_error_correction)
 class TextErrorCorrectionPipeline(Pipeline):
    def __init__(
            self,
            model: Union[BartForTextErrorCorrection, str],
            preprocessor: Optional[TextErrorCorrectionPreprocessor] = None,
            **kwargs):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 **kwargs):
        """use `model` and `preprocessor` to create a nlp text correction pipeline.
        Args:
            model (BartForTextErrorCorrection): A model instance, or a model local dir, or a model id in the model hub.
            preprocessor (TextErrorCorrectionPreprocessor): An optional preprocessor instance.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        Example:
        >>> from modelscope.pipelines import pipeline
        >>> pipeline_ins = pipeline(
@@ -38,13 +41,17 @@ class TextErrorCorrectionPipeline(Pipeline):
        >>> sentence1 = '随着中国经济突飞猛近，建造工业与日俱增'
        >>> print(pipeline_ins(sentence1))
        To view other examples plese check the tests/pipelines/test_text_error_correction.py.
        To view other examples plese check tests/pipelines/test_text_error_correction.py.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = TextErrorCorrectionPreprocessor(
                self.model.model_dir)
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir, **kwargs)
        self.vocab = self.preprocessor.vocab
    def forward(self, inputs: Dict[str, Any],
--- a/modelscope/pipelines/nlp/text_generation_pipeline.py
+++ b/modelscope/pipelines/nlp/text_generation_pipeline.py
@@ -1,20 +1,22 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os
 from typing import Any, Dict, Optional, Union
 import torch
 from modelscope.metainfo import Pipelines
 from modelscope.models.base import Model
 from modelscope.outputs import OutputKeys
 from modelscope.outputs import (ModelOutputBase, OutputKeys,
                                TokenGeneratorOutput)
 from modelscope.pipelines.base import Pipeline, Tensor
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import Preprocessor, build_preprocessor
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.chinese_utils import remove_space_between_chinese_chars
 from modelscope.utils.constant import Fields, Tasks
 from modelscope.utils.hub import read_config
 from modelscope.utils.constant import Tasks
 from modelscope.utils.hub import Config, read_config
 __all__ = ['TextGenerationPipeline']
 __all__ = ['TextGenerationPipeline', 'TextGenerationT5Pipeline']
@PIPELINES.register_module(
@@ -24,7 +26,11 @@ class TextGenerationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 first_sequence='sentence',
                 sequence_length=128,
                 **kwargs):
        """Use `model` and `preprocessor` to create a generation pipeline for prediction.
@@ -33,11 +39,8 @@ class TextGenerationPipeline(Pipeline):
            or a model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            first_sequence: The key to read the first sentence in.
            sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
            NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
            param will have no effect.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
            Example:
            >>> from modelscope.pipelines import pipeline
@@ -49,26 +52,29 @@ class TextGenerationPipeline(Pipeline):
            >>> # Or use the dict input:
            >>> print(pipeline_ins({'sentence': sentence1}))
            To view other examples plese check the tests/pipelines/test_text_generation.py.
            To view other examples plese check tests/pipelines/test_text_generation.py.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        cfg = read_config(self.model.model_dir)
        self.postprocessor = cfg.pop('postprocessor', 'decode')
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            preprocessor_cfg = cfg.preprocessor
            preprocessor_cfg.update({
                'model_dir':
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                'first_sequence':
                first_sequence,
                'second_sequence':
                None,
                'sequence_length':
                kwargs.pop('sequence_length', 128)
            })
            self.preprocessor = build_preprocessor(preprocessor_cfg,
                                                   Fields.nlp)
                first_sequence=first_sequence,
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
        self.postprocessor = kwargs.pop('postprocessor', None)
        if self.postprocessor is None and hasattr(self.model, 'model_dir'):
            # Compatible with old code
            cfg = read_config(self.model.model_dir)
            self.postprocessor = cfg.get('postprocessor')
        if self.postprocessor is None:
            self.postprocessor = 'decode'
    def _sanitize_parameters(self, **pipeline_parameters):
        return {}, pipeline_parameters, {}
@@ -79,20 +85,19 @@ class TextGenerationPipeline(Pipeline):
            return self.model.generate(inputs, **forward_params)
    def decode(self, inputs) -> str:
        tokenizer = self.preprocessor.tokenizer
        return tokenizer.decode(inputs.tolist(), skip_special_tokens=True)
        return self.preprocessor.decode(
            inputs.tolist(), skip_special_tokens=True)
    def sentence_piece(self, inputs) -> str:
        tokenizer = self.preprocessor.tokenizer
        return tokenizer.decode(inputs.tolist())
        return self.preprocessor.decode(inputs.tolist())
    def roberta(self, inputs) -> str:
        tokenizer = self.preprocessor.tokenizer
        decoded = tokenizer.decode(inputs.tolist())
        decoded = self.preprocessor.decode(inputs.tolist())
        return decoded.replace('<q>', '. ').replace('<mask>',
                                                    '. ').replace('</s>', '')
    def postprocess(self, inputs: Dict[str, Tensor],
    def postprocess(self, inputs: Union[Dict[str, Tensor],
                                        TokenGeneratorOutput],
                    **postprocess_params) -> Dict[str, str]:
        """process the prediction results
@@ -102,9 +107,72 @@ class TextGenerationPipeline(Pipeline):
        Returns:
            Dict[str, str]: the prediction results
        """
        inputs = inputs['sequences']
        if isinstance(inputs, (dict, ModelOutputBase)):
            inputs = inputs['sequences']
        if isinstance(inputs, list) or len(inputs.shape) > 1:
            inputs = inputs[0]
        decoded = getattr(self, self.postprocessor)(inputs)
        text = remove_space_between_chinese_chars(decoded)
        return {OutputKeys.TEXT: text}
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.translation_en_to_de)
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.translation_en_to_ro)
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.translation_en_to_fr)
@PIPELINES.register_module(
    Tasks.text2text_generation, module_name=Pipelines.text2text_generation)
 class TextGenerationT5Pipeline(TextGenerationPipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 sub_task=None,
                 **kwargs):
        super().__init__(model, preprocessor, **kwargs)
        self.sub_task = sub_task
        self.task_specific_params = self._parse_specific_model_params(
            getattr(self.model, 'model_dir', None), 'task_specific_params')
        self.min_length = self._parse_specific_model_params(
            getattr(self.model, 'model_dir', None), 'min_length')
        self.max_length = self._parse_specific_model_params(
            getattr(self.model, 'model_dir', None), 'max_length')
    def _parse_specific_model_params(self, model_dir, key):
        if model_dir is None:
            return
        cfg: Config = read_config(model_dir)
        params = cfg.safe_get(f'model.{key}')
        if params is None:
            cfg: Config = read_config(os.path.join(model_dir, 'config.json'))
            params = cfg.safe_get(key)
        return params
    def preprocess(self, inputs, **preprocess_params) -> Dict[str, Any]:
        if not isinstance(inputs, str):
            raise ValueError(f'Not supported input type: {type(inputs)}')
        if self.task_specific_params is not None:
            sub_task = self.sub_task or self.model.pipeline.type
            if sub_task in self.task_specific_params:
                self.model.config.update(self.task_specific_params[sub_task])
                if 'prefix' in self.task_specific_params[sub_task]:
                    inputs = self.task_specific_params[sub_task].prefix + inputs
        return super().preprocess(inputs, **preprocess_params)
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
        min_length = forward_params.get('min_length', self.min_length)
        max_length = forward_params.get('max_length', self.max_length)
        if min_length is not None:
            forward_params['min_length'] = min_length
        if max_length is not None:
            forward_params['max_length'] = max_length
        with torch.no_grad():
            return self.model.generate(**inputs, **forward_params)
--- a/modelscope/pipelines/nlp/text_ranking_pipeline.py
+++ b/modelscope/pipelines/nlp/text_ranking_pipeline.py
@@ -9,7 +9,8 @@ from modelscope.models import Model
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import Preprocessor, TextRankingPreprocessor
 from modelscope.preprocessors import (Preprocessor,
                                      TextRankingTransformersPreprocessor)
 from modelscope.utils.constant import Tasks
 __all__ = ['TextRankingPipeline']
@@ -22,6 +23,10 @@ class TextRankingPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=128,
                 **kwargs):
        """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction.
@@ -30,14 +35,21 @@ class TextRankingPipeline(Pipeline):
            or a model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 128))
                sequence_length=sequence_length,
                **kwargs)
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
--- a/modelscope/pipelines/nlp/token_classification_pipeline.py
+++ b/modelscope/pipelines/nlp/token_classification_pipeline.py
@@ -1,7 +1,8 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Optional, Union
 from typing import Any, Dict, List, Optional, Union
 import numpy as np
 import torch
 from modelscope.metainfo import Pipelines
@@ -32,24 +33,35 @@ class TokenClassificationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=128,
                 **kwargs):
        """use `model` and `preprocessor` to create a token classification pipeline for prediction
        Args:
            model (str or Model): A model instance or a model local dir or a model id in the model hub.
            preprocessor (Preprocessor): a preprocessor instance, must not be None.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 128))
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
        self.id2label = kwargs.get('id2label')
        if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
            self.id2label = self.preprocessor.id2label
        assert hasattr(self.preprocessor, 'id2label')
        self.id2label = self.preprocessor.id2label
    def forward(self, inputs: Dict[str, Any],
                **forward_params) -> Dict[str, Any]:
@@ -60,53 +72,59 @@ class TokenClassificationPipeline(Pipeline):
            }
    def postprocess(self, inputs: Dict[str, Any],
                    **postprocess_params) -> Dict[str, str]:
        """process the prediction results
                    **postprocess_params) -> Dict[str, Any]:
        """Process the prediction results
        Args:
            inputs (Dict[str, Any]): should be tensors from model
        Returns:
            Dict[str, str]: the prediction results
            Dict[str, Any]: the prediction results
        """
        chunks = self._chunk_process(inputs, **postprocess_params)
        # for cws outputs
        if len(chunks) > 0 and chunks[0]['type'].lower() == 'cws':
            spans = [
                chunk['span'] for chunk in chunks if chunk['span'].strip()
            ]
            seg_result = [span for span in spans]
            outputs = {OutputKeys.OUTPUT: seg_result}
        # for ner outputs
        else:
            outputs = {OutputKeys.OUTPUT: chunks}
        return outputs
        return {OutputKeys.OUTPUT: chunks}
    def _chunk_process(self, inputs: Dict[str, Any],
                       **postprocess_params) -> Dict[str, str]:
                       **postprocess_params) -> List:
        """process the prediction results and output as chunks
        Args:
            inputs (Dict[str, Any]): should be tensors from model
        Returns:
            Dict[str, str]: the prediction results
            List: The output chunks
        """
        text = inputs['text']
        # TODO post_process does not support batch for now.
        if OutputKeys.PREDICTIONS not in inputs:
            logits = inputs[OutputKeys.LOGITS]
            predictions = torch.argmax(logits[0], dim=-1)
            if len(logits.shape) == 3:
                logits = logits[0]
            predictions = torch.argmax(logits, dim=-1)
        else:
            predictions = inputs[OutputKeys.PREDICTIONS].squeeze(
                0).cpu().numpy()
            predictions = inputs[OutputKeys.PREDICTIONS]
            if len(predictions.shape) == 2:
                predictions = predictions[0]
        offset_mapping = inputs['offset_mapping']
        if len(offset_mapping.shape) == 3:
            offset_mapping = offset_mapping[0]
        label_mask = inputs.get('label_mask')
        if label_mask is not None:
            masked_lengths = label_mask.sum(-1).long().cpu().item()
            offset_mapping = torch.narrow(
                offset_mapping, 0, 0,
                masked_lengths)  # index_select only move loc, not resize
            predictions = torch.narrow(
                predictions, 0, 0,
                masked_lengths)  # index_select only move loc, not resize
        offset_mapping = torch_nested_numpify(
            torch_nested_detach(offset_mapping))
        predictions = torch_nested_numpify(torch_nested_detach(predictions))
        offset_mapping = [x.cpu().tolist() for x in inputs['offset_mapping']]
        labels = [self.id2label[x] for x in predictions]
        if len(labels) > len(offset_mapping):
            labels = labels[1:-1]
        chunks = []
        chunk = {}
        for label, offsets in zip(labels, offset_mapping):
--- a/modelscope/pipelines/nlp/translation_quality_estimation_pipeline.py
+++ b/modelscope/pipelines/nlp/translation_quality_estimation_pipeline.py
@@ -2,19 +2,15 @@
 import io
 import os
 from typing import Any, Dict, Union
 from typing import Any, Dict
 import numpy as np
 import torch
 from transformers import XLMRobertaTokenizer
 from modelscope.metainfo import Pipelines
 from modelscope.models import Model
 from modelscope.models.nlp import BertForSequenceClassification
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Input, Pipeline
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import SequenceClassificationPreprocessor
 from modelscope.utils.constant import ModelFile, Tasks
 __all__ = ['TranslationQualityEstimationPipeline']
--- a/modelscope/pipelines/nlp/word_segmentation_pipeline.py
+++ b/modelscope/pipelines/nlp/word_segmentation_pipeline.py
@@ -10,9 +10,9 @@ from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.pipelines.nlp import TokenClassificationPipeline
 from modelscope.preprocessors import (Preprocessor,
                                      TokenClassificationPreprocessor,
                                      WordSegmentationPreprocessorThai)
 from modelscope.preprocessors import (
    Preprocessor, TokenClassificationTransformersPreprocessor,
    WordSegmentationPreprocessorThai)
 from modelscope.utils.constant import Tasks
 from modelscope.utils.tensor_utils import (torch_nested_detach,
                                           torch_nested_numpify)
@@ -23,42 +23,49 @@ __all__ = ['WordSegmentationPipeline', 'WordSegmentationThaiPipeline']
@PIPELINES.register_module(
    Tasks.word_segmentation, module_name=Pipelines.word_segmentation)
 class WordSegmentationPipeline(TokenClassificationPipeline):
    """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction.
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 **kwargs):
        """Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction.
    NOTE: The preprocessor will first split the sentence into single characters,
    then feed them into the tokenizer with the parameter is_split_into_words=True.
    Example:
    >>> from modelscope.pipelines import pipeline
    >>> pipeline_ins = pipeline(task='word-segmentation',
    >>>    model='damo/nlp_structbert_word-segmentation_chinese-base')
    >>> sentence1 = '今天天气不错，适合出去游玩'
    >>> print(pipeline_ins(sentence1))
    To view other examples plese check tests/pipelines/test_word_segmentation.py.
    """
    def postprocess(self,
                    inputs: Dict[str, Any],
                    output_final_sentence=True,
                    **postprocess_params) -> Dict[str, Any]:
        """Process the prediction results
        Args:
            model (str or Model): Supply either a local model dir which supported the WS task,
            or a model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
            NOTE: The preprocessor will first split the sentence into single characters,
            then feed them into the tokenizer with the parameter is_split_into_words=True.
            Example:
            >>> from modelscope.pipelines import pipeline
            >>> pipeline_ins = pipeline(task='word-segmentation',
            >>>    model='damo/nlp_structbert_word-segmentation_chinese-base')
            >>> sentence1 = '今天天气不错，适合出去游玩'
            >>> print(pipeline_ins(sentence1))
            To view other examples plese check the tests/pipelines/test_word_segmentation.py.
            inputs (Dict[str, Any]): should be tensors from model
            output_final_sentence (bool): Output the cut sentence splitted by blanks or not.
                If False, the pipeline will output the original token-label information.
        Returns:
            Dict[str, Any]: The prediction results.
        """
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        if preprocessor is None:
            self.preprocessor = TokenClassificationPreprocessor(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 128))
        self.model.eval()
        chunks = self._chunk_process(inputs, **postprocess_params)
        # for cws outputs
        if output_final_sentence:
            spans = [
                chunk['span'] for chunk in chunks if chunk['span'].strip()
            ]
            seg_result = [span for span in spans]
            outputs = {OutputKeys.OUTPUT: seg_result}
        self.id2label = kwargs.get('id2label')
        if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
            self.id2label = self.preprocessor.id2label
        # for ner outputs
        else:
            outputs = {OutputKeys.OUTPUT: chunks}
        return outputs
@PIPELINES.register_module(
@@ -66,8 +73,10 @@ class WordSegmentationPipeline(TokenClassificationPipeline):
    module_name=Pipelines.multilingual_word_segmentation)
 class MultilingualWordSegmentationPipeline(WordSegmentationPipeline):
    def postprocess(self, inputs: Dict[str, Any],
                    **postprocess_params) -> Dict[str, str]:
    def postprocess(self,
                    inputs: Dict[str, Any],
                    output_final_sentence=True,
                    **postprocess_params) -> Dict[str, Any]:
        chunks = self._chunk_process(inputs, **postprocess_params)
        word_segments = [entity['span'] for entity in chunks]
        return {OutputKeys.OUTPUT: word_segments}
@@ -80,14 +89,22 @@ class WordSegmentationThaiPipeline(MultilingualWordSegmentationPipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Optional[Preprocessor] = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=512,
                 **kwargs):
        model = model if isinstance(model,
                                    Model) else Model.from_pretrained(model)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        if preprocessor is None:
            preprocessor = WordSegmentationPreprocessorThai(
                model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 512))
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
            self.preprocessor = WordSegmentationPreprocessorThai(
                self.model.model_dir,
                sequence_length=sequence_length,
                **kwargs)
    def postprocess(self, inputs: Dict[str, Any],
                    **postprocess_params) -> Dict[str, str]:
--- a/modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
+++ b/modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
@@ -10,8 +10,7 @@ from modelscope.models import Model
 from modelscope.outputs import OutputKeys
 from modelscope.pipelines.base import Pipeline
 from modelscope.pipelines.builder import PIPELINES
 from modelscope.preprocessors import (Preprocessor,
                                      ZeroShotClassificationPreprocessor)
 from modelscope.preprocessors import Preprocessor
 from modelscope.utils.constant import Tasks
 __all__ = ['ZeroShotClassificationPipeline']
@@ -25,6 +24,10 @@ class ZeroShotClassificationPipeline(Pipeline):
    def __init__(self,
                 model: Union[Model, str],
                 preprocessor: Preprocessor = None,
                 config_file: str = None,
                 device: str = 'gpu',
                 auto_collate=True,
                 sequence_length=512,
                 **kwargs):
        """Use `model` and `preprocessor` to create a nlp zero shot classifiction for prediction.
@@ -44,7 +47,8 @@ class ZeroShotClassificationPipeline(Pipeline):
            or a model id from the model hub, or a torch model instance.
            preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
            the model if supplied.
            sequence_length: Max sequence length in the user's custom scenario. 512 will be used as a default value.
            kwargs (dict, `optional`):
                Extra kwargs passed into the preprocessor's constructor.
            Example:
            >>> from modelscope.pipelines import pipeline
@@ -55,17 +59,22 @@ class ZeroShotClassificationPipeline(Pipeline):
            >>> template = '这篇文章的标题是{}'
            >>> print(pipeline_ins(sentence1, candidate_labels=labels, hypothesis_template=template))
            To view other examples plese check the tests/pipelines/test_zero_shot_classification.py.
            To view other examples plese check tests/pipelines/test_zero_shot_classification.py.
        """
        assert isinstance(model, str) or isinstance(model, Model), \
            'model must be a single str or Model'
        super().__init__(model=model, preprocessor=preprocessor, **kwargs)
        super().__init__(
            model=model,
            preprocessor=preprocessor,
            config_file=config_file,
            device=device,
            auto_collate=auto_collate)
        self.entailment_id = 0
        self.contradiction_id = 2
        if preprocessor is None:
            self.preprocessor = ZeroShotClassificationPreprocessor(
            sequence_length = kwargs.pop('sequence_length', 512)
            self.preprocessor = Preprocessor.from_pretrained(
                self.model.model_dir,
                sequence_length=kwargs.pop('sequence_length', 512))
                sequence_length=sequence_length,
                **kwargs)
        self.model.eval()
    def _sanitize_parameters(self, **kwargs):
--- a/modelscope/preprocessors/init.py
+++ b/modelscope/preprocessors/init.py
@@ -16,15 +16,19 @@ if TYPE_CHECKING:
    from .kws import WavToLists
    from .multi_modal import (OfaPreprocessor, MPlugPreprocessor)
    from .nlp import (
        DocumentSegmentationPreprocessor, FaqQuestionAnsweringPreprocessor,
        FillMaskPoNetPreprocessor, NLPPreprocessor,
        NLPTokenizerPreprocessorBase, PassageRankingPreprocessor,
        TextRankingPreprocessor, RelationExtractionPreprocessor,
        SentenceEmbeddingPreprocessor, SequenceClassificationPreprocessor,
        TokenClassificationPreprocessor, TextErrorCorrectionPreprocessor,
        TextGenerationPreprocessor, Text2TextGenerationPreprocessor, Tokenize,
        DocumentSegmentationTransformersPreprocessor,
        FaqQuestionAnsweringTransformersPreprocessor,
        FillMaskPoNetPreprocessor, FillMaskTransformersPreprocessor,
        TextRankingTransformersPreprocessor,
        RelationExtractionTransformersPreprocessor,
        SentenceEmbeddingTransformersPreprocessor,
        TextClassificationTransformersPreprocessor,
        TokenClassificationTransformersPreprocessor,
        TextErrorCorrectionPreprocessor, TextGenerationT5Preprocessor,
        TextGenerationTransformersPreprocessor, Tokenize,
        WordSegmentationBlankSetToLabelPreprocessor, CodeGeeXPreprocessor,
        MGLMSummarizationPreprocessor, ZeroShotClassificationPreprocessor,
        MGLMSummarizationPreprocessor,
        ZeroShotClassificationTransformersPreprocessor,
        TextGenerationJiebaPreprocessor, SentencePiecePreprocessor,
        DialogIntentPredictionPreprocessor, DialogModelingPreprocessor,
        DialogStateTrackingPreprocessor, ConversationalTextToSqlPreprocessor,
@@ -47,18 +51,21 @@ else:
        'kws': ['WavToLists'],
        'multi_modal': ['OfaPreprocessor', 'MPlugPreprocessor'],
        'nlp': [
            'DocumentSegmentationPreprocessor',
            'FaqQuestionAnsweringPreprocessor', 'FillMaskPoNetPreprocessor',
            'NLPPreprocessor', 'NLPTokenizerPreprocessorBase',
            'TextRankingPreprocessor', 'RelationExtractionPreprocessor',
            'SentenceEmbeddingPreprocessor',
            'SequenceClassificationPreprocessor',
            'TokenClassificationPreprocessor',
            'TextErrorCorrectionPreprocessor', 'TextGenerationPreprocessor',
            'Tokenize', 'Text2TextGenerationPreprocessor',
            'DocumentSegmentationTransformersPreprocessor',
            'FaqQuestionAnsweringTransformersPreprocessor',
            'FillMaskPoNetPreprocessor', 'FillMaskTransformersPreprocessor',
            'NLPTokenizerPreprocessorBase',
            'TextRankingTransformersPreprocessor',
            'RelationExtractionTransformersPreprocessor',
            'SentenceEmbeddingTransformersPreprocessor',
            'TextClassificationTransformersPreprocessor',
            'TokenClassificationTransformersPreprocessor',
            'TextErrorCorrectionPreprocessor',
            'TextGenerationTransformersPreprocessor', 'Tokenize',
            'TextGenerationT5Preprocessor',
            'WordSegmentationBlankSetToLabelPreprocessor',
            'MGLMSummarizationPreprocessor', 'CodeGeeXPreprocessor',
            'ZeroShotClassificationPreprocessor',
            'ZeroShotClassificationTransformersPreprocessor',
            'TextGenerationJiebaPreprocessor', 'SentencePiecePreprocessor',
            'NERPreprocessorViet', 'NERPreprocessorThai',
            'WordSegmentationPreprocessorThai',
--- a/modelscope/preprocessors/base.py
+++ b/modelscope/preprocessors/base.py
@@ -2,9 +2,10 @@
 import os
 from abc import ABC, abstractmethod
 from copy import deepcopy
 from typing import Any, Dict, Optional, Sequence
 from typing import Any, Callable, Dict, Optional, Sequence, Union
 from modelscope.metainfo import Models, Preprocessors
 from modelscope.utils.checkpoint import save_configuration
 from modelscope.utils.config import Config, ConfigDict
 from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, Invoke,
                                       ModeKeys, Tasks)
@@ -98,6 +99,8 @@ PREPROCESSOR_MAP = {
    Preprocessors.sen_cls_tokenizer,
    (Models.structbert, Tasks.part_of_speech):
    Preprocessors.token_cls_tokenizer,
    (Models.token_classification_for_ner, Tasks.named_entity_recognition):
    Preprocessors.token_cls_tokenizer,
    (Models.structbert, Tasks.token_classification):
    Preprocessors.token_cls_tokenizer,
    (Models.structbert, Tasks.word_segmentation):
@@ -117,7 +120,15 @@ PREPROCESSOR_MAP = {
    (Models.veco, Tasks.sentence_similarity):
    Preprocessors.sen_cls_tokenizer,
    # space
    # taskmodels
    (Models.lcrf, Tasks.named_entity_recognition):
    Preprocessors.sequence_labeling_tokenizer,
    (Models.lcrf_wseg, Tasks.word_segmentation):
    Preprocessors.sequence_labeling_tokenizer,
    (Models.tcrf_wseg, Tasks.word_segmentation):
    Preprocessors.sequence_labeling_tokenizer,
    (Models.tcrf, Tasks.named_entity_recognition):
    Preprocessors.sequence_labeling_tokenizer,
 }
@@ -125,6 +136,8 @@ class Preprocessor(ABC):
    def __init__(self, mode=ModeKeys.INFERENCE, *args, **kwargs):
        self._mode = mode
        assert self._mode in (ModeKeys.INFERENCE, ModeKeys.TRAIN,
                              ModeKeys.EVAL)
        self.device = int(
            os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None
        pass
@@ -264,4 +277,41 @@ class Preprocessor(ABC):
            })
            preprocessor = build_preprocessor(sub_cfg, field_name)
        preprocessor.mode = preprocessor_mode
        sub_cfg.pop('model_dir', None)
        if not hasattr(preprocessor, 'cfg'):
            preprocessor.cfg = cfg
        return preprocessor
    def save_pretrained(self,
                        target_folder: Union[str, os.PathLike],
                        config: Optional[dict] = None,
                        save_config_function: Callable = save_configuration):
        """Save the preprocessor, its configuration and other related files to a directory,
            so that it can be re-loaded
        By default, this method will save the preprocessor's config with mode `inference`.
        Args:
            target_folder (Union[str, os.PathLike]):
            Directory to which to save. Will be created if it doesn't exist.
            config (Optional[dict], optional):
            The config for the configuration.json
            save_config_function (Callable): The function used to save the configuration, call this function
                after the config is updated.
        """
        if config is None and hasattr(self, 'cfg'):
            config = self.cfg
        if config is not None:
            # Update the mode to `inference` in the preprocessor field.
            if 'preprocessor' in config and config['preprocessor'] is not None:
                if 'mode' in config['preprocessor']:
                    config['preprocessor']['mode'] = 'inference'
                elif 'val' in config['preprocessor'] and 'mode' in config[
                        'preprocessor']['val']:
                    config['preprocessor']['val']['mode'] = 'inference'
            save_config_function(target_folder, config)
--- a/modelscope/preprocessors/nlp/init.py
+++ b/modelscope/preprocessors/nlp/init.py
@@ -5,24 +5,22 @@ from modelscope.utils.import_utils import LazyImportModule
 if TYPE_CHECKING:
    from .text_error_correction import TextErrorCorrectionPreprocessor
    from .nlp_base import (NLPTokenizerPreprocessorBase, NLPBasePreprocessor)
    from .text_generation_jieba_preprocessor import TextGenerationJiebaPreprocessor
    from .text_generation_preprocessor import TextGenerationJiebaPreprocessor
    from .sentence_piece_preprocessor import SentencePiecePreprocessor
    from .bert_seq_cls_tokenizer import Tokenize
    from .document_segmentation_preprocessor import DocumentSegmentationPreprocessor
    from .faq_question_answering_preprocessor import FaqQuestionAnsweringPreprocessor
    from .fill_mask_preprocessor import FillMaskPoNetPreprocessor, NLPPreprocessor
    from .text_ranking_preprocessor import TextRankingPreprocessor
    from .relation_extraction_preprocessor import RelationExtractionPreprocessor
    from .sentence_classification_preprocessor import SequenceClassificationPreprocessor
    from .sentence_embedding_preprocessor import SentenceEmbeddingPreprocessor
    from .text_generation_preprocessor import TextGenerationPreprocessor
    from .text2text_generation_preprocessor import Text2TextGenerationPreprocessor
    from .token_classification_preprocessor import TokenClassificationPreprocessor, \
    from .document_segmentation_preprocessor import DocumentSegmentationTransformersPreprocessor
    from .faq_question_answering_preprocessor import FaqQuestionAnsweringTransformersPreprocessor
    from .fill_mask_preprocessor import FillMaskPoNetPreprocessor, FillMaskTransformersPreprocessor
    from .text_ranking_preprocessor import TextRankingTransformersPreprocessor
    from .relation_extraction_preprocessor import RelationExtractionTransformersPreprocessor
    from .text_classification_preprocessor import TextClassificationTransformersPreprocessor
    from .sentence_embedding_preprocessor import SentenceEmbeddingTransformersPreprocessor
    from .text_generation_preprocessor import TextGenerationTransformersPreprocessor, TextGenerationT5Preprocessor
    from .token_classification_preprocessor import TokenClassificationTransformersPreprocessor, \
        WordSegmentationBlankSetToLabelPreprocessor
    from .token_classification_thai_preprocessor import WordSegmentationPreprocessorThai, NERPreprocessorThai
    from .token_classification_viet_preprocessor import NERPreprocessorViet
    from .zero_shot_classification_reprocessor import ZeroShotClassificationPreprocessor
    from .zero_shot_classification_preprocessor import ZeroShotClassificationTransformersPreprocessor
    from .space import (DialogIntentPredictionPreprocessor,
                        DialogModelingPreprocessor,
                        DialogStateTrackingPreprocessor, InputFeatures,
@@ -36,30 +34,31 @@ else:
            'NLPTokenizerPreprocessorBase',
            'NLPBasePreprocessor',
        ],
        'text_generation_jieba_preprocessor':
        ['TextGenerationJiebaPreprocessor'],
        'sentence_piece_preprocessor': ['SentencePiecePreprocessor'],
        'bert_seq_cls_tokenizer': ['Tokenize'],
        'document_segmentation_preprocessor':
        ['DocumentSegmentationPreprocessor'],
        ['DocumentSegmentationTransformersPreprocessor'],
        'faq_question_answering_preprocessor':
        ['FaqQuestionAnsweringPreprocessor'],
        ['FaqQuestionAnsweringTransformersPreprocessor'],
        'fill_mask_preprocessor':
        ['FillMaskPoNetPreprocessor', 'NLPPreprocessor'],
        'text_ranking_preprocessor': ['TextRankingPreprocessor'],
        'relation_extraction_preprocessor': ['RelationExtractionPreprocessor'],
        'sentence_classification_preprocessor':
        ['SequenceClassificationPreprocessor'],
        'sentence_embedding_preprocessor': ['SentenceEmbeddingPreprocessor'],
        'text_generation_preprocessor': ['TextGenerationPreprocessor'],
        'text2text_generation_preprocessor':
        ['Text2TextGenerationPreprocessor'],
        ['FillMaskPoNetPreprocessor', 'FillMaskTransformersPreprocessor'],
        'text_ranking_preprocessor': ['TextRankingTransformersPreprocessor'],
        'relation_extraction_preprocessor':
        ['RelationExtractionTransformersPreprocessor'],
        'text_classification_preprocessor':
        ['TextClassificationTransformersPreprocessor'],
        'sentence_embedding_preprocessor':
        ['SentenceEmbeddingTransformersPreprocessor'],
        'text_generation_preprocessor': [
            'TextGenerationTransformersPreprocessor',
            'TextGenerationJiebaPreprocessor', 'TextGenerationT5Preprocessor'
        ],
        'token_classification_preprocessor': [
            'TokenClassificationPreprocessor',
            'TokenClassificationTransformersPreprocessor',
            'WordSegmentationBlankSetToLabelPreprocessor'
        ],
        'zero_shot_classification_reprocessor':
        ['ZeroShotClassificationPreprocessor'],
        'zero_shot_classification_preprocessor':
        ['ZeroShotClassificationTransformersPreprocessor'],
        'text_error_correction': [
            'TextErrorCorrectionPreprocessor',
        ],
--- a/modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
+++ b/modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
@@ -3,39 +3,52 @@
 from typing import Any, Dict
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.logger import get_logger
 from .nlp_base import NLPBasePreprocessor
 logger = get_logger()
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.document_segmentation)
 class DocumentSegmentationPreprocessor(NLPBasePreprocessor):
    def __init__(self, model_dir: str, config, *args, **kwargs):
        """preprocess the data
 class DocumentSegmentationTransformersPreprocessor(Preprocessor):
    def __init__(self,
                 model_dir: str,
                 model_max_length: int,
                 mode: str = ModeKeys.INFERENCE,
                 question_column_name='labels',
                 context_column_name='sentences',
                 example_id_column_name='example_id',
                 label_list=['B-EOP', 'O']):
        """The preprocessor for document segmentation task, based on transformers' tokenizer.
        Args:
            model_dir (str): model path
            model_dir: The model dir containing the essential files to build the tokenizer.
            model_max_length: The max length the model supported.
            mode: The mode for this preprocessor.
            question_column_name: The key for the question column, default `labels`.
            context_column_name: The key for the context column, default `sentences`.
            example_id_column_name: The key for the example id column, default `example_id`.
            label_list: The label list, default `['B-EOP', 'O']`
        """
        super().__init__(model_dir, *args, **kwargs)
        super().__init__(mode)
        from transformers import BertTokenizerFast
        self.tokenizer = BertTokenizerFast.from_pretrained(
            model_dir,
            use_fast=True,
        )
        self.question_column_name = 'labels'
        self.context_column_name = 'sentences'
        self.example_id_column_name = 'example_id'
        self.label_to_id = {'B-EOP': 0, 'O': 1}
        self.tokenizer = BertTokenizerFast.from_pretrained(model_dir, )
        self.question_column_name = question_column_name
        self.context_column_name = context_column_name
        self.example_id_column_name = example_id_column_name
        self.label_list = label_list
        self.label_to_id = {
            label: id
            for id, label in enumerate(self.label_list)
        }
        self.target_specical_ids = set()
        self.target_specical_ids.add(self.tokenizer.eos_token_id)
        self.max_seq_length = config.max_position_embeddings
        self.label_list = ['B-EOP', 'O']
        self.max_seq_length = model_max_length
    def __call__(self, examples, model_cfg=None) -> Dict[str, Any]:
        questions = examples[self.question_column_name]
--- a/modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
+++ b/modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
@@ -1,38 +1,58 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os
 from typing import Any, Dict
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.config import Config, ConfigFields
 from modelscope.utils.constant import Fields, ModeKeys, ModelFile
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.type_assert import type_assert
 from .nlp_base import NLPBasePreprocessor
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.faq_question_answering_preprocessor)
 class FaqQuestionAnsweringPreprocessor(NLPBasePreprocessor):
    def __init__(self, model_dir: str, *args, **kwargs):
        super(FaqQuestionAnsweringPreprocessor, self).__init__(
            model_dir, mode=ModeKeys.INFERENCE, **kwargs)
        from transformers import BertTokenizer
        preprocessor_config = Config.from_file(
            os.path.join(model_dir, ModelFile.CONFIGURATION)).get(
                ConfigFields.preprocessor, {})
        if preprocessor_config.get('tokenizer',
                                   'BertTokenizer') == 'XLMRoberta':
 class FaqQuestionAnsweringTransformersPreprocessor(Preprocessor):
    def __init__(self,
                 model_dir: str,
                 mode: str = ModeKeys.INFERENCE,
                 tokenizer='BertTokenizer',
                 query_set='query_set',
                 support_set='support_set',
                 label_in_support_set='label',
                 text_in_support_set='text',
                 sequence_length=None,
                 **kwargs):
        """The preprocessor for Faq QA task, based on transformers' tokenizer.
        Args:
            model_dir: The model dir containing the essential files to build the tokenizer.
            mode: The mode for this preprocessor.
            tokenizer: The tokenizer type used, supported types are `BertTokenizer`
                and `XLMRobertaTokenizer`, default `BertTokenizer`.
            query_set: The key for the query_set.
            support_set: The key for the support_set.
            label_in_support_set: The key for the label_in_support_set.
            text_in_support_set: The key for the text_in_support_set.
            sequence_length: The sequence length for the preprocessor.
        """
        super().__init__(mode)
        if tokenizer == 'XLMRoberta':
            from transformers import XLMRobertaTokenizer
            self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_dir)
        else:
            from transformers import BertTokenizer
            self.tokenizer = BertTokenizer.from_pretrained(model_dir)
        self.MAX_LEN = preprocessor_config.get('max_seq_length', 50)
        if sequence_length is not None:
            self.max_len = sequence_length
        else:
            self.max_len = kwargs.get('max_seq_length', 50)
        self.label_dict = None
        self.query_set = query_set
        self.support_set = support_set
        self.label_in_support_set = label_in_support_set
        self.text_in_support_set = text_in_support_set
    def pad(self, samples, max_len):
        result = []
@@ -58,25 +78,31 @@ class FaqQuestionAnsweringPreprocessor(NLPBasePreprocessor):
    @type_assert(object, Dict)
    def __call__(self, data: Dict[str, Any],
                 **preprocessor_param) -> Dict[str, Any]:
        TMP_MAX_LEN = preprocessor_param.get('max_seq_length', self.MAX_LEN)
        queryset = data['query_set']
        tmp_max_len = preprocessor_param.get(
            'sequence_length',
            preprocessor_param.get('max_seq_length', self.max_len))
        queryset = data[self.query_set]
        if not isinstance(queryset, list):
            queryset = [queryset]
        supportset = data['support_set']
        supportset = sorted(supportset, key=lambda d: d['label'])
        supportset = data[self.support_set]
        supportset = sorted(
            supportset, key=lambda d: d[self.label_in_support_set])
        queryset_tokenized = [self.encode_plus(text) for text in queryset]
        supportset_tokenized = [
            self.encode_plus(item['text']) for item in supportset
            self.encode_plus(item[self.text_in_support_set])
            for item in supportset
        ]
        max_len = max(
            [len(seq) for seq in queryset_tokenized + supportset_tokenized])
        max_len = min(TMP_MAX_LEN, max_len)
        max_len = min(tmp_max_len, max_len)
        queryset_padded = self.pad(queryset_tokenized, max_len)
        supportset_padded = self.pad(supportset_tokenized, max_len)
        supportset_labels_ori = [item['label'] for item in supportset]
        supportset_labels_ori = [
            item[self.label_in_support_set] for item in supportset
        ]
        label_dict = []
        for label in supportset_labels_ori:
            if label not in label_dict:
--- a/modelscope/preprocessors/nlp/feature_extraction_preprocessor.py
+++ b/modelscope/preprocessors/nlp/feature_extraction_preprocessor.py
@@ -0,0 +1,78 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Tuple, Union
 import numpy as np
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.hub import get_model_type
 from .transformers_tokenizer import NLPTokenizer
 from .utils import parse_text_and_label
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.feature_extraction)
 class FeatureExtractionTransformersPreprocessor(Preprocessor):
    def __init__(self,
                 model_dir: str = None,
                 first_sequence: str = None,
                 second_sequence: str = None,
                 mode: str = ModeKeys.INFERENCE,
                 sequence_length: int = 128,
                 use_fast: bool = None,
                 **kwargs):
        """The preprocessor for feature extraction task, based on transformers' tokenizer.
        Args:
            model_dir: The model dir used to initialize the tokenizer.
            use_fast: Use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        self.first_sequence = first_sequence
        self.second_sequence = second_sequence
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = sequence_length
        kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
                                                     True)
        super().__init__(mode)
        model_type = None
        if model_dir is not None:
            model_type = get_model_type(model_dir)
        self.nlp_tokenizer = NLPTokenizer(
            model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
    def __call__(self, data: Union[str, Tuple, Dict],
                 **kwargs) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (tuple): [sentence1, sentence2]
                sentence1 (str): a sentence
                    Example:
                        'you are so handsome.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        text_a, text_b, _ = parse_text_and_label(data, self.mode,
                                                 self.first_sequence,
                                                 self.second_sequence)
        output = self._tokenize_text(text_a, text_b, **kwargs)
        output = {
            k: np.array(v) if isinstance(v, list) else v
            for k, v in output.items()
        }
        return output
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
        return self.nlp_tokenizer(sequence1, sequence2, **kwargs)
--- a/modelscope/preprocessors/nlp/fill_mask_preprocessor.py
+++ b/modelscope/preprocessors/nlp/fill_mask_preprocessor.py
@@ -2,60 +2,207 @@
 import os.path as osp
 import re
 from abc import abstractmethod
 from typing import Any, Dict, Tuple, Union
 import numpy as np
 import torch
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.config import Config
 from modelscope.utils.constant import Fields, ModeKeys, ModelFile
 from modelscope.utils.hub import get_model_type
 from modelscope.utils.nlp import import_external_nltk_data
 from .nlp_base import NLPTokenizerPreprocessorBase
 from .transformers_tokenizer import NLPTokenizer
 from .utils import parse_text_and_label
 class FillMaskPreprocessorBase(Preprocessor):
    def __init__(self,
                 first_sequence: str = None,
                 second_sequence: str = None,
                 mode: str = ModeKeys.INFERENCE):
        """The base constructor for all the fill-mask preprocessors.
        Args:
            first_sequence: The key of the first sequence.
            second_sequence: The key of the second sequence.
            mode: The mode for the preprocessor.
        """
        super().__init__(mode)
        self.first_sequence = first_sequence
        self.second_sequence = second_sequence
    def __call__(self, data: Union[str, Tuple, Dict],
                 **kwargs) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (tuple): [sentence1, sentence2]
                sentence1 (str): a sentence
                    Example:
                        'you are so handsome.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        text_a, text_b, _ = parse_text_and_label(data, self.mode,
                                                 self.first_sequence,
                                                 self.second_sequence)
        output = self._tokenize_text(text_a, text_b, **kwargs)
        output = {
            k: np.array(v) if isinstance(v, list) else v
            for k, v in output.items()
        }
        return output
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        """Tokenize the text.
        Args:
            sequence1: The first sequence.
            sequence2: The second sequence which may be None.
        Returns:
            The encoded sequence.
        """
        raise NotImplementedError()
    @property
    def mask_id(self):
        """Return the id of the mask token.
        Returns:
            The id of mask token.
        """
        return None
    @abstractmethod
    def decode(self,
               token_ids,
               skip_special_tokens: bool = False,
               clean_up_tokenization_spaces: bool = True,
               **kwargs):
        """Turn the token_ids to real sentence.
        Args:
            token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
                List of tokenized input ids. Can be obtained using the `__call__` method.
            skip_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not to remove special tokens in the decoding.
            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
                Whether or not to clean up the tokenization spaces.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the underlying model specific decode method.
        Returns:
            The real sentence decoded by the preprocessor.
        """
        pass
@PREPROCESSORS.register_module(Fields.nlp, module_name=Preprocessors.fill_mask)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.feature_extraction)
 class NLPPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in MLM task.
    """
 class FillMaskTransformersPreprocessor(FillMaskPreprocessorBase):
    def __init__(self,
                 model_dir: str = None,
                 first_sequence: str = None,
                 second_sequence: str = None,
                 mode: str = ModeKeys.INFERENCE,
                 sequence_length: int = 128,
                 use_fast: bool = None,
                 **kwargs):
        """The preprocessor for fill mask task, based on transformers' tokenizer.
    def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
        Args:
            model_dir: The model dir used to initialize the tokenizer.
            use_fast: Use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
        kwargs['max_length'] = sequence_length
        kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
                                                     True)
        super().__init__(model_dir, mode=mode, **kwargs)
        super().__init__(first_sequence, second_sequence, mode)
        model_type = None
        if model_dir is not None:
            model_type = get_model_type(model_dir)
        self.nlp_tokenizer = NLPTokenizer(
            model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
        return self.nlp_tokenizer(sequence1, sequence2, **kwargs)
    @property
    def mask_id(self):
        return self.tokenizer.mask_token_id
        """Return the id of the mask token.
        Returns:
            The id of mask token.
        """
        return self.nlp_tokenizer.tokenizer.mask_token_id
    def decode(self,
               token_ids,
               skip_special_tokens: bool = False,
               clean_up_tokenization_spaces: bool = True,
               **kwargs):
        return self.tokenizer.decode(token_ids, skip_special_tokens,
                                     clean_up_tokenization_spaces, **kwargs)
        """Turn the token_ids to real sentence.
        Args:
            token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
                List of tokenized input ids. Can be obtained using the `__call__` method.
            skip_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not to remove special tokens in the decoding.
            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
                Whether or not to clean up the tokenization spaces.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the underlying model specific decode method.
        Returns:
            The real sentence decoded by the preprocessor.
        """
        return self.nlp_tokenizer.tokenizer.decode(
            token_ids, skip_special_tokens, clean_up_tokenization_spaces,
            **kwargs)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.fill_mask_ponet)
 class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in PoNet model's MLM task.
    """
 class FillMaskPoNetPreprocessor(FillMaskPreprocessorBase):
    def __init__(self,
                 model_dir,
                 first_sequence: str = None,
                 second_sequence: str = None,
                 mode: str = ModeKeys.INFERENCE,
                 sequence_length: int = 512,
                 use_fast: bool = None,
                 **kwargs):
        """The tokenizer preprocessor used in PoNet model's MLM task.
    def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
        Args:
            model_dir: The model dir used to initialize the tokenizer.
            use_fast: Use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = kwargs.pop('sequence_length', 512)
        kwargs['max_length'] = sequence_length
        kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
                                                     True)
        super().__init__(model_dir, mode=mode, **kwargs)
        super().__init__(first_sequence, second_sequence, mode)
        self.nlp_tokenizer = NLPTokenizer(
            model_dir, use_fast=use_fast, tokenize_kwargs=kwargs)
        self.cfg = Config.from_file(
            osp.join(model_dir, ModelFile.CONFIGURATION))
@@ -80,27 +227,15 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
        self.sent_tokenize = sent_tokenize
        self.max_length = kwargs['max_length']
    def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (tuple): [sentence1, sentence2]
                sentence1 (str): a sentence
                    Example:
                        'you are so handsome.'
                sentence2 (str): a sentence
                    Example:
                        'you are so beautiful.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        text_a, text_b, labels = self.parse_text_and_label(data)
        output = self.tokenizer(
            text_a,
            text_b,
            return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
            **self.tokenize_kwargs)
    def __call__(self, data: Union[str, Tuple, Dict],
                 **kwargs) -> Dict[str, Any]:
        text_a, text_b, _ = parse_text_and_label(data, self.mode,
                                                 self.first_sequence,
                                                 self.second_sequence)
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
        output = self.nlp_tokenizer(text_a, text_b, **kwargs)
        max_seq_length = self.max_length
        if text_b is None:
@@ -108,7 +243,7 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
            seg_lens = list(
                map(
                    len,
                    self.tokenizer(
                    self.nlp_tokenizer.tokenizer(
                        self.sent_tokenize(text_a),
                        add_special_tokens=False,
                        truncation=True)['input_ids']))
@@ -125,18 +260,36 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
            k: np.array(v) if isinstance(v, list) else v
            for k, v in output.items()
        }
        self.labels_to_id(labels, output)
        return output
    @property
    def mask_id(self):
        return self.tokenizer.mask_token_id
        """Return the id of the mask token.
        Returns:
            The id of mask token.
        """
        return self.nlp_tokenizer.tokenizer.mask_token_id
    def decode(self,
               token_ids,
               skip_special_tokens: bool = False,
               clean_up_tokenization_spaces: bool = True,
               **kwargs):
        return self.tokenizer.decode(token_ids, skip_special_tokens,
                                     clean_up_tokenization_spaces, **kwargs)
        """Turn the token_ids to real sentence.
        Args:
            token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
                List of tokenized input ids. Can be obtained using the `__call__` method.
            skip_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not to remove special tokens in the decoding.
            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
                Whether or not to clean up the tokenization spaces.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the underlying model specific decode method.
        Returns:
            The real sentence decoded by the preprocessor.
        """
        return self.nlp_tokenizer.tokenizer.decode(
            token_ids, skip_special_tokens, clean_up_tokenization_spaces,
            **kwargs)
--- a/modelscope/preprocessors/nlp/nlp_base.py
+++ b/modelscope/preprocessors/nlp/nlp_base.py
@@ -1,291 +0,0 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os
 from abc import ABC
 from collections.abc import Mapping
 from typing import Any, Dict, List, Tuple, Union
 import json
 import numpy as np
 import torch
 from transformers import AutoTokenizer
 from modelscope.metainfo import Models
 from modelscope.outputs import OutputKeys
 from modelscope.preprocessors.base import Preprocessor
 from modelscope.utils.constant import ModeKeys
 from modelscope.utils.hub import get_model_type, parse_label_mapping
 from modelscope.utils.logger import get_logger
 logger = get_logger()
 __all__ = [
    'NLPBasePreprocessor',
    'NLPTokenizerPreprocessorBase',
 ]
 class NLPBasePreprocessor(Preprocessor, ABC):
    def __init__(self,
                 model_dir: str,
                 first_sequence=None,
                 second_sequence=None,
                 label=None,
                 label2id=None,
                 mode=ModeKeys.INFERENCE,
                 use_fast=None,
                 **kwargs):
        """The NLP preprocessor base class.
        Args:
            model_dir (str): The local model path
            first_sequence: The key for the first sequence
            second_sequence: The key for the second sequence
            label: The label key
            label2id: An optional label2id mapping, the class will try to call utils.parse_label_mapping
                if this mapping is not supplied.
            mode: Run this preprocessor in either 'train'/'eval'/'inference' mode
            use_fast: use the fast version of tokenizer
        """
        self.model_dir = model_dir
        self.first_sequence = first_sequence
        self.second_sequence = second_sequence
        self.label = label
        self.use_fast = use_fast
        if self.use_fast is None and model_dir is None:
            self.use_fast = False
        elif self.use_fast is None and os.path.isfile(
                os.path.join(model_dir, 'tokenizer_config.json')):
            with open(
                    os.path.join(model_dir, 'tokenizer_config.json'),
                    'r',
                    encoding='utf-8') as f:
                json_config = json.load(f)
                self.use_fast = json_config.get('use_fast')
        self.use_fast = False if self.use_fast is None else self.use_fast
        self.label2id = label2id
        if self.label2id is None and model_dir is not None:
            self.label2id = parse_label_mapping(model_dir)
        super().__init__(mode, **kwargs)
    @property
    def mask_id(self):
        """Child preprocessor can override this property to return the id of mask token.
        Returns:
            The id of mask token, default None.
        """
        return None
    def decode(self,
               token_ids: Union[int, List[int], 'np.ndarray', 'torch.Tensor',
                                'tf.Tensor'],
               skip_special_tokens: bool = False,
               clean_up_tokenization_spaces: bool = True,
               **kwargs):
        """Turn the token_ids to real sentence.
        Args:
            token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
                List of tokenized input ids. Can be obtained using the `__call__` method.
            skip_special_tokens (`bool`, *optional*, defaults to `False`):
                Whether or not to remove special tokens in the decoding.
            clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
                Whether or not to clean up the tokenization spaces.
            kwargs (additional keyword arguments, *optional*):
                Will be passed to the underlying model specific decode method.
        Returns:
            The real sentence decoded by the preprocessor.
        """
        raise NotImplementedError()
 class NLPTokenizerPreprocessorBase(NLPBasePreprocessor):
    def __init__(self,
                 model_dir: str,
                 first_sequence: str = None,
                 second_sequence: str = None,
                 label: str = 'label',
                 label2id: dict = None,
                 mode: str = ModeKeys.INFERENCE,
                 use_fast: bool = None,
                 **kwargs):
        """The NLP tokenizer preprocessor base class.
        Any nlp preprocessor which uses the hf tokenizer can inherit from this class.
        Args:
            model_dir (str): The local model path
            first_sequence: The key for the first sequence
            second_sequence: The key for the second sequence
            label: The key for the label
            label2id: An optional label2id dict.
                If label2id is None, the preprocessor will try to parse label-id mapping from:
                - configuration.json model.label2id/model.id2label
                - config.json label2id/id2label
                - label_mapping.json
            mode: Run this preprocessor in either 'train'/'eval'/'inference' mode, the behavior may be different.
            use_fast: use the fast version of tokenizer
            kwargs: These kwargs will be directly fed into the tokenizer.
        """
        super().__init__(model_dir, first_sequence, second_sequence, label,
                         label2id, mode, use_fast, **kwargs)
        self.model_dir = model_dir
        self.tokenize_kwargs = kwargs
        self.tokenizer = self.build_tokenizer(model_dir)
        logger.info(f'The key of sentence1: {self.first_sequence}, '
                    f'The key of sentence2: {self.second_sequence}, '
                    f'The key of label: {self.label}')
        if self.first_sequence is None:
            logger.warning('[Important] first_sequence attribute is not set, '
                           'this will cause an error if your input is a dict.')
    @property
    def id2label(self):
        """Return the id2label mapping according to the label2id mapping.
        @return: The id2label mapping if exists.
        """
        if self.label2id is not None:
            return {id: label for label, id in self.label2id.items()}
        return None
    def build_tokenizer(self, model_dir):
        """Build a tokenizer by the model type.
        NOTE: This default implementation only returns slow tokenizer, because the fast tokenizers have a
        multi-thread problem.
        Args:
            model_dir:  The local model dir.
        Returns:
            The initialized tokenizer.
        """
        self.is_transformer_based_model = 'lstm' not in model_dir
        # fast version lead to parallel inference failed
        model_type = get_model_type(model_dir)
        if model_type in (Models.structbert, Models.gpt3, Models.palm,
                          Models.plug):
            from modelscope.models.nlp.structbert import SbertTokenizer, SbertTokenizerFast
            tokenizer = SbertTokenizerFast if self.use_fast else SbertTokenizer
            return tokenizer.from_pretrained(model_dir)
        elif model_type == Models.veco:
            from modelscope.models.nlp.veco import VecoTokenizer, VecoTokenizerFast
            tokenizer = VecoTokenizerFast if self.use_fast else VecoTokenizer
            return tokenizer.from_pretrained(model_dir)
        elif model_type == Models.deberta_v2:
            from modelscope.models.nlp.deberta_v2 import DebertaV2Tokenizer, DebertaV2TokenizerFast
            tokenizer = DebertaV2TokenizerFast if self.use_fast else DebertaV2Tokenizer
            return tokenizer.from_pretrained(model_dir)
        elif not self.is_transformer_based_model:
            from transformers import BertTokenizer, BertTokenizerFast
            tokenizer = BertTokenizerFast if self.use_fast else BertTokenizer
            return tokenizer.from_pretrained(model_dir)
        else:
            return AutoTokenizer.from_pretrained(
                model_dir, use_fast=self.use_fast)
    def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (tuple): [sentence1, sentence2]
                sentence1 (str): a sentence
                    Example:
                        'you are so handsome.'
                sentence2 (str): a sentence
                    Example:
                        'you are so beautiful.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        text_a, text_b, labels = self.parse_text_and_label(data)
        output = self.tokenizer(
            text_a,
            text_b,
            return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
            **self.tokenize_kwargs)
        output = {
            k: np.array(v) if isinstance(v, list) else v
            for k, v in output.items()
        }
        self.labels_to_id(labels, output)
        return output
    def parse_text_and_label(self, data):
        """Parse the input and return the sentences and labels.
        When input type is tuple or list and its size is 2:
        If the pair param is False, data will be parsed as the first_sentence and the label,
        else it will be parsed as the first_sentence and the second_sentence.
        Args:
            data: The input data.
        Returns:
            The sentences and labels tuple.
        """
        text_a, text_b, labels = None, None, None
        if isinstance(data, str):
            text_a = data
        elif isinstance(data, tuple) or isinstance(data, list):
            if len(data) == 3:
                text_a, text_b, labels = data
            elif len(data) == 2:
                if self._mode == ModeKeys.INFERENCE:
                    text_a, text_b = data
                else:
                    text_a, labels = data
        elif isinstance(data, Mapping):
            text_a = data.get(self.first_sequence)
            text_b = data.get(self.second_sequence)
            labels = data.get(self.label)
        return text_a, text_b, labels
    def labels_to_id(self, labels, output):
        """Turn the labels to id with the type int or float.
        If the original label's type is str or int, the label2id mapping will try to convert it to the final label.
        If the original label's type is float, or the label2id mapping does not exist,
        the original label will be returned.
        Args:
            labels: The input labels.
            output: The label id.
        Returns:
            The final labels.
        """
        def label_can_be_mapped(label):
            return isinstance(label, str) or isinstance(label, int)
        try:
            if isinstance(labels, (tuple, list)) and all([label_can_be_mapped(label) for label in labels]) \
                    and self.label2id is not None:
                output[OutputKeys.LABELS] = [
                    self.label2id[label]
                    if label in self.label2id else self.label2id[str(label)]
                    for label in labels
                ]
            elif label_can_be_mapped(labels) and self.label2id is not None:
                output[OutputKeys.LABELS] = self.label2id[
                    labels] if labels in self.label2id else self.label2id[str(
                        labels)]
            elif labels is not None:
                output[OutputKeys.LABELS] = labels
        except KeyError as e:
            logger.error(
                f'Label {labels} cannot be found in the label mapping {self.label2id},'
                f'which comes from the user input or the configuration files. '
                f'Please consider matching your labels with this mapping.')
            raise e
--- a/modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
+++ b/modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
@@ -5,34 +5,36 @@ from typing import Any, Dict
 from transformers import AutoTokenizer
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.type_assert import type_assert
 from .nlp_base import NLPBasePreprocessor
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.re_tokenizer)
 class RelationExtractionPreprocessor(NLPBasePreprocessor):
    """The relation extraction preprocessor used in normal RE task.
    """
 class RelationExtractionTransformersPreprocessor(Preprocessor):
    def __init__(self, model_dir: str, *args, **kwargs):
        """preprocess the data
    def __init__(
        self,
        model_dir: str,
        mode: str = ModeKeys.INFERENCE,
        **kwargs,
    ):
        """The preprocessor for relation Extraction task, based on transformers' tokenizer.
        Args:
            model_dir (str): model path
            model_dir: The model dir used to initialize the tokenizer.
            mode: The mode for the preprocessor.
        """
        super().__init__(model_dir, *args, **kwargs)
        super().__init__(mode)
        self.model_dir: str = model_dir
        self.sequence_length = kwargs.pop('sequence_length', 512)
        self.tokenizer = AutoTokenizer.from_pretrained(
            model_dir, use_fast=True)
    @type_assert(object, str)
    def __call__(self, data: str) -> Dict[str, Any]:
    def __call__(self, data: str, **kwargs) -> Dict[str, Any]:
        """process the raw input data
        Args:
@@ -46,7 +48,9 @@ class RelationExtractionPreprocessor(NLPBasePreprocessor):
        # preprocess the data for the model input
        text = data
        output = self.tokenizer([text], return_tensors='pt')
        if 'return_tensors' not in kwargs:
            kwargs['return_tensors'] = 'pt'
        output = self.tokenizer([text], **kwargs)
        return {
            'text': text,
            'input_ids': output['input_ids'],
--- a/modelscope/preprocessors/nlp/sentence_classification_preprocessor.py
+++ b/modelscope/preprocessors/nlp/sentence_classification_preprocessor.py
@@ -1,25 +0,0 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from .nlp_base import NLPTokenizerPreprocessorBase
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.nli_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
 class SequenceClassificationPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in sequence classification.
    """
    def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
        super().__init__(model_dir, mode=mode, **kwargs)
--- a/modelscope/preprocessors/nlp/sentence_embedding_preprocessor.py
+++ b/modelscope/preprocessors/nlp/sentence_embedding_preprocessor.py
@@ -1,31 +1,61 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Union
 from typing import Any, Dict
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from .nlp_base import NLPTokenizerPreprocessorBase
 from modelscope.utils.hub import get_model_type
 from .transformers_tokenizer import NLPTokenizer
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sentence_embedding)
 class SentenceEmbeddingPreprocessor(NLPTokenizerPreprocessorBase):
 class SentenceEmbeddingTransformersPreprocessor(Preprocessor):
    """The tokenizer preprocessor used in sentence embedding.
    """
    def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
        super().__init__(model_dir, mode=mode, **kwargs)
    def __init__(self,
                 model_dir: str,
                 first_sequence='source_sentence',
                 second_sequence='sentences_to_compare',
                 mode=ModeKeys.INFERENCE,
                 use_fast: bool = None,
                 sequence_length: int = 128,
                 **kwargs):
        """The preprocessor for sentence embedding task, based on transformers' tokenizer.
    def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]:
        Args:
            model_dir: The model dir used to initialize the tokenizer.
            first_sequence: The key of the first sequence.
            second_sequence: The key of the second sequence.
            mode: The mode for the preprocessor.
            use_fast: Use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        self.first_sequence = first_sequence
        self.second_sequence = second_sequence
        kwargs['max_length'] = sequence_length
        model_type = None
        if model_dir is not None:
            model_type = get_model_type(model_dir)
        self.nlp_tokenizer = NLPTokenizer(
            model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
        super().__init__(mode=mode)
    def __call__(self,
                 data: Dict,
                 padding=True,
                 truncation=True,
                 **kwargs) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data Dict:
                keys: "source_sentence" && "sentences_to_compare"
                keys: the source sentence and the sentences to compare
                values: list of sentences
                Example:
                    {"source_sentence": ["how long it take to get a master's degree"],
@@ -37,16 +67,16 @@ class SentenceEmbeddingPreprocessor(NLPTokenizerPreprocessorBase):
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        source_sentence = data['source_sentence']
        compare_sentences = data['sentences_to_compare']
        sentences = []
        sentences.append(source_sentence[0])
        source_sentence = data[self.first_sequence]
        compare_sentences = data[self.second_sequence]
        sentences = [source_sentence[0]]
        for sent in compare_sentences:
            sentences.append(sent)
        tokenized_inputs = self.tokenizer(
            sentences,
            return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
            padding=True,
            truncation=True)
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
        tokenized_inputs = self.nlp_tokenizer(
            sentences, padding=padding, truncation=truncation, **kwargs)
        return tokenized_inputs
--- a/modelscope/preprocessors/nlp/sentence_piece_preprocessor.py
+++ b/modelscope/preprocessors/nlp/sentence_piece_preprocessor.py
@@ -1,7 +1,6 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os
 import os.path as osp
 from typing import Any, Dict
 import sentencepiece as spm
 import torch
@@ -9,17 +8,26 @@ import torch
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.base import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields
 from modelscope.utils.constant import Fields, ModeKeys
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sentence_piece)
 class SentencePiecePreprocessor(Preprocessor):
    def __init__(self, model_dir: str, *args, **kwargs):
        import os
    def __init__(self,
                 model_dir: str,
                 mode=ModeKeys.INFERENCE,
                 *args,
                 **kwargs):
        """The preprocessor for the sentence piece tokenizer.
        Args:
            model_dir: The model dir contains the essential files used by the `SentencePieceProcessor`.
            mode: The mode for the preprocessor.
        """
        super().__init__(*args, **kwargs)
        super().__init__(mode)
        self.tokenizer = None
        for file_name in os.listdir(model_dir):
            if file_name.endswith('.model'):
@@ -28,5 +36,5 @@ class SentencePiecePreprocessor(Preprocessor):
                break
        assert self.tokenizer is not None, 'Can not find .model file'
    def __call__(self, data: str) -> Dict[str, Any]:
    def __call__(self, data: str) -> torch.Tensor:
        return torch.tensor(self.tokenizer.encode([data]), dtype=torch.long)
--- a/modelscope/preprocessors/nlp/text2text_generation_preprocessor.py
+++ b/modelscope/preprocessors/nlp/text2text_generation_preprocessor.py
@@ -1,40 +0,0 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Union
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from .nlp_base import NLPTokenizerPreprocessorBase
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text2text_gen_preprocessor)
 class Text2TextGenerationPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in text generation.
    """
    def __init__(self,
                 model_dir: str,
                 tokenizer=None,
                 mode=ModeKeys.INFERENCE,
                 **kwargs):
        kwargs['truncation'] = kwargs.get('truncation', 'do_not_truncate')
        kwargs['padding'] = kwargs.get('padding', False)
        kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
                                                     False)
        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
        super().__init__(model_dir, mode=mode, **kwargs)
    def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
        text_a, _, _ = self.parse_text_and_label(data)
        inputs = self.tokenizer(
            text_a,
            return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
            **self.tokenize_kwargs)
        # This is produced by tokenizers but is an invalid generate kwargs
        if 'token_type_ids' in inputs:
            del inputs['token_type_ids']
        return inputs
--- a/modelscope/preprocessors/nlp/text_classification_preprocessor.py
+++ b/modelscope/preprocessors/nlp/text_classification_preprocessor.py
@@ -0,0 +1,152 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from abc import abstractmethod
 from typing import Any, Dict, List, Tuple, Union
 import numpy as np
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.hub import get_model_type, parse_label_mapping
 from modelscope.utils.logger import get_logger
 from .transformers_tokenizer import NLPTokenizer
 from .utils import labels_to_id, parse_text_and_label
 logger = get_logger(__name__)
 class TextClassificationPreprocessorBase(Preprocessor):
    def __init__(
        self,
        model_dir=None,
        first_sequence: str = None,
        second_sequence: str = None,
        label: str = 'label',
        label2id: Dict = None,
        mode: str = ModeKeys.INFERENCE,
    ):
        """The base class for the text classification preprocessor.
        Args:
            model_dir(str, `optional`): The model dir used to parse the label mapping, can be None.
            first_sequence(str, `optional`): The key of the first sequence.
            second_sequence(str, `optional`): The key of the second sequence.
            label(str, `optional`): The keys of the label columns, default is `label`
            label2id: (dict, `optional`): The optional label2id mapping
            mode: The mode for the preprocessor
        """
        super().__init__(mode)
        self.model_dir = model_dir
        self.first_sequence = first_sequence
        self.second_sequence = second_sequence
        self.label = label
        self.label2id = label2id
        if self.label2id is None and self.model_dir is not None:
            self.label2id = parse_label_mapping(self.model_dir)
        logger.info(f'The key of sentence1: {self.first_sequence}, '
                    f'The key of sentence2: {self.second_sequence}, '
                    f'The key of label: {self.label}')
        if self.first_sequence is None:
            logger.warning('[Important] first_sequence attribute is not set, '
                           'this will cause an error if your input is a dict.')
    @property
    def id2label(self):
        """Return the id2label mapping according to the label2id mapping.
        @return: The id2label mapping if exists.
        """
        if self.label2id is not None:
            return {id: label for label, id in self.label2id.items()}
        return None
    def __call__(self, data: Union[str, Tuple, Dict],
                 **kwargs) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (tuple): [sentence1, sentence2]
                sentence1 (str): a sentence
                    Example:
                        'you are so handsome.'
                sentence2 (str): a sentence
                    Example:
                        'you are so beautiful.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        text_a, text_b, labels = parse_text_and_label(data, self.mode,
                                                      self.first_sequence,
                                                      self.second_sequence,
                                                      self.label)
        output = self._tokenize_text(text_a, text_b, **kwargs)
        output = {
            k: np.array(v) if isinstance(v, list) else v
            for k, v in output.items()
        }
        labels_to_id(labels, output, self.label2id)
        return output
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        """Tokenize the text.
        Args:
            sequence1: The first sequence.
            sequence2: The second sequence which may be None.
        Returns:
            The encoded sequence.
        """
        raise NotImplementedError()
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.nli_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
 class TextClassificationTransformersPreprocessor(
        TextClassificationPreprocessorBase):
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
        return self.nlp_tokenizer(sequence1, sequence2, **kwargs)
    def __init__(self,
                 model_dir=None,
                 first_sequence: str = None,
                 second_sequence: str = None,
                 label: Union[str, List] = 'label',
                 label2id: Dict = None,
                 mode: str = ModeKeys.INFERENCE,
                 sequence_length: int = 128,
                 use_fast: bool = None,
                 **kwargs):
        """The tokenizer preprocessor used in sequence classification.
        Args:
            use_fast: Whether to use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = sequence_length
        model_type = None
        if model_dir is not None:
            model_type = get_model_type(model_dir)
        self.nlp_tokenizer = NLPTokenizer(
            model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
        super().__init__(model_dir, first_sequence, second_sequence, label,
                         label2id, mode)
--- a/modelscope/preprocessors/nlp/text_error_correction.py
+++ b/modelscope/preprocessors/nlp/text_error_correction.py
@@ -7,12 +7,11 @@ from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.base import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields
 from .nlp_base import NLPBasePreprocessor
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text_error_correction)
 class TextErrorCorrectionPreprocessor(NLPBasePreprocessor):
 class TextErrorCorrectionPreprocessor(Preprocessor):
    """The preprocessor used in text correction task.
    """
@@ -23,7 +22,7 @@ class TextErrorCorrectionPreprocessor(NLPBasePreprocessor):
        Args:
            model_dir (str): model path
        """
        super().__init__(model_dir, *args, **kwargs)
        super().__init__(*args, **kwargs)
        self.vocab = Dictionary.load(osp.join(model_dir, 'dict.src.txt'))
    def __call__(self, data: str) -> Dict[str, Any]:
--- a/modelscope/preprocessors/nlp/text_generation_jieba_preprocessor.py
+++ b/modelscope/preprocessors/nlp/text_generation_jieba_preprocessor.py
@@ -1,44 +0,0 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os.path as osp
 from typing import Any, Dict
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.base import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text_gen_jieba_tokenizer)
 class TextGenerationJiebaPreprocessor(Preprocessor):
    """The jieba tokenizer preprocessor used in text generation.
    """
    def __init__(self, model_dir: str, *args, **kwargs):
        from modelscope.models.nlp.gpt3 import JiebaBPETokenizer
        super().__init__(*args, **kwargs)
        self.tokenizer = JiebaBPETokenizer(
            osp.join(model_dir, 'tokenizer.json'))
    def __call__(self, data: str) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (str): a sentence
                Example:
                    '深蓝的天空中挂着一轮金黄的圆月，下面是海边的沙地'
        Returns:
            Dict[str, Any]: the preprocessed data
            Example:
            {'net_input':
                {'src_tokens':tensor([1,2,3,4]),
                'src_lengths': tensor([4])}
            }
        """
        import torch
        return {
            'input_ids':
            torch.tensor(self.tokenizer.tokenize(data)).unsqueeze_(0)
        }
--- a/modelscope/preprocessors/nlp/text_generation_preprocessor.py
+++ b/modelscope/preprocessors/nlp/text_generation_preprocessor.py
@@ -1,62 +1,257 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os.path as osp
 from typing import Any, Dict, Optional, Union
 import numpy as np
 import torch
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.base import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from .nlp_base import NLPTokenizerPreprocessorBase
 from modelscope.utils.hub import get_model_type
 from modelscope.utils.logger import get_logger
 from .transformers_tokenizer import NLPTokenizer
 from .utils import parse_text_and_label
 logger = get_logger(__name__)
 class TextGenerationPreprocessorBase(Preprocessor):
    def __init__(self,
                 mode: str = ModeKeys.INFERENCE,
                 src_txt='src_txt',
                 tgt_txt='tgt_txt'):
        """The base class for all the text generation task's preprocessors.
        Args:
            mode: The preprocessor mode.
            src_txt: The key for the src text.
            tgt_txt: The key for the tgt text.
        """
        super().__init__(mode)
        self.src_txt = src_txt
        self.tgt_txt = tgt_txt
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        """Tokenize the text.
        Args:
            sequence1: The first sequence.
            sequence2: The second sequence which may be None.
        Returns:
            The encoded sequence.
        """
        raise NotImplementedError()
    def __call__(self, data: Union[Dict, str], **kwargs) -> Dict[str, Any]:
        text_a, text_b = parse_text_and_label(data, self.mode, self.src_txt,
                                              self.tgt_txt)[0:2]
        output = self._tokenize_text(text_a, text_b, **kwargs)
        output = {
            k: np.array(v) if isinstance(v, list) else v
            for k, v in output.items()
        }
        return output
    def decode(self, tokens, **kwargs):
        """Decode the tokens to real text.
        Args:
            tokens: The output tokens from model's `forward` and `generate`
        Returns:
            The actual text.
        """
        raise NotImplementedError()
 class NLPTokenizerForRoberta(NLPTokenizer):
    def build_tokenizer(self):
        def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]:
            import os
            for name in os.listdir(model_dir):
                full_name = os.path.join(model_dir, name)
                if 'roberta' in name and os.path.isdir(full_name):
                    return full_name
        roberta_tokenizer_dir = get_roberta_tokenizer_dir(self.model_dir)
        if roberta_tokenizer_dir:
            from transformers import RobertaTokenizer
            return RobertaTokenizer.from_pretrained(
                roberta_tokenizer_dir, do_lower_case=False)
        return super().build_tokenizer()
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text_gen_tokenizer)
 class TextGenerationPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in text generation.
    """
 class TextGenerationTransformersPreprocessor(TextGenerationPreprocessorBase):
    def __init__(self,
                 model_dir: str,
                 tokenizer=None,
                 mode=ModeKeys.INFERENCE,
                 mode: str = ModeKeys.INFERENCE,
                 src_txt='src_txt',
                 tgt_txt='tgt_txt',
                 sequence_length: int = 128,
                 use_fast: bool = None,
                 **kwargs):
        """The tokenizer preprocessor used in text generation.
        Args:
            model_dir: The model dir used to initialize the tokenizer.
            mode: The mode for the preprocessor.
            src_txt: The key of the source sentence.
            tgt_txt: The key of the generated sentence.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            use_fast: Whether to use the fast tokenizer or not.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        if 'first_sequence' in kwargs:
            src_txt = kwargs.pop('first_sequence')
        super().__init__(mode, src_txt, tgt_txt)
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
                                                     False)
        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
        super().__init__(model_dir, mode=mode, **kwargs)
    @staticmethod
    def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]:
        import os
        for name in os.listdir(model_dir):
            full_name = os.path.join(model_dir, name)
            if 'roberta' in name and os.path.isdir(full_name):
                return full_name
    def build_tokenizer(self, model_dir: str):
        roberta_tokenizer_dir = self.get_roberta_tokenizer_dir(model_dir)
        if roberta_tokenizer_dir:
            from transformers import RobertaTokenizer
            return RobertaTokenizer.from_pretrained(
                roberta_tokenizer_dir, do_lower_case=False)
        return super().build_tokenizer(model_dir)
    def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
        if self._mode == ModeKeys.INFERENCE:
            return super().__call__(data)
        src_rst = super().__call__(data['src_txt'])
        src_input_ids = src_rst['input_ids']
        src_attention_mask = src_rst['attention_mask']
        if 'tgt_txt' in data:
            labels = super().__call__(data['tgt_txt'])['input_ids']
        else:
            labels = src_input_ids[1:]
            src_input_ids = src_input_ids[:-1]
            src_attention_mask = src_attention_mask[:-1]
        kwargs['max_length'] = sequence_length
        model_type = None
        if model_dir is not None:
            model_type = get_model_type(model_dir)
        self.nlp_tokenizer = NLPTokenizerForRoberta(
            model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
    def decode(self, tokens, **kwargs):
        """Decode the tokens to real text.
        Args:
            tokens: The output tokens from model's `forward` and `generate`
        Returns:
            The actual text.
        """
        return self.nlp_tokenizer.tokenizer.decode(tokens, **kwargs)
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        """Tokenize the text.
        Args:
            sequence1: The first sequence.
            sequence2: The second sequence which may be None.
        Returns:
            The encoded sequence.
        """
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
        output = self.nlp_tokenizer(sequence1, **kwargs)
        if self.mode != ModeKeys.INFERENCE:
            if sequence2 is not None:
                labels = self.nlp_tokenizer(sequence2)['input_ids']
                src_input_ids = output['input_ids']
                src_attention_mask = output['attention_mask']
            else:
                labels = output['input_ids'][1:]
                src_input_ids = output['input_ids'][:-1]
                src_attention_mask = output['attention_mask'][:-1]
            output = {
                'input_ids': src_input_ids,
                'attention_mask': src_attention_mask,
                'labels': labels,
            }
        return output
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text_gen_jieba_tokenizer)
 class TextGenerationJiebaPreprocessor(TextGenerationPreprocessorBase):
    """The jieba tokenizer preprocessor used in text generation.
    """
    def __init__(self,
                 model_dir: str,
                 mode: str = ModeKeys.INFERENCE,
                 src_txt='src_txt',
                 tgt_txt=None):
        from modelscope.models.nlp.gpt3 import JiebaBPETokenizer
        super().__init__(mode, src_txt, tgt_txt)
        if self.tgt_txt is not None:
            logger.warn(
                f'TextGenerationJiebaPreprocessor currently does not support training, '
                f'the {self.tgt_txt} of the tgt_txt field will be ignored.')
        self.src_txt = src_txt
        self.tokenizer = JiebaBPETokenizer(
            osp.join(model_dir, 'tokenizer.json'))
    def decode(self, tokens, **kwargs):
        """Decode the tokens to real text.
        Args:
            tokens: The output tokens from model's `forward` and `generate`
        Returns:
            The actual text.
        """
        return self.tokenizer.detokenize(tokens)
    def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
        """Tokenize the text.
        Args:
            sequence1: The first sequence.
            sequence2: The second sequence which may be None.
        Returns:
            The encoded sequence.
        """
        return {
            'input_ids': src_input_ids,
            'attention_mask': src_attention_mask,
            'labels': labels,
            'input_ids':
            torch.tensor(self.tokenizer.tokenize(sequence1)).unsqueeze_(0)
        }
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text2text_gen_preprocessor)
 class TextGenerationT5Preprocessor(TextGenerationTransformersPreprocessor):
    def __init__(self,
                 model_dir: str,
                 mode: str = ModeKeys.INFERENCE,
                 src_txt='src_txt',
                 tgt_txt='tgt_txt',
                 use_fast: bool = None,
                 sequence_length: int = 128,
                 **kwargs):
        """The preprocessor for text to text generation task, based on transformers' tokenizer.
        Args:
            model_dir: The model dir used to initialize the tokenizer.
            src_txt: The key of the first sequence.
            use_fast: Use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            mode: The mode for the preprocessor.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        super().__init__(
            model_dir,
            mode=mode,
            src_txt=src_txt,
            tgt_txt=tgt_txt,
            sequence_length=sequence_length,
            use_fast=use_fast,
            truncation=kwargs.pop('truncation', True),
            padding=kwargs.pop('padding', 'max_length'),
            return_token_type_ids=kwargs.pop('return_token_type_ids', False),
            **kwargs)
--- a/modelscope/preprocessors/nlp/text_ranking_preprocessor.py
+++ b/modelscope/preprocessors/nlp/text_ranking_preprocessor.py
@@ -1,67 +1,78 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Union
 from typing import Any, Dict
 from transformers import AutoTokenizer
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.type_assert import type_assert
 from .nlp_base import NLPTokenizerPreprocessorBase
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.text_ranking)
 class TextRankingPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in passage ranking model.
    """
 class TextRankingTransformersPreprocessor(Preprocessor):
    def __init__(self,
                 model_dir: str,
                 mode=ModeKeys.INFERENCE,
                 *args,
                 mode: str = ModeKeys.INFERENCE,
                 first_sequence='source_sentence',
                 second_sequence='sentences_to_compare',
                 label='labels',
                 qid='qid',
                 sequence_length=128,
                 **kwargs):
        """preprocess the data
        """The tokenizer preprocessor class for the text ranking preprocessor.
        Args:
            model_dir (str): model path
            model_dir(str, `optional`): The model dir used to parse the label mapping, can be None.
            first_sequence(str, `optional`): The key of the first sequence.
            second_sequence(str, `optional`): The key of the second sequence.
            label(str, `optional`): The keys of the label columns, default `labels`.
            qid(str, `optional`): The qid info.
            mode: The mode for the preprocessor.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
        """
        super().__init__(model_dir, mode=mode, *args, **kwargs)
        self.model_dir: str = model_dir
        self.first_sequence: str = kwargs.pop('first_sequence',
                                              'source_sentence')
        self.second_sequence = kwargs.pop('second_sequence',
                                          'sentences_to_compare')
        self.sequence_length = kwargs.pop('sequence_length', 128)
        super().__init__(mode)
        self.model_dir = model_dir
        self.first_sequence = first_sequence
        self.second_sequence = second_sequence
        self.label = label
        self.qid = qid
        self.sequence_length = sequence_length
        self.tokenizer = AutoTokenizer.from_pretrained(self.model_dir)
    @type_assert(object, (str, tuple, Dict))
    def __call__(self, data: Union[tuple, Dict]) -> Dict[str, Any]:
        if isinstance(data, tuple):
            sentence1, sentence2 = data
        elif isinstance(data, dict):
            sentence1 = data.get(self.first_sequence)
            sentence2 = data.get(self.second_sequence)
    @type_assert(object, dict)
    def __call__(self,
                 data: Dict,
                 padding='max_length',
                 truncation=True,
                 **kwargs) -> Dict[str, Any]:
        sentence1 = data.get(self.first_sequence)
        sentence2 = data.get(self.second_sequence)
        labels = data.get(self.label)
        qid = data.get(self.qid)
        if isinstance(sentence2, str):
            sentence2 = [sentence2]
        if isinstance(sentence1, str):
            sentence1 = [sentence1]
        sentence1 = sentence1 * len(sentence2)
        max_seq_length = self.sequence_length
        kwargs['max_length'] = kwargs.get(
            'max_length', kwargs.pop('sequence_length', self.sequence_length))
        if 'return_tensors' not in kwargs:
            kwargs['return_tensors'] = 'pt'
        feature = self.tokenizer(
            sentence1,
            sentence2,
            padding='max_length',
            truncation=True,
            max_length=max_seq_length,
            return_tensors='pt')
        if 'labels' in data:
            labels = data['labels']
            padding=padding,
            truncation=truncation,
            **kwargs)
        if labels is not None:
            feature['labels'] = labels
        if 'qid' in data:
            qid = data['qid']
        if qid is not None:
            feature['qid'] = qid
        return feature
--- a/modelscope/preprocessors/nlp/token_classification_preprocessor.py
+++ b/modelscope/preprocessors/nlp/token_classification_preprocessor.py
@@ -1,28 +1,35 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Tuple, Union
 from typing import Any, Dict, List, Tuple, Union
 import numpy as np
 import torch
 from modelscope.metainfo import Preprocessors
 from modelscope.outputs import OutputKeys
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.hub import get_model_type, parse_label_mapping
 from modelscope.utils.logger import get_logger
 from modelscope.utils.type_assert import type_assert
 from .nlp_base import NLPBasePreprocessor, NLPTokenizerPreprocessorBase
 from .transformers_tokenizer import NLPTokenizer
 from .utils import parse_text_and_label
 logger = get_logger(__name__)
@PREPROCESSORS.register_module(
    Fields.nlp,
    module_name=Preprocessors.word_segment_text_to_label_preprocessor)
 class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor):
 class WordSegmentationBlankSetToLabelPreprocessor(Preprocessor):
    """The preprocessor used to turn a single sentence to a labeled token-classification dict.
    """
    def __init__(self, **kwargs):
        self.first_sequence: str = kwargs.pop('first_sequence', 'tokens')
        self.label = kwargs.pop('label', OutputKeys.LABELS)
    def __init__(self, generated_sentence='tokens', generated_label='labels'):
        super().__init__()
        self.generated_sentence = generated_sentence
        self.generated_label = generated_label
    def __call__(self, data: str) -> Union[Dict[str, Any], Tuple]:
        data = data.split(' ')
@@ -43,9 +50,134 @@ class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor):
        chars, labels = produce_train_sample(data)
        return {
            self.first_sequence: chars,
            self.label: labels,
            self.generated_sentence: chars,
            self.generated_label: labels,
        }
 class TokenClassificationPreprocessorBase(Preprocessor):
    def __init__(
        self,
        model_dir: str = None,
        first_sequence: str = None,
        label: str = 'label',
        label2id: Dict = None,
        label_all_tokens: bool = False,
        mode: str = ModeKeys.INFERENCE,
    ):
        """The base class for all the token-classification tasks.
        Args:
            model_dir: The model dir to build the the label2id mapping.
                If None, user need to pass in the `label2id` param.
            first_sequence: The key for the text(token) column if input type is a dict.
            label: The key for the label column if input type is a dict and the mode is `training` or `evaluation`.
            label2id: The label2id mapping, if not provided, you need to specify the model_dir to search the mapping
                from config files.
            label_all_tokens: If label exists in the dataset, the preprocessor will try to label the tokens.
                If label_all_tokens is true, all non-initial sub-tokens will get labels like `I-xxx`,
                or else the labels will be filled with -100, default False.
            mode: The preprocessor mode.
        """
        super().__init__(mode)
        self.model_dir = model_dir
        self.first_sequence = first_sequence
        self.label = label
        self.label2id = label2id
        self.label_all_tokens = label_all_tokens
        if self.label2id is None and self.model_dir is not None:
            self.label2id = parse_label_mapping(self.model_dir)
    @property
    def id2label(self):
        """Return the id2label mapping according to the label2id mapping.
        @return: The id2label mapping if exists.
        """
        if self.label2id is not None:
            return {id: label for label, id in self.label2id.items()}
        return None
    def labels_to_id(self, labels_list, word_ids):
        # align the labels with tokenized text
        assert self.label2id is not None
        # Map that sends B-Xxx label to its I-Xxx counterpart
        b_to_i_label = []
        label_enumerate_values = [
            k for k, v in sorted(
                self.label2id.items(), key=lambda item: item[1])
        ]
        for idx, label in enumerate(label_enumerate_values):
            if label.startswith('B-') and label.replace(
                    'B-', 'I-') in label_enumerate_values:
                b_to_i_label.append(
                    label_enumerate_values.index(label.replace('B-', 'I-')))
            else:
                b_to_i_label.append(idx)
        label_row = [self.label2id[lb] for lb in labels_list]
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label_row[word_idx])
            else:
                if self.label_all_tokens:
                    label_ids.append(b_to_i_label[label_row[word_idx]])
                else:
                    label_ids.append(-100)
            previous_word_idx = word_idx
        return label_ids
    def _tokenize_text(self, sequence1, **kwargs):
        """Tokenize the text.
        Args:
            sequence1: The first sequence.
            sequence2: The second sequence which may be None.
        Returns:
            The encoded sequence.
        """
        raise NotImplementedError()
    @type_assert(object, (str, tuple, dict))
    def __call__(self, data: Union[dict, tuple, str],
                 **kwargs) -> Dict[str, Any]:
        text, _, label = parse_text_and_label(
            data, self.mode, self.first_sequence, label=self.label)
        outputs, word_ids = self._tokenize_text(text, **kwargs)
        if label is not None:
            label_ids = self.labels_to_id(label, word_ids)
            outputs[OutputKeys.LABELS] = label_ids
        outputs = {
            k: np.array(v) if isinstance(v, list) else v
            for k, v in outputs.items()
        }
        if self.mode == ModeKeys.INFERENCE:
            outputs['text'] = text
        return outputs
 class NLPTokenizerForLSTM(NLPTokenizer):
    def build_tokenizer(self):
        if self.model_type == 'lstm':
            from transformers import AutoTokenizer
            return AutoTokenizer.from_pretrained(
                self.model_dir, use_fast=self.use_fast, tokenizer_type='bert')
        else:
            return super().build_tokenizer()
    def get_tokenizer_class(self):
        tokenizer_class = self.tokenizer.__class__.__name__
        if tokenizer_class.endswith(
                'Fast') and tokenizer_class != 'PreTrainedTokenizerFast':
            tokenizer_class = tokenizer_class[:-4]
        return tokenizer_class
@PREPROCESSORS.register_module(
@@ -54,227 +186,238 @@ class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor):
    Fields.nlp, module_name=Preprocessors.token_cls_tokenizer)
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.sequence_labeling_tokenizer)
 class TokenClassificationPreprocessor(NLPTokenizerPreprocessorBase):
 class TokenClassificationTransformersPreprocessor(
        TokenClassificationPreprocessorBase):
    """The tokenizer preprocessor used in normal NER task.
    """
    def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
        """preprocess the data
    def __init__(self,
                 model_dir: str = None,
                 first_sequence: str = None,
                 label: str = 'label',
                 label2id: Dict = None,
                 label_all_tokens: bool = False,
                 mode: str = ModeKeys.INFERENCE,
                 sequence_length=128,
                 use_fast=None,
                 **kwargs):
        """
        Args:
            model_dir (str): model path
            use_fast: Whether to use the fast tokenizer or not.
            sequence_length: The max sequence length which the model supported,
                will be passed into tokenizer as the 'max_length' param.
            **kwargs: Extra args input into the tokenizer's __call__ method.
        """
        super().__init__(model_dir, first_sequence, label, label2id,
                         label_all_tokens, mode)
        self.is_lstm_model = 'lstm' in model_dir
        model_type = None
        if self.is_lstm_model:
            model_type = 'lstm'
        elif model_dir is not None:
            model_type = get_model_type(model_dir)
        kwargs['truncation'] = kwargs.get('truncation', True)
        kwargs['padding'] = kwargs.get(
            'padding', False if mode == ModeKeys.INFERENCE else 'max_length')
        kwargs['max_length'] = kwargs.pop('sequence_length', 128)
        self.sequence_length = kwargs['max_length']
        self.label_all_tokens = kwargs.pop('label_all_tokens', False)
        super().__init__(model_dir, mode=mode, **kwargs)
        if 'is_split_into_words' in kwargs:
            self.tokenize_kwargs['is_split_into_words'] = kwargs.pop(
                'is_split_into_words')
        else:
            self.tokenize_kwargs[
                'is_split_into_words'] = self.tokenizer.init_kwargs.get(
                    'is_split_into_words', False)
        if 'label2id' in kwargs:
            kwargs.pop('label2id')
        kwargs['padding'] = kwargs.get('padding', 'max_length')
        kwargs['max_length'] = sequence_length
        kwargs['add_special_tokens'] = model_type != 'lstm'
        self.nlp_tokenizer = NLPTokenizerForLSTM(
            model_dir=model_dir,
            model_type=model_type,
            use_fast=use_fast,
            tokenize_kwargs=kwargs)
    @type_assert(object, (str, dict))
    def __call__(self, data: Union[dict, str]) -> Dict[str, Any]:
        """process the raw input data
    def _tokenize_text(self, text: Union[str, List[str]], **kwargs):
        tokens = text
        if self.mode != ModeKeys.INFERENCE:
            assert isinstance(tokens, list), 'Input needs to be lists in training and evaluating,' \
                                             'because the length of the words and the labels need to be equal.'
        is_split_into_words = self.nlp_tokenizer.get_tokenizer_kwarg(
            'is_split_into_words', False)
        if is_split_into_words:
            tokens = list(tokens)
        Args:
            data (str): a sentence
                Example:
                    'you are so handsome.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        if is_split_into_words and self.mode == ModeKeys.INFERENCE:
            encodings, word_ids = self._tokenize_text_by_words(
                tokens, **kwargs)
        elif self.nlp_tokenizer.tokenizer.is_fast:
            encodings, word_ids = self._tokenize_text_with_fast_tokenizer(
                tokens, **kwargs)
        else:
            encodings, word_ids = self._tokenize_text_with_slow_tokenizer(
                tokens, **kwargs)
        # preprocess the data for the model input
        text = None
        labels_list = None
        if isinstance(data, str):
            # for inference inputs without label
            text = data
        elif isinstance(data, dict):
            # for finetune inputs with label
            text = data.get(self.first_sequence)
            labels_list = data.get(self.label)
            if isinstance(text, list):
                self.tokenize_kwargs['is_split_into_words'] = True
        if self._mode == ModeKeys.INFERENCE:
            self.tokenize_kwargs['add_special_tokens'] = False
        if self.mode == ModeKeys.INFERENCE:
            for key in encodings.keys():
                encodings[key] = torch.tensor(encodings[key]).unsqueeze(0)
        else:
            encodings.pop('offset_mapping', None)
        return encodings, word_ids
    def _tokenize_text_by_words(self, tokens, **kwargs):
        input_ids = []
        label_mask = []
        offset_mapping = []
        token_type_ids = []
        if self.tokenize_kwargs[
                'is_split_into_words'] and self._mode == ModeKeys.INFERENCE:
            for offset, token in enumerate(list(text)):
                subtoken_ids = self.tokenizer.encode(token,
                                                     **self.tokenize_kwargs)
                if len(subtoken_ids) == 0:
                    subtoken_ids = [self.tokenizer.unk_token_id]
                input_ids.extend(subtoken_ids)
                label_mask.extend([1] + [0] * (len(subtoken_ids) - 1))
                offset_mapping.extend([(offset, offset + 1)])
        attention_mask = []
        for offset, token in enumerate(tokens):
            subtoken_ids = self.nlp_tokenizer.tokenizer.encode(
                token, add_special_tokens=False)
            if len(subtoken_ids) == 0:
                subtoken_ids = [self.nlp_tokenizer.tokenizer.unk_token_id]
            input_ids.extend(subtoken_ids)
            attention_mask.extend([1] * len(subtoken_ids))
            label_mask.extend([True] + [False] * (len(subtoken_ids) - 1))
            offset_mapping.extend([(offset, offset + 1)])
        padding = kwargs.get('padding',
                             self.nlp_tokenizer.get_tokenizer_kwarg('padding'))
        max_length = kwargs.get(
            'max_length',
            kwargs.get('sequence_length',
                       self.nlp_tokenizer.get_tokenizer_kwarg('max_length')))
        special_token = 1 if self.nlp_tokenizer.get_tokenizer_kwarg(
            'add_special_tokens') else 0
        if len(label_mask) > max_length - 2 * special_token:
            label_mask = label_mask[:(max_length - 2 * special_token)]
            input_ids = input_ids[:(max_length - 2 * special_token)]
        offset_mapping = offset_mapping[:sum(label_mask)]
        if padding == 'max_length':
            label_mask = [False] * special_token + label_mask + \
                         [False] * (max_length - len(label_mask) - special_token)
            offset_mapping = offset_mapping + [(0, 0)] * (
                max_length - len(offset_mapping))
            input_ids = [self.nlp_tokenizer.tokenizer.cls_token_id] * special_token + input_ids + \
                        [self.nlp_tokenizer.tokenizer.sep_token_id] * special_token + \
                        [self.nlp_tokenizer.tokenizer.pad_token_id] * (max_length - len(input_ids) - 2 * special_token)
            attention_mask = attention_mask + [1] * (
                special_token * 2) + [0] * (
                    max_length - len(attention_mask) - 2 * special_token)
        else:
            if self.tokenizer.is_fast:
                encodings = self.tokenizer(
                    text, return_offsets_mapping=True, **self.tokenize_kwargs)
                attention_mask = encodings['attention_mask']
                if 'token_type_ids' in encodings:
                    token_type_ids = encodings['token_type_ids']
                input_ids = encodings['input_ids']
                word_ids = encodings.word_ids()
                for i in range(len(word_ids)):
                    if word_ids[i] is None:
                        label_mask.append(0)
                    elif word_ids[i] == word_ids[i - 1]:
                        label_mask.append(0)
                        offset_mapping[-1] = (
                            offset_mapping[-1][0],
                            encodings['offset_mapping'][i][1])
                    else:
                        label_mask.append(1)
                        offset_mapping.append(encodings['offset_mapping'][i])
            label_mask = [False] * special_token + label_mask + \
                         [False] * special_token
            input_ids = [self.nlp_tokenizer.tokenizer.cls_token_id] * special_token + input_ids + \
                        [self.nlp_tokenizer.tokenizer.sep_token_id] * special_token
            attention_mask = attention_mask + [1] * (special_token * 2)
        encodings = {
            'input_ids': input_ids,
            'attention_mask': attention_mask,
            'label_mask': label_mask,
            'offset_mapping': offset_mapping,
        }
        return encodings, None
    def _tokenize_text_with_fast_tokenizer(self, tokens, **kwargs):
        is_split_into_words = isinstance(tokens, list)
        encodings = self.nlp_tokenizer(
            tokens,
            return_offsets_mapping=True,
            is_split_into_words=is_split_into_words,
            **kwargs)
        label_mask = []
        word_ids = encodings.word_ids()
        offset_mapping = []
        for i in range(len(word_ids)):
            if word_ids[i] is None:
                label_mask.append(False)
            elif word_ids[i] == word_ids[i - 1]:
                label_mask.append(False)
                if not is_split_into_words:
                    offset_mapping[-1] = (offset_mapping[-1][0],
                                          encodings['offset_mapping'][i][1])
            else:
                encodings = self.tokenizer(text, **self.tokenize_kwargs)
                input_ids = encodings['input_ids']
                label_mask, offset_mapping = self.get_label_mask_and_offset_mapping(
                    text)
        if self._mode == ModeKeys.INFERENCE:
            if len(input_ids) >= self.sequence_length - 2:
                input_ids = input_ids[:self.sequence_length - 2]
                label_mask = label_mask[:self.sequence_length - 2]
            input_ids = [self.tokenizer.cls_token_id
                         ] + input_ids + [self.tokenizer.sep_token_id]
            label_mask = [0] + label_mask + [0]
            attention_mask = [1] * len(input_ids)
            offset_mapping = offset_mapping[:sum(label_mask)]
            if not self.is_transformer_based_model:
                input_ids = input_ids[1:-1]
                attention_mask = attention_mask[1:-1]
                label_mask = label_mask[1:-1]
            input_ids = torch.tensor(input_ids).unsqueeze(0)
            attention_mask = torch.tensor(attention_mask).unsqueeze(0)
            label_mask = torch.tensor(
                label_mask, dtype=torch.bool).unsqueeze(0)
            # the token classification
            output = {
                'text': text,
                'input_ids': input_ids,
                'attention_mask': attention_mask,
                'label_mask': label_mask,
                'offset_mapping': offset_mapping
            }
                label_mask.append(True)
                if is_split_into_words:
                    offset_mapping.append((word_ids[i], word_ids[i] + 1))
                else:
                    offset_mapping.append(encodings['offset_mapping'][i])
        padding = self.nlp_tokenizer.get_tokenizer_kwarg('padding')
        if padding == 'max_length':
            offset_mapping = offset_mapping + [(0, 0)] * (
                len(label_mask) - len(offset_mapping))
        encodings['offset_mapping'] = offset_mapping
        encodings['label_mask'] = label_mask
        return encodings, word_ids
    def _tokenize_text_with_slow_tokenizer(self, tokens, **kwargs):
        assert self.mode == ModeKeys.INFERENCE and isinstance(tokens, str), \
            'Slow tokenizer now only support str input in inference mode. If you are training models, ' \
            'please consider using the fast tokenizer.'
        word_ids = None
        encodings = self.nlp_tokenizer(
            tokens, is_split_into_words=False, **kwargs)
        tokenizer_name = self.nlp_tokenizer.get_tokenizer_class()
        method = 'get_label_mask_and_offset_mapping_' + tokenizer_name
        if not hasattr(self, method):
            raise RuntimeError(
                f'No `{method}` method defined for '
                f'tokenizer {tokenizer_name}, please use a fast tokenizer instead, or '
                f'try to implement a `{method}` method')
        label_mask, offset_mapping = getattr(self, method)(tokens)
        padding = self.nlp_tokenizer.get_tokenizer_kwarg('padding')
        max_length = self.nlp_tokenizer.get_tokenizer_kwarg('max_length')
        special_token = 1 if self.nlp_tokenizer.get_tokenizer_kwarg(
            'add_special_tokens') else 0
        if len(label_mask) > max_length - 2 * special_token:
            label_mask = label_mask[:(max_length - 2 * special_token)]
        offset_mapping = offset_mapping[:sum(label_mask)]
        if padding == 'max_length':
            label_mask = [False] * special_token + label_mask + \
                         [False] * (max_length - len(label_mask) - special_token)
            offset_mapping = offset_mapping + [(0, 0)] * (
                max_length - len(offset_mapping))
        else:
            output = {
                'input_ids': input_ids,
                'token_type_ids': token_type_ids,
                'attention_mask': attention_mask,
                'label_mask': label_mask,
            }
            # align the labels with tokenized text
            if labels_list is not None:
                assert self.label2id is not None
                # Map that sends B-Xxx label to its I-Xxx counterpart
                b_to_i_label = []
                label_enumerate_values = [
                    k for k, v in sorted(
                        self.label2id.items(), key=lambda item: item[1])
                ]
                for idx, label in enumerate(label_enumerate_values):
                    if label.startswith('B-') and label.replace(
                            'B-', 'I-') in label_enumerate_values:
                        b_to_i_label.append(
                            label_enumerate_values.index(
                                label.replace('B-', 'I-')))
                    else:
                        b_to_i_label.append(idx)
                label_row = [self.label2id[lb] for lb in labels_list]
                previous_word_idx = None
                label_ids = []
                for word_idx in word_ids:
                    if word_idx is None:
                        label_ids.append(-100)
                    elif word_idx != previous_word_idx:
                        label_ids.append(label_row[word_idx])
                    else:
                        if self.label_all_tokens:
                            label_ids.append(b_to_i_label[label_row[word_idx]])
                        else:
                            label_ids.append(-100)
                    previous_word_idx = word_idx
                labels = label_ids
                output['labels'] = labels
            output = {
                k: np.array(v) if isinstance(v, list) else v
                for k, v in output.items()
            }
        return output
            label_mask = [False] * special_token + label_mask + \
                         [False] * special_token
        encodings['offset_mapping'] = offset_mapping
        encodings['label_mask'] = label_mask
        return encodings, word_ids
    def get_tokenizer_class(self):
        tokenizer_class = self.tokenizer.__class__.__name__
        if tokenizer_class.endswith(
                'Fast') and tokenizer_class != 'PreTrainedTokenizerFast':
            tokenizer_class = tokenizer_class[:-4]
        return tokenizer_class
    def get_label_mask_and_offset_mapping_BertTokenizer(self, text):
        label_mask = []
        offset_mapping = []
        tokens = self.nlp_tokenizer.tokenizer.tokenize(text)
        offset = 0
        for token in tokens:
            is_start = (token[:2] != '##')
            if is_start:
                label_mask.append(True)
            else:
                token = token[2:]
                label_mask.append(False)
            start = offset + text[offset:].index(token)
            end = start + len(token)
            if is_start:
                offset_mapping.append((start, end))
            else:
                offset_mapping[-1] = (offset_mapping[-1][0], end)
            offset = end
        return label_mask, offset_mapping
    def get_label_mask_and_offset_mapping(self, text):
    def get_label_mask_and_offset_mapping_XLMRobertaTokenizer(self, text):
        label_mask = []
        offset_mapping = []
        tokens = self.tokenizer.tokenize(text)
        tokens = self.nlp_tokenizer.tokenizer.tokenize(text)
        offset = 0
        if self.get_tokenizer_class() == 'BertTokenizer':
            for token in tokens:
                is_start = (token[:2] != '##')
                if is_start:
                    label_mask.append(True)
                else:
                    token = token[2:]
                    label_mask.append(False)
                start = offset + text[offset:].index(token)
                end = start + len(token)
                if is_start:
                    offset_mapping.append((start, end))
                else:
                    offset_mapping[-1] = (offset_mapping[-1][0], end)
                offset = end
        elif self.get_tokenizer_class() == 'XLMRobertaTokenizer':
        last_is_blank = False
        for token in tokens:
            is_start = (token[0] == '▁')
            if is_start:
                token = token[1:]
                label_mask.append(True)
                if len(token) == 0:
                    last_is_blank = True
                    continue
            else:
                label_mask.append(False)
            start = offset + text[offset:].index(token)
            end = start + len(token)
            if last_is_blank or is_start:
                offset_mapping.append((start, end))
            else:
                offset_mapping[-1] = (offset_mapping[-1][0], end)
            offset = end
            last_is_blank = False
            for token in tokens:
                is_start = (token[0] == '▁')
                if is_start:
                    token = token[1:]
                    label_mask.append(True)
                    if len(token) == 0:
                        last_is_blank = True
                        continue
                else:
                    label_mask.append(False)
                start = offset + text[offset:].index(token)
                end = start + len(token)
                if last_is_blank or is_start:
                    offset_mapping.append((start, end))
                else:
                    offset_mapping[-1] = (offset_mapping[-1][0], end)
                offset = end
                last_is_blank = False
        else:
            raise NotImplementedError
        return label_mask, offset_mapping
--- a/modelscope/preprocessors/nlp/token_classification_thai_preprocessor.py
+++ b/modelscope/preprocessors/nlp/token_classification_thai_preprocessor.py
@@ -9,19 +9,23 @@ from modelscope.outputs import OutputKeys
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.type_assert import type_assert
 from .token_classification_preprocessor import TokenClassificationPreprocessor
 from .token_classification_preprocessor import \
    TokenClassificationTransformersPreprocessor
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.thai_ner_tokenizer)
 class NERPreprocessorThai(TokenClassificationPreprocessor):
 class NERPreprocessorThai(TokenClassificationTransformersPreprocessor):
    @type_assert(object, str)
    def __call__(self, data: str) -> Dict[str, Any]:
    @type_assert(object, (str, dict))
    def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
        from pythainlp import word_tokenize
        if isinstance(data, str):
            text = data
        else:
            text = data[self.first_sequence]
        segmented_data = ' '.join([
            w.strip(' ') for w in word_tokenize(text=data, engine='newmm')
            w.strip(' ') for w in word_tokenize(text=text, engine='newmm')
            if w.strip(' ') != ''
        ])
        output = super().__call__(segmented_data)
@@ -31,12 +35,17 @@ class NERPreprocessorThai(TokenClassificationPreprocessor):
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.thai_wseg_tokenizer)
 class WordSegmentationPreprocessorThai(TokenClassificationPreprocessor):
 class WordSegmentationPreprocessorThai(
        TokenClassificationTransformersPreprocessor):
    @type_assert(object, str)
    def __call__(self, data: str) -> Dict[str, Any]:
    @type_assert(object, (str, dict))
    def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
        import regex
        data = regex.findall(r'\X', data)
        if isinstance(data, str):
            text = data
        else:
            text = data[self.first_sequence]
        data = regex.findall(r'\X', text)
        data = ' '.join([char for char in data])
        output = super().__call__(data)
--- a/modelscope/preprocessors/nlp/token_classification_viet_preprocessor.py
+++ b/modelscope/preprocessors/nlp/token_classification_viet_preprocessor.py
@@ -9,19 +9,23 @@ from modelscope.outputs import OutputKeys
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.type_assert import type_assert
 from .token_classification_preprocessor import TokenClassificationPreprocessor
 from .token_classification_preprocessor import \
    TokenClassificationTransformersPreprocessor
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.viet_ner_tokenizer)
 class NERPreprocessorViet(TokenClassificationPreprocessor):
 class NERPreprocessorViet(TokenClassificationTransformersPreprocessor):
    @type_assert(object, str)
    def __call__(self, data: str) -> Dict[str, Any]:
    @type_assert(object, (str, dict))
    def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
        from pyvi import ViTokenizer
        if isinstance(data, str):
            text = data
        else:
            text = data[self.first_sequence]
        seg_words = [
            t.strip(' ') for t in ViTokenizer.tokenize(data).split(' ')
            t.strip(' ') for t in ViTokenizer.tokenize(text).split(' ')
            if t.strip(' ') != ''
        ]
        raw_words = []
--- a/modelscope/preprocessors/nlp/transformers_tokenizer.py
+++ b/modelscope/preprocessors/nlp/transformers_tokenizer.py
@@ -0,0 +1,112 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os
 from collections.abc import Mapping
 import json
 from transformers import AutoTokenizer
 from modelscope.metainfo import Models
 from modelscope.outputs import OutputKeys
 from modelscope.utils.constant import ModeKeys
 from modelscope.utils.logger import get_logger
 logger = get_logger()
 __all__ = [
    'NLPTokenizer',
 ]
 class NLPTokenizer:
    def __init__(self,
                 model_dir: str = None,
                 model_type=None,
                 use_fast: bool = None,
                 tokenize_kwargs=None):
        """The transformers tokenizer preprocessor base class.
        Any nlp preprocessor which uses the huggingface tokenizer can inherit from this class.
        Args:
            model_dir (str, `optional`): The local path containing the files used to create a preprocessor.
            use_fast (str, `optional`): Use the fast version of tokenizer
            tokenize_kwargs (dict, `optional`): These args will be directly fed into the tokenizer.
        """
        self.model_dir = model_dir
        self.model_type = model_type
        self.tokenize_kwargs = tokenize_kwargs
        if self.tokenize_kwargs is None:
            self.tokenize_kwargs = {}
        self._use_fast = use_fast
        self._tokenizer = None
    @property
    def tokenizer(self):
        if self._tokenizer is None:
            self._tokenizer = self.build_tokenizer()
        return self._tokenizer
    @property
    def use_fast(self):
        if self._use_fast is None:
            if self._use_fast is None and self.model_dir is None:
                self._use_fast = False
            elif self._use_fast is None and os.path.isfile(
                    os.path.join(self.model_dir, 'tokenizer_config.json')):
                with open(
                        os.path.join(self.model_dir, 'tokenizer_config.json'),
                        'r',
                        encoding='utf-8') as f:
                    json_config = json.load(f)
                    self._use_fast = json_config.get('use_fast')
            self._use_fast = False if self._use_fast is None else self._use_fast
        return self._use_fast
    def build_tokenizer(self):
        """Build a tokenizer by the model type.
        NOTE: The fast tokenizers have a multi-thread problem, use it carefully.
        Returns:
            The initialized tokenizer.
        """
        # fast version lead to parallel inference failed
        model_type = self.model_type
        model_dir = self.model_dir
        if model_type == Models.deberta_v2:
            from modelscope.models.nlp.deberta_v2 import DebertaV2Tokenizer, DebertaV2TokenizerFast
            tokenizer = DebertaV2TokenizerFast if self.use_fast else DebertaV2Tokenizer
            return tokenizer.from_pretrained(
                model_dir) if model_dir is not None else tokenizer()
        if model_type in (Models.structbert, Models.gpt3, Models.palm,
                          Models.plug):
            from transformers import BertTokenizer, BertTokenizerFast
            tokenizer = BertTokenizerFast if self.use_fast else BertTokenizer
            return tokenizer.from_pretrained(
                model_dir) if model_dir is not None else tokenizer()
        elif model_type == Models.veco:
            from transformers import XLMRobertaTokenizer, XLMRobertaTokenizerFast
            tokenizer = XLMRobertaTokenizerFast if self.use_fast else XLMRobertaTokenizer
            return tokenizer.from_pretrained(
                model_dir) if model_dir is not None else tokenizer()
        assert model_dir is not None
        return AutoTokenizer.from_pretrained(model_dir, use_fast=self.use_fast)
    def __call__(self, text, text_pair=None, **kwargs):
        kwargs['max_length'] = kwargs.get('max_length',
                                          kwargs.pop('sequence_length', None))
        if kwargs['max_length'] is None:
            kwargs.pop('max_length')
        tokenize_kwargs = {k: v for k, v in self.tokenize_kwargs.items()}
        tokenize_kwargs.update(kwargs)
        kwargs.update(self.tokenize_kwargs)
        return self.tokenizer(text, text_pair, **tokenize_kwargs)
    def get_tokenizer_kwarg(self, key, default_value=None):
        if key in self.tokenize_kwargs:
            return self.tokenize_kwargs[key]
        return self.tokenizer.init_kwargs.get(key, default_value)
--- a/modelscope/preprocessors/nlp/utils.py
+++ b/modelscope/preprocessors/nlp/utils.py
@@ -0,0 +1,100 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 import os
 from collections.abc import Mapping
 from typing import Any, Dict, List, Tuple, Union
 import json
 import numpy as np
 from transformers import AutoTokenizer
 from modelscope.metainfo import Models
 from modelscope.outputs import OutputKeys
 from modelscope.preprocessors.base import Preprocessor
 from modelscope.utils.constant import ModeKeys
 from modelscope.utils.hub import get_model_type, parse_label_mapping
 from modelscope.utils.logger import get_logger
 logger = get_logger()
 __all__ = ['parse_text_and_label', 'labels_to_id']
 def parse_text_and_label(data,
                         mode,
                         first_sequence=None,
                         second_sequence=None,
                         label=None):
    """Parse the input and return the sentences and labels.
    When input type is tuple or list and its size is 2:
    If the pair param is False, data will be parsed as the first_sentence and the label,
    else it will be parsed as the first_sentence and the second_sentence.
    Args:
        data: The input data.
        mode: The mode of the preprocessor
        first_sequence: The key of the first sequence
        second_sequence: The key of the second sequence
        label: The key of the label
    Returns:
        The sentences and labels tuple.
    """
    text_a, text_b, labels = None, None, None
    if isinstance(data, str):
        text_a = data
    elif isinstance(data, tuple) or isinstance(data, list):
        if len(data) == 3:
            text_a, text_b, labels = data
        elif len(data) == 2:
            if mode == ModeKeys.INFERENCE:
                text_a, text_b = data
            else:
                text_a, labels = data
    elif isinstance(data, Mapping):
        text_a = data.get(first_sequence)
        text_b = data.get(second_sequence)
        if label is None or isinstance(label, str):
            labels = data.get(label)
        else:
            labels = [data.get(lb) for lb in label]
    return text_a, text_b, labels
 def labels_to_id(labels, output, label2id=None):
    """Turn the labels to id with the type int or float.
    If the original label's type is str or int, the label2id mapping will try to convert it to the final label.
    If the original label's type is float, or the label2id mapping does not exist,
    the original label will be returned.
    Args:
        label2id: An extra label2id mapping. If not provided, the label will not be translated to ids.
        labels: The input labels.
        output: The label id.
    Returns:
        The final labels.
    """
    def label_can_be_mapped(label):
        return isinstance(label, str) or isinstance(label, int)
    try:
        if isinstance(labels, (tuple, list)) and all([label_can_be_mapped(label) for label in labels]) \
                and label2id is not None:
            output[OutputKeys.LABELS] = [
                label2id[label] if label in label2id else label2id[str(label)]
                for label in labels
            ]
        elif label_can_be_mapped(labels) and label2id is not None:
            output[OutputKeys.LABELS] = label2id[
                labels] if labels in label2id else label2id[str(labels)]
        elif labels is not None:
            output[OutputKeys.LABELS] = labels
    except KeyError as e:
        logger.error(
            f'Label {labels} cannot be found in the label mapping {label2id},'
            f'which comes from the user input or the configuration files. '
            f'Please consider matching your labels with this mapping.')
        raise e
--- a/modelscope/preprocessors/nlp/zero_shot_classification_preprocessor.py
+++ b/modelscope/preprocessors/nlp/zero_shot_classification_preprocessor.py
@@ -0,0 +1,74 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Union
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors import Preprocessor
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from modelscope.utils.hub import get_model_type
 from .transformers_tokenizer import NLPTokenizer
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
 class ZeroShotClassificationTransformersPreprocessor(Preprocessor):
    """The tokenizer preprocessor used in zero shot classification.
    """
    def __init__(self,
                 model_dir: str,
                 first_sequence=None,
                 mode=ModeKeys.INFERENCE,
                 sequence_length=512,
                 use_fast=None,
                 **kwargs):
        """preprocess the data
        Args:
            model_dir (str): model path
        """
        self.sequence_length = sequence_length
        model_type = None
        if model_dir is not None:
            model_type = get_model_type(model_dir)
        self.nlp_tokenizer = NLPTokenizer(
            model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
        self.first_sequence = first_sequence
        super().__init__(mode=mode)
    def __call__(self,
                 data: Union[str, Dict],
                 hypothesis_template: str,
                 candidate_labels: list,
                 padding=True,
                 truncation=True,
                 truncation_strategy='only_first',
                 **kwargs) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (str or dict): a sentence
                Example:
                    'you are so handsome.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        if isinstance(data, dict):
            data = data.get(self.first_sequence)
        pairs = [[data, hypothesis_template.format(label)]
                 for label in candidate_labels]
        if 'return_tensors' not in kwargs:
            kwargs[
                'return_tensors'] = 'pt' if self._mode == ModeKeys.INFERENCE else None
        features = self.nlp_tokenizer(
            pairs,
            padding=padding,
            truncation=truncation,
            truncation_strategy=truncation_strategy,
            **kwargs)
        return features
--- a/modelscope/preprocessors/nlp/zero_shot_classification_reprocessor.py
+++ b/modelscope/preprocessors/nlp/zero_shot_classification_reprocessor.py
@@ -1,51 +0,0 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from typing import Any, Dict, Union
 from modelscope.metainfo import Preprocessors
 from modelscope.preprocessors.builder import PREPROCESSORS
 from modelscope.utils.constant import Fields, ModeKeys
 from .nlp_base import NLPTokenizerPreprocessorBase
@PREPROCESSORS.register_module(
    Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
 class ZeroShotClassificationPreprocessor(NLPTokenizerPreprocessorBase):
    """The tokenizer preprocessor used in zero shot classification.
    """
    def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
        """preprocess the data
        Args:
            model_dir (str): model path
        """
        self.sequence_length = kwargs.pop('sequence_length', 512)
        super().__init__(model_dir, mode=mode, **kwargs)
    def __call__(self, data: Union[str, Dict], hypothesis_template: str,
                 candidate_labels: list) -> Dict[str, Any]:
        """process the raw input data
        Args:
            data (str or dict): a sentence
                Example:
                    'you are so handsome.'
        Returns:
            Dict[str, Any]: the preprocessed data
        """
        if isinstance(data, dict):
            data = data.get(self.first_sequence)
        pairs = [[data, hypothesis_template.format(label)]
                 for label in candidate_labels]
        features = self.tokenizer(
            pairs,
            padding=True,
            truncation=True,
            max_length=self.sequence_length,
            truncation_strategy='only_first',
            return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None)
        return features
--- a/modelscope/trainers/hooks/checkpoint_hook.py
+++ b/modelscope/trainers/hooks/checkpoint_hook.py
@@ -6,11 +6,12 @@ import numpy as np
 import torch
 from modelscope import __version__
 from modelscope.metainfo import Hooks
 from modelscope.utils.checkpoint import load_checkpoint, save_checkpoint
 from modelscope.metainfo import Hooks, Pipelines
 from modelscope.utils.checkpoint import (load_checkpoint, save_checkpoint,
                                         save_configuration)
 from modelscope.utils.constant import LogKeys, ModelFile
 from modelscope.utils.logger import get_logger
 from modelscope.utils.torch_utils import get_dist_info, is_master
 from modelscope.utils.torch_utils import is_master
 from .builder import HOOKS
 from .hook import Hook
 from .priority import Priority
@@ -28,17 +29,25 @@ class CheckpointHook(Hook):
        save_dir (str): The directory to save checkpoints. If is None, use `trainer.work_dir`
        save_last (bool): Whether to save the last checkpoint. Default: True.
        checkpoint_file (str): The checkpoint file to be loaded.
        load_all_state (bool): Load all states(optimizer, epoch, lr_scheduler, random_state, etc.) when loading old
            training state file or not. The model's state dict will only be loaded if False.
        max_checkpoint_num (int): The max number of checkpoint files, default None which means never delete anything.
        If the number exceeding the limit, earlier checkpoints will be deleted first.
    """
    PRIORITY = Priority.LOW
    def __init__(self,
                 interval=0,
                 by_epoch=True,
                 save_optimizer=True,
                 save_dir=None,
                 save_last=True,
                 checkpoint_file=None):
    def __init__(
        self,
        interval=0,
        by_epoch=True,
        save_optimizer=True,
        save_dir=None,
        save_last=True,
        checkpoint_file=None,
        load_all_state=True,
        max_checkpoint_num=None,
    ):
        self.interval = interval
        self.by_epoch = by_epoch
        self.save_optimizer = save_optimizer
@@ -47,6 +56,11 @@ class CheckpointHook(Hook):
        self.save_last = save_last
        self.rng_state = None
        self.need_load_rng_state = False
        self.load_all_state = load_all_state
        self.max_checkpoint_num = None
        if max_checkpoint_num is not None:
            self.max_checkpoint_num = max(int(max_checkpoint_num), 1)
        self.history_checkpoints = []
    def before_run(self, trainer):
        if not self.save_dir:
@@ -65,9 +79,10 @@ class CheckpointHook(Hook):
        if self.checkpoint_file is not None and os.path.isfile(
                self.checkpoint_file):
            meta = self.load_checkpoint(self.checkpoint_file, trainer)
            meta = self.load_checkpoint(self.checkpoint_file, trainer,
                                        self.load_all_state)
            self.rng_state = meta.get('rng_state')
            self.need_load_rng_state = True
            self.need_load_rng_state = self.load_all_state
    def before_train_iter(self, trainer):
        if self.need_load_rng_state:
@@ -95,28 +110,30 @@ class CheckpointHook(Hook):
                self._save_checkpoint(trainer)
    @classmethod
    def load_checkpoint(cls, filename, trainer):
    def load_checkpoint(cls, filename, trainer, load_all_state=True):
        from modelscope.trainers.parallel.utils import is_parallel
        if is_parallel(trainer.model):
            model = trainer.model.module
        else:
            model = trainer.model
        meta = load_checkpoint(filename, model,
                               getattr(trainer, 'optimizer', None),
                               getattr(trainer, 'lr_scheduler', None))
        trainer._epoch = meta.get('epoch', trainer._epoch)
        trainer._iter = meta.get('iter', trainer._iter)
        trainer._inner_iter = meta.get('inner_iter', trainer._inner_iter)
        for i, hook in enumerate(trainer.hooks):
            # hook: Hook
            key = f'{hook.__class__}-{i}'
            if key in meta and hasattr(hook, 'load_state_dict'):
                hook.load_state_dict(meta.get(key, {}))
            else:
                trainer.logger.warn(
                    f'The state_dict of hook {hook.__class__} at index {i} is not found in the checkpoint file.'
                )
        meta = load_checkpoint(
            filename, model,
            getattr(trainer, 'optimizer', None) if load_all_state else None,
            getattr(trainer, 'lr_scheduler', None) if load_all_state else None)
        if load_all_state:
            trainer._epoch = meta.get('epoch', trainer._epoch)
            trainer._iter = meta.get('iter', trainer._iter)
            trainer._inner_iter = meta.get('inner_iter', trainer._inner_iter)
            for i, hook in enumerate(trainer.hooks):
                # hook: Hook
                key = f'{hook.__class__}-{i}'
                if key in meta and hasattr(hook, 'load_state_dict'):
                    hook.load_state_dict(meta.get(key, {}))
                else:
                    trainer.logger.warn(
                        f'The state_dict of hook {hook.__class__} at index {i} is not found in the checkpoint file.'
                    )
        version = meta.get('modelscope')
        if version != __version__:
@@ -163,6 +180,21 @@ class CheckpointHook(Hook):
                                       and not self.by_epoch):
            self._save_pretrained(trainer)
        self.history_checkpoints.append(cur_save_name)
        self.remove_obsolete_checkpoints()
    def remove_obsolete_checkpoints(self):
        if self.max_checkpoint_num is not None and \
                len(self.history_checkpoints) > self.max_checkpoint_num:
            history_checkpoints = [ckpt for ckpt in self.history_checkpoints]
            self.history_checkpoints.clear()
            for i, ckpt_file in enumerate(history_checkpoints):
                if i < len(history_checkpoints) - self.max_checkpoint_num:
                    if os.path.isfile(ckpt_file):
                        os.remove(ckpt_file)
                else:
                    self.history_checkpoints.append(ckpt_file)
    def _save_pretrained(self, trainer):
        output_dir = os.path.join(self.save_dir, ModelFile.TRAIN_OUTPUT_DIR)
        from modelscope.trainers.parallel.utils import is_parallel
@@ -175,15 +207,53 @@ class CheckpointHook(Hook):
        config = trainer.cfg.to_dict()
        # override pipeline by tasks name after finetune done,
        # avoid case like fill mask pipeline with a text cls task
        config['pipeline'] = {'type': config['task']}
        if config['task'] in [
                getattr(Pipelines, attr) for attr in dir(Pipelines)
                if not attr.startswith('__')
        ]:
            # TODO a temp fix to avoid pipeline_name and task mismatch
            config['pipeline'] = {'type': config['task']}
        class SaveConfig:
            def __init__(self, output_dir, config):
                self.output_dir = output_dir
                self.config = config
            def __call__(self, _output_dir, _config):
                self.config = _config
            def save_config(self):
                save_configuration(self.output_dir, self.config)
        save_config_fn = SaveConfig(output_dir, config)
        if hasattr(model, 'save_pretrained'):
            # Now support two binary files: pytorch_model.bin and pytorch_model.pt
            default_bin_file = ModelFile.TORCH_MODEL_BIN_FILE
            if hasattr(
                    model,
                    'model_dir') and ModelFile.TORCH_MODEL_FILE in os.listdir(
                        model.model_dir):
                default_bin_file = ModelFile.TORCH_MODEL_FILE
            model.save_pretrained(
                output_dir,
                ModelFile.TORCH_MODEL_BIN_FILE,
                default_bin_file,
                save_function=save_checkpoint,
                config=config,
                config=save_config_fn.config,
                save_config_function=save_config_fn,
                with_meta=False)
        if trainer.train_preprocessor is not None:
            trainer.train_preprocessor.save_pretrained(
                output_dir,
                save_config_fn.config,
                save_config_function=save_config_fn)
        if trainer.eval_preprocessor is not None:
            trainer.eval_preprocessor.save_pretrained(
                output_dir,
                save_config_fn.config,
                save_config_function=save_config_fn)
        save_config_fn.save_config()
    def after_train_iter(self, trainer):
        if self.by_epoch:
@@ -222,6 +292,9 @@ class BestCkptSaverHook(CheckpointHook):
        save_optimizer (bool): Whether to save optimizer state dict.  Default: True.
        save_dir (str): Output directory to save best checkpoint.
        restore_best (bool): Whether to restore the best checkpoint after training.
        max_checkpoint_num (int): The max number of checkpoint files, default None which means never delete anything.
            If the number exceeding the limit, checkpoints with worse metric will be deleted, which is judged by the
            `rule` and `metric_key` arguments.
    """
    PRIORITY = Priority.LOW
@@ -235,13 +308,17 @@ class BestCkptSaverHook(CheckpointHook):
                 save_dir=None,
                 save_file_name=None,
                 restore_best=False,
                 interval=0):
                 max_checkpoint_num=1,
                 interval=0,
                 **kwargs):
        assert rule in ['max', 'min'], 'Only support "max" or "min" rule now.'
        super().__init__(
            interval=interval,
            by_epoch=by_epoch,
            save_optimizer=save_optimizer,
            save_dir=save_dir,
            max_checkpoint_num=max_checkpoint_num,
            **kwargs,
        )
        self.metric_key = metric_key
        self.rule = rule
@@ -249,6 +326,7 @@ class BestCkptSaverHook(CheckpointHook):
        self._best_ckpt_file = None
        self.save_file_name = save_file_name
        self.restore_best = restore_best
        self.history_checkpoints = set()
    def _should_save(self, trainer):
        return self._is_best_metric(trainer.metric_values)
@@ -284,6 +362,10 @@ class BestCkptSaverHook(CheckpointHook):
                    self.save_dir,
                    f'best_{LogKeys.ITER}{trainer.iter + 1}_{self.metric_key}{self._best_metric}.pth'
                )
        else:
            if '.' not in cur_save_name:
                cur_save_name = f'{cur_save_name}.pth'
            cur_save_name = os.path.join(self.save_dir, cur_save_name)
        meta = {
            'epoch': trainer.epoch,
@@ -300,6 +382,28 @@ class BestCkptSaverHook(CheckpointHook):
                        trainer.lr_scheduler, meta)
        self._best_ckpt_file = cur_save_name
        self._save_pretrained(trainer)
        self.history_checkpoints.add(cur_save_name)
        self.remove_obsolete_checkpoints()
    def remove_obsolete_checkpoints(self):
        def extract_metric_from_filename(name1):
            metric1 = float(name1.split(self.metric_key)[1].split('.')[0])
            if self.rule == 'max':
                return -metric1
            else:
                return metric1
        if self.max_checkpoint_num is not None and \
                len(self.history_checkpoints) > self.max_checkpoint_num:
            history_checkpoints = sorted(
                self.history_checkpoints, key=extract_metric_from_filename)
            self.history_checkpoints.clear()
            for i, ckpt_file in enumerate(history_checkpoints):
                if i < self.max_checkpoint_num:
                    self.history_checkpoints.add(ckpt_file)
                elif os.path.isfile(ckpt_file):
                    os.remove(ckpt_file)
    def state_dict(self):
        return {
--- a/modelscope/trainers/nlp/text_generation_trainer.py
+++ b/modelscope/trainers/nlp/text_generation_trainer.py
@@ -14,8 +14,8 @@ from modelscope.utils.file_utils import func_receive_dict_inputs
 class TextGenerationTrainer(NlpEpochBasedTrainer):
    def _decode(self, tokens):
        tokenizer = self.eval_preprocessor.tokenizer
        return tokenizer.decode(tokens.tolist(), skip_special_tokens=True)
        return self.eval_preprocessor.decode(
            tokens.tolist(), skip_special_tokens=True)
    def evaluation_step(self, data):
        model = self.model.module if self._dist else self.model