Browse Source

[to #42322933] Refactor NLP and fix some user feedbacks

1. Abstract keys of dicts needed by nlp metric classes into the init method
2. Add Preprocessor.save_pretrained to save preprocessor information
3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training.
4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead
5. Use model/preprocessor's from_pretrained in all nlp pipeline classes.
6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes
7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes
8. Fix user feedback: Re-train the model in continue training scenario
9. Fix user feedback: Too many checkpoint saved
10. Simplify the nlp-trainer
11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override
12. Add safe_get to Config class

----------------------------  Another refactor from version 36 -------------------------

13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example:
      TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor
14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors
15. Add output classes of nlp models
16. Refactor the logic for token-classification
17. Fix bug: checkpoint_hook does not support pytorch_model.pt
18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training
       NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines
        Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513

    * add save_pretrained to preprocessor

* save preprocessor config in hook

* refactor label-id mapping fetching logic

* test ok on sentence-similarity

* run on finetuning

* fix bug

* pre-commit passed

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/preprocessors/nlp/nlp_base.py

* add params to init

* 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics

* Split trainer init impls to overridable methods

* remove some obsolete tokenizers

* unfinished

* support input params in pipeline

* fix bugs

* fix ut bug

* fix bug

* fix ut bug

* fix ut bug

* fix ut bug

* add base class for some preprocessors

* Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config

* compatible with old code

* fix ut bug

* fix ut bugs

* fix bug

* add some comments

* fix ut bug

* add a requirement

* fix pre-commit

* Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config

* fixbug

* Support function type in registry

* fix ut bug

* fix bug

* Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config

# Conflicts:
#	modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/utils/hub.py

* remove obsolete file

* rename init args

* rename params

* fix merge bug

* add default preprocessor config for ner-model

* move a method a util file

* remove unused config

* Fix a bug in pbar

* bestckptsaver:change default ckpt numbers to 1

* 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name

* Fix bug

* fix bug

* fix bug

* unfinished refactoring

* unfinished

* uw

* uw

* uw

* uw

* Merge branch 'feat/refactor_config' into feat/refactor_trainer

# Conflicts:
#	modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
#	modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
#	modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
#	modelscope/preprocessors/nlp/text_generation_preprocessor.py

* uw

* uw

* unify nlp task outputs

* uw

* uw

* uw

* uw

* change the order of text cls pipeline

* refactor t5

* refactor tg task preprocessor

* fix

* unfinished

* temp

* refactor code

* unfinished

* unfinished

* unfinished

* unfinished

* uw

* Merge branch 'feat/refactor_config' into feat/refactor_trainer

* smoke test pass

* ut testing

* pre-commit passed

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/models/nlp/bert/document_segmentation.py
#	modelscope/pipelines/nlp/__init__.py
#	modelscope/pipelines/nlp/document_segmentation_pipeline.py

* merge master

* unifnished

* Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config

* fix bug

* fix ut bug

* support ner batch inference

* fix ut bug

* fix bug

* support batch inference on three nlp tasks

* unfinished

* fix bug

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/models/base/base_model.py
#	modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
#	modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
#	modelscope/pipelines/nlp/dialog_modeling_pipeline.py
#	modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
#	modelscope/pipelines/nlp/document_segmentation_pipeline.py
#	modelscope/pipelines/nlp/faq_question_answering_pipeline.py
#	modelscope/pipelines/nlp/feature_extraction_pipeline.py
#	modelscope/pipelines/nlp/fill_mask_pipeline.py
#	modelscope/pipelines/nlp/information_extraction_pipeline.py
#	modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
#	modelscope/pipelines/nlp/sentence_embedding_pipeline.py
#	modelscope/pipelines/nlp/summarization_pipeline.py
#	modelscope/pipelines/nlp/table_question_answering_pipeline.py
#	modelscope/pipelines/nlp/text2text_generation_pipeline.py
#	modelscope/pipelines/nlp/text_classification_pipeline.py
#	modelscope/pipelines/nlp/text_error_correction_pipeline.py
#	modelscope/pipelines/nlp/text_generation_pipeline.py
#	modelscope/pipelines/nlp/text_ranking_pipeline.py
#	modelscope/pipelines/nlp/token_classification_pipeline.py
#	modelscope/pipelines/nlp/word_segmentation_pipeline.py
#	modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
#	modelscope/trainers/nlp_trainer.py

* pre-commit passed

* fix bug

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/preprocessors/__init__.py

* fix bug

* fix bug

* fix bug

* fix bug

* fix bug

* fixbug

* pre-commit passed

* fix bug

* fixbug

* fix bug

* fix bug

* fix bug

* fix bug

* self review done

* fixbug

* fix bug

* fix bug

* fix bugs

* remove sub-token offset mapping

* fix name bug

* add some tests

* 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs

* add old logic back

* tmp save

* add tokenize by words logic back

* move outputs file back

* revert veco token-classification back

* fix typo

* Fix description

* Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config

* Merge branch 'master' into feat/refactor_config

# Conflicts:
#	modelscope/pipelines/builder.py
master^2
yuze.zyz 3 years ago
parent
commit
bb5512d1ab
100 changed files with 4113 additions and 4658 deletions
  1. +2
    -2
      data/test/regression/sbert_ws_zh.bin
  2. +7
    -5
      modelscope/exporters/nlp/sbert_for_sequence_classification_exporter.py
  3. +14
    -5
      modelscope/metrics/sequence_classification_metric.py
  4. +9
    -3
      modelscope/metrics/text_generation_metric.py
  5. +16
    -10
      modelscope/metrics/token_classification_metric.py
  6. +13
    -15
      modelscope/models/base/base_model.py
  7. +1
    -0
      modelscope/models/base/base_torch_model.py
  8. +5
    -6
      modelscope/models/nlp/T5/backbone.py
  9. +15
    -3
      modelscope/models/nlp/T5/text2text_generation.py
  10. +1
    -8
      modelscope/models/nlp/__init__.py
  11. +3
    -2
      modelscope/models/nlp/bart/text_error_correction.py
  12. +10
    -29
      modelscope/models/nlp/bert/backbone.py
  13. +23
    -22
      modelscope/models/nlp/bert/document_segmentation.py
  14. +1
    -1
      modelscope/models/nlp/bert/fill_mask.py
  15. +1
    -1
      modelscope/models/nlp/bert/text_classification.py
  16. +15
    -4
      modelscope/models/nlp/bert/token_classification.py
  17. +1
    -2
      modelscope/models/nlp/deberta_v2/backbone.py
  18. +1
    -1
      modelscope/models/nlp/deberta_v2/fill_mask.py
  19. +4
    -6
      modelscope/models/nlp/palm_v2/__init__.py
  20. +0
    -1327
      modelscope/models/nlp/palm_v2/backbone.py
  21. +1343
    -29
      modelscope/models/nlp/palm_v2/text_generation.py
  22. +1
    -31
      modelscope/models/nlp/ponet/backbone.py
  23. +21
    -20
      modelscope/models/nlp/ponet/document_segmentation.py
  24. +3
    -3
      modelscope/models/nlp/space/model/tokenization_space.py
  25. +0
    -6
      modelscope/models/nlp/structbert/__init__.py
  26. +8
    -24
      modelscope/models/nlp/structbert/backbone.py
  27. +4
    -3
      modelscope/models/nlp/structbert/faq_question_answering.py
  28. +3
    -3
      modelscope/models/nlp/structbert/fill_mask.py
  29. +1
    -1
      modelscope/models/nlp/structbert/text_classification.py
  30. +15
    -4
      modelscope/models/nlp/structbert/token_classification.py
  31. +0
    -519
      modelscope/models/nlp/structbert/tokenization.py
  32. +0
    -203
      modelscope/models/nlp/structbert/tokenization_fast.py
  33. +3
    -5
      modelscope/models/nlp/task_models/feature_extraction.py
  34. +3
    -3
      modelscope/models/nlp/task_models/information_extraction.py
  35. +8
    -9
      modelscope/models/nlp/task_models/nncrf_for_named_entity_recognition.py
  36. +18
    -10
      modelscope/models/nlp/task_models/token_classification.py
  37. +0
    -4
      modelscope/models/nlp/veco/__init__.py
  38. +1
    -1
      modelscope/models/nlp/veco/fill_mask.py
  39. +7
    -21
      modelscope/models/nlp/veco/text_classification.py
  40. +8
    -20
      modelscope/models/nlp/veco/token_classification.py
  41. +0
    -321
      modelscope/models/nlp/veco/tokenization.py
  42. +0
    -213
      modelscope/models/nlp/veco/tokenization_fast.py
  43. +125
    -311
      modelscope/outputs/nlp/model_outputs.py
  44. +3
    -1
      modelscope/pipelines/base.py
  45. +7
    -4
      modelscope/pipelines/builder.py
  46. +2
    -0
      modelscope/pipelines/cv/easycv_pipelines/base.py
  47. +5
    -9
      modelscope/pipelines/nlp/__init__.py
  48. +14
    -5
      modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py
  49. +12
    -2
      modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py
  50. +12
    -2
      modelscope/pipelines/nlp/dialog_modeling_pipeline.py
  51. +14
    -2
      modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py
  52. +9
    -1
      modelscope/pipelines/nlp/distributed_gpt3_pipeline.py
  53. +8
    -8
      modelscope/pipelines/nlp/distributed_plug_pipeline.py
  54. +28
    -20
      modelscope/pipelines/nlp/document_segmentation_pipeline.py
  55. +24
    -28
      modelscope/pipelines/nlp/extractive_summarization_pipeline.py
  56. +17
    -1
      modelscope/pipelines/nlp/faq_question_answering_pipeline.py
  57. +6
    -6
      modelscope/pipelines/nlp/fasttext_text_classification_pipeline.py
  58. +19
    -15
      modelscope/pipelines/nlp/feature_extraction_pipeline.py
  59. +20
    -14
      modelscope/pipelines/nlp/fill_mask_pipeline.py
  60. +26
    -6
      modelscope/pipelines/nlp/information_extraction_pipeline.py
  61. +26
    -54
      modelscope/pipelines/nlp/named_entity_recognition_pipeline.py
  62. +15
    -7
      modelscope/pipelines/nlp/sentence_embedding_pipeline.py
  63. +23
    -7
      modelscope/pipelines/nlp/summarization_pipeline.py
  64. +13
    -2
      modelscope/pipelines/nlp/table_question_answering_pipeline.py
  65. +0
    -113
      modelscope/pipelines/nlp/text2text_generation_pipeline.py
  66. +49
    -35
      modelscope/pipelines/nlp/text_classification_pipeline.py
  67. +19
    -12
      modelscope/pipelines/nlp/text_error_correction_pipeline.py
  68. +102
    -34
      modelscope/pipelines/nlp/text_generation_pipeline.py
  69. +16
    -4
      modelscope/pipelines/nlp/text_ranking_pipeline.py
  70. +49
    -31
      modelscope/pipelines/nlp/token_classification_pipeline.py
  71. +2
    -6
      modelscope/pipelines/nlp/translation_quality_estimation_pipeline.py
  72. +59
    -42
      modelscope/pipelines/nlp/word_segmentation_pipeline.py
  73. +18
    -9
      modelscope/pipelines/nlp/zero_shot_classification_pipeline.py
  74. +25
    -18
      modelscope/preprocessors/__init__.py
  75. +52
    -2
      modelscope/preprocessors/base.py
  76. +28
    -29
      modelscope/preprocessors/nlp/__init__.py
  77. +31
    -18
      modelscope/preprocessors/nlp/document_segmentation_preprocessor.py
  78. +51
    -25
      modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py
  79. +78
    -0
      modelscope/preprocessors/nlp/feature_extraction_preprocessor.py
  80. +198
    -45
      modelscope/preprocessors/nlp/fill_mask_preprocessor.py
  81. +0
    -291
      modelscope/preprocessors/nlp/nlp_base.py
  82. +17
    -13
      modelscope/preprocessors/nlp/relation_extraction_preprocessor.py
  83. +0
    -25
      modelscope/preprocessors/nlp/sentence_classification_preprocessor.py
  84. +49
    -19
      modelscope/preprocessors/nlp/sentence_embedding_preprocessor.py
  85. +15
    -7
      modelscope/preprocessors/nlp/sentence_piece_preprocessor.py
  86. +0
    -40
      modelscope/preprocessors/nlp/text2text_generation_preprocessor.py
  87. +152
    -0
      modelscope/preprocessors/nlp/text_classification_preprocessor.py
  88. +2
    -3
      modelscope/preprocessors/nlp/text_error_correction.py
  89. +0
    -44
      modelscope/preprocessors/nlp/text_generation_jieba_preprocessor.py
  90. +234
    -39
      modelscope/preprocessors/nlp/text_generation_preprocessor.py
  91. +45
    -34
      modelscope/preprocessors/nlp/text_ranking_preprocessor.py
  92. +351
    -208
      modelscope/preprocessors/nlp/token_classification_preprocessor.py
  93. +19
    -10
      modelscope/preprocessors/nlp/token_classification_thai_preprocessor.py
  94. +10
    -6
      modelscope/preprocessors/nlp/token_classification_viet_preprocessor.py
  95. +112
    -0
      modelscope/preprocessors/nlp/transformers_tokenizer.py
  96. +100
    -0
      modelscope/preprocessors/nlp/utils.py
  97. +74
    -0
      modelscope/preprocessors/nlp/zero_shot_classification_preprocessor.py
  98. +0
    -51
      modelscope/preprocessors/nlp/zero_shot_classification_reprocessor.py
  99. +137
    -33
      modelscope/trainers/hooks/checkpoint_hook.py
  100. +2
    -2
      modelscope/trainers/nlp/text_generation_trainer.py

+ 2
- 2
data/test/regression/sbert_ws_zh.bin View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:7d98ac11a4e9e2744a7402a5cc912da991a41938bbc5dd60f15ee5c6b3196030
size 63349
oid sha256:3b38bfb5a851d35d5fba4d59eda926557666dbd62c70e3e3b24c22605e7d9c4a
size 40771

+ 7
- 5
modelscope/exporters/nlp/sbert_for_sequence_classification_exporter.py View File

@@ -7,7 +7,8 @@ from torch.utils.data.dataloader import default_collate
from modelscope.exporters.builder import EXPORTERS
from modelscope.exporters.torch_model_exporter import TorchModelExporter
from modelscope.metainfo import Models
from modelscope.preprocessors import Preprocessor, build_preprocessor
from modelscope.preprocessors import (
TextClassificationTransformersPreprocessor, build_preprocessor)
from modelscope.utils.config import Config
from modelscope.utils.constant import ModeKeys, Tasks

@@ -59,12 +60,13 @@ class SbertForSequenceClassificationExporter(TorchModelExporter):
'mode': ModeKeys.TRAIN,
**sequence_length
})
preprocessor: Preprocessor = build_preprocessor(cfg, field_name)
preprocessor: TextClassificationTransformersPreprocessor = build_preprocessor(
cfg, field_name)
if pair:
first_sequence = preprocessor.tokenizer.unk_token
second_sequence = preprocessor.tokenizer.unk_token
first_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token
second_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token
else:
first_sequence = preprocessor.tokenizer.unk_token
first_sequence = preprocessor.nlp_tokenizer.tokenizer.unk_token
second_sequence = None

batched = []


+ 14
- 5
modelscope/metrics/sequence_classification_metric.py View File

@@ -19,18 +19,27 @@ from .builder import METRICS, MetricKeys
class SequenceClassificationMetric(Metric):
"""The metric computation class for sequence classification tasks.

This metric class calculates accuracy of the whole input batches.
This metric class calculates accuracy/F1 of all the input batches.

Args:
label_name: The key of label column in the 'inputs' arg.
logit_name: The key of logits column in the 'inputs' arg.
"""

def __init__(self, *args, **kwargs):
def __init__(self,
label_name=OutputKeys.LABELS,
logit_name=OutputKeys.LOGITS,
*args,
**kwargs):
super().__init__(*args, **kwargs)
self.preds = []
self.labels = []
self.label_name = label_name
self.logit_name = logit_name

def add(self, outputs: Dict, inputs: Dict):
label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS
ground_truths = inputs[label_name]
eval_results = outputs[OutputKeys.LOGITS]
ground_truths = inputs[self.label_name]
eval_results = outputs[self.logit_name]
self.preds.append(
torch_nested_numpify(torch_nested_detach(eval_results)))
self.labels.append(


+ 9
- 3
modelscope/metrics/text_generation_metric.py View File

@@ -18,16 +18,22 @@ class TextGenerationMetric(Metric):
"""The metric computation class for text generation classes.

This metric class calculates F1 of the rouge scores for the whole evaluation dataset.

Args:
target_text: The key of the target text column in the `inputs` arg.
pred_text: The key of the predicted text column in the `outputs` arg.
"""

def __init__(self):
def __init__(self, target_text='tgts', pred_text='preds'):
self.preds: List[str] = []
self.tgts: List[str] = []
self.rouge = Rouge()
self.target_text = target_text
self.pred_text = pred_text

def add(self, outputs: Dict[str, List[str]], inputs: Dict[str, List[str]]):
ground_truths = inputs['tgts']
eval_results = outputs['preds']
ground_truths = inputs[self.target_text]
eval_results = outputs[self.pred_text]
for truth in ground_truths:
self.tgts.append(rebuild_chinese_str(truth))
for result in eval_results:


+ 16
- 10
modelscope/metrics/token_classification_metric.py View File

@@ -21,20 +21,16 @@ class TokenClassificationMetric(Metric):
This metric class uses seqeval to calculate the scores.

Args:
return_entity_level_metrics (bool, *optional*):
label_name(str, `optional`): The key of label column in the 'inputs' arg.
logit_name(str, `optional`): The key of logits column in the 'inputs' arg.
return_entity_level_metrics (bool, `optional`):
Whether to return every label's detail metrics, default False.
label2id(dict, `optional`): The label2id information to get the token labels.
"""

def add(self, outputs: Dict, inputs: Dict):
label_name = OutputKeys.LABEL if OutputKeys.LABEL in inputs else OutputKeys.LABELS
ground_truths = inputs[label_name]
eval_results = outputs[OutputKeys.LOGITS]
self.preds.append(
torch_nested_numpify(torch_nested_detach(eval_results)))
self.labels.append(
torch_nested_numpify(torch_nested_detach(ground_truths)))

def __init__(self,
label_name=OutputKeys.LABELS,
logit_name=OutputKeys.LOGITS,
return_entity_level_metrics=False,
label2id=None,
*args,
@@ -44,6 +40,16 @@ class TokenClassificationMetric(Metric):
self.preds = []
self.labels = []
self.label2id = label2id
self.label_name = label_name
self.logit_name = logit_name

def add(self, outputs: Dict, inputs: Dict):
ground_truths = inputs[self.label_name]
eval_results = outputs[self.logit_name]
self.preds.append(
torch_nested_numpify(torch_nested_detach(eval_results)))
self.labels.append(
torch_nested_numpify(torch_nested_detach(ground_truths)))

def evaluate(self):
label2id = self.label2id


+ 13
- 15
modelscope/models/base/base_model.py View File

@@ -6,7 +6,8 @@ from typing import Any, Callable, Dict, List, Optional, Union

from modelscope.hub.snapshot_download import snapshot_download
from modelscope.models.builder import build_model
from modelscope.utils.checkpoint import save_checkpoint, save_pretrained
from modelscope.utils.checkpoint import (save_checkpoint, save_configuration,
save_pretrained)
from modelscope.utils.config import Config
from modelscope.utils.constant import DEFAULT_MODEL_REVISION, Invoke, ModelFile
from modelscope.utils.device import verify_device
@@ -129,11 +130,9 @@ class Model(ABC):
model_cfg[k] = v
if device is not None:
model_cfg.device = device
model = build_model(
model_cfg, task_name=task_name, default_args=kwargs)
model = build_model(model_cfg, task_name=task_name)
else:
model = build_model(
model_cfg, task_name=task_name, default_args=kwargs)
model = build_model(model_cfg, task_name=task_name)

# dynamically add pipeline info to model for pipeline inference
if hasattr(cfg, 'pipeline'):
@@ -142,6 +141,7 @@ class Model(ABC):
if not hasattr(model, 'cfg'):
model.cfg = cfg

model_cfg.pop('model_dir', None)
model.name = model_name_or_path
model.model_dir = local_model_dir
return model
@@ -151,6 +151,7 @@ class Model(ABC):
save_checkpoint_names: Union[str, List[str]] = None,
save_function: Callable = save_checkpoint,
config: Optional[dict] = None,
save_config_function: Callable = save_configuration,
**kwargs):
"""save the pretrained model, its configuration and other related files to a directory,
so that it can be re-loaded
@@ -168,18 +169,15 @@ class Model(ABC):
config (Optional[dict], optional):
The config for the configuration.json, might not be identical with model.config

save_config_function (Callble, optional):
The function to use to save the configuration.

"""
if config is None and hasattr(self, 'cfg'):
config = self.cfg
assert config is not None, 'Cannot save the model because the model config is empty.'
if isinstance(config, Config):
config = config.to_dict()
if 'preprocessor' in config and config['preprocessor'] is not None:
if 'mode' in config['preprocessor']:
config['preprocessor']['mode'] = 'inference'
elif 'val' in config['preprocessor'] and 'mode' in config[
'preprocessor']['val']:
config['preprocessor']['val']['mode'] = 'inference'

if config is not None:
save_config_function(target_folder, config)

save_pretrained(self, target_folder, save_checkpoint_names,
save_function, config, **kwargs)
save_function, **kwargs)

+ 1
- 0
modelscope/models/base/base_torch_model.py View File

@@ -6,6 +6,7 @@ import torch
from torch import nn

from modelscope.utils.file_utils import func_receive_dict_inputs
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.logger import get_logger
from .base_model import Model



+ 5
- 6
modelscope/models/nlp/T5/backbone.py View File

@@ -36,9 +36,7 @@ from transformers.utils.model_parallel_utils import (assert_device_map,
from modelscope.metainfo import Models
from modelscope.models.base import Model, Tensor, TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import (BaseModelOutput,
BaseModelOutputWithPastAndCrossAttentions,
Seq2SeqModelOutput)
from modelscope.outputs import AttentionBackboneModelOutput, Seq2SeqModelOutput
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
from .configuration import T5Config
@@ -1182,7 +1180,7 @@ class T5Stack(T5PreTrainedModel):
all_attentions,
all_cross_attentions,
] if v is not None)
return BaseModelOutputWithPastAndCrossAttentions(
return AttentionBackboneModelOutput(
last_hidden_state=hidden_states,
past_key_values=present_key_value_states,
hidden_states=all_hidden_states,
@@ -1475,8 +1473,9 @@ class T5Model(T5PreTrainedModel):
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
encoder_outputs = BaseModelOutput(
elif return_dict and not isinstance(encoder_outputs,
AttentionBackboneModelOutput):
encoder_outputs = AttentionBackboneModelOutput(
last_hidden_state=encoder_outputs[0],
hidden_states=encoder_outputs[1]
if len(encoder_outputs) > 1 else None,


+ 15
- 3
modelscope/models/nlp/T5/text2text_generation.py View File

@@ -24,7 +24,8 @@ from transformers.utils.model_parallel_utils import (assert_device_map,

from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.outputs import BaseModelOutput, Seq2SeqLMOutput
from modelscope.outputs import (AttentionBackboneModelOutput, Seq2SeqLMOutput,
TokenGeneratorOutput)
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger
from .backbone import T5PreTrainedModel, T5Stack
@@ -311,8 +312,9 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
elif return_dict and not isinstance(encoder_outputs, BaseModelOutput):
encoder_outputs = BaseModelOutput(
elif return_dict and not isinstance(encoder_outputs,
AttentionBackboneModelOutput):
encoder_outputs = AttentionBackboneModelOutput(
last_hidden_state=encoder_outputs[0],
hidden_states=encoder_outputs[1]
if len(encoder_outputs) > 1 else None,
@@ -426,6 +428,16 @@ class T5ForConditionalGeneration(T5PreTrainedModel):
def prepare_decoder_input_ids_from_labels(self, labels: torch.Tensor):
return self._shift_right(labels)

def generate(
self,
*args,
**kwargs,
):
output = super().generate(*args, **kwargs)
return TokenGeneratorOutput(
sequences=output if isinstance(output, torch.Tensor) else output[0]
)

def _reorder_cache(self, past, beam_idx):
# if decoder past is not included in output
# speedy decoding is disabled and no need to reorder


+ 1
- 8
modelscope/models/nlp/__init__.py View File

@@ -30,9 +30,7 @@ if TYPE_CHECKING:
SbertForMaskedLM,
SbertForSequenceClassification,
SbertForTokenClassification,
SbertTokenizer,
SbertModel,
SbertTokenizerFast,
)
from .T5 import T5ForConditionalGeneration
from .mglm import MGLMForTextSummarization
@@ -51,8 +49,7 @@ if TYPE_CHECKING:
)
from .veco import (VecoConfig, VecoForMaskedLM,
VecoForSequenceClassification,
VecoForTokenClassification, VecoModel, VecoTokenizer,
VecoTokenizerFast)
VecoForTokenClassification, VecoModel)
from .bloom import BloomModel
else:
_import_structure = {
@@ -66,8 +63,6 @@ else:
'SbertForMaskedLM',
'SbertForSequenceClassification',
'SbertForTokenClassification',
'SbertTokenizer',
'SbertTokenizerFast',
'SbertModel',
],
'veco': [
@@ -76,8 +71,6 @@ else:
'VecoForSequenceClassification',
'VecoForTokenClassification',
'VecoModel',
'VecoTokenizer',
'VecoTokenizerFast',
],
'bert': [
'BertForMaskedLM',


+ 3
- 2
modelscope/models/nlp/bart/text_error_correction.py View File

@@ -7,6 +7,7 @@ import torch.cuda
from modelscope.metainfo import Models
from modelscope.models.base import TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import TextErrorCorrectionOutput
from modelscope.utils.constant import ModelFile, Tasks

__all__ = ['BartForTextErrorCorrection']
@@ -55,7 +56,7 @@ class BartForTextErrorCorrection(TorchModel):

self.task = task

def forward(self, input: Dict[str, Dict]) -> Dict[str, Any]:
def forward(self, input: Dict[str, Dict]) -> TextErrorCorrectionOutput:
"""return the result by the model

Args:
@@ -91,4 +92,4 @@ class BartForTextErrorCorrection(TorchModel):

# get 1-best List[Tensor]
preds = translations[0][0]['tokens']
return {'predictions': preds}
return TextErrorCorrectionOutput(predictions=preds)

+ 10
- 29
modelscope/models/nlp/bert/backbone.py View File

@@ -16,9 +16,6 @@
"""PyTorch BERT model. """

import math
import os
from dataclasses import dataclass
from typing import Optional, Tuple

import torch
import torch.utils.checkpoint
@@ -33,11 +30,10 @@ from transformers.modeling_utils import (PreTrainedModel,
from modelscope.metainfo import Models
from modelscope.models import Model, TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import (BaseModelOutputWithPastAndCrossAttentions,
BaseModelOutputWithPoolingAndCrossAttentions)
from modelscope.outputs import AttentionBackboneModelOutput
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.logger import get_logger
from modelscope.utils.nlp.utils import parse_labels_in_order
from .configuration import BertConfig

logger = get_logger(__name__)
@@ -562,7 +558,7 @@ class BertEncoder(nn.Module):
all_self_attentions,
all_cross_attentions,
] if v is not None)
return BaseModelOutputWithPastAndCrossAttentions(
return AttentionBackboneModelOutput(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
@@ -639,30 +635,15 @@ class BertPreTrainedModel(TorchModel, PreTrainedModel):
The loaded model, which is initialized by transformers.PreTrainedModel.from_pretrained
"""

model_dir = kwargs.get('model_dir', None)
model_dir = kwargs.pop('model_dir', None)
cfg = kwargs.pop('cfg', None)
model_args = parse_labels_in_order(model_dir, cfg, **kwargs)
if model_dir is None:
config = BertConfig(**kwargs)
config = BertConfig(**model_args)
model = cls(config)
else:
model_kwargs = {}
label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
id2label = kwargs.get(
'id2label', None if label2id is None else
{id: label
for label, id in label2id.items()})
if id2label is not None and label2id is None:
label2id = {label: id for id, label in id2label.items()}

num_labels = kwargs.get(
'num_labels', None if label2id is None else len(label2id))
if num_labels is not None:
model_kwargs['num_labels'] = num_labels
if label2id is not None:
model_kwargs['label2id'] = label2id
if id2label is not None:
model_kwargs['id2label'] = id2label
model = super(Model, cls).from_pretrained(
pretrained_model_name_or_path=model_dir, **model_kwargs)
pretrained_model_name_or_path=model_dir, **model_args)
model.model_dir = model_dir
return model

@@ -750,7 +731,7 @@ class BertModel(BertPreTrainedModel):
output_attentions=None,
output_hidden_states=None,
return_dict=None,
**kwargs):
**kwargs) -> AttentionBackboneModelOutput:
r"""
Args:
input_ids (`torch.LongTensor` of shape `((batch_size, sequence_length)`):
@@ -936,7 +917,7 @@ class BertModel(BertPreTrainedModel):
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]

return BaseModelOutputWithPoolingAndCrossAttentions(
return AttentionBackboneModelOutput(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,


+ 23
- 22
modelscope/models/nlp/bert/document_segmentation.py View File

@@ -5,37 +5,22 @@ from typing import Any, Dict
import torch
from torch import nn
from torch.nn import CrossEntropyLoss
from transformers.modeling_outputs import TokenClassifierOutput
from transformers.models.bert.modeling_bert import (BertModel,
BertPreTrainedModel)

from modelscope.metainfo import Models
from modelscope.models.base import Model
from modelscope.models import Model
from modelscope.models.builder import MODELS
from modelscope.models.nlp.ponet import PoNetConfig
from modelscope.outputs import AttentionTokenClassificationModelOutput
from modelscope.utils.constant import Tasks
from .backbone import BertModel, BertPreTrainedModel
from .configuration import BertConfig

__all__ = ['BertForDocumentSegmentation']


@MODELS.register_module(
Tasks.document_segmentation, module_name=Models.bert_for_ds)
class BertForDocumentSegmentation(Model):

def __init__(self, model_dir: str, model_config: Dict[str, Any], *args,
**kwargs):
super().__init__(model_dir, model_config, *args, **kwargs)
self.model_cfg = model_config

def build_with_config(self, config):
self.bert_model = BertForDocumentSegmentationBase.from_pretrained(
self.model_dir, from_tf=False, config=config)
return self.bert_model

def forward(self) -> Dict[str, Any]:
return self.model_cfg


class BertForDocumentSegmentationBase(BertPreTrainedModel):
class BertForDocumentSegmentation(BertPreTrainedModel):

_keys_to_ignore_on_load_unexpected = [r'pooler']

@@ -103,9 +88,25 @@ class BertForDocumentSegmentationBase(BertPreTrainedModel):
output = (logits, ) + outputs[2:]
return ((loss, ) + output) if loss is not None else output

return TokenClassifierOutput(
return AttentionTokenClassificationModelOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)

@classmethod
def _instantiate(cls, model_dir, model_config: Dict[str, Any], **kwargs):
if model_config['type'] == 'bert':
config = BertConfig.from_pretrained(model_dir, num_labels=2)
elif model_config['type'] == 'ponet':
config = PoNetConfig.from_pretrained(model_dir, num_labels=2)
else:
raise ValueError(
f'Expected config type bert and ponet, which is : {model_config["type"]}'
)
model = super(Model, cls).from_pretrained(
model_dir, from_tf=False, config=config)
model.model_dir = model_dir
model.model_cfg = model_config
return model

+ 1
- 1
modelscope/models/nlp/bert/fill_mask.py View File

@@ -121,7 +121,7 @@ class BertForMaskedLM(BertPreTrainedModel):

Preprocessor:
This is the fill_mask model of Structbert, the preprocessor of this model
is `modelscope.preprocessors.NLPPreprocessor`.
is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.

Parameters:
config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with


+ 1
- 1
modelscope/models/nlp/bert/text_classification.py View File

@@ -51,7 +51,7 @@ class BertForSequenceClassification(BertPreTrainedModel):

Preprocessor:
This is the fill_mask model of Bert, the preprocessor of this model
is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`.

Trainer:
This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,


+ 15
- 4
modelscope/models/nlp/bert/token_classification.py View File

@@ -22,7 +22,7 @@ from torch.nn import CrossEntropyLoss

from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.outputs import TokenClassifierOutput
from modelscope.outputs import AttentionTokenClassificationModelOutput
from modelscope.utils import logger as logging
from modelscope.utils.constant import Tasks
from .backbone import BertModel, BertPreTrainedModel
@@ -47,7 +47,7 @@ class BertForTokenClassification(BertPreTrainedModel):

Preprocessor:
This is the fill_mask model of Bert, the preprocessor of this model
is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
is `modelscope.preprocessors.TokenClassificationTransformersPreprocessor`.

Trainer:
This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,
@@ -169,7 +169,7 @@ class BertForTokenClassification(BertPreTrainedModel):
- 0 for tokens that are **masked**.

Returns:
Returns `modelscope.outputs.TokenClassifierOutput`
Returns `modelscope.outputs.AttentionTokenClassificationModelOutput`

Examples:
>>> from modelscope.models import Model
@@ -212,14 +212,25 @@ class BertForTokenClassification(BertPreTrainedModel):
loss = loss_fct(
logits.view(-1, self.num_labels), labels.view(-1))

if label_mask is not None:
mask = label_mask
masked_lengths = mask.sum(-1).long()
masked_logits = torch.zeros_like(logits)
for i in range(len(mask)):
masked_logits[
i, :masked_lengths[i], :] = logits[i].masked_select(
mask[i].unsqueeze(-1)).view(masked_lengths[i], -1)
logits = masked_logits

if not return_dict:
output = (logits, ) + outputs[2:]
return ((loss, ) + output) if loss is not None else output

return TokenClassifierOutput(
return AttentionTokenClassificationModelOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
offset_mapping=offset_mapping,
label_mask=label_mask,
)

+ 1
- 2
modelscope/models/nlp/deberta_v2/backbone.py View File

@@ -22,7 +22,6 @@ import torch.utils.checkpoint
from torch import nn
from torch.nn import LayerNorm
from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutput
from transformers.modeling_utils import PreTrainedModel
from transformers.pytorch_utils import softmax_backward_data

@@ -574,7 +573,7 @@ class DebertaV2Encoder(nn.Module):
return tuple(
v for v in [output_states, all_hidden_states, all_attentions]
if v is not None)
return BaseModelOutput(
return AttentionBackboneModelOutput(
last_hidden_state=output_states,
hidden_states=all_hidden_states,
attentions=all_attentions)


+ 1
- 1
modelscope/models/nlp/deberta_v2/fill_mask.py View File

@@ -44,7 +44,7 @@ class DebertaV2ForMaskedLM(DebertaV2PreTrainedModel):

Preprocessor:
This is the fill_mask model of Deberta_v2, the preprocessor of this model
is `modelscope.preprocessors.NLPPreprocessor`.
is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.

Parameters:
config (`DebertaV2Config`): Model configuration class with all the parameters of the model.


+ 4
- 6
modelscope/models/nlp/palm_v2/__init__.py View File

@@ -18,18 +18,16 @@ from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .configuration import PalmConfig
from .backbone import (
from .text_generation import (
AbsSummarizer,
PalmForConditionalGeneration,
PalmForTextGeneration,
Translator,
)
from .text_generation import PalmForTextGeneration
else:
_import_structure = {
'configuration': ['PalmConfig'],
'backbone':
['AbsSummarizer', 'PalmForConditionalGeneration', 'Translator'],
'text_generation': ['PalmForTextGeneration'],
'text_generation':
['AbsSummarizer', 'Translator', 'PalmForTextGeneration'],
}

import sys


+ 0
- 1327
modelscope/models/nlp/palm_v2/backbone.py
File diff suppressed because it is too large
View File


+ 1343
- 29
modelscope/models/nlp/palm_v2/text_generation.py
File diff suppressed because it is too large
View File


+ 1
- 31
modelscope/models/nlp/ponet/backbone.py View File

@@ -23,8 +23,6 @@ import torch.utils.checkpoint
from packaging import version
from torch import nn
from transformers.activations import ACT2FN
from transformers.modeling_outputs import \
BaseModelOutputWithPastAndCrossAttentions
from transformers.modeling_utils import (PreTrainedModel,
apply_chunking_to_forward,
find_pruneable_heads_and_indices,
@@ -573,7 +571,7 @@ class PoNetEncoder(nn.Module):
all_self_attentions,
all_cross_attentions,
] if v is not None)
return BaseModelOutputWithPastAndCrossAttentions(
return AttentionBackboneModelOutput(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
@@ -642,34 +640,6 @@ class PoNetPreTrainedModel(TorchModel, PreTrainedModel):
return model


class PoNetPreTrainedModelV2(PreTrainedModel):
"""
A base class to handle weights initialization and a simple interface for loading pretrained models.
"""

config_class = PoNetConfig
base_model_prefix = 'ponet'
_keys_to_ignore_on_load_missing = [r'position_ids']

def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, nn.Linear):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(
mean=0.0, std=self.config.initializer_range)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(
mean=0.0, std=self.config.initializer_range)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)


@MODELS.register_module(Tasks.backbone, module_name=Models.ponet)
class PoNetModel(PoNetPreTrainedModel):
"""The bare PoNet Model transformer outputting raw hidden-states without any specific head on top.


+ 21
- 20
modelscope/models/nlp/ponet/document_segmentation.py View File

@@ -5,13 +5,15 @@ from typing import Any, Dict
import torch
from torch import nn
from torch.nn import CrossEntropyLoss
from transformers.modeling_outputs import TokenClassifierOutput

from modelscope.metainfo import Models
from modelscope.models.base import Model
from modelscope.models.builder import MODELS
from modelscope.models.nlp.bert import BertConfig
from modelscope.outputs import AttentionTokenClassificationModelOutput
from modelscope.utils.constant import Tasks
from .backbone import PoNetModel, PoNetPreTrainedModelV2
from .backbone import PoNetModel, PoNetPreTrainedModel
from .configuration import PoNetConfig

__all__ = ['PoNetForDocumentSegmentation']

@@ -20,23 +22,7 @@ __all__ = ['PoNetForDocumentSegmentation']
Tasks.document_segmentation, module_name=Models.ponet_for_ds)
@MODELS.register_module(
Tasks.extractive_summarization, module_name=Models.ponet_for_ds)
class PoNetForDocumentSegmentation(Model):

def __init__(self, model_dir: str, model_config: Dict[str, Any], *args,
**kwargs):
super().__init__(model_dir, model_config, *args, **kwargs)
self.model_cfg = model_config

def build_with_config(self, config):
self.ponet_model = PoNetForDocumentSegmentationBase.from_pretrained(
self.model_dir, config=config)
return self.ponet_model

def forward(self) -> Dict[str, Any]:
return self.model_cfg


class PoNetForDocumentSegmentationBase(PoNetPreTrainedModelV2):
class PoNetForDocumentSegmentation(PoNetPreTrainedModel):
_keys_to_ignore_on_load_unexpected = [r'pooler']

def __init__(self, config):
@@ -107,9 +93,24 @@ class PoNetForDocumentSegmentationBase(PoNetPreTrainedModelV2):
output = (logits, ) + outputs[2:]
return ((loss, ) + output) if loss is not None else output

return TokenClassifierOutput(
return AttentionTokenClassificationModelOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)

@classmethod
def _instantiate(cls, model_dir, model_config: Dict[str, Any], **kwargs):
if model_config['type'] == 'bert':
config = BertConfig.from_pretrained(model_dir, num_labels=2)
elif model_config['type'] == 'ponet':
config = PoNetConfig.from_pretrained(model_dir, num_labels=2)
else:
raise ValueError(
f'Expected config type bert and ponet, which is : {model_config["type"]}'
)
model = super(Model, cls).from_pretrained(model_dir, config=config)
model.model_dir = model_dir
model.model_cfg = model_config
return model

+ 3
- 3
modelscope/models/nlp/space/model/tokenization_space.py View File

@@ -15,14 +15,14 @@
# limitations under the License
"""Tokenization classes for Space. mainly copied from :module:`~transformers.tokenization_xlm_roberta`"""

from modelscope.models.nlp.structbert import (BasicTokenizer, SbertTokenizer,
WordpieceTokenizer)
from transformers import BasicTokenizer, BertTokenizer, WordpieceTokenizer
from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)


class SpaceTokenizer(SbertTokenizer):
class SpaceTokenizer(BertTokenizer):
"""
This class overrides [`SpaceTokenizer`]. Please check the superclass for the appropriate
documentation alongside usage examples.


+ 0
- 6
modelscope/models/nlp/structbert/__init__.py View File

@@ -24,9 +24,6 @@ if TYPE_CHECKING:
from .fill_mask import SbertForMaskedLM
from .text_classification import SbertForSequenceClassification
from .token_classification import SbertForTokenClassification
from .tokenization import (BasicTokenizer, SbertTokenizer,
WordpieceTokenizer)
from .tokenization_fast import SbertTokenizerFast
else:
_import_structure = {
'backbone': ['SbertModel', 'SbertPreTrainedModel'],
@@ -35,9 +32,6 @@ else:
'faq_question_answering': ['SbertForFaqQuestionAnswering'],
'text_classification': ['SbertForSequenceClassification'],
'token_classification': ['SbertForTokenClassification'],
'tokenization':
['BasicTokenizer', 'SbertTokenizer', 'WordpieceTokenizer'],
'tokenization_fast': ['SbertTokenizerFast'],
}

import sys


+ 8
- 24
modelscope/models/nlp/structbert/backbone.py View File

@@ -18,15 +18,13 @@

import math
from dataclasses import dataclass
from typing import Optional, Tuple, Union
from typing import Optional, Union

import torch
import torch.nn as nn
import torch.utils.checkpoint
from packaging import version
from transformers.activations import ACT2FN
from transformers.modeling_outputs import \
BaseModelOutputWithPastAndCrossAttentions
from transformers.modeling_utils import (PreTrainedModel,
apply_chunking_to_forward,
find_pruneable_heads_and_indices,
@@ -37,8 +35,8 @@ from modelscope.models import Model, TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import AttentionBackboneModelOutput
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.logger import get_logger
from modelscope.utils.nlp.utils import parse_labels_in_order
from .configuration import SbertConfig

logger = get_logger(__name__)
@@ -563,7 +561,7 @@ class SbertEncoder(nn.Module):
all_self_attentions,
all_cross_attentions,
] if v is not None)
return BaseModelOutputWithPastAndCrossAttentions(
return AttentionBackboneModelOutput(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
@@ -641,29 +639,15 @@ class SbertPreTrainedModel(TorchModel, PreTrainedModel):
"""

model_dir = kwargs.pop('model_dir', None)
cfg = kwargs.pop('cfg', None)
model_args = parse_labels_in_order(model_dir, cfg, **kwargs)

if model_dir is None:
config = SbertConfig(**kwargs)
config = SbertConfig(**model_args)
model = cls(config)
else:
model_kwargs = {}
label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
id2label = kwargs.get(
'id2label', None if label2id is None else
{id: label
for label, id in label2id.items()})
if id2label is not None and label2id is None:
label2id = {label: id for id, label in id2label.items()}

num_labels = kwargs.get(
'num_labels', None if label2id is None else len(label2id))
if num_labels is not None:
model_kwargs['num_labels'] = num_labels
if label2id is not None:
model_kwargs['label2id'] = label2id
if id2label is not None:
model_kwargs['id2label'] = id2label
model = super(Model, cls).from_pretrained(
pretrained_model_name_or_path=model_dir, **model_kwargs)
pretrained_model_name_or_path=model_dir, **model_args)
return model




+ 4
- 3
modelscope/models/nlp/structbert/faq_question_answering.py View File

@@ -14,6 +14,7 @@ from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.models.nlp.structbert import SbertConfig, SbertModel
from modelscope.models.nlp.task_models.task_model import BaseTaskModel
from modelscope.outputs import FaqQuestionAnsweringOutput
from modelscope.utils.config import Config, ConfigFields
from modelscope.utils.constant import ModelFile, Tasks

@@ -208,10 +209,10 @@ class SbertForFaqQuestionAnswering(BaseTaskModel):
Predicted scores of all classes for each query.
Examples:
>>> from modelscope.hub.snapshot_download import snapshot_download
>>> from modelscope.preprocessors import FaqQuestionAnsweringPreprocessor
>>> from modelscope.preprocessors import FaqQuestionAnsweringTransformersPreprocessor
>>> from modelscope.models.nlp import SbertForFaqQuestionAnswering
>>> cache_path = snapshot_download('damo/nlp_structbert_faq-question-answering_chinese-base')
>>> preprocessor = FaqQuestionAnsweringPreprocessor.from_pretrained(cache_path)
>>> preprocessor = FaqQuestionAnsweringTransformersPreprocessor.from_pretrained(cache_path)
>>> model = SbertForFaqQuestionAnswering.from_pretrained(cache_path)
>>> param = {
>>> 'query_set': ['如何使用优惠券', '在哪里领券', '在哪里领券'],
@@ -270,7 +271,7 @@ class SbertForFaqQuestionAnswering(BaseTaskModel):
scores = self.metrics_layer(z_query, protos).view([n_query, num_cls])
if self.metrics_layer.name == 'relation':
scores = torch.sigmoid(scores)
return {'scores': scores}
return FaqQuestionAnsweringOutput(scores=scores)

def _get_onehot_labels(self, labels, support_size, num_cls):
labels_ = labels.view(support_size, 1)


+ 3
- 3
modelscope/models/nlp/structbert/fill_mask.py View File

@@ -105,7 +105,7 @@ class SbertForMaskedLM(SbertPreTrainedModel):

Preprocessor:
This is the fill_mask model of StructBERT, the preprocessor of this model
is `modelscope.preprocessors.NLPPreprocessor`.
is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.

Parameters:
config (:class:`~modelscope.models.nlp.structbert.SbertConfig`): Model configuration class with
@@ -213,9 +213,9 @@ class SbertForMaskedLM(SbertPreTrainedModel):

Examples:
>>> from modelscope.models import Model
>>> from modelscope.preprocessors import Preprocessor, NLPPreprocessor
>>> from modelscope.preprocessors import Preprocessor, FillMaskTransformersPreprocessor
>>> model = Model.from_pretrained('damo/nlp_structbert_fill-mask_chinese-large')
>>> preprocessor = NLPPreprocessor('damo/nlp_structbert_fill-mask_chinese-large')
>>> preprocessor = FillMaskTransformersPreprocessor('damo/nlp_structbert_fill-mask_chinese-large')
>>> # Call the model, return some tensors
>>> print(model(**preprocessor('你师父差得动你,你师父可[MASK]不动我。')))
>>> # Call the pipeline


+ 1
- 1
modelscope/models/nlp/structbert/text_classification.py View File

@@ -55,7 +55,7 @@ class SbertForSequenceClassification(SbertPreTrainedModel):

Preprocessor:
This is the text classification model of StructBERT, the preprocessor of this model
is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`.

Trainer:
This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,


+ 15
- 4
modelscope/models/nlp/structbert/token_classification.py View File

@@ -22,7 +22,7 @@ from torch.nn import CrossEntropyLoss

from modelscope.metainfo import Models
from modelscope.models.builder import MODELS
from modelscope.outputs import TokenClassifierOutput
from modelscope.outputs import AttentionTokenClassificationModelOutput
from modelscope.utils import logger as logging
from modelscope.utils.constant import Tasks
from .adv_utils import compute_adv_loss
@@ -50,7 +50,7 @@ class SbertForTokenClassification(SbertPreTrainedModel):

Preprocessor:
This is the token-classification model of StructBERT, the preprocessor of this model
is `modelscope.preprocessors.TokenClassificationPreprocessor`.
is `modelscope.preprocessors.TokenClassificationTransformersPreprocessor`.

Trainer:
This model is a normal PyTorch model, and can be trained by variable trainers, like EpochBasedTrainer,
@@ -168,7 +168,7 @@ class SbertForTokenClassification(SbertPreTrainedModel):
- 0 for tokens that are **masked**.

Returns:
Returns `modelscope.outputs.TokenClassifierOutput`
Returns `modelscope.outputs.AttentionTokenClassificationModelOutput`

Examples:
>>> from modelscope.models import Model
@@ -220,10 +220,21 @@ class SbertForTokenClassification(SbertPreTrainedModel):
with_attention_mask=attention_mask is not None,
**outputs.kwargs)

return TokenClassifierOutput(
if label_mask is not None:
mask = label_mask
masked_lengths = mask.sum(-1).long()
masked_logits = torch.zeros_like(logits)
for i in range(len(mask)):
masked_logits[
i, :masked_lengths[i], :] = logits[i].masked_select(
mask[i].unsqueeze(-1)).view(masked_lengths[i], -1)
logits = masked_logits

return AttentionTokenClassificationModelOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
offset_mapping=offset_mapping,
label_mask=label_mask,
)

+ 0
- 519
modelscope/models/nlp/structbert/tokenization.py View File

@@ -1,519 +0,0 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert`"""

import collections
import os
import unicodedata
from typing import List, Optional, Tuple

from transformers.tokenization_utils import (PreTrainedTokenizer, _is_control,
_is_punctuation, _is_whitespace)

from modelscope.utils.constant import ModelFile
from modelscope.utils.logger import get_logger

logger = get_logger(__name__)

VOCAB_FILES_NAMES = {'vocab_file': ModelFile.VOCAB_FILE}

PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
'nlp_structbert_backbone_large_std': 512,
'nlp_structbert_backbone_base_std': 512,
'nlp_structbert_backbone_lite_std': 512,
'nlp_structbert_backbone_tiny_std': 512,
}

PRETRAINED_INIT_CONFIGURATION = {
'english_sbert-large-std-512': {
'do_lower_case': True
},
}


def load_vocab(vocab_file):
"""Loads a vocabulary file into a dictionary."""
vocab = collections.OrderedDict()
with open(vocab_file, 'r', encoding='utf-8') as reader:
tokens = reader.readlines()
for index, token in enumerate(tokens):
token = token.rstrip('\n')
vocab[token] = index
return vocab


def whitespace_tokenize(text):
"""Runs basic whitespace cleaning and splitting on a piece of text."""
text = text.strip()
if not text:
return []
tokens = text.split()
return tokens


class SbertTokenizer(PreTrainedTokenizer):
r"""
Construct a SBERT tokenizer. Based on WordPiece.

This tokenizer inherits from :class:`~transformers.PreTrainedTokenizer` which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
do_basic_tokenize (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to do basic tokenization before WordPiece.
never_split (:obj:`Iterable`, `optional`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters.

This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES

def __init__(self,
vocab_file,
do_lower_case=True,
do_basic_tokenize=True,
never_split=None,
unk_token='[UNK]',
sep_token='[SEP]',
pad_token='[PAD]',
cls_token='[CLS]',
mask_token='[MASK]',
tokenize_chinese_chars=True,
strip_accents=None,
**kwargs):
super().__init__(
do_lower_case=do_lower_case,
do_basic_tokenize=do_basic_tokenize,
never_split=never_split,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)

if not os.path.isfile(vocab_file):
raise ValueError(
f"Can't find a vocabulary file at path '{vocab_file}'. To load the vocabulary from a Google pretrained "
'model use `tokenizer = SbertTokenizer.from_pretrained(PRETRAINED_MODEL_NAME)`'
)
self.vocab = load_vocab(vocab_file)
self.ids_to_tokens = collections.OrderedDict([
(ids, tok) for tok, ids in self.vocab.items()
])
self.do_basic_tokenize = do_basic_tokenize
if do_basic_tokenize:
self.basic_tokenizer = BasicTokenizer(
do_lower_case=do_lower_case,
never_split=never_split,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
)
self.wordpiece_tokenizer = WordpieceTokenizer(
vocab=self.vocab, unk_token=self.unk_token)

@property
def do_lower_case(self):
return self.basic_tokenizer.do_lower_case

@property
def vocab_size(self):
return len(self.vocab)

def get_vocab(self):
return dict(self.vocab, **self.added_tokens_encoder)

def _tokenize(self, text):
split_tokens = []
if self.do_basic_tokenize:
for token in self.basic_tokenizer.tokenize(
text, never_split=self.all_special_tokens):

# If the token is part of the never_split set
if token in self.basic_tokenizer.never_split:
split_tokens.append(token)
else:
split_tokens += self.wordpiece_tokenizer.tokenize(token)
else:
split_tokens = self.wordpiece_tokenizer.tokenize(text)
return split_tokens

def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
return self.vocab.get(token, self.vocab.get(self.unk_token))

def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.ids_to_tokens.get(index, self.unk_token)

def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (string) in a single string."""
out_string = ' '.join(tokens).replace(' ##', '').strip()
return out_string

def build_inputs_with_special_tokens(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A SBERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
cls = [self.cls_token_id]
sep = [self.sep_token_id]
return cls + token_ids_0 + sep + token_ids_1 + sep

def get_special_tokens_mask(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None,
already_has_special_tokens: bool = False) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer ``prepare_for_model`` method.

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (:obj:`bool`, `optional`, defaults to :obj:`False`):
Whether or not the token list is already formatted with special tokens for the model.

Returns:
:obj:`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""

if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0,
token_ids_1=token_ids_1,
already_has_special_tokens=True)

if token_ids_1 is not None:
return [1] + ([0] * len(token_ids_0)) + [1] + (
[0] * len(token_ids_1)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1]

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence
pair mask has the following format:

::

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1
+ sep) * [1]

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
index = 0
if os.path.isdir(save_directory):
vocab_file = os.path.join(
save_directory,
(filename_prefix + '-' if filename_prefix else '')
+ VOCAB_FILES_NAMES['vocab_file'])
else:
vocab_file = (filename_prefix
+ '-' if filename_prefix else '') + save_directory
with open(vocab_file, 'w', encoding='utf-8') as writer:
for token, token_index in sorted(
self.vocab.items(), key=lambda kv: kv[1]):
if index != token_index:
logger.warning(
f'Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive.'
' Please check that the vocabulary is not corrupted!')
index = token_index
writer.write(token + '\n')
index += 1
return (vocab_file, )


class BasicTokenizer(object):
"""
Constructs a BasicTokenizer that will run basic tokenization (punctuation splitting, lower casing, etc.).

Args:
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
never_split (:obj:`Iterable`, `optional`):
Collection of tokens which will never be split during tokenization. Only has an effect when
:obj:`do_basic_tokenize=True`
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters.

This should likely be deactivated for Japanese (see this `issue
<https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
"""

def __init__(self,
do_lower_case=True,
never_split=None,
tokenize_chinese_chars=True,
strip_accents=None):
if never_split is None:
never_split = []
self.do_lower_case = do_lower_case
self.never_split = set(never_split)
self.tokenize_chinese_chars = tokenize_chinese_chars
self.strip_accents = strip_accents

def tokenize(self, text, never_split=None):
"""
Basic Tokenization of a piece of text. Split on "white spaces" only, for sub-word tokenization, see
WordPieceTokenizer.

Args:
**never_split**: (`optional`) list of str
Kept for backward compatibility purposes. Now implemented directly at the base class level (see
:func:`PreTrainedTokenizer.tokenize`) List of token not to split.
"""
# union() returns a new set by concatenating the two sets.
never_split = self.never_split.union(
set(never_split)) if never_split else self.never_split
text = self._clean_text(text)

# This was added on November 1st, 2018 for the multilingual and Chinese
# models. This is also applied to the English models now, but it doesn't
# matter since the English models were not trained on any Chinese data
# and generally don't have any Chinese data in them (there are Chinese
# characters in the vocabulary because Wikipedia does have some Chinese
# words in the English Wikipedia.).
if self.tokenize_chinese_chars:
text = self._tokenize_chinese_chars(text)
orig_tokens = whitespace_tokenize(text)
split_tokens = []
for token in orig_tokens:
if token not in never_split:
if self.do_lower_case:
token = token.lower()
if self.strip_accents is not False:
token = self._run_strip_accents(token)
elif self.strip_accents:
token = self._run_strip_accents(token)
split_tokens.extend(self._run_split_on_punc(token, never_split))

output_tokens = whitespace_tokenize(' '.join(split_tokens))
return output_tokens

def _run_strip_accents(self, text):
"""Strips accents from a piece of text."""
text = unicodedata.normalize('NFD', text)
output = []
for char in text:
cat = unicodedata.category(char)
if cat == 'Mn':
continue
output.append(char)
return ''.join(output)

def _run_split_on_punc(self, text, never_split=None):
"""Splits punctuation on a piece of text."""
if never_split is not None and text in never_split:
return [text]
chars = list(text)
i = 0
start_new_word = True
output = []
while i < len(chars):
char = chars[i]
if _is_punctuation(char):
output.append([char])
start_new_word = True
else:
if start_new_word:
output.append([])
start_new_word = False
output[-1].append(char)
i += 1

return [''.join(x) for x in output]

def _tokenize_chinese_chars(self, text):
"""Adds whitespace around any CJK character."""
output = []
for char in text:
cp = ord(char)
if self._is_chinese_char(cp):
output.append(' ')
output.append(char)
output.append(' ')
else:
output.append(char)
return ''.join(output)

def _is_chinese_char(self, cp):
"""Checks whether CP is the codepoint of a CJK character."""
# This defines a "chinese character" as anything in the CJK Unicode block:
# https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block)
#
# Note that the CJK Unicode block is NOT all Japanese and Korean characters,
# despite its name. The modern Korean Hangul alphabet is a different block,
# as is Japanese Hiragana and Katakana. Those alphabets are used to write
# space-separated words, so they are not treated specially and handled
# like the all of the other languages.
if ((0x4E00 <= cp <= 0x9FFF) or (0x3400 <= cp <= 0x4DBF)
or (0x20000 <= cp <= 0x2A6DF) or (0x2A700 <= cp <= 0x2B73F)
or (0x2B740 <= cp <= 0x2B81F) or (0x2B820 <= cp <= 0x2CEAF)
or (0xF900 <= cp <= 0xFAFF) or (0x2F800 <= cp <= 0x2FA1F)):
return True

return False

def _clean_text(self, text):
"""Performs invalid character removal and whitespace cleanup on text."""
output = []
for char in text:
cp = ord(char)
if cp == 0 or cp == 0xFFFD or _is_control(char):
continue
if _is_whitespace(char):
output.append(' ')
else:
output.append(char)
return ''.join(output)


class WordpieceTokenizer(object):
"""Runs WordPiece tokenization."""

def __init__(self, vocab, unk_token, max_input_chars_per_word=100):
self.vocab = vocab
self.unk_token = unk_token
self.max_input_chars_per_word = max_input_chars_per_word

def tokenize(self, text):
"""
Tokenizes a piece of text into its word pieces. This uses a greedy longest-match-first algorithm to perform
tokenization using the given vocabulary.

For example, :obj:`input = "unaffable"` wil return as output :obj:`["un", "##aff", "##able"]`.

Args:
text: A single token or whitespace separated tokens. This should have
already been passed through `BasicTokenizer`.

Returns:
A list of wordpiece tokens.
"""

output_tokens = []
for token in whitespace_tokenize(text):
chars = list(token)
if len(chars) > self.max_input_chars_per_word:
output_tokens.append(self.unk_token)
continue

is_bad = False
start = 0
sub_tokens = []
while start < len(chars):
end = len(chars)
cur_substr = None
while start < end:
substr = ''.join(chars[start:end])
if start > 0:
substr = '##' + substr
if substr in self.vocab:
cur_substr = substr
break
end -= 1
if cur_substr is None:
is_bad = True
break
sub_tokens.append(cur_substr)
start = end

if is_bad:
output_tokens.append(self.unk_token)
else:
output_tokens.extend(sub_tokens)
return output_tokens

+ 0
- 203
modelscope/models/nlp/structbert/tokenization_fast.py View File

@@ -1,203 +0,0 @@
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# Copyright 2018 The Google AI Language Team Authors and The HuggingFace Inc. team.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Fast Tokenization classes for Sbert. mainly copied from :module:`~transformers.tokenization_bert_fast`"""

from typing import List, Optional, Tuple

import json
import transformers
from tokenizers import normalizers
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

from modelscope.utils.constant import ModelFile
from modelscope.utils.logger import get_logger
from .tokenization import SbertTokenizer

logger = get_logger(__name__)

VOCAB_FILES_NAMES = {
'vocab_file': ModelFile.VOCAB_FILE,
'tokenizer_file': 'tokenizer.json'
}

PRETRAINED_VOCAB_FILES_MAP = {
'vocab_file': {},
'tokenizer_file': {},
}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
'nlp_structbert_backbone_large_std': 512,
'nlp_structbert_backbone_base_std': 512,
'nlp_structbert_backbone_lite_std': 512,
'nlp_structbert_backbone_tiny_std': 512,
}

PRETRAINED_INIT_CONFIGURATION = {
'english_sbert-large-std-512': {
'do_lower_case': True
},
}

transformers.SLOW_TO_FAST_CONVERTERS[
'SbertTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS['BertTokenizer']


class SbertTokenizerFast(PreTrainedTokenizerFast):
r"""
Construct a "fast" SBERT tokenizer (backed by HuggingFace's `tokenizers` library). Based on WordPiece.

This tokenizer inherits from :class:`~transformers.PreTrainedTokenizerFast` which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (:obj:`str`):
File containing the vocabulary.
do_lower_case (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to lowercase the input when tokenizing.
unk_token (:obj:`str`, `optional`, defaults to :obj:`"[UNK]"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
sep_token (:obj:`str`, `optional`, defaults to :obj:`"[SEP]"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
pad_token (:obj:`str`, `optional`, defaults to :obj:`"[PAD]"`):
The token used for padding, for example when batching sequences of different lengths.
cls_token (:obj:`str`, `optional`, defaults to :obj:`"[CLS]"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
mask_token (:obj:`str`, `optional`, defaults to :obj:`"[MASK]"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
clean_text (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to clean the text before tokenization by removing any control characters and replacing all
whitespaces by the classic one.
tokenize_chinese_chars (:obj:`bool`, `optional`, defaults to :obj:`True`):
Whether or not to tokenize Chinese characters. This should likely be deactivated for Japanese (see `this
issue <https://github.com/huggingface/transformers/issues/328>`__).
strip_accents: (:obj:`bool`, `optional`):
Whether or not to strip all accents. If this option is not specified, then it will be determined by the
value for :obj:`lowercase` (as in the original BERT).
wordpieces_prefix: (:obj:`str`, `optional`, defaults to :obj:`"##"`):
The prefix for subwords.
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
pretrained_init_configuration = PRETRAINED_INIT_CONFIGURATION
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
slow_tokenizer_class = SbertTokenizer

def __init__(self,
vocab_file=None,
tokenizer_file=None,
do_lower_case=True,
unk_token='[UNK]',
sep_token='[SEP]',
pad_token='[PAD]',
cls_token='[CLS]',
mask_token='[MASK]',
tokenize_chinese_chars=True,
strip_accents=None,
**kwargs):
super().__init__(
vocab_file,
tokenizer_file=tokenizer_file,
do_lower_case=do_lower_case,
unk_token=unk_token,
sep_token=sep_token,
pad_token=pad_token,
cls_token=cls_token,
mask_token=mask_token,
tokenize_chinese_chars=tokenize_chinese_chars,
strip_accents=strip_accents,
**kwargs,
)

pre_tok_state = json.loads(
self.backend_tokenizer.normalizer.__getstate__())
if (pre_tok_state.get('lowercase', do_lower_case) != do_lower_case
or pre_tok_state.get('strip_accents',
strip_accents) != strip_accents):
pre_tok_class = getattr(normalizers, pre_tok_state.pop('type'))
pre_tok_state['lowercase'] = do_lower_case
pre_tok_state['strip_accents'] = strip_accents
self.backend_tokenizer.normalizer = pre_tok_class(**pre_tok_state)

self.do_lower_case = do_lower_case

def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A SBERT sequence has the following format:

- single sequence: ``[CLS] X [SEP]``
- pair of sequences: ``[CLS] A [SEP] B [SEP]``

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `input IDs <../glossary.html#input-ids>`__ with the appropriate special tokens.
"""
output = [self.cls_token_id] + token_ids_0 + [self.sep_token_id]

if token_ids_1:
output += token_ids_1 + [self.sep_token_id]

return output

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. A SBERT sequence
pair mask has the following format:

::

0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1
| first sequence | second sequence |

If :obj:`token_ids_1` is :obj:`None`, this method only returns the first portion of the mask (0s).

Args:
token_ids_0 (:obj:`List[int]`):
List of IDs.
token_ids_1 (:obj:`List[int]`, `optional`):
Optional second list of IDs for sequence pairs.

Returns:
:obj:`List[int]`: List of `token type IDs <../glossary.html#token-type-ids>`_ according to the given
sequence(s).
"""
sep = [self.sep_token_id]
cls = [self.cls_token_id]
if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep) * [0] + len(token_ids_1
+ sep) * [1]

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
files = self._tokenizer.model.save(
save_directory, name=filename_prefix)
return tuple(files)

+ 3
- 5
modelscope/models/nlp/task_models/feature_extraction.py View File

@@ -5,12 +5,10 @@ import numpy as np

from modelscope.metainfo import TaskModels
from modelscope.models.builder import MODELS
from modelscope.models.nlp.bert import BertConfig
from modelscope.models.nlp.task_models.task_model import \
SingleBackboneTaskModelBase
from modelscope.outputs import OutputKeys
from modelscope.outputs import FeatureExtractionOutput, OutputKeys
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping

__all__ = ['FeatureExtractionModel']

@@ -31,9 +29,9 @@ class FeatureExtractionModel(SingleBackboneTaskModelBase):

self.build_backbone(self.backbone_cfg)

def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]:
def forward(self, **input: Dict[str, Any]) -> FeatureExtractionOutput:
# backbone do not need labels, only head need for loss compute
input.pop(OutputKeys.LABELS, None)
outputs = super().forward(input)
sequence_output = outputs.last_hidden_state
return {OutputKeys.TEXT_EMBEDDING: sequence_output}
return FeatureExtractionOutput(text_embedding=sequence_output)

+ 3
- 3
modelscope/models/nlp/task_models/information_extraction.py View File

@@ -7,7 +7,7 @@ from modelscope.metainfo import TaskModels
from modelscope.models.builder import MODELS
from modelscope.models.nlp.task_models.task_model import \
SingleBackboneTaskModelBase
from modelscope.outputs import OutputKeys
from modelscope.outputs import InformationExtractionOutput, OutputKeys
from modelscope.utils.constant import Tasks

__all__ = ['InformationExtractionModel']
@@ -31,9 +31,9 @@ class InformationExtractionModel(SingleBackboneTaskModelBase):
self.build_backbone(self.backbone_cfg)
self.build_head(self.head_cfg)

def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]:
def forward(self, **input: Dict[str, Any]) -> InformationExtractionOutput:
outputs = super().forward(input)
sequence_output = outputs.last_hidden_state
outputs = self.head.forward(sequence_output, input['text'],
input['offsets'])
return {OutputKeys.SPO_LIST: outputs}
return InformationExtractionOutput(spo_list=outputs)

+ 8
- 9
modelscope/models/nlp/task_models/nncrf_for_named_entity_recognition.py View File

@@ -12,7 +12,7 @@ from transformers import AutoConfig, AutoModel
from modelscope.metainfo import Models
from modelscope.models import TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import TokenClassifierWithPredictionsOutput
from modelscope.outputs import AttentionTokenClassificationModelOutput
from modelscope.utils.constant import ModelFile, Tasks

__all__ = [
@@ -115,7 +115,7 @@ class SequenceLabelingForNamedEntityRecognition(TorchModel):
- 0 for tokens that are **masked**.

Returns:
Returns `modelscope.outputs.TokenClassifierOutput`
Returns `modelscope.outputs.AttentionTokenClassificationModelOutput`

Examples:
>>> from modelscope.models import Model
@@ -138,17 +138,16 @@ class SequenceLabelingForNamedEntityRecognition(TorchModel):

def postprocess(self, input: Dict[str, Any], **kwargs):
predicts = self.model.decode(input)
offset_len = len(input['offset_mapping'])
predictions = torch.narrow(
predicts, 1, 0,
offset_len) # index_select only move loc, not resize
return TokenClassifierWithPredictionsOutput(
offset_mapping = input.get('offset_mapping')
mask = input.get('label_mask')
return AttentionTokenClassificationModelOutput(
loss=None,
logits=None,
hidden_states=None,
attentions=None,
offset_mapping=input['offset_mapping'],
predictions=predictions,
label_mask=mask,
offset_mapping=offset_mapping,
predictions=predicts,
)




+ 18
- 10
modelscope/models/nlp/task_models/token_classification.py View File

@@ -1,18 +1,16 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from typing import Any, Dict

import numpy as np
import torch

from modelscope.metainfo import Models, TaskModels
from modelscope.models.builder import MODELS
from modelscope.models.nlp.task_models.task_model import \
SingleBackboneTaskModelBase
from modelscope.outputs import OutputKeys, TokenClassifierOutput
from modelscope.outputs import (AttentionTokenClassificationModelOutput,
OutputKeys)
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.tensor_utils import (torch_nested_detach,
torch_nested_numpify)

__all__ = ['TokenClassificationModel']

@@ -48,7 +46,10 @@ class TokenClassificationModel(SingleBackboneTaskModelBase):
self.build_backbone(self.backbone_cfg)
self.build_head(self.head_cfg)

def forward(self, **input: Dict[str, Any]) -> Dict[str, np.ndarray]:
def forward(
self,
**input: Dict[str,
Any]) -> AttentionTokenClassificationModelOutput:
labels = None
if OutputKeys.LABEL in input:
labels = input.pop(OutputKeys.LABEL)
@@ -62,16 +63,23 @@ class TokenClassificationModel(SingleBackboneTaskModelBase):
if labels in input:
loss = self.compute_loss(outputs, labels)

# apply label mask to logits
logits = logits[input['label_mask']].unsqueeze(0)
if 'label_mask' in input:
mask = input['label_mask']
masked_lengths = mask.sum(-1).long()
masked_logits = torch.zeros_like(logits)
for i in range(len(mask)):
masked_logits[
i, :masked_lengths[i], :] = logits[i].masked_select(
mask[i].unsqueeze(-1)).view(masked_lengths[i], -1)
logits = masked_logits

return TokenClassifierOutput(
return AttentionTokenClassificationModelOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
offset_mapping=input['offset_mapping'],
)
offset_mapping=input.get('offset_mapping'),
label_mask=input.get('label_mask'))

def extract_logits(self, outputs):
return outputs[OutputKeys.LOGITS].cpu().detach()

+ 0
- 4
modelscope/models/nlp/veco/__init__.py View File

@@ -23,8 +23,6 @@ if TYPE_CHECKING:
from .text_classification import VecoForSequenceClassification
from .token_classification import VecoForTokenClassification
from .fill_mask import VecoForMaskedLM
from .tokenization import VecoTokenizer
from .tokenization_fast import VecoTokenizerFast
else:
_import_structure = {
'configuration': ['VecoConfig'],
@@ -32,8 +30,6 @@ else:
'text_classification': ['VecoForSequenceClassification'],
'fill_mask': ['VecoForMaskedLM'],
'token_classification': ['VecoForTokenClassification'],
'tokenization': ['VecoTokenizer'],
'tokenization_fast': ['VecoTokenizerFast'],
}

import sys


+ 1
- 1
modelscope/models/nlp/veco/fill_mask.py View File

@@ -40,7 +40,7 @@ class VecoForMaskedLM(TorchModel, RobertaForMaskedLM):

Preprocessor:
This is the fill_mask model of StructBERT, the preprocessor of this model
is `modelscope.preprocessors.NLPPreprocessor`.
is `modelscope.preprocessors.FillMaskTransformersPreprocessor`.

Parameters:
config ([`VecoConfig`]): Model configuration class with all the parameters of the


+ 7
- 21
modelscope/models/nlp/veco/text_classification.py View File

@@ -22,7 +22,7 @@ from modelscope.models import Model, TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import AttentionTextClassificationModelOutput
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.nlp.utils import parse_labels_in_order
from .configuration import VecoConfig


@@ -46,7 +46,7 @@ class VecoForSequenceClassification(TorchModel,

Preprocessor:
This is the text classification model of Veco, the preprocessor of this model
is `modelscope.preprocessors.SequenceClassificationPreprocessor`.
is `modelscope.preprocessors.TextClassificationTransformersPreprocessor`.

Trainer:
This model should be trained by dataset which has mixed languages,
@@ -124,27 +124,13 @@ class VecoForSequenceClassification(TorchModel,
"""

model_dir = kwargs.pop('model_dir', None)
cfg = kwargs.pop('cfg', None)
model_args = parse_labels_in_order(model_dir, cfg, **kwargs)

if model_dir is None:
config = VecoConfig(**kwargs)
config = VecoConfig(**model_args)
model = cls(config)
else:
model_kwargs = {}
label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
id2label = kwargs.get(
'id2label', None if label2id is None else
{id: label
for label, id in label2id.items()})
if id2label is not None and label2id is None:
label2id = {label: id for id, label in id2label.items()}

num_labels = kwargs.get(
'num_labels', None if label2id is None else len(label2id))
if num_labels is not None:
model_kwargs['num_labels'] = num_labels
if label2id is not None:
model_kwargs['label2id'] = label2id
if id2label is not None:
model_kwargs['id2label'] = id2label
model = super(Model, cls).from_pretrained(
pretrained_model_name_or_path=model_dir, **model_kwargs)
pretrained_model_name_or_path=model_dir, **model_args)
return model

+ 8
- 20
modelscope/models/nlp/veco/token_classification.py View File

@@ -15,6 +15,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.

import torch
from transformers import RobertaForTokenClassification

from modelscope.metainfo import Models
@@ -22,7 +23,7 @@ from modelscope.models import Model, TorchModel
from modelscope.models.builder import MODELS
from modelscope.outputs import AttentionTokenClassificationModelOutput
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import parse_label_mapping
from modelscope.utils.nlp.utils import parse_labels_in_order
from .configuration import VecoConfig


@@ -58,6 +59,7 @@ class VecoForTokenClassification(TorchModel, RobertaForTokenClassification):
def forward(self, *args, **kwargs):
kwargs['return_dict'] = True
outputs = super(Model, self).forward(*args, **kwargs)

return AttentionTokenClassificationModelOutput(
loss=outputs.loss,
logits=outputs.logits,
@@ -81,27 +83,13 @@ class VecoForTokenClassification(TorchModel, RobertaForTokenClassification):
"""

model_dir = kwargs.pop('model_dir', None)
cfg = kwargs.pop('cfg', None)
model_args = parse_labels_in_order(model_dir, cfg, **kwargs)

if model_dir is None:
config = VecoConfig(**kwargs)
config = VecoConfig(**model_args)
model = cls(config)
else:
model_kwargs = {}
label2id = kwargs.get('label2id', parse_label_mapping(model_dir))
id2label = kwargs.get(
'id2label', None if label2id is None else
{id: label
for label, id in label2id.items()})
if id2label is not None and label2id is None:
label2id = {label: id for id, label in id2label.items()}

num_labels = kwargs.get(
'num_labels', None if label2id is None else len(label2id))
if num_labels is not None:
model_kwargs['num_labels'] = num_labels
if label2id is not None:
model_kwargs['label2id'] = label2id
if id2label is not None:
model_kwargs['id2label'] = id2label
model = super(Model, cls).from_pretrained(
pretrained_model_name_or_path=model_dir, **model_kwargs)
pretrained_model_name_or_path=model_dir, **model_args)
return model

+ 0
- 321
modelscope/models/nlp/veco/tokenization.py View File

@@ -1,321 +0,0 @@
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
"""Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta`"""

import os
from shutil import copyfile
from typing import Any, Dict, List, Optional, Tuple

import sentencepiece as spm
from transformers.tokenization_utils import AddedToken, PreTrainedTokenizer

from modelscope.utils import logger as logging

logger = logging.get_logger(__name__)

SPIECE_UNDERLINE = '▁'

VOCAB_FILES_NAMES = {'vocab_file': 'sentencepiece.bpe.model'}

PRETRAINED_VOCAB_FILES_MAP = {'vocab_file': {}}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}


class VecoTokenizer(PreTrainedTokenizer):
"""
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`]. Based on
[SentencePiece](https://github.com/google/sentencepiece).

This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods.
Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (`str`):
Path to the vocabulary file.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the `cls_token`.

</Tip>

eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.

</Tip>

sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
sp_model_kwargs (`dict`, *optional*):
Will be passed to the `SentencePieceProcessor.__init__()` method.
The [Python wrapper for SentencePiece](https://github.com/google/sentencepiece/tree/master/python)
can be used, among other things, to set:

- `enable_sampling`: Enable subword regularization.
- `nbest_size`: Sampling parameters for unigram. Invalid for BPE-Dropout.

- `nbest_size = {0,1}`: No sampling is performed.
- `nbest_size > 1`: samples from the nbest_size results.
- `nbest_size < 0`: assuming that nbest_size is infinite and samples from the all hypothesis (lattice)
using forward-filtering-and-backward-sampling algorithm.

- `alpha`: Smoothing parameter for unigram sampling, and dropout probability of merge operations for
BPE-dropout.

Attributes:
sp_model (`SentencePieceProcessor`):
The *SentencePiece* processor that is used for every conversion (string, tokens and IDs).
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ['input_ids', 'attention_mask']

def __init__(self,
vocab_file,
bos_token='<s>',
eos_token='</s>',
sep_token='</s>',
cls_token='<s>',
unk_token='<unk>',
pad_token='<pad>',
mask_token='<mask>',
sp_model_kwargs: Optional[Dict[str, Any]] = None,
**kwargs) -> None:
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(
mask_token, lstrip=True, rstrip=False) if isinstance(
mask_token, str) else mask_token

self.sp_model_kwargs = {} if sp_model_kwargs is None else sp_model_kwargs

super().__init__(
bos_token=bos_token,
eos_token=eos_token,
unk_token=unk_token,
sep_token=sep_token,
cls_token=cls_token,
pad_token=pad_token,
mask_token=mask_token,
sp_model_kwargs=self.sp_model_kwargs,
**kwargs,
)

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.Load(str(vocab_file))
self.vocab_file = vocab_file

# Original fairseq vocab and spm vocab must be "aligned":
# Vocab | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
# -------- | ------- | ------- | ------ | ------- | --- | --- | --- | ----- | ----- | ----
# fairseq | '<s>' | '<pad>' | '</s>' | '<unk>' | ',' | '.' | '▁' | 's' | '▁de' | '-'
# spm | '<unk>' | '<s>' | '</s>' | ',' | '.' | '▁' | 's' | '▁de' | '-' | '▁a'

# Mimic fairseq token-to-id alignment for the first 4 token
self.fairseq_tokens_to_ids = {
'<s>': 0,
'<pad>': 1,
'</s>': 2,
'<unk>': 3
}

# The first "real" token "," has position 4 in the original fairseq vocab and position 3 in the spm vocab
self.fairseq_offset = 1

self.fairseq_tokens_to_ids['<mask>'] = len(
self.sp_model) + self.fairseq_offset
self.fairseq_ids_to_tokens = {
v: k
for k, v in self.fairseq_tokens_to_ids.items()
}

def __getstate__(self):
state = self.__dict__.copy()
state['sp_model'] = None
state['sp_model_proto'] = self.sp_model.serialized_model_proto()
return state

def __setstate__(self, d):
self.__dict__ = d

# for backward compatibility
if not hasattr(self, 'sp_model_kwargs'):
self.sp_model_kwargs = {}

self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwargs)
self.sp_model.LoadFromSerializedProto(self.sp_model_proto)

def build_inputs_with_special_tokens(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An Veco sequence has the following format:

- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`

Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""

if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
cls = [self.cls_token_id]
sep = [self.sep_token_id]
return cls + token_ids_0 + sep + sep + token_ids_1 + sep

def get_special_tokens_mask(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None,
already_has_special_tokens: bool = False) -> List[int]:
"""
Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding
special tokens using the tokenizer `prepare_for_model` method.

Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
already_has_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not the token list is already formatted with special tokens for the model.

Returns:
`List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token.
"""

if already_has_special_tokens:
return super().get_special_tokens_mask(
token_ids_0=token_ids_0,
token_ids_1=token_ids_1,
already_has_special_tokens=True)

if token_ids_1 is None:
return [1] + ([0] * len(token_ids_0)) + [1]
return [1] + ([0] * len(token_ids_0)) + [1, 1] + (
[0] * len(token_ids_1)) + [1]

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does
not make use of token type ids, therefore a list of zeros is returned.

Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of zeros.

"""

sep = [self.sep_token_id]
cls = [self.cls_token_id]

if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

@property
def vocab_size(self):
return len(
self.sp_model) + self.fairseq_offset + 1 # Add the <mask> token

def get_vocab(self):
vocab = {
self.convert_ids_to_tokens(i): i
for i in range(self.vocab_size)
}
vocab.update(self.added_tokens_encoder)
return vocab

def _tokenize(self, text: str) -> List[str]:
return self.sp_model.encode(text, out_type=str)

def _convert_token_to_id(self, token):
"""Converts a token (str) in an id using the vocab."""
if token in self.fairseq_tokens_to_ids:
return self.fairseq_tokens_to_ids[token]
spm_id = self.sp_model.PieceToId(token)

# Need to return unknown token if the SP model returned 0
return spm_id + self.fairseq_offset if spm_id else self.unk_token_id

def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
if index in self.fairseq_ids_to_tokens:
return self.fairseq_ids_to_tokens[index]
return self.sp_model.IdToPiece(index - self.fairseq_offset)

def convert_tokens_to_string(self, tokens):
"""Converts a sequence of tokens (strings for sub-words) in a single string."""
out_string = ''.join(tokens).replace(SPIECE_UNDERLINE, ' ').strip()
return out_string

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
if not os.path.isdir(save_directory):
logger.error(
f'Vocabulary path ({save_directory}) should be a directory')
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + '-' if filename_prefix else '')
+ VOCAB_FILES_NAMES['vocab_file'])

if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)

return (out_vocab_file, )

+ 0
- 213
modelscope/models/nlp/veco/tokenization_fast.py View File

@@ -1,213 +0,0 @@
# Copyright 2018 Google AI, Google Brain and Carnegie Mellon University Authors and the HuggingFace Inc. team.
# Copyright 2021-2022 The Alibaba DAMO NLP Team Authors.
# All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License
"""Fast Tokenization classes for Veco. mainly copied from :module:`~transformers.tokenization_xlm_roberta_fast`"""

import os
from shutil import copyfile
from typing import List, Optional, Tuple

import transformers
from transformers.file_utils import is_sentencepiece_available
from transformers.tokenization_utils import AddedToken
from transformers.tokenization_utils_fast import PreTrainedTokenizerFast

from modelscope.utils import logger as logging

if is_sentencepiece_available():
from .tokenization import VecoTokenizer
else:
VecoTokenizer = None

logger = logging.get_logger(__name__)

VOCAB_FILES_NAMES = {
'vocab_file': 'sentencepiece.bpe.model',
'tokenizer_file': 'tokenizer.json'
}

PRETRAINED_VOCAB_FILES_MAP = {
'vocab_file': {},
'tokenizer_file': {},
}

PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {}

transformers.SLOW_TO_FAST_CONVERTERS[
'VecoTokenizer'] = transformers.SLOW_TO_FAST_CONVERTERS[
'XLMRobertaTokenizer']


class VecoTokenizerFast(PreTrainedTokenizerFast):
"""
Adapted from [`RobertaTokenizer`] and [`XLNetTokenizer`].
Based on [BPE](https://huggingface.co/docs/tokenizers/python/latest/components.html?highlight=BPE#models).

This tokenizer inherits from [`PreTrainedTokenizerFast`] which contains most of the main
methods. Users should refer to this superclass for more information regarding those methods.

Args:
vocab_file (`str`):
Path to the vocabulary file.
bos_token (`str`, *optional*, defaults to `"<s>"`):
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the beginning of
sequence. The token used is the `cls_token`.

</Tip>

eos_token (`str`, *optional*, defaults to `"</s>"`):
The end of sequence token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the end of
sequence. The token used is the `sep_token`.

</Tip>

sep_token (`str`, *optional*, defaults to `"</s>"`):
The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for
sequence classification or for a text and a question for question answering. It is also used as the last
token of a sequence built with special tokens.
cls_token (`str`, *optional*, defaults to `"<s>"`):
The classifier token which is used when doing sequence classification (classification of the whole sequence
instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (`str`, *optional*, defaults to `"<unk>"`):
The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
token instead.
pad_token (`str`, *optional*, defaults to `"<pad>"`):
The token used for padding, for example when batching sequences of different lengths.
mask_token (`str`, *optional*, defaults to `"<mask>"`):
The token used for masking values. This is the token used when training this model with masked language
modeling. This is the token which the model will try to predict.
additional_special_tokens (`List[str]`, *optional*, defaults to `["<s>NOTUSED", "</s>NOTUSED"]`):
Additional special tokens used by the tokenizer.
"""

vocab_files_names = VOCAB_FILES_NAMES
pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
model_input_names = ['input_ids', 'attention_mask']
slow_tokenizer_class = VecoTokenizer

def __init__(self,
vocab_file=None,
tokenizer_file=None,
bos_token='<s>',
eos_token='</s>',
sep_token='</s>',
cls_token='<s>',
unk_token='<unk>',
pad_token='<pad>',
mask_token='<mask>',
**kwargs):
# Mask token behave like a normal word, i.e. include the space before it
mask_token = AddedToken(
mask_token, lstrip=True, rstrip=False) if isinstance(
mask_token, str) else mask_token

super().__init__(
vocab_file,
tokenizer_file=tokenizer_file,
bos_token=bos_token,
eos_token=eos_token,
sep_token=sep_token,
cls_token=cls_token,
unk_token=unk_token,
pad_token=pad_token,
mask_token=mask_token,
**kwargs,
)

self.vocab_file = vocab_file
self.can_save_slow_tokenizer = False if not self.vocab_file else True

def build_inputs_with_special_tokens(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. An Veco sequence has the following format:

- single sequence: `<s> X </s>`
- pair of sequences: `<s> A </s></s> B </s>`

Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""

if token_ids_1 is None:
return [self.cls_token_id] + token_ids_0 + [self.sep_token_id]
cls = [self.cls_token_id]
sep = [self.sep_token_id]
return cls + token_ids_0 + sep + sep + token_ids_1 + sep

def create_token_type_ids_from_sequences(
self,
token_ids_0: List[int],
token_ids_1: Optional[List[int]] = None) -> List[int]:
"""
Create a mask from the two sequences passed to be used in a sequence-pair classification task. Veco does
not make use of token type ids, therefore a list of zeros is returned.

Args:
token_ids_0 (`List[int]`):
List of IDs.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.

Returns:
`List[int]`: List of zeros.

"""

sep = [self.sep_token_id]
cls = [self.cls_token_id]

if token_ids_1 is None:
return len(cls + token_ids_0 + sep) * [0]
return len(cls + token_ids_0 + sep + sep + token_ids_1 + sep) * [0]

def save_vocabulary(self,
save_directory: str,
filename_prefix: Optional[str] = None) -> Tuple[str]:
if not self.can_save_slow_tokenizer:
raise ValueError(
'Your fast tokenizer does not have the necessary information to save the vocabulary for a slow '
'tokenizer.')

if not os.path.isdir(save_directory):
logger.error(
f'Vocabulary path ({save_directory}) should be a directory.')
return
out_vocab_file = os.path.join(
save_directory, (filename_prefix + '-' if filename_prefix else '')
+ VOCAB_FILES_NAMES['vocab_file'])

if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file):
copyfile(self.vocab_file, out_vocab_file)

return (out_vocab_file, )

+ 125
- 311
modelscope/outputs/nlp/model_outputs.py View File

@@ -1,179 +1,13 @@
from dataclasses import dataclass
from typing import List, Optional, Tuple, Union
from typing import Optional, Tuple, Union

import numpy as np

from modelscope.outputs.outputs import ModelOutputBase

Tensor = Union['torch.Tensor', 'tf.Tensor']


@dataclass
class TextClassificationModelOutput(ModelOutputBase):
"""The output class for text classification models.

Args:
logits (`Tensor`): The logits output of the model. loss (`Tensor`,
*optional*) The loss of the model, available when training.
hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
output of each layer plus the optional initial embedding outputs.
"""

logits: Tensor = None
loss: Tensor = None


@dataclass
class TokenClassificationModelOutput(ModelOutputBase):
"""The output class for token classification models.
logits (`Tensor`): The logits output of the model.
loss (`Tensor`, *optional*) The loss of the model, available when training.
"""

logits: Tensor = None
loss: Tensor = None
offset_mapping: Tensor = None


@dataclass
class FillMaskModelOutput(ModelOutputBase):
"""The output class for text classification models.

Args:
logits (`Tensor`): The logits output of the model.
loss (`Tensor`, *optional*) The loss of the model, available when training.
input_ids (`Tensor`, *optional*) The input id tensor fed into the model.
hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
output of each layer plus the optional initial embedding outputs.
"""

logits: Tensor = None
loss: Tensor = None
input_ids: Tensor = None
hidden_states: Tensor = None


@dataclass
class TokenClassifierOutput(ModelOutputBase):
"""
Base class for outputs of token classification models.

Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when
`labels` is provided) :
Classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length,
config.num_labels)`):
Classification scores (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_hidden_states=True` is passed or when
`config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings,
if the model has an embedding layer, + one for the output of each
layer) of shape `(batch_size, sequence_length, hidden_size)`.

Hidden-states of the model at the output of each layer plus the
optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_attentions=True` is passed or when
`config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape
`(batch_size, num_heads, sequence_length, sequence_length)`.

Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,
sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the sentence.
Selected in the range ``[0, sequence_length - 1]``.

"""

loss: Tensor = None
logits: Tensor = None
hidden_states: Tensor = None
attentions: Tensor = None
offset_mapping: Tensor = None


@dataclass
class TokenClassifierWithPredictionsOutput(ModelOutputBase):
"""
Base class for outputs of token classification models.

Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when
`labels` is provided) :
Classification loss.
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length,
config.num_labels)`):
Classification scores (before SoftMax).
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_hidden_states=True` is passed or when
`config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings,
if the model has an embedding layer, + one for the output of each
layer) of shape `(batch_size, sequence_length, hidden_size)`.

Hidden-states of the model at the output of each layer plus the
optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_attentions=True` is passed or when
`config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape
`(batch_size, num_heads, sequence_length, sequence_length)`.

Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,
sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the sentence.
Selected in the range ``[0, sequence_length - 1]``.
predictions: A PyTorch tensor of the best tag sequence for each batch of shape
(nbest, batch_size, seq_length)

"""

loss: Tensor = None
logits: Tensor = None
hidden_states: Tensor = None
attentions: Tensor = None
offset_mapping: Tensor = None
predictions: Tensor = None


@dataclass
class BaseModelOutput(ModelOutputBase):
"""
Base class for model's outputs, with potential hidden states and attentions.

Args:
last_hidden_state (`torch.FloatTensor` of shape `(batch_size,
sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the
model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_hidden_states=True` is passed or when
`config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings,
if the model has an embedding layer, + one for the output of each
layer) of shape `(batch_size, sequence_length, hidden_size)`.

Hidden-states of the model at the output of each layer plus the
optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_attentions=True` is passed or when
`config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape
`(batch_size, num_heads, sequence_length, sequence_length)`.

Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
"""

last_hidden_state: Tensor = None
hidden_states: Optional[Tuple[Tensor]] = None
attentions: Optional[Tuple[Tensor]] = None


@dataclass
class BackboneModelOutput(ModelOutputBase):
"""The output class for text classification models.
@@ -196,81 +30,6 @@ class AttentionBackboneModelOutput(BackboneModelOutput):
"""The output class for backbones of attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the
attention softmax, used to compute the weighted average in the
self-attention heads.
"""
attentions: Tensor = None
past_key_values: Tensor = None
cross_attentions: Tensor = None


@dataclass
class AttentionTextClassificationModelOutput(TextClassificationModelOutput):
"""The output class for backbones of attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the
attention softmax, used to compute the weighted average in the
self-attention heads.
"""
attentions: Tensor = None
hidden_states: Tensor = None


@dataclass
class AttentionTokenClassificationModelOutput(TokenClassificationModelOutput):
"""The output class for backbones of attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the attention softmax,
used to compute the weighted average in the self-attention heads.
"""
attentions: Tensor = None
hidden_states: Tensor = None


@dataclass
class AttentionFillMaskModelOutput(FillMaskModelOutput):
"""The output class for the fill mask and attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the
attention softmax, used to compute the weighted average in the
self-attention heads.
"""
attentions: Tensor = None


@dataclass
class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutputBase):
"""
Base class for model's outputs that also contains a pooling of the last
hidden states.

Args:
last_hidden_state (`torch.FloatTensor` of shape `(batch_size,
sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the
model.
pooler_output (`torch.FloatTensor` of shape `(batch_size,
hidden_size)`):
Last layer hidden-state of the first token of the sequence
(classification token) after further processing through the layers
used for the auxiliary pretraining task. E.g. for BERT-family of
models, this returns the classification token after processing
through a linear layer and a tanh activation function. The linear
layer weights are trained from the next sentence prediction
(classification) objective during pretraining.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_hidden_states=True` is passed or when
`config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings,
if the model has an embedding layer, + one for the output of each
layer) of shape `(batch_size, sequence_length, hidden_size)`.

Hidden-states of the model at the output of each layer plus the
optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_attentions=True` is passed or when
`config.output_attentions=True`):
@@ -303,75 +62,8 @@ class BaseModelOutputWithPoolingAndCrossAttentions(ModelOutputBase):
can be used (see `past_key_values` input) to speed up sequential
decoding.
"""

last_hidden_state: Tensor = None
pooler_output: Tensor = None
hidden_states: Tensor = None
past_key_values: Tensor = None
attentions: Tensor = None
cross_attentions: Tensor = None


@dataclass
class BaseModelOutputWithPastAndCrossAttentions(ModelOutputBase):
"""
Base class for model's outputs that may also contain a past key/values (to
speed up sequential decoding).

Args:
last_hidden_state (`torch.FloatTensor` of shape `(batch_size,
sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the
model.

If `past_key_values` is used only the last hidden-state of the
sequences of shape `(batch_size, 1, hidden_size)` is output.
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned
when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`,
with each tuple having 2 tensors of shape `(batch_size, num_heads,
sequence_length, embed_size_per_head)`) and optionally if
`config.is_encoder_decoder=True` 2 additional tensors of shape
`(batch_size, num_heads, encoder_sequence_length,
embed_size_per_head)`.

Contains pre-computed hidden-states (key and values in the
self-attention blocks and optionally if
`config.is_encoder_decoder=True` in the cross-attention blocks) that
can be used (see `past_key_values` input) to speed up sequential
decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_hidden_states=True` is passed or when
`config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings,
if the model has an embedding layer, + one for the output of each
layer) of shape `(batch_size, sequence_length, hidden_size)`.

Hidden-states of the model at the output of each layer plus the
optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_attentions=True` is passed or when
`config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape
`(batch_size, num_heads, sequence_length, sequence_length)`.

Attentions weights after the attention softmax, used to compute the
weighted average in the self-attention heads.
cross_attentions (`tuple(torch.FloatTensor)`, *optional*, returned when
`output_attentions=True` and `config.add_cross_attention=True` is passed
or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape
`(batch_size, num_heads, sequence_length, sequence_length)`.

Attentions weights of the decoder's cross-attention layer, after the
attention softmax, used to compute the weighted average in the
cross-attention heads.
"""

last_hidden_state: Tensor = None
past_key_values: Tensor = None
hidden_states: Tensor = None
attentions: Tensor = None
cross_attentions: Tensor = None


@@ -459,6 +151,60 @@ class Seq2SeqModelOutput(ModelOutputBase):
encoder_attentions: Optional[Tuple[Tensor]] = None


@dataclass
class FaqQuestionAnsweringOutput(ModelOutputBase):
"""The output class for faq QA models.
"""

scores: Tensor = None


@dataclass
class FeatureExtractionOutput(ModelOutputBase):
"""The output class for feature extraction models.
"""

text_embedding: Tensor = None


@dataclass
class FillMaskModelOutput(ModelOutputBase):
"""The output class for text classification models.

Args:
logits (`Tensor`): The logits output of the model.
loss (`Tensor`, *optional*) The loss of the model, available when training.
input_ids (`Tensor`, *optional*) The input id tensor fed into the model.
hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
output of each layer plus the optional initial embedding outputs.
"""

logits: Tensor = None
loss: Tensor = None
input_ids: Tensor = None
hidden_states: Tensor = None


@dataclass
class AttentionFillMaskModelOutput(FillMaskModelOutput):
"""The output class for the fill mask and attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the
attention softmax, used to compute the weighted average in the
self-attention heads.
"""
attentions: Tensor = None


@dataclass
class InformationExtractionOutput(ModelOutputBase):
"""The output class for information extraction models.
"""

spo_list: np.ndarray = None


@dataclass
class Seq2SeqLMOutput(ModelOutputBase):
"""
@@ -543,6 +289,42 @@ class Seq2SeqLMOutput(ModelOutputBase):
encoder_attentions: Optional[Tuple[Tensor]] = None


@dataclass
class TextClassificationModelOutput(ModelOutputBase):
"""The output class for text classification models.

Args:
logits (`Tensor`): The logits output of the model. loss (`Tensor`,
*optional*) The loss of the model, available when training.
hidden_states (`Tensor`, *optional*) Hidden-states of the model at the
output of each layer plus the optional initial embedding outputs.
"""

logits: Tensor = None
loss: Tensor = None


@dataclass
class AttentionTextClassificationModelOutput(TextClassificationModelOutput):
"""The output class for backbones of attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the
attention softmax, used to compute the weighted average in the
self-attention heads.
"""
attentions: Tensor = None
hidden_states: Tensor = None


@dataclass
class TextErrorCorrectionOutput(ModelOutputBase):
"""The output class for information extraction models.
"""

predictions: np.ndarray = None


@dataclass
class TextGenerationModelOutput(ModelOutputBase):
"""The output class for text generation models.
@@ -588,3 +370,35 @@ class TokenGeneratorOutput(ModelOutputBase):
scores: Optional[Tuple[Tensor]] = None
attentions: Optional[Tuple[Tuple[Tensor]]] = None
hidden_states: Optional[Tuple[Tuple[Tensor]]] = None


@dataclass
class TokenClassificationModelOutput(ModelOutputBase):
"""The output class for token classification models.
logits (`Tensor`): The logits output of the model.
loss (`Tensor`, *optional*) The loss of the model, available when training.
predictions: A PyTorch tensor of the best tag sequence for each batch of shape
(nbest, batch_size, seq_length)
offset_mapping (:obj:`torch.FloatTensor` of shape :obj:`(batch_size,
sequence_length)`, `optional`):
Indices of positions of each input sequence tokens in the sentence.
Selected in the range ``[0, sequence_length - 1]``.
"""

logits: Tensor = None
loss: Tensor = None
offset_mapping: Tensor = None
predictions: Tensor = None
label_mask: Tensor = None


@dataclass
class AttentionTokenClassificationModelOutput(TokenClassificationModelOutput):
"""The output class for backbones of attention based models.

Args:
attentions (`tuple(Tensor)`, *optional* Attentions weights after the attention softmax,
used to compute the weighted average in the self-attention heads.
"""
attentions: Tensor = None
hidden_states: Tensor = None

+ 3
- 1
modelscope/pipelines/base.py View File

@@ -12,7 +12,7 @@ import numpy as np

from modelscope.models.base import Model
from modelscope.msdatasets import MsDataset
from modelscope.outputs import TASK_OUTPUTS
from modelscope.outputs import TASK_OUTPUTS, ModelOutputBase
from modelscope.pipeline_inputs import TASK_INPUTS, check_input_type
from modelscope.preprocessors import Preprocessor
from modelscope.utils.config import Config
@@ -321,6 +321,8 @@ class Pipeline(ABC):
return
output_keys = TASK_OUTPUTS[task_name]
missing_keys = []
input = input.keys() if isinstance(input,
(dict, ModelOutputBase)) else input
for k in output_keys:
if k not in input:
missing_keys.append(k)


+ 7
- 4
modelscope/pipelines/builder.py View File

@@ -298,6 +298,7 @@ def pipeline(task: str = None,
raise ValueError('task or pipeline_name is required')

model = normalize_model_input(model, model_revision)
pipeline_props = {'type': pipeline_name}
if pipeline_name is None:
# get default pipeline for this task
if isinstance(model, str) \
@@ -309,7 +310,7 @@ def pipeline(task: str = None,
model, str) else read_config(
model[0], revision=model_revision)
check_config(cfg)
pipeline_name = cfg.pipeline.type
pipeline_props = cfg.pipeline
elif model is not None:
# get pipeline info from Model object
first_model = model[0] if isinstance(model, list) else model
@@ -318,13 +319,15 @@ def pipeline(task: str = None,
cfg = read_config(first_model.model_dir)
check_config(cfg)
first_model.pipeline = cfg.pipeline
pipeline_name = first_model.pipeline.type
pipeline_props = first_model.pipeline
else:
pipeline_name, default_model_repo = get_default_pipeline_info(task)
model = normalize_model_input(default_model_repo, model_revision)
pipeline_props = {'type': pipeline_name}

cfg = ConfigDict(type=pipeline_name, model=model)
cfg.device = device
pipeline_props['model'] = model
pipeline_props['device'] = device
cfg = ConfigDict(pipeline_props)

if kwargs:
cfg.update(kwargs)


+ 2
- 0
modelscope/pipelines/cv/easycv_pipelines/base.py View File

@@ -61,6 +61,8 @@ class EasyCVPipeline(object):
self.cfg = Config.from_file(self.config_file)
if 'device' in kwargs:
kwargs['device'] = create_device(kwargs['device'])
if 'predictor_config' in kwargs:
kwargs.pop('predictor_config')
self.predict_op = self._build_predict_op(**kwargs)

def _build_predict_op(self, **kwargs):


+ 5
- 9
modelscope/pipelines/nlp/__init__.py View File

@@ -12,22 +12,19 @@ if TYPE_CHECKING:
from .dialog_state_tracking_pipeline import DialogStateTrackingPipeline
from .document_segmentation_pipeline import DocumentSegmentationPipeline
from .extractive_summarization_pipeline import ExtractiveSummarizationPipeline
from .fasttext_sequence_classification_pipeline import FasttextSequenceClassificationPipeline
from .fasttext_text_classification_pipeline import FasttextSequenceClassificationPipeline
from .faq_question_answering_pipeline import FaqQuestionAnsweringPipeline
from .feature_extraction_pipeline import FeatureExtractionPipeline
from .fill_mask_pipeline import FillMaskPipeline
from .information_extraction_pipeline import InformationExtractionPipeline
from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline, \
NamedEntityRecognitionThaiPipeline, \
NamedEntityRecognitionVietPipeline
from .named_entity_recognition_pipeline import NamedEntityRecognitionPipeline
from .text_ranking_pipeline import TextRankingPipeline
from .sentence_embedding_pipeline import SentenceEmbeddingPipeline
from .text_classification_pipeline import TextClassificationPipeline
from .summarization_pipeline import SummarizationPipeline
from .translation_quality_estimation_pipeline import TranslationQualityEstimationPipeline
from .text_error_correction_pipeline import TextErrorCorrectionPipeline
from .text_generation_pipeline import TextGenerationPipeline
from .text2text_generation_pipeline import Text2TextGenerationPipeline
from .text_generation_pipeline import TextGenerationPipeline, TextGenerationT5Pipeline
from .token_classification_pipeline import TokenClassificationPipeline
from .translation_pipeline import TranslationPipeline
from .word_segmentation_pipeline import WordSegmentationPipeline, WordSegmentationThaiPipeline
@@ -56,8 +53,6 @@ else:
'information_extraction_pipeline': ['InformationExtractionPipeline'],
'named_entity_recognition_pipeline': [
'NamedEntityRecognitionPipeline',
'NamedEntityRecognitionThaiPipeline',
'NamedEntityRecognitionVietPipeline'
],
'text_ranking_pipeline': ['TextRankingPipeline'],
'sentence_embedding_pipeline': ['SentenceEmbeddingPipeline'],
@@ -66,7 +61,8 @@ else:
['TableQuestionAnsweringPipeline'],
'text_classification_pipeline': ['TextClassificationPipeline'],
'text_error_correction_pipeline': ['TextErrorCorrectionPipeline'],
'text_generation_pipeline': ['TextGenerationPipeline'],
'text_generation_pipeline':
['TextGenerationPipeline', 'TextGenerationT5Pipeline'],
'text2text_generation_pipeline': ['Text2TextGenerationPipeline'],
'token_classification_pipeline': ['TokenClassificationPipeline'],
'translation_pipeline': ['TranslationPipeline'],


+ 14
- 5
modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py View File

@@ -24,18 +24,27 @@ class ConversationalTextToSqlPipeline(Pipeline):
def __init__(self,
model: Union[StarForTextToSql, str],
preprocessor: ConversationalTextToSqlPreprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""use `model` and `preprocessor` to create a conversational text-to-sql prediction pipeline

Args:
model (StarForTextToSql): a model instance
preprocessor (ConversationalTextToSqlPreprocessor):
a preprocessor instance
model (StarForTextToSql): A model instance
preprocessor (ConversationalTextToSqlPreprocessor): A preprocessor instance
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = ConversationalTextToSqlPreprocessor(
self.model.model_dir)
self.model.model_dir, **kwargs)

def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]:
"""process the prediction results


+ 12
- 2
modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py View File

@@ -22,6 +22,9 @@ class DialogIntentPredictionPipeline(Pipeline):
def __init__(self,
model: Union[SpaceForDialogIntent, str],
preprocessor: DialogIntentPredictionPreprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""Use `model` and `preprocessor` to create a dialog intent prediction pipeline

@@ -29,11 +32,18 @@ class DialogIntentPredictionPipeline(Pipeline):
model (str or SpaceForDialogIntent): Supply either a local model dir or a model id from the model hub,
or a SpaceForDialogIntent instance.
preprocessor (DialogIntentPredictionPreprocessor): An optional preprocessor instance.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = DialogIntentPredictionPreprocessor(
self.model.model_dir)
self.model.model_dir, **kwargs)
self.categories = self.preprocessor.categories

def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, str]:


+ 12
- 2
modelscope/pipelines/nlp/dialog_modeling_pipeline.py View File

@@ -21,6 +21,9 @@ class DialogModelingPipeline(Pipeline):
def __init__(self,
model: Union[SpaceForDialogModeling, str],
preprocessor: DialogModelingPreprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""Use `model` and `preprocessor` to create a dialog modeling pipeline for dialog response generation

@@ -28,11 +31,18 @@ class DialogModelingPipeline(Pipeline):
model (str or SpaceForDialogModeling): Supply either a local model dir or a model id from the model hub,
or a SpaceForDialogModeling instance.
preprocessor (DialogModelingPreprocessor): An optional preprocessor instance.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = DialogModelingPreprocessor(
self.model.model_dir)
self.model.model_dir, **kwargs)

def postprocess(self, inputs: Dict[str, Tensor]) -> Dict[str, str]:
"""process the prediction results


+ 14
- 2
modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py View File

@@ -22,6 +22,9 @@ class DialogStateTrackingPipeline(Pipeline):
def __init__(self,
model: Union[SpaceForDST, str],
preprocessor: DialogStateTrackingPreprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""use `model` and `preprocessor` to create a dialog state tracking pipeline for
observation of dialog states tracking after many turns of open domain dialogue
@@ -30,11 +33,20 @@ class DialogStateTrackingPipeline(Pipeline):
model (str or SpaceForDialogStateTracking): Supply either a local model dir or a model id
from the model hub, or a SpaceForDialogStateTracking instance.
preprocessor (DialogStateTrackingPreprocessor): An optional preprocessor instance.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)

super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
self.preprocessor = DialogStateTrackingPreprocessor(
self.model.model_dir)
self.model.model_dir, **kwargs)

self.tokenizer = self.preprocessor.tokenizer
self.config = self.preprocessor.config


+ 9
- 1
modelscope/pipelines/nlp/distributed_gpt3_pipeline.py View File

@@ -21,8 +21,16 @@ class DistributedGPT3Pipeline(DistributedPipeline):
model = None

def __init__(self, model, preprocessor=None, **kwargs):
"""

Args:
model: The model piece, str is not supported.
preprocessor: The preprocessor matched with the model.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
if preprocessor is None:
preprocessor = TextGenerationJiebaPreprocessor(model)
preprocessor = TextGenerationJiebaPreprocessor(model, **kwargs)
super().__init__(model, preprocessor=preprocessor, **kwargs)
assert hasattr(preprocessor, 'tokenizer')



+ 8
- 8
modelscope/pipelines/nlp/distributed_plug_pipeline.py View File

@@ -8,7 +8,7 @@ from modelscope.metainfo import Pipelines
from modelscope.models.nlp.plug import DistributedPlug
from modelscope.pipelines.base import DistributedPipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import TextGenerationPreprocessor
from modelscope.preprocessors import TextGenerationTransformersPreprocessor
from modelscope.utils.constant import Tasks


@@ -24,11 +24,12 @@ class DistributedPlugPipeline(DistributedPipeline):
model,
preprocessor=None,
first_sequence='sentence',
sequence_length=512,
**kwargs):
"""Create a plug pipeline instance.

Args:
model: The model_id of plug(damo/nlp_plug_text-generation_27B).
model: The model_id of plug(damo/nlp_plug_text-generation_27B).
The default path to damo/nlp_plug_text-generation_27B can be obtained by function
get_cache_dir("damo/nlp_plug_text-generation_27B"), the model should be downloaded to
this path before calling this class by model_id.
@@ -53,17 +54,16 @@ class DistributedPlugPipeline(DistributedPipeline):
|_ mp_rank_05_model_states.pt
|_ mp_rank_06_model_states.pt
|_ mp_rank_07_model_states.pt
preprocessor: The optional preprocessor, if not passed in, a TextGenerationPreprocessor will
preprocessor: The optional preprocessor, if not passed in, a TextGenerationPreprocessor will
be used as default.
first_sequence: The first_sequence key name if the input format is a dict.
kwargs:
sequence_length: The input sequence_length.
kwargs (dict, `optional`): Extra kwargs passed into the preprocessor's constructor.
"""
if preprocessor is None:
preprocessor = TextGenerationPreprocessor(
preprocessor = TextGenerationTransformersPreprocessor(
model,
first_sequence=first_sequence,
sequence_length=kwargs.pop('sequence_length', 512))
sequence_length=sequence_length,
**kwargs)
super().__init__(model, preprocessor=preprocessor, **kwargs)
assert hasattr(preprocessor, 'tokenizer')
self.cls_token_id = preprocessor.tokenizer.cls_token_id


+ 28
- 20
modelscope/pipelines/nlp/document_segmentation_pipeline.py View File

@@ -14,7 +14,8 @@ from modelscope.models.nlp.ponet.configuration import PoNetConfig
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import DocumentSegmentationPreprocessor
from modelscope.preprocessors import \
DocumentSegmentationTransformersPreprocessor
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger

@@ -27,26 +28,34 @@ __all__ = ['DocumentSegmentationPipeline']
Tasks.document_segmentation, module_name=Pipelines.document_segmentation)
class DocumentSegmentationPipeline(Pipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: DocumentSegmentationPreprocessor = None,
**kwargs):
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
def __init__(
self,
model: Union[Model, str],
preprocessor: DocumentSegmentationTransformersPreprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""The document segmentation pipeline.

self.model_dir = self.model.model_dir
self.model_cfg = self.model.forward()

if self.model_cfg['type'] == 'bert':
config = BertConfig.from_pretrained(self.model_dir, num_labels=2)
elif self.model_cfg['type'] == 'ponet':
config = PoNetConfig.from_pretrained(self.model_dir, num_labels=2)

self.document_segmentation_model = self.model.build_with_config(
config=config)
Args:
model (str or Model): Supply either a local model dir or a model id from the model hub
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
"""
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

self.model_dir = self.model.model_dir
self.model_cfg = self.model.model_cfg
if preprocessor is None:
self.preprocessor = DocumentSegmentationPreprocessor(
self.model.model_dir, config)
self.preprocessor = DocumentSegmentationTransformersPreprocessor(
self.model_dir, self.model.config.max_position_embeddings,
**kwargs)

def __call__(
self, documents: Union[List[List[str]], List[str],
@@ -85,8 +94,7 @@ class DocumentSegmentationPipeline(Pipeline):
key: torch.tensor(val)
for key, val in predict_dataset.items()
}
predictions = self.document_segmentation_model.forward(
**input).logits
predictions = self.model.forward(**input).logits

predictions = np.argmax(predictions, axis=2)
assert len(sentences) == len(


+ 24
- 28
modelscope/pipelines/nlp/extractive_summarization_pipeline.py View File

@@ -6,15 +6,14 @@ from typing import Any, Dict, List, Union
import numpy as np
import torch
from datasets import Dataset
from transformers.models.bert.modeling_bert import BertConfig

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp.ponet.configuration import PoNetConfig
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import DocumentSegmentationPreprocessor
from modelscope.preprocessors import \
DocumentSegmentationTransformersPreprocessor
from modelscope.utils.constant import Tasks
from modelscope.utils.logger import get_logger

@@ -28,31 +27,29 @@ __all__ = ['ExtractiveSummarizationPipeline']
module_name=Pipelines.extractive_summarization)
class ExtractiveSummarizationPipeline(Pipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: DocumentSegmentationPreprocessor = None,
**kwargs):
model = model if isinstance(model,
Model) else Model.from_pretrained(model)
self.model_dir = model.model_dir
self.model_cfg = model.forward()
if self.model_cfg['type'] == 'bert':
config = BertConfig.from_pretrained(model.model_dir, num_labels=2)
elif self.model_cfg['type'] == 'ponet':
config = PoNetConfig.from_pretrained(model.model_dir, num_labels=2)
self.extractive_summarization_model = model.build_with_config(
config=config)
def __init__(
self,
model: Union[Model, str],
preprocessor: DocumentSegmentationTransformersPreprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
self.model_dir = self.model.model_dir
self.model_cfg = self.model.model_cfg

if preprocessor is None:
preprocessor = DocumentSegmentationPreprocessor(
self.model_dir, config)
super().__init__(model=model, preprocessor=preprocessor, **kwargs)

self.preprocessor = preprocessor
self.preprocessor = DocumentSegmentationTransformersPreprocessor(
self.model_dir, self.model.config.max_position_embeddings,
**kwargs)

def __call__(self, documents: Union[List[str], str]) -> Dict[str, Any]:
output = self.predict(documents)
@@ -80,8 +77,7 @@ class ExtractiveSummarizationPipeline(Pipeline):
key: torch.tensor(val)
for key, val in predict_dataset.items()
}
logits = self.extractive_summarization_model.forward(
**input).logits
logits = self.model.forward(**input).logits

predictions = np.argmax(logits, axis=2)
assert len(sentences) == len(


+ 17
- 1
modelscope/pipelines/nlp/faq_question_answering_pipeline.py View File

@@ -20,8 +20,24 @@ class FaqQuestionAnsweringPipeline(Pipeline):
def __init__(self,
model: Union[str, Model],
preprocessor: Preprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
"""The faq question answering pipeline.

Args:
model (str or Model): A model instance or a model local dir or a model id in the model hub.
preprocessor (Preprocessor, `optional`): a preprocessor instance
kwargs (dict, `optional`):
The preprocessor kwargs passed into the preprocessor's constructor.
"""
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir, **kwargs)


modelscope/pipelines/nlp/fasttext_sequence_classification_pipeline.py → modelscope/pipelines/nlp/fasttext_text_classification_pipeline.py View File

@@ -9,11 +9,9 @@ from fasttext import load_model
from fasttext.FastText import _FastText

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Input, Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import SequenceClassificationPreprocessor
from modelscope.utils.constant import ModelFile, Tasks

__all__ = ['FasttextSequenceClassificationPipeline']
@@ -36,8 +34,7 @@ class FasttextSequenceClassificationPipeline(Pipeline):
"""use `model` and `preprocessor` to create a nlp text classification pipeline for prediction

Args:
model: a model directory including model.bin and spm.model
preprocessor (SequenceClassificationPreprocessor): a preprocessor instance
model: A model directory including model.bin and spm.model
"""
super().__init__(model=model)
model_file = os.path.join(model, ModelFile.TORCH_MODEL_BIN_FILE)
@@ -53,8 +50,11 @@ class FasttextSequenceClassificationPipeline(Pipeline):
text_sp = sentencepiece_tokenize(self.spm, text)
return {'text_sp': text_sp, 'text': text}

def forward(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
topk = inputs.get('topk', -1)
def forward(self,
inputs: Dict[str, Any],
topk: int = None) -> Dict[str, Any]:
if topk is None:
topk = inputs.get('topk', -1)
label, probs = self.model.predict(inputs['text_sp'], k=topk)
label = [x.replace('__label__', '') for x in label]
result = {

+ 19
- 15
modelscope/pipelines/nlp/feature_extraction_pipeline.py View File

@@ -9,7 +9,8 @@ from modelscope.models import Model
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import NLPPreprocessor, Preprocessor
from modelscope.preprocessors import (FillMaskTransformersPreprocessor,
Preprocessor)
from modelscope.utils.config import Config
from modelscope.utils.constant import ModelFile, Tasks

@@ -23,7 +24,11 @@ class FeatureExtractionPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
first_sequence='sentence',
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
padding=False,
sequence_length=128,
**kwargs):
"""Use `model` and `preprocessor` to create a nlp feature extraction pipeline for prediction

@@ -32,11 +37,8 @@ class FeatureExtractionPipeline(Pipeline):
no-head model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
first_sequence: The key to read the sentence in.
sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.

NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
param will have no effect.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.

Example:
>>> from modelscope.pipelines import pipeline
@@ -46,19 +48,21 @@ class FeatureExtractionPipeline(Pipeline):


"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
self.preprocessor = NLPPreprocessor(
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
padding=kwargs.pop('padding', False),
sequence_length=kwargs.pop('sequence_length', 128))
padding=padding,
sequence_length=sequence_length,
**kwargs)
self.model.eval()

self.config = Config.from_file(
os.path.join(self.model.model_dir, ModelFile.CONFIGURATION))
self.tokenizer = self.preprocessor.tokenizer

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
with torch.no_grad():


+ 20
- 14
modelscope/pipelines/nlp/fill_mask_pipeline.py View File

@@ -23,7 +23,11 @@ class FillMaskPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
first_sequence: str = 'sentence',
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
first_sequence='sentence',
sequence_length=128,
**kwargs):
"""The inference pipeline for all the fill mask sub-tasks.

@@ -31,11 +35,8 @@ class FillMaskPipeline(Pipeline):
model (`str` or `Model` or module instance): A model instance or a model local dir
or a model id in the model hub.
preprocessor (`Preprocessor`, `optional`): A Preprocessor instance.
first_sequence (`str`, `optional`): The key to read the sentence in.
sequence_length (`int`, `optional`): Max sequence length in the user's custom scenario, default 128.

NOTE1: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
param will have no effect.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.

Example1:
>>> from modelscope.pipelines import pipeline
@@ -51,20 +52,25 @@ class FillMaskPipeline(Pipeline):
NOTE2: Please pay attention to the model's special tokens.
If bert based model(bert, structbert, etc.) is used, the mask token is '[MASK]'.
If the xlm-roberta(xlm-roberta, veco, etc.) based model is used, the mask token is '<mask>'.
To view other examples plese check the tests/pipelines/test_fill_mask.py.
To view other examples plese check tests/pipelines/test_fill_mask.py.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
first_sequence=first_sequence,
second_sequence=None,
sequence_length=kwargs.pop('sequence_length', 128))
assert hasattr(
self.preprocessor, 'mask_id'
), 'The input preprocessor should have the mask_id attribute.'

sequence_length=sequence_length,
**kwargs)
self.model.eval()
assert hasattr(
self.preprocessor, 'mask_id'
), 'The input preprocessor should have the mask_id attribute.'

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:


+ 26
- 6
modelscope/pipelines/nlp/information_extraction_pipeline.py View File

@@ -8,8 +8,7 @@ from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import (Preprocessor,
RelationExtractionPreprocessor)
from modelscope.preprocessors import Preprocessor
from modelscope.utils.constant import Tasks

__all__ = ['InformationExtractionPipeline']
@@ -24,12 +23,33 @@ class InformationExtractionPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=512,
**kwargs):
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
if preprocessor is None:
self.preprocessor = RelationExtractionPreprocessor(
"""

Args:
model (str or Model): Supply either a local model dir which supported information extraction task, or a
model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if self.preprocessor is None:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 512))
sequence_length=sequence_length,
**kwargs)
self.model.eval()

def forward(self, inputs: Dict[str, Any],


+ 26
- 54
modelscope/pipelines/nlp/named_entity_recognition_pipeline.py View File

@@ -1,36 +1,35 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Optional, Union

import torch
from typing import Optional, Union

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.pipelines.nlp import TokenClassificationPipeline
from modelscope.preprocessors import (NERPreprocessorThai, NERPreprocessorViet,
Preprocessor,
TokenClassificationPreprocessor)
from modelscope.preprocessors import Preprocessor
from modelscope.utils.constant import Tasks
from modelscope.utils.tensor_utils import (torch_nested_detach,
torch_nested_numpify)

__all__ = [
'NamedEntityRecognitionPipeline', 'NamedEntityRecognitionThaiPipeline',
'NamedEntityRecognitionVietPipeline'
]
__all__ = ['NamedEntityRecognitionPipeline']


@PIPELINES.register_module(
Tasks.named_entity_recognition,
module_name=Pipelines.named_entity_recognition)
@PIPELINES.register_module(
Tasks.named_entity_recognition,
module_name=Pipelines.named_entity_recognition_thai)
@PIPELINES.register_module(
Tasks.named_entity_recognition,
module_name=Pipelines.named_entity_recognition_viet)
class NamedEntityRecognitionPipeline(TokenClassificationPipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=128,
**kwargs):
"""Use `model` and `preprocessor` to create a nlp NER pipeline for prediction

@@ -39,8 +38,8 @@ class NamedEntityRecognitionPipeline(TokenClassificationPipeline):
model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
sequence_length: Max sequence length in the user's custom scenario. 512 will be used as a default value.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
Example:
>>> from modelscope.pipelines import pipeline
>>> pipeline_ins = pipeline(task='named-entity-recognition',
@@ -50,44 +49,17 @@ class NamedEntityRecognitionPipeline(TokenClassificationPipeline):

To view other examples plese check the tests/pipelines/test_named_entity_recognition.py.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = TokenClassificationPreprocessor(
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 128))
sequence_length=sequence_length,
**kwargs)
self.model.eval()
self.id2label = kwargs.get('id2label')
if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
self.id2label = self.preprocessor.id2label


@PIPELINES.register_module(
Tasks.named_entity_recognition,
module_name=Pipelines.named_entity_recognition_thai)
class NamedEntityRecognitionThaiPipeline(NamedEntityRecognitionPipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
**kwargs):
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
if preprocessor is None:
self.preprocessor = NERPreprocessorThai(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 512))


@PIPELINES.register_module(
Tasks.named_entity_recognition,
module_name=Pipelines.named_entity_recognition_viet)
class NamedEntityRecognitionVietPipeline(NamedEntityRecognitionPipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
**kwargs):
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
if preprocessor is None:
self.preprocessor = NERPreprocessorViet(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 512))
assert hasattr(self.preprocessor, 'id2label')
self.id2label = self.preprocessor.id2label

+ 15
- 7
modelscope/pipelines/nlp/sentence_embedding_pipeline.py View File

@@ -22,7 +22,10 @@ class SentenceEmbeddingPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
first_sequence='first_sequence',
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=128,
**kwargs):
"""Use `model` and `preprocessor` to create a nlp text dual encoder then generates the text representation.
Args:
@@ -30,15 +33,20 @@ class SentenceEmbeddingPipeline(Pipeline):
or a model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir
if isinstance(self.model, Model) else model,
first_sequence=first_sequence,
sequence_length=kwargs.pop('sequence_length', 128))
self.model.model_dir,
sequence_length=sequence_length,
**kwargs)

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:


+ 23
- 7
modelscope/pipelines/nlp/summarization_pipeline.py View File

@@ -1,12 +1,11 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from typing import Any, Dict, Optional, Union

from modelscope.metainfo import Pipelines
from modelscope.models.multi_modal import OfaForAllTasks
from modelscope.metainfo import Pipelines, Preprocessors
from modelscope.pipelines.base import Model, Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import OfaPreprocessor, Preprocessor
from modelscope.utils.constant import Tasks
from modelscope.preprocessors import Preprocessor
from modelscope.utils.constant import Fields, Tasks
from modelscope.utils.logger import get_logger

logger = get_logger()
@@ -19,6 +18,9 @@ class SummarizationPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""Use `model` and `preprocessor` to create a Summarization pipeline for prediction.

@@ -26,11 +28,25 @@ class SummarizationPipeline(Pipeline):
model (str or Model): Supply either a local model dir which supported the summarization task,
or a model id from the model hub, or a model instance.
preprocessor (Preprocessor): An optional preprocessor instance.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
self.model.eval()
if preprocessor is None and isinstance(self.model, OfaForAllTasks):
self.preprocessor = OfaPreprocessor(model_dir=self.model.model_dir)
if preprocessor is None:
if self.model.__class__.__name__ == 'OfaForAllTasks':
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
type=Preprocessors.ofa_tasks_preprocessor,
field=Fields.multi_modal)
else:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir, **kwargs)

def postprocess(self, inputs: Dict[str, Any]) -> Dict[str, Any]:
return inputs

+ 13
- 2
modelscope/pipelines/nlp/table_question_answering_pipeline.py View File

@@ -33,6 +33,9 @@ class TableQuestionAnsweringPipeline(Pipeline):
model: Union[TableQuestionAnswering, str],
preprocessor: TableQuestionAnsweringPreprocessor = None,
db: Database = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""use `model` and `preprocessor` to create a table question answering prediction pipeline

@@ -40,11 +43,19 @@ class TableQuestionAnsweringPipeline(Pipeline):
model (TableQuestionAnswering): a model instance
preprocessor (TableQuestionAnsweringPreprocessor): a preprocessor instance
db (Database): a database to store tables in the database
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
self.preprocessor = TableQuestionAnsweringPreprocessor(
self.model.model_dir)
self.model.model_dir, **kwargs)

# initilize tokenizer
self.tokenizer = BertTokenizer(


+ 0
- 113
modelscope/pipelines/nlp/text2text_generation_pipeline.py View File

@@ -1,113 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from typing import Any, Dict, List, Optional, Union

import torch
from numpy import isin

from modelscope.metainfo import Pipelines
from modelscope.models.base import Model
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Input, Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import Text2TextGenerationPreprocessor
from modelscope.utils.config import use_task_specific_params
from modelscope.utils.constant import Tasks

__all__ = ['Text2TextGenerationPipeline']

TRANSLATE_PIPELINES = [
Pipelines.translation_en_to_de,
Pipelines.translation_en_to_ro,
Pipelines.translation_en_to_fr,
]


@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.text2text_generation)
@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.translation_en_to_de)
@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.translation_en_to_ro)
@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.translation_en_to_fr)
class Text2TextGenerationPipeline(Pipeline):

def __init__(
self,
model: Union[Model, str],
preprocessor: Optional[Text2TextGenerationPreprocessor] = None,
first_sequence='sentence',
**kwargs):
"""Use `model` and `preprocessor` to create a text to text generation pipeline for prediction.

Args:
model (str or Model): Supply either a local model dir which supported the text generation task,
or a model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
first_sequence: The key to read the first sentence in.
sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.

NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
param will have no effect.

Example:
>>> from modelscope.pipelines import pipeline
>>> pipeline_ins = pipeline(task='text2text-generation',
>>> model='damo/nlp_t5_text2text-generation_chinese-base')
>>> sentence1 = '中国的首都位于<extra_id_0>。'
>>> print(pipeline_ins(sentence1))
>>> # Or use the dict input:
>>> print(pipeline_ins({'sentence': sentence1}))
>>> # 北京

To view other examples plese check the tests/pipelines/test_text_generation.py.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
if preprocessor is None:
self.preprocessor = Text2TextGenerationPreprocessor(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 128))
self.tokenizer = self.preprocessor.tokenizer
self.pipeline = self.model.pipeline.type
self.model.eval()

def preprocess(self, inputs: Input, **preprocess_params) -> Dict[str, Any]:
""" Provide specific preprocess for text2text generation pipeline in order to handl multi tasks
"""
if not isinstance(inputs, str):
raise ValueError(f'Not supported input type: {type(inputs)}')

if self.pipeline in TRANSLATE_PIPELINES:
use_task_specific_params(self.model, self.pipeline)
inputs = self.model.config.prefix + inputs

return super().preprocess(inputs, **preprocess_params)

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:

forward_params['min_length'] = forward_params.get(
'min_length', self.model.config.min_length)
forward_params['max_length'] = forward_params.get(
'max_length', self.model.config.max_length)

with torch.no_grad():
output_ids = self.model.generate(**inputs, **forward_params)
return {'output_ids': output_ids}

def postprocess(self, inputs: Dict[str, Tensor],
**postprocess_params) -> Dict[str, str]:
"""process the prediction results

Args:
inputs (Dict[str, Any]): _description_

Returns:
Dict[str, str]: the prediction results
"""
output = self.tokenizer.decode(
inputs['output_ids'][0],
skip_special_tokens=True,
)
return {OutputKeys.TEXT: output}

+ 49
- 35
modelscope/pipelines/nlp/text_classification_pipeline.py View File

@@ -5,11 +5,14 @@ import numpy as np

from modelscope.metainfo import Pipelines, Preprocessors
from modelscope.models.base import Model
from modelscope.outputs import OutputKeys
from modelscope.outputs import OutputKeys, TextClassificationModelOutput
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import Preprocessor
from modelscope.utils.constant import Fields, Tasks
from modelscope.utils.logger import get_logger

logger = get_logger(__name__)


@PIPELINES.register_module(
@@ -31,6 +34,9 @@ class TextClassificationPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Preprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""The inference pipeline for all the text classification sub-tasks.

@@ -38,10 +44,8 @@ class TextClassificationPipeline(Pipeline):
model (`str` or `Model` or module instance): A model instance or a model local dir
or a model id in the model hub.
preprocessor (`Preprocessor`, `optional`): A Preprocessor instance.
first_sequence (`str`, `optional`): The key of the first sentence.
second_sequence (`str`, `optional`): The key of the second sentence.
sequence_length (`int`, `optional`): The sequence length.
id2label (`dict`, `optional`): The id-label mapping.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.

Example:
>>> from modelscope.pipelines import pipeline
@@ -49,31 +53,38 @@ class TextClassificationPipeline(Pipeline):
model='damo/nlp_structbert_sentence-similarity_chinese-base')
>>> input = ('这是个测试', '这也是个测试')
>>> print(pipeline_ins(input))

NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence' and 'second_sequence'
param will have no affection.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
if self.model.__class__.__name__ == 'OfaForAllTasks':
self.preprocessor = Preprocessor.from_pretrained(
model_name_or_path=self.model.model_dir,
type=Preprocessors.ofa_tasks_preprocessor,
field=Fields.multi_modal)
field=Fields.multi_modal,
**kwargs)
else:
first_sequence = kwargs.pop('first_sequence', 'first_sequence')
second_sequence = kwargs.pop('second_sequence', None)
sequence_length = kwargs.pop('sequence_length', 512)
self.preprocessor = Preprocessor.from_pretrained(
self.model
if isinstance(self.model, str) else self.model.model_dir,
first_sequence=first_sequence,
second_sequence=second_sequence,
sequence_length=kwargs.pop('sequence_length', 512))

self.id2label = kwargs.get('id2label')
if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
self.id2label = self.preprocessor.id2label
self.model.model_dir, **{
'first_sequence': first_sequence,
'second_sequence': second_sequence,
'sequence_length': sequence_length,
**kwargs
})
assert hasattr(self.preprocessor, 'id2label')
self.id2label = self.preprocessor.id2label
if self.id2label is None:
logger.warn(
'The id2label mapping is None, will return original ids.'
)

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
@@ -82,16 +93,17 @@ class TextClassificationPipeline(Pipeline):
return self.model(**inputs, **forward_params)

def postprocess(self,
inputs: Dict[str, Any],
topk: int = 5) -> Dict[str, str]:
"""process the prediction results
inputs: Union[Dict[str, Any],
TextClassificationModelOutput],
topk: int = None) -> Dict[str, Any]:
"""Process the prediction results

Args:
inputs (`Dict[str, Any]` or `TextClassificationModelOutput`): The model output, please check
the `TextClassificationModelOutput` class for details.
topk (int): The topk probs to take
Returns:
Dict[str, str]: the prediction results.
Dict[str, Any]: the prediction results.
scores: The probabilities of each label.
labels: The real labels.
Label at index 0 is the smallest probability.
@@ -99,8 +111,6 @@ class TextClassificationPipeline(Pipeline):
if self.model.__class__.__name__ == 'OfaForAllTasks':
return inputs
else:
assert self.id2label is not None, 'Cannot convert id to the original label, please pass in the mapping ' \
'as a parameter or make sure the preprocessor has the attribute.'
logits = inputs[OutputKeys.LOGITS].cpu().numpy()
if logits.shape[0] == 1:
logits = logits[0]
@@ -111,20 +121,24 @@ class TextClassificationPipeline(Pipeline):

probs = softmax(logits)
num_classes = probs.shape[-1]
topk = min(topk, num_classes)
topk = min(topk, num_classes) if topk is not None else num_classes
top_indices = np.argpartition(probs, -topk)[-topk:]
probs = np.take_along_axis(probs, top_indices, axis=-1).tolist()

def map_to_label(id):
if id in self.id2label:
return self.id2label[id]
elif str(id) in self.id2label:
return self.id2label[str(id)]
if self.id2label is not None:
if id in self.id2label:
return self.id2label[id]
elif str(id) in self.id2label:
return self.id2label[str(id)]
else:
raise Exception(
f'id {id} not found in id2label: {self.id2label}')
else:
raise Exception('id not found in id2label')
return id

v_func = np.vectorize(map_to_label)
return {
OutputKeys.SCORES: probs,
OutputKeys.LABELS: v_func(top_indices).tolist()
}
top_indices = v_func(top_indices).tolist()
probs = list(reversed(probs))
top_indices = list(reversed(top_indices))
return {OutputKeys.SCORES: probs, OutputKeys.LABELS: top_indices}

+ 19
- 12
modelscope/pipelines/nlp/text_error_correction_pipeline.py View File

@@ -10,7 +10,7 @@ from modelscope.models.nlp import BartForTextErrorCorrection
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import TextErrorCorrectionPreprocessor
from modelscope.preprocessors import Preprocessor
from modelscope.utils.constant import Tasks

__all__ = ['TextErrorCorrectionPipeline']
@@ -20,17 +20,20 @@ __all__ = ['TextErrorCorrectionPipeline']
Tasks.text_error_correction, module_name=Pipelines.text_error_correction)
class TextErrorCorrectionPipeline(Pipeline):

def __init__(
self,
model: Union[BartForTextErrorCorrection, str],
preprocessor: Optional[TextErrorCorrectionPreprocessor] = None,
**kwargs):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
**kwargs):
"""use `model` and `preprocessor` to create a nlp text correction pipeline.

Args:
model (BartForTextErrorCorrection): A model instance, or a model local dir, or a model id in the model hub.
preprocessor (TextErrorCorrectionPreprocessor): An optional preprocessor instance.

kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
Example:
>>> from modelscope.pipelines import pipeline
>>> pipeline_ins = pipeline(
@@ -38,13 +41,17 @@ class TextErrorCorrectionPipeline(Pipeline):
>>> sentence1 = '随着中国经济突飞猛近,建造工业与日俱增'
>>> print(pipeline_ins(sentence1))

To view other examples plese check the tests/pipelines/test_text_error_correction.py.
To view other examples plese check tests/pipelines/test_text_error_correction.py.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)

super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
self.preprocessor = TextErrorCorrectionPreprocessor(
self.model.model_dir)
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir, **kwargs)
self.vocab = self.preprocessor.vocab

def forward(self, inputs: Dict[str, Any],


+ 102
- 34
modelscope/pipelines/nlp/text_generation_pipeline.py View File

@@ -1,20 +1,22 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os
from typing import Any, Dict, Optional, Union

import torch

from modelscope.metainfo import Pipelines
from modelscope.models.base import Model
from modelscope.outputs import OutputKeys
from modelscope.outputs import (ModelOutputBase, OutputKeys,
TokenGeneratorOutput)
from modelscope.pipelines.base import Pipeline, Tensor
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import Preprocessor, build_preprocessor
from modelscope.preprocessors import Preprocessor
from modelscope.utils.chinese_utils import remove_space_between_chinese_chars
from modelscope.utils.constant import Fields, Tasks
from modelscope.utils.hub import read_config
from modelscope.utils.constant import Tasks
from modelscope.utils.hub import Config, read_config

__all__ = ['TextGenerationPipeline']
__all__ = ['TextGenerationPipeline', 'TextGenerationT5Pipeline']


@PIPELINES.register_module(
@@ -24,7 +26,11 @@ class TextGenerationPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
first_sequence='sentence',
sequence_length=128,
**kwargs):
"""Use `model` and `preprocessor` to create a generation pipeline for prediction.

@@ -33,11 +39,8 @@ class TextGenerationPipeline(Pipeline):
or a model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
first_sequence: The key to read the first sentence in.
sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.

NOTE: Inputs of type 'str' are also supported. In this scenario, the 'first_sequence'
param will have no effect.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.

Example:
>>> from modelscope.pipelines import pipeline
@@ -49,26 +52,29 @@ class TextGenerationPipeline(Pipeline):
>>> # Or use the dict input:
>>> print(pipeline_ins({'sentence': sentence1}))

To view other examples plese check the tests/pipelines/test_text_generation.py.
To view other examples plese check tests/pipelines/test_text_generation.py.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
cfg = read_config(self.model.model_dir)
self.postprocessor = cfg.pop('postprocessor', 'decode')
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
preprocessor_cfg = cfg.preprocessor
preprocessor_cfg.update({
'model_dir':
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
'first_sequence':
first_sequence,
'second_sequence':
None,
'sequence_length':
kwargs.pop('sequence_length', 128)
})
self.preprocessor = build_preprocessor(preprocessor_cfg,
Fields.nlp)
first_sequence=first_sequence,
sequence_length=sequence_length,
**kwargs)
self.model.eval()
self.postprocessor = kwargs.pop('postprocessor', None)
if self.postprocessor is None and hasattr(self.model, 'model_dir'):
# Compatible with old code
cfg = read_config(self.model.model_dir)
self.postprocessor = cfg.get('postprocessor')
if self.postprocessor is None:
self.postprocessor = 'decode'

def _sanitize_parameters(self, **pipeline_parameters):
return {}, pipeline_parameters, {}
@@ -79,20 +85,19 @@ class TextGenerationPipeline(Pipeline):
return self.model.generate(inputs, **forward_params)

def decode(self, inputs) -> str:
tokenizer = self.preprocessor.tokenizer
return tokenizer.decode(inputs.tolist(), skip_special_tokens=True)
return self.preprocessor.decode(
inputs.tolist(), skip_special_tokens=True)

def sentence_piece(self, inputs) -> str:
tokenizer = self.preprocessor.tokenizer
return tokenizer.decode(inputs.tolist())
return self.preprocessor.decode(inputs.tolist())

def roberta(self, inputs) -> str:
tokenizer = self.preprocessor.tokenizer
decoded = tokenizer.decode(inputs.tolist())
decoded = self.preprocessor.decode(inputs.tolist())
return decoded.replace('<q>', '. ').replace('<mask>',
'. ').replace('</s>', '')

def postprocess(self, inputs: Dict[str, Tensor],
def postprocess(self, inputs: Union[Dict[str, Tensor],
TokenGeneratorOutput],
**postprocess_params) -> Dict[str, str]:
"""process the prediction results

@@ -102,9 +107,72 @@ class TextGenerationPipeline(Pipeline):
Returns:
Dict[str, str]: the prediction results
"""
inputs = inputs['sequences']
if isinstance(inputs, (dict, ModelOutputBase)):
inputs = inputs['sequences']
if isinstance(inputs, list) or len(inputs.shape) > 1:
inputs = inputs[0]
decoded = getattr(self, self.postprocessor)(inputs)
text = remove_space_between_chinese_chars(decoded)
return {OutputKeys.TEXT: text}


@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.translation_en_to_de)
@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.translation_en_to_ro)
@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.translation_en_to_fr)
@PIPELINES.register_module(
Tasks.text2text_generation, module_name=Pipelines.text2text_generation)
class TextGenerationT5Pipeline(TextGenerationPipeline):

def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
sub_task=None,
**kwargs):
super().__init__(model, preprocessor, **kwargs)
self.sub_task = sub_task
self.task_specific_params = self._parse_specific_model_params(
getattr(self.model, 'model_dir', None), 'task_specific_params')
self.min_length = self._parse_specific_model_params(
getattr(self.model, 'model_dir', None), 'min_length')
self.max_length = self._parse_specific_model_params(
getattr(self.model, 'model_dir', None), 'max_length')

def _parse_specific_model_params(self, model_dir, key):
if model_dir is None:
return

cfg: Config = read_config(model_dir)
params = cfg.safe_get(f'model.{key}')
if params is None:
cfg: Config = read_config(os.path.join(model_dir, 'config.json'))
params = cfg.safe_get(key)
return params

def preprocess(self, inputs, **preprocess_params) -> Dict[str, Any]:
if not isinstance(inputs, str):
raise ValueError(f'Not supported input type: {type(inputs)}')

if self.task_specific_params is not None:
sub_task = self.sub_task or self.model.pipeline.type
if sub_task in self.task_specific_params:
self.model.config.update(self.task_specific_params[sub_task])
if 'prefix' in self.task_specific_params[sub_task]:
inputs = self.task_specific_params[sub_task].prefix + inputs

return super().preprocess(inputs, **preprocess_params)

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:

min_length = forward_params.get('min_length', self.min_length)
max_length = forward_params.get('max_length', self.max_length)
if min_length is not None:
forward_params['min_length'] = min_length
if max_length is not None:
forward_params['max_length'] = max_length

with torch.no_grad():
return self.model.generate(**inputs, **forward_params)

+ 16
- 4
modelscope/pipelines/nlp/text_ranking_pipeline.py View File

@@ -9,7 +9,8 @@ from modelscope.models import Model
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import Preprocessor, TextRankingPreprocessor
from modelscope.preprocessors import (Preprocessor,
TextRankingTransformersPreprocessor)
from modelscope.utils.constant import Tasks

__all__ = ['TextRankingPipeline']
@@ -22,6 +23,10 @@ class TextRankingPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=128,
**kwargs):
"""Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction.

@@ -30,14 +35,21 @@ class TextRankingPipeline(Pipeline):
or a model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 128))
sequence_length=sequence_length,
**kwargs)

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:


+ 49
- 31
modelscope/pipelines/nlp/token_classification_pipeline.py View File

@@ -1,7 +1,8 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Optional, Union
from typing import Any, Dict, List, Optional, Union

import numpy as np
import torch

from modelscope.metainfo import Pipelines
@@ -32,24 +33,35 @@ class TokenClassificationPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=128,
**kwargs):
"""use `model` and `preprocessor` to create a token classification pipeline for prediction

Args:
model (str or Model): A model instance or a model local dir or a model id in the model hub.
preprocessor (Preprocessor): a preprocessor instance, must not be None.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)

if preprocessor is None:
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 128))
sequence_length=sequence_length,
**kwargs)
self.model.eval()

self.id2label = kwargs.get('id2label')
if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
self.id2label = self.preprocessor.id2label
assert hasattr(self.preprocessor, 'id2label')
self.id2label = self.preprocessor.id2label

def forward(self, inputs: Dict[str, Any],
**forward_params) -> Dict[str, Any]:
@@ -60,53 +72,59 @@ class TokenClassificationPipeline(Pipeline):
}

def postprocess(self, inputs: Dict[str, Any],
**postprocess_params) -> Dict[str, str]:
"""process the prediction results
**postprocess_params) -> Dict[str, Any]:
"""Process the prediction results

Args:
inputs (Dict[str, Any]): should be tensors from model

Returns:
Dict[str, str]: the prediction results
Dict[str, Any]: the prediction results
"""
chunks = self._chunk_process(inputs, **postprocess_params)

# for cws outputs
if len(chunks) > 0 and chunks[0]['type'].lower() == 'cws':
spans = [
chunk['span'] for chunk in chunks if chunk['span'].strip()
]
seg_result = [span for span in spans]
outputs = {OutputKeys.OUTPUT: seg_result}

# for ner outputs
else:
outputs = {OutputKeys.OUTPUT: chunks}
return outputs
return {OutputKeys.OUTPUT: chunks}

def _chunk_process(self, inputs: Dict[str, Any],
**postprocess_params) -> Dict[str, str]:
**postprocess_params) -> List:
"""process the prediction results and output as chunks

Args:
inputs (Dict[str, Any]): should be tensors from model

Returns:
Dict[str, str]: the prediction results
List: The output chunks
"""
text = inputs['text']
# TODO post_process does not support batch for now.
if OutputKeys.PREDICTIONS not in inputs:
logits = inputs[OutputKeys.LOGITS]
predictions = torch.argmax(logits[0], dim=-1)
if len(logits.shape) == 3:
logits = logits[0]
predictions = torch.argmax(logits, dim=-1)
else:
predictions = inputs[OutputKeys.PREDICTIONS].squeeze(
0).cpu().numpy()
predictions = inputs[OutputKeys.PREDICTIONS]
if len(predictions.shape) == 2:
predictions = predictions[0]

offset_mapping = inputs['offset_mapping']
if len(offset_mapping.shape) == 3:
offset_mapping = offset_mapping[0]

label_mask = inputs.get('label_mask')
if label_mask is not None:
masked_lengths = label_mask.sum(-1).long().cpu().item()
offset_mapping = torch.narrow(
offset_mapping, 0, 0,
masked_lengths) # index_select only move loc, not resize
predictions = torch.narrow(
predictions, 0, 0,
masked_lengths) # index_select only move loc, not resize

offset_mapping = torch_nested_numpify(
torch_nested_detach(offset_mapping))
predictions = torch_nested_numpify(torch_nested_detach(predictions))
offset_mapping = [x.cpu().tolist() for x in inputs['offset_mapping']]

labels = [self.id2label[x] for x in predictions]
if len(labels) > len(offset_mapping):
labels = labels[1:-1]

chunks = []
chunk = {}
for label, offsets in zip(labels, offset_mapping):


+ 2
- 6
modelscope/pipelines/nlp/translation_quality_estimation_pipeline.py View File

@@ -2,19 +2,15 @@

import io
import os
from typing import Any, Dict, Union
from typing import Any, Dict

import numpy as np
import torch
from transformers import XLMRobertaTokenizer

from modelscope.metainfo import Pipelines
from modelscope.models import Model
from modelscope.models.nlp import BertForSequenceClassification
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Input, Pipeline
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import SequenceClassificationPreprocessor
from modelscope.utils.constant import ModelFile, Tasks

__all__ = ['TranslationQualityEstimationPipeline']


+ 59
- 42
modelscope/pipelines/nlp/word_segmentation_pipeline.py View File

@@ -10,9 +10,9 @@ from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.pipelines.nlp import TokenClassificationPipeline
from modelscope.preprocessors import (Preprocessor,
TokenClassificationPreprocessor,
WordSegmentationPreprocessorThai)
from modelscope.preprocessors import (
Preprocessor, TokenClassificationTransformersPreprocessor,
WordSegmentationPreprocessorThai)
from modelscope.utils.constant import Tasks
from modelscope.utils.tensor_utils import (torch_nested_detach,
torch_nested_numpify)
@@ -23,42 +23,49 @@ __all__ = ['WordSegmentationPipeline', 'WordSegmentationThaiPipeline']
@PIPELINES.register_module(
Tasks.word_segmentation, module_name=Pipelines.word_segmentation)
class WordSegmentationPipeline(TokenClassificationPipeline):
"""Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction.

def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
**kwargs):
"""Use `model` and `preprocessor` to create a nlp word segment pipeline for prediction.
NOTE: The preprocessor will first split the sentence into single characters,
then feed them into the tokenizer with the parameter is_split_into_words=True.

Example:
>>> from modelscope.pipelines import pipeline
>>> pipeline_ins = pipeline(task='word-segmentation',
>>> model='damo/nlp_structbert_word-segmentation_chinese-base')
>>> sentence1 = '今天天气不错,适合出去游玩'
>>> print(pipeline_ins(sentence1))

To view other examples plese check tests/pipelines/test_word_segmentation.py.
"""

def postprocess(self,
inputs: Dict[str, Any],
output_final_sentence=True,
**postprocess_params) -> Dict[str, Any]:
"""Process the prediction results

Args:
model (str or Model): Supply either a local model dir which supported the WS task,
or a model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
sequence_length: Max sequence length in the user's custom scenario. 128 will be used as a default value.

NOTE: The preprocessor will first split the sentence into single characters,
then feed them into the tokenizer with the parameter is_split_into_words=True.

Example:
>>> from modelscope.pipelines import pipeline
>>> pipeline_ins = pipeline(task='word-segmentation',
>>> model='damo/nlp_structbert_word-segmentation_chinese-base')
>>> sentence1 = '今天天气不错,适合出去游玩'
>>> print(pipeline_ins(sentence1))

To view other examples plese check the tests/pipelines/test_word_segmentation.py.
inputs (Dict[str, Any]): should be tensors from model
output_final_sentence (bool): Output the cut sentence splitted by blanks or not.
If False, the pipeline will output the original token-label information.

Returns:
Dict[str, Any]: The prediction results.
"""
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
if preprocessor is None:
self.preprocessor = TokenClassificationPreprocessor(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 128))
self.model.eval()
chunks = self._chunk_process(inputs, **postprocess_params)

# for cws outputs
if output_final_sentence:
spans = [
chunk['span'] for chunk in chunks if chunk['span'].strip()
]
seg_result = [span for span in spans]
outputs = {OutputKeys.OUTPUT: seg_result}

self.id2label = kwargs.get('id2label')
if self.id2label is None and hasattr(self.preprocessor, 'id2label'):
self.id2label = self.preprocessor.id2label
# for ner outputs
else:
outputs = {OutputKeys.OUTPUT: chunks}
return outputs


@PIPELINES.register_module(
@@ -66,8 +73,10 @@ class WordSegmentationPipeline(TokenClassificationPipeline):
module_name=Pipelines.multilingual_word_segmentation)
class MultilingualWordSegmentationPipeline(WordSegmentationPipeline):

def postprocess(self, inputs: Dict[str, Any],
**postprocess_params) -> Dict[str, str]:
def postprocess(self,
inputs: Dict[str, Any],
output_final_sentence=True,
**postprocess_params) -> Dict[str, Any]:
chunks = self._chunk_process(inputs, **postprocess_params)
word_segments = [entity['span'] for entity in chunks]
return {OutputKeys.OUTPUT: word_segments}
@@ -80,14 +89,22 @@ class WordSegmentationThaiPipeline(MultilingualWordSegmentationPipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Optional[Preprocessor] = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=512,
**kwargs):
model = model if isinstance(model,
Model) else Model.from_pretrained(model)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
if preprocessor is None:
preprocessor = WordSegmentationPreprocessorThai(
model.model_dir,
sequence_length=kwargs.pop('sequence_length', 512))
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
self.preprocessor = WordSegmentationPreprocessorThai(
self.model.model_dir,
sequence_length=sequence_length,
**kwargs)

def postprocess(self, inputs: Dict[str, Any],
**postprocess_params) -> Dict[str, str]:


+ 18
- 9
modelscope/pipelines/nlp/zero_shot_classification_pipeline.py View File

@@ -10,8 +10,7 @@ from modelscope.models import Model
from modelscope.outputs import OutputKeys
from modelscope.pipelines.base import Pipeline
from modelscope.pipelines.builder import PIPELINES
from modelscope.preprocessors import (Preprocessor,
ZeroShotClassificationPreprocessor)
from modelscope.preprocessors import Preprocessor
from modelscope.utils.constant import Tasks

__all__ = ['ZeroShotClassificationPipeline']
@@ -25,6 +24,10 @@ class ZeroShotClassificationPipeline(Pipeline):
def __init__(self,
model: Union[Model, str],
preprocessor: Preprocessor = None,
config_file: str = None,
device: str = 'gpu',
auto_collate=True,
sequence_length=512,
**kwargs):
"""Use `model` and `preprocessor` to create a nlp zero shot classifiction for prediction.

@@ -44,7 +47,8 @@ class ZeroShotClassificationPipeline(Pipeline):
or a model id from the model hub, or a torch model instance.
preprocessor (Preprocessor): An optional preprocessor instance, please make sure the preprocessor fits for
the model if supplied.
sequence_length: Max sequence length in the user's custom scenario. 512 will be used as a default value.
kwargs (dict, `optional`):
Extra kwargs passed into the preprocessor's constructor.

Example:
>>> from modelscope.pipelines import pipeline
@@ -55,17 +59,22 @@ class ZeroShotClassificationPipeline(Pipeline):
>>> template = '这篇文章的标题是{}'
>>> print(pipeline_ins(sentence1, candidate_labels=labels, hypothesis_template=template))

To view other examples plese check the tests/pipelines/test_zero_shot_classification.py.
To view other examples plese check tests/pipelines/test_zero_shot_classification.py.
"""
assert isinstance(model, str) or isinstance(model, Model), \
'model must be a single str or Model'
super().__init__(model=model, preprocessor=preprocessor, **kwargs)
super().__init__(
model=model,
preprocessor=preprocessor,
config_file=config_file,
device=device,
auto_collate=auto_collate)
self.entailment_id = 0
self.contradiction_id = 2
if preprocessor is None:
self.preprocessor = ZeroShotClassificationPreprocessor(
sequence_length = kwargs.pop('sequence_length', 512)
self.preprocessor = Preprocessor.from_pretrained(
self.model.model_dir,
sequence_length=kwargs.pop('sequence_length', 512))
sequence_length=sequence_length,
**kwargs)
self.model.eval()

def _sanitize_parameters(self, **kwargs):


+ 25
- 18
modelscope/preprocessors/__init__.py View File

@@ -16,15 +16,19 @@ if TYPE_CHECKING:
from .kws import WavToLists
from .multi_modal import (OfaPreprocessor, MPlugPreprocessor)
from .nlp import (
DocumentSegmentationPreprocessor, FaqQuestionAnsweringPreprocessor,
FillMaskPoNetPreprocessor, NLPPreprocessor,
NLPTokenizerPreprocessorBase, PassageRankingPreprocessor,
TextRankingPreprocessor, RelationExtractionPreprocessor,
SentenceEmbeddingPreprocessor, SequenceClassificationPreprocessor,
TokenClassificationPreprocessor, TextErrorCorrectionPreprocessor,
TextGenerationPreprocessor, Text2TextGenerationPreprocessor, Tokenize,
DocumentSegmentationTransformersPreprocessor,
FaqQuestionAnsweringTransformersPreprocessor,
FillMaskPoNetPreprocessor, FillMaskTransformersPreprocessor,
TextRankingTransformersPreprocessor,
RelationExtractionTransformersPreprocessor,
SentenceEmbeddingTransformersPreprocessor,
TextClassificationTransformersPreprocessor,
TokenClassificationTransformersPreprocessor,
TextErrorCorrectionPreprocessor, TextGenerationT5Preprocessor,
TextGenerationTransformersPreprocessor, Tokenize,
WordSegmentationBlankSetToLabelPreprocessor, CodeGeeXPreprocessor,
MGLMSummarizationPreprocessor, ZeroShotClassificationPreprocessor,
MGLMSummarizationPreprocessor,
ZeroShotClassificationTransformersPreprocessor,
TextGenerationJiebaPreprocessor, SentencePiecePreprocessor,
DialogIntentPredictionPreprocessor, DialogModelingPreprocessor,
DialogStateTrackingPreprocessor, ConversationalTextToSqlPreprocessor,
@@ -47,18 +51,21 @@ else:
'kws': ['WavToLists'],
'multi_modal': ['OfaPreprocessor', 'MPlugPreprocessor'],
'nlp': [
'DocumentSegmentationPreprocessor',
'FaqQuestionAnsweringPreprocessor', 'FillMaskPoNetPreprocessor',
'NLPPreprocessor', 'NLPTokenizerPreprocessorBase',
'TextRankingPreprocessor', 'RelationExtractionPreprocessor',
'SentenceEmbeddingPreprocessor',
'SequenceClassificationPreprocessor',
'TokenClassificationPreprocessor',
'TextErrorCorrectionPreprocessor', 'TextGenerationPreprocessor',
'Tokenize', 'Text2TextGenerationPreprocessor',
'DocumentSegmentationTransformersPreprocessor',
'FaqQuestionAnsweringTransformersPreprocessor',
'FillMaskPoNetPreprocessor', 'FillMaskTransformersPreprocessor',
'NLPTokenizerPreprocessorBase',
'TextRankingTransformersPreprocessor',
'RelationExtractionTransformersPreprocessor',
'SentenceEmbeddingTransformersPreprocessor',
'TextClassificationTransformersPreprocessor',
'TokenClassificationTransformersPreprocessor',
'TextErrorCorrectionPreprocessor',
'TextGenerationTransformersPreprocessor', 'Tokenize',
'TextGenerationT5Preprocessor',
'WordSegmentationBlankSetToLabelPreprocessor',
'MGLMSummarizationPreprocessor', 'CodeGeeXPreprocessor',
'ZeroShotClassificationPreprocessor',
'ZeroShotClassificationTransformersPreprocessor',
'TextGenerationJiebaPreprocessor', 'SentencePiecePreprocessor',
'NERPreprocessorViet', 'NERPreprocessorThai',
'WordSegmentationPreprocessorThai',


+ 52
- 2
modelscope/preprocessors/base.py View File

@@ -2,9 +2,10 @@
import os
from abc import ABC, abstractmethod
from copy import deepcopy
from typing import Any, Dict, Optional, Sequence
from typing import Any, Callable, Dict, Optional, Sequence, Union

from modelscope.metainfo import Models, Preprocessors
from modelscope.utils.checkpoint import save_configuration
from modelscope.utils.config import Config, ConfigDict
from modelscope.utils.constant import (DEFAULT_MODEL_REVISION, Invoke,
ModeKeys, Tasks)
@@ -98,6 +99,8 @@ PREPROCESSOR_MAP = {
Preprocessors.sen_cls_tokenizer,
(Models.structbert, Tasks.part_of_speech):
Preprocessors.token_cls_tokenizer,
(Models.token_classification_for_ner, Tasks.named_entity_recognition):
Preprocessors.token_cls_tokenizer,
(Models.structbert, Tasks.token_classification):
Preprocessors.token_cls_tokenizer,
(Models.structbert, Tasks.word_segmentation):
@@ -117,7 +120,15 @@ PREPROCESSOR_MAP = {
(Models.veco, Tasks.sentence_similarity):
Preprocessors.sen_cls_tokenizer,

# space
# taskmodels
(Models.lcrf, Tasks.named_entity_recognition):
Preprocessors.sequence_labeling_tokenizer,
(Models.lcrf_wseg, Tasks.word_segmentation):
Preprocessors.sequence_labeling_tokenizer,
(Models.tcrf_wseg, Tasks.word_segmentation):
Preprocessors.sequence_labeling_tokenizer,
(Models.tcrf, Tasks.named_entity_recognition):
Preprocessors.sequence_labeling_tokenizer,
}


@@ -125,6 +136,8 @@ class Preprocessor(ABC):

def __init__(self, mode=ModeKeys.INFERENCE, *args, **kwargs):
self._mode = mode
assert self._mode in (ModeKeys.INFERENCE, ModeKeys.TRAIN,
ModeKeys.EVAL)
self.device = int(
os.environ['LOCAL_RANK']) if 'LOCAL_RANK' in os.environ else None
pass
@@ -264,4 +277,41 @@ class Preprocessor(ABC):
})
preprocessor = build_preprocessor(sub_cfg, field_name)
preprocessor.mode = preprocessor_mode
sub_cfg.pop('model_dir', None)
if not hasattr(preprocessor, 'cfg'):
preprocessor.cfg = cfg
return preprocessor

def save_pretrained(self,
target_folder: Union[str, os.PathLike],
config: Optional[dict] = None,
save_config_function: Callable = save_configuration):
"""Save the preprocessor, its configuration and other related files to a directory,
so that it can be re-loaded

By default, this method will save the preprocessor's config with mode `inference`.

Args:
target_folder (Union[str, os.PathLike]):
Directory to which to save. Will be created if it doesn't exist.

config (Optional[dict], optional):
The config for the configuration.json

save_config_function (Callable): The function used to save the configuration, call this function
after the config is updated.

"""
if config is None and hasattr(self, 'cfg'):
config = self.cfg

if config is not None:
# Update the mode to `inference` in the preprocessor field.
if 'preprocessor' in config and config['preprocessor'] is not None:
if 'mode' in config['preprocessor']:
config['preprocessor']['mode'] = 'inference'
elif 'val' in config['preprocessor'] and 'mode' in config[
'preprocessor']['val']:
config['preprocessor']['val']['mode'] = 'inference'

save_config_function(target_folder, config)

+ 28
- 29
modelscope/preprocessors/nlp/__init__.py View File

@@ -5,24 +5,22 @@ from modelscope.utils.import_utils import LazyImportModule

if TYPE_CHECKING:
from .text_error_correction import TextErrorCorrectionPreprocessor
from .nlp_base import (NLPTokenizerPreprocessorBase, NLPBasePreprocessor)
from .text_generation_jieba_preprocessor import TextGenerationJiebaPreprocessor
from .text_generation_preprocessor import TextGenerationJiebaPreprocessor
from .sentence_piece_preprocessor import SentencePiecePreprocessor
from .bert_seq_cls_tokenizer import Tokenize
from .document_segmentation_preprocessor import DocumentSegmentationPreprocessor
from .faq_question_answering_preprocessor import FaqQuestionAnsweringPreprocessor
from .fill_mask_preprocessor import FillMaskPoNetPreprocessor, NLPPreprocessor
from .text_ranking_preprocessor import TextRankingPreprocessor
from .relation_extraction_preprocessor import RelationExtractionPreprocessor
from .sentence_classification_preprocessor import SequenceClassificationPreprocessor
from .sentence_embedding_preprocessor import SentenceEmbeddingPreprocessor
from .text_generation_preprocessor import TextGenerationPreprocessor
from .text2text_generation_preprocessor import Text2TextGenerationPreprocessor
from .token_classification_preprocessor import TokenClassificationPreprocessor, \
from .document_segmentation_preprocessor import DocumentSegmentationTransformersPreprocessor
from .faq_question_answering_preprocessor import FaqQuestionAnsweringTransformersPreprocessor
from .fill_mask_preprocessor import FillMaskPoNetPreprocessor, FillMaskTransformersPreprocessor
from .text_ranking_preprocessor import TextRankingTransformersPreprocessor
from .relation_extraction_preprocessor import RelationExtractionTransformersPreprocessor
from .text_classification_preprocessor import TextClassificationTransformersPreprocessor
from .sentence_embedding_preprocessor import SentenceEmbeddingTransformersPreprocessor
from .text_generation_preprocessor import TextGenerationTransformersPreprocessor, TextGenerationT5Preprocessor
from .token_classification_preprocessor import TokenClassificationTransformersPreprocessor, \
WordSegmentationBlankSetToLabelPreprocessor
from .token_classification_thai_preprocessor import WordSegmentationPreprocessorThai, NERPreprocessorThai
from .token_classification_viet_preprocessor import NERPreprocessorViet
from .zero_shot_classification_reprocessor import ZeroShotClassificationPreprocessor
from .zero_shot_classification_preprocessor import ZeroShotClassificationTransformersPreprocessor
from .space import (DialogIntentPredictionPreprocessor,
DialogModelingPreprocessor,
DialogStateTrackingPreprocessor, InputFeatures,
@@ -36,30 +34,31 @@ else:
'NLPTokenizerPreprocessorBase',
'NLPBasePreprocessor',
],
'text_generation_jieba_preprocessor':
['TextGenerationJiebaPreprocessor'],
'sentence_piece_preprocessor': ['SentencePiecePreprocessor'],
'bert_seq_cls_tokenizer': ['Tokenize'],
'document_segmentation_preprocessor':
['DocumentSegmentationPreprocessor'],
['DocumentSegmentationTransformersPreprocessor'],
'faq_question_answering_preprocessor':
['FaqQuestionAnsweringPreprocessor'],
['FaqQuestionAnsweringTransformersPreprocessor'],
'fill_mask_preprocessor':
['FillMaskPoNetPreprocessor', 'NLPPreprocessor'],
'text_ranking_preprocessor': ['TextRankingPreprocessor'],
'relation_extraction_preprocessor': ['RelationExtractionPreprocessor'],
'sentence_classification_preprocessor':
['SequenceClassificationPreprocessor'],
'sentence_embedding_preprocessor': ['SentenceEmbeddingPreprocessor'],
'text_generation_preprocessor': ['TextGenerationPreprocessor'],
'text2text_generation_preprocessor':
['Text2TextGenerationPreprocessor'],
['FillMaskPoNetPreprocessor', 'FillMaskTransformersPreprocessor'],
'text_ranking_preprocessor': ['TextRankingTransformersPreprocessor'],
'relation_extraction_preprocessor':
['RelationExtractionTransformersPreprocessor'],
'text_classification_preprocessor':
['TextClassificationTransformersPreprocessor'],
'sentence_embedding_preprocessor':
['SentenceEmbeddingTransformersPreprocessor'],
'text_generation_preprocessor': [
'TextGenerationTransformersPreprocessor',
'TextGenerationJiebaPreprocessor', 'TextGenerationT5Preprocessor'
],
'token_classification_preprocessor': [
'TokenClassificationPreprocessor',
'TokenClassificationTransformersPreprocessor',
'WordSegmentationBlankSetToLabelPreprocessor'
],
'zero_shot_classification_reprocessor':
['ZeroShotClassificationPreprocessor'],
'zero_shot_classification_preprocessor':
['ZeroShotClassificationTransformersPreprocessor'],
'text_error_correction': [
'TextErrorCorrectionPreprocessor',
],


+ 31
- 18
modelscope/preprocessors/nlp/document_segmentation_preprocessor.py View File

@@ -3,39 +3,52 @@
from typing import Any, Dict

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.logger import get_logger
from .nlp_base import NLPBasePreprocessor

logger = get_logger()


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.document_segmentation)
class DocumentSegmentationPreprocessor(NLPBasePreprocessor):

def __init__(self, model_dir: str, config, *args, **kwargs):
"""preprocess the data
class DocumentSegmentationTransformersPreprocessor(Preprocessor):

def __init__(self,
model_dir: str,
model_max_length: int,
mode: str = ModeKeys.INFERENCE,
question_column_name='labels',
context_column_name='sentences',
example_id_column_name='example_id',
label_list=['B-EOP', 'O']):
"""The preprocessor for document segmentation task, based on transformers' tokenizer.

Args:
model_dir (str): model path
model_dir: The model dir containing the essential files to build the tokenizer.
model_max_length: The max length the model supported.
mode: The mode for this preprocessor.
question_column_name: The key for the question column, default `labels`.
context_column_name: The key for the context column, default `sentences`.
example_id_column_name: The key for the example id column, default `example_id`.
label_list: The label list, default `['B-EOP', 'O']`
"""

super().__init__(model_dir, *args, **kwargs)
super().__init__(mode)
from transformers import BertTokenizerFast
self.tokenizer = BertTokenizerFast.from_pretrained(
model_dir,
use_fast=True,
)
self.question_column_name = 'labels'
self.context_column_name = 'sentences'
self.example_id_column_name = 'example_id'
self.label_to_id = {'B-EOP': 0, 'O': 1}
self.tokenizer = BertTokenizerFast.from_pretrained(model_dir, )
self.question_column_name = question_column_name
self.context_column_name = context_column_name
self.example_id_column_name = example_id_column_name
self.label_list = label_list
self.label_to_id = {
label: id
for id, label in enumerate(self.label_list)
}
self.target_specical_ids = set()
self.target_specical_ids.add(self.tokenizer.eos_token_id)
self.max_seq_length = config.max_position_embeddings
self.label_list = ['B-EOP', 'O']
self.max_seq_length = model_max_length

def __call__(self, examples, model_cfg=None) -> Dict[str, Any]:
questions = examples[self.question_column_name]


+ 51
- 25
modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py View File

@@ -1,38 +1,58 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os
from typing import Any, Dict

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.config import Config, ConfigFields
from modelscope.utils.constant import Fields, ModeKeys, ModelFile
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.type_assert import type_assert
from .nlp_base import NLPBasePreprocessor


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.faq_question_answering_preprocessor)
class FaqQuestionAnsweringPreprocessor(NLPBasePreprocessor):

def __init__(self, model_dir: str, *args, **kwargs):
super(FaqQuestionAnsweringPreprocessor, self).__init__(
model_dir, mode=ModeKeys.INFERENCE, **kwargs)

from transformers import BertTokenizer

preprocessor_config = Config.from_file(
os.path.join(model_dir, ModelFile.CONFIGURATION)).get(
ConfigFields.preprocessor, {})
if preprocessor_config.get('tokenizer',
'BertTokenizer') == 'XLMRoberta':
class FaqQuestionAnsweringTransformersPreprocessor(Preprocessor):

def __init__(self,
model_dir: str,
mode: str = ModeKeys.INFERENCE,
tokenizer='BertTokenizer',
query_set='query_set',
support_set='support_set',
label_in_support_set='label',
text_in_support_set='text',
sequence_length=None,
**kwargs):
"""The preprocessor for Faq QA task, based on transformers' tokenizer.

Args:
model_dir: The model dir containing the essential files to build the tokenizer.
mode: The mode for this preprocessor.
tokenizer: The tokenizer type used, supported types are `BertTokenizer`
and `XLMRobertaTokenizer`, default `BertTokenizer`.
query_set: The key for the query_set.
support_set: The key for the support_set.
label_in_support_set: The key for the label_in_support_set.
text_in_support_set: The key for the text_in_support_set.
sequence_length: The sequence length for the preprocessor.
"""
super().__init__(mode)
if tokenizer == 'XLMRoberta':
from transformers import XLMRobertaTokenizer
self.tokenizer = XLMRobertaTokenizer.from_pretrained(model_dir)
else:
from transformers import BertTokenizer
self.tokenizer = BertTokenizer.from_pretrained(model_dir)

self.MAX_LEN = preprocessor_config.get('max_seq_length', 50)
if sequence_length is not None:
self.max_len = sequence_length
else:
self.max_len = kwargs.get('max_seq_length', 50)
self.label_dict = None
self.query_set = query_set
self.support_set = support_set
self.label_in_support_set = label_in_support_set
self.text_in_support_set = text_in_support_set

def pad(self, samples, max_len):
result = []
@@ -58,25 +78,31 @@ class FaqQuestionAnsweringPreprocessor(NLPBasePreprocessor):
@type_assert(object, Dict)
def __call__(self, data: Dict[str, Any],
**preprocessor_param) -> Dict[str, Any]:
TMP_MAX_LEN = preprocessor_param.get('max_seq_length', self.MAX_LEN)
queryset = data['query_set']
tmp_max_len = preprocessor_param.get(
'sequence_length',
preprocessor_param.get('max_seq_length', self.max_len))
queryset = data[self.query_set]
if not isinstance(queryset, list):
queryset = [queryset]
supportset = data['support_set']
supportset = sorted(supportset, key=lambda d: d['label'])
supportset = data[self.support_set]
supportset = sorted(
supportset, key=lambda d: d[self.label_in_support_set])

queryset_tokenized = [self.encode_plus(text) for text in queryset]
supportset_tokenized = [
self.encode_plus(item['text']) for item in supportset
self.encode_plus(item[self.text_in_support_set])
for item in supportset
]

max_len = max(
[len(seq) for seq in queryset_tokenized + supportset_tokenized])
max_len = min(TMP_MAX_LEN, max_len)
max_len = min(tmp_max_len, max_len)
queryset_padded = self.pad(queryset_tokenized, max_len)
supportset_padded = self.pad(supportset_tokenized, max_len)

supportset_labels_ori = [item['label'] for item in supportset]
supportset_labels_ori = [
item[self.label_in_support_set] for item in supportset
]
label_dict = []
for label in supportset_labels_ori:
if label not in label_dict:


+ 78
- 0
modelscope/preprocessors/nlp/feature_extraction_preprocessor.py View File

@@ -0,0 +1,78 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Tuple, Union

import numpy as np

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.hub import get_model_type
from .transformers_tokenizer import NLPTokenizer
from .utils import parse_text_and_label


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.feature_extraction)
class FeatureExtractionTransformersPreprocessor(Preprocessor):

def __init__(self,
model_dir: str = None,
first_sequence: str = None,
second_sequence: str = None,
mode: str = ModeKeys.INFERENCE,
sequence_length: int = 128,
use_fast: bool = None,
**kwargs):
"""The preprocessor for feature extraction task, based on transformers' tokenizer.

Args:
model_dir: The model dir used to initialize the tokenizer.
use_fast: Use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
self.first_sequence = first_sequence
self.second_sequence = second_sequence
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = sequence_length
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
True)
super().__init__(mode)
model_type = None
if model_dir is not None:
model_type = get_model_type(model_dir)
self.nlp_tokenizer = NLPTokenizer(
model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)

def __call__(self, data: Union[str, Tuple, Dict],
**kwargs) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b, _ = parse_text_and_label(data, self.mode,
self.first_sequence,
self.second_sequence)
output = self._tokenize_text(text_a, text_b, **kwargs)
output = {
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}
return output

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
return self.nlp_tokenizer(sequence1, sequence2, **kwargs)

+ 198
- 45
modelscope/preprocessors/nlp/fill_mask_preprocessor.py View File

@@ -2,60 +2,207 @@

import os.path as osp
import re
from abc import abstractmethod
from typing import Any, Dict, Tuple, Union

import numpy as np
import torch

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.config import Config
from modelscope.utils.constant import Fields, ModeKeys, ModelFile
from modelscope.utils.hub import get_model_type
from modelscope.utils.nlp import import_external_nltk_data
from .nlp_base import NLPTokenizerPreprocessorBase
from .transformers_tokenizer import NLPTokenizer
from .utils import parse_text_and_label


class FillMaskPreprocessorBase(Preprocessor):

def __init__(self,
first_sequence: str = None,
second_sequence: str = None,
mode: str = ModeKeys.INFERENCE):
"""The base constructor for all the fill-mask preprocessors.

Args:
first_sequence: The key of the first sequence.
second_sequence: The key of the second sequence.
mode: The mode for the preprocessor.
"""
super().__init__(mode)
self.first_sequence = first_sequence
self.second_sequence = second_sequence

def __call__(self, data: Union[str, Tuple, Dict],
**kwargs) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b, _ = parse_text_and_label(data, self.mode,
self.first_sequence,
self.second_sequence)
output = self._tokenize_text(text_a, text_b, **kwargs)
output = {
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}
return output

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
"""Tokenize the text.

Args:
sequence1: The first sequence.
sequence2: The second sequence which may be None.

Returns:
The encoded sequence.
"""
raise NotImplementedError()

@property
def mask_id(self):
"""Return the id of the mask token.

Returns:
The id of mask token.
"""
return None

@abstractmethod
def decode(self,
token_ids,
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs):
"""Turn the token_ids to real sentence.

Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.
Returns:
The real sentence decoded by the preprocessor.
"""
pass


@PREPROCESSORS.register_module(Fields.nlp, module_name=Preprocessors.fill_mask)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.feature_extraction)
class NLPPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in MLM task.
"""
class FillMaskTransformersPreprocessor(FillMaskPreprocessorBase):

def __init__(self,
model_dir: str = None,
first_sequence: str = None,
second_sequence: str = None,
mode: str = ModeKeys.INFERENCE,
sequence_length: int = 128,
use_fast: bool = None,
**kwargs):
"""The preprocessor for fill mask task, based on transformers' tokenizer.

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
Args:
model_dir: The model dir used to initialize the tokenizer.
use_fast: Use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
kwargs['max_length'] = sequence_length
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
True)
super().__init__(model_dir, mode=mode, **kwargs)
super().__init__(first_sequence, second_sequence, mode)
model_type = None
if model_dir is not None:
model_type = get_model_type(model_dir)
self.nlp_tokenizer = NLPTokenizer(
model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
return self.nlp_tokenizer(sequence1, sequence2, **kwargs)

@property
def mask_id(self):
return self.tokenizer.mask_token_id
"""Return the id of the mask token.

Returns:
The id of mask token.
"""
return self.nlp_tokenizer.tokenizer.mask_token_id

def decode(self,
token_ids,
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs):
return self.tokenizer.decode(token_ids, skip_special_tokens,
clean_up_tokenization_spaces, **kwargs)
"""Turn the token_ids to real sentence.

Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.
Returns:
The real sentence decoded by the preprocessor.
"""
return self.nlp_tokenizer.tokenizer.decode(
token_ids, skip_special_tokens, clean_up_tokenization_spaces,
**kwargs)


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.fill_mask_ponet)
class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in PoNet model's MLM task.
"""
class FillMaskPoNetPreprocessor(FillMaskPreprocessorBase):

def __init__(self,
model_dir,
first_sequence: str = None,
second_sequence: str = None,
mode: str = ModeKeys.INFERENCE,
sequence_length: int = 512,
use_fast: bool = None,
**kwargs):
"""The tokenizer preprocessor used in PoNet model's MLM task.

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
Args:
model_dir: The model dir used to initialize the tokenizer.
use_fast: Use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 512)
kwargs['max_length'] = sequence_length
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
True)
super().__init__(model_dir, mode=mode, **kwargs)
super().__init__(first_sequence, second_sequence, mode)
self.nlp_tokenizer = NLPTokenizer(
model_dir, use_fast=use_fast, tokenize_kwargs=kwargs)

self.cfg = Config.from_file(
osp.join(model_dir, ModelFile.CONFIGURATION))
@@ -80,27 +227,15 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
self.sent_tokenize = sent_tokenize
self.max_length = kwargs['max_length']

def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
sentence2 (str): a sentence
Example:
'you are so beautiful.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b, labels = self.parse_text_and_label(data)
output = self.tokenizer(
text_a,
text_b,
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
**self.tokenize_kwargs)
def __call__(self, data: Union[str, Tuple, Dict],
**kwargs) -> Dict[str, Any]:
text_a, text_b, _ = parse_text_and_label(data, self.mode,
self.first_sequence,
self.second_sequence)
if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
output = self.nlp_tokenizer(text_a, text_b, **kwargs)
max_seq_length = self.max_length

if text_b is None:
@@ -108,7 +243,7 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
seg_lens = list(
map(
len,
self.tokenizer(
self.nlp_tokenizer.tokenizer(
self.sent_tokenize(text_a),
add_special_tokens=False,
truncation=True)['input_ids']))
@@ -125,18 +260,36 @@ class FillMaskPoNetPreprocessor(NLPTokenizerPreprocessorBase):
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}

self.labels_to_id(labels, output)
return output

@property
def mask_id(self):
return self.tokenizer.mask_token_id
"""Return the id of the mask token.

Returns:
The id of mask token.
"""
return self.nlp_tokenizer.tokenizer.mask_token_id

def decode(self,
token_ids,
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs):
return self.tokenizer.decode(token_ids, skip_special_tokens,
clean_up_tokenization_spaces, **kwargs)
"""Turn the token_ids to real sentence.

Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.
Returns:
The real sentence decoded by the preprocessor.
"""
return self.nlp_tokenizer.tokenizer.decode(
token_ids, skip_special_tokens, clean_up_tokenization_spaces,
**kwargs)

+ 0
- 291
modelscope/preprocessors/nlp/nlp_base.py View File

@@ -1,291 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os
from abc import ABC
from collections.abc import Mapping
from typing import Any, Dict, List, Tuple, Union

import json
import numpy as np
import torch
from transformers import AutoTokenizer

from modelscope.metainfo import Models
from modelscope.outputs import OutputKeys
from modelscope.preprocessors.base import Preprocessor
from modelscope.utils.constant import ModeKeys
from modelscope.utils.hub import get_model_type, parse_label_mapping
from modelscope.utils.logger import get_logger

logger = get_logger()

__all__ = [
'NLPBasePreprocessor',
'NLPTokenizerPreprocessorBase',
]


class NLPBasePreprocessor(Preprocessor, ABC):

def __init__(self,
model_dir: str,
first_sequence=None,
second_sequence=None,
label=None,
label2id=None,
mode=ModeKeys.INFERENCE,
use_fast=None,
**kwargs):
"""The NLP preprocessor base class.

Args:
model_dir (str): The local model path
first_sequence: The key for the first sequence
second_sequence: The key for the second sequence
label: The label key
label2id: An optional label2id mapping, the class will try to call utils.parse_label_mapping
if this mapping is not supplied.
mode: Run this preprocessor in either 'train'/'eval'/'inference' mode
use_fast: use the fast version of tokenizer

"""
self.model_dir = model_dir
self.first_sequence = first_sequence
self.second_sequence = second_sequence
self.label = label

self.use_fast = use_fast
if self.use_fast is None and model_dir is None:
self.use_fast = False
elif self.use_fast is None and os.path.isfile(
os.path.join(model_dir, 'tokenizer_config.json')):
with open(
os.path.join(model_dir, 'tokenizer_config.json'),
'r',
encoding='utf-8') as f:
json_config = json.load(f)
self.use_fast = json_config.get('use_fast')
self.use_fast = False if self.use_fast is None else self.use_fast

self.label2id = label2id
if self.label2id is None and model_dir is not None:
self.label2id = parse_label_mapping(model_dir)
super().__init__(mode, **kwargs)

@property
def mask_id(self):
"""Child preprocessor can override this property to return the id of mask token.

Returns:
The id of mask token, default None.
"""
return None

def decode(self,
token_ids: Union[int, List[int], 'np.ndarray', 'torch.Tensor',
'tf.Tensor'],
skip_special_tokens: bool = False,
clean_up_tokenization_spaces: bool = True,
**kwargs):
"""Turn the token_ids to real sentence.

Args:
token_ids (`Union[int, List[int], np.ndarray, torch.Tensor, tf.Tensor]`):
List of tokenized input ids. Can be obtained using the `__call__` method.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
clean_up_tokenization_spaces (`bool`, *optional*, defaults to `True`):
Whether or not to clean up the tokenization spaces.
kwargs (additional keyword arguments, *optional*):
Will be passed to the underlying model specific decode method.
Returns:
The real sentence decoded by the preprocessor.
"""
raise NotImplementedError()


class NLPTokenizerPreprocessorBase(NLPBasePreprocessor):

def __init__(self,
model_dir: str,
first_sequence: str = None,
second_sequence: str = None,
label: str = 'label',
label2id: dict = None,
mode: str = ModeKeys.INFERENCE,
use_fast: bool = None,
**kwargs):
"""The NLP tokenizer preprocessor base class.

Any nlp preprocessor which uses the hf tokenizer can inherit from this class.

Args:
model_dir (str): The local model path
first_sequence: The key for the first sequence
second_sequence: The key for the second sequence
label: The key for the label
label2id: An optional label2id dict.
If label2id is None, the preprocessor will try to parse label-id mapping from:
- configuration.json model.label2id/model.id2label
- config.json label2id/id2label
- label_mapping.json
mode: Run this preprocessor in either 'train'/'eval'/'inference' mode, the behavior may be different.
use_fast: use the fast version of tokenizer
kwargs: These kwargs will be directly fed into the tokenizer.
"""

super().__init__(model_dir, first_sequence, second_sequence, label,
label2id, mode, use_fast, **kwargs)
self.model_dir = model_dir
self.tokenize_kwargs = kwargs
self.tokenizer = self.build_tokenizer(model_dir)
logger.info(f'The key of sentence1: {self.first_sequence}, '
f'The key of sentence2: {self.second_sequence}, '
f'The key of label: {self.label}')
if self.first_sequence is None:
logger.warning('[Important] first_sequence attribute is not set, '
'this will cause an error if your input is a dict.')

@property
def id2label(self):
"""Return the id2label mapping according to the label2id mapping.

@return: The id2label mapping if exists.
"""
if self.label2id is not None:
return {id: label for label, id in self.label2id.items()}
return None

def build_tokenizer(self, model_dir):
"""Build a tokenizer by the model type.

NOTE: This default implementation only returns slow tokenizer, because the fast tokenizers have a
multi-thread problem.

Args:
model_dir: The local model dir.

Returns:
The initialized tokenizer.
"""
self.is_transformer_based_model = 'lstm' not in model_dir
# fast version lead to parallel inference failed
model_type = get_model_type(model_dir)
if model_type in (Models.structbert, Models.gpt3, Models.palm,
Models.plug):
from modelscope.models.nlp.structbert import SbertTokenizer, SbertTokenizerFast
tokenizer = SbertTokenizerFast if self.use_fast else SbertTokenizer
return tokenizer.from_pretrained(model_dir)
elif model_type == Models.veco:
from modelscope.models.nlp.veco import VecoTokenizer, VecoTokenizerFast
tokenizer = VecoTokenizerFast if self.use_fast else VecoTokenizer
return tokenizer.from_pretrained(model_dir)
elif model_type == Models.deberta_v2:
from modelscope.models.nlp.deberta_v2 import DebertaV2Tokenizer, DebertaV2TokenizerFast
tokenizer = DebertaV2TokenizerFast if self.use_fast else DebertaV2Tokenizer
return tokenizer.from_pretrained(model_dir)
elif not self.is_transformer_based_model:
from transformers import BertTokenizer, BertTokenizerFast
tokenizer = BertTokenizerFast if self.use_fast else BertTokenizer
return tokenizer.from_pretrained(model_dir)
else:
return AutoTokenizer.from_pretrained(
model_dir, use_fast=self.use_fast)

def __call__(self, data: Union[str, Tuple, Dict]) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
sentence2 (str): a sentence
Example:
'you are so beautiful.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b, labels = self.parse_text_and_label(data)
output = self.tokenizer(
text_a,
text_b,
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
**self.tokenize_kwargs)
output = {
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}
self.labels_to_id(labels, output)
return output

def parse_text_and_label(self, data):
"""Parse the input and return the sentences and labels.

When input type is tuple or list and its size is 2:
If the pair param is False, data will be parsed as the first_sentence and the label,
else it will be parsed as the first_sentence and the second_sentence.

Args:
data: The input data.

Returns:
The sentences and labels tuple.
"""
text_a, text_b, labels = None, None, None
if isinstance(data, str):
text_a = data
elif isinstance(data, tuple) or isinstance(data, list):
if len(data) == 3:
text_a, text_b, labels = data
elif len(data) == 2:
if self._mode == ModeKeys.INFERENCE:
text_a, text_b = data
else:
text_a, labels = data
elif isinstance(data, Mapping):
text_a = data.get(self.first_sequence)
text_b = data.get(self.second_sequence)
labels = data.get(self.label)

return text_a, text_b, labels

def labels_to_id(self, labels, output):
"""Turn the labels to id with the type int or float.

If the original label's type is str or int, the label2id mapping will try to convert it to the final label.
If the original label's type is float, or the label2id mapping does not exist,
the original label will be returned.

Args:
labels: The input labels.
output: The label id.

Returns:
The final labels.
"""

def label_can_be_mapped(label):
return isinstance(label, str) or isinstance(label, int)

try:
if isinstance(labels, (tuple, list)) and all([label_can_be_mapped(label) for label in labels]) \
and self.label2id is not None:
output[OutputKeys.LABELS] = [
self.label2id[label]
if label in self.label2id else self.label2id[str(label)]
for label in labels
]
elif label_can_be_mapped(labels) and self.label2id is not None:
output[OutputKeys.LABELS] = self.label2id[
labels] if labels in self.label2id else self.label2id[str(
labels)]
elif labels is not None:
output[OutputKeys.LABELS] = labels
except KeyError as e:
logger.error(
f'Label {labels} cannot be found in the label mapping {self.label2id},'
f'which comes from the user input or the configuration files. '
f'Please consider matching your labels with this mapping.')
raise e

+ 17
- 13
modelscope/preprocessors/nlp/relation_extraction_preprocessor.py View File

@@ -5,34 +5,36 @@ from typing import Any, Dict
from transformers import AutoTokenizer

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.type_assert import type_assert
from .nlp_base import NLPBasePreprocessor


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.re_tokenizer)
class RelationExtractionPreprocessor(NLPBasePreprocessor):
"""The relation extraction preprocessor used in normal RE task.
"""
class RelationExtractionTransformersPreprocessor(Preprocessor):

def __init__(self, model_dir: str, *args, **kwargs):
"""preprocess the data
def __init__(
self,
model_dir: str,
mode: str = ModeKeys.INFERENCE,
**kwargs,
):
"""The preprocessor for relation Extraction task, based on transformers' tokenizer.

Args:
model_dir (str): model path
model_dir: The model dir used to initialize the tokenizer.
mode: The mode for the preprocessor.
"""

super().__init__(model_dir, *args, **kwargs)

super().__init__(mode)
self.model_dir: str = model_dir
self.sequence_length = kwargs.pop('sequence_length', 512)
self.tokenizer = AutoTokenizer.from_pretrained(
model_dir, use_fast=True)

@type_assert(object, str)
def __call__(self, data: str) -> Dict[str, Any]:
def __call__(self, data: str, **kwargs) -> Dict[str, Any]:
"""process the raw input data

Args:
@@ -46,7 +48,9 @@ class RelationExtractionPreprocessor(NLPBasePreprocessor):

# preprocess the data for the model input
text = data
output = self.tokenizer([text], return_tensors='pt')
if 'return_tensors' not in kwargs:
kwargs['return_tensors'] = 'pt'
output = self.tokenizer([text], **kwargs)
return {
'text': text,
'input_ids': output['input_ids'],


+ 0
- 25
modelscope/preprocessors/nlp/sentence_classification_preprocessor.py View File

@@ -1,25 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from .nlp_base import NLPTokenizerPreprocessorBase


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.nli_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
class SequenceClassificationPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in sequence classification.
"""

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, mode=mode, **kwargs)

+ 49
- 19
modelscope/preprocessors/nlp/sentence_embedding_preprocessor.py View File

@@ -1,31 +1,61 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Union
from typing import Any, Dict

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from .nlp_base import NLPTokenizerPreprocessorBase
from modelscope.utils.hub import get_model_type
from .transformers_tokenizer import NLPTokenizer


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sentence_embedding)
class SentenceEmbeddingPreprocessor(NLPTokenizerPreprocessorBase):
class SentenceEmbeddingTransformersPreprocessor(Preprocessor):
"""The tokenizer preprocessor used in sentence embedding.
"""

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, mode=mode, **kwargs)
def __init__(self,
model_dir: str,
first_sequence='source_sentence',
second_sequence='sentences_to_compare',
mode=ModeKeys.INFERENCE,
use_fast: bool = None,
sequence_length: int = 128,
**kwargs):
"""The preprocessor for sentence embedding task, based on transformers' tokenizer.

def __call__(self, data: Union[str, Dict]) -> Dict[str, Any]:
Args:
model_dir: The model dir used to initialize the tokenizer.
first_sequence: The key of the first sequence.
second_sequence: The key of the second sequence.
mode: The mode for the preprocessor.
use_fast: Use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
self.first_sequence = first_sequence
self.second_sequence = second_sequence
kwargs['max_length'] = sequence_length
model_type = None
if model_dir is not None:
model_type = get_model_type(model_dir)
self.nlp_tokenizer = NLPTokenizer(
model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
super().__init__(mode=mode)

def __call__(self,
data: Dict,
padding=True,
truncation=True,
**kwargs) -> Dict[str, Any]:
"""process the raw input data

Args:
data Dict:
keys: "source_sentence" && "sentences_to_compare"
keys: the source sentence and the sentences to compare
values: list of sentences
Example:
{"source_sentence": ["how long it take to get a master's degree"],
@@ -37,16 +67,16 @@ class SentenceEmbeddingPreprocessor(NLPTokenizerPreprocessorBase):
Returns:
Dict[str, Any]: the preprocessed data
"""
source_sentence = data['source_sentence']
compare_sentences = data['sentences_to_compare']
sentences = []
sentences.append(source_sentence[0])
source_sentence = data[self.first_sequence]
compare_sentences = data[self.second_sequence]
sentences = [source_sentence[0]]
for sent in compare_sentences:
sentences.append(sent)

tokenized_inputs = self.tokenizer(
sentences,
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
padding=True,
truncation=True)
if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None

tokenized_inputs = self.nlp_tokenizer(
sentences, padding=padding, truncation=truncation, **kwargs)
return tokenized_inputs

+ 15
- 7
modelscope/preprocessors/nlp/sentence_piece_preprocessor.py View File

@@ -1,7 +1,6 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
import os
import os.path as osp
from typing import Any, Dict

import sentencepiece as spm
import torch
@@ -9,17 +8,26 @@ import torch
from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.base import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields
from modelscope.utils.constant import Fields, ModeKeys


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sentence_piece)
class SentencePiecePreprocessor(Preprocessor):

def __init__(self, model_dir: str, *args, **kwargs):
import os
def __init__(self,
model_dir: str,
mode=ModeKeys.INFERENCE,
*args,
**kwargs):
"""The preprocessor for the sentence piece tokenizer.

Args:
model_dir: The model dir contains the essential files used by the `SentencePieceProcessor`.
mode: The mode for the preprocessor.
"""

super().__init__(*args, **kwargs)
super().__init__(mode)
self.tokenizer = None
for file_name in os.listdir(model_dir):
if file_name.endswith('.model'):
@@ -28,5 +36,5 @@ class SentencePiecePreprocessor(Preprocessor):
break
assert self.tokenizer is not None, 'Can not find .model file'

def __call__(self, data: str) -> Dict[str, Any]:
def __call__(self, data: str) -> torch.Tensor:
return torch.tensor(self.tokenizer.encode([data]), dtype=torch.long)

+ 0
- 40
modelscope/preprocessors/nlp/text2text_generation_preprocessor.py View File

@@ -1,40 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Union

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from .nlp_base import NLPTokenizerPreprocessorBase


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text2text_gen_preprocessor)
class Text2TextGenerationPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in text generation.
"""

def __init__(self,
model_dir: str,
tokenizer=None,
mode=ModeKeys.INFERENCE,
**kwargs):
kwargs['truncation'] = kwargs.get('truncation', 'do_not_truncate')
kwargs['padding'] = kwargs.get('padding', False)
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
False)
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, mode=mode, **kwargs)

def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
text_a, _, _ = self.parse_text_and_label(data)

inputs = self.tokenizer(
text_a,
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None,
**self.tokenize_kwargs)

# This is produced by tokenizers but is an invalid generate kwargs
if 'token_type_ids' in inputs:
del inputs['token_type_ids']
return inputs

+ 152
- 0
modelscope/preprocessors/nlp/text_classification_preprocessor.py View File

@@ -0,0 +1,152 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from abc import abstractmethod
from typing import Any, Dict, List, Tuple, Union

import numpy as np

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.hub import get_model_type, parse_label_mapping
from modelscope.utils.logger import get_logger
from .transformers_tokenizer import NLPTokenizer
from .utils import labels_to_id, parse_text_and_label

logger = get_logger(__name__)


class TextClassificationPreprocessorBase(Preprocessor):

def __init__(
self,
model_dir=None,
first_sequence: str = None,
second_sequence: str = None,
label: str = 'label',
label2id: Dict = None,
mode: str = ModeKeys.INFERENCE,
):
"""The base class for the text classification preprocessor.

Args:
model_dir(str, `optional`): The model dir used to parse the label mapping, can be None.
first_sequence(str, `optional`): The key of the first sequence.
second_sequence(str, `optional`): The key of the second sequence.
label(str, `optional`): The keys of the label columns, default is `label`
label2id: (dict, `optional`): The optional label2id mapping
mode: The mode for the preprocessor
"""
super().__init__(mode)
self.model_dir = model_dir
self.first_sequence = first_sequence
self.second_sequence = second_sequence
self.label = label
self.label2id = label2id
if self.label2id is None and self.model_dir is not None:
self.label2id = parse_label_mapping(self.model_dir)

logger.info(f'The key of sentence1: {self.first_sequence}, '
f'The key of sentence2: {self.second_sequence}, '
f'The key of label: {self.label}')
if self.first_sequence is None:
logger.warning('[Important] first_sequence attribute is not set, '
'this will cause an error if your input is a dict.')

@property
def id2label(self):
"""Return the id2label mapping according to the label2id mapping.

@return: The id2label mapping if exists.
"""
if self.label2id is not None:
return {id: label for label, id in self.label2id.items()}
return None

def __call__(self, data: Union[str, Tuple, Dict],
**kwargs) -> Dict[str, Any]:
"""process the raw input data

Args:
data (tuple): [sentence1, sentence2]
sentence1 (str): a sentence
Example:
'you are so handsome.'
sentence2 (str): a sentence
Example:
'you are so beautiful.'
Returns:
Dict[str, Any]: the preprocessed data
"""

text_a, text_b, labels = parse_text_and_label(data, self.mode,
self.first_sequence,
self.second_sequence,
self.label)
output = self._tokenize_text(text_a, text_b, **kwargs)
output = {
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}
labels_to_id(labels, output, self.label2id)
return output

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
"""Tokenize the text.

Args:
sequence1: The first sequence.
sequence2: The second sequence which may be None.

Returns:
The encoded sequence.
"""
raise NotImplementedError()


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.nli_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_sim_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.bert_seq_cls_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sen_cls_tokenizer)
class TextClassificationTransformersPreprocessor(
TextClassificationPreprocessorBase):

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None
return self.nlp_tokenizer(sequence1, sequence2, **kwargs)

def __init__(self,
model_dir=None,
first_sequence: str = None,
second_sequence: str = None,
label: Union[str, List] = 'label',
label2id: Dict = None,
mode: str = ModeKeys.INFERENCE,
sequence_length: int = 128,
use_fast: bool = None,
**kwargs):
"""The tokenizer preprocessor used in sequence classification.

Args:
use_fast: Whether to use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = sequence_length
model_type = None
if model_dir is not None:
model_type = get_model_type(model_dir)
self.nlp_tokenizer = NLPTokenizer(
model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
super().__init__(model_dir, first_sequence, second_sequence, label,
label2id, mode)

+ 2
- 3
modelscope/preprocessors/nlp/text_error_correction.py View File

@@ -7,12 +7,11 @@ from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.base import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields
from .nlp_base import NLPBasePreprocessor


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text_error_correction)
class TextErrorCorrectionPreprocessor(NLPBasePreprocessor):
class TextErrorCorrectionPreprocessor(Preprocessor):
"""The preprocessor used in text correction task.
"""

@@ -23,7 +22,7 @@ class TextErrorCorrectionPreprocessor(NLPBasePreprocessor):
Args:
model_dir (str): model path
"""
super().__init__(model_dir, *args, **kwargs)
super().__init__(*args, **kwargs)
self.vocab = Dictionary.load(osp.join(model_dir, 'dict.src.txt'))

def __call__(self, data: str) -> Dict[str, Any]:


+ 0
- 44
modelscope/preprocessors/nlp/text_generation_jieba_preprocessor.py View File

@@ -1,44 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os.path as osp
from typing import Any, Dict

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.base import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text_gen_jieba_tokenizer)
class TextGenerationJiebaPreprocessor(Preprocessor):
"""The jieba tokenizer preprocessor used in text generation.
"""

def __init__(self, model_dir: str, *args, **kwargs):
from modelscope.models.nlp.gpt3 import JiebaBPETokenizer
super().__init__(*args, **kwargs)
self.tokenizer = JiebaBPETokenizer(
osp.join(model_dir, 'tokenizer.json'))

def __call__(self, data: str) -> Dict[str, Any]:
"""process the raw input data

Args:
data (str): a sentence
Example:
'深蓝的天空中挂着一轮金黄的圆月,下面是海边的沙地'
Returns:
Dict[str, Any]: the preprocessed data
Example:
{'net_input':
{'src_tokens':tensor([1,2,3,4]),
'src_lengths': tensor([4])}
}
"""
import torch

return {
'input_ids':
torch.tensor(self.tokenizer.tokenize(data)).unsqueeze_(0)
}

+ 234
- 39
modelscope/preprocessors/nlp/text_generation_preprocessor.py View File

@@ -1,62 +1,257 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os.path as osp
from typing import Any, Dict, Optional, Union

import numpy as np
import torch

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.base import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from .nlp_base import NLPTokenizerPreprocessorBase
from modelscope.utils.hub import get_model_type
from modelscope.utils.logger import get_logger
from .transformers_tokenizer import NLPTokenizer
from .utils import parse_text_and_label

logger = get_logger(__name__)


class TextGenerationPreprocessorBase(Preprocessor):

def __init__(self,
mode: str = ModeKeys.INFERENCE,
src_txt='src_txt',
tgt_txt='tgt_txt'):
"""The base class for all the text generation task's preprocessors.

Args:
mode: The preprocessor mode.
src_txt: The key for the src text.
tgt_txt: The key for the tgt text.
"""
super().__init__(mode)
self.src_txt = src_txt
self.tgt_txt = tgt_txt

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
"""Tokenize the text.

Args:
sequence1: The first sequence.
sequence2: The second sequence which may be None.

Returns:
The encoded sequence.
"""
raise NotImplementedError()

def __call__(self, data: Union[Dict, str], **kwargs) -> Dict[str, Any]:
text_a, text_b = parse_text_and_label(data, self.mode, self.src_txt,
self.tgt_txt)[0:2]

output = self._tokenize_text(text_a, text_b, **kwargs)
output = {
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}
return output

def decode(self, tokens, **kwargs):
"""Decode the tokens to real text.

Args:
tokens: The output tokens from model's `forward` and `generate`

Returns:
The actual text.
"""
raise NotImplementedError()


class NLPTokenizerForRoberta(NLPTokenizer):

def build_tokenizer(self):

def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]:
import os
for name in os.listdir(model_dir):
full_name = os.path.join(model_dir, name)
if 'roberta' in name and os.path.isdir(full_name):
return full_name

roberta_tokenizer_dir = get_roberta_tokenizer_dir(self.model_dir)
if roberta_tokenizer_dir:
from transformers import RobertaTokenizer
return RobertaTokenizer.from_pretrained(
roberta_tokenizer_dir, do_lower_case=False)
return super().build_tokenizer()


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text_gen_tokenizer)
class TextGenerationPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in text generation.
"""
class TextGenerationTransformersPreprocessor(TextGenerationPreprocessorBase):

def __init__(self,
model_dir: str,
tokenizer=None,
mode=ModeKeys.INFERENCE,
mode: str = ModeKeys.INFERENCE,
src_txt='src_txt',
tgt_txt='tgt_txt',
sequence_length: int = 128,
use_fast: bool = None,
**kwargs):
"""The tokenizer preprocessor used in text generation.

Args:
model_dir: The model dir used to initialize the tokenizer.
mode: The mode for the preprocessor.
src_txt: The key of the source sentence.
tgt_txt: The key of the generated sentence.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
use_fast: Whether to use the fast tokenizer or not.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
if 'first_sequence' in kwargs:
src_txt = kwargs.pop('first_sequence')
super().__init__(mode, src_txt, tgt_txt)
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['return_token_type_ids'] = kwargs.get('return_token_type_ids',
False)
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
super().__init__(model_dir, mode=mode, **kwargs)

@staticmethod
def get_roberta_tokenizer_dir(model_dir: str) -> Optional[str]:
import os
for name in os.listdir(model_dir):
full_name = os.path.join(model_dir, name)
if 'roberta' in name and os.path.isdir(full_name):
return full_name

def build_tokenizer(self, model_dir: str):
roberta_tokenizer_dir = self.get_roberta_tokenizer_dir(model_dir)
if roberta_tokenizer_dir:
from transformers import RobertaTokenizer
return RobertaTokenizer.from_pretrained(
roberta_tokenizer_dir, do_lower_case=False)
return super().build_tokenizer(model_dir)

def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
if self._mode == ModeKeys.INFERENCE:
return super().__call__(data)
src_rst = super().__call__(data['src_txt'])
src_input_ids = src_rst['input_ids']
src_attention_mask = src_rst['attention_mask']
if 'tgt_txt' in data:
labels = super().__call__(data['tgt_txt'])['input_ids']
else:
labels = src_input_ids[1:]
src_input_ids = src_input_ids[:-1]
src_attention_mask = src_attention_mask[:-1]
kwargs['max_length'] = sequence_length
model_type = None
if model_dir is not None:
model_type = get_model_type(model_dir)
self.nlp_tokenizer = NLPTokenizerForRoberta(
model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)

def decode(self, tokens, **kwargs):
"""Decode the tokens to real text.

Args:
tokens: The output tokens from model's `forward` and `generate`

Returns:
The actual text.
"""
return self.nlp_tokenizer.tokenizer.decode(tokens, **kwargs)

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
"""Tokenize the text.

Args:
sequence1: The first sequence.
sequence2: The second sequence which may be None.

Returns:
The encoded sequence.
"""
if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self.mode == ModeKeys.INFERENCE else None

output = self.nlp_tokenizer(sequence1, **kwargs)

if self.mode != ModeKeys.INFERENCE:
if sequence2 is not None:
labels = self.nlp_tokenizer(sequence2)['input_ids']
src_input_ids = output['input_ids']
src_attention_mask = output['attention_mask']
else:
labels = output['input_ids'][1:]
src_input_ids = output['input_ids'][:-1]
src_attention_mask = output['attention_mask'][:-1]

output = {
'input_ids': src_input_ids,
'attention_mask': src_attention_mask,
'labels': labels,
}
return output


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text_gen_jieba_tokenizer)
class TextGenerationJiebaPreprocessor(TextGenerationPreprocessorBase):
"""The jieba tokenizer preprocessor used in text generation.
"""

def __init__(self,
model_dir: str,
mode: str = ModeKeys.INFERENCE,
src_txt='src_txt',
tgt_txt=None):
from modelscope.models.nlp.gpt3 import JiebaBPETokenizer
super().__init__(mode, src_txt, tgt_txt)
if self.tgt_txt is not None:
logger.warn(
f'TextGenerationJiebaPreprocessor currently does not support training, '
f'the {self.tgt_txt} of the tgt_txt field will be ignored.')
self.src_txt = src_txt
self.tokenizer = JiebaBPETokenizer(
osp.join(model_dir, 'tokenizer.json'))

def decode(self, tokens, **kwargs):
"""Decode the tokens to real text.

Args:
tokens: The output tokens from model's `forward` and `generate`

Returns:
The actual text.
"""
return self.tokenizer.detokenize(tokens)

def _tokenize_text(self, sequence1, sequence2=None, **kwargs):
"""Tokenize the text.

Args:
sequence1: The first sequence.
sequence2: The second sequence which may be None.

Returns:
The encoded sequence.
"""
return {
'input_ids': src_input_ids,
'attention_mask': src_attention_mask,
'labels': labels,
'input_ids':
torch.tensor(self.tokenizer.tokenize(sequence1)).unsqueeze_(0)
}


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text2text_gen_preprocessor)
class TextGenerationT5Preprocessor(TextGenerationTransformersPreprocessor):

def __init__(self,
model_dir: str,
mode: str = ModeKeys.INFERENCE,
src_txt='src_txt',
tgt_txt='tgt_txt',
use_fast: bool = None,
sequence_length: int = 128,
**kwargs):
"""The preprocessor for text to text generation task, based on transformers' tokenizer.

Args:
model_dir: The model dir used to initialize the tokenizer.
src_txt: The key of the first sequence.
use_fast: Use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
mode: The mode for the preprocessor.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
super().__init__(
model_dir,
mode=mode,
src_txt=src_txt,
tgt_txt=tgt_txt,
sequence_length=sequence_length,
use_fast=use_fast,
truncation=kwargs.pop('truncation', True),
padding=kwargs.pop('padding', 'max_length'),
return_token_type_ids=kwargs.pop('return_token_type_ids', False),
**kwargs)

+ 45
- 34
modelscope/preprocessors/nlp/text_ranking_preprocessor.py View File

@@ -1,67 +1,78 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Union
from typing import Any, Dict

from transformers import AutoTokenizer

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.type_assert import type_assert
from .nlp_base import NLPTokenizerPreprocessorBase


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.text_ranking)
class TextRankingPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in passage ranking model.
"""
class TextRankingTransformersPreprocessor(Preprocessor):

def __init__(self,
model_dir: str,
mode=ModeKeys.INFERENCE,
*args,
mode: str = ModeKeys.INFERENCE,
first_sequence='source_sentence',
second_sequence='sentences_to_compare',
label='labels',
qid='qid',
sequence_length=128,
**kwargs):
"""preprocess the data
"""The tokenizer preprocessor class for the text ranking preprocessor.

Args:
model_dir (str): model path
model_dir(str, `optional`): The model dir used to parse the label mapping, can be None.
first_sequence(str, `optional`): The key of the first sequence.
second_sequence(str, `optional`): The key of the second sequence.
label(str, `optional`): The keys of the label columns, default `labels`.
qid(str, `optional`): The qid info.
mode: The mode for the preprocessor.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
"""
super().__init__(model_dir, mode=mode, *args, **kwargs)
self.model_dir: str = model_dir
self.first_sequence: str = kwargs.pop('first_sequence',
'source_sentence')
self.second_sequence = kwargs.pop('second_sequence',
'sentences_to_compare')
self.sequence_length = kwargs.pop('sequence_length', 128)

super().__init__(mode)
self.model_dir = model_dir
self.first_sequence = first_sequence
self.second_sequence = second_sequence
self.label = label
self.qid = qid
self.sequence_length = sequence_length
self.tokenizer = AutoTokenizer.from_pretrained(self.model_dir)

@type_assert(object, (str, tuple, Dict))
def __call__(self, data: Union[tuple, Dict]) -> Dict[str, Any]:
if isinstance(data, tuple):
sentence1, sentence2 = data
elif isinstance(data, dict):
sentence1 = data.get(self.first_sequence)
sentence2 = data.get(self.second_sequence)
@type_assert(object, dict)
def __call__(self,
data: Dict,
padding='max_length',
truncation=True,
**kwargs) -> Dict[str, Any]:
sentence1 = data.get(self.first_sequence)
sentence2 = data.get(self.second_sequence)
labels = data.get(self.label)
qid = data.get(self.qid)

if isinstance(sentence2, str):
sentence2 = [sentence2]
if isinstance(sentence1, str):
sentence1 = [sentence1]
sentence1 = sentence1 * len(sentence2)

max_seq_length = self.sequence_length
kwargs['max_length'] = kwargs.get(
'max_length', kwargs.pop('sequence_length', self.sequence_length))
if 'return_tensors' not in kwargs:
kwargs['return_tensors'] = 'pt'
feature = self.tokenizer(
sentence1,
sentence2,
padding='max_length',
truncation=True,
max_length=max_seq_length,
return_tensors='pt')
if 'labels' in data:
labels = data['labels']
padding=padding,
truncation=truncation,
**kwargs)
if labels is not None:
feature['labels'] = labels
if 'qid' in data:
qid = data['qid']
if qid is not None:
feature['qid'] = qid
return feature

+ 351
- 208
modelscope/preprocessors/nlp/token_classification_preprocessor.py View File

@@ -1,28 +1,35 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Tuple, Union
from typing import Any, Dict, List, Tuple, Union

import numpy as np
import torch

from modelscope.metainfo import Preprocessors
from modelscope.outputs import OutputKeys
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.hub import get_model_type, parse_label_mapping
from modelscope.utils.logger import get_logger
from modelscope.utils.type_assert import type_assert
from .nlp_base import NLPBasePreprocessor, NLPTokenizerPreprocessorBase
from .transformers_tokenizer import NLPTokenizer
from .utils import parse_text_and_label

logger = get_logger(__name__)


@PREPROCESSORS.register_module(
Fields.nlp,
module_name=Preprocessors.word_segment_text_to_label_preprocessor)
class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor):
class WordSegmentationBlankSetToLabelPreprocessor(Preprocessor):
"""The preprocessor used to turn a single sentence to a labeled token-classification dict.
"""

def __init__(self, **kwargs):
self.first_sequence: str = kwargs.pop('first_sequence', 'tokens')
self.label = kwargs.pop('label', OutputKeys.LABELS)
def __init__(self, generated_sentence='tokens', generated_label='labels'):
super().__init__()
self.generated_sentence = generated_sentence
self.generated_label = generated_label

def __call__(self, data: str) -> Union[Dict[str, Any], Tuple]:
data = data.split(' ')
@@ -43,9 +50,134 @@ class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor):

chars, labels = produce_train_sample(data)
return {
self.first_sequence: chars,
self.label: labels,
self.generated_sentence: chars,
self.generated_label: labels,
}


class TokenClassificationPreprocessorBase(Preprocessor):

def __init__(
self,
model_dir: str = None,
first_sequence: str = None,
label: str = 'label',
label2id: Dict = None,
label_all_tokens: bool = False,
mode: str = ModeKeys.INFERENCE,
):
"""The base class for all the token-classification tasks.

Args:
model_dir: The model dir to build the the label2id mapping.
If None, user need to pass in the `label2id` param.
first_sequence: The key for the text(token) column if input type is a dict.
label: The key for the label column if input type is a dict and the mode is `training` or `evaluation`.
label2id: The label2id mapping, if not provided, you need to specify the model_dir to search the mapping
from config files.
label_all_tokens: If label exists in the dataset, the preprocessor will try to label the tokens.
If label_all_tokens is true, all non-initial sub-tokens will get labels like `I-xxx`,
or else the labels will be filled with -100, default False.
mode: The preprocessor mode.
"""
super().__init__(mode)
self.model_dir = model_dir
self.first_sequence = first_sequence
self.label = label
self.label2id = label2id
self.label_all_tokens = label_all_tokens
if self.label2id is None and self.model_dir is not None:
self.label2id = parse_label_mapping(self.model_dir)

@property
def id2label(self):
"""Return the id2label mapping according to the label2id mapping.

@return: The id2label mapping if exists.
"""
if self.label2id is not None:
return {id: label for label, id in self.label2id.items()}
return None

def labels_to_id(self, labels_list, word_ids):
# align the labels with tokenized text
assert self.label2id is not None
# Map that sends B-Xxx label to its I-Xxx counterpart
b_to_i_label = []
label_enumerate_values = [
k for k, v in sorted(
self.label2id.items(), key=lambda item: item[1])
]
for idx, label in enumerate(label_enumerate_values):
if label.startswith('B-') and label.replace(
'B-', 'I-') in label_enumerate_values:
b_to_i_label.append(
label_enumerate_values.index(label.replace('B-', 'I-')))
else:
b_to_i_label.append(idx)

label_row = [self.label2id[lb] for lb in labels_list]
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label_row[word_idx])
else:
if self.label_all_tokens:
label_ids.append(b_to_i_label[label_row[word_idx]])
else:
label_ids.append(-100)
previous_word_idx = word_idx
return label_ids

def _tokenize_text(self, sequence1, **kwargs):
"""Tokenize the text.

Args:
sequence1: The first sequence.
sequence2: The second sequence which may be None.

Returns:
The encoded sequence.
"""
raise NotImplementedError()

@type_assert(object, (str, tuple, dict))
def __call__(self, data: Union[dict, tuple, str],
**kwargs) -> Dict[str, Any]:
text, _, label = parse_text_and_label(
data, self.mode, self.first_sequence, label=self.label)
outputs, word_ids = self._tokenize_text(text, **kwargs)
if label is not None:
label_ids = self.labels_to_id(label, word_ids)
outputs[OutputKeys.LABELS] = label_ids
outputs = {
k: np.array(v) if isinstance(v, list) else v
for k, v in outputs.items()
}
if self.mode == ModeKeys.INFERENCE:
outputs['text'] = text
return outputs


class NLPTokenizerForLSTM(NLPTokenizer):

def build_tokenizer(self):
if self.model_type == 'lstm':
from transformers import AutoTokenizer
return AutoTokenizer.from_pretrained(
self.model_dir, use_fast=self.use_fast, tokenizer_type='bert')
else:
return super().build_tokenizer()

def get_tokenizer_class(self):
tokenizer_class = self.tokenizer.__class__.__name__
if tokenizer_class.endswith(
'Fast') and tokenizer_class != 'PreTrainedTokenizerFast':
tokenizer_class = tokenizer_class[:-4]
return tokenizer_class


@PREPROCESSORS.register_module(
@@ -54,227 +186,238 @@ class WordSegmentationBlankSetToLabelPreprocessor(NLPBasePreprocessor):
Fields.nlp, module_name=Preprocessors.token_cls_tokenizer)
@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.sequence_labeling_tokenizer)
class TokenClassificationPreprocessor(NLPTokenizerPreprocessorBase):
class TokenClassificationTransformersPreprocessor(
TokenClassificationPreprocessorBase):
"""The tokenizer preprocessor used in normal NER task.
"""

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
"""preprocess the data
def __init__(self,
model_dir: str = None,
first_sequence: str = None,
label: str = 'label',
label2id: Dict = None,
label_all_tokens: bool = False,
mode: str = ModeKeys.INFERENCE,
sequence_length=128,
use_fast=None,
**kwargs):
"""

Args:
model_dir (str): model path
use_fast: Whether to use the fast tokenizer or not.
sequence_length: The max sequence length which the model supported,
will be passed into tokenizer as the 'max_length' param.
**kwargs: Extra args input into the tokenizer's __call__ method.
"""
super().__init__(model_dir, first_sequence, label, label2id,
label_all_tokens, mode)
self.is_lstm_model = 'lstm' in model_dir
model_type = None
if self.is_lstm_model:
model_type = 'lstm'
elif model_dir is not None:
model_type = get_model_type(model_dir)
kwargs['truncation'] = kwargs.get('truncation', True)
kwargs['padding'] = kwargs.get(
'padding', False if mode == ModeKeys.INFERENCE else 'max_length')
kwargs['max_length'] = kwargs.pop('sequence_length', 128)
self.sequence_length = kwargs['max_length']
self.label_all_tokens = kwargs.pop('label_all_tokens', False)
super().__init__(model_dir, mode=mode, **kwargs)

if 'is_split_into_words' in kwargs:
self.tokenize_kwargs['is_split_into_words'] = kwargs.pop(
'is_split_into_words')
else:
self.tokenize_kwargs[
'is_split_into_words'] = self.tokenizer.init_kwargs.get(
'is_split_into_words', False)
if 'label2id' in kwargs:
kwargs.pop('label2id')
kwargs['padding'] = kwargs.get('padding', 'max_length')
kwargs['max_length'] = sequence_length
kwargs['add_special_tokens'] = model_type != 'lstm'
self.nlp_tokenizer = NLPTokenizerForLSTM(
model_dir=model_dir,
model_type=model_type,
use_fast=use_fast,
tokenize_kwargs=kwargs)

@type_assert(object, (str, dict))
def __call__(self, data: Union[dict, str]) -> Dict[str, Any]:
"""process the raw input data
def _tokenize_text(self, text: Union[str, List[str]], **kwargs):
tokens = text
if self.mode != ModeKeys.INFERENCE:
assert isinstance(tokens, list), 'Input needs to be lists in training and evaluating,' \
'because the length of the words and the labels need to be equal.'
is_split_into_words = self.nlp_tokenizer.get_tokenizer_kwarg(
'is_split_into_words', False)
if is_split_into_words:
tokens = list(tokens)

Args:
data (str): a sentence
Example:
'you are so handsome.'

Returns:
Dict[str, Any]: the preprocessed data
"""
if is_split_into_words and self.mode == ModeKeys.INFERENCE:
encodings, word_ids = self._tokenize_text_by_words(
tokens, **kwargs)
elif self.nlp_tokenizer.tokenizer.is_fast:
encodings, word_ids = self._tokenize_text_with_fast_tokenizer(
tokens, **kwargs)
else:
encodings, word_ids = self._tokenize_text_with_slow_tokenizer(
tokens, **kwargs)

# preprocess the data for the model input
text = None
labels_list = None
if isinstance(data, str):
# for inference inputs without label
text = data
elif isinstance(data, dict):
# for finetune inputs with label
text = data.get(self.first_sequence)
labels_list = data.get(self.label)
if isinstance(text, list):
self.tokenize_kwargs['is_split_into_words'] = True

if self._mode == ModeKeys.INFERENCE:
self.tokenize_kwargs['add_special_tokens'] = False
if self.mode == ModeKeys.INFERENCE:
for key in encodings.keys():
encodings[key] = torch.tensor(encodings[key]).unsqueeze(0)
else:
encodings.pop('offset_mapping', None)
return encodings, word_ids

def _tokenize_text_by_words(self, tokens, **kwargs):
input_ids = []
label_mask = []
offset_mapping = []
token_type_ids = []
if self.tokenize_kwargs[
'is_split_into_words'] and self._mode == ModeKeys.INFERENCE:
for offset, token in enumerate(list(text)):
subtoken_ids = self.tokenizer.encode(token,
**self.tokenize_kwargs)
if len(subtoken_ids) == 0:
subtoken_ids = [self.tokenizer.unk_token_id]
input_ids.extend(subtoken_ids)
label_mask.extend([1] + [0] * (len(subtoken_ids) - 1))
offset_mapping.extend([(offset, offset + 1)])
attention_mask = []
for offset, token in enumerate(tokens):
subtoken_ids = self.nlp_tokenizer.tokenizer.encode(
token, add_special_tokens=False)
if len(subtoken_ids) == 0:
subtoken_ids = [self.nlp_tokenizer.tokenizer.unk_token_id]
input_ids.extend(subtoken_ids)
attention_mask.extend([1] * len(subtoken_ids))
label_mask.extend([True] + [False] * (len(subtoken_ids) - 1))
offset_mapping.extend([(offset, offset + 1)])

padding = kwargs.get('padding',
self.nlp_tokenizer.get_tokenizer_kwarg('padding'))
max_length = kwargs.get(
'max_length',
kwargs.get('sequence_length',
self.nlp_tokenizer.get_tokenizer_kwarg('max_length')))
special_token = 1 if self.nlp_tokenizer.get_tokenizer_kwarg(
'add_special_tokens') else 0
if len(label_mask) > max_length - 2 * special_token:
label_mask = label_mask[:(max_length - 2 * special_token)]
input_ids = input_ids[:(max_length - 2 * special_token)]
offset_mapping = offset_mapping[:sum(label_mask)]
if padding == 'max_length':
label_mask = [False] * special_token + label_mask + \
[False] * (max_length - len(label_mask) - special_token)
offset_mapping = offset_mapping + [(0, 0)] * (
max_length - len(offset_mapping))
input_ids = [self.nlp_tokenizer.tokenizer.cls_token_id] * special_token + input_ids + \
[self.nlp_tokenizer.tokenizer.sep_token_id] * special_token + \
[self.nlp_tokenizer.tokenizer.pad_token_id] * (max_length - len(input_ids) - 2 * special_token)
attention_mask = attention_mask + [1] * (
special_token * 2) + [0] * (
max_length - len(attention_mask) - 2 * special_token)
else:
if self.tokenizer.is_fast:
encodings = self.tokenizer(
text, return_offsets_mapping=True, **self.tokenize_kwargs)
attention_mask = encodings['attention_mask']
if 'token_type_ids' in encodings:
token_type_ids = encodings['token_type_ids']
input_ids = encodings['input_ids']
word_ids = encodings.word_ids()
for i in range(len(word_ids)):
if word_ids[i] is None:
label_mask.append(0)
elif word_ids[i] == word_ids[i - 1]:
label_mask.append(0)
offset_mapping[-1] = (
offset_mapping[-1][0],
encodings['offset_mapping'][i][1])
else:
label_mask.append(1)
offset_mapping.append(encodings['offset_mapping'][i])
label_mask = [False] * special_token + label_mask + \
[False] * special_token
input_ids = [self.nlp_tokenizer.tokenizer.cls_token_id] * special_token + input_ids + \
[self.nlp_tokenizer.tokenizer.sep_token_id] * special_token
attention_mask = attention_mask + [1] * (special_token * 2)

encodings = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'label_mask': label_mask,
'offset_mapping': offset_mapping,
}
return encodings, None

def _tokenize_text_with_fast_tokenizer(self, tokens, **kwargs):
is_split_into_words = isinstance(tokens, list)
encodings = self.nlp_tokenizer(
tokens,
return_offsets_mapping=True,
is_split_into_words=is_split_into_words,
**kwargs)
label_mask = []
word_ids = encodings.word_ids()
offset_mapping = []
for i in range(len(word_ids)):
if word_ids[i] is None:
label_mask.append(False)
elif word_ids[i] == word_ids[i - 1]:
label_mask.append(False)
if not is_split_into_words:
offset_mapping[-1] = (offset_mapping[-1][0],
encodings['offset_mapping'][i][1])
else:
encodings = self.tokenizer(text, **self.tokenize_kwargs)
input_ids = encodings['input_ids']
label_mask, offset_mapping = self.get_label_mask_and_offset_mapping(
text)

if self._mode == ModeKeys.INFERENCE:
if len(input_ids) >= self.sequence_length - 2:
input_ids = input_ids[:self.sequence_length - 2]
label_mask = label_mask[:self.sequence_length - 2]
input_ids = [self.tokenizer.cls_token_id
] + input_ids + [self.tokenizer.sep_token_id]
label_mask = [0] + label_mask + [0]
attention_mask = [1] * len(input_ids)
offset_mapping = offset_mapping[:sum(label_mask)]

if not self.is_transformer_based_model:
input_ids = input_ids[1:-1]
attention_mask = attention_mask[1:-1]
label_mask = label_mask[1:-1]

input_ids = torch.tensor(input_ids).unsqueeze(0)
attention_mask = torch.tensor(attention_mask).unsqueeze(0)
label_mask = torch.tensor(
label_mask, dtype=torch.bool).unsqueeze(0)

# the token classification
output = {
'text': text,
'input_ids': input_ids,
'attention_mask': attention_mask,
'label_mask': label_mask,
'offset_mapping': offset_mapping
}
label_mask.append(True)
if is_split_into_words:
offset_mapping.append((word_ids[i], word_ids[i] + 1))
else:
offset_mapping.append(encodings['offset_mapping'][i])

padding = self.nlp_tokenizer.get_tokenizer_kwarg('padding')
if padding == 'max_length':
offset_mapping = offset_mapping + [(0, 0)] * (
len(label_mask) - len(offset_mapping))
encodings['offset_mapping'] = offset_mapping
encodings['label_mask'] = label_mask
return encodings, word_ids

def _tokenize_text_with_slow_tokenizer(self, tokens, **kwargs):
assert self.mode == ModeKeys.INFERENCE and isinstance(tokens, str), \
'Slow tokenizer now only support str input in inference mode. If you are training models, ' \
'please consider using the fast tokenizer.'
word_ids = None
encodings = self.nlp_tokenizer(
tokens, is_split_into_words=False, **kwargs)
tokenizer_name = self.nlp_tokenizer.get_tokenizer_class()
method = 'get_label_mask_and_offset_mapping_' + tokenizer_name
if not hasattr(self, method):
raise RuntimeError(
f'No `{method}` method defined for '
f'tokenizer {tokenizer_name}, please use a fast tokenizer instead, or '
f'try to implement a `{method}` method')
label_mask, offset_mapping = getattr(self, method)(tokens)
padding = self.nlp_tokenizer.get_tokenizer_kwarg('padding')
max_length = self.nlp_tokenizer.get_tokenizer_kwarg('max_length')
special_token = 1 if self.nlp_tokenizer.get_tokenizer_kwarg(
'add_special_tokens') else 0
if len(label_mask) > max_length - 2 * special_token:
label_mask = label_mask[:(max_length - 2 * special_token)]
offset_mapping = offset_mapping[:sum(label_mask)]
if padding == 'max_length':
label_mask = [False] * special_token + label_mask + \
[False] * (max_length - len(label_mask) - special_token)
offset_mapping = offset_mapping + [(0, 0)] * (
max_length - len(offset_mapping))
else:
output = {
'input_ids': input_ids,
'token_type_ids': token_type_ids,
'attention_mask': attention_mask,
'label_mask': label_mask,
}

# align the labels with tokenized text
if labels_list is not None:
assert self.label2id is not None
# Map that sends B-Xxx label to its I-Xxx counterpart
b_to_i_label = []
label_enumerate_values = [
k for k, v in sorted(
self.label2id.items(), key=lambda item: item[1])
]
for idx, label in enumerate(label_enumerate_values):
if label.startswith('B-') and label.replace(
'B-', 'I-') in label_enumerate_values:
b_to_i_label.append(
label_enumerate_values.index(
label.replace('B-', 'I-')))
else:
b_to_i_label.append(idx)

label_row = [self.label2id[lb] for lb in labels_list]
previous_word_idx = None
label_ids = []
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100)
elif word_idx != previous_word_idx:
label_ids.append(label_row[word_idx])
else:
if self.label_all_tokens:
label_ids.append(b_to_i_label[label_row[word_idx]])
else:
label_ids.append(-100)
previous_word_idx = word_idx
labels = label_ids
output['labels'] = labels
output = {
k: np.array(v) if isinstance(v, list) else v
for k, v in output.items()
}
return output
label_mask = [False] * special_token + label_mask + \
[False] * special_token
encodings['offset_mapping'] = offset_mapping
encodings['label_mask'] = label_mask
return encodings, word_ids

def get_tokenizer_class(self):
tokenizer_class = self.tokenizer.__class__.__name__
if tokenizer_class.endswith(
'Fast') and tokenizer_class != 'PreTrainedTokenizerFast':
tokenizer_class = tokenizer_class[:-4]
return tokenizer_class
def get_label_mask_and_offset_mapping_BertTokenizer(self, text):
label_mask = []
offset_mapping = []
tokens = self.nlp_tokenizer.tokenizer.tokenize(text)
offset = 0
for token in tokens:
is_start = (token[:2] != '##')
if is_start:
label_mask.append(True)
else:
token = token[2:]
label_mask.append(False)
start = offset + text[offset:].index(token)
end = start + len(token)
if is_start:
offset_mapping.append((start, end))
else:
offset_mapping[-1] = (offset_mapping[-1][0], end)
offset = end

return label_mask, offset_mapping

def get_label_mask_and_offset_mapping(self, text):
def get_label_mask_and_offset_mapping_XLMRobertaTokenizer(self, text):
label_mask = []
offset_mapping = []
tokens = self.tokenizer.tokenize(text)
tokens = self.nlp_tokenizer.tokenizer.tokenize(text)
offset = 0
if self.get_tokenizer_class() == 'BertTokenizer':
for token in tokens:
is_start = (token[:2] != '##')
if is_start:
label_mask.append(True)
else:
token = token[2:]
label_mask.append(False)
start = offset + text[offset:].index(token)
end = start + len(token)
if is_start:
offset_mapping.append((start, end))
else:
offset_mapping[-1] = (offset_mapping[-1][0], end)
offset = end
elif self.get_tokenizer_class() == 'XLMRobertaTokenizer':
last_is_blank = False
for token in tokens:
is_start = (token[0] == '▁')
if is_start:
token = token[1:]
label_mask.append(True)
if len(token) == 0:
last_is_blank = True
continue
else:
label_mask.append(False)
start = offset + text[offset:].index(token)
end = start + len(token)
if last_is_blank or is_start:
offset_mapping.append((start, end))
else:
offset_mapping[-1] = (offset_mapping[-1][0], end)
offset = end
last_is_blank = False
for token in tokens:
is_start = (token[0] == '▁')
if is_start:
token = token[1:]
label_mask.append(True)
if len(token) == 0:
last_is_blank = True
continue
else:
label_mask.append(False)
start = offset + text[offset:].index(token)
end = start + len(token)
if last_is_blank or is_start:
offset_mapping.append((start, end))
else:
offset_mapping[-1] = (offset_mapping[-1][0], end)
offset = end
last_is_blank = False
else:
raise NotImplementedError

return label_mask, offset_mapping

+ 19
- 10
modelscope/preprocessors/nlp/token_classification_thai_preprocessor.py View File

@@ -9,19 +9,23 @@ from modelscope.outputs import OutputKeys
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.type_assert import type_assert
from .token_classification_preprocessor import TokenClassificationPreprocessor
from .token_classification_preprocessor import \
TokenClassificationTransformersPreprocessor


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.thai_ner_tokenizer)
class NERPreprocessorThai(TokenClassificationPreprocessor):
class NERPreprocessorThai(TokenClassificationTransformersPreprocessor):

@type_assert(object, str)
def __call__(self, data: str) -> Dict[str, Any]:
@type_assert(object, (str, dict))
def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
from pythainlp import word_tokenize

if isinstance(data, str):
text = data
else:
text = data[self.first_sequence]
segmented_data = ' '.join([
w.strip(' ') for w in word_tokenize(text=data, engine='newmm')
w.strip(' ') for w in word_tokenize(text=text, engine='newmm')
if w.strip(' ') != ''
])
output = super().__call__(segmented_data)
@@ -31,12 +35,17 @@ class NERPreprocessorThai(TokenClassificationPreprocessor):

@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.thai_wseg_tokenizer)
class WordSegmentationPreprocessorThai(TokenClassificationPreprocessor):
class WordSegmentationPreprocessorThai(
TokenClassificationTransformersPreprocessor):

@type_assert(object, str)
def __call__(self, data: str) -> Dict[str, Any]:
@type_assert(object, (str, dict))
def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
import regex
data = regex.findall(r'\X', data)
if isinstance(data, str):
text = data
else:
text = data[self.first_sequence]
data = regex.findall(r'\X', text)
data = ' '.join([char for char in data])

output = super().__call__(data)


+ 10
- 6
modelscope/preprocessors/nlp/token_classification_viet_preprocessor.py View File

@@ -9,19 +9,23 @@ from modelscope.outputs import OutputKeys
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.type_assert import type_assert
from .token_classification_preprocessor import TokenClassificationPreprocessor
from .token_classification_preprocessor import \
TokenClassificationTransformersPreprocessor


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.viet_ner_tokenizer)
class NERPreprocessorViet(TokenClassificationPreprocessor):
class NERPreprocessorViet(TokenClassificationTransformersPreprocessor):

@type_assert(object, str)
def __call__(self, data: str) -> Dict[str, Any]:
@type_assert(object, (str, dict))
def __call__(self, data: Union[Dict, str]) -> Dict[str, Any]:
from pyvi import ViTokenizer

if isinstance(data, str):
text = data
else:
text = data[self.first_sequence]
seg_words = [
t.strip(' ') for t in ViTokenizer.tokenize(data).split(' ')
t.strip(' ') for t in ViTokenizer.tokenize(text).split(' ')
if t.strip(' ') != ''
]
raw_words = []


+ 112
- 0
modelscope/preprocessors/nlp/transformers_tokenizer.py View File

@@ -0,0 +1,112 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os
from collections.abc import Mapping

import json
from transformers import AutoTokenizer

from modelscope.metainfo import Models
from modelscope.outputs import OutputKeys
from modelscope.utils.constant import ModeKeys
from modelscope.utils.logger import get_logger

logger = get_logger()

__all__ = [
'NLPTokenizer',
]


class NLPTokenizer:

def __init__(self,
model_dir: str = None,
model_type=None,
use_fast: bool = None,
tokenize_kwargs=None):
"""The transformers tokenizer preprocessor base class.

Any nlp preprocessor which uses the huggingface tokenizer can inherit from this class.

Args:
model_dir (str, `optional`): The local path containing the files used to create a preprocessor.
use_fast (str, `optional`): Use the fast version of tokenizer
tokenize_kwargs (dict, `optional`): These args will be directly fed into the tokenizer.
"""
self.model_dir = model_dir
self.model_type = model_type
self.tokenize_kwargs = tokenize_kwargs
if self.tokenize_kwargs is None:
self.tokenize_kwargs = {}
self._use_fast = use_fast
self._tokenizer = None

@property
def tokenizer(self):
if self._tokenizer is None:
self._tokenizer = self.build_tokenizer()
return self._tokenizer

@property
def use_fast(self):
if self._use_fast is None:
if self._use_fast is None and self.model_dir is None:
self._use_fast = False
elif self._use_fast is None and os.path.isfile(
os.path.join(self.model_dir, 'tokenizer_config.json')):
with open(
os.path.join(self.model_dir, 'tokenizer_config.json'),
'r',
encoding='utf-8') as f:
json_config = json.load(f)
self._use_fast = json_config.get('use_fast')
self._use_fast = False if self._use_fast is None else self._use_fast
return self._use_fast

def build_tokenizer(self):
"""Build a tokenizer by the model type.

NOTE: The fast tokenizers have a multi-thread problem, use it carefully.

Returns:
The initialized tokenizer.
"""
# fast version lead to parallel inference failed
model_type = self.model_type
model_dir = self.model_dir
if model_type == Models.deberta_v2:
from modelscope.models.nlp.deberta_v2 import DebertaV2Tokenizer, DebertaV2TokenizerFast
tokenizer = DebertaV2TokenizerFast if self.use_fast else DebertaV2Tokenizer
return tokenizer.from_pretrained(
model_dir) if model_dir is not None else tokenizer()

if model_type in (Models.structbert, Models.gpt3, Models.palm,
Models.plug):
from transformers import BertTokenizer, BertTokenizerFast
tokenizer = BertTokenizerFast if self.use_fast else BertTokenizer
return tokenizer.from_pretrained(
model_dir) if model_dir is not None else tokenizer()
elif model_type == Models.veco:
from transformers import XLMRobertaTokenizer, XLMRobertaTokenizerFast
tokenizer = XLMRobertaTokenizerFast if self.use_fast else XLMRobertaTokenizer
return tokenizer.from_pretrained(
model_dir) if model_dir is not None else tokenizer()

assert model_dir is not None
return AutoTokenizer.from_pretrained(model_dir, use_fast=self.use_fast)

def __call__(self, text, text_pair=None, **kwargs):
kwargs['max_length'] = kwargs.get('max_length',
kwargs.pop('sequence_length', None))
if kwargs['max_length'] is None:
kwargs.pop('max_length')
tokenize_kwargs = {k: v for k, v in self.tokenize_kwargs.items()}
tokenize_kwargs.update(kwargs)
kwargs.update(self.tokenize_kwargs)
return self.tokenizer(text, text_pair, **tokenize_kwargs)

def get_tokenizer_kwarg(self, key, default_value=None):
if key in self.tokenize_kwargs:
return self.tokenize_kwargs[key]
return self.tokenizer.init_kwargs.get(key, default_value)

+ 100
- 0
modelscope/preprocessors/nlp/utils.py View File

@@ -0,0 +1,100 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

import os
from collections.abc import Mapping
from typing import Any, Dict, List, Tuple, Union

import json
import numpy as np
from transformers import AutoTokenizer

from modelscope.metainfo import Models
from modelscope.outputs import OutputKeys
from modelscope.preprocessors.base import Preprocessor
from modelscope.utils.constant import ModeKeys
from modelscope.utils.hub import get_model_type, parse_label_mapping
from modelscope.utils.logger import get_logger

logger = get_logger()

__all__ = ['parse_text_and_label', 'labels_to_id']


def parse_text_and_label(data,
mode,
first_sequence=None,
second_sequence=None,
label=None):
"""Parse the input and return the sentences and labels.

When input type is tuple or list and its size is 2:
If the pair param is False, data will be parsed as the first_sentence and the label,
else it will be parsed as the first_sentence and the second_sentence.

Args:
data: The input data.
mode: The mode of the preprocessor
first_sequence: The key of the first sequence
second_sequence: The key of the second sequence
label: The key of the label
Returns:
The sentences and labels tuple.
"""
text_a, text_b, labels = None, None, None
if isinstance(data, str):
text_a = data
elif isinstance(data, tuple) or isinstance(data, list):
if len(data) == 3:
text_a, text_b, labels = data
elif len(data) == 2:
if mode == ModeKeys.INFERENCE:
text_a, text_b = data
else:
text_a, labels = data
elif isinstance(data, Mapping):
text_a = data.get(first_sequence)
text_b = data.get(second_sequence)
if label is None or isinstance(label, str):
labels = data.get(label)
else:
labels = [data.get(lb) for lb in label]
return text_a, text_b, labels


def labels_to_id(labels, output, label2id=None):
"""Turn the labels to id with the type int or float.

If the original label's type is str or int, the label2id mapping will try to convert it to the final label.
If the original label's type is float, or the label2id mapping does not exist,
the original label will be returned.

Args:
label2id: An extra label2id mapping. If not provided, the label will not be translated to ids.
labels: The input labels.
output: The label id.

Returns:
The final labels.
"""

def label_can_be_mapped(label):
return isinstance(label, str) or isinstance(label, int)

try:
if isinstance(labels, (tuple, list)) and all([label_can_be_mapped(label) for label in labels]) \
and label2id is not None:
output[OutputKeys.LABELS] = [
label2id[label] if label in label2id else label2id[str(label)]
for label in labels
]
elif label_can_be_mapped(labels) and label2id is not None:
output[OutputKeys.LABELS] = label2id[
labels] if labels in label2id else label2id[str(labels)]
elif labels is not None:
output[OutputKeys.LABELS] = labels
except KeyError as e:
logger.error(
f'Label {labels} cannot be found in the label mapping {label2id},'
f'which comes from the user input or the configuration files. '
f'Please consider matching your labels with this mapping.')
raise e

+ 74
- 0
modelscope/preprocessors/nlp/zero_shot_classification_preprocessor.py View File

@@ -0,0 +1,74 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Union

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors import Preprocessor
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from modelscope.utils.hub import get_model_type
from .transformers_tokenizer import NLPTokenizer


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
class ZeroShotClassificationTransformersPreprocessor(Preprocessor):
"""The tokenizer preprocessor used in zero shot classification.
"""

def __init__(self,
model_dir: str,
first_sequence=None,
mode=ModeKeys.INFERENCE,
sequence_length=512,
use_fast=None,
**kwargs):
"""preprocess the data

Args:
model_dir (str): model path
"""
self.sequence_length = sequence_length
model_type = None
if model_dir is not None:
model_type = get_model_type(model_dir)
self.nlp_tokenizer = NLPTokenizer(
model_dir, model_type, use_fast=use_fast, tokenize_kwargs=kwargs)
self.first_sequence = first_sequence
super().__init__(mode=mode)

def __call__(self,
data: Union[str, Dict],
hypothesis_template: str,
candidate_labels: list,
padding=True,
truncation=True,
truncation_strategy='only_first',
**kwargs) -> Dict[str, Any]:
"""process the raw input data

Args:
data (str or dict): a sentence
Example:
'you are so handsome.'

Returns:
Dict[str, Any]: the preprocessed data
"""
if isinstance(data, dict):
data = data.get(self.first_sequence)

pairs = [[data, hypothesis_template.format(label)]
for label in candidate_labels]

if 'return_tensors' not in kwargs:
kwargs[
'return_tensors'] = 'pt' if self._mode == ModeKeys.INFERENCE else None

features = self.nlp_tokenizer(
pairs,
padding=padding,
truncation=truncation,
truncation_strategy=truncation_strategy,
**kwargs)
return features

+ 0
- 51
modelscope/preprocessors/nlp/zero_shot_classification_reprocessor.py View File

@@ -1,51 +0,0 @@
# Copyright (c) Alibaba, Inc. and its affiliates.

from typing import Any, Dict, Union

from modelscope.metainfo import Preprocessors
from modelscope.preprocessors.builder import PREPROCESSORS
from modelscope.utils.constant import Fields, ModeKeys
from .nlp_base import NLPTokenizerPreprocessorBase


@PREPROCESSORS.register_module(
Fields.nlp, module_name=Preprocessors.zero_shot_cls_tokenizer)
class ZeroShotClassificationPreprocessor(NLPTokenizerPreprocessorBase):
"""The tokenizer preprocessor used in zero shot classification.
"""

def __init__(self, model_dir: str, mode=ModeKeys.INFERENCE, **kwargs):
"""preprocess the data

Args:
model_dir (str): model path
"""
self.sequence_length = kwargs.pop('sequence_length', 512)
super().__init__(model_dir, mode=mode, **kwargs)

def __call__(self, data: Union[str, Dict], hypothesis_template: str,
candidate_labels: list) -> Dict[str, Any]:
"""process the raw input data

Args:
data (str or dict): a sentence
Example:
'you are so handsome.'

Returns:
Dict[str, Any]: the preprocessed data
"""
if isinstance(data, dict):
data = data.get(self.first_sequence)

pairs = [[data, hypothesis_template.format(label)]
for label in candidate_labels]

features = self.tokenizer(
pairs,
padding=True,
truncation=True,
max_length=self.sequence_length,
truncation_strategy='only_first',
return_tensors='pt' if self._mode == ModeKeys.INFERENCE else None)
return features

+ 137
- 33
modelscope/trainers/hooks/checkpoint_hook.py View File

@@ -6,11 +6,12 @@ import numpy as np
import torch

from modelscope import __version__
from modelscope.metainfo import Hooks
from modelscope.utils.checkpoint import load_checkpoint, save_checkpoint
from modelscope.metainfo import Hooks, Pipelines
from modelscope.utils.checkpoint import (load_checkpoint, save_checkpoint,
save_configuration)
from modelscope.utils.constant import LogKeys, ModelFile
from modelscope.utils.logger import get_logger
from modelscope.utils.torch_utils import get_dist_info, is_master
from modelscope.utils.torch_utils import is_master
from .builder import HOOKS
from .hook import Hook
from .priority import Priority
@@ -28,17 +29,25 @@ class CheckpointHook(Hook):
save_dir (str): The directory to save checkpoints. If is None, use `trainer.work_dir`
save_last (bool): Whether to save the last checkpoint. Default: True.
checkpoint_file (str): The checkpoint file to be loaded.
load_all_state (bool): Load all states(optimizer, epoch, lr_scheduler, random_state, etc.) when loading old
training state file or not. The model's state dict will only be loaded if False.
max_checkpoint_num (int): The max number of checkpoint files, default None which means never delete anything.
If the number exceeding the limit, earlier checkpoints will be deleted first.
"""

PRIORITY = Priority.LOW

def __init__(self,
interval=0,
by_epoch=True,
save_optimizer=True,
save_dir=None,
save_last=True,
checkpoint_file=None):
def __init__(
self,
interval=0,
by_epoch=True,
save_optimizer=True,
save_dir=None,
save_last=True,
checkpoint_file=None,
load_all_state=True,
max_checkpoint_num=None,
):
self.interval = interval
self.by_epoch = by_epoch
self.save_optimizer = save_optimizer
@@ -47,6 +56,11 @@ class CheckpointHook(Hook):
self.save_last = save_last
self.rng_state = None
self.need_load_rng_state = False
self.load_all_state = load_all_state
self.max_checkpoint_num = None
if max_checkpoint_num is not None:
self.max_checkpoint_num = max(int(max_checkpoint_num), 1)
self.history_checkpoints = []

def before_run(self, trainer):
if not self.save_dir:
@@ -65,9 +79,10 @@ class CheckpointHook(Hook):

if self.checkpoint_file is not None and os.path.isfile(
self.checkpoint_file):
meta = self.load_checkpoint(self.checkpoint_file, trainer)
meta = self.load_checkpoint(self.checkpoint_file, trainer,
self.load_all_state)
self.rng_state = meta.get('rng_state')
self.need_load_rng_state = True
self.need_load_rng_state = self.load_all_state

def before_train_iter(self, trainer):
if self.need_load_rng_state:
@@ -95,28 +110,30 @@ class CheckpointHook(Hook):
self._save_checkpoint(trainer)

@classmethod
def load_checkpoint(cls, filename, trainer):
def load_checkpoint(cls, filename, trainer, load_all_state=True):
from modelscope.trainers.parallel.utils import is_parallel
if is_parallel(trainer.model):
model = trainer.model.module
else:
model = trainer.model
meta = load_checkpoint(filename, model,
getattr(trainer, 'optimizer', None),
getattr(trainer, 'lr_scheduler', None))
trainer._epoch = meta.get('epoch', trainer._epoch)
trainer._iter = meta.get('iter', trainer._iter)
trainer._inner_iter = meta.get('inner_iter', trainer._inner_iter)

for i, hook in enumerate(trainer.hooks):
# hook: Hook
key = f'{hook.__class__}-{i}'
if key in meta and hasattr(hook, 'load_state_dict'):
hook.load_state_dict(meta.get(key, {}))
else:
trainer.logger.warn(
f'The state_dict of hook {hook.__class__} at index {i} is not found in the checkpoint file.'
)
meta = load_checkpoint(
filename, model,
getattr(trainer, 'optimizer', None) if load_all_state else None,
getattr(trainer, 'lr_scheduler', None) if load_all_state else None)
if load_all_state:
trainer._epoch = meta.get('epoch', trainer._epoch)
trainer._iter = meta.get('iter', trainer._iter)
trainer._inner_iter = meta.get('inner_iter', trainer._inner_iter)

for i, hook in enumerate(trainer.hooks):
# hook: Hook
key = f'{hook.__class__}-{i}'
if key in meta and hasattr(hook, 'load_state_dict'):
hook.load_state_dict(meta.get(key, {}))
else:
trainer.logger.warn(
f'The state_dict of hook {hook.__class__} at index {i} is not found in the checkpoint file.'
)

version = meta.get('modelscope')
if version != __version__:
@@ -163,6 +180,21 @@ class CheckpointHook(Hook):
and not self.by_epoch):
self._save_pretrained(trainer)

self.history_checkpoints.append(cur_save_name)
self.remove_obsolete_checkpoints()

def remove_obsolete_checkpoints(self):
if self.max_checkpoint_num is not None and \
len(self.history_checkpoints) > self.max_checkpoint_num:
history_checkpoints = [ckpt for ckpt in self.history_checkpoints]
self.history_checkpoints.clear()
for i, ckpt_file in enumerate(history_checkpoints):
if i < len(history_checkpoints) - self.max_checkpoint_num:
if os.path.isfile(ckpt_file):
os.remove(ckpt_file)
else:
self.history_checkpoints.append(ckpt_file)

def _save_pretrained(self, trainer):
output_dir = os.path.join(self.save_dir, ModelFile.TRAIN_OUTPUT_DIR)
from modelscope.trainers.parallel.utils import is_parallel
@@ -175,15 +207,53 @@ class CheckpointHook(Hook):
config = trainer.cfg.to_dict()
# override pipeline by tasks name after finetune done,
# avoid case like fill mask pipeline with a text cls task
config['pipeline'] = {'type': config['task']}
if config['task'] in [
getattr(Pipelines, attr) for attr in dir(Pipelines)
if not attr.startswith('__')
]:
# TODO a temp fix to avoid pipeline_name and task mismatch
config['pipeline'] = {'type': config['task']}

class SaveConfig:

def __init__(self, output_dir, config):
self.output_dir = output_dir
self.config = config

def __call__(self, _output_dir, _config):
self.config = _config

def save_config(self):
save_configuration(self.output_dir, self.config)

save_config_fn = SaveConfig(output_dir, config)

if hasattr(model, 'save_pretrained'):
# Now support two binary files: pytorch_model.bin and pytorch_model.pt
default_bin_file = ModelFile.TORCH_MODEL_BIN_FILE
if hasattr(
model,
'model_dir') and ModelFile.TORCH_MODEL_FILE in os.listdir(
model.model_dir):
default_bin_file = ModelFile.TORCH_MODEL_FILE
model.save_pretrained(
output_dir,
ModelFile.TORCH_MODEL_BIN_FILE,
default_bin_file,
save_function=save_checkpoint,
config=config,
config=save_config_fn.config,
save_config_function=save_config_fn,
with_meta=False)
if trainer.train_preprocessor is not None:
trainer.train_preprocessor.save_pretrained(
output_dir,
save_config_fn.config,
save_config_function=save_config_fn)
if trainer.eval_preprocessor is not None:
trainer.eval_preprocessor.save_pretrained(
output_dir,
save_config_fn.config,
save_config_function=save_config_fn)
save_config_fn.save_config()

def after_train_iter(self, trainer):
if self.by_epoch:
@@ -222,6 +292,9 @@ class BestCkptSaverHook(CheckpointHook):
save_optimizer (bool): Whether to save optimizer state dict. Default: True.
save_dir (str): Output directory to save best checkpoint.
restore_best (bool): Whether to restore the best checkpoint after training.
max_checkpoint_num (int): The max number of checkpoint files, default None which means never delete anything.
If the number exceeding the limit, checkpoints with worse metric will be deleted, which is judged by the
`rule` and `metric_key` arguments.
"""

PRIORITY = Priority.LOW
@@ -235,13 +308,17 @@ class BestCkptSaverHook(CheckpointHook):
save_dir=None,
save_file_name=None,
restore_best=False,
interval=0):
max_checkpoint_num=1,
interval=0,
**kwargs):
assert rule in ['max', 'min'], 'Only support "max" or "min" rule now.'
super().__init__(
interval=interval,
by_epoch=by_epoch,
save_optimizer=save_optimizer,
save_dir=save_dir,
max_checkpoint_num=max_checkpoint_num,
**kwargs,
)
self.metric_key = metric_key
self.rule = rule
@@ -249,6 +326,7 @@ class BestCkptSaverHook(CheckpointHook):
self._best_ckpt_file = None
self.save_file_name = save_file_name
self.restore_best = restore_best
self.history_checkpoints = set()

def _should_save(self, trainer):
return self._is_best_metric(trainer.metric_values)
@@ -284,6 +362,10 @@ class BestCkptSaverHook(CheckpointHook):
self.save_dir,
f'best_{LogKeys.ITER}{trainer.iter + 1}_{self.metric_key}{self._best_metric}.pth'
)
else:
if '.' not in cur_save_name:
cur_save_name = f'{cur_save_name}.pth'
cur_save_name = os.path.join(self.save_dir, cur_save_name)

meta = {
'epoch': trainer.epoch,
@@ -300,6 +382,28 @@ class BestCkptSaverHook(CheckpointHook):
trainer.lr_scheduler, meta)
self._best_ckpt_file = cur_save_name
self._save_pretrained(trainer)
self.history_checkpoints.add(cur_save_name)
self.remove_obsolete_checkpoints()

def remove_obsolete_checkpoints(self):

def extract_metric_from_filename(name1):
metric1 = float(name1.split(self.metric_key)[1].split('.')[0])
if self.rule == 'max':
return -metric1
else:
return metric1

if self.max_checkpoint_num is not None and \
len(self.history_checkpoints) > self.max_checkpoint_num:
history_checkpoints = sorted(
self.history_checkpoints, key=extract_metric_from_filename)
self.history_checkpoints.clear()
for i, ckpt_file in enumerate(history_checkpoints):
if i < self.max_checkpoint_num:
self.history_checkpoints.add(ckpt_file)
elif os.path.isfile(ckpt_file):
os.remove(ckpt_file)

def state_dict(self):
return {


+ 2
- 2
modelscope/trainers/nlp/text_generation_trainer.py View File

@@ -14,8 +14,8 @@ from modelscope.utils.file_utils import func_receive_dict_inputs
class TextGenerationTrainer(NlpEpochBasedTrainer):

def _decode(self, tokens):
tokenizer = self.eval_preprocessor.tokenizer
return tokenizer.decode(tokens.tolist(), skip_special_tokens=True)
return self.eval_preprocessor.decode(
tokens.tolist(), skip_special_tokens=True)

def evaluation_step(self, data):
model = self.model.module if self._dist else self.model


Some files were not shown because too many files changed in this diff

Loading…
Cancel
Save