You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

test_ast.py 8.2 kB

[to #42322933] Refactor NLP and fix some user feedbacks 1. Abstract keys of dicts needed by nlp metric classes into the init method 2. Add Preprocessor.save_pretrained to save preprocessor information 3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training. 4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead 5. Use model/preprocessor's from_pretrained in all nlp pipeline classes. 6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes 7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes 8. Fix user feedback: Re-train the model in continue training scenario 9. Fix user feedback: Too many checkpoint saved 10. Simplify the nlp-trainer 11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override 12. Add safe_get to Config class ---------------------------- Another refactor from version 36 ------------------------- 13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example: TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor 14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors 15. Add output classes of nlp models 16. Refactor the logic for token-classification 17. Fix bug: checkpoint_hook does not support pytorch_model.pt 18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513 * add save_pretrained to preprocessor * save preprocessor config in hook * refactor label-id mapping fetching logic * test ok on sentence-similarity * run on finetuning * fix bug * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/nlp/nlp_base.py * add params to init * 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics * Split trainer init impls to overridable methods * remove some obsolete tokenizers * unfinished * support input params in pipeline * fix bugs * fix ut bug * fix bug * fix ut bug * fix ut bug * fix ut bug * add base class for some preprocessors * Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config * compatible with old code * fix ut bug * fix ut bugs * fix bug * add some comments * fix ut bug * add a requirement * fix pre-commit * Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config * fixbug * Support function type in registry * fix ut bug * fix bug * Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config # Conflicts: # modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/utils/hub.py * remove obsolete file * rename init args * rename params * fix merge bug * add default preprocessor config for ner-model * move a method a util file * remove unused config * Fix a bug in pbar * bestckptsaver:change default ckpt numbers to 1 * 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name * Fix bug * fix bug * fix bug * unfinished refactoring * unfinished * uw * uw * uw * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer # Conflicts: # modelscope/preprocessors/nlp/document_segmentation_preprocessor.py # modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py # modelscope/preprocessors/nlp/relation_extraction_preprocessor.py # modelscope/preprocessors/nlp/text_generation_preprocessor.py * uw * uw * unify nlp task outputs * uw * uw * uw * uw * change the order of text cls pipeline * refactor t5 * refactor tg task preprocessor * fix * unfinished * temp * refactor code * unfinished * unfinished * unfinished * unfinished * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer * smoke test pass * ut testing * pre-commit passed * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/nlp/bert/document_segmentation.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py * merge master * unifnished * Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config * fix bug * fix ut bug * support ner batch inference * fix ut bug * fix bug * support batch inference on three nlp tasks * unfinished * fix bug * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/base/base_model.py # modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py # modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py # modelscope/pipelines/nlp/dialog_modeling_pipeline.py # modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py # modelscope/pipelines/nlp/faq_question_answering_pipeline.py # modelscope/pipelines/nlp/feature_extraction_pipeline.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/information_extraction_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/sentence_embedding_pipeline.py # modelscope/pipelines/nlp/summarization_pipeline.py # modelscope/pipelines/nlp/table_question_answering_pipeline.py # modelscope/pipelines/nlp/text2text_generation_pipeline.py # modelscope/pipelines/nlp/text_classification_pipeline.py # modelscope/pipelines/nlp/text_error_correction_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/text_ranking_pipeline.py # modelscope/pipelines/nlp/token_classification_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/trainers/nlp_trainer.py * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/__init__.py * fix bug * fix bug * fix bug * fix bug * fix bug * fixbug * pre-commit passed * fix bug * fixbug * fix bug * fix bug * fix bug * fix bug * self review done * fixbug * fix bug * fix bug * fix bugs * remove sub-token offset mapping * fix name bug * add some tests * 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs * add old logic back * tmp save * add tokenize by words logic back * move outputs file back * revert veco token-classification back * fix typo * Fix description * Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/pipelines/builder.py
3 years ago
[to #42322933] Refactor NLP and fix some user feedbacks 1. Abstract keys of dicts needed by nlp metric classes into the init method 2. Add Preprocessor.save_pretrained to save preprocessor information 3. Abstract the config saving function, which can lead to normally saving in the direct call of from_pretrained, and the modification of cfg one by one when training. 4. Remove SbertTokenizer and VecoTokenizer, use transformers' tokenizers instead 5. Use model/preprocessor's from_pretrained in all nlp pipeline classes. 6. Add model_kwargs and preprocessor_kwargs in all nlp pipeline classes 7. Add base classes for fill-mask and text-classification preprocessor, as a demo for later changes 8. Fix user feedback: Re-train the model in continue training scenario 9. Fix user feedback: Too many checkpoint saved 10. Simplify the nlp-trainer 11. Fix user feedback: Split the default trainer's __init__ method, which makes user easier to override 12. Add safe_get to Config class ---------------------------- Another refactor from version 36 ------------------------- 13. Name all nlp transformers' preprocessors from TaskNamePreprocessor to TaskNameTransformersPreprocessor, for example: TextClassificationPreprocessor -> TextClassificationTransformersPreprocessor 14. Add a base class per task for all nlp tasks' preprocessors which has at least two sub-preprocessors 15. Add output classes of nlp models 16. Refactor the logic for token-classification 17. Fix bug: checkpoint_hook does not support pytorch_model.pt 18. Fix bug: Pipeline name does not match with task name, so inference will not succeed after training NOTE: This is just a stop bleeding solution, the root cause is the uncertainty of the relationship between models and pipelines Link: https://code.alibaba-inc.com/Ali-MaaS/MaaS-lib/codereview/10723513 * add save_pretrained to preprocessor * save preprocessor config in hook * refactor label-id mapping fetching logic * test ok on sentence-similarity * run on finetuning * fix bug * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/nlp/nlp_base.py * add params to init * 1. support max ckpt num 2. support ignoring others but bin file in continue training 3. add arguments to some nlp metrics * Split trainer init impls to overridable methods * remove some obsolete tokenizers * unfinished * support input params in pipeline * fix bugs * fix ut bug * fix bug * fix ut bug * fix ut bug * fix ut bug * add base class for some preprocessors * Merge commit '379867739548f394d0fa349ba07afe04adf4c8b6' into feat/refactor_config * compatible with old code * fix ut bug * fix ut bugs * fix bug * add some comments * fix ut bug * add a requirement * fix pre-commit * Merge commit '0451b3d3cb2bebfef92ec2c227b2a3dd8d01dc6a' into feat/refactor_config * fixbug * Support function type in registry * fix ut bug * fix bug * Merge commit '5f719e542b963f0d35457e5359df879a5eb80b82' into feat/refactor_config # Conflicts: # modelscope/pipelines/nlp/multilingual_word_segmentation_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/utils/hub.py * remove obsolete file * rename init args * rename params * fix merge bug * add default preprocessor config for ner-model * move a method a util file * remove unused config * Fix a bug in pbar * bestckptsaver:change default ckpt numbers to 1 * 1. Add assert to max_epoch 2. split init_dist and get_device 3. change cmp func name * Fix bug * fix bug * fix bug * unfinished refactoring * unfinished * uw * uw * uw * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer # Conflicts: # modelscope/preprocessors/nlp/document_segmentation_preprocessor.py # modelscope/preprocessors/nlp/faq_question_answering_preprocessor.py # modelscope/preprocessors/nlp/relation_extraction_preprocessor.py # modelscope/preprocessors/nlp/text_generation_preprocessor.py * uw * uw * unify nlp task outputs * uw * uw * uw * uw * change the order of text cls pipeline * refactor t5 * refactor tg task preprocessor * fix * unfinished * temp * refactor code * unfinished * unfinished * unfinished * unfinished * uw * Merge branch 'feat/refactor_config' into feat/refactor_trainer * smoke test pass * ut testing * pre-commit passed * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/nlp/bert/document_segmentation.py # modelscope/pipelines/nlp/__init__.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py * merge master * unifnished * Merge branch 'feat/fix_bug_pipeline_name' into feat/refactor_config * fix bug * fix ut bug * support ner batch inference * fix ut bug * fix bug * support batch inference on three nlp tasks * unfinished * fix bug * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/models/base/base_model.py # modelscope/pipelines/nlp/conversational_text_to_sql_pipeline.py # modelscope/pipelines/nlp/dialog_intent_prediction_pipeline.py # modelscope/pipelines/nlp/dialog_modeling_pipeline.py # modelscope/pipelines/nlp/dialog_state_tracking_pipeline.py # modelscope/pipelines/nlp/document_segmentation_pipeline.py # modelscope/pipelines/nlp/faq_question_answering_pipeline.py # modelscope/pipelines/nlp/feature_extraction_pipeline.py # modelscope/pipelines/nlp/fill_mask_pipeline.py # modelscope/pipelines/nlp/information_extraction_pipeline.py # modelscope/pipelines/nlp/named_entity_recognition_pipeline.py # modelscope/pipelines/nlp/sentence_embedding_pipeline.py # modelscope/pipelines/nlp/summarization_pipeline.py # modelscope/pipelines/nlp/table_question_answering_pipeline.py # modelscope/pipelines/nlp/text2text_generation_pipeline.py # modelscope/pipelines/nlp/text_classification_pipeline.py # modelscope/pipelines/nlp/text_error_correction_pipeline.py # modelscope/pipelines/nlp/text_generation_pipeline.py # modelscope/pipelines/nlp/text_ranking_pipeline.py # modelscope/pipelines/nlp/token_classification_pipeline.py # modelscope/pipelines/nlp/word_segmentation_pipeline.py # modelscope/pipelines/nlp/zero_shot_classification_pipeline.py # modelscope/trainers/nlp_trainer.py * pre-commit passed * fix bug * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/preprocessors/__init__.py * fix bug * fix bug * fix bug * fix bug * fix bug * fixbug * pre-commit passed * fix bug * fixbug * fix bug * fix bug * fix bug * fix bug * self review done * fixbug * fix bug * fix bug * fix bugs * remove sub-token offset mapping * fix name bug * add some tests * 1. support batch inference of text-generation,text2text-generation,token-classification,text-classification 2. add corresponding UTs * add old logic back * tmp save * add tokenize by words logic back * move outputs file back * revert veco token-classification back * fix typo * Fix description * Merge commit '4dd99b8f6e4e7aefe047c68a1bedd95d3ec596d6' into feat/refactor_config * Merge branch 'master' into feat/refactor_config # Conflicts: # modelscope/pipelines/builder.py
3 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194
  1. # Copyright (c) Alibaba, Inc. and its affiliates.
  2. import os
  3. import shutil
  4. import tempfile
  5. import time
  6. import unittest
  7. from pathlib import Path
  8. from modelscope.utils.ast_utils import (FILES_MTIME_KEY, INDEX_KEY, MD5_KEY,
  9. MODELSCOPE_PATH_KEY, REQUIREMENT_KEY,
  10. VERSION_KEY, AstScaning,
  11. FilesAstScaning, load_index)
  12. p = Path(__file__)
  13. MODELSCOPE_PATH = p.resolve().parents[2].joinpath('modelscope')
  14. class AstScaningTest(unittest.TestCase):
  15. def setUp(self):
  16. print(('Testing %s.%s' % (type(self).__name__, self._testMethodName)))
  17. self.tmp_dir = tempfile.TemporaryDirectory().name
  18. self.test_file = os.path.join(self.tmp_dir, 'test.py')
  19. if not os.path.exists(self.tmp_dir):
  20. os.makedirs(self.tmp_dir)
  21. def tearDown(self):
  22. super().tearDown()
  23. shutil.rmtree(self.tmp_dir)
  24. def test_ast_scaning_class(self):
  25. astScaner = AstScaning()
  26. pipeline_file = os.path.join(MODELSCOPE_PATH, 'pipelines', 'nlp',
  27. 'text_generation_pipeline.py')
  28. output = astScaner.generate_ast(pipeline_file)
  29. self.assertTrue(output['imports'] is not None)
  30. self.assertTrue(output['from_imports'] is not None)
  31. self.assertTrue(output['decorators'] is not None)
  32. imports, from_imports, decorators = output['imports'], output[
  33. 'from_imports'], output['decorators']
  34. self.assertIsInstance(imports, dict)
  35. self.assertIsInstance(from_imports, dict)
  36. self.assertIsInstance(decorators, list)
  37. self.assertListEqual(
  38. list(set(imports.keys()) - set(['torch', 'os'])), [])
  39. self.assertEqual(len(from_imports.keys()), 10)
  40. self.assertTrue(from_imports['modelscope.metainfo'] is not None)
  41. self.assertEqual(from_imports['modelscope.metainfo'], ['Pipelines'])
  42. self.assertEqual(
  43. decorators,
  44. [('PIPELINES', 'text-generation', 'text-generation'),
  45. ('PIPELINES', 'text2text-generation', 'translation_en_to_de'),
  46. ('PIPELINES', 'text2text-generation', 'translation_en_to_ro'),
  47. ('PIPELINES', 'text2text-generation', 'translation_en_to_fr'),
  48. ('PIPELINES', 'text2text-generation', 'text2text-generation')])
  49. def test_files_scaning_method(self):
  50. fileScaner = FilesAstScaning()
  51. # case of pass in files directly
  52. pipeline_file = os.path.join(MODELSCOPE_PATH, 'pipelines', 'nlp',
  53. 'text_generation_pipeline.py')
  54. file_list = [pipeline_file]
  55. output = fileScaner.get_files_scan_results(file_list)
  56. self.assertTrue(output[INDEX_KEY] is not None)
  57. self.assertTrue(output[REQUIREMENT_KEY] is not None)
  58. index, requirements = output[INDEX_KEY], output[REQUIREMENT_KEY]
  59. self.assertIsInstance(index, dict)
  60. self.assertIsInstance(requirements, dict)
  61. self.assertIsInstance(list(index.keys())[0], tuple)
  62. index_0 = list(index.keys())[0]
  63. self.assertIsInstance(index[index_0], dict)
  64. self.assertTrue(index[index_0]['imports'] is not None)
  65. self.assertIsInstance(index[index_0]['imports'], list)
  66. self.assertTrue(index[index_0]['module'] is not None)
  67. self.assertIsInstance(index[index_0]['module'], str)
  68. index_0 = list(requirements.keys())[0]
  69. self.assertIsInstance(requirements[index_0], list)
  70. def test_file_mtime_md5_method(self):
  71. fileScaner = FilesAstScaning()
  72. # create first file
  73. with open(self.test_file, 'w', encoding='utf-8') as f:
  74. f.write('This is the new test!')
  75. md5_1, mtime_1 = fileScaner.files_mtime_md5(self.tmp_dir, [])
  76. md5_2, mtime_2 = fileScaner.files_mtime_md5(self.tmp_dir, [])
  77. self.assertEqual(md5_1, md5_2)
  78. self.assertEqual(mtime_1, mtime_2)
  79. self.assertIsInstance(mtime_1, dict)
  80. self.assertEqual(list(mtime_1.keys()), [self.test_file])
  81. self.assertEqual(mtime_1[self.test_file], mtime_2[self.test_file])
  82. time.sleep(2)
  83. # case of revise
  84. with open(self.test_file, 'w', encoding='utf-8') as f:
  85. f.write('test again')
  86. md5_3, mtime_3 = fileScaner.files_mtime_md5(self.tmp_dir, [])
  87. self.assertNotEqual(md5_1, md5_3)
  88. self.assertNotEqual(mtime_1[self.test_file], mtime_3[self.test_file])
  89. # case of create
  90. self.test_file_new = os.path.join(self.tmp_dir, 'test_1.py')
  91. time.sleep(2)
  92. with open(self.test_file_new, 'w', encoding='utf-8') as f:
  93. f.write('test again')
  94. md5_4, mtime_4 = fileScaner.files_mtime_md5(self.tmp_dir, [])
  95. self.assertNotEqual(md5_1, md5_4)
  96. self.assertNotEqual(md5_3, md5_4)
  97. self.assertEqual(
  98. set(mtime_4.keys()) - set([self.test_file, self.test_file_new]),
  99. set())
  100. def test_load_index_method(self):
  101. # test full indexing case
  102. output = load_index()
  103. self.assertTrue(output[INDEX_KEY] is not None)
  104. self.assertTrue(output[REQUIREMENT_KEY] is not None)
  105. index, requirements = output[INDEX_KEY], output[REQUIREMENT_KEY]
  106. self.assertIsInstance(index, dict)
  107. self.assertIsInstance(requirements, dict)
  108. self.assertIsInstance(list(index.keys())[0], tuple)
  109. index_0 = list(index.keys())[0]
  110. self.assertIsInstance(index[index_0], dict)
  111. self.assertTrue(index[index_0]['imports'] is not None)
  112. self.assertIsInstance(index[index_0]['imports'], list)
  113. self.assertTrue(index[index_0]['module'] is not None)
  114. self.assertIsInstance(index[index_0]['module'], str)
  115. index_0 = list(requirements.keys())[0]
  116. self.assertIsInstance(requirements[index_0], list)
  117. self.assertIsInstance(output[MD5_KEY], str)
  118. self.assertIsInstance(output[MODELSCOPE_PATH_KEY], str)
  119. self.assertIsInstance(output[VERSION_KEY], str)
  120. self.assertIsInstance(output[FILES_MTIME_KEY], dict)
  121. def test_update_load_index_method(self):
  122. file_number = 20
  123. file_list = []
  124. for i in range(file_number):
  125. filename = os.path.join(self.tmp_dir, f'test_{i}.py')
  126. with open(filename, 'w', encoding='utf-8') as f:
  127. f.write('import os')
  128. file_list.append(filename)
  129. index_file = 'ast_indexer_1'
  130. start = time.time()
  131. index = load_index(
  132. file_list=file_list,
  133. indexer_file_dir=self.tmp_dir,
  134. indexer_file=index_file)
  135. duration_1 = time.time() - start
  136. self.assertEqual(len(index[FILES_MTIME_KEY]), file_number)
  137. # no changing case, time should be less than original
  138. start = time.time()
  139. index = load_index(
  140. file_list=file_list,
  141. indexer_file_dir=self.tmp_dir,
  142. indexer_file=index_file)
  143. duration_2 = time.time() - start
  144. self.assertGreater(duration_1, duration_2)
  145. self.assertEqual(len(index[FILES_MTIME_KEY]), file_number)
  146. # adding new file, time should be less than original
  147. test_file_new_2 = os.path.join(self.tmp_dir, 'test_new.py')
  148. with open(test_file_new_2, 'w', encoding='utf-8') as f:
  149. f.write('import os')
  150. file_list.append(test_file_new_2)
  151. start = time.time()
  152. index = load_index(
  153. file_list=file_list,
  154. indexer_file_dir=self.tmp_dir,
  155. indexer_file=index_file)
  156. duration_3 = time.time() - start
  157. self.assertGreater(duration_1, duration_3)
  158. self.assertEqual(len(index[FILES_MTIME_KEY]), file_number + 1)
  159. # deleting one file, time should be less than original
  160. file_list.pop()
  161. start = time.time()
  162. index = load_index(
  163. file_list=file_list,
  164. indexer_file_dir=self.tmp_dir,
  165. indexer_file=index_file)
  166. duration_4 = time.time() - start
  167. self.assertGreater(duration_1, duration_4)
  168. self.assertEqual(len(index[FILES_MTIME_KEY]), file_number)
  169. if __name__ == '__main__':
  170. unittest.main()