You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

序列标注.rst 8.7 kB

6 years ago
6 years ago
6 years ago
6 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208
  1. =====================
  2. 序列标注
  3. =====================
  4. 这一部分的内容主要展示如何使用fastNLP实现序列标注(Sequence labeling)任务。您可以使用fastNLP的各个组件快捷,方便地完成序列标注任务,达到出色的效果。
  5. 在阅读这篇教程前,希望您已经熟悉了fastNLP的基础使用,尤其是数据的载入以及模型的构建。通过这个小任务,能让您进一步熟悉fastNLP的使用。
  6. .. note::
  7. 本教程推荐使用 GPU 进行实验
  8. 命名实体识别(name entity recognition, NER)
  9. ------------------------------------------
  10. 命名实体识别任务是从文本中抽取出具有特殊意义或者指代性非常强的实体,通常包括人名、地名、机构名和时间等。
  11. 如下面的例子中
  12. 我来自复旦大学。
  13. 其中“复旦大学”就是一个机构名,命名实体识别就是要从中识别出“复旦大学”这四个字是一个整体,且属于机构名这个类别。这个问题在实际做的时候会被
  14. 转换为序列标注问题
  15. 针对"我来自复旦大学"这句话,我们的预测目标将是[O, O, O, B-ORG, I-ORG, I-ORG, I-ORG],其中O表示out,即不是一个实体,B-ORG是ORG(
  16. organization的缩写)这个类别的开头(Begin),I-ORG是ORG类别的中间(Inside)。
  17. 在本tutorial中我们将通过fastNLP尝试写出一个能够执行以上任务的模型。
  18. 载入数据
  19. ------------------------------------------
  20. fastNLP的数据载入主要是由Loader与Pipe两个基类衔接完成的,您可以通过 :doc:`使用Loader和Pipe处理数据 </tutorials/tutorial_4_load_dataset>`
  21. 了解如何使用fastNLP提供的数据加载函数。下面我们以微博命名实体任务来演示一下在fastNLP进行序列标注任务。
  22. .. code-block:: python
  23. from fastNLP.io import WeiboNERPipe
  24. data_bundle = WeiboNERPipe().process_from_file()
  25. print(data_bundle.get_dataset('train')[:2])
  26. 打印的数据如下 ::
  27. +-------------------------------------------------+------------------------------------------+------------------------------------------+---------+
  28. | raw_chars | target | chars | seq_len |
  29. +-------------------------------------------------+------------------------------------------+------------------------------------------+---------+
  30. | ['一', '节', '课', '的', '时', '间', '真', '... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, ... | [8, 211, 775, 3, 49, 245, 89, 26, 101... | 16 |
  31. | ['回', '复', '支', '持', ',', '赞', '成', '... | [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... | [116, 480, 127, 109, 2, 446, 134, 2, ... | 59 |
  32. +-------------------------------------------------+------------------------------------------+------------------------------------------+---------+
  33. 模型构建
  34. --------------------------------
  35. 首先选择需要使用的Embedding类型。关于Embedding的相关说明可以参见 :doc:`使用Embedding模块将文本转成向量 </tutorials/tutorial_3_embedding>` 。
  36. 在这里我们使用通过word2vec预训练的中文汉字embedding。
  37. .. code-block:: python
  38. from fastNLP.embeddings import StaticEmbedding
  39. embed = StaticEmbedding(vocab=data_bundle.get_vocab('chars'), model_dir_or_name='cn-char-fastnlp-100d')
  40. 选择好Embedding之后,我们可以使用fastNLP中自带的 :class:`fastNLP.models.BiLSTMCRF` 作为模型。
  41. .. code-block:: python
  42. from fastNLP.models import BiLSTMCRF
  43. data_bundle.rename_field('chars', 'words') # 这是由于BiLSTMCRF模型的forward函数接受的words,而不是chars,所以需要把这一列重新命名
  44. model = BiLSTMCRF(embed=embed, num_classes=len(data_bundle.get_vocab('target')), num_layers=1, hidden_size=200, dropout=0.5,
  45. target_vocab=data_bundle.get_vocab('target'))
  46. 进行训练
  47. --------------------------------
  48. 下面我们选择用来评估模型的metric,以及优化用到的优化函数。
  49. .. code-block:: python
  50. from fastNLP import SpanFPreRecMetric
  51. from torch.optim import Adam
  52. from fastNLP import LossInForward
  53. metric = SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))
  54. optimizer = Adam(model.parameters(), lr=1e-2)
  55. loss = LossInForward()
  56. 使用Trainer进行训练, 您可以通过修改 device 的值来选择显卡。
  57. .. code-block:: python
  58. from fastNLP import Trainer
  59. import torch
  60. device= 0 if torch.cuda.is_available() else 'cpu'
  61. trainer = Trainer(data_bundle.get_dataset('train'), model, loss=loss, optimizer=optimizer,
  62. dev_data=data_bundle.get_dataset('dev'), metrics=metric, device=device)
  63. trainer.train()
  64. 训练过程输出为::
  65. input fields after batch(if batch size is 2):
  66. target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
  67. seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
  68. words: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
  69. target fields after batch(if batch size is 2):
  70. target: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2, 26])
  71. seq_len: (1)type:torch.Tensor (2)dtype:torch.int64, (3)shape:torch.Size([2])
  72. training epochs started 2019-09-25-10-43-09
  73. Evaluate data in 0.62 seconds!
  74. Evaluation on dev at Epoch 1/10. Step:43/430:
  75. SpanFPreRecMetric: f=0.070352, pre=0.100962, rec=0.053985
  76. ...
  77. Evaluate data in 0.61 seconds!
  78. Evaluation on dev at Epoch 10/10. Step:430/430:
  79. SpanFPreRecMetric: f=0.51223, pre=0.581699, rec=0.457584
  80. In Epoch:7/Step:301, got best dev performance:
  81. SpanFPreRecMetric: f=0.515528, pre=0.65098, rec=0.426735
  82. Reloaded the best model.
  83. 进行测试
  84. --------------------------------
  85. 训练结束之后过,可以通过 :class:`~fastNLP.Tester` 测试其在测试集上的性能
  86. .. code-block:: python
  87. from fastNLP import Tester
  88. tester = Tester(data_bundle.get_dataset('test'), model, metrics=metric)
  89. tester.test()
  90. 输出为::
  91. [tester]
  92. SpanFPreRecMetric: f=0.482399, pre=0.530086, rec=0.442584
  93. 使用更强的Bert做序列标注
  94. --------------------------------
  95. 在fastNLP使用Bert进行任务,您只需要把 :class:`fastNLP.embeddings.StaticEmbedding` 切换为 :class:`fastNLP.embeddings.BertEmbedding` (可修改 device 选择显卡)。
  96. .. code-block:: python
  97. from fastNLP.io import WeiboNERPipe
  98. from fastNLP.models import BiLSTMCRF
  99. data_bundle = WeiboNERPipe().process_from_file()
  100. data_bundle.rename_field('chars', 'words')
  101. from fastNLP.embeddings import BertEmbedding
  102. embed = BertEmbedding(vocab=data_bundle.get_vocab('words'), model_dir_or_name='cn')
  103. model = BiLSTMCRF(embed=embed, num_classes=len(data_bundle.get_vocab('target')), num_layers=1, hidden_size=200, dropout=0.5,
  104. target_vocab=data_bundle.get_vocab('target'))
  105. from fastNLP import SpanFPreRecMetric
  106. from torch.optim import Adam
  107. from fastNLP import LossInForward
  108. metric = SpanFPreRecMetric(tag_vocab=data_bundle.get_vocab('target'))
  109. optimizer = Adam(model.parameters(), lr=2e-5)
  110. loss = LossInForward()
  111. from fastNLP import Trainer
  112. import torch
  113. device= 0 if torch.cuda.is_available() else 'cpu'
  114. trainer = Trainer(data_bundle.get_dataset('train'), model, loss=loss, optimizer=optimizer, batch_size=12,
  115. dev_data=data_bundle.get_dataset('dev'), metrics=metric, device=device)
  116. trainer.train()
  117. from fastNLP import Tester
  118. tester = Tester(data_bundle.get_dataset('test'), model, metrics=metric)
  119. tester.test()
  120. 输出为::
  121. training epochs started 2019-09-25-07-15-43
  122. Evaluate data in 2.02 seconds!
  123. Evaluation on dev at Epoch 1/10. Step:113/1130:
  124. SpanFPreRecMetric: f=0.0, pre=0.0, rec=0.0
  125. ...
  126. Evaluate data in 2.17 seconds!
  127. Evaluation on dev at Epoch 10/10. Step:1130/1130:
  128. SpanFPreRecMetric: f=0.647332, pre=0.589852, rec=0.717224
  129. In Epoch:6/Step:678, got best dev performance:
  130. SpanFPreRecMetric: f=0.669963, pre=0.645238, rec=0.696658
  131. Reloaded the best model.
  132. Evaluate data in 1.82 seconds!
  133. [tester]
  134. SpanFPreRecMetric: f=0.641774, pre=0.626424, rec=0.657895
  135. 可以看出通过使用Bert,效果有明显的提升,从48.2提升到了64.1。
  136. ----------------------------------
  137. 代码下载
  138. ----------------------------------
  139. .. raw:: html
  140. <a href="../_static/notebooks/%E5%BA%8F%E5%88%97%E6%A0%87%E6%B3%A8.ipynb" download="序列标注.ipynb">点击下载 IPython Notebook 文件</a><hr>