{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# T2. databundle 和 tokenizer 的基本使用\n", "\n", "  1   fastNLP 中 dataset 的延伸\n", "\n", "    1.1   databundle 的概念与使用\n", "\n", "  2   fastNLP 中的 tokenizer\n", " \n", "    2.1   PreTrainedTokenizer 的概念\n", "\n", "    2.2   BertTokenizer 的基本使用\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. fastNLP 中 dataset 的延伸\n", "\n", "### 1.1 databundle 的概念与使用\n", "\n", "在`fastNLP 0.8`中,在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间,还存在\n", "\n", "  一个中间模块,即 **数据包`DataBundle`模块**,可以从`fastNLP.io`路径中导入该模块\n", "\n", "在`fastNLP 0.8`中,**一个`databundle`数据包包含若干`dataset`数据集和`vocabulary`词汇表**\n", "\n", "  分别存储在`datasets`和`vocabs`两个变量中,所以了解`databundle`数据包之前\n", "\n", "需要首先**复习`dataset`数据集和`vocabulary`词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n", "\n", "" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
       "
\n" ], "text/plain": [ "\n" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/6 [00:00': 0, '': 1, 'negative': 2, 'positive': 3, 'neutral': 4}\n" ] } ], "source": [ "import pandas as pd\n", "\n", "from fastNLP import DataSet\n", "from fastNLP import Vocabulary\n", "from fastNLP.io import DataBundle\n", "\n", "datasets = DataSet.from_pandas(pd.read_csv('./data/test4dataset.tsv', sep='\\t'))\n", "datasets.rename_field('Sentence', 'text')\n", "datasets.rename_field('Sentiment', 'label')\n", "datasets.apply_more(lambda ins:{'label': ins['label'].lower(), \n", " 'text': ins['text'].lower().split()},\n", " progress_bar='tqdm')\n", "datasets.delete_field('SentenceId')\n", "train_ds, test_ds = datasets.split(ratio=0.7)\n", "datasets = {'train': train_ds, 'test': test_ds}\n", "print(datasets['train'])\n", "print(datasets['test'])\n", "\n", "vocabs = {}\n", "vocabs['label'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='label')\n", "vocabs['text'] = Vocabulary().from_dataset(datasets['train'].concat(datasets['test'], inplace=False), field_name='text')\n", "print(vocabs['label'].word2idx)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "上述代码的含义是:从`test4dataset`的 6 条数据中,划分 4 条训练集(`int(6*0.7) = 4`),2 条测试集\n", "\n", "    修改相关字段名称,删除序号字段,同时将标签都设为小写,对文本进行分词\n", "\n", "  接着通过`concat`方法拼接测试集训练集,注意设置`inplace=False`,生成临时的新数据集\n", "\n", "  使用`from_dataset`方法从拼接的数据集中抽取词汇表,为将数据集中的单词替换为序号做准备\n", "\n", "由此就可以得到**数据集字典`datasets`**(**对应训练集、测试集**)和**词汇表字典`vocabs`**(**对应数据集各字段**)\n", "\n", "  然后就可以初始化`databundle`了,通过`print`可以观察其大致结构,效果如下" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In total 2 datasets:\n", "\ttrain has 4 instances.\n", "\ttest has 2 instances.\n", "In total 2 vocabs:\n", "\tlabel has 5 entries.\n", "\ttext has 96 entries.\n", "\n", "['train', 'test']\n", "['label', 'text']\n" ] } ], "source": [ "data_bundle = DataBundle(datasets=datasets, vocabs=vocabs)\n", "print(data_bundle)\n", "print(data_bundle.get_dataset_names())\n", "print(data_bundle.get_vocab_names())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "此外,也可以通过`data_bundle`的`num_dataset`和`num_vocab`返回数据表和词汇表个数\n", "\n", "  通过`data_bundle`的`iter_datasets`和`iter_vocabs`遍历数据表和词汇表" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "In total 2 datasets:\n", "\ttrain has 4 instances.\n", "\ttest has 2 instances.\n", "In total 2 datasets:\n", "\tlabel has 5 entries.\n", "\ttext has 96 entries.\n" ] } ], "source": [ "print(\"In total %d datasets:\" % data_bundle.num_dataset)\n", "for name, dataset in data_bundle.iter_datasets():\n", " print(\"\\t%s has %d instances.\" % (name, len(dataset)))\n", "print(\"In total %d datasets:\" % data_bundle.num_dataset)\n", "for name, vocab in data_bundle.iter_vocabs():\n", " print(\"\\t%s has %d entries.\" % (name, len(vocab)))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "在数据包`databundle`中,也有和数据集`dataset`类似的四个`apply`函数,即\n", "\n", "  `apply`函数、`apply_field`函数、`apply_field_more`函数和`apply_more`函数\n", "\n", "  负责对数据集进行预处理,如下所示是`apply_more`函数的示例,其他函数类似\n", "\n", "此外,通过`get_dataset`函数,可以通过数据表名`name`称找到对应数据表\n", "\n", "  通过`get_vocab`函数,可以通过词汇表名`field_name`称找到对应词汇表" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/4 [00:00, max_length=32, truncation=True, return_attention_mask=True)\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Processing: 0%| | 0/4 [00:00\n", "\n", "在接下来的`tutorial 3.`中,将会介绍`fastNLP v0.8`中的`dataloader`模块,会涉及本章中\n", "\n", "  提到的`collator`模块,`fastNLP`的多框架适应以及完整的数据加载过程,敬请期待" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.13" }, "pycharm": { "stem_cell": { "cell_type": "raw", "metadata": { "collapsed": false }, "source": [] } } }, "nbformat": 4, "nbformat_minor": 1 }