{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# T2. dataloader 和 tokenizer 的基本使用\n", "\n", " 1 fastNLP 中的 dataloader\n", "\n", " 1.1 databundle 的结构与使用\n", "\n", " 1.2 dataloader 的结构与使用\n", "\n", " 2 fastNLP 中的 tokenizer\n", " \n", " 2.1 传统 GloVe 词嵌入的加载\n", " \n", " 2.2 PreTrainedTokenizer 的概念\n", "\n", " 2.3 BertTokenizer 的基本使用\n", "\n", " 3 实例:NG20 数据集的完整加载过程\n", " \n", " 3.1 \n", "\n", " 3.2 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. fastNLP 中的 dataloader\n", "\n", "### 1.1 databundle 的结构与使用\n", "\n", "在`fastNLP 0.8`中,在常用的数据加载模块`DataLoader`和数据集`DataSet`模块之间,还存在\n", "\n", " 一个中间模块,即 **数据包`DataBundle`模块**,可以从`fastNLP.io`路径中导入该模块\n", "\n", "在`fastNLP 0.8`中,**一个`databundle`数据包包含若干`dataset`数据集和`vocabulary`词汇表**\n", "\n", " 分别存储在`datasets`和`vocabs`两个变量中,所以了解`databundle`数据包之前\n", "\n", " 需要首先**复习`dataset`数据集和`vocabulary`词汇表**,**下面的一串代码**,**你知道其大概含义吗?**\n", "\n", "必要提示:`NG20`,全称[`News Group 20`](http://qwone.com/~jason/20Newsgroups/),是一个新闻文本分类数据集,包含20个大类以及若干小类\n", "\n", " 数据集包含训练集`'ng20_train.csv'`和测试集`'ng20_test.csv'`两部分,每条数据\n", "\n", " 包括`'label'`标签和`'text'`文本两个条目,通过`sample(frac=1)[:10]`随机采样并读取前十条" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/10 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Processing: 0%| | 0/10 [00:00, ?it/s]"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------+------------------------------------------+\n",
"| label | text |\n",
"+-------+------------------------------------------+\n",
"| talk | ['mwilson', 'ncratl', 'atlantaga', 'n... |\n",
"| talk | ['ch981', 'cleveland', 'freenet', 'ed... |\n",
"| rec | ['mbeaving', 'bnr', 'ca', '\\\\(', 'bea... |\n",
"| soc | ['jayne', 'mmalt', 'guild', 'org', '\\... |\n",
"| talk | ['jrutledg', 'cs', 'ulowell', 'edu', ... |\n",
"| talk | ['cramer', 'optilink', 'com', '\\\\(', ... |\n",
"| comp | ['triton', 'unm', 'edu', '\\\\(', 'larr... |\n",
"| rec | ['ingres', 'com', '\\\\(', 'bruce', '\\\\... |\n",
"| comp | ['ldo', 'waikato', 'ac', 'nz', '\\\\(',... |\n",
"| misc | ['rebecca', 'rpi', 'edu', '\\\\(', 'ezr... |\n",
"+-------+------------------------------------------+\n",
"{'