{ "cells": [ { "cell_type": "markdown", "id": "cdc25fcd", "metadata": {}, "source": [ "# T1. dataset 和 vocabulary 的基本使用\n", "\n", " 1 dataset 的使用与结构\n", " \n", " 1.1 dataset 的结构与创建\n", "\n", " 1.2 dataset 的数据预处理\n", "\n", " 1.3 延伸:instance 和 field\n", "\n", " 2 vocabulary 的结构与使用\n", "\n", " 2.1 vocabulary 的创建与修改\n", "\n", " 2.2 vocabulary 与 OOV 问题\n", "\n", " 3 dataset 和 vocabulary 的组合使用\n", " \n", " 3.1 从 dataframe 中加载 dataset\n", "\n", " 3.2 从 dataset 中获取 vocabulary" ] }, { "cell_type": "markdown", "id": "0eb18a22", "metadata": {}, "source": [ "## 1. dataset 的基本使用\n", "\n", "### 1.1 dataset 的结构与创建\n", "\n", "在`fastNLP 0.8`中,使用`DataSet`模块表示数据集,**`dataset`类似于关系型数据库中的数据表**(下文统一为小写`dataset`)\n", "\n", " **主要包含`field`字段和`instance`实例两个元素**,对应`table`中的`field`字段和`record`记录\n", "\n", "在`fastNLP 0.8`中,`DataSet`模块被定义在`fastNLP.core.dataset`路径下,导入该模块后,最简单的\n", "\n", " 初始化方法,即将字典形式的表格 **`{'field1': column1, 'field2': column2, ...}`** 传入构造函数" ] }, { "cell_type": "code", "execution_count": 1, "id": "a1d69ad2", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"from fastNLP.core.dataset import DataSet\n",
"\n",
"data = {'idx': [0, 1, 2], \n",
" 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"],\n",
" 'words': [['This', 'is', 'an', 'apple', '.'], \n",
" ['I', 'like', 'apples', '.'], \n",
" ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],\n",
" 'num': [5, 4, 7]}\n",
"\n",
"dataset = DataSet(data)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "9260fdc6",
"metadata": {},
"source": [
" 在`dataset`的实例中,字段`field`的名称和实例`instance`中的字符串也可以中文"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "3d72ef00",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+------+--------------------+------------------------+------+\n",
"| 序号 | 句子 | 字符 | 长度 |\n",
"+------+--------------------+------------------------+------+\n",
"| 0 | 生活就像海洋, | ['生', '活', '就', ... | 7 |\n",
"| 1 | 只有意志坚强的人, | ['只', '有', '意', ... | 9 |\n",
"| 2 | 才能到达彼岸。 | ['才', '能', '到', ... | 7 |\n",
"+------+--------------------+------------------------+------+\n"
]
}
],
"source": [
"temp = {'序号': [0, 1, 2], \n",
" '句子':[\"生活就像海洋,\", \"只有意志坚强的人,\", \"才能到达彼岸。\"],\n",
" '字符': [['生', '活', '就', '像', '海', '洋', ','], \n",
" ['只', '有', '意', '志', '坚', '强', '的', '人', ','], \n",
" ['才', '能', '到', '达', '彼', '岸', '。']],\n",
" '长度': [7, 9, 7]}\n",
"\n",
"chinese = DataSet(temp)\n",
"print(chinese)"
]
},
{
"cell_type": "markdown",
"id": "202e5490",
"metadata": {},
"source": [
"在`dataset`中,使用`drop`方法可以删除满足条件的实例,这里使用了python中的`lambda`表达式\n",
"\n",
" 注一:在`drop`方法中,通过设置`inplace`参数将删除对应实例后的`dataset`作为一个新的实例生成"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "09b478f8",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1969418794120 1971237588872\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dropped = dataset\n",
"dropped = dropped.drop(lambda ins:ins['num'] < 5, inplace=False)\n",
"print(id(dropped), id(dataset))\n",
"print(dropped)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "aa277674",
"metadata": {},
"source": [
" 注二:在`fastNLP 0.8`中,**对`dataset`使用等号**,**其效果是传引用**,**而不是赋值**(???)\n",
"\n",
" 如下所示,**`dropped`和`dataset`具有相同`id`**,**对`dropped`执行删除操作`dataset`同时会被修改**"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "77c8583a",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1971237588872 1971237588872\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n",
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dropped = dataset\n",
"dropped.drop(lambda ins:ins['num'] < 5)\n",
"print(id(dropped), id(dataset))\n",
"print(dropped)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "a76199dc",
"metadata": {},
"source": [
"在`dataset`中,使用`delet_instance`方法可以删除对应序号的`instance`实例,序号从0开始"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "d8824b40",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+--------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+--------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"+-----+--------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.delete_instance(2)\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "f4fa9f33",
"metadata": {},
"source": [
"在`dataset`中,使用`delet_field`方法可以删除对应名称的`field`字段"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "f68ddb40",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+--------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+--------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"+-----+--------------------+------------------------------+\n"
]
}
],
"source": [
"dataset.delete_field('num')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "b1e9d42c",
"metadata": {},
"source": [
"### 1.2 dataset 的数据预处理\n",
"\n",
"在`dataset`模块中,`apply`、`apply_field`、`apply_more`和`apply_field_more`函数可以进行简单的数据预处理\n",
"\n",
" **`apply`和`apply_more`针对整条实例**,**`apply_field`和`apply_field_more`仅针对实例的部分字段**\n",
"\n",
" **`apply`和`apply_field`仅针对单个字段**,**`apply_more`和`apply_field_more`则可以针对多个字段**\n",
"\n",
" **`apply`和`apply_field`返回的是个列表**,**`apply_more`和`apply_field_more`返回的是个字典**\n",
"\n",
"***\n",
"\n",
"`apply`的参数包括一个函数`func`和一个新字段名`new_field_name`,函数`func`的处理对象是`dataset`模块中\n",
"\n",
" 的每个`instance`实例,函数`func`的处理结果存放在`new_field_name`对应的新建字段内"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "72a0b5f9",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.jupyter.widget-view+json": {
"model_id": "",
"version_major": 2,
"version_minor": 0
},
"text/plain": [
"Output()"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
"source": [
"data = {'idx': [0, 1, 2], \n",
" 'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"], }\n",
"dataset = DataSet(data)\n",
"dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "c10275ee",
"metadata": {},
"source": [
" **`apply`使用的函数可以是一个基于`lambda`表达式的匿名函数**,**也可以是一个自定义的函数**"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "b1a8631f",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"\n",
"def get_words(instance):\n",
" sentence = instance['sentence']\n",
" words = sentence.split()\n",
" return words\n",
"\n",
"dataset.apply(get_words, new_field_name='words')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "64abf745",
"metadata": {},
"source": [
"`apply_field`的参数,除了函数`func`外还有`field_name`和`new_field_name`,该函数`func`的处理对象仅\n",
"\n",
" 是`dataset`模块中的每个`field_name`对应的字段内容,处理结果存放在`new_field_name`对应的新建字段内"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "057c1d2c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------------+------------------------------+\n",
"| idx | sentence | words |\n",
"+-----+------------------------------+------------------------------+\n",
"| 0 | This is an apple . | ['This', 'is', 'an', 'app... |\n",
"| 1 | I like apples . | ['I', 'like', 'apples', '... |\n",
"| 2 | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
"+-----+------------------------------+------------------------------+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "5a9cc8b2",
"metadata": {},
"source": [
"`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
"\n",
" 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "51e2f02c",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_more(lambda ins:{'words': ins['sentence'].split(), 'num': len(ins['sentence'].split())})\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "02d2b7ef",
"metadata": {},
"source": [
"`apply_more`的参数只有函数`func`,函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
"\n",
" 要求函数`func`返回一个字典,根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "db4295d5",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n"
],
"text/plain": []
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"text/html": [
"\n",
"\n"
],
"text/plain": [
"\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----+------------------------+------------------------+-----+\n",
"| idx | sentence | words | num |\n",
"+-----+------------------------+------------------------+-----+\n",
"| 0 | This is an apple . | ['This', 'is', 'an'... | 5 |\n",
"| 1 | I like apples . | ['I', 'like', 'appl... | 4 |\n",
"| 2 | Apples are good for... | ['Apples', 'are', '... | 7 |\n",
"+-----+------------------------+------------------------+-----+\n"
]
}
],
"source": [
"dataset = DataSet(data)\n",
"dataset.apply_field_more(lambda sent:{'words': sent.split(), 'num': len(sent.split())}, \n",
" field_name='sentence')\n",
"print(dataset)"
]
},
{
"cell_type": "markdown",
"id": "9c09e592",
"metadata": {},
"source": [
"### 1.3 延伸:instance 和 field\n",
"\n",
"在`fastNLP 0.8`中,使用`Instance`模块表示数据集`dataset`中的每条数据,被称为实例\n",
"\n",
" 构造方式类似于构造一个字典,通过键值相同的`Instance`列表,也可以初始化一个`dataset`,代码如下"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "012f537c",
"metadata": {},
"outputs": [],
"source": [
"from fastNLP.core.dataset import DataSet\n",
"from fastNLP.core.dataset import Instance\n",
"\n",
"dataset = DataSet([\n",
" Instance(sentence=\"This is an apple .\",\n",
" words=['This', 'is', 'an', 'apple', '.'],\n",
" num=5),\n",
" Instance(sentence=\"I like apples .\",\n",
" words=['I', 'like', 'apples', '.'],\n",
" num=4),\n",
" Instance(sentence=\"Apples are good for our health .\",\n",
" words=['Apples', 'are', 'good', 'for', 'our', 'health', '.'],\n",
" num=7),\n",
" ])"
]
},
{
"cell_type": "markdown",
"id": "2fafb1ef",
"metadata": {},
"source": [
" 通过`items`、`keys`和`values`方法,可以分别获得`dataset`的`item`列表、`key`列表、`value`列表"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "a4c1c10d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dict_items([('sentence', 'This is an apple .'), ('words', ['This', 'is', 'an', 'apple', '.']), ('num', 5)])\n",
"dict_keys(['sentence', 'words', 'num'])\n",
"dict_values(['This is an apple .', ['This', 'is', 'an', 'apple', '.'], 5])\n"
]
}
],
"source": [
"ins = Instance(sentence=\"This is an apple .\", words=['This', 'is', 'an', 'apple', '.'], num=5)\n",
"\n",
"print(ins.items())\n",
"print(ins.keys())\n",
"print(ins.values())"
]
},
{
"cell_type": "markdown",
"id": "b5459a2d",
"metadata": {},
"source": [
" 通过`add_field`方法,可以在`Instance`实例中,通过参数`field_name`添加字段,通过参数`field`赋值"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "55376402",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+------------------------+-----+-----+\n",
"| sentence | words | num | idx |\n",
"+--------------------+------------------------+-----+-----+\n",
"| This is an apple . | ['This', 'is', 'an'... | 5 | 0 |\n",
"+--------------------+------------------------+-----+-----+\n"
]
}
],
"source": [
"ins.add_field(field_name='idx', field=0)\n",
"print(ins)"
]
},
{
"cell_type": "markdown",
"id": "49caaa9c",
"metadata": {},
"source": [
"在`fastNLP 0.8`中,使用`FieldArray`模块表示数据集`dataset`中的每条字段名(注:没有`field`类)\n",
"\n",
" 通过`get_all_fields`方法可以获取`dataset`的字段列表\n",
"\n",
" 通过`get_field_names`方法可以获取`dataset`的字段名称列表,代码如下"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "fe15f4c1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'sentence':