hummingbird
/
fastNLP

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "cdc25fcd",
   "metadata": {},
   "source": [
    "# T1. dataset 和 vocabulary 的基本使用\n",
    "\n",
    "&emsp; 1 &ensp; dataset 的使用与结构\n",
    " \n",
    "&emsp; &emsp; 1.1 &ensp; dataset 的结构与创建\n",
    "\n",
    "&emsp; &emsp; 1.2 &ensp; dataset 的数据预处理\n",
    "\n",
    "&emsp; &emsp; 1.3 &ensp; 延伸：instance 和 field\n",
    "\n",
    "&emsp; 2 &ensp; vocabulary 的结构与使用\n",
    "\n",
    "&emsp; &emsp; 2.1 &ensp; vocabulary 的创建与修改\n",
    "\n",
    "&emsp; &emsp; 2.2 &ensp; vocabulary 与 OOV 问题\n",
    "\n",
    "&emsp; 3 &ensp; dataset 和 vocabulary 的组合使用\n",
    " \n",
    "&emsp; &emsp; 3.1 &ensp; 从 dataframe 中加载 dataset\n",
    "\n",
    "&emsp; &emsp; 3.2 &ensp; 从 dataset 中获取 vocabulary"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0eb18a22",
   "metadata": {},
   "source": [
    "## 1. dataset 的基本使用\n",
    "\n",
    "### 1.1  dataset 的结构与创建\n",
    "\n",
    "在`fastNLP 0.8`中，使用`DataSet`模块表示数据集，**`dataset`类似于关系型数据库中的数据表**（下文统一为小写`dataset`）\n",
    "\n",
    "&emsp; **主要包含`field`字段和`instance`实例两个元素**，对应`table`中的`field`字段和`record`记录\n",
    "\n",
    "在`fastNLP 0.8`中，`DataSet`模块被定义在`fastNLP.core.dataset`路径下，导入该模块后，最简单的\n",
    "\n",
    "&emsp; 初始化方法，即将字典形式的表格 **`{'field1': column1, 'field2': column2, ...}`** 传入构造函数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "a1d69ad2",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 1   | I like apples .        | ['I', 'like', 'appl... | 4   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n"
     ]
    }
   ],
   "source": [
    "from fastNLP.core.dataset import DataSet\n",
    "\n",
    "data = {'idx': [0, 1, 2],  \n",
    "        'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"],\n",
    "        'words': [['This', 'is', 'an', 'apple', '.'], \n",
    "                  ['I', 'like', 'apples', '.'], \n",
    "                  ['Apples', 'are', 'good', 'for', 'our', 'health', '.']],\n",
    "        'num': [5, 4, 7]}\n",
    "\n",
    "dataset = DataSet(data)\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9260fdc6",
   "metadata": {},
   "source": [
    "&emsp; 在`dataset`的实例中，字段`field`的名称和实例`instance`中的字符串也可以中文"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "3d72ef00",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+------+--------------------+------------------------+------+\n",
      "| 序号 | 句子               | 字符                   | 长度 |\n",
      "+------+--------------------+------------------------+------+\n",
      "| 0    | 生活就像海洋，     | ['生', '活', '就', ... | 7    |\n",
      "| 1    | 只有意志坚强的人， | ['只', '有', '意', ... | 9    |\n",
      "| 2    | 才能到达彼岸。     | ['才', '能', '到', ... | 7    |\n",
      "+------+--------------------+------------------------+------+\n"
     ]
    }
   ],
   "source": [
    "temp = {'序号': [0, 1, 2],  \n",
    "        '句子':[\"生活就像海洋，\", \"只有意志坚强的人，\", \"才能到达彼岸。\"],\n",
    "        '字符': [['生', '活', '就', '像', '海', '洋', '，'], \n",
    "                  ['只', '有', '意', '志', '坚', '强', '的', '人', '，'], \n",
    "                  ['才', '能', '到', '达', '彼', '岸', '。']],\n",
    "        '长度': [7, 9, 7]}\n",
    "\n",
    "chinese = DataSet(temp)\n",
    "print(chinese)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "202e5490",
   "metadata": {},
   "source": [
    "在`dataset`中，使用`drop`方法可以删除满足条件的实例，这里使用了python中的`lambda`表达式\n",
    "\n",
    "&emsp; 注一：在`drop`方法中，通过设置`inplace`参数将删除对应实例后的`dataset`作为一个新的实例生成"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "09b478f8",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1969418794120 1971237588872\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 1   | I like apples .        | ['I', 'like', 'appl... | 4   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n"
     ]
    }
   ],
   "source": [
    "dropped = dataset\n",
    "dropped = dropped.drop(lambda ins:ins['num'] < 5, inplace=False)\n",
    "print(id(dropped), id(dataset))\n",
    "print(dropped)\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aa277674",
   "metadata": {},
   "source": [
    "&emsp; 注二：在`fastNLP 0.8`中，**对`dataset`使用等号**，**其效果是传引用**，**而不是赋值**（？？？）\n",
    "\n",
    "&emsp; &emsp; 如下所示，**`dropped`和`dataset`具有相同`id`**，**对`dropped`执行删除操作`dataset`同时会被修改**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "77c8583a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1971237588872 1971237588872\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n"
     ]
    }
   ],
   "source": [
    "dropped = dataset\n",
    "dropped.drop(lambda ins:ins['num'] < 5)\n",
    "print(id(dropped), id(dataset))\n",
    "print(dropped)\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a76199dc",
   "metadata": {},
   "source": [
    "在`dataset`中，使用`delet_instance`方法可以删除对应序号的`instance`实例，序号从0开始"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d8824b40",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+--------------------+------------------------+-----+\n",
      "| idx | sentence           | words                  | num |\n",
      "+-----+--------------------+------------------------+-----+\n",
      "| 0   | This is an apple . | ['This', 'is', 'an'... | 5   |\n",
      "| 1   | I like apples .    | ['I', 'like', 'appl... | 4   |\n",
      "+-----+--------------------+------------------------+-----+\n"
     ]
    }
   ],
   "source": [
    "dataset = DataSet(data)\n",
    "dataset.delete_instance(2)\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f4fa9f33",
   "metadata": {},
   "source": [
    "在`dataset`中，使用`delet_field`方法可以删除对应名称的`field`字段"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f68ddb40",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+--------------------+------------------------------+\n",
      "| idx | sentence           | words                        |\n",
      "+-----+--------------------+------------------------------+\n",
      "| 0   | This is an apple . | ['This', 'is', 'an', 'app... |\n",
      "| 1   | I like apples .    | ['I', 'like', 'apples', '... |\n",
      "+-----+--------------------+------------------------------+\n"
     ]
    }
   ],
   "source": [
    "dataset.delete_field('num')\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b1e9d42c",
   "metadata": {},
   "source": [
    "### 1.2  dataset 的数据预处理\n",
    "\n",
    "在`dataset`模块中，`apply`、`apply_field`、`apply_more`和`apply_field_more`函数可以进行简单的数据预处理\n",
    "\n",
    "&emsp; **`apply`和`apply_more`针对整条实例**，**`apply_field`和`apply_field_more`仅针对实例的部分字段**\n",
    "\n",
    "&emsp; **`apply`和`apply_field`仅针对单个字段**，**`apply_more`和`apply_field_more`则可以针对多个字段**\n",
    "\n",
    "&emsp; **`apply`和`apply_field`返回的是个列表**，**`apply_more`和`apply_field_more`返回的是个字典**\n",
    "\n",
    "***\n",
    "\n",
    "`apply`的参数包括一个函数`func`和一个新字段名`new_field_name`，函数`func`的处理对象是`dataset`模块中\n",
    "\n",
    "&emsp; 的每个`instance`实例，函数`func`的处理结果存放在`new_field_name`对应的新建字段内"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "72a0b5f9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Output()"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+------------------------------+------------------------------+\n",
      "| idx | sentence                     | words                        |\n",
      "+-----+------------------------------+------------------------------+\n",
      "| 0   | This is an apple .           | ['This', 'is', 'an', 'app... |\n",
      "| 1   | I like apples .              | ['I', 'like', 'apples', '... |\n",
      "| 2   | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
      "+-----+------------------------------+------------------------------+\n"
     ]
    }
   ],
   "source": [
    "data = {'idx': [0, 1, 2], \n",
    "        'sentence':[\"This is an apple .\", \"I like apples .\", \"Apples are good for our health .\"], }\n",
    "dataset = DataSet(data)\n",
    "dataset.apply(lambda ins: ins['sentence'].split(), new_field_name='words')\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c10275ee",
   "metadata": {},
   "source": [
    "&emsp; **`apply`使用的函数可以是一个基于`lambda`表达式的匿名函数**，**也可以是一个自定义的函数**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "b1a8631f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+------------------------------+------------------------------+\n",
      "| idx | sentence                     | words                        |\n",
      "+-----+------------------------------+------------------------------+\n",
      "| 0   | This is an apple .           | ['This', 'is', 'an', 'app... |\n",
      "| 1   | I like apples .              | ['I', 'like', 'apples', '... |\n",
      "| 2   | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
      "+-----+------------------------------+------------------------------+\n"
     ]
    }
   ],
   "source": [
    "dataset = DataSet(data)\n",
    "\n",
    "def get_words(instance):\n",
    "    sentence = instance['sentence']\n",
    "    words = sentence.split()\n",
    "    return words\n",
    "\n",
    "dataset.apply(get_words, new_field_name='words')\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "64abf745",
   "metadata": {},
   "source": [
    "`apply_field`的参数，除了函数`func`外还有`field_name`和`new_field_name`，该函数`func`的处理对象仅\n",
    "\n",
    "&emsp; 是`dataset`模块中的每个`field_name`对应的字段内容，处理结果存放在`new_field_name`对应的新建字段内"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "057c1d2c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+------------------------------+------------------------------+\n",
      "| idx | sentence                     | words                        |\n",
      "+-----+------------------------------+------------------------------+\n",
      "| 0   | This is an apple .           | ['This', 'is', 'an', 'app... |\n",
      "| 1   | I like apples .              | ['I', 'like', 'apples', '... |\n",
      "| 2   | Apples are good for our h... | ['Apples', 'are', 'good',... |\n",
      "+-----+------------------------------+------------------------------+\n"
     ]
    }
   ],
   "source": [
    "dataset = DataSet(data)\n",
    "dataset.apply_field(lambda sent:sent.split(), field_name='sentence', new_field_name='words')\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5a9cc8b2",
   "metadata": {},
   "source": [
    "`apply_more`的参数只有函数`func`，函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
    "\n",
    "&emsp; 要求函数`func`返回一个字典，根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "51e2f02c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 1   | I like apples .        | ['I', 'like', 'appl... | 4   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n"
     ]
    }
   ],
   "source": [
    "dataset = DataSet(data)\n",
    "dataset.apply_more(lambda ins:{'words': ins['sentence'].split(), 'num': len(ins['sentence'].split())})\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02d2b7ef",
   "metadata": {},
   "source": [
    "`apply_more`的参数只有函数`func`，函数`func`的处理对象是`dataset`模块中的每个`instance`实例\n",
    "\n",
    "&emsp; 要求函数`func`返回一个字典，根据字典的`key-value`确定存储在`dataset`中的字段名称与内容"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "db4295d5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\"></pre>\n"
      ],
      "text/plain": []
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "text/html": [
       "<pre style=\"white-space:pre;overflow-x:auto;line-height:normal;font-family:Menlo,'DejaVu Sans Mono',consolas,'Courier New',monospace\">\n",
       "</pre>\n"
      ],
      "text/plain": [
       "\n"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+-----+------------------------+------------------------+-----+\n",
      "| idx | sentence               | words                  | num |\n",
      "+-----+------------------------+------------------------+-----+\n",
      "| 0   | This is an apple .     | ['This', 'is', 'an'... | 5   |\n",
      "| 1   | I like apples .        | ['I', 'like', 'appl... | 4   |\n",
      "| 2   | Apples are good for... | ['Apples', 'are', '... | 7   |\n",
      "+-----+------------------------+------------------------+-----+\n"
     ]
    }
   ],
   "source": [
    "dataset = DataSet(data)\n",
    "dataset.apply_field_more(lambda sent:{'words': sent.split(), 'num': len(sent.split())}, \n",
    "                         field_name='sentence')\n",
    "print(dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9c09e592",
   "metadata": {},
   "source": [
    "### 1.3 延伸：instance 和 field\n",
    "\n",
    "在`fastNLP 0.8`中，使用`Instance`模块表示数据集`dataset`中的每条数据，被称为实例\n",
    "\n",
    "&emsp; 构造方式类似于构造一个字典，通过键值相同的`Instance`列表，也可以初始化一个`dataset`，代码如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "012f537c",
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastNLP.core.dataset import DataSet\n",
    "from fastNLP.core.dataset import Instance\n",
    "\n",
    "dataset = DataSet([\n",
    "    Instance(sentence=\"This is an apple .\",\n",
    "        words=['This', 'is', 'an', 'apple', '.'],\n",
    "        num=5),\n",
    "    Instance(sentence=\"I like apples .\",\n",
    "        words=['I', 'like', 'apples', '.'],\n",
    "        num=4),\n",
    "    Instance(sentence=\"Apples are good for our health .\",\n",
    "        words=['Apples', 'are', 'good', 'for', 'our', 'health', '.'],\n",
    "        num=7),\n",
    "    ])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2fafb1ef",
   "metadata": {},
   "source": [
    "&emsp; 通过`items`、`keys`和`values`方法，可以分别获得`dataset`的`item`列表、`key`列表、`value`列表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "a4c1c10d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dict_items([('sentence', 'This is an apple .'), ('words', ['This', 'is', 'an', 'apple', '.']), ('num', 5)])\n",
      "dict_keys(['sentence', 'words', 'num'])\n",
      "dict_values(['This is an apple .', ['This', 'is', 'an', 'apple', '.'], 5])\n"
     ]
    }
   ],
   "source": [
    "ins = Instance(sentence=\"This is an apple .\", words=['This', 'is', 'an', 'apple', '.'], num=5)\n",
    "\n",
    "print(ins.items())\n",
    "print(ins.keys())\n",
    "print(ins.values())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5459a2d",
   "metadata": {},
   "source": [
    "&emsp; 通过`add_field`方法，可以在`Instance`实例中，通过参数`field_name`添加字段，通过参数`field`赋值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "55376402",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "+--------------------+------------------------+-----+-----+\n",
      "| sentence           | words                  | num | idx |\n",
      "+--------------------+------------------------+-----+-----+\n",
      "| This is an apple . | ['This', 'is', 'an'... | 5   | 0   |\n",
      "+--------------------+------------------------+-----+-----+\n"
     ]
    }
   ],
   "source": [
    "ins.add_field(field_name='idx', field=0)\n",
    "print(ins)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49caaa9c",
   "metadata": {},
   "source": [
    "在`fastNLP 0.8`中，使用`FieldArray`模块表示数据集`dataset`中的每条字段名（注：没有`field`类）\n",
    "\n",
    "&emsp; 通过`get_all_fields`方法可以获取`dataset`的字段列表\n",
    "\n",
    "&emsp; 通过`get_field_names`方法可以获取`dataset`的字段名称列表，代码如下"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "fe15f4c1",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'sentence': <fastNLP.core.dataset.field.FieldArray at 0x1ca8a879d08>,\n",
       " 'words': <fastNLP.core.dataset.field.FieldArray at 0x1ca8a879d88>,\n",
       " 'num': <fastNLP.core.dataset.field.FieldArray at 0x1ca8a879e08>}"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.get_all_fields()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "5433815c",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['num', 'sentence', 'words']"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "dataset.get_field_names()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4964eeed",
   "metadata": {},
   "source": [
    "其他`dataset`的基本使用：通过`in`或者`has_field`方法可以判断`dataset`的是否包含某种字段\n",
    "\n",
    "&emsp; 通过`rename_field`方法可以更改`dataset`中的字段名称；通过`concat`方法可以实现两个`dataset`中的拼接\n",
    "\n",
    "&emsp; 通过`len`可以统计`dataset`中的实例数目；`dataset`的全部变量与函数可以通过`dir(dataset)`查询"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "25ce5488",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3 False\n",
      "6 True\n",
      "+------------------------------+------------------------------+--------+\n",
      "| sentence                     | words                        | length |\n",
      "+------------------------------+------------------------------+--------+\n",
      "| This is an apple .           | ['This', 'is', 'an', 'app... | 5      |\n",
      "| I like apples .              | ['I', 'like', 'apples', '... | 4      |\n",
      "| Apples are good for our h... | ['Apples', 'are', 'good',... | 7      |\n",
      "| This is an apple .           | ['This', 'is', 'an', 'app... | 5      |\n",
      "| I like apples .              | ['I', 'like', 'apples', '... | 4      |\n",
      "| Apples are good for our h... | ['Apples', 'are', 'good',... | 7      |\n",
      "+------------------------------+------------------------------+--------+\n"
     ]
    }
   ],
   "source": [
    "print(len(dataset), dataset.has_field('length')) \n",
    "if 'num' in dataset:\n",
    "    dataset.rename_field('num', 'length')\n",
    "elif 'length' in dataset:\n",
    "    dataset.rename_field('length', 'num')\n",
    "dataset.concat(dataset)\n",
    "print(len(dataset), dataset.has_field('length'))  \n",
    "print(dataset)  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e30a6cd7",
   "metadata": {},
   "source": [
    "## 2. vocabulary 的结构与使用\n",
    "\n",
    "### 2.1 vocabulary 的创建与修改\n",
    "\n",
    "在`fastNLP 0.8`中，使用`Vocabulary`模块表示词汇表，**`vocabulary`的核心是从单词到序号的映射**\n",
    "\n",
    "&emsp; 可以直接通过构造函数实例化，通过查找`word2idx`属性，可以找到`vocabulary`映射对应的字典实现\n",
    "\n",
    "&emsp; **默认补零`padding`用`<pad>`表示**，**对应序号为0**；**未知单词`unknown`用`<unk>`表示**，**对应序号1**\n",
    "\n",
    "&emsp; 通过打印`vocabulary`可以看到词汇表中的单词列表，其中，`padding`和`unknown`不会显示"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "3515e096",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Vocabulary([]...)\n",
      "{'<pad>': 0, '<unk>': 1}\n",
      "<pad> 0\n",
      "<unk> 1\n"
     ]
    }
   ],
   "source": [
    "from fastNLP.core.vocabulary import Vocabulary\n",
    "\n",
    "vocab = Vocabulary()\n",
    "print(vocab)\n",
    "print(vocab.word2idx)\n",
    "print(vocab.padding, vocab.padding_idx)\n",
    "print(vocab.unknown, vocab.unknown_idx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "640be126",
   "metadata": {},
   "source": [
    "在`vocabulary`中，通过`add_word`方法或`add_word_lst`方法，可以单独或批量添加单词\n",
    "\n",
    "&emsp; 通过`len`或`word_count`属性，可以显示`vocabulary`的单词量和每个单词添加的次数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "88c7472a",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "5 Counter({'生活': 1, '就像': 1, '海洋': 1})\n",
      "6 Counter({'生活': 1, '就像': 1, '海洋': 1, '只有': 1})\n"
     ]
    }
   ],
   "source": [
    "vocab.add_word_lst(['生活', '就像', '海洋'])\n",
    "print(len(vocab), vocab.word_count)\n",
    "vocab.add_word('只有')\n",
    "print(len(vocab), vocab.word_count)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f9ec8b28",
   "metadata": {},
   "source": [
    "&emsp; **通过`to_word`方法可以找到单词对应的序号**，**通过`to_index`方法可以找到序号对应的单词**\n",
    "\n",
    "&emsp; &emsp; 由于序号0和序号1已经被占用，所以**新加入的词的序号从2开始计数**，如`'生活'`对应2\n",
    "\n",
    "&emsp; &emsp; 通过`has_word`方法可以判断单词是否在词汇表中，没有的单词被判做`<unk>`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "3447acde",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<pad> 0\n",
      "<unk> 1\n",
      "生活 2\n",
      "只有 5\n",
      "彼岸 1 False\n"
     ]
    }
   ],
   "source": [
    "print(vocab.to_word(0), vocab.to_index('<pad>'))\n",
    "print(vocab.to_word(1), vocab.to_index('<unk>'))\n",
    "print(vocab.to_word(2), vocab.to_index('生活'))\n",
    "print(vocab.to_word(5), vocab.to_index('只有'))\n",
    "print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b4e36850",
   "metadata": {},
   "source": [
    "**`vocabulary`允许反复添加相同单词**，**可以通过`word_count`方法看到相应单词被添加的次数**\n",
    "\n",
    "&emsp; 但其中没有`<unk>`和`<pad>`，`vocabulary`的全部变量与函数可以通过`dir(vocabulary)`查询"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "490b101c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "13 Counter({'生活': 2, '就像': 2, '海洋': 2, '只有': 2, '意志': 1, '坚强的': 1, '人': 1, '才': 1, '能': 1, '到达': 1, '彼岸': 1})\n",
      "彼岸 12 True\n"
     ]
    }
   ],
   "source": [
    "vocab.add_word_lst(['生活', '就像', '海洋', '只有', '意志', '坚强的', '人', '才', '能', '到达', '彼岸'])\n",
    "print(len(vocab), vocab.word_count)\n",
    "print('彼岸', vocab.to_index('彼岸'), vocab.has_word('彼岸'))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "23e32a63",
   "metadata": {},
   "source": [
    "### 2.2 vocabulary 与 OOV 问题\n",
    "\n",
    "在`vocabulary`模块初始化的时候，可以通过指定`unknown`和`padding`为`None`，限制其存在\n",
    "\n",
    "&emsp; 此时添加单词直接从0开始标号，如果遇到未知单词会直接报错，即 out of vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "a99ff909",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'positive': 0, 'negative': 1}\n",
      "ValueError: word `neutral` not in vocabulary\n"
     ]
    }
   ],
   "source": [
    "vocab = Vocabulary(unknown=None, padding=None)\n",
    "\n",
    "vocab.add_word_lst(['positive', 'negative'])\n",
    "print(vocab.word2idx)\n",
    "\n",
    "try:\n",
    "    print(vocab.to_index('neutral'))\n",
    "except ValueError:\n",
    "    print(\"ValueError: word `neutral` not in vocabulary\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "618da6bd",
   "metadata": {},
   "source": [
    "&emsp; 相应的，如果只指定其中的`unknown`，则编号会后移一个，同时遇到未知单词全部当做`<unk>`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "432f74c1",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'<unk>': 0, 'positive': 1, 'negative': 2}\n",
      "0 <unk>\n"
     ]
    }
   ],
   "source": [
    "vocab = Vocabulary(unknown='<unk>', padding=None)\n",
    "\n",
    "vocab.add_word_lst(['positive', 'negative'])\n",
    "print(vocab.word2idx)\n",
    "\n",
    "print(vocab.to_index('neutral'), vocab.to_word(vocab.to_index('neutral')))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6263f73",
   "metadata": {},
   "source": [
    "## 3 dataset 和 vocabulary 的组合使用\n",
    " \n",
    "### 3.1 从 dataframe 中加载 dataset\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "89059713",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3dbd985d",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f634586",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "id": "5ba13989",
   "metadata": {},
   "source": [
    "### 3.2 从 dataset 中获取 vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2de615b",
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5f5eed18",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}