{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Node Classification for Heterogeneous Graph\n", "This notebook introduces how to use AutoGL to automate the learning of heterogeneous graphs in Deep Graph Library (DGL).\n", "\n", "\n", "## Import libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "WARNING:root:The OGB package is out of date. Your version is 1.3.3, while the latest version is 1.3.4.\n" ] } ], "source": [ "# Import needed libraries.\n", "import torch\n", "from autogl.datasets import build_dataset_from_name\n", "from autogl.module.model.dgl import AutoHAN\n", "from sklearn.metrics import f1_score\n", "from autogl.solver import AutoHeteroNodeClassifier\n", "from tqdm import tqdm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Creating a Heterogeneous Graph\n", "\n", "AutoGL supports datasets created in DGL. We provide two datasets named \"hetero-acm-han\" and \"hetero-acm-hgt\" for HAN and HGT models, respectively [1]." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following code snippet is an example for loading a heterogeneous graph. You can also access to data stored in the dataset object for more details:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Using cached file /villa/xbn/.cache-autogl/data/hetero-acm-han/raw/ACM.mat\n" ] } ], "source": [ "# load dataset\n", "dataset = build_dataset_from_name(\"hetero-acm-han\")\n", "g = dataset[0]\n", "\n", "# get dataset information\n", "node_type = dataset.schema[\"target_node_type\"]\n", "labels = g.nodes[node_type].data['label']\n", "num_classes = labels.max().item() + 1\n", "num_features=g.nodes[node_type].data['feat'].shape[1]\n", "\n", "# get train-val-test mask\n", "train_mask = g.nodes[node_type].data['train_mask']\n", "val_mask = g.nodes[node_type].data['val_mask']\n", "test_mask = g.nodes[node_type].data['test_mask']" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# set device\n", "device = 'cpu' # or cuda:0" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also build your own dataset and do feature engineering by adding files in the location AutoGL/autogl/datasets/_heterogeneous_datasets/_dgl_heterogeneous_datasets.py. \n", "\n", "We suggest users create a data object of type `torch_geometric.data.HeteroData` refering to the official documentation of DGL.\n", "\n", "\n", "## Building Heterogeneous GNN Modules\n", "\n", "AutoGL integrates commonly used heterogeneous graph neural network models such as ``HeteroRGCN`` (Schlichtkrull et al., 2018) [2], ``HAN`` (Wang et al., 2019) [3] and ``HGT`` (Hu et al., 2029) [4].\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# initialize auto model\n", "model = AutoHAN(\n", " dataset=dataset,\n", " num_features=num_features,\n", " num_classes=num_classes,\n", " device=device,\n", " init=True\n", ").model" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then you can train the model for 100 epochs." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "# Define the loss function.\n", "loss_fcn = torch.nn.CrossEntropyLoss()\n", "# Define the loss optimizer.\n", "optimizer = torch.optim.Adam(model.parameters(), lr=1e-2,\n", " weight_decay=1e-2)\n", "\n", "\n", "# Define the evlaute score function.\n", "def score(logits, labels):\n", " _, indices = torch.max(logits, dim=1)\n", " prediction = indices.long().cpu().numpy()\n", " labels = labels.cpu().numpy()\n", "\n", " accuracy = (prediction == labels).sum() / len(prediction)\n", " micro_f1 = f1_score(labels, prediction, average='micro')\n", " macro_f1 = f1_score(labels, prediction, average='macro')\n", "\n", " return accuracy, micro_f1, macro_f1\n", "\n", "# Define the Evaluation function.\n", "def evaluate(model, g, labels, mask, loss_func):\n", " model.eval()\n", " with torch.no_grad():\n", " logits = model(g)\n", " loss = loss_func(logits[mask], labels[mask])\n", " accuracy, micro_f1, macro_f1 = score(logits[mask], labels[mask])\n", "\n", " return loss, accuracy, micro_f1, macro_f1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, evaluate the model." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "100%|██████████| 100/100 [20:36<00:00, 12.37s/it]\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "test accuracy for HAN: 0.8988\n" ] } ], "source": [ "\n", "# Training.\n", "g = g.to(device)\n", "for epoch in tqdm(range(100)):\n", " model.train()\n", " logits = model(g)\n", " loss = loss_fcn(logits[train_mask], labels[train_mask])\n", "\n", " optimizer.zero_grad()\n", " loss.backward()\n", " optimizer.step()\n", "\n", " val_loss, val_acc, _, _ = evaluate(model, g, labels, val_mask, loss_fcn)\n", "\n", " \n", "_, test_acc, _, _ = evaluate(model, g, labels, test_mask, loss_fcn)\n", "print('test accuracy for HAN: {:.4f}'.format(test_acc))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can also define your own heterogeneous graph neural network models by adding files in the location AutoGL/autogl/module/model/dgl/hetero.\n", "\n", "## Automatic Search for Node Classification Tasks\n", "\n", "On top of the modules mentioned above, we provide a high-level API Solver to control the overall pipeline. We encapsulated the training process in the Building Heterogeneous GNN Modules part in the solver AutoHeteroNodeClassifier that supports automatic hyperparametric optimization as well as feature engineering and ensemble.\n", "In this part, we will show you how to use AutoHeteroNodeClassifier to automatically predict the publishing conference of a paper using the ACM academic graph dataset.\n", "\n", "Firstly, we get the pre-defined model hyperparameter. " ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "model_hp = {\n", " \"num_layers\": 2,\n", " \"hidden\": [256], \n", " \"heads\": [8], \n", " \"dropout\": 0.2,\n", " \"act\": \"gelu\",\n", " }" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Secondly, use AutoHeteroNodeClassifier directly to bulid automatic heterogeneous GNN models in the following example:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "def fixed(**kwargs):\n", " return [{\n", " 'parameterName': k,\n", " \"type\": \"FIXED\",\n", " \"value\": v\n", " } for k, v in kwargs.items()]\n", " \n", "solver = AutoHeteroNodeClassifier(\n", " graph_models=[\"han\"],\n", " hpo_module=\"random\",\n", " ensemble_module=None,\n", " max_evals=1,\n", " device=device,\n", " trainer_hp_space=fixed(\n", " max_epoch=100,\n", " early_stopping_round=101,\n", " lr=1e-3,\n", " weight_decay=1e-2\n", " ),\n", " model_hp_spaces=[fixed(**model_hp)]\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, fit and evlauate the model." ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "HPO Search Phase:\n", "\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ " 0%| | 0/1 [00:00