|
|
|
@@ -3,144 +3,98 @@ |
|
|
|
AutoGL Dataset |
|
|
|
============== |
|
|
|
|
|
|
|
We import the module of datasets from `CogDL` and `PyTorch Geometric` and add support for datasets from `OGB`. One can refer to the usage of creating and building datasets via the tutorial of `CogDL`_, `PyTorch Geometric`_, and `OGB`_. |
|
|
|
We provide various common datasets based on ``PyTorch-Geometric``, ``Deep Graph Library`` and ``OGB``. |
|
|
|
Besides, users are able to leverage a unified abstraction provided in AutoGL, ``GeneralStaticGraph``, which is towards both static homogeneous graph and static heterogeneous graph. |
|
|
|
|
|
|
|
.. _CogDL: https://cogdl.readthedocs.io/en/latest/tutorial.html |
|
|
|
.. _PyTorch Geometric: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html |
|
|
|
.. _OGB: https://ogb.stanford.edu/docs/dataset_overview/ |
|
|
|
|
|
|
|
A basic example to construct an instance of ``GeneralStaticGraph`` is shown as follows. |
|
|
|
|
|
|
|
.. code-block:: python |
|
|
|
from autogl.data.graph import GeneralStaticGraph, GeneralStaticGraphGenerator |
|
|
|
|
|
|
|
''' Construct a custom homogeneous graph ''' |
|
|
|
custom_static_homogeneous_graph: GeneralStaticGraph = GeneralStaticGraphGenerator.create_homogeneous_static_graph( |
|
|
|
{'x': torch.rand(2708, 3), 'y': torch.rand(2708, 1)}, torch.randint(0, 1024, (2, 10556)) |
|
|
|
) |
|
|
|
|
|
|
|
''' Construct a custom heterogemneous graph ''' |
|
|
|
custom_static_heterogeneous_graph: GeneralStaticGraph = GeneralStaticGraphGenerator.create_heterogeneous_static_graph( |
|
|
|
{ |
|
|
|
'author': {'x': torch.rand(1024, 3), 'y': torch.rand(1024, 1)}, |
|
|
|
'paper': {'feat': torch.rand(2048, 10), 'z': torch.rand(2048, 13)} |
|
|
|
}, |
|
|
|
{ |
|
|
|
('author', 'writing', 'paper'): (torch.randint(0, 1024, (2, 5120)), torch.rand(5120, 10)), |
|
|
|
('author', 'reading', 'paper'): torch.randint(0, 1024, (2, 3840)), |
|
|
|
} |
|
|
|
) |
|
|
|
|
|
|
|
Supporting datasets |
|
|
|
------------------- |
|
|
|
AutoGL now supports the following benchmarks for different tasks: |
|
|
|
|
|
|
|
Semi-supervised node classification: Cora, Citeseer, Pubmed, Amazon Computers\*, Amazon Photo\*, Coauthor CS\*, Coauthor Physics\*, Reddit (\*: using `utils.random_splits_mask_class` for splitting dataset is recommended.). |
|
|
|
For detailed information for supporting datasets, please kindly refer to `PyTorch Geometric Dataset`_. |
|
|
|
|
|
|
|
.. _PyTorch Geometric Dataset: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html |
|
|
|
|
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Dataset | PyG | CogDL | x | y | edge_index| edge_attr | train/val/test node | train/val/test mask | |
|
|
|
+==================+============+===========+============+============+===========+============+====================+=====================+ |
|
|
|
| Cora | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Citeseer | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Pubmed | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Amazon Computers | ✓ | | ✓ | ✓ | ✓ | ✓ | | | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Amazon Photo | ✓ | | ✓ | ✓ | ✓ | ✓ | | | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Coauthor CS | ✓ | | ✓ | ✓ | ✓ | ✓ | | | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Coauthor Physics | ✓ | | ✓ | ✓ | ✓ | ✓ | | | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
| Reddit | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ | |
|
|
|
+------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+ |
|
|
|
|
|
|
|
Graph classification: MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB |
|
|
|
|
|
|
|
+-----------+------------+------------+-----------+------------+------------+-----------+ |
|
|
|
| Dataset | PyG | CogDL | x | y | edge_index | edge_attr | |
|
|
|
+===========+============+============+===========+============+============+===========+ |
|
|
|
| MUTAG | ✓ | | ✓ | ✓ | ✓ | ✓ | |
|
|
|
+-----------+------------+------------+-----------+------------+------------+-----------+ |
|
|
|
| IMDB-B | ✓ | | | ✓ | ✓ | | |
|
|
|
+-----------+------------+------------+-----------+------------+------------+-----------+ |
|
|
|
| IMDB-M | ✓ | | | ✓ | ✓ | | |
|
|
|
+-----------+------------+------------+-----------+------------+------------+-----------+ |
|
|
|
| PROTEINS | ✓ | | ✓ | ✓ | ✓ | | |
|
|
|
+-----------+------------+------------+-----------+------------+------------+-----------+ |
|
|
|
| COLLAB | ✓ | | | ✓ | ✓ | | |
|
|
|
+-----------+------------+------------+-----------+------------+------------+-----------+ |
|
|
|
|
|
|
|
TODO: Supporting all datasets from `PyTorch Geometric`. |
|
|
|
|
|
|
|
OGB datasets |
|
|
|
------------ |
|
|
|
AutoGL also supports the popular benchmark on `OGB` for node classification and graph classification tasks. For the summary of `OGB` datasets, please kindly refer to the their `docs`_. |
|
|
|
|
|
|
|
.. _docs: https://ogb.stanford.edu/docs/nodeprop/ |
|
|
|
|
|
|
|
Since the loss and evaluation metric used for `OGB` datasets vary among different tasks, we also add `string` properties of datasets for identification: |
|
|
|
|
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| Dataset | dataset.metric | datasets.loss | |
|
|
|
+=================+================+===================+ |
|
|
|
| ogbn-products | Accuracy | nll_loss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbn-proteins | ROC-AUC | BCEWithLogitsLoss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbn-arxiv | Accuracy | nll_loss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbn-papers100M | Accuracy | nll_loss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbn-mag | Accuracy | nll_loss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbg-molhiv | ROC-AUC | BCEWithLogitsLoss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbg-molpcba | AP | BCEWithLogitsLoss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbg-ppa | Accuracy | CrossEntropyLoss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
| ogbg-code | F1 score | CrossEntropyLoss | |
|
|
|
+-----------------+----------------+-------------------+ |
|
|
|
|
|
|
|
|
|
|
|
Create a dataset via URL |
|
|
|
------------------------ |
|
|
|
|
|
|
|
If your dataset is the same as the 'ppi' dataset, which contains two matrices: 'network' and 'group', you can register your dataset directly use the above code. The default root for downloading dataset is `~/.cache-autogl`, you can also specify the root by passing the string to the `path` in `build_dataset(args, path)` or `build_dataset_from_name(dataset, path)`. |
|
|
|
|
|
|
|
.. code-block:: python |
|
|
|
|
|
|
|
# following code-snippet is from autogl/datasets/matlab_matrix.py |
|
|
|
|
|
|
|
@register_dataset("ppi") |
|
|
|
class PPIDataset(MatlabMatrix): |
|
|
|
def __init__(self, path): |
|
|
|
dataset, filename = "ppi", "Homo_sapiens" |
|
|
|
url = "http://snap.stanford.edu/node2vec/" |
|
|
|
super(PPIDataset, self).__init__(path, filename, url) |
|
|
|
|
|
|
|
You should declare the name of the dataset, the name of the file, and the URL, where our script can download the resource. Then you can use either `build_dataset(args, path)` or `build_dataset_from_name(dataset, path)` in your task to build a dataset with corresponding parameters. |
|
|
|
|
|
|
|
Create a dataset locally |
|
|
|
------------------------ |
|
|
|
|
|
|
|
If you want to test your local dataset, we recommend you to refer to the docs on `creating PyTorch Geometric dataset`_. |
|
|
|
|
|
|
|
.. _creating PyTorch Geometric dataset: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html |
|
|
|
|
|
|
|
|
|
|
|
You can simply inherit from `torch_geometric.data.InMemoryDataset` to create an empty `dataset`, then create some `torch_geometric.data.Data` objects for your data and pass a regular python list holding them, then pass them to `torch_geometric.data.Dataset` or `torch_geometric.data.DataLoader`. |
|
|
|
Let’s see this process in a simplified example: |
|
|
|
Semi-supervised node classification: Cora, Citeseer, Pubmed, Amazon Computers, Amazon Photo, Coauthor CS, Coauthor Physics, Reddit, etc. |
|
|
|
|
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Dataset | PyG | DGL | default train/val/test split | |
|
|
|
+==================+============+===========+================================+ |
|
|
|
| Cora | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Citeseer | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Pubmed | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Amazon Computers | ✓ | ✓ | | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Amazon Photo | ✓ | ✓ | | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Coauthor CS | ✓ | ✓ | | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Coauthor Physics | ✓ | ✓ | | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| Reddit | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| ogbn-products | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| ogbn-proteins | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| ogbn-arxiv | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
| ogbn-papers100M | ✓ | ✓ | ✓ | |
|
|
|
+------------------+------------+-----------+--------------------------------+ |
|
|
|
|
|
|
|
Graph classification: MUTAG, IMDB-Binary, IMDB-Multi, PROTEINS, COLLAB, etc. |
|
|
|
|
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| Dataset | PyG | DGL | Node Feature | Label | Edge Features | |
|
|
|
+=============+============+============+==============+============+====================+ |
|
|
|
| MUTAG | ✓ | ✓ | ✓ | ✓ | ✓ | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| IMDB-Binary | ✓ | ✓ | | ✓ | | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| IMDB-Multi | ✓ | ✓ | | ✓ | | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| PROTEINS | ✓ | ✓ | ✓ | ✓ | | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| COLLAB | ✓ | ✓ | | ✓ | | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| ogbg-molhiv | ✓ | ✓ | ✓ | ✓ | ✓ | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| ogbg-molpcba| ✓ | ✓ | ✓ | ✓ | ✓ | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| ogbg-ppa | ✓ | ✓ | | ✓ | ✓ | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
| ogbg-code2 | ✓ | ✓ | ✓ | ✓ | ✓ | |
|
|
|
+-------------+------------+------------+--------------+------------+--------------------+ |
|
|
|
|
|
|
|
Link Prediction: At present, AutoGL utilizes various homogeneous graphs towards node classification to conduct automatic link prediction. |
|
|
|
|
|
|
|
Construct custom dataset by instances of GeneralStaticGraph |
|
|
|
------------------------------------------------------------ |
|
|
|
The following example shows the way to compose a custom dataset by a sequence of instances of ``GeneralStaticGraph``. |
|
|
|
|
|
|
|
.. code-block:: python |
|
|
|
|
|
|
|
from typing import Iterable |
|
|
|
from torch_geometric.data.data import Data |
|
|
|
from autogl.datasets import build_dataset_from_name |
|
|
|
from torch_geometric.data import InMemoryDataset |
|
|
|
|
|
|
|
class MyDataset(InMemoryDataset): |
|
|
|
def __init__(self, datalist) -> None: |
|
|
|
super().__init__() |
|
|
|
self.data, self.slices = self.collate(datalist) |
|
|
|
|
|
|
|
# Create your own Data objects |
|
|
|
|
|
|
|
# for example, if you have edge_index, features and labels |
|
|
|
# you can create a Data as follows |
|
|
|
# See pytorch geometric more info of Data |
|
|
|
data = Data() |
|
|
|
data.edge_index = edge_index |
|
|
|
data.x = features |
|
|
|
data.y = labels |
|
|
|
|
|
|
|
# create a list of Data object |
|
|
|
data_list = [data, Data(...), ..., Data(...)] |
|
|
|
|
|
|
|
# Initialize AutoGL Dataset with your own data |
|
|
|
myData = MyDataset(data_list) |
|
|
|
from autogl.data import InMemoryDataset |
|
|
|
''' Suppose the graphs is a sequence of instances of GeneralStaticGraph ''' |
|
|
|
graphs = [ ... ] |
|
|
|
custom_dataset = InMemoryDataset(graphs) |