|
- .. _dataset:
-
- AutoGL Dataset
- ==============
-
- We import the module of datasets from `CogDL` and `PyTorch Geometric` and add support for datasets from `OGB`. One can refer to the usage of creating and building datasets via the tutorial of `CogDL`_, `PyTorch Geometric`_, and `OGB`_.
-
- .. _CogDL: https://cogdl.readthedocs.io/en/latest/tutorial.html
- .. _PyTorch Geometric: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
- .. _OGB: https://ogb.stanford.edu/docs/dataset_overview/
-
-
- Supporting datasets
- -------------------
- AutoGL now supports the following benchmarks for different tasks:
-
- Semi-supervised node classification: Cora, Citeseer, Pubmed, Amazon Computers\*, Amazon Photo\*, Coauthor CS\*, Coauthor Physics\*, Reddit (\*: using `utils.random_splits_mask_class` for splitting dataset is recommended.).
- For detailed information for supporting datasets, please kindly refer to `PyTorch Geometric Dataset`_.
-
- .. _PyTorch Geometric Dataset: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html
-
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Dataset | PyG | CogDL | x | y | edge_index| edge_attr | train/val/test node | train/val/test mask |
- +==================+============+===========+============+============+===========+============+====================+=====================+
- | Cora | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Citeseer | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Pubmed | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Amazon Computers | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Amazon Photo | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Coauthor CS | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Coauthor Physics | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
- | Reddit | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
- +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
-
- Graph classification: MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB
-
- +-----------+------------+------------+-----------+------------+------------+-----------+
- | Dataset | PyG | CogDL | x | y | edge_index | edge_attr |
- +===========+============+============+===========+============+============+===========+
- | MUTAG | ✓ | | ✓ | ✓ | ✓ | ✓ |
- +-----------+------------+------------+-----------+------------+------------+-----------+
- | IMDB-B | ✓ | | | ✓ | ✓ | |
- +-----------+------------+------------+-----------+------------+------------+-----------+
- | IMDB-M | ✓ | | | ✓ | ✓ | |
- +-----------+------------+------------+-----------+------------+------------+-----------+
- | PROTEINS | ✓ | | ✓ | ✓ | ✓ | |
- +-----------+------------+------------+-----------+------------+------------+-----------+
- | COLLAB | ✓ | | | ✓ | ✓ | |
- +-----------+------------+------------+-----------+------------+------------+-----------+
-
- TODO: Supporting all datasets from `PyTorch Geometric`.
-
- OGB datasets
- ------------
- AutoGL also supports the popular benchmark on `OGB` for node classification and graph classification tasks. For the summary of `OGB` datasets, please kindly refer to the their `docs`_.
-
- .. _docs: https://ogb.stanford.edu/docs/nodeprop/
-
- Since the loss and evaluation metric used for `OGB` datasets vary among different tasks, we also add `string` properties of datasets for identification:
-
- +-----------------+----------------+-------------------+
- | Dataset | dataset.metric | datasets.loss |
- +=================+================+===================+
- | ogbn-products | Accuracy | nll_loss |
- +-----------------+----------------+-------------------+
- | ogbn-proteins | ROC-AUC | BCEWithLogitsLoss |
- +-----------------+----------------+-------------------+
- | ogbn-arxiv | Accuracy | nll_loss |
- +-----------------+----------------+-------------------+
- | ogbn-papers100M | Accuracy | nll_loss |
- +-----------------+----------------+-------------------+
- | ogbn-mag | Accuracy | nll_loss |
- +-----------------+----------------+-------------------+
- | ogbg-molhiv | ROC-AUC | BCEWithLogitsLoss |
- +-----------------+----------------+-------------------+
- | ogbg-molpcba | AP | BCEWithLogitsLoss |
- +-----------------+----------------+-------------------+
- | ogbg-ppa | Accuracy | CrossEntropyLoss |
- +-----------------+----------------+-------------------+
- | ogbg-code | F1 score | CrossEntropyLoss |
- +-----------------+----------------+-------------------+
-
-
- Create a dataset via URL
- ------------------------
-
- If your dataset is the same as the 'ppi' dataset, which contains two matrices: 'network' and 'group', you can register your dataset directly use the above code. The default root for downloading dataset is `~/.cache-autogl`, you can also specify the root by passing the string to the `path` in `build_dataset(args, path)` or `build_dataset_from_name(dataset, path)`.
-
- .. code-block:: python
-
- # following code-snippet is from autogl/datasets/matlab_matrix.py
-
- @register_dataset("ppi")
- class PPIDataset(MatlabMatrix):
- def __init__(self, path):
- dataset, filename = "ppi", "Homo_sapiens"
- url = "http://snap.stanford.edu/node2vec/"
- super(PPIDataset, self).__init__(path, filename, url)
-
- You should declare the name of the dataset, the name of the file, and the URL, where our script can download the resource. Then you can use either `build_dataset(args, path)` or `build_dataset_from_name(dataset, path)` in your task to build a dataset with corresponding parameters.
-
- Create a dataset locally
- ------------------------
-
- If you want to test your local dataset, we recommend you to refer to the docs on `creating PyTorch Geometric dataset`_.
-
- .. _creating PyTorch Geometric dataset: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
-
-
- You can simply inherit from `torch_geometric.data.InMemoryDataset` to create an empty `dataset`, then create some `torch_geometric.data.Data` objects for your data and pass a regular python list holding them, then pass them to `torch_geometric.data.Dataset` or `torch_geometric.data.DataLoader`.
- Let’s see this process in a simplified example:
-
- .. code-block:: python
-
- from typing import Iterable
- from torch_geometric.data.data import Data
- from autogl.datasets import build_dataset_from_name
- from torch_geometric.data import InMemoryDataset
-
- class MyDataset(InMemoryDataset):
- def __init__(self, datalist) -> None:
- super().__init__()
- self.data, self.slices = self.collate(datalist)
-
- # Create your own Data objects
-
- # for example, if you have edge_index, features and labels
- # you can create a Data as follows
- # See pytorch geometric more info of Data
- data = Data()
- data.edge_index = edge_index
- data.x = features
- data.y = labels
-
- # create a list of Data object
- data_list = [data, Data(...), ..., Data(...)]
-
- # Initialize AutoGL Dataset with your own data
- myData = MyDataset(data_list)
|