You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

t_dataset.rst 9.3 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
  1. .. _dataset:
  2. AutoGL Dataset
  3. ==============
  4. We import the module of datasets from `CogDL` and `PyTorch Geometric` and add support for datasets from `OGB`. One can refer to the usage of creating and building datasets via the tutorial of `CogDL`_, `PyTorch Geometric`_, and `OGB`_.
  5. .. _CogDL: https://cogdl.readthedocs.io/en/latest/tutorial.html
  6. .. _PyTorch Geometric: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
  7. .. _OGB: https://ogb.stanford.edu/docs/dataset_overview/
  8. Supporting datasets
  9. -------------------
  10. AutoGL now supports the following benchmarks for different tasks:
  11. Semi-supervised node classification: Cora, Citeseer, Pubmed, Amazon Computers\*, Amazon Photo\*, Coauthor CS\*, Coauthor Physics\*, Reddit (\*: using `utils.random_splits_mask_class` for splitting dataset is recommended.).
  12. For detailed information for supporting datasets, please kindly refer to `PyTorch Geometric Dataset`_.
  13. .. _PyTorch Geometric Dataset: https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html
  14. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  15. | Dataset | PyG | CogDL | x | y | edge_index| edge_attr | train/val/test node | train/val/test mask |
  16. +==================+============+===========+============+============+===========+============+====================+=====================+
  17. | Cora | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
  18. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  19. | Citeseer | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
  20. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  21. | Pubmed | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
  22. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  23. | Amazon Computers | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
  24. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  25. | Amazon Photo | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
  26. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  27. | Coauthor CS | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
  28. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  29. | Coauthor Physics | ✓ | | ✓ | ✓ | ✓ | ✓ | | |
  30. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  31. | Reddit | ✓ | | ✓ | ✓ | ✓ | ✓ | | ✓ |
  32. +------------------+------------+-----------+------------+------------+-----------+------------+--------------------+---------------------+
  33. Graph classification: MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB
  34. +-----------+------------+------------+-----------+------------+------------+-----------+
  35. | Dataset | PyG | CogDL | x | y | edge_index | edge_attr |
  36. +===========+============+============+===========+============+============+===========+
  37. | MUTAG | ✓ | | ✓ | ✓ | ✓ | ✓ |
  38. +-----------+------------+------------+-----------+------------+------------+-----------+
  39. | IMDB-B | ✓ | | | ✓ | ✓ | |
  40. +-----------+------------+------------+-----------+------------+------------+-----------+
  41. | IMDB-M | ✓ | | | ✓ | ✓ | |
  42. +-----------+------------+------------+-----------+------------+------------+-----------+
  43. | PROTEINS | ✓ | | ✓ | ✓ | ✓ | |
  44. +-----------+------------+------------+-----------+------------+------------+-----------+
  45. | COLLAB | ✓ | | | ✓ | ✓ | |
  46. +-----------+------------+------------+-----------+------------+------------+-----------+
  47. TODO: Supporting all datasets from `PyTorch Geometric`.
  48. OGB datasets
  49. ------------
  50. AutoGL also supports the popular benchmark on `OGB` for node classification and graph classification tasks. For the summary of `OGB` datasets, please kindly refer to the their `docs`_.
  51. .. _docs: https://ogb.stanford.edu/docs/nodeprop/
  52. Since the loss and evaluation metric used for `OGB` datasets vary among different tasks, we also add `string` properties of datasets for identification:
  53. +-----------------+----------------+-------------------+
  54. | Dataset | dataset.metric | datasets.loss |
  55. +=================+================+===================+
  56. | ogbn-products | Accuracy | nll_loss |
  57. +-----------------+----------------+-------------------+
  58. | ogbn-proteins | ROC-AUC | BCEWithLogitsLoss |
  59. +-----------------+----------------+-------------------+
  60. | ogbn-arxiv | Accuracy | nll_loss |
  61. +-----------------+----------------+-------------------+
  62. | ogbn-papers100M | Accuracy | nll_loss |
  63. +-----------------+----------------+-------------------+
  64. | ogbn-mag | Accuracy | nll_loss |
  65. +-----------------+----------------+-------------------+
  66. | ogbg-molhiv | ROC-AUC | BCEWithLogitsLoss |
  67. +-----------------+----------------+-------------------+
  68. | ogbg-molpcba | AP | BCEWithLogitsLoss |
  69. +-----------------+----------------+-------------------+
  70. | ogbg-ppa | Accuracy | CrossEntropyLoss |
  71. +-----------------+----------------+-------------------+
  72. | ogbg-code | F1 score | CrossEntropyLoss |
  73. +-----------------+----------------+-------------------+
  74. Create a dataset via URL
  75. ------------------------
  76. If your dataset is the same as the 'ppi' dataset, which contains two matrices: 'network' and 'group', you can register your dataset directly use the above code. The default root for downloading dataset is `~/.cache-autogl`, you can also specify the root by passing the string to the `path` in `build_dataset(args, path)` or `build_dataset_from_name(dataset, path)`.
  77. .. code-block:: python
  78. # following code-snippet is from autogl/datasets/matlab_matrix.py
  79. @register_dataset("ppi")
  80. class PPIDataset(MatlabMatrix):
  81. def __init__(self, path):
  82. dataset, filename = "ppi", "Homo_sapiens"
  83. url = "http://snap.stanford.edu/node2vec/"
  84. super(PPIDataset, self).__init__(path, filename, url)
  85. You should declare the name of the dataset, the name of the file, and the URL, where our script can download the resource. Then you can use either `build_dataset(args, path)` or `build_dataset_from_name(dataset, path)` in your task to build a dataset with corresponding parameters.
  86. Create a dataset locally
  87. ------------------------
  88. If you want to test your local dataset, we recommend you to refer to the docs on `creating PyTorch Geometric dataset`_.
  89. .. _creating PyTorch Geometric dataset: https://pytorch-geometric.readthedocs.io/en/latest/notes/create_dataset.html
  90. You can simply inherit from `torch_geometric.data.InMemoryDataset` to create an empty `dataset`, then create some `torch_geometric.data.Data` objects for your data and pass a regular python list holding them, then pass them to `torch_geometric.data.Dataset` or `torch_geometric.data.DataLoader`.
  91. Let’s see this process in a simplified example:
  92. .. code-block:: python
  93. from typing import Iterable
  94. from torch_geometric.data.data import Data
  95. from autogl.datasets import build_dataset_from_name
  96. from torch_geometric.data import InMemoryDataset
  97. class MyDataset(InMemoryDataset):
  98. def __init__(self, datalist) -> None:
  99. super().__init__()
  100. self.data, self.slices = self.collate(datalist)
  101. # Create your own Data objects
  102. # for example, if you have edge_index, features and labels
  103. # you can create a Data as follows
  104. # See pytorch geometric more info of Data
  105. data = Data()
  106. data.edge_index = edge_index
  107. data.x = features
  108. data.y = labels
  109. # create a list of Data object
  110. data_list = [data, Data(...), ..., Data(...)]
  111. # Initialize AutoGL Dataset with your own data
  112. myData = MyDataset(data_list)