You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

t_preprocessing.rst 7.7 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139
  1. .. _preprocessing:
  2. AutoGL Preprocessing
  3. ==========================
  4. We provide a series of node and graph feature engineers for
  5. you to compose within a feature engineering pipeline. An automatic
  6. feature engineering algorithm is also provided. We also provide several structural engineers.
  7. Quick Start
  8. -----------
  9. .. code-block :: python
  10. # 1. Choose a dataset.
  11. from autogl.datasets import build_dataset_from_name
  12. data = build_dataset_from_name('cora')
  13. # 2. Compose a preprocessing pipeline
  14. from autogl.module.preprocessing import DataPreprocessor
  15. from autogl.module.preprocessing.feature_engineering import AutoFeatureEngineer
  16. from autogl.module.preprocessing.feature_engineering._generators import OneHotFeatureGenerator
  17. from autogl.module.preprocessing.feature_engineering._selectors import GBDTFeatureSelector
  18. from autogl.module.preprocessing.feature_engineering._graph import NXLargeCliqueSize
  19. # you may compose preprocessing bases through operator &
  20. fe = OneHotFeatureGenerator() & GBDTFeatureSelector(fixlen=100) & NXLargeCliqueSize()
  21. # 3. Fit and transform the data
  22. fe.fit(data)
  23. data1=fe.transform(data,inplace=False)
  24. List of Feature Engineer Base Names
  25. ---------------------
  26. Now three kinds of feature engineering bases are supported,namely ``generators``, ``selectors`` , ``graph``.You can import
  27. bases from according module as is mentioned in the ``Quick Start`` part. Or you may want to just list names of bases
  28. in configurations or as arguments of the autogl solver.
  29. 1. ``generators``
  30. +---------------------------+-------------------------------------------------+
  31. | Base | Description |
  32. +===========================+=================================================+
  33. | ``GraphletGenerator`` | concatenate local graphlet numbers as features. |
  34. +---------------------------+-------------------------------------------------+
  35. | ``EigenFeatureGenerator`` | concatenate Eigen features. |
  36. +---------------------------+-------------------------------------------------+
  37. | ``PageRankFeatureGenerator`` | concatenate Pagerank scores. |
  38. +---------------------------+-------------------------------------------------+
  39. | ``LocalDegreeProfileGenerator`` | concatenate Local Degree Profile features. |
  40. +---------------------------+-------------------------------------------------+
  41. | ``NormalizeFeatures`` | Normalize all node features |
  42. +---------------------------+-------------------------------------------------+
  43. | ``OneHotDegreeGenerator`` | concatenate degree one-hot encoding. |
  44. +---------------------------+-------------------------------------------------+
  45. | ``OneHotFeatureGenerator`` | concatenate node id one-hot encoding. |
  46. +---------------------------+-------------------------------------------------+
  47. 2. ``selectors``
  48. +----------------------+--------------------------------------------------------------------------------+
  49. | Base | Description |
  50. +======================+================================================================================+
  51. | ``FilterConstant`` | delete all constant and one-hot encoding node features. |
  52. +----------------------+--------------------------------------------------------------------------------+
  53. | ``GBDTFeatureSelector`` | select top-k important node features ranked by Gradient Descent Decision Tree. |
  54. +----------------------+--------------------------------------------------------------------------------+
  55. 3. ``graph``
  56. ``NetLSD`` is a graph feature generation method. please refer to the according document.
  57. A set of graph feature extractors implemented in NetworkX are wrapped, please refer to NetworkX for details. (``NxLargeCliqueSize``, ``NxAverageClusteringApproximate``, ``NxDegreeAssortativityCoefficient``, ``NxDegreePearsonCorrelationCoefficient``, ``NxHasBridge``
  58. ,``NxGraphCliqueNumber``, ``NxGraphNumberOfCliques``, ``NxTransitivity``, ``NxAverageClustering``, ``NxIsConnected``, ``NxNumberConnectedComponents``,
  59. ``NxIsDistanceRegular``, ``NxLocalEfficiency``, ``NxGlobalEfficiency``, ``NxIsEulerian``)
  60. The taxonomy of base types is based on the way of transforming features. ``generators`` concatenate the original features with ones newly generated
  61. or just overwrite the original ones. Instead of generating new features , ``selectors`` try to select useful features and keep learned selecting methods
  62. in the base itself. The former two types of bases can be exploited in node or edge level (modification upon each
  63. node or edge feature) ,while ``graph`` focuses on feature engineering in graph level (modification upon each graph feature).
  64. For your convenience in further development,you may want to design a new item by inheriting one of them.
  65. Of course, you can directly inherit the ``FeatureEngineer`` as well.
  66. Create Your Own Feature Engineer
  67. ------------------
  68. You can create your own feature engineering object by inheriting ``FeatureEngineer``and overwriting methods ``_fit`` and ``_transform``,
  69. or simply inheriting one of feature engineering base types ,namely ``generators``, ``selectors`` , ``graph``.
  70. .. code-block :: python
  71. # for example : create a node one-hot feature.
  72. import torch
  73. from autogl.datasets import build_dataset_from_name
  74. data = build_dataset_from_name('cora')
  75. from autogl.module.preprocessing.feature_engineering._generators._basic import BaseFeatureGenerator
  76. import numpy as np
  77. class GeOnehot(BaseFeatureGenerator):
  78. def _extract_nodes_feature(self, data):
  79. num_nodes: int = (
  80. data.x.size(0)
  81. if data.x is not None and isinstance(data.x, torch.Tensor)
  82. else (data.edge_index.max().item() + 1)
  83. )
  84. return torch.eye(num_nodes)
  85. fe=GeOnehot()
  86. fe.fit(data)
  87. data1=fe.transform(data,inplace=False)
  88. List of Structure Engineer Base Names
  89. ---------------------
  90. +----------------------+--------------------------------------------------------------------------------+
  91. | Base | Description |
  92. +======================+================================================================================+
  93. | ``GCNSVD`` | use Truncated SVD as preprocessing. |
  94. +----------------------+--------------------------------------------------------------------------------+
  95. | ``GCNJaccard`` | drop dissimilar edges. |
  96. +----------------------+--------------------------------------------------------------------------------+
  97. Create Your Own Structure Engineer
  98. ---------------------
  99. You can create your own feature engineering object by inheriting ``StructureEngineer``, and overwriting methods ``_fit`` and ``_transform``
  100. .. code-block :: python
  101. from autogl.datasets import build_dataset_from_name
  102. data = build_dataset_from_name('cora')
  103. from autogl.module.preprocessing.structure_engineering import *
  104. from autogl.module.preprocessing.structure_engineering._structure_engineer import *
  105. from torch_geometric.utils import add_self_loops
  106. class AddSelfLoop(StructureEngineer):
  107. def _transform(self,data):
  108. adj = get_edges(data) # edge list
  109. modified_adj=add_self_loops(adj)
  110. set_edges(data,modified_adj)
  111. return data
  112. fe=AddSelfLoop()
  113. fe.fit(data)
  114. data1=fe.transform(data,inplace=False)