Advanced Tutorial ================= This tutorial assumes the reader is familiar with d3m ecosystem in general. If not, please refer to other sections of `documentation`_ first, e.g., :ref:`quickstart`. .. _documentation: https://docs.datadrivendiscovery.org Overview of building a primitive -------------------------------- 1. :ref:`Recognize the base class of a primitive `. 2. :ref:`Identify the input and output container types `. 3. :ref:`Define metadata for each primitive `. 4. :ref:`Write a unit test to verify the primitive functions `. 5. :ref:`Generate the primitive annotation for the primitive `. 6. :ref:`Write pipeline for demonstrating primitive functionality `. 7. :ref:`Advanced: Primitive might use static files `. .. _primitive-class: Primitive class --------------- There are a variety of :py:mod:`primitive interfaces/classes ` available. As an example, a primitive doing just attribute extraction without requiring any fitting, a :py:class:`~d3m.primitive_interfaces.transformer.TransformerPrimitiveBase` from :py:mod:`~d3m.primitive_interfaces.transformer` module can be used. Each primitives can have it's own :py:mod:`hyper-parameters `. Some example hyper-parameter types one can use to describe primitive's hyper-parameters are: :py:class:`~d3m.metadata.hyperparams.Constant`, :py:class:`~d3m.metadata.hyperparams.UniformBool`, :py:class:`~d3m.metadata.hyperparams.UniformInt`, :py:class:`~d3m.metadata.hyperparams.Choice`, :py:class:`~d3m.metadata.hyperparams.List`. Also, each hyper-parameter should be defined as one or more of the four :ref:`hyper-parameter semantic types `: * `https://metadata.datadrivendiscovery.org/types/TuningParameter `__ * `https://metadata.datadrivendiscovery.org/types/ControlParameter `__ * `https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter `__ * `https://metadata.datadrivendiscovery.org/types/MetafeatureParameter `__ Example ~~~~~~~ .. code:: python from d3m.primitive_interfaces import base, transformer from d3m.metadata import base as metadata_base, hyperparams __all__ = ('ExampleTransformPrimitive',) class Hyperparams(hyperparams.Hyperparams): learning_rate = hyperparams.Uniform(lower=0.0, upper=1.0, default=0.001, semantic_types=[ 'https://metadata.datadrivendiscovery.org/types/TuningParameter', ]) clusters = hyperparams.UniformInt(lower=1, upper=100, default=10, semantic_types=[ 'https://metadata.datadrivendiscovery.org/types/TuningParameter', ]) class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]): """ The docstring is very important and must to be included. It should contain relevant information about the hyper-parameters, primitive functionality, etc. """ def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]: pass .. _input-output-types: Input/Output types ------------------ The acceptable inputs/outputs of a primitive must be pre-defined. D3M supports a variety of standard input/output :ref:`container types ` such as: - ``pandas.DataFrame`` (as :py:class:`d3m.container.pandas.DataFrame`) - ``numpy.ndarray`` (as :py:class:`d3m.container.numpy.ndarray`) - ``list`` (as :py:class:`d3m.container.list.List`) .. note:: Even thought D3M container types behave mostly as standard types, the D3M container types must be used for inputs/outputs, because D3M container types support D3M metadata. Example ~~~~~~~ .. code:: python from d3m import container Inputs = container.DataFrame Outputs = container.DataFrame class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]): ... .. note:: When returning the output DataFrame, its metadata should be updated with the correct semantic and structural types. Example ~~~~~~~ .. code:: python # Update metadata for each DataFrame column. for column_index in range(outputs.shape[1]): column_metadata = {} column_metadata['structural_type'] = type(1.0) column_metadata['name'] = "column {i}".format(i=column_index) column_metadata["semantic_types"] = ("http://schema.org/Float", "https://metadata.datadrivendiscovery.org/types/Attribute",) outputs.metadata = outputs.metadata.update((metadata_base.ALL_ELEMENTS, column_index), column_metadata) .. _tutorial-primitive-metadata: Primitive Metadata ------------------ It is very crucial to define :ref:`primitive metadata ` for the primitive properly. Primitive metadata can be used by TA2 systems to metalearn about primitives and in general decide which primitive to use when. Example ~~~~~~~ .. code:: python from d3m.primitive_interfaces import base, transformer from d3m.metadata import base as metadata_base, hyperparams __all__ = ('ExampleTransformPrimitive',) class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]): """ Docstring. """ metadata = metadata_base.PrimitiveMetadata({ 'id': , 'version': , 'name': , 'python_path': 'd3m.primitives.<>.<>.<>' # Must match path in setup.py, 'source': { 'name': , 'uris': [], 'contact': 'mailto:' }, 'installation': [{ 'type': metadata_base.PrimitiveInstallationType.PIP, 'package_uri': 'git+@{git_commit}#egg='.format( git_commit=d3m_utils.current_git_commit(os.path.dirname(__file__)), ), }], 'algorithm_types': [ # Check https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.algorithm_types for all available algorithm types. # If algorithm type s not available a Merge Request should be made to add it to core package. metadata_base.PrimitiveAlgorithmType., ], # Check https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.primitive_family for all available primitive family types. # If primitive family is not available a Merge Request should be made to add it to core package. 'primitive_family': metadata_base.PrimitiveFamily. }) ... .. _unit-tests: Unit tests ---------- Once the primitives are constructed, unit testing must be done to see if the primitive works as intended. **Sample Setup** .. code:: python import os import unittest from d3m.container import dataset from d3m.metadata import base as metadata_base from common_primitives import dataset_to_dataframe from example_primitive import ExampleTransformPrimitive class ExampleTransformTest(unittest.TestCase): def test_happy_path(): # Load a dataset. # Datasets can be obtained from: https://datasets.datadrivendiscovery.org/d3m/datasets base_path = '../datasets/training_datasets/seed_datasets_archive/' dataset_doc_path = os.path.join(base_path, '38_sick_dataset', 'datasetDoc.json') dataset = dataset.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) dataframe = dataframe_primitive.produce(inputs=dataset).value # Call example transformer. hyperparams_class = SampleTransform.metadata.get_hyperparams() primitive = SampleTransform(hyperparams=hyperparams_class.defaults()) test_out = primitive.produce(inputs=dataframe).value # Write assertions to make sure that the output (type, shape, metadata) is what is expected. self.assertEqual(...) ... if __name__ == '__main__': unittest.main() It is recommended to do the testing inside the D3M Docker container: .. code:: shell docker run --rm -v /home/foo/d3m:/mnt/d3m -it \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 cd /mnt/d3m/example_primitive python3 primitive_name_test.py .. _primitive-annotation: Primitive annotation -------------------- Once primitive is constructed and unit testing is successful, the final step in building a primitive is to generate the primitive annotation which will be indexed and used by D3M. .. code:: shell docker run --rm -v /home/foo/d3m:/mnt/d3m -it \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 cd /mnt/d3m/example_primitive pip3 install -e . python3 -m d3m index describe -i 4 Alternatively, a `helper script `__ can be used to generate primitive annotations as well. This can be more convenient when having to manage multiple primitives. In this case, generating the primitive annotation is done as follows: .. code:: shell docker run --rm -v /home/foo/d3m:/mnt/d3m -it \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 cd /mnt/d3m/example_primitive pip3 install -e . python3 generate-primitive-json.py ... .. _example-pipeline: Example pipeline ---------------- After building custom primitives, it has to be used in an example pipeline and run using one of D3M seed datasets in order to be integrated with other indexed D3M primitives. The essential elements of pipelines are: ``Dataset Denormalizer -> Dataset Parser -> Data Cleaner (If necessary) -> Feature Extraction -> Classifier/Regressor -> Output`` An example code of building pipeline is shown below: .. code:: python # D3M dependencies from d3m import index from d3m.metadata.base import ArgumentType from d3m.metadata.pipeline import Pipeline, PrimitiveStep # Common Primitives from common_primitives.column_parser import ColumnParserPrimitive from common_primitives.dataset_to_dataframe import DatasetToDataFramePrimitive from common_primitives.extract_columns_semantic_types import ExtractColumnsBySemanticTypesPrimitive # Testing primitive from quickstart_primitives.sample_primitive1.input_to_output import InputToOutputPrimitive # Pipeline pipeline = Pipeline() pipeline.add_input(name='inputs') # Step 0: DatasetToDataFrame (Dataset Denormalizer) step_0 = PrimitiveStep(primitive_description=DatasetToDataFramePrimitive.metadata.query()) step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0') step_0.add_output('produce') pipeline.add_step(step_0) # Step 1: Custom primitive step_1 = PrimitiveStep(primitive=InputToOutputPrimitive) step_1.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce') step_1.add_output('produce') pipeline.add_step(step_1) # Step 2: Column Parser (Dataset Parser) step_2 = PrimitiveStep(primitive_description=ColumnParserPrimitive.metadata.query()) step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce') step_2.add_output('produce') pipeline.add_step(step_2) # Step 3: Extract Attributes (Feature Extraction) step_3 = PrimitiveStep(primitive_description=ExtractColumnsBySemanticTypesPrimitive.metadata.query()) step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.2.produce') step_3.add_output('produce') step_3.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE, data=['https://metadata.datadrivendiscovery.org/types/Attribute'] ) pipeline.add_step(step_3) # Step 4: Extract Targets (Feature Extraction) step_4 = PrimitiveStep(primitive_description=ExtractColumnsBySemanticTypesPrimitive.metadata.query()) step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce') step_4.add_output('produce') step_4.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE, data=['https://metadata.datadrivendiscovery.org/types/TrueTarget'] ) pipeline.add_step(step_4) attributes = 'steps.3.produce' targets = 'steps.4.produce' # Step 6: Imputer (Data Cleaner) step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_cleaning.imputer.SKlearn')) step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference=attributes) step_5.add_output('produce') pipeline.add_step(step_5) # Step 7: Classifier step_6 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.classification.decision_tree.SKlearn')) step_6.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.5.produce') step_6.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference=targets) step_6.add_output('produce') pipeline.add_step(step_6) # Final Output pipeline.add_output(name='output predictions', data_reference='steps.6.produce') # print(pipeline.to_json()) with open('./pipeline.json', 'w') as write_file: write_file.write(pipeline.to_json(indent=4, sort_keys=False, ensure_ascii=False)) Once pipeline is constructed and the pipeline's JSON file is generated, the pipeline is run using ``python3 -m d3m runtime`` command. Successfully running the pipeline validates that the primitive is working as intended. .. code:: shell docker run --rm -v /home/foo/d3m:/mnt/d3m -it \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ /bin/bash -c "cd /mnt/d3m; \ pip3 install -e .; \ cd pipelines; \ python3 -m d3m runtime fit-produce \ --pipeline pipeline.json \ --problem /datasets/seed_datasets_current/38_sick/TRAIN/problem_TRAIN/problemDoc.json \ --input /datasets/seed_datasets_current/38_sick/TRAIN/dataset_TRAIN/datasetDoc.json \ --test-input /datasets/seed_datasets_current/38_sick/TEST/dataset_TEST/datasetDoc.json \ --output 38_sick_results.csv \ --output-run pipeline_run.yml; \ exit" .. _static-files: Advanced: Primitive with static files ------------------------------------- When building primitives that uses external/static files i.e. pre-trained weights, the metadata for the primitive must be properly define such dependency. The static file can be hosted anywhere based on your preference, as long as the URL to the file is a direct download link. It must be public so that users of your primitive can access the file. Be sure to keep the URL available, as the older version of the primitive could potentially start failing if URL stops resolving. .. note:: Full code of this section can be found in the `quickstart repository `__. Below is a description of primitive metadata definition required, named ``_weights_configs`` for each static file. .. code:: python _weights_configs = [{ 'type': 'FILE', 'key': '', 'file_uri': '', 'file_digest':'sha256sum of the ', }] This ``_weights_configs`` should be directly added to the ``INSTALLATION`` field of the primitive metadata. .. code:: python from d3m.primitive_interfaces import base, transformer from d3m.metadata import base as metadata_base, hyperparams __all__ = ('ExampleTransform',) class ExampleTransform(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]): """ Docstring. """ _weights_configs = [{ 'type': 'FILE', 'key': '', 'file_uri': '', 'file_digest':'sha256sum of the ', }] metadata = ... 'installation': [{ 'type': metadata_base.PrimitiveInstallationType.PIP, 'package_uri': 'git+@{git_commit}#egg='.format( git_commit=d3m_utils.current_git_commit(os.path.dirname(__file__)), ), }] + _weights_configs, ... ... After the primitive metadata definition, it is important to include code to return the path of files. An example is given as follows: .. code:: python def _find_weights_path(self, key_filename): if key_filename in self.volumes: weight_file_path = self.volumes[key_filename] else: weight_file_path = os.path.join('.', self._weights_configs['file_digest'], key_filename) if not os.path.isfile(weight_file_path): raise ValueError( "Can't get weights file from volumes by key '{key_filename}' and at path '{path}'.".format( key_filename=key_filename, path=weight_file_path, ), ) return weight_file_path In this example code, ``_find_weights_path`` method will try to find the static files from volumes based on weight file key. If it cannot be found (e.g., runtime was not provided with static files), then it looks into the current directory. The latter fallback is useful during development. To run a pipeline with such primitive, you have to download static files and provide them to the runtime: .. code:: shell docker run --rm -v /home/foo/d3m:/mnt/d3m -it \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ /bin/bash -c "cd /mnt/d3m; \ pip3 install -e .; \ cd pipelines; \ mkdir /static python3 -m d3m index download -p d3m.primitives.path.of.Primitive -o /static; \ python3 -m d3m runtime --volumes /static fit-produce \ --pipeline feature_pipeline.json \ --problem /datasets/seed_datasets_current/22_handgeometry/TRAIN/problem_TRAIN/problemDoc.json \ --input /datasets/seed_datasets_current/22_handgeometry/TRAIN/dataset_TRAIN/datasetDoc.json \ --test-input /datasets/seed_datasets_current/22_handgeometry/TEST/dataset_TEST/datasetDoc.json \ --output 22_handgeometry_results.csv \ --output-run feature_pipeline_run.yml; \ exit" The static files will be downloaded and stored locally based on ``file_digest`` of ``_weights_configs``. In this way we don't duplicate same files used by multiple primitives: .. code:: shell mkdir /static python3 -m d3m index download -p d3m.primitives.path.of.Primitive -o /static ``-p`` optional argument to download static files for a particular primitive, matching on its Python path. ``-o`` optional argument to download the static files into a common folder. If not provided, they are downloaded into the current directory. After the download, the file structure is given as follows:: /static/ / / ... ...