wangwei
/
tods

 
			
							Advanced Tutorial
=================

This tutorial assumes the reader is familiar with d3m ecosystem in general.
If not, please refer to other sections of `documentation`_ first, e.g.,
:ref:`quickstart`.

.. _documentation: https://docs.datadrivendiscovery.org

Overview of building a primitive
--------------------------------

1. :ref:`Recognize the base class of a primitive <primitive-class>`.

2. :ref:`Identify the input and output container types <input-output-types>`.

3. :ref:`Define metadata for each primitive <tutorial-primitive-metadata>`.

4. :ref:`Write a unit test to verify the primitive functions <unit-tests>`.

5. :ref:`Generate the primitive annotation for the primitive <primitive-annotation>`.

6. :ref:`Write pipeline for demonstrating primitive functionality <example-pipeline>`.

7. :ref:`Advanced: Primitive might use static files <static-files>`.

.. _primitive-class:

Primitive class
---------------

There are a variety of :py:mod:`primitive interfaces/classes <d3m.primitive_interfaces>` available. As an example,
a primitive doing just attribute extraction without requiring any fitting, a :py:class:`~d3m.primitive_interfaces.transformer.TransformerPrimitiveBase`
from :py:mod:`~d3m.primitive_interfaces.transformer` module can be used.

Each primitives can have it's own :py:mod:`hyper-parameters <d3m.metadata.hyperparams>`. Some example hyper-parameter types one can use to describe
primitive's hyper-parameters are: :py:class:`~d3m.metadata.hyperparams.Constant`, :py:class:`~d3m.metadata.hyperparams.UniformBool`,
:py:class:`~d3m.metadata.hyperparams.UniformInt`, :py:class:`~d3m.metadata.hyperparams.Choice`, :py:class:`~d3m.metadata.hyperparams.List`.

Also, each hyper-parameter should be defined as one or more of the four :ref:`hyper-parameter semantic types <hyperparameters>`:

* `https://metadata.datadrivendiscovery.org/types/TuningParameter <https://metadata.datadrivendiscovery.org/types/TuningParameter>`__
* `https://metadata.datadrivendiscovery.org/types/ControlParameter <https://metadata.datadrivendiscovery.org/types/ControlParameter>`__
* `https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter <https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter>`__
* `https://metadata.datadrivendiscovery.org/types/MetafeatureParameter <https://metadata.datadrivendiscovery.org/types/MetafeatureParameter>`__

Example
~~~~~~~

.. code:: python

    from d3m.primitive_interfaces import base, transformer
    from d3m.metadata import base as metadata_base, hyperparams

    __all__ = ('ExampleTransformPrimitive',)


    class Hyperparams(hyperparams.Hyperparams):
        learning_rate = hyperparams.Uniform(lower=0.0, upper=1.0, default=0.001, semantic_types=[
            'https://metadata.datadrivendiscovery.org/types/TuningParameter',
        ])
        clusters = hyperparams.UniformInt(lower=1, upper=100, default=10, semantic_types=[
            'https://metadata.datadrivendiscovery.org/types/TuningParameter',
        ])


    class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
        """
        The docstring is very important and must to be included. It should contain
        relevant information about the hyper-parameters, primitive functionality, etc.
        """

        def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]:
            pass

.. _input-output-types:

Input/Output types
------------------

The acceptable inputs/outputs of a primitive must be pre-defined. D3M supports a variety of
standard input/output :ref:`container types <container_types>` such as:

- ``pandas.DataFrame`` (as :py:class:`d3m.container.pandas.DataFrame`)

- ``numpy.ndarray`` (as :py:class:`d3m.container.numpy.ndarray`)

- ``list`` (as :py:class:`d3m.container.list.List`)

.. note::
    Even thought D3M container types behave mostly as standard types, the D3M container types must be used for inputs/outputs, because D3M container types support D3M metadata.

Example
~~~~~~~

.. code:: python

    from d3m import container

    Inputs  = container.DataFrame
    Outputs = container.DataFrame


    class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
        ...

.. note::
    When returning the output DataFrame, its metadata should be updated with the correct semantic and structural types.

Example
~~~~~~~

.. code:: python

    # Update metadata for each DataFrame column.
    for column_index in range(outputs.shape[1]):
        column_metadata = {}
        column_metadata['structural_type'] = type(1.0)
        column_metadata['name'] = "column {i}".format(i=column_index)
        column_metadata["semantic_types"] = ("http://schema.org/Float", "https://metadata.datadrivendiscovery.org/types/Attribute",)
        outputs.metadata = outputs.metadata.update((metadata_base.ALL_ELEMENTS, column_index), column_metadata)

.. _tutorial-primitive-metadata:

Primitive Metadata
------------------

It is very crucial to define :ref:`primitive metadata <primitive-metadata>` for the primitive properly.
Primitive metadata can be used by TA2 systems to metalearn about primitives and in general decide which primitive to use when.

Example
~~~~~~~

.. code:: python

    from d3m.primitive_interfaces import base, transformer
    from d3m.metadata import base as metadata_base, hyperparams

    __all__ = ('ExampleTransformPrimitive',)

    class ExampleTransformPrimitive(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
        """
        Docstring.
        """

        metadata = metadata_base.PrimitiveMetadata({
            'id': <Unique-ID, generated using UUID>,
            'version': <Primitive-development-version>,
            'name': <Primitive-Name>,
            'python_path': 'd3m.primitives.<>.<>.<>' # Must match path in setup.py,
            'source': {
                'name': <Project-maintainer-name>,
                'uris': [<GitHub-link-to-project>],
                'contact': 'mailto:<Author E-Mail>'
            },
            'installation': [{
                'type': metadata_base.PrimitiveInstallationType.PIP,
                'package_uri': 'git+<git-link-to-project>@{git_commit}#egg=<Package_name>'.format(
                    git_commit=d3m_utils.current_git_commit(os.path.dirname(__file__)),
                ),
            }],
            'algorithm_types': [
                # Check https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.algorithm_types for all available algorithm types.
                # If algorithm type s not available a Merge Request should be made to add it to core package.
                metadata_base.PrimitiveAlgorithmType.<Choose-the-algorithm-type-that-best-describes-the-primitive>,
            ],
            # Check https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.primitive_family for all available primitive family types.
            # If primitive family is not available a Merge Request should be made to add it to core package.
            'primitive_family': metadata_base.PrimitiveFamily.<Choose-the-primitive-family-that-closely-associates-to-the-primitive>
        })

        ...

.. _unit-tests:

Unit tests
----------

Once the primitives are constructed, unit testing must be done to see if the
primitive works as intended.

**Sample Setup**

.. code:: python

    import os
    import unittest

    from d3m.container import dataset
    from d3m.metadata import base as metadata_base
    from common_primitives import dataset_to_dataframe

    from example_primitive import ExampleTransformPrimitive


    class ExampleTransformTest(unittest.TestCase):
        def test_happy_path():
            # Load a dataset.
            # Datasets can be obtained from: https://datasets.datadrivendiscovery.org/d3m/datasets
            base_path = '../datasets/training_datasets/seed_datasets_archive/'
            dataset_doc_path = os.path.join(base_path, '38_sick_dataset', 'datasetDoc.json')
            dataset = dataset.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path))

            dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams()
            dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults())
            dataframe = dataframe_primitive.produce(inputs=dataset).value

            # Call example transformer.
            hyperparams_class = SampleTransform.metadata.get_hyperparams()
            primitive  = SampleTransform(hyperparams=hyperparams_class.defaults())
            test_out   = primitive.produce(inputs=dataframe).value

            # Write assertions to make sure that the output (type, shape, metadata) is what is expected.
            self.assertEqual(...)

            ...


    if __name__ == '__main__':
        unittest.main()

It is recommended to do the testing inside the D3M Docker container:

.. code:: shell

    docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
      registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
    cd /mnt/d3m/example_primitive
    python3 primitive_name_test.py

.. _primitive-annotation:

Primitive annotation
--------------------

Once primitive is constructed and unit testing is successful, the
final step in building a primitive is to generate the primitive annotation
which will be indexed and used by D3M.

.. code:: shell

    docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
      registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
    cd /mnt/d3m/example_primitive
    pip3 install -e .
    python3 -m d3m index describe -i 4 <primitive_name>

Alternatively, a `helper script <https://gitlab.com/datadrivendiscovery/docs-quickstart/-/blob/master/quickstart_primitives/generate-primitive-json.py>`__
can be used to generate primitive annotations as well.
This can be more convenient when having to manage multiple primitives.
In this case, generating the primitive annotation is done as follows:

.. code:: shell

    docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
      registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
    cd /mnt/d3m/example_primitive
    pip3 install -e .
    python3 generate-primitive-json.py ...

.. _example-pipeline:

Example pipeline
----------------

After building custom primitives, it has to be used in an example pipeline and run using one of
D3M seed datasets in order to be integrated with other indexed D3M primitives.

The essential elements of pipelines are:

``Dataset Denormalizer -> Dataset Parser -> Data Cleaner (If necessary) -> Feature Extraction -> Classifier/Regressor -> Output``

An example code of building pipeline is shown below:

.. code:: python

    # D3M dependencies
    from d3m import index
    from d3m.metadata.base import ArgumentType
    from d3m.metadata.pipeline import Pipeline, PrimitiveStep

    # Common Primitives
    from common_primitives.column_parser import ColumnParserPrimitive
    from common_primitives.dataset_to_dataframe import DatasetToDataFramePrimitive
    from common_primitives.extract_columns_semantic_types import ExtractColumnsBySemanticTypesPrimitive

    # Testing primitive
    from quickstart_primitives.sample_primitive1.input_to_output import InputToOutputPrimitive

    # Pipeline
    pipeline = Pipeline()
    pipeline.add_input(name='inputs')

    # Step 0: DatasetToDataFrame (Dataset Denormalizer)
    step_0 = PrimitiveStep(primitive_description=DatasetToDataFramePrimitive.metadata.query())
    step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0')
    step_0.add_output('produce')
    pipeline.add_step(step_0)

    # Step 1: Custom primitive
    step_1 = PrimitiveStep(primitive=InputToOutputPrimitive)
    step_1.add_argument(name='inputs',  argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
    step_1.add_output('produce')
    pipeline.add_step(step_1)

    # Step 2: Column Parser (Dataset Parser)
    step_2 = PrimitiveStep(primitive_description=ColumnParserPrimitive.metadata.query())
    step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce')
    step_2.add_output('produce')
    pipeline.add_step(step_2)

    # Step 3: Extract Attributes (Feature Extraction)
    step_3 = PrimitiveStep(primitive_description=ExtractColumnsBySemanticTypesPrimitive.metadata.query())
    step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.2.produce')
    step_3.add_output('produce')
    step_3.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE, data=['https://metadata.datadrivendiscovery.org/types/Attribute'] )
    pipeline.add_step(step_3)

    # Step 4: Extract Targets (Feature Extraction)
    step_4 = PrimitiveStep(primitive_description=ExtractColumnsBySemanticTypesPrimitive.metadata.query())
    step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
    step_4.add_output('produce')
    step_4.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE, data=['https://metadata.datadrivendiscovery.org/types/TrueTarget'] )
    pipeline.add_step(step_4)

    attributes = 'steps.3.produce'
    targets    = 'steps.4.produce'

    # Step 6: Imputer (Data Cleaner)
    step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_cleaning.imputer.SKlearn'))
    step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference=attributes)
    step_5.add_output('produce')
    pipeline.add_step(step_5)

    # Step 7: Classifier
    step_6 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.classification.decision_tree.SKlearn'))
    step_6.add_argument(name='inputs',  argument_type=ArgumentType.CONTAINER,  data_reference='steps.5.produce')
    step_6.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference=targets)
    step_6.add_output('produce')
    pipeline.add_step(step_6)

    # Final Output
    pipeline.add_output(name='output predictions', data_reference='steps.6.produce')

    # print(pipeline.to_json())
    with open('./pipeline.json', 'w') as write_file:
        write_file.write(pipeline.to_json(indent=4, sort_keys=False, ensure_ascii=False))

Once pipeline is constructed and the pipeline's JSON file is generated, the pipeline is run using
``python3 -m d3m runtime`` command.
Successfully running the pipeline validates that the primitive is working as intended.

.. code:: shell

    docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
      registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
      /bin/bash -c "cd /mnt/d3m; \
        pip3 install -e .; \
        cd pipelines; \
        python3 -m d3m runtime fit-produce \
                --pipeline pipeline.json \
                --problem /datasets/seed_datasets_current/38_sick/TRAIN/problem_TRAIN/problemDoc.json \
                --input /datasets/seed_datasets_current/38_sick/TRAIN/dataset_TRAIN/datasetDoc.json \
                --test-input /datasets/seed_datasets_current/38_sick/TEST/dataset_TEST/datasetDoc.json \
                --output 38_sick_results.csv \
                --output-run pipeline_run.yml; \
        exit"

.. _static-files:

Advanced: Primitive with static files
-------------------------------------

When building primitives that uses external/static files i.e. pre-trained weights, the
metadata for the primitive must be properly define such dependency.
The static file can be hosted anywhere based on your preference, as long as the URL to the file is a direct download link. It must
be public so that users of your primitive can access the file. Be sure to keep the URL available, as
the older version of the primitive could potentially start failing if URL stops resolving.

.. note::
    Full code of this section can be found in the `quickstart repository <https://gitlab.com/datadrivendiscovery/docs-quickstart>`__.

Below is a description of primitive metadata definition required, named ``_weights_configs`` for
each static file.

.. code:: python

    _weights_configs = [{
        'type': 'FILE',
        'key': '<Weight File Name>',
        'file_uri': '<URL to directly Download the Weight File>',
        'file_digest':'sha256sum of the <Weight File>',
    }]


This ``_weights_configs`` should be directly added to the ``INSTALLATION`` field of the primitive metadata.

.. code:: python

    from d3m.primitive_interfaces import base, transformer
    from d3m.metadata import base as metadata_base, hyperparams

    __all__ = ('ExampleTransform',)

    class ExampleTransform(transformer.TransformerPrimitiveBase[Inputs, Outputs, Hyperparams]):
        """
        Docstring.
        """

        _weights_configs = [{
            'type': 'FILE',
            'key': '<Weight File Name>',
            'file_uri': '<URL to directly Download the Weight File>',
            'file_digest':'sha256sum of the <Weight File>',
        }]

        metadata = ...
            'installation': [{
                'type': metadata_base.PrimitiveInstallationType.PIP,
                'package_uri': 'git+<git-link-to-project>@{git_commit}#egg=<Package_name>'.format(
                    git_commit=d3m_utils.current_git_commit(os.path.dirname(__file__)),
                ),
            }] + _weights_configs,
            ...

        ...

After the primitive metadata definition, it is important to include code to return the path of files.
An example is given as follows:

.. code:: python

    def _find_weights_path(self, key_filename):
        if key_filename in self.volumes:
            weight_file_path = self.volumes[key_filename]
        else:
            weight_file_path = os.path.join('.', self._weights_configs['file_digest'], key_filename)

        if not os.path.isfile(weight_file_path):
            raise ValueError(
                "Can't get weights file from volumes by key '{key_filename}' and at path '{path}'.".format(
                    key_filename=key_filename,
                    path=weight_file_path,
                ),
            )

        return weight_file_path

In this example code,  ``_find_weights_path`` method will try to find the static files from volumes based on weight file key.
If it cannot be found (e.g., runtime was not provided with static files), then it looks into the current directory.
The latter fallback is useful during development.

To run a pipeline with such primitive, you have to download static files and provide them to the runtime:

.. code:: shell

    docker run --rm -v /home/foo/d3m:/mnt/d3m -it \
      registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
      /bin/bash -c "cd /mnt/d3m; \
        pip3 install -e .; \
        cd pipelines; \
        mkdir /static
        python3 -m d3m index download -p d3m.primitives.path.of.Primitive -o /static; \
        python3 -m d3m runtime --volumes /static fit-produce \
                --pipeline feature_pipeline.json \
                --problem /datasets/seed_datasets_current/22_handgeometry/TRAIN/problem_TRAIN/problemDoc.json \
                --input /datasets/seed_datasets_current/22_handgeometry/TRAIN/dataset_TRAIN/datasetDoc.json \
                --test-input /datasets/seed_datasets_current/22_handgeometry/TEST/dataset_TEST/datasetDoc.json \
                --output 22_handgeometry_results.csv \
                --output-run feature_pipeline_run.yml; \
        exit"

The static files will be downloaded and stored locally based on ``file_digest`` of ``_weights_configs``.
In this way we don't duplicate same files used by multiple primitives:

.. code:: shell

    mkdir /static
    python3 -m d3m index download -p d3m.primitives.path.of.Primitive -o /static

``-p`` optional argument to download static files for a particular primitive, matching on its Python path.
``-o`` optional argument to download the static files into a common folder. If not provided, they are
downloaded into the current directory.

After the download, the file structure is given as follows::

    /static/
      <file_digest>/
        <file>
      <file_digest>/
        <file>
      ...
      ...