wangwei
/
tods

 
			
							Pipeline
========

Pipeline is described as a DAG consisting of interconnected steps, where
steps can be primitives, or (nested) other pipelines. Pipeline has
data-flow semantics, which means that steps are not necessary executed
in the order they are listed, but a step can be executed when all its
inputs are available. Some steps can even be executed in parallel. On
the other hand, each step can use only previously defined outputs from
steps coming before in the order they are listed. In JSON, the following
is a sketch of its representation:

.. code:: yaml

    {
      "id": <UUID of the pipeline>,
      "schema": <a URI representing a schema and version to which pipeline description conforms>,
      "source": {
        "name": <string representing name of the author, team>,
        "contact": <contact information of author of the pipeline>,
        "from": <if pipeline was derived from another pipeline, or pipelines, which>
        ... # Any extra metadata author might want to add into the pipeline, like version,
            # name, and config parameters of the system which produced this pipeline.
      },
      "created": <timestamp when created>,
      "name": <human friendly name of the pipeline, if it exists>,
      "description": <human friendly description of the pipeline, if it exists>,
      "users": [
        {
          "id": <UUID for the user, if user is associated with the creation of the pipeline>,
          "reason": <textual description of what user did to create the pipeline>,
          "rationale": <textual description by the user of what the user did>
        }
      ],
      "inputs": [
        {
          "name": <human friendly name of the inputs>
        }
      ],
      "outputs": [
        {
          "name": <human friendly name of the outputs>,
          "data": <data reference, probably of an output of a step>
        }
      ],
      "steps": [
        {
          "type": "PRIMITIVE",
          "primitive": {
            "id": <ID of the primitive used in this step>,
            "version": <version of the primitive used in this step>,
            "python_path": <Python path of this primitive>,
            "name": <human friendly name of this primitive>,
            "digest": <digest of this primitive>
          },
          # Constructor arguments should not be listed here, because they can be automatically created from other
          # information. All these arguments are listed as kind "PIPELINE" in primitive's metadata.
          "arguments": {
             # A standard inputs argument used for both set_training_data and default "produce" method.
            "inputs": {
              "type": "CONTAINER",
              "data": <data reference, probably of an output of a step or pipeline input>
            },
             # A standard inputs argument, used for "set_training_data".
            "outputs": {
              "type": "CONTAINER",
              "data": <data reference, probably of an output of a step or pipeline input>
            },
            # An extra argument which takes as inputs outputs from another primitive in this pipeline.
            "extra_data": {
              "type": "CONTAINER",
              "data": <data reference, probably of an output of a step or pipeline input>
            },
            # An extra argument which takes as input a singleton output from another step in this pipeline.
            "offset": {
              "type": "DATA",
              "data": <data reference, probably of an output of a step or pipeline input>
            }
          },
          "outputs": [
            {
              # Data is made available by this step from default "produce" method.
              "id": "produce"
            },
            {
              # Data is made available by this step from an extra "produce" method, too.
              "id": "produce_score"
            }
          ],
          # Some hyper-parameters are not really tunable and should be fixed as part of pipeline definition. This
          # can be done here. Hyper-parameters listed here cannot be tuned or overridden during a run. Author of
          # a pipeline decides which hyper-parameters are which, probably based on their semantic type.
          # This is a map hyper-parameter names and their values using a similar format as arguments, but
          # allowing also PRIMITIVE and VALUE types.
          "hyperparams": {
            "loss": {
              "type": "PRIMITIVE",
              "data": <0-based index from steps identifying a primitive to pass in>
            },
            "column_to_operate_on": {
              "type": "VALUE",
              # Value is converted to a JSON-compatible value by hyper-parameter class.
              # It also knows how to convert it back.
              "data": 5
            },
            # A special case where a hyper-parameter can also be a list of primitives,
            # which are then passed to the \"Set\" hyper-parameter class.
            "ensemble": {
              "type": "PRIMITIVE",
              "data": [
                <0-based index from steps identifying a primitive to pass in>,
                <0-based index from steps identifying a primitive to pass in>
              ]
            }
          },
          "users": [
            {
              "id": <UUID for the user, if user is associated with selection of this primitive/arguments/hyper-parameters>,
              "reason": <textual description of what user did to select this primitive>,
              "rationale": <textual description by the user of what the user did>
            }
          ]
        },
        {
          "type": "SUBPIPELINE",
          "pipeline": {
            "id": <UUID of a pipeline to run as this step>
          },
          # For example: [{"data": "steps.0.produce"}] would map the data reference "steps.0.produce" of
          # the outer pipeline to the first input of a sub-pipeline.
          "inputs": [
            {
              "data": <data reference, probably of an output of a step or pipeline input, mapped to sub-pipeline's inputs in order>
            }
          ],
          # For example: [{"id": "predictions"}] would map the first output of a sub-pipeline to a data
          # reference "steps.X.predictions" where "X" is the step number of a given sub-pipeline step.
          "outputs": [
            {
              "id": <ID to be used in data reference, mapping sub-pipeline's outputs in order>
            }
          ]
        },
        {
          # Used to represent a pipeline template which can be used to generate full pipelines. Not to be used in
          # the metalearning context. Additional properties to further specify the placeholder constraints are allowed.
          "type": "PLACEHOLDER",
          # A list of inputs which can be used as inputs to resulting sub-pipeline.
          # Resulting sub-pipeline does not have to use all the inputs, but it cannot use any other inputs.
          "inputs": [
            {
              "data": <data reference, probably of an output of a step or pipeline input>
            }
          ],
          # A list of outputs of the resulting sub-pipeline.
          # Their (allowed) number and meaning are defined elsewhere.
          "outputs": [
            {
              "id": <ID to be used in data reference, mapping resulting sub-pipeline's outputs in order>
            }
          ]
        }
      ]
    }

``id`` uniquely identifies this particular database document.

Pipeline describes how inputs are computed into outputs. In most cases
inputs are :class:`~d3m.container.dataset.Dataset` container values and
outputs are predictions as Pandas :class:`~d3m.container.pandas.DataFrame` container
values in `Lincoln Labs predictions
format <https://gitlab.com/datadrivendiscovery/data-supply/blob/shared/documentation/problemSchema.md#predictions-file>`__,
and, during training, potentially also internal losses/scores. The same
pipeline is used for both training and predicting.

Pipeline description contains many *data references*. Data reference is
just a string which identifies an output of a step or a pipeline input
and forms a data-flow connection between data available and an input to
a step. It is recommended to be a string of the following forms:

-  ``steps.<number>.<id>`` — ``number`` identifies the step in the list
   of steps (0-based) and ``id`` identifies the name of a produce method
   of the primitive, or the output of a pipeline step
-  ``inputs.<number>`` — ``number`` identifies the pipeline input
   (0-based)
-  ``outputs.<number>`` — ``number`` identifies the pipeline output
   (0-based)

Inputs in the context of metalearning are expected to be datasets, and
the order of inputs match the order of datasets in a pipeline run. (In
other contexts, like TA2-TA3 API, inputs might be something else, for
example a pipeline can consist of just one primitive a TA3 wants to run
on a particular input.)

Remember that each primitive has a set of arguments it takes as a whole,
combining all the arguments from all its methods. Each argument
(identified by its name) can have only one value associated with it and
any method accepting that argument receives that value. Once all values
for all arguments for a method are available, that method can be called.

Remember as well that each primitive can have multiple "produce"
methods. These methods can be called after a primitive has been fitted.
In this way a primitive can have multiple outputs, for each "produce"
method one.

Placeholders can be used to define pipeline templates to be used outside
of the metalearning context. A placeholder is replaced with a pipeline
step to form a pipeline. Restrictions of placeholders may apply on the
number of them, their position, allowed inputs and outputs, etc.

.. _pipeline-description-example:

Pipeline description example
----------------------------

The following example uses the core package and the `common primitives
repo <https://gitlab.com/datadrivendiscovery/common-primitives>`__, this
example provides the basic knowledge to build a pipeline in memory. This
specific example creates a pipeline for classification task.

.. code:: python

    from d3m import index
    from d3m.metadata.base import ArgumentType
    from d3m.metadata.pipeline import Pipeline, PrimitiveStep

    # -> dataset_to_dataframe -> column_parser -> extract_columns_by_semantic_types(attributes) -> imputer -> random_forest
    #                                             extract_columns_by_semantic_types(targets)    ->            ^

    # Creating pipeline
    pipeline_description = Pipeline()
    pipeline_description.add_input(name='inputs')

    # Step 1: dataset_to_dataframe
    step_0 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.dataset_to_dataframe.Common'))
    step_0.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='inputs.0')
    step_0.add_output('produce')
    pipeline_description.add_step(step_0)

    # Step 2: column_parser
    step_1 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.column_parser.Common'))
    step_1.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
    step_1.add_output('produce')
    pipeline_description.add_step(step_1)

    # Step 3: extract_columns_by_semantic_types(attributes)
    step_2 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common'))
    step_2.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.1.produce')
    step_2.add_output('produce')
    step_2.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE,
                              data=['https://metadata.datadrivendiscovery.org/types/Attribute'])
    pipeline_description.add_step(step_2)

    # Step 4: extract_columns_by_semantic_types(targets)
    step_3 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_transformation.extract_columns_by_semantic_types.Common'))
    step_3.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.0.produce')
    step_3.add_output('produce')
    step_3.add_hyperparameter(name='semantic_types', argument_type=ArgumentType.VALUE,
                              data=['https://metadata.datadrivendiscovery.org/types/TrueTarget'])
    pipeline_description.add_step(step_3)

    attributes = 'steps.2.produce'
    targets = 'steps.3.produce'

    # Step 5: imputer
    step_4 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.data_cleaning.imputer.SKlearn'))
    step_4.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference=attributes)
    step_4.add_output('produce')
    pipeline_description.add_step(step_4)

    # Step 6: random_forest
    step_5 = PrimitiveStep(primitive=index.get_primitive('d3m.primitives.regression.random_forest.SKlearn'))
    step_5.add_argument(name='inputs', argument_type=ArgumentType.CONTAINER, data_reference='steps.4.produce')
    step_5.add_argument(name='outputs', argument_type=ArgumentType.CONTAINER, data_reference=targets)
    step_5.add_output('produce')
    pipeline_description.add_step(step_5)

    # Final Output
    pipeline_description.add_output(name='output predictions', data_reference='steps.5.produce')

    # Output to YAML
    print(pipeline_description.to_yaml())

Pipeline Run
------------

:mod:`d3m.metadata.pipeline_run` module contains the classes that represent the Pipeline Run. The Pipeline Run was
introduced to ensure that pipeline execution could be captured and duplicated. To accomplish this, the problem doc,
hyperparameter settings and any other variables to the pipeline execution phases are captured by the Pipeline Run.

The Pipeline Run is generated during pipeline execution:

::

    $ python3 -m d3m runtime fit-produce -p pipeline.json -r problem/problemDoc.json -i dataset_TRAIN/datasetDoc.json \
         -t dataset_TEST/datasetDoc.json -o results.csv -O pipeline_run.yml

 In JSON, the following is a sketch of the Pipeline Run representation in two phases for the above fit-produce call:

.. code:: yaml

    context: <The run context for this phase (TESTING for example)>
    datasets:
      <ID and digest of the dataset>
    end: <timestamp of when this phase ended>
    environment:
      <Details about the machine the phase was performed on>
    id: e3187585-cf8b-5e31-9435-69907912c3ca <id of this pipeline run instance>
    pipeline:
      <ID and Digest of the pipeline>
    problem:
       <ID and digest of the problem>
    random_seed: <Random seed value, 0 if none provided>
    run:
      is_standard_pipeline: true
      phase: FIT
      results:
        <Results of the fit phase>
    schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline_run.json
    start: <timestamp of when this phase started>
    status:
      state: <Whether this stage completed successfully or not>
    steps:
      <Details of each step (primitive) in this stage of running the pipeline, parameters start/end times plus success/failure>
    --- <This indicates a divider between phases like fit and produce in this example>
    context: <The run context for this phase (TESTING for example)>
    datasets:
      <ID and digest of the dataset>
    end: <timestamp of when this phase ended>
    environment:
      <Details about the machine the phase was performed on>
    id: b2e9b591-c332-5bc5-815e-d1ec73ecdb06 <id of this pipeline run instance>
    pipeline:
      <ID and Digest of the pipeline>
    previous_pipeline_run:
      id: e3187585-cf8b-5e31-9435-69907912c3ca <id of the previous pipeline run instance (see above)>
    problem:
      <ID and digest of the problem>
    random_seed: <Random seed value, 0 if none provided>
    run:
      is_standard_pipeline: true
      phase: PRODUCE
      results:
        <Results of the fit phase>
      scoring:
        datasets:
          <ID and Digest of the scoring dataset>
        end: <timestamp of when the scoring phase ended>
        pipeline:
          <ID and Digest of the pipeline>
        random_seed: <Random seed value, 0 if none provided>
        start: <timestamp of when the scoring phase started>
        status:
          state: <Whether this scoring stage completed successfully or not>
        steps:
          <Details of each step (primitive) in this stage of running the pipeline, parameters start/end times plus success/failure>
    schema: https://metadata.datadrivendiscovery.org/schemas/v0/pipeline_run.json
    start: <timestamp of when this phase started>
    status:
      state: <Whether this stage completed successfully or not>
    steps:
      <Details of each step (primitive) in this stage of running the pipeline, parameters start/end times plus success/failure>

The d3m module has a call that supports actions Pipeline Run:

::

    $ python3 -m d3m pipeline-run --help

Currently there is only one command available which validates a Pipeline Run:

::

    $ python3 -m d3m pipeline-run validate pipeline_run.yml

The Reference Runtime offers a way to pass an existing Pipeline Run file to a runtime command to allow it to be rerun.
Here is an example of this for the fit-produce call:

::

    $ python3 -m d3m runtime fit-produce -u pipeline_run.yml

Here is the guidance from the help menu:

::

      -u INPUT_RUN, --input-run INPUT_RUN
                        path to a pipeline run file with configuration, use
                        "-" for stdin


Reference runtime
-----------------

:mod:`d3m.runtime` module contains a reference runtime for pipelines. This
module also has an extensive command line interface you can access
through ``python3 -m d3m runtime``.

Example of fitting and producing a pipeline with Runtime:

.. code:: python

    from d3m.metadata import base as metadata_base, hyperparams as hyperparams_module, pipeline as pipeline_module, problem
    from d3m.container.dataset import Dataset
    from d3m.runtime import Runtime

    # Loading problem description.
    problem_description = problem.parse_problem_description('problemDoc.json')

    # Loading dataset.
    path = 'file://{uri}'.format(uri=os.path.abspath('datasetDoc.json'))
    dataset = Dataset.load(dataset_uri=path)

    # Loading pipeline description file.
    with open('pipeline_description.json', 'r') as file:
        pipeline_description = pipeline_module.Pipeline.from_json(string_or_file=file)

    # Creating an instance on runtime with pipeline description and problem description.
    runtime = Runtime(pipeline=pipeline_description, problem_description=problem_description, context=metadata_base.Context.TESTING)

    # Fitting pipeline on input dataset.
    fit_results = runtime.fit(inputs=[dataset])
    fit_results.check_success()

    # Producing results using the fitted pipeline.
    produce_results = runtime.produce(inputs=[dataset])
    produce_results.check_success()

    print(produce_results.values)

Also, the Runtime provides a very useful set of tools to run pipelines
on the terminal, here is a basic example of how to fit and produce a
pipeline like the previous example:

::

    $ python3 -m d3m runtime fit-produce -p pipeline.json -r problem/problemDoc.json -i dataset_TRAIN/datasetDoc.json -t dataset_TEST/datasetDoc.json -o results.csv -O pipeline_run.yml

For more information about the usage:

::

    $ python3 -m d3m runtime --help