wangwei
/
tods

 
			
			   
				 
					
						
						
							
							TA1 API for primitives
====================================

A collection of standard Python interfaces for TA1 primitives. All
primitives should extend one of the base classes available and
optionally implement available mixins.

Design principles
-----------------

Standard TA1 primitive interfaces have been designed to be possible for
TA2 systems to call primitives automatically and combine them into
pipelines.

Some design principles applied:

-  Use of a de facto standard language for "glue" between different
   components and libraries, Python.
-  Use of keyword-only arguments for all methods so that caller does not
   have to worry about the order of arguments.
-  Every primitive should implement only one functionality, more or less
   a function, with clear inputs and outputs. All parameters of the
   function do not have to be known in advance and function can be
   "fitted" as part of the training step of the pipeline.
-  Use of Python 3 typing extensions to annotate methods and classes
   with typing information to make it easier for TA2 systems to prune
   incompatible combinations of inputs and outputs and to reuse existing
   Python type-checking tooling.
-  Typing information can serve both detecting issues and
   incompatibilities in primitive implementations and help with pipeline
   construction.
-  All values being passed through a primitive have metadata associated
   with them.
-  Primitives can operate only at a metadata level to help guide the
   pipeline construction process without having to operate on data
   itself.
-  Primitive metadata is close to the source, primitive code, and not in
   separate files to minimize chances that it is goes out of sync.
   Metadata which can be automatically determined from the code should
   be automatically determined from the code. Similarly for data
   metadata.
-  All randomness of primitives is captured by a random seed argument to
   assure reproducibility.
-  Operations can work in iterations, under time budgets, and caller
   might not always want to compute values fully.
-  Through use of mixins primitives can signal which capabilities they
   support.
-  Primitives are to be composed and executed in a data-flow manner.

Main concepts
-------------

Interface classes, mixins, and methods are documented in detail through
use of docstrings and typing annotations. Here we note some higher-level
concept which can help understand basic ideas behind interfaces and what
they are trying to achieve, the big picture. This section is not
normative.

A primitive should extend one of the base classes available and
optionally mixins as well. Not all mixins apply to all primitives. That
being said, you probably do not want to subclass ``PrimitiveBase``
directly, but instead one of other base classes to signal to a caller
more about what your primitive is doing. If your primitive belong to a
larger set of primitives no exiting non-\ ``PrimitiveBase`` base class
suits well, consider suggesting that a new base class is created by
opening an issue or making a merge request.

Base class and mixins have generally four type arguments you have to
provide: ``Inputs``, ``Outpus``, ``Params``, and ``Hyperparams``. One
can see a primitive as parameterized by those four type arguments. You
can access them at runtime through metadata:

.. code:: python

    FooBarPrimitive.metadata.query()['class_type_arguments']

``Inputs`` should be set to a primary input type of a primitive.
Primary, because you can define additional inputs your primitive might
need, but we will go into these details later. Similarly for
``Outputs``. ``produce`` method then produces outputs from inputs. Other
primitive methods help the primitive (and its ``produce`` method)
achieve that, or help the runtime execute the primitive as a whole, or
optimize its behavior.

Both ``Inputs`` and ``Outputs`` should be of a
:ref:`container_types`. We allow a limited set of value types being
passed between primitives so that both TA2 and TA3 systems can
implement introspection for those values if needed, or user interface
for them, etc. Moreover this allows us also to assure that they can be
efficiently used with Arrow/Plasma store.

Container values can then in turn contain values of an :ref:`extended but
still limited set of data types <data_types>`.

Those values being passed between primitives also hold metadata.
Metadata is available on their ``metadata`` attribute. Metadata on
values is stored in an instance of
:class:`~d3m.metadata.base.DataMetadata` class. This is a
reason why we have :ref:`our own versions of some standard container
types <container_types>`: to have the ``metadata`` attribute.

All metadata is immutable and updating a metadata object returns a new,
updated, copy. Metadata internally remembers the history of changes, but
there is no API yet to access that. But the idea is that you will be
able to follow the whole history of change to data in a pipeline through
metadata. See :ref:`metadata API <metadata_api>` for more information
how to manipulate metadata.

Primitives have a similar class ``PrimitiveMetadata``, which when
created automatically analyses its primitive and populates parts of
metadata based on that. In this way author does not have to have
information in two places (metadata and code) but just in code and
metadata is extracted from it. When possible. Some metadata author of
the primitive stil has to provide directly.

Currently most standard interface base classes have only one ``produce``
method, but design allows for multiple: their name has to be prefixed
with ``produce_``, have similar arguments and same semantics as all
produce methods. The main motivation for this is that some primitives
might be able to expose same results in different ways. Having multiple
produce methods allow the caller to pick which type of the result they
want.

To keep primitive from outside simple and allow easier compositionality
in pipelines, primitives have arguments defined per primitive and not
per their method. The idea here is that once a caller satisfies
(computes a value to be passed to) an argument, any method which
requires that argument can be called on a primitive.

There are three types of arguments:

-  pipeline – arguments which are provided by the pipeline, they are
   required (otherwise caller would be able to trivially satisfy them by
   always passing ``None`` or another default value)
-  runtime – arguments which caller provides during pipeline execution
   and they control various aspects of the execution
-  hyper-parameter – a method can declare that primitive's
   hyper-parameter can be overridden for the call of the method, they
   have to match hyper-parameter definition

Methods can accept additional pipeline and hyper-parameter arguments and
not just those from the standard interfaces.

Produce methods and some other methods return results wrapped in
``CallResult``. In this way primitives can expose information about
internal iterative or optimization process and allow caller to decide
how long to run.

When calling a primitive, to access ``Hyperparams`` class you can do:

.. code:: python

    hyperparams_class = FooBarPrimitive.metadata.query()['class_type_arguments']['Hyperparams']

You can now create an instance of the class by directly providing values
for hyper-parameters, use available simple sampling, or just use default
values:

.. code:: python

    hp1 = hyperparams_class({'threshold': 0.01})
    hp2 = hyperparams_class.sample(random_state=42)
    hp3 = hyperparams_class.defaults

You can then pass those instances as the ``hyperparams`` argument to
primitive's constructor.

Author of a primitive has to define what internal parameters does the
primitive have, if any, by extending the ``Params`` class. It is just a
fancy dict, so you can both create an instance of it in the same way,
and access its values:

.. code:: python

    class Params(params.Params):
        coefficients: numpy.ndarray

    ps = Params({'coefficients': numpy.array[1, 2, 3]})
    ps['coefficients']

``Hyperparams`` class and ``Params`` class have to be pickable and
copyable so that instances of primitives can be serialized and restored
as needed.

Primitives (and some other values) are uniquely identified by their ID
and version. ID does not change through versions.

Primitives should not modify in-place any input argument but always
first make a copy before any modification.

Checklist for creating a new primitive
--------------------------------------
1. Implement as many interfaces as are applicable to your
   primitive. An up-to-date list of mixins you can implement can be
   found at
   <https://gitlab.com/datadrivendiscovery/d3m/blob/devel/d3m/primitive_interfaces/base.py>

2. Create unit tests to test all methods you implement

3. Include all relevant hyperparameters and use appropriate
   ``Hyperparameter`` subclass for specifying the range of values a
   hyperparameter can take. Try to provide good default values where
   possible. Also include all relevant ``semantic_types``
   <https://metadata.datadrivendiscovery.org/types/>

4. Include ``metadata`` and ``__author__`` fields in your class
   definition. The ``__author__`` field should include a name or team
   as well as email. The ``metadata`` object has many fields which should
   be filled in:

   * id, this is a uuid unique to this primitive. It can be generated with :code:`import uuid; uuid.uuid4()`
   * version
   * python_path, the name you want to be import this primitive through
   * keywords, keywords you want your primitive to be discovered by
   * installation, how to install the package which has this primitive. This is easiest if this is just a python package on PyPI
   * algorithm_types, specify which PrimitiveAlgorithmType the algorithm is, a complete list can be found in TODO
   * primitive_family, specify the broad family a primitive falls under, a complete list can be found in TODO
   * hyperparameters_to_tune, specify which hyperparameters you would prefer a TA2 system tune

5. Make sure primitive uses the correct container type

6. If container type is a dataframe, specify which column is the
   target value, which columns are the input values, and which columns
   are the output values.

7. Create an example pipeline which includes this primitive and uses one of the seed datasets as input.

Examples
--------

Examples of simple primitives using these interfaces can be found `in
this
repository <https://gitlab.com/datadrivendiscovery/tests-data/tree/master/primitives>`__:

-  `MonomialPrimitive <https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/monomial.py>`__
   is a simple regressor which shows how to use ``container.List``,
   define and use ``Params`` and ``Hyperparams``, and implement multiple
   methods needed by a supervised learner primitive
-  `IncrementPrimitive <https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/increment.py>`__
   is a transformer and shows how to have ``container.ndarray`` as
   inputs and outputs, and how to set metadata for outputs
-  `SumPrimitive <https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/sum.py>`__
   is a transformer as well, but it is just a wrapper around a Docker
   image, it shows how to define Docker image in metadata and how to
   connect to a running Docker container, moreover, it also shows how
   inputs can be a union type of multiple other types
-  `RandomPrimitive <https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/random.py>`__
   is a generator which shows how to use ``random_seed``, too.