wangwei
/
tods

 
			
							.. _primitive-good-citizen:

Primitive Good Citizen Checklist
================================

This is a list of dos, don'ts and things to consider when crafting a new primitive or updating an existing one. This
list is not exhaustive so please add new items to the list as they are discovered! An example of a primitive that
endeavors to adheres to all of the following guidance can be found `here`_:

DO's

* Do complete the documentation on the primitive such as:

  * Primitive family, algorithm type.
  * Docstring of the primitive's Python class.

    * One line summary first:

      * Primitive name should be close to this.
      * Primitive path should be close to this as well.

    * Longer documentation/description after, all in the main docstring of the class.

  * Provide pipeline examples together with the primitive annotation.
  * Docstrings in `numpy style`_.
  * Please use `reStructuredText`_ instead of markdown or other formats.
  * Maintain a change-log of alterations to the primitive (somewhere in the primitive's repo, consider using a `standard format`_).
  * One should also add point of contact information and the git repository link in primitive's metadata
    (``source.name``, ``source.contact`` and ``source.uris`` metadata fields).
  * Add your primitive name to the `list of primitive names`_ if it does not already
    exist. Chances are that your generic primitive name is in that list and you should use that name for your primitive.

* Do annotate your Primitive with Python types.

* Do make sure the output from your produce method is a d3m container type.


* If your primitive is operating on columns and rows:

  * Do include ``d3mIndex`` column in produced output if input has ``d3mIndex`` column.
    * You can make this behavior controlled by the ``add_index_columns`` hyper-parameter.

  * If a primitive has a hyper-paramer to directly set which columns to operate on, do use column
    indices and not column names to identify those columns.

    * Consider using a pair of hyper-parameters: ``use_columns`` and ``exclude_columns`` with standard logic.

  * When deciding on which columns to operate, when using semantic types, do use
    ``https://metadata.datadrivendiscovery.org/types/TrueTarget``
    and ``https://metadata.datadrivendiscovery.org/types/RedactedPrivilegedData`` semantic types and not
    ``https://metadata.datadrivendiscovery.org/types/SuggestedTarget`` and
    ``https://metadata.datadrivendiscovery.org/types/SuggestedPrivilegedData``.
    The latter are semantic types which come from the dataset, the former are those which come from the problem description.
    While it is true that currently generally they always match, in fact primitives should just respect those coming from
    the problem description. The dataset has them so that one can create problem descriptions on the fly, if needed.

* Be mindful that data being passed through a pipeline also has metadata:

  * If your primitive generates new data (e.g., new columns), add metadata suitable for those columns:

    * Name the column appropriately for human consumption by setting column's ``name`` metadata.

    * Set semantic types appropriately.

    * If your primitive is producing target predictions, add ``https://metadata.datadrivendiscovery.org/types/PredictedTarget``
      to a column containing those predictions.

    * Remember metadata encountered on target columns during fitting, and reuse that metadata as much
      as reasonable when producing target predictions.

  * If your primitive is transforming existing data (e.g., transforming columns), reuse as much metadata from
    original data as reasonable, but do update metadata based on new data.

    * If structural type of the column changes, make sure you note this change in metadata as well.

  * Support also non-standard metadata and try to pass it through as-is if possible.

* Do write unit tests for your primitives. This greatly aids porting to a new version of the core package.

  * Test pickle and unpickle of the primitive (both fitted and unfitted primitives).
  * Test with use of semantic types to select columns to operate on, and without the use of semantic types.
  * Test with all return types: ``append``, ``replace``, ``new``.
  * Test all hyper-parameter values with their ``sample`` method.
  * Use/contribute to `tests data repository`_.

* Do clearly define hyper-parameters (bounds, descriptions, semantic types).

  * Suggest new classes of hyper-parameters if needed.
  * Consider if ``upper_inclusive`` and ``lower_inclusive`` values should be included or not for every hyper-parameter
  * Define reasonable hyper-parameters which can be automatically populated/searched by TA2.
    A hyper-parameter such as ``hyperparams.Hyperparameter[typing.Sequence[Any]]`` is not useful in this case.
  * Ensure that your primitive can be run successfully with default settings for all hyper-parameters.
  * If there are combinations of hyper-parameters settings that are suboptimal please note this in the documentation. For
    example: "If hyper-parameter A is set to a True, hyper-parameter B must always be a positive integer".

* Do bump primitive version when changing hyper-parameters, method signatures or params.
  In short, on any API change of your primitive.

* If your primitive can use GPUs if available, set ``can_use_gpus`` primitive's metadata to true.

* If your primitive can use different number of CPUs/cores, expose a hyper-parameter with semantic types
  `https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter` and `https://metadata.datadrivendiscovery.org/types/CPUResourcesUseParameter`
  and allow caller to control the number of CPUs/cores used through it.

  * Make sure that the default value of such hyper-parameter is 1.

DON'Ts

* Don't change the input DataFrame! Make a copy and make changes to the copy instead. The original input DataFrame is
  assumed never to change between primitives in the pipeline.
* Don't return DataFrames with a (non-default) Pandas DataFrame index. It can be utilized internally, but drop it before
  returning. On output a default index should be provided.

PLEASE CONSIDER

* Consider using/supporting semantic types to select which columns to operate on, and use the `use_semantic_types` hyper-parameter.
* Consider allowing three types of outputs strategies: ``new``/``append``/``replace`` output, if operating on columns,
  controlled by the ``return_result`` hyper-parameter.
* Consider picking the input and output format/structure of data to match other primitives of the same family/type. If
  necessary, convert data to the format you need inside your primitive. Pipelines tend to start with datasets, then go
  to dataframes, and then to ndarrays sometimes, returning predictions as a dataframe.
  Consider where your primitive in a pipeline generally should be and
  consider that when deciding on what are inputs and outputs of your primitive. Consider that your primitive will be
  chosen dynamically by a TA2 and will be expected to behave in predictable ways based on family and base class.
* Consider using a specific hyper-parameter class instead of the hyper-parameter base class as it is not very useful for
  TA2s. For example use ``hyperparams.Set`` instead of ``hyperparams.Hyperparameter[typing.Sequence[Any]]``. It is
  better to use the former as it is far more descriptive.
* Use a base class for your primitive which makes sense based on semantics of the base class and not necessarily
  how a human would understand the primitive.
* Consider that your primitive will be chosen dynamically by a TA2 and will
  be expected to behave in predictable ways based on primitive family and base class.

.. _here: https://gitlab.com/datadrivendiscovery/common-primitives/blob/master/common_primitives/random_forest.py
.. _numpy style: https://numpydoc.readthedocs.io/en/latest/format.html
.. _reStructuredText: http://www.sphinx-doc.org/en/master/usage/restructuredtext/basics.html
.. _tests data repository: https://gitlab.com/datadrivendiscovery/tests-data
.. _standard format: https://keepachangelog.com/en/1.0.0/
.. _list of primitive names: https://gitlab.com/datadrivendiscovery/d3m/-/blob/devel/d3m/metadata/primitive_names.py