|
- .. _quickstart:
-
- TA1 quick-start guide
- =====================
-
- This aims to be a tutorial, or a quick-start guide, for
- newcomers to the D3M project who are interested in writing TA1 primitives.
- It is not meant to be a comprehensive
- guide to everything about D3M, or even just TA1. The goal here is for
- the reader to be able to write a new, simple, but working primitive by
- the end of this tutorial. To achieve this goal, this tutorial is divided
- into several sections:
-
- Important links
- ---------------
-
- First, here is a list of some important links that should help you with
- reference and instructional material beyond this quick start guide. Be
- aware also that the d3m core package source code has extensive docstrings that
- :ref:`you may find helpful <api-reference>`.
-
- - Documentation of the whole D3M program:
- `https://docs.datadrivendiscovery.org <https://docs.datadrivendiscovery.org>`__
- - Common primitives:
- `https://gitlab.com/datadrivendiscovery/common-primitives <https://gitlab.com/datadrivendiscovery/common-primitives>`__
- - Public datasets:
- `https://datasets.datadrivendiscovery.org/d3m/datasets <https://datasets.datadrivendiscovery.org/d3m/datasets>`__
- - Docker images:
- `https://docs.datadrivendiscovery.org/docker.html <https://docs.datadrivendiscovery.org/docker.html>`__
- - Index of TA1, TA2, TA3 repositories:
- `https://github.com/darpa-i2o/d3m-program-index <https://github.com/darpa-i2o/d3m-program-index>`__
- - :ref:`primitive-good-citizen`
-
- .. _overview-of-primitives-and-pipelines:
-
- Overview of primitives and pipelines
- ------------------------------------
-
- Let's start with basic definitions in order for us to understand a
- little bit better what happens when we run a pipeline later in the
- tutorial.
-
- A *pipeline* is basically a series of steps that are executed in order
- to solve a particular *problem* (such as prediction based on historical
- data). A step of a pipeline is usually a *primitive* (a step can be
- something else, however, like a sub-pipeline, but for the purposes of
- this tutorial, assume that each step is a primitive): something that
- individually could, for example, transform data into another format, or
- fit a model for prediction. There are many types of primitives (see the
- `primitives index repo`_ for the full
- list of available primitives). In a pipeline, the steps must be arranged
- in a way such that each step must be able to read the data in the format
- produced by the preceding step.
-
- .. _primitives index repo: https://gitlab.com/datadrivendiscovery/primitives
-
- For this tutorial, let's try to use the example pipeline that comes with
- a primitive called
- ``d3m.primitives.classification.logistic_regression.SKlearn`` to predict
- baseball hall-of-fame players, based on their stats (see the
- `185_baseball dataset <https://datasets.datadrivendiscovery.org/d3m/datasets/-/tree/master/training_datasets/seed_datasets_archive/185_baseball>`__).
-
- Let's take a look at the example pipeline. Many example pipelines can be found
- in `primitives index repo`_ where they demonstrate how to use particular primitives.
- At the time of this writing, an example pipeline can be found `here
- <https://gitlab.com/datadrivendiscovery/primitives/blob/master/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines/862df0a2-2f87-450d-a6bd-24e9269a8ba6.json>`__,
- but this repository's directory names and files periodically change, so it is
- prudent to see how to navigate to this file too.
-
- The index is organized as:
- - ``v2020.1.9`` (version of the core package of the index, changes periodically)
- - ``JPL`` (the organization that develops/maintains the primitive)
- - ``d3m.primitives.classification.logistic_regression.SKlearn`` (the python path of the actual primitive)
- - ``2019.11.13`` (the version of this primitive, changes periodically)
- - ``pipelines``
- - ``862df0a2-2f87-450d-a6bd-24e9269a8ba6.json`` (actual pipeline description filename, changes periodically)
-
- Early on in this JSON document, you will see a list called ``steps``. This
- is the actual list of primitive steps that run one after another in a
- pipeline. Each step has the information about the primitive, as well as
- arguments, outputs, and hyper-parameters, if any. This specific pipeline
- has 5 steps (the ``d3m.primitives`` prefix is omitted in the following
- list):
-
- - ``data_transformation.dataset_to_dataframe.Common``
- - ``data_transformation.column_parser.Common``
- - ``data_cleaning.imputer.SKlearn``
- - ``classification.logistic_regression.SKlearn``
- - ``data_transformation.construct_predictions.Common``
-
- Now let's take a look at the first primitive step in that pipeline. We
- can find the source code of this primitive in the common-primitives repo
- (`common_primitives/dataset_to_dataframe.py
- <https://gitlab.com/datadrivendiscovery/common-primitives/blob/master/common_primitives/dataset_to_dataframe.py>`__).
- Take a look particularly at the ``produce`` method. This is essentially
- what the primitive does. Try to do this for the other primitive steps in
- the pipeline as well - take a cursory look at what each one essentially
- does (note that for the actual classifier primitive, you should look at
- the ``fit`` method as well to see how the model is trained). Primitives
- whose python path suffix is ``*.Common`` is in the `common primitives <https://gitlab.com/datadrivendiscovery/common-primitives>`__
- repository, and those that have a ``*.SKlearn`` suffix is in the
- `sklearn-wrap <https://gitlab.com/datadrivendiscovery/sklearn-wrap>`__ repository (checkout the `dist <https://gitlab.com/datadrivendiscovery/sklearn-wrap/-/tree/dist>`__ branch,
- to which primitives are being generated).
-
- If you're having a hard time looking for the correct source file, you can try
- taking the primitive ``id`` from the primitive step description in the
- pipeline, and ``grep`` for it. For example, if you were
- looking for the source code of the first primitive step in this
- pipeline, first look at the primitive info in that step and get its
- ``id``:
-
- .. code::
-
- "primitive": {
- "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65",
- "version": "0.3.0",
- "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common",
- "name": "Extract a DataFrame from a Dataset"
- },
-
- Then, run this:
-
- .. code:: shell
-
- git clone https://gitlab.com/datadrivendiscovery/common-primitives.git
- cd common-primitives
- grep -r 4b42ce1e-9b98-4a25-b68e-fad13311eb65 . | grep -F .py
-
- However, this series of commands assumes that you know exactly which
- specific repository is the primitive's source code located in (the ``git
- clone`` command). Since this is probably not the case for an arbitrarily
- given primitive, there is a method on how to find out the repository URL
- of any primitive, and it requires using a d3m Docker image, which is
- described in the next section.
-
- Setting up a local d3m environment
- ----------------------------------
-
- In order to run a pipeline, you must have a Python environment where the
- d3m core package is installed, as well as the packages of the primitives
- installed as well. While it is possible to setup a Python virtual
- environment and install the packages them through ``pip``, in this
- tutorial, we're going to use the d3m Docker images instead (in many
- cases, even beyond this tutorial, this will save you a lot of time and
- effort trying to find the any missing primitive packages, manually
- installing them, and troubleshooting installation errors). So, make sure
- `Docker <https://docs.docker.com/>`__ is installed in your system.
-
- You can find the list of D3M docker images `here <https://docs.datadrivendiscovery.org/docker.html>`__.
- The one we're going to use in this tutorial is the v2020.1.9
- primitives image (feel free to use whatever the latest one instead
- though - just modify the ``v2020.1.9`` part accordingly):
-
- .. code:: shell
-
- docker pull registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
-
- Once you have downloaded the image, we can finally run the d3m package
- (and hence run a pipeline). Before running a pipeline though, let's
- first try to get a list of what primitives are installed in the image's
- Python environment:
-
- .. code:: shell
-
- docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index search
-
- You should get a big list of primitives. All of the known primitives to
- D3M should be there.
-
- You can also run the docker container in interactive mode (to run
- commands as if you have logged into the container machine provides) by
- using the ``-it`` option:
-
- .. code:: shell
-
- docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
-
- The previous section mentions a method of determining where the source
- code of an arbitrarily given primitive can be found. We can do this
- using the d3m python package within a d3m docker container. First get the
- ``python_path`` of the primitive step (see the JSON snippet above of the
- primitive's info from the pipeline). Then, run this command:
-
- .. code:: shell
-
- docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common
-
- Near the top of the huge JSON string describing the primitive, you'll see
- ``"source"``, and inside it, ``"uris"``. To help read the JSON, you can use
- the ``jq`` utility:
-
- .. code:: shell
-
- docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
- python3 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common | jq .source.uris
-
- This should give the URI of the git repo where the source code of that primitive can be found. Also, You
- can also substitute the primitive ``id`` for the ``python_path`` in that
- command, but the command usually returns a result faster if you provide
- the ``python_path``. Note also that you can only do this for primitives
- that have been submitted for a particular image (primitives that are
- contained in the `primitives index repo`_).
-
- It can be obscure at first how to use the d3m python package, but you can
- always access the help string for each d3m command at every level of the
- command chain by using the ``-h`` flag. This is useful especially for
- the getting a list of all the possible arguments for the ``runtime``
- module.
-
- .. code:: shell
-
- docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m -h
- docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index -h
- docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m runtime -h
- docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m runtime fit-score -h
-
- One last point before we try running a pipeline. The docker container
- must be able to access the dataset location and the pipeline location
- from the host filesystem. We can do this by `bind-mounting
- <https://docs.docker.com/storage/bind-mounts/>`__ a host directory that
- contains both the ``datasets`` repo and the ``primitives`` index repo to
- a container directory. Git clone these repos, and also make another empty directory called
- ``pipeline-outputs``. Now, if your directory structure looks like this::
-
- /home/foo/d3m
- ├── datasets
- ├── pipeline-outputs
- └── primitives
-
- Then you'll want to bind-mount ``/home/foo/d3m`` to a directory in the
- container, say ``/mnt/d3m``. You can specify this mapping in the docker
- command itself:
-
- .. code:: shell
-
- docker run \
- --rm \
- -v /home/foo/d3m:/mnt/d3m \
- registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2019.11.10 \
- ls /mnt/d3m
-
- If you're reading this tutorial from a text editor, it might be a good
- idea at this point to find and replace ``/home/foo/d3m`` with the actual
- path in your system where the ``datasets``, ``pipeline-outputs``, and
- ``primitives`` directories are all located. This will make it easier for
- you to just copy and paste the commands from here on out, instead of
- changing the faux path every time.
-
- .. _running-example-pipeline:
-
- Running an example pipeline
- ---------------------------
-
- At this point, let's try running a pipeline. Again, we're going to run
- the example pipeline that comes with
- ``d3m.primitives.classification.logistic_regression.SKlearn``. There are
- two ways to run a pipeline: by specifying all the necessary paths of the
- dataset, or by specifying and using a pipeline run file. Let's
- make sure first though that the dataset is available, as described in the
- next subsection.
-
- .. _preparing-dataset:
-
- Preparing the dataset
- ~~~~~~~~~~~~~~~~~~~~~
-
- Towards the end of the previous section, you were asked to git clone the
- ``datasets`` repo to your machine. Most likely, you might have
- accomplished that like this:
-
- .. code:: shell
-
- git clone https://datasets.datadrivendiscovery.org/d3m/datasets.git
-
- But unless you had `git LFS <https://github.com/git-lfs/git-lfs>`__
- installed, the entire contents of the repo might not have been really
- installed.
-
- The repo is organized such that all files larger than 100
- KB is stored in git LFS. Thus, if you cloned without git LFS installed, you
- most likely have to do a one-time extra step before you can use a dataset, as
- some files of that dataset that are over 100 KB will not have the actual
- data in them (although they will still exist as files in the cloned
- repo). This is true even for the dataset that we will use in this
- exercise, ``185_baseball``. To verify this, open this file in a text
- editor::
-
- datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_dataset/tables/learningData.csv
-
- Then, see if it contains text similar to this::
-
- version https://git-lfs.github.com/spec/v1
- oid sha256:931943cc4a675ee3f46be945becb47f53e4297ec3e470c4e3e1f1db66ad3b8d6
- size 131187
-
- If it does, then this dataset has not yet been fully downloaded from git
- LFS (but if it looks like a normal CSV file, then you can skip the rest
- of this subsection and move on). To download this dataset, simply run
- this command inside the ``datasets`` directory:
-
- .. code:: shell
-
- git lfs pull -I training_datasets/seed_datasets_archive/185_baseball/
-
- Inspect the file again, and you should see that it looks like a normal
- CSV file now.
-
- In general, if you don't know which specific dataset does a certain
- example pipeline in the ``primitives`` repo uses, inspect the pipeline
- run output file of that primitive (whose file path is similar to that of
- the pipeline JSON file, as described in the :ref:`overview-of-primitives-and-pipelines` section, but
- instead of going to ``pipelines``, go to ``pipeline_runs``). The
- pipeline run is initially gzipped in the ``primitives`` repo, so
- decompress it first. Then open up the actual .yml file, look at
- ``datasets``, and under it should be ``id``. If you do that for the
- example pipeline run of the SKlearn logistic regression primitive
- that we're looking at for this exercise, you'll find that the dataset id
- is ``185_baseball_dataset``. The name of the main dataset directory is this string,
- without the ``_dataset`` part.
-
- Now, let's actually run the pipeline using the two ways mentioned
- earlier.
-
- Specifying all the necessary paths of a dataset
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
- You can use this if there is no existing pipeline run yet for a
- pipeline, or if you want to manually specify the dataset path (set the
- paths for ``--problem``, ``--input``, ``--test-input``, ``--score-input``, ``--pipeline`` to your target dataset
- location).
-
- Remember to change the bind mount paths as appropriate for your system
- (specified by ``-v``).
-
- .. code:: shell
-
- docker run \
- --rm \
- -v /home/foo/d3m:/mnt/d3m \
- registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
- python3 -m d3m \
- runtime \
- fit-score \
- --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \
- --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \
- --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \
- --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \
- --pipeline /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines/862df0a2-2f87-450d-a6bd-24e9269a8ba6.json \
- --output /mnt/d3m/pipeline-outputs/predictions.csv \
- --output-run /mnt/d3m/pipeline-outputs/run.yml
-
- The score is displayed after the pipeline run. The output predictions
- will be stored on the path specified by ``--output``, and information about
- the pipeline run is stored in the path specified by ``--output-run``.
-
- Again, you can use the ``-h`` flag on ``fit-score`` to access the help
- string and read about the different arguments, as described earlier.
-
- If you get a python error that complains about missing columns, or
- something that looks like this::
-
- ValueError: Mismatch between column name in data 'version https://git-lfs.github.com/spec/v1' and column name in metadata 'd3mIndex'.
-
- Chances are that the ``185_baseball`` dataset has not yet been
- downloaded through git LFS. See the :ref:`previous subsection
- <preparing-dataset>` for details on how to verify and do this.
-
- Using a pipeline run file
- ~~~~~~~~~~~~~~~~~~~~~~~~~
-
- Instead of specifying all the specific dataset paths, you can also use
- an existing pipeline run to essentially "re-run" a previous run
- of the pipeline:
-
- .. code:: shell
-
- docker run \
- --rm \
- -v /home/foo/d3m:/mnt/d3m \
- registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
- python3 -m d3m \
- --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \
- runtime \
- --datasets /mnt/d3m/datasets \
- fit-score \
- --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml.gz \
- --output /mnt/d3m/pipeline-outputs/predictions.csv \
- --output-run /mnt/d3m/pipeline-outputs/run.yml
-
- In this case, ``--input-run`` is the pipeline run file that this pipeline
- will re-run, and ``---output-run`` is the new pipeline run file that will be
- generated.
-
- Note that if you choose ``fit-score`` for the d3m runtime option, the
- pipeline actually runs in two phases: fit, and produce. You can verify
- this by searching for ``phase`` in the pipeline run file.
-
- Lastly, if you want to run multiple commands in the docker container,
- simply chain your commands with ``&&`` and wrap them double quotes
- (``"``) for ``bash -c``. As an example:
-
- .. code:: shell
-
- docker run \
- --rm \
- -v /home/foo/d3m:/mnt/d3m \
- registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
- /bin/bash -c \
- "python3 -m d3m \
- --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \
- runtime \
- --datasets /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball \
- fit-score \
- --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml \
- --output /mnt/d3m/pipeline-outputs/predictions.csv \
- --output-run /mnt/d3m/pipeline-outputs/run.yml && \
- head /mnt/d3m/pipeline-outputs/predictions.csv"
-
- Writing a new primitive
- -----------------------
-
- Let's now try to write a very simple new primitive - one that simply
- passes whatever input data it receives from the previous step to the
- next step in the pipeline. Let's call this primitive "Passthrough".
-
- We will use this `skeleton primitive repo
- <https://gitlab.com/datadrivendiscovery/docs-quickstart>`__
- as a starting point
- for this exercise. A d3m primitive repo does not have to follow the
- exact same directory structure as this, but this is a good structure to
- start with, at least. git clone the repo into ``docs-quickstart`` at the same place
- where the other repos that we have used earlier are located
- (``datasets``, ``pipeline-outputs``, ``primitives``).
-
- Alternatively, you can also use the `test primitives
- <https://gitlab.com/datadrivendiscovery/tests-data/tree/master/primitives>`__
- as a model/starting point. ``test_primitives/null.py`` is essentially
- the same primitive that we are trying to write.
-
- .. _primitive-source-code:
-
- Primitive source code
- ~~~~~~~~~~~~~~~~~~~~~
-
- In the ``docs-quickstart`` directory, open
- ``quickstart_primitives/sample_primitive1/input_to_output.py``. The first
- important thing to change here is the primitive metadata, which are the
- first objects defined under the ``InputToOutputPrimitive`` class. Modify the
- following fields (unless otherwise noted, the values you put in must be
- strings):
-
- - ``id``: The primitive's UUID v4 number/identifier. To generate one,
- you can run simply run this simple inline Python command:
-
- .. code:: shell
-
- python3 -c "import uuid; print(uuid.uuid4())"
-
- - ``version``: You can use semantic versioning for this or another style
- of versioning. Write ``"0.1.0"`` for this exercise. You should bump
- the version of the primitive at least every time public interfaces
- of the primitive change (e.g. hyper-parameters).
-
- - ``name``: The primitive's name. Write ``"Passthrough primitive"`` for
- this exercise.
-
- - ``description``: A short description of the primitive. Write ``"A
- primitive which directly outputs the input."`` for this exercise.
-
- - ``python_path``: This follows this format::
-
- d3m.primitives.<primitive family>.<primitive name>.<kind>
-
- Primitive families can be found in the `d3m metadata page
- <https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.primitive_family>`__
- (wait a few seconds for the page to load completely), and primitive
- names can be found in the `d3m core package source code
- <https://gitlab.com/datadrivendiscovery/d3m/blob/devel/d3m/metadata/primitive_names.py>`__.
- The last segment can be used to attribute the primitive to the author and/or
- describe in which way it is different from other primitives with same
- primitive family and primitive name, e.g., a different implementation with different
- trade-offs.
-
- For this exercise, write
- ``"d3m.primitives.operator.input_to_output.Quickstart"``. Note that
- ``input_to_output`` is not currently registered as a standard primitive name
- and using it will produce a warning. For primitives you intent on publishing
- make a merge request to the d3m core package to add any primitive names
- you need.
-
- - ``primitive_family``: This must be the same as used for ``python_path``,
- as enumeration value. You can use a string or Python enumeration value.
- Add this import statement (if not there already):
-
- .. code:: python
-
- from d3m.metadata import base as metadata_base
-
- Then write ``metadata_base.PrimitiveFamily.OPERATOR`` (as
- a value, not a string, so do not put quotation marks) as the value of
- this field.
-
- - ``algorithm_types``: Algorithm type(s) that the primitive implements.
- This can be multiple values in an array. Values can be chosen from
- the `d3m metadata page
- <https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.algorithm_types>`__
- as well.
- Write ``[metadata_base.PrimitiveAlgorithmType.IDENTITY_FUNCTION]``
- here for this exercise (as a list that contains one element, not a
- string).
-
- - ``source``: General info about the author of this primitive. ``name``
- is usually the name of the person or the team that wrote this
- primitive. ``contact`` is a ``mailto`` URI to the email address of
- whoever one should contact about this primitive. ``uris`` are usually
- the git clone URL of the repo, and you can also add the URL of the
- source file of this primitive.
-
- Write these for the exercise:
-
- .. code:: python
-
- "name": "My Name",
- "contact": "mailto:myname@example.com",
- "uris": ["https://gitlab.com/datadrivendiscovery/docs-quickstart.git"],
-
- - ``keywords``: Key words for what this primitive is or does. Write
- ``["passthrough"]``.
-
- - ``installation``: Information about how to install this primitive. Add
- these import statements first:
-
- .. code:: python
-
- import os.path
- from d3m import utils
-
- Then replace the ``installation`` entry with this:
-
- .. code:: python
-
- "installation": [{
- "type": metadata_base.PrimitiveInstallationType.PIP,
- "package_uri": "git+https://gitlab.com/datadrivendiscovery/docs-quickstart@{git_commit}#egg=quickstart_primitives".format(
- git_commit=utils.current_git_commit(os.path.dirname(__file__))
- ),
- }],
-
- In general, for your own actual primitives, you might only need to
- substitute the git repo URL here as well as the python egg name.
-
- Next, let's take a look at the ``produce`` method. You can see that it
- simply makes a new dataframe out of the input data, and returns it as
- the output. To see for ourselves though that our primitive (and thus
- this ``produce`` method) gets called during the pipeline run, let's add
- a log statement here. The ``produce`` method should now look something
- like this:
-
- .. code:: python
-
- def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]:
- self.logger.warning('Hi, InputToOutputPrimitive.produce was called!')
- return base.CallResult(value=inputs)
-
- Note that this is simply an example primitive that is intentionally
- simple for the purposes of this tutorial. It does not necessarily model
- a well-written primitive, by any means. For guidelines on how to write a
- good primitive, take a look at the :ref:`primitive-good-citizen`.
-
- setup.py
- ~~~~~~~~
-
- Next, we fill in the necessary information in ``setup.py`` so that
- ``pip`` can correctly install our primitive in our local d3m
- environment. Open ``setup.py`` (in the project root), and modify the
- following fields:
-
- - ``name``: Same as the egg name you used in ``package_uri``
-
- - ``version``: Same as the primitive metadata's ``version``
-
- - ``description``: Same as the primitive metadata's ``description``,
- or a description of all primitives if there are multiple primitives
- in the package you are making
-
- - ``author``: Same as the primitive metadata's ``suorce.name``
-
- - ``url``: Same as main URL in the primitive metadata's
- ``source.uris``
-
- - ``packages``: This is an array of the python packages that this
- primitive repo contains. You can use the ``find_packages`` helper:
-
- .. code:: python
-
- packages=find_packages(exclude=['pipelines']),
-
- - ``keywords``: A list of keywords. Important standard keyword is
- ``d3m_primitive`` which makes all primitives discoverable on PyPi
-
- - ``install_requires``: This is an array of the python package
- dependencies of the primitives contained in this repo. Our primitive
- needs nothing except the d3m core package (and the
- ``common-primitives`` package too for testing, but this is not a
- package dependency), so write this as the value of this field:
- ``['d3m']``
-
- - ``entry_points``: This is how the d3m runtime maps your primitives'
- d3m python paths to the your repo's local python paths. For this
- exercise, it should look like this:
-
- .. code:: python
-
- entry_points={
- 'd3m.primitives': [
- 'operator.input_to_output.Quickstart = quickstart_primitives.sample_primitive1:InputToOutputPrimitive',
- ],
- }
-
- That's it for this file. Briefly review it for any possible syntax
- errors.
-
- Primitive unit tests
- ~~~~~~~~~~~~~~~~~~~~
-
- Let's now make a python test for this primitive, which in this case will
- just assert whether the input dataframe to the primitive equals the
- output dataframe. Make a new file called ``test_input_to_output.py``
- inside ``quickstart_primitives/sample_primitive1`` (the same directory as
- ``input_to_output.py``), and write this as its contents:
-
- .. code:: python
-
- import unittest
- import os
-
- from d3m import container
- from common_primitives import dataset_to_dataframe
- from input_to_output import InputToOutputPrimitive
-
-
- class InputToOutputTestCase(unittest.TestCase):
- def test_output_equals_input(self):
- dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', 'tests-data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json'))
-
- dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path))
-
- dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams()
- dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults())
- dataframe = dataframe_primitive.produce(inputs=dataset).value
-
- i2o_hyperparams_class = InputToOutputPrimitive.metadata.get_hyperparams()
- i2o_primitive = InputToOutputPrimitive(hyperparams=dataframe_hyperparams_class.defaults())
- output = i2o_primitive.produce(inputs=dataframe).value
-
- self.assertTrue(output.equals(dataframe))
-
-
- if __name__ == '__main__':
- unittest.main()
-
- For the dataset that this test uses, add as git submodule the `d3m tests-data <https://gitlab.com/datadrivendiscovery/tests-data>`__
- repository at the root of the ``docs-quickstart`` repository.
- Then let's install this new primitive to the Docker image's d3m environment, and
- run this test using the command below:
-
- .. code:: shell
-
- docker run \
- --rm \
- -v /home/foo/d3m:/mnt/d3m \
- registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
- /bin/bash -c \
- "pip3 install -e /mnt/d3m/docs-quickstart && \
- cd /mnt/d3m/docs-quickstart/quickstart_primitives/sample_primitive1 && \
- python3 test_input_to_output.py"
-
- You should see a log statement like this, as well as the python unittest
- pass message::
-
- Hi, InputToOutputPrimitive.produce was called!
- .
- ----------------------------------------------------------------------
- Ran 1 test in 0.011s
-
- Using this primitive in a pipeline
- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
- Having seen the primitive test pass, we can now confidently include this
- primitive in a pipeline. Let's take the same pipeline that we ran :ref:`before <running-example-pipeline>`
- (the sklearn logistic regression's example pipeline),
- and add a step using this primitive.
-
- In the root directory of your repository, create these directories:
- ``pipelines/operator.input_to_output.Quickstart``. Then, from the d3m
- ``primitives`` repo, copy the JSON pipeline description file from
- ``primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines``
- into the directory we just created. Open this file, and replace the
- ``id`` (generate another UUID v4 number using the inline python command
- earlier, different from the primitive ``id``), as well as the created
- timestamp using this inline python command (add ``Z`` at the end of the
- generated timestamp)::
-
- python3 -c "import time; import datetime; \
- print(datetime.datetime.fromtimestamp(time.time()).isoformat())"
-
- You can rename the json file too using the new pipeline ``id``.
-
- Next, change the output step number (shown below, ``"steps.4.produce"``)
- to be one more than the current number (at the time of this writing, it
- is ``4``, so in this case, change it to ``5``):
-
- .. code:: json
-
- "outputs": [
- {
- "data": "steps.5.produce",
- "name": "output predictions"
- }
- ],
-
- Then, find the step that contains the
- ``d3m.primitives.classification.logistic_regression.SKlearn`` primitive
- (search for this string in the file), and right above it, add the
- following JSON object. Remember to change ``primitive.id`` to the
- primitive's id that you generated in the earlier :ref:`primitive-source-code` subsection.
-
- .. code:: json
-
- {
- "type": "PRIMITIVE",
- "primitive": {
- "id": "30d5f2fa-4394-4e46-9857-2029ec9ed0e0",
- "version": "0.1.0",
- "python_path": "d3m.primitives.operator.input_to_output.Quickstart",
- "name": "Passthrough primitive"
- },
- "arguments": {
- "inputs": {
- "type": "CONTAINER",
- "data": "steps.2.produce"
- }
- },
- "outputs": [
- {
- "id": "produce"
- }
- ]
- },
-
- Make sure that the step number (``"steps.N.produce"``) in
- ``arguments.inputs.data`` is correct (one greater than the previous step
- and one less than the next step). Do this as well for the succeeding
- steps, with the following caveats:
-
- - For ``d3m.primitives.classification.logistic_regression.SKlearn``,
- increment the step number both for ``arguments.inputs.data`` and
- ``arguments.outputs.data`` (at the time of this writing, the number
- should be changed to ``3``).
- - For
- ``d3m.primitives.data_transformation.construct_predictions.Common``,
- increment the step number for ``arguments.inputs.data`` (at the time
- of this writing, the number should be changed to ``4``), but do not
- change the one for ``arguments.reference.data`` (the value should
- stay as ``"steps.0.produce"``)
-
- Generally, you can also programmatically generate a pipeline, as
- described in the :ref:`pipeline-description-example`.
-
- Now we can finally run this pipeline that uses our new primitive. In the
- command below, modify the pipeline JSON filename in the ``-p`` argument
- to match the filename of your pipeline file (if you changed it to the
- new pipeline id that you generated).
-
- .. code:: shell
-
- docker run \
- --rm \
- -v /home/foo/d3m:/mnt/d3m \
- registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
- /bin/bash -c \
- "pip3 install -e /mnt/d3m/docs-quickstart && \
- python3 -m d3m \
- runtime \
- fit-score \
- --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \
- --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \
- --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \
- --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \
- --pipeline /mnt/d3m/docs-quickstart/pipelines/operator.input_to_output.Quickstart/0f290525-3fec-44f7-ab93-bd778747b91e.json \
- --output /mnt/d3m/pipeline-outputs/predictions_new.csv \
- --output-run /mnt/d3m/pipeline-outputs/run_new.yml"
-
- In the output, you should see the log statement as a warning,
- before the score is shown (similar to the text below)::
-
- ...
- WARNING:d3m.primitives.operator.input_to_output.Quickstart:Hi, InputToOutputPrimitive.produce was called!
- ...
- metric,value,normalized,randomSeed
- F1_MACRO,0.31696136214800263,0.31696136214800263,0
-
- Verify that the old and new ``predictions.csv`` in ``pipeline-outputs``
- are the same (you can use ``diff``), as well as the scores in the old
- and new ``run.yml`` files (search for ``scores`` in the files).
-
- Beyond this tutorial
- --------------------
-
- Congratulations! You just built your own primitive and you were able to
- use it in a d3m pipeline!
-
- Normally, when you build your own primitives, you would proceed to
- validating the primitives to be included in the d3m index of all known
- primitives. See the `primitives repo README
- <https://gitlab.com/datadrivendiscovery/primitives#adding-a-primitive>`__
- on details on how to do this.
|