.. _quickstart: TA1 quick-start guide ===================== This aims to be a tutorial, or a quick-start guide, for newcomers to the D3M project who are interested in writing TA1 primitives. It is not meant to be a comprehensive guide to everything about D3M, or even just TA1. The goal here is for the reader to be able to write a new, simple, but working primitive by the end of this tutorial. To achieve this goal, this tutorial is divided into several sections: Important links --------------- First, here is a list of some important links that should help you with reference and instructional material beyond this quick start guide. Be aware also that the d3m core package source code has extensive docstrings that :ref:`you may find helpful `. - Documentation of the whole D3M program: `https://docs.datadrivendiscovery.org `__ - Common primitives: `https://gitlab.com/datadrivendiscovery/common-primitives `__ - Public datasets: `https://datasets.datadrivendiscovery.org/d3m/datasets `__ - Docker images: `https://docs.datadrivendiscovery.org/docker.html `__ - Index of TA1, TA2, TA3 repositories: `https://github.com/darpa-i2o/d3m-program-index `__ - :ref:`primitive-good-citizen` .. _overview-of-primitives-and-pipelines: Overview of primitives and pipelines ------------------------------------ Let's start with basic definitions in order for us to understand a little bit better what happens when we run a pipeline later in the tutorial. A *pipeline* is basically a series of steps that are executed in order to solve a particular *problem* (such as prediction based on historical data). A step of a pipeline is usually a *primitive* (a step can be something else, however, like a sub-pipeline, but for the purposes of this tutorial, assume that each step is a primitive): something that individually could, for example, transform data into another format, or fit a model for prediction. There are many types of primitives (see the `primitives index repo`_ for the full list of available primitives). In a pipeline, the steps must be arranged in a way such that each step must be able to read the data in the format produced by the preceding step. .. _primitives index repo: https://gitlab.com/datadrivendiscovery/primitives For this tutorial, let's try to use the example pipeline that comes with a primitive called ``d3m.primitives.classification.logistic_regression.SKlearn`` to predict baseball hall-of-fame players, based on their stats (see the `185_baseball dataset `__). Let's take a look at the example pipeline. Many example pipelines can be found in `primitives index repo`_ where they demonstrate how to use particular primitives. At the time of this writing, an example pipeline can be found `here `__, but this repository's directory names and files periodically change, so it is prudent to see how to navigate to this file too. The index is organized as: - ``v2020.1.9`` (version of the core package of the index, changes periodically) - ``JPL`` (the organization that develops/maintains the primitive) - ``d3m.primitives.classification.logistic_regression.SKlearn`` (the python path of the actual primitive) - ``2019.11.13`` (the version of this primitive, changes periodically) - ``pipelines`` - ``862df0a2-2f87-450d-a6bd-24e9269a8ba6.json`` (actual pipeline description filename, changes periodically) Early on in this JSON document, you will see a list called ``steps``. This is the actual list of primitive steps that run one after another in a pipeline. Each step has the information about the primitive, as well as arguments, outputs, and hyper-parameters, if any. This specific pipeline has 5 steps (the ``d3m.primitives`` prefix is omitted in the following list): - ``data_transformation.dataset_to_dataframe.Common`` - ``data_transformation.column_parser.Common`` - ``data_cleaning.imputer.SKlearn`` - ``classification.logistic_regression.SKlearn`` - ``data_transformation.construct_predictions.Common`` Now let's take a look at the first primitive step in that pipeline. We can find the source code of this primitive in the common-primitives repo (`common_primitives/dataset_to_dataframe.py `__). Take a look particularly at the ``produce`` method. This is essentially what the primitive does. Try to do this for the other primitive steps in the pipeline as well - take a cursory look at what each one essentially does (note that for the actual classifier primitive, you should look at the ``fit`` method as well to see how the model is trained). Primitives whose python path suffix is ``*.Common`` is in the `common primitives `__ repository, and those that have a ``*.SKlearn`` suffix is in the `sklearn-wrap `__ repository (checkout the `dist `__ branch, to which primitives are being generated). If you're having a hard time looking for the correct source file, you can try taking the primitive ``id`` from the primitive step description in the pipeline, and ``grep`` for it. For example, if you were looking for the source code of the first primitive step in this pipeline, first look at the primitive info in that step and get its ``id``: .. code:: "primitive": { "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65", "version": "0.3.0", "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common", "name": "Extract a DataFrame from a Dataset" }, Then, run this: .. code:: shell git clone https://gitlab.com/datadrivendiscovery/common-primitives.git cd common-primitives grep -r 4b42ce1e-9b98-4a25-b68e-fad13311eb65 . | grep -F .py However, this series of commands assumes that you know exactly which specific repository is the primitive's source code located in (the ``git clone`` command). Since this is probably not the case for an arbitrarily given primitive, there is a method on how to find out the repository URL of any primitive, and it requires using a d3m Docker image, which is described in the next section. Setting up a local d3m environment ---------------------------------- In order to run a pipeline, you must have a Python environment where the d3m core package is installed, as well as the packages of the primitives installed as well. While it is possible to setup a Python virtual environment and install the packages them through ``pip``, in this tutorial, we're going to use the d3m Docker images instead (in many cases, even beyond this tutorial, this will save you a lot of time and effort trying to find the any missing primitive packages, manually installing them, and troubleshooting installation errors). So, make sure `Docker `__ is installed in your system. You can find the list of D3M docker images `here `__. The one we're going to use in this tutorial is the v2020.1.9 primitives image (feel free to use whatever the latest one instead though - just modify the ``v2020.1.9`` part accordingly): .. code:: shell docker pull registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 Once you have downloaded the image, we can finally run the d3m package (and hence run a pipeline). Before running a pipeline though, let's first try to get a list of what primitives are installed in the image's Python environment: .. code:: shell docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index search You should get a big list of primitives. All of the known primitives to D3M should be there. You can also run the docker container in interactive mode (to run commands as if you have logged into the container machine provides) by using the ``-it`` option: .. code:: shell docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 The previous section mentions a method of determining where the source code of an arbitrarily given primitive can be found. We can do this using the d3m python package within a d3m docker container. First get the ``python_path`` of the primitive step (see the JSON snippet above of the primitive's info from the pipeline). Then, run this command: .. code:: shell docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common Near the top of the huge JSON string describing the primitive, you'll see ``"source"``, and inside it, ``"uris"``. To help read the JSON, you can use the ``jq`` utility: .. code:: shell docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common | jq .source.uris This should give the URI of the git repo where the source code of that primitive can be found. Also, You can also substitute the primitive ``id`` for the ``python_path`` in that command, but the command usually returns a result faster if you provide the ``python_path``. Note also that you can only do this for primitives that have been submitted for a particular image (primitives that are contained in the `primitives index repo`_). It can be obscure at first how to use the d3m python package, but you can always access the help string for each d3m command at every level of the command chain by using the ``-h`` flag. This is useful especially for the getting a list of all the possible arguments for the ``runtime`` module. .. code:: shell docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m -h docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index -h docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m runtime -h docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m runtime fit-score -h One last point before we try running a pipeline. The docker container must be able to access the dataset location and the pipeline location from the host filesystem. We can do this by `bind-mounting `__ a host directory that contains both the ``datasets`` repo and the ``primitives`` index repo to a container directory. Git clone these repos, and also make another empty directory called ``pipeline-outputs``. Now, if your directory structure looks like this:: /home/foo/d3m ├── datasets ├── pipeline-outputs └── primitives Then you'll want to bind-mount ``/home/foo/d3m`` to a directory in the container, say ``/mnt/d3m``. You can specify this mapping in the docker command itself: .. code:: shell docker run \ --rm \ -v /home/foo/d3m:/mnt/d3m \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2019.11.10 \ ls /mnt/d3m If you're reading this tutorial from a text editor, it might be a good idea at this point to find and replace ``/home/foo/d3m`` with the actual path in your system where the ``datasets``, ``pipeline-outputs``, and ``primitives`` directories are all located. This will make it easier for you to just copy and paste the commands from here on out, instead of changing the faux path every time. .. _running-example-pipeline: Running an example pipeline --------------------------- At this point, let's try running a pipeline. Again, we're going to run the example pipeline that comes with ``d3m.primitives.classification.logistic_regression.SKlearn``. There are two ways to run a pipeline: by specifying all the necessary paths of the dataset, or by specifying and using a pipeline run file. Let's make sure first though that the dataset is available, as described in the next subsection. .. _preparing-dataset: Preparing the dataset ~~~~~~~~~~~~~~~~~~~~~ Towards the end of the previous section, you were asked to git clone the ``datasets`` repo to your machine. Most likely, you might have accomplished that like this: .. code:: shell git clone https://datasets.datadrivendiscovery.org/d3m/datasets.git But unless you had `git LFS `__ installed, the entire contents of the repo might not have been really installed. The repo is organized such that all files larger than 100 KB is stored in git LFS. Thus, if you cloned without git LFS installed, you most likely have to do a one-time extra step before you can use a dataset, as some files of that dataset that are over 100 KB will not have the actual data in them (although they will still exist as files in the cloned repo). This is true even for the dataset that we will use in this exercise, ``185_baseball``. To verify this, open this file in a text editor:: datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_dataset/tables/learningData.csv Then, see if it contains text similar to this:: version https://git-lfs.github.com/spec/v1 oid sha256:931943cc4a675ee3f46be945becb47f53e4297ec3e470c4e3e1f1db66ad3b8d6 size 131187 If it does, then this dataset has not yet been fully downloaded from git LFS (but if it looks like a normal CSV file, then you can skip the rest of this subsection and move on). To download this dataset, simply run this command inside the ``datasets`` directory: .. code:: shell git lfs pull -I training_datasets/seed_datasets_archive/185_baseball/ Inspect the file again, and you should see that it looks like a normal CSV file now. In general, if you don't know which specific dataset does a certain example pipeline in the ``primitives`` repo uses, inspect the pipeline run output file of that primitive (whose file path is similar to that of the pipeline JSON file, as described in the :ref:`overview-of-primitives-and-pipelines` section, but instead of going to ``pipelines``, go to ``pipeline_runs``). The pipeline run is initially gzipped in the ``primitives`` repo, so decompress it first. Then open up the actual .yml file, look at ``datasets``, and under it should be ``id``. If you do that for the example pipeline run of the SKlearn logistic regression primitive that we're looking at for this exercise, you'll find that the dataset id is ``185_baseball_dataset``. The name of the main dataset directory is this string, without the ``_dataset`` part. Now, let's actually run the pipeline using the two ways mentioned earlier. Specifying all the necessary paths of a dataset ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ You can use this if there is no existing pipeline run yet for a pipeline, or if you want to manually specify the dataset path (set the paths for ``--problem``, ``--input``, ``--test-input``, ``--score-input``, ``--pipeline`` to your target dataset location). Remember to change the bind mount paths as appropriate for your system (specified by ``-v``). .. code:: shell docker run \ --rm \ -v /home/foo/d3m:/mnt/d3m \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ python3 -m d3m \ runtime \ fit-score \ --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \ --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \ --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \ --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \ --pipeline /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines/862df0a2-2f87-450d-a6bd-24e9269a8ba6.json \ --output /mnt/d3m/pipeline-outputs/predictions.csv \ --output-run /mnt/d3m/pipeline-outputs/run.yml The score is displayed after the pipeline run. The output predictions will be stored on the path specified by ``--output``, and information about the pipeline run is stored in the path specified by ``--output-run``. Again, you can use the ``-h`` flag on ``fit-score`` to access the help string and read about the different arguments, as described earlier. If you get a python error that complains about missing columns, or something that looks like this:: ValueError: Mismatch between column name in data 'version https://git-lfs.github.com/spec/v1' and column name in metadata 'd3mIndex'. Chances are that the ``185_baseball`` dataset has not yet been downloaded through git LFS. See the :ref:`previous subsection ` for details on how to verify and do this. Using a pipeline run file ~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of specifying all the specific dataset paths, you can also use an existing pipeline run to essentially "re-run" a previous run of the pipeline: .. code:: shell docker run \ --rm \ -v /home/foo/d3m:/mnt/d3m \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ python3 -m d3m \ --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \ runtime \ --datasets /mnt/d3m/datasets \ fit-score \ --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml.gz \ --output /mnt/d3m/pipeline-outputs/predictions.csv \ --output-run /mnt/d3m/pipeline-outputs/run.yml In this case, ``--input-run`` is the pipeline run file that this pipeline will re-run, and ``---output-run`` is the new pipeline run file that will be generated. Note that if you choose ``fit-score`` for the d3m runtime option, the pipeline actually runs in two phases: fit, and produce. You can verify this by searching for ``phase`` in the pipeline run file. Lastly, if you want to run multiple commands in the docker container, simply chain your commands with ``&&`` and wrap them double quotes (``"``) for ``bash -c``. As an example: .. code:: shell docker run \ --rm \ -v /home/foo/d3m:/mnt/d3m \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ /bin/bash -c \ "python3 -m d3m \ --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \ runtime \ --datasets /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball \ fit-score \ --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml \ --output /mnt/d3m/pipeline-outputs/predictions.csv \ --output-run /mnt/d3m/pipeline-outputs/run.yml && \ head /mnt/d3m/pipeline-outputs/predictions.csv" Writing a new primitive ----------------------- Let's now try to write a very simple new primitive - one that simply passes whatever input data it receives from the previous step to the next step in the pipeline. Let's call this primitive "Passthrough". We will use this `skeleton primitive repo `__ as a starting point for this exercise. A d3m primitive repo does not have to follow the exact same directory structure as this, but this is a good structure to start with, at least. git clone the repo into ``docs-quickstart`` at the same place where the other repos that we have used earlier are located (``datasets``, ``pipeline-outputs``, ``primitives``). Alternatively, you can also use the `test primitives `__ as a model/starting point. ``test_primitives/null.py`` is essentially the same primitive that we are trying to write. .. _primitive-source-code: Primitive source code ~~~~~~~~~~~~~~~~~~~~~ In the ``docs-quickstart`` directory, open ``quickstart_primitives/sample_primitive1/input_to_output.py``. The first important thing to change here is the primitive metadata, which are the first objects defined under the ``InputToOutputPrimitive`` class. Modify the following fields (unless otherwise noted, the values you put in must be strings): - ``id``: The primitive's UUID v4 number/identifier. To generate one, you can run simply run this simple inline Python command: .. code:: shell python3 -c "import uuid; print(uuid.uuid4())" - ``version``: You can use semantic versioning for this or another style of versioning. Write ``"0.1.0"`` for this exercise. You should bump the version of the primitive at least every time public interfaces of the primitive change (e.g. hyper-parameters). - ``name``: The primitive's name. Write ``"Passthrough primitive"`` for this exercise. - ``description``: A short description of the primitive. Write ``"A primitive which directly outputs the input."`` for this exercise. - ``python_path``: This follows this format:: d3m.primitives... Primitive families can be found in the `d3m metadata page `__ (wait a few seconds for the page to load completely), and primitive names can be found in the `d3m core package source code `__. The last segment can be used to attribute the primitive to the author and/or describe in which way it is different from other primitives with same primitive family and primitive name, e.g., a different implementation with different trade-offs. For this exercise, write ``"d3m.primitives.operator.input_to_output.Quickstart"``. Note that ``input_to_output`` is not currently registered as a standard primitive name and using it will produce a warning. For primitives you intent on publishing make a merge request to the d3m core package to add any primitive names you need. - ``primitive_family``: This must be the same as used for ``python_path``, as enumeration value. You can use a string or Python enumeration value. Add this import statement (if not there already): .. code:: python from d3m.metadata import base as metadata_base Then write ``metadata_base.PrimitiveFamily.OPERATOR`` (as a value, not a string, so do not put quotation marks) as the value of this field. - ``algorithm_types``: Algorithm type(s) that the primitive implements. This can be multiple values in an array. Values can be chosen from the `d3m metadata page `__ as well. Write ``[metadata_base.PrimitiveAlgorithmType.IDENTITY_FUNCTION]`` here for this exercise (as a list that contains one element, not a string). - ``source``: General info about the author of this primitive. ``name`` is usually the name of the person or the team that wrote this primitive. ``contact`` is a ``mailto`` URI to the email address of whoever one should contact about this primitive. ``uris`` are usually the git clone URL of the repo, and you can also add the URL of the source file of this primitive. Write these for the exercise: .. code:: python "name": "My Name", "contact": "mailto:myname@example.com", "uris": ["https://gitlab.com/datadrivendiscovery/docs-quickstart.git"], - ``keywords``: Key words for what this primitive is or does. Write ``["passthrough"]``. - ``installation``: Information about how to install this primitive. Add these import statements first: .. code:: python import os.path from d3m import utils Then replace the ``installation`` entry with this: .. code:: python "installation": [{ "type": metadata_base.PrimitiveInstallationType.PIP, "package_uri": "git+https://gitlab.com/datadrivendiscovery/docs-quickstart@{git_commit}#egg=quickstart_primitives".format( git_commit=utils.current_git_commit(os.path.dirname(__file__)) ), }], In general, for your own actual primitives, you might only need to substitute the git repo URL here as well as the python egg name. Next, let's take a look at the ``produce`` method. You can see that it simply makes a new dataframe out of the input data, and returns it as the output. To see for ourselves though that our primitive (and thus this ``produce`` method) gets called during the pipeline run, let's add a log statement here. The ``produce`` method should now look something like this: .. code:: python def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]: self.logger.warning('Hi, InputToOutputPrimitive.produce was called!') return base.CallResult(value=inputs) Note that this is simply an example primitive that is intentionally simple for the purposes of this tutorial. It does not necessarily model a well-written primitive, by any means. For guidelines on how to write a good primitive, take a look at the :ref:`primitive-good-citizen`. setup.py ~~~~~~~~ Next, we fill in the necessary information in ``setup.py`` so that ``pip`` can correctly install our primitive in our local d3m environment. Open ``setup.py`` (in the project root), and modify the following fields: - ``name``: Same as the egg name you used in ``package_uri`` - ``version``: Same as the primitive metadata's ``version`` - ``description``: Same as the primitive metadata's ``description``, or a description of all primitives if there are multiple primitives in the package you are making - ``author``: Same as the primitive metadata's ``suorce.name`` - ``url``: Same as main URL in the primitive metadata's ``source.uris`` - ``packages``: This is an array of the python packages that this primitive repo contains. You can use the ``find_packages`` helper: .. code:: python packages=find_packages(exclude=['pipelines']), - ``keywords``: A list of keywords. Important standard keyword is ``d3m_primitive`` which makes all primitives discoverable on PyPi - ``install_requires``: This is an array of the python package dependencies of the primitives contained in this repo. Our primitive needs nothing except the d3m core package (and the ``common-primitives`` package too for testing, but this is not a package dependency), so write this as the value of this field: ``['d3m']`` - ``entry_points``: This is how the d3m runtime maps your primitives' d3m python paths to the your repo's local python paths. For this exercise, it should look like this: .. code:: python entry_points={ 'd3m.primitives': [ 'operator.input_to_output.Quickstart = quickstart_primitives.sample_primitive1:InputToOutputPrimitive', ], } That's it for this file. Briefly review it for any possible syntax errors. Primitive unit tests ~~~~~~~~~~~~~~~~~~~~ Let's now make a python test for this primitive, which in this case will just assert whether the input dataframe to the primitive equals the output dataframe. Make a new file called ``test_input_to_output.py`` inside ``quickstart_primitives/sample_primitive1`` (the same directory as ``input_to_output.py``), and write this as its contents: .. code:: python import unittest import os from d3m import container from common_primitives import dataset_to_dataframe from input_to_output import InputToOutputPrimitive class InputToOutputTestCase(unittest.TestCase): def test_output_equals_input(self): dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', 'tests-data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json')) dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path)) dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams() dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults()) dataframe = dataframe_primitive.produce(inputs=dataset).value i2o_hyperparams_class = InputToOutputPrimitive.metadata.get_hyperparams() i2o_primitive = InputToOutputPrimitive(hyperparams=dataframe_hyperparams_class.defaults()) output = i2o_primitive.produce(inputs=dataframe).value self.assertTrue(output.equals(dataframe)) if __name__ == '__main__': unittest.main() For the dataset that this test uses, add as git submodule the `d3m tests-data `__ repository at the root of the ``docs-quickstart`` repository. Then let's install this new primitive to the Docker image's d3m environment, and run this test using the command below: .. code:: shell docker run \ --rm \ -v /home/foo/d3m:/mnt/d3m \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ /bin/bash -c \ "pip3 install -e /mnt/d3m/docs-quickstart && \ cd /mnt/d3m/docs-quickstart/quickstart_primitives/sample_primitive1 && \ python3 test_input_to_output.py" You should see a log statement like this, as well as the python unittest pass message:: Hi, InputToOutputPrimitive.produce was called! . ---------------------------------------------------------------------- Ran 1 test in 0.011s Using this primitive in a pipeline ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Having seen the primitive test pass, we can now confidently include this primitive in a pipeline. Let's take the same pipeline that we ran :ref:`before ` (the sklearn logistic regression's example pipeline), and add a step using this primitive. In the root directory of your repository, create these directories: ``pipelines/operator.input_to_output.Quickstart``. Then, from the d3m ``primitives`` repo, copy the JSON pipeline description file from ``primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines`` into the directory we just created. Open this file, and replace the ``id`` (generate another UUID v4 number using the inline python command earlier, different from the primitive ``id``), as well as the created timestamp using this inline python command (add ``Z`` at the end of the generated timestamp):: python3 -c "import time; import datetime; \ print(datetime.datetime.fromtimestamp(time.time()).isoformat())" You can rename the json file too using the new pipeline ``id``. Next, change the output step number (shown below, ``"steps.4.produce"``) to be one more than the current number (at the time of this writing, it is ``4``, so in this case, change it to ``5``): .. code:: json "outputs": [ { "data": "steps.5.produce", "name": "output predictions" } ], Then, find the step that contains the ``d3m.primitives.classification.logistic_regression.SKlearn`` primitive (search for this string in the file), and right above it, add the following JSON object. Remember to change ``primitive.id`` to the primitive's id that you generated in the earlier :ref:`primitive-source-code` subsection. .. code:: json { "type": "PRIMITIVE", "primitive": { "id": "30d5f2fa-4394-4e46-9857-2029ec9ed0e0", "version": "0.1.0", "python_path": "d3m.primitives.operator.input_to_output.Quickstart", "name": "Passthrough primitive" }, "arguments": { "inputs": { "type": "CONTAINER", "data": "steps.2.produce" } }, "outputs": [ { "id": "produce" } ] }, Make sure that the step number (``"steps.N.produce"``) in ``arguments.inputs.data`` is correct (one greater than the previous step and one less than the next step). Do this as well for the succeeding steps, with the following caveats: - For ``d3m.primitives.classification.logistic_regression.SKlearn``, increment the step number both for ``arguments.inputs.data`` and ``arguments.outputs.data`` (at the time of this writing, the number should be changed to ``3``). - For ``d3m.primitives.data_transformation.construct_predictions.Common``, increment the step number for ``arguments.inputs.data`` (at the time of this writing, the number should be changed to ``4``), but do not change the one for ``arguments.reference.data`` (the value should stay as ``"steps.0.produce"``) Generally, you can also programmatically generate a pipeline, as described in the :ref:`pipeline-description-example`. Now we can finally run this pipeline that uses our new primitive. In the command below, modify the pipeline JSON filename in the ``-p`` argument to match the filename of your pipeline file (if you changed it to the new pipeline id that you generated). .. code:: shell docker run \ --rm \ -v /home/foo/d3m:/mnt/d3m \ registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \ /bin/bash -c \ "pip3 install -e /mnt/d3m/docs-quickstart && \ python3 -m d3m \ runtime \ fit-score \ --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \ --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \ --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \ --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \ --pipeline /mnt/d3m/docs-quickstart/pipelines/operator.input_to_output.Quickstart/0f290525-3fec-44f7-ab93-bd778747b91e.json \ --output /mnt/d3m/pipeline-outputs/predictions_new.csv \ --output-run /mnt/d3m/pipeline-outputs/run_new.yml" In the output, you should see the log statement as a warning, before the score is shown (similar to the text below):: ... WARNING:d3m.primitives.operator.input_to_output.Quickstart:Hi, InputToOutputPrimitive.produce was called! ... metric,value,normalized,randomSeed F1_MACRO,0.31696136214800263,0.31696136214800263,0 Verify that the old and new ``predictions.csv`` in ``pipeline-outputs`` are the same (you can use ``diff``), as well as the scores in the old and new ``run.yml`` files (search for ``scores`` in the files). Beyond this tutorial -------------------- Congratulations! You just built your own primitive and you were able to use it in a d3m pipeline! Normally, when you build your own primitives, you would proceed to validating the primitives to be included in the d3m index of all known primitives. See the `primitives repo README `__ on details on how to do this.