You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

quickstart.rst 36 kB

first commit Former-commit-id: 08bc23ba02cffbce3cf63962390a65459a132e48 [formerly 0795edd4834b9b7dc66db8d10d4cbaf42bbf82cb] [formerly b5010b42541add7e2ea2578bf2da537efc457757 [formerly a7ca09c2c34c4fc8b3d8e01fcfa08eeeb2cae99d]] [formerly 615058473a2177ca5b89e9edbb797f4c2a59c7e5 [formerly 743d8dfc6843c4c205051a8ab309fbb2116c895e] [formerly bb0ea98b1e14154ef464e2f7a16738705894e54b [formerly 960a69da74b81ef8093820e003f2d6c59a34974c]]] [formerly 2fa3be52c1b44665bc81a7cc7d4cea4bbf0d91d5 [formerly 2054589f0898627e0a17132fd9d4cc78efc91867] [formerly 3b53730e8a895e803dfdd6ca72bc05e17a4164c1 [formerly 8a2fa8ab7baf6686d21af1f322df46fd58c60e69]] [formerly 87d1e3a07a19d03c7d7c94d93ab4fa9f58dada7c [formerly f331916385a5afac1234854ee8d7f160f34b668f] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18 [formerly 386086f05aa9487f65bce2ee54438acbdce57650]]]] Former-commit-id: a00aed8c934a6460c4d9ac902b9a74a3d6864697 [formerly 26fdeca29c2f07916d837883983ca2982056c78e] [formerly 0e3170d41a2f99ecf5c918183d361d4399d793bf [formerly 3c12ad4c88ac5192e0f5606ac0d88dd5bf8602dc]] [formerly d5894f84f2fd2e77a6913efdc5ae388cf1be0495 [formerly ad3e7bc670ff92c992730d29c9d3aa1598d844e8] [formerly 69fb3c78a483343f5071da4f7e2891b83a49dd18]] Former-commit-id: 3c19c9fae64f6106415fbc948a4dc613b9ee12f8 [formerly 467ddc0549c74bb007e8f01773bb6dc9103b417d] [formerly 5fa518345d958e2760e443b366883295de6d991c [formerly 3530e130b9fdb7280f638dbc2e785d2165ba82aa]] Former-commit-id: 9f5d473d42a435ec0d60149939d09be1acc25d92 [formerly be0b25c4ec2cde052a041baf0e11f774a158105d] Former-commit-id: 9eca71cb73ba9edccd70ac06a3b636b8d4093b04
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663664665666667668669670671672673674675676677678679680681682683684685686687688689690691692693694695696697698699700701702703704705706707708709710711712713714715716717718719720721722723724725726727728729730731732733734735736737738739740741742743744745746747748749750751752753754755756757758759760761762763764765766767768769770771772773774775776777778779780781782783784785786787788789790791792793794795796797798799800801802803804805806807808809810811812813814815816817
  1. .. _quickstart:
  2. TA1 quick-start guide
  3. =====================
  4. This aims to be a tutorial, or a quick-start guide, for
  5. newcomers to the D3M project who are interested in writing TA1 primitives.
  6. It is not meant to be a comprehensive
  7. guide to everything about D3M, or even just TA1. The goal here is for
  8. the reader to be able to write a new, simple, but working primitive by
  9. the end of this tutorial. To achieve this goal, this tutorial is divided
  10. into several sections:
  11. Important links
  12. ---------------
  13. First, here is a list of some important links that should help you with
  14. reference and instructional material beyond this quick start guide. Be
  15. aware also that the d3m core package source code has extensive docstrings that
  16. :ref:`you may find helpful <api-reference>`.
  17. - Documentation of the whole D3M program:
  18. `https://docs.datadrivendiscovery.org <https://docs.datadrivendiscovery.org>`__
  19. - Common primitives:
  20. `https://gitlab.com/datadrivendiscovery/common-primitives <https://gitlab.com/datadrivendiscovery/common-primitives>`__
  21. - Public datasets:
  22. `https://datasets.datadrivendiscovery.org/d3m/datasets <https://datasets.datadrivendiscovery.org/d3m/datasets>`__
  23. - Docker images:
  24. `https://docs.datadrivendiscovery.org/docker.html <https://docs.datadrivendiscovery.org/docker.html>`__
  25. - Index of TA1, TA2, TA3 repositories:
  26. `https://github.com/darpa-i2o/d3m-program-index <https://github.com/darpa-i2o/d3m-program-index>`__
  27. - :ref:`primitive-good-citizen`
  28. .. _overview-of-primitives-and-pipelines:
  29. Overview of primitives and pipelines
  30. ------------------------------------
  31. Let's start with basic definitions in order for us to understand a
  32. little bit better what happens when we run a pipeline later in the
  33. tutorial.
  34. A *pipeline* is basically a series of steps that are executed in order
  35. to solve a particular *problem* (such as prediction based on historical
  36. data). A step of a pipeline is usually a *primitive* (a step can be
  37. something else, however, like a sub-pipeline, but for the purposes of
  38. this tutorial, assume that each step is a primitive): something that
  39. individually could, for example, transform data into another format, or
  40. fit a model for prediction. There are many types of primitives (see the
  41. `primitives index repo`_ for the full
  42. list of available primitives). In a pipeline, the steps must be arranged
  43. in a way such that each step must be able to read the data in the format
  44. produced by the preceding step.
  45. .. _primitives index repo: https://gitlab.com/datadrivendiscovery/primitives
  46. For this tutorial, let's try to use the example pipeline that comes with
  47. a primitive called
  48. ``d3m.primitives.classification.logistic_regression.SKlearn`` to predict
  49. baseball hall-of-fame players, based on their stats (see the
  50. `185_baseball dataset <https://datasets.datadrivendiscovery.org/d3m/datasets/-/tree/master/training_datasets/seed_datasets_archive/185_baseball>`__).
  51. Let's take a look at the example pipeline. Many example pipelines can be found
  52. in `primitives index repo`_ where they demonstrate how to use particular primitives.
  53. At the time of this writing, an example pipeline can be found `here
  54. <https://gitlab.com/datadrivendiscovery/primitives/blob/master/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines/862df0a2-2f87-450d-a6bd-24e9269a8ba6.json>`__,
  55. but this repository's directory names and files periodically change, so it is
  56. prudent to see how to navigate to this file too.
  57. The index is organized as:
  58. - ``v2020.1.9`` (version of the core package of the index, changes periodically)
  59. - ``JPL`` (the organization that develops/maintains the primitive)
  60. - ``d3m.primitives.classification.logistic_regression.SKlearn`` (the python path of the actual primitive)
  61. - ``2019.11.13`` (the version of this primitive, changes periodically)
  62. - ``pipelines``
  63. - ``862df0a2-2f87-450d-a6bd-24e9269a8ba6.json`` (actual pipeline description filename, changes periodically)
  64. Early on in this JSON document, you will see a list called ``steps``. This
  65. is the actual list of primitive steps that run one after another in a
  66. pipeline. Each step has the information about the primitive, as well as
  67. arguments, outputs, and hyper-parameters, if any. This specific pipeline
  68. has 5 steps (the ``d3m.primitives`` prefix is omitted in the following
  69. list):
  70. - ``data_transformation.dataset_to_dataframe.Common``
  71. - ``data_transformation.column_parser.Common``
  72. - ``data_cleaning.imputer.SKlearn``
  73. - ``classification.logistic_regression.SKlearn``
  74. - ``data_transformation.construct_predictions.Common``
  75. Now let's take a look at the first primitive step in that pipeline. We
  76. can find the source code of this primitive in the common-primitives repo
  77. (`common_primitives/dataset_to_dataframe.py
  78. <https://gitlab.com/datadrivendiscovery/common-primitives/blob/master/common_primitives/dataset_to_dataframe.py>`__).
  79. Take a look particularly at the ``produce`` method. This is essentially
  80. what the primitive does. Try to do this for the other primitive steps in
  81. the pipeline as well - take a cursory look at what each one essentially
  82. does (note that for the actual classifier primitive, you should look at
  83. the ``fit`` method as well to see how the model is trained). Primitives
  84. whose python path suffix is ``*.Common`` is in the `common primitives <https://gitlab.com/datadrivendiscovery/common-primitives>`__
  85. repository, and those that have a ``*.SKlearn`` suffix is in the
  86. `sklearn-wrap <https://gitlab.com/datadrivendiscovery/sklearn-wrap>`__ repository (checkout the `dist <https://gitlab.com/datadrivendiscovery/sklearn-wrap/-/tree/dist>`__ branch,
  87. to which primitives are being generated).
  88. If you're having a hard time looking for the correct source file, you can try
  89. taking the primitive ``id`` from the primitive step description in the
  90. pipeline, and ``grep`` for it. For example, if you were
  91. looking for the source code of the first primitive step in this
  92. pipeline, first look at the primitive info in that step and get its
  93. ``id``:
  94. .. code::
  95. "primitive": {
  96. "id": "4b42ce1e-9b98-4a25-b68e-fad13311eb65",
  97. "version": "0.3.0",
  98. "python_path": "d3m.primitives.data_transformation.dataset_to_dataframe.Common",
  99. "name": "Extract a DataFrame from a Dataset"
  100. },
  101. Then, run this:
  102. .. code:: shell
  103. git clone https://gitlab.com/datadrivendiscovery/common-primitives.git
  104. cd common-primitives
  105. grep -r 4b42ce1e-9b98-4a25-b68e-fad13311eb65 . | grep -F .py
  106. However, this series of commands assumes that you know exactly which
  107. specific repository is the primitive's source code located in (the ``git
  108. clone`` command). Since this is probably not the case for an arbitrarily
  109. given primitive, there is a method on how to find out the repository URL
  110. of any primitive, and it requires using a d3m Docker image, which is
  111. described in the next section.
  112. Setting up a local d3m environment
  113. ----------------------------------
  114. In order to run a pipeline, you must have a Python environment where the
  115. d3m core package is installed, as well as the packages of the primitives
  116. installed as well. While it is possible to setup a Python virtual
  117. environment and install the packages them through ``pip``, in this
  118. tutorial, we're going to use the d3m Docker images instead (in many
  119. cases, even beyond this tutorial, this will save you a lot of time and
  120. effort trying to find the any missing primitive packages, manually
  121. installing them, and troubleshooting installation errors). So, make sure
  122. `Docker <https://docs.docker.com/>`__ is installed in your system.
  123. You can find the list of D3M docker images `here <https://docs.datadrivendiscovery.org/docker.html>`__.
  124. The one we're going to use in this tutorial is the v2020.1.9
  125. primitives image (feel free to use whatever the latest one instead
  126. though - just modify the ``v2020.1.9`` part accordingly):
  127. .. code:: shell
  128. docker pull registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
  129. Once you have downloaded the image, we can finally run the d3m package
  130. (and hence run a pipeline). Before running a pipeline though, let's
  131. first try to get a list of what primitives are installed in the image's
  132. Python environment:
  133. .. code:: shell
  134. docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index search
  135. You should get a big list of primitives. All of the known primitives to
  136. D3M should be there.
  137. You can also run the docker container in interactive mode (to run
  138. commands as if you have logged into the container machine provides) by
  139. using the ``-it`` option:
  140. .. code:: shell
  141. docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
  142. The previous section mentions a method of determining where the source
  143. code of an arbitrarily given primitive can be found. We can do this
  144. using the d3m python package within a d3m docker container. First get the
  145. ``python_path`` of the primitive step (see the JSON snippet above of the
  146. primitive's info from the pipeline). Then, run this command:
  147. .. code:: shell
  148. docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common
  149. Near the top of the huge JSON string describing the primitive, you'll see
  150. ``"source"``, and inside it, ``"uris"``. To help read the JSON, you can use
  151. the ``jq`` utility:
  152. .. code:: shell
  153. docker run --rm -it registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9
  154. python3 -m d3m index describe d3m.primitives.data_transformation.dataset_to_dataframe.Common | jq .source.uris
  155. This should give the URI of the git repo where the source code of that primitive can be found. Also, You
  156. can also substitute the primitive ``id`` for the ``python_path`` in that
  157. command, but the command usually returns a result faster if you provide
  158. the ``python_path``. Note also that you can only do this for primitives
  159. that have been submitted for a particular image (primitives that are
  160. contained in the `primitives index repo`_).
  161. It can be obscure at first how to use the d3m python package, but you can
  162. always access the help string for each d3m command at every level of the
  163. command chain by using the ``-h`` flag. This is useful especially for
  164. the getting a list of all the possible arguments for the ``runtime``
  165. module.
  166. .. code:: shell
  167. docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m -h
  168. docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m index -h
  169. docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m runtime -h
  170. docker run --rm registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 python3 -m d3m runtime fit-score -h
  171. One last point before we try running a pipeline. The docker container
  172. must be able to access the dataset location and the pipeline location
  173. from the host filesystem. We can do this by `bind-mounting
  174. <https://docs.docker.com/storage/bind-mounts/>`__ a host directory that
  175. contains both the ``datasets`` repo and the ``primitives`` index repo to
  176. a container directory. Git clone these repos, and also make another empty directory called
  177. ``pipeline-outputs``. Now, if your directory structure looks like this::
  178. /home/foo/d3m
  179. ├── datasets
  180. ├── pipeline-outputs
  181. └── primitives
  182. Then you'll want to bind-mount ``/home/foo/d3m`` to a directory in the
  183. container, say ``/mnt/d3m``. You can specify this mapping in the docker
  184. command itself:
  185. .. code:: shell
  186. docker run \
  187. --rm \
  188. -v /home/foo/d3m:/mnt/d3m \
  189. registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2019.11.10 \
  190. ls /mnt/d3m
  191. If you're reading this tutorial from a text editor, it might be a good
  192. idea at this point to find and replace ``/home/foo/d3m`` with the actual
  193. path in your system where the ``datasets``, ``pipeline-outputs``, and
  194. ``primitives`` directories are all located. This will make it easier for
  195. you to just copy and paste the commands from here on out, instead of
  196. changing the faux path every time.
  197. .. _running-example-pipeline:
  198. Running an example pipeline
  199. ---------------------------
  200. At this point, let's try running a pipeline. Again, we're going to run
  201. the example pipeline that comes with
  202. ``d3m.primitives.classification.logistic_regression.SKlearn``. There are
  203. two ways to run a pipeline: by specifying all the necessary paths of the
  204. dataset, or by specifying and using a pipeline run file. Let's
  205. make sure first though that the dataset is available, as described in the
  206. next subsection.
  207. .. _preparing-dataset:
  208. Preparing the dataset
  209. ~~~~~~~~~~~~~~~~~~~~~
  210. Towards the end of the previous section, you were asked to git clone the
  211. ``datasets`` repo to your machine. Most likely, you might have
  212. accomplished that like this:
  213. .. code:: shell
  214. git clone https://datasets.datadrivendiscovery.org/d3m/datasets.git
  215. But unless you had `git LFS <https://github.com/git-lfs/git-lfs>`__
  216. installed, the entire contents of the repo might not have been really
  217. installed.
  218. The repo is organized such that all files larger than 100
  219. KB is stored in git LFS. Thus, if you cloned without git LFS installed, you
  220. most likely have to do a one-time extra step before you can use a dataset, as
  221. some files of that dataset that are over 100 KB will not have the actual
  222. data in them (although they will still exist as files in the cloned
  223. repo). This is true even for the dataset that we will use in this
  224. exercise, ``185_baseball``. To verify this, open this file in a text
  225. editor::
  226. datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_dataset/tables/learningData.csv
  227. Then, see if it contains text similar to this::
  228. version https://git-lfs.github.com/spec/v1
  229. oid sha256:931943cc4a675ee3f46be945becb47f53e4297ec3e470c4e3e1f1db66ad3b8d6
  230. size 131187
  231. If it does, then this dataset has not yet been fully downloaded from git
  232. LFS (but if it looks like a normal CSV file, then you can skip the rest
  233. of this subsection and move on). To download this dataset, simply run
  234. this command inside the ``datasets`` directory:
  235. .. code:: shell
  236. git lfs pull -I training_datasets/seed_datasets_archive/185_baseball/
  237. Inspect the file again, and you should see that it looks like a normal
  238. CSV file now.
  239. In general, if you don't know which specific dataset does a certain
  240. example pipeline in the ``primitives`` repo uses, inspect the pipeline
  241. run output file of that primitive (whose file path is similar to that of
  242. the pipeline JSON file, as described in the :ref:`overview-of-primitives-and-pipelines` section, but
  243. instead of going to ``pipelines``, go to ``pipeline_runs``). The
  244. pipeline run is initially gzipped in the ``primitives`` repo, so
  245. decompress it first. Then open up the actual .yml file, look at
  246. ``datasets``, and under it should be ``id``. If you do that for the
  247. example pipeline run of the SKlearn logistic regression primitive
  248. that we're looking at for this exercise, you'll find that the dataset id
  249. is ``185_baseball_dataset``. The name of the main dataset directory is this string,
  250. without the ``_dataset`` part.
  251. Now, let's actually run the pipeline using the two ways mentioned
  252. earlier.
  253. Specifying all the necessary paths of a dataset
  254. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  255. You can use this if there is no existing pipeline run yet for a
  256. pipeline, or if you want to manually specify the dataset path (set the
  257. paths for ``--problem``, ``--input``, ``--test-input``, ``--score-input``, ``--pipeline`` to your target dataset
  258. location).
  259. Remember to change the bind mount paths as appropriate for your system
  260. (specified by ``-v``).
  261. .. code:: shell
  262. docker run \
  263. --rm \
  264. -v /home/foo/d3m:/mnt/d3m \
  265. registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  266. python3 -m d3m \
  267. runtime \
  268. fit-score \
  269. --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \
  270. --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \
  271. --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \
  272. --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \
  273. --pipeline /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines/862df0a2-2f87-450d-a6bd-24e9269a8ba6.json \
  274. --output /mnt/d3m/pipeline-outputs/predictions.csv \
  275. --output-run /mnt/d3m/pipeline-outputs/run.yml
  276. The score is displayed after the pipeline run. The output predictions
  277. will be stored on the path specified by ``--output``, and information about
  278. the pipeline run is stored in the path specified by ``--output-run``.
  279. Again, you can use the ``-h`` flag on ``fit-score`` to access the help
  280. string and read about the different arguments, as described earlier.
  281. If you get a python error that complains about missing columns, or
  282. something that looks like this::
  283. ValueError: Mismatch between column name in data 'version https://git-lfs.github.com/spec/v1' and column name in metadata 'd3mIndex'.
  284. Chances are that the ``185_baseball`` dataset has not yet been
  285. downloaded through git LFS. See the :ref:`previous subsection
  286. <preparing-dataset>` for details on how to verify and do this.
  287. Using a pipeline run file
  288. ~~~~~~~~~~~~~~~~~~~~~~~~~
  289. Instead of specifying all the specific dataset paths, you can also use
  290. an existing pipeline run to essentially "re-run" a previous run
  291. of the pipeline:
  292. .. code:: shell
  293. docker run \
  294. --rm \
  295. -v /home/foo/d3m:/mnt/d3m \
  296. registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  297. python3 -m d3m \
  298. --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \
  299. runtime \
  300. --datasets /mnt/d3m/datasets \
  301. fit-score \
  302. --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml.gz \
  303. --output /mnt/d3m/pipeline-outputs/predictions.csv \
  304. --output-run /mnt/d3m/pipeline-outputs/run.yml
  305. In this case, ``--input-run`` is the pipeline run file that this pipeline
  306. will re-run, and ``---output-run`` is the new pipeline run file that will be
  307. generated.
  308. Note that if you choose ``fit-score`` for the d3m runtime option, the
  309. pipeline actually runs in two phases: fit, and produce. You can verify
  310. this by searching for ``phase`` in the pipeline run file.
  311. Lastly, if you want to run multiple commands in the docker container,
  312. simply chain your commands with ``&&`` and wrap them double quotes
  313. (``"``) for ``bash -c``. As an example:
  314. .. code:: shell
  315. docker run \
  316. --rm \
  317. -v /home/foo/d3m:/mnt/d3m \
  318. registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  319. /bin/bash -c \
  320. "python3 -m d3m \
  321. --pipelines-path /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines \
  322. runtime \
  323. --datasets /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball \
  324. fit-score \
  325. --input-run /mnt/d3m/primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipeline_runs/pipeline_run.yml \
  326. --output /mnt/d3m/pipeline-outputs/predictions.csv \
  327. --output-run /mnt/d3m/pipeline-outputs/run.yml && \
  328. head /mnt/d3m/pipeline-outputs/predictions.csv"
  329. Writing a new primitive
  330. -----------------------
  331. Let's now try to write a very simple new primitive - one that simply
  332. passes whatever input data it receives from the previous step to the
  333. next step in the pipeline. Let's call this primitive "Passthrough".
  334. We will use this `skeleton primitive repo
  335. <https://gitlab.com/datadrivendiscovery/docs-quickstart>`__
  336. as a starting point
  337. for this exercise. A d3m primitive repo does not have to follow the
  338. exact same directory structure as this, but this is a good structure to
  339. start with, at least. git clone the repo into ``docs-quickstart`` at the same place
  340. where the other repos that we have used earlier are located
  341. (``datasets``, ``pipeline-outputs``, ``primitives``).
  342. Alternatively, you can also use the `test primitives
  343. <https://gitlab.com/datadrivendiscovery/tests-data/tree/master/primitives>`__
  344. as a model/starting point. ``test_primitives/null.py`` is essentially
  345. the same primitive that we are trying to write.
  346. .. _primitive-source-code:
  347. Primitive source code
  348. ~~~~~~~~~~~~~~~~~~~~~
  349. In the ``docs-quickstart`` directory, open
  350. ``quickstart_primitives/sample_primitive1/input_to_output.py``. The first
  351. important thing to change here is the primitive metadata, which are the
  352. first objects defined under the ``InputToOutputPrimitive`` class. Modify the
  353. following fields (unless otherwise noted, the values you put in must be
  354. strings):
  355. - ``id``: The primitive's UUID v4 number/identifier. To generate one,
  356. you can run simply run this simple inline Python command:
  357. .. code:: shell
  358. python3 -c "import uuid; print(uuid.uuid4())"
  359. - ``version``: You can use semantic versioning for this or another style
  360. of versioning. Write ``"0.1.0"`` for this exercise. You should bump
  361. the version of the primitive at least every time public interfaces
  362. of the primitive change (e.g. hyper-parameters).
  363. - ``name``: The primitive's name. Write ``"Passthrough primitive"`` for
  364. this exercise.
  365. - ``description``: A short description of the primitive. Write ``"A
  366. primitive which directly outputs the input."`` for this exercise.
  367. - ``python_path``: This follows this format::
  368. d3m.primitives.<primitive family>.<primitive name>.<kind>
  369. Primitive families can be found in the `d3m metadata page
  370. <https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.primitive_family>`__
  371. (wait a few seconds for the page to load completely), and primitive
  372. names can be found in the `d3m core package source code
  373. <https://gitlab.com/datadrivendiscovery/d3m/blob/devel/d3m/metadata/primitive_names.py>`__.
  374. The last segment can be used to attribute the primitive to the author and/or
  375. describe in which way it is different from other primitives with same
  376. primitive family and primitive name, e.g., a different implementation with different
  377. trade-offs.
  378. For this exercise, write
  379. ``"d3m.primitives.operator.input_to_output.Quickstart"``. Note that
  380. ``input_to_output`` is not currently registered as a standard primitive name
  381. and using it will produce a warning. For primitives you intent on publishing
  382. make a merge request to the d3m core package to add any primitive names
  383. you need.
  384. - ``primitive_family``: This must be the same as used for ``python_path``,
  385. as enumeration value. You can use a string or Python enumeration value.
  386. Add this import statement (if not there already):
  387. .. code:: python
  388. from d3m.metadata import base as metadata_base
  389. Then write ``metadata_base.PrimitiveFamily.OPERATOR`` (as
  390. a value, not a string, so do not put quotation marks) as the value of
  391. this field.
  392. - ``algorithm_types``: Algorithm type(s) that the primitive implements.
  393. This can be multiple values in an array. Values can be chosen from
  394. the `d3m metadata page
  395. <https://metadata.datadrivendiscovery.org/devel/?definitions#definitions.algorithm_types>`__
  396. as well.
  397. Write ``[metadata_base.PrimitiveAlgorithmType.IDENTITY_FUNCTION]``
  398. here for this exercise (as a list that contains one element, not a
  399. string).
  400. - ``source``: General info about the author of this primitive. ``name``
  401. is usually the name of the person or the team that wrote this
  402. primitive. ``contact`` is a ``mailto`` URI to the email address of
  403. whoever one should contact about this primitive. ``uris`` are usually
  404. the git clone URL of the repo, and you can also add the URL of the
  405. source file of this primitive.
  406. Write these for the exercise:
  407. .. code:: python
  408. "name": "My Name",
  409. "contact": "mailto:myname@example.com",
  410. "uris": ["https://gitlab.com/datadrivendiscovery/docs-quickstart.git"],
  411. - ``keywords``: Key words for what this primitive is or does. Write
  412. ``["passthrough"]``.
  413. - ``installation``: Information about how to install this primitive. Add
  414. these import statements first:
  415. .. code:: python
  416. import os.path
  417. from d3m import utils
  418. Then replace the ``installation`` entry with this:
  419. .. code:: python
  420. "installation": [{
  421. "type": metadata_base.PrimitiveInstallationType.PIP,
  422. "package_uri": "git+https://gitlab.com/datadrivendiscovery/docs-quickstart@{git_commit}#egg=quickstart_primitives".format(
  423. git_commit=utils.current_git_commit(os.path.dirname(__file__))
  424. ),
  425. }],
  426. In general, for your own actual primitives, you might only need to
  427. substitute the git repo URL here as well as the python egg name.
  428. Next, let's take a look at the ``produce`` method. You can see that it
  429. simply makes a new dataframe out of the input data, and returns it as
  430. the output. To see for ourselves though that our primitive (and thus
  431. this ``produce`` method) gets called during the pipeline run, let's add
  432. a log statement here. The ``produce`` method should now look something
  433. like this:
  434. .. code:: python
  435. def produce(self, *, inputs: Inputs, timeout: float = None, iterations: int = None) -> base.CallResult[Outputs]:
  436. self.logger.warning('Hi, InputToOutputPrimitive.produce was called!')
  437. return base.CallResult(value=inputs)
  438. Note that this is simply an example primitive that is intentionally
  439. simple for the purposes of this tutorial. It does not necessarily model
  440. a well-written primitive, by any means. For guidelines on how to write a
  441. good primitive, take a look at the :ref:`primitive-good-citizen`.
  442. setup.py
  443. ~~~~~~~~
  444. Next, we fill in the necessary information in ``setup.py`` so that
  445. ``pip`` can correctly install our primitive in our local d3m
  446. environment. Open ``setup.py`` (in the project root), and modify the
  447. following fields:
  448. - ``name``: Same as the egg name you used in ``package_uri``
  449. - ``version``: Same as the primitive metadata's ``version``
  450. - ``description``: Same as the primitive metadata's ``description``,
  451. or a description of all primitives if there are multiple primitives
  452. in the package you are making
  453. - ``author``: Same as the primitive metadata's ``suorce.name``
  454. - ``url``: Same as main URL in the primitive metadata's
  455. ``source.uris``
  456. - ``packages``: This is an array of the python packages that this
  457. primitive repo contains. You can use the ``find_packages`` helper:
  458. .. code:: python
  459. packages=find_packages(exclude=['pipelines']),
  460. - ``keywords``: A list of keywords. Important standard keyword is
  461. ``d3m_primitive`` which makes all primitives discoverable on PyPi
  462. - ``install_requires``: This is an array of the python package
  463. dependencies of the primitives contained in this repo. Our primitive
  464. needs nothing except the d3m core package (and the
  465. ``common-primitives`` package too for testing, but this is not a
  466. package dependency), so write this as the value of this field:
  467. ``['d3m']``
  468. - ``entry_points``: This is how the d3m runtime maps your primitives'
  469. d3m python paths to the your repo's local python paths. For this
  470. exercise, it should look like this:
  471. .. code:: python
  472. entry_points={
  473. 'd3m.primitives': [
  474. 'operator.input_to_output.Quickstart = quickstart_primitives.sample_primitive1:InputToOutputPrimitive',
  475. ],
  476. }
  477. That's it for this file. Briefly review it for any possible syntax
  478. errors.
  479. Primitive unit tests
  480. ~~~~~~~~~~~~~~~~~~~~
  481. Let's now make a python test for this primitive, which in this case will
  482. just assert whether the input dataframe to the primitive equals the
  483. output dataframe. Make a new file called ``test_input_to_output.py``
  484. inside ``quickstart_primitives/sample_primitive1`` (the same directory as
  485. ``input_to_output.py``), and write this as its contents:
  486. .. code:: python
  487. import unittest
  488. import os
  489. from d3m import container
  490. from common_primitives import dataset_to_dataframe
  491. from input_to_output import InputToOutputPrimitive
  492. class InputToOutputTestCase(unittest.TestCase):
  493. def test_output_equals_input(self):
  494. dataset_doc_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', '..', 'tests-data', 'datasets', 'timeseries_dataset_1', 'datasetDoc.json'))
  495. dataset = container.Dataset.load('file://{dataset_doc_path}'.format(dataset_doc_path=dataset_doc_path))
  496. dataframe_hyperparams_class = dataset_to_dataframe.DatasetToDataFramePrimitive.metadata.get_hyperparams()
  497. dataframe_primitive = dataset_to_dataframe.DatasetToDataFramePrimitive(hyperparams=dataframe_hyperparams_class.defaults())
  498. dataframe = dataframe_primitive.produce(inputs=dataset).value
  499. i2o_hyperparams_class = InputToOutputPrimitive.metadata.get_hyperparams()
  500. i2o_primitive = InputToOutputPrimitive(hyperparams=dataframe_hyperparams_class.defaults())
  501. output = i2o_primitive.produce(inputs=dataframe).value
  502. self.assertTrue(output.equals(dataframe))
  503. if __name__ == '__main__':
  504. unittest.main()
  505. For the dataset that this test uses, add as git submodule the `d3m tests-data <https://gitlab.com/datadrivendiscovery/tests-data>`__
  506. repository at the root of the ``docs-quickstart`` repository.
  507. Then let's install this new primitive to the Docker image's d3m environment, and
  508. run this test using the command below:
  509. .. code:: shell
  510. docker run \
  511. --rm \
  512. -v /home/foo/d3m:/mnt/d3m \
  513. registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  514. /bin/bash -c \
  515. "pip3 install -e /mnt/d3m/docs-quickstart && \
  516. cd /mnt/d3m/docs-quickstart/quickstart_primitives/sample_primitive1 && \
  517. python3 test_input_to_output.py"
  518. You should see a log statement like this, as well as the python unittest
  519. pass message::
  520. Hi, InputToOutputPrimitive.produce was called!
  521. .
  522. ----------------------------------------------------------------------
  523. Ran 1 test in 0.011s
  524. Using this primitive in a pipeline
  525. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  526. Having seen the primitive test pass, we can now confidently include this
  527. primitive in a pipeline. Let's take the same pipeline that we ran :ref:`before <running-example-pipeline>`
  528. (the sklearn logistic regression's example pipeline),
  529. and add a step using this primitive.
  530. In the root directory of your repository, create these directories:
  531. ``pipelines/operator.input_to_output.Quickstart``. Then, from the d3m
  532. ``primitives`` repo, copy the JSON pipeline description file from
  533. ``primitives/v2020.1.9/JPL/d3m.primitives.classification.logistic_regression.SKlearn/2019.11.13/pipelines``
  534. into the directory we just created. Open this file, and replace the
  535. ``id`` (generate another UUID v4 number using the inline python command
  536. earlier, different from the primitive ``id``), as well as the created
  537. timestamp using this inline python command (add ``Z`` at the end of the
  538. generated timestamp)::
  539. python3 -c "import time; import datetime; \
  540. print(datetime.datetime.fromtimestamp(time.time()).isoformat())"
  541. You can rename the json file too using the new pipeline ``id``.
  542. Next, change the output step number (shown below, ``"steps.4.produce"``)
  543. to be one more than the current number (at the time of this writing, it
  544. is ``4``, so in this case, change it to ``5``):
  545. .. code:: json
  546. "outputs": [
  547. {
  548. "data": "steps.5.produce",
  549. "name": "output predictions"
  550. }
  551. ],
  552. Then, find the step that contains the
  553. ``d3m.primitives.classification.logistic_regression.SKlearn`` primitive
  554. (search for this string in the file), and right above it, add the
  555. following JSON object. Remember to change ``primitive.id`` to the
  556. primitive's id that you generated in the earlier :ref:`primitive-source-code` subsection.
  557. .. code:: json
  558. {
  559. "type": "PRIMITIVE",
  560. "primitive": {
  561. "id": "30d5f2fa-4394-4e46-9857-2029ec9ed0e0",
  562. "version": "0.1.0",
  563. "python_path": "d3m.primitives.operator.input_to_output.Quickstart",
  564. "name": "Passthrough primitive"
  565. },
  566. "arguments": {
  567. "inputs": {
  568. "type": "CONTAINER",
  569. "data": "steps.2.produce"
  570. }
  571. },
  572. "outputs": [
  573. {
  574. "id": "produce"
  575. }
  576. ]
  577. },
  578. Make sure that the step number (``"steps.N.produce"``) in
  579. ``arguments.inputs.data`` is correct (one greater than the previous step
  580. and one less than the next step). Do this as well for the succeeding
  581. steps, with the following caveats:
  582. - For ``d3m.primitives.classification.logistic_regression.SKlearn``,
  583. increment the step number both for ``arguments.inputs.data`` and
  584. ``arguments.outputs.data`` (at the time of this writing, the number
  585. should be changed to ``3``).
  586. - For
  587. ``d3m.primitives.data_transformation.construct_predictions.Common``,
  588. increment the step number for ``arguments.inputs.data`` (at the time
  589. of this writing, the number should be changed to ``4``), but do not
  590. change the one for ``arguments.reference.data`` (the value should
  591. stay as ``"steps.0.produce"``)
  592. Generally, you can also programmatically generate a pipeline, as
  593. described in the :ref:`pipeline-description-example`.
  594. Now we can finally run this pipeline that uses our new primitive. In the
  595. command below, modify the pipeline JSON filename in the ``-p`` argument
  596. to match the filename of your pipeline file (if you changed it to the
  597. new pipeline id that you generated).
  598. .. code:: shell
  599. docker run \
  600. --rm \
  601. -v /home/foo/d3m:/mnt/d3m \
  602. registry.gitlab.com/datadrivendiscovery/images/primitives:ubuntu-bionic-python36-v2020.1.9 \
  603. /bin/bash -c \
  604. "pip3 install -e /mnt/d3m/docs-quickstart && \
  605. python3 -m d3m \
  606. runtime \
  607. fit-score \
  608. --problem /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/185_baseball_problem/problemDoc.json \
  609. --input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TRAIN/dataset_TRAIN/datasetDoc.json \
  610. --test-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/TEST/dataset_TEST/datasetDoc.json \
  611. --score-input /mnt/d3m/datasets/training_datasets/seed_datasets_archive/185_baseball/SCORE/dataset_TEST/datasetDoc.json \
  612. --pipeline /mnt/d3m/docs-quickstart/pipelines/operator.input_to_output.Quickstart/0f290525-3fec-44f7-ab93-bd778747b91e.json \
  613. --output /mnt/d3m/pipeline-outputs/predictions_new.csv \
  614. --output-run /mnt/d3m/pipeline-outputs/run_new.yml"
  615. In the output, you should see the log statement as a warning,
  616. before the score is shown (similar to the text below)::
  617. ...
  618. WARNING:d3m.primitives.operator.input_to_output.Quickstart:Hi, InputToOutputPrimitive.produce was called!
  619. ...
  620. metric,value,normalized,randomSeed
  621. F1_MACRO,0.31696136214800263,0.31696136214800263,0
  622. Verify that the old and new ``predictions.csv`` in ``pipeline-outputs``
  623. are the same (you can use ``diff``), as well as the scores in the old
  624. and new ``run.yml`` files (search for ``scores`` in the files).
  625. Beyond this tutorial
  626. --------------------
  627. Congratulations! You just built your own primitive and you were able to
  628. use it in a d3m pipeline!
  629. Normally, when you build your own primitives, you would proceed to
  630. validating the primitives to be included in the d3m index of all known
  631. primitives. See the `primitives repo README
  632. <https://gitlab.com/datadrivendiscovery/primitives#adding-a-primitive>`__
  633. on details on how to do this.

全栈的自动化机器学习系统,主要针对多变量时间序列数据的异常检测。TODS提供了详尽的用于构建基于机器学习的异常检测系统的模块,它们包括:数据处理(data processing),时间序列处理( time series processing),特征分析(feature analysis),检测算法(detection algorithms),和强化模块( reinforcement module)。这些模块所提供的功能包括常见的数据预处理、时间序列数据的平滑或变换,从时域或频域中抽取特征、多种多样的检测算