wangwei
/
tods

 
			
			   
				 
					
						
						
							
							.. _metadata:

Metadata for primitives and the values they process
===================================================

Metadata is a core component of any data-based system. This repository
is standardizing how we represent metadata in the D3M program and
focusing on three types of metadata: \* metadata associated with
primitives \* metadata associated with datasets \* metadata associated
with values passed inside pipelines

This repository is also standardizing types of values being passed
between primitives in pipelines. While theoretically any value could be
passed between primitives, limiting them to a known set of values can
make primitives more compatible, efficient, and values easier to
introspect by TA3 systems.

.. _container_types:

Container types
---------------

All input and output (container) values passed between primitives should
expose a ``Sequence``
`protocol <https://www.python.org/dev/peps/pep-0544/>`__ (sequence in
samples) and provide ``metadata`` attribute with metadata.

``d3m.container`` module exposes such standard types:

-  ``Dataset`` – a class representing datasets, including D3M datasets,
   implemented in
   :mod:`d3m.container.dataset` module
-  ``DataFrame`` –
   :class:`pandas.DataFrame`
   with support for ``metadata`` attribute, implemented in
   :mod:`d3m.container.pandas` module
-  ``ndarray`` –
   :class:`numpy.ndarray`
   with support for ``metadata`` attribute, implemented in
   :mod:`d3m.container.numpy` module
-  ``List`` – a standard :class:`list` with support for ``metadata``
   attribute, implemented in
   :mod:`d3m.container.list` module

``List`` can be used to create a simple list container.

It is strongly encouraged to use the :class:`~d3m.container.pandas.DataFrame` container type for
primitives which do not have strong reasons to use something else
(:class:`~d3m.container.dataset.Dataset`\ s to operate on initial pipeline input, or optimized
high-dimensional packed data in :class:`~numpy.ndarray`\ s, or :class:`list`\ s to pass as
values to hyper-parameters). This makes it easier to operate just on
columns without type casting while the data is being transformed to make
it useful for models.

When deciding which container type to use for inputs and outputs of a
primitive, consider as well where an expected place for your primitive
is in the pipeline. Generally, pipelines tend to have primitives
operating on :class:`~d3m.container.dataset.Dataset` at the beginning, then use :class:`~d3m.container.pandas.DataFrame` and
then convert to :class:`~numpy.ndarray`.

.. _data_types:

Data types
----------

Container types can contain values of the following types:

* container types themselves
* Python builtin primitive types:

  * ``str``
  * ``bytes``
  * ``bool``
  * ``float``
  * ``int``
  * ``dict`` (consider using :class:`typing.Dict`, :class:`typing.NamedTuple`, or :ref:`TypedDict <mypy:typeddict>`)
  * ``NoneType``

Metadata
--------

:mod:`d3m.metadata.base` module provides a
standard Python implementation for metadata object.

When thinking about metadata, it is useful to keep in mind that metadata
can apply to different contexts:

* primitives
* values being passed
  between primitives, which we call containers (and are container types)
* datasets are a special case of a container
* to parts of data
  contained inside a container
* for example, a cell in a table can have
  its own metadata

Containers and their data can be seen as multi-dimensional structures.
Dimensions can have numeric (arrays) or string indexes (string to value
maps, i.e., dicts). Moreover, even numeric indexes can still have names
associated with each index value, e.g., column names in a table.

If a container type has a concept of *shape*
(:attr:`DataFrame.shape <pandas.DataFrame.shape>`, :attr:`ndarray.shape <numpy.ndarray.shape>`),
dimensions go in that order. For tabular data and existing container
types this means that the first dimension of a container is always
traversing samples (e.g., rows in a table), and the second dimension
columns.

Values can have nested other values and metadata dimensions go over all
of them until scalar values. So if a Pandas DataFrame contains
3-dimensional ndarrays, the whole value has 5 dimensions: two for rows
and columns of the DataFrame (even if there is only one column), and 3
for the array.

To tell to which part of data contained inside a container metadata
applies, we use a *selector*. Selector is a tuple of strings, integers,
or special values. Selector corresponds to a series of ``[...]`` item
getter Python operations on most values, except for Pandas DataFrame
where it corresponds to
:attr:`iloc <pandas.DataFrame.iloc>`
position-based selection.

Special selector values:

-  ``ALL_ELEMENTS`` – makes metadata apply to all elements in a given
   dimension (a wildcard)

Metadata itself is represented as a (potentially nested) dict. If
multiple metadata dicts comes from different selectors for the same
resolved selector location, they are merged together in the order from
least specific to more specific, later overriding earlier. ``null``
metadata value clears the key specified from a less specific selector.

Example
~~~~~~~

To better understand how metadata is attached to various parts of the
value, A `simple tabular D3M
dataset <https://gitlab.com/datadrivendiscovery/tests-data/tree/master/datasets/iris_dataset_1>`__
could be represented as a multi-dimensional structure:

.. code:: yaml

    {
      "0": [
        [0, 5.1, 3.5, 1.4, 0.2, "Iris-setosa"],
        [1, 4.9, 3, 1.4, 0.2, "Iris-setosa"],
        ...
      ]
    }

It contains one resource with ID ``"0"`` which is the first dimension
(using strings as index; it is a map not an array), then rows, which is
the second dimension, and then columns, which is the third dimension.
The last two dimensions are numeric.

In Python, accessing third column of a second row would be
``["0"][1][2]`` which would be value ``3``. This is also the selector if
we would want to attach metadata to that cell. If this metadata is
description for this cell, we can thus describe this datum metadata as a
pair of a selector and a metadata dict:

-  selector: ``["0"][1][2]``
-  metadata:
   ``{"description": "Measured personally by Ronald Fisher."}``

Dataset-level metadata have empty selector:

-  selector: ``[]``
-  metadata: ``{"id": "iris_dataset_1", "name": "Iris Dataset"}``

To describe first dimension itself, we set ``dimension`` metadata on the
dataset-level (container). ``dimension`` describes the next dimension at
that location in the data structure.

-  selector: ``[]``
-  metadata: ``{"dimension": {"name": "resources", "length": 1}}``

This means that the full dataset-level metadata is now:

.. code:: json

    {
      "id": "iris_dataset_1",
      "name": "Iris Dataset",
      "dimension": {
        "name": "resources",
        "length": 1
      }
    }

To attach metadata to the first (and only) resource, we can do:

-  selector: ``["0"]``
-  metadata:
   ``{"structural_type": "pandas.core.frame.DataFrame", "dimension": {"length": 150, "name": "rows"}``

``dimension`` describes rows.

Columns dimension:

-  selector: ``["0"][ALL_ELEMENTS]``
-  metadata: ``{"dimension": {"length": 6, "name": "columns"}}``

Observe that there is no requirement that dimensions are aligned from
the perspective of metadata. But in this case they are, so we can use
``ALL_ELEMENTS`` wildcard to describe columns for all rows.

Third column metadata:

-  selector: ``["0"][ALL_ELEMENTS][2]``
-  metadata:
   ``{"name": "sepalWidth", "structural_type": "builtins.str", "semantic_types": ["http://schema.org/Float", "https://metadata.datadrivendiscovery.org/types/Attribute"]}``

Column names belong to each particular column and not all columns. Using
``name`` can serve to assign a string name to otherwise numeric
dimension.

We attach names and types to datums themselves and not dimensions.
Because we use ``ALL_ELEMENTS`` selector, this is internally stored
efficiently. We see traditional approach of storing this information in
the header of a column as a special case of a ``ALL_ELEMENTS`` selector.

Note that the name of a column belongs to the metadata because it is
just an alternative way to reference values in an otherwise numeric
dimension. This is different from a case where a dimension has
string-based index (a map/dict) where names of values are part of the
data structure at that dimension. Which approach is used depends on the
structure of the container for which metadata is attached to.

Default D3M dataset loader found in this package parses all tabular
values as strings and add semantic types, if known, for what could those
strings be representing (a float) and its role (an attribute). This
allows primitives later in a pipeline to convert them to proper
structural types but also allows additional analysis on original values
before such conversion is done.

Fetching all metadata for ``["0"][1][2]`` now returns:

.. code:: json

    {
      "name": "sepalWidth",
      "structural_type": "builtins.str",
      "semantic_types": [
        "http://schema.org/Float",
        "https://metadata.datadrivendiscovery.org/types/Attribute"
      ],
      "description": "Measured personally by Ronald Fisher."
    }

.. _metadata_api:

API
~~~

:mod:`d3m.metadata.base` module provides two
classes which serve for storing metadata on values: :class:`~d3m.metadata.base.DataMetadata` for
data values, and :class:`~d3m.metadata.base.PrimitiveMetadata` for primitives. It also exposes a
:const:`~d3m.metadata.base.ALL_ELEMENTS` constant to be used in selectors.

You can see public methods available on classes documented in their
code. Some main ones are:

-  ``__init__(metadata)`` – constructs a new instance of the metadata
   class and optionally initializes it with top-level metadata
-  ``update(selector, metadata)`` – updates metadata at a given location
   in data structure identified by a selector
-  ``query(selector)`` – retrieves metadata at a given location
-  ``query_with_exceptions(selector)`` – retrieves metadata at a given
   location, but also returns metadata for selectors which have metadata
   which differs from that of ``ALL_ELEMENTS``
-  ``remove(selector)`` – removes metadata at a given location
-  ``get_elements(selector)`` – lists element names which exists at a
   given location
-  ``to_json()`` – converts metadata to a JSON representation
-  ``pretty_print()`` – pretty-print all metadata

``PrimitiveMetadata`` differs from ``DataMetadata`` that it does not
accept selector in its methods because there is no structure in
primitives.

Standard metadata keys
~~~~~~~~~~~~~~~~~~~~~~

You can use custom keys for metadata, but the following keys are
standardized, so you should use those if you are trying to represent the
same metadata:
https://metadata.datadrivendiscovery.org/schemas/v0/definitions.json

The same key always have the same meaning and we reuse the same key in
different contexts when we need the same meaning. So instead of having
both ``primitive_name`` and ``dataset_name`` we have just ``name``.

Different keys are expected in different contexts:

-  ``primitive`` –
   https://metadata.datadrivendiscovery.org/schemas/v0/primitive.json
-  ``container`` –
   https://metadata.datadrivendiscovery.org/schemas/v0/container.json
-  ``data`` –
   https://metadata.datadrivendiscovery.org/schemas/v0/data.json

A more user friendly visualization of schemas listed above is available
at https://metadata.datadrivendiscovery.org/.

Contribute: Standardizing metadata schemas are an ongoing process. Feel
free to contribute suggestions and merge requests with improvements.

.. _primitive-metadata:

Primitive metadata
~~~~~~~~~~~~~~~~~~

Part of primitive metadata can be automatically obtained from
primitive's code, some can be computed through evaluation of primitives,
but some has to be provided by primitive's author. Details of which
metadata is currently standardized and what values are possible can be
found in primitive's JSON schema. This section describes author's
metadata into more detail. Example of primitive's metadata provided by
an author from `Monomial test
primitive <https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/monomial.py#L32>`__,
slightly modified:

.. code:: python

    metadata = metadata_module.PrimitiveMetadata({
        'id': '4a0336ae-63b9-4a42-860e-86c5b64afbdd',
        'version': '0.1.0',
        'name': "Monomial Regressor",
        'keywords': ['test primitive'],
        'source': {
            'name': 'Test team',
            'uris': [
                'https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/monomial.py',
                'https://gitlab.com/datadrivendiscovery/tests-data.git',
            ],
        },
        'installation': [{
            'type': metadata_module.PrimitiveInstallationType.PIP,
            'package_uri': 'git+https://gitlab.com/datadrivendiscovery/tests-data.git@{git_commit}#egg=test_primitives&subdirectory=primitives'.format(
                git_commit=utils.current_git_commit(os.path.dirname(__file__)),
            ),
        }],
        'location_uris': [
            'https://gitlab.com/datadrivendiscovery/tests-data/raw/{git_commit}/primitives/test_primitives/monomial.py'.format(
                git_commit=utils.current_git_commit(os.path.dirname(__file__)),
            ),
        ],
        'python_path': 'd3m.primitives.test.MonomialPrimitive',
        'algorithm_types': [
            metadata_module.PrimitiveAlgorithmType.LINEAR_REGRESSION,
        ],
        'primitive_family': metadata_module.PrimitiveFamily.REGRESSION,
    })

-  Primitive's metadata provided by an author is defined as a class
   attribute and instance of :class:`~d3m.metadata.base.PrimitiveMetadata`.
-  When class is defined, class is automatically analyzed and metadata
   is extended with automatically obtained values from class code.
-  ``id`` can be simply generated using :func:`uuid.uuid4` in Python and
   should never change. **Do not reuse IDs and do not use the ID from
   this example.**
-  When primitive's code changes you should update the version, a `PEP
   440 <https://www.python.org/dev/peps/pep-0440/>`__ compatible one.
   Consider updating a version every time you change code, potentially
   using `semantic versioning <https://semver.org/>`__, but nothing of
   this is enforced.
-  ``name`` is a human-friendly name of the primitive.
-  ``keywords`` can be anything you want to convey to users of the
   primitive and which could help with primitive's discovery.
-  ``source`` describes where the primitive is coming from. The required
   value is ``name`` to tell information about the author, but you might
   be interested also in ``contact`` where you can put an e-mail like
   ``mailto:author@example.com`` as a way to contact the author.
   ``uris`` can be anything. In above, one points to the code in GitLab,
   and another to the repo. If there is a website for the primitive, you
   might want to add it here as well. These URIs are not really meant
   for automatic consumption but are more as a reference. See
   ``location_uris`` for URIs to the code.
-  ``installation`` is important because it describes how can your
   primitive be automatically installed. Entries are installed in order
   and currently the following types of entries are supported:
-  A ``PIP`` package available on PyPI or some other package registry:

   ::

       ```
       {
         'type': metadata_module.PrimitiveInstallationType.PIP,
         'package': 'my-primitive-package',
         'version': '0.1.0',
       }
       ```

-  A ``PIP`` package available at some URI. If this is a git repository,
   then an exact git hash and ``egg`` name should be provided. ``egg``
   name should match the package name installed. Because here we have a
   chicken and an egg problem: how can one commit a hash of code version
   if this changes the hash, you can use a helper utility function to
   provide you with a hash automatically at runtime. ``subdirectory``
   part of the URI suffix is not necessary and is here just because this
   particular primitive happens to reside in a subdirectory of the
   repository.
-  A ``DOCKER`` image which should run while the primitive is operating.
   Starting and stopping of a Docker container is managed by a caller,
   which passes information about running container through primitive's
   ``docker_containers`` ``__init__`` argument. The argument is a
   mapping between the ``key`` value and address and ports at which the
   running container is available. See `Sum test
   primitive <https://gitlab.com/datadrivendiscovery/tests-data/blob/master/primitives/test_primitives/sum.py#L66>`__
   for an example:

   ::

       ```
       {
           'type': metadata_module.PrimitiveInstallationType.DOCKER,
           'key': 'summing',
           'image_name': 'registry.gitlab.com/datadrivendiscovery/tests-data/summing',
           'image_digest': 'sha256:07db5fef262c1172de5c1db5334944b2f58a679e4bb9ea6232234d71239deb64',
       }
       ```

-  A ``UBUNTU`` entry can be used to describe a system library or
   package required for installation or operation of your primitive. If
   your other dependencies require a system library to be installed
   before they can be installed, list this entry before them in
   ``installation`` list.

   ::

       ```
       {
           'type': metadata_module.PrimitiveInstallationType.UBUNTU,
           'package': 'ffmpeg',
           'version': '7:3.3.4-2',
       }
       ```

-  A ``FILE`` entry allows a primitive to specify a static file
   dependency which should be provided by a caller to a primitive.
   Caller passes information about the file path of downloaded file
   through primitive's ``volumes`` ``__init__`` argument. The argument
   is a mapping between the ``key`` value and file path. The filename
   portion of the provided path does not necessary match the filename
   portion of the file's URI.

   ::

       ```
       {
           'type': metadata_module.PrimitiveInstallationType.FILE,
           'key': 'model',
           'file_uri': 'http://mmlab.ie.cuhk.edu.hk/datasets/comp_cars/googlenet_finetune_web_car_iter_10000.caffemodel',
           'file_digest': '6bdf72f703a504cd02d7c3efc6c67cbbaf506e1cbd9530937db6a698b330242e',
       }
       ```

-  A ``TGZ`` entry allows a primitive to specify a static directory
   dependency which should be provided by a caller to a primitive.
   Caller passes information about the directory path of downloaded and
   extracted file through primitive's ``volumes`` ``__init__`` argument.
   The argument is a mapping between the ``key`` value and directory
   path.

   ::

       ```
       {
           'type': metadata_module.PrimitiveInstallationType.TGZ,
           'key': 'mails',
           'file_uri': 'https://www.cs.cmu.edu/~enron/enron_mail_20150507.tar.gz',
           'file_digest': 'b3da1b3fe0369ec3140bb4fbce94702c33b7da810ec15d718b3fadf5cd748ca7',
       }
       ```

-  If you can provide, ``location_uris`` points to an exact code used by
   the primitive. This can be obtained through installing a primitive,
   but it can be helpful to have an online resource as well.
-  ``python_path`` is a path under which the primitive will get mapped
   through ``setup.py`` entry points. This is very important to keep in
   sync.
-  ``algorithm_types`` and ``primitive_family`` help with discovery of a
   primitive. They are required and if suitable values are not available
   for you, make a merge request and propose new values. As you see in
   the code here and in ``installation`` entries, you can use directly
   Python enumerations to populate these values.

Some other metadata you might be interested to provide to help callers
use your primitive better are ``preconditions`` (what preconditions
should exist on data for primitive to operate well), ``effects`` (what
changes does a primitive do to data), and a ``hyperparams_to_tune`` hint
to help callers know which hyper-parameters are most important to focus
on.

Primitive metadata also includes descriptions of a primitive and its
methods. These descriptions are automatically obtained from primitive's
docstrings. Docstrings should be made according to :ref:`numpy docstring
format <numpy:format>`
(`examples <https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_numpy.html>`__).

Data metadata
~~~~~~~~~~~~~

Every value passed around a pipeline has metadata associated with it.
Defined container types have an attribute ``metadata`` to contain it.
API available to manipulate metadata is still evolving because many
operations one can do on data are reasonable also on metadata (e.g.,
slicing and combining data). Currently, every operation on data clears
and re-initializes associated metadata.

    **Note:** While part of primitive's metadata is obtained
    automatically nothing like that is currently done for data metadata.
    This means one has to manually populate with dimension and typing
    information. This will be improved in the future with automatic
    extraction of this metadata from data.

Parameters
----------

A base class to be subclassed and used as a type for :class:`~d3m.metadata.params.Params` type
argument in primitive interfaces can be found in the
:mod:`d3m.metadata.params` module. An
instance of this subclass should be returned from primitive's
:meth:`~d3m.metadata.params.Params.get_params` method, and accepted in :meth:`~d3m.metadata.params.Params.set_params`.

To define parameters a primitive has you should subclass this base class
and define parameters as class attributes with type annotations.
Example:

.. code:: python

    import numpy
    from d3m.metadata import params

    class Params(params.Params):
        weights: numpy.ndarray
        bias: float

:class:`~d3m.metadata.params.Params` class is just a fancy Python dict which checks types of
parameters and requires all of them to be set. You can create it like:

.. code:: python

    ps = Params({'weights': weights, 'bias': 0.1})
    ps['bias']

::

    0.01

``weights`` and ``bias`` do not exist as an attributes on the class or
instance. In the class definition, they are just type annotations to
configure which parameters are there.

    **Note:** :class:`~d3m.metadata.params.Params` class uses ``parameter_name: type`` syntax
    while :class:`~d3m.metadata.hyperparams.Hyperparams` class uses
    ``hyperparameter_name = Descriptor(...)`` syntax. Do not confuse
    them.

.. _hyperparameters:

Hyper-parameters
----------------

A base class for hyper-parameters description for primitives can be
found in the
:mod:`d3m.metadata.hyperparams` module.

To define a hyper-parameters space you should subclass this base class
and define hyper-parameters as class attributes. Example:

.. code:: python

    from d3m.metadata import hyperparams

    class Hyperparams(hyperparams.Hyperparams):
        learning_rate = hyperparams.Uniform(lower=0.0, upper=1.0, default=0.001, semantic_types=[
            'https://metadata.datadrivendiscovery.org/types/TuningParameter'
        ])
        clusters = hyperparams.UniformInt(lower=1, upper=100, default=10, semantic_types=[
            'https://metadata.datadrivendiscovery.org/types/TuningParameter'
        ])

To access hyper-parameters space configuration, you can now call:

.. code:: python

    Hyperparams.configuration

::

    OrderedDict([('learning_rate', Uniform(lower=0.0, upper=1.0, q=None, default=0.001)), ('clusters', UniformInt(lower=1, upper=100, default=10))])

To get a random sample of all hyper-parameters, call:

.. code:: python

    hp1 = Hyperparams.sample(random_state=42)

::

    Hyperparams({'learning_rate': 0.3745401188473625, 'clusters': 93})

To get an instance with all default values:

.. code:: python

    hp2 = Hyperparams.defaults()

::

    Hyperparams({'learning_rate': 0.001, 'clusters': 10})

:class:`~d3m.metadata.hyperparams.Hyperparams` class is just a fancy read-only Python dict. You can
also manually create its instance:

.. code:: python

    hp3 = Hyperparams({'learning_rate': 0.01, 'clusters': 20})
    hp3['learning_rate']

::

    0.01

If you want to use most of default values, but set some, you can thus
use this dict-construction approach:

.. code:: python

    hp4 = Hyperparams(Hyperparams.defaults(), clusters=30)

::

    Hyperparams({'learning_rate': 0.001, 'clusters': 30})

There is no class- or instance-level attribute ``learning_rate`` or
``clusters``. In the class definition, they were used only for defining
the hyper-parameters space, but those attributes were extracted out and
put into ``configuration`` attribute.

There are four types of hyper-parameters: \* tuning parameters which
should be tuned during hyper-parameter optimization phase \* control
parameters which should be determined during pipeline construction phase
and are part of the logic of the pipeline \* parameters which control
the use of resources by the primitive \* parameters which control which
meta-features are computed by the primitive

You can use hyper-parameter's semantic type to differentiate between
those types of hyper-parameters using the following URIs:

* https://metadata.datadrivendiscovery.org/types/TuningParameter
* https://metadata.datadrivendiscovery.org/types/ControlParameter
* https://metadata.datadrivendiscovery.org/types/ResourcesUseParameter
* https://metadata.datadrivendiscovery.org/types/MetafeatureParameter

Once you define a :class:`~d3m.metadata.hyperparams.Hyperparams` class for your primitive you can pass
it as a class type argument in your primitive's class definition:

.. code:: python

    class MyPrimitive(SupervisedLearnerPrimitiveBase[Inputs, Outputs, Params, Hyperparams]):
        ...

Those class type arguments are then automatically extracted from the
class definition and made part of primitive's metadata. This allows the
caller to access the :class:`~d3m.metadata.hyperparams.Hyperparams` class to crete an instance to pass
to primitive's constructor:

.. code:: python

    hyperparams_class = MyPrimitive.metadata.get_hyperparams()
    primitive = MyPrimitive(hyperparams=hyperparams_class.defaults())

    **Note:** :class:`~d3m.metadata.hyperparams.Hyperparams` class uses
    ``hyperparameter_name = Descriptor(...)`` syntax while :class:`~d3m.metadata.params.Params`
    class uses ``parameter_name: type`` syntax. Do not confuse them.

Problem description
-------------------

:mod:`d3m.metadata.problem` module provides
a parser for problem description into a normalized Python object.

You can load a problem description and get the loaded object dumped back
by running:

.. code:: bash

    python3 -m d3m problem describe <path to problemDoc.json>

Dataset
-------

This package also provides a Python class to load and represent datasets
in Python in :mod:`d3m.container.dataset`
module. This container value can serve as an input to the whole pipeline
and be used as input for primitives which operate on a dataset as a
whole. It allows one to register multiple loaders to support different
formats of datasets. You pass an URI to a dataset and it automatically
picks the right loader. By default it supports:

-  D3M dataset. Only ``file://`` URI scheme is supported and URI should
   point to the ``datasetDoc.json`` file. Example:
   ``file:///path/to/datasetDoc.json``
-  CSV file. Many URI schemes are supported, including remote ones like
   ``http://``. URI should point to a file with ``.csv`` extension.
   Example: ``http://example.com/iris.csv``
-  Sample datasets from :mod:`sklearn.datasets`.
   Example: ``sklearn://boston``

You can load a dataset and get the loaded object dumped back by running:

.. code:: bash

    python3 -m d3m dataset describe <path to the dataset file>