[DOC] Update docs

3 years ago · 341c716901
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -26,12 +26,19 @@ Document Structure
   Introduction <start/intro.rst>
   Quick Start <start/quick.rst>
   Installation Guide <start/install.rst>
   Workflow <start/workflow.rst>
   Examples <start/performance.rst>
   Experiments and Examples <start/performance.rst>

 .. toctree::
   :maxdepth: 3
   :caption: COMPONENTS:
   :caption: WORKFLOWS:

   Learnware Preparation and Submission <workflow/submit.rst>
   Helpful Learnwares Identification <workflow/identify.rst>
   Learnwares Reuse <workflow/reuse.rst>

 .. toctree::
   :maxdepth: 3
   :caption: MAIN COMPONENTS:

   Market <components/market.rst>
   Learnware <components/learnware.rst>
--- a/docs/start/get_start_examples.rst
+++ b/docs/start/get_start_examples.rst
@@ -1,63 +0,0 @@
 .. _examples:
 ================================
 Experiments & Get Start Examples
 ================================

 This chapter will introduce related experiments to illustrate the search and reuse performance of our learnware system.

 ================
 Environment
 ================
 For all experiments, we used a single linux server. Details on the specifications are listed in the table below. All processors were used for training and evaluating.

 ====================  ====================  ===============================
 System                GPU                   CPU
 ====================  ====================  ===============================
 Ubuntu 20.04.4 LTS    Nvidia Tesla V100S    Intel(R) Xeon(R) Gold 6240R
 ====================  ====================  ===============================


 ================
 Experiments
 ================

 Datasets
 ================
 We designed experiments on three publicly available datasets, namely `Prediction Future Sales (PFS) <https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data>`_,
 `M5 Forecasting (M5) <https://www.kaggle.com/competitions/m5-forecasting-accuracy/data>`_ and `CIFAR 10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_.
 For the two sales forecasting data sets of PFS and M5, we divide the user data according to different stores, and train the Ridge model and LightGBM model on the corresponding data respectively.
 For the CIFAR10 image classification task, we first randomly pick 6 to 10 categories, and randomly select 800 to 2000 samples from each category from the categories corresponding to the training set, constituting a total of 50 different uploaders.
 For test users, we first randomly pick 3 to 6 categories, and randomly select 150 to 350 samples from each category from the corresponding categories from the test set, constituting a total of 20 different users.

 We tested the efficiency of the specification generation and the accuracy of the search and reuse model respectively.
 The evaluation index on PFS and M5 data is RMSE, and the evaluation index on CIFAR10 classification task is classification accuracy

 Results
 ================

 The time-consuming specification generation is shown in the table below:

 ====================  ====================  =================================
 Dataset               Data Dimensions       Specification Generation Time (s)
 ====================  ====================  =================================
 PFS
 M5
 CIFAR10               9000*3*32*32          7~10
 ====================  ====================  =================================

 The accuracy of search and reuse is shown in the table below:

 ====================  ==================== ================================= =================================
 Dataset               Top-1 Performance    Job Selector Reuse                Average Ensemble Reuse
 ====================  ==================== ================================= =================================
 PFS
 M5
 CIFAR10                 0.619 +/- 0.138    0.585 +/- 0.056                    0.715 +/- 0.075
 ====================  ==================== ================================= =================================

 =========================
 Get Start Examples
 =========================
 Examples for `PFS, M5` and `CIFAR10` are available at [xxx]. You can run { main.py } directly to reproduce related experiments.
 The test code is mainly composed of three parts, namely data preparation (optional), specification generation and market construction, and search test.
 You can load data prepared by as and skip the data preparation step.
--- a/docs/start/intro.rst
+++ b/docs/start/intro.rst
@@ -50,7 +50,7 @@ Developers or owners of trained machine learning models can submit their models
 Instead of building a model from scratch, users can submit their requirements to the learnware market, which then identifies and deploys helpful learnware(s) based on the specifications. Users can apply the learnware directly, adapt it using their data, or exploit it in other ways to improve their model. This process is more efficient and less expensive than building a model from scratch.

 Benefits of the Learnware Paradigm
 =======================
 ==============================================

 +-----------------------+-----------------------------------------------------------------------------------------------+
 | Benefit               | Description                                                                                   |
@@ -77,7 +77,7 @@ Benefits of the Learnware Paradigm
 +-----------------------+-----------------------------------------------------------------------------------------------+

 Challenges and Future Work
 =======================
 ==============================================

 Although the learnware proposal shows promise, much work remains to make it a reality. The next sections will present some of the progress made so far.

--- a/docs/start/performance.rst
+++ b/docs/start/performance.rst
@@ -1,6 +1,6 @@
 .. _performance:
 ================================
 Experiments
 Experiments and Examples
 ================================

 This chapter will introduce related experiments to illustrate the search and reuse performance of our learnware system.
--- a/docs/start/workflow.rst
+++ b/docs/start/workflow.rst
@@ -1,339 +0,0 @@
 ================
 Workflow
 ================


 Submit learnware
 =================

 In this section, a detailed guide on how to submit your own learnware to the Learnware Market will be provided.
 Specifically, we will first elaborate on the components that a valid learnware file should include, then explain
 how learnwares are uploaded and removed within ``Learnware Market``.


 Prepare Learnware
 -------------------

 A valid learnware is a zipfile which consists of four essentia parts. Here we demonstrate the detail format of a learnware zipfile.

 ``__init__.py``
 ^^^^^^^^^^^^^^^

 In ``Learnware Market``, each uploader is required to provide a set of unified interfaces for their model, 
 which enables convenient usage by future users.
 ``__init__.py`` is the python file offering interfaces for your model's fitting, predicting and fine-tuning. For example,
 the code snippet below trains and saves a SVM model for a sample dataset on sklearn digits classification:

 .. code-block:: python

    import joblib
    from sklearn.datasets import load_digits
    from sklearn.model_selection import train_test_split

    X,y = load_digits(return_X_y=True) 
    data_X, _, data_y, _ = train_test_split(X, y, test_size=0.3, shuffle=True)

    # input dimension: (64, ), output dimension: (10, )
    clf = svm.SVC(kernel="linear", probability=True)
    clf.fit(data_X, data_y)

    joblib.dump(clf, "svm.pkl") # model is stored as file "svm.pkl"


 Then the ``__init__.py`` for this SVM model should be as follows:

 .. code-block:: python
    
    import os
    import joblib
    import numpy as np
    from learnware.model import BaseModel


    class SVM(BaseModel):
        def __init__(self):
            super(SVM, self).__init__(input_shape=(64,), output_shape=(10,))
            dir_path = os.path.dirname(os.path.abspath(__file__))
            self.model = joblib.load(os.path.join("svm.pkl"))

        def fit(self, X: np.ndarray, y: np.ndarray):
            pass

        def predict(self, X: np.ndarray) -> np.ndarray:
            return self.model.predict_proba(X)

        def finetune(self, X: np.ndarray, y: np.ndarray):
            pass
    
 As a kind reminder, don't forget to fill in ``input_shape`` and ``output_shape`` corresponding to the model 
 (in our sklearn digits classification example, (64,) and (10,) respectively).


 ``stat.json``
 ^^^^^^^^^^^^^^^

 In order to better match users with learnwares suitable for their tasks, 
 we do need the information of your training dataset. Specifically, you need to provide a statistical specification 
 stored as a json file, e.g., ``stat.json``, which contains statistical information of the dataset. 
 This json file is all we required regarding your training data, and there is no need for you to upload your own data.

 Statistical specification can have many implementation approaches. 
 If Reduced Kernel Mean Embedding (RKME) is chosen to be as statistical specification, 
 the following code snippet provides guidance on how to build and store the RKME of a dataset:

 .. code-block:: python
    
    import learnware.specification as specification
    
    # generate rkme specification for digits dataset
    spec = specification.utils.generate_rkme_spec(X=data_X, gamma=0.1, cuda_idx=0)
    spec.save(os.path.join("stat.json"))


 ``learnware.yaml``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 We also require you to prepare a configuration file in YAML format,
 which describes your model class name, type of statistical specification(e.g. Reduced Kernel Mean Embedding, ``RKMEStatSpecification``), and 
 the file name of your statistical specification file. The following ``learnware.yaml`` demonstrates 
 how your learnware configuration file should be structured based on our previous example:

 .. code-block:: yaml

    model:
      class_name: SVM
      kwargs: {}
    stat_specifications:
      - module_path: learnware.specification
        class_name: RKMEStatSpecification
        file_name: stat.json
        kwargs: {}  


 ``environment.yaml``
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 In this YAML file, you need to specify the conda environment configuration for running your model 
 (if the model environment is incompatible, you can rely on this for manual configuration). 
 You can generate this file according to the following steps:

 - Create env config for conda:

    .. code-block::

        conda env export | grep -v "^prefix: " > environment.yaml
        
 - Recover env from config:

    .. code-block::

        conda env create -f environment.yaml


 Upload Learnware 
 -------------------

 Once you have prepared the four required files mentioned above, 
 you can package them as your own learnware zipfile. Combined with the generated semantic specification that 
 briefly describes the features of your task and model (Please refer to :ref:`semantic_specification` for more details), 
 you can easily upload your learnware to the ``Learnware Market`` with a single line of code:

 .. code-block:: python

    import learnware
    from learnware.market import EasyMarket

    learnware.init()
    
    # EasyMarket: most basic set of functions in a Learnware Market
    easy_market = EasyMarket(market_id="demo", rebuild=True) 
    
    # single line uploading
    easy_market.add_learnware(zip_path, semantic_spec) 

 Here, ``zip_path`` is the directory of your learnware zipfile.


 Remove Learnware
 ------------------

 As ``Learnware Market`` administrators, it is necessary to remove learnwares with suspicious uploading motives.
 With required permissions and approvals, you can use the following code to remove a learnware 
 from the ``Learnware Market``:

 .. code-block:: python

    easy_market.delete_learnware(learnware_id)

 Here,  ``learnware_id`` is the market ID of the learnware to be removed.



 Identify helpful learnware
 ===========================

 When a user comes with her requirements, the market should identify helpful learnwares and recommend them to the user.
 The search of helpful learnwares is based on the user information, and can be divided into two stages: semantic specification search and statistical specification search.

 User information
 -------------------------------
 The user should provide her requirements in ``BaseUserInfo``. The class ``BaseUserInfo`` consists of user's semantic specification ``user_semantic`` and statistical information ``stat_info``. 

 The semantic specification ``user_semantic`` is stored in a ``dict``, with keywords 'Data', 'Task', 'Library', 'Scenario', 'Description' and 'Name'. An example is shown below, and you could choose their values according to the figure. For the keys of type 'Class, you should choose one ilegal value; for the keys of type 'Tag', you can choose one or more values; for the keys of type 'String', you should provide a string; the key 'Description' is used in learnwares' semantic specifications and is ignored in user semantic specification; the values of all these keys can be empty if the user have no idea of them.

 .. code-block:: python

    # An example of user_semantic
    user_semantic = {
        "Data": {"Values": ["Image"], "Type": "Class"},
        "Task": {"Values": ["Classification"], "Type": "Class"},
        "Library": {"Values": ["Scikit-learn"], "Type": "Tag"},
        "Scenario": {"Values": ["Education"], "Type": "Class"},
        "Description": {"Values": "", "Type": "String"},
        "Name": {"Values": "digits", "Type": "String"},
    }

 .. _semantic_specification:

 .. figure: ..\_static\img\semantic_spec.png
   :alt: Semantic Specification
   :align: center

 引用方式 :ref:`semantic_specification` 。


 The user's statistical information ``stat_info`` is stored in a ``json`` file, e.g., ``stat.json``. The generation of this file is seen in `这是一个语义规约生成的链接`_.



 Semantic Specification Search
 -------------------------------
 To search for learnwares that fit your task purpose, 
 the user should first provide a semantic specification ``user_semantic`` that describes the characteristics of your task.
 The Learnware Market will perform a first-stage search based on ``user_semantic``,
 identifying potentially helpful leranwares whose models solve tasks similar to your requirements. 

 .. code-block:: python

    # construct user_info which includes semantic specification for searching learnware
    user_info = BaseUserInfo(semantic_spec=semantic_spec)

    # search_learnware performs semantic specification search if user_info doesn't include a statistical specification
    _, single_learnware_list, _ = easy_market.search_learnware(user_info) 

    # single_learnware_list is the learnware list by semantic specification searching
    print(single_learnware_list)

 In semantic specification search, we go through all learnwares in the market to compare their semantic specifications with the user's one, and return all the learnwares that pass through the comparation. When comparing two learnwares' semantic specifications, we design different ways for different semantic keys:

 - For semantic keys with type 'Class', they are matched only if they have the same value.

 - For semantic keys with type 'Tag', they are matched only if they have nonempty intersections.

 - For the user's input in the search box, it matchs with a learnware's semantic specification only if it's a substring of its 'Name' or 'Description'. All the strings are converted to the lower case before matching.

 - When a key value is missing, it will not participate in the match. The user could upload no semantic specifications if he wants.

 Statistical Specification Search
 ---------------------------------

 If you choose to provide your own statistical specification file ``stat.json``, 
 the Learnware Market can perform a more accurate leanware selection from 
 the learnwares returned by the previous step. This second-stage search is based on statistical information and returns one or more learnwares that are most likely to be helpful for your task. 

 For example, the following code is designed to work with Reduced Kernel Mean Embedding (RKME) as a statistical specification:

 .. code-block:: python

    import learnware.specification as specification

    user_spec = specification.rkme.RKMEStatSpecification()
    user_spec.load(os.path.join("rkme.json"))
    user_info = BaseUserInfo(
        semantic_spec=user_semantic, stat_info={"RKMEStatSpecification": user_spec}
    )
    (sorted_score_list, single_learnware_list,
        mixture_score, mixture_learnware_list) = easy_market.search_learnware(user_info)

    # sorted_score_list is the learnware scores based on MMD distances, sorted in descending order
    print(sorted_score_list) 

    # single_learnware_list is the learnwares sorted in descending order based on their scores
    print(single_learnware_list)

    # mixture_learnware_list is the learnwares whose mixture is helpful for your task
    print(mixture_learnware_list) 

    # mixture_score is the score of the mixture of learnwares
    print(mixture_score)

 The return values of statistical specification search are ``sorted_score_list``, ``single_learnware_list``, ``mixture_score`` and ``mixture_learnware_list``.
 ``sorted_score_list`` and ``single_learnware_list`` are the ranking of each single learnware and the corresponding scores. We return at least 15 learnwares unless there're no enough ones. If there are more than 15 matched learnwares, the ones with scores less than 50 will be ignored.
 ``mixture_score`` and ``mixture_learnware_list`` are the chosen mixture learnwares and the corresponding score. At most 5 learnwares will be chosen, whose mixture may have a relatively good performance on the user's task.


 The statistical specification search is done in the following way.
 We first filter by the dimension of RKME specifications; only those with the same dimension with the user's will enter the subsequent stage.

 The single_learnware_list is calculated using the distances between two RKMEs. The greater the distance from the user's RKME, the lower the score is. 

 The mixture_learnware_list is calculated in a greedy way. Each time we choose a learnware to make their mixture closer to the user's RKME. Specifically, each time we go through all the left learnwares to find the one whose combination with chosen learnwares could minimize the distance between their mixture's RKME and the user's RKME. The mixture weight is calculated by minimizing the RKME distance, which is solved by quadratic programming. If the distance become larger or the number of chosen learnwares reaches a threshold, the process will end and the chosen learnware and weight list will return.



 Reuse learnware
 ===========================

 This part introduces two baseline methods for reusing a given list of learnwares, namely ``JobSelectorReuser`` and ``AveragingReuser``.
 Instead of training a model from scratch, the user can easily reuse a list of learnwares (``List[Learnware]``) to predict the labels of their own data (``numpy.ndarray`` or ``torch.Tensor``).

 To illustrate, we provide a code demonstration that obtains the user dataset using ``sklearn.datasets.load_digits``, where ``test_data`` represents the data that requires prediction.
 Assuming that ``learnware_list`` is the list of learnwares searched by the learnware market based on user specifications, the user can reuse each learnware in the ``learnware_list`` through ``JobSelectorReuser`` or ``AveragingReuser`` to predict the label of ``test_data``, thereby avoiding training a model from scratch.

 .. code-block:: python

    from sklearn.datasets import load_digits
    from learnware.learnware import JobSelectorReuser, AveragingReuser

    # Load user data
    X, y = load_digits(return_X_y=True)
    test_data = X

    # Based on user information, the learnware market returns a list of learnwares (learnware_list)
    # Use jobselector reuser to reuse the searched learnwares to make prediction
    reuse_job_selector = JobSelectorReuser(learnware_list=learnware_list)
    job_selector_predict_y = reuse_job_selector.predict(user_data=test_data)

    # Use averaging ensemble reuser to reuse the searched learnwares to make prediction
    reuse_ensemble = AveragingReuser(learnware_list=learnware_list)
    ensemble_predict_y = reuse_ensemble.predict(user_data=test_data)


 JobSelectorReuser
 -------------------

 The ``JobSelectorReuser`` is a class that inherits from the base reuse class ``BaseReuser``.
 Its purpose is to create a job selector that identifies the optimal learnware for each data point in user data.
 There are three parameters required to initialize the class:

 - ``learnware_list``: A list of objects of type ``Learnware``. Each ``Learnware`` object should have an RKME specification.
 - ``herding_num``: An optional integer that specifies the number of items to herd, which defaults to 1000 if not provided.
 - ``use_herding``: A boolean flag indicating whether to use kernel herding.

 The job selector is essentially a multi-class classifier :math:`g(\boldsymbol{x}):\mathcal{X}\rightarrow \mathcal{I}` with :math:`\mathcal{I}=\{1,\ldots, C\}`, where :math:`C` is the size of ``learnware_list``.
 Given a testing sample :math:`\boldsymbol{x}`, the ``JobSelectorReuser`` predicts it by using the :math:`g(\boldsymbol{x})`-th learnware in ``learnware_list``.
 If ``use_herding`` is set to false, the ``JobSelectorReuser`` uses data points in each learware's RKME spefication with the corresponding learnware index to train a job selector.
 If ``use_herding`` is true, the algorithm estimates the mixture weight based on RKME specifications and raw user data, uses the weight to generate ``herding_num`` auxiliary data points mimicking the user distribution through the kernel herding method, and learns a job selector on these data.


 AveragingReuser
 -------------------

 The ``AveragingReuser`` is a class that inherits from the base reuse class ``BaseReuser``, that implements the average ensemble method by averaging each learnware's output to predict user data.
 There are two parameters required to initialize the class:

 - ``learnware_list``: A list of objects of type ``Learnware``.
 - ``mode``: The mode of averaging leanrware outputs, which can be set to "mean" or "vote" and defaults to "mean".

 If ``mode`` is set to "mean", the ``AveragingReuser`` computes the mean of the learnware's output to predict user data, which is commonly used in regression tasks.
 If ``mode`` is set to "vote", the ``AveragingReuser`` computes the mean of the softmax of the learnware's output to predict each label probability of user data, which is commonly used in classification tasks.
--- a/docs/workflow/.gitkeep
+++ b/docs/workflow/.gitkeep
--- a/docs/workflow/identify.rst
+++ b/docs/workflow/identify.rst
@@ -1,5 +1,5 @@
 ============================================================
 Identify Helpful Learnwares
 Helpful Learnwares Identification
 ============================================================

 When a user comes with her requirements, the market should identify helpful learnwares and recommend them to the user.
--- a/docs/workflow/reuse.rst
+++ b/docs/workflow/reuse.rst
@@ -1,5 +1,5 @@
 ==========================================
 Getting Started-Workflow-Reuse learnware
 Learnwares Reuse
 ==========================================

 This part introduces two baseline methods for reusing a given list of learnwares, namely ``JobSelectorReuser`` and ``AveragingReuser``.
--- a/docs/workflow/submit.rst
+++ b/docs/workflow/submit.rst
@@ -1,5 +1,5 @@
 ==========================================
 Submit Learnware
 Learnware Preparation and Submission
 ==========================================

 In this section, a detailed guide on how to submit your own learnware to the Learnware Market will be provided.