diff --git a/docs/_static/img/semantic_spec.png b/docs/_static/img/semantic_spec.png new file mode 100644 index 0000000..bb11762 Binary files /dev/null and b/docs/_static/img/semantic_spec.png differ diff --git a/docs/start/get_start_examples.rst b/docs/start/get_start_examples.rst new file mode 100644 index 0000000..8f225f8 --- /dev/null +++ b/docs/start/get_start_examples.rst @@ -0,0 +1,63 @@ +.. _examples: +================================ +Experiments & Get Start Examples +================================ + +This chapter will introduce related experiments to illustrate the search and reuse performance of our learnware system. + +================ +Environment +================ +For all experiments, we used a single linux server. Details on the specifications are listed in the table below. All processors were used for training and evaluating. + +==================== ==================== =============================== +System GPU CPU +==================== ==================== =============================== +Ubuntu 20.04.4 LTS Nvidia Tesla V100S Intel(R) Xeon(R) Gold 6240R +==================== ==================== =============================== + + +================ +Experiments +================ + +Datasets +================ +We designed experiments on three publicly available datasets, namely `Prediction Future Sales (PFS) `_, +`M5 Forecasting (M5) `_ and `CIFAR 10 `_. +For the two sales forecasting data sets of PFS and M5, we divide the user data according to different stores, and train the Ridge model and LightGBM model on the corresponding data respectively. +For the CIFAR10 image classification task, we first randomly pick 6 to 10 categories, and randomly select 800 to 2000 samples from each category from the categories corresponding to the training set, constituting a total of 50 different uploaders. +For test users, we first randomly pick 3 to 6 categories, and randomly select 150 to 350 samples from each category from the corresponding categories from the test set, constituting a total of 20 different users. + +We tested the efficiency of the specification generation and the accuracy of the search and reuse model respectively. +The evaluation index on PFS and M5 data is RMSE, and the evaluation index on CIFAR10 classification task is classification accuracy + +Results +================ + +The time-consuming specification generation is shown in the table below: + +==================== ==================== ================================= +Dataset Data Dimensions Specification Generation Time (s) +==================== ==================== ================================= +PFS +M5 +CIFAR10 9000*3*32*32 7~10 +==================== ==================== ================================= + +The accuracy of search and reuse is shown in the table below: + +==================== ==================== ================================= ================================= +Dataset Top-1 Performance Job Selector Reuse Average Ensemble Reuse +==================== ==================== ================================= ================================= +PFS +M5 +CIFAR10 0.619 +/- 0.138 0.585 +/- 0.056 0.715 +/- 0.075 +==================== ==================== ================================= ================================= + +========================= +Get Start Examples +========================= +Examples for `PFS, M5` and `CIFAR10` are available at [xxx]. You can run { main.py } directly to reproduce related experiments. +The test code is mainly composed of three parts, namely data preparation (optional), specification generation and market construction, and search test. +You can load data prepared by as and skip the data preparation step. \ No newline at end of file diff --git a/docs/workflow/Identify helpful learnwares.rst b/docs/workflow/Identify helpful learnwares.rst index 7a7d69d..7cc8d17 100644 --- a/docs/workflow/Identify helpful learnwares.rst +++ b/docs/workflow/Identify helpful learnwares.rst @@ -2,17 +2,47 @@ Identify Helpful Learnwares ============================================================ +When a user comes with her requirements, the market should identify helpful learnwares and recommend them to the user. +The search of helpful learnwares is based on the user information, and can be divided into two stages: semantic specification search and statistical specification search. + +User information +------------------------------- +The user should provide her requirements in ``BaseUserInfo``. The class ``BaseUserInfo`` consists of user's semantic specification ``user_semantic`` and statistical information ``stat_info``. + +The semantic specification ``user_semantic`` is stored in a ``dict``, with keywords 'Data', 'Task', 'Library', 'Scenario', 'Description' and 'Name'. An example is shown below, and you could choose their values according to the figure. For the keys of type 'Class, you should choose one ilegal value; for the keys of type 'Tag', you can choose one or more values; for the keys of type 'String', you should provide a string; the key 'Description' is used in learnwares' semantic specifications and is ignored in user semantic specification; the values of all these keys can be empty if the user have no idea of them. + +.. code-block:: python + + # An example of user_semantic + user_semantic = { + "Data": {"Values": ["Image"], "Type": "Class"}, + "Task": {"Values": ["Classification"], "Type": "Class"}, + "Library": {"Values": ["Scikit-learn"], "Type": "Tag"}, + "Scenario": {"Values": ["Education"], "Type": "Class"}, + "Description": {"Values": "", "Type": "String"}, + "Name": {"Values": "digits", "Type": "String"}, + } + + +.. image:: ../_static/img/semantic_spec.png + :align: center + + +The user's statistical information ``stat_info`` is stored in a ``json`` file, e.g., ``stat.json``. The generation of this file is seen in `这是一个语义规约生成的链接`_. + + + Semantic Specification Search ------------------------------- To search for learnwares that fit your task purpose, -you should first provide a semantic specification ``user_semantic`` that describes the characteristics of your task. +the user should first provide a semantic specification ``user_semantic`` that describes the characteristics of your task. The Learnware Market will perform a first-stage search based on ``user_semantic``, identifying potentially helpful leranwares whose models solve tasks similar to your requirements. .. code-block:: python # construct user_info which includes semantic specification for searching learnware - user_info = BaseUserInfo(id="user", semantic_spec=semantic_spec) + user_info = BaseUserInfo(semantic_spec=semantic_spec) # search_learnware performs semantic specification search if user_info doesn't include a statistical specification _, single_learnware_list, _ = easy_market.search_learnware(user_info) @@ -24,7 +54,7 @@ In semantic specification search, we go through all learnwares in the market to - For semantic keys with type 'Class', they are matched only if they have the same value. - For semantic keys with type 'Tag', they are matched only if they have nonempty intersections. - For the user's input in the search box, it matchs with a learnware's semantic specification only if it's a substring of its 'Name' or 'Description'. All the strings are converted to the lower case before matching. -- When a key value is missing, it will not participate in the match. The user could upload no semantic specifications if he wants. +- When a key value is missing, it will not participate in the match. Statistical Specification Search --------------------------------- @@ -40,9 +70,9 @@ For example, the following code is designed to work with Reduced Kernel Mean Emb import learnware.specification as specification user_spec = specification.rkme.RKMEStatSpecification() - user_spec.load(os.path.join(unzip_path, "rkme.json")) + user_spec.load(os.path.join("rkme.json")) user_info = BaseUserInfo( - id="user", semantic_spec=user_semantic, stat_info={"RKMEStatSpecification": user_spec} + semantic_spec=user_semantic, stat_info={"RKMEStatSpecification": user_spec} ) (sorted_score_list, single_learnware_list, mixture_score, mixture_learnware_list) = easy_market.search_learnware(user_info) @@ -59,9 +89,14 @@ For example, the following code is designed to work with Reduced Kernel Mean Emb # mixture_score is the score of the mixture of learnwares print(mixture_score) +The return values of statistical specification search are ``sorted_score_list``, ``single_learnware_list``, ``mixture_score`` and ``mixture_learnware_list``. +``sorted_score_list`` and ``single_learnware_list`` are the ranking of each single learnware and the corresponding scores. We return at least 15 learnwares unless there're no enough ones. If there are more than 15 matched learnwares, the ones with scores less than 50 will be ignored. +``mixture_score`` and ``mixture_learnware_list`` are the chosen mixture learnwares and the corresponding score. At most 5 learnwares will be chosen, whose mixture may have a relatively good performance on the user's task. + + The statistical specification search is done in the following way. We first filter by the dimension of RKME specifications; only those with the same dimension with the user's will enter the subsequent stage. -The single_learnware_list is calculated using the distances between two RKMEs. The greater the distance from the user's RKME, the lower the score is. The learnwares with very low scores will not be returned, unless the number of matching leanwares is smaller than a set threshold. +The single_learnware_list is calculated using the distances between two RKMEs. The greater the distance from the user's RKME, the lower the score is. The mixture_learnware_list is calculated in a greedy way. Each time we choose a learnware to make their mixture closer to the user's RKME. Specifically, each time we go through all the left learnwares to find the one whose combination with chosen learnwares could minimize the distance between their mixture's RKME and the user's RKME. The mixture weight is calculated by minimizing the RKME distance, which is solved by quadratic programming. If the distance become larger or the number of chosen learnwares reaches a threshold, the process will end and the chosen learnware and weight list will return. \ No newline at end of file diff --git a/learnware/learnware/reuse.py b/learnware/learnware/reuse.py index 7c289cc..6ada88b 100644 --- a/learnware/learnware/reuse.py +++ b/learnware/learnware/reuse.py @@ -163,7 +163,7 @@ class JobSelectorReuser(BaseReuser): Parameters ---------- user_data : np.ndarray - User's labeld raw data. + Raw user data. task_rkme_list : List[RKMEStatSpecification] The list of learwares' rkmes whose mixture approximates the user's rkme task_rkme_matrix : np.ndarray @@ -272,7 +272,7 @@ class AveragingReuser(BaseReuser): Parameters ---------- learnware_list : List[Learnware] - The learnware list, which should have RKME Specification for each learnweare + The learnware list """ super(AveragingReuser, self).__init__(learnware_list) self.mode = mode @@ -283,7 +283,7 @@ class AveragingReuser(BaseReuser): Parameters ---------- user_data : np.ndarray - User's labeled raw data. + Raw user data. Returns -------