| @@ -43,4 +43,5 @@ cache/ | |||
| tmp/ | |||
| learnware_pool/ | |||
| PFS/ | |||
| data/ | |||
| data/ | |||
| examples/results/ | |||
| @@ -1,5 +1,5 @@ | |||
| ============================== | |||
| Specification evolvement | |||
| Specification Evolvement | |||
| ============================== | |||
| The specification is the core of the learnware paradigm. | |||
| @@ -1,87 +1,138 @@ | |||
| .. _learnware: | |||
| ========================================== | |||
| Learnware & Reuser | |||
| ========================================== | |||
| Learnware and Reuser are related... | |||
| ``Learnware`` is the most basic concept in the ``learnware paradigm``. In this section, we will introduce the concept and design of ``learnware`` and its extension for ``Hetero Reuse``. Then we will introduce the ``Reuse Methods``, which applies one or several ``learnware``\ s to solve the user's task. | |||
| Concepts | |||
| =================== | |||
| The learnware paradiam, first introduced by Zhi-Hua Zhou, is defined as a proficiently trained machine learning model accompanied by a specification that allows future users with no prior knowledge of the learnware to identify and reuse it according to their needs. | |||
| In the learnware paradigm, a learnware is a well-performed trained machine learning model with a specification which enables it to be adequately identified to reuse according to the requirement of future users who know nothing about the learnware in advance. The introduction of specifications are shown in `COMPONENTS: Specification <./spec.html>`_. | |||
| Developers or owners of trained machine learning models can voluntarily submit their models to a learnware marketplace. If the marketplace accepts the model, it assigns a specification to the model and makes it available in the marketplace. | |||
| In our implementation, the class ``Learnware`` has 3 important member variables: | |||
| Utilizing Learnware in Practice | |||
| ------------------------------- | |||
| - ``id``: The learnware id is generated by ``market``. | |||
| - ``model``: The model in the learnware, can be a ``BaseModel`` or a dict including model name and path. When it is a dict, the function ``Learnware.instantiate_model`` is used to transform it to a ``BaseModel``. The function ``Learnware.predict`` use the model to predict for an input ``X``. See more in `COMPONENTS: Model <./model.html>`_. | |||
| - ``specification``: The specification including the semantic specification and the statistic specification. | |||
| With a learnware marketplace in place, users can tackle machine learning tasks without having to create models from scratch. | |||
| Learnware for Hetero Reuse (Feature Align + Hetero Map Learnware) | |||
| ======================================================================= | |||
| Addressing Concerns with Learnware | |||
| ---------------------------------- | |||
| In the Hetero Market(see `COMPONENTS: Hetero Market <./market.html#hetero-market>`_ for details), ``HeteroSearcher`` identifies and recommends helpful learnwares among all learnwares in the market, | |||
| including learnwares with feature/label spaces different from the user's task requirements(heterogeneous learnwares). ``FeatureAlignLearnware`` and ``HeteroMapLearnware`` | |||
| are designed to enable the reuse of heterogeneous learnwares, which extends ``Learnware`` with the ability to align the feature space and label space of the learnware to the user's task requirements, | |||
| and provide basic interfaces for heterogeneous learnwares to be applied to tasks beyond their original purposes. | |||
| ``FeatureAlignLearnware`` | |||
| --------------------------- | |||
| ``FeatureAlignLearnware`` employs a neural network to align the feature space of the learnware to the user's task. | |||
| It is initialized with a ``Learnware``, and has the following methods to expand the applicable scope of this ``Learnware``: | |||
| - **align**: Trains a neural network to align ``user_rkme``, which is the ``RKMETableSpecification`` of the user's data, with the learnware's statistical specification. | |||
| - **predict**: Predict the output for user data using the trained neural network and the original learnware's model. | |||
| ``HeteroMapAlignLearnware`` | |||
| ----------------------------- | |||
| If user data is not only heterogeneous in feature space but also in label space, ``HeteroMapAlignLearnware`` uses the help of | |||
| a small amount of labeled data ``(x_train, y_train)`` required from the user task to align heterogeneous learnwares with the user task. | |||
| There are two key interfaces in ``HeteroMapAlignLearnware``: | |||
| - ``HeteroMapAlignLearnware.align(self, user_rkme: RKMETableSpecification, x_train: np.ndarray, y_train: np.ndarray)`` | |||
| - **input space alignment**: Align the feature space of the learnware to the user task's statistical specification ``user_rkme`` using ``FeatureAlignLearnware``. | |||
| - **output space alignment**: Further align the label space of the aligned learnware to the user task through supervised learning of ``FeatureAugmentReuser`` using ``(x_train, y_train)``. | |||
| - ``HeteroMapAlignLearnware.predict(self, user_data)`` | |||
| - If input space and output space alignment are both performed, use the ``FeatureAugmentReuser`` to predict the output for ``user_data``. | |||
| The learnware approach aims to address several challenges: | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Concern | Solution | | |||
| +========================+========================================================================================+ | |||
| | Limited training data | Use existing high-quality learnware and require only a small amount of data for | | |||
| | | adaptation or refinement. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Lack of training skills| Leverage existing learnware instead of building a model from scratch. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Catastrophic forgetting| Retain old knowledge in the marketplace as accepted learnware remain available. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Continual learning | Facilitate continuous and lifelong learning with the constant influx of high-quality | | |||
| | | learnware, enriching the knowledge base. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Data privacy and | Ensure data privacy and proprietary protection by having developers only submit | | |||
| | proprietary concerns | models, not their data. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Unplanned tasks | Ensure the availability of helpful learnware for various tasks, unless entirely new | | |||
| | | to all legal developers. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| | Carbon emissions | Reduce the need to train numerous large models by assembling smaller models that | | |||
| | | provide satisfactory performance. | | |||
| +------------------------+----------------------------------------------------------------------------------------+ | |||
| Future Work and Progress | |||
| ------------------------ | |||
| Despite the promising potential of the learnware proposal, much work remains to bring it to fruition. The following sections will discuss some of the progress made thus far. | |||
| Learnware for Hetero Reuse (Feature Aligh + Hetero Map Learnware) | |||
| ======================================================================= | |||
| All Reuse Methods | |||
| =========================== | |||
| In addition to applying ``Learnware``, ``FeatureAlignLearnware`` or ``HeteroMapAlignLearnware`` objects directly by calling their ``predict`` interface, | |||
| the ``learnware`` package also provides a set of ``Reuse Methods`` for users to further customize a single or multiple learnwares, with the hope of enabling learnwares to be | |||
| helpful beyond their original purposes, and eliminating the need for users to build models from scratch. | |||
| There are two main categories of ``Reuse Methods``: (1) direct reuse and (2) reuse based on a small amount of labeled data. | |||
| .. note:: | |||
| Combine ``HeteroMapAlignLearnware`` with the following reuse methods to enable the reuse of heterogeneous learnwares. See `WORKFLOW: Hetero Reuse <../workflows/reuse.html#hetero-reuse>`_ for details. | |||
| Direct Reuse of Learnware | |||
| -------------------------- | |||
| Two methods for direct reuse of learnwares are provided: ``JobSelectorReuser`` and ``AveragingReuser``. | |||
| JobSelectorReuser | |||
| -------------------- | |||
| ^^^^^^^^^^^^^^^^^^ | |||
| The ``JobSelectorReuser`` is a class that inherits from the base reuse class ``BaseReuser``. | |||
| Its purpose is to create a job selector that identifies the optimal learnware for each data point in user data. | |||
| There are three parameters required to initialize the class: | |||
| ``JobSelectorReuser`` trains a classifier ``job selector`` that identifies the optimal learnware for each data point in user data. | |||
| There are three member variables: | |||
| - ``learnware_list``: A list of objects of type ``Learnware``. Each ``Learnware`` object should have an RKME specification. | |||
| - ``learnware_list``: A list of ``Learnware`` objects for the ``JobSelectorReuser`` to choose from. | |||
| - ``herding_num``: An optional integer that specifies the number of items to herd, which defaults to 1000 if not provided. | |||
| - ``use_herding``: A boolean flag indicating whether to use kernel herding. | |||
| The job selector is essentially a multi-class classifier :math:`g(\boldsymbol{x}):\mathcal{X}\rightarrow \mathcal{I}` with :math:`\mathcal{I}=\{1,\ldots, C\}`, where :math:`C` is the size of ``learnware_list``. | |||
| Given a testing sample :math:`\boldsymbol{x}`, the ``JobSelectorReuser`` predicts it by using the :math:`g(\boldsymbol{x})`-th learnware in ``learnware_list``. | |||
| If ``use_herding`` is set to false, the ``JobSelectorReuser`` uses data points in each learware's RKME spefication with the corresponding learnware index to train a job selector. | |||
| If ``use_herding`` is true, the algorithm estimates the mixture weight based on RKME specifications and raw user data, uses the weight to generate ``herding_num`` auxiliary data points mimicking the user distribution through the kernel herding method, and learns a job selector on these data. | |||
| The most important methods of ``JobSelectorReuser`` are ``job_selector`` and ``predict``: | |||
| - **job_selector**: Train a ``job selector`` based on user's data and the ``learnware_list``. Processions are different based on the value of ``use_herding``: | |||
| - If ``use_herding`` is False: Statistical specifications of learnwares in ``learnware_list`` combined with the corresponding learnware index are used to train the ``job selector``. | |||
| - If ``use_herding`` is True: | |||
| - Estimate the mixture weight based on user raw data and the statistical specifications of learnwares in ``learnware_list`` | |||
| - Use the mixture weight to generate ``herding_num`` auxiliary data points which mimic the user task's distribution through the kernel herding method | |||
| - Finally learns the ``job selector`` on the auxiliary data points. | |||
| - **predict**: The ``job selector`` is essentially a multi-class classifier :math:`g(\boldsymbol{x}):\mathcal{X}\rightarrow \mathcal{I}` with :math:`\mathcal{I}=\{1,\ldots, C\}`, where :math:`C` is the size of ``learnware_list``. Given a testing sample :math:`\boldsymbol{x}`, the ``JobSelectorReuser`` predicts it by using the :math:`g(\boldsymbol{x})`-th learnware in ``learnware_list``. | |||
| AveragingReuser | |||
| ------------------ | |||
| ^^^^^^^^^^^^^^^^^^ | |||
| ``AveragingReuser`` uses an ensemble method to make predictions. It is initialized with a list of ``Learnware`` objects, and has a member variable ``mode`` which | |||
| specifies the ensemble method(default is set to ``mean``). | |||
| - **predict**: The member variable ``mode`` provides different options for classification and regression tasks: | |||
| - For regression tasks, ``mode`` should be set to ``mean``. The prediction is the average of the learnwares' outputs. | |||
| - For classification tasks, ``mode`` has two available options. If ``mode`` is set to ``vote_by_label``, the prediction is the majority vote label based on learnwares' output labels. If ``mode`` is set to ``vote_by_prob``, the prediction is the mean vector of all learnwares' output label probabilities. | |||
| Reuse Learnware with Labeled Data | |||
| ---------------------------------- | |||
| When users have a small amount of labeled data available, ``learnware`` package provides two methods: ``EnsemblePruningReuser`` and ``FeatureAugmentReuser`` to help reuse learnwares. | |||
| They are both initialized with a list of ``Learnware`` objects ``learnware_list``, and have different implementations of ``fit`` and ``predict`` methods. | |||
| EnsemblePruningReuser | |||
| ^^^^^^^^^^^^^^^^^^^^^^ | |||
| The ``EnsemblePruningReuser`` class implements a selective ensemble approach inspired by the MDEP algorithm, as detailed in [1]_. | |||
| It selects a subset of learnwares from ``learnware_list``, utilizing user's labeled data for effective ensemble integration on user tasks. | |||
| This method effectively balances validation error, margin ratio, and ensemble size, leading to a robust and optimized selection of learnwares for task-specific ensemble creation. | |||
| - **fit**: Effectively prunes the large set of learnwares ``learnware_list`` by evaluating and comparing the learnwares based on their performance on user's labeled validation data ``(val_X, val_y)``. Returns the most suitable subset of learnwares. | |||
| - **predict**: The ``mode`` member variable has two available options. Set ``mode`` to ``regression`` for regression tasks, and ``classification`` for classification tasks. The prediction is the average of the selected learnwares' outputs. | |||
| FeatureAugmentReuser | |||
| ^^^^^^^^^^^^^^^^^^^^^^ | |||
| ``FeatureAugmentReuser`` helps users reuse learnwares by augmenting features. In this method, | |||
| outputs of the learnwares from ``learnware_list`` on user's validation data ``val_X`` are taken as augmented features and are concatenated with original features ``val_X``. | |||
| The augmented data(concatenated features combined with validation labels ``val_y``) are then used to train a simple model ``augment_reuser`` which gives the final prediction | |||
| on ``user_data``. | |||
| - **fit**: Trains the ``augment_reuser`` using augmented user validation data. For classification tasks, ``mode`` should be set to ``classification``, and ``augment_reuser`` is a ``LogisticRegression`` model. For regression tasks, mode should be set to ``classification``, and ``augment_reuser`` is a ``RidgeCV`` model. | |||
| The ``AveragingReuser`` is a class that inherits from the base reuse class ``BaseReuser``, that implements the average ensemble method by averaging each learnware's output to predict user data. | |||
| There are two parameters required to initialize the class: | |||
| - ``learnware_list``: A list of objects of type ``Learnware``. | |||
| - ``mode``: The mode of averaging leanrware outputs, which can be set to "mean" or "vote" and defaults to "mean". | |||
| References | |||
| ----------- | |||
| If ``mode`` is set to "mean", the ``AveragingReuser`` computes the mean of the learnware's output to predict user data, which is commonly used in regression tasks. | |||
| If ``mode`` is set to "vote", the ``AveragingReuser`` computes the mean of the softmax of the learnware's output to predict each label probability of user data, which is commonly used in classification tasks. | |||
| .. [1] Yu-Chang Wu, Yi-Xiao He, Chao Qian, and Zhi-Hua Zhou. Multi-objective Evolutionary Ensemble Pruning Guided by Margin Distribution. In: Proceedings of the 17th International Conference on Parallel Problem Solving from Nature (PPSN'22), Dortmund, Germany, 2022. | |||
| @@ -1,6 +1,7 @@ | |||
| .. _market: | |||
| ================================ | |||
| Market | |||
| Learnware Market | |||
| ================================ | |||
| The ``learnware market`` receives high-performance machine learning models from developers, incorporates them into the system, and provides services to users by identifying and reusing learnware to help users solve current tasks. Developers voluntarily submit various learnwares to the learnware market, and the market conducts quality checks and further organization of these learnwares. When users submit task requirements, the learnware market automatically selects whether to recommend a single learnware or a combination of multiple learnwares. | |||
| @@ -10,11 +11,11 @@ The ``learnware market`` will receive various kinds of learnwares, and learnware | |||
| Framework | |||
| ====================================== | |||
| The ``learnware market`` is combined with a ``organizer``, a ``searcher``, and a list of ``checker``s. | |||
| The ``learnware market`` is combined with a ``organizer``, a ``searcher``, and a list of ``checker``\ s. | |||
| The ``organizer`` can store and organize learnwares in the market. It supports ``add``, ``delete``, and ``update`` operations for learnwares. It also provides the interface for ``searcher`` to search learnwares based on user requirement. | |||
| The ``searcher`` can search learnwares based on user requirement. The implementation of ``searcher`` is dependent on the concrete implementation and interface for ``organizer``, where usually an ``organizer`` can be compatible with multiple different ``searcher``s. | |||
| The ``searcher`` can search learnwares based on user requirement. The implementation of ``searcher`` is dependent on the concrete implementation and interface for ``organizer``, where usually an ``organizer`` can be compatible with multiple different ``searcher``\ s. | |||
| The ``checker`` is used for checking the learnware in some standards. It should check the utility of a learnware and is supposed to return the status and a message related to the learnware's check result. Only the learnwares who passed the ``checker`` could be able to be stored and added into the ``learnware market``. | |||
| @@ -23,9 +24,9 @@ The ``checker`` is used for checking the learnware in some standards. It should | |||
| Current Checkers | |||
| ====================================== | |||
| The ``learnware`` package provide two different implementation of ``market`` where both of them share the same ``checker`` list. So we first introduce the details of ``checker``s. | |||
| The ``learnware`` package provide two different implementation of ``market`` where both of them share the same ``checker`` list. So we first introduce the details of ``checker``\ s. | |||
| The ``checker``s check a learnware object in different aspects, including environment configuration (``CondaChecker``), semantic specifications (``EasySemanticChecker``), and statistical specifications (``EasyStatChecker``). The ``__call__`` method of each checker is designed to be invoked as a function to conduct the respective checks on the learnware and return the outcomes. It defines three types of learnwares: ``INVALID_LEARNWARE`` denotes the learnware does not pass the check, ``NONUSABLE_LEARNWARE`` denotes the learnware pass the check but cannot make prediction, ``USABLE_LEARWARE`` denotes the leanrware pass the check and can make prediction. Currently, we have three ``checker``s, which are described below. | |||
| The ``checker``s check a learnware object in different aspects, including environment configuration (``CondaChecker``), semantic specifications (``EasySemanticChecker``), and statistical specifications (``EasyStatChecker``). The ``__call__`` method of each checker is designed to be invoked as a function to conduct the respective checks on the learnware and return the outcomes. It defines three types of learnwares: ``INVALID_LEARNWARE`` denotes the learnware does not pass the check, ``NONUSABLE_LEARNWARE`` denotes the learnware pass the check but cannot make prediction, ``USABLE_LEARWARE`` denotes the leanrware pass the check and can make prediction. Currently, we have three ``checker``\ s, which are described below. | |||
| ``CondaChecker`` | |||
| @@ -52,15 +53,115 @@ The ``learnware`` package provide two different implementation of ``market``, i. | |||
| Easy Market | |||
| ------------- | |||
| Easy market is a basic realization of the learnware market. It consists of ``EasyOrganizer``, ``EasySearcher``, and the checker list ``[EasySemanticChecker, EasyStatChecker]``. | |||
| ``Easy Organizer`` | |||
| ++++++++++++++++++++ | |||
| ``EasyOrganizer`` mainly has the following methods to store learnwares, which is an easy way to organize learnwares. | |||
| - **reload_market**: Reload the learnware market when server restarted, and return a flag indicating whether the market is reloaded successfully. | |||
| - **add_learnware**: Add a learnware with ``learnware_id``, ``semantic_spec`` and model files in ``zip_path`` into the market. Return the ``learnware_id`` and ``learnwere_status``. The ``learnwere_status`` is set ``check_status`` if it is provided, else ``checker`` will be called to generate the ``learnwere_status``. | |||
| - **delete_learnware**: Delete the learnware with ``id`` from the market, return a flag of whether the deletion is successfully. | |||
| - **update_learnware**: Update the learnware's ``zip_path``, ``semantic_spec``, ``check_status``. If None, the corresponding item is not updated. Return a flag indicating whether it passed the ``checker``. | |||
| - **get_learnwares**: Similar to **get_learnware_ids**, but return list of learnwares instead of ids. | |||
| - **reload_learnware**: Reload all the attributes of the learnware with ``learnware_id``. | |||
| ``Easy Searcher`` | |||
| ++++++++++++++++++++ | |||
| ``EasySearcher`` consists of ``EasyFuzzsemanticSearcher`` and ``EasyStatSearcher``. ``EasyFuzzsemanticSearcher`` is a kind of ``Semantic Specification Searcher``, while ``EasyStatSearcher`` is a kind of ``Statistical Specification Searcher``. All these searchers return helpful learnwares based on ``BaseUserInfo`` provided by users. | |||
| ``BaseUserInfo`` is a ``Python API`` for users to provide enough information to identify helpful learnwares. | |||
| When initializing ``BaseUserInfo``, three optional information can be provided: ``id``, ``semantic_spec`` and ``stat_info``. The introductions of these specifications is shown in `COMPONENTS: Specification <./spec.html>`_. | |||
| The semantic specification search and statistical specification search have been integrated into the same interface ``EasySearcher``. | |||
| - **EasySearcher.__call__(self, user_info: BaseUserInfo, check_status: int = None, max_search_num: int = 5, search_method: str = "greedy",) -> SearchResults** | |||
| - It conducts the semantic searcher ``EasyFuzzsematicSearcher`` on all the learnwares from the ``organizer`` with the same ``check_status`` (All learnwares if ``check_status`` is None). If the result is not empty and the ``stat_info`` is provided in ``user_info``, then it conducts ``EasyStatSearcher``, and return the ``SearchResults``. | |||
| ``Semantic Specification Searcher`` | |||
| '''''''''''''''''''''''''''''''''''' | |||
| ``Semantic Specification Searcher`` is the first-stage search based on ``user_semantic``, identifying potentially helpful learnwares whose models solve tasks similar to your requirements. There are two types of Semantic Specification Search: ``EasyExactSemanticSearcher`` and ``EasyFuzzSemanticSearcher``. | |||
| In these two searchers, each learnware in the ``learnware_list`` is compared with ``user_info`` according to their ``semantic_spec``, and added to the search result if mathched. Two semantic_spec are matched when all the key words are matched or empty in ``user_info``. Different keys have different matching rules. Their ``__call__`` functions are the same: | |||
| - **EasyExactSemanticSearcher/EasyFuzzSemanticSearcher.__call__(self, learnware_list: List[Learnware], user_info: BaseUserInfo)-> SearchResults** | |||
| - For keys ``Data``, ``Task``, ``Library`` and ``license``, two``semantic_spec`` keys are matched only if these values(only one value foreach key) of learnware ``semantic_spec`` exists in values(may be muliplevalues for one key) of user ``semantic_spec``. | |||
| - For the key ``Scenario``, two ``semantic_spec`` keys are matched iftheir values have nonempty intersections. | |||
| - For keys ``Name`` and ``Description``, the values are strings and caseis ignored. In ``EasyExactSemanticSearcher``, two ``semantic_spec`` keysare matched if these values of learnware ``semantic_spec`` is a substringof user ``semantic_spec``; In ``EasyFuzzSemanticSearcher``, first theexact semantic searcher is conducted like ``EasyExactSemanticSearcher``.If the result is empty, the fuzz semantic searcher is activated: the``learnware_list`` is sorted according to the fuzz score function ``fuzzpartial_ratio`` in ``rapidfuzz``. | |||
| The results are returned storing in ``single_results`` of ``SearchResults``. | |||
| ``Statistical Specification Searcher`` | |||
| '''''''''''''''''''''''''''''''''''''''''' | |||
| If user's statistical specification ``stat_info`` is provided, the learnware market can perform a more accurate leanware selection using ``EasyStatSearcher``. | |||
| - **EasyStatSearcher.__call__(self, learnware_list: List[Learnware], user_info: BaseUserInfo, max_search_num: int = 5, search_method: str = "greedy",) -> SearchResults** | |||
| - It searches for helpful learnwares from ``learnware_list`` based on the ``stat_info`` in ``user_info``. | |||
| - The result ``SingleSearchItem`` and ``MultipleSearchItem`` are both stored in ``SearchResults``. In ``SingleSearchItem``, it searches for single learnwares that could solve the user task; scores are also provided to represent the fitness of each single learnware and user task. In ``MultipleSearchItem``, it searches for a mixture of learnwares that could solve the user task better; the mixture learnware list and a score for the mixture is returned. | |||
| - The parameter ``search_method`` provides two choice of search strategies for mixture learnwares: ``greedy`` and ``auto``. For the search method ``greedy``, each time it chooses a learnware to make their mixture closer to the user's ``stat_info``; for the search method ``auto``, it directly calculates a best mixture weight for the ``learnware_list``. | |||
| - For single learnware search, we only return the learnwares with score larger than 0.6; For multiple learnware search, the parameter ``max_search_num`` specifies the maximum length of the returned mixture learnware list. | |||
| ``Easy Checker`` | |||
| ++++++++++++++++++++ | |||
| ``EasySemanticChecker`` and ``EasyStatChecker`` are used to check the validity of the learnwares. They are used as: | |||
| - ``EasySemanticChecker`` mainly check the integrity and legitimacy of the ``semantic_spec`` in the learnware. A legal ``semantic_spec`` should includes all the keys, and the type of each key should meet our requirements. For keys with type ``Class``, the values should be unique and in our ``valid_list``; for keys with type ``Tag``, the values should not be empty; for keys with type ``String``, a non-empty string is expected as the value; for a table learnware, the dimensions and description of inputs is needed; for ``classification`` or ``regression`` learnwares, the dimensions and description of outputs is indispensable. The learnwares that pass the ``EasySemanticChecker`` is marked as ``NONUSABLE_LEARNWARE``; otherwise, it is ``INVALID_LEARNWARE`` and error information will be returned. | |||
| - ``EasyStatChecker`` mainly check the ``model`` and ``stat_spec`` of the learnwares. It includes the following steps: | |||
| - **Check model instantiation**: ``learnware.instantiate_model`` to instantiate the model and transform it to a ``BaseModel``. | |||
| - **Check input shape**: Check whether the shape of ``semantic_spec`` input(if exists), ``learnware.input_shape`` and shape of ``stat_spec`` are consistent, and then generate an example input with that shape. | |||
| - **Check model prediction**: Use the model to predict the label of the example input, and record the output shape. | |||
| - **Check output shape**: For ``Classification``, ``Regression`` and ``Feature Extraction`` tasks, the output shape should be consistent with that in ``semantic_spec`` and ``learnware.output_shape``. Besides, for ``Regression`` tasks, the output should be a legal class in ``semantic_spec``. | |||
| If any step above fails or meets a error, the learnware will be marked as ``INVALID_LEARNWARE``. The learnwares that pass the ``EasyStatChecker`` is marked as ``USABLE_LEARNWARE``. | |||
| Hetero Market | |||
| -------------- | |||
| ------------- | |||
| Hetero Market consists of ``HeteroMapTableOrganizer``, ``HeteroSearcher``, and the checker list ``[EasySemanticChecker, EasyStatChecker]``. | |||
| It is an extended version of the Easy Market which accommodates table learnwares from different feature spaces(heterogeneous table learnwares), expanding the applicable scope of learnware paradigm. | |||
| This market trains a heterogeneous engine based on existing learnware specifications in the market to merge different specification islands and assign new specifications(``HeteroMapTableSpecification``) to learnwares. | |||
| With more learnwares submitted, the heterogeneous engine will continuously update and is expected to build a more precise specification world. | |||
| ``HeteroMapTableOrganizer`` | |||
| +++++++++++++++++++++++++++ | |||
| ``HeteroMapTableOrganizer`` overrides methods from ``EasyOrganizer`` and implements new methods to support organization of heterogeneous table learnwares. Key features include: | |||
| - **reload_market**: Reloads the heterogeneous engine if there is one, otherwise initializes an engine with default configurations. Returns a flag indicating whether the market is reloaded successfully. | |||
| - **reset**: Resets the heterogeneous market with specific settings regarding the heterogeneous engine such as ``auto_update``, ``auto_update_limit`` and ``training_args`` configurations. | |||
| - **add_learnware**: Add a learnware into the market, meanwhile assigning ``HeteroMapTableSpecification`` to the learnware using the heterogeneous engine. The engine's update process will be triggered if ``auto_update`` is set to True and the number of learnwares in the market with ``USABLE_LEARNWARE`` status exceeds ``auto_update_limit``. Return the ``learnware_id`` and ``learnwere_status``. | |||
| - **delete_learnware**: Removes the learnware with ``id`` from the market, also remove its new specification if there is one. Return a flag of whether the deletion is successful. | |||
| - **update_learnware**: Update the learnware's ``zip_path``, ``semantic_spec``, ``check_status`` and its new specification if there is one. Return a flag indicating whether it passed the ``checker``. | |||
| - **generate_hetero_map_spec**: Generate ``HeteroMapTableSpecification`` for users based on the information provided in ``user_info``. | |||
| - **train**: Build the heterogeneous engine using learnwares from the market that supports heterogeneous market training. | |||
| The learnware market naturally consists of models with different feature spaces, different label spaces, or different objectives. It is beneficial for the market to accommodate these heterogeneous learnwares and provide corresponding learnware recommendation and reuse services to the user so as to expand the applicable scope of learnware paradigm. | |||
| ``HeteroSearcher`` | |||
| ++++++++++++++++++ | |||
| Models are submitted to the market with their original specifications. However, these specifications are hard to be used for responding to user requirements due to heterogeneity. Specifications of heterogeneous models reside in different specification spaces. The market needs to merge these specification spaces into a unified one. To achieve this adjustment, you need to implement the class ``EvolvedMarket``, especially the function ``EvolvedMarket.generate_new_stat_specification``, which generates new statistical specifcation in an identical space for each submitted model. | |||
| ``HeteroSearcher`` builds upon ``EasySearcher`` with additional support for searching among heterogeneous table learnwares, returning helpful learnwares with feature space and label space different from the user's task requirements. | |||
| The semantic specification search and statistical specification search have been integrated into the same interface ``HeteroSearcher``. | |||
| One important case is that models have different feature spaces. In order to enable the learnware market to handle heterogeneous feature spaces, you need to implement the class ``HeterogeneousFeatureMarket`` in the following way: | |||
| - **HeteroSearcher.__call__(self, user_info: BaseUserInfo, check_status: int = None, max_search_num: int = 5, search_method: str = "greedy",) -> SearchResults** | |||
| - First, design a method for the market to connect different feature spaces to a common subspace and implement the function ``HeterogeneousFeatureMarket.learn_mapping_functions``. This function uses specifications of all submitted models to learn mapping functions that can map the data in the original feature space to the common subspace and vice verse. | |||
| - Second, use learned mapping functions to implement the functions ``HeterogeneousFeatureMarket.transform_original_to_subspace`` and ``HeterogeneousFeatureMarket.transform_subspace_to_original``. | |||
| - Third, use the functions ``HeterogeneousFeatureMarket.transform_original_to_subspace`` and ``HeterogeneousFeatureMarket.transform_subspace_to_original`` to overwrite the mehtod ``EvolvedMarket.generate_new_stat_specification`` and ``EvolvedMarket.EvolvedMarket.evolve_learnware_list`` of the base class ``EvolvedMarket``. | |||
| - It conducts the semantic searcher ``EasyFuzzsematicSearcher`` on all the learnwares from the ``HeteroOrganizer`` with the same ``check_status`` (All learnwares if ``check_status`` is None). | |||
| - If the ``stat_info`` is provided in ``user_info``, it conducts one of the following two types of statistical specification search using ``EasySearcher``, depending on whether heterogeneous learnware search is enabled. If enabled, ``stat_info`` will be updated with a ``HeteroMapTableSpecification`` generated for the user, and the Hetero Market performs heterogeneous learnware selection based on the updated ``stat_info``. If not enabled, the Hetero Market performs homogeneous learnware selection based on the original ``stat_info``. | |||
| .. note:: | |||
| The heterogeneous learnware search is enabled when ``user_info`` contains valid heterogeneous search information. Please refer to `WORKFLOWS: Hetero Search <../workflows/search.html#hetero-search>`_ for details. | |||
| @@ -3,80 +3,67 @@ | |||
| Specification | |||
| ================================ | |||
| Concepts & Types | |||
| ====================================== | |||
| The search of helpful learnwares can be divided into two stages: statistical specification and semantic specification. | |||
| Statistical Specification | |||
| --------------------------- | |||
| The learnware specification should ideally provide essential information about every model in the learnware market, enabling efficient and accurate identification for future users. Our current specification design has two components. The first part consists of a string of descriptions or tags assigned by the learnware market based on developer-submitted information. These descriptions or tags help identify the model's specification island. Different learnware market enterprises may use different descriptions or tags. | |||
| The second part of the specification is crucial for determining the model's position in the functional space :math:`F: \mathcal{X} \mapsto \mathcal{Y}` with respect to obj. A recent development in this area is the RKME (Reduced Kernel Mean Embedding) specification, which builds on the reduced set of KME (Kernel Mean Embedding) techniques. KME is a powerful method for mapping a probability distribution to a point in RKHS (Reproducing Kernel Hilbert Space), while the reduced set retains this ability with a concise representation that doesn't reveal the original data. | |||
| The RKME specification assumes that each learnware is a well-performed model on its training data. The RKME specification is based on RKME :math:`\widetilde{\Phi}`, which aims to provide a good representation by constructing a reduced set to approximate the empirical KME :math:`\Phi=\int_{\mathcal{X}} k(\boldsymbol{x}, \cdot) \mathrm{d} P(\boldsymbol{x})` of the underlying distribution. Theoretically, when the kernel function satisfies :math:`k(\boldsymbol{x}, \boldsymbol{x}) \leq 1` for all :math:`x \in \mathcal{X}`, we have the guarantee that | |||
| .. math:: | |||
| \|\widetilde{\Phi}-\Phi\|_{\mathcal{H}} \leq 2 \sqrt{\frac{2}{n}}+\sqrt{\frac{1}{m}}+\sqrt{\frac{2 \log (1 / \delta)}{m}}, | |||
| with a probability of at least :math:`1-\delta`, where :math:`n, m` are the sizes of the RKME reduced set and the original data, respectively. It is known that when using characteristic kernels such as the Gaussian kernel, KME can capture all information about the distribution. Additionally, when the RKHS of the kernel function is finite-dimensional, RKME has a linear convergence rate :math:`O\left(e^{-n}\right)` to empirical KME; for infinite-dimensional RKHS, it has been proved constructively that RKME can enjoy :math:`O(\sqrt{d} / n)` convergence rate under :math:`L_{\infty}` measure, where :math:`d` is the dimension of the original data. Therefore, RKME is guaranteed to be a good estimation of KME and a valid representation for data distribution that encodes the ability of a trained model. | |||
| Under certain assumptions, the risk for the user task can be bounded, such as assuming that the distribution corresponding to the user's task matches that of a learnware, or that it can be approximated by a mixture of distributions corresponding to a set of learnwares' tasks, i.e., | |||
| .. math:: | |||
| \mathcal{D}_u=\sum_{i=1}^N w_i \mathcal{D}_i | |||
| where :math:`\mathcal{D}_u` is the distribution corresponding to the user's task, :math:`N` is the number of learnwares, and :math:`\mathcal{D}_i` are their corresponding distributions. We have :math:`\sum_{i=1}^N w_i=1` and :math:`w_i \geq 0`. These two assumptions are known as task-recurrent and instance-recurrent assumptions. Additionally, assume that all learnwares are well-performed ones: | |||
| .. math:: | |||
| \mathbb{E}_{\mathcal{D}_i}\left[\ell\left(\widehat{f}_i(\boldsymbol{x}), \boldsymbol{y}\right)\right] \leq \epsilon, \forall i \in[N], | |||
| Learnware specification is the core component of the learnware paradigm, linking all processes about learnwares, including uploading, organizing, searching, deploying and reusing. | |||
| where :math:`\widehat{f}_i` is the function corresponding to the :math:`i`-th learnware, :math:`\ell` is the loss function, and :math:`\boldsymbol{y}` is assumed to be determined by a ground-truth global function :math:`h`. Under these assumptions, recent studies have attempted to bound the risk on the user's task. With the task-recurrent assumption and selecting the learnware :math:`\left(\widehat{f}_i, \tilde{\Phi}_i\right)` with the smallest RKHS distance :math:`\eta` according to RKME, given the loss function | |||
| In this section, we will introduce the concept and design of learnware specification in the ``learnware`` package. | |||
| We will then explore ``regular specification``\ s tailored for different data types such as tables, images and texts. | |||
| Lastly, we cover a ``system specification`` specifically assigned to table learnwares by the learnware market, aimed at accommodating all available table learnwares into a unified "specification world" despite their heterogeneity. | |||
| .. math:: | |||
| \left|\ell\left(\widehat{f}_i(\boldsymbol{x}), h(\boldsymbol{x})\right)\right| \leq U, \forall \boldsymbol{x} \in \mathcal{X}, \forall i \in[N], | |||
| Concepts & Types | |||
| ================== | |||
| we have | |||
| The learnware specification describes the model's specialty and utility in a certain format, allowing the model to be identified and reused by future users who may have no prior knowledge of the learnware. | |||
| The ``learnware`` package employs a highly extensible specification design, which consists of two parts: | |||
| .. math:: | |||
| - **Semantic specification** describes the model's type and functionality through a set of descriptions and tags. Learnwares with similar semantic specifications reside in the same specification island | |||
| - **Statistical specification** characterizes the statistical information contained in the model using various machine learning techniques. It plays a crucial role in locating the appropriate place for the model within the specification island. | |||
| \mathbb{E}_{\mathcal{D}_u}\left[\ell\left(\widehat{f}_i(\boldsymbol{x}), \boldsymbol{y}\right)\right] \leq \epsilon+U \eta+O\left(\frac{1}{\sqrt{m}}+\frac{1}{\sqrt{n}}\right). | |||
| When searching in the learnware market, the system first locates specification islands based on the semantic specification of the user's task, | |||
| then pinpoints highly beneficial learnwares on theses islands based on the statistical specification of the user's task. | |||
| As for the instance-recurrent assumption and the 0/1-loss | |||
| Statistical Specification | |||
| --------------------------- | |||
| .. math:: | |||
| We employ the ``Reduced Kernel Mean Embedding (RKME) Specification`` as the foundation for implementing statistical specification for diverse data types, | |||
| with adjustments made according to the characteristics of each data type. | |||
| The RKME specification is a recent development in learnware specification design, which represents the distribution of a model's training data in a privacy-preserving manner. | |||
| \ell_{01}(f(\boldsymbol{x}), \boldsymbol{y})=\mathbb{I}(f(\boldsymbol{x}) \neq \boldsymbol{y}), | |||
| Within the ``learnware`` package, you'll find two types of statistical specifications: ``regular specification`` and ``system specification``. The former is generated locally | |||
| by users to express their model's statistical information, while the latter is assigned by the learnware market to accommodate and organize heterogeneous learnwares. | |||
| a more general result has been achieved: | |||
| Semantic Specification | |||
| ----------------------- | |||
| .. math:: | |||
| The semantic specification consists of a "dict" structure that includes keywords "Data", "Task", "Library", "Scenario", "License", "Description", and "Name". | |||
| In the case of table learnwares, users should additionally provide descriptions for each feature dimension and output dimension through the "Input" and "Output" keywords. | |||
| \mathbb{E}_{\mathcal{D}_u}\left[\ell_{01}(f(\boldsymbol{x}), \boldsymbol{y})\right] \leq \epsilon+R(g), | |||
| where :math:`R(g)=\sum_{i=1}^N w_i \mathbb{E}_{\mathcal{D}_1}\left[\ell_{01}(g(\boldsymbol{x}), i)\right]` represents the weighted risk of any learnware selector :math:`g(x)`, which takes unlabeled data as input and assigns it to the appropriate model, :math:`f(\boldsymbol{x})=\widehat{f}_{g(\boldsymbol{x})}(\boldsymbol{x})` is the final model for the user's task. | |||
| Regular Specification | |||
| ====================================== | |||
| Efforts have been made to enable the learnware market to handle unseen tasks, where the user's task involves some unseen aspects that have never been addressed by the current learnwares in the market. A more general theoretical analysis has been presented based on mixture proportion estimation. | |||
| The ``learnware`` package provides a unified interface, ``generate_stat_spec``, for generating ``regular specification``\ s across different data types. | |||
| Users can use the training data ``train_x`` (supported types include numpy.ndarray, pandas.DataFrame, and torch.Tensor) as input to generate the ``regular specification`` of the model, | |||
| as shown in the following code: | |||
| .. code:: python | |||
| Semantic Specification | |||
| --------------------------- | |||
| for learnware.specification import generate_stat_spec | |||
| The semantic specification describes the characteristics of user's task and the market will identify potentially helpful leaarnwares whose models solve tasks similar to your requirements. The detail semantic specification is in `Indentification Learnwares <../workflow/identify.html>`_. | |||
| data_type = "table" # supported data types: ["table", "image", "text"] | |||
| regular_spec = generate_stat_spec(type=data_type, x=train_x) | |||
| regular_spec.save("stat.json") | |||
| It's worth noting that the above code only runs on user's local computer and does not interact with any cloud servers or leak any local private data. | |||
| Regular Specification | |||
| ====================================== | |||
| .. note:: | |||
| In cases where the model's training data is too large, causing the above code to fail, you can consider sampling the training data to ensure it's of a suitable size before proceeding with reduction generation. | |||
| Table Specification | |||
| -------------------------- | |||
| The ``regular specification`` for tabular learnware is essentially the RKME specification of the model's training table data. No additional adjustment is needed. | |||
| Image Specification | |||
| -------------------------- | |||
| @@ -99,19 +86,17 @@ By randomly sampling a subset of the dataset, we can construct Image Specificati | |||
| .. code-block:: python | |||
| import torchvision | |||
| from torch.utils.data import DataLoader | |||
| from learnware.specification import generate_rkme_image_spec | |||
| import torchvision | |||
| from torch.utils.data import DataLoader | |||
| from learnware.specification import generate_rkme_image_spec | |||
| SAMPLED_SIZE = 5000 | |||
| full_set = torchvision.datasets.CIFAR10( | |||
| root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor()) | |||
| loader = DataLoader(full_set, batch_size=SAMPLED_SIZE, shuffle=True) | |||
| sampled_X, _ = next(iter(loader)) | |||
| cifar10 = torchvision.datasets.CIFAR10( | |||
| root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor()) | |||
| X, _ = next(iter(DataLoader(cifar10, batch_size=len(cifar10)))) | |||
| spec = generate_rkme_image_spec(sampled_X) | |||
| spec.save("cifar10.json") | |||
| spec = generate_rkme_image_spec(X, sample_size=5000) | |||
| spec.save("cifar10.json") | |||
| Privacy Protection | |||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
| @@ -130,4 +115,21 @@ Text Specification | |||
| System Specification | |||
| ====================================== | |||
| ====================================== | |||
| In contrast to ``regular specification``\ s which are generated solely by users, | |||
| ``system specification``\ s are higher-level statistical specifications assigned by learnware markets | |||
| to effectively accommodate and organize heterogeneous learnwares. | |||
| This implies that ``regular specification``\ s are usually applicable across different markets, while ``system specification``\ s are generally closely associated | |||
| with particular learnware market implementations. | |||
| ``system specification`` play a critical role in heterogeneous markets such as the ``Hetero Market``: | |||
| - Learnware organizers use these specifications to connect isolated specification islands into unified "specification world"s. | |||
| - Learnware searchers perform helpful learnware recommendations among all table learnwares in the market, leveraging the ``system specification``\ s generated for users. | |||
| ``learnware`` package now includes a type of ``system specification``, named ``HeteroMapTableSpecification``, made especially for the ``Hetero Market`` implementation. | |||
| This specification is automatically given to all table learnwares when they are added to the ``Hetero Market``. | |||
| It is also set up to be updated periodically, ensuring it remains accurate as the learnware market evolves and builds more precise specification worlds. | |||
| Please refer to `COMPONENTS: Hetero Market <../components/market.html#hetero-market>`_ for implementation details. | |||
| @@ -92,8 +92,8 @@ html_theme = "sphinx_book_theme" | |||
| html_theme_path = [sphinx_book_theme.get_html_theme_path()] | |||
| html_theme_options = { | |||
| "logo_only": True, | |||
| # "collapse_navigation": False, | |||
| # "display_version": False, | |||
| "collapse_navigation": False, | |||
| # "display_version": False, | |||
| "navigation_depth": 4, | |||
| } | |||
| html_logo = "_static/img/logo/logo1.png" | |||
| @@ -21,50 +21,195 @@ Table: homo+hetero | |||
| Datasets | |||
| ------------------ | |||
| We designed experiments on three publicly available datasets, namely `Prediction Future Sales (PFS) <https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data>`_, | |||
| `M5 Forecasting (M5) <https://www.kaggle.com/competitions/m5-forecasting-accuracy/data>`_ and `CIFAR 10 <https://www.cs.toronto.edu/~kriz/cifar.html>`_. | |||
| For the two sales forecasting data sets of PFS and M5, we divide the user data according to different stores, and train the Ridge model and LightGBM model on the corresponding data respectively. | |||
| For the CIFAR10 image classification task, we first randomly pick 6 to 10 categories, and randomly select 800 to 2000 samples from each category from the categories corresponding to the training set, constituting a total of 50 different uploaders. | |||
| For test users, we first randomly pick 3 to 6 categories, and randomly select 150 to 350 samples from each category from the corresponding categories from the test set, constituting a total of 20 different users. | |||
| Our study involved three public datasets in the sales forecasting field: `Predict Future Sales (PFS) <https://www.kaggle.com/c/competitive-data-science-predict-future-sales/data>`_, | |||
| `M5 Forecasting (M5) <https://www.kaggle.com/competitions/m5-forecasting-accuracy/data>`_ and `Corporacion <https://www.kaggle.com/competitions/favorita-grocery-sales-forecasting/data>`_. | |||
| We applied various pre-processing methods to these datasets to enhance the richness of the data. | |||
| After pre-processing, we first divided each dataset by store and then split the data for each store into training and test sets. Specifically: | |||
| - For PFS, the test set consisted of the last month's data from each store. | |||
| - For M5, we designated the final 28 days' data from each store as the test set. | |||
| - For Corporacion, the test set was composed of the last 16 days of data from each store. | |||
| In the submitting stage, the Corporacion dataset's 55 stores are regarded as 165 uploaders, each employing one of three different feature engineering methods. | |||
| For the PFS dataset, 100 uploaders are established, each using one of two feature engineering approaches. | |||
| These uploaders then utilize their respective stores' training data to develop LightGBM models. | |||
| As a result, the learnware market comprises 265 learnwares, derived from five types of feature spaces and two types of label spaces | |||
| Based on the specific design of user tasks, our experiments were primarily categorized into two types: | |||
| - ``homogeneous experiments`` are designed to evaluate performance when users can reuse learnwares in the learnware market that have the same feature space as their tasks(homogeneous learnwares). | |||
| This contributes to showing the effectiveness of using learnwares that align closely with the user's specific requirements. | |||
| - ``heterogeneous experiments`` aim to evaluate the performance of identifying and reusing helpful heterogeneous learnwares in situations where | |||
| no available learnwares match the feature space of the user's task. This helps to highlight the potential of learnwares for applications beyond their original purpose. | |||
| Homo Experiments | |||
| ----------------------- | |||
| For homogeneous experiments, the 55 stores in the Corporacion dataset act as 55 users, each applying one feature engineering method, | |||
| and using the test data from their respective store as user data. These users can then search for homogeneous learnwares in the market with the same feature spaces as their tasks. | |||
| The Mean Squared Error (MSE) of search and reuse is presented in the table below: | |||
| +-----------------------------------+---------------------+ | |||
| | Mean in Market (Single) | 0.331 ± 0.040 | | |||
| +-----------------------------------+---------------------+ | |||
| | Best in Market (Single) | 0.151 ± 0.046 | | |||
| +-----------------------------------+---------------------+ | |||
| | Top-1 Reuse (Single) | 0.280 ± 0.090 | | |||
| +-----------------------------------+---------------------+ | |||
| | Job Selector Reuse (Multiple) | 0.274 ± 0.064 | | |||
| +-----------------------------------+---------------------+ | |||
| | Average Ensemble Reuse (Multiple) | 0.267 ± 0.051 | | |||
| +-----------------------------------+---------------------+ | |||
| When users have both test data and limited training data derived from their original data, reusing single or multiple searched learnwares from the market can often yield | |||
| better results than training models from scratch on limited training data. We present the change curves in MSE for the user's self-trained model, as well as for the Feature Augmentation single learnware reuse method and the Ensemble Pruning multiple learnware reuse method. | |||
| These curves display their performance on the user's test data as the amount of labeled training data increases. | |||
| The average results across 55 users are depicted in the figure below: | |||
| .. image:: ../_static/img/table_homo_labeled.png | |||
| :align: center | |||
| :alt: Table Homo Limited Labeled Data | |||
| From the figure, it's evident that when users have limited training data, the performance of reusing single/multiple table learnwares is superior to that of the user's own model. | |||
| This emphasizes the benefit of learnware reuse in significantly reducing the need for extensive training data and achieving enhanced results when available user training data is limited. | |||
| Hetero Experiments | |||
| ------------------------- | |||
| In heterogeneous experiments, the learnware market would recommend helpful heterogeneous learnwares with different feature spaces with | |||
| the user tasks. Based on whether there are learnwares in the market that handle tasks similar to the user's task, the experiments can be further subdivided into the following two types: | |||
| Cross Feature Space Experiments | |||
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |||
| We designate the 41 stores in the PFS dataset as users, creating their user data with an alternative feature engineering approach that varies from the methods employed by learnwares in the market. | |||
| Consequently, while the market's learnwares from the PFS dataset undertake tasks very similar to our users, the feature spaces do not match exactly. In this experimental configuration, | |||
| we tested various heterogeneous learnware reuse methods (without using user's labeled data) and compared them to the user's self-trained model based on a small amount of training data. | |||
| The average MSE performance across 41 users are as follows: | |||
| +-----------------------------------+---------------------+ | |||
| | Mean in Market (Single) | 1.459 ± 1.066 | | |||
| +-----------------------------------+---------------------+ | |||
| | Best in Market (Single) | 1.226 ± 1.032 | | |||
| +-----------------------------------+---------------------+ | |||
| | Top-1 Reuse (Single) | 1.407 ± 1.061 | | |||
| +-----------------------------------+---------------------+ | |||
| | Average Ensemble Reuse (Multiple) | 1.312 ± 1.099 | | |||
| +-----------------------------------+---------------------+ | |||
| | User model with 50 labeled data | 1.267 ± 1.055 | | |||
| +-----------------------------------+---------------------+ | |||
| From the results, it is noticeable that the learnware market still perform quite well even when users lack labeled data, | |||
| provided it includes learnwares addressing tasks that are similar but not identical to the user's. | |||
| In these instances, the market's effectiveness can match or even rival scenarios where users have access to a limited quantity of labeled data. | |||
| Cross Task experiments | |||
| ^^^^^^^^^^^^^^^^^^^^^^^ | |||
| Here we have chosen the 10 stores from the M5 dataset to act as users. Although the broad task of sales forecasting is similar to the tasks addressed by the learnwares in the market, | |||
| there are no learnwares available that directly cater to the M5 sales forecasting requirements. All learnwares show variations in both feature and label spaces compared to the tasks of M5 users. | |||
| We present the change curves in RMSE for the user's self-trained model and several learnware reuse methods. | |||
| These curves display their performance on the user's test data as the amount of labeled training data increases. | |||
| The average results across 10 users are depicted in the figure below: | |||
| .. image:: ../_static/img/table_hetero_labeled.png | |||
| :align: center | |||
| :alt: Table Hetero Limited Labeled Data | |||
| We can observe that heterogeneous learnwares are beneficial when there's a limited amount of the user's labeled training data available, | |||
| aiding in better alignment with the user's specific task. This underscores the potential of learnwares to be applied to tasks beyond their original purpose. | |||
| We tested the efficiency of the specification generation and the accuracy of the search and reuse model respectively. | |||
| The evaluation index on PFS and M5 data is RMSE, and the evaluation index on CIFAR10 classification task is classification accuracy | |||
| Text Experiment | |||
| ==================== | |||
| Datasets | |||
| ------------------ | |||
| We conducted experiments on the widely used text benchmark dataset: `20-newsgroup <http://qwone.com/~jason/20Newsgroups/>`_. | |||
| 20-newsgroup is a renowned text classification benchmark with a hierarchical structure, featuring 5 superclasses {comp, rec, sci, talk, misc}. | |||
| In the submitting stage, we enumerated all combinations of three superclasses from the five available, randomly sampling 50% of each combination from the training set to create datasets for 50 uploaders. | |||
| In the deploying stage, we considered all combinations of two superclasses out of the five, selecting all data for each combination from the testing set as a test dataset for one user. This resulted in 10 users. | |||
| The user's own training data was generated using the same sampling procedure as the user test data, despite originating from the training dataset. | |||
| Model training comprised two parts: the first part involved training a tfidf feature extractor, and the second part used the extracted text feature vectors to train a naive Bayes classifier. | |||
| Our experiments comprises two components: | |||
| * ``unlabeled_text_example`` is designed to evaluate performance when users possess only testing data, searching and reusing learnware available in the market. | |||
| * ``labeled_text_example`` aims to assess performance when users have both testing and limited training data, searching and reusing learnware directly from the market instead of training a model from scratch. This helps determine the amount of training data saved for the user. | |||
| Results | |||
| ---------------- | |||
| The time-consuming specification generation is shown in the table below: | |||
| * ``unlabeled_text_example``: | |||
| ==================== ==================== ================================= | |||
| Dataset Data Dimensions Specification Generation Time (s) | |||
| ==================== ==================== ================================= | |||
| PFS 8714274*31 < 1.5 | |||
| M5 46027957*82 9~15 | |||
| CIFAR10 9000*3*32*32 7~10 | |||
| ==================== ==================== ================================= | |||
| The accuracy of search and reuse is presented in the table below: | |||
| The accuracy of search and reuse is shown in the table below: | |||
| +-----------------------------------+---------------------+ | |||
| | Mean in Market (Single) | 0.507 ± 0.030 | | |||
| +-----------------------------------+---------------------+ | |||
| | Best in Market (Single) | 0.859 ± 0.051 | | |||
| +-----------------------------------+---------------------+ | |||
| | Top-1 Reuse (Single) | 0.846 ± 0.054 | | |||
| +-----------------------------------+---------------------+ | |||
| | Job Selector Reuse (Multiple) | 0.845 ± 0.053 | | |||
| +-----------------------------------+---------------------+ | |||
| | Average Ensemble Reuse (Multiple) | 0.862 ± 0.051 | | |||
| +-----------------------------------+---------------------+ | |||
| ==================== ==================== ================================= ================================= | |||
| Dataset Top-1 Performance Job Selector Reuse Average Ensemble Reuse | |||
| ==================== ==================== ================================= ================================= | |||
| PFS 1.955 +/- 2.866 2.175 +/- 2.847 1.950 +/- 2.888 | |||
| M5 2.066 +/- 0.424 2.116 +/- 0.472 2.512 +/- 0.573 | |||
| CIFAR10 0.619 +/- 0.138 0.585 +/- 0.056 0.715 +/- 0.075 | |||
| ==================== ==================== ================================= ================================= | |||
| * ``labeled_text_example``: | |||
| We present the change curves in classification error rates for both the user's self-trained model and the multiple learnware reuse(EnsemblePrune), showcasing their performance on the user's test data as the user's training data increases. The average results across 10 users are depicted below: | |||
| .. image:: ../_static/img/text_labeled_curves.png | |||
| :align: center | |||
| :alt: Text Limited Labeled Data | |||
| Text Experiment | |||
| ==================== | |||
| From the figure above, it is evident that when the user's own training data is limited, the performance of multiple learnware reuse surpasses that of the user's own model. As the user's training data grows, it is expected that the user's model will eventually outperform the learnware reuse. This underscores the value of reusing learnware to significantly conserve training data and achieve superior performance when user training data is limited. | |||
| Image Experiment | |||
| ==================== | |||
| For the CIFAR-10 dataset, we sampled the training set unevenly by category and constructed unbalanced training datasets for the 50 learnwares that contained only some of the categories. This makes it unlikely that there exists any learnware in the learnware market that can accurately handle all categories of data; only the learnware whose training data is closest to the data distribution of the target task is likely to perform well on the target task. Specifically, the probability of each category being sampled obeys a random multinomial distribution, with a non-zero probability of sampling on only 4 categories, and the sampling ratio is 0.4: 0.4: 0.1: 0.1. Ultimately, the training set for each learnware contains 12,000 samples covering the data of 4 categories in CIFAR-10. | |||
| We constructed 50 target tasks using data from the test set of CIFAR-10. Similar to constructing the training set for the learnwares, in order to allow for some variation between tasks, we sampled the test set unevenly. Specifically, the probability of each category being sampled obeys a random multinomial distribution, with non-zero sampling probability on 6 categories, and the sampling ratio is 0.3: 0.3: 0.1: 0.1: 0.1: 0.1. Ultimately, each target task contains 3000 samples covering the data of 6 categories in CIFAR-10. | |||
| With this experimental setup, we evaluated the performance of RKME Image using 1 - Accuracy as the loss. | |||
| +-----------------------------------+---------------------+ | |||
| | Mean in Market (Single) | 0.655 ± 0.021 | | |||
| +-----------------------------------+---------------------+ | |||
| | Best in Market (Single) | 0.304 ± 0.046 | | |||
| +-----------------------------------+---------------------+ | |||
| | Top-1 Reuse (Single) | 0.406 ± 0.128 | | |||
| +-----------------------------------+---------------------+ | |||
| | Job Selector Reuse (Multiple) | 0.406 ± 0.128 | | |||
| +-----------------------------------+---------------------+ | |||
| | Average Ensemble Reuse (Multiple) | 0.310 ± 0.112 | | |||
| +-----------------------------------+---------------------+ | |||
| In some specific settings, the user will have a small number of labelled samples. In such settings, learning the weight of selected learnwares on a limited number of labelled samples can result in a better performance than training directly on a limited number of labelled samples. | |||
| .. image:: ../_static/img/image_labeled.png | |||
| :align: center | |||
| Get Start Examples | |||
| ========================= | |||
| Examples for `PFS, M5` and `CIFAR10` are available at [xxx]. You can run { main.py } directly to reproduce related experiments. | |||
| The test code is mainly composed of three parts, namely data preparation (optional), specification generation and market construction, and search test. | |||
| You can load data prepared by as and skip the data preparation step. | |||
| You can load data prepared by as and skip the data preparation step. | |||
| Examples for the `20-newsgroup` dataset are available at [examples/dataset_text_workflow]. | |||
| We utilize the `fire` module to construct our experiments. You can execute the experiment with the following commands: | |||
| * `python main.py prepare_market`: Prepares the market. | |||
| * `python main.py unlabeled_text_example`: Executes the unlabeled_text_example experiment; the results will be printed in the terminal. | |||
| * `python main.py labeled_text_example`: Executes the labeled_text_example experiment; result curves will be automatically saved in the `figs` directory. | |||
| * Additionally, you can use `python main.py unlabeled_text_example True` to combine steps 1 and 2. The same approach applies to running labeled_text_example directly. | |||
| @@ -2,36 +2,135 @@ | |||
| Learnwares Reuse | |||
| ========================================== | |||
| This part introduces two baseline methods for reusing a given list of learnwares, namely ``JobSelectorReuser`` and ``AveragingReuser``. | |||
| Instead of training a model from scratch, the user can easily reuse a list of learnwares (``List[Learnware]``) to predict the labels of their own data (``numpy.ndarray`` or ``torch.Tensor``). | |||
| ``Learnware Reuser`` is a ``Python API`` that offers a variety of convenient tools for learnware reuse. Users can reuse a single learnware, combination of multiple learnwares, | |||
| and heterogeneous learnwares using these tools efficiently, thereby saving the laborious time and effort of building models from scratch. There are mainly two types of | |||
| reuse tools, based on whether user has gathered a small amount of labeled data beforehand: (1) direct reuse and (2) customized reuse based on labeled data. | |||
| To illustrate, we provide a code demonstration that obtains the user dataset using ``sklearn.datasets.load_digits``, where ``test_data`` represents the data that requires prediction. | |||
| Assuming that ``learnware_list`` is the list of learnwares searched by the learnware market based on user specifications, the user can reuse each learnware in the ``learnware_list`` through ``JobSelectorReuser`` or ``AveragingReuser`` to predict the label of ``test_data``, thereby avoiding training a model from scratch. | |||
| .. note:: | |||
| For detailed explanations of the learnware reusers mentioned below, please refer to `COMPONENTS: All Reuse Methods <../components/learnware.html#all-reuse-methods>`_ . | |||
| Homo Reuse | |||
| ==================== | |||
| .. code-block:: python | |||
| This part introduces baseline methods for reusing homogeneous learnwares to make predictions on unlabeled data. | |||
| Direct reuse of Learnware | |||
| -------------------------- | |||
| from sklearn.datasets import load_digits | |||
| from learnware.learnware import JobSelectorReuser, AveragingReuser | |||
| - ``JobSelector`` selects different learnwares for different data by training a ``job selector`` classifier. The following code shows how to use it: | |||
| # Load user data | |||
| X, y = load_digits(return_X_y=True) | |||
| test_data = X | |||
| .. code:: python | |||
| # Based on user information, the learnware market returns a list of learnwares (learnware_list) | |||
| # Use jobselector reuser to reuse the searched learnwares to make prediction | |||
| from learnware.reuse import JobSelectorReuser | |||
| # learnware_list is the list of searched learnware | |||
| reuse_job_selector = JobSelectorReuser(learnware_list=learnware_list) | |||
| job_selector_predict_y = reuse_job_selector.predict(user_data=test_data) | |||
| # Use averaging ensemble reuser to reuse the searched learnwares to make prediction | |||
| reuse_ensemble = AveragingReuser(learnware_list=learnware_list) | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=test_data) | |||
| # test_x is the user's data for prediction | |||
| # predict_y is the prediction result of the reused learnwares | |||
| predict_y = reuse_job_selector.predict(user_data=test_x) | |||
| - ``AveragingReuser`` uses an ensemble method to make predictions. The ``mode`` parameter specifies the specific ensemble method: | |||
| .. code:: python | |||
| from learnware.reuse import AveragingReuser | |||
| # Regression tasks: | |||
| # - mode="mean": average the learnware outputs. | |||
| # Classification tasks: | |||
| # - mode="vote_by_label": majority vote for learnware output labels. | |||
| # - mode="vote_by_prob": majority vote for learnware output label probabilities. | |||
| reuse_ensemble = AveragingReuser( | |||
| learnware_list=learnware_list, mode="vote_by_label" | |||
| ) | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=test_x) | |||
| Reusing Learnware with Labeled Data | |||
| ------------------------------------ | |||
| When users have a small amount of labeled data, they can also adapt/polish the received learnware(s) by reusing them with the labeled data, gaining even better performance. | |||
| - ``EnsemblePruningReuser`` selectively ensembles a subset of learnwares to choose the ones that are most suitable for the user's task: | |||
| .. code:: python | |||
| from learnware.reuse import EnsemblePruningReuser | |||
| # mode="regression": Suitable for regression tasks | |||
| # mode="classification": Suitable for classification tasks | |||
| reuse_ensemble_pruning = EnsemblePruningReuser( | |||
| learnware_list=learnware_list, mode="regression" | |||
| ) | |||
| # (val_X, val_y) is the small amount of labeled data | |||
| reuse_ensemble_pruning.fit(val_X, val_y) | |||
| predict_y = reuse_job_selector.predict(user_data=test_x) | |||
| - ``FeatureAugmentReuser`` helps users reuse learnwares by augmenting features. This reuser regards each received learnware as a feature augmentor, taking its output as a new feature and then build a simple model on the augmented feature set(``logistic regression`` for classification tasks and ``ridge regression`` for regression tasks): | |||
| .. code:: python | |||
| from learnware.reuse import FeatureAugmentReuser | |||
| # mode="regression": Suitable for regression tasks | |||
| # mode="classification": Suitable for classification tasks | |||
| reuse_feature_augment = FeatureAugmentReuser( | |||
| learnware_list=learnware_list, mode="regression" | |||
| ) | |||
| # (val_X, val_y) is the small amount of labeled data | |||
| reuse_feature_augment.fit(val_X, val_y) | |||
| predict_y = reuse_feature_augment.predict(user_data=test_x) | |||
| Hetero Reuse | |||
| ==================== | |||
| When heterogeneous learnware search is activated(see `WORKFLOWS: Hetero Search <../workflows/search.html#hetero-search>`_), users would receive heterogeneous learnwares which are identified from the whole "specification world". | |||
| Though these recommended learnwares are trained from tasks with different feature/label spaces from the user's task, they can still be helpful and perform well beyond their original purpose. | |||
| Normally these learnwares are hard to be used, leave alone polished by users, due to the feature/label space heterogeneity. However with the help of ``HeteroMapAlignLearnware`` class which align heterogeneous learnware | |||
| with the user's task, users can easily reuse them with the same set of reuse methods mentioned above. | |||
| During the alignment process of heterogeneous learnware, the statistical specifications of the learnware and the user's task ``(user_spec)`` are used for input space alignment, | |||
| and a small amount of labeled data ``(val_x, val_y)`` is mandatory to be used for output space alignment. This can be done by the following code: | |||
| .. code:: python | |||
| from learnware.reuse import HeteroMapAlignLearnware | |||
| # mode="regression": For user tasks of regression | |||
| # mode="classification": For user tasks of classification | |||
| hetero_learnware = HeteroMapAlignLearnware(learnware=leanrware, mode="regression") | |||
| hetero_learnware.align(user_spec, val_x, val_y) | |||
| # Make predictions using the aligned heterogeneous learnware | |||
| predict_y = hetero_learnware.predict(user_data=test_x) | |||
| To reuse multiple heterogeneous learnwares, | |||
| combine ``HeteroMapAlignLearnware`` with the homogeneous reuse methods ``AveragingReuser`` and ``EnsemblePruningReuser`` mentioned above will do the trick: | |||
| .. code:: python | |||
| hetero_learnware_list = [] | |||
| for learnware in learnware_list: | |||
| hetero_learnware = HeteroMapAlignLearnware(learnware, mode="regression") | |||
| hetero_learnware.align(user_spec, val_x, val_y) | |||
| hetero_learnware_list.append(hetero_learnware) | |||
| # Reuse multiple heterogeneous learnwares using AveragingReuser | |||
| reuse_ensemble = AveragingReuser(learnware_list=hetero_learnware_list, mode="mean") | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=test_x) | |||
| # Reuse multiple heterogeneous learnwares using EnsemblePruningReuser | |||
| reuse_ensemble = EnsemblePruningReuser( | |||
| learnware_list=hetero_learnware_list, mode="regression" | |||
| ) | |||
| reuse_ensemble.fit(val_x, val_y) | |||
| ensemble_pruning_predict_y = reuse_ensemble.predict(user_data=test_x) | |||
| Reuse with Container | |||
| ===================== | |||
| @@ -2,111 +2,95 @@ | |||
| Learnwares Search | |||
| ============================================================ | |||
| When a user comes with her requirements, the market should identify helpful learnwares and recommend them to the user. | |||
| The search of helpful learnwares is based on the user information, and can be divided into two stages: semantic specification search and statistical specification search. | |||
| ``Learnware Searcher`` is a key component of ``Learnware Market`` that identifies and recommends helpful learnwares to users according to their ``UserInfo``. Based on whether the returned learnware dimensions are consistent with user tasks, the searchers can be divided into two categories: homogeneous searchers and heterogeneous searchers. | |||
| All the searchers are implemented as a subclass of ``BaseSearcher``. When initializing, you should assign a ``organizer`` to it. The introduction of ``organizer`` is shown in `COMPONENTS: Market - Framework <../components/market.html>`_. Then these searchers can be called with ``UserInfo`` and return ``SearchResults``. | |||
| Homo Search | |||
| ====================== | |||
| User information | |||
| ------------------------------- | |||
| The user should provide her requirements in ``BaseUserInfo``. The class ``BaseUserInfo`` consists of user's semantic specification ``user_semantic`` and statistical information ``stat_info``. | |||
| The semantic specification ``user_semantic`` is stored in a ``dict``, with keywords 'Data', 'Task', 'Library', 'Scenario', 'Description' and 'Name'. An example is shown below, and you could choose their values according to the figure. For the keys of type 'Class, you should choose one ilegal value; for the keys of type 'Tag', you can choose one or more values; for the keys of type 'String', you should provide a string; the key 'Description' is used in learnwares' semantic specifications and is ignored in user semantic specification; the values of all these keys can be empty if the user have no idea of them. | |||
| The homogeneous search of helpful learnwares can be divided into two stages: semantic specification search and statistical specification search. Both of them needs ``BaseUserInfo`` as input. The following codes shows how to use the searcher to search for helpful learnwares from a market ``easy_market`` for a user. The introduction of ``EasyMarket`` is in `COMPONENTS: Market <../components/market.html>`_. | |||
| .. code-block:: python | |||
| # An example of user_semantic | |||
| # generate BaseUserInfo(semantic_spec + stat_info) | |||
| user_semantic = { | |||
| "Data": {"Values": ["Image"], "Type": "Class"}, | |||
| "Task": {"Values": ["Classification"], "Type": "Class"}, | |||
| "Library": {"Values": ["Scikit-learn"], "Type": "Tag"}, | |||
| "Scenario": {"Values": ["Education"], "Type": "Class"}, | |||
| "Data": {"Values": ["Table"], "Type": "Class"}, | |||
| "Task": {"Values": ["Regression"], "Type": "Class"}, | |||
| "Library": {"Values": ["Scikit-learn"], "Type": "Class"}, | |||
| "Scenario": {"Values": ["Business"], "Type": "Tag"}, | |||
| "Description": {"Values": "", "Type": "String"}, | |||
| "Name": {"Values": "digits", "Type": "String"}, | |||
| "Name": {"Values": "", "Type": "String"}, | |||
| "Input": {"Dimension": 82, "Description": {},}, | |||
| "Output": {"Dimension": 1, "Description": {},}, | |||
| "License": {"Values": ["MIT"], "Type": "Class"}, | |||
| } | |||
| .. image:: ../_static/img/semantic_spec.png | |||
| :align: center | |||
| The user's statistical information ``stat_info`` is stored in a ``json`` file, e.g., ``stat.json``. The generation of this file is seen in `Learnware Preparation <./submit>`_. | |||
| Semantic Specification Search | |||
| ------------------------------- | |||
| To search for learnwares that fit your task purpose, | |||
| the user should first provide a semantic specification ``user_semantic`` that describes the characteristics of your task. | |||
| The Learnware Market will perform a first-stage search based on ``user_semantic``, | |||
| identifying potentially helpful leaarnwares whose models solve tasks similar to your requirements. | |||
| .. code-block:: python | |||
| # construct user_info which includes semantic specification for searching learnware | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| # search_learnware performs semantic specification search if user_info doesn't include a statistical specification | |||
| _, single_learnware_list, _ = easy_market.search_learnware(user_info) | |||
| # single_learnware_list is the learnware list by semantic specification searching | |||
| print(single_learnware_list) | |||
| In semantic specification search, we go through all learnwares in the market to compare their semantic specifications with the user's one, and return all the learnwares that pass through the comparation. When comparing two learnwares' semantic specifications, we design different ways for different semantic keys: | |||
| - For semantic keys with type 'Class', they are matched only if they have the same value. | |||
| - For semantic keys with type 'Tag', they are matched only if they have nonempty intersections. | |||
| - For the user's input in the search box, it matchs with a learnware's semantic specification only if it's a substring of its 'Name' or 'Description'. All the strings are converted to the lower case before matching. | |||
| - When a key value is missing, it will not participate in the match. The user could upload no semantic specifications if he wants. | |||
| Statistical Specification Search | |||
| --------------------------------- | |||
| If you choose to provide your own statistical specification file ``stat.json``, | |||
| the Learnware Market can perform a more accurate leanware selection from | |||
| the learnwares returned by the previous step. This second-stage search is based on statistical information and returns one or more learnwares that are most likely to be helpful for your task. | |||
| For example, the following code is designed to work with Reduced Kernel Mean Embedding (RKME) as a statistical specification: | |||
| .. code-block:: python | |||
| import learnware.specification as specification | |||
| user_spec = specification.RKMETableSpecification() | |||
| user_spec.load(os.path.join("rkme.json")) | |||
| user_spec = generate_rkme_table_spec(X=x) | |||
| user_info = BaseUserInfo( | |||
| semantic_spec=user_semantic, stat_info={"RKMETableSpecification": user_spec} | |||
| semantic_spec=user_semantic, | |||
| stat_info={"RKMETableSpecification": user_spec} | |||
| ) | |||
| (sorted_score_list, single_learnware_list, | |||
| mixture_score, mixture_learnware_list) = easy_market.search_learnware(user_info) | |||
| # sorted_score_list is the learnware scores based on MMD distances, sorted in descending order | |||
| print(sorted_score_list) | |||
| # search the market for the user | |||
| search_result = easy_market.search_learnware(user_info) | |||
| # single_learnware_list is the learnwares sorted in descending order based on their scores | |||
| print(single_learnware_list) | |||
| # search result: single_result | |||
| single_result = search_result.get_single_results() | |||
| print(f"single model num: {len(single_result)}, | |||
| max_score: {single_result[0].score}, | |||
| min_score: {single_result[-1].score}" | |||
| ) | |||
| # search result: multiple_result | |||
| multiple_result = search_result.get_multiple_results() | |||
| mixture_id = " ".join([learnware.id for learnware in multiple_result[0].learnwares]) | |||
| print(f"mixture_score: {multiple_result[0].score}, mixture_learnwares: {mixture_id}") | |||
| # mixture_learnware_list is the learnwares whose mixture is helpful for your task | |||
| print(mixture_learnware_list) | |||
| Hetero Search | |||
| ====================== | |||
| # mixture_score is the score of the mixture of learnwares | |||
| print(mixture_score) | |||
| For table-based user tasks, | |||
| homogeneous searchers like ``EasySearcher`` fail to recommend learnwares when no table learnware matches the user task's feature dimension, returning empty results. | |||
| To enhance functionality, ``learnware`` package includes the heterogeneous learnware search feature, whose processions is as follows: | |||
| The return values of statistical specification search are ``sorted_score_list``, ``single_learnware_list``, ``mixture_score`` and ``mixture_learnware_list``. | |||
| ``sorted_score_list`` and ``single_learnware_list`` are the ranking of each single learnware and the corresponding scores. We return at least 15 learnwares unless there're no enough ones. If there are more than 15 matched learnwares, the ones with scores less than 50 will be ignored. | |||
| ``mixture_score`` and ``mixture_learnware_list`` are the chosen mixture learnwares and the corresponding score. At most 5 learnwares will be chosen, whose mixture may have a relatively good performance on the user's task. | |||
| - Learnware markets such as ``Hetero Market`` integrate different specification islands into a unified "specification world" by assigning system-level specifications to all learnwares. This allows heterogeneous searchers like ``HeteroSearcher`` to find helpful learnwares from all available table learnwares. | |||
| - Searchers assign system-level specifications to users based on ``UserInfo``'s statistical specification, using methods provided by corresponding organizers. In ``Hetero Market``, for example, ``HeteroOrganizer.generate_hetero_map_spec`` generates system-level specifications for users. | |||
| - Finally searchers conduct statistical specification search across the "specification world". User's system-level specification will guide the searcher in pinpointing helpful heterogeneous learnwares. | |||
| To activate heterogeneous learnware search, ``UserInfo`` should contain both semantic and statistical specifications. What's more, the semantic specification should meet the following requirements: | |||
| The statistical specification search is done in the following way. | |||
| We first filter by the dimension of RKME specifications; only those with the same dimension with the user's will enter the subsequent stage. | |||
| - The task type should be ``Classification`` or ``Regression``. | |||
| - The data type should be ``Table``. | |||
| - It should include description for at least one feature dimension. | |||
| - The feature dimension stated here should match with the feature dimension in the statistical specification. | |||
| The single_learnware_list is calculated using the distances between two RKMEs. The greater the distance from the user's RKME, the lower the score is. | |||
| The following codes shows how to search for helpful heterogeneous learnwares from a market | |||
| ``hetero_market`` for a user. The introduction of ``HeteroMarket`` is in `COMPONENTS: Hetero Market <../components/market.html#hetero-market>`_. | |||
| The mixture_learnware_list is calculated in a greedy way. Each time we choose a learnware to make their mixture closer to the user's RKME. Specifically, each time we go through all the left learnwares to find the one whose combination with chosen learnwares could minimize the distance between their mixture's RKME and the user's RKME. The mixture weight is calculated by minimizing the RKME distance, which is solved by quadratic programming. If the distance become larger or the number of chosen learnwares reaches a threshold, the process will end and the chosen learnware and weight list will return. | |||
| .. code-block:: python | |||
| Hetero Search | |||
| ====================== | |||
| # initiate a Hetero Market | |||
| hetero_market = initiate_learnware_market(market_id="test_hetero", name="hetero") | |||
| # user_semantic should meet the above requirements | |||
| input_description = { | |||
| "Dimension": 2, | |||
| "Description": { | |||
| "0": "leaf width", | |||
| "1": "leaf length", | |||
| }, | |||
| } | |||
| user_semantic = generate_semantic_spec( | |||
| data_type="table", | |||
| task_type="Classification", | |||
| scenarios=["Business"], | |||
| input_description=input_description, | |||
| ) | |||
| user_spec = generate_stat_spec(type="table", X=train_x) | |||
| user_info = BaseUserInfo( | |||
| semantic_spec=user_semantic, | |||
| stat_info={user_spec.type: user_spec} | |||
| ) | |||
| # search for heterogeneous learnwares in hetero_market | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| @@ -0,0 +1,58 @@ | |||
| # Text Dataset Workflow Example | |||
| ## Introduction | |||
| We conducted experiments on the widely used text benchmark dataset: [``20-newsgroup``](http://qwone.com/~jason/20Newsgroups/). | |||
| ``20-newsgroup`` is a renowned text classification benchmark with a hierarchical structure, featuring 5 superclasses {comp, rec, sci, talk, misc}. | |||
| In the submitting stage, we enumerated all combinations of three superclasses from the five available, randomly sampling 50% of each combination from the training set to create datasets for 50 uploaders. | |||
| In the deploying stage, we considered all combinations of two superclasses out of the five, selecting all data for each combination from the testing set as a test dataset for one user. This resulted in 10 users. | |||
| The user's own training data was generated using the same sampling procedure as the user test data, despite originating from the training dataset. | |||
| Model training comprised two parts: the first part involved training a tfidf feature extractor, and the second part used the extracted text feature vectors to train a naive Bayes classifier. | |||
| Our experiments comprises two components: | |||
| * ``unlabeled_text_example`` is designed to evaluate performance when users possess only testing data, searching and reusing learnware available in the market. | |||
| * ``labeled_text_example`` aims to assess performance when users have both testing and limited training data, searching and reusing learnware directly from the market instead of training a model from scratch. This helps determine the amount of training data saved for the user. | |||
| ## Run the code | |||
| Run the following command to start the ``unlabeled_text_example``. | |||
| ```bash | |||
| python workflow.py unlabeled_text_example | |||
| ``` | |||
| Run the following command to start the ``labeled_text_example``. | |||
| ```bash | |||
| python workflow.py labeled_text_example | |||
| ``` | |||
| ## Results | |||
| ### ``unlabeled_text_example``: | |||
| The accuracy of search and reuse is presented in the table below: | |||
| | Metric | Value | | |||
| |--------------------------------------|---------------------| | |||
| | Mean in Market (Single) | 0.507 ± 0.030 | | |||
| | Best in Market (Single) | 0.859 ± 0.051 | | |||
| | Top-1 Reuse (Single) | 0.846 ± 0.054 | | |||
| | Job Selector Reuse (Multiple) | 0.845 ± 0.053 | | |||
| | Average Ensemble Reuse (Multiple) | 0.862 ± 0.051 | | |||
| ### ``labeled_text_example``: | |||
| We present the change curves in classification error rates for both the user's self-trained model and the multiple learnware reuse(EnsemblePrune), showcasing their performance on the user's test data as the user's training data increases. The average results across 10 users are depicted below: | |||
| <div align=center> | |||
| <img src="../../docs/_static/img/text_labeled_curves.png" alt="Text Limited Labeled Data" style="width:50%;" /> | |||
| </div> | |||
| From the figure above, it is evident that when the user's own training data is limited, the performance of multiple learnware reuse surpasses that of the user's own model. As the user's training data grows, it is expected that the user's model will eventually outperform the learnware reuse. This underscores the value of reusing learnware to significantly conserve training data and achieve superior performance when user training data is limited. | |||
| @@ -0,0 +1,62 @@ | |||
| from learnware.tests.benchmarks import BenchmarkConfig | |||
| text_benchmark_config = BenchmarkConfig( | |||
| name="20-Newsgroups", | |||
| user_num=10, | |||
| learnware_ids=[ | |||
| "00002193", | |||
| "00002192", | |||
| "00002191", | |||
| "00002190", | |||
| "00002189", | |||
| "00002188", | |||
| "00002187", | |||
| "00002186", | |||
| "00002185", | |||
| "00002184", | |||
| "00002183", | |||
| "00002182", | |||
| "00002181", | |||
| "00002180", | |||
| "00002179", | |||
| "00002178", | |||
| "00002177", | |||
| "00002176", | |||
| "00002175", | |||
| "00002174", | |||
| "00002173", | |||
| "00002172", | |||
| "00002171", | |||
| "00002170", | |||
| "00002169", | |||
| "00002168", | |||
| "00002167", | |||
| "00002166", | |||
| "00002165", | |||
| "00002164", | |||
| "00002163", | |||
| "00002162", | |||
| "00002161", | |||
| "00002160", | |||
| "00002159", | |||
| "00002158", | |||
| "00002157", | |||
| "00002156", | |||
| "00002155", | |||
| "00002154", | |||
| "00002153", | |||
| "00002152", | |||
| "00002151", | |||
| "00002150", | |||
| "00002149", | |||
| "00002148", | |||
| "00002147", | |||
| "00002146", | |||
| "00002145", | |||
| "00002144", | |||
| ], | |||
| test_data_path="20-Newsgroups/test_data.zip", | |||
| train_data_path="20-Newsgroups/train_data.zip", | |||
| extra_info_path="20-Newsgroup/extra_info.zip", | |||
| ) | |||
| @@ -1,29 +0,0 @@ | |||
| import os | |||
| import pickle | |||
| import numpy as np | |||
| from learnware.model import BaseModel | |||
| class Model(BaseModel): | |||
| def __init__(self): | |||
| super(Model, self).__init__(input_shape=(1,), output_shape=(1,)) | |||
| dir_path = os.path.dirname(os.path.abspath(__file__)) | |||
| modelv_path = os.path.join(dir_path, "modelv.pth") | |||
| with open(modelv_path, "rb") as f: | |||
| self.modelv = pickle.load(f) | |||
| modell_path = os.path.join(dir_path, "modell.pth") | |||
| with open(modell_path, "rb") as f: | |||
| self.modell = pickle.load(f) | |||
| def fit(self, X: np.ndarray, y: np.ndarray): | |||
| pass | |||
| def predict(self, X: np.ndarray) -> np.ndarray: | |||
| return self.modell.predict(self.modelv.transform(X)) | |||
| def finetune(self, X: np.ndarray, y: np.ndarray): | |||
| pass | |||
| @@ -1,8 +0,0 @@ | |||
| model: | |||
| class_name: Model | |||
| kwargs: { } | |||
| stat_specifications: | |||
| - module_path: learnware.specification | |||
| class_name: RKMETextSpecification | |||
| file_name: rkme.json | |||
| kwargs: { } | |||
| @@ -1,4 +0,0 @@ | |||
| numpy | |||
| pickle | |||
| lightgbm | |||
| scikit-learn | |||
| @@ -1,16 +0,0 @@ | |||
| import os | |||
| import pandas as pd | |||
| def get_data(data_root="./data"): | |||
| dtrain = pd.read_csv(os.path.join(data_root, "train.csv")) | |||
| dtest = pd.read_csv(os.path.join(data_root, "test.csv")) | |||
| # returned X(DataFrame), y(Series) | |||
| return ( | |||
| dtrain[["discourse_text", "discourse_type"]], | |||
| dtrain["discourse_effectiveness"], | |||
| dtest[["discourse_text", "discourse_type"]], | |||
| dtest["discourse_effectiveness"], | |||
| ) | |||
| @@ -1,274 +0,0 @@ | |||
| import os | |||
| import fire | |||
| import pickle | |||
| import time | |||
| import zipfile | |||
| from shutil import copyfile, rmtree | |||
| import numpy as np | |||
| import learnware.specification as specification | |||
| from get_data import get_data | |||
| from learnware.logger import get_module_logger | |||
| from learnware.market import instantiate_learnware_market, BaseUserInfo | |||
| from learnware.reuse import JobSelectorReuser, AveragingReuser, EnsemblePruningReuser | |||
| from utils import generate_uploader, generate_user, TextDataLoader, train, eval_prediction | |||
| logger = get_module_logger("text_test", level="INFO") | |||
| origin_data_root = "./data/origin_data" | |||
| processed_data_root = "./data/processed_data" | |||
| tmp_dir = "./data/tmp" | |||
| learnware_pool_dir = "./data/learnware_pool" | |||
| dataset = "ae" # argumentative essays | |||
| n_uploaders = 7 | |||
| n_users = 7 | |||
| n_classes = 3 | |||
| data_root = os.path.join(origin_data_root, dataset) | |||
| data_save_root = os.path.join(processed_data_root, dataset) | |||
| user_save_root = os.path.join(data_save_root, "user") | |||
| uploader_save_root = os.path.join(data_save_root, "uploader") | |||
| model_save_root = os.path.join(data_save_root, "uploader_model") | |||
| os.makedirs(data_root, exist_ok=True) | |||
| os.makedirs(user_save_root, exist_ok=True) | |||
| os.makedirs(uploader_save_root, exist_ok=True) | |||
| os.makedirs(model_save_root, exist_ok=True) | |||
| output_description = { | |||
| "Dimension": 1, | |||
| "Description": { | |||
| "0": "classify as 0(ineffective), 1(effective), or 2(adequate).", | |||
| }, | |||
| } | |||
| semantic_specs = [ | |||
| { | |||
| "Data": {"Values": ["Text"], "Type": "Class"}, | |||
| "Task": {"Values": ["Classification"], "Type": "Class"}, | |||
| "Library": {"Values": ["Scikit-learn"], "Type": "Class"}, | |||
| "Scenario": {"Values": ["Education"], "Type": "Tag"}, | |||
| "Description": {"Values": "", "Type": "String"}, | |||
| "Name": {"Values": "learnware_1", "Type": "String"}, | |||
| "Output": output_description, | |||
| "License": {"Values": ["MIT"], "Type": "Class"}, | |||
| } | |||
| ] | |||
| user_semantic = { | |||
| "Data": {"Values": ["Text"], "Type": "Class"}, | |||
| "Task": {"Values": ["Classification"], "Type": "Class"}, | |||
| "Library": {"Values": ["Scikit-learn"], "Type": "Class"}, | |||
| "Scenario": {"Values": ["Education"], "Type": "Tag"}, | |||
| "Description": {"Values": "", "Type": "String"}, | |||
| "Name": {"Values": "", "Type": "String"}, | |||
| "Output": output_description, | |||
| "License": {"Values": ["MIT"], "Type": "Class"}, | |||
| } | |||
| class TextDatasetWorkflow: | |||
| def _init_text_dataset(self): | |||
| self._prepare_data() | |||
| self._prepare_model() | |||
| def _prepare_data(self): | |||
| X_train, y_train, X_test, y_test = get_data(data_root) | |||
| generate_uploader(X_train, y_train, n_uploaders=n_uploaders, data_save_root=uploader_save_root) | |||
| generate_user(X_test, y_test, n_users=n_users, data_save_root=user_save_root) | |||
| def _prepare_model(self): | |||
| dataloader = TextDataLoader(data_save_root, train=True) | |||
| for i in range(n_uploaders): | |||
| logger.info("Train on uploader: %d" % (i)) | |||
| X, y = dataloader.get_idx_data(i) | |||
| vectorizer, lgbm = train(X, y, out_classes=n_classes) | |||
| modelv_save_path = os.path.join(model_save_root, "uploader_v_%d.pth" % (i)) | |||
| modell_save_path = os.path.join(model_save_root, "uploader_l_%d.pth" % (i)) | |||
| with open(modelv_save_path, "wb") as f: | |||
| pickle.dump(vectorizer, f) | |||
| with open(modell_save_path, "wb") as f: | |||
| pickle.dump(lgbm, f) | |||
| logger.info("Model saved to '%s' and '%s'" % (modelv_save_path, modell_save_path)) | |||
| def _prepare_learnware( | |||
| self, data_path, modelv_path, modell_path, init_file_path, yaml_path, env_file_path, save_root, zip_name | |||
| ): | |||
| os.makedirs(save_root, exist_ok=True) | |||
| tmp_spec_path = os.path.join(save_root, "rkme.json") | |||
| tmp_modelv_path = os.path.join(save_root, "modelv.pth") | |||
| tmp_modell_path = os.path.join(save_root, "modell.pth") | |||
| tmp_yaml_path = os.path.join(save_root, "learnware.yaml") | |||
| tmp_init_path = os.path.join(save_root, "__init__.py") | |||
| tmp_env_path = os.path.join(save_root, "requirements.txt") | |||
| with open(data_path, "rb") as f: | |||
| X = pickle.load(f) | |||
| semantic_spec = semantic_specs[0] | |||
| st = time.time() | |||
| user_spec = specification.RKMETextSpecification() | |||
| user_spec.generate_stat_spec_from_data(X=X) | |||
| ed = time.time() | |||
| logger.info("Stat spec generated in %.3f s" % (ed - st)) | |||
| user_spec.save(tmp_spec_path) | |||
| copyfile(modelv_path, tmp_modelv_path) | |||
| copyfile(modell_path, tmp_modell_path) | |||
| copyfile(yaml_path, tmp_yaml_path) | |||
| copyfile(init_file_path, tmp_init_path) | |||
| copyfile(env_file_path, tmp_env_path) | |||
| zip_file_name = os.path.join(learnware_pool_dir, "%s.zip" % (zip_name)) | |||
| with zipfile.ZipFile(zip_file_name, "w", compression=zipfile.ZIP_DEFLATED) as zip_obj: | |||
| zip_obj.write(tmp_spec_path, "rkme.json") | |||
| zip_obj.write(tmp_modelv_path, "modelv.pth") | |||
| zip_obj.write(tmp_modell_path, "modell.pth") | |||
| zip_obj.write(tmp_yaml_path, "learnware.yaml") | |||
| zip_obj.write(tmp_init_path, "__init__.py") | |||
| zip_obj.write(tmp_env_path, "requirements.txt") | |||
| rmtree(save_root) | |||
| logger.info("New Learnware Saved to %s" % (zip_file_name)) | |||
| return zip_file_name | |||
| def prepare_market(self, regenerate_flag=False): | |||
| if regenerate_flag: | |||
| self._init_text_dataset() | |||
| text_market = instantiate_learnware_market(market_id="ae", rebuild=True) | |||
| try: | |||
| rmtree(learnware_pool_dir) | |||
| except: | |||
| pass | |||
| os.makedirs(learnware_pool_dir, exist_ok=True) | |||
| for i in range(n_uploaders): | |||
| data_path = os.path.join(uploader_save_root, "uploader_%d_X.pkl" % (i)) | |||
| modelv_path = os.path.join(model_save_root, "uploader_v_%d.pth" % (i)) | |||
| modell_path = os.path.join(model_save_root, "uploader_l_%d.pth" % (i)) | |||
| init_file_path = "./example_files/example_init.py" | |||
| yaml_file_path = "./example_files/example_yaml.yaml" | |||
| env_file_path = "./example_files/requirements.txt" | |||
| new_learnware_path = self._prepare_learnware( | |||
| data_path, | |||
| modelv_path, | |||
| modell_path, | |||
| init_file_path, | |||
| yaml_file_path, | |||
| env_file_path, | |||
| tmp_dir, | |||
| "%s_%d" % (dataset, i), | |||
| ) | |||
| semantic_spec = semantic_specs[0] | |||
| semantic_spec["Name"]["Values"] = "learnware_%d" % (i) | |||
| semantic_spec["Description"]["Values"] = "test_learnware_number_%d" % (i) | |||
| text_market.add_learnware(new_learnware_path, semantic_spec) | |||
| logger.info("Total Item: %d" % (len(text_market))) | |||
| def test(self, regenerate_flag=False): | |||
| self.prepare_market(regenerate_flag) | |||
| text_market = instantiate_learnware_market(market_id="ae") | |||
| print("Total Item: %d" % len(text_market)) | |||
| select_list = [] | |||
| avg_list = [] | |||
| improve_list = [] | |||
| job_selector_score_list = [] | |||
| ensemble_score_list = [] | |||
| pruning_score_list = [] | |||
| for i in range(n_users): | |||
| user_data_path = os.path.join(user_save_root, "user_%d_X.pkl" % (i)) | |||
| user_label_path = os.path.join(user_save_root, "user_%d_y.pkl" % (i)) | |||
| with open(user_data_path, "rb") as f: | |||
| user_data = pickle.load(f) | |||
| with open(user_label_path, "rb") as f: | |||
| user_label = pickle.load(f) | |||
| user_stat_spec = specification.RKMETextSpecification() | |||
| user_stat_spec.generate_stat_spec_from_data(X=user_data) | |||
| user_info = BaseUserInfo(semantic_spec=user_semantic, stat_info={"RKMETextSpecification": user_stat_spec}) | |||
| logger.info("Searching Market for user: %d" % (i)) | |||
| search_result = text_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| print(f"search result of user{i}:") | |||
| print( | |||
| f"single model num: {len(single_result)}, max_score: {single_result[0].score}, min_score: {single_result[-1].score}" | |||
| ) | |||
| l = len(single_result) | |||
| acc_list = [] | |||
| for idx in range(l): | |||
| learnware = single_result[idx].learnware | |||
| score = single_result[idx].score | |||
| pred_y = learnware.predict(user_data) | |||
| acc = eval_prediction(pred_y, user_label) | |||
| acc_list.append(acc) | |||
| print( | |||
| f"Top1-score: {single_result[0].score}, learnware_id: {single_result[0].learnware.id}, acc: {acc_list[0]}" | |||
| ) | |||
| if len(multiple_result) > 0: | |||
| mixture_id = " ".join([learnware.id for learnware in multiple_result[0].learnwares]) | |||
| print(f"mixture_score: {multiple_result[0].score}, mixture_learnware: {mixture_id}") | |||
| mixture_learnware_list = multiple_result[0].learnwares | |||
| else: | |||
| mixture_learnware_list = [single_result[0].learnware] | |||
| # test reuse (job selector) | |||
| reuse_baseline = JobSelectorReuser(learnware_list=mixture_learnware_list, herding_num=100) | |||
| reuse_predict = reuse_baseline.predict(user_data=user_data) | |||
| reuse_score = eval_prediction(reuse_predict, user_label) | |||
| job_selector_score_list.append(reuse_score) | |||
| print(f"mixture reuse loss(job selector): {reuse_score}") | |||
| # test reuse (ensemble) | |||
| reuse_ensemble = AveragingReuser(learnware_list=mixture_learnware_list, mode="vote_by_label") | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=user_data) | |||
| ensemble_score = eval_prediction(ensemble_predict_y, user_label) | |||
| ensemble_score_list.append(ensemble_score) | |||
| print(f"mixture reuse accuracy (ensemble): {ensemble_score}") | |||
| # test reuse (ensemblePruning) | |||
| reuse_pruning = EnsemblePruningReuser(learnware_list=mixture_learnware_list) | |||
| pruning_predict_y = reuse_pruning.predict(user_data=user_data) | |||
| pruning_score = eval_prediction(pruning_predict_y, user_label) | |||
| pruning_score_list.append(pruning_score) | |||
| print(f"mixture reuse accuracy (ensemble Pruning): {pruning_score}\n") | |||
| select_list.append(acc_list[0]) | |||
| avg_list.append(np.mean(acc_list)) | |||
| improve_list.append((acc_list[0] - np.mean(acc_list)) / np.mean(acc_list)) | |||
| logger.info( | |||
| "Accuracy of selected learnware: %.3f +/- %.3f, Average performance: %.3f +/- %.3f" | |||
| % (np.mean(select_list), np.std(select_list), np.mean(avg_list), np.std(avg_list)) | |||
| ) | |||
| logger.info("Average performance improvement: %.3f" % (np.mean(improve_list))) | |||
| logger.info( | |||
| "Average Job Selector Reuse Performance: %.3f +/- %.3f" | |||
| % (np.mean(job_selector_score_list), np.std(job_selector_score_list)) | |||
| ) | |||
| logger.info( | |||
| "Averaging Ensemble Reuse Performance: %.3f +/- %.3f" | |||
| % (np.mean(ensemble_score_list), np.std(ensemble_score_list)) | |||
| ) | |||
| logger.info( | |||
| "Selective Ensemble Reuse Performance: %.3f +/- %.3f" | |||
| % (np.mean(pruning_score_list), np.std(pruning_score_list)) | |||
| ) | |||
| if __name__ == "__main__": | |||
| fire.Fire(TextDatasetWorkflow) | |||
| @@ -1,104 +0,0 @@ | |||
| import os | |||
| import pickle | |||
| import numpy as np | |||
| import pandas as pd | |||
| from lightgbm import LGBMClassifier | |||
| from sklearn.feature_extraction.text import TfidfVectorizer | |||
| class TextDataLoader: | |||
| def __init__(self, data_root, train: bool = True): | |||
| self.data_root = data_root | |||
| self.train = train | |||
| def get_idx_data(self, idx=0): | |||
| if self.train: | |||
| X_path = os.path.join(self.data_root, "uploader", "uploader_%d_X.pkl" % (idx)) | |||
| y_path = os.path.join(self.data_root, "uploader", "uploader_%d_y.pkl" % (idx)) | |||
| if not (os.path.exists(X_path) and os.path.exists(y_path)): | |||
| raise Exception("Index Error") | |||
| with open(X_path, "rb") as f: | |||
| X = pickle.load(f) | |||
| with open(y_path, "rb") as f: | |||
| y = pickle.load(f) | |||
| else: | |||
| X_path = os.path.join(self.data_root, "user", "user_%d_X.pkl" % (idx)) | |||
| y_path = os.path.join(self.data_root, "user", "user_%d_y.pkl" % (idx)) | |||
| if not (os.path.exists(X_path) and os.path.exists(y_path)): | |||
| raise Exception("Index Error") | |||
| with open(X_path, "rb") as f: | |||
| X = pickle.load(f) | |||
| with open(y_path, "rb") as f: | |||
| y = pickle.load(f) | |||
| return X, y | |||
| def generate_uploader(data_x: pd.Series, data_y: pd.Series, n_uploaders=50, data_save_root=None): | |||
| if data_save_root is None: | |||
| return | |||
| os.makedirs(data_save_root, exist_ok=True) | |||
| types = data_x["discourse_type"].unique() | |||
| for i in range(n_uploaders): | |||
| indices = data_x["discourse_type"] == types[i] | |||
| selected_X = data_x[indices]["discourse_text"].to_list() | |||
| selected_y = data_y[indices].to_list() | |||
| X_save_dir = os.path.join(data_save_root, "uploader_%d_X.pkl" % (i)) | |||
| y_save_dir = os.path.join(data_save_root, "uploader_%d_y.pkl" % (i)) | |||
| with open(X_save_dir, "wb") as f: | |||
| pickle.dump(selected_X, f) | |||
| with open(y_save_dir, "wb") as f: | |||
| pickle.dump(selected_y, f) | |||
| print("Saving to %s" % (X_save_dir)) | |||
| def generate_user(data_x, data_y, n_users=50, data_save_root=None): | |||
| if data_save_root is None: | |||
| return | |||
| os.makedirs(data_save_root, exist_ok=True) | |||
| types = data_x["discourse_type"].unique() | |||
| for i in range(n_users): | |||
| indices = data_x["discourse_type"] == types[i] | |||
| selected_X = data_x[indices]["discourse_text"].to_list() | |||
| selected_y = data_y[indices].to_list() | |||
| X_save_dir = os.path.join(data_save_root, "user_%d_X.pkl" % (i)) | |||
| y_save_dir = os.path.join(data_save_root, "user_%d_y.pkl" % (i)) | |||
| with open(X_save_dir, "wb") as f: | |||
| pickle.dump(selected_X, f) | |||
| with open(y_save_dir, "wb") as f: | |||
| pickle.dump(selected_y, f) | |||
| print("Saving to %s" % (X_save_dir)) | |||
| # Train Uploaders' models | |||
| def train(X, y, out_classes): | |||
| vectorizer = TfidfVectorizer(stop_words="english") | |||
| X_tfidf = vectorizer.fit_transform(X) | |||
| lgbm = LGBMClassifier(boosting_type="dart", n_estimators=500, num_leaves=21) | |||
| lgbm.fit(X_tfidf, y) | |||
| return vectorizer, lgbm | |||
| def eval_prediction(pred_y, target_y): | |||
| if not isinstance(pred_y, np.ndarray): | |||
| pred_y = pred_y.detach().cpu().numpy() | |||
| if len(pred_y.shape) == 1: | |||
| predicted = np.array(pred_y) | |||
| else: | |||
| predicted = np.argmax(pred_y, 1) | |||
| annos = np.array(target_y) | |||
| total = predicted.shape[0] | |||
| correct = (predicted == annos).sum().item() | |||
| return correct / total | |||
| @@ -0,0 +1,285 @@ | |||
| import os | |||
| import fire | |||
| import time | |||
| import random | |||
| import pickle | |||
| import tempfile | |||
| import numpy as np | |||
| import matplotlib.pyplot as plt | |||
| from sklearn.metrics import accuracy_score | |||
| from sklearn.naive_bayes import MultinomialNB | |||
| from sklearn.feature_extraction.text import TfidfVectorizer | |||
| from learnware.client import LearnwareClient | |||
| from learnware.logger import get_module_logger | |||
| from learnware.specification import RKMETextSpecification | |||
| from learnware.tests.benchmarks import LearnwareBenchmark | |||
| from learnware.market import instantiate_learnware_market, BaseUserInfo | |||
| from learnware.reuse import JobSelectorReuser, AveragingReuser, EnsemblePruningReuser | |||
| from config import text_benchmark_config | |||
| logger = get_module_logger("text_workflow", level="INFO") | |||
| class TextDatasetWorkflow: | |||
| @staticmethod | |||
| def _train_model(X, y): | |||
| vectorizer = TfidfVectorizer(stop_words="english") | |||
| X_tfidf = vectorizer.fit_transform(X) | |||
| clf = MultinomialNB(alpha=0.1) | |||
| clf.fit(X_tfidf, y) | |||
| return vectorizer, clf | |||
| @staticmethod | |||
| def _eval_prediction(pred_y, target_y): | |||
| if not isinstance(pred_y, np.ndarray): | |||
| pred_y = pred_y.detach().cpu().numpy() | |||
| pred_y = np.array(pred_y) if len(pred_y.shape) == 1 else np.argmax(pred_y, 1) | |||
| target_y = np.array(target_y) | |||
| return accuracy_score(target_y, pred_y) | |||
| def _plot_labeled_peformance_curves(self, all_user_curves_data): | |||
| plt.figure(figsize=(10, 6)) | |||
| plt.xticks(range(len(self.n_labeled_list)), self.n_labeled_list) | |||
| styles = [ | |||
| {"color": "navy", "linestyle": "-", "marker": "o"}, | |||
| {"color": "magenta", "linestyle": "-.", "marker": "d"}, | |||
| ] | |||
| labels = ["User Model", "Multiple Learnware Reuse (EnsemblePrune)"] | |||
| user_mat, pruning_mat = all_user_curves_data | |||
| user_mat, pruning_mat = np.array(user_mat), np.array(pruning_mat) | |||
| for mat, style, label in zip([user_mat, pruning_mat], styles, labels): | |||
| mean_curve, std_curve = 1 - np.mean(mat, axis=0), np.std(mat, axis=0) | |||
| plt.plot(mean_curve, **style, label=label) | |||
| plt.fill_between( | |||
| range(len(mean_curve)), | |||
| mean_curve - 0.5 * std_curve, | |||
| mean_curve + 0.5 * std_curve, | |||
| color=style["color"], | |||
| alpha=0.2, | |||
| ) | |||
| plt.xlabel("Labeled Data Size") | |||
| plt.ylabel("1 - Accuracy") | |||
| plt.title(f"Text Limited Labeled Data") | |||
| plt.legend() | |||
| plt.tight_layout() | |||
| plt.savefig(os.path.join(self.fig_path, "text_labeled_curves.png"), bbox_inches="tight", dpi=700) | |||
| def _prepare_market(self, rebuild=False): | |||
| client = LearnwareClient() | |||
| self.text_benchmark = LearnwareBenchmark().get_benchmark(text_benchmark_config) | |||
| self.text_market = instantiate_learnware_market(market_id=self.text_benchmark.name, rebuild=rebuild) | |||
| self.user_semantic = client.get_semantic_specification(self.text_benchmark.learnware_ids[0]) | |||
| self.user_semantic["Name"]["Values"] = "" | |||
| if len(self.text_market) == 0 or rebuild == True: | |||
| for learnware_id in self.text_benchmark.learnware_ids: | |||
| with tempfile.TemporaryDirectory(prefix="text_benchmark_") as tempdir: | |||
| zip_path = os.path.join(tempdir, f"{learnware_id}.zip") | |||
| for i in range(20): | |||
| try: | |||
| semantic_spec = client.get_semantic_specification(learnware_id) | |||
| client.download_learnware(learnware_id, zip_path) | |||
| self.text_market.add_learnware(zip_path, semantic_spec) | |||
| break | |||
| except: | |||
| time.sleep(1) | |||
| continue | |||
| logger.info("Total Item: %d" % (len(self.text_market))) | |||
| def unlabeled_text_example(self, rebuild=False): | |||
| self._prepare_market(rebuild) | |||
| select_list = [] | |||
| avg_list = [] | |||
| best_list = [] | |||
| improve_list = [] | |||
| job_selector_score_list = [] | |||
| ensemble_score_list = [] | |||
| all_learnwares = self.text_market.get_learnwares() | |||
| for i in range(self.text_benchmark.user_num): | |||
| user_data, user_label = self.text_benchmark.get_test_data(user_ids=i) | |||
| user_stat_spec = RKMETextSpecification() | |||
| user_stat_spec.generate_stat_spec_from_data(X=user_data) | |||
| user_info = BaseUserInfo( | |||
| semantic_spec=self.user_semantic, stat_info={"RKMETextSpecification": user_stat_spec} | |||
| ) | |||
| logger.info("Searching Market for user: %d" % (i)) | |||
| search_result = self.text_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| print(f"search result of user{i}:") | |||
| print( | |||
| f"single model num: {len(single_result)}, max_score: {single_result[0].score}, min_score: {single_result[-1].score}" | |||
| ) | |||
| acc_list = [] | |||
| for idx in range(len(all_learnwares)): | |||
| learnware = all_learnwares[idx] | |||
| pred_y = learnware.predict(user_data) | |||
| acc = self._eval_prediction(pred_y, user_label) | |||
| acc_list.append(acc) | |||
| learnware = single_result[0].learnware | |||
| pred_y = learnware.predict(user_data) | |||
| best_acc = self._eval_prediction(pred_y, user_label) | |||
| best_list.append(np.max(acc_list)) | |||
| select_list.append(best_acc) | |||
| avg_list.append(np.mean(acc_list)) | |||
| improve_list.append((best_acc - np.mean(acc_list)) / np.mean(acc_list)) | |||
| print(f"market mean accuracy: {np.mean(acc_list)}, market best accuracy: {np.max(acc_list)}") | |||
| print( | |||
| f"Top1-score: {single_result[0].score}, learnware_id: {single_result[0].learnware.id}, acc: {best_acc}" | |||
| ) | |||
| if len(multiple_result) > 0: | |||
| mixture_id = " ".join([learnware.id for learnware in multiple_result[0].learnwares]) | |||
| print(f"mixture_score: {multiple_result[0].score}, mixture_learnware: {mixture_id}") | |||
| mixture_learnware_list = multiple_result[0].learnwares | |||
| else: | |||
| mixture_learnware_list = [single_result[0].learnware] | |||
| # test reuse (job selector) | |||
| reuse_baseline = JobSelectorReuser(learnware_list=mixture_learnware_list, herding_num=100) | |||
| reuse_predict = reuse_baseline.predict(user_data=user_data) | |||
| reuse_score = self._eval_prediction(reuse_predict, user_label) | |||
| job_selector_score_list.append(reuse_score) | |||
| print(f"mixture reuse accuracy (job selector): {reuse_score}") | |||
| # test reuse (ensemble) | |||
| reuse_ensemble = AveragingReuser(learnware_list=mixture_learnware_list, mode="vote_by_label") | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=user_data) | |||
| ensemble_score = self._eval_prediction(ensemble_predict_y, user_label) | |||
| ensemble_score_list.append(ensemble_score) | |||
| print(f"mixture reuse accuracy (ensemble): {ensemble_score}\n") | |||
| logger.info( | |||
| "Accuracy of selected learnware: %.3f +/- %.3f, Average performance: %.3f +/- %.3f, Best performance: %.3f +/- %.3f" | |||
| % ( | |||
| np.mean(select_list), | |||
| np.std(select_list), | |||
| np.mean(avg_list), | |||
| np.std(avg_list), | |||
| np.mean(best_list), | |||
| np.std(best_list), | |||
| ) | |||
| ) | |||
| logger.info("Average performance improvement: %.3f" % (np.mean(improve_list))) | |||
| logger.info( | |||
| "Average Job Selector Reuse Performance: %.3f +/- %.3f" | |||
| % (np.mean(job_selector_score_list), np.std(job_selector_score_list)) | |||
| ) | |||
| logger.info( | |||
| "Averaging Ensemble Reuse Performance: %.3f +/- %.3f" | |||
| % (np.mean(ensemble_score_list), np.std(ensemble_score_list)) | |||
| ) | |||
| def labeled_text_example(self, rebuild=False, train_flag=True): | |||
| self.n_labeled_list = [100, 200, 500, 1000, 2000, 4000] | |||
| self.repeated_list = [10, 10, 10, 3, 3, 3] | |||
| self.root_path = os.path.dirname(os.path.abspath(__file__)) | |||
| self.fig_path = os.path.join(self.root_path, "figs") | |||
| self.curve_path = os.path.join(self.root_path, "curves") | |||
| if train_flag: | |||
| self._prepare_market(rebuild) | |||
| os.makedirs(self.fig_path, exist_ok=True) | |||
| os.makedirs(self.curve_path, exist_ok=True) | |||
| for i in range(self.text_benchmark.user_num): | |||
| user_model_score_mat = [] | |||
| pruning_score_mat = [] | |||
| single_score_mat = [] | |||
| test_x, test_y = self.text_benchmark.get_test_data(user_ids=i) | |||
| test_y = np.array(test_y) | |||
| train_x, train_y = self.text_benchmark.get_train_data(user_ids=i) | |||
| train_y = np.array(train_y) | |||
| user_stat_spec = RKMETextSpecification() | |||
| user_stat_spec.generate_stat_spec_from_data(X=test_x) | |||
| user_info = BaseUserInfo( | |||
| semantic_spec=self.user_semantic, stat_info={"RKMETextSpecification": user_stat_spec} | |||
| ) | |||
| logger.info(f"Searching Market for user_{i}") | |||
| search_result = self.text_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| learnware = single_result[0].learnware | |||
| pred_y = learnware.predict(test_x) | |||
| best_acc = self._eval_prediction(pred_y, test_y) | |||
| print(f"search result of user_{i}:") | |||
| print( | |||
| f"single model num: {len(single_result)}, max_score: {single_result[0].score}, min_score: {single_result[-1].score}, single model acc: {best_acc}" | |||
| ) | |||
| if len(multiple_result) > 0: | |||
| mixture_id = " ".join([learnware.id for learnware in multiple_result[0].learnwares]) | |||
| print(f"mixture_score: {multiple_result[0].score}, mixture_learnware: {mixture_id}") | |||
| mixture_learnware_list = multiple_result[0].learnwares | |||
| else: | |||
| mixture_learnware_list = [single_result[0].learnware] | |||
| print(len(train_x)) | |||
| for n_label, repeated in zip(self.n_labeled_list, self.repeated_list): | |||
| user_model_score_list, reuse_pruning_score_list = [], [] | |||
| if n_label > len(train_x): | |||
| n_label = len(train_x) | |||
| for _ in range(repeated): | |||
| x_train, y_train = zip(*random.sample(list(zip(train_x, train_y)), k=n_label)) | |||
| x_train = list(x_train) | |||
| y_train = np.array(list(y_train)) | |||
| modelv, modell = self._train_model(x_train, y_train) | |||
| user_model_predict_y = modell.predict(modelv.transform(test_x)) | |||
| user_model_score = self._eval_prediction(user_model_predict_y, test_y) | |||
| user_model_score_list.append(user_model_score) | |||
| reuse_pruning = EnsemblePruningReuser( | |||
| learnware_list=mixture_learnware_list, mode="classification" | |||
| ) | |||
| reuse_pruning.fit(x_train, y_train) | |||
| reuse_pruning_predict_y = reuse_pruning.predict(user_data=test_x) | |||
| reuse_pruning_score = self._eval_prediction(reuse_pruning_predict_y, test_y) | |||
| reuse_pruning_score_list.append(reuse_pruning_score) | |||
| single_score_mat.append([best_acc] * repeated) | |||
| user_model_score_mat.append(user_model_score_list) | |||
| pruning_score_mat.append(reuse_pruning_score_list) | |||
| print(n_label, np.mean(user_model_score_mat[-1]), np.mean(pruning_score_mat[-1])) | |||
| logger.info(f"Saving Curves for User_{i}") | |||
| user_curves_data = (single_score_mat, user_model_score_mat, pruning_score_mat) | |||
| with open(os.path.join(self.curve_path, f"curve{str(i)}.pkl"), "wb") as f: | |||
| pickle.dump(user_curves_data, f) | |||
| pruning_curves_data, user_model_curves_data = [], [] | |||
| for i in range(self.text_benchmark.user_num): | |||
| with open(os.path.join(self.curve_path, f"curve{str(i)}.pkl"), "rb") as f: | |||
| user_curves_data = pickle.load(f) | |||
| (single_score_mat, user_model_score_mat, pruning_score_mat) = user_curves_data | |||
| for i in range(len(single_score_mat)): | |||
| user_model_score_mat[i] = np.mean(user_model_score_mat[i]) | |||
| pruning_score_mat[i] = np.mean(pruning_score_mat[i]) | |||
| if len(user_model_score_mat) < 6: | |||
| for i in range(6 - len(user_model_score_mat)): | |||
| user_model_score_mat.append(user_model_score_mat[-1]) | |||
| pruning_score_mat.append(pruning_score_mat[-1]) | |||
| user_model_curves_data.append(user_model_score_mat[:6]) | |||
| pruning_curves_data.append(pruning_score_mat[:6]) | |||
| self._plot_labeled_peformance_curves([user_model_curves_data, pruning_curves_data]) | |||
| if __name__ == "__main__": | |||
| fire.Fire(TextDatasetWorkflow) | |||
| @@ -1,4 +1,4 @@ | |||
| __version__ = "0.2.0.7" | |||
| __version__ = "0.2.0.9" | |||
| import os | |||
| import json | |||
| @@ -14,10 +14,11 @@ from typing import Union, List, Optional | |||
| from ..config import C | |||
| from .container import LearnwaresContainer | |||
| from ..market import BaseChecker | |||
| from ..specification import generate_semantic_spec | |||
| from ..logger import get_module_logger | |||
| from ..learnware import get_learnware_from_dirpath | |||
| from ..market import BaseUserInfo | |||
| from ..tests import get_semantic_specification | |||
| CHUNK_SIZE = 1024 * 1024 | |||
| logger = get_module_logger(module_name="LearnwareClient") | |||
| @@ -52,8 +53,8 @@ class SemanticSpecificationKey(Enum): | |||
| DATA_TYPE = "Data" | |||
| TASK_TYPE = "Task" | |||
| LIBRARY_TYPE = "Library" | |||
| LICENSE = "License" | |||
| SENARIOES = "Scenario" | |||
| LICENSE = "License" | |||
| class LearnwareClient: | |||
| @@ -67,8 +68,16 @@ class LearnwareClient: | |||
| self.chunk_size = 1024 * 1024 | |||
| self.tempdir_list = [] | |||
| self.login_status = False | |||
| atexit.register(self.cleanup) | |||
| def is_connected(self): | |||
| url = f"{self.host}/auth/login_by_token" | |||
| response = requests.post(url) | |||
| if response.status_code == 404: | |||
| return False | |||
| return True | |||
| def login(self, email, token): | |||
| url = f"{self.host}/auth/login_by_token" | |||
| @@ -80,6 +89,10 @@ class LearnwareClient: | |||
| token = result["data"]["token"] | |||
| self.headers = {"Authorization": f"Bearer {token}"} | |||
| self.login_status = True | |||
| def is_login(self): | |||
| return self.login_status | |||
| @require_login | |||
| def logout(self): | |||
| @@ -166,7 +179,18 @@ class LearnwareClient: | |||
| if result["code"] != 0: | |||
| raise Exception("update failed: " + json.dumps(result)) | |||
| def download_learnware(self, learnware_id, save_path): | |||
| def get_semantic_specification(self, learnware_id: str): | |||
| url = f"{self.host}/engine/learnware_info" | |||
| response = requests.get(url, params={"learnware_id": learnware_id}, headers=self.headers, stream=True) | |||
| result = response.json() | |||
| if result["code"] != 0: | |||
| raise Exception("get learnware semantic specification failed: " + json.dumps(result)) | |||
| return result["data"]["learnware_info"]["semantic_specification"] | |||
| def download_learnware(self, learnware_id: str, save_path: str): | |||
| url = f"{self.host}/engine/download_learnware" | |||
| response = requests.get( | |||
| @@ -251,7 +275,6 @@ class LearnwareClient: | |||
| headers=self.headers, | |||
| ) | |||
| result = response.json() | |||
| if result["code"] != 0: | |||
| raise Exception("search failed: " + json.dumps(result)) | |||
| @@ -260,12 +283,11 @@ class LearnwareClient: | |||
| returns["single"]["semantic_specifications"].append(learnware["semantic_specification"]) | |||
| returns["single"]["matching"].append(learnware["matching"]) | |||
| if len(result["data"]["learnware_list_multi"]) > 0: | |||
| multi_learnware = result["data"]["learnware_list_multi"][0] | |||
| returns["multiple"]["learnware_ids"].append(multi_learnware["learnware_id"]) | |||
| returns["multiple"]["semantic_specifications"].append(multi_learnware["semantic_specification"]) | |||
| for learnware in result["data"]["learnware_list_multi"]: | |||
| returns["multiple"]["learnware_ids"].append(learnware["learnware_id"]) | |||
| returns["multiple"]["semantic_specifications"].append(learnware["semantic_specification"]) | |||
| returns["multiple"]["matching"] = learnware["matching"] | |||
| # Delete temp json file | |||
| os.remove(temp_file_name) | |||
| @@ -281,41 +303,6 @@ class LearnwareClient: | |||
| if result["code"] != 0: | |||
| raise Exception("delete failed: " + json.dumps(result)) | |||
| def create_semantic_specification( | |||
| self, | |||
| name: Optional[str] = None, | |||
| description: Optional[str] = None, | |||
| data_type: Optional[str] = None, | |||
| task_type: Optional[str] = None, | |||
| library_type: Optional[str] = None, | |||
| scenarios: Optional[Union[str, List[str]]] = None, | |||
| license: Optional[Union[str, List[str]]] = None, | |||
| input_description: Optional[dict] = None, | |||
| output_description: Optional[dict] = None, | |||
| ): | |||
| semantic_specification = dict() | |||
| semantic_specification["Data"] = {"Type": "Class", "Values": [data_type] if data_type is not None else []} | |||
| semantic_specification["Task"] = {"Type": "Class", "Values": [task_type] if task_type is not None else []} | |||
| semantic_specification["Library"] = { | |||
| "Type": "Class", | |||
| "Values": [library_type] if library_type is not None else [], | |||
| } | |||
| license = [license] if isinstance(license, str) else license | |||
| semantic_specification["License"] = {"Type": "Class", "Values": license if license is not None else []} | |||
| scenarios = [scenarios] if isinstance(scenarios, str) else scenarios | |||
| semantic_specification["Scenario"] = {"Type": "Tag", "Values": scenarios if scenarios is not None else []} | |||
| semantic_specification["Name"] = {"Type": "String", "Values": name if name is not None else ""} | |||
| semantic_specification["Description"] = { | |||
| "Type": "String", | |||
| "Values": description if description is not None else "", | |||
| } | |||
| semantic_specification["Input"] = {} if input_description is None else input_description | |||
| semantic_specification["Output"] = {} if output_description is None else output_description | |||
| return semantic_specification | |||
| def list_semantic_specification_values(self, key: SemanticSpecificationKey): | |||
| url = f"{self.host}/engine/semantic_specification" | |||
| response = requests.get(url, headers=self.headers) | |||
| @@ -435,7 +422,17 @@ class LearnwareClient: | |||
| @staticmethod | |||
| def check_learnware(learnware_zip_path, semantic_specification=None): | |||
| semantic_specification = ( | |||
| get_semantic_specification() if semantic_specification is None else semantic_specification | |||
| generate_semantic_spec( | |||
| name="test", | |||
| description="test", | |||
| data_type="Text", | |||
| task_type="Segmentation", | |||
| scenarios="Financial", | |||
| library_type="Scikit-learn", | |||
| license="Apache-2.0", | |||
| ) | |||
| if semantic_specification is None | |||
| else semantic_specification | |||
| ) | |||
| check_status, message = LearnwareClient._check_semantic_specification(semantic_specification) | |||
| @@ -446,12 +443,9 @@ class LearnwareClient: | |||
| z_file.extractall(tempdir) | |||
| learnware = get_learnware_from_dirpath( | |||
| id="test", semantic_spec=semantic_specification, learnware_dirpath=tempdir | |||
| id="test", semantic_spec=semantic_specification, learnware_dirpath=tempdir, ignore_error=False | |||
| ) | |||
| if learnware is None: | |||
| raise Exception("The learnware is not valid.") | |||
| check_status, message = LearnwareClient._check_stat_specification(learnware) | |||
| assert check_status is True, message | |||
| @@ -28,8 +28,6 @@ def system_execute(args, timeout=None, env=None, stdout=subprocess.DEVNULL, stde | |||
| def remove_enviroment(conda_env): | |||
| system_execute(args=["conda", "env", "remove", "-n", f"{conda_env}"]) | |||
| logger.info(f"The learnware conda env [{conda_env}] is removed.") | |||
| def install_environment(learnware_dirpath, conda_env): | |||
| """Install environment of a learnware | |||
| @@ -51,7 +49,7 @@ def install_environment(learnware_dirpath, conda_env): | |||
| if "environment.yaml" in os.listdir(learnware_dirpath): | |||
| yaml_path: str = os.path.join(learnware_dirpath, "environment.yaml") | |||
| yaml_path_filter: str = os.path.join(tempdir, "environment_filter.yaml") | |||
| logger.info(f"checking the avaliabe conda packages for {conda_env}") | |||
| logger.info(f"checking the available conda packages for {conda_env}") | |||
| filter_nonexist_conda_packages_file(yaml_file=yaml_path, output_yaml_file=yaml_path_filter) | |||
| # create environment | |||
| logger.info(f"create conda env [{conda_env}] according to .yaml file") | |||
| @@ -60,7 +58,7 @@ def install_environment(learnware_dirpath, conda_env): | |||
| elif "requirements.txt" in os.listdir(learnware_dirpath): | |||
| requirements_path: str = os.path.join(learnware_dirpath, "requirements.txt") | |||
| requirements_path_filter: str = os.path.join(tempdir, "requirements_filter.txt") | |||
| logger.info(f"checking the avaliabe pip packages for {conda_env}") | |||
| logger.info(f"checking the available pip packages for {conda_env}") | |||
| filter_nonexist_pip_packages_file(requirements_file=requirements_path, output_file=requirements_path_filter) | |||
| logger.info(f"create empty conda env [{conda_env}]") | |||
| system_execute(args=["conda", "create", "-y", "--name", f"{conda_env}", "python=3.8"]) | |||
| @@ -1,8 +1,9 @@ | |||
| import os | |||
| import copy | |||
| from typing import Optional | |||
| import traceback | |||
| from .base import Learnware | |||
| from .utils import get_stat_spec_from_config | |||
| from ..specification import Specification | |||
| from ..utils import read_yaml_to_dict | |||
| @@ -12,7 +13,7 @@ from ..config import C | |||
| logger = get_module_logger("learnware.learnware") | |||
| def get_learnware_from_dirpath(id: str, semantic_spec: dict, learnware_dirpath, ignore_error=True) -> Learnware: | |||
| def get_learnware_from_dirpath(id: str, semantic_spec: dict, learnware_dirpath, ignore_error=True) -> Optional[Learnware]: | |||
| """Get the learnware object from dirpath, and provide the manage interface tor Learnware class | |||
| Parameters | |||
| @@ -45,7 +46,12 @@ def get_learnware_from_dirpath(id: str, semantic_spec: dict, learnware_dirpath, | |||
| } | |||
| try: | |||
| yaml_config = read_yaml_to_dict(os.path.join(learnware_dirpath, C.learnware_folder_config["yaml_file"])) | |||
| learnware_yaml_path = os.path.join(learnware_dirpath, C.learnware_folder_config["yaml_file"]) | |||
| assert os.path.exists(learnware_yaml_path), f"learnware.yaml is not found for learnware_{id}, please check the learnware folder or zipfile." | |||
| yaml_config = read_yaml_to_dict(learnware_yaml_path) | |||
| if "name" in yaml_config: | |||
| learnware_config["name"] = yaml_config["name"] | |||
| @@ -60,7 +66,10 @@ def get_learnware_from_dirpath(id: str, semantic_spec: dict, learnware_dirpath, | |||
| learnware_spec = Specification() | |||
| for _stat_spec in learnware_config["stat_specifications"]: | |||
| stat_spec = _stat_spec.copy() | |||
| stat_spec["file_name"] = os.path.join(learnware_dirpath, stat_spec["file_name"]) | |||
| stat_spec_path = os.path.join(learnware_dirpath, stat_spec["file_name"]) | |||
| assert os.path.exists(stat_spec_path), f"statistical specification file {stat_spec['file_name']} is not found for learnware_{id}, please check the learnware folder or zipfile." | |||
| stat_spec["file_name"] = stat_spec_path | |||
| stat_spec_inst = get_stat_spec_from_config(stat_spec) | |||
| learnware_spec.update_stat_spec(**{stat_spec_inst.type: stat_spec_inst}) | |||
| @@ -69,7 +78,7 @@ def get_learnware_from_dirpath(id: str, semantic_spec: dict, learnware_dirpath, | |||
| except Exception as e: | |||
| if not ignore_error: | |||
| raise e | |||
| logger.warning(f"Load Learnware {id} failed! Due to {repr(e)}") | |||
| logger.warning(f"Load Learnware {id} failed! Due to {e}; details:\n{traceback.format_exc()}") | |||
| return None | |||
| return Learnware( | |||
| @@ -139,7 +139,7 @@ class LearnwareMarket: | |||
| def check_learnware(self, zip_path: str, semantic_spec: dict, checker_names: List[str] = None, **kwargs) -> bool: | |||
| try: | |||
| final_status = BaseChecker.NONUSABLE_LEARNWARE | |||
| if len(checker_names): | |||
| if checker_names is not None and len(checker_names): | |||
| with tempfile.TemporaryDirectory(prefix="pending_learnware_") as tempdir: | |||
| with zipfile.ZipFile(zip_path, mode="r") as z_file: | |||
| z_file.extractall(tempdir) | |||
| @@ -236,14 +236,14 @@ class LearnwareMarket: | |||
| int | |||
| The final learnware check_status. | |||
| """ | |||
| zip_path = self.get_learnware_zip_path_by_ids(id) if zip_path is None else zip_path | |||
| zip_path_for_check = self.get_learnware_zip_path_by_ids(id) if zip_path is None else zip_path | |||
| semantic_spec = ( | |||
| self.get_learnware_by_ids(id).get_specification().get_semantic_spec() | |||
| if semantic_spec is None | |||
| else semantic_spec | |||
| ) | |||
| checker_names = list(self.learnware_checker.keys()) if checker_names is None else checker_names | |||
| update_status = self.check_learnware(zip_path, semantic_spec, checker_names) | |||
| update_status = self.check_learnware(zip_path_for_check, semantic_spec, checker_names) | |||
| check_status = ( | |||
| update_status if check_status is None or update_status == BaseChecker.INVALID_LEARNWARE else check_status | |||
| ) | |||
| @@ -15,6 +15,13 @@ logger = get_module_logger("easy_seacher") | |||
| class EasyExactSemanticSearcher(BaseSearcher): | |||
| def _learnware_id_search(self, learnware_id: str, learnware_list: List[Learnware]) -> List[Learnware]: | |||
| match_learnwares = [] | |||
| for learnware in learnware_list: | |||
| if learnware_id == learnware.id: | |||
| match_learnwares.append(learnware) | |||
| return match_learnwares | |||
| def _match_semantic_spec(self, semantic_spec1, semantic_spec2): | |||
| """ | |||
| semantic_spec1: semantic spec input by user | |||
| @@ -29,12 +36,10 @@ class EasyExactSemanticSearcher(BaseSearcher): | |||
| for key in semantic_spec1.keys(): | |||
| v1 = semantic_spec1[key].get("Values", "") | |||
| v2 = semantic_spec2[key].get("Values", "") | |||
| if len(v1) == 0: | |||
| # user input is empty, no need to search | |||
| if key not in semantic_spec2 or len(v1) == 0: | |||
| continue | |||
| v2 = semantic_spec2[key].get("Values", "") | |||
| if key in ("Name", "Description"): | |||
| v1 = v1.lower() | |||
| if v1 not in name2 and v1 not in description2: | |||
| @@ -57,9 +62,15 @@ class EasyExactSemanticSearcher(BaseSearcher): | |||
| def __call__(self, learnware_list: List[Learnware], user_info: BaseUserInfo) -> SearchResults: | |||
| match_learnwares = [] | |||
| user_semantic_spec = user_info.get_semantic_spec() | |||
| # Learnware id search | |||
| if "learnware_id" in user_semantic_spec: | |||
| learnware_list = self._learnware_id_search(user_semantic_spec["learnware_id"]["Values"], learnware_list) | |||
| # Semantic tag match | |||
| for learnware in learnware_list: | |||
| learnware_semantic_spec = learnware.get_specification().get_semantic_spec() | |||
| user_semantic_spec = user_info.get_semantic_spec() | |||
| if self._match_semantic_spec(user_semantic_spec, learnware_semantic_spec): | |||
| match_learnwares.append(learnware) | |||
| logger.info("semantic_spec search: choose %d from %d learnwares" % (len(match_learnwares), len(learnware_list))) | |||
| @@ -67,6 +78,13 @@ class EasyExactSemanticSearcher(BaseSearcher): | |||
| class EasyFuzzSemanticSearcher(BaseSearcher): | |||
| def _learnware_id_search(self, learnware_id: str, learnware_list: List[Learnware]) -> List[Learnware]: | |||
| match_learnwares = [] | |||
| for learnware in learnware_list: | |||
| if learnware_id in learnware.id: | |||
| match_learnwares.append(learnware) | |||
| return match_learnwares | |||
| def _match_semantic_spec_tag(self, semantic_spec1, semantic_spec2) -> bool: | |||
| """Judge if tags of two semantic specs are consistent | |||
| @@ -128,6 +146,11 @@ class EasyFuzzSemanticSearcher(BaseSearcher): | |||
| final_result = [] | |||
| user_semantic_spec = user_info.get_semantic_spec() | |||
| # Learnware id search | |||
| if "learnware_id" in user_semantic_spec: | |||
| learnware_list = self._learnware_id_search(user_semantic_spec["learnware_id"]["Values"], learnware_list) | |||
| # Semantic tag match | |||
| for learnware in learnware_list: | |||
| learnware_semantic_spec = learnware.get_specification().get_semantic_spec() | |||
| if self._match_semantic_spec_tag(user_semantic_spec, learnware_semantic_spec): | |||
| @@ -12,7 +12,7 @@ from ...base import BaseChecker, BaseUserInfo | |||
| from ...easy import EasyOrganizer | |||
| from ....learnware import Learnware | |||
| from ....logger import get_module_logger | |||
| from ....specification import HeteroMapTableSpecification, RKMETableSpecification | |||
| from ....specification import HeteroMapTableSpecification | |||
| logger = get_module_logger("hetero_map_table_organizer") | |||
| @@ -165,9 +165,11 @@ class HeteroMapTableOrganizer(EasyOrganizer): | |||
| int | |||
| The final learnware check_status. | |||
| """ | |||
| old_semantic_spec = self.learnware_list[id].get_specification().get_semantic_spec() | |||
| final_status = super(HeteroMapTableOrganizer, self).update_learnware(id, zip_path, semantic_spec, check_status) | |||
| if final_status == BaseChecker.USABLE_LEARWARE and len(self._get_hetero_learnware_ids(id)): | |||
| self._update_learware_hetero_spec(id) | |||
| if zip_path is not None or old_semantic_spec.get("Input", {}) != semantic_spec.get("Input", {}): | |||
| self._update_learware_hetero_spec(id) | |||
| return final_status | |||
| def _reload_learnware_hetero_spec(self, learnware_id): | |||
| @@ -245,7 +247,7 @@ class HeteroMapTableOrganizer(EasyOrganizer): | |||
| ret = [] | |||
| for idx in ids: | |||
| spec = self.learnware_list[idx].get_specification() | |||
| if is_hetero(stat_specs=spec.get_stat_spec(), semantic_spec=spec.get_semantic_spec()): | |||
| if is_hetero(stat_specs=spec.get_stat_spec(), semantic_spec=spec.get_semantic_spec(), verbose=False): | |||
| ret.append(idx) | |||
| return ret | |||
| @@ -1,9 +1,10 @@ | |||
| import traceback | |||
| from ...logger import get_module_logger | |||
| logger = get_module_logger("hetero_utils") | |||
| def is_hetero(stat_specs: dict, semantic_spec: dict) -> bool: | |||
| def is_hetero(stat_specs: dict, semantic_spec: dict, verbose=True) -> bool: | |||
| """Check if user_info satifies all the criteria required for enabling heterogeneous learnware search | |||
| Parameters | |||
| @@ -35,15 +36,17 @@ def is_hetero(stat_specs: dict, semantic_spec: dict) -> bool: | |||
| semantic_decription_feature_num = len(semantic_input_description["Description"]) | |||
| if semantic_decription_feature_num <= 0: | |||
| logger.warning("At least one of Input.Description in semantic spec should be provides.") | |||
| if verbose: | |||
| logger.warning("At least one of Input.Description in semantic spec should be provides.") | |||
| return False | |||
| if table_input_shape != semantic_description_dim: | |||
| logger.warning("User data feature dimensions mismatch with semantic specification.") | |||
| if verbose: | |||
| logger.warning("User data feature dimensions mismatch with semantic specification.") | |||
| return False | |||
| return True | |||
| except Exception as e: | |||
| logger.warning(f"Invalid heterogeneous search information provided due to {e}. Use homogeneous search instead.") | |||
| except Exception as err: | |||
| if verbose: | |||
| logger.warning(f"Invalid heterogeneous search information provided.") | |||
| return False | |||
| @@ -1,9 +1,10 @@ | |||
| from .base import LearnwareMarket | |||
| from .classes import CondaChecker | |||
| from .easy import EasyOrganizer, EasySearcher, EasySemanticChecker, EasyStatChecker | |||
| from .heterogeneous import HeteroMapTableOrganizer, HeteroSearcher | |||
| def get_market_component(name, market_id, rebuild, organizer_kwargs=None, searcher_kwargs=None, checker_kwargs=None): | |||
| def get_market_component(name, market_id, rebuild, organizer_kwargs=None, searcher_kwargs=None, checker_kwargs=None, conda_checker=False): | |||
| organizer_kwargs = {} if organizer_kwargs is None else organizer_kwargs | |||
| searcher_kwargs = {} if searcher_kwargs is None else searcher_kwargs | |||
| checker_kwargs = {} if checker_kwargs is None else checker_kwargs | |||
| @@ -11,7 +12,7 @@ def get_market_component(name, market_id, rebuild, organizer_kwargs=None, search | |||
| if name == "easy": | |||
| easy_organizer = EasyOrganizer(market_id=market_id, rebuild=rebuild) | |||
| easy_searcher = EasySearcher(organizer=easy_organizer) | |||
| easy_checker_list = [EasySemanticChecker(), EasyStatChecker()] | |||
| easy_checker_list = [EasySemanticChecker(), EasyStatChecker() if conda_checker is False else CondaChecker(EasyStatChecker())] | |||
| market_component = { | |||
| "organizer": easy_organizer, | |||
| "searcher": easy_searcher, | |||
| @@ -20,7 +21,7 @@ def get_market_component(name, market_id, rebuild, organizer_kwargs=None, search | |||
| elif name == "hetero": | |||
| hetero_organizer = HeteroMapTableOrganizer(market_id=market_id, rebuild=rebuild, **organizer_kwargs) | |||
| hetero_searcher = HeteroSearcher(organizer=hetero_organizer) | |||
| hetero_checker_list = [EasySemanticChecker(), EasyStatChecker()] | |||
| hetero_checker_list = [EasySemanticChecker(), EasyStatChecker() if conda_checker is False else CondaChecker(EasyStatChecker())] | |||
| market_component = { | |||
| "organizer": hetero_organizer, | |||
| @@ -40,9 +41,10 @@ def instantiate_learnware_market( | |||
| organizer_kwargs: dict = None, | |||
| searcher_kwargs: dict = None, | |||
| checker_kwargs: dict = None, | |||
| conda_checker: bool = False, | |||
| **kwargs, | |||
| ): | |||
| market_componets = get_market_component(name, market_id, rebuild, organizer_kwargs, searcher_kwargs, checker_kwargs) | |||
| market_componets = get_market_component(name, market_id, rebuild, organizer_kwargs, searcher_kwargs, checker_kwargs, conda_checker) | |||
| return LearnwareMarket( | |||
| organizer=market_componets["organizer"], | |||
| searcher=market_componets["searcher"], | |||
| @@ -17,5 +17,12 @@ if not is_torch_available(verbose=False): | |||
| generate_rkme_table_spec = None | |||
| generate_rkme_image_spec = None | |||
| generate_rkme_text_spec = None | |||
| generate_semantic_spec = None | |||
| else: | |||
| from .module import generate_stat_spec, generate_rkme_table_spec, generate_rkme_image_spec, generate_rkme_text_spec | |||
| from .module import ( | |||
| generate_stat_spec, | |||
| generate_rkme_table_spec, | |||
| generate_rkme_image_spec, | |||
| generate_rkme_text_spec, | |||
| generate_semantic_spec, | |||
| ) | |||
| @@ -1,7 +1,7 @@ | |||
| import torch | |||
| import numpy as np | |||
| import pandas as pd | |||
| from typing import Union, List | |||
| from typing import Union, List, Optional | |||
| from .utils import convert_to_numpy | |||
| from .base import BaseStatSpecification | |||
| @@ -78,7 +78,7 @@ def generate_rkme_image_spec( | |||
| reduce: bool = True, | |||
| verbose: bool = True, | |||
| cuda_idx: int = None, | |||
| **kwargs | |||
| **kwargs, | |||
| ) -> RKMEImageSpecification: | |||
| """ | |||
| Interface for users to generate Reduced Kernel Mean Embedding (RKME) specification for Image. | |||
| @@ -168,7 +168,7 @@ def generate_rkme_text_spec( | |||
| # Check input type | |||
| if not isinstance(X, list) or not all(isinstance(item, str) for item in X): | |||
| raise TypeError("Input data must be a list of strings.") | |||
| # Generate rkme text spec | |||
| rkme_text_spec = RKMETextSpecification(gamma=gamma, cuda_idx=cuda_idx) | |||
| rkme_text_spec.generate_stat_spec_from_data(X, reduced_set_size, step_size, steps, nonnegative_beta, reduce) | |||
| @@ -177,7 +177,7 @@ def generate_rkme_text_spec( | |||
| def generate_stat_spec( | |||
| type: str, X: Union[np.ndarray, pd.DataFrame, torch.Tensor, List[str]], *args, **kwargs | |||
| ) -> BaseStatSpecification: | |||
| ) -> Union[RKMETableSpecification, RKMEImageSpecification, RKMETextSpecification]: | |||
| """ | |||
| Interface for users to generate statistical specification. | |||
| Return a StatSpecification object, use .save() method to save as npy file. | |||
| @@ -204,3 +204,41 @@ def generate_stat_spec( | |||
| return generate_rkme_image_spec(X=X, *args, **kwargs) | |||
| else: | |||
| raise TypeError(f"type {type} is not supported!") | |||
| def generate_semantic_spec( | |||
| name: Optional[str] = None, | |||
| description: Optional[str] = None, | |||
| data_type: Optional[str] = None, | |||
| task_type: Optional[str] = None, | |||
| library_type: Optional[str] = None, | |||
| scenarios: Optional[Union[str, List[str]]] = None, | |||
| license: Optional[Union[str, List[str]]] = None, | |||
| input_description: Optional[dict] = None, | |||
| output_description: Optional[dict] = None, | |||
| ): | |||
| semantic_specification = dict() | |||
| semantic_specification["Data"] = {"Type": "Class", "Values": [data_type] if data_type is not None else []} | |||
| semantic_specification["Task"] = {"Type": "Class", "Values": [task_type] if task_type is not None else []} | |||
| semantic_specification["Library"] = { | |||
| "Type": "Class", | |||
| "Values": [library_type] if library_type is not None else [], | |||
| } | |||
| license = [license] if isinstance(license, str) else license | |||
| semantic_specification["License"] = {"Type": "Class", "Values": license if license is not None else []} | |||
| scenarios = [scenarios] if isinstance(scenarios, str) else scenarios | |||
| semantic_specification["Scenario"] = {"Type": "Tag", "Values": scenarios if scenarios is not None else []} | |||
| semantic_specification["Name"] = {"Type": "String", "Values": name if name is not None else ""} | |||
| semantic_specification["Description"] = { | |||
| "Type": "String", | |||
| "Values": description if description is not None else "", | |||
| } | |||
| if input_description is not None: | |||
| semantic_specification["Input"] = input_description | |||
| if output_description is not None: | |||
| semantic_specification["Output"] = output_description | |||
| return semantic_specification | |||
| @@ -39,8 +39,7 @@ class RKMEImageSpecification(RegularStatSpecification): | |||
| """ | |||
| self.RKME_IMAGE_VERSION = 1 # Please maintain backward compatibility. | |||
| # TODO: remove this | |||
| self.msg=None | |||
| self.msg = None | |||
| self.z = None | |||
| self.beta = None | |||
| @@ -170,7 +169,8 @@ class RKMEImageSpecification(RegularStatSpecification): | |||
| import torch_optimizer | |||
| except ModuleNotFoundError: | |||
| raise ModuleNotFoundError( | |||
| f"RKMEImageSpecification is not available because 'torch-optimizer' is not installed! Please install it manually.") | |||
| f"RKMEImageSpecification is not available because 'torch-optimizer' is not installed! Please install it manually." | |||
| ) | |||
| # Cross-platform by default, unless the spec is specified to be generated specifically for local experiments. | |||
| cross_platform = "experimental" not in kwargs or not kwargs["experimental"] | |||
| @@ -422,7 +422,9 @@ class RKMEImageSpecification(RegularStatSpecification): | |||
| for d in self.get_states(): | |||
| if d in rkme_load.keys(): | |||
| if d == "type" and rkme_load[d] != self.type: | |||
| raise TypeError(f"The type of loaded RKME ({rkme_load[d]}) is different from the expected type ({self.type})!") | |||
| raise TypeError( | |||
| f"The type of loaded RKME ({rkme_load[d]}) is different from the expected type ({self.type})!" | |||
| ) | |||
| setattr(self, d, rkme_load[d]) | |||
| self.beta = self.beta.to(self._device) | |||
| @@ -441,9 +443,8 @@ def _get_zca_matrix(X, reg_coef=0.1): | |||
| class RandomGenerator: | |||
| def __init__(self, seed=0, cross_platform=True): | |||
| self.cross_platform=cross_platform | |||
| self.cross_platform = cross_platform | |||
| self.state = RandomState(seed) | |||
| def normal_(self, tensor: torch.Tensor, mean=0.0, std=1.0): | |||
| @@ -462,24 +463,24 @@ def deterministic(cross_platform, device): | |||
| deterministic_state = torch.backends.cudnn.deterministic | |||
| torch.backends.cudnn.deterministic = True | |||
| if cross_platform and torch.cuda.is_available(): | |||
| torch.cuda.set_rng_state( | |||
| new_state=torch.cuda.get_rng_state(device.index), | |||
| device="cpu") | |||
| torch.cuda.set_rng_state(new_state=torch.cuda.get_rng_state(device.index), device="cpu") | |||
| yield RandomGenerator(seed=0, cross_platform=cross_platform) | |||
| torch.backends.cudnn.deterministic = deterministic_state | |||
| if cross_platform and torch.cuda.is_available(): | |||
| torch.cuda.set_rng_state( | |||
| new_state=torch.cuda.get_rng_state(device.index), | |||
| device="cuda") | |||
| torch.cuda.set_rng_state(new_state=torch.cuda.get_rng_state(device.index), device="cuda") | |||
| class _ConvNet_wide(nn.Module): | |||
| def __init__(self, channel, random_generator, mu=None, sigma=None, k=2, net_width=128, net_depth=3, im_size=(32, 32)): | |||
| def __init__( | |||
| self, channel, random_generator, mu=None, sigma=None, k=2, net_width=128, net_depth=3, im_size=(32, 32) | |||
| ): | |||
| self.k = k | |||
| super().__init__() | |||
| self.features, shape_feat = self._make_layers(channel, net_width, net_depth, im_size, mu, sigma, random_generator) | |||
| self.features, shape_feat = self._make_layers( | |||
| channel, net_width, net_depth, im_size, mu, sigma, random_generator | |||
| ) | |||
| # self.aggregation = nn.AvgPool2d(kernel_size=shape_feat[1]) | |||
| def forward(self, x): | |||
| @@ -495,7 +496,9 @@ class _ConvNet_wide(nn.Module): | |||
| in_channels = channel | |||
| shape_feat = [in_channels, im_size[0], im_size[1]] | |||
| for d in range(net_depth): | |||
| layers += [_build_conv2d_gaussian(in_channels, int(k * net_width), random_generator, 3, 1, mean=mu, std=sigma)] | |||
| layers += [ | |||
| _build_conv2d_gaussian(in_channels, int(k * net_width), random_generator, 3, 1, mean=mu, std=sigma) | |||
| ] | |||
| shape_feat[0] = int(k * net_width) | |||
| layers += [nn.ReLU(inplace=True)] | |||
| @@ -508,7 +511,9 @@ class _ConvNet_wide(nn.Module): | |||
| return nn.Sequential(*layers), shape_feat | |||
| def _build_conv2d_gaussian(in_channels, out_channels, random_generator: RandomGenerator, kernel=3, padding=1, mean=None, std=None): | |||
| def _build_conv2d_gaussian( | |||
| in_channels, out_channels, random_generator: RandomGenerator, kernel=3, padding=1, mean=None, std=None | |||
| ): | |||
| layer = nn.Conv2d(in_channels, out_channels, kernel, padding=padding) | |||
| if mean is None: | |||
| mean = 0 | |||
| @@ -1 +1 @@ | |||
| from .module import get_semantic_specification | |||
| from .utils import parametrize | |||
| @@ -0,0 +1,173 @@ | |||
| import os | |||
| import pickle | |||
| import tempfile | |||
| import zipfile | |||
| from dataclasses import dataclass | |||
| from typing import Tuple, Optional, List, Union | |||
| from .config import BenchmarkConfig, benchmark_configs | |||
| from ..data import GetData | |||
| from ...config import C | |||
| @dataclass | |||
| class Benchmark: | |||
| name: str | |||
| user_num: int | |||
| learnware_ids: List[str] | |||
| test_X_paths: List[str] | |||
| test_y_paths: List[str] | |||
| train_X_paths: Optional[List[str]] = None | |||
| train_y_paths: Optional[List[str]] = None | |||
| extra_info_path: Optional[str] = None | |||
| def get_test_data(self, user_ids: Union[int, List[int]]): | |||
| raw_user_ids = user_ids | |||
| if isinstance(user_ids, int): | |||
| user_ids = [user_ids] | |||
| ret = [] | |||
| for user_id in user_ids: | |||
| with open(self.test_X_paths[user_id], "rb") as fin: | |||
| test_X = pickle.load(fin) | |||
| with open(self.test_y_paths[user_id], "rb") as fin: | |||
| test_y = pickle.load(fin) | |||
| ret.append((test_X, test_y)) | |||
| if isinstance(raw_user_ids, int): | |||
| return ret[0] | |||
| else: | |||
| return ret | |||
| def get_train_data(self, user_ids: Union[int, List[int]]): | |||
| if self.train_X_paths is None or self.train_y_paths is None: | |||
| return None | |||
| raw_user_ids = user_ids | |||
| if isinstance(user_ids, int): | |||
| user_ids = [user_ids] | |||
| ret = [] | |||
| for user_id in user_ids: | |||
| with open(self.train_X_paths[user_id], "rb") as fin: | |||
| train_X = pickle.load(fin) | |||
| with open(self.train_y_paths[user_id], "rb") as fin: | |||
| train_y = pickle.load(fin) | |||
| ret.append((train_X, train_y)) | |||
| if isinstance(raw_user_ids, int): | |||
| return ret[0] | |||
| else: | |||
| return ret | |||
| class LearnwareBenchmark: | |||
| def __init__(self): | |||
| self.benchmark_configs = benchmark_configs | |||
| def list_benchmarks(self): | |||
| return list(self.benchmark_configs.keys()) | |||
| def _check_cache_data_valid(self, benchmark_config: BenchmarkConfig, data_type: str) -> bool: | |||
| """Check if the cache data is valid | |||
| Parameters | |||
| ---------- | |||
| benchmark_config : BenchmarkConfig | |||
| benchmark config | |||
| data_type : str | |||
| "test" for test data or "train" for train data | |||
| Returns | |||
| ------- | |||
| bool | |||
| A flag indicating if the cache data is valid | |||
| """ | |||
| cache_folder = os.path.join(C.cache_path, benchmark_config.name, f"{data_type}_data") | |||
| if os.path.exists(cache_folder): | |||
| for user_id in range(benchmark_config.user_num): | |||
| X_path = os.path.join(cache_folder, f"user{user_id}_X.pkl") | |||
| y_path = os.path.join(cache_folder, f"user{user_id}_X.pkl") | |||
| if not os.path.isfile(X_path) or not os.path.isfile(y_path): | |||
| return False | |||
| return True | |||
| else: | |||
| return False | |||
| def _download_data(self, download_path: str, save_path: str): | |||
| """Download data from backend | |||
| Parameters | |||
| ---------- | |||
| download_path : str | |||
| data path for download in backend | |||
| save_path : str | |||
| local cache path for saving data | |||
| """ | |||
| with tempfile.TemporaryDirectory(prefix="learnware_benchmark_") as tempdir: | |||
| test_data_zippath = os.path.join(tempdir, "benchmark_data.zip") | |||
| GetData().download_file(download_path, test_data_zippath) | |||
| os.makedirs(save_path, exist_ok=True) | |||
| with zipfile.ZipFile(test_data_zippath, "r") as z_file: | |||
| z_file.extractall(save_path) | |||
| def _load_cache_data(self, benchmark_config: BenchmarkConfig, data_type: str) -> Tuple[List[str], List[str]]: | |||
| """Load data from local cache path | |||
| Parameters | |||
| ---------- | |||
| benchmark_config : BenchmarkConfig | |||
| benchmark config | |||
| data_type : str | |||
| "test" for test data or "train" for train data | |||
| """ | |||
| cache_folder = os.path.join(C.cache_path, benchmark_config.name, f"{data_type}_data") | |||
| if not self._check_cache_data_valid(benchmark_config, data_type): | |||
| download_path = getattr(benchmark_config, f"{data_type}_data_path", None) | |||
| self._download_data(download_path, cache_folder) | |||
| X_paths, y_paths = [], [] | |||
| for user_id in range(benchmark_config.user_num): | |||
| user_X_path = os.path.join(cache_folder, f"user{user_id}_X.pkl") | |||
| user_y_path = os.path.join(cache_folder, f"user{user_id}_y.pkl") | |||
| assert os.path.isfile(user_X_path), f"user {user_id} {data_type}_X is not valid!" | |||
| assert os.path.isfile(user_y_path), f"user {user_id} {data_type}_y is not valid!" | |||
| X_paths.append(user_X_path) | |||
| y_paths.append(user_y_path) | |||
| return X_paths, y_paths | |||
| def get_benchmark(self, benchmark_config: Union[str, BenchmarkConfig]): | |||
| if isinstance(benchmark_config, str): | |||
| benchmark_config = self.benchmark_configs[benchmark_config] | |||
| # Load test data | |||
| test_X_paths, test_y_paths = self._load_cache_data(benchmark_config, "test") | |||
| # Load train data | |||
| train_X_paths, train_y_paths = None, None | |||
| if benchmark_config.train_data_path is not None: | |||
| train_X_paths, train_y_paths = self._load_cache_data(benchmark_config, "train") | |||
| # Load extra info | |||
| extra_info_path = None | |||
| if benchmark_config.extra_info_path is not None: | |||
| extra_info_path = os.path.join(C.cache_path, benchmark_config.name, "extra_info") | |||
| if not os.path.exists(extra_info_path): | |||
| self._download_data(benchmark_config.extra_info_path, extra_info_path) | |||
| return Benchmark( | |||
| name=benchmark_config.name, | |||
| user_num=benchmark_config.user_num, | |||
| learnware_ids=benchmark_config.learnware_ids, | |||
| test_X_paths=test_X_paths, | |||
| test_y_paths=test_y_paths, | |||
| train_X_paths=train_X_paths, | |||
| train_y_paths=train_y_paths, | |||
| extra_info_path=extra_info_path, | |||
| ) | |||
| @@ -0,0 +1,15 @@ | |||
| from dataclasses import dataclass | |||
| from typing import Optional, List | |||
| @dataclass | |||
| class BenchmarkConfig: | |||
| name: str | |||
| user_num: int | |||
| learnware_ids: List[str] | |||
| test_data_path: str | |||
| train_data_path: Optional[str] = None | |||
| extra_info_path: Optional[str] = None | |||
| benchmark_configs = {} | |||
| @@ -0,0 +1,39 @@ | |||
| import json | |||
| import requests | |||
| from tqdm import tqdm | |||
| from ..config import C | |||
| class GetData: | |||
| def __init__(self, host=None, chunk_size=1024 * 1024): | |||
| self.headers = None | |||
| if host is None: | |||
| self.host = C.backend_host | |||
| else: | |||
| self.host = host | |||
| self.chunk_size = chunk_size | |||
| def download_file(self, file_path: str, save_path: str): | |||
| url = f"{self.host}/datasets/download_datasets" | |||
| response = requests.get( | |||
| url, | |||
| params={ | |||
| "dataset": file_path, | |||
| }, | |||
| stream=True, | |||
| ) | |||
| if response.status_code != 200: | |||
| raise Exception("download failed: " + json.dumps(response.json())) | |||
| num_chunks = int(response.headers["Content-Length"]) // self.chunk_size + 1 | |||
| bar = tqdm(total=num_chunks, desc="Downloading", unit="MB") | |||
| with open(save_path, "wb") as f: | |||
| for chunk in response.iter_content(chunk_size=self.chunk_size): | |||
| f.write(chunk) | |||
| bar.update(1) | |||
| @@ -1,10 +0,0 @@ | |||
| def get_semantic_specification(): | |||
| semantic_specification = dict() | |||
| semantic_specification["Data"] = {"Type": "Class", "Values": ["Text"]} | |||
| semantic_specification["Task"] = {"Type": "Class", "Values": ["Segmentation"]} | |||
| semantic_specification["Library"] = {"Type": "Class", "Values": ["Scikit-learn"]} | |||
| semantic_specification["Scenario"] = {"Type": "Tag", "Values": ["Financial"]} | |||
| semantic_specification["License"] = {"Type": "Class", "Values": ["Apache-2.0"]} | |||
| semantic_specification["Name"] = {"Type": "String", "Values": "test"} | |||
| semantic_specification["Description"] = {"Type": "String", "Values": "test"} | |||
| return semantic_specification | |||
| @@ -0,0 +1,101 @@ | |||
| import os | |||
| import tempfile | |||
| from dataclasses import dataclass, field | |||
| from shutil import copyfile | |||
| from typing import List, Tuple, Union, Optional | |||
| from ...utils import save_dict_to_yaml, convert_folder_to_zipfile | |||
| from ...config import C | |||
| @dataclass | |||
| class ModelTemplate: | |||
| class_name: str = field(init=False) | |||
| template_path: str = field(init=False) | |||
| model_kwargs: dict = field(init=False) | |||
| @dataclass | |||
| class PickleModelTemplate(ModelTemplate): | |||
| model_kwargs: dict | |||
| pickle_filepath: str | |||
| def __post_init__(self): | |||
| self.class_name = "PickleLoadedModel" | |||
| self.template_path = os.path.join(C.package_path, "tests", "templates", "pickle_model.py") | |||
| default_model_kwargs = { | |||
| "predict_method": "predict", | |||
| "fit_method": "fit", | |||
| "finetune_method": "finetune", | |||
| "pickle_filename": "model.pkl", | |||
| } | |||
| default_model_kwargs.update(self.model_kwargs) | |||
| self.model_kwargs = default_model_kwargs | |||
| @dataclass | |||
| class StatSpecTemplate: | |||
| filepath: str | |||
| type: str = field(default="RKMETableSpecification") | |||
| class LearnwareTemplate: | |||
| @staticmethod | |||
| def generate_requirements(filepath, requirements: Optional[List[Union[Tuple[str, str, str], str]]] = None): | |||
| requirements = [] if requirements is None else requirements | |||
| operators = {"==", "~=", ">=", "<=", ">", "<"} | |||
| requirements_str = "" | |||
| for requirement in requirements: | |||
| if isinstance(requirement, str): | |||
| line_str = requirement.strip() + "\n" | |||
| elif isinstance(requirement, tuple): | |||
| assert requirement[1] in operators, f"The operator of requirements is not supported." | |||
| line_str = requirement[0].strip() + requirement[1].strip() + requirement[2].strip() + "\n" | |||
| else: | |||
| raise TypeError(f"requirement must be type str/tuple, rather than {type(requirement)}") | |||
| requirements_str += line_str | |||
| with open(filepath, "w") as fdout: | |||
| fdout.write(requirements_str) | |||
| @staticmethod | |||
| def generate_learnware_yaml(filepath, model_config: Optional[dict] = None, stat_spec_config: Optional[List[dict]] = None): | |||
| learnware_config = {} | |||
| if model_config is not None: | |||
| learnware_config["model"] = model_config | |||
| if stat_spec_config is not None: | |||
| learnware_config["stat_specifications"] = stat_spec_config | |||
| save_dict_to_yaml(learnware_config, filepath) | |||
| @staticmethod | |||
| def generate_learnware_zipfile( | |||
| learnware_zippath: str, | |||
| model_template: ModelTemplate, | |||
| stat_spec_template: StatSpecTemplate, | |||
| requirements: Optional[List[Union[Tuple[str, str, str], str]]] = None, | |||
| ): | |||
| with tempfile.TemporaryDirectory(suffix="learnware_template") as tempdir: | |||
| requirement_filepath = os.path.join(tempdir, "requirements.txt") | |||
| LearnwareTemplate.generate_requirements(requirement_filepath, requirements) | |||
| model_filepath = os.path.join(tempdir, "__init__.py") | |||
| copyfile(model_template.template_path, model_filepath) | |||
| learnware_yaml_filepath = os.path.join(tempdir, "learnware.yaml") | |||
| model_config = { | |||
| "class_name": model_template.class_name, | |||
| "kwargs": model_template.model_kwargs, | |||
| } | |||
| stat_spec_config = { | |||
| "module_path": "learnware.specification", | |||
| "class_name": stat_spec_template.type, | |||
| "file_name": "stat_spec.json", | |||
| "kwargs": {} | |||
| } | |||
| copyfile(stat_spec_template.filepath, os.path.join(tempdir, stat_spec_config["file_name"])) | |||
| LearnwareTemplate.generate_learnware_yaml(learnware_yaml_filepath, model_config, stat_spec_config=[stat_spec_config]) | |||
| if isinstance(model_template, PickleModelTemplate): | |||
| pickle_filepath = os.path.join(tempdir, model_template.model_kwargs["pickle_filename"]) | |||
| copyfile(model_template.pickle_filepath, pickle_filepath) | |||
| convert_folder_to_zipfile(tempdir, learnware_zippath) | |||
| @@ -0,0 +1,33 @@ | |||
| import os | |||
| import pickle | |||
| import numpy as np | |||
| from learnware.model.base import BaseModel | |||
| class PickleLoadedModel(BaseModel): | |||
| def __init__( | |||
| self, | |||
| input_shape, | |||
| output_shape, | |||
| predict_method="predict", | |||
| fit_method="fit", | |||
| finetune_method="finetune", | |||
| pickle_filename="model.pkl", | |||
| ): | |||
| super(PickleLoadedModel, self).__init__(input_shape=input_shape, output_shape=output_shape) | |||
| dir_path = os.path.dirname(os.path.abspath(__file__)) | |||
| self.pickle_filepath = os.path.join(dir_path, pickle_filename) | |||
| with open(self.pickle_filepath, "rb") as fd: | |||
| self.model = pickle.load(fd) | |||
| self.predict_method = predict_method | |||
| self.fit_method = fit_method | |||
| self.finetune_method = finetune_method | |||
| def predict(self, X: np.ndarray) -> np.ndarray: | |||
| return getattr(self.model, self.predict_method)(X) | |||
| def fit(self, X: np.ndarray, y: np.ndarray): | |||
| getattr(self.model, self.fit_method)(X, y) | |||
| def finetune(self, X: np.ndarray, y: np.ndarray): | |||
| getattr(self.model, self.finetune_method)(X, y) | |||
| @@ -0,0 +1,9 @@ | |||
| import unittest | |||
| def parametrize(test_class, **kwargs): | |||
| test_loader = unittest.TestLoader() | |||
| test_names = test_loader.getTestCaseNames(test_class) | |||
| _suite = unittest.TestSuite() | |||
| for name in test_names: | |||
| _suite.addTest(test_class(name, **kwargs)) | |||
| return _suite | |||
| @@ -3,7 +3,7 @@ import zipfile | |||
| from .import_utils import is_torch_available | |||
| from .module import get_module_by_module_path | |||
| from .file import read_yaml_to_dict, save_dict_to_yaml | |||
| from .file import read_yaml_to_dict, save_dict_to_yaml, convert_folder_to_zipfile | |||
| from .gpu import setup_seed, choose_device, allocate_cuda_idx | |||
| from ..config import get_platform, SystemType | |||
| @@ -1,5 +1,6 @@ | |||
| import os | |||
| import yaml | |||
| import zipfile | |||
| def save_dict_to_yaml(dict_value: dict, save_path: str): | |||
| """save dict object into yaml file""" | |||
| @@ -12,3 +13,13 @@ def read_yaml_to_dict(yaml_path: str): | |||
| with open(yaml_path, "r") as file: | |||
| dict_value = yaml.load(file.read(), Loader=yaml.FullLoader) | |||
| return dict_value | |||
| def convert_folder_to_zipfile(folder_path, zip_path): | |||
| with zipfile.ZipFile(zip_path, "w") as zip_obj: | |||
| for foldername, subfolders, filenames in os.walk(folder_path): | |||
| for filename in filenames: | |||
| file_path = os.path.join(foldername, filename) | |||
| zip_info = zipfile.ZipInfo(filename) | |||
| zip_info.compress_type = zipfile.ZIP_STORED | |||
| with open(file_path, "rb") as file: | |||
| zip_obj.writestr(zip_info, file.read()) | |||
| @@ -0,0 +1,103 @@ | |||
| import os | |||
| import unittest | |||
| import tempfile | |||
| import logging | |||
| import learnware | |||
| learnware.init(logging_level=logging.WARNING) | |||
| from learnware.learnware import Learnware | |||
| from learnware.client import LearnwareClient | |||
| from learnware.market import instantiate_learnware_market, BaseUserInfo, EasySemanticChecker | |||
| from learnware.config import C | |||
| class TestSearch(unittest.TestCase): | |||
| client = LearnwareClient() | |||
| @classmethod | |||
| def setUpClass(cls): | |||
| cls.market = instantiate_learnware_market(market_id="search_test", name="hetero", rebuild=True) | |||
| if cls.client.is_connected(): | |||
| cls._build_learnware_market() | |||
| @classmethod | |||
| def _build_learnware_market(cls): | |||
| table_learnware_ids = ["00001951", "00001980", "00001987"] | |||
| image_learnware_ids = ["00000851", "00000858", "00000841"] | |||
| text_learnware_ids = ["00000652", "00000637"] | |||
| learnware_ids = table_learnware_ids + image_learnware_ids + text_learnware_ids | |||
| with tempfile.TemporaryDirectory(prefix="learnware_search_test") as tempdir: | |||
| for learnware_id in learnware_ids: | |||
| learnware_zippath = os.path.join(tempdir, f"learnware_{learnware_id}.zip") | |||
| try: | |||
| cls.client.download_learnware(learnware_id=learnware_id, save_path=learnware_zippath) | |||
| semantic_spec = ( | |||
| cls.client.load_learnware(learnware_path=learnware_zippath) | |||
| .get_specification() | |||
| .get_semantic_spec() | |||
| ) | |||
| except Exception: | |||
| print("'learnware_id' is passed due to the network problem.") | |||
| cls.market.add_learnware( | |||
| learnware_zippath, | |||
| learnware_id=learnware_id, | |||
| semantic_spec=semantic_spec, | |||
| checker_names=["EasySemanticChecker"], | |||
| ) | |||
| def _skip_test(self): | |||
| if not self.client.is_connected(): | |||
| print("Client can not connect!") | |||
| return True | |||
| return False | |||
| def test_image_search(self): | |||
| if not self._skip_test(): | |||
| learnware_id = "00000619" | |||
| try: | |||
| learnware: Learnware = self.client.load_learnware(learnware_id=learnware_id) | |||
| except Exception: | |||
| print("'test_image_search' is passed due to the network problem.") | |||
| user_info = BaseUserInfo(stat_info=learnware.get_specification().get_stat_spec()) | |||
| search_result = self.market.search_learnware(user_info) | |||
| print("Single Search Results:", search_result.get_single_results()) | |||
| print("Multiple Search Results:", search_result.get_multiple_results()) | |||
| def test_text_search(self): | |||
| if not self._skip_test(): | |||
| learnware_id = "00000653" | |||
| try: | |||
| learnware: Learnware = self.client.load_learnware(learnware_id=learnware_id) | |||
| except Exception: | |||
| print("'test_text_search' is passed due to the network problem.") | |||
| user_info = BaseUserInfo(stat_info=learnware.get_specification().get_stat_spec()) | |||
| search_result = self.market.search_learnware(user_info) | |||
| print("Single Search Results:", search_result.get_single_results()) | |||
| print("Multiple Search Results:", search_result.get_multiple_results()) | |||
| def test_table_search(self): | |||
| if not self._skip_test(): | |||
| learnware_id = "00001950" | |||
| try: | |||
| learnware: Learnware = self.client.load_learnware(learnware_id=learnware_id) | |||
| except Exception: | |||
| print("'test_table_search' is passed due to the network problem.") | |||
| user_info = BaseUserInfo(stat_info=learnware.get_specification().get_stat_spec()) | |||
| search_result = self.market.search_learnware(user_info) | |||
| print("Single Search Results:", search_result.get_single_results()) | |||
| print("Multiple Search Results:", search_result.get_multiple_results()) | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestSearch("test_image_search")) | |||
| _suite.addTest(TestSearch("test_text_search")) | |||
| _suite.addTest(TestSearch("test_table_search")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -1,8 +0,0 @@ | |||
| model: | |||
| class_name: MyModel | |||
| kwargs: {} | |||
| stat_specifications: | |||
| - module_path: learnware.specification | |||
| class_name: RKMETableSpecification | |||
| file_name: stat.json | |||
| kwargs: {} | |||
| @@ -1,16 +0,0 @@ | |||
| from learnware.model import BaseModel | |||
| import numpy as np | |||
| import joblib | |||
| import os | |||
| class MyModel(BaseModel): | |||
| def __init__(self): | |||
| super(MyModel, self).__init__(input_shape=(20,), output_shape=(1,)) | |||
| dir_path = os.path.dirname(os.path.abspath(__file__)) | |||
| model_path = os.path.join(dir_path, "ridge.pkl") | |||
| model = joblib.load(model_path) | |||
| self.model = model | |||
| def predict(self, X: np.ndarray) -> np.ndarray: | |||
| return self.model.predict(X) | |||
| @@ -1,16 +0,0 @@ | |||
| from learnware.model import BaseModel | |||
| import numpy as np | |||
| import joblib | |||
| import os | |||
| class MyModel(BaseModel): | |||
| def __init__(self): | |||
| super(MyModel, self).__init__(input_shape=(30,), output_shape=(1,)) | |||
| dir_path = os.path.dirname(os.path.abspath(__file__)) | |||
| model_path = os.path.join(dir_path, "ridge.pkl") | |||
| model = joblib.load(model_path) | |||
| self.model = model | |||
| def predict(self, X: np.ndarray) -> np.ndarray: | |||
| return self.model.predict(X) | |||
| @@ -1 +0,0 @@ | |||
| learnware == 0.1.0.999 | |||
| @@ -1,414 +0,0 @@ | |||
| import torch | |||
| import unittest | |||
| import os | |||
| import copy | |||
| import joblib | |||
| import zipfile | |||
| import numpy as np | |||
| import multiprocessing | |||
| from sklearn.linear_model import Ridge | |||
| from sklearn.datasets import make_regression | |||
| from shutil import copyfile, rmtree | |||
| from learnware.client import LearnwareClient | |||
| from sklearn.metrics import mean_squared_error | |||
| import learnware | |||
| from learnware.market import instantiate_learnware_market, BaseUserInfo | |||
| from learnware.specification import RKMETableSpecification, generate_rkme_table_spec | |||
| from learnware.reuse import HeteroMapAlignLearnware, AveragingReuser, EnsemblePruningReuser | |||
| from example_learnwares.config import ( | |||
| input_shape_list, | |||
| input_description_list, | |||
| output_description_list, | |||
| user_description_list, | |||
| ) | |||
| curr_root = os.path.dirname(os.path.abspath(__file__)) | |||
| user_semantic = { | |||
| "Data": {"Values": ["Table"], "Type": "Class"}, | |||
| "Task": { | |||
| "Values": ["Regression"], | |||
| "Type": "Class", | |||
| }, | |||
| "Library": {"Values": ["Scikit-learn"], "Type": "Class"}, | |||
| "Scenario": {"Values": ["Education"], "Type": "Tag"}, | |||
| "Description": {"Values": "", "Type": "String"}, | |||
| "Name": {"Values": "", "Type": "String"}, | |||
| "License": {"Values": ["MIT"], "Type": "Class"}, | |||
| } | |||
| def check_learnware(learnware_name, dir_path=os.path.join(curr_root, "learnware_pool")): | |||
| print(f"Checking Learnware: {learnware_name}") | |||
| zip_file_path = os.path.join(dir_path, learnware_name) | |||
| client = LearnwareClient() | |||
| # if check_learnware doesn't raise an exception, return True, otherwise, return false | |||
| try: | |||
| client.check_learnware(zip_file_path) | |||
| return True | |||
| except Exception as e: | |||
| print(f"Learnware {learnware_name} failed the check: {e}") | |||
| return False | |||
| class TestMarket(unittest.TestCase): | |||
| @classmethod | |||
| def setUpClass(cls) -> None: | |||
| np.random.seed(2023) | |||
| learnware.init() | |||
| def _init_learnware_market(self, organizer_kwargs=None): | |||
| """initialize learnware market""" | |||
| hetero_market = instantiate_learnware_market( | |||
| market_id="hetero_toy", name="hetero", rebuild=True, organizer_kwargs=organizer_kwargs | |||
| ) | |||
| return hetero_market | |||
| def test_prepare_learnware_randomly(self, learnware_num=5): | |||
| self.zip_path_list = [] | |||
| for i in range(learnware_num): | |||
| dir_path = os.path.join(curr_root, "learnware_pool", "ridge_%d" % (i)) | |||
| os.makedirs(dir_path, exist_ok=True) | |||
| print("Preparing Learnware: %d" % (i)) | |||
| example_learnware_idx = i % 2 | |||
| input_dim = input_shape_list[example_learnware_idx] | |||
| learnware_example_dir = "example_learnwares" | |||
| X, y = make_regression(n_samples=5000, n_informative=15, n_features=input_dim, noise=0.1, random_state=42) | |||
| clf = Ridge(alpha=1.0) | |||
| clf.fit(X, y) | |||
| joblib.dump(clf, os.path.join(dir_path, "ridge.pkl")) | |||
| spec = generate_rkme_table_spec(X=X, gamma=0.1, cuda_idx=0) | |||
| spec.save(os.path.join(dir_path, "stat.json")) | |||
| init_file = os.path.join(dir_path, "__init__.py") | |||
| copyfile( | |||
| os.path.join(curr_root, learnware_example_dir, f"model{example_learnware_idx}.py"), init_file | |||
| ) # cp example_init.py init_file | |||
| yaml_file = os.path.join(dir_path, "learnware.yaml") | |||
| copyfile( | |||
| os.path.join(curr_root, learnware_example_dir, "learnware.yaml"), yaml_file | |||
| ) # cp example.yaml yaml_file | |||
| env_file = os.path.join(dir_path, "requirements.txt") | |||
| copyfile(os.path.join(curr_root, learnware_example_dir, "requirements.txt"), env_file) | |||
| zip_file = dir_path + ".zip" | |||
| # zip -q -r -j zip_file dir_path | |||
| with zipfile.ZipFile(zip_file, "w") as zip_obj: | |||
| for foldername, subfolders, filenames in os.walk(dir_path): | |||
| for filename in filenames: | |||
| file_path = os.path.join(foldername, filename) | |||
| zip_info = zipfile.ZipInfo(filename) | |||
| zip_info.compress_type = zipfile.ZIP_STORED | |||
| with open(file_path, "rb") as file: | |||
| zip_obj.writestr(zip_info, file.read()) | |||
| rmtree(dir_path) # rm -r dir_path | |||
| self.zip_path_list.append(zip_file) | |||
| def test_generated_learnwares(self): | |||
| curr_root = os.path.dirname(os.path.abspath(__file__)) | |||
| dir_path = os.path.join(curr_root, "learnware_pool") | |||
| # Execute multi-process checking using Pool | |||
| mp_context = multiprocessing.get_context("spawn") | |||
| with mp_context.Pool() as pool: | |||
| results = pool.starmap(check_learnware, [(name, dir_path) for name in os.listdir(dir_path)]) | |||
| # Use an assert statement to ensure that all checks return True | |||
| self.assertTrue(all(results), "Not all learnwares passed the check") | |||
| def test_upload_delete_learnware(self, learnware_num=5, delete=True): | |||
| hetero_market = self._init_learnware_market() | |||
| self.test_prepare_learnware_randomly(learnware_num) | |||
| self.learnware_num = learnware_num | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == 0, f"The market should be empty!" | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Name"]["Values"] = "learnware_%d" % (idx) | |||
| semantic_spec["Description"]["Values"] = "test_learnware_number_%d" % (idx) | |||
| semantic_spec["Input"] = input_description_list[idx % 2] | |||
| semantic_spec["Output"] = output_description_list[idx % 2] | |||
| hetero_market.add_learnware(zip_path, semantic_spec) | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| curr_inds = hetero_market.get_learnware_ids() | |||
| print("Available ids After Uploading Learnwares:", curr_inds) | |||
| assert len(curr_inds) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| if delete: | |||
| for learnware_id in curr_inds: | |||
| hetero_market.delete_learnware(learnware_id) | |||
| self.learnware_num -= 1 | |||
| assert ( | |||
| len(hetero_market) == self.learnware_num | |||
| ), f"The number of learnwares must be {self.learnware_num}!" | |||
| curr_inds = hetero_market.get_learnware_ids() | |||
| print("Available ids After Deleting Learnwares:", curr_inds) | |||
| assert len(curr_inds) == 0, f"The market should be empty!" | |||
| return hetero_market | |||
| def test_train_market_model(self, learnware_num=5): | |||
| hetero_market = self._init_learnware_market( | |||
| organizer_kwargs={"auto_update": False, "auto_update_limit": learnware_num} | |||
| ) | |||
| self.test_prepare_learnware_randomly(learnware_num) | |||
| self.learnware_num = learnware_num | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == 0, f"The market should be empty!" | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Name"]["Values"] = "learnware_%d" % (idx) | |||
| semantic_spec["Description"]["Values"] = "test_learnware_number_%d" % (idx) | |||
| semantic_spec["Input"] = input_description_list[idx % 2] | |||
| semantic_spec["Output"] = output_description_list[idx % 2] | |||
| hetero_market.add_learnware(zip_path, semantic_spec) | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| curr_inds = hetero_market.get_learnware_ids() | |||
| print("Available ids After Uploading Learnwares:", curr_inds) | |||
| assert len(curr_inds) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| # organizer=hetero_market.learnware_organizer | |||
| # organizer.train(hetero_market.learnware_organizer.learnware_list.values()) | |||
| return hetero_market | |||
| def test_search_semantics(self, learnware_num=5): | |||
| hetero_market = self.test_upload_delete_learnware(learnware_num, delete=False) | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Name"]["Values"] = f"learnware_{learnware_num - 1}" | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| print("User info:", user_info.get_semantic_spec()) | |||
| print(f"Search result:") | |||
| assert len(single_result) == 1, f"Exact semantic search failed!" | |||
| for search_item in single_result: | |||
| semantic_spec1 = search_item.learnware.get_specification().get_semantic_spec() | |||
| print("Choose learnware:", search_item.learnware.id, semantic_spec1) | |||
| assert semantic_spec1["Name"]["Values"] == semantic_spec["Name"]["Values"], f"Exact semantic search failed!" | |||
| semantic_spec["Name"]["Values"] = "laernwaer" | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| print("User info:", user_info.get_semantic_spec()) | |||
| print(f"Search result:") | |||
| assert len(single_result) == self.learnware_num, f"Fuzzy semantic search failed!" | |||
| for search_item in single_result: | |||
| semantic_spec1 = search_item.learnware.get_specification().get_semantic_spec() | |||
| print("Choose learnware:", search_item.learnware.id, semantic_spec1) | |||
| def test_stat_search(self, learnware_num=5): | |||
| hetero_market = self.test_train_market_model(learnware_num) | |||
| print("Total Item:", len(hetero_market)) | |||
| # hetero test | |||
| print("+++++ HETERO TEST ++++++") | |||
| user_dim = 15 | |||
| test_folder = os.path.join(curr_root, "test_stat") | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| unzip_dir = os.path.join(test_folder, f"{idx}") | |||
| # unzip -o -q zip_path -d unzip_dir | |||
| if os.path.exists(unzip_dir): | |||
| rmtree(unzip_dir) | |||
| os.makedirs(unzip_dir, exist_ok=True) | |||
| with zipfile.ZipFile(zip_path, "r") as zip_obj: | |||
| zip_obj.extractall(path=unzip_dir) | |||
| user_spec = RKMETableSpecification() | |||
| user_spec.load(os.path.join(unzip_dir, "stat.json")) | |||
| z = user_spec.get_z() | |||
| z = z[:, :user_dim] | |||
| device = user_spec.device | |||
| z = torch.tensor(z, device=device) | |||
| user_spec.z = z | |||
| print(">> normal case test:") | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Input"] = copy.deepcopy(input_description_list[idx % 2]) | |||
| semantic_spec["Input"]["Dimension"] = user_dim | |||
| # keep only the first user_dim descriptions | |||
| semantic_spec["Input"]["Description"] = { | |||
| str(key): semantic_spec["Input"]["Description"][str(key)] for key in range(user_dim) | |||
| } | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| print(f"search result of user{idx}:") | |||
| for single_item in single_result: | |||
| print(f"score: {single_item.score}, learnware_id: {single_item.learnware.id}") | |||
| for multiple_item in multiple_result: | |||
| print( | |||
| f"mixture_score: {multiple_item.score}, mixture_learnware_ids: {[item.id for item in multiple_item.learnwares]}" | |||
| ) | |||
| # inproper key "Task" in semantic_spec, use homo search and print invalid semantic_spec | |||
| print(">> test for key 'Task' has empty 'Values':") | |||
| semantic_spec["Task"] = {"Values": ["Segmentation"], "Type": "Class"} | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| assert len(single_result) == 0, f"Statistical search failed!" | |||
| # delete key "Task" in semantic_spec, use homo search and print WARNING INFO with "User doesn't provide correct task type" | |||
| print(">> delele key 'Task' test:") | |||
| semantic_spec.pop("Task") | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| assert len(single_result) == 0, f"Statistical search failed!" | |||
| # modify semantic info with mismatch dim, use homo search and print "User data feature dimensions mismatch with semantic specification." | |||
| print(">> mismatch dim test") | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Input"] = copy.deepcopy(input_description_list[idx % 2]) | |||
| semantic_spec["Input"]["Dimension"] = user_dim - 2 | |||
| semantic_spec["Input"]["Description"] = { | |||
| str(key): semantic_spec["Input"]["Description"][str(key)] for key in range(user_dim) | |||
| } | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| assert len(single_result) == 0, f"Statistical search failed!" | |||
| rmtree(test_folder) # rm -r test_folder | |||
| # homo test | |||
| print("\n+++++ HOMO TEST ++++++") | |||
| test_folder = os.path.join(curr_root, "test_stat") | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| unzip_dir = os.path.join(test_folder, f"{idx}") | |||
| # unzip -o -q zip_path -d unzip_dir | |||
| if os.path.exists(unzip_dir): | |||
| rmtree(unzip_dir) | |||
| os.makedirs(unzip_dir, exist_ok=True) | |||
| with zipfile.ZipFile(zip_path, "r") as zip_obj: | |||
| zip_obj.extractall(path=unzip_dir) | |||
| user_spec = RKMETableSpecification() | |||
| user_spec.load(os.path.join(unzip_dir, "stat.json")) | |||
| user_info = BaseUserInfo(semantic_spec=user_semantic, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| assert len(single_result) >= 1, f"Statistical search failed!" | |||
| print(f"search result of user{idx}:") | |||
| for single_item in single_result: | |||
| print(f"score: {single_item.score}, learnware_id: {single_item.learnware.id}") | |||
| for multiple_item in multiple_result: | |||
| print(f"mixture_score: {multiple_item.score}\n") | |||
| mixture_id = " ".join([learnware.id for learnware in multiple_item.learnwares]) | |||
| print(f"mixture_learnware: {mixture_id}\n") | |||
| rmtree(test_folder) # rm -r test_folder | |||
| def test_model_reuse(self, learnware_num=5): | |||
| # generate toy regression problem | |||
| X, y = make_regression(n_samples=5000, n_informative=10, n_features=15, noise=0.1, random_state=0) | |||
| # generate rkme | |||
| user_spec = generate_rkme_table_spec(X=X, gamma=0.1, cuda_idx=0) | |||
| # generate specification | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Input"] = user_description_list[0] | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| # learnware market search | |||
| hetero_market = self.test_train_market_model(learnware_num) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| # print search results | |||
| for single_item in single_result: | |||
| print(f"score: {single_item.score}, learnware_id: {single_item.learnware.id}") | |||
| for multiple_item in multiple_result: | |||
| print( | |||
| f"mixture_score: {multiple_item.score}, mixture_learnware_ids: {[item.id for item in multiple_item.learnwares]}" | |||
| ) | |||
| # single model reuse | |||
| hetero_learnware = HeteroMapAlignLearnware(single_result[0].learnware, mode="regression") | |||
| hetero_learnware.align(user_spec, X[:100], y[:100]) | |||
| single_predict_y = hetero_learnware.predict(X) | |||
| # multi model reuse | |||
| hetero_learnware_list = [] | |||
| for learnware in multiple_result[0].learnwares: | |||
| hetero_learnware = HeteroMapAlignLearnware(learnware, mode="regression") | |||
| hetero_learnware.align(user_spec, X[:100], y[:100]) | |||
| hetero_learnware_list.append(hetero_learnware) | |||
| # Use averaging ensemble reuser to reuse the searched learnwares to make prediction | |||
| reuse_ensemble = AveragingReuser(learnware_list=hetero_learnware_list, mode="mean") | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=X) | |||
| # Use ensemble pruning reuser to reuse the searched learnwares to make prediction | |||
| reuse_ensemble = EnsemblePruningReuser(learnware_list=hetero_learnware_list, mode="regression") | |||
| reuse_ensemble.fit(X[:100], y[:100]) | |||
| ensemble_pruning_predict_y = reuse_ensemble.predict(user_data=X) | |||
| print("Single model RMSE by finetune:", mean_squared_error(y, single_predict_y, squared=False)) | |||
| print("Averaging Reuser RMSE:", mean_squared_error(y, ensemble_predict_y, squared=False)) | |||
| print("Ensemble Pruning Reuser RMSE:", mean_squared_error(y, ensemble_pruning_predict_y, squared=False)) | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestMarket("test_prepare_learnware_randomly")) | |||
| _suite.addTest(TestMarket("test_generated_learnwares")) | |||
| _suite.addTest(TestMarket("test_upload_delete_learnware")) | |||
| _suite.addTest(TestMarket("test_train_market_model")) | |||
| _suite.addTest(TestMarket("test_search_semantics")) | |||
| _suite.addTest(TestMarket("test_stat_search")) | |||
| _suite.addTest(TestMarket("test_model_reuse")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -3,17 +3,20 @@ import json | |||
| import zipfile | |||
| import unittest | |||
| import tempfile | |||
| import argparse | |||
| from learnware.client import LearnwareClient | |||
| from learnware.specification import Specification | |||
| from learnware.specification import generate_semantic_spec | |||
| from learnware.market import BaseUserInfo | |||
| class TestAllLearnware(unittest.TestCase): | |||
| def setUp(self): | |||
| unittest.TestCase.setUpClass() | |||
| dir_path = os.path.dirname(__file__) | |||
| config_path = os.path.join(dir_path, "config.json") | |||
| client = LearnwareClient() | |||
| @classmethod | |||
| def setUpClass(cls) -> None: | |||
| config_path = os.path.join(os.path.dirname(__file__), "config.json") | |||
| if not os.path.exists(config_path): | |||
| data = {"email": None, "token": None} | |||
| with open(config_path, "w") as file: | |||
| @@ -21,42 +24,53 @@ class TestAllLearnware(unittest.TestCase): | |||
| with open(config_path, "r") as file: | |||
| data = json.load(file) | |||
| email = data["email"] | |||
| token = data["token"] | |||
| email = data.get("email") | |||
| token = data.get("token") | |||
| if email is None or token is None: | |||
| raise ValueError("Please set email and token in config.json.") | |||
| self.client = LearnwareClient() | |||
| self.client.login(email, token) | |||
| print("Please set email and token in config.json.") | |||
| else: | |||
| cls.client.login(email, token) | |||
| def _skip_test(self): | |||
| if not self.client.is_login(): | |||
| print("Client does not login!") | |||
| return True | |||
| return False | |||
| def test_all_learnware(self): | |||
| max_learnware_num = 2000 | |||
| semantic_spec = self.client.create_semantic_specification() | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={}) | |||
| result = self.client.search_learnware(user_info, page_size=max_learnware_num) | |||
| learnware_ids = result["single"]["learnware_ids"] | |||
| keys = [key for key in result["single"]["semantic_specifications"][0]] | |||
| print(f"result size: {len(learnware_ids)}") | |||
| print(f"key in result: {keys}") | |||
| failed_ids = [] | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| for idx in learnware_ids: | |||
| zip_path = os.path.join(tempdir, f"test_{idx}.zip") | |||
| self.client.download_learnware(idx, zip_path) | |||
| with zipfile.ZipFile(zip_path, "r") as zip_file: | |||
| with zip_file.open("semantic_specification.json") as json_file: | |||
| semantic_spec = json.load(json_file) | |||
| try: | |||
| LearnwareClient.check_learnware(zip_path, semantic_spec) | |||
| print(f"check learnware {idx} succeed") | |||
| except: | |||
| failed_ids.append(idx) | |||
| print(f"check learnware {idx} failed!!!") | |||
| print(f"The currently failed learnware ids: {failed_ids}") | |||
| if not self._skip_test(): | |||
| semantic_spec = generate_semantic_spec() | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={}) | |||
| result = self.client.search_learnware(user_info, page_size=None) | |||
| learnware_ids = result["single"]["learnware_ids"] | |||
| keys = [key for key in result["single"]["semantic_specifications"][0]] | |||
| print(f"result size: {len(learnware_ids)}") | |||
| print(f"key in result: {keys}") | |||
| failed_ids = [] | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| for idx in learnware_ids: | |||
| zip_path = os.path.join(tempdir, f"test_{idx}.zip") | |||
| self.client.download_learnware(idx, zip_path) | |||
| try: | |||
| semantic_spec = self.client.get_semantic_specification(idx) | |||
| LearnwareClient.check_learnware(zip_path, semantic_spec) | |||
| print(f"check learnware {idx} succeed") | |||
| except: | |||
| failed_ids.append(idx) | |||
| print(f"check learnware {idx} failed!!!") | |||
| print(f"The currently failed learnware ids: {failed_ids}") | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestAllLearnware("test_all_learnware")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -4,7 +4,6 @@ import zipfile | |||
| import unittest | |||
| import tempfile | |||
| from learnware.client import LearnwareClient | |||
| @@ -13,15 +12,19 @@ class TestCheckLearnware(unittest.TestCase): | |||
| unittest.TestCase.setUpClass() | |||
| self.client = LearnwareClient() | |||
| def test_check_learnware_pip(self): | |||
| def test_check_learnware_pip_only_zip(self): | |||
| learnware_id = "00000208" | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| self.zip_path = os.path.join(tempdir, "test.zip") | |||
| self.client.download_learnware(learnware_id, self.zip_path) | |||
| LearnwareClient.check_learnware(self.zip_path) | |||
| with zipfile.ZipFile(self.zip_path, "r") as zip_file: | |||
| with zip_file.open("semantic_specification.json") as json_file: | |||
| semantic_spec = json.load(json_file) | |||
| def test_check_learnware_pip(self): | |||
| learnware_id = "00000208" | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| self.zip_path = os.path.join(tempdir, "test.zip") | |||
| self.client.download_learnware(learnware_id, self.zip_path) | |||
| semantic_spec = self.client.get_semantic_specification(learnware_id) | |||
| LearnwareClient.check_learnware(self.zip_path, semantic_spec) | |||
| def test_check_learnware_conda(self): | |||
| @@ -29,10 +32,7 @@ class TestCheckLearnware(unittest.TestCase): | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| self.zip_path = os.path.join(tempdir, "test.zip") | |||
| self.client.download_learnware(learnware_id, self.zip_path) | |||
| with zipfile.ZipFile(self.zip_path, "r") as zip_file: | |||
| with zip_file.open("semantic_specification.json") as json_file: | |||
| semantic_spec = json.load(json_file) | |||
| semantic_spec = self.client.get_semantic_specification(learnware_id) | |||
| LearnwareClient.check_learnware(self.zip_path, semantic_spec) | |||
| def test_check_learnware_dependency(self): | |||
| @@ -40,10 +40,7 @@ class TestCheckLearnware(unittest.TestCase): | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| self.zip_path = os.path.join(tempdir, "test.zip") | |||
| self.client.download_learnware(learnware_id, self.zip_path) | |||
| with zipfile.ZipFile(self.zip_path, "r") as zip_file: | |||
| with zip_file.open("semantic_specification.json") as json_file: | |||
| semantic_spec = json.load(json_file) | |||
| semantic_spec = self.client.get_semantic_specification(learnware_id) | |||
| LearnwareClient.check_learnware(self.zip_path, semantic_spec) | |||
| def test_check_learnware_image(self): | |||
| @@ -51,10 +48,7 @@ class TestCheckLearnware(unittest.TestCase): | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| self.zip_path = os.path.join(tempdir, "test.zip") | |||
| self.client.download_learnware(learnware_id, self.zip_path) | |||
| with zipfile.ZipFile(self.zip_path, "r") as zip_file: | |||
| with zip_file.open("semantic_specification.json") as json_file: | |||
| semantic_spec = json.load(json_file) | |||
| semantic_spec = self.client.get_semantic_specification(learnware_id) | |||
| LearnwareClient.check_learnware(self.zip_path, semantic_spec) | |||
| def test_check_learnware_text(self): | |||
| @@ -62,12 +56,21 @@ class TestCheckLearnware(unittest.TestCase): | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| self.zip_path = os.path.join(tempdir, "test.zip") | |||
| self.client.download_learnware(learnware_id, self.zip_path) | |||
| with zipfile.ZipFile(self.zip_path, "r") as zip_file: | |||
| with zip_file.open("semantic_specification.json") as json_file: | |||
| semantic_spec = json.load(json_file) | |||
| semantic_spec = self.client.get_semantic_specification(learnware_id) | |||
| LearnwareClient.check_learnware(self.zip_path, semantic_spec) | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestCheckLearnware("test_check_learnware_pip_only_zip")) | |||
| _suite.addTest(TestCheckLearnware("test_check_learnware_pip")) | |||
| _suite.addTest(TestCheckLearnware("test_check_learnware_conda")) | |||
| _suite.addTest(TestCheckLearnware("test_check_learnware_dependency")) | |||
| _suite.addTest(TestCheckLearnware("test_check_learnware_image")) | |||
| _suite.addTest(TestCheckLearnware("test_check_learnware_text")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -0,0 +1,51 @@ | |||
| import unittest | |||
| import numpy as np | |||
| from learnware.client import LearnwareClient | |||
| from learnware.client.container import LearnwaresContainer | |||
| class TestContainer(unittest.TestCase): | |||
| def __init__(self, method_name='runTest', mode="all"): | |||
| super(TestContainer, self).__init__(method_name) | |||
| self.modes = [] | |||
| if mode in {"all", "conda"}: | |||
| self.modes.append("conda") | |||
| if mode in {"all", "docker"}: | |||
| self.modes.append("docker") | |||
| def setUp(self): | |||
| self.client = LearnwareClient() | |||
| def _test_container_with_pip(self, mode): | |||
| learnware_id = "00000147" | |||
| learnware = self.client.load_learnware(learnware_id=learnware_id) | |||
| with LearnwaresContainer(learnware, ignore_error=False, mode=mode) as env_container: | |||
| learnware = env_container.get_learnwares_with_container()[0] | |||
| input_array = np.random.random(size=(20, 23)) | |||
| print(learnware.predict(input_array)) | |||
| def _test_container_with_conda(self, mode): | |||
| learnware_id = "00000148" | |||
| learnware = self.client.load_learnware(learnware_id=learnware_id) | |||
| with LearnwaresContainer(learnware, ignore_error=False, mode=mode) as env_container: | |||
| learnware = env_container.get_learnwares_with_container()[0] | |||
| input_array = np.random.random(size=(20, 204)) | |||
| print(learnware.predict(input_array)) | |||
| def test_container_with_pip(self): | |||
| for mode in self.modes: | |||
| self._test_container_with_pip(mode=mode) | |||
| def test_container_with_conda(self): | |||
| for mode in self.modes: | |||
| self._test_container_with_conda(mode=mode) | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestContainer("test_container_with_pip", mode="all")) | |||
| _suite.addTest(TestContainer("test_container_with_conda", mode="all")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -1,82 +0,0 @@ | |||
| import os | |||
| import unittest | |||
| import zipfile | |||
| import numpy as np | |||
| import learnware | |||
| from learnware.learnware import get_learnware_from_dirpath | |||
| from learnware.client import LearnwareClient | |||
| from learnware.client.container import ModelCondaContainer, LearnwaresContainer | |||
| from learnware.reuse import AveragingReuser | |||
| class TestLearnwareLoad(unittest.TestCase): | |||
| def setUp(self): | |||
| unittest.TestCase.setUpClass() | |||
| self.client = LearnwareClient() | |||
| root = os.path.dirname(__file__) | |||
| self.learnware_ids = ["00000910", "00000899", "00000900"] | |||
| self.zip_paths = [os.path.join(root, x) for x in ["1.zip", "2.zip", "3.zip"]] | |||
| def test_load_single_learnware_by_zippath(self): | |||
| for learnware_id, zip_path in zip(self.learnware_ids, self.zip_paths): | |||
| self.client.download_learnware(learnware_id, zip_path) | |||
| learnware_list = [ | |||
| self.client.load_learnware(learnware_path=zippath, runnable_option="conda") for zippath in self.zip_paths | |||
| ] | |||
| reuser = AveragingReuser(learnware_list, mode="mean") | |||
| input_array = np.random.random(size=(20, 40)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_multi_learnware_by_zippath(self): | |||
| for learnware_id, zip_path in zip(self.learnware_ids, self.zip_paths): | |||
| self.client.download_learnware(learnware_id, zip_path) | |||
| learnware_list = self.client.load_learnware(learnware_path=self.zip_paths, runnable_option="conda") | |||
| reuser = AveragingReuser(learnware_list, mode="mean") | |||
| input_array = np.random.random(size=(20, 40)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_single_learnware_by_id(self): | |||
| learnware_list = [ | |||
| self.client.load_learnware(learnware_id=idx, runnable_option="conda") for idx in self.learnware_ids | |||
| ] | |||
| reuser = AveragingReuser(learnware_list, mode="mean") | |||
| input_array = np.random.random(size=(20, 40)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_multi_learnware_by_id(self): | |||
| learnware_list = self.client.load_learnware(learnware_id=self.learnware_ids, runnable_option="conda") | |||
| reuser = AveragingReuser(learnware_list, mode="mean") | |||
| input_array = np.random.random(size=(20, 40)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_single_learnware_by_id_pip(self): | |||
| learnware_id = "00000147" | |||
| learnware = self.client.load_learnware(learnware_id=learnware_id, runnable_option="conda") | |||
| input_array = np.random.random(size=(20, 23)) | |||
| print(learnware.predict(input_array)) | |||
| def test_load_single_learnware_by_id_conda(self): | |||
| learnware_id = "00000148" | |||
| learnware = self.client.load_learnware(learnware_id=learnware_id, runnable_option="conda") | |||
| input_array = np.random.random(size=(20, 204)) | |||
| print(learnware.predict(input_array)) | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -1,57 +0,0 @@ | |||
| import os | |||
| import unittest | |||
| import zipfile | |||
| import numpy as np | |||
| import learnware | |||
| from learnware.learnware import get_learnware_from_dirpath | |||
| from learnware.client import LearnwareClient | |||
| from learnware.client.container import ModelCondaContainer, LearnwaresContainer | |||
| from learnware.reuse import AveragingReuser | |||
| class TestLearnwareLoad(unittest.TestCase): | |||
| def setUp(self): | |||
| unittest.TestCase.setUpClass() | |||
| self.client = LearnwareClient() | |||
| root = os.path.dirname(__file__) | |||
| self.learnware_ids = ["00000910", "00000899", "00000900"] | |||
| self.zip_paths = [os.path.join(root, x) for x in ["1.zip", "2.zip", "3.zip"]] | |||
| def test_load_multi_learnware_by_zippath(self): | |||
| for learnware_id, zip_path in zip(self.learnware_ids, self.zip_paths): | |||
| self.client.download_learnware(learnware_id, zip_path) | |||
| learnware_list = self.client.load_learnware(learnware_path=self.zip_paths, runnable_option="docker") | |||
| reuser = AveragingReuser(learnware_list, mode="mean") | |||
| input_array = np.random.random(size=(20, 40)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_multi_learnware_by_id(self): | |||
| learnware_list = self.client.load_learnware(learnware_id=self.learnware_ids, runnable_option="docker") | |||
| reuser = AveragingReuser(learnware_list, mode="mean") | |||
| input_array = np.random.random(size=(20, 40)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_single_learnware_by_id_pip(self): | |||
| learnware_id = "00000147" | |||
| learnware = self.client.load_learnware(learnware_id=learnware_id, runnable_option="docker") | |||
| input_array = np.random.random(size=(20, 23)) | |||
| print(learnware.predict(input_array)) | |||
| def test_load_single_learnware_by_id_conda(self): | |||
| learnware_id = "00000148" | |||
| learnware = self.client.load_learnware(learnware_id=learnware_id, runnable_option="docker") | |||
| input_array = np.random.random(size=(20, 204)) | |||
| print(learnware.predict(input_array)) | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -0,0 +1,61 @@ | |||
| import os | |||
| import unittest | |||
| import numpy as np | |||
| from learnware.client import LearnwareClient | |||
| from learnware.reuse import AveragingReuser | |||
| class TestLearnwareLoad(unittest.TestCase): | |||
| def __init__(self, method_name='runTest', mode="all"): | |||
| super(TestLearnwareLoad, self).__init__(method_name) | |||
| self.runnable_options = [] | |||
| if mode in {"all", "conda"}: | |||
| self.runnable_options.append("conda") | |||
| if mode in {"all", "docker"}: | |||
| self.runnable_options.append("docker") | |||
| def setUp(self): | |||
| self.client = LearnwareClient() | |||
| root = os.path.dirname(__file__) | |||
| self.learnware_ids = ["00000910", "00000899", "00000900"] | |||
| self.zip_paths = [os.path.join(root, x) for x in ["1.zip", "2.zip", "3.zip"]] | |||
| def _test_load_learnware_by_zippath(self, runnable_option): | |||
| for learnware_id, zip_path in zip(self.learnware_ids, self.zip_paths): | |||
| self.client.download_learnware(learnware_id, zip_path) | |||
| learnware_list = self.client.load_learnware(learnware_path=self.zip_paths, runnable_option=runnable_option) | |||
| reuser = AveragingReuser(learnware_list, mode="vote_by_label") | |||
| input_array = np.random.random(size=(20, 13)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def _test_load_learnware_by_id(self, runnable_option): | |||
| learnware_list = self.client.load_learnware(learnware_id=self.learnware_ids, runnable_option=runnable_option) | |||
| reuser = AveragingReuser(learnware_list, mode="vote_by_label") | |||
| input_array = np.random.random(size=(20, 13)) | |||
| print(reuser.predict(input_array)) | |||
| for learnware in learnware_list: | |||
| print(learnware.id, learnware.predict(input_array)) | |||
| def test_load_learnware_by_zippath(self): | |||
| for runnable_option in self.runnable_options: | |||
| self._test_load_learnware_by_zippath(runnable_option=runnable_option) | |||
| def test_load_learnware_by_id(self): | |||
| for runnable_option in self.runnable_options: | |||
| self._test_load_learnware_by_id(runnable_option=runnable_option) | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestLearnwareLoad("test_load_learnware_by_zippath", mode="all")) | |||
| _suite.addTest(TestLearnwareLoad("test_load_learnware_by_id", mode="all")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -1,34 +0,0 @@ | |||
| import zipfile | |||
| import numpy as np | |||
| from learnware.learnware import get_learnware_from_dirpath | |||
| from learnware.client.container import LearnwaresContainer | |||
| from learnware.reuse import AveragingReuser | |||
| from learnware.tests.module import get_semantic_specification | |||
| if __name__ == "__main__": | |||
| semantic_specification = get_semantic_specification() | |||
| zip_paths = [ | |||
| "/home/bixd/workspace/learnware/Learnware/tests/test_learnware_client/rf_tic.zip", | |||
| "/home/bixd/workspace/learnware/Learnware/tests/test_learnware_client/svc_tic.zip", | |||
| ] | |||
| dir_paths = [ | |||
| "/home/bixd/workspace/learnware/Learnware/tests/test_learnware_client/rf_tic", | |||
| "/home/bixd/workspace/learnware/Learnware/tests/test_learnware_client/svc_tic", | |||
| ] | |||
| learnware_list = [] | |||
| for id, (zip_path, dir_path) in enumerate(zip(zip_paths, dir_paths)): | |||
| with zipfile.ZipFile(zip_path, "r") as z_file: | |||
| z_file.extractall(dir_path) | |||
| learnware = get_learnware_from_dirpath(f"test_id{id}", semantic_specification, dir_path) | |||
| learnware_list.append(learnware) | |||
| with LearnwaresContainer(learnware_list) as env_container: | |||
| learnware_list = env_container.get_learnwares_with_container() | |||
| reuser = AveragingReuser(learnware_list, mode="vote") | |||
| input_array = np.random.randint(0, 3, size=(20, 9)) | |||
| print(reuser.predict(input_array).argmax(axis=1)) | |||
| for id, ind_learner in enumerate(learnware_list): | |||
| print(f"learner_{id}", reuser.predict(input_array).argmax(axis=1)) | |||
| @@ -4,13 +4,16 @@ import unittest | |||
| import tempfile | |||
| from learnware.client import LearnwareClient | |||
| from learnware.specification import generate_semantic_spec | |||
| class TestAllLearnware(unittest.TestCase): | |||
| def setUp(self): | |||
| unittest.TestCase.setUpClass() | |||
| dir_path = os.path.dirname(__file__) | |||
| config_path = os.path.join(dir_path, "config.json") | |||
| class TestUpload(unittest.TestCase): | |||
| client = LearnwareClient() | |||
| @classmethod | |||
| def setUpClass(cls) -> None: | |||
| config_path = os.path.join(os.path.dirname(__file__), "config.json") | |||
| if not os.path.exists(config_path): | |||
| data = {"email": None, "token": None} | |||
| with open(config_path, "w") as file: | |||
| @@ -18,52 +21,65 @@ class TestAllLearnware(unittest.TestCase): | |||
| with open(config_path, "r") as file: | |||
| data = json.load(file) | |||
| email = data["email"] | |||
| token = data["token"] | |||
| email = data.get("email") | |||
| token = data.get("token") | |||
| if email is None or token is None: | |||
| raise ValueError("Please set email and token in config.json.") | |||
| self.client = LearnwareClient() | |||
| self.client.login(email, token) | |||
| print("Please set email and token in config.json.") | |||
| else: | |||
| cls.client.login(email, token) | |||
| def _skip_test(self): | |||
| if not self.client.is_login(): | |||
| print("Client does not login!") | |||
| return True | |||
| return False | |||
| def test_upload(self): | |||
| input_description = { | |||
| "Dimension": 13, | |||
| "Description": {"0": "age", "1": "weight", "2": "body length", "3": "animal type", "4": "claw length"}, | |||
| } | |||
| output_description = { | |||
| "Dimension": 1, | |||
| "Description": { | |||
| "0": "the probability of being a cat", | |||
| }, | |||
| } | |||
| semantic_spec = self.client.create_semantic_specification( | |||
| name="learnware_example", | |||
| description="Just a example for uploading a learnware", | |||
| data_type="Table", | |||
| task_type="Classification", | |||
| library_type="Scikit-learn", | |||
| scenarios=["Business", "Financial"], | |||
| input_description=input_description, | |||
| output_description=output_description, | |||
| ) | |||
| assert isinstance(semantic_spec, dict) | |||
| download_learnware_id = "00000084" | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| zip_path = os.path.join(tempdir, f"test.zip") | |||
| self.client.download_learnware(download_learnware_id, zip_path) | |||
| learnware_id = self.client.upload_learnware( | |||
| learnware_zip_path=zip_path, semantic_specification=semantic_spec | |||
| if not self._skip_test(): | |||
| input_description = { | |||
| "Dimension": 13, | |||
| "Description": {"0": "age", "1": "weight", "2": "body length", "3": "animal type", "4": "claw length"}, | |||
| } | |||
| output_description = { | |||
| "Dimension": 2, | |||
| "Description": {"0": "cat", "1": "not cat"}, | |||
| } | |||
| semantic_spec = generate_semantic_spec( | |||
| name="learnware_example", | |||
| description="Just a example for uploading a learnware", | |||
| data_type="Table", | |||
| task_type="Classification", | |||
| library_type="Scikit-learn", | |||
| scenarios=["Business", "Financial"], | |||
| license="MIT", | |||
| input_description=input_description, | |||
| output_description=output_description, | |||
| ) | |||
| assert isinstance(semantic_spec, dict) | |||
| download_learnware_id = "00000084" | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| zip_path = os.path.join(tempdir, f"test.zip") | |||
| self.client.download_learnware(download_learnware_id, zip_path) | |||
| learnware_id = self.client.upload_learnware( | |||
| learnware_zip_path=zip_path, semantic_specification=semantic_spec | |||
| ) | |||
| uploaded_ids = [learnware["learnware_id"] for learnware in self.client.list_learnware()] | |||
| assert learnware_id in uploaded_ids | |||
| self.client.delete_learnware(learnware_id) | |||
| uploaded_ids = [learnware["learnware_id"] for learnware in self.client.list_learnware()] | |||
| assert learnware_id not in uploaded_ids | |||
| uploaded_ids = [learnware["learnware_id"] for learnware in self.client.list_learnware()] | |||
| assert learnware_id in uploaded_ids | |||
| self.client.delete_learnware(learnware_id) | |||
| uploaded_ids = [learnware["learnware_id"] for learnware in self.client.list_learnware()] | |||
| assert learnware_id not in uploaded_ids | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| _suite.addTest(TestUpload("test_upload")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| runner = unittest.TextTestRunner() | |||
| runner.run(suite()) | |||
| @@ -0,0 +1,43 @@ | |||
| import os | |||
| import json | |||
| import string | |||
| import random | |||
| import torch | |||
| import unittest | |||
| import tempfile | |||
| import numpy as np | |||
| from learnware.specification import RKMETableSpecification, HeteroMapTableSpecification | |||
| from learnware.specification import generate_stat_spec | |||
| from learnware.market.heterogeneous.organizer import HeteroMap | |||
| class TestTableRKME(unittest.TestCase): | |||
| def setUp(self): | |||
| self.hetero_map = HeteroMap() | |||
| def _test_hetero_spec(self, X): | |||
| rkme: RKMETableSpecification = generate_stat_spec(type="table", X=X) | |||
| hetero_spec = self.hetero_map.hetero_mapping(rkme_spec=rkme, features=dict()) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| hetero_spec.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "HeteroMapTableSpecification" | |||
| rkme2 = HeteroMapTableSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "HeteroMapTableSpecification" | |||
| def test_hetero_rkme(self): | |||
| self._test_hetero_spec(np.random.uniform(-10000, 10000, size=(5000, 200))) | |||
| self._test_hetero_spec(np.random.uniform(-10000, 10000, size=(10000, 100))) | |||
| self._test_hetero_spec(np.random.uniform(-10000, 10000, size=(5, 20))) | |||
| self._test_hetero_spec(np.random.uniform(-10000, 10000, size=(1, 50))) | |||
| self._test_hetero_spec(np.random.uniform(-10000, 10000, size=(100, 150))) | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -0,0 +1,38 @@ | |||
| import os | |||
| import json | |||
| import torch | |||
| import unittest | |||
| import tempfile | |||
| import numpy as np | |||
| from learnware.specification import RKMEImageSpecification | |||
| from learnware.specification import generate_stat_spec | |||
| class TestImageRKME(unittest.TestCase): | |||
| @staticmethod | |||
| def _test_image_rkme(X): | |||
| image_rkme = generate_stat_spec(type="image", X=X, steps=10) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| image_rkme.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "RKMEImageSpecification" | |||
| rkme2 = RKMEImageSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "RKMEImageSpecification" | |||
| def test_image_rkme(self): | |||
| self._test_image_rkme(np.random.randint(0, 255, size=(2000, 3, 32, 32))) | |||
| self._test_image_rkme(np.random.randint(0, 255, size=(100, 1, 128, 128))) | |||
| self._test_image_rkme(np.random.randint(0, 255, size=(50, 3, 128, 128)) / 255) | |||
| self._test_image_rkme(torch.randint(0, 255, (2000, 3, 32, 32))) | |||
| self._test_image_rkme(torch.randint(0, 255, (20, 3, 128, 128))) | |||
| self._test_image_rkme(torch.randint(0, 255, (1, 1, 128, 128)) / 255) | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -1,104 +0,0 @@ | |||
| import os | |||
| import json | |||
| import string | |||
| import random | |||
| import torch | |||
| import unittest | |||
| import tempfile | |||
| import numpy as np | |||
| from learnware.specification import RKMETableSpecification, RKMEImageSpecification, RKMETextSpecification | |||
| from learnware.specification import generate_stat_spec | |||
| class TestRKME(unittest.TestCase): | |||
| def test_rkme(self): | |||
| def _test_table_rkme(X): | |||
| rkme = generate_stat_spec(type="table", X=X) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| rkme.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "RKMETableSpecification" | |||
| rkme2 = RKMETableSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "RKMETableSpecification" | |||
| _test_table_rkme(np.random.uniform(-10000, 10000, size=(5000, 200))) | |||
| _test_table_rkme(np.random.uniform(-10000, 10000, size=(10000, 100))) | |||
| _test_table_rkme(np.random.uniform(-10000, 10000, size=(5, 20))) | |||
| _test_table_rkme(np.random.uniform(-10000, 10000, size=(1, 50))) | |||
| _test_table_rkme(np.random.uniform(-10000, 10000, size=(100, 150))) | |||
| def test_image_rkme(self): | |||
| def _test_image_rkme(X): | |||
| image_rkme = generate_stat_spec(type="image", X=X, steps=10) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| image_rkme.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "RKMEImageSpecification" | |||
| rkme2 = RKMEImageSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "RKMEImageSpecification" | |||
| _test_image_rkme(np.random.randint(0, 255, size=(2000, 3, 32, 32))) | |||
| _test_image_rkme(np.random.randint(0, 255, size=(100, 1, 128, 128))) | |||
| _test_image_rkme(np.random.randint(0, 255, size=(50, 3, 128, 128)) / 255) | |||
| _test_image_rkme(torch.randint(0, 255, (2000, 3, 32, 32))) | |||
| _test_image_rkme(torch.randint(0, 255, (20, 3, 128, 128))) | |||
| _test_image_rkme(torch.randint(0, 255, (1, 1, 128, 128)) / 255) | |||
| def test_text_rkme(self): | |||
| def generate_random_text_list(num, text_type="en", min_len=10, max_len=1000): | |||
| text_list = [] | |||
| for i in range(num): | |||
| length = random.randint(min_len, max_len) | |||
| if text_type == "en": | |||
| characters = string.ascii_letters + string.digits + string.punctuation | |||
| result_str = "".join(random.choice(characters) for i in range(length)) | |||
| text_list.append(result_str) | |||
| elif text_type == "zh": | |||
| result_str = "".join(chr(random.randint(0x4E00, 0x9FFF)) for i in range(length)) | |||
| text_list.append(result_str) | |||
| else: | |||
| raise ValueError("Type should be en or zh") | |||
| return text_list | |||
| def _test_text_rkme(X): | |||
| rkme = generate_stat_spec(type="text", X=X) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| rkme.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "RKMETextSpecification" | |||
| rkme2 = RKMETextSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "RKMETextSpecification" | |||
| return rkme2.get_z().shape[1] | |||
| dim1 = _test_text_rkme(generate_random_text_list(3000, "en")) | |||
| dim2 = _test_text_rkme(generate_random_text_list(100, "en")) | |||
| dim3 = _test_text_rkme(generate_random_text_list(50, "zh")) | |||
| dim4 = _test_text_rkme(generate_random_text_list(5000, "zh")) | |||
| dim5 = _test_text_rkme(generate_random_text_list(1, "zh")) | |||
| assert dim1 == dim2 and dim2 == dim3 and dim3 == dim4 and dim4 == dim5 | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -0,0 +1,36 @@ | |||
| import os | |||
| import json | |||
| import unittest | |||
| import tempfile | |||
| import numpy as np | |||
| from learnware.specification import RKMETableSpecification, RKMEImageSpecification, RKMETextSpecification | |||
| from learnware.specification import generate_stat_spec | |||
| class TestTableRKME(unittest.TestCase): | |||
| @staticmethod | |||
| def _test_table_rkme(X): | |||
| rkme = generate_stat_spec(type="table", X=X) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| rkme.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "RKMETableSpecification" | |||
| rkme2 = RKMETableSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "RKMETableSpecification" | |||
| def test_table_rkme(self): | |||
| self._test_table_rkme(np.random.uniform(-10000, 10000, size=(5000, 200))) | |||
| self._test_table_rkme(np.random.uniform(-10000, 10000, size=(10000, 100))) | |||
| self._test_table_rkme(np.random.uniform(-10000, 10000, size=(5, 20))) | |||
| self._test_table_rkme(np.random.uniform(-10000, 10000, size=(1, 50))) | |||
| self._test_table_rkme(np.random.uniform(-10000, 10000, size=(100, 150))) | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -0,0 +1,58 @@ | |||
| import os | |||
| import json | |||
| import string | |||
| import random | |||
| import unittest | |||
| import tempfile | |||
| from learnware.specification import RKMETextSpecification | |||
| from learnware.specification import generate_stat_spec | |||
| class TestTextRKME(unittest.TestCase): | |||
| @staticmethod | |||
| def generate_random_text_list(num, text_type="en", min_len=10, max_len=1000): | |||
| text_list = [] | |||
| for i in range(num): | |||
| length = random.randint(min_len, max_len) | |||
| if text_type == "en": | |||
| characters = string.ascii_letters + string.digits + string.punctuation | |||
| result_str = "".join(random.choice(characters) for i in range(length)) | |||
| text_list.append(result_str) | |||
| elif text_type == "zh": | |||
| result_str = "".join(chr(random.randint(0x4E00, 0x9FFF)) for i in range(length)) | |||
| text_list.append(result_str) | |||
| else: | |||
| raise ValueError("Type should be en or zh") | |||
| return text_list | |||
| @staticmethod | |||
| def _test_text_rkme(X): | |||
| rkme = generate_stat_spec(type="text", X=X) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_") as tempdir: | |||
| rkme_path = os.path.join(tempdir, "rkme.json") | |||
| rkme.save(rkme_path) | |||
| with open(rkme_path, "r") as f: | |||
| data = json.load(f) | |||
| assert data["type"] == "RKMETextSpecification" | |||
| rkme2 = RKMETextSpecification() | |||
| rkme2.load(rkme_path) | |||
| assert rkme2.type == "RKMETextSpecification" | |||
| return rkme2.get_z().shape[1] | |||
| def test_text_rkme(self): | |||
| dim1 = self._test_text_rkme(self.generate_random_text_list(3000, "en")) | |||
| dim2 = self._test_text_rkme(self.generate_random_text_list(100, "en")) | |||
| dim3 = self._test_text_rkme(self.generate_random_text_list(50, "zh")) | |||
| dim4 = self._test_text_rkme(self.generate_random_text_list(5000, "zh")) | |||
| dim5 = self._test_text_rkme(self.generate_random_text_list(1, "zh")) | |||
| assert dim1 == dim2 and dim2 == dim3 and dim3 == dim4 and dim4 == dim5 | |||
| if __name__ == "__main__": | |||
| unittest.main() | |||
| @@ -1,10 +0,0 @@ | |||
| ## How to Generate Environment Yaml | |||
| * create env config for conda: | |||
| ```shell | |||
| conda env export | grep -v "^prefix: " > environment.yml | |||
| ``` | |||
| * recover env from config | |||
| ``` | |||
| conda env create -f environment.yml | |||
| ``` | |||
| @@ -1,27 +0,0 @@ | |||
| name: learnware_example_env | |||
| channels: | |||
| - defaults | |||
| dependencies: | |||
| - _libgcc_mutex=0.1=main | |||
| - _openmp_mutex=5.1=1_gnu | |||
| - ca-certificates=2023.01.10=h06a4308_0 | |||
| - ld_impl_linux-64=2.38=h1181459_1 | |||
| - libffi=3.4.2=h6a678d5_6 | |||
| - libgcc-ng=11.2.0=h1234567_1 | |||
| - libgomp=11.2.0=h1234567_1 | |||
| - libstdcxx-ng=11.2.0=h1234567_1 | |||
| - ncurses=6.4=h6a678d5_0 | |||
| - openssl=1.1.1t=h7f8727e_0 | |||
| - pip=23.0.1=py38h06a4308_0 | |||
| - python=3.8.16=h7a1cb2a_3 | |||
| - readline=8.2=h5eee18b_0 | |||
| - setuptools=66.0.0=py38h06a4308_0 | |||
| - sqlite=3.41.2=h5eee18b_0 | |||
| - tk=8.6.12=h1ccaba5_0 | |||
| - wheel=0.38.4=py38h06a4308_0 | |||
| - xz=5.2.10=h5eee18b_1 | |||
| - zlib=1.2.13=h5eee18b_0 | |||
| - pip: | |||
| - joblib==1.2.0 | |||
| - learnware==0.0.1.99 | |||
| - numpy==1.19.5 | |||
| @@ -1,8 +0,0 @@ | |||
| model: | |||
| class_name: SVM | |||
| kwargs: {} | |||
| stat_specifications: | |||
| - module_path: learnware.specification | |||
| class_name: RKMETableSpecification | |||
| file_name: svm.json | |||
| kwargs: {} | |||
| @@ -1,20 +0,0 @@ | |||
| import os | |||
| import joblib | |||
| import numpy as np | |||
| from learnware.model import BaseModel | |||
| class SVM(BaseModel): | |||
| def __init__(self): | |||
| super(SVM, self).__init__(input_shape=(64,), output_shape=(10,)) | |||
| dir_path = os.path.dirname(os.path.abspath(__file__)) | |||
| self.model = joblib.load(os.path.join(dir_path, "svm.pkl")) | |||
| def fit(self, X: np.ndarray, y: np.ndarray): | |||
| pass | |||
| def predict(self, X: np.ndarray) -> np.ndarray: | |||
| return self.model.predict_proba(X) | |||
| def finetune(self, X: np.ndarray, y: np.ndarray): | |||
| pass | |||
| @@ -0,0 +1,321 @@ | |||
| import torch | |||
| import pickle | |||
| import unittest | |||
| import os | |||
| import logging | |||
| import tempfile | |||
| import zipfile | |||
| from sklearn.linear_model import Ridge | |||
| from sklearn.datasets import make_regression | |||
| from shutil import copyfile, rmtree | |||
| from sklearn.metrics import mean_squared_error | |||
| import learnware | |||
| learnware.init(logging_level=logging.WARNING) | |||
| from learnware.market import instantiate_learnware_market, BaseUserInfo | |||
| from learnware.specification import RKMETableSpecification, generate_rkme_table_spec, generate_semantic_spec | |||
| from learnware.reuse import HeteroMapAlignLearnware, AveragingReuser, EnsemblePruningReuser | |||
| from learnware.tests.templates import LearnwareTemplate, PickleModelTemplate, StatSpecTemplate | |||
| from hetero_config import input_shape_list, input_description_list, output_description_list, user_description_list | |||
| curr_root = os.path.dirname(os.path.abspath(__file__)) | |||
| class TestHeteroWorkflow(unittest.TestCase): | |||
| universal_semantic_config = { | |||
| "data_type": "Table", | |||
| "task_type": "Regression", | |||
| "library_type": "Scikit-learn", | |||
| "scenarios": "Education", | |||
| "license": "MIT", | |||
| } | |||
| def _init_learnware_market(self, organizer_kwargs=None): | |||
| """initialize learnware market""" | |||
| hetero_market = instantiate_learnware_market( | |||
| market_id="hetero_toy", name="hetero", rebuild=True, organizer_kwargs=organizer_kwargs | |||
| ) | |||
| return hetero_market | |||
| def test_prepare_learnware_randomly(self, learnware_num=5): | |||
| self.zip_path_list = [] | |||
| for i in range(learnware_num): | |||
| learnware_pool_dirpath = os.path.join(curr_root, "learnware_pool_hetero") | |||
| os.makedirs(learnware_pool_dirpath, exist_ok=True) | |||
| learnware_zippath = os.path.join(learnware_pool_dirpath, "ridge_%d.zip" % (i)) | |||
| print("Preparing Learnware: %d" % (i)) | |||
| X, y = make_regression(n_samples=5000, n_informative=15, n_features=input_shape_list[i % 2], noise=0.1, random_state=42) | |||
| clf = Ridge(alpha=1.0) | |||
| clf.fit(X, y) | |||
| pickle_filepath = os.path.join(learnware_pool_dirpath, "ridge.pkl") | |||
| with open(pickle_filepath, "wb") as fout: | |||
| pickle.dump(clf, fout) | |||
| spec = generate_rkme_table_spec(X=X, gamma=0.1) | |||
| spec_filepath = os.path.join(learnware_pool_dirpath, "stat_spec.json") | |||
| spec.save(spec_filepath) | |||
| LearnwareTemplate.generate_learnware_zipfile( | |||
| learnware_zippath=learnware_zippath, | |||
| model_template=PickleModelTemplate(pickle_filepath=pickle_filepath, model_kwargs={"input_shape":(input_shape_list[i % 2],), "output_shape": (1,)}), | |||
| stat_spec_template=StatSpecTemplate(filepath=spec_filepath, type="RKMETableSpecification"), | |||
| requirements=["scikit-learn==0.22"], | |||
| ) | |||
| self.zip_path_list.append(learnware_zippath) | |||
| def _upload_delete_learnware(self, hetero_market, learnware_num, delete): | |||
| self.test_prepare_learnware_randomly(learnware_num) | |||
| self.learnware_num = learnware_num | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == 0, f"The market should be empty!" | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| semantic_spec = generate_semantic_spec( | |||
| name=f"learnware_{idx}", | |||
| description=f"test_learnware_number_{idx}", | |||
| input_description=input_description_list[idx % 2], | |||
| output_description=output_description_list[idx % 2], | |||
| **self.universal_semantic_config | |||
| ) | |||
| hetero_market.add_learnware(zip_path, semantic_spec) | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| curr_inds = hetero_market.get_learnware_ids() | |||
| print("Available ids After Uploading Learnwares:", curr_inds) | |||
| assert len(curr_inds) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| if delete: | |||
| for learnware_id in curr_inds: | |||
| hetero_market.delete_learnware(learnware_id) | |||
| self.learnware_num -= 1 | |||
| assert ( | |||
| len(hetero_market) == self.learnware_num | |||
| ), f"The number of learnwares must be {self.learnware_num}!" | |||
| curr_inds = hetero_market.get_learnware_ids() | |||
| print("Available ids After Deleting Learnwares:", curr_inds) | |||
| assert len(curr_inds) == 0, f"The market should be empty!" | |||
| return hetero_market | |||
| def test_upload_delete_learnware(self, learnware_num=5, delete=True): | |||
| hetero_market = self._init_learnware_market() | |||
| return self._upload_delete_learnware(hetero_market, learnware_num, delete) | |||
| def test_train_market_model(self, learnware_num=5, delete=False): | |||
| hetero_market = self._init_learnware_market( | |||
| organizer_kwargs={"auto_update": True, "auto_update_limit": learnware_num} | |||
| ) | |||
| hetero_market = self._upload_delete_learnware(hetero_market, learnware_num, delete) | |||
| # organizer=hetero_market.learnware_organizer | |||
| # organizer.train(hetero_market.learnware_organizer.learnware_list.values()) | |||
| return hetero_market | |||
| def test_search_semantics(self, learnware_num=5): | |||
| hetero_market = self.test_upload_delete_learnware(learnware_num, delete=False) | |||
| print("Total Item:", len(hetero_market)) | |||
| assert len(hetero_market) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| semantic_spec = generate_semantic_spec( | |||
| name=f"learnware_{learnware_num - 1}", | |||
| **self.universal_semantic_config, | |||
| ) | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| print(f"Search result1:") | |||
| assert len(single_result) == 1, f"Exact semantic search failed!" | |||
| for search_item in single_result: | |||
| semantic_spec1 = search_item.learnware.get_specification().get_semantic_spec() | |||
| print("Choose learnware:", search_item.learnware.id) | |||
| assert semantic_spec1["Name"]["Values"] == semantic_spec["Name"]["Values"], f"Exact semantic search failed!" | |||
| semantic_spec["Name"]["Values"] = "laernwaer" | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| print(f"Search result2:") | |||
| assert len(single_result) == self.learnware_num, f"Fuzzy semantic search failed!" | |||
| for search_item in single_result: | |||
| print("Choose learnware:", search_item.learnware.id) | |||
| def test_hetero_stat_search(self, learnware_num=5): | |||
| hetero_market = self.test_train_market_model(learnware_num, delete=False) | |||
| print("Total Item:", len(hetero_market)) | |||
| user_dim = 15 | |||
| with tempfile.TemporaryDirectory(prefix="learnware_test_hetero") as test_folder: | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| with zipfile.ZipFile(zip_path, "r") as zip_obj: | |||
| zip_obj.extractall(path=test_folder) | |||
| user_spec = RKMETableSpecification() | |||
| user_spec.load(os.path.join(test_folder, "stat_spec.json")) | |||
| z = user_spec.get_z() | |||
| z = z[:, :user_dim] | |||
| device = user_spec.device | |||
| z = torch.tensor(z, device=device) | |||
| user_spec.z = z | |||
| print(">> normal case test:") | |||
| semantic_spec = generate_semantic_spec( | |||
| input_description={ | |||
| "Dimension": user_dim, | |||
| "Description": {str(key): input_description_list[idx % 2]["Description"][str(key)] for key in range(user_dim)}, | |||
| }, | |||
| **self.universal_semantic_config, | |||
| ) | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| print(f"search result of user{idx}:") | |||
| for single_item in single_result: | |||
| print(f"score: {single_item.score}, learnware_id: {single_item.learnware.id}") | |||
| for multiple_item in multiple_result: | |||
| print( | |||
| f"mixture_score: {multiple_item.score}, mixture_learnware_ids: {[item.id for item in multiple_item.learnwares]}" | |||
| ) | |||
| # inproper key "Task" in semantic_spec, use homo search and print invalid semantic_spec | |||
| print(">> test for key 'Task' has empty 'Values':") | |||
| semantic_spec["Task"] = {"Values": ["Segmentation"], "Type": "Class"} | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| assert len(single_result) == 0, f"Statistical search failed!" | |||
| # delete key "Task" in semantic_spec, use homo search and print WARNING INFO with "User doesn't provide correct task type" | |||
| print(">> delele key 'Task' test:") | |||
| semantic_spec.pop("Task") | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| assert len(single_result) == 0, f"Statistical search failed!" | |||
| # modify semantic info with mismatch dim, use homo search and print "User data feature dimensions mismatch with semantic specification." | |||
| print(">> mismatch dim test") | |||
| semantic_spec = generate_semantic_spec( | |||
| input_description={ | |||
| "Dimension": user_dim - 2, | |||
| "Description": {str(key): input_description_list[idx % 2]["Description"][str(key)] for key in range(user_dim)}, | |||
| }, | |||
| **self.universal_semantic_config, | |||
| ) | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| assert len(single_result) == 0, f"Statistical search failed!" | |||
| def test_homo_stat_search(self, learnware_num=5): | |||
| hetero_market = self.test_train_market_model(learnware_num, delete=False) | |||
| print("Total Item:", len(hetero_market)) | |||
| with tempfile.TemporaryDirectory(prefix="learnware_test_hetero") as test_folder: | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| with zipfile.ZipFile(zip_path, "r") as zip_obj: | |||
| zip_obj.extractall(path=test_folder) | |||
| user_spec = RKMETableSpecification() | |||
| user_spec.load(os.path.join(test_folder, "stat_spec.json")) | |||
| user_semantic = generate_semantic_spec(**self.universal_semantic_config) | |||
| user_info = BaseUserInfo(semantic_spec=user_semantic, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| assert len(single_result) >= 1, f"Statistical search failed!" | |||
| print(f"search result of user{idx}:") | |||
| for single_item in single_result: | |||
| print(f"score: {single_item.score}, learnware_id: {single_item.learnware.id}") | |||
| for multiple_item in multiple_result: | |||
| print(f"mixture_score: {multiple_item.score}\n") | |||
| mixture_id = " ".join([learnware.id for learnware in multiple_item.learnwares]) | |||
| print(f"mixture_learnware: {mixture_id}\n") | |||
| def test_model_reuse(self, learnware_num=5): | |||
| # generate toy regression problem | |||
| X, y = make_regression(n_samples=5000, n_informative=10, n_features=15, noise=0.1, random_state=0) | |||
| # generate rkme | |||
| user_spec = generate_rkme_table_spec(X=X, gamma=0.1, cuda_idx=0) | |||
| # generate specification | |||
| semantic_spec = generate_semantic_spec(input_description=user_description_list[0], **self.universal_semantic_config) | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec, stat_info={"RKMETableSpecification": user_spec}) | |||
| # learnware market search | |||
| hetero_market = self.test_train_market_model(learnware_num, delete=False) | |||
| search_result = hetero_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| multiple_result = search_result.get_multiple_results() | |||
| # print search results | |||
| for single_item in single_result: | |||
| print(f"score: {single_item.score}, learnware_id: {single_item.learnware.id}") | |||
| for multiple_item in multiple_result: | |||
| print( | |||
| f"mixture_score: {multiple_item.score}, mixture_learnware_ids: {[item.id for item in multiple_item.learnwares]}" | |||
| ) | |||
| # single model reuse | |||
| hetero_learnware = HeteroMapAlignLearnware(single_result[0].learnware, mode="regression") | |||
| hetero_learnware.align(user_spec, X[:100], y[:100]) | |||
| single_predict_y = hetero_learnware.predict(X) | |||
| # multi model reuse | |||
| hetero_learnware_list = [] | |||
| for learnware in multiple_result[0].learnwares: | |||
| hetero_learnware = HeteroMapAlignLearnware(learnware, mode="regression") | |||
| hetero_learnware.align(user_spec, X[:100], y[:100]) | |||
| hetero_learnware_list.append(hetero_learnware) | |||
| # Use averaging ensemble reuser to reuse the searched learnwares to make prediction | |||
| reuse_ensemble = AveragingReuser(learnware_list=hetero_learnware_list, mode="mean") | |||
| ensemble_predict_y = reuse_ensemble.predict(user_data=X) | |||
| # Use ensemble pruning reuser to reuse the searched learnwares to make prediction | |||
| reuse_ensemble = EnsemblePruningReuser(learnware_list=hetero_learnware_list, mode="regression") | |||
| reuse_ensemble.fit(X[:100], y[:100]) | |||
| ensemble_pruning_predict_y = reuse_ensemble.predict(user_data=X) | |||
| print("Single model RMSE by finetune:", mean_squared_error(y, single_predict_y, squared=False)) | |||
| print("Averaging Reuser RMSE:", mean_squared_error(y, ensemble_predict_y, squared=False)) | |||
| print("Ensemble Pruning Reuser RMSE:", mean_squared_error(y, ensemble_pruning_predict_y, squared=False)) | |||
| def suite(): | |||
| _suite = unittest.TestSuite() | |||
| #_suite.addTest(TestHeteroWorkflow("test_prepare_learnware_randomly")) | |||
| #_suite.addTest(TestHeteroWorkflow("test_upload_delete_learnware")) | |||
| #_suite.addTest(TestHeteroWorkflow("test_train_market_model")) | |||
| _suite.addTest(TestHeteroWorkflow("test_search_semantics")) | |||
| _suite.addTest(TestHeteroWorkflow("test_hetero_stat_search")) | |||
| _suite.addTest(TestHeteroWorkflow("test_homo_stat_search")) | |||
| _suite.addTest(TestHeteroWorkflow("test_model_reuse")) | |||
| return _suite | |||
| if __name__ == "__main__": | |||
| runner = unittest.TextTestRunner(verbosity=2) | |||
| runner.run(suite()) | |||
| @@ -1,37 +1,34 @@ | |||
| import sys | |||
| import unittest | |||
| import os | |||
| import copy | |||
| import joblib | |||
| import logging | |||
| import tempfile | |||
| import pickle | |||
| import zipfile | |||
| import numpy as np | |||
| from sklearn import svm | |||
| from sklearn.datasets import load_digits | |||
| from sklearn.model_selection import train_test_split | |||
| from shutil import copyfile, rmtree | |||
| import learnware | |||
| learnware.init(logging_level=logging.WARNING) | |||
| from learnware.market import instantiate_learnware_market, BaseUserInfo | |||
| from learnware.specification import RKMETableSpecification, generate_rkme_table_spec | |||
| from learnware.specification import RKMETableSpecification, generate_rkme_table_spec, generate_semantic_spec | |||
| from learnware.reuse import JobSelectorReuser, AveragingReuser, EnsemblePruningReuser, FeatureAugmentReuser | |||
| from learnware.tests.templates import LearnwareTemplate, PickleModelTemplate, StatSpecTemplate | |||
| curr_root = os.path.dirname(os.path.abspath(__file__)) | |||
| user_semantic = { | |||
| "Data": {"Values": ["Table"], "Type": "Class"}, | |||
| "Task": { | |||
| "Values": ["Classification"], | |||
| "Type": "Class", | |||
| }, | |||
| "Library": {"Values": ["Scikit-learn"], "Type": "Class"}, | |||
| "Scenario": {"Values": ["Education"], "Type": "Tag"}, | |||
| "Description": {"Values": "", "Type": "String"}, | |||
| "Name": {"Values": "", "Type": "String"}, | |||
| "License": {"Values": ["MIT"], "Type": "Class"}, | |||
| } | |||
| class TestWorkflow(unittest.TestCase): | |||
| universal_semantic_config = { | |||
| "data_type": "Table", | |||
| "task_type": "Classification", | |||
| "library_type": "Scikit-learn", | |||
| "scenarios": "Education", | |||
| "license": "MIT", | |||
| } | |||
| def _init_learnware_market(self): | |||
| """initialize learnware market""" | |||
| easy_market = instantiate_learnware_market(market_id="sklearn_digits_easy", name="easy", rebuild=True) | |||
| @@ -42,45 +39,30 @@ class TestWorkflow(unittest.TestCase): | |||
| X, y = load_digits(return_X_y=True) | |||
| for i in range(learnware_num): | |||
| dir_path = os.path.join(curr_root, "learnware_pool", "svm_%d" % (i)) | |||
| os.makedirs(dir_path, exist_ok=True) | |||
| learnware_pool_dirpath = os.path.join(curr_root, "learnware_pool") | |||
| os.makedirs(learnware_pool_dirpath, exist_ok=True) | |||
| learnware_zippath = os.path.join(learnware_pool_dirpath, "svm_%d.zip" % (i)) | |||
| print("Preparing Learnware: %d" % (i)) | |||
| data_X, _, data_y, _ = train_test_split(X, y, test_size=0.3, shuffle=True) | |||
| clf = svm.SVC(kernel="linear", probability=True) | |||
| clf.fit(data_X, data_y) | |||
| joblib.dump(clf, os.path.join(dir_path, "svm.pkl")) | |||
| pickle_filepath = os.path.join(learnware_pool_dirpath, "model.pkl") | |||
| with open(pickle_filepath, "wb") as fout: | |||
| pickle.dump(clf, fout) | |||
| spec = generate_rkme_table_spec(X=data_X, gamma=0.1, cuda_idx=0) | |||
| spec.save(os.path.join(dir_path, "svm.json")) | |||
| init_file = os.path.join(dir_path, "__init__.py") | |||
| copyfile( | |||
| os.path.join(curr_root, "learnware_example/example_init.py"), init_file | |||
| ) # cp example_init.py init_file | |||
| yaml_file = os.path.join(dir_path, "learnware.yaml") | |||
| copyfile(os.path.join(curr_root, "learnware_example/example.yaml"), yaml_file) # cp example.yaml yaml_file | |||
| env_file = os.path.join(dir_path, "environment.yaml") | |||
| copyfile(os.path.join(curr_root, "learnware_example/environment.yaml"), env_file) | |||
| zip_file = dir_path + ".zip" | |||
| # zip -q -r -j zip_file dir_path | |||
| with zipfile.ZipFile(zip_file, "w") as zip_obj: | |||
| for foldername, subfolders, filenames in os.walk(dir_path): | |||
| for filename in filenames: | |||
| file_path = os.path.join(foldername, filename) | |||
| zip_info = zipfile.ZipInfo(filename) | |||
| zip_info.compress_type = zipfile.ZIP_STORED | |||
| with open(file_path, "rb") as file: | |||
| zip_obj.writestr(zip_info, file.read()) | |||
| rmtree(dir_path) # rm -r dir_path | |||
| self.zip_path_list.append(zip_file) | |||
| spec_filepath = os.path.join(learnware_pool_dirpath, "stat_spec.json") | |||
| spec.save(spec_filepath) | |||
| LearnwareTemplate.generate_learnware_zipfile( | |||
| learnware_zippath=learnware_zippath, | |||
| model_template=PickleModelTemplate(pickle_filepath=pickle_filepath, model_kwargs={"input_shape":(64,), "output_shape": (10,), "predict_method": "predict_proba"}), | |||
| stat_spec_template=StatSpecTemplate(filepath=spec_filepath, type="RKMETableSpecification"), | |||
| requirements=["scikit-learn==0.22"], | |||
| ) | |||
| self.zip_path_list.append(learnware_zippath) | |||
| def test_upload_delete_learnware(self, learnware_num=5, delete=True): | |||
| easy_market = self._init_learnware_market() | |||
| @@ -91,20 +73,22 @@ class TestWorkflow(unittest.TestCase): | |||
| assert len(easy_market) == 0, f"The market should be empty!" | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Name"]["Values"] = "learnware_%d" % (idx) | |||
| semantic_spec["Description"]["Values"] = "test_learnware_number_%d" % (idx) | |||
| semantic_spec["Input"] = { | |||
| "Dimension": 64, | |||
| "Description": { | |||
| f"{i}": f"The value in the grid {i // 8}{i % 8} of the image of hand-written digit." | |||
| for i in range(64) | |||
| semantic_spec = generate_semantic_spec( | |||
| name=f"learnware_{idx}", | |||
| description=f"test_learnware_number_{idx}", | |||
| input_description={ | |||
| "Dimension": 64, | |||
| "Description": { | |||
| f"{i}": f"The value in the grid {i // 8}{i % 8} of the image of hand-written digit." | |||
| for i in range(64) | |||
| }, | |||
| }, | |||
| output_description={ | |||
| "Dimension": 10, | |||
| "Description": {f"{i}": "The probability for each digit for 0 to 9." for i in range(10)}, | |||
| }, | |||
| } | |||
| semantic_spec["Output"] = { | |||
| "Dimension": 10, | |||
| "Description": {f"{i}": "The probability for each digit for 0 to 9." for i in range(10)}, | |||
| } | |||
| **self.universal_semantic_config | |||
| ) | |||
| easy_market.add_learnware(zip_path, semantic_spec) | |||
| print("Total Item:", len(easy_market)) | |||
| @@ -129,70 +113,52 @@ class TestWorkflow(unittest.TestCase): | |||
| easy_market = self.test_upload_delete_learnware(learnware_num, delete=False) | |||
| print("Total Item:", len(easy_market)) | |||
| assert len(easy_market) == self.learnware_num, f"The number of learnwares must be {self.learnware_num}!" | |||
| test_folder = os.path.join(curr_root, "test_semantics") | |||
| # unzip -o -q zip_path -d unzip_dir | |||
| if os.path.exists(test_folder): | |||
| rmtree(test_folder) | |||
| os.makedirs(test_folder, exist_ok=True) | |||
| with zipfile.ZipFile(self.zip_path_list[0], "r") as zip_obj: | |||
| zip_obj.extractall(path=test_folder) | |||
| semantic_spec = copy.deepcopy(user_semantic) | |||
| semantic_spec["Name"]["Values"] = f"learnware_{learnware_num - 1}" | |||
| semantic_spec["Description"]["Values"] = f"test_learnware_number_{learnware_num - 1}" | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| search_result = easy_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| print("User info:", user_info.get_semantic_spec()) | |||
| print(f"Search result:") | |||
| for search_item in single_result: | |||
| print( | |||
| "Choose learnware:", | |||
| search_item.learnware.id, | |||
| search_item.learnware.get_specification().get_semantic_spec(), | |||
| with tempfile.TemporaryDirectory(prefix="learnware_test_workflow") as test_folder: | |||
| with zipfile.ZipFile(self.zip_path_list[0], "r") as zip_obj: | |||
| zip_obj.extractall(path=test_folder) | |||
| semantic_spec = generate_semantic_spec( | |||
| name=f"learnware_{learnware_num - 1}", | |||
| description=f"test_learnware_number_{learnware_num - 1}", | |||
| **self.universal_semantic_config, | |||
| ) | |||
| user_info = BaseUserInfo(semantic_spec=semantic_spec) | |||
| search_result = easy_market.search_learnware(user_info) | |||
| single_result = search_result.get_single_results() | |||
| rmtree(test_folder) # rm -r test_folder | |||
| print(f"Search result:") | |||
| for search_item in single_result: | |||
| print("Choose learnware:",search_item.learnware.id) | |||
| def test_stat_search(self, learnware_num=5): | |||
| easy_market = self.test_upload_delete_learnware(learnware_num, delete=False) | |||
| print("Total Item:", len(easy_market)) | |||
| test_folder = os.path.join(curr_root, "test_stat") | |||
| with tempfile.TemporaryDirectory(prefix="learnware_test_workflow") as test_folder: | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| with zipfile.ZipFile(zip_path, "r") as zip_obj: | |||
| zip_obj.extractall(path=test_folder) | |||
| for idx, zip_path in enumerate(self.zip_path_list): | |||
| unzip_dir = os.path.join(test_folder, f"{idx}") | |||
| # unzip -o -q zip_path -d unzip_dir | |||
| if os.path.exists(unzip_dir): | |||
| rmtree(unzip_dir) | |||
| os.makedirs(unzip_dir, exist_ok=True) | |||
| with zipfile.ZipFile(zip_path, "r") as zip_obj: | |||
| zip_obj.extractall(path=unzip_dir) | |||
| user_spec = RKMETableSpecification() | |||
| user_spec.load(os.path.join(unzip_dir, "svm.json")) | |||
| user_info = BaseUserInfo(semantic_spec=user_semantic, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_results = easy_market.search_learnware(user_info) | |||
| user_spec = RKMETableSpecification() | |||
| user_spec.load(os.path.join(test_folder, "stat_spec.json")) | |||
| user_semantic = generate_semantic_spec(**self.universal_semantic_config) | |||
| user_info = BaseUserInfo(semantic_spec=user_semantic, stat_info={"RKMETableSpecification": user_spec}) | |||
| search_results = easy_market.search_learnware(user_info) | |||
| single_result = search_results.get_single_results() | |||
| multiple_result = search_results.get_multiple_results() | |||
| assert len(single_result) >= 1, f"Statistical search failed!" | |||
| print(f"search result of user{idx}:") | |||
| for search_item in single_result: | |||
| print(f"score: {search_item.score}, learnware_id: {search_item.learnware.id}") | |||
| single_result = search_results.get_single_results() | |||
| multiple_result = search_results.get_multiple_results() | |||
| for mixture_item in multiple_result: | |||
| print(f"mixture_score: {mixture_item.score}\n") | |||
| mixture_id = " ".join([learnware.id for learnware in mixture_item.learnwares]) | |||
| print(f"mixture_learnware: {mixture_id}\n") | |||
| assert len(single_result) >= 1, f"Statistical search failed!" | |||
| print(f"search result of user{idx}:") | |||
| for search_item in single_result: | |||
| print(f"score: {search_item.score}, learnware_id: {search_item.learnware.id}") | |||
| rmtree(test_folder) # rm -r test_folder | |||
| for mixture_item in multiple_result: | |||
| print(f"mixture_score: {mixture_item.score}\n") | |||
| mixture_id = " ".join([learnware.id for learnware in mixture_item.learnwares]) | |||
| print(f"mixture_learnware: {mixture_id}\n") | |||
| def test_learnware_reuse(self, learnware_num=5): | |||
| easy_market = self.test_upload_delete_learnware(learnware_num, delete=False) | |||
| @@ -202,6 +168,7 @@ class TestWorkflow(unittest.TestCase): | |||
| train_X, data_X, train_y, data_y = train_test_split(X, y, test_size=0.3, shuffle=True) | |||
| stat_spec = generate_rkme_table_spec(X=data_X, gamma=0.1, cuda_idx=0) | |||
| user_semantic = generate_semantic_spec(**self.universal_semantic_config) | |||
| user_info = BaseUserInfo(semantic_spec=user_semantic, stat_info={"RKMETableSpecification": stat_spec}) | |||
| search_results = easy_market.search_learnware(user_info) | |||
| @@ -243,5 +210,5 @@ def suite(): | |||
| if __name__ == "__main__": | |||
| runner = unittest.TextTestRunner() | |||
| runner = unittest.TextTestRunner(verbosity=2) | |||
| runner.run(suite()) | |||