From fa08c549fc28cadddd83e7bc8d5825200381eeab Mon Sep 17 00:00:00 2001 From: nju-xy <1582857295@qq.com> Date: Fri, 8 Dec 2023 10:25:08 +0800 Subject: [PATCH] [MNT] revert commits irrelevant to text example --- docs/advanced/anchor.rst | 10 ++-- docs/components/spec.rst | 3 +- docs/workflows/search.rst | 109 ++++++++++++++++++++++++++------------ 3 files changed, 80 insertions(+), 42 deletions(-) diff --git a/docs/advanced/anchor.rst b/docs/advanced/anchor.rst index fce3b0f..75fdcc1 100644 --- a/docs/advanced/anchor.rst +++ b/docs/advanced/anchor.rst @@ -4,15 +4,15 @@ Anchor learnware Anchor learnwares are a small fraction of representative learnwares that helps locate user's requirements through user feedback. The learnware market can choose or generate several learnwares as anchor learnwares corresponding to the specification island. If the user does not have sufficient training data for constructing an RKME requirement, the learnware market can send several anchor learnwares to the user. By feeding her own data to these anchor learnwares, some information such as (precision, recall) or other performance indicators, can be generated and returned to the market. These information could help the market identify potentially helpful models, e.g., by identifying models that are far from anchors exhibiting poor performance whereas close to anchors exhibiting relatively better performance in the specification island. -To fulfill the anchor learnware method, you need to implement the following functions in the folder ``Learnware/market/anchor``: +To fulfill the anchor learnware method, you need to implement the following functions in ``anchor.py``: -- First, you should design how the market chooses or generates anchor learnwares. This can be realized by selecting prototype models through functional space clustering, and more interesting designs can be explored. The class ``AnchoredOrganizer`` in ``organizer.py`` is designed for it. The function ``AnchoredOrganizer.update_anchor_learnware_list`` is reserved for choosing or generating anchor learnwares. The functions ``AnchoredOrganizer._update_anchor_learnware`` and ``AnchoredOrganizer._delete_anchor_learnware`` have been completed as auxiliary. +- First, you should design how the market chooses or generates anchor learnwares. This can be realized by selecting prototype models through functional space clustering, and more interesting designs can be explored. The function ``AnchoredMarket.update_anchor_learnware_list`` is reserved for it. The functions ``AnchoredMarket._update_anchor_learnware`` and ``AnchoredMarket._delete_anchor_learnware`` have been completed as auxiliary. -- Second, when a user comes with no RKME(or other statistical) specifications, the market should choose several anchor learnwares and send them to the user. This process is done by ``AnchoredSearcher.search_anchor_learnware`` in ``searcher.py``. Besides the list of anchor learnwares, it also returns an item specifying which performance indicator should the user return. +- Second, when a user comes with no RKME(or other statistical) specifications, the market should choose several anchor learnwares and send them to the user. This process is done by ``AnchoredMarket.search_anchor_learnware``, and the chosen anchors are stored in ``AnchoredUserInfo`` by ``AnchoredUserInfo.add_anchor_learnware``. -- Third, by feeding the user's data to these anchor learnwares, the returned information is calculated and stored in ``AnchoredUserInfo`` in ``user_info.py``. The user should add the anchor learnwares into it using ``AnchoredUserInfo.add_anchor_learnware_ids``, and then fill in performance indicator using ``AnchoredUserInfo.update_stat_info``. +- Third, the market should specify which performance indicator should the user return. By feeding the user's data to these anchor learnwares, the returned information is calculated and stored in ``AnchoredUserInfo`` by ``AnchoredUserInfo.update_stat_info``. -- Fourth, according to the returned information from the user, the market should identify the helpful learnwares for the user. This process is done in ``AnchoredSearcher.search_learnware`` in ``searcher.py``, which returns the recommended comination of helpful learnwares and the list of helpful learnwares. +- Fourth, according to the returned information from the user, the market should identify the helpful learnwares for the user. This process is done in ``AnchoredMarket.search_learnware``. diff --git a/docs/components/spec.rst b/docs/components/spec.rst index a1ff33e..ed07c7a 100644 --- a/docs/components/spec.rst +++ b/docs/components/spec.rst @@ -82,8 +82,7 @@ Image Specification Text Specification -------------------------- -Different from tabular data, each text input is a string of different length, so we should first transform them to equal-length arrays. Sentence embedding is used here to complete this transformation. We choose the model ``paraphrase-multilingual-MiniLM-L12-v2``, a lightweight multilingual embedding model. Then, we calculate the RKME specification on the embedding, -just like we do with tabular data. Besides, we use the package ``langdetect`` to detect and store the language of the text inputs for further search. We hope to search for the learnware which supports the language of the user task. + System Specification ====================================== \ No newline at end of file diff --git a/docs/workflows/search.rst b/docs/workflows/search.rst index ffbd201..9086532 100644 --- a/docs/workflows/search.rst +++ b/docs/workflows/search.rst @@ -2,72 +2,111 @@ Learnwares Search ============================================================ -``Learnware Searcher`` is a key component of ``Learnware Market`` that identifies and recommends helpful learnwares to users according to their ``UserInfo``. Based on whether the returned learnware dimensions are consistent with user tasks, the searchers can be divided into two categories: homogeneous searchers and heterogeneous searchers. - +When a user comes with her requirements, the market should identify helpful learnwares and recommend them to the user. +The search of helpful learnwares is based on the user information, and can be divided into two stages: semantic specification search and statistical specification search. Homo Search ====================== -The homogeneous search of helpful learnwares is based on the user information, and can be divided into two stages: semantic specification search and statistical specification search. - User information ------------------------------- -``BaseUserInfo`` is a ``Python API`` for users to provide enough information to identify helpful learnwares. -When initializing ``BaseUserInfo``, three optional information can be provided: ``id``, ``semantic_spec`` and ``stat_info``. The generation of these specifications is seen in `Learnware Preparation <./submit>`_. +The user should provide her requirements in ``BaseUserInfo``. The class ``BaseUserInfo`` consists of user's semantic specification ``user_semantic`` and statistical information ``stat_info``. + +The semantic specification ``user_semantic`` is stored in a ``dict``, with keywords 'Data', 'Task', 'Library', 'Scenario', 'Description' and 'Name'. An example is shown below, and you could choose their values according to the figure. For the keys of type 'Class, you should choose one ilegal value; for the keys of type 'Tag', you can choose one or more values; for the keys of type 'String', you should provide a string; the key 'Description' is used in learnwares' semantic specifications and is ignored in user semantic specification; the values of all these keys can be empty if the user have no idea of them. + +.. code-block:: python + + # An example of user_semantic + user_semantic = { + "Data": {"Values": ["Image"], "Type": "Class"}, + "Task": {"Values": ["Classification"], "Type": "Class"}, + "Library": {"Values": ["Scikit-learn"], "Type": "Tag"}, + "Scenario": {"Values": ["Education"], "Type": "Class"}, + "Description": {"Values": "", "Type": "String"}, + "Name": {"Values": "digits", "Type": "String"}, + } + + +.. image:: ../_static/img/semantic_spec.png + :align: center + + +The user's statistical information ``stat_info`` is stored in a ``json`` file, e.g., ``stat.json``. The generation of this file is seen in `Learnware Preparation <./submit>`_. Semantic Specification Search ------------------------------- To search for learnwares that fit your task purpose, -the user could first provide a semantic specification ``user_semantic`` that describes the characteristics of your task. +the user should first provide a semantic specification ``user_semantic`` that describes the characteristics of your task. The Learnware Market will perform a first-stage search based on ``user_semantic``, -identifying potentially helpful leaarnwares whose models solve tasks similar to your requirements. There are two types of Semantic Specification Search: ``EasyExactSemanticSearcher`` and ``EasyFuzzSemanticSearcher``. They can be used like: - -- `EasyExactSemanticSearcher(learnware_list: List[Learnware], user_info: BaseUserInfo)-> SearchResults` +identifying potentially helpful leaarnwares whose models solve tasks similar to your requirements. -- `EasyFuzzSemanticSearcher(learnware_list: List[Learnware], user_info: BaseUserInfo)-> SearchResults` +.. code-block:: python -In these two searchers, each learnware in the ``learnware_list`` is compared with ``user_info`` according to their ``semantic_spec``, and added to the search result if mathched. Two semantic_spec are matched when all the key words are matched or empty in ``user_info``. Different keys have different matching rules: + # construct user_info which includes semantic specification for searching learnware + user_info = BaseUserInfo(semantic_spec=semantic_spec) -- For keys ``Data``, ``Task``, ``Library`` and ``license``, two ``semantic_spec`` keys are matched only if these values(only one value for each key) of learnware ``semantic_spec`` exists in values(may be muliple values for one key) of user ``semantic_spec``. + # search_learnware performs semantic specification search if user_info doesn't include a statistical specification + _, single_learnware_list, _ = easy_market.search_learnware(user_info) -- For the key ``Scenario``, two ``semantic_spec`` keys are matched if their values have nonempty intersections. + # single_learnware_list is the learnware list by semantic specification searching + print(single_learnware_list) -- For keys ``Name`` and ``Description``, the values are strings and case is ignored; +In semantic specification search, we go through all learnwares in the market to compare their semantic specifications with the user's one, and return all the learnwares that pass through the comparation. When comparing two learnwares' semantic specifications, we design different ways for different semantic keys: - - In ``EasyExactSemanticSearcher``, two ``semantic_spec`` keys are matched if these values of learnware ``semantic_spec`` is a substring of user ``semantic_spec``. +- For semantic keys with type 'Class', they are matched only if they have the same value. - - In ``EasyFuzzSemanticSearcher``, first the exact semantic searcher is conducted like ``EasyExactSemanticSearcher``. If the result is empty, the fuzz semantic searcher is activated: the ``learnware_list`` is sorted according to the fuzz score function ``fuzz.partial_ratio`` in ``rapidfuzz``. +- For semantic keys with type 'Tag', they are matched only if they have nonempty intersections. -The results are returned storing in ``single_results`` of ``SearchResults``. +- For the user's input in the search box, it matchs with a learnware's semantic specification only if it's a substring of its 'Name' or 'Description'. All the strings are converted to the lower case before matching. +- When a key value is missing, it will not participate in the match. The user could upload no semantic specifications if he wants. Statistical Specification Search --------------------------------- -If you choose to provide your own statistical specification ``stat_info``, -the Learnware Market can perform a more accurate leanware selection using ``EasyStatSearcher``. +If you choose to provide your own statistical specification file ``stat.json``, +the Learnware Market can perform a more accurate leanware selection from +the learnwares returned by the previous step. This second-stage search is based on statistical information and returns one or more learnwares that are most likely to be helpful for your task. -- `EasyStatSearcher( - learnware_list: List[Learnware], - user_info: BaseUserInfo, - max_search_num: int = 5, - search_method: str = "greedy",) - -> SearchResults` - - - It searches for helpful learnwares from ``learnware_list`` based on the ``stat_info`` in ``user_info``. - - - The result ``SingleSearchItem`` and ``MultipleSearchItem`` are both stored in ``SearchResults``. In ``SingleSearchItem``, it searches for single learnwares that could solve the user task; scores are also provided to represent the fitness of each single learnware and user task. In ``MultipleSearchItem``, it searches for a mixture of learnwares that could solve the user task better; the mixture learnware list and a score for the mixture is returned. +For example, the following code is designed to work with Reduced Kernel Mean Embedding (RKME) as a statistical specification: - - The parameter ``search_method`` provides two choice of search strategies for mixture learnwares: ``greedy`` and ``auto``. For the search method ``greedy``, each time it chooses a learnware to make their mixture closer to the user's ``stat_info``; for the search method ``auto``, it directly calculates a best mixture weight for the ``learnware_list``. +.. code-block:: python - - For single learnware search, we only return the learnwares with score larger than 0.6; For multiple learnware search, the parameter ``max_search_num`` specifies the maximum length of the returned mixture learnware list. + import learnware.specification as specification -Semantic and Statistical Specification Search -------------------------------- + user_spec = specification.RKMETableSpecification() + user_spec.load(os.path.join("rkme.json")) + user_info = BaseUserInfo( + semantic_spec=user_semantic, stat_info={"RKMETableSpecification": user_spec} + ) + (sorted_score_list, single_learnware_list, + mixture_score, mixture_learnware_list) = easy_market.search_learnware(user_info) + + # sorted_score_list is the learnware scores based on MMD distances, sorted in descending order + print(sorted_score_list) + + # single_learnware_list is the learnwares sorted in descending order based on their scores + print(single_learnware_list) + + # mixture_learnware_list is the learnwares whose mixture is helpful for your task + print(mixture_learnware_list) + + # mixture_score is the score of the mixture of learnwares + print(mixture_score) + +The return values of statistical specification search are ``sorted_score_list``, ``single_learnware_list``, ``mixture_score`` and ``mixture_learnware_list``. +``sorted_score_list`` and ``single_learnware_list`` are the ranking of each single learnware and the corresponding scores. We return at least 15 learnwares unless there're no enough ones. If there are more than 15 matched learnwares, the ones with scores less than 50 will be ignored. +``mixture_score`` and ``mixture_learnware_list`` are the chosen mixture learnwares and the corresponding score. At most 5 learnwares will be chosen, whose mixture may have a relatively good performance on the user's task. + + +The statistical specification search is done in the following way. +We first filter by the dimension of RKME specifications; only those with the same dimension with the user's will enter the subsequent stage. + +The single_learnware_list is calculated using the distances between two RKMEs. The greater the distance from the user's RKME, the lower the score is. -The semantic specification search and statistical specification search have been Has been integrated into the same interface ``EasySearcher``. +The mixture_learnware_list is calculated in a greedy way. Each time we choose a learnware to make their mixture closer to the user's RKME. Specifically, each time we go through all the left learnwares to find the one whose combination with chosen learnwares could minimize the distance between their mixture's RKME and the user's RKME. The mixture weight is calculated by minimizing the RKME distance, which is solved by quadratic programming. If the distance become larger or the number of chosen learnwares reaches a threshold, the process will end and the chosen learnware and weight list will return. Hetero Search ====================== \ No newline at end of file