Third round of enhancement of API comment & README_CN

5 years ago · 7cc48a9af8
--- a/README.md
+++ b/README.md
@@ -1,7 +1,9 @@
 ![MindSpore Logo](docs/MindSpore-logo.png "MindSpore logo")
 ============================================================

 - [What Is MindSpore?](#what-is-mindspore)
 [查看中文](./README_CN.md)

 - [What Is MindSpore](#what-is-mindspore)
    - [Automatic Differentiation](#automatic-differentiation)
    - [Automatic Parallel](#automatic-parallel)
 - [Installation](#installation)
--- a/README_CN.md
+++ b/README_CN.md
@@ -0,0 +1,220 @@
 ![MindSpore标志](docs/MindSpore-logo.png "MindSpore logo")
 ============================================================

 [View English](./README.md)

 - [MindSpore介绍](#mindspore介绍)
    - [自动微分](#自动微分)
    - [自动并行](#自动并行)
 - [安装](#安装)
    - [二进制文件](#二进制文件)
    - [来源](#来源)
    - [Docker镜像](#docker镜像)
 - [快速入门](#快速入门)
 - [文档](#文档)
 - [社区](#社区)
    - [治理](#治理)
    - [交流](#交流)
 - [贡献](#贡献)
 - [版本说明](#版本说明)
 - [许可证](#许可证)

 ## MindSpore介绍

 MindSpore是一种适用于端边云场景的新型开源深度学习训练/推理框架。
 MindSpore提供了友好的设计和高效的执行，旨在提升数据科学家和算法工程师的开发体验，并为Ascend AI处理器提供原生支持，以及软硬件协同优化。


 同时，MindSpore作为全球AI开源社区，致力于进一步开发和丰富AI软硬件应用生态。



 <img src="docs/MindSpore-architecture.png" alt="MindSpore Architecture" width="600"/>

 欲了解更多详情，请查看我们的[总体架构](https://www.mindspore.cn/docs/zh-CN/master/architecture.html)。

 ### 自动微分

 当前主流深度学习框架中有三种自动微分技术：

 - **基于静态计算图的转换**：编译时将网络转换为静态数据流图，将链式法则应用于数据流图，实现自动微分。
 - **基于动态计算图的转换**：记录算子过载正向执行时网络的运行轨迹，对动态生成的数据流图应用链式法则，实现自动微分。
 - **基于源码的转换**：该技术是从功能编程框架演进而来，以即时编译（Just-in-time Compilation，JIT）的形式对中间表达式（程序在编译过程中的表达式）进行自动差分转换，支持复杂的控制流场景、高阶函数和闭包。

 TensorFlow早期采用的是静态计算图，PyTorch采用的是动态计算图。静态映射可以利用静态编译技术来优化网络性能，但是构建网络或调试网络非常复杂。动态图的使用非常方便，但很难实现性能的极限优化。

 MindSpore找到了另一种方法，即基于源代码转换的自动微分。一方面，它支持自动控制流的自动微分，因此像PyTorch这样的模型构建非常方便。另一方面，MindSpore可以对神经网络进行静态编译优化，以获得更好的性能。

 <img src="docs/Automatic-differentiation.png" alt="Automatic Differentiation" width="600"/>

 MindSpore自动微分的实现可以理解为程序本身的符号微分。MindSpore IR是一个函数中间表达式，它与基础代数中的复合函数具有直观的对应关系。复合函数的公式由任意可推导的基础函数组成。MindSpore IR中的每个原语操作都可以对应基础代数中的基本功能，从而可以建立更复杂的流控制。

 ### 自动并行

 MindSpore自动并行的目的是构建数据并行、模型并行和混合并行相结合的训练方法。该方法能够自动选择开销最小的模型切分策略，实现自动分布并行训练。

 <img src="docs/Automatic-parallel.png" alt="Automatic Parallel" width="600"/>

 目前MindSpore采用的是算子切分的细粒度并行策略，即图中的每个算子被切分为一个集群，完成并行操作。在此期间的切分策略可能非常复杂，但是作为一名Python开发者，您无需关注底层实现，只要顶层API计算是有效的即可。

 ## 安装

 ### 二进制文件

 MindSpore提供跨多个后端的构建选项：

 | 硬件平台          | 操作系统            | 状态   |
 | :------------ | :-------------- | :--- |
 | Ascend 910    | Ubuntu-x86      | ✔️   |
 |               | EulerOS-x86     | ✔️   |
 |               | EulerOS-aarch64 | ✔️   |
 | GPU CUDA 10.1 | Ubuntu-x86      | ✔️   |
 | CPU           | Ubuntu-x86      | ✔️   |
 |               | Windows-x86     | ✔️   |

 使用`pip`命令安装，以`CPU`和`Ubuntu-x86`build版本为例：

 1. 请从[MindSpore下载页面](https://www.mindspore.cn/versions)下载并安装whl包。

    ```
    pip install https://ms-release.obs.cn-north-4.myhuaweicloud.com/0.6.0-beta/MindSpore/cpu/ubuntu_x86/mindspore-0.6.0-cp37-cp37m-linux_x86_64.whl
    ```

 2. 执行以下命令，验证安装结果。

    ```python
    import numpy as np
    import mindspore.context as context
    import mindspore.nn as nn
    from mindspore import Tensor
    from mindspore.ops import operations as P
    
    context.set_context(mode=context.GRAPH_MODE, device_target="CPU")
    
    class Mul(nn.Cell):
        def __init__(self):
            super(Mul, self).__init__()
            self.mul = P.Mul()
    
        def construct(self, x, y):
            return self.mul(x, y)
    
    x = Tensor(np.array([1.0, 2.0, 3.0]).astype(np.float32))
    y = Tensor(np.array([4.0, 5.0, 6.0]).astype(np.float32))
    
    mul = Mul()
    print(mul(x, y))
    ```
    ```
    [ 4. 10. 18.]
    ```
 ### 来源

 [MindSpore安装](https://www.mindspore.cn/install)。

 ### Docker镜像

 MindSpore的Docker镜像托管在[Docker Hub](https://hub.docker.com/r/mindspore)上。
 目前容器化构建选项支持情况如下：

 | 硬件平台   | Docker镜像仓库                | 标签                       | 说明                                       |
 | :----- | :------------------------ | :----------------------- | :--------------------------------------- |
 | CPU    | `mindspore/mindspore-cpu` | `x.y.z`                  | 已经预安装MindSpore `x.y.z` CPU版本的生产环境。       |
 |        |                           | `devel`                  | 提供开发环境从源头构建MindSpore（`CPU`后端）。安装详情请参考https://www.mindspore.cn/install。 |
 |        |                           | `runtime`                | 提供运行时环境安装MindSpore二进制包（`CPU`后端）。         |
 | GPU    | `mindspore/mindspore-gpu` | `x.y.z`                  | 已经预安装MindSpore `x.y.z` GPU版本的生产环境。       |
 |        |                           | `devel`                  | 提供开发环境从源头构建MindSpore（`GPU CUDA10.1`后端）。安装详情请参考https://www.mindspore.cn/install。 |
 |        |                           | `runtime`                | 提供运行时环境安装MindSpore二进制包（`GPU CUDA10.1`后端）。 |
 | Ascend | <center>&mdash;</center>  | <center>&mdash;</center> | 即将推出，敬请期待。                               |

 > **注意：** 不建议从源头构建GPU `devel` Docker镜像后直接安装whl包。我们强烈建议您在GPU `runtime` Docker镜像中传输并安装whl包。

 * CPU

    对于`CPU`后端，可以直接使用以下命令获取并运行最新的稳定镜像：
    ```
    docker pull mindspore/mindspore-cpu:0.6.0-beta
    docker run -it mindspore/mindspore-cpu:0.6.0-beta /bin/bash
    ```

 * GPU

    对于`GPU`后端，请确保`nvidia-container-toolkit`已经提前安装，以下是`Ubuntu`用户安装指南：
    ```
    DISTRIBUTION=$(. /etc/os-release; echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
    curl -s -L https://nvidia.github.io/nvidia-docker/$DISTRIBUTION/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list

    sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit nvidia-docker2
    sudo systemctl restart docker
    ```

    使用以下命令获取并运行最新的稳定镜像：
    ```
    docker pull mindspore/mindspore-gpu:0.6.0-beta
    docker run -it --runtime=nvidia --privileged=true mindspore/mindspore-gpu:0.6.0-beta /bin/bash
    ```

    要测试Docker是否正常工作，请运行下面的Python代码并检查输出：
    ```python
    import numpy as np
    import mindspore.context as context
    from mindspore import Tensor
    from mindspore.ops import functional as F

    context.set_context(device_target="GPU")

    x = Tensor(np.ones([1,3,3,4]).astype(np.float32))
    y = Tensor(np.ones([1,3,3,4]).astype(np.float32))
    print(F.tensor_add(x, y))
    ```
    ```
    [[[ 2.  2.  2.  2.],
    [ 2.  2.  2.  2.],
    [ 2.  2.  2.  2.]],

    [[ 2.  2.  2.  2.],
    [ 2.  2.  2.  2.],
    [ 2.  2.  2.  2.]],

    [[ 2.  2.  2.  2.],
    [ 2.  2.  2.  2.],
    [ 2.  2.  2.  2.]]]
    ```

 如果您想了解更多关于MindSpore Docker镜像的构建过程，请查看[docker](docker/README.md) repo了解详细信息。

 ## 快速入门

 参考[快速入门](https://www.mindspore.cn/tutorial/zh-CN/master/quick_start/quick_start.html)实现图片分类。


 ## 文档

 有关安装指南、教程和API的更多详细信息，请参阅[用户文档](https://gitee.com/mindspore/docs)。

 ## 社区

 ### 治理

 查看MindSpore如何进行[开放治理](https://gitee.com/mindspore/community/blob/master/governance.md)。

 ### 交流

 - [MindSpore Slack](https://join.slack.com/t/mindspore/shared_invite/zt-dgk65rli-3ex4xvS4wHX7UDmsQmfu8w) 开发者交流平台。
 - `#mindspore`IRC频道（仅用于会议记录）
 - 视频会议：待定
 - 邮件列表：<https://mailweb.mindspore.cn/postorius/lists>

 ## 贡献

 欢迎参与贡献。更多详情，请参阅我们的[贡献者Wiki](CONTRIBUTING.md)。


 ## 版本说明

 版本说明请参阅[RELEASE](RELEASE.md)。

 ## 许可证

 [Apache License 2.0](LICENSE)
--- a/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
+++ b/mindspore/ccsrc/pybind_api/ir/tensor_py.cc
@@ -150,7 +150,7 @@ TensorPtr TensorPy::MakeTensor(const py::array &input, const TypePtr &type_ptr)
  // Get tensor shape.
  std::vector<int> shape(buf.shape.begin(), buf.shape.end());
  if (data_type == buf_type) {
    // Use memory copy if input data type is same as the required type.
    // Use memory copy if input data type is the same as the required type.
    return std::make_shared<Tensor>(data_type, shape, buf.ptr, buf.size * buf.itemsize);
  }
  // Create tensor with data type converted.
--- a/mindspore/context.py
+++ b/mindspore/context.py
@@ -546,9 +546,11 @@ def set_context(**kwargs):

    Note:
        Attribute name is required for setting attributes.
        The mode is not recommended to be changed after net was initilized because the implementations of some
        operations are different in graph mode and pynative mode. Default: PYNATIVE_MODE.

    Args:
        mode (int): Running in GRAPH_MODE(0) or PYNATIVE_MODE(1). Default: PYNATIVE_MODE.
        mode (int): Running in GRAPH_MODE(0) or PYNATIVE_MODE(1).
        device_target (str): The target device to run, support "Ascend", "GPU", "CPU". Default: "Ascend".
        device_id (int): Id of target device, the value must be in [0, device_num_per_host-1],
                    while device_num_per_host should no more than 4096. Default: 0.
--- a/mindspore/nn/cell.py
+++ b/mindspore/nn/cell.py
@@ -148,7 +148,7 @@ class Cell:

    def update_cell_type(self, cell_type):
        """
        Update the current cell type mainly identify if quantization aware training network.
        The current cell type is updated when a quantization aware training network is encountered.

        After being invoked, it can set the cell type to 'cell_type'.
        """
@@ -934,7 +934,7 @@ class GraphKernel(Cell):
    Base class for GraphKernel.

    A `GraphKernel` a composite of basic primitives and can be compiled into a fused kernel automatically when
    context.set_context(enable_graph_kernel=True).
    enable_graph_kernel in context is set to True.

    Examples:
        >>> class Relu(GraphKernel):
--- a/mindspore/nn/graph_kernels/graph_kernels.py
+++ b/mindspore/nn/graph_kernels/graph_kernels.py
@@ -661,7 +661,7 @@ class LogSoftmax(GraphKernel):
    Log Softmax activation function.

    Applies the Log Softmax function to the input tensor on the specified axis.
    Suppose a slice along the given aixs :math:`x` then for each element :math:`x_i`
    Suppose a slice in the given aixs :math:`x` then for each element :math:`x_i`
    the Log Softmax function is shown as follows:

    .. math::
@@ -987,10 +987,10 @@ class LayerNorm(Cell):
    Applies Layer Normalization over a mini-batch of inputs.

    Layer normalization is widely used in recurrent neural networks. It applies
    normalization over a mini-batch of inputs for each single training case as described
    normalization on a mini-batch of inputs for each single training case as described
    in the paper `Layer Normalization <https://arxiv.org/pdf/1607.06450.pdf>`_. Unlike batch
    normalization, layer normalization performs exactly the same computation at training and
    testing times. It can be described using the following formula. It is applied across all channels
    testing time. It can be described using the following formula. It is applied across all channels
    and pixel but only one batch size.

    .. math::
@@ -1139,9 +1139,9 @@ class LambNextMV(GraphKernel):
    Outputs:
        Tuple of 2 Tensor.

        - **add3** (Tensor) - The shape is same as the shape after broadcasting, and the data type is
        - **add3** (Tensor) - The shape is the same as the shape after broadcasting, and the data type is
                              the one with high precision or high digits among the inputs.
        - **realdiv4** (Tensor) - The shape is same as the shape after broadcasting, and the data type is
        - **realdiv4** (Tensor) - The shape is the same as the shape after broadcasting, and the data type is
                                  the one with high precision or high digits among the inputs.

    Examples:
--- a/mindspore/nn/layer/activation.py
+++ b/mindspore/nn/layer/activation.py
@@ -55,7 +55,7 @@ class Softmax(Cell):
    .. math::
        \text{softmax}(x_{i}) =  \frac{\exp(x_i)}{\sum_{j=0}^{n-1}\exp(x_j)},

    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Args:
        axis (Union[int, tuple[int]]): The axis to apply Softmax operation, -1 means the last dimension. Default: -1.
@@ -87,11 +87,11 @@ class LogSoftmax(Cell):

    Applies the LogSoftmax function to n-dimensional input tensor.

    The input is transformed with Softmax function and then with log function to lie in range[-inf,0).
    The input is transformed by the Softmax function and then by the log function to lie in range[-inf,0).

    Logsoftmax is defined as:
    :math:`\text{logsoftmax}(x_i) = \log \left(\frac{\exp(x_i)}{\sum_{j=0}^{n-1} \exp(x_j)}\right)`,
    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Args:
        axis (int): The axis to apply LogSoftmax operation, -1 means the last dimension. Default: -1.
@@ -123,7 +123,7 @@ class ELU(Cell):
    Exponential Linear Uint activation function.

    Applies the exponential linear unit function element-wise.
    The activation function defined as:
    The activation function is defined as:

    .. math::
        E_{i} =
@@ -162,7 +162,7 @@ class ReLU(Cell):

    Applies the rectified linear unit function element-wise. It returns
    element-wise :math:`\max(0, x)`, specially, the neurons with the negative output
    will suppressed and the active neurons will stay the same.
    will be suppressed and the active neurons will stay the same.

    Inputs:
        - **input_data** (Tensor) - The input of ReLU.
@@ -197,7 +197,7 @@ class ReLU6(Cell):
        - **input_data** (Tensor) - The input of ReLU6.

    Outputs:
        Tensor, which has the same type with `input_data`.
        Tensor, which has the same type as `input_data`.

    Examples:
        >>> input_x = Tensor(np.array([-1, -2, 0, 2, 1]), mindspore.float16)
@@ -234,7 +234,7 @@ class LeakyReLU(Cell):
        - **input_x** (Tensor) - The input of LeakyReLU.

    Outputs:
        Tensor, has the same type and shape with the `input_x`.
        Tensor, has the same type and shape as the `input_x`.

    Examples:
        >>> input_x = Tensor(np.array([[-1.0, 4.0, -8.0], [2.0, -5.0, 9.0]]), mindspore.float32)
@@ -365,7 +365,7 @@ class PReLU(Cell):
    PReLU is defined as: :math:`prelu(x_i)= \max(0, x_i) + w * \min(0, x_i)`, where :math:`x_i`
    is an element of an channel of the input.

    Here :math:`w` is an learnable parameter with default initial value 0.25.
    Here :math:`w` is a learnable parameter with a default initial value 0.25.
    Parameter :math:`w` has dimensionality of the argument channel. If called without argument
    channel, a single parameter :math:`w` will be shared across all channels.

@@ -413,7 +413,7 @@ class PReLU(Cell):

 class HSwish(Cell):
    r"""
    rHard swish activation function.
    Hard swish activation function.

    Applies hswish-type activation element-wise. The input is a Tensor with any valid shape.

@@ -422,7 +422,7 @@ class HSwish(Cell):
    .. math::
        \text{hswish}(x_{i}) = x_{i} * \frac{ReLU6(x_{i} + 3)}{6},

    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Inputs:
        - **input_data** (Tensor) - The input of HSwish.
@@ -456,7 +456,7 @@ class HSigmoid(Cell):
    .. math::
        \text{hsigmoid}(x_{i}) = max(0, min(1, \frac{x_{i} + 3}{6})),

    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Inputs:
        - **input_data** (Tensor) - The input of HSigmoid.
--- a/mindspore/nn/layer/basic.py
+++ b/mindspore/nn/layer/basic.py
@@ -65,7 +65,7 @@ class Dropout(Cell):
        dtype (:class:`mindspore.dtype`): Data type of input. Default: mindspore.float32.

    Raises:
        ValueError: If keep_prob is not in range (0, 1).
        ValueError: If `keep_prob` is not in range (0, 1).

    Inputs:
        - **input** (Tensor) - An N-D Tensor.
@@ -373,8 +373,8 @@ class OneHot(Cell):
        axis is created at dimension `axis`.

    Args:
        axis (int): Features x depth if axis == -1, depth x features
                    if axis == 0. Default: -1.
        axis (int): Features x depth if axis is -1, depth x features
                    if axis is 0. Default: -1.
        depth (int): A scalar defining the depth of the one hot dimension. Default: 1.
        on_value (float): A scalar defining the value to fill in output[i][j]
                          when indices[j] = i. Default: 1.0.
@@ -492,18 +492,18 @@ class Unfold(Cell):
    The input tensor must be a 4-D tensor and the data format is NCHW.

    Args:
        ksizes (Union[tuple[int], list[int]]): The size of sliding window, should be a tuple or list of int,
        ksizes (Union[tuple[int], list[int]]): The size of sliding window, should be a tuple or a list of integers,
            and the format is [1, ksize_row, ksize_col, 1].
        strides (Union[tuple[int], list[int]]): Distance between the centers of the two consecutive patches,
            should be a tuple or list of int, and the format is [1, stride_row, stride_col, 1].
        rates (Union[tuple[int], list[int]]): In each extracted patch, the gap between the corresponding dim
            pixel positions, should be a tuple or list of int, and the format is [1, rate_row, rate_col, 1].
        rates (Union[tuple[int], list[int]]): In each extracted patch, the gap between the corresponding dimension
            pixel positions, should be a tuple or a list of integers, and the format is [1, rate_row, rate_col, 1].
        padding (str): The type of padding algorithm, is a string whose value is "same" or "valid",
            not case sensitive. Default: "valid".

            - same: Means that the patch can take the part beyond the original image, and this part is filled with 0.

            - valid: Means that the patch area taken must be completely contained in the original image.
            - valid: Means that the taken patch area must be completely covered in the original image.

    Inputs:
        - **input_x** (Tensor) - A 4-D tensor whose shape is [in_batch, in_depth, in_row, in_col] and
@@ -511,7 +511,7 @@ class Unfold(Cell):

    Outputs:
        Tensor, a 4-D tensor whose data type is same as 'input_x',
        and the shape is [out_batch, out_depth, out_row, out_col], the out_batch is same as the in_batch.
        and the shape is [out_batch, out_depth, out_row, out_col], the out_batch is the same as the in_batch.

    Examples:
        >>> net = Unfold(ksizes=[1, 2, 2, 1], strides=[1, 1, 1, 1], rates=[1, 1, 1, 1])
@@ -556,11 +556,11 @@ class MatrixDiag(Cell):
    Returns a batched diagonal tensor with a given batched diagonal values.

    Inputs:
        - **x** (Tensor) - The diagonal values. It can be of the following data types:
          float32, float16, int32, int8, uint8.
        - **x** (Tensor) - The diagonal values. It can be one of the following data types:
          float32, float16, int32, int8, and uint8.

    Outputs:
        Tensor, same type as input `x`. The shape should be x.shape + (x.shape[-1], ).
        Tensor, has the same type as input `x`. The shape should be x.shape + (x.shape[-1], ).

    Examples:
        >>> x = Tensor(np.array([1, -1]), mstype.float32)
@@ -587,11 +587,11 @@ class MatrixDiagPart(Cell):
    Returns the batched diagonal part of a batched tensor.

    Inputs:
        - **x** (Tensor) - The batched tensor. It can be of the following data types:
          float32, float16, int32, int8, uint8.
        - **x** (Tensor) - The batched tensor. It can be one of the following data types:
          float32, float16, int32, int8, and uint8.

    Outputs:
        Tensor, same type as input `x`. The shape should be x.shape[:-2] + [min(x.shape[-2:])].
        Tensor, has the same type as input `x`. The shape should be x.shape[:-2] + [min(x.shape[-2:])].

    Examples:
        >>> x = Tensor([[[-1, 0], [0, 1]], [[-1, 0], [0, 1]], [[-1, 0], [0, 1]]], mindspore.float32)
@@ -617,12 +617,12 @@ class MatrixSetDiag(Cell):
    Modify the batched diagonal part of a batched tensor.

    Inputs:
        - **x** (Tensor) - The batched tensor. It can be of the following data types:
          float32, float16, int32, int8, uint8.
        - **x** (Tensor) - The batched tensor. It can be one of the following data types:
          float32, float16, int32, int8, and uint8.
        - **diagonal** (Tensor) - The diagonal values.

    Outputs:
        Tensor, same type as input `x`. The shape same as `x`.
        Tensor, has the same type and shape as input `x`.

    Examples:
        >>> x = Tensor([[[-1, 0], [0, 1]], [[-1, 0], [0, 1]], [[-1, 0], [0, 1]]], mindspore.float32)
--- a/mindspore/nn/layer/container.py
+++ b/mindspore/nn/layer/container.py
@@ -72,7 +72,7 @@ class SequentialCell(Cell):
        args (list, OrderedDict): List of subclass of Cell.

    Raises:
        TypeError: If arg is not of type list or OrderedDict.
        TypeError: If the type of the argument is not list or OrderedDict.

    Inputs:
        - **input** (Tensor) - Tensor with shape according to the first Cell in the sequence.
--- a/mindspore/nn/layer/conv.py
+++ b/mindspore/nn/layer/conv.py
@@ -131,7 +131,7 @@ class Conv2d(_Conv):
    Args:
        in_channels (int): The number of input channel :math:`C_{in}`.
        out_channels (int): The number of output channel :math:`C_{out}`.
        kernel_size (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the height
        kernel_size (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the height
            and width of the 2D convolution window. Single int means the value is for both the height and the width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@@ -147,7 +147,7 @@ class Conv2d(_Conv):
              last extra padding will be done from the bottom and the right side. If this mode is set, `padding`
              must be 0.

            - valid: Adopts the way of discarding. The possibly largest height and width of output will be returned
            - valid: Adopts the way of discarding. The possible largest height and width of output will be returned
              without padding. Extra pixels will be discarded. If this mode is set, `padding`
              must be 0.

@@ -158,7 +158,7 @@ class Conv2d(_Conv):
                    the padding of top, bottom, left and right is the same, equal to padding. If `padding` is a tuple
                    with four integers, the padding of top, bottom, left and right will be equal to padding[0],
                    padding[1], padding[2], and padding[3] accordingly. Default: 0.
        dilation (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the dilation rate
        dilation (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the dilation rate
                                      to use for dilated convolution. If set to be :math:`k > 1`, there will
                                      be :math:`k - 1` pixels skipped for each sampling location. Its value should
                                      be greater or equal to 1 and bounded by the height and width of the
@@ -451,7 +451,7 @@ class Conv2dTranspose(_Conv):
    Args:
        in_channels (int): The number of channels in the input space.
        out_channels (int): The number of channels in the output space.
        kernel_size (Union[int, tuple]): int or tuple with 2 integers, which specifies the  height
        kernel_size (Union[int, tuple]): int or a tuple of 2 integers, which specifies the  height
            and width of the 2D convolution window. Single int means the value is for both the height and the width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@@ -825,7 +825,7 @@ class DepthwiseConv2d(Cell):
    Args:
        in_channels (int): The number of input channel :math:`C_{in}`.
        out_channels (int): The number of output channel :math:`C_{out}`.
        kernel_size (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the height
        kernel_size (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the height
            and width of the 2D convolution window. Single int means the value is for both the height and the width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@@ -841,7 +841,7 @@ class DepthwiseConv2d(Cell):
              last extra padding will be done from the bottom and the right side. If this mode is set, `padding`
              must be 0.

            - valid: Adopts the way of discarding. The possibly largest height and width of output will be returned
            - valid: Adopts the way of discarding. The possible largest height and width of output will be returned
              without padding. Extra pixels will be discarded. If this mode is set, `padding`
              must be 0.

@@ -849,16 +849,16 @@ class DepthwiseConv2d(Cell):
              Tensor borders. `padding` should be greater than or equal to 0.

        padding (int): Implicit paddings on both sides of the input. Default: 0.
        dilation (Union[int, tuple[int]]): The data type is int or tuple with 2 integers. Specifies the dilation rate
        dilation (Union[int, tuple[int]]): The data type is int or a tuple of 2 integers. Specifies the dilation rate
                                      to use for dilated convolution. If set to be :math:`k > 1`, there will
                                      be :math:`k - 1` pixels skipped for each sampling location. Its value should
                                      be greater or equal to 1 and bounded by the height and width of the
                                      be greater than or equal to 1 and bounded by the height and width of the
                                      input. Default: 1.
        group (int): Split filter into groups, `in_ channels` and `out_channels` should be
            divisible by the number of groups. Default: 1.
        has_bias (bool): Specifies whether the layer uses a bias vector. Default: False.
        weight_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the convolution kernel.
            It can be a Tensor, a string, an Initializer or a numbers.Number. When a string is specified,
            It can be a Tensor, a string, an Initializer or a number. When a string is specified,
            values from 'TruncatedNormal', 'Normal', 'Uniform', 'HeUniform' and 'XavierUniform' distributions as well
            as constant 'One' and 'Zero' distributions are possible. Alias 'xavier_uniform', 'he_uniform', 'ones'
            and 'zeros' are acceptable. Uppercase and lowercase are both acceptable. Refer to the values of
--- a/mindspore/nn/layer/embedding.py
+++ b/mindspore/nn/layer/embedding.py
@@ -36,7 +36,7 @@ class Embedding(Cell):
    the corresponding word embeddings.

    Note:
        When 'use_one_hot' is set to True, the input should be of type mindspore.int32.
        When 'use_one_hot' is set to True, the type of the input should be mindspore.int32.

    Args:
        vocab_size (int): Size of the dictionary of embeddings.
@@ -48,9 +48,9 @@ class Embedding(Cell):
        dtype (:class:`mindspore.dtype`): Data type of input. Default: mindspore.float32.

    Inputs:
        - **input** (Tensor) - Tensor of shape :math:`(\text{batch_size}, \text{input_length})`. The element of
          the Tensor should be integer and not larger than vocab_size. else the corresponding embedding vector is zero
          if larger than vocab_size.
        - **input** (Tensor) - Tensor of shape :math:`(\text{batch_size}, \text{input_length})`. The elements of
          the Tensor should be integer and not larger than vocab_size. Otherwise the corresponding embedding vector will
          be zero.

    Outputs:
        Tensor of shape :math:`(\text{batch_size}, \text{input_length}, \text{embedding_size})`.
--- a/mindspore/nn/layer/image.py
+++ b/mindspore/nn/layer/image.py
@@ -253,7 +253,7 @@ class MSSSIM(Cell):
    Args:
        max_val (Union[int, float]): The dynamic range of the pixel values (255 for 8-bit grayscale images).
          Default: 1.0.
        power_factors (Union[tuple, list]): Iterable of weights for each of the scales.
        power_factors (Union[tuple, list]): Iterable of weights for each scal e.
          Default: (0.0448, 0.2856, 0.3001, 0.2363, 0.1333). Default values obtained by Wang et al.
        filter_size (int): The size of the Gaussian filter. Default: 11.
        filter_sigma (float): The standard deviation of Gaussian kernel. Default: 1.5.
--- a/mindspore/nn/layer/lstm.py
+++ b/mindspore/nn/layer/lstm.py
@@ -35,7 +35,7 @@ class LSTM(Cell):
    Applies a LSTM to the input.

    There are two pipelines connecting two consecutive cells in a LSTM model; one is cell state pipeline
    and another is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
    and the other is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
    Given an input :math:`x_t` at time :math:`t`, an hidden state :math:`h_{t-1}` and an cell
    state :math:`c_{t-1}` of the layer at time :math:`{t-1}`, the cell state and hidden state at
    time :math:`t` is computed using an gating mechanism. Input gate :math:`i_t` is designed to protect the cell
@@ -68,18 +68,17 @@ class LSTM(Cell):
        input_size (int): Number of features of input.
        hidden_size (int):  Number of features of hidden layer.
        num_layers (int): Number of layers of stacked LSTM . Default: 1.
        has_bias (bool): Specifies whether has bias `b_ih` and `b_hh`. Default: True.
        has_bias (bool): Whether the cell has bias `b_ih` and `b_hh`. Default: True.
        batch_first (bool): Specifies whether the first dimension of input is batch_size. Default: False.
        dropout (float, int): If not 0, append `Dropout` layer on the outputs of each
            LSTM layer except the last layer. Default 0. The range of dropout is [0.0, 1.0].
        bidirectional (bool): Specifies whether this is a bidirectional LSTM. If set True,
            number of directions will be 2 otherwise number of directions is 1. Default: False.
        bidirectional (bool): Specifies whether it is a bidirectional LSTM. Default: False.

    Inputs:
        - **input** (Tensor) - Tensor of shape (seq_len, batch_size, `input_size`).
        - **hx** (tuple) - A tuple of two Tensors (h_0, c_0) both of data type mindspore.float32 or
          mindspore.float16 and shape (num_directions * `num_layers`, batch_size, `hidden_size`).
          Data type of `hx` should be the same of `input`.
          Data type of `hx` should be the same as `input`.

    Outputs:
        Tuple, a tuple constains (`output`, (`h_n`, `c_n`)).
@@ -205,7 +204,7 @@ class LSTMCell(Cell):
    Applies a LSTM layer to the input.

    There are two pipelines connecting two consecutive cells in a LSTM model; one is cell state pipeline
    and another is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
    and the other is hidden state pipeline. Denote two consecutive time nodes as :math:`t-1` and :math:`t`.
    Given an input :math:`x_t` at time :math:`t`, an hidden state :math:`h_{t-1}` and an cell
    state :math:`c_{t-1}` of the layer at time :math:`{t-1}`, the cell state and hidden state at
    time :math:`t` is computed using an gating mechanism. Input gate :math:`i_t` is designed to protect the cell
@@ -238,7 +237,7 @@ class LSTMCell(Cell):
        input_size (int): Number of features of input.
        hidden_size (int):  Number of features of hidden layer.
        layer_index (int): index of current layer of stacked LSTM . Default: 0.
        has_bias (bool): Specifies whether has bias `b_ih` and `b_hh`. Default: True.
        has_bias (bool): Whether the cell has bias `b_ih` and `b_hh`. Default: True.
        batch_first (bool): Specifies whether the first dimension of input is batch_size. Default: False.
        dropout (float, int): If not 0, append `Dropout` layer on the outputs of each
            LSTM layer except the last layer. Default 0. The range of dropout is [0.0, 1.0].
--- a/mindspore/nn/layer/normalization.py
+++ b/mindspore/nn/layer/normalization.py
@@ -243,6 +243,10 @@ class BatchNorm1d(_BatchNorm):
    .. math::
        y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

    Note:
        The implementation of BatchNorm is different in graph mode and pynative mode, therefore the mode is not
        recommended to be changed after net was initilized.

    Args:
        num_features (int): `C` from an expected input of size (N, C).
        eps (float): A value added to the denominator for numerical stability. Default: 1e-5.
@@ -319,6 +323,10 @@ class BatchNorm2d(_BatchNorm):
    .. math::
        y = \frac{x - \mathrm{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} * \gamma + \beta

    Note:
        The implementation of BatchNorm is different in graph mode and pynative mode, therefore that mode can not be
        changed after net was initilized.

    Args:
        num_features (int): `C` from an expected input of size (N, C, H, W).
        eps (float): A value added to the denominator for numerical stability. Default: 1e-5.
@@ -384,8 +392,8 @@ class GlobalBatchNorm(_BatchNorm):
    r"""
    Global normalization layer over a N-dimension input.

    Global Normalization is cross device synchronized batch normalization. Batch Normalization implementation
    only normalize the data within each device. Global normalization will normalize the input within the group.
    Global Normalization is cross device synchronized batch normalization. The implementation of Batch Normalization
    only normalizes the data within each device. Global normalization will normalize the input within the group.
    It has been described in the paper `Batch Normalization: Accelerating Deep Network Training by
    Reducing Internal Covariate Shift <https://arxiv.org/abs/1502.03167>`_. It rescales and recenters the
    feature using a mini-batch of data and the learned parameters which can be described in the following formula.
@@ -467,10 +475,10 @@ class LayerNorm(Cell):
    Applies Layer Normalization over a mini-batch of inputs.

    Layer normalization is widely used in recurrent neural networks. It applies
    normalization over a mini-batch of inputs for each single training case as described
    normalization on a mini-batch of inputs for each single training case as described
    in the paper `Layer Normalization <https://arxiv.org/pdf/1607.06450.pdf>`_. Unlike batch
    normalization, layer normalization performs exactly the same computation at training and
    testing times. It can be described using the following formula. It is applied across all channels
    testing time. It can be described using the following formula. It is applied across all channels
    and pixel but only one batch size.

    .. math::
@@ -545,7 +553,7 @@ class GroupNorm(Cell):
    Group Normalization over a mini-batch of inputs.

    Group normalization is widely used in recurrent neural networks. It applies
    normalization over a mini-batch of inputs for each single training case as described
    normalization on a mini-batch of inputs for each single training case as described
    in the paper `Group Normalization <https://arxiv.org/pdf/1803.08494.pdf>`_. Group normalization
    divides the channels into groups and computes within each group the mean and variance for normalization,
    and it performs very stable over a wide range of batch size. It can be described using the following formula.
@@ -557,7 +565,7 @@ class GroupNorm(Cell):
        num_groups (int): The number of groups to be divided along the channel dimension.
        num_channels (int): The number of channels per group.
        eps (float): A value added to the denominator for numerical stability. Default: 1e-5.
        affine (bool): A bool value, this layer will has learnable affine parameters when set to true. Default: True.
        affine (bool): A bool value, this layer will have learnable affine parameters when set to true. Default: True.
        gamma_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the gamma weight.
            The values of str refer to the function `initializer` including 'zeros', 'ones', 'xavier_uniform',
            'he_uniform', etc. Default: 'ones'.
--- a/mindspore/nn/layer/quant.py
+++ b/mindspore/nn/layer/quant.py
@@ -61,7 +61,7 @@ class Conv2dBnAct(Cell):
    Args:
        in_channels (int): The number of input channel :math:`C_{in}`.
        out_channels (int): The number of output channel :math:`C_{out}`.
        kernel_size (Union[int, tuple]): The data type is int or tuple with 2 integers. Specifies the height
        kernel_size (Union[int, tuple]): The data type is int or a tuple of 2 integers. Specifies the height
            and width of the 2D convolution window. Single int means the value is for both height and width of
            the kernel. A tuple of 2 ints means the first value is for the height and the other is for the
            width of the kernel.
@@ -292,19 +292,19 @@ class BatchNormFoldCell(Cell):

 class FakeQuantWithMinMax(Cell):
    r"""
    Quantization aware op. This OP provide Fake quantization observer function on data with min and max.
    Quantization aware op. This OP provides the fake quantization observer function on data with min and max.

    Args:
        min_init (int, float): The dimension of channel or 1(layer). Default: -6.
        max_init (int, float): The dimension of channel or 1(layer). Default: 6.
        ema (bool): Exponential Moving Average algorithm update min and max. Default: False.
        ema (bool): The exponential Moving Average algorithm updates min and max. Default: False.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        channel_axis (int): Quantization by channel axis. Default: 1.
        num_channels (int): declarate the min and max channel size, Default: 1.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.

    Inputs:
@@ -431,7 +431,7 @@ class Conv2dBnFoldQuant(Cell):
            variance vector. Default: 'ones'.
        fake (bool): Whether Conv2dBnFoldQuant Cell adds FakeQuantWithMinMax op. Default: True.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): The Quantization delay parameters according to the global step. Default: 0.
@@ -614,7 +614,7 @@ class Conv2dBnWithoutFoldQuant(Cell):
            Default: 'normal'.
        bias_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the bias vector. Default: 'zeros'.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@@ -736,7 +736,7 @@ class Conv2dQuant(Cell):
            Default: 'normal'.
        bias_init (Union[Tensor, str, Initializer, numbers.Number]): Initializer for the bias vector. Default: 'zeros'.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@@ -845,7 +845,7 @@ class DenseQuant(Cell):
        has_bias (bool): Specifies whether the layer uses a bias vector. Default: True.
        activation (str): The regularization function applied to the output of the layer, eg. 'relu'. Default: None.
        per_channel (bool): FakeQuantWithMinMax Parameters. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@@ -947,15 +947,14 @@ class ActQuant(_QuantActivation):
    r"""
    Quantization aware training activation function.

    Add Fake Quant OP after activation. Not Recommand to used these cell for Fake Quant Op
    Will climp the max range of the activation and the relu6 do the same operation.
    This part is a more detailed overview of ReLU6 op.
    Add the fake quant op to the end of activation op, by which the output of activation op will be truncated.
    Please check `FakeQuantWithMinMax` for more details.

    Args:
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global steps. Default: 0.
@@ -1010,7 +1009,7 @@ class LeakyReLUQuant(_QuantActivation):
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@@ -1080,9 +1079,9 @@ class HSwishQuant(_QuantActivation):
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.

    Inputs:
@@ -1149,9 +1148,9 @@ class HSigmoidQuant(_QuantActivation):
        activation (Cell): Activation cell class.
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.

    Inputs:
@@ -1217,7 +1216,7 @@ class TensorAddQuant(Cell):
    Args:
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
@@ -1269,7 +1268,7 @@ class MulQuant(Cell):
    Args:
        ema_decay (float): Exponential Moving Average algorithm parameter. Default: 0.999.
        per_channel (bool):  Quantization granularity based on layer or on channel. Default: False.
        num_bits (int): The quantization number bit, support 4 and 8bit. Default: 8.
        num_bits (int): The bit number of quantization, supporting 4 and 8bits. Default: 8.
        symmetric (bool): The quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): The quantization algorithm uses narrow range or not. Default: False.
        quant_delay (int): Quantization delay parameters according to the global step. Default: 0.
--- a/mindspore/nn/loss/loss.py
+++ b/mindspore/nn/loss/loss.py
@@ -80,7 +80,7 @@ class L1Loss(_Loss):
    When argument reduction is 'sum', the sum of :math:`L(x, y)` will be returned. :math:`N` is the batch size.

    Args:
        reduction (str): Type of reduction to apply to loss. The optional values are "mean", "sum", "none".
        reduction (str): Type of reduction to be applied to loss. The optional values are "mean", "sum", and "none".
            Default: "mean".

    Inputs:
@@ -107,7 +107,7 @@ class L1Loss(_Loss):

 class MSELoss(_Loss):
    r"""
    MSELoss create a criterion to measures the mean squared error (squared L2-norm) between :math:`x` and :math:`y`
    MSELoss creates a criterion to measure the mean squared error (squared L2-norm) between :math:`x` and :math:`y`
    by element, where :math:`x` is the input and :math:`y` is the target.

    For simplicity, let :math:`x` and :math:`y` be 1-dimensional Tensor with length :math:`N`,
@@ -120,7 +120,7 @@ class MSELoss(_Loss):
    When argument reduction is 'sum', the sum of :math:`L(x, y)` will be returned. :math:`N` is the batch size.

    Args:
        reduction (str): Type of reduction to apply to loss. The optional values are "mean", "sum", "none".
        reduction (str): Type of reduction to be applied to loss. The optional values are "mean", "sum", and "none".
            Default: "mean".

    Inputs:
@@ -210,14 +210,14 @@ class SoftmaxCrossEntropyWithLogits(_Loss):

    Note:
        While the target classes are mutually exclusive, i.e., only one class is positive in the target, the predicted
        probabilities need not be exclusive. All that is required is that the predicted probability distribution
        probabilities need not to be exclusive. It is only required that the predicted probability distribution
        of entry is a valid one.

    Args:
        is_grad (bool): Specifies whether calculate grad only. Default: True.
        sparse (bool): Specifies whether labels use sparse format or not. Default: False.
        reduction (Union[str, None]): Type of reduction to apply to loss. Support 'sum' or 'mean' If None,
            do not reduction. Default: None.
        reduction (Union[str, None]): Type of reduction to be applied to loss. Support 'sum' and 'mean'. If None,
            do not perform reduction. Default: None.
        smooth_factor (float): Label smoothing factor. It is a optional input which should be in range [0, 1].
            Default: 0.
        num_classes (int): The number of classes in the task. It is a optional input Default: 2.
@@ -225,7 +225,7 @@ class SoftmaxCrossEntropyWithLogits(_Loss):
    Inputs:
        - **logits** (Tensor) - Tensor of shape (N, C).
        - **labels** (Tensor) - Tensor of shape (N, ). If `sparse` is True, The type of
          `labels` is mindspore.int32. If `sparse` is False, the type of `labels` is same as the type of `logits`.
          `labels` is mindspore.int32. If `sparse` is False, the type of `labels` is the same as the type of `logits`.

    Outputs:
        Tensor, a tensor of the same shape as logits with the component-wise
@@ -282,8 +282,8 @@ class SoftmaxCrossEntropyExpand(Cell):
    where :math:`x_i` is a 1D score Tensor, :math:`t_i` is the target class.

    Note:
        When argument sparse is set to True, the format of label is the index
        range from :math:`0` to :math:`C - 1` instead of one-hot vectors.
        When argument sparse is set to True, the format of the label is the index
        ranging from :math:`0` to :math:`C - 1` instead of one-hot vectors.

    Args:
        sparse(bool): Specifies whether labels use sparse format or not. Default: False.
--- a/mindspore/nn/metrics/init.py
+++ b/mindspore/nn/metrics/init.py
@@ -69,7 +69,7 @@ def names():

 def get_metric_fn(name, *args, **kwargs):
    """
    Gets the metric method base on the input name.
    Gets the metric method based on the input name.

    Args:
        name (str): The name of metric method. Refer to the '__factory__'
--- a/mindspore/nn/metrics/metric.py
+++ b/mindspore/nn/metrics/metric.py
@@ -82,7 +82,7 @@ class Metric(metaclass=ABCMeta):
    @abstractmethod
    def clear(self):
        """
        A interface describes the behavior of clearing the internal evaluation result.
        An interface describes the behavior of clearing the internal evaluation result.

        Note:
            All subclasses should override this interface.
@@ -92,7 +92,7 @@ class Metric(metaclass=ABCMeta):
    @abstractmethod
    def eval(self):
        """
        A interface describes the behavior of computing the evaluation result.
        An interface describes the behavior of computing the evaluation result.

        Note:
            All subclasses should override this interface.
@@ -102,7 +102,7 @@ class Metric(metaclass=ABCMeta):
    @abstractmethod
    def update(self, *inputs):
        """
        A interface describes the behavior of updating the internal evaluation result.
        An interface describes the behavior of updating the internal evaluation result.

        Note:
            All subclasses should override this interface.
--- a/mindspore/nn/optim/adam.py
+++ b/mindspore/nn/optim/adam.py
@@ -36,8 +36,8 @@ def _update_run_op(beta1, beta2, eps, lr, weight_decay, param, m, v, gradient, d
    Update parameters.

    Args:
        beta1 (Tensor): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0).
        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
@@ -180,12 +180,12 @@ class Adam(Optimizer):
              the order will be followed in the optimizer. There are no other keys in the `dict` and the parameters
              which in the 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use the dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
@@ -195,11 +195,11 @@ class Adam(Optimizer):
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
            If True, updating of the var, m, and v tensors will be protected by a lock.
            If False, the result is unpredictable. Default: False.
            If true, updates of the var, m, and v tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
            If True, update the gradients using NAG.
            If False, update the gradients without using NAG. Default: False.
            If true, update the gradients using NAG.
            If false, update the gradients without using NAG. Default: False.
        weight_decay (float): Weight decay (L2 penalty). It should be equal to or greater than 0. Default: 0.0.
        loss_scale (float): A floating point value for the loss scale. Should be greater than 0. Default: 1.0.

@@ -304,12 +304,12 @@ class AdamWeightDecay(Optimizer):
              the order will be followed in the optimizer. There are no other keys in the `dict` and the parameters
              which in the 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use the dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use the dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Default: 0.9.
--- a/mindspore/nn/optim/ftrl.py
+++ b/mindspore/nn/optim/ftrl.py
@@ -114,12 +114,12 @@ class FTRL(Optimizer):
            than or equal to zero. Use fixed learning rate if lr_power is zero. Default: -0.5.
        l1 (float): l1 regularization strength, must be greater than or equal to zero. Default: 0.0.
        l2 (float): l2 regularization strength, must be greater than or equal to zero. Default: 0.0.
        use_locking (bool): If True use locks for update operation. Default: False.
        use_locking (bool): If True, use locks for updating operation. Default: False.
        loss_scale (float): Value for the loss scale. It should be equal to or greater than 1.0. Default: 1.0.
        weight_decay (float): Weight decay value to multiply weight, must be zero or positive value. Default: 0.0.

    Inputs:
        - **grads** (tuple[Tensor]) - The gradients of `params` in optimizer, the shape is as same as the `params`
        - **grads** (tuple[Tensor]) - The gradients of `params` in the optimizer, the shape is the same as the `params`
          in optimizer.

    Outputs:
--- a/mindspore/nn/optim/lamb.py
+++ b/mindspore/nn/optim/lamb.py
@@ -39,8 +39,8 @@ def _update_run_op(beta1, beta2, eps, global_step, lr, weight_decay, param, m, v
    Update parameters.

    Args:
        beta1 (Tensor): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0).
        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
@@ -122,8 +122,8 @@ def _update_run_op_graph_kernel(beta1, beta2, eps, global_step, lr, weight_decay
    Update parameters.

    Args:
        beta1 (Tensor): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0).
        beta1 (Tensor): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
        beta2 (Tensor): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
        eps (Tensor): Term added to the denominator to improve numerical stability. Should be greater than 0.
        lr (Tensor): Learning rate.
        weight_decay (Number): Weight decay. Should be equal to or greater than 0.
@@ -184,7 +184,7 @@ def _check_param_value(beta1, beta2, eps, prim_name):

 class Lamb(Optimizer):
    """
    Lamb Dynamic LR.
    Lamb Dynamic Learning Rate.

    LAMB is an optimization algorithm employing a layerwise adaptive large batch
    optimization technique. Refer to the paper `LARGE BATCH OPTIMIZATION FOR DEEP LEARNING: TRAINING BERT IN 76
@@ -214,16 +214,16 @@ class Lamb(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
        beta1 (float): The exponential decay rate for the 1st moment estimates. Default: 0.9.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Default: 0.9.
            Should be in range (0.0, 1.0).
        beta2 (float): The exponential decay rate for the 2nd moment estimates. Default: 0.999.
        beta2 (float): The exponential decay rate for the 2nd moment estimations. Default: 0.999.
            Should be in range (0.0, 1.0).
        eps (float): Term added to the denominator to improve numerical stability. Default: 1e-6.
            Should be greater than 0.
--- a/mindspore/nn/optim/lars.py
+++ b/mindspore/nn/optim/lars.py
@@ -58,12 +58,12 @@ class LARS(Optimizer):
        epsilon (float): Term added to the denominator to improve numerical stability. Default: 1e-05.
        coefficient (float): Trust coefficient for calculating the local learning rate. Default: 0.001.
        use_clip (bool): Whether to use clip operation for calculating the local learning rate. Default: False.
        lars_filter (Function): A function to determine whether apply lars algorithm. Default:
        lars_filter (Function): A function to determine whether apply the LARS algorithm. Default:
                                lambda x: 'LayerNorm' not in x.name and 'bias' not in x.name.

    Inputs:
        - **gradients** (tuple[Tensor]) - The gradients of `params` in optimizer, the shape is
          as same as the `params` in optimizer.
        - **gradients** (tuple[Tensor]) - The gradients of `params` in the optimizer, the shape is the
          as same as the `params` in the optimizer.

    Outputs:
        Union[Tensor[bool], tuple[Parameter]], it depends on the output of `optimizer`.
--- a/mindspore/nn/optim/lazyadam.py
+++ b/mindspore/nn/optim/lazyadam.py
@@ -127,26 +127,26 @@ class LazyAdam(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 1e-3.
        beta1 (float): The exponential decay rate for the 1st moment estimates. Should be in range (0.0, 1.0). Default:
                       0.9.
        beta2 (float): The exponential decay rate for the 2nd moment estimates. Should be in range (0.0, 1.0). Default:
                       0.999.
        beta1 (float): The exponential decay rate for the 1st moment estimations. Should be in range (0.0, 1.0).
                       Default: 0.9.
        beta2 (float): The exponential decay rate for the 2nd moment estimations. Should be in range (0.0, 1.0).
                       Default: 0.999.
        eps (float): Term added to the denominator to improve numerical stability. Should be greater than 0. Default:
                     1e-8.
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
            If True, updating of the var, m, and v tensors will be protected by a lock.
            If False, the result is unpredictable. Default: False.
            If true, updates of the var, m, and v tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
            If True, updates the gradients using NAG.
            If False, updates the gradients without using NAG. Default: False.
            If true, update the gradients using NAG.
            If true, update the gradients without using NAG. Default: False.
        weight_decay (float): Weight decay (L2 penalty). Default: 0.0.
        loss_scale (float): A floating point value for the loss scale. Should be equal to or greater than 1. Default:
                            1.0.
--- a/mindspore/nn/optim/momentum.py
+++ b/mindspore/nn/optim/momentum.py
@@ -83,12 +83,12 @@ class Momentum(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
        momentum (float): Hyperparameter of type float, means momentum for the moving average.
            It should be at least 0.0.
--- a/mindspore/nn/optim/optimizer.py
+++ b/mindspore/nn/optim/optimizer.py
@@ -40,8 +40,6 @@ class Optimizer(Cell):
    """
    Base class for all optimizers.

    This class defines the API to add Ops to train a model.

    Note:
        This class defines the API to add Ops to train a model. Never use
        this class directly, but instead instantiate one of its subclasses.
@@ -55,12 +53,12 @@ class Optimizer(Cell):
        To improve parameter groups performance, the customized order of parameters can be supported.

    Args:
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning
            rate. When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning
            rate. When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
        parameters (Union[list[Parameter], list[dict]]): When the `parameters` is a list of `Parameter` which will be
            updated, the element in `parameters` should be class `Parameter`. When the `parameters` is a list of `dict`,
@@ -84,8 +82,8 @@ class Optimizer(Cell):
            type of `loss_scale` input is int, it will be converted to float. Default: 1.0.

    Raises:
        ValueError: If the learning_rate is a Tensor, but the dims of tensor is greater than 1.
        TypeError: If the learning_rate is not any of the three types: float, Tensor, Iterable.
        ValueError: If the learning_rate is a Tensor, but the dimension of tensor is greater than 1.
        TypeError: If the learning_rate is not any of the three types: float, Tensor, nor Iterable.
    """

    def __init__(self, learning_rate, parameters, weight_decay=0.0, loss_scale=1.0):
@@ -179,7 +177,7 @@ class Optimizer(Cell):
        An approach to reduce the overfitting of a deep learning neural network model.

        Args:
            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape with
            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape as
                `self.parameters`.

        Returns:
@@ -204,7 +202,7 @@ class Optimizer(Cell):
        network.

        Args:
            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape with
            gradients (tuple[Tensor]): The gradients of `self.parameters`, and have the same shape as
                `self.parameters`.

        Returns:
--- a/mindspore/nn/optim/proximal_ada_grad.py
+++ b/mindspore/nn/optim/proximal_ada_grad.py
@@ -87,22 +87,22 @@ class ProximalAdagrad(Optimizer):
              in the value of 'order_params' should be in one of group parameters.

        accum (float): The starting value for accumulators, must be zero or positive values. Default: 0.1.
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 0.001.
        l1 (float): l1 regularization strength, must be greater than or equal to zero. Default: 0.0.
        l2 (float): l2 regularization strength, must be greater than or equal to zero. Default: 0.0.
        use_locking (bool): If True use locks for update operation. Default: False.
        use_locking (bool): If True, use locks for updating operation. Default: False.
        loss_scale (float): Value for the loss scale. It should be greater than 0.0. Default: 1.0.
        weight_decay (float): Weight decay value to multiply weight, must be zero or positive value. Default: 0.0.

    Inputs:
        - **grads** (tuple[Tensor]) - The gradients of `params` in optimizer, the shape is as same as the `params`
        - **grads** (tuple[Tensor]) - The gradients of `params` in the optimizer, the shape is the same as the `params`
          in optimizer.

    Outputs:
--- a/mindspore/nn/optim/rmsprop.py
+++ b/mindspore/nn/optim/rmsprop.py
@@ -106,12 +106,12 @@ class RMSProp(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 0.1.
        decay (float): Decay rate. Should be equal to or greater than 0. Default: 0.9.
--- a/mindspore/nn/optim/sgd.py
+++ b/mindspore/nn/optim/sgd.py
@@ -78,12 +78,12 @@ class SGD(Optimizer):
              the order will be followed in optimizer. There are no other keys in the `dict` and the parameters which
              in the value of 'order_params' should be in one of group parameters.

        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or graph for the learning rate.
            When the learning_rate is a Iterable or a Tensor with dimension of 1, use dynamic learning rate, then
        learning_rate (Union[float, Tensor, Iterable, LearningRateSchedule]): A value or a graph for the learning rate.
            When the learning_rate is an Iterable or a Tensor in a 1D dimension, use dynamic learning rate, then
            the i-th step will take the i-th value as the learning rate. When the learning_rate is LearningRateSchedule,
            use dynamic learning rate, the i-th learning rate will be calculated during the process of training
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor with
            dimension of 0, use fixed learning rate. Other cases are not supported. The float learning rate should be
            according to the formula of LearningRateSchedule. When the learning_rate is a float or a Tensor in a zero
            dimension, use fixed learning rate. Other cases are not supported. The float learning rate should be
            equal to or greater than 0. If the type of `learning_rate` is int, it will be converted to float.
            Default: 0.1.
        momentum (float): A floating point value the momentum. should be at least 0.0. Default: 0.0.
--- a/mindspore/nn/wrap/cell_wrapper.py
+++ b/mindspore/nn/wrap/cell_wrapper.py
@@ -138,9 +138,9 @@ class TrainOneStepCell(Cell):
    r"""
    Network training package class.

    Wraps the network with an optimizer. The resulting Cell be trained with input *inputs.
    Backward graph will be created in the construct function to do parameter updating. Different
    parallel modes are available to run the training.
    Wraps the network with an optimizer. The resulting Cell is trained with input *inputs.
    The backward graph will be created in the construct function to update the parameter. Different
    parallel modes are available for training.

    Args:
        network (Cell): The training network.
@@ -231,14 +231,14 @@ class DataWrapper(Cell):

 class GetNextSingleOp(Cell):
    """
    Cell to run get next operation.
    Cell to run for getting the next operation.

    Args:
        dataset_types (list[:class:`mindspore.dtype`]): The types of dataset.
        dataset_shapes (list[tuple[int]]): The shapes of dataset.
        queue_name (str): Queue name to fetch the data.

    Detailed information, please refer to `ops.operations.GetNext`.
    For detailed information, refer to `ops.operations.GetNext`.
    """

    def __init__(self, dataset_types, dataset_shapes, queue_name):
@@ -360,7 +360,7 @@ class ParameterUpdate(Cell):
        param (Parameter): The parameter to be updated manually.

    Raises:
        KeyError: If parameter with the specified name do not exist.
        KeyError: If parameter with the specified name does not exist.

    Examples:
        >>> network = Net()
--- a/mindspore/nn/wrap/grad_reducer.py
+++ b/mindspore/nn/wrap/grad_reducer.py
@@ -329,7 +329,7 @@ class DistributedGradReducer(Cell):

    def construct(self, grads):
        """
        In some circumstances, the data precision of grads could be mixed with float16 and float32. Thus, the
        Under certain circumstances, the data precision of grads could be mixed with float16 and float32. Thus, the
        result of AllReduce is unreliable. To solve the problem, grads should be cast to float32 before AllReduce,
        and cast back after the operation.

--- a/mindspore/nn/wrap/loss_scale.py
+++ b/mindspore/nn/wrap/loss_scale.py
@@ -54,8 +54,8 @@ class DynamicLossScaleUpdateCell(Cell):
    Dynamic Loss scale update cell.

    For loss scaling training, the initial loss scaling value will be set to be `loss_scale_value`.
    In every training step, the loss scaling value  will be updated by loss scaling value/`scale_factor`
    when there is overflow. And it will be increased by loss scaling value * `scale_factor` if there is no
    In each training step, the loss scaling value  will be updated by loss scaling value/`scale_factor`
    when there is an overflow. And it will be increased by loss scaling value * `scale_factor` if there is no
    overflow for a continuous `scale_window` steps. This cell is used for Graph mode training in which all
    logic will be executed on device side(Another training mode is normal(non-sink) mode in which some logic will be
    executed on host).
@@ -133,7 +133,7 @@ class FixedLossScaleUpdateCell(Cell):
    """
    Static scale update cell, the loss scaling value will not be updated.

    For usage please refer to `DynamicLossScaleUpdateCell`.
    For usage, refer to `DynamicLossScaleUpdateCell`.

    Args:
        loss_scale_value (float): Init loss scale.
--- a/mindspore/ops/composite/multitype_ops/getitem_impl.py
+++ b/mindspore/ops/composite/multitype_ops/getitem_impl.py
@@ -57,7 +57,7 @@ class _TupleGetItemTensor(base.TupleGetItemTensor_):
        data (tuple): A tuple of items.
        index (Tensor): The index in tensor.
    Outputs:
        Type, is same as the element type of data.
        Type, is the same as the element type of data.
    """

    def __init__(self, name):
@@ -81,7 +81,7 @@ def _tuple_getitem_by_number(data, number_index):
        number_index (Number): Index in scalar.

    Outputs:
        Type, is same as the element type of data.
        Type, is the same as the element type of data.
    """
    return F.tuple_getitem(data, number_index)

@@ -96,7 +96,7 @@ def _tuple_getitem_by_slice(data, slice_index):
        slice_index (Slice): Index in slice.

    Outputs:
        Tuple, element type is same as the element type of data.
        Tuple, element type is the same as the element type of data.
    """
    return _tuple_slice(data, slice_index)

@@ -111,7 +111,7 @@ def _tuple_getitem_by_tensor(data, tensor_index):
        tensor_index (Tensor): Index to select item.

    Outputs:
        Type, is same as the element type of data.
        Type, is the same as the element type of data.
    """
    return _tuple_get_item_tensor(data, tensor_index)

@@ -126,7 +126,7 @@ def _list_getitem_by_number(data, number_index):
        number_index (Number): Index in scalar.

    Outputs:
        Type is same as the element type of data.
        Type is the same as the element type of data.
    """
    return F.list_getitem(data, number_index)

@@ -186,7 +186,7 @@ def _tensor_getitem_by_slice(data, slice_index):
        slice_index (Slice): Index in slice.

    Outputs:
        Tensor, element type is same as the element type of data.
        Tensor, element type is the same as the element type of data.
    """
    return compile_utils.tensor_index_by_slice(data, slice_index)

@@ -201,7 +201,7 @@ def _tensor_getitem_by_tensor(data, tensor_index):
        tensor_index (Tensor): An index expressed by tensor.

    Outputs:
        Tensor, element type is same as the element type of data.
        Tensor, element type is the same as the element type of data.
    """
    return compile_utils.tensor_index_by_tensor(data, tensor_index)

@@ -216,7 +216,7 @@ def _tensor_getitem_by_tuple(data, tuple_index):
        tuple_index (tuple): Index in tuple.

    Outputs:
        Tensor, element type is same as the element type of data.
        Tensor, element type is the same as the element type of data.
    """
    return compile_utils.tensor_index_by_tuple(data, tuple_index)

--- a/mindspore/ops/composite/multitype_ops/setitem_impl.py
+++ b/mindspore/ops/composite/multitype_ops/setitem_impl.py
@@ -32,7 +32,7 @@ def _list_setitem_with_string(data, number_index, value):
        number_index (Number): Index of data.

    Outputs:
        list, type is same as the element type of data.
        list, type is the same as the element type of data.
    """
    return F.list_setitem(data, number_index, value)

@@ -48,7 +48,7 @@ def _list_setitem_with_number(data, number_index, value):
        value (Number): Value given.

    Outputs:
        list, type is same as the element type of data.
        list, type is the same as the element type of data.
    """
    return F.list_setitem(data, number_index, value)

@@ -64,7 +64,7 @@ def _list_setitem_with_Tensor(data, number_index, value):
        value (Tensor): Value given.

    Outputs:
        list, type is same as the element type of data.
        list, type is the same as the element type of data.
    """
    return F.list_setitem(data, number_index, value)

@@ -80,7 +80,7 @@ def _list_setitem_with_List(data, number_index, value):
        value (list): Value given.

    Outputs:
        list, type is same as the element type of data.
        list, type is the same as the element type of data.
    """
    return F.list_setitem(data, number_index, value)

@@ -96,7 +96,7 @@ def _list_setitem_with_Tuple(data, number_index, value):
        value (list): Value given.

    Outputs:
        list, type is same as the element type of data.
        list, type is the same as the element type of data.
    """
    return F.list_setitem(data, number_index, value)

--- a/mindspore/ops/operations/_inner_ops.py
+++ b/mindspore/ops/operations/_inner_ops.py
@@ -158,18 +158,18 @@ class ExtractImagePatches(PrimitiveWithInfer):
    The input tensor must be a 4-D tensor and the data format is NHWC.

    Args:
        ksizes (Union[tuple[int], list[int]]): The size of sliding window, should be a tuple or list of int,
        ksizes (Union[tuple[int], list[int]]): The size of sliding window, should be a tuple or a list of integers,
            and the format is [1, ksize_row, ksize_col, 1].
        strides (Union[tuple[int], list[int]]): Distance between the centers of the two consecutive patches,
            should be a tuple or list of int, and the format is [1, stride_row, stride_col, 1].
        rates (Union[tuple[int], list[int]]): In each extracted patch, the gap between the corresponding dim
            pixel positions, should be a tuple or list of int, and the format is [1, rate_row, rate_col, 1].
        rates (Union[tuple[int], list[int]]): In each extracted patch, the gap between the corresponding dimension
            pixel positions, should be a tuple or a list of integers, and the format is [1, rate_row, rate_col, 1].
        padding (str): The type of padding algorithm, is a string whose value is "same" or "valid",
            not case sensitive. Default: "valid".

            - same: Means that the patch can take the part beyond the original image, and this part is filled with 0.

            - valid: Means that the patch area taken must be completely contained in the original image.
            - valid: Means that the taken patch area must be completely covered in the original image.

    Inputs:
        - **input_x** (Tensor) - A 4-D tensor whose shape is [in_batch, in_row, in_col, in_depth] and
@@ -177,7 +177,7 @@ class ExtractImagePatches(PrimitiveWithInfer):

    Outputs:
        Tensor, a 4-D tensor whose data type is same as 'input_x',
        and the shape is [out_batch, out_row, out_col, out_depth], the out_batch is same as the in_batch.
        and the shape is [out_batch, out_row, out_col, out_depth], the out_batch is the same as the in_batch.
    """

    @prim_attr_register
@@ -436,8 +436,8 @@ class MatrixDiag(PrimitiveWithInfer):
    Returns a batched diagonal tensor with a given batched diagonal values.

    Inputs:
        - **x** (Tensor) - A tensor which to be element-wise multi by `assist`. It can be of the following data types:
          float32, float16, int32, int8, uint8.
        - **x** (Tensor) - A tensor which to be element-wise multi by `assist`. It can be one of the following data
          types: float32, float16, int32, int8, and uint8.
        - **assist** (Tensor) - A eye tensor of the same type as `x`. It's rank must greater than or equal to 2 and
          it's last dimension must equal to the second to last dimension.

@@ -490,7 +490,7 @@ class MatrixDiagPart(PrimitiveWithInfer):
    Returns the batched diagonal part of a batched tensor.

    Inputs:
        - **x** (Tensor) - The batched tensor. It can be of the following data types:
        - **x** (Tensor) - The batched tensor. It can be one of the following data types:
          float32, float16, int32, int8, uint8.
        - **assist** (Tensor) - A eye tensor of the same type as `x`. With shape same as `x`.

@@ -531,7 +531,7 @@ class MatrixSetDiag(PrimitiveWithInfer):
    Modify the batched diagonal part of a batched tensor.

    Inputs:
        - **x** (Tensor) - The batched tensor. It can be of the following data types:
        - **x** (Tensor) - The batched tensor. It can be one of the following data types:
          float32, float16, int32, int8, uint8.
        - **assist** (Tensor) - A eye tensor of the same type as `x`. With shape same as `x`.
        - **diagonal** (Tensor) - The diagonal values.
--- a/mindspore/ops/operations/_quant_ops.py
+++ b/mindspore/ops/operations/_quant_ops.py
@@ -178,8 +178,8 @@ class FakeQuantPerLayer(PrimitiveWithInfer):
        quant_delay (int): Quantilization delay parameter. Before delay step in training time not update
            simulate quantization aware funcion. After delay step in training time begin simulate the aware
            quantize funcion. Default: 0.
        symmetric (bool): Quantization algorithm use symmetric or not. Default: False.
        narrow_range (bool): Quantization algorithm use narrow range or not. Default: False.
        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        training (bool): Training the network or not. Default: True.

    Inputs:
@@ -318,8 +318,8 @@ class FakeQuantPerChannel(PrimitiveWithInfer):
        quant_delay (int): Quantilization delay  parameter. Before delay step in training time not
            update the weight data to simulate quantize operation. After delay step in training time
            begin simulate the quantize operation. Default: 0.
        symmetric (bool): Quantization algorithm use symmetric or not. Default: False.
        narrow_range (bool): Quantization algorithm use narrow range or not. Default: False.
        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.
        training (bool): Training the network or not. Default: True.
        channel_axis (int): Quantization by channel axis. Ascend backend only supports 0 or 1. Default: 1.

--- a/mindspore/ops/operations/array_ops.py
+++ b/mindspore/ops/operations/array_ops.py
@@ -3359,7 +3359,7 @@ class InplaceUpdate(PrimitiveWithInfer):
        indices (Union[int, tuple]): Indices into the left-most dimension of `x`.

    Inputs:
        - **x** (Tensor) - A tensor which to be inplace updated. It can be of the following data types:
        - **x** (Tensor) - A tensor which to be inplace updated. It can be one of the following data types:
          float32, float16, int32.
        - **v** (Tensor) - A tensor of the same type as `x`. Same dimension size as `x` except
          the first dimension, which must be the same as the size of `indices`.
@@ -3474,7 +3474,7 @@ class TransShape(PrimitiveWithInfer):
        - **out_shape** (tuple[int]) - The shape of output data.

    Outputs:
        Tensor, a tensor whose data type is same as 'input_x', and the shape is same as the `out_shape`.
        Tensor, a tensor whose data type is same as 'input_x', and the shape is the same as the `out_shape`.
    """
    @prim_attr_register
    def __init__(self):
--- a/mindspore/ops/operations/inner_ops.py
+++ b/mindspore/ops/operations/inner_ops.py
@@ -31,7 +31,7 @@ class ScalarCast(PrimitiveWithInfer):
        - **input_y** (mindspore.dtype) - The type should cast to be. Only constant value is allowed.

    Outputs:
        Scalar. The type is same as the python type corresponding to `input_y`.
        Scalar. The type is the same as the python type corresponding to `input_y`.

    Examples:
        >>> scalar_cast = P.ScalarCast()
--- a/mindspore/ops/operations/math_ops.py
+++ b/mindspore/ops/operations/math_ops.py
@@ -132,7 +132,7 @@ class TensorAdd(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1067,7 +1067,7 @@ class Sub(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1105,7 +1105,7 @@ class Mul(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1144,7 +1144,7 @@ class SquaredDifference(_MathBinaryOp):
          float16, float32, int32 or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1333,7 +1333,7 @@ class Pow(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1618,7 +1618,7 @@ class Minimum(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1656,7 +1656,7 @@ class Maximum(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1694,7 +1694,7 @@ class RealDiv(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1733,7 +1733,7 @@ class Div(_MathBinaryOp):
          is a number or a bool, the second input should be a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Raises:
@@ -1772,7 +1772,7 @@ class DivNoNan(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Raises:
@@ -1814,7 +1814,7 @@ class FloorDiv(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1844,7 +1844,7 @@ class TruncateDiv(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1873,7 +1873,7 @@ class TruncateMod(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -1900,7 +1900,7 @@ class Mod(_MathBinaryOp):
          the second input should be a tensor whose data type is number.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Raises:
@@ -1967,7 +1967,7 @@ class FloorMod(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -2025,7 +2025,7 @@ class Xdivy(_MathBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is float16, float32 or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -2059,7 +2059,7 @@ class Xlogy(_MathBinaryOp):
          The value must be positive.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,
        Tensor, the shape is the same as the shape after broadcasting,
        and the data type is the one with high precision or high digits among the two inputs.

    Examples:
@@ -2219,7 +2219,7 @@ class Equal(_LogicBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([1, 2, 3]), mindspore.float32)
@@ -2250,7 +2250,7 @@ class ApproximateEqual(_LogicBinaryOp):
        - **x2** (Tensor) - A tensor of the same type and shape as 'x1'.

    Outputs:
        Tensor, the shape is same as the shape of 'x1', and the data type is bool.
        Tensor, the shape is the same as the shape of 'x1', and the data type is bool.

    Examples:
        >>> x1 = Tensor(np.array([1, 2, 3]), mindspore.float32)
@@ -2328,7 +2328,7 @@ class NotEqual(_LogicBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([1, 2, 3]), mindspore.float32)
@@ -2364,7 +2364,7 @@ class Greater(_LogicBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([1, 2, 3]), mindspore.int32)
@@ -2399,7 +2399,7 @@ class GreaterEqual(_LogicBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([1, 2, 3]), mindspore.int32)
@@ -2434,7 +2434,7 @@ class Less(_LogicBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([1, 2, 3]), mindspore.int32)
@@ -2469,7 +2469,7 @@ class LessEqual(_LogicBinaryOp):
          a bool when the first input is a tensor or a tensor whose data type is number or bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([1, 2, 3]), mindspore.int32)
@@ -2495,7 +2495,7 @@ class LogicalNot(PrimitiveWithInfer):
        - **input_x** (Tensor) - The input tensor whose dtype is bool.

    Outputs:
        Tensor, the shape is same as the `input_x`, and the dtype is bool.
        Tensor, the shape is the same as the `input_x`, and the dtype is bool.

    Examples:
        >>> input_x = Tensor(np.array([True, False, True]), mindspore.bool_)
@@ -2533,7 +2533,7 @@ class LogicalAnd(_LogicBinaryOp):
          a tensor whose data type is bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting, and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting, and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([True, False, True]), mindspore.bool_)
@@ -2563,7 +2563,7 @@ class LogicalOr(_LogicBinaryOp):
          a tensor whose data type is bool.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is bool.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is bool.

    Examples:
        >>> input_x = Tensor(np.array([True, False, True]), mindspore.bool_)
@@ -3182,7 +3182,7 @@ class Atan2(_MathBinaryOp):
        - **input_y** (Tensor) - The input tensor.

    Outputs:
        Tensor, the shape is same as the shape after broadcasting,and the data type is same as `input_x`.
        Tensor, the shape is the same as the shape after broadcasting,and the data type is same as `input_x`.

    Examples:
         >>> input_x = Tensor(np.array([[0, 1]]), mindspore.float32)
--- a/mindspore/ops/operations/nn_ops.py
+++ b/mindspore/ops/operations/nn_ops.py
@@ -100,7 +100,7 @@ class Softmax(PrimitiveWithInfer):
    Softmax operation.

    Applies the Softmax operation to the input tensor on the specified axis.
    Suppose a slice along the given aixs :math:`x` then for each element :math:`x_i`
    Suppose a slice in the given aixs :math:`x` then for each element :math:`x_i`
    the Softmax function is shown as follows:

    .. math::
@@ -151,7 +151,7 @@ class LogSoftmax(PrimitiveWithInfer):
    Log Softmax activation function.

    Applies the Log Softmax function to the input tensor on the specified axis.
    Suppose a slice along the given aixs :math:`x` then for each element :math:`x_i`
    Suppose a slice in the given aixs :math:`x` then for each element :math:`x_i`
    the Log Softmax function is shown as follows:

    .. math::
@@ -429,7 +429,7 @@ class HSwish(PrimitiveWithInfer):
    .. math::
        \text{hswish}(x_{i}) = x_{i} * \frac{ReLU6(x_{i} + 3)}{6},

    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Inputs:
        - **input_data** (Tensor) - The input of HSwish, data type should be float16 or float32.
@@ -502,7 +502,7 @@ class HSigmoid(PrimitiveWithInfer):
    .. math::
        \text{hsigmoid}(x_{i}) = max(0, min(1, \frac{x_{i} + 3}{6})),

    where :math:`x_{i}` is the :math:`i`-th slice along the given dim of the input Tensor.
    where :math:`x_{i}` is the :math:`i`-th slice in the given dimension of the input Tensor.

    Inputs:
        - **input_data** (Tensor) - The input of HSigmoid, data type should be float16 or float32.
@@ -2234,7 +2234,7 @@ class DropoutDoMask(PrimitiveWithInfer):
          shape of `input_x` must be same as the value of `DropoutGenMask`'s input `shape`. If input wrong `mask`,
          the output of `DropoutDoMask` are unpredictable.
        - **keep_prob** (Tensor) - The keep rate, between 0 and 1, e.g. keep_prob = 0.9,
          means dropping out 10% of input units. The value of `keep_prob` is same as the input `keep_prob` of
          means dropping out 10% of input units. The value of `keep_prob` is the same as the input `keep_prob` of
          `DropoutGenMask`.

    Outputs:
@@ -2674,9 +2674,9 @@ class Pad(PrimitiveWithInfer):

    Args:
        paddings (tuple): The shape of parameter `paddings` is (N, 2). N is the rank of input data. All elements of
            paddings are int type. For `D` th dimension of input, paddings[D, 0] indicates how many sizes to be
            extended ahead of the `D` th dimension of the input tensor, and paddings[D, 1] indicates how many sizes to
            be extended behind of the `D` th dimension of the input tensor.
            paddings are int type. For the input in `D` th dimension, paddings[D, 0] indicates how many sizes to be
            extended ahead of the input tensor in the `D` th dimension, and paddings[D, 1] indicates how many sizes to
            be extended behind of the input tensor in the `D` th dimension.

    Inputs:
        - **input_x** (Tensor) - The input tensor.
@@ -2733,9 +2733,9 @@ class MirrorPad(PrimitiveWithInfer):
        - **input_x** (Tensor) - The input tensor.
        - **paddings** (Tensor) - The paddings tensor. The value of `paddings` is a matrix(list),
          and its shape is (N, 2). N is the rank of input data. All elements of paddings
          are int type. For `D` th dimension of input, paddings[D, 0] indicates how many sizes to be
          extended ahead of the `D` th dimension of the input tensor, and paddings[D, 1] indicates
          how many sizes to be extended behind of the `D` th dimension of the input tensor.
          are int type. For the input in `D` th dimension, paddings[D, 0] indicates how many sizes to be
          extended ahead of the input tensor in the `D` th dimension, and paddings[D, 1] indicates how many sizes to
          be extended behind of the input tensor in the `D` th dimension.

    Outputs:
        Tensor, the tensor after padding.
@@ -2880,11 +2880,11 @@ class Adam(PrimitiveWithInfer):

    Args:
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
            If True, updating of the var, m, and v tensors will be protected by a lock.
            If False, the result is unpredictable. Default: False.
            If true, updates of the var, m, and v tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
            If True, updates the gradients using NAG.
            If False, updates the gradients without using NAG. Default: False.
            If true, update the gradients using NAG.
            If true, update the gradients without using NAG. Default: False.

    Inputs:
        - **var** (Tensor) - Weights to be updated.
@@ -2894,8 +2894,8 @@ class Adam(PrimitiveWithInfer):
        - **beta1_power** (float) - :math:`beta_1^t` in the updating formula.
        - **beta2_power** (float) - :math:`beta_2^t` in the updating formula.
        - **lr** (float) - :math:`l` in the updating formula.
        - **beta1** (float) - The exponential decay rate for the 1st moment estimates.
        - **beta2** (float) - The exponential decay rate for the 2nd moment estimates.
        - **beta1** (float) - The exponential decay rate for the 1st moment estimations.
        - **beta2** (float) - The exponential decay rate for the 2nd moment estimations.
        - **epsilon** (float) - Term added to the denominator to improve numerical stability.
        - **gradient** (Tensor) - Gradients. Has the same type as `var`.

@@ -2974,11 +2974,11 @@ class FusedSparseAdam(PrimitiveWithInfer):

    Args:
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
            If True, updating of the var, m, and v tensors will be protected by a lock.
            If False, the result is unpredictable. Default: False.
            If true, updates of the var, m, and v tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
            If True, updates the gradients using NAG.
            If False, updates the gradients without using NAG. Default: False.
            If true, update the gradients using NAG.
            If true, update the gradients without using NAG. Default: False.

    Inputs:
        - **var** (Parameter) - Parameters to be updated. With float32 data type.
@@ -2989,8 +2989,8 @@ class FusedSparseAdam(PrimitiveWithInfer):
        - **beta1_power** (Tensor) - :math:`beta_1^t` in the updating formula. With float32 data type.
        - **beta2_power** (Tensor) - :math:`beta_2^t` in the updating formula. With float32 data type.
        - **lr** (Tensor) - :math:`l` in the updating formula. With float32 data type.
        - **beta1** (Tensor) - The exponential decay rate for the 1st moment estimates. With float32 data type.
        - **beta2** (Tensor) - The exponential decay rate for the 2nd moment estimates. With float32 data type.
        - **beta1** (Tensor) - The exponential decay rate for the 1st moment estimations. With float32 data type.
        - **beta2** (Tensor) - The exponential decay rate for the 2nd moment estimations. With float32 data type.
        - **epsilon** (Tensor) - Term added to the denominator to improve numerical stability. With float32 data type.
        - **gradient** (Tensor) - Gradient value. With float32 data type.
        - **indices** (Tensor) - Gradient indices. With int32 data type.
@@ -3108,11 +3108,11 @@ class FusedSparseLazyAdam(PrimitiveWithInfer):

    Args:
        use_locking (bool): Whether to enable a lock to protect updating variable tensors.
            If True, updating of the var, m, and v tensors will be protected by a lock.
            If False, the result is unpredictable. Default: False.
            If true, updates of the var, m, and v tensors will be protected by a lock.
            If false, the result is unpredictable. Default: False.
        use_nesterov (bool): Whether to use Nesterov Accelerated Gradient (NAG) algorithm to update the gradients.
            If True, updates the gradients using NAG.
            If False, updates the gradients without using NAG. Default: False.
            If true, update the gradients using NAG.
            If true, update the gradients without using NAG. Default: False.

    Inputs:
        - **var** (Parameter) - Parameters to be updated. With float32 data type.
@@ -3123,8 +3123,8 @@ class FusedSparseLazyAdam(PrimitiveWithInfer):
        - **beta1_power** (Tensor) - :math:`beta_1^t` in the updating formula. With float32 data type.
        - **beta2_power** (Tensor) - :math:`beta_2^t` in the updating formula. With float32 data type.
        - **lr** (Tensor) - :math:`l` in the updating formula. With float32 data type.
        - **beta1** (Tensor) - The exponential decay rate for the 1st moment estimates. With float32 data type.
        - **beta2** (Tensor) - The exponential decay rate for the 2nd moment estimates. With float32 data type.
        - **beta1** (Tensor) - The exponential decay rate for the 1st moment estimations. With float32 data type.
        - **beta2** (Tensor) - The exponential decay rate for the 2nd moment estimations. With float32 data type.
        - **epsilon** (Tensor) - Term added to the denominator to improve numerical stability. With float32 data type.
        - **gradient** (Tensor) - Gradient value. With float32 data type.
        - **indices** (Tensor) - Gradient indices. With int32 data type.
@@ -3227,7 +3227,7 @@ class FusedSparseFtrl(PrimitiveWithInfer):
        l2 (float): l2 regularization strength, must be greater than or equal to zero.
        lr_power (float): Learning rate power controls how the learning rate decreases during training,
            must be less than or equal to zero. Use fixed learning rate if `lr_power` is zero.
        use_locking (bool): Use locks for update operation if True . Default: False.
        use_locking (bool): Use locks for updating operation if True . Default: False.

    Inputs:
        - **var** (Parameter) - The variable to be updated. The data type must be float32.
@@ -3320,7 +3320,7 @@ class FusedSparseProximalAdagrad(PrimitiveWithInfer):
            var = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)

    Args:
        use_locking (bool): If True, updating of the var and accum tensors will be protected. Default: False.
        use_locking (bool): If true, updates of the var and accum tensors will be protected. Default: False.

    Inputs:
        - **var** (Parameter) - Variable tensor to be updated. The data type must be float32.
@@ -3415,7 +3415,7 @@ class KLDivLoss(PrimitiveWithInfer):
            \end{cases}

    Args:
        reduction (str): Specifies the reduction to apply to the output.
        reduction (str): Specifies the reduction to be applied to the output.
            Its value should be one of 'none', 'mean', 'sum'. Default: 'mean'.

    Inputs:
@@ -3487,7 +3487,7 @@ class BinaryCrossEntropy(PrimitiveWithInfer):
            \end{cases}

    Args:
        reduction (str): Specifies the reduction to apply to the output.
        reduction (str): Specifies the reduction to be applied to the output.
            Its value should be one of 'none', 'mean', 'sum'. Default: 'mean'.

    Inputs:
@@ -3575,9 +3575,9 @@ class ApplyAdaMax(PrimitiveWithInfer):
          With float32 or float16 data type.
        - **lr** (Union[Number, Tensor]) - Learning rate, :math:`l` in the updating formula, should be scalar.
          With float32 or float16 data type.
        - **beta1** (Union[Number, Tensor]) - The exponential decay rate for the 1st moment estimates,
        - **beta1** (Union[Number, Tensor]) - The exponential decay rate for the 1st moment estimations,
          should be scalar. With float32 or float16 data type.
        - **beta2** (Union[Number, Tensor]) - The exponential decay rate for the 2nd moment estimates,
        - **beta2** (Union[Number, Tensor]) - The exponential decay rate for the 2nd moment estimations,
          should be scalar. With float32 or float16 data type.
        - **epsilon** (Union[Number, Tensor]) - A small value added for numerical stability, should be scalar.
          With float32 or float16 data type.
@@ -3939,7 +3939,7 @@ class SparseApplyAdagrad(PrimitiveWithInfer):
    Args:
        lr (float): Learning rate.
        update_slots (bool): If `True`, `accum` will be updated. Default: True.
        use_locking (bool): If True, updating of the var and accum tensors will be protected. Default: False.
        use_locking (bool): If true, updates of the var and accum tensors will be protected. Default: False.

    Inputs:
        - **var** (Parameter) - Variable to be updated. The data type must be float16 or float32.
@@ -4099,7 +4099,7 @@ class ApplyProximalAdagrad(PrimitiveWithInfer):
            var = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)

    Args:
        use_locking (bool): If True, updating of the var and accum tensors will be protected. Default: False.
        use_locking (bool): If true, updates of the var and accum tensors will be protected. Default: False.

    Inputs:
        - **var** (Parameter) - Variable to be updated. The data type should be float16 or float32.
@@ -4195,7 +4195,7 @@ class SparseApplyProximalAdagrad(PrimitiveWithInfer):
            var = \frac{sign(\text{prox_v})}{1 + lr * l2} * \max(\left| \text{prox_v} \right| - lr * l1, 0)

    Args:
        use_locking (bool): If True, updating of the var and accum tensors will be protected. Default: False.
        use_locking (bool): If true, updates of the var and accum tensors will be protected. Default: False.

    Inputs:
        - **var** (Parameter) - Variable tensor to be updated. The data type must be float16 or float32.
@@ -4697,7 +4697,7 @@ class ApplyFtrl(PrimitiveWithInfer):
    Update relevant entries according to the FTRL scheme.

    Args:
        use_locking (bool): Use locks for update operation if True . Default: False.
        use_locking (bool): Use locks for updating operation if True . Default: False.

    Inputs:
        - **var** (Parameter) - The variable to be updated. The data type should be float16 or float32.
@@ -4788,7 +4788,7 @@ class SparseApplyFtrl(PrimitiveWithInfer):
        l2 (float): l2 regularization strength, must be greater than or equal to zero.
        lr_power (float): Learning rate power controls how the learning rate decreases during training,
            must be less than or equal to zero. Use fixed learning rate if `lr_power` is zero.
        use_locking (bool): Use locks for update operation if True . Default: False.
        use_locking (bool): Use locks for updating operation if True . Default: False.

    Inputs:
        - **var** (Parameter) - The variable to be updated. The data type must be float16 or float32.
@@ -4967,8 +4967,8 @@ class ConfusionMulGrad(PrimitiveWithInfer):
        axis (Union[int, tuple[int], list[int]]): The dimensions to reduce.
            Default:(), reduce all dimensions. Only constant value is allowed.
        keep_dims (bool):
            - If True, keep these reduced dimensions and the length is 1.
            - If False, don't keep these dimensions. Default:False.
            - If true, keep these reduced dimensions and the length is 1.
            - If false, don't keep these dimensions. Default:False.

    Inputs:
        - **input_0** (Tensor) - The input Tensor.
@@ -5094,9 +5094,9 @@ class CTCLoss(PrimitiveWithInfer):
    Calculates the CTC(Connectionist Temporal Classification) loss. Also calculates the gradient.

    Args:
        preprocess_collapse_repeated (bool): If True, repeated labels are collapsed prior to the CTC calculation.
        preprocess_collapse_repeated (bool): If true, repeated labels are collapsed prior to the CTC calculation.
                                             Default: False.
        ctc_merge_repeated (bool): If False, during CTC calculation, repeated non-blank labels will not be merged
        ctc_merge_repeated (bool): If false, during CTC calculation, repeated non-blank labels will not be merged
                                   and are interpreted as individual labels. This is a simplfied version of CTC.
                                   Default: True.
        ignore_longer_outputs_than_inputs (bool): If True, sequences with longer outputs than inputs will be ignored.
@@ -5192,7 +5192,7 @@ class BasicLSTMCell(PrimitiveWithInfer):
        keep_prob (float): If not 1.0, append `Dropout` layer on the outputs of each
            LSTM layer except the last layer. Default 1.0. The range of dropout is [0.0, 1.0].
        forget_bias (float): Add forget bias to forget gate biases in order to decrease former scale. Default to 1.0.
        state_is_tuple (bool): If True, state is tensor tuple, containing h and c; If False, one tensor,
        state_is_tuple (bool): If true, state is tensor tuple, containing h and c; If false, one tensor,
          need split first. Default to True.
        activation (str): Activation. Default to "tanh".

--- a/mindspore/train/quant/quant.py
+++ b/mindspore/train/quant/quant.py
@@ -496,12 +496,11 @@ def convert_quant_network(network,
        per_channel (bool, list or tuple):  Quantization granularity based on layer or on channel. If `True`
            then base on per channel otherwise base on per layer. The first element represent weights
            and second element represent data flow. Default: (False, False)
        symmetric (bool, list or tuple): Quantization algorithm use symmetric or not. If `True` then base on
        symmetric (bool, list or tuple): Whether the quantization algorithm is symmetric or not. If `True` then base on
            symmetric otherwise base on asymmetric. The first element represent weights and second
            element represent data flow. Default: (False, False)
        narrow_range (bool, list or tuple): Quantization algorithm use narrow range or not. If `True` then base
            on narrow range otherwise base on off narrow range. The first element represent weights and
            second element represent data flow. Default: (False, False)
        narrow_range (bool, list or tuple): Whether the quantization algorithm uses narrow range or not.
            The first element represents weights and the second element represents data flow. Default: (False, False)

    Returns:
        Cell, Network which has change to quantization aware training network cell.
--- a/mindspore/train/quant/quant_utils.py
+++ b/mindspore/train/quant/quant_utils.py
@@ -31,8 +31,8 @@ def cal_quantization_params(input_min,
        input_max (numpy.ndarray): The dimension of channel or 1.
        data_type (numpy type) : Can ben numpy int8, numpy uint8.
        num_bits (int): Quantization number bit, support 4 and 8bit. Default: 8.
        symmetric (bool): Quantization algorithm use symmetric or not. Default: False.
        narrow_range (bool): Quantization algorithm use narrow range or not. Default: False.
        symmetric (bool): Whether the quantization algorithm is symmetric or not. Default: False.
        narrow_range (bool): Whether the quantization algorithm uses narrow range or not. Default: False.

    Returns:
        scale (numpy.ndarray): quantization param.
--- a/setup.py
+++ b/setup.py
@@ -34,7 +34,7 @@ pkg_dir = os.path.join(pwd, 'build/package')


 def _read_file(filename):
    with open(os.path.join(pwd, filename)) as f:
    with open(os.path.join(pwd, filename), encoding='UTF-8') as f:
        return f.read()