Browse Source

!15374 modify some formula

From: @pan-fei
Reviewed-by: @gemini524,@kingxian,@liangchenghui
Signed-off-by: @kingxian
pull/15374/MERGE
mindspore-ci-bot Gitee 4 years ago
parent
commit
9650979baa
5 changed files with 60 additions and 60 deletions
  1. +2
    -2
      mindspore/nn/optim/ada_grad.py
  2. +3
    -3
      mindspore/nn/optim/momentum.py
  3. +11
    -11
      mindspore/nn/optim/rmsprop.py
  4. +41
    -41
      mindspore/ops/operations/nn_ops.py
  5. +3
    -3
      model_zoo/official/cv/resnet/src/momentum.py

+ 2
- 2
mindspore/nn/optim/ada_grad.py View File

@@ -46,8 +46,8 @@ class Adagrad(Optimizer):

.. math::
\begin{array}{ll} \\
h_{t} = h_{t-1} + g\\
w_{t} = w_{t-1} - lr*\frac{1}{\sqrt{h_{t}}}*g
h_{t+1} = h_{t} + g\\
w_{t+1} = w_{t} - lr*\frac{1}{\sqrt{h_{t+1}}}*g
\end{array}

:math:`h` represents the cumulative sum of gradient squared, :math:`g` represents `gradients`.


+ 3
- 3
mindspore/nn/optim/momentum.py View File

@@ -44,17 +44,17 @@ class Momentum(Optimizer):
Refer to the paper on the importance of initialization and momentum in deep learning for more details.

.. math::
v_{t} = v_{t-1} \ast u + gradients
v_{t+1} = v_{t} \ast u + gradients

If use_nesterov is True:

.. math::
p_{t} = p_{t-1} - (grad \ast lr + v_{t} \ast u \ast lr)
p_{t+1} = p_{t} - (grad \ast lr + v_{t+1} \ast u \ast lr)

If use_nesterov is False:

.. math::
p_{t} = p_{t-1} - lr \ast v_{t}
p_{t+1} = p_{t} - lr \ast v_{t+1}

Here: where grad, lr, p, v and u denote the gradients, learning_rate, params, moments, and momentum respectively.



+ 11
- 11
mindspore/nn/optim/rmsprop.py View File

@@ -47,35 +47,35 @@ class RMSProp(Optimizer):
The equation is as follows:

.. math::
s_{t} = \\rho s_{t-1} + (1 - \\rho)(\\nabla Q_{i}(w))^2
s_{t+1} = \\rho s_{t} + (1 - \\rho)(\\nabla Q_{i}(w))^2

.. math::
m_{t} = \\beta m_{t-1} + \\frac{\\eta} {\\sqrt{s_{t} + \\epsilon}} \\nabla Q_{i}(w)
m_{t+1} = \\beta m_{t} + \\frac{\\eta} {\\sqrt{s_{t+1} + \\epsilon}} \\nabla Q_{i}(w)

.. math::
w = w - m_{t}
w = w - m_{t+1}

The first equation calculates moving average of the squared gradient for
each weight. Then dividing the gradient by :math:`\\sqrt{ms_{t} + \\epsilon}`.
each weight. Then dividing the gradient by :math:`\\sqrt{ms_{t+1} + \\epsilon}`.

if centered is True:

.. math::
g_{t} = \\rho g_{t-1} + (1 - \\rho)\\nabla Q_{i}(w)
g_{t+1} = \\rho g_{t} + (1 - \\rho)\\nabla Q_{i}(w)

.. math::
s_{t} = \\rho s_{t-1} + (1 - \\rho)(\\nabla Q_{i}(w))^2
s_{t+1} = \\rho s_{t} + (1 - \\rho)(\\nabla Q_{i}(w))^2

.. math::
m_{t} = \\beta m_{t-1} + \\frac{\\eta} {\\sqrt{s_{t} - g_{t}^2 + \\epsilon}} \\nabla Q_{i}(w)
m_{t+1} = \\beta m_{t} + \\frac{\\eta} {\\sqrt{s_{t+1} - g_{t+1}^2 + \\epsilon}} \\nabla Q_{i}(w)

.. math::
w = w - m_{t}
w = w - m_{t+1}

where :math:`w` represents `params`, which will be updated.
:math:`g_{t}` is mean gradients, :math:`g_{t-1}` is the last moment of :math:`g_{t}`.
:math:`s_{t}` is the mean square gradients, :math:`s_{t-1}` is the last moment of :math:`s_{t}`,
:math:`m_{t}` is moment, the delta of `w`, :math:`m_{t-1}` is the last moment of :math:`m_{t}`.
:math:`g_{t+1}` is mean gradients, :math:`g_{t}` is the last moment of :math:`g_{t+1}`.
:math:`s_{t+1}` is the mean square gradients, :math:`s_{t}` is the last moment of :math:`s_{t+1}`,
:math:`m_{t+1}` is moment, the delta of `w`, :math:`m_{t}` is the last moment of :math:`m_{t+1}`.
:math:`\\rho` represents `decay`. :math:`\\beta` is the momentum term, represents `momentum`.
:math:`\\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
:math:`\\eta` is learning rate, represents `learning_rate`. :math:`\\nabla Q_{i}(w)` is gradients,


+ 41
- 41
mindspore/ops/operations/nn_ops.py View File

@@ -2738,14 +2738,14 @@ class ApplyRMSProp(PrimitiveWithInfer):

.. math::
\begin{array}{ll} \\
s_{t} = \rho s_{t-1} + (1 - \rho)(\nabla Q_{i}(w))^2 \\
m_{t} = \beta m_{t-1} + \frac{\eta} {\sqrt{s_{t} + \epsilon}} \nabla Q_{i}(w) \\
w = w - m_{t}
s_{t+1} = \rho s_{t} + (1 - \rho)(\nabla Q_{i}(w))^2 \\
m_{t+1} = \beta m_{t} + \frac{\eta} {\sqrt{s_{t+1} + \epsilon}} \nabla Q_{i}(w) \\
w = w - m_{t+1}
\end{array}

where :math:`w` represents `var`, which will be updated.
:math:`s_{t}` represents `mean_square`, :math:`s_{t-1}` is the last momentent of :math:`s_{t}`,
:math:`m_{t}` represents `moment`, :math:`m_{t-1}` is the last momentent of :math:`m_{t}`.
:math:`s_{t+1}` represents `mean_square`, :math:`s_{t}` is the last momentent of :math:`s_{t+1}`,
:math:`m_{t+1}` represents `moment`, :math:`m_{t}` is the last momentent of :math:`m_{t+1}`.
:math:`\rho` represents `decay`. :math:`\beta` is the momentum term, represents `momentum`.
:math:`\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
:math:`\eta` represents `learning_rate`. :math:`\nabla Q_{i}(w)` represents `grad`.
@@ -2834,16 +2834,16 @@ class ApplyCenteredRMSProp(PrimitiveWithInfer):

.. math::
\begin{array}{ll} \\
g_{t} = \rho g_{t-1} + (1 - \rho)\nabla Q_{i}(w) \\
s_{t} = \rho s_{t-1} + (1 - \rho)(\nabla Q_{i}(w))^2 \\
m_{t} = \beta m_{t-1} + \frac{\eta} {\sqrt{s_{t} - g_{t}^2 + \epsilon}} \nabla Q_{i}(w) \\
w = w - m_{t}
g_{t+1} = \rho g_{t} + (1 - \rho)\nabla Q_{i}(w) \\
s_{t+1} = \rho s_{t} + (1 - \rho)(\nabla Q_{i}(w))^2 \\
m_{t+1} = \beta m_{t} + \frac{\eta} {\sqrt{s_{t+1} - g_{t+1}^2 + \epsilon}} \nabla Q_{i}(w) \\
w = w - m_{t+1}
\end{array}

where :math:`w` represents `var`, which will be updated.
:math:`g_{t}` represents `mean_gradient`, :math:`g_{t-1}` is the last momentent of :math:`g_{t}`.
:math:`s_{t}` represents `mean_square`, :math:`s_{t-1}` is the last momentent of :math:`s_{t}`,
:math:`m_{t}` represents `moment`, :math:`m_{t-1}` is the last momentent of :math:`m_{t}`.
:math:`g_{t+1}` represents `mean_gradient`, :math:`g_{t}` is the last momentent of :math:`g_{t+1}`.
:math:`s_{t+1}` represents `mean_square`, :math:`s_{t}` is the last momentent of :math:`s_{t+1}`,
:math:`m_{t+1}` represents `moment`, :math:`m_{t}` is the last momentent of :math:`m_{t+1}`.
:math:`\rho` represents `decay`. :math:`\beta` is the momentum term, represents `momentum`.
:math:`\epsilon` is a smoothing term to avoid division by zero, represents `epsilon`.
:math:`\eta` represents `learning_rate`. :math:`\nabla Q_{i}(w)` represents `grad`.
@@ -5001,16 +5001,16 @@ class ApplyAdaMax(PrimitiveWithInfer):

.. math::
\begin{array}{ll} \\
m_{t} = \beta_1 * m_{t-1} + (1 - \beta_1) * g \\
v_{t} = \max(\beta_2 * v_{t-1}, \left| g \right|) \\
var = var - \frac{l}{1 - \beta_1^t} * \frac{m_{t}}{v_{t} + \epsilon}
m_{t+1} = \beta_1 * m_{t} + (1 - \beta_1) * g \\
v_{t+1} = \max(\beta_2 * v_{t}, \left| g \right|) \\
var = var - \frac{l}{1 - \beta_1^{t+1} * \frac{m_{t+1}}{v_{t+1} + \epsilon}
\end{array}

:math:`t` represents updating step while :math:`m` represents the 1st moment vector, :math:`m_{t-1}`
is the last momentent of :math:`m_{t}`, :math:`v` represents the 2nd moment vector, :math:`v_{t-1}`
is the last momentent of :math:`v_{t}`, :math:`l` represents scaling factor `lr`,
:math:`t` represents updating step while :math:`m` represents the 1st moment vector, :math:`m_{t}`
is the last momentent of :math:`m_{t+1}`, :math:`v` represents the 2nd moment vector, :math:`v_{t}`
is the last momentent of :math:`v_{t+1}`, :math:`l` represents scaling factor `lr`,
:math:`g` represents `grad`, :math:`\beta_1, \beta_2` represent `beta1` and `beta2`,
:math:`beta_1^t` represents `beta1_power`, :math:`var` represents the variable to be updated,
:math:`beta_1^{t+1}` represents `beta1_power`, :math:`var` represents the variable to be updated,
:math:`\epsilon` represents `epsilon`.

Inputs of `var`, `m`, `v` and `grad` comply with the implicit type conversion rules
@@ -5914,13 +5914,13 @@ class ApplyAddSign(PrimitiveWithInfer):

.. math::
\begin{array}{ll} \\
m_{t} = \beta * m_{t-1} + (1 - \beta) * g \\
m_{t+1} = \beta * m_{t} + (1 - \beta) * g \\
\text{update} = (\alpha + \text{sign_decay} * sign(g) * sign(m)) * g \\
var = var - lr_{t} * \text{update}
var = var - lr_{t+1} * \text{update}
\end{array}

:math:`t` represents updating step while :math:`m` represents the 1st moment vector, :math:`m_{t-1}`
is the last momentent of :math:`m_{t}`, :math:`lr` represents scaling factor `lr`, :math:`g` represents `grad`.
:math:`t` represents updating step while :math:`m` represents the 1st moment vector, :math:`m_{t}`
is the last momentent of :math:`m_{t+1}`, :math:`lr` represents scaling factor `lr`, :math:`g` represents `grad`.

Inputs of `var`, `accum` and `grad` comply with the implicit type conversion rules
to make the data types consistent.
@@ -6039,13 +6039,13 @@ class ApplyPowerSign(PrimitiveWithInfer):

.. math::
\begin{array}{ll} \\
m_{t} = \beta * m_{t-1} + (1 - \beta) * g \\
m_{t+1} = \beta * m_{t} + (1 - \beta) * g \\
\text{update} = \exp(\text{logbase} * \text{sign_decay} * sign(g) * sign(m)) * g \\
var = var - lr_{t} * \text{update}
var = var - lr_{t+1} * \text{update}
\end{array}

:math:`t` represents updating step while :math:`m` represents the 1st moment vector, :math:`m_{t-1}`
is the last momentent of :math:`m_{t}`, :math:`lr` represents scaling factor `lr`, :math:`g` represents `grad`.
:math:`t` represents updating step while :math:`m` represents the 1st moment vector, :math:`m_{t}`
is the last momentent of :math:`m_{t+1}`, :math:`lr` represents scaling factor `lr`, :math:`g` represents `grad`.

All of inputs comply with the implicit type conversion rules to make the data types consistent.
If `lr`, `logbase`, `sign_decay` or `beta` is a number, the number is automatically converted to Tensor,
@@ -7130,12 +7130,12 @@ class DynamicRNN(PrimitiveWithInfer):

.. math::
\begin{array}{ll} \\
i_t = \sigma(W_{ix} x_t + b_{ix} + W_{ih} h_{(t-1)} + b_{ih}) \\
f_t = \sigma(W_{fx} x_t + b_{fx} + W_{fh} h_{(t-1)} + b_{fh}) \\
\tilde{c}_t = \tanh(W_{cx} x_t + b_{cx} + W_{ch} h_{(t-1)} + b_{ch}) \\
o_t = \sigma(W_{ox} x_t + b_{ox} + W_{oh} h_{(t-1)} + b_{oh}) \\
c_t = f_t * c_{(t-1)} + i_t * \tilde{c}_t \\
h_t = o_t * \tanh(c_t) \\
i_{t+1} = \sigma(W_{ix} x_{t+1} + b_{ix} + W_{ih} h_{(t)} + b_{ih}) \\
f_{t+1} = \sigma(W_{fx} x_{t+1} + b_{fx} + W_{fh} h_{(t)} + b_{fh}) \\
\tilde{c}_{t+1} = \tanh(W_{cx} x_{t+1} + b_{cx} + W_{ch} h_{(t)} + b_{ch}) \\
o_{t+1} = \sigma(W_{ox} x_{t+1} + b_{ox} + W_{oh} h_{(t)} + b_{oh}) \\
c_{t+1} = f_{t+1} * c_{(t)} + i_t * \tilde{c}_{t+1} \\
h_{t+1} = o_{t+1} * \tanh(c_{t+1}) \\
\end{array}

Here :math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product. :math:`W, b`
@@ -7285,16 +7285,16 @@ class DynamicGRUV2(PrimitiveWithInfer):
.. math::

\begin{array}{ll}
r_t = \sigma(W_{ir} x_t + b_{ir} + W_{hr} h_{(t-1)} + b_{hr}) \\
z_t = \sigma(W_{iz} x_t + b_{iz} + W_{hz} h_{(t-1)} + b_{hz}) \\
n_t = \tanh(W_{in} x_t + b_{in} + r_t * (W_{hn} h_{(t-1)}+ b_{hn})) \\
h_t = (1 - z_t) * n_t + z_t * h_{(t-1)}
r_{t+1} = \sigma(W_{ir} x_{t+1} + b_{ir} + W_{hr} h_{(t)} + b_{hr}) \\
z_{t+1} = \sigma(W_{iz} x_{t+1} + b_{iz} + W_{hz} h_{(t)} + b_{hz}) \\
n_{t+1} = \tanh(W_{in} x_{t+1} + b_{in} + r_{t+1} * (W_{hn} h_{(t)}+ b_{hn})) \\
h_{t+1} = (1 - z_{t+1}) * n_{t+1} + z_{t+1} * h_{(t)}
\end{array}

where :math:`h_t` is the hidden state at time `t`, :math:`x_t` is the input
at time `t`, :math:`h_{(t-1)}` is the hidden state of the layer
at time `t-1` or the initial hidden state at time `0`, and :math:`r_t`,
:math:`z_t`, :math:`n_t` are the reset, update, and new gates, respectively.
where :math:`h_{t+1}` is the hidden state at time `t+1`, :math:`x_{t+1}` is the input
at time `t+1`, :math:`h_{t}` is the hidden state of the layer
at time `t` or the initial hidden state at time `0`, and :math:`r_{t+1}`,
:math:`z_{t+1}`, :math:`n_{t+1}` are the reset, update, and new gates, respectively.
:math:`\sigma` is the sigmoid function, and :math:`*` is the Hadamard product.

Args:


+ 3
- 3
model_zoo/official/cv/resnet/src/momentum.py View File

@@ -37,17 +37,17 @@ class Momentum(Optimizer):
Refer to the paper on the importance of initialization and momentum in deep learning for more details.

.. math::
v_{t} = v_{t-1} \ast u + gradients
v_{t+1} = v_{t} \ast u + gradients

If use_nesterov is True:

.. math::
p_{t} = p_{t-1} - (grad \ast lr + v_{t} \ast u \ast lr)
p_{t+1} = p_{t} - (grad \ast lr + v_{t+1} \ast u \ast lr)

If use_nesterov is False:

.. math::
p_{t} = p_{t-1} - lr \ast v_{t}
p_{t+1} = p_{t} - lr \ast v_{t+1}

Here: where grad, lr, p, v and u denote the gradients, learning_rate, params, moments, and momentum respectively.



Loading…
Cancel
Save