Browse Source

!31952 auto_parallel_support_not_only_2_power_fix_wont_gen_repeated_stra_bug_r1.7

Merge pull request !31952 from yao_yf/auto_parallel_support_not_only_2_power_fix_wont_gen_repeated_stra_bug_r1.7
r1.7
i-robot Gitee 4 years ago
parent
commit
26c1e0ec8b
No known key found for this signature in database GPG Key ID: 173E9B9CA92EEF8F
5 changed files with 42 additions and 2 deletions
  1. +1
    -1
      docs/api/api_python/nn/mindspore.nn.AdaSumByDeltaWeightWrapCell.rst
  2. +1
    -1
      docs/api/api_python/nn/mindspore.nn.AdaSumByGradWrapCell.rst
  3. +8
    -0
      mindspore/ccsrc/frontend/parallel/ops_info/matmul_info.cc
  4. +7
    -0
      mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc
  5. +25
    -0
      mindspore/python/mindspore/nn/optim/adasum.py

+ 1
- 1
docs/api/api_python/nn/mindspore.nn.AdaSumByDeltaWeightWrapCell.rst View File

@@ -3,7 +3,7 @@ mindspore.nn.AdaSumByDeltaWeightWrapCell

.. py:class:: mindspore.nn.AdaSumByDeltaWeightWrapCell(optimizer)

Adaptive Summation (AdaSum)算法的实现,根据更新前后的参数差计算。
Adaptive Summation (AdaSum)算法的实现,根据更新前后的参数差计算。应用于semi_auto_parallel/auto_parallel模式。

请参阅论文 `AdaSum: Scaling Distributed Training with Adaptive Summation <https://arxiv.org/abs/2006.02924>`_。



+ 1
- 1
docs/api/api_python/nn/mindspore.nn.AdaSumByGradWrapCell.rst View File

@@ -3,7 +3,7 @@ mindspore.nn.AdaSumByGradWrapCell

.. py:class:: mindspore.nn.AdaSumByGradWrapCell(optimizer)

Adaptive Summation (AdaSum)算法的实现,根据梯度计算。
Adaptive Summation (AdaSum)算法的实现,根据梯度计算。应用于semi_auto_parallel/auto_parallel模式。

请参阅论文 `AdaSum: Scaling Distributed Training with Adaptive Summation <https://arxiv.org/abs/2006.02924>`_。



+ 8
- 0
mindspore/ccsrc/frontend/parallel/ops_info/matmul_info.cc View File

@@ -484,6 +484,14 @@ Status MatMulBase::GenerateStrategiesNotPower2(int64_t stage_id, size_t dev_num_
}
}
strategy_cost_.clear();
// add the repeated strategy
auto repeated_stra_arrays{inputs_shape_};
for (auto &stra_array : repeated_stra_arrays) {
std::fill(stra_array.begin(), stra_array.end(), 1);
}
StrategyPtr repeated_stra = std::make_shared<Strategy>(stage_id, repeated_stra_arrays);
sp_vector.push_back(repeated_stra);

for (auto &sp : sp_vector) {
if (SetCostUnderStrategy(sp) == FAILED) {
MS_LOG(WARNING) << name_ << " : Calculating cost for strategy failed.";


+ 7
- 0
mindspore/ccsrc/frontend/parallel/ops_info/operator_info.cc View File

@@ -1470,6 +1470,13 @@ Status GenerateStrategiesForIndependentInputs(int64_t stage_id, const Shapes &in
}
}
}
// add the repeated strategy
auto repeated_stra_arrays{splittable_inputs};
for (auto &stra_array : repeated_stra_arrays) {
std::fill(stra_array.begin(), stra_array.end(), 1);
}
StrategyPtr repeated_stra = std::make_shared<Strategy>(stage_id, repeated_stra_arrays);
sp_vector->push_back(repeated_stra);
return SUCCESS;
}



+ 25
- 0
mindspore/python/mindspore/nn/optim/adasum.py View File

@@ -402,6 +402,18 @@ def _parallel_check():
class AdaSumByGradWrapCell(Cell):
r"""
Enable the adasum in "auto_parallel/semi_auto_parallel" mode.
The implementation of the Adaptive Summation (AdaSum) algorithm is calculated by gradients.
See the paper `AdaSum: Scaling Distributed Training with Adaptive Summation <https://arxiv.org/abs/2006.02924>`_.

.. math::
\begin{array}{ll}
w_{t+1}=w_{t} - \alpha \cdot Adasum(g_{1}, g_{2}) \\
w_{t+1}=w_{t} - \alpha \cdot [(1 - \frac{g_2^{T}\cdot g_1}{2\cdot \left \| g_1 \right \|^2 })\cdot g_1 +
(1 - \frac{g_1^{T}\cdot g_2}{2\cdot \left \| g_2 \right \|^2 })\cdot g_2] \\
\end{array}

In this implementation, :math:`g` represents the gradient of the weights,
and the subscripts represent different devices in the data-parallel dimension.

Note:
When using AdaSum, the number of traning cards needs to be a power of 2 and at least 16 cards are required.
@@ -456,6 +468,19 @@ class AdaSumByGradWrapCell(Cell):
class AdaSumByDeltaWeightWrapCell(Cell):
r"""
Enable the adasum in "auto_parallel/semi_auto_parallel" mode.
The implementation of the Adaptive Summation (AdaSum) algorithm is calculated based on the difference of weights
before and after the updating of optimizer.
See the paper `AdaSum: Scaling Distributed Training with Adaptive Summation <https://arxiv.org/abs/2006.02924>`_.

.. math::
\begin{array}{ll}
w_{t+1}=w_{t} - \alpha \cdot Adasum(g_{1}, g_{2}) \\
w_{t+1}=w_{t} - \alpha \cdot [(1 - \frac{g_2^{T}\cdot g_1}{2\cdot \left \| g_1 \right \|^2 })\cdot g_1 +
(1 - \frac{g_1^{T}\cdot g_2}{2\cdot \left \| g_2 \right \|^2 })\cdot g_2] \\
\end{array}

In this implementation, :math:`g` represents the weight difference before and after the updating of optimizer,
and the subscripts represent different devices in the data parallel dimension.

Note:
When using AdaSum, the number of traning cards needs to be a power of 2 and at least 16 cards are required.


Loading…
Cancel
Save