History

panbingao 98b76b9020 remove old MINDSPORE_HCCL_CONFIG_PATH in model zoo 2		5 years ago
..
README.md	add_python_distribute_pretrain_script	5 years ago

__init__.py	add_python_distribute_pretrain_script	5 years ago

hyper_parameter_config.ini	add_python_distribute_pretrain_script	5 years ago

run_distribute_pretrain.py	remove old MINDSPORE_HCCL_CONFIG_PATH in model zoo 2	5 years ago

README.md

Run distribute pretrain

Run distribute pretrain

description

The number of D chips can be automatically allocated based on the device_num set in hccl config file, You don not need to specify that.

how to use

For example, if we want to run the distributed training of Bert model on D chip, we can in /bert/ dir:

python model_zoo/utils/ascend_distributed_launcher/run_distribute_pretrain.py --run_script_dir ./run_pretrain.py --hyper_parameter_config_dir model_zoo/utils/ascend_distributed_launcher/hyper_parameter_config.ini --data_dir /path/dataset/ --hccl_config_dir model_zoo/utils/hccl_tools/hccl_2p_56_x.x.x.x.json

output:

hccl_config_dir: model_zoo/utils/hccl_tools/hccl_2p_56_x.x.x.x.json
the number of logical core: 192
avg_core_per_rank: 96
rank_size: 2

start training for rank 0, device 5:
rank_id: 0
device_id: 5
core nums: 0-95
epoch_size: 8
data_dir: /data/small_512/
schema_dir:
log file dir: ./LOG5/log.txt

start training for rank 1, device 6:
rank_id: 1
device_id: 6
core nums: 96-191
epoch_size: 8
data_dir: /data/small_512/
schema_dir:
log file dir: ./LOG6/log.txt

Note

Note that hccl_2p_56_x.x.x.x.json can use hccl_tools.py to generate.
For hyper parameter, please note that you should customize the scripts hyper_parameter_config.ini. Please note that these two hyper parameters are not allowed to be configured here:
device_id
device_num
For Other Model, please note that you should customize the option run_script and Corresponding hyper_parameter_config.ini.

MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.

C++ Python Text Unity3D Asset C other

314202276@qq.com 5518576+mindspore_ci@user.noreply.gitee.com tommylike@qq.com jiangjinsheng@huawei.com zhaozhenlong1@huawei.com zhaojichen1@huawei.com yiren19920727@163.com chenzomi12@gmail.com zhoufeng54@huawei.com shiliang10@huawei.com wangkaisheng2@huawei.com huanghui44@huawei.com guozhijian@huawei.com fary.fanrui@huawei.com 6576637+ms_yan@user.noreply.gitee.com chenweifeng720@huawei.com xiefangqi2@huawei.com weiluning@huawei.com 2713219276@qq.com caifubi1@huawei.com lichentrue@163.com yujianfeng5@huawei.com zhangxiaoda@huawei.com jinyaohui@huawei.com lvliang18@huawei.com

README.md

Run distribute pretrain

description

how to use

Note

Contributors (25+) All

Contributors (25+)
All