You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

1_exist_data_model.md 24 kB

2 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602
  1. # 1: Inference and train with existing models and standard datasets
  2. MMDetection provides hundreds of existing and existing detection models in [Model Zoo](https://mmdetection.readthedocs.io/en/latest/model_zoo.html)), and supports multiple standard datasets, including Pascal VOC, COCO, CityScapes, LVIS, etc. This note will show how to perform common tasks on these existing models and standard datasets, including:
  3. - Use existing models to inference on given images.
  4. - Test existing models on standard datasets.
  5. - Train predefined models on standard datasets.
  6. ## Inference with existing models
  7. By inference, we mean using trained models to detect objects on images. In MMDetection, a model is defined by a configuration file and existing model parameters are save in a checkpoint file.
  8. To start with, we recommend [Faster RCNN](https://github.com/open-mmlab/mmdetection/tree/master/configs/faster_rcnn) with this [configuration file](https://github.com/open-mmlab/mmdetection/blob/master/configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py) and this [checkpoint file](https://download.openmmlab.com/mmdetection/v2.0/faster_rcnn/faster_rcnn_r50_fpn_1x_coco/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth). It is recommended to download the checkpoint file to `checkpoints` directory.
  9. ### High-level APIs for inference
  10. MMDetection provide high-level Python APIs for inference on images. Here is an example of building the model and inference on given images or videos.
  11. ```python
  12. from mmdet.apis import init_detector, inference_detector
  13. import mmcv
  14. # Specify the path to model config and checkpoint file
  15. config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
  16. checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
  17. # build the model from a config file and a checkpoint file
  18. model = init_detector(config_file, checkpoint_file, device='cuda:0')
  19. # test a single image and show the results
  20. img = 'test.jpg' # or img = mmcv.imread(img), which will only load it once
  21. result = inference_detector(model, img)
  22. # visualize the results in a new window
  23. model.show_result(img, result)
  24. # or save the visualization results to image files
  25. model.show_result(img, result, out_file='result.jpg')
  26. # test a video and show the results
  27. video = mmcv.VideoReader('video.mp4')
  28. for frame in video:
  29. result = inference_detector(model, frame)
  30. model.show_result(frame, result, wait_time=1)
  31. ```
  32. A notebook demo can be found in [demo/inference_demo.ipynb](https://github.com/open-mmlab/mmdetection/blob/master/demo/inference_demo.ipynb).
  33. Note: `inference_detector` only supports single-image inference for now.
  34. ### Asynchronous interface - supported for Python 3.7+
  35. For Python 3.7+, MMDetection also supports async interfaces.
  36. By utilizing CUDA streams, it allows not to block CPU on GPU bound inference code and enables better CPU/GPU utilization for single-threaded application. Inference can be done concurrently either between different input data samples or between different models of some inference pipeline.
  37. See `tests/async_benchmark.py` to compare the speed of synchronous and asynchronous interfaces.
  38. ```python
  39. import asyncio
  40. import torch
  41. from mmdet.apis import init_detector, async_inference_detector
  42. from mmdet.utils.contextmanagers import concurrent
  43. async def main():
  44. config_file = 'configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py'
  45. checkpoint_file = 'checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth'
  46. device = 'cuda:0'
  47. model = init_detector(config_file, checkpoint=checkpoint_file, device=device)
  48. # queue is used for concurrent inference of multiple images
  49. streamqueue = asyncio.Queue()
  50. # queue size defines concurrency level
  51. streamqueue_size = 3
  52. for _ in range(streamqueue_size):
  53. streamqueue.put_nowait(torch.cuda.Stream(device=device))
  54. # test a single image and show the results
  55. img = 'test.jpg' # or img = mmcv.imread(img), which will only load it once
  56. async with concurrent(streamqueue):
  57. result = await async_inference_detector(model, img)
  58. # visualize the results in a new window
  59. model.show_result(img, result)
  60. # or save the visualization results to image files
  61. model.show_result(img, result, out_file='result.jpg')
  62. asyncio.run(main())
  63. ```
  64. ### Demos
  65. We also provide three demo scripts, implemented with high-level APIs and supporting functionality codes.
  66. Source codes are available [here](https://github.com/open-mmlab/mmdetection/tree/master/demo).
  67. #### Image demo
  68. This script performs inference on a single image.
  69. ```shell
  70. python demo/image_demo.py \
  71. ${IMAGE_FILE} \
  72. ${CONFIG_FILE} \
  73. ${CHECKPOINT_FILE} \
  74. [--device ${GPU_ID}] \
  75. [--score-thr ${SCORE_THR}]
  76. ```
  77. Examples:
  78. ```shell
  79. python demo/image_demo.py demo/demo.jpg \
  80. configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
  81. checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
  82. --device cpu
  83. ```
  84. #### Webcam demo
  85. This is a live demo from a webcam.
  86. ```shell
  87. python demo/webcam_demo.py \
  88. ${CONFIG_FILE} \
  89. ${CHECKPOINT_FILE} \
  90. [--device ${GPU_ID}] \
  91. [--camera-id ${CAMERA-ID}] \
  92. [--score-thr ${SCORE_THR}]
  93. ```
  94. Examples:
  95. ```shell
  96. python demo/webcam_demo.py \
  97. configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
  98. checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth
  99. ```
  100. #### Video demo
  101. This script performs inference on a video.
  102. ```shell
  103. python demo/video_demo.py \
  104. ${VIDEO_FILE} \
  105. ${CONFIG_FILE} \
  106. ${CHECKPOINT_FILE} \
  107. [--device ${GPU_ID}] \
  108. [--score-thr ${SCORE_THR}] \
  109. [--out ${OUT_FILE}] \
  110. [--show] \
  111. [--wait-time ${WAIT_TIME}]
  112. ```
  113. Examples:
  114. ```shell
  115. python demo/video_demo.py demo/demo.mp4 \
  116. configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
  117. checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
  118. --out result.mp4
  119. ```
  120. ## Test existing models on standard datasets
  121. To evaluate a model's accuracy, one usually tests the model on some standard datasets.
  122. MMDetection supports multiple public datasets including COCO, Pascal VOC, CityScapes, and [more](https://github.com/open-mmlab/mmdetection/tree/master/configs/_base_/datasets).
  123. This section will show how to test existing models on supported datasets.
  124. ### Prepare datasets
  125. Public datasets like [Pascal VOC](http://host.robots.ox.ac.uk/pascal/VOC/index.html) or mirror and [COCO](https://cocodataset.org/#download) are available from official websites or mirrors. Note: In the detection task, Pascal VOC 2012 is an extension of Pascal VOC 2007 without overlap, and we usually use them together.
  126. It is recommended to download and extract the dataset somewhere outside the project directory and symlink the dataset root to `$MMDETECTION/data` as below.
  127. If your folder structure is different, you may need to change the corresponding paths in config files.
  128. ```text
  129. mmdetection
  130. ├── mmdet
  131. ├── tools
  132. ├── configs
  133. ├── data
  134. │ ├── coco
  135. │ │ ├── annotations
  136. │ │ ├── train2017
  137. │ │ ├── val2017
  138. │ │ ├── test2017
  139. │ ├── cityscapes
  140. │ │ ├── annotations
  141. │ │ ├── leftImg8bit
  142. │ │ │ ├── train
  143. │ │ │ ├── val
  144. │ │ ├── gtFine
  145. │ │ │ ├── train
  146. │ │ │ ├── val
  147. │ ├── VOCdevkit
  148. │ │ ├── VOC2007
  149. │ │ ├── VOC2012
  150. ```
  151. Some models require additional [COCO-stuff](http://calvin.inf.ed.ac.uk/wp-content/uploads/data/cocostuffdataset/stuffthingmaps_trainval2017.zip) datasets, such as HTC, DetectoRS and SCNet, you can download and unzip then move to the coco folder. The directory should be like this.
  152. ```text
  153. mmdetection
  154. ├── data
  155. │ ├── coco
  156. │ │ ├── annotations
  157. │ │ ├── train2017
  158. │ │ ├── val2017
  159. │ │ ├── test2017
  160. │ │ ├── stuffthingmaps
  161. ```
  162. Panoptic segmentation models like PanopticFPN require additional [COCO Panoptic](http://images.cocodataset.org/annotations/panoptic_annotations_trainval2017.zip) datasets, you can download and unzip then move to the coco annotation folder. The directory should be like this.
  163. ```text
  164. mmdetection
  165. ├── data
  166. │ ├── coco
  167. │ │ ├── annotations
  168. │ │ │ ├── panoptic_train2017.json
  169. │ │ │ ├── panoptic_train2017
  170. │ │ │ ├── panoptic_val2017.json
  171. │ │ │ ├── panoptic_val2017
  172. │ │ ├── train2017
  173. │ │ ├── val2017
  174. │ │ ├── test2017
  175. ```
  176. The [cityscapes](https://www.cityscapes-dataset.com/) annotations need to be converted into the coco format using `tools/dataset_converters/cityscapes.py`:
  177. ```shell
  178. pip install cityscapesscripts
  179. python tools/dataset_converters/cityscapes.py \
  180. ./data/cityscapes \
  181. --nproc 8 \
  182. --out-dir ./data/cityscapes/annotations
  183. ```
  184. TODO: CHANGE TO THE NEW PATH
  185. ### Test existing models
  186. We provide testing scripts for evaluating an existing model on the whole dataset (COCO, PASCAL VOC, Cityscapes, etc.).
  187. The following testing environments are supported:
  188. - single GPU
  189. - single node multiple GPUs
  190. - multiple nodes
  191. Choose the proper script to perform testing depending on the testing environment.
  192. ```shell
  193. # single-gpu testing
  194. python tools/test.py \
  195. ${CONFIG_FILE} \
  196. ${CHECKPOINT_FILE} \
  197. [--out ${RESULT_FILE}] \
  198. [--eval ${EVAL_METRICS}] \
  199. [--show]
  200. # multi-gpu testing
  201. bash tools/dist_test.sh \
  202. ${CONFIG_FILE} \
  203. ${CHECKPOINT_FILE} \
  204. ${GPU_NUM} \
  205. [--out ${RESULT_FILE}] \
  206. [--eval ${EVAL_METRICS}]
  207. ```
  208. `tools/dist_test.sh` also supports multi-node testing, but relies on PyTorch's [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
  209. Optional arguments:
  210. - `RESULT_FILE`: Filename of the output results in pickle format. If not specified, the results will not be saved to a file.
  211. - `EVAL_METRICS`: Items to be evaluated on the results. Allowed values depend on the dataset, e.g., `proposal_fast`, `proposal`, `bbox`, `segm` are available for COCO, `mAP`, `recall` for PASCAL VOC. Cityscapes could be evaluated by `cityscapes` as well as all COCO metrics.
  212. - `--show`: If specified, detection results will be plotted on the images and shown in a new window. It is only applicable to single GPU testing and used for debugging and visualization. Please make sure that GUI is available in your environment. Otherwise, you may encounter an error like `cannot connect to X server`.
  213. - `--show-dir`: If specified, detection results will be plotted on the images and saved to the specified directory. It is only applicable to single GPU testing and used for debugging and visualization. You do NOT need a GUI available in your environment for using this option.
  214. - `--show-score-thr`: If specified, detections with scores below this threshold will be removed.
  215. - `--cfg-options`: if specified, the key-value pair optional cfg will be merged into config file
  216. - `--eval-options`: if specified, the key-value pair optional eval cfg will be kwargs for dataset.evaluate() function, it's only for evaluation
  217. ### Examples
  218. Assuming that you have already downloaded the checkpoints to the directory `checkpoints/`.
  219. 1. Test Faster R-CNN and visualize the results. Press any key for the next image.
  220. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/faster_rcnn).
  221. ```shell
  222. python tools/test.py \
  223. configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
  224. checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
  225. --show
  226. ```
  227. 2. Test Faster R-CNN and save the painted images for future visualization.
  228. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/faster_rcnn).
  229. ```shell
  230. python tools/test.py \
  231. configs/faster_rcnn/faster_rcnn_r50_fpn_1x_coco.py \
  232. checkpoints/faster_rcnn_r50_fpn_1x_coco_20200130-047c8118.pth \
  233. --show-dir faster_rcnn_r50_fpn_1x_results
  234. ```
  235. 3. Test Faster R-CNN on PASCAL VOC (without saving the test results) and evaluate the mAP.
  236. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/pascal_voc).
  237. ```shell
  238. python tools/test.py \
  239. configs/pascal_voc/faster_rcnn_r50_fpn_1x_voc.py \
  240. checkpoints/faster_rcnn_r50_fpn_1x_voc0712_20200624-c9895d40.pth \
  241. --eval mAP
  242. ```
  243. 4. Test Mask R-CNN with 8 GPUs, and evaluate the bbox and mask AP.
  244. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/mask_rcnn).
  245. ```shell
  246. ./tools/dist_test.sh \
  247. configs/mask_rcnn_r50_fpn_1x_coco.py \
  248. checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
  249. 8 \
  250. --out results.pkl \
  251. --eval bbox segm
  252. ```
  253. 5. Test Mask R-CNN with 8 GPUs, and evaluate the **classwise** bbox and mask AP.
  254. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/mask_rcnn).
  255. ```shell
  256. ./tools/dist_test.sh \
  257. configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
  258. checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
  259. 8 \
  260. --out results.pkl \
  261. --eval bbox segm \
  262. --options "classwise=True"
  263. ```
  264. 6. Test Mask R-CNN on COCO test-dev with 8 GPUs, and generate JSON files for submitting to the official evaluation server.
  265. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/mask_rcnn).
  266. ```shell
  267. ./tools/dist_test.sh \
  268. configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
  269. checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
  270. 8 \
  271. --format-only \
  272. --options "jsonfile_prefix=./mask_rcnn_test-dev_results"
  273. ```
  274. This command generates two JSON files `mask_rcnn_test-dev_results.bbox.json` and `mask_rcnn_test-dev_results.segm.json`.
  275. 7. Test Mask R-CNN on Cityscapes test with 8 GPUs, and generate txt and png files for submitting to the official evaluation server.
  276. Config and checkpoint files are available [here](https://github.com/open-mmlab/mmdetection/tree/master/configs/cityscapes).
  277. ```shell
  278. ./tools/dist_test.sh \
  279. configs/cityscapes/mask_rcnn_r50_fpn_1x_cityscapes.py \
  280. checkpoints/mask_rcnn_r50_fpn_1x_cityscapes_20200227-afe51d5a.pth \
  281. 8 \
  282. --format-only \
  283. --options "txtfile_prefix=./mask_rcnn_cityscapes_test_results"
  284. ```
  285. The generated png and txt would be under `./mask_rcnn_cityscapes_test_results` directory.
  286. ### Test without Ground Truth Annotations
  287. MMDetection supports to test models without ground-truth annotations using `CocoDataset`. If your dataset format is not in COCO format, please convert them to COCO format. For example, if your dataset format is VOC, you can directly convert it to COCO format by the [script in tools.](https://github.com/open-mmlab/mmdetection/tree/master/tools/dataset_converters/pascal_voc.py) If your dataset format is Cityscapes, you can directly convert it to COCO format by the [script in tools.](https://github.com/open-mmlab/mmdetection/tree/master/tools/dataset_converters/cityscapes.py) The rest of the formats can be converted using [this script](https://github.com/open-mmlab/mmdetection/tree/master/tools/dataset_converters/images2coco.py).
  288. ```shel
  289. python tools/dataset_converters/images2coco.py \
  290. ${IMG_PATH} \
  291. ${CLASSES} \
  292. ${OUT} \
  293. [--exclude-extensions]
  294. ```
  295. arguments:
  296. - `IMG_PATH`: The root path of images.
  297. - `CLASSES`: The text file with a list of categories.
  298. - `OUT`: The output annotation json file name. The save dir is in the same directory as `IMG_PATH`.
  299. - `exclude-extensions`: The suffix of images to be excluded, such as 'png' and 'bmp'.
  300. After the conversion is complete, you can use the following command to test
  301. ```shell
  302. # single-gpu testing
  303. python tools/test.py \
  304. ${CONFIG_FILE} \
  305. ${CHECKPOINT_FILE} \
  306. --format-only \
  307. --options ${JSONFILE_PREFIX} \
  308. [--show]
  309. # multi-gpu testing
  310. bash tools/dist_test.sh \
  311. ${CONFIG_FILE} \
  312. ${CHECKPOINT_FILE} \
  313. ${GPU_NUM} \
  314. --format-only \
  315. --options ${JSONFILE_PREFIX} \
  316. [--show]
  317. ```
  318. Assuming that the checkpoints in the [model zoo](https://mmdetection.readthedocs.io/en/latest/modelzoo_statistics.html) have been downloaded to the directory `checkpoints/`, we can test Mask R-CNN on COCO test-dev with 8 GPUs, and generate JSON files using the following command.
  319. ```sh
  320. ./tools/dist_test.sh \
  321. configs/mask_rcnn/mask_rcnn_r50_fpn_1x_coco.py \
  322. checkpoints/mask_rcnn_r50_fpn_1x_coco_20200205-d4b0c5d6.pth \
  323. 8 \
  324. -format-only \
  325. --options "jsonfile_prefix=./mask_rcnn_test-dev_results"
  326. ```
  327. This command generates two JSON files `mask_rcnn_test-dev_results.bbox.json` and `mask_rcnn_test-dev_results.segm.json`.
  328. ### Batch Inference
  329. MMDetection supports inference with a single image or batched images in test mode. By default, we use single-image inference and you can use batch inference by modifying `samples_per_gpu` in the config of test data. You can do that either by modifying the config as below.
  330. ```shell
  331. data = dict(train=dict(...), val=dict(...), test=dict(samples_per_gpu=2, ...))
  332. ```
  333. Or you can set it through `--cfg-options` as `--cfg-options data.test.samples_per_gpu=2`
  334. ### Deprecated ImageToTensor
  335. In test mode, `ImageToTensor` pipeline is deprecated, it's replaced by `DefaultFormatBundle` that recommended to manually replace it in the test data pipeline in your config file. examples:
  336. ```python
  337. # use ImageToTensor (deprecated)
  338. pipelines = [
  339. dict(type='LoadImageFromFile'),
  340. dict(
  341. type='MultiScaleFlipAug',
  342. img_scale=(1333, 800),
  343. flip=False,
  344. transforms=[
  345. dict(type='Resize', keep_ratio=True),
  346. dict(type='RandomFlip'),
  347. dict(type='Normalize', mean=[0, 0, 0], std=[1, 1, 1]),
  348. dict(type='Pad', size_divisor=32),
  349. dict(type='ImageToTensor', keys=['img']),
  350. dict(type='Collect', keys=['img']),
  351. ])
  352. ]
  353. # manually replace ImageToTensor to DefaultFormatBundle (recommended)
  354. pipelines = [
  355. dict(type='LoadImageFromFile'),
  356. dict(
  357. type='MultiScaleFlipAug',
  358. img_scale=(1333, 800),
  359. flip=False,
  360. transforms=[
  361. dict(type='Resize', keep_ratio=True),
  362. dict(type='RandomFlip'),
  363. dict(type='Normalize', mean=[0, 0, 0], std=[1, 1, 1]),
  364. dict(type='Pad', size_divisor=32),
  365. dict(type='DefaultFormatBundle'),
  366. dict(type='Collect', keys=['img']),
  367. ])
  368. ]
  369. ```
  370. ## Train predefined models on standard datasets
  371. MMDetection also provides out-of-the-box tools for training detection models.
  372. This section will show how to train _predefined_ models (under [configs](https://github.com/open-mmlab/mmdetection/tree/master/configs)) on standard datasets i.e. COCO.
  373. **Important**: The default learning rate in config files is for 8 GPUs and 2 img/gpu (batch size = 8\*2 = 16).
  374. According to the [linear scaling rule](https://arxiv.org/abs/1706.02677), you need to set the learning rate proportional to the batch size if you use different GPUs or images per GPU, e.g., `lr=0.01` for 4 GPUs \* 2 imgs/gpu and `lr=0.08` for 16 GPUs \* 4 imgs/gpu.
  375. ### Prepare datasets
  376. Training requires preparing datasets too. See section [Prepare datasets](#prepare-datasets) above for details.
  377. **Note**:
  378. Currently, the config files under `configs/cityscapes` use COCO pretrained weights to initialize.
  379. You could download the existing models in advance if the network connection is unavailable or slow. Otherwise, it would cause errors at the beginning of training.
  380. ### Training on a single GPU
  381. We provide `tools/train.py` to launch training jobs on a single GPU.
  382. The basic usage is as follows.
  383. ```shell
  384. python tools/train.py \
  385. ${CONFIG_FILE} \
  386. [optional arguments]
  387. ```
  388. During training, log files and checkpoints will be saved to the working directory, which is specified by `work_dir` in the config file or via CLI argument `--work-dir`.
  389. By default, the model is evaluated on the validation set every epoch, the evaluation interval can be specified in the config file as shown below.
  390. ```python
  391. # evaluate the model every 12 epoch.
  392. evaluation = dict(interval=12)
  393. ```
  394. This tool accepts several optional arguments, including:
  395. - `--no-validate` (**not suggested**): Disable evaluation during training.
  396. - `--work-dir ${WORK_DIR}`: Override the working directory.
  397. - `--resume-from ${CHECKPOINT_FILE}`: Resume from a previous checkpoint file.
  398. - `--options 'Key=value'`: Overrides other settings in the used config.
  399. **Note**:
  400. Difference between `resume-from` and `load-from`:
  401. `resume-from` loads both the model weights and optimizer status, and the epoch is also inherited from the specified checkpoint. It is usually used for resuming the training process that is interrupted accidentally.
  402. `load-from` only loads the model weights and the training epoch starts from 0. It is usually used for finetuning.
  403. ### Training on multiple GPUs
  404. We provide `tools/dist_train.sh` to launch training on multiple GPUs.
  405. The basic usage is as follows.
  406. ```shell
  407. bash ./tools/dist_train.sh \
  408. ${CONFIG_FILE} \
  409. ${GPU_NUM} \
  410. [optional arguments]
  411. ```
  412. Optional arguments remain the same as stated [above](#train-with-a-single-GPU).
  413. #### Launch multiple jobs simultaneously
  414. If you would like to launch multiple jobs on a single machine, e.g., 2 jobs of 4-GPU training on a machine with 8 GPUs,
  415. you need to specify different ports (29500 by default) for each job to avoid communication conflict.
  416. If you use `dist_train.sh` to launch training jobs, you can set the port in commands.
  417. ```shell
  418. CUDA_VISIBLE_DEVICES=0,1,2,3 PORT=29500 ./tools/dist_train.sh ${CONFIG_FILE} 4
  419. CUDA_VISIBLE_DEVICES=4,5,6,7 PORT=29501 ./tools/dist_train.sh ${CONFIG_FILE} 4
  420. ```
  421. ### Training on multiple nodes
  422. MMDetection relies on `torch.distributed` package for distributed training.
  423. Thus, as a basic usage, one can launch distributed training via PyTorch's [launch utility](https://pytorch.org/docs/stable/distributed.html#launch-utility).
  424. ### Manage jobs with Slurm
  425. [Slurm](https://slurm.schedmd.com/) is a good job scheduling system for computing clusters.
  426. On a cluster managed by Slurm, you can use `slurm_train.sh` to spawn training jobs. It supports both single-node and multi-node training.
  427. The basic usage is as follows.
  428. ```shell
  429. [GPUS=${GPUS}] ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} ${CONFIG_FILE} ${WORK_DIR}
  430. ```
  431. Below is an example of using 16 GPUs to train Mask R-CNN on a Slurm partition named _dev_, and set the work-dir to some shared file systems.
  432. ```shell
  433. GPUS=16 ./tools/slurm_train.sh dev mask_r50_1x configs/mask_rcnn_r50_fpn_1x_coco.py /nfs/xxxx/mask_rcnn_r50_fpn_1x
  434. ```
  435. You can check [the source code](https://github.com/open-mmlab/mmdetection/blob/master/tools/slurm_train.sh) to review full arguments and environment variables.
  436. When using Slurm, the port option need to be set in one of the following ways:
  437. 1. Set the port through `--options`. This is more recommended since it does not change the original configs.
  438. ```shell
  439. CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR} --options 'dist_params.port=29500'
  440. CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR} --options 'dist_params.port=29501'
  441. ```
  442. 2. Modify the config files to set different communication ports.
  443. In `config1.py`, set
  444. ```python
  445. dist_params = dict(backend='nccl', port=29500)
  446. ```
  447. In `config2.py`, set
  448. ```python
  449. dist_params = dict(backend='nccl', port=29501)
  450. ```
  451. Then you can launch two jobs with `config1.py` and `config2.py`.
  452. ```shell
  453. CUDA_VISIBLE_DEVICES=0,1,2,3 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config1.py ${WORK_DIR}
  454. CUDA_VISIBLE_DEVICES=4,5,6,7 GPUS=4 ./tools/slurm_train.sh ${PARTITION} ${JOB_NAME} config2.py ${WORK_DIR}
  455. ```

No Description

Contributors (3)