This file documents a large collection of baselines trained
with detectron2 in Sep-Oct, 2019.
The corresponding configurations for all models can be found under the configs/ directory.
Unless otherwise noted, the following settings are used for all runs:
All models were trained on Big Basin
servers with 8 NVIDIA V100 GPUs, with data-parallel sync SGD and a total minibatch size of 16 images.
All models were trained with CUDA 9.2, cuDNN 7.4.2 or 7.6.3 (the difference in speed is found to be negligible).
Training curves and other statistics can be found in metrics for each model.
The default settings are not directly comparable with Detectron.
For example, our default training data augmentation uses scale jittering in addition to horizontal flipping.
For configs that are comparable to Detectron's settings, see
Detectron1-Comparisons for accuracy comparison,
and benchmarks
for speed comparison.
Inference speed is measured by tools/train_net.py --eval-only, or inference_on_dataset(),
with batch size 1 in detectron2 directly.
The actual deployment should in general be faster than the given inference
speed due to more optimizations.
Training speed is averaged across the entire training.
We keep updating the speed with latest version of detectron2/pytorch/etc.,
so they might be different from the metrics file.
All COCO models were trained on train2017 and evaluated on val2017.
For Faster/Mask R-CNN, we provide baselines based on 3 different backbone combinations:
Most models are trained with the 3x schedule (~37 COCO epochs).
Although 1x models are heavily under-trained, we provide some ResNet-50 models with the 1x (~12 COCO epochs)
training schedule for comparison when doing quick research iteration.
The model id column is provided for ease of reference.
To check downloaded file integrity, any model on this page contains its md5 prefix in its file name.
Each model also comes with a metrics file with all the training statistics and evaluation curves.
We provide backbone models pretrained on ImageNet-1k dataset.
These models are different from those provided in Detectron: we do not fuse BatchNorm into an affine layer.
Pretrained models in Detectron's format can still be used. For example:
All models available for download through this document are licensed under the
Creative Commons Attribution-ShareAlike 3.0 license.
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
|---|---|---|---|---|---|---|---|
| R50-C4 | 1x | 0.551 | 0.110 | 4.8 | 35.7 | 137257644 | model | metrics |
| R50-DC5 | 1x | 0.380 | 0.068 | 5.0 | 37.3 | 137847829 | model | metrics |
| R50-FPN | 1x | 0.210 | 0.055 | 3.0 | 37.9 | 137257794 | model | metrics |
| R50-C4 | 3x | 0.543 | 0.110 | 4.8 | 38.4 | 137849393 | model | metrics |
| R50-DC5 | 3x | 0.378 | 0.073 | 5.0 | 39.0 | 137849425 | model | metrics |
| R50-FPN | 3x | 0.209 | 0.047 | 3.0 | 40.2 | 137849458 | model | metrics |
| R101-C4 | 3x | 0.619 | 0.149 | 5.9 | 41.1 | 138204752 | model | metrics |
| R101-DC5 | 3x | 0.452 | 0.082 | 6.1 | 40.6 | 138204841 | model | metrics |
| R101-FPN | 3x | 0.286 | 0.063 | 4.1 | 42.0 | 137851257 | model | metrics |
| X101-FPN | 3x | 0.638 | 0.120 | 6.7 | 43.0 | 139173657 | model | metrics |
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
model id | download |
|---|---|---|---|---|---|---|---|
| R50 | 1x | 0.200 | 0.062 | 3.9 | 36.5 | 137593951 | model | metrics |
| R50 | 3x | 0.201 | 0.063 | 3.9 | 37.9 | 137849486 | model | metrics |
| R101 | 3x | 0.280 | 0.080 | 5.1 | 39.9 | 138363263 | model | metrics |
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
prop. AR |
model id | download |
|---|---|---|---|---|---|---|---|---|
| RPN R50-C4 | 1x | 0.130 | 0.051 | 1.5 | 51.6 | 137258005 | model | metrics | |
| RPN R50-FPN | 1x | 0.186 | 0.045 | 2.7 | 58.0 | 137258492 | model | metrics | |
| Fast R-CNN R50-FPN | 1x | 0.140 | 0.035 | 2.6 | 37.8 | 137635226 | model | metrics |
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
|---|---|---|---|---|---|---|---|---|
| R50-C4 | 1x | 0.584 | 0.117 | 5.2 | 36.8 | 32.2 | 137259246 | model | metrics |
| R50-DC5 | 1x | 0.471 | 0.074 | 6.5 | 38.3 | 34.2 | 137260150 | model | metrics |
| R50-FPN | 1x | 0.261 | 0.053 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
| R50-C4 | 3x | 0.575 | 0.118 | 5.2 | 39.8 | 34.4 | 137849525 | model | metrics |
| R50-DC5 | 3x | 0.470 | 0.075 | 6.5 | 40.0 | 35.9 | 137849551 | model | metrics |
| R50-FPN | 3x | 0.261 | 0.055 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
| R101-C4 | 3x | 0.652 | 0.155 | 6.3 | 42.6 | 36.7 | 138363239 | model | metrics |
| R101-DC5 | 3x | 0.545 | 0.155 | 7.6 | 41.9 | 37.3 | 138363294 | model | metrics |
| R101-FPN | 3x | 0.340 | 0.070 | 4.6 | 42.9 | 38.6 | 138205316 | model | metrics |
| X101-FPN | 3x | 0.690 | 0.129 | 7.2 | 44.3 | 39.5 | 139653917 | model | metrics |
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
kp. AP |
model id | download |
|---|---|---|---|---|---|---|---|---|
| R50-FPN | 1x | 0.315 | 0.083 | 5.0 | 53.6 | 64.0 | 137261548 | model | metrics |
| R50-FPN | 3x | 0.316 | 0.076 | 5.0 | 55.4 | 65.5 | 137849621 | model | metrics |
| R101-FPN | 3x | 0.390 | 0.090 | 6.1 | 56.4 | 66.1 | 138363331 | model | metrics |
| X101-FPN | 3x | 0.738 | 0.142 | 8.7 | 57.3 | 66.0 | 139686956 | model | metrics |
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
|---|---|---|---|---|---|---|---|---|---|
| R50-FPN | 1x | 0.304 | 0.063 | 4.8 | 37.6 | 34.7 | 39.4 | 139514544 | model | metrics |
| R50-FPN | 3x | 0.302 | 0.063 | 4.8 | 40.0 | 36.5 | 41.5 | 139514569 | model | metrics |
| R101-FPN | 3x | 0.392 | 0.078 | 6.0 | 42.4 | 38.5 | 43.0 | 139514519 | model | metrics |
Mask R-CNN baselines on the LVIS dataset, v0.5.
These baselines are described in Table 3(c) of the LVIS paper.
NOTE: the 1x schedule here has the same amount of iterations as the COCO 1x baselines.
They are roughly 24 epochs of LVISv0.5 data.
The final results of these configs have large variance across different runs.
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
|---|---|---|---|---|---|---|---|---|
| R50-FPN | 1x | 0.292 | 0.127 | 7.1 | 23.6 | 24.4 | 144219072 | model | metrics |
| R101-FPN | 1x | 0.371 | 0.124 | 7.8 | 25.6 | 25.9 | 144219035 | model | metrics |
| X101-FPN | 1x | 0.712 | 0.166 | 10.2 | 26.7 | 27.1 | 144219108 | model | metrics |
Simple baselines for
| Name | train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
box AP50 |
mask AP |
model id | download |
|---|---|---|---|---|---|---|---|---|
| R50-FPN, Cityscapes | 0.240 | 0.092 | 4.4 | 36.5 | 142423278 | model | metrics | ||
| R50-C4, VOC | 0.537 | 0.086 | 4.8 | 51.9 | 80.3 | 142202221 | model | metrics |
Ablations for Deformable Conv and Cascade R-CNN:
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
|---|---|---|---|---|---|---|---|---|
| Baseline R50-FPN | 1x | 0.261 | 0.053 | 3.4 | 38.6 | 35.2 | 137260431 | model | metrics |
| Deformable Conv | 1x | 0.342 | 0.061 | 3.5 | 41.5 | 37.5 | 138602867 | model | metrics |
| Cascade R-CNN | 1x | 0.317 | 0.066 | 4.0 | 42.1 | 36.4 | 138602847 | model | metrics |
| Baseline R50-FPN | 3x | 0.261 | 0.055 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
| Deformable Conv | 3x | 0.349 | 0.066 | 3.5 | 42.7 | 38.5 | 144998336 | model | metrics |
| Cascade R-CNN | 3x | 0.328 | 0.075 | 4.0 | 44.3 | 38.5 | 144998488 | model | metrics |
Ablations for normalization methods:
(Note: The baseline uses 2fc head while the others use 4conv1fc head. According to the
GroupNorm paper, the change in head does not improve the baseline by much)
| Name | lr sched |
train time (s/iter) |
inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
model id | download |
|---|---|---|---|---|---|---|---|---|
| Baseline R50-FPN | 3x | 0.261 | 0.055 | 3.4 | 41.0 | 37.2 | 137849600 | model | metrics |
| SyncBN | 3x | 0.464 | 0.063 | 5.6 | 42.0 | 37.8 | 143915318 | model | metrics |
| GN | 3x | 0.356 | 0.077 | 7.3 | 42.6 | 38.6 | 138602888 | model | metrics |
| GN (scratch) | 3x | 0.400 | 0.077 | 9.8 | 39.9 | 36.6 | 138602908 | model | metrics |
A few very large models trained for a long time, for demo purposes:
| Name | inference time (s/im) |
train mem (GB) |
box AP |
mask AP |
PQ | model id | download |
|---|---|---|---|---|---|---|---|
| Panoptic FPN R101 | 0.123 | 11.4 | 47.4 | 41.3 | 46.1 | 139797668 | model | metrics |
| Mask R-CNN X152 | 0.281 | 15.1 | 50.2 | 44.0 | 18131413 | model | metrics | |
| above + test-time aug. | 51.9 | 45.9 |