Models (and their sub-models) in detectron2 are built by
functions such as build_model, build_backbone, build_roi_heads:
from detectron2.modeling import build_model
model = build_model(cfg) # returns a torch.nn.Module
To load an existing checkpoint to the model, use
DetectionCheckpointer(model).load(file_path).
Detectron2 recognizes models in pytorch's .pth format, as well as the .pkl files
in our model zoo.
You can use a model by just outputs = model(inputs).
Next, we explain the inputs/outputs format used by the builtin models in detectron2.
All builtin models take a list[dict] as the inputs. Each dict
corresponds to information about one image.
The dict may contain the following keys:
"image": Tensor in (C, H, W) format. The meaning of channels are defined by cfg.INPUT.FORMAT.
"instances": an Instances object, with the following fields:
Boxes object storing N boxes, one for each instance.Tensor of long type, a vector of N labels, in range [0, num_categories).PolygonMasks object storing N masks, one for each instance.Keypoints object storing N keypoint sets, one for each instance."proposals": an Instances object used in Fast R-CNN style models, with the following fields:
Boxes object storing P proposal boxes.Tensor, a vector of P scores, one for each proposal."height", "width": the desired output height and width of the image, not necessarily the same
as the height or width of the image when input into the model, which might be after resizing.
For example, it can be the original image height and width before resizing.
If provided, the model will produce output in this resolution,
rather than in the resolution of the image as input into the model. This is more efficient and accurate.
"sem_seg": Tensor[int] in (H, W) format. The semantic segmentation ground truth.
The output of the default DatasetMapper is a dict
that follows the above format.
After the data loader performs batching, it becomes list[dict] which the builtin models support.
When in training mode, the builtin models output a dict[str->ScalarTensor] with all the losses.
When in inference mode, the builtin models output a list[dict], one dict for each image. Each dict may contain:
Tensor, a vector of N scores.Tensor, a vector of N labels in range [0, num_categories).Tensor of shape (N, H, W), masks for each detected instance.Tensor of shape (N, num_keypoint, 3).Tensor of (num_categories, H, W), the semantic segmentation prediction.isthing==True, and the stuff class id otherwise.Contruct your own list[dict], with the necessary keys.
For example, for inference, provide dicts with "image", and optionally "height" and "width".
Note that when in training mode, all models are required to be used under an EventStorage.
The training statistics will be put into the storage:
from detectron2.utils.events import EventStorage
with EventStorage() as storage:
losses = model(inputs)