Benchmarking Detection Transfer Learning With Vision Transformers
Benchmarking Detection Transfer Learning With Vision Transformers
Yanghao Li Saining Xie Xinlei Chen Piotr Dollár Kaiming He Ross Girshick
Facebook AI Research (FAIR)
Abstract
arXiv:2111.11429v1 [cs.CV] 22 Nov 2021
cols are easy to define and baselines are plentiful (e.g. [17]).
In other words, unsupervised learning with CNNs produces
Object detection is a central downstream task used to a plug-and-play parameter initialization.
test if pre-trained network parameters confer benefits, such We are now witnessing the growth of unsupervised learn-
as improved accuracy or training speed. The complexity ing with Vision Transformer (ViT) models [10], and while
of object detection methods can make this benchmarking the high-level transfer learning methodology remains the
non-trivial when new architectures, such as Vision Trans- same, the low-level details and baselines for some important
former (ViT) models, arrive. These difficulties (e.g., archi- downstream tasks have not been established. Notably, ob-
tectural incompatibility, slow training, high memory con- ject detection, which has played a central role in the study of
sumption, unknown training formulae, etc.) have prevented transfer learning over the last decade (e.g., [35, 14, 9, 17]),
recent studies from benchmarking detection transfer learn- was not explored in the pioneering work on ViT train-
ing with standard ViT models. In this paper, we present ing [10, 7, 5]—supervised or unsupervised—due to the
training techniques that overcome these challenges, en- challenges (described shortly) of integrating ViTs into com-
abling the use of standard ViT models as the backbone of mon detection models, like Mask R-CNN [19].
Mask R-CNN. These tools facilitate the primary goal of our To bridge this gap, this paper establishes a transfer learn-
study: we compare five ViT initializations, including re- ing protocol for evaluating ViT models on object detection
cent state-of-the-art self-supervised learning methods, su- and instance segmentation using the COCO dataset [28] and
pervised initialization, and a strong random initialization the Mask R-CNN framework. We focus on standard ViT
baseline. Our results show that recent masking-based un- models, with minimal modifications, as defined in the orig-
supervised learning methods may, for the first time, provide inal ViT paper [10], because we expect this architecture will
convincing transfer learning improvements on COCO, in- remain popular in unsupervised learning work over the next
creasing APbox up to 4% (absolute) over supervised and few years due to its simplicity and flexibility when explor-
prior self-supervised pre-training methods. Moreover, these ing new techniques, e.g., masking-based methods [1, 16].
masking-based initializations scale better, with the improve- Establishing object detection baselines for ViT is chal-
ment growing as model size increases. lenging due to technical obstacles that include mitigat-
ing ViT’s large memory requirements when processing
detection-sized inputs (e.g., ∼20× more patches than in
1. Introduction pre-training), architectural incompatibilities (e.g., single-
Unsupervised/self-supervised deep learning is com- scale ViT vs. a multi-scale detector), and developing ef-
monly used as a pre-training step that initializes model pa- fective training formulae (i.e., learning schedules, regular-
rameters before they are transferred to a downstream task, ization and data augmentation methods, etc.) for numerous
such as image classification or object detection, for fine- pre-trained initializations, as well as random initialization.
tuning. The utility of an unsupervised learning algorithm We overcome these obstacles and present strong ViT-based
is judged by downstream task metrics (e.g. accuracy, con- Mask R-CNN baselines on COCO when initializing ViT
vergence speed, etc.) in comparison to baselines, such as from-scratch [18], with pre-trained ImageNet [8] supervi-
supervised pre-training or no pre-training at all, i.e., ran- sion, and with unsupervised pre-training using recent meth-
dom initialization (often called training “from scratch”). ods like MoCo v3 [7], BEiT [1], and MAE [16].
Unsupervised deep learning in computer vision typically Looking beyond ViT, we hope our practices and obser-
uses standard convolutional network (CNN) models [25], vations will serve as a blueprint for future work comparing
such as ResNets [20]. Transferring these models is rela- pre-training methods for more advanced ViT derivatives,
tively straightforward because CNNs are in widespread use like Swin [29] and MViT [12]. To facilitate community de-
in most downstream tasks, and thus benchmarking proto- velopment we will release code in Detectron2 [40].
1
windowed windowed windowed windowed
blocks blocks blocks blocks However, using FPN presents a problem because ViT
produces feature maps at a single scale (e.g., 1/16th), in
block block block block block contrast to the multi-scale feature maps produced by typical
1 d/4 2d/4 3d/4 d
ViT
(scale 1/16)
CNNs.1 To address this discrepancy, we employ a simple
windowed global global global global
attention attention attention attention attention technique from [11] (used for the single-scale XCiT back-
bone) to either upsample or downsample intermediate ViT
input feature maps by placing four resolution-modifying modules
up up down multi-scale
patches identity
4x 2x 2x adaptor at equally spaced intervals of d/4 transformer blocks, where
feature pyramid network (FPN)
d is the total number of blocks. See Figure 1 (green blocks).
The first of these modules upsamples the feature map by
Mask R-CNN components a factor of 4 using a stride-two 2×2 transposed convolution,
(RPN, box head, mask head)
followed by group normalization [39] and GeLU [21], and
Figure 1. ViT-based Mask R-CNN. In §2 we describe how a stan-
finally another stride-two 2×2 transposed convolution. The
dard ViT model can be used effectively as the backbone in Mask
R-CNN. To save time and memory, we modify the ViT to use non- next d/4th block’s output is upsampled by 2× using a single
overlapping windowed attention in all but four of its Transformer stride-two 2 × 2 transposed convolution (without normal-
blocks, spaced at an interval of d/4, where d is the total number ization and non-linearity). The next d/4th block’s output is
of blocks (blue) [26]. To adapt the single-scale ViT to the multi- taken as is and the final ViT block’s output is downsampled
scale FPN (yellow), we make use of upsampling and downsam- by a factor of two using stride-two 2 × 2 max pooling. Each
pling modules (green) [11]. The rest of the system (light red) uses of these modules preserves the ViT’s embedding/channel
upgraded, but standard, Mask R-CNN components. dimension. Assuming a patch size of 16, these modules
produce feature maps with strides of 4, 8, 16, and 32 pixels,
2. Approach w.r.t. the input image, that are ready to input into an FPN.
We note that recent work, such as Swin [29] and
We select the Mask R-CNN [19] framework due to its
MViT [12], address the single vs. multi-scale feature map
ubiquitous presence in object detection and transfer learn-
problem by modifying the core ViT architecture (in pre-
ing research. Mask R-CNN is the foundation of higher
training) so it is inherently multi-scale. This is an important
complexity/higher performing systems, such as Cascade
direction, but it also complicates the simple ViT design and
R-CNN [4] and HTC/HTC++ [6, 29], which may improve
may impede the exploration of new unsupervised learning
upon the results presented here at the cost of additional
directions, such as methods that sparsely process unmasked
complexity that is orthogonal to the goal of benchmarking
patches [16]. Therefore, we focus on external additions to
transfer learning. Our choice attempts to balance (relative)
ViTs that allow them to integrate into multi-scale detection
simplicity vs. complexity while providing compelling, even
systems. We also note that Beal et al. [2] integrate stan-
though not entirely state-of-the-art, results.
dard ViT models with Faster R-CNN [34], but report sub-
We configure Mask R-CNN with a number of upgraded
stantially lower APbox compared to our results (>10 points
modules (described in §2.2) and training procedures (de-
lower), which suggests that our design is highly effective.
scribed in §2.3) relative to the original publication. These
upgrades, developed primarily in [39, 18, 13], allow the Reducing Memory and Time Complexity. Using ViT as
model to be trained effectively from random initialization, a backbone in Mask R-CNN introduces memory and run-
thus enabling a meaningful from-scratch baseline. Next, we time challenges. Each self-attention operation in ViT takes
will discuss how the backbone, which would typically be a O(h2 w2 ) space and time for an image tiled (or “patchified”)
ResNet, can be replaced with a Vision Transformer. into h × w non-overlapping patches [38].
During pre-training, this complexity is manageable as
2.1. ViT Backbone
h = w = 14 is a typical setting (a 224 × 224 pixel image
In this section we address two technical obstacles when patchified into 16 × 16 pixel patches). In object detection,
using ViT as the backbone in Mask R-CNN: (1) how to a standard image size is 1024 × 1024—approximately 21×
adapt it to work with a feature pyramid network (FPN) [27] more pixels and patches. This higher resolution is needed in
and (2) how to reduce its memory footprint and runtime to order to detect relatively small objects as well as larger ones.
make benchmarking large ViT backbones tractable. Due to the quadratic complexity of self-attention, even the
“base” size ViT-B may consume ∼20–30GB of GPU mem-
FPN Compatibility. Mask R-CNN can work with a back- ory when used in Mask R-CNN with a single-image mini-
bone that either produces a single-scale feature map or fea- batch and half -precision floating point numbers.
ture maps at multiple scales that can be input into an FPN.
Since FPN typically provides better detection results with 1 We view the natural 2D spatial arrangement of intermediate ViT patch
minimal time and memory overhead, we adopt it. embeddings as a standard 2D feature map.
2
To reduce space and time complexity we use restricted 0.25 epochs, and drop path regularization. When using a
(or “windowed”) self-attention [38], which saves both space pre-trained initialization, we fine-tune Mask R-CNN for up
and time by replacing global computation with local com- to 100 epochs. When training from scratch, we consider
putation. We partition the h × w patchified image into schedules of up to 400 epochs since convergence is slower
r × r patch non-overlapping windows and compute self- than when using pre-training. We distribute training over
attention independently within each of these windows. This 32 or 64 GPUs (NVIDIA V100-32GB) and always use a
windowed self-attention has O(r2 hw) space and time com- minibatch size of 64 images. We use PyTorch’s automatic
plexity (from O(r4 ) per-window complexity and h/r ×w/r mixed precision. Additional hyperparameters are tuned by
windows). We set r to the global self-attention size used in the consistent application of a protocol, describe next.
pre-training (e.g., r = 14 is typical).
A drawback of windowed self-attention is that the back- 2.4. Hyperparameter Tuning Protocol
bone does not integrate information across windows. There- To adapt the training formula to each model, we tune
fore we adopt the hybrid approach from [26] that includes three hyperparameters—learning rate (lr), weight decay
four global self-attention blocks placed evenly at each d/4th (wd), and drop path rate (dp)—while keeping all others the
block (these coincide with the up-/downsampling locations same for all models. We conducted pilot experiments us-
used for FPN integration; see Figure 1). ing ViT-B pre-trained with MoCo v3 to estimate reasonable
hyperparameter ranges. Based on these estimates we estab-
2.2. Upgraded Modules
lished the following tuning protocol:
Relative to the original Mask R-CNN in [19], we mod- (1) For each initialization (from-scratch, supervised,
ernize several of its modules. Concisely, the modifications etc.), we fix dp at 0.0 and perform a grid search over lr and
include: (1) following the convolutions in FPN with batch wd using ViT-B and a 25 epoch schedule (or 100 epochs
normalization (BN) [23], (2) using two convolutional lay- when initializing from scratch). We center a 3 × 3 grid at
ers in the region proposal network (RPN) [33] instead of lr, wd = 1.6e−4, 0.1 and use doubled and halved values
one, (3) using four convolutional layers with BN followed around the center. If a local optimum is not found (i.e. the
by one linear layer for the region-of-interest (RoI) classifi- best value is a boundary value), we expand the search.
cation and box regression head [39] instead of a two-layer (2) For ViT-B, we select dp from {0.0, 0.1, 0.2, 0.3} us-
MLP without normalization, (4) and following the convo- ing a 50 epoch schedule for pre-trained initializations. The
lutions in the standard mask head with BN. Wherever BN shorter 25 epoch schedule was unreliable and 100 epochs
is applied, we use synchronous BN across all GPUs. These was deemed impractical. For random initialization we’re
upgrades are implemented in the Detectron2 model zoo.2 forced to use 100 epochs due to slow convergence. We
found that dp = 0.1 is optimal for all initializations.
2.3. Training Formula (3) For ViT-L, we adopt the optimal lr and wd from ViT-
We adopt an upgraded training formula compared to the B (searching with ViT-L is impractical) and find dp = 0.3
original Mask R-CNN. This formula was developed in [18], is best using the same procedure as for ViT-B.
which demonstrated good from-scratch performance when
Limitations. The procedure above takes practical shortcuts
training with normalization layers and for long enough,
to reduce the full hyperparameter tuning space. In particu-
and [13], which demonstrated that a simple data augmen-
lar, lr and wd are optimized separately from dp, thus the
tation method called large-scale jitter (LSJ) is effective at
combination may be suboptimal. Further, we only tune lr
preventing overfitting and improves results when models
and wd using ViT-B, therefore the choice may be subopti-
are trained for very long schedules (e.g., 400 epochs).
mal for ViT-L. We also tune lr and wd using a schedule that
We aim to keep the number of hyperparameters low and
is 4× shorter than the longest schedule we eventually train
therefore resist adopting additional data augmentation and
at, which again may be suboptimal. Given these limitations
regularization techniques. However, we found that drop
we aim to avoid biasing results by applying the same tuning
path regularization [24, 22] is highly effective for ViT back-
protocol to all initializations.
bones and therefore we include it (e.g., it improves from-
Finally, we note that we tune lr, wd, and dp on the COCO
scratch training by up to 2 APbox ).
2017 val split and report results on the same split. While
In summary, we train all models with the same sim-
technically not an ML best-practice, a multitude of com-
ple formula: LSJ (1024 × 1024 resolution, scale range
parisons on COCO val vs. test-dev results over many
[0.1, 2.0]), AdamW [30] (β1 , β2 = 0.9, 0.999) with half-
years demonstrate that overfitting in not a concern for this
period cosine learning rate decay, linear warmup [15] for
kind of low-degree-of-freedom hyperparameter tuning.3
2 https://github.com/facebookresearch/detectron2/blob/
3 E.g., Table 2 in [29] (version 1) shows that text-dev APbox is sys-
main / MODEL _ ZOO . md # new - baselines - using - large - scale -
jitter-and-longer-training-schedule tematically higher than val APbox in seven system-level comparisons.
3
2.5. Additional Implementation Details pre-training APbox APmask
initialization data ViT-B ViT-L ViT-B ViT-L
Images are padded during training and inference to form supervised IN1k w/ labels 47.9 49.3 42.9 43.9
a 1024 × 1024 resolution input. During training, padding random none 48.9 50.7 43.6 44.9
is necessary for batching. During (unbatched) inference, MoCo v3 IN1k 47.9 49.3 42.7 44.0
BEiT IN1k+DALL·E 49.8 53.3 44.4 47.1
the input only needs to be a multiple of the ViT patch size
MAE IN1k 50.3 53.3 44.9 47.2
on each side, which is possibly less than 1024 on one side.
Table 1. COCO object detection and instance segmentation us-
However, we found that such reduced padding performs
ing our ViT-based Mask R-CNN baseline. Results are reported on
worse (e.g., decrease of ∼0.5–1 APbox ) than padding to the
COCO 2017 val using the best schedule length (see Figure 2).
same resolution used during training, likely due to ViT’s use Random initialization does not use any pre-training data, super-
of positional information. Therefore, we use a 1024 × 1024 vised initialization uses IN1k with labels, and all other initializa-
resolution input at inference time, even though the extra tions use IN1k without labels. Additionally, BEiT uses a dVAE
padding slows inference time by ∼30% on average. trained on the proprietary DALL·E dataset of ∼250M images [32].
3. Initialization Methods (2) BEiT uses learned relative position biases that are
We compare five initialization methods, which we briefly added to the self-attention logits [31] in each block, instead
summarize below. of the absolute position embeddings used by the other meth-
Random: All network weights are randomly initialized ods. To account for this, albeit imperfectly, we include
and no pre-training is used. The ViT backbone initialization both relative position biases and absolute position embed-
follows the code of [1] and the Mask R-CNN initialization dings in all detection models regardless of their use in pre-
uses the defaults in Detectron2 [40]. training. For BEiT, we transfer the pre-trained biases and
randomly initialize the absolute position embeddings. For
Supervised: The ViT backbone is pre-trained for super-
all other methods, we zero-initialize the relative position bi-
vised classification using ImageNet-1k images and labels.
ases and transfer the pre-trained absolute position embed-
We use the DeiT released weights [36] for ViT-B and the
dings. Relative position biases are shared across windowed
ViT-L weights from [16], which uses an even stronger train-
attention blocks and (separately) shared across global atten-
ing formula than DeiT to avoid overfitting (moreover, the
tion blocks. When there is a spatial dimension mismatch be-
DeiT release does not include ViT-L). ViT-B and ViT-L
tween pre-training and fine-tuning, we resize the pre-trained
were pre-trained for 300 and 200 epochs, respectively.
parameters to the required fine-tuning resolution.
MoCo v3: We use the unsupervised ImageNet-1k pre-
(3) BEiT makes use of layer scale [37] in pre-training,
trained ViT-B and ViT-L weights from the authors of [7]
while the other methods do not. During fine-tuning, the
(ViT-B is public; ViT-L was provided via private communi-
BEiT-initialized model must also be parameterized to use
cation). These models were pre-trained for 300 epochs.
layer scale with the pre-trained layer scaling parameters ini-
BEiT: Since ImageNet-1k pre-trained weights are not tialized from the pre-trained model. All other models do not
available, we use the official BEiT code release [1] to train use layer scale in pre-training or in fine-tuning.
ViT-B and ViT-L ourselves for 800 epochs (the default train-
(4) We try to standardize pre-training data to ImageNet-
ing length used in [1]) on unsupervised ImageNet-1k.
1k, however BEiT uses the DALL·E [32] discrete VAE
MAE: We use the ViT-B and ViT-L weights pre-trained (dVAE), which was trained on ∼250 million proprietary and
on unsupervised ImageNet-1k from the authors of [16]. undisclosed images, as an image tokenizer. The impact of
These models were pre-trained for 1600 epochs using nor- this additional training data is not fully understood.
malized pixels as the target.
4. Experiments and Analysis
3.1. Nuisance Factors in Pre-training
4.1. Comparing Initializations
We attempt to make comparisons as equally matched as
possible, yet there are pre-training nuisance factors, listed Results. In Table 1, we compare COCO fine-tuning results
below, that differ across methods. using the pre-trained initializations and random initializa-
(1) Different pre-training methods may use different tion described in §3. We show results after maximizing
numbers of epochs. We adopt the default number of pre- APbox over the considered training lengths: 25, 50, or 100
training epochs from the respective papers. While these epochs for pre-trained initializations, and 100, 200, or 400
values may not appear comparable, the reality is unclear: epochs for random initialization. (We discuss convergence
not all methods may benefit equally from longer training below.) Next, we make several observations.
and not all methods have the same per-epoch training cost (1) Our updated Mask R-CNN trains smoothly with
(e.g., BEiT uses roughly 3× more flops than MAE). ViT-B and ViT-L backbones regardless of the initialization
4
51 ViT-B Mask R-CNN overfitting when the training schedule is made sufficiently
50 long, typically by 100 epochs for pre-trained initializations
49 and 400 epochs for random initialization. Based on this
data, pre-training tends to accelerate training on COCO by
APbox
48
Initialization roughly 4× compared to random initialization.
MAE
47 BEiT We also note two caveats about these results: (i) The
random
46 MoCo v3
drop path rate should ideally be tuned for each training du-
supervised IN1k ration as we have observed that the optimal dp value may
45
Fine-tuning epochs need to increase when models are trained for longer. (How-
ViT-L Mask R-CNN ever, performing an exhaustive dp sweep for all initializa-
53
52
tions, model sizes, and training durations is likely compu-
51
tationally impractical.) (ii) Moreover, it may be possible to
achieve better results in all cases by training for longer un-
APbox
50
der a more complex training formula that employs heavier
49
regularization and stronger data augmentation.
48
47 Discussion. The COCO dataset is a challenging setting for
46
25 50 100 200 400
transfer learning. Due to the large training set (∼118k im-
Fine-tuning epochs (log scale) ages with ∼0.9M annotated objects), it is possible to achieve
Figure 2. Impact of fine-tuning epochs. Convergence plots for strong results when training from random initialization. We
fine-tuning from 25 and 400 epochs on COCO. All pre-trained find that existing methods, like supervised IN1k or unsu-
initializations converge much faster (∼4×) compared to random pervised MoCo v3 pre-training, actually underperform the
initialization, though they achieve varied peak APbox . The perfor- AP of the random initialization baseline (though they yield
mance gap between the masking-based methods (MAE and BEiT) faster convergence). Prior works reporting unsupervised
and all others is visually evident. When increasing model scale transfer learning improvements on COCO (e.g., [17]) tend
from ViT-B (top) to ViT-L (bottom), this gap also increases, sug- to show modest gains over supervised pre-training (e.g.,
gesting that these methods may have superior scaling properties. ∼1 APbox ) and do not include a strong random initializa-
tion baseline as we do here (because strong training formu-
lae based on large-scale jitter had not yet been developed).
method. It does not exhibit instabilities nor does it require Moreover, they use weaker models and report results that
stabilizing techniques like gradient clipping. are overall much lower (e.g., ∼40 APbox ) making it unclear
(2) Training from scratch yields up to 1.4 higher APbox how well the findings translate to state-of-the-art practices.
than fine-tuning from supervised IN1k pre-training (50.7 vs. We find that MAE and BEiT provide the first convincing
49.3). While the higher AP may sound surprising, the same results of substantial COCO AP improvements due to pre-
trend is observed in [13]. Supervised pre-training is not al- training. Moreover, these masking-based methods show the
ways a stronger baseline than random initialization. potential to improve detection transfer learning as model
(3) The contrastive learning-based MoCo v3 underper- size increases. We do not observe this important scaling
forms random initialization’s AP and has similar results trend with either supervised IN1k pre-training or unsuper-
compared to supervised initialization. vised contrastive learning, as represented by MoCo v3.
(4) For ViT-B, BEiT and MAE outperform both random 4.2. Ablations and Analysis
initialization by up to 1.4 APbox (50.3 vs. 48.9) and super-
vised initialization by up to 2.4 APbox (50.3 vs. 47.9). We ablate several factors involved in the system compar-
(5). For ViT-L, the APbox gap increases, with BEiT and ison, analyze model complexity, and report tuned hyperpa-
MAE substantially outperforming both random initializa- rameter values. For these experiments, we use MAE and 50
tion by up to 2.6 APbox (53.3 vs. 50.7) and supervised ini- epoch fine-tuning by default.
tialization by up to 4.0 APbox (53.3 vs. 49.3).
Single-scale vs. Multi-scale. In Table 2 we compare our
Convergence. In Figure 2 we show how pre-training im- default FPN-based multi-scale detector to a single-scale
pacts fine-tuning convergence. Given the tuned hyperpa- variant. The single-scale variant simply applies RPN and
rameters for each initialization method, we train models for RoIAlign [19] to the final 1/16th resolution feature map
2× and 4× longer (and also 0.5× for random initialization). generated by the ViT backbone. The RoI heads and all
Generally, we find that all pre-trained initializations signif- other choices are the same between the systems (in partic-
icantly accelerate convergence compared to random initial- ular, note that both use the same hybrid windowed/global
ization, as observed in [18]. Most methods show signs of attention). We observe that the multi-scale FPN design in-
5
APbox pre-train (pt) fine-tuning APbox
FPN ViT-B ViT-L initialization abs rel abs rel ViT-B ViT-L
yes 50.1 53.3 (1) BEiT no yes rand pt 49.8 53.3
no 48.4 52.0 (2) BEiT no yes rand zero 49.5 53.2
Table 2. Single-scale vs. multi-scale (FPN) ablation. FPN yields (3) BEiT† yes no pt zero - 53.1
consistent improvements. Our default setting is marked in gray. (4) MAE yes no pt zero 50.1 53.3
(5) MAE yes no pt no 49.9 53.0
act train train test Table 4. Positional information ablation. In the BEiT code, the
self-attention checkpt APbox mem time time ViT is modified to use relative position biases (rel) instead of abso-
(1) windowed no 50.7 16GB 0.67s 0.34s lute position embeddings (abs). We study how these components
(2) windowed, 4 global no 53.3 27GB 0.93s 0.40s impact results based on their use in pre-training (pt) and under var-
(3) global yes 53.1 14GB 2.26s 0.65s ious treatments in fine-tuning: (i) pt: initialized with pre-trained
(4) global no - OOM - - values; (ii) rand: random initialization; (iii) zero: initialized at
Table 3. Memory and time reduction strategies. We com- zero; and (iv) no: this positional information is not used in the
pare methods for reducing memory and time when using ViT-L in fine-tuned model. For BEiT† (row 3), we pre-train an additional
Mask R-CNN. The strategies include: (1) replace all global self- model (ViT-L only) that, like MAE, uses absolute position em-
attention with 14 × 14 non-overlapping windowed self-attention, beddings instead of relative position biases. Our default settings
(2) a hybrid that uses both windowed and global self-attention, or are marked in gray. Comparing (1) and (2), we observe that pre-
(3) all global attention with activation checkpointing. Without any trained relative position bias initialization provides a slight benefit
of these strategies (row 4) an out-of-memory (OOM) error pre- over zero initialization. Comparing (1,2) to (3), we see that BEiT
vents training. We report APbox , peak GPU training memory, av- pre-trained with absolute position embeddings performs similarly
erage per-iteration training time, and average per-image inference (perhaps slightly worse) to pre-training with relative position bi-
time using NVIDIA V100-32GB GPUs. The per-GPU batch size ases. Comparing (4) and (5), we see that including relative posi-
is 1. Our defaults (row 2) achieves a good balance between mem- tion biases in addition to absolute position embeddings provides a
ory, time, and APbox metrics. In fact, our hybrid approach achieves small improvement.
comparable APbox to full global attention, while being much faster.
6
53.3
53.1 backbone params (M) acts (M) flops (G) fps
53 52.7
ResNet-101 65 426 ± 43 422 ± 35 13.7
ViT-B 116 1532 ± 11 853 ± 13 5.1
APbox
52 51.5
ViT-L 339 2727 ± 10 1907 ± 12 2.5
51
50.1
Table 5. Model complexity for inference with the specific Mask
50 R-CNN configuration used in this report. For ViT, the image reso-
100 200 400 800 1600
Pre-training epochs (log scale) lution is 1024 × 1024 (padded as necessary). The flop and activa-
tion counts are measured at runtime and vary based on the number
Figure 3. Impact of pre-training epochs. Increasing MAE pre-
of detected objects. We report the mean ± one standard deviation
training from 100 to 800 epochs confers large transfer learning
from 100 validation images. Results change very slightly when us-
gains. The improvements start to plateau after 800 epochs.
ing different initializations. For reference, we report results using
the ResNet-101 backbone, which can (and does) use non-square
8 Initialization inputs at inference time (longest side is 1024); otherwise infer-
MAE
ence settings are the same. The ResNet-101 based Mask R-CNN
∆APbox @0.5
6 BEiT
random achieves 48.9 APbox when trained from scratch for 400 epochs.
MoCo v3
4 We also report wall-clock speed in frames-per-second (fps) on an
supervised IN1k
7
References [17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual rep-
[1] Hangbo Bao, Li Dong, and Furu Wei. BEiT: Bert pre- resentation learning. In CVPR, 2020.
training of image transformers. arXiv:2106.08254, 2021.
[18] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking
[2] Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew ImageNet pre-training. In ICCV, 2019.
Zhai, and Dmitry Kislyuk. Toward transformer-based object
[19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
detection. arXiv preprint arXiv:2012.09958, 2020.
shick. Mask R-CNN. In ICCV, 2017.
[3] Daniel Bolya, Sean Foley, James Hays, and Judy Hoffman.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
TIDE: A general toolbox for identifying object detection er-
Deep residual learning for image recognition. In CVPR,
rors. In ECCV, 2020.
2016.
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv-
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
ing into high quality object detection. In CVPR, 2018.
units (gelus). arXiv:1606.08415, 2016.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
[22] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
Weinberger. Deep networks with stochastic depth. In ECCV,
ing properties in self-supervised vision transformers. In
2016.
ICCV, 2021.
[6] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- [23] Sergey Ioffe and Christian Szegedy. Batch normalization:
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Accelerating deep network training by reducing internal co-
Wanli Ouyang, et al. Hybrid task cascade for instance seg- variate shift. In ICML, 2015.
mentation. In CVPR, 2019. [24] Gustav Larsson, Michael Maire, and Gregory
[7] Xinlei Chen, Saining Xie, and Kaiming He. An empirical Shakhnarovich. Fractalnet: Ultra-deep neural networks
study of training self-supervised Vision Transformers. In without residuals. ICLR, 2016.
ICCV, 2021. [25] Yann LeCun, Bernhard Boser, John S Denker, Donnie
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Henderson, Richard E Howard, Wayne Hubbard, and
and Li Fei-Fei. ImageNet: A large-scale hierarchical image Lawrence D Jackel. Backpropagation applied to handwrit-
database. In CVPR, 2009. ten zip code recognition. Neural computation, 1989.
[9] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- [26] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man-
vised visual representation learning by context prediction. In galam, Bo Xiong, Jitendra Malik, and Christoph Feichten-
ICCV, 2015. hofer. Improved multiscale vision transformers for classifi-
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, cation and detection. In preparation, 2021.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [27] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Bharath Hariharan, and Serge Belongie. Feature pyramid
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is networks for object detection. In CVPR, 2017.
worth 16x16 words: Transformers for image recognition at [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
scale. In ICLR, 2021. Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[11] Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Zitnick. Microsoft COCO: Common objects in context. In
Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, ECCV. 2014.
Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. [29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
XCiT: Cross-covariance image transformers. arXiv preprint Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans-
arXiv:2106.09681, 2021. former: Hierarchical vision transformer using shifted win-
[12] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao dows. arXiv preprint arXiv:2103.14030, 2021.
Li, Zhicheng Yan, Jitendra Malik, and Christoph Feicht- [30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
enhofer. Multiscale vision transformers. arXiv preprint regularization. In ICLR, 2019.
arXiv:2104.11227, 2021. [31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
[13] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple Li, and Peter J Liu. Exploring the limits of transfer learn-
copy-paste is a strong data augmentation method for instance ing with a unified text-to-text transformer. arXiv preprint
segmentation. In CVPR, 2021. arXiv:1910.10683, 2019.
[14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [32] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Malik. Rich feature hierarchies for accurate object detection Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
and semantic segmentation. In CVPR, 2014. Zero-shot text-to-image generation. arXiv:2102.12092,
[15] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- 2021.
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Yangqing Jia, and Kaiming He. Accurate, large minibatch Faster R-CNN: Towards real-time object detection with re-
SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017. gion proposal networks. In NeurIPS, 2015.
[16] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Dollár, and Ross Girshick. Masked autoencoders are scalable Faster R-CNN: Towards real-time object detection with re-
vision learners. arXiv preprint arXiv:2111.06377, 2021. gion proposal networks. TPAMI, 2017.
8
[35] Pierre Sermanet, Koray Kavukcuoglu, Sandhya Chintala,
and Yann LeCun. Pedestrian detection with unsupervised
multi-stage feature learning. In CVPR, 2013.
[36] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. arXiv:2012.12877, 2020.
[37] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Gabriel Synnaeve, and Hervé Jégou. Going deeper with im-
age transformers. arXiv preprint arXiv:2103.17239, 2021.
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[39] Yuxin Wu and Kaiming He. Group normalization. In ECCV,
2018.
[40] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019.