0% found this document useful (0 votes)

66 views

Benchmarking Detection Transfer Learning With Vision Transformers

This document discusses benchmarking object detection transfer learning with Vision Transformers (ViTs). It presents techniques to overcome challenges in integrating ViTs into detection models like Mask R-CNN. The authors compare five ViT initializations, including recent self-supervised learning methods, supervised pre-training, and random initialization. Their results show that masking-based self-supervised pre-training can improve COCO APbox scores up to 4% over supervised and prior self-supervised methods, and these improvements scale with model size. The techniques and observations aim to serve as a blueprint for future work comparing pre-training methods for advanced ViT models.

Uploaded by

Johnathan Xie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

66 views

Benchmarking Detection Transfer Learning With Vision Transformers

Uploaded by

Johnathan Xie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Benchmarking Detection Transfer Learning with Vision Transformers

Yanghao Li Saining Xie Xinlei Chen Piotr Dollár Kaiming He Ross Girshick
Facebook AI Research (FAIR)

Abstract
arXiv:2111.11429v1 [cs.CV] 22 Nov 2021

cols are easy to define and baselines are plentiful (e.g. [17]).
In other words, unsupervised learning with CNNs produces
Object detection is a central downstream task used to a plug-and-play parameter initialization.
test if pre-trained network parameters confer benefits, such We are now witnessing the growth of unsupervised learn-
as improved accuracy or training speed. The complexity ing with Vision Transformer (ViT) models [10], and while
of object detection methods can make this benchmarking the high-level transfer learning methodology remains the
non-trivial when new architectures, such as Vision Trans- same, the low-level details and baselines for some important
former (ViT) models, arrive. These difficulties (e.g., archi- downstream tasks have not been established. Notably, ob-
tectural incompatibility, slow training, high memory con- ject detection, which has played a central role in the study of
sumption, unknown training formulae, etc.) have prevented transfer learning over the last decade (e.g., [35, 14, 9, 17]),
recent studies from benchmarking detection transfer learn- was not explored in the pioneering work on ViT train-
ing with standard ViT models. In this paper, we present ing [10, 7, 5]—supervised or unsupervised—due to the
training techniques that overcome these challenges, en- challenges (described shortly) of integrating ViTs into com-
abling the use of standard ViT models as the backbone of mon detection models, like Mask R-CNN [19].
Mask R-CNN. These tools facilitate the primary goal of our To bridge this gap, this paper establishes a transfer learn-
study: we compare five ViT initializations, including re- ing protocol for evaluating ViT models on object detection
cent state-of-the-art self-supervised learning methods, su- and instance segmentation using the COCO dataset [28] and
pervised initialization, and a strong random initialization the Mask R-CNN framework. We focus on standard ViT
baseline. Our results show that recent masking-based un- models, with minimal modifications, as defined in the orig-
supervised learning methods may, for the first time, provide inal ViT paper [10], because we expect this architecture will
convincing transfer learning improvements on COCO, in- remain popular in unsupervised learning work over the next
creasing APbox up to 4% (absolute) over supervised and few years due to its simplicity and flexibility when explor-
prior self-supervised pre-training methods. Moreover, these ing new techniques, e.g., masking-based methods [1, 16].
masking-based initializations scale better, with the improve- Establishing object detection baselines for ViT is chal-
ment growing as model size increases. lenging due to technical obstacles that include mitigat-
ing ViT’s large memory requirements when processing
detection-sized inputs (e.g., ∼20× more patches than in
1. Introduction pre-training), architectural incompatibilities (e.g., single-
Unsupervised/self-supervised deep learning is com- scale ViT vs. a multi-scale detector), and developing ef-
monly used as a pre-training step that initializes model pa- fective training formulae (i.e., learning schedules, regular-
rameters before they are transferred to a downstream task, ization and data augmentation methods, etc.) for numerous
such as image classification or object detection, for fine- pre-trained initializations, as well as random initialization.
tuning. The utility of an unsupervised learning algorithm We overcome these obstacles and present strong ViT-based
is judged by downstream task metrics (e.g. accuracy, con- Mask R-CNN baselines on COCO when initializing ViT
vergence speed, etc.) in comparison to baselines, such as from-scratch [18], with pre-trained ImageNet [8] supervi-
supervised pre-training or no pre-training at all, i.e., ran- sion, and with unsupervised pre-training using recent meth-
dom initialization (often called training “from scratch”). ods like MoCo v3 [7], BEiT [1], and MAE [16].
Unsupervised deep learning in computer vision typically Looking beyond ViT, we hope our practices and obser-
uses standard convolutional network (CNN) models [25], vations will serve as a blueprint for future work comparing
such as ResNets [20]. Transferring these models is rela- pre-training methods for more advanced ViT derivatives,
tively straightforward because CNNs are in widespread use like Swin [29] and MViT [12]. To facilitate community de-
in most downstream tasks, and thus benchmarking proto- velopment we will release code in Detectron2 [40].

1
windowed windowed windowed windowed
blocks blocks blocks blocks However, using FPN presents a problem because ViT
produces feature maps at a single scale (e.g., 1/16th), in
block block block block block contrast to the multi-scale feature maps produced by typical
1 d/4 2d/4 3d/4 d
ViT
(scale 1/16)
CNNs.1 To address this discrepancy, we employ a simple
windowed global global global global
attention attention attention attention attention technique from [11] (used for the single-scale XCiT back-
bone) to either upsample or downsample intermediate ViT
input feature maps by placing four resolution-modifying modules
up up down multi-scale
patches identity
4x 2x 2x adaptor at equally spaced intervals of d/4 transformer blocks, where
feature pyramid network (FPN)
d is the total number of blocks. See Figure 1 (green blocks).
The first of these modules upsamples the feature map by
Mask R-CNN components a factor of 4 using a stride-two 2×2 transposed convolution,
(RPN, box head, mask head)
followed by group normalization [39] and GeLU [21], and
Figure 1. ViT-based Mask R-CNN. In §2 we describe how a stan-
finally another stride-two 2×2 transposed convolution. The
dard ViT model can be used effectively as the backbone in Mask
R-CNN. To save time and memory, we modify the ViT to use non- next d/4th block’s output is upsampled by 2× using a single
overlapping windowed attention in all but four of its Transformer stride-two 2 × 2 transposed convolution (without normal-
blocks, spaced at an interval of d/4, where d is the total number ization and non-linearity). The next d/4th block’s output is
of blocks (blue) [26]. To adapt the single-scale ViT to the multi- taken as is and the final ViT block’s output is downsampled
scale FPN (yellow), we make use of upsampling and downsam- by a factor of two using stride-two 2 × 2 max pooling. Each
pling modules (green) [11]. The rest of the system (light red) uses of these modules preserves the ViT’s embedding/channel
upgraded, but standard, Mask R-CNN components. dimension. Assuming a patch size of 16, these modules
produce feature maps with strides of 4, 8, 16, and 32 pixels,
2. Approach w.r.t. the input image, that are ready to input into an FPN.
We note that recent work, such as Swin [29] and
We select the Mask R-CNN [19] framework due to its
MViT [12], address the single vs. multi-scale feature map
ubiquitous presence in object detection and transfer learn-
problem by modifying the core ViT architecture (in pre-
ing research. Mask R-CNN is the foundation of higher
training) so it is inherently multi-scale. This is an important
complexity/higher performing systems, such as Cascade
direction, but it also complicates the simple ViT design and
R-CNN [4] and HTC/HTC++ [6, 29], which may improve
may impede the exploration of new unsupervised learning
upon the results presented here at the cost of additional
directions, such as methods that sparsely process unmasked
complexity that is orthogonal to the goal of benchmarking
patches [16]. Therefore, we focus on external additions to
transfer learning. Our choice attempts to balance (relative)
ViTs that allow them to integrate into multi-scale detection
simplicity vs. complexity while providing compelling, even
systems. We also note that Beal et al. [2] integrate stan-
though not entirely state-of-the-art, results.
dard ViT models with Faster R-CNN [34], but report sub-
We configure Mask R-CNN with a number of upgraded
stantially lower APbox compared to our results (>10 points
modules (described in §2.2) and training procedures (de-
lower), which suggests that our design is highly effective.
scribed in §2.3) relative to the original publication. These
upgrades, developed primarily in [39, 18, 13], allow the Reducing Memory and Time Complexity. Using ViT as
model to be trained effectively from random initialization, a backbone in Mask R-CNN introduces memory and run-
thus enabling a meaningful from-scratch baseline. Next, we time challenges. Each self-attention operation in ViT takes
will discuss how the backbone, which would typically be a O(h2 w2 ) space and time for an image tiled (or “patchified”)
ResNet, can be replaced with a Vision Transformer. into h × w non-overlapping patches [38].
During pre-training, this complexity is manageable as
2.1. ViT Backbone
h = w = 14 is a typical setting (a 224 × 224 pixel image
In this section we address two technical obstacles when patchified into 16 × 16 pixel patches). In object detection,
using ViT as the backbone in Mask R-CNN: (1) how to a standard image size is 1024 × 1024—approximately 21×
adapt it to work with a feature pyramid network (FPN) [27] more pixels and patches. This higher resolution is needed in
and (2) how to reduce its memory footprint and runtime to order to detect relatively small objects as well as larger ones.
make benchmarking large ViT backbones tractable. Due to the quadratic complexity of self-attention, even the
“base” size ViT-B may consume ∼20–30GB of GPU mem-
FPN Compatibility. Mask R-CNN can work with a back- ory when used in Mask R-CNN with a single-image mini-
bone that either produces a single-scale feature map or fea- batch and half -precision floating point numbers.
ture maps at multiple scales that can be input into an FPN.
Since FPN typically provides better detection results with 1 We view the natural 2D spatial arrangement of intermediate ViT patch

minimal time and memory overhead, we adopt it. embeddings as a standard 2D feature map.

2
To reduce space and time complexity we use restricted 0.25 epochs, and drop path regularization. When using a
(or “windowed”) self-attention [38], which saves both space pre-trained initialization, we fine-tune Mask R-CNN for up
and time by replacing global computation with local com- to 100 epochs. When training from scratch, we consider
putation. We partition the h × w patchified image into schedules of up to 400 epochs since convergence is slower
r × r patch non-overlapping windows and compute self- than when using pre-training. We distribute training over
attention independently within each of these windows. This 32 or 64 GPUs (NVIDIA V100-32GB) and always use a
windowed self-attention has O(r2 hw) space and time com- minibatch size of 64 images. We use PyTorch’s automatic
plexity (from O(r4 ) per-window complexity and h/r ×w/r mixed precision. Additional hyperparameters are tuned by
windows). We set r to the global self-attention size used in the consistent application of a protocol, describe next.
pre-training (e.g., r = 14 is typical).
A drawback of windowed self-attention is that the back- 2.4. Hyperparameter Tuning Protocol
bone does not integrate information across windows. There- To adapt the training formula to each model, we tune
fore we adopt the hybrid approach from [26] that includes three hyperparameters—learning rate (lr), weight decay
four global self-attention blocks placed evenly at each d/4th (wd), and drop path rate (dp)—while keeping all others the
block (these coincide with the up-/downsampling locations same for all models. We conducted pilot experiments us-
used for FPN integration; see Figure 1). ing ViT-B pre-trained with MoCo v3 to estimate reasonable
hyperparameter ranges. Based on these estimates we estab-
2.2. Upgraded Modules
lished the following tuning protocol:
Relative to the original Mask R-CNN in [19], we mod- (1) For each initialization (from-scratch, supervised,
ernize several of its modules. Concisely, the modifications etc.), we fix dp at 0.0 and perform a grid search over lr and
include: (1) following the convolutions in FPN with batch wd using ViT-B and a 25 epoch schedule (or 100 epochs
normalization (BN) [23], (2) using two convolutional lay- when initializing from scratch). We center a 3 × 3 grid at
ers in the region proposal network (RPN) [33] instead of lr, wd = 1.6e−4, 0.1 and use doubled and halved values
one, (3) using four convolutional layers with BN followed around the center. If a local optimum is not found (i.e. the
by one linear layer for the region-of-interest (RoI) classifi- best value is a boundary value), we expand the search.
cation and box regression head [39] instead of a two-layer (2) For ViT-B, we select dp from {0.0, 0.1, 0.2, 0.3} us-
MLP without normalization, (4) and following the convo- ing a 50 epoch schedule for pre-trained initializations. The
lutions in the standard mask head with BN. Wherever BN shorter 25 epoch schedule was unreliable and 100 epochs
is applied, we use synchronous BN across all GPUs. These was deemed impractical. For random initialization we’re
upgrades are implemented in the Detectron2 model zoo.2 forced to use 100 epochs due to slow convergence. We
found that dp = 0.1 is optimal for all initializations.
2.3. Training Formula (3) For ViT-L, we adopt the optimal lr and wd from ViT-
We adopt an upgraded training formula compared to the B (searching with ViT-L is impractical) and find dp = 0.3
original Mask R-CNN. This formula was developed in [18], is best using the same procedure as for ViT-B.
which demonstrated good from-scratch performance when
Limitations. The procedure above takes practical shortcuts
training with normalization layers and for long enough,
to reduce the full hyperparameter tuning space. In particu-
and [13], which demonstrated that a simple data augmen-
lar, lr and wd are optimized separately from dp, thus the
tation method called large-scale jitter (LSJ) is effective at
combination may be suboptimal. Further, we only tune lr
preventing overfitting and improves results when models
and wd using ViT-B, therefore the choice may be subopti-
are trained for very long schedules (e.g., 400 epochs).
mal for ViT-L. We also tune lr and wd using a schedule that
We aim to keep the number of hyperparameters low and
is 4× shorter than the longest schedule we eventually train
therefore resist adopting additional data augmentation and
at, which again may be suboptimal. Given these limitations
regularization techniques. However, we found that drop
we aim to avoid biasing results by applying the same tuning
path regularization [24, 22] is highly effective for ViT back-
protocol to all initializations.
bones and therefore we include it (e.g., it improves from-
Finally, we note that we tune lr, wd, and dp on the COCO
scratch training by up to 2 APbox ).
2017 val split and report results on the same split. While
In summary, we train all models with the same sim-
technically not an ML best-practice, a multitude of com-
ple formula: LSJ (1024 × 1024 resolution, scale range
parisons on COCO val vs. test-dev results over many
[0.1, 2.0]), AdamW [30] (β1 , β2 = 0.9, 0.999) with half-
years demonstrate that overfitting in not a concern for this
period cosine learning rate decay, linear warmup [15] for
kind of low-degree-of-freedom hyperparameter tuning.3
2 https://github.com/facebookresearch/detectron2/blob/
3 E.g., Table 2 in [29] (version 1) shows that text-dev APbox is sys-
main / MODEL _ ZOO . md # new - baselines - using - large - scale -
jitter-and-longer-training-schedule tematically higher than val APbox in seven system-level comparisons.

3
2.5. Additional Implementation Details pre-training APbox APmask
initialization data ViT-B ViT-L ViT-B ViT-L
Images are padded during training and inference to form supervised IN1k w/ labels 47.9 49.3 42.9 43.9
a 1024 × 1024 resolution input. During training, padding random none 48.9 50.7 43.6 44.9
is necessary for batching. During (unbatched) inference, MoCo v3 IN1k 47.9 49.3 42.7 44.0
BEiT IN1k+DALL·E 49.8 53.3 44.4 47.1
the input only needs to be a multiple of the ViT patch size
MAE IN1k 50.3 53.3 44.9 47.2
on each side, which is possibly less than 1024 on one side.
Table 1. COCO object detection and instance segmentation us-
However, we found that such reduced padding performs
ing our ViT-based Mask R-CNN baseline. Results are reported on
worse (e.g., decrease of ∼0.5–1 APbox ) than padding to the
COCO 2017 val using the best schedule length (see Figure 2).
same resolution used during training, likely due to ViT’s use Random initialization does not use any pre-training data, super-
of positional information. Therefore, we use a 1024 × 1024 vised initialization uses IN1k with labels, and all other initializa-
resolution input at inference time, even though the extra tions use IN1k without labels. Additionally, BEiT uses a dVAE
padding slows inference time by ∼30% on average. trained on the proprietary DALL·E dataset of ∼250M images [32].

3. Initialization Methods (2) BEiT uses learned relative position biases that are
We compare five initialization methods, which we briefly added to the self-attention logits [31] in each block, instead
summarize below. of the absolute position embeddings used by the other meth-
Random: All network weights are randomly initialized ods. To account for this, albeit imperfectly, we include
and no pre-training is used. The ViT backbone initialization both relative position biases and absolute position embed-
follows the code of [1] and the Mask R-CNN initialization dings in all detection models regardless of their use in pre-
uses the defaults in Detectron2 [40]. training. For BEiT, we transfer the pre-trained biases and
randomly initialize the absolute position embeddings. For
Supervised: The ViT backbone is pre-trained for super-
all other methods, we zero-initialize the relative position bi-
vised classification using ImageNet-1k images and labels.
ases and transfer the pre-trained absolute position embed-
We use the DeiT released weights [36] for ViT-B and the
dings. Relative position biases are shared across windowed
ViT-L weights from [16], which uses an even stronger train-
attention blocks and (separately) shared across global atten-
ing formula than DeiT to avoid overfitting (moreover, the
tion blocks. When there is a spatial dimension mismatch be-
DeiT release does not include ViT-L). ViT-B and ViT-L
tween pre-training and fine-tuning, we resize the pre-trained
were pre-trained for 300 and 200 epochs, respectively.
parameters to the required fine-tuning resolution.
MoCo v3: We use the unsupervised ImageNet-1k pre-
(3) BEiT makes use of layer scale [37] in pre-training,
trained ViT-B and ViT-L weights from the authors of [7]
while the other methods do not. During fine-tuning, the
(ViT-B is public; ViT-L was provided via private communi-
BEiT-initialized model must also be parameterized to use
cation). These models were pre-trained for 300 epochs.
layer scale with the pre-trained layer scaling parameters ini-
BEiT: Since ImageNet-1k pre-trained weights are not tialized from the pre-trained model. All other models do not
available, we use the official BEiT code release [1] to train use layer scale in pre-training or in fine-tuning.
ViT-B and ViT-L ourselves for 800 epochs (the default train-
(4) We try to standardize pre-training data to ImageNet-
ing length used in [1]) on unsupervised ImageNet-1k.
1k, however BEiT uses the DALL·E [32] discrete VAE
MAE: We use the ViT-B and ViT-L weights pre-trained (dVAE), which was trained on ∼250 million proprietary and
on unsupervised ImageNet-1k from the authors of [16]. undisclosed images, as an image tokenizer. The impact of
These models were pre-trained for 1600 epochs using nor- this additional training data is not fully understood.
malized pixels as the target.
4. Experiments and Analysis
3.1. Nuisance Factors in Pre-training
4.1. Comparing Initializations
We attempt to make comparisons as equally matched as
possible, yet there are pre-training nuisance factors, listed Results. In Table 1, we compare COCO fine-tuning results
below, that differ across methods. using the pre-trained initializations and random initializa-
(1) Different pre-training methods may use different tion described in §3. We show results after maximizing
numbers of epochs. We adopt the default number of pre- APbox over the considered training lengths: 25, 50, or 100
training epochs from the respective papers. While these epochs for pre-trained initializations, and 100, 200, or 400
values may not appear comparable, the reality is unclear: epochs for random initialization. (We discuss convergence
not all methods may benefit equally from longer training below.) Next, we make several observations.
and not all methods have the same per-epoch training cost (1) Our updated Mask R-CNN trains smoothly with
(e.g., BEiT uses roughly 3× more flops than MAE). ViT-B and ViT-L backbones regardless of the initialization

4
51 ViT-B Mask R-CNN overfitting when the training schedule is made sufficiently
50 long, typically by 100 epochs for pre-trained initializations
49 and 400 epochs for random initialization. Based on this
data, pre-training tends to accelerate training on COCO by
APbox

48
Initialization roughly 4× compared to random initialization.
MAE
47 BEiT We also note two caveats about these results: (i) The
random
46 MoCo v3
drop path rate should ideally be tuned for each training du-
supervised IN1k ration as we have observed that the optimal dp value may
45
Fine-tuning epochs need to increase when models are trained for longer. (How-
ViT-L Mask R-CNN ever, performing an exhaustive dp sweep for all initializa-
53
52
tions, model sizes, and training durations is likely compu-
51
tationally impractical.) (ii) Moreover, it may be possible to
achieve better results in all cases by training for longer un-
APbox

50
der a more complex training formula that employs heavier
49
regularization and stronger data augmentation.
48
47 Discussion. The COCO dataset is a challenging setting for
46
25 50 100 200 400
transfer learning. Due to the large training set (∼118k im-
Fine-tuning epochs (log scale) ages with ∼0.9M annotated objects), it is possible to achieve
Figure 2. Impact of fine-tuning epochs. Convergence plots for strong results when training from random initialization. We
fine-tuning from 25 and 400 epochs on COCO. All pre-trained find that existing methods, like supervised IN1k or unsu-
initializations converge much faster (∼4×) compared to random pervised MoCo v3 pre-training, actually underperform the
initialization, though they achieve varied peak APbox . The perfor- AP of the random initialization baseline (though they yield
mance gap between the masking-based methods (MAE and BEiT) faster convergence). Prior works reporting unsupervised
and all others is visually evident. When increasing model scale transfer learning improvements on COCO (e.g., [17]) tend
from ViT-B (top) to ViT-L (bottom), this gap also increases, sug- to show modest gains over supervised pre-training (e.g.,
gesting that these methods may have superior scaling properties. ∼1 APbox ) and do not include a strong random initializa-
tion baseline as we do here (because strong training formu-
lae based on large-scale jitter had not yet been developed).
method. It does not exhibit instabilities nor does it require Moreover, they use weaker models and report results that
stabilizing techniques like gradient clipping. are overall much lower (e.g., ∼40 APbox ) making it unclear
(2) Training from scratch yields up to 1.4 higher APbox how well the findings translate to state-of-the-art practices.
than fine-tuning from supervised IN1k pre-training (50.7 vs. We find that MAE and BEiT provide the first convincing
49.3). While the higher AP may sound surprising, the same results of substantial COCO AP improvements due to pre-
trend is observed in [13]. Supervised pre-training is not al- training. Moreover, these masking-based methods show the
ways a stronger baseline than random initialization. potential to improve detection transfer learning as model
(3) The contrastive learning-based MoCo v3 underper- size increases. We do not observe this important scaling
forms random initialization’s AP and has similar results trend with either supervised IN1k pre-training or unsuper-
compared to supervised initialization. vised contrastive learning, as represented by MoCo v3.
(4) For ViT-B, BEiT and MAE outperform both random 4.2. Ablations and Analysis
initialization by up to 1.4 APbox (50.3 vs. 48.9) and super-
vised initialization by up to 2.4 APbox (50.3 vs. 47.9). We ablate several factors involved in the system compar-
(5). For ViT-L, the APbox gap increases, with BEiT and ison, analyze model complexity, and report tuned hyperpa-
MAE substantially outperforming both random initializa- rameter values. For these experiments, we use MAE and 50
tion by up to 2.6 APbox (53.3 vs. 50.7) and supervised ini- epoch fine-tuning by default.
tialization by up to 4.0 APbox (53.3 vs. 49.3).
Single-scale vs. Multi-scale. In Table 2 we compare our
Convergence. In Figure 2 we show how pre-training im- default FPN-based multi-scale detector to a single-scale
pacts fine-tuning convergence. Given the tuned hyperpa- variant. The single-scale variant simply applies RPN and
rameters for each initialization method, we train models for RoIAlign [19] to the final 1/16th resolution feature map
2× and 4× longer (and also 0.5× for random initialization). generated by the ViT backbone. The RoI heads and all
Generally, we find that all pre-trained initializations signif- other choices are the same between the systems (in partic-
icantly accelerate convergence compared to random initial- ular, note that both use the same hybrid windowed/global
ization, as observed in [18]. Most methods show signs of attention). We observe that the multi-scale FPN design in-

5
APbox pre-train (pt) fine-tuning APbox
FPN ViT-B ViT-L initialization abs rel abs rel ViT-B ViT-L
yes 50.1 53.3 (1) BEiT no yes rand pt 49.8 53.3
no 48.4 52.0 (2) BEiT no yes rand zero 49.5 53.2
Table 2. Single-scale vs. multi-scale (FPN) ablation. FPN yields (3) BEiT† yes no pt zero - 53.1
consistent improvements. Our default setting is marked in gray. (4) MAE yes no pt zero 50.1 53.3
(5) MAE yes no pt no 49.9 53.0
act train train test Table 4. Positional information ablation. In the BEiT code, the
self-attention checkpt APbox mem time time ViT is modified to use relative position biases (rel) instead of abso-
(1) windowed no 50.7 16GB 0.67s 0.34s lute position embeddings (abs). We study how these components
(2) windowed, 4 global no 53.3 27GB 0.93s 0.40s impact results based on their use in pre-training (pt) and under var-
(3) global yes 53.1 14GB 2.26s 0.65s ious treatments in fine-tuning: (i) pt: initialized with pre-trained
(4) global no - OOM - - values; (ii) rand: random initialization; (iii) zero: initialized at
Table 3. Memory and time reduction strategies. We com- zero; and (iv) no: this positional information is not used in the
pare methods for reducing memory and time when using ViT-L in fine-tuned model. For BEiT† (row 3), we pre-train an additional
Mask R-CNN. The strategies include: (1) replace all global self- model (ViT-L only) that, like MAE, uses absolute position em-
attention with 14 × 14 non-overlapping windowed self-attention, beddings instead of relative position biases. Our default settings
(2) a hybrid that uses both windowed and global self-attention, or are marked in gray. Comparing (1) and (2), we observe that pre-
(3) all global attention with activation checkpointing. Without any trained relative position bias initialization provides a slight benefit
of these strategies (row 4) an out-of-memory (OOM) error pre- over zero initialization. Comparing (1,2) to (3), we see that BEiT
vents training. We report APbox , peak GPU training memory, av- pre-trained with absolute position embeddings performs similarly
erage per-iteration training time, and average per-image inference (perhaps slightly worse) to pre-training with relative position bi-
time using NVIDIA V100-32GB GPUs. The per-GPU batch size ases. Comparing (4) and (5), we see that including relative posi-
is 1. Our defaults (row 2) achieves a good balance between mem- tion biases in addition to absolute position embeddings provides a
ory, time, and APbox metrics. In fact, our hybrid approach achieves small improvement.
comparable APbox to full global attention, while being much faster.

on fine-tuning performance. A detailed analysis is given in

the caption. In summary, we observe that including rela-
creases APbox by ∼1.3-1.7 (e.g., 50.1 vs. 48.4), while in- tive position biases during fine-tuning may slightly improve
creasing training and inference time by ∼5 and ∼10% rela- APbox by ∼0.2–0.3 points (e.g., 53.0 to 53.3) for a model
tive, respectively. Multi-scale memory overhead is <1%. that was pre-trained with only absolute position embed-
dings. We also observe that pre-training relative position
Memory and Time Reduction. In Table 3 we compare
biases, as done by BEiT, may also have a slight positive ef-
several strategies for reducing memory and time complexity
fect of ∼0.1–0.3 points. Our practice of including both posi-
when using a standard ViT backbone in Mask R-CNN. Us-
tional information types during fine-tuning appears to pro-
ing a combination of 14 × 14 non-overlapping windowed
vide a reasonably fair comparison. We also note that using
self-attention together with four global attention blocks
relative position biases introduces non-trivial overhead—
achieves a good balance between memory, training and in-
it increases training and inference time by roughly 25%
ference time, and AP metrics. This finding motivates us
and 15% relative, respectively, increases memory by ∼15%
to use this setting as our default. Somewhat surprisingly us-
(even with shared biases), and perhaps should be avoided.
ing only windowed attention is not catastrophic even though
the backbone processes all windows entirely independently Pre-training Epochs. In Figure 3 we study the impact of
(APbox decreases from 53.3 to 50.7). This is likely due to MAE pre-training epochs on COCO APbox by sweeping
cross-window computation introduced by convolutions and pre-training epochs from 100 to 1600 (the default). The
RoIAlign in the rest of the Mask R-CNN model. results show that pre-training duration has a significant im-
pact on transfer learning performance with large increases
Positional Information. In the default BEiT code, the ViT
in APbox continuing from 100 to 800 epochs. There is still
is modified to use relative position biases [31] in each trans-
a small improvement from 800 to 1600 epochs (+0.2 from
former block instead of adding absolute position embed-
53.1 to 53.3), though the gradient has largely flattened.
dings to the patch embeddings. This choice is an orthog-
onal enhancement that is not used by the other pre-training TIDE Error Type Analysis. In Figure 4 we show the error
methods (though it could be). In an attempt to make the type analysis generated by the TIDE toolbox [3]. A detailed
comparison more equal, we include these biases (and abso- description and analysis is given in the caption. The anal-
lute position embeddings) in all fine-tuning models by de- ysis reveals more granular information about where MAE
fault, as discussed in §3.1. and BEiT improve overall AP relative to the other initial-
In Table 4 we study the effect of relative position biases izations. In summary, we observe that all initializations lead

6
53.3
53.1 backbone params (M) acts (M) flops (G) fps
53 52.7
ResNet-101 65 426 ± 43 422 ± 35 13.7
ViT-B 116 1532 ± 11 853 ± 13 5.1
APbox

52 51.5
ViT-L 339 2727 ± 10 1907 ± 12 2.5
51
50.1
Table 5. Model complexity for inference with the specific Mask
50 R-CNN configuration used in this report. For ViT, the image reso-
100 200 400 800 1600
Pre-training epochs (log scale) lution is 1024 × 1024 (padded as necessary). The flop and activa-
tion counts are measured at runtime and vary based on the number
Figure 3. Impact of pre-training epochs. Increasing MAE pre-
of detected objects. We report the mean ± one standard deviation
training from 100 to 800 epochs confers large transfer learning
from 100 validation images. Results change very slightly when us-
gains. The improvements start to plateau after 800 epochs.
ing different initializations. For reference, we report results using
the ResNet-101 backbone, which can (and does) use non-square
8 Initialization inputs at inference time (longest side is 1024); otherwise infer-
MAE
ence settings are the same. The ResNet-101 based Mask R-CNN
∆APbox @0.5

6 BEiT
random achieves 48.9 APbox when trained from scratch for 400 epochs.
MoCo v3
4 We also report wall-clock speed in frames-per-second (fps) on an
supervised IN1k

2 NVIDIA V100-32GB GPU.

0
cls loc cls+loc dup bg miss
box
Figure 4. TIDE analysis. We plot the ∆AP metric at an
intersection-over-union (IoU) threshold of 0.5 as defined in [3]. Hyperparameter Tuning. All pre-trained initializations
Each bar shows how much AP can be added to the detector if an preferred wd = 0.1 for fine-tuning. Random initialization
oracle fixes a certain error type. The error types are: cls: localized benefitted from stronger regularization and selected a higher
correctly (IoU ≥0.5), but classified incorrectly; loc: classified cor- setting of 0.2. Most methods selected lr = 8.0e−5, except
rectly, but localized incorrectly (IoU in [0.1, 0.5)); cls+loc: classi- for random initialization and MoCo v3 initialization, which
fied incorrectly and localized incorrectly; dup: detection would be both preferred a higher setting of 1.6e−4. As described pre-
correct if not for a higher scoring correct detection; bg: detection viously, the drop path rate could not be reliably tuned using
is in the background (IoU <0.1); miss: all undetected ground-truth shorter schedules. As a result, we tuned dp with 50 epoch
objects not covered by other error types. (See [3] for more details
training for pre-trained initializations and 100 epoch train-
and discussion.) We observe that the masking-based initializations
ing for random initialization. Based on this tuning, all ini-
(MAE and BEiT) make fewer localization errors than MoCo v3
and supervised initialization (random initialization is somewhere tializations selected dp = 0.1 when using ViT-B and 0.3
in-between) and, even more so, have fewer missed detections. The when using ViT-L.
other error types are more similar across initializations.

to roughly the same classification performance for correctly

5. Conclusion
localized objects, however the MAE and BEiT initializa-
We have presented techniques that enable the practi-
tions improve localization compared to the other initializa-
cal use of standard ViT models as the backbone in Mask
tions. We observe an even stronger effect when looking at
R-CNN. These methods yield acceptable training mem-
missed detections: the masking-based initializations yield
ory and time, while also achieving strong results on COCO
notably higher recall than the other initializations and thus
without involving too many complex extensions. Us-
leave fewer undetected objects. This higher recall creates a
ing these techniques, we find effective training formulae
small increase in background errors, thus leading to better
that enable us to benchmark five different ViT initializa-
overall AP.
tion methods. We show that random initialization takes
Model Complexity. Table 5 compares various complex- ∼4× longer than any of the pre-trained initializations, but
ity and wall-clock time measures of our specific Mask R- achieves a meaningfully higher AP than ImageNet-1k su-
CNN configuration. We also report these measures using a pervised pre-training. We find that MoCo v3, a represen-
ResNet-101 backbone instead of ViT. When trained from tative of contrastive unsupervised learning, performs nearly
scratch, both ResNet-101 and ViT-B backbones achieve the same as supervised pre-training (and thus worse than
48.9 APbox . At inference time, the ResNet-101 backbone is random initialization). Importantly, we witness an exciting
much faster; however, during training ViT-B reaches peak new result: masking-based methods (BEiT and MAE) show
performance at 200 epochs compared to 400 for ResNet- considerable gains over both supervised and random ini-
101. ResNet-101 is not yet able to benefit from BEiT or tialization and these gains increase as model size increases.
MAE pre-training and therefore lags behind ViT-B in APbox This scaling behavior is not observed with either supervised
(∼1 point) when those methods are used for initialization. or MoCo v3-based initialization.

7
References [17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Girshick. Momentum contrast for unsupervised visual rep-
[1] Hangbo Bao, Li Dong, and Furu Wei. BEiT: Bert pre- resentation learning. In CVPR, 2020.
training of image transformers. arXiv:2106.08254, 2021.
[18] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking
[2] Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew ImageNet pre-training. In ICCV, 2019.
Zhai, and Dmitry Kislyuk. Toward transformer-based object
[19] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir-
detection. arXiv preprint arXiv:2012.09958, 2020.
shick. Mask R-CNN. In ICCV, 2017.
[3] Daniel Bolya, Sean Foley, James Hays, and Judy Hoffman.
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
TIDE: A general toolbox for identifying object detection er-
Deep residual learning for image recognition. In CVPR,
rors. In ECCV, 2020.
2016.
[4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: Delv-
[21] Dan Hendrycks and Kevin Gimpel. Gaussian error linear
ing into high quality object detection. In CVPR, 2018.
units (gelus). arXiv:1606.08415, 2016.
[5] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou,
[22] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q
Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg-
Weinberger. Deep networks with stochastic depth. In ECCV,
ing properties in self-supervised vision transformers. In
2016.
ICCV, 2021.
[6] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox- [23] Sergey Ioffe and Christian Szegedy. Batch normalization:
iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Accelerating deep network training by reducing internal co-
Wanli Ouyang, et al. Hybrid task cascade for instance seg- variate shift. In ICML, 2015.
mentation. In CVPR, 2019. [24] Gustav Larsson, Michael Maire, and Gregory
[7] Xinlei Chen, Saining Xie, and Kaiming He. An empirical Shakhnarovich. Fractalnet: Ultra-deep neural networks
study of training self-supervised Vision Transformers. In without residuals. ICLR, 2016.
ICCV, 2021. [25] Yann LeCun, Bernhard Boser, John S Denker, Donnie
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Henderson, Richard E Howard, Wayne Hubbard, and
and Li Fei-Fei. ImageNet: A large-scale hierarchical image Lawrence D Jackel. Backpropagation applied to handwrit-
database. In CVPR, 2009. ten zip code recognition. Neural computation, 1989.
[9] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsuper- [26] Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Man-
vised visual representation learning by context prediction. In galam, Bo Xiong, Jitendra Malik, and Christoph Feichten-
ICCV, 2015. hofer. Improved multiscale vision transformers for classifi-
[10] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, cation and detection. In preparation, 2021.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [27] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Bharath Hariharan, and Serge Belongie. Feature pyramid
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is networks for object detection. In CVPR, 2017.
worth 16x16 words: Transformers for image recognition at [28] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
scale. In ICLR, 2021. Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
[11] Alaaeldin El-Nouby, Hugo Touvron, Mathilde Caron, Piotr Zitnick. Microsoft COCO: Common objects in context. In
Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, ECCV. 2014.
Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. [29] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,
XCiT: Cross-covariance image transformers. arXiv preprint Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans-
arXiv:2106.09681, 2021. former: Hierarchical vision transformer using shifted win-
[12] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao dows. arXiv preprint arXiv:2103.14030, 2021.
Li, Zhicheng Yan, Jitendra Malik, and Christoph Feicht- [30] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
enhofer. Multiscale vision transformers. arXiv preprint regularization. In ICLR, 2019.
arXiv:2104.11227, 2021. [31] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
[13] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei
Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Simple Li, and Peter J Liu. Exploring the limits of transfer learn-
copy-paste is a strong data augmentation method for instance ing with a unified text-to-text transformer. arXiv preprint
segmentation. In CVPR, 2021. arXiv:1910.10683, 2019.
[14] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra [32] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray,
Malik. Rich feature hierarchies for accurate object detection Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.
and semantic segmentation. In CVPR, 2014. Zero-shot text-to-image generation. arXiv:2102.12092,
[15] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- 2021.
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, [33] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Yangqing Jia, and Kaiming He. Accurate, large minibatch Faster R-CNN: Towards real-time object detection with re-
SGD: Training ImageNet in 1 hour. arXiv:1706.02677, 2017. gion proposal networks. In NeurIPS, 2015.
[16] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr [34] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
Dollár, and Ross Girshick. Masked autoencoders are scalable Faster R-CNN: Towards real-time object detection with re-
vision learners. arXiv preprint arXiv:2111.06377, 2021. gion proposal networks. TPAMI, 2017.

8
[35] Pierre Sermanet, Koray Kavukcuoglu, Sandhya Chintala,
and Yann LeCun. Pedestrian detection with unsupervised
multi-stage feature learning. In CVPR, 2013.
[36] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco
Massa, Alexandre Sablayrolles, and Hervé Jégou. Training
data-efficient image transformers & distillation through at-
tention. arXiv:2012.12877, 2020.
[37] Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles,
Gabriel Synnaeve, and Hervé Jégou. Going deeper with im-
age transformers. arXiv preprint arXiv:2103.17239, 2021.
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia
Polosukhin. Attention is all you need. In NeurIPS, 2017.
[39] Yuxin Wu and Kaiming He. Group normalization. In ECCV,
2018.
[40] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen
Lo, and Ross Girshick. Detectron2. https://github.
com/facebookresearch/detectron2, 2019.

375479255-Brownie-Lesson-Plan 1
0% (1)
375479255-Brownie-Lesson-Plan 1
2 pages
XXXBetter Plain ViT Baselines for ImageNet-1k
No ratings yet
XXXBetter Plain ViT Baselines for ImageNet-1k
3 pages
Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
No ratings yet
Caron_Emerging_Properties_in_Self-Supervised_Vision_Transformers_ICCV_2021_paper
11 pages
2022_ViTAEv2_Zhang et al_arXiv
No ratings yet
2022_ViTAEv2_Zhang et al_arXiv
22 pages
A Simple Single-Scale Vision Transformer For Object Localization
No ratings yet
A Simple Single-Scale Vision Transformer For Object Localization
12 pages
An_Investigation_of_Deep_Neural_Network_based_Techniques_for_Object_Detection_an
No ratings yet
An_Investigation_of_Deep_Neural_Network_based_Techniques_for_Object_Detection_an
6 pages
Yang TVT Transferable Vision Transformer For Unsupervised Domain Adaptation WACV 2023 Paper
No ratings yet
Yang TVT Transferable Vision Transformer For Unsupervised Domain Adaptation WACV 2023 Paper
11 pages
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
No ratings yet
【图片重叠采样+VIT多层提炼+架构】Tokens-To-Token ViT Training Vision Transformers From Scratch on ImageNet
10 pages
Darcet 2023 Vision Transformers Need Registers
No ratings yet
Darcet 2023 Vision Transformers Need Registers
16 pages
Understanding Robustness of Transformers For Image
No ratings yet
Understanding Robustness of Transformers For Image
23 pages
Revealing The Dark Secrets of Extremely Large Kernel Convnets On Robustness
No ratings yet
Revealing The Dark Secrets of Extremely Large Kernel Convnets On Robustness
13 pages
Dinov 2
No ratings yet
Dinov 2
31 pages
V T N R: Ision Ransformers EED Egisters
No ratings yet
V T N R: Ision Ransformers EED Egisters
21 pages
good note - ViT
No ratings yet
good note - ViT
13 pages
Vision Transformer Adapter For Dense Predictions
No ratings yet
Vision Transformer Adapter For Dense Predictions
20 pages
Accuracy-Relevance Trade-Off in Transfer Learning For Object Detection
No ratings yet
Accuracy-Relevance Trade-Off in Transfer Learning For Object Detection
4 pages
ConvNeXt - A ConvNet For The 2020s
No ratings yet
ConvNeXt - A ConvNet For The 2020s
15 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
object -PAPER WORK
No ratings yet
object -PAPER WORK
10 pages
Inicai 2V1
No ratings yet
Inicai 2V1
7 pages
How To Train An Accurate and Efficient Object Detection Model On Any Dataset
No ratings yet
How To Train An Accurate and Efficient Object Detection Model On Any Dataset
7 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
No ratings yet
A Simple Single-Scale Vision Transformer For Object Detection and Instance Segmentation
23 pages
2204.07118v1
No ratings yet
2204.07118v1
27 pages
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
No ratings yet
Region-Based Convolutional Networks For Accurate Object Detection and Segmentation
21 pages
【PVT】Wang Pyramid Vision Transformer a Versatile Backbone for Dense Prediction Without ICCV 2021 Paper
No ratings yet
【PVT】Wang Pyramid Vision Transformer a Versatile Backbone for Dense Prediction Without ICCV 2021 Paper
11 pages
【更有效的掩码模型】Architecture-Agnostic Masked Image Modeling -- From ViT Back to CNN
No ratings yet
【更有效的掩码模型】Architecture-Agnostic Masked Image Modeling -- From ViT Back to CNN
19 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Improving CNN Performance With Min-Max Objective
No ratings yet
Improving CNN Performance With Min-Max Objective
7 pages
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
No ratings yet
Region-Based_Convolutional_Networks_for_Accurate_Object_Detection_and_Segmentation
17 pages
Vision Transformer for Small-Size Datasets
No ratings yet
Vision Transformer for Small-Size Datasets
11 pages
2022_PVT v2
No ratings yet
2022_PVT v2
10 pages
MATCHER: SEGMENT ANYTHING WITH ONE SHOT USING ALL-PURPOSE FEATURE MATCHING
No ratings yet
MATCHER: SEGMENT ANYTHING WITH ONE SHOT USING ALL-PURPOSE FEATURE MATCHING
22 pages
Vision Transformer (ViT)
No ratings yet
Vision Transformer (ViT)
26 pages
Electronics 11 02306 v2 PDF
No ratings yet
Electronics 11 02306 v2 PDF
15 pages
Chen An Empirical Study of Training Self-Supervised Vision Transformers ICCV 2021 Paper
No ratings yet
Chen An Empirical Study of Training Self-Supervised Vision Transformers ICCV 2021 Paper
10 pages
1701.01077
No ratings yet
1701.01077
8 pages
Rethinking Imagenet Pre-Training: Kaiming He Ross Girshick Piotr Doll Ar Facebook Ai Research (Fair)
No ratings yet
Rethinking Imagenet Pre-Training: Kaiming He Ross Girshick Piotr Doll Ar Facebook Ai Research (Fair)
10 pages
A ConvNet For The 2020s - FaceBook - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2201.03545
No ratings yet
A ConvNet For The 2020s - FaceBook - Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie - 2201.03545
14 pages
IBOT ImageBertPreTrainingWithOnlineTokenizer
No ratings yet
IBOT ImageBertPreTrainingWithOnlineTokenizer
29 pages
Research Paper UGR_Team-07
No ratings yet
Research Paper UGR_Team-07
16 pages
FlashCards
No ratings yet
FlashCards
4 pages
MMDetection Open MMLab Detection Toolbox and Benchmark
No ratings yet
MMDetection Open MMLab Detection Toolbox and Benchmark
13 pages
Object_Detection_in_Images_and_Videos_Using_OpenCV_A_Comparative_Study_of_Deep_Learning_and_Traditional_Computer_Vision_Techniques
No ratings yet
Object_Detection_in_Images_and_Videos_Using_OpenCV_A_Comparative_Study_of_Deep_Learning_and_Traditional_Computer_Vision_Techniques
6 pages
BEIT V2: Masked Image Modeling With Vector-Quantized Visual Tokenizers
No ratings yet
BEIT V2: Masked Image Modeling With Vector-Quantized Visual Tokenizers
15 pages
BinaryViT：高效、精确的二值ViT
No ratings yet
BinaryViT：高效、精确的二值ViT
12 pages
2024_GvT_Shan_chen_arXiv
No ratings yet
2024_GvT_Shan_chen_arXiv
9 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Iclr2022 Should We Replace Cnns With TR
No ratings yet
Iclr2022 Should We Replace Cnns With TR
15 pages
Transfer Learning For Object Detection Using State-of-the-Art Deep Neural Networks
No ratings yet
Transfer Learning For Object Detection Using State-of-the-Art Deep Neural Networks
7 pages
Zeiler Ec CV 2014
No ratings yet
Zeiler Ec CV 2014
16 pages
Generic-to-Specific Distillation of Masked Autoencoders
No ratings yet
Generic-to-Specific Distillation of Masked Autoencoders
12 pages
2106.03348
No ratings yet
2106.03348
23 pages
MPViT Multi-Path Vision Transformer For Dense Prediction
No ratings yet
MPViT Multi-Path Vision Transformer For Dense Prediction
16 pages
10 1109@TVCG 2019 2903943 PDF
No ratings yet
10 1109@TVCG 2019 2903943 PDF
11 pages
2207.02696v1 2
No ratings yet
2207.02696v1 2
15 pages
ppt2_new
No ratings yet
ppt2_new
30 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
DAAI - Lecture - 15 - 23nov22
No ratings yet
DAAI - Lecture - 15 - 23nov22
113 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Logistics Order: (S Shipper: Pt. Sumatra Wahana Perkasa
No ratings yet
Logistics Order: (S Shipper: Pt. Sumatra Wahana Perkasa
1 page
Bench Project
No ratings yet
Bench Project
3 pages
Holiday Homework Class - Ix (23-24)
No ratings yet
Holiday Homework Class - Ix (23-24)
2 pages
Tadano Mobile Crane Atf 70g 4 Load Chart Operating Manual
No ratings yet
Tadano Mobile Crane Atf 70g 4 Load Chart Operating Manual
11 pages
Applied Mechanics 2023 Set2
No ratings yet
Applied Mechanics 2023 Set2
2 pages
Current Unbalance Monitoring In Four-Wire System Based
No ratings yet
Current Unbalance Monitoring In Four-Wire System Based
9 pages
Heat Energy
No ratings yet
Heat Energy
18 pages
EDUC 30023 The Teaching Profession RACorpus
No ratings yet
EDUC 30023 The Teaching Profession RACorpus
82 pages
Cambridge International AS & A Level: Chemistry 9701/11
No ratings yet
Cambridge International AS & A Level: Chemistry 9701/11
16 pages
Untitled
No ratings yet
Untitled
57 pages
Vertical Axis Wind Turbine
No ratings yet
Vertical Axis Wind Turbine
22 pages
Documentation Report: Project: STG & Boiler Batubara Project Date: Location: Item Inspection
No ratings yet
Documentation Report: Project: STG & Boiler Batubara Project Date: Location: Item Inspection
34 pages
RudsSeminars & Fieldtrip - 2
No ratings yet
RudsSeminars & Fieldtrip - 2
59 pages
Plastic Materials Comparison Table
No ratings yet
Plastic Materials Comparison Table
8 pages
Official Notification For Iocl Recruitment
No ratings yet
Official Notification For Iocl Recruitment
10 pages
Poo Elementary School: Title Six (6) - Day School Based INSET Training
100% (1)
Poo Elementary School: Title Six (6) - Day School Based INSET Training
4 pages
Physics IA Help Sheet
No ratings yet
Physics IA Help Sheet
3 pages
Factors Affecting Crop Production
No ratings yet
Factors Affecting Crop Production
40 pages
Topic 5A. Differentiation
No ratings yet
Topic 5A. Differentiation
52 pages
GCash Dispute Form
No ratings yet
GCash Dispute Form
3 pages
The actual and theoretical results in Torricelli
No ratings yet
The actual and theoretical results in Torricelli
5 pages
Manual de Bomba ICTUS
No ratings yet
Manual de Bomba ICTUS
16 pages
UMF Unit-Wide Lesson Plan Template
No ratings yet
UMF Unit-Wide Lesson Plan Template
3 pages
(Routledge Advances in International Relations and Global Politics) Chih-Mao Tang - Small States and Hegemonic Competition in Southeast Asia_ Pursuing Autonomy, Security and Development Amid Great Pow
100% (1)
(Routledge Advances in International Relations and Global Politics) Chih-Mao Tang - Small States and Hegemonic Competition in Southeast Asia_ Pursuing Autonomy, Security and Development Amid Great Pow
182 pages
Ppi
No ratings yet
Ppi
17 pages
1a. Autoevaluación - Exercises
No ratings yet
1a. Autoevaluación - Exercises
2 pages
Cook Schemas 1998
No ratings yet
Cook Schemas 1998
1 page
1987 Organic Chemistry Third Edition (Fessenden Ralph Fessenden j )
No ratings yet
1987 Organic Chemistry Third Edition (Fessenden Ralph Fessenden j )
1 page
High Efficiency and Robustness: Spiral Tube Heat Exchangers: Combining
No ratings yet
High Efficiency and Robustness: Spiral Tube Heat Exchangers: Combining
3 pages

Uploaded by

Uploaded by

Benchmarking Detection Transfer Learning with Vision Transformers

on fine-tuning performance. A detailed analysis is given in

2 NVIDIA V100-32GB GPU.

to roughly the same classification performance for correctly

You might also like