DETRs With Collaborative Hybrid Assignments Training
DETRs With Collaborative Hybrid Assignments Training
Abstract Co-DETR
54 DINO-Deformable-DETR
H-Deformable-DETR
In this paper, we provide the observation that too few Deformable-DETR
DAB-DETR
queries assigned as positive samples in DETR with one- 52 DN-DETR
Faster-RCNN
to-one set matching leads to sparse supervision on the en- HTC
coder’s output which considerably hurt the discriminative 50
feature learning of the encoder and vice visa for attention
AP
learning in the decoder. To alleviate this, we present a novel 48
collaborative hybrid assignments training scheme, namely
46
Co-DETR, to learn more efficient and effective DETR-based
detectors from versatile label assignment manners. This
44
new training scheme can easily enhance the encoder’s
learning ability in end-to-end detectors by training the mul-
42
tiple parallel auxiliary heads supervised by one-to-many la- 0 20 40 60 80 100 120
bel assignments such as ATSS and Faster RCNN. In addi- Epoch
tion, we conduct extra customized positive queries by ex- Figure 1. Performance of models with ResNet-50 on COCO val.
tracting the positive coordinates from these auxiliary heads Co-DETR outperforms other counterparts by a large margin.
to improve the training efficiency of positive samples in the
decoder. In inference, these auxiliary heads are discarded
a series of variants [31, 37, 44] such as ATSS [41], Reti-
and thus our method introduces no additional parameters
naNet [21], FCOS [32], and PAA [17] lead to the significant
and computational cost to the original detector while re-
breakthrough of object detection task. One-to-many label
quiring no hand-crafted non-maximum suppression (NMS).
assignment is the core scheme of them, where each ground-
We conduct extensive experiments to evaluate the effective-
truth box is assigned to multiple coordinates in the detec-
ness of the proposed approach on DETR variants, including
tor’s output as the supervised target cooperated with propos-
DAB-DETR, Deformable-DETR, and DINO-Deformable-
als [11, 27], anchors [21] or window centers [32]. Despite
DETR. The state-of-the-art DINO-Deformable-DETR with
their promising performance, these detectors heavily rely
Swin-L can be improved from 58.5% to 59.5% AP on COCO
on many hand-designed components like a non-maximum
val. Surprisingly, incorporated with ViT-L backbone, we
suppression procedure or anchor generation [1]. To con-
achieve 66.0% AP on COCO test-dev and 67.9% AP on
duct a more flexible end-to-end detector, DEtection TRans-
LVIS val, outperforming previous methods by clear mar-
former (DETR) [1] is proposed to view the object detection
gins with much fewer model sizes. Codes are available at
as a set prediction problem and introduce the one-to-one set
https://github.com/Sense-X/Co-DETR.
matching scheme based on a transformer encoder-decoder
architecture. In this manner, each ground-truth box will
only be assigned to one specific query, and multiple hand-
1. Introduction designed components that encode prior knowledge are no
Object detection is a fundamental task in computer vi- longer needed. This approach introduces a flexible detec-
sion, which requires us to localize the object and classify tion pipeline and encourages many DETR variants to fur-
its category. The seminal R-CNN families [11, 14, 27] and ther improve it. However, the performance of the vanilla
end-to-end object detector is still inferior to the traditional
* Corresponding author. detectors with one-to-many label assignments.
1
1.0 1.0 Input Image ATSS Co-Deformable-DETR Deformable-DETR
ATSS Deformable-DETR
Deformable-DETR Group-DETR
0.8 Group-DETR 0.8 Co-Deformable-DETR
Co-Deformable-DETR
0.6 0.6
IoF
IoF
0.4 0.4
0.2 0.2
0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
IoB IoB
In this paper, we try to make DETR-based detectors visions on the encoder’s output which forces it to be dis-
superior to conventional detectors while maintaining their criminative enough to support the training convergence of
end-to-end merit. To address this challenge, we focus on these heads. To further improve the training efficiency of
the intuitive drawback of one-to-one set matching that it ex- the decoder, we elaborately encode the coordinates of posi-
plores less positive queries. This will lead to severe ineffi- tive samples in these auxiliary heads, including the positive
cient training issues. We detailedly analyze this from two anchors and positive proposals. They are sent to the origi-
aspects, the latent representation generated by the encoder nal decoder as multiple groups of positive queries to predict
and the attention learning in the decoder. We first compare the pre-assigned categories and bounding boxes. Positive
the discriminability score of the latent features between the coordinates in each auxiliary head serve as an independent
Deformable-DETR [43] and the one-to-many label assign- group that is isolated from the other groups. Versatile one-
ment method where we simply replace the decoder with to-many label assignments can introduce lavish (positive
the ATSS head. The feature l2 -norm in each spatial co- query, ground-truth) pairs to improve the decoder’s train-
ordinate is utilized to represent the discriminability score. ing efficiency. Note that, only the original decoder is used
Given the encoder’s output F ∈ RC×H×W , we can obtain during inference, thus the proposed training scheme only
the discriminability score map S ∈ R1×H×W . The object introduces extra overheads during training.
can be better detected when the scores in the correspond- We conduct extensive experiments to evaluate the effi-
ing area are higher. As shown in Figure 2, we demonstrate ciency and effectiveness of the proposed method. Illus-
the IoF-IoB curve (IoF: intersection over foreground, IoB: trated in Figure 3, Co-DETR greatly alleviates the poorly
intersection over background) by applying different thresh- encoder’s feature learning in one-to-one set matching. As a
olds on the discriminability scores (details in Section 3.4). plug-and-play approach, we easily combine it with different
The higher IoF-IoB curve in ATSS indicates that it’s eas- DETR variants, including DAB-DETR [23], Deformable-
ier to distinguish the foreground and background. We fur- DETR [43], and DINO-Deformable-DETR [39]. As shown
ther visualize the discriminability score map S in Figure 3. in Figure 1, Co-DETR achieves faster training convergence
It’s obvious that the features in some salient areas are fully and even higher performance. Specifically, we improve the
activated in the one-to-many label assignment method but basic Deformable-DETR by 5.8% AP in 12-epoch train-
less explored in one-to-one set matching. For the explo- ing and 3.2% AP in 36-epoch training. The state-of-the-
ration of decoder training, we also demonstrate the IoF-IoB art DINO-Deformable-DETR with Swin-L [25] can still
curve of the cross-attention score in the decoder based on be improved from 58.5% to 59.5% AP on COCO val.
the Deformable-DETR and the Group-DETR [5] which in- Surprisingly, incorporated with ViT-L [8] backbone, we
troduces more positive queries into the decoder. The il- achieve 66.0% AP on COCO test-dev and 67.9% AP
lustration in Figure 2 shows that too few positive queries on LVIS val, establishing the new state-of-the-art detector
also influence attention learning and increasing more posi- with much fewer model sizes.
tive queries in the decoder can slightly alleviate this.
This significant observation motivates us to present a 2. Related Works
simple but effective method, a collaborative hybrid assign-
ment training scheme (Co-DETR). The key insight of Co- One-to-many label assignment. For one-to-many label as-
DETR is to use versatile one-to-many label assignments to signment in object detection, multiple box candidates can
improve the training efficiency and effectiveness of both the be assigned to the same ground-truth box as positive sam-
encoder and decoder. More specifically, we integrate the ples in the training phase. In classic anchor-based detectors,
auxiliary heads with the output of the transformer encoder. such as Faster-RCNN [27] and RetinaNet [21], the sam-
These heads can be supervised by versatile one-to-many la- ple selection is guided by the predefined IoU threshold and
bel assignments such as ATSS [41], FCOS [32], and Faster matching IoU between anchors and annotated boxes. The
RCNN [27]. Different label assignments enrich the super- anchor-free FCOS [32] leverages the center priors and as-
2
"
One-to-One Set Matching 𝓐 𝓐𝟏 𝓐𝒌
Training-only
Transformer Decoder Transformer Decoder ⋯⋯ Transformer Decoder
Transformer Encoder
𝐐𝟏 𝐐𝒌
Backbone
{𝒑𝒐𝒔} {𝒑𝒐𝒔}
𝐁𝟏 𝐁𝒌
Input Image
One-to-Many Label Assignments 𝓐𝟏 ⋯⋯ 𝓐𝒌
Figure 4. Framework of our Collaborative Hybrid Assignment Training. The auxiliary branches are discarded during evaluation.
signs spatial locations near the center of each bounding box tive hybrid assignments training scheme and the customized
as positives. Moreover, the adaptive mechanism is incorpo- positive queries generation. We will detailedly describe
rated into one-to-many label assignments to overcome the these modules and give insights why they can work well.
limitation of fixed label assignments. ATSS [41] performs
adaptive anchor selection by the statistical dynamic IoU val-
ues of top-k closest anchors. PAA [17] adaptively separates 3.2. Collaborative Hybrid Assignments Training
anchors into positive and negative samples in a probabilis- To alleviate the sparse supervision on the encoder’s out-
tic manner. In this paper, we propose a collaborative hybrid put caused by the fewer positive queries in the decoder, we
assignment scheme to improve encoder representations via incorporate versatile auxiliary heads with different one-to-
auxiliary heads with one-to-many label assignments. many label assignment paradigms, e.g., ATSS, and Faster
One-to-one set matching. The pioneering transformer- R-CNN. Different label assignments enrich the supervisions
based detector, DETR [1], incorporates the one-to-one set on the encoder’s output which forces it to be discrimina-
matching scheme into object detection and performs fully tive enough to support the training convergence of these
end-to-end object detection. The one-to-one set matching heads. Specifically, given the encoder’s latent feature F,
strategy first calculates the global matching cost via Hun- we firstly transform it to the feature pyramid {F1 , · · · , FJ }
garian matching and assigns only one positive sample with via the multi-scale adapter where J indicates feature map
the minimum matching cost for each ground-truth box. DN- with 22+J downsampling stride. Similar to ViTDet [20],
DETR [18] demonstrates the slow convergence results from the feature pyramid is constructed by a single feature map
the instability of one-to-one set matching, thus introducing in the single-scale encoder, while we use bilinear interpo-
denoising training to eliminate this issue. DINO [39] inher- lation and 3 × 3 convolution for upsampling. For instance,
its the advanced query formulation of DAB-DETR [23] and with the single-scale feature from the encoder, we succes-
incorporates an improved contrastive denoising technique sively apply downsampling (3×3 convolution with stride 2)
to achieve state-of-the-art performance. Group-DETR [5] or upsampling operations to produce a feature pyramid. As
constructs group-wise one-to-many label assignment to ex- for the multi-scale encoder, we only downsample the coars-
ploit multiple positive object queries, which is similar to the est feature in the multi-scale encoder features F to build
hybrid matching scheme in H-DETR [16]. In contrast with the feature pyramid. Defined K collaborative heads with
the above follow-up works, we present a new perspective of corresponding label assignment manners Ak , for the i-th
collaborative optimization for one-to-one set matching. collaborative head, {F1 , · · · , FJ } is sent to it to obtain the
predictions P̂i . At the i-th head, Ai is used to compute the
3. Method supervised targets for the positive and negative samples in
3.1. Overview Pi . Denoted G as the ground-truth set, this procedure can
be formulated as:
Following the standard DETR protocol, the input image
{pos} {pos} {neg}
is fed into the backbone and encoder to generate latent fea- Pi , Bi , Pi = Ai (P̂i , G), (1)
tures. Multiple predefined object queries interact with them
in the decoder via cross-attention afterwards. We introduce where {pos} and {neg} indicate the pair set of (j, positive
Co-DETR to improve the feature learning in the encoder coordinates or negative coordinates in Fj ) determined by
{pos}
and the attention learning in the decoder via the collabora- Ai . j means the feature index in {F1 , · · · , FJ }. Bi is
3
Assignment Ai
Head i Loss Li {pos}
{pos}, {neg} Generation Pi Generation Bi Generation
cls: CE loss, {pos}: IoU(proposal, gt)>0.5 {pos}: gt labels, offset(proposal, gt) positive proposals
Faster-RCNN [27]
reg: GIoU loss {neg}: IoU(proposal, gt)<0.5 {neg}: gt labels (x1 , y1 , x2 , y2 )
cls: Focal loss {pos}:IoU(anchor, gt)>(mean+std) {pos}: gt labels, offset(anchor, gt), centerness positive anchors
ATSS [41]
reg: GIoU, BCE loss {neg}: IoU(anchor, gt)<(mean+std) {neg}: gt labels (x1 , y1 , x2 , y2 )
cls: Focal loss {pos}: IoU(anchor, gt)>0.5 {pos}: gt labels, offset(anchor, gt) positive anchors
RetinaNet [21]
reg: GIoU Loss {neg}: IoU(anchor, gt)<0.4 {neg}: gt labels (x1 , y1 , x2 , y2 )
cls: Focal Loss {pos}: points inside gt center area {pos}: gt labels, ltrb distance, centerness FCOS point (cx, cy)
FCOS [32]
reg: GIoU, BCE loss {neg}: points outside gt center area {neg}: gt labels w = h = 8 × 22+j
Table 1. Detailed information of auxiliary heads. The auxiliary heads include Faster-RCNN [27], ATSS [41], RetinaNet [21], and
FCOS [32]. If not otherwise specified, we follow the original implementations, e.g., anchor generation.
{pos} {neg}
the set of spatial positive coordinates. Pi and Pi in the i-th auxiliary branch can be formulated as:
are the supervised targets in the corresponding coordinates, {pos}
including the categories and regressed offsets. To be spe- Ldec
i,l = L(Pi,l , Pi
e e ). (5)
cific, we describe the detailed information about each vari-
able in Table 1. The loss functions can be defined as: P
e i,l refers to the output predictions of the l-th decoder layer
in the i-th auxiliary branch. Finally, the training objective
{pos} {pos} {neg} {neg}
Lenc
i = Li (P̂i , Pi ) + Li (P̂i , Pi ), (2) for Co-DETR is:
L K
Note that the regression loss is discarded for negative sam- Lglobal =
X
(Ledec
X
Ldec enc
l + λ1 i,l + λ2 L ), (6)
ples. The training objective of the optimization for K aux- l=1 i=1
iliary heads is formulated as follows:
where Ledec
l stands for the loss in the original one-to-one set
K
X matching branch [1], λ1 and λ2 are the coefficient balancing
Lenc = Lenc
i (3)
the losses.
i=1
3.4. Why Co-DETR works
3.3. Customized Positive Queries Generation
Co-DETR leads to evident improvement to the DETR-
In the one-to-one set matching paradigm, each ground-
based detectors. In the following, we try to investigate its
truth box will only be assigned to one specific query as the
effectiveness qualitatively and quantitatively. We conduct
supervised target. Too few positive queries lead to ineffi-
detailed analysis based on Deformable-DETR with ResNet-
cient cross-attention learning in the transformer decoder as
50 [15] backbone using the 36-epoch setting.
shown in Figure 2. To alleviate this, we elaborately generate
sufficient customized positive queries according to the label Enrich the encoder’s supervisions. Intuitively, too few
assignment Ai in each auxiliary head. Specifically, given positive queries lead to sparse supervisions as only one
{pos} query is supervised by regression loss for each ground-truth.
the positive coordinates set Bi ∈ RMi ×4 in the i-th
The positive samples in one-to-many label assignment man-
auxiliary head, where Mi is the number of positive sam-
ners receive more localization supervisions to help enhance
ples, the extra customized positive queries Qi ∈ RMi ×C
the latent feature learning. To further explore how the sparse
can be generated by:
supervisions impede the model training, we detailedly in-
Qi = Linear(PE(Bi
{pos}
)) + Linear(E({F∗ }, {pos})). vestigate the latent features produced by the encoder. We
(4) introduce the IoF-IoB curve to quantize the discriminabil-
where PE(·) stands for positional encodings and we select ity score of the encoder’s output. Specifically, given the
the corresponding features from E(·) according to the index latent feature F of the encoder, inspired by the feature visu-
pair (j, positive coordinates or negative coordinates in Fj ). alization in Figure 3, we compute the IoF (intersection over
As a result, there are K + 1 groups of queries that con- foreground) and IoB (intersection over background). Given
tribute to a single one-to-one set matching branch and K the encoder’s feature Fj ∈ RC×Hj ×Wj at level j, we first
branches with one-to-many label assignments during train- calculate the l2 -norm Fbj ∈ R1×Hj ×Wj and resize it to the
ing. The auxiliary one-to-many label assignment branches image size H × W . The discriminability score D(F) is
share the same parameters with L decoders layers in the computed by averaging the scores from all levels:
original main branch. All the queries in the auxiliary branch J
are regarded as positive queries, thus the matching process 1X Fbj
D(F) = , (7)
is discarded. To be specific, the loss of the l-th decoder layer J j=1 max(Fbj )
4
process. Furthermore, in order to quantify how well cross-
12 attention is being optimized, we also calculate the IoF-IoB
curve for attention score. Similar to the feature discrim-
inability score computation, we set different thresholds for
IS(Instability)
5
Method K #epochs AP Method K #epochs AP
Conditional DETR-C5 [26] 0 36 39.4 Deformable-DETR++ [43] 0 12 47.1
Conditional DETR-C5 [26] 1 36 41.5(+2.1) Deformable-DETR++ [43] 1 12 48.7(+1.6)
Conditional DETR-C5 [26] 2 36 41.8(+2.4) Deformable-DETR++ [43] 2 12 49.5(+2.4)
DAB-DETR-C5 [23] 0 36 41.2 DINO-Deformable-DETR† [39] 0 12 49.4
DAB-DETR-C5 [23] 1 36 43.1(+1.9) DINO-Deformable-DETR† [39] 1 12 51.0(+1.6)
DAB-DETR-C5 [23] 2 36 43.5(+2.3) DINO-Deformable-DETR† [39] 2 12 51.2(+1.8)
Deformable-DETR [43] 0 12 37.1 Deformable-DETR++‡ [43] 0 12 55.2
Deformable-DETR [43] 1 12 42.3(+5.2) Deformable-DETR++‡ [43] 1 12 56.4(+1.2)
Deformable-DETR++‡ [43] 2 12 56.9(+1.7)
Deformable-DETR [43] 2 12 42.9(+5.8)
DINO-Deformable-DETR†‡ [39] 0 12 58.5
Deformable-DETR [43] 0 36 43.3
DINO-Deformable-DETR†‡ [39] 1 12 59.3(+0.8)
Deformable-DETR [43] 1 36 46.8(+3.5)
DINO-Deformable-DETR†‡ [39] 2 12 59.5(+1.0)
Deformable-DETR [43] 2 36 46.5(+3.2)
Table 3. Results of strong baselines on COCO val. Methods with
Table 2. Results of plain baselines on COCO val.
† use 5 feature levels. ‡ refers to Swin-L backbone.
test-dev (20K images) are also reported. LVIS v1.0 is DETR equipped with our method can achieve 51.2% AP,
a large-scale and long-tail dataset with 1203 categories for which is +1.8% AP higher than the competitive baseline.
large vocabulary instance segmentation. To verify the scal-
We further scale up the backbone capacity from ResNet-
ability of Co-DETR, we further apply it to a large-scale ob-
50 to Swin-L [25] based on two state-of-the-art base-
ject detection benchmark, namely Objects365 [30]. There
lines. As presented in Table 3, Co-DETR achieves 56.9%
are 1.7M labeled images used for training and 80K images
AP and surpasses the Deformable-DETR++ baseline by
for validation in the Objects365 dataset. All results follow
a large margin (+1.7% AP). The performance of DINO-
the standard mean Average Precision(AP) under IoU thresh-
Deformable-DETR with Swin-L can still be boosted from
olds ranging from 0.5 to 0.95 at different object scales.
58.5% to 59.5% AP.
Implementation Details. We incorporate our Co-DETR
into the current DETR-like pipelines and keep the train- 4.3. Comparisons with the state-of-the-art
ing setting consistent with the baselines. We adopt ATSS
We apply our method with K = 2 to Deformable-
and Faster-RCNN as the auxiliary heads for K = 2 and
DETR++ and DINO. Besides, the quality focal loss [19] and
only keep ATSS for K = 1. More details about our aux-
NMS are adopted for our Co-DINO-Deformable-DETR.
iliary heads can be found in the supplementary materials.
We report the comparisons on COCO val in Table 4. Com-
We choose the number of learnable object queries to 300
pared with other competitive counterparts, our method con-
and set {λ1 , λ2 } to {1.0, 2.0} by default. For Co-DINO-
verges much faster. For example, Co-DINO-Deformable-
Deformable-DETR++, we use large-scale jitter with copy-
DETR readily achieves 52.1% AP when using only 12
paste [10].
epochs with ResNet-50 backbone. Our method with Swin-
L can obtain 58.9% AP for 1× scheduler, even surpass-
4.2. Main Results
ing other state-of-the-art frameworks on 3× scheduler.
In this section, we empirically analyze the effectiveness More importantly, our best model Co-DINO-Deformable-
and generalization ability of Co-DETR on different DETR DETR++ achieves 54.8% AP with ResNet-50 and 60.7%
variants in Table 2 and Table 3. All results are repro- AP with Swin-L under 36-epoch training, outperforming all
duced using mmdetection [4]. We first apply the collabo- existing detectors with the same backbone by clear margins.
rative hybrid assignments training to single-scale DETRs To further explore the scalability of our method, we ex-
with C5 features. Surprisingly, both Conditional-DETR tend the backbone capacity to 304 million parameters. This
and DAB-DETR obtain 2.4% and 2.3% AP gains over the large-scale backbone ViT-L [7] is pre-trained using a self-
baselines with a long training schedule. For Deformable- supervised learning method (EVA-02 [8]). We first pre-train
DETR with multi-scale features, the detection performance Co-DINO-Deformable-DETR with ViT-L on Objects365
is significantly boosted from 37.1% to 42.9% AP. The over- for 26 epochs, then fine-tune it on the COCO dataset for
all improvements (+3.2% AP) still hold when the training 12 epochs. In the fine-tuning stage, the input resolution
time is increased to 36 epochs. Moreover, we conduct ex- is randomly selected between 480×2400 and 1536×2400.
periments on the improved Deformable-DETR (denoted as The detailed settings are available in supplementary materi-
Deformable-DETR++) following [16], where a +2.4% AP als. Our results are evaluated with test-time augmentation.
gain is observed. The state-of-the-art DINO-Deformable- Table 5 presents the state-of-the-art comparisons on the
6
Method Backbone Multi-scale #query #epochs AP AP50 AP75 APS APM APL
Conditional-DETR [26] R50 ✗ 300 108 43.0 64.0 45.7 22.7 46.7 61.5
Anchor-DETR [35] R50 ✗ 300 50 42.1 63.1 44.9 22.3 46.2 60.0
DAB-DETR [23] R50 ✗ 900 50 45.7 66.2 49.0 26.1 49.4 63.1
AdaMixer [9] R50 ✓ 300 36 47.0 66.0 51.1 30.1 50.2 61.8
Deformable-DETR [43] R50 ✓ 300 50 46.9 65.6 51.0 29.6 50.1 61.6
DN-Deformable-DETR [18] R50 ✓ 300 50 48.6 67.4 52.7 31.0 52.0 63.7
DINO-Deformable-DETR† [39] R50 ✓ 900 12 49.4 66.9 53.8 32.3 52.5 63.9
DINO-Deformable-DETR† [39] R50 ✓ 900 36 51.2 69.0 55.8 35.0 54.3 65.3
DINO-Deformable-DETR† [39] Swin-L (IN-22K) ✓ 900 36 58.5 77.0 64.1 41.5 62.3 74.0
Group-DINO-Deformable-DETR [5] Swin-L (IN-22K) ✓ 900 36 58.4 - - 41.0 62.5 73.9
H-Deformable-DETR [16] R50 ✓ 300 12 48.7 66.4 52.9 31.2 51.5 63.5
H-Deformable-DETR [16] Swin-L (IN-22K) ✓ 900 36 57.9 76.8 63.6 42.4 61.9 73.4
Co-Deformable-DETR R50 ✓ 300 12 49.5 67.6 54.3 32.4 52.7 63.7
Co-Deformable-DETR Swin-L (IN-22K) ✓ 900 36 58.5 77.1 64.5 42.4 62.4 74.0
Co-DINO-Deformable-DETR† R50 ✓ 900 12 52.1 69.4 57.1 35.4 55.4 65.9
Co-DINO-Deformable-DETR† Swin-L (IN-22K) ✓ 900 12 58.9 76.9 64.8 42.6 62.7 75.1
Co-DINO-Deformable-DETR† Swin-L (IN-22K) ✓ 900 24 59.8 77.7 65.5 43.6 63.5 75.5
Co-DINO-Deformable-DETR† Swin-L (IN-22K) ✓ 900 36 60.0 77.7 66.1 44.6 63.9 75.7
Co-DINO-Deformable-DETR++† R50 ✓ 900 12 52.1 69.3 57.3 35.4 55.5 67.2
Co-DINO-Deformable-DETR++† R50 ✓ 900 36 54.8 72.5 60.1 38.3 58.4 69.6
Co-DINO-Deformable-DETR++† Swin-L (IN-22K) ✓ 900 12 59.3 77.3 64.9 43.3 63.3 75.5
Co-DINO-Deformable-DETR++† Swin-L (IN-22K) ✓ 900 24 60.4 78.3 66.4 44.6 64.2 76.5
Co-DINO-Deformable-DETR++† Swin-L (IN-22K) ✓ 900 36 60.7 78.5 66.7 45.1 64.7 76.4
†: 5 feature levels.
Table 4. Comparison to the state-of-the-art DETR variants on COCO val.
Table 5. Comparison to the state-of-the-art frameworks on COCO. Table 6. Comparison to the state-of-the-art frameworks on LVIS.
COCO test-dev benchmark. With much fewer model +3.5% and +2.5% AP, respectively. We further finetune
sizes (304M parameters), Co-DETR sets a new record of the Objects365 pretrained Co-DETR on this dataset. With-
66.0% AP on COCO test-dev, outperforming the previ- out elaborate test-time augmentation, our approach achieves
ous best model InternImage-G [34] by +0.5% AP. the best detection performance of 67.9% and 71.9% AP
on LVIS val and minival. Compared to the 3-billion
We also demonstrate the best results of Co-DETR on
parameter InternImage-G with test-time augmentation, we
the long-tailed LVIS detection dataset. In particular, we
obtain +4.7% and +6.1% AP gains on LVIS val and
use the same Co-DINO-Deformable-DETR++ as the model
minival while reducing the model size to 1/10.
on COCO but choose FedLoss [42] as the classification
loss to remedy the impact of unbalanced data distribution. 4.4. Ablation Studies
Here, we only apply bounding boxes supervision and re-
port the object detection results. The comparisons are avail- Unless stated otherwise, all experiments for ablations are
able in Table 6. Co-DETR with Swin-L yields 56.9% and conducted on Deformable-DETR with a ResNet-50 back-
62.3% AP on LVIS val and minival, surpassing ViT- bone. We choose the number of auxiliary heads K to 1 by
Det with MAE-pretrained [13] ViT-H and GLIPv2 [40] by default and set the total batch size to 32. More ablations and
7
Method K
Auxiliary Memory GPU
AP 0.02 K=1
head (MB) hours K=2
Deformable-DETR++ 0 - 12808 70 47.1
K=3
K=6
H-Deformable-DETR 0 - 15307 104 48.4
Deformable-DETR++ 1 ATSS 13947 86 48.7
Distance
Deformable-DETR++ 2 ATSS + PAA 14629 124 49.0
0.01
Deformable-DETR++ 2 ATSS + Faster-RCNN 14387 120 49.5
ATSS + Faster-RCNN
Deformable-DETR++ 3 15263 150 49.5
+ PAA
ATSS + Faster-RCNN
Deformable-DETR++ 6 + PAA + RetinaNet 19385 280 48.9
+ FCOS + GFL 0.00
ATSS Faster-RCNN PAA GFL FCOS RetinaNet
Table 7. Experimental results of K varying from 1 to 6. Figure 6. The distance when varying K from 1 to 6.
Auxiliary head #epochs AP AP50 AP75 aux head pos queries #epochs AP AP50 AP75
Baseline 36 43.3 62.3 47.1 12 37.1 55.5 40.0
✗ ✗
RetinaNet [21] 36 46.1 64.2 50.1 36 43.3 62.3 47.1
Faster-RCNN [27] 36 46.3 64.7 50.5 12 41.6(+4.5) 59.8 45.6
✓ ✗
Mask-RCNN [14] 36 46.5 65.0 50.6 36 46.2(+2.9) 64.7 50.9
FCOS [32] 36 46.5 64.8 50.7 12 40.5(+3.4) 58.8 44.4
✗ ✓
PAA [17] 36 46.5 64.6 50.7 36 45.3(+2.0) 63.5 49.8
GFL [19] 36 46.5 65.0 51.0 12 42.3(+5.2) 60.5 46.1
✓ ✓
ATSS [41] 36 46.8 65.1 51.5 36 46.8(+3.5) 65.1 51.5
Table 8. Performance of our approach with various auxiliary one- Table 9. “aux head” denotes training with an auxiliary head and
to-many heads on COCO val. “pos queries” means the customized positive queries generation.
K
analyses can be found in the supplementary materials. 1 X
Si = (Si,j + Sj,i ), (10)
Criteria for choosing auxiliary heads. We further delve 2(K − 1)
j̸=i
into the criteria for choosing auxiliary heads in Table 7 and
8. The results in Table 8 reveal that any auxiliary head where KL, D, I, C refer to KL divergence, dataset, the input
with one-to-many label assignments consistently improves image, and class activation maps (CAM) [29]. As illustrated
the baseline and ATSS achieves the best performance. We in Figure 6, we compute the average distances among aux-
find the accuracy continues to increase as K increases when iliary heads for K > 1 and the distance between the DETR
choosing K smaller than 3. It is worth noting that perfor- head and the single auxiliary head for K = 1. We find the
mance degradation occurs when K = 6, and we speculate distance metric is insignificant for each auxiliary head when
the severe conflicts among auxiliary heads cause this. If the K = 1 and this observation is consistent with our results in
feature learning is inconsistent across the auxiliary heads, Table 8: the DETR head can be collaboratively improved
the continuous improvement as K becomes larger will be with any head when K = 1. When K is increased to 2, the
destroyed. We also analyze the optimization consistency of distance metrics increase slightly and our method achieves
multiple heads next and in the supplementary materials. In the best performance as shown in Table 7. The distance
summary, we can choose any head as the auxiliary head and surges when K is increased from 3 and 6, indicating severe
we regard ATSS and Faster-RCNN as the common practice optimization conflicts among these auxiliary heads lead to
to achieve the best performance when K ≤ 2. We do not a decrease in performance. However, the baseline with 6
use too many different heads, e.g., 6 different heads to avoid ATSS achieves 49.5% AP and can be decreased to 48.9%
optimization conflicts. AP by replacing ATSS with 6 various heads. Accordingly,
we speculate too many diverse auxiliary heads, e.g., more
Conflicts analysis. The conflicts emerge when the same
than 3 different heads, exacerbate the conflicts. In summary,
spatial coordinate is assigned to different foreground boxes
optimization conflicts are influenced by the number of vari-
or treated as background in different auxiliary heads and
ous auxiliary heads and the relations among these heads.
can confuse the training of the detector. We first define the
distance between head Hi and head Hj , and the average Should the added heads be different? Collaborative train-
distance of Hi to measure the optimization conflicts as: ing with two ATSS heads (49.2% AP) still improves the
model with one ATSS head (48.7% AP) as ATSS is comple-
1 X mentary to the DETR head in our analysis. Besides, intro-
Si,j = KL(C(Hi (I)), C(Hj (I)), (9)
|D| ducing a diverse and complementary auxiliary head rather
I∈D
8
Method K #epochs GPU hours AP 50% Positive Negative
Table 11. Collaborative training consistently improves perfor- ter region of the instance and provide sufficient supervision
mances of all branches on Deformable-DETR++ with ResNet-50. signals for the detector.
Does distribution difference lead to instability? We com-
than the same one as the original head, e.g., Faster-RCNN, pute the average distance between original and customized
can bring better gains (49.5% AP). Note that this is not con- queries in Figure 7b. The average distance between original
tradictory to above conclusion; instead, we can obtain the negative queries and customized positive queries is signif-
best performance with few different heads (K ≤ 2) as the icantly larger than the distance between original and cus-
conflicts are insignificant, but we are faced with severe con- tomized positive queries. As this distribution gap between
flicts when using many different heads (K > 3). original and customized queries is marginal, there is no in-
stability encountered during training.
The effect of each component. We perform a component-
wise ablation to thoroughly analyze the effect of each com- 5. Conclusions
ponent in Table 9. Incorporating the auxiliary head yields
significant gains since the dense spatial supervision enables In this paper, we present a novel collaborative hybrid
the encoder features more discriminative. Alternatively, in- assignments training scheme, namely Co-DETR, to learn
troducing customized positive queries also contributes re- more efficient and effective DETR-based detectors from
markably to the final results, while improving the training versatile label assignment manners. This new training
efficiency of the one-to-one set matching. Both techniques scheme can easily enhance the encoder’s learning ability
can accelerate convergence and improve performance. In in end-to-end detectors by training the multiple parallel
summary, we observe the overall improvements stem from auxiliary heads supervised by one-to-many label assign-
more discriminative features for the encoder and more effi- ments. In addition, we conduct extra customized positive
cient attention learning for the decoder. queries by extracting the positive coordinates from these
Comparisons to the longer training schedule. As pre- auxiliary heads to improve the training efficiency of posi-
sented in Table 10, we find Deformable-DETR can not ben- tive samples in decoder. Extensive experiments on COCO
efit from longer training as the performance saturates. On dataset demonstrate the efficiency and effectiveness of Co-
the contrary, Co-DETR greatly accelerates the convergence DETR. Surprisingly, incorporated with ViT-L backbone, we
as well as increasing the peak performance. achieve 66.0% AP on COCO test-dev and 67.9% AP
on LVIS val, establishing the new state-of-the-art detector
Performance of auxiliary branches. Surprisingly, we ob-
with much fewer model sizes.
serve Co-DETR also brings consistent gains for auxiliary
heads in Table 11. This implies our training paradigm
References
contributes to more discriminative encoder representations,
which improves the performances of both decoder and aux- [1] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nico-
iliary heads. las Usunier, Alexander Kirillov, and Sergey Zagoruyko.
End-to-end object detection with transformers. ArXiv,
Difference in distribution of original and customized
abs/2005.12872, 2020. 1, 3, 4
positive queries. We visualize the positions of original pos-
[2] Fangyi Chen, Han Zhang, Kai Hu, Yu-kai Huang, Chenchen
itive queries and customized positive queries in Figure 7a. Zhu, and Marios Savvides. Enhanced training of query-
We only show one object (green box) per image. Positive based object detection via selective query recollection. arXiv
queries assigned by Hungarian Matching in the decoder are preprint arXiv:2212.07593, 2022. 5
marked in red. We mark positive queries extracted from [3] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-
Faster-RCNN and ATSS in blue and orange, respectively. iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping
These customized queries are distributed around the cen- Shi, Wanli Ouyang, et al. Hybrid task cascade for instance
9
segmentation. In Proceedings of the IEEE/CVF Conference [16] Ding Jia, Yuhui Yuan, Haodi He, Xiaopei Wu, Haojun Yu,
on Computer Vision and Pattern Recognition, pages 4974– Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. Detrs with
4983, 2019. 7 hybrid matching. arXiv preprint arXiv:2207.13080, 2022. 3,
[4] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu 6, 7, 12
Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, [17] Kang Kim and Hee Seok Lee. Probabilistic anchor assign-
Jiarui Xu, et al. Mmdetection: Open mmlab detection tool- ment with iou prediction for object detection. In European
box and benchmark. arXiv preprint arXiv:1906.07155, 2019. Conference on Computer Vision, pages 355–371. Springer,
6 2020. 1, 3, 8, 13
[5] Qiang Chen, Xiaokang Chen, Gang Zeng, and Jingdong [18] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni,
Wang. Group detr: Fast training convergence with de- and Lei Zhang. Dn-detr: Accelerate detr training by intro-
coupled one-to-many label assignment. arXiv preprint ducing query denoising. In Proceedings of the IEEE/CVF
arXiv:2207.13085, 2022. 2, 3, 7 Conference on Computer Vision and Pattern Recognition,
[6] Qiang Chen, Jian Wang, Chuchu Han, Shan Zhang, Zex- pages 13619–13627, 2022. 3, 5, 7
ian Li, Xiaokang Chen, Jiahui Chen, Xiaodi Wang, Shum- [19] Xiang Li, Wenhai Wang, Lijun Wu, Shuo Chen, Xiaolin Hu,
ing Han, Gang Zhang, et al. Group detr v2: Strong object Jun Li, Jinhui Tang, and Jian Yang. Generalized focal loss:
detector with encoder-decoder pretraining. arXiv preprint Learning qualified and distributed bounding boxes for dense
arXiv:2211.03594, 2022. 7 object detection. Advances in Neural Information Processing
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Systems, 33:21002–21012, 2020. 6, 8, 13
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [20] Yanghao Li, Hanzi Mao, Ross Girshick, and Kaiming He.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Exploring plain vision transformer backbones for object de-
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is tection. arXiv preprint arXiv:2203.16527, 2022. 3, 7
worth 16x16 words: Transformers for image recognition at [21] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and
scale. ArXiv, abs/2010.11929, 2021. 6, 7 Piotr Dollár. Focal loss for dense object detection. In Pro-
ceedings of the IEEE international conference on computer
[8] Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin-
vision, pages 2980–2988, 2017. 1, 2, 4, 8, 13
long Wang, and Yue Cao. Eva-02: A visual representation
for neon genesis. arXiv preprint arXiv:2303.11331, 2023. 2, [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,
6, 7 Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence
Zitnick. Microsoft coco: Common objects in context. In
[9] Ziteng Gao, Limin Wang, Bing Han, and Sheng Guo.
European conference on computer vision, pages 740–755.
Adamixer: A fast-converging query-based object detector.
Springer, 2014. 5
In Proceedings of the IEEE/CVF Conference on Computer
[23] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi,
Vision and Pattern Recognition, pages 5364–5373, 2022. 7
Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic
[10] Golnaz Ghiasi, Yin Cui, Aravind Srinivas, Rui Qian, Tsung-
anchor boxes are better queries for detr. arXiv preprint
Yi Lin, Ekin D Cubuk, Quoc V Le, and Barret Zoph. Sim-
arXiv:2201.12329, 2022. 2, 3, 6, 7
ple copy-paste is a strong data augmentation method for in-
[24] Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie,
stance segmentation. In Proceedings of the IEEE/CVF con-
Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al.
ference on computer vision and pattern recognition, pages
Swin transformer v2: Scaling up capacity and resolution. In
2918–2928, 2021. 6
Proceedings of the IEEE/CVF Conference on Computer Vi-
[11] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE inter- sion and Pattern Recognition, pages 12009–12019, 2022. 7
national conference on computer vision, pages 1440–1448, [25] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng
2015. 1 Zhang, S. Lin, and B. Guo. Swin transformer: Hierar-
[12] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A chical vision transformer using shifted windows. ArXiv,
dataset for large vocabulary instance segmentation. In Pro- abs/2103.14030, 2021. 2, 6, 7
ceedings of the IEEE/CVF conference on computer vision [26] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng,
and pattern recognition, pages 5356–5364, 2019. 5 Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang.
[13] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Conditional detr for fast training convergence. In Proceed-
Dollár, and Ross Girshick. Masked autoencoders are scalable ings of the IEEE/CVF International Conference on Com-
vision learners. In Proceedings of the IEEE/CVF conference puter Vision, pages 3651–3660, 2021. 6, 7
on computer vision and pattern recognition, pages 16000– [27] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.
16009, 2022. 7 Faster r-cnn: Towards real-time object detection with region
[14] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- proposal networks. Advances in neural information process-
shick. Mask r-cnn. In Proceedings of the IEEE international ing systems, 28, 2015. 1, 2, 4, 8, 13
conference on computer vision, pages 2961–2969, 2017. 1, [28] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir
8, 13 Sadeghian, Ian Reid, and Silvio Savarese. Generalized in-
[15] Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep tersection over union: A metric and a loss for bounding
residual learning for image recognition. 2016 IEEE Confer- box regression. In Proceedings of the IEEE/CVF conference
ence on Computer Vision and Pattern Recognition (CVPR), on computer vision and pattern recognition, pages 658–666,
pages 770–778, 2016. 4 2019. 13
10
[29] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, sion and pattern recognition, pages 9759–9768, 2020. 1, 2,
Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 3, 4, 8, 13
Grad-cam: Visual explanations from deep networks via [42] Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl.
gradient-based localization. In Proceedings of the IEEE in- Probabilistic two-stage detection. arXiv preprint
ternational conference on computer vision, pages 618–626, arXiv:2103.07461, 2021. 7
2017. 8, 13 [43] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang
[30] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Wang, and Jifeng Dai. Deformable detr: Deformable trans-
Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A formers for end-to-end object detection. arXiv preprint
large-scale, high-quality dataset for object detection. In Pro- arXiv:2010.04159, 2020. 2, 6, 7
ceedings of the IEEE/CVF international conference on com- [44] Zhuofan Zong, Qianggang Cao, and Biao Leng. Rcnet: Re-
puter vision, pages 8430–8439, 2019. 6 verse feature pyramid and cross-scale shift network for ob-
[31] Guanglu Song, Yu Liu, and Xiaogang Wang. Revisit- ject detection. In Proceedings of the 29th ACM International
ing the sibling head in object detector. In Proceedings of Conference on Multimedia, pages 5637–5645, 2021. 1
the IEEE/CVF conference on computer vision and pattern
recognition, pages 11563–11572, 2020. 1
[32] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. Fcos:
Fully convolutional one-stage object detection. In Proceed-
ings of the IEEE/CVF international conference on computer
vision, pages 9627–9636, 2019. 1, 2, 4, 8, 13
[33] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhil-
iang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mo-
hammed, Saksham Singhal, Subhojit Som, et al. Image as a
foreign language: Beit pretraining for all vision and vision-
language tasks. arXiv preprint arXiv:2208.10442, 2022. 7
[34] Wenhai Wang, Jifeng Dai, and Zhe Chen. Internimage:
Exploring large-scale vision foundation models with de-
formable convolutions. arXiv preprint arXiv:2211.05778,
2022. 7
[35] Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun.
Anchor detr: Query design for transformer-based detector.
In Proceedings of the AAAI conference on artificial intelli-
gence, volume 36, pages 2567–2575, 2022. 7
[36] Yixuan Wei, Han Hu, Zhenda Xie, Zheng Zhang, Yue Cao,
Jianmin Bao, Dong Chen, and Baining Guo. Contrastive
learning rivals masked image modeling in fine-tuning via
feature distillation. arXiv preprint arXiv:2205.14141, 2022.
7
[37] Zeyue Xue, Jianming Liang, Guanglu Song, Zhuofan Zong,
Liang Chen, Yu Liu, and Ping Luo. Large-batch optimization
for dense visual predictions. In Advances in Neural Informa-
tion Processing Systems, 2022. 1
[38] Jianwei Yang, Chunyuan Li, and Jianfeng Gao. Focal mod-
ulation networks. arXiv preprint arXiv:2203.11926, 2022.
7
[39] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun
Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr
with improved denoising anchor boxes for end-to-end object
detection. arXiv preprint arXiv:2203.03605, 2022. 2, 3, 6, 7
[40] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun
Chen, Liunian Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-
Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localiza-
tion and vision-language understanding. Advances in Neural
Information Processing Systems, 35:36067–36080, 2022. 7
[41] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and
Stan Z Li. Bridging the gap between anchor-based and
anchor-free detection via adaptive training sample selection.
In Proceedings of the IEEE/CVF conference on computer vi-
11
DETRs with Collaborative Hybrid Assignments Training
Supplementary Material
#convs 0 1 2 3 4 5 CAM KL divergence
0.014
AP 41.8 42.3 41.9 42.1 42.3 42.0 DETR 0.0000 0.0062 0.0090 0.0129 0.0079 0.0069 0.0083
Table 12. Influence of number of convolutions in auxiliary head. 0.012
ATSS 0.0066 0.0000 0.0097 0.0132 0.0077 0.0045 0.0083
N
SS
et
A
TR
OS
CN
GF
PA
aN
AT
DE
FC
r-R
tin
ste
Re
CAM KL divergence
Fa
Figure 9. Distances among 7 various heads in our model with
0.008 K = 6.
DETR 0.0000 0.0078 0.0085
0.006 The number of customized positive queries. We compute
the average ratio of positive samples in one-to-many label
ATSS 0.0073 0.0000 0.0062
0.004 assignment to the ground-truth boxes. For instance, the ra-
tio is 18.7 for Faster-RCNN and 8.8 for ATSS on COCO
dataset, indicating more than 8× extra positive queries are
0.002
Faster-RCNN 0.0079 0.0059 0.0000 introduced when K = 1.
Effectiveness of collaborative one-to-many label assign-
0.000 ments. To verify the effectiveness of our feature learning
N
S
TR
CN
S
r-R
{λ1 , λ2 } to {1.0, 2.0} by default. where KL, D, I, C refer to KL divergence, dataset, the input
12
image, and class activation maps (CAM) [29]. In our imple- 16 epochs, where the batch size is set to 64, and the initial
mentation, we choose the validation set COCO val as D learning rate is set to 5 × 10−5 , which is reduced by a factor
and Grad-CAM as C. We use the output features of DETR of 0.1 at the 9-th and 15-th epoch.
encoder to compute the CAM maps. More specifically, we
show the detailed distances when K = 2 and K = 6 in Fig-
ure 8 and Figure 9, respetively. The larger distance metric
of Si,j indicates Hi is less consistent to Hj and contributes
to the optimization inconsistency.
13