incrementalDetr
incrementalDetr
Learning
Na Dong1,2 * , Yongqiang Zhang2 , Mingli Ding2 , Gim Hee Lee1
1
Department of Computer Science, National University of Singapore
2
School of Instrument Science and Engineering, Harbin Institute of Technology
{dongna1994, zhangyongqiang, dingml}@hit.edu.cn, [email protected]
arXiv:2205.04042v3 [cs.CV] 27 Feb 2023
accessible when the novel classes are introduced in the incre- classes and the novel classes as Dbase and Dnovel , respec-
mental few-shot object detection. However, the training data tively. Following the definition of class-incremental learning,
of base classes are still accessible when training the novel we only have access to the novel class data Dnovel , where
model in few-shot object detection by which the novel model ynovel ∈ {CB+1 , . . . , CB+N }. The base class data Dbase ,
is easier to maintain the performance of the base classes. where ybase ∈ {C1 , . . . , CB } are no longer accessible when
novel classes are introduced. Dbase and Dnovel have no over-
Incremental Few-Shot Object Detection. Incremental
lapped classes, i.e. {C1 , . . . , CB } ∩ {CB+1 , . . . , CB+N } =
few-shot object detection is first explored in ONCE (Perez-
∅. The training dataset Dbase has abundant annotated in-
Rua et al. 2020), which uses CenterNet (Zhou, Wang, and
stances of {C1 , . . . , CB }. In contrast, Dnovel has very few
Krähenbühl 2019) as a backbone to learn a class-agnostic
annotated instances of {CB+1 , . . . , CB+N } and are often de-
feature extractor and a per-class code generator network for
scribed as a N -way K-shot training set, where there are N
the novel classes. CenterNet (Zhou, Wang, and Krähenbühl
novel classes and each novel class has K annotated object in-
2019) is well-suited for incremental few-shot object detection
stances. The goal of incremental few-shot object detection is
since a separate heatmap that makes independent detection by
to continually learn novel classes {CB+1 , . . . , CB+N } from
an activation thresholding is maintained for each individual
only a few training examples without forgetting knowledge
object class. MTFA (Ganea, Boom, and Poppe 2021) extends
of the base classes {C1 , . . . , CB }.
TFA (Wang et al. 2020) in a similar way as Mask R-CNN ex-
tends Faster R-CNN. MTFA is then adapted into incremental
MTFA (iMTFA) by learning discriminative embeddings for Our Methodology
object instances that are merged into class representatives. In
this paper, we do incremental few-shot object detection on the Base Model Training
DETR object detector. However, DETR does not have a sep-
arate heatmap for each class and cannot be easily decoupled As shown in Figure 1, we split the first stage of our
to independent detection for each class. Furthermore, DETR incremental-DETR into base model pre-training and fine-
is a deeper network, and thus significantly more training data tuning. In base model pre-training, we train the whole model
and computation time are needed for convergence. Since we only on the base classes {C1 , . . . , CB } with the same loss
only have access to a few samples of the novel classes in functions used in DETR. Following DETR, we denote the
the incremental few-shot setting and the data of base classes set of M predictions for base classes as ŷ = {ŷi }M i=1 =
are no longer accessible, DETR is easier to catastrophically {(ĉi , b̂i )}M
i=1 , and the ground truth set of objects as y =
forget the base classes due to the absence of the base data {yi }M = {(c i , b )}
i i=1
M
padded with ∅ (no object). For
i=1
and overfit to the novel classes due to the novel data scarcity. each element i of the ground truth set, ci is the target class
label (which may be ∅ ) and bi ∈ [0, 1]4 is a 4-vector that
Problem Definition defines ground truth bounding box center coordinates, and
Let (x , y) ∈ D denotes a dataset D which contains im- its height and width relative to the image size. We adopt a
ages x and their corresponding ground truth set of ob- pair-wise matching cost Lmatch (yi , ŷσ(i) ) between the ground
jects y. We further denote the training dataset of the base truth yi and a prediction ŷσ(i) with index σ(i) to search for a
Figure 2: Overview of our proposed incremental few-shot fine-tuning stage. Parameters of the modules shaded in green are
frozen during training. Refer to the text for more details.
bipartite matching with the lowest cost: object proposals. However, the number of object proposals
M
is large and the ranking is not precise. To circumvent this
X problem, we use the object ranking from the selective search
σ̂ = arg min Lmatch (yi , ŷσ(i) ). (1)
σ algorithm to prune away imprecise object proposals. Specifi-
i=1
cally, we select the top O objects in the ranking list that are
The matching cost takes into account both the class prediction also not overlapping with the ground truth objects of the base
ĉσ(i) , and the similarity of the predicted b̂σ(i) and ground classes as the pseudo ground truth object proposals. We repre-
truth bi boxes. Specifically, it is defined as: sent the pseudo ground truth object proposals as b0 . To make
predictions on these selected object proposals, we follow the
prediction head of DETR to introduce a new class label c0 for
Lmatch (yi , ŷσ(i) ) = 1{ci 6=∅} Lcls (ci , ĉσ(i) )
(2) all the selected object proposals as the pseudo ground truth
+ 1{ci 6=∅} Lbox (bi , b̂σ(i) ). label along with the labels of the base classes. Specifically, c0
depends on the number of categories of different sets. If there
Given the above definitions, the Hungarian loss (Kuhn are n categories, we set c0 as n + 1 along with the ground
1955) for all pairs matched in the previous step is defined as: truth labels (1, . . . , n).
M
X Let us now denote the pseudo ground truth set of the se-
Lhg (y, ŷ) = [Lcls (ci , ĉσ̂(i) ) + 1{ci 6=∅} Lbox (bi , b̂σ̂(i) )], lected object proposals as y 0 = {yi0 }P 0 0 P
i=1 = {(ci , bi )}i=1
i=1 padded with ∅ (no object) and the set of P predictions as
(3) ŷ = {ŷi }P P
i=1 = {(ĉi , b̂i )}i=1 . The same pair-wise matching
where σ̂ denotes the optimal assignment between predictions cost Lmatch (yi0 , ŷσ0 (i) ) between a pseudo ground truth yi0 and
and targets. Lcls is the sigmoid focal loss (Lin et al. 2017b).
a prediction ŷσ0 (i) with index σ 0 (i) is also adopted to search
Lbox is a linear combination of `1 loss and generalized IoU
for a bipartite matching with the lowest cost. Finally, the
loss (Rezatofighi et al. 2019) with the same weight hyperpa-
Hungarian loss for all the matched pairs is defined as:
rameters as DETR.
To make the parameters of class-specific components of P
DETR generalize well to the novel classes with a few samples X
Lhg (y 0 , ŷ) = [Lcls (c0i , ĉσ̂0 (i) ) + 1{c0i 6=∅} Lbox (b0i , b̂σ̂0 (i) )],
in the incremental few-shot fine-tuning stage, we propose to
i=1
fine-tune the class-specific components in a self-supervised (4)
way while keeping the class-agnostic components frozen in where σ̂ 0 denotes the optimal assignment between predictions
the base model fine-tuning of the first stage, where the pa- and the pseudo ground truths. Lcls is the sigmoid focal loss
rameters of the model is initialized from the pre-trained base which does binary classification between the selected object
model. The base model fine-tuning relies on making predic- proposals and the background. Lbox is a linear combination
tions on the other potential class-agnostic objects of base data of `1 loss and generalized IoU loss with the same weight
along with ground truth objects of base classes. Inspired by hyperparameters as DETR.
R-CNN (Girshick et al. 2014a) and Fast R-CNN (Girshick
2015), we use the selective search algorithm (Uijlings et al. The overall loss Lbase
total to fine-tune the base model on the
2013) to generate a set of class-agnostic object proposals abundant base data Dbase is given by:
for each of the raw images. The selective search is a well- 0 0
established and very effective unsupervised object proposal Lbase
total = Lhg (y, ŷ) + λ Lhg (y , ŷ), (5)
generation algorithm which uses color similarity, texture sim-
ilarity, size of region and fit between regions to generate where λ0 is the hyperparameter to balance the loss terms.
Incremental Few-Shot Fine-Tuning Experiments
Experimental Setup
As shown in Figure 2, we first initialize the parameters of
the novel model from the fine-tuned base model in the first Incremental object detection. We first evaluate the per-
stage. We then propose to fine-tune the class-specific projec- formance of our proposed method on the incremental setting
tion layer and classification head with a few samples of the as studied in (Shmelkov, Schmid, and Alahari 2017; Kj et al.
novel classes while keeping the class-agnostic components 2021). We conduct the evaluation on the popular object detec-
frozen. However, the constant updating of the projection layer tion benchmark MS COCO 2017 (Lin et al. 2014) which cov-
and classification head in the process of novel classes learn- ers 80 object classes. The train set of COCO serves as training
ing can aggravate catastrophic forgetting of the base classes. data and the val set serves as testing data. Standard evalua-
Therefore, we propose to use knowledge distillation to miti- tion metrics for COCO are adopted. Following (Shmelkov,
gate catastrophic forgetting. Specifically, the base model is Schmid, and Alahari 2017; Kj et al. 2021), we report the
utilized to prevent the projection layer output features of the results on the incremental learning setting of adding a group
novel model from deviating too much from the projection of novel classes (40+40). Furthermore, results on the setting
layer output features of the base model. However, a direct of adding one novel class (40+1) are also reported to prove
knowledge distillation on the full features causes conflicts the effectiveness of our method.
and thus hurts the performance of the novel classes. We thus Incremental few-shot object detection. We follow the
use the ground truth bounding boxes of the novel classes as data setups of prior works on incremental few-shot object
a binary mask mask novel (1 within ground truth boxes, oth- detection, e.g. (Perez-Rua et al. 2020). Specifically, we con-
erwise 0) to prevent negative influence on the novel class duct the experimental evaluations on two widely used object
learning from features of the base model. The distillation loss detection benchmarks: MS COCO 2017 and PASCAL VOC
with the mask on the features is written as: 2007 (Everingham et al. 2010). The train set of COCO serves
as training data and the val set serves as testing data. The
h X
w X c
1 2 trainval set of VOC serves as training data and the test set
Lkd (1 − mask novel novel base
X
feat = ij ) fijk − fijk ,
2N novel serves as testing data. Standard evaluation metrics for COCO
i=1 j=1 k=1
(6) are adopted. COCO contains objects from 80 different classes
Pw Ph including 20 classes that intersect with VOC. We adopt the 20
where N novel = i=1 j=1 (1 − mask novel ij ). f base
and f novel
shared classes as novel classes and the remaining 60 classes
denote the features of the base and novel models, respectively. as base classes. Following (Perez-Rua et al. 2020), there are
w , h and c are the width, height and channels of the feature. two dataset splits: same-dataset on COCO and cross-dataset
For the knowledge distillation on the classification head with 60 COCO classes as base classes. For the same-dataset
of DETR, we first select the prediction outputs from the M evaluation on COCO, the model is pre-trained and fine-tuned
prediction outputs of the base model as pseudo ground truths on the COCO base training data. The novel classes learning
of the base classes. Specifically, for an input novel image, we is then conducted on COCO novel training data and evaluated
consider a prediction output of the base model as a pseudo on COCO testing data. We use the same setup as above for
ground truth of the base classes when its class probability the cross-dataset evaluation on COCO to VOC, where the
greater than 0.5 and its bounding box has no overlap with model is pre-trained and fine-tuned on the COCO base train-
ground truth bounding boxes of the novel classes. We then ing data. However, the incremental few-shot fine-tuning stage
adopt a pair-wise matching cost to find for the bipartite match- is conducted on the VOC training data and evaluated on the
ing between the pseudo ground truths and the predictions of VOC testing data. We evaluate incremental few-shot object
the novel model. Subsequently, the classification outputs of detection with 1, 5, 10-shot per novel class and the base train-
the base and novel models are compared in the distillation ing data is no longer accessible during novel classes learning
loss function given by: following the definition of class-incremental learning.
Lkd
cls = Lkl_div (log(q
novel
), q base ), (7) Implementation Details
We use ResNet-50 as the feature extractor. The network ar-
where we follow (Hinton, Vinyals, and Dean 2015) in the chitectures and hyperparameters of transformer encoder and
definition of the KL-divergence loss Lkl_div between the class decoder remain the same as Deformable DETR (Zhu et al.
probabilities of the novel and base models. q denotes the 2020). λ0 , λfeat and λcls are set to 1, 0.1 and 2, respectively.
class probability. We report the results of our proposed method with single-
The Hungarian loss Lhg is also applied to the ground truth scale feature. The training is carried out on 8 RTX 6000 GPUs
set y and the predictions ŷ of the novel data Dnovel . The with a batch size of 2 per GPU. In the base model pre-training
overall loss Lnovel
total to train the novel model on the novel data stage, we train our model using the AdamW optimizer and
Dnovel is given by: an initial learning rate of 2 × 10−4 and a weight decay of
1 × 10−4 . We train our model for 50 epochs and the learning
Lnovel kd kd
total = Lhg (y, ŷ) + λfeat Lfeat + λcls Lcls , (8) rate is decayed at 40th epoch by a factor of 0.1. In the base
model fine-tuning stage, the model is initialized from pre-
where λfeat and λcls are hyperparameters to balance the loss trained base model. The parameters of the projection layer
terms. and classification head are then fine-tuned while keeping the
Base Novel All
Shot Method
AP(%) AP50(%) AP(%) AP50(%) AP(%) AP50(%)
iMTFA (Ganea, Boom, and Poppe 2021) 38.2 58.0 - - - -
Base
(Zhu et al. 2020) 36.9 56.7 - - - -
ONCE (Perez-Rua et al. 2020) 17.9 - 0.7 - 13.6 -
1 iMTFA (Ganea, Boom, and Poppe 2021) 27.8 40.1 3.2 5.9 21.7 31.6
Ours 29.4 47.1 1.9 2.7 22.5 36.0
ONCE (Perez-Rua et al. 2020) 17.9 - 1.0 - 13.7 -
5 iMTFA (Perez-Rua et al. 2020) 24.1 33.7 6.1 11.2 19.6 28.1
Ours 30.5 48.4 8.3 13.3 24.9 39.6
ONCE (Perez-Rua et al. 2020) 17.9 - 1.2 - 13.7 -
10 iMTFA (Ganea, Boom, and Poppe 2021) 23.4 32.4 7.0 12.7 19.3 27.5
Ours 27.3 44.0 14.4 22.4 24.1 38.6
other parameters frozen. We fine-tune the model for 1 epoch the next 40 classes as the additional group of novel classes.
with a learning rate of 2 × 10−4 . We apply the same setting as "1-80" and "1-40" are baselines trained without using incre-
the base model pre-training stage in the incremental few-shot mental learning. "41-80 (fine-tune)" and "41-80 (scratch)" are
fine-tuning stage. the model trained with and without initialization from the pre-
trained base model, respectively. The performance of "41-80
Incremental Object Detection (scratch)" is better than "41-80 (fine-tune)" due to the abun-
Addition of one class. Table 2 shows the results on the dance of novel class data. Our method achieves results with
incremental learning setting of one novel class addition. AP and AP50 that are almost on par with the baseline training
We take the first 40 classes from MS COCO as the base on base classes "1-40" without using incremental learning
classes, and the next 41th class is used as the novel class. (Ours: 44.0% | 66.1% vs. "1-40": 44.8% | 67.8%). Addition-
We report the average precision (AP) and the average pre- ally, our method significantly outperforms (Kj et al. 2021)
cision with 0.5 IoU threshold (AP50) of the base, novel on all classes (Ours: 37.3% | 56.6% vs. (Kj et al. 2021):
and all classes, respectively. "1-41" and "1-40" are base- 23.8% | 40.5%). Our method also outperforms (Shmelkov,
lines trained without using incremental learning. "41 (fine- Schmid, and Alahari 2017) under the incremental setting
tune)" and "41 (scratch)" are the model trained with and (Ours: 37.3% | 56.6% vs. (Shmelkov, Schmid, and Alahari
without initialization from the pre-trained base model, re- 2017): 21.3% | 37.4%). These results show the effectiveness
spectively. The results of "41 (scratch)" are significantly of our approach when a group of novel classes is added.
lower than "41 (fine-tune)" due to the scarcity of the novel
Base Novel All
data. Our approach achieves high results with AP and AP50 Class Method
AP(%) AP50(%) AP(%) AP50(%) AP(%) AP50(%)
of 41.9% | 64.1%, 29.2% | 50.3%, 41.6% | 63.8% for the 1-80 (Zhu et al. 46.8 69.4 36.3 54.7 41.4 61.8
2020)
base, novel and all classes, respectively. Our results are al- 1-40 (Zhu et al. 44.8 67.8 - - - -
most on par with the results from baseline training on the 2020)
41-80
novel class "41" with fine-tuning (Ours: 29.2% | 50.3% vs. (fine-tune)
(Zhu et al.
2020)
0 0 33.0 49.6 16.5 24.8
Table 4: Ablation studies with 5-shot per novel class on COCO val set.
Table 5: Ablation studies with 5-shot per novel class on COCO val set.
where all methods perform poorly due to the extreme data of the components of our method. The subsequent settings
scarcity. Particularly, our method significantly outperforms in Row 2-8 include various combinations of the components
iMTFA on the AP and AP50 for all classes, i.e. iMTFA: from our framework. Our complete framework achieves the
(21.7% | 31.6%, 19.6% | 28.1%, 19.3% | 27.5%) vs Ours: best performance of all classes. These results further show
(22.5% | 36.0%, 24.9% | 39.6%, 24.1% | 38.6%) under the the effectiveness of the two-stage fine-tuning, self-supervised
1, 5 and 10-shot settings, respectively. These results show the learning and knowledge distillation strategies in our method.
superiority of our method for incremental few-shot object We combine the two steps from the base model pre-training
detection. stage into a single step and train the model from scratch with
the pseudo targets generated by the selective search algorithm
Cross-dataset. We report the performance of incremen-
and the targets from the abundant base class data. As shown in
tal few-shot object detection in a cross-dataset setting from
Row 1 of Table 5, we can see that the results of base and novel
COCO to VOC to demonstrate the domain adaptation ability
classes drop (Ours: 30.5% vs One-stage: 28.3% AP for base
of our method, where the domain of the novel model train-
classes, Ours: 8.3% vs One-stage: 7.1% AP for novel classes)
ing dataset is different from the base model training dataset.
in one-step base model pre-training stage. Furthermore, we
We only report the performance of the novel classes since
report the effectiveness of keeping the class-agnostic modules
there are no base classes in VOC. As shown in Table 6, our
of the model frozen in the incremental few-shot fine-tuning
method achieves much better result of 16.6% AP compared
stage. As shown in Row 2 of Table 5, we can see that the
to ONCE (Perez-Rua et al. 2020) with 2.4% AP in the 5-shot
results of base and novel classes drop significantly when
setting. Furthermore, we can see that our method also signifi-
unfreezing the class-agnostic CNN backbone, transformer
cantly outperforms ONCE (Ours: 24.6% vs. ONCE: 2.6%)
and regression head in incremental few-shot learning stage.
in the 10-shot setting.
These results demonstrate that each component in our method
Novel has a critical role for incremental few-shot learning.
Shot Method
AP(%) AP50(%)
ONCE (Perez-Rua et al. 2020) - -
1
Ours 4.1 6.6 Conclusion
ONCE (Perez-Rua et al. 2020) 2.4 - In this paper, we present a novel incremental few-shot object
5
Ours 16.6 26.3
ONCE (Perez-Rua et al. 2020) 2.6 - detection framework: the Incremental-DETR for the more
10
Ours 24.6 38.4 challenging and realistic scenario with no samples from the
base classes and only a few samples from the novel classes in
Table 6: Results of incremental few-shot object detection on VOC the training dataset. To circumvent the challenges in DETR
test set. for incremental few-shot object detection, we identify that
the projection layer and the classification layer of the DETR
architecture are class-specific and other modules like the
Ablation Studies transformer are class-agnostic with empirically experimental
Table 4 shows the ablation studies to understand the effec- studies. We then propose the use of two-stage fine-tuning
tiveness of each component in our framework. The first row strategy and self-supervised learning to retain the knowledge
is the result of the model initialized from the pre-trained base of the base classes and to learn a better generalizable pro-
model, and then trained with the standard Deformable DETR jection layer to adapt to the novel classes. Furthermore, we
loss on a few novel samples without using any component in- utilize a knowledge distillation strategy to transfer knowl-
troduced in our framework. We can see that this setting gives edge from the base to the novel model using the few samples
the lowest performance, which is evident on the importance from the novel classes. Extensive experimental results on the
benchmark datasets show the effectiveness of our proposed Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014a.
method. We hope our work can provide good insights and Rich feature hierarchies for accurate object detection and se-
inspire further researches in the relatively under-explored but mantic segmentation. In Proceedings of the IEEE conference
important problem of incremental few-shot object detection. on computer vision and pattern recognition, 580–587.
Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014b.
Acknowledgements Rich feature hierarchies for accurate object detection and se-
The first author is funded by a scholarship from the China mantic segmentation. In Proceedings of the IEEE conference
Scholarship Council (CSC). This research/project is sup- on computer vision and pattern recognition, 580–587.
ported by the National Research Foundation Singapore and He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020.
DSO National Laboratories under the AI Singapore Pro- Momentum contrast for unsupervised visual representation
gramme (AISG Award No: AISG2-RP-2020-016) and the learning. In Proceedings of the IEEE/CVF Conference on
Tier 2 grant MOE-T2EP20120-0011 from the Singapore Min- Computer Vision and Pattern Recognition, 9729–9738.
istry of Education.
Hinton, G.; Vinyals, O.; and Dean, J. 2015. Distill-
ing the knowledge in a neural network. arXiv preprint
References arXiv:1503.02531.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; and Darrell, T.
A.; and Zagoruyko, S. 2020. End-to-end object detection 2019. Few-shot object detection via feature reweighting. In
with transformers. In European Conference on Computer Proceedings of the IEEE/CVF International Conference on
Vision, 213–229. Springer. Computer Vision, 8420–8429.
Carlucci, F. M.; D’Innocente, A.; Bucci, S.; Caputo, B.; and
Kj, J.; Rajasegaran, J.; Khan, S.; Khan, F. S.; and Balasub-
Tommasi, T. 2019. Domain generalization by solving jigsaw
ramanian, V. N. 2021. Incremental Object Detection via
puzzles. In Proceedings of the IEEE/CVF Conference on
Meta-Learning. IEEE Transactions on Pattern Analysis and
Computer Vision and Pattern Recognition, 2229–2238.
Machine Intelligence.
Caron, M.; Bojanowski, P.; Joulin, A.; and Douze, M. 2018.
Kuhn, H. W. 1955. The Hungarian method for the assignment
Deep clustering for unsupervised learning of visual features.
problem. Naval research logistics quarterly, 2(1-2): 83–97.
In Proceedings of the European Conference on Computer
Vision (ECCV), 132–149. Lin, T. Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; and
Caron, M.; Misra, I.; Mairal, J.; Goyal, P.; Bojanowski, P.; Belongie, S. 2017a. Feature Pyramid Networks for Object
and Joulin, A. 2020. Unsupervised learning of visual fea- Detection. In IEEE Conference on Computer Vision and
tures by contrasting cluster assignments. arXiv preprint Pattern Recognition.
arXiv:2006.09882. Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; and Dollár, P.
Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020. 2017b. Focal loss for dense object detection. In Proceedings
A simple framework for contrastive learning of visual repre- of the IEEE international conference on computer vision,
sentations. In International conference on machine learning, 2980–2988.
1597–1607. PMLR. Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.;
Doersch, C.; Gupta, A.; and Efros, A. A. 2015. Unsupervised Ramanan, D.; Dollár, P.; and Zitnick, C. L. 2014. Microsoft
visual representation learning by context prediction. In Pro- coco: Common objects in context. In European conference
ceedings of the IEEE international conference on computer on computer vision, 740–755. Springer.
vision, 1422–1430. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu,
Everingham, M.; Van Gool, L.; Williams, C. K.; Winn, J.; C. Y.; and Berg, A. C. 2016. SSD: Single Shot MultiBox
and Zisserman, A. 2010. The pascal visual object classes Detector. In European Conference on Computer Vision.
(voc) challenge. International journal of computer vision, McCloskey, M.; and Cohen, N. J. 1989. Catastrophic inter-
88(2): 303–338. ference in connectionist networks: The sequential learning
Ganea, D. A.; Boom, B.; and Poppe, R. 2021. Incremen- problem. In Psychology of learning and motivation, vol-
tal Few-Shot Instance Segmentation. In Proceedings of ume 24, 109–165. Elsevier.
the IEEE/CVF Conference on Computer Vision and Pattern Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and
Recognition, 1185–1194. Efros, A. A. 2016. Context encoders: Feature learning by in-
Gidaris, S.; Bursuc, A.; Komodakis, N.; Pérez, P.; and Cord, painting. In Proceedings of the IEEE conference on computer
M. 2019. Boosting few-shot visual learning with self- vision and pattern recognition, 2536–2544.
supervision. In Proceedings of the IEEE/CVF International Perez-Rua, J.-M.; Zhu, X.; Hospedales, T. M.; and Xiang, T.
Conference on Computer Vision, 8059–8068. 2020. Incremental Few-Shot Object Detection. In Proceed-
Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised ings of the IEEE/CVF Conference on Computer Vision and
representation learning by predicting image rotations. arXiv Pattern Recognition, 13846–13855.
preprint arXiv:1803.07728. Ratcliff, R. 1990. Connectionist models of recognition mem-
Girshick, R. 2015. Fast r-cnn. In Proceedings of the IEEE ory: constraints imposed by learning and forgetting functions.
international conference on computer vision, 1440–1448. Psychological review, 97(2): 285.
Redmon, J.; Divvala, S.; Girshick, R.; and Farhadi, A. 2016.
You only look once: Unified, real-time object detection. In
Proceedings of the IEEE conference on computer vision and
pattern recognition, 779–788.
Ren, S.; He, K.; Girshick, R.; and Jian, S. 2015. Faster
R-CNN: Towards Real-Time Object Detection with Region
Proposal Networks. In International Conference on Neural
Information Processing Systems.
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.;
and Savarese, S. 2019. Generalized intersection over union: A
metric and a loss for bounding box regression. In Proceedings
of the IEEE/CVF conference on computer vision and pattern
recognition, 658–666.
Shmelkov, K.; Schmid, C.; and Alahari, K. 2017. Incremental
learning of object detectors without catastrophic forgetting.
In Proceedings of the IEEE International Conference on
Computer Vision, 3400–3409.
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; and Zhang, C. 2021. FSCE:
Few-shot object detection via contrastive proposal encoding.
In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 7352–7362.
Uijlings, J. R.; Van De Sande, K. E.; Gevers, T.; and Smeul-
ders, A. W. 2013. Selective search for object recognition.
International journal of computer vision, 104(2): 154–171.
Wang, X.; Huang, T. E.; Darrell, T.; Gonzalez, J. E.; and
Yu, F. 2020. Frustratingly simple few-shot object detection.
arXiv preprint arXiv:2003.06957.
Wang, Y.-X.; Ramanan, D.; and Hebert, M. 2019. Meta-
learning to detect rare objects. In Proceedings of the
IEEE/CVF International Conference on Computer Vision,
9925–9934.
Wu, J.; Liu, S.; Huang, D.; and Wang, Y. 2020. Multi-
scale positive sample refinement for few-shot object detec-
tion. In European Conference on Computer Vision, 456–472.
Springer.
Wu, X.; Sahoo, D.; and Hoi, S. 2020. Meta-RCNN: Meta
learning for few-shot object detection. In Proceedings of the
28th ACM International Conference on Multimedia, 1679–
1687.
Zhai, X.; Oliver, A.; Kolesnikov, A.; and Beyer, L. 2019. S4l:
Self-supervised semi-supervised learning. In Proceedings of
the IEEE/CVF International Conference on Computer Vision,
1476–1485.
Zhang, G.; Luo, Z.; Cui, K.; and Lu, S. 2021. Meta-detr: Few-
shot object detection via unified image-level meta-learning.
arXiv preprint arXiv:2103.11731, 2.
Zhou, X.; Wang, D.; and Krähenbühl, P. 2019. Objects as
points. arXiv preprint arXiv:1904.07850.
Zhu, F.; Zhang, X.-Y.; Wang, C.; Yin, F.; and Liu, C.-L.
2021. Prototype Augmentation and Self-Supervision for
Incremental Learning. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
5871–5880.
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; and Dai, J. 2020.
Deformable detr: Deformable transformers for end-to-end
object detection. arXiv preprint arXiv:2010.04159.